This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
:: = | :: = | :: = if(food_ahead()) {} else {} :: = left(); | right(); | move(); We are now more interested in the behaviour of the heuristic on a class of grammars, rather than on particular problems. To this end, we again use the number of wraps as a fitness function, to create a worst case scenario which may give us a lower bound on how well one can expect the heuristic to perform. These problems use the same parameters normally associated with GE. That is, a population of 500 evolving for 50 generations, with steady state selection and a mutation probability of 0.01. Figure 5 shows the performance of each of the grammars using the number of successful wraps as a measure of success. The Santa Fe results immediately leap out, showing that no individual in any of the runs attained a fitness greater
On the Avoidance of Fruitless Wraps in Grammatical Evolution Symbolic Regression
1761
Santa Fe Ant Trail
45
4
40
35 3
Average best fitness
Average best fitness
30
25
20
2
15
1 10
5
0
0 0
5
10
15
20
25 30 Generation
35
40
45
50
0
5
10
15
20
25 30 Generation
35
40
45
50
Fig. 5. Performance using number of successful wraps as a measure for Symbolic Regression (left) and Santa Fe Ant (right)
than two. It is an inherent property of the grammar that it will either complete after two wraps or not at all, but the heuristic is still capable of identifying individuals, and does so after an average of 1.3 wraps. On the other hand, the Symbolic Regression grammar climbs continuously right up until the final generation, clearly indicating that legal individuals in this case are capable of considerably more wraps. The numbers of individuals identified by the heuristic as being illegal in each case are shown in Fig. 6. In the case of the Symbolic Regression grammar, the heuristic performs very well at the start, but, after about 15 generations, starts to fail to identify failures. Across the 50 generations, the heuristic, on average, identifies 33% of invalid individuals, with an average of 2.06 wraps. In the case of the Santa Fe grammar, the heuristic performs much better, correctly identifying around 84% of failures, with an average of just 1.3 wraps. The fact that these problems specifically encourage wrapping should be kept in mind, so, particularly with the symbolic regression grammar, in which the fitness keeps rising, it shouldn’t be surprising that the number of invalid individuals keeps increasing. This was done specifically to create a difficult situation for the heuristic, but in usual GE runs, the number of invalid individuals drops to near zero by the fourth or fifth generation.
1762
C. Ryan, M, Keijzer, and M. Nicolau Symbolic Regression
Santa Fe Ant Trail
500
500 Illegal individuals Heuristic
Illegal individuals Heuristic
450
400
400
350
350
300
300
Number of individuals
Number of individuals
450
250
200
250
200
150
150
100
100
50
50
0
0 0
5
10
15
20 25 30 Generation
35
40
45
50
0
5
10
15
20 25 30 Generation
35
40
45
50
Fig. 6. The number of individuals identified by the heuristic for the Symbolic Regression grammar (left) and the Santa Fe grammar (right)
7
Conclusions
This paper has investigated the phenomenon of wrapping in Grammatical Evolution. We have demonstrated circumstances under which individuals will fail to map using a number of different types of context free grammars, namely, single non-terminal grammars, dual non-terminal grammars and multiple nonterminal grammars. Using shape graphs, we describe a stop sequence, which is any sequence of genes that is likely to terminate another string after crossover. We showed a simple algorithm to determine whether or not an individual using a dual non-terminal grammar will fail to map, and illustrated the difficulty in extending this algorithm to multiple non-terminal grammars. However, a very cheap heuristic has been shown that, on a problem specifically designed to produce individuals that will fail, successfully identifies 35% of individuals that won’t map. The paper also considered two classes of grammars, a symbolic regression type, and a Santa Fe Ant Trail type. For both of these, the worst case was examined, that is, where individuals are actually rewarded for wrapping. In these cases, the heuristic successfully identified 33% and 84% of failures for the symbolic regression grammar and the Santa Fe Ant Trail grammar respectively.
On the Avoidance of Fruitless Wraps in Grammatical Evolution
1763
References 1. Keijzer M. Scientific Discovery using Genetic Programming PhD Thesis, Danish Hydraulic Institute, 2001. 2. Koza, J.R., Genetic Programming: On the Programming of Computers by Means of Natural Evolution, MIT Press, Cambridge, MA, 1992. 3. Keijzer M., Ryan C., O’Neill M., Cattolico M., and Babovic V. Ripple crossover in genetic programming. In Proceedings of EuroGP 2001, 2001. 4. M. O’Neill. Automatic Programming in an Arbitrary Language: Evolving Programs with Grammatical Evolution. PhD thesis, University Of Limerick, 2001. 5. O’Neill M. and Ryan C. Genetic code degeneracy: Implications for grammatical evolution and beyond. In ECAL’99: Proc. of the Fifth European Conference on Artificial Life, Lausanne, Switzerland, 1999. 6. O’Neill M. and Ryan C. Grammatical Evolution. IEEE Transactions on Evolutionary Computation. 2001. 7. O’Sullivan, J., and Ryan, C., An Investigation into the Use of Different Search Strategies with Grammatical Evolution. In the proceedings of European Conference on Genetic Programming (EuroGP2002) (pp. 268–277), Springer, 2002. 8. Ryan, C., and Azad, R.M.A., Sensible Initialisation in Chorus. Accepted for European Conference on Genetic Programming (EuroGP 2003). 9. Ryan, C., Azad, A., Sheahan, A., and O’Neill, M., No Coercion and No Prohibition, A Position Independent Encoding Scheme for Evolutionary Algorithms - The Chorus System. In the Proceedings of European Conference on Genetic Programming (EuroGP 2002) (pp. 131–141), Springer, 2002. 10. Ryan, C., Collins, J.J., and O’Neill, M., Grammatical Evolution: Evolving Programs for an Arbitrary Language, in EuroGP’98: Proc. of the First European Workshop on Genetic Programming (Lecture Notes in Computer Science 1391, pp- 83–95), Springer, Paris, France, 1998.
Dense and Switched Modular Primitives for Bond Graph Model Design 1
1
1
1
Kisung Seo , Zhun Fan , Jianjun Hu , Erik D. Goodman , and 2 Ronald C. Rosenberg 1
Genetic Algorithms Research and Applications Group (GARAGe) Michigan State University 2 Department of Mechanical Engineering, Michigan State University East Lansing, MI 48824, USA {ksseo,fanzhun,hujianju,goodman,rosenber}@egr.msu.edu
Abstract. This paper suggests dense and switched modular primitives for a bond-graph-based GP design framework that automatically synthesizes designs for multi-domain, lumped parameter dynamic systems. A set of primitives is sought that will avoid redundant junctions and elements, based on preassembling useful functional blocks of bond graph elements and (optionally) using a switched choice mechanism for inclusion of some elements. Motivation for using these primitives is to improve performance through greater search efficiency and thereby to reduce computational effort. As a proof of concept for this approach, an eigenvalue assignment problem, which is to find bond graph models exhibiting minimal distance errors from target sets of eigenvalues, was tested and showed improved performance for various sets of eigenvalues.
1
Introduction
Design of interdisciplinary (multi-domain) dynamic engineering systems, such as mechatronic systems, differs from design of single-domain systems, such as electronic circuits, mechanisms, and fluid power systems, in part because of the need to integrate the several distinct domain characteristics in predicting system behavior (Youcef-Toumi [1]). However, most current research for evolutionary design has been optimized for a single domain (see, for example, Koza et. al., [2,3]). In order to overcome this limitation and enable open-ended search, the Bond Graph / Genetic Programming (BG/GP) design methodology has been developed, based on the combination of these two powerful tools (Seo et al. [4,5] and tested for a few applications – an analog filter (Fan et al. [6]), printer drive mechanism (Fan et. al., [7]), and air pump design (Goodman et al. [8]). BG/GP worked efficiently for these applications. The search capability of this system has been improved dramatically by introduction of a new form of parallel evolutionary computation, called Hierarchical Fair Competition GP (HFC-GP, Hu, et al., [9]), which can strongly reduce premature convergence and enable scalability with smaller populations. E. Cantú-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 1764–1775, 2003. © Springer-Verlag Berlin Heidelberg 2003
Dense and Switched Modular Primitives for Bond Graph Model Design
1765
However, two issues still arise: one is the need for much stronger synthesis capability arising from the complex nature of multi-domain engineering design, and the other is the desire to minimize computational demands. While we have made inroads in improving of GP search by introducing HFC-GP, we want to exploit the notion of modularity of GP function primitives to make additional gains. Much useful modularity can be discovered during an evolutionary process, as is done, for example, by the ADF (Koza [10]). However, in many cases, we believe that explicit introduction of higher-level modules as function primitives, based on domain knowledge, will yield faster progress than requiring their recognition during the evolutionary process. Some research has been devoted to choice or refinement of the function set in GP. Soule and Heckendorn [11] examined how the function set influences performance in GP and showed some relationship between performance and GP functions sets, but their work was limited to generating simple sine functions varying only arithmetic and trigonometric operators (e.g., +, -, *, /, tan, ….). We will try to exploit higher-level function sets, rather than simply choosing different sets at the same level. In this paper, a generic type of primitive is introduced, and specialized here to capture specific domain knowledge about bond graphs – the dense switched modular primitive. First, we introduce the dense module concept to generate compact bond graph models with fewer operations. It replaces several operations in the basic (original) set with one operation, yielding a smaller tree attainable with less computational effort. Second, the switched module concept creates a small function set of elements with changeable forms, which can assist in evolving complex functionality, while eliminating many redundant bond graph structures evolved if it is not used. Elements eliminated include “dangling” junctions that connect to nothing and many one-port components (such as resistors, capacitors, inductors, etc.). Their elimination makes the resulting bond graph simpler and the speed of evolution faster. A careful design of a dense and switched modular primitive should considerably increase the efficiency of search and also, for the bond graph case, the efficiency of fitness assessment, as is illustrated in this paper. As a test class of design problems, we have chosen one in which the objective is to realize a design having a specified set of eigenvalues. The eigenvalue assignment problem is well defined and has been studied effectively using linear components with constant parameters. Section 2 discusses the inter-domain nature, efficient evaluation, and graphical generation of bond graphs, including the design methodology used in approaching such problems. Section 3 explains the basic set and redundancy problem and Sect. 4 describes the dense switched modular primitive set. Section 5 presents results for 6-, 10- and 16-eigenvalue design problems, and Sect. 6 concludes the paper.
1766
2
K. Seo et al.
Evolutionary Bond Graph Synthesis for Engineering Design
2.1 The BG/GP Design Methodology There is a strong need for a unified design tool, able to be applied across energy domains – electrical, mechanical, hydraulic, etc. Most design tools or methodologies require user interaction, so users must make many decisions during the design process. This makes the design procedure more complex and often introduces the need for trial-and-error iterations. Automation of this process – so the user sets up the specifications and “pushes a button,” then receives candidate design(s) – is also important. A design methodology that combines bond graphs and genetic programming can serve as an automated and unified approach (Fig. 1). The proposed BG/GP (Bond Graph with Genetic Programming) design methodology requires only an embryo model and fitness (performance) definition in its initial stage; the remaining procedures are automatically executed by genetic programming search. However, due to the complexity of the engineering design problem, the need for efficiency in the design search is very high. It is this problem that is addressed here. • Embryo • Fitness Definition • Electric • Mechanical • Hydraulic
BG/GP design
Unified Design
Automated Design
Fig. 1. Key features of the BG/GP design methodology
2.2 Bond Graphs Topologically, bond graphs consist of elements and bonds. Relatively simple systems include passive one-port elements C, I, and R, active one-port elements Se and Sf, and two-port elements TF and GY (transformers and gyrators). These elements can be attached to 0- (or 1-) junctions, which are multi-port elements, using bonds. The middle of Fig. 2 consists of Se, 1-junction, C, I, and R elements, and that same bond graph represents, for example, either a mechanical mass, spring and damper system(left), or an RLC electrical circuit. Se corresponds with force in mechanical systems, or voltage in electrical (right). The 1-junction implies a common velocity for 1) the force source, 2) the end of the spring, 3) the end of the damper, and 4) the mass in the mechanical system, or implies that the current in the RLC loop is common. The R, I, and C represent the damper, inertia (of a mass), and spring in the mechanical system, or the resistor, inductor, and capacitor in the electrical circuit.
Dense and Switched Modular Primitives for Bond Graph Model Design
C
x
k
F(t) m
Se
b
1
1767
R
i
I
L
C
R
Fig. 2. The same bond graph model for two different domains
3
Basic Set and Redundancy
The initial BG/GP system used GP functions and terminals for bond graph construction as follows. There are four types of functions: add functions that can be applied only to a junction and which add a C, I, or R element; insert functions that can be applied to a bond and which insert a 0-junction or 1-junction into the bond; replace functions that can be applied to a node and which can change the type of element and corresponding parameter values for C, I, or R elements; and arithmetic functions that perform arithmetic operations and can be used to determine the numerical values associated with components (Table 1). Details of function definitions are illustrated in Seo et al. [5]. Table 1. Functions and terminals in Basic set
Name add_C add_I add_R insert_J0 insert_J1 replace_C replace_ I replace_ R + endn endb endr erc
Description Add a C element to a junction Add an I element to a junction Add an R element to a junction Insert a 0-junction in a bond Insert a 1-junction in a bond Replace current element with C element Replace current element with I element Replace current element with R element Sum two ERCs Subtract two ERCs End terminal for add element operation End terminal for insert junction operation End terminal for replace element operation Ephemeral random constant (ERC)
Many redundant or unnecessary junctions and elements were observed in experiments with this basic set. Such unnecessary elements can be generated by the free combinatorial connection of elements, and, while they can be removed without any change in the physical meaning of the bond graph, their processing reduces the efficiency of
1768
K. Seo et al.
processing and of search. At the same time, such a “universal” set guarantees that all possible topologies can be generated. However, many junctions “dangle” without further extension and many arrangements of one-port components (C, I, R) that can be condensed are generated. Figure 3 illustrates redundancies that are marked with dotted circles in the example. First, the dangling 0- and 1-junctions in the left-hand figure can be eliminated, and then three C, I, and R elements can be joined together at one 1-junction. Furthermore, two R elements attached to neighboring 0-junctions can be merged to a single equivalent R. Avoiding these redundant junctions and elements improves search efficiency significantly. Se
1 C
0
1
1
0
0
I
R
R
R
1
R
Se
1
0
I
R R
C
R
Fig. 3. Example of redundant 0- and 1-junctions and R elements (left) in generated bond graph model, and equivalent model after simplification (right). The dotted lines represent the boundary of the embryo
4
Construction of Dense Switched Modular Primitives
The redundancy problem is closely related with the performance and computational effort in the evolutionary process. The search process will be hastened by eliminating the redundancy, and it is hypothesized that this will happen without loss of performance of the systems evolved. It is obvious that computational resources can be saved by removal of the redundancy. To reduce the redundancy noted above and to utilize the concept of modularity, a new type of GP function primitives has been devised – the dense switched modular primitives (“DSMP”). Roughly speaking, a dense representation (eliminating redundant components at junctions, guaranteeing causally well-posed bond graphs, and avoiding adjacent junctions of the same type) will be combined with a switched structure (allowing components that do not impact causal assignment at a junction to be present or absent depending on a binary switch).
ins
add
add
add
Fig. 4. The dense modular primitive
ins
add
add
add
Dense and Switched Modular Primitives for Bond Graph Model Design
1769
The major features of the modular primitives are as follows. First, a single dense function replaces all add, insert, and replace functions of the basic set. This concept is explained in Fig. 4, in which mixed ins and add operations can be merged into one operation. Therefore, a GP tree that represents a certain bond graph topology can be much smaller than attainable with the basic set. This dense function not only incorporates multiple operations, but also reflects design knowledge of the bond graph domain, such as causality (discussed later). Second, any combination of C, I, and R components can be instantiated according to the values of a set of on/off switch settings that are evolved by mutation. This modularity also helps to relieve the redundancy of C, I, and R components, giving them fewer places to proliferate that appear to be different, but are functionally equivalent. This new set introduces further modularity through a controllable switching function for selection of C, I, R combinations (Fig. 5). The function set of the dense switched modular primitives is shown in Table 2. It consists of two functions that replace all ins, add, and replace functions in the basic set (Table 1). Table 2. New functions in the switched modular primitive set
Name
Description
insert_JPair_SWElements
Insert a 0-1 (or 1-0) junction pair in a bond and attach switched C, I, R elements to each junction Add a counter-junction to a junction and attach switched C, I, R elements
add_J_ SWElements
C
0, 1
I
on/off switches
R
Fig. 5. Switched modular primitive
Third, the proper typing of 0-junctions and 1-junctions is determined by an implicit genotype-phenotype mapping, considering the neighbor junction to which the primitive is attached. This allows insertion of only “proper pairs” of junctions on bonds, preventing generation of consecutive junctions of the same type that are replaceable by a single one.
1770
K. Seo et al.
Fourth, we insure that we generate only feasible individuals, satisfying the causally well-posed property, so automatic state equation formulation is simplified considerably. One of the key advantages of BG/GP design is the efficiency of the evaluation. The evaluation stage is composed of two steps: 1) causality analysis, and, when merited, 2) dynamic simulation. The first, causal analysis, allows rapid determination of feasibility of candidate designs, thereby sharply reducing the time needed for analysis of designs that are infeasible. In most cases, all bonds in the graph will have been assigned a causal stroke (determining which variables are assigned values at that point, rather than bringing to it pre-assigned values) using only integral causality of C or I and extension of causal implication. Some models can have all causality assigned without violation – the causally satisfied case. Other models are assigned causality, but with violations – the causally violated case. If one has to continue to use an arbitrary causality of an R, it means that some algebraic relationships must be solved if the equations are to be put into standard form. This case can be classified as causally undetermined. Detail causality analysis is described in Karnopp et al. [12]. The dense switched modular primitives with implicit genotype-phenotype mapping and the guaranteed feasibility of the resulting causally well-posed bond graphs can speed up the evolution process significantly.
5
Experiments and Analysis
To evaluate and compare the proposed approach with the previous one, the eigenvalue assignment problem, for which the design objective is to find bond graph models with minimal distance errors from a target set of eigenvalues, is used. The problem of eigenvalue assignment has received a great deal of attention in control system design. Design of systems to avoid instability and to provide specified response characteristics as determined by their eigenvalues is often an important and practical problem. 5.1
Problem Definition
In the example that follows, a set of target eigenvalues is given and a bond graph model with those eigenvalues must be generated, in a classic “inverse” problem. The following sets (consisting of various 6-, 10- and 16-eigenvalue target sets, respectively) were used for the genetic programming runs: • Eigenvalue sets used in experiments: 1) 2) 3) 4) 5) 6) 7)
{-1±2j, -2±j, -3±0.5j} {-10±j, -1±10j, -3±3j } {-20±j, -1±20j, -7±7j} {-1, -2, -3, -4, -5, -6} {-20±j, -1±20j, -7±7j, -12±4j, -4±12j } {-1, -2, -3, -4, -5, -6, -7, -8, -9, -10} {-20±1j, -1±20j, -7±7j, -12±4j, -4±12j, -15±2j, -9±5j, -5±9j}
Dense and Switched Modular Primitives for Bond Graph Model Design
1771
The fitness function is defined as follows: pair each target eigenvalue one:one with the closest one in the solution; calculate the sum of distance errors between each target eigenvalue and the solution’s corresponding eigenvalue, divide by the order, and perform hyperbolic scaling as follows. Relative distance error (normed by the distance of the target from the origin) is used.
Fitness( Eigenvalue) = 0.5 + 1
(2 + ∑ Error / Order)
We used a strongly-typed version (Luke, [13]) of lilgp (Zongker and Punch [14]) with HFC (Hierarchical Fair Competition, Hu, et al., [9]) GP to generate bond graph models. These examples were run on a single Pentium IV 2.8GHz PC with 512MB RAM. The GP parameters were as shown below. Number of generations : 500 Population sizes : 100 in each of ten subpopulations for multiple population runs Initial population: half_and_half Initial depth : 3-6 Max depth : 12 (with 800 max_nodes) Selection : Tournament (size=7) Crossover : 0.9 Mutation : 0.1 The tabular results of 6- and 10-eigenvalue runs are provided in Tables 3-4, with statistics including mean relative distance error (averaged across each target eigenvalue) and mean tree size, for each set of 10 experiments. Table 3 illustrates the comparison between the basic set and the DSMP (dense switched modular primitive) set on typical complex conjugate and real six-eigenvalue target sets. In the first set, {-1±2j, -2±j, -3±0.5j}, the average error of the basic set (0.151) is larger than that of the DSMP set (0.043). The second and third sets, for two different target eigenvalue sets that have larger norms from the origin, show average distance errors of the basic set that are also larger. The numbers in parentheses regarding distance error of the DSMP set represent their ratio to the basic set distance errors. Table 3. Results for 6 eigenvalues 6-Eigenvalue Placement Problem (10 runs) Basic set Eigenvalue set {-1±2j, -2±j, -3±0.5j} {-10±1j, -1±10j, -3±3j} {-20±1j, -1±20j, -7±7j} {-1, -2, -3, -4, -5, -6}
Dist error 0.151 0.068 0.056 0.144
Tree Size 513.6 451.8 399.4 445.7
DSMP set Dist error 0.043(28%) 0.026(38%) 0.021(37%) 0.009(6%)
Tree Size 237.0 296.8 285.6 307.1
1772
K. Seo et al.
In a fourth example, an all-real set of target eigenvalues {-1, -2, -3, -4, -5, -6} is tested and shows that the ratio of errors between the approaches is more than ten (0.144 for the basic set vs. 0.009 for the DSMP set, only 6% of the basic set error). Also, mean tree sizes of all basic set runs are much larger than those of DSMP set. Results for a 10-eigenvalue assignment problem are shown in Table 4. The results for a complex conjugate 10-eigenvalue set {-20±1j, -1±20j, -7±7j, -12±4j, -4±12j} show that the average error of the basic set (0.210) is three times larger than that of the DSMP set (0.064). The results for a real 10-eigenvalue set also show the average error of the basic set (0.267) is more than ten times larger than that of the DSMP set (0.023). As with 6 eigenvalues, the mean tree sizes of the basic set are larger than those of the DSMP set. Table 4. Results for 10 eigenvalues 10-Eigenvalue Placement Problem (10 runs) Basic set Eigenvalue set Dist error {-20±1j, -1±20j, -7±7j, -12±4j, -4±12j} 0.210 {-1, -2, -3, -4, -5, -6, -7, -8, -9, -10} 0.267
DSMP set
Tree size 564.9 564.5
Dist error 0.064 (30%) 0.023 (9%)
Tree size 385.6 425.8
Results for a 16-eigenvalue assignment problem – a much more difficult problem – are shown in Table 5. The results for a complex conjugate 16-eigenvalue set {-20±1j, -1±20j, -7±7j, -12±4j, -4±12j, -15±2j, -9±5j, -5±9j} show that the average error of the basic set (0.279) is twice as large as that of the DSMP set (0.132). Mean size of the GP tree, BG size, and computation time are also given in Table 5. BG size represents the mean number of junctions and C, I, R elements in each individual. All mean tree sizes, BG sizes, and computation times of the DSMP set are less, respectively, than their basic set counterparts
Table 5. Results for 16 eigenvalues 16-Eigenvalue Placement Problem (10 runs) {-20±1j, -1±20j, -7±7j, -12±4j, -4±12j, -15±2j, -9±5j, -5±9j} Basic set
DSMP set
Dist error
Mean Tree Size
BG Size
Compu. Time (min)
0.279
663.1
62.2
72.4
Dist error
Mean Tree Size
BG Size
Compu. Time (min)
0.132 (47%)
592.6
37
56.1
Dense and Switched Modular Primitives for Bond Graph Model Design
1773
Although the experiments run to date are not sufficient to allow making strong statistical assertions, it appears that the search capability of the DSMP set is superior to that of the basic set for bond graph design. The superiority of the DSMP set seems very clear. Although the difference may be not seem large, it is very significant considering that the results of the basic set runs are already taking advantage of HFC (Hierarchical Fair Competition, Hu, et al., [9]).
Basic
DSMP
Fig. 6. Distance error for 16 eigenvalues
Basic
DSMP
Fig. 7. Mean tree size for 16 eigenvalues
The distance errors (vs. generation) in 10 runs of the 16-eigenvalue problem are shown in Fig. 6. The distance errors of the DSMP set in Fig. 6 have already decreased rapidly within 50 generations, because only causally feasible (well-posed) individuals appear in the population. Figure 7 gives the mean tree sizes for each approach on the
1774
K. Seo et al.
16-eigenvalue problem. The DSMP set clearly obtains better performance using smaller trees. This bodes well for the scalability of the approach.
6
Conclusion
This paper has introduced the dense switched modular primitive for bond graph/GPbased automated design of multi-domain, lumped parameter dynamic systems. A careful combination is made of a dense representation (eliminating redundant components at junctions, guaranteeing causally well-posed bond graphs, and avoiding adjacent junctions of the same type) and a switched structure (allowing components that do not impact causal assignment at a junction to be present or absent depending on a binary switch). The use of these primitives considerably increases the efficiency of fitness assessment and the search performance in generation of bond graph models, to solve engineering problems with less computational effort. As a proof of concept for this approach, the eigenvalue assignment problem, which is to synthesize bond graph models with minimum distance errors from prespecified target sets of eigenvalues, was used. Results showed better performance for various eigenvalue sets when the new primitives were used. This tends to support the conjecture that a carefully tailored, problem-specific representation and operators that generate only feasible solutions with smaller amounts of redundancy and fewer genotypes that map to the same effective phenotype will improve the efficiency of GP search. This, in turn, offers promise that much more complex multi-domain systems with more detailed performance specifications can be designed efficiently. Further study will aim at extension and refinement of the GP representations for the bondgraph/genetic programming design methodology, and at demonstration of its applicability to design of more complex systems. Acknowledgment. The authors gratefully acknowledge the support of the National Science Foundation through grant DMII 0084934.
References 1. 2.
3. 4.
Youcef-Toumi, K.: Modeling, Design, and Control Integration: A necessary Step in Mechatronics. IEEE/ASME Trans. Mechatronics, vol. 1, no.1, (1996) 29–38 Koza, J.R., Bennett, F. H., Andre, D., Keane, M.A., Dunlap, F.: Automated Synthesis of Analog Electrical Circuits by Means of Genetic Programming. IEEE Trans. Evolutionary Computation, vol. 1. no. 2. (1997) 109–128. Koza, J.R., Bennett F.H., Andre D., Keane M.A., Genetic Programming III, Darwinian Invention and Problem Solving. Morgan Kaufmann Publishers, (1999) Seo K., Goodman, E., Rosenberg, R.C.: First Steps toward Automated Design of Mechatronic Systems Using Bond Graphs and Genetic Programming. Proc. Genetic and Evolutionary Computation Conference, GECCO-2001, San Francisco (2001) 189
Dense and Switched Modular Primitives for Bond Graph Model Design 5.
6.
7.
8.
9.
10. 11. 12. 13. 14.
1775
Seo K., Hu, J., Fan, Z., Goodman E.D., Rosenberg R.C.: Automated Design Approaches for Multi-Domain Dynamic Systems Using Bond Graphs and Genetic Programming. Int. Jour. of Computers, Systems and Signals, vol. 3. no. 1. (2002) 55–70 Fan Z., Hu, J., Seo, K., Goodman E.D., Rosenberg R.C., Zhang, B.: Bond Graph Representation and GP for Automated Analog Filter Design. Genetic and Evolutionary Computation Conference Late-Breaking Papers, San Francisco (2001) 81–86 Fan Z., Seo, K., Rosenberg R.C., Hu J., Goodman E.D.: Exploring Multiple Design Topologies using Genetic Programming and Bond Graphs. Proceedings of the Genetic and Evolutionary Computation Conference, GECCO-2002, New York (2002) 1073–1080. Goodman E.D., Seo, K., Rosenberg R.C., Fan Z., Hu, J.: Automated Design of Mechatronic Systems: Novel Search Methods and Modular Primitives to Enable Real-World Applications. Proc. 2003 NSF Design, Service and Manufacturing Grantees and Research Conference, January, Birmingham, Alabama, (2003) Hu J., Goodman E.D., Seo, K., Pei, M.: Adaptive Hierarchical Fair Competition (AHFC) Model for Parallel Evolutionary Algorithms. Proc. Genetic and Evolutionary Computation Conference, GECCO-2002, New York (2002) 772–779. Koza, J.R., Genetic Programming II: Automatic Discovery of Reusable Programs. The MIT Press, (1994) Soule, T., Heckendorn, R.B.: Function Sets in Genetic Programming.: Proc. Genetic and Evolutionary Computation Conference, GECCO-2001, San Francisco (2001) 190 Karnopp, D.C., Rosenberg R.C., Margolis, D.L., System Dynamics, A Unified Approach, nd 3 ed., John Wiley & Sons (2000) Luke S., Strongly-Typed, Multithreaded C Genetic Programming Kernel. http://www.cs.umd.edu/users/seanl/gp/patched-gp/. (1997) Zongker, D., Punch, W., lil-gp 1.1 User’s Manual. Michigan State University, (1996)
Dynamic Maximum Tree Depth A Simple Technique for Avoiding Bloat in Tree-Based GP Sara Silva1 and Jonas Almeida1,2 1
2
Biomathematics Group, Instituto de Tecnologia Qu´ımica e Biol´ ogica Universidade Nova de Lisboa, PO Box 127, 2780-156 Oeiras, Portugal [email protected] http://www.itqb.unl.pt:1111 Dept Biometry & Epidemiology, Medical Univ South Carolina, 135 Cannon Street Suite 303, PO Box 250835, Charleston SC 29425, USA [email protected] http://bioinformatics.musc.edu
Abstract. We present a technique, designated as dynamic maximum tree depth, for avoiding excessive growth of tree-based GP individuals during the evolutionary process. This technique introduces a dynamic tree depth limit, very similar to the Koza-style strict limit except in two aspects: it is initially set with a low value; it is increased when needed to accommodate an individual that is deeper than the limit but is better than any other individual found during the run. The results show that the dynamic maximum tree depth technique efficiently avoids the growth of trees beyond the necessary size to solve the problem, maintaining the ability to find individuals with good fitness values. When compared to lexicographic parsimony pressure, dynamic maximum tree depth proves to be significantly superior. When both techniques are coupled, the results are even better.
1
Introduction
Genetic Programming (GP) solves complex problems by evolving populations of computer programs, using Darwinian evolution and Mendelian genetics as inspiration. Its search space is potentially unlimited and programs may grow in size during the evolutionary process. Code growth is a healthy result of genetic operators in search of better solutions. Unfortunately, it also permits the appearance of pieces of redundant code, called introns, which increase the size of programs without improving their fitness. Besides consuming precious time in an already computationally intensive process, introns may start growing rapidly, a situation known as bloat. Several plausible explanations for this phenomenon have been proposed (reviews in [9,12]), and a combination of factors may be involved, but whatever the reasons may be, the fact remains that bloat is a serious problem that may stagnate the evolution, preventing the algorithm from finding better solutions. Several techniques have been used in the attempt to control bloat (reviews in [7,11]), most of them based on parsimony pressure. This paper presents a E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 1776–1787, 2003. c Springer-Verlag Berlin Heidelberg 2003
Dynamic Maximum Tree Depth
1777
technique especially suited for tree representations in GP, based on tree depth limitations. The idea is to introduce a dynamic tree depth limit, very similar to the Koza-style strict limit [3] except in two aspects: it is initially set with a low value; it is increased when needed to accommodate an individual that is deeper than the limit but is better than any other individual found during the run. Unlike some of the most recent successful approaches, this technique does not require especially designed operators [6,14] and its performance is not reduced in the presence of populations where all individuals have different fitness values [7]. It is also simple to implement and can easily be coupled with other techniques to join forces in the battle against bloat.
2
Introns
In GP, the term intron usually refers to a piece of redundant code that can be removed from an individual without affecting its fitness [1]. The term exon refers to all non-intron code. However, on designing a procedure for detecting introns in GP trees, a more precise definition arises. Our procedure for detecting introns works recursively in a tree from its root to its terminals, trying to maximize the number of introns detected. It can be described by the following pseudo-code, where nintrons designates the number of introns detected: If tree is a terminal node: nintrons = 0 Otherwise: Evaluate tree and all its branches If none of the branches returns the same evaluation as tree: nintrons = sum of the number of introns in each branch Otherwise: Count nodes and introns in branches with same evaluation as tree Pick branch with lowest number of non intron nodes: ibranch nintrons = all nodes in tree minus non intron nodes in ibranch This does not detect introns in code like not(not(X)) or -(-(X)), although removing pieces of this code, leaving only the underlined parts, would not affect the fitness of the individual. We use the term redundant complexity to designate these and other cases where a tree could be replaced by a smaller tree (but not by one of its branches) without affecting the fitness of the individual. Redundant complexity is also a problem in GP [10]. We use the term bloat loosely to designate a rapid growth of the mean population size in terms of the number of nodes that make up the trees, whether it happens in the beginning or in the end of the run, and whether the growth is caused by introns or exons.
1778
3
S. Silva and J. Almeida
Dynamic Maximum Tree Depth
Dynamic maximum tree depth is a technique that introduces a dynamic limit to the maximum depth of the individuals allowed in the population. It is similar to the Koza-style depth limiting technique, but it does not replace it – both dynamic and strict limits are used in conjunction. There is also an initial depth limit imposed on the trees that form the initial population [3], which must not be mistaken for the initial value of the dynamic limit. The dynamic limit should be initially set with a low value at least as high as the initial tree depth. It can increase during the run, but it always lies somewhere between the initial tree depth and the strict Koza limit. Whenever a new individual is created by the genetic operators, these rules apply: — if the new individual is deeper than the strict Koza depth limit, reject it and consider one of its parents for the new generation instead; — if the new individual is no deeper than the dynamic depth limit, consider it acceptable to participate in the new generation; — if the new individual is deeper than the dynamic limit (but no deeper than the strict Koza limit) measure its fitness and: – if the individual has better fitness than the current best individual of the run, increase the dynamic limit to match the depth of the individual and consider it acceptable to the new generation; – if the individual is not better than the current best of run, leave the dynamic level unchanged and reject the individual, considering one of its parents for the new generation instead. Once increased, the dynamic level will not be lowered again. If and when the dynamic limit reaches the same value as the strict Koza limit, both limits become one and the same. By setting the dynamic limit to the same value as the strict Koza limit, one is in fact switching it off, and making the algorithm behave as if there were no dynamic limit. The dynamic maximum depth technique is easy to implement and can be used with any set of parameters and/or in conjunction with other techniques for controlling bloat. Furthermore, it may be used for another purpose besides controlling bloat. In real world applications, one may not be interested or able to invest a large amount of time in achieving the best possible solution, particularly in approximation problems. Instead, one may only consider a solution to be acceptable if it is sufficiently simple to be comprehended, even if its accuracy is known to be worse than the accuracy of other more complex solutions. Choosing less stringent stop conditions to allow the algorithm to stop earlier is not enough to ensure that the resulting solution will be acceptable, as it cannot predict its complexity. By starting with a low dynamic limit for tree depth and repeatedly raising it as more complex solutions prove to be better than simpler ones, the dynamic maximum tree depth technique can in fact provide a series of solutions of increasing complexity and accuracy, from which the user may choose the most adequate one. Once again, it is important to choose a low value for the initial dynamic depth limit, to force the algorithm to look for simpler solutions before adopting more complex ones.
Dynamic Maximum Tree Depth
4
1779
Experiments
To test the efficacy of the dynamic maximum tree depth technique we have decided to perform the same battery of tests using six different approaches: (1) Koza-style tree depth limiting alone [3]; (2) lexicographic parsimony pressure (coupled with the Koza-style limit, which yielded better results than lexicographic parsimony pressure alone, with no bucketing) [7]; (3) dynamic maximum tree depth (also coupled with the Koza-style limit, as described in the previous section) with initial dynamic limit 9; (4) dynamic maximum tree depth, with initial dynamic limit 9, coupled with lexicographic parsimony pressure; (5) dynamic maximum tree depth with initial dynamic limit 6 (the same as the initial tree depth); (6) dynamic maximum tree depth, with initial dynamic limit 6, coupled with lexicographic parsimony pressure. Two different problems were chosen for the experiments: Symbolic Regression of the quartic polynomial (x4 + x3 + x2 + x, with 21 equidistant points in the interval −1 to +1), and Even-3 Parity. These problems are very different in the number of possible fitness values of the evolved solutions – from a potentially infinite number in Symbolic Regression, to very few possible values in Even3 Parity. This facilitates the comparison with lexicographic parsimony pressure because it covers the two domains where this technique, due to its characteristics, behaved most differently [7]. A total of 30 runs were performed with each technique for each problem. All the runs used populations of 500 individuals evolved during 50 generations, even when an optimal solution was found earlier. The parameters adopted for the experiments were essentially the same as in [3] and [7], to facilitate the comparison between the techniques, but a few differences must be noted. Although some effort was put into promoting the diversity of the initial population, the tree initialization procedure used (Ramped Half-and-Half [3]) did not guarantee that all individuals were different. For 500 individuals and initial trees of maximum depth 6, the mean diversity of the initial population was roughly 75% for Symbolic Regression and 80% for Even-3 Parity, where diversity is the percentage of individuals in the population that account for the total number of different individuals (based on variety in [5]). Standard tree crossover was used, but with total random choice of the crossover points, independently of being internal or terminal nodes. As stated in the previous section, the dynamic maximum tree limit can be switched on and off without affecting the rest of the algorithm, and switching it off is setting it to the same value as the Koza-style limit, which was always set to depth 17. Likewise, lexicographic parsimony pressure can be switched on by using a modified tournament selection that prefers smaller trees when their fitness values are the same, or switched off by using the standard tournament selection. Table 1 summarizes the combinations of tournament and initial dynamic limit that lead to each technique. All the experiments were performed with GPLAB [8], a GP Toolbox for MATLAB [13]. Statistical significance of the null hypothesis of no difference was determined with ANOVAs at p = 0.01.
1780
S. Silva and J. Almeida
Table 1. Tournament and initial dynamic limit settings for each technique used Technique (1) (2) (3) (4) (5) (6)
5
Koza-style tree depth limiting (K) Lexicographic parsimony pressure (L) Dynamic maximum tree depth 9 (S9) L+S9 Dynamic maximum tree depth 6 (S6) L+S6
Tournament
Initial dynamic limit
standard modified standard modified standard modified
17 17 9 9 6 6
Results
The results of the experiments are presented as boxplots and evolution curves concerning the mean size of the trees (Fig. 1), where size is the number of nodes, the mean percentage of introns in the trees (Fig. 2), the population diversity as defined in the last section (Fig. 3), and best (lowest) fitness achieved (Fig. 4). The boxplots are based on the mean values (except Fig. 4, based on the best value) of all generations of each run, and each technique is represented by a box and pair of whiskers. Each box has lines at the lower quartile, median, and upper quartile values, and the whiskers mark the furthest value within 1.5 of the quartile ranges. Outliers are represented by + and × marks the mean. The evolution curves represent each generation separately (averaged over all runs), one line per technique. Throughout the text, we use the acronyms introduced in Table 1 to designate the different techniques. 5.1
Mean Size of Trees
Figure 1 shows the results concerning the mean size of trees. In the Symbolic Regression problem, techniques S9 and S6 performed equally well and were able to significantly lower mean tree sizes of run when compared to K and L. None of the compound techniques (L+S9, L+S6) proved to be significantly different from its counterpart (S9 and S6, respectively). The evolution curves show a clear difference in growth rate between the four techniques that use dynamic maximum tree depth (S9, L+S9, S6, L+S6) and the two that do not (K, L). In the Even-3 Parity problem, techniques S9 and S6 were significantly different from each other. Both outperformed K, but only S6 performed better than L (no significant difference between S9 and L). However, both compound techniques (L+S9, L+S6) were able to outperform L, as well as their respective counterparts S6 and S9. The evolution curves show a tendency for stabilization of growth in all three techniques that use lexicographic parsimony pressure (L, L+S9, L+S6). 5.2
Mean Percentage of Introns
Figure 2 shows the results concerning the mean percentage of introns. Starting with Symbolic Regression, the striking fact about this problem is that it prac-
Dynamic Maximum Tree Depth
1781
Fig. 1. Boxplots and evolution curves of mean tree size. See Table 1 for the names of the techniques
tically does not produce introns, redundant complexity being the only apparent cause of bloat. The differences between techniques in mean intron percentage of run were not significant. The evolution curves show that the percentage of introns tends to slowly increase after an initial reduction. In the Even-3 Parity problem the percentage of introns is generally high. Techniques S9 and S6 outperformed K, and S6 also outperformed L, with no
1782
S. Silva and J. Almeida
Fig. 2. Boxplots and evolution curves of mean intron percentage. See Table 1 for the names of the techniques
significant difference between S9 and L (similarly to what had been observed in mean tree size). The compound techniques (L+S9, L+S6) showed no significant differences from their counterparts (respectively S9 and S6), but both were able to significantly outperform L. The evolution curves show that, by the end of the run, the mean intron percentage has not increased from its initial value, in all techniques except K.
Dynamic Maximum Tree Depth
1783
Fig. 3. Boxplots and evolution curves of population diversity. See Table 1 for the names of the techniques
5.3
Population Diversity
Figure 3 shows the results concerning the population diversity. In Symbolic Regression, techniques S9 and S6 caused a significant decrease in population diversity when compared to K and L. None of the compound techniques (L+S9, L+S6) was significantly different from its counterpart (S9 and S6, respectively). The evolution curves show a conspicuous decrease in population diversity in the
1784
S. Silva and J. Almeida
Fig. 4. Boxplots and evolution curves of best fitness. See Table 1 for the names of the techniques
beginning of the run, promptly recovered afterwards. The two techniques that do not use dynamic maximum tree depth (K, L) are able to sustain much higher population diversity in later generations. In the Even-3 Parity problem, once again S9 and S6 caused a significant decrease in population diversity when compared to K, but showed significantly different performances when compared to each other. While S9 outperformed L,
Dynamic Maximum Tree Depth
1785
maintaining significantly higher population diversity, S6 was outperformed by L. Both compound techniques (L+S9, L+S6) significantly lowered population diversity when compared to L and to their counterparts S9 and S6. The evolution curves for this problem also show an abrupt loss of diversity in the beginning of the run, but only in the techniques that use lexicographic parsimony pressure (L, L+S9, L+S6). In this problem, techniques using dynamic maximum tree depth (S9, L+S9, S6, L+S6) also result in lower population diversity in later generations, with both S6 and L+S6 yielding the worst results. An initial decrease in population diversity is expected in problems where the number of possible fitness values is unlimited, as in Symbolic Regression. Since most of the random initial trees (particularly the larger ones) have very poor fitness, selection quickly eliminates them from the population in favor of a more homogeneous group of better (and usually smaller) individuals. It can be observed that the abrupt loss of diversity shown in Fig. 3 occurs at the same time in the run as the decrease in mean tree size shown in Fig. 1 (evolution curve on the left). In the Even-3 Parity problem, the number of possible fitness values is very limited, so the initial trees never have very poor fitness. However, several different trees have the same fitness value, and lexicographic parsimony pressure quickly eliminates the larger trees in favor of the smaller ones, thus producing the same effect (which also explains the initial decrease of the intron percentage that can be observed in Fig. 2, evolution curve on the right). This behavior has not been reported before. A higher number of clonings resulting from unsuccessful crossovers, or simply the reduction of the search space, caused by the introduction of the dynamic limit, may account for the persistent lower diversity observed in the later stages of the run, with all techniques using dynamic maximum tree depth. 5.4
Best Fitness
Figure 4 shows the results concerning the best fitness. Although there has been some concern regarding whether depth restrictions may affect GP performance negatively when using crossover [2,4], in both Symbolic Regression and Even-3 Parity none of the techniques was significantly different from the others. The evolution curves indicate that convergence to good solutions was not a problem.
6
Conclusions and Future Work
The dynamic maximum tree depth technique was able to effectively control code growth in both Symbolic Regression and Even-3 Parity problems. In spite of a noticeable decrease in population diversity, the ability to find individuals with good fitness values was not compromised in these two problems. Dynamic maximum tree depth was clearly superior to lexicographic parsimony pressure in the Symbolic Regression problem. In the Even-3 Parity problem, where the differences between both techniques were not always significant, the combination of both outperformed the use of either one separately.
1786
S. Silva and J. Almeida
There seems to be no disadvantage in using the lower, and most limitative, value for the initial depth limit. The size of the trees was kept low without losing convergence into good solutions. However, the persistent lower population diversity observed when using the dynamic level, particularly with the more limitative setting, may become an impediment to good convergence in smaller populations, which may require a higher value for the initial depth limit in order to maintain the convergence ability. Further work must be carried out in order to confirm or reject this hypothesis. The performance of the dynamic maximum tree depth technique must also be checked in harder problems. Acknowledgements. This work was partially supported by grants QLK2-CT2000-01020 (EURIS) from the European Commission and POCTI/1999/BSE/ 34794 (SAPIENS) from Funda¸c˜ao para a Ciˆencia e a Tecnologia, Portugal. We would like to thank Ernesto Costa (Universidade de Coimbra), Pedro J.N. Silva (Universidade de Lisboa), and all the anonymous reviewers for carefully reading the manuscript and providing many valuable suggestions.
References 1. Banzhaf, W., Nordin, P., Keller, R.E., Francone, F.D.: Genetic programming – an introduction, San Francisco, CA. Morgan Kaufmann (1998) 2. Gathercole, C., Ross, P.: An adverse interaction between crossover and restricted tree depth in genetic programming. In Koza, J.R., Goldberg, D.E., Fogel, D.B., Riolo, R.L., editors, Proceedings of GP’96, Cambridge, MA. MIT Press (1996) 291–296 3. Koza, J.R.: Genetic programming – on the programming of computers by means of natural selection, Cambridge, MA. MIT Press (1992) 4. Langdon, W.B., Poli, R.: An analysis of the MAX problem in genetic programming. In Koza, J.R., Deb, K., Dorigo, M., Fogel, D.B., Garzon, M., Iba, H., Riolo, R.L., editors, Proceedings of GP’97, San Francisco, CA. Morgan Kaufman (1997) 222– 230 5. Langdon, W.B.: Genetic Programming + Data Structures = Automatic Programming!, Boston, MA. Kluwer (1998) 6. Langdon, W.B.: Size fair and homologous tree crossovers for tree genetic programming. Genetic Programming and Evolvable Machines, 1 (2000) 95–119 7. Luke, S., Panait, L.: Lexicographic parsimony pressure. In Langdon, W.B., Cant´ uPaz, E., Mathias, K., Roy, R., Davis, D., Poli, R. Balakrishnan, K., Honavar, V., Rudolph, G., Wegener, J., Bull, L., Potter, M.A., Schultz, A.C., Miller, J.F., Burke, E., Jonoska, N., editors, Proceedings of GECCO-2002, San Francisco, CA. Morgan Kaufmann (2002) 829–836 8. Silva, S.: GPLAB – a genetic programming toolbox for MATLAB. (2003) http://www.itqb.unl.pt:1111/gplab/ 9. Soule, T.: Code growth in genetic programming. PhD thesis, University of Idaho (1998) 10. Soule, T.: Exons and code growth in genetic programming. In Foster, J.A., Lutton, E., Miller, J., Ryan, C., Tettamanzi, A.G.B., editors, Proceedings of EuroGP2002, Berlin. Springer (2002) 142–151
Dynamic Maximum Tree Depth
1787
11. Soule, T., Foster, J.A.: Effects of code growth and parsimony pressure on populations in genetic programming. Evolutionary Computation, 6(4) (1999) 293–309 12. Soule, T., Heckendorn, R.B.: An analysis of the causes of code growth in genetic programming. Genetic Programming and Evolvable Machines, 3 (2002) 283–309 13. The MathWorks. (2003) http://www.mathworks.com 14. Van Belle, T., Ackley, D.H.: Uniform subtree mutation. In Foster, J.A., Lutton, E., Miller, J., Ryan, C., Tettamanzi, A.G.B., editors, Proceedings of EuroGP-2002, Berlin. Springer (2002) 152–161
Difficulty of Unimodal and Multimodal Landscapes in Genetic Programming Leonardo Vanneschi1 , Marco Tomassini1 , Manuel Clergue2 , and Philippe Collard2 1
Computer Science Institute, University of Lausanne, Lausanne, Switzerland 2 I3S Laboratory, University of Nice, Sophia Antipolis, France
Abstract. This paper presents an original study of fitness distance correlation as a measure of problem difficulty in genetic programming. A new definition of distance, called structural distance, is used and suitable mutation operators for the program space are defined. The difficulty is studied for a number of problems, including, for the first time in GP, multimodal ones, both for the new hand-tailored mutation operators and standard crossover. Results are in agreement with empirical observations, thus confirming that fitness distance correlation can be considered a reasonable index of difficulty for genetic programming, at least for the set of problems studied here.
1
Introduction
The fitness distance correlation (fdc) coefficient has been used as a tool for measuring problem difficulty in genetic algorithms (GAs) and genetic programming (GP) with controversial results: some counterexamples have been found for GAs [15], but fdc has been proven an useful measure on a large number of GA (see for example [3] or [9]) and GP functions (see [2,16]). In particular, Clergue and coworkers ([2]) have shown fdc to be a reasonable way of quantifying problem difficulty for GP for a set of functions. In this paper, we use a measure of structural distance for trees (see [7]) to calculate fdc. Then, we employ fdc to measure problem difficulty for two kinds of GP processes: GP using the standard Koza’s crossover as the only genetic operator (that we call from now on standard GP) and GP using two new mutation operators based on the transformations on which structural distance is defined (structural mutation genetic programming from now on). The test problems that we have chosen are unimodal and multimodal trap functions, royal trees and two versions of the MAX problem. The present study is the first attempt to quantify problem difficulty of functions with multiple global optima with the same fitness in evolutionary algorithms by the fdc. This paper is structured as follows: the next section gives a short description of the structural tree distance used in the paper, followed by the definition of the basic mutation operators that go hand-in-hand with it. Section 4 presents the main results and their discussion for a number of GP problems. Finally, Sect. 5 gives our conclusions and hints to future work. E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 1788–1799, 2003. c Springer-Verlag Berlin Heidelberg 2003
Difficulty of Unimodal and Multimodal Landscapes in Genetic Programming
2
1789
Distance Measure for Genetic Programs
In GAs individuals are represented as strings of digits and typical distance measures are Hamming distance or alternation. Defining a distance between genotypes in GP is much more difficult, given the tree structure of the individuals. In [2] an ad hoc distance between trees was used. In the spirit of the fdc definition [9], it would be better to use GP operators that have a direct counterpart in the tree distance. For that reason, we adopt the structural distance for trees proposed in [7] and we define corresponding operators (see next section). According to this measure, given the sets F and T of functions and terminal symbols, a coding function c must be defined such that c : {T ∪ F} → IN. One can think of many ways for the specification of c, for example the “complexity” of the primitives or their arity. The distance between two trees T1 and T2 is calculated in three steps: (1) T1 and T2 are overlapped at the root node and the process is applied recursively starting from the leftmost subtrees (see [7] for a description of the overlapping algorithm). (2) For each pair of nodes at matching positions, the difference of their codes (eventually raised to an exponent) is computed. (3) The differences computed in the previous step are combined in a weighted sum. This gives for the distance between two trees T1 and T2 with roots R1 and R2 the following expression: dist(T1 , T2 ) = d(R1 , R2 ) + k
m
dist(childi (R1 ), childi (R2 ))
(1)
i=1
where: d(R1 , R2 ) = (|c(R1 ) − c(R2 )|)z , childi (Y ) is the ith of the m possible children of a generical node Y , if i ≤ m, or the empty tree otherwise, and c evaluated on the root of an empty tree is 0. Constant k is used to give different weights to nodes belonging to different levels and z is a constant usually chosen in such a way that z ∈ IN. In most of this paper, except for the MAX function, individuals will be coded using the same syntax as in [2] and [14], i.e. considering a set of functions A, B, C, etc. with increasing arity (i.e. arity(A) = 1, arity(B) = 2, and so on) and a single terminal X (i.e. arity(X) = 0) as follows: F = {A, B, C, D, . . .}, T = {X} and the c function will be defined as follows: ∀x ∈ {F ∪ T } c(x) = arity(x) + 1. In our experiments we will always set k = 12 and z = 2. By keeping 0 < k < 1, the differences near the root have higher weight. This is convenient for GP as it has been noted that programs converge quickly to a fixed root portion [11].
3
Structural Mutation Operators
In [13] O’Reilly used the edit distance, which is related to, but not identical with the tree distance employed here. She also defined suitable edit operators and related them to GP mutation. She used the edit distance for a different purpose: how to measure and control the step size of crossover in order to balance explotation and exploration in GP. She also suggested that edit distance could
1790
L. Vanneschi et al.
be useful in the study of GP fitness landscapes, but did not develop the issue. Here we take up exactly this last point. The following operators have been inspired by the distance definition presented in the last section and by the work in [13]. Given the sets F and T and the coding function c defined in Sect. 2, we define cmax (respectively, cmin ) as the maximum (respectively, the minimum) value taken by c on the domain {F ∪ T }. Moreover, given a symbol n such that n ∈ {F ∪ T } and c(n) < cmax and a symbol m such that m ∈ {F ∪ T } and c(m) > cmin , we define: succ(n) as a primitive such that c(succ(n)) = c(n) + 1 and pred(m) as a primitive such that c(pred(m)) = c(m) − 1. Then we can define the following editing operators on a generic tree T : – inflate mutation. A primitive labelled with a symbol n such that c(n) < cmax is selected in T and replaced by succ(n). A new random terminal node is added to this new node in a random position (i.e. the new terminal becomes the ith son of succ(n), where i is comprised between 0 and arity(n)). – deflate mutation. A primitive labelled with a symbol m such that c(m) > cmin , and such that at least one of his sons is a leaf, is selected in T and replaced by pred(m). A random leaf, between the sons of this node, is deleted from T . The terms inflate and deflate have been used to avoid confusion with the similar and well known grow and shrink mutations that have already been proposed in GP. The following property holds (the proof appears in [17]): Property 1. Distance/Operator Consistency. Let’s consider the sets F and T and the coding function c defined in Sect. 2. Let T1 and T2 be two trees composed by symbols belonging to {F ∪ T } and let’s consider the k and z constants of definition (1) to be equal to 1. If dist(T1 , T2 ) = D, then T2 can be obtained from T1 by a sequence of D 2 editing operations, where an editing operation can be an inflate mutation or a deflate mutation. From this property, we conclude that the operators of inflate mutation and deflate mutation are completely coherent with the notion of distance defined in Sect. 2, i.e. an application of these operators allow us to move on the search space from a tree to its neighbors according to that distance. We call the new GP process based on these operators structural mutation genetic programming (SMGP).
4
Experimental Results
In all experiments shown in the following, fdc has been calculated via a sampling of 40000 randomly chosen individuals. All the experiments have been done with generational GP, a total population size of 100 individuals, ramped half and a half initialization, tournament selection of size 10. When standard GP has been used, the crossover rate has been set to 95% and the mutation rate to 0%, while
Difficulty of Unimodal and Multimodal Landscapes in Genetic Programming
1791
when SMGP has been employed, no crossover has been used and the rate of the two new mutation operators has been set to 95%. The GP process has been stopped either when a perfect solution has been found (global optimum) or when 500 generations have been executed. All experiments have been performed 100 times. 4.1
Fitness Distance Correlation
An approach proposed for GAs [9] states that an indication of problem hardness is given by the relationship between fitness and distance of the genotypes from known optima. Given a sample F = {f1 , f2 , ..., fn } of n individual fitnesses and a to the nearest global corresponding sample D = {d1 , d2 , ..., dn } of the n distances n optimum, fdc is defined as: fdc = σCFFσDD , where: CF D = n1 i=1 (fi − f )(di − d) is the covariance of F and D and σF , σD , f and d are the standard deviations and means of F and D. Given the tree structure of genotypes, the normalization problem is not a trivial one in GP. Dividing all the distances in the sampling by the maximum of all the possible distances between two trees in the search space is not a practically applicable method, since it would give too large a number for the typically used maximum tree depth and operator arity values. This problem has been tackled in [4]. In [2], Clergue and coworkers used a “sufficiently large” integer constant to obtain the normalisation of distances. Here, similarly to [4], we obtain the normalized distance between two trees by dividing their distance by the maximum value between the distances of the same two trees from the empty one. As suggested in [9], GA problems can be empirically classified in three classes, depending on the value of the fdc coefficient: misleading (fdc ≥ 0.15), in which fitness increases with distance, unknown (−0.15 < fdc < 0.15) in which there is virtually no correlation between fitness and distance and straightforward (fdc ≤ −0.15) in which fitness increases as the global optimum approaches. The second class corresponds to problems for which the difficulty can’t be estimated, because fdc doesn’t bring any information. In this case, examination of the fitness-distance scatterplot may give information on problem difficulty (see [9]). 4.2
Unimodal Trap Functions
Trap functions [5] allow one to define the fitness of the individuals as a function of their distance from the optimum, and the difficulty of trap functions can be changed by simply modifying some parameters. A function f : distance → f itness is a unimodal trap function if it is defined in the following way: d if d ≤ B 1 − B f (d) = R · (d − B) otherwise 1−B where d is the distance of the current individual from the unique global optimum, normalized so as to belong to the range [0, 1], and B and R are constants
L. Vanneschi et al.
1
1
Performance (p)
Fitness Distance Correlation (fdc)
1792
0.5 0 −0.5
0.8 0.6 0.4 0.2 0 1
−1 1
1
1 0.5
0.5
0.5 0 0
R
0.5 0 0
R
B
(a) E D
D
D
D
TC
TC
TC
TC
TC
Performance (p)
1
D
C B w h e re :
=
TC
B
(b)
B
0.8 0.6 0.4 0.2 0 1
B
1
A A A A A A
0.5
X X X X X X
0.5 0 0
R
(c)
B
(d)
Fig. 1. (a): fdc values for some trap functions obtained by changing the values of the constants B and R. (b): Performance (p) values of SMGP for traps. (d): Performance (p) values of standard GP for traps. (c): Stucture of the tree used as optimum in the experiments reported in (a), (b) and (d). D
E C
C C
X
C
X X
X
C
X X
X
A
B
A
B X
A
X
X
C
X
X A
X
X
X
B
X
X X X
X
X
X
X
X
X
Fig. 2. The trees used as optima in experiments analogous to the ones of Fig. 1.
belonging to [0, 1]. B allows to set the width of the attractive basin for each one of the optima and R sets their relative importance. By construction, the difficulty of trap functions decreases as the value of B increases, while it increases as the value of R decreases. For a more detailed explanation of trap functions see for instance [2]. Figure 1 show values of the performance p (defined as the proportion of the runs for which the global optimum has been found in less than 500 generations over 100 runs) and of fdc for various trap functions obtained by changing the values of the constants B and R. These experiments have been performed considering the tree represented in Fig. 1c as the global optimum. The same experiments have been performed using as global optimum the trees shown in Fig. 2 and the results (not shown here for lack of space) are qualitatively analogous to the ones of Fig. 1. In all cases fdc is confirmed to be a reasonable
Difficulty of Unimodal and Multimodal Landscapes in Genetic Programming
1793
measure for quantifying problem difficulty both for SMGP and standard GP for unimodal trap functions. 4.3
“W” Multimodal Trap Functions
Unimodal trap functions described in Sect. 4.2 are characterized by the presence of a unique global optimum. The functions used here and first proposed in [6] (informally called “W” trap functions, given their typical shape shown in Fig. 3) are characterized by the presence of several global optima. They depend on 5 variables called B1 , B2 , B3 , R1 and R2 and they can be expressed by the following formula: d 1− if d ≤ B1 B 1 R1 · (d − B1 ) if B1 ≤ d ≤ B2 B2 − B1 f (d) = R1 · (B3 − d) if B2 ≤ d ≤ B3 B3 − B2 R · (d − B3 ) 2 otherwise 1 − B3 where B1 , B2 , B3 , R1 and R2 are constants belonging to the interval [0, 1] and the property B1 ≤ B2 ≤ B3 must hold. A systematic study of problem difficulty for multimodal landscapes via an algebraic indicator such as fdc has, to our knowledge, never been performed before in evolutionary computation. Here is how our study has been performed: we choose a particular tree belonging to the search space and we call it origin. Then we artificially assign a maximum fitness (= 1) to this tree, so as to make it a global optimum. Then, if we set the
R1
Fitness (f)
R2
B1
B2
B3
Distance (d)
Fig. 3. Graphical representation of a “W” trap function with B1 = 0.1, B2 = 0.3, B3 = 0.7, R1 = 1, R2 = 0.7. Note that distances and fitness are normalized into the interval [0, 1].
1794
L. Vanneschi et al. D X
B
C
D
A
C
C
X X X
X X X X
X
X X X
X X X
(a)
(b)
Fig. 4. Two trees having a normalized distance equal to 0.5 between them
R1 constant to 1, all the trees having a normalized distance equal to B2 from the origin, given the definition of “W” trap functions, have a fitness equal to 1 and thus are global optima too. Two series of experiments have been performed, using both SMGP and standard GP. In the first one, the tree represented in Fig. 4a is considered as the origin (let’s call this tree T1 ). Since the tree in Fig. 4b (that we call T2 ) has a normalized distance of 0.5 from T1 , if we set R1 to 1 and B2 to 0.5 then T2 is a global optimum too. To calculate the fdc, the minimum of the distances from T1 and T2 to each tree in the sampling is considered (as suggested for GAs by Jones in [9], even though never practically experimented). The distribution of these “minimum distances” has been studied on a sample of 40000 individuals and it appears to have a regular shape, as it is the case for unimodal trap functions (results not shown here for lack of space). Since these series of experiments represent a first attempt to use fdc to measure the difficulty of fitness landscapes with more than one global optimum, and since we want to be able to study the dynamics of the GP process in detail, we have decided, as a first step, to use fitness landscapes containing only two global optima. Thus, a random normalized fitness different from 1 has been arbitrarily assigned to each tree having a distance equal to 0.5 from T1 and a genotype different from T2 . This choice, of course, alters the “W” trap functions definition and influences the fitness landscape, even though only marginally. Anyway, we consider this choice perfectly “fair”, given that we are not interested in testing fdc on standard benchmarks, but rather on artificially defined fitness landscapes. Table 1 shows a subset of the experimental results that we have obtained with this first series of tests, by setting B2 to 0.5 and R1 to 1 and by varying the values of B1 , B3 and R2 . Given the enormous number of experiments performed (about 10000 GP runs) and the limited space available, we couldn’t show all the results. Thus, we have decided to proceed as follows: for each trap function (identified by particular values of B1 , B3 and R2 ), we have calculated the fdc. Then, we have discarded all the traps for which the fdc is comprised between −0.15 and 0.15, because no experimental result would give us any useful information for these functions (see Sect. 4.1). For a subset of the other trap functions, 100 runs have been performed with both SMGP and standard GP. Results shown in Table 1 are encouraging and suggest that fdc could be a reasonable measure to predict problem difficulty for some typical “W” trap functions. The results not shown in Table 1 are qualitatively similar and lead us to the same conclusions.
Difficulty of Unimodal and Multimodal Landscapes in Genetic Programming
1795
Table 1. Results of fdc using SMGP and standard GP for a set of “W” trap functions where the origin is the tree shown in Fig. 4a and the second global optimum is the one shown in Fig. 4b; p stands for performance B1 B1 B1 B1 B1 B1 B1 B1
= 0.4, B3 = 0.5, B3 = 0.3, B3 = 0.2, B3 = 0.1, B3 = 0.5, B3 = 0.4, B3 = 0.3, B3
= 0.9, R2 = 0.8, R2 = 0.9, R2 = 0.9, R2 = 0.9, R2 = 0.6, R2 = 0.6, R2 = 0.6, R2
fdc -0.62 -0.88 -0.61 -0.69 -0.72 0.34 0.36 0.33
= 0.5 = 0.4 = 0.7 = 0.1 = 0.3 = 0.9 = 0.9 = 0.9
p (SMGP) 0.75 0.98 0.80 0.72 0.85 0.33 0.14 0.31
fdc prediction straightf. straightf. straightf. straightf. straightf. misleading misleading misleading
p (stGP) 0.81 0.94 0.77 0.91 0.98 0.20 0.30 0.13
Table 2. Results of fdc using SMGP and standard GP for a set of “W” trap functions where the origin is the tree shown in Fig. 5a and the second global optimum is the one shown in Fig. 5b; p stands for performance B1 = 0.1, B3 = 0.1, R2 = 0.1 B1 = 0.1, B3 = 0.9, R2 = 0.1 B1 = 0.05, B3 = 0.7, R2 = 0.1 B1 = 0.1, B3 = 0.8, R2 = 0.5 B1 = 0.1, B3 = 0.5, R2 = 0.1 B1 = 0.05, B3 = 0.2, R2 = 0.9 B1 = 0.05, B3 = 0.4, R2 = 0.9 B1 = 0.05, B3 = 0.1, R2 = 0.6
fdc -0.76 -0.93 -0.90 -0.81 -0.71 0.89 0.74 0.78
p (SMGP) 0.91 0.97 0.98 0.90 0.91 0.02 0.35 0.25
fdc prediction straightf. straightf. straightf. straightf. straightf. misleading misleading misleading
D
D D
D
X X X X
X X X X
B
(a)
p (stGP) 0.88 0.99 0.92 0.83 0.90 0.13 0.17 0.23
X
B
X X
X
D X
X
X X X X
A X
(b)
Fig. 5. Two trees having a normalized distance equal to 0.1 between them
Analogous experiments have also been performed by considering the tree in Fig. 5a as the origin and the tree in Fig. 5b as the second global optimum. Since the normalized distance between these two trees is equal to 0.1, this time B2 is set to 0.1 and R1 to 1. Once again, a random normalized fitness different from 1 is assigned to all the trees having a distance equal to 0.1 from the one in Fig. 5a and a genotype different from the one in Fig. 5b. A subset of the results of this second series of experiments are shown in Table 2. These results are encouraging too and seem to confirm the validity of the fdc both for SMGP and standard GP. In any event, the amount of results shown in this section doesn’t allow us to draw definitive conclusions: many more cases should be studied. For example, in the case of fitness landscapes containing two global optima, a larger set of
1796
L. Vanneschi et al.
differently shaped trees should be considered as origin and as second global optimum. Moreover, fitness landscapes with more than two global optima should be investigated. However, a methodology for the study of difficulty of multimodal landscapes has been proposed, and some non-trivial choices have been performed, as the one of calculating fdc by considering the minimum of the distances from the global optima for every individual of the sampling. These choices needed an experimental confirmation. Results shown here are encouraging and should pave the way for a deeper study of problem difficulty for multimodal lanscapes. 4.4
Royal Trees
The next functions we study are the royal trees proposed by Punch et al. [14]. The language used is the same as in Sect. 2, and the fitness of a tree (or any subtree) is defined as the score of its root. Each function calculates its score by summing the weighted scores of its direct children. If the child is a perfect tree of the appropriate level (for instance, a complete level-C tree beneath a D node), then the score of that subtree, times a FullBonus weight, is added to the score of the root. If the child has a correct root but is not a perfect tree, then the weight is PartialBonus. If the child’s root is incorrect, then the weight is Penalty. After scoring the root, if the function is itself the root of a perfect tree, the final sum is multiplied by CompleteBonus (see [14] for a more detailed explanation). Values used here are as in [14] i.e. FullBonus = 2, PartialBonus = 1, Penalty = 13 , CompleteBonus = 2. Different experiments, considering different nodes as the node with maximum arity allowed, have been performed. Results are shown in Table 3. Predictions made by fdc for level-A, level-B, level-C and level-D functions are correct. Level-E function is “difficult” to be predicted by the fdc (i.e. no correlation between fitness and distance is observed). Finally, level-F and levelG functions are predicted to be “misleading” (in accord with Punch in [14]) and they really are, since the global optimum is never found before 500 generations. Royal trees problem spans all the classes of difficulty as described by the fdc. 4.5
MAX Problem
The task of the MAX problem for GP, defined in [8] and [10], is “to find the program which returns the largest value for a given terminal and function set Table 3. Results of fdc for the Royal Trees; p stands for performance Root B C D E F G
fdc -0.31 -0.25 -0.20 0.059 0.44 0.73
fdc prediction straightf. straightf. straightf. unknown misleading misleading
p (SMGP) 1 1 0.76 0 0 0
p (stGP) 1 1 0.70 0.12 0 0
Difficulty of Unimodal and Multimodal Landscapes in Genetic Programming
1797
Table 4. Results of fdc for the MAX problem using SMGP and standard GP. The first column shows the sets of functions and terminals used in the experiments; p stands for performance MAX problem {+} {1} {+} {1,2}
fdc -0.87 -0.86
fdc prediction straightf. straightf.
p (SMGP) 1 1
p (stGP) 1 1
with a depth limit d, where the root node counts as depth 0”. We set d equal to 8 and we use the set of functions F = {+} and the set of terminals T1 = {1} or T2 = {1, 2}. When using T1 , we specify the coding function c as: c(1) = 1, c(+) = 2, when using T2 , we pose: c(1) = 1, c(2) = 2, c(+) = 3. The study of standard GP for these MAX functions comports no particular problem, while for the case of SMGP, the definitions of the operators of inflate and deflate mutation given in Sect. 3 must be slightly modified, since we are considering a different coding language. The inflate and deflate mutations are now defined in such a way that, when using T1 , a terminal symbol 1 can be transformed in the subtree T1 = +(1, 1) by one step of inflate mutation and the vice-versa can be done by the deflate mutation. When using T2 , the inflate mutation can transform a 1 node into a 2 node and a 2 node into the subtrees T2 = +(2, 1) or T3 = +(1, 2) (with a uniform probability). On the other hand, the deflate mutation can tranform T1 or T2 into a leaf labelled by 2, and a 2 node into a 1 node. Table 4 shows the fdc and p values for these test cases. Both problems are correctly classified as straightforward by fdc, both for SMGP and standard GP.
5
Conclusions and Future Work
Two new kinds of tree mutations corresponding to the operations of the structural distance are defined in this paper. Fitness distance correlation (calculated using this structural distance between trees) has been shown to be a reasonable way of quantifying the difficulty of unimodal trap functions, of a restricted set of multimodal trap functions, of royal trees, and of two MAX functions for GP using these mutations as genetic operators, as well as for GP based on standard crossover. The results show that, for the functions studied, using crossover or our mutation operators, does not seem to have a marked effect on difficulty as measured by fdc. Thus the remarks of [1] and [12], and other work where it is claimed that the standard GP crossover does not seem to markedly improve performance with respect to other variation operators seem to be confirmed. Although fdc has confirmed its validity in the present study, in view of some counterexamples for GAs mentioned in the text, it remains to be seen whether the use of fdc extends to other classes of functions, such as typical GP benchmarks. In the future we plan to improve the study of problem difficulty of multimodal landscapes and MAX functions (e.g. when F = {+, ∗} and T = {0.5}), and look for a measure for GP difficulty that can be calculated without prior knowledge of the global optima, thus eliminating the strongest limitation of fdc. Moreover, since fdc is
1798
L. Vanneschi et al.
surely not an infallible measure, we plan to build a counterexample for fdc in GP. Another open problem consists in taking into account in the distance definition the phenomenon of introns (whereby two different genotypes can lead to the same phenotypic behavior). Finally, we intend to look for a better measure than performance to identify the success rate of functions, possibly independent from the maximum number of generations chosen. Acknowledgments. This work has been partially sponsored by the Fonds National Suisse pour la recherche scientifique.
References 1. K. Chellapilla. Evolutionary programming with tree mutations: Evolving computer programs without crossover. In J.R. Koza, K. Deb, M. Dorigo, D.B. Fogel, M. Garzon, H. Iba, and R.L. Riolo, editors, Genetic Programming 1997: Proceedings of the Second Annual Conference, pages 431–438, Stanford University, CA, USA, 1997. Morgan Kaufmann. 2. M. Clergue, P. Collard, M. Tomassini, and L. Vanneschi. Fitness distance correlation and problem difficulty for genetic programming. In Proceedings of the genetic and evolutionary computation conference GECCO’02, pages 724–732. Morgan Kaufmann, San Francisco, CA, 2002. 3. P. Collard, A. Gaspar, M. Clergue, and C. Escazut. Fitness distance correlation as statistical measure of genetic algorithms difficulty, revisited. In European Conference on Artificial Intelligence (ECAI’98), pages 650–654, Brighton, 1998. John Witley & Sons, Ltd. 4. E.D. de Jong, R.A. Watson, and J.B. Pollak. Reducing bloat and promoting diversity using multi-objective methods. In L. Spector et al., editor, Proceedings of the Genetic and Evolutionary Computation Conference GECCO ’01, San Francisco, CA, July 2001. Morgan Kaufmann. 5. K. Deb and D.E. Goldberg. Analyzing deception in trap functions. In D. Whitley, editor, Foundations of Genetic Algorithms, 2, pages 93–108. Morgan Kaufmann, 1993. 6. K. Deb, J. Horn, and D. Goldberg. Multimodal deceptive functions. Complex Systems, 7:131–153, 1993. 7. A. Ek´ art and S.Z. N´emeth. Maintaining the diversity of genetic programs. In J.A. Foster, E. Lutton, J. Miller, C. Ryan, and A.G.B. Tettamanzi, editors, Genetic Programming, Proceedings of the 5th European Conference, EuroGP 2002, volume 2278 of LNCS, pages 162–171, Kinsale, Ireland, 3–5 April 2002. Springer-Verlag. 8. C. Gathercole and P. Ross. An adverse interaction between crossover and restricted tree depth in genetic programming. In J.R. Koza, D.E. Goldberg, D.B. Fogel, and R.L. Riolo, editors, Genetic Programming 1996: Proceedings of the First Annual Conference, pages 291–296, Stanford University, CA, USA, 28–31 July 1996. MIT Press. 9. T. Jones. Evolutionary Algorithms, Fitness Landscapes and Search. PhD thesis, University of New Mexico, Albuquerque, 1995.
Difficulty of Unimodal and Multimodal Landscapes in Genetic Programming
1799
10. W.B. Langdon and R. Poli. An analysis of the max problem in genetic programming. In J.R. Koza, K. Deb, M. Dorigo, D.B. Fogel, M. Garzon, H. Iba, and R.L. Riolo, editors, Genetic Programming 1997: Proceedings of the Second Annual Conference on Genetic Programming, pages 222–230, San Francisco, CA, 1997. Morgan Kaufmann. 11. N.F. McPhee and N.J. Hopper. Analysis of genetic diversity through population history. In W. Banzhaf, J. Daida, A.E. Eiben, M.H. Garzon, V. Honavar, M. Jakiela, and R. E. Smith, editors, Proceedings of the Genetic and Evolutionary Computation Conference, volume 2, pages 1112–1120, Orlando, Florida, USA, 13–17 July 1999. Morgan Kaufmann. 12. U.-M. O’Reilly. An Analysis of Genetic Programming. PhD thesis, Carleton University, Ottawa, Ontario, Canada, 1995. 13. U.-M. O’Reilly. Using a distance metric on genetic programs to understand genetic operators. In Proceedings of 1997 IEEE International Conference on Systems, Man, and Cybernetics, pages 4092–4097. IEEE Press, Piscataway, NJ, 1997. 14. B. Punch, D. Zongker, and E. Goodman. The royal tree problem, a benchmark for single and multiple population genetic programming. In P. Angeline and K. Kinnear, editors, Advances in Genetic Programming 2, pages 299–316, Cambridge, MA, 1996. The MIT Press. 15. R.J. Quick, V.J. Rayward-Smith, and G.D. Smith. Fitness distance correlation and ridge functions. In Fifth Conference on Parallel Problems Solving from Nature (PPSN’98), pages 77–86. Springer-Verlag, Heidelberg, 1998. 16. V. Slavov and N.I. Nikolaev. Fitness landscapes and inductive genetic programming. In Proceedings of International Conference on Artificial Neural Networks and Genetic Algorithms (ICANNGA97), University of East Anglia, Norwich, UK, 1997. Springer-Verlag KG, Vienna. 17. L. Vanneschi, M. Tomassini, P. Collard, and M. Clergue. Fitness distance correlation in structural mutation genetic programming. In C. Ryan et al., editor, Genetic Programming, 6th European Conference, EuroGP2003, Lecture Notes in Computer Science. Springer-Verlag, Heidelberg, 2003.
Ramped Half-n-Half Initialisation Bias in GP Edmund Burke, Steven Gustafson , and Graham Kendall School of Computer Science & IT University of Nottingham {ekb,smg,gxk}@cs.nott.ac.uk
Tree initialisation techniques for genetic programming (GP) are examined in [4,3], highlighting a bias in the standard implementation of the initialisation method Ramped Half-n-Half (RHH) [1]. GP trees typically evolve to random shapes, even when populations were initially full or minimal trees [2]. In canonical GP, unbalanced and sparse trees increase the probability that bigger subtrees are selected for recombination, ensuring code growth occurs faster and that subtree crossover will have more difficultly in producing trees within specified depth limits. The ability to evolve tree shapes which allow more legal crossover operations, by providing more possible crossover points (by being bushier), and control code growth is critical. The GP community often uses RHH [4]. The standard implementation of the RHH method selects either the grow or full method with 0.5 probability to produce a tree. If the tree is already in the initial population it is discarded and another is created by grow or full. As duplicates are typically not allowed, this standard implementation of RHH favours full over grow and possibly biases the evolutionary process. The full and grow methods are similar algorithms for recursively producing GP trees. The full algorithm makes trees with branches extending to the maximum initial depth. The grow algorithm does not require this and allows branches of varying length (up to the maximum initial depth). As many more unique trees exist which are full (as full trees contain more nodes), there is a tendency, especially with particular function and terminal sets, to produce more duplicate trees with the grow method. To estimate the bias of the RHH method with a particular function and terminal set, we use the results from Luke [3] (Lemma 1). The expected number of nodes Ed at depth d is defined as: Ed = {1 if d = 0, else Ed−1 pb if d > 0} where pb is the expected number of children of a new node (p is the probability of picking a nonterminal, and b is the expected number of children of a nonterminal). Luke [3] uses this to calculate the expected size of trees in the infinite case. Here we bound the calculation to depth d = 4. For our analysis, Ed correctly predicted the full method would contribute more trees to the initial population whenever the expected size of the two methods was not similar (i.e. grow made smaller trees, causing more duplicates and rejected trees). Canonical GP trees grow in size to maximum depth limits, making the initial trees seeds for the evolutionary process. As the full algorithm is more likely to evolve larger and more bushier trees with more nodes than the grow method, we conduct an experimental study to observe these differences in the evolutionary
corresponding author
E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 1800–1801, 2003. c Springer-Verlag Berlin Heidelberg 2003
Ramped Half-n-Half Initialisation Bias in GP
1801
process on standard problems. We use RHH and the full and grow methods exclusively. Initial depths of 4 and 6, with maximum tree depths of 10 and 15, respectively, a population size of 500, a total of 51 generations, and standard subtree crossover is used for recombination on the artificial ant, even-5-parity, and symbolic regression problem with the quartic polynomial. 50 random runs were performed for each of the 6 experiments for each problem. On the ant problem, GP produced the best fitness with smaller initial trees (smaller initial depth or those produced by the grow method). Initial trees in the ant problem are particularly important as they encode the initial path that the ant takes. The parity problem experiments produced better fitness with bigger and more fuller trees; the populations created only by grow had the worst fitness, followed by the smaller depth limit populations. The RHH method causes 70% of the initial ant and parity population to be created by the full method, a negative bias for the ant problem and a positive bias for the parity problem. The regression problem always produced trees which were sparse due to the number of unary functions in the regression function set (log, exp, sin, cos), which is typical in function sets for more complex regression problems. While the RHH method was not overly biased (full produced 55% of initial trees in the RHH experiments and full and grow had similar fitness), the sparseness and unbalancedness of trees caused a significant loss of genotypic diversity (the number of unique trees). The bias of the RHH method can be positive or negative, but it can also effect diversity. A loss of genotypic diversity can result from two factors here: the creation of trees by crossover already in the population, or the failure of crossover to find acceptable crossover points (leading to the duplication of one of the parents). The ant and parity populations produced with grow were sparser, more unbalanced and lost diversity. All the regression populations had similar behaviour due to the function set. A disadvantage exists for GP trees which are unable to grow effectively. GP trees which grow sparse and unbalanced will cause more code growth, less genotypic diversity, and search a smaller space of possible programs. These populations will be less effective in the evolutionary process. Current research is investigating various ways to adapt and detect program shapes for improved performance of fitness and recombination operators.
References 1. J.R. Koza. Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA, USA, 1992. 2. W. Langdon, T. Soule, R. Poli, and J.A. Foster. The evolution of size and shape. In Lee Spector et al., editors, Advances in Genetic Programming 3, chapter 8, pages 163–190. MIT Press, Cambridge, MA, USA, June 1999. 3. S. Luke. Two fast tree-creation algorithms for genetic programming. IEEE Transactions on Evolutionary Computation, 4(3):274–283, September 2000. 4. S. Luke and L. Panait. A survey and comparison of tree generation algorithms. In L. Spector et al., editors, Proceedings of the Genetic and Evolutionary Computation Conference, pages 81–88, San Francisco, USA, 7-11 July 2001. Morgan Kaufmann.
Improving Evolvability of Genetic Parallel Programming Using Dynamic Sample Weighting Sin Man Cheang, Kin Hong Lee, and Kwong Sak Leung Department of Computer Science and Engineering The Chinese University of Hong Kong {smcheang,khlee,ksleung}@cse.cuhk.edu.hk
Abstract. This paper investigates the sample weighting effect on Genetic Parallel Programming (GPP) that evolves parallel programs to solve the training samples captured directly from a real-world system. The distribution of these samples can be extremely biased. Standard GPP assigns equal weights to all samples. It slows down evolution because crowded regions of samples dominate the fitness evaluation and cause premature convergence. This paper compares the performance of four sample weighting (SW) methods, namely, Equal SW (ESW), Class-equal SW (CSW), Static SW (SSW) and Dynamic SW (DSW) on five training sets. Experimental results show that DSW is superior in performance on tested problems.
1
Introduction
Genetic Programming (GP) has been used to learn computer programs for data classification. Linear GP (LGP) uses a linear representation which is directly corresponding to the program statements based on a register machine. In this paper, experiments are performed on a novel LGP paradigm, Genetic Parallel Programming (GPP) [1], which evolves parallel programs based on a multiple ALU register machine, Multi-ALU Processor (MAP). The MAP employs a tightly coupled, shared-register, MIMD architecture. It can perform multiple operations in parallel in each processor clock cycle. GPP has been used to evolve parallel programs for symbolic regression, recursive function regression and artificial ant problems. Besides of evolving parallel programs to make use of the parallelism of the specific parallel architecture, the GPP accelerating phenomenon [3] reveals the efficiency of evolving algorithms in parallel form. Sample Weighting (SW) is a technique to change the samples’ weights in order to increase the efficiency of fitness-based evolutionary algorithms. The basic principle of SW is to balance the biased distribution of sample points in a solution space. Researchers have proposed different SW methods which determine samples’ weights in runtime. For example, we have applied Dynamic SW (DSW) to Genetic Algorithms (GA) to evolve Finite State Machines [2]. In this paper, we show that DSW can also be applied to GP to increase evolutionary efficiency of Boolean function regression and data classification problems. We compare DSW to other three SW methods, E. Cantú-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 1802–1803, 2003. © Springer-Verlag Berlin Heidelberg 2003
Improving Evolvability of Genetic Parallel Programming
1803
Equal SW (ESW), Class-equal SW (CSW) and Static SW (SSW). In ESW, all samples’ weights are set to equal without considering the distribution of samples. In CSW, the calculation of samples’ weights is based on the number of classes. In DSW, samples’ weights are recalculated periodically based on the contributions of individual samples to the population. SSW is a special case of DSW that does not modify samples’ weights after the initialization.
2
Experiments, Results, and Conclusions
We have applied four SW methods on five training sets. Three training sets are Boolean functions, 3-even-parity (3ep), 5-even-parity (5ep) and 5-symmetery (5sm). The remainings are UCI data classification databases, Wisconsin Breast Cancer (bcw) and Bupa Liver Disorder (bld). Three Boolean function sets (F1={and,or,xor,nop}, F2={and,or,nand,nor,nop} and F3={and,or,not,nop}) are used for Boolean functions. One arithmetic function set (F4={+,−,×,%,move,nop}) is used for UCI databases. Table 1 below shows the relative computation effort of ESW versus DSW. Obviously, DSW boosts the evolutionary efficiency of all tested problems (all values are greater than 1.0). In the 5sm_F3 experiment, DSW reduces the computational effort by 744.5 times. This shows that DSW is superior in evolutionary speedup on this problem. Even if the improvements are not significant on the bcw and bld problems, DSW is a helpful SW method to speed up evolution because it is easily to be implemented on the original population-based evolutionary algorithms. Research in the near future will be focused on extending the usage of DSW to different applications areas, e.g. multi-class classification and numeric function regression problems. The main concern is to apply DSW to non-count based algorithm, e.g. floating-point function regression problems in which the fitness of an individual program is based on the total error of the expected outputs and the program outputs. Table 1. Relative computation effort (EESW/EDSW)
EESW/EDSW
3ep_F3 11.8
5ep_F1 3.7
5ep_F2 1.8
5sm_F1 51.4
5sm_F2 226.7
5sm_F3 bcw 744.5 1.2
bld 1.1
References 1. Leung, K.S., Lee, K.H., Cheang, S.M.: Evolving Parallel Machine Programs for a MultiALU Processor. Proc. of IEEE Congress on Evolutionary Computation (2002) 1703-1708 2. Leung, K.S., Lee, K.H., Cheang, S.M.: Balancing Samples’ Contributions on GA Learning. Proc. of the 4th Int. Conf. on Evolvable Systems: From Biology to Hardware (ICES), Lecture Notes in Computer Science, Springs-Verlag (2001) 256–266 3. Leung, K.S., Lee, K.H., Cheang, S.M.: Parallel Programs are More Evolvable than Sequential Programs. Proc. of the 6th Euro. Conf. on Genetic Programming (EuroGP), Lecture Notes in Computer Science, Springs-Verlag (2003)
Enhancing the Performance of GP Using an Ancestry-Based Mate Selection Scheme Rodney Fry and Andy Tyrrell Department of Electronics University of York, UK {rjlf101,amt}@ohm.york.ac.uk
Abstract. The performance of genetic programming relies mostly on population-contained variation. If the population diversity is low then there will be a greater chance of the algorithm being unable to find the global optimum. We present a new method of approximating the genetic similarity between two individuals using ancestry information. We define a new diversity-preserving selection scheme, based on standard tournament selection, which encourages genetically dissimilar individuals to undergo genetic operation. The new method is illustrated by assessing its performance in a well-known problem domain: algebraic symbolic regression.
1
Ancestry-Based Tournament Selection
Genetic programming [1] in its most conventional form, suffers from a well-known problem: the search becomes localised around a point in the search space which is not the global optimum. This may be a result of not taking steps to preserve the genetic diversity of the population. A selection scheme in GP may be made to promote diversity if one could select individuals not only based on their fitness, but also based on how dissimilar the individuals’ genotypes are. Previous work in measuring how genotypically similar individuals are has concentrated their phenotypic relationship [2] and structural difference [3]. A good measure of how similar two individuals are can be attained by looking at the individuals’ ancestries. For example, siblings are far more likely to have similar genotypes than two randomly chosen individuals. If each individual in a GP population had the capability of knowing who its parents and grandparents were, and chose a mate based partly on dissimilarity between their respective ancestries, then search of deceptive local optima would be reduced. Previous work in the area includes the incest prevention scheme [4] and crossover prohibition based on ancestry [5]. In our scheme, a binary tree of ancestry is associated with each individual. The information stored relates to which previous individuals were used in the creation of the current solution and the proportion of each parent individual that was donated to create the current solution: the Parental Contribution Fraction or PCF. The first of the two individuals that will take part in the crossover E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 1804–1805, 2003. c Springer-Verlag Berlin Heidelberg 2003
Enhancing the Performance of GP
1805
operation is selected using standard tournament selection. A second parent individual must then be selected based on how genetically dissimilar it is when compared to the first. This can be achieved by modifying the fitness values of the candidate individuals to reflect their apparent fitness from the first parent’s point of view, before they enter the tournament.
2
Results
The modified tournament selection mechanism was tested and results were compared with the standard tournament selection scheme and the ancestry-based crossover prohibition scheme. Tests were carried out on the symbolic regression problem outlined in Section 10.2 of Koza [1]. As Fig. 1 shows, our ‘PCF Tree’ implementation outperforms the standard tournament selection scheme and the crossover prohibition scheme. 0.7
PCF tree prohibition standard
Probability of Success
0.6
0.5
0.4
0.3
0.2
0.1
0 0
10
20
30
40
50
60
70
Generation
Fig. 1. Comparison of the performance of the PCF ancestry tree, the crossover prohibition scheme and standard tournament selection
References 1. John R. Koza. Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, 1992. 2. S.W. Mahfoud. Crowding and preselection revisited. In Reinhard M¨ anner and Bernard Manderick, editors, Parallel problem solving from nature 2, pages 27–36, Amsterdam, 1992. North-Holland. 3. R. Keller and W. Banzhaf. Explicit maintenance of genetic diversity on genospaces, 1994. 4. L.J. Eshelman and J.D. Schaffer. Preventing premature convergence in genetic algorithms by preventing incest. In Proceedings of the Fourth International Conference on Genetic Algorithms, pages 115–122, San Diego, CA, 1991. Morgan Kaufmann. 5. Rob Craighurst and Worthy Martin. Enhancing ga performance through crossover prohibitions based on ancestry. In Larry J. Eshelman, editor, Proceedings of the Sixth International Conference on Genetic Algorithms, pages 130–135. Morgan Kaufmann, 1995.
A General Approach to Automatic Programming Using Occam’s Razor, Compression, and Self-Inspection Extended Abstract Peter Galos, Peter Nordin, Joel Ols´en, and Kristofer Sund´en Ringn´er Complex Systems Group Department of Physical Resource Theory Chalmers University of Technology S-412 96 G¨ oteborg, Sweden
This paper describes a novel general method for automatic programming which can be seen as a generalization of techniques such as genetic programming and ADATE. The approach builds on the assumption that data compression can be used as a metaphor for cognition and intelligence. The proof-of-concept system is evaluated on sequence prediction problems. As a starting point, the process of inferring a general law from a data set is viewed as an attempt to compress the observed data. From an artificial intelligence point of view, compression is a useful way of measuring how deeply the observed data is understood. If the sequence contains redundancy it exists a shorter description i.e. the sequence can be compressed. Consider the sequence: ABCDEABCDEABCDEABCDEABC This sequence could be described with a program. for i = 1..5 loop print ’ABCDE’ end loop print ’ABC’
for i = 1..infinity loop print ’ABCDE’ end loop
The principle of Occam’s Razor states that from two possible solutions (cf. the programs above) to a problem the simpler one (the program to the right) should be chosen. According to algorithmic information theory, the information content of the simplest program that without any prior information describes a data series is called the Kolmogorov complexity. It has been shown that prediction of the continuation of a sequence is more likely by a simpler program [VL1]. Solomonoff’s induction method [So1] is used to produce the target data as well as a continuation the target data. Since it is not known whether a program will halt or not before execution, the Kolmogorov complexity is in general not computable. However, it can be approximated, in this case by programs consisting of recursively applied compression methods. The learning problem is therefore turned into a search problem in the space of all possible permutations of the set of compression methods. The fitness criterion is based on algorithmic information theory [Sh1]. E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 1806–1807, 2003. c Springer-Verlag Berlin Heidelberg 2003
A General Approach to Automatic Programming
1807
It is possible to search the space of hypotheses randomly in order to find a solution. However, a method that takes into account previous search results before suggesting a new hypothesis can give rice to an accelerated search process. In order to achieve self-inspection, a prediction algorithm is used by examining the algorithm’s own progress. This pattern search in the history of previous programs’ performance, enables the algorithm to predict the most likely way to search for future hypotheses. In a way, this is a reinforcement problem and thus just another sequence prediction problem. An advantage of this self-inspection is that no additional theory or learning algorithm is needed. To achieve an accurate measure of the information content of a program a registry machine is used. Table 1 shows some examples of predictions1 using the approach proposed in this paper, were the ”least complex programs found” column presents combinations of different compression methods. Table 1. Results of predicting data series using the algorithm Data sequence 1. 2. 3. 4. 5. 6. 7. 8. 9.
5, 2, 0, 1, 0, 2, 0, 1, 0,
Least complex program found
10, 15, 20, 25 Delta→SRP 4, 8, 16, 32 LCB→Delta→SRP 9, 3, 4, 13, 7 Delta→SRP 2, 2, 3, 3, 3, 4, 4, 4, 4 RL→Delta→SRP 3, 12, 21, 48, 75, 102 Delta→LCB→RL→Delta→SRP 2, 3, 2, 2, 4, 2, 2, 5, 2, 2, 6 N-gram→PWC→Delta→SRP 1, 3, 6, 10, 15, 21, 28 Delta→Delta→SRP 1, 2, 1, 2, 3, 1, 2, 3 LZSS→PWC→Delta→SRP 1, 2, 4, 5, 7, 10, 11, 13, 16 Delta→LZSS→PWC→Delta→SRP
Prediction 30, 35, 40, 45,... 64, 128, 256, 512,... 8, 17, 11, 12, 21,... 5, 5, 5, 5, 5, 6,... 183, 264, 345,... 2, 2, 7, 2, 2, 8, 2, 2,... 36, 45, 55, 66, 78,... 4, 1, 2, 3, 4, 5, 1, 2, 3,... 20, 21, 23, 26, 30,...
In conclusion the algorithm is simple and requires little domain specific input which makes it able to predict a wide range of input sequences. Furthermore, no prior knowledge is necessary. In addition, the paper pioneers the implementation and verification of the theory [VL1] concerning the connection between compression and learning. Much work on theory and evaluation is still needed in order to transform the method into a practical tool with the same wide applicability as for instance genetic programming. However, the firm theoretical foundation holds promise for a very efficient method and advantageous scaling properties.
References [Sh1] [So1] [VL1]
1
Shannon, C.E.: A Mathematical Theory of Communication The Bell System Technical Journal Vol. 27, (1948) pp. 13 Solomonoff, R., J.: A preliminary report on a general theory of inductive inference. (1960) pp. 11–14 Vit´ anyi, P., Li, M.: Simplicity, Information, Kolmogorov Complexity, and Prediction. (1998) pp. 3
Computed within 4 minutes on an Intel PII/133MHz
Building Decision Tree Software Quality Classification Models Using Genetic Programming Yi Liu and Taghi M. Khoshgoftaar Florida Atlantic University Boca Raton, Florida, USA {yliu,taghi}@cse.fau.edu
Abstract. Predicting the quality of software modules prior to testing or system operations allows a focused software quality improvement endeavor. Decision trees are very attractive for classification problems, because of their comprehensibility and white box modeling features. However, optimizing the classification accuracy and the tree size is a difficult problem, and to our knowledge very few studies have addressed the issue. This paper presents an automated and simplified genetic programming (gp) based decision tree modeling technique for calibrating software quality classification models. The proposed technique is based on multiobjective optimization using strongly typed gp. Two fitness functions are used to optimize the classification accuracy and tree size of the classification models calibrated for a real-world high-assurance software system. The performances of the classification models are compared with those obtained by standard gp. It is shown that the gp-based decision tree technique yielded better classification models.
A timely quality prediction of the software system can be useful in improving the overall reliability and quality of the software product. Software quality classification (sqc) models which classify program modules as either fault-prone (fp) or not fault-prone (nfp) can be used to target the software quality improvement resources toward modules that are of low quality. With the aid of such models, software project managers can deliver a high-quality software product on time and within the allocated budget. Usually software quality estimation models are based on software product and process measurements, which have shown to be effective indicators of software quality. Therefore, a given sqc technique aims to model the underlying relationship between the available software metrics data, i.e., independent variables, and the software quality factor, i.e., dependent variable. Among the commonly used sqc models, the decision tree (dt) based modeling approach is very attractive due to the model’s white box feature. A dt-based sqc model is usually a binary tree with two types of nodes, i.e., query nodes and leaf nodes. A query node is a logical equation which returns either true or false, whereas a leaf node is a terminal which assigns a class label. Each query node in the tree can be seen as a classifier which partitions the data set into subsets. E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 1808–1809, 2003. c Springer-Verlag Berlin Heidelberg 2003
Building Decision Tree Software Quality Classification Models
1809
All the leaf nodes are labelled as one of the class types, such as fp or nfp. An observation or case in the given data set will “trickle” down, according to its attribute values, from the root node of the decision tree to one of its leaf nodes. This study focuses on calibrating gp-based decision tree models for predicting the quality of software modules as either fp or nfp. In the context of what constitutes a good dt-based model, two factors are usually considered: classification accuracy and model simplicity. Classification accuracy is often measured in terms of the misclassification error rates, whereas model simplicity of decision trees is often expressed in terms of the number of nodes. Therefore, an optimal dt is one that has low misclassification error rates and has a (relatively) few number of nodes. gp is a logical solution to the problems that require a multiobjective optimization, primarily because it is based on the process of natural evolution which involves the simultaneous optimization of several factors. Very few studies have investigated gp-based decision tree models, and among those none has investigated the gp-based decision trees for the sqc problem. Previous works related to gp-based classification models have focused on the standard gp process, which requires that the function and terminal sets have the closure property. This property implies that all the functions in the function set must accept any of the values and data types defined in the terminal set as arguments. However, since each decision tree has at least two different types of nodes, i.e., query nodes and leaf nodes, the closure property requirement of standard gp does not guarantee the generation of a valid individual(s). Strongly Typed Genetic Programming (stgp) has been used to alleviate the closure property requirement (Montana [1]), by allowing each function to define the different kinds of data types it can accept. Moreover, each function, variable, and terminal are specified by certain types in stgp. When the stgp process generates an individual or performs a genetic operation during the simulated evolution process, it considers additional criteria that are not part of standard gp. For example, in our study of calibrating sqc models, a function in the function set can only be in query nodes, while the terminal variables such as fp and nfp can only be in the leaf nodes. Therefore, in our study we use stgp to build decision trees. In this study we investigated a simplified gp-based multi-objective optimization method for automatically building optimal sqc decision trees that have a high classification accuracy rate and a relatively small tree size. The gp-based decision trees were build to optimize two fitness functions: the average weighted cost of misclassification, and the tree size which is expressed in terms of the number of tree nodes. Moreover, the relative classification performances of the gp-based dt model and the that of standard gp were evaluated in the context of a real-world industrial software project. It was observed that the proposed approach yielded useful decision trees.
References 1. Montana, D.J.: Strongly typed genetic programming. Evolutionary Computation (1995) 199–230
Evolving Petri Nets with a Genetic Algorithm Holger Mauch University of Hawaii at Manoa, Dept. of Information and Computer Science, 1680 East-West Road, Honolulu, HI 96822 [email protected]
Abstract. In evolutionary computation many different representations (“genomes”) have been suggested as the underlying data structures, upon which the genetic operators act. Among the most prominent examples are the evolution of binary strings, real-valued vectors, permutations, finite automata, and parse trees. In this paper the use of place-transition nets, a low-level Petri net (PN) class [1,2], as the structures that undergo evolution is examined. We call this approach “Petri Net Evolution” (PNE). Structurally, Petri nets can be considered as specialized bipartite graphs. In their extended version (adding inhibitor arcs) PNs are as powerful as Turing machines. PNE is therefore a form of Genetic Programming (GP). Preliminary results obtained by evolving variablesize place-transition nets show the success of this approach when applied to the problem areas of boolean function learning and classification.1
1
Overview
PNs are a nice formalisms that allow the complex modeling of numerous real world problems. Their strength is the “built in” concurrency mechanism, which allows a formal description of systems containing parallel processes. For this study small sample problems have been used to test the feasibility of PNE. The PNs evolved belong to a class of low level PNs called place-transition nets. This work has been inspired by the generalization of the recombination operator from trees to bipartite graphs. Since recombination seems to be an important genetic operator in PNE, this approach follows the GA stream within EC. The common feature of traditional GP and PNE is the variable length genotype. This is in contrast to the standard GA which employs fixed-length strings. In contrast to traditional tree-based GP, there is no need to specify a function set explicitly for PNE. Functions are implicitly built into the semantics of the chosen PN model. Because the place-transition PN model used is integer-based our focus is on discrete problems rather than continuous ones. PNE evolves a population of PNs in an evolution-directed search of the space of possible PNs for ones, which, when executed, will exhibit high fitness. To the best of our knowledge the idea of evolving a population of individuals which are variable-sized PNs does not appear in previous literature. 1
This research was supported in part by DARPA grant NBCH1020004.
E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 1810–1811, 2003. c Springer-Verlag Berlin Heidelberg 2003
Evolving Petri Nets with a Genetic Algorithm
2
1811
Elements of Petri Net Evolution
In experiments PNs have been evolved for boolean function learning and classification. We partition the set of places P = Pin ∪Pcenter ∪Pout . For a classification problem such as PlayTennis [3, p.59] the input places Pin contain the values of the attributes of a fitness case in its initial marking. After firing a suitable number of transitions the PN is evaluated on its output places Pout according to the induced submarking Mout . Every PN individual is executed on several fitness cases to determine its fitness. One problem with the evaluation of a PN is its potentially nondeterministic behavior due to conflicts (i.e. the situation when two transitions are enabled, but firing one of the two disables the other one.) To solve this difficulty the fitness measure works with the mean value of a sample of simulation runs. This introduces a random component (the firing order) into the fitness evaluation process. The PNE implementation uses the incidence matrix representation as described in [2] to represent the directed, weighted, bipartite graph data structure of a place-transition net. The incidence matrix and the initial marking vector M0 are implemented by a low-level data structure that can grow dynamically. After random generation of generation 0 PNs evolutionary computation enters the cycle of evaluation, selection and recombination/mutation. Several mutation operators to manipulate the various components (i.e. arcs, places, transitions, initial marking) of a PN have been designed. Mutation also allows a PN to grow and shrink in size. At the heart of PNE is the specifically designed recombination operator which creates 2 offspring PNs from 2 mating parent PNs. The parents do not need to match in size, so every individual in the population can mate with any other individual. A conventional tournament selection with tournament size 2 is used. In addition to the usual parameters (population size, maximum number of generations, parameters for recombination, mutation and selection) PNE uses a parameter for the maximum duration of a single PN execution, since termination is not guaranteed when a PN is executed. First results in the area of classification and boolean function learning show that numerous runs produced highly fit PN individuals. However, the examples examined are very simple in nature. Future work will involve larger sized problem instances. That will allow the evaluation of the robustness of PNE. What has been shown is the feasibility of successfully evolving PNs when suitable procedures for the evaluation, mutation and recombination are employed.
References 1. Reisig, W.: Petri Nets: an Introduction. Springer-Verlag, Berlin, Germany (1985) 2. Murata, T.: Petri nets: Properties, analysis and applications. Proceedings of the IEEE 77 (1989) 541–580 3. Mitchell, T.M.: Machine Learning. WCB/McGraw-Hill, Boston, MA (1997)
Diversity in Multipopulation Genetic Programming Marco Tomassini1 , Leonardo Vanneschi1 , Francisco Fern´andez2 , and Germ´an Galeano2 1
Computer Science Institute, University of Lausanne, Lausanne, Switzerland 2 Computer Science Institute, University of Extremadura, Spain
In the past few years, we have done a systematic experimental investigation of the behavior of multipopulation GP [2] and we have empirically observed that distributing the individuals among several loosely connected islands allows not only to save computation time, due to the fact that the system runs on multiple machines, but also to find better solution quality. These results have often been attributed to better diversity maintenance due to the periodic migration of groups of “good” individuals among the subpopulations. We also believe that this might be the case and we study the evolution of diversity in multi-island GP. All the diversity measures that we use in this paper are based on the N concept of entropy of a population P , defined as H(P ) = − j=1 Fj log(Fj ). If we are considering phenotypic diversity, we define Fj as the fraction nj /N of individuals in P having a certain fitness j, where N is the total number of fitness values in P . In this case, the entropy measure will be indicated as Hp (P ) or simply Hp . To define genotypic diversity, we use two different techniques. The first one consists in partitioning individuals in such a way that only identical individuals belong to the same group. In this case, we have considered Fj as the fraction of trees in the population P having a certain genotype j, where N is the total number of genotypes in P and the entropy measure will be indicated as HG (P ) or simply HG . The second technique consists in defining a distance measure, able to quantify the genotypic diversity between two trees. In this case, Fj is the fraction of individuals having a given distance j from a fixed tree (called origin), where N is the total number of distance values from the origin appearing in P and the entropy measure will be indicated as Hg (P ) or simply Hg . The tree distance used is Ek´art’s and N´emeth’s definition [1]. Figure 1 depicts the behavior of HG , Hg and Hp during evolution for the symbolic regression problem, with the classic ploynomial equation f (x) = x4 + x3 + x2 + x, an input set composed of 1000 fitness cases and a set of functions equal to F={*,//,+,-}, where // is like / but returns 0 instead of error when the divisor is equal to 0. Fitness is the sum of the square errors at each test point. Curves are averages over 100 independent runs for generational GP, crossover rate: 95%, mutation rate: 0.1%, tournament selection of size: 10, ramped half and half initialization, maximum depth of individuals for the creation phase: 6, maximum depth of individuals for crossover: 17, elitism. Genotypic diversity for the panmictic case tends to remain constant over time, and to have higher values than the distributed case. On the contrary, the average phenotypic entropy for the multipopulation case tends to remain higher than in the panmictic case. The oscillating behavior of the multipopulation curves when groups of individuals are sent and received is not surprising: it is due to the sudden change in diversity when new individuals enter a subpopulation. Finally, we remark that the behavior of the two measures used to calculate genotypic diversity (HG and Hg ) is qualitatively equivalent. Analogous results E. Cant´u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 1812–1813, 2003. c Springer-Verlag Berlin Heidelberg 2003
Diversity in Multipopulation Genetic Programming 5
5
1813
4.5
1−100 5−20 Global Results
1−100 5−20 Global Results
1−100 5−20 Global Results 4
4.5
4.5
Hp
g
H
H
G
3.5 4
4
3 2.5
3.5
3.5 2
3 0
50
100
150
200
3 0
Generation
(a)
50
100
150
200
1.5 0
50
100
Generation
Generation
(b)
(c)
150
200
Fig. 1. Symbolic Regression Problem. 100 total individuals. (a): Genotypic diversity calculated using the HG measure (see the text). Gray curve: panmictic population. Black curve: aggregated subpopulations. (b): Genotypic diversity calculated with the Hg measure (c): Phenotypic diversity (Hp ) Table 1. Numbers represent the average amount of time spent by a solution or a part thereof in the corresponding population. Results refer to the Ant problem with a population size of 1000 Pop 1 20
Pop 2 18.62
Pop 3 21.5
Pop 4 15.62
Pop 5 24.5
(not shown here for lack of space) have been obtained for the Even Parity 4 problem and for the Artificial Ant on the Santa Fe Trail problem. It is also interesting to study how solutions originate and propagate in the distributed system. Table 1 is a synthesis of a few runs of the Artificial Ant problem. Although data are not statistically significant (too few runs have been done), they do indicate that all the islands participate in the formation of a solution. By defining genotypic and phenotipic diversity indices and by monitoring their variation over a large number of runs on three standard test problems, we have empirically studied how diversity evolves in the distributed case. While genotypic diversity is not much affected by splitting a single population into multiple ones, phenotypic diversity, which is linked to fitness, remains higher in the multipopulation case for all problems studied here. We have also studied how solutions arise in the distributed case, and we have seen that all the subpopulations contribute in the building of the right genetic material, which again seems to confirm that smaller communicating populations are more effective than a big panmictic one. In conclusion, using multiple loosely coupled populations is a natural and easy way for maintaining diversity and, to some extent, avoiding premature convergence in GP.
References 1. A. Ek´art and S.Z. N´emeth. Maintaining the diversity of genetic programs. In J.A. Foster et al., editor, Genetic Programming, Proceedings of the 5th European Conference, EuroGP 2002, volume 2278 of LNCS, pages 162–171. Springer-Verlag, 2002. 2. F. Fern´andez, M. Tomassini, and L. Vanneschi. An empirical study of multipopulation genetic programming. Genetic Programming and Evolvable Machines, March 2003. Volume 4. Pages 21–51. W. Banzhaf Editor-in-Chief.
An Encoding Scheme for Generating λ-Expressions in Genetic Programming Kazuto Tominaga, Tomoya Suzuki, and Kazuhiro Oka Tokyo University of Technology, Hachioji, Tokyo 192-0982 Japan [email protected], {tomoya,oka}@ns.it.teu.ac.jp
Abstract. To apply genetic programming (GP) to evolve λ-expressions, we devised an encoding scheme that encodes λ-expressions into trees. This encoding has closure property, i.e., any combination of terminal and non-terminal symbols forms a valid λ-expression. We applied this encoding to a simple symbolic regression problem over Church numerals and the objective function f (x) = 2x was successfully obtained. This encoding scheme will provide a good foothold for exploring fundamental properties of GP by making use of lambda-calculus.
In this paper, we give an encoding scheme that encodes λ-expressions into trees, which is appropriate for evolving λ-expressions by genetic programming (GP). A λ-expression is a term in lambda-calculus [2], composed of variables, abstraction and application. Here are some examples of λ-expressions. (λx.(λy.(x(xy)))) (λx.(λy.(λz.((xy)(xy)z))))
(let 2 denote this λ-expression) (let D denote this λ-expression)
A λ-expression can be regarded as a program. Giving the input 2 to the program D is represented by applying D to 2 as the following. (D 2 ) ≡ ((λx.(λy.(λz.((xy)(xy)z))))(λx.(λy.(x(xy))))) By several steps of β-reduction, the above λ-expression is rewritten into (λx.(λy.(x(x(x(xy)))))) . (let 4 denote this λ-expression) β-reduction steps can be regarded as the execution of the program, and 4 can be regarded as the output of the execution. If λ-expressions can be represented as trees, GP can evolve λ-expressions. We devised an encoding scheme to handle λ-expressions in GP. This encoding scheme encodes a λ-expression into a tree. The non-terminal symbol L represents abstraction, and the non-terminal symbol P represents application. A terminal symbol is a positive integer and it represents a variable. A variable labeled by an integer n is bound by n-th abstraction in the path from the variable to the root. Example trees are (L 1), (L (L 1)) and (L (L (P 2 1))), which represent λ-expressions (λx.x), (λx.(λy.y)) and (λx.(λy.(xy))), respectively. E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 1814–1815, 2003. c Springer-Verlag Berlin Heidelberg 2003
An Encoding Scheme for Generating λ-Expressions in Genetic Programming
1815
This encoding scheme has the following good properties. One is that any tree encoded by this scheme is syntactically and semantically valid as a λ-expression. Crossover and mutation always produce valid individuals. Secondly the meaning of a subtree does not change by crossover since relation of a bound variable and the abstraction that binds the variable is preserved under crossover. We applied this encoding to a simple symbolic regression problem. The objective function is f (x) = 2x over Church numerals [2]. Church numerals represent natural numbers; examples are 2 and 4 shown above. The function f can be expressed as a λ-expression in many ways, and D is such a λ-expression. We implemented a β-reduction routine in C and incorporated it to lil-gp [3] to make the whole GP system. Main GP parameters are: population size 1000; maximum generation 1000; 10 % reproduction with elites 1 %, 89.9 % crossover and 0.1 % mutation. In the experiment, the objective function D was obtained in 13 out of 25 runs. The transition of the best individuals in a successful run is shown in Table 1 in their canonical forms. The fitness was measured based on structural difference between the correct answer and the output of the individual. The experimental results indicate that the encoding scheme is appropriate for evolving λ-expressions by GP, and thus it will provide a good foothold for exploring fundamental property of GP by using lambda-calculus. Table 1. Example transition of best individuals Generation
Fitness
Best individual
0 233 346 365 376 401 403 414 420 422 425 433 439
153 150 147 146 145 143 82 81 80 9 6 2 0
(λx.x) (λx.(λy.(λz.(z((xy)(yy)))))) (λx.(λy.(λz.(z((xy)(y(z(yy)))))))) (λx.(λy.(λz.(z((xy)(y(y(yy)))))))) (λx.(λy.(λz.(z((xy)(y(y(y(zy))))))))) (λx.(λy.(λz.(z((xy)(y(y(y(z(yz)))))))))) (λx.(λy.(λz.(z((xy)(y(y((xz)(λw.w))))))))) (λx.(λy.(λz.(z((xy)(y((xz)(λw.w)))))))) (λx.(λy.(λz.((xy)(y(y((xz)(λw.w)))))))) (λx.(λy.(λz.(z((xy)(y(y((xy)(λw.w))))))))) (λx.(λy.(λz.((xy)(y(y((xy)(λw.w)))))))) (λx.(λy.(λz.((xy)((xy)(λw.w)))))) (λx.(λy.(λz.((xy)((xy)z)))))
References 1. Koza, J.: Genetic Programming. MIT Press (1992) 2. Revesz, G.: Lambda-Calculus, Combinators, and Functional Programming. Cambridge University Press (1988) 3. lil-gp Genetic Programming System. http://garage.cps.msu.edu/software/lil-gp/lilgp-index.html
AVICE: Evolving Avatar’s Movernent Hiromi Wakaki and Hitoshi Iba Graduate School of Frontier Science The University of Tokyo 7-3-1 Hongo, Bunkyou-ku, Tokyo, 113-8656, Japan {wakaki,iba}@miv.t.u-tokyo.ac.jp http://www.miv.t.u-tokyo.ac.jp/˜wakaki/HomePage.html Abstract. In recent years, the widespread penetration of the Internet has led to the use of diverse expressions over the Web. Among them, many appear to have strong video elements. However, few expressions are based on human beings, with whom we are most familiar. This is deemed to be attributable to the fact that it is not easy to generate human movements. This is attributable to the fact that the creation of movements by hand requires intuition and technology. With this in mind, this study proposes a system that supports the creation of 3D avatars’ movements based on interactive evolutionary computation, as a system that facilitates creativity and enables ordinary Users with no special skills to generate dynamic animations easily, and as a method of movement description.
1
Introduction
The creation of 3D-CG animations requires a tool that can easily and accurately change the details of the movement rather than joining the parts of the movement. inverse kinematics (IK) is good at balancing the movements of joints relative to other joints, but a great deal of knowledge, experience and intuition is required t o describe a movement. In consideration of the above, we realized a System called AVICE that generates movements through Interactive Evolutionary Computation (1EC) [3,2,1]. AVICE Supports the user creativity in designing motions, which is empirically shown.
2
Outline of AVICE System
AVICE system uses H-Anim to express avatar. The Humanoid Animation Working Group (http://h-anim.org/) specifies how to express standard human bodies in VRML97, which is called Humanoid Animation (hereinafter referred to as “H-Anim”). AVICE system integrates Creator, Mixer and Remaker modules (See Fig. 1). Each relationship is that Mixer module take in the Output of Creator module and other Softwares. Creator module makes VRML files suited user’s taste from initialization used IEC [4]. Mixer module makes VRML files suited user’s taste from inputted files used IEC. E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 1816–1817, 2003. c Springer-Verlag Berlin Heidelberg 2003
AVICE: Evolving Avatar’s Movernent
Fig. 1. outline of AVICE system
3
1817
Fig. 2. Example of created movements
Demonstration of Created Movements
Movements shown in the Fig. 2 were acquired as a result of the experiment of AVICE System. In this movement demo, avatars are dancing to two types of music. The movements were created by joining movements deemed to suitable to the music. Even if no music data has been imported, it shows that they are in rhythm. The difference between the two also shows that their dance is not only in rhythm to the music but also suits the atmosphere thereof.
4
Conclusion
Few expressions are based on human beings, with whom we are most familiar. This is deemed to be attributable to the fact that it is not easy to generate human movements. If movements of avatars can be created based on VRML, which can be displayed by a browser, it should increase the ways in which individuals express themselves, as they do through paintings and music.
References 1. P.J. Bentley. Evolutionary Design by Computers. Morgan Kaufmann Publishers Inc., 1999. 2. H. Takagi. Interactive evolutionary computation – cooperation of computational intelligence and human kansei. In Proceeding of 5th International Conference on Sofi Computing und Information/Intelligent Systems, pp. 41–50, Oct. 1998. 3. T. Unemi. Sbart2.4:breeding 2d cg images and movies, and creating a type of Collage. In Proceedings of The Third International Conference on Knowledge-based Intelligent Information Engineering Systems, pp. 288-291, Adelaide, Australia, Aug. 1999. 4. H. Wakaki and H. Iba. Motion design of a 3d-cg avatar using interactive evolutionary computation. In 2002 IEEE International Conference on Systems, Man und Cybernetics (SMC02). IEEE Press, 2002.
Evolving Multiple Discretizations with Adaptive Intervals for a Pittsburgh Rule-Based Learning Classifier System Jaume Bacardit and Josep Maria Garrell Intelligent Systems Research Group Enginyeria i Arquitectura La Salle Universitat Ramon Llull Psg. Bonanova 8, 08022-Barcelona Catalonia, Spain, Europe {jbacardit,josepmg}@salleURL.edu
Abstract. One of the ways to solve classification problems with realvalue attributes using a Learning Classifier System is the use of a discretization algorithm, which enables traditional discrete knowledge representations to solve these problems. A good discretization should balance losing the minimum of information and having a reasonable number of cut points. Choosing a single discretization that achieves this balance across several domains is not easy. This paper proposes a knowledge representation that uses several discretization (both uniform and nonuniform ones) at the same time, choosing the correct method for each problem and attribute through the iterations. Also, the intervals proposed by each discretization can split and merge among them along the evolutionary process, reducing the search space where possible and expanding it where necessary. The knowledge representation is tested across several domains which represent a broad range of possibilities.
1
Introduction
The application of Genetic Algorithms (GA) [1,2] to classification problems is usually known as Genetic Based Machine Learning (GBML) or Learning Classifier Systems (LCS), and traditionally it has been addressed from two different points of view: the Pittsburgh approach (also known as Pittsburgh LCS), and the Michigan approach (or Michigan LCS). Some representative systems of each approach are GABIL [3] and XCS [4]. The classical knowledge representation used in these systems is a set of rules where the antecedent is defined by a prefixed finite number of intervals to handle real-valued attributes. The performance of these systems is tied to the right election of the intervals through the use of a discretization algorithm. There exist several good heuristic discretization methods which have good performance on several domains. However, they lack robustness on some other domains because they loose too much information from the original data. The E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 1818–1831, 2003. c Springer-Verlag Berlin Heidelberg 2003
Evolving Multiple Discretizations with Adaptive Intervals
1819
alternative of a high number of simple uniform-width intervals usually expands the size of the search space without a clear performance gain. In a previous paper [5] we have proposed a representation called adaptive discrete intervals (ADI) rule representation where several uniform-width discretizations are used at the same time. Thus, allowing the GA choose the correct discretization for each rule and attribute. Also, the discretization intervals were split and merged through the evolutionary process. This representation has been used in our Pittsburgh approach LSC. In this paper we generalize the ADI representation approach (proposing ADI2 ) by also using heuristic non-uniform discretization methods. In our case the well-known Fayyad & Irani discretization algorithm [6]. In our previous work se reported that using only the Fayyad & Irani discretization can lead to a nonrobust system in some domains, but using it together with some uniform-width discretizations improves the performance in some domains and, most important, presents a robust behavior across most domains. Also, the probability that controls the split and merge operators is redefined in order to simplify the tuning needed to use the representation. This rule representation is compared across different domains against the results of the original representation and also the XCSR representation [7] by Wilson which uses rules with real-valued intervals in the well known XCS Michigan LCS [4]. We want to state clearly that we have integrated the XCSR representation into our Pittsburgh system, instead of using it inside XCS. The paper is structured as follows. Section 2 presents some related work. Then, we describe the framework of our classifier system in Sect. 3. The revision of the ADI representation is explained in Sect. 4. Next, Sect. 5 describes the test suite used in the comparison. The results obtained are summarized in Sect. 6. Finally, Sect. 7 discusses the conclusions and some further work.
2
Related Work
There are several approaches to handle real-valued attributes in the LCS field. These approaches can be classified in a simple manner into two groups: Systems that discretize or systems that work directly with real values. Reducing the real-valued attributes to discrete values let the systems in the first group use traditional symbolic knowledge representations. There are several types of algorithms which can perform this reduction. Some heuristic examples are the Fayyad & Irani method [6] which works with information entropy or the χ2 statistic measures [8]. Some specific GBML applications of discretization algorithms were presented by Riquelme and Aguilar [9] and are similar to the representation proposed here. Their system evolves conjunctive rules where each conjunct is associated to an attribute and is defined as a range of adjacent discretization intervals. They use their own discretization algorithm, called USD [10]. Lately, several alternatives to the discrete rules have been presented. There are rules composed by real-valued intervals (XCSR [7] by Wilson or COGITO
1820
J. Bacardit and J.M. Garrell
[11] by Aguilar and Riquelme). Also, Llor` a and Garrell [12] proposed a knowledge independent method for learning other knowledge representations like instance sets or decision trees. Most of these alternatives have better accuracy than the discrete rules, but usually they also have higher computational cost [11].
3
Framework
In this section we describe the main features of our classifier system which is a Pittsburgh LCS based on GABIL [3]. Directly from GABIL we have borrowed the semantically correct crossover operator and the fitness computation (squared accuracy). Matching Strategy: The matching process follows a “if ... then ... else if ... then...” structure, usually called Decision List [13]. Mutation Operators: The system manipulates variable-length individuals, making more difficult the tuning of the classic gene-based mutation probability. In order to simplify this tuning, we define pmut as the probability of mutating an individual. When an individual is selected for mutation (based on pmut ), a random gene is chosen inside its chromosome for mutation. Control of the Individuals Length: Dealing with variable-length individuals arises some serious considerations. One of the most important ones is the control of the size of the evolving individuals [14]. This control is achieved using two different operators: – Rule deletion: This operator deletes the rules of the individuals that do not match any training example. This rule deletion is done after the fitness computation and has two constraints: (a) the process is only activated after a predefined number of iterations, to prevent a massive diversity loss and (b) the number of rules of an individual never goes below a lower threshold. – Selection bias using the individual size: Selection is guided as usual by the fitness (the accuracy of the individual). However, it also gives certain degree of relevance to the size of the individuals, having a policy similar to multi-objective systems. We use tournament selection because its local behavior lets us implement this policy. The criterion of the tournament is given by our own operator called “hierarchical selection” [15]. This operator considers two individuals “similar” if their fitness difference is below a certain threshold (dcomp ). Then, it selects the individual with fewer number of rules. Our previous tests showed that sizing dcomp to 0.01 was quite good for real problems and 0.001 was quite good for synthetic problems. Although a fine tuning of this parameter probably would improve the performance for each test domain, for the sake of simplicity we will use these values.
Evolving Multiple Discretizations with Adaptive Intervals
4
1821
The Adaptive Discretization Intervals (ADI) Rule Representation
4.1
The Original ADI Representation
The general structure of each rule is taken from GABIL. That is, each rule consists of a condition part and a classification part: condition → classif ication. Each condition is a Conjunctive Normal Form (CNF) predicate defined as: ((A1 = V11 ∨ . . . ∨ A1 = Vm1 ) . . . (An = V2n ∨ . . . An = Vmb )) Where Ai is the i th attribute of the problem and Vij is the j th value that can take the i th attribute. In the GABIL representation this kind of predicate can be encoded into a binary string in the following way: if we have a problem with two attributes, where each attribute can take three different values {1,2,3}, a rule of the form “If the first attribute has value 1 or 2 and the second one has value 3 then we assign class 1” will be represented by the string 110|001|1. For real-valued attributes, each bit (except the class one) would be associated to a discretization interval. The intervals of the rules in the ADI representation are not static, but they evolve splitting and merging among them (having a minimum size called microintervals). Thus, the binary coding of the GABIL representation is extended as represented in Fig. 1, also showing the split and merge operations. The ADI representation is defined in depth as follows: 1. Each individual, initial rule and attribute term is assigned a number of “low level” uniform-width and static discretization intervals (called microintervals). 2. The intervals of the rule are built joining together adjacent micro-intervals. 3. Attributes with different number of micro-intervals can coexist in the population. The evolution will choose the correct number of micro-intervals for each attribute. Rule set
Interval to mutate Attribute 1
Rule
1
0
1
Class
Split
Interval state
Interval
{
Microinterval 1
1
0
1
Attribute
1
1 0 0 1 Cut point
Merge
1
1
0
Neighbour selected to merge
Fig. 1. Adaptive intervals representation and the split and merge operators
1822
J. Bacardit and J.M. Garrell
4. For computational cost reasons, we will have an upper limit in the number of intervals allowed for an attribute, which in most cases will be less than the number of micro-intervals assigned to each attribute. 5. When we split an interval, we select a random point in its micro-intervals to break it. 6. When we merge two intervals, the state (1 or 0) of the resulting interval is taken from the one which has more micro-intervals. If both have the same number of micro-intervals, the value is chosen randomly. 7. The number of micro-intervals assigned to each attribute term is chosen from a predefined set. 8. The number and size of the initial intervals is selected randomly. 9. The cut points of the crossover operator can only take place in attribute terms boundaries, not between intervals. This restriction takes place in order to maintain the semantical correctness of the rules. 10. The hierarchical selection operator uses the length of the individuals (defined as the sum of all the intervals of the individual) instead of the number of rules as the secondary criteria. This change promotes simple individuals with more reduced interval fragmentation. In order to make the interval splitting and merging part of the evolutionary process, we have to include it in the GAs genetic operators. We have chosen to add to the GA cycle two special stages applied to the offspring population after the mutation stage. For each stage (split and merge) we have a probability (psplit or pmerge ) of applying the operation to an individual. If an individual is selected for split or merge, a random point inside its chromosome is chosen to apply the operation. 4.2
Revisions to the Representation: ADI2
The first modification to the ADI representation affects the definition of the split and merge probabilities. These probabilities were defined at an individualwise level, and needed an specific adjust for each domain that we want to solve. The need of this fine-adjusting is motivated by the fact that this probability definition does not take into account the number of attributes of the problem nor its optimum number of rules. Thus, a problem having the double of attributes than another problem would also need a probability two times higher. Also, it was empirically determined that it was useful to split or merge more than once for each individual, thus using an expected values instead of a probability to control the operators. Our proposal is a probability defined for each attribute term of each rule. Thus, becoming independent of these two factors. The code for the merge operator probability is represented in Fig. 2. Code for the split operator is similar. This redefinition allows us to use the same probability for all the test domains with similar results to the original probability definition. The probability value has been empirically defined as 0.05.
Evolving Multiple Discretizations with Adaptive Intervals
1823
ForEach Individual i of P opulation ForEach Rule j of P opulation individual i ForEach Attribute k of Rule j of P opulation individual i If random [0..1] number < pmerge Select one random interval of attribute term k of rule j of individual i Apply a merge operation to this interval EndIf EndForEach EndForEach EndForEach
Fig. 2. Code of the application of the merge operator
Our second modification is related to the kind of discretization that we use. The experimentation reported of the original representation included a comparison with a discrete representation where the discretization intervals where given by the Fayyad & Irani method [6]. This discrete representation performed better than the adaptive intervals rule representation in some domains, but it was significantly outperformed in some other domains, showing a lack of robustness. Our aim is to modify the representation in order to include non-uniform discretization into it, in a way that improves the performance of the system in the problems where it is possible, but maintains a robust behavior in the kind of problems where the systems using only heuristic representations fail. A graphical representation of the two types of rules is in Fig. 3. The changes introduced to the definition of the representation are minimum. A revised definition follows: 1. A set of static discretization intervals (called micro-intervals) is assigned to each attribute term of each rule of each individual. 2. The intervals of the rule are built joining together micro-intervals. 3. Attributes with different number and sizes of micro-intervals can coexist in the population. The evolution will choose the correct discretization for each attribute. 4. For computational cost reasons, we will have an upper limit in the number of intervals allowed for an attribute, which in most cases will be less than the number of micro-intervals assigned to each attribute. 5. When we split an interval, we select a random point in its micro-intervals to break it. 10 uniform micro−intervals
1
1
0
5 uniform micro−intervals
1 1
0
1
4 uniform micro−intervals
1
0
1
8 uniform micro−intervals
1 1 1
0
1
1 ADI1
10 uniform micro−intervals 3 non−uniform micro−intervals 4 uniform micro−intervals 7 non−uniform micro−intervals
1
1
0
1 1
0
1
0
1
1
1
0
1
Attribute terms with non−uniform discretizations
Fig. 3. Example of the differences between ADI1 and ADI2 rules
0 ADI2
1824
J. Bacardit and J.M. Garrell
6. When we merge two intervals, the state (1 or 0) of the resulting interval is taken from the one which has more micro-intervals. If both have the same number of micro-intervals, the value is chosen randomly. 7. The discretization assigned in the initialization stage to each attribute term is chosen from a predefined set. 8. The number and size of the initial intervals is selected randomly. 9. The cut points of the crossover operator can only take place in attribute terms boundaries, not between intervals. This restriction takes place in order to maintain the semantical correctness of the rules. 10. The hierarchical selection operator uses the length of the individuals (defined as the sum of all the intervals of the individual) instead of the number of rules as the secondary criteria. This change promotes simple individuals with more reduced interval fragmentation.
5
Test Suite
This section summarizes the tests done in order to evaluate the accuracy and efficiency of the method presented in this paper. We also compare it with some alternative methods. The tests were conducted using several machine learning problems which we also describe. 5.1
Test Problems
The selected test problems are the same that were used in the original version of the representation being studied in this paper. The first problem is a synthetic problem (tao [12]) that has non-orthogonal class boundaries. We also use several problems provided by the University of California at Irvine (UCI) repository [16]. The problems selected are: Pima Indians Diabetes (pim), Iris (irs), Glass (gls) and Winsconsin Breast Cancer (breast). Finally we will use three real problems from private repositories. The first two deal with the diagnosis of breast cancer based on biopsies (bps [17]) and on mammograms (mmg [18]), whereas the last one is related to the prediction of student qualifications (lrn [19]). The characteristics of the problems are listed in Table 1. The partition of the examples into the train and test sets was done using stratified ten-fold cross-validation [20]. 5.2
Configurations of the GA to Test
The main goal of the tests is to evaluate the performance of the changes done to the ADI representation: the redefinition of the split and merge probabilities and also the inclusion of non-uniform discretizations. Thus, we performed two kind of tests: The first test (called ADI2) uses the new probabilities and only the uniform discretizations. The second one (called ADI2+Fayyad) adds to the configuration ADI2 the use of the Fayyad & Irani discretization method [6]. This revision of the ADI representation is compared to three more configurations:
Evolving Multiple Discretizations with Adaptive Intervals
1825
Table 1. Characteristics of the test problems Name Tao Pima-Indians-Diabetes Iris Glass Winsconsin Breast Cancer Biopsies Mammograms Learning
ID. Instances Real attr Discrete attr. Type Classes tao 1888 2 synthetic 2 pim 768 8 real 2 irs 150 4 real 3 gls 214 9 real 6 bre 699 9 real 2 bps 1027 24 real 2 mmg 216 21 real 2 lrn 648 4 2 real 5
– Non-adaptive discrete rules (using the GABIL representation) using the Fayyad & Irani discretization method to determine the intervals used. This test is labeled Fayyad. – The original ADI representation with the old split and merge probability definition and only using uniform discretizations (called ADI1 ). – Rules with real-valued intervals using the XCSR representation [7] (called XCSR). We have decided not to add any other non-uniform discretization to ADI2+ Fayyad because the comparison of these tests with the Fayyad test would not be fair if more non-uniform discretization methods were used. The GA parameters are shown in Tables 2 and 3. The first one shows the general parameters and the second one the domain-specific ones. The reader can appreciate that the sizing of both psplit and pmerge is the same for all the problems except the tao problem. Giving the same value to pmerge and psplit produce solutions with too few rules and intervals, as well as less accurate than the results obtained with the configuration shown in Table 3. The reason of this fact comes from the definition of this synthetic problem with only two attributes and rules with intervals as small as one 48th of an attribute domain in the optimum rule set. The probability of generating these very specific rules is very low because of the hierarchical selection operator used to apply generalization pressure. The operator is still used because it is beneficial for the rest of problems and we can fix this bad interaction by increasing psplit . The characteristics of this problem explained here are also the reason of the bad performance of the test using only the Fayyad & Irani intervals because the discretization generates too few intervals and, as a consequence, loses too much information from the original data.
6
Results
In this section we present the results obtained. The aim of the tests was to compare the methods studied in this paper in three aspects: accuracy and size of the solutions as well as the computational cost. For each method and test
1826
J. Bacardit and J.M. Garrell Table 2. Common parameters of the GA
Parameter
Value General parameters
Crossover probability Selection algorithm Tournament size Population size Probability of mutating an individual Rule Deletion operator Iteration of activation Minimum number of rules before disabling the operator Hierarchical Selection Iteration of activation ADI rule representation Number of intervals of the uniform discretizations XCSR rule representation Maximum radius size in initialization Maximum center offset in mutation Maximum radius offset in mutation
0.6 Tournament 3 300 0.6 30 number of classes + 3 10 4,5,6,7,8,10,15,20,25 0.7 of each attribute domain 0.7 of each attribute domain 0.7 of each attribute domain
Table 3. Problem-specific parameters of the GA Code #iter dcomp psplit orig pmerge orig psplit pmerge
Parameter Number of GA iterations Distance parameter in the “size-based comparison” operator Expected value interval split in ADI1 Expected value of interval merging in ADI1 Probability of splitting an interval in ADI2 Probability of merging an interval in ADI2
Problem tao pim irs gls bre bps mmg lrn
#iter 900 250 275 900 300 275 225 700
dcomp 0.001 0.01 0.01 0.01 0.01 0.01 0.01 0.01
Parameter pmerge orig psplit orig pmerge 1.3 2.6 0.05 0.8 0.8 0.05 0.5 0.5 0.05 1.5 1.5 0.05 3.2 3.2 0.05 1.7 1.7 0.05 1.0 1.0 0.05 1.2 1.2 0.05
psplit 0.25 0.05 0.05 0.05 0.05 0.05 0.05 0.05
problem we show the average and standard deviation values of: (1) the crossvalidation accuracy, (2) the size of the best individual in number of rules and number of intervals and (3) the execution time in seconds. Obviously, the XCSR results lack the intervals per attribute column. The tests were executed in an AMD Athlon 1700+ using Linux operating system and C++ language. Runs
Evolving Multiple Discretizations with Adaptive Intervals
1827
for each method, problem and fold has been repeated 15 times using different random seeds and the results averaged. The full detail of the results is shown in Table 4 and a summary of them taking the form of a methods ranking is in Table 5. The ranking for each problem and method is based on the accuracy. The global rankings are computed averaging the problem rankings. The results were also analyzed using two-sided Student t-tests [20] with a significance level of 1% in order to determine if there were significant outperformances between the methods tested. The results of the t-tests are shown in Table 6. None of the ADI was outperformed. The results were also analyzed using the two-sided t-test [20] with a significance level of 1% in order to determine if there were significant outperformances between the methods tested. The first interesting fact from the results is the comparison between ADI1 and ADI2, that is, the comparison between the split and merge probabilities definitions. ADI2 the method that uses the same probabilities for all the domains is always better (with only one exception) than ADI1, the method that needed domain-specific probabilities. Thus, the gain is double, performance and simplicity of use. The performance of ADI2+Fayyad is also quite good, being the second method in the overall ranking but very close to ADI2, the first one. The objective of increasing the ADI accuracy in the problems where the Fayyad discrete rules alone performed well and, at the same time, maintain the accuracy in the domains where Fayyad presented very poor performance has been achieved. Also, the results of the XCSR representation show that the representations based on a discretization process, if well used, present good performance compared to representations that evolve real-valued intervals. This observation matches the ones reported by Riquelme and Aguilar [11]. The computational cost continues being the main drawback of the representation. The comparison of the ADI representation run time with the Fayyad discrete rules is clearly favorable to Fayyad test. The comparison with XCSR, however, is not so clear. In some domains one representation is faster and in other domains it is the reverse situation.
7
Conclusions and Further Work
This paper focused on a revision and generalization of our previous work on representations for real-valued attributes: the Adaptive Discretization Intervals (ADI) rule representation. This representation evolves rules that can use multiple discretizations, letting the evolution choose the correct discretization for each rule and attribute. Also, the intervals defined in each discretization can split or merge among them through the evolution process, reducing the search space where it is possible and expanding it where it is needed. This revision has consisted of two parts: redefining the split and merge operators probabilities and using non-uniform discretizations in the representation in addition of the existing uniform ones. The redefinition of the operators has simplified enormously
1828
J. Bacardit and J.M. Garrell
Table 4. Mean and deviation of the accuracy, number of rules and intervals per attribute for each method tested. Bold entries show the method with best results for each test problem Problem Configuration Fayyad ADI1 tao ADI2 ADI2+Fayyad XCSR Fayyad ADI1 pim ADI2 ADI2+Fayyad XCSR Fayyad ADI1 irs ADI2 ADI2+Fayyad XCSR Fayyad ADI1 gls ADI2 ADI2+Fayyad XCSR Fayyad ADI1 bre ADI2 ADI2+Fayyad XCSR Fayyad ADI1 bps ADI2 ADI2+Fayyad XCSR Fayyad ADI1 mmg ADI2 ADI2+Fayyad XCSR Fayyad ADI1 lrn ADI2 ADI2+Fayyad XCSR
Accuracy Number of Rules Intervals per attribute 87.8±1.1 3.1±0.3 3.4±0.1 94.3±1.0 19.5±4.9 6.0±0.6 94.7±0.9 17.5±4.4 8.7±0.6 94.0±0.9 15.6±4.4 7.6±1.0 91.1±1.4 12.9±3.1 —— 73.6±3.1 6.6±2.6 2.3±0.2 74.4±3.1 5.8±2.2 1.9±0.4 74.5±3.8 3.7±1.2 2.1±0.3 74.5±3.3 3.6±0.9 2.0±0.3 74.2±3.3 3.4±1.0 —— 94.2±3.0 3.2±0.6 2.8±0.1 96.2±2.2 3.6±0.9 1.3±0.2 96.0±2.6 3.9±0.7 1.5±0.3 96.3±2.4 3.7±0.7 1.5±0.2 95.2±2.3 3.5±0.6 —— 65.7±6.1 8.1±1.4 2.4±0.1 65.2±4.1 6.7±2.0 1.8±0.2 65.6±3.7 7.6±1.5 2.6±0.2 66.2±3.6 7.7±1.6 2.5±0.2 65.2±5.2 8.4±1.2 —— 95.2±1.8 4.1±0.8 3.6±0.1 95.3±2.3 2.6±0.9 1.7±0.2 95.9±2.3 2.4±0.6 1.8±0.3 95.7±2.4 2.4±0.6 1.8±0.3 95.8±2.7 3.2±0.9 —— 80.0±3.1 7.1±3.8 2.4±0.1 80.1±3.3 5.1±2.0 2.0±0.3 80.6±2.9 3.6±1.0 2.1±0.2 80.3±2.9 3.6±1.0 2.0±0.2 79.7±3.1 4.3±1.1 —— 65.3±11.1 2.3±0.5 2.0±0.1 65.0±6.1 4.4±1.9 1.9±0.2 65.6±5.7 4.9±1.0 2.3±0.1 65.2±4.3 4.7±1.1 2.2±0.1 63.1±4.1 5.6±1.6 —— 67.5±5.1 14.3±5.0 4.4±0.1 66.7±4.1 11.6±4.1 3.4±0.2 67.4±4.3 7.2±1.7 3.5±0.2 67.6±4.6 6.9±1.4 3.4±0.2 67.1±4.4 9.1±2.5 ——
Run time 37.3±2.1 145.6±20.9 162.1±20.9 150.5±23.4 110.6±14.5 13.2±1.5 29.9±4.5 30.2±3.2 32.1±3.0 18.7±2.9 4.1±0.1 6.2±0.6 6.0±2.0 6.4±2.1 3.5±0.1 16.8±1.3 46.1±6.0 41.8±3.8 42.7±3.5 37.5±4.9 8.3±0.9 25.0±1.4 19.6±1.4 22.8±1.6 21.4±2.9 31.1±5.0 99.6±17.0 113.8±9.9 119.2±7.7 151.6±13.4 3.8±0.3 12.3±2.5 13.7±1.2 14.4±1.2 14.9±2.4 26.5±3.4 53.9±7.2 46.6±5.1 45.9±5.0 52.6±7.5
the tuning needed to use the representation because we have converted an operator which needed domain-specific tuning into an operator which uses the same probability for all the domains. The addition of non-uniform discretizations to the representation has improved its performance in some domains while main-
Evolving Multiple Discretizations with Adaptive Intervals
1829
Table 5. Performance ranking of the tested methods. Lower number means better ranking Problem Fayyad ADI1 ADI2 ADI2+Fayyad XCSR tao 5 2 1 3 4 pim 5 3 1 1 4 irs 5 2 3 1 4 gls 2 4 3 1 4 bre 5 4 1 3 2 bps 4 3 1 2 5 mmg 2 4 1 3 5 lrn 2 4 3 1 4 Average 3.75 3.25 1.75 1.875 4 Final rank 4 3 1 2 5 Table 6. Summary of the statistical two-sided t-test performed at the 1% significance level. Each cell indicates how many times the method in the row outperforms the method in the column Method Fayyad ADI1 ADI2 ADI2+Fayyad XCSR Total Fayyad 0 0 0 0 0 ADI1 1 0 0 2 3 ADI2 1 0 0 2 3 ADI2+Fayyad 1 0 0 2 3 XCSR 1 0 0 0 1 Total 4 0 0 0 6
taining the robust behavior that presented the original version, according to the statistical t-tests done. The tests done show that the new version is always better than the original one and also better than the other representations included in the comparison when used in a Pittsburgh LCS. However, the computational cost continues being a major drawback, compared to the discrete rules. As a further work, other non-uniform discretization methods beside the Fayyad & Irani one should be tested, to increase the range of problems where this representation performs well. This search for more discretization methods, however, should always have in mind maintaining the robustness behavior that the representation has showed so far. On the other hand, the ADI representation should be compared with other kind of representations dealing with real-valued attributes, for example non-orthogonal ones.
Acknowledgments. The authors acknowledge the support provided under grant numbers 2001FI 00514, TIC2002-04160-C02-02 and TIC 2002-04036-C0503 and 2002SGR 00155. Also, we would like to thank Enginyeria i Arquitectura La Salle for their support to our research group. The Wisconsin breast cancer
1830
J. Bacardit and J.M. Garrell
databases was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg.
References 1. Holland, J.H.: Adaptation in Natural and Artificial Systems. University of Michigan Press (1975) 2. Goldberg, D.E.: Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley Publishing Company, Inc. (1989) 3. DeJong, K.A., Spears, W.M.: Learning concept classification rules using genetic algorithms. Proceedings of the International Joint Conference on Artificial Intelligence (1991) 651–656 4. Wilson, S.W.: Classifier fitness based on accuracy. Evolutionary Computation 3 (1995) 149–175 5. Bacardit, J., Garrell, J.M.: Evolution of multi-adaptive discretization intervals for A rule-based genetic learning system. In: Proceedings of the VIII Iberoamerican Conference on Artificial Intelligence (IBERAMIA’2002), LNAI vol. 2527, Springer (2002) 350–360 6. Fayyad, U.M., Irani, K.B.: Multi-interval discretization of continuous-valued attributes for classification learning. In: IJCAI. (1993) 1022–1029 7. Wilson, S.W.: Get real! XCS with continuous-valued inputs. In Booker, L., Forrest, S., Mitchell, M., Riolo, R.L., eds.: Festschrift in Honor of John H. Holland, Center for the Study of Complex Systems (1999) 111–121 8. Liu, H., Setiono, R.: Chi2: Feature selection and discretization of numeric attributes. In: Proceedings of 7th IEEE International Conference on Tools with Artificial Intelligence, IEEE Computer Society (1995) 388–391 9. Aguilar-Ruiz, J.S., Riquelme, J.C., Valle, C.D.: Improving the evolutionary coding for machine learning tasks. In: Proceedings of the European Conference on Artificial Intelligence, ECAI’02, Lyon, France, IOS Press (2002) pp. 173–177 10. Gir´ aldez, R., Aguilar-Ruiz, J.S., Riquelme, J.C.: Discretizaci´ on supervisada no param´etrica orientada a la obtenci´ on de reglas de decisi´ on. In: Proceedings of the CAEPIA2001. (2001) 53–62 11. Riquelme, J.C., Aguilar, J.S.: Codificaci´ on indexada de atributos continuos para algoritmos evolutivos en aprendizaje supervisado. In: Proceedings of the “Primer Congreso Espa˜ nol de Algoritmos Evolutivos y Bioinspirados (AEB’02)”. (2002) 161–167 12. Llor` a, X., Garrell, J.M.: Knowledge-independent data mining with fine-grained parallel evolutionary algorithms. In: Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2001), Morgan Kaufmann (2001) 461–468 13. Rivest, R.L.: Learning decision lists. Machine Learning 2 (1987) 229–246 14. Soule, T., Foster, J.A.: Effects of code growth and parsimony pressure on populations in genetic programming. Evolutionary Computation 6 (1998) 293–309 15. Bacardit, J., Garrell, J.M.: M´etodos de generalizaci´ on para sistemas clasificadores de Pittsburgh. In: Proceedings of the “Primer Congreso Espa˜ nol de Algoritmos Evolutivos y Bioinspirados (AEB’02)”. (2002) 486–493 16. Blake, C., Keogh, E., Merz, C.: Uci repository of machine learning databases (1998) Blake, C., Keogh, E., & Merz, C.J. (1998). UCI repository of machine learning databases (www.ics.uci.edu/mlearn/MLRepository.html).
Evolving Multiple Discretizations with Adaptive Intervals
1831
17. Mart´ınez Marroqu´ın, E., Vos, C., et al.: Morphological analysis of mammary biopsy images. In: Proceedings of the IEEE International Conference on Image Processing (ICIP’96). (1996) 943–947 18. Mart´ı, J., Cuf´ı, X., Reginc´ os, J., et al.: Shape-based feature selection for microcalcification evaluation. In: Imaging Conference on Image Processing, 3338:1215-1224. (1998) 19. Golobardes, E., Llor` a, X., Garrell, J.M., Vernet, D., Bacardit, J.: Genetic classifier system as a heuristic weighting method for a case-based classifier system. Butllet´ı de l’Associaci´ o Catalana d’Intel.lig`encia Artificial 22 (2000) 132–141 20. Witten, I.H., Frank, E.: Data Mining: practical machine learning tools and techniques with java implementations. Morgan Kaufmann (2000)
Limits in Long Path Learning with XCS Alwyn Barry Department of Computer Science University of Bath Claverton Down Bath, BA2 7AY, UK [email protected]
Abstract. The development of the XCS Learning Classifier System [26] has produced a stable implementation, able to consistently identify the accurate and optimally general population of classifiers mapping a given reward landscape [15,16,29]. XCS is particularly powerful within directreward environments, and notably within problems suitable for commercial application [3]. The application of XCS within delayed reward environments has also shown promise, although early investigations were within enviroments with a comparatively short delay to reward (e.g. [28, 19]). Subsequent systematic investigation [19,20,1,2] have suggested that XCS has difficulty accurately mapping and exploiting even simple environments with moderate reward delays. This paper summarises these results and presents new results that identify some limits and their implications. A modification to the error computation within XCS is introduced that allows the minimum error parameter to be applied relative to the magnitude of the payoff to each classifier. First results demonstrate that this modification enables XCS to successfully map longer delayedreward enviroments.
1
Background
Learning Classifier Systems (‘LCS ’) are a class of machine learning techniques that utilise evolutionary computation to provide the main knowledge induction algorithm. They are characterised by the representation of knowledge in terms of a population of simplified production rules (‘classifiers’) in which the conditions are able to cover one or more inputs. The Michigan LCS [13] maintain a single population of production rules with a Genetic Algorithm (‘GA’) operating within the population . . . each rule maintains its own fitness estimate. LCS are general machine learners, primarily limited by the constraints in the representation adopted for the production rules (see, for example, [29]) and by the complexity of solutions that can be maintained under the action of the GA [10]. LCS have been successfully applied to many application areas – most notably for Data Mining [22,14,3] but also in more complex control problems (e.g. [12,24])1 . Although LCS performance has been competitive with the most effective machine learning techniques, it is notable that many of these cases (and all 1
see [21] for further details of LCS applications
E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 1832–1843, 2003. c Springer-Verlag Berlin Heidelberg 2003
Limits in Long Path Learning with XCS
1833
of those cited earlier in this paragraph) have been with LCS implementations that use direct reward allocation. In contrast, the application of LCS to complex delayed reward tasks has, until recently, been more problematic2 . Recently a number of new LCS implementations appear to have overcome some of the instabilities of LCS when learning in delayed reward situations. ZCS [25] is a simplification of earlier LCS implementations and was designed for academic study, but [4,6] have shown that ZCS can perform optimally with appropriate parameter settings. [16] has argued persuasively that strength-based LCS implementations will inevitably lead to over-generalisation in tasks with biased reward functions. XCS [26] is an accuracy-based LCS derived from ZCS which overcomes many of these limitations, and the Optimality Hypothesis: [15] suggests that XCS will always identify a sub-population of accurate optimally general classifiers that occupy a larger proportion of the population than other classifiers. Bull argues that the fitness sharing mechanism of ZCS acts as a mechanism to prevent overgeneralisation within ZCS, making ZCS a competitive strength-based LCS [5]. The effectiveness of XCS in its application to direct reward environments has been empirically demonstrated by many workers (for example, [26,15,29,3, 11]). Research into the performance of XCS within delayed reward environments has been more limited. [26,27] provided a proof-of-concept demonstration of the operation of XCS within the Woods2 environment. [17] identified that within certain Woods-like environments XCS was unable to identify optimum generalisations. This was attributed to two major factors: an inequality in exploration of all states in the environment allowing over-general classifiers to appear accurate, and an input encoding which meant that certain generalisations were not explored as often as others. [18] sought to apply these lessons to more complex Woods-based environments and discovered that XCS was additionally unable to establish a solution to the long chain Woods-14 problem [9]. This was due in part to the number of possible alternatives to explore in each state that prevented XCS from attributing equal exploration time to later states within the chain. It has been shown that XCS is able to learn this environment using an alternative explore-exploit strategy [6]. Whilst other work has investigated some of the more complex problems that delayed reward environments present, such as perceptual aliasing, there had been little investigation of the comparative performance of traditional LCS implementations and XCS in delayed reward environments beyond the simple Woods environments. Much more importantly, no work had attempted to identify the limits of XCS learning with increasing environment complexity or length when applied to delayed reward environments. [2] presented the results of an investigation of the abililty of XCS to form the population of optimally general classifiers mapping the payoff of environments of incrementally increasing length. It was shown that XCS performance was very good within the GREF-1 environment (100% performance within 1000 explorations – CFS-C [23] achieved 90% performance in 10,000 explorations). It was also shown that XCS can reliably learn the optimal route in a corridor environment (an extension of Fig. 1) of length 2
There are many reasons for this . . . for a review of the issues see [1].
1834
A. Barry
- s1 1- s2 0- s3 1- s4 0- sr s0 0 1 0,1 0 0,1 1 0,1 0 0,1 1 0,1 ? ? ? ? ? s5
s6
s7
s8
s9
Fig. 1. A length 5 corridor Finite State World for delayed reward experiments FSW 15, Coverage Chart
FSW-10, No Action Set Subsumption, Butz XCS 0.3 Steps Error Population 0.25
1000 Optimal Route Incorrect Route
Proportion
Prediction
0.2 100
0.15 0.1 0.05
10
0 0
2
4
6 8 State
10
12
14
1000 2000 3000 4000 Exploitation Episodes
Fig. 2. (a) State × Prediction (log scale) coverage of the populations from 10 runs of Barry’s XCS in a length 15 corridor environment after 15000 iterations; (b) Average of 10 runs of Butz’s XCS in a length 10 corridor environment showing errors in the ‘steps’ plot
40 where generalisation was not used. However, once XCS was required to learn the optimal generalisations XCS was unable to reliably select the optimal route even within the length 10 environment. Learning performance deteriorated sufficiently thereafter to make investigation of performance in corridors longer than 15 steps superfluous (see Fig. 2a – the payoff prediction values of classifiers in the early states are highly confused). Subsequent investigations using an alternative XCS implementation [7] have confirmed these findings (see the perturbation in the ‘steps’ plot in Fig. 2b3 ). Analysis of the population coverage of the environment indicated that XCS was unable to learn appropriate generalisations for early states in the environment. It was hypothesised that this was partially due to the reduction in payoff prediction values in classifiers covering these early states, making the prediction values sufficiently similar that XCS can identify a few over-general classifiers to cover the early states (see Fig. 2a). This conclusion caused some other workers to suggest that the problem might be resolved through the use of Wilson’s modified accuracy measure or the use of an alternative subsumption algorithm (both introduced within [8]). These suggestions, though ill-founded, led to fur3
XCS should rapidly identify the optimum number of steps to the reward, but instead often chooses at least one sub-optimal action in each corridor traversal
Limits in Long Path Learning with XCS
1835
ther investigations into the cause of the over-generalisation within these early states.
2
XCS Learning in Delayed-Reward Environments
The match set (‘[M ]’) is the set of all classifiers whose conditions match the current environmental input. XCS selects one of the actions proposed by [M ] and all classifiers in [M ] which advocate that action become members of the action set (‘[A]’) in the current iteration. When reward R is obtained by XCS in a delayed reward environment the reward is allocated to all classifiers in [A]. Clearly the reward could only have been reached as a result of earlier actions. So that the optimal path to the reward can be established XCS allocates payoff to classifiers within earlier action sets. In each exploration iteration the maximum action set prediction of [M] discounted by a discount factor γ is used as the payoff P in the update of the predictions p of the classifiers in the previous action set ([A]t−1 ) using the following update scheme (0.0 < β ≤ 1.0): p = p + β (P − p) .
(1)
Over time all action sets will converge upon an estimate of the discounted payoff that will be received if the action was to be taken. The discount allows the estimate to take account of distance to reward as well as the magnitude of the reward so that a trade-off of effort to reward is inherent in the action selection. Unfortunately the discount also means that the payoff prediction becomes much smaller with distance from the reward. With the ‘standard’ XCS discount (γ = 0.71) the payoff reduces from the reward of 1000 to less than 10 within 14 steps . . . the payoff in state n of an N state path to reward R is: pn = Rγ N −n .
(2)
XCS is an accuracy-based LCS. Accuracy is a steep logarithmic function4 of the error in the classifier’s prediction of payoff: 0 ln (α) ε−ε ε0 m (ε > ε0 ) (3) κ= 1.0 (otherwise) . (α and m are constants), where the error ε is calculated as: ε = ε + β (εabs − ε) |P − p| where εabs = . Rmax − Rmin
(4) (5)
(Rmax − Rmin is the reward range). Fitness is based on the relative accuracy of the classifiers appearing within each [A]. 4
[8] introduce an alternative accuracy function which is calculated using a power function.
1836
A. Barry
A cut-off parameter in the accuracy function, ε0 (typically 0.01), is used to identify when the accuracy (κ) of a classifier is sufficiently close enough to 1.0 to be considered fully accurate (see Eq. 3). The conclusions in [2] failed to take into account the effect of this cut-off calculation of accuracy on the generalisation behaviour of XCS. Once the error in payoff prediction falls below ε0 R all predictions will be considered accurate. This is normally useful, filtering out noise in the payoff algorithm. However, in the early states the payoff prediction is sufficently small that ε0 is a large proportion of the prediction. This will allow classifiers which advocate sub-optimal actions with a payoff prediction that is variant from the payoff by less than ε0 R to be considered accurate. When the difference in stable payoff values for the same action in two neighbouring states falls below this threshold it is possible to identify a single classifier to cover that action in both states. This classifier will be considered accurate even though there is a fluctuation in the payoff to the classifier. XCS will use this false accuracy to proliferate classifiers that generalise over successive states, so producing accurate over-general classifiers. That this is the case can be seen within Fig. 2a. In this environment the optimal route alternates between action 0 and 1. A single classifier is covering action 0 and another single classifier covers action 1 in states s0 to s3 so that the optimal action is not selected in states s1 and s3 . Although this might suggest that a solution to the problem would be to reduce ε0 , this is problematic because the noise created whilst seeking to identify appropriate generalisations will also prevent the identification of accurate classifiers. An alternative is to change the value of γ to reduce the amount of discount and keep the difference in neighbouring payoffs above ε0 R. This may increase the length of paths that can be learnt, but as the level of discount is reduced the difference between neighbouring prediction values will decrease. It has been noted that where prediction values are close and there is an area of the environment where over-generals are encouraged, it is possible for large numerosity over-general classifiers to develop [2]. These classifiers use their numerosity to dominate the action-sets, reducing the payoff of the action-sets to a value similar to their prediction, thereby making themselves more accurate and giving themselves more breeding opportunities. It is therefore hypothesised that: Hypothesis Reducing the discount level will increase the number of steps over which XCS can learn the optimal state × action × payoff mapping, but there will be a point beyond which further reduction will cause the mapping to be disrupted by over–general classifiers.
3
Experimental Approach
[1] introduced a test environment designed for testing the ability to learn an optimal policy as the distance to reward increases, whilst controlling other potentially confounding variables. This environment, depicted in Fig. 1 has many useful properties for these experiments – see [1]. In the experiments that follow the environment used will be labelled ‘FSW-N ’, where N is the length of the optimal path from the start to the reward state.
Limits in Long Path Learning with XCS
1837
Unless stated otherwise, each experiment uses the following XCS parameters: β = 0.2, θ = 25, ε = 0.01, α = 0.1, χ = 0.8, µ = 0.04, pr = 0.5, P# = 0.33, pi = 10.0, εi = 0.0, fi = 0.01, m = 0.1, s = 20 (see [26] for a parameter glossary). Each experiment was repeated 30 times and the results presented are the average of 30 runs unless otherwise stated.
4
Investigating Length Limits
It was argued in §2 that the reason for the failure to learn the optimal solution in FSW-10 and FSW-15 was due to the small difference in action-set payoff as a result of the high discount value (γ). To demonstrate that this is the case XCS was run within the FSW-15 environment, changing γ to 0.75, 0.80, 0.85, 0.90, 0.95 in each successive batch of 30 experiments. For these experiments the maximum population size (N ) was 1200 and the message size was 7 bits (for comparability with [2]). For each exploitation iteration the System Relative Error [1] was calculated, in addition to the number of steps taken in the environment and the population size. The results of each batch of 30 runs were averaged and the averages compared. Figures 3 and 4 show the results and coverage chart at discount values 0.80 and 0.95. From 0.75 through to 0.90 the maximum System Relative Error reduces and the number of steps taken to achieve the reward moves towards the optimal route, with γ = 0.95 allowing the optimal route to be reliably selected. The identification that XCS can learn the optimal route in the FSW-15 environment given an appropriate discount factor provides an initial verification of the hypothesis. However, it is useful to question why a discount of 0.8 or 0.85 was not effective. The answer is partially revealed by an examination of Table 1. The difference between the predictions in s0 and s1 for discount 0.8 is 11 and the difference in payoff between the optimal and sub-optimal route in s0 is 8.8 – below the ε0 error boundary and therefore a candidate for generalisation. This is reflected in the results for γ = 0.8 – the ‘iterations’ plot shows one incorrect decision is taken in each episode (see Fig. 3). At γ = 0.85 the difference between
FSW-15, Discount=0.80, 30 Runs
Length 15, Discount 0.80, Coverage Chart
0.9
1000 Max Rel Err Min Rel Err Relative Error Population Iterations
0.8 0.7
Optimal Route Incorrect Route
Prediction
Proportion
0.6 0.5 0.4
100
0.3 0.2 0.1 0
10 0
2000 4000 6000 8000 10000 12000 Exploitation Episodes
0
2
4
6
8
10
12
14
State
Fig. 3. Averaged results from 30 runs of XCS in FSW-15 at γ = 0.80
1838
A. Barry FSW-15, Discount=0.95, 30 Runs
Length 15, Discount 0.90, Coverage Chart
0.8
1000 Max Rel Err Min Rel Err Relative Error Population Iterations
0.7
0.5
Prediction
Proportion
0.6
0.4 0.3 0.2 0.1
Optimal Route Incorrect Route
0
100 0
2000 4000 6000 8000 10000 12000 Exploitation Episodes
0
2
4
6 8 State
10
12
14
Fig. 4. Averaged results from 30 runs of XCS in FSW-15 at γ = 0.95 Table 1. Payout predictions in states within FSW-15 with different values of γ γ 0.71 0.75 0.8 0.85 0.9 0.95
s0err s0 s1 s2 s3 s4 s5 5.87 8.27 11.65 16.41 23.11 32.55 45.85 13.36 17.82 23.76 31.68 42.24 56.31 75.08 35.18 43.98 54.98 68.72 85.90 107.37 134.21 87.35 102.77 120.91 142.24 167.34 196.87 231.62 205.89 228.77 254.19 282.43 313.81 348.68 387.42 463.29 487.68 513.34 540.36 568.80 598.74 630.25
the predicted payoff in s0 and s1 is increased. Unfortunately the coverage graph (not shown) indicates that difference of 15 between the optimal and non-optimal routes in s0 is still sufficiently small to adversely influence the coverage of the sub-optimal route in s0 . Now that it is clear that γ = 0.95 allows XCS to learn the optimal coverage of FSW-15, the second part of the hypothesis must be tested. This suggests that as the discount becomes small over-general classifiers will develop. It is worth noting that the proximity of the payoff values produced by γ = 0.95 were assumed by the author prior to the investigation to be sufficient to start to produce this generalisation. To investigate further, γ was systematically reduced to 0.99 in steps of 0.01, and then from 0.99 to 0.999 in steps of 0.001. When the results were analysed, a clear pattern was evident. As γ was increased towards 0.99 the System Relative Error increased and more errors were evident in the number of steps taken to sr . However, as it drew near to 0.99 and thereafter, the System Relative Error reduced even though the number of steps taken gradually moved towards that expected for a random selection of action in each state. An examination of the populations revealed that at 0.995 fully general classifiers dominated 7 of the 30 populations (see Fig. 5a) and at 0.999 all populations maintained high numerosity fully general classifiers. This is no surprise – the proximity of the predictions at 0.999 causes the difference in prediction to be well below ε0 and the range of predictions is sufficiently small to allow a fully general classifier to easily gain sufficient numerosity to suppress the
Limits in Long Path Learning with XCS Length 15, Discount 0.99, Coverage Chart
Length 15, Discount 0.995, Coverage Chart
960
980
940
970 960 Prediction
920 Prediction
1839
900 880 860
950 940 930 920
840
910
Optimal Route Incorrect Route
820
Optimal Route Incorrect Route
900 0
2
4
6 8 State
10
12
14
0
2
4
6 8 State
10
12
14
Fig. 5. Coverage charts for FSW-15 at γ = 0.99 and γ = 0.995
predictions and always be classed as accurate. What was surprising was the revelation that XCS was able to maintain such a reasonable coverage at a discount level of γ = 0.99. Having demonstrated that XCS can perform optimally within FSW-15, it was appropriate to investigate how much the environment could be extended before XCS would once more perform sub-optimally. XCS was run with γ = 0.95 in FSW of length 20, 25, 30 and 40. If errors in coverage are guaranteed when the difference between payoff in sn and sn+1 is 10.0 then a simple mathematical exercise would suggest that the practical limit is an environment of length 30. However it is now known that a difference of 15 was sufficient to prevent appropriate early state coverage, and this magnitude difference is seen between optimal and sub-optimal routes only 22 states from sr . When the experiments were completed and the results analysed, is was found that XCS was able to find an optimal solution to FSW-20 at γ = 0.95. In FSW-25 the coverage chart remained almost optimal, and a careful analysis of each run showed that a single erroneous step was taken in 2.3% of the episodes after episode 3000. Within FSW-30 up to three additional steps were taken in each episode after episode 3000, and the coverage chart indicated generalisation over states up to s6 (23 states from sr ). These limitations present some difficulties. XCS has shown itself to be powerful in application to direct-reward problems, and yet apparently fundamentally limited in relatively small delayed-reward environments. However, it is important to understand that the problems arise not because of the basic mechanisms of XCS but due to the use of an absolute measure of error. As equation 4 indicates, error is computed relative to the reward range without taking account of the discount mechanism. A way to tackle this problem may be to identify an error measure that is independent of the position of the classifier in the action chain to the reward. A first attempt at such a solution was devised. This involved a simple modification to XCS to retain an estimate of the distance d from the reward within each classifier. This was used to calculate an action set estimate of distance to the nearest reward and to calculate the error as a proportion of
1840
A. Barry FSW 15, Coverage Chart
Length 15, Discount 0.71, Rel Error, Coverage Chart 1000 Optimal Route Incorrect Route
1000 Optimal Route Incorrect Route
Prediction
Prediction
100 100
10
10
1 0
2
4
6 8 State
10
12
14
0
2
4
6 8 State
10
12
14
Fig. 6. Comparison of coverage charts for absolute error calculation and relative error calculation in FSW-15 at γ = 0.71
γ N −d R. Although this technique resulted in a reduction in the System Relative Error, considerable error remained within the calculations, due to the magnification of errors in the estimate of d by the calculation. It is possible that reducing d to an integer value within this expression may mask these errors. An alternative approach replaced the absolute calculation of error with: |P −p| (max(P, p) > 0) (6) ε = max(P,p) 0.0 (otherwise) . As the prediction becomes more accurate, the relative difference between P and p will reduce and so the error will become small independently of the magnitude of the payoff P . This, and an alternative error calculation: ε = |P −p| (capped p to 1.0 if ε > 1.0), have been the subject of recent investigations. Figure 6 shows that the use of the relative error update method reduces the over-generalisation in states above the state s0 . In FSW-15 at γ = 0.71 the two relative error expressions appear to produce similar results, although the second should provide less variance in the initial updates. The results of applying a relative error calculation are encouraging, though further investigations are now required to identify any penalties that may be present in the formulation of the optimal sub-population of accurate classifiers. It would appear that the use of relative error makes the identification of accurate classifiers more problematic leading to greater divergence in the population, although this does not appear to dramatically affect the time taken for XCS to identify the optimal route. The use of a more focused GA selection technique, such as Tournament Selection, may resolve this problem.
5
Discussion
The limitations on the length of delayed reward environments are highly constraining. Delayed reward environments are commonly of much greater size
Limits in Long Path Learning with XCS
1841
within the laboratory, let alone within ‘real-world’ applications. In defence of XCS it should be noted that the corridor environment used within these tests is highly artificial. Richer environments may provide more distinct payoff gradients which will enable XCS to work over a larger range. It should also be noted that changes in parameters alongside modification to γ may affect the performance. For example, in experiments conducted alongside this investigation it was found that using the within-niche mutation scheme of [8] produced a much weaker performance because it encouraged an early decision on the most accurate generalisation and so aided the formation of over-general classifiers. Mutation schemes that encourage more diversity will act as a pressure against generalisation, and so enable XCS to map the environment using more specific classifiers. This is hardly a desirable solution, however. The results are interesting in the light of [16]. This identifies that all nontrivial delayed reward environments are environments which encourage the development of over-general classifiers. Whilst XCS, as an accuracy-based LCS, has some protection against over-general classifiers, it is clear that the formation of over-generals will be encouraged as soon as there is a failure of the accuracy function to distinguish between payoff boundaries. The lack of provision for discount within the error calculation leads to an inequity of accuracy computation that encourages such a failure. Whilst the results presented on the use of relative error are promising, more work is required in order to identify the dynamics of the calculated error in various parts of the action chain, and to identify the effect of the measure on the ability of XCS to satisfy the Optimality Hypothesis. The use of a relative measure of error should allow the length of paths that can be optimally mapped to be extended, but new limits must be established. It is recognised that any discounted payoff scheme will cause hard limits in the length of path learning. Therefore work towards autonomous problem-space subdivision, the autonomous identification of sub-goals and their use in hierarchical planning remain important research objectives. Many other Reinforcement Learning methods face related problems in long path learning, and lessons can be drawn from these areas. However, the limits this paper has sought to address are those generated by the requirement to identify the minimum set of input generalisations that produce an accurate condition × action × payoff prediction mapping. The combination of accuracy-based learning and the requirement for the production of an optimally compact and accurate mapping is unique to XCS.
References 1. Alwyn Barry. XCS Performance and Population Structure within Multiple-Step Environments. PhD thesis, Queens University Belfast, 2000. 2. Alwyn M. Barry. The stability of long action chains in xcs. Journal of Soft Computing, 6(3–4):183–199, 2002.
1842
A. Barry
3. Ester Bernad´ o, Xavier Llor` a, and Josep M. Garrell. XCS and GALE: a Comparative Study of Two Learning Classifier Systems with Six Other Learning Algorithms on Classification Tasks. In Proceedings of the 4th International Workshop on Learning Classifier Systems (IWLCS-2001), pages 337–341, 2001. Short version publishe in Genetic and Evolutionary Compution Conference (GECCO2001). 4. Larry Bull. On using ZCS in a Simulated Continuous Double-Auction Market. In Wolfgang Banzhaf, Jason Daida, Agoston E. Eiben, Max H. Garzon, Vasant Honavar, Mark Jakiela, and Robert E. Smith, editors, Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-99), pages 83–90. Morgan Kaufmann, 1999. 5. Larry Bull. On accuracy-based fitness. Journal of Soft Computing, 6(3–4):154–161, 2002. 6. Larry Bull, Jacob Hurst, and Andy Tomlinson. Mutation in Classifier System Controllers. In J.A. Meyer et al., editor, From Animals to Animats 6: Proceedings of the Sixth International Conference on Simulation of Adaptive Behavior, pages 460–467, 2000. 7. Martin V. Butz. An Implementation of the XCS classifier system in C. Technical Report 99021, The Illinois Genetic Algorithms Laboratory, 1999. 8. Martin V. Butz and Stewart W. Wilson. An Algorithmic Description of XCS. Technical Report 2000017, Illinois Genetic Algorithms Laboratory, 2000. 9. Dave Cliff and Susi Ross. Adding Temporary Memory to ZCS. Adaptive Behavior, 3(2):101–150, 1994. Also technical report: ftp://ftp.cogs.susx.ac.uk/pub/reports/csrp/csrp347.ps.Z. 10. M. Compiani, D. Montanari, R. Serra, and P. Simonini. Learning and Bucket Brigade Dynamics in Classifier Systems. Special issue of Physica D (Vol. 42), 42:202–212, 1990. 11. Lawrence Davis, Chunsheng Fu, and Stewart W. Wilson. An incremental multiplexer problem and its uses in classifier system research. In Pier Luca Lanzi, Wolfgang Stolzmann, and Stewart W. Wilson, editors, Advances in Learning Classifier Systems, volume 2321 of LNAI, pages 23–31. Springer-Verlag, Berlin, 2002. 12. Marco Dorigo and Marco Colombetti. Robot Shaping: An Experiment in Behavior Engineering. MIT Press/Bradford Books, 1998. 13. John H. Holland, Keith J. Holyoak, Richard E. Nisbett, and P.R. Thagard. Induction: Processes of Inference, Learning, and Discovery. MIT Press, Cambridge, 1986. 14. John H. Holmes. A genetics-based machine learning approach to knowledge discovery in clinical data. Journal of the American Medical Informatics Association Supplement, 1996. 15. Tim Kovacs. Evolving Optimal Populations with XCS Classifier Systems. Master’s thesis, School of Computer Science, University of Birmingham, Birmingham, U.K., 1996. 16. Tim Kovacs. A Comparison and Strength and Accuracy-based Fitness in Learning Classifier Systems. PhD thesis, University of Birmingham, 2002. 17. Pier Luca Lanzi. A Study of the Generalization Capabilities of XCS. In Thomas B¨ ack, editor, Proceedings of the 7th International Conference on Genetic Algorithms (ICGA97), pages 418–425. Morgan Kaufmann, 1997. 18. Pier Luca Lanzi. Solving Problems in Partially Observable Environments with Classifier Systems (Experiments on Adding Memory to XCS). Technical Report 97.45, Politecnico di Milano. Department of Electronic Engineering and Information Sciences, 1997.
Limits in Long Path Learning with XCS
1843
19. Pier Luca Lanzi. Generalization in Wilson’s XCS. In A.E. Eiben, T.B¨ ack, M. Shoenauer, and H.-P. Schwefel, editors, Proceedings of the Fifth International Conference on Parallel Problem Solving From Nature – PPSN V, number 1498 in LNCS. Springer Verlag, 1998. 20. Pier Luca Lanzi. An Analysis of Generalization in the XCS Classifier System. Evolutionary Computation, 7(2):125–149, 1999. 21. Pier Luca Lanzi and Rick L. Riolo. A Roadmap to the Last Decade of Learning Classifier System Research (from 1989 to 1999). In Pier Luca Lanzi, Wolfgang Stolzmann, and Stewart W. Wilson, editors, Learning Classifier Systems. From Foundations to Applications, volume 1813 of LNAI, pages 33–62, Berlin, 2000. Springer-Verlag. 22. Alexandre Parodi and P. Bonelli. The Animat and the Physician. In J.A. Meyer and S.W. Wilson, editors, From Animals to Animats 1. Proceedings of the First International Conference on Simulation of Adaptive Behavior (SAB90), pages 50– 57. A Bradford Book. MIT Press, 1990. 23. Rick L. Riolo. The Emergence of Coupled Sequences of Classifiers. In J. David Schaffer, editor, Proceedings of the 3rd International Conference on Genetic Algorithms (ICGA89), pages 256–264, George Mason University, June 1989. Morgan Kaufmann. 24. Robert E. Smith, B.A. Dike, R.K. Mehra, B. Ravichandran, and A. El-Fallah. Classifier Systems in Combat: Two-sided Learning of Maneuvers for Advanced Fighter Aircraft. In Computer Methods in Applied Mechanics and Engineering. Elsevier, 1999. 25. Stewart W. Wilson. ZCS: A zeroth level classifier system. Evolutionary Computation, 2(1):1–18, 1994. 26. Stewart W. Wilson. Classifier Fitness Based on Accuracy. Evolutionary Computation, 3(2):149–175, 1995. 27. Stewart W. Wilson. Generalization in evolutionary learning. Presented at the Fourth European Conference on Artificial Life (ECAL97), Brighton, UK, July 2731., 1997. 28. Stewart W. Wilson. Generalization in the XCS classifier system. In John R. Koza, Wolfgang Banzhaf, Kumar Chellapilla, Kalyanmoy Deb, Marco Dorigo, David B. Fogel, Max H. Garzon, David E. Goldberg, Hitoshi Iba, and Rick Riolo, editors, Genetic Programming 1998: Proceedings of the Third Annual Conference, pages 665–674. Morgan Kaufmann, 1998. 29. Stewart W. Wilson. Mining Oblique Data with XCS. Technical Report 2000028, University of Illinois at Urbana-Champaign, 2000.
Bounding the Population Size in XCS to Ensure Reproductive Opportunities Martin V. Butz and David E. Goldberg Illinois Genetic Algorithms Laboratory (IlliGAL) University of Illinois at Urbana-Champaign 104 S. Mathews, 61801 Urbana, IL, USA {butz,deg}@illigal.ge.uiuc.edu
Abstract. Despite several recent successful comparisons and applications of the accuracy-based learning classifier system XCS, it is hardly understood how crucial parameters should be set in XCS nor how XCS can be expect to scale up in larger problems. Previous research identified a covering challenge in XCS that needs to be obeyed to ensure that the genetic learning process takes place. Furthermore, a schema challenge was identified that, once obeyed, ensures the existence of accurate classifiers. This paper departs from these challenges deriving a reproductive opportunity bound. The bound assures that more accurate classifiers get a chance for reproduction. The relation to the previous bounds as well as to the specificity pressure in XCS are discussed as well. The derived bound shows that XCS scales in a machine learning competitive way.
1
Introduction
The XCS classifier system has recently gained increasing attention. Especially in the realm of classification problems, XCS has been shown to solve many typical machine learning problems with a performance comparable to other traditional machine learning algorithms [15,1,7] On the theory side, however, XCS is still rather poorly understood. There are few hints for parameter settings and scale-up analyses do not exist. Wilson [16,17] suggested that XCS scales polynomially in the number of concepts that need to be distinguished and in the problem length. However, this hypothesis has not been investigated to date. This paper derives an important population size bound that needs to be obeyed to ensure learning. The derivation departs from the previously identified covering challenge, which requires that the genetic learning mechanism takes place, and the schema challenge, which requires that important (sufficiently specialized) classifiers are present in the initial population [3]. The two challenges were shown to result in specificity-dependent population size bounds. In addition to these requirements, the population must be large enough to ensure that more accurate classifiers have reproductive opportunities. Following this intuition, we derive a reproductive opportunity bound that bounds the population size of XCS in O(lkd ) where l denotes the problem length and kd denotes E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 1844–1856, 2003. c Springer-Verlag Berlin Heidelberg 2003
Bounding the Population Size in XCS to Ensure Reproductive Opportunities
1845
the minimal schema order in the problem. Since kd is usually small, Wilson’s hypothesis holds that XCS scales polynomially in problem length l. In general, our analysis provides further insights on scale-up behavior, necessary parameter settings, and basic XCS functioning. After a short overview of the XCS system, we review the previous identified problem bounds, identify a time dimension in the schema challenge, and then analyze the additional reproductive opportunity bound. Empirical confirmation of the derived bound is provided. Summary and conclusions complete the paper.
2
XCS in Brief
The XCS classifier system investigated herein is based on the work in [16,17, 10] and derived from the algorithmic description in [6]. This XCS overview provides only the details necessary for the rest of the paper. The interested reader is referred to the cited literature. For simplicity, we introduce XCS as a pure classifier. The results, however, should readily carry over to multi-step problems in which reward needs to be propagated. We define a classification problem as a binary problem that provides problem instances σ ∈ {0, 1}l (denoting problem length by l). After an instance is classified, the problem provides feedback in terms of scalar payoff ρ ∈ reflecting the quality of the classification. The task is to maximize this feedback effectively always choosing the correct classification. XCS is basically a rule learning system. A population of rules, or classifiers, is evolved and adapted. Each rule consists of a condition part and a classification part. The condition part C specifies when the rule is active, or matches. It is coded by the ternary alphabet {0, 1, #} (i.e. C ∈ {0, 1, #}l ) where a #-symbol matches both zero and one. The proportion of non-don’t care symbols in the condition part determines the specificity of a classifier. In addition to condition and classification parts, classifiers specify a reward prediction p that estimates resulting payoff, prediction error that estimates the mean absolute deviation of p, and fitness F that measures the average relative scaled accuracy of p with respect to all competing classifiers. Further classifier parameters are experience exp, time stamp ts, action set size estimate as, and numerosity num. XCS continuously evaluates and evolves its population. Rule parameters are updated by the Widrow-Hoff rule [14] and reinforcement learning techniques [13]. Effectively, the parameters approximate the average of all possible cases. The population is evolved by the means of a covering mechanism and a genetic algorithm (GA). Initially, the population of XCS is empty. Given a problem instance, if there is no classifier that matches the instance for a particular classification, a covering classifier is generated that matches in that instance and specifies the missing classification. At each position of the condition, a #-symbol is induced with a probability of P# . A GA is applied if the average time in the action set [A] since the last GA application, recorded by the time stamp ts, is greater than the threshold θGA . If a GA is applied, two classifiers are selected in [A] for reproduction using fitness proportionate selection with respect to the
1846
M.V. Butz and D.E. Goldberg
fitness of the classifiers in [A]. The classifiers are reproduced and the children undergo mutation and crossover. In mutation, each attribute in C of each classifier is changed with a probability µ to either other value of the ternary alphabet (niche mutation is not considered herein). The action is mutated to any other possible action with a probability µ. For crossover, two-point crossover is applied with a probability χ. The parents stay in the population and compete with their offspring. The classifiers are inserted applying subsumption deletion in which classifiers are absorbed by experienced, more general, accurate classifiers. If the number of (micro-)classifiers in the population exceeds the maximal population size N , excess classifiers are deleted. A classifier is chosen for deletion with roulette wheel selection proportional to its action set size estimate as. Further, if a classifier is sufficiently experienced (exp > θdel ) and significantly less accurate than the average fitness in the population (f < δ ∗ cl∈[P ] fcl / cl∈[P ] numcl ), the probability of being selected for deletion is further increased. Note that the GA is consequently divided into a reproduction process that takes place inside particular action sets and a deletion process that takes place in the whole population.
3
Specificity Change and Previous Problem Bounds
Previous investigations of XCS have proven the existence of intrinsic generalization in the system. Also, a covering challenge was identified that requires that the GA takes place as well as a schema challenge that requires that important classifiers are present in the initial population. This section gives a short overview of these bounds. Moreover, assuming that none of the important classifiers are present in the initial population, a time estimate for the generation of the necessary classifiers is derived. 3.1
Specificity Pressure
Since XCS reproduces classifiers in the action set but deletes classifiers from the whole population, there is an implicit generalization pressure inherent in XCS’s evolutionary mechanism. The change in specificity due to reproduction can be derived from the expected specificity in an action set s[A] given current specificity in the population s[P ] which is given by [2]: s[A] =
s[P ] 2 − s[P ]
(1)
Additionally, mutation pushes towards an equal number of symbols in a classifier so that the expected specificity of the offspring so can be derived from parental specificity sp [4]: ∆m = so − sp = 0.5µ(2 − 3sp )
(2)
Bounding the Population Size in XCS to Ensure Reproductive Opportunities
1847
Taken together, the evolving specificity in the whole population can be asserted considering the reproduction of two offspring in a reproductive event and the impact on the whole population [4] (ignoring any possible fitness influence): s[P (t+1)] = fga
2(s[A] + ∆m − s[P (t)] ) N
(3)
Parameter fga denotes the frequency the GA is applied on average which depends on the GA threshold θga but also on the current specificity s[P ] as well as the distribution of the encountered problem instances. While the specificity is initially determined by the don’t care probability P# , the specificity changes over time as formulated by Equation 3. Thus, parameter P# can influence XCS’s behavior only early in a problem run. From equation 3 we can derive the steady state value of the specificity s[P ] as long as no fitness influence is present. Note that the speed of convergence is dependent on fga and N , the converged value is independent of these values. s[A] + ∆m − s[P ] = 0 s[P ] s[P ] µ 2−3 = s[P ] + 2 − s[P ] 2 2 − s[P ]
(4)
Solving for s[P ] : s[P ] 2 − (2.5µ + 1)s[P ] + 2µ = 0 1 + 2.5µ − 6.25µ2 − 3µ + 1 s[P ] = 2
(5)
This derivation enables us to determine the converged specificity of a population given a mutation probability µ. Table 1 shows the resulting specificities in theory and empirically determined on a random Boolean function (a Boolean function that returns randomly either zero or one and thus either 1000 or zero reward). The empirical results are slightly higher due to higher noise in more specific classifiers, the biased accuracy function, the fitness biased deletion method, and the parameter initialization method [4]. Although mutation’s implicit specialization pressure (as long as s[P ] < 2/3) can be used to control the specificity of the population, too high mutation most certainly disrupts useful problem substructures. Thus, in the future other ways might be worth exploring to control specificity in [P ]. Table 1. Converged Specificities in Theory and in Empirical Results µ 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 s[P ] , theory 0.040 0.078 0.116 0.153 0.188 0.223 0.256 0.288 0.318 0.347 s[P ] , empirical 0.053 0.100 0.160 0.240 0.287 0.310 0.329 0.350 0.367 0.394
1848
3.2
M.V. Butz and D.E. Goldberg
Covering and Schema Bound
Before specificity change and evolutionary pressure in general apply, though, it needs to be assured that the GA takes place in the first place. This is addressed by the covering challenge [3]. To meet the challenge, the probability of covering a given random input needs to be sufficiently high (assuming uniformly distributed #-symbols): P (cover) = 1 −
1−
2 − s[P ] 2
l N >0
(6)
Once the covering bound is obeyed, the GA takes place and specificity changes according to Equation 3. To evolve accurate classifiers, however, more accurate classifiers need to be present in a population. This leads to the schema challenge which determines the probability of the existence of a schema representative in a population with given specificity [3]. Order of Problem Difficulty. To understand the schema challenge, we use the traditional schema notation [9,8] of GAs to define a schema representative. Hereby, a schema can be characterized by its order k, which denotes the number of specified attributes, and its defining length δ, which is the distance between the outermost specified attributes. For example, schema **11***0*000* has order six and defining length nine whereas schema **1*0*0****** has order three and defining length four. As proposed in [3], a classifier is said to be a schema representative if its condition part has at least all k attributes of the corresponding order k schema specified and its action corresponds to the schema corresponding action. For example, a classifier with condition C=##1#0#0##000# would be a representative of the second schema but not of the first one. Consequently, the specificity of a representative is at least equal to the order k of the schema it represents. If the specificity of a representative cl exactly equals k, then it is said to be maximally general with respect to the schema. A problem is now said to be of order of problem difficulty kd if the smallest order schema that supplies information gain is of order kd . Thus, if a problem is of order of problem difficulty kd any schema of order less than kd will have the same (low) accuracy and essentially the same accuracy as the completely general schema (order k = 0). Similarly, only classifiers that have at least kd attributes specified can have higher accuracy, higher fitness, and can thus provide fitness guidance. Thus, a problem of order of difficulty kd can only be solved by XCS if representatives of the order kd schema exist (or are generated at random) in the population. Given a particular problem, any schema can be assigned a probability of being correct. It has been shown in [3] that the more consistently a classifier is correct or incorrect, the more accurate the classifier will be. Thus, a problem provides fitness guidance from the over-general side, if classifiers that specify parts of the relevant attributes are more consistently correct/incorrect.
Bounding the Population Size in XCS to Ensure Reproductive Opportunities
1849
An extreme case of this is the parity problem in which all k relevant bits need to be specified to reach any fitness guidance. The specification of any subset of those k bits will result in a 50/50 chance for correct classification and consequently a base accuracy which is equal to the completely general classifier. Thus, in the parity problem there is no fitness guidance before reaching full accuracy (effectively going from zero to completely accuracy in one step) so that the order of problem difficulty in the parity problem is equal to the size of the parity. Note that the order of problem difficulty measure also affects many other common machine learning techniques. For example, the inductive decision tree learner C4.5 [12] totally relies on the notion of information gain. If a problem has an order of difficulty kd > 1, C4.5 would basically decide at random on which attribute to expand first. Consequently, C4.5 would generate an inappropriately large decision tree. Schema Representative. With this notion we can formulate the probability that there exists an order k schema representative in a population that has specificity s[P ] [3]:
1 s[P ] k P (representative in [P ]) = 1 − 1 − n 2
N >0
(7)
where n denotes the number of possible outcome classes. Similar to the probability of a representative in a population, the probability that a particular classifier in the population is a representative as well as the expected number of representatives in the population can be assessed: 1 (s[P ] )k n 1 E(representative) = N (s[P ] )k n P (representative) =
(8)
Requiring that at least one representative exists in a particular population with specificity s[P ] we derive the following representative bound on the population size: E(representative) > 1 n N> k s[P ]
(9)
While this bound is rather easy to obey given a reasonable high specificity, the more interesting result is the constraint on the specificity given an actual population size N : s[P ] > (
n 1/k ) N
(10)
This shows that with increasing population size, the required specificity decreases.
1850
3.3
M.V. Butz and D.E. Goldberg
Extension in Time
As mentioned above, initially the population has a specificity of s[P ] = 1 − P# . From then on, as long as the population is not over-specific and the GA takes place, the specificity changes according to Equation 3. For example, say we set P# = 1.0 so that the initial specificity will be zero, the specificity will gradually increase towards the equilibrium specificity. The time dimension is determined from the time it takes that a representative is created due to random mutations. Given a current specificity s[P ] , we can determine the probability that a representative of a schema with order k is generated in a GA invocation: P (generation of representative) = (1 − µ)s[P ] k · µ(1−s[P ] )k
(11)
Out of this probability, we can determine the expected number of steps until at least one classifier may have the desired attributes specified. Since this is a geometric distribution, E(t(generation of representative)) = 1/P (generation of representative) = s[P ] k µ µ−k 1−µ
(12)
To derive a lower bound on the mutation probability, we can assume a current specificity of zero. The expected number of steps until the generation of a representative then equals to µ−k . Thus, given we start with a completely general population, the expected time until the generation of a first representative is less than µ−k (since s[P ] increases over time). Requiring that the expected time until a classifier is generated is smaller than some threshold Θ, we can generate a lower bound on the mutation µ: µ−k < Θ 1
µ > Θ −k
(13)
This representative bound actually relates to the existence of a representative bound determined above in equation 10. Setting Θ equal to N/n we get the same bound (s can be approximated by a · µ where a ≈ 2, see Table 1). Assuming that no representative exists in the population, the order of difficulty of a problem kd influences speed of learning to take place due to possible delayed supply of representatives. In the extreme case of the parity function, the bound essentially bounds learning speed since a representative is automatically fully accurate.
4
Reproductive Opportunity Bound
Given the presence of representatives and GA application, another crucial requirement remains to be satisfied: The representative needs to get the chance to
Bounding the Population Size in XCS to Ensure Reproductive Opportunities
1851
reproduce. Since GA selection for reproduction takes place in action sets, reproduction can only be assured if the representative will be part of an action set before it is deleted. To derive an approximate population size bound to ensure reproduction, we can require that the probability of deletion of a representative is smaller than the probability that the representative is part of an action set. The probability of deletion is P (deletion) =
1 N
(14)
assuming that there are no action set size estimate influences nor any fitness influences in the deletion process. This is a reasonable assumption given that the action set size estimate is inherited from the parents and fitness is initially decreased by 0.1. Given a particular classifier cl that has specificity scl , we can determine the probability that cl will be part of an action set given random inputs. P (in [A]) =
1 l·scl 0.5 n
(15)
This probability is exact if binary input strings are encountered that are uniformly distributed over the whole problem instance space {0, 1}l . Combining equations 14 and 15 we can now derive a constraint for successful evolution in XCS by requiring that the probability of deletion is smaller than the probability of reproduction. Since two classifiers are deleted at random from the population but classifiers are reproduced in only one action set (the current one), the probability of a reproductive opportunity needs to be larger than approximately twice the deletion probability: P (in[A]) > 2P (deletion) 1 l·scl 2 0.5 > n N N > n2l·scl +1
(16)
This bounds the population size by O(n2ls ). Note that the specificity s depends on the order of problem difficulty kd and the chosen population size N as expressed in Equation 10. Since kd is usually small, s can often be approximated by s = 1/l so that the bound diminishes. However, in problems in which no fitness guidance can be expected from the over-general side up to order kd > 1, specificity s cannot be approximated as easily. The expected specificity of such a representative of an order k schema can be estimated given a current specificity in the population s[P ] . Given that the classifier specifies all k positions its expected average specificity will be approximately: E(s(representative of schema of order k)) =
k + (l − k)s[P ] l
(17)
1852
M.V. Butz and D.E. Goldberg
Substituting scl in equation 16 with the expected specificity of a representative of a schema of order k = kd necessary in a problem of order of difficulty kd , the population size N can be bounded by N > n2l·
kd +(l−kd )s[P ] l
+1
N > n2kd +(l−kd )s[P ] +1
(18)
This bound ensures that classifiers necessary in a problem of order of difficulty kd will get reproductive opportunities. Once the bound is satisfied, existing representatives of the necessary order kd schemata are ensured to reproduce and XCS is enabled to evolve a more accurate population. Note that this population size bound is actually exponential in schema order kd and in string length times specificity l · s[P ] . This would mean that XCS would scale exponential in the problem length which is certainly highly undesirable. However, as seen above the necessary specificity actually decreases with increasing population sizes. Considering this, we show in the following that out of the specificity constraint a general reproductive opportunity bound (ROP-bound) can be derived that shows that population size needs to grow in O(lkd ). Equations 10 and 18 can both be denoted in O-notation: n 1/k ) ) N N = O(2ls[P ] )
s[P ] = O((
(19)
Ignoring additional constants, we can derive the following population size bound which solely depends on problem length l and order of difficulty kd . n
N > 2l( N )
1/kd
N (log2 N )kd > nlkd
(20)
This reproductive opportunity bound essentially shows that population size N needs to grow approximately exponential in the order of the minimal building block kd (i.e., order of difficulty) and polynomial in the string length. N = O(lkd )
(21)
Note that in the usual case kd is rather small and can often be set to one (see Sect. 3.2). As shown in the extreme case of the parity problem, though, kd can be larger than one.
5
Bound Verification
To verify the derived reproductive opportunity bound as well as the time dimension of the representative bound, we apply XCS to a Boolean function problem of order of difficulty kd where kd is larger than one. The hidden-parity problem,
Bounding the Population Size in XCS to Ensure Reproductive Opportunities
1853
s[P] dependence in ROP bound in hidden parity problem l=20, k=5 8000
population size
7000
empirical 5+15s[P]+1 2*2
6000 5000 4000 3000 2000 1000 0.1
0.15
0.2
0.25
0.3
0.35
0.4
specificity s[P] (empirically derived from mutation)
Fig. 1. Minimal population size depends on applied specificity in the population (manipulated by varying mutation rates). When mutation and thus specificity is sufficiently low, the bound becomes obsolete
originally investigated in [11], is very suitable to manipulate kd . The basic problem is represented by a Boolean function in which k relevant bits determine the outcome. If the k bits have an even number of ones, the outcome will be one and zero otherwise. As discussed above, the order of difficulty kd is equivalent to the number of relevant attributes k. Figure 1 shows that the reproductive opportunity bound is well approximated by equation 16.1 The experimental values are derived by determining the population size needed to reliably reach 100% performance. The average specificity in the population is derived from the empirical specificities in Table 1. XCS is assumed to reach 100% performance reliably if all 50 runs reached 100% performance after 200, 000 steps. Although the bound corresponds to the empirical points when specificity is high, in the case of lower specificity the actual population size needed departs from the approximation. The reason for this is the time dimension of the representative bound determined above. The lower the specificity in the population, the longer it takes until a representative is actually generated (by random mutation). Thus, the representative bound extends in time rather than in population size. To validate the O(lk ) bound of equation 21, we ran further experiments in the hidden parity problem with k = 4, varying l and adapting mutation rate µ (and thus specificity) appropriately.2 Results are shown in Fig. 2. The bound was derived from experimental results averaged over 50 runs determining the population size for which 98% performance is reached after 300000 steps. 1
2
XCS with tournament selection was used [5]. If not stated differently, parameters were set as follows: β = 0.2, α = 1, 0 = 1, ν = 5, θga = 25, τ = 0.4, χ = 1.0, uniform crossover, µ = 0.04, θdel = 20, δ = 0.1, P# = 1.0, θsub = 20. Mutation was set as follows with respect to l: l = 10, µ = 0.10; l = 15, µ = 0.075; l = 20, µ = 0.070; l = 25, µ = 0.065; l = 30, µ = 0.060; l = 35, µ = 0.057; l = 40, µ = 0.055; l = 45, µ = 0.050; l = 50, µ = 0.050; l = 55, µ = 0.048; l = 60, µ = 0.048; l = 65, µ = 0.048.
1854
M.V. Butz and D.E. Goldberg ROP bound in hidden parity problem, k=4 18000
population size
16000 empirical (>98% perf. after 300,000 steps) k 300+0.0009(l ) 14000 12000 10000 8000 6000 4000 2000 0 10
20
30
40
50
60
string length l
Fig. 2. The reproductive opportunity bound can be observed altering problem length l and using optimal mutation µ, given a problem of order of difficulty kd . The experimental runs confirm that the evolutionary process indeed scales up in O(lk )
With appropriate constants added, the experimentally derived bound is wellapproximated by the theory.
6
Summary and Conclusions
In accord with Wilson’s early scale-up hypothesis [16,17], this paper confirms that XCS scales-up polynomially in problem length l and exponentially in order of problem difficulty and thus in a machine learning competitive way. The empirical analysis in the hidden parity function confirmed the derived bounds. Combined with the first derived scale-up bound in XCS, the analysis provides several hints on mutation and population size settings. All bounds so far identified in XCS assure that the GA takes place (covering bound), that initially more accurate classifiers are supplied (schema bound—in population size and generation time), and that these more accurate classifiers are propagated (reproductive opportunity bound). Many issues remain to be addressed. First, we assumed that crucial schema representatives provide fitness guidance towards accuracy. Type, strength, and reliability of this guidance require further investigation. Second, classifier parameter values are always only approximations of their assumed average values. Variance in these values needs to be accounted for. Third, although reproductive opportunity events are assured by our derived bound it needs to be assured that recombination effectively processes important substructures and mutation does not destroy these substructures. Forth, the number of to-be represented classifier niches (denoted by [O] in [11]) needs to be taken into account in our model. Fifth, the distribution of classifier niches such as overlapping niches and obliqueness needs to be considered [15]. Integrating these points should allow us to derive a rigorous theory of XCS functioning and enable us to develop more competent and robust XCS learning systems.
Bounding the Population Size in XCS to Ensure Reproductive Opportunities
1855
Acknowledgment. We are grateful to Kumara Sastry and the whole IlliGAL lab for their help on the equations and for useful discussions. The work was sponsored by the Air Force Office of Scientific Research, Air Force Materiel Command, USAF, under grant F49620-00-0163. Research funding was also provided by a grant from the National Science Foundation under grant DMI-9908252. The US Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation thereon. Additional support from the Automated Learning Group (ALG) at the National Center for Supercomputing Applications (NCSA) is acknowledged. Additional funding from the German research foundation (DFG) under grant DFG HO1301/4-3 is acknowledged. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the Air Force Office of Scientific Research, the National Science Foundation, or the U.S. Government.
References 1. Bernad´ o, E., Llor` a, X., Garrell, J.M.: XCS and GALE: A comparative study of two learning classifier systems and six other learning algorithms on classification tasks. In Lanzi, P.L., Stolzmann, W., Wilson, S.W., eds.: Advances in Learning Classifier Systems: 4th International Workshop, IWLCS 2001. Springer-Verlag, Berlin Heidelberg (2002) 115–132 2. Butz, M.V., Kovacs, T., Lanzi, P.L., Wilson, S.W.: Theory of generalization and learning in XCS. IEEE Transaction on Evolutionary Computation (submitted, 2003) also available as IlliGAL technical report 2002011 at http://wwwilligal.ge.uiuc.edu/techreps.php3. 3. Butz, M.V., Kovacs, T., Lanzi, P.L., Wilson, S.W.: How XCS evolves accurate classifiers. Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2001) (2001) 927–934 4. Butz, M.V., Pelikan, M.: Analyzing the evolutionary pressures in XCS. Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2001) (2001) 935–942 5. Butz, M.V., Sastry, K., Goldberg, D.G.: Tournament selection in XCS. Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2003) (2003) this volume. 6. Butz, M.V., Wilson, S.W.: An algorithmic description of XCS. In Lanzi, P.L., Stolzmann, W., Wilson, S.W., eds.: Advances in Learning Classifier Systems: Third International Workshop, IWLCS 2000. Springer-Verlag, Berlin Heidelberg (2001) 253–272 7. Dixon, P.W., Corne, D.W., Oates, M.J.: A preliminary investigation of modified XCS as a generic data mining tool. In Lanzi, P.L., Stolzmann, W., Wilson, S.W., eds.: Advances in Learning Classifier Systems: 4th International Workshop, IWLCS 2001. Springer-Verlag, Berlin Heidelberg (2002) 133–150 8. Goldberg, D.E.: Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley, Reading, MA (1989) 9. Holland, J.H.: Adaptation in natural and artificial systems. Universtiy of Michigan Press, Ann Arbor, MI (1975) second edition 1992.
1856
M.V. Butz and D.E. Goldberg
10. Kovacs, T.: Deletion schemes for classifier systems. Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-99) (1999) 329–336 11. Kovacs, T., Kerber, M.: What makes a problem hard for XCS? In Lanzi, P.L., Stolzmann, W., Wilson, S.W., eds.: Advances in Learning Classifier Systems: Third International Workshop, IWLCS 2000. Springer-Verlag, Berlin Heidelberg (2001) 80–99 12. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann (1993) 13. Sutton, R.S., Barto, A.G.: Reinforcement learning: An introduction. MIT Press, Cambridge, MA (1998) 14. Widrow, B., Hoff, M.: Adaptive switching circuits. Western Electronic Show and Convention 4 (1960) 96–104 15. Wilson, S.W.: Mining oblique data with XCS. In Lanzi, P.L., Stolzmann, W., Wilson, S.W., eds.: Advances in Learning Classifier Systems: Third International Workshop, IWLCS 2000. Springer-Verlag, Berlin Heidelberg (2001) 158–174 16. Wilson, S.W.: Classifier fitness based on accuracy. Evolutionary Computation 3 (1995) 149–175 17. Wilson, S.W.: Generalization in the XCS classifier system. Genetic Programming 1998: Proceedings of the Third Annual Conference (1998) 665–674
Tournament Selection: Stable Fitness Pressure in XCS Martin V. Butz, Kumara Sastry, and David E. Goldberg Illinois Genetic Algorithms Laboratory (IlliGAL) University of Illinois at Urbana-Champaign, 104 S. Mathews, 61801 Urbana, IL, USA {butz,kumara,deg}@illigal.ge.uiuc.edu
Abstract. Although it is known from GA literature that proportionate selection is subject to many pitfalls, the LCS community somewhat adhered to proportionate selection. Also in the accuracy-based learning classifier system XCS, introduced by Wilson in 1995, proportionate selection is used. This paper identifies problem properties in which performance of proportionate selection is impaired. Consequently, tournament selection is introduced which makes XCS more parameter independent, noise independent, and more efficient in exploiting fitness guidance.
1
Introduction
Learning Classifier Systems (LCSs) [12,3] are rule learning systems in which rules are generated by the means of a genetic algorithm (GA) [11]. The GA evolves a population of rules, the so-called classifiers. A classifier usually consists of a condition and an action part. The condition part specifies when the classifier is applicable and the action part specifies which action to execute. In contrast to the original LCSs, the fitness in the XCS classifier system, introduced by Wilson [18], is based on the accuracy of reward predictions rather than on the reward predictions directly. Thus, XCS is meant to not only evolve a representation of an optimal behavioral strategy, or classification, but rather to evolve a representation of a complete payoff map of the problem. That is, XCS is designed to evolve a representation of the expected payoff in each possible situation-action combination. Recently, several studies were reported that show that XCS performs comparably to several other typical classification algorithms in different standard machine learning problems [20,2,8]. Although many indicators can be found in the GA literature that point out that proportionate selection is strongly fitness dependent [9] and moreover is strongly dependent on the degree of convergence [10], the LCS community has largely ignored this insight. In XCS, fitness is a scaled estimate of relative accuracy of a classifier. Due to the scaling, it does not seem necessary to apply a more fitness-independent selection method. Nonetheless, in this paper we identify problem types that impair the efficiency of proportionate selection. We show that tournament selection makes XCS selection more reliable, outperforming proportionate selection in all investigated problem types. E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 1857–1869, 2003. c Springer-Verlag Berlin Heidelberg 2003
1858
M.V. Butz, K. Sastry, and D.E. Goldberg
We restrict our analysis to Boolean classification, or one-step, problems in which each presented situation is independent of the history of situations and classifications. Particularly, we apply XCS to several typical Boolean function problems. In these problems, feedback is available immediately so that no reinforcement learning techniques are necessary that propagate reward [13]. The aim of this study is threefold. First, to make the XCS classifier system more robust and more problem independent. Second, to contribute to the understanding of the functioning of XCS in general. Third, to prepare the XCS classifier system to solve decomposable machine learning problems quickly, accurately, and reliably. The next section gives a short overview of the major mechanisms in XCS. Next, the Boolean multiplexer problem is introduced and noise is added which causes XCS to fail. Next, we introduce tournament selection to XCS. Section 5 provides further experimental comparisons of the two mechanisms in several Boolean function problems with and without additional noise. Finally, we provide summary and conclusions.
2
XCS in a Nutshell
Although XCS was also successfully applied in multi-step problems [18,14,15,1], we restrict this study to classification problems to avoid the additional problem of reward propagation. However, the insights of this study should readily carry over to multi-step problems. This section consequently introduces XCS as a pure classification system providing the necessary details to comprehend the remainder of this work. For a more complete introduction to XCS the interested reader is referred to the original paper [18] and the algorithmic description [6]. We define a classification problem as a problem that consists of problem instances s ∈ S that need to be classified by XCS with one of the possible classifications a ∈ A. The problem then provides scalar payoff R ∈ with respect to the made classification. The goal for XCS is to choose the classification that results in the highest payoff. To do that, XCS is designed to learn a complete mapping from any possible s×a combination to an accurate payoff value. To keep things simple, we investigate problems with Boolean input and classification, i.e. S ⊆ {0, 1}L where L denotes the fixed length of the input string and A = {0, 1}. XCS evolves a population [P ] of rules, or classifiers. Each classifier in XCS consists of five main components. The condition C ∈ {0, 1, #}L specifies the subspace of the problem instances in which the classifier is applicable, or matches. The “don’t care” symbol # matches in all input cases. The action part A ∈ A specifies the advocated action, or classification. The payoff prediction p approaches the average payoff encountered after executing action A in situations in which condition C matches. The prediction error ε estimates the average deviation, or error, of the payoff prediction p. The fitness reflects the average relative accuracy of the classifier with respect to other overlapping classifiers. XCS iteratively updates its knowledge base with respect to each problem instance. Given current input s, XCS forms a match set [M ] consisting of all
Tournament Selection: Stable Fitness Pressure in XCS
1859
classifiers in [P ] whose conditions match s. If an action is not represented in [M ], a covering classifier is created that matches s (#-symbols are inserted with a probability of P# at each position). For each classification, XCS forms a payoff prediction P (a), i.e. the fitness-weighted average of all reward prediction estimates of the classifiers in [M ] that advocate classification a. The payoff predictions determine the appropriate classification. After the classification is selected and sent to the problem, payoff R is provided according to which XCS updates all classifiers in the current action set [A] which comprises all classifiers in [M ] that advocate the chosen classification a. After update and possible GA invocation, the next iteration starts. Prediction and prediction error parameters are update in [A] by p ← p + β(R − p) and ε ← ε + β(|R − p| − ε) where β (β ∈ [0, 1]) denotes the learning rate. The fitness value of each classifier in [A] is updated according to its current scaled relative accuracy κ : 1 ν κ= α εε0
if ε < ε0 otherwise
κ κ =
F ← F + β(κ − F )
x∈[A]
κx
(1) (2)
The parameter ε0 (ε0 > 0) controls the tolerance for prediction error ε; parameters α (α ∈ (0, 1)) and ν (ν > 0) are constants controlling the rate of decline in accuracy κ when ε0 is exceeded. The accuracy values κ in the action set [A] are then converted to set-relative accuracies κ . Finally, classifier fitness F is updated towards the classifier’s current set-relative accuracy. Figure 1 shows how κ is influenced by ε0 and α. The determination of κ, then, also causes the scaling of the fitness function and thus strongly influences proportionate selection. All parameters except fitness F are updated using the moyenne adaptive modifi´ee technique [17]. This technique sets parameter values directly to the average of the so far encountered cases as long as the experience of a classifier is still less than 1/β. Each time the parameters of a classifier are updated, the experience counter exp of the classifier is increased by one. A GA is invoked in XCS if the average time since the last GA application upon the classifiers in [A] exceeds threshold θga . The GA selects two parental classifiers using proportionate selection (the probability of selecting classifier cl (Ps (cl)) is determined by its relative fitness in [A], i.e. Ps (cl) = F (cl)/ c∈[A] F (c)). Two offspring are generated reproducing the parents and applying (two-point) crossover and mutation. Parents stay in the population competing with their offspring. We apply free mutation in which each attribute of the offspring condition is mutated to the other two possibilities with equal probability. Parameters of the offspring are inherited from the parents, except for the experience counter exp which is set to one, the numerosity num which is set to one, and the fitness F which is multiplied by 0.1. In the insertion process, subsumption deletion may be applied [19] to stress generalization. Due to the possible strong effects of action-set subsumption we apply GA subsumption only.
1860
M.V. Butz, K. Sastry, and D.E. Goldberg
ε0
}
κ
}
1−α ν, ε0
ε
Fig. 1. The scaling of accuracy κ is crucial for successful proportionate selection. Parameters ε0 , α, and ν control tolerance, offset, and slope, respectively
The population of classifiers [P ] is of fixed size N . Excess classifiers are deleted from [P ] with probability proportional to an estimate of the size of the action sets that the classifiers occur in (stored in the additional parameter as). If the classifier is sufficiently experienced and its fitness F is significantly lower than the average fitness of classifiers in [P ], its deletion probability is further increased.
3
Trouble for Proportionate Selection
Proportionate selection depends on fitness relative to action set [A]. The fitness F of a classifier is strongly influenced by the accuracy determination of Equation 1 visualized in Fig. 1. F is also influenced by the number of classifiers in the action sets the classifier participates in. Moreover, the initial fitness of offspring is decreased by 0.1 so that it may take a while until the superiority of offspring is reflected by its fitness value. To infer these dependencies, we apply XCS to the Multiplexer problem with altered parameter settings or with additional noise in the payoff function of the problem. The multiplexer problem is a widely studied problem in LCS research [7,18, 19,4]. It has been shown that LCSs are superior compared to standard machine learning algorithms, such as C4.5, in the multiplexer task [7]. The multiplexer problem is a Boolean function. The function is defined for binary strings of length k + 2k . The output of the multiplexer function is determined by the bit situated at the position referred to by the k position bits. For example, in the six multiplexer f (100010) = 1, f (000111) = 0, or f (110101) = 1. A correct classification results in a payoff of 1000 while an incorrect classification results in a payoff of 0.
Tournament Selection: Stable Fitness Pressure in XCS XCS in the 20 multiplexer problem, β dependence
XCS in the 20 multiplexer problem, P# dependence 1
performance, pop.size (/2000)
1
performance, pop.size (/2000)
1861
0.8
0.6
0.4 β=0.20, α=1.0, ε0=1: perf |[P]| β=0.05, α=1.0, ε0=1: perf |[P]| β=0.05, α=0.1, ε0=10: perf |[P]|
0.2
0 0
20000
40000
0.8
0.6
0.4 P#=0.6: perf |[P]| P#=0.8: perf |[P]| P#=1.0: perf |[P]|
0.2
0 60000
explore problems
80000
100000
0
20000
40000
60000
80000
100000
explore problems
Fig. 2. Decreasing learning rate β decreases XCS learning performance (left-hand side). Accuracy function parameters have no immediate influence. Also a decrease in initial specificity strongly decreases XCS’s performance (right-hand side)
Figure 2 reveals the strong dependence on parameter β.1 Decreasing the learning rate hinders XCS from evolving an accurate payoff map. The problem is that over-general classifiers occupy a big part of the population initially. Better offspring often looses against the over-general parents since the fitness of the offspring only increases slowly (due to the low β value) and small differences in the fitness F only have small effects when using proportionate selection. Altering the slope of the accuracy curve by changing parameters α and ε0 does not have any positive learning effect, either. Later, we show that small β values are necessary in some problems so that increasing β does not solve this problem in general. Figure 2 (right-hand side) also reveals XCS’s dependence on initial specificity. Increasing P# (effectively decreasing initial specificity) impairs learning speed of XCS facing the schema challenge [4]. Additional to the dependence on β we can show that XCS is often not able to solve noisy problems. We added two kinds of noise to the multiplexer problem: (1) Gaussian noise is added to the payoff provided by the environment. (2) The payoff is alternated with a certain probability, termed alternating noise in the remainder of this work. Figure 3 shows that when adding only a small amount of either noise, XCS’s performance is strongly affected. The more noise is added, the smaller the fitness difference between accurate and inaccurate classifiers. Thus, selection pressure decreases due to proportionate selection so that the population starts to drift at random. Lanzi [15] proposed an extension to XCS that detects noise in environments and adjusts the error estimates accordingly. This approach, however, does not solve the β problem. 1
All results herein are averaged over 50 experimental runs. Performance is assessed by test trials in which no learning takes place and the better classification is chosen as the classification. During learning, classifications are chosen at random. If not stated differently, parameters were set as follows: N = 2000, β = 0.2, α = 1, ε0 = 1, ν = 5, θGA = 25, χ = 0.8, µ = 0.04, θdel = 20, δ = 0.1, θsub = 20, and P# = 0.6.
1862
M.V. Butz, K. Sastry, and D.E. Goldberg XCS in the 20 multiplexer problem, Gaussian noise dependence
XCS in the 20 multiplexer problem, alternating noise dependence 1
performance, pop.size (/2000)
performance, pop.size (/2000)
1
0.8
0.6
0.4
σ=50: perf |[P]| σ=100: perf |[P]| σ=150: perf |[P]|
0.2
0
0.8
0.6
0.4 noise=0.05: perf |[P]| noise=0.10: perf 0.2 |[P]| noise=0.15: perf |[P]| 0
0
50000
100000 150000 200000 250000 300000 explore problems
0
50000
100000 150000 200000 250000 300000 explore problems
Fig. 3. Adding noise to the payoff function of the multiplexer significantly deteriorates performance of XCS. Gaussian noise (left-hand side) or alternating noise (right-hand side) is added
4
Stable Selection Pressure with Tournament Selection
By contrast to proportionate selection, tournament selection is independent of fitness scaling [9]. In tournament selection parental classifiers are not selected proportional to their fitness, but tournaments are held in which the classifier with the highest fitness wins (stochastic tournaments are not considered herein). Participants for the tournament are usually chosen at random from the population in which selection is applied. The size of the tournament controls the selection pressure. Usually, fixed tournament sizes are used. Compared to standard GAs, the GA in XCS is a steady-state, niched GA. Only two classifiers are selected in each GA application. Moreover, selection is restricted to the current action set. Thus, some classifiers might not get any reproductive opportunity at all before being deleted from the population. Additionally, action set sizes can vary significantly. Initially, action sets are often over-populated with over-general classifiers. Thus, a relatively strong selection pressure appears to be necessary which adapts to the current action set size. Our tournament selection process holds tournaments of sizes dependent on the current action set size |[A]| by choosing a fraction τ of classifiers (τ ∈ (0, 1]) in the current action set. 2 Instead of proportionate selection, two independent tournaments are held in which the classifier with the highest fitness is selected. We also experimented with fixed tournament sizes, as shown below, which did not result in a stable selection pressure. Figure 4 shows that XCS with tournament selection, referred to as XCSTS in the remainder of this work, can solve the 20-Multiplexer problem even with a low parameter value β. The curves show that XCSTS is more independent of parameter β. Even lower values of β, of course, ultimately also impair XCSTS’s performance. In the case of a more general initial population (P# = 1.0) XCSTS 2
If not stated differently, τ is set to 0.4 in the subsequent experimental runs.
Tournament Selection: Stable Fitness Pressure in XCS XCSTS in the 20 multiplexer problem, β dependence
XCSTS in the 20 multiplexer problem, P# dependence 1
performance, pop.size (/2000)
1
performance, pop.size (/2000)
1863
0.8
0.6
0.4 β=0.20, α=1.0, ε0=1: perf |[P]| β=0.05, α=1.0, ε0=1: perf |[P]| β=0.05, α=0.1, ε0=10: perf |[P]|
0.2
0 0
20000
40000
0.8
0.6
0.4 P#=0.6: perf |[P]| P#=0.8: perf |[P]| P#=1.0: perf |[P]|
0.2
0 60000
80000
100000
0
20000
explore problems
40000
60000
80000
100000
explore problems
Fig. 4. XCSTS is barely influenced by a decrease of learning rate β. Accuracy parameters do not influence learning behavior. An initial over-general population (P# = 1.0) does hardly influence learning, either (right-hand side)
0.8
0.6
0.4
σ=250: perf |[P]| σ=500: perf |[P]| β=0.05, σ=500: perf |[P]|
0.2
0
XCSTS in the 20 multiplexer problem, alternating noise dependence 1
performance, pop.size (/2000)
performance, pop.size (/2000)
XCSTS in the 20 multiplexer problem, Gaussian noise dependence 1
0.8
0.6
0.4 noise=0.05: perf |[P]| noise=0.10: perf |[P]| noise=0.15: perf |[P]|
0.2
0 0
50000
100000 150000 200000 250000 300000 explore problems
0
50000
100000 150000 200000 250000 300000 explore problems
Fig. 5. XCSTS performs much better than XCS in noisy problems. For example, performance of XCSTS with Gaussian noise σ = 250 is better than XCS’s performance with σ = 50 noise. Lowering β allows XCSTS to solve problems with even σ = 500
detects accurate classifiers faster and thus hardly suffers at all from the initial over-general population. Figure 5 shows that XCSTS is much more robust in noisy problems as well. XCSTS solves Gaussian noise with a standard deviation of σ = 250 better than XCS the σ = 100 setting. In the case of alternating noise, XCSTS solves the Px = 0.1 noise case faster than XCS the Px = 0.05 case. As expected, the population sizes do not converge to the sizes achieved without noise since subsumption does not apply. Nonetheless, in both noise cases the population sizes decrease indicating the detection of accurate classifiers.
1864
5
M.V. Butz, K. Sastry, and D.E. Goldberg
Experimental Study
This section investigates XCSTS’s behavior further. We show the effect of a fixed tournament size and different proportional tournament sizes τ . We also apply XCS to the multiplexer problem with layered reward as well as to the layered count ones problem. In the latter problem, recombinatory events are highly beneficial. 5.1
Fixed Tournament Sizes
XCS’s action sets vary in size and in distribution. Dependent on the initial specificity in the population (controlled by parameter P# ), the average action set size is either large or small initially. It was shown that the average specificity in an action set is always smaller than the specificity in the whole population [5]. Replication in action sets and deletion from the whole population results in an implicit generalization pressure that can only be overcome by a sufficiently large specialization pressure. Additionally, the distribution of the specificities depends on initial specificity, problem properties, the resulting fitness pressure, and learning dynamics. Thus, an approach with fixed tournament size is quite dependent on the particular problem and probably not flexible enough. In Fig. 6 we show that XCSTS with fixed tournament size only solves the multiplexer problem with the large tournament size of 12. With lower tournament sizes not enough selection pressure is generated. Since the population is usually over-populated with over-general classifiers early in a run, action set sizes are large so that a too small tournament size usually results in a competition among over-general classifiers. Thus, not enough fitness pressure is generated. Adding noise, an even larger tournament size is necessary for learning. A tournament size of 32, however, does not allow any useful recombinatory events anymore since the action set size itself is usually not much bigger than that. Thus, fixed tournament sizes are inappropriate for XCS’s selection mechanism. 5.2
Layered Boolean Multiplexer
The layered multiplexer [18] provides useful fitness guidance for XCS [4]. The more position bits are specified (starting from the most significant one) the less different reward levels are available and the closer the values of the different reward levels are together so that classifiers that have more position bits specified have higher accuracy. Consequently, those classifiers get propagated. Thus, the reward scheme somewhat simplifies the problem. The exact equation of the reward scheme is Int(positionbits) ∗ 200 + Int(ref erencedbit) ∗ 100. An additional reward of 300 is added if the classification was correct. For example, in the 6-multiplexer the (incorrect) classification 0 of instance (100010) would result in a payoff of 500 while the (correct) classification 1 would result in a payoff of 800. We ran a series of experiments in the layered 20-Multiplexer increasing Gaussian noise. Figure 7 shows results for noise values σ = 100 and σ = 300. With a
Tournament Selection: Stable Fitness Pressure in XCS
XCSTS in the 20 multiplexer problem, σ=250, fixed tourn. size
XCSTS in the 20 multiplexer problem, fixed tournament size
1
performance, pop.size (/2000)
1
performance, pop.size (/2000)
1865
0.8
0.6
0.4
τ=4: perf |[P]| τ=8: perf |[P]| τ=12: perf |[P]|
0.2
0
0.8
0.6
0.4
τ=8: perf |[P]| τ=16: perf |[P]| τ=32: perf |[P]|
0.2
0 0
20000
40000
60000
80000
100000
0
explore problems
20000
40000
60000
80000
100000
explore problems
Fig. 6. Due to fluctuations in action set sizes and distributions as well as the higher proportion of more general classifiers in action sets, fixed tournament sizes are inappropriate for selection in XCSTS XCS(TS) in the layered 20 multiplexer problem, noise σ=100
XCS in the layered 20 multiplexer problem, noise σ=300 1
performance, pop.size (/2000)
performance, pop.size (/2000)
1
0.8
0.6 XCS, β=0.05: perf |[P]| XCS, β=0.20:: perf |[P]| XCSTS, β=0.05: perf |[P]| XCSTS, β=0.20: perf |[P]|
0.4
0.2
0
0.8
0.6 XCS, β=0.05: perf |[P]| XCS, β=0.20:: perf |[P]| XCSTS, β=0.05: perf |[P]| XCSTS, β=0.20: perf |[P]|
0.4
0.2
0 0
20000
40000
60000
explore problems
80000
100000
0
20000 40000 60000 80000 100000120000140000 explore problems
Fig. 7. Gaussian noise disrupts performance of XCS more than performance of XCSTS. While in a problem with little noise a higher value of parameter β allows faster learning, in settings with more noise high β values cause too high fluctuations and are thus disruptive. XCS fails to solve a noise level σ = 300 with either setting of parameter β while XCSTS still reaches 100% performance with parameter β set to 0.05
noise with standard deviation σ = 100, XCS as well as XCSTS have no problem to solve the task even with the lower learning rate β = 0.05 for XCS. XCSTS however already learns faster than XCS. With a noise of σ = 300 performance of XCS does hardly increase at all. XCSTS learns with either learning rate setting but reaches 100% knowledge only with a learning rate of β = 0.05. This shows that a lower learning rate is necessary in noisy problems since otherwise fluctuations in classifier parameter values can cause disruptions.
1866
5.3
M.V. Butz, K. Sastry, and D.E. Goldberg
Layered Count Ones Problem
The final Boolean function in this study is constructed to reveal the benefit of recombination in XCS(TS). While in GA research whole conferences are filled with studies on crossover operators, linkage learning GAs, or the more recent probabilistic model building GAs (PMBGAs) [16], LCS research has largely neglected the investigation of the crossover operator. Herein, we take a first step towards understanding the possible usefulness of recombination in XCS by constructing a problem in which the recombination of low-fitness classifiers can increasingly result in higher fitness. We term the problem the layered count ones problem. In this problem a subset of a Boolean string determines the class of the problem. If the number of ones in this subset is greater than half the size of the subset, then the class is one and otherwise zero. Additionally, we layer the payoff somewhat similarly to the layered multiplexer problem. The amount of payoff is dependent on the number of ones in the relevant bits and not the integer value of the relevant bits. For example, consider the layered count ones problem 5/3 (a binary string of length five with three relevant bits) and a maximal payoff of 1000. Assuming that the first three bits are relevant, the classification of 1 of string 10100 would be correct and would result in a payoff of 666 while the incorrect classification 0 would result in a payoff of 333. Similarly, a classification of 1 of string 00010 would result in a payoff of 0 while the correct classification 0 would give a payoff of 1000. Thus, always the correct classification results in the higher payoff. Figure 8 shows runs in the layered count ones problem with string length L = 70 and five relevant bits. Without any additional Gaussian noise, XCSTS outperforms XCS. To solve the problem at all, the initial specificity needs to be set very low (P# = 0.9) to avoid the covering challenge [4]. Mutation needs to be set low enough to avoid over-specialization via mutation [5]. Runs with a mutation rate µ = 0.04 (not shown) failed to solve the problem. Performance
XCS in the 70/5 layered count ones problem, noise σ=200
XCS in the 70/5 layered count ones problem
1
performance, pop.size (/2000)
performance, pop.size (/2000)
1
0.8
0.6 XCS: perf |[P]| XCSTS, τ=0.2: perf |[P]| XCSTS, τ=0.6: perf |[P]| XCSTS, τ=1.0: perf |[P]|
0.4
0.2
0
0.8
0.6 XCS: perf |[P]| XCSTS, τ=0.4: perf |[P]| XCSTS, τ=0.4, unif. X: perf |[P]| XCSTS, τ=1.0: perf |[P]|
0.4
0.2
0 0
20000
40000
60000
explore problems
80000
100000
0
20000
40000
60000
80000
100000
explore problems
Fig. 8. Also in the layered count ones problem tournament selection has better performance. The benefit of crossover becomes clearly visible. Performance increases when uniform crossover is applied whereas it decreases when no mixing (τ = 1.0) can occur
Tournament Selection: Stable Fitness Pressure in XCS
1867
stalls at approximately 0.8 regardless of the selection type. Figure 8 (left-hand side) also shows that a higher tournament size proportion τ decreases the probability of recombining different classifiers. Thus, the evolutionary process cannot benefit from efficient recombination and learning speed decreases. Note however that proportionate selection is still worse than the selection type that always chooses the classifier with the highest fitness (τ = 1.0). This indicates that recombination, albeit helpful, does not seem to be mandatory. The more important criterion for a reliable convergence is a proper selection pressure. Adding Gaussian noise of σ = 250 to the layered count ones problem again completely deteriorates XCS’s performance (Fig. 8, right-hand side). Noise hinders mutation even more from doing the trick. After 100, 000 learning steps only selection with proper recombination was able to allow the evolution of perfect performance (Fig. 8). Perpetual selection of the best classifier hinders crossover from recombining classifiers that are partially specified in the relevant bits so that the evolutionary process is significantly delayed.
6
Summary and Conclusions
This paper showed that proportionate selection can prevent proper fitness pressure and thus successful learning in XCS. Applying tournament selection selection results in better (1) parameter independence, (2) noise robustness, and (3) recombinatory efficiency. No problem was found in which XCS with tournament selection with an action-set proportionate tournament size performed worse than proportionate selection. Thus, it is clear that future research in XCS should start using tournament selection instead of proportionate selection. Despite the recent first insights in problem difficulty it remains unclear which problems are really hard for XCS. Moreover, it remains to be shown in which machine learning problems crossover is actually useful or even mandatory for successful learning in XCS and LCSs in general. In XCS the recombination of important substructures should lead to successively higher accuracy. In strengthbased LCSs, on the other hand, the recombination of substructures should lead to larger payoff. Due to the apparent importance of recombinatory events in nature as well as in GAs, these issues deserve further detailed investigations in the realm of LCSs. Acknowledgment. We are grateful to Xavier Llora, Martin Pelikan, Abhishek Sinha, and the whole IlliGAL lab. The work was sponsored by the Air Force Office of Scientific Research, Air Force Materiel Command, USAF, under grant F49620-00-0163. Research funding was also provided by a grant from the National Science Foundation under grant DMI-9908252. The US Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation thereon. Additional support from the Automated Learning Group (ALG) at the National Center for Supercomputing Applications (NCSA) is acknowledged. Additional funding from the German research foundation (DFG) under grant DFG HO1301/4-3 is acknowledged. The
1868
M.V. Butz, K. Sastry, and D.E. Goldberg
views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the Air Force Office of Scientific Research, the National Science Foundation, or the U.S. Government.
References 1. Barry, A.: A hierarchical XCS for long path environments. Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2001) (2001) 913– 920 2. Bernad´ o, E., Llor` a, X., Garrell, J.M.: XCS and GALE: A comparative study of two learning classifier systems and six other learning algorithms on classification tasks. In Lanzi, P.L., Stolzmann, W., Wilson, S.W., eds.: Advances in Learning Classifier Systems: 4th International Workshop, IWLCS 2001. Springer-Verlag, Berlin Heidelberg (2002) 115–132 3. Booker, L.B., Goldberg, D.E., Holland, J.H.: Classifier systems and genetic algorithms. Artificial Intelligence 40 (1989) 235–282 4. Butz, M.V., Kovacs, T., Lanzi, P.L., Wilson, S.W.: How XCS evolves accurate classifiers. Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2001) (2001) 927–934 5. Butz, M.V., Pelikan, M.: Analyzing the evolutionary pressures in XCS. Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2001) (2001) 935–942 6. Butz, M.V., Wilson, S.W.: An algorithmic description of XCS. In Lanzi, P.L., Stolzmann, W., Wilson, S.W., eds.: Advances in Learning Classifier Systems: Third International Workshop, IWLCS 2000. Springer-Verlag, Berlin Heidelberg (2001) 253–272 7. De Jong, K.A., Spears, W.M.: Learning concept classification rules using genetic algorithms. IJCAI-91 Proceedings of the Twelfth International Conference on Artificial Intelligence (1991) 651–656 8. Dixon, P.W., Corne, D.W., Oates, M.J.: A preliminary investigation of modified XCS as a generic data mining tool. In Lanzi, P.L., Stolzmann, W., Wilson, S.W., eds.: Advances in Learning Classifier Systems: 4th International Workshop, IWLCS 2001. Springer-Verlag, Berlin Heidelberg (2002) 133–150 9. Goldberg, D.E., Deb, K.: A comparative analysis of selection schemes used in genetic algorithms. Foundations of Genetic Algorithms (1991) 69–93 10. Goldberg, D.E., Sastry, K.: A practical schema theorem for genetic algorithm design and tuning. Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2001) (2001) 328–335 11. Holland, J.H.: Adaptation in natural and artificial systems. Universtiy of Michigan Press, Ann Arbor, MI (1975) second edition 1992. 12. Holland, J.H.: Adaptation. In Rosen, R., Snell, F., eds.: Progress in Theoretical Biology. Volume 4. Academic Press, New York (1976) 263–293 13. Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement learning: A survey. Journal of Artificial Intelligence Research 4 (1996) 237–258 14. Lanzi, P.L.: An analysis of generalization in the XCS classifier system. Evolutionary Computation 7 (1999) 125–149 15. Lanzi, P.L., Colombetti, M.: An extension to the XCS classifier system for stochastic environments. Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-99) (1999) 353–360
Tournament Selection: Stable Fitness Pressure in XCS
1869
16. Pelikan, M., Goldberg, D.E., Lobo, F.: A survey of optimization by building and using probabilistic models. Computational Optimization and Applications 21 (2002) 5–20 17. Venturini, G.: Adaptation in dynamic environments through a minimal probability of exploration. From Animals to Animats 3: Proceedings of the Third International Conference on Simulation of Adaptive Behavior (1994) 371–381 18. Wilson, S.W.: Classifier fitness based on accuracy. Evolutionary Computation 3 (1995) 149–175 19. Wilson, S.W.: Generalization in the XCS classifier system. Genetic Programming 1998: Proceedings of the Third Annual Conference (1998) 665–674 20. Wilson, S.W.: Get real! XCS with continuous-valued inputs. In Lanzi, P.L., Stolzmann, W., Wilson, S.W., eds.: Learning Classifier Systems: From Foundations to Applications. Springer-Verlag, Berlin Heidelberg (2000) 209–219
Improving Performance in Size-Constrained Extended Classifier Systems Devon Dawson Hewlett-Packard, 8000 Foothills Blvd, Roseville CA 95747-5557, USA [email protected]
Abstract. Extended Classifier Systems, or XCS, have been shown to be successful at developing accurate, complete and compact mappings of a problem’s payoff landscape. However, the experimental results presented in the literature frequently utilize population sizes significantly larger than the size of the search space. This resource requirement may limit the range of problem/hardware combinations to which XCS can be applied. In this paper two sets of modifications are presented that are shown to improve performance in small sizeconstrained 6-Multiplexer and Woods-2 problems.
1 Introduction An extended classifier system, or XCS, is a rule-based machine learning framework developed by Stewart Wilson [3,15,16] that has been shown to posses some very appealing properties. Chiefly, XCS has been found to develop accurate, complete and compact mappings of a problem’s search space [6,10,15,16]. However, to be successful XCS can require a population size large enough to allow the system to maintain, at least initially, a number of unique classifiers (i.e., macroclassifiers) constituting a significant portion of the size of the search space. This potentially prohibitive resource requirement can prevent XCS from being successful where the system is sizeconstrained, that is, where the population size is smaller than the search space. As XCS-based systems find application in industry (and particularly in costconstrained commercial applications), size-constrained systems are likely to be frequently encountered. In an effort to broaden the range of situations in which XCSbased systems can be utilized, this paper presents two sets of modifications that have been found to significantly improve system performance in small size-constrained problem/population configurations. Improving upon techniques employed by Dawson and Juliano in [5], the first set of mostly minor modifications attempt to make more cautious use of the limited resources available to a size-constrained system. The second modified system builds upon the first and, among other changes, uses an effectiveness-based fitness measure (i.e., a strength/accuracy hybrid) - thus resulting in a system that develops a partial map of the payoff landscape. Experimental results are reported that demonstrate the performance improvement observed when using the proposed modifications in sizeconstrained 6-Multiplexer and Woods-2 problems. E. Cantú-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 1870–1881, 2003. © Springer-Verlag Berlin Heidelberg 2003
Improving Performance in Size-Constrained Extended Classifier Systems
1871
2 XCS: Developing Accurate, Complete, and Compact Populations Building upon his strength-based ZCS [14], Wilson introduced XCS in [15] as a solution to many of the problems encountered in traditional strength-based classifier systems. Namely, the problems of overgeneral classifiers, greedy classifier selection, and the interaction of the two [4,8,9,10,15]. Through the use of a simplified architecture, an accuracy-based fitness measure, a Q-Learning [12] like credit distribution algorithm and a niche Genetic Algorithm (GA) [1] – XCS is motivated towards the development of accurate, complete and compact maps of a problem’s search space. While earlier work had utilized various forms of accuracy as a factor in determining a classifier’s fitness, Wilson demonstrated with XCS that the use of accuracy alone could lead the GA to develop a stable population of classifiers that accurately predict the reward anticipated as a result of their activation. The completeness and compactness of the solution developed by XCS stems from its use of a niche GA as well as the accuracy-based fitness measure. The niche GA, introduced by Booker [1], considers classifiers to be members of specific, functionally related subsets of the population and operates upon these subsets, or niches, rather than on the population as a whole. By calculating a classifier’s fitness relative to the members of the niches in which it resides and searching within these niches, the GA drives the system towards the development of the minimal number of accurate classifiers required to completely cover the search space.
3 Size-Constrained Extended Classifier Systems Although the populations ultimately developed by XCS tend to be compact, and therefore significantly smaller than the search space (to a degree dependent upon the amount of generalization possible in the problem environment), the early stages of a run can require that the population contain a large number of classifiers, often constituting a significant percentage of the problem’s search space [6,10,11,15,16]. Typically this is not a problem as the population size is set large enough to more than encompass the problem’s search space. However, for practical application this may not always be possible as the size of the search space can simply exceed the physical limitations of the hardware on which the system is to execute, or the population size may be limited by the need to share system resources with other processes. Lanzi identified in [11] that the tendency to generalize inherent in XCS may impair the system’s ability to succeed in environments which afford little opportunity for generalization, and that this effect could be diminished through the use of a new Specify operator implemented in his XCSS. The Specify operator attempts to identify overly general classifiers and produces specialized offspring from those classifiers with some number of their don’t-care bits (i.e., #’s) replaced by the corresponding bits in the current system input. Lanzi showed that as the population size was reduced, the performance of XCS and XCSS diminished in the Maze4 problem [11]. While Lanzi’s work was focused on improving the performance of XCS in problems allowing little generalization, this work aims to improve performance in size-
1872
D. Dawson
constrained problem/population combinations. For the purposes of this paper we define a size-constrained system to be one in which the population size, N, is smaller than the size of the search space, S. In order to quantify the amount of size-constraint in a system (as required by some of the modifications proposed in this work), a new size-constraint measure, Ω, is introduced and determined by: Ω = 1+ ln ((S / N)c)) .
(1)
For all experiments reported in this work the value of the power parameter, c, was set to c=5. For configurations where N ≥ S, the size-constraint measure is set to 1. Note that this function was selected based simply on the fact that it resulted in values that appeared reasonable for the problems and population sizes evaluated in this work. A self tuning constraint measure (e.g., one taking into account such factors as the stability of the population) would seem to be in order.
4 Modified Extended Classifier Systems This section presents the first set of modifications that are intended to improve the performance of size-constrained systems while still allowing the development of a complete map. For the sake of comparison the system utilizing these modifications is termed the Modified Extended Classifier System, or MXCS. 4.1 Experience-Based GA Trigger Algorithm In XCS, the GA trigger algorithm attempts to evenly distribute GA activations across all encountered match/action sets regardless of the frequency with which those input states are encountered. Evenly distributing GA activations is achieved by triggering the GA when the average amount of time (i.e., exploration episodes) since the last GA activation on the set exceeds the given threshold value (θga). While the overall impact of this triggering algorithm may be beneficial, its experience-independent nature becomes problematic in size-constrained systems where the brittle nature of the smaller populations can be easily disturbed by “over-clocking” the GA, that is, producing classifiers faster than they can be evaluated by the system. Though the GA trigger threshold can be increased to overcome this, it becomes difficult to determine the appropriate value due to its experience-independent nature and the increase may result in a significantly diminished learning rate. MXCS modifies the trigger algorithm to use a simple experience-based threshold, where the GA is triggered when the given set contains any number of classifiers with experience since the GA last operated on it greater than the trigger threshold (θga). While this experience-based algorithm appears to favor classifiers covering frequently encountered inputs, the favoritism is ameliorated by the following modifications and dissipates once a maximally-general classifier is discovered covering that niche, as the offspring will tend to be subsumed [16] into the generalist.
Improving Performance in Size-Constrained Extended Classifier Systems
1873
4.2 Focused Deletion-Selection In XCS a classifier is selected for deletion from the population when the addition of a new classifier causes the population size to exceed the maximum size parameter (N). The deletion-selection process begins by assigning each classifier a deletion-vote, which is the product of the classifier’s numerosity (numcl) and the learned action set size estimate (ascl). The classifier’s numerosity indicates the number of identical classifiers (i.e., microclassifiers) in the population represented by the given macroclassifier. The learned action set size estimate represents an average of the number of classifiers present in the action sets to which the classifier belongs. If the classifier’s experience (expcl) is greater than the deletion threshold (θdel) and its fitness is less than a constant fraction (δ) of the mean fitness of the population, then the deletion vote is scaled by the difference between the classifier’s fitness and the mean population fitness. Therefore, this algorithm calculates the deletion vote as follows [3,7,15]: votecl = numcl * ascl . IF ((expcl > θdel) AND (fcl / numcl < δ * (Σ fpop / Σ numpop)))
(2) (3)
votecl = votecl * (Σ fpop / Σ numpop) / (fcl / numcl) . Once the deletion vote of each classifier is determined, the XCS deletion-selection algorithm typically uses roulette-selection, where a classifier is probabilistically selected from the population based on each classifier’s deletion vote. However, in sizeconstrained systems roulette-selection can negatively impact system performance as its probabilistic nature often results in the deletion of effective and useful classifiers. MXCS modifies the deletion-selection method to deterministically select the classifier with the highest deletion vote (i.e., a classifier is selected at random from the set of classifiers with the highest deletion vote). When using this max-deletion technique the deletion process maintains a strong downward pressure on the numerosity of classifiers, due to the deletion-vote factors of numerosity and action set size estimate. Though not utilized in [5], the MXCS described in this work further increases the pressure on classifier numerosities as the deletion vote is modified by raising the numerosity factor by a power equivalent to the constraint measure, Ω, as follow: votecl = numclΩ * ascl .
(4)
4.3 Focused GA Selection In XCS the GA probabilistically selects parents for use in the generation of new offspring from the current set (action or match depending on placement of the GA) with the selection probability based on the value of each classifier’s fitness. MXCS intensifies the focus on fit classifiers by basing the selection probability on the fitness of each classifier raised to the power of the constraint measure, Ω, as follows: prob(cl) = fcl Ω / Σf[A] Ω .
(5)
1874
D. Dawson
4.4 Parameter Updates In an effort to avoid favoring classifiers frequently chosen for activation, XCS limits the updating of classifier parameters (such as experience, error, fitness and predicted payoff), to exploration steps only. MXCS is modified to update a classifier’s prediction and error estimates on both exploration and exploitation steps. Though not required for the system to succeed, this is done in order to take full advantage of all the information returned from the environment. It should be noted that a bias is introduced in that accurate classifiers chosen on exploitation steps will have more chances to refine their predictions and will tend to have more refined accuracy estimates.
5 Effectiveness-Based Extended Classifier System This section builds upon MXCS and introduces a new effectiveness-based classifier system (EXCS) that further reduces the resource requirements of the system by focusing its resources on accurate and rewarding classifiers. This results in a far more compact partial map as the system discards classifiers with low payoff predictions. Though the map developed by EXCS is an incomplete one, the use of a niche GA and the accuracy factor in the effectiveness measure allow the system to avoid the problems previously mentioned for purely strength-based systems [4,8,9,10,15]. 5.1 Accuracy Function In developing EXCS it was found that the standard accuracy function used in XCS was too sensitive to fluctuations in prediction error and too insensitive to the classifier’s long term accuracy history to be used in determining the effectiveness of classifiers in highly competitive populations. This is particularly problematic in systems employing max-deletion due to the less forgiving nature of the algorithm. The accuracy function used in EXCS uses two factors to calculate a classifier’s accuracy. First a weighted average of the classifier’s long-term accuracy is calculated. As shown in equations 6 and 7, this is accomplished by taking the ratio of the sum of the classifier’s accuracy over time (k_sumcl) and its experience (k_expcl) both discounted at each update by a constant fraction, ψ. Owing to the power series that develops with each update, the sensitivity of the function is determined by the value of (1- ψ)-1 to which the series (i.e., Σψn) converges. Therefore, the weight attributed to an accuracy recorded n steps ago is ψn. A value of ψ=0.999 was used in all experiments reported in this paper. Lastly, this new weighted accuracy average and the standard classifier accuracy measure are then averaged as shown in equation 8. k_sumcl = (k_sumcl * ψ) + kcl .
(6)
k_expcl = (k_expcl * ψ) + 1.
(7)
k_avecl = ((k_sumcl / k_expcl) + kcl ) / 2 .
(8)
Improving Performance in Size-Constrained Extended Classifier Systems
1875
Note that this accuracy measure was not found to improve the performance of XCS and was therefore omitted from the MXCS implementation. It may be that accuracybased fitness functions require more drastic near-term changes in accuracy to be effective. 5.2 Effectiveness-Based Fitness Measure Similar to the effectiveness measure utilized by Booker’s GOFER-1 [2], which was a function of a classifier’s “impact” (i.e., payoff), “consistency” (i.e., prediction error) and “match score” (i.e., specificity), EXCS measures classifier effectiveness as the product of a classifier’s predicted payoff (pcl), average accuracy (k_avecl) and generality (gencl) – where generality is the percentage of the classifier’s condition consisting of don’t-care bits (#’s). Therefore, the effectiveness of an individual classifier is: effcl = pcl * k_avecl * gencl .
(9)
The fitness calculation in EXCS then proceeds as in XCS with two differences not found to improve the performance of MXCS. First, the GA in EXCS operates on the match set ([M]) with offspring actions determined by the crossover of the parent actions. This is done in order to cause the GA to drive the system towards a small number of effective actions per match set. Second, where XCS calculates a classifier’s fitness to be the portion of the niche’s accuracy sum attributed to that classifier, EXCS updates classifier fitness towards the ratio of a classifier’s effectiveness and the maximum effectiveness found in the match set, as shown in equation 10. fcl = fcl + β * (effcl / MAX(eff[M]) - fcl) .
(10)
The use of the maximum effectiveness in the match set prevents the fitness of each classifier from being impacted by the quantity of classifiers in the same match sets. This can be troublesome when comparing classifier fitness’s across GA niches, as in the deletion-selection process. It should also be noted that the action set size estimate, as, was changed for EXCS to measure the average size of the match sets in which the classifiers are found, rather than the action sets as in XCS and MXCS. 5.3 Conflict-Resolution In order to accommodate the use of a partial map in EXCS, the conflict-resolution process must be modified. XCS appears to take great advantage of the fact that accurate predictions of low payoff actions can be used to direct action selection towards higher payoff actions during conflict-resolution on exploitation steps. EXCS, on the other hand, is at a disadvantage as the incomplete map developed results in inexperienced or inaccurate classifiers with higher predictions appearing favorable to the conflict-resolution process due to the lack of accurate classifiers in the same set (e.g., the fitness weighted prediction of a classifier that is the only member of an action set is independent of the classifier’s fitness, and therefore accuracy).
1876
D. Dawson
This problem is overcome in EXCS by scaling the contribution of an inexperienced classifier (i.e., expcl < Ω * θdel) to the calculation of the prediction and fitness sums for an action set. As shown in equations 11 and 12 this is accomplished by scaling the prediction of an inexperienced classifier by the square of a small fraction (i.e., .012), and scaling the inexperienced classifier’s contribution to the fitness sum by the same small fraction (i.e., .01). Σ (p*f)[A] = Σ (p*f)[A] + .012 * pcl * fcl .
(11)
Σ f[A] = Σ f[A] + .01 * fcl .
(12)
The use of the square of the fraction in the calculation of the payoff prediction sum helps to prevent the inexperienced classifier from overestimating its predicted payoff as well as protects against situations where the classifier is the only member of the action set. Using the fraction of the classifier’s fitness in the fitness sum for the action set prevents the inexperienced classifier from distorting the payoff predicted for that action set. Note that it was found to be beneficial to performance to use these scaling factors, rather than simply disregarding inexperienced classifiers. 5.4 Population Culling The final modification made in EXCS is the introduction of a new deletion operation, termed culling. The purpose of the culling operation is to rapidly remove ineffective classifiers from the population. This is accomplished by immediately removing classifiers before each invocation of the GA when the classifier’s experience is greater than the deletion threshold (expcl > θdel), its effectiveness is below the minimum error parameter (ε0) and the percentage of the population composed of such ineffective classifiers is above a new cullling threshold (θcull – in all experiments reported in this paper, θcull=0.1). The process employed was to first determine the set of ineffective classifiers then randomly remove classifiers from this set until the threshold is met. The culling threshold is used as it was found to be beneficial to performance to leave some number of “easy targets” in the population for the deletion selection process. Otherwise, should all ineffective classifiers be immediately removed, the deletionselection process tends to be forced to remove effective classifiers, as the bulk of the population consists of experienced effective classifiers and inexperienced classifiers with low deletion votes due to their low experience levels.
6 Experimental Comparisons This section presents experimental results that compare the performance of XCS, MXCS and EXCS in size-constrained 6-Multiplexer and Woods-2 problems. Except for changes required for the described modifications the systems were implemented to follow the XCS definition described by Butz and Wilson in [3]. System parameters for all experiments reported in this work are as in [15] except where noted. The covering threshold parameter (θmna) was set to require one classifier per match set.
Improving Performance in Size-Constrained Extended Classifier Systems
1877
This value yields better performance in size-constrained systems. For EXCS, the deletion-vote fitness fraction (δ) was disregarded (i.e., set to infinity) as this was found to improve performance. This change had an adverse effect when used in XCS or MXCS, possibly owing to the greater amount of information encoded in the effectiveness-based fitness measure. Classifier parameters were updated in the order: experience, error, prediction, and then fitness. Subsumption deletion [3,16] was used in all experiments, though a new subsumption experience threshold (θsub[A]) was introduced for use in triggering action set subsumption. Owing to the fact that the negative impact of an erroneous action set subsumption can be far more severe than an erroneous GA subsumption, a higher subsumption threshold was required. For all experiments reported in this work the value of the action set subsumption threshold was set to be ten times the size of the problem’s input space (i.e., the number of possible inputs). This gives occasionally erroneous classifiers more opportunities to be identified and removed from the system before subsuming other more accurate classifiers. Note that this level is certainly not intended to work for other problems with potentially much larger input spaces. It is merely intended to allow subsumption deletion to be used in the relatively small environments reported here. 6.1 6-Multiplexer The multiplexer class of problems are frequently used in the literature as they present a challenging single step task with a scalable difficulty level capable of testing the systems ability to develop accurate, general and optimal rules. Only three of the six bits in any given input string of a 6-Mulitplexer (6-Mux) are important, the two address bits and the bit at the position pointed to by the address bits (i.e., the correct output). Due to this property, the system can develop general rules with don’t-care symbols in the ignored bit positions of their conditions that will accurately encode the entire set of classifiers masked by that condition. For example, while the 6-Mux problem requires 128 unique and fully defined classifiers to be complete, a smaller set of 16 optimal classifiers can accurately cover the entire input space given a two-level payoff landscape. For all 6-Mux results reported in this work, the system inputs were randomly generated 6-bit strings. Rewards of 1000 and 0 were returned to the system for right and wrong answers respectively. The results presented in this work for the 6-Mux problem use two different performance measures commonly reported in the literature: performance and optimality. Performance is a moving average of the percentage of the last 50 exploit steps that returned the correct output. Optimality measures the percentage of the optimal set of classifiers present in the population. As stated, for the 6-Mux problem with two payoff levels the optimal set consists of 16 classifiers. However, in order to compare systems developing complete maps (i.e., XCS and MXCS) and partial maps (i.e., EXCS) the optimal set is restricted to those optimal classifiers that earn the maximum reward. Therefore, for the 6-mux problem the optimally-effective set of classifiers consists of 8 unique classifiers.
1878
D. Dawson
Furthermore, to facilitate a direct comparison of systems utilizing different GA triggering algorithms the GA trigger threshold (θga) was selected for XCS so as to cause the GA to be invoked at the same rate as MXCS and EXCS. Therefore, θga for XCS with N=40 was set to θga=185 and θga for XCS with N=20 was set to θga=265. These values resulted in approximately the same number of GA invocations on average per run as the value of θga=25 used in the experience-based MXCS and EXCS GA trigger algorithms. Note that the performance of XCS was not found to be greatly affected by any of the numerous other trigger values tested and results obtained with N=400 (not shown) closely approximated those reported by Wilson and others [6,10,15]. Performance. Figure 1 shows performance in the 6-Mux problem at the two sizeconstrained configurations of N=40 and N=20 (a size of N=400 is commonly used for the 6-Mux problem). As shown, XCS is unsuccessful at either constrained size. MXCS, on the other hand, achieves a successful population with N=40 but fails with N=20. EXCS succeeds at both N=40 and N=20. Optimality. Figure 2 shows the optimality of the three systems. Due to the small population sizes XCS is unable to maintain much of the optimally-effective set of classifiers while MXCS succeeds with N=40 but performs poorly with N=20. EXCS performs well with both N=40 and N=20.
Perf.
0.9
1.0
E1 M 1 E2
0.8 0.7 0.6 0.5 0
0.8 X1
E2
0.6
M2
M1
0.4
X1
0.2
X2 10
E1
Opt.
1.0
20 30 40 Exploitations (k)
50
0.0 0
20
X2
M2
40 60 80 Exploitations (k)
100
Fig. 1 (left) and Fig. 2 (right). 6-Mux performance (left) and optimality (right) of: XCS with N=40 (X1) and N=20 (X2), MXCS with N=40 (M1) and N=20 (M2), and EXCS with N=40 (E1) and N=20 (E2). Curves are averages of 100 runs
6.2 Woods-2 This section compares the performance of XCS, MXCS and EXCS in the Woods-2 problem. Though, as Kovacs explains in [10], Woods-2 is not a very difficult test of the abilities of a system to learn sequential decision tasks, it serves as a standard benchmark by which we can compare the system’s responses to population pressure. Woods-2, developed by Wilson [15], is a multi-step Markov problem with delayed rewards in which the system represents an “animat” [13], or organism, searching for food on a 30 by 15 grid. The Woods-2 environment is composed of five possible
Improving Performance in Size-Constrained Extended Classifier Systems
1879
location types (encoded as three bit binary strings): empty space through which the animat may move, two types of “rock” that block movement, and two types of food which garner a reward when occupied by the animat. The grid contains a repeating pattern of 3 by 3 blocks which are separated by two rows of empty space and the grid “wraps around” (e.g., moving off the edge of the grid places the animat on the opposite edge of the grid), thereby creating a continuous environment. The system input is a 24-bit binary string representing the eight space types surrounding the location currently occupied by the animat (i.e., the animat’s “sensed” surroundings). The output of the system is a three-bit binary string representing in which of the eight possible directions the animat attempts to move in response to its currently sensed surroundings. At the start of each episode, the types of food and rock in each block are randomly selected but the locations remain constant. This is done to better test the generalization abilities of the systems. For Woods-2, performance is measured as a moving average of the number of steps required for the animat to reach food on the last 50 exploitation episodes. A performance level of 1.7 steps on average is the optimal solution for Woods-2 as no population of rules can do better [10,15]. As with the 6-Mux problem, the GA trigger threshold was selected for XCS in Woods-2 in order to cause the system to invoke the GA approximately the same number of times per run as MXCS and EXCS. Therefore, values of θga=500 and θga=1000 were used for XCS at N=80 and N=40 respectively; while a value of θga=25 was used for MXCS and EXCS. As with the 6-Mux experiments, a variety of threshold values were evaluated and not found to significantly alter the performance of XCS at these population sizes. One further change was made to the implementation of the Woods-2 problem in order to allow a comparison of systems developing complete and partial maps. The exploration phase of the Woods-2 problem involves placing the animat at a random location and ending the exploration once the random selection of actions causes the animat to move onto a food space (and thus garner a reward) or when the maximum number of steps have been taken. However, it was found that a system developing an incomplete map (i.e., EXCS) will tend to take far fewer steps on average during exploration as the majority of actions present will direct the animat towards food. This greatly reduces the number of steps taken during exploration and thus reduces the amount of “learning” taking place during each exploration episode. In an attempt to level the playing field, the exploration episodes of the Woods-2 problems reported in this work utilize a complete-exploration technique. Under complete-exploration the animat is placed in a new random location after each update of the prior action set or when food is reached. This continues until the maximum number of steps have been taken, thereby ensuring that each system takes the same number of exploration steps per run. Note that this technique is not required for the system to learn, it is used to allow a direct comparison between the two types of systems. Performance. Figure 3 shows the performance of the three systems in the Woods-2 problem at two size-constrained configurations of N=80 and N=40 (note the logarithmic vertical axis). As shown, the size constraint causes XCS to be unsuccessful at either population size, actually performing worse than would be expected from a random walk. MXCS achieves an average performance level of 1.94 steps by the end of
1880
D. Dawson
the run with N=80 and 3.83 steps with N=40. EXCS reaches average performance levels of 1.74 steps and 1.84 steps for N=80 and N=40 respectively.
Steps
100 5 3 2
X1
10 5 3 2
1 0
X2 M2
E2 E1 2
M1
4 6 8 Exploitations (k)
10
Fig. 3. Woods2 Performance of: XCS with N=80 (X1) and N=40 (X2), MXCS with N=80 (M1) and N=40 (M2), EXCS with N=80 (E1) and N=40 (E2). Curves are averages of 100 runs
7
Conclusions and Future Directions
This paper has attempted to broaden the range of applications to which the XCS framework can be applied by introducing modifications that may help XCS to perform in size-constrained systems. Though the solutions may take longer to reach, it makes sense to investigate alternatives that allow the system’s requirements to be shifted from physical resources (i.e., population size) to system iterations as the physical resources may be limited by forces out of the implementers control. As was demonstrated, the first modified system, MXCS, was able to outperform XCS in small size-constrained configurations by, in essence, tuning the system’s algorithms to make more cautious use of the limited resources available. The second system introduced, EXCS, was shown to outperform both XCS and MXCS by taking advantage of the lower resource requirements of a partial map. Though a system developing an incomplete map such as EXCS may not be suitable for many problems, the advantages in terms of reduced resource requirements and potentially shorter timeto-solution make it appealing for those problems that allow an incomplete map. There are a variety or areas related to this work that could be of interest for future investigation. First, a systematic evaluation of each modification is called for. Also, a more thorough investigation into the resource requirements of XCS and the relationship between population size and the ability of the system to generalize is called for. Furthermore, a comparison of the systems presented here, as well as others (such as Kovacs’ SBXCS [9,10] and Lanzi’s XCSS [11]), in a wider range of problems and varying size-constraint levels seems to be in order. Lastly, a comparison of these systems in terms of time-to-solution for a variety of problems and configurations may help to make an argument for the use of smaller populations, as the computational cost of an increased population size will at some point outweigh the benefit.
Improving Performance in Size-Constrained Extended Classifier Systems
1881
The author would like to thank the Hewlett-Packard Company for its support and the reviewers of this paper for their time and valuable feedback.
References 1. 2.
3.
4. 5.
6. 7.
8.
9
10.
11. 12. 13.
14. 15. 16.
Booker L.B. Intelligent behavior as an adaptation to the task environment. Ph.D. Dissertation, The University of Michigan, 1982. Booker L.B. Triggered rule discovery in classifier systems. In: J.D. Schaffer (ed.), Proceedings of the Third International Conference on Genetic Algorithms (ICGA89), pages 265–274. Morgan Kaufmann, 1989. Butz M.V. and Wilson S.W. An Algorithmic Description of XCS. In: Lanzi P.L., Stolzmann W. and Wilson S.W. (Eds.), Advances in Learning Classifier Systems, LNAI 1996, pages 253–272. Springer-Verlag, Berlin, 2001. Cliff D. and Ross S., Adding Temporary Memory to ZCS. Adaptive Behavior, 3(2):101– 150, 1995. Dawson D. and Juliano B. Modifying XCS for size-constrained systems. Neural Network World, International Journal on Neural and Mass-Parallel Computing and Information Systems, Special Issue on Soft Computing Systems, 12(6), pages 519–531, 2002. Kovacs T. Evolving Optimal Populations with XCS Classifier Systems. Master’s Thesis, School of Computer Science, University of Birmingham, Birmingham, U.K., 1996. Kovacs T. Deletion Schemes for Classifier Systems. In: W. Banzhaf, J. Daida, A. E. Eiben, M. H. Garzon, V. Honavar, M. Jakiela, and R. E. Smith, (Eds.), GECCO-99: Proceedings of the Genetic and Evolutionary Computation Conference, pages 329–336. Morgan Kaufmann, San Francisco (CA), 1999. Kovacs T. Strength or Accuracy? Fitness calculation in learning classifier systems. In: P. L. Lanzi, W. Stolzmann, S.W. Wilson (Eds.), Learning Classifier Systems, From Foundations to Applications, Springer-Verlag, Berlin, 2000. Kovacs T. Two Views of Classifier Systems. In: Lanzi, P.L., Stolzmann W., and Wilson, S.W., (Eds.), Advances in Learning Classifier Systems, pages 74–87. Springer-Verlag, Berlin, 2002. Kovacs T. A Comparison of Strength and Accuracy-Based Fitness in Learning Classifier Systems, Ph.D. Dissertation, School of Computer Science, University of Birmingham, 2002. Lanzi P.L. An Analysis of Generalization in the XCS Classifier System. Evolutionary Computation, 7(2):125–149. 1999. Watkins C. Learning from Delayed Rewards. Ph.D. Dissertation, Cambridge University, 1989. Wilson S.W. Knowledge growth in an artificial animal. Proceedings of the Tenth International Joint Conference on Artificial Intelligence. pages 217–220. Los Angeles (CA), Morgan Kaufman, 1985. Wilson S.W. ZCS: A zeroeth level classifier system, Evolutionary Computation, 2(1), pages 1–18. 1994 Wilson S.W. Classifier fitness based on accuracy. Evolutionary Computation, 3(2), pages 149–175. 1995, Wilson S.W. Generalization in the XCS classifier system. In: J.R. Koza et al. (Eds.), Genetic Programming 1998: Proceedings of the Third Annual Conference, pages 665–674. Morgan Kaufmann, San Francisco (CA), 1996.
ÿþýüûúüúû ùø÷üþúö ùõôóòñðöüòú ïüöî íìëêé íòèçóþý ðúè æçú÷öüòú ìôôñòõüåðöüòú Pierre Géée´rard and Olivier Sigaud ÿþýüûúùûø ÷ùöõôó òñ ðïî íï ìûëýúûýþî êéèúú çæåäæ õÿãöê
ÿþýüûúùüø âÿìê ÷âèíïáûð ÿþúýéýëûúèðà ìáû!!ý"îð êà!úîüó ý! û þî# ÿþúýé$
ýëûúèðà ìáû!!ý"îð êà!úîü% &ýú' ðî!ëîéú úè ýú! ëðîíîéî!!èð!ñ ÿìêñ ÿìê( ûþí )ÿìêñ ú'î áûúîþú áîûðþýþ* ëðèéî!! ýþ âÿìê ý! ûøáî úè úû+î ûí,ûþúû*î è- þî# ðî*ïáûðýúýî!% öþ!úîûí è- ûþúýéýëûúýþ* ûáá ûúúðýøïúî! è- ú'î ëîðéîý,îí !ýúïûúýèþ! ýþ ú'î !ûüî éáû!!ý"îðñ âÿìê èþáà ûþúýéýëûúî! èþî ûúúðýøïúî ëîð éáû!!ý"îð% öþ ú'ý! ëûëîð #î íî!éðýøî 'è# ú'î üèíîá è- ú'î îþ,ýðèþüîþú ðîëðî!îþúîí øà ú'î éáû!!ý"îð! éûþ øî ï!îí úè ëîð-èðü ûéúý,î î.ëáèðûúýèþ ûþí 'è# ú'ý! î.ëáèðûúýèþ ëèáýéà ý! û**ðî*ûúîí #ýú' ú'î î.ëáèýúûúýèþ ëèáýéà% /'î ûðé'ýúîéúïðî ý! ,ûáýíûúîí î.ëîðýüîþúûááà% /'îþ #î íðû# üèðî *îþîðûá ëðýþéýëáî! -ðèü ú'î ûðé'ýúîéúïðûá é'èýéî! *ý,ýþ* ðý!î úè âÿìê% &î !'è# ú'ûú øïýáíýþ* û üèíîá è- ú'î îþ,ýðèþüîþú éûþ øî !îîþ û! û -ïþéúýèþ ûëëðè.ýüûúýèþ ëðèøáîü #'ýé' éûþ øî !èá,îí #ýú' ÿþúýéýëûúèðà ìáû!!ý"îð êà!úîü! !ïé' û! âÿìêñ øïú ûá!è #ýú' ûééïðûéà$øû!îí !à!úîü! áý+î 0ìê èð 0ìê1ñ èð*ûþý2îí ýþúè û 3àþû ûðé'ýúîéúïðî%
ÿ
þýüûúùø÷üöúý
ÿþýþüûúù ø÷ öþüû÷õ÷ô óòüýýõñþû ðïýîþíý ìöóðýë ùüý ûþúþõêþé õ÷úûþüýõ÷ô üîîþ÷îõø÷ øêþû îùþ òüýî èþç ïþüûýæ åùõý ýäûôþ øè õ÷îþûþýî îøøã îçø éõâþûþ÷î éõûþúîõø÷ýæ áõûýîà ü îûþ÷é úüòòþé !úòüýýõúüò öóðý" ùþûþüèîþû úøíþý èûøí îùþ ýõí#òõñúüîõø÷ øè $øòòü÷é%ý õ÷õîõüò èûüíþçøûã &$øò'() *ï +õòýø÷æ åùþ éþýõô÷ øè ,óð &+õò-.) ü÷é îùþ÷ /óð &+õò-() ûþýäòîþé õ÷ ü éûüíüîõú õ÷úûþüýþ øè öóð #þûèøûíü÷úþ ü÷é ü##òõúü*õòõîïæ åùþ òüîîþû ýïýîþíà äýõ÷ô îùþ üúúäûüúï øè îùþ ûþçüûé #ûþéõúîõø÷ üý ü ñî÷þýý íþüýäûþà ùüý #ûøêþ÷ õîý þâþúîõêþ÷þýý ø÷ éõâþûþ÷î úòüýýþý øè #ûø*òþíý ýäúù üý üéü#îõêþ *þùüêõøû òþüû÷õ÷ô øû éüîü íõ÷õ÷ôæ /óð úü÷ *þ úø÷ýõéþûþé üý îùþ ýîüûîõ÷ô #øõ÷î øè íøýî ÷þç çøûã üòø÷ô îùõý ñûýî òõ÷þ øè ûþýþüûúùæ ðþúø÷éà ü ÷þç èüíõòï øè ýïýîþíý úüòòþé 0÷îõúõ#üîøûï öþüû÷õ÷ô óòüýýõñþû ðïýîþíý ì0öóðýë ùüý þíþûôþéà ýùøçõ÷ô îùþ èþüýõ*õòõîï øè äýõ÷ô íøéþò1*üýþé ÿþõ÷èøûúþíþ÷î öþüû÷õ÷ô ìÿöë õ÷ îùþ öóð èûüíþçøûãæ åùõý ýþúø÷é òõ÷þ øè ûþýþüûúù õý íøûþ õ÷úòõ÷þé îø äýþ ùþäûõýîõúý ûüîùþû îùü÷ ôþ÷þîõú üòôøûõîùíý ì20ýë õ÷ øûéþû îø éþüò çõîù îùþ õí#ûøêþ1 íþ÷î øè úòüýýõñþûýæ ðþêþûüò ýïýîþíý ìÿþýþ 0óð &ðîø-'àðîø33)à 0óð4 &5äî34ü) ü÷é 60óð &2ðð34)ë ùüêþ ùõôùòõôùîþé îùþ õ÷îþûþýîõ÷ô #ûø#þûîõþý øè îùõý èüíõòï øè ü##ûøüúùþýæ åùþ çüï 0öóðý üúùõþêþ íøéþò1*üýþé ÿö úø÷ýõýîý õ÷ ü íü7øû ýùõèî õ÷ îùþ úòüýýõñþûý ûþ#ûþýþ÷îüîõø÷æ 8÷ýîþüé øè
ÿ÷þùúýüø ÿöõõöþùø
ÿþýüûúùúýüø ÿ÷þùúýüø úòüýýõñþûýà îùþï äýþ ü ÿþýüûúùúýüø ÿöõõöþùø #üûî ûþ#ûþýþ÷î çùüî çøäòé
ûþ#ûþýþ÷îüîõø÷à çùþûþ îùþ
ûþýäòî èûøí îüãõ÷ô îùþ üúîõø÷ õè îùþ úø÷éõîõø÷ õý êþûõñþéæ 0ý ü úø÷ýþ9äþ÷úþ øè îùõý ûþ#ûþýþ÷îüîõø÷üò ýùõèîà îùþ 0öóð èûüíþçøûã õý ýøíþçùüî éõýîõ÷úî èûøí îùþ úòüýýõúüò öóð ø÷þæ 8÷ îùõý #ü#þûà çþ çü÷î îø ýùøç îùüî ø÷þ úü÷ *þ÷þñî èûøí îùþ íøéþò1*üýþé #ûø#þû1 îõþý øè 0öóðý çùõòþ ãþþ#õ÷ô îùþ úòüýýõúüò öóð èûüíþçøûã üý õýà îùü÷ãý îø îùþ éþýõô÷ øè
E. Cantú-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 1882−1893, 2003. Springer-Verlag Berlin Heidelberg 2003
Designing Efficient Exploration with MACS
1883
ÿ þýüýûÿú ÿûùø÷öýùöõûý ôøóòý ñÿò÷ù ùóðïóüýüöò ùÿü ñý ý÷öøýû îíìëò óû ùúÿòò÷ùÿú íìëò ú÷êý éìë óû éìëè çæ÷úåäãæ÷úåâáã óû ýàýü óöøýû ê÷ü!ò ó" "õüùö÷óü ÿïïûó#÷ðÿöóûò$ %óûý ïûýù÷òýú&ã ÿ"öýû ïûýòýüö÷üþ îíìëò ÷ü òýùö÷óü âã ôý ô÷úú ÷üöûó!õùý ÷ü òýùö÷óü ' ÿ üýô îíìë ùÿúúý! %îìëã ôøóòý þýüýûÿú÷(ÿö÷óü ïûóïýûö÷ýò ÿûý !÷)ýûýüö "ûóð öøóòý ó" ïûýà÷óõò îíìëò$ ìúÿòò÷*ýûò ÷ü %îìë ÿûý óüú& ÷üöýü!ý! öó ïûý!÷ùö óüý ÿööû÷ñõöý ó" öøý üý#ö ò÷öõÿö÷óü ÷ü öøý ýüà÷ûóüðýüö !ýïýü!÷üþ óü ÿ ò÷öõÿö÷óü ÿü! ÿü ÿùö÷óüã ôøýûýÿò ùúÿòò÷*ýûò ÷ü ÿúú ïûýà÷óõò ò&òöýðò öû& öó ïûý!÷ùö öøý üý#ö ò÷öõÿö÷óü ÿò ÿ ôøóúý$ æý ô÷úú òøóô öøÿö öø÷ò üýô ûýïûýòýüöÿö÷óü þ÷àýò û÷òý öó ðóûý ïóôýû"õú þýüýûÿú÷(ÿö÷óü öøÿü ïûýà÷óõò óüýòã ÿü! ùÿü ñý ûýÿú÷(ý! ô÷öø ÿ ðó!õúÿû ÿïïûóÿùø$ +øýü ôý ô÷úú ý#ïúÿ÷ü ÷ü òýùö÷óü , øóô òõùø ÿü ý-ù÷ýüö ðó!ýú.ñÿòý! úýÿûü÷üþ ïûóùýòò ùÿü ñý ÷üöýþûÿöý! ÷üöó ÿ /&üÿ ÿûùø÷öýùöõûý ùóðñ÷ü÷üþ ý#ïúóûÿö÷óü ÿü! ý#ïúó÷öÿö÷óü ùû÷öýû÷ÿã þ÷à÷üþ ýðï÷û÷ùÿú ûýòõúöò ÷ü òýùö÷óü 0$ +øýüã ÷ü òýùö÷óü 1ã ôý ô÷úú þýüýûÿú÷(ý ôøÿö ôý øÿàý úýÿûüý! "ûóð %îìë ÷ü ÿ ô÷!ýû ïýûòïýùö÷àý$ 2ûý!÷ùö÷üþ öøý àÿúõý ó" óüý ÿööû÷ñõöý ó" öøý ýüà÷ûóüðýüö ùÿü ñý òýýü ÿò ÿ "õüùö÷óü ÿïïûó#÷ðÿö÷óü ïûóñúýðã ÿü! ÿü& ò&òöýð ÿñúý öó ÿïïûó#÷ðÿöý ÿ "õüùö÷óü ùÿü ñý õòý! ÷ü öøý òÿðý ÿûùø÷öýùöõûý$ 3ü ïÿûö÷ùõúÿûã ò÷üùý éìë ÿü! éìëè ÿûý òõùø ò&òöýðòã ôý ô÷úú ùóüùúõ!ý ÷ü òýùö÷óü 4 öøÿö ðó!ýú.ñÿòý! ûý÷ü"óûùýðýüö úýÿûü÷üþ ô÷öø þýüýûÿú÷(ÿö÷óü ïûóïýûö÷ýò ùÿü ñý ïýû"óûðý! ÿò ôýúú ô÷öø éìëã éìëè óû ýàýü óöøýû ê÷ü!ò ó" ò&òöýðòã ïûóà÷!ý! öøÿö ÿü ÿ!ý5õÿöý ÿûùø÷öýùöõûý ÷ò õòý!$
ÿ þýüûúûùøü÷öõ ôóøöýûýò ñðøïïûîóö íõïüóìï +øý õòõÿú "óûðÿú ûýïûýòýüöÿö÷óü ó" 6í ïûóñúýðò ÷ò ÿ %ÿûêóà /ýù÷ò÷óü 2ûóùýòò 7%/28 ôø÷ùø ÷ò !ý*üý! ñ&9
ÿ ÿ *ü÷öý òöÿöý òïÿùý S : ÿ ÿ *ü÷öý òýö ó" ÿùö÷óüò A: ÿ ÿ öûÿüò÷ö÷óü "õüùö÷óü T : S × A → Π(S) ôøýûý Π(S) ÷ò ÿ !÷òöû÷ñõö÷óü ó" ïûóñÿñ÷ú. ÷ö÷ýò óàýû öøý òöÿöý òïÿùý S : ÿ ÿ ûýôÿû! "õüùö÷óü R : S × A × S → IR ôø÷ùø ÿòòóù÷ÿöýò ÿü ÷ððý!÷ÿöý ûýôÿû! öó ýàýû& ïóòò÷ñúý öûÿüò÷ö÷óü$ ;üý ó" öøý ðóòö ïóïõúÿû 6í ÿúþóû÷öøð ñÿòý! óü öø÷ò ûýïûýòýüöÿö÷óü ÷ò <.úýÿûü÷üþ çæÿö=>á$ +ø÷ò ÿúþóû÷öøð !÷ûýùöú& ÿü! ÷üùûýðýüöÿúú& õï!ÿöýò ÿ <.öÿñúý ûýïûýòýüö÷üþ ÿ 5õÿú÷ö& "õüùö÷óü q : S × A → IRã ô÷öøóõö õò÷üþ öøý öûÿüò÷ö÷óü ÿü! öøý ÷ððý!÷ÿöý ûýôÿû! "õüùö÷óüò$ +øý 5õÿú÷ö& q(s, a) ûýïûýòýüöò öøý ý#ïýùöý! ïÿ&ó) ôøýü öøý ÿþýüö ïýû"óûðò öøý ÿùö÷óü a ÷ü öøý òöÿöý sã ÿü! "óúúóôò öøý þûýý!& ïóú÷ù& öøýûýÿ"öýû$ +øýüã öøý 5õÿú÷ö÷ýò ÿþþûýþÿöý öøý ÷ððý!÷ÿöý ÿü! "õöõûý ý#ïýùöý! ïÿ&ó)ò$ +øý ðÿ÷ü ÿ!àÿüöÿþý ó" íýÿûü÷üþ ìúÿòò÷*ýû ë&òöýðò ô÷öø ûýòïýùö öó óöøýû 6í öýùø. ü÷5õýò ú÷êý öÿñõúÿû ÿþýüûúùøù÷ ûýú÷ýò ÷ü öøý÷û þýüýûÿú÷(ÿö÷óü ùÿïÿñ÷ú÷ö÷ýò$ 3ü ïûóñúýðò òõùø öøÿö ò÷öõÿö÷óüò ÿûý ùóðïóòý! ó" òýàýûÿú ÿööû÷ñõöýòã þýüýûÿú÷(ÿö÷óü ðÿêýò ÷ö ïóò. ò÷ñúý öó ÿþþûýþÿöý òýàýûÿú ò÷öõÿö÷óüò ô÷öø÷ü ÿ ùóððóü !ýòùû÷ïö÷óü òó öøÿö öøý ðó!ýú ó" öøý 5õÿú÷ö& "õüùö÷óü q ñýùóðýò òðÿúúýû$ 3ü çíÿüååáã íÿü(÷ òøóôò øóô ÷ö ÷ò ïóòò÷ñúý öó òø÷"ö "ûóð ÿ öÿñõúÿû ûýïûýòýüöÿö÷óü ó" ÿ 6í ïûóñúýð öó ÿ ùúÿòò÷*ýû.ñÿòý! ûýïûýòýüöÿö÷óü$ æø÷úý öÿñõúÿû <.úýÿûü÷üþ ùóü. ò÷!ýûò öû÷ïúýò (s, a, q) ∈ S × A × IRã íìëò ú÷êý éìë ùóüò÷!ýû ÿþýþü ûõúýò ûúùø÷öõöùøô ûóúõöùøô üóòùññ öýûõõøôüúõ8$ /õû÷üþ öøý úýÿûü÷üþ ïûóùýòòã öøý íìë úýÿûüò ÿïïûóïû÷. ÿöý þýüýûÿú ùóü!÷ö÷óüò ÿü! õï!ÿöýò öøý ïÿ&ó) ïûý!÷ùö÷óü$ æ÷öø÷ü öøý ùúÿòò÷ùÿú íìë "ûÿðýôóûêã öøý õòý ó" óòùñð öûúü ò&ðñóúò ð ÷ü öøý ÿ ïÿûö ó" öøý ùúÿòò÷*ýûò ïýûð÷öò þýüýûÿú÷(ÿö÷óüã ò÷üùý óòùñð öûúü ò&ðñóúò ðÿêý ÷ö ïóòò÷ñúý öó õòý
1884 P. Gérard and O. Sigaud
ÿ þýüûúù øùþ÷öýõôýóü ôó øùþ÷öýòù þùñùöÿú þýôðÿôýóüþï îüøùùøí ÿ ÿþýüû úùø÷ þìëòóú öùûúõ÷ô ÿüì õÿöôý÷ðúÿö ñÿúðù óê ôéù ÷óüþýøùöùø ÿôôöýòðôùï èéùöùêóöùí ÷éÿüûýüû ÿü ÿôôöýòðôù ýüôó ÿ ÿþýüû úùø÷ þìëòóú ëÿçùþ ôéù ÷óööùþõóüøýüû ÷óüøýôýóü ëóöù ûùüùöÿú æýô ëÿô÷éùþ ëóöù þýôðÿôýóüþåï èéù ëÿýü ýþþðù äýôé ûùüùöÿúýãÿôýóü ýü ÷úÿþþý÷ÿú âáàþ ýþ ôó óöûÿüýãù ÿ ÿüø þ õÿöôþ þó ôéÿô ôéù ÿþýüû úùø÷ þìëòóúþ ÿöù äùúú õúÿ÷ùøï !áà ó"ùöþ ÿ ûùüùöÿúýãÿôýóü ÷ÿõÿòýúýôì òðôí ÿþ óòñ÷ùøýðýïí ýô ÷ÿü óüúì ðõøÿôù ñùöì êùä ëùÿþðöùþ óê ÷úÿþþý#ùö õÿìó" õöùøý÷ôýóü ÿô ùÿ÷é ôýëù þôùõí ÷óööùþõóüøýüû ôó ôéù ýëëùøýÿôù ÿ÷ôðÿú õöùñýóðþ þýôðÿôýóü ÿüø ÿ÷ôýóü st−1 ÿüø at−1 ï îüøùùøí ôéù ëóøùú óê ôéù ù$õù÷ôùø õÿìó" ÷ÿü óüúì òù ðõøÿôùø äéùü ÿü ÿ÷ôðÿú ôöÿüþýôýóü ó÷÷ðöþï àðôôóü %àðô&'( õöóõóþùø ôéù )ìüÿ ÿö÷éýôù÷ôðöù ôó ùüøóä ôéù þìþôùë äýôé ôéù ÿòýúýôì ôó ðõøÿôù ëÿüì *ðÿúýôýùþ ÿô ôéù þÿëù ôýëùí ýü óöøùö ôó þýûüý#÷ÿüôúì ýëõöóñù ôéù úùÿöüýüû þõùùø óê ôéù *ðÿúýôì êðü÷ôýóüï èéù )ìüÿ ÿö÷éýôù÷ôðöù ýúúðþôöÿôùø ýü #ûðöù + ðþùþ ÿ ëóøùú óê ôéù ùüñýöóüëùüô ôó òðýúø éìõóôéùôý÷ÿú ôöÿüþýôýóüþ ýüøùõùüøùüôúì êöóë ôéù ÷ðööùüô ù$õùöýùü÷ùï èéùþù þýëðúÿôùø ÿ÷ôýóüþ ÿöù ðþùø ôó ðõøÿôù ôéù ëóøùú óê ôéù *ðÿúýôì êðü÷ôýóü ëóöù ôéÿü óü÷ù õùö ôýëù þôùõí äýôé ÿ îùñí÷ ðû÷øùûðþý ÿúûóöýôéë ýüþõýöùø êöóë ìëýùöðú êøþïøùööðýïï èéù ëóøùú óê ôéù ùüñýöóüëùüô ýþ úùÿöüùø ñùû÷ýûñë , ýïùï ýüøùõùüøùüôúì êöóë ôéù öùäÿöøï reward
payoff
ENVIRONMENT
model of the
transitions
situation
model of the transitions
action
ÿþýü ûü ÿ þýüû ûúùø÷öõùöôúõ öó òõúñóúð úõ÷üñóúùõðõüö ïõûúü÷üîí ìö ùóðë÷üõê û ðóéõï óñ öøõ òûýóè ç÷öø û ðóéõï óñ öøõ öúûüê÷ö÷óüêí æøõ ðóéõï óñ öøõ òûýóè ðûý ùóüê÷êö ÷ü ûü ûòòúóå÷ä ðûö÷óü óñ öøõ ÷ððõé÷ûöõ úõçûúé ñôüùö÷óü R : S × A → IRã ûüé ÷ü ûü ûòòúóå÷ðûö÷óü óñ öøõ âôûï÷öý ñôüùö÷óü q : S × A → IRã çø÷ùø òúóéôùõê û êùûïûú âôûï÷öý ñóú õûùø (situation, action) òû÷úí æó ïõûúü öøõ ñôüùö÷óü q ÷üùúõðõüöûïïýã öøõ êýêöõð ÷ê î÷áõü (st−1 , at−1 , st , rt ) öôòïõêí æøõ ðóéõï óñ öøõ öúûüê÷ö÷óüê ÷ê ûü ûòòúóå÷ðûö÷óü óñ öøõ öúûüê÷ö÷óü ñôüùö÷óü T : S × A → S í æøõ öúûüê÷ö÷óü ñôüùö÷óü ûüé öøõ ÷ððõé÷ûöõ úõçûúé ñôüùö÷óüê òúóá÷éõ øýòóöøõö÷ùûï êûðòïõê öó öøõ âôûï÷öý ïõûúü÷üî êýêöõðã ûüé êôëêõâôõüöïý êòõõé ôò öøõ úõ÷üñóúùõðõüö ïõûúü÷üî òúóùõêêí àóöõ öøûö öøõ ùóüêöúôùö÷óü óñ ëóöø ðóéõïê úõâô÷úõê û ðõðóúý óñ öøõ òúõá÷óôê ê÷öôûö÷óüã çø÷ùø ÷ê üóö êøóçü óü öøõ !îôúõí
îüþôùÿø óê øýöù÷ôúì úùÿöüýüû ÿ ëóøùú óê ôéù *ðÿúýôì êðü÷ôýóü q ÿþ !áà øóùþí -âáàþ þð÷é ÿþ -áà %àôó&.í/0à''(í -áà1 %/ðô'1ÿ( ÿüø 2-áà %0à'+í0àà'1( úùÿöü ÿ ëóøùú óê ôéù ôöÿüþýôýóü êðü÷ôýóü T ï èéùì ôÿçù ÿøñÿüôÿûù óê ûùüùöÿúýãÿôýóü ÷ÿõÿòýúýôýùþ ôó úùÿöü ÿ ëóøùú óê ôéù ôöÿüþýôýóü êðü÷ôýóü äéý÷é ýþ ëóöù ÷óëõÿ÷ô ôéÿü ôéù ù$éÿðþôýñù úýþô óê ÿúú ôéù (st , at , st+1 ) ôöÿüþýôýóüþï èéù ôöÿüþýôýóü êðü÷ôýóü õöóñýøùþ ÿ ëóøùú óê ôéù øìüÿëý÷þ óê ôéù ýüôùöÿ÷ôýóüþ òùôäùùü ôéù ÿûùüô ÿüø ýôþ ùüñýöóüëùüô äéý÷é ôÿçùþ õúÿ÷ù ýü ÿ )ìüÿ ÿö÷éýôù÷ôðöù ôó þõùùø ðõ ôéù õúÿýü öùýüêóö÷ùëùüô úùÿöüýüû õöó÷ùþþï
Designing Efficient Exploration with MACS
1885
ÿþ ýüûúùø ÷öõ ôóòùùñðõïù òïõ îïíòþñìõë ñþ÷î ÿþýüûúùúýüø ÿ÷þùúýüø ÿöõõöþùø êòï÷ùø þî÷õë ôóòóñé ÿþ ùèôö ôóòùùñðõïùø ÷öõ ñ êòï÷ ïõêïõùõþ÷ù ÷öõ õçõô÷ù îæ òô÷ñîþ ò ñþ ùñ÷èò÷ñîþù åò÷ôöõë äã ôîþëñ÷ñîþ ôé ÿ÷ ïõôîïëù ÷öõ êõïôõñâõë ôöòþíõù ñþ ÷öõ õþâñá ïîþåõþ÷é ÿþ ýûúø ýûúà òþë !ýûúø ò ô êòï÷ ñù ò ùñ÷èò÷ñîþ "öñôö åòã ôîþ÷òñþ ÿþýüû úùø÷ ùãåäîóù ð îï ùêõôñðô âòóèõù #óñ$õ % îï &'ø òù ñþ (ûúé ýþ ñ êòï÷ ñù òóùî ëñâñëõë ñþ÷î ùõâõïòó ò÷÷ïñäè÷õù òþë åòã ôîþ÷òñþ õñ÷öõï ùêõôñðô âòóèõù îï ÿþýüû úöùýõ÷ ùãåäîóù ïé úèôö ò ÿþýüû úöùýõ÷ ùãåäîó åõòþù ÷öò÷ ÷öõ ò÷÷ïñäè÷õ îæ ÷öõ êõïôõñâõë ùñ÷èò÷ñîþ ñ÷ ïõæõïù ÷î ïõåòñþù èþôöòþíõë "öõþ òô÷ñîþ ò ñù êõïæîïåõëé ý ùêõôñðô âòóèõ ñþ ÷öõ ñ êòï÷ åõòþù ÷öò÷ ÷öõ âòóèõ îæ ÷öõ ôîïïõùêîþëñþí ò÷÷ïñäè÷õ úöùýõ÷ô ÷î ÷öõ âòóèõ ùêõôñðõë ñþ ÷öò÷ ñ êòï÷é )öñù æîïåòóñùå êõïåñ÷ù ÷öõ ïõêïõùõþ÷ò÷ñîþ îæ ïõíèóòïñ÷ñõù ñþ ÷öõ ñþ÷õïòô÷ñîþù "ñ÷ö ÷öõ õþâñïîþåõþ÷ø óñ$õ æîï ñþù÷òþôõ óòý ù õøñÿ ðþøïÿî ðö÷ý ûö÷ ùõ÷ýû í÷øú÷ñì÷ô ù ðùïï ñý ëøþýû þë ñûô÷ïëî ðöùû÷ì÷ø ûö÷ þûö÷ø ë÷ùûêø÷ô þë ûö÷ úêøø÷ýû ú÷ïï ùø÷î ûøéñýõ ûþ èþì÷
é )öõ óò÷õþ÷ óõòïþñþí êïîôõùù ñù ñþ ôöòïíõ îæ ëñùôîâõïñþí ôóòóñ ôóòùùñðõïù "ñ÷ö íõþõïòó ô êòï÷ù ÷öò÷ òôôèïò÷õóã åîëõó ÷öõ ëãþòåñôù îæ ÷öõ õþâñïîþåõþ÷é ýûú òþë !ýûú íõþõïòóñìõ òôôîïëñþí ÷î òþ÷ñôñêò÷õë ùñ÷èò÷ñîþùø òþë þî÷ òôôîïëñþí ÷î ÷öõ êòãîçø òù ñþ (ûúé ýù ò ïõùèó÷ø ñ÷ ëîõù þî÷ åò$õ ùõþùõ ÷î ù÷îïõ ñþæîïåò÷ñîþ òäîè÷ ÷öõ õ*êõô÷õë êòãîç ñþ ÷öõ ôóòùùñðõïùé )öõïõæîïõø ÷öõ óñù÷ îæ ôóòùùñðõïù îþóã åîëõóù õþâñïîþåõþ÷òó ôöòþíõùé )öõ ñþæîïåò÷ñîþ ôîþôõïþñþí ÷öõ êòãîçù åèù÷ äõ ù÷îïõë ùõêòïò÷õóãé
ëþøðùøÿ ÷ýûùñïô öñûûñýõ ûö÷ ðùïïî ùýÿ ýþ úöùýõ÷ ðñïï ç÷ í÷øú÷ñì÷ÿ ñý ûö÷ ú÷ïïüô ë÷ùûêø÷ôæ
ÿ þýüûúùø÷ öõôøóô öøõûóòóñ ðòôï îíìë ÿþý üûúùûøû÷öõ÷ô óòùû üûôñðïùõöõûø îõöí óìëê
+õþõïòóñìò÷ñîþ åò$õù ñ÷ êîùùñäóõ ÷î ïõêïõùõþ÷ ïõíèóòïñ÷ñõù ñþ ÷öõ ñþ÷õïòô÷ñîþù "ñ÷ö ÷öõ õþâñïîþåõþ÷é ,î"õâõïø "öñóõ ýûú òþë !ýûú òïõ òäóõ ÷î ëõ÷õô÷ ñæ ò êòï÷ñôèóòï ò÷÷ïñäè÷õ ñù ôöòþíñþí îï þî÷ø ÷öõñï æîïåòóñùå ôòþþî÷ ïõêïõùõþ÷ ïõíèóòïñ÷ñõù òôïîùù ëñæá æõïõþ÷ ò÷÷ïñäè÷õù äõôòèùõ ñ÷ ôîþùñëõïù õòôö ùñ÷èò÷ñîþ òù ò "öîóõé )î åò$õ ÷öñù êîñþ÷ ôóõòïø óõ÷ èù ôîþùñëõï òþ òíõþ÷ ñþ ò íïñë "îïóë ùèôö òù ÷öîùõ êïõùõþ÷õë ñþ ðíèïõù -ø . òþë /ø "öõïõ ñ÷ù êõïôõê÷ñîþù òïõ ëõðþõë òù ò âõô÷îï îæ äîîóõòþ âòóèõù ëõêõþëñþí îþ ÷öõ êïõùõþôõ îï òäùõþôõ îæ "òóóù ñþ ÷öõ õñíö÷ ùèïïîèþëñþí ôõóóùé )èïþñþí ïñíö÷ ïõùèó÷ù ñþ ò ÷"îáêîùñ÷ñîþù óõæ÷ ùöñæ÷ îæ ÷öõ ò÷÷ïñäè÷õùé 0îï ñþù÷òþôõø ÷öõ òíõþ÷ åòã õ*êõïñõþôõ ÷ïòþùñ÷ñîþù óñ$õ ÿîîííîîííø ÿ!ø ÿííîîííîîøé ÿþ ùèôö ò ôòùõø õâõïã ò÷÷ïñäè÷õ ñù ôöòþíñþíé )öèùø ÷öõ æîïåòóñùå îæ ýûú òþë !ýûú ñù èþòäóõ ÷î ïõêïõùõþ÷ ÷öñù ïõíèóòïñ÷ãé 1õâõï÷öõóõùùø ÷öõ ùöñæ÷ ñþ ÷öõ êõïôõñâõë ùñ÷èò÷ñîþ ñù òô÷èòóóã ò ïõíèóòïñ÷ã îæ ÷öõ ëãþòåñôù îæ ÷öõ ñþ÷õïòô÷ñîþù2 "öò÷õâõï ÷öõ ùñ÷èò÷ñîþ ñùø "öõþ ÷öõ òíõþ÷ ÷èïþù ôóîô$"ñùõø ÷öõ âòóèõ îæ ÷öõ &ù÷ ò÷÷ïñäè÷õ ôîåõù ÷î ÷öõ óòù÷ âòóèõ îæ ÷öõ -ïëø ÷öõ âòóèõ îæ ÷öõ àþë äõôîåõù ÷öõ .÷ö ÷ûúé )öõ êòï÷ñôèóòïñ÷ã îæ ùèôö ò ïõíèóòïñ÷ã ñù ÷öò÷ ÷öõ þõ" âòóèõ îæ òþ ò÷÷ïñäè÷õ ëõá êõþëù îþ ÷öõ êïõâñîèù âòóèõ îæ òþî÷öõï îþõé 3*êïõùùñþí íõþõïòóñìò÷ñîþ "ñ÷ö ÿþýüû úöùýõ÷ ùãåäîóù æîïäñëù ÷öõ ïõêïõùõþ÷ò÷ñîþ îæ ùèôö ïõíèóòïñ÷ñõùé ÿþ ÷öõ ýûú4!ýûú æîïá åòóñùåø ÷öõ þõ" âòóèõ îæ òþ ò÷÷ïñäè÷õ åòã îþóã ëõêõþë èêîþ ÷öõ êïõâñîèù âòóèõ îæ ÷öõ ùòåõ ò÷÷ïñäè÷õø ò ùñ÷èò÷ñîþ "öñôö ñù ùõóëîå õþôîèþ÷õïõë ñþ êïòô÷ñôõé )î îâõïôîåõ ÷öñù êïîäóõåø ñ÷ ñù þõôõùùòïã ÷î ëõôîïïõóò÷õ ÷öõ ò÷÷ïñäè÷õù ñþ ÷öõ ñ êòï÷ùø "öõïõòù ýûú òþë !ýûú ôóòùùñðõïù òþ÷ñôñêò÷õ òóó ò÷÷ïñäè÷õù ò÷ îþôõé )î ÷öñù õþëø îèï þõ" ùãù÷õåø 5ýûúø ëõùôïñäõù ÷öõ ñ êòï÷ù "ñ÷ö ÿþýüû åýþð ùãåá äîóù 6 ì7 ïò÷öõï ÷öòþ "ñ÷ö ÿþýüû úöùýõ÷ ùãåäîóùé )öñù "òãø ÷öõ òôôèïò÷õ ôóòùùñðõï ÿððððîðððø ÿ!ø ÿììîìììììø åõòþù ÷öò÷ 6 äêôû ùëû÷ø ûêøýñýõ øñõöûî ûö÷ ùõ÷ýû ùïðùéô
1886 P. Gérard and O. Sigaud ÿþýüû úù ýüþûúù ø÷ö ûúøöùþõøûÿú ôþÿóöòòñ ø÷ö ðïî ôþÿôÿòöí ûú òöóøûÿú ì òóõúò ø÷ö
òöëöóøò óëõòòûêöþò é÷ÿòö
ø
ôõþøò èõøó÷ ø÷ö õóøûÿú õúí é÷ÿòö
÷
ù ôõþøò õúí
ôõþø èõøó÷ ø÷ö òûøüõøûÿúç æ÷ö
ûúøöùþõøûÿú ôþÿóöòò åüûëíò õëë ø÷ö ôÿòòûåëö õúøûóûôõøöí òûøüõøûÿúò éûø÷ þöòôöóø øÿ ø÷ö ôÿòòûåëö äõëüöò ÿã öäöþâ õøøþûåüøöç áöþöñ ø÷ö òâòøöè õúøûóûôõøöò ø÷õø üòûúù òûøüõøûÿú ò÷ÿüëí ëöõí öûø÷öþ øÿ
ÿýýþþýýþþü
ÿþ øÿ
ÿýýþþýþþþüç
ÿþþýýþþýýü
õò õ óüþþöúø
àú íöøöþèûúûòøûó öúäûþÿúèöúøòñ
ûã õëë ø÷ö óëõòòûêöþò éöþö õóóüþõøöñ ø÷ûò ôþÿóöòò éÿüëí ùöúöþõøö ÿúëâ ÿúö ôÿòòûåëö õúøûóûôõøûÿúç
ÿþþýýþþýýü ÿþûûûûûûûü ÿûþûûûûûûü ÿûûýûûûûûü ÿûûûýûûûûü ÿûûûûþûûûü ÿûûûûûþûûü ÿûûûûûûýûü ÿûûûûûûûýü ÿûûûûûûûýü øùýþ÷þöûýþúùõ
←ÿþýüûýþúù ÿ!ü ÿúúúúúúþúü ÿ!ü ÿúúúúúúúþü ÿ!ü ÿýúúúúúúúü ÿ!ü ÿúýúúúúúúü ÿ!ü ÿúúþúúúúúü ÿ!ü ÿúúúþúúúúü ÿ!ü ÿúúúúýúúúü ÿ!ü ÿúúúúúýúúü ÿ!ü ÿúúúúúþúúü → ÿýýþþýýþþü ÿþ ÿýýþþýþþþü
ÿþýüþûúþù ø ÷øöö øõ ûõù öþôõ ÷óþò ûõ ÿþýüþûúþñ ø ÷øöö ðþóûòñ ûõùþöôï ÷óøõþúþý õóþ îõóþý øõõýûðíõþù ÷þýþÿþ ýüûú ùø÷úúûöõô óòõú ñòð ïôòîûóõ ûñíòôì÷ðûòñ ÷ëòêð ðüõ ñõé î÷øêõú
òí òðüõô ÷ððôûëêðõú è÷ú óõñòðõó ëç ðüõ ÿ úçìëòøæþ ýüêúå ðüõ òîõô÷øø úçúðõì ä÷ûñú ðüõ òïïòôðêñûðç ðò óûúùòîõô ôõäêø÷ôûðûõú ûñîòøîûñä óûãõôõñð ÷ððôûëêðõú ûñ ðüõ þýüûúùøùüû÷ ÷ñó ðüõ þöõõöýø÷ ï÷ôðúþ âä÷ûñå ðüûú ïôòïòú÷ø íòô ÷ ñõé íòôì÷øûúì øõ÷óú ðò ÷ ñõé ùòñùõïðûòñ òí äõñõô÷øá ûà÷ðûòñþ âú êúê÷øå ÷ ùø÷úúûöõô ûú ú÷ûó ðò ëõ ìøëûìøööê éþòþýøö ûí ûð ùòêøó ñòð ùòñð÷ûñ ÷ñç ÷óóûðûòñ÷ø ñîòèõ üøýþ úçìëòø éûðüòêð ëõùòìûñä ûñ÷ùùêô÷ðõþ !êð ûð ûú ñòé ú÷ûó ðò ëõ øüüíýøõþ ûíå ûñ õîõôç úûðê÷ðûòñ ì÷ðùüõó ëç ûðú ùòñóûðûòñå ð÷"ûñä ðüõ ïôòïòúõó ÷ùðûòñ ÷øé÷çú úõðú ðüõ ÷ððôûëêðõú ðò ðüõ î÷øêõú úïõùûöõó ûñ ðüõ õãõùð ï÷ôðå éüõñ úêùü ÷ððôûëêðõú ÷ôõ ñòð ñîòèõ çòî÷ úçìëòøúþ âú ÷ ôõúêøðå ðüõ ÷ñðûùûï÷ðûñä êñûð ûú ñòð ðüõ úûñäøõ ùø÷úúûöõô ÷ñçìòôõ ëêð ðüõ éüòøõ #$%þ &ûîõñ ÷ úûðê÷ðûòñ ÷ñó ÷ñ ÷ùðûòñå ÷ úûñäøõ ùø÷úúûöõô ûú ñòð ÷ëøõ ðò ïôõóûùð ðüõ éüòøõ ñõ'ð úûðê÷ðûòñ( ûð ÷ñðûùûï÷ðõú òñøç òñõ ÷ððôûëêðõþ ýüõ úçúðõì ñõõóú ÷ñ ÷óóûðûòñ÷ø ìõùü÷ñûúì éüûùü ûòõþéýøõþù ðüõúõ ï÷ôðû÷ø ÷ñðûùûï÷ðûòñú ÷ñó ëêûøóú ÷ éüòøõ ÷ñðûùûï÷ðõó úûðê÷ðûòñå éûðüòêð ÷ñç ñîòèõ çòî÷ úçìëòø ûñ ûðú óõúùôûïðûòñå ÷ú úüòéñ ûñ ý÷ëøõ )þ *'ïõôûìõñð÷ø ôõúêøðú ïôõúõñðõó ûñ +&,%-./ óõìòñúðô÷ðõó ðü÷ð ðüõ ñõé íòôì÷øá ûúì êúõó ëç ,â$% ÷ùðê÷øøç ÷ãòôóú ìòôõ ïòéõôíêø äõñõô÷øûà÷ðûòñ ù÷ï÷ùûðûõú ðü÷ñ ðüõ íòôì÷øûúì òí 0â$%å éûðüòêð ÷ñç ùòúð ûñ ðõôìú òí øõ÷ôñûñä úïõõóþ ýüõ ÷øäòôûðüìú ôõ÷øûàûñä ðüõ ø÷ðõñð øõ÷ôñûñä ïôòùõúú ûñ ,â$% éûøø ñòð ëõ óõúùôûëõó ûñ óõð÷ûø üõôõå ðüõç ÷ôõ ïôõúõñðõó ûñ +&,%-./þ ÿþý üûúùø÷ö üûúõø ûô óòõ ñðïîöûðíõðó ìîóò üëêé
1ñ ðüõ ïôõîûòêú úõùðûòñå éõ óõúùôûëõó üòé ,â$% ôõïôõúõñðú ûðú ìòóõø òí ðüõ óçñ÷ìûùú òí ðüõ õñîûôòñìõñð éûðü ÷ñðûùûï÷ðûñä ùø÷úúûöõôúþ 1ñ ÷øø 2çñ÷ úçúðõìúå ðüûú ìòóõø ùòñúûúðú òí ÷ñ õ'ü÷êúðûîõ øûúð òí (st−1 , at−1 , st ) ðôûïøõúå õ÷ùü úïõùûíçûñä ÷ éüòøõ ðô÷ñúûðûòñå ûþõþ ðüõ õ'ïõùðõó î÷øêõ òí ÷ ùòìïøõðõ úõð òí
Designing Efficient Exploration with MACS
1887
ÿþþýüûúþùø÷ öõô ÿóò ñöõô ûðþï üîíýðìù þïù îðòùë ûê ÿòòüóé éùóùýÿëüèÿþüðó üó þïù þýüíëùøç ûúþ ùÿæï æëÿøøüåùý øþüëë øíùæüåùø æðîíëùþù þýÿóøüþüðóø÷ external reward situation models of the payoffs imm. reward model Rr
payoff model
payoff model
payoff model
sd ... s1
model of the transitions internal rewards
s1 partial transition function # 1 integration
imm. reward model Ri
transitions
ENVIRONMENT
imm. reward model Re
s i partial transition function # i sd partial transition function # d an ... a1
hierarchical agregation
action
ÿþýü ûü ÿþýü ûúùø÷öõùöôúõó òûùø ñðïôîõ ÷í û í÷ñìîõ ëôêùö÷ðê ûììúðé÷ñûöðúó
äÿæï æëÿøøüåùý ðã ñöõô âðý (st−1 , at−1 , st ) þýüíëù ðã áêóÿà!" üø ÿ øúûãúóæþüðó ðã þïù éëðûÿë þýÿóøüþüðó ãúóæþüðó T : s1 × ... × sd × a1 × ... × ae → s1 × ... × sd ÿ ÷ õðóìùýøùëêç üó #öõôç ùÿæï æëÿøøüåùý üø ÿ øúûãúóæþüðó ðã ÿ íÿýþüÿë þýÿóøüþüðó ãúóæþüðó Ti : s1 × ... × sd × a1 × ... × ae → si ÷ $ïùóç üþ üø íðøøüûëù þð æðóøüòùý éýðúíø ðã æëÿøøüåùýøç ùÿæï éýðúí ÿóþüæüíÿþüóé ðóù íÿýþüæúëÿý ÿþþýüûúþù÷ äÿæï ðã þïùøù éýðúíø þïùó îðòùëø ÿ íÿýþüÿë þýÿóøüþüðó ãúóæþüðó÷ $ïù éëðûÿë þýÿóøüþüðó ãúóæþüðó æÿó ûù ðûþÿüóùò ûê üóþùéýÿþüóé þïù íÿýþüÿë ãúóæþüðóø÷ $ïù #öõô ÿýæïüþùæþúýù üëëúøþýÿþùò üó åéúýù % øïð&ø ïð& þïù ëÿþùóþ ëùÿýóüóé íÿýþ ðã #öõô æÿó ûù æðóøüòùýùò ÿø ÿ îðòúëÿý øêøþùîç ùÿæï îðòúëù ÿóþüæüíÿþüóé ðóù ÿþþýüûúþù÷ äÿæï ðã þïùøù îðòúëùø íýðìüòùø ÿó ÿííýð'üîÿþüðó ðã ðóù íÿýþüÿë þýÿóøüþüðó ãúóæþüðóç ùÿæï íýùòüæþüóé ðóù øüóéëù ìÿëúù÷ öø &ù &üëë òüøæúøø üó øùæþüðó (ç þïüø ÿýæïüþùæþúýù øúééùøþø þïÿþ ðóù æðúëò ýùíëÿæù #öõô îðòúëùø ûê øðîù ðþïùý ãúóæþüðó ÿííýð'üîÿþüðó øêøþùîø ëü)ù óùúýÿë óùþ&ðý)ø ðý æëÿøøüæÿë *õôø ëü)ù +õô ðý +õô,÷ ÿ þýüûúùúùø ÷öõúôó òñðïýîíõúýù íùì òñðïýúõíõúýù
ÿþý üûúùøù÷öû÷øõ ôóóùúóøòûñð ñï îûíúùúðò ìùûòúùûø
$ïù ÿüî ðã ÿæþüìù ù'íëðýÿþüðó üø þð íýðìüòù þïù ÿéùóþ &üþï ÿ íðëüæê þïÿþ îÿ'üîüèùø þïù üóãðýîÿþüðó íýðìüòùò ûê þïù øùóøðýü-îðþðý ëððí÷ $ïù ÿéùóþ øùëùæþø ÿæþüðóø þïÿþ ïùëí üîíýðìüóé þïù îðòùë ÿææðýòüóé þð þïüø æýüþùýüðó÷ ÿ
èøõúõ e ÷í öøõ êôñçõú ðë õæõùöðúíå ûêï d ÷í öøõ êôñçõú ðë ìõúùõ÷äõï ûööú÷çôöõí
1888 P. Gérard and O. Sigaud
ÿþ ýüüûþúùøú÷ö õô óòûù÷ ñð ýï îùö÷ù úî íîìõýï÷ øíúýë÷ ÷êéüîùøúýîï èýúç ÷êéüîýúøæ úýîïð è÷ ö÷þýòï÷ö øï øùíçýú÷íúûù÷ ù÷þûüúýïò åùîì úç÷ çý÷ùøùíçýíøü íîìõýïøúýîï îå úçù÷÷ öôïøìýí éùîòùøììýïò ìîöûü÷þð ÷øíç úùôýïò úî ìøêýìýä÷ úç÷ ù÷èøùö åùîì ø öýã÷ù÷ïú þîûùí÷â áïö÷÷öð è÷ öýþúýïòûýþç úç÷ ýïú÷ùïøü ù÷èøùöð íîùù÷þéîïöýïò úî ø òøýï ýï ýïåîùæ ìøúýîï øõîûú úç÷ ìîö÷ü îå úç÷ ÷ïëýùîïì÷ïúð úç÷ ù÷ç÷øùþøü ù÷èøùöð íîùù÷þéîïöýïò úî ø ì÷øþûù÷ îå úç÷ úýì÷ ÷üøéþ÷ö øåú÷ù úç÷ üøþú ëýþýú îå ÷øíç úùøïþýúýîïð øïö úç÷ ÷êú÷ùïøü ù÷èøùöð íîùù÷þéîïöýïò úî ø òøýï îå åîîö îù èçøú÷ë÷ù øíúûøü ù÷èøùö ýï úç÷ ÷ïëýùîïì÷ïú îå úç÷ øò÷ïúâ àç÷ éù÷íýþ÷ ö÷óïýúýîï îå úç÷þ÷ ýìì÷öýøú÷ ù÷èøùöþ øù÷ úç÷ åîüüîèýïò
ÿ àç÷ ò÷ï÷ùøü ýö÷ø îå ö÷óïýïò øï ýìì÷öýøú÷ ýïåîùìøúýîï ù÷èøùö íîïþýþúþ ýï ì÷øþûùæ
ýïò èç÷úç÷ù úç÷ ìîö÷ü îå úç÷ úùøïþýúýîïþ íøï õ÷ ýìéùîë÷ö îù ïîú úçøï!þ úî úç÷ øíúýëøúýîï îå ø éøùúýíûüøù øíúýîïâ "îù÷ éù÷íýþ÷üôð è÷ ö÷óï÷ øï ÷þúýìøúîù El(c) øþþîíýøú÷ö èýúç ÷øíç íüøþþýó÷ù câ El(c) ì÷øþûù÷þ úç÷ ÷ëøüûøúýîï ü÷ë÷ü îå úç÷ íüøþþýó÷ùâ àçûþ El(c) ìûþú õ÷ ÷#ûøü úî $ ýå úç÷ íüøþþýó÷ù çøþ ïîú õ÷÷ï ÷ëøüûøú÷ö ô÷úð øïö ìûþú õ÷ ÷#ûøü úî % ýå úç÷ íüøþþýó÷ù çøþ õ÷÷ï ú÷þú÷ö ÷ïîûòçâ àçûþ è÷ ö÷óï÷ El(c) = min((b + g)/θe , 1)ð èç÷ù÷ θe ýþ úç÷ ïûìõ÷ù îå ÷ëøüûøæ úýîïþ ï÷÷ö÷ö úî íüøþþýåô ø íüøþþýó÷ù øþ øííûùøú÷ð ýïøííûùøú÷ îù îþíýüüøúýïòð øïö b øïö g øù÷ ù÷þé÷íúýë÷üô úç÷ ïûìõ÷ù îå øïúýíýéøúýîï ìýþúø!÷þ øïö þûíí÷þþ÷þ øüù÷øöô ÷ïíîûïú÷ù÷ö õô úç÷ íüøþþýó÷ù ýï éù÷ëýîûþ ÷ëøüûøúýîïþâ &øíç ÷þúýìøúîù El(c) ýþ õîûïö úî îï÷ íüøþþýó÷ùð ïîú úî îï÷ þýúûøúýîïâ áï îùö÷ù úî íîìéûú÷ úç÷ ýïåîùìøúýîï òøýï õîûïö úî îï÷ þýúûøúýîï Ri (s0 )ð úç÷ éùîí÷þþ ýþ úç÷ åîüüîèýïòâ àç÷ íüøþþýó÷ùþ úçøú ìøúíç s0 øù÷ òùîûé÷ö õô øíúýîïâ 'îù ÷øíç éîþþýõü÷ øíúýîï að "ÿ() íîìéûú÷þ úç÷ þ÷ú Ss0 ,a îå úç÷ éîþþýõü÷ øïúýíýéøú÷ö þýúûøúýîïþ øþ ö÷þíùýõ÷ö ýï þ÷íúýîï *ÿ â &øíç úùýéü÷ (s0 , a, s1 )ð èç÷ù÷ s1 ∈ Ss0 ,a ð ýþ îï÷ îå úç÷ éîþþýõü÷ úùøïþýúýîïþ úçøú èîûüö õ÷ ÷êé÷ùý÷ïí÷ö ýå øíúýîï a è÷ù÷ é÷ùåîùì÷ö ýï þýúûøúýîï s0 â +÷ ö÷óï÷ úç÷ ÷ëøüûøúýîï ü÷ë÷ü El(s0 , a, s1 ) øþþîíýøú÷ö èýúç úçýþ úùøïþýúýîï øþ úç÷ éùîöûíú îå úç÷ ÷ëøüûøúýîï ü÷ë÷üþ El(c) îå øüü úç÷ íüøþþýó÷ùþ c ýïëîüë÷ö ýï úçýþ øïúýíýéøúýîï, El(s0 , a, s1 ) =
!
El(c)
c≈(s0 ,a,s1 )
àç÷ íüøþþýó÷ùþ c ìøúíçýïò (s0 , a, s1 ) øù÷ þûíç úçøú úç÷ýù C éøùú ìøúíç÷þ s0 ð úç÷ýù A éøùú ìøúíç÷þ að øïö úç÷ýù E éøùú ìøúíç÷þ s1 â àç÷ ü÷þþ ø úùøïþýúýîï çøþ õ÷÷ï ÷ëøüûøú÷öð úç÷ òù÷øú÷ù úç÷ ýìì÷öýøú÷ ýïåîùìøúýîï òøýïâ àçûþð ýå úç÷ úùøïþýúýîï îííûùþð úç÷ øþþîíýøú÷ö ýìì÷öýøú÷ ýïåîùìøúýîï òøýï ýþ, Ri (s0 , a, s1 ) = 1 − El(s0 , a, s1 )
+÷ ö÷óï÷ úç÷ ýìì÷öýøú÷ ýïåîùìøúýîï òøýï øþþîíýøú÷ö èýúç ø þýúûøúýîï øïö øï øíúýîï øþ úç÷ ìøêýìûì ýïåîùìøúýîï òøýï îë÷ù úç÷ éîþþýõü÷ øþþîíýøú÷ö øïúýíýéøúýîïþ, Ri (s0 , a) = max Ri (s0 , a, s1 ) s1∈Ss0 ,a
áå úç÷ ìîö÷ü öî÷þ ïîú éùîëýö÷ "ÿ() èýúç øú ü÷øþú îï÷ øïúýíýéøú÷ö þýúûøúýîï s1 ð õ÷íøûþ÷ îå ýïíîìéü÷ú÷ï÷þþð úç÷ï Ri (s0 , a) ýþ òýë÷ï úç÷ ö÷åøûüú ëøüû÷ 1ð èçýíç ýþ úç÷ ìøêýìûì ýìì÷öýøú÷ ýïåîùìøúýîï òøýïâ ÿ
ÿþýüý ûúù øý ÷ýöýüúõ ôó÷÷òøõý úñðòïòôúðýî ÷òðíúðòóñ÷ òñ ðþý ïú÷ý ìþýüý ðþý ïõú÷÷òëýü÷ úüý ñóð úïïíüúðýê
Designing Efficient Exploration with MACS
ÿþýüûûúù
1889
Ri (s0 ) = max Ri (s0 , a) a∈A
ÿ ø÷ö öõôöóýüû óöòüóñ þð ô÷ö ïðïüû ðîïóíö îì óöòüóñ ìîïýñ þý üýú óöþýìîóíöëöýô
ûöüóýþýê ìóüëöòîóéè çö ñöæýö þô üð Re(s0 ) íîóóöðåîýñþýê ôî ô÷ö þëëöñþüôö óöòüóñ îäôüþýöñ ìîó ãþðþôþýê ô÷ö ðþôïüôþîý s0 è ÿ ø÷ö þëëöñþüôö óö÷öüóðüû óöòüóñ ûöüñð ôî ü åîûþíú ò÷þí÷ þð ðþëþûüó ôî ô÷ö öõåûîâ óüôþîý åîûþíú ñöðíóþäöñ þý áàïô!"# îó á$ïô%&ä#ÿ è ø÷öý þô êóîòð ïýôþû ô÷ö ðþôïüôþîý þð ãþðþôöñ üêüþý üýñ ô÷öý ñóîåð ôî %è 'ý îóñöó ôî ïåñüôö Rr (s0 )ù òö ïðö ô÷ö íûüððþíüû çþñóîòâ(î) ö*ïüôþîý+ Rr (s0 ) = (1 − βr )Rr (s0 ) + βr
ø÷öýù ìóîë öüí÷ þëëöñþüôö óöòüóñù ò÷öô÷öó þýôöóýüûù óö÷öüóðüû îó öõôöóýüûù ü ûîýê ôöóë öõåöíôöñ åüúî) þð íîëåïôöñ ðöåüóüôöûú ô÷üýéð ôî ô÷ö ÿþýüû úùûøþù÷öõ üûêîóþô÷ë ,ðöö áà$!-# ìîó ü åóöðöýôüôþîý.è ø÷ö ûüôöýô ûöüóýþýê åóîíöðð åóîãþñöð ô÷ö ðúðôöë òþô÷ ô÷ö ëîñöû îì ô÷ö ôóüýðþôþîýð ò÷þí÷ þð ýöíöððüóú ìîó üý î/þýö íîëåïôþýê îì ô÷ö *ïüûþôþöð üððîíþüôöñ ôî öüí÷ üíôþîýù êþãöý ü ðþôïüôþîýè $ú ñîþýê ðîù ô÷ö ãüûïöð üóö ïåñüôöñ þýñöåöýñöýôûú ìóîë ô÷ö üíôïüû öõåöóþöýíö îì ô÷ö üêöýôù üýñ ëüýú ïåñüôöð íüý äö íîëåïôöñ üô öüí÷ ôþëö ðôöåù üð ïðïüû þý ü ôóõþ üóí÷þôöíôïóöè ø÷ö ûüðô íîëåîýöýô îì ô÷ö üóí÷þôöíôïóö íîýðþðôð îì ü ÷þöóüóí÷þíüû íîëäþýüôþîý îì ô÷öðö ô÷óöö íóþôöóþüè àþýíö üý þýüííïóüôö ëîñöû ëüú óöðïûô þý üý þýüííïóüôö öðôþëüôþîý îì ô÷ö öõåöíôöñ öõôöóýüû åüúî)ù ëüõþëþ0þýê ô÷ö þýìîóëüôþîý êüþý þð êþãöý ô÷ö åóþîóþôú üêüþýðô ô÷ö öõôöóýüû åüúî) ëüõþëþ0üôþîýè ø÷ïðù þì ô÷öóö þð ü äöôôöó üíôþîý òþô÷ óöðåöíô ôî ô÷ö þýìîóëüôþîý êüþý íóþôöóþîýù ô÷þð üíôþîý þð í÷îðöýè 1ûðöù þì üô ûöüðô ôòî üíôþîýð åóîãþñö ô÷ö ëüõþëüû öõåöíôöñ þýìîóëüôþîý êüþýù ô÷öý ô÷ö îýö ò÷þí÷ ëüõþëþ0öð ô÷ö öõôöóýüû åüúî) þð í÷îðöýè 'ý åüóôþíïûüóù þì ô÷ö þýìîóëüôþîý üäîïô ô÷ö åóîäûöë þð åöóâ ìöíôù ò÷þí÷ ëöüýð ô÷üô ýî üíôþîý åóîãþñö üýú êüþý þý þýìîóëüôþîý üýúëîóöù ô÷öý ô÷ö åîûþíú òþûû äö íîëåûöôöûú ñóþãöý äú ô÷ö öõôöóýüû åüúî) üýñ òþûû íîýãöóêö ôî îåôþëüû öõåûîþôüôþîýè ø÷ö ûüðô íóþôöóþîýù óö÷öüóðüû óöòüóñù þð ô÷ö ûöüðô îìôöý ïðöñè 'ô þð îýûú í÷îâ ðöý þì üô ûöüðô ôòî üíôþîýð üóö ö*ïüûûú ûþéöûú ôî äö æóöñ òþô÷ óöðåöíô ôî äîô÷ åóöãþîïð íóþôöóþüè
ÿ þýüûúùøû÷öõô óûòñôöòð ïîíù÷ì ëîñúêûò îé óûèõúç 'ý îóñöó ôî þûûïðôóüôö ô÷ö êüþý óöðïûôþýê ìóîë ûüôöýô ûöüóýþýê þý óöþýìîóíöëöýô ûöüóýþýê åóîäûöëðù òö åóöðöýô þý ô÷þð åüåöó ýöò öõåöóþëöýôð ò÷öóö òö ôóþöñ ôî ôöðô 234à îý åóîäûöëð ò÷öóö ô÷ö ðîïóíöð îì óöòüóñ üóö ëîãöñ üìôöó ô÷ö üêöýô ðïííööñð þý óöüí÷þýê ô÷öëè ø÷þð åóîäûöë êþãöð ô÷ö îååîóôïýþôú ôî ÷þê÷ûþê÷ô ôòî éöú åóîåöóôþöð îì 234àè ÿþóðôù ô÷ö ýöíöððþôú ôî ÷üãö üý þýôöóýüû ëîñöû üýñ ôî äö üäûö ôî åöóìîóë ÿþýüû úùûøþù÷öõ îý ô÷üô ëîñöû üð üý î/þýö ëöýôüû óö÷öüóðüû üóþðöð ò÷öý ô÷ö üêöýô ñþðíîãöóð ô÷üô ô÷ö ðîïóíö îì óöòüóñ ÷üð ëîãöñ+ þô ëïðô ìîóêöô üûû ô÷ö åüúî) öðôþëüôöð íîóóöðåîýñþýê ôî þôð åóöãþîïð ëîñöûù üýñ ô÷þð ìîóêöôôþýê åóîíöðð òîïûñ äö ãöóú ðûîò òþô÷îïô ðïí÷ ü ëîñöûè àöíîýñù ô÷ö ýöíöððþôú îì üíôþãö öõåûîóüôþîý üóþðöð îýíö ô÷ö ëîñöû þð ìîóêîôôöý+ ô÷ö üêöýô ëïðô óöâöõåûîóö ô÷ö ò÷îûö öýãþóîýëöýô ôî æýñ ô÷ö ýöò ûîíüôþîý îì ô÷ö ðîïóíö îì óöòüóñè ø÷þð ðöüóí÷ þð ëïí÷ ëîóö ö5íþöýô ô÷üýéð ôî üíôþãö öõåûîóüôþîýè ÿ
ÿþý üûúúýù ûüø÷ öøýø û ø÷ùú ÷õ ôóòúýùòûü ùýñûùðô ïóûøóòî ûíúó÷ò øýüýíúó÷ò ûíí÷ùðóòî ú÷ úþý ûííöùûíì ÷ù ôëöûüóúìô ÷õ úþý êùýðóíúýð ýéýíúøè
1890 P. Gérard and O. Sigaud
F
F
F
F
ÿþýü ûü ÿþýüûúùø
F
F
ÿþýü úü ÿþýüûû÷ø
ÿþýü ùü ÿþýüöúûø
ÿþ ý üûúùø÷ýöõ ôöóüòû÷ñ ðû ïûþïûî íÿìë óú ïøöûû îêéûöûúï ûúèêöóú÷ûúïþñ úý÷ûòç íýæûåäãìñ íýæûååâì ýúî íýæûáäåìñ öûþôûùïêèûòç þøóðú óú à!"öû áñ # ýúî $% &øûþû '÷ýæû(òêõû) óö 'ðóóîþ) ôöóüòû÷þ ýöû þïýúîýöî üûúùø÷ýöõþ êú *ìë öûþûýöùøñ ïøûç ùó"òî üû öûôòýùûî üç ýúç àúêïû þïýïû ý"ïó÷ýïóú ü"ï ïøûç ôöóèêîû ý ÷óöû êúï"êïêèû èêûð% ÿï ïøû üû!êúúêú! ó+ ïøû û,ôûöê÷ûúïþñ ïøû +óóî êþ êú ïøû ùûòò ÷ýöõûî ðêïø ý ùêöùòûî -% .úùû ïøû ý!ûúï øýþ öûýùøûî êï ïðûúïç ïê÷ûþñ ïøû +óóî êþ ÷óèûî ïó ïøû ùûòò ÷ýöõûî ðêïø ý ôòýêú -% &øêþ êþ îóúû öûôûýïûîòçñ ïøû +óóî üûêú! ÷óèûî ý!ýêú ûýùø ïê÷û ïøû ý!ûúï øýþ +ó"úî êï ïðûúïç ïê÷ûþ% -ê!"öû ã êòò"þïöýïûþ ïøû ú"÷üûö ó+ ïê÷û þïûôþ íÿìë úûûîþ ïó þóòèû íýæûåäãìñ íýæûååâì ýúî íýæûáäåì î"öêú! å$/ þ"ùùûþþêèû ïöêýòþ% &øû öûþ"òïþ ýöû ýèûöý!ûî óèûö ä// û,ôûöê÷ûúïþ% &øû òûýöúêú! öýïûþ ýöû þûï ïó 0.1 ýúî ïøû ÷û÷óöç þêæû ýúî ú"÷üûö ó+ ûèýò"ýïêóúþ úûùûþþýöç ïó ïýõû ý þôûùêýòêæýïêóú0!ûúûöýòêæýïêóú îûùêþêóú ýöû ýòò þûï ïó 5% &øû îêþùó"úï +ýùïóö γ êþ þûï ïó 0.9% &øû öûþ"òïþ þøóð ïøýïñ ïøó"!ø êï óúòç ôûö+óö÷þ óúû ÿþýüû úùûøþù÷öõ þïûô ôûö ïê÷û þïûôñ íÿìë êþ ýüòû ïó öû(ýîýôï ïøû ôóòêùç ïó ïøû úûð þó"öùû òóùýïêóú èûöç +ýþï% 1ýùø ïðûúïç ïöêýòþñ ïøû òóú!ûö ïöêýò ùóööûþôóúîþ ïó ïøû ùýþû ðøûöû ïøû ý!ûúï ÷"þï àúî ïøû úûð òóùýïêóú ó+ ïøû þó"öùû ó+ öûðýöî% 2ï êþ +ó"úî +ýþï ïøýúõþ ïó ýùïêèû û,ôòóöýïêóú ýúîñ ýþ þóóú ýþ êï êþ +ó"úîñ ïøû ïê÷û úûùûþþýöç ïó öûýùø êï ý!ýêú ùóúèûö!ûþ î"öêú! ïøû úû,ï ïöêýò% &øû þòê!øï èýöêýïêóúþ ýöû î"û ïó ïøû +ýùï ïøýï ïøû ý!ûúï ýòðýçþ þïýöïþ +öó÷ ý öýúîó÷ ùûòò ýúî ïøû öûþ"òïþ ýöû ýèûöý!ûî óèûö ä// ïöêýòþ óúòç% 3óïû ïøýï íÿìë êþ ïûþïûî øûöû êú ý îûïûö÷êúêþïêù ûúèêöóú÷ûúï ýúî ðó"òî úóï ôûö+óö÷ ýþ û4ùêûúïòç êú ý ôöóüýüêòêþïêù ûúèêöóú÷ûúïñ ü"ï ïøû òûþþóúþ ïøýï ðû ðêòò îöýð êú ïøû +óòòóðêú! îêþù"þþêóú ýúî ùóúùò"þêóú þïêòò ýôôòç êú ïøû ôöóüýüêòêþïêù ùýþû%
ÿ þýüûúüüýùø 2ú ïøêþ ôýôûöñ ðû øýèû þøóðú øóð ý òýïûúï òûýöúêú! ôöóùûþþ ùó"òî üû ùó÷üêúûî ðêïø þûèûöýò îçúý÷êùýò ôöó!öý÷÷êú! ôöóùûþþûþ êúïó ý 5çúý(òêõû ýöùøêïûùï"öû% ÿþ êú 5çúý ýöùøêïûùï"öûþñ óúû ÷óî"òû òûýöúþ ý ÷óîûò ó+ ïøû ûúèêöóú÷ûúï ýúî ïøû óïøûö ÷óî"òûþ ýöû êú ùøýö!û ó+ ÷ý,ê÷êæêú! ïøû ôýçóé û,ôûùïýïêóú ïøýúõþ ïó ó6êúû 5çúý÷êù 7öó( !öý÷÷êú! ïûùøúê8"ûþ% 9ç "þêú! þûèûöýò ê÷÷ûîêýïû öûðýöîþñ ðû øýèû þøóðú øóð ïøöûû îêéûöûúï ôóòêùêûþ ùó"òî üû óüïýêúûî ýúî øóð ý øêûöýöùøêùýò ùó÷üêúýïêóú ó+ ïøûþû ôóòê( ùêûþ öûþ"òïûî êú ê÷ôöóèûî ôûö+óö÷ýúùû êú ùòýþþêùýò ôöóüòû÷þ ùóúþêîûöûî îê4ù"òï êú ïøû öûêú+óöùû÷ûúï òûýöúêú! +öý÷ûðóöõ% ÿ àúûö îûùó÷ôóþêïêóú êúïó ÷óî"òûþ ùýú üû óüïýêúûî ê+ óúû ùóúþêîûöþ ïøýï ïøû òýïûúï òûýöúêú! ôöóùûþþ êú íÿìë ùýú êïþûò+ üû þôòêï êúïó óúû ÷óî"òû ôûö ýúïêùêôýïûî ýïïöêü"ïû% &øû !òóüýò ýöùøêïûùï"öû ó+ ó"ö þçþïû÷ êþ þøóðú êú à!"öû å%
Designing Efficient Exploration with MACS time to complete successive trials in Maze228C
350
350
300
300
250
250
number of time steps
number of time steps
time to complete successive trials in Maze216C
200 150 100
200 150 100
50 0
1891
50 0
50
100
150
0
200
0
50
100
trial
150
200
trial time to complete successive trials in Maze312C
350
number of time steps
300 250 200 150 100 50 0
0
50
100
150
200
trial
ÿþýü ûü ÿþýü ûúùø÷ö÷öõ ôóòñúðïî÷úö ïöí ôóòñú÷îïî÷úöì öëùøôð úê î÷ùô éîôòé îú ðôïûè îèô éúëðûô úê ðôçïðí ÷ö éëûûôéé÷æô îð÷ïñé ÷ö ÿïåôäãâýá ÿïåôääàý ïöí ÿïåô!ãäý"
ÿþ ýü ûúüùø÷ö õüô óõòýñú ðôòýôþ õü ïîô ôíïôðüùò ðôìùðë ùüë ù þôñõüë õüô õü ïîô ïýêô ôòùóþôë ùéïôð ïîô òùþï èýþýï õé ôùñî ïðùüþýïýõüç æåï ïîô ïîýðë óõòýñú ýü äÿãâö ðôá òúýüà õü ïîô ýüéõðêùïýõü ìîýñî ñùü !ô àùýüôë !ú "ðýüà ôùñî ñòùþþý"ôðö ýþ êõðô õðýàýüùòç #ï óðõèýëôþ ïîô ùàôüï ìýïî ù þúþïôêùïýñ ôíóòõðùïýõü ñùóù!ýòýïú ìîýñî ùòòõìþ ýï ïõ àùýü ù óôðéôñï $üõìòôëàô õé ýïþ ôüèýðõüêôüï êåñî éùþïôð ïîùü ìýïî ù ðùüëõê ôíóòõðùïýõüö ùþ þîõìü ýü %&äâ'()ç *îô îýôðùðñîýñùò ñõê!ýüùïýõü õé ïîôþô óõòýñýôþ ýþ ùòþõ ù ëýþïýüñïýèô éôùïåðô õé äÿãâ ìýïî ðôþóôñï ïõ ûúüùø÷ç #ü ûúüùø÷ö ù ìôýàîïôë þåê õé ïîô ëý+ôðôüï ñðýïôðýù ðôþåòïþ ýü ù ñõêóðõêýþô !ôïìôôü ïîô ëý+ôðôüï óõòýñýôþç ,ýðþïö ëôþýàüýüà ù ìôýàîïôë þåê õé ëý+ôðôüï ñðýïôðýù ìîýòô ïîô êùàüýïåëô õé ïîôþô ñðýïôðýù ýþ üõï $üõìü ýü ùëèùüñô -ýï ëôóôüëþ õü ïîô ùêõåüï õé ôíïôðüùò ðôìùðëö ìîýñî êùú èùðú éðõê ùü ôüèýðõüêôüï ïõ ùüõïîôð. ýþ èôðú ëý/ñåòïç *îåþ ýï ýþ òý$ôòú ïîùï ïîô ìôýàîïþ îùèô ïõ !ô ïåüôë éõð ôùñî ôíóôðýêôüïç #ü äÿãâö õü ïîô ñõüïðùðúö üõ ìôýàîï óùðùêôïôð îùþ ïõ !ô ïåüôë ýü õðëôð ïõ ëôùò ìýïî ëý+ôðôüï òôèôòþ õé ðôìùðëö þýüñô þåñî ìôýàîïþ ëõ üõï ôíýþï ýü ïîô îýôðùðñîýñùò ùàðôàùïýõüç ,åðïîôðêõðôö þýüñô ïîô ñðýïôðýõü ñõððôþóõüëýüà ïõ ïîô ïôüëôüñú ïõ ôíóòõðô ýþ üôèôð ô0åùò ïõ 1ôðõö ïîô õèôðùòò !ôîùèýõð ýþ ùòìùúþ åüëôðáõóïýêùò ìýïî ðôþóôñï ïõ óùúõ+ç #ü äÿãâö !ú ñõüïðùþïö àýèýüà ïîô óðýõðýïú ïõ !åýòëýüà ù ñõððôñï êõëôò óôðêýïþ ïîô ñõüþïðåñïýõü õé ù ñõêóòôïô êõëôòö ùüë ïîôü ïîô óùúõ+ êùíýêý1ùïýõü óðõñôþþ üùïåðùòòú ïù$ôþ ïîô ñõüïðõò õé ïîô ùàôüï ùüë àýèôþ ðýþô ïõ ùü õóïýêùò !ôîùèýõðç
1892 P. Gérard and O. Sigaud ÿ
þýüýûú ùøû÷ öõô óøõòñýðïøõ
ÿþ ýüû úùø÷ öõôüóýûôýòõû ñðõ ïöýûþý ïûöõþóþî íõûìûþýûë ðþ êîòõû éè ýüû ñòþôýóðþöï õðïû ðñ ûöôü çðëòïû óì ýð îòûìì ýüû æöïòû ðñ ðþû öýýõóåòýû öý ýüû þûäý ýóçû ìýûí îóæûþ ýüû æöïòû ðñ ìûæûõöï öýýõóåòýûì ãôðþëóýóðþì öþë öôýóðþìâ öý ýüû ôòõõûþý ýóçû ìýûíá àüöý óìè ûöôü çðëòïû ýõóûì ýð öííõðäóçöýû ö ÿþýüûúùûùýø ÷öýõúôùýè !üûõûöì ôïöììóêûõì óþ ùø÷ öþë "ùø÷ !ûõû ýõ#óþî ýð öííõðäóçöýû ö ÿþýüûúùûÿþýü ÿþóóôýòá àüû û$ûôýóæûþûìì ðñ ýüóì öííõðäóçöýóðþ õûìòïýì ñõðç ýüû ñöôý ýüöý öïï ôïöììóêûõì õûôûóæû öý ýüû þûäý ýóçû ìýûí ýüû öôýòöï ìóýòöýóðþ ðõ öýýõóåòýû ýüöý ýüû# ìüðòïë üöæû öþýóôóíöýûëá ÿáûáè ýüðòîü ýüûõû óì þð ìòíûõæóìðõè ýüû ïûöõþóþî íõðôûìì ôöþ åûþûêý ñõðç óþñðõçöýóðþì ðþ óýì ûõõðõì öì óñ óý !ûõû ìòíûõæóìûëá àüûõûñðõûè öþ# ì#ìýûç õûï#óþî ðþ ýüû ìöçû íõðíûõý# ôöþ åû òìûë ýð öííõðäóçöýû ñòþôýóðþì !óýü ýüû ìöçû û$ûôýóæûþûììá ÿþ íöõýóôòïöõè %ø÷ óì ìòôü ö ì#ìýûçè ìóþôû ýüû íõûëóôýóðþ ðñ ýüû öôôòõöô# ðñ ûöôü ûïóîóåïû ôïöììóêûõ ôöþ åû ôðõõûôýûë öñýûõ ûöôü ýóçû ìýûíá àüòìè õöýüûõ ýüöþ òìóþî ö ìíûôóöïó&ûë ù'ø÷ ìòôü öì úùø÷ ýð ìðïæû õûóþñðõôûçûþý ïûöõþóþî íõðåïûçì !óýü ö çðëûï(åöìûë öííõðöôüè óý óì íðììóåïû ýð òìû ö çðõû îûþûõöï ì#ìýûç ïó)û %ø÷ öì ö åòóïëóþî åïðô) ýð óçíïûçûþý ýüû åöìóô çðëòïûì óþ ýüû öõôüóýûôýòõû íõûìûþýûë óþ êîòõû éá *ìóþî %ø÷ óþ ìòôü öþ öõôüóýûôýòõû !ðòïë ïûý ýüû ì#ìýûç ìðïæû ëóìôõûýû ìýöýû öþë ýóçû íõðåïûçì æûõóñ#óþî ýüû úöõ)ðæ íõðíûõý# öì úùø÷ ëðûìá ùþðýüûõ íðììóåóïóý# !ðòïë åû ýð òìû %ø÷+ ,-óï./è-óï.é0è öþ %ø÷ ûäýûþìóðþ ëûæðýûë ýð ýüû íóûôû!óìû ïóþûöõ öííõðäóçöýóðþ ðñ ôðþýóþòðòì ñòþôýóðþìè ìð öì ýð öëëõûìì ôðþýóþòðòì ìýöýû íõðåïûçìá 1òõ ñòýòõû !ðõ) !óïï ôðþìóìý óþ ýüû óþæûìýóîöýóðþ ðñ ýüöý ïóþû ðñ õûìûöõôü 2 òìóþî %ø÷+ çðëòïûì ýð ïûöõþ åðýü ö íðïóô# öþë ö çðëûï ðñ ýüû ûþæóõðþçûþýá ÿý çòìý åû çûþýóðþûë ýüöýè !üûõû %ø÷+ òìûì ýüû -óëõð!(3ð$ ëûïýö õòïû ìð öì ýð ïóþûöõï# öí( íõðäóçöýû ö ñòþôýóðþè !û ûþæóìóðþ ýð õöýüûõ òìû ö çðõû íð!ûõñòï öþë û4ôóûþý çûýüðë ôðçóþî ñõðç ýüû êûïë ðñ óþôõûçûþýöï ìýöýóìýóôöï ûìýóçöýóðþìá ùì ö ôðþôïòìóðþè !û ëûìôõóåûë óþ ýüóì íöíûõ ìûæûõöï 'ø÷ìè ôöìýóþî ö þû! ïóîüý ðþ ýüû ôðþôûíý ðñ îûþûõöïó&öýóðþ óþ ýüû 'ø÷ ñõöçû!ðõ)á -û íõûìûþýûë úùø÷è ö þû! 'ø÷ òìóþî ö ëó$ûõûþý ñðõçöïóìç öåïû ýð òìû öëëóýóðþöï õûîòïöõóýóûì óþ óýì ïöýûþý ïûöõþóþî íõð( ôûììá úùø÷ ñðõçöïóìç ñðõ ïöýûþý ïûöõþóþî ëðûì þðý ôðþìóëûõ ìóýòöýóðþì öì öþ òþìûôöåïû !üðïûè åòý ëûôðõõûïöýûì ýüû öýýõóåòýûìè çö)óþî óý íðììóåïû ýð õûíõûìûþý õûîòïöõóýóûì öôõðìì öýýõóåòýûìá -û üöæû óïïòìýõöýûë ýüû û$ûôýóæûþûìì ðñ ðòõ öííõðöôü !óýü úùø÷ óþ ýüû ôðþýûäý ðñ öþ öôýóæû ûäíïðõöýóðþ íõðåïûçè òìóþî ö üóûõöõôüóôöï öîîõûîöýóðþ ôõóýûõóðþ óþ ðõëûõ ýð ýöô)ïû ýüû ûäíïðõöýóðþ5ûäíïðóýöýóðþ ýõöëûð$á 6û#ðþë ýüû íõûìûþýöýóðþ ðñ úùø÷è !û üðíû ýð üöæû ôðþæóþôûë ýüû õûöëûõ ýüöý ìðïæóþî õûóþñðõôûçûþý ïûöõþóþî íõðåïûçì !óýü ö çðëûï(åöìûë öííõðöôü ôöþ åû ìûûþ öì ö ôðþ7òþôýóðþ ðñ ìûæûõöï ñòþôýóðþ öííõðäóçöýóðþ íõðåïûçìè öþë ýüöý ìðïæóþî ýüûìû íõðåïûçì ôöþ åû öôüóûæûë û4ôóûþýï# !óýü ëó$ûõûþý ì#ìýûçì öì ö åòóïëóþî åïðô)è îóæûþ ýüöý ýüû# õûï# ðþ ö íõûëóôýóðþ íõðôûììá îúíúûúõòúð
ÿþýüûûú ùø ÷ø þöõôó òø ñø ýðïîíìëêó éèî çø üõðïôæéèèø åèõëðîöäãèê é êìèìõãä êìèìëâ éïãôéõãðè áëìààöëì õð õ!ì "èõãäãáéõðë# $ïéààã%ìë ü#àõìæ áéëõ å& '!ìðëìõãäéï éáâ áëðéä!ø åè ÿþýüûûúùø÷ö ýõ ôóû òñññ ðûøûôùü ïøú îíýìëôùýøïþê éýèçëôïôùýø éýøõûþæ ûøüû åðîééä òñññãó áéêìà ()*)+ó ,ûûûø
Designing Efficient Exploration with MACS ÿþýüûúùø
1893
÷ö õö þýüôö óò óñðïîíüìëíê éèçêîíæüíïò ïå óäãúö âò áùòôí èü ùñö ÿáãàûúø! æùðèç ú""#úú$ö
ÿþýüûú%ø
÷ö õö þýüôö
þíùçíòð &'æñïîùüíïò íò ùò óòüíêíæùüïî( áèùîòíòð äñùççí)èî ã(çüèëö
âò áùòôí èü ùñö ÿáãàûúø! æùðèç *#úúö ÿ+÷ãû*ø
,ö +-îùî.! /ö0óö ÷è(èî! ùò. 1ö ãíðùý.ö äïë%íòíòð ñùüèòü ñèùîòíòð 2íüì .(òùëíê æîïðîùëëíòðö
ÿ+ãû"ø
ÿþýüûúùø ÷üþýøùö üõ ôûúýùóòüø ñúðúùýïî!
üï ùææèùî! úûû*ö
,ö +-îùî. ùò. 1ö ãíðùý.ö 3óäã 4 äïë%íòíòð óòüíêíæùüíïò ùò. é(òùëíê ,îïðîùë0 ëíòð íò äñùççí)èî ã(çüèëçö âò ,ö áö áùòôí! àö ãüïñôëùòò! ùò. ãöàö àíñçïò! è.íüïîç!
íìëùøïúð òø êúùýøòøé èöùððòçúý æåðóúäð! 5ïñýëè "$$6 ïå êúïóþýú ãüóúð òø íýóòçïòùö âøóúööòéúøïú! æùðèç 7ú#6$ö ãæîíòðèî0õèîñùð! þèîñíò! úûû"ö ÿ+ããûúø
,ö +-îùî.! àö ãüïñôëùòò! ùò. 1ö ãíðùý.ö 3óäã4 ù òè2 áèùîòíòð äñùççí)èî ã(çüèë 2íüì óòüíêíæùüíïòö
æåðóúäð! ÿ<ïñ;7ø
÷üþýøùö üõ æüõó èüäûþóòøé á æûúïòùö âððþú üø êúùýøòøé èöùððòçúý
68*09:4ú"6#úú;! úûûúö
/ö<ö <ïññùò.ö
,îïæèîüíèç ïå üìè %ýê=èü %îíðù.è ùñðïîíüìëö
âò /ö/ö +îèåèòçüèüüè!
àýüïúúìòøéð üõ óîú !ðó òøóúýøùóòüøùö èüøõúýúøïú üø "úøúóòï íöéüýòóîäð ùøì óîúòý ùûûöòïùóòüøð #âè"í$%&! æùðèç "#>ö áö&ö óççïêíùüèç! ?ýñ( "$;7ö
è.íüïî!
ÿáùòûûø
,ö áö áùòôíö áèùîòíòð äñùççí)èî ã(çüèëç åîïë ù îèíòåïîêèëèòü ñèùîòíòð æèîçæèêüí5èö @èêìòíêùñ Aèæïîü ûû0û*! éíæö .í &ñèüüîïòíêù è âòåïîëùôíïòè! ,ïñíüèêòíêï .í ÷íñùòï! úûûûö
íìëùøïúð òø êúùýøòøé èöùð' êúïóþýú ãüóúð òø íýóòçïòùö âøóúööòéúøïúö ãæîíòðèî0
ÿáãàûúø ,ö áö áùòôí! àö ãüïñôëùòò! ùò. ãöàö àíñçïò! è.íüïîçö
ðòçúý æåðóúäð!
5ïñýëè ú*ú" ïå
õèîñùð! þèîñíò! úûûúö ÿãþ$;ø
Aö ãö ãýüüïò ùò. óö +ö þùîüïö
ñúòøõüýïúäúøó êúùýøòøéá íø âøóýüìþïóòüøö
։@
,îèçç! "$$;ö ÿãüï$;ø
àö ãüïñôëùòòö óòüíêíæùüïî( äñùççí)èî ã(çüèëçö âò /öAö Bïôù! àö þùòôìùå! Bö äìèñ0 ñùæíññù! Bö éè%! ÷ö éïîíðï! éö þö Cïðèñ! ÷ö <ö +ùîôïò! éö &ö +ïñ.%èîð! <ö â%ù! ùò. Aö Aíïñï! è.íüïîç!
"úøúóòï àýüéýùääòøé!
æùðèç 67;#669ö ÷ïîðùò Bùýåëùòò
,ý%ñíçìèîç! âòêö! ãùò Cîùòêíçêï! äó! "$$;ö ÿãüïûûø
àö ãüïñôëùòòö óò íòüîï.ýêüíïò üï óòüíêíæùüïî( äñùççí)èî ã(çüèëçö âò ,ö áö áùòôí! àö ãüïñôëùòò! ùò. ãö àö àíñçïò! è.íüïîç!
ìùóòüøð óü íûûöòïùóòüøð! ÿãýü$ûø
êúùýøòøé èöùððòçúý æåðóúäðá õýüä (üþø'
æùðèç ">7#"$9ö ãæîíòðèî0õèîñùð! <èí.èñ%èîð! úûûûö
Aö ãö ãýüüïòö âòüèðîùüíòð ùîêìíüèêüýîèç åïî ñèùîòíòð! æñùòòíòð! ùò. îèùêüíòð %ùçè. ïò
àýüïúúìòøéð üõ óîú æúëúøóî âøóúýøùóòüøùö èüøõúýúøïú üø )ùïîòøú êúùýøòøé âè)ê*+,! æùðèç ú"6#úú9! ãùò ÷ùüèï! äó! "$$ûö ùææîï'íëùüíòð .(òùëíê æîïðîùëëíòðö âò
÷ïîðùò Bùýåëùòòö ÿãýü$"ø
Aö ãö ãýüüïòö Aèíòåïîêèëèòü ñèùîòíòð ùîêìíüèêüýîèç åïî ùòíëùüçö âò /ö0óö ÷è(èî ùò.
(ýüä ùøòäùöð óü ùøòäùóðá àýüïúúìòøéð üõ óîú (òýðó âøóúý' øùóòüøùö èüøõúýúøïú üø æòäþöùóòüø üõ íìùûóùóòëú -úîùëòüý! æùðèç ú;;#ú$6! äùë0
ãö àö àíñçïò! è.íüïîç!
%îí.ðè! ÷ó! "$$"ö ÷â@ ,îèççö ÿàùü;$ø
äö /ö àùü=íòçö
êúùýøòøé .òóî ìúöùåúì ýú.ùýìðö
,ìé üìèçíç! ,ç(êìïñïð( éèæùîüëèòü!
Dòí5èîçíü( ïå äùë%îí.ðè! &òðñùò.! "$;$ö ÿàíñ$9ø
ãö àö àíñçïòö
Eäã! ù ôèîïüì ñè5èñ äñùççí)èî ã(çüèëö
ÿëüöþóòüøùýå èüäûþóùóòüø!
ú8":4"#";! "$$9ö ÿàíñ$7ø
ãö àö àíñçïòö
äñùççí)èî )üòèçç %ùçè. ïò ùêêýîùê(ö
ÿëüöþóòüøùýå èüäûþóùóòüø!
*8ú:4"9$#">7! "$$7ö ÿàíñû"ø
ãö àö àíñçïòö
Cýòêüíïò ùææîï'íëùüíïò 2íüì ù êñùççí)èî ç(çüèëö
âò áö ãæèêüïî!
àýü' ïúúìòøéð üõ óîú "úøúóòï ùøì ÿëüöþóòüøùýå èüäûþóùóòüø èüøõúýúøïú #"ÿèèô,!&!
+ïï.ëùò &ö éö! óö àý! àö þö áùòð.ïò! <ö ÷ö õïíðü! ùò. ÷ö +èò! è.íüïîç!
æùðèç $>9#$;"ö ÷ïîðùò Bùýåëùòò! úûû"ö ÿàíñûúø
ãö àö àíñçïòö
äñùççí)èîç üìùü ùææîï'íëùüè åýòêüíïòçö
*:4ú""#ú*9! úûûúö
ãùóþýùö èüäûþóòøé!
"8ú0
Estimating Classifier Generalization and Action’s Effect: A Minimalist Approach Pier Luca Lanzi Artificial Intelligence and Robotics Laboratory Dipartimento di Elettronica e Informazione Politecnico di Milano [email protected]
Abstract. We present an online technique to estimate the generality of classifiers conditions. We show that this technique can be extended to gather some basic information about the effect of classifier actions in the environment. The approach we present is minimalist in that it is aimed at obtaining as much information as possible from online experience, with as few modifications as possible to the classifier structure. Because of its plainness, the method we propose can be applied virtually to any classifier system model.
1
Introduction
Learning classifier systems are online techniques which exploit reinforcement learning and evolutionary computation to learn how to solve problems. Learning is achieved through the evolution of a set (or population) of condition-actionpayoff rules (the classifiers) which represent the solution. As all the other reinforcement learning techniques, a learning classifier system neither assumes any knowledge about space of possible input configurations that will be encountered, nor any knowledge about the possible effect that its actions will have in the environment. When the solutions evolved by a classifier system are analyzed, one of the main focus is the degree of generalization achieved. Classifiers that apply in many situations are studied because they provide some high level knowledge about the target problem; classifiers that apply in few situations are studied because they provide insight about some specific aspect of the target problem. In principle, since no assumptions are made about the input space, it should not be possible to know which situations classifiers have been applied to (unless we can keep track of the situations encountered as done in [3]). In practice, the analysis techniques usually applied assume that the problem space is known completely (e.g., [9]). However, this assumption becomes easily infeasible in large applications such as real-world robotics and analysis of huge amount of data. Thus it would be useful to have a technique capable of providing some information about the classifier generality which (i) does not assume that the input space is known completely, and (ii) does not store large amounts of additional information. E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 1894–1904, 2003. c Springer-Verlag Berlin Heidelberg 2003
Estimating Classifier Generalization and Action’s Effect
1895
In this paper, we present a very simple technique to estimate the degree of generality of classifier conditions. The technique does not require any prior knowledge about the input space and requires only limited amount of additional information. The technique works online while learning is in progress and classifiers are used. It consists of collecting elementary statistics about the sensory inputs for classifiers that are currently matching. We show that the same technique can be extended to collect some elementary information about the effect of classifier actions. The approach we present is minimalist in that it is aimed at obtaining as much information as possible from online experience, with as few modifications as possible to the classifier structure. The method is far from being as powerful as those specifically designed for anticipatory behavior [1]. Nevertheless, the results produced might be still interesting. Most important, because of its plainness the method we propose can be applied virtually to any classifier system model.
2
Motivation
In learning classifier systems, it is usually difficult to distinguish between classifiers that are syntactically general and classifier that are actually general, unless some knowledge about the problem space is available. Let us illustrate this with a simple example. Suppose that we are dealing with the most usual learning classifier system model; conditions are represented as strings of fixed length L in the alphabet {0, 1, #}; the symbol #, called don’t care, can match both 1s and 0s; actions are represented as binary strings of fixed length; the problem we are trying to solve involves binary sensory inputs made of 16 bits, and allows eight actions encoded with a binary string of three bits. Suppose that the system has evolved a classifier with condition ###############1 and action 111. This classifier seems very general: it contains 15 don’t care symbols and therefore in principle it can match 215 different situations. However, it might be also the case that there is only one input configuration with a one in the last position (e.g., 0000000000000001), so that the classifier is not general but very specific instead. Some information about the situations that classifiers match can also be useful to analyze the results produced through symbolic representations. For instance, [5] reports the following example of symbolic classifier condition expressed with a Lisp syntax: (AND(AND(AND(NOT(EQ(A2)(A1)))(NOT(EQ(A2)(A1)))) (AND(AND(GT(A5)(1))(4))(NOT(EQ(A2)(A1)))))(NOT(EQ(A2)(A1))))
which involves five integer attributes, A1 . . . A5. AND, OR, and NOT represent the corresponding Boolean functions; GT is the relation greater-than, and EQ is the relation equal-to. Note that, although the above condition is quite simple, it is still difficult to estimate whether the condition is general. Of course, we might simplify it with some tools, but the resulting condition might still be too complex to be analyzed. Thus, we believe that the analysis of large rulesets
1896
P.L. Lanzi
T T T T
T T T T
F T T T (a)
T T T T T T T T
T T T T T T T T F T T T T T T T T T T T T T T T T T T T T T (b)
Fig. 1. The Woods1 environment (a) and the Maze4 environment (b)
evolved through symbolic representations would be made easier if some support to evaluate the degree of generalization of classifiers could be available.
3
Experimental Design
To test the methods presented in this paper, we implemented them on an available implementation of XCS [6] and applied our extended version of XCS to Boolean multiplexer and to woods environments. Experiments. All the experiments reported were performed following the standard procedure used in the literature [8]. Each experiment consists of a number of problems that XCS must solve. When XCS solves the problem correctly, it receives a constant reward equal to 1000; otherwise it always receives a constant reward equal to 0. Boolean Multiplexer. Boolean multiplexer are defined for l Boolean variables (i.e., bits) where l = k+2k : the first k variables (x0 . . . xk−1 ) represent an address which indexes into the remaining 2k variables (y0 . . . y2k −1 ); the function returns the value of the indexed variable. For example, consider the multiplexer with six variables mp6 (x0 , x1 , y0 , y1 , y2 , y3 ) (mp6 : {0, 1}6 → {0, 1}) defined as follows: mp6 (x0 , x1 , y0 , y1 , y2 , y3 ) = x0 x1 y0 + x0 x1 y1 + x0 x1 y2 + x0 x1 y3 The product corresponds to the logical and, the sum to the logical or, and the overline corresponds to the logical not. The system has to learn how to represent the Boolean multiplexer from a set of examples. For each problem, the system receives as input an assignment of the input variables (x0 , x1 , y0 , y1 , y2 , y3 ); the system has to answer with the corresponding truth value of the function (0 or 1); if the answer is correct the system is rewarded with 1000, otherwise 0. Woods Environments. These are simple grid worlds like those depicted in Fig. 1, which contain obstacles (T), free positions (.), and food (F). There are eight sensors, one for each possible adjacent cell. Each sensor is encoded with two bits: 10 indicates the presence of an obstacle T; 11 indicates a goal F; 00
Estimating Classifier Generalization and Action’s Effect
1897
indicates a free cell. Classifiers conditions are 16 bits long (2 bits × 8 cells). There are eight possible actions, encoded with three bits. The system goal is to learn how to reach food position from any free position; the system is always in one free position and it can move in any surrounding free position; when the system reaches a food position (F) the problem ends and the systems is rewarded with 1000; in all the other cases the system receives zero reward.
4
Estimating Rule Generality
The method we propose is aimed at collecting as much information as possible regarding the classifier generality, making as few modifications as possible to the classifier system structure. The method is quite simple. To the classifier structure, we add (i) an array in to estimate the averages of the sensory inputs matched by the classifier; (ii) an array εin to estimate the error affecting the values in the array in. The size of in and εin is the number of sensory inputs (e.g., in the first example discussed it would be 16; in the second example it would be six). At the beginning of an experiment, the values of in and εin of classifiers in the rulebase are initialized with a predefined value (0 in our experiments, but a random value would work as well). When classifiers are matched against the current sensory inputs st , for every classifier cl that matches st , the array in and εin are updated as follows: ∀cl that matches st ∀k : cl.in[k] ← cl.in[k] + βe (st [k] − cl.in[k]) cl.εin [k] ← cl.εin [k] + βe (|st [k] − cl.in[k]| − cl.εin [k])
(1) (2)
where st [k] represent the k-th sensory input of st ; cl.in[k] represents the k-th element of the array in of classifier cl ; likewise cl.εin [k] represents the k-th element of the array εin of classifier cl ; βe (0 < βe < 1) is the learning rate. Equation 1 tries to estimate the average of all the input values that classifier cl matches. Given the average input values stored in cl.in[k], Equation 2 builds an estimate of the error cl.εin [k] affecting cl.in[k]. The reader familiar with Wilson’s XCS [8] will recognize in equations 1 and 2 the update of the classifier prediction parameter p, and classifier error parameter ε used in XCS.
5
Experiments on Rule Generality
To test the proposed method for estimating classifier generality, we implemented it on an available XCS implementation [6] and applied our XCS extension to the 6-multiplexer with the following parameters (see [2] for details): N = 400, β = 0.2, α = 0.1, 0 = 0.01, ν = 5, θGA = 25, χ = 0.8, µ = 0.04, θdel = 20, δ = 0.1, θsub = 20, P# = 0.6, θmna = 2, doGASubsumption= 1, doActionSetSubsumption= 0 (i.e., action set subsumtption is not used). The learning rate βe is 0.001. Table 1 reports an example of population that was evolved after 20000 learning problems; for the sake of brevity, only classifiers with non null
1898
P.L. Lanzi
Table 1. Classifiers with non-zero prediction evolved by XCS for the 6-mpultiplexer. For every classifier we report: a unique identifier, the classifier condition and the classifier action separated by “:” , the prediction p, the prediction error ε, and the fitness F . The array in and εin are reported as a sequence of in[k] ± εin [k] for k ∈ {0 . . . 5}. 23331 11###1:1 p = 1000 ε = 0.00 F = 1.00 1.00 ± 0.00, 1.00 ± 0.00, 0.53 ± 0.50, 0.48 ± 0.50, 0.47 ± 0.50, 1.00 ± 0.00 3485 000###:0 p = 1000 ε = 0.00 F = 1.00 0.00 ± 0.00, 0.00 ± 0.00, 0.00 ± 0.00, 0.48 ± 0.50, 0.49 ± 0.50, 0.52 ± 0.50 7810 11###0:0 p = 1000 ε = 0.00 F = 1.00 1.00 ± 0.00, 1.00 ± 0.00, 0.55 ± 0.49, 0.47 ± 0.50, 0.47 ± 0.50, 0.00 ± 0.00 8807 01#0##:0 p = 1000 ε = 0.00 F = 1.00 0.00 ± 0.00, 1.00 ± 0.00, 0.54 ± 0.50, 0.00 ± 0.00, 0.47 ± 0.50, 0.53 ± 0.50 20897 01#1##:1 p = 1000 ε = 0.00 F = 1.00 0.00 ± 0.00, 1.00 ± 0.00, 0.46 ± 0.50, 1.00 ± 0.00, 0.50 ± 0.50, 0.46 ± 0.50 373 10##1#:1 p = 1000 ε = 0.00 F = 1.00 1.00 ± 0.00, 0.00 ± 0.00, 0.48 ± 0.50, 0.50 ± 0.50, 1.00 ± 0.00, 0.50 ± 0.50 7723 001###:1 p = 1000 ε = 0.00 F = 1.00 0.00 ± 0.00, 0.00 ± 0.00, 1.00 ± 0.00, 0.53 ± 0.50, 0.50 ± 0.50, 0.47 ± 0.50 6108 10##0#:0 p = 1000 ε = 0.00 F = 1.00 1.00 ± 0.00, 0.00 ± 0.00, 0.51 ± 0.50, 0.52 ± 0.50, 0.00 ± 0.00, 0.52 ± 0.5
prediction are reported. For each classifier, on the first line, we report: a unique identifier, the classifier condition and the classifier action separated by a“:”, the prediction p, the prediction error ε, the fitness F ; on the second line, for each input s[k], we report the interval in[k] ± εin [k] which provides an estimate of the values of the k-th input that the classifier matches. Since in Boolean multiplexer all the possible input configurations are considered, we should expect that (i) every position k of the classifier condition containing a don’t care, would correspond to a in[k] equal to 0.5, and to a εin [k] equal to 0.5 (i.e., 0.5 ± 0.5); (ii) every position k of the classifier condition containing a 1 or a 0, would correspond to a in[k] equal to 1 or 0 respectively, and to a εin [k] equal to 0.0 (i.e., 0.5 ± 0.5). Table 1 confirms these expectations. All the interval corresponding to a constant input (0 or 1) have an error estimate (εin ) equal to 0; all the interval corresponding to a don’t care have values very near to 0.5. Of course, since in and εin are estimates they are affected by errors mainly due to the order in which the inputs appeared and to the value of βe used. For instance, in classifier 23331, the interval corresponding to the first don’t care is 0.53 ± 0.50, instead of 0.50 ± 0.50. Probably, we could obtain better approximations by tuning βe or by using an adaptive value for βe . On the other hand, some experiments we performed with more sophisticated techniques did not result in relevant improvements of the results; while we find that the estimates developed through our elementary technique are quite satisfactory. In fact, in Table 1 all the values of in and εin turn out to be correct when approximated to the first decimal digit.
Estimating Classifier Generalization and Action’s Effect
1899
The technique appears even more interesting if we apply it to a version of XCS involving symbolic expressions (e.g., [7]). We apply XCS with symbolic expression to the 6-multiplexer with the same parameters used in the previous experiments. In one of the final populations appears the following classifier: ((((NOT(NOT(NOT(( X0 OR Y0 )AND(NOT Y0 )))))OR((NOT((NOT( X0 AND Y1 )) AND((NOT X1 ) AND( Y3 AND X0 )))) OR(NOT((( Y0 OR Y0 )OR( Y3 OR Y1 ))AND((NOT Y3) AND((NOT X1 )AND ( Y3 AND X0 )))))))AND((NOT X0 )AND(NOT(NOT( Y1 AND X1 )))))AND(( X1 AND(((( Y1 AND ( Y1 AND X1 ))AND Y1 )AND( Y1 AND(NOT X0 ))) AND((((NOT( Y0 OR X0 )) OR(((NOT X0 )AND Y0 )OR( Y1 AND Y0 )))AND(NOT(NOT( Y1 AND X1 ))))AND( Y1 AND(NOT X0 )))))AND X1 )):1
p = 1000 ε = 0 F = 1 0.00 ± 0.00, 1.00 ± 0.00, 0.55 ± 0.50, 1.00 ± 0.00, 0.46 ± 0.50, 0.51 ± 0.50 where AND, OR, and NOT represent the corresponding Boolean operators; X0, X1, Y0, . . . , Y3 represent the six problem variables. Note that it is almost impossible to evaluate the generality of the condition (unless we simplify it), however the intervals suggest that the above classifier matches situations in which x0 = 0, x1 = 1, and y1 = 1 while all the other variables can take any value between 0 and 1. When we simplify the condition with a tool for logic synthesis, we find that the condition corresponds to the expression x0 x1 y1 , confirming what suggested by the intervals represented with in and εin .
6
Estimating Actions’ Effect
The same principle used to estimate rule generality can be applied to provide some elementary information about the effect of classifier actions. The classifier structure is further extended, adding (i) an array ef to estimate the average values of the sensory inputs that the system will receive if the classifier action is performed; (ii) an array εef to estimate the error affecting the values in the array ef. As in the previous case, the size of ef and εef is the number of sensory inputs. Consider the (quite typical) performance cycle of a classifier system: 1. repeat 2. consider the current sensory input st ; 3. build the set [M] of classifiers that match st ; 4. select an action at from those in [M]; 5. store the classifiers in [M] with action at into [A]; 6. perform action at ; 7. get the reward rt ; 8. get the next sensory input st+1 ; 9. distribute the reward rt to classifiers; 10. until stop criterion met;
Although the structure still resembles that of XCS, we left out some typical details of XCS (e.g., the genetic algorithm and the distribution of the reward in
1900
P.L. Lanzi
[A]) to keep it as much general as possible. We can extend the cycle above by adding a step, between line 8 and line 9, in which for every classifier cl in [A], the array ef and the array εef are updated with st+1 as follows: ∀cl ∈ [A] ∀k : cl.ef[k] ← cl.ef[k] + βe (st+1 [k] − cl.ef[k]) cl.εef [k] ← cl.εef [k] + βe (|st+1 [k] − cl.ef[k]| − cl.εef [k])
(3) (4)
where st+1 [k] represents the k-th input of state st+1 , i.e., the effect that performing action at in state st has on the k-th sensory input; as before, cl.ef[k] represents the k-th element of the array ef of classifier cl ; likewise cl.εef [k] represents the k-th element of the array εef of classifier cl. Equation 3 estimates the average effect that the execution of classifier cl will have on the next k-th sensory input (i.e., on st+1 [k]); equation 4, estimates the error affecting cl.ef [k].
7
Experiments on Actions’ Effect
We finally test the proposed techniques approach by applying our modified version of XCS to two grid environments: Woods1 and Maze4. The experiments are conducted as follows. First, we apply XCS to the two environments to develop some very small (i.e., very general) solutions. To achieve this we use both actionset subsumption and an additional condensation phase (see [2,5] for details). The parameters are set as in the previous experiments except: N = 1600, γ = .7, doActionSetSubsumption= 1 (i.e., AS-subsumption was used), θnma = 8. For Woods1, the smallest solution consisted of 33 classifiers; for Maze4 consisted of 101 classifiers. As a second step, we compute the exact average values that our method should be able to estimate. For every classifier cl evolved, we consider: (i) the input set Icl of all the sensory inputs that cl matches; (ii) the effect set Ecl of all the sensory inputs that appears next to the execution of cl. Given Icl and Ecl , for each input k we compute the average of the input values appearing in position k, ai [k], the average of the input values appearing in position k after the execution of cl, ae[k], and the average errors affecting ai [k] and ae[k], i.e., εai [k] and εae [k]. More formally: cl.ai[k] = ( s[k])/|Icl | (5) s∈Icl
cl.εai [k] = (
|cl.ai[k] − s[k]|)/|Icl |
(6)
s∈Icl
cl.ae[k] = (
s∈Ecl
s[k])/|Ecl |
(7)
Estimating Classifier Generalization and Action’s Effect
cl.εae [k] = (
|cl.ae[k] − s[k]|)/|Ecl |
1901
(8)
s∈Ecl
The values of cl.ai [k], cl.εai [k], cl.ae[k], and cl.εae [k] are the average of input values for sensor k and the average effect for sensor k, i.e., what our technique should be able to estimate. Thus they serve as reference to compare the quality of the estimates developed through our approach. Then we apply our version of XCS to Woods1. Table 2 reports three classifiers evolved for Woods1; due to space constraints, it is not possible to consider here all the 33 classifiers. For each classifier we report on the first line: a unique identifier, the condition and the action separated by a “:”, the prediction p, the error ε, the fitness F . On the next lines, we report the list I of inputs that the classifier matches and the corresponding list E of inputs that appear after as effect of the classifier execution. Then, we report: (i) the array of average input values, which contains, for every k, the intervals cl.ai[k] ± cl.εai [k] (Equation 5 and 6); (ii) the array of average effect values, which contains, for every k, the intervals cl.ae[k] ± cl.εae [k] (Equation 7 and 8), (iii) the array representing the estimated classifier input, which contains, for every k, the intervals cl.in[k]±cl.εin [k] (Equation 1 and 2); and (iv) the array representing the estimated classifier effect, which contains, for every k, the intervals cl.ef[k] ± cl.εef [k] (Equation 3 and 4). The first classifier in Table 2 (3112676) is the most general classifiers found; it matches all the possible sensory inputs of Woods1, nevertheless is very accurate and has an high fitness. Note that, since classifier 3112677 is very general, the estimate of the matching sensory inputs and the estimate of the classifier effect provide only limited information. The computed values of ai and εai suggest that three of the sensory inputs (the second, the third, and the fifth) will be always 0; the computed values of ae and εae suggest that no matter in which situation the classifier is applied, in the next state, nine of the inputs will be always zero. If we now consider the estimated input in and the estimated effect ef for the classifier, we find that our technique was successful in finding the same information. All the positions in ai and ae that correspond to zero values, corresponds to zero also in in and ef. While the estimated non zero values of in and ef are quite near to the corresponding computed values. Some interesting information can be inferred from the non zero values. For instance, the values of ae[8] = 0.06 and εae [8] = 0.12 suggests that on the subsequent state, the ninth bit will be very likely to be 0. Again, the values estimated through our approach for the same input are quite similar: the values of ef [8] = 0.04 and εef [8] = 0.8. Note that, in Woods1 sensory inputs are not visited uniformly, accordingly the estimates evolved through our technique (e.g., in) tend to differ from the computed averages (e.g., ai ), since the latter assume a uniform distribution of the inputs. The second classifier (3112681) matches fewer situations, therefore its estimate of classifier input in and its estimate of classifier effect ef provide more information. For instance, the arrays in and εin of classifier 3112681 suggest that most of the sensory inputs matched do not change; likewise the values in ef and εef . Again, we note that for non constant values, the estimates developed
1902
P.L. Lanzi
Table 2. Three classifiers payoff evolved by our extended XCS for Woods1. For every classifier we report: a unique identifier, the classifier condition and the classifier action separated by “:” , the prediction p, the prediction error ε, and the fitness F , the set I of matched inputs, the set E of inputs that arrive after the classifier execution, the computed average values of the inputs ai±εai , and of the effects af εaf , the estimate input and effect values, in±εin and ef ±εef . 3112676 ###############1:001 p = 490 ε = 0.00 F = 1.00 input set I effect set E 1010000000000000 1010000000000000 0000001010000000 1010000000000010 1010000000000010 1010000000000010 0000001110100000 1000000000000010 1000000000000010 0000000000001010 0000000011100000 0000000000000010 0000000000000010 0010100000000000 0000000000110000 0010000000000000 0000000000101100 0000001000000000 0000000000101011 0000101000000000 0000000000001010 0010101000000000 0010000000000000 0010000000000000 0000001000000000 1010000000000000 0000101000000000 0000001010000000 0010101000000000 0010101000000000 0010100000000000 0010100000000000 ai±εai .19 ± .30, .00 ± .00, .31 ± .43, .00 ± .00, .19 ± .30, .00 ± .00, .31 ± .43, .06 ± .12, .19 ± .30, .06 ± .12, .31 ± .43, .06 ± .12, .19 ± .30, .06 ± .12, .31 ± .43, .06 ± .12 ae±εae .31 ± .43, .00 ± .00, .62 ± .47, .00 ± .00, .31 ± .43, .00 ± .00, .31 ± .43, .00 ± .00, .06 ± .12, .00 ± .00, .00 ± .00, .00 ± .00, .06 ± .12, .00 ± .00, .31 ± .43, .00 ± .00 in±εin .21 ± .33, .00 ± .00, .37 ± .45, .00 ± .00, .20 ± .32, .00 ± .00, .24 ± .38, .05 ± .10, .18 ± .29, .07 ± .12, .31 ± .41, .05 ± .10, .21 ± .32, .04 ± .09, .38 ± .47, .09 ± .14 ie±εie .34 ± .45, .00 ± .00, .71 ± .43, .00 ± .00, .36 ± .45, .00 ± .00, .25 ± .39, .00 ± .00, .04 ± .08, .00 ± .00, .00 ± .00, .00 ± .00, .09 ± .14, .00 ± .00, .32 ± .42, .00 ± .00 3112681 ############1###:000 p = 700 ε = 0.00 F = 1.00 input set I effect set E 0000000000101100 0000000000110000 0000000000101011 0000000000101100 0000000000001010 0000000000101011 ai±εai .00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00 .00 ± .00, .00 ± .00, .67 ± .44, .00 ± .00, 1.00 ± .00, .33 ± .44, .67 ± .44, .33 ± .44 ae±εae .00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00 .00 ± .00, .00 ± .00, 1.00 ± .00, .33 ± .44, .67 ± .44, .33 ± .44, .33 ± .44, .33 ± .44 in±εin .00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00 .00 ± .00, .00 ± .00, .63 ± .45, .00 ± .00, 1.00 ± .00, .27 ± .40, .73 ± .40, .37 ± .46 ef ±εef .00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00 .00 ± .00, .00 ± .00, 1.00 ± .00, .25 ± .38, .75 ± .38, .27 ± .40, .49 ± .49, .49 ± .49 3112677 ###############1:111 p = 1000 ε = 0.00 F = 1.00 input set I effect set E 0000000000101011 0000000010101000 ai±εai .00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00 .00 ± .00, .00 ± .00, 1.00 ± .00, .00 ± .00, 1.00 ± .00, .00 ± .00, 1.00 ± .00, 1.00 ± .00 ae±εae .00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00 1.00 ± .00, .00 ± .00, 1.00 ± .00, .00 ± .00, 1.00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00 in±εin .00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00 .00 ± .00, .00 ± .00, 1.00 ± .00, .00 ± .00, 1.00 ± .00, .00 ± .00, 1.00 ± .00, 1.00 ± .00 ef ±εef .00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00 1.00 ± .00, .00 ± .00, 1.00 ± .00, .00 ± .00, 1.00 ± .00, .00 ± .00, .00 ± .00, .00 ± .00,
Estimating Classifier Generalization and Action’s Effect
1903
through our technique are quite reasonable, also considering that the distribution of inputs is not considered in the computation of ai and ae. The last classifier (3112677) is the one we discussed in the first example. Although it might appear quite general, it matches only one possible situation, accordingly its estimates are exact. To provide a rough evaluation of the estimates developed with our technique we computed the average error between the computed values for inputs and effect (ai, εai , ae, εae ) and the corresponding estimated values (in, εin , ef, εef ). For this purpose we applied our modified version of XCS to Woods1 ten times. For each run we computed: cl∈P k |cl.ai[k] − cl.in[k]| error(in) = (9) |P | × n |cl.εai [k] − cl.εin [k]| error(εin ) = cl∈P k (10) |P | × n |cl.ae[k] − cl.ef[k]| error(ef) = cl∈P k (11) |P | × n |cl.εae [k] − cl.εef [k]| error(εef ) = cl∈P k (12) |P | × n where P represents the final population evolved; n is the number of inputs (16 in Woods1). Over ten runs in Woods1, the average error(in) is 0.0054, the average error(εin ) is 0.0035, the average error(ef ) is 0.0075, the average error(εef ) is 0.0041. We applied the same technique to Maze4 with a population size N = 2000 for ten runs, measuring an average error(in) of 0.0058, an average error(εin ) of 0.0024, an average error(ef ) of 0.0057, an average error(εef ) of 0.0028. Since Maze4 does not allow many generalization, the prediction of the classifier effect is a little bit more accurate than that obtained for Woods1, in fact in Maze4 the values of error(ef ) and error(εef ) are a little bit smaller. Note also that, these errors measures the difference between our estimated values and the best values that can be obtained following our approach. Thus, even if the errors are small, still the accuracy of the prediction built is limited by the approach we follow which is less powerful than those methods specifically design for anticipatory behavior. As the short example presented and the few statistics collected suggest, the method we propose provides limited information about classifiers that apply in many situations; while it provides more precise information on classifier that apply in fewer situations. In either cases, the method appears to be able to provide reliable information about classifier generality and some interesting information about the effect of classifier activation.
8
Extensions and Limitations
The proposed approach is very simple, therefore it is opened to critics and extensions.
1904
P.L. Lanzi
Classifier Generality. The first limitation regards the estimates of classifier generality. Our approach, for every input k, develops an interval in[k] ± εin [k] that estimates the range of values that appears at corresponding input. The larger the range, the more general the classifier appears to be with respect to that specific input. Accordingly, we might be tempted to use these intervals to build a subsumption relation for classifier representations that usually do not allow it. Lanzi [4] noted that with symbolic conditions subsumption operators are computationally infeasible. In Sect. 7 we showed that the intervals in[k] ± εin [k] might provide some indication about the classifier generality. Thus, we might try to define a subsumption relation for symbolic conditions using our technique. Unfortunately this is not possible. As a counter example consider the two conditions defined over a variable X∈ {0 . . . 9}: (i) (2<X) AND (X<6); (ii) (X<3) AND (X>5). Both conditions have the same average input value, in= 4, while the associated error εin is 0.66 for condition (i), 3 for condition (ii). Comparing the two intervals we should argue that condition (ii) is more general than condition (i) since the estimated input interval of condition (ii) subsumes the estimated input interval for condition. But the two conditions form a partition of the input domain and they cannot be compared through an inclusion operator. Thus we cannot define subsumption operators based on the estimated intervals. These in fact are developed from linear estimators, while symbolic conditions usually express non linear relations Classifier Effect. Our approach tend to provide information on the classifier effect in terms of input values that the system will experience after the classifier activation. Other anticipatory techniques like ACS [1] provides information in term of what will change in the current sensory input after the classifier execution. It is straightforward to extend our technique to provide such kind of information. Given the current input st and the next input st+1 we define an array ∆[k] as follows: ∆[k] = 1 if st [k] = st+1 [k], 0 otherwise. The array ∆ represents the difference that occurred to the current state st after the classifier execution. The value ∆[k] can now be used in Equation 3 and Equation 4 instead of st+1 [k] so that ef will now estimate the changes in the environment state rather than the next state. In addition, note that in our approach we also take into account the final state, corresponding to the position with food. Classifier 3112677 in Table 2 has reward 1000 meaning that performing action 111 in state 0000000000101011 will take the agent to the end position, corresponding to state 0000000010101000. This information might be exploited to perform some very elementary planning (as done with Anticipatory Classifier Systems), although, given the basic information provided by the approach is not clear at the moment the true potential of the approach. Alternatively, we might eliminate the effect estimate from classifiers whose action takes the agent to a final state. In this case, planning should focus on reaching of classifiers with a null effect.
Estimating Classifier Generalization and Action’s Effect
9
1905
Summary
We presented a very basic method to estimate the degree of generality of classifier conditions. The method was also applied to obtain basic information about the effect of classifier execution. We showed some example of information that we could extract from some very elementary problems. Since the method does not provide an exact anticipation, it is currently difficult to provide some performance metrics to estimate its effectiveness. On the other hand, because of its simplicity, the method can be added virtually on any classifier system model. The code to experiment the approach discussed here is available under GNU Public License at the xcslib Web site [6].
References 1. Martin Butz. Anticipatory Learning Classifier Systems. Kluwer, 2002. 2. Martin V. Butz and Stewart W. Wilson. An algorithmic description of xcs. Journal of Soft Computing, 6(3–4):144–153, 2002. 3. Pierre G´eard and Olivier Sigaud. Yacs: Combining dynamic programming with generalization in classifier systems. In Pier Luca Lanzi, Wolfgang Stolzmann, and Stewart W. Wilson, editors, IWLCS, volume 1996 of Lecture Notes in Computer Science, pages 52–69. Springer, 2001. 4. Pier Luca Lanzi. Extending the Representation of Classifier Conditions Part II: From Messy Coding to S-Expressions. In Wolfgang Banzhaf, Jason Daida, Agoston E. Eiben, Max H. Garzon, Vasant Honavar, Mark Jakiela, and Robert E. Smith, editors, Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-99), pages 345–352. Morgan Kaufmann, 1999. 5. Pier Luca Lanzi. Mining interesting knowledge from data with the xcs classifier system. In Lee Spector, Erik D. Goodman, Annie Wu, W.B. Langdon, Hans-Michael Voigt, Mitsuo Gen, Sandip Sen, Marco Dorigo, Shahram Pezeshk, Max H. Garzon, and Edmund Burke, editors, Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2001), pages 958–965, San Francisco, CA 94104, USA, 7-11 July 2001. Morgan Kaufmann. 6. Pier Luca Lanzi. The xcs library. http://xcslib.sourceforge.net, 2002. 7. Pier Luca Lanzi and Alessandro Perrucci. Extending the Representation of Classifier Conditions Part II: From Messy Coding to S-Expressions. In Wolfgang Banzhaf, Jason Daida, Agoston E. Eiben, Max H. Garzon, Vasant Honavar, Mark Jakiela, and Robert E. Smith, editors, Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 99), pages 345–352, Orlando (FL), July 1999. Morgan Kaufmann. 8. Stewart W. Wilson. Classifier Fitness Based on Accuracy. Evolutionary Computation, 3(2):149–175, 1995. http://prediction-dynamics.com/. 9. Stewart W. Wilson. Compact rulesets from xcsi. In Pier Luca Lanzi, Wolfgang Stolzmann, and Stewart W. Wilson, editors, IWLCS, volume 2321 of Lecture Notes in Computer Science, pages 197–210. Springer, 2002.
Towards Building Block Propagation in XCS: A Negative Result and Its Implications Kurian K. Tharakunnel, Martin V. Butz, and David E. Goldberg Illinois Genetic Algorithms Laboratory (IlliGAL) University of Illinois at Urbana-Champaign 104 S. Mathews, 61801 Urbana, IL, USA {kurian,butz,deg}@illigal.ge.uiuc.edu Abstract. The accuracy-based classifier system XCS is currently the most successful learning classifier system. Several recent studies showed that XCS can produce machine-learning competitive results. Nonetheless, until now the evolutionary mechanisms in XCS remained somewhat ill-understood. This study investigates the selectorecombinative capabilities of the current XCS system. We reveal the accuracy dependence of XCS’s evolutionary algorithm and identify a fundamental limitation of the accuracy-based fitness approach in certain problems. Implications and future research directions conclude the paper.
1
Introduction
After Holland’s introduction of learning classifier systems (LCSs), originally referring to a cognitive system [7], one of the most important steps in LCS research was the development of the accuracy-based classifier system XCS by Wilson [12]. Most of the recent work on LCSs focuses on XCS. Among the many changes from the original LCS, the accuracy-based fitness in XCS is considered as the distinctive feature and the most important reason for success. Although crossover has always been applied in XCS, our experimental investigations showed that in most investigated classification problems to date mutation alone seems to be sufficient to evolve an appropriate solution. However, the larger a problem the more difficult it becomes to generate an accurate solution by mutation. This is the case since mutation in general results in a diversification pressure and in LCSs effectively in a specialization pressure since classifiers usually specify only a small fraction of the available features [2]. The larger the problem, however, the more general classifiers need to be to undergo sufficient evaluations and reproductive events. Thus, the larger a problem the smaller mutation needs to be. Consequently, only successful recombinations of accurate sub-structures, or building blocks [6], can do the trick. Additionally, as argued in [5], suitable search and ultimately innovative events are determined by effective recombination of appropriate sub-parts. As this insight led to the development of competent GAs—GAs that solve boundedly difficult problems quickly, accurately, and reliably—the same approach seems necessary in learning classifier system research to develop competent LCSs, that is, LCSs that solve typical (decomposable) machine learning problems efficiently. E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 1906–1917, 2003. c Springer-Verlag Berlin Heidelberg 2003
Towards Building Block Propagation in XCS: A Negative Result
1907
Recently, several studies addressed how XCS works. Many of these try to explain the XCS mechanisms and help choosing appropriate parameter settings. An important aspect missing so far is an explanation for the apparent nonperformance of XCS in some problems. This paper investigates a set of such hard problems. By doing that, we reveal an important interaction between the accuracy based fitness and the selectorecombinative GA [6,4] in XCS that prevents learning. Previous early hints on this interaction can be found in [1]. The role of the GA in XCS is to evolve accurate classifiers as well as to search for optimal generalizations within classification niches. The basic working of GA dictates that selection, recombination, and mutation progressively results in the discovery of better individuals. We show that the GA in XCS may not be able to achieve this in particular problems due to a lack of suitable accuracy guidance from the over-general side. We use concatenated multiplexer problems (multiplexer problems formed by concatenating the condition and action parts of more than one single multiplexer problem) to illustrate how XCS performance is affected by the combination of accuracy based fitness and selectorecombinative GAs. After a brief review on previous analyses of XCS mechanisms, we formalize the derivation of the estimated prediction error in XCS. Since the accuracy is derived from the prediction error estimate, this estimate is crucial for XCS success. In Sect. 4 we identify a fundamental weakness in XCS’s accuracy approach, illustrating successful and unsuccessful selectorecombinative events in XCS. Concluding remarks complete the paper.
2
XCS Performance
Along with the introduction of XCS, Wilson [12] illustrated the success of XCS both in single-step problems and multi-step problems with results from experiments on Multiplexer and Woods2 environments respectively. Later Wilson showed the successful performance of XCS in a standard data mining task [13]. However, the results obtained in several other test environments were not very encouraging. An interesting comparison of the performance of XCS with other LCS models in various Maze environments can be found on the web page: http://www.ai.tsi.lv/ga/lcs performance.html Although considerably more tractable than Holland’s original LCS model, XCS is still a complex system. There have been several attempts recently to understand XCS mechanisms and their interactions. The studies by Kovacs [8] established that XCS would be able to accurately build an optimal representation of the payoff function of a problem. Kovacs also illustrated how XCS overcomes the problem of strong over-generals encountered in strength based LCSs without fitness sharing [9]. A fundamental problem difficulty dimension was characterized by Kovacs’ optimal rule set size measure [O] [10]. Recently, Butz et al. [1,2] have given a more analytical treatment of XCS mechanisms, in which further dimensions of problem complexity were identified.
1908
3
K.K. Tharakunnel, M.V. Butz, D.E. Goldberg
Accuracy Revisited
The most important and distinctive feature of XCS is its fitness definition. In the traditional LCS approach payoff prediction of the classifier, sometimes combined with condition specificity, is directly used as the fitness measure. In XCS, the accuracy with which a classifier predicts payoff is used as a measure of its fitness. The actual procedure of fitness calculation involves several steps [3]. Every time a classifier takes part in an action set, first its prediction parameter is estimated using the immediate payoff received according to the standard Widrow-Hoff delta rule. Next, an estimate of the absolute error in estimate of the prediction is done using the current error (absolute difference between the immediate payoff and the current estimate of prediction). This is again done by the Widrow-Hoff delta rule. This error estimate is then used to calculate the accuracy of a classifier by using a power function on error. This ensures that classifiers with error less than a threshold value will have accuracy one and the accuracy falls steep with increase in error. Next, the accuracy values (all between zero and one) of all the classifiers in the action set are used to calculate a relative accuracy value for each classifier in the action set. These relative accuracy values are then used to update the fitness of classifiers using the Widrow-Hoff delta rule once again. The procedure described above shows that the estimated absolute error plays a crucial role in the determination of accuracy of a classifier. To understand the full significance of this we should know what the estimates of prediction and prediction error mean to a classifier. A classifier represents a set of environmental states along with a possible action. Thus, a classifier in fact represents a set of state-action pairs for the given problem. Each state-action pair results in a particular reward (state-action value). Hence, as a classifier takes part in many action sets and with the update procedure described above, the prediction parameter of the classifier is estimated towards the average of the state-action values of the state-action pairs covered by the classifier. Similarly, the update procedure on prediction error results in an estimate of the mean absolute deviation (MAD) in the covered state-action pairs. Thus, the error estimate of a classifier in XCS is in fact an estimate of MAD of state-action values associated with that classifier. We call this value simply the MAD of the classifier. The above definition of error can be used to calculate directly the error of a classifier and hence determine its accuracy. For example, let ps denote the probability of correct classification with a uniform reward of r and zero reward otherwise. Then the prediction P of a classifier may be written as P = rps + 0 ∗ (1 − ps ) = rps
(1)
Similarly, the error which is the MAD of the classifier may be written as = |r − P |ps + |0 − P |(1 − ps ) = 2r(ps − p2s )
(2)
(see also [1]). We use these expressions to determine reward prediction and reward prediction error of classifiers in the following sections. Figure 1 illustrates
Towards Building Block Propagation in XCS: A Negative Result
1909
Resulting Error Estimate
Resulting Error
500 400 300 200 100 0 0
0.2
0.4
0.6
0.8
1
Probability of Correct Classification
Fig. 1. Prediction-error dependence on correctness of classifier in a 1000/0 reward scheme
Equation 2. It can be seen that the more consistently a classifier classifies correctly (or incorrectly), the smaller its prediction error and thus the larger its accuracy will be.
4
Accuracy Guidance for Successful Recombination
Two fundamental processes are associated with the working of every LCS - rule discovery and rule evaluation which are executed by the GA and the credit allocation components, respectively. In XCS, these two processes work towards the objective of building a complete, accurate, and maximally general payoff map. That is, a problem representation is formed that is able to predict the resulting payoff of an action, or classification, for any possible problem instance. Important changes in XCS are the accuracy-based fitness approach and GA selection which is applied in action set niches rather than panmictically in the whole population as in traditional LCSs. Deletion, however, is done panmictically in the whole population. This results in an inherent generalization pressure as hypothesized in [12] and formulated in [2]. The question is now how the GA evolves accurate classifiers in XCS. The GA used in XCS is basically selectorecombinative. This means that the driving force of evolution towards better individuals is the dual process of selection and recombination with a continuous influence of mutation. With this point of view, accurate classifiers are evolved from selection, mutation, and recombination of good partially accurate solutions. Especially successful recombinations of different partially accurate sub-parts, or building blocks, should lead to higher accuracy. 4.1
Accuracy Guidance Required
Let’s define an accurate, maximally general classifier as a completely successful classifier and the corresponding generalization a completely successful generalization. A perfect classifier has maximum accuracy so that any further generaliza-
1910
K.K. Tharakunnel, M.V. Butz, D.E. Goldberg
tion of the classifier condition will result in a reduction in accuracy. Such a classifier can be characterized by the specified positions in the condition part since all other positions are filled with don’t care symbols. Let us say this classifier condition consists of three specified positions a, b and c. Based on the discussion above, this classifier can be generated through GA recombination or mutation. Hence we can reasonably assume that the specific pattern of abc can result only from recombinations or mutations of classifiers with the partial patterns a, b and c in them (since a direct generation of abc by mutation is highly unlikely). We term the classifiers that specify a subset of positions abc partially successful classifiers and the corresponding generalizations partially successful generalizations. Note that there might be more than one possible completely successful generalization and different completely successful classifiers might overlap. This issue, however, needs to be addressed separately from the concern in this paper. The above argument leads to the hypothesis that successful generalization can be possible only if the following two conditions are satisfied. (1) Sustenance of partially successful classifiers in the population; (2) Partially successful classifiers whose conditions are more similar to completely successful classifiers have higher accuracy. The first condition says that the partially successful classifier should have some minimum accuracy (fitness) so that it will not get deleted from the population. To a great extent this is dependent on the parameter setting of XCS as well as on the problem properties. The second condition says that classifiers with conditions closer to a completely successful classifier should have higher fitness so that a proper recombination/mutation can lead to the generation of completely successful classifiers. As we investigate in the rest of this paper, this accuracy guidance is essential for successful evolutionary learning in XCS. Note that we focus on when a successful recombination, i.e. a recombination that generates offspring that is more similar to a perfect classifier in its condition part than its parents, actually also results in higher accuracy. In this paper, we are not interested in the matter of a good crossover operator or even in the creation of a competent crossover operator (such as the probabilistic model building GAs field (PMBGAs) [11]), but in the satisfaction of the preconditions for such successful recombination. 4.2
Accuracy Guidance in the Six Multiplexer
To illustrate accuracy guidance further we take the Boolean multiplexer problem as an example. Multiplexer problems are widely used test problems in LCS research. In a multiplexer problem, the agent tries to predict the output of a multiplexer function. A multiplexer function is a function defined on binary strings of length k + 2k . The first k bits reference the output bit in the 2k remaining bits. In a 3 multiplexer problem, for example, a zero in the first bit would determine the output being the second bit while a one would reference the third bit. In the 6 multiplexer problem the input strings are of length six and the first two bits determine the output bit position. According to our definition of a completely successful classifier, it can easily be observed that one of the completely
Towards Building Block Propagation in XCS: A Negative Result
1911
Fig. 2. Progress of successful generalization in 6 multiplexer problem
successful classifier in a 6 multiplexer problem is 01#1## → 1. Now consider that the condition part of this classifier consists of three consecutive patterns a, b and c involving the three specified bits respectively. Then a partially successful classifier with pattern a is 0##### → 1. Similarly partially successful classifiers with patterns b and c are #1#### → 1 and ###1## → 1 respectively. Here we have considered the most general version of partially successful classifiers though we can consider partially successful classifiers with more specific bits with the same effect. Now let us look at what happens to the accuracy when the partially successful classifiers combine. Figure 2 illustrates the progressive recombination of partially successful classifiers A, B and C resulting in the completely successful classifier ABC. A 1000/0 payoff scheme is used in this example and the accuracy of classifiers is measured as MAD. It can be seen that the MAD diminishes when partially successful classifiers recombine. Eventually, the completely successful classifier with MAD 0 will be generated. Here the condition (2) of our hypothesis holds in that a successful recombination of partially successful classifiers also results in a decrease in MAD and thus in an increase in accuracy. This explains how XCS with the accuracy based fitness is able to successfully solve the 6 multiplexer problem. What happens when condition (2) is not satisfied? This is illustrated in the next section using what we call a concatenated multiplexer problem. 4.3
Concatenated Multiplexer Problems
We construct concatenated multiplexer problems by combining more than one multiplexer problem. In a concatenated multiplexer problem, the agent tries to predict the output of a concatenated multiplexer function. A concatenated multiplexer function is the function obtained by concatenating the input part and the output part of more than one individual multiplexer functions of the same type. The output will be the combination of outputs determined by the individual multiplexer functions. For example, Table 1 shows a sample of inputs and outputs of a concatenated multiplexer function made of three 3-multiplexer
1912
K.K. Tharakunnel, M.V. Butz, D.E. Goldberg Table 1. A sample input/output mapping of 3X3 multiplexer problem Input 000 010 110 111 111
000 100 101 110 111
Output 000 111 001 001 111
000 101 010 100 111
Table 2. In the 3X3 multiplexer problem the necessary classifiers for an accurate and correct classification need to be much more specific than those for an accurate but wrong classification (classifiers that always suggest a wrong classification) Correct Classifiers Wrong Classifiers Condition Action Condition Action 00# 00# 00# 00# 1#0 1#0 1#0 1#0 00# ...
00# 00# 1#0 1#0 00# 00# 1#0 1#0 00#
00# 1#0 00# 1#0 00# 1#0 00# 1#0 01#
000 000 000 000 000 000 000 000 001 ...
### ### 01# ### ### 1#1 ### 01# ### ### 1#1 ### 01# ### ### 1#1 ### ### ### ### 00# ... ... ...
000 000 000 000 000 000 001 ... ... ...
functions. We call this concatenated problem a 3X3 multiplexer problem. In general, a mXn multiplexer problem is a combination of m problems of type n. We use concatenated multiplexer problems to illustrate the hypothesis of Sect. 4.1 because we are sure of the completely successful generalizations existing in these problems. For example, in the 3X3 problem, we know the completely successful generalizations for the individual three multiplexer problems and so the completely successful generalization for the concatenated problem can be easily constructed. A part of the optimal classifier population [O] for the 3X3 problem is shown in Table 2. Note that for the wrong classifiers the wrong output of one component is sufficient for the entire problem to have a wrong output and so a greater level of generalization is possible. The size of the optimal population |[O]| in an mXn problem where the n-multiplexer has np position m bits can be calculated by |[O]| = 2m (2np ) + 2m m2np = 22m+np m + m2m+np adding the number of correct and wrong classifiers together. In the 3X3 problem |[O]| = 64 + 48 = 112. It is surprising that although an optimal population exists for the 3X3 problem, XCS fails to evolve it. We show that this is because condition (2) of our hypothesis in Sect. 4.1 is not satisfied in this problem.
Towards Building Block Propagation in XCS: A Negative Result
1
0.9
0.9
0.8
1913
0.8 0.7 0.7 Performance
Performance
0.6 0.6
0.5
0.5
0.4 0.4 0.3 0.3
0.2
0.2
0.1
0
0.5
1
1.5 Explore Problems
(a) 1000/0 Reward
2
2.5 4
x 10
0.1
0
0.5
1
1.5 Explore Problems
2
2.5 4
x 10
(b) Layered Reward
Fig. 3. XCS performance in the 3X3 multiplexer problem
Before we proceed further with concatenated multiplexer problems we need to decide about the reward function to be used in these problems. We can use the usual 1000/0 reward structure which means that a correct output from all the component multiplexer problems will result in a reward of 1000 but if one of them is wrong then the output is 0. Another possibility is to use a layered reward structure, i.e. a progressive reward of 1000 with each component multiplexer giving the correct output. This will result in a reward of 1000c for the concatenated problem where c is the number of components correctly classified by the specified action. We use the 1000/0 reward function here. Later we will show that XCS performance will be worse with layered reward as the first condition (sustenance of partially successful classifiers) is violated as well in that case. Next, we shall see why XCS is not able to perform well in the 3X3 problem. Remember that the three multiplexer is a rather trivial problem for XCS. There are only two actions and XCS will quickly find the accurate and optimal classifier population. Now, in the 3X3 problem, there are eight actions, two from each component problem. The XCS performance on the 3X3 with 1000/0 reward structure is shown in Fig. 3(a). (All the performance results shown in this paper are averages of 10 runs with parameters set as follows: N = 2000, β = 0.2, α = 0.1, 0 = 0.01, ν = 5, θGA = 25, χ = 0.8, µ = 0.04, θdel = 20, δ = 0.1, θsub = 20, P# = 0.33, pexplr = 1, pI = −10, I = 0, and FI = 0.01 [3]). It can be seen that XCS fails to reach 100% performance even after 25,000 steps. To see why XCS is not able to perform well in this problem, let us look at the completely successful classifiers for this problem. For example, 01# 01# 01# → 111 is an accurate and optimal generalization (completely successful classifier) for the problem (the spaces between condition part of component multiplexers are for clarity). One possible way for this completely successful classifier to form in the population is by the recombination of the partially successful classifiers :
1914
K.K. Tharakunnel, M.V. Butz, D.E. Goldberg
Fig. 4. Necessary learning progress in 3X3 multiplexer problem
01# ### ### → 111, ### 01# ### → 111 and ### ### 01# → 111. We have shown only the most general case of partially successful classifiers though in reality more specific forms are likely. However, this will not change the results we are going to illustrate. Also, remember that the GA occurs only in the action set specified by the action 111 and these recombinations happen progressively. In Fig. 4, a possible progressive formation of the completely successful classifier from partially successful classifiers is shown. The accuracy of classifiers are expressed as MAD. Note that the MAD of the completely successful classifier is 0. In particular, let’s look at the exact derivation of those numbers. Classifiers cl1 = 01# ### ### → 111 and cl2 = ### 01# ### → 111 classify a random input string correctly with a probability of 0.25 since there is a 50/50 chance that either of the other components is correct. Thus, the reward predictions of these classifiers will be approximately 0.25·1000 = 250 and consequently their reward prediction errors (MAD) will approximate 375 (see equations 1 and 2). The more specific classifier cl3 = 01# 01# ### → 111 is much closer to the desired completely successful classifier cl4 = 01# 01# 01# → 111. Thus, cl3 should have higher fitness than cl1 and cl2 since a less complex recombinatory event or the mutation of the appropriate two attributes can actually lead to the generation of classifier cl4 . Classifier cl3 has a probability of 0.5 to perform the correct classification. Thus, both its reward prediction and MAD will approximate 500. With a higher MAD, classifier cl3 will actually have a smaller accuracy than classifiers cl1 and cl2 . Thus, cl3 will encounter less recombinatory events and is consequently highly likely to be deleted soon. Moreover, in order to get to classifiers cl1 and cl2 it is necessary that those classifiers have a lower MAD than, for example, the completely general classifier (classifier with only don’t care symbols in the condition part). However, the probability of a completely general classifier to classify a random input correctly is 0.125 which results in a MAD of 218.75 so that the completely general classifier has an even higher accuracy than cl1 or cl2 . Thus, besides the inherent generalization pressure in XCS [12,2], in this problem, fitness also pushes towards generality instead of towards
Towards Building Block Propagation in XCS: A Negative Result
1915
specificity so that it is highly unlikely that XCS will ever evolve an accurate representation of the problem. Indeed, looking at the evolving population in more detail revealed that none of the completely successful classifiers are present in the population. It seems somewhat surprising, though, why XCS still reaches a performance of near 100% as displayed in Fig. 3(a). A more detailed analysis revealed that XCS is actually learning the classifiers that specify inaccurate classifications accurately (shown on the right column in table 2). Thus, given a current match set, XCS eventually ’knows’ that seven of the eight possible classifications will be incorrect and thus chooses the correct classification most of the time. In Fig. 3(b), we show performance of XCS in the 3X3 multiplexer problem when the reward is 1000c where c is the number of components correctly classified by the specified action. It can be observed that performance is worse with this layered reward structure. This is because, with layered reward, the difference between MAD of partially successful classifiers resulting from recombination/mutation and MAD of their parent classifiers is even larger. This results in a smaller chance for the newly generated partially successful classifiers to sustain in the population and cause subsequent progress of successful generalizations. This shows that layered reward does not necessarily need to be helpful. The concatenated multiplexer problem is essentially a problem in which the base probability of classifying an input correctly is below 0.5. Looking back at Fig. 1 we see that the resulting problem is that XCS needs to cope with the problem that more correct classifiers (those closer to 1.0 and thus initially closer to 0.5) are actually less accurate due to the MAD-based accuracy. Moreover, the problem is not symmetrical in that an accurate, maximally general classifier that predicts payoff 0 does not have the same structure as an accurate, maximally general classifier that predicts payoff 1000. If there are other problem properties that can cause the same problem or if these properties are typical in interesting problems is a subject of future research. Note that our analysis made several simplifying assumptions. First, we assumed that reward prediction and reward prediction error of a classifier immediately take on the average values. However, the Widrow-Hoff delta rule allows only an approximation of these values with an accuracy dependent on the update rate β. Second, since classifier fitness in XCS is based on the set-relative accuracy of a classifier, overlapping classifiers (classifiers that match sometimes both and sometimes either one of them) can alter fitness determination which might cause additional problems. Third, we studied XCS in one-step, or classification, problems. In multi-step problems the additional problem of reinforcement propagation arises which can cause over-general classifiers to propagate inappropriate reinforcement values. Thus, in multi-step problems, the determination of current MAD is even harder.
5
Conclusion
This paper investigated the accuracy-based fitness approach in XCS and its implications for a successful selectorecombinative evolutionary process. Accuracy
1916
K.K. Tharakunnel, M.V. Butz, D.E. Goldberg
of a classifier was derived from its mean absolute deviation (MAD) which is approximated by its reward prediction error estimate. Two necessary conditions were identified for a successful evolutionary process: (1) Sustenance of partially successful classifiers; (2) classifiers closer to completely successful ones need to be more accurate. Violating either of those conditions disables the evolutionary process from evolving an accurate problem representation since there are no classifiers to propagate and/or the propagation leads towards the wrong direction. The concatenated multiplexer problem was introduced as an example problem that violates the second condition. Consequently, XCS was not able to evolve the desired complete, accurate, and maximally general payoff map of the problem. As the concatenated multiplexer problem showed, there are problems that are inherently accuracy-misleading for the current accuracy-based fitness approach in XCS. Thus, a modification of the accuracy definition appears to be necessary to enable successful learning. Since accuracy is derived from the reward prediction error which approximates the MAD, it appears that the MAD itself is not an appropriate basis for XCS’s accuracy measure. Thus, it is necessary to redefine the reward prediction error in XCS. One approach would be to change the error to an upper and lower error or a two-sided error measure. Future research needs to investigate the various possibilities for modifying the reward prediction error in XCS. Although a sufficiently large population size with a large mutation might be able to create and maintain completely successful classifiers in our investigated concatenated multiplexer problem, mutation cannot remedy the problem in general since it does not scale up. Thus, accuracy guidance needs to be exploited for successful selectorecombinative events. Once accuracy guidance is available in a problem, the next step is to investigate and develop competent crossover operators that can further speed-up and scale-up evolutionary search in XCS.
Acknowledgments. We are grateful to Xavier Llora for the useful discussions. This work was sponsored by the Air Force Office of Scientific Research, Air Force Material Command, USAF, under grant F49620-00-0163. Research funding for this work was also provided by the National Science Foundation under grant DMI-9908252. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the Air Force Office of Scientific Research, the National Science Foundation, or the U.S. Government.
References 1. Butz, M.V., Kovacs, T., Lanzi, P.L., Wilson, S.W.: How XCS evolves accurate classifiers. Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2001) (2001) 927–934
Towards Building Block Propagation in XCS: A Negative Result
1917
2. Butz, M.V., Pelikan, M.: Analyzing the evolutionary pressures in XCS. Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2001) (2001) 935–942 3. Butz, M.V., Wilson, S.W.: An algorithmic description of XCS. In Lanzi, P.L., Stoltzman, W., Wilson, S.W., eds.: Advances in Learning Classifier Systems: Third International Workshop, IWLCS 2000, Berlin Heidelberg, Springer-Verlag (2001) 253–272 4. Goldberg, D.E.: Genetic algorithms in search, optimization and machine learning. Addison-Wesley, Reading, MA (1989) 5. Goldberg, D.E.: The design of innovation: Lessons from and for competent genetic algorithms. Kluwer Academic Publishers, Boston, MA (2002) 6. Holland, J.H.: Adaptation in natural and artificial systems. Universtiy of Michigan Press, Ann Arbor, MI (1975) second edition 1992. 7. Holland, J.H.: Adaptation. In Rosen, R., Snell, F., eds.: Progress in Theoretical Biology. Volume 4., New York, Academic Press (1976) 263–293 8. Kovacs, T.: XCS classifier system reliably evolves accurate, complete, and minimal representations for boolean functions. In Roy, Chawdhry, Pant, eds.: Soft computing in engineering design and manufacturing. Springer-Verlag, London (1997) 59–68 9. Kovacs, T.: Towards a theory of strong overgeneral classifiers. Foundations of Genetic Algorithms 6 (2001) 10. Kovacs, T., Kerber, M.: What makes a problem hard for XCS? In Lanzi, P.L., Stolzmann, W., Wilson, S.W., eds.: Advances in Learning Classifier Systems: Third International Workshop, IWLCS 2000. Springer-Verlag, Berlin Heidelberg (2001) 80–99 11. Pelikan, M., Goldberg, D.E., Lobo, F.: A survey on optimization by building and using probabilistic models. Computational Optimization and Applications 21 (2002) 5–20 12. Wilson, S.W.: Classifier fitness based on accuracy. Evolutionary Computation 3 (1995) 149–175 13. Wilson, S.W.: Mining oblique data with XCS. In Lanzi, P.L., Stolzmann, W., Wilson, S.W., eds.: Advances in Learning Classifier Systems: Third International Workshop, IWLCS 2000. Springer-Verlag, Berlin Heidelberg (2001) 158–174
Data Classification Using Genetic Parallel Programming Sin Man Cheang, Kin Hong Lee, and Kwong Sak Leung Department of Computer Science and Engineering The Chinese University of Hong Kong, Hong Kong {smcheang,khlee,ksleung}@cse.cuhk.edu.hk
Abstract. A novel Linear Genetic Programming (LGP) paradigm called Genetic Parallel Programming (GPP) has been proposed to evolve parallel programs based on a Multi-ALU Processor. It is found that GPP can evolve parallel programs for Data Classification problems. In this paper, five binary-class UCI Machine Learning Repository databases are used to test the effectiveness of the proposed GPP-classifier. The main advantages of employing GPP for data classification are: 1) speeding up evolutionary process by parallel hardware fitness evaluation; and 2) discovering parallel algorithms automatically. Experimental results show that the GPP-classifier evolves simple classification programs with good generalization performance. The accuracies of these evolved classifiers are comparable to other existing classification algorithms.
Data Classification is a supervised learning process that learns a classifier from a training set. The learned classifier can be used to classify unseen data records. Lim et al. have performed a sophisticated study on 16 UCI Machine Learning Repository databases by 33 different data classification algorithms [1]. Their experimental results are used for comparison with the proposed GPP-classifier. A novel LGP paradigm – Genetic Parallel Programming (GPP) [2,3] is employed to learn data classifiers. In GPP, individual programs are represented in a sequence of parallel instructions. Each parallel instruction consists of multiple subinstructions in order to perform multiple operations in each processor clock cycle simultaneously. A parallel program is executed on a specially designed Multi-ALU Processor (MAP). The main purpose of this paper is to demonstrate that GPP can evolve data classifiers to solve real-world data classification problems. Experimental results show that GPP can evolve binary-class data classifiers with comparable generalization accuracy to the other 33 existing data classification methods presented in [1]. We adopt the 10-fold cross-validation method to estimate the classification error rate (CE) of the GPP-classifier. 10 training sets are used to learn 10 classifiers that are tested with their corresponding test sets to obtain 10 test set CE. The 10 test set CE are averaged to estimate the generalized CE. We measure the classification accuracy and the generalization performance. A good generalized classifier gives similar levels of performance on the training and test sets. Furthermore, the GPP-classifier has adopted three techniques to avoid overtraining: 1) limiting the size of genetic programs; 2) penalizing over-trained individual programs; and 3) monitoring generalization performance over the evolution. All experiments have been run on a software GPP-classifier system. It produces a parallel assembly program together with a correE. Cantú-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 1918–1919, 2003. © Springer-Verlag Berlin Heidelberg 2003
Data Classification Using Genetic Parallel Programming
1919
spondent serialized C code segment. Table 1 below shows the best, average, and standard deviation (stddev) of training set CE and test set CE of 10 independent runs (10-fold cross-validation on each run). Table 1. Training set CE and test set CE of the GPP-classifier
bcw bld pid hea vot average
training set CE (%) best average stddev 2.7 2.9 0.09 27.3 28.0 0.69 22.5 22.7 0.11 14.4 14.8 0.24 3.9 4.1 0.10
test set CE (%) best average stddev 3.5 3.9 0.29 29.3 31.7 1.74 23.7 24.5 0.42 16.0 18.9 1.78 4.1 4.6 0.23
%∆CE 25.6% 11.7% 7.3% 21.7% 10.8% 15.4%
In Table 1 above, the last column shows the percentage differences (%∆CE) of the average training set CE and test set CE. The average %∆CE of the five databases is 15.4%. It is shown that GPP can learn parallel programs to solve real-world data classification problems. Experimental results show that GPP is able to learn human understandable classifiers with comparable generalization performance to other classification algorithms. Even without tailor-making the GPP configurations for individual problem, good quality classifiers are evolved. The generalization performance of the GPP-classifier is higher than the average of the 33 benchmark algorithms in [1]. It shows that the GPP-classifier has the power to learn very simple but accurate classifiers with a suitable overtraining control strategy. Besides, the GPP-classifier automatically determines the structure of the solution program without prior knowledge of the databases. Even though the results show that classification program evolved by GPP has comparable generalization performance to other classification algorithms, further improvements can be carried out. In spite of adopting overtraining control strategies, the GPP-classifier still suffers from some overtraining, i.e. the training set CE are higher than test set CE in Table 1 above. In order to obtain a good generalization performance, we shall work out an appropriate terminating condition to detect overtraining.
References 1. Lim, T.S., Loh, W.Y., Shih, Y.S.: A Comparison of Prediction Accuracy, Complexity, and Training Time of Thirty-Three Old and New Classification Algorithms. Machine Learning Journal, Vol.40, Kluwer Academic (2000) 203–229 2. Leung, K.S., Lee, K.H., Cheang, S.M.: Evolving Parallel Machine Programs for a MultiALU Processor. Proc. of IEEE Congress on Evolutionary Computation (2002) 1703–1708 3. Leung, K.S., Lee, K.H., Cheang, S.M.: Genetic Parallel Programming – Evolving Linear Machine Codes on a Multiple-ALU Processor. Proc. of International Conference on Intelligence in Engineering and Technology, Univ. Malaysia Sabah (2002) 207–213
Dynamic Strategies in a Real-Time Strategy Game William Joseph Falke II and Peter Ross School of Computing, Napier University 10 Colinton Road, Edinburgh EH10 5DT, UK [email protected], [email protected]
Abstract. Most modern real-time strategy computer games have a sophisticated but fixed ‘AI’ component that controls the computer’s actions. Once the user has learned how such a game will react, the game quickly loses its appeal. This paper describes an example of how a learning classifier system (based on Wilson’s ZCS [1]) can be used to equip the computer with dynamically-changing strategies that respond to the user’s strategies, thus greatly extending the games playability for serious gamers.
1
The Game and the Classifier System
Real-time strategy (RTS) games typically involve fighting a battle against a computer opponent. This opponent typically has a very sophisticated but essentially static strategy and once the user has learned how it behaves, the game loses its playability. To get round this limitation we have used a learning classifier system (LCS) to provide the computer with the capability to dynamically change its strategy. LCSs are sometimes criticized for being slow to learn. In this example, the LCS has short conditions so as to fit in the real-time loop; success also depends on a careful choice of the types of actions. The game we implemented, using the publicly available version of the Auran Jet game engine, involves two opposing armies each consisting of two squads of 20 soldiers, moving on hilly terrain. A squad defaults to moving in a 6-6-6-2 formation but the soldiers are individually governed by a flocking algorithm [2] so that squad shapes are very fluid in battles. The essence of the game play is to jockey for position, because it pays for a squad to attack the flank of the enemy or for two squads to attack a third if its companion squad is too far away to offer support quickly enough. Figure 1 illustrates the idea. The rectangles are identifying battle-flags. The large diamond is the cursor that the user employs to command his squads, by placing the cursor somewhere and then left- or rightclicking to summon a particular squad towards that location. The classifier system sends commands to the computer’s squads, at randomlychosen times within a two- to four-second interval (experimentally determined to offer good playability). The condition part has 14 bits as follows, the values depend on the squad to be commanded. For enemy squad 1, two bits describe how far away it is; one bit indicates whether that squad is engaged in a fight; E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 1920–1921, 2003. c Springer-Verlag Berlin Heidelberg 2003
Dynamic Strategies in a Real-Time Strategy Game
1921
Fig. 1. Two examples of attacks
one bit indicates whether that squad is stronger than this one; one bit indicates whether this squad is positionally flanking that one. Another five bits give the same data about enemy squad 2. Bits 11 and 12 give the proximity of the friendly squad, bits 13 and 14 show whether this squad and its friend are currently engaged in battle. A battle consists of a set of soldier-to-soldier fights running in parallel, each ending with the death of one combatant. A key to the success of the system is the fairly general nature of the possible actions [3]: pursue a named enemy squad, flank a named enemy squad, flee from a named enemy squad, follow the friendly squad. Unlike in the original LCS, the reward system is invoked immediately after the environment string is built and evaluates the squad’s previous action. The reward is 1000 is that action caused more kills than deaths for the squad, 0 otherwise. In experiments it took around 750 computervs-computer battles, at 15 seconds per battle on a 750MHz PC, to generate a strong opponent for a human. Human testers then found the computer opponent to be both hard to beat and hard to model, as it continued learning. This demonstrates by example that an LCS, kept suitably simple and with suitably general actions, can be the basis of an effective dynamic strategy in a real-time strategy game. The frame rate of 40-50 fps is good. Moreover, acquired strategies can be saved and exchanged with other users.
References 1. Wilson, S.: ZCS: a zeroth level classifier system. Evolutionary Computation 2 (1994) 1–18 2. Reynolds, C.: Flocks, herds, and schools: A distributed behavioural model. Computer Graphics 21 (1987) 25–34 3. Smith, R., Dike, B., Ravichandran, B., El-Fallah, A., Mehra, R.: The fighter aircraft lcs: a case of different lcs goals and techniques. In Lanzi, P.L., W.Stolzmann, Wilson, S., eds.: Learning Classifier Systems, From Foundations to Applications. Number 1813 in LNCS. Springer-Verlag (2000) 283–300
Using Raw Accuracy to Estimate Classifier Fitness in XCS Pier Luca Lanzi Artificial Intelligence and Robotics Laboratory Dipartimento di Elettronica e Informazione Politecnico di Milano [email protected]
In XCS classifier fitness is based on the relative accuracy of the classifier prediction [3]. A classifier is more fit if its prediction of the expected payoff is more accurate than the prediction given by the other classifiers that are applied in the same situations. The use of relative accuracy has two major implications. First, because the evaluation of fitness is based on the relevance that classifiers have in some situations, classifiers that are the only ones applying in a certain situation have a high fitness, even if they are inaccurate. As a consequence, inaccurate classifiers might be able to reproduce so to cause reduced performance; as already noted by Wilson (personal communication reported in [1]). In addition, because the computation of classifier fitness is based both (i) on the classifier accuracy and (ii) on the classifier relevance in situations in which it applies, in XCS, classifier fitness does not provide information about the problem solution, but rather an indication of the classifier relevance in the encountered situations. Accordingly, it is not generally possible to tell whether a classifier with a high fitness is accurate or not, just looking at the fitness. To have this kind of information, we need the prediction error ε which provides an indication of the raw classifier accuracy. Relative Accuracy. In XCS, a classifier cl consists of a condition, an action, and four main parameters: (i) the prediction cl.p, which estimates the payoff that the system expects when cl is applied; (ii) the prediction error cl.ε, which estimates the error of the prediction cl.p; (iii) the fitness cl.F, which estimates the accuracy of the payoff prediction given by cl.p; and finally (iv) the numerosity cl.num, which indicates how many copies of classifiers with the same condition and the same action are present in the population. In XCS the update of classifier fitness consists of three steps. First, for all the classifiers cl in [A], the raw accuracy cl.κ is computed: cl.κ is 1 if cl.ε ≤ ε0 ; α(cl.ε/ε0 )−ν otherwise; α is usually 0.1; ν is usually 5. A classifier is considered to be accurate if its prediction error cl.ε is smaller than the threshold ε0 ; a classifier that is accurate has a raw accuracy cl.κ equal to one. A classifier is considered to be inaccurate if its prediction error ε is larger than ε0 ; the raw accuracy cl.κ of an inaccurate classifier cl is computed as a potential descending slope given by α(cl.ε/ε0 )−ν . The classifier raw κ as accuracy cl.κ is used to calculate the relative accuracy (cl.κ × cl.num)/ cli ∈[A] (cli .κ × cli .num). Finally the relative accuracy κ is used to update the classifier fitness as: cl.F ← cl.F + β(cl.κ − cl.F ). E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 1922–1923, 2003. c Springer-Verlag Berlin Heidelberg 2003
Using Raw Accuracy to Estimate Classifier Fitness in XCS
1923
Raw Accuracy. The idea behind raw accuracy is quite simple: instead of updating classifier fitness cl.F with the relative accuracy cl.κ , we update cl.F with the raw fitness cl.κ. The fitness update becomes: cl.F ← cl.F + β(cl.κ − cl.F ). Classifier fitness now conveys different information since raw accuracy does not provide any knowledge about the classifier relative accuracy in the encountered situations. With relative accuracy this information is obtained through the use of classifier numerosity, and it is intrinsically exploited in Wilson’s XCS when fitness is used, i.e.: (i) when the system prediction is computed; (ii) when offspring classifiers are selected from [A] with probability proportional to their fitness (see [3] for details). With raw accuracy, this information can be obtained by combining raw accuracy and numerosity either during the computation of the system prediction, either during the selection of offspring classifiers. This gives raise to four different XCS versions, which differ in the way raw accuracy and numerosity are combined. The first model, XCSnn, implements the most basic approach possible: classifier fitness is updated using the raw accuracy κ, but classifier numerosity is never used. XCSnn lacks of any information regarding the classifier relevance in the environmental niche which is instead available with relative accuracy. The second model, XCSnga, extends XCSnn by introducing numerosity for selecting offspring classifiers: fitness is updated using κ, classifier numerosity is used during offspring selection, i.e., the probability of selecting a classifier cl is proportional to cl.F ×cl.num. The third model, XCSne, uses numerosity both for computing system prediction both for offspring selection. The fourth model, XCSnu, does not update fitness; raw accuracy κ is used as the measure of fitness; this is equivalent to have cl.F ← cl.κ so as to eliminate the parameter F ; classifier fitness is thus computed directly from the prediction error. Discussion. These different versions of XCS have been compared in [2] with Wilson’s XCS [3]. Lanzi [2] shows that in simple single-step problems, XCSnu learns faster than all the other XCS versions producing also the most compact solutions (see [2] for details). In more complex problems, Wilson’s XCS learns faster than XCSnu and XCSne but it produces larger populations. Overall, the results reported in [2] suggest that raw accuracy might result in interesting performance, although Wilson’s relative accuracy appears to provide the best trade-off.
References 1. Pier Luca Lanzi. An Analysis of Generalization in the XCS Classifier System. Evolutionary Computation, 7(2):125–149, 1999. 2. Pier Luca Lanzi. A comparison of relative accuracy and raw accuracy in XCS. Technical Report 2003.14, Dipartimento di Elettronica e Informazione. Politecnico di Milano., March 2003. 3. Stewart W. Wilson. Classifier Fitness Based on Accuracy. Evolutionary Computation, 3(2):149–175, 1995. http://prediction-dynamics.com/.
Towards Learning Classifier Systems for Continuous-Valued Online Environments Christopher Stone and Larry Bull Faculty of Computing, Engineering and Mathematical Sciences University of the West of England Bristol, BS16 1QY, United Kingdom {christopher.stone,[email protected]}
Abstract. Previous work has studied the use of interval representations in XCS to allow its use in continuous-valued environments. Here we compare the speed of learning of continuous-valued versions of ZCS and XCS with a simple model of an online environment.
1
Introduction
Much current research is focussed on the asymptotic performance of Learning Classifier Systems (LCS). However, where the environment is ever changing, speed of learning and the ability of the system to recover from environmental change (i.e., the slope of the learning curve) may be a more relevant measure. We are interested in continuous-valued environments and so use an interval representation to replace the {0, 1, #} classifier predicate with one representing an interval [pi , qi ) [1,5]. An interval is represented as an unordered tuple (pi , qi ) and matches an environmental variable xi if pi ≤ xi < qi . Changes to the cover, subsumption and Genetic Algorithm (GA) operators must be made to ZCS [3] and XCS [4] to accommodate the interval representation. Action set subsumption is not used. We use single-point crossover between intervals for ZCS and two-point crossover between intervals for XCS. Both ZCS and XCS are run with an initially empty population [1,2].
2
Experiments
Experiments were performed on a six-bit real multiplexer [5]. Parameter settings used for ZCS were (using XCS terminology for consistency) N = 800, β = 0.2, τ = 0.1, χ = 0.5, µ = 0.04, fI = 500, s0 = 0.5, ρ = 0.25, φ = 0.5. Parameter settings for XCS were N = 800, β = 0.2, α = 0.1, ε0 = 10, ν = 5, θGA = 12, χ = 0.8, µ = 0.04, θdel = 20, δ = 0.1, θsub = 20, pI = 10, εI = 0, fI = 0.01, θmna = 2, m = 0.1, s0 = 1. All experiments used a fixed reward of 1000. In an online environment there is no artificial distinction between exploration and exploitation trials, so we use a roulette wheel for action selection and permanently enable the GA and update mechanisms. E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 1924–1925, 2003. c Springer-Verlag Berlin Heidelberg 2003
Towards Learning Classifier Systems
1925
1 0.9 0.8
System Performance
0.7 0.6 ZCS
0.5
XCS
0.4 0.3 0.2 0.1 0 0
5000
10000
15000
20000
25000
30000
35000
40000
Trials
Fig. 1. Mean system performance over 10 runs of ZCS and XCS on the six-bit real multiplexer
Figure 1 shows the performance of ZCS and XCS on the real multiplexer. Similar results were obtained for a three-dimensional checkerboard problem with three divisions per dimension [2] (not shown). We found that, although the asymptotic performance of XCS ultimately exceeds that of ZCS, ZCS performs as well as XCS during the early part of the learning curve for the problems tested. This result shows that a simple LCS architecture, such as ZCS, can compete with the more complex XCS system where speed of learning is more important than asymptotic performance.
References 1. Stone, C. & Bull, L. (2003) For real! XCS with continuous-valued inputs. To appear in Evolutionary Computation. 2. Stone, C. & Bull, L. (2003) Comparing Learning Classifier Systems for ContinuousValued Online Environments. University of the West of England Learning Classifier Group Technical Report UWELCSG03-001. http://www.cems.uwe.ac.uk/lcsg/reports/uwelcsg03-001.ps 3. Wilson, S.W. (1994). ZCS: A zeroth order classifier system. Evolutionary Computation, 2(1):1-18. 4. Wilson, S.W. (1995). Classifier fitness based on accuracy. Evolutionary Computation, 3(2):149–175. 5. Wilson, S W. (2000). Get real! XCS with continuous-valued inputs. In P.L. Lanzi, W. Stolzmann and S.W. Wilson (eds.), Learning Classifier Systems. From Foundations to Applications, Lecture Notes in Artificial Intelligence (LNAI-1813), Berlin: Springer, pages 209–219.
Artificial Immune System for Classification of Gene Expression Data Shin Ando1 and Hitoshi Iba2 1
Dept. of Electronics, School of Engineering University of Tokyo [email protected] 2 Dept. of Frontier Informatics, School of Frontier Science University of Tokyo [email protected]
Abstract. DNA microarray experiments generate thousands of gene expression measurement simultaneously. Analyzing the difference of gene expression in cell and tissue samples is useful in diagnosis of disease. This paper presents an Artificial Immune System for classifying microarray-monitored data. The system evolutionarily selects important features and optimizes their weights to derive classification rules. This system was applied to two datasets of cancerous cells and tissues. The primary result found few classification rules which correctly classified all the test samples and gave some interesting implications for feature selection.
1
Introduction
The analysis of human gene expression is an important topic in bioinformatics. The microarrays and DNA chips can measure the expression profile of thousands of genes simultaneously. Genes are expressed differently depending on its environment, such as their affiliate organs and external stimulation. Many ongoing researches try to extract information from the difference in expression profile given the stimulation or environmental change. Some of the experiments have shown promising result in diagnosis of cancer [16, 17, 4]. Our focus is on classifying gene expression data in [16] and [17]. These data are publicly available and has been applied by other classification methods. This paper describes the implementation of Artificial Immune System (AIS) [2] for classification. The AIS simulates the human immune system, which is a complex network structure, which responds to an almost unlimited multitude of foreign pathogens. It is considered to be potent in intelligent computing applications such as detection, pattern recognition, and classification. This implementation of AIS defines classification rules as hyperplanes, which divide the domain of sample vectors. The experiments with genomic expression data showed how the border lines are captured and how the features are selected. Compared to other classifier methods, AIS show reduced complexity of the rules, while equaling or improving on the accuracy of prediction.
E. Cantú-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 1926–1937, 2003. © Springer-Verlag Berlin Heidelberg 2003
Artificial Immune System for Classification of Gene Expression Data
1927
2 Classification of Microarray Data Important aspects in gene expression classification are selection of informative genes (feature selection), and optimization of strength (weight) of each gene, and generalization. Many existing works use ranking methods to select features as a kind of dimensionality reduction on the data. Many studies [7, 15, 16], uses correlation metrics Gi (i) to rank the feature genes. Subset of genes with highest correlations is chosen as classifier genes. In (i), µ and σ are the mean and standard deviation for the expression levels of gene i, in class A or B samples. [5] compares several ranking methods and results when combined with several machine learning methods. Gi = (µA-µB) / (σA+σB)
(i)
For optimizing weights, machine learning methods such as weighted vote cast [16], Bayesian Network, Neural Network, RBF Network [7], Support Vector Machine [6, 15], have been applied to such data. 2.1
Classification of ALL/AML in Acute Leukemia
In [16], cancerous cells collected from 72 patients of acute leukemia are monitored over 7109 genes. Discovery and prediction of cancer class is important, as clinical outcome vary depending on its class. Each sample belongs to either ALL or AML cancer classes. Two independent data sets, training data set (38 cell samples, 27 ALL and 11 AML) to learn the cancer classes and test data set (34 samples, 20 ALL and 14 AML) to evaluate its prediction were provided in [16]. The reliable diagnoses of the samples were made by combination of clinical tests. 2.2
Colon Cancer Diagnosis
In [17], using Affymetrix oligonucleotide arrays, expression levels of 40 tumor and 22 normal colon tissues are measured for 6500 human genes. Among these genes, 2000 with the highest minimal intensity across the tissues are selected for classification purposes. Since no training data set or test dataset were classified, we randomly chosen 38 training samples and 24 test samples. Table 1 shows comparison of performance in ALL/AML classification by various machine learning techniques. The results are cited from [6, 7, 15, 16]. Each method were trained and tested under following conditions. Weighted Vote Cast (WVC) [16] selected 50 genes with high correlation as classifier genes. Each sample is classified by the sum of each gene’s correlation value. Since weights are not learned, training samples are misclassified. Neural Network (NN) creates different classifier in every run. The success rate shown in Table 1 is the best in 10 runs, while the average and worst success rate of test data was 4.9 and 9 respectively. Bayesian Network[7] uses 4 genes with higher correlations, though how the features were chosen is not clearly stated. Support Vector Machine (SVM) is a combination of linear modeling and instance–based learning. [15] uses 50, 100, and 200 genes of high correlation. [6] uses Recursive Feature
1928
S. Ando and H. Iba
Elimination to select informative genes. Result in Table 1 is obtained when 8, 16, 32 genes were chosen.Our implementation of AIS selects combination of genes and weights in evolutionary recombination process. The result in Table 1 is the best of 20 runs, while average number of misclassification was 0.85. More detailed result will be given in later section. Table 1. Comparison of machine learning techniques in Leukemia data set
ALL/AML Training Test
WVC[16] 2/38 5/34
BN[7] 0/38 2/34
NN[7] 0/38 1/34
SVM[15] 0/38 2-4/34
SVM[6] 0/38 0/34
AIS 0/38 0/34
3 Features of Immune System The capabilities of natural immune system, which are to recognize, destroy, and remember almost unlimited multitude of foreign pathogens, have drawn increasing interest of researchers over the past few years. Application of AIS includes fields of computer security, pattern recognition, and classification. The natural immune system responds to and removes intruders such as bacteria, viruses, fungi, and parasites. Substances that are capable of invoking specific immune responses are referred to as antigens (Ag). Immune system learns the features of antigens and remembers successful responses to use against invasions by similar pathogens in the future. These characteristics are achieved by a class of white blood cells called lymphocytes, whose function is to detect antigens and assist in their elimination. The receptors on the surface of a lymphocyte bind with specific epitopes on the surfaces of antigens. These proteins related to immune system are called antibodies (Ab). Immune system can maintain a diverse repertoire of receptors to capture various antigens, because the DNA strings which codes the receptors are subject to high probability crossover and mutation, and new receptors are constantly created. Lymphocytes are subject to two types of selection process. Negative selection, which takes place in thymus, operates by killing all antibodies that binds to any selfprotein in its maturing process. The clonal selection takes place in the bone marrow. Lymphocyte which binds to a pathogen is stimulated to copy themselves. The copy process is subject to a high probability of errors, i.e., hypermutation. The combination of mutation and selection amounts to an evolutionary algorithm that produces lymphocytes that become increasingly specific to invading pathogens. During the primary response to a new pathogen, the organism will experience an infection while the immune system learns to recognize the epitope by evolutionary process. The memory of successful receptors is maintained to allow much quicker secondary response when same or similar pathogens invade thereafter. There are several theories of how immune memory is maintained. The AIS in this paper stores successful antibodies in permanent memory cells to store adapted results.
Artificial Immune System for Classification of Gene Expression Data
1929
4 Implementation of Artificial Immune System In the application of artificial immune system to ALL/AML classification problem, the following analogy applies. The ALL training data sets correspond to Ag, and AML training data sets to the self-proteins. Classification rules represent Ab, which captures training samples(Ag or self-proteins) when its profile satisfies the conditions of the rule. A population of Ab goes through a cycle of invasion by Ag and selective reproduction. As successful Abs are converted into memory cells, ALL/AML class is learned. Above analogy can be applied to Colon cancer data by replacing ALL with the tumor tissues and AML with normal tissue. 4.1
Rule Encoding
Classification rules are linear separators, or hyper-planes, as shown in (ii). Vector x =(x1, x2, …, xi, … xn) represents gene expression levels of a sample, and vector w=(w1, w2, …, wi, …, wn) represents the weight of each gene. A hyperplane W(x)=0 can separate the domain and the samples into two classes. W determines the class of sample x; if W(x) is larger than or equal to 0, sample x is classifies as ALL. If it is smaller than 0, the sample is classified as AML.
W ( x) ≥ 0
{ W ( x) = w ⋅ x} T
(ii)
Encoded rules are shown in (iii). It represents a vector w, where each loci consists of a pointer to a gene and weight value of that gene. It corresponds to a vector where unspecified gene weights are supplemented with 0. It is similar to messyGA[3] encoding of vector w. (X123, 0.5) (Xi, wi) (…) (…) (…) (gene index, weight) 4.2
(iii)
Initialization
Initially, rules are created by sequential creation of locus. For each locus, a gene and a weight value is chosen randomly. With probability Pi, next locus is created. Thus, average lengths of the initial rules are ΣnnPin. Empirically, the number of initial rules should be in the same order as the number of genes to ensure sufficient building blocks for classification rules. 4.3
Negative Selection
All newly created rules will first go through negative selection. Each rule is met with set of AML training samples, xi(i=1,2,…,NAML), as self-proteins. If a rule binds with any of the samples(satisfy (iv)0, it is terminated. The new rules are created until NAb antibodies pass the negative selection. These rules constitute population of antibodies Abi(i=1,2,…,NAb). δ (t ) = 0 | t > 0 = 1 | t < 0
∏ δ (W (x )) ≠ 1
(iv)
1930
4.4
S. Ando and H. Iba
Memory Cell Conversion
The antibodies who endures the negative selection are met with invading antigens, Agi(i=1,2,…,NALL), or ALL training samples. Antibodies which can capture many antigens are converted into memory cells Mi(i=1,2,…,NMem). Abi are converted to memory cell in following conditions. A set of antigens captured by Abi or a memory cell Mi is represented by C(Abi), C(Mi). • Mi is removed if C (M i ) ⊂ C (Ab i ) .
• Abi is converted to MN+1 if C(Abi) ⊄ C(M1) ∪ C(M2) ∪ … ∪ C(MN). • Abi is converted to MN+1, if C(Mi)=C(Abi) 4.5
Clonal Selection
The memory cells and Abs which bind with Ags go through clonal selection for reproduction. This process is a cycle described as follows: • First, an Ag is selected from the list of captured Ags. The probability of selection is proportional to S(Agi), the concentration of Agi, which is initially 1. • Randomly select two antibodies Abp1 and Abp2 from all the antibodies bound with the antigen. • Abp1 and Abp2 are crossed over with probability Pc to produce offspring Abc1 and Abc2. The crossover operation is defined as cut and splice operation. Crossover is followed by hypermutation, which is a series of copy mutation applied to wc1, wc2, and their copied offspring. There are several types of mutation. Locus deletion, deletes randomly selected locus. Locus addition, adds newly created locus to the antibody. Weight mutation changes the weight value of randomly chosen locus. • With probability Pm, newly created antibody creates another mutated copy. Copy n operation is repeated for ΣnPm times on average. • Parents are selected from memory cells as well. Same crossover and hypermutation process is applied. • The copied antigens go through negative selection as previously described. The reproduction processes are repeated until NAb antigens pass the negative selection. • Finally, the score of each Ag is updated by (v). T is the score of Ag, s is the number of Ab bound to an Ag, and N is the total number of Ab. β is an empirically determined constant, 1.44 in this study. The concentration of Ag converges to 1 with appropriate β. T´=β
T-s/N
(v)
The process goes back to Negative Selection to start a new cycle. 4.6
Generalization
In a single run, many rules with same set of captured antigens, C(Mem), are stored as memory cells. After the run is terminated, one memory cell is chosen to classify the test samples. A memory cell with largest margin M (vi), is chosen. M = min W (xi ) W (~ xi ) i
(vi)
Artificial Immune System for Classification of Gene Expression Data
1931
xi are the ALL/AML sample vectors and ~ xi is the median of the samples. We try to maximize generality by choosing a hyperplane whose margin to nearest sample vector is the largest. 4.7
Summary of Experiment
AIS repeats the cycle as previously described. Flow of the system is shown in Fig. 1. Each run was terminated after Nc(=50) cycles. The AIS runs on parameters shown in Table 2. The results were robust to minor tuning of these parameters.
Self-protein (AML)
Initialize Rules
Negative Selection
Antigens (ALL)
Infection
Hypermutation
Memory Cell
Clonal Selection
Crossover
Fig. 1. Cycle of artificial immune system
Table 2. Data attributes and AIS parameters Leukemia Colon cancer AIS parameters
4.8
NG = 7109 NG = 2001 NAb = 7,000
NAML = 11 Nnorm = 13 Nc = 50
NALL = 27 Ntumor = 25 Pi = 0.5
Pc = 0.9
Pm = 0.6
Results
AIS was applied to Leukemia dataset and Colon cancer dataset. Its performance was measured by average and standard deviation of 20 runs. In all runs, training data set was correctly classified. Table 3 shows average and standard deviation of the number of false positives (misclassified AML test data/ misclassified normal tissue) and false negative (misclassified ALL test data / misclassified tumor tissue) prediction on test samples for both dataset.
Table 3. The number of misclassified samples in test data set of Leukemia and Colon cancer
Leukemia Colon Caner
#FN(Average/Standard Dev.) 0.3 / 0.47 0.75 / 0.44
#FP(Average/Standard Dev.) 0.55 / 0.51 0.7 / 0.47
1932
S. Ando and H. Iba
Fig. 2 and Fig. 3 show the learning process of AIS in a typical run for Leukemia dataset. Fig. 2 shows the number of Ag captured by best and worst Abs. The average th number of captured Ag is also shown. It shows training samples are learned by 20 cycle. It can be read from the graph that cycles afterward are spent to derive more general rules. Fig. 3 shows the concentration of Ags at each cycle. Most Ags converge to 1, while few are slower to converge. These samples imply the borderline of the classes. Samples near the classification border are prone to misclassification by untrained Abs, thus slower to converge.
30
25
best mean worst
20
15
10
5
0
Fig. 2. Number of antigens caught by the best and worst antibodies
Empirically, the results were successful when the grasp of border is clear, i.e. all but few samples have converged. The termination criteria (number of cycle) were determined so that sufficient convergence was achieved by the end of iteration.
Artificial Immune System for Classification of Gene Expression Data
1933
2.2 2 1.8 1.6 1.4 1.2 1 0.8 0.6
48
46
44
42
40
38
36
34
32
30
28
26
24
22
20
18
16
14
12
8
10
6
4
2
0
0.4
Fig. 3. Transition of antigen scores
5 Analyses of Selected Features Fig. 4 and Fig. 5 shows some of the classification rules which correctly classified the test samples of Leukemia dataset and Colon cancer dataset respectively.
A) 1.21896X3675 + -1.5858X4474 + 1.46134X1540 + -1.19885X2105 + 1.84803X757 + 1.82983X4038 B) -1.4577X1385 + -1.57815X4363 + 1.31819X2317+ 1.75329X2328 C) 1.00809X1904 + 1.7706X6244 + -1.41034X4526 + -1.14542X759 + 1.94696X2723 + -1.34382X4875 D) 1.26745X3110 + 1.43941X1190 + 1.97632 + 1.74422X5519 + 1.79449X6874 + -1.44577X6022 Fig. 4. Examples of ALL/AML classifier rules
1934
S. Ando and H. Iba
A) -1.678X1997 -1.55458X1516 1.41783X1297 -1.85391X696 1.69254X416 -1.39212X620 -1.19309X1672 -1.95164X1310 1.95381X49 1.64378X134 B) -1.25102X1997 + 1.77743X138 + -1.42182X1770 + 1.3154X49 + 1.63866X183 + 1.17681X94 Fig. 5. Examples of colon cancer classifier rules
Each rule was different for each run. In this section, we further analyze the selected genes. Some of the genes appear repeatedly in the classification rules. Such genes have relatively high correlations (i) as shown in Table 4. These genes are fairly informative in terms of ALL/AML classification. Fig. 6 shows the expression level of those genes, and how the test data sets can be clustered with features in Table 4. The figure was created with average linkage clustering by Eisen’s clustering tool and viewer [10]. It can separate ALL/AML samples with the exception of one sample. Table 4. Correlation value of the classifier genes (GAN: Gene Accession Number) Gene GAN Gene name Correlation
X757 D88270 VPREB1 0.838
X1238 L07633 PSME1 0.592
X4038 X03934 CD3D 0.647
X2328 M89957 CD79B 0.766
X1683 M11722 DNTT 0.816
X6022 L00634_s FNTA -0.837
X4363 X62654 CD63 -0.834
Many of these genes have relations to Leukemic disease which can be confirmed by biological literature. For example, CD79b is one of the surface marker molecule which could provide important additional information in leukemia cell analysis [8]. CD3D is involved in abnormal location of the genes often observed in acute leukemia [14].
Fig. 6. Expression levels of informative genes and clustering based on those genes
The following section analyzes featured genes in rule A (Fig. 4). Each gene does not always have high correlation value as can be seen in Table 5.
Artificial Immune System for Classification of Gene Expression Data
1935
Table 5. Correlation of classifier genes in rule A Gene Weight GAN Correlation
X3675 1.21896 U73682 0.151
X4474 -1.5858 X69699 0.371
X1540 1.46134 L38696 0.418
X2105 -1.19885 M62762 -1.02
X757 1.84803 D88270 0.838
X4038 1.82983 X03934 0.647
Fig. 7 shows the expression levels of feature genes in rule A. It implies that majority of the samples can be classified by few ALL(X2105, X757) and AML(X4038) classifier genes. These classifier genes have relatively high correlation value. Some of the samples (AML1, 6, 10, ALL11, 17, 18, 19) in Fig. 7 seem indistinguishable by those classifier genes. Functions of supplementary genes(X3675, X4474, X1540) become evident when these samples are looked at especially. Fig. 8 shows normalized expression levels of selected samples. In this selected group, X3675 and X4474 are highly correlated to ALL and AML respectively, while the classifier genes(X2105, X757, X4038) became irrelevant to sample class.
Fig. 7. Expression level of classifier genes in rule A
Fig. 8. Selected samples
6
Conclusion
This paper presented Artificial Immune System to classify gene expression of cancerous tissues. Despite sparseness in training data, the accuracy of prediction was satis-
1936
S. Ando and H. Iba
factory, as test data were correctly classified 8 out of 20 times in ALL/AML classification. The colon cancer classification is known to be much harder. For comparison, we implemented classification algorithms using popular machine learning methods. Table 6 shows the error rates of the algorithms when applied to the colon cancer data, using 50, 100, and 200 genes selected by correlation metric (i) as features. SVM used linear kernel and error margin of 0.001. NN was implemented as a 3-layer perceptron. In the experiment with Colon Cancer dataset, the approximate amount of time required for each of the algorithms are: 50 min. for SVM, 3 hours for NN, and 6 hours for AIS. It is hard to compare the efficiency of the algorithms with regards to accuracy of the prediction and amount of computation required. The computational cost may depend on the conditions of the data and the algorithms, e.g. the number of iterations in SVM increase exponentially to the number of training samples, where as the AIS population must increase linear to the number of genes. Considering that preparation of the data, i.e., collecting samples from patients and performing the Microarray tests, takes months, it can be said that all algorithms require considerably small amount of time. Table 6. Comparison of performance in colon cancer dataset
# of Missclassification Test Data
SVM 3/24
NN 9/24
AIS 1.45/24
We presume that AIS was more effective than other methods in regards to the feature selection. AIS evolutionarily chooses informative genes whilst optimizing its weight as well. Though gene subset chosen by AIS for classification differs in each run, the genes with strong correlation are chosen frequently as Table 4 shows. Similar results can be obtained by application of Genetic Algorithm [9]. These genes with strong correlation, either selected by frequency or correlation, may not contain enough information to be a sufficient subset for classification, when many co-regulated genes are selected in the subset. Many genes are predicted to be co-expressed and those genes are expected to have similar rankings. On the other hand, the result in Fig. 8 implies that selection of complementary genes, which are not necessary highly correlated, can be useful in classification. It might be suggested that the choice of feature gene subsets should be based not only on single ranking method, but also on redundancy and mutual information between the genes. Changing of ranking objective, when one feature is removed as a ranking criterion, has been suggested in [13], while [1937] states that performance of machine learning is naïve to choice of features. AIS selects features in the learning process and it is interesting that it can choose primary and complementary feature genes by evolutionary process. Monitoring the convergence of the clonal selection suggested new termination criteria, as results improved when all but few genes converged. The analysis and quantitative implementation of such criteria is underway. As future work to improve classification capability, use of effective kernel functions, and expressing relations between the genes, such as combining antibodies with AND/OR functions should be addressed.
Artificial Immune System for Classification of Gene Expression Data
1937
References 1. 2. 3. 4. 5. 6. 7.
8. 9. 10. 11. 12. 13. 14.
15. 16. 17.
A. Ben-Dor, N. Friedman, Z. Yakini, Class discovery in gene expression data, Proc. of the 5th Annual International Conference on Computational Molecular Biology, 31–38, 2001. D. Dasgupta. Artificial Immune Systems and Their Applications. Springer, 1999. D. Goldberg, B. Korb and K. Deb, Messy Genetic Algorithms: Motivation, Analysis and First Results, Complex Systems, 3:493–530, 1989 Donna K. Slonim, Pablo Tamayo, Jill P. Mesirov, Todd R. Golub, Eric S. Lander, Class Prediction and Discovery Using Gene Expression Data, Proc. of the 4th Annual International Conference on Computational Molecular Biology(RECOMB), 263–272, 2000. H. Liu, J. Li, L. Wong, A Comparative Study on Feature Selection and Classification Methods Using Gene Expression Profiles and Proteomic Patterns, in Proceeding of Genome Informatics Workshop, 2002 I. Guyon, J. Weston, S. Barnhill, V. Vapnik, Gene Selection for Cancer Classification using Support Vector Machines, Machine Learning Vol. 46 Issue 1–3, pp. 389–422, 2002 K.B. Hwang, D.Y. Cho, S.W. Wook Park, S.D. Kim, and B.Y. Zhang, Applying Machine Learning Techniques to Analysis of Gene Expression Data: Cancer Diagnosis, in Proceedings of the First Conference on Critical Assessment of Microarray Data Analysis, CAMDA2000. Knapp W, Strobl H, Majdic O., Flow cytometric analysis of cell-surface and intracellular antigens in leukemia diagnosis. Cytometry 1994 Dec 15;18(4):187–98 L. Li, C. R. Weinberg, T. A. Darden, L. G. Pedersen, Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method, Bioinformatics, Vol. 17, No. 12, pp. 1131–1142, 2001 M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Science, 85:14863–14868, 1998. P. Baldi and A. Long, A Bayesian framework for the analysis of microarray expression data: Regularized t-test and statistical inferences of gene changes, Bioinformatics, 17:509-519, 2001. P.J. Park, M. Pagano, and M. Bonetti, A nonparametric scoring algorithm for identifying informative genes from microarry data, PSB2001, 6:52–63, 2001. R. Kohavi and G. H. John, Wrappers for Feature Subset Selection, Artificial Intelligence, vol. 97, 1–2, pp. 273–324, 1997 Rowley JD, Diaz MO, Espinosa R 3rd, Patel YD, van Melle E, Ziemin S, Taillon-Miller P, Lichter P, Evans GA, Kersey JH, et al., Mapping chromosome band 11q23 in human acute leukemia with biotinylated probes: identification of 11q23 translocation breakpoints with a yeast artificial chromosome., Proc Natl Acad Sci U S A 1990 Dec;87(23):9358–62 T.S. Furey, N. Cristianini, N. Duffy, D.W. Bednarski, M. Schummer, and D. Haussler. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 2001 T.R. Golub, D.K. Slonim, P. Tamayo, Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286:531–537, 1999. U. Alon, N. Barkai, D. Notterman, K. Gish, S. Ybarra, D. Mack, and A. Levine. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon cancer tissues probed by oligonucleotide arrays. Cell Biology, 96:6745–6750, 1999.
Automatic Design Synthesis and Optimization of Component-Based Systems by Evolutionary Algorithms 1
P.P. Angelov , Y. Zhang1, J.A. Wright1, V.I. Hanby2, and R.A. Buswell1 1
Dept of Civil and Building Engineering, Loughborough University Loughborough, LE11 3TU, Leicestershire, United Kingdom {P.P.Angelov,Y.Zhang,J.A.Wright}@Lboro.ac.uk 2 Institute of Energy and Sustainable Development, De Montfort University Leicester, LE7 9SU, United Kingdom [email protected]
Abstract. A novel approach for automatic design synthesis and optimization using evolutionary algorithms (EA) is introduced in the paper. The approach applies to component-based systems in general and is demonstrated with a heating, ventilating and air-conditioning (HVAC) systems problem. The whole process of the system design, including the initial stages that usually entail significant human involvement, is treated as a constraint satisfaction problem. The formulation of the optimization process realizes the complex nature of the design problem using different types of variables (real and integer) that represent both the physical and the topological properties of the system; the objective is to design a feasible and efficient system. New evolutionary operators tailored to the component-based, spatially distributed system design problem have been developed. The process of design has been fully automated. Interactive supervision of the optimization process by a human-designer is possible using a specialized GUI. An example of automatic design of HVAC system for two-zone buildings is presented.
1
Introduction
There are generally two phases in a design process: • Conceptual design • Detailed design During the conceptual design phase, strategic design decisions are usually taken by the human designer based on experience and knowledge. These choices lead to one or more possible schematic representation(s) of the system. Repeated consideration of similar problems quite often results in a set of “typical” design solutions for predefined cases. The conceptual design is a creative process and is highly subjective. The detailed design process identifies possible candidate solutions that meet the requirements and provide the necessary level of performance [12]; in contrast, this stage involves numerical analyses and simulation and occasionally optimization.
E. Cantú-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 1938–1950, 2003. © Springer-Verlag Berlin Heidelberg 2003
Automatic Design Synthesis and Optimization of Component-Based Systems
1939
The development of component-based simulation techniques [3,5] made possible the investigation of changes in system operating variables and parameters. Studies considering a fixed system configuration are now relatively easy to compile [7]. Existing design options allow the designer to chose a particular configuration from a list of prescribed systems [4] or to develop a user-defined system [2,6]. The former approach presents a limited range of possible solutions and the latter is realized only by time-consuming and error-prone procedures. An alternative approach is proposed in the paper offering a new degree of freedom in the design process. Using the technique, strategic design decisions concerning the system type and configuration are investigated as part of the design optimization process, instead of being predefined. This approach allows alternative, novel system configurations to be explored automatically and hence, the “best” design in terms of both the feasible configuration and optimal set of sizing parameters to be determined. It is well known that for a given design problem, a number of plausible system configurations exist [10]. The proposed approach combines the automatic generation of alternative system configurations with the optimization. Variables and parameters describing the system configuration, the component set and the topological links, form part of the optimization search-space [13]. EAs have been selected to solve this multi-modal problem because they are particularly appropriate for tackling complex optimization problems [1]. New evolutionary operators, tailored to the specific problem of secondary HVAC system design, have been developed. The component set and the topological interconnections have been encoded into the chromosome of the EA. This has been combined with the physical variables describing the thermal and air mass transfer through the HVAC system into a “genome” (a data structure, which combines several types of chromosomes). The objective is to design a feasible and energy efficient system. The system input is a “design brief”, a standardized description of the design requirements and constraints. The output from the system is a ranked list of suggested design solutions ordered by their cost. The cost is based on component size and energy consumption. The process of design is fully automated. A computational module has been produced in Java, given a performance specification and a library of system components, can synthesize functionally viable systems configurations. In addition, the specialized GUI allows interactive supervision of the design process by a humandesigner. An example of automatic design of HVAC systems for two-zone buildings is given as an illustration of the technique.
2
Automatic Design Synthesis and Optimization by EA
2.1
Concept of the Approach
The concept of the proposed approach is that the design synthesis can be represented as a constraint-satisfying optimization problem. In this way, the design synthesis, including system configuration generation and parameter optimization as well as the overall system performance optimization is integrated into one search problem. This
1940
P.P. Angelov et al.
problem formulation combines the search for a feasible system configuration (comprising the list of particular components used and their interconnections) with the minimization of the energy (and possibly capital) costs. A feasible configuration is one that fulfills both the design and engineering requirements. The computational mechanism used to solve the problem is based on EA [1,8]. This technique has proven to be powerful and robust in solving mixed variable, nonlinear and discontinuous problems. The EA searches from a population of points, each representing a particular candidate solution. It needs only the value of the objective function and operates using a coding of the parameter set not the parameter values. The evaluation of a given system includes checks that the configuration satisfies the connectivity constraints and the other specific requirements of the system; in this case, that the supply air conditions satisfy the zone loads. Graph algorithms have been used to verify these requirements [14]. The approach is concerned with identifying system configurations using a library of typical components performing elementary transformations. A typical list of components of a secondary HVAC system is given in Table 1. Table 1. List of components (example for a secondary HVAC system) ID 0 1 2 3 4 5 6
Type Div Mix Heat Cool Steam Zone Ambient
Description Divergence tee Mixing tee Heating coil Cooling coil Steam injection Zone Ambient environment
The approach is generic and could be applied to other similar design synthesis and optimization problems. These include building architectural design, electronic hardware design, pipeline and sewage network design. The approach maximizes the potential for generating novel system designs, but needs to determine whether or not the system configurations conform to the connectivity constraints. It forms a highly constrained mixed integer-real multi-modal optimization problem. Characteristics of the specific search problem are: • •
2.2
the solution space is multi-modal, with the viable system configurations being distributed in a discontinuous manner throughout the search space; the number of performance optimization variables changes with the system configuration, and therefore, location of the solution in the search space. Evolutionary Synthesis and Optimization of the System Configuration
An initial population is created randomly in which each member is a chromosome consisting of a coding of the components used together with their links. This popula-
Automatic Design Synthesis and Optimization of Component-Based Systems
1941
tion is then evolved through successive epochs using mating and mutation to improve the “fitness” of the population. Each chromosome is formed from the system connectivity [14], component capacity and air mass flow rate variables. It is possible for configurations in a given population to have different numbers of components, hence the number of capacity and flow rate variables could be different. The structure of genome, depicted in Fig.1 has been developed to handle this complexity. Essentially, a more flexible data structure is designed to combine chromosomes with variable length in an object-oriented manner. The benefit of the use of the data structure is the use of directed crossover in optimizing the duties. Unlike the loose chromosome set implementation seen in some other implementations, this structure allows relationship and interaction to be defined among its member chromosomes. It allows the consistency of the genome to be maintained although there are operators performing on chromosome level. This implementation is more generic in that: • •
It does not require equal/fixed length member chromosomes; It can take any type of member chromosomes n components, in which there are m Div’s (type 0) ComponentSet C1 C2 .
.
Cn
n + m “to” links
2nd leg of Div’s
Topology T1 T2
.
.
Tn
Tn+1
.
FlowDuty_1 M1 .
.
Mm+1
Q1
.
Qk
FlowDuty_.
j running
. . . FlowDuty_j M1 .
.
Mm+1
conditions
Q1
m mass flow rates (kg/s) associated
to Mix’s, plus 1 for Ambient
Tn+m
.
Qk
k duties associated to airhandling components
Fig. 1. Genome structure
Each problem variable is viewed as a gene. It can be integer (for the component ID number, its position and links (topology) or real (for the coil duty or mass flow rate). A collection of genes forms a chromosome. Examples of such chromosomes having a clear interpretation in the HVAC system design are [14]: • • •
ComponentSet (list of components present in a particular configuration); Topology (Modified adjacency list encoding of the network, where “To” links between the components are listed); FlowDuty (Real-encoded air mass flow rates and component duties for each running condition).
1942
P.P. Angelov et al.
In short, the genome is a collection of chromosomes that fully describe structure and operation of a specific system. A sample of genome is presented in section 4. 2.3
Fitness and Constraints Formulation
The design synthesis and optimization problem has been formulated as a constraint satisfaction problem, which in general can be described by: Minimize:
f ( X , Y ),
X = ( x1 ,..., xn1 ) ∈ ℜ n1
Y = ( y1 ,..., y n 2 ) ∈ N n 2 Such that:
li ≤ x i ≤ u i ,
∀ i ∈ (1 ,..., n 1 )
g j ( X ) ≤ 0 . 0 , ∀ j ∈ (1 ,..., q ) h j ( X ) = 0,
∀ j ∈ ( q + 1 ,..., m 1 )
li ≤ y i ≤ u i ,
∀ i ∈ (1 ,..., n 2 )
h j (Y ) = 0 ,
∀ j ∈ ( q 2 + 1 ,..., m 2 )
And: li , u i ∈ ℜ n l i ≤ u i , ∀ i ∈ (1,..., n )
The EA attempts to find a system configuration that minimizes the system energy consumption subject to both the system configuration and system operation being feasible. The feasibility of the system configuration can be defined by a set of equality constraints, h j (Y ) , whereas the feasibility of the system operation is described by a set of inequality constraints, g j ( X ) . In fact the system operation includes two equality constraints on the zone conditions, but these have been converted to inequality constraints by applying a tolerance, , to the equality: h j ( X ) − δ ≤ 0 . 0 . Since it is not possible to evaluate the energy use of an infeasible topology, fitness of solutions violating the topology constraints is evaluated by the degree of violations. The fitness, F ( X , Y ) , is derived from a linear exterior penalty function and is formulated for function minimization: w f ( X , Y ), f q F ( X , Y ) = w f f ( X , Y ) + ∑ w j g o , j ( X ) + j =0 m ∑ w j ht , j j = q +1
If feasible;
∑ w (h m
j = q +1
j
o, j
)
−δ ,
else if topology feasible, operation infeasible else (topology infeasible).
Where: subscript o, refers to the operation constraints and subscript t, to the topology constraints; all g o , j and h o , j − δ , have values ≥ 0.0 (are infeasible); and w f and
(
)
w j are the objective and constraint function weights.
Automatic Design Synthesis and Optimization of Component-Based Systems
1943
Several sets of fundamental and user-defined and controlled constraints are imposed on configurations generations. Generally, they consist of [14]: • • •
component-related rules connection-related rules process-related rules
It should be mentioned that the approach applied to the problem is open and allows easily to add/modify any group of constraints/rules.
3
Problem-Specific Operators
The optimization has two connected problems; search for feasible configurations and optimization of operations of the feasible configurations found. The latter can only be performed when the configuration is feasible. The effective solution of the search for feasible configurations is therefore of paramount importance. A set of problemspecific operators for mutation and recombination of topologies have been studied, to improve the chance for a candidate configuration to be feasible. 3.1
New Operators Definitions
Two operators have been defined for mutation of topology chromosomes. The “Component link mutation” (LINK) is based on the generic swap-type mutation that inverses the position of two randomly selected genes. For a topology chromosome, inversion of the integers representing the “To” links effectively exchanges two branches in the network, as illustrated in Fig.2. The number of swaps to be performed can be specified as a parameter of the operator. The advantage of the operator is its exploration power, but the probability of damaging the connectivity requirement of a topology is high. As a mutation operator, LINK increases the exploration capabilities of the algorithm by swapping the links, not only the components themselves. swap 1
3
4
2
3
1 5
0
1
5
1
3
5
2
3
1
2
0
1
4
2
5
swap
4
4 0
0
Fig. 2. Example of Component links mutation operator
The second topology operator is “component position mutation” (POS), illustrated in Fig. 3. The position of component <2> and component <0> in the network are ex-
1944
P.P. Angelov et al.
changed by sequentially swapping the “To” and “From” links belonging to each component. If either component had had more than one “To” or “From” link, graph based analysis would have been applied to determine which link can be swapped to ensure preservation of the network connectivity. The advantage of this operator is that if a topology has already satisfied, the connectivity requirement, further operations do not generate topologies that violate it. The POS operator generates fewer infeasible solutions than the LINK operator, but it is weaker at exploring new topologies. 1
3
5
swap
0
1
2
3
5
0
4
4 2
Fig. 3. Example of swapping mutation operator
The configuration of an HVAC system can be considered as a network of components in which air is flowing. Accordingly, optimization methods developed for the so-called Travelling Salesman Problem (TSP) are intuitively relevant. Goldberg [8] introduced “partially matched crossover” (PMX), which has shown its usefulness in solving the TSP. The PMX was adapted as a topology recombination operator with a minor modification to allow multiple connections to a single node. When a child chromosome is produced, the PMX for topology tries to preserve chunk of chromosome from one parent, completing the chromosome with parts of the other. In order to maintain the correct form of adjacency representation for a network, genes from the second parent often have to be repaired. This limits the inheritance of features from the second parent. In addition, any repair of the genes is executed without consideration of the preservation of connectivity. This results in a high probability that a pair of parents who themselves are already connected, will produce unconnected children. This characteristic has a significant impact on the performance of PMX in searching topologies, demonstrated by the results in the next section. A new topology recombination operator, “adjacent component crossover” (ADJ), has been implemented in order to enhance both connectivity and propagation of features. The procedure is based on the assumption that the adjacency of specific components is useful feature for an HVAC configuration. The ADJ operator performs minimal relocation of components in the clone of one parent, so as to make the pair of adjacent components in the other parent, adjacent in the child also. The procedure is illustrated in Fig. 4. Firstly, the parent chromosomes are cloned and following steps performed: 1) A component that is present in both topologies is selected at random as the base component (component <1> in this instance). 2) The component next to the base component is identified in both topologies respectively. If this component is same in both topologies, or if it is absent from either topology, restart from step 1. The feature of parent A identified in this
Automatic Design Synthesis and Optimization of Component-Based Systems
1945
example, is the adjacency of <1>→<3>, whereas the feature of parent B is the adjacency of <1>→<2>. 3) The algorithm that exchanges components in the POS operator is used to swap components in parent A in the example. This ensures the preservation of connectivity in the child A’. So for parent A, these operations are; where <3> is linked from <1>, swap <3> with <2> so that <1> and <2> become adjacent. 4) Step 3 is then repeated for parent B to achieve the adjacency feature (<1>→<3>) in parent A. The result of these operations is that the child preserves most of the network structure of the parent, while acquiring a new feature from a second parent. Parent A
Child A'
Feature from parent B
1
3
5
1
2
Swap
4
2
3
5
4
0
0
Child B'
Feature from parent A
Parent B 1
1
2
5
3
5 Swap 3
4
0
4 2
0
Fig. 4. Example of exchange-adjacent component crossover
3.2
Evaluation of New Operators
Two test procedures have been performed to evaluate the effectiveness of the new operators. In the first test procedure, the operators are used to generate new individuals from a pre-prepared population. The proportion of feasible topologies generated in the children of the initial population is compared to the number of new individuals. This measure gives the probability that the operator will produce feasible topologies from an existing topology. Two initial populations are prepared, both based on a set of 15 components. The first population is filled with randomly initialized topologies and the second with feasible topologies. The feasible topologies are generated at random and tested for feasibility a priori. The operator under test continually selects individuals from the initial population, producing new individuals, until 10,000 new chromosomes have been produced. The same test is repeated on a given operator 20 times. The mean and standard deviation of the results are given in Table 2.
1946
P.P. Angelov et al.
Table 2. Number of feasible solutions out of 10,000 individuals produced by mutation and crossover operators from random population and all-feasible population Mutation
From Random Mean (stdev)
From Feasible Mean (stdev)
Crossover
From Random Mean (stdev)
From Feasible Mean (stdev)
FLIP (p = 0.1)
2.6 (1.43)
253.3 (18.1)
1P (p = 1.0)
0.9 (1.04)
118.7 (19.0)
RAND (p = 1.0)
99.1(12.0)
100.0(11.1)
PMX (p = 1.0)
95.9 (8.6)
641.0 (19.4)
LINK (swap = 1)
102.9 (20.0)
2439.7 (30.3)
ADJ (p = 1.0)
132.6 (24.1)
6221.2 (51.3)
POS (swap = 1)
141.3 (28.6)
6326.6 (35.7)
The proposed topology operators are compared with “flip gene mutation” (FLIP), “randomization operation” (RAND) and “1-point crossover” (1P). Table 2 demonstrates that the number of feasible solutions generated by the topology-specific operators are significantly higher than those generated by the more conventional operators. Both POS mutation and ADJ crossover have more than 60% probability of producing new feasible topologies from feasible parents. Conventional operators for integer chromosomes (FLIP mutation and 1P crossover) have only 1%~3% chance of producing feasible individuals from feasible parents. 1E+0
Simple operators 1P(p=1.0)+FLIP(p=0.1) (3000+)
1E-1
Topology mutation - LINK (p=1.0) (1734+)
Penalized objective function
1E-2
Topology mutation - POS (p=1.0) (805/710)
1E-3
1E-4
Topology recombination PMX (p=1.0) (2979+)
1E-5
Topology recombination ADJ (p=1.0) (436/419)
1E-6
Combined effectivness for topology operators ADJ(0.9) + PMX(0.1) + POS(0.15) + LINK(0.15) (462+)
1E-7
1E-8 1
10
100
1000
10000
Number of generations
Fig. 5. Effectiveness of evolutionary operators in searching a specified topology
The second evaluation procedure tests convergence. In the procedure, a conventional configuration for a HVAC system was selected, depicted in Fig. 6. The operation of the selected configuration, including the component capacities and airflow rates, were optimized. The capacities and flow rates were then fixed and the system
Automatic Design Synthesis and Optimization of Component-Based Systems
1947
configuration subjected to further optimization, using the different operators. The optimal configuration in this case must be the original configuration and hence convergence on this solution can be used as a measure of operator performance. Figure 5 illustrates the rate of convergence for various operators. The numbers given in the notation are the mean generations required to find the optimum configuration from 30 runs. The '+' sign indicates that the topology was not found in every trial run, otherwise standard deviation is given. Fig. 5 demonstrates the poor convergence using the conventional operators compared to the new problemspecific operators.
4
Automatic HVAC System Design Synthesis and Optimization
To demonstrate the design synthesis approach, consider the HVAC problem: Design the optimal HVAC system for a small office building in Oklahoma, USA. Three typical design conditions are selected to represent the operation in summer, winter, and swing season. Fig. 6 shows a conventional system with its optimal operation parameters in the summer condition. One of the auto-generated configurations is presented in Fig. 7 (encoded), and in Fig. 8 graphically. 23.2°C Heat (6)
0.807 kg/s
Cool (9)
Steam(10)
15.8°C
Mix (3)
1.561 kg/s
Mix (4)
0.119 kg/s
0kw
38.9°C
-10.55kw
15.8°C Div (0) 1.140 kg/s
0.421 kg/s
15.8°C
0kg/s
Heat (8)
Heat (7)
0kw Ambient (13)
20.5°C
0kw
20.5°C
15.8°C
0.754 kg/s
0.688 kg/s
Zone (12) [east]
2.773kw 0.119 kg/s
15.8°C Zone (11) [west]
22.0°C
5.226kw
20.0°C
20.5°C Div (2)
Div (1)
0.807 kg/s
Mix (5)
1.140 kg/s
1.561 kg/s
Fig. 6. A conventional HVAC system configuration (summer condition)
Div
Mix
Heat
Cool
Steam
Zone
Ambient
CompSet
0
0
0
0
1
1
1
1
2
2
3
3
4
5
5
6
Topology
5
7
1
6
13
7
0
4
9
5
14
4
11
2
6
3
10 12 15
8
FlowDuty_1
0.064
0.592
0.029
0.388
0.072
0
0
-3.30
-5.88
0
FlowDuty_2
0.064
0.285
0.025
0.172
0.117
0.12
0.90
0
0
0
FlowDuty_3
0.096
0.0
0.001
0.948
0
0
0
0
0
Flow rates (kg/s)
Heating (kW)
Cooling (kW)
Fig. 7. The genome for the generated configuration
0 Steam (kg/s)
1948
P.P. Angelov et al. 0.029 kg/s
Heat (8)
Heat (9)
0kw
0kw
38.9°C 38.9°C
Mix (6)
Div (3) 0.035 kg/s
0.064 kg/s 38.9°C
Div (0) 0.423 kg/s
22.0°C
0.388 kg/s
Ambient (15)
0.035 kg/s
0.136 kg/s
0.064 kg/s
0.388 kg/s
15.1°C
Zone (14) (Eest)
13.0°C
0.592 kg/s
10.3°C
-3.30kw
Cool (10)
Mix (4)
Mix (7)
Mix (5)
23.4°C
0.728 kg/s
0.728 kg/s
-5.88kw
Cool (11)
20.0°C
0.072 kg/s Steam(12)
2.77kw
Zone (13) (West)
5.23kw
0kg/s 20.0°C
0.064 kg/s
Div (1)
0.728 kg/s
0.592 kg/s
0.664 kg/s
20.0°C
Div (2) 0.728 kg/s
Fig. 8. Configuration generated automatically by the new approach (summer condition)
The minimum energy consumption associated with the conventional and generated configurations is compared in Table 3. The results illustrate both the ability of the automated design synthesis approach to find a viable HVAC system design and that the system is more energy efficient saving ~800kWh or 9% of the energy consumed by the optimized conventional design solution. Table 3. Energy consumption for conventional and automatically generated configurations Conventional Configuration Season
Total Heat/Cooling Consumption (kWh)
Total energy consumption (kWh)
Total Heat/Cooling Consumption (kWh)
Total circulation consumption (kWh)
Total energy consumption (kWh)
Total energy saving (kWh)
Summer
6857.5
396.5
7254.0
5960.5
136.5
6097.0
1157.0
Winter
858.0
71.5
929.5
611.0
65.0
676.0
253.5
0.0
377.0
377.0
0.0
988.0
988.0
-611.0
7761.0
799.5
Spring/Autumn
Annual Total
4
Total circulation consumption (kWh)
Generated configuration
8560.5
Conclusions
A new approach for automatic design synthesis and optimization using EAs is introduced in the paper. The approach applies to component-based systems in general
Automatic Design Synthesis and Optimization of Component-Based Systems
1949
although the focus of this paper has been on secondary HVAC system problems. The whole design process is treated as a constraint satisfaction problem. The problem formulation is complex, using different variables to represent the physical and topological properties of the system. The problem objective is to design a feasible and efficient system. EAs have been shown to solve numerically complex search problems such as this. Several problem-specific operators, tailored to the componentbased spatially distributed systems design, have been developed and evaluated. The results of evaluation of these operators illustrate their superiority in comparison with conventional operators for the mutation and crossover of integer genes. The process of design can be fully automated, commencing with a design brief and resulting in a rank ordered set of optimal configurations. An example of the automated design of a secondary HVAC system for two-zone office building was presented. The synthesized system design showed energy savings of about 800 kWh (or 9%) compared to the optimal, conventional system, based on the performance over three season-typical days. This result demonstrates the potential of the approach to automatically generate viable novel HVAC system configurations that can yield improved energy performance compared to conventional design solutions. Acknowledgement. This research has been supported by the American Society for Heating Refrigeration and Air-Conditioning Engineering (ASHRAE) through the RP1049 research project.
References 1.
Michalewicz, Z.: Genetic Algorithms + Data Structures = Evolution Programs. 3rd edn. Springer-Verlag, Berlin Heidelberg New York (1996) 2. Blast 3.0 User Manual, Urbana-Champaign, Illinois, Blast Support Office, Department of Machine and Industrial Engineering, University of Illinois (1992) 3. York D.A., C.C. Cappiello, DOE-2 Engineers manual (Version 2.1.A), Lawrence Berkeley Laboratory and Los Alamos National Lab., LBL-11353 and LA-8520-M (1981) 4. Park C., D.R. Clarke, G.E. Kelly, An Overview of HVACSIM+, a Dynamic Building st /HVAC/ Control System Simulation Program, 1 Ann. Build Energy Sim Conf, Seattle, WA (1985) 5. Klein S.A., W.A. Beckman, J. A. Duffie, TRNSYS – a Transient Simulation Program, ASHRAE Trans., 82, Pt.2 (1976) 6. Sahlin P., IDA – a Modelling and Simulation Environment for Building Applications, Swedish Institute of Applied Mathematics, ITH Report No 1991:2 (1991) 7. Wright J.A., V.I. Hanby, The Formulation, Characteristics, and Solution of HVAC System Optimised Design Problems, ASHRAE Trans., 93, pt.2 (1987) 8. Goldberg G.E., Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley, Reading, M.A., USA (1989) 9. Hanby V.I., J.A. Wright, HVAC Optimisation Studies: Component Modelling Methodology, Building Services Engineering Research and Technology, 10 (1), 35–39 (1989) 10. Sowell E.F., K. Taghavi, H. Levy, D. W. Low, Generation of Building Energy Management System Models, ASHRAE Trans., 90, pt.1 (1984) 573–586.
1950
P.P. Angelov et al.
11. Volkov A., Automated Design of Branched Duct Networks of Ventilated and airconditioning Systems, Proc. of the CLIMA 2000 Conference, Napoli, Italy 15– 18.09.2001. 12. P.J. Bentley (Ed.) Evolutionary Design by Computers, London: John Willey Ltd., 2000. 13. Angelov P.P., V.I. Hanby, J. A. Wright, An Automatic Generator of HVAC Secondary Systems Configurations, Technical Report, Loughborough University, UK (1999) 1–17. 14. Wright J.A., P.P. Angelov, Y. Zhang, R.A. Buswell, V.I. Hanby, Building System Design Synthesis and Optimization, Final report to 1049-RP, ASHRAE, 2002
Studying the Advantages of a Messy Evolutionary Algorithm for Natural Language Tagging Lourdes Araujo Dpto. Sistemas Inform´ aticos y Programaci´ on Universidad Complutense de Madrid [email protected]
Abstract. The process of labeling each word in a sentence with one of its lexical categories (noun, verb, etc) is called tagging and is a key step in parsing and many other language processing and generation applications. Automatic lexical taggers are usually based on statistical methods, such as Hidden Markov Models, which works with information extracted from large tagged available corpora. This information consists of the frequencies of the contexts of the words, that is, of the sequence of their neighbouring tags. Thus, these methods rely on the assumption that the tag of a word only depends on its surrounding tags. This work proposes the use of a Messy Evolutionary Algorithm to investigate the validity of this assumption. This algorithm is an extension of the fast messy genetic algorithms, a variety of Genetic Algorithms that improve the survival of high quality partial solutions or building blocks. Messy GAs do not require all genes to be present in the chromosomes and they may also appear more than one time. This allows us to study the kind of building blocks that arise, thus obtaining information of possible relationships between the tag of a word and other tags corresponding to any position in the sentence. The paper describes the design of a messy evolutionary algorithm for the tagging problem and a number of experiments on the performance of the system and the parameters of the algorithm.
1
Introduction
The process of labeling each word in a sentence of a text with its lexical category (noun, verb, etc) is called tagging and is a key step in the parsing process and many other natural language processing and generation applications: machine translation, information retrieval, speech recognition and generation, etc. A word may have more than one possible tag (lexical ambiguity), and thus, disambiguation methods are required to proceed with the tagging. There are different approaches to automatic tagging based on statistical information, that use large amount of data to establish the probabilities of the tag assignments. Most of them are based on Hidden Markov Models (HMMs) and variants [11,4,
Supported by project PR1/03-11588.
E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 1951–1962, 2003. c Springer-Verlag Berlin Heidelberg 2003
1952
L. Araujo
13,3] and neither require knowledge of the rules of the language nor try to deduce them. Others are rule-based approaches that apply language rules to improve the accuracy of the tagging. The Brill system [2] extracts these rules from a training corpus, obtaining competitive performance with stochastic taggers. Evolutionary algorithms present an appealing tradeoff between efficiency and generality and for this reason they have been extensively applied to complex optimization problems. They have previously been applied to tagging [1]. In the evolutionary tagger, individuals are sequences of tags assigned to the words of a sentence. This model is a variant of the HMM in which disambiguation is introduced by assigning different probabilities to a given tag depending on which are the neighbouring tags (context) on both sides of the word. The computation of the fitness of the individuals is based on the data extracted of a training corpus tagged by hand. These data are organized as contexts. The tagger is able to learn from a training corpus so as to produce a table of rules (contexts) called training table. This table records the different contexts of each tag. The table can be computed by going through the training text and recording the different contexts and the number of occurrences of each of them for every tag in the training text. Results indicate that the evolutionary approach for tagging texts of natural language obtains accuracies comparable to other statistical approaches, while improving the robustness of the typical algorithms used for the same purpose in other stochastic tagging approaches (such as the widely used of Viterbi). These methods typically perform at about the 96% level of correctness [3] (percentage of words correctly tagged). However, there is a limit beyond which no further improvement can be obtained, neither by enlarging the size of the context, nor the size of the training text. This leads us to investigate the application of messy GAs to the tagging problem with the aim of studying the possible relationships between the tags of the sentence. Statistical methods for tagging rely on the assumption that the only influence to the correctness comes from the words surrounding the considered one. Therefore, in the evolutionary tagger solutions are evaluated according to the position of the genes. In a messy GA, individuals may be composed of any set of genes, therefore relaxing the previous assumption and allowing to investigate other kinds of dependencies among the words to be tagged. Messy GAs do not require all genes to be present in the chromosomes and they may also appear more than one time. This allows us to study the kind of building blocks that arise, thus obtaining information of possible relation between the tag of a word and other tags corresponding to any position in the sentence. Accordingly, this paper presents an extension of the fast messy GAs to a fast messy evolutionary algorithm (EA), which does not work with bit strings but with tag strings for the tagging problem. 1.1
Messy Genetic Algorithms
Messy Genetic Algorithms [5] are variants of Genetic Algorithms that improve the survival of high quality partial solutions or building blocks. This is achieved
Studying the Advantages of a Messy Evolutionary Algorithm
1953
by explicitly composing increasingly longer and highly fit string from previously obtained and tested building blocks. This is a technique to tackle the problem of building block disruption — linkage problem [9] — mainly due to the fixed mapping from the solutions into the representations of individuals or chromosomes and to the way in which the crossover operator combine two chromosomes to obtain a new one. Classic crossover operators often break promising building block leading the algorithm to a local optimum. Furthermore, because messy GAs expedite the presence of high quality building blocks, they increase the probability of exchanging different building blocks [7]. Messy GAs use variable-length strings that are sequences of messy genes. A messy gene is an ordered pair composed of a position and a value. The recording of this information allows any building block to achieve tight linkage. Messy GAs do not require all genes to be present in the chromosomes. Therefore, under or overspecified chromosomes may arise. For example, in the messy string ((1 1) (3 0) (1 0)), the leftmost element specifies gene 1 and its allele value is 1. In a three-bit problem, this chromosome is underspecified, because gene 2 is absent, and overspecified, because gene 1 appears twice. Overspecification is handled by applying a first-come-first-served rule on a left to right order. As for underspecification, some problems do not require to deal with it in any special way because structures of any size can be interpreted and evaluated. Otherwise, the missed genes are filled with those of a competitive template, a string that is locally optimal. A messy GA proceeds in two different phases. The first one, called primordial phase, aims to select tight building blocks. It is achieved by initializing the population with all possible building blocks of a specified length. The proportion of good building blocks is improved by applying selection alone for a number of generations. Furthermore, the population size is usually reduced at particular intervals. Thereafter, the next phase, the juxtaposition phase, proceeds applying different genetic operators, like cut and splice, with the objective of recombining the building block obtained in the first phase. In order to deal with chromosomes of variable length, the traditional crossover operator is substituted by the complementary operators cut and splice. Cut divides the chromosome with a specified probability, proportional to the length of the chromosome, while splice joins two chromosomes. The previous description corresponds to the original messy GA, which suffers from an initialization bottleneck due to the generation of all building blocks of a particular size k: it requires a population of O(lk ) individuals, l being the length of the problem strings. This problem has been dealt with by the socalled fast messy GAs [7,10]. Fast messy GAs use smaller populations composed of longer chromosomes. However these longer chromosomes introduce too much error places [6]. This effect is handled by the mechanism of building block filtering. According to this, the process begins with long chromosomes, that are reduced by applying selection and random deletion at specific intervals. Another feature of messy GAs is the use of a thresholding selection, which enforces any two chromosomes to share some threshold number of genes before competing for selection.
1954
L. Araujo
The rest of the paper proceeds as follows: Sect. 2 describes the main elements in the fast messy EA for tagging; Sect. 3 is devoted to evaluate the model and Sect. 4 draws the main conclusions of this work.
2
Messy Evolutionary Algorithm for Probabilistic Tagging
The fast messy EA for tagging works with a population of sequences of genes (chromosomes) of variable length. Each gene consists of a position indicating the word of the sentence to be tagged and a tag chosen for that word. A word may appear any number of times in the chromosome or be missing at all. Let us considered the following words and their tags: Rice: (NOUN), flies: (NOUN, VERB), like: (PREP, VERB), sand: (NOUN). Then a possible individual when tagging the sentence Rice flies like sand is shown in Fig. 1. The algorithm uses a pattern for the evaluation of the underspecified chromosomes. A pattern is a complete tagging of the sentence. Initially each tag of this pattern is randomly selected with a probability proportional to the frequency of the tag. Then the algorithm performs a number of iterations, each of which is devoted to building up increasingly longer building blocks. The pattern is updated after every iteration step, replacing its tags by those present in the best current individual. Figure 2 shows a pattern for the sentence of the previous example. When the individual of Fig. 1 is evaluated, the tag for the word flies, which is missing in the chromosome, is taken from the pattern. As in a messy GA, each iteration proceeds in three consecutive phases (see Fig. 3). In the initialization phase, a population of individuals is generated to represent the classes corresponding to each set of genes. In the original messy GA the initial population was composed of a single copy of all substrings of length k. This initialization ensured that all building blocks of the considered length were present but led to a bottleneck for a problem length l moderately large. This is avoided by a technique called probabilistically complete initialization [7], which creates a population of longer individuals ensuring that, with high probability, any building block appears at least once. Two parameters control the procedure: the length of the initial individuals and the population size. Let us call l the length of the initial individuals, a value larger than k and smaller than l. The results obtained in [7,8] indicate that if k < l/2 and we set l = l − k, a population size O(l) is a good choice. Rice like sand NOUN VERB NOUN Fig. 1. Example of an individual for the sentence Rice flies like sand Rice flies like sand NOUN NOUN VERB NOUN Fig. 2. Example of pattern for the sentence Rice flies like sand
Studying the Advantages of a Messy Evolutionary Algorithm
1955
function Fast messy EA() pattern = random tagging(sentence); for level = initial level to final level do{ Probabilistic initialization(Population, level); Primordial phase(Population, pattern, level); Juxtaposition phase(Population, pattern); pattern = update pattern (pattern, best(Population); } Fig. 3. Scheme of the fast messy EA for tagging
These results have been obtained assuming two alleles for each gene, but they may be extended to an arbitrary number of them. Accordingly we assume these results and take, in our case, the population size as a function of the length of the sentence. Our experimental results serve as an a posteriori justification. 2.1
The Primordial Phase
After the initialization, the primordial phase (Fig. 4) selects the best individuals of each combination of genes of size k. Because chromosomes in a messy algorithm may contain very different sets of genes, and thus have little in common, a special selection is required to make the competition fair. This is called thresholding selection, and it is implemented by applying tournament selection only between two individuals that have in common a number of genes greater than a threshold value. In our case, the number of common genes is computed as the number of different words of the sentence tagged in both chromosomes. Individual Evaluation. The fitness function is related to the total probability of the sequence of tags assigned to the sentence. The raw data to obtain this probability are extracted from the training table. The fitness of an assignment is defined as the sum of the fitness of its positions, i (f (gi )). The fitness of a position g is defined as f (g) = log P (T |LC, RC), where P (T |LC, RC) is the probability that the tag of position g is T , given that its context is formed by the sequence of tags LC to the left and the sequence function Primordial phase(Population, pattern, level) level temp = length(Population[0]) - level); while (level temp > level) do{ Selection(Population); Gene deletion(Population); Evaluation(Population); level temp–; } Fig. 4. Scheme of the primordial phase
1956
L. Araujo
RC to the right. This probability is estimated from the training table as occ(LC, T, RC) T ∈T occ(LC, T , RC)
P (T |LC, RC) ≈
where occ(LC, T, RC) is the number of occurrences of the list of tags LC, T, RC in the training table and T is the set of all possible tags of gi . The contexts corresponding to the position at the beginning and the end of the sentences lack tags on the left-hand side and on the right-hand side respectively. This event is managed by introducing a special tag, NULL, to complete the context. In our case chromosomes are evaluated by previously building a tagging of the sentence with tags coming from the individual if it contains any instance of the corresponding gene and from the pattern otherwise. For overspecified genes we take the first occurrence of the gene in a left-to-right order. For example, if we are evaluating the individual of Fig. 1 and we are considering contexts composed of one tag on the left and one tag on the right of the position evaluated, the second gene, for which there are two possible tags, NOUN (the one chosen in this individual) and VERB, will be evaluated as: #(NOUN NOUN VERB) [#(NOUN NOUN VERB) + #(NOUN VERB VERB)] where # represents the number of occurrences of the context. The remaining genes are evaluated in the same manner. Another mechanism introduced in this phase is the deletion of genes to reduce the length of the individuals from l to k. The genes to be erased are randomly selected. We expect to have enough copies of the best building blocks so that after the random deletion some of them still remain. 2.2
The Juxtaposition Phase
The last phase, called juxtaposition (Fig. 5), is devoted to combine the tight building blocks obtained in the previous phase. This phase also uses thresholding selection as well the cut-and-splice operator to combine individuals, and the mutation operator. The rate of application of the cut operator increases with the length of the individuals, while splicing is applied at a fixed rate. Thus, in function Juxtaposition phase(Population, pattern) generation = 0; while (generation < max generation) && not convergence do{ Selection(Population); Cut splice(Population); Mutation(Population); Evaluation(Population); generation++; } Fig. 5. Scheme of the juxtaposition phase
Studying the Advantages of a Messy Evolutionary Algorithm
1957
the beginning of the evolution process, cut-and-splice behaves almost like splice alone, increasing the length of the individuals. Later on, when the length of the individuals is long enough, cut and splice are applied together producing an effect similar to a one-point crossover operator. Mutation is then applied to every gene of the chromosomes resulting from the cut-and-splice operation with a probability given by the mutation rate. The tag of the mutation point is replaced by another of the valid tags of the corresponding word. The new tag is randomly chosen according to its probability (the frequency it appears with in the corpus). Chromosomes resulting from the application of genetic operators replace an equal number of individuals selected with a probability inversely proportional to their fitness.
3
Experiments
The algorithm has been implemented on C++ language on a Pentium II PC. Tables 1 and 2 compare the accuracy rate obtained with a classic EA and a fast messy EA for the tagging problem. Experiments have been carried out using a training corpus of 185000 words extracted from the Brown corpus [12], a test text of 500 words, population sizes once, twice or three times the length of the sentence to be tagged, a crossover rate of 30%, and a mutation rate of 1%. Table 1 presents the highest accuracy achieved in ten runs for all population sizes; Table 2 presents the average and standard deviation of these ten runs for a population size twice the sentence length, as well as the results of a t-test to asses the statistical significance of the differences in accuracy of both methods. Table 1. Largest accuracy of ten runs obtained with the classic EA and with the fast messy EA with different numbers of generation and population sizes for the tagging problem. CEA stands for classic evolutionary algorithm and FMEA for fast messy evolutionary algorithm. In the FMEA the number of generations refers to the juxtaposition phase. The range of sizes of building blocks in the primordial phase has been |s| / 3.5 – |s| / 2.5. The threshold value to allow competition during selection has been |s| - 2, and the threshold value to begin to apply cut has been |s| / 1.5. Two last columns show the number of fitness evaluations for PS = |s|*2
5 10 20 30 40
generations generations generations generations generations
PS = |s| CEA FMEA 95.44 96.40 95.44 96.48 95.68 96.88 95.44 96.88 95.68 96.40
Accuracy PS = |s|*2 CEA FMEA 95.44 96.40 96.16 96.88 96.16 97.12 95.92 96.88 95.92 96.88
PS = |s|*3 CEA FMEA 95.20 95.62 95.68 96.64 96.16 96.68 95.92 96.64 95.92 96.64
evaluations n. PS = |s|*2 CEA FMEA 98964 11521 102944 21494 109964 38136 117337 51225 123757 60728
1958
L. Araujo
Table 2. Study of the statistical significance of the results obtained for a population size of 2*|s|. CEA stands for classic evolutionary algorithm and FMEA for fast messy evolutionary algorithm. Last column reports the probability that the two samples (the accuracies of the ten runs with the CEA and those of the FMEA) arise from the same probability distribution (a small number means that the difference in the means is statistically significant)
5 10 20 30 40
generations generations generations generations generations
CEA Mean St. dev. 95.5 0.441 95.5 0.411 95.5 0.435 95.4 0.518 95.5 0.499
FMEA significance Mean St. Dev. t-test signif. 96.1 0.415 0.0057 95.8 0.343 0.17 96.3 0.509 0.0015 96.1 0.610 0.020 96.5 0.495 0.0002
We can observe that the messy algorithm slightly but systematically improves the results. Although the accuracy results are limited a priori by the statistical information used in the fitness function, the systematic improvement obtained suggests the existence of some correlation between the word to be tagged and distant words, which comes to light by the nature of the messy EA. This hypothesis is supported by the shape of the building blocks obtained from the primordial phase. Figure 6 shows some instances of building blocks obtained from the primordial phase for a sentence of 29 words and a level k = 7. A ’0’ indicates that the gene corresponding to the word of that position in the sentence is absent and any other number that it is present at least once. A ’1’ indicates that the gene is present but the tag is wrong, ’2’ that the tag is right and ’3’ that the tag is right while it was wrong in the tagging performed with a classic EA. We can observe that the worse building blocks present very sparse genes. We can also observe that although some of the best building blocks, such as 9, present their genes very grouped, there are also some others, such as 2, with some sparse genes. Furthermore, we can also observe that some of the tags that have been
1 2 3 4 5 6 7 8 9
Building blocks 0220012100000010220000302 0020002022020002002000302 0020022000202012000002020 0000200020002220020202300 0023210000020010202020000 0221200000002210020002022 0203200120020002000020002 0020212000220000200020020 0020002220002200000020302
Fitness 20.2156 22.1340 21.6027 21.6340 20.7156 21.1027 21.1352 21.2468 22.1340
Fig. 6. Samples of building blocks obtained from the primordial phase for a sentence of 29 words and a level k = 7. Average fitness is 20.92
Studying the Advantages of a Messy Evolutionary Algorithm
1959
correctly assigned by the messy EA but wrongly by the classic EA (marked ’3’) appear among sparse genes. The evaluation of contexts containing missing genes amounts to taking the tag of the pattern and thus, in general, at least in the primordial phase, the resulting context will be composed of tags without any relation. For the same reason, we expect that consecutive genes increase the fitness because with high probability they correspond to frequent contexts (they have appeared after the selection of the primordial phase). Therefore, the selection of chromosomes with separate genes may indicate a correlation between tags by some hidden indirect mechanisms. To illustrate this assume the following pattern for a sentence with n words: P : P1 , P2 , · · · , Pn where Pi is the tag assigned to the word wi . Let us consider now the following individuals ( denotes a missing gene), and let T represent the tag they have assigned to the word: I1 : T1 , T2 , T3 · · · , I2 : T 1 , , T 3 , , T 5 , · · · , Assuming that a context is composed of one tag on the left and one of the right, the evaluation of I1 is done according to the frequencies of the contexts (T1 , T2 ), (T1 , T2 , T3 ), (T2 , T3 , T4 ), etc, that must have high frequencies because they have been selected. However, the evaluation of I2 is done with the contexts (T 1 , P2 ), (T 1 , P2 , T 3 ), (P2 , T 3 , P4 ), etc, composed of tags not necessarily related. Therefore, in general, they will have low frequencies and will hardly appear. Accordingly, the actual occurrence of an individual as the following one I : T 1 , , , T 4 , T 5 , · · · , may indicate a hidden relation between T 1 and T 4 , T 5 (e.g. they might correspond to words separated by a subordinate phrase, or by words that have only one possible tag). In this way the messy EA although less efficient than classic EAs (the number of fitness evaluations is more than twice, see two last columns of Table 1), helps to uncover hidden relations between the words of the sentence. 3.1
Study of the Algorithm Parameters
Two fundamental parameters for the probabilistic initialization phase in a messy EA are the length of the chromosomes and the size of the initial population. Some experiments have been carried out in order to determine the most appropriate values of these parameters in our problem. Because the text is tagged sentence by sentence it seems reasonable to adjust these parameters as some kind of simple function of the length of the sentence. Figure 7 shows the results obtained when varying the length of the chromosomes for different population sizes. All parameters are established as a function of the length of the sentence. We can observe that values of the order of the problem length are enough. A parameter
1960
L. Araujo 98
97.5
PS = |s| PS = |s|*2 PS = |s|*3
Accuracy
97
96.5
96
95.5
95
|s|
|s|*1.5
|s|*2 Chromosome size
|s|*2.5
|s|*3
Fig. 7. Accuracy obtained with different sizes of the chromosomes. Each chart corresponds to a different population size: P1 stand for a population size of equal to the length of the sentence, P2 to one of twice the length of the sentence and P2 of three times this length
of the algorithm that needs to be studied is the threshold value for the measurement of similarities between two chromosomes, in order to decide if they are similar enough for the comparison to make sense. Table 3(a) shows the results for some values defined as a function of the length of the sentence to be tagged. The number of similarities is measured as the number of different words of the sentence that corresponds to any of the genes of the chromosome. The best result is obtained when the required similarity is equal to the length of the sentence minus 2. This indicates that only very similar chromosomes must be compared. Another parameter to be fixed in a messy EA is the threshold value for the chromosome size to begin to apply the cut operator. Table 3(b) shows the results.
Table 3. Table (a) presents the accuracy obtained with different values of the threshold established to considerer two chromosomes similar enough to be compared. Table (b) shows the accuracy obtained with different values of the threshold size of the chromosomes to apply the cut operator. Table (c) presents the accuracy obtained with different range of sizes of building blocks in the primordial phase. |s| stands for length of the sentence. When varying one of the three parameters the other two are assigned the value marked in boldface Thres. similarity |s| |s| - 1 |s| - 2 |s| - 3 (a)
Acc. 96.16 96.64 97.12 95.44
Thres. length |s| / 1.3 |s| / 1.4 |s| / 1.5 |s| / 1.6 |s| / 1.7 (b)
Acc. 95.68 96.88 97.12 96.16 95.68
Range |s| / 3 – |s| / 2 |s| / 4 – |s| / 3 |s| / 3.5 – |s| / 2.5 |s| / 3 – |s| / 2.5 (c)
Acc. 96.40 96.88 97.12 96.88
Studying the Advantages of a Messy Evolutionary Algorithm
1961
The best accuracy is obtained when cut is only applied to individuals longer than two thirds of the length of the sentence. Finally, the range of sizes of building blocks explored in the primordial phase has also been studied. Table 3(c) shows the results obtained. In this case, the greater accuracy is obtained when the size of the building blocks explored ranges between the length of the sentence divided by 2.5 and the length divided by 3.5.
4
Conclusions
This work has investigated the kind of dependencies between words in the tagging problem, that is the assignment of lexical categories to the words of a text. This have been done by applying a fast messy evolutionary algorithm to solve the problem. This algorithm is an extension of the fast messy genetic algorithm, a variety of Genetic Algorithm that improves the survival of high quality partial solutions or building blocks. Thus, it is a technique to tackle the problem of building block disruption or linkage problem partially due to the fixed mapping from the solutions into the representations of individuals. In a messy GA individuals are variable-length strings of messy genes. A messy gene is an ordered pair composed of a position and a value. The recording of this information allows any building block to achieve tight linkage. These algorithms allow obtaining in a phase, called primordial phase, previous to the evolutionary process, a set of tight building blocks, whose features can provide information about the internal dependencies of the problem. Results obtained for the tagging problem systematically outperform a little those obtained with a classic evolutionary algorithm, thus indicating the existence of other relation between tags apart from those between the neighbouring words. This idea is also supported by the kind of building blocks obtained from the primordial phase. Some of them present their genes grouped, but in others the genes are mainly sparse in the proximity of the group or forming others groups. These results indicate the presence of more complex relationships between words. Accordingly, in the future those possible relationships will be investigated by studying the dependencies between the tagging and the parsing problem. A number of parameters affecting the performance of the results have also been investigated. These parameters have been assigned values as a function of the length of the sentence. The study of the size of the chromosomes and of the initial population size in the probabilistic initialization phase indicates that values close to the length of the sentence or twice this length are enough to obtain the best results. The experiments on the threshold value to consider that two chromosomes are comparable indicates that very similar chromosomes are required (different at most by the presence of two genes).
1962
L. Araujo
References 1. L. Araujo. A parallel evolutionary algorithm for stochastic natural language parsing. In Proc. of the Int. Conf. Parallel Problem Solving from Nature (PPSNVII), 2002. 2. E. Brill. Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging. Computational Linguistics, 21(4), 1995. 3. E. Charniak. Statistical Language Learning. MIT press, 1993. 4. D. Cutting, J. Kupiec, J. Pedersen, and P. Sibun. A practical part-of-speech tagger. In Proc. of the Third Conf. on Applied Natural Language Processing. Association for Computational Linguistics, 1992. 5. D.E. Goldberg, Korb B., and Deb K. Messy genetic algorithms: motivation, analysis, and first results. Complex Systems, 3:493–530, 1989. 6. D.E. Goldberg, Korb B., and Deb K. Messy genetic algorithms revisited: Studies in mixed size and scale. Complex Systems, 4:415–444, 1990. 7. D.E. Goldberg, Kargupta H. Deb K., and Harik G. Rapid, accurate optimization of difficult problems using fast messy genetic algorithms. In Proc. of the Fifth International Conference on Genetic Algorithms, pages 56–64. Morgan Kaufmann Publishers, 1993. 8. D.E. Goldberg, Deb K., and J. H. Clark. Don’t worry, be messy. In Proc. of the Fourth International Conference in Genetic Algorithms and their Applications, pages 24–30, 1991. 9. Georges R. Harik and David E. Goldberg. Learning linkage. In Richard K. Belew and Michael D. Vose, editors, Foundations of Genetic Algorithms 4, pages 247–262. Morgan Kaufmann, San Francisco, CA, 1997. 10. H. Kargupta. Search, polynomial complexity, and the fast messy genetic algorithm. Ph.D. thesis, Graduate College of the University of Illinois at Urbana-Champaign, 1996. 11. B. Merialdo. Tagging english text with a probabilistic model. Computational Linguistics, 20(2):155–172, 1994. 12. Francis W. Nelson and Henry Kucera. Manual of information to accompany a standard corpus of present-day edited american english, for use with digital computers. Technical report, Department of Linguistics, Brown University., 1979. 13. H. Schutze and Y. Singer. Part od speech tagging using a variable memory markov model. In Proc. of the 1994 of the Association for Computational Linguistics. Association for Computational Linguistics, 1994.
Optimal Elevator Group Control by Evolution Strategies Thomas Beielstein1 , Claus-Peter Ewald2 , and Sandor Markon3 1
2
Universtit¨ at Dortmund, D-44221 Dortmund, Germany [email protected] http://ls11-www.cs.uni-dortmund.de/people/tom/index.html NuTech Solutions GmbH, Martin-Schmeisser Weg 15, D-44227 Dortmund, Germany [email protected] http://www.nutechsolutions.de 3 FUJITEC Co.Ltd. World Headquarters, 28-10, Shoh 1-chome, Osaka, Japan [email protected] http://www.fujitec.com
Abstract. Efficient elevator group control is important for the operation of large buildings. Recent developments in this field include the use of fuzzy logic and neural networks. This paper summarizes the development of an evolution strategy (ES) that is capable of optimizing the neuro-controller of an elevator group controller. It extends the results that were based on a simplified elevator group controller simulator. A threshold selection technique is presented as a method to cope with noisy fitness function values during the optimization run. Experimental design techniques are used to analyze first experimental results.
1
Introduction
Elevators play an important role in today’s urban life. The elevator supervisory group control (ESGC) problem is related to many other stochastic traffic control problems, especially with respect to the complex behavior and to many difficulties in analysis, design, simulation, and control [Bar86,MN02]. The ESGC problem has been studied for a long time: first approaches were mainly based on analytical approaches derived from queuing theory, in the last decades artificial intelligence techniques such as fuzzy logic (FL), neural networks (NN), and evolutionary algorithms (EA) were introduced, whereas today hybrid techniques, that combine the best methods from the different worlds, enable improvements. Therefore, the present era of optimization could be classified as the era of computational intelligence (CI) methods [SWW02,BPM03]. CI techniques might be useful as quick development techniques to create a new generation of self-adaptive ESGC systems that can handle high maximum traffic situations. In the following we will consider an ESGC system that is based on a neural network to control the elevators. Some of the NN connection weights can be modified, so that different weight settings and their influence on the ESGC performance can be tested. Let x denote one weight configuration. We can define E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 1963–1974, 2003. c Springer-Verlag Berlin Heidelberg 2003
1964
T. Beielstein, C.-P. Ewald and S. Markon
the optimal weight configuration as x∗ = arg min f (x), where the performance measure f () to be minimized is defined later. The determination of an optimal weight setting x∗ is difficult, since it is not trivial 1. to find an efficient strategy that modifies the weights without generating too many infeasible solutions, and 2. to judge the performance or fitness f (x) of one ESGC configuration. The performance of one specific weight setting x is based on simulations of specific traffic situations, which lead automatically to stochastically disturbed (noisy) fitness function values f˜(x). The rest of this article deals mainly with the second problem, especially with problems related to the comparison of two noisy fitness function values. Before we discuss a technique that might be able to improve the comparison of stochastically disturbed values in Sect. 3, we introduce one concrete variant of the ESGC problem in the next section. The applicability of the comparison technique to the ESGC problem is demonstrated in Sect. 4, whereas Sect. 5 gives a summary and an outlook.
2
The Elevator Supervisory Group Control Problem
In this section, we will consider the elevator supervisory group control problem [Bar86,SC99,MN02]. An elevator group controller assigns elevators to service calls. An optimal control strategy is a precondition to minimize service times and to maximize the elevator group capacity. Depending on current traffic loads, several heuristics for reasonable control strategies do exist, but which in general lead to suboptimal controllers. These heuristics have been implemented by Fujitec, one of the world’s leading elevator manufacturers, using a fuzzy control approach. In order to improve the generalization of the resulting controllers, Fujitec developed a neural network based controller, which is trained by use of a set of the aforementioned fuzzy controllers, each representing control strategies for different traffic situations. This leads to robust and reasonable, but not necessarily optimal, control strategies [Mar95]. Here we will be concerned with finding optimal control strategies for destination call based ESGC systems. In contrast to traditional elevators, where customers only press a button to request up or down service and choose the exact destination from inside the elevator car, a destination call system lets the customer choose the desired destination at a terminal before entering the elevator car. This provides more exact information to the group controller, and allows higher efficiency by grouping of passengers into elevators according to their destinations; but also limits the freedom of decision. Once a customer is assigned to a car and the car number appears on the terminal, the customer moves away from the terminal, which makes it very inconvenient to reassign his call to another car later. The concrete control strategy of the neural network is determined by the network structure and neural weights. While the network structure as well as many
Optimal Elevator Group Control by Evolution Strategies
1965
Fig. 1. The architecture of an elevator supervisory group control system [Sii97]
Fig. 2. Visualization of the output from the elevator group simulator
of the weights are fixed, some of the weights on the output layer, which have a major impact on the controller’s performance, are variable and therefore subject to optimization. Thus, an algorithm is searched for to optimize the variable weights of the neural controller. The controller’s performance can be computed by the help of a discrete-event based elevator group simulator. Figure 2 illustrates how the output from the simulator is visualized. Unfortunately, the resulting response surface shows some characteristics which makes the identification of globally optimal weights difficult if not impossible. The topology of the fitness function can be characterized as follows:
1966
– – – – – –
T. Beielstein, C.-P. Ewald and S. Markon
highly nonlinear, highly multi-modal, varying fractal dimensions depending on the position inside the search space, randomly disturbed due to the nondeterminism of service calls, dynamically changing with respect to traffic loads, local measures such as gradients can not be derived analytically.
Furthermore, the maximum number of fitness function evaluations is limited to the order of magnitude 104 , due to the computational effort for single simulator calls. Consequently, efficient robust methods from the domain of black-box optimization are required where evolutionary computation is one of the most promising approaches. Therefore, we have chosen an evolution strategy to perform the optimization [BFM00,Bey01]. The objective function for this study is the average waiting time of all passengers served during a simulated elevator movement of two hours. Further experiments will be performed in future to compare this objective function to more complex ones, which e.g. take into account the maximum waiting time as well. There are three main traffic patterns occurring during a day: up-peak (morning rush hour), two-way (less intense, balanced traffic during the day), and downpeak traffic (rush hour at closing time). These patterns make up the simulation time to one third each, which forces the resulting controller to cope with different traffic situations. As described above, the objective function values are massively disturbed by noise, since the simulator calculates different waiting times on the same objective variables (network weights), which stems from changing passenger distributions generated by the simulator. For the comparison of different ES parameter settings the final best individuals produced by the ES were assigned handling capacities at 30, 35, and 40 seconds. A handling capacity of n passengers per hour at 30s means that the elevator system is able to serve a maximum of n passengers per hour without exceeding an average waiting time of 30s. These values were created by running the simulator with altering random seeds and increasing passenger loads using the network weights of the best individuals found in each optimization run. Finally, to enable the deployment of our standard evaluation process, we needed a single figure to minimize. Therefore, the handling capacities were averaged and then subtracted from 3000 pass./h. The latter value was empirically chosen as an upper bound for the given scenario. The resulting fitness function is shown in (9) and is called ‘inverse handling capacity’ in the following.
3 3.1
Evolution Strategies and Threshold Selection Experimental Noise and Evolution Strategies
In this section we discuss the problem of comparing two solutions, when the available information (the measured data) is disturbed by noise. A bad solution might appear better than a good one due to the noise. Since minor differences are unimportant in this situation, it might be useful to select only much better
Optimal Elevator Group Control by Evolution Strategies
1967
values. A threshold value τ can be used to formulate the linguistic expression ‘much better’ mathematically: x is much better (here: greater) than y ⇔ f˜(x) > f˜(y) + τ.
(1)
Since comparisons play an important role in the selection process of evolutionary algorithms, we will use the term threshold selection (TS) in the rest of this article. TS can be generalized to many other stochastic search techniques such as particle swarm optimizers or genetic algorithms. We will consider evolution strategies in this article only. The goal of the ES is to find a vector x∗ , for which holds: f (x∗ ) ≤ f (x) ∀x ∈ D,
(2)
where the vector x represents a set of (object) parameters, and D is some ndimensional search space. An ES individual is usually defined as the set of object parameters x, strategy parameters s, and its fitness value f (x) [BS02]. In the following we will consider fitness function values obtained from computer simulation experiments: in this case, the value of the fitness function depends on the seed that is used to initialize the random stream of the simulator. The exact fitness function value f (x) is replaced by the noisy fitness function value f˜(x) = f (x) + . In the theoretical analysis we assume normal-distributed noise, that is ∼ N (µ, σ 2 ). It is crucial for any stochastic search algorithm to decide with a certain amount of confidence whether one individual has a better fitness function value than its competitor. This problem will be referred to as the selection problem in the following [Rud98,AB01]. Reevaluation of the fitness function can increase this confidence, but this simple averaging technique is not applicable to many real-world optimization problems, i.e. if the evaluations are too costly. 3.2
Statistical Hypothesis Testing
Threshold selection belongs to a class of statistical methods that can reduce the number of reevaluations. It is directly connected to classical statistical hypothesis testing [BM02]. Let x and y denote two object parameter vectors, with y being proposed as a ’better’ (greater) one, to replace the existing x. Using statistical testing, the selection problem can be stated as testing the null hypothesis H0 : f (x) ≤ f (y) against the one-sided alternative hypothesis H1 : f (x) > f (y). Whether the null hypothesis is accepted or not depends on the specification of a critical region for the test and on the definition of an appropriate test statistic. Two kinds of errors may occur in testing hypothesis: if the null hypothesis is rejected when it is true, an alpha error has been made. If the null hypothesis is not rejected when it is false, a beta error has been made. Consider a maximization problem: the threshold rejection probability Pτ− is defined as the conditional probability, that a worse candidate y has a better noisy fitness value than the fitness value of candidate x by at least a threshold τ : Pτ− := P {f (y) ≤ f (x) + τ | f (y) ≤ f (x)},
(3)
1968
T. Beielstein, C.-P. Ewald and S. Markon
Fig. 3. Hypothesis testing to compare models: we are testing whether threshold selection has any significance in the model. The Normal QQ-plot, the box-plot, and the interaction-plots lead to the conclusion that threshold selection has a significant effect. Y denotes the fitness function values that are based on the inverse handling capacities: smaller values are better, see (9). Threshold, population size µ and selective pressure (offspring-parent ratio ν) are modified according to the values in Table 3. The function α(t) as defined in (6) was used to determine the threshold value
n ˜ where f (x) := i=1 f (xi )/n denotes the sample average of the noisy values [BM02]. Pτ− and the alpha error are complementary probabilities: Pτ− = 1 − α.
(4)
(4) reveals, how TS and hypothesis testing are connected: maximization of the threshold rejection probability, an important task in ES based optimization, is equivalent to the minimization of the alpha error. Therefore, we provide tech-
Optimal Elevator Group Control by Evolution Strategies
1969
niques from statistical hypothesis testing to improve the behavior of ES in noisy environments. Our implementation of the TS method is based on the common practice in hypothesis testing to specify a value of the probability of an alpha error, the so called significance level of the test. In the first phase of the search process the explorative character is enforced, whereas in the second phase the main focus lies on the exploitive character. Thus, a technique that is similar to simulated annealing is used to modify the significance level of the test during the search process. In the following paragraph, we will describe the TS implementation in detail. 3.3
Implementation Details
The TS implementation presented here is related to doing a hypothesis test to see whether the measured difference in the expectations of the fitness function values is significantly different from zero. The test result (either a ‘reject’ or ‘failto-reject’ recommendation) determines the decision whether to reject or accept a new individual during the selection process of an ES. In the following, we give some recommendations for the concrete implementation. We have chosen a parametric approach, although non-parametric approaches might be applicable in this context too. Let Yi1 , Yi2 , . . . Yin (i = 1, 2) be a sample of n independent and identically distributed (i.i.d.) measured fitness function values. The n differences Zi = Y1j −Y2j are also i.i.d. random variables (r.v.) with sample mean Z and sample variance S 2 (n). Consider the following test on means µ1 and µ2 of normal distributions: if H0 : µ1 − µ2 ≤ 0 is tested against H1 : µ1 − µ2 > 0, we have the test statistic t0 =
z − (µ1 − µ2 ) , s 2/n
(5)
with sample standard deviation S (small letters denote realizations of the corresponding r.v.). The null hypothesis H0 : µ1 − µ2 ≤ 0 would be rejected if t0 > t2n−2,1−α , where t0 > t2n−2,1−α is the upper α percentage point of the t distribution with 2n − 2 degrees of freedom. Summarizing the methods discussed so far, we recommend the following recipe: 1. Select an alpha value. The α error is reduced during the search process, e.g. the function (6) α(t) = (1 − t/tmax )/2, with tmax the maximum number of iterations and t ∈ {0, tmax }, the actual iteration number, can been used. 2. Evaluate the parent and the offspring candidate n times. To reduce the computational effort, this sampling can be performed every k generations only.
1970
T. Beielstein, C.-P. Ewald and S. Markon
3. Determine the sample variance. 4. Determine the threshold value τ = t2n−2,1−α · s · 2/n 5. A new individual is accepted, if its (perturbed) fitness value plus τ is smaller than the (perturbed) fitness value of the parent. A first analysis of threshold selection can be found in [MAB+ 01,BM02]. In the following section we will present some experimental results that are based on the introduced ideas.
4 4.1
Threshold Selection in the Context of ESGC Optimization Statistical Experimental Design
The following experiments have been set up to answer the question: how can TS improve the optimization process if only stochastically perturbed fitness function values (e.g. from simulation runs) can be obtained? Therefore, we simulated alternative ES parameter configurations and examined their results. We wanted to find out if TS has any effect on the performance of an ES, and if there are any interactions between TS and other exogenous parameters such as population size or selective pressure. A description of the experimental design (DoE) methods we used is omitted here. [Kle87,LK00] give excellent introductions into design of experiments, the applicability of DoE to evolutionary algorithms is shown in [Bei03]. The following vector notation provides a very compact description of evolution strategies [BS02,BM02]. Consider the following parameter vector of an ES parameter design: pES = (µ, ν, S, nσ , τ0 , τi , ρ, R1 , R2 , r0 ) ,
(7)
where ν := λ/µ defines the offspring-parent ratio, S ∈ {C, P } defines the selection scheme resulting in a comma or plus strategy. The representation of the selection parameter S can be generalized by introducing the parameter κ that defines the maximum age (in generations) of an individual. If κ is set to 1, we obtain the comma-strategy, if κ equals +∞, we model the plus-strategy. The mixing number ρ defines the size of the parent family that is chosen from the parent pool of size µ to create λ offsprings. We consider global intermediate GI, global discrete GD, local intermediate LI, and local discrete LD recombination. R1 and R2 define the recombination operator for object resp. strategy variables, and r0 is the random seed. This representation will be used throughout the rest of this article and is summarized in Table 1. Typical settings are: √ √ pES = 5, 7, 1, 1, 1/ 2 D, 1/ 2D, 5, GD, GI, 0 . (8) Our experiment with the elevator group simulator involved three factors. A 23 full factorial design has been selected to compare the influence of different population sizes, offspring-parent ratios and selection schemes. This experimental
Optimal Elevator Group Control by Evolution Strategies
1971
Table 1. DoE parameter for ES Symbol Parameter
Recommended Values
µ ν = λ/µ nσ τ0 τi κ ρ R1
10 . . . 100 1 . . . 10 1 . ..D √ 1/√ 2 D 1/ 2D 1...∞ µ {discrete}
R2
number of parent individuals offspring-parent ratio number of standard deviations global mutation parameter individual mutation parameter age mixing number recombination operator for object variables recombination operator for strategy variables
{intermediate}
Table 2. DoE parameter for the fitness function Symbol Parameter
Values
f D Nexp
ESGC problem, minimization, see (9) 36 10
fitness function, optimization problem dimension of f number of experiments for each scenario total number of fitness function evaluations noise level
Ntot σ
5 · 103 unknown
design leads to eight different configurations of the factors that are shown in Table 1. The encoding of the problem parameters is shown in Table 2. Table 3 displays the parameter settings that were used during the simulation runs: the population size was set to 5 and 20, whereas the parent-offspring ratio was set to 2 and 5. This gives four different (parent, offspring) combinations: (5,20), (5,25), (20,40), and (20,100). Combined with two different selection schemes we obtain 8 different run configurations. Each run configuration was repeated 10 times. 4.2
Analysis
Although a batch job processing system, that enables a parallel execution of simulation runs, was used to run the experiments, more than a full week of round-the-clock computing was required to perform the experiments. Finally 80 configurations have been tested (10 repeats of 8 different factor settings). The simulated inverse handling capacities can be summarized as follows: Min. Mean
: 850.0 :1003.1
1st Qu.:916.7 3rd Qu.:1083.3
Median :1033.3 Max. :1116.7
A closer look at the influence of TS on the inverse handling capacities reveals that ES runs with TS have a lower mean (931.2) than simulations that used a
1972
T. Beielstein, C.-P. Ewald and S. Markon Table 3. ES parameter designs. ES ESGC-Design Model (D = 36) Variable Low High (−1) (+1) µ 5 20 ν 2 5 κ +∞ Threshold Selection The following values remain unchanged: nσ 1 √ τ0 1/√ 2 D τ1 1/ 2D ρ µ R1 GD R2 GI
plus selection strategy (1075.0). As introduced in Sect. 2, the fitness function reads (minimization task): g(x) = 3000.0 − f pES (x),
(9)
where the setting from Table 3 was used, f pES is the averaged handling capacity (pass./h), and x is a 36 dimensional vector that specifies the NN weights. The minimum fitness function value (850.0) has been found by TS. Performing a t-test (the null hypothesis ‘the true difference in means is not greater than 0’ that is tested against the ‘alternative hypothesis: the true difference in means is greater than 0’) leads to the p-value smaller than 2.2e − 16. An interaction plot plots the mean of the fitness function value for two-way combinations of factors, e.g. population size and selective strength. Thus, it can illustrate possible interactions between factors. Considering the interaction plots in Fig. 3, we can conclude that the application of threshold selection improves the behavior of the evolution strategy in any case. This improvement is independent from other parameter settings of the underlying evolution strategy. Thus, there is no hint that the results were caused by interactions between other factors.
5
Summary and Outlook
The elevator supervisory group control problem was introduced in the first part of this paper. Evolution strategies were characterized as efficient optimization techniques: they can be applied to optimize the performance of an NN-based elevator supervisory group controller. The implementation of a threshold selection operator for evolution strategies and its application to a complex real-world optimization problem has been shown. Experimental design methods have been used to set up the experiments and to perform the data analysis. The obtained results gave first hints that threshold selection might be able to improve the performance of evolution strategies if the fitness function values are stochastically
Optimal Elevator Group Control by Evolution Strategies
1973
disturbed. The TS operator was able to improve the average passenger handling capacities of an elevator supervisory group control problem. Future research on the thresholding mechanism will investigate the following topics: – Developing an improved sampling mechanism to reduce the number of additional fitness function evaluations. – Introducing a self-adaptation mechanism for τ during the search process. – Combining TS with other statistical methods, e.g. Staggge’s efficiently averaging method [Sta98]. Acknowledgments. T. Beielstein’s research was supported by the DFG as a part of the collaborative research center ‘Computational Intelligence’ (531). R, a language for data analysis and graphics was used to compute the statistical analysis [IG96].
References [AB01]
[Bar86] [Bei03] [Bey01] [BFM00]
[BM02]
[BPM03]
[BS02] [IG96] [Kle87] [LK00]
Dirk V. Arnold and Hans-Georg Beyer. Investigation of the (µ, λ)-ES in the presence of noise. In J.-H. Kim, B.-T. Zhang, G. Fogel, and I. Kuscu, editors, Proc. 2001 Congress on Evolutionary Computation (CEC’01), Seoul, pages 332–339, Piscataway NJ, 2001. IEEE Press. G. Barney. Elevator Traffic Analysis, Design and Control. Cambridg U.P., 1986. T. Beielstein. Tuning evolutionary algorithms. Technical Report 148/03, Universit¨ at Dortmund, 2003. Hans-Georg Beyer. The Theory of Evolution Strategies. Natural Computing Series. Springer, Heidelberg, 2001. Th. B¨ ack, D.B. Fogel, and Z. Michalewicz, editors. Evolutionary Computation 1 – Basic Algorithms and Operators. Institute of Physics Publ., Bristol, 2000. T. Beielstein and S. Markon. Threshold selection, hypothesis tests, and DOE methods. In David B. Fogel, Mohamed A. El-Sharkawi, Xin Yao, Garry Greenwood, Hitoshi Iba, Paul Marrow, and Mark Shackleton, editors, Proceedings of the 2002 Congress on Evolutionary Computation CEC2002, pages 777–782. IEEE Press, 2002. T. Beielstein, M. Preuss, and S. Markon. A parallel approach to elevator optimization based on soft computing. Technical Report 147/03, Universit¨ at Dortmund, 2003. Hans-Georg Beyer and Hans-Paul Schwefel. Evolution strategies – A comprehensive introduction. Natural Computing, 1:3–52, 2002. R. Ihaka and R. Gentleman. R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5(3):299–314, 1996. J. Kleijnen. Statistical Tools for Simulation Practitioners. Marcel Dekker, New York, 1987. Averill M. Law and W. David Kelton. Simulation Modelling and Analysis. McGraw-Hill Series in Industrial Egineering and Management Science. McGraw-Hill, New York, 3 edition, 2000.
1974
T. Beielstein, C.-P. Ewald and S. Markon
ack, Thomas Beielstein, and [MAB+ 01] Sandor Markon, Dirk V. Arnold, Thomas B¨ Hans-Georg Beyer. Thresholding – a selection operator for noisy ES. In J.-H. Kim, B.-T. Zhang, G. Fogel, and I. Kuscu, editors, Proc. 2001 Congress on Evolutionary Computation (CEC’01), pages 465–472, Seoul, Korea, May 27–30, 2001. IEEE Press, Piscataway NJ. [Mar95] Sandor Markon. Studies on Applications of Neural Networks in the Elevator System. PhD thesis, Kyoto University, 1995. [MN02] S. Markon and Y. Nishikawa. On the analysis and optimization of dynamic cellular automata with application to elevator control. In The 10th Japanese-German Seminar, Nonlinear Problems in Dynamical Systems, Theory and Applications. Noto Royal Hotel, Hakui, Ishikawa, Japan, September 2002. [Rud98] G¨ unter Rudolph. On risky methods for local selection under noise. In A. E. Eiben, Th. B¨ ack, M. Schoenauer, and H.-P. Schwefel, editors, Parallel Problem Solving from Nature – PPSN V, Fifth Int’l Conf., Amsterdam, The Netherlands, September 27–30, 1998, Proc., volume 1498 of Lecture Notes in Computer Science, pages 169–177. Springer, Berlin, 1998. [SC99] A.T. So and W.L. Chan. Intelligent Building Systems. Kluwer A.P., 1999. [Sii97] M.L. Siikonen. Planning and Control Models for Elevators in High-Rise Buildings. PhD thesis, Helsinki Unverstity of Technology, Systems Analysis Laboratory, October 1997. [Sta98] Peter Stagge. Averaging efficiently in the presence of noise. In A.Eiben, editor, Parallel Problem Solving from Nature, PPSN V, pages 188–197, Berlin, 1998. Springer-Verlag. [SWW02] H.-P. Schwefel, I. Wegener, and K. Weinert, editors. Advances in Computational Intelligence – Theory and Practice. Natural Computing Series. Springer, Berlin, 2002.
A Methodology for Combining Symbolic Regression and Design of Experiments to Improve Empirical Model Building Flor Castillo, Kenric Marshall, James Green, and Arthur Kordon The Dow Chemical Company 2301 N. Brazosport Blvd, B-1217 Freeport, TX 77541, USA 979-238-7554 {Facastillo,KAMarshall,JLGreen,Akordon}@dow.com
Abstract. A novel methodology for empirical model building using GPgenerated symbolic regression in combination with statistical design of experiments as well as undesigned data is proposed. The main advantage of this methodology is the maximum data utilization when extrapolation is necessary. The methodology offers alternative non-linear models that can either linearize the response in the presence of Lack or Fit or challenge and confirm the results from the linear regression in a cost effective and time efficient fashion. The economic benefit is the reduced number of additional experiments in the presence of Lack of Fit.
1
Introduction
The key issues in empirical model development are high quality model interpolation and its extrapolation capability outside the known data range. Of special importance to industrial applications is the second property since the changing operating conditions are more a rule than an exception. Using linear regression models based on well-balanced data generated by Design of Experiments (DOE) is the dominant approach to effective empirical modeling and several techniques have been developed for this purpose [1]. However, in many cases, due to the non-linear nature of the system and unfeasible experimental conditions, it is not possible to develop a linear model. Among the several approaches that can be used either to linearize the problem, or to generate a non-linear empirical model is Genetic Programming (GP)generated symbolic regression. This novel approach uses simulated evolution to generate non-linear equations with high fitness [2]. The potential of symbolic regression for linearizing the response in statistical DOE when significant Lack of Fit is detected and additional experimentation is unfeasible was explored in [3]. The derived transformations based on the GP equations resolved the problem of Lack of Fit and demonstrated good interpolation capability in an industrial case study in The Dow Chemical Company. However, the extrapolation capability of the derived non-linear models is unknown and not built-in as in the case of linear models. E. Cantú-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 1975–1985, 2003. © Springer-Verlag Berlin Heidelberg 2003
1976
F. Castillo et al.
Extrapolation of empirical models is not always effective but often necessary in chemical processes because plants often operate outside the range of the original experimental data used to develop the empirical model. Of special importance is the use of this data since time and cost restrictions in planned experimentation are frequently encountered. In this paper, a novel methodology integrating GP with designed and undesigned data is presented. It is based on a combination of linear regression models based on a DOE and non-linear GP-generated symbolic regression models considering interpolation and extrapolation capabilities. This approach has the potential to improve the effectiveness of empirical model building by maximizing the use of plant data and saving time and resources in situations where experimental runs are expensive or technically unfeasible. A case study with a chemical process was used to investigate the utility of this approach and to test the potential of model over-fitting. The promising results obtained give the initial confidence for empirical model development based on GP- generated symbolic regression in conjunction with designed data and open the door for numerous industrial applications.
2
A Methodology for Empirical Model Building Combining Linear and Non-linear Models
With the growing research in evolutionary algorithms and the speed, power and availability of modern computers, genetic programming offers a suitable possibility for real-world applications in industry. This approach based on GP offers four unique elements to empirical model building. First, previous modeling assumptions, such as independent inputs and an established error structure with constant variance, are not required [4]. Second, it generates a multiplicity of non-linear equations which have the potential of restoring Lack of Fit in linear regression models by suggesting variable transforms that can be used to linearize the response [3]. Third, it generates nonlinear equations with high fitness representing additional empirical models, which can be considered in conjunction with linearized regression models or to challenge and confirm results even when a linear model is not significant [3]. Fourth, unlike neural networks that require significant amount of data for training and testing, it can generate empirical models with small data sets. One of the most significant challenges of Genetic programming for empirical model building and linearizing regression models is that it produces non-parsimonious solutions with chunks of inactive terms (introns) that do not contribute to the overall fitness [4]. This drawback can be partially overcome by computational algorithms that quickly select the most highly fit and less-complex models or by using parsimony pressure during the evolution process. Using DOE in conjunction with genetic programming results in a powerful approach that improves empirical model building and that may have significant economic impact by reducing the number of experiments. The following are the main components of the proposed methodology. Step 1. The Experimental Design A well-planned and executed experimental design (DOE) is essential to the development of empirical models to understand causality in the relationship between the input
A Methodology for Combining Symbolic Regression and Design of Experiments
1977
variables and the response variable. This component includes the selection of input variables and their ranges, as well as the response variable. It also includes the analysis and the development of a linear regression model for the response(s). Step 2. Linearization via Genetic Programming This step is necessary if the linear regression model developed previously presents Lack of Fit and additional experimentation to augment the experimental design such as a Central Composite Design (CCD) is not possible due to extreme experimental conditions. An alternative is to construct a Face-Centered Design by modification of the CCD [5]. However, this alternative often results in high correlation between the square terms of the resulting regression models. GP-generated symbolic regression can be employed to search for mathematical expressions that fit the given data set generated by the experimental design. This approach results in several analytical equations that offer a rich set of possible input transformations which can remove lack of fit without additional experimentation. In addition, it offers non-linear empirical models that are considered with the linearized regression model. The selection between these models is often a trade-off between model complexity and fitness. Very often the less complex model is easier to interpret and is preferred by plant personnel and process engineers. Step 3. Interpolation This step is a form of model validation in which the linearized regression model and the non-linear model(s) generated by GP are tested with additional data within the experimental region. Here again, model selection considers complexity, fitness, and the preference of the final user. Step 4. Extrapolation This is necessary when the range of one or more of the variables is extended beyond the range of the original design and it is desirable to make timely predictions on this range. Step 5. Linear and Non-linear Models for Undesigned Data and Model Comparison When the models developed in the previous sections do not perform well for extrapolation, it is necessary to develop a new empirical model for the new operating region. The ideal approach is to consider an additional experimental design in the new range but often it can not be done in a timely fashion. Furthermore, it is often desirable to use the data set already available. In these cases, while a multiple regression approach can be applied to build a linear regression model with all available data, the risk of collinearity, near linear dependency among regression variables, must be evaluated since the data no longer conforms to an experimental design. High collinearity produces ambiguous regression results making it impossible to estimate the unique effects of individual variables in the regression model. In this case regression coefficients have large sampling errors and are very sensitive to slight changes in the data and to the addition or deletion of variables in the regression equation. This affects model stability, inference and forecasting that is made based on the regression model. Of special interest here is a model developed with a GP-generated symbolic regression algorithm because it offers additional model alternatives. One essential consideration in this case is that the models generated with this new data can not be used to infer causality. This restriction comes from the fact that the
1978
F. Castillo et al.
data no longer conforms to an experimental design. However, the models generated (linear regression and non-linear GP-generated models) can be used to predict the response in the new range. Both types of models are then compared in terms of complexity, and fitness, allowing the final users to make the decision. The proposed methodology will be illustrated with an industrial application in a chemical reactor.
3
Empirical Modeling Methodology for a Chemical Reactor
3.1 The Experimental Design and Transformed Linear Model (Steps 1 and 2 in Methodology) The original data set corresponds to a series of designed experiments that were conducted to clarify the impact on formation of a chemical compound as key variables 4 are manipulated. The experiments consisted of a complete 2 factorial design in the factors x1, x2, x3, x4 with three center points. Nineteen experiments were performed. The response variable, Sk, was the yield or selectivity of one of the products. The factors were coded to a value of –1 at the low level, +1 at the high level, and 0 at the center point. The complete design in the coded variables is shown in Table 1, based on the original design in [3]. The selectivity of the chemical compound (Sk), was first fit to the following firstorder linear regression equation considering only terms that are significant at the 95% confidence level: (1)
k
S k = β o + ∑ β i xi + ∑∑ β i j xi x j i =1
i< j
4
Table 1. 2 factorial design with three center points Runs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
x1 1 0 0 -1 -1 1 -1 1 -1 -1 -1 1 1 -1 -1 1 1 1 0
x2 -1 0 0 1 1 1 1 1 -1 1 -1 -1 1 -1 -1 1 -1 -1 0
x3 1 0 0 1 -1 -1 1 1 -1 -1 -1 -1 -1 1 1 1 -1 1 0
x4 1 0 0 1 1 1 -1 -1 1 -1 -1 1 -1 -1 1 1 -1 -1 0
Sk 1.598 1.419 1.433 1.281 1.147 1.607 1.195 2.027 1.111 1.159 1.186 1.453 1.772 1.047 1.175 1.923 1.595 1.811 1.412
Pk 0.000 0.000 0.016 0.016 0.009 0.012 0.019 0.015 0.009 0.007 0.006 0.013 0.006 0.018 0.009 0.023 0.007 0.015 0.017
A Methodology for Combining Symbolic Regression and Design of Experiments
1979
Lack of Fit was induced (p = 0.046) by omission of experiment number 1 of the experimental design. This was done to simulate a common situation in industry in which LOF is significant and additional experimental runs are impractical due to the cost of experimentation, or because it is technically unfeasible due to restrictions in experimental conditions. The GP algorithm (GP-generated symbolic regression) was applied to the same data set. The algorithm was implemented as a toolbox in MATLAB. Several models of the selectivity of the chemical compound as a function of the four experimental variables (x1, x2, x3, x4) were obtained by combining basic functions, inputs, and numerical constants. The initial functions for GP included: addition, subtraction, multiplication, division, square, change sign, square root, natural logarithm, exponential, and power. Function generation takes 20 runs with 500 population size, 100 number of generations, 4 reproductions per generation, 0.6 probability for function as next node, 0.01 parsimony pressure, and correlation coefficient as optimization criteria. The functional form of the equations produced a rich set of possible transforms. The suggested transforms were tested for the ability to linearize the response without altering the necessary conditions of the error structure needed for least-square estimation (uncorrelated and normally distributed errors with mean zero, and constant variance). The process consisted of selecting equations from the simulated evolution with correlation coefficients larger than 0.9. These equations were analyzed in terms of the 2 2 R between model prediction and empirical response. Equations with R higher than 0.9 were chosen and the original variables were transformed according to the functional form of these equations. The linear regression model presented in equation (1) was fitted to the data using the transformed variables and the fitness of this trans2 formed linear model was analyzed considering Lack of Fit and R . The error structure of the models not showing lack of fit was then analyzed. This process ensured that the transformations given by GP not only linearized the response but also produced the adequate error structure needed for least square estimations. Using the process previously described, the best fit between model prediction and empirical response from the set of potential non-linear equations was found for the 2 following analytical function with R of 0.98[3]: Sk =
3.13868 × 10−17 e x4
2 x1
[ ]
ln (x 3 ) 2 x2
+ 1.00545
(2)
The corresponding input/output sensitivity analysis reveals that x1 is the most important input. The following transformations were then applied to the data as indicated by the functional form of the GP function shown in equation (2). Table 2. Variable transformations suggested by GP model.
Original Variable x1 x2 x3 x4
Transformed Variable
(
Z1 = exp 2 x1 Z2 = x 2 2 Z3 = ln[(x3) ] -1 Z4 = x 4
)
1980
F. Castillo et al.
The transformed variables were used to fit a first-order linear regression model shown in equation (1). The resulting model is referred to as the Transformed Linear 2 Model (TLM). The TLM had an R of 0.99, no evidence of Lack of Fit (p=0.3072) and retained the appropiate randomized error structure indicating that the GPgenerated transformations were succesful in linearizing the response. The model parameters in the transformed variables are given in [3]. 3.2 Interpolation Capability of Transformed Linear Model (TLM) and Symbolic Regression model (GP) (Step 3 of Methodology) The interpolation capabilities of the TLM and the GP model shown in equation (2) were tested with nine additional experimental points within the range of experimentation. A plot of the GP and TLM models for the interpolation data is presented in Fig. 1. The models, GP and TLM, perform similarly in terms of interpolation with sum square errors (SSE) being slightly smaller for the TLM (0.1082) than for the GP model (0.1346). However, the models are comparable in terms of prediction with data within the region of the design. The selection of one of these models would be driven by the requirements of a particular application. For example, in the case of process control, the more parsimonious model would generally be preferred.
Predicted selectivity
2 1.75 1.5 SSE : TLM =0.1082 G P =0.1346
1.25 1 1
1.25
1.5 1.75 Actual S electivity A ctual S electivity T LM G P M odel
2
Fig. 1. Predicted selectivity for GP and TLM using interpolation data set
3.3 Extrapolation Capability of Transformed Linear Model (TLM) and Symbolic Regression Model (GP) (Step 4 of Methodology) It was necessary to evaluate the prediction of the TLM and the GP model with available data outside the region of experimentation (beyond –1 and 1 for the input variables) that were occasionally encountered in the chemical system. The most appropriate approach would be to generate an experimental design in this new region of operability. Unfortunately this could not be completed in a timely and cost effective
A Methodology for Combining Symbolic Regression and Design of Experiments
1981
fashion due to restrictions in operating conditions. The data used for extrapolation in coded form is shown in Table 3. Figure 2 shows the comparison between actual and predicted selectivity for this data set. Table 3. Data used for extrapolation for GP and TLM models GP TL SelecM tivity Sk 1.0 1.614 1.63 1.61 1 1.0 1.331 1.32 1.21 2 1.0 1.368 1.36 1.40 3 2.0 1.791 2.27 2.02 4 2.0 1.359 1.65 1.24 5 2.0 1.422 1.72 1.47 6 3.0 1.969 3.52 2.63 7 3.0 1.398 2.29 1.28 8 3.0 1.455 2.43 1.57 9 3.0 0.67 1.480 2.62 2.10 10 3.0 0.76 0.47 1.343 2.30 1.48 11 Data is coded based on conditions of the original design
Run
X1
X2
X3
X4
SSE GP 0.000 0.000 0.000 0.231 0.086 0.092 2.412 0.798 0.960 1.318 0.923 6.822
SSE TLM 0.000 0.013 0.001 0.056 0.013 0.003 0.446 0.013 0.014 0.384 0.020 0.963
Predicted Selectivity
4 3.5 3 2.5 2 1.5 (SSE): TLM=0.963 GP=6.822
1 0.5 0 1.2
1.3
1.4
1.5 1.6 1.7 Actual Selectivity Actual Selectivity
GP
1.8
1.9
2
TLM
Fig. 2. Predicted selectivity for GP and TLM models using extrapolation data
The deviations between model predictions and actual selectivity (SSE) are larger for the GP model (6.822) than for the TLM (0.963) suggesting that the TLM had better extrapolation capabilities. Table 3 shows that the sum squares errors (SSE) gets larger for both models as the most sensitive input x1 increases beyond 1,which is the region of the original design. This finding confirms that extrapolation of a GP model in terms of the most sensitive inputs is a dangerous practice. This is not a disadvantage of the GP or TLM per se, it is a fact that has challenged and will always challenge empirical models, whether linear or non-linear in nature. This is because, unlike mechanistic models that are deduced from the laws of physics and which apply in
1982
F. Castillo et al.
in general for the phenomena being modeled, empirical models are developed from data in a limited region of experimentation. 3.4 Linear Regression and Symbolic Regression for Undesigned Data (Step 5 of Methodology) Given that extrapolation of the previously developed models was not effective, a different approach was explored by combining all of the data sets previously presented (DOE, interpolation and extrapolation data sets) to build a linear regression model and a GP model. However since the combined data sets do not conform to an experimental DOE, the resulting models are only to be used for prediction within the new region of operability and not to infer causality. The Linear Regression Model. The linear regression model was built treating the three sets as one combined data set and treating them as undesigned data, or data not collected using a design of experiments. Selectivity was fit to the first-order linear regression shown in equation (1). The resulting model is referred to as Linear Regression Model (LRM). The analysis of variance revealed a significant regression 2 equation (F ratio <0.0001) with R of 0.94, and no evidence of Lack of Fit (p = 0.1520). The parameter estimates are presented in Table 4. Table 4. Parameter Estimates for Linear Regression Model showing significant terms at 95% Term
β Estimate
Std t Ratio Prob>|t| Error 0.365 <.000 Intercept 0.008585 0.000 17.49 <.000 x1 0.086230 0.033 2.61 0.014 x2 0.105926 0.010 9.72 <.000 x3 0.119 -4.21 0.000 x4 0.002362 0.001 2.17 0.038 (x1-615.789)*(x2-3.73) 0.003111 0.000 10.63 <.000 (x1-615.789)*(x3-4.37447) 0.045254 0.023 1.94 0.062 (x2-3.73)*(x3-4.37447) 0.003 -2.47 0.020 (x1-615.789)*(x4-1.18474) 0.036342 0.074 0.49 0.628 (x3-4.37447)*(x4-1.18474) 0.001 -2.80 0.009 (x1-615.789)*(x3-4.37447)*(x4effect included to preserve model precedence given that third order interaction is cant VIF 16 7526
VIF
1.871 1.506 1.995 1.688 1.748 1.537 1.791 1.695 1.605 1.766 signifi-
Because the LRM was built for undesigned data, multicollinearity was analyzed using Variance Inflation Factors (VIF) [6]. As previously discussed, the presence of severe multicollinearty can seriously affect the precision of the estimated regression coefficients, making them very sensitive to the data in the particular sample collected and producing models with poor prediction. The VIF for each model parameter (VIFj) were compared to the overall Model (VIF model). The VIF for the LRM are listed in Table 4. No evidence of severe multicollinearity was detected (VIFj < 16.7526). A subsequent residual analysis did not
A Methodology for Combining Symbolic Regression and Design of Experiments
1983
show indication of violations of the error structure required for least square estimation indicating that an acceptable LRM was achieved with the combined data set. The GP Model. The GP model was developed for the combined data set by using a toolbox in MATLAB. This model is referred to as the GP2 model to differentiate it from the GP model that was developed using only the DOE data. The initial functions for GP2 included addition, subtraction, multiplication, division, square, change sign, square root, natural logarithm, exponential, and power. Several runs were performed with various function generation (20 to 40); population size (500 to 900); number of generations (100 to 500); parsimony pressure (0.01 to 0.1); and 40 reproductions per generation with 0.6 probability for function as next node, and two different optimization criteria (correlation coefficient and sum of squares). Several hundred equations were obtained. The selection of the best equations was 2 2 based on the value of R . The following equation resulted in the highest value of R (0.84).
Sk = −4.5333− 0.0081* (−e
−2x *(− x + x ) 2 3 3 2x4 e
x3 + x4 x3 log e
(3)
2Re 2
* 1.99971+ x
1
−
x +e 1
2 * x4 − x2(5.0521+ x3 − x4 )
Unlike the model of equation (1), the GP2 model shown in equation (3) is a nonlinear model, with a complex functional form that shows relationships between the different variables (x1, x2, x3, x4). Comparing the Linear and Non-linear Empirical Models. Figure 3 shows the graph of the GP2 model and the Linear Regression Model (LRM) for the combined data set. Comparing the two empirical models, the LRM performs better that the GP2 2 model in terms of R and the sum square errors. In terms of influential observations, the linear regression model allows determination of influential observations by calculating the Cook’s D influence, Di [7]. Observations with large values of Di have considerable influence on the least square estimates βi in equation (1) making these estimates very sensitive to these observations. None of the observations in the combined data set was considered influential (all calculated Di<1). This information is helpful identifying potentially interesting observations that may be replicated to confirm results. The GP algorithm does not allow the statistical determination of influential observations but it allows the determination of the most sensitive inputs. This process is somehow similar to the testing of significant parameters in linear regression but it is not based on statistical hypothesis testing. The GP algorithm finds the most sensitive inputs by determining how the fitness function of the GP-generated equations improves by adding the specific input. The input/output sensitivity analysis had shown x1 as the most important input. This is in agreement with the LRM, which shows x1 as significant (Table 4). The simpler form of the LRM would generally be preferred by plant engineers be2 cause of the larger R and because it is more parsimonious than the GP2 model shown
1984
F. Castillo et al.
in equation (3). Nevertheless, the functional form of the non-linear GP2 model reveals relationships among the variables that account for a large percentage of the variation of the data. Therefore, designed and undesigned data are opportunities for consideration of an alternative GP generated model. Even when the regression model is not significant, a GP model can still be built to confirm or challenge the result. This is illustrated in the following section.
2.2
Predicted Selectivity
2 1.8 1.6
R2 and (SSE): LRM=0.94 (0.14 SSE) GP2=0.84 (0.43 SSE)
1.4 1.2 1 1
1.2
1.4
1.6
1.8
2
2.2
Actual Selectivity Actual Selectivity
LRM
GP2
Fig. 3. Predicted selectivity for GP2 and LRM models for combined data set data
3.5 The GP Model – LRM Not Significant A known weakness of some non-linear algorithms is the risk of model over-fitting (modeling unexplainable variation instead of real relationships that exist between inputs and outputs). This risk can significantly hinder the potential of GP models. In order to test model over-fitting, data for the selectivity of a second chemical 4 compound (Pk) was selected from the same 2 DOE experiments shown in Table 1 (Pk selectivity data is shown in the fourth column in Table 1). The standard linear regression model shown in equation (1) was used to fit to the data considering selectivity Pk as the response variable; and x1, x2, x3, x4 as the independent variables. The corresponding analysis of variance indicated no evidence of Lack of Fit at the 95% confidence (p=0.0938) but revealed a non-significant model (p=0.4932) indicating that the hypothesis that all regression coefficients βi in the model of equation (1) are zero could not be rejected. Thus, the linear model is reduced to the mean. The GP algorithm has been applied to the DOE data set. The selectivity Pk was selected as the output variable and x1, x2, x3, x4 were selected as input variables. Several runs were performed with 20 runs, population size between 500 to 900, number of generations between 100 to 500, 40 reproductions per generation, 0.6 probability for function as next node, parsimony pressure between 0.01 and 0.1, and correlation coefficient and sum of squares as optimization criteria. Hundreds of non-linear GP equa-
A Methodology for Combining Symbolic Regression and Design of Experiments
1985
tions were generated. However, no equation was found that accounted for more than 50% of the variation in the data (the maximum correlation coefficient found was 0.5). This result is potentially significant. In the Pk case, it can be demonstrated through statistical analysis that a statistically significant correlation does not exist between variables and response. The non-linear GP algorithm produced in an independent way the same result from the linear regression analysis by not creating a statistically significant model. This demonstrates that in this case the risk of model over-fitting is low.
4
Conclusions
A novel methodology for empirical modeling based on Design of Experiments and GP has been defined and applied successfully in the Dow Chemical Company. The proposed methodology uses linear regression models and non-linear GP models (statistically designed experiments, linear regression models for undesigned data, and GP-generated symbolic regression). A significant part of this methodology is based on the unique potential of GP algorithms for linearizing regression models in the presence of Lack of Fit and providing additional empirical non-linear functions that can be considered in combination with the linear ones. The proposed methodology has the following advantages: - increases the options for development of empirical models based on DOE; - reduces (or even eliminates) the number of additional experiments in the presence of Lack of Fit; - maximizes the use of available data when model extrapolation is required; - improves model validation by introducing alternative models. These advantages are illustrated in an industrial application. The promising results obtained constitute a solid foundation for utilization of linear and non-linear models in industrial applications where extrapolation of an empirical model is often required.
References 1. Box, G.E.P, and Draper, N. R., Empirical Model Building and Response Surfaces, John Wiley and Sons, New York, 1987. 2. Koza, J. Genetic Programming: On the Programming of Computers by Means of Natural Selection, MIT Press, Cambridge, MA, 1992. 3. Castillo F. A.,Marshall, K. A, Green, J.L., and Kordon, A, “Symbolic Regression in Design of Experiments: A Case Study with Linearizing Transformations, Proceedings of GECCO’2002, New York, pp. 1043–1048., 2002 4. Banzhaf W., P. Nordin, R. Keller, and F. Francone, Genetic Programming: An Introduction, Morgan Kaufmann, San Francisco, 1998. 5. Myers, R.H., and, Montgomery, D.C., Response Surface Methodology, John Wiley and Sons, New York, 1995 6. Montgomery, D.C., and Peck, E.A., Introduction to Linear Regression Analysis, John Wiley and Sons, New York, 1992. 7. Cook, R.D., “Detection of Influential Observations in Linear Regression”, Technometrics, Vol. 19, pp. 15–18, 1977
The General Yard Allocation Problem Ping Chen1 , Zhaohui Fu1 , Andrew Lim2 , and Brian Rodrigues3 1
2
Dept of Computer Science, National University of Singapore 3 Science Drive 2, Singapore 117543 {chenp,fuzh}@comp.nus.edu.sg Dept of IEEM, Hong Kong University of Science and Technology Clear Water Bay, Kowloon, Hong Kong [email protected] 3 School of Business, Singapore Management University 469 Bukit Timah Road, Singapore 259759 [email protected]
Abstract. The General Yard Allocation Problem (GYAP) is a resource allocation problem faced by the Port of Singapore Authority. Here, space allocation for cargo is minimized for all incoming requests for space required in the yard within time intervals. The GYAP is NP-hard for which we propose several heuristic algorithms, including Tabu Search, Simulated Annealing, Genetic Algorithms and the recently emerged “Squeaky Wheel” Optimization (SWO). Extensive experiments give solutions to the problem while comparisons among approaches developed show that the Genetic Algorithm method gives best results.
1
Introduction
The Port of Singapore Authority (PSA) operates this world’s largest integrated container port and transshipment hub in Singapore. In year 2000, PSA handled 19.77 million Twenty-foot Equivalent Units (TEUs) of containers worldwide, including 17.04 million TEUs in Singapore. This represents 7.4% of the global container throughput and 25% of the world’s total container transshipment throughput. Typically, storage is a constraining component in port logistics management. Factors that impact on terminal storage capacity include stacking heights, available net storage area, storage density (containers per acre) and dwell times for empty containers and breakbulk cargo. The port in Singapore, faces such storage constraints in heightened way due to scarcity of land available for port activities. As such, the optimization of storage of cargo in its available yards is crucial to its operations. In studying its operations with the view of finding better ways to utilize storage space within the dynamic environment of the port, we have focused on a central allocation process in storage operations which allows for improved usage of space. In this process, requests are made from operations which coordinate ship berthing and ship-to-apron loading as well as apron-toyard transportation. Each request consists of a set of yard spaces required for a single time interval. If space is allocated to the request, this space cannot be E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 1986–1997, 2003. c Springer-Verlag Berlin Heidelberg 2003
The General Yard Allocation Problem
1987
freed until completion of the request, i.e, until the end time. The major reason for such a constraint is that once a container is placed in the yard, it will not be removed until the ship for which it is bound arrives to reduce labor cost. The purpose is to pack all such requirements into a minimum space. The current allocation is made manually, hence it requires a considerable amount of manpower and the yard arrangement generated is not efficient. Sabria and Daganzo [1] give a bibliography on port operations with the focus on berthing and cargo-handling systems. On the other hand, traffic and vehicle-flow scheduling on landside upto yard points have been studied well (see for example, Bish et al.[2]). Other than studies such as Gambardella et al. [3], which address spatial allocation of containers on a terminal yard using simulation techniques, there has been little direct study on yard space allocation as described in this paper. We propose a basic model to address this port storage optimization problem. The model is applicable elsewhere as it is generic in form. We call it General Yard Allocation Problem (GYAP), which is an extended study of The Yard Allocation Problem (YAP) (see [4] [5]). The yard is treated as a two dimensional space instead of one dimensional in YAP. This paper is organised as follows: problem definition is given in Sect. 2, followed by a discussion on two-dimensional packing, which is highly related to our problem, in Sect. 3. Different heuristic approaches including Tabu Search, ”Squeaky Wheel” Optimization, Simulated Annealing and Genetic Algorithms, as discussed in the next four sections, are applied. Before we conclude in Sect. 9, experimental results are presented and analyzed in Sect. 8.
2
Problem Description and Formulation
The main objective of the GYAP problem is to minimize the container yard space used while satisfying all the space request. The GYAP problem can be described as follows: A set R of n yard space requests, as described above, and a two-dimensional infinite container yard E are given. We can think of E as the being first quadrant in 2 . Each request Ri ∈ R (i = 1, ..., n) has a series of (continuous) space requirements Yij with length Lij and width Wij , where j ∈ [Tistart , Tiend ], where the latter time interval is defined by the request Ri . A mapping, F , such that F (Yij ) = (x, y), where (x, y) ∈ E gives the coordinate of the bottom left corner of Yij as it is aligned in E with its sides parallel to the X-Y axes. Each map, F, must also satisfy the condition that for all p, q ∈ [Tistart , Tiend ] such that p = q − 1, and for F (Yip ) = (xip , yip ) and F (Yiq ) = (xiq , yiq ), we must have xip ≥ xiq , yip ≥ yiq , xip + Lip ≤ xiq + Lia and yip + Wip ≤ yiq + Wiq .This constraint provides the fact that the total space requests increase in time, as would be expected in a realistic situation. Our objective is then to minimize, over all possible mappings F , : max [(P rojX F (Yij ) + Lij )] ×
i,j,Yij ∈Ri
max [(P rojY F (Yij ) + Wij )]
i,j,Yij ∈Ri
P. Chen et al.
R3
Time T
1988
T3end=6 7 6 T3start=3
4 2
W
P1
3
P2
2 1 0
L35=3
5
36
6 3
34
=4
4
=6
Y
W
5
L36=7 1
2
3
4
5
6
7
8
9
X
Fig. 1. A valid request R3 in GYAP. The coordinates for P1 and P2 are (2, 4) and (1, 3) respectively. Same amount of space is required at time 4 and 5.
where ProjX (ProjY ) denotes the orthogonal projections from 2 onto the X (Y) axis. In other words, we would like to minimize the total area used in the yard while satisfying all the space requests. Figure 1 shows a layout with only one valid request, R3 . Time T is taken as a discrete variable with a minimum unit of 1. R3 has four space requirements, at times 3, 4, 5, and 6, two of which are the same at time 4 and 5, within the time interval [T3start = 3, T3end = 6]. The final positions for Y35 and Y36 are F (Y35 ) = (2, 4) and F (Y36 ) = (1, 3), respectively, as shown. The corresponding mappings for R3 will then be {(3, 4), (2, 4), (2, 4), (1, 3)}, where all the constraints imposed can be seen to hold. The maximum value in this case, which is the product of the maximum X−coordinate and the maximum Y−coordinate, is 72. Note that the space requests is look like a inverted pyramid when together. We will call the each space requirement at each time slot a layer. It is clear that GYAP is NP-hard, since (see [4] ) we have found YAP is NP-hard, which is a special case of GYAP in the case when each request has the same length (width) which equals to the length(width) of the yard.
3
Two-Dimensional Rectangular Packing Problem
In finding solutions for the GYAP, we confront, at each time slot, a TwoDimensional Rectangular Packing Problem (2-DRPP) with certain boundary constraints. The 2-DRPP has been proven to be NP-hard. Various heuristic methods have been proposed. However, most of these are meta-heuristics because the complex representation of the 2DRPP makes it difficult to apply other heuristics. Such meta-heuristics usually use a lower-level packing algorithm, for example, greedy packing, and higher-level algorithms, such as Tabu Search.
The General Yard Allocation Problem
1989
A lower-level greedy packing routine allocates the objects according to certain given ordering, which is determined by the higher-level heuristics. A greedy packing routine is the Bottom Left (BL) heuristic [6] [7] [8]. Starting from the top right corner each item is slid as far as possible to the bottom and then as far as possible to the left of the object. These successive vertical and horizontal operations are repeated until the item locks in a stable position. The major advantage of this routine is its simplicity. Its time complexity is only O(N 2 ), where N is the total number of items to be packed. Due to its low complexity this simple heuristic is favorable in a hybrid combination with a meta-heuristic, since the decoding routine has to be executed every time the quality of a solution is evaluated and hence contributes significantly to the running time of any hybrid algorithm [8]. The BL packing routine uses a deterministic algorithm, hence the input ordering of the objects fully determines the final layout. Heuristics can be applied to optimize the ordering. In fact, such orderings can also be considered as permutations, and hence, in this work, we use the word permutation for a solution representation. As the objective of GYAP is to minimize the total yard space while satisfying all the space requests, A procedure, called EVALUATE SOLUTION, is required. It takes a solution (i.e permutation) and computes the minimum space required by applying BL heuristic on packing those pyramids into 3+ (with two dimensions representing the yard and one dimension for the time axis). EVALUATE SOLUTION (P ) 1 for each R ∈ P 2 while the position of R changes 3 SHIFT (R, Rend , 0, 0, bottom) 4 SHIFT (R, Rend , 0, 0, lef t) 5 return Minimum area used SHIFT (R, t, l, b, o) 01 B :=bottom most position to shift all layers (time t ) 02 if B < b 03 B = b 04 L :=left most position to shift all layers (time t ) 05 if L < l 06 L = l 07 if o = bottom 08 for all layers r after t − 1 09 shift r bottomwards to B 10 SHIFT (R, t − 1, L, B, o) 11 else 12 for all layer r after t − 1 13 shift r leftwards to L 14 SHIFT (R, t − 1, L, B, o)
1990
4
P. Chen et al.
Tabu Search
Tabu Search (TS) is a local search meta-heuristic that uses the best neighborhood move that is not “tabu” active to move out from local optimum by incorporating adaptive memory and responsive exploration [9]. According to the different usage of memory, conventionally, Tabu Search has been classified into two categories: Tabu Search with Short Term Memory (TSSTM) and Tabu Search with Long Term Memory (TSLTM) [10] [11]. 4.1
Tabu Search with Short Term Memory
The usage of memory of TSSTM is via the Tabu List. Such an adaptation is also known as recency-based Tabu Search. The neighborhood solution can be obtained by swapping any two numbers in the permutation. For example: [2, 3, 0, 1, 4] is a neighborhood solution of [1, 3, 0, 2, 4] by interchanging the positions of 1 and 2. Neighborhood solutions that are identical to the original solution after normalization are excluded for efficiency reasons. 4.2
Tabu Search with Long Term Memory
TSLTM uses more advanced Tabu Search techniques including intensification and diversification strategies. It archives total or partial information from all the solutions it has visited. This is also known as frequency- based Tabu Search. It attempts to identify certain potentially “good” patterns, which will be used to guide the search process towards possibly better solutions [12]. Two kinds of diversification techniques are used. One is random re-start. The other is randomly picking a sub-sequence and inserting it to a random position. For example, [0, 1, 2, 3, 4] may be changed to [0, 3, 2, 1, 4] if (2, 3) is chosen as the sub-sequence and its inverted order (or original, if random) is inserted back in the position in ahead of 1. Intensification is similar to TSSTM. TSLTM uses a frequency-based memory by recording both residence-frequency and transition-frequency of the visited solutions.
5
“Squeaky Wheel” Optimization
“Squeaky Wheel” Optimization (SWO) is a new heuristic approach proposed in [13]. Until now, this concept can only be found in a few papers: [14], [15] and [16]. In 1996, a “doubleback” approach was proposed to solve the Resource Constrained Project Scheduling (RCPS) problem [17] , which motivated the development of SWO in 1998. The YAP is similar to the RCPS except that it has no precedence constraints and the tasks (requirements) are Stair Like Shapes (SLS). Instead of Left-Shift and Right-Shift in “doubleback”, we only use a “drop” routine similar to Left-Shift. We continue using this technique in GYAP. The ideas in SWO mimic how human beings solve problems by identifying the “trouble-spot” or the “trouble-maker” and trying to resolve problems caused by
The General Yard Allocation Problem
1991
the latter. In SWO, a greedy algorithm is used to construct a solution according to certain priorities (initially randomly generated) which is then analyzed to find the “trouble-makers”, i.e. the elements whose improvements are likely to improve the objective function score. The results of the analysis are used to generate new priorities that determine the order in which the greedy algorithm constructs the next solution. This Construct/Analyze/Prioritize (C/A/P) cycle continues until a certain limit is reached or an acceptable solution is found. This is similar to the Iterative Greedy heuristic proposed in [18]. Iterative Greedy is especially designed for Graph Coloring Problem and may not be directly applicable to other problems, whereas SWO is a more general optimization heuristic. In our problem, we start with a random solution, which is a random permutation. The Analyzer evaluates the solution by applying the packing routine. If the best known result (yard space) is B , then a threshold T is set to be B − 1. The blame factor for each request is the sum of the space requirements that exceed this threshold T , i.e. the total area of the pyramid above the cutting line T . All blame information is passed to the Prioritizer, which is a priority queue in our case. When the control has been handed over to the Constructor again, it continuously deletes the elements from the priority queue and immediately drops them into the yard. A tie, i.e. more than one element with the same priority, is broken by considering their relative positions in previous solution. This tie-breaker also helps avoid cycles in our search process. We also found that the performance of the SWO can be further improved if a “quick” Tabu Search technique TSSTM is embedded in the SWO. We call this modified algorithm SWO+TS, where TS denotes Tabu Search. The Constructor passes its solution to a TSSTM engine, which performs a quick local search and passes the local optimum to the Analyzer. Experiments shows a considerable improvement against the original SWO system. Similar ideas of SWO with “intensification” have been proposed in [14], where solutions are partitioned and SWO is applied to each partition.
6
Simulated Annealing
Simulated annealing [19] stochastically simulates the slow cooling of a physical system. We used the following Simulated Annealing algorithm on our problem: Step 1. Choose some initial temperature T0 and a random initial starting configuration θ0 . Set T = T0 . Define the Objective function (Energy function) to be En() and the cooling schedule σ. Step 2. Propose a new configuration, θ , of the parameter space, within a neighborhood of the current state θ, by setting θ = θ + φ for some random vector φ. Step 3. Let δ = En(θ ) − En(θ). Accept the move to θ with probability 1 if δ < 0 α(θ, θ ) = exp(− Tδ ) otherwise
1992
P. Chen et al.
Step 4. Repeat Step 2 and 3 for K iterations, until it is deemed to have reached the equilibrium. Step 5. Lower the temperature by T = T × σ and repeat Steps 2-4 until certain stopping criterion, for our case T < (for some small ) is met. Due to the logarithmic decrement of T , we set T0 = 1000. The Energy function is simply defined as the length of the yard required. The probability exp(− Tδ ) is a Boltzmann factor. The number of iterations K is proportional to the input size n. A neighborhood is defined similarly as the one in TS by swapping of any two permutations and re-positioning a random permutation.
7
Genetic Algorithms
Genetic Algorithms [20] (GA) are search procedures based notions from natural selection.It is clear that the classical binary representation is not a suitable in GYAP, in which a permutation (0, 1, . . . , n − 1) is used as the solution representation. The solution space is a permutation of (0, 1, . . . , n − 1). The binary codes of these permutations do not provide any advantage. At times, the situation is even worse: the change of a single bit may not lead to a valid solution. Here, we adopt a vector representation, i.e. by using a permutation directly as the chromosome in the genetic process. We will illustrate the two major genetic operators used in our approach, crossover and mutation. 7.1
Crossover Operator
Using permutations as chromosome, we have implemented three crossover operators: Classical crossover with repair, Partially-mapped crossover and Cycle crossover. All these operators are be tailored to suit our problem domain. A small change in the crossover operator may cause totally different results. Classical Crossover with Repair. The Classical Crossover operator builds the offspring by appending the head from one parent with the tail from the other parent, where the head and tail come from a random cut of the parents’ chromosomes. A repair procedure may be necessary after the crossover [21]. For example, the two parents (with random cut point marked by ‘|’): p1 = (0 1 2 3 4 5 | 6 7 8 9) and p2 = (3 1 2 5 7 4 | 0 9 6 8) will produce the following two offsprings: o1 = (0 1 2 3 4 5 | 0 9 6 8) and o2 = (3 1 2 5 7 4 | 6 7 8 9) However, the two offsprings are not valid permutations after the crossover. A repair routine replaces the repeated numbers with the missing ones randomly. The repaired offsprings will be: o1 = (7 1 2 3 4 5 | 0 9 6 8) and o2 = (3 1 2 5 7 4 | 6 0 8 9) The classical crossover operator tries to maintain the absolute positions in the parents.
The General Yard Allocation Problem
1993
Partially Mapped Crossover. Partially Mapped Crossover (PMX) was first used in [22] to solve the Traveling Salesman Problem. We have made several adjustments to accommodate our permutation representation. The modified PMX builds an offspring by choosing a subsequence of a permutation from one parent and preserving the order and position of as many numbers as possible from the other parent. The subsequence is determined by choosing two random cut points. For example, the two parents: p1 = (0 1 2 | 3 4 5 6 | 7 8 9) and p2 = (3 1 2 | 5 7 4 0 | 9 6 8) would produce offspring as follows. First, two segments between cutting points are swapped (symbol ‘u’ represents ‘unknown’ for this moment): o1 = (u u u | 5 7 4 0 | u u u) and o2 = (u u u | 3 4 5 6 | u u u) The swap defines a series of mappings implicitly at the same time: 3 ↔ 5, 4 ↔ 7, 5 ↔ 4 and 6 ↔ 0. The ‘unknown’s are then filled in with numbers from original parents, for which there is no conflict: o1 = (u 1 2 | 5 7 4 0 | u 8 9) and o2 = (u 1 2 | 3 4 5 6 | 9 u 8) Finally, the first u in o1 (which should be 0 will cause a conflict) is replaced by 6 because of the mapping 0 ↔ 6. Note such replacement is transitive, for example, the second u in o1 should follow the mapping 7 ↔ 4, 4 ↔ 5, 5 ↔ 3 and is hence replaced by 3. The final offspring are: o1 = (6 1 2 | 5 7 4 0 | 3 8 9) and o2 = (7 1 2 | 3 4 5 6 | 9 0 8) The PMX crossover exploits important similarities in the value and ordering simultaneously when used with an appropriate reproductive plan [22]. Cycle Crossover. Original Cycle Crossover (CX) was proposed in [23], again for the TSP problem. Our CX builds offspring in such a way that each number (and its position) comes from one of the parents. We explain the mechanism of the CX with following example. Two parents: p1 = (0 1 2 3 4 5 6 7 8 9) and p2 = (3 1 2 5 0 4 7 9 6 8) will produce the first offspring by taking the first number from the first parent: o1 = (0 u u u u u u u u u) Since every number in the offspring should come from one of its parents (for the same position), the only choice we have at this moment is to pick number 3, as
1994
P. Chen et al.
the number from parent p2 just “below” the selected 0. In p1 , it is in position 3, hence: o1 = (0 u u 3 u u u u u u) which, in turn, implies number 5, as the number from p2 “below” the selected 3: o1 = (0 u u 3 u 5 u u u u) Following the rule, the next number to be inserted is 4. However, selection of 4 requires the selection of 0, which is already in the list. Hence the cycle is formed as expected. o1 = (0 u u 3 4 5 u u u u) The remaining ‘u’s are filled from p2 : o1 = (0 1 2 3 4 5 7 9 6 8). Similarly, o2 = (3 1 2 5 0 4 6 7 8 9). CX preserves the absolute position of the elements in the parent sequence [21]. Our experiments shows Classical crossover and CX have a stable but slow improvement rates, while PMX demonstrates oscillating but fast convergence. In our later experiments, the majority of the crossover is done by PMX. Classical crossover and CX are applied at a much lower probability. 7.2
Mutation Operator
Mutation is another classical genetic operator, which alters one or more genes (part of a chromosome) with a probability equal to the mutation rate. There are several known mutation algorithms which work well on different problems: – – – –
Inversion: invert a subsequence. Insertion: select an number and insert it back in a random position. Displacement: select a subsequence and insert it back in a random position. Reciprocal Exchange: swap two numbers.
In fact the Inversion, Displacement and Reciprocal Exchange are quite similar to our neighborhood solution and diversification techniques used in Tabu Search and Simulated Annealing in previous sessions. We adopt a relatively low mutation rate of 1%. We use population size P = 1000 for most cases. The evolution process starts with a random population. The population is sorted according to the objective function, the better the quality, the higher the probability it will be selected for reproduction. At each iteration, a new generation with population size 2P is produced and the better half, which is of size P , survive for the next iteration. The evolution process continues until certain stop criterion are met.
The General Yard Allocation Problem
8
1995
Experimental Results
We conducted extensive experiments on randomly generated data 1 . The graph for each test case contains one components, so that the cases cannot be partitioned into more than one independent sub-case. All the programs implementing various heuristic methods are coded in GNU C++, with an extensive use of Standard Template Library (STL) for efficiency data structures like priority queue, set and map. Due to the difficulties of finding any optimal solution in the experiments, a trivial lower bound is taken to be the sum of the space requirements at each time slot and used for benchmarking purpose. Table 1. Experimental results for GYAP (Entries in the table show the minimum length of the yard required. Names of Data Sets show the number of pyramids in the file; LB:Lower Bound) Data Set Lower Bound TSSTM TSLTM SWO SWO+TS SA P35 47 84 80 112 84 76 P86 218 616 583 748 550 572 P108 78 205 200 270 200 210 P127 221 660 649 891 671 671 P137 239 680 670 950 690 600 P142 40 135 125 190 140 120 P160 249 828 828 996 852 840 P167 243 814 792 1012 825 748 P187 302 948 912 1032 912 816
GA 76 550 175 550 560 115 780 649 768
Table 1 illustrates the results and Table 2 shows the running time for each of the test performed in Table 1. GA is the most cost-effective approach while TSSTM has the simplest implementation for which it is not surprising to see that it achieves the poorest results. Long term memory certainly improves the performance of TS, though the improvement is not very obvious and stable sometimes. We believe one of the major difficulties with long term memory is the fine tuning of parameters, including the assignment of relative weights to yard length, residence frequency and transition frequency in the objective function. The performance of SWO is poor in our experiments because of two reasons. Firstly, the BL packing routine makes it difficult to identify the bottleneck, or the “trouble makers”. The objects are manipulated in a two-dimensional space, and assigning blame factor according to only one dimension may not well reflect the structure of the solution. Secondly, there is the loss of correspondence between physical layout and actual solutions. SWO is more sensitive to the problem domain as it needs to know the exact structure of the solution in order to assign 1
All test data are available on the web with URL: http://www.comp.nus.edu.sg/˜fuzh/GYAP
1996
P. Chen et al. Table 2. Experiment running time (in seconds) for Table 1.
Data Set TSSTM TSLTM P35 783 4521 P86 948 6529 P108 2321 8392 P127 2783 9837 P137 2796 9640 P142 893 3428 P160 4781 15327 P167 6063 14294 P187 7806 17327
SWO SWO+TS SA 1033 5462 4512 2312 5978 2351 4528 8432 8934 4678 10262 11023 4780 9892 9857 1532 4582 3275 3085 18539 6793 3769 19852 6832 4085 20542 6673
GA 923 1423 3452 6621 7048 1094 3583 3781 6849
proper blame values. SA is the second best approach. We believe it is because the cooling schedule is not much affected by the loss of correspondence.
9
Conclusion
In this paper, the General Yard Allocation Problem is studied which involves two-dimensional rectangle packing as a related problem. We adopted a simple Bottom Left packing strategy as a first heuristic. Heuristics like Tabu Search, Simulated Annealing, Genetic Algorithms and “Squeaky Wheel” Optimization were then applied to problem in the extensive experiments and solutions obtained. Our approach sheds light on the use of these meta-heuristics on a set of problems, including those of packing where packing lists can change in time. In comparisons between results obtained, we found that the GA implementation achieves the best results to this problem.
References 1. F. Sabria and C. Daganzo: Queuing systems with scheduled arrivals and established service order. Transportation Research B, vol. 23. (1989) 159–175 2. E.K. Bish, T.-Y. Leong, C.-L. Li, J. W.C. Ng, D. Simchi-Levi: Analysis of a new vehicle scheduling and location problem. Naval Research Logistics, vol. 48. (2001) 363–385 3. L. Gambardella, A. Rizzoli, M. Zaffalon: Simulation and planning of an intermodal container terminal. Simulation, vol. 71. (1998) 107–116 4. P. Chen, Z. Fu, A. Lim: The yard allocation problem. In: Proceedings of the Eighteenth National Conference on Artificial Intelligence (AAAI-02), Edmonton, Alberta, Canada (2002) 5. P. Chen, Z. Fu, A. Lim: Using genetic algorithms to solve the yard allocation problem. In: Proceedings of the Genetic and Evolutionary Computing Conference (GECCO-2002) New York City, USA (2002) 6. B. Baker, E.C. Jr., R. Rivest: Orthogonal packing in two dimensions. SIAM Journal of Computing, vol. 9. (1980) 846–855
The General Yard Allocation Problem
1997
7. S. Jacobs: On genetic algorithms for the packing of polygons. European Journal of Operational Research, vol. 88. (1996) 165–181 8. E. Hopper, B. Turton: An empirical investigation of meta-heuristic and heuristic algorithms for a 2d packing problem. European Journal of Operational Research, vol. 128. (2001) 34–57 9. P. L. Hammer: Tabu Search. J.C. Baltzer, Basel, Switzerland (1993) 10. F. Glover, M. Laguna: Tabu Search. Kluwer Academic Publishers (1997) 11. S. Sait, H. Youssef: Iterative Computer Algorithms with Applications in Engineering: Solving Combinatorial Optimization Problems. IEEE (1999) 12. D. Pham, D. Karaboga: Intelligent optimisation techniques: genetic algorithms, tabu search, simulated annealing and neural networks. London and New York, Springer (2000) 13. D. Clements, J. Crawford, D. Joslin, G. Nemhauser, M. Puttlitz, M. Savelsbergh: Heuristic optimization: A hybrid ai/or approach. In: Workshop on Industrial Constraint-Directed Scheduling (1997) 14. D.E. Joslin, D. P. Clements: “squeaky wheel” optimization. Journal of Artificial Intelligence Research, vol. 10. (1999) 353–373 15. D.E. Joslin, D. P. Clements: Squeaky wheel optimization. In: Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98), Madison, USA (1998) 340–346 16. D. Draper, A. Jonsson, D. Clements, D. Joslin: Cyclic scheduling. In: Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (1999) 17. J.M. Crawford: An approach to resource constrained project scheduling. In: Proceedings of the 1996 Artificial Intelligence and Manufacturing Research Planning Workshop (1996) 18. J.C. Culberson, F. Luo: Exploring the k-colorable landscape with iterated greedy. Cliques, Coloring, and Satisfiability: Second DIMACS Implementation Challenge, David S. Johnson and Michael A. Trick (eds.). DIMACS Series in Discrete Mathematics and Theoretical Computer Science (1996) 245–284 19. S. Kirpatrick, C. Gelatt, Jr., M. Vecchi: Optimization by simulated annealing. Science, vol. 220. (1983) 671–680 20. J.H. Holland: Adaptation in natural artificial systems. Ann Arbor: University of Michigan Press (1975) 21. Z. Michalewicz: Genetic Algorithms + Data Structure = Evolution Programs. Springer-Verlag, Berlin Heidelberg New York (1996) 22. D. Goldberg, R. Lingle: Alleles, loci, and the traveling salesman problem (1985) 23. I. Oliver, D. Smith, J. Holland: A study of permutation crossover operators on the tsp (1987)
Connection Network and Optimization of Interest Metric for One-to-One Marketing Sung-Soon Choi and Byung-Ro Moon School of Computer Science and Engineering Seoul National University Seoul, 151-742 Korea {sschoi,moon}@soar.snu.ac.kr
Abstract. With the explosive growth of data in electronic commerce, rule finding becomes a crucial part in marketing. In this paper, we discuss the essential limitations of the existing metrics to quantify the interests of rules, and present the need of optimizing the interest metric. We describe the construction of the connection network that represents the relationships between items and propose a natural marketing model using the network. Although simple interest metrics were used, the connection network model showed stable performance in the experiment with field data. By constructing the network based on the optimized interest metric, the performance of the model was significantly improved.
1
Introduction
The progress in modern technologies made it possible for finance and retail organizations to collect and store a massive amount of data. In consequence, it has attracted great attention to identify systems that explain the data, particularly in the data mining area. An early representative problem is the “market-basket” problem [1]. In the problem, we are given a set of items and a collection of transactions each of which is a subset (basket) of items purchased together by a customer in a visit. The objective is “mining” relationships between items from the baskets data. First of all, the attraction of the problem arises in a great variety of applications [2]. A typical example (from which the problem got its name) is the customers’ shopping behavior in a supermarket. In the example, the items are products and the transactions are customer purchases at the checkout. Determining what products customers are likely to buy together is useful for display and marketing. There are many other applications with different data characteristics. Some examples are student enrollment in classes [2], copy detection (identifying identical or similar documents or web pages) [3] [4], clustering (identifying similar vectors in high-dimensional spaces) [5] [6], etc. With the rapid growth of the electronic commerce (e-commerce) field, the problem becomes increasingly important. Various solutions to the problem were used for marketing in the e-commerce field [7] [8]. Among them, collaborative filtering (tracking user behavior and providing recommendations to individuals E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 1998–2009, 2003. c Springer-Verlag Berlin Heidelberg 2003
Connection Network and Optimization of Interest Metric
1999
based on the similarities of their preferences) has been playing as one of the core marketing strategies [9] [10] [11]. In this paper, we suggest a genetic approach to optimize the interest metric that measures the relationships between items. The optimized metric is used to construct a weighted relationship network of the items. Then, we propose a natural marketing model using the network. Experimental results showed that our model was more stable and better than collaborative filtering. The rest of this paper is organized as follows. In Sect. 2, we describe interest metrics introduced so far that measure the degree of connection between items. We also give a brief description of methods for one-to-one marketing. In Sect. 3, we present our theoretical view of the interest metrics described in Sect. 2 and mention the need of optimizing the interest metric. In Sect. 4, we explain the connection network that represents the relationships among the items, and describe the marketing model based on the network. The genetic framework for optimizing the interest metric is provided in Sect. 5. Section 6 provides the experimental results on real-world data sets. Finally, we make our conclusions in Sect. 7.
2 2.1
Preliminaries Interest Metrics of Rules
Recently, there were many studies to find the meaningful rules between items based on the interest metrics that measure the degree of connection between items. A rule is denoted by X → Y , for two item sets X and Y , which means that the occurrence of X implies the occurrence of Y . Agrawal et al. [1] proposed the interest metrics called support and confidence and used them to build rules between items. In particular, they called such rules association rules. The support of an item set X is defined to be the number of transactions in which the item set occurs (the number of occurrences of X). For a rule X → Y , the confidence of this rule is the fraction of transactions containing Y among those containing X. In order that a rule X → Y becomes an association rule, the support of the item set X ∪ Y and the confidence of the rule X → Y must exceed given thresholds θs and θc , respectively. Note that the support of the item set X ∪ Y corresponds to the number of transactions in which the item sets X and Y occur together (the number of cooccurrences of X and Y ). And the confidence of the rule X → Y corresponds to the number of co-occurrences of X and Y over the number of occurrences of X. We denote the number of occurrences of an item set X by n(X) and the number of co-occurrences of item sets X and Y by n(X, Y ). Then, from the definitions of support and confidence, we have sup(X → Y ) = n(X, Y ) and conf(X → Y ) =
n(X, Y ) . n(X)
(1)
2000
S.-S. Choi and B.-R. Moon
The association rule X → Y is said to hold if and only if sup(X → Y ) = n(X, Y ) > θs and conf(X → Y ) =
n(X, Y ) > θc . n(X)
(2)
Cohen et al. [12] pointed out a problem of association rules which require high support: Most rules with high supports are obvious and well-known and it is the rules of low supports that provide interesting new insights. Besides, several recent papers mentioned that it is not reasonable to use confidence as the interest measure of rules [2] [13] [14]. In this context, a number of different metrics to quantify “interestingness” or “goodness” of rules were proposed [15]. Among them are gain [16], proposed by Piatetsky-Shapiro [17], variance and chi-squared value [18], entropy gain [18] [19], gini [19], laplace [20] [21], lift [22] (also known as interest [2] or strength [23]), conviction [2], and similarity [12]. Except similarity, Bayardo and Agrawal [15] expressed the definitions of all these metrics with the supports of the related item sets and the confidence of the rule. It is also possible to express similarity in the same way. This indicates that all these metrics can be described with the numbers of occurrences and cooccurrences of the related item sets. If we denote by T the set of all transactions, the laplace, gain, Piatetsky-Shapiro’s metric (p-s), conviction, lift, and similarity values for a rule X → Y are expressed as follows:1 n(X, Y ) + 1 , n(X) + k → Y ) = n(X, Y ) − c · n(X) , n(X) · n(Y ) → Y ) = n(X, Y ) − , |T | |T | · n(X) − n(X) · n(Y ) , →Y)= |T | · (n(X) − n(X, Y )) |T | · n(X, Y ) , and →Y)= n(X) · n(Y ) n(X, Y ) →Y)= . n(X) + n(Y ) − n(X, Y )
laplace(X → Y ) = gain(X p-s(X conviction(X lift(X similarity(X
(3)
The rest of the above metrics are also able to be expressed in the same way. It is difficult to come up with a single metric among the above metrics. It is, however, clear that all the metrics consider only the numbers of occurrences and co-occurrences of item sets. This fact is utilized in modeling the interest metric and optimizing it. And note that these metrics numerically evaluate the relationships between item sets. So, these metrics can be used in quantifying the strengths of the connections between item sets.
1
k in laplace is an integer greater than 1 and c in gain is a fractional constant between 0 and 1.
Connection Network and Optimization of Interest Metric
2.2
2001
One-to-One Marketing
Personalization is a sharply growing issue in modern marketing. It helps boost customers’ loyalty by providing the most attractive contents to each customer or by locating the most proper set of customers for an arbitrary advertisement [24]. As mentioned before, personal information and huge activity logs are accumulated due to the progress of modern technologies. These data implicitly contain valuable trends and patterns which are useful to improve business decisions and efficiency. Diverse data mining tools to discover implicit knowledge hidden in a large database have been studied with various tools including neural networks, decision trees, rule induction, Bayesian belief networks, evolutionary algorithms, fuzzy sets, clustering, association rules, and collaborative filtering [25] [26]. Among them, collaborative filtering, which tracks user behavior and makes recommendations to individuals on the basis of the similarities of their preferences, is known to be a standard for personalized recommendations [10] [27]. Most of the personalized marketing tools including collaborative filtering utilize customer profiles [7] [8] [28] [29] [30]. These strategies often have the data-sparsity problem, which greatly undermines the quality of recommendations, as it is in general difficult to collect customers’ personal information or preferences [31] [32].
3
Rule Space and Optimized Interest Metric
) We saw that sup(X → Y ) = n(X, Y ) and conf(X → Y ) = n(X,Y n(X) for a rule X → Y in the previous section. Now if we replace n(X), n(Y ), and n(X, Y ) with independent variables x, y, and z, respectively, the necessary and sufficient condition for an association rule X → Y to hold are as follows:
z > θs
and
z > θc . (x > 0, y ≥ 0, z ≥ 0) x
(4)
This means that the values of x, y, and z for the rule X → Y to hold correspond to the points in three dimensions that satisfy the above inequality (4). At this time, the rule space consisting of x, y, and z axes is partitioned by the plane corresponding to the support condition, fs (x, y, z) = z = θs , and the plane corresponding to the confidence condition, fc (x, y, z) = xz = θc . (e.g., Figure 1(a)) It was mentioned that, for a rule X → Y , the metrics introduced in the previous section are able to be described with n(X), n(Y ), and n(X, Y ). Therefore, it is also possible to express all the metrics of the previous section as the formulas of x, y, and z. For example, conviction(X → Y ) for a rule X → Y can be described as follows: conviction(X → Y ) =
|T | · n(X) − n(X) · n(Y ) |T |x − xy = . |T |(n(X) − n(X, Y )) |T |(x − z)
(5)
2002
S.-S. Choi and B.-R. Moon z
z
0
0
x
x y
(a) Two linear planes
y
(b) A free plane
Fig. 1. A rule space partitioned by planes
After all, given a threshold θ, each of the metrics mentioned in the previous section is a plane, f (x, y, z) = θ, which partitions the rule space. The selection of interest metrics allows the space partition of a fixed shape regardless of the characteristics of the data set. We suspect that the optimal shapes of the partition planes are different depending on the data sets. In this context, we guarantee the degree of freedom for the partition plane to the utmost, then find the optimized plane for the given data set, namely the optimized interest metric. (Figure 1(b)) To maximize the degree of freedom for the plane, it is desirable to assume that the plane f (x, y, z) = θ is a free plane. In this case, however, the search space becomes so huge that the learning time increases excessively. We restrict the shape of the plane to help perform the learning. We set a model of f (x, y, z) as follows: f (x, y, z) = (ax xex + bx )(ay y ey + by )(az z ez + bz ) where
0 < ax , ay , az ≤ 10 −1 ≤ ex , ey , ez ≤ 1 0 ≤ bx , by , bz ≤ 10 .
(6)
We use a genetic algorithm to search the optimal coefficients and exponents of f (x, y, z) for the data set. The details of optimization are described in Sect. 5.
4
Personalized Marketing Model Using Connection Networks
In the previous section, we made a model of the metric f (X → Y ) (= f (x, y, z)) that evaluates the strength of a rule X → Y . Intuitively the value of f (X → Y ) indicates the strength of the connection between the item sets X and Y . The optimized metric is used to measure the strength of connection between item sets. Here, we only consider the case that each item set has just one item. Then,
Connection Network and Optimization of Interest Metric
2003
X1 X3 X2
previously purchased products
products in union of the neighbor sets
Fig. 2. A connection network
the value of f (X → Y ) represents the strength for the item X to imply the item Y. Now we are ready to construct a connection network. We set a vertex for each item. We put an arc from the vertex X to the vertex Y with the weight f (X → Y ). We have a directed graph G = (V, A), where V is the vertex set and A is the arc set. We perform one-to-one marketing using the connection network. Suppose that a customer purchased the products X1 , X2 , . . . , Xk so far. Let N (Xi ) be the set of neighbor vertices of Xi in the connection network (1 ≤ i ≤ k). We define a score function for recommendation, s : V −→ R, as follows (R: the set of real numbers): { f (Xi → Y )}/(ak k ek + bk ) , if Y ∈ N (Xi ) (7) s(Y ) = 1≤i≤k 1≤i≤k 0 , otherwise . We recommend the products of high scores to the customer. The value of the score function s(Y ) for a product Y is proportional to the weight sum of the arcs from the previously purchased products to the product Y . Figure 2 shows an example connection network. We divide the sum of weights by a function of k (the number of previously purchased products). This prevents the recommendations from flowing in upon a few customers that purchased excessively many products before. ak , ek , and bk are optimized in the following ranges, 0 < ak ≤ 10,
−1 ≤ ek ≤ 1,
0 ≤ bk ≤ 10,
(8)
and the details are described in the next section. Such a recommendation strategy is different from the existing ones based on the customers’ profiles in that it performs recommendations just by the relationships between products regardless of the customers’ profiles. We need not handle the customer profile vectors of high dimensions nor quantify each field
2004
S.-S. Choi and B.-R. Moon
create initial population of a fixed size; do { choose parent1 and parent2 from population; offspring = crossover(parent1, parent2); mutation(offspring); replace(population, offspring); } until (stopping condition); return the best individual; Fig. 3. The outline of the genetic algorithm
of the customer profiles. We merely optimize the quantitative representation of topology among items and perform recommendation on the basis of this. The computational cost for the recommendation is significantly low compared with existing recommendation strategies. In addition, when reflecting new data into the established network, the cost of updating the network is also fairly low. Furthermore, such a strategy is less sensitive to the data sparsity problem that greatly undermines the quality of recommendations for the existing ones.
5
Genetic Algorithm
The selection of an interest metric and a score function has a great effect on the quality of recommendations. The problem is to find the best set of coefficients related to the interest metric and score function that maximizes the response rate, which is defined to be response rate =
# of purchases . # of recommended items
(9)
We used a steady-state genetic algorithm to optimize the nine parameters related to the interest metric (in Sect. 3) and the three parameters related to the score function (in Sect. 4). Figure 3 shows the outline of the genetic algorithm. The details are described in the following. – Encoding: Each solution is a set of 12 real values. A solution is represented by a chromosome; a chromosome is a real array of 12 elements. Figure 4 shows the structure of chromosomes. Each element of the array is called a gene and we restrict the range of each gene as mentioned before. – Initialization: We set the population size to be 100. For each gene in a chromosome, we randomly generate a real number in the restricted range. – Parent Selection: The fitness value Fi of chromosome i is assigned as follows: Fi = (Ri − Rw ) + (Rb − Rw )/4
(10)
Connection Network and Optimization of Interest Metric
ax
ex
bx
ay
ey
by
az
ez
bz
ak
ek
2005
bk
Fig. 4. The structure of chromosomes
where Rw : the response rate of the worst, Rb : the response rate of the best, and Ri : the response rate of chromosome i. Each chromosome is selected as a parent with a probability proportional to its fitness value. This is a typical proportional selection scheme. – Crossover: We use traditional one-point crossover. – Mutation: We randomly select each gene with a low probability (P=0.1) and perform the non-uniform mutation in which the perturbation degree decreases over time. Non-uniform mutation was first proposed by Michalewicz [33]. – Replacement: We replace the inferior of the two parents if the offspring is not worse than both parents. Otherwise, we replace the worst member of the population. This scheme is a compromise between preselection [34] and GENITOR-style replacement [35].
6
Experimental Results
We conducted experiments with two different types of data sets to evaluate the performance of the optimized interest metric and the marketing model using the connection network. First, we conducted experiments with a massive amount of purchase data set from June 1996 to August 2000 of a representative e-commerce company in Korea. In this data set, the items are products and a transaction is a customer’s purchase of one or more items with a time stamp. We first divided the whole data set into two disjoint sets in terms of dates. Then we predicted the purchases of customers in the latter set, based on the data of the former set. In this experiment, a number of different recommendation models were used: collaborative filtering (CF), a few plain connection networks (PCNs), and optimized connection network (OCN). As mentioned before, collaborative filtering is a proven standard for personalized recommendations [9] [10] [27]. PCNs are our early recommendation models (commercialized by Optus Inc.) in which we construct the networks based on the interest metrics mentioned in Sect. 2 (such as laplace, gain, and so on) and then recommend attractive products based on the prescribed thresholds. OCN is the recommendation model in which we construct the network on the basis of the optimized interest metric and then recommend the products according to the score-function values.
2006
S.-S. Choi and B.-R. Moon Table 1. Comparison of experiments with purchase data set
Dates May 31 June 7 June 14 June 21 June 28 July 5 July 12 July 19 July 26 August 2 August 9
CF PCN-laplace PCN-gain PCN-p-s OCN 26 14 23 15 26 .0052 ( 5018 ) .0032 ( 4310 ) .0046 ( 5003 ) .0030 ( 5011 ) .0052 ( 4994 ) 32 15 24 23 45 .0061 ( 5210 ) .0034 ( 4411 ) .0046 ( 5183 ) .0043 ( 5390 ) .0087 ( 5162 ) 30 13 25 25 52 .0054 ( 5513 ) .0028 ( 4565 ) .0046 ( 5415 ) .0045 ( 5602 ) .0092 ( 5634 ) 36 18 29 23 58 .0063 ( 5695 ) .0032 ( 5742 ) .0053 ( 5448 ) .0041 ( 5627 ) .0099 ( 5859 ) 37 17 31 25 55 .0062 ( 5947 ) .0034 ( 4937 ) .0051 ( 6094 ) .0043 ( 5767 ) .0088 ( 6229 ) 34 15 28 28 55 .0055 ( 6138 ) .0027 ( 5459 ) .0046 ( 6135 ) .0044 ( 6343 ) .0090 ( 6117 ) 26 27 41 38 52 .0041 ( 6327 ) .0043 ( 6289 ) .0068 ( 6036 ) .0058 ( 6582 ) .0084 ( 6192 ) 28 40 56 55 54 .0043 ( 6571 ) .0063 ( 6314 ) .0089 ( 6324 ) .0084 ( 6511 ) .0084 ( 6458 ) 26 56 59 59 79 .0038 ( 6879 ) .0069 ( 8111 ) .0088 ( 6695 ) .0085 ( 6982 ) .0112 ( 7036 ) 23 49 58 56 58 .0032 ( 7130 ) .0069 ( 7067 ) .0078 ( 7427 ) .0078 ( 7190 ) .0079 ( 7372 ) 20 29 39 39 43 .0027 ( 7386 ) .0045 ( 6385 ) .0053 ( 7315 ) .0053 ( 7330 ) .0061 ( 7047 )
Table 2. Comparison of experiments with mobile-content data set
CF PCN-laplace PCN-gain PCN-similarity OCN 4886 5215 5522 5096 5535 .0241 ( 203054 ) .0256 ( 203379 ) .0269 ( 205061 ) .0247 ( 206474 ) .0291 ( 190417 )
For experiments, we selected eleven pivot dates over the weeks from May 2000 to August 2000. For all the models, the data before the pivot date were used as the training set; the data after the date were used as the test set. In the case of OCN, the training set was further divided into the real training set and the validation set, to optimize the interest metric and the score function. Table 1 shows the response rates (and the numbers of purchases over the numbers of recommended products) of CF, PCNs, and OCN, respectively. We omitted the experimental results of PCNs based on conviction, lift, and similarity since their performances were not so comparable. Here, the numbers of recommendations were adjusted to be comparable for fair comparison. PCNs showed comparable performance with CF although the networks were constructed based on the general metrics. We consider this to be an evidence of the suitability of the connection network as a recommendation model. The optimized model, OCN, significantly improved all the PCN models. OCN showed on average 76 % and 40 % better results than CF and PCN-gain (the best among PCNs), respectively. Next, we conducted similar experiments with a massive amount of Internet contents access data from August 2001 to January 2002 of a representative contents service company in Korea. In this data set, the items are contents provided by the company and a transaction is composed of the contents that a customer used in a certain period of time. We divided the whole data set into two disjoint sets in terms of dates in the ratio of three to one. Table 2 shows the response rates of CF, PCNs, and OCN, respectively. The response rate is the number of uses (hits) over the number of recommended
Connection Network and Optimization of Interest Metric
2007
contents as in the parentheses. In the table, the experimental results of PCNs based on p-s, conviction, and lift were omitted since their performances were not so comparable. Similarly to the experiments with the purchase data set, PCNs showed comparable performance with CF, and OCN performed significantly better than the other models.
7
Conclusion
We discussed the essential limitations of existing rule-interest metrics and raised the need of optimizing the interest metric. We proposed a novel marketing model using connection networks. We maximized the performance of the proposed model by constructing the network with the optimized interest metric. Although the suggested method performed impressively, we consider that there remains room for further improvement. More elaborate modeling of the interest metric and the score function is a candidate. Other function models such as neural network and relevant optimization are also worth trying. The optimized interest metric and the connection network model is not restricted to the personalized marketing only. We believe that they are applicable to various other problems; so far, we found that they are applicable to a few practical problems such as e-mail auto-response system and personalized search engine. Acknowledgement. The authors thank Yung-Keun Kwon for insightful discussions about data mining techniques. This work was partly supported by Optus Inc. and Brain Korea 21 Project. The RIACT at Seoul National University provided research facilities for this study.
References 1. R. Agrawal, T. Imilienski, and A. Swami. Mining association rules between sets of items in large databases. In Proc. of the ACM SIGMOD Int’l Conf. on Management of Data, pages 207–216, 1993. 2. S. Brin, R. Motwani, J.D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. In Proc. of the ACM SIGMOD Int’l Conf. on Management of Data, pages 255–264, 1997. 3. N. Shivakumar and H. Garcia-Molina. Building a scalable and accurate copy detection mechanism. In Proc. Third Int’l Conf. Theory and Practice of Digital Libraries, 1996. 4. A. Broder. On the resemblance and containment of documents. In Proc. Compression and Complexity of Sequences Conf. (SEQUENCES’97), pages 21–29, 1998. 5. R. O. Duda and P.E. Hart. Pattern Classification and Scene Analysis. Wiley InterScience, New York, 1973. 6. S. Guha, R. Rastogi, and K. Shim. CURE – an efficient clustering algorithm for large databases. In Proc. of the ACM SIGMOD Int’l Conf. on Management of Data, pages 73–84, 1998.
2008
S.-S. Choi and B.-R. Moon
7. D. Greening. Building Consumer Trust with Accurate Product Recommendations. LikeMinds White Paper LMWSWP-210-6966, 1997. 8. C.C. Aggarwal, J.L. Wolf, K. Wu, and P.S. Yu. Horting hatches an egg: A new graph-theoretic approach to collaborative filtering. In Proc. of the fifth ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining, pages 201–212, 1999. 9. D. Goldberg, D. Nichols, B.M. Oki, and D. Terry. Using collaborative filtering to weave an information tapestry. Comm. of the ACM, 35(12):61–70, 1992. 10. J.A. Konstan et al. Grouplens: Applying collaborative filtering to usenet news. Comm. of the ACM, 40:77–87, 1997. 11. H.R. Varian and P. Resnick. CACM special issue on recommender systems. Comm. of the ACM, 40, 1997. 12. E. Cohen et al. Finding interesting associations without support pruning. IEEE Trans. on Knowledge and Data Engineering, 13(1):64–78, 2001. 13. C. Silverstein, S. Brin, and R. Motwani. Beyond market baskets: Generalizing association rules to dependence rules. Data Mining and Knowledge Discovery, 2:69–96, 1998. 14. C. Silverstein, S. Brin, R. Motwani, and J.D. Ullman. Scalable techniques for mining casual structures. In Proc. 24th Int’l Conf. Very Large Data Bases, pages 594–605, 1998. 15. R.J. Bayardo Jr. and R. Agrawal. Mining the most interesting rules. In Proc. of the fifth ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining, pages 145–154, 1999. 16. T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining using twodimensional optimized association rules: Scheme, algorithms, and visualization. In Proc. of the ACM SIGMOD Int’l Conf. on Management of Data, pages 13–23, 1996. 17. G. Piatetsky-Shapiro. Discovery, Analysis, and Presentation of Strong Rules. AAAI/MIT Press, 1991. Chapter 13 of Knowledge Discovery in Databases. 18. S. Morishita. On classification and regression. In Proc. First Int’l Conf. on Discovery Science – Lecture Notes in Artificial Intelligence 1532, pages 40–57, 1998. 19. Y. Morimoto, T. Fukuda, H. Matsuzawa, T. Tokuyama, and K. Yoda. Algorithms for mining association rules for binary segmentations of huge categorical databases. In Proc. of the 24th Very Large Data Bases Conf., pages 380–391, 1998. 20. P. Clark and P. Boswell. Rule induction with CN2: Some recent improvements. In Machine Learning: Proc. of the Fifth European Conference, pages 151–163, 1991. 21. G.I. Webb. OPUS: An efficient admissible algorithm for unordered search. Journal of Artificial Intelligence Research, 3:431–465, 1995. 22. International Business Machines. IBM Intelligent Miner User’s Guide, 1996. Version 1, Release 1. 23. V. Dhar and A. Tuzhilin. Abstract-driven pattern discovery in databases. IEEE Trans. on Knowledge and Data Engineering, 5(6), 1993. 24. R. Dewan, B. Jing, and A. Seidmann. One-to-one marketing on the internet. In Proc. of the 20th Int’l Conf. on Information Systems, pages 93–102, 1999. 25. M.S. Chen, P.S. Han, and J.Yu. Data mining: An overview from a database perspective. IEEE Trans. on Knowledge and Data Engineering, 8(6):866–883, 1996. 26. M. Goebel and L. Gruenwald. A survey of data mining and knowledge discovery software tools. SIGKDD Explorations, 1:20–33, 1999.
Connection Network and Optimization of Interest Metric
2009
27. J.L. Herlocker, J.A. Konstan, A. Borchers, and J. Riedl. An algorithmic framework for performing collaborative filtering. In Proc. of the 22nd Annual Int’l ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 230– 237, 1999. 28. V. Venugopal and W. Baets. Neural networks and their applications in marketing management. Journal of Systems Management, 45(9):16–21, 1994. 29. S. Park. Neural networks and customer grouping in e-commerce: A framework using fuzzy ART. In Proc. Academia/Industry Working Conf. on Research Challenges ’00, pages 331–336, 2000. 30. Y.K. Kwon and B.R. Moon. Personalized email marketing with a genetic programming circuit model. In Genetic and Evolutionary Computation Conference, pages 1352–1358, 2001. 31. M.J. Culnan. “How did they get my name ?”: An exploratory investigation of customer attitudes toward secondary information use. MIS Quarterly, 17(3):341– 363, 1993. 32. J. III Hagel and J.F. Rayport. The coming battle for customer information. Havard Business Review, 75(1):53–65, 1997. 33. Z. Michalewicz. Genetic Algorithms + Data Structures = Evolutionary Programs. Springer, 1992. 34. D. Cavicchio. Adaptive Search Using Simulated Evolution. PhD thesis, University of Michigan, Ann Arbor, MI, 1970. Unpublished. 35. D. Whitley and J. Kauth. Genitor: A different genetic algorithm. In Rocky Mountain Conf. on Artificial Intelligence, pages 118–130, 1988.
Parameter Optimization by a Genetic Algorithm for a Pitch Tracking System Yoon-Seok Choi and Byung-Ro Moon School of Computer Science & Engineering, Seoul National University Sillim-dong, Gwanak-gu, Seoul, 151-742 Korea {mickey, moon}@soar.snu.ac.kr
Abstract. The emergence of multimedia data in databases requires adequate methods for information retrieval. In a music data retrieval system by humming, the first stage is to extract exact pitch periods from a flow of signals. Due to the complexity of speech signals, it is difficult to make a robust and practical pitch tracking system. We adopt genetic algorithm in optimizing the control parameters for note segmentation and pitch determination. We applied the results to HumSearch, a commercialized product, as a pitch tracking engine. Experimental results showed that the proposed engine notably improved the performance of the existing engine in HumSearch.
1
Introduction
As information systems advance, databases include multimedia data such as images, musics, and movies. For music databases, there are many search methods based on melodic contours, authors, singers, or lyrics. People want to find music with a few notes that are entered by a convenient method like humming. We aim to develop a pitch tracking engine for a music retrieval system that supports queries by humming. A number of music search algorithms have been proposed. Handel [1] emphasized that the melodic contour is the most critical factor that listeners use to distinguish a song from the others. In other words, similar melodic contours make listeners consider songs as the same. Music search by humming matches songs and input queries according to properties such as melodic contours and rhythmic variations. Although being attractive, it contains some difficult processes such as melodic contour representation, note segmentation, and pitch period determination. Chou et al. [2] proposed a chord decision algorithm which transforms songs and queries into chord strings. Ghias et al. [3] used melodic contours of hummed queries, which consist of a sequence of relative differences in pitch between successive notes. They used an alphabet of three possible relationships between pitches: “U”, “D”, and “S”. Each alphabet represents the situation that the current note is above (“U”), below (“D”), or the same (“S”), respectively, as the previous one. E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2010–2021, 2003. c Springer-Verlag Berlin Heidelberg 2003
Parameter Optimization by a Genetic Algorithm
2011
To represent a melody by a melodic contour, a high-precision pitch tracking process is required. Pitch tracking is a process to determine the pitch period of each note in a melody. It involves two main processes: note segmentation and pitch determination. Note segmentation determines where notes begin and terminate, and extracts the notes from a flow of signals. Rodger et al. [4] performed note segmentation in two ways: One based on amplitude and the other on pitch. Ahmadi et al. [5] presented an improved system for voiced/unvoiced classification from hummed queries based on statistical analysis of cepstral peaks, zero-crossing rates, and energy magnitudes of short-time speech segments. Mingyang et al. [6] proposed a system for the pitch tracking of noisy speech with statistical anticipation. Pitch determination calculates pitch period of a segmented note. There are several approaches to calculate the basic frequency such as autocorrelation [7], AMDF [8], and cepstrum analysis [9]. There are also a number of studies under noisy environments. Shimomuro et al. [10] used weighted autocorrelation for pitch extraction of noisy speech. Kunieda et al. [11] calculated the fundamental frequency by applying autocorrelation in order to extract periodic features. Few people usually have perfect pitch and they cannot remember all the exact pitches of their favorite songs either, even when they sing the songs very often. There may be errors as well in the melodic contours extracted from hummed queries. For this reason, we need approximate pattern matching. Our engine is utilized in HumSearch which supports queries by humming with an advanced approximate pattern matching. In this paper, we suggest a genetic algorithm for pitch tracking that transcribes a hummed query into a melodic contour under noisy environments. We not only used the classical methods for pitch tracking such as the energy of short-time speech segment, zero-crossing rate, and cepstrum analysis, but also designed a contour analysis model. While the classical algorithms determined the threshold for note segmentation using a statistical method [5] [6] or an adaptive method [4], we optimize the control parameters of the methods by combining the classical methods with the genetic framework to enhance the performance of the pitch tracking engine. The remainder of this paper is organized as follows. In Section 2, we summarize preliminaries for pitch tracking. In Section 3, we explain our additional methods such as contour analysis for note segmentation and describe our system in Section 4. We perform experiments and compare the results with an existing pitch tracking engine in Section 5. Finally, we make our conclusions in Section 6.
2 2.1
Preliminaries Note Segmentation
In pitch tracking systems, note segmentation is the most important process. To transcribe a hummed query into a melodic contour, we need to compute the number of notes contained in the query. Figure 2 shows an energy diagram of
2012
Y.-S. Choi and B.-R. Moon
Fig. 1. Original vocal waveform (spaced by 0.01s)
Fig. 2. Energy diagram of short-time speech segment
Fig. 3. Pitch diagram of speech segment
a hummed query. Figure 3 is the result of pitch determination from the energy diagram. Speech Segment. Because of a temporary varying occurrence factor, speech processing is naturally a non-stationary process. However, we assume that the speech signal is short-time stationary for simple speech processing. We divide the whole sound signal into small speech segments and try to determine pitches on these small blocks. It is accomplished by multiplying the signal by a window function, wn , whose value is zero outside some defined range. The rectangular window is defined as follows: 1, if 0 ≤ n < N wn = 0, otherwise. where n is the time point of speech samples and N is the window size. Note that if the size of the window is too large, our assumption becomes unreasonable.
Parameter Optimization by a Genetic Algorithm
2013
Fig. 4. Note segmentation based on zero-crossing rate
In contrast, if the size of the window is too small, we lack enough information for delimitating a pitch. The size of the window is thus an important factor of determining pitch period. Energy of Speech Segment. The energy of a speech segment is defined to be the sum of amplitudes of the input wave as follows: E=
N
|x(i)|
i=0
where x(i) is the amplitude of original waveform signals and N is the size of a speech segment [12]. A statistical metric such as average or median is used to determine a threshold which classifies segments with or without voice. To make classification simpler, a user sings by “da” or “ta” because the consonants cause a drop in amplitude at each note boundary [3]. Zero-Crossing Rate. Zero-crossing occurs when neighboring points in a wave have different signs. The zero-crossing rate measures the number of zero-crossings in a given time interval [5]. The zero-crossing rate corresponding to the ith speech segment is computed as follows: ZCRi =
N −1
|sgn[xi (n)] − sgn[xi (n − 1)]|
n=1
where xi (n) is the value at position n in the ith speech segment and N denotes the size of the speech segment, xi (n). A segment with a high zero-crossing rate is judged to be unvoiced or sibilant sounds; a segment with a low number is judged to be voiced intervals (See Figure 4).
2014
2.2
Y.-S. Choi and B.-R. Moon
Pitch Determination
Algorithms for determining the pitch on a vocal signal are roughly classified into two types: time-domain methods and frequency-domain methods. The timedomain methods such as autocorrelation and AMDF examine the structure of the sampled waveform, and the frequency-domain methods such as cepstrum analysis perform the Fourier transform and examine the resulting cepstrum. Autocorrelation. Autocorrelation is one of the oldest classical algorithms for pitch determination. It shows the similarity in phase between two values of the speech signal at times xn and xn+k . The autocorrelation function R(k) is calculated from a numerical sequence as R(k) =
N −k
x(n) · x(n + k)
n=1
where k is the correlation distance in the time sequence, N is the length of the sequence, and x(n) is the value at position n in the time sequence. The function value with respect to k expresses the average correlation between numbers separated by distance k in the sequence. The correlation distance k with the maximum function value corresponds to the pitch period of the input signal. Cepstrum Analysis. The cepstrum is defined to be the real part of the inverse Fourier transform of the log-power spectrum of x(n) which is the value of acoustic signal in time sequence. Since cepstrum analysis requires a high computation cost, we use the Fast Fourier Transformation (FFT). For the FFT, we restrict the window size of the speech segment to be a dyadic number. Figure 7 shows the pitch period extracted from the vocal waveform of Figure 6. 2.3
The MIDI Notes Presentation
Since musical units such as octaves or cents are relative measures, we use the MIDI notation for note presentation. MIDI is a standard for communicating with electronic musical instruments and also a standard representation of the western musical scale. It is thus appropriate for representing a melodic contour of song or hummed input. MIDI assigns an integer in [0, 127] to each note. Middle C (C4) is assigned 60, the note just above is 61, and that below is 59. MIDI note 0 corresponds to 8.1279 Hz and the highest defined note, 127, corresponds to 13,344 Hz in frequency. The output of the pitch tracking system is eventually stored in MIDI form. Figure 12 contains a sequence of 12 MIDI notes from a hummed input.
3
Enhanced Methods for Pitch Tracking System
Since a user hums at a slide, the amplitude of an unvoiced sound segment does not drop. Thus the repeated notes may not be segmented successfully using the
Parameter Optimization by a Genetic Algorithm
Fig. 5. Cepstrum analysis
Fig. 6. Original waveform signal
Fig. 7. Cepstrum pitch determination (p = pitch period)
2015
2016
Y.-S. Choi and B.-R. Moon
Fig. 8. Note segmentation based on energy of short-time speech segment
Fig. 9. Note segmentation based on smoothed energy diagram for detecting a valley
methods such as energy of the speech segment and zero-crossing rate. Unsegmented notes should separate into two notes (See Figure 8). We try to separate the successive notes with contour analysis of energy diagram. We detect the valley in the energy diagram and adopt it as a new feature for note segmentation. We use a moving average method and make an energy diagram flatten to remove the small valleys. This method divides the successive notes into two notes (Figure 9). In our system, we divided the resolution of the relative pitch differences into 15 levels (U1 ∼ U7 (Up), D1 ∼ D7 (Down), and R (Same)). If the difference is beyond the range, it is assigned one of the boundary levels, U7 or D7. More accurate representation of pitch differences in successive notes is possible by this fine-grained strategy.
Parameter Optimization by a Genetic Algorithm
2017
Fig. 10. Optimized pitch tracking system with GA
4 4.1
Optimized Pitch Tracking System with GA: GAPTS System Architecture
The system architecture is shown in Figure 10. A hummed query fed into the system is evaluated through the pitch tracking and query systems. The efficiency of the pitch tracking module depends on the control parameters such as weighted value, window size, etc. There are three main modules in the system : voiced/unvoiced classification, note segmentation, and pitch determination.
4.2
GA Framework
Encoding. A chromosome consists of 11 genes. Each gene corresponds to a feature that affects controlling the pitch tracking system. The function of each gene is explained in Table 1.
Initialization. Initial solutions are generated at random. We set the population size to be 100.
2018
Y.-S. Choi and B.-R. Moon
Create initial population; Evaluation all chromosomes ; do { Choose parent1 and parent2 from population ; offspring = crossover (parent1, parent2) ; mutation(offspring) ; evaluation (offspring) ; replace(parent1, parent2, offspring); } until (stopping condition) ; return the best solution;
Fig. 11. Genetic algorithm framework Table 1. The functions of genes Genes Description 0 The window size of speech segment 1 A weighted value for determining the threshold of short-time energy which divides voiced/unvoiced segments 2 The upper bound of frequency considered as voiced pitch 3 The lower bound of frequency considered as voiced pitch 4 Weighted value for determining the threshold of zero-crossing rate which divides voiced/unvoiced segments 5 Boundary value for deciding a sampling count to determine pitch periods The window size of analysis section for pitch determining 6 7 A window size of the moving average method for smoothing the contour of a short-time energy diagram. If the size is set to 0, the method is not applied to the system 8 The window size of the moving average method for smoothing the contour of a zero-crossing rate diagram. if the size is set to 0, the method is not applied to the system 9 Sampling count for pitch determining 10 Weighted value to determine the threshold that cuts a note into two or more for note segmentation with contour analysis
Parent Selection. We assign to each chromosome a fitness value. A fitness value is defined to be the sum of edit distances between the melodic contours of the hummed query and the original song. Let SED(t, q) be the string edit distance between a hummed query q and the target contour t. The fitness value Fi of chromosome i is defined as follows: Fi =
N
SED(Q(i), T (i))
i=1
where Q(i) is the ith hummed query, T (i) is the ith target contour, and N is the number of the hummed queries. We use the tournament selection method [13]. The tournament size is 2. Crossover. We use the uniform crossover [14]. The relatively high disruptivity of the uniform crossover helped our algorithm escape from local optima and converge to better solutions.
Parameter Optimization by a Genetic Algorithm
2019
Table 2. sets of parameters gene number Optus PTS GAPTS 11 11 0 1 0.4 0.47 2 530.0 454.91 3 90.0 64.28 4 0.4 0.5 5 6 10 6 2 3 7 0 1 8 0 0 9 1.5 5.00 10 1.5 0.68
Mutation. We randomly select each genes with a probability (= 0.5) and mutate the gene values within its admitted range. Replacement. After generating an offspring, GAPTS replaces a member of the population by the offspring. We replace the inferior of the two parents with the offspring if the offspring is not worse than both parents. Otherwise, we replace the worst member of the population by the offspring. This scheme is a compromise between preselection [15] and GENITOR-style replacement [16]. Stopping Condition. The GA stops if the generation count reaches a predetermined number, or it shows 2000 times of consecutive fails to replace one of the parents.
5
Experimental Results
The melodic contours extracted by GAPTS are evaluated by SED. SED is the error in the string edit distance as mentioned in Section 4.2. Ten people hummed 95 songs for test. Following the convention for clear segmentation, they sang by “ta” or “da” into microphone under a usual condition with noise. All programs were written in C++ language and run on PentiumIII 866MHz with the Linux operating system. We did not use any hardware device for acoustic signal processing except the microphone. For robust comparison between Optus PTS and GAPTS (combined with GA), we followed the 5-fold cross-validation approach [17] [18]. We randomly divided the entire hummed query set D into 5 mutually exclusive subsets of approximately equal size. The GAPTS was trained and tested 5 times. Table 2 shows the sets of parameters; the first is the set of parameters that has been used in HumSearch, a commercial query-by-humming product of Optus Inc., and the second is the set we have found by the GA. The original SED of HumSearch was 443. The GAPTS improved it to 424, which is a notable improvement. When we replaced the existing set of parameters in HumSearch by the new set, HumSearch also showed improvement. Out of the
2020
Y.-S. Choi and B.-R. Moon
Fig. 12. A snapshot of our hummed query system
95 songs, we found 10 songs ranked up and 8 songs ranked down. We should note that the parameter set for the HumSearch has been tuned for a long time to be a commercial product. Considering this, the improvement in SED and the change of ranks is notable. Figure 12 shows a snapshot of our interactive hummed query system.
6
Conclusions
We extracted the parameters that control the note segmentation and pitch determination. We optimized them by combining the classical methods such as short-time energy diagram, zero-crossing rate, contour analysis of short-time energy diagram for note segmentation and cepstrum analysis for pitch determination. Our pitch tracking system worked well regardless of testing environments under various conditions, e.g., sexuality, noise, and input devices (mobile phone or microphone). Acknowledgments. This work was partly supported by Optus Inc. and Brain Korea 21 Project. The RIACT at Seoul National University provided research facilities for this study.
References 1. Stephen Handel. Listening : An introduction to the perception of auditory events. In The MIT Press, 1989.
Parameter Optimization by a Genetic Algorithm
2021
2. Ta Chou, Arbee L.P., and C.C. Liu Chen. Music database : indexing technique and implementation. In Proc. of IEEE, International Workshop on Multimedia data base Menagement System, 1996. 3. A. Ghais, J. Logan, D. Chamberlin, and B.C. Smith. Query by humming. In Proceeding ACM Multimedia 95, 1995. 4. Rodger J. McNab, Lloyd A. Smith, Ian H. Witten, Cclar L. Henderson, and Sally J. Cunningham. Towards the digital music library: Tune retrieval from acoustic input. In Proc. ACM Digital Libraries, 1996. 5. Sassan Ahmadi and Andreas S. Spanias. Cepstrum-based pitch detection using a new statistical v/uv classification algorithm. IEEE Trans. Speech and Audio Processing, 7(3):333–338, 1999. 6. Mingyang Wu, DeLiang Wang, and G.J. Brown. Pitch tracking based on statistical anticipation. In Proceedings. IJCNN ’01. International Joint Conference, volume 2, pages 866–871, 2001. 7. L.R. Rabiner. On the use of autocorrelation analysis for pitch detection. IEEE Trans. Acoust,, Speech, Signal Processing, ASSP-25(1):24–33, 1977. 8. Ross. M, J, H.L. Shaffer, A. Cohen, R. Freudberg, and H. J. Manley. Average magnitude difference function pitch extractor. IEEE Trans. Acoust,, Speech, Signal Processing, ASSP-22(1):353–362, 1977. 9. Noll A.M. Cepstrum pitch determination. Journal of Acoust, Soc, Amer., 41:293– 309, 1967. 10. T. Shimomuro and H. Kobayashi. Weighted autocorrelation for pitch extraction of noisy speech. IEEE Trans. Speech and Audio Processing, 9(7):727–730, 2001. 11. Shimamura T. Kunieda N. and Suzuki J. Robust method of measurement of fundamental frequency by aclos: autocorrelation of log spectrum. In IEEE International Conference Acoustics, Speech, and Signal Processing, pages 232–235, 1996. 12. J.K. Kaiser. On a simple algorithm to calculate the ’energy’ of a signal. In Proceeding IEEE ICASSP, pages 381–384, 1990. 13. D. E. Goldberg, K. Deb, and B. Korb. Do not worry, be messy. In Proceedings of the Fourth International Conference on Genetic Algorithms, pages 24–30, 1991. 14. Syswerda. Uniform crossover in genetic algorithms. In International Conference on Genetic Algorithms, pages 2–9, 1989. 15. D. Cavicchio. Adaptive Search Using Simulated Evolution. PhD thesis, University of Michigan, Ann Arbor, MI, 1970. Unpublished. 16. D. Whitley and J. Kauth. GENITOR: A different genetic algorithm. In Proceedings of Rocky Mountain Conference on Artificial Intelligence, pages 118–130, 1988. 17. B. Efron an dR.Tibshirani. Cross-validation and the bootstrap: Estimating the error rate of a prediction rule. In Technical Report(TR-477), Dept. of Statistics, Stanford University., 1995. 18. R. Kohavi. A study of cross-validation and bootstrap of accuracy estimation and model selection. In Proceedings of hte Rourteenth International Joint Conference on Artificial Intelligence, pages 1137–1143, 1995.
Secret Agents Leave Big Footprints: How to Plant a Cryptographic Trapdoor, and Why You Might Not Get Away with It John A. Clark, Jeremy L. Jacob, and Susan Stepney Department of Computer Science, University of York Heslington, York, YO10 5DD, UK {jac,jeremy,susan}@cs.york.ac.uk
Abstract. This paper investigates whether optimisation techniques can be used to evolve artifacts of cryptographic significance which are apparently secure, but which have hidden properties that may facilitate cryptanalysis. We show how this might be done and how such sneaky use of optimisation may be detected.
1
Introduction
The issue of optimisation is fundamental to cryptography. Pseudo-random bit streams should appear as random as possible, cryptanalysts try to extract the maximum amount of useful information from available data and designers wish to make attacks as difficult as possible. It is not surprising that some security researchers have begun to consider the use of heuristic optimisation algorithms. Both simulated annealing and genetic algorithms have been used to decrypt simple substitution and transposition ciphers [6,8,13,14], and there have been some recent and successful attempts to attack modern day crypto protocols based on NP-hard problems [3]. Recent research has used optimisation for design synthesis, evolving Boolean functions for cryptographic products, most notably using hill climbing [10] and genetic algorithms [9]. Other recent work has applied search to the synthesis of secure protocols [1]. Some cryptographic algorithms have been shown to be insecure. Others are thought to be secure, but subsequent discoveries may prove otherwise. Cryptanalysts live in hope that the ciphers they investigate have as yet undiscovered features that will allow them to be broken, ‘trapdoors’ if you like, hidden ways of opening up the algorithms to let the secrets of those using them fall out! Discovering such trapdoors is very hard work. There is, however, a somewhat sneakier approach — deliberately create an algorithm with a subtle trapdoor in it but which looks secure according to currently accepted criteria. Users of the algorithm may communicate secrets, but the analyst who knows the trapdoor will be able to access them. If users subsequently find out that a trapdoor had been deliberately engineered into the algorithm, they may not be very happy. The trapdoor should be so subtle that they have little chance of discovering it. E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2022–2033, 2003. c Springer-Verlag Berlin Heidelberg 2003
Secret Agents Leave Big Footprints
2023
This paper investigates whether optimisation techniques can be used to surreptitiously plant such trapdoors. This issue goes to the heart of cryptosystem design in the past. The debate about whether the Data Encryption Standard [11] has a secret trapdoor in its substitution boxes 1 has raged since its publication [12]. A major factor in this debate is that initially the design criteria were not made public (and the full criteria are still unknown). There was a distinct suspicion that security authorities had designed in a trapdoor or otherwise knew of one. 2 We contend that combinatorial and other numerical optimisation techniques can be demonstrably open or honest. Goals are overt (that is, maximising or minimising a specified design function). The means of achieving those ends is entirely clear — the algorithm for design is the search technique and its initialisation data coupled with the specified design function to be minimised. An analyst who doubts the integrity of the design process can simply replay the search to obtain the same result. Furthermore, the approach permits some surprising analyses.
2
Public and Hidden Design Criteria
The set of all possible designs forms a design space D. From within the design space we generally seek one instantiation whose exhibited properties are “good” in some sense. (In the case of DES, one design space may be thought of as the set of legitimate S-box configurations. The standard uses a particular configuration but many others are possible.) When designing a cryptosystem, we may require that certain stringent public properties P be met. Generally only small fraction of possible designs satisfy these properties. Consider now a trapdoor property T . If a design has property T then cryptanalysis may be greatly facilitated. Some designs may simultaneously satisfy P and T . There two possibilities here: – All designs (or a good number) satisfying P also satisfy T . This suggests that the basic design family is flawed (though the flaw may not be publicly known). – Only a small fraction of designs satisfying P also satisfy T . Here the trapdoor planter must drive the design in such a way that T is achieved simultaneously with P . 1
2
DES is a 16-round cipher. Each round is a mini-encryption algorithm with the outputs of one round forming the inputs of the next. Each round application further obscures the relationship to the initial inputs. At the end of 16 rounds, there should be a seemingly random relationship between the initial inputs and the final outputs. (Of course, the relationship is not random, since if you have the secret key you can recover the plaintext inputs) Within each round there are several substitution boxes (S-boxes). These take 6 input bits and produce 4-bit outputs. The details need not concern us here. Suffice it to say that S-boxes are designed to give a complex input-output mapping to prevent particular types of attacks. As it happens, it seems they intervened to make the algorithm secure against differential cryptanalysis, a form of attack that would be discovered by academics only in the late 1980s.
2024
J.A. Clark, J.L. Jacob, and S. Stepney
Fig. 1. Combining bit streams
It may be computationally very hard to locate good designs. Heuristic optimisation offers a potential solution. Suppose a cost function motivated solely by the goal of evolving a design (from design space D) to satisfy P is given by h : D → . (We denote this cost function by h since its choice is honest — free from hidden motives.) Suppose also that a cost function aimed at evolving a design satisfying the trapdoor property T is given by t : D → (t for trapdoor). If cryptographic functionality is developed using only trapdoor criteria T it is highly likely that someone will notice! Indeed, reasonable performance of the resulting artifacts when judged against public properties P would be accidental. Similarly, an artifact developed using only the honest (public) criteria P will generally have poor trapdoor performance. There is a tradeoff between malicious effectiveness (i.e. attaining good trapdoor properties) and being caught out. The more blunt the malice, the more likely the detection. We capture the degree of bluntness via a parameter λ, which we term the malice factor. The design problem for malice factor λ ∈ [0.0, 1.0] is to minimise a cost function of the form cost(d) = (1 − λ)h(d) + λt(d) (1) At the extremes we have the honest design problem, λ = 0, and unrestrained malice, λ = 1. In between we have a range of tradeoff problems. We now illustrate how optimisation can be used and abused in the development of perhaps the simplest cryptographically relevant element, namely a Boolean function.
3 3.1
Cryptographic Design with Optimisation Boolean Functions
Boolean functions play a critical role in cryptography. Each output from an Sbox can be regarded as a Boolean function in its own right. Furthermore, in
Secret Agents Leave Big Footprints
2025
stream ciphers Boolean functions may be used to combine the outputs of several bit streams. Figure 1 illustrates the classic stream cipher model. The plaintext stream {Pi } of bits is XOR-ed with a pseudo-random bit stream {Zi } to give a cipher text stream {Ci }. The plaintext is recovered by the receiver by XORing the cipherstream with the same pseudo-random stream. The pseudo-random stream is formed from several bit streams generated by Linear Feedback Shift Registers (LFSRs) using a suitable combining function. The initial state of the registers forms the secret key. The function f must be sufficiently complex that cryptanalysts will not be able to determine the initial state of the registers, even when they know what the plaintext is (and so can recover the pseudo-random stream {Zi }). Two desirable features are that the function should be highly nonlinear and posses low auto-correlation. These are explained below, together with some further preliminaries. We denote the binary truth table of a Boolean function by f : {0, 1}n → {0, 1} mapping each combination of n binary variables to some binary value. If the number of combinations mapping to 0 is the same as the number mapping to 1 then the function is said to be balanced. A cryptographic designer may require a Boolean function to have this property. The polarity truth table is a particularly useful representation for our purposes. It is defined by fˆ(x) = (−1)f (x) . Two functions f, g are said to be uncorrelated when x fˆ(x)ˆ g (x) = 0. If so, if you approximate f by using g you will be right half the time and wrong half the time. An area of particular importance for cryptanalysts is the ability to approximate a function f by a simple linear function, that is, one of the form Lω (x) = ω1 x1 ⊕ ω2 x2 . . . ⊕ ωn xn (where ⊕ is exclusive or). For example, x1 ⊕ x3 ⊕ x4 is a linear function (with ω = 1011), whereas x1 x2 ⊕ x3 ⊕ x4 is not: it has a quadratic term. One of the cryptosystem designer’s tasks is to make such approximation as difficult as possible; the function f should exhibit high non-linearity 3 . In terms of the nomenclature given above, the formal definition of non-linearity is given by 1 N L(f ) = 2
ˆ ω (x) 2 − max fˆ(x)L ω
n
(2)
x
ˆ ω (x), ω : 0..(2n−1) } span the space of Boolean functions on The 2n functions {L ˆ ω (x) (usually referred to as the Walsh value at n variables. The term x fˆ(x)L ˆ ω . Thus, we are trying to ω) can be thought of as the dot product of fˆ with L minimise the maximum absolute value of a projection onto a linear function. 3
If it doesn’t then various forms of attack become possible, e.g. the Best Affine Attack. Loosely, linear functions are not very complex input-output mappings. If the function fˆ can be approximated by a linear function, it is just too predictable to resist attack. See [5] for details.
2026
J.A. Clark, J.L. Jacob, and S. Stepney
Low auto-correlation is another important property that cryptographically strong Boolean functions should possess. Auto-correlation is defined by ACmax = max fˆ(x)fˆ(x ⊕ s) (3) s=0 x It is essentially a property about dependencies between periodically spaced elements. The details need not concern us here. What is important is that nonlinearity and autocorrelation of a given Boolean function f can be measured. For the purpose of this paper we simply identify balancedness, high non-linearity and low autocorrelation to serve as our desirable publicly stated properties P . 3.2
An Honest Cost Function
We aim to provide balanced Boolean functions of 8 variables with high nonlinearity and low autocorrelation. We adopt a search strategy that starts with a balanced (but otherwise random) function, and explores the design space by moving between neighbouring functions until an appropriate solution is found. A function fˆ can be represented by a vector of length 256 with elements equal to 1 or −1. The first element is fˆ(00000000) and the last fˆ(11111111) etc. We define the neighbourhood of a balanced function fˆ to be all functions gˆ obtained from fˆ by negating any two dissimilar elements in the vector (that is, changing a 1 to a −1 and −1 to a 1). Provided the initial function was balanced, the search maintains that balance. To use heuristic optimisation techniques we need a cost function whose minimisation gives good values of non-linearity and low values for autocorrelation. We use the (honest) cost function 3 ˆ ω (x) − 12 h(fˆ) = (4) fˆ(x)L ω
x
where ω, x range over {0, 1}8 . The details of this cost function and the rationale for it can be found in [2]. It gives very good results for Boolean functions of 8 variables; but we admit it is not intuitively clear. There is a tradeoff between a “clearly” honest cost function, and one with good performance. The primary search technique used was the simulated annealing algorithm [7] with the honest cost function of Equation 4. This was followed by hill-climbing (attempting to maximise directly the non-linearity as defined in Equation 2 ) from the resulting function fˆ. Figure 2 shows the results from 400 runs of the technique using λ = 0, that is, total honesty. 115 of the 400 runs gave rise to functions with non-linearity of 116 and autocorrelation of 32. The highest non-linearity ever attained by any technique (ignoring auto-correlation) is 116. Until recently the lowest autocorrelation possible was thought to be 24. Recent optimisation based work [4] generated functions with autocorrelation of 16 (but with nonlinearity value of 112). No function has ever been demonstrated with nonlinearity 116 and autocorrelation 16.
Secret Agents Leave Big Footprints
2027
ACmax Non-linearity 110 112 114 116 56 0 0 0 0 48 0 2 5 13 40 0 9 68 80 32 0 29 74 115 24 0 1 1 3 Fig. 2. Honest Evolution, sample size = 400. Simultaneous high non-linearity and low autocorrelation (bottom right corner) are desirable.
Our approach appears to give excellent results for both criteria simulataneously. Thus the technique can be used effectively by honest researchers to achieve honest goals. The search proceeded from a random starting function to its final goals guided by the honest cost function at each stage. Now we choose to guide the search in a malicious way. 3.3
Evolving a Trapdoor
We now seek designs that perform well when judged against the public criteria P but also possess some secret trapdoor property T . For illustrative purposes only, we choose to define a trapdoor property T as sufficient closeness to some g (x) arbitrary but fixed Boolean function represented by g. We take x fˆ(x)ˆ as a measure of closeness. The maximum value 256 occurs when f and g are the same function (or when one is the logical negation of the other). Since our design framework is based around minimising cost, the closer the value to 256 the smaller should be the cost. This gives rise to a trapdoor cost function 3 (5) fˆ(x)ˆ g (x) t(fˆ) = 256 − x The exponent 3 is adopted for comparability with the honest cost function. It serves to punish poorly performing functions very highly and affects the rate at which the search converges. We now combine the honest and dishonest functions in a single cost function cost as given in Equation 1 (with the function fˆ playing the role of design d). We applied the technique for various values of λ in the range [0, 1]. For a randomly generated trapdoor g, 30 runs were carried out for each of λ ∈ {0.0, 0.2, 0.4, 0.6, 0.8, 1.0}. The same trapdoor was used in all runs. Figure 3 records the results. Thus, for λ = 0.0, 12 of the 30 runs give rise to functions with nonlinearity of 116 of 32. The average closeness to the and autocorrelation trapdoor g (given by x fˆ(x)ˆ g (x)) for the 30 runs in this case is 12.8. The results show that optimisation techniques can be used effectively to obtain artifacts that provide good public performance measures and good trapdoor functionality. The figures show that as λ increases the performance of the evolved
2028
J.A. Clark, J.L. Jacob, and S. Stepney
ACmax 64 56 48 40 32 24
ACmax 80 72 64 56 48 40 32 24
λ = 0.0 Non-linearity 110 112 114 116 0 0 0 0 0 0 0 0 0 0 0 1 0 0 3 4 0 2 7 12 0 0 0 1 Mean Trap = 12.8 λ = 0.6 Non-linearity 110 112 114 116 0 0 0 0 0 0 0 0 0 0 1 0 0 1 2 0 0 5 7 0 0 2 12 0 0 0 0 0 0 0 0 0 Mean Trap = 222.1
λ = 0.2 Non-linearity 110 112 114 116 0 0 0 0 0 0 1 0 0 0 7 0 0 0 16 0 0 0 6 0 0 0 0 0 Mean Trap = 198.9 λ = 0.8 Non-linearity 110 112 114 116 0 0 0 0 0 0 0 0 0 1 0 0 0 4 1 0 0 19 1 0 0 3 1 0 0 0 0 0 0 0 0 0 Mean Trap = 232.3
λ = 0.4 Non-linearity 110 112 114 116 0 0 1 0 0 0 2 0 0 1 6 0 0 2 17 0 0 0 1 0 0 0 0 0 Mean Trap = 213.1 λ = 1.0 Non-linearity 110 112 114 116 2 0 0 0 4 1 0 0 10 6 0 0 2 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Mean Trap = 242.7
Fig. 3. Performance for various malice factors λ. Sample size = 30 for all experiments. For the λ = 1 result, any remaining non-linearity and auto-correlation arises from the dishonest g and the final hill-climb on non-linearity.
designs with respect to the public criteria worsens (i.e their non-linearity gets smaller and autocorrelation gets bigger). Of course, the closeness of the evolved designs to the trapdoor property gets radically better. An honest sample (that is, λ = 0) has an average closeness of 12.8. With λ = 0.2 the average solution obtained has a closeness value of 198.9. With λ = 1.0 the average is 242.74 .
4
Detection
If you know what the trapdoor is then you can of course verify that the artifact has it. But what if you don’t know what the trapdoor is? 4.1
Telling Truth from Fiction
We wish to distinguish honestly derived functions from maliciously derived ones. We would like to develop discriminating techniques with wide ranging applicability. We observe that functions (and indeed designs more generally) have some 4
Recall that after optimising with respect to the indicated cost function, we then hillclimb (maximise) with respect to non-linearity alone. This typically serves to reduce the trapdoor value attained - it is simply measured at the end of the hill-climb. It plays no part during the hill-climb.
Secret Agents Leave Big Footprints
2029
concrete binary representation. Our current function representation is enumerative, but programs that implement particular functions also have binary representations (the vector of object code bits being a particular one). A function within a particular family may be defined by some set of configuration data (e.g. S-Boxes). It is the properties of such vector representations of designs that we propose to use as the basis of detection. This has the benefit of being more problem independent. We illustrate one such way in which families of designs can be discriminated. Suppose we have two groups of designs G1 , G2 . Each group comprises a number of designs represented by vectors. For each group we calculate the mean vector for the vectors in that group. Denote these mean vectors by m1 , m2 . Re-express m1 = µm2 + z, where z is orthogonal to m2 . The first component µm2 is the projection of m1 onto m2 . We now project each design onto this residual z and measure the length of such projections. In general if the groups G1 and G2 are different we would expect those in G1 to give rise to projections of large magnitude and those in G2 to give rise to vectors of small magnitude. More formally – Let d1j and d2j be the jth designs of G1 and G2 respectively. Let d1j be represented by a vector of N bit values (d1jk ), k = 1..N . Similarly for d2j . D – For each group form the mean vector mi = (mik ), where mik = S1 j=1 dijk – Calculate the residual vector when m1 is projected on m2 . For vectors x, y the residual of x after projection on y is given by resid(x, y) = x − (x,y) (y,y) y, where (x, y) is the ‘dot product’ for vectors x and y. – for G1 , G2 find the dot product of each of its designs onto this residual vector and find its absolute value. 4.2
Results
Figure 4 shows the the summary statistics (with the honest group as the reference sample) for each of the λ values, for 30 runs. The aim is to show the degree to which the summary statistics for each malicious sample overlaps with the set of values from the honest one. For λ = 0, every value is ≤ 0, by construction. By contrast, for other values of λ, very few results are ≤ 0. For λ = 0.6 only one design has a result in the range of honest results; for other values of λ the overlap is greater. We see that the summary statistic provides a reasonable, and in some cases good, way of discriminating between designs developed using different cost functions. In some cases the discrimination capability is not as good as we should like, because the calculated statistic loses a great deal of information. However, even with this gross loss of information we can still detect a discernible difference in malicious and honest designs. Clearly more sophisticated tests are needed. The above is intended purely as an illustration that families of designs give rise to their own distributions of summary statistics. It should be noted that the designs within a family are not necessarily close to each other. Indeed, they may tend to congregate in different areas (corresponding to different local optima of
2030
J.A. Clark, J.L. Jacob, and S. Stepney
Fig. 4. Squares of projections, for different values of λ. The axis is drawn at the maximum value of the λ = 0 statistic, to highlight the contrasting behaviour for other values of λ.
cost function used). A malicious designer might take advantage of this weakness or similar weaknesses of other discrimination techniques used. This we now show. 4.3
Games and Meta-games
If the discrimination technique is known then it is possible to recast the malicious design problem with a cost function cost(d) = (1 − λ)h(d) + λt(d) + easeDetect(d)
(6)
Here easeDetect represents how easily the solution may be identified as a maliciously derived one. Given the scheme above a suitable candidate would be to punish solutions whose summary statistic was too large. If a variety of known discriminants are used each of these can similarly be taken into account. Of course, it becomes more difficult to find a good solution, but the malicious designer is really just looking to gain an edge. Malicious designers can attempt to evade detection in this way only if they know the discrimination tests to be applied. Analysts are free to choose fairly arbitrarily which tests to apply. This makes the task of malicious designers more difficult. They can appeal to their intuition as to what sorts of tests might be sensible (as indeed must the analyst), but they will be second guessing the analyst at large.
Secret Agents Leave Big Footprints
4.4
2031
Meta-meta Games
The work described has concentrated on how trapdoors can be inserted and about how claims to have used particular cost functions can be checked. We stated in the introduction that a designer could simply publish the cost function, the search algorithm, and its initialisation data; the doubting analyst could simply replay the search to obtain the same result. This is undoubtedly so, but the implication that this gives us honest design must be questioned. We must question in fact what it means to be ‘honest’. Our “honest” cost function of Equation 4 uses an exponent of 3. Why 3? Would 2.9 be dishonest? It really depends on what our intention is for choosing a particular value. As it happens we have found that 3.0 gives good results for finding the desired properties, but other values may perform similarly well. Our honest cost function is indeed significantly different from those used by other researchers, whose cost functions are actually more intuitively clear, but seem to perform less well. It would not be surprising if functions derived using different honest cost functions gave rise to results which have different properties as far as illicit trapdoor functionality is concerned. Large scale experimentation could lead to the discovery of cost functions that give results with excellent public properties but which happen to have a good element of specific trapdoor functionality. Such higher level games could be played most easily by those with massive computational capability. Perhaps this could be countered by cost function cryptanalysis! As a final game, consider the following. Suppose you discover a search technique that works better than existing ones. Suppose, for example, it could find with ease a balanced Boolean function of 8 variables with non-linearity of 118. (This would answer an open research question, since the best least upper bound is 118, but the best actual non-linearity achieved is 116.) If you so desired you could use you enhanced search capability to find functions that had, say, nonlinearity 116 but good trapdoor properties too. You know that you have sacrificed a little non-linearity to achieve your malicious goal, but the public analyst will compare your results against the best published ones. Enhanced search can also be provided by increased computational power. Of course, the malicious designer will have to do without the kudos that would be attached to publishing results to open problems (an academic would of course never do this!).
5
Conclusions
This paper builds on extant optimisation-based cryptographic design synthesis work and extends it to show the optimisation techniques offer the potential for planting trapdoors. We have explored some of the consequences. We know of no comparable work. Different cost functions solve different problems. Someone might be solving a different problem from the one you assumed; someone may have added a trapdoor to their alleged cost function. However, you can use optimisation to evolve a design against a criterion many times. The resulting sample of solutions may
2032
J.A. Clark, J.L. Jacob, and S. Stepney
provide a yardstick against which to judge proffered designs. It provides therefore a potential means of resolving whether a trapdoor was present in the fitness function actually used. Currently, it is hard to generate a large sample of good designs by other techniques (without the benefit of automatic search it may be difficult to demonstrate a single good design). That designs evolved against different criteria should exhibit different functionality is clearly no surprise — it means that optimisation based approach works! This paper has observed that different designs have different bit level representations and that different groups of designs can be detected by using statistical properties of the representations. The following consequences have emerged from the work: An optimisation-based design process may be open and reproducible. A designer can publish the criteria (that is, the cost function) and the optimisation search algorithm (together with any initialistion data such as random number seed).The search can be repeated by interested parties. Optimisation can be abused. If optimisation ‘works’ then ‘use’ means one approves of the cost criteria and ‘abuse’ means one doesn’t! Optimisation allows a family of representative designs to be obtained. The search process may be repeated as often as desired to give a set of designs from which statistical distributions of designs evolved to possess particular properties may be extracted. Designs developed against different criteria just look different! If designs have different functionality then they are different in some discernible way. In certain cases this difference may be detected by examination of the design representations alone. The games just do not stop. Cryptographic design and analysis go hand in hand and evolve together. When the smart cryptanalyst makes an important discovery, the designers must counter. These ideas are at an early stage at the moment, but there would appear to be several natural extensions. The simple projection method used for illustration in this paper loses a very large amount of information. A study of available discrimination techniques should bring great benefit. We aim to evolve larger artifacts with trapdoor functionality, and are currently attempting to evolve DES-family cryptosystems with built-in trapdoors (by allowing a search to range over S-Box configurations). We acknowledge that we may have been beaten to it! How could we tell?
6
Coda: The Authors Are λ = 0 People
Our aim in this paper has been to point out what is possible or indeed what might already have been attempted (secretly). We do not wish to suggest that the techniques should be used to plant trapdoors. Indeed, this is the reason why we have begun to investigate the abuse of optimisation via discriminatory tests. The paper is, we hope, an interesting, original and rather playful exploration of an intriguing issue.
Secret Agents Leave Big Footprints
2033
References 1. John A Clark and Jeremy L Jacob. Searching for a Solution: Engineering Tradeoffs and the Evolution of Provably Secure Protocols. In Proceedings 2000 IEEE Symposium on Research in Security and Privacy, pages 82–95. IEEE Computer Society, May 2000. 2. John A Clark and Jeremy L Jacob. Two Stage Optimisation in the Design of Boolean Functions. In Ed Dawson, Andrew Clark, and Colin Boyd, editors, 5th Australasian Conference on Information Security and Privacy, ACISP 2000, pages 242–254. Springer Verlag LNCS 1841, july 2000. 3. John A Clark and Jeremy L Jacob. Fault Injection and a Timing Channel on an Analysis Technique. In Advances in Cryptology Eurocrypt 2002. Springer Verlag LNCS 1592, 2002. 4. John A Clark, Jeremy L Jacob, Susan Stepney, Subhamoy Maitra, and William Millan. Evolving Boolean Functions Satisfying Multiple Criteria. In Alfred Menezes and Palash Sarkar, editors, INDOCRYPT, volume 2551 of Lecture Notes in Computer Science. Springer, 2002. 5. C. Ding, G. Xiao, and W. Shan. The Stability of Stream Ciphers, volume 561 of Lecture Notes in Computer Science. Springer-Verlag, 1991. 6. Giddy J.P. and Safavi-Naini R. Automated Cryptanalysis of Transposition Ciphers. The Computer Journal, XVII(4), 1994. 7. S. Kirkpatrick, Jr. C. D. Gelatt, and M. P. Vecchi. Optimization by Simulated Annealing. Science, 220(4598):671–680, May 1983. 8. Robert A J Mathews. The Use of Genetic Algorithms in Cryptanalysis. Cryptologia, XVII(2):187–201, April 1993. 9. W. Millan, A. Clark, and E. Dawson. An Effective Genetic Algorithm for Finding Highly Non-linear Boolean Functions. pages 149–158, 1997. Lecture Notes in Computer Science Volume 1334. 10. W. Millan, A. Clark, and E. Dawson. Boolean Function Design Using Hill Climbing Methods. In Bruce Schneier, editor, 4th Australian Conference on Information Security and Privacy. Springer-Verlag, april 1999. Lecture Notes in Computer Science Volume 1978. 11. National Bureau of Standards. Data Encryption Standard. NBS FIPS PUB 46, 1976. 12. Bruce Schneier. Applied Cryptography. Wiley, 1996. 13. Richard Spillman, Mark Janssen, Bob Nelson, and Martin Kepner. Use of A Genetic Algorithm in the Cryptanalysis of Simple Substitution Ciphers. Cryptologia, XVII(1):187–201, April 1993. 14. Forsyth W.S. and Safavi-Naini R. Automated Cryptanalysis of Substitution Ciphers. Cryptologia, XVII(4):407–418, 1993.
GenTree: An Interactive Genetic Algorithms System for Designing 3D Polygonal Tree Models Clare Bates Congdon1 and Raymond H. Mazza2 1
Department of Computer Science, Colby College 5846 Mayflower Hill Drive, Waterville, ME 04901 USA [email protected] 2 Entertainment Technology Center, Carnegie Mellon University 5000 Forbes Avenue, Doherty Hall 4301, Pittsburgh, PA 15213 USA [email protected]
Abstract. The creation of individual 3D models to include within a virtual world can be a time-consuming process. The standard approach to streamline this is to use procedural modeling tools, where the user adjusts a set of parameters that defines the tree. We have designed GenTree, an interactive system that uses a genetic algorithms (GA) approach to evolve procedural 3D tree models. GenTree is a hybrid system, combining the standard parameter adjustment and an interactive GA. The parameters may be changed by the user directly or via the GA process, which combines parameters from pairs of parent trees into those that describe novel trees. The GA component enables the system to be used by someone who is ignorant of the parameters that define the trees. Furthermore, combining the standard interactive design process with GA design decreases the time and patience required to design realistic 3D polygonal trees by either method alone.
1
Introduction
This paper describes GenTree, an interactive system that models 3D polygonal trees as a set of parameters for a procedural tree description and uses genetic algorithms (GA’s) to evolve these parameters. Realistic-looking organic objects for 3D virtual worlds are time consuming to model. Typically, designers must become skilled in the use of a commercial modeling program such as 3D Studio Max or Maya in order to be able to construct realistic tree models to be used in simulations. Creation of a single tree by even a skilled modeler would typically take several hours using this approach. Procedural modeling systems, such as described in [11] (or commercial systems SpeedTree and Tree Storm) are intended to streamline the tree model creation process. They allow users to adjust a number of parameters that control the appearance of a tree without having to focus on low-level components of the model. In procedural systems, many of the possible parameters interact with each other; for example, the length of branches, the base width of branches relative to E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2034–2045, 2003. c Springer-Verlag Berlin Heidelberg 2003
GenTree: An Interactive Genetic Algorithms System
2035
the parent branch, and the rate at which they taper will combine to determine the width of the tip of each branch. Thus, it can be time consuming and difficult to design a tree with a desired look, and it is difficult for the user to gain an understanding of the effects and interactions of the parameters. Since the trees are constructed one at a time, exploring variations (and creating multiple models) is further time consuming. Furthermore, more realism and variation in the space of possible trees is achieved in procedural tree programs by increasing the parameter space. Increasing the size of this search space increases the burden of using and learning to use such a system. GenTree seeks to extend the capabilities of the procedural tree-building approach by adding a GA component to the search process. GenTree enables users, even those with no experience in using 3D modeling tools, to quickly design trees for use in virtual worlds. The system is interactive, allowing (but not requiring) the user to adjust the parameters that define each tree. The system also allows (but does not require) the user to assign a numeric rating to each tree; this information is then used by the GA to favor the more appealing trees as parents in the evolutionary process. This work does not focus on rendering the trees smoothly, and is instead focused on specifying the parameters that define the general shapes of the trees. Consequently, there is much room for improvement in the component of the system that renders the trees, depending on the intended use for these trees.
2
Background
There are several examples of evolutionary approaches to design in recent literature, including [1][2]. These collections in particular include several examples of aesthetic evolutionary projects, but differ in focus from the project presented here. Sims evolves 3D creatures [1] (and [10][9]), but his focus was on the mobility of the creatures (for example, joints) and not a natural appearance of the creatures. Soddu [2] evolves cityscapes, composed of individual buildings composed in turn of architectural elements. These do not have the added constraint of intended realism in modeling natural objects. Hancock and Frowd’s work in the generation of faces is also relevant[2]; however, this project is aimed at altering photographs, rather than constructing 3D models of the faces. Bentley’s system [1] evolves solutions to a wide variety of 3D modeling problems, but focuses on problems that can be evaluated independently of aesthetic considerations. From these collections, the work here is most similar to an extension to Bentley, reported in [2], which allowed the GADES system to interact with a user. The system is able to evolve a wide variety of shapes in response to the user-provided fitness feedback. L-systems an an established approach to modeling the development of plants, for example [5][7], and has been used directly or inspired previous work on the evolution of plant forms, for example [6]. Finding a set of rules to achieve a desired appearance is a search problem, not unlike the search for parameters
2036
C.B. Congdon and R.H. Mazza
Initial set of trees specified by parameter strings generated at random
render
render
User views trees one at a time and A. May adjust parameters B. May assign numeric ratings
GA creates new parameter strings by: 1. Selecting parents 2. Crossing over parents 3. Mutation
reproduce
Fig. 1. GenTree iterates between user input and the genetic algorithm in the creation of new trees.
in a procedural tree-modeling program. However, since these systems are evolving rule sets or hierarchical structures, the concerns are different from those in this paper. Also, since the underlying representations are more complex, these approaches are much harder for users to attempt without a genetic algorithms approach (or another form of help with the search process). Sims [8] evolved plant shapes using parameter strings, but the system is strictly evolutionary and does not allow for interactive parameter tuning. In [4], we describe a GA approach to designing 3D polygonal trees that led to the work reported here. However, this work was an initial exploration of the concept and uses a limited parameter set, resulting in fanciful, rather than realistic, trees. Also, like [8], the system is strictly evolutionary, so the trade-offs between a strictly procedural approach and an evolutionary approach are not explored, nor is the potential for a hybrid approach explored.
3
System Details
As demonstrated in [8] and [4], a GA approach is able to help navigate the search space of a procedural tree-building program. The parameters that define the trees are easily represented as strings, and the user-provided rating of the quality of the solutions serves as a fitness function. Finally, new (offspring) parameter sets can be derived from existing (parent) parameter sets using typical GA processes of crossover and mutation. This paper extends these ideas to combine the interactive approach of a standard procedural program (where the user adjusts parameters) with an interactive GA.
GenTree: An Interactive Genetic Algorithms System
2037
main create (random) initial population getFitness while (maxTrials not reached) generateNew getFitness generateNew select parents mutate crossover save best (elitism) Fig. 2. High-level pseudocode for the GA system in GenTree. getFitness (all trees start with default fitness of 0.0) while (not DoneWithGeneration) renderTree -- display each tree for user (each tree rotates to show all angles) user may increment or decrement fitness for each tree user may advance to next tree or backup to previous tree user may change any of the parameters that define the tree user may signal DoneWithGeneration Fig. 3. High-level pseudocode for the interactive component.
3.1
System Overview
GenTree starts with an initial population of trees whose parameters are generated at random. The user may look at each tree in turn, may change any of the parameters that define any of the trees, and may assign each tree a numeric rating that indicates its quality. When signaled by the user, a new generation of trees is produced from the old, using the GA process. Most notably, the parameter sets from two different trees “cross over”, producing two new parameter sets. The new set of trees is viewed by the user, and may again be altered and/or rated, and a new set of trees produced from the old via the GA process. A high-level view of the system is illustrated in Figure 1; pseudocode is shown in Figures 2 and 3. 3.2
GA Component
To hasten the development of our system, we used Genesis [3] for the genetic algorithms component of GenTree; the Genesis code has been combined with a modest OpenGL system (developed by the authors) to display the trees for evaluation by the user. We named the system GenTree to reflect a genetic algorithms approach to creating tree models.
2038
C.B. Congdon and R.H. Mazza
Table 1. A description of the values represented in the strings evolved by GenTree and rendered as 3D polygonal trees. Gene Name # 0 Number of Branches 1 Length Ratio 2 3 4
5 6
7 8 9
10
11 12
Description
The number of sub-branches that protrude from any one branch (modified by noise) The ratio of the length of a branch to that of the parent branch Width Ratio The ratio of the width of the base of a sub-branch to that of the parent branch Branch Taper The ratio of the end of a branch to the base of a branch Branch From the tip of the parent branch, the percent of Proximity the parent to be used for distributing the sub-branches along the parent. Tree Depth The number of branches that can be traversed from the trunk to the tips (excluding the trunk) Branch The change in x and z direction in the vector that Angle Delta determines the angle of a sub-branch, relative to the parent branch Vertical change Added to direction inherited from parent branch. Parent influence Strength of influence of parent vector on offspring branches. Branch Noise in the x, y, and z direction of the growth of Direction Noise a branch section (off the defined normal vector for the branch section) Branch Noise in determining the number of branches at Number Noise each level of growth (higher noise tends to add branches) Subtree Noise in determining how deep a subtree is Depth Noise (higher noise tends to truncate subtrees) Random Seed Stored with tree parameters so that tree will render consistently
Range of Values [1..8] [0.2 .. 0.9] [0.2 .. 0.9] [0.2 .. 0.9] [0.1 .. 0.8]
[2 .. 9] [0.0 .. 0.1]
[0.0 .. 1.0] [0.0 .. 1.0] [0.0 .. 1.0]
[0.0 .. 1.0]
[0.0 .. 1.0] [0.0 .. 1.0]
The mechanics of the genetic algorithm (GA) component of the system are largely unchanged from the original Genesis code, although the interactive nature of the system changes the flow of control, since mouse clicks and keystrokes from the user affect the GA. This will be described later in this section. 3.3
Procedural Tree Component
In designing the parameters used to define the trees, we looked to existing procedural tree programs, plus a bit of trial and error, to determine what would make good looking trees. The parameters evolved by the system and their ranges are provided in Table 3.3.
GenTree: An Interactive Genetic Algorithms System
2039
renderTree drawSpan(trunk) drawSpan setDirection draw conic section flip biased coin to possibly add a branch (bias determined by gene 10) if (subBranches) for each branch flip biased coin to possibly make a smaller subtree (bias determined by gene 11) drawSpan(branch) setDirection if trunk, straight up else, combine normal vectors for: parent direction, weighted by gene 8 random compass direction, weighted by gene 6 random y direction, weighted by gene 7 Fig. 4. High-level pseudocode for the rendering component.
3.4
Rendering a GA String into OpenGL Trees
The trees are rendered recursively from the trunk through the branches. Highlevel pseudocode for the rendering process is provided in Figure 4. Each GA string uniquely translates into a specific rendering, including a seed used for random numbers (gene 12), so the process is deterministic for a given string. The trunk and each branch are rendered as primitive conic sections1 . The trees are generated without leaves, but green triangles can be generated at the tips of the branches to assess what a tree would look like if it had leaves on it.
4
Results
One is first tempted to evaluate the GenTree system by judging how good the resulting trees look, and based on this criteria, the system is able to generate good looking trees in a short period of runtime.
1
In OpenGL, conic sections are drawn using a specified number of faces. Because the rendering component is not the focus of the work reported here, the number of faces used here is four to keep the polygon counts low and the rendering fast. Thus, the sections are rectangular.
2040
C.B. Congdon and R.H. Mazza
Fig. 5. A variety of trees from the final population.
4.1
Resulting Trees
Subjective results show that with an initial population of 20 trees2 , several interesting and realistic looking trees can be produced within 5-10 generations of the GA and with less than an hour of the user’s time. Because the GA is working with several models simultaneously, the end result of a run is a population of trees, several of which are likely to be attractive and useful. Furthermore, similar looking but distinct trees can be produced from a favorite tree by altering its parameters. Therefore, it is possible to easily construct a variety of trees as the result of one run of the system. Figure 5 shows a variety of trees from the final population from a run of 7 generations of 20 trees, with both interactive fitness and interactive adjustment used. Figure 6 shows another example tree produced at the end of the same run, and then shows this same tree with leaves added. Recall that since the trees (and leaves) are rendered using simple cylinders (and triangles), it is the structure of the trees that should be considered, and not the surfaces of the trees. 2
A population size of more than 20 starts to become tedious from the usability perspective, as the user will typically view and interact with each tree before advancing to the next generation.
GenTree: An Interactive Genetic Algorithms System
2041
Fig. 6. One of the best trees from the final population, shown with and without leaves.
2042
C.B. Congdon and R.H. Mazza
Fig. 7. Five example trees from the initial population.
4.2
Initial Population
As a reference point, a sampling of trees from the initial population are shown in Figure 7. This sample (every fourth tree) includes the two from the initial population that most resemble trees (the third and the fifth shown), as well as one with branches growing into the ground and two with branches growing inside each other. The three clearly errant trees shown in this figure illustrate typical features of the randomly generated trees. 4.3
Comparison of Interaction Styles
The primary question to be asked in the context of the work reported here is whether the GA component assists with the development of the 3D tree models. To assess this, the system was run four different ways: 1. Without providing fitnesses for trees and without altering them in any way. 2. Altering tree parameters for trees that can be improved upon each generation, but not rating the trees. 3. Rating the trees, but not altering them beyond the GA process. 4. Altering tree parameters for trees that can be improved upon each generation and also rating these trees before they are used in the GA process. In the first instance, the system starts with randomly generated trees and recombines them, so it is not surprising that it does not produce anything interesting. It has been mentioned here merely for completeness. The second instance is not typical of the GA process in that no information is provided to the GA about solutions that are “better” or “worse” than others, and this information is typically used for preferential parent selection. However, in this case, the alternative mechanism of “eugenics” (helping there be more good solutions to be used as parents) achieves a similar role of recombining primarily good solutions. The third instance corresponds to the standard use of Genetic Algorithms, and the fourth instance corresponds to the full capacity of the GenTree system to take input from the user in two forms.
GenTree: An Interactive Genetic Algorithms System
2043
The system was run by the authors in all four variants with the general goal of finding good-looking trees with minimal effort. Although this remains a subjective test and evaluation, the authors explicitly did not fine tune any trees in the adjustment process. A population size of 20 trees was used, with a crossover rate of 60 percent and a mutation rate of 0.1 percent (each bit has a .1 percent chance of being set to a random 0 or 1 value). With each approach save the first, for each variation of altering trees and supplying ratings, GA generations were run until approximately an hour had elapsed. This resulted in 5 generations for variation 2, 20 generations for variation 3, and 7 generations for variation 4. An explanation for the varying numbers of generations is that with the second variation, the most time was spent adjusting the parameters to each tree, since all trees are equally likely to be “parents” to the next generation. In variation 4, when the ratings were used, bad trees (those that could not easily be adjusted) were simply given low ratings. This happens sometimes, for example, when a very angular tree pops up. (Such a tree can be useful “grist for the mill” or might be useful when constructing deliberately fanciful trees, but can sometimes be difficult to smooth out without considering how parameters interact and perhaps changing the parameter set severely. Thus, it can be expedient to simply give a difficult tree a low rating.) With the third variation, no alterations were allowed, so each tree was viewed and assigned a rating, allowing more generations in the time frame. All three approaches are capable of producing comparable trees, and the assessment of “best” may have to do with the user’s background. As users familiar with the available parameters, it is frustrating to do the third run and restrain oneself from “touching up” a promising tree. However, the ease with which a naive user can design trees using only ratings and without knowing the parameters has been witnessed several times in curious colleagues trying the software. For more knowledgeable users, there seems to be little reason to run the system with the second variation because the ability to weed out bad trees or favor interesting ones can be expedient, even if used only occasionally. The differences observed between the second and fourth variations are that when both ratings and alterations are used, the trees in the final population tend to show more variation than when just alterations (and no ratings) are used. This is a somewhat counterintuitive result in that in many GA applications, the population tends to converge, and it is impossible to discern at this point whether this is a quirk of the author’s use of the system or a more meaningful effect. However a possible explanation can be constructed in that when adjusting most of the trees each generation (and not being allowed to simply indicate a tree is undesirable and leave it behind), it appears likely that one gets into the habit of adjusting the same parameters into similar values. Whereas when the ratings are used, less parameter adjusting happens and therefore less bias in adjusting particular values is introduced. (This is just one possible explanation of observed differences in the final populations.) The question must also be asked about the results of altering parameters only (without advancing any generations in the GA process) for an hour, during which
2044
C.B. Congdon and R.H. Mazza
time a knowledgeable user would surely be able to produce one or more attractive trees. Using the system this way, more skill would be required in assessing the relative contributions of the different parameters and the interactions of the various parameter settings with each other. The addition of the GA process (with or without ratings) considerably lessens the need for understanding the effects of individual parameters, or even knowing what the parameters are. Further Observations Overall, we are quite pleased with the wide variety of shapes that are possible in the system and its ability to yield some trees that are approaching realism. Not surprisingly, the noise parameters appear to be instrumental in making the trees look more realistic and providing more variations of “similar” trees.
5
Conclusions and Future Work
The GenTree system is able to produce a variety of trees with realistic shapes. Although it is difficult to effectively evaluate a system whose output is judged on aesthetic criteria, the addition of the GA mechanisms to a procedural tree modeling system simplifies the development of 3D tree models in that users do not need to understand the effects of individual parameters or even know what the parameters are in order to develop useful models. Rather than having to identify why a tree is good or bad (or the ways in which a tree is good or bad), the user can assign a subjective rating. Future Work The introduction of more parameters into the procedural definition of the tree would be useful in that it would allow for a greater variety of trees, including pine trees and palm trees. However, this would also make the space of possible trees more difficult for users to navigate. Thus, as the parameter space increases, the GA mechanisms are expected to be even more useful for helping the user discover appealing trees. The trees would be significantly more realistic by improving the rendering engine, even if the GA and interactive facets of the system were left unchanged. This would include adding textures to the surfaces and smoother transitions from branches to sub-branches. Furthermore, it may be desirable to parameterize the rendering component, for example, to allow favoring low-polygon-count trees (for some applications) or favoring realism. Additional benefits could be seen from parameterizing leaf characteristics and improving the rendering of the leaves. Root structure would also make trees more realistic. In addition, different colors and textures could be specified for bark and for leaves, and the colors and textures could be subject to evolution. The ability to add flowers and fruit, and variations on these, could also be added as parameters.
GenTree: An Interactive Genetic Algorithms System
2045
In a few years, we will no longer be constrained by polygonal limits in virtual worlds; some applications will call for trees of high intricacy, which will be too time consuming even to attempt to model through regular methods. There is an obvious advantage to using GenTree in this situation.
References 1. P. J. Bentley. Evolutionary Design By Computers. Morgan Kaufmann, San Francisco, CA, 1999. 2. P. J. Bentley and D. W. Corne. Creative Evolutionary Systems. Morgan Kaufmann, San Francisco, CA, 2002. 3. J. J. Grefenstette. A user’s guide to GENESIS. Technical report, Navy Center for Applied Research in AI, Washington, DC, 1987. Source code updated 1990; available at http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai /areas/genetic/ga/systems/genesis/. 4. R. H. Mazza III and C. B. Congdon. Towards a genetic algorithms approach to designing 3D polygonal tree models. In Proceedings of the Congress on Evolutionary Computation (CEC-2002). IEEE Computer Society Press, 2002. 5. C. Jacob. Genetic L-system programming. In Parallel Problem Solving from Nature Proceedings (PPSN III), Lecture Notes in Computer Science, volume 866. Springer, 1994. 6. G. Ochoa. On genetic algorithms and Lindenmayer systems. In Parallel Problem Solving from Nature Proceedings (PPSN V), Lecture Notes in Computer Science, volume 1498. Springer, 1998. 7. P. Prusinkiewicz, M. Hammel, R. Mech, and J. Hanan. The artificial life of plants. In SIGGRAPH course notes, Artificial Life for Graphics, Animation, and Virtual Reality. ACM Press, 1995. 8. K. Sims. Artificial evolution for computer graphics. In Proceedings of the 18th annual conference on Computer graphics and interactive techniques (SIGGRAPH91), pages 319–328. ACM Press, 1991. 9. K. Sims. Evolving 3D morphology and behavior by competition. In R. Brooks and P. Maes, editors, Artificial Life IV Proceedings, pages 28–39. MIT Press, 1994. 10. K. Sims. Evolving virtual creatures. In Proceedings of the 21st annual conference on Computer graphics and interactive techniques (SIGGRAPH94), pages 15–22. ACM Press, 1994. 11. J. Weber and J. Penn. Creation and rendering of realistic trees. In Proceedings of the 22nd annual conference on Computer graphics and interactive techniques (SIGGRAPH95), pages 119–128. ACM Press, 1995.
Optimisation of Reaction Mechanisms for Aviation Fuels Using a Multi-objective Genetic Algorithm Lionel Elliott1 , Derek B. Ingham1 , Adrian G. Kyne2 , Nicolae S. Mera2 , Mohamed Pourkashanian3 , and Chritopher W. Wilson4 1
Department of Applied Mathematics, University of Leeds, Leeds, LS2 9JT, UK {lionel,amt6dbi}@amsta.leeds.ac.uk 2 Centre for Computational Fluid Dynamics, Energy and Resources Research Institute, University of Leeds, Leeds, LS2 9JT, UK {fueagk,fuensm}@sun.leeds.ac.uk 3 Energy and Resources Research Institute, University of Leeds, Leeds, LS2 9JT, UK, [email protected] 4 Centre for Aerospace Technology, QinetiQ group plc, Cody Technology Park Ively Road, Pyestock, Farnborough, Hants, GU14 0LS, UK [email protected]
Abstract. In this study a multi-objective genetic algorithm approach is developed for determining new reaction rate parameters for the combustion of kerosene/air mixtures. The multi-objective structure of the genetic algorithm employed allows for the incorporation of both perfectly stirred reactor and laminar premixed flame data into the inversion process, thus producing more efficient reaction mechanisms.
1
Introduction
Reduction of engine development costs, improved predictions of component life or the environmental impact of combustion processes, as well as the improved assessment of any industrial process making use of chemical kinetics, is only possible with an accurate chemical kinetics model. However, the lack of accurate combustion schemes for high-order hydrocarbon fuels (e.g. kerosene) precludes such techniques from being utilised in many important design processes. To facilitate further development, a procedure for optimising chemical kinetic schemes must be devised in which a reduced number of reaction steps are constructed from only the most sensitive reactions with tuned reaction rate coefficients. A novel method for automating the optimisation of reaction mechanisms of complex aviation fuels is developed in this paper. Complex hydrocarbon fuels, such as kerosene, require more than 1000 reaction steps with over 200 species, see [1]. Hence, as all the reaction rate data is not well known, there is a high degree of uncertainty in the results obtained using these large detailed reaction mechanisms. A reduced reaction mechanism for kerosene combustion was developed in [2] and their mechanism is used in this paper to test the ability of GAs to retrieve the reaction rate coefficients. E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2046–2057, 2003. c Springer-Verlag Berlin Heidelberg 2003
Optimisation of Reaction Mechanisms
2047
We note that various methods have been proposed in order to find a set of reaction rate parameters that gives the best fit to a given set of experimental data. However, in the case of complex hydrocarbon fuels, the objective function is usually highly structured, having multiple ridges and valleys and exhibiting multiple local optima. For objective functions with such a complex structure, traditional gradient based algorithms are likely to fail. Optimisation methods based upon linearization of the objective function, see [3] and [4], fall into the same category. On the other hand, genetic algorithms are particularly suitable for optimising objective functions with complex, highly structured landscapes. A powerful inversion technique, based on a single objective genetic algorithm, was developed in [5] in order to determine the reaction rate parameters for the combustion of a hydrogen/air mixture in a perfectly stirred reactor (PSR). However, many practical combustors, such as internal combustion engines, rely on premixed flame propagation. Moreover, burner-stabilized laminar premixed flames are very often used to study chemical kinetics in a combustion environment. Such flames are effectively one-dimensional and can be made very steady, thus facilitating detailed experimental measurements of temperature and species profiles. Therefore, for laminar premixed flames it is much easier to obtain accurate experimental data to be used in the genetic algorithm inversion procedure. Therefore, in order to obtain a reaction mechanism which can be used for a wide range of practical problems it is required to use an inversion procedure which incorporates multiple objectives. Another reason to use this approach is the fact that sometimes in practical situations the amount of experimental data available is limited and it is important to use all the available experimental data, which may come from different measurements for perfectly stirred reactors or laminar premixed flames. Therefore the single objective optimisation technique proposed in [5] is further extended to a multi objective genetic algorithm in order to include into the optimisation process data from premixed laminar flames. The algorithm was found to provide good results for small scale test problem, i.e. for the optimisation of reaction rate coefficients for a hydrogen/air mixture. It is the purpose of this paper to extend this technique to the case of complex hydrocarbon fuels. An inversion procedure based on a reduced set of available measurements is also considered. In many experimental simulations, some of the species concentrations are subject to large experimental errors both in perfectly stirred rectors and laminar premixed flames. In order to facilitate the incorporation of various types of experimental data in the GA based inversion procedures, this study also investigates such GA calculations based on such incomplete or reduced sets of data.
2 2.1
Description of the Problem Reaction Rate Parameters
In a combustion process the net chemical production or destruction rate of each species results from a competition between all the chemical reactions involving
2048
L. Eliott et al.
that species. In this study it is assumed that each reaction proceeds according to the law of mass action and the forward rate coefficients are of three parameter functional Arrhenius form, namely Ea (1) kfi = Ai T bi exp − i RT for i = 1, . . . , NR , where R is the universal gas constant and there are NR competing reactions occurring simultaneously. The rate equations (1) contains the three parameters Ai , bi and Eai for the ith reaction. These parameters are of paramount importance in modelling the combustion processes since small changes in the reaction rates may produce large deviations in the output species concentrations. In both perfectly stirred reactors as well as in laminar premixed flames, if the physical parameters required are specified, and the reaction rates (1) are given for every reaction involved in the combustion process, then the species concentrations of the combustion products may be calculated. It is the possibility of the determination of these parameters for each reaction, based upon outlet species concentrations, which is investigated in this paper. 2.2
The Perfectly Stirred Reactor (PSR) Calculations and the Laminar Premixed Flames (PREMIX) Calculations
Various software packages may be used for the direct calculations to determine the output species concentrations if the reaction rates are known. In this study the perfectly stirred rector calculations are performed using the PSR FORTRAN computer program that predicts the steady-state temperature and composition of the species in a perfectly stirred reactor, see [6] while the laminar premixed flame structure calculations were performed using the PREMIX code, see [7]. We use Xk to denote the mole fraction of the k th species, where k = 1, ..., K and K represents the total number of species. The temperature and composition which exit the reactor are assumed to be the same as those in the reactor since the mixing in the reactor chamber is intense. If PSR calculations are undertaken for NSP SR different sets of reactor conditions then the output data which is used in the GA search procedure will consist of a set of mole concentrations, (Xjk , j = 1, ..., NSP SR , k = 1, ..., K), where Xjk represents the mole concentration of the kth species in the j th set of reactor conditions. The laminar premixed flame structure calculations were performed using the PREMIX code for burner stabilised flames with a known mass flow rate. The code is capable of solving the energy equation in determining a temperature profile. However, throughout this investigation a fixed temperature profile was preferred and the reason for fixing the temperature was two fold. Firstly, in many flames there can be significant heat losses to the external environment, which are of unknown and questionable origin and thus troublesome to model. Secondly, the CPU time involved in arriving at the final solution was decreased significantly.
Optimisation of Reaction Mechanisms
2049
If all the physical operating conditions are specified then for a given set of reaction rates (1) the profiles of the species concentration and the burning velocity as functions of the distance from the burner’s surface are calculated. If PREMIX calculations are performed for NSP REM IX different sets of operating conditions then the output data of the code which is used by the GA in the matching process consists of a set of species concentration profiles (Yjk (x) , j = 1, ..., NSP REM IX , k = 1, ..., K) and a set of burning velocity profiles (Vj (x), j = 1, ..., NSP REM IX ), where Yjk is the profile of the mole concentration along the burner for the k th species in the j th set of operating conditions and Vj is the burning velocity profile for the jth set of operating conditions. It is the purpose of this paper to determine the reaction rate coefficients A’s, b’s and Ea ’s in equation (1) given species concentrations Xjk in a perfectly stirred reactor and/or profiles of species concentration Yjk and the burning velocity profiles Vj in laminar premixed flames.
3
Reformulation of the Problem as an Optimisation Problem
An inverse solution procedure attempts, by calculating new reaction rate parameters that lie between predefined boundaries, to recover the profiles of the species (to within any preassigned experimental uncertainty) resulting from numerous sets of operating conditions. The inversion process aims to determine the unknown reaction rate parameters ((Ai , bi ,Eai ), i = 1, ..., NR ) that provide the best fit to a set of given data. Thus, first a set of output concentration measurements of species is simulated (or measured). If numerical simulations are performed for NSP SR different sets of PSR operating conditions and/or NSP REM IX different sets of PREMIX operating conditions then the data will consist of a set of KNSP SR concentration measurements of species for PSR calculations and (K + 1)NSP REM IX profiles of species concentrations and burning velocity for PREMIX calculations. Genetic algorithms based inversion procedures seek for the set of reaction rate parameters that gives the best fit to these measurements. In order to do so we consider two objective functions which compare predicted and measured species concentrations for PSR and PREMIX simulations −1 calc orig NS K Xjk − Xjk fP SR ((Ai , bi , Eai )i=1,NR ) = 10−8 + Wk (2) orig Xjk j=1 k=1
((Ai , bi , Eai )i=1,NR ) = f calc orig −1 P REM IX orig calc K −Yjk Vj −Vj L2 L2 Yjk NS 10−8 + j=1 + W orig orig k k=1 V Y j
L2
jk
(3)
L2
where calc calc , Yjk and Vjcalc represent the mole concentrations of the k th species in – Xjk th the j set of operating conditions and the burning velocity in the j th set of
2050
L. Eliott et al.
operating conditions using the set of reaction rate parameters ((Ai , bi , Eai ), i = 1, ..., NR ), orig orig – Xjk , Yjk and Vjorig are the corresponding original values which were measured or simulated using the exact values of the reaction rate parameters, – · L2 represents the L2 norm of a function which is numerically calculated using a trapezoidal rule, – Wk are different weights that can be applied to each species depending on the importance of the species. It should be noted that the fitness function (2) is a measure of the accuracy of species concentrations predictions obtained by a given reaction mechanism for PSR simulations. The second fitness function (3) is a measure of the accuracy in predicting species concentrations and burning velocity profiles in the case of laminar premixed flames.
4
The Multi-objective Genetic Algorithm Inversion Technique
There is a vast variety of genetic algorithms approaches to multi-objective optimisation problems. GAs are in general a very efficient optimization technique for problems with multiple, often conflicting, object functions. However, most GA approaches require that all the object functions are evaluated for every individual generated during the optimisation process. For the problem considered in this paper, the object function fP REM IX requires a long computational run time. For example, in the kerosene case a PREMIX run for one operating condition, NSP REM IX = 1, was found to take an average of 94 seconds CPU time on a Pentium IV processor running at 1.7GHz. This leads to a running time of more than 65 days, for evaluating say 1000 GA generations with 60 individuals generated at every iteration. On the other hand, PSR calculations are much faster, requiring for example an average of 14 seconds CPU time for evaluating one individual at NSP SR = 18 different operating conditions. Therefore, in this paper we propose a multi-objective GA that avoids PREMIX evaluations at every generation. The algorithm is based on alternating the fitness functions that are used for selecting the survivors of the populations generated during the GA procedure. Under the assumptions that the optimisation problem considered has n objective functions the multi-objective GA employed in this paper consists of the following steps 1. Construct n fitness functions f1 , f2 , ..., fn and assign a probability for each of these functions p1 , p2 , ..., pn . 2. Construct an initial population. 3. Select a fitness function fi with the probability pi using a roulette wheel selection 4. Using the fitness function selected at step 3 construct a new population – Use the fitness function fi to select a mating pool.
Optimisation of Reaction Mechanisms
2051
– Apply the genetic operators (crossover and mutation) to create population of offsprings. – Use the fitness function fi to construct a new population by merging the old population with the offsprings population. 5. Repeat steps 3-4 until a stopping condition is satisfied, i.e. convergence is achieved. For the problem considered in this paper we have n = 2, f1 = fP SR , f2 = fP REM IX . Different values can be used for the fitness selection probabilities pP SR and pP REM IX . It is worth noting that the fitness selection probabilities pP SR and pP REM IX can be used to bias the convergence of the algorithm towards accurate predictions in PSR or PREMIX, respectively. However they do not act as weights since at every generation only one of the objective functions (2) and (3) is evaluated and not a linear combination of both of them. Instead these objective functions are alternated during the evolution process. This method maintains the multi-objective character of the optimisation process while avoiding the evaluation of the expensive fitness function (3) during every generation. The CPU time required by the algorithm depends on the values set for the probabilities pP SR and pP REM IX . In this paper we have considered pP SR = 0.9 and pP REM IX = 0.1. For these values the genetic algorithm was found to require an average of 22 minutes/generation on a Pentium IV processor running at 1.7GHz. Numeircal experiments have been performed for various values for the selection probabilities pP SR and pP REM IX and it was found that the quality of the solution found in the end is not affected for a large range of values. However the CPU time can be controlled by choosing suitably these probabilities according to the number of PSR and PREMIx runs required. We note that the GA employed is based on a floating point encoding of the variables. The genetic operators and the GA parameters were taken to be as follows: population size npop = 50, number of offspring nchild = 60, uniform arithmetic crossover, crossover probability pc = 0.65, tournament selection, tournament size k = 2, tournament probability pt = 0.8, non-uniform mutation, mutation probability pm = 0.5, elitism parameter ne = 2. The new sets of rate constants generated every iteration are constrained to lie between predefined boundaries that represent the uncertainty associated with the experimental findings listed in the National Institute of Standards and Technology (NIST) database. The constraints are imposed by associating a very low fitness function to the individuals that violate these constraints. It is worth noting that when working with accurate experimental data or numerically simulated data the two fitness functions (2) and (3) have a common global optimum given by the real values of the reaction rate coefficients or the values that were used to generate the data, respectively. Therefore for the problem considered the two objective functions are not conflicting and the method we have proposed can be used although it searches for a single common optima, rather than a set of non-dominated solutions. It is worth noting that if the experimental data is corrupted with a large level of noise then the two objective
2052
L. Eliott et al.
functions (2) and (3) may be conflicting and then Pareto oriented techniques have to be used in order to construct the front of non-dominated solutions, i.e. a set of possible optima. In this situation other considerations have to be used to make a decision on what is the best location of the optimum.
5
Numerical Results
In order to test the multi-objective genetic algorithm inversion procedure, the test problem considered is the recovery of the species concentration profiles predicted by a previously validated kerosene/air reaction mechanism for a wide range of operating conditions which was presented in [2]. The reaction scheme used to generate the data involves K = 63 species across NR = 168 reactions. Ideally, experimental data would have been preferred but at this developmental stage in order to avoid the uncertainty associated with experimental data it was decided to undertake numerical computations on a set of numerically simulated data. Since the purpose of this study is to develop and to assess the technical feasibility of the multi-objective GA techniques for the development of new reaction mechanisms for complex high order fuels in order to employ the most efficient schemes for future investigations concerning aviation fuels, for this study numerically simulated data is as efficient as experimental data. The test problem considered is the combustion of a kerosene/air mixture at constant atmospheric pressures, p = 10atm and p = 40atm . A set of NSP SR = 18 different operating conditions have been considered for PSR runs corresponding to various changes to the inlet temperature from 780K to 1150K. The inlet composition is defined by a specified fuel/air equivalence ratio, Φ = 1.0 or Φ = 1.5 . Rather than solving the energy equation we assume that there is no heat loss from the reactor, i.e. Q = 0 and the temperature profile is specified. For PREMIX calculations NSP REM IX = 1 for a steady state burner stabilized premixed laminar flames with specified temperature, pressure p = 1atm and fuel/air equivalence ratio, Φ = 1.0. In many practical situations measurements are available only for the most important species. Therefore it is the purpose of this section to test the GA inversion procedure proposed also for the case of incomplete sets of data. Two reaction mechanisms are generated using two sets of data simulating complete and incomplete sets of measurements as follows: – COMPLETE - is generated using multi-objective optimisation based on PSR and PREMIX data for all the output species – REDUCED - is generated using multi-objective optimisation based on PSR and PREMIX data only for the following species: kerosene (C10H22+C6H5CH3), O2, C6H6, CO, CH2O, CO2, H2, C2H4, CH4, C3H6, C2H6, C2H2, C4H8 These two reaction mechanisms are used to produce estimations of the output species concentrations and these estimations are compared with the predictions given by the original mechanism (ORIGINAL), i.e. the mechanism which was used to simulate the data numerically, see [2].
Optimisation of Reaction Mechanisms
5.1
2053
Output Species Predictions for PSR and PREMIX Calculations
Figures 1 presents the unburned fuel concentration calculated with the PSR program, using the reaction mechanisms generated in the GA inversion procedure and the original mechanism for various temperatures and the air/fuel ratio given by Φ = 1.0 at constant atmospheric pressures, p = 10atm. It can be seen that both the COMPLETE and REDUCED mechanism which were generated by the multi-objective GA are accurate in reproducing the output species concentrations as predicted by the original mechanism. Although not presented here, it is reported that also for the other species similar results were obtained. If the pro-
Fig. 1. The mole fraction of unburned fuel for various temperatures and air/fuel ratio given by Φ = 1.0 at constant atmospheric pressures, p = 10atm as predicted by various reaction mechanisms in PSR calculations.
files of the mole fractions in the PREMIX calculations are investigated, then the GA generated reaction mechanisms are again found to be efficient in predicting the species profiles, see Figure 2 which presents the output species profiles of four major species for a pressure p = 1atm and fuel/air equivalence ratio, Φ = 1.0. It can be seen that both mechanisms generated by our inversion procedure are efficient in predicting the mole fractions even for very low concentrations. Many other species have been investigated for PSR and PREMIX calculations for operating conditions within the range used for the optimisation process and overall it was found that the reaction mechanisms generated by the GA are accurately reconstructing the output species predicted by the original mechanism.
2054
L. Eliott et al.
Fig. 2. The species profiles of various output species for premixed laminar flames as predicted by various reaction mechanisms, namely the original mechanism (-), the COMPLETE mechanism (×) and the REDUCED mechanism ().
5.2
Validation of the GA Generated Mechanisms for Various Operating Conditions
The previous section presented the predictions given by the constructed reaction mechanisms for operating conditions which were included in the optimisation process. However, the aim of the inversion procedure is to develop reaction mechanisms that are valid over wide ranges of operating conditions. Therefore, next the GA generated reaction mechanisms were tested at operating conditions outside the range included in the optimisation procedure. Figure 3 presents the species concentrations obtained by PSR calculations for the two GA generated reaction mechanisms, in comparison with those generated by the original mechanism for kerosene concentration at temperatures outside the range used for optimisation, i.e. T = 1000K − 2000K. Again it can be seen that the mechanisms COMPLETE and REDUCED give accurate predictions in comparison with the original mechanism. Figure 4 presents the predicted concentrations profiles of four of the main species obtained by PREMIX calculations at a constant atmospheric pressure p = 2atm which is outside the range of operating conditions used for optimisation. It is worth noting that both COMPLETE and REDUCED reaction mechanisms are able to accurately predict the species concentration profiles for PREMIX operating conditions outside the range used for optimisation. It is worth noting that for some species, see Figure 4 it was found that the REDUCED mechanism is more efficient than the COMPLETE mechanism in particular in the part of the species profiles where the concentration is becoming very low. This can be explained by the fact that the REDUCED mechanism is biased towards some of the species and it predicts them very accurately while the COMPLETE mechanism is forced to predict with reasonable accu-
Optimisation of Reaction Mechanisms
2055
Fig. 3. The mole fraction of unburned fuel for various temperatures T=1000K-2000K as predicted by various reaction mechanisms in PSR calculations.
racy all the species, which is undertaken at the expense of slightly less accurate predictions for the main species. The REDUCED mechanism outperformed the COMPLETE mechanism only for those species which were included in the REDUCED inversion procedure while the COMPLETE mechanism outperformed the REDUCED mechanism if all the species are taken into consideration. It is worth mentioning that when working with numerically simulated data both mechanisms can be generated but in practice, it is impossible to obtain measurements for all the output species and therefore a reaction mechanism of the type REDUCED is generated when working with experimental data. Similar conclusions have been obtained for a wide range of operating conditions. Overall, the reaction mechanisms generated by the multi-objective GA proposed in this study were found to be very efficient in predicting the output species concentrations and burning velocity for both PSR and PREMIX simulations. Thus we may conclude that the inversion procedure proposed efficiently generates a general reaction mechanism that can be used to predict the combustion products for various combustion problems at a variety of operating conditions. 5.3
Ignition Delay Time Validations
To provide additional validation of the results obtained by the inversion procedures presented in this paper, the predicted ignition delay times are computed using all the reaction mechanisms generated in this study. In testing the ignition delay time, the SENKIN code of Sandia National Laboratories [8] was used to predict the time dependent chemical kinetic behaviour of a homogeneous gas mixture in a closed system. Figure 5 compares the ignition delay times predicted by the reaction mechanisms generated in this study with the ignition delay times generated by the
2056
L. Eliott et al.
Fig. 4. The species profiles of various output species for premixed laminar flames at pressure p = 2atm and fuel/air equivalence ratio, Φ = 1.0 as predicted by various reaction mechanisms, namely the original mechanism (-), the COMPLETE mechanism (×) and the REDUCED mechanism ().
original mechanism at four different initial temperatures and the air/fuel ratio Φ = 1.0. We note that both the COMPLETE and the REDUCED mechanisms based on complete or incomplete sets of data, respectively, reproduce the same ignition delay times as the original mechanism and this indicates that the GA does not just generate a mechanism whose validity only extends as far as those operating conditions included in the optimisation process.
6
Conclusions
In this study we have developed a multi-objective GA inversion procedure that can be used to generate new reaction rate coefficients that successfully predict the profiles of the species produced by a previously validated kerosene/air reaction scheme. The new reaction rate coefficients generated by GAs were found to successfully match mole fraction values in PSR simulations and burning velocity and species profiles in PREMIX simulations, as well as ignition delay time predicted by the original starting mechanism over a wide range of operating conditions, despite their origin being based solely on measurements for a restricted range of operating conditions.
Acknowledgments. The authors would like to thank the EPSRC and the corporate research program for the MOD and DTI Program for the funding and permission to publish this work.
Optimisation of Reaction Mechanisms
2057
Fig. 5. The ignition delay time at various temperatures T=1000K-2500K as predicted by various reaction mechanisms.
References 1. Dagaut, P., Reuillon, M., Boetner, J.C. and Cathonnet, M.: Kerosene combustion at pressures up to 40atm: experimental study and detailed chemical kinetic modelling, Proceedings of the Combustion Institute, 25, (1994), 919–926. 2. Kyne, A.G., Patterson, P.M., Pourkashanian, M., Williams, A. and Wilson, C.J. Prediction of Premixed Laminar Flame Structure and Burning Velocity of Aviation Fuel-Air Mixtures, Proceedings of Turbo Expo 2001: ASME TURBO EXPO2001: LAND, SEA AND AIR, June 4–7, New Orleans, USA, (2001). 3. Milstein, J., The inverse problem: estimation of kinetic parameters, in: K.H. Ezbert, P. Deuflhard and W. Jager (Eds), Modelling of Chemical Reaction Systems, , Springer, Berlin (1981). 4. Bock, H.G., Numerical treatment of inverse problems in chemical reaction kinetics, in Ezbert,K.H., Deuflhard,P. and Jager, W. (Eds), Modelling of Chemical Reaction Systems , Springer, Berlin , (1981). 5. Harris, S.D., Elliott,L., Ingham, D.B. Pourkashanian, M. Wilson, C.W. The Optimisation of Reaction Rate Parameters for Chemical Kinetic Modelling of Combustion using Genetic Algorithms, Computer methods in applied mechanics and engineering, 190, (2000), 1065–1083. 6. Glarborg, P., Kee, R.J., Grcar, J.F., Miller, J.A.: PSR: A FORTRAN program for modelling well-stirred reactors, Sandia National Laboratories Report SAND86-8209, (1988). 7. Kee, R.J., Grcar, J.F., Smooke, M.D., Miller, J.A.: A FORTRAN program for modelling steady laminar one-dimensional premixed flames, Sandia National Laboratories Report SAND85-8240, (1985). 8. Lutz, A.E., Kee, R.J. and Miller, J.A., SENKIN: A FORTRAN program for predicting homogeneous gas phase chemical kinetics with sensitivity analysis, Sandia National Laboratories Report SAND87-8248, (1987).
System-Level Synthesis of MEMS via Genetic Programming and Bond Graphs 1
1
1
Zhun Fan , Kisung Seo , Jianjun Hu , 2
Ronald C. Rosenberg , and Erik D. Goodman
1
1
Genetic Algorithms Research and Applications Group (GARAGe) 2 Department of Mechanical Engineering Michigan State University, East Lansing, MI, 48824 {hujianju,ksseo,fanzhun,rosenber,goodman}@egr.msu.edu
Abstract. Initial results have been achieved for automatic synthesis of MEMS system-level lumped parameter models using genetic programming and bond graphs. This paper first discusses the necessity of narrowing the problem of MEMS synthesis into a certain specific application domain, e.g., RF MEM devices. Then the paper briefly introduces the flow of a structured MEMS design process and points out that system-level lumped-parameter model synthesis is the first step of the MEMS synthesis process. Bond graphs can be used to represent a system-level model of a MEM system. As an example, building blocks of RF MEM devices are selected carefully and their bond graph representations are obtained. After a proper and realizable function set to operate on that category of building blocks is defined, genetic programming can evolve both the topologies and parameters of corresponding RF MEM devices to meet predefined design specifications. Adaptive fitness definition is used to better direct the search process of genetic programming. Experimental results demonstrate the feasibility of the approach as a first step of an automated MEMS synthesis process. Some methods to extend the approach are also discussed.
1 Introduction Mechanical systems are known to be much more difficult to address with either systematic design or clean separation of design and fabrication. Composed of parts involving multiple energy domains, lacking a small set of primitive building blocks such as the NOR and NAND gates in used VLSI, and lacking a clear separation of form and function, mechanical systems are so diverse in their design and manufacturing procedures that they present more challenges to a systematic approach and have basically defied an automated synthesis attempt. Despite the numerous difficulties presented in automated synthesis of macromechanical systems, MEMS holds the promise of being amenable to structured automated design due to its similarities with VLSI, provided that the synthesis is carried out in a properly constrained design domain. Due to their multi-domain and intrinsically three-dimensional nature of MEMS, their design and analysis is very complicated and requires access to simulation tools E. Cantú-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2058–2071, 2003. © Springer-Verlag Berlin Heidelberg 2003
System-Level Synthesis of MEMS via Genetic Programming and Bond Graphs
2059
with finite element analysis capability. Computation cost is typically very high. A common representation that encompasses multiple energy domains is thus needed for modeling of the whole system. We need a system-level model that reduces the number of degrees of freedom from the hundreds and thousands of degrees of freedom characterizing the meshed 3-D model to as few as possible. The bond graph, based on power flow, provides a unified model representation across multiple energy domain system and is also compatible with 3-D numerical simulation and experimental results in describing the macro behavior of the system, so long as suitable lumping of components can be done to obtain lumped-parameter models. It can be used to represent the behavior of a subsystem within one energy domain, or the interaction of multiple domains. Therefore, the first important step in our method of MEMS synthesis is to develop a strategy to automatically generate bond graph models to meet particular design specifications on system level behaviors. For system-level design, hand calculation is still the most popular method in current design practice. This is for two reasons: 1) The MEMS systems we are considering, or designing are relatively simple in dynamic behavior -- especially the mechanical parts -- largely due to limitation in fabrication capability. 2) There is no powerful and widely accepted synthesis approach to automated design of multi-domain systems. The BG/GP approach, which combines the capability of genetic programming to search in an open-ended design space and the merits of bond graphs for representing and modeling multi-domain systems elegantly and effectively, proves to be a promising method to do system-level synthesis of multi-domain dynamical systems [1][2]. In the first or higher level of system synthesis of MEMS, the BG/GP approach can help to obtain a high-level description of a system that assembles the system from a library of existing components in an automated manner to meet a predefined design specification. Then in the second or lower level, other numerical optimization approaches [3], as well as evolutionary computation, may be used to synthesize custom components from a functionality specification. It is worthwhile to point out that for the system designer, the goal of synthesis is not necessarily to design the optimum device, but to take advantage of rapid prototyping and "design reuse" through component libraries; while for the custom component designer, the goal may be maximum performance. These two goals may lead to different synthesis pathways. Figure 1 shows a typical structured MEMS synthesis procedure, and the BG/GP approach aims to solve the problem of system-level synthesis in an automated manner in the first level. However, in trying to establish an automated synthesis approach for MEMS, we should take cautious steps. Due to the limitations of fabrication technology, there are many constraints in design of MEMS. Unlike in VLSI, which can draw on extensive sets of design rules and programs that automatically test for design-rule violations, the MEMS field lacks design verification tools at this time. This means that no design automation tools are available at this stage capable of designing and verifying any kind of geometrical shapes of MEMS devices. Thus, automated MEMS synthesis tools must solve sub-problems of MEMS design in particular application domains for which a small set of predefined and widely used basic electromechanical elements are available, to cover a moderately large functional design space.
2060
Z. Fan et al.
Fig. 1. Structured MEMS design flow
Automated synthesis of an RF MEM device, namely, a micro-mechanical band pass filter, is taken as an example in this paper. As designing and micromachining of more complex structures is a definite trend, and research into micro-assembly is already on its way, the BG/GP approach is believed to have many potential applications. More work to extend this approach to an integrated evolutionary synthesis environment for MEMS across a variety of design layers is also discussed at the end.
2 Design Methodology Over the past two decades, computational design algorithms based on Charles Darwin’s principles of evolution have developed from academic curiosities into practical and effective tools for scientists and engineers. Gero, for example, investigates evolutionary systems as computational models of creative design and studies the relationships among genetic engineering, style emergence, and complex evolution [13]. Goodman et al. [14] studied evolution of engineering artifacts parallel using genetic algorithms. Shah [15] identifies the importance of bond graphs for unifying multi-level design of multi-domain systems. Tay et al. [12] use bond graphs and GA to generate and analyze dynamic system designs automatically. This approach adopts a variational design method, which means they make a complete bond graph model first, and then change the bond graph topologically using a GA, yielding new design alternatives. However, the efficiency of his approach is hampered by the weak ability of GA to search in both topology and parameter spaces simultaneously. Other researchers have begun to explore combination of bond graphs and evolutionary computation [16]. Campell [18] also uses the idea of both bond graphs and genetic algorithm in his A-
System-Level Synthesis of MEMS via Genetic Programming and Bond Graphs
2061
Design framework. In this research, we use an approach combining genetic programming and bond graphs to automate the process of design of dynamic systems to a significant degree. 2.1 Bond Graphs The bond graph is a modeling tool that provides a unified approach to the modeling and analysis of dynamic systems, especially hybrid multi-domain systems including mechanical, electrical, pneumatic, hydraulic components, etc. It is the explicit representation of model topology that makes the bond graphs a good candidate for use in open-ended design search. For notation details and methods of system analysis related to the bond graph representation, see [4]. Bond graphs have four embedded strengths for design applications, namely, the wide scope of systems that can be created because of the multi- and inter-domain nature of bond graphs, the efficiency of evaluation of design alternatives, the natural combinatorial features of bond and node components for generation of design alternatives, and the ease of mapping to the engineering design process. Those attributes make bond graphs an excellent candidate for modeling and design of a multi-domain system. 2.2 Combining Bond Graphs and Genetic Programming The most common form of genetic programming [5] uses trees to represent the entities to be evolved. Defining of a proper function set is one of the most significant steps in using genetic programming. It may affect both the search efficiency and validity of evolved results and is closely related to the selection of building blocks for the system being designed. In this research, a basic function set and a modular function set are presented and listed in Tables 1 and 2. Operators in the basic function set basically aim to construct primitive building blocks for the system, while operators in the modular function set purport to utilize relatively modular and predefined building blocks composed of primitive building blocks. Notice that numeric functions are included in both function sets, as they are needed in both cases. In other research, we hypothesize that usage of modular operators in genetic programming has some promise for improving its search efficiency. However, in this paper, we concentrate on another issue, proposing the concept of a realizable function set. By using only operators in a realizable function set, we seek to guarantee that the evolved design is physically realizable and has the potential to be manufactured. This concept of realizability may include stringent fabrication constraints to be fulfilled in some specific application domains. This idea is to be illustrated in the design example of an RF MEM device, namely, a micro-mechanical band pass filter. Examples of modular operators, namely insert_BU and insert_CU operators, are illustrated in Figures 2 and 3. Examples of basic operators are available in our earlier work [6]. Figure 2 explains how the insert_BU function works. A Bridging Unit (BU) is a subsystem that is composed of three capacitors with the same parameters, at-
2062
Z. Fan et al.
tached together with a 0-junction in the center and 1-junctions at the left and right ends. After execution of the insert_BU function, an additional modifiable site (2) appears at the rightmost newly created bond. As illustrated in Figure 3, a resonant unit (RU), composed of one I, R, and C component all attached to a 1-junction, is inserted in an original bond with a modifiable site through the insert_RU function. After the insert_RU function is executed, a new RU is created and one additional modifiable site, namely bond (3), appears in the resulting phenotype bond graph, along with the original modifiable site bond (1). The new added 1-junction also has an additional modifiable site (2). As components C, I, and R all have parameters to be evolved, the insert_RU function has three corresponding ERC-typed sites, (4), (5), and (6), for numerical evolution of parameters. The reason these representations are chosen for the RU and BU components is discussed in the next, case study, section. Table 1. Operators in Basic Function Set
Basic Function Set add_C add_I add_R insert_J0 insert_J1 replace_C replace_I replace_R + enda endi endr erc
Add a C element to a junction Add a I element to a junction Add a R element to a junction Insert a 0-junction in a bond Insert a 1-junction in a bond Replace the current element Replace the current element Replace the current element Sum two ERCs Subtract two ERCs End terminal for add functions End terminal for insert funcEnd terminal for replace funcEphemeral Random Constant
Table 2. Operators in Modular Function Set
Modular Function Set insert_RU insert_CU insert_BU add_RU insert_J01 insert_CIR insert_CR Add_J
Insert a Resonant Unit Insert a Coupling Unit Insert a Bridging Unit Add a Resonant Unit Insert a 0-1-junction comInsert a special CIR comInsert a special CR compound Add a junction compound
System-Level Synthesis of MEMS via Genetic Programming and Bond Graphs
Fig. 2. Operator to Insert Bridging Unit
2063
Fig. 3. Operator to Insert Resonant Unit
3 MEM Filter Design 3.1 Filter Topology Automated synthesis of a RF MEM device, micro-mechanical band pass filters is used as an example in this paper [7]. Through analyzing two popular topologies used in surface micromachining of micro-mechanical filters, we found that they are topologically composed of a series of RUs and Bridging Units (BUs) or RUs and Coupling Units (CUs) concatenated together. Figure 4, 5, 6 illustrates the layouts and bond graph representations of filter topology I and II.
3.2 Design Embryo All individual genetic programming trees create bond graphs from an embryo. Selection of the embryo is also an important topic in system design, especially for multiport systems. In our filter design problems, we use the bond graph shown in Figure 7 as our embryo.
3.3 Realizable Function Set BG/GP is a quite general approach to automate synthesis of multidisciplinary systems. Using a basic set of building blocks, BG/GP can perform topologically open composition of an unconstrained design. However, engineering systems in the real world are often limited by various constraints. So if BG/GP is to be used to synthesize real-world engineering systems, it must enforce those constraints.
2064
Z. Fan et al.
Fig. 4. Layout of Filter Topology I: Filter is composed of a series of Resonator Units (RUs) connected by Bridging Units (BUs).
Fig. 5. Bond Graph Representation of Filter Topology I
Fig. 6. Layout of Filter Topology II: Filter is composed of a series of Resonator Units coupled by Coupling Units. Its corresponding bond graph representation is also shown.
evolved circuit RS
1
0
:
Se uS
Fig. 7. Embryo of Design
RL
System-Level Synthesis of MEMS via Genetic Programming and Bond Graphs
2065
Unlike our previous designs with basic function sets, which impose fewer topological constraints on design, MEMS design features relatively few devices in the component library. These devices are typically more complex in structure than those primitive building blocks used in the basic function set. Only evolved designs represented by bond graphs matching the dynamic behavior of those devices belonging to the component library are expected to be manufacturable under current or anticipated technology. Thus, an important and special step in MEMS synthesis with the BG/GP approach is to define a realizable function set that, throughout execution, will produce only phenotypes that can be built using existing or expected technology. Analyzing the system of MEM filters of [7] from a bond graph viewpoint, the filters are basically composed of Resonator Units (RUs) and Coupling Units (CUs). Another popular MEM filter topology includes Resonator Units and Bridging Units (BUs). A realizable function set for these design topologies often includes functions from both the basic set and modular set. In many cases, multiple realizable function sets, rather than only one, can be used to evolve realizable structures of MEMS. In this research, we used the following function set, along with traditional numeric functions and end operators, for creating filter topologies with coupling units and resonant units. ℜ1 = { f _ tree , f _ insert _ J 1, f _ insert _ RU , f _ insert _ CU , f _ add _ C , f _ add _ R , f _ add _ I } ℜ 2 = { f _ tree , f _ insert _ J 1, f _ insert _ RU , f _ insert _ BU , f _ add _ C , f _ add _ R , f _ add _ I }
3.4 Adaptive Fitness Function Within the frequency range of interest, frange= [fmin, fmax], uniformly sample 100 points. Here, frange = [0.1, 1000K] Hz.Compare the magnitudes of the frequency response at the sample points with target magnitudes, which are 1.0 within the pass frequency range of [316, 1000] Hz, and 0.0 otherwise, between 0.1 and 1000KHz. Compute their differences and get a sum of squared differences as raw fitness, defined as Fitness raw .
Fitness raw < Threshold, change frange to frange*= [fmin*, fmax*]. Usually
If
f range ⊂ f range . Repeat the above steps and obtain a new Fitness raw . *
Then normalized fitness is calculated according to:
Fitness
norm
= 0 . 5 + Norm
( Norm + Fitness
raw
)
2066
Z. Fan et al.
The reason to use adaptive fitness evaluation is that after a GP population has reached a fairly high fitness value as a group, the differences of frequency responses of individuals need to be centered on a more constrained frequency range. In this circumstance, if there is not sufficient sampling within this much smaller frequency range, the GP may lack sufficient search pressure to push the search forward. The normalized fitness is calculated from the sampling differences between the frequency response magnitudes of the synthesized systems and the target responses. Therefore, we adaptively change and narrow the frequency range to be heavily sampled. The effect is analogous to narrowing the search window on a smaller yet most significant area, magnifying it, and continuing to search this area with closer scrutiny. 3.5 Experimental Setup We used a strongly-typed version of lilgp to generate bond graph models. The major GP parameters were as shown below:
Population size: 500 in each of thirteen subpopulations Initial population: half_and_half Initial depth: 4-6 Max depth: 50 Max_nodes 5000 Selection: Tournament (size=7) Crossover: 0.9 Mutation: 0.3
Three major code modules were created in this work. The algorithm kernel of HFC-GP was a strongly typed version [17] of an open software package developed in our research group -- lilgp. A bond graph class was implemented in C++. The fitness evaluation package is C++ code converted from Matlab code, with hand-coded functions used to interface with the other modules of the project. The commercial software package 20Sim was used to verify the dynamic characteristics of the evolved design. The GP program obtains satisfactory results on a Pentium-IV 1GHz in 1000~1250 minutes. 3.6 Experimental Results Experimental results show the strong topological search capability of genetic programming and feasibility of our BG/GP approach for finding realizable designs for micro-mechanical filters. Although significant fabrication difficulty is currently presented when fabricating a micro-mechanical filter with more than 3 resonators, it does not invalidate our research and the topological search capability of the BG/GP ap-
System-Level Synthesis of MEMS via Genetic Programming and Bond Graphs
2067
proach BG/BP shows potential for exploring more complicated topologies of future MEMS design and the ever-progressing technology frontiers of MEMS fabrication. In Figure 8, K is the number of resonant units appearing in the best design of the generation on the horizontal axis. The use of hierarchical fair competition [8] is facilitating continual improvement of the fitness. As fitness improves, the number of resonant units, K, grows – unsurprising because a higher-order system with more resonator units has the potential of better system performance than its low-order counterpart. The plot of corresponding system frequency responses at generations 27, 52, 117 and 183 are shown in Figure 9.
Fig. 8. Fitness Improvement Curve
A layout of a design candidate with three resonators and two bridging units as well as its bond graph representation is shown below in Figure 10. Notice that the geometry of resonators may not show the real sizes and shapes of a physical resonator and the layout figure only serves as a topological illustration. The parameters are listed in Table 3. Using the BG/GP approach, it is also possible to explore novel topologies of MEM filter design. In this case, we may not necessary use a strictly realizable function set. Instead, a semi-realizable function set is used to relax the topological constraints with the purpose of finding new topologies not realized before but still realizable after careful design. Figure 11 gives an example of a novel topology for a MEM filter design. An attempt to fabricate this kind of topology is being carried out in a university research setting.
2068
Z. Fan et al.
Fig. 9. Frequency responses of a sampling of design candidates, which evolved topologies with larger numbers, K, of resonators as the evolution progressed. All results are from one genetic programming run of the BG/GP approach.
4 Extensions In MEMS, there are two or three levels of designs that need to be synthesized. Usually the design process starts with basic capture of the schematic of the overall system, and then goes on through layout and construction of a 3-D solid model. So the first design level is the system level, which includes selection and configuration of a repertoire of planar devices or subsystems. The second level is 2-D layout of basic structures like beams to form the elementary planar devices. In some cases, if the MEMS is basically a result of a surface-micro machining process and no significant 3-D features are present, design of this level will end one cycle of design. More generally, modeling and analysis of a 3-D solid model for MEMS is necessary. For the second level -- two-dimensional layout designs of cell elements -- layout synthesis usually takes into consideration a large variety of design variables and design constraints. The most popular synthesis method seems to be based on conventional numerical optimization methods. The design problem is often first formulated as a nonlinear constrained optimization problem and then solved using an optimization software package [3]. Geometric programming, one special type of convex optimization method, is reported to synthesize a CMOS op-amp. The method is
System-Level Synthesis of MEMS via Genetic Programming and Bond Graphs
2069
Fig. 10. Layout and bond graph representation of a design candidate from the experiment with three resonator units coupled with two bridging units.
Fig. 11. A novel topology of MEM filter and its bond graph representation
claimed to be both globally optimal and extremely fast. The only disadvantage and limitation is that the design problem has to be carefully formatted first to make it suitable for the treatment of the geometric programming algorithm. However, all the above approaches are based on the assumption that the structures of the cell elements are relatively fixed and subject to no radical topology changes [9]. A multi-objective evolutionary algorithm approach is reported for automatic synthesis of topology and sizing of a MEMS 2-D meandering spring structure with desired stiffnesses in certain directions [10].
2070
Z. Fan et al.
The third level design calls for FEA (Finite Element Analysis). FEA is a computational method used for analyzing mechanical, thermal, electrical behavior of complex structures. The underlying idea of FEA is to split structures into small pieces and determine behaviors of each piece. It is used for verifying results of hand calculations for simple model, but more importantly, for predicting behavior of complex models where 1st order hand calculations are not available or insufficient. It is especially well suited for iterative design. As a result, it is quite possible that we can use an evolutionary computation approach to evolve a design using evaluation by means of FEA to assign fitness. Much work in this area has already been reported and it should also be an ideal analysis tool for use in the synthesis loop for final 3-D structures of MEMS. However, even if we have obtained an optimized 3-D device shape, it is still very difficult to produce a proper mask layout and correct fabricate procedures. Automated mask layout and process synthesis tools will be very helpful to relieve the designers from considering the fabrication details and focus on the functional design of the device and system instead [11] Our long time task of research is to include computational synthesis for different design levels, and to provide support for design engineers in the whole MEMS design process.
5 Conclusions This paper has suggested a design methodology for automatically synthesizing system-level designs for MEMS. For design of systems like the MEM filter problem, with strong topology constraints and fewer topology variations allowed, the challenge is to define a realizable function set that assures the evolved design is physically realizable and can be built using existing or anticipated technologies. Experiments show that a mixture of functions from both a modular function set and a basic function set form a realizable function set, and that the BG/GP algorithm evolves a variety of designs with different levels of topological complexity that satisfy design specifications. Acknowledgment. The authors gratefully acknowledge the support of the National Science Foundation through grant DMII 0084934. The authors are grateful to Prof. Steven Shaw, Michigan State University, for an introduction to various aspects of MEMS design.
References 1.
Fan Z., Hu J., Seo K., Goodman E., Rosenberg R., and Zhang B.: Bond Graph Representation and GP for Automated Analog Filter Design. GECCO-2001 Late-Breaking Papers, San Francisco. (2001) 81–86
System-Level Synthesis of MEMS via Genetic Programming and Bond Graphs 2.
3. 4. 5. 6.
7. 8. 9.
10.
11.
12. 13. 14.
15.
16.
17. 18.
2071
Fan Z., Seo K., Rosenberg R. C., Hu J., Goodman E. D.: Exploring Multiple Design Topologies using Genetic Programming and Bond Graphs. Proceedings of the Genetic and Evolutionary Computation Conference, GECCO-2002, New York. (2002) 1073– 1080. Zhou Y.: Layout Synthesis of Accelerometers. Thesis for Master of Science. Department of Electrical and Computer Engineering, Carnegie Mellon University. (1998) Rosenberg R. C.: Reflections on Engineering Systems and Bond Graphs, Trans. ASME J. Dynamic Systems, Measurements and Control, (1993) 115: 242–251 Koza J. R.: Genetic Programming II: Automatic Discovery of Reusable Programs, The MIT Press (1994) Rosenberg C., Goodman E. D. and Seo K.: Some Key Issues in using Bond Graphs and Genetic Programming for Mechatronic System Design, Proc. ASME International Mechanical Engineering Congress and Exposition, New York (2001) Wang K. and Nguyen C. T. C.: High-Order Medium Frequency Micromechanical Electronic Filters, Journal of Microelectromechanical Systems. (1999) 534–556 Hu J., Goodman E. D.: Hierarchical Fair Competition Model for Parallel Evolutionary Algorithms. CEC 2002, Honolulu, Hawaii, May, (2002) Hershenson M. M, Boyd, S. P., and Lee T.H.: Optimal Design of a CMOS Op-Amp via Geometric Programming. Computer-Aided Design of Integrated Circuits and Systems Vol. 20. No. 1 (2001) 1–21 Zhou N., Zhu B., Agogino A., Pister K.: Evolutionary Synthesis of MEMS design. ANNIE 2001, IEEE Neural Networks Council and Smart Engineering System Design conference, St. Louis, MO, Nov 4–7, (2001). Ma L. and Antonsson E.K.: Automated Mask-Layout and Process Synthesis for MEMS. Technical Proceedings of the 2000 International Conference on Modeling and Simulation of Microsystems (2000) 20–23 Tay E. H., Flowers W. and Barrus J.: Automated Generation and Analysis of Dynamic System Designs. Research in Engineering Design (1998) 10: 15–29. Gero J. S.: Computers and Creative Design, in M. Tan and R. Teh (eds), The Global Design Studio, National University of Singarpo (1996) 11–19 Eby, D., Averill, R., Gelfand, B., Punch, W., Mathews, O., Goodman, E.: An Injection th Island GA for Flywheel Design Optimization. 5 European Congress on Intelligent Techniques and Soft Computing EUFIT’97 Vol. 1. (1997) 687–691 Vargas-Hernandez N., Shah J., Lacroix Z.: Development of a Computer Aided Conceptual Design Tool for Complex Electromechanical Systems. Computational Synthesis: From Basic Building Blocks to High Level Functionality, Papers from the 2003 AAAI Symposium Technical Report SS-03-02 (2003) 255–261 Wang, J. and Terpenny, J.: Interactive Evolutionary Solution Synthesis in Fuzzy Setbased Preliminary Engineering Design, Special Issue on Soft Computing in Manufacturing, Journal of Intelligent Manufacturing, Vol. 14. (2003) 153–167 Luke S.: Strongly-Typed, Multithreaded C Genetic Programming Kernel, http://www.cs.umd.edu/users/-seanl/gp/patched-gp/ (1997) Campbell, M., Cagan J. and Kotovsky K.: Agent-based Synthesis of Electro-Mechanical Design Configurations, Journal of Mechanical Design, Vol. 122. No. 1, (2000) 61–69
Congressional Districting Using a TSP-Based Genetic Algorithm Sean L. Forman and Yading Yue Mathematics and Computer Science Department Saint Joseph’s University Philadelphia, PA 19131, USA
Abstract. The drawing of congressional districts by legislative bodies in the United States creates a great deal of controversy each decade as political parties and special interest groups attempt to divide states into districts beneficial to their candidates. The genetic algorithm presented in this paper attempts to find a set of compact and contiguous congressional districts of approximately equal population. This genetic algorithm utilizes a technique based on an encoding and genetic operators used to solve Traveling Salesman Problems (TSP). This encoding forces near equality of district population and uses the fitness function to promote district contiguity and compactness. A post-processing step further refines district population equality. Results are provided for three states (North Carolina, South Carolina, and Iowa) using 2000 census data.
1
Problem History
The United States Congress consists of two houses, the Senate (containing two members from each of the fifty states) and the House of Representatives. The House of Representatives has 435 members, and each state is apportioned a congressional delegation in proportion to its population as determined by a national, decennial census. Each state (usually the state’s legislative body) is responsible for partitioning its state into a number of districts (a districting plan) equal to its apportionment. Through years of case law, the courts have outlined several requirements for the drawing of districts [1]. – The districts must be contiguous. – The districts must be of equal population following the “one-man one-vote” principle.1 – The districts should be of a pleasing shape.2
Corresponding author: [email protected], http://www.sju.edu/˜sforman/ Special thanks to Alberto Segre for providing support and a sounding board for this work, Saint Joseph’s University for support of this work, the U.S. Census Bureau Geography Division for their guidance, and the reviewers for their comments. 1 Despite a census error of 2-3%, a federal court recently threw out a Pennsylvania plan in which the smallest and largest districts varied by 19 people in a state with 12.3 million people [2]. 2 The 1990 North Carolina plan was thrown out due to a district where one candidate quipped he could hit everyone in the district by driving down Interstate-75 with both doors open [3].
E. Cant´u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2072–2083, 2003. c Springer-Verlag Berlin Heidelberg 2003
Congressional Districting Using a TSP-Based Genetic Algorithm
2073
– The districts should strive not to divide regions with a common interest. – The districts should not be a dramatic departure from previous plans. For the purposes of this paper, a districting plan will be considered valid if each district created is (1) contiguous and (2) its population is within 1% of the state’s total population divided by the number of districts (the population for perfect district population equality). Within these constraints, the algorithm described will attempt to find districts with a compact (pleasing) shape.
2
Current Practices
The redistricting process is largely a partisan process with the political party in power drawing districts likely to keep them in power. This can lead to elongated, poorly shaped districts, or gerrymanders–a contraction of long ago Massachusetts Governor Eldridge Gerry and a salamander. Due to this politicization of the redistricting process, automated redistricting is an appealing alternative. Automated redistricting typically attempts to find a partition of the state into contiguous districts that minimizes some objective function promoting compactness and an equal distribution of the population. The number of entities to partition can vary from counties (10-150 per state) to census tracts (100-10,000). However, one can quickly see that the number of potential solutions grows very quickly as smaller and smaller entities are used. In the 1960’s, there was a flurry of efforts to apply new computational technology to the redistricting problem beginning with Vickrey’s initial work in 1961 [4]. In 1963, Weaver and Hess [5] used an operations research approach similar to the methods used to locate warehouses. In 1965, Kaiser and Nagel developed methods that take current districting plans and improve them by swapping units between adjoining districts [6,7]. In 1973, Liittschwager applied Vickrey’s technique to redistricting in Iowa [8]. More recently, several groups have attempted to use local search techniques to find optimal or near-optimal solutions to the redistricting problem [9]. Altman sketches a solution based on node partitioning [10] where the fitness function promotes equality of population and compactness. Mehrotra, et al [11], propose a constrained graph partitioning solution with pre and post-processing steps. Their solution of a transshipment problem to completely balance district populations is mimicked in this paper. di Cortona [1] and Altman [12] provide excellent backgrounds on redistricting techniques. Altman also provides a proof that the redistricting problem is, in fact, NPcomplete by relating it to the class of set partitioning problems.
3 A Redistricting Genetic Algorithm The genetic algorithm described here (ConRed, for Congressional Redistricting) takes as an input the number of districts to be drawn, a set of tracts, which could mean any partition of a state (counties, townships, census tracts, census blocks, or some combination), each with a population and an area, a list of all inter-tract borders and their lengths, and a list of all borders between a tract and a neighboring state or shoreline and their respective lengths. It then outputs the best districting plan it finds as a list of tracts assigned to districts.
2074
3.1
S.L. Forman and Y. Yue
Fitness Function
The encoding used (Section 3.2) forces every solution to have approximately equal (at worst, 5%-8% error) populations among the created districts.3 Therefore, the fitness function focuses primarily on contiguity and compactness. Some consideration is given to population equality to encourage solutions within 1% of perfectly balanced. Methods used to determine the compactness of individual districts and districting plans fall into three primary categories [13,14]: – Dispersion measures – district length versus width; ratio of district area to that of the smallest circumscribed circle; moment of inertia. – Perimeter-based measures – sum of all perimeters; ratio of perimeter to area. – Population measures – population’s moment of inertia; ratio of population in district to the population within the smallest circumscribed circle. Each of these techniques has its own biases and pathological examples. The Schwartzberg Index [15], a perimeter-based measure, was chosen for ConRed because it can be computed quickly and incrementally as tracts are added and subtracted from districts. The Schwartzberg Index measures the compactness of a district as the perimeter of the district squared divided by its area. For this measurement, lower is better, and it is minimized by a circular shape. The fitness function for ConRed consists of a measurement of the plan’s shape and discontiguity (Eq. 1) and a measurement of the variance from equal district populations (Eq. 2) with the final fitness a linear combination of these two parts (Eq. 3). The shape fitness function (Eq. 3) modifies the Schwarzberg Index to reward contiguity by multiplying this value by one plus the number of excess discontiguous pieces (piecesi ) found in the district (preferably, zero) weighted by a parameter, φ. The more discontiguous pieces a proposed district has the more severe the penalty will be. The products for individual districts are then averaged across all n districts to give the overall districting plan’s shape fitness.4 Unlike some possible techniques, the encoding does not force a possible plan to be contiguous, but a lack of contiguity is penalized severely in the fitness function by the multiplication of the district’s compactness with the number of pieces. n F itnessshape =
i=1 (1
i) + φ (piecesi − 1)) (perimeter areai
2
(1) n The plan’s population variance function depends on the ideal district population (idealDistP op = state population divided by the number of districts) and the maximum error allowed in a valid solution (V ark = idealDistP op ∗ k, in our case k = 1%). The first term of Eq. 2 serves as a penalty function (γ determines just how severe) for each district within a districting plan that violates the population constraint. The second term of Eq. 2 drives the population differences towards zero once a valid plan has been found. 3
4
Recall that according to the courts, the primary determinants of a hypothetical districting plan’s fitness are the equality of its population distribution and the compactness and contiguity of its districts. One could also choose to find the minimax compactness or any similar measurement.
Congressional Districting Using a TSP-Based Genetic Algorithm
F itnesspop =
γ
n i=1
M AX(|popi −idealDistP op|−V ar1% ,0) n×idealDistP op
n
+
2075
(2)
|popi −idealDistP op| n×idealDistP op
i=1
Then adding the two together gives the overall fitness (Eq. 3). The parameter θ can be tuned to balance population and shape considerations. F itness = F itnessshape + θ F itnesspop 3.2
(3)
Encoding
The encoding chosen for this genetic algorithm is the path representation used for Traveling Salesman Problems (TSP) [16,17,18]. As in the TSP, a single chromosome travels through each tract, and as the tracts are traversed, districts are formed by the sequence of tracts. This differs significantly from GA techniques used to solve node partitioning problems (NPP) or clustering problems [19]. ConRed does not store the partition of the state in its chromosome per se, but extracts it from the order of the entries. Though details may vary, NPP solutions typically encode the actual partitions of the graph vertices as chromosomes [20]. Clustering problems may be solved similarly, or they may encode the chromosomes as coordinates for the location of the cluster center with data points assigned to the nearest cluster center [19]. The clustering approach would guarantee compactness and contiguity, but the population equality constraint will be difficult to satisfy since it may not always be possible to draw compact districts for a given problem. An NPP encoding that would maintain contiguity and population equality in every chromosome and optimize on compactness would require a significant amount of post-mutation and post-crossover processing to maintain chromosome validity. Encoding the redistricting problem in a manner similar to the TSP causes each chromosome to have districts with approximately equal populations and relies upon the fitness function to create compactness and contiguity. The trade-off is that the search space is dramatically enlarged with redundant solutions as will be shown below. To demonstrate how the encoding is translated from a permutation of tracts to a set of districts, take a fictitious state named Botna, with three districts and nine tracts arranged in a 3 × 3 array (tract population is given in the subscripts), and a chromosome, which is just a permutations of length nine. 130 220 310 410 520 630 710 820 930
A=1−4−5−2−3−6−9−8−7
Now to convert this permutation into a districting plan, one travels along the chromosome summing the populations until some threshold population is met. The threshold chosen is typically the idealDistP op ± δ, where δ is typically 5% of the idealDistPop. In cases where the number of tracts is small, there is no guarantee that the population
2076
S.L. Forman and Y. Yue
will be split so neatly as they are in this example, but for a large numbers of tracts the approximate equality of district populations is virtually guaranteed. Table 1. Converting chromosome A to a districting plan. State population is 180, so the idealDistP op = 60. Tract Pop. 1 30 4 10 5 20 2 20 3 10 6 30 9 30 8 20 7 10
Pop. District 30 1 40 1 60 1 20 2 30 2 60 2 30 3 50 3 60 3
123 456 789
In the districting plan, A, shown in Table 1, every tract has an area of one and each inter-tract boundary is of length one. The f itnessshape of this chromosome, A, can then be evaluated using Equation (1) with φ = 2. District 1 (1-4-5) Perim. = 8, Area = 3, Fitness = 21.33 District 2 (2-3-6) Perim. = 8, Area = 3, Fitness = 21.33 District 3 (9-8-7) Perim. = 8, Area = 3, Fitness = 21.33 Taking the average of the three districts gives f itnessshape = 21.33. Here is another example, which illustrates a problem with this encoding (again φ = 2). B=1−6−3−2−5−7−9−8−4 Table 2. Example displaying encoding shortcomings. Tract Pop. 1 30 6 30 3 10 2 20 5 20 7 10 9 30 8 20 4 10
Pop. District 30 1 60 1 10 2 30 2 50 2 60 2 30 3 50 3 60 3
123 456 789
Congressional Districting Using a TSP-Based Genetic Algorithm
2077
District 1 (1-6) Perim. = 8, Area = 2, and Pieces = 2, Fitness = 96. District 2 (3-2-5-7) Perim. = 12, Area = 4, and Pieces = 2, Fitness = 108. District 3 (9-8-4) Perim. = 10, Area = 3, and Pieces = 2, Fitness = 100. Taking the average of the three districts gives f itnessshape = 101.33. So while this encoding does not force the districts to be contiguous, the fitness function does penalize those districting plans which are not contiguous. An additional problem with this encoding is that it allows a large number of redundant solutions to be considered. For instance, (1 − 6) − (3 − 2 − 5 − 7) − (9 − 8 − 4) and (1 − 6) − (2 − 3 − 5 − 7) − (9 − 8 − 4) would produce the exact same districting plans in the example above. 3.3
Initial Population & Selection
Several heuristics are used to find an initial population. The algorithm begins each new chromosome at a randomly selected border tract.5 Adjoining, unvisited tracts are added randomly to the chromosome. The direction of the next selected tract is random, but biased toward the selection of tracts adjoining previously visited elements. This gives some preliminary structure to the permutations. If the chromosome finds no adjoining, unvisited tracts to add, the chromosome jumps randomly to a non-adjoining tract elsewhere in the state and continues adding tracts to the chromosome.6 After the fitnesses are calculated for each member of the population, they are ranked and are selected proportionally by their ranks [21]. For example, in a population of ten chromosomes the best chromosome is ten times more likely to be selected than the worst chromosome. Additionally, a copy of the best chromosome is passed to the next generation. 3.4
Genetic Operators
Of the many different operators used on the traveling salesman problem, a crossover, mutation and heuristic operator have been chosen. Maximal Preservative Crossover. This operator uses a donor and a receiver chromosome [22]. From the donor, a random substring is chosen, call it Ω. All of the elements in Ω are then deleted from the receiver chromosome, and Ω is inserted where the first element of Ω occurs in the receiver. This implementation is slightly altered from M¨uhlenbein, et al.’s original implementation as they placed Ω at the beginning of the receiver chromosome. Suppose the following chromosomes are chosen, A=1−4−5−8−7−2−3−6−9 B=1−6−3−2−5−4−7−8−9 5
While testing was inconclusive, this appeared to be a somewhat better strategy than using the largest tract (by area or population) or a random tract. 6 This method does provide some structure, but none of the millions of initial chromosomes created in the tests performed (Section 4) represented valid or even contiguous districting plans.
2078
S.L. Forman and Y. Yue
where A is the donor and B is the receiver. Suppose Ω = 5 − 8 − 7 − 2. In B, Ω will be inserted after the 3 since 2 is in Ω. B=1−6−3− − −4− − −9⇒B=1−6−3−5−8−7−2−4−9 Exchange Mutation. Exchange mutation [23] cuts a piece out of chromosome and then switches it with a piece of the same size somewhere else in the chromosome. The size of this cut is a parameter swapSize that may be entered by the user. Suppose swapSize = 2, cutPoint = 2 and pastePoint = 6. B = 1 − (6 − 3) − 2 − 5 − (4 − 7) − 8 − 9 ⇒ B = 1 − (4 − 7) − 2 − 5 − (6 − 3) − 8 − 9 Discontiguity Patch. In the course of building chromosomes, one often finds islands of tracts from a district trapped within another district (see Table 2). A heuristic clean-up of these islands can significantly improve a chromosome’s fitness while not altering its underlying character. The clean-up process involves searching for discontiguous districts in a districting plan, removing one of the smaller pieces from the string of tracts and then re-inserting it after a tract that borders the first tract in the island piece. For example with Table 2, tracts 4, 8, and 9 are part of the same district, but tract 4 is discontiguous from tracts 8 and 9. The heuristic snips tract 4 from the string, and then randomly inserts it after one of the tracts that it borders (1, 5, or 7). The resulting plan removes several of the discontiguities and has an improved fitness of 75.33. 1 − 6 − 3 − 2 − 5 − 7 − 9 − 8 − (4) ⇒ 1 − (4) − 6 − 3 − 2 − 5 − 7 − 9 − 8 3.5
Pre-processing
Population equality tends to improve and compactness and contiguity tends to worsen as the number of tracts increases. The coarsest tract structure available is the county level. But since most states have at least some counties with very large populations, it is often necessary to partition a large county into a set of subtracts, ideally each with an equal population. For instance, a large county can be divided into a number of smaller tracts7 and then used with the other unbroken counties to give a data set with the population more evenly divided among the input tracts. To create these subtracts (itself essentially a districting problem), ConRed was applied on individual counties with populations above a prescribed threshold. In Table 3, the number of tracts increases as this threshold is lowered. A low population threshold for subdividing a county will lead to more tracts in the input set and possibly less contiguity, but it will increase the probability that a solution will have a low population variance as Table 3 shows. 3.6
Post-processing
The encoding used provides approximate equality of district populations, but often only within 5% of the ideal district population rather than the 1% that is considered valid in this context. To bring the district populations into this desired range, a post processing step has been added. This step is applied following the fitness evaluation and only to 7
For these experiments, this number was capped at nine.
Congressional Districting Using a TSP-Based Genetic Algorithm
2079
chromosomes that already generate contiguous plans. A desired population reapportionment is calculated using a transshipment algorithm. Guided by this reapportionment, ConRed then proceeds to swap boundary tracts between districts until no further population balancing is possible. The transshipment algorithm [24] is a common operations research technique used to determine the optimal shipments between a set of known supply and demand points (commonly factories, warehouses and stores). Each possible branch has a cost associated with it, so the objective is to minimize the overall cost while satisfying the given demand using the given supply. As applied to this problem, the districts with larger-than-ideal populations are suppliers for demanders which are districts with smaller-than-ideal populations. The costs are structured to guarantee that swapping occurs only between districts bordering each other, therefore some districts may act as go-betweens (both suppliers and demanders) for districts needing to adjust their populations. This problem is solved using the transportation simplex method. This technique for balancing district population was suggested by Mehrotra, et al [11]. 3.7
Parameters
The primary parameters for tuning are the probabilities for the three operators, the maximum and minimum lengths of the segments removed and inserted by the operators, and the fitness function’s parameters: γ, θ, and φ. Some work has been done on tuning parameters for this GA, but it is an area where considerable improvement may still be possible. For the experimental results that follow, φ = 2 ∗ districts, γ = 10, θ = 500, pmpx = .40, pexc = .10, and pdisc = .40. The length of any exchange or patch is at most ten tracts and the crossover can be of any length.
4
Experimental Results
ConRed8 performed a total of 80 runs for each of three states: North Carolina (13 districts), South Carolina (6 districts), and Iowa (5 districts).9 For each state, 20 runs were made on 4 different tract layouts created by decrementing the pre-processing population threshold from 100,000 to 25,000 (Section 3.5). For these 20 runs, GA population sizes were incremented from 500 to 2000 chromosomes (by 500) for 5 runs each. The number of generations was chosen such that there were 2,000,000 plan evaluations for each run. Results are presented in Table 3. Contig gives the number of runs whose best, final solutions are contiguous and N valid is the number of those plans that are both contiguous and have a population variation within 1.0% of ideal. PopVar is a plan’s largest district population variance from the ideal. Time is in seconds. The results in the Best column 8 9
ConRed is written in ANSI C and was tested on a 867MHz Pentium III using RedHat Linux. The U.S. Census Bureau provides a pair of files that are needed to run tests on actual state data. Summary File 1 (SF1) [25] provides information on population, area, and latitude and longitude for a variety of entity sizes (county, county subdistrict, census tract, census block) for each state. TIGER/Line files [26] contain all boundary information between various entities found in the SF1 files. A series of Perl scripts were written to build the necessary input files and generate the postscript files used to visualize final solutions.
2080
S.L. Forman and Y. Yue
are the best fitness across all runs and the minimax population variance across all of the runs, so the two numbers may be from different runs. The best, valid solutions are shown in Figure 1. The results show that for small states with six or fewer representatives (26 of 50 states), ConRed will draw compact districts of essentially equal populations. For a larger state like North Carolina (42 of the 50 states have 13 representatives or less), ConRed can produce contiguous districts, but is not able to consistently bring the populations close enough together to be a valid solution. This may be a matter of parameter tuning. A higher setting for θ may produce better results. Still these results could be used as rough plans and then manipulated along the borders to produce an equal population distribution. Unfortunately, it has been difficult to locate other existing solvers with which to compare ConRed. Comparisons to the existing districting plans are interesting, but legislatures are not attempting to form compact districting plans, so they are of little informational value. Almost any computational technique will produce a solution more compact than those drawn by a legislature. In the 1960’s, Kaiser-Nagel [6] and Weaver [5] produced plans with compactness, but population variances greater than 1%. In 1996, Hayes [3] produced a twelve-district plan for North Carolina with population variation of less that 1%, but no concern for compactness. In 1998, Mehrotra [11] produced a plan for South Carolina beginning with 48 tracts. They initially found a solution with a population variation of 4.4%. A transshipment step determined the amount of population that needed shifting to achieve district population equality, and then by hand (it appears) tracts were sliced away on district boundaries until an equal population distribution was created. A visual inspection suggests that the valid solutions produced by ConRed are as compact as their plan.
Table 3. ConRed experiments. See text (Section 4) for an explanation. N Average Time Best Tracts Runs Contig valid Fitness PopVar% (sec) Fitness PopVar% IA 108 20 20 20 27.8 0.66 568 25.1 0.19 5 dist. 115 20 20 20 28.1 0.60 612 26.1 0.20 124 20 20 18 29.2 0.65 624 25.4 0.25 164 20 20 20 28.5 0.49 722 25.2 0.12 SC 69 20 20 15 37.1 0.89 571 32.4 0.48 6 dist. 83 20 20 13 35.0 0.90 607 30.68 0.58 115 20 20 18 39.9 0.84 725 33.3 0.55 183 20 20 19 38.7 0.69 883 34.1 0.43 17 09 71.5 2.33 1669 43.3 1.36 NC 147 20 13 dist. 166 20 16 0 79.6 2.58 1797 38.7 1.06 212 20 17 0 73.9 1.65 1792 43.0 1.08 352 20 8 2 173.3 1.95 171510 41.9 0.83
Congressional Districting Using a TSP-Based Genetic Algorithm
2081
North Carolina
South Carolina
Iowa
Fig. 1. ConRed’s best, valid, final solutions for North Carolina, South Carolina and Iowa. North Carolina’s solution divides 352 tracts into 13 districts and has a fitness of 45.1 and a maximum population variation of 0.86%. There are better shaped solutions than this, but their population variation fell just outside the 1% cutoff. South Carolina’s solution divides 83 tracts into 6 districts and has a fitness of 30.6 and a maximum population variation of 0.81%. Iowa’s solution divides 108 tracts into 5 districts and has a fitness of 25.1 and a maximum population variation of 0.44%. All interior borders shown are county borders.
5
Future Work
Additional heuristics, improved genetic operators, and additional parameter tuning may provide further improvement in the quality of the solutions produced. Thus far, most of the work has focused on the heuristics and the visualization aspects of the project, so further attention to parameter tuning and the choice of operators will be the next area of study. In addition, politicians may also be interested in the distribution of political affiliations within districts, and applications to other location or partitioning problems have yet to be considered. 8
This particular solution was found by ConRed on four separate runs. Of the 58 contiguous solutions for NC, maximum population variation ranged from 0%-1% for two solutions, 1.0%-1.5% for 23, 1.5%-2.0% for 18, and above two for 15. 10 Due to the lower number of contiguous solutions, the post-processing step was performed less often than in other problems, hence the lower time required despite a larger number of tracts. 9
2082
6
S.L. Forman and Y. Yue
Conclusions
The genetic algorithm presented in this paper, ConRed, uses an encoding adapted from traveling salesman problems to produce possible congressional districting plans. ConRed’s fitness function promotes compactness and contiguity and its encoding provides approximate equality of district population for free. Two genetic operators have been implemented in ConRed along with a single heuristic operator. Experimental results on three small to medium states show significant success at producing compact, contiguous districts of nearly equal population.
References 1. di Cortona, P.G., Manzi, C., Pennisi, A., Ricca, F., Simeone, B.: Evaluation and Optimization of Electoral Systems. Monographs on Discrete Mathematics and Applications. SIAM (1999) 2. Milewski, M.: Court overturns redistricting. The Intelligencer (2002) 3. Hayes, B.: Machine politics. American Scientist 84 (1996) 522–526 4. Vickrey, W.: On the prevention of gerrymandering. Political Science Quarterly 76 (1961) 105 5. Hess, S., Weaver, J., Siegfeldt, H., Whelan, J., Zitlau, P.: Nonpartisan political redistricting by computer. Operations Research 13 (1965) 998–1006 6. Kaiser, H.: An objective method for establishing legislative districts. Midwest Journal of Political Science (1966) 7. Nagel, S.: Simplified bipartisan computer redistricting. Stanford Law Review 17 (1965) 863–899 8. Liittschwager, J.: The Iowa redistricting system. Annals of the NewYork Academy of Science 219 (1973) 221–235 9. Ricca, F., Simeone, B.: Political redistricting: Traps, criteria, algorithms, and trade-offs. Ricera Operativa 27 (1997) 81–119 10. Altman, M.: Is automation the answer? the computational complexity of automated redistricting. Rutgers Computer and Technology Law Journal 23 (2001) 81–142 11. Mehrotra, A., Johnson, E.L., Nemhauser, G.L.: An optimization based heuristic for political districting. Management Science 44 (1998) 1100–1114 12. Altman, M.: Modeling the effect of manadatory district compactness on partisan gerrymanders. Political Geography 17 (1998) 989–1012 13. Young, H.: Measuring compactness of legislative districts. Legislative Studies Quarterly 13 (1988) 105–111 14. Niemi, R., Grofman, B., Carlucci, C., Hofeller, T.: Measuring compactness and the role of a compactness standard in a test for partisan and racial gerrymandering. Journal of Politics 52 (1990) 1152–1182 15. Schwartzberg, J.: Reapportionment, gerrymanders, and the notion of compactness. Minnesota Law Review 50 (1966) 443–452 16. Jog, P., Suh, J.Y., van Gucht, D.: Parallel genetic algorithms applied to the traveling salesman problem. SIAM Journal of Optimization 1 (1991) 515–529 17. Grefenstette, J., Gopal, R., Rosimaita, B., van Gucht, D.: Genetic algorithms for the traveling salesman problem. In: International Conference on GeneticAlgorithms and theirApplications. (1985) 160–168 18. Larra naga, P., Kuijpers, C., Murga, R., Inza, I., Dizdarevic, S.: Genetic algorithms for the traveling salesman problem: A review of representations and operators. Artificial Intelligence Review 13 (1999) 129–170
Congressional Districting Using a TSP-Based Genetic Algorithm
2083
19. Sarkar, M., Yegnanarayana, B., Khemani, D.: A clustering algorithm using an evolutionary programming-based approach. Pattern Recognition Letters 18 (1997) 975–986 20. Chandrasekharam, R., Subramanian, S., Chaudhury, S.: Genetic algorithm for node partitioning problem and applications in VLSI design. IEEE Proceedings: Comput. Digit. Tech. 140 (1993) 255–260 21. Baker, J.: Adaptive selection methods for genetic algorithms. In Grefenstette, J., ed.: Proc. of the 1st International Conference on Genetic Algorithms and their Applications, Hillsdale, NJ, Lawrence Erlbaum Associates (1985) 101–111 22. M¨uhlenbein, H., Gorges-Schleuter, M., Kr¨amer, O.: Evolution algorithms in combinatorial optimization. Parallel Computing 7 (1988) 65–85 23. Banzhaf, W.: The "molecular" traveling salesman. Biological Cybernetics 64 (1990) 7–14 24. Winston, W.L.: Operations Research: Applications and Algorithms. Volume 3. Duxbury Press (1994) 25. U.S. Census Bureau Washington, DC: Census 2000 Summary File 1. (2001) 26. U.S. Census Bureau Washington, DC: Census 2000 TIGER/Line Files. (2000) 27. Goldberg, D.: Genetic Algorithms in Search, Optimization and Machine Learning. AddisonWesley (1989)
Active Guidance for a Finless Rocket Using Neuroevolution Faustino J. Gomez and Risto Miikkulainen Department of Computer Sciences University of Texas Austin, TX, 78712 USA Abstract. Finless rockets are more efficient than finned designs, but are too unstable to fly unassisted. These rockets require an active guidance system to control their orientation during flight and maintain stability. Because rocket dynamics are highly non-linear, developing such a guidance system can be prohibitively costly, especially for relatively small-scale rockets such as sounding rockets. In this paper, we propose a method for evolving a neural network guidance system using the Enforced SubPopulations (ESP) algorithm. Based on a detailed simulation model, a controller is evolved for a finless version of the Interorbital Systems RSX-2 sounding rocket. The resulting performance is compared to that of an unguided standard full-finned version. Our results show that the evolved active guidance controller can greatly increase the final altitude of the rocket, and that ESP can be an effective method for solving real-world, non-linear control tasks.
1
Introduction
Sounding rockets carry a payload for making scientific measurements to the Earth’s upper atmosphere, and then return the payload to the ground by parachute. These rockets serve an invaluable role in many areas of scientific research including high-G-force testing, meteorology, radio-astronomy, environmental sampling, and micro-gravity experimentation [1,2]. They have been used for more than 40 years; they were instrumental e.g., in discovering the first evidence of X-ray sources outside the solar system in 1962 [3]. Today, sounding rockets are much in demand as the most cost-effective platform for conducting experiments in the upper atmosphere. Like most rockets, sounding rockets are usually equipped with fins to keep them on a relatively straight path and maintain stability. While fins are an effective “passive” guidance system, they increase both mass and drag on the rocket which lowers the final altitude or apogee that can be reached with a given amount of fuel. A rocket with smaller fins or no fins at all can potentially fly much higher than a full-finned version. Unfortunately, such a design is unstable, requiring some kind of active attitude control or guidance to keep the rocket from tumbling. Finless rockets have been used for decades in expensive, large-scale launch vehicles such as the USAF Titan family, the Russian Proton, and the Japanese H-IIA. The guidance systems on these rockets are based on classical feedback E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2084–2095, 2003. c Springer-Verlag Berlin Heidelberg 2003
Active Guidance for a Finless Rocket Using Neuroevolution
2085
control such as ProportionalIntegral-Differential (PID) methods to adjust the thrust angle (i.e. thrust vectoring) of the engines. Because rocket flight dynamics are highly non-linear, engineers must make simplifying assumptions in order to apply these linear methods, and take great care to ensure that these assumptions are not violated during operation. Such an undertaking requires detailed knowledge of the rocket’s dynamics that can be very costly to acquire. Recently, non-linear approaches such as neural networks have been explored primarily for use in guided missiles (see [4] for an overview of neural network control architectures). Neural Fig. 1. The Interorbital Systems RSX-2 Rocket. networks can implement arbiThe RSX-2 is capable of lifting a 5 pound paytrary non-linear mappings that load into the upper atmosphere using a cluscan make control greatly more ter of four liquid-fueled thrusters. It is currently accurate and robust, but, the only liquid-fueled sounding rocket in production. Such rockets are desirable because of unfortunately, still require their low acceleration rates and non-corrosive significant domain knowledge exhaust products. to train. In this paper, we propose a method for making the development of finless sounding rockets more economical by using Enforced SubPopulations (ESP; [5,6]) to evolve a neural network guidance system. As a test case, we will focus on a finless version of the Interorbital Systems RSX-2 rocket (figure 1). The RSX-2 is a liquidfueled sounding rocket that uses the differential thrust of its four engines to control attitude. By evolving a neural network controller that maps the state of the rocket to thrust commands, the guidance problem can be solved without the need for analytical modeling of the rocket’s dynamics or prior knowledge of the appropriate kind of control strategy to employ. Using ESP, all that is required is a sufficiently accurate simulator and a quantitative measure of the guidance system’s performance, i.e. a fitness function. In the next three sections we describe the Enforced SubPopulations method used to evolve the guidance system, the controller evolution simulations, and experimental results on controlling the RSX-2 rocket.
2086
F.J. Gomez and R. Miikkulainen
ESP
Environment fitness
R S X 2
action observation
Neural Network
Fig. 2. The Enforced Subpopulations Method (ESP; color figure). The population of neurons is segregated into subpopulations shown here in different colors. Networks are formed by randomly selecting one neuron from each subpopulation. A neuron accumulates a fitness score by adding the fitness of each network in which it participated. This score is then normalized and the best neurons within each subpopulation are mated to form new neurons. By evolving neurons in separate subpopulations, the specialized sub-functions needed for good networks are evolved more efficiently.
2
Enforced Subpopulations (ESP)
Enforced Subpopulations1 [5,6] is a neuroevolution method that extends the Symbiotic, Adaptive Neuroevolution algorithm (SANE; [7]). ESP and SANE differ from other NE methods in that they evolve partial solutions or neurons instead of complete networks, and a subset of these neurons are put together to form a complete network. In SANE, neurons are selected from a single population to form networks. In contrast, ESP makes use of explicit subtasks; a separate subpopulation is allocated for each of the h units in the network, and a neuron can only be recombined with members of its own subpopulation (figure 1). This way the neurons in each subpopulation can evolve independently and specialize rapidly into good network sub-functions. Evolution in ESP proceeds as follows: 1. Initialization. The number of hidden units h in the networks that will be formed is specified and a subpopulation of neuron chromosomes is created. Each chromosome encodes the input and output connection weights of a neuron with a random string of real numbers. 2. Evaluation. A set of h neurons is selected randomly, one neuron from each subpopulation, to form the hidden layer of a complete network. The network is submitted to a trial in which it is evaluated on the task and awarded a fitness score. The score is added to the cumulative fitness of each neuron 1
The ESP software package is available at: http://www.cs.utexas.edu/users/nn/pages/software/abstracts.html#esp-cpp
Active Guidance for a Finless Rocket Using Neuroevolution
2087
that participated in the network. This process is repeated until each neuron has participated in an average of e.g. 10 trials. 3. Recombination. The average fitness of each neuron is calculated by dividing its cumulative fitness by the number of trials in which it participated. Neurons are then ranked by average fitness within each subpopulation. Each neuron in the top quartile is recombined with a higher-ranking neuron using 1-point crossover and mutation at low levels to create the offspring to replace the lowest-ranking half of the subpopulation. 4. The Evaluation–Recombination cycle is repeated until a network that performs sufficiently well in the task is found. Evolving networks at the neuron level has proven to be a very efficient method for solving reinforcement learning tasks such as pole-balancing [6], robot arm control [8], and game playing [7]. ESP is more efficient that SANE because the subpopulation architecture makes the evaluations more consistent in two ways: first, the subpopulations that gradually form in SANE are already present by design in ESP. The species do not have to organize themselves out of a single large population, and their progressive specialization is not hindered by recombination across specializations that usually fulfill relatively orthogonal roles in the network. Second, because the networks formed by ESP always consist of a representative from each evolving specialization, a neuron is always evaluated on how well it performs its role in the context of all the other players. The accelerated specialization in ESP makes it more efficient than SANE, but it also causes diversity decline over the course of evolution like that of a normal GA. This can be a problem because a converged population cannot easily adapt to a new task. To deal with premature convergence, ESP is combined with burst mutation. The idea is to search for optimal modifications of the current best solution. When performance has stagnated for a predetermined number of generations, new subpopulations are created by adding noise to each of the neurons in the best solution. Each new subpopulation contains neurons that represent differences from the best solution. Evolution then resumes, but now searching the space in a “neighborhood” around the best previous solution. Burst mutation can be applied multiple times, with successive invocations representing differences to the previous best solution. Assuming the best solution already has some competence in the task, most of its weights will not need to be changed radically. To ensure that most changes are small while allowing for larger changes to some weights, ESP uses the Cauchy distribution to generate noise: α f (x) = (1) 2 π(α + x2 ) With this distribution 50% of the values will fall within the interval ±α and 99.9% within the interval ±318.3α. This technique of “recharging” the subpopulations keeps diversity in the population so that ESP can continue to make progress toward a solution even in prolonged evolution. Burst mutation is similar to the Delta-Coding technique of [9] which was developed to improve the precision of genetic algorithms for numerical optimization problems. Because our goal is to maintain diversity, we do not reduce the range of the noise on successive applications of burst mutation and we use Cauchy rather that uniformly distributed noise.
2088
F.J. Gomez and R. Miikkulainen β α
β Drag
α
roll
CP
yaw pitch
CG
CG
CP
Lift
Side force
Thrust
(a) Fins: stable
(b) Finless: unstable
Fig. 3. Rocket Dynamics. The rocket (a) is stable because the fins increase drag in the rear of the rocket moving the center of pressure (CP) behind the center of gravity (CG). As a result, any small angles α and β are automatically corrected. In contrast, the finless rocket (b) is unstable because the CP stays well ahead of the CG. To keep α and β from increasing, i.e to keep the rocket from tumbling, active guidance is needed to counteract the destabilizing torque produced by drag.
3
The Finless Rocket Guidance Task
In previous work, ESP was shown to outperform a wide range of reinforcement learning algorithms including Q-learning, SARSA(λ), Evolutionary Programming, and SANE on several difficult versions of the pole-balancing or inverted pendulum task [6]. The rocket guidance domain is similar to pole-balancing in that both involve stabilizing an inherently unstable system. Figure 1 gives a basic overview of rocket dynamics. The motion of a rocket is defined by the translation of its center of gravity (CG), and the rotation of the body about the CG in the pitch, yaw, and roll axes. Four forces act upon a rocket in flight: (1) the thrust of the engines which propel the rocket, (2) the drag of the atmosphere exerted at the center of pressure (CP) in roughly the opposite direction to the thrust, (3) the lift force generated by the fins along the yaw axis, and (4) the side force generated by the fins along the pitch axis. The angle between the direction the rocket is flying and the longitudinal axis of the rocket in the yaw-roll plane is known as the angle of attack or α, the corresponding angle in the pitch-roll plane is known as the sideslip angle or β. When either α or β is greater than 0 degrees the drag exerts a torque on the rocket that can cause the rocket to tumble if it is not stable. The arm through which this torque acts is the distance between the CP and the CG. In figure 1a, the finned rocket is stable because the CP is behind the rocket’s CG. When α (β) is non-zero, a torque is generated by the lift (side) force of the fins that counteracts the drag torque, and tends to minimize α (β). This situation corresponds to a pendulum hanging down from its hinge; the pendulum
Active Guidance for a Finless Rocket Using Neuroevolution 1200
4
Drag: pounds
800 600
mach 1
mach 2
mach 3
mach 4
400
CP: ft. ahead of CG
mach 1 1000
mach 2
mach 3
mach 4
6
8
10
12
200 0
2089
0
10000 20000 30000 40000 50000 60000 70000 80000 90000
Altitude: ft.
(a)
14
0
10000 20000 30000 40000 50000 60000 70000 80000 90000
Altitude: ft.
(b)
Fig. 4. The time-varying difficulty of the guidance task. Plot (a) shows the amount of drag force acting on the finless rocket as it ascends through the atmosphere. More drag means that it takes more differential thrust to control the rocket. Plot (b) shows the position of the center of pressure in terms of how many feet ahead it is of the center of gravity. From 0 to about 22,000ft the control task becomes more difficult due to the rapid increase in drag and the movement of the CG away from the nose of the rocket. At approximately 22,000ft, drag peaks and begins a decline as the air gets thinner, and the CP starts a steady migration towards the CG. As it ascends further, the rocket becomes progressively easier to control as the density of the atmosphere decreases.
will return to this stable equilibrium point if it is disturbed. When the rocket does not have fins, as in figure 1b, the CP is ahead of the CG causing the rocket to be unstable. A non-zero α or β will tend to grow causing the rocket to eventually tumble. This situation corresponds to a pendulum at its unstable upright equilibrium point where any disturbance will cause it to diverge away from this state. Although the rocket domain is similar to the inverted pendulum, the rocket guidance problem is significantly more difficult for two reasons: the interactions between the rocket and the atmosphere are highly non-linear and complex, and the rocket’s behavior continuously changes throughout the course of a flight due to system variables that are either not under control or not directly observable (e.g. air density, fuel load, drag, etc.). Figure 3 shows how the difficulty of stabilization varies over the course of a successful flight for the finless rocket. In figure 3a, drag is plotted against altitude. From 0ft to about 22,000ft, the rocket approaches the sound barrier (Mach 1) and drag rises sharply. This drag increases the torque exerted on the rocket in the yaw and pitch axes for a given α and β, making it more difficult to control its attitude. In figure 3b, we see that also during this period the distance between the CG and CP increases because the consumption of fuel causes the CG to move back, making the rocket increasingly unstable. After 22,000ft, drag starts to decrease as the air becomes less dense, and the CP steadily migrates back towards the CG, so that the rocket becomes easier to control.
2090
F.J. Gomez and R. Miikkulainen
For ESP, this means that the fitness function automatically scales the difficulty of the task in response to the performance of the population. At the beginning of evolution the task is relatively easy. As the population improves and individuals are able to control the rocket to higher altitudes, the task becomes progressively harder. Although figure 3 indicates that above 22,000ft the task again becomes easier, progress in evolution continues to be difficult because the controller is constantly enterFig. 5. RSX-2 Rocket Simulator. The picture ing an unfamiliar part of the shows a 3D visualization of the JSBSim rocket state space. A fitness funcsimulation used to evolve the RSX-2 guidance contion that gradually increases trollers. The simulator provides a realistic environin difficulty is usually dement for designing and verifying aircraft dynamics sirable because it allows for and guidance systems. sufficient selective pressure at the beginning of evolution to direct the search into a favorable region of the solution space. However, the rocket control task is already too hard in the beginning—all members of the initial population perform so poorly that the evolution stalls and converges to a local maxima. In other words, direct evolution does not even get started on this very challenging task. One way to overcome this problem is to make the initial task easier and then increase the difficulty as the performance of the population improves. In the experiments below we employ such an incremental evolution approach [5]. Instead of trying to evolve a controller for the finless rocket directly, a more stable version of the rocket is used initially to make the ultimate task more accessible. The following section describes the simulation environment used to evolve the controller, the details of how a guidance controller interacts with the simulator, and the experimental setup for evolving a neural network controller for this task.
4
Rocket Control Experiments
4.1
The RSX2 Rocket Simulator
As an evolution environment we used the JSBSim Flight Dynamics Model2 adapted for the RSX-2 rocket by Eric Gullichsen of Interorbital Systems. JSB2
More information about the free JSBSim software package is available at: http://jsbsim.sourceforge.net/
Active Guidance for a Finless Rocket Using Neuroevolution
2091
Sim is an open source, object-oriented flight dynamics simulator with the ability to specify a flight control system of any complexity. JSBSim provides a realistic simulation of the complex dynamic interaction between the airframe, propulsion system, fuel tanks, atmosphere, and flight controls. The aerodynamic forces and moments on the rocket were calculated using a detailed geometric model of the RSX-2. Four versions of the rocket with different fin configurations were used: full fins, half fins (smaller fins), quarter fins (smaller still), and no fins, i.e. the actual finless rocket. This set of rockets allowed us to observe the behavior of the RSX2 at different levels of instability, and provided a sequence of increasingly difficult tasks with which to evolve incrementally. All simulations used Adams-Bashforth 4th-order integration with a time step of 0.0025 seconds. Neural Guidance Control Architecture
The rocket controller is represented by a feedforward neural network with one hidden layer (figure 6). Every 0.05 seconds (i.e. the control time-step) the controller receives a vector of readings from the rocket’s onboard sensors that provide information about the current orientation (pitch, yaw, roll), rate of orientation change, angle of attack α, sideslip angle β, the current throttle position of the four thrusters, altitude, and velocity in the direction of flight. This input vector is propagated through the sigmoidal hidden and output units of the network to produce a new throttle position for each engine determined by: ui = 1.0 − oi /δ, i = 1..4 (2)
pitch yaw roll pitch rate throttle commands
yaw rate roll rate alpha beta throttle 1
o1 o2 o3 o4
u1
SCALE
4.2
u2 u3 u4
throttle 2 throttle 3 throttle 4 altitude volecity
Fig. 6. Neural Network Guidance. The control network receives the state of the rocket every time step through its input layer. The input consists of the rocket’s orientation, the rate of change in orientation, α, β, the current throttle position of each engine, the altitude, and the velocity of the rocket in the direction of flight. These values are propagated through the network to produce a new throttle command (the amount of thrust) for each engine.
where ui is the throttle position of thruster i, oi is the value of network output unit i, 0 ≤ ui , oi ≤ 1, and δ ≥ 1.0. A value of ω for ui means that the controller wants thruster i to generate ω × 100% of maximum thrust. The parameter δ controls how far the controller is permitted to “throttle back” an engine from 100% thrust.
2092
4.3
F.J. Gomez and R. Miikkulainen
Experimental Setup
The objective of the experiments was to determine whether ESP could evolve a controller to stabilize the finless version of the RSX-2 rocket. To do so, the neural guidance system has to control the thrust of each of the four engines in order to maintain the rocket’s angle of attack α and sideslip angle β within ±5 degrees from the time of ignition to burnout when all of the fuel has been expended. Exceeding the ±5 degree boundary indicates that the rocket is about to tumble and is therefore considered a catastrophic failure. Each ESP network was evaluated in a single trial that consisted of the following four phases: 1. At time t0 , the rocket is attached to a launch rail that will guide it on a straight path for the first 50 feet of flight. The fuel tanks are full and the engines are ignited. 2. At time t1 > t0 , the rocket begins its ascent as the engines are powered to full thrust. 3. At time t2 > t1 , the rocket leaves the launch rail and the controller begins to modulate the thrust as described in section 4.2 according to equation 2. 4. While controlling the rocket one of two events occurs at time tf > t2 : a) α or β exceeds ±5 degrees, in which case the controller has failed. b) the rocket reaches burnout, in which case the controller has succeeded. In either case, the trial is over and the altitude of the rocket at tf becomes the fitness score for the network. In a real launch, the rocket would continue after burnout and coast to apogee. Since we are only concerned with the control phase, for efficiency the trials were limited to reaching burnout. This fitness measure is all that is needed to encourage evolutionary progress. However, there is a large locally maximal region in the network weight space corresponding to the policy oi = 1.0, i = 1..4; the policy of keeping all four engines at full throttle. Since it is very easy to randomly generate networks that saturate their outputs, this policy will be present in the first generations. Such a policy clearly does not solve the task, but because the rocket is so unstable, no better policy is likely to be present in the initial population. Therefore, it will quickly dominate the population and halt progress toward a solution. To avoid this problem, all controllers that exhibited this policy were penalized by setting their fitness to zero. This procedure ensured that the controller was not rewarded for doing nothing. The simulations used 10 subpopulations of 200 neurons (i.e. the networks were composed of 10 hidden units), and δ was set to 10 so the network could only control the thrust in the range between 90% and 100% for each engine. It was determined in early testing that this range produced sufficient differential thrust to counteract side forces, and solve the task. As was discussed in section 3, evolving a controller directly for the finless rocket was too difficult and an incremental evolution method was used instead. We first evolved a controller for the quarter-finned rocket. Once a solution to this easier task was found, the evolution was transitioned to the more difficult finless rocket.
Active Guidance for a Finless Rocket Using Neuroevolution
2093
90 finless w/ guidance
80
1/4 fins w/ guidance
Altitude: ft. x 1000
70
full fins no guidance
60 50
1/2 fins no guidance
40 30 1/4 fins no guidance
20
finless no guidance
10 0
0
10
20
30 Time: seconds
40
50
60
Fig. 7. Comparison of burnout altitudes for different fin-size rockets with and without guidance. The crosses indicate the altitude at which a particular rocket becomes unstable (i.e. either α or β > ±5 degrees). The circles indicate the altitude of a successful rocket that remained stable all the way to burnout. The guided quarter-finned and finless rockets fly significantly higher than the unguided full-finned rocket.
5
Results
ESP solved the task of controlling the quarter-finned rocket in approximately 600,000 evaluations. Another 50,000 evaluations were required to successfully transition to the finless rocket. Figure 4.3 compares the altitudes that the various rockets reach in simulation. Without guidance, the full-finned rocket reaches burnout at approximately 70,000ft, whereas the finless, quarter-finned, and halffinned rockets all fail before reaching burnout. However, with neural network guidance the quarter-finned and finless rockets do reach burnout and exceed the full-finned rocket’s altitude by 10,000ft and 15,000ft, respectively. After burnout, the rocket will begin to coast at a higher velocity in a less dense part of the atmosphere; the higher burnout altitude for the finless rocket translates into an apogee that is about 20 miles higher than that of the finned rocket. Figure 5a shows the behavior of the four engines during a guided flight for the finless rocket. The controller makes smooth changes to the thrust of the engines throughout the flight. This very fine control is required because any abrupt changes in thrust at speeds of up to Mach 4 can quickly cause failure. Figure 5b shows α and β for the various rockets with and without guidance. Without guidance, the quarter-finned and even the half-finned rocket start to tumble as soon as α or β start to diverge from 0 degrees. Using guidance, both the quarter-finned and finless rockets keep α and β at very small values up to burnout. Note that although the finless controller was produced by further evolving the quarter-finned controller, the finless controller not only solves a more difficult task, but does so with more optimal performance.
2094
F.J. Gomez and R. Miikkulainen
100
1/2 fin unguided
2
α 1
degrees
Throttle %
98
96
94
α β
1/4 fin unguided 1/4 fin guided β
0
α β
β α
finless guided
−1
92 −2
90
0
10000
20000
30000
40000
50000
60000
70000
80000
90000
0
10000
20000
30000
40000
50000
Altitude: ft.
Altitude: ft.
(a)
(b)
60000
70000
80000
90000
Fig. 8. Controller performance for the finless rocket. Plot (a) shows the policy implemented by the controller. Each curve corresponds to the percent thrust of one of the four rocket engines over the course of a successful flight. After some initial oscillation the control becomes very fine, changing less than 2% of total thrust for any given engine. Plot (b) compares the values α and β for various rocket configurations and illustrates how well the neural guidance system is able to minimize α and β. The unguided quarter-finned and half-finned rockets maintain low α and β for a while, but as soon as either starts to grow the rocket tumbles. In contrast, the guidance systems for the quarter-finned and finless rockets are able to contain α and β all the way up to burnout. The finless controller, although evolved from the quarter-finned controller, is more optimal.
6
Discussion and Future Work
The rocket control task is representative of many real world problems such as manufacturing, financial prediction, and robotics that are characterized by a complex non-linear interaction between system components. The critical advantage of using ESP over traditional engineering approaches is that it can produce a controller for these systems without requiring formal knowledge of system behavior or prior knowledge of correct control behavior. The experiments in this paper demonstrate this ability by solving a high-precision, high-dimensional, non-linear control task. Of equal importance is the result that the differential thrust approach for the finless version of the current RSX-2 rocket is feasible. Also, having a controller that can complete the sounding rocket mission in simulation has provided valuable information about the behavior of the rocket that would not otherwise be available. In the future, we plan on making the task more realistic in two ways: (1) the controller will no longer receive α and β values as input, and (2) instead of a generating continuous control signal, the network will output a binary vector indicating whether or not each engine should “throttle back” to a preset low throttle position. This scheme will greatly simplify the control hardware on the real rocket. Once a controller for this new task is evolved, work will focus on varying environmental parameters and incorporating noise and wind so that the controller will be forced to utilize a robust and general strategy to stabilize
Active Guidance for a Finless Rocket Using Neuroevolution
2095
the rocket. This phase will be critical to our ultimate goal of transferring the controller to the RSX-2 and testing it in an actual rocket launch.
7
Conclusion
In this paper, we propose a method for learning an active guidance system for a dynamically unstable, finless rocket by using the Enforced Subpopulation neuroevolution algorithm. Our experiments show that the evolved guidance system is able to stabilize the finless rocket and greatly improve its final altitude compared to the full-finned, stable version of the rocket. The results suggest that neuroevolution is a promising approach for difficult nonlinear control tasks in the real world. Acknowledgments. This research was supported in part by the National Science Foundation under grant IIS-0083776, and the Texas Higher Education Coordinating Board under grant ARP-003658-476-2001. The authors would like to give special thanks to Eric Gullichsen of Interorbital Systems ([email protected]) for his collaboration and invaluable assistance in this project.
References 1. Corliss, W.R.: NASA sounding rockets, 1958–1968: A historical summary. Technical Report NASA SP-4401, National Aeronautics and Space Administration, Washington, D.C. (1971) 2. Seibert, G.: A world without gravity. Technical Report SP-1251, European Space Agency (2001) 3. Giacconi, R., Gursky, H., Paolini, F., Rossi, B.: Evidence for X-rays from sources outside the solar system. Physical Review Letters 9 (1962) 439–444 4. Liang Lin, C., Wen Su, H.: Intelligent control theory in guidance and control system design: an overview. Proc. Natl. Sci, Counc. ROC(A) 24 (2000) 15–30 5. Gomez, F., Miikkulainen, R.: Incremental evolution of complex general behavior. Adaptive Behavior 5 (1997) 317–342 6. Gomez, F., Miikkulainen, R.: Robust non-linear control through neuroevolution. Technical Report AI02-292, Department of Computer Sciences, The University of Texas at Austin (2002) 7. Moriarty, D.E.: Symbiotic Evolution of Neural Networks in Sequential Decision Tasks. PhD thesis, Department of Computer Sciences, The University of Texas at Austin (1997) Technical Report UT-AI97-257. 8. Moriarty, D.E., Miikkulainen, R.: Evolving obstacle avoidance behavior in a robot arm. Technical Report AI96-243, Department of Computer Sciences, The University of Texas at Austin (1996) 9. Whitley, D., Mathias, K., Fitzhorn, P.: Delta-Coding: An iterative search strategy for genetic algorithms. In Belew, R.K., Booker, L.B., eds.: Proceedings of the Fourth International Conference on Genetic Algorithms, San Francisco, CA: Morgan Kaufmann (1991) 77–84
Simultaneous Assembly Planning and Assembly System Design Using Multi-objective Genetic Algorithms 1
Karim Hamza1, Juan F. Reyes-Luna , and Kazuhiro Saitou2 * 1
Graduate Student, 2Assistant Professor, Mechanical Engineering Department University of Michigan, Ann Arbor, MI 48109-2102, USA {khamza,juanfr,kazu}@.umich.edu
Abstract. This paper aims to demonstrate the application of multi-objective evolutionary optimization, namely an adaptation of NSGA-II, to simultaneously optimize the assembly sequence plan as well as selection of the type and number of assembly stations for a production shop that produces three different models of wind propelled ventilators. The decision variables, which are the assembly sequences of each product and the machine selection at each assembly station, are encoded in a manner that allows efficient implementation of a repair operator to maintain the feasibility of the offspring. Test runs are conducted for the sample assembly system using a crossover operator tailored for the proposed encoding and some conventional crossover schemes. The results show overall good performance for all schemes with the best performance achieved by the tailored crossover, which illustrates the applicability of multi-objective GA’s. The presented framework proposed is generic to be applicable to other products and assembly systems.
1 Introduction The optimization of product assembly processes is a key issue for the efficiency of manufacturing system, which involves several different types of decisions such as selecting the assembly sequences of products, assigning tasks to the assembly stations, and selecting the number and type of machines at each assembly station. Research on assembly sequence planning was originated by two pioneer works in late eighties: De Fazio and Whitney [1] and de Mello et al. [2] independently presented graph-based models of assemblies and algorithms for enumerating all feasible assembly sequences. Since then, numerous work has been conducted on assembly sequence planning1. However, a few attentions have been paid to the integration of assembly sequence planning and assembly system design. Assembly system design is a complex problem that may have several objectives such as minimizing overall cost, meeting production demand, increasing productivity, reliability and/or product quality. Assembly sequence planning is a precedence* 1
Corresponding Author A comprehensive bibliography of the area is found at www.cs.albany.edu/~amit/bib/ASSPLAN.html
E. Cantú-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2096–2108, 2003. © Springer-Verlag Berlin Heidelberg 2003
Simultaneous Assembly Planning and Assembly System Design
2097
constrained scheduling problem, which is known to be NP-complete [3]. Furthermore, allocating machines to assembly tasks is resource constrained scheduling which is also known to be NP-complete [3]. In real-life workshops, there is a need to consider assembly sequence and machine allocation simultaneously which results in a doubled difficulty that makes such problems beyond the feasibility of full enumeration. Thus, assembly systems design provides rich opportunities for heuristics approaches such multi-objective GA’s. For instance, process planning using GA’s was considered in the late eighties [4]. More recently, Awdah et al. [5] proposed a computer-aided process-planning model based on GA’s. Kaeschel et al. [6] applied an evolutionary algorithm to shop floor scheduling and multi-objective evolutionary optimization of flexible manufacturing systems was considered by Chen and Ho [7]. Saitou et al.[8] applied GA for robust optimization of multi-product manufacturing systems subject to production plan variations. This paper presents the application of multi-objective GA’s to simultaneously optimize the assembly sequence, assembly stations’ type and number selection, based on data extracted from a real assembly shop that assembles specially designed windpropelled ventilators (Fig. 1). The reminder of the paper first describes the problem formulation, a special encoding and crossover schemes, the results of simulation runs with the proposed crossover scheme as well as arithmetic projection, multipoint and uniform crossovers. Finally, discussion and future extensions are provided.
2 Wind Propelled Ventilators The family of products considered is the three models of wind propelled ventilators. Shown in Fig. 1, is a photo of model A, the basic model used in ventilating industrial or storage hangars in dry regions. The basic idea of operation is that the ventilators are placed atop ventilating ducts in the ceiling of the hangars. When the lateral wind blows across the hangar, it spins the spherically shaped blades, which in turn perform a sucking action that draws air from the top of the hangar. Model B has the same exoskeleton and blades as model A, but has different internal shaft as well as an additional rotor that improves the air suction out of the ventilation duct. Model C is the same as Model B, except that its outer blades have improved shape design. The three models share several components, and are assembled in the same assembly shop and may use the same assembly stations. Table 1 provides a listing of all components in the three models and in which models they are being used. The problem of designing an assembly system that can assemble the three types of ventilators, models A, B and C, is formulated as a multi-objective optimization problem with two objective functions to me minimized: assembly cost f1 and production shortage within a given production period f2. These are the functions for the three decision variable categories: the assembly sequences of each products, the type of machines at each assembly station, and the number of machines for each type. The model of the assembly system used in this study is simple one, which computes the production cost f1 by summing over the startup and hourly operation rate of
2098
K. Hamza, J.F. Reyes-Luna, and K. Saitou
the assembly stations. Production volume is estimated according to average cycle times, for the numbers and types of machines assigned to each assembly task: Minimize
f1 =
∑
Investment Cost +
Machines
f2 =
∑
Processing Time × Hourly Rate
(1)
Machines
∑
max(0, Target Production - Production Volume)
(2)
Products
Decision Var.
Assembly Sequence, Type of assembly stations, Number of assembly stations of every used type
Fig. 1. A wind-propelled ventilator – Model A
While the simulations provided in this paper are based on the real assembly shop data in a company, the actual production volumes and assembly station rates are not revealed due to proprietary nature of the information.
Table 1. Listing of components in three models of the wind-propelled ventilator. Component Label 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Description Duct and Main Body Lower Bracket – Type #1 Upper Bracket Axle Lower Ball Bearing – Type #1 Upper Ball Bearing Lower Hub – Type #1 Upper Hub – Type #1 Blades – Type #1 Lower Bracket – Type #2 Shaft Lower Ball Bearing – Type #2 Lower Hub – Type #2 Upper Hub – Type #2 Inner Rotor Blades Type #2
A x x x x x x x x x
Used in Model B C x x x
x
x
x
x x x x x x x
x x x x x x
Simultaneous Assembly Planning and Assembly System Design
2099
3 Problem Formulation for Multi-objective GA A software system, ASMGA (ASsembly using Multi-objective GA) was implemented, whose overall structure is illustrated in Fig. 2. It is capable of communicating with user-provided models of assembly systems involving multi-products.
Generate initial population Candidate designs Evaluate objectives & Constraints
Evaluate objectives & constraints Objectives & constraints
Apply selection, crossover, mutation & repair
User model of assembly system
N Termination? Y Return best designs (undominated Pareto-front)
Optimizer (MOGA)
Fig. 2. Overall structure of the ASMGA (ASsembly using Multi-objective GA).
3.1 Encoding Scheme of Chromosomes The following three decision variable sets are considered: • • •
Assembly sequence of every product; Choice of type of assembly station for every assembly operation; Number of assembly stations of every type.
For every product consisting of N components, there are 2(N – 1) variables that control the assembly sequence, (N – 1) variables controlling the type of assembly stations and (N – 1) variables to specify the number of each type of assembly stations in the platform. In the rest of the paper, a set of values for the decision variables is sometimes referred to as a candidate design of the assembly system. The above decision variables can be efficiently encoded in a fixed-length string of integers in a manner that allows efficient implementation of a repair operator to maintain the feasibility of the offspring for assembly precedence constraints. The layout of the integer string chromosome is shown in Fig. 3. The basic building block of the chromosome is a set of four integer numbers that define one assembly operation. Every product is assembled through a set of assembly operations. Performing
2100
K. Hamza, J.F. Reyes-Luna, and K. Saitou
assembly operations to obtain an assembled product out of its individual components may be mentally visualized by thinking of a hypothetical bin, which initially contains all separate components of the product. In every assembly operation, one reaches into the bin, grabs two items assembles them together (provided they can be assembled together) and puts them back into the hypothetical bin. It can be shown that one gets a product consisting of N components, in this manner, in exactly N-1 assembly operations. These N-1 assembly operations are laid on a linear chromosome. The four-integer assembly operation information translates as follows: The first number refers to one of the components (or sub-assemblies) inside the bin of the product. The second number refers to the component in the bin, which will be joined to the one selected by the first number. The third number refers to the assembly station chosen to perform the operation and the fourth number refers to the number of those assembly stations that should be present in the assembly shop. For example, chromosome c=11421021 for the simple product shown in Fig. 4 would be translated as follows: • •
• •
In the first assembly operation (first four numbers) the component, which has the order of 1 (i.e,. the component labeled 2, because ordering starts at zero) is selected. The component labeled 2 is assembled to the second component that can be coupled to it (i.e., the component labeled 3) to produce a subassembly (2,3), using the fifth type of assembly station that can perform the task and there are two such stations in the assembly plant In the second assembly operation, according to the new order of the components and subassemblies in the bin, the subassembly (2,3) is selected. The subassembly (2,3) is assembled the first (and only) component that can be coupled to it (i.e., the component labeled 1) to produce the full product (1,2,3), using the second type of assembly station that can perform the task and there is one such station in the assembly plan component selection component to couple to the one selected assembly station ID number of stations
assembly operation
assembly operation product #1
product #2
chromosome corresponds to all products in the assembly system
Fig. 3. Layout of chromosomes.
Simultaneous Assembly Planning and Assembly System Design
2101
cover: 3 core: 2
box: 1
Fig. 4. A simple product.
3.2 Assembly Constraints and Repair Operator Due to the geometrical and other quality requirements, assembly precedence relationships exist as constraints upon the assembly sequence in almost every real product. Thus it makes sense to define all the assembly constraints during a pre-processing stage, and then make (or attempt to make) all candidate designs of the assembly system conform to those constraints throughout the optimization. The ASMGA software includes a pre-processor that guides the user to define all feasible assembly operations and automatically generates a set of all sub-assemblies that may be encountered. Chromosome reproduction through crossover and mutation may result in an assembly sequences that violate such precedence relationships. One way of enforcing the feasibility of the assembly sequences is through penalizing the objective functions of infeasible candidate designs. However, such penalizing means relying on the selection according to fitness, while allowing the integer numbers on the chromosome to take on any value within the maximum and minimum ranges, which would lead to an impossibly large search space that is probably beyond the capabilities of GA or any other optimization technique. Instead, after crossover and mutation, a repair operator is invoked that systematically increments the integer number on the chromosome so that the assembly sequence remains feasible. Such integer incrementing is performed for every four-integer assembly operation till it represents one of the assembly operations defined on the set of feasible assembly operations, that can be invoked to join two of the components or sub-assemblies that are in the hypothetical bin of the product. 3.3 Crossover In the following example, four crossover schemes are tested: Arithmetic Projection, Multi-Point, Uniform, and a special crossover operator especially tailored for the proposed encoding scheme. Arithmetic Projection Crossover This is a blend of arithmetic crossover [9] and heuristic crossover [9]. Both arithmetic and linear projection crossover schemes are mainly applied in real coded GA’s and they are tested in this paper although the chromosome is integer coded, because some portion of the chromosome has underlying continuity (the number of assembly sta-
2102
K. Hamza, J.F. Reyes-Luna, and K. Saitou
tions). When viewing the chromosome in multi-dimensional vector space, such crossover schemes set the gene values of the offspring chromosome to some point along the line joining two parent points. In arithmetic crossover, the offspring point is in between the parents, while in projection crossover the point is outside the parents, projected in the direction of the better (more fit or higher rank) parent. Thus in the combined version of arithmetic-projection, the gene values of the offspring are given by: g C = g P1 + r ( g P 2 − g P1 )
(3)
Where gc is the offspring gene value, gp1 is the gene value of the less fit (or lower rank) parent, gp2 is the gene value of the more fit parent, and r is a uniformly distributed random number within a certain range. If the range of r is between zero and 1.0, Eq. 1 would represent arithmetic crossover, while if the range is between 1.0 and some number bigger than 1.0, Eq. 1 would represent heuristic crossover. In this paper, r is uniformly distributed between zero and 2.5, so it represents a blend of both arithmetic and heuristic crossover. Multipoint Crossover Similar to single-point crossover but exchange of chromosome segments between parents to produce the offspring chromosome occurs at several points [4]. It is believed that multipoint crossover works better for long chromosomes. Uniform Crossover Can be viewed as an extreme case of multipoint crossover, where every building block can be exchanged between the parent chromosomes, with a specified probability. Special Crossover Scheme A Special crossover operator is designed based on understanding of the nature of the encoding scheme of the chromosomes. The basic idea is that the meaning of the numbers of a four-integer building block is dependent on the state of the components and subassemblies present in the hypothetical bin of the product. Thus if crossover changes some building block on a chromosome, it means crossover has subsequently changed the meaning of all the building blocks that follow the changed building block for that product. The special crossover operator is designed as follows: •
One single-point crossover is used for every product on the chromosome. This is slightly different from multipoint crossover as one crossover occurs in every product, while in multipoint crossover, one product can have several crossover points, while another has none.
•
Crossover can occur only at intervals in between the building blocks (which are of length four, each defining an assembly operation).
Simultaneous Assembly Planning and Assembly System Design
•
2103
In order to grow good schemata in the population of chromosomes, the crossover location in the initial population has higher probability of occurring nearer to the start of the string, but as the search progresses, the crossover location probability centeroid is shifted gradually towards the end of the string.
It is worth mentioning that within one product, single-point crossover is chosen and not two-point or multi-point crossover because of the heavy backward dependency in the chromosome introduced by the encoding scheme. This backward dependency makes a second crossover point within the same product no more than a random mutation for the remaining part of the chromosome. 3.4 Mutation When arithmetic projection, multipoint or uniform crossover are used, mutation of the building blocks occurs in a classical fashion, where according to a user specified probability, random changing of the numbers on the chromosome to any number within their allowed range with uniform probability is performed [9]. However, when the special crossover is employed, a corresponding special mutation is used. In the special mutation, separate probabilities are assigned to each of the four numbers of the building block. Typically, the first number has the least mutation probability, and then following numbers the next have higher and higher mutation probabilities. The intuition behind this is that the meaning of the second number on the building block is dependent on the first number, so changing the first subsequently changes the second. Also the first two numbers on the building block play the role of defining the assembly sequence, so changing any of them subsequently changes the meaning of the whole chromosome portion that follows the location of the mutation. It is noted that the special mutation scheme could also be applied to the classical forms of crossover considered. However, the authors preferred keeping the study of specialized operators (crossover and mutation) separate from classical ones in order to highlight the benefit (if any) that is gained by introducing prior knowledge of the problem structure into the GA operators. 3.5 Multi-objective GA The implemented multi-objective GA is similar to NSGA-II [10, 11] with a few modifications in enforcing elitism, where the un-dominated members are copied into a separate elite population that is preserved across generations unless it grows too big. General highlights of NSGA-II that differ from single objective GA include: i) selection is based on ranking through Pareto-based dominance and ii) use of a Nitching function in selecting from with members that have the same rank. The Nitching function serves to give higher probability of selection for members that have no others
2104
K. Hamza, J.F. Reyes-Luna, and K. Saitou
near to them and thereby enhances the spread of the population over the Pareto-front. The pseudo-code for the implemented multi-objective GA is given as: 1.
Generate Initial Population
2.
Loop Until Termination: (Max. Number of Generations or Function Evaluations)
3.
Perform Pareto-Ranking of Current Population
4.
Add Un-dominated Members of Current Population Members into Elite Population
5.
Re-Rank Elite Population and Kill any members that become dominated
6.
If Elite Population Grows beyond an allowed size, select a number of members equal to the allowed elite population size according to a Nitching function and kill the rest of the members
7.
Loop until New Population is full
8.
Select Parents for reproduction
9.
a.
Perform a binary tournament to select a rank
b.
Select a parent member from within a chosen rank according to a Nitching function
Perform Crossover, mutation and Repair then place new offspring in New Population
10. When New Population is full, replace Current Population with the new one and repeat at Step #2.
4 Results and Discussion The feasible assembly operations satisfying the assembly precedence relations for each of the ventilator models are shown in a tree representation in Fig. 5. A list of assembly stations is provided in Table 2. Ten multi-objective GA runs are performed for each of the crossover schemes, using a population size of 150, for 100 generations, with a limit on the elite population size of 30 individuals. Crossover and mutation follow the general recommendation [4] of high crossover probability (about 0.9), low mutation rate (about 2%). Figure 6 shows typical populations at the beginning and end of the search and their corresponding un-dominated elite front. Since the objective is to minimize both assembly cost f1 and production shortage f2, the closer a candidate design gets to the lower left corner of the plot, the better the design is. The difference between the initial and final populations is quite apparent. The initial population is scattered over a wide range on the f1 – f2 space, while final population has its un-dominated front closer to the lower left corner of the plot and the whole population is gathered pretty close to the un-dominated front. As such, the plot implies that convergence is achieved and that the choice of number of generations is sufficient. Of particular interest, is the extreme point along the Pareto-front, which has zero production volume shortage (i.e. f2=0), meaning that it has minimal production cost while meeting the required production volume during a given production period. This point on the Pareto-front is referred to as the best point. In the best-known solution of the problem, the best point has the assembly sequences shown using thicker lines in
Simultaneous Assembly Planning and Assembly System Design
2105
Fig. 5. The best-known solution also utilizes types and number of assembly stations as shown in the last column of Table 2. It is observed, that whenever possible, the best solution avoids employing assembly stations that incur extra cost to do an operation. The history of the average value of f1 for the best point during the search is displayed in Fig. 7. The percentage success in attaining the best-known solution is given in Table 3. It is seen in Fig. 7 that given enough generations, all the considered crossover schemes succeeded in bringing down the objective function to the near-optimal region. Arithmetic-Projection crossover seems to have the fastest early descent because of its embedded gradient following heuristic, but has the least probability of success in attaining the best-known solution, which is probably due to premature convergence. Multi-point crossover has the slowest convergence rate but has a good chance at reaching the best-known solution. The proposed crossover scheme, which is tailored according to the structure of the encoded chromosome, has a moderate convergence rate, but superior chance at reaching the best-known solution. A possible reason is that the employed encoding scheme introduces heavy backward dependency to the chromosome genes, which presents difficulty to traditional crossover schemes, but the special crossover scheme is better geared to operate with such encoding. Overall, all the considered crossover schemes performed well at getting close to the best-known solution, which implies the effectiveness of using multi-objective GA’s, and in particular the adapted version of NSGA-II, for similar problems. Table 2. List of assembly stations. ID 1 2 3 4 5 6 7 8 9 10
Operation Frame to L. Bracket Frame to U. Bracket L. Brckt to Axle 1 L. Brckt to Axle 2 U. Brckt to Axle Axle to L. Brg. Axle to U. Brg. 1 Axle to U. Brg. 2 L. Brg. to L. Hub U. Brg. to U. Hub
Inv. Cost 1000
Hr. Rate 5.0
# in Opt. 1
ID
Operation
11 Blades
3000
10.0
1
12
2000 3000 4000 1000 1000 4000 1000 1000
8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0
1 NA 1 1 1 NA 1 1
13 14 15 16 17 18 19
L. Brckt 2 to L. Brg. 2 U. Brckt to U. Brg. L. Brg. 2 to Shaft U. Brg. to Shaft Inner Rotor 1 Inner Rotor 2 Shaft to L. Hub Shaft to U. Hub
Inv. Cost 4000
Hr. Rate 10.0
# in Opt. 2
2000
8.0
1
2000 4000 2000 3000 5000 3000 3000
8.0 10.0 8.0 8.0 10.0 8.0 8.0
1 1 1 2 NA 1 1
5 Conclusion This paper presented the application of an adapted version of NSGA-II for multiobjective optimization of a real multi-product assembly shop. Although using real data, the model of an assembly system is rather simple. Further extension of this study, would test the same approach on more sophisticated assembly models through
2106
K. Hamza, J.F. Reyes-Luna, and K. Saitou
linking to random discrete time event simulation software such as ARENA [12]. Also presented in this paper, was a special encoding scheme for the decision parameters and associated special crossover, mutation and repair operators. The obtained results for the simplified assembly model are encouraging and motivate further exploration.
Model B Model C
Model A
Fig. 5. Feasible assembly operations and sequences for best-known solution (thick lines).
600
Whole Population - Generation #0 500
Rank-1 - Generation #0 Whole Population - Generation #100
f2
400
Rank-1 - Generation #100
300
200
100
0 0
100000
200000
300000
400000
500000
600000
f1
Fig. 6. Typical display of initial and final populations and their un-dominated Pareto-fronts.
Simultaneous Assembly Planning and Assembly System Design
2107
180000 A rithmetic-Projection Multipoint
170000
Unif orm A SMGA Special
f1
160000
150000
140000
130000
120000 0
10
20
30
40
50
60
70
80
90
100
Ge ne ration Num be r
Fig. 7. History of search average progress for the best point.
Table 3. Percentage success in attaining the best-known solution. Crossover Scheme Arithmetic Projection Crossover Multipoint Crossover Uniform Crossover ASMGA Special Crossover
Percentage Success 30% 50% 40% 90%
Acknowledgements. This work is an extension of a course project for ME588: Assembly Modeling conducted during the Fall 2002 semester at the University of Michigan, Ann Arbor. MECO: Modern Egyptian Contracting provided the data for the wind-propelled ventilators.
References 1. 2.
3. 4. 5.
De Fazio, T., Whitney, D.: Simplified Generation of All Mechanical Assembly Sequences. IEEE J. of Robotics and Automation, Vol. 3, No. 6, (1987) 640–658. De Mello, H., Luiz S., Sanderson, A.: A correct and complete algorithm for the generation of mechanical assembly sequences. IEEE Transactions on Robotics and Autonomous Systems, Vol. 7, No. 2 (1991) 228–240. Garey, M., Johnson, D.,: Computers and Intractability: A Guide to the Theory of NPCompleteness. W.H. Freeman and Company, NY, (1979). Goldberg, D.: Genetic Algorithms in Search Optimization and Machine Learning. Addison – Wesley Publishing Company (1989). Awdah, N., Sepehri, N., Hwalwshka, O.: A Computer-Aided Process Planning Model Based o Genetic Algorithms. Computers and Operations Research, Vol. 22, No. 8, (1995) 841–856.
2108 6.
K. Hamza, J.F. Reyes-Luna, and K. Saitou
Kaeschel, J., Meier, B., Fisher, M., Teich, T.: Evolutionary Real-World Shop Floor Scheduling using Parallelization and Parameter Coevolution. Proceedings of the Genetic and Evolutionary Computation Conference, Las Vegas, NV (2000) 697–701. 7. Chen, J., Ho, S.: Multi-Objective Evolutionary Optimization of Flexible Manufacturing Systems. Proceedings of the Genetic and Evolutionary Computation Conference, San Francisco, CA (2001) 1260–1267. 8. Saitou, K., Malpathak, S., Qvam, H.: Robust Design of Flexible Manufacturing Systems using Colored Petri Net and Genetic Algorithm. Journal of Intelligent Manufacturing, Vol. 13, No. 5, (2002) 339–351. 9. Michalewicz, Z.: Genetic Algorithms + Data Structures = Evolution Programs. 3rd edn. Springer-Verlag, Berlin Heidelberg New York (1996). 10. Deb, K., Argawal, S., Pratab, A., Meyarivan, T.: A Fast Elitist Non-Dominated Sorting Genetic Algorithm for Multi-Objective Optimization: NSGA-II. Proceedings of the Parallel Problem Solving from Nature VI Conference, Paris, France (2000) 849–858. 11. Coello, C., Van Veldhuizen, D., Lamont, G.: Evolutionary Algorithms for Solving MultiObjective Problems. Kluwer Academic Publishers (2002). 12. Kelton, W., Sadowski, R., Sadowski, D.: Simulation with ARENA. 2nd edn, McGraw Hill (2002).
Multi-FPGA Systems Synthesis by Means of Evolutionary Computation 1
2
1
2
1
J.I. Hidalgo , F. Fernández , J. Lanchares , J.M. Sánchez , R. Hermida , 3 4 4 1 M. Tomassini , R. Baraglia , R. Perego , and O. Garnica 1
Computer Architecture Department (DACYA) Universidad Complutense de Madrid, Spain {hidalgo,julandan,rhermida,ogarnica}@dacya.ucm.es 2 Departamento de Informática. Universidad de Extremadura, Spain {fcofdez,jmsanchez}@unex.es 3 Computer Science Institute. University of Laussane, Switzerland [email protected] 4 Istituto di Scienza e tecnologie dell´informazione “Alessandro Faedo”, CNR, Italy {Ranieri.Baraglia,Raffaele.Perego}@cnuce.cnr.it Abstract. Multi-FPGA systems (MFS) are used for a great variety of applications, for instance, dynamically re-configurable hardware applications, digital circuit emulation, and numerical computation. There are a great variety of boards for MFS implementation. In this paper a methodology for MFS design is presented. The techniques used are evolutionary programs and they solve all of the design tasks (partitioning placement and routing). Firstly a hybrid compact genetic algorithm solves the partitioning problem and then genetic programming is used to obtain a solution for the two other tasks.
1
Introduction
An FPGA (Field Programmable Gate Array) is an integrated circuit capable of implementing any digital circuit by means of a configuration process. FPGAs are made up of three principal components: configurable logic blocks, input-output blocks and connection blocks. Configurable logic blocks (CLBs) are used to implement all the logic circuitry of a given design. They are distributed in a matrix way in the device. On the other hand, the input-output blocks (IOBs) are the responsible for connecting the circuit implemented in the FPGA with the external world. This "world" may be the application environment for which it has been designed, or a set of different FPGAs. Finally, the connection blocks (switch-boxes) and interconnection lines are the elements available to the designer for making the routing of the circuit. In most occasions we need to use some of the CLBs to accomplish the routing. The internal structure of an FPGA is shown in Figure 1. Logic blocks, IOBs and the interconnection resources are indicated on it [1]. Sometimes, the size of an FPGA is not enough to implement large circuits and it is necessary the use of Multi-FPGA system (MFS). There is a great number of MFSs, each of them suited for different applications [2]. These systems include not only integrated circuits but, also memories and connection circuitry. Nowadays MFS are used for a great variety of applications, for instance, dynamically re-configurable hardware applications [3], digital circuit emulation [4], and numerical computation [5]. There are a lot E. Cantú-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2109–2120, 2003. © Springer-Verlag Berlin Heidelberg 2003
2110
J.I. Hidalgo et al.
of different boards and topologies for MFS implementation. The two most common topologies are the mesh and crossbar types. In a mesh, the FPGAs IOBs CLBs Switch Boxes
Fig. 1. General structure of an FPGA
are connected in the nearest-neighbor pattern. This kind of board has a simple routing methodology as well as an easy expandability and all programmable devices within the board have identical functionality. Figure 2a shows and FPGA with meshtopology. On the other hand crossbar topologies, Figure 2b, separate the system components into logic and routing chips. This topology could be suitable for some specific problems, but crossbar topologies usually waste not only logic resources but also routing resources. For these reasons we have focused on mesh topologies
A
B
C
A
B
C
D
E
F
E
F
G
(a)
(b)
Fig. 2. Multi-FPGA Mesh (a) and crossbar (b) topologies
As in any integrated circuit device, MFSs design flow has three major tasks: partitioning, placement and routing. Frequently two of these tasks are tackled together, because when accomplishing the partitioning, the placement must be considered or vice versa. The solution that we are proposing on this paper covers the whole process. Firstly, we obtain the partitions of the circuit; each of them will be later implemented into a different FPGA. Second, we need to place and route the circuit using the FPGA resources. As we will see in the following sections, we use two different evolutionary algorithms: a hybrid compact genetic algorithm (cGA) for the partitioning step and a genetic programming technique for the routing and placement step. The rest of the paper is organized as follows. Section 2 describes the partitioning methodology. Section 3 shows how genetic programming can finish the design process within the FPGAs, while section 4 contains the experimental results and finally section 5 drafts our conclusions and the future work.
Multi-FPGA Systems Synthesis by Means of Evolutionary Computation
2
2111
MFS Partitioning and Placement
Methodology: Partitioning deals with the problem of dividing a given circuit into several parts, in order to be implemented on a MFS. When using a specific board we must bear in mind several constraints related to the board topology. Some of these constraints are the number of available I/O pins and logic capacity. Although the logic capacity is usually a difficulty, the number of available pins is the hardest problem, because FPGA devices have a reduced number of them comparing with their logic capacity. In addition we must reserve some of the pins to interconnect the parts of the circuit placed on non-adjacent FPGAs. Most of the research related to the problem of partitioning on FPGAs has been adapted from other VLSI areas, and hence, they disregard the specific features of these systems. In this paper a new method for solving the partitioning and placement problem in MFSs is presented. We apply the graph theory to describe a given circuit, and then a compact genetic algorithm (cGA) with a local search improvement [17] is applied with a problem-specific encoding. This algorithm not only preserves the original structure of the circuit but also evaluates the I/O-pins consumption due to direct and indirect connections among FPGAs. It is done by means of a fuzzy technique. We have used the Partitioning93 benchmarks [6], described in the XNF hardware description language (Xilinx Netlist Format) [7]. Circuit Description: Some authors use hypergraphs for representing a circuit netlist, although there are some approximations, which use graphs. We have used an undirected graph representation to describe the circuit. This selection has been motivated because it can be adapted to the compact genetic algorithm code. We will describe later how the edges of his spanning tree can be used to represent a k-way partitioning. A spanning tree of a graph is a tree, which has been obtained selecting edges from this graph [8]. Then we use a hybrid compact genetic algorithm to search the optimal partitioning which works basically as follows. First we obtain a graph from the netlist description of the circuit, and then a spanning tree of that graph is obtained. From this tree, we select k-1 edges and they are eliminated in order to obtain a k-way partition. The partitions are represented by those deleted edges. Genetic Representation: The compact genetic algorithm (cGA) uses the encoding presented in [9], which is directly connected with the solution of the problem. The code for our problem is based on the edges, which belong to the spanning tree. We have seen above how the partition is obtained by the elimination of some edges. We assign a number to every edge of the tree. Consequently a chromosome will have k-1 genes for a k-way partitioning, and the value of these genes can be any of the order values of the edges. For example, chromosome (3 4 6) for a 4-way partitioning, represents a solution obtained after the suppression of edge numbers 3, 4, and 6 from its spanning tree. So the alphabet of the algorithm is {0, 1… n-1} where n is the number of vertices of the target graph (circuit), because the spanning tree has n-1 edges.
2112
J.I. Hidalgo et al.
Hybrid Compact Genetic Algorithm: The cGA does not manage a population of solutions but only mimics its existence [10]. It represents the population by means of a vector of values, pi ∈ [0,1], ∀i = 1,…,l, where l is the number of alleles needed to represent the solutions. In order to design a cGA for MFS partitioning we adopted the edge representation explained below and we consider the frequencies of the edges occurring in the simulated population. A vector V with the same dimension as the number of nodes minus one was used to store these frequencies. Each element vi of V represents the proportion of individuals whose partition use the edge ei. The vector elements vi were initialised to 0.5 to represent a randomly generated population in which each edge has equal probability to belong to a solution [11]. Sometimes it is necessary to increase the selection pressure (Ps)rate to reach good results with a Compact Genetic Algorithm. A value for Ps near to 4 is usually a good value for MFS partitioning. It is not recommendable to increase this value very much because the computation time grows. Additionally, for some problems we need a complement for the cGA. We can combine heuristics techniques with local search algorithms to obtain this additional tool called hybrid algorithms. We have implemented a cGA with local search after a certain number of iterations in order to improve the solutions obtained by the only use of cGA. In [12] a compact genetic algorithm for MFSs partitioning was presented, and in [13] a Hybrid cGA was explained. Authors combine a cGA with the Lin-Kernighan (LK) local search algorithm, to solve TSP problems. The cGA part explores the most interesting areas of the search space and LK task is the fine-tuning of those solutions obtained by cGA. Following this structure we have implemented the hybrid cGA for MFS partitioning presented in [17]. In this paper we have used other heuristic different from LK, which is more feasible to the problem are solving. Most of the local search algorithms try to perform search as exhaustive as possible. But, this can implies an unacceptable amount of computation time. In MFS problem, the ideal implementation of local search is to explore all the neighbour solutions to the current best solutions after a certain number of iterations. Unfortunately, the most computational expensive step of our cGA is the evaluation of the individuals. We have employed a local search heuristic every n iterations and as in parallel genetic algorithms we need to fix the value of n to keep the algorithm search in good working order. After an empirical study for different values the local search frequency, we obtain that n must be located between 20 and 60 with an optimal value (that depends on the benchmark) near to 50. So for our experiments we fixed the local search frequency n to 50 iterations, i.e. we develop a local search process every 50 iterations of the compact GA. Remember that a chromosome has k-1 genes for a k-way partitioning, and the value of these genes are the edges eliminated to obtain a partitioning solution. In order to explain the algorithm we must define what a neighbour solution is. We say that solution A is a neighbour solution of B (and B is a neighbour solution of A) if the difference between their chromosomes is just one gene. For example solution represented by chromosome (1 43 56 78 120 345 789) is a neighbour solution of the partition represented by (1 43 56 78 120 289 789), in an 8-way partitioning problem. Our local search heuristic explores only one neighbour solution for each gene, that is k-1
Multi-FPGA Systems Synthesis by Means of Evolutionary Computation
2113
neighbour solutions of the best solution every n iterations. For the sake of clarity we reproduce the explanation of the local search process presented in [17]. The local search process works as follows. Every n iterations, we obtain the best solution up to that time, which is denoted by BS. To obtain BS, first we explore the compact GA probability vector and select the k-1 most used genes (edges) to form MBS (vector best individual). We also have the best individual generated up to now (GBS) (similar to elitism). The best individual between MBS and GBS (i.e. which of them has the best fitness value) will be BS. After BS has been deduced at iteration n, the first random neighbour solution (TS1) to BS is generated substituting the first gene (edge) of the chromosome by a random one not used in BS. Then we evaluate the fitness value of BS (FVBS) and the fitness value of TS1 (FVTS1). If FVTS1 is better than FVBS, TS1 is dropped to BS and the initial BS is eliminated, otherwise TS1 is eliminated. Then we repeat the same process using the new BS but with the second gene, to generate TS2. If the fitness value of TS2 (FVTS2) is better than the present FVBS then TS2 will be our new BS or, if FVTS2 is worst than FVBS, there will be no change in BS. The process is repeated for all genes until the end of the chromosome, that is, k-1 times for a k-way partitioning. Although only a very small part of the solution neighbourhood space is explored, we improve the performance of the algorithm significantly (in terms of quality of our solutions) without drastically degrading its total computation time. In order to clarify the explanation about the proposed local search method we can see an example. Let us suppose a graph with 12 nodes and its spanning tree, for a 5way partitioning problem (i.e. is we want to divide the circuit into five parts). As we have explained, we will use individuals with 4 genes. Let us also suppose a local search frequency (n) of 50 and that after 50 iterations we have reached to a best solution represented by: BS = 3, 4, 6, 7 (2) The circuit graph has 12 nodes, so its spanning tree is formed by 11 edges. The whole set of possible edges to obtain a partitioning solution is called E: E = {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10} (3) In order to generate TS1 we need to know the available edges ALS for random selection, as we have said we eliminate the edges within BS from E to obtain ALS: ALS= {0, 1, 2, 5, 8, 9, 10} (4) Now we randomly select an edge (suppose 0) to build TS1substituying it by the first gene (3) in BS: TS1 = 0, 4, 6, 7 (5) The third step is the evaluation of TS1 (suppose FVTS1=12) and comparing (suppose a minimization problem) with FVBS (suppose FVBS = 25). As FVTS1 is better than FVBS, TS1 will be our new BS and the original BS is eliminated. Those changes also affect to ALS because our new ALS is: ALS= {1, 2, 3, 5, 8, 9, 10} (6) Table 1 represents the rest of the local search process for this example.
2114
J.I. Hidalgo et al. Table 1. Local search example i
ALS
FV
New BS
1
0,1,2,5,8,9,10
BS 3,4,6,7
FV 25
Random gene 0
0,4,6,7
TS
12
0,4,6,7
2
1,2,3,5,8,9,10
0,4,6,7
12
1
0,1,6,7
37
0,1,6,7
3
1,2,3,5,8,9,10
0,4,6,7
12
9
0,4,9,7
10
0,4,9,7
4
1,2,3,5,6,8,10
0,4,9,7
10
8
0,4,8,9
11
0,4,9,7
Pre-Local Search Best Solution: 3, 4, 6, 7
Post-Local Search Best Solution: 0, 4, 9, 7
3
Placement and Routing on FPGAs
Once different partitions have been obtained and assigned to different FPGAs, we must place components into FPGAs’ CLBs and connect CLBs within each of the FPGAs. To do so, we use Genetic Programming (GP). A wide description of this technique can be found in [14].
Fig. 3. Representing a circuit wit black boxes and labeling connections
Partitions Representation Using Trees: The main goal for this step is to implement a partition (circuit) -that has been obtained in the previous step- into an FPGA. We have thus to place each of the circuit component into a CLB and then to connect all the CLBs according to the circuit’s topology. We have used Genetic Programming (GP) based in tree structures in this task. Therefore, circuits will be encoded here as trees. A given circuit is made up of components and connections. If we forget the name and function of each of the simple components (considering each of them as a black box), and instead we use only one symbol for representing any of them, a circuit could be represented in a similar way as the example depicted in figure 3. Given that components compute very easy logic function, any of them can be implemented by using any of the CLBs available within each FPGA. This means that we only have to connect CLBs from the FPGA according to the interconnection model that a given circuit implements, and then we can configure each of the CLB with the function that each component performs in the circuit. After this couple of simple steps we have got the circuit in the FPGA. Given that we employ Genetic Programming we have to encode the circuit in a tree. We can firstly number each component from the circuit, and then assign the number of those components to the ends of wires connected to them (see figure 3). Wires could now be disconnected without loosing any information. We could even rebuild the circuit by using labels as a guide.
Multi-FPGA Systems Synthesis by Means of Evolutionary Computation
2115
Fig. 4. Encoding circuits by means of binary trees. a) Each branch of the tree describe a connection from the circuit. Dotted lines indicates a number internal nodes in the branch. b) Making connections in the FPGA according to nodes.
We may now describe all the wires by means of a tree by connecting each of the wires as a branch of the tree and keeping them all together in the same tree. By labeling both extremes of branches, we will have all the information required to reconstructing the circuits. This way of representing circuits allows us to go back and construct the real graph. Moreover, any given tree, randomly generated, will always correspond to a particular graph, regardless of the usefulness of the associated circuit. In this proposal, each node from the tree is representing a connection, and each branch is representing a wire. The next stage is to encode the path of wires into an FPGA. Each branch of the tree will encode a wire from the circuit. We have now to decide how each of the tree’s branches can encode a set of connections. As seen in previous sections, mesh FPGAs contains CLBs, switch blocks and wire segments. Each wire segment can connect adjacent blocks, both CLBs and switch blocks. Several wire segments must be connected through switch blocks when joining two CLBs’ pins according to a given circuit description. A connection in a circuit can be placed into an FPGA in many different ways. For example, there are as many choices in the selection of each CLB as the number of rows multiplied by the number of columns available in the FPGA (see figure 1, section 1). Moreover, which of the pins of the CLB will be used and how the wire traverses the FPGA has to be decided from among the incredibly high number of possibilities. Of course, the same connection can be made in many different ways, with more or fewer switch blocks being crossed by the wire. Every wire in an FPGA is made up of two ends - these can connect to a CLB or to an IOB. On the other hand, as said above, a given number of switch connections may conform the path of the wire. In the representation we have used a branch from tree for codifying wires, CLB and IOB connections are described as each of the two end nodes which make up a branch. In order to describe switch connections, we add as many new internal nodes to the branch as switch blocks are traversed by wires (see figure 4b). Each internal node requires some extra information: if the node corresponds to a CLB we need to know information about the position of the CLB in the FPGA, the number of pin to which one of the ends of the wire is connected, and
2116
J.I. Hidalgo et al.
which of the wires of the wire block we are using; if the node represents a switch connection, we need information about that connection (Figure 4 graphically depicts how a tree describes a circuit). It may well happen that when placing a wire into an FPGA, some of the required connections specified in the branch can not be made, because, for instance, a switch block connection has been previously used for routing another wire segment. In this case the circuit is not valid, in the sense that not all the connections can be placed into a physical circuit. In order for the whole circuit to be represented by means of a tree, we will use a binary tree, whose left most branch will correspond to one of its connections, and the left branch will consist of another subtree constructed recursively in the same way (left-branch is a connection and right-branch a subtree). The last and deepest right branch will be the last circuit connection. Given that all internal nodes are binary ones we can use only a kind of function with two descendants. GP Sets: When solving a problem by means of GP one of the first things to do once the problem has been analyzed is to build both the function and terminal sets. The function set for our problem contains only one element: F={SW}, Similarly, the terminal set contains only one element T={CLB}. But SW and CLB may be interpreted differently depending on the position of the node within a tree. Sometimes a terminal node corresponds to an IOB connection, while sometimes it corresponds to a CLB connection in the FPGA (see figure 4,a). Similarly, a SW node sometimes corresponds to a CLB connection, while others affects switch connections in the FPGA. Each of the nodes in the tree will thus contain different information: • If we are dealing with a terminal node, it will have information about the position of CLBs, the number of pins selected, the number of wires to which it is connected, and the direction we are taking when placing the wire. • If we are instead in a function node, it will have information about the direction we are taking. This information enables us to establish the switch connection, or in the case of the first node of the branch, the number of the pin where the connection ends. If we look at figure 4, we can observe that wires with IOBs at one of their ends are shorter –only needs a couple of nodes- than those that have CLBs at both ends –they require internal nodes for expressing switch connections-. Wires expressed in the latest position of trees have less space to grow, and so we decided to place IOB wires in that position, thus leaving the first parts of the trees for long wires joining CLBs. Evaluating Individuals: In order for GP to work, individuals from the population have to be evaluated and reproduced employing the GP algorithm. For evaluating an individual we must convert the genotype (tree structure) to the phenotype (circuit in the FPGA), and then compare it to the circuit provided by the partitioning algorithm. We developed an FPGA simulator for this task. This software allows us to simulate any circuit and checks its resemblance to other circuit. Therefore, this software tool is in charge of taking an individual from the population and evaluating every branch from the tree, in a sequential way, establishing the connections that each branch specifies. Circuits are thus mapped by visiting each of the useful nodes of the trees
Multi-FPGA Systems Synthesis by Means of Evolutionary Computation
2117
and making connections on the virtual FPGA, thus obtaining phenotype. Each time a connection is made, the position into the FPGA must be brought up to date, in order to be capable of making new connections when evaluating the following nodes. If we evaluate each branch, beginning with the terminal node, thus establishing the first end of the wire, we could continue evaluating nodes of the branch from the bottom to the top. Nevertheless, we must be aware that there are several terminals related to each branch, because each function node has two different descendants. We must decide which of the terminals will be taken as the beginning of the wire, and then drive the evaluation to the top of the branch. We have decided to use the terminal that is reached when going down through the branch using always the left descendant (see figure 5).
Fig. 5. Evaluating trees
In one sense there is a waste of resources when having so many unused nodes. Nevertheless they represent new possibilities that can show up after a crossover operation (in nature, there always exist recessive genes, which from time to time appear in descendants). These nodes are hidden, in the sense that they don’t take part in the construction of the circuit and may appear in new individuals after some generations. If they are useful in solving the problem, they will remain in descendants in the form of nodes that express connections. The fitness function is computed as the difference between the circuit provided and the circuit described by the individual.
4
Experimental Results
• Partitioning and Placement onto the FPGAs The algorithm has been implemented in C and it has been run on a Pentium II 450 MHz. We have used the partitioning 93 benchmarks in XNF format. As the number and characteristics of CLBs depend on the device used for the implementation, we have supposed that each block of the circuits uses one CLB. We use the Xilinx’s 4000 series. Table 2 contains some experimental results. It has seven columns which express: the name of the test circuit (Circuit), its number of CLBs (CLB), the number of connections between CLBs (Edges), the distribution of CLBs among the FPGAs (Distribution), the number of I/O pins lacking (p), the device proposed for the implementation (FPGA type), and the CPU time in seconds necessary to obtain a solution (T(sec)). From the results we can observe that there are some unbalanced distributions. This is a logic result because we need some circuits to pass the nets from one device to another. In addition our fitness function has been developed to achieve two objectives, so that the cGA works. In summary, the algorithm succeeds in solving the partitioning problem with board constraints. We have implemented a board and we have checked some basic circuits so we can conclude that the algorithms works.
2118
J.I. Hidalgo et al. Table 2. Partitioning and Placement Results for different benchmarks
Circuit S208 S298 S400 S444 S510 S832 S820 S953 S838 S1238 C1423 C3540
•
CLBs
Edges
Distribution
p
FPGA type
127 158 217 234 251 336 338 494 495 574 829 1778
200 306 412 442 455 808 796 882 800 1127 1465 2115
16,15,17,21,11,11,19,17 20,23,29,20,20,20,25,11 28,47,23,37,16,20,33,13 27,38,36,29,41,37,16,27 34,41,38,42,32,19,24,2 230,11,25,14,17,18,16,5 237,17,24,14,21,8,7,10 168,60,82,31,101,289,15 100,92,37,77,67,60,43,17 91,11,293,56,50,25,17,31 525,52,114,14,37,51,28,8 614,135,88,89,56,26,28,2
0 0 0 0 0 0 0 0 0 0 0 0
4003 4003 4005 4005 4005 4008 4008 4008 4008 4008 4020 4020
T (sec) 20.70 5.92 52.96 52.99 96.74 96.74 138.93 194.65 293.45 320.14 273.65 844.16
Inter-FPGA Placement and Routing
Several experiments with different sizes and complexities have been performed for testing the placement and routing process. [10]. One of them is shown in figure 6. We worked on a SUN workstation 167 Mhz. The main parameters employed were the following: Population size = 200, Number of generations = 500, Population size: 200, Maximum depth: 30, Steady State Tournament size: 10. Crossover probability=98%, Mutation probability=2%, Creation type: Ramp Half/Half. Add best to New Population. The GP tool we used is described in [16]. Figures 6 and 7 show one of the proposed circuits and one of the solutions found, respectively. More solutions found for this circuit are described in [15].
Fig. 6. Circuit to be tested.
5
Conclusions and Future Work
In this paper a methodology for circuit design using MFSs has been presented. We have used evolutionary computation for all steps in the design process. First, a compact genetic algorithm with a local search heuristic was used on achieving partitioning and placement for intra-FPGA systems and, for the Inter-FPGA tasks Genetic programming was used. This method can be applied for different boards and solves the whole design flow process. As future work, we are working now on the parallelization of all of the steps and studying Multi-Objective Genetic Algorithms (MOGA) techniques.
Multi-FPGA Systems Synthesis by Means of Evolutionary Computation
Clb 1,0
Sw 1,1 Clb 2,1
S 2,0 Clb 3,0
Sw 2,1
S 3,0
S 4,0
Sw 6,1 Clb 7,1
S 7,0 Clb 8,0
Sw 7,1
Sw 7,2
Clb 9,1
Sw 7,3
Clb 8,8 9,2
Clb Clb 9,3 8,9 Sw 8,8
Sw 6,5
Sw 7,0 Clb 8,4
Sw 8,3
Sw 7,1
Clb 9,4
Clb 9,5
Sw 6,7
Sw 7,2 Clb 8,6
Sw 8,5 Clb 9,6
Sw 7,3
IN
Clb 6,9
IN
Clb 7,9 Sw 7,0
Clb 8,8 Sw 8,7
Clb 9,7
Clb 5,9
Sw 6,8 Clb 7,8
Clb 8,7 Sw 8,6
OUT
Sw 5,8 Clb 6,8
Clb 7,7
IN
Clb 4,9 Sw 4,8
Clb 5,8 Sw 5,7
Sw 6,6 Clb 7,6
Clb 8,5 Sw 8,4
Sw 4,7
Clb 6,7
IN
Clb 3,9 Sw 3,8
Clb 4,8
Clb 5,7 Sw 5,6
Clb 6,6
Clb 7,5
Sw 3,7
Sw 4,6
Clb 2,9 Sw 2,8
Clb 3,8
Clb 4,7
Clb 5,6 Sw 5,5
Sw 6,4 Clb 7,4
Clb 8,3 Sw 8,2
Sw 4,5
Clb 6,5
Sw 2,7
Sw 3,6
Clb 1,9 Sw 1,8
Clb 2,8
Clb 3,7
Clb 4,6
Clb 5,5 Sw 5,4
Sw 6,3 Clb 7,3
Clb 8,2 Sw 8,1
Sw 4,4
Clb 6,4
Sw 2,6
Sw 3,5
Clb 1,8 Sw 1,7
Clb 2,7
Clb 3,6
Clb 4,5
Clb 5,4 Sw 5,3
Sw 6,2 Clb 7,2
Clb 8,1 S 8,0
Sw 4,3
Clb 6,3
Sw 2,5
Sw 3,4
Clb 1,7 Sw 1,6
Clb 2,6
Clb 3,5
Clb 4,4
Clb 5,3 Sw 5,2
Clb 6,2
Sw 2,4
Sw 3,3
Sw 4,2
Sw 1,5 Clb 2,5
Clb 3,4
Clb 4,3
Clb 5,2 Sw 5,1
S 6,0 Clb 7,0
Sw 3,2
Sw 4,1
Clb 6,1
Sw 2,3
Clb 1,6
Clb 1,5 Sw 1,4
Clb 2,4
Clb 3,3
Clb 4,2
Clb 5,1 S 5,0
Clb 6,0
Sw 2,2
Sw 3,1
Clb 1,4 Sw 1,3
Clb 2,3
Clb 3,2
Clb 4,1
Clb 5,0
Clb 1,3 Sw 1,2
Clb 2,2
Clb 3,1
Clb 4,0
Clb 9,0
Clb 1,2
Clb 1,1 S 1,0
Clb 2,0
2119
Clb 8,9 Sw 8,8
Clb 9,8
Clb 9,9
IN
Sw 8,9
Fig. 7. A solution found for example on Figure 6
Acknowledgement. Part of this research has been possible thanks to the Spanish Government research projects number TIC2002-04498-C05-01 and TIC2002/750.
References 1. 2. 3. 4.
5. 6. 7. 8. 9.
10. 11. 12. 13.
14.
S. Trimberger. “Field Programmable Gate Array Technology”. Kluwer 1994. S .Hauck,: Multi-FPGA systems. Ph. D. dissertation. University of Washington. 1994 M. Baxter. “Icarus: A dinamically reconfigurable computer architecture” IEEE Symposium on FPGAs for Custom Computing machines, 1999, 278–279. R. Macketanz, W. Karl. “JVX: a rapid prototyping system based on Java and FPGAs”. In Field Programmable Logic: From FPGAs to Computing Paradigm, pages 99–108. Spinger Verlag, 1998 M.I. Heywood and A.N. Zincir-Heywood. “Register based genetic programming on FPGA computing platforms”. Euro GP 2000, 44–59. CAD Benmarching Laboratory, http://vlsicad.cs.ud.edu/ XNF: Xilinx Netlist Format", http://www.xilinx.com F. Harary. “Graph Theory”. Addison-Wesley 1968 J.I. Hidalgo, J. Lanchares, R. Hermida. "Graph Partitioning methods for Multi-FPGA systems and Reconfigurable Hardware based on Genetic algorithms", Proceedings of the 1999 Genetic and Evolutionary Computation Conference Workshop Program, Orlando (USA), 1999, 357–358. G.R. Harik, F.G. Lobo, D. E. Goldberg “The Compact Genetic Algorithm”. Illigal Report Nº 97006, August 1997. University of Illinois at Urbana-Champaign G.R. Harik, F.G. Lobo, D. E. Goldberg “The Compact Genetic Algorithm”. IEEE Transactions on Evolutionary Computation. Vol. 3, No. 4, pp. 287–297, 1999. J.I. Hidalgo. R.Baraglia, R. Perego, J. Lanchares, F. Tirado. “A Parallel compact genetic algorithm for Multi-FPGA partitioning” Euromicro PDP 2001, 113–120. IEEE Press. R, Baraglia, J.I.Hidalgo, and R. Perego. ”A Hybrid Heuristic for the Travelling Salesman Problem “. IEEE Transactions on Evolutionary Computation. Vol. 5, No. 6, pp. 613–622, December 2001. J.R. Koza: Genetic Programming. On the programming of computers by mens of natural selection. Cambridge MA: The MIT Press
2120
J.I. Hidalgo et al.
15. F. Fernández, J.M. Sánchez, M. Tomassini, "Placing and routing circuits on FPGAs by th means of Parallel and Distributed Genetic Programming ". Proceedings 4 international conference on Evolvable systems ICES 2001. 16. M. Tomassini, F. Fernández, L. Vannexhi, L. Bucher, "An MPI-Based Tool for Distributed Genetic Programming" In Proceedings of IEEE International Conference on Cluster Computing CLUSTER2000, IEEE Computer Society. Pp. 209–216 17. J.I. Hidalgo, J. Lanchares, A.ibarra,R. Hermida. A Hybrid Evolutionary Algorithm for Multi-FPGA Systems Design. Proceedings of Euromicro Symposium on Digital System Design, DSD 2002. Dortmund, Germany, September 2002. IEEE Press, pp. 60–68.
Genetic Algorithm Optimized Feature Transformation – A Comparison with Different Classifiers 1
1
1
2
3
Zhijian Huang , Min Pei , Erik Goodman , Yong Huang , and Gaoping Li 1
Genetic Algorithms Research and Application Group (GARAGe) Michigan State University, East Lansing, MI {huangzh1,pei,goodman}@egr.msu.edu 2 Computer Center, East China Normal University, Shanghai, China [email protected] 3 Electrocardiograph Research Lab, Medical College Fudan University, Shanghai, China [email protected]
Abstract. When using a Genetic Algorithm (GA) to optimize the feature space of pattern classification problems, the performance improvement is not only determined by the data set used, but also depends on the classifier. This work compares the improvements achieved by GA-optimized feature transformations on several simple classifiers. Some traditional feature transformation techniques, such as Principal Components Analysis (PCA) and Linear Discriminant Analysis (LDA) are also tested to see their effects on the GA optimization. The results based on some real-world data and five benchmark data sets from the UCI repository show that the improvements after GA-optimized feature transformation are in reverse ratio with the original classification rate if the classifier is used alone. It is also shown that performing the PCA and LDA transformations on the feature space prior to the GA optimization improved the final result.
1 Introduction The genetic algorithm (GA) has been tested as an effective search method for highdimensional complex problems, taking advantage of its capability for sometimes escaping local optima to find optimal or near optimal solutions. In pattern classification, GA is widely used for parameter tuning, feature weighting [1] and prototype selection [2]. Feature extraction and selection is a very important phase for a classification system, because the selection of a feature subset will greatly affect the classification result. GA has recently also been used in the area of feature extraction. The optimization of feature space using GA can be linear [3], or non-linear [4], where in both cases, the GA stochastically, but efficiently, searches in a very high-dimensional data space that makes traditional deterministic search methods run out of time. The GA approach can also be combined with other traditional feature transformation methods.
E. Cantú-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2121–2133, 2003. © Springer-Verlag Berlin Heidelberg 2003
2122
Z. Huang et al.
Prakash presented the combination of GA with Principal Components Analysis (PCA), where instead of the few largest Principal Components (PCs), a subset of PCs from the whole spectrum was chosen by the GA to get the best performance [5]. In this work, three classifiers – a kNN classifier, a Bayes classifier and a Linear Regression classifier – are tested, together with the PCA and LDA transformations. One new, real-world dataset, the Electrocardiogram (ECG) data, and five benchmark datasets from the UCI Machine Learning Repository [6] are used to test the approach. The paper starts with an introduction to GA approaches in the area of pattern classification in Section 2, followed by our solution designed in Section 3. Section 4 presents the results on ECG data with detailed comparison with regard to both classifier choice and the use of PCA/LDA transformations. Section 5 extends the tests to five benchmark pattern classification test sets, by using the best solution on PCA/LDA combination found in Section 4. Section 6 concludes the paper and Section 7 proposes some possible future work.
2 GA in Pattern Classification Generally, the GA-based approaches to pattern classification can be divided into two groups: • Those directly applying GA as part of the classifier. • Those optimizing parameters in pattern classification. 2.1 Direct Application of GA as Part of the Classifier When the GA is directly applied as part of the classifier, the main difficulty is how to represent the classifier using the GA chromosome. Bandyopadhyay and Murthy proposed an idea using a GA to perform a direct search on the partitions of an Ndimensional feature space, where each partition represents a possible classification rule [7]. In this approach, the decision boundary of the N-dimensional feature space is represented by H lines. The genetic algorithm is used to find those lines that minimize the misclassification rate of the decision boundary. The number of lines, H, turns out to be a parameter similar to the k in the kNN classifier. More lines (higher H) do not necessarily improve the classification rate, due to the effect of over-fitting. In addition to using lines as space separators, Srikanth et al [8] also gave a novel method clustering and classifying the feature space by ellipsoids. 2.2 Optimizing Parameters in Pattern Classification by GA However, most of the approaches using GA in pattern classification do not design the classifier using GA. Instead, GA is used to estimate the parameters of the pattern classification system, which can be categorized into the following four classes:
Genetic Algorithm Optimized Feature Transformation
2123
GA-Optimized Feature Selection and Extraction. Feature selection and extraction are the most widely used applications of GA in pattern classification. The classification rate is affected indirectly when different weights are applied to the features. A genetic algorithm is used to find a set of optimal feature weights that can improve the classification performance on training samples. Before GA-optimized feature extraction and selection, traditional feature extraction techniques such as the Principal Components Analysis (PCA) can be applied [5], while after that, a classifier should be used to calculate the fitness function for the GA. The most commonly used classifier is the k-Nearest Neighbor classifier [9], [1]. GA-Optimized Prototype Selection. In supervised pattern classification, the reference set or training samples are critical for the classification of testing samples. A genetic algorithm can be also used in the selection of prototypes in case-based classification [2], [10]. In this approach, a subset of the most typical samples is chosen to form a prototype, on which the classification for testing and validation samples is based. GA-Optimized Classifier. GA can be used to optimize the input weight or topology of a Neural Network (NN) [4]. It is intuitive to give weights for each connection in a NN. By evolving the weights using GA, it is possible to throw away some connections of the neural network if their weights are too small, thus improving the topology of the NN, too. GA-Optimized Classifier Combination. The combination of classifiers, sometimes called Bagging and Boosting [11], may also be optimized by Genetic Algorithm. Kuncheva and Jain [12], in their design of the Classifier Fusion system, not only selected the features, but also selected the types of the individual classifiers using a genetic algorithm.
3 GA-Optimized Feature Transformation Algorithm This section first reviews the two models for feature extraction and feature selection in pattern classification. Then a GA-optimized feature weighting and selection algorithm based on the wrapper model [13] is proposed, outlining the structure of the experiment in this paper. 3.1 The Filter Model and the Wrapper Model For the problem of feature extraction and selection in pattern classification, two models play important roles. The filter model chooses features by heuristically determined “goodness”, or knowledge, while the wrapper model does this by the feedback of the classifier evaluation, or experience. Fig. 1 illustrates the differences between these two models.
2124
Z. Huang et al.
Fig. 1. Comparison of filter model and wrapper model
Research has shown that the wrapper model performs better than the filter model, comparing the predictive power on unseen data [14]. Some widely used feature extraction approaches, such as Principal Components Analysis (PCA), belong to the Filter model because they rank the features by their intrinsic properties: the eigenvalues of the covariance matrix. Most recently developed feature selection or extraction techniques are categorized to be Wrapper models, taking into consideration the classification results of a particular classifier. For example, the GFNC (Genetically Found, Neurally Computed) approach by Firpi [4] uses a GA and a Neural Network to perform non-linear feature extraction with the feedback from a kNN classifier as the fitness function for the Genetic Algorithm. A modified PCA approach proposed by Prakash also uses genetic algorithms to search for the optimal feature subset with the feedback result of the classifier [5].
Fig. 2. Classification system with feature weighting
3.2 GA-Optimized Pattern Classification with Feature Weighting Consider the wrapper model introduced above; in any of the pattern classification systems with weighted features, there are five components that need to be determined as illustrated in Fig. 2. • The dataset used. • The feature space transformation to be applied on the original feature space, known as the feature extraction phase in traditional pattern classification systems.
Genetic Algorithm Optimized Feature Transformation
2125
•
The optimization algorithm which searches for the best weighting for the features; • The classifier used to get the feedback, or fitness, for the GA, of the feature weighting. (The Induction Algorithm in the Wrapper Model [13]) • The classifier to calculate the final result for the classification problem based on the newly weighted features (the classifier for the result). From Fig. 2, we can see that among these components, the feature space transformation, the optimization and the classifier are the plug-in procedures for feature weighting optimization. The feature-weighting framework is the plug-in procedure in traditional classification systems that transforms the feature space, weights each feature, evaluates and optimizes the weight attached to each feature. In this paper, we can replace each of the components listed above by the specific choices we made, as follows: • Data: ECG data and five other data sets from UCI repository. • Feature Space Transformation: PCA, LDA and their combinations. • Optimization Algorithm: Genetic algorithm or none. When implemented with different classifiers: feature weighting for kNN classifier, and feature selection for other classifiers. The reason for using feature weighting for the kNN classifier is because of its distance metric, that will affect the classification result by changing its weight [1], while for the Bayes classifier and the linear regression classifier, feature weighting has no further effect on the training error when compared with feature selection [15]. • Classifier (induction algorithm): A kNN classifier, a Bayes classifier and a linear regression classifier. • Classifier for Result: Same classifier as used for the induction algorithm.
4
Test Results for ECG Data
The test results for ECG data, with various settings regarding the feature space transformation, using the GA or not, and using various classifiers, are presented and compared in this section. We will first discuss the experimental settings, and then move on to the results. 4.1 Experimental Settings The ECG data used in this paper is directly extracted from the 12-lead, 10-second digital signal acquired from Shanghai Zhongshan Hospital, containing 618 normal ECG cases and 618 abnormal cases. The abnormal cases contain three different kinds of diseases with roughly the same number of cases of each. Altogether, 23 features, including 21 morphological features and 2 frequency domain features, are extracted from the original signal. For a non-GA-optimized classifier, the data is partitioned into training samples and validation samples; for a GA-optimized algorithm, the data is partitioned into training
2126
Z. Huang et al.
data, testing data and validation data, in an n-fold cross-validation manner. If not specifically indicated, here the n is set to be 10. Table 1 lists the details of the data partitioning. A simple genetic algorithm (SGA) is used here to do the optimization. The crossover rate is 30% and the mutation rate is set to 0.03 per bit. The program runs for 200 generations with a population size of 50 individuals (80 individuals for kNN feature weighting). When there has been no improvement within the last 100 generations, the evolution is halted. Classifiers: A 5-nearest neighbor classifier is used. A Bayes Plug-In classifier with its parameters estimated by Maximum Likelihood Estimation (MLE) is implemented, and a linear regression classifier uses the simple multivariate regression combined with a threshold decision to predict the class labels. With the kNN classifier, the feature weighting is allowed to range among 1024 different values between 0.0 and 10.0, with minimum changes of about 0.01, as determined by the GA, by setting the chromosome for each feature to be 10 binary digits. With the Bayesian classifier and linear regression classifier, only feature selection was tested. Table 1. Summary of Data Partitioning Experiments Training Non-GA GA (n-fold)
40%
Testing N/A
Valid 60%
n − 1 60%× n − 1 1 40%× n n n
Table 2. Result of kNN classifier (k = 5) Settings
kNN (k = 5) Results of Classification Rate in %
Fea. NonGA Trans GA None PCA LDA Both
Trn
Improve
PValue
73.09 74.54 78.29 1.45±3.23 0.3008 74.10 77.16 79.39 3.06±1.59 0.0043 73.09 77.00 78.72 3.91±2.24 0.0065 72.41 75.38 79.82 2.97±2.78 0.0404 Overall P value: 0.0000
4.2 Results and Conclusions Tables 2-4 list the results of GA-optimized feature extraction using a kNN classifier (k=5), Bayes Plug-In classifier and linear regression classifier. The improvement after GA optimization is represented by the average improvement, with a two-tailed ttest with a 95% confidence interval. A P value indicating the probability of the Null Hypothesis (the improvement is 0) is also given, among which results having a 95% significant improvement are in bold font. In addition to the row-wise statistics, an overall improvement significance level based on the improvement percentage is calculated for each classifier, which is a P value on all the improvement ratio values of that classifier. This significance indicator can be considered as a final summary of the GA improvement based on a particular classifier and is listed at the bottom of each table.
Genetic Algorithm Optimized Feature Transformation
2127
Table 3. Result of Bayes classifier
Table 4. Result of Linear Regression classifier
SetBayes Results of Classification tings Rate in % Fea. NonGA Trn Improve P-Val Trans GA
SetLinear Regression Results, tings Rate in % PFea. NonGA Trn Improve Value Trans GA
None 71.92 74.19 76.51 2.27±3.69 0.1912 PCA 72.53 74.27 77.23 1.75±1.95 0.0730 LDA 72.95 75.47 77.23 2.52±3.11 0.0968 Both 71.85 76.99 78.33 5.14±2.54 0.0014 Overall P value: 0.0000
None PCA LDA Both
77.68 76.49 79.08 -1.19±3.23 NA 76.48 78.02 79.87 1.53±3.09 0.25 78.44 77.42 77.36 -1.02±3.15 NA 77.59 77.41 77.84 -0.18±2.79 NA Overall P value: NA
Conclusions: 1.
For the ECG data, the utility of GA-optimized feature weighting and selection depends on the classifier used. The GA-kNN feature weighting and GA-Bayes feature selection yield significant improvement, with three rows showing a significance level of more than 90%, and the fourth showing less improvement, but in the same direction. As a result, the overall significance based on the improvement ratio is 99.99%. In contrast, the GA-optimized linear regression classifier does not show improved performance, and the inconsistency of change direction makes it likely that any systematic improvement due to the GA-optimized weights, if it exists, is very small.
2.
The PCA and/or LDA transformation is useful, in combination with GA optimization. As we can see from the kNN and Bayes classifiers, although applying the PCA and/or LDA transformation on the non-GA optimized classifiers yields no major progress, their combination with GA yields significant improvement as well as better final classification rates. In some sense, the PCA and LDA transformations can help the GA to break the “barrier” of the optimum classification rate.
3.
The more accurate the original classifier is, the less improvement GA optimization yields. From the data, we can see that the linear regression classifier is the most powerful classifier if used alone, and also the least improved classifier after GA optimization. Fig. 3 illustrates this conclusion.
2128
Z. Huang et al.
Fig 3. Summary of the ECG classification rate
4.3 Results of GA Search For small numbers of features (here we set the standard: ≤ 15) in feature selection, it is possible to apply an exhaustive search in the whole feature space, thus providing the possibility to determine whether the GA can find the best solution or not. At the same time, some information about the usefulness of features can be traced from the terrain graph of the whole feature space. In some cases, the global optimum was found. However, in Fig. 4, although the GA found quite a good result, it was not the global optimum. But since the classification rate for the validation samples is related to but not linearly dependent on the training rate obtained by the GA, such a near-optimal result seems to produce good performance for a pattern classification problem.
Fig. 4. The GA search space (Best not found)
Genetic Algorithm Optimized Feature Transformation
2129
5 Test Results for Other Data Five datasets from UCI repository were tested to further validate our result from the previous section. A brief introduction to the data sets is given first, followed by the results and discussion.
5.1 The Testing Datasets •
• • • •
WDBC: The Wisconsin Diagnostic Breast Cancer [16] data contains 30 features expanded from the original 10 features by taking mean, standard error and extreme value of the originally measured features. The dataset has 357 benign and 212 malignant samples, with highest reported classification accuracies around 97%. LIVER: The BUPA Liver Disorders data has 6 numerical features on 345 instances, with 2 classes. PIMA: The Pima Indians Diabetes Database has 8 features on female diabetes patients, classified to be positive and negative. The total of 768 samples has 500 negative and 268 positive samples. SONAR: The Sonar data compares mines versus rocks based on their reflected sonar signals. With the 111 metal patterns and 97 rock patterns, each of them has 60 feature values [17]. IONO: The Ionosphere database from Johns Hopkins University [18] contains 351 instances of radar signals returned from the ionosphere, with 34 features. It contrasts "Good" and "Bad" radar returns that show or not show evidence of some types of structure in the ionosphere.
5.2 Results for Benchmark Datasets The results presented here are all based on both PCA and LDA transformation, which were shown to be useful in GA-optimized pattern classification in Section 4. Table 5. kNN classifier (Benchmark Datasets) DATA
Non-GA
WDBC 91.38 LIVER 66.36 PIMA 70.44 SONAR 69.28 IONO 79.19 Overall P value:
GA
Train
Improve
P-Value
94.19 66.64 72.65 74.97 80.32
92.24 72.89 76.97 64.06 69.98
2.81±2.93 0.28±6.28 2.21±2.90 5.69±5.91 1.13±3.13
0.0571 0.9126 0.1076 0.0563 0.3986 0.0001
2130
Z. Huang et al. Table 6. Bayes classifier (Benchmark Datasets) DATA
Non-GA
WDBC 94.78 LIVER 56.96 PIMA 72.91 SONAR 46.83 IONO 64.88 Overall P value:
GA
Train
Improve
P-Value
94.71 64.00 74.74 68.27 90.64
96.66 67.40 76.30 76.37 94.17
-0.07±2.44 7.04±6.57 1.84±3.73 21.45±10.77 25.76±11.52
NA 0.0334 0.2864 0.0015 0.0000 0.0000
From the results shown in Table 5—7, we can see that: 1. More than half of the row-wise results show an improvement with significance above 90%. For some settings, such as Bayes classifier for SONAR data, the original classification rate is very low, but GA can make up for this and yield a decent result. The effect of GA optimization here is to reach a fairly good result, if not the best, when the original settings of the classifier are not very good. Table 7. Regression classifier (Benchmark Datasets) DATA
Non-GA
WDBC 94.43 LIVER 66.04 PIMA 76.57 SONAR 63.65 IONO 85.12 Overall P value:
2.
GA
Train
Improve
P-Value
95.26 67.23 76.69 72.44 82.61
95.49 66.33 77.55 74.78 84.98
0.83±2.50 1.19±6.21 0.11±3.14 8.79±9.91 -2.51±5.78
0.4737 0.6752 0.9364 0.0757 NA 0.1364
The linear regression classifier has the highest classification rate among the three. After GA-optimized feature weighting and selection, the gaps in performance among the various classifiers became smaller, as shown in Fig. 5.
Fig. 5. Comparison, before and after GA optimization
Genetic Algorithm Optimized Feature Transformation
2131
6 Conclusions The utility of GA-optimized feature weighting and selection depends on both the classifier and the data set. Especially in those cases when a particular classifier has a better classification rate than some other classifiers, the potential improvement from GA optimization of the better classifier seems to be quite limited in comparison with its performance improvement of the poorer classifiers. Clearly, to evaluate a new approach involving optimization on feature space, it is necessary to test on different classifiers, and the improvement in the best classifier will be the most convincing evidence of the utility of that method. In this work, the Genetic Algorithm shows powerful searching ability in highdimensionality feature spaces. By comparing it with an exhaustive search algorithm on small-scale problems, it was determined that the GA found the optimal or a nearly optimal solution with a computational complexity of O(n). The results from Sections 4 and 5 indicate that over-fitting exists in various approaches. While the training performances can be significantly improved, the improvements on the validation samples lag behind in every case. The tests run in Section 4 show that the PCA and LDA transformations are very useful in pattern classification. The significance levels of GA optimization are greatly improved for the kNN and Bayes classifiers, though the absolute values after PCA and LDA transformation without GA optimization do not differ much from the non-PCA and non-LDA cases. The point is, however, that GA-optimized feature extraction and selection extend the utilities of those traditional feature transformation techniques.
7 Future Works To solve the problem of over-fitting, one possible approach is to evaluate the solution not only by the classification rate on training data, but also to consider the margin between classes and the boundary, because a larger margin means a more general and more robust classification boundary. Support Vector Machines (SVM) are classification systems that separate the training patterns by maximizing the margins between support vectors (those nearest patterns) and the decision boundary in a highdimensional space. Work by Gartner and Flach [19] that used SVM rather than a GA to optimize the feature space yielded a statistically significant result on Bayes classifiers. Another possible improvement may be non-linear feature construction using GA [4]. Non-linear feature construction can generate more artificial features so that the GA can search for more hidden patterns. However, the problem of over-fitting still theoretically exists. This work was supported in part by the National Science Foundation of China (NSFC) under Grant 39970210.
2132
Z. Huang et al.
References 1.
2.
3.
4.
5.
6.
7. 8.
9. 10. 11. 12.
13.
14.
15. 16.
17.
M. Pei, E. D. Goodman, W. F. Punch. "Feature Extraction Using Genetic Algorithms", Proceeding of International Symposium on Intelligent Data Engineering and Learning'98 (IDEAL'98) Hong Kong, Oct. 1998. David B. Skalak. “Using a Genetic Algorithm to Learn Prototypes for Case Retrieval and Classification”, Proceedings of the AAAI-93 Case-Based Reasoning Workshop, pages 64– 69, Washington, D.C., American Association for Artificial Intelligence, Menlo Park, CA, 1994. W.F. Punch, E.D. Goodman, Min Pei, Lai Chia-Shun, P. Hovland and R. Enbody. “Further Research on Feature Selection and Classification Using Genetic Algorithms”, In 5th International Conference on Genetic Algorithms, Champaign IL, pages 557–564, 1993 Hiram A. Firpi Cruz. “Genetically Found, Neurally Computed Artificial Features With Applications to Epileptic Seizure Detection and Prediction”, Master’s Thesis, University of Puerto Rico, Mayaguez, 2001. M. Prakash, M. Narasimha Murty. "A Genetic Approach for Selection of (Near-) Optimal Subsets of Principal Components for Discrimination", Pattern Recognition Letters, Vol. 16, pages 781–787, 1995. Blake, C.L. & Merz, C.J. UCI Repository of machine learning databases [http://www.ics.uci.edu /~mlearn/MLRepository.html]. Irvine, CA: University of California, Department of Information and Computer Science (1998). S. Bandyopadhyay, C.A. Murthy. “Pattern Classification Using Genetic Algorithms”, Pattern Recognition Letters, Vol. 16, pages 801–808, 1995. R. Srikanth, R. George, N. Warsi, D. Prabhu, F.E.Petry, B.P.Buckles. “A Variable-Length Genetic Algorithm for Clustering and Classification”, Pattern Recognition Letters, Vol. 16, pages 789–800, 1995. W. Siedlecki, J. Sklansky. “A Note on Genetic Algorithms for Large-Scale Feature Selection”, Pattern Recognition Letters, Vol. 10, pages 335–347, 1989. Ludmila I. Kuncheva. “Editing for the k-Nearest Neighbors Rule by a Genetic Algorithm”, Pattern Recognition Letters, Vol. 16, pages 809–814, 1995. Richard O. Duda, Peter E. Hart, David G. Stock. “Pattern Classification”, Second Edition, Wiley 2001. Ludmila I. Kuncheva and Lakhmi C. Jain. “Designing Classifier Fusion Systems by Genetic Algorithms”, IEEE Transactions on Evolutionary Computation, Vol. 4, No. 4, September 2000. Ron Kohavi, George John. “The Wrapper Approach”, Feature Extraction, Construction and Selection: A Data Mining Perspective. Edited by Hiroshi Motoda, Huan Liu, Kluwer Academic Publishers, July 1998. George H. John, Ron Kohavi, Karl Pfleger. "Irrelevant Features and the Subset Selection Problem", Proceedings of the Eleventh International Conference of Machine Learning, pages 121–129, Morgan Kaufmann Publishers, San Francisco, CA, 1994. Zhijian Huang. “Genetic Algorithm Optimized Feature Extraction and Selection for ECG Pattern Classification”, Master’s Thesis, Michigan State University, 2002. O.L. Mangasarian, W.N. Street and W.H. Wolberg. “Breast Cancer Diagnosis and Prognosis via Linear Programming”, Operations Research, 43(4), pages 570–577, July-August 1995. Gorman and T. J. Sejnowski. “Learned Classification of Sonar Targets Using Massively Parallel Network”, IEEE Transactions on Acoustic, Speech and Signal Processing, 36 (7), pages 1135–1140, 1988.
Genetic Algorithm Optimized Feature Transformation
2133
18. V.G. Sigilito, S.P. Wing, L.V. Hutton and K.B. Baker. “Classification of Radar Returns from the Ionosphere Using Neural Networks”, Johns Hopkins APL Technical Digest, Vol. 10, pages 262–266, 1989. 19. Thomas Gartner, Peter A. Flach. “WBCSVM: Weighted Bayesian Classification based on Support Vector Machine”, Eighteenth International Conference on Machine Learning (ICML-2001), pages 156–161. Morgan Kaufmann, 2001.
Web-Page Color Modification for Barrier-Free Color Vision with Genetic Algorithm Manabu Ichikawa1 , Kiyoshi Tanaka1 , Shoji Kondo2 , Koji Hiroshima3 , Kazuo Ichikawa3 , Shoko Tanabe4 , and Kiichiro Fukami4 1
Faculty of Engineering, Shinshu Unversity 4-17-1 Wakasato, Nagano-shi, Nagano, 380-8553 Japan {manabu, ktanaka}@iplab.shinshu-u.ac.jp 2 Kanto Electronics & Research, Inc. 5134 Yaho, Kunitachi-shi, Tokyo, 186-0914 Japan 3 Social Insurance Chukyo Hospital 1-1-10 Sanjyo, Minami-ku, Nagoya-shi, Aichi, 457-8510 Japan 4 Research Institute of Color Vision 12-26 Sanbonmatsu-cho, Atta-ku, Nagoya-shi, Aichi, 456-0032 Japan
Abstract. In this paper, we propose a color modification scheme for web-pages described by HTML markup language in order to realize barrier-free color vision on the internet. First, we present an abstracted image model, which describes a color image as a combination of several regions divided with color information, and define some mutual color relations between regions. Next, based on fundamental research on the anomalous color vision, we design some fitness functions to modify colors in a web-page properly and effectively. Then we solve the color modification problem, which contains complex mutual color relations, by using Genetic Algorithm. Experimental results verify that the proposed scheme can make the colors in a web-page more recognizable for anomalous vision users through not only computer simulation but also psychological experiments with them.
1
Introduction
Due to the rapid development of computers, internet, and display and printing techniques we can take advantage of color to describe and deliver digital information. The colors themselves used in the description of a message often contain very important information. For example, various colors of text, graphics, and images have been used in the web pages on the internet, in which color difference often classifies the importance of information, such as a linkage to another page, and so on. However, from a medical point of view, colorful information description does not always provide real convenience and easy understanding of the information to all users. It is well-known that 5-8% of men, with so-called anomalous vision, have difficulties recognizing certain colors and color differences rather than normal people. Those who have a different color vision from normal people (anomalous vision people) may miss the information that can be recognized by normal vision people, which causes the disparity of capability getting E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2134–2146, 2003. c Springer-Verlag Berlin Heidelberg 2003
Web-Page Color Modification for Barrier-Free Color Vision
2135
information between them. In this paper, we focus on this problem and try to modify the colors used on the web pages to colors that could be more recognizable for anomalous vision people in order to realize barrier-free color vision in the IT society. The anomalous vision has been mainly verified by lots of psychological experiments in medical field so far [1,2]. From an engineering point of view, Kondo proposed a color vision model which can describe the anomalous color vision to simulate (display) the colors that the anomalous vision people may observe [3]. In this model, the anomalous vision can be considered as if a certain unit or a channel between units in the model is broken or their functions are deteriorated. In this paper, we follow this fundamental research and propose a novel web page color modification method to realize a barrier-free color vision. In this paper, we first propose an abstracted image model for general color images and apply it for web pages. Next, we propose an optimization method of color arrangement in the abstracted image model for web pages. When we increase the number of colors used in a web page, the complexity of color arrangement increases rapidly. In this paper we use Genetic Algorithm (GA) for color arrangement optimization of web pages, which is widely known as a robust optimization method [4]. We show some experimental results and give a discussion about those.
2
Color Vision Model
First of all, we show a normal vision model based on the stage theory in Fig. 1, which is an equivalent circuit to explain the phenomena on color vision. In this figure, R, G and B denote cones that produce output signals depending on input stimuli. The output signals from cones produce a Luminance signal L through a unit denoted as V, and two opponent color signals, Crg and Cyb through opponent processes denoted by r-g and y-b via intermediate units r and y. Here the function proposed by Smith and Pokorny [5] is used as the cone sensitivity function in Eq.(1), where x(λ), y(λ) and z(λ), are the color matching functions. x(λ) 0.15514 0.54312 −0.03286 Sr (λ) Sg (λ) = −0.15514 0.45684 0.03286 y(λ) (1) Sb (λ) 0 0 1 z(λ) L, Crg and Cyb in Fig. 1 are obtained by Eq.(2), which is derived from a linear combination of cone sensitivity functions given by Eq.(1) with a suitable normalization and rounding off small fractions for simplicity. Crg (λ) x(λ) 2.00 −2.00 0.00 Cyb (λ) = 0.00 1.00 −1.00 y(λ) (2) L(λ) 0.00 1.00 0.00 z(λ) Next, we show an anomalous vision model [3], in which some units and channels in the normal vision model are partially changed. First, in the model of “protan”, the normal cone R changes to an abnormal one R , which changes the
2136
M. Ichikawa et al.
L +
Crg
Cyb +
V
+
-
+
y +
B
y-b
r-g
-
r +
+
G
+
R
Fig. 1. A normal vision model
channel from R and the output from R goes only to r as shown in Fig. 2(a). Because the output signal from R is very weak and fragile, we can explain the phenomena of “protan” without contradiction to medical knowledge. In Fig. 2(a), the parameter KP (0 ≤ KP < 1) denotes the degree of color weakness. The case KP = 0 corresponds to a complete “protanopia”, KP > 0 does to a “protanomalia”. (The case KP = 1 means the hypothetical “protan” who has an anomalous cone but a perfect hue discrimination capability like normal ones.) The opponent signals Crg and Cyb and the luminance one L are obtained by Crg (λ) x(λ) 0.252KP −0.203KP −0.011KP Cyb (λ) = −0.155 0.457 −0.251 y(λ) , (3) L(λ) −0.155 0.457 0.033 z(λ) where the coefficients are derived from numerical experiments. Similarly, in the model of “deutan”, the normal cone G changes to an abnormal one G , which changes the channel from G and the output from G goes only to r-g as shown in Fig. 2(b). In this figure, the parameter KD (0 ≤ KD < 1) denotes the degree of color weakness. The case KD = 0 corresponds to a complete “deutan”, KD > 0 does to a “deuter anomalopsia”. (The case KD = 1 means the hypothetical “deuteranopia”.) The output signals Crg , Cyb and L are obtained by Crg (λ) 0.105KD −0.132KD 0.020KD x(λ) Cyb (λ) = 0.155 0.543 −0.969 y(λ) . (4) L(λ) 0.155 0.543 −0.033 z(λ) We can simulate the anomalous color vision by using Eq.(3) and (4) and the inverse transformation of Eq.(2). That is, first we input RGB signals into the anomalous color vision model and output the corresponding Crg , Cyb and L signals. Then we input them from the output-side of Fig. 1, and transform them to RGB signals by the inverse transformation of Eq.(2). Through this calculation, we can obtain the colors that the anomalous vision people may observe. If two colors obtained through this process are similar (close to each other), we can judge that it might be a hard combination of colors for anomalous vision people. When we select a number of color pairs, which are hard to recognize the difference for anomalous vision people, and plot them on
Web-Page Color Modification for Barrier-Free Color Vision
L +
V
+ 0
-
+
y +
B
y-b
L
Crg
Cyb
KP
r-g
-
0
+
+
r 0
+
G
+
y-b
+
-
y 0
KD
r-g
-
r +
B
R’
Crg
Cyb
V
2137
+
G’
+
R
(b) D, DA
(a) P, PA
Fig. 2. Models of anomalous vision Table 1. Chromaticity coordinates for protanopic and deuteranopic convergence points
Protanopic Deuteranopic
x 0.7465 1.4000
y 0.2535 -0.4000
the chromaticity diagram, they tend to be on some straight lines. Such lines obtained are called “confusion lines” and they converge to a point called “convergence point” of confusion lines. This unique point is experimentally obtained and several values have been so far proposed. In this paper, we adopt the results proposed by Smith and Pokorny [5].
3 3.1
Abstracted Image Model and Color Modification Abstracted Image Model
We consider a color image as a combination of several color regions, each of which contains similar colors in it. In this paper, we call a color image, which is divided into several regions with colors, as the abstracted image model as shown in Fig. 3. In this model we can define some relations between regions as follows. We define “included” as the relation that a region completely contains another region (regions). The region containing a region is called “parent region” and the region included by the “parent region” is called “child region”. In Fig. 3, since region D contains F, we describe this relation as D ⊃ F.
(5)
On the other hand, we define “even” as the relation in which two regions have a same “parent” region. Note that the outside of an image can be considered as the “parent” region. In Fig. 3, since regions D and E are “even” with each other, we describe this relation as {D, E}.
(6)
2138
M. Ichikawa et al.
A A D
B
B C
F
E
D
C
E
F Fig. 3. An example of the abstracted image model
Fig. 4. Mutual color relations to be considered
These “included” and “even” relations are often satisfied inter-regionally. For example, since the region A contains two regions D and E, which are “even” with each other, we can describe A ⊃ {D, E}. As another example, since both regions A and B contain region C, we can describe {A, B} ⊃ C. Therefore, we can describe the entire relations in Fig. 3 as {{A ⊃ {{D ⊃ F }, E}}, B} ⊃ C.
3.2
(7)
Color Modification for Divided Regions
When we human beings recognize regions by using colors, we make use of color difference between regions. Because in case of “included” and “parent” region and its “child” region are neighbors with each other, it could be easier for us to recognize the boundary if we enhance the color difference between regions. In case of “even”, we can make the boundary more recognizable by enhancing color difference between regions but we should additionally consider the distance between regions. That is, the necessity to enhance the color difference between “even” regions varies depending on the distance between regions. We depict the graph which shows mutual color relations to improve the recognizability of Eq.(7) in Fig. 4. The solid and broken lines denote “included” and “even”, respectively. And bold line shows the necessity of color difference enhancement between regions. In this paper, we modify all colors of regions by considering such mutual color relations in the image in order to obtain more recognizable color combinations for anomalous vision users. 3.3
Applying the Abstracted Image Model to Web-Pages
Let us consider the application of the abstracted image model to web-pages described by the HTML markup language. In the HTML description we can specify colors for a background, characters, part of characters, figures, tables, and so on. The background and characters on it have an “included” relation between them. Also, two different colors of characters have an “even” relation. Since the background should be “parent” region and the characters on the background should be “child” regions of it, the “child” regions have a single “parent” region
Web-Page Color Modification for Barrier-Free Color Vision
2139
and might not be contained in multiple “parent” regions. In this case, since the “child” region (a color of characters) is being neighbors with the “parent” region (a color of background), we enhance the color difference between them in order to make a given web-page more recognizable. In case of “even” we should consider mutual color relation between characters, a portion of characters has a different color from the color of other characters, for example, for specifying a linkage function to users. In this case, we enhance the color difference between two colors in characters. We should consider balancing the weight for color modification between “included” and “even” cases. When we enhance the color difference in HTML description, we should consider the controllable range of it. Because we specify the colors to be used with RGB color space and the dynamic range of (R, G, B) signals are limited to 0255, we can not describe all colors in the chromaticity diagram [6]. Also, since (R, G, B) signals must be positive, the colors which can be displayed are limited to a narrow range when the lightness of the color is low or high. In this paper, we distinguish the color into lightness and spectral colors, and separately evaluate them for optimization. The number of colors used in the entire web-page could be an important parameter when we optimize to modify all colors in it. In general, the mutual color relations to be considered increase as the number of colors n increases. Thus in this paper we use Genetic Algorithm (GA) [4] to solve the complex color modification problem.
4 4.1
Color Modification with Genetic Algorithm Individual Representation
We try to generate the optimum color vector (C1∗ , C2∗ , · · · , Cn∗ ) for the original color vector (C1 , C2 , · · · , Cn ) in terms of making the web-page more recognizable for anomalous vision users. To accomplish this purpose we design the individual which has all n kinds of color information in it. However, the amount of information an individual has becomes large, which enlarge the solution space and thus the search speed might be deteriorated remarkably. We can specify mutual color relations in the HTML description by analyzing the scanned webpage. Thus we divided the entire solution space into n sub-spaces for each color Ci (i = 1, 2, · · · , n). Therefore, each individual is described with 24 bits binary representation (8 bits for each (R, G, B) component). (The way of evaluation by considering mutual color relations will be mentioned in 4.3.) 4.2
Design of Fitness Functions
We design the function that can evaluate a fitness value between the color represented by an individual and a basis color CB . We achieve the evaluation in the Luv perceptually uniform color space [6]. Here we denote the color represented by an individual as C = (L, u, v) and its filtered one by the computer
2140
M. Ichikawa et al.
simulation of anomalous vision as C = (L , u , v ). Similarly, we denote the basis color and its filtered one by the computer simulation of anomalous vision as CB = (LB , uB , vB ) and CB = (LB , uB , vB ) respectively. Also, we denote the = (LO , uO , vO ) original color and its filtered one as CO = (LO , uO , vO ) and CO respectively. Also, we define a normalized spectral color difference between two colors Cx and Cy in Luv color space as 2 2 (ux − uy ) + (vx − vy ) d(Cx ,Cy ) = , (8) dmax where dmax is the maximum color distance in Luv space. Furthermore, we define a normalized lightness difference between two colors as b(Cx ,Cy ) =
|Lx − Ly | , bmax
(9)
where bmax is the maximum brightness in Luv space. (1) Weighting Coefficient between Spectral Colors and Brightness. As described in 3.3 it is difficult to control the color difference with spectral colors when the lightness of the color is low or high. In this case, we stress on the lightness enhancement. On the other hand, when the lightness of the color is medium, we stress on color difference enhancement with spectral colors. In this paper, we define the weighting coefficient α(0 ≤ α ≤ 1) between spectral colors and brightness as a function of LB . We designed α(LB ) such that α(LB ) ≈ 0 as LB is low or high and α(LB ) ≈ 1 as LB is intermediate. (2) Evaluation for Spectral Colors. We positively evaluate the color difference d(C ,CB ) between C and CB obtained through the simulation of anomalous vision, but negatively does the difference d(C,CO ) between Cand CO in order to keep the original colors as possible as we can. Thus in this paper we use the following function to evaluate the color C represented by an individual gc d(C ,C ) O B , (10) fc = d(C ,C ) 1 − d(C,CO ) B
where gc d(CO ,CB ) is a constraint function on the original spectral colors. The constraint gradually increases as gc approaches to 0. (3) Evaluation for Brightness. We evaluate the brightness of the color properly to control it depending on the brightness of CB and the brightness difference between C and CB by gb b(C ,C ) O B , (11) fb = b(C ,C ) 1 − b(C,CO ) B
where gb d(CO ,CB ) is a constraint function on the original brightness. The constraint gradually increases as gb approaches to 0.
Web-Page Color Modification for Barrier-Free Color Vision
2141
(4) Combined Fitness function. We combine two evaluation functions fc (Eq.(10)) and fb (Eq.(11)) with the weighting coefficient α as f = αfc + (1 − α)fb .
(12)
With this equation, we finally evaluate the color represented by an individual C and the basis color CB . 4.3
Evaluation Method by Considering Mutual Color Relations
Each color Ci (i = 1, 2, · · · , n) has some mutual color relations with other colors in (C1 , C2 , · · · , Cn ) obtained by scanning a HTML web-page data such as “included” and “even” as described in 3.1. Thus we should construct an evaluation method such that we can consider such mutual relations in colors. Fig. 5 shows the way of mutual evaluation method between two colors C1 and C2 . Although C1 and C2 should be determined by considering their partner colors, both colors will be changing (evolving) along with the alternation of generations by GA. In the proposed scheme, the individuals in the population P1 for the determination ∗(t−1) by (t − 1)-th generation of C1 are evaluated with the best individual C2 in the population P2 for the determination of C2 . Note that the best individ∗(t−1) ∗(t−1) , is re-evaluated with C2 and we ual in P1 at (t − 1)-th generation, C1 ∗(t) (t) ∗(t−1) determine the best individual C1 among the individuals in P1 and C2 . Similarly, the individuals in the population P2 for the determination of C2 are ∗(t−1) . When a color has more than two evaluated with the best individual C1 relations, the fitness values for multiple basis colors are averaged. In this way the proposed scheme synchronizes to evolve all the evolution of the populations Pi (i = 1, 2, · · · , n) corresponding to the colors Ci (i = 1, 2, · · · , n) generation by generation. Finally, the best individuals (C1∗ , C2∗ , · · · , Cn∗ ) collected from fully matured populations are output and displayed as the modified colors. 4.4
Genetic Operators in GA
In the proposed scheme we employ an improved GA (GA-SRM) [7,8] to achieve efficient and reliable optimization. GA-SRM uses two kinds of cooperative and competitive genetic operators and combines them with an extinctive selection method. It has been verified that this framework is quite effective for 0/1 multiple-knapsack problems [7,8], NK-Landscape problems [9,10], and image halftoning problems [11,12]. In this paper, we follow the basic framework of GASRM. That is, we create offspring with CM (Crossover and serial background Mutation) and SRM (Self-Reproduction with Mutation) operators in parallel. Their offspring compete for survival through (µ, λ) proportional selection [13]. We adopt one-point crossover and background mutation with the mutation prob(CM ) ability Pm for CM. On the other hand, for SRM we adopt ADP (Adaptive Dynamic Probability) mutation strategy [7] which adaptively reduces the muta(SRM ) tion probability Pm depending on a mutants survival ratio.
2142
M. Ichikawa et al.
(t-1)-th generation
R R
G G
(t)-th generation
B B
R R
R
G G
B B
・・・
・・・
R Best Indivisual C1*(t-1)
G G
B B
Copy
R
G
B
R
G
B
Population P1(t) for C1 Re-evaluation
Interaction between colors
R R
G G
Best Indivisual C2*(t-1)
G G
B B
R
G
B
R
G
B
・・・
R R
・・・
B B
R
G
B
R
G
B
Copy
Population P2(t) for C2 Re-evaluation
Fig. 5. Mutual evaluation method between two colors
5
Experimental Results and Discussion
In this paper, we use the following functions as α(LB ), gc and gb . The parameters in these equations are experimentally determined.
|LB /bmax − 0.5| α(LB ) = 0.1 + 0.65 0.5 cos π + 0.5 (13) 0.5
(14) gc d(CO ,CB ) = 0.6 0.7 1 − d(CO ,CB ) + 0.3
gb b(CO ,CB ) = 2 0.7b(CO ,CB ) + 0.3 (15) First, we prepared two kinds of test images (image A and B) described with HTML markup language as shown in Fig. 6(a) and (b). In these examples, we can describe the color relations as C1 ⊃ {C2 , C3 } with a background color C1 , and two kinds of colors for characters C2 and C3 . First we show the results through the computer simulation of anomalous color vision [3] in Fig. 6(c)∼(f), where the degree of color weakness, KP and KD , are set to 0. We can consider these images as the color appearance observed by anomalous vision people. We can see that the colors in all images are deteriorated and it becomes hard to recognize the characters from the background and between set of characters. Next we show the results after applying the proposed scheme to the test images in Fig. 7. Here we set the weight between “included” and “even” relations to 2:1 and use the genetic parameters in Table 2. In Fig. 7, we show the images with modified colors and their appearance through the computer simulation of anomalous color vision. Compared with the ones without modification in Fig. 6, we can see that color differences are enhanced, and becomes more recognizable
Web-Page Color Modification for Barrier-Free Color Vision
Color Check
Color Check
(a) Test image A
(b) Test image B
2143
Color Check
Color Check
Color Check
Color Check
(c) Appearance of anomalous vision (protan, image A)
(d) Appearance of anomalous vision (deutan, image A)
(e) Appearance of anomalous vision (protan, image B)
(f ) Appearance of anomalous vision (deutan, image B)
Fig. 6. Test images and their appearance through computer simulation of anomalous color vision Table 2. Genetic parameters of GAs
Crossover ratio Pc
cGA 0.6
Mutation ratio Pm
1 24
GA-SRM 1.0 (CM ) Pm = 1 1 124 (SRM ) Pm = 8 , 24 [7,8]
Proportional selection (µ, λ) Proportional selection (µ, λ)Proportional selection Population size N 64 64 Evaluation numbers T 128,000 128,000 Selection method
(distinguishable) the characters from the background and between set of characters by our scheme. It could be an evidence that the fitness functions designed in 4.2 are working effectively. Next we show the evolution of solutions achieved by GA-SRM in Fig. 8(a). The figure also shows results by cGA (canonical GA) and GA(µ, λ) (cGA with (µ, λ) proportional selection but no parallel varying mutation). These plots are averaged over 100 random runs. From this figure we can see that GA-SRM can generate a high-fitness solution with less evaluation times compared with other configulations. Also we depict the standard deviation σ around the average of 100 runs corresponding to Fig. 8(a) in Fig. 8(b). We can see that the standard deviation attained by GA-SRM is very small and we can achieve reliable solution search with GA-SRM. Furthermore, we conducted psychological experiments with the cooperation of actual anomalous vision people (a protan, a protanomalia, a dutan and a duteranopia, totally 4 examinees). We prepared 15 sample pages including two colors (a background color and a character color) and 16 samples with three col-
2144
M. Ichikawa et al.
Color Check
Color Check
Color Check
Color Check
(a) Output image (b) Output image (c) Output image (d) Output image with modified colors with modified colors with modified colors with modified colors (protan, image A) (deutan, image A) (protan, image B) (deutan, image B)
Color Check
Color Check
Color Check
Color Check
(e) Appearance of anomalous vision for (a) (protan)
(f ) Appearance of anomalous vision for (b) (deutan)
(g) Appearance of anomalous vision for (c) (protan)
(h) Appearance of anomalous vision for (d) (deutan)
Fig. 7. Output images with modified colors and their appearance through computer simulation of anomalous color vision
ors (a background color and two character colors). All pages have different color combinations. We showed two pages for each sample to the examinee, one was the original page and the other was its modified one by the proposed scheme. The original and its modified page were randomly displayed on the right-hand side or left-hand side on the screen. To avoid that the examinees expect either one to be surely recognizable from another one, we randomly included some dummy pages, in which both pages were the same (no change). We asked examinees which page is recognizable or even, and how recognizable (somehow/remarkable) in case the examinee chose either one. We keep score as follow: 0 point to even, +1 point to somehow recognizable for the modified page, +2 points to remarkably recognizable for modified page, -1 point to somehow recognizable to the original page, and -2 points to remarkably recognizable to the original page. Finally, we sum up all points for all sample pages to measure the color recognizability improvement. The obtained results are shown in Table 3. It can be seen that around +1.0 points are obtained for achromatopsia examinees (KP = KD = 0) in average and a positive points (around +0.7) obtained for dyschromatopsia ones. The reason why lower points were obtained for the latter case is that some original pages without modification can be recognizable for dyschromatopsia examinees. In general, the effect of color modification increases as the font size decreases. This is becasue it is sometimes very hard even for normal color vision people to distinguish the color of small font characters on a background color. Also, we can see the improvement is slightly deteriorated in case of three colors, which is mainly caused by a compromise between mutual color relations. Nevertheless, we have never scored negative point for the modified page for all samples and achieved a positive +0.85 points as the entire average. Therefore, we can
Web-Page Color Modification for Barrier-Free Color Vision σ
f 0.295
0.03
Standard deviation
Fitness
0.290
0.285
0.280
0.000
cGA(64) GA(51,64) GA-SRM(51,64)
≈ 0
2145
30000
60000 90000 Evaluations
cGA(64) GA(51,64) GA-SRM(51,64)
0.02
0.01
0.00 120000 T
(a) Fitness
0
30000
60000 90000 Evaluations
120000 T
(b) Standard deviation
Fig. 8. Performance achieved by cGA and GA-SRM
Table 3. Results by psychological experiments with actual anomalous vision people type Two colors (10pt) Two colors (12pt) Two colors (18pt) Two colors (Average) Three colors (10pt) Three colors (12pt) Three colors (18pt) Three colors (Average) Total average
protan +1.23 +1.20 +1.10 +1.18 +0.88 +0.94 +0.88 +0.90 +1.03
protanomalia +0.83 +0.87 +0.37 +0.69 +0.94 +0.84 +0.22 +0.67 +0.68
deutan +1.10 +1.13 +1.03 +1.09 +0.84 +0.78 +0.91 +0.84 +0.96
deuteranomalia +0.83 +0.77 +0.77 +0.79 +0.72 +0.63 +0.59 +0.65 +0.72
say that the proposed scheme successfully modifies colors on web pages to more recognizable ones for anomalous vision users.
6
Conclusions
In this paper, we have proposed a color modification scheme for web-pages described by HTML markup language in order to realize barrier-free color vision on the internet. We solved the color modification problem, which contains complex mutual color relations, by using an improved Genetic Algorithm (GA-SRM). Through computer simulation and psychological experiments we verified that the proposed scheme can make the colors in a web-page more recognizable for anomalous vision users. The proposed scheme can be implemented in web servers or PCs on the internet. With these options, because all web pages that transit our system will be displayed after color modification, anomalous vision users can always view recognizable pages.
2146
M. Ichikawa et al.
As future works, we should further investigate on the implementation aspects of this scheme. Also, we are planning to extend this scheme which can modify colors in any kind of color images, such as natural images, to further realize barrier-free color vision in the internet society.
References 1. Joel Pokorny, Vivianne C. Smith, Guy Verriest, A.J.L.G. Pinckers, Congenital and Acquired Color Vision Defects, Grune and Stranton, 1979. 2. Committee on Vision Assembly of Behavioral and Social Science National Research Council, Procedures for Testing Color Vision Report of Working Group 41, National Academy Press, 1981. 3. S. Kondo, “A Computer Simulation of Anomalous Color Vision”, Color Vision Deficiencies, Kugler & Ghedini Pub., pp.145–159, 1990. 4. D.E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, Addision-Wesley, 1989. 5. Smith, V. C and Pokorny, J., Spectral sensitivity of the foveal cone photopigments between 400 and 500 nm, Vis. Res. 15, pp.161–171, 1975. 6. Andrew S. Glassner, Principles of Digital Image Synthesis, Morgan Kaufmann Publishers, Inc., 1995. 7. H. Aguirre, K. Tanaka and T. Sugimura, “Cooperative Model for Genetic Operators to Improve GAs”, Proc. IEEE ICIIS, pp.98–109, 1999. 8. H. Aguirre, K. Tanaka, T. Sugimura and S. Oshita, “Cooperative-Competitive Model for Genetic Operators: Contributions of Extinctive Selection and Parallel Genetic Operators”, Proc. Late Breaking Papers at GECCO, pp.6–14, 2000. 9. M. Shinkai, H. Aguirre and K. Tanaka, “Mutation Strategy Improves GA’s Performance on Epistatic Problems”, Proc. IEEE CEC, pp.968–973, 2002. 10. Hernan Aguirre and Kiyoshi Tanaka, “Genetic algorithms on NK-landscapes: effects of selection, drift, mutation, and recombination”, Proc. 3rd European Workshop on Evolutionary Computation in Combinatorial Optimization, LNCS (Springer ), Vol.2611, to appear, 2003. 11. H. Aguirre, K. Tanaka and T. Sugimura, “Accelerated Halftoning Technique Using Improved Genetic Algorithm with Tiny Populations”, Proc. IEEE SMC, pp.905– 910, 1999. 12. H. Aguirre, K. Tanaka, T. Sugimura and S. Oshita, “Halftone Image Generation with Improved Multiobjective Genetic Algorithm”, Proc. EMO’01, LNCS (Springer ), Vol.1993, pp.501–515, 2001. 13. T.B¨ ack, Evolutionary Algorithms in Theory and Practice, Oxford Univ. Press, 1966.
Quantum-Inspired Evolutionary Algorithm-Based Face Verification Jun-Su Jang, Kuk-Hyun Han, and Jong-Hwan Kim Dept. of Electrical Engineering and Computer Science Korea Advanced Institute of Science and Technology (KAIST) Guseong-dong, Yuseong-gu, Daejeon, 305-701, Republic of Korea {jsjang,khhan,johkim}@rit.kaist.ac.kr
Abstract. Face verification is considered to be the main part of the face detection system. To detect human faces in images, face candidates are extracted and face verification is performed. This paper proposes a new face verification algorithm using Quantum-inspired Evolutionary Algorithm (QEA). The proposed verification system is based on Principal Components Analysis (PCA). Although PCA related algorithms have shown outstanding performance, the problem lies in the selection of eigenvectors. They may not be the optimal ones for representing the face features. Moreover, a threshold value should be selected properly considering the verification rate and false alarm rate. To solve these problems, QEA is employed to find out the optimal distance measure under the predetermined threshold value which distinguishes between face images and non-face images. The proposed verification system is tested on the AR face database and the results are compared with the previous works to show the improvement in performance.
1
Introduction
Most approaches to face verification fall into one of two categories. They are either based on local features or on holistic templates. In the former category, facial features such as eyes, mouth and some other constraints are used to verify face patterns. In the latter category, 2-D images are directly classified into face groups using pattern recognition algorithms. We focus on the face verification under the holistic approach. The basic approach in verifying face patterns is a training procedure which classifies examples into face and non-face prototype categories. The simplest holistic approaches rely on template matching [1], but these approaches have poor performance compared to more complex techniques like neural networks. The first neural network approach to face verification was based on multilayer perceptrons [2], and advanced algorithms were studied by Rowely [3]. The neural network was designed to look through a window of 20×20 pixels and was trained by face and non-face data. Based on window scanning technique, the face detection task was performed. It means that face verification network was applied to input image for possible face locations at all scales. E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2147–2156, 2003. c Springer-Verlag Berlin Heidelberg 2003
2148
J.-S. Jang, K.-H. Han, and J.-H. Kim
One of the most famous methods among holistic approaches is Principal Components Analysis (PCA), which is well known as eigenfaces [4]. Given an ensemble of different face images, the technique first finds the principal components of the face training data set, expressed in terms of eigenvectors of the covariance matrix of the face vector distribution. Each individual face in the face set can then be approximated by a linear combination of the eigenvectors. Since the face reconstruction by its principal components is an approximation, a residual reconstruction error is defined in the algorithm as a measure of faceness. The residual reconstruction error which they termed as “distance-from-face space”(DFFS) gives a good indication of the existence of a face [5]. Moghaddam and Pentland have further developed this technique within a probabilistic framework [6]. PCA is an appropriate way of constructing a subspace for representing an object class in many cases, but it is not necessarily optimal for distinguishing between the face class from the non-face class. Face space might be better represented by dividing it into subclasses, and several methods have been proposed for doing this. Sung and Poggio proposed the mixture of multidimensional Gaussian model and they used an adaptively changing normalized Mahalanobis distance metric [7]. Afterward, many face space analysis algorithms have been investigated and some of them have outstanding performance. The problem of the PCA related approaches lies in the selection of eigenvectors. They may not be the optimal ones for representing the face features. Moreover, a threshold value should be selected properly considering the verification rate and false alarm rate. By employing QEA, the performance of the face verification is improved enough to distinguish between face images and non-face images. In this paper, eigenfaces are constructed based on PCA and a set of weight factors is selected by using Quantum-inspired Evolutionary Algorithm (QEA) [8]. QEA has lately become a subject of special interest in evolutionary computation. It is based on the concept and principles of quantum computing such as a quantum bit and superposition of states. Instead of binary, numeric or symbolic representation, it uses a Q-bit as a probabilistic representation. Its performance was tested on the knapsack problem, which produced on outstanding result [8]. This paper is organized as follows. Section 2 describes QEA briefly. Section 3 presents the PCA and density estimation. Section 4 presents how the QEA is applied to optimize the decision boundary between face images and non-face images. Section 5 presents the experimental results and discussions. Finally, conclusion and further works follow in Section 6.
2
Quantum-Inspired Evolutionary Algorithm (QEA)
QEA [8] can treat the balance between exploration and exploitation more easily when compared to conventional GAs (CGAs). Also, QEA can explore the search space with a small number of individuals (even with only one individual for real-time application) and exploit the global solution in the search space within a short span of time. QEA is based on the concept and principles of quantum computing, such as a quantum bit and superposition of states. However, QEA is not a quantum algorithm, but a novel evolutionary algorithm. Like other
Quantum-Inspired Evolutionary Algorithm-Based Face Verification
2149
Procedure QEA begin t←0 i) initialize Q(t) ii) make P (t) by observing the states of Q(t) iii) evaluate P (t) iv) store the best solutions among P (t) into B(t) while (not termination condition) do begin t←t+1 v) make P (t) by observing the states of Q(t − 1) vi) evaluate P (t) vii) update Q(t) using Q-gates viii) store the best solutions among B(t − 1) and P (t) into B(t) ix) store the best solution b among B(t) x) if (global migration condition) then migrate b to B(t) globally xi) else if (local migration condition) then migrate btj in B(t) to B(t) locally end end
Fig. 1. Procedure of QEA.
evolutionary algorithms, QEA is also characterized by the representation of the individual, the evaluation function, and the population dynamics. QEA is designed with a novel Q-bit representation, a Q-gate as a variation operator, an observation process, a global migration process, and a local migration process. QEA uses a new representation, called Q-bit, for the probabilistic representation that is based on the concept of qubits, and a Q-bit individual as a string of Q-bits. A Q-bit is defined as the smallest unit of information in QEA, which is defined with a pair of numbers, (α, β), where |α|2 + |β|2 = 1. |α|2 gives the probability that the Q-bit will be found in the ‘0’ state and |β|2 gives the probability that the Q-bit will be found in the ‘1’ state. A Q-bit may be in the ‘1’ state, in the ‘0’ state, or in a linear superposition of the two. A Q-bit individual is defined as a string of m Q-bits. QEA maintains a population of Q-bit individuals, Q(t) = {qt1 , qt2 , · · · , qtn } at generation t, where n is the size of population, and qtj , j = 1, 2, · · · , n, is a Q-bit individual. Fig. 1 shows the standard procedure of QEA. The procedure of QEA is explained as follows: i) In the step of ‘initialize Q(t),’ αi0 and βi0 , i = 1, 2, · · · , m, of all q0j , are initialized to √12 . It means that one Q-bit individual, q0j represents the linear superposition of all possible states with the same probability. ii) This step generates binary solutions in P (0) by observing the states of Q(0), where P (0) = {x01 , x02 , · · · , x0n } at generation t = 0. One binary solution,
2150
J.-S. Jang, K.-H. Han, and J.-H. Kim
x0j , is a binary string of length m, which is formed by selecting either 0 or 1 for each bit by using the probability, either |αi0 |2 or |βi0 |2 of q0j , respectively. iii) Each binary solution x0j is evaluated to give a level of its fitness. iv) The initial best solutions are then selected among the binary solutions, P (0), and stored into B(0), where B(0) = {b01 , b02 , · · · , b0n }, and b0j is the same as x0j at the initial generation. v, vi) In the while loop, binary solutions in P (t) are formed by observing the states of Q(t − 1) as in step ii), and each binary solution is evaluated for the fitness value. It should be noted that xtj in P (t) can be formed by multiple observations of qt−1 in Q(t − 1). j vii) In this step, Q-bit individuals in Q(t) are updated by applying Q-gates defined as a variation operator of QEA. The following rotation gate is used as a basic Q-gate in QEA: cos(∆θi ) − sin(∆θi ) , (1) U (∆θi ) = sin(∆θi ) cos(∆θi ) where ∆θi , i = 1, 2, · · · , m, is a rotation angle of each Q-bit. ∆θi should be designed in compliance with the application problem. viii, ix) The best solutions among B(t − 1) and P (t) are selected and stored into B(t), and if the best solution stored in B(t) is a better solution fitting than the stored best solution b, the stored solution b is replaced by the new one. x, xi) If a global migration condition is satisfied, the best solution b is migrated to B(t) globally. If a local migration condition is satisfied, the best one among some of the solutions in B(t) is migrated to them. The migration condition is a design parameter, and the migration process can induce a variation of the probabilities of a Q-bit individual. A local-group in QEA is defined to be the subpopulation affected mutually by a local migration, and a local-group size is the number of the individuals in a local-group. Until the termination condition is satisfied, QEA is running in the while loop.
3
PCA and Density Estimation
In this section, we present the PCA concept and density estimation using Gaussian densities. It should be noted that this method is a basic technique in pattern recognition and it lays the background of this study. 3.1
PCA Concept
A technique commonly used for dimensionality reduction is PCA. In the late 1980’s, Sirovich and Kirby [9] efficiently represented human faces using PCA and it is currently a popular technique. Given a set of m×n pixels images {I1 ,I2 , . . . ,IK } , we can form a set of 1-D vectors X ={x1 ,x2 , . . . ,xK }, where xi ∈ N =mn , i = 1, 2, . . . , K. The basis functions for the Karhunen-Loeve Transform(KLT) [10] are obtained by solving the eigenvalue problem Λ = ΦT ΣΦ
(2)
Quantum-Inspired Evolutionary Algorithm-Based Face Verification
2151
Fig. 2. Two subspaces
where Σ is the covariance matrix of X, Φ is the eigenvector matrix of Σ, and Λ is the corresponding diagonal matrix of eigenvalues. We can obtain M largest eigenvalues of the covariance matrix and their corresponding eigenvectors. Then feature vector is given as follows: ˜ y = ΦTM x
(3)
˜ = x−x ¯ is the difference between the image vector and the mean image where x vector, and ΦM is a submatrix of Φ containing the M largest eigenvectors. These principal components preserve the major linear correlations in the given set of image vectors. By projecting to ΦTM , original image vector x is transformed to feature vector y. It is a linear transformation which reduces N dimensions to M dimensions as follows: y = T (x) : N −→ M .
(4)
By selecting M largest eigenvectors, we can obtain two subspaces. One is the principal subspace (or feature space) F containing the principal components, and another is the orthogonal space F¯ . These two spaces are described in Fig. 2, where DFFS stands for “distance-from-feature-space” and DIFS “distance-infeature-space”. In a partial KL expansion, the residual reconstruction error is defined as 2 (x) =
N i=M +1
y2i = ˜ x2 −
M
y2i
(5)
i=1
and this is the DFFS as stated before which is basically the Euclidean distance. The component of x which lies in the feature space F is referred to as the DIFS. 3.2
Density Estimation
In the previous subsection, we obtained DFFS and DIFS. DFFS is an Euclidean distance, but DIFS is generally not a distance norm. However, it can be interpreted in terms of the probability distribution of y in F . Moghaddam estimated
2152
J.-S. Jang, K.-H. Han, and J.-H. Kim
DIFS as the high-dimensional Gaussian densities [6]. This is the likelihood of an input image vector x formulated as follows: P (x|Ω) =
¯ )T Σ −1 (x − x ¯ )] exp[− 12 (x − x (2π)N/2 |Σ|1/2
(6)
where Ω is a class of the image vector x. This likelihood is characterized by the Mahalanobis distance ¯ )T Σ −1 (x − x ¯) d(x) = (x − x
(7)
and it can be also calculated efficiently as follows: ˜ T Σ −1 x ˜ d(x) = x ˜ T [ΦΛ−1 ΦT ]˜ x =x = yT Λ−1 y =
N i=1
(8)
y2i λi
where λ is the eigenvalue of the covariance matrix. Now, we can divide this distance into two subspaces. It is determined as d(x) =
M y2 i
i=1
λi
+
N y2i . λi
(9)
i=1+M
It should be noted that the first term can be computed by projecting x onto the M -dimensional principal subspace F . However, the second term cannot be computed explicitly in practice because of the high-dimensionality. So, we use the residual reconstruction error to estimate the distance as follows: ˆ d(x) =
M y2 i=1
=
N 1 2 + yi λi ρ i
M i=1
i=M +1
y2i 2 (x) . + λi ρ
(10)
The optimal value of ρ can be determined by minimizing a cost function, but ρ = 12 λM +1 may be used as a thumb rule [11]. Finally, we can extract the estimated probability distribution using (6) and (10). The estimated form is determined by 2 M y2 (x) exp − 2ρ exp − 12 i=1 λii ˆ P (x|Ω) = M 1/2 · (11) (2πρ)(N −M )/2 (2π)M/2 i=1 λi = PF (x|Ω) · PˆF¯ (x|Ω).
Quantum-Inspired Evolutionary Algorithm-Based Face Verification
2153
Using (11), we can distinguish the face class from the non-face class by setting a threshold value for Pˆ (x|Ω), which is the Maximum Likelihood (ML) estimation method. In this case, the threshold value becomes the deciding factor between the verification rate and false alarm rate. If the threshold value is too low, the verification rate would be quite good but the false alarm rate would also increase. For this reason, the threshold value has to be carefully selected. In the following section, we propose an optimization procedure for selecting a set of eigenvalues to determine the decision boundary between the face and non-face classes. By optimizing the Mahalanobis distance term in (11) under the given threshold value, we can find a better distance measure and need not refer to the threshold value.
4
Optimization of Decision Boundary
In this section, we describe a framework for decision boundary optimization using QEA. In the case of ML estimation, the decision boundary is determined by a probability equation and an appropriate threshold value. To improve the performance of the verification rate and reduce the false alarm rate, we attempt to find a better decision boundary. The Mahalanobis distance-based probability guarantees quite good performance, but it is not optimal. In (11), an eigenvalue is used as the weight factor of the corresponding feature value. These weight factors can be optimized on a training data set. To perform the optimization, we construct the training data set. It consists of two classes: face class (positive data) and non-face class (negative data). Fig. 3 shows an example of a face training data set. A non-face training data set consists of arbitrarily chosen images, including randomly generated images. To search for the weight factors, QEA is used. The number of weight factors to be optimized is M , which is the same as the number of principal components. Using the weight factors obtained by QEA, we can compute the probability distribution as follows:
Fig. 3. Example of the face training data set
2154
J.-S. Jang, K.-H. Han, and J.-H. Kim
2 M y2 (x) exp − 12 i=1 ωii exp − 2ρ Popt (x|Ω) = · . M 1/2 (2πρ)(N −M )/2 (2π)M/2 i=1 λi
(12)
It is the same as (11) except for the weight factors ωi , i = 1, 2, . . . , M . To apply (12) to face verification, the threshold value should be assigned. But, since QEA yields optimized weight factors to the predetermined threshold value, we need not assign the threshold value. To evaluate the fitness value, we calculate the score. The score is added by +1 for every correct verification. The score is used as a fitness measure considering both the verification rate (P score) and the false alarm rate (N score) because the training data set consists of both face and non-face data. Then the fitness is evaluated as F itness = P score + N score
(13)
where P score is for the face class (positive data) and N score is for the non-face class (negative data). Using this fitness function, we can find the optimal weight factors for training data set under the predetermined threshold value.
5
Experimental Results and Discussions
We constructed 3 types of database for the experiment. First, 70 face images were used for extracting principal components. Second, 560 images (280 images for face and 280 images for non-face) were used for training weight factors. Third, 1568 images (784 images for face and 784 images for non-face) were used for the generalization test. All images are 50×50 pixels with 256 gray levels. We chose 50 principal components from the 70 face images. For pre-processing, histogram equalization was performed to normalize the lighting condition. Positive data were produced from the face region of the AR face database [12]. An example of a face training data set is shown in Fig. 3. Variations of the facial expression and illumination were allowed. Negative data consisted of both randomly generated images and natural images excluding the face images. Position-shifted face images and different-scale face images were also included as negative data. The following boundary of each weight factor was considered as a domain constraint: 0.1λi < ωi < 10λi ,
(1 ≤ i ≤ 50).
(14)
By setting the constraint of the boundary using the eigenvalue, it becomes a constraint optimization problem. We performed QEA for 560 training images using the parameters in Table 1. In (1), rotation angles should be selected properly. For each Q-bit, θ1 = 0, θ2 = 0, θ3 = 0.01π, θ4 = 0, θ5 = −0.01π, θ6 = 0, θ7 = 0, θ8 = 0 were used. The termination condition was given by the maximum generation. The perfect score was 560 points. If the score does not reach 560 points before maximum
Quantum-Inspired Evolutionary Algorithm-Based Face Verification
2155
Table 1. Parameters for QEA Parameters No. Population size 15 No. of variables 50 No. of Q-bits per variable 10 No. of observations 2 Global migration period 100 Local migration period 1 No. of individuals per group 3 Max. generation 2000 Table 2. Results for generalization test
DFFS classifier ML classifier QEA-based classifier
P score Verification (784) rate(%) 726 92.60 728 92.86 741 94.52
N score False Alarm (784) rate(%) 716 0.087 726 0.074 740 0.056
F itness (1568) 1442 1454 1481
generation, the evolution process stops at maximum generation. After the searching procedure, we performed a generalization test to 1568 images using the weight factors obtained by QEA. We also compared the results with the DFFS and the ML classifier. For the DFFS and the ML classifier, we selected the threshold value that provoked the best score. For QEA-based classifier, we used the same threshold value set for the ML classifier. It should be noted that there is no need to choose a threshold value for better performance in our classifier because the weight factors have been already optimized under the predetermined threshold value. Table 2 shows the results for the generalization test. The results show that the proposed method performs better than the DFFS or the ML classifier. The results described above suggest that the QEA-based classifier works well not only in terms of the verification rate (P score), but also in terms of the false alarm rate (N score). The verification rate of the QEA-based classifier was 1.66% higher than that of the ML classifier. The false alarm rate was 0.018% lower than that of the ML classifier. The advantage of the proposed classifier is that more training data can improve its performance.
6
Conclusion and Further Works
In this paper, we have proposed a decision boundary optimization method for face verification using QEA. The approach is basically related to eigenspace density estimation technique. To improve the previous Mahalanobis distancebased probability, we have used a new distance which consists of the weight factors optimized at the training set. The proposed face verification system has
2156
J.-S. Jang, K.-H. Han, and J.-H. Kim
been tested by face and non-face images extracted from AR database, and very good results have been achieved both in terms of the face verification rate and false alarm rate. The advantage of our system can be summarized in two aspects. First, our system does not need an exact threshold value to perform optimally. We only need to choose an appropriate threshold value and QEA will find the optimal decision boundary based on the threshold value. Second, our system can be adapted to various negative data. A fixed structured classifier such as the ML classifier can not change its character in frequently failure situation. But our system can be adapted to this case by reconstructing the training data and following the optimization procedure. As a future research, we will construct face detection system using this verification method. In face detection, it is clear that verification performance is very important. Most of the image-based face detection approaches apply a window scanning technique. It is an exhaustive search of the input image for possible face locations at all scales. In this case, overlapping detection arises easily. A powerful verification method is therefore needed to find exact face locations. We expect that our verification method will also work well for the face detection task.
References 1. Holst, G.: Face detection by facets : Combined bottom-up and top-down search using compound templates. Proc. of the International Conference on Image Processing (2000) TA07–08 2. Propp, M., Samal, A.: Artificial neural network architecture for human face detection. Intell. Eng. Systems Artificial Neural Networks 1 (1992) 535–540 3. Rowley, H.A., Baluja, S., Kanade, T.: Neural network-based face detection. IEEE Trans. Pattern Anal. Mach. Intell. 20 (1998) 23–38 4. Turk, M., Pentland, A.: Eigenfaces for recognition. J. Cog. Neurosci. 3 (1991) 71–86 5. Pentland, A., Moghaddam, B., Strarner, T.: View-based and modular eigenspaces for face recognition. IEEE Proc. of Int. Conf. on Computer Vision and Pattern Recognition (1994) 84–91 6. Moghaddam, B., Pentland, A.: Probabilistic visual learning for object representation, IEEE Trans. Pattern Anal. Mach. Intell. 19 (1997) 696-710 7. Sung, K.-K., Poggio, T.: Example-based learning for view-based human face detection, IEEE Trans. Pattern Anal. Mach. Intell. 20 (1998) 39-51 8. Han, K.-H., Kim, J.-H.: Quantum-inspired evolutionary algorithm for a class of combinatorial optimization. IEEE Trans. Evolutionary Computation 6 (2002) 580– 593 9. Sirovich, L., Kirby, M.: Low-dimensional procedure for the characterization of human faces. J. Opt. Soc. Amer. 4 (1987) 519–524 10. Loeve, M.: Probability Theory. Princeton, N.J., Van Nostrand (1955) 11. Cootes, T., Hill, A. Taylor, C., Haslam, J.: Use of active shape models for locating structures in medical images. Image and Vision Computing 12 (1994) 355–365 12. AR face database: http://rvl1.ecn.purdue.edu/˜aleix/aleix face DB.html
Minimization of Sonic Boom on Supersonic Aircraft Using an Evolutionary Algorithm Charles L. Karr1 , Rodney Bowersox2 , and Vishnu Singh3 1
2
Associate Professor and Head, Aerospace Engineering and Mechanics, The University of Alabama, Box 870280, Tuscaloosa, AL, 35487-0280, [email protected] Associate Professor, Department of Aerospace Engineering, Texas A&M University, 701 H.R. Bright Bldg., TAMU-3141, College Station, TX 77843-3141 3 Graduate Student, Aerospace Engineering and Mechanics, The University of Alabama, Box 870280, Tuscaloosa, AL, 35487-0280
Abstract. The aerospace community has an increasing interest in developing super sonic transport class vehicles for civil aviation. One of the concerns in such a project is to minimize the sonic boom produced by the aircraft as demonstrated by its ground signature. One approach being considered is to attach a spike/keel on the front of the aircraft to attenuate the magnitude of an aircraft’s ground signature. This paper describes an effort to develop an automatic method for designing the spike/keel area distribution that satisfies constraints on the ground signature of a specified aircraft. In this work a genetic algorithm is used to perform the design optimization. A modified version of Whitham’s theory is used to generate the near field pressure signature. The ground signature is computed with the NFBOOM atmospheric propagation code. Results indicate that genetic algorithms are effective tools for solving the design problem presented.
1
Introduction
The minimum achievable sonic boom of any aircraft can be computed – it is proportional to the weight of the aircraft divided by its length raised to the 1.5 power [1,2]. Given that the minimum sonic boom can be computed, the natural extension is to attempt to design an aircraft so that its sonic boom is minimized. McLean [3] was the first to consider trying to achieve this objective by eliminating shocks in the ground signature of a supersonic aircraft. The motivation for minimizing a sonic boom is many-fold, but the driving force behind the current effort is simply human comfort (from the perspective of an observer on the ground). The acceptability of a finite rise time overpressure as apposed to the N-wave shock structure was demonstrated in human subject tests [4,5]. This data indicated that a rise time of about 10 msec rendered realistic (0.6 psf) overpressures acceptable to all the subjects tested. Increasing rise time reduces the acoustic power in the frequency range to which the ear is most sensitive. Unfortunately, the aircraft length required to achieve a 10 msec rise E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2157–2167, 2003. c Springer-Verlag Berlin Heidelberg 2003
2158
C.L. Karr, R. Bowersox, and V. Singh
time has consistently been proven to be impractical. McLean considered a Super Sonic Transport (SST) class vehicle (600K lb, flying at 60K ft and Mach 2.7) and concluded that a 570 ft long vehicle would be required (atmospheric effects resulting in midfield signature freezing were included). Miller and Carlson [6,7] considered the projection of both heat and a force field upstream of a SST class vehicle. Their work focused on projecting a “phantom body” in front of the SST, with the goal of a finite (10 msec) ground rise time. Linear theory indicates that the optimum shape for a finite rise time is a 5/2-power area distribution (i.e., an isentropic spike). They concluded the power required to project the phantom body was the same as that needed to propel the vehicle. Marconi [8], using two-dimensional computational fluid dynamics, confirmed this power requirement. However, a significant wave drag reduction associated with replacing the vehicle nose shock with an isentropic pressure rise was also found. There are a number of challenging problems associated with projecting anything far upstream of a supersonic vehicle. In particular, a significant reduction in efficiency is incurred when trying to project energy. These projection problems are avoided with the off-axis volume control introduced by Batdorf [9]. Figure 1 depicts his concept as implemented by Marconi et al [10]. Linear theory predicts that the asymptotic far field flow is independent of the details of the vehicle and is influenced only by the cross sectional area (and equivalent area due to lift) distribution of the configuration. In particular, in a supersonic flow the cross sectional area distribution in planes inclined at the Mach angle (µ in Fig.1) governs the far field flow downstream of those Mach planes. The theory predicts that whether the volume being cut is centered along the axis of the vehicle (i.e., a nose spike) or shifted off-axis (i.e., the keel of Fig.1) should not influence the resulting far field flow. Batdorf’s concept was tested by Swigart [11] at Mach 2 and shown to be feasible. Swigart tested an equivalent body of revolution (typical of an SST) with an all-solid keel in addition to a keel with a portion of its volume replaced by a heated region. Swigart also tested a lifting wing body combination with an all-solid keel. The keels (both solid and thermal) were design to match the Mach plane area of a 5/2 power isentropic spike and achieved a nearly linear near-field pressure distribution as expected. The solid keel tested by Swigart was very large, extending the entire length of the configuration. Only the thermal keel that was tested seemed practical. Marconi et al. [10] extended the Batdorf concept. In that study, the initial spike/keel sizing was accomplished using a modified version of the classical linear Whitham Theory [13], and detailed 3-D flowfields were analyzed using a nonlinear Euler solver. That study demonstrated that single keels produced significant 3-D mitigation over entire ground signature. In addition, they examined multiple approaches to minimizing the impact of the keel on the vehicle performance. Lastly, Marconi et al demonstrated that the spike/keel area distribution could be tailored to produce a desired ground signature. The purpose of the present work is to develop an optimization procedure for the design of the spike/keel area distribution. A modified version of Whitham’s
Minimization of Sonic Boom on Supersonic Aircraft
2159
Fig. 1. Design Optimization - Genetic Algorithms (Forward Swept Keel → Length Amplification Factor = M∞ )
theory [12] is used to generate the near field pressure signature. The ground signature is computed with the NFBOOM [13] atmospheric propagation code. Genetic algorithms (GAs) are used for the optimization.
2
Flowfield Analysis – Genetic Algorithnm Fitness Function
Naturally, as in any GA application, determining a fitness function for the problem at hand is a central issue. In the current problem an attempt is being made to design a keel for a SST aircraft that will minimize the sonic boom felt on the ground. Thus, the fitness function must ultimately be able to take a keel design (the coding issue is addressed later) and provide a maximum decibel level in the associated ground signature. Seebass [2] presents an algorithm for using linear theory to predict sonic boom ground signatures. In summary, the midfield pressure signature is computed from the aircraft equivalent body of revolution using the quasi-linear theory of Whitham [11] making far field approximations; i.e., computation of the “Ffunction”. The midfield signature is then propagated to the ground using acoustic theory. A modified version of this theory (described next) was used for initial
2160
C.L. Karr, R. Bowersox, and V. Singh
sizing of the phantom isentropic spike (dashed-lined-body extending the airplane nose in Fig.1). In the present work, the far field approximations in the classical Whitham theory were relaxed; this was facilitated by the use of modern computational resources. Specifically, the small-disturbance perturbation velocity field solution, with the Whitham [12] modifications, to the linear form of the velocity potential equation, φrr + φr /r − β 2 φxx = 0
(1)
is given by u U∞
=−
v U∞
1 r
y 0
=
y 0
√
f (η)dη (y−η)(y−η+2βr)
√(y−η+βr)f
(2)
(η)dη
(y−η)(y−η+2βr)
In Eqn. (2), y(x, r) = constant is the nonlinear characteristic curve along which dx/dr = cot(µ + θ) ≈ β + (γ + 1)M 4 u/2β − M 2 (v + βu) + o(u2 + v 2 ) [11]. Neglecting the second order terms, substituting for the velocity components [Eqn. (2)] and integrating dx ≈ [β + (γ + 1)M 4 u/2β − M 2 (v + βu)]dr from the body surface results in √ y √ 4 (y−η)+2βr− y−η+2βRB (γ+1)M∞ √ f (η)dη − 2 2β y−η y0 √ √ (y−η)+2βr− y−η 2 f (η)dη −M∞ log √ x − βr = (3) √ (y−η)+2βr+ y−η 0 √ √ y (y−η)+2βRB − y−η f (η)dη + y − log √ √ 0
(y−η)+2βRB + y−η
y(x, r) is the value of x − βr where the characteristic meets the body surface; f (x) represents the source distribution, and when the tangency boundary condition is applied, f (x) = A (x)/2π. Consistent with the small disturbance theory, the pressure field was computed from the linearized form of the compressible Bernoulli’s equation (p + ρU 2 = constant); i.e., p − p∞ 2 u = −γM∞ p∞ U∞
(4)
Using the above nonlinear characteristic theory improves the accuracy [12] over traditional linear theory, however multi-valued solutions due to wave crossing (break points) are possible. Physically, the overlap regions correspond to coalescence of the Mach waves into shock wave, followed by a discontinuous change in flow properties across the shock. Because near field solutions (r/L ∼ 0.2) were computed here, break points were rare, and if they did occur, the spatial extent was very small. Hence, a simple shock-fitting algorithm, similar to that of Whitham was incorporated. The main simplification was that the shock was placed at the midpoint of the break region.
Minimization of Sonic Boom on Supersonic Aircraft
2161
In summary, Equations (2-4), with 200 characteristics, were used to predict the velocity and pressure fields for a given airplane area distribution (with and without the isentropic spikes) including the additional equivalent area due to lift [2]. The integrals in Equation (2) were evaluated using a fourth order accurate numerical integration scheme; 5000 increments in η were used to ensure numerical convergence to within 0.1%. A computer program was written to perform this near field analysis. The program was validated with the 5/2-power frontal spike experiment given Swigart [11]; a comparison of the present algorithm to the experimental data is shown in Fig.2. The agreement was considered sufficient for the present sonic boom mitigation analysis.
0.015 Theory 0.010
Experimental Data
0.005
p/p
0.000 -0.005 -0.010 -0.015 -0.020 -0.025 0
10
20
30
40
50
60
70
x-br (cm)
Fig. 2. Comparison of Quasi-Linear Theory with the Frontal Spike (r = 50.8 cm) Data of Swigart [11].
The near field solutions were extrapolated to the ground using the NASA Ames sonic boom FORTRAN routines in NFBOOM [13]. Specifically, the ANET subroutine, which extrapolates the signal through the non-uniform atmosphere, was incorporated into the Whitham theory program. A reflection factor of 1.9 was taken and no wind effects were assumed. In addition, the perceived ground pressure levels [4,5] were estimated with the PLdB subroutine also present in NFBOOM.
3
Design Optimization – Genetic Algorithm Coding Scheme
For sonic boom mitigation, the GA is given the task of minimizing the loudness of the sonic boom created by the body shape of the aircraft. This body shape is represented by graphing the effective area of the aircraft versus the position along the length of the aircraft. To optimize this body shape, an assumption is made that the body shape may be represented by continuous fourth-order polynomials.
2162
C.L. Karr, R. Bowersox, and V. Singh
The GA’s task is then to modify the coefficients of this polynomial to minimize the loudness while also satisfying certain constraints that are placed on the body shape. One of the constraints placed on the GA is that the derivatives of the body shape must not fluctuate beyond a certain pre-set level. This is done to insure that the body shape will not be oscillatory in nature. Second, the GA must develop a body shape that visually lies between a “minimum” and “maximum” body shape. Thus, the plot of the polynomial generated by the GA must fit between two pre-defined curves. This focuses the scope of the GA’s search and allows it to be more efficient. In order to quantitatively evaluate the quality of each body shape that the GA proposes, each body shape is analyzed using Whitham theory. This results in a set of data that can then be read into NFBOOM program code to calculate the effective loudness of the sonic boom. This loudness serves as the driving factor of the GA’s search. Those solutions with lower decibel levels are recombined with other quality solutions to form new solutions to test, while those solutions with higher decibel levels are discarded in favor of better solutions. This allows the GA to bias the population of potential solutions towards better solutions over time. This approach of allowing a GA to propose coefficients of several fourth-order polynomials (constrained such that the first and second derivatives of the curve are continuous) allows for reasonable, if not effective, area distributions. Figure 3 shows a sample area distribution proposed by a GA.
Fig. 3. Sample area distribution as proposed by a GA.
Minimization of Sonic Boom on Supersonic Aircraft
4
2163
Results
With the definition of a fitness function and a coding scheme, a floating point GA can be used to effectively design keels that can be added to SST aircraft that minimize the sonic boom for given flight conditions. For this study, the following GA parameters are used: – – – – –
population size = 50 maximum generations = 250 mutation probability = 0.3 mutation operator = Gaussian mutation crossover operator = standard two-point crossover
The remainder of this section describes the effectiveness of the GA in determining keel designs. 4.1
Initial Isentropic Spike Design
Phantom isentropic spikes (extending the nose of the airplane as shown in Fig.1 and 2) were designed to lower the initial shock pressure rise to 0.3 psf (Marconi et al [10] goal). The length and strength (K) of the 5/2-spike (A = Kx5/2 ) were the design parameters. The spike was blended to the body, and the body geometry was fit with five fourth-order polynomial curves. The coefficients were selected to ensure that the body fit was continuous through the third derivative. Figure 4 shows an example aircraft area distribution (including the equivalent area due to lift) with the isentropic (forward) spike and the body fits. This parameterization proved sufficient for all configurations tested. Because the 5/2-spike removes the shock wave altogether, the perceived sound level (PLdB) [4] was used to convert the linear ground-signature waveform sound level into an equivalent-shock-wave signature, where the perceived sound levels were matched. Figure 5 shows an example result, where a spike was designed to reduce the initial shock pressure rise from just over 0.6 psf for the baseline aircraft down to 0.3 psf (the design goal). The length of the corresponding swept forward keel [10] would be 8.1% of the baseline aircraft length. 4.2
Initial Genetic Algorithm Designs – Feasibility Study
The first step in determining the effectiveness of using a GA to perform keel design for sonic boom mitigation is to match published data. As described in Marconi et al. [10], the upstream spike area distribution can be used as a design variable to produce a desired ground signature. Hence, it the present work, the 5/2-spike requirement was relaxed, and genetic algorithms were used to produce an “optimum” design. In the present work, the optimum was determined to be the shortest spike/keel that produced a ground signature shock overpressures < 0.3 psf; signatures with multiple shocks that met this requirement were acceptable.
2164
C.L. Karr, R. Bowersox, and V. Singh
6 0
Baseline Aircraft Area Distribution (Provided by Aircraft Company) Low-Boom Aircraft Area Distribution
5 0
4 0
Area Phantom Isentropic Spike
3 0
2 0
1 0
Five Fourth-Order Polynomial Fits 0 0
2 0
4 0
6 0
8 0
10 0
X
12 0
14 0
16 0
18 0
20 0
Fig. 4. Example Equivalent Mach Plane Area Distribution (Scales intentionally left blank).
1 N-WAVE GROUND SIGNATURE
0.9 0.8 0.7 ∆Peq 0.6 (psf) 0.5
Baseline Aircraft Design
0.4 0.3
QSP GOAL Baseling + Forward Swept Keel (LKeel/LAirplane = 8.1%)
0.2 0.1 80
85
90 95 100 Perceived Sound Levels (dB)
Fig. 5. Example Boom Reduction Results.
105
110
Minimization of Sonic Boom on Supersonic Aircraft
2165
Fig. 6. The area distribution above represents the best solution determined using a GA.
Fig. 7. The ground signature above results in a sonic boom of magnitude 104.0 dB.
2166
C.L. Karr, R. Bowersox, and V. Singh
Given the GA’s ability to solve this simple design problem, the more complex problem of determining a keel shape that minimizes the decibel level in a ground signature was undertaken. 4.3
Initial Genetic Algorithm Designs – Feasibility Study
Based on the results presented in Fig.4 and 5, the GA was considered to be an appropriate tool for determining the keel shape (as depicted in an area distribution) that minimizes the sonic boom of a SST aircraft. Figures 6 and 7 show the optimal design and associated ground signature determined in this manner. It is important to note that the GA found a design that had a sonic boom of magnitude 104.0 dB; the best solution to date determined by a human designer had a sonic boom of magnitude 106.3 dB.
References 1. Seebass, R. and Argrow, B., “Sonic Boom Minimization Revisited,” AIAA Paper 98–2956, 1998. 2. Seebass, R., “Sonic Boom Theory”, AIAA Journal of Aircraft, Vol. 6, No. 3, 1969. 3. McLean, E., “Configuration Design for Specific Pressure Signature Characteristics,” Sonic Boom Research, edited by I.R. Schwartz, NASA SP-180, 1968. 4. Leatherwood, J.D. and Sullivan, B.M., “Laboratory Study of Sonic Boom Shaping on Subjective Loudness and Acceptability,” NASA TP-3269, 1992. 5. McCurdy, D.A., “Subjective Response to Sonic Booms Having Different Shapes, Rise Times, And Duration,” NASA TM 109090, 1994 6. Miller, D. S., and Carlson, H. W., “A Study of the Application of Heat or Force Field to the Sonic-Boom-Minimization Problem,” NASA TND-5582, 1969. 7. Miller, D. S., and Carlson, H. W., “On the Application of Heat or Force Field to the Sonic-Boom-Minimization Problem,” AIAA Paper 70–903, 1970. 8. Marconi, F., “An Investigation of Tailored Upstream Heating for Sonic Boom and Drag Reduction”, AIAA Paper 98–0333, 1998. 9. Batdorf, S. B., “On Alleviation of the Sonic Boom by Thermal Means,” AIAA Paper 70–1323, 1970. 10. Marconi, F., Bowersox, R. and Schetz, J., “Sonic Boom Alleviation Using Keel Configurations,” AIAA-2002-0149, 40th Aerospace Science Meeting, Reno NV, Jan. 2002. 11. Swigart, R.J. “Verification of the Heat-Field Concept for Sonic-Boom Alleviation”, AIAA Journal of Aircraft, Vol. 12, No. 2, 1975. 12. Whitham, G., “The Flow Pattern of a Supersonic Projectile,” Communications on Pure and Applied Mathematics, Vol. V, 1952, pp. 301–348. 13. Durston, D.A., “NFBOOM User’s Guide, Sonic Boom Extrapolation and SoundLevel Prediction”, NASA Ames Research Center, Unpublished Document (Last Updated Nov. 2000) 14. Marconi, F., Bowersox, R. and Schetz, J., “Sonic Boom Alleviation Using Keel Configurations,” AIAA-2002-0149, 40th Aerospace Science Meeting, Reno NV, Jan. 2002.
Minimization of Sonic Boom on Supersonic Aircraft
Nomenclature A Airplane Area Distribution K 5/2-Area Distribution Strength (i.e., A = Kx5/2 ) L Length M Mach Number p Pressure Rb Body Radius r Radial Coordinate u Axial velocity perturbation U Axial Velocity v Radial velocity perturbation x Axial coordinate y Curved Characteristic Line √ β M2 − 1 γ Ratio of Specific Heats µ Mach Wave Angle ρ Density Subscripts ∞ Flight Condition at Altitude
2167
Control of a Flexible Manipulator Using a Sliding Mode Controller with Genetic Algorithm Tuned Manipulator Dimension N.M. Kwok1 and S. Kwong2 1
Faculty of Engineering, University of Technology, Sydney, NSW, Australia [email protected] 2 Department of Computer Science, City University of Hong Kong Hong Kong SAR, P.R. China [email protected]
Abstract. The tip position control of a single-link flexible manipulator is considered in this paper. The cross-sectional dimension of the manipulator is tuned by genetic algorithms such that the first vibration mode dominates the higher order vibrations. A sliding mode controller of reduced dimension is employed to control the manipulator using the rigid and vibration measurements as feedback. Control effort is shared dynamically between the rigid mode tracking and vibration suppression. Performance and effectiveness of the proposed reduced dimension sliding mode controller in vibration suppression and against payload variations are demonstrated with simulations.
1
Introduction
Flexible manipulators are robotic manipulators made of light-weight materials, e.g. aluminium. Because of the reduction in weight, lower power consumption and faster movement can be realised. Other advantages include the ease of setting up and transportation, as well as reduced impact destruction should the manipulator system went faulty. Light-weight manipulators can be applied in general industrial processes, e.g. pick and place, other applications could include the space shuttle on-board robotic arm. However, due to its light-weight, vibrations are inherent in the flexible manipulator that hinders its wide application. Earlier research on the control of flexible manipulators could be found in [1] where a state-space model of the manipulator was proposed and the linear quadratic gaussian approach was used. Other research work included the use of an H∞ controller in [2], the manipulator considered allow movements in both the horizontal and vertical planes. The use of deterministic control was found in [3], where a sliding mode controller was applied. The control effort was switched between the vibration mode errors, however, in an ad-hoc weighting basis. Recently, in [4], a sliding mode controller with reduced controller dimension, using only the rigid mode and first vibration mode as feedback, was demonstrated. The sliding surfaces were made adaptive to enhance the controller performance. E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2191–2202, 2003. c Springer-Verlag Berlin Heidelberg 2003
2192
N.M. Kwok and S. Kwong
However, a relatively large number of switching in controller gains had to be incorporated in the controller. Apart from these advanced controller designs, intelligent controllers were also reported in [5], where a model-free fuzzy controller was used to control the flexible manipulator. In [6], a flexible manipulator was controlled by a neural network controller, trained with genetic learning. Genetic tuning of a Lyapunov based controller was also reported in [7], where stability was guaranteed by the Lyapunov design. Moreover, the controller performance was enhanced by genetic algorithm tuning. The work in [8] also showed that a sliding mode controller retaining only the rigid mode and first vibration mode as feedback was feasible. The approach adopted was that the control on the tip position and vibrations were dynamically weighted in a fuzzy-like manner. Diverging from the focus on controller designs, structural dimensions of the flexible manipulator was considered in [9], where a tapered beam resulted and the vibration frequency was increased for faster manoeuvre. In this work, we consider the tip angular positional control of a single-link flexible manipulator. The manipulator is constructed in the form of a narrow beam of rectangular cross-section. At the hub is driven by a dc motor and the tip carries a payload. We use a sliding mode controller for its robustness against model uncertainties and parameter variations, e.g. payload variations. An inspection on the manipulator mathematical model will reflect that the manipulator is of infinite dimension, and the design and implementation of an infinite dimension controller is challenging. Here, we follow the work of [4] and [8], that only the rigid mode and first vibration mode are used as feedback and the controller dimension is therefore reduced. When using only the first vibration mode as one of the feedback, we need to ensure that it dominates all other higher order vibrations. Motivated by the work in [9], we propose to adjust the cross sectional dimension of the manipulator. In addition to the above requirement, we also aim to increase the vibration frequency for reduced vibration magnitude, while rejecting a large cross section area that increases the weight of the manipulator. We also see from the manipulator model that vibrations are proportional to the slope of the mode shapes at the hub. Therefore, in order to reduce vibration as a whole, we penalise large mode shape slopes. The above arguments lead to the formulation of a multi-objective optimisation problem. From [10], we propose to use the genetic algorithm to search for the optimal manipulator cross sectional dimension. The resultant manipulator characteristics will facilitate the design and implementation of a sliding mode controller of reduced dimension. This paper is arranged as follows. The manipulator model will be developed in Section 2. Section 3 will present the sliding mode controller design. The genetic tuning of manipulator dimensions will be treated in Section 4. Simulation results will be presented in Section 5 and a conclusion will be drawn in Section 6.
2
System Modelling
The flexible manipulator considered here is a uniform aluminium beam with rectangular cross section and moves in the horizontal plane. One end of the
Control of a Flexible Manipulator Using a Sliding Mode Controller Angle
Accelerometer Hub
Encoder
Motor
Mounting Fixture
2193
; ;
Beam
Grabber
Payload m
Payload
t
y(x,t) θt
Beam
torque τ θh Motor Drive
Reference Controller
Set Point
Fig. 1. System Configuration
x
Hub J h
Fig. 2. Parameter Definition
beam is fixed to the hub consisting of a dc motor and associated mounting fixtures. At the other end, the tip, is mounted a gribber for pick and place operation. The gribber and the load together form the payload. Angular sensors, e.g. accelerometers and angular encoders, are mounted at the tip and the motor shaft respectively. The manipulator set-up is shown in Fig. 1. The definitions of the manipulator parameters are shown in Fig. 2. Using the Eular-Bernoulli beam theory, we assume that shear and rotary inertia are negligible and vibrations being small as compared to the length of the manipulator, the equation of motion of the manipulator can be given as [2] EI
∂4y ∂2y +ρ 2 =0 4 ∂x ∂t
(1)
where E=Young’s modulus, I=cross sectional moment of inertia, ρ=mass per unit length, y=displacement from reference The boundary conditions are y(0) = 0, EIy (0) = τ + Jh θ¨h , EIy (L) = mt y¨(L), EIy (L) = −Jt y¨ (L) (2) where τ =motor torque, Jh =hub inertia, θh =hub angle, mt =payload mass, Jt =payload inertia, L=manipulator length, prime stands for differentiation against x, dot stands for differentiation against time t Using separation of variables, put y(x, t) = φ(x)ϕ(t)
(3)
then we have ϕ¨ + ω 2 ϕ = 0, φ − β 4 φ = 0, ω 2 = β 4
EI ρ
(4)
The general solution to Equ.4 is the mode shape given by φ(x) = Asinβx + Bcosβx + Csinhβx + Dcoshβx
(5)
where β is the solution of the characteristic equation formed from the boundary conditions Equ.5 can be expressed with coefficient A only, giving φ(x) = A(sinβx + γ(cosβx − coshβx) + ξsinhβx)
(6)
2194
N.M. Kwok and S. Kwong
where γ and ξ are functions of β and system parameters, and A is yet to be determined by normalisation Using the assumed mode method, put y(x, t) =
∞
φi (t)ϕi (t) = xθh +
∞
i=0
φi (x)ϕi (t)
(7)
i=1
The normalization coefficient Ai of the general solution is Jtotal Ai = L 2 2 ρψi (x)dx + mt ψi (L) + Jt ψi2 (L) + Jh ψi2 (0) 0
(8)
where ψi is the expression inside the bracket in Equ.6 After normalisation using the orthogonal property and keeping consistence with the rigid mode, we have ϕ¨i + 2ζωi ϕ˙i + ω 2 ϕi =
τ Jtotal
φi (0)
(9)
where ζ is the material damping of small value, φi 0) is the slope of the ith vibration mode shape at the hub, Jtotal is the total inertia making up of the hub, beam and the tip With further manipulations, a state space system equation can now be written as 0 θh θ˙h 01 . . . . . ˙ θ¨h 0 0 . . . . . θh 1 ϕ˙ 1 . . 0 1 . . . ϕ1 0 τ ϕ¨1 . . −2ζω1 −ω12 . . . (10) = ϕ˙ 1 + φ1 (0) . Jtotal .. .. .. .. . . . . . . . . . ϕ˙ n . . . . . 0 1 ϕn 0 . . . . . −2ζωn −ωn2 ϕ˙ n ϕ¨n φn (0) The motor dynamic equation is Jh θ¨h = τ =
1 1 kt (V − kb θ˙h ), V = Rτ + kb θ˙h R kt
(11)
where Jh is the hub inertia, τ is the torque generated by the motor, kt is the torque constant, V is the voltage applied to the motor, kb is the back-emf constant, R is the armature resistance Let the controller generate the required driving torque τ , then we will use inverse dynamics to compute the motor voltage V . It is because the dc motor model has been well studied and its identification in real practice is not difficult. Finally, we will apply the derived control voltage to the motor to drive the flexible manipulator.
Control of a Flexible Manipulator Using a Sliding Mode Controller
3
2195
Controller Design
We will design the controller using only the rigid mode and first vibration mode as feedback signals. From the state space equation, Equ.10, we re-write the system differential equations θ¨ =
τ2 τ1 , ϕ¨ = −2ζω ϕ˙ − ω 2 ϕ + φ Jtotal Jtotal
(12)
Note that we have dropped all subscripts for clarity. The control objective is to make θ → θd
ϕ→0
and
t→∞
for
where θd is the desired hub angle We then define the error variables as θe = θd − θ, ϕe = −ϕ
(13)
The system differential equations become θ¨e = θ¨d −
τ1 Jtotal
, ϕ¨e = −2ζω ϕ˙ e − ω 2 ϕe −
τ2 Jtotal
φ
(14)
Following the sliding mode design methods, we define the sliding surface as s1 = c1 θe + θ˙e
(15)
where c1 is the slope of the sliding surface When sliding mode is attained, we have θe (t) = θe (ts )e−c1 t
f or
t > ts
where ts is the time that sliding mode firstly occurred and c1 > 0 then θe (t) → 0
as
t→∞
The sliding surface time derivative is τ1 Jtotal
(16)
τ1 = (c1 θ˙e + θ¨d + j1 sign(s1 ))Jtotal
(17)
s˙ 1 = c1 θ˙e + θ¨e = c1 θ˙e + θ¨d − Put
where sign(x) = 1 for x > 0 and sign(x) = −1 otherwise, and the sliding surface s1 can be reached for j1 > 0 Similarly, define the sliding surface s2 as s2 = c2 ϕe + ϕ˙ e
(18)
2196
N.M. Kwok and S. Kwong
The time derivative is s˙ 2 = c2 ϕ˙ e + ϕ¨e = (c2 − 2ζω)ϕ˙ e − ω 2 ϕe −
τ2 φ Jtotal
(19)
Jtotal φ
(20)
Put τ2 = ((c2 − 2ζω)ϕ˙ e − ω 2 ϕe + j2 sign(s2 )) where sliding surfaces2 can also be reached when j2 > 0 Let the sliding mode controller output be u1 = γ1 + k1 sign(s1 ), u2 = γ2 + k2 sign(s2 )
(21)
γ1 = (c1 θ˙e + θ¨d )Jtotal , k1 = j1 Jtotal Jtotal γ2 = ((c2 − 2ζω)ϕ˙ e − ω 2 ϕe ) Jtotal φ , k2 = j2 τ2
(22)
where
Note that there is only one control torque from the motor but there are two control objectives; namely, θ → θd and ϕ → 0. We have to assign weighting factors n1 and n2 applying to u1 and u2 respectively. u = n1 u1 + n2 u2 = n1 γ1 + n1 k1 sign(s1 ) + n2 γ2 + n2 k2 sign(s2 )
(23)
In order to ensure that sliding surfaces are reached, that is, to make s1 → 0, s2 → 0; we also have to assign the values of k1 and k2 . In so doing, we define a Lyapunov function V =
1 2 (s + s22 ) > 0 2 1
for all
s1 = 0, s2 = 0
(24)
The time derivative is required to be negative for reachability, that is V˙ = s1 s˙ 1 + s2 s˙ 2 < 0
(25)
When we apply the weighted and aggregated control, Equ.23, to the plant, Equ.25 becomes V˙ = s1 (n2 (γ1 − γ2 ) − n1 k1 sign(s1 ) − n2 k2 sign(s2 ))+ s2 (n1 (γ2 − γ1 ) − n1 k1 sign(s1 ) − n2 k2 sign(s2 ))
(26)
From Equ.26, we see that there are two sliding surfaces s1 and s2 being interrelated to each other. Their combination will affect the Lyapunov function derivative. It was shown in [8] that the effect will be reduced if the values of k1 and k2 are given by for s1 > 0 and s2 > 0
Control of a Flexible Manipulator Using a Sliding Mode Controller
2197
k1 = max(γ2 − γ1 , 0) + 1 , k2 = max(γ1 − γ2 , 0) + 2
(27)
for s1 < 0 and s2 < 0 k1 = max(γ1 − γ2 , 0) + 1 , k2 = max(γ2 − γ1 , 0) + 2
(28)
for s1 > 0 and s2 < 0 k1 = max(
n2 (γ1 − γ2 ) , 0) + 1 , k2 = 0 n1
(29)
n2 (γ2 − γ1 ) , 0) + 1 , k2 = 0 n1
(30)
for s1 < 0 and s2 > 0 k1 = max(
where 1 and 2 are small positive constants to compensate for model uncertainties and parameter variations Regarding the sharing of control effort between regulation and vibration suppression, we state our control strategy in the following rule. A larger control effect is to be applied for hub angle regulation when the hub angle error is large; when the hub angle is near the set point (small error), apply larger control effort to suppress the vibration mode. We also observe from Equ.29 and Equ.30; that k1 may go unbounded if n1 → 0. Therefore, we put n 1 > n2
with
n1 + n2 = 1
(31)
We also assign n1 according to the control strategy in the rule stated above. Put 1 1 , n2 = 1 − n 1 1+ n1 = (32) 2 1 + exp(−a( θθde − b)) where θe is the hub angle error, θd is the desired set point, a determines the slope of weight crossover, b determines the point where crossover in control weighting is to occur
4
Manipulator Dimension Tuning by Genetic Algorithm
The reduced dimension sliding mode controller developed in Section 3 depends critically on the fact that the first vibration mode magnitude dominates the higher order vibrations. When the tip displacement is given by Equ.7, what we can do to make y1 dominates yi , i > 1, is to make φ1 and ϕ1 to dominate the corresponding higher order variables. Referring to Equ.6, we see that the value of φi depends on the value of Ai and βi . Also from Equ.10, the vibrations will be
2198
N.M. Kwok and S. Kwong
excited according to the gain φi (0). We also see that in normalising the variable Ai in Equ.8, it is a function of the variable βi . To sum up, the variable βi plays a critical role in satisfying our requirements in making the first vibration mode dominate. From a practical point of view, the hub parameters and the tip parameters are relatively fixed by the motor torque required and the task assigned. The length of the manipulator is also fixed by the workspace. A relatively free parameter, which bears an effect on βi is the cross sectional area of the manipulator. Adjusting the height and width of the manipulator, we change the value of mass per unit length, cross sectional moment of inertia as well as the beam inertia. The effect of the manipulator dimension on βi and in turn on φi is very non-linear. From the equation of motion in Equ.1 and boundary conditions in Equ.2, the determination of βi in closed form analytical solution is very involved. Therefore, we turn to the use of genetic algorithms to search for the manipulator dimension that best satisfies our requirements. Genetic algorithms are stochastic search methods based on the theory of evolution, the survival of the fittest, and the theory of schema [10]. When implementing the genetic algorithm, variables to be searched are coded in binary strings. For two such strings, parents, the binary bit patterns are exchanged in some bit position selected randomly. The resultant pair of strings, off-springs, become members of the population in the next generation. The exchange of binary bits, crossover, is conducted according to a crossover probability pc . The parents with higher fitness have a higher chance for crossover. The above process exploits the search space but diversity should be explored. This is made possible by applying mutation to the off-springs. Mutation flips one of the bit in the string according to the probability pm . For every string in the population, fitness is evaluated and then the process repeats in the next generation. Termination criteria may be selected from tracking the improvement over the average fitness of a generation or according to the count of number of generations being processed. Variations from the standard genetic algorithms are adopted here in tuning the manipulator dimensions. We will apply elitism, such that the string with the highest fitness value is stored irrespective of the processing of generations. It is because the last generation before termination of the algorithm may or may not contain the best string. We also adjust the crossover and mutation probability dynamically according to the improvement on the average fitness of a generation. If found improving, then pc and pm are reduced by a scale factor. While the average fitness is not improving, we increase pc and pm to gain more chances in finding a string of higher fitness. Our algorithm terminates when the average fitness is improving for several consecutive generations. The genetic algorithm becomes that described below. initialise the population randomly evaluate fitness and average while not end of algorithm select parents by fitness
Control of a Flexible Manipulator Using a Sliding Mode Controller
2199
apply crossover and mutation evaluate fitness and average if improving then decrease pc and pm else increase pc and pm store the fittest string end on fitness improved or generations count The fitness function is defined as f itness = fr + f1 + f2 − f3 − f4 − f5 + f6
(33)
where fr =reference datum keeping fitness> 0, f1 = 1 when φi (L) are in descending order f1 = 0 otherwise, f2 = (φi (L) − φi+1 (L)) reward greater dominance of lower order vibrations, f3 = |i × φi (L)| penalise weighted magnitude of vibrations at tip, f4 = φi (0) penalise mode shape slopes at the hub, f5 = Hb Wb penalise for large cross-sectional beam area, f6 = ωi reward higher vibration frequencies
5
Simulation
Simulations will be described in this section as well as the presentation of simulation results. Cases for comparisons include responses from open-loop and from the use of the proposed sliding mode controller. We will also present cases for initially un-tuned and the resulting tuned manipulator responses. The parameters used in the simulations are given below in Table 1. Table 1. Simulation Parameters beam initial height/width/length 0.04m/0.002m/0.8m hub inertia 1.06−4 kgm2 payload mass/inertia 0.38kg/1.78−4 kgm2 no. of population/generation 10/20 probability pc /pm (initial) 0.9/0.02 scaling factor for pc and pm 1.1 termination on improvement 3 generations
5.1
Response with Un-tuned Dimensions
Using the initial manipulator parameters, an open-loop simulation was conducted for subsequent performance comparisons. The motor was fed with a voltage pulse, by some trial and error that brings the response to 0.2rad as the set-point for close-loop control. The result is shown in Fig. 3. The close-loop response using the sliding mode controller is shown in Fig. 4, where initial untuned dimensions were used. Comparing Figs. 3 and 4, we see that vibrations are effectively suppressed in the steady state with the close-loop control. However, vibration during the transient period is still observed.
N.M. Kwok and S. Kwong
0.25
0.25
0.2
0.2
0.15
0.15 Tip Angle(rad)
Tip Angle(rad)
2200
0.1
0.1
0.05
0.05
0
0
−0.05 0
0.5
1
1.5
2
2.5 Time(s)
3
3.5
4
4.5
Fig. 3. Open-loop response, dimensions not tuned
5.2
−0.05 0
5
0.5
1
1.5
2
2.5 Time(s)
3
3.5
4
4.5
5
Fig. 4. Close-loop response, dimensions not tuned
Genetic Algorithm Tuning
Tuning of the manipulator height and width was conducted according to the genetic algorithm described in Section 4. It is noted that the algorithm terminated before reaching the maximum number of generations with a relatively small number of populations. This observation shows the modification on the standard genetic algorithms with dynamic adjustment of the crossover and mutation probabilities were effective. Figures 5 and 6 shows the value of fitness function and the average fitness over the generations. It is also observed from the plots that the maximum fitness function though not increasing explicitly, the occurrences of low fitness were decreasing. Results of mode shapes at tip, φi (L), mode shape slopes at the hub, φi , and the vibration frequencies, ωi , are tabulated in Table 2 for comparison between the initially un-tuned and the resulting tuned values. From the results, we see that the mode shape at the tip is arranged in descending order as required. The vibration frequency shows an increase from the tuned dimensions. An overall
11 10.8
10.6
10.6
10.4
10.4 Average Fitness
11 10.8
Fitness
10.2 10 9.8
10.2 10 9.8
9.6
9.6
9.4
9.4
9.2 9 0
9.2
20
40
60
80
100
120
140
160
Fig. 5. Fitness Function
180
200
9 0
20
40
60
80
100
120
140
160
180
Fig. 6. Average fitness function
200
Control of a Flexible Manipulator Using a Sliding Mode Controller
2201
Table 2. Comparison of design variables resulted from initially un-tuned and tuned manipulator dimension Results from un-tuned dimensions mode .φi (L) .φi .ωi 1 -0.228 6.983 45.985 2 0.118 17.242 164.929 3 0.012 23.389 320.977
Results from tuned dimensions mode .φi (L) .φi .ωi 1 -0.211 7.421 68.366 2 0.102 18.074 243.026 3 0.000 24.191 465.226
optimal manipulator dimension results irrespective of the slight increase in mode shape slopes. 5.3
Responses with Tuned Dimensions
0.25
0.25
0.2
0.2
0.15
0.15 Tip Angle(rad)
Tip angle(rad)
The tip responses with the use of tuned dimensions are shown in Figs. 7 and 8 below. The tuning of manipulator dimension using genetic algorithms is effective
0.1
0.05
0.05
0
0
−0.05
0
0.5
1
1.5
2
2.5 Time(s)
3
3.5
4
4.5
−0.05 0
5
Fig. 7. Open-loop response, dimensions tuned 0.25
0.25
0.2
0.2
0.15
0.15
0.1
0.05
0
0
0.5
1
1.5
2
2.5 Time(s)
3
3.5
4
4.5
5
Fig. 9. Close-loop response, dimensions tuned, payload increased by 50%
1
1.5
2
2.5 Time(s)
3
3.5
4
4.5
5
0.1
0.05
−0.05 0
0.5
Fig. 8. Close-loop response, dimensions tuned
Tip Angle(rad)
Tip Angle(rad)
0.1
−0.05 0
0.5
1
1.5
2
2.5 Time(s)
3
3.5
4
4.5
5
Fig. 10. Close-loop response, dimensions tuned, payload increased by 100%
2202
N.M. Kwok and S. Kwong
in reducing the vibration magnitudes even in the open-loop response. Figures 9 and 10 show the close-loop response when the payload is increased by 50% and 100% respectively. No significant degrade in response is observed.
6
Conclusion
A sliding mode controller was used to control the position of a single-link flexible manipulator. Only the rigid mode and first vibration mode were used as feedback signals that reduced the controller dimension. The manipulator dimensions were tuned using a genetic algorithm routine, with the application of elitism and adaptive crossover and mutation probabilities, had made the first vibration mode dominating the higher order vibrations and reduced the complexity in the implementation of the controller. Simulation results had demonstrated that the controller was effective in suppressing vibrations and achieving set-point regulation. Acknowledgement. This work is supported by the City University Strategic Grant 7001416 in part.
References 1. R.H.Cannon, E.Schmitx: Initial Experiments on the End-point Control of a Flexible One-link Robot. Intl. J. of Robotics Research, Vol.3, No.3 (1984) 62–75 2. R.P.Sutton, G.D.Halikas, A.R.Plumer, D.A.Wilson: Modelling and H∞ Control of a Single-link Flexible Manipulator. Proc. Instn. Mech. Engrs., Vol.213, Pt.1 (1999) 85–104 3. A.R.Ingole, B.Bandyopadhyay, R.Gorez: Variable Structure Control Application for Flexible Manipulators. Proc. 3rd IEEE Conf. Control Applications, Glasgow, UK (1994) 1311–1316 4. F.H.K.Fung, C.K.M.Lee: Variable Structure Tracking Control of a Single-link Flexible Arm Using Time-varying Sliding Surface. J. Robotic Sys., Vol.16, No.12 (1999) 715–726 5. J.X.Lee, G.Vukovich, J.Z.Sasiadek: Fuzzy Control of a Flexible Link Manipulator. Proc. American Control Conf., Baltimore, USA (1994) 568–574 6. S.Jain, P.Y.Peng, A.Tzes, F.Khorrami: Neural Network Designs With Genetic Learning for Control of a Single Link Flexible Manipulator. Proc. American Control Conf., Baltimore, USA (1994) 2570–2574 7. S.S.Ge, T.H.Lee, G.Zhu: Genetic Algorithm Tuning of Lyapunov-based Controllers: An Application to a Single-link Flexible Robot System. IEEE Trans. Ind. Elecn, Vol.43, No.5 (1996) 567–573 8. C.K.Lee, N.M.Kwok: Control of a Flexible Manipulator Using a Sliding Mode Controller With a Fuzzy-like Weighting Factor. Proc. 2001 IEEE Intl. Symposium on Ind. Elecn., Pusan, Korea (2001) 52–57 9. F.Y.Wang, J.L.Russell: Optimum Shape Construction of Flexible Manipulators With Total Weight Constraint. IEEE Trans. Sys. Man and Cybernetics, Vol.24, No.4 (1995) 605–614 10. K.F.Man, K.S.Tang, S.Kwong: Genetic Algorithms: Concepts and Applications. IEEE Trans. Ind. Elecn., Vol.43, No.5 (1996) 519–534
Daily Stock Prediction Using Neuro-genetic Hybrids Yung-Keun Kwon and Byung-Ro Moon School of Computer Science & Engineering, Seoul National University Shilim-dong, Kwanak-gu, Seoul, 151-742 Korea {kwon,moon}@soar.snu.ac.kr
Abstract. We propose a neuro-genetic daily stock prediction model. Traditional indicators of stock prediction are utilized to produce useful input features of neural networks. The genetic algorithm optimizes the neural networks under a 2D encoding and crossover. To reduce the time in processing mass data, a parallel genetic algorithm was used on a Linux cluster system. It showed notable improvement on the average over the buy-and-hold strategy. We also observed that some companies were more predictable than others.
1
Introduction
It is a topic of practical interest to predict the trends of financial objects such as stocks, currencies, and options. It is not an easy job even for financial experts because they are mostly nonlinear, uncertain, and nonstationary. There is no consensus among experts as to how well, if possible, financial time series are predictable and how to predict them. Many approaches have been tried to predict a variety of financial time series such as portfolio optimization, bankruptcy prediction, financial forecasting, fraud detection, and scheduling. They include artificial neural networks [16] [19], decision trees [3], rule induction [5], Bayesian belief networks [20], evolutionary algorithms [9] [12], classifier systems [13], fuzzy sets [2] [18], and association rules [17]. Hybrid models combining a few approaches are also popular [15] [7]. Artificial neural networks (ANNs) have the ability to approximate well a large class of functions. ANNs have the potential to capture complex nonlinear relationships between a group of individual features and the variable being forecasted, which simple linear models are unable to capture. It is important in ANNs how to find the most valuable input features and how to use them. In [22], they used not only quantitative variables but also qualitative information to more accurately forecast stock prices. Textual information in articles published on the web was also used in [21]. In addition, the optimization of weights is another important issue in ANNs. The backpropagation algorithm is the most popular one for the supervised training of ANNs, but yet it is just a local gradient technique and there is room for further optimization. In this paper, we try to predict the stock price using a hybrid genetic approach combined with a recurrent neural network. We describe a number of input E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2203–2214, 2003. c Springer-Verlag Berlin Heidelberg 2003
2204
Y.-K. Kwon and B.-R. Moon
variables that enable the network to forecast the next day price more accurately. The genetic algorithm operators provide diverse initial weights for the network. The rest of this paper is organized as follows. In section 2, we explain the problem and present the objective. In section 3, we describe our hybrid genetic algorithm for predicting the stock price. In section 4, we provide our experimental results. Finally, conclusions are given in section 5.
2 2.1
Preliminaries The Problem and Dataset
We attack the automatic stock trading problem. There are a number of versions such as intraday, daily, weekly, and monthly trading, depending on behavioral interval. In this work, we consider the daily trading. In the problem, each record of dataset includes daily information which consists of the closing price, the highest price, the lowest price, and the trading volume. We name them at day t as x(t), xh (t), xl (t), and v(t), respectively. The trading strategy is based on the series of x(t); if we expect x(t + 1) is higher than x(t) we buy the stocks; if lower, we sell them; otherwise, we do not take any action. The problem is a kind of time-series data prediction that can be usually first tried with delay coordinates as follows: x(t + 1) = f (x(t), x(t − 1), x(t − 2), . . .). But, it is a simple and weak model. We transform the original time series to another that is more suitable for neural networks. First, x(t+1)−x(t) is used inx(t) stead of x(t + 1) as the target variable. For the input variables, we use technical indicators or signals that were developed in deterministic trading techniques. To achieve it, we construct the model as follows: x(t + 1) − x(t) = f (g1 , g2 , . . . , gm ). x(t) where gk (k = 1, . . . , m) is a technical indicator or signal. 2.2
The Objective
There can be a number of measures to evaluate the performance of the trading system. In our problem, we simulate the daily trading as close as possible to the actual situation and evaluate the profit. We have a cash balance Ct and stock St at day t (t = 1, . . . , N ). We start with C, i.e, C1 = C and S1 = 0. Figure 1 shows the investing strategy and change of property at day t+1 according to the signal at day t of the trading system. In the strategy, the constant B is the upper bound of stock trade per day. We have the final property ratio P as follows: P =
CN + SN . C1 + S1
Daily Stock Prediction Using Neuro-genetic Hybrids
2205
if ( signal is SELL ) { Ct+1 ← Ct + min(B, St ) St+1 ← St − min(B, St ) } if ( signal is BUY ) { Ct+1 ← Ct − min(B, Ct ) St+1 ← St + min(B, Ct ) } xt+1 St+1 ← St × xt Fig. 1. Investing strategy and change of the property
3 3.1
The Suggested System Processing Data
As mentioned, we have four daily data, x, xh , xl , and v, but we do not use them for the input variables as they are. We utilize a number of technical indicators being used by financial experts [10]. We describe some of them in the following: – Moving average ( MA ) • The numerical average value of the stock prices over a period of time. • MAS , MAL : short-term and long-term moving average, respectively. – Golden-cross and dead-cross • States that MAS crosses MAL upward and downward, respectively. – Moving average convergence and divergence ( MACD ) • A momentum indicator that shows the relationship between MAS and MAL . • MACD = MAS − MAL – Relative strength index ( RSI ) • An oscillator that indicates the internal strength of a single stock. 100 , • RSI = 100 − 1 + U/D U, D : An average of upward and downward price changes, respectively. – Stochastics • An indicator that compares where a stock price closed relative to its price range over a given time period. x(t) − L • %K = × 100, H −L H, L = The highest and the lowest price in a given time period. • %D = Moving average of %K We generate 64 input variables using the technical indicators. Figure 2 shows some representative variables. The others not shown in this figure include variables in the same forms as the above that use trading volumes in place of the prices. After generating the new variables, we normalize them by dividing by the maximum value of each variable. It helps the neural network to learn efficiently.
2206
Y.-K. Kwon and B.-R. Moon
X1 = X2 = X3 = X4 = X5 = X6 = X7 = X8 = X9 = X10 = X11 =
MA(t) − MA(t − 1) MA(t − 1) MAS (t) − MAL (t) MAL (t) # of days since the last golden-cross # of days since the last dead-cross x(t) − x(t − 1) x(t − 1) the profit while the stock has risen or fallen continuously # of days for which the stock has risen or fallen continuously MACD(t) %K(t) %D(t) x(t) − xl (t) xh (t) − xl (t)
Fig. 2. Some examples of input variables Output Layer
Hidden Layer
Input Layer
Fig. 3. The recurrent neural network architecture
3.2
Artificial Neural Networks
We use a recurrent neural network architecture which is a variant of Elman’s network [4]. It consists of input, hidden, and output layers as shown in Fig. 3. Each hidden unit is connected to itself and also connected to all the other hidden units. The network is trained by a backpropagation-based algorithm. It has 64 nodes in the input layer corresponding to the variables described in Section 3.1. Only one node exists in the output layer for x(t+1)−x(t) . x(t) 3.3
Distribution of Loads by Parallelization
There have been many attempts to optimize the architectures or weights of ANNs. In this paper, we use a GA to optimize the weights. Especially, we parallelize the genetic algorithm since it takes much time to handle data over a long period of time. The structure of the parallel GA is shown in Fig. 4. Parallel genetic algorithms can be largely categorized into three classes [1]: (i) global single-population master-slave GAs, (ii) single-population fine-grained,
Daily Stock Prediction Using Neuro-genetic Hybrids
Prepare initial population
2207
Client 1
Selection
Client 2 Crossover Mutation
Client 3
Replacement
Stopping Condition
No
Yes
Client N Final solution Local optimization (Evaluation)
Server
Fig. 4. The framework of the parallel genetic algorithm 6 w64 w44
w65 w45
4
w54
5 w52 w53
w41 w42
1
w55
w51
1 2 3 4 5 6 4 w41 w42 w43 w44 w45 w64 5 w51 w52 w53 w54 w55 w65
w43 2
3
Fig. 5. Encoding in the GA
and (iii) multiple-population coarse-grained GAs. We take the first model for this work. In this neuro-genetic hybrid approach, the fitness evaluation is dominant in running time. To evaluate an offspring (a network) the backpropagation-based algorithm trains the network with training data. In a measurement with gprof, the evaluation part took about 95% of the total running time. We distribute the load of evaluation to the clients (slaves) of a Linux cluster system. The main genetic parts locate in the server (master). When a new ANN is created by crossover and mutation, the GA passes it to one of the clients. When the evaluation is completed in the client, it sends the result back to the server. The server communicates with the clients in an asynchronous mode. This eliminates the need to synchronize every generation and it can maintain a high level of processor utilization, even if the slave processors operate at different speeds. This is possible because we use a steady-state GA which does not wait until a set of offspring is generated. All these are achieved with the help of MPI (Message Passing Interface), a popular interface specification for programming distributed memory systems. In this work, we used a Linux cluster system with 46 CPUs. As shown in Fig. 4, the process in the server is a traditional steady-state GA. In the following, we describe each part of the GA.
2208
Y.-K. Kwon and B.-R. Moon
offspring
Fig. 6. An example of 2D geographical crossover
– Representation: Most GAs used linear encodings in optimizing ANN’s weights [6] [14]. Recently, a two-dimensional encoding has proven to perform favorably [11]. We represent a chromosome by a two-dimensional weight matrix as shown in figure 5. In the matrix, each row corresponds to a hidden unit and each column corresponds to an input, hidden, or output unit. A chromosome is represented by p × (n + p + q) where n, p, and q are the numbers of input, hidden, output units, respectively. In this work, the matrix size is 20 × (64 + 20 + 1). – Selection, crossover, and mutation: Roulette-wheel selection is used for parent selection. The offspring is produced through geographic 2D crossover [8]. It is known to create diverse new schemata and reflect well the geographical relationships among genes. It chooses a number of lines, divides the chromosomal domain into two equivalent classes, and alternately copies the genes from the two parents as shown in figure 6. The mutation operator replaces each weight in the matrix with a low probability. All these three operators are performed in the server. – Local optimization: After an offspring is modified by a mutation operator, it is locally optimized by backpropagation which helps the GA fine-tune around local optima. Local optimization is mingled with quality evaluation. As mentioned, it is performed in the client and the result is sent to the server. – Replacement and stopping criterion: The offspring first attempts to replace the more similar parent to it. If it fails, it attempts to replace the other parent and the most inferior member of the population in order. Replacement is done only when the offspring is better than the replacee. The GA stops if it does not find an improved solution for a fixed number generations.
4
Experimental Results
We tested our approaches with 36 companies’ stocks in NYSE and NASDAQ. We evaluated the performance for ten years from 1992 to 2001. We got the entire data from YAHOO (http://quote.yahoo.com). The GA was trained with two consecutive years of data and validated with the third year’s. The solution was tested with the fourth year’s data. This process was shifted year by year. Thus, totally 13 years of data were used for this work. Table 1 shows the experimental results. The values mean the final property ratio P defined in section 2.2. In the table, the Hold strategy buys the stock at the first day and holds it all the year through. RNN and GA are the average results
Daily Stock Prediction Using Neuro-genetic Hybrids %
%
%
25 20 15 10 5 0 −5 −10 −15 −20 −25 −30
15
30 20
10
10
5
0 0 −10 −5
−20 −30
−10
−40
−15
’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01
Year
’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01
AA
Year
15 10 5 0
−5 −10 −15 −20
’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01
10
60
25
50
20
40
15
30
10
0 0 −10
−10
−20
−15
’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01
−30
Year
20
5
−5
’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01
C
10
0
0
−5
−10
−10
−20
−15
−30
−20
Year
’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01
CAT
%
20
15
15
10
10
5
5
0
0
−5
40 30 20 10 0
10 0 −10 −20
−5
’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01
−10
Year
−20
−15
Year
−30
’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01
EK
%
25 20 15 10 5 0 −5 −10 −15 −20 −25 −30
10 0 −10 −20 −30
’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01
Year
−40
Year
GE
%
20
’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01
Year
GM
%
%
40
25 20
30
15
20
10
10
5
0
−5
−10
−10
0
−15
−20
’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01
HD
−30
Year
−20
’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01
HON
%
−25
Year
HWP
’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01
Year
IBM
%
30
20
20
15
10
10
0
5
−10
0
−20
−5
−30
−10
−40
−10
−10
’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01
DIS
−40
Year
%
20
20
’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01
DELL
%
25
30
−40
Year
DD
%
40
Year
%
30
20
5
’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01
BA
%
30
10
−25
Year
AYP
%
15
−20
%
AXP
%
2209
’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01
INTC
−15
Year
’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01
Year
RNN
GA
IP
Fig. 7. Improvements of RNN and GA over Hold in the first half of the companies
by the recurrent neural network (20 trials) and the hybrid genetic algorithm (20 trials), respectively. For qualitative comparison, we summarized the relative performance in table 2. In the table, Up, Down, and NC represent the states of the stock market in each year. Since there are 36 companies tested for 10 years, we have 359 cases except one case with deficient data. The Up and Down mean that the closing price has risen or fallen over the year’s starting price by 5% or more, respectively. NC means no notable difference. Better, Worse, and Even represent the relative performance of RNN and GA over the Hold. Better and Worse mean that P value of the learned strategy is at least 5% higher or lower than that of Hold, respectively. The GA performed better than the Hold in 153 cases, worse in 88 cases, and comparable in 118 cases. RNN and GA showed a significant performance improvement over Hold on the average. The GA considerably improved the performance of RNN.
2210
Y.-K. Kwon and B.-R. Moon
Table 1. P values Symbols Strategies 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 Hold RNN GA Hold AXP RNN GA Hold AYP RNN GA Hold BA RNN GA Hold C RNN GA Hold CAT RNN GA Hold DD RNN GA Hold DELL RNN GA Hold DIS RNN GA Hold EK RNN GA Hold GE RNN GA Hold GM RNN GA Hold HD RNN GA Hold HON RNN GA Hold HWP RNN GA Hold IBM RNN GA Hold INTC RNN GA Hold IP RNN GA AA
1.141 1.168 1.167 1.205 1.039 1.071 1.070 1.089 1.096 0.839 0.858 0.915 1.238 1.150 1.101 1.246 1.303 1.275 1.038 1.071 1.105 2.514 3.312 3.130 1.476 1.399 1.308 0.830 0.976 0.938 1.116 1.114 1.082 1.060 1.091 1.131 1.437 1.440 1.442 1.437 1.071 1.167 1.219 1.084 1.368 0.555 0.630 0.669 1.721 1.175 1.285 0.946 0.948 0.990
0.967 1.166 1.159 1.185 1.037 1.030 1.108 1.168 1.152 1.101 1.156 1.151 1.656 1.691 1.655 1.644 1.374 1.304 1.026 1.071 1.052 0.530 0.822 0.793 1.009 1.020 1.043 1.370 1.290 1.326 1.216 1.184 1.215 1.684 1.086 1.124 0.787 0.802 0.909 1.288 1.239 1.257 1.141 1.185 1.212 1.150 1.040 1.022 1.416 1.330 1.272 1.026 0.990 1.022
1.211 1.140 1.157 1.116 1.119 1.103 0.825 0.829 0.818 1.072 1.087 1.063 0.817 0.817 0.819 1.231 1.291 1.390 1.131 1.172 1.127 1.661 1.680 1.880 1.081 1.106 1.191 1.076 1.163 1.153 0.981 1.024 1.021 0.754 0.797 0.858 1.171 1.171 1.148 0.875 0.986 0.996 1.265 1.399 1.405 1.280 0.996 1.003 1.041 1.067 1.113 1.105 1.125 1.158
1.272 1.262 1.211 1.419 1.416 1.353 1.328 1.484 1.433 1.709 1.523 1.579 1.878 1.540 1.548 1.082 1.225 1.256 1.286 1.178 1.196 1.787 1.987 1.720 1.306 1.115 1.148 1.419 1.526 1.396 1.441 1.375 1.309 1.251 1.251 1.340 1.044 1.123 1.132 1.449 1.483 1.227 1.698 1.286 1.298 1.232 1.204 1.139 1.839 1.422 1.413 1.042 1.233 1.212
1.198 1.154 1.212 1.316 1.370 1.389 1.039 1.065 1.113 1.300 1.303 1.270 1.441 1.418 1.394 1.227 1.567 1.542 1.326 1.142 1.063 2.877 2.182 2.184 1.107 1.112 1.087 1.149 1.190 1.202 1.328 1.244 1.192 1.120 1.024 1.052 1.037 1.098 1.114 1.365 1.330 1.228 1.207 1.197 1.229 1.686 1.715 1.668 2.224 2.094 1.856 1.067 1.214 1.190
1.094 1.120 1.161 1.624 1.488 1.531 1.077 1.057 1.065 0.941 1.040 1.034 1.803 1.782 1.931 1.342 1.169 1.173 1.272 1.273 1.342 3.346 3.233 2.186 1.479 1.424 1.397 0.807 0.869 0.850 1.516 1.492 1.423 1.107 1.252 1.206 1.786 1.478 1.363 1.175 1.244 1.192 1.282 1.416 1.507 1.378 1.272 1.216 1.114 1.148 1.103 1.082 1.010 0.959
1.035 1.243 1.243 1.144 1.214 1.256 1.062 1.055 1.058 0.674 0.737 0.742 0.940 1.001 1.070 0.967 0.986 1.054 0.909 1.116 1.089 3.461 3.364 3.288 0.890 0.906 0.897 1.119 1.267 1.206 1.359 1.382 1.361 1.162 1.192 1.196 2.011 1.266 1.401 1.103 1.151 1.182 1.068 1.136 1.070 1.733 1.555 1.599 1.664 1.659 1.708 0.932 1.019 1.012
2.195 1.446 1.373 1.549 1.667 1.617 0.756 0.800 0.795 1.231 1.408 1.304 1.576 1.702 1.764 1.026 1.308 1.331 1.176 1.408 1.452 1.372 1.248 1.148 1.011 1.072 1.077 0.907 1.067 1.117 1.492 1.718 1.575 1.245 1.260 1.314 1.663 1.805 1.711 1.301 1.333 1.251 1.704 1.680 1.430 1.264 1.324 1.397 1.440 1.316 1.432 1.318 1.477 1.402
0.797 0.806 0.975 0.992 0.996 1.094 1.757 1.274 1.276 1.528 1.152 1.163 1.273 1.352 1.401 0.952 0.819 0.875 0.738 0.948 0.931 0.344 0.343 0.342 0.935 0.887 1.050 0.596 0.701 0.724 0.875 0.928 0.937 0.699 0.737 0.786 0.699 0.742 0.768 0.781 0.860 0.956 0.641 0.849 0.708 0.734 0.732 0.741 0.714 0.802 0.759 0.711 0.803 0.841
1.102 1.078 1.143 0.686 0.682 0.696 0.795 0.946 0.960 0.625 0.676 0.586 1.000 0.979 0.980 1.128 1.103 1.071 0.886 1.043 0.997 1.553 1.624 1.890 0.742 1.012 1.019 0.764 0.894 0.936 0.916 0.905 0.924 0.931 1.143 1.284 1.120 1.145 1.135 0.764 0.881 0.921 0.679 0.699 0.683 1.426 1.502 1.611 1.012 1.068 1.284 1.023 1.112 1.092
Daily Stock Prediction Using Neuro-genetic Hybrids
Table 1. Continued Symbols Strategies 1992 JNJ
JPM
KO
MCD
MMM
MO
MRK
MSFT
NMSB
ORCL
PG
RYFL
SBC
SUNW
T
UTX
WMT
XOM
Hold RNN GA Hold RNN GA Hold RNN GA Hold RNN GA Hold RNN GA Hold RNN GA Hold RNN GA Hold RNN GA Hold RNN GA Hold RNN GA Hold RNN GA Hold RNN GA Hold RNN GA Hold RNN GA Hold RNN GA Hold RNN GA Hold RNN GA Hold RNN GA
0.882 0.881 0.913 0.959 1.003 0.976 1.047 1.058 1.114 1.248 1.268 1.254 1.056 1.161 1.180 0.959 0.959 0.960 0.791 0.790 0.817 1.120 1.153 1.270 0.893 7.049 6.773 1.924 1.445 1.377 1.145 1.136 1.137 0.750 8.658 9.037 1.121 1.169 1.170 1.165 1.104 1.361 1.293 1.031 1.095 0.920 1.052 0.988 1.063 1.060 1.107 1.023 1.219 1.247
1993
1994
0.914 0.921 0.912 1.065 1.051 1.032 1.060 1.060 1.059 1.148 1.256 1.245 1.062 1.112 1.135 0.755 0.750 0.779 0.803 0.803 0.868 0.941 0.941 1.031 1.360 4.855 5.081 1.991 1.840 1.558 1.080 1.069 1.058 0.889 12.406 12.718 1.137 1.274 1.299 0.851 1.019 0.943 1.037 0.987 0.981 1.228 1.170 1.183 0.811 0.813 0.857 1.039 1.198 1.182
1.204 1.211 1.185 0.814 0.822 0.830 1.163 1.082 1.054 1.036 1.197 1.227 1.011 1.075 1.081 1.000 1.006 0.980 1.085 1.239 1.196 1.502 1.343 1.343 1.000 4.603 4.843 1.504 1.647 1.613 1.085 1.115 1.091 0.688 16.073 15.765 0.979 0.983 1.000 1.245 1.412 1.465 0.950 0.950 0.950 1.048 1.175 1.216 0.819 0.771 0.791 0.951 1.134 1.131
1995 1996 1997 1998 1999 2000 2001 1.546 1.554 1.490 1.436 1.369 1.381 1.449 1.464 1.430 1.562 1.579 1.584 1.255 1.310 1.318 1.594 1.356 1.280 1.680 1.465 1.260 1.491 1.418 1.293 1.618 4.896 4.777 1.513 1.517 1.491 1.333 1.399 1.354 2.273 7.267 6.346 1.437 1.217 1.242 2.512 1.212 1.556 1.347 1.188 1.169 1.491 1.325 1.330 1.114 1.035 1.158 1.330 1.267 1.234
1.181 1.184 1.074 1.194 1.091 1.242 1.383 1.385 1.352 0.992 1.087 1.092 1.321 1.245 1.221 1.214 1.164 1.204 1.243 1.310 1.258 1.819 1.694 1.898 1.273 3.012 3.021 1.457 1.667 1.554 1.283 1.224 1.237 0.760 7.670 8.001 0.901 0.931 0.948 1.196 1.334 1.594 0.891 1.102 1.102 1.400 1.396 1.301 0.989 1.038 1.097 1.220 1.293 1.312
1.308 1.313 1.434 1.156 1.234 1.234 1.290 1.256 1.243 1.055 1.050 1.044 0.968 1.048 1.101 1.219 1.450 1.455 1.347 1.263 1.270 1.606 1.603 1.438 1.514 3.280 3.437 0.821 1.024 0.974 1.518 1.455 1.417 1.053 5.034 5.191 1.424 1.513 1.526 1.551 1.638 1.836 1.404 1.336 1.219 1.114 1.134 1.175 1.712 1.262 1.265 1.258 1.392 1.397
1.271 1.411 1.426 0.957 0.961 0.960 1.004 1.012 0.955 1.615 1.483 1.358 0.894 0.916 0.903 1.160 1.201 1.244 1.391 1.430 1.433 2.151 1.894 1.920 0.896 2.174 2.126 1.870 1.416 1.540 1.112 1.111 1.113 0.260 1.770 1.827 1.404 1.538 1.635 2.170 1.605 1.433 1.324 1.317 1.231 1.477 1.560 1.559 2.048 1.166 1.217 1.174 1.280 1.296
1.115 1.150 1.275 1.133 1.223 1.101 0.839 0.870 0.916 1.030 1.114 1.179 1.265 1.100 1.138 0.446 0.678 0.690 0.903 0.981 0.988 1.653 1.705 1.793 1.063 1.530 1.623 4.121 1.576 1.633 1.191 1.398 1.395 1.154 3.627 2.950 0.897 1.066 1.067 3.398 3.298 2.580 1.028 1.207 1.123 1.157 1.246 1.288 1.659 1.700 1.771 1.077 1.105 1.156
1.106 1.314 1.376 1.363 1.404 1.345 1.079 1.154 1.164 0.845 0.845 0.855 1.263 1.121 1.052 1.971 1.440 1.662 1.375 1.360 1.377 0.372 0.382 0.407 0.842 2.485 2.440 0.893 0.916 0.977 0.732 1.115 1.010 0.800 3.983 3.967 1.068 1.020 1.032 0.665 0.701 0.776 0.342 0.395 0.444 1.204 1.233 1.279 0.806 0.943 0.990 1.140 1.523 1.536
1.159 1.162 1.179 N/A 0.775 1.007 0.982 0.790 0.934 0.966 0.992 0.995 0.949 0.993 1.005 0.993 0.632 0.975 0.950 1.527 1.564 1.435 1.379 1.571 1.524 0.524 0.528 0.735 1.008 1.151 1.188 1.333 5.753 5.655 0.778 1.020 1.115 0.484 0.544 0.538 1.287 1.655 1.417 0.859 0.745 0.720 1.068 1.076 1.111 0.882 0.957 0.994
2211
2212
Y.-K. Kwon and B.-R. Moon
%
%
25 20 15
%
25
20
4
20
15
10
0
5
−2
−10
Year
−5
0
−10
−5
’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01
JNJ
−10
Year
−15
’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01
JPM
%
5
50
40
40
30
30
0
20
20
−5
10
10
0
0
−10
−10
−15
−20
’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01
−30
Year
15 10 5 0
−5 −10
−20
’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01
MMM
−30
Year
’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01
MO
%
−15
Year
%
60
%
2500
60
40
50
500
20
40
400
0
30
1500
300
−20
20
1000
200
−40
10
100
−60
0
’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01
−80
Year
’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01
NMSB
−10
Year
500
’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01
0
Year
%
20
30
30
30
10
20
0
10 10
−10 10
−20
0
−30
5 0
0
−5
−10
−10
−40
−10
15
20
20
−20
−15
−50
’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01
−60
Year
’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01
SBC
−30
Year
’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01
SUNW
%
−20
Year
’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01
T
Year
UTX
%
30
35
20
30
10
25 20
0
15
−10
10
−20
5
−30
0
−40
−5
−50
Year
RYFL
%
40
40
’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01
PG
%
50
−20
2000
ORCL
%
Year
MSFT
600
0
’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01
MRK
%
700
Year
%
60
50
−10
’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01
MCD
%
60
10
−20
Year
KO
%
15
−20
0
5
−8
’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01
5
10
−4
−5
10
15
−6
−10
25
6
2
0
%
30
8
’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01
−10
Year
’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01
WMT
RNN
Year
GA
XOM
Fig. 7. Continued Table 2. Relative performance of RNN and GA over Hold (1) RN N Better Worse Even Total Up Down NC Total
63 55 21 139
64 3 1 68
103 28 21 152
230 86 43 359
(2) GA Better Worse Even Total Up Down NC Total
79 66 28 173
76 2 2 80
75 18 13 106
230 86 43 359
Figure 7 shows the relative performance graphically; the Y-axis represents the percentage improvement of RNN and GA over Hold. We can observe in the figure that some companies are more predictable than others. Practically, one can choose the stocks of companies that turned out to be more stably predictable by the GA.
Daily Stock Prediction Using Neuro-genetic Hybrids
5
2213
Conclusion
In this paper, we proposed a genetic algorithm combined with a recurrent neural network for the daily stock trading. It showed significantly better performance than the “buy-and-hold” strategy with a variety of companies for recent ten years. For the input nodes of the neural network, we transform the stock prices and the volumes into a large number of technical indicators or signals. The genetic algorithm helps optimize the weights of neural network globally. This turned out to contribute to the performance improvement. In the experiments, the proposed GA predicted better in some companies than in others. It implies that this work can be useful in portfolio optimization. Future study will include finding the stock trading strategy combined with portfolio. In addition, this approach is not just restricted to the stock market. Acknowledgment. This work was partly supported by Optus Inc. and Brain Korea 21 Project. The RIACT at Seoul National University provided research facilities for this study.
References 1. E. Cantu-Paz. A survey of parallel genetic algorithms. Calculateurs Parallels, 10(2):141–171, 1998. 2. O. Castillo and P. Melin. Simulation and forecasting complex financial time series using neural networks and fuzzy logic. In Proceedings of IEEE Conference on Systems, Man, and Cybernetics, pages 2664–2669, 2001. 3. Y. M. Chae, S. H. Ho, K. W. Cho, D. H. Lee, and S. H. Ji. Data mining approach to policy analysis in a health insurance domain. International Journal of Medical Informatics, 62(2):103–111, 2001. 4. J. L. Elman. Finding structure in time. Cognitive Science, 14:179–211, 1990. 5. J. A. Gentry, M. J. Shaw, A. C. Tessmer, and D. T. Whitford. Using inductive learning to predict bankrupcy. Journal of Organizational Computing and Electronic Commerce, 12(1):39–57, 2002. 6. S. A. Harp, T. Samad, and A. Guha. Towards the genetic synthesis of neural networks. In International Conference on Genetic Algorithm, pages 360–369, 1989. 7. P. G. Harrald and M. Kamstra. Evolving artificial neural networks to combine financial forecasts. IEEE Transactions on Evolutionary Computation, 1(1):40–52, 1997. 8. A. B. Kahng and B. R. Moon. Toward more powerful recombinations. In International Conference on Genetic Algorithms, pages 96–103, 1995. 9. M. A. Kanoudan. Genetic programming prediction of stock prices. Computational Economics, 16:207–236, 2000. 10. P. J. Kaufman. Trading Systems and Methods. John Wiley & Sons, 1998. 11. J. H. Kim and B. R. Moon. Neuron reordering for better neuro-genetic hybrids. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 407–414, 2002. 12. K. J. Kim. Genetic algorithms approach to feature discretization in artificial neural networks for the prediction of stock price index. Expert Systems with Applications, 19(2):125–132, 2000.
2214
Y.-K. Kwon and B.-R. Moon
13. P. Y. Liao and J. S. Chen. Dynamic trading strategy learning model using learning classifier system. In Proceedings of the Congresson Evolutionary Computation, pages 783–789, 2001. 14. C. T. Lin and C. P. Jou. Controlling chaos by ga-based reinforcement learning neural network. IEEE Transactions on Neural Networks, 10(4):846–869, 1999. 15. K. N. Pantazopoulos, L. H. Tsoukalas, N. G. Bourbakis, Br¨ un, and E. N. Houstis. Financial prediction and trading strategies using neurofuzzy approaches. IEEE Transactions on Systems, Man, and Cybernetics–Part:B, 28(4):520–531, 1998. 16. P. Tiˇ no, C. Schittenkopf, and G. Dorffner. Financial volatility trading using recurrent neural networks. IEEE Transactions on Neural Networks, 12:865–874, 2001. 17. R. Veliev, A. Rubinov, and A. Stranieri. The use of an association rules matrix for economic modelling. In International conference on neural information processing, pages 836–841, 1999. 18. Y. F. Wang. Predicting stock price using fuzzy grey prediction system. Expert Systems with Applications, 22(1):33–38, 2002. 19. I. D. Wilson, S. D. Paris, J. A. Ware, and D. H. Jenkins. Residential property price time series forecasting with neural networks. Knowledge-Based Systems, 15(5):335– 341, 2002. 20. R. K. Wolfe. Turning point identification and Bayesian forecasting of a volatile time series. Computers and Industrial Engineering, 15:378–386, 1988. 21. B Wuthrich, V. Cho, S. Leung, D. Permunetilleke, K. Sankaran, and J. Zhang. Daily stock market forecast from textual web data. In IEEE International Conference on Systems, Man and Cybernetics, pages 2720–2725, 1999. 22. Y. Yoon and G. Swales. Predicting stock price performance: a neural network approach. In Proc. 24th Annual Hawaii International conference on System Sciences, pages 156–162, 1991.
Finding the Optimal Gene Order in Displaying Microarray Data Seung-Kyu Lee1 , Yong-Hyuk Kim2 , and Byung-Ro Moon2 1
2
NHN Corp., 7th floor, Startower 737 Yoksam-dong, Kangnam-gu, Seoul, Korea [email protected] School of Computer Science & Engineering, Seoul National University Shilim-dong, Kwanak-gu, Seoul, 151-742 Korea {yhdfly, moon}@soar.snu.ac.kr
Abstract. The rapid advances of genome-scale sequencing have brought out the necessity of developing new data processing techniques for enormous genomic data. Microarrays, for example, can generate such a large number of gene expression data that we usually analyze them with some clustering algorithms. However, the clustering algorithms have been ineffective for visualization in that they are not concerned about the order of genes in each cluster. In this paper, a hybrid genetic algorithm for finding the optimal order of microarray data, or gene expression profiles, is proposed. We formulate our problem as a new type of traveling salesman problem and apply a hybrid genetic algorithm to the problem. To use the 2D natural crossover, we apply the Sammon’s mapping to the microarray data. Experimental results showed that our algorithm found improved gene orders for visualizing the gene expression profiles.
1
Introduction
The recent marvelous advances of genome-scale sequencing have provided us with a huge amount of genomic data. Microarrays [36,37], for example, have been used for revealing gene expression profiles for more than thousands of genes. In general, microarray data can be represented by a real-valued matrix; each row represents a gene and each column represents a condition, or experiment. If we let the matrix be X, the element Xij represents the expression level of gene i for a given condition j. The microarray data are usually preprocessed with a clustering algorithm. The clustered microarray data, then, can be analyzed by biologists. A number of algorithms for clustering gene expression profiles were proposed. Eisen et al. [10] applied hierarchical clustering [38] which has been a widely used tool [1,22,24,35]. It also has some variants [2,17]. Self-organizing maps (SOMs) [42,44] and k-means clustering [43] were also used for the same purpose. Ben-Dor et al. [3] developed an algorithm, cluster affinity search technique (CAST), which has a good theoretical basis. Merz and Zell [28] proposed a memetic algorithm for the problem formulated as finding the minimum sum-of-squares clustering E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2215–2226, 2003. c Springer-Verlag Berlin Heidelberg 2003
2216
S.-K. Lee, Y.-H. Kim, and B.-R. Moon
[48,9]. However, all the proposed algorithms were ineffective in visualizing the microarray data since they were not concerned about aligning genes within each cluster in a meaningful way. This raises the problem of finding the optimal order of genes for visualization. Although there is no standard optimal criterion for evaluating which order is better than the others for visualization, placing genes with similar or the same expression profiles next to each other is considered to be natural and intuitive. Since finding the optimal order of microarray data is known to be NP-hard [5], evolutionary approaches such as genetic algorithms [18,13], memetic algorithms [29] are considered to be well suited for solving the problem. Recently Tsai et al. [46] formulated the problem as the traveling salesman problem (TSP) and applied family competition genetic algorithm (FCGA). In the FCGA, the edge assembly crossover [30] was combined with the family competition concept [45] and neighbor-join mutation [47]. Using this consolidation, they showed that their formulation was effective in finding attractive gene orders for visualizing microarray data. However, they implicitly tried to minimize the distance between distant genes as well, which is less important for visualization. In this paper, we propose a hybrid genetic algorithm for finding the optimal gene order of microarray data. We suggest a new variation of TSP formulation for this purpose. We use the 2D natural crossover [20,21], which is one of the state-of-the-art crossovers in the TSP literature. To use the 2D natural crossover, we need a 2D mapping of the data, which are virtually real-valued vectors, from a high-dimensional space into a two-dimensional Euclidean space. This mapping is necessary since the crossover exploits two-dimensional geographical information. We choose the Sammon’s mapping [34] among several candidates. Another important contribution of this paper is that we used a new formulation of TSP for the problem. In this model of TSP, relatively long edges in a tour are ignored for fitness evaluation. This is because reducing the length of long edges, which represents distant genes in relation, is not very meaningful for visualizing microarray data. We tested this idea on a spectrum of different rates of excluded edges. The remainder of the paper is organized as follows. In Section 2, we summarize the traveling salesman problem and the Sammon’s mapping. In Section 3, we describe our variation of TSP formulation for finding the optimal gene order in displaying microarray data. In Section 4, we explain our hybrid genetic algorithm in detail and present the experimental results in Section 5. Finally, we make our conclusions in Section 6.
2 2.1
Preliminaries Traveling Salesman Problem
Given n cities and a distance matrix D = [dij ] where dij is the distance between city i and city j, the traveling salesman problem (TSP) is the problem of finding a n−1 permutation π that minimizes i=1 dπi ,πi+1 +dπn ,π1 . In metric TSP the cities lie in a metric space (i.e., the distances satisfy the triangle inequality). In Euclidean
Finding the Optimal Gene Order in Displaying Microarray Data
2217
TSP, the cities lie in d for some d; the most popular version is 2D Euclidean TSP where the cities lie in 2 . Euclidean TSP is a sub-case of metric TSP. 2.2
Sammon’s Mapping
Sammon’s mapping [34] is a mapping technique for transforming a dataset from a high-dimensional (say, m-dimensional) input space onto a low-dimensional (say, d-dimensional) output space (with d < m). The basic idea is to arrange all the data points on a d-dimensional output space in such a way that minimizes the distortion of the relationships among data points. Sammon’s mapping tries to preserve distances. This is achieved by minimizing an error criterion which penalizes the differences of distances between the points in the input space and the output space. Consider a dataset of n objects. If we denote the distance between two points xi and xj in the input space by δij and the distances between xi and xj in the output space by δij , then Sammon’s stress measure E is defined as follows: 1 E = n−1 n i=1
j=i+1 δij
n−1
n 2 (δij − δij ) δij . i=1 j=i+1
The stress range is [0,1] with 0 indicating a lossless mapping. This stress measure can be minimized using any minimization technique. Sammon [34] proposed a technique called pseudo-Newton minimization, a steepest-descent method. The complexity of Sammon’s mapping is O(n2 m). There were many studies about Sammon’s mapping [8,33,31].
3
A New TSP Formulation
To visualize the microarray data, or gene expression profiles, in a meaningful way, it is natural and intuitive to align genes with similar expression profiles, or within the same group, close together. For genes with similar expression profiles to be aligned next to each other, it is useful to formulate the problem as the TSP. For the TSP formulation, a distance measure is needed to quantify the similarity between gene expression profiles, which then defines the similarity between the genes themselves. Several distance measures were proposed to define the distance. They include Pearson correlation 1 , absolute correlation 2 , Spearman rank correlation [39], Kendall rank correlation [23], and Euclidean distance. In this paper, we choose the Pearson correlation as the distance measure. Let X = x1 , x2 , . . . , xk and Y = y1 , y2 , . . . , yk be the expression levels of two genes X and Y , which were observed over a series of k conditions. The Pearson 1
2
Karl Pearson (1857-1936) is considered to be the first to call the quantity a correlation efficient in 1896 [7]. It first appeared as a published form by Harris [16]. It is also referred to as Pearson product-moment correlation coefficient. absolute value of correlation
2218
S.-K. Lee, Y.-H. Kim, and B.-R. Moon
(a) with the naive TSP formulation
(b) with our new TSP formulation
Fig. 1. A comparison of good tours between the naive and our new TSP formulation.
correlation of the two genes X and Y is sX,Y
k 1 xi − X yi − Y = ( )( ) k i=1 σX σY
where X and σX are the mean and the standard deviation of the expression levels, respectively. Then we define the distance between the genes X and Y by D(X, Y ) = 1 − sX,Y where sX,Y is the Pearson correlation. Once the distance measure is defined, it is possible to formulate finding the optimal gene order for visualization with the TSP model. Each gene corresponds to a city in TSP, and the distance between two genes corresponds to the length of the edge between the two cities. Tsai et al. [46] reduced the problem of finding the optimal order of genes to the problem of finding the shortest tour of the corresponding TSP. In the model, the fitness function is naturally defined to be n
D(gπi , gπi+1 )
i=1
where gπn+1 = gπ1 , gi denotes a gene, π denotes a gene order, n is the number of genes, and D(gi , gj ) is the distance between two genes gi and gj . The above TSP formulation [46] aims at aligning genes with similar profiles close together. However, it also tries to minimize the distances between pairs of genes with not-very-similar profiles. When two genes are adjacent in a TSP tour, they are placed next to each other. If they have distant profiles, replacing a gene by a third one with a less distant profile has little meaning, as long as they are not considerably similar. Tsai et al.’s TSP formulation implicitly tries to reduce this type of edges as well. To alleviate the problem, we propose a new fitness function. Our key idea is that we can improve the visualization results by excluding less meaningful edges in a tour from the fitness function. Figure 1 illustrates the motivation of our new TSP formulation. The length of the tour in Fig. 1(a) is shorter than that of the tour in Fig. 1(b), thus it is preferred in the naive TSP formulation. However, the
Finding the Optimal Gene Order in Displaying Microarray Data
2219
tour in Fig. 1(b) can be a better tour in that it reflects the natural grouping, denoted by ellipses. Since an edge between two genes means that the two genes are placed next to each other in visualization, we can make the naturally-grouped genes placed next to each other if a meaningful tour like Fig. 1(b), possibly including long edges, is preferred. For such tours to be preferred, relatively long edges are excluded from the fitness function. The dotted edges in Fig. 1(b) represent the relatively long edges, thus they are not counted in our new TSP formulation. By excluding them, meaningful tours like Fig. 1(b) can be favored. More formally, our variation of the TSP formulation defines the fitness function by n D(gπi , gπi+1 )δ(gπi , gπi+1 ) i=1
where δ(i, j) =
0 if (i, j) ∈ L 1 otherwise
in which (i, j) is the edge connecting gene i and gene j, and L is the set of excluded edges. We use the new fitness function only in selection and replacement. In other words, the other stages of GA except them still use the common TSP formulation which considers all the edges in a tour. This setting was settled down after some experiments.
4
A Hybrid Genetic Algorithm
A genetic algorithm hybridized with local optimizations is called a hybrid GA. A great many studies about hybridization of GAs were proposed [49,32]. – Sammon’s Mapping Since the microarray data are virtually real-valued vectors in a highdimensional space, we map them into the two-dimensional space in order to use the 2D natural crossover, which operates on chromosomes encoded by 2D graphic images. We chose the Sammon’s mapping described in Section 2.2. Figure 2(a) shows a Sammon-mapped image from a small subset of real-field microarray data. – Encoding Using the Sammon’s mapping, we obtain a 2D Euclidean TSP instance using the distance information. We use the graphic image itself of a tour as a chromosome. This encoding was used in [20,21] and showed successful results on most TSP benchmarks. – Initialization All the chromosomes are created at random. We set the population size to be 50 in our algorithm. – Selection We use the tournament selection [14]. The tournament size is 2.
2220
S.-K. Lee, Y.-H. Kim, and B.-R. Moon
(a) Sammon-mapped genes
(b) tour A
(c) tour B
(d) partitioned genes
(e) intermediate sub-tours
(f) new offspring
Fig. 2. A Sammon-mapped image and an example crossover on it
– Crossover We use the natural crossover [20,21]. The natural crossover draws free curves on the 2D space where genes are located. The curves divide the chromosomal positions into two disjoint partitions. Then we copy the genes in one partition from one parent to the offspring and those in the other partition from the other parent. Figures 2(b) through (f) show an example operation of the natural crossover on a Sammon-mapped chromosome. – Mutation The double-bridge kick move, which is known to be effective from the literature [19,27], is used. – Local Optimization We use the Lin-Kernighan (LK) algorithm [26], which is one of the most effective heuristics for TSP. The LK used here is an advanced version incorporating the techniques of don’t-look bit [4] and segment tree [12] which cause dramatic speed-up. – Replacement The replacement scheme proposed in [6] is used. The offspring tries to first replace the more similar parent, measured by Hamming distance [15], if it fails, then it tries to replace the other parent (replacement is done only when the offspring is better than one of the parents). If the offspring is worse than both parents, we replace the worst member of the population (GENITORstyle replacement [50]). – Stopping Criterion The GA stops when one of the three conditions is satisfied: i) 80% of the population is occupied by solutions with the same quality, whose chromosomes are not necessarily the same, ii) the number of consecutive fails to replace the best solution reaches 200, or iii) the number of generations reaches 2000.
Finding the Optimal Gene Order in Displaying Microarray Data
2221
Table 1. Data Set Data Set Number of Number of Name Genes Experiments Cell cycle cdc15 782 24 Cell cycle 803 59 Yeast complexes 979 79
5
Experimental Results
5.1
Test Beds and Test Environment
We tested the proposed algorithm on three data sets, Cell Cycle cdc15, Cell Cycle, and Yeast Complexes. The first two data sets consist of about 800 genes each, which are cell cycle regulated in saccharomyces cerevisia with different numbers of experiments [40]. They are classified into five groups termed G1, S, S/G2, G2/M, and M/G1 by Spellman et al. [40]. Although it is controversial whether the group assignment does reflect the real grouping, it is known to be meaningful to some degree [46]. The final data set, Yeast Complexes, is from MIPS yeast complexes database [10]. All these three data sets can be found in [2] and downloaded at a web site.3 Table 1 shows a brief description of each data set. All programs were written in C++ language and run on Pentium III 866 MHz with Linux 2.2.14. They were compiled using GNU’s g++ compiler. We performed 100 runs for each experiment. 5.2
Performance
We denote by NNGA our proposed hybrid GA using the natural crossover and the new TSP formulation. The performances of the visualization results are evaluated by a score described in [46], which is defined by Score =
n
G(gπi , gπi+1 )
i=1
where gπn+1 = gπ1 , and 1, if gπi and gπj are in the same group G(gπi , gπj ) = . 0, if gπi and gπj are not in the same group It is clear that a solution gets a higher score, under this scoring system, when more genes with the same group are aligned next to each other. Figure 3 shows the scores found by NNGA when the percent of the excluded edges varies from 0% to 90% at intervals of 10%. The NNGA improved the 3
http://www.psrg.lcs.mit.edu/clustering/ismb01/optimal.html
2222
S.-K. Lee, Y.-H. Kim, and B.-R. Moon score score
527
621 526 620 525 619 524 618 523 617 522 616 521
0
10
20
30
40
50
60
percent of excluded edges (%)
70
80
90
615
0
10
20
(a) Cell Cycle cdc15
30
40
50
60
percent of excluded edges (%)
70
80
90
(b) Cell Cycle
score 370
369
368
367
366
365
364
0
10
20
30
40
50
60
percent of excluded edges (%)
70
80
90
(c) Yeast Complexes Fig. 3. Scores on a spectrum of different rates of excluded edges
results for most of the tested percents. In particular, for the Yeast Complexes data set, it showed improvement for all of the tested percents. It is interesting that two peaks, not necessarily the highest ones, were observed at around 10% and 70% for all data sets. They imply that it is more favorable to exclude either a small or a large number of edges than do an intermediate number of edges in our experimental settings. Table 2 compares the performance of our NNGA with state-of-the art algorithms for clustering gene expression profiles in terms of the score. The Single-, Complete-, and Average-linkage represent different versions of hierarchical clustering [10] and SOM [42] is a self-organizing map. We used the CLUSTER package 4 for the three versions of hierarchical clustering and the SOM. The NNGA with the new TSP formulation dominated the others. It is more intuitive to inspect the visualized results than to just compare the scores between the algorithms for clustering gene expression profiles. To visualize them, we should assign a color to each expression level. We follow the typical red/green coloring scheme [41,10], while other schemes using different colors are available [41]. The red/green coloring scheme is as follows: – Expression levels of zero are colored black, increasingly positive levels with reds of increasing intensity, and increasingly negative levels with greens of increasing intensity. – Missing expression levels are usually colored gray. 4
http://genome-www.stanford.edu/clustering
Finding the Optimal Gene Order in Displaying Microarray Data
2223
Table 2. Comparisons of NNGA with other algorithms in terms of best score NNGA FCGA Single-linkage Complete-linkage Average-linkage SOM
Cell cycle cdc15 Cell cycle Yeast Complexes 539 634 384 521 627 N.A. 251 336 300 498 598 340 500 581 331 461 578 306
N.A. : Not Available
(a) Random
(b) FCGA
(c) NNGA
Fig. 4. Visualization results for Cell cycle cdc15
Figure 4 shows the visualization results for Cell cycle cdc15. In particular, Fig. 4(a) shows a random order, which is the original order and Fig. 4(b) and 4(c) show the best orders found by FCGA and NNGA, respectively. The NNGA shows a notable feature that clusters gene expression profiles with many missing data. One can find the clustered gray rows in Fig. 4(c).
6
Conclusions
We proposed a hybrid genetic algorithm for finding the optimal gene order in displaying the microarray data. To use the natural crossover, which exploits twodimensional geographical information, we applied the Sammon’s mapping to the data. Furthermore, we improved the visualization results using our new TSP formulation. Our key idea is that reducing relatively long edges in a tour is less meaningful for visualization and it is thus advantageous to exclude the long edges from the fitness function. Experimental results showed that our idea improved the visualization results. Using the new fitness function, we could align
2224
S.-K. Lee, Y.-H. Kim, and B.-R. Moon
more genes with the same group next to each other compared to state-of-the-art algorithms. However, there is still a lot of work to give insight into the visualization of the microarray data. Since there have been no official criterion for evaluating visualization results, it is controversial to claim which one is better than the others in terms of a measure. We think that it is because most biologists analyze the results based on their limited visual intuition. We believe that examining what visualization is more meaningful to biologists is one of the most fundamental and demanding study. The distance measure itself is also an interesting issue. While the Pearson correlation has been extensively used in the literature, it is not clear that the Pearson correlation is the best measure for defining the similarity between gene expression profiles. More elaborate distance measures such as an informationtheoretic measure [11,25] are left for future studies. Acknowledgments. The authors would like to thank Soonchul Jung and HuaiKuang Tsai for invaluable discussions on this paper. This work was partly supported by Optus Inc. and Brain Korea 21 Project. The RIACT at Seoul National University provided research facilities for this study.
References 1. A. A. Alizadeh, M. B. Eisen, and et al. Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling. Nature, 403(6769):503–511, 2000. 2. Z. Bar-Joseph, D. K. Gifford, and T. S. Jaakkola. Fast optimal leaf ordering for hierarchical clustering. Bioinformatics, 17:22–29, 2001. 3. A. Ben-Dor, R. Shamir, and Z. Yakhini. Clustering gene expresssion patterns. Journal of Computational Biology, 6:281–297, 1999. 4. J. L. Bentley. Experiments on traveling salesman problem. In 1st Annual ACMSIAM Symposium on Discrete Algorithms (SODA ’90), pages 129–133, 1990. 5. T. Biedl, B. Brejova, and et al. Optimal arrangement of leaves in the tree representing hierarchical clustering of gene expression data. Technical Report Technical Report CS-2001-14, Dept. of Computer Science, University of Waterloo, 2001. 6. T. N. Bui and B. R. Moon. Graph partitioning and genetic algorithms. IEEE Transactions on Computers, 45:841–855, 1996. 7. H. David. First (?) occurrence of common terms in mathematical statistics. The American Statistician, 49:121–133, 1995. 8. W. Dzwinel. How to make Sammon mapping useful for multidimensional data structures analysis. Pattern Recognition, 27(7):949–959, 1994. 9. A. Edwards and L. Cavalli-sforza. A method for cluster analysis. Biometrics, 21:362–375, 1965. 10. M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein. Cluster analysis and display of genome-wide expression patterns. In Proceedings of the National Academy of Sciences, pages 14863–14867, 1998. 11. A. M. Fraser. Reconstructing attractors from scalar time series: a comparison of singular system and redundancy criteria. Physica D, 34:391–404, 1989.
Finding the Optimal Gene Order in Displaying Microarray Data
2225
12. M. L. Fredman, D. S. Johnson, L. A. McGeoch, and G. Ostheimer. Data structures for traveling salesman. In 4th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA ’93), pages 145–154, 1993. 13. D. E. Goldberg. Genetic algorithms in search, optimization and machine learning. Addison-Wesley, Reading, MA, 1989. 14. D. E. Goldberg, K. Deb, and B. Korb. Do not worry, be messy. In Proceedings of the Fourth International Conference on Genetic Algorithms, pages 24–30, 1991. 15. R. Hamming. Error detecting and error correcting codes. Bell systems Technical Journal, 29(2):147–160, 1950. 16. J. Harris. The arithmetic of the product moment of calculating the coefficient of correlation. American Nature, 44:693–699, 1910. 17. J. Herrero, A. Valencia, and J. Dopazo. A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics, 17:126–136, 2001. 18. J. Holland. Adaptation in natural and artificial systems. University of Michigan Press, Ann Arbor, 1975. 19. D. S. Johnson. Local optimization and the traveling salesman problem. In 17th Colloquium on Automata, Languages, and Programming, pages 446–461, 1990. 20. S. Jung and B. R. Moon. The natural crossover for the 2D Euclidean TSP. In Genetic and Evolutionary Computation Conference, pages 1003–1010, 2000. 21. S. Jung and B. R. Moon. Toward minimal restriction of genetic encoding and crossovers for the 2D Euclidean TSP. IEEE Transactions on Evolutionary Computation, 6(6):557–565, 2002. 22. S. Kawasaki, C. Borchert, and et al. Gene expression profiles during the initial phase of salt stress in rice. Plant Cell, 13(4):889–906, 2001. 23. M. Kendall. A new measure of rank correlation. Biomerika, 30:81–93, 1938. 24. A. B. Khodursky, B. J. Peter, and et al. DNA microarray analysis of gene expression in reponse to physiological and genetic changes that affect tryptophan metabolism in escherichia coli. In Proceedings of the National Academy of Sciences, pages 12170–12175, 2000. 25. W. Li. Mutual information functions versus correlation functions. Journal of Statistical Physics, 60:823–837, 1990. 26. S. Lin and B. Kernighan. An effective heuristic algorithm for the traveling salesman problem. Operations Research, 21(4598):498–516, 1973. 27. O. Martin, S. Otto, and E. Felten. Large-step Markov chains for the traveling salesman problem. Complex Systems, 5:299–236, 1991. 28. P. Merz and A. Zell. Clustering gene expression profiles with memetic algorithms. In Proceedings of the 7th International Conference on Parallel Problem Solving from Nature, pages 811–820, 2002. 29. P. Moscato. On evolution, search, optimization, genetic algorithms and martial arts: Towards memetic algorithms. Technical Report Technical Report C3P Report 826, Concurrent Computation Program, California Institute of Technology, 1989. 30. Y. Nagata and S. Kobayashi. Edge assembly crossover: A high-power genetic algorithm for the traveling saleman problem. In 7th International Conference on Genetic Algorithms, pages 450–457, 1997. 31. E. Pekalska, D. De Ridder, R. P. W. Duin, and M. A. Kraaijveld. A new method of generalizing Sammon mapping with application to algorithm speed-up. In Fifth Annual Conference of the Advanced School for Computing and Imaging, pages 221– 228, 1999. 32. J. M. Renders and H. Bersini. Hybridizing genetic algorithms with hill-climbing methods for global optimization: Two possible ways. In Proceedings of the First IEEE Conference on Evolutionary Computation, pages 312–317, 1994.
2226
S.-K. Lee, Y.-H. Kim, and B.-R. Moon
33. D. De Ridder and R. P. W. Duin. Sammon’s mapping using neural networks: a comparision. Pattern Recognition Letters, 18(11–13):1307–1316, 1997. 34. J. W. Sammon, Jr. A non-linear mapping for data structure analysis. IEEE Transactions on Computers, 18:401–409, 1969. 35. R. Schaffer, J. Landgraf, and et al. Microarray analysis of diurnal and circadianregulated genes in arabidopsis. Plant Cell, 13(1):113–123, 2001. 36. M. Schena, D. Shalon, R. W. Davis, and P. O. Brown. Quantitative monitoring of gene expresssion patterns with a complementary DNA microarray. Science, 270(5235):467–470, 1995. 37. D. Shalon, S. J. Smith, and P. O. Brown. A DNA microarray system for analyzing complex DNA samples using two-color fluorescent probe hybridization. Genome Research, 6(7):639–645, 1996. 38. R. R. Sokal and C. D. Michener. A statistical method for evaluating systematic relationships. University of Kansas Science Bulletin, 38:1409–1438, 1958. 39. C. Spearman. The proof and measurement of association between two things. American Journal of Psychology, 15:72–101, 1904. 40. T. S. Spellman, G. Sherlock, and et al. Comprehensive identification of cell cycleregulated genes of the yeast saccharomyces cerevisia by microarray hybridization. Molecular Biology of the Cell, 9:3273–3297, 1998. 41. A. Sturn. Cluster analysis for large scale gene expression studies. Master’s thesis, Graz University of Technology, Graz, Austria, 2001. 42. P. Tamayo, D. Slonim, and et al. Interpreting patterns of gene expresssion with self-organizing maps: Methods and application to hematopoietic differentiation. In Proceedings of the National Academy of Sciences, pages 2907–2912, 1999. 43. S. Tavazoie, J. D. Hughes, and et al. Systematic determination of genetic network architecture. Nature Genetics, 22:281–285, 1999. 44. P. Toronen, M. Kolehmainen, G. Wong, and E. Castren. Analysis of gene expression data using self-organizing maps. FEBS Letters, 451:142–146, 1999. 45. H. K. Tsai, J. M. Yang, and C. Y. Kao. A genetic algorithm for traveling salesman problems. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2001), pages 687–693, 2001. 46. H. K. Tsai, J. M. Yang, and C. Y. Kao. Applying genetic algorithms to finding the optimal order in displaying the microarray data. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2002), pages 610–617, 2002. 47. H. K. Tsai, J. M. Yang, and C. Y. Kao. Solving traveling salesman problems by combining global and local search mechanisms. In Proceedings of the Congress on Evolutionary Computation (CEC 2002), pages 1290–1295, 2002. 48. J. Ward. Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58:236–244, 1963. 49. D. Whitley, V. Gordon, and K. Mathias. Larmarckian evolution, the baldwin effect and function optimization. In International Conference on Evolutionary Computation, Oct. 1994. Lecture Notes in Computer Science, 866:6–15, SpringerVerlag. 50. D. Whitley and J. Kauth. GENITOR: A different genetic algorithm. In Proceedings of Rocky Mountain Conference on Artificial Intelligence, pages 118–130, 1988.
Learning Features for Object Recognition Yingqiang Lin and Bir Bhanu Center for Research in Intelligent Systems University of California, Riverside, CA, 92521, USA {yqlin,bhanu}@vislab.ucr.edu
Abstract. Features represent the characteristics of objects and selecting or synthesizing effective composite features are the key factors to the performance of object recognition. In this paper, we propose a co-evolutionary genetic programming (CGP) approach to learn composite features for object recognition. The motivation for using CGP is to overcome the limitations of human experts who consider only a small number of conventional combinations of primitive features during synthesis. On the other hand, CGP can try a very large number of unconventional combinations and these unconventional combinations may yield exceptionally good results in some cases. Our experimental results with real synthetic aperture radar (SAR) images show that CGP can learn good composite features. We show results to distinguish objects from clutter and to distinguish objects that belong to several classes.
1 Introduction In this paper, we apply genetic programming to synthesize composite features for object recognition. The basic task of object recognition is to identify the kinds of objects in an image, and sometimes the task may include to estimate the pose of the recognized objects. One of the key approaches to object recognition is based on features extracted from images. These features capture the characteristics of the object to be recognized and are fed into a classifier to perform recognition. The quality of object recognition is heavily dependent on the effectiveness of the features. However, it is difficult to extract good features from real images due to various factors, including noise. More importantly, there are many features that can be extracted. What are the appropriate features and how to synthesize composite features useful to the recognition from primitive features? The answers to these questions are largely dependent on the intuitive instinct, knowledge, experience and the bias of human experts. In this paper, co-evolutionary genetic programming (CGP) is employed to generate a composite operator vector whose elements are synthesized composite operators for object recognition. A composite operator is represented by a binary tree whose internal nodes represent the pre-specified primitive operators and leaf nodes represent the primitive features, it is a way of combining primitive features. With each element evolved by a sub-population of CGP, a composite operator vector is cooperatively evolved by all the sub-populations. By applying composite operators, corresponding to each sub-population, to the primitive features extracted from images, we obtain
E. Cantú-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2227–2239, 2003. © Springer-Verlag Berlin Heidelberg 2003
2228
Y. Lin and B. Bhanu
composite feature vectors. These composite feature vectors are fed into a classifier for recognition. The primitive features are real numbers and designed by human experts based on the type of objects to be recognized. It is worth noting that the primitive operators and primitive features are decoupled from the CGP mechanism that generates composite operators. The users can tailor them to their own particular recognition task without affecting the other parts of the system. Thus, the method and the recognition system are flexible and can be applied to a wide variety of images.
2 Motivation and Related Research • Motivation: The recognition accuracy of an automatic object recognition system is determined by the quality of the feature set. Usually, it is the human experts who design the features to be used in recognition. Handcrafting a set of features requires human ingenuity and insight into the characteristics of the objects to be recognized and in general, it is a very time consuming and expensive process due to the large number of features available and the correlations among them. Thus, automatic synthesis of composite features useful to the recognition from simple primitive features becomes extremely important. The process of synthesizing composite features can often be dissected into some primitive operations on primitive features. However, the ways of combining primitive features are almost infinite and human experts, relying on their knowledge, rich experience and limited by their speed and bias, can try only a small number of conventional combinations. Co-evolutionary genetic programming, on the other hand, may try many unconventional combinations and in some cases these unconventional combinations yield exceptionally good results. Also, the inherent parallelism of CGP and the speed of computers allow much more combinations to be considered by CGP compared to that by human experts and this greatly enhances the chances of finding good composite features. • Related Research: Genetic programming (GP) has been used in image processing, object detection and recognition. Poli et al. [1] use GP to develop image filters to enhance and detect features of interest or to build pixel-classification-based segmentation algorithms. Bhanu and Lin [2] use GP for object detection and ROI extraction. Howard et al. [3] apply GP for automatic detection of ships in low resolution SAR imagery. Roberts and Howard [4] use GP to develop automatic object detectors in infrared images. Stanhope and Daida [5] use GP for the generation of rules for target/clutter classification and rules for the identification of objects. Unlike the work of Stanhope and Daida [5], the primitive operators in this paper are not logical operators, but operators that work on real numbers. They use GP to evolve logical expressions and the final outcome of the logical expressions determines the type of object under consideration (for example, 1 means target and 0 means clutter); we use CGP to evolve composite feature vectors for a Bayesian classifier and each sub-population is responsible for evolving a specific composite feature in the composite feature vector. The classifier evolved by GP in their system can be viewed as a linear classifier, but the classifier evolved by CGP here is a Bayesian classifier determined by the composite feature vectors learned from training images.
Learning Features for Object Recognition
2229
3 Technical Approach In our CGP-based approach, individuals are composite operators and all possible composite operators form the huge search space, leading to the extreme difficulty in finding good composite operators unless one has a smart search strategy. The system we developed is divided into training and testing parts, which are shown in Fig. 1(a) and (b), respectively. During training, CGP runs on training images and evolves composite operators to obtain composite features. Since Bayesian classifier is completely determined by the composite feature vectors learned from training images, so both the composite features and the classifier are learned by CGP. Ground Truth
Bayesian Classifier
Recognition Results
Fitness Evaluator Fitness
Primitive Operators
Coeevolutionary Genetic Programming
Training
Feature Extractor
Composite Operator Vector
Composite Feature Vectors
Primitive Features
(a) Training Learning composite feature vectors and Bayesian classifier
Testing Image
Composite Operator Vector
Feature Extractor
Composite Feature Vectors
Primitive Features
Bayesian Classifier
Recognition
(b) Testing Applying learned composite feature vectors and Bayesian classifier to a test image
Fig. 1. System diagram for object recognition using co-evolutionary genetic programming
3.1 Design Considerations • The Set of Terminals: The set of terminals used in this paper are 20 primitive features used in [6]. The first 10 of them are designed by MIT Lincoln lab to capture the particular characteristics of synthetic aperture radar (SAR) imagery and are found useful for object detection. The other 10 features are common features used widely in image processing and computer vision. The 20 features are: (1) standard deviation of image; (2) fractal dimension and (3) weight rank fill ratio of brightest scatterers; (4) blob mass; (5) blob diameter; (6) blob inertia; (7) maximum and (8) mean values of pixels within blob; (9) contrast brightness of blob; (10) count; (11) horizontal, (12) vertical, (13) major diagonal and (14) minor diagonal projections of blob; (15) maxi-
2230
Y. Lin and B. Bhanu
mum, (16) minimum and (17) mean distances of scatterers from their centroid; (18) moment µ20, (19) moment µ02 and (20) moment µ22 of scatters. • The Set of Primitive Operators: A primitive operator takes one or two real numbers, performs a simple operation on them and outputs the result. Currently, 12 primitive operators shown in Table 1 are used, where a and b are real numbers and input to an operator and c is a constant real number stored in an operator. Table 1. Twelve primitive operators Primitive Operator ADD (a, b) SUB (a, b) MUL (a, b) DIV (a, b) MAX2 (a, b) SQRT (a)
Description
Primitive Operator
Add a and b. Subtract b from a. Multiply a and b. Divide a by b. Get the larger of a and b. Return
if a ≥ 0; other-
a
wise, return
−
−a
ADDC (a, c) SUBC (a, c) MUL (a, c) DIVC (a, c) MIN2 (a, b) LOG (a)
Description Add constant value c to a. Subtract constant value c from a. Multiply a with constant value c. Divide a by constant value c. Get the smaller of a and b. Return log(a) if a ≥ 0; otherwise, return – log(-a).
.
• The Fitness Measure: the fitness of a composite operator vector is computed in the following way: apply each composite operator of the composite operator vector on the primitive features of training images to obtain composite feature vectors of training images and feed them to a Bayesian classifier. The recognition rate of the classifier is the fitness of the composite operator vector. To evaluate a composite operator evolved in a sub-population (see Fig. 2), the composite operator is combined with the current best composite operators in other sub-populations to form a complete composite operator vector where composite operator from the ith sub-population occupies the ith position in the vector and defines the fitness of the vector as the fitness of the composite operator under evaluation. The fitness values of other composite operators in the vector are not affected. When sub-populations are initially generated, the composite operators in each sub-population are evaluated separately without being combined with composite operators from other sub-populations. After each generation, the composite operators in the first sub-population are evaluated first, then the composite operators in the second sub-population and so on. Subpopulation 1: Best Individual
Subpopulation i: Individual j
Subpopulation n: Best Individual
fitness Ground truth
evaluator
Recognition results classifier
assemble Composite operator vector
composite Primitive feature vectors features
Fig. 2. Computation of fitness of jth composite operator of ith subpopulation
Learning Features for Object Recognition
2231
• Parameters and Termination: The key parameters are the number of subpopulation N, the population size M, the number of generations G, the crossover and mutation rates, and the fitness threshold. GP stops whenever it finishes the specified number of generations or the performance of the Bayesian classifier is above the fitness threshold. After termination, CGP selects the best composite operator of each sub-population to form the learned composite operator vector to be used in testing.
3.2 Selection, Crossover, and Mutation The CGP searches through the space of composite operator vectors to generate new composite operator vectors. The search is performed by selection, crossover and mutation operations. The initial sub-populations are randomly generated. Although subpopulations are cooperatively evolved (the fitness of a composite operator in a subpopulation is not solely determined by itself, but affected by the composite operators from other sub-populations), selection is performed only on composite operators within a sub-population and the crossover is not allowed between two composite operators from different sub-populations. • Selection: The selection operation involves selecting composite operators from the current sub-population. In this paper, we use tournament selection. The higher the fitness value, the more likely the composite operator is selected to survive. • Crossover: Two composite operators, called parents, are selected on the basis of their fitness values. The higher the fitness value, the more likely the composite operator is selected for crossover. One internal node in each of these two parents is randomly selected, and the two subtrees rooted at these two nodes are exchanged between the parents to generate two new composite operators, called offspring. It is easy to see that the size of one offspring (i.e., the number of nodes in the binary tree representing the offspring) may be greater than both parents if crossover is implemented in such a simple way. To prevent code bloat, we specify a maximum size of a composite operator. If the size of one offspring exceeds the maximum size, the crossover is performed again until the sizes of both offspring are within the limit. • Mutation: To avoid premature convergence, mutation is introduced to randomly change the structure of some composite operators to maintain the diversity of subpopulations. Candidates for mutation are randomly selected and the mutated composite operators replace the old ones in the sub-population. There are three mutations invoked with equal probability: 1. 2.
3.
Randomly select a node of the composite operator and replace the subtree rooted at this node by another randomly generated binary tree Randomly select a node of the composite operator and replace the primitive operator stored in the node with another primitive operator randomly selected from the primitive operators of the same number of input as the replaced one. Randomly selected two subtrees of the composite operator and swap them. Of course, neither of the two sub-trees can be the sub-tree of the other.
2232
Y. Lin and B. Bhanu
3.3 Generational Co-evolutionary Genetic Programming Generational co-evolutionary genetic programming is used to evolve composite operators. The GP operations are applied in the order of crossover, mutation and selection. Firstly, two composite operators are selected on the basis of their fitness values for crossover. The two offspring from crossover are kept aside and won’t participate in the following crossover operations on the current sub-population. The above process is repeated until the crossover rate is met. Then, mutation is applied to the composite operators in the current sub-population and the offspring from crossover. Finally, selection is applied to select some composite operators from the current subpopulation and combine them with the offspring from crossover to get a new subpopulation of the same size as the old one. In addition, we adopt an elitism replacement method that keeps the best composite operator from generation to generation. • 0.
Generational Co-evolutionary Genetic Programming:
randomly generate N sub-populations of size M and evaluate each composite operator in each sub-population individually. 1. for gen = 1 to generation_num do 2. for i =1 to N do 3. keep the best composite operator in sub-population Pi. 4. perform crossover on the composite operators in Pi until the crossover rate is satisfied and keep all the offspring from crossover. 5. perform mutation on the composite operators in Pi and the offspring from crossover with the probability of mutation rate. 6. perform selection on Pi to select some composite operators and combine them with the composite operators from crossover to get a new subpopulation Pi’ of the same size as Pi. 7. evaluate each composite operator Cj in Pi’. To evaluate Cj, select the current best composite operator in each of the other sub-population, combine Cj with those N-1 best composite operators to form a composite operator vecter where composite operator from kth sub-population occupy the kth position in the vector (k=1, …, N). Run the composite operator vector on the primitive features of the training images to get composite feature vectors and use them to build a Bayesian classifier. Feed the composite feature vectors into the Bayesian classifier and let the recognition rate be the fitness of the composite operator vector and the fitness of Cj. 8. let the best composite operator from Pi replace the worst composite operator in Pi’ and let Pi = Pi’ 9. Form the composite operator vector consisting of the best composite operators from corresponding sub-populations and evaluate it. If its fitness is above the fitness threshold, goto 10. endfor // loop 2 endfor // loop 1 10. select the best composite operator from each sub-population to form the learned composite operator vector and output it.
Learning Features for Object Recognition
2233
4 Experiments Various experiments are performed to test the efficacy of CGP in generating composite features for object recognition. In this paper, we show some selected examples. All the images used in experiments are real synthetic aperture radar (SAR) images and they are divided into training and testing images. 20 primitive features are extracted from each SAR image. CGP runs on primitive features from training images to generate a composite operator vector and a Bayesian classifier. The composite operator vector and the Bayesian classifier are tested against the testing images. It is to be noted that the ground truth is used only during training. The parameters of CGP used throughout the experiments are sub-population size (50), number of generations (50), fitness threshold (1.0), crossover rate (0.6) and mutation rate (0.05). The maximum size of composite operators is 10 in experiment 1 and 20 in experiment 2. The constant real number c stored in some primitive operators is from –20 to 20. For the purpose of objective comparison, CGP is invoked ten times for each experiment with the same set of parameters and the same set of training images. Only the average performances are used for comparison
recognition rate
• Experiment 1 – Distinguish Object from Clutter: From MSTAR public real SAR images, we generate 1048 SAR images containing objects and 1048 SAR images containing natural clutter. These images have size 120×120 and are called object images and clutter images, respectively. An example object image and clutter image are shown in Fig. 3, where white spots indicate scatterers with high magnitude. 300 object images and 300 clutter images are randomly selected as training images and the rest are used in testing.
(a) A typical object image
(b) A typical natural clutter image
Fig. 3. Example target and clutter SAR images
1 0.8 0.6 0.4 0.2 0 1
3
5
7
9 11 13 15 17 19 feature
Fig. 4. Recognition rates of 20 primitive features
First, the efficacy of each primitive feature in discriminating the object from clutter is examined. Each primitive feature from training images is used to train a Bayesian classifier and the classifier is tested against the same kind of primitive features from the testing images. The results are shown in Fig. 4. Feature contrast brightness of blob (9) is the best one with recognition rate 0.98. To show the efficacy of CGP in synthesizing effective composite features, we consider three cases: only the worst two primitive features (blob inertia (6) and mean values of pixels within blob (8)) are used by CGP; five bad primitive features (blob inertia (6), mean values of pixels within blob (8), moments µ20 (18), µ02 (19) and µ22 (20) of scatters) are used by CGP; 10 common features (primitive features 11 to 20) not specifically designed to deal with SAR images are used by CGP during synthesis. The number of sub-populations is 3, which
2234
Y. Lin and B. Bhanu
means the dimension of the composite feature vector is 3. The results are shown in Fig. 5, where the horizontal coordinates are the number of primitive features used in synthesis and the vertical coordinates are the recognition rate. The bins on the left show the training results and those on the right show the testing results. The numbers above the bins are the average recognition rates over all ten runs. Then the number of sub-population is increased from 3 to 5. The same 2, 5 and 10 primitive features are used as building blocks by CGP to evolve composite features. The experimental results are shown in Fig. 6. Table 2 shows the maximum and minimum recognition rates in these experiments, where tr and te mean training and testing and max and min stand for the maximum and minimum recognition rates.
Fig. 5. Experimental results with 3 sub-populations
Fig. 6. Experimental results with 5 sub-populations
(MULC (MULC (SUBC (SQRT (LOG PF8)))))
(DIV (DIVC (DIVC (DIV PF18 PF6))) PF8)
(a) Composite operator 1
(b) Composite operator 2
(SQRT PF8)
(c) Composite operator
Fig. 7. Composite operator vector learned by CGP Table 2. The maximum and minimum recognition rate 2 Max Min
Tr 0.992 0.988
Te 0.991 0.979
3 sub-populations 5 10 Tr Te Tr Te 0.995 0.995 0.978 0.989 0.978 0.974 0.965 0.979
2 Tr 0.992 0.988
Te 0.984 0.98
5 sub-populations 5 10 Tr Te Tr Te 0.995 0.995 0.983 0.992 0.993 0.987 0.972 0.979
From Figs. 5 and 6, it is obvious that composite feature vectors synthesized by CGP are very effective in distinguishing object from clutter. They are much better than the primitive features from which they are built. Actually, if both features 6 and 8 jointly form 2-dimensional primitive feature vectors for recognition, the recognition rate is 0.668; if features 6, 8, 18, 19, and 20 jointly form 5-dimensional primitive feature vectors, the recognition rate is 0.947; if all the last 10 primitive features are used, the recognition rate is 0.978. The average recognition rates of composite feature vectors are better than all the above results. Figure 7 shows the composite operator vector evolved by CGP maintaining 3 sub-populations in the 6th run when 5 primitive features are used, where PFi means the primitive feature i and so on.
Learning Features for Object Recognition
2235
• Experiment 2 – Recognize Objects: Five objects (BRDM2 truck, D7 bulldozer, T62 tank, ZIL131 truck and ZSU anti-aircraft gun) are used in the experiments. For each object, we collect 210 real SAR images under 15°-depression angle and various azimuth angles between 0° and 359° from MSTAR public data. Fig. 8 shows one optical and one SAR image of each object. From Fig. 8, we can see that it is not easy to distinguish SAR images of different objects. Since SAR images are very sensitive to azimuth angles and training images should represent the characteristics of the object under various azimuth angles, 210 SAR images of each object are sorted in the ascending order of their corresponding azimuth angles and the first, fourth, seventh, tenth SAR images and so on are selected for training. Thus, for each object, 70 SAR images are used in training and the rest are used in testing.
(a) BRDM2
(b) D7 (d) ZIL (e) ZSU (c) T62 Fig. 8. Five objects and their SAR images used for recognition
1) Discriminate three objects: CGP synthesizes composite features to recognize three objects: BRDM2, D7 and T62. First, the efficacy of each primitive feature in discriminating these three objects is examined. The results are shown in Fig. 9. Feature mean values of pixels within blob (8) is the best primitive feature with recognition 0.73. Three series of experiments are performed in which CGP maintains 3, 5 and 8 sub-populations to evolve 3, 5 and 8-dimensional composite features, respectively. The primitive features used in the experiments are all the 20 primitive features and the last 10 primitive features. The maximum size of the composite operators is 20. The experimental results are shown in Figs. 10, 11 and 12, where 10f and 20f mean primitive features 11 to 20 and all the 20 primitive features, respectively. The bins on the left show the training results and those on the right show the testing results. The numbers above the bins are the average recognition rates over all ten runs. Table 3 shows the maximum and minimum recognition rates in these experiments. From Figs. 10, 11 and 12, it is clear that composite feature vectors synthesized by CGP are effective in recognizing objects. They are much better than primitive features used by CGP to synthesize composite features. Actually, if all 20 primitive features are used jointly to form 20-dimensional primitive feature vectors for recognition, the recognition rate is 0.96. This result is a little bit better than the average performance shown in Fig 10 (0.94), but the dimension of the feature vector is 20. However, the dimension of composite feature vectors in Figs 10 and 11 are just 3 and 5 respectively. If we increase the dimension of composite feature vector from 3 to 5 and 8, CGP re-
2236
Y. Lin and B. Bhanu
training
0.8 0.6 0.4 0.2 0 1 3 5 7 9 11 13 15 17 19
recognition rate
recognition rate
sults are better. If the last 10 primitive features are used, the recognition rate is 0.81. From these results, we can see that the effectiveness of the primitive features has an important impact on the effectiveness of the composite features synthesized by CGP. With effective primitive features in hand, CGP will synthesize better composite features. Fig. 13 shows the composite operator vector evolved by CGP maintaining 5 sub-populations in the 10th run when all 20 primitive features are used. The size of the first and second composite operators is 20. The size of the third composite operator is 9 and the size of the last one is 15. The fourth composite operator is very simple, just selects the primitive feature 11. The primitive features used by the synthesized composite operator vector are primitive features 2, 3, 4, 5, 6, 7, 8, 11, 12, 14, 18, 19, 20. If all these 13 primitive features directly form a 13-dimensional primitive feature vector for recognition, the recognition rate is 0.96.
0.6 10f
0.92
Fig. 10. Recognition rate with 3 sub-populations training
testing
0.99 0.96 0.86
10f
20f
recognition rate
recognition rate
1 0.9 0.8 0.7
20f
# of primitive features used in synthesis
Fig. 9. Recognition rates of 20 primitive features training
0.97 0.94
0.8
primtive feature
.
0.88 0.843
1
testing
1 0.9
testing
0.999 0.97
0.96 0.87
0.8 10f
20f
# of primitive features used in synthesis
# of primitive features used in synthesis
Fig. 11. Recognition rate with 5 sub-populations
Fig. 12. Recognition rate with 8 sub-populations
(DIV (MULC (SUB (SUB (DIVC (SQRT PF6)) (MULC (SUB PF18 (MULC (SUB PF18 (SQRT PF4)))))) (SQRT PF6))) (MIN2 PF12 PF19))
(DIV (MULC (ADD (ADDC (MULC (MUL (MIN2 (ADDC (DIV PF20 PF4)) PF14) PF3))) (LOG (ADDC (DIV PF20 PF4))))) (DIVC PF4))
(a) Composite operator 1 (DIV (MIN2 (SUBC (SUBC PF11)) (MAX2 PF7 PF8)) PF8)
(c) Composite operator 3
(b) Composite operator 2
(PF11)
(LOG (ADDC (LOG (DIV (SUBC (LOG (DIV (SUBC (LOG PF5)) (SUBC PF5)))) (MUL PF2 PF5)))))
(d) Composite operator 4
(e) Composite operator 5
Fig. 13. Composite operator vector learned by CGP for 5 sub-populations
Learning Features for Object Recognition
2237
Table 3. The maximum and minimum recognition rate
Max Min
3 sub-populations 10f 20f Tr Te Tr Te 0.91 0.86 0.98 0.97 0.85 0.83 0.95 0.92
10f Tr 0.94 0.91
5 sub-populations 20f Te Tr 0.88 0.995 0.84 0.98
10f Te 0.97 0.94
Tr 0.98 0.95
8 sub-populations 20f Te Tr Te 0.9 1.0 0.98 0.85 0.99 0.95 5
0.6 0.4 0.2 0 1 3 5 7 9 11 13 15 17 19
recognition rate
recognition rate
2) Discriminate Five Objects: With more objects added, the recognition becomes more difficult. This can be seen from Fig. 14, which shows the efficacy of each primitive feature in discriminating these five objects. Feature blob mass (4) is the best primitive feature with recognition 0.49. If all 20 primitive features are used jointly to form 20-dimensional primitive feature vectors for recognition, the recognition rate is 0.81; if only the last 10 primitive features are used, the recognition rate is 0.62. This number is much lower, since the last 10 features are common features and are not designed with the characteristics of SAR images taken into consideration. Two series of experiments are performed in which CGP maintains 5 and 8 subpopulations to evolve 5 and 8-dimensional composite features for recognition. The primitive features used in the experiments are all the 20 primitive features and the last 10 primitive features. The maximum size of composite operators is 20. The experimental results are shown in Fig. 15. The left two bins in columns 10f and 20f correspond to 5 sub-populations and the right two bins correspond to 8 sub-populations. The bins showing the training results are to the left of those showing testing results. The numbers above the bins are the average recognition rates over all ten runs. Table 4 shows the maximum and minimum recognition rates in these experiments. 1
0.860.770.930.83
0.5 0 10f
primitive feature
Fig. 14. Recognition rates of 20 primitive features
0.680.580.790.63
20f
# of primitive features used in synthesis
Fig. 15. Recognition rate with 5 (left two bins) and 8 (right two bins) sub-populations
Table 4. The maximum and minimum recognition rate 10f Max Min
Tr 0.71 0.65
5 sub-population 20f Te Tr Te 0.63 0.88 0.8 0.55 0.83 0.73
10f Tr 0.80 0.75
8 sub-population 20f Te Tr Te 0.65 0.94 0.85 0.62 0.91 0.80
From Fig.15, we can see that when the dimension of the composite feature vector is 8, the performance of the composite features is good and it is better than using all 20 (0.81) or 10 (0.62) primitive features from which the composite features are built. When the dimension of the composite feature vector is 5, the recognition is not satisfactory when using just 10 common features as building blocks. Also, when the dimension is 5, the average performance is a little bit worse than using all 20 or 10
2238
Y. Lin and B. Bhanu
primitive features, but the dimension of composite feature vector is just one-fourth or half of the number of primitive features, saving a lot of computational burden in recognition. When all 20 primitive features are used and CGP has 8 sub-populations, the composite operators in the best composite operator vector evolved have sizes 19, 1, 16, 19, 15, 7, 16 and 6, respectively. The primitive features used by the synthesized composite operator vector are primitive features 2, 3, 4, 5, 8, 9, 10, 11, 12, 13, 14, 15, 16, 18, 19 and 20. If all these 16 primitive features directly form a 16-dimensional primitive feature vector for recognition, the recognition rate is 0.80, which is lower than the average performance of the composite feature vector shown in Fig. 15. • Discussion: The above experiments show that CGP is a viable tool to synthesize effective composite features from primitive features for object recognition and the learned composite features outperform the primitive features or any combination of primitive features from which they are evolved. The effectiveness of composite features learned by CGP is dependent on the effectiveness of primitive features. The usefulness of CGP is that it can evolve composite features that are more effective than the primitive ones upon which they are evolved. To achieve the same recognition rate, the number of composite features needed is smaller than the number of primitive features needed (one-fourth or half), thus, reducing the computational expenses during run-time recognition.
5 Conclusions In this paper, CGP is used to synthesize composite features for object recognition. Our experimental results using real SAR images show that CGP can evolve composite features that are more effective than the primitive features upon which they are built. To achieve the same recognition performance of primitive features, fewer composite features are needed and this reduces the computational burden during recognition. However, primitive features still have a significant impact on the effectiveness of the evolved composite features. How to let CGP evolve effective composite features using general primitive features is the focus of our future research. Acknowledgment. This research is supported by the grant F33615-99-C-1440. The contents of the information do not necessarily reflect the position or policy of the U. S. government.
References 1. R. Poli, “Genetic programming for feature detection and image segmentation,” in Evolutionary Computation, T.C. Forgarty (Ed.), pp. 110–125, 1996. 2. B. Bhanu and Y. Lin, “Learning composite operators for object detection,” Proc. Genetic and Evolutionary Computation Conference, pp. 1003–1010, July, 2002. 3. D. Howard, S.C. Roberts, and R. Brankin, “Target detection in SAR imagery by genetic programming,” Advances in Engg. Software, vol. 30, no. 5, pp. 303–311, May 1999.
Learning Features for Object Recognition
2239
4. S.C. Roberts and D. Howard, “Evolution of vehicle detectors for infrared line scan imagery,” Proc. Evolutionary Image Analysis, Signal Processing and Telecommunications, First European Workshops, pp. 110–125, Springer-Verlag, 1999. 5. S.A. Stanhope and J. M. Daida, “Genetic programming for automatic target classification and recognition in synthetic aperture radar imagery,” Proc. Conference. Evolutionary Programming VII, pp. 735–744, 1998. 6. B. Bhanu and Y. Lin, “Genetic algorithm based feature selection for target detection in SAR images,” Image and Vision Computing, 2003.
An Efficient Hybrid Genetic Algorithm for a Fixed Channel Assignment Problem with Limited Bandwidth Shouichi Matsui, Isamu Watanabe, and Ken-ichi Tokoro Communication & Information Research Laboratory (CIRL) Central Research Institute of Electric Power Industry (CRIEPI) 2-11-1 Iwado-kita, Komae-shi, Tokyo 201-8511, Japan {matsui,isamu,tokoro}@criepi.denken.or.jp
Abstract. We need an efficient channel assignment algorithm for increasing channel re-usability, reducing call-blocking rate and reducing interference in any cellular systems with limited bandwidth and a large number of subscribers. We propose an efficient hybrid genetic algorithm for a fixed channel assignment problem with limited bandwidth constraint. The proposed GA finds a good sequence of codes for a virtual machine that produces channel assignment. Results are given which show that our GA produces far better solutions to several practical problems than existing GAs.
1
Introduction
The channel assignment problem (CAP), or frequency assignment problem (FAP) is a very important problem today, but is a difficult, NP-hard problem. The radio spectrum is a limited natural resource used in a variety of private and public services, the most well known example can be found in cellular mobile phone systems, or personal communication services (PCS). To facilitate this expansion the radio spectrum allocated to a particular service provider needs to be assigned as efficiently and effectively as possible. Because the CAP is a very important problem in the real world and an NP-hard problem, a number of heuristic algorithms have been proposed (e.g., [6]). To achieve the optimal solution of fixed channel assignment problems, most proposed algorithms try to minimize the amount of necessary channels while satisfying a set of given constraints (e.g., [2,3,5,8,9,10,11,19]). However, the total number of available channels, or bandwidth of frequencies, are given and fixed in many situations. Minimizing the bandwidth becomes meaningless for such applications. To address the problem, Jin et al. proposed a new cost model [7] in which the available number of channels are given and fixed and the electro-magnetic compatibility constraints, and demand constraint are relaxed. They also proposed genetic algorithms to solve the problems [4,7]. But, their algorithms use na¨ıve representation, and the search space is too large in scale, therefore the results obtained using these algorithms are not good enough. E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2240–2251, 2003. c Springer-Verlag Berlin Heidelberg 2003
An Efficient Hybrid Genetic Algorithm
2241
We propose a new hybrid genetic algorithm for the problem. Because the proposed algorithm combines GA with a local search, it can be thought of as a memetic algorithm [13]. The proposed GA is tested using a set of standard benchmark problems, and the performance is far better than the previously proposed GAs [4,15] for the problem [4]. The proposed GA can obtain far better solutions than the previously proposed GAs.
2 2.1
Fixed Channel Assignment with Limited Bandwidth Notation
Let us consider a cellular system that consists of N cells, and each cell is numbered from 1 to N . A compatibility matrix is a symmetric N ×N matrix C = (cij ) with nonnegative integer elements. The value cij prescribes the minimum frequency separation required between frequencies assigned to cell i and cell j, i.e., if fik and fjl are the frequencies assigned to cell i and j respectively, then the following condition |fik − fjl | ≥ cij should be satisfied for all i and j. Radio frequencies are assumed to be evenly spaced, therefore they can be identified with the positive integers. Let Xi,j be the variable that takes 1 when the j-th mobile that stays in cell i wishes to own a frequency, and ni be the number of mobiles that stay in cell i, and let T be the number of channels that each frequency provides under TDMA (Time Division Multiple Access). And let ni Xi = j=1 Xi,j be the random variable of required channels in cell i, and let µi and σi be the expected number and the standard deviation of Xi (no matter the request is failed or success). In general, the load of each cell is maintained by the system, therefore µi and σi can be estimated from the long-term history of base stations’ load data. 2.2
Damage of Blocked Calls
Call attempts may fail at a busy station because there are no available channels, and failed calls are called blocked calls. The more blocked calls, the more damage caused to the system. Thus, an ideal channel assignment should guarantee that the total amount of blocked calls in all cells be as low as possible. Let Hi be the number ofassigned channels to cell i, then the expected number ni of blocked calls in cell i is j=H P (Xi = j)(j − Hi ). The objective here is to i +1 N ni minimize the cost i=1 j=Hi +1 P (Xi = j)(j − Hi ). As Horng et al. showed, we can assume the random variable Xi (1 ≤ i ≤ N ) is close to the normal random ni variable with parameters µi and σi [4]. Therefore, as an approximation of j=x+1 P (Xi = j)(j − x), we can use 2 IE (x) = √
1 2πσi
∞
x
1 = √ σi exp 2π
1 2
(y − x) exp
−
2
1 − 2
x − µi σi
y − µi σi
dy
x − µi 1 + (µi − x)erfc √ 2 2σi
where erfc(x) is the complementary error function defined as
,
2242
S. Matsui, I. Watanabe, and K.-i. Tokoro
2 erfc(x) = 1 − erf(x) = √ π 2.3
∞
exp(−t2 )dt.
x
Damage from Interference
Interference can occur between a pair of transmitters if the interfering signal is sufficiently strong. Whether a transmitter pair has the potential to interfere depends on many factors, e.g., distance, terrain, power, or antenna design. The higher the potential for interference between a transmitter pair is, the larger the frequency separation required. For example, if two transmitters are sufficiently geographically separated then the frequency can be re-used, i.e., the same frequency can be assigned. At the other extreme if two transmitters are located at the same site then they may require, say, a five-frequency separation. Violating frequency separation constraint, or EMC (Electro-Magnetic Compatibility) constraint, would bring some degree of disadvantage to the mobiles that experience interference. And the disadvantage would be in proportion to the degree of interference that depends on the frequency distance (i.e., how many Hz between them) and the power it suffered. The degree of damage is defined as follows. Let p be the assigned frequency to cell i, and q be the one to cell j, then the damage caused by interference from this assignment f (i, j, p, q) is defined as follows. if |p − q| ≥ cij , 0 f (i, j, p, q) = fi,p fj,q IC (cij − |p − q|) if |p − q| < cij and i = j, fi,p fj,q IA (cij − |p − q|) otherwise. where fi,p = 1 if frequency p is assigned to cell i, otherwise fi,p = 0, and IC and IA are two strictly increasing functions. 2.4
Objective Function
The objective of the problem is to minimize the total damage, the sum of the cost of blocked calls and the cost of interference, therefore the problem is defined as follows. Min O =
N
Z N
Z
i=1 j=1 p=1 q=1
f (i, j, p, q) + α
N
IE (T Fi ),
i=1
Z subject to fi,p = 0 or 1 for 1 ≤ i ≤ N and 1 ≤ p ≤ Z where Fi = p=1 fi,p for 1 ≤ i ≤ N , Z is the allowable number of frequencies, and α is the relative weight of the damage of blocked calls to the damage from interference. 2.5
Related Works
Because CAP (FAP) is an important and very difficult problem to solve exactly, GA based algorithms for the minimum span frequency assignment problem (MSFAP) have been proposed (e.g., [2,3,5,8,9,10,11,14,17,19]).
An Efficient Hybrid Genetic Algorithm
2243
The performance of the GAs for MSFAP that represent possible assignment directly as a bit-string or a sequence of integers is not good enough, and the permutation based GAs are reported to show good performance [10,11,19]. In these GAs, an assignment order of transmitters is represented by a permutation and an assignment is carried out using a sequential algorithm. The scheme has overcome the weakness of the previous two schemes and the performance has improved [10,11,19], and a GA with an adaptive mutation rate and a new initialization method was developed and showed very good performance [11]. The performance of the permutation based GAs are high for MSFAP, but they are designed to find an assignment without violating compatibility constraints. Therefore, they cannot be used for the problem shown in this section. The formulation shown above, which is quite different from MSFAP, is first proposed and a GA for solving the problem was developed by Jin et al. [7], and an improved version was proposed by Horng et al. [4] and Park et al. [15]. However they use na¨ıve representation of N × Z matrix that is bad coding and they use simple GA. Rothlauf et al. [16] showed that we should use well-designed GA in the case of bad codings, therefore the performance of previously proposed GAs is not good enough.
3
The Proposed Algorithm
We propose a new algorithm for the FAP formulated in the previous section. The main idea of the proposed GA is that sequences of codes of a virtual machine that performs assignment are encoded as chromosomes, and the GA searches for a good code sequence that minimizes the total damage cost. The genetic algorithm (GA) used here is an example of ‘steady state’, overlapping populations GA [12]. The proposed GA is outlined in Figure 1. 3.1
How Many Frequencies Are Necessary for a Cell?
The second term of the objective function decreases as we increase the number of assigned frequencies (Fi ), but the first term, interference cost, would increase. Let us consider a cost function Di (Fi ) =
Z
Z
f (i, i, p, q) + αDB (T Fi ),
p=1 q=1
which consists of the blocking cost and interference cost within a cell i only considering the co-site channel separation constraint. We can find a good candidate of Fi using Di (Fi ). The frequency separation in a cell decreases as we increase Fi , and the interference cost increases, whereas DB (T Fi ) decreases. Therefore we can find Fi that minimize Di (Fi ) by evaluating Di (Fi ) for Fi = 1, 2, · · · , F¯i , i.e., the optimal Fi∗ can be defined as Fi∗ =
argmin Fi ∈{1,2,···,F¯i }
Di (Fi ).
2244
S. Matsui, I. Watanabe, and K.-i. Tokoro
Procedure GA BEGIN Initialize: (1) Calculate Fi∗ (i = 1, 2, · · · , N ). (2) Generate N individuals (N being the population size), and randomly generate i mutation rate Nm . (3) Produce N assignments and store each assignment. (4) Store best-so-far. LOOP (1) Generate offspring: Generate Nn offspring. (1-1) Select two parents by roulette wheel rule; (1-2) Apply crossover with the probability of Pc , and generate two offspring; when no crossover is applied, then two parents will be the offspring; (1-3) Apply adaptive mutation to offspring and generate one offspring by Adaptive-Mutation; (1-4) If the offspring is better than best-so-far then replace best-so-far. (2) Selection: Select best N individuals from the pool of old N and new Nn individuals. UNTIL stopping condition satisfied. Print best-so-far. END Fig. 1. Outline of the proposed GA.
Because DB (x) rapidly decreases and becomes close to zero when x ≥ µi + 5σ, we can set F¯i = µi + 5σ. The minimal frequency separation Sim of each cell i is calculated by Sim = min{(Z − 1)/(Fi∗ − 1), cii }. We can use T GT =
N
Di (Fi∗ ) as a target of total cost.
i=1
3.2
Virtual Machine
Let us consider a virtual machine that assigns the frequency one by one to cells according to a given code. The cost function Di (Fi∗ ) is minimized by assigning all frequencies with the separation of Sim when (Z − 1)mod(Fi∗ − 1) = 0. When (Z − 1)mod(Fi∗ − 1) = 0, some frequencies must be assigned with the separation Sim + 1. Because Di (Fi ) does not consider the compatibility constraint between cells i and j, the above assignment does not minimize the total cost, therefore some assignments should be with the separation that does not violate the intercell constraints. With these observations, the virtual machine must have at least 3 codes or operations that are shown in Table 1. The assignment order to cells is important, so an instruction of the virtual machine is defined as a pair of integers, namely (cell number, action). The virtual machine assigns a frequency f that is determined by the Table 1 to cell cell number when 1 ≤ f ≤ Z, and do nothing in other cases. When
An Efficient Hybrid Genetic Algorithm
2245
Table 1. Action specification. Action 0 1 2
Assigning frequency frequency with separation of Sim frequency with separation of Sim + 1 minimum usable frequency
a frequency f is assigned to cell i, the set of usable frequencies of all cells are updated according to the compatibility matrix C. In this step, we use Sim instead of cii . 3.3
Chromosome
A chromosome of the GA is a sequence of the instructions. A single instruction is a pair of integers, namely (p, a) where 1 ≤ p ≤ N , and a ∈ {0, 1, 2}. An instruction (p, a) assigns a frequency to cell p, therefore a valid chromosome must have Fi∗ times of instructions whosecell number is i for each cell n i, and the length of the sequence becomes L = i Fi∗ . Thus a chromosome is expressed as a sequence {(p1 , a1 ), (p2 , a2 ), · · · , (pL , aL )}. 3.4
Local Search
After the assignment by the virtual machine we could improve the assignment by a local search. If a cell has a frequency fp that violates the interference constraints, and there is a usable frequency fq , then we can replace fp by fq , and the replacement always reduces the total cost. After this local search, the chromosome is modified to reflect the modification. The modification algorithm changes the order of instructions in the chromosome and the action of the genes according to the result of the local search. The basic ideas are as follows; – If the frequency is generated by the local search, the corresponding instruction is moved towards the tail. – The action that assigns the minimum usable frequency should be kept unchanged. The modification algorithm shown in Figure 2 is applied to each cell in decreasing order of the interference damage of each cell. 3.5
Crossover
We use a crossover operator that does not generate an invalid chromosome. If we ignore the action part of the genes, a chromosome is a permutation of cell numbers. Because cells have multiple channel demands, a permutation contains
2246
S. Matsui, I. Watanabe, and K.-i. Tokoro
Procedure Modification BEGIN Initialize: (1) Let two sequences of S1 and S2 be empty. (2) Sort assigned frequencies in ascending order for each cell, and result be (fi,1 , fi,2 , · · · , fi,Fi ). And also let fi,0 ← 1 − Sim . Scan genes from head to tail: LOOP (1) Let i be the cell number, and a be the action of current gene that corresponds to the k-th assignment to cell i. (2) Calculate channel separations s ← fi,k − fi,(k−1) . (3) If (s > Sim + 1) then t ← 2 else t ← s − Sim . (4) If t = 2 then If fi,k is generated by the local search then append instruction (i, t) to S2 , else append instruction (i, t) to S1 . else If a = 2 then append instruction (i, a) to S1 , else append instruction (i, t) to S1 . UNTIL all genes are scanned. Return the concatenation of S1 and S2 . END Fig. 2. Modification algorithm.
multiple occurrence of the same cell numbers. Therefore, the candidates are Generalized Order Crossover (GOX) [1], Generalized Partially Mapped Crossover (GPMX) [1], Precedence Preservation Crossover (PPX) [1], and modified PMX (mPMX) [11]. By comparing the performance of the four operators, we decided to use GPMX as the crossover operator. Because GPMX attained the best average, and the difference between the best solution attained by GPMX and the other operators is small, we decided to use GPMX. The comparisons is shown in ‘Experiments and Results’ section. 3.6
Mutation
The mutation is done in 2 steps. In the first step, instructions are mutated by swap mutation, i.e., randomly chosen instruction at position p and q are swapped, i.e., (np , ap ) and (nq , aq ) is swapped. In the second step, the action part of randomly chosen genes are mutated by flip mutation, i.e., the value of the action part of a randomly chosen gene gi is replaced by a randomly chosen integer in the range [0, 2]. The mutation operator does not generate any invalid chromosomes, i.e., offsprings are always valid sequence of code for the virtual machine. Self-Adaptive Mutation Rate. The efficiency of GA depends on two factors, namely the maintenance of suitable working memory, and quality of the
An Efficient Hybrid Genetic Algorithm
2247
Procedure Adaptive-Mutation BEGIN Initialization: Let L be the length of chromosome, o1 , o2 be the offspring, and 1 2 , Nm be the product of L and the mutation rate of o1 , o2 respectively. Let U () Nm be a function that returns random variable uniformly distributed in the range of [0, 1). Update Mutation Rate: 1 /L) then if U () < (Nm 1 1 1 1 if U () < 1/2 then Nm ← Nm + 1 else Nm ← Nm −1 1 1 else Nm ← Nm 2 if U () < (Nm /L) then 2 2 2 2 if U () < 1/2 then Nm ← Nm + 1 else Nm ← Nm −1 2 2 else Nm ← Nm 1 2 Apply Mutation: Apply mutation to offspring o1 with the rate of Nm /L, Nm /L 1 2 and the results be o1 , o1 . And apply mutation to offspring o2 with the rate of 1 2 /L, Nm /L and the results be o12 , o22 . Individuals inherit the mutation rate Nm that generated them. Selection: Select the best individual from o11 , o21 , o12 , o22 . END Fig. 3. Adaptive mutation and reproduction method.
match between the probability density function generated and the landscape being searched. The first of these factors will depend on the choice of population size and selection algorithm. The second will depend on the action of the reproductive operators, i.e., crossover and mutation operators, and the set of associative parameters on the current population [18]. We reported that the performance of their permutation based GA for MSFAP heavily depends on the mutation rate [10], and we also showed that an adaptive mutation rate mechanism improved the performance greatly [11]. Therefore, we have built an individual level self-adaptive mutation rate mechanism into our algorithm.
i Proposed Mutation Scheme. The mutation rate rm for each individual i is i L U i coded as an integer Nm in the range of [Nm , Nm ], and the mutation rate rm is i i L U defined as rm = Nm /L. The values of Nm , Nm used in the numerical experiments will be discussed later. i In the mutation step, the mutation rate Nm is first modified according to i is incremented or decremented with the mutation the value of itself. The Nm i i , and the result rm is used as the mutation rate for the individual i. rate of rm The mutation and the resulting offsprings are created by the algorithm shown in Figure 3.
2248
3.7
S. Matsui, I. Watanabe, and K.-i. Tokoro
Selection
The selection of the GA is roulette wheel selection. The selection probability p1 for the first parent is calculated for each genotype based on its rank. The selection probability p2 for the second parent is calculated based on its fitness1 . 3.8
Other Genetic Operators
We use sigma truncation for the scaling function. The fitness function F is defined as F = F − (µF − 2 × σF ), where F = 1/O, µF is the average, and σF is the standard deviation of F over the population. The GA evolves population to maximize the F .
4
Experiments and Results
4.1
Benchmark Problems
We tested the algorithm using the problem proposed by Horng [4]. The eight benchmark problems were examined in our experiments. Three compatibility matrices C3 , C4 , C5 and three communication load tables D1 , D2 , D3 , are combined to make eight problems [4]. The interference cost function IC (x) = 5x−1 and IA (x) = 52x−1 , the weight of blocking cost α = 1000, and the value of T (number of channels that each frequency provides under TDMA) was set to 82 . 4.2
Parameters
The population size for each problem was 100, the maximal number of generation was 1000, The GA terminates if the variance of fitness of the population becomes 0, and does not change during contiguous 200 generations. The crossover ratio (Pc ) is 0.8, and the number of new offsprings (Nn ) is equal to the population L =3 size. The lower bound and the upper bound of the mutation rate is set Nm U and Nm = 0.1 × L. 1
2
Valenzuela et al. used similar asymmetric selection probability in their GA for MSFAP [19]. In their GA, the first parent is selected deterministically in sequence, but the second parent is selected in a roulette wheel fashion with the probability based on its rank. According to private communications with an author of the paper, there are typos in the paper [4]. The typos are the communication load table of Problem 1, and the weight α. And there is no description of the value of T . We used the correct data provided by the author. The results by Park et al.[15] used the same corrected data.
An Efficient Hybrid Genetic Algorithm
2249
Table 2. Comparison of simulation results
Horng et al.[4] Park et al.[15] Our GA Prob T GT Best Ave Best Ave CPU† Best Ave P1 3.7e-4 203.4 302.6 0.4 0.5 65504 3.7e-4 3.7e-4 P2 4.1 271.4 342.5 27.9 30.9 88692 4.1 4.1 P3 4.1 1957.4 2864.1 63.1 79.3 89918 5.8 7.1 P4 231 906.3 1002.4 675.8 684.1 95585 243.8 248.3 P5 231 4302.3 4585.4 1064.1 1092.5 87905 535.8 659.8 P6 190 4835.4 5076.2 1149.8 1227.3 35790 552.1 692.2 P7 2232 20854.3 21968.4 5636.7 5831.8 37323 3243,5 3537.9 P8 22518 53151.7 60715.4 41883.0 41967.5 135224 27714.3 29845.4
CPU‡ 75.3 68.4 196.0 69.4 255.4 249.7 249.4 813.8
†: CPU seconds on Pentium III 750MHz, ‡: CPU seconds on Pentium III 933MHz.
4.3
Results
We tested the performance of the GA by running 100 times for each problem. The results are shown in Table 2. Comparison with the results by Horng et al.[4] and by Park et al.[15] and the target of total cost defined in the subsection 3.1 are also given. The column P denotes the problem name, and T GT denotes the target of total cost. The column ‘Best’ shows the minimum cost found in the 100 runs, the column ‘Ave’ shows the average cost over the 100 runs, and the column ‘CPU’ shows the computational time for a single run. Table 2 shows that our GA performs very well, and outperforms the previous GAs [4,15] for all cases. The cost obtained by our GA is very small compared to the others. Our GA can find a very good assignment with the same cost of the target for P1 and P2, and the cost is very close to the target for P3 and P4. The GA can obtain assignments without any violation of the EMC constraints for the problems P1, P2, and P3. The performance improvement is significant for all problems. The CPU time is very short compared with the GA by Park et al., it is almost 1/100 for all problems. Our GA not only runs faster, but also the solution is better than the GA by Park et al. Comparison of Crossover Operators. We ran the proposed GA with the same parameters shown above, only changing the crossover operators. Table 3 shows the comparison of crossover operators. Table 3 shows the following. – GPMX attained the best average solution for all eight problems. – The best solution of P1, P2, P3, and P4 is the same among the four operators. GPMX attained the best solution for P8, PPX attained the best solution for P6 and P7, mPMX attained the best solution for P5. – All four operators attained better results than that of Horng et al. [4] or Park et al. [15].
2250
S. Matsui, I. Watanabe, and K.-i. Tokoro Table 3. Comparison of crossover operators
GOX GPMX PPX mPMX Prob Best Ave Best Ave Best Ave Best Ave P1 3.7e-4 3.7e-4 3.7e-4 3.7e-4 3.7e-4 3.7e-4 3.7e-4 3.7e-4 P2 4.1 4.1 4.1 4.1 4.1 4.1 4.1 4.1 P3 5.8 7.1 5.8 7.1 5.8 8.0 5.8 7.7 P4 243.8 248.8 243.8 248.3 243.8 249.7 243.8 248.6 P5 548.5 713.7 535.8 659.8 558.0 690.0 526.7 691.4 P6 608.0 766.1 552.1 692.2 539.2 723.6 574.4 766.6 P7 3309.7 3693.4 3243.5 3537.9 3233.3 3667.8 3273.3 3651.2 P8 30308.6 31655.8 27714.3 29845.4 28842.5 30392.1 29040.2 30637.1
5
Conclusions
We have presented here a new efficient hybrid genetic algorithm for fixed channel assignment problems with limited bandwidth constraint. The algorithm uses the GA to find a good sequence of codes for a virtual machine that executes the assignment task, and to improve the performance, an adaptive mutation rate mechanism is developed. The proposed GA is tested using a set of benchmark problems, and the performance is superior to existing GAs. The proposed GA can obtain very good solutions that were unable to be found using the previously proposed GAs for the problem. We believe that our approach, finding good code sequence for a virtual machine by GA, with the adaptive mutation rate mechanism can be applied to other real world problems. Acknowledgments. The authors are grateful to Mr. Ming-Hui Jin of National Central University, Chungli, Taiwan for providing us data for the experiment.
References 1. Bierwirth, C., Mattfeld, D.C., and Kopfer, H.: On permutation representations for scheduling problems, Proc. 4th International Conference on Parallel Problem Solving from Nature—PPSN IV, pp.310–318, 1996. 2. Crompton, W., Hurley, S., and Stephens, N.M.: Applying genetic algorithms to frequency assignment problems, Proc. SPIE Conf. Neural and Stochastic Methods in Image and Signal Processing, vol.2304, pp.76–84, 1994. 3. Cuppini, M.: A genetic algorithm for channel assignment problems, Eur. Trans. Telecommun., vol.5, no.2, pp.285–294, 1994. 4. Horng, J.T., Jin, M.H., and Kao, C.Y.: Solving fixed channel assignment problems by an evolutionary approach, Proc. of Genetic and Evolutionary Computation Conference 2001 (GECCO-2001), pp.351–358, 2001.
An Efficient Hybrid Genetic Algorithm
2251
5. Hurley, S. and Smith, D.H.: Fixed spectrum frequency assignment using natural algorithms, Proc. of Genetic Algorithms in Engineering Systems: Innovations and Applications, pp.373–378, 1995. 6. Hurley, S., Smith,D.H., and Thiel, S.U.: FASoft: a system for discrete channel frequency assignment, Radio Science, vol.32, no.5, pp.1921–1939, 1997. 7. Jin, M.H., Wu, H.K., Horng,J.Z., and Tsai, C.H.: An evolutionary approach to fixed channel assignment problems with limited bandwidth constraint, Proc. IEEE Int. Conf. Commun. 2001, vol.7, pp.2100–2104, 2001. 8. Kim, J.-S., Park,S.H., Dowd, P.W., and Nasrabadi, N.M.: Comparison of two optimization techniques for channel assignment in cellular radio network, Proc. of IEEE Int. Conf. Commun., vol.3, pp.850–1854, 1995. 9. Lai, K.W. and Coghill,G.G.: Channel assignment through evolutionary optimization, IEEE Trans. Veh. Technol., vol.45, no.1, pp.91–96, 1996. 10. Matsui, S. and Tokoro, K.: A new genetic algorithm for minimum span frequency assignment using permutation and clique, Proc. of Genetic and Evolutionary Computation Conference 2000 (GECCO-2000), pp.682–689, 2000. 11. Matsui, S. and Tokoro, K.: Improving the performance of a genetic algorithm for minimum span frequency assignment problem with an adaptive mutation rate and a new initialization method, Proc. of Genetic and Evolutionary Computation Conference 2001 (GECCO-2001), pp.1359–1366, 2001. 12. Mitchell, M.:An Introduction to Genetic Algorithms, MIT Press, 1996. 13. Moscate, P: On evolution, search, optimization, genetic algorithms and martial arts: towards memetic algorithms, Caltech Concurrent Computation Program, C3P Report 826, 1989. 14. Ngo, C.Y. and Li, V.O.K.: Fixed channel assignment in cellular radio networks using a modified genetic algorithm, IEEE Trans. Veh. Technol., vol.47, no.1, pp.163– 172, 1998. 15. Park, E. J., Kim, Y. H., and Moon, B. R., Genetic search for fixed channel assignment problem with limited bandwidth, Proc. of Genetic and Evolutionary Computation Conference 2002 (GECCO-2002), pp.1172–1179, 2002. 16. Rothlauf, F., Goldberg, D.E., and Heinzl, A.: Bad coding and the utility of welldesigned genetic algorithms, Proc. of Genetic and Evolutionary Computation Conference 2000 (GECCO-2000), pp. 355–362, 2000. 17. Smith, D.H., Hurley, S., and Thiel, S.U.: Improving heuristics for the frequency assignment problem, Eur. J. Oper. Res., vol.107, no.1, pp.76–86, 1998. 18. Smith, J.E.: Self Adaptation in Evolutionary Algorithms, Ph.D thesis, Univ. of the West England, Bristol, 1998. 19. Valenzuela, C., Hurley, S., and Smith, D.: A permutation based algorithm for minimum span frequency assignment, Proc. 5th International Conference on Parallel Problem Solving from Nature—PPSN V, Amsterdam, pp. 907–916, 1998.
Using Genetic Algorithms for Data Mining Optimization in an Educational Web-Based System Behrouz Minaei-Bidgoli and William F. Punch Genetic Algorithms Research and Applications Group (GARAGe) Department of Computer Science & Engineering Michigan State University 2340 Engineering Building East Lansing, MI 48824 {minaeibi,punch}@cse.msu.edu http://garage.cse.msu.edu
Abstract. This paper presents an approach for classifying students in order to predict their final grade based on features extracted from logged data in an education web-based system. A combination of multiple classifiers leads to a significant improvement in classification performance. Through weighting the feature vectors using a Genetic Algorithm we can optimize the prediction accuracy and get a marked improvement over raw classification. It further shows that when the number of features is few; feature weighting is works better than just feature selection.
1 Statement of Problem Many leading educational institutions are working to establish an online teaching and learning presence. Several systems with different capabilities and approaches have been developed to deliver online education in an academic setting. In particular, Michigan State University (MSU) has pioneered some of these systems to provide an infrastructure for online instruction. The research presented here was performed on a part of the latest online educational system developed at MSU, the Learning Online Network with Computer-Assisted Personalized Approach (LON-CAPA). In LON-CAPA1, we are involved with two kinds of large data sets: 1) educational resources such as web pages, demonstrations, simulations, and individualized problems designed for use on homework assignments, quizzes, and examinations; and 2) information about users who create, modify, assess, or use these resources. In other words, we have two ever-growing pools of data. We have been studying data mining methods for extracting useful knowledge from these large databases of students using online educational resources and their recorded paths through the web of educational resources. In this study, we aim to an1
See http://www.lon-capa.org
E. Cantú-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2252–2263, 2003. © Springer-Verlag Berlin Heidelberg 2003
Using Genetic Algorithms for Data Mining Optimization
2253
swer the following research questions: Can we find classes of students? In other words, do there exist groups of students who use these online resources in a similar way? If so, can we identify that class for any individual student? With this information, can we help a student use the resources better, based on the usage of the resource by other students in their groups? We hope to find similar patterns of use in the data gathered from LON-CAPA, and eventually be able to make predictions as to the most-beneficial course of studies for each learner based on their present usage. The system could then make suggestions to the learner as to how to best proceed.
2 Map the Problem to Genetic Algorithm Genetic Algorithms have been shown to be an effective tool to use in data mining and pattern recognition. [7], [10], [6], [16], [15], [13], [4]. An important aspect of GAs in a learning context is their use in pattern recognition. There are two different approaches to applying GA in pattern recognition: 1. Apply a GA directly as a classifier. Bandyopadhyay and Murthy in [3] applied GA to find the decision boundary in N dimensional feature space. 2. Use a GA as an optimization tool for resetting the parameters in other classifiers. Most applications of GAs in pattern recognition optimize some parameters in the classification process. Many researchers have used GAs in feature selection [2], [9], [12], [18]. GAs has been applied to find an optimal set of feature weights that improve classification accuracy. First, a traditional feature extraction method such as Principal Component Analysis (PCA) is applied, and then a classifier such as k-NN is used to calculate the fitness function for GA [17], [19]. Combination of classifiers is another area that GAs have been used to optimize. Kuncheva and Jain in [11] used a GA for selecting the features as well as selecting the types of individual classifiers in their design of a Classifier Fusion System. GA is also used in selecting the prototypes in the case-based classification [20]. In this paper we will focus on the second approach and use a GA to optimize a combination of classifiers. Our objective is to predict the students’ final grades based on their web-use features, which are extracted from the homework data. We design, implement, and evaluate a series of pattern classifiers with various parameters in order to compare their performance on a dataset from LON-CAPA. Error rates for the individual classifiers, their combination and the GA optimized combination are presented.
2254
B. Minaei-Bidgoli and W.F. Punch
2.1 Dataset and Class Labels As test data we selected the student and course data of a LON-CAPA course, PHY183 (Physics for Scientists and Engineers I), which was held at MSU in spring semester 2002. This course integrated 12 homework sets including 184 problems, all of which are online. About 261 students used LON-CAPA for this course. Some of students dropped the course after doing a couple of homework sets, so they do not have any final grades. After removing those students, there remained 227 valid samples. The grade distribution of the students is shown in Fig 1. Grade Distribution 4.0 3.5
Grade
3.0 2.5 2.0 1.5 1.0 0.0 0
10
20
30
40
50
60
# of students
Fig. 1. Graph of distribution of grades in course PHY183 SS02
We can group the students regarding their final grades in several ways, 3 of which are: 1.
Let the 9 possible class labels be the same as students’ grades, as shown in table 1
2.
We can label the students in relation to their grades and group them into three classes, “high” representing grades from 3.5 to 4.0, “middle” representing grades from 2.5 to 3, and “low” representing grades less than 2.5.
3.
We can also categorize the students with one of two class labels: “Passed” for grades higher than 2.0, and ”Failed” for grades less than or equal to 2.0, as shown in table 3. Table 1. Selecting 9 class labels regarding to students’ grades in course PHY183 SS02 Class 1 2 3 4 5 6 7 8 9
Grade 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
Student # 2 0 10 28 23 43 52 41 28
Percentage 0.9% 0.0% 4.4% 12.4% 10.1% 18.9% 22.9% 18.0% 12.4%
Using Genetic Algorithms for Data Mining Optimization
2255
Table 2. Selecting 3 class labels regarding to students’ grades in course PHY183 SS02 Class High Middle Low
Grade Grade >= 3.5 2.0 < Grade < 3.5 Grade <= 2.0
Student # 69 95 63
Percentage 30.40% 41.80% 27.80%
Table 3. Selecting 2 class labels regarding to students’ grades in course PHY183 SS02 Class Passed Failed
Grade Grade > 2.0 Grade <= 2.0
Student # 164 63
Percentage 72.2% 27.80%
We can predict that the error rate in the first class grouping should be higher than the others, because the sample size of the grades over 9 classes differs considerably. It is clear that we have less data for the first three classes in the training phase, and so the error rate would likely be higher in the evaluation phase. 2.2 Extractable Features An essential step in doing classification is selecting the features used for classification. Below we discuss the features from LON-CAPA that were used, how they can be visualized (to help in selection) and why we normalize the data before classification. The following features are stored by the LON-CAPA system: 1. Total number of correct answers. (Success rate) 2. Getting the problem right on the first try, vs. those with high number of tries. (Success at the first try) 3. Total number of tries for doing homework. (Number of attempts before correct answer is derived) 4. Time spent on the problem until solved (more specifically, the number of hours until correct. The difference between time of the last successful submission and the first time the problem was examined). Also, the time at which the student got the problem correct relative to the due date. Usually better students get the homework completed earlier. 5. Total time spent on the problem regardless of whether they got the correct answer or not. (Difference between time of the last submission and the first time the problem was examined). 6. Participating in the communication mechanisms, vs. those working alone. LONCAPA provides online interaction both with other students and with the instructor. Where these used? 7. Reading the supporting material before attempting homework vs. attempting the homework first and then reading up on it.
2256
B. Minaei-Bidgoli and W.F. Punch
8. Submitting a lot of attempts in a short amount of time without looking up material in between, versus those giving it one try, reading up, submitting another one, and so forth. 9. Giving up on a problem versus students who continued trying up to the deadline. 10.Time of the first log on (beginning of assignment, middle of the week, last minute) correlated with the number of tries or number of solved problems. A student who gets all correct answers will not necessarily be in the successful group if they took an average of 5 tries per problem, but it should be verified from this research. In this paper we focused on the first six features in the PHY183 SS02 dataset that we have chosen for the classification experiment. 2.3 Classifiers Pattern recognition has a wide variety of applications in many different fields, such that it is not possible to come up with a single classifier that can give good results in all the cases. The optimal classifier in every case is highly dependent on the problem domain. In practice, one might come across a case where no single classifier can classify with an acceptable level of accuracy. In such cases it would be better to pool the results of different classifiers to achieve the optimal accuracy. Every classifier operates well on different aspects of the training or test feature vector. As a result, assuming appropriate conditions, combining multiple classifiers may improve classification performance when compared with any single classifier [4]. The scope of this survey is restricted to comparing some popular non-parametric pattern classifiers and a single parametric pattern classifier according to the error estimate. Six different classifiers using the LON-CAPA datasets are compared in this study. The classifiers used in this study include Quadratic Bayesian classifier, 1nearest neighbor (1-NN), k-nearest neighbor (k-NN), Parzen-window, multi-layer perceptron (MLP), and Decision Tree.2 These classifiers are some of the common classifiers used in most practical classification problems. After some preprocessing operations were made on the dataset, the error rate of each classifier is reported. Finally, to improve performance, a combination of classifiers is presented. 2.4 Normalization Having assumed in Bayesian and Parzen-window classifiers that the features are normally distributed, it is necessary that the data for each feature be normalized. This ensures that each feature has the same weight in the decision process. Assuming that the given data is Gaussian distributed, this normalization is performed using the mean and standard deviation of the training data. In order to normalize the training data, it is necessary first to calculate the sample mean µ, and the standard deviation σ of
2
TM
The first five classifiers are coded in MATLAB 6.0, and for the decision tree classifiers we have used some available software packages such as C5.0, CART, QUEST, and CRUISE.
Using Genetic Algorithms for Data Mining Optimization
2257
each feature, or column, in this dataset, and then normalize the data using the equation (1) xi =
xi − µ
σ
(1)
This ensures that each feature of the training dataset has a normal distribution with a mean of zero and a standard deviation of one. In addition, the kNN method requires normalization of all features into the same range. However, we should be cautious in using the normalization before considering its effect on classifiers’ performances. 2.5 Combination of Multiple Classifiers (CMC) In combining multiple classifiers we want to improve classifier performance. There are different ways one can think of combining classifiers: • The simplest way is to find the overall error rate of the classifiers and choose the one which has the least error rate on the given dataset. This is called an offline CMC. This may not really seem to be a CMC; however, in general, it has a better performance than individual classifiers. • The second method, which is called online CMC, uses all the classifiers followed by a vote. The class getting the maximum votes from the individual classifiers will be assigned to the test sample. This method intuitively seems to be better than the previous one. However, when tried on some cases of our dataset, the results were not better than the best result in previous method. So, we changed the rule of majority vote from “getting more than 50% votes” to “getting more than 75% votes”. This resulted in a significant improvement over offline CMC. Using the second method, we show in table 4 that CMC can achieve a significant accuracy improvement in all three cases of 2, 3, and 9-classes. Now we are going to use GA to find out that whether we can maximize the CMC performance.
3 Optimizing the CMC Using a GA We used GAToolBox3 for MATLAB to implement a GA to optimize classification performance. Our goal is to find a population of best weights for every feature vector, which minimize the classification error rate. The feature vector for our predictors are the set of six variables for every student: Success rate, Success at the first try, Number of attempts before correct answer is derived, the time at which the student got the problem correct relative to the due date, total time spent on the problem, and the number of online interactions of the student both with other students and with the instructor. 3 Downloaded from http://www.shef.ac.uk/~gaipp/ga-toolbox/
2258
B. Minaei-Bidgoli and W.F. Punch
We randomly initialized a population of six dimensional weight vectors with values between 0 and 1, corresponding to the feature vector and experimented with different number of population sizes. We found good results using a population with 200 individuals. The GA Toolbox supports binary, integer, real-valued and floating-point chromosome representations. Real-valued populations may be initialized using the Toolbox function crtrp. For example, to create a random population of 6 individuals with 200 variables each: we define boundaries on the variables in FieldD which is a matrix containing the boundaries of each variable of an individual. FieldD = [ 0 0 0 0 0 0; % lower bound 1 1 1 1 1 1]; % upper bound
We create an initial population with Chrom = crtrp(200, FieldD), So we have for example: Chrom = 0.23 0.35 0.50 0.21
0.17 0.09 0.10 0.29
0.95 0.43 0.09 0.89
0.38 0.64 0.65 0.48
0.06 0.20 0.68 0.63
0.26 0.54 0.46 0.89
………………
We used the simple genetic algorithm (SGA), which is described by Goldberg in [9]. The SGA uses common GA operators to find a population of solutions which optimize the fitness values. 3.1 Recombination We used “Stochastic Universal Sampling” [1] as our selection method. A form of stochastic universal sampling is implemented by obtaining a cumulative sum of the fitness vector, FitnV, and generating N equally spaced numbers between 0 and sum(FitnV). Thus, only one random number is generated, all the others used being equally spaced from that point. The index of the individuals selected is determined by comparing the generated numbers with the cumulative sum vector. The probability of an individual being selected is then given by
F (xi ) =
f (xi ) N
ind
∑
i =1
f (xi )
(2)
where f(xi) is the fitness of individual xi and F(xi) is the probability of that individual being selected. 3.2 Crossover The crossover operation is not necessarily performed on all strings in the population. Instead, it is applied with a probability Px when the pairs are chosen for breeding. We selected Px = 0.7. There are several functions to make crossover on real-valued matrices. One of them is recint, which performs intermediate recombination between pairs of individuals in the current population, OldChrom, and returns a new popula-
Using Genetic Algorithms for Data Mining Optimization
2259
tion after mating, NewChrom. Each row of OldChrom corresponds to one individual. recint is a function only applicable to populations of real-value variables. Intermediate recombination combines parent values using the following formula [14]: Offspring = parent1 + Alpha × (parent2 – parent1)
(3)
Alpha is a Scaling factor chosen uniformly in the interval [-0.25, 1.25] 3.3
Mutation
A further genetic operator, mutation is applied to the new chromosomes, with a set probability Pm. Mutation causes the individual genetic representation to be changed according to some probabilistic rule. Mutation is generally considered to be a background operator that ensures that the probability of searching a particular subspace of the problem space is never zero. This has the effect of tending to inhibit the possibility of converging to a local optimum, rather than the global optimum. There are several functions to make mutation on real-valued population. We used mutbga, which takes the real-valued population, OldChrom, mutates each variable with given probability and returns the population after mutation, NewChrom = mutbga(OldChrom, FieldD, MutOpt) takes the current population, stored in the matrix OldChrom and mutates each variable with probability by addition of small random values (size of the mutation step). We considered 1/600 as our mutation rate. The mutation of each variable is calculated as follows: Mutated Var = Var + MutMx × range × MutOpt(2) × delta
(4)
where delta is an internal matrix which specifies the normalized mutation step size; MutMx is an internal mask table; and MutOpt specifies the mutation rate and its shrinkage during the run. The mutation operator mutbga is able to generate most points in the hypercube defined by the variables of the individual and the range of the mutation. However, it tests more often near the variable, that is, the probability of small step sizes is greater than that of larger step sizes. 3.4 Fitness Function During the reproduction phase, each individual is assigned a fitness value derived from its raw performance measure given by the objective function. This value is used in the selection to bias towards more fit individuals. Highly fit individuals, relative to the whole population, have a high probability of being selected for mating whereas less fit individuals have a correspondingly low probability of being selected. The error rate is measured in each round of cross validation by dividing “the total number of misclassified examples” into “total number of test examples”. Therefore, our fitness function measures the error rate achieved by CMC and our objective would be to maximize this performance (minimize the error rate).
2260
B. Minaei-Bidgoli and W.F. Punch
4 Experiment Results Without using GA, the overall results of classifiers’ performance on our dataset, regarding the four tree-classifiers, five non-tree classifiers and CMC are shown in the Table 4. Regarding individual classifiers, for the case of 2-classes, kNN has the best performance with 82.3% accuracy. In the case of 3-classes and 9-classes, CART has the best accuracy of about 60% in 3-classes and 43% in 9-Classes. However, considering the combination of non-tree-based classifiers, the CMC has the best performance in all three cases. That is, it achieved 86.8% accuracy in the case of 2-Classes, 71% in the case of 3-Classes, and 51% in the case of 9-Classes. Table 4. Comparing the Error Rate of all classifiers on PHY183 dataset in the cases of 2-Classes, 3-Classess, and 9-Classes, using 10-fold cross validation, without GA Performance % Tree Classifier
Classifier
2-Classes
3-Classes
9-Classes
C5.0
MLP
80.3 81.5 80.5 81.0 76.4 76.8 82.3 75.0 79.5
56.8 59.9 57.1 54.9 48.6 50.5 50.4 48.1 50.9
25.6 33.1 20.0 22.9 23.0 29.0 28.5 21.5 -
CMC
86.8
70.9
51.0
CART QUEST
Non-tree Classifier
CRUISE Bayes 1NN KNN Parzen
For GA optimization, we used 200 individuals in our population, running the GA over 500 generations. We ran the program 10 times and got the averages, which are shown, in table 5. In every run 500 × 200 times the fitness function is called in which we used 10-fold cross validation to measure the average performance of CMC. So 6 every classifier is called 3 × 10 times for the case of 2-classes, 3-classes and 9classes. Thus, the time overhead for fitness evaluation is critical. Since using the MLP in this process took about 2 minutes and all other four non-tree classifiers (Bayes, 1NN, 3NN, and Parzen window) took only 3 seconds, we omitted the MLP from our classifiers group so we could obtain the results in a reasonable time. Table 5. Comparing the CMC Performance on PHY183 dataset Using GA and without GA in the cases of 2-Classes, 3-Classess, and 9-Classes, 95% confidence interval. Performance % Classifier
2-Classes
3-Classes
9-Classes
CMC of 4 Classifiers without GA
83.87 ± 1.73
61.86 ± 2.16
49.74 ± 1.86
GA Optimized CMC, Mean individual
94.09 ± 2.84
72.13 ± 0.39
62.25 ± 0.63
10.22 ± 1.92
10.26 ± 1.84
12.51 ± 1.75
Improvement
Using Genetic Algorithms for Data Mining Optimization
2261
The results in Table 5 represent the mean performance with a two-tailed t-test with a 95% confidence interval. For the improvement of GA over non-GA result, a P-value indicating the probability of the Null-Hypothesis (There is no improvement) is also given, showing the significance of the GA optimization. All have p<0.000, indicating significant improvement. Therefore, using GA, in all the cases, we got more than a 10% mean individual performance improvement and about 12 to 15% mean individual performance improvement. Fig. 2 shows the best result of the ten runs over our dataset. These charts represent the population mean, the best individual at each generation and the best value yielded by the run.
Fig. 2. Graph of GA Optimized CMC performance in the case of 2, and 3-Classes
2262
B. Minaei-Bidgoli and W.F. Punch
Finally, we can examine the individuals (weights) for features by which we obtained the improved results. This feature weighting indicates the importance of each feature for making the required classification. In most cases the results are similar to Multiple Linear Regressions or tree-based software that use statistical methods to measure feature importance. Table 6 shows the importance of the six features in the 3-classes case using the Entropy splitting criterion. Based on entropy, a statistical property called information gain measures how well a given feature separates the training examples in relation to their target classes. Entropy characterizes impurity of an arbitrary collection of examples S at a specific node N. In [5] the impurity of a node N is denoted by i(N) such that:
∑P(ω )log P(ω )
Entropy(S) = i(N) = −
j
j
2
(5)
j
where P(ω j ) is the fraction of examples at node N that go to category
ωj .
Table 6. Feature Importance in 3-Classes Using Entropy Criterion Feature Total_Correct _Answers Total_Number_of_Tries First_Got_Correct Time_Spent_to_Solve Total_Time_Spent Communication
Importance % 100.00 58.61 27.70 24.60 24.47 9.21
The GA results also show that the “Total number of correct answers” and the “Total number of tries” are the most important features for the classification. The second column in table 6 shows the percentage of feature importance.
5 Conclusions and Future Work Four classifiers were used to segregate the students. A combination of multiple classifiers leads to a significant accuracy improvement in all 3 cases. Weighing the features and using a genetic algorithm to minimize the error rate improves the prediction accuracy at least 10% in the all cases of 2, 3 and 9-Classes. In cases where the number of features is low, the feature weighting worked much better than feature selection. The successful optimization of student classification in all three cases demonstrates the merits of using the LON-CAPA data to predict the students’ final grades based on their features, which are extracted from the homework data. We are going to apply Genetic Programming to produce many different combinations of features, to extract new features and improve prediction accuracy. We plan to use Evolutionary Algorithms to classify the students and problems directly as well. We also want to apply Evolutionary Algorithms to find Association Rules and Dependency among the groups of problems (Mathematical, Optional Response, Numerical, Java Applet, and so forth) of LON-CAPA homework data sets. Acknowledgements. This work was partially supported by the National Science Foundation under ITR 0085921.
Using Genetic Algorithms for Data Mining Optimization
2263
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20.
Baker, J. E.: Reducing bias and inefficiency in the selection algorithm, Proceeding ICGA 2, Lawrence Erlbuam Associates, Publishers, (1987) 14–21 Bala J., De Jong K., Huang J., Vafaie H.: and Wechsler H. Using learning to facilitate the evolution of features for recognizing visual concepts. Evolutionary Computation 4(3) - Special Issue on Evolution, Learning, and Instinct: 100 years of the Baldwin Effect. (1997) Bandyopadhyay, S., and Muthy, C.A.: Pattern Classification Using Genetic Algorithms, Pattern Recognition Letters, Vol. 16, (1995) 801–808 De Jong K.A., Spears W.M. and Gordon D.F.: Using genetic algorithms for concept learning. Machine Learning 13, (1993) 161–188 nd Duda, R.O., Hart, P.E., and Stork, D.G.: Pattern Classification. 2 Edition, John Wiley & Sons, Inc., New York NY. (2001) Falkenauer E.: Genetic Algorithms and Grouping Problems. John Wiley & Sons, (1998) Freitas, A.A.: A survey of Evolutionary Algorithms for Data Mining and Knowledge Discovery,See: www.pgia.pucpr.br/~alex/papers. A chapter of: A. Ghosh and S. Tsutsui. (Eds.) Advances in Evolutionary Computation. Springer-Verlag, (2002) Goldberg, D.E.: Genetic Algorithms in Search, Optimization, and Machine Learning, MA, Addison-Wesley, (1989) Guerra-Salcedo C. and Whitley D.: Feature Selection mechanisms for ensemble creation: a genetic search perspective. In: Freitas AA (Ed.) Data Mining with Evolutionary Algorithms: Research Directions, Technical Report WS-99-06. AAAI Press, (1999) Jain, A. K.; Zongker, D.: Feature Selection: Evaluation, Application, and Small Sample Performance, IEEE Transaction on Pattern Analysis and Machine Intelligence, Vol. 19, No. 2, February (1997) Kuncheva , L.I., and Jain, L.C.: Designing Classifier Fusion Systems by Genetic Algorithms, IEEE Transaction on Evolutionary Computation, Vol. 33 (2000) 351–373 Martin-Bautista MJ and Vila MA. A survey of genetic feature selection in mining issues. Proceeding Congress on Evolutionary Computation (CEC-99), Washington D.C., July (1999) 1314–1321 Michalewicz Z.: Genetic Algorithms + Data Structures = Evolution Programs. 3rd Ed. Springer-Verlag, (1996) Muhlenbein and Schlierkamp-Voosen D.: Predictive Models for the Breeder Genetic Algorithm: I. Continuous Parameter Optimization, Evolutionary Computation, Vol. 1, No. 1, (1993) 25–49 Park Y and Song M.: A genetic algorithm for clustering problems. Genetic Programming 1998: Proceeding of 3rd Annual Conference, Morgan Kaufmann, (1998), 568–575. Pei, M., Goodman, E.D. and Punch, W.F.: Pattern Discovery from Data Using Genetic Algost rithms, Proceeding of 1 Pacific-Asia Conference Knowledge Discovery & Data Mining (PAKDD-97) Feb. (1997) Pei, M., Punch, W.F., and Goodman, E.D.: Feature Extraction Using Genetic Algorithms, Proceeding of International Symposium on Intelligent Data Engineering and Learning’98 (IDEAL’98), Hong Kong, Oct. (1998) Punch, W.F., Pei, M., Chia-Shun, L., Goodman, E.D.: Hovland, P., and Enbody R. Further th research on Feature Selection and Classification Using Genetic Algorithms, In 5 International Conference on Genetic Algorithm, Champaign IL, (1993) 557–564 Siedlecki, W., Sklansky J., A note on genetic algorithms for large-scale feature selection, Pattern Recognition Letters, Vol. 10, (1989) 335–347 Skalak D. B.: Using a Genetic Algorithm to Learn Prototypes for Case Retrieval an Classification. Proceeding of the AAAI-93 Case-Based Reasoning Workshop, Washigton, D.C., American Association for Artificial Intelligence, Menlo Park, CA, (1994) 64–69
Improved Image Halftoning Technique Using GAs with Concurrent Inter-block Evaluation Emi Myodo, Hern´ an Aguirre, and Kiyoshi Tanaka Dept. of Electrical and Electronic Engineering, Faculty of Engineering Shinshu University, 4-17-1 Wakasato, Nagano 380-8553, Japan {myodo,ahernan,ktanaka}@iplab.shinshu-u.ac.jp
Abstract. In this paper we propose a modified evaluation method to improve the performance of an image halftoning technique using GAs. We design the algorithm to avoid noise in the fitness function by evolving all image blocks concurrently, exploiting the inter-block correlation, and sharing information between neighbor image blocks. The effectiveness of the method when the population and image block size are reduced, and the configuration of selection and genetic operators are investigated in detail. Simulation results show that the proposed method can remarkably reduce the entire processing time to generate high quality bi-level halftone images.
1
Introduction
Recently the application of evolutionary algorithms (EAs) to real-world problems has been rapidly increasing. Signal processing is one of the areas in which methods using EAs are steadily being developed [1]. In this work, we focus on the image halftoning problem using genetic algorithms (GAs). In the halftoning problem a N -gray tone image must be portrayed as a n-gray tone image, where n < N . GAs have been used for halftoning in two ways. One approach seeks to evolve (or co-evolve) filters, which are applied to the input N -gray tone image to generate a halftone image [2,3,4]. The second approach searches directly for the optimum halftoning image having as a reference the input N -gray tone image. Our work fit in the latter approach. Kobayashi et al. [5] first proposed a direct search GA based halftoning technique to generate bi-level halftone images. This technique divides the input images into non-overlapping blocks and uses a simple GA with a specialized two dimensional crossover to search the corresponding optimum binary patterns (high gray level precision and high spatial resolution). The method’s major advantages are that (i) it can generate images with a specific desired combination of gray level precision and spatial resolution, and (ii) it generates bi-level halftone images with quality higher than traditional schemes such as ordered dithering, error diffusion, and so on [6]. However, the method uses a substantial amount of processing time and computer memory. Recently, Aguirre et al. [7] have proposed an improved GA (GA-SRM) to overcome the drawbacks of [5]. GA-SRM applies varying mutation parallel to crossover and background mutation, putting E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2264–2276, 2003. c Springer-Verlag Berlin Heidelberg 2003
Improved Image Halftoning Technique Using GAs
2265
the operators in a cooperative-competitive stand with each other by subjecting their offspring to extinctive selection [8,9]. The halftoning technique using GASRM [7], compared to the conventional GA based technique [5], can achieve a 98% reduction in population size and a 70%-85% reduction in processing time while generating high quality bi-level halftone images. Additional related work can be found in [10], where the halftoning problem is treated as a multiobjective optimization problem to generate simultaneously halftone images with various combinations of gray level precision and special resolution. Also, in [11] the bilevel halftoning technique is extended to a multi-level halftoning technique using GAs. The gains in performance achieved by GA-SRM alone are substantial, yet the direct GA-SRM based halftoning method still requires much more processing time than traditional methods, such ordered dithering and error diffusion, and enhancing efficiency is required. The GA based halftoning methods mentioned above evolve all image blocks independently from each other. A side effect of this is that the evaluation function becomes approximate for the pixels close to the boundaries between image blocks, which introduces false optima and delays the search. In this paper we improve further GA based halftoning with a modified evaluation method that contemplates inter-block correlation between neighbor blocks evolving all image blocks concurrently. The effectiveness of the method when the population and image block size are reduced, and the configuration of selection and genetic operators are investigated in detail. Simulation results verify that the modified scheme further accelerates the search speed to generate high quality halftone 1 images, reducing processing time to less than 100 of the time required by the conventional GA based halftoning method proposed in [5].
2 2.1
Image Halftoning Scheme Using GA Individual Representation
An input image is first divided into non-overlapping blocks D consisting of r × r pixels to reduce the search space of solutions [5,7]. The GA uses an individual x with a r × r two dimensional representation for the chromosome. In the case of bi-level halftoning each element of the chromosome x(i, j) ∈ {0, 1}. Figure 1 illustrates the division of the image in blocks and an example of individual x corresponding to a current block D. 2.2
Evaluation
We evaluate chromosomes with two kinds of evaluation criteria. (i) One is high gray level precision (local mean gray levels close to the original image), and (ii) the other is high spatial resolution (appropriate contrast near edges) [5]. The bi-level image halftoning technique calculates a gray level precision error by Em =
(i,j)∈D
1 |g(i, j) − gˆ(i, j)| r2
(1)
2266
E. Myodo, H. Aguirre, and K. Tanaka current block D
individual x 0 0 1 1 0 1 0 1
r r
0 1 0 1 0 1 1 1
1 1 1 0 1 0 0 0
0 1 1 0 0 1 0 1
1 0 0 1 1 0 1 0
0 0 1 1 0 1 0 1
1 0 1 0 1 1 1 1
1 0 1 0 1 0 0 1
r
r
Fig. 1. Image division and individual representation(r = 8) generated block
generated block
generated block reference region
reference region
turned over
x(i, j)
x(i, j)
(a)
(b)
(c)
Fig. 2. (a) shows a current generated block x with its binary pattern copied around block boundaries for gray level estimation. (b) and (c) show examples of generated pixels and their reference region to calculate gray level estimation.
where g(i, j) (i, j=0,1,· · ·,r-1) is the gray level of the (i, j)-th pixel in the input image block, and gˆ(i, j) is the estimated gray level associated to the (i, j)-th pixel of the generated halftone block (x(i, j)). To obtain gˆ(i, j) a reference region around x(i, j) is convoluted by a gaussian filter that models the correlation among pixels. In order to reduce discontinuity around block boundaries, the pixel pattern of x is copied around the boundary regions, as shown in Fig. 2a, and used to calculate the gray level estimation gˆ(i, j). Figure 2b and Fig. 2c illustrate two examples of x(i, j) and its reference region to calculate gˆ(i, j). In Fig. 2b the reference region lies within the generated block while in Fig. 2c it exceeds the block boundaries and includes also the copied pixels. In order to preserve the edge information of the input image well, the spatial resolution error in the bi-level image halftoning technique is calculated by Ec =
(i,j)∈D
1 |G(i, j) − B(i, j)| r2
(2)
G(i, j) = g(i, j) − g¯(i, j) 1 B(i, j) = (x(i, j) − )N 2 where G(i, j) is the difference between the gray level g(i, j) of the (i, j)-th pixel in the input image block and its neighboring local mean value g¯(i, j).
Improved Image Halftoning Technique Using GAs
2267
The two errors Em and Ec are combined into one single objective function as E = αm Em + αc Ec
(3)
where αm and αc are weighting parameters of Em and Ec , respectively. The chromosome’s fitness is assigned by F = Emax − E
(4)
where Emax is the error associated with the worst chromosome in a population. The GA is used to search for optimum compromise between (i) and (ii) with the above fitness function. 2.3
Genetic Operators and Selection
The improved GA-SRM for the halftoning problem [7] is based on a model of GA that puts genetic operators with complementary roles in a cooperativecompetitive stand with each other [8,9]. The main features of the model are (i) two genetic operators to create offspring: Self-Reproduction with Mutation (SRM) that puts emphasis on mutation, and Crossover and Mutation (CM) that puts emphasis on recombination, (ii) an extinctive selection mechanism, and (iii) an adaptive mutation schedule that varies SRM’s mutation rates from high to low values based on SRM’s own contribution to the population. In [7], CM uses the same two dimensional crossover as [5] and SRM is provided with an Adaptive Dynamic Block (ADB) mutation schedule, in which the size of mutation block is dynamically adjusted. Extinctive selection is implemented with (µ, λ) proportional selection.
3 3.1
Proposed Method Problems
As indicated in 2.2, the GA based halftoning schemes in [5] and [7] evolve the image blocks independently from each other, copy the binary pattern of the current block around the boundary regions, and use that information to evaluate the quality of the generated block. Due to the expected high correlation between neighboring pixels in an image, the pixels copied around the boundaries of the generated block aim to reduce discontinuities between blocks. However, these pixels are not the true information of the generated neighbor blocks. Although mathematically the same fitness function is used for every pixel, from an information standpoint the conventional GA based halftoning method induce a kind of approximate fitness function [12] for the pixels close to the boundary regions, which introduces false optima. This can mislead the algorithm and greatly affect its search speed. Note that if the area of image block is reduced the number of pixels evaluated with the approximate function (e.g. Fig. 2c) will increase while
2268
E. Myodo, H. Aguirre, and K. Tanaka
the number of pixels evaluated with the true fitness function (e.g. Fig. 2b) will decrease, negatively affecting the quality of the generated image and delaying processing time. The noise introduced by the approximated function is a real obstacle for further reduction of processing time. 3.2
Modified Evaluation Method
To have a fitness function that models the halftoning problem with higher fidelity, we make use of the inter-block correlation between neighbor blocks in the evaluation, linking each block with its neighbor blocks and sharing some genetic information between them. A GA is allocated to each block and each GA evolves its population of possible solutions concurrently. In this process the best ∗(t−1) individuals xu,v in the neighbor populations are generationally referred and (t) used to calculate the fitness values for individuals xk,l in the current population, as shown in Fig. 3. With this procedure of information sharing between populations we can supplement incomplete information in the evaluation process of [5,7] expecting that it would contribute to reduce processing time, improve the image quality around block boundaries, and allow further reductions of block size. Parallel implementations can be realized with the required number of processing units, linking at most 8 neighbor units. In this work the parallel GA is simulated as concurrent processes in a serial machine.
(t-1)
x* k-1, l-1
generated block (t-1) (t-1) x* k, l-1 x* k+1, l-1
(t-1)
x k,k,l l
x* k+1, l
(t-1)
x* k, l+1
x* k+1, l+1
x* k-1, l
x* k-1, l+1
(t)
(t-1)
(t-1)
(t-1)
Fig. 3. A current block and connected neighbor blocks for gray level estimation
4 4.1
Results and Discussion Experimental Setup
In this paper, we apply the proposed method to a canonical GA (cGA) [5] and GA-SRM [7]. To test the alogorithms we use SIDBA’s benchmark images in our simulation. Unless stated otherwise, results presented here are for image “Lenna”. The size of the original image is 256 × 256 pixels with N = 256 gray levels and the generated images are bi-level halftone images (n = 2). The image block size is r × r = 16 × 16 and the population size is 200. The weighting
Improved Image Halftoning Technique Using GAs
2269
Fig. 4. Performance by cGA(200) and GA-SRM(100,200) with proposed and conventional method Table 1. Comparison of number of evaluations method cGA(200) GA-SRM(100,200) conventional 1.000T 0.510T proposed 0.695T 0.430T
parameters in Eq. (3) are set to αm = 0.2 and to αc = 0.8. In the case of the canonical GA the crossover probability is set to pc = 1.0 and mutation probability to pm = 0.001, similar to [5]. In our simulation we consider the result (error value of Eq. (3)) achieved by cGA with this settings after T = 40, 000 evaluations as a reference value for image quality [5]. In the case of GA-SRM, for CM (CM ) the crossover and mutation probabilities are set to pc = 1.0 and pm = 0.001, respectively. For SRM, ADB mutation schedule with bit swapping mutation is used [7]. The balance for offspring creation is λCM : λSRM = 1 : 1, and the ratio between the number of offspring and the number of parents is µ : λ = 1 : 2. 4.2
Effects in Conventional Schemes
First, we show the effect of the proposed method in cGA and GA-SRM with large populations. Figure 4 plots the error reduction over the generations by cGA(200), which denotes a canonical GA with a population size of 200 individuals, and by GA-SRM(100,200), which denotes GA-SRM with a parent population of µ = 100 and an offspring populaion of λ = 200. From Fig. 4 we can see that both cGA and GA-SRM can achieve higher image quality with the proposed method than with the conventional evaluation method. Table 1 shows the number of evaluations needed to reach the reference value for image quality. We can reduce the number of evaluations about 31% in cGA and 16% in GA-SRM. Note that GA-SRM even with the conventional evaluation method is faster than cGA with the proposed evaluation method.
2270
E. Myodo, H. Aguirre, and K. Tanaka
Fig. 5. Performance by GA-SRM with proposed and conventional method in different population sizes Table 2. Effect of population size reduction
method conventional proposed
4.3
(100, 200) 0.510T 0.430T
GA-SRM (50, 100) (25, 50) 0.330T 0.211T 0.290T 0.185T
(4, 8) 0.115T 0.094T
Effect in Population Size Reduction
Second, since memory is an important issue in this application we study the effect on performance of reducing the population size. Figure 5 plots the error transition over the evaluations by GA-SRM with the conventional and proposed method. From Fig. 5 we can see that GA-SRM with the proposed method using smaller population sizes accelerates the search speed without deteriorating the final image quality. Table 2 shows the number of evaluations needed to reach the reference value for image quality. 4.4
Effects in Block Size Reduction
Next, we study the effect of reducing the size of the image block fixing the population size to (µ, λ) = (4, 8) in GA-SRM. Here, the mutation probability (CM ) 1 for CM is set to pm = r×r [13], because this value for mutation rate causes better performance in combination with extinctive selection. Figure 6 plots the error reduction over the evaluations for “Lenna” and Table 3 shows the number of evaluations needed to reach the image quality reference value for “Lenna” and other benchmark images. Note that with the proposed method we can further accelerate the search by reducing the block size to be evolved and still keep high image quality. For example, in case of Lenna and r × r = 4 × 4 the proposed method needs only 240 evaluations to achieve the image quality reference value
Improved Image Halftoning Technique Using GAs
2271
Fig. 6. Performance by GA-SRM(4,8) with proposed and conventional method using different block sizes
(the same image quality obtained by cGA after 40, 000 evaluations)† , which 1 means less than 100 of the processing time compared with the original scheme [5]. Running software implementations of the algorithms in a Pentium IV processor (2GHz), it takes about 7 seconds to generate one image of 256 × 256 pixels that reaches the reference value for image quality. Processing time increases in proportion to image pixels. From Table 3 note also that we could consistently observe similar behavior for other benchmark images. On the other hand, the conventional method considerably delays the search and cannot achieve visually satisfactory images for various test images. As mentioned in 3.2, in this paper the parallel implementation of the GA is simulated as concurrent processes in a serial machine. A true parallel implementation will benefit from a further reduction in processing time due to the distribution of work to several processors. Figure 7 shows the original image “Lenna” and several generated images by traditional methods, such ordered dithering and error diffusion, and GA based halftoning methods with the conventional and proposed evaluation method. 4.5
Inter-block Information Sharing Gap
The effect on image quality and processing time of using a generational gap to perform the information sharing was also examined [14]. From our experiments we found that a Gap of 10 generations reduces substantially the frequency needed to share the inter-block information and still achieves a very significant reduction in processing time (only 1% of the processing time is needed compared to the original cGA method [5]). For values of Gap greater than 10 the search is delayed and no significant gain is observed reducing the number of times the information is updated. †
Here we compare average number of evaluation per pixel between different schemes, which is a good coincident with the actual processing time.
2272
E. Myodo, H. Aguirre, and K. Tanaka
(a) original image
(b) ordered dithering, Bayer matrix
(c) error diffusion, Javis matrix [6]
(d) cGA(200) (r × r = 16 × 16, 0.006T , αm : αc = 0.2 : 0.8)
(e) GA-SRM(4,8) (r × r = 16 × 16, 0.006T , αm : αc = 0.2 : 0.8)
(f) GA-SRM(4,8) with proposed method (r × r = 4 × 4, 0.006T , αm : αc = 0.2 : 0.8)
(g) GA-SRM(4,8) (r × r = 4 × 4, 1.000T , αm : αc = 0.2 : 0.8)
(h) GA-SRM(4,8) with proposed method (r × r = 4 × 4, 0.006T , αm : αc = 0.5 : 0.5)
(i) GA-SRM(4,8) with proposed method (r × r = 4 × 4, 0.006T , αm : αc = 0.9 : 0.1)
Fig. 7. Original “Lenna” and output images
Improved Image Halftoning Technique Using GAs
2273
Table 3. Effect of block size reduction for “Lenna” and other benchmark images GA-SRM(4,8) image method 16 × 16 8×8 4×4 Lenna conventional 0.112T 0.054T 0.84T (256 × 256) proposed 0.090T 0.029T 0.006T Aerial conventional 0.129T 0.062T — (256 × 256) proposed 0.095T 0.032T 0.006T Airplane conventional 0.105T 0.056T — (256 × 256) proposed 0.088T 0.030T 0.005T Girl conventional 0.108T 0.053T — (256 × 256) proposed 0.093T 0.027T 0.005T Moon conventional 0.133T 0.059T 0.073T (256 × 256) proposed 0.100T 0.031T 0.005T Title conventional 0.127T 0.050T 0.018T (256 × 256) proposed 0.117T 0.043T 0.010T Woman conventional 0.120T 0.058T 0.130T (256 × 256) proposed 0.092T 0.031T 0.006T — : never achieved the reference fitness value by cGA(200) with T=40,000 evaluations
5
GA Configuration in Different Block Sizes
In 4.4 we showed that by using the proposed method the block size can be reduced from r×r = 16×16 to r×r = 4×4 to generate high quality bi-level halftone images. The block size determines the size of search space. Smaller blocks imply an easier optimization problem and simpler optimization methods could be used. However, smaller blocks would also require more hardware resources, especially if we want to have a true parallel implementation of the algorithm. On the other hand, bigger blocks would increase the difficulty of finding the optimum and enhanced optimization methods would be required. This trade-off is an important issue that should be considered during the implementation of this application. In this section we study the configuration of selection and genetic operators in GA for different block sizes. Figure 8 shows results by several EAs set with 1 an offspring population of 8. M indicates only mutation with pm = r×r , CM is 1 crossover (pc = 1.0) followed by background mutation (pm = r×r ), and SRM is the adaptive varying mutation operator. If the values of (µ, λ) are indicated, e.g. (4,8), the algorithm uses (µ, λ) proportional selection. Otherwise, the algorithm uses proportional selection. From these figures, we can see that GA-SRM (CM-SRM(4,8)) is the most reliable method in all block sizes to generate a high quality halftone image rapidly. But to make the implementation of this problem easier in certain block size, other methods can be used. For example, when halftone images are generated in block size r × r = 4 × 4, we can use M(4,8) method without disadvantages because the performance of M(4,8) is as good as CM-SRM(4,8) as shown in Fig. 8a.
2274
E. Myodo, H. Aguirre, and K. Tanaka
(a) r × r = 4 × 4
(b) r × r = 8 × 8
(c) r × r = 16 × 16 Fig. 8. Configuration of selection and genetic operators in GA for different block sizes
Improved Image Halftoning Technique Using GAs
6
2275
Conclusions
In this paper we have presented a modified evaluation method to improve the performance of an image halftoning technique using GAs. The proposed algorithm avoid noise in the fitness function by evolving all image blocks concurrently, exploiting the inter-block correlation, and sharing information between neighbor image blocks. The effectiveness of the method when the population and image block size are reduced, and the configuration of selection and genetic operators were investigated in detail. We could verify that the proposed method can re1 markably reduce the processing time to less than 100 of the time required by the conventional GA based halftoning method without deteriorating high image quality. The proposed method also generates images with less noise around block boundaries. As future work, we are planning to extend this method to multi-level and color halftoning using multiobjective optimization techniques [10].
References 1. K. F. Man, K. S. Tang, S. Kwong and W. A. Halong, Genetic Algorithms for Control and Signal Processing, Springer-Verlag, 1997. 2. J. Newbern and V. M. Bove, Jr., “Generation of Blue Noise Arrays by Genetic Algorithm”, Proc. SPIE Human Vision and Electronic Imaging II, vol.3016, pp.441– 450, 1997. 3. J. T. Alander, T. Mantere, and T. Pyylampi, “Threshold Matrix Generation for Digital Halftoning by Genetic Algorithm Optimization”, Proc. SPIE Intelligent Systems and Advanced Manufacturing: Intelligent Robots and Computer Vision XVII: Algorithms, Techniques, and Active Vision, vol.3522, pp.204–212, 1998. 4. J. T. Alander, T. Mantere, and T. Pyylampi, Digital Halftoning Optimization via Genetic Algorithms for Ink Jet Machine, Developments in Computational Mechanics with High Performance Computing, CIVIL-COMP Press, Edinburg, UK, pp.211–216, 1999. 5. N. Kobayashi and H. Saito, “Halftoning Technique Using Genetic Algorithm”, Proc. IEEE ICASSP’94, Vol.5, pp.105–108, 1994. 6. R. Ulichney, Digital Halftoning, MIT Press, Cambridge, 1987. 7. H. Aguirre, K. Tanaka and T. Sugimura, “Accelerated Halftoning Technique Using Improved Genetic Algorithm with Tiny Populations”, Proc. IEEE Intl. Conf. on Systems, Man, and Cybernetics, pp.905–910, 1999. 8. H. Aguirre, K. Tanaka and T. Sugimura, “Cooperative Model for Genetic Operators to Improve GAs”, Proc. IEEE Intl. Conf. on Information, Intelligence and Systems, pp.98–106, 1999. 9. H. Aguirre, K. Tanaka, T. Sugimura and S. Oshita, “Cooperative-Competitive Model for Genetic Operators: Contributions of Extinctive Selection and Parallel Genetic Operators”, Proc. Late Breaking Papers at Genetic and Evolutionary Computation Conference 2000, pp.6–14, 2000. 10. H. Aguirre, K. Tanaka and T. Sugimura, “Halftone Image Generation with Improved Multiobjective Genetic Algorithm”, Proc. First Intl. Conf. on Evolutionary Multi-Criterion Optimization, Lecture Notes in Computer Science, Springer, vol.1993, pp.501–515, 2001.
2276
E. Myodo, H. Aguirre, and K. Tanaka
11. T. Umemura, H. Aguirre, K. Tanaka,“Multi-level Halftone Image Generation with Genetic Algorithms,” Proc. WSEAS Multiconference on Applied and Theoretical Mathematics, pp.202–207, 2001. 12. Y. Jin, M. Olhofer, and B. Sendhoff, “A Framework for Evolutionary Optimization With Approximate Fitness Functions”, IEEE Trans. Evolutionary Computation, vol.6, no.5, pp.481–494, 2002. 13. T. B¨ ack, “Optimal mutation rates in genetic search”, Proc. 5th Int’l Conf. on Genetic Altorithms, Morgan Kaufmann, pp.2–8, 1993. 14. E. Myodo, H. Aguirre, K. Tanaka,“Improved Image Halftoning Scheme Using GAs with a Modified Evaluation Method ” IEEE-EURASIP Int’l Workshop on Signal and Image Processing, to appear, 2003.
Complex Function Sets Improve Symbolic Discriminant Analysis of Microarray Data 1
1
2
2
David M. Reif , Bill C. White , Nancy Olsen , Thomas Aune , and Jason H. Moore
1
1
Program in Human Genetics, Department of Molecular Physiology and Biophysics Vanderbilt University, Nashville, TN, USA {Reif,BWhite,Moore}@phg.mc.Vanderbilt.edu 21 Program in Human Genetics, Department of Medicine Vanderbilt University, Nashville, TN, USA 37232-0700 {Nancy.Olsen,Thomas.Aune}@Vanderbilt.edu
Abstract. Our ability to simultaneously measure the expression levels of thousands of genes in biological samples is providing important new opportunities for improving the diagnosis, prevention, and treatment of common diseases. However, new technologies such as DNA microarrays are generating new challenges for variable selection and statistical modeling. In response to these challenges, a genetic programming-based strategy called symbolic discriminant analysis (SDA) for the automatic selection of gene expression variables and mathematical functions for statistical modeling of clinical endpoints has been developed. The initial development and evaluation of SDA has focused on a function set consisting of only the four basic arithmetic operators. The goal of the present study is to evaluate whether adding more complex operators such as square root to the function set improves SDA modeling of microarray data. The results presented in this paper demonstrate that adding complex functions to the terminal set significantly improves SDA modeling by reducing model size and, in some cases, reducing classification error and runtime. We anticipate SDA will be an important new evolutionary computation tool to be added to the repertoire of methods for the analysis of microarray data.
1 Introduction Biomedicine is at a critical inflection point in the relationship between the amount of data it is possible to collect and our understanding of that data. For the first time in history, we are undergoing an information explosion and an understanding implosion. That is, new technologies are making it possible to collect data at a much faster rate than we can understand it. For example, DNA microarray technology facilitates the simultaneous measurement of the expression levels of tens of thousands of genes in biological samples [1]. As a result of this explosion of genetic information, bioinformatics and computational biology are faced with two important challenges. First, what are the most appropriate statistical and computational modeling approaches for relating gene expression data with clinical endpoints? Second, what are the most appropriate computational search strategies for identifying combinations of gene expression variables from an effectively infinite search space? E. Cantú-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2277–2287, 2003. © Springer-Verlag Berlin Heidelberg 2003
2278
D.M. Reif et al.
In response to these challenges, Moore et al. [2,3] have developed a machine learning strategy called symbolic discriminant analysis (SDA) for the automatic selection of gene expression variables and mathematical functions for statistical modeling of clinical endpoints. One advantage of this approach is that it uses the parallel search features of genetic programming [4], or GP, that are desirable for microarray data analysis [5]. Another important advantage is that no a priori assumptions are made about the functional form of the statistical model. This is important if the relationships between gene expression variables and clinical endpoints are complex and nonlinear, with interactions playing a more important role than the independent main effects of each gene. It is anticipated that such complex interactions among genes are common and play an important role in susceptibility to multifactorial diseases [6,7]. Indeed, applications of SDA to identifying patterns of gene expression that are associated with human leukemia and autoimmune diseases has revealed nonadditive relationships between gene expression variables that would not have been identified using a parametric statistical approach such as linear discriminant analysis [2,3,8]. The initial development of SDA has focused on a function set consisting of only four arithmetic functions. The goal of the present study is to evaluate whether adding complex functions such as square root to the terminal set improves SDA modeling. Adding complex functions will be considered an improvement if 1) they significantly decrease classification error of SDA models, 2) they significantly reduce the size of SDA models in terms of the overall number of nodes and the node depth, and/or 3) they significantly reduce the computational time required to complete a certain number of GP generations. For this study, we evaluate the addition of complex functions using real microarray data from human autoimmune disease that has been previously analyzed using SDA with just arithmetic functions [3]. We begin in Section 2 with an overview of the SDA approach. In Section 3, we provide an overview of the microarray data. In Section 4, we describe the details of the GP. In Section 5, we describe the experimental design and statistical analysis. A summary and discussion of the results are presented in Sections 6 and 7 respectively. The results presented in this paper demonstrate that adding complex functions to the terminal set significantly reduces the size of SDA models and, in some cases, reduces the GP runtime and classification error.
2 An Overview of Symbolic Discriminant Analysis 2.1 Introduction to Symbolic Discriminant Analysis An important limitation of parametric statistical approaches such as linear discriminant analysis and logistic regression is the need to pre-specify the functional form of the model. To address this limitation, Moore et al. [2,3] developed SDA for automatically identifying the optimal functional form and coefficients of discriminant functions that may be linear or nonlinear. This is accomplished by providing a list of mathematical functions and a list of explanatory variables that can be used to build discriminant scores. Similar to symbolic regression [4], GP is utilized to perform a
Complex Function Sets Improve Symbolic Discriminant Analysis of Microarray Data
2279
parallel search for a combination of functions and variables that optimally discriminate between two endpoint groups. GP permits the automatic discovery of symbolic discriminant functions that can take any form defined by the mathematical operators provided. GP builds symbolic discriminant functions using expression trees. Each expression tree has a mathematical function at the root node and all other nodes. Terminals in the expression tree are comprised of gene expression variables and constants. The primary advantage of this approach is that the functional form of the statistical model does not need to be pre-specified. This is important for the identification of combinations of expressed genes whose relationship with the endpoint of interest may be non-additive or nonlinear [6,7]. In its first implementation, SDA used leave one out cross-validation (LOOCV) to estimate the classification and prediction error of SDA models [2]. With LOOCV, each subject is systematically left out of the SDA analysis as an independent data point (i.e. the testing set) used to assess the predictive accuracy of the SDA model. Thus, SDA is run on a subset of the data (i.e. the training set) comprised of n-1 subjects. The model that classifies subjects in the training set with minimum error is selected and then used to predict the group membership of the single independent testing subject. This is repeated for each of the possible training sets yielding n SDA models. Moore et al. [2] selected LOOCV because it is an unbiased estimator of model error [9]. However, it should be noted that LOOCV may have a large variance due to similarity of the training datasets [9,10] and the relatively small sample sizes that are common with microarray experiments. It is possible to reduce the variance using perhaps 5-fold or 10-fold cross-validation. However, these procedures may lead to biased estimates and may not be practical when the sample size is small [9]. In this first implementation of SDA, models were selected that had low classification and prediction errors as evaluated using LOOCV. The end result of this initial implementation of SDA is an ensemble of models from separate cross validation divisions of the data. In fact, over k runs of n-fold cross validation, there are a maximum of kn possible models generated assuming a ‘best’ model for each interval is identifiable. This is a common result when GP is used with cross-validation methods because of its stochastic elements. In response to this, Moore et al. [3] and Moore [8] developed a strategy for SDA modeling that involves evaluating the consistency with which each gene was identified across each of the LOOCV trials. This new statistic is referred to as cross validation consistency (CVC) and is similar to the approach taken by Ritchie et al. [11] for the identification of gene-gene interactions in epidemiological study designs. The idea here is that genes that are important for discriminating between biological or clinical endpoint groups should consistently be identified regardless of the LOOCV dataset. The number of times each gene is identified is counted and this value compared to the value expected 5% of the time by chance were the null hypothesis of no association true. This empirical decision rule is established by permuting the data 1,000 or more times and repeating SDA analysis on each permuted dataset as described above. In this manner, a list of statistically significant genes derived from SDA can be compiled. Once a list of statistically significant genes or variables is compiled, a symbolic discriminant function that maximally describes the entire dataset can then be derived.
2280
D.M. Reif et al.
This is accomplished by rerunning SDA multiple times on the entire dataset using only the list of statistically significant candidate genes identified in the CVC analysis. A symbolic discriminant function that maximizes the distance between distributions of symbolic discriminant scores between the two endpoint groups is selected. In this manner, a single 'best' symbolic discriminant function can be identified and used for prediction in independent datasets. This modeling process is designed to limit falsepositive results and provide an objective way of dealing with different GP results in different cross validation divisions of the data. Moore [8] has suggested this approach may be useful for other GP-based modeling strategies. 2.2 Advantages of Symbolic Discriminant Analysis As discussed by Moore et al. [3], there are two important advantages of SDA over traditional multivariate methods such as linear discriminant analysis [12-14]. First, SDA does not pre-specify the functional form of the model. For example, with linear discriminant analysis, the discriminant function must take the form of an additive linear equation. This limits the models to linear additive functions of the explanatory variables. With SDA, the basic mathematical building blocks are defined and then flexibly combined with explanatory variables to derive the best discriminant function. The second advantage of SDA is the automatic selection of variables from a list of thousands. Traditional model fitting involves stepwise procedures that enter a variable into the model and then keep it in the model if it has statistically significant marginal or independent main effect [15]. Interaction terms are only evaluated for those variables that are already in the model. This deals with the combinatorial problem of selecting variables, however, variables whose effects are primarily through interactions with other variables will be missed. This may be an unreasonable assumption for most complex biological systems. The SDA approach employs a parallel machine learning approach to selecting variables that permits interactions to be modeled in the absence of marginal effects. For example, both of the SDA models of human autoimmune disease identified by Moore et al. [3] involve multiplicative relationships among the gene expression variables (see Figure 1 below). No one variable contributes to the discriminant function independently of the others. For comparison, Moore et al. [3] analyzed the data using stepwise LDA [14]. Only one of the genes was identified by both SDA and LDA, and this was one of the genes that had a statistically significant main effect in a univariate analysis. Thus, SDA identified a set of genes that was not identified by LDA. 2.3 Disadvantages of Symbolic Discriminant Analysis Although SDA has several important advantages over traditional multivariate statistical methods, there are several disadvantages [3]. First, there is no guarantee that GP will find the optimal solution. Heuristic searches tend to sacrifice finding an optimal solution in favor of tractability [16]. Implementing an evolutionary algorithm in parallel certainly improves the chances of finding an optimal solution [17], but it is not a certainty. This is due to the stochastic nature of GP. The initial populations of solutions are randomly generated and the recombination and mutation occurs at ran-
Complex Function Sets Improve Symbolic Discriminant Analysis of Microarray Data
2281
Fig. 1. Symbolic discriminant functions of gene expression variables in the form of expression trees and mathematical equations for discriminating rheumatoid arthritis from normal (A) and systemic lupus erythematosus from normal (B) [3].
dom positions in the binary expression trees. Further, there may be a stochastic component to how the highest fit individuals are selected. For these reasons, evolutionary algorithms should be run multiple times with multiple parallel populations. A second disadvantage of this approach is the computational requirement. LDA can be performed on a standard desktop workstation while SDA requires a parallel computer cluster for optimal performance. When the number of genes to be evaluated is large, the power of a parallel computer is required. Although such systems are fairly inexpensive to build, the time investment to establish and manage a parallel computing farm may be prohibitive to some. A third disadvantage is the complexity of the symbolic discriminant functions obtained. An attractive feature of LDA is the simplicity of the models, which facilitates interpretation. SDA has the potential to generate rather large models. The goal of the present study is to evaluate whether adding more complex functions to the terminal set overcomes some of the disadvantages of SDA described by Moore et al. [3]. Specifically, does SDA with complex functions find better models faster? Also, are the SDA models identified smaller?
2282
D.M. Reif et al.
3 An Overview of the Microarray Data Used A detailed description of the data is provided by Maas et al. [20]. Briefly, gene expression was measured in peripheral blood mononuclear cells (PBMC) from normal individuals (n=12) and patients with rheumatoid arthritis (RA) (n=7) or systemic lupus erythematosus (SLE) (n=9) using cDNA microarrays. Normal individuals (i.e. controls) were examined before and after routine immunization with flu vaccine to allow comparison of normal immune response genes to those that are differentially regulated in autoimmune disease. Reproducibility of the experimental method was first established by performing four hybridizations to separate microarrays using the same RNA sample. Data were normalized against a common control and linear regression analysis was utilized to estimate reproducibility. All hybridizations were 2 highly reproducible with R values ranging from 0.87 to 0.99. In the present study, we used SDA to identify symbolic discriminant functions that differentiate RA from normal and SLE from normal.
4 Details of the Genetic Programming (GP) Strategy We used genetic programming or GP [4] in the present study to optimize the selection of gene expression variables and mathematical operators for building symbolic discriminant functions. We implemented the GP optimization of SDA using the lil-gp software package [18] modified to operate in parallel using the parallel virtual machine (PVM) message-passing library. The parallel GP was run on two processors of a 110-processor Beowulf-style parallel computer system running the Linux operating system. Each node has two Pentium III 600Mhz processors, 256 Mb RAM, a network card, and a 10 Gb hard drive. A total of two demes were used each consisting of 100 individuals for total population size of 200. We allowed the GP to run a total of 100 iterations with migration between each population every 25 iterations. A recombination frequency of 0.9 was used along with a reproduction rate of 0.1. Table 1 summarizes the GP parameters. Table 1. The GP parameter settings for the SDA analyses. Objective Fitness function
Identify optimal SDA models Classification error
Number of runs
100 per dataset
Stopping criteria
Classification error = 0
Population size
200
Number of demes
2
Generations
100
Selection Crossover probability
Fitness proportionate 0.9
Reproduction probability
0.1
Complex Function Sets Improve Symbolic Discriminant Analysis of Microarray Data
5
2283
Experimental Design and Data Analysis
SDA was run with and without complex functions in two study designs. In the first study design, we compared normal samples and rheumatoid arthritis (RA) samples. In the second study design, we compared normal samples and systemic lupus erythematosus (SLE) samples. For each of the two study designs, we ran SDA a total of 100 times with just the basic arithmetic operators in the terminal set (+, -, *, /). Next, we ran SDA a total of 100 times for each comparison with a set of complex functions in addition to the basic arithmetic operators in the terminal set. The complex function set included square, square root, logarithm, exponential, absolute value, sine, and cosine. The goal of the statistical analysis was to determine whether the addition of more complex functions to the terminal set 1) significantly decreases the classification error of SDA models as assessed by leave one out cross validation (LOOCV), 2) significantly reduces the size of SDA models in terms of the overall number of nodes and the node depth, and/or 3) significantly reduces the computational time required to complete a certain number of GP generations. Across the 100 runs in each study design, we specifically compared the average classification error as assessed by LOOCV, the average runtime of the GP, the average node count, and the average node depth. In addition, we compared the average number of gene expression variables, constants, and mathematical functions in the best SDA models. A one-tailed Student’s t-test was used to test the null hypothesis that complex functions do not decrease classification error, reduce SDA model size, or reduce computation time. When the assumptions of normality or equal variance were violated, a nonparametric Wilcoxon rank-sum test was carried out. All results were considered statistically significant at the 0.05 level.
6
Results
Table 2 summarizes the average classification error, GP runtime, node count, node depth, number of gene expression variables, number of constants, and number of mathematical functions in SDA models with and without complex functions for the RA vs. normal comparison. Table 3 summarizes the same information for the SLE vs. normal comparison. For both comparisons, the majority of measures were significantly improved with the addition of complex functions to the terminal set. For the RA vs. normal comparison, only classification error was not improved. For the SLE vs. normal comparison, only GP runtime was not improved. The most striking difference is that the SDA models are significantly smaller when complex functions are used.
2284
D.M. Reif et al.
Table 2. Comparison of mean SDA performance measures for modeling with and without complex functions in the terminal set (RA vs. normal). Performance Measure Classification error GP runtime (sec.) Node count Node depth Number of variables Number of constants Number of functions
Arithmetic Functions n mean 100 <.01 100 67.04 100 37.79 100 6.71 100 4.83 100 17.21 100 16.21
Complex Functions n mean 100 <.01 100 66.80 100 17.31 100 7.64 100 2.37 100 4.65 100 10.17
pvalue .5656 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001
Table 3. Comparison of mean SDA performance measures for modeling with and without complex functions in the terminal set (SLE vs. normal). Performance Measure Classification error GP runtime (sec.) Node count Node depth Number of variables Number of constants Number of functions
7
Arithmetic Functions n mean 100 .06 100 73.05 100 40.46 100 6.95 100 5.51 100 18.21 100 17.21
Complex Functions n mean 100 .03 100 73.09 100 24.51 100 9.72 100 4.10 100 6.28 100 14.07
pvalue <.0001 .5372 <.0001 <.0001 <.0001 <.0001 <.0001
Discussion
Symbolic discriminant analysis (SDA) was developed as an attempt to deal with the challenges of selecting subsets of gene expression variables and features that facilitate the classification and prediction of biological and clinical endpoints [2,3,8]. Motivation for the development of SDA came from the limitations of traditional parametric approaches such as linear discriminant analysis and logistic regression. Application of SDA to high-dimensional microarray data from several human diseases has demonstrated that this approach is capable of identifying biologically relevant genes from among thousands of candidates. Further, because SDA makes no assumptions about the functional form of the model, it is capable of modeling complex nonlinear relationships between gene expression variables and clinical endpoints [2,3,8]. The initial studies using SDA used only the four basic arithmetic functions in the terminal set. The goal of the present study is to determine whether adding more complex functions such as logarithm and square root improve SDA modeling of microarray data. Adding complex function was considered an improvement if 1) they significantly decrease the classification error of SDA models as assessed by leave one out cross validation, 2) they significantly reduce the size of SDA models in terms of the overall
Complex Function Sets Improve Symbolic Discriminant Analysis of Microarray Data
2285
number of nodes and the node depth, and/or 3) they significantly reduce the computational time required to complete a certain number of GP generations. The results presented in this paper demonstrate that adding complex functions to the terminal set significantly improves SDA modeling by reducing model size and, in some cases, reducing classification error and runtime. Thus, the primary conclusion of this study is that a richer set of functions should be included in the terminal set. The results of this study address several of the primary disadvantages of the SDA approach outlined in Section 2.3 and by Moore et al. [3]. First, the finding that adding complex functions reduces model size and runtime is important since a primary concern is that the search space is rugged and effectively infinite [5]. Further, there is no guarantee that GP will identify an optimal solution. Thus, anything that can be done to reduce the size of the search space without compromising the flexibility and power of the modeling approach is desirable. The availability of complex functions in the terminal set reduces the search space that would have been necessary to construct the same complex functions from the basic arithmetic operators. In both comparisons, the average number of nodes in the expression trees of the best SDA models was reduced by approximately 50%. The results of this study also address the disadvantage of increased computational time associated with SDA modeling. It is certainly true that SDA is more computationally intensive than methods such as LDA. In the RA versus normal comparison, the GP runtime was significantly less when complex functions were used. A time saving of even seconds for a single run can be very important when computational methods such as permutation testing are used. Indeed, Moore et al. [3] propose permutation testing as a strategy for carrying our formal hypothesis testing with the SDA approach. This entails running SDA 1000 or more times on randomized datasets to create an empirical distribution of the test statistic being used under the null hypothesis of no association. Thus, a time savings of even one second per run would save 1000 seconds or approximately 16.7 minutes. A time savings of one minute would save 16.7 hours. These time savings could be very important, especially if computational resources are at a minimum or if many such runs need to performed on many different datasets. The final disadvantage addressed by this study is model complexity. As with any GP-based modeling procedure, SDA models can be quite large and complicated. We demonstrated that adding complex functions significantly reduced model size in both comparisons. Smaller models, even with more complex functions, are easier to interpret because there are fewer variable interrelationships that need to be interpreted. Indeed, in both comparisons, the average number of variables included in the optimal SDA models was reduced by approximately 25-50% with the addition of complex functions. Further, the average number of constants and functions was also significantly reduced. In the present study, we only carried out the very first few steps in the SDA modeling process outlined in Section 2 and by Moore et al. [3] in order to compare the features of the models during cross validation. In future studies, we will carry out the full SDA analysis of this dataset and others using the more complex functions. This will allow us to compare the final model obtained with those found previously (see
2286
D.M. Reif et al.
Figure 1 for example). Additionally, it will be very important to assess the prediction error of models with complex functions. This will be possible when multiple, independently collected datasets are available from the same study. We anticipate SDA models with complex functions will have a lower prediction error because the overall models are likely to be smaller and simpler. However, this is a working hypothesis that still needs to be tested. As suggested by Moore and Parker [5], GP-based modeling strategies are expected to have a large impact on the statistical and computational analysis of highdimensional datasets derived from high-throughput technologies such as DNA microarrays. The inherent parallel or beam search strategy used by GP may be necessary for traversing effectively infinite rugged search spaces. Indeed, GP is finding its way into a number of bioinformatics and computational biology studies [19]. This study significantly improves SDA modeling through the addition of complex functions to the terminal set. We anticipate SDA will continue to be an important methodology for the analysis of microarray data.
Acknowledgements. This work was supported by generous funds from the Robert J. Kleberg, Jr. and Helen C. Kleberg Foundation. This work was also supported by the Arthritis Foundation and National Institutes of Health grants AR02027, AR41943, DK58749, HL68744, CA90949, CA95103, and CA98131.
References 1. 2.
3. 4. 5.
6. 7. 8.
9.
Schena, M., Shalon, D., Davis, R.W., Brown, P.O.: Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270 (1995) 467–470 Moore, J.H., Parker, J.S., Hahn, L.W.: Symbolic discriminant analysis for mining gene expression patterns. In: De Raedt, L., Flach, P. (eds) Lecture Notes in Artificial Intelligence 2167, pp 372–81, Springer-Verlag, Berlin (2001) Moore, J.H., Parker, J.S., Olsen, N., Aune, T. Symbolic discriminant analysis of microarray data in autoimmune disease. Genetic Epidemiology 23 (2002) 57–69 Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. The MIT Press, Cambridge London (1992) Moore, J.H., Parker, J.S.: Evolutionary computation in microarray data analysis. In: Lin, S. and Johnson, K. (eds): Methods of Microarray Data Analysis. Kluwer Academic Publishers, Boston (2001) Templeton, A.R.: Epistasis and complex traits. In: Wade, M., Brodie III, B., Wolf, J. (eds.): Epistasis and Evolutionary Process. Oxford University Press, New York (2000) Moore, J.H., Williams, S.M.: New strategies for identifying gene-gene interactions in hypertension. Annals of Medicine 34 (2002) 88–95 Moore, J.H.: Cross validation consistency for the assessment of genetic programming results in microarray studies. In: Raidl, G. et al. (eds) Lecture Notes in Computer Science 2611, in press, Springer-Verlag, Berlin (2003). Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, New York (2001)
Complex Function Sets Improve Symbolic Discriminant Analysis of Microarray Data
2287
10. Devroye, L., Gyorfi, L., Lugosi, G.: A Probabilistic Theory of Pattern Recognition. Springer-Verlag, New York (1996) 11. Ritchie, M.D., Hahn, L.W., Roodi, N., Bailey, L.R., Dupont, W.D., Plummer, W.D., Parl, F.F. and Moore, J.H.: Multifactor dimensionality reduction reveals high-order interactions among estrogen metabolism genes in sporadic breast cancer. American Journal of Human Genetics 69 (2001) 138–147 12. Fisher, R.A.: The Use of Multiple Measurements in Taxonomic Problems. Ann. Eugen. 7 (1936) 179–188 13. Johnson, R.A., Wichern, D.W.: Applied Multivariate Statistical Analysis. Prentice Hall, Upper Saddle River (1998) 14. Huberty, C.J.: Applied Discriminant Analysis. John Wiley & Sons, Inc., New York (1994) 15. Neter, J., Wasserman, W., Kutner, M.H.: Applied Linear Statistical Models, Regression, rd Analysis of Variance, and Experimental Designs. 3 edn. Irwin, Homewood (1990) 16. Langley, P.: Elements of Machine Learning. Morgan Kaufmann Publishers, Inc., San Francisco (1996) 17. Cantu-Paz, E.: Efficient and Accurate Parallel Genetic Algorithms. Kluwer Academic Publishers, Boston (2000) 18. http://garage.cps.msu.edu/software/software-index.html 19. Fogel, G.B., Corne, D.W.: Evolutionary Computation in Bioinformatics. Morgan Kaufmann Publishers, Inc., San Francisco (2003) 20. Maas, K., Chan, S., Parker, J., Slater, A., Moore, J.H., Olsen, N., and Aune, T.M.: Cutting edge: molecular portrait of human autoimmunity. Journal of Immunology 169 (2002) 5–9
GA-Based Inference of Euler Angles for Single Particle Analysis Shusuke Saeki1 , Kiyoshi Asai2 , Katsutoshi Takahashi2 , Yutaka Ueno2 , Katsunori Isono3 , and Hitoshi Iba1 1
2
Graduate School of Frontier Science, The University of Tokyo, Hongo 7-3-1 Bunkyo-ku, Tokyo, 113-8656, Japan {saeki,iba}@miv.t.u-tokyo.ac.jp Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology, Aomi 2-41-6, Koutou-ku, Tokyo 135-0064, Japan {asai-cbrc,takahashi-k,yutaka.ueno}@aist.go.jp 3 INTEC Web and Genome Informatics Corporation, 1-3-3 Shinsuna, Koto-ku, Tokyo 136-0075, Japan [email protected]
Abstract. Single particle analysis is one of the methods for structural studies of protein and macromolecules developed in image analysis on electron microscopy. Reconstructing 3D structure from microscope images is not an easy analysis because of the low resolution of images and lack of the directional information of images in 3D structure. To improve the resolution, different projections are aligned, classified and averaged. Inferring the orientations of these images is so difficult that the task of reconstructing 3D structures depends upon the experience of researchers. But recently, a method to reconstruct 3D structures is automatically devised [6]. In this paper, we propose a new method for determining Euler angles of projections by applying Genetic Algorithms (i.e., GAs). We empirically show that the proposed approach has improved the previous one in terms of computational time and acquired precision.
1
Introduction
Structural analysis of proteins is currently conducted using primarily NMR and X-ray methods. However, each of these has limitations: NMR is applicable only to relatively small proteins, and X-ray analysis is constrained by the difficulty and limitations of crystallization. Recently, a technique called single-particle analysis has been recognized as a viable method for analyzing three-dimensional protein structures that are difficult to analyze through other methods [2].Using this technique, a protein is frozen for observation by electron microscopy, and the three-dimensional structure is then determined from images of the protein in various orientations. This technique does not require crystallization of the protein and therefore very large proteins can be analyzed. However, the presence of noise in the raw micrographs causes the protein images to be barely recognizable, making resolution of the E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2288–2300, 2003. c Springer-Verlag Berlin Heidelberg 2003
GA-Based Inference of Euler Angles for Single Particle Analysis
2289
three-dimensional structure problematic due to the extremely small size of the proteins and their fragility under the electron beam. In order to successfully obtain a protein image from noisy photomicrography, resolution must be enhanced by classifying multiple two-dimensional images according to orientation and then overlaying the images to determine the unknown three-dimensional structure [4]. This requires very complex image analysis techniques. Typically, a few tens of averaged images are obtained from 10,000 to 100,000 raw projection images, and from these, the three-dimensional structure is determined. This method has finally become practical due to improvements in micrographic techniques and the recent rapid progress in computational capabilities [9,8]. In this paper, we propose a new algorithm based on a GA to address an especially difficult image analysis problem involved in the single-particle technique, i.e. obtaining a clear three-dimensional structural projection from images of proteins at various orientations.
2 2.1
Euler Angle Determination Problem in Single-Particle Analysis Common-Line Method
Reconstructing the three-dimensional structures requires determination of the projection angle of each image. By obtaining three-dimensional positional relationships between images, practical techniques used for CT scan analysis can be applied to obtain a density distribution of the three-dimensional structure, which then allows the density distribution to be displayed by assigning a certain threshold value. In the three-dimensional density function, the particle orientation with respect to the image plane is represented by the Euler angle.As shown ` and in Fig. 1, the Euler angle is composed of three fundamental angles:@U?¿,A ´ A. The basis of the Euler angle representation of the projection angle of each image is the common-line method [11,3], which holds well because micrographic images are transparent projections. In the three-dimensional density function, the profile that is projected onto a straight line at a certain angle is common to two projection images, and can be obtained by projecting either projection image onto a straight line at a certain angle.
When there are three projection images, the projection angle of the three projection images is determined by the profile (common line) that matches the given angle. Even when the common line for two projected images is determined, the three-dimensional projection direction cannot be expressed by a single set of parameters, as one rotational degree of freedom remains (Fig. 2a). At least three
2290
S. Saeki et al.
projection images are required in order to express the three-dimensional projection direction using a single set of parameters (Fig. 2b) [5].
Fig. 2. Projections of a 3D Object.
` A ´ Fig. 1. Euler Angles ¿,A,
2.2
Actual Common-Line Determination
Figure 3 illustrates the actual procedure used to determine the common line from sample projection images A and B. First, the profile, which is a threedimensional density function called a sinogram, is obtained and then projected on a straight line at a certain angle (Fig. 3, left). The profile can be calculated to obtain the projection on the straight line while varying the angle of each of the projection images. Next the sinogram of projection image A (angle θa ) is compared to the sinogram of projection image B (angle θb ). The index of similarity is expressed as the root mean square deviation (RMSD).By representing the sinograms of projections A and B as snA(θa , x), snB(θb , x), respectively, the RMSD value is expressed generally as follows: RM SD(θa , θb ) =
2
x
{snA(θa , x) − snB(θb , x)} dx
The profile obtained by acquiring RMSD values by varying θa and θb , respectively, from 0 to 360 is called the cross sinogram, as shown on the right in Fig. 3. 2.3
Inference of the Euler Angle
If the Euler angles are given, RMSD values can be calculated.Therefore, the inference of the Euler angles based on common-line method boils down to the following optimization problem: One of the projection images is regarded as a reference image and its Euler angle is fixed as (0, 0, 0).Determine the Euler angles of the rest of the projection images to minimize the sum of RMSD values of all projection image pairs.
GA-Based Inference of Euler Angles for Single Particle Analysis
2291
Fig. 3. Sinogram and Cross Sinogram.
Therefore, if N is the number of projection images, this optimization problem is a function of 3(N − 1) parameters. Judging from the cross sinogram of Fig. 3, the sum total of functions of RMSD values has a fairly complicated shape, which we believe would result in a huge number of local solutions, implying the need for a highly efficient search method.
3 3.1
Search Method Using GA Real-Coded GA
GA is a population-based search method by means of simulated evolution in a computer. The parameters of the optimization problem (the optimization problem function) are coded as genes, and the solution is searched by repeating the operations of selection, crossover and mutation. Binary or gray representation is generally used for GA coding; however, for real-coded handling, a problem exists in that the phase structures of the genotype and the phenotype are too dissimilar. Thus, in recent years, the real-coded GA, which uses directly the actual number of vectors for coding, has been proposed. One example of the successful practical application of real-coded GA is the lens design system [7]. As the crossover method, Unimodal Normal Distribution Crossover (UNDX) is used in the current experiment. The UNDX generates offsprings around the line segment connecting two parents. It was shown that the UNDX can efficiently optimize some benchmark functions with strong epistasis among parameters compared to other crossover methods such as BLX-¿. Euler angles which we are going to optimize is considered to be the parameters with such a feature. 3.2
Sequential Additive Search Method
When applying the real-coded GA to optimize all Euler angles simultaneously, the precise three-dimensional reconstruction was not successful. Because the search space was simply too large, our investigations were steered toward one of the innumerable existing local solutions.
2292
S. Saeki et al.
Therefore, consideration was given to introducing heuristics that would satisfy the problem.If a set of at least three images is provided, the optimized value among these images can be obtained. Thus, the following method of sequential additive optimization was considered.
1. Three images are optimized by realcoded GA. The Euler angle of the initial population is taken at random. 2. One image to be added is selected from among the remaining images. 3. The initial Euler angle of the original set of images is treated as the best solution of the previous optimization, and the Euler angle of the additional image generates a population as a random value. 4. Optimization is performed by realcoded GA. 5. Further optimization is performed by the steepest-descent method. 6. If an image to be added remains, return to step 2. If no images remains, finish.
select initial 2 images add one of remaining real GA optimization steepest descent optimization Are there remaining images to add? Yes
No finish
Fig. 4. Flow of Optimization
Figure 4 illustrates the flow of sequential additive search by GA. The best solution acquired by GA is further optimized by the steepest-descent method, because even though the precision remains unchanged, the required computation time is reduced to approximately one-tenth by halting the search at the proper generation. The best local solution is obtained by force from the steepest-descent method rather than searching through lengthy generations using GA. Figure 5 shows the initialization details for this method. By including the best solution of the previous optimization as an initial individual, a more optimal initial population can be generated than for the case in which all parameters are initially random. Since the crossover method is UNDX [7], the best solution portion of the previous optimization of an individual can be changed, except by mutation. By this method, we believe that an incredibly large search space can be searched efficiently. A problem associated with this method is that the quality of the optimization solution depends on the order of the additional images. The order of image addition can produce approximately a 10“ variation in RMSD value. Trying all possible orders is not realistic because the number of cases increases exponentially with the number of images. For the heuristic, a method that tries all possibilities of addition at each step while adding images one-by-one and then selects having the lowest RMSD value, is considered. However, with this additional method, because a GA is used for optimization, some unevenness remains in the quality of the solution at certain levels for each trial. To stabilize the
GA-Based Inference of Euler Angles for Single Particle Analysis 1. Randomly select two sets of images as the number of candidates. 2. Add one image from those remaining in each set, and perform optimization and verification. Check all additional possibilities. 3. From the new image set obtained by the above addition, select the number of candidates in sequence toward higher evaluation, and designate this as the next image set. 4. If an image to be added remains, return to step 2. If no images remain, finish.
2293
[N = 3] Set random values for E(2),E(3) E(1) E(2) E(3) [N > 3] Use best values of previous optimization for E(2)㨪E(N-1) Set random values for E(N)
E(1) E(2) E(3) E(N-1) E(N)
N : number of images for optimization E(i) : Euler angle set {ǩ,Ǫ,ǫ} of (i)th image E(1) is fixed as {0,0,0}
Fig. 5. Initialization of GA Optimization.
quality of the solution, a number of solutions are retained as “candidates” for the order of image addition during the search (Fig. 6). 1. Randomly select two sets of images as the number of candidates. 2. Add one of the remaining images to each set, and perform optimization and verification. Check all additional possibilities. 3. From the new image set obtained by the addition, select the number of candidates in sequence toward higher evaluation, and designate this as the next image set. 4. If an image to be added remains, return to step 2. If no image remain, finish.
4 4.1
+
Optimization
Evaluation
All of the possible combinations +
Optimization
Evaluation
+
Optimization
Evaluation
Candidates
Selection
All of the possible combinations +
Optimization
Evaluation
Fig. 6. Flow of Image Addition.
Experimental Results with the Three-Dimensional Structures Reconstructed from Projection Images Experimental Projection Images
In order to evaluate the method by GA, an experiment was performed using a protein having a known three-dimensional structure. The protein used for
2294
S. Saeki et al.
this experiment was myosin, which generates power in muscle tissue. Figure 7 shows the three-dimensional structure of myosin and twenty images acquired by projection from different directions. Each pixel of the image takes a real value from 0.0 to 256.0. The objective was to determine the Euler angles of these 20 projection images.
Fig. 7. Projection of Myosin.
4.2
Validity of Sequential Additive Search
Reconstruction of the three-dimensional structure was performed using the GAbased method described earlier. The image size is 65 x 65 pixels. In order to investigate the validity of sequential additive search by GA, comparison of a result with global search by GA(i.e. real-coded GA holding a vector of 57 variables from the start) was performed. The parameters for global search by GA are shown listed in Table 1. The relationship between the number of generations and the RMSD values in global search by GA is shown in Fig. 8. The best search, the worst search, and the average in ten trial are displayed. The last result was further optimized by the steepest descent method. RMSD values were saturated around 1.0 in a certain amount of generation.
Table 1. The Parameters for Global Search. Population Crossover Mutation Generation
100 0.8 0.00175 15000
Table 2. The Parameters for Sequential Additive Search(N : number of images). Population Crossover Mutation Generation Repeat
50 0.8 0.3/ {3(N − 1)} 25N 5
The parameters for sequential additive search by GA are listed in Table 2. The number of mutations was selected to be proportional to the gene length (i.e., the number of parameters). In addition, the number of generations was selected so as
GA-Based Inference of Euler Angles for Single Particle Analysis
2295
to be proportional to the gene length. The search process was repeated five times, and the best solution among the obtained solutions was selected, thus scattering of the solution was minimized. We found that the solution quality tended to be better by performing five trials (resulting in five times the population) than by making the number of individuals in the population five times larger, although the computational cost was approximately the same. Because the population approaches uniformity with the evolution of individuals, we believe that various evolved individuals can be ultimately obtained if the population number is larger. A result of seqential additive search is shown in Figs. 9, 10, and 11. The search by GA was tried 10 times to each number of candidates, in order to investigate the stability of the solution. Figure 9 plots the best value, the average value and the worst value to the number of candidates. We found that average value of RMSD values decreases when the number of candidates increased. Moreover, it turned out that the best value of RMSD values also decreases slightly to the increase in the number of candidates. Variance of the RMSD values to the number of candidates is shown in Fig. 11. The tendency for the variance of the solution to decrease exponentially to the number of candidates was observed.
㪇㪅㪏 2.8 worst best average
2.6
㪇㪅㪎㪌
2.4
㪩㪤㪪㪛㩷㪭㪸㫃㫌㪼㫊
RMSD Values
2.2 2 1.8 1.6
㪇㪅㪎
㪇㪅㪍㪌
1.4 1.2
㪇㪅㪍 㪇
1
㪉
㪋
㪍
㪏
㪈㪇
㪥㫌㫄㪹㪼㫉㩷㫆㪽㩷㪚㪸㫅㪻㫀㪻㪸㫋㪼㫊
0.8 0
2000
4000
6000
8000 Generation
10000
12000
14000
16000
㪸㫍㪼㫉㪸㪾㪼
Fig. 8. RMSD Values v.s. Generations.
㫎㫆㫉㫊㫋
Fig. 9. RMSD Values v.s. Number of Candidates.
㪇㪅㪇㪈
㪇㪅㪍㪊㪈㪌
㪇
㪭㪸㫉㫀㪸㫅㪺㪼㩷㫆㪽㩷㪩㪤㪪㪛㩷㪭㪸㫃㫌㪼㫊
㪇㪅㪍㪊㪈
㪩㪤㪪㪛㩷㪭㪸㫃㫌㪼㫊
㪹㪼㫊㫋
㪇㪅㪍㪊㪇㪌
㪇㪅㪍㪊
㪇㪅㪍㪉㪐㪌
㪉
㪋
㪍
㪏
㪇㪅㪇㪇㪈
㪇㪅㪇㪇㪇㪈
㪇㪅㪇㪇㪇㪇㪈
㪇㪅㪇㪇㪇㪇㪇㪈 㪇㪅㪍㪉㪐 㪇
㪉
㪋
㪍
㪏
㪈㪇
㪥㫌㫄㪹㪼㫉㩷㫆㪽㩷㪚㪸㫅㪻㫀㪻㪸㫋㪼㫊
㪇㪅㪇㪇㪇㪇㪇㪇㪈
㪥㫌㫄㪹㪼㫉㩷㫆㪽㩷㪚㪸㫅㪻㫀㪻㪸㫋㪼㫊
Fig. 10. Best of RMSD Values
Fig. 11. Variance v.s. Number of Candidates.
㪈㪇
2296
S. Saeki et al.
Figure 12 shows the reconstructed three-dimensional structure obtained using the derived Euler angles. Figure 12(a) is a solid reconstruction based on the projection direction when the projection image was generated, so the purpose of this experiment was to reconstruct this structure. Figure 12(b) is a result of sequential additive search by GA with ten candidates of the solutions. Figure 12c is a result of global search by GA. Figure 12b is obviously closer to the target structure Fig. 12a as compared with Fig. 12c. When the result of global search is 0.92 in a RMSD value, the result of sequential additive search is 0.63 in a RMSD value. It can be said that sequential additive search is superior to global search in this problem.
Fig. 12. Reconstructed 3D Structures.
4.3
Effect of Noise and Blur
In order to investigate the applicability to real micrographic images, some noise and blur were added and their effects were examined. Gaussian noise was applied, and blur was added using a Gaussian filter. Gaussian noise is a model of real noise in raw micrographic images, whereas Gaussian filter can simulate blur phenomena caused by the clustering operations. Figure 13 shows the images with noise and blur. The level in Fig. 13 is the Gaussian noise average level parameter, and half of a standard deviation was added. The noise factor and intensity follow a Gaussian distribution. SN in Fig. 13 is the parameter of the Gaussian filter representing the standard deviation of the angle at which pixels are blurred on a circle centered on the image. When SN is zero, no filtering is applied. The Gaussian noise was added first, followed by the Gaussian filter. Reconstruction was performed by sequential additive search by GA with 50 candidates. The parameters for GA are the same as Table 2. The image size used this experiment is 96 x 96 pixels in order to show that our proposed method is applicable also to the image of other resolution. Figure 15 shows the target the three-dimensional structure reconstructed using the correct Euler angles. The arrangement of the three-dimensional structure corresponds to that of the projection image in Fig. 13. Figure 14 is a solid reconstruction based
GA-Based Inference of Euler Angles for Single Particle Analysis
2297
Fig. 13. Projections with Gaussian Noise and Gaussian Filter
on the projection direction when the projection image was generated, so the purpose of this experiment was to reconstruct this structure. When no noise was added, the reconstruction was almost perfect. We observed a tendency for the reconstructed three-dimensional structure to be smoothed when the Gaussian filter was applied. However, when the Gaussian noise level was set to ten, we were still able to clearly discern the tail structure of myosin when using the Gaussian filter. We believe that the influence of Gaussian noise was reduced by the blurring. It could be claimed that a certain degree of degradation of the images is a valid sacrifice for this noise reduction when the global threedimensional structure is of interest.
Fig. 14. Target 3D Structure.
Fig. 15. Reconstructed 3D Structures.
2298
S. Saeki et al. Table 3. Comparison with Other Methods.
(a)sequential GA (b)spot detection (c)global GA
4.4
RMSD for inference RMSD with target 3D structure Time(hour) 0.03136 5.11 30 0.03160 5.12 110 0.04358 11.35 30
Comparison with the Previously Proposed Method
There was previously proposed method called spot detection method. This method is based on a sequential additive search method identical to the present method, except that the solution candidate is searched by detecting a comparatively large area (spot) of mutual correlation of cross sinograms [12,10]. For ´ components of the Euler angle are obtained from added images, the ¿ and A the spot information of the cross sinogram between the standard image and the ` is estimated from the overall relationship. added image, and A Table 3 lists the results for the experimental RMSD values and the required computation times for this method along with those of various other methods. The microprocessor used for the present experiments was an Athlon XP1800+. Table 3(a) shows the case using fifty candidates in the present method. Table 3(b) is, as far as we know, the only method proposed in the past to solve this problem.Table 3(c) shows the results of searching a wide space via GA. Projection images with noise (level 10, sn 10) are used because acutual microscopic images contain a great deal of noise. RMSD values of 3D structures between the target and the acquired ones by these methods are compared to investigate the solution precision. The proposed method(a) is superior with respect to both computation time and solution precision, as compared to the spot searching method(b). Compared to these two methods, the wide-space search via GA(c) showed greatly reduced precision.
5
Discussion
Only the spot-detection method has been known as the single particle analysis so far, for the sake of inferring the Euler angles solely from the projection images, i.e., not by using the symmetrical information of 3D structures. The proposed GA-based method gave better performance to the noisy projection images, in terms of the computational time and the acquired precision. One reason seems to be that our approach is relatively independent from the reference images so that it can explore the search space more widely. On the other hand, in case of noisefree projection images,the spot detection method may be superior, especially when we handle the higher resolution. This is because the spot information can provide a more correct common line with the higher resolution. When applying to the real data, we inevitably have to cope with the noisy projection, which will make our method more practical and effective.
GA-Based Inference of Euler Angles for Single Particle Analysis
2299
Our GA-based method is a stochastic search so that different solutions are acquired from one run to another.However, the RMSD values by our method were much smaller, i.e.,better,for all the runs. In general, many parameters (e.g., see Table 2) have to be effectively tuned for the sake of efficient GA-base search.We have tried two types of resolutions, i.e.,65x65 pixels and 96x96 pixels, for the projected images.As a result of several comparative experiments, we have confirmed that only the number of candidates should be changed to give the satisfactory performance to these cases.In other words, most of the other GA parameters need not be considered for the further tuning to cope with different experimental set-ups. Our future reseach concerns are on reconstructing three-dimensional structures from projection images acquired by clustering operations.Clustering and averaging algorithms for the robust image-processing have been worked on [1]. We are currently working on the application to clustered and averaged projection images of a simulation and we plan to use actual micrographic images.
6
Conclusion
We found that three-dimensional structures could be restored from projection images by the method proposed in the present paper. In addition, noise was added to the projection images, and its effect on the reconstructed solid body was investigated. In processing, projection images that are judged to be oriented in the same direction are clustered and then overlaid to reduce noise by sacrificing image clarity through fading to a certain level. The experimental results suggest that the proposed type of processing is viable for discerning large target structures. The present method was found to be superior compared to previous methods when noise is included in the images. Because actual micrographic images contain a great deal of noise, the present method is considered to be more practicable than previous methods. Introducing heuristics that correspond to the characteristics of the problems involved in searching via GA narrows the search space. Conversely however, for difficult problems, the present method can still search the space effectively, and thus is quite useful. In the above described experiments, we restored a three-dimensional structure directly, based on projection images of a simulation having added noise. However, we must first investigate reconstruction of three-dimensional structures based on projection images acquired by clustering operations. Ultimately, we intend to apply the present method to actual micrographic images.
References 1. K.Asai, Y.Ueno, C.Sato K.Takahashi “Clustering and Averaging of Images in Single-Particle Analysis” Genome Informatics 11, 151–160 2000. 2. J.Frank “Three-dimentional Electron Microscopy of Macromolecular Assemblies” Academic Press, London 1996.
2300
S. Saeki et al.
3. G. Harauz “Exact filters for general geometry three dimensional reconstruction” Optik 73, 146–156 1986. 4. van Heel, et al “Multivariate statistical classification of noisy images” Ultramicroscopy 13, 165–183 1984. 5. van Heel, et al “Angular reconstruction : a posteriori assignment of projection directions for 3D reconstruction” Ultramicroscopy 38, 241–251 1987. 6. van Heel, et al. “A new generation of the IMAGIC image processing system” J.Struct. Biol. 116, 17–24 1996. 7. I.Ono, S.Kobayashi “A Real-coded Genetic Algorithm for Function Optimization Using Unimodal Normal Distribution Crossover, Proc. of 7th Int. Conf. on Genetic Algorithms” Proc. of 7th Int. Conf. on Genetic Algorithms 246–253 1997. 8. S.Saeki, K.Asai, K.Takahashi, Y.Ueno, K.Isono, H.Iba “Inference of Euler Angles for Single Particle Analysis by Using Genetic Algorithms” Genome Informatics 12, 151–160 2002. 9. C.Sato, Y.Ueno, K.Asai, K.Takahashi, M.Sato, A.Engel, Y.Fujiyoshi “The voltagesensitive sodium channel is a bell-shaped molecule with several cavities” Nature, Vol.409, No.6823, pp. 1047–1051, 2001. 10. K.Takahashi, Y.Ueno, K.Asai “Euler Angle Decision Method between Transfer Images of a Protein Molecule for Single Particle Analysis” The Biophysical Society of Japan, 39th Annual Meeting, 1P041, 2001.(in Japanese) 11. Penczek, et al. “Three-dimentional reconstruction of single particles embedded in ice” Ultramicroscopy 40, 33–53 1992. 12. Y.Ueno, K.Takahashi, K.Asai, C.Sato “BESPA: software tool for ThreeDimensional Structure Reconstruction from Single Particle Images of Proteins” Genome Informatics 10, 241–242 1999.
Mining Comprehensible Clustering Rules with an Evolutionary Algorithm Ioannis Sarafis1 , Phil Trinder1 , and Ali Zalzala2,3 1
3
School of Mathematical and Computer Sciences, Heriot-Watt University Riccarton Campus, Edinburgh, EH14 4AS Scotland, United Kingdom {I.Sarafis,P.W.Trinder}@hw.ac.uk 2 School of Engineering & Physical Sciences, Heriot-Watt University Riccarton Campus, Edinburgh, EH14 4AS Scotland, United Kingdom School of Engineering, American University of Sharjah, P.O. 26666, Sharjah, UAE [email protected]
Abstract. In this paper, we present a novel evolutionary algorithm, called NOCEA, which is suitable for Data Mining (DM) clustering applications. NOCEA evolves individuals that consist of a variable number of non-overlapping clustering rules, where each rule includes d intervals, one for each feature. The encoding scheme is non-binary as the values for the boundaries of the intervals are drawn from discrete domains, which reflect the automatic quantization of the feature space. NOCEA uses a simple fitness function, which is radically different from any distancebased criterion function suggested so far. A density-based merging operator combines adjacent rules forming the genuine clusters in data. NOCEA has been evaluated on challenging datasets and we present results showing that it meets many of the requirements for DM clustering, such as ability to discover clusters of different shapes, sizes, and densities. Moreover, NOCEA is independent of the order of input data and insensitive to the presence of outliers, and to initialization phase. Finally, the discovered knowledge is presented as a set of non-overlapping clustering rules, contributing to the interpretability of the results.
1
Introduction
Clustering is a common data analysis task that aims to partition a collection of objects into homogeneous groups, called clusters [9]. Objects assigned to the same cluster exhibit high similarity among themselves and are dissimilar to objects belonging to other clusters. The challenging field of DM clustering led to the emergence of various clustering algorithms [7]. Evolutionary Algorithms (EAs) are optimization techniques that have been inspired from the biological evolution of species [4], [11]. NOCEA (Non-Overlapping Clustering with an Evolutionary Algorithm) employs the powerful search mechanism of EAs to meet some of the requirements for DM clustering such as discovery of clusters with different shapes, sizes, and densities, independency of the order of input data, insensitivity to the presence of outliers and to initialization phase and interpretability of the results [7]. E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2301–2312, 2003. c Springer-Verlag Berlin Heidelberg 2003
2302
2
I. Sarafis, P. Trinder, and A. Zalzala
Related Work
There are four basic types of clustering algorithms: partitioning algorithms, hierarchical algorithms, density-based algorithms and grid-based algorithms. Partitioning algorithms construct a partition of N objects into a set of k clusters [9]. Hierarchical algorithms create a hierarchical decomposition of the database that can be presented as dendrogram [13]. Density-based algorithms search for regions in the data space that are denser than a threshold and form clusters from these dense regions [8]. Grid-based algorithms quantize the search space into a finite number of cells and then operate on the quantized space [1]. EAs have been proposed for clustering, because they avoid local optima and are insensitive to the initialization [3], [6], [12].
3 3.1
NOCEA Clustering Algorithm Bin Quantization
Let A = {A1 , ..., Ad } be a set of bounded domains and S = A1 × ... × Ad is a d-dimensional numerical space. The input consists of a set of d-dimensional patterns P = {p1 , ..., pk }, where each pi is a vector containing d numerical values, pi = [α1 , ..., αd ]. The jth component of vector pi is drawn from domain Aj . NOCEA’s clustering mechanism based on a statistical decomposition of the feature space into a multi-dimensional grid. Each dimension is divided into a finite number of intervals, called bins. The number of bins mj and the bin width hj for jth dimension are dynamically computed by projecting the patterns in this dimension and then applying the statistical analysis described below. Initially, each dimension is divided into four segments, namely A, B, C and D. Segments A, B, C and D represent the intervals [a, Q1), [Q1, median), [median, Q3) and [Q3, b], respectively. Note that, a, b are the left and right bounds of Aj , while median, Q1 and Q3 are the median, first and third quartiles of patterns in jth dimension, respectively. The bin width h for each segment is then computed using the following formula [2]: 1
h = 3.729 ∗ σ ∗ n− 3
(1)
where, n is the number of patterns inside this segment, while σ denotes the standard deviation of patterns lying in this segment. The number of bins for each segment is derived by dividing the length of the segment by the corresponding bin width h. The total number of bins mj for the jth dimension is computed by forming the sum of bins from the four segments. Finally, the bin width hj for the entire jth dimension is obtained by dividing the length of Aj domain by mj . Of course it would be more realistic if each segment keeps the corresponding bin width leading to non-uniform grids. Despite the robustness of non-uniform grids we adopt the uniform approach because of the extra complexity introduced by the first and the way that infeasible solutions are repaired (see section 3.6).
Mining Comprehensible Clustering Rules
3.2
2303
Individual Encoding
The proposed encoding scheme is a non-binary, rule-based representation, containing a variable number of non overlapping rules. Each rule comprises d genes, where each gene corresponds to an interval involving one feature. Each ith gene, where i=1,..,d of a rule, is subdivided into two fields:left boundary (lbi ) and right boundary (rbi ), where lbi and rbi denotes the lower and upper value of the ith feature in this rule. The boundaries are drawn from discrete domains, reflecting the grid-based decomposition of the feature space. Note that two rules are nonoverlapping if there exists at least one dimension where there is no intersection between the corresponding genes. 3.3
Fitness Function
In our clustering context, the fitness function is greedy with respect to the number of patterns that are covered by the rules of individuals. In particular, NOCEA aims to maximize the coverage C, which is defined to be the fraction of total patterns Ntotal that are covered by the rules of the individuals: k N i i=1 C = max (2) Ntotal where, k denotes the number of rules and Ni the number of patterns in ith rule. The above fitness function is suitable for comparing individuals that have different number of rules. Two individuals can have exactly the same performance with radically different genetic material (e.g. number, shape and size of rules). Thus, the fitness landscape may have multiple optima and as long as there is no any kind of bias towards a certain type of solution(s) (e.g. solution(s) with a particular number of rules, size or shape) equation 2 contributes to the diversity among the population members. Another important characteristic of the above fitness function is the fact that its theoretical lowest and highest values are always known, that is, the fitness of an individual can be between zero and one. Theoretically, as the level of noise increases the distance between the performance of the best individual(s) and one increases. 3.4
Recombination Operator
In a typical EA the recombination operator is used to exploit known solutions by exchanging genetic material between good individuals. Taking into consideration the requirement for evolving individuals without overlapping rules elaborate one and two point crossover operators have been developed. The crossover operator is applied to two parents generating two offsprings, from which only one can survive. Firstly, a number j is randomly drawn from the domain [1..d]. Recall, that d denotes the dimensionality of the feature space. If mj is the number of bins corresponding to jth dimension, the crossover points cp1 and cp2 are determined by randomly picking one bin from the interval [0, (mj − 1)]. Two offsprings are generated by exchanging the rules lying in the region from cp1 to cp2, between
2304
I. Sarafis, P. Trinder, and A. Zalzala PARENT 1 cp1
Rule 1
Rule 2
OFFSPRING 1 cp1 Rule 1
cp2
Rule 3 Rule 4
cp2 Rule 2
PARENT 2 cp1
Rule 1
OFFSPRING 2 Rule 1
cp1 Rule 4
cp2
Rule 3
cp2 Rule 2
Rule 2
Fig. 1. An example of two-point crossover operation
the parents. An example of two-point crossover is depicted in Fig. 1. Note that if the number of crossover points is one then the above mentioned crossover operation reduces to one-point recombination. The proposed crossover operator can lead to the generation of offsprings that are not of the same length as their parents. This is because the offsprings must contain no overlapping rules, which in turn requires the splitting of any rule that intersects with the d-dimensional region between cp1 and cp2. Clearly, for very low dimensional datasets and high recombination rates, two or even one point crossover can be disruptive in terms of breaking a relatively large rule into a number of smaller rules. Obviously, the greater the number of rules the greater the computational complexity as far as the application of genetic operators is concerned. Moreover, for the gain of simplicity in describing the clusters, fewer rules are always preferable. On the other hand, the length-changing characteristic of the crossover operator can contribute to increased diversity among the population members and this can be useful especially in cases with arbitrary-shaped clusters. To merely cope with the disruptive effect of crossover operator, we apply the following heuristic in the reported experiments. Instead of arbitrarily selecting the crossover points in jth dimension from the interval [0, (mj − 1)], these points are now fixed and they correspond to the bins containing Q1, Q3 or median in jth dimension. Theoretically, the disruptive effect of crossover is reduced as the dimensionality of the dataset increases, because the probability of intersection between the crossover points and rules decreases. As a consequence, for very low dimensional datasets (2D or 3D) it is reasonable to set the ratio of crossover in relatively small values or even to deactivate the recombination operator. 3.5
Mutation Operator
The evolutionary search in our system is mainly based on an elaborate mutation operator, which although alters the genetic material randomly, the generated individuals do not contain overlapping rules. The mutation operator has two functionalities: a) to grow and b) to shrink an existing rule. Growing-Mutation. The growing-mutation aims to increase the size of existing rules in an attempt to cover as many patterns as possible but with a minimum number of rules. It is particularly useful in cases where there are large and convex clusters that can be easily captured by a relatively small number of large rules.
Mining Comprehensible Clustering Rules
2305
These large rules can be generated starting from smaller ones and by applying the growing-mutation operator over the generations. It is reasonable to focus on the discovery of as few and as large rules as possible, because these kind of rules contribute to the high interpretability of the clustering output. Additionally, the combined functionalities of the growing-mutation and the repair operator (see section 3.6) can lead to the discovery of new interesting regions (isolated clusters) within the d-dimensional feature space. Lets assume that the jth dimension of a d-dimensional rule R undergoes growing-mutation. We can compute the maximum value rbj(max) that rbj can be extended to the right without causing overlapping as follows: 1. Sort in ascending order all the rules that their left bound is greater than the right bound of rule R in jth dimension. 2. If the sorted list is empty, set rbj(max) to (mj − 1) and exit. Otherwise, proceed to step 3. 3. Pick the next element RN of the list. If there is intersection in at least one dimension (excluding the jth) between the rules R and RN, set rbj(max) equal to the left bound of RN in jth dimension minus one and terminate the loop. If there is no intersection, proceed to the next element. The derivation of the minimum value lbj(min) that lbj can be extended to the left without causing overlapping is the dual of computing rbj(max) . Obviously, the allowable range of modification in the boundaries of rules due to growing mutation is solely determined by the relative position of rules within the d-dimensional feature space. The left and right bounds of each rule in jth dimension are randomly mutated within the intervals [lbj(min) , lbj ) and (rbj , rbj(max) ], respectively. Shrinking-Mutation. This type of mutation as its name implies shrinks an existing rule. Each time, the bound of a rule undergoes shrinking-mutation, the corresponding bound shrinks by one bin. The intuition behind this small modification is that for high dimensional datasets where the existence of isolated rules is increased, an arbitrary shrinking-mutation operation can cause the elimination of these rules and as a consequence additional generations may be required to re-discover these promising regions of the feature space. The shrinking-mutation operator is particularly useful to perform local fine-tuning and to facilitate the precise capturing of non-convex clusters. This is done by allowing adjacent rules to be re-arranged within the d-dimensional grid in order to cover as many patterns as possible. Balancing Shrinking and Growing. The ratio of shrinking to growing operations rsg should be set to relatively small values (e.g. 0.05). This is mainly done to bias the evolutionary search to discover new interesting regions in the large d-dimensional feature space, rather than trying to perform fine local tunning. Furthermore, due to the replacement strategy (see section 3.8) that allows the surviving of individuals for only a certain number of generations, high values for rsg may be an obstacle for NOCEA to converge in an optimal solution.
2306
3.6
I. Sarafis, P. Trinder, and A. Zalzala
Repair Operator
Often the search space S consists of two disjoint subsets of feasible and infeasible solutions, F and U, respectively [11]. Infeasible solutions are those that violate at least one constraint of the problem and we have to deal with them very carefully, because their presence in the population influence other parts of the evolutionary search [11]. In the following we introduce some basic definitions in order to formalize the notion of feasibility in our clustering context. The selectivity sij of ith bin in jth dimension is defined to be the number of patterns lying inside this bin over the total number of patterns covered by this rule. We define as Selectivity Control Chart (SCC) in jth dimension a diagram that has a centerline CL, and an upper (UCL) and a lower (LCL) control limits that are symmetric about the centerline. Additionally, measurements corresponding to selectivity of bins in jth dimension are plotted on the chart. CL, UCL and LCL are given by equation 3. UCL = CL ∗ (1 + t) CL = b1j (3) LCL = CL ∗ (1 − t) where, bj denotes the number of bins in jth dimension covered by the rule and t is a tolerance threshold that controls the sensitivity of SCC chart in detecting shifts in the selectivity of bins. Consider the two dimensional feature space shown in Fig. 2a, where each dimension has been partitioned into a number of bins. Figures 2c and 2d illustrate the SCC charts for dimensions x and y of a rule R1 shown in Fig. 2a, respectively. If the patterns covered by R1 are projected to x-axis, there are two distinct and well separated regions. These regions can be easily detected by examining the corresponding SCC chart shown in Fig. 2c, where the selectivity of the third bin is below LCL. In contrast, the selectivity for each bin in the y-axis is within the control limits and as a consequence NOCEA detects a single region in y-axis. In our approach, for each dimension of a rule R we construct a SCC diagram. If there exist points below LCL, then the corresponding bins are relatively sparse with respect to the rest and can be discarded. On the other hand, if there are no points below LCL but there exists at least one bin with selectivity greater than UCL, then NOCEA reduces t gradually by subtracting a small constant number, until at least one point falls below LCL. If neither of the above conditions are true, the distribution of patterns across this dimension can be considered as uniform and as a consequence all bins are kept. The removal of sparse bins described above creates a set of intervals for each dimension. The original rule is replaced by a set of new rules, which are formed by combining one interval from each dimension. A rule R is said to be solid, if the selectivity of all bins for each dimension is between the corresponding LCL and UCL. An individual that contains only solid rules is a feasible solution in our clustering context. Finally, if the selectivity of a rule, that is the fraction of total patterns covered by this rule, is below a user-defined threshold tsparse , then this rule is said to be sparse and is eliminated. Theoretically, for relatively large values of t, regardless the underlying distribution of patterns, each rule will be solid. In such a case, NOCEA can not produce homogeneous rules because
Mining Comprehensible Clustering Rules y
y R1
SCC chart for Y−axis
SCC chart for X−axis
R2
UCL
Selectivity
CL LCL
(a)
x
(b)
x
(c)
2307
Bins
UCL CL LCL
Selectivity
(d)
Bins
Fig. 2. Visualization of SCC charts
in essence there is no mechanism to stop rules from growing arbitrarily. On the other hand, relatively small values for t can cause over-triggering of the repair operator resulting in the generation of many small rules, which in turn add extra complexity to the application of the genetic operators. To merely cope with the problem of fine tunning the parameter t, we suggest a step-size adaptation mechanism that increases t over the generations according to: b i (4) ti = tmin + (tmax − tmin ) ∗ 1 − T where, b is a user-defined parameter, ti denotes the value of t in ith generation, T is the total number of generations, while tmin and tmax are input parameters determining the minimum and maximum values of t, respectively. A major drawback of SCC charts is the fact that consider each dimension independently. Thus, a SCC chart can only detect discontinuities occurring in a particular dimension as long as there is at least one bin in this dimension that is entirely very sparse. For instance, consider the rule R2 that is shown in Fig. 2b. Assuming relatively large values for t the corresponding SCC charts, which are not shown here, can not detect the very sparse regions located in the bottom-left and top-right of rule R2. To merely cope with partially sparse bins we apply the following heuristic to each rule, periodically (e.g. every 40 generations). Initially, a dimension that contains at least six bins is randomly selected. Then the parent rule R is split in that dimension in the middle and forms two touching rules R and R . Both these rules undergo the repairing procedure described above. If these rules remain intact as far as their size is concerned, another dimension which has not previously been examined and which contains at least six bins is randomly chosen. However, if the repair operator modifies at least one of these rules, the parent rule R is replaced with the set of rules that has been produced after repairing both rules, R and R . Note that we only consider dimensions containing at least six bins to avoid local abnormalities in the derived rules R and R . For the same reason the repair operator is invoked using tmax as the value for the tolerance threshold t. 3.7
Merging Rules
As soon as NOCEA converges, a merging procedure that combines adjacent rules in order to form the genuine clusters in data, is activated. The merging procedure is based on the concept of density and it is different from other distance-based
2308
I. Sarafis, P. Trinder, and A. Zalzala y
R1 C1 B1 B2 C2
R2
x
Fig. 3. An example of adjacent rules (R1 and R2).
merging approaches [5], [10]. Two rules are adjacent if they have a common face, i.e there are d-1 dimensions where there is an intersection in at least one bin between the corresponding genes and there is one dimension where the left bound of one rule is adjacent to the right bound of the other one. For instance, rules R1 and R2 shown in Fig. 3 are adjacent because there is an intersection between their corresponding genes in y-axis while in x-axis the right and left bound of rules R1 and R2, respectively, are touching. For each pair of adjacent rules R1 and R2 we compute three density metrics, namely, dC1 , dC2 , dB1 and dB2 . These density metrics correspond to the density of patterns within the d-dimensional regions C1, C2, B1 and B2, respectively, which are shown in Fig. 3. C1 and C2 represents the central regions while B1 and B2 the border regions within the rules R1 and R2, respectively. NOCEA assigns a pair of adjacent rules R1 and R2 to the same cluster as long as all the following conditions are satisfied: min(dC1 ,dB1 ) min(dB1 ,dB2 ) min(dC2 ,dB2 ) (1), (2), (3) (5) ≥ t ≥ t ≥ t d d d max(dC1 ,dB1 ) max(dB1 ,dB2 ) max(dC2 ,dB2 ) where td is a user-defined parameter. In essence, conditions (1), (2) and (3) ensure that the variation in the density across the path (shadowed region in Fig. 3) which connect the central regions of the two rules is not very high and as a consequence these two rules belong to the same cluster. 3.8
Setting Parameters
The mutation ratio is set to 1.0, while the probability of mutating the boundaries of rules is 0.05. The shrinking to growing ratio rsg is set to 0.05. We used one-point crossover with ratio 0.1. The selection strategy that is used in our experiments is a k-fold tournament selection with tournament size k=4. The replacement strategy implements best of all scheme by merging the current and offspring populations and selecting the best individuals. However, to avoid premature convergence in infeasible optima, each individual is allowed to survive only for a certain number of generations (e.g. 5). Moreover, for every 40 generations all parents are replaced by the offsprings in order to eliminate all individuals containing partially sparse bins. Usually, the repairing of such rules cause a small reduction in the coverage of the entire individual, which can be easily observed from the fitness diagrams shown in Fig. 5b, 5d, and 5f. The only
Mining Comprehensible Clustering Rules
2309
stopping criterion used in the reported experiments is the maximum number of generations 200, while the population size is 50. Each population member is randomly initialized with a single d-dimensional rule. The parameters tmin and tmax and b, which control the adaptation of the tolerance threshold t are set to 0.4, 0.65 and 1.5, respectively. The threshold tsparse for eliminating sparse rules is 0.05.
4
Experimental Results
In this section we report the experimental results derived by running NOCEA against three synthetically generated datasets that contain patterns in 2-dimensional feature space as depicted in Fig. 4. These datasets are particularly challenging because they contain clusters of different sizes, shapes, densities and orientations. Furthermore, there are special artifacts such as streaks running across clusters and outliers that are randomly scattered in the features space. The first (DS1) and second (DS2) datasets consist of 8000 and 10000 patterns respectively and were used to evaluate the performance of a wellknown clustering algorithm, called CHAMELEON [10]. The third dataset, DS3 has 100000 2-dimensional patterns and was used by another popular clustering algorithm called CURE [5]. These datasets are public available under the URL: http://www.macs.hw.ac.uk/ ceeis/gecco03/ds.htm. Figure 5 illustrates the rules and clusters that NOCEA found for the three datasets. Rules belonging to the same cluster are assigned the same number. Furthermore, for each dataset there is a fitness diagram where the coverage of the best and worst individuals as well as the mean coverage of the entire population in each generation are plotted. The experiments were conducted in a workstation running Windows 2000 with a 600MHz Intel Pentium III processor, 256MB of DRAM and 17GB of IDE disk. The evaluation time per generation for DS1, DS2 and DS3 is 1.5, 1.8 and 16 seconds, respectively. It is not guaranteed that NOCEA converges to the same set of rules using the same seeds. However, the union (or merging) of the derived rules always produces the same set of clusters, regardless the seeds used to initialize the individuals. It can be observed from Fig. 5 that NOCEA has the following desirable properties: a) Discovery of Non-convex Clusters. NOCEA has the ability to discover arbitrary-shaped clusters, with different sizes and densities. The density-based merging procedure combines correctly adjacent rules forming the genuine clusters in data. b) No a priori Provision of the Number of Clusters. Unlike other clustering algorithms such as k-means [9], NOCEA does not require the provision of the number of clusters a-priory, because the merging procedure discover the correct number of clusters on the fly. c) Simplicity of Fitness Function. NOCEA has a simple fitness function which differs radically from distance-based criterion functions used in partitioning and hierarchical clustering algorithms [9]. Unlike other hierarchical and par-
2310
I. Sarafis, P. Trinder, and A. Zalzala
DS1: 8000 patterns
DS2: 10000 patterns
DS3: 100000 patterns
Fig. 4. The three datasets used in our experiments
titioning algorithms, NOCEA is not biased on splitting large clusters into smaller ones in order to minimize some distance criterion function [5]. d) Handling Outliers. Although NOCEA is greedy with respect to the number of patterns that are covered by the rules of the individuals, the fitness diagrams shown in Fig. 5 indicate that the maximum theoretical value of coverage 1.0, was never reached (assuming an amount of noise). This is due to the combined functionalities of the repair operator which do not allow a sparse region to be a member of a rule and to the procedure that eliminates very sparse rules. Thus, NOCEA is relatively insensitive to the presence of noise. e) Data Order and Initialization Independency. The form of fitness function which is not distance-based and the randomized search of NOCEA ensures independency to the order of input data and to the initialization phase. f ) Interpretability of the Results. Whereas other clustering techniques describe the derived partitions by labeling the patterns with an identifier corresponding to the cluster that they have been assigned [7], NOCEA presents the discovered knowledge in the form of non-overlapping IF-THEN clustering rules, which have the advantage of being high-level, symbolic knowledge representation.
5
Conclusions and Future Work
In this paper, we have presented a novel evolutionary algorithm called NOCEA, which is suitable for DM clustering applications. NOCEA evolves individuals that consist of a variable number of non-overlapping clustering rules, where each rule includes d intervals, one for each feature. The encoding scheme is nonbinary as the values for the boundaries of the intervals are drawn from discrete domains. These discrete domains reflect the dynamic quantization of the feature space, which is based on information derived by analyzing the distribution of patterns in each dimension. We use a simple fitness function, which is radically different from any distance-based criterion function suggested so far. A densitybased merging procedure combines adjacent rules forming the correct number of clusters on the fly. Experimental results reported in section 4 indicate that the specific fitness function, together with, the elaborate genetic operators allow NOCEA to meet some of the requirements for DM clustering. NOCEA is an evolutionary algorithm and as a consequence does not easily scale up comparing to other hill-climbing clustering techniques [7]. However, EA are highly parallel
Mining Comprehensible Clustering Rules
2311
0.92
2
2
6
2
6 4
0.9
4 2 2 4
2
1
Coverage (C)
5 1
3 5
1
3
2
0.86 0.84
4
4 4
3
1
0.88
0.82
best individual mean of population worst individual
4 1
4
3
0.8
4
20
Rules and clusters for DS1 (a)
40
60
80
100 120 140 160 180 200 Generation
Coverage metrics for DS1 (b) 0.9
1 1
1 7
1 1 1
2 1
1
1
1
9
1
8
2
2
1
1
2 3 3
5
5
3
2
3
2
4
5
3
4
0.85
9
0.75
0.7
4
20
Rules and clusters for DS2 (c) 6
5
1 11
1
best individual mean of population worst individual
2
6 4 4
0.8
3
40
60
80
100 120 140 160 180 200 Generation
Coverage metrics for DS2 (d)
1 1 11 1 1 1
0.96 0.94 0.92
4
4 4 4 4 4
4 2
4
Coverage (C)
1
2
2
Coverage (C)
1
1
1
0.9 0.88 0.86 0.84
4
3 3 3 3
4 4 4
best individual mean of population worst individual
0.82 0.8 20
Rules and clusters for DS3 (e)
40
60
80
100 120 140 160 180 200 Generation
Coverage metrics for DS3 (f)
Fig. 5. NOCEA’s experimental results
2312
I. Sarafis, P. Trinder, and A. Zalzala
procedures and ongoing work attempts to investigate the improvements in the efficiency of NOCEA using various schemes of parallelism. Additionally, ongoing work attempts to overcome the problem of splitting large rules into smaller ones when the crossover points intersect with rules, by extending the functionality of the recombination operator. Another possible extension can be the development of a generalization operator that can be used to reduce the complexity of the cluster descriptors. In particular this procedure will take as argument a set of clustering rules corresponding to a particular cluster and will produce a more general descriptor for it.
References 1. R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In Proc. 1998 ACM-SIGMOD Int. Conf. Management of Data, pages 94–105, 1998. 2. Scott D. W. Multivariate Density Estimation. Wiley, New York, 1992. 3. A.A. Freitas. Data Mining and Knowledge Discovery with Evolutionary Algorightms. Springer-Verlag, August 2002. 4. D. E. Goldberg. Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley, 1989. 5. S. Guha, R. Rastogi, and K. Shim. CURE: An efficient clustering algorithm for large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD-98), volume 27,2 of ACM SIGMOD Record, pages 73–84. ACM Press, June 1–4 1998. ¨ 6. L. O. Hall, I. B. Ozyurt, and J. C. Bezdek. Clustering with a genetically optimized approach. IEEE Trans. on Evolutionary Computation, 3(2):103–112, 1999. 7. M. Han, J.and Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, August 2000. 8. A. Hinneburg and D. A. Keim. An efficient approach to clustering in large multimedia databases with noise. In Proc. 1998 International Conference on Knowledge Discovery and Data Mining (KDD-98), pages 58–65. AAAI Press, 1998. 9. Jain, A.K., Dubes, R.C: . Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs, NJ, 1988. 10. G. Karypis, E-H. Han, and V. Kumar. Chameleon: Hierarchical clustering using dynamic modeling. IEEE Computer, 32(8):68–75, 1999. 11. Z. Michalewicz. Genetic Algorithms + Data Structures = Evolution Programs. Springer-Verlag, 1996. Third Edition. 12. I. Sarafis, A. Zalzala, and P. Trinder. A genetic rule-based data clustering toolkit. In Proc. Congress on Evolutionary Computation (CEC), Honolulu, USA, 2002. 13. T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: an efficient data clustering method for very large databases. In Proceedings of theACM SIGMOD International Conference on Management of Data, volume 25, 2, pages 103–114. ACM Press, 1996.
Evolving Consensus Sequence for Multiple Sequence Alignment with a Genetic Algorithm Conrad Shyu and James A. Foster Initiatives for Bioinformatics and Evolutionary Studies (IBEST) Department of Computer Science University of Idaho, Moscow, Idaho 83843, USA {tsemings,foster}@cs.uidaho.edu
Abstract. In this paper we present an approach that evolves the consensus sequence [25] for multiple sequence alignment (MSA) with genetic algorithm (GA). We have developed an encoding scheme such that the number of generations needed to find the optimal solution is approximately the same regardless the number of sequences. Instead it only depends on the length of the template and similarity between sequences. The objective function gives a sum-of-pairs (SP) score as the fitness values. We conducted some preliminary studies and compared our approach with the commonly used heuristic alignment program Clustal W. Results have shown that the GA can indeed scale and perform well.
1
Introduction
Living things diverge from common ancestors through changes in deoxyribonucleic acid (DNA) and millions of years of evolution [6]. DNA indeed plays a fundamental role in the processes of life in various aspects. It contains the template for the synthesis of proteins, which are crucial molecules for life. Moreover, DNA is essential to life because it functions as a medium to transmit information from one generation to another [10]. Evidently the most important regions in DNA are generally conserved to ensure survival. Sequence alignment is commonly used to detect and quantify similarities in DNA or protein sequences. Alignments of biological sequences generated by computational algorithms are routinely used as a basis for inference about sequences whose structures or functions are not well known [7]. The most common approach is to find the best-scoring algorithm between a pair of sequences where the score records aligning similar residues and penalizes substitutions and gaps. The bestscoring alignment is commonly found by the dynamic programming (DP) algorithms, such as Smith-Waterman and Needleman-Wunsch algorithms [14, 23]. DP algorithms guarantee a mathematically optimal alignment for the given evolutionary model; however, the complexity of DP algorithms grows exponentially as the length and number of sequences increase. Specifically multiple sequence alignments (MSA) with DP have been shown to be NP-hard [19]. Several heuristic approaches, such as Clustal W [20], are frequently used to approximate the optimal alignments. In this paper, we present an approach that utilizes the guided search in GA to evolve the most probable consensus sequence [25] for MSA.
E. Cantú-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2313–2324, 2003. © Springer-Verlag Berlin Heidelberg 2003
2314
C. Shyu and J.A. Foster
The design of GA is derived from the most commonly used DP algorithms for sequence alignments. In addition, we have developed an encoding scheme such that the search complexity does not depend on the number of sequences. The search complexity instead depends on the length of the consensus sequence and similarity between sequences. The scheme encodes each possible matching nucleotide at given column with binary masks. This compact representation greatly reduces the space requirement as well as the search complexity. The GA constructs the final alignment through the backtracking process, which is identical to the one found in DP algorithms [14]. The objective or evaluation function gives the sum-of-pairs (SP) scores to determine the fitness of each chromosome in the population. SP score has been widely used to detect and quantify similarities between sequences; however it does not provide any probabilistic or biological justifications [7]. To further improve the performance of GA, we have devised a sequence profiling formulation that reduces the complexity for calculating the SP scores. We have compared our approach to the most commonly used heuristic alignment program Clustal W and demonstrated that GA can indeed perform and scale well. In most cases, the GA outperformed Clustal W and produced better alignments.
2
Sequence Alignment
There are diverse motivations behind the alignment of biological sequences. Genetic sequences are inherited from common ancestors through millions of years of evolution. Therefore, it is of interest to trace evolutionary history of mutation and other evolutionary changes through sequencing [2, 6]. Alignment of biological sequences in this context is generally understood as comparisons based on the criteria of evolution. For example, the number of mutations, insertions, and deletions of residues necessary to transform one DNA sequence into another is a measure of phylogeny or evolutionary relatedness [3, 13]. On the other hand, a comparison may pinpoint regions of common origin, which may in turn coincide with regions of similar structure or function [10]. A pairwise sequence alignment is a technique of arranging two sequences, so that the residues in certain positions are deemed to have a common evolutionary origin. In other words, if the same residue occurs in both sequences at the same position then it may have been conserved during the course of evolution. On the other hand, if two resides differ then it is generally assumed that they may have derived from a common ancestor. Homologous sequences, those related by common descent, may have different lengths, which is generally explained through insertions or deletions [6, 10]. DP has been commonly used to align two sequences because it guarantees a mathematically optimal alignment. MSA, on the other hand, is simply an extension of pairwise sequence alignment. MSA is the process of aligning three or more sequences simultaneously to bring as many similar residues into register as possible. The resulting alignments are commonly interpreted into two contexts; (a) to find regions that define a conserved pattern or domain; and (b) to derive the possible phylogeny or evolutionary relationships among the sequences [13]. The presence of similar domains in several similar sequences implies a similar biochemical function or structural fold that may be used as the basis for further experimental investigation. Simul-
Evolving Consensus Sequence for Multiple Sequence Alignment
2315
taneous alignment of three or more sequences with DP, however, poses a difficult algorithmic challenge. 2.1
Dynamic Programming (DP)
Dynamic programming (DP) is a commonly used method for solving sequential or multi-stage decision problems and is recursive in nature. The essence of DP is the principle of optimality [15, 17]. DP has long been used to solve varieties of discrete optimization problems such as scheduling, string-editing, packaging, and inventory management [12]. It views a problem as a set of interdependent sub-problems. DP solves sub-problems and uses the results to solve larger sub-problems. The solution to a sub-problem is expressed as a function of solutions to one or more sub-problems at the preceding levels [7]. In other words, DP expresses the problem in a recurrent formulation. To make optimal decisions for the next and all future states, DP only needs to know the current decision. This is also known as the Markovian property. For a process to be Markovian the future must depend only on the present state, and past should not have any effect on the future [7, 12]. The term programming in the name actually refers to the mathematical rules that can be easily followed to solve a problem; it has nothing to do with writing a computer program. DP is known to be an efficient programming technique for solving certain combinatorial problems. It is particular important in bioinformatics [17], as it is the basis of sequence alignments for comparing DNA and protein sequences. The following figure shows the recurrent formulation of DP for sequence alignment.
F (i − 1, j − 1) + s ( xi , y j ), F (i, j ) = max F (i − 1, j ) − d , F (i, j − 1) − d . Fig. 1. This recurrent equation is applied repeatedly to fill the matrix of F(i, j) values. This particular formulation gives the global alignment of two sequences. F(i, j) is the maximum of three previous values, namely F(i-1, j-1), F(i-1, j), and F(i, j-1). The value s(xi, yj) is the score for aligning the characters xi and yj and d is the penalty for inserting a gap
For pairwise sequence alignments, for example, DP first begins with the construction of an alignment matrix F(i, j) with the indexes (i, j) for the two sequences Sx and Sy. The matrix is initialized with F(0, 0)=0. The value of F(i, j) is the score of the best alignment from the first character x1 to the character xi of sequence Sx and the first character y1 to the character yj of Sy. There are three possible ways that xi and yj can be aligned; (a) xi can align with yj, which gives a match or mismatch; (b) xi is aligned with a gap; or (c) yj is aligned to a gap. Since the matrix is built recursively, in order to calculate F(i, j), the previous states F(i-1, j-1), F(i-1, j), and F(i, j-1) must be known beforehand [7].
2316
2.2
C. Shyu and J.A. Foster
Sum-of-Pairs (SP) Score
Carrillo and Lipman [5] first introduced the sum-of-pairs (SP) score function, which defines the scores of a multiple alignment of N sequences as the sum of the scores of the N(N-1)/2 pairwise alignments [5, 7]. Although SP score function has been widely used to evaluate MSA, it doesn’t really provide any biological or probabilistic justification. Each sequence is scored as if it is descended from the N-1 other sequences instead of a single ancestor. As a result, evolutionary events are often overestimated. The problem worsens as the number of sequences increases [7]. A weighted SP score function has been proposed to partially compensate the problem [1, 8]. Moreover, despite the simplicity of the SP score function, its sheer running time and space consumption makes it impractical even for modestly sized sets of short sequences. Specifically it has been shown that the problem of computing MSA with optimal SP score is NP-hard [22]. Several fast approximations and divide-and-conquer approaches [18] have been proposed to overcome the computational complexity. The following figure shows the mathematical formulation of the weighted SP score function.
w( M ) =
N a × p , q ∑ s (m pj , mqj ) ∑ 1≤ p < q ≤ k j =1
Fig. 2. The SP function, w(M), sums all the pairwise substitution scores in the columns for the sequence pairs p and q. Each column is evaluated with a scoring matrix. The substitution scoring function, s(mpj, mqj), defines all possible alignments for nucleotides pj and qj. The function s(mpj, mqj) gives the score of the alignment at column j for sequence p and q. The weight, αp,q, is intended to balance the overestimation problem in the SP score function [1, 7]
2.3
Clustal W
Clustal W is a commonly used progressive alignment program for biological sequences. It is based on a heuristic algorithm and therefore cannot always find the optimal alignments. Clustal W exploits the fact that homologous sequences are evolutionarily related. It builds up multiple alignments progressively with a series of pairwise alignments moving from the leaves upward in a guide tree that estimates the phylogeny of the sequences [9]. Clustal W first aligns regions of identical or highly conserved residues and gradually adds in more distance ones [21]. This approach is sufficiently fast and allows Clustal W to alignment virtually any number of sequences. Although Clustal W doesn’t always find the optimal alignments; however, in most cases those alignments at least give a good starting points for further automatic or manual refinement. This type of alignment is generally useful for the study of identifying regions that are highly conserved. The alignments can be further improved through sequence weighting, positions-specific gap penalties and choice of weight matrix [20]. Clustal W nonetheless suffers two major problems, the local maxima and the choice of alignment parameters. The local maxima problem stems from the nature of the progressive alignment strategy. As the algorithm follows the guide tree and merges sequences together, the solution is never guaranteed to be globally optimal, as defined by some overall mea-
Evolving Consensus Sequence for Multiple Sequence Alignment
2317
sure of alignment quality [9, 16, 20]. Any misaligned regions made early in the alignment process cannot be corrected later as new information from other sequences is introduced. This problem is frequently a result of an incorrect branching order in the guide tree. The only way to correct this is to use an iterative or stochastic sampling procedure such as bootstrapping [20]. The choice of alignment parameters is another problem in Clustal W. If parameters are not chosen appropriately, alignments will not converge to a globally optimal solution [7, 20]. For closely related sequences, any reasonable scoring matrices should work fine because matches usually receive the most weights [11]. Therefore, when matches dominate an alignment, almost any weight matrices will find a good solution. However, when aligning more divergent sequences, scores for gaps and mismatches become narrow and critical because they occur more frequently. Moreover for highly conserved sequences, the range of gap penalties that will find the correct or best possible solution can be very broad. As more and more divergent sequences are added, however, the exact values for gap penalties become critical for success [20]. Our observations have confirmed that this is actually a common problem in most MSA algorithms. Statistically as the number of sequences increases, the expected number of matches in each column also increases. For example, the probability of finding a matching nucleotide in the column of ten sequences is much higher than that of three sequences. If the gap penalty is too low, alignments will generally contain excessive amounts of gaps. It is in general difficult to justify why one scoring matrix is better than the others [7].
3
Design of GA
3.1
Consensus Sequence
The consensus sequence [25] is a unique as well as the most interesting and important feature of our GA approach. It is essentially a compact formulation to represent all possible alignments for virtually any given numbers of sequences [4]. The consensus sequence borrows the idea from biology that sometimes it is necessary for certain positions in a sequence to be made ambiguous when some residues simply cannot be resolved during laboratory experiments. A sequence with ambiguity codes is actually a mix of sequences, each having one of the nucleotides defined by the ambiguity at that position. For example, if an R is encountered in the sequence, then the sequences in the assortment will have either an adenine or a guanine at that position. The ambiguity enables conserved sequences to be condensed into one single representation. The following figure lists the most commonly used ambiguity codes defined by the Nomenclature Committee of the International Union of Biochemistry (IUB) and the next figure shows a hypothetical alignment with five sequences and illustrates how the ambiguity codes are used to derive the consensus sequence. It is assumed that the optimal alignment is already known for a given evolutionary model. The consensus sequence in essence is a condensed sequence with ambiguity codes that shows what nucleotides are allowed in each column.
2318
C. Shyu and J.A. Foster
Fig. 3. The most commonly used DNA ambiguity codes are defined by the International Union of Biochemistry (IUB). The presence of ambiguity generally indicates that some residues can not be resolved during the laboratory experiments. Ambiguity codes also enable sequences to be represented in a more condensed form
3.2
Design of Encoding Scheme
To further enhance the design, our chromosomes are broken into four pieces according to the nucleotide they represent. In other words, our GA uses four parallel chromosomes to represent four different nucleotides. Each chromosome encodes the relative occurrences and locations of a nucleotide and is only evolved with the chromosome that encodes the same nucleotide. The fitness of the entire chromosomes is determined by how well they fit together to derive the final alignment. The geometry of the parallel chromosomes is very similar to the four dimensional hyper-plane. Each dimension is evolved and optimized separately and independently.
Fig. 4. Sequences are split up into four subsequences according to the nucleotides. Each subsequence only encodes the information where such a nucleotide can be found. Since each nucleotide is individually encoded, any possible ambiguities can be fully represented. Further, chromosomes are represented as binary strings, where 1 signals the existence of such a nucleotide at the given location while 0 signals the absence
The length of chromosomes is difficult to determine precisely. It depends on the evolutionary model as well as the similarity of the sequences. If sequences are highly conserved, chromosomes can be relatively short. Intuitively any two randomly gener-
Evolving Consensus Sequence for Multiple Sequence Alignment
2319
ated sequences will have at least 25% similarity. For this study, we assumed that sequences have at least 50% similarity to be biologically significant. Therefore, we arbitrarily defined the length of the chromosome to be 1.5 times longer than the longest sequences. For most of our studies, this assumption worked fine. Furthermore, for implementation convenience, each chromosome is further divided into smaller blocks called loci. Biologically, a locus is a block of alleles where genes can be found. The length of a locus is determined by the length of an integer on the hardware platform. The number of loci depends on the length of the chromosome.
3.3
Crossover and Mutation Operations
For this research, we implemented a simple one-point crossover. A point is randomly selected in each locus for each chromosome and alleles are exchanged between two parent chromosomes to form an offspring. An offspring is produced at each generation and then competes with the population. Since each chromosome has separate string for the four nucleotides, the crossover points are chosen separately. Furthermore, we have implemented a bias function for selecting the parent chromosomes from the entire population. The bias function is like an “unfair” randomly number generator. It is essentially a quadratic equation that randomly generates a series of numbers with bias toward the lower indexes. Since the initial population has been sorted in the descending order according to the fitness values, consequently the individuals with higher fitness values are more likely to be selected. Mutation is an important operator that prevents the population from stagnating at local optima [24]. In our implementation, the mutation is only applied to the newly created offspring chromosomes. The GA first calculates the expected number of mutations for each locus in the chromosomes with a random factor. It then iteratively picks random locations on each locus for each of the four parallel chromosomes and changes the alleles. The mutation operator randomly flips the alleles independently on each locus with the binary XOR operator. It inverts the alleles from 0 to 1 or 1 to 0.
Fig. 5. The crossover operator retrieves the alleles from two parent chromosomes to create an offspring. A point is randomly selected on each locus for each chromosome. The offspring receives approximately a half of alleles from each parent. The length of a locus is 32-bit, which corresponds to the length of an integer on our machines
2320
3.4
C. Shyu and J.A. Foster
Objective Function
The objective function measures the quality of MSA. Therefore, ideally the better the score the more biologically relevant the multiple alignments are. The substitution costs are evaluated using a predefined substitution matrix. The matrix assigns every possible substitution or conservation according to its biological likeliness. We used the nucleic scoring matrix defined by IUB that each match receives 10 points and mismatch 0. The gap penalty is 10.0 for opening and 0.2 extending a gap. The alignment with the highest score is considered a potentially optimal solution. In addition, the objective function subtracts the fitness value with both the mismatch and gap scores multiplied by the numbers of nucleotides that are missing in the alignment for the chromosomes that do not include all nucleotides. Our experiments have confirmed that this strategy worked quite well. The calculation of SP scores for N sequences 2 takes O(M×N ) time [4, 26] where M is the average length of the sequences. To further improve the GA performance, we have devised a sequence profiling technique that simplifies the calculations of SP scores.
Fig. 6. The figure shows the sequence profile for the column in the red color. The profiling process simply accumulates the frequencies or occurrences of each nucleotide on a given column. The process simplifies the calculations of the SP scores into three smaller tasks and reduces the complexity. Matches are only possible when two identical nucleotides are aligned together. Therefore, the score for matches is the sum of all possible combinations of identical nucleotides multiplied by the matching scores S(i, i) from the substitution matrix. The number of mismatches is the sum of all combinations between two different nucleotides. There are only six such combinations. The gap penalties are the sum of all arrangements between each nucleotide and gaps
The objective function first computes the profile of the sequences for each column. A profile is simply the occurrences or frequencies of each nucleotide. The profiling process accumulates the occurrences of each nucleotide and reduces the calculations of SP scores into three smaller tasks. Matches are only possible when two identical nucleotides are aligned together. Therefore, the matching score is simply the sum of all possible combinations of the same nucleotides multiplied by the match score in the substitution matrix. The number of mismatches, on the other hand, is the sum of all the combinations of two different nucleotides. There are only six such combinations. The number of gap alignments is derived from the sum of the frequencies of each nucleotide multiplied by the number of gaps. The result is then multiplied by the gap penalty to obtain the overall gap score.
Evolving Consensus Sequence for Multiple Sequence Alignment
3.5
2321
Alignment Construction
The construction of the final alignment is very similar to that of the DP algorithm [14, 23]. The GA derives the alignment from the last nucleotide to the first and the chromosomes are accordingly decoded backward. If a nucleotide is permitted at a given column, then it is consumed and added in the final alignment. The process moves on to the preceding ones. Otherwise a gap is inserted into the alignment. If no nucleotides are ever used in the column, then the allele is skipped. Alleles that are not used to derive the final alignment are considered the non-coding regions. One of the very interesting and important features of the scheme is that the alleles that are used to derive the alignment do not have to be consecutive. In addition, two different chromosomes can potentially give the same alignments. In other words, the alignment construction process picks the “appropriate” alleles as it moves along. The chromosomes do not have to encode exact bit patterns for the alignments. This makes every allele in the chromosome a potential solution for the alignment. Experiments have confirmed that the GA discovered the optimal alignment, as defined by the substitution matrix, quickly and effectively. The alignments produced by the GA are at least as good as the ones obtained from Clustal W. If the GA is allowed to continue evolving, better alignments are very likely to be found. As the number of sequences increases, the effectiveness of the GA begins to surface. Our experiments have shown that regardless the number of sequences being aligned; the GA performed extremely well and produced alignments with competitive scores.
Fig. 7. The figure shows how the final alignment is constructed from the chromosomes. The alignment is derived in the reverse order. The spaces in between are the alleles that did not match up any of the nucleotides in the sequences. If a particular nucleotide does not present in the sequence, a gap is inserted. Chromosomes do not have to encode the exact bit patterns for the alignments. The alignment process simply picks the “appropriate” alleles
4
Experiment Settings
The objective of our experiments was to demonstrate that the GA could scale better as well as produce competitive alignments. For the purpose of this study, we assumed that Clustal W always gave the optimal scores. Sequences were first aligned with Clustal W and the scores were used as the stopping condition for the GA. We applied the standard IUB nucleic scoring matrix and used the gap penalties identical to that of Clustal W. For this study, we have gathered 20 short random DNA sequences of an
2322
C. Shyu and J.A. Foster
average length of 60 base pairs. Sequences were manually verified to have at least 50% similarity. Each chromosome was about 96 bits long and had three loci. The mutation rate was 0.0625 and the average expected number of mutations on each locus was about one. The GA began with a randomly generated population of 64 individuals. The population was first evaluated and sorted in the descending order according to the fitness values. At each generation, the bias function randomly picked two individuals from the population that served as the parent chromosomes. The crossover operator exchanged alleles from two parent chromosomes and created an offspring. Mutation was applied to the offspring repeatedly until the fitness is higher than both parent chromosomes. The offspring then competed with the entire population and removed the individual with the lowest fitness. We gradually increased the number of sequences in each trial. Due to the stochastic nature of GA, all trials were performed at least three times in order to obtain more reliable results.
5
Experiment Results and Discussions
Experiments show promising results for our GA approach. In most cases, the GA outperformed Clustal W and produced better alignments. The number of generations needed to find the optimal solutions remained approximately the same even though the quantity of sequences increased. This is tribute to the fact that the GA was able to utilize the guided search effectively and found the optimal alignments. During the course of experiments, we have tried various chromosome lengths in order to understand how they affect the performance of the GA. Notably the performance dropped dramatically when the length of sequences reached to 300 base pairs. Further investigation is still needed in order to gain a better understanding. The exact length of the consensus sequence is difficult to determine because it largely depends on the similarity of sequences and the evolutionary model. Our observations revealed that if the consensus sequence is too short, GA frequently failed to converge. On the other hand, if it is too long, the progress becomes extremely slow. In addition, we have confirmed that the SP scoring function was never a good measurement for MSA. If the gap is not heavily penalized, the same score can be easily achieved with more matches but excessive amounts of gaps. The relative difference in score between the correct and incorrect alignments decreases as the number of sequences increases. Clearly this is very counterintuitive and not realistic. The relative difference should increase when more sequences are introduced into the alignment. Table 1. This table summarizes the numbers of generations needed to find the optimal alignments, at least as good as Clustal W, with various amounts of sequences
Trial 1 2 3
10 29,044 20,835 26,080
Number of Sequences / Generations 12 14 16 18 36,286 32,225 25,304 35,893 31,012 22,244 35,447 27,701 39,906 26,141 32,720 43,989
20 42,805 46,888 44,452
Evolving Consensus Sequence for Multiple Sequence Alignment
2323
Fig. 8. The figure shows the GA performance on a set of 18 sequences. The GA typically found a good alignment within 50,000 generations. If the GA is allowed to continue evolving, alignments with even higher scores are very likely to be found. The horizontal line indicates the score found by Clustal W, which is about 35,294
6
Future Works
The consensus sequence with GA showed very promising performance and results. For future work, we would like to extend this approach to align protein sequences and implement statistical scoring techniques. In addition, we plan to incorporate the weighting scheme and analyze the impact on the GA performance. We are currently investigating an approach that incorporates several statistical and simulation techniques to try to quantify the significance of alignment scores. Acknowledgements. Shyu was partially funded by a grant from Proctor and Gamble and Foster was part-ially funded for this research by NIH NCRR 1P20 RR16448.
References 1. 2. 3. 4. 5. 6.
Altschul, S.F., Carroll, R.J., and Lipman, D. Weights for data related by a tree. Journal of Molecular Biology, 207: 647–653 (1989). Altschul, S.F. and Lipman, D. Trees, stars, and multiple sequence alignment. SIAM Journal of Applied Mathematics, 49: 197–209 (1989). Altschul, S.F., Gish, W., Miller, W., Myers, E. W., and Lipman, D. Basic local alignment search tool. Journal of Molecular Biology, 215 (3):403–410 (1990). Day, W.H. and McMorris, F.R. The computation of consensus patterns in DNA sequence. Mathematical and Computational Model. 17, 49–52 (1993). Carrillo, H. and Lipman, D. The multiple sequence alignment problem in biology. SIAM Journal of Applied Mathematics, 48: 1073–1082 (1988). Carroll, S.B., Grenier, J.K., and Weatherbee, S.D. From DNA to diversity: molecular genetics and the evolutionary of animal designs. Malden, MA: Blackwell Science (2001).
2324 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26.
C. Shyu and J.A. Foster Durbin, R., Eddy, S., Krogh, A., and Mitchison, G. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge, UK: Cambridge University (1998). Fogel, D.B. and Corne, D.W. (ed.). Evolutionary Computation in Bioinformatics. San Fran-cisco, CA: Morgan Kaufmann Publishers (2003). Feng, D. and Doolittle, R.F. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. Journal of Molecular Evolutionary, 25: 351–360 (1987). nd Graur, D. and Li, W.H. Fundamental of Molecular Evolution, 2 ed. Sunderland, MA: Sinauer Associates (2000). Wang, L., and Gusfield, D. Improved approximation algorithms for tree alignment. Journal of Algorithms, 25: 255–273 (1998) Gusfield, D. Algorithms on strings, trees and sequences: computer science and computational biology. New York, NY: Cambridge University Press (1997). Hall, B.G. Phylogenetics trees made easy: a how-to manual for molecular biologists. Sunderland, MA: Sinauer Associates (1997). Needleman, S.B. and Wunsch, C.D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. of Mol. Biol. 48: 443–453 (1970). Sean, R.E. A memory-efficient dynamic programming algorithm for optimal alignment of sequence to an RNA secondary structure. BMC Bioinformatics, 3: 13 (2002). Sauder, J., Arther, J., and Dunbrack, R. Large-scale of comparison of protein sequence alignment algorithms with structure alignments. Proteins: structures, function, and genetics, 40: 6–32 (2000). Setubal, J. and Meidanis, J. Introduction to computational molecular biology. Boston, MA: PWS Publishing (1997). Stoye, J., Perry, S.W., and Dress, A.W.M. Improving the divide-and-conquer approach to sum-of-pairs multiple sequence alignment. Applied Mathematical Literature, vol. 10, no. 2, pp. 67–73 (1997). Thomsen, R., Fogel, G.B., and Kirnk, T. A Clustal alignment improver using evolutionary algorithms. Congress on Evolutionary Computation, vol. 1; p. 121–126 (2002). Thompson, J.D., Higgins, D.G., and Gibson, T.J. Clustal W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice. Nucleic Acids Research, 22: 4673–4680 (1994). Thompson, J. D., Gibson, T.J., Plewniak, F., Jeanmougin, F., and Higgins, D.G. The Clustal X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools. Nucleic Acids Research, 24: 4876–4882 (1997). Wang, L. and Jiang, T. On the complexity of multiple sequence alignment. Journal of Computational Biology, 1: 337–348 (1994). Waterman, S.M. and Eggert, M. A new algorithm for best subsequence alignments with application to tRNA-rRNA comparisons. J. of Molecular Biology, 197: 723–725 (1987). Whitley, D. A genetic algorithm tutorial. Statistics and Computing, vol. 4: 65–85 (1994). Keith, J.M., Adams, P., Bryant, D. Kroese, D.P., Mitchelson, K.R., Cochran, D.A.E., and Lala, G.H. A simulated annealing algorithm for finding consensus sequences. Bioinformaics, vol. 18, no. 11, p. 1494–1499 (2002). Wang, L., Jiang, T. and Gusfield, D. A more efficient approximation scheme for tree align-ment. SIAM Journal of Computational Biology. 30: 283–299 (2000).
A Linear Genetic Programming Approach to Intrusion Detection Dong Song, Malcolm I. Heywood, and A. Nur Zincir-Heywood Dalhousie University, Faculty of Computer Science 6040 University Avenue, Halifax, NS, B3H 1W5, Canada {dsong,mheywood,zincir}@cs.dal.ca
Abstract. Page-based Linear Genetic Programming (GP) is proposed and implemented with two-layer Subset Selection to address a two-class intrusion detection classification problem as defined by the KDD-99 benchmark dataset. By careful adjustment of the relationship between subset layers, over fitting by individuals to specific subsets is avoided. Moreover, efficient training on a dataset of 500,000 patterns is demonstrated. Unlike the current approaches to this benchmark, the learning algorithm is also responsible for deriving useful temporal features. Following evolution, decoding of a GP individual demonstrates that the solution is unique and comparative to hand coded solutions found by experts.
1 Introduction The Internet, as well as representing a revolution in the ability to exchange and communicate information, has also provided greater opportunity for disruption and sabotage of data previously considered secure. The study of intrusion detection systems (IDS) provides many challenges. In particular the environment is forever changing, both with respect to what constitutes normal behavior and abnormal behavior. Moreover, given the utilization levels of networked computing systems, it is also necessary for such systems to work with a very low false alarm rate [1]. In order to promote the comparison of advanced research in this area, the Lincoln Laboratory at MIT, under DARPA sponsorship, conducted the 1998 and 1999 evaluation of intrusion detection [1]. As such, it provides a basis for making comparisons of existing systems under a common set of circumstances and assumptions [2]. Based on binary TCP dump data provided by DARPA evaluation, millions of connection statistics are collected and generated to form the training and test data in the Classifier Learning Contest organized in conjunction with the 5th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining 1999 (KDD-99). The learning task is to build a detector (i.e. a classifier) capable of distinguishing between “bad” connections, called intrusions or attacks, and “good” or normal connections. There were a total of 24 entries submitted for the contest [3,4]. The top three winning solutions are all variants of decision trees. The winning entry is composed from 50×10 C5 decision trees fused E. Cantú-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2325–2336, 2003. © Springer-Verlag Berlin Heidelberg 2003
2326
D. Song, M.I. Heywood, and A.N. Zincir-Heywood
by cost-sensitive bagged boosting [5]. The second placed entry consisted of a decision forest containing 755 trees [6]. The third placed entry consisted of two layers of voting decision trees augmented with human security expertise [7]. In this work, the first interest is to explore a Genetic Programming approach to produce computer programs with a much less complex structure, compared with the above data mining approaches, yet yielding satisfactory performance on the KDD-99 test set. The motivation being to provide transparent solutions that execute in real time with modest computational resources. Second, the training system must scale well on a comparatively large training data set (the 10% KDD-99 training set consists of approximately half a million patterns) whilst providing a small enough computational footprint to complete training in a matter of hours on a personal computer. Third, genetic programs are translated into human readable format and are compared with human reasoning. This will facilitate a better understanding of what constitutes a good rule as well as gaining the confidence of the more technically motivated users. Finally, only the most basic feature set is to be utilized, as opposed to the 41 connection features used in the decision tree solutions. To this end, a Page-based Linear Genetic Program with Dynamic Subset Selection and Random Subset Selection is proposed. Forty trials are conducted and results are compared with KDD-99 winning entries. One individual was simplified by removing structural introns and analyzed. The solution was found to be comparative to the type of rules extracted by domain experts on the same data set. In the following text, Section 2 summarizes the properties associated with the KDD-99 IDS data set. Section 3 details the Genetic Programming (GP) approach taken, with a particular emphasis on the methodology used to address the size of the data set. Section 4 describes parameter settings and evaluates experiment results. Finally, conclusions and future directions discussed in Section 5.
2 Intrusion Detection Problem From the perspective of the Genetic Programming (GP) paradigm KDD-99 posted several challenges [1, 2]. The amount of data is much larger than normally the case in GP applications. The entire training dataset consists of about 5,000,000 connection records. However, KDD-99 provided a concise training dataset – which is used in this work – and appears to be utilized in the case of the entries to the data-mining competition [3-7]. Known as “10% training” this contains 494,021 records among which there are 97,278 normal connection records (i.e. 19.69 %). This will be addressed by using a hierarchical competition between patterns, based on the Dynamic Subset Selection technique [8]. Each connection record is described in terms of 41 features and a label declaring the connection as either normal, or as a specific attack type. Of the 41 features, only the first eight (of nine) “Basic features of an Individual TCP connection”; hereafter referred to as ‘basic features’ are employed by this work. The additional 32 derived features, fall into three categories,
A Linear Genetic Programming Approach to Intrusion Detection
2327
Content Features: Domain knowledge is used to assess the payload of the original TCP packets. This includes features such as the number of failed login attempts; Time-Based Traffic Features: These features are designed to capture properties that mature over a 2 second temporal window. One example of such a feature would be the number of connections to the same host over the 2 second interval; Host-Based Traffic Features: Utilize a historical window estimated over the number of connections – in this case 100 – instead of time. Host based features are therefore designed to assess attacks, which span intervals longer than 2 seconds. In this work, none of these additional features are employed, as they appear to almost act as flags for specific attack behaviors. Our interest is on assessing how far the GP paradigm would go on ‘basic features’ alone. Thirdly, the training data encompasses 24 different attack types, grouped into one of four categories: User to Root; Remote to Local; Denial of Service; and Probe. Naturally, the distribution of these attacks varies significantly, in line with their function – ‘Denial of Service,’ for example, results in many more connections than ‘Probe’. Table 1 summarizes the distribution of attack types across the training data. Test data, on the other hand, follows a different distribution than in the training data, where this has previously been shown to be a significant factor in assessing generalization [3]. Finally, the test data added an additional 14 attack types not included in the training data. Table 1. Distribution of Attacks
Data Type Normal Probe DOS U2R R2L
Training 19.69% 0.83% 79.24% 0.01% 0.23%
Test 19.48% 1.34% 73.90% 0.07% 5.2%
Given that this work does not make use of the additional 32 derived features, it is necessary for the detector to derive any temporal properties associated with the current pattern, x(t). To this end, as well as providing the detector with the labeled feature vector for the current pattern, [x(t), d(t)], the detector is also allowed to address the previous ‘n’ features at some sampling interval (modulo(n)).
3 Methodology In the case of this work a form of Linearly-structured GP (L-GP) is employed [9-12]. That is to say, rather than expressing individuals using the tree like structure popularized by the work of Koza [13], individuals are expressed as a linear list of instructions [9]. Execution of an individual therefore mimics the process of program execution normally associated with a simple register machine. That is, instructions are defined in terms of an opcode and operand (synonymous with function and terminal sets respec-
2328
D. Song, M.I. Heywood, and A.N. Zincir-Heywood
tively) that modify the contents of internal registers {R[0],…,R[k]}, memory and program counter [9]. Output of the program is taken from register R[0] on completion of program execution (or some appropriate halting criterion [11]). Moreover, in an attempt to make the action of the crossover operator less destructive, the Page-based formulation of L-GP is employed [12]. In this case, an individual is described in terms of a number of pages, where each page has the same number of instructions. Crossover is limited to the exchange of single pages between two parents, and appears to result in concise solutions across a range of benchmark regression and classification problems. Moreover, a mechanism for dynamically changing page size was introduced, thus avoiding problems associated with the a priori selection of a specific number of instructions per page at initialization. Mutation operators take two forms. In the first case the ‘mutation’ operator selects an instruction for modification with uniform probability and performs an Ex-OR with a second instruction, also created with uniform probability. If the ensuing instruction represents a legal instruction the new instruction is accepted, otherwise the process is repeated. The second mutation operator ‘swap’ is designed to provide sequence modification. To do so, two instructions are selected within the same individual with uniform probability and their positions exchanged. As indicated in the introduction, the specific interest of this work lies in identifying a solution to the problem of efficiently training with a large dataset (around half a million patterns). To this end, we revisit the method Dynamic Subset Selection [8] and extend it to the case of a hierarchy of subset selections. There are at least two aspects to this problem: the cost of fitness evaluation – the inner loop, which dominates the computational overheads associated with applying GP in practice; and the overhead associated with datasets that do not fit within RAM alone. In this work, a hierarchy is employed in which the data set is first partitioned into blocks small enough for retention in RAM, whilst a competition is initiated between training patterns within a selected block. Such a scheme also naturally matches the design methodology for computer memory hierarchies [14]. The selection of blocks is performed using Random Subset Selection (RSS) – layer 1. Dynamic Subset Selection (DSS) enforces a competition between different patterns – layer 2. 3.1 Subset Selection First layer. The KDD-99 10% training data set was divided into 100 blocks with 5,000 connection records per block. The size of such blocks is defined to ensure that, when selected, they fit within the available RAM. Blocks are randomly selected with uniform probability. Once selected, a history of training pressure on a block is used to set up the number of iterations performed at the next layer in DSS. This is performed in proportion to the performance of the best-case individual. Thus, iterations of DSS, I, in block, b, at the current instance, i, is I b (i ) = I (max) × Eb (i − 1) (5) where I(max) is the maximum number of subsets selected on a block; and Eb(i – 1) is the number of misclassifications of the best individual on the previous instance, i, of block, b. Hence, Eb(i) = 1 – [hitsb(i) / #connections(b)], where hitsb(i) is the hit count
A Linear Genetic Programming Approach to Intrusion Detection
2329
over block ‘b’ for the best case individual identified over the last DSS tournament at iteration ‘i’ of block ‘b’; and #connections(b) is the total number of connections in block ‘b’. Second Layer. A simplified DSS is deployed in this layer. That is, fixed probabilities are used to control the weighting of selection between age and difficulty. For each record in the DSS subset, there is a 30% (70%) probability of selecting on the basis of age (difficulty). Thus, a greater emphasis is always given to examples that resist classification. DSS utilizes a subset size of 50, with the objective of reducing the computational complexity associated with a particular fitness evaluation. Moreover, in order to further reduce computation, the performance of parent individuals on a specific subset is retained. After 6 tournaments the DSS subset will be reselected. DSS Selection [8]. In the RSS block, every pattern is associated with an age value, which is the number of DSS selections since last selection, and a difficulty value. The difficulty value is the number of individuals that were unable to recognize a connection correctly the last time that the connection appeared in the DSS subset. Connections appear in a specific DSS stochastically, where there is 30% (70%) probability to select by age (difficulty). Roulette wheel selection is then conducted on the whole RSS block, with respect to age (difficulty). After the DSS subset is filled, age and difficulty of selected connections are reset. For the rest, age is increased by 1 and difficulty remains unchanged. 3.2 Parameterization of the Subsets The low number of patterns actually seen by a GP individual during fitness evaluation, relative to the number of patterns in the training data set, may naturally lead to ‘over fitting’ on specific subsets. Our general objective was therefore to ensure that the performance across subsets evolved as uniformly as possible. The principle interest is therefore to identify the stop criterion necessary to avoid individuals that are sensitive to the composition of a specific Second Level subset. To this end, a single experiment is conducted in which 2,000 block selections are made with uniform probability. In the case of each block selection, there are 400 DSS selections. Before selection of the next block takes place, the best performing individual (with respect to sub-set classification error) is evaluated over all patterns within the block, let this be the block error at selection i, or Eb(i). A sliding window is then constructed consisting of 100 block selection errors, and a linear least-squares regression performed. The gradient of each linear regression is then plotted, Figure 1 (1900 points). A negative trend implies that the block errors are decreasing whereas a positive trend implies that the block errors are increasing (the continuous line indicates the trend). It is now apparent, that after the first 750 block selections, the trend in block error has stopped decreasing. After 750 selections, oscillation in the block gradients appears, where this becomes very erratic in the last 500 block selections.
2330
D. Song, M.I. Heywood, and A.N. Zincir-Heywood
Fig. 1. Gradient of block error using best case DSS individual. X axis represents tournament and Y represents slope of best fitting line on 100 point window
On the basis of these observations, the number of DSS selections per block is limited to 100 (from 400) – with the objective of further reducing any tendency to prematurely specialize – where the principle cost is in a higher number of block selections, 1000 in this case. 3.3 Structural Removal of Introns Introns are program pieces, which have no influence to the output, but appear to be a factor in the evolution of solutions. Moreover, two forms of introns are often distinguished: structural introns and semantic introns. Structural introns manipulate variables that are not used for the calculation of the outputs at that program position. Whereas, semantic introns manipulate variables on which the state of the program is invariant [15]. In this work, structural introns are detected using following pseudo code, initiated once evolution is complete, with the last reference to R[0] (the register a priori defined as the output) as the input argument. markExon(reg, i) {..for (destination in the ith instruction != reg; i--) if (i = 0) exit; mark ith instruction as exon markExon(oprand1, i-1) markExon(oprand2, i-1)
.... }
A Linear Genetic Programming Approach to Intrusion Detection
2331
4 Experiment The following experiments are based on 40 runs using Dynamic Page-based L-GP. Runs differ only in their choice of a random seed initializing the population. Table 2 lists the common parameter settings for all runs. The total number of records in training and test set is listed in Table 3. The method used for encouraging the identification of temporal relationships and composing the instruction set is defined as follows. Sequencing Information. As indicated above, only the 8 basic features of each connection are used, corresponding to: Duration; Protocol; Service; normal or error status of the connection (Flag); number of data bytes from source to destination (DST); number of data bytes from destination to source (SRC); LAND (1 if connection is from/to the same host/port, 0 otherwise); and number of “wrong” fragments (WRONG). This implies that GP is required determine the temporal features of interest itself. To do so, for each ‘current’ connection record, x(t), GP is permitted to index the previous 32 connection records relative to the current sample, modulo 4. Thus, for each of the eight basic TCP/IP features available in the KDD-99 dataset, GP may index the 8 connection records [(t), (t – 4), … (t – 32)], where the objective is to provide the label associated with sample ‘t’. Table 2. Parameter Settings for Dynamic Page based Linear GP Parameter Population Size Maximum number of pages Page size Maximum working page size Crossover probability Mutation probability Swap probability Tournament size Number of registers Instruction type 1 probability Instruction type 2 probability Instruction type 3 probability Function set Terminal set RSS subset size DSS subset size RSS iteration DSS iteration (6 tournaments/ iteration) Wrapper function Cost function
Setting 125 32 pages 8 instructions 8 instructions 0.9 0.5 0.9 4 8 0.5 4 1 {+, -, *, /} {0, .., 255} ∪ {i0, .., i63} 5000 50 1000 100 0 if output <=0, otherwise 1 Increment by 1 for each misclassification
2332
D. Song, M.I. Heywood, and A.N. Zincir-Heywood Table 3. Distribution of Normal and Attacks Connection
Training
Test
Normal
97249
60577
Attacks
396744
250424
Instruction Set. A 2-address format is employed in which provision is made for: up to 16 internal registers, up to 64 inputs (Terminal Set), 5 opcodes (Functional Set) – the fifth is retained for a reserved word denoting end of program – and an 8-bit integer field representing constants (0-255) [12]. Two mode bits toggle between one of three instruction types: opcode with internal register reference; opcode with reference to input; target register with integer constant. Extension to include further inputs or internal registers merely increases the size of the associated instruction field. The output is taken from the first internal register. Training was performed on a Pentium III 1GHz platform with a 256M byte RAM under Windows 2000. The 40 best individuals within the last tournament are recorded and translated simplified as per Section 3.3. Note that ‘best’ is defined with respect to the cost function used during training, Table 2. Performance of these cases is then expressed in terms of false positive (FP) and detection rates, estimated as follows,
Detection Rate = 1 −
False Positive Rate =
# False Negatives Total Number of Attacks
(6)
# False Positives (7) Total Number of Normal Connections
Figure 3 summarizes the performance of all 40 runs in terms of FP and Detection rate on both training and test data. Of the forty cases, three represented degenerate solutions (not plotted). That is to say, they basically classified everything as normal (i.e. only 20% of the training classifications would be correct) or attack (roughly 80% of the training connections would be correct). Outside of the case of the three degenerate cases, it is apparent that solution classification accuracy is consistently achieved. Table 4 makes a direct comparison between KDD-99 competition winners, verses the corresponding GP cases. Structural removal of introns, Section 3.3, resulted in a 5:1 decrease in the average number of instructions per individual (87 to 17 instructions). With the objective of identifying what type of rules were learnt, the GP individual with best Detection Rate from Table 4 was selected for analysis. Table 5 lists the individual following removal of the structural introns.
A Linear Genetic Programming Approach to Intrusion Detection
2333
Fig. 3. FP and detection rate of 40 runs on KDD-99 test data set Table 4. Comparison with KDD-99 winning entries Parameter
Detection Rate
FP rate
Winning entry
0.908819
0.004472
Second place
0.915252
0.00576
Best GP – FP rate
0.894096
0.006818
Best GP – Detection rate
0.908252
0.032669
Table 5. Anatomy of Best Individual Opcode LOD SUB MUL DIV SUB SUB DIV
Destination R[0] R[0] R[0] R[0] R[0] R[0] R[0]
Source 20 Input[2][5] Input[0][1] Input[0][4] Input[2][5] Input[6][5] Input[0][4]
Table 6 summarizes performance of the individual over a sample set of the connection types in terms of connections types seen during training (24 different types) and connections types only during test (14 different types). Of particular interest here is that high classification accuracy is returned for connection types, which are both frequent and rare, where it might be assumed that only the connections with many examples might be learnt.
2334
D. Song, M.I. Heywood, and A.N. Zincir-Heywood
Table 6. Error rates on test data for top 16 attacks by individual with Best Detection Rate Seen connection type Neptune Portsweep Land Nmap Smurf Satan Normal
% Misclassified
Total Examples
Unseen connection type
% Misclassified
Total Examples
0 0 0 0 0.077 3.552 3.267
58,001 354 9 84 164,091 1,633 60,577
Udpstorm Prosstable Saint Mscan Httptunnel Phf Apache2
0 3.03 5.978 8.452 15.823 50 65.491
2 759 736 1,053 158 2 794
Re-expressing this individual analytically, below, indicates that the statistics of the number of bytes from the responder and the byte ratio responder-originator are utilized. This enables the individual to identify that the attacking telnet connections in the DARPA dataset are statistically different from the normal telnet connections. Moreover, not only telnet connections can be classified by this way. Such a rule never misses an attack of “Neptune”, “portsweep”, “land”, ”nmap”, “udpstorm”. It also provided ‘good’ performance on “smurf”, “processtable”, “normal”, “satan”, “saint”, “mscan” and “httptunnel”. For “Neptune,” there are many half open tcp connections, without any data transfer. In “smurf,” there are many echo replies to victim, but no echo requests from victim. In “http tunnel,” the attacker defines attacks on the http protocol, which is normal, but the actual data exchange ratio makes it different from normal traffic. Currently, only [16] argued that telnet connection can be differentiated by a rule of the form discovered here. It has been suggested that attacks be formulated with such a rule in mind, [17], but without explicitly proposing using this statistic. Thus GP in this case has actually provided a unique generic rule for the detection of multiple attack types.
(20 − Input[2][5]) × Input[0][1] − Input[2][5] − Input[6][5] Output =
Input[0][4]
Input[0][4]
where Input[j][i] indexes the i input feature at temporal location t – 4 × j; and ‘t’ (= 0) is the current connection. th
5 Conclusion A Page-based Linear Genetic Programming system with DSS and RSS was implemented and tested on the KDD'99 benchmark dataset, a problem involving a training dataset of half a million patterns. To do so, a hierarchy of data subset selections is introduced such that GP only perceives 50 of the total training set patterns at any one
A Linear Genetic Programming Approach to Intrusion Detection
2335
time. Moreover, such a hierarchy is designed to utilize the memory hierarchy commonly employed in computer architectures. As such the ensuing system completes each trial in approximately 15 minutes on a modest laptop-computing platform (1Ghz Pentium III, 256 Mbyte RAM) or 10 hours for 40 trials. In addition, only the ‘basic’ connection features are employed, with GP deriving the necessary temporal features itself. Performance approaches that of data-mining solutions based on all 41 features, whilst solution transparency is also supported and verified, enabling the user to learn from the solutions provided. Note however, that the principle design interest of this work was to demonstrate that GP could be applied to data-driven learning problems on large datasets. The resulting GP classifier represents an anomaly detector, providing a binary decision boundary: normal or attack. Extensions to include the classification of different attack types would involve training additional detectors on the subset of patterns labeled as attack. The ensuing hierarchy of detectors would provide an additional (attack) class label at each level. Future work is expected to include a dynamic cost function; with the objective of adjusting at run time the relative weighting associated with different attack types. Moreover, the function set at present is purely analytical. Of interest would be the significance of conditional statements or modular code within this problem context. Acknowledgements. This research was partially supported by NSERC Discovery Grants of Drs. Heywood and Zincir-Heywood.
References 1.
2.
3. 4.
5. 6. 7. 8.
Lippmann R.P., Fried D.J., Graf I., Haines J.W., Kendall K.R., McClung D., Weber D., Webster S.E., Wyschogrod D., Cunningham R.K., Zissman M.A.: Evaluating Intrusion Detection Systems: the 1998 DARPA Off-Line Intrusion Detection Evaluation. Proceedings of the 2000 DARPA Information Survivability Conference and Exposition, 2 (2000) McHugh J.: Testing Intrusion Detection Systems: A Critique of the 1998 and 1999 DARPA Intrusion Detection System Evaluations as Performed by Lincoln Laboratory. ACM Transactions on Information and System Security. 3(4), (2000) 262–294 Elkan C.: Results of the KDD'99 Classifier Learning Contest. SIGKDD Explorations. ACM SIGKDD. 1(2), (2000) 63–64 Wenke L., Stolfo S.J., Mok K.W.: A data mining framework for building intrusion detection models. Proceedings of the 1999 IEEE Symposium on Security and Privacy (1999) 120–132 Pfahringer B.: Winning the KDD99 Classification Cup: Bagged Boosting. SIGKDD Explorations. ACM SIGKDD. 1(2) (2000) 65–66 Levin I.: KDD-99 Classifier Learning Contest LLSoft’s Results Overview. SIGKDD Explorations. ACM SIGKDD. 1(2) (2000) 67–75 Vladimir M., Alexei V., Ivan S.: The MP13 Approach to the KDD'99 Classifier Learning Contest. SIGKDD Explorations. ACM SIGKDD. 1(2) (2000) 76–77 Gathercole C., Ross P.: Dynamic Training Subset Selection for Supervised Learning in Genetic Programming. Parallel Problem Solving from Nature III. Lecture Notes in Computer Science, Vol. 866. Springer-Verlag, Berlin (1994) 312–321
2336 9.
10.
11.
12.
13. 14. 15.
16.
17.
D. Song, M.I. Heywood, and A.N. Zincir-Heywood Cramer N.L.: A Representation for the Adaptive Generation of Simple Sequential Programs. Proceedings of the International Conference on Genetic Algorithms and Their Application (1985) 183–187 Nordin P.: A Compiling Genetic Programming System that Directly Manipulates the Machine Code. In: Kinnear K.E. (ed.): Advances in Genetic Programming, Chapter 14. MIT Press, Cambridge, MA (1994) 311–334 Huelsbergen L.: Finding General Solutions to the Parity Problem by Evolving Machinerd Language Representations. Proceedings of the 3 Conference on Genetic Programming. Morgan Kaufmann, San Francisco, CA (1998) 158–166 Heywood M.I., Zincir-Heywood A.N.: Dynamic Page-Based Linear Genetic Programming. IEEE Transactions on Systems, Man and Cybernetics – PartB: Cybernetics. 32(3) (2002), 380–388 Koza J.R.: Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA (1992) rd Hennessy J.L., Patterson D.A.: Computer Architecture: A Quantitative Approach. 3 Edition. Morgan Kaufmann, San Francisco, CA (2002) Brameier M., Banzhaf W.: A Comparison of Linear Genetic Programming and Neural Networks in Medical Data Mining. IEEE Transactions on Evolutionary Computation, 5(1) (2001) 17–26 Caberera J.B.D., Ravichandran B., Mehra R.K.: Statistical traffic modeling for network intrusion detection. Proceedings of the 8th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (2000) 466–473 Kendall K.: A Database of Computer Attacks for the Evaluation of Intrusion Detection Systems. Master Thesis. Massachusetts Institute of Technology (1998)
Genetic Algorithm for Supply Planning Optimization under Uncertain Demand Tezuka Masaru and Hiji Masahiro Hitachi Tohoku Software, Ltd. 2-16-10, Honcho, Aoba ward, Sendai City, 980-0014, Japan {tezuka,hiji}@hitachi-to.co.jp
Abstract. Supply planning optimization is one of the most important issues for manufacturers and distributors. Supply is planned to meet the future demand. Under the uncertainty involved in demand forecasting, profit is maximized and risk is minimized. In order to simulate the uncertainty and evaluate the profit and risk, we introduced Monte Carlo simulation. The fitness function of GA used the statistics of the simulation. The supply planning problems are multi-objective, thus there are several Pareto optimal solutions from high-risk and high-profit to lowrisk and low-profit. Those solutions are very helpful as alternatives for decision-makers. For the purpose of providing such alternatives, a multiobjective genetic algorithm was employed. In practice, it is important to obtain good enough solutions in an acceptable time. So as to search the solutions in a short time, we propose Boundary Initialization which initializes population on the boundary of constrained space. The initialization makes the search efficient. The approach was tested on the supply planning data of an electric appliances manufacturer, and has achieved a remarkable result.
1
Introduction
Manufacturers and distributors deal with a number of products. Supply planning problems are to decide the quantity, the type, and the due time of each product to supply as a replenishment. The supply plan is decided depending on the demand forecast of the products, and the forecasted demand involves uncertainty. If the demand exceeds the supply, opportunity losses occur while excess supply increases inventory level, and may result in dead stocks. In order to supply products, several resources are consumed to produce, and deliver the products. Materials, production machines, and transportation are the resources. The availability of the resources is limited, thus the supply quantity of the products also has a limit. Under the resource constraints and the uncertainty of demands, the supply plan is made to maximize the profit. Traditionally, inventory management approach has been used to create supply plans[1]. This approach decides the supply plan to minimize the stockout rate, or opportunity loss rate. However, in practice, this approach has a problem. This approach does not take account of the relationship between profit and E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2337–2346, 2003. c Springer-Verlag Berlin Heidelberg 2003
2338
T. Masaru and H. Masahiro
risk. For example, under the resource constraints, we can expect more profit by reducing the opportunity loss of the product with high gross margin than by doing so with low gross margin. However, the demand of the high gross margin product is often uncertain, while the demand of regular products with low gross margin is relatively steady and the risk is low. Thus the best mixture of high risk and low risk products is important to achieve a certain amount of profit with minimizing risk. In other words, portfolio management needs to be introduced in supply planning problems. It is desired to optimize profit and risk of supply planning problems under uncertain demand. There is also an operational problem. Supply planning problems are multiobjective. Though the maximization of profit and the minimization of risk are required simultaneously, those objectives are in trade-off relationship. Therefore, the problem has some Pareto optimal solutions from high-risk and high-profit to low-risk and low-profit. For decision-makers, such alternative solutions are very helpful because what they should do is to simply select one solution, which fits their strategy, from the alternatives. Many existing supply planning methods, however, create only one solution while most decision-makers do not like the black box system which proposes only one solution. To produce an efficient supply plan, we introduced a Multi-objective Genetic Algorithm (GA) and Monte Carlo Simulation into the supply planning. GAs are considered as efficient ways of optimizing multi-objective problems[2,3,4]. In GAs, a number of individuals promote optimization in parallel and this characteristic is expedient to find Pareto optimal solutions all at once. We introduced Monte Carlo simulation[5] into the evaluation process of GA. Monte Carlo simulation simulates the uncertainty of demand, then profit and risk are evaluated from the simulation result. In order to search the solutions in an acceptable time, we proposed boundary initialization which initializes population on the boundary of constrained space and makes the search efficient. We briefly describe the supply planning problems in section 2, and propose a GA with Monte Carlo simulation-based evaluation for supply planning and efficient population initialization method in section 3. Section 4 shows the result of computational experiments. Then we conclude in section 5.
2 2.1
Demand and Supply Planning Supply Planning and Resource Constraints
Manufacturers and distributors deal with many kinds of products. A firm deals with I kinds of products and it wants to make a supply plan of a certain period consisting of T terms. dti , pti , and qti are the demand, the supply quantity, and the initial inventory quantity of product i in term t, respectively. Sales quantity sti and opportunity loss quantity lti are obtained by the following equations. sti = min (dti , pti + qti )
(1)
lti = dti − sti
(2)
Genetic Algorithm for Supply Planning Optimization
2339
The initial inventory quantity of the next term is q(t+1)i = pti + qti − sti .
(3)
The firm sells the product i for unit price uti in term t, the unit supply cost is vti and the unit inventory cost is wti . Then the gross profit of the firm through the planning period is obtained by equation 4. G=
I T
(sti uti − pti vti − qti wti )
(4)
t=1 i=1
The first term represents the sales amount, the second and the third terms are the supply cost and inventory cost. Opportunity loss amount L is the sales amount which could be gained if the firm had enough supply and inventory, so it is defined as the following equation. L=
I T
lti uti
(5)
t=1 i=1
Demand forecast dti is an uncertain estimate, that is a stochastic variable while supply quantity pti is a decision variable. Since dti is stochastic, as its dependent variables, gross profit G and opportunity loss L are naturally stochastic. The distributions of G and L depend on the decision variable pti . Thus, the objective of the supply planning problem is to decide the supply quantity so that the distributions are optimal. Resources are consumed as products are supplied. For example, the machines to produce products and the trucks to transport them are the resources. The firm has J kinds of resources. rij of resource j is consumed to supply the unit quantity of product i. atj is the available quantity of resource j in term t. The amount of the consumption of the resource must not exceed the available quantity. On the other hand, supply quantity should not be negative. Therefore, the constraints of supply planning problems are defined as the following equations. I
rij pti ≤ atj
(t = 1, ..., T, j = 1, ..., J)
(6)
i=1
pti ≥ 0 (t = 1, ..., T, i = 1, ..., I)
(7)
The constraints space is convex. 2.2
Optimization Criteria
Supply planning problems are multi-objective. They have a number of criteria of profit and risk. Statistics of the distributions of the gross profit and opportunity loss are used as the optimization criteria. Figure 1 summarizes the statistics. X axis indicates gross profit G or opportunity loss L, and y axis indicates the probability of it.
2340
T. Masaru and H. Masahiro
Probablity Expected value Variance 100(1−α) % 2
100(1−α) % 2
100α conf. interval Lower conf. limit
G,L
Upper conf. limit
Fig. 1. Optimization Criterion
One of the most important objectives of business is maximizing profit. Thus, expected gross profit is naturally used as a criterion of profit, and it should be maximized. Variance of the gross profit reflects the volatility of outcome. Less volatility is preferable, thus variance is used as a risk criterion, and it should be minimized. Expected opportunity loss is also used as a risk criterion, and it should be minimized. The lower confidence limit of the 100α percent confidence interval is the lower 100(1−α)/2 percentile. For example, a lower confidence limit of 95 percent confidence interval of the gross profit is the lower 2.5 percentile of the profit. That means the probability of the gross profit falling below the limit is only 2.5 percent. Lower confidence limit is an inverse criterion of risk since it indicates the worst case of the profit, and it should be maximized. Likewise, upper confidence limit of the opportunity loss is a risk criterion since it indicates the worst case of the loss, and it should be minimized. 2.3
Demand Forecasting
Future demand can be forecasted according to the history of demand. There are several forecasting methods such as Moving Average Method, Exponential Smoothing Method, and Box-Jenkins Method [6]. In this paper, an existing commercial software was used to forecast future demand. The software analyzes the history of demand and automatically chooses the forecasting method which best fits to the historical data, then forecasts the expected value and the variance of the future demand. The demand forecast is supposed to follow Normal distribution though its left tail is truncated at zero.
3 3.1
Genetic Algorithms Approach Genetic Representation
In order to optimize supply planning problems, we employed a real-coded genetic algorithm. Each individual is a vector of real value, and the vector
Genetic Algorithm for Supply Planning Optimization
2341
x = (x1 , ..., xT ×I ) corresponds to the set of supply quantities. Its element x(t−1)×I+i corresponds to supply quantity pti , i.e. x = (p11 , p12 , ..., p1I , p21 , p22 , ..., p2I , ..., pT 1 , pT 2 , ..., pT I ) . 3.2
(8)
Fitness Function Using Monte Carlo Simulation
In order to evaluate the criteria of supply plan problems, we introduced Monte Carlo Simulation in the fitness function of GA. Monte Carlo simulation is a simulation method in which a large quantity of random numbers are used to calculate statistics. It calculates multiple scenarios of the gross profit and the opportunity loss by sampling demand quantities from the random number following their probability distributions. Figure 2 describes the detail of the Monte Carlo simulation to evaluate supply planning. The future demand of each product is forecasted as the expected value and the variance. µti and σti denote the expected demand and its standard deviation of product i in term t. Demand is considered to follow normal distribution. 01 begin evaluation 02 m := 1 *** Repeat simulation for an evaluation *** 03 repeat Gm :=0, Lm :=0 *** Initialize gross profit and opp. loss of m-th simulation *** 04 05 i :=1 06 repeat q1i :=0 07 08 t :=1 09 repeat *** Calculate the profit and the loss from product i in term t *** 10 *** and sum up them into n-th profit and loss *** 11 dti := Random number following N( µti, σti ) 12 13 sti := min( dti, pti + qti ) lti := dti − sti 14 Gm := Gm + ( stiuti − ptivti − qtiwti ) 15 16 Lm := Lm + ltiuti 17 q(t+1)i := pti + qti – sti 18 t :=t +1 19 until t <= T 20 i := i +1 21 until i <= I 22 m := m +1 23 until m <= M 24 Calculate optimization criteria 25 End evaluation Fig. 2. Monte Carlo Simulation to Evaluate Supply Plan
2342
T. Masaru and H. Masahiro
The evaluation of one individual, or one supply plan, consists of M simulations of the gross profit and the opportunity loss. The block from line 4 to 22 in the figure corresponds to one simulation. In each simulation, the demand of each product in each term is simulated by the random number following normal distribution N (µ, σ)(line 12 in the figure). After the iterations, we obtain M samples of gross profit Gm and opportunity loss Lm (m = 1, ..., M ). Statistics as the optimization criteria are calculated from the samples. For example, the expected gross profit and its variance are estimated by equation 9 and 10. M ˆ= 1 G Gm M m=1
U (G) =
M 2 1 ˆ Gm − G M − 1 m=1
(9)
(10)
ˆ should be maximized as a matter of course and U (G) should be minimized G since less volatility is preferable. 3.3
Genetic Operators and Selection
Supply planning problems are multi-objective. The outcome of multi-objective optimization is not a single solution, but a set of solutions known as Pareto optimal solutions. Among the solutions, each objective cannot be improved without the other objectives being degenerated. A vector u = (u1 , ..., un ) is superior to v = (v1 , ..., vn ) when u is partially greater than v, i.e., ∀i, ui ≥ vi ∧ ∃i, ui > vi .
(11)
Any solution to which no other solution is superior is considered as optimal. Since supply plan optimization is multi-objective, it has several Pareto optimal solutions. The supply plan optimization problem has convex constraints. The genetic operators proposed by Michalewicz [7][8] can handle convex constraints effectively. Thus we employed the operators and modified for multi-objective problems. The GA has two mutation operators, uniform and boundary, and two cross-over operators, arithmetic and heuristic. The mutation operator selects one locus k from individual x randomly and changes the value of xk in the range satisfying constraints. Uniform mutation changes xk to a random number following uniform distribution, and boundary mutation changes xk to the boundary of constrained space. x and y denote parents, and z denotes offspring. Arithmetic crossover reproduces offspring as z = λx + (1 − λ) y where λ is a random value following uniform distribution [0, 1]. Since the constrained solution space is convex, whenever both x and y satisfy the constraints, arithmetic crossover guarantees the feasibility of z.
Genetic Algorithm for Supply Planning Optimization
2343
Heuristic crossover uses the evaluation of two parents to determine the search direction. The offspring reproduced by heuristic crossover is z = x + λ (x − y) where the evaluation of individual x is superior to y and λ is a random number. The range of λ is [0, 1], and in the case when the offspring is not feasible, the operator makes attempts to generate a feasible one. The supply plan problems we deal with are multi-objective. Thus, in order to determine the search direction of heuristic crossover, we introduced a comparison procedure consisting of three steps. Ei (x) , (i = 1, ..., K) denotes the evaluation of i-th criterion of individual x. K is the number of criteria. In case of maximization, the procedure is as follows: – Step 1 Compare two parents according to Pareto optimality. Individual x is superior to y when ∀i, Ei (x) ≥ Ei (y) ∧ ∃i, Ei (x) > Ei (y) .
(12)
Otherwise, go to step 2. – Step 2 Compare the number of individuals superior to each parent. The parent dominated by the smaller number of other individuals in the current population is considered to be superior. If the same number of individuals dominate both x and y, go to step 3. – Step 3 Randomly Select either of two parents as a superior. We employed tournament selection [9]. It selects k individuals at random and selects the best one among these k individuals. k is a parameter called tournament size which determines selective pressure. Pareto optimality is also applied to determine which individual wins the tournament. Superior individuals have more chances to be selected. Consequently, Pareto optimal solution set is explored. 3.4
Boundary Initialization
In the GA we introduced, all the individuals in the initial population must satisfy the constraints. Thus, the population is generally initialized randomly in the constrained space. In practice, it is more important to find good enough solutions in acceptable time than to find genuine optimal solutions. Thus, we propose a new initialization method, Boundary Initialization. Practically and empirically, most of optimal solutions of the real-world constraints problems are on the boundary of constraints. The method initializes the population randomly on the boundary of constrained space, and makes the search of solutions efficient. In figure 3, black dots are the individuals produced by Boundary Initialization, and white dots are produced by Random Initialization. It is said that the diversity of initial population is very important because it ensures the exploration through the whole search space. Therefore, there is a
2344
T. Masaru and H. Masahiro
Constraints Random Initalization Boundary Initalization
Fig. 3. Boundary Initialization and Random Initialization
possibility that search fails to reach optimal solutions when the initial population is biased on the boundary. However, we expect that the bias boosts the search efficiency in solving practical problems, and that Boundary Initialization should work well.
4
Computational Experiments
The supply planning method with GA was tested on the data provided by an electric appliances manufacturer. The data consist of ten product groups and four key resources. In our experiments, each product group is treated as a product. In our experiments, population size was set to 100, termination was 50 generations, and tournament size was 4. We adopted the elitist policy [10] and elite size was 5. The iteration of simulations in one evaluation was 1000. That means gross profit and opportunity loss were calculated 5,000,000 times in one optimization run. It took about 255 seconds to run one optimization on Windows 2000 PC with Pentium IV 2.2GHz and 1GBytes RAM. The test program was implemented in C++. We carried out two experiments. In the first experiment, the objectives were maximizing expected gross profit and minimizing the standard deviation of the profit. The standard deviation was minimized since smaller volatility was preferable. In the second experiment, the objectives were maximizing expected gross profit and minimizing expected opportunity loss. Standard deviation of the profit and opportunity loss are the criteria of risk. We tested Random and Boundary Initialization in both experiments. Figure 4 and Figure 5 show the Pareto optimal individuals, i.e. solutions at the last generation of five trials. Each marker corresponds to one solution and each line corresponds to the efficiency frontier of each trial. (a) is the result with Random Initialization and (b) is with Boundary Initialization. For comparison, we also tested conventional inventory management method. Figure 4 is the result of the first experiment maximizing expected gross profit and minimizing the standard deviation. Obviously, the optimization with Boundary Initialization is better than that with Random Initialization. The standard deviation was minimized with both methods. However, Random Initialization could not obtain a better solution even than the conventional method. Figure 5 is the result of the second experiment maximizing expected gross profit and minimizing opportunity loss. The optimization with Boundary Initialization is much better than that with Random Initialization.
Genetic Algorithm for Supply Planning Optimization
2345
Fig. 4. Maximization of expected profit and minimization of standard deviation of profit
Fig. 5. Maximization of expected profit and minimization of opportunity loss
We can observe the trade-off between profit and risk from the figures. The solutions are appropriate as reasonable alternatives for decision-makers. We found that the supply planning GA with Boundary Initialization obtained extremely good solutions. The proposed approach could also provide several pareto-optimal solutions as alternatives, from high-risk and high-profit to low-risk and lowprofit.
5
Conclusion
In this paper, we proposed a supply planning method employing a multi objective GA. The method uses Monte Carlo Simulation in the fitness function of GA. Monte Carlo simulation simulates uncertain demand, then profit and risk as fit-
2346
T. Masaru and H. Masahiro
ness values are calculated from the simulation result. The GA searches a number of Pareto optimal solutions in one optimization run. We also proposed Boundary Initalization which initializes population on the boundary of constraints. We tested our approach on the actual data from an electric appliances manufacturer. The proposed approach successfully optimized the supply planning problem. We also found that Boundary Initialization is more effective than Random Initialization. The GA provided a number of Pareto optimal solutions covering from highrisk and high-profit to low-risk and low-profit. We believe this feature is very helpful to decision-makers. Since the Pareto optimal solutions can be the alternative choices, decision-makers can select one preferable solution from the alternatives according to their risk appetite and business strategies.
References 1. Hashimoto, F., Hoashi, T., Kurosawa, T. and Kato, K. : Production Management System. New edn., Kyoritsu Shuppan Co., Ltd., Tokyo (1993) 2. Schaffer, J. D. : Multiple Objective Optimization with Vector Evaluated Genetic Algorithms, Proc. First Intl. Conf. on Genetic Algorithms and their applications (1985) 93–100 3. Fonseca, C. M. and Fleming, P. J. : Genetic Algorithms for Multiobjective Optimization: Formulation, Discussion, and Genralization, Proc. 5th Intl. Conf. on Genetic Algorithms (1993) 416–423 4. Horn, J., Nafpliotis, N. and Goldberg, D. E. : Niche Pareto Genetic Algorithms for Multiobjective Optimization, Proc. First IEEE Conf. on Evolutionary Computation (1994) 82–87 5. Tsuda, T : Monte Carlo Method and Simulation. 3rd edn., Baifukan Co., Ltd., Tokyo (1995) 6. Goodrich, R. L. : Applied Statistical Forecasting, Business Forecast Systems, Inc., Belmont MA (1989) 7. Michalewicz, Z. and Janikow, C. Z. : Handling Constraints in Genetic Algorithms, Proc. 4th Intl. Conf. on Genetic Algorithms (1991) 151–157 8. Michalewicz, Z. : Genetic Algorithms + Data Structure = Evolution Programs. 3rd edn., Springer-Verlag (1996) 9. Goldberg, D. E., Deb, K., Korb, B. : Don’t Worry, Be Messy, Proc. 4th Intl. Conf. on Genetic Algorithms (1991) 24–30 10. Liepins, G. E., and Potter, W. D. : A Genetic Algorithm Approach to MultipleFault Diagnosis, In: Davis, L. (ed.): Handbook of Genetic Algorithms, Van Nostrand Reinhold (1991)
Genetic Algorithms: A Fundamental Component of an Optimization Toolkit for Improved Engineering Designs Siu Tong1 and David J. Powell2 1
2
Engineous Software, Cary, North Carolina [email protected] Elon University, Campus Box 2101, Elon, North Carolina 27244 [email protected]
Abstract. Optimization is being increasing applied to engineering design problems throughout the world. iSIGHT is a generic engineering design environment that provides engineers with an optimization toolkit of leading optimization algorithms and an optimization advisor to solve their optimization needs. This paper focuses on the key role played by the toolkit’s genetic algorithm in providing a robust, general purpose solution to nonlinear continuous, mixed integer nonlinear and integer combinatorial problems. The robustness of the genetic algorithm is demonstrated on successful application to 30 engineering benchmark problems and the following three real world problems: a marine naval propeller, a heart pacemaker and a jet engine turbine airfoil.
1
Introduction
This paper describes a generic engineering design environment, iSIGHT, used by the Automotive, Aerospace, Industrial Manufacturing and Electronic industries in the United States, Japan, China, Korea and Europe [1]. Within these industries, the designer’s knowledge of optimization varies from that of novice to expert. To meet the needs and challenges of a wide variety of industries and the wide disparity in designer’s optimization expertise, iSIGHT provides an Optimization Toolkit. This paper focuses on the role of the Genetic Algorithm in iSIGHT’s Optimization Toolkit and its application to three design problems. The paper starts with a description of the optimization problem in engineering and presents a generic problem formulation. Thirteen well known packages that represent the major numerical and exploratory algorithms are benchmarked against a suite of thirty engineering problems that represent nonlinear continuous, mixed integer and integer combinatorial problems. The benchmark results are analyzed to demonstrate the strengths and weaknesses of each package and to demonstrate the critical role that genetic algorithms serve to the toolkit by providing a single algorithm to solve any nonlinear, constrained continuous or mixed integer problem. The benchmark conclusions are supported by describing the successful application of genetic algorithms to three customer applications E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2347–2359, 2003. c Springer-Verlag Berlin Heidelberg 2003
2348
S. Tong and D.J. Powell
for the design of a marine naval propeller, a heart pacemaker and a jet engine turbine airfoil. The paper concludes with lessons learned in the development and application of iSIGHT.
2
The Optimization Problem
During the past twenty years, product design and analysis has been routinely done with computer simulation programs developed either in house or by a commercial vendor. The designer supplies the design parameters for the product as input into one or more computer simulation programs, runs the program and then analyzes the results. If the results do not meet the design goals then the designer changes the design parameters and repeats the process. This process is typically called the design, evaluate and redesign process. The challenge to the designer is to find the best design in as short a time period as possible. These design optimization problems are commonly found in manufacturing industries and can be represented by the following mathematically formulation. Objective function: Inequality constraints:
Equality constraints:
Side constraints:
Minimize y = f (X) where X = {x1, x2, ..., xn } Subject to: gkb (X) <= gk (X) <= gku (X) where k = 0, 1, · · · , K b is lower constraint boundary u is upper constraint boundary gle (X) = hl (X) where l = 0, 1, ..., L e is the constraint boundary xib <= xi <= xiu where xi is real or integer or xi is a member of a discrete set of values
This formulation supports the specification of unconstrained and constrained problems with a single objective. (Note: Multiple objectives are automatically converted into a single weighted objective.) The constrained problems can have either or both inequality and equality constraints. The design parameters can be combinations of type real, integer or discrete. Depending on the mixture, the problem formulation can be considered a continuous nonlinear problem, a pure integer combinatorial problem or a mixed integer nonlinear problem (MINLP). Non-linear optimization technologies are a natural fit to aid the designer but their successful application has been limited for many reasons to include: 1. The need to modify the simulation source code to interact with the optimization algorithm. If the simulation code is from a third party vendor then the source code may not be available. 2. Most simulation codes were not written with optimization in mind and would require significant reprogramming [2].
Genetic Algorithms: A Fundamental Component of an Optimization Toolkit
2349
3. The difficulty in identifying which optimization algorithm(s) is appropriate for the product. For example, the domain modeled by the simulation code may be discontinuous, non convex, highly nonlinear, poorly scaled, have limited precision in output values, have parameters with different orders of magnitude, and have a mixture of discrete, integer and real parameter design variables. To overcome these limitations and still solve real world problems, the authors decided to create a toolkit consisting of best-of-breed optimization algorithms that would work together in a coordinated manner. Accordingly, the first step was to identify the best algorithms for each class of industrial problems. 2.1
Optimization Algorithms
In order to identify the most suitable algorithms to be included in the optimization toolkit, a number of analytical and real world problems were set up as benchmarks. Each package was benchmarked to determine how often it achieved the known optimum (i.e. robustness) and the number of function evaluations needed to achieve an optimal result. Table 1 lists the packages, their underlying algorithms, whether they are gradient based or non gradient based, whether they support mixed integer, and their number of basic and advanced tuning parameters. A value of No for Mixed Integer indicates that the algorithm applies only for continuous problems. A value of Yes followed by (BB) indicates that the algorithm is based on a branch and bound algorithm and the underlying code must be able to support both real and integer values for integer parameters. A value of Yes not followed by (BB) for Mixed Integer indicates that the algorithm can work directly with integer values. The Tuning Parameters for each algorithm are set to default values recommended by either the implementer or expert users. The Basic Tuning Parameters are those most likely to be tuned by a user. The Advanced Tuning Parameters are those usually manipulated by only an expert in optimization. Table 1 lists both numerical optimization algorithms and exploratory genetic algorithms used in this study. For certain algorithms (e.g., Genetic Algorithms and Sequential Quadratic Programming), multiple packages are listed. The rationale for testing multiple packages for an algorithm is due to the importance of the implementation on performance and robustness. 2.2
Benchmarks and Analysis
In 1977, Sandgren established a test set of single objective, nonlinear, continuous engineering problems with different numbers of design variables, inequality constraints and equality constraints [17]. Each problem had a given starting point and a known solution. Twenty eight of his thirty problems and two mixed integer nonlinear problems were used to benchmark the iSIGHT optimization algorithms. The two MINLP problems are a coil compression spring, San 31 [18], and a cantilevered beam, Van 32 [22]. Table 2 lists the benchmark data
2350
S. Tong and D.J. Powell
Package
Algorithm
Gradient Based
Mixed Integer
Basic Tuning Parameters
Advanced Tuning Parameters
Table 1. The optimization packages and their associated characteristics, technique and tuning parameters.
1. GENEsYs [7] 2. Ga2k [8] 3. NSGA-II [6] 4. ASA [9] 5. Hooke [11] 6. ADS-EP [10] 7. ADS-SLP [10] 8. Donlp [12] 9. NLPQL [13] 10. MOST [14] 11. LSGRG [15] 12. ADS-MFD [10] 13. ADS-MMFD [10]
Genetic Algorithm Genetic Algorithm Genetic Algorithm Simulated Annealing Hooke Jeeves Exterior Penalty Sequential Linear Programming Sequential Quadratic Programming Sequential QuadraticProgramming Sequential Quadratic Programming Generalized Reduced Gradient Method of Feasible Directions Modified Method of Feasible Directions
No No No No No Yes Yes Yes Yes Yes Yes Yes Yes
Yes Yes Yes Yes Yes No No No No Yes (BB) No No No
2 3 2 8 2 6 6 3 4 3 3 5 5
11 7 4 7 2 22 25 6 3 5 5 11 25
for these thirty test cases. The relative error for the starting point, the optimization algorithm that achieved the published optimum in the fewest function evaluations, the relative error achieved by the ga2k genetic algorithm and the percentage of relative error reduced by the genetic algorithm from the starting point are shown. The calculation of relative error is shown in equation 1. The penalty is the sum of constraint violations.
RelativeError =
|current objective − published optimum| + penalty |published optimum|
(1)
The key results revealed by the benchmarks are: 1. No single algorithm worked the best in all test cases. In fact, 7 different algorithms proved to be the best for selective test cases. 2. If one had to pick a single algorithm to try for any optimization problem then the genetic algorithm would be a very robust and sound choice. In the benchmarks, the relative error of the optimum point found by ga2k averaged a reduction in relative error from the starting relative error by over 85%. This is a significant result. Keep in mind that the goal of the designer is to find the best design that meets customer constraints within the design deadline. The genetic algorithm found a feasible design on 29 of the thirty test cases. On only Sandgren 27 did the genetic algorithm fail to achieve a better design. 3. Genetic algorithms require one to two orders of magnitude more function evaluations than numerical algorithms. 4. Genetic algorithms are clearly superior to numerical optimization techniques on mixed integer problems when the simulation program does not accept non
Genetic Algorithms: A Fundamental Component of an Optimization Toolkit
2351
Inequality Constraints
Equality Constraints
Relative Error Starting Point
Feasible Starting Point
Name Best Algorithm
Function Evaluations Best Algorithm
Relative Error ga2k
Error Reduction ga2k
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 24 25 26 27 29 30 31
Integer Design Variables
Sandgren Sandgren Sandgren Sandgren Sandgren Sandgren Sandgren Sandgren Sandgren Sandgren Sandgren Sandgren Sandgren Sandgren Sandgren Sandgren Sandgren Sandgren Sandgren Sandgren Sandgren Sandgren Sandgren Sandgren Sandgren Sandgren Sandgren Sandgren Sandgren Van 32
Real Design Variables
Problem
Table 2. Benchmark results on test suite
5 3 5 4 2 6 2 3 3 2 2 4 4 15 16 3 12 7 8 8 13 7 4 6 3 48 10 19 1 4
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 6
10 2 6 0 0 0 1 2 9 0 2 0 3 5 0 14 3 14 4 6 13 4 5 4 0 0 14 1 8 11
0 0 0 0 0 4 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 1 2 1 11 0 0
1.6192 0.697 0.0119 19192 24.2 1961.4 53.554 0.6755 0.7892 1468.7 0.4236 0.024 0.0456 73.192 18.212 0.253 2.4394 0.7315 0.6672 1.1829 4.0012 647.91 5.6398 36.954 211.5 1.1545 2.8149 6809.3 90.069 2.5544
YES YES YES YES YES NO NO NO YES YES YES YES YES YES NO YES NO YES NO NO NO NO YES YES NO YES YES NO NO NO
NLPQL GRG DONLP ADS-EP GRG DONLP NLPQL NLPQL GRG MMFD MMFD NLPQL GA2K DONLP NLPQL NLPQL GRG MOST MOST MOST MOST NLPQL NLPQL NLPQL GRG MOST MOST NLPQL GA2K GA2K
32 25 32 184 166 96 36 21 1692 32 18 61 10000 412 141 85 132 97 136 155 1196 179 56 180 121 4708 442 132 10000 10000
0.03096 0.0003 0.00799 0.378 0.026 0.00528 0.00031 0.00083 0.27248 5.7E-05 0.00419 0.00112 0.0 7.75308 7.80919 3.1E-05 0.29052 0.4954 0.17183 0.17048 1.02186 0.25652 0.79119 0.71947 0.47662 1.1545 1.32457 5157.12 0.48947 0.11022
0.98 0.99 0.32 0.99 0.99 0.99 0.99 0.99 0.65 1.0 0.99 0.95 1.0 0.89 0.57 0.99 0.88 0.32 0.74 0.85 0.74 0.99 0.85 0.98 0.99 0.0 0.52 0.24 0.99 0.95
integer values for integer parameters (e.g., problems Sandgren 31 and Van 32). 5. Genetic algorithms are least effective on problems with equality constraints (e.g. Sandgren problems 15, 27, 29 and 30). This section describes the genetic algorithm in more detail as they are not as well known and not as well tested in the engineering community. Three genetic algorithm packages, GENEsYs, ga2k, and NSGA-II were benchmarked. All packages supported the parallel evaluation of each design in the population. GENEsYs’ default settings are for elitism, binary gray encoding, linear rank based selection with a maximum fitness of 1.1 and a minimum fitness of .9, two point crossover with a .6 crossover rate, standard mutation with a .01 mutation rate and seeding. The first population is seeded with the initial design point, 20% small creep, 20% large creep, 20% boundaries and random selection. With
2352
S. Tong and D.J. Powell
small creep, large creep and boundary each design parameter has a 50 percent chance of selection. If the parameter is selected with small creep, the parameter will be randomly varied by 0-3%. If selected by large creep the parameter will be varied by 0-10% and if selected with boundary, the parameter will be randomly set to either the upper or lower bound. The rationale for the small and large creep operators is to leverage the implicit knowledge provided in the initial user supplied design point. The boundary seeding leverages the heuristic that many design parameters are active at their boundaries at the optimal design point. ga2k is a distributed genetic algorithm [16]. The package’s default settings are for elitism, binary gray encoding, tournament based selection, 10 subpopulations of size 10 with a migration rate of .5 at every 5th generation, two point crossover with a 1.0 crossover rate, standard mutation with a .01 mutation rate and no seeding. This algorithm significantly outperformed GENEsYs in the set of benchmark test cases described in Section 2.2. NSGA-II is a multiobjective genetic algorithm. The package’s default settings are for elitism, real encoding, tournament selection, a population size of 100, 100 generations, SBX crossover at a rate of .9, real mutation crossover at a rate of 1/(number of design variables), a crossover distribution index of 20 and a mutation distribution index of 100. The NSGA-II algorithm was similar in performance to ga2k. In the 30 benchmarked test cases, it was better in 14, worse in 9 and equal in 7. The benchmark tests are helpful but are not completely consistent with the types of applications encountered in industry. They are lacking in the following areas: 1. Their execution time is short compared to the times required of simulation programs which take anywhere from 1 minute to a few hours to execute a single evaluation. The simulation programs are typically thousands or hundreds of thousands of lines of code. 2. The continuous benchmark codes are relatively smooth, well behaved landscapes. In practice, the simulation programs are not smooth, have many discontinuities introduced by if statements, casts and parameter settings. In addition, the programs frequently provide less precision in their output parameters then desired by finite difference algorithms. For example, results may be provided to two digit decimal accuracy where numerical finite gradient algorithms typically are looking at changes in the 4th or 5th decimal digit. 3. The design parameters are typically not independent variables. These differences make the results of the benchmark only a guideline rather than an absolute measurement of the usefulness of each algorithm in real world situations. The iSIGHT design environment [1] was developed to provide an industrial solution to this design optimization problem. It has a process integration toolkit that allows simulation codes to be coupled into the environment without source code or reprogramming. The iSIGHT optimization toolkit, which is discussed in the next section, supplies a variety of algorithms which make different as-
Genetic Algorithms: A Fundamental Component of an Optimization Toolkit
2353
sumptions about the design space. An optimization advisor was developed to recommend the appropriate algorithms to novice users.
3
Optimization Toolkit
The iSIGHT Optimization Toolkit provides a number of industry leading implementations of optimization algorithms and a mechanism to provide an optimization plan from an interdigitation of individual algorithms to solve complex problems. 3.1
Strength and Weaknesses of Optimization Packages
More than a dozen packages from both public and private domains are provided in the toolkit and in a few cases multiple implementations of each. There are advantages and disadvantages to having this amount. The advantages are: 1. Each algorithm has one or more strengths that enables it to find a better design than another algorithm for certain design space conditions. For example, Exterior Penalty and Sequential Quadratic Programming work well when started from an infeasible starting point, GRG works well with equality constraints and Genetic Algorithm works well with non convex, multi-modal or mixed integer problems. 2. Some algorithms are extremely efficient for certain design space conditions. For example, gradient based techniques work well in smooth, unimodal, convex design spaces with twenty or fewer design variables. 3. An optimization expert has access to an assortment of algorithms, packages and tuning parameters to best match the characteristics of the algorithm to the characteristics of the product design space. 4. Each package has proven itself on at least one user application and is maintained for backward compatibility. However, certain packages such as GENEsYs have not been routinely updated by the developer and have fallen behind other packages such as ga2k in performance. If a designer selects one of these packages (e.g., GENEsYs) then the designer is notified that the selected package is not recommended and they should consider an alternative package (e.g., ga2k). The disadvantages are: 1. For a novice user, the choices are confusing and intimidating. In fact, the choice of more than a dozen algorithms with close to 200 tuning parameters could be a larger optimization problem then the problem the designer is trying to solve. In this case, the problem of changing parameters has merely been transferred from the product parameters to the optimization parameters. 2. The diagnostic tools to determine how the package is performing are different for each package. A user experienced in one package cannot easily leverage
2354
S. Tong and D.J. Powell
his experience to work with another package. As a result, designers will oftentimes use a single package regardless of the characteristics of the design space. These disadvantages are overcome in two ways. First, iSIGHT has an optimization advisor to automatically pick packages for the user. (The advisor will be discussed in section 3.2). Second, the genetic algorithm is a general purpose algorithm that will work in any design space. A key lesson learned is that although the genetic algorithm cannot compete with numerical techniques in terms of function evaluations for smooth optimization problems, it will always find some improvement and with todays computing environment many of the function evaluations can be done in parallel. iSIGHT actively supports only a handful of the best performing algorithms listed in Table 1 and deprecates the older or poorer performing packages. However, iSIGHT provides a set of application programming interfaces that allow the designers to couple their own package. In addition, iSIGHT supports Interdigitation which allows the user to create a hybrid technique from one or more existing techniques [21]. Interdigitation can be as simple as a sequential execution of an exploratory genetic algorithm followed by an exploitive numerical optimization or can have complicated scripting with conditional branches and loops among the optimization techniques. 3.2
Optimization Advisor
While the large number of algorithms with direct access to their tuning parameters provides a powerful tool for the expert in optimization, it is overwhelming to the majority of designers. In fact, less than a fraction of 1 percent of users using optimization are Operations Research professionals [19]. To address the needs of this vast majority of novice designers, iSIGHT provides an optimization advisor that automatically selects the best two optimization algorithms based on its analysis of the problem’s characteristics. Alternatively, the user can view a priority listing of the recommended optimization packages. For example, on the Van32 benchmark which has a low number of design variables, a low number of design constraints, mixed types of parameters, small variable variance, no equality constraints, a discontinuous design space, non linear simulation codes, an initial infeasible design, low execution time, and no availability of simulation code gradients, the optimization advisor recommended a combination of a genetic algorithm, ga2k, followed by adaptive simulated annealing. For this example, the priority ranking of applicable optimization techniques was: genetic algorithm, adaptive simulated annealing, Hooke Jeeves, and MOST. The continuous numerical optimization techniques are not applicable and are not listed by the advisor for this mixed integer formulation. On the other hand, in a continuous design space with real parameters and high execution time of the simulation code, numerical optimization algorithms will have the highest priority. During the past three years of iSIGHT application to a variety of real world problems, the genetic algorithm has been the optimization advisor’s most fre-
Genetic Algorithms: A Fundamental Component of an Optimization Toolkit
2355
quently recommended technique. In many cases, it is not the highest recommended technique but its frequent recommendation recognizes the robustness of the technique.
4
Applications
This section demonstrates the robustness of the genetic algorithm in its successful application to three different types of real world problem formulations. The first application is a mixed integer design, the second is a pure integer combinatorial problem and the last is a continuous parameter problem in a multimodal design space. 4.1
Marine Propeller Design
Numerous competing design and performance parameters must be considered in propeller design. A properly designed propeller must balance the competing requirements characterized by cavitation, cost, efficiency, noise, strength, thrust, and vibration. The design space contains many local optimal designs. Several codes are required for the preliminary design. The designers used the process integration toolkit to integrate six simulation codes whose function in the preliminary design process can be described as follows: – – – – –
Compute the mean and harmonic wake velocity components Calculate propulsive efficiency, tip vortex and cavitation inception speed Determine structural stress and weight Find surface cavitation inception speeds Compute vibratory forces and moments
The mixed integer design variables for this problem include the propeller diameter, speed (RPM), and several spline parameters that characterize the chord distribution, thickness distribution, and loading distribution. The design goal was to minimize the propeller weight, subject to limits on the minimum propulsive efficiency, minimum cavitation inception speed, minimum tip vortex speed and a maximum allowable stress. Since the problem was mixed integer, the genetic algorithm was used with the default tuning parameter settings. A manual design was used as the baseline for comparison with the genetic algorithm solution. Figure 1 compares the chord distributions and propeller thickness distributions of the baseline and genetic algorithm optimized designs. The genetic algorithm design met all the design constraints and propulsion goal of 0.68 while reducing the weight 181 lbs below the baseline design weight. The net result was an improvement of 17% in the weight. 4.2
Heart Pacemaker Antenna Design
This case study details the design of an implantable patch antenna for a human heart pacemaker. Enabling two-way communication with pacemakers will make
2356
S. Tong and D.J. Powell
Fig. 1. Chord distribution of Baseline and genetic algorithm designs for marine propeller
it possible to download information about the condition of the pacemaker and the patients heart and upload improved instrument settings. Fitting pacemakers with antennas is challenging, because either the wavelength is too short to penetrate the body or the antenna is so large that it is impractical to implement. Antennas are normally 5 or 6 inches long to be used at the required frequency, long enough that they protrude into the body and risk infection or lung punctures. Researchers at Utah State University solved this problem by developing a two-inch-square 433 MHz patch antenna small enough to fit on a standard pacemaker battery pack [20]. The designers set out to design a 50-ohm antenna with the required performance characteristics that could fit onto the battery pack of the device, which would make it virtually noninvasive Electromagnetic finite difference time domain software called XFDTD from Remcom Incorporated was used to evaluate the performance of specific designs. The design team worked on the problem and quickly realized that the design space was very sparsely populated with feasible designs that met all the stringent requirements. This made the discovery of an acceptable (much less optimal) patch antenna geometry a very difficult task. After 9 months of manual iteration they finally developed two optimized designs that met the requirements of the project - a u-shaped patch antenna and a spiral shaped antenna. But the design team recognized that the design would require numerous subsequent design iterations to meet additional requirements that arose as the project evolved. iSIGHT was linked with the XFDTD simulation package. The design parameters included the length of the antenna and the locations of the feed and
Genetic Algorithms: A Fundamental Component of an Optimization Toolkit
2357
Fig. 2. Comparison of best antenna designs found by iSIGHT and manual design
ground pins. Design constraints on the maximum antenna length, pin locations, and impedance requirements were specified. The design objective was to minimize the antenna length and meet the target frequency of 433 MHz subject to the various design constraints. A genetic algorithm was used as this was a completely integer combinatorial problem. The genetic algorithm found a solution that met all the requirements and was better than the best manual designs. Figure 2 shows a comparison of the iSIGHT and manual trial-and-error best designs. The antenna length of the iSIGHT optimized design, shown on the left, was 128 mm compared with a length of 132 mm found by the 9 month trial-and-error method. 4.3
Jet Engine Airfoil Turbine Design
Researchers at Brigham Young University (BYU) recently completed a complex, multi-discipline design optimization of a cooled jet engine turbine airfoil. The optimization required running multiple software codes to generate a parametric solid model, and execute aerodynamic and structural analyses. The parametric model was generated in Unigraphics (UG). This model was used as the foundation for the design and subsequent optimization. The structural analysis was performed in ANSYS while the GAMBIT and FLUENT software tools were used to compute the aerodynamic analysis. The airfoil geometry was created from four sections, each section with seven parameters, for a total of 28 design variables. The design parameters or variables include the chord length, leading and trailing angles, leading and trailing radius, leading and trailing wall thickness and trailing radius thickness. The objectives of the optimization were to minimize the pressure loss and weight and to maximize the safety factor based on the stress in the airfoil subject to stress constraints and geometric limits. The analysis for this problem was highly complex and required extremely long computational times to solve. The genetic algorithm available in iSIGHT was selected to complete the optimization study. The genetic algorithm was chosen
2358
S. Tong and D.J. Powell
for its abilities to search for a set of designs, explore the entire multi-modal design space, and run in parallel. A population size of 50 was used and 100 generations were run. The optimization was run automatically in parallel by iSIGHT on a 64 processor SGI and required approximately 20 hours to complete. 65 solutions were selected from the 5000 designs generated to form the pareto set. This set of 65 designs showed a decrease in the blade volume ranging from 2% to 19%, an increase in the safety factor of 2% to 8%, and pressure loss decreases of 2% to 38%.
5
Lessons Learned
The iSIGHT generic engineering design environment described in this paper has had a genetic algorithm as part of its optimization toolkit since 1990. The genetic algorithm has played a fundamental role in the robustness of the toolkit when applied individually as shown in the three real world applications discussed in this paper or when used with interdigitation to design an aircraft engine turbine [23]. It is the robustness of the genetic algorithm and its ease of use that makes it especially attractive to the majority of users who are not experts in optimization. The concept of the algorithm is easy to understand and there are no limiting design space assumptions of parameter independence or continuity to worry about. The genetic algorithms key advantage is it applies to almost every type of problem. This claim is supported by the genetic algorithm finding improved designs on 29 of 30 benchmarked engineering problems. This high success rate of 96% makes the genetic algorithm a reasonable choice when selecting an initial algorithm for a design when the problem domain is not very well known. The increased number of function evaluations required by the genetic algorithm is mitigated by the use of parallel computing and makes the use of the genetic algorithm feasible in real world problems.
References 1. Engineous Software Incorporated. www.engineous.com 2. Vanderplaats, G.: Numerical Optimization Techniques for Engineering Design. (1999). 3. Belegund, A. and Chandrupatla, T.: Optimization Concepts and Applications in Engineering. Prentice Hall (1999) 4. Onwubiko, C.: Introduction to Engineering Design Optimization. Prentice Hall (2000) 5. Gen, M. and Cheng, R: Genetic Algorithms and Engineering Optimization. John Wiley & Sons (2000). 6. Deb, K.: Multi-Objective Optimization Using Evolutionary Algorithms. John Wiley & Sons (2001). 7. Back, T.: A Users Guide to GENEsYs 1.0. University of Dortumnd. (1992) 8. Hiroyasu, T.: Spec Sheet: Distributed Genetic Algorithms ga2k (ver 1.1). Intelligent Systems Design Lab. Doushisya University (2002) 9. Inger, L.: Adaptive Simulated Annealing. http://www.ingber.com/. (2002)
Genetic Algorithms: A Fundamental Component of an Optimization Toolkit
2359
10. Vanderplaats, G.: ADS – A Fortran Program for Automated Design Synthesis. Santa Barbara, CA: Engineering Design Optimization, Inc. (1988) 11. Johnson, M.: Hooke and Jeeves Algorithm. http://www.netlib.org/opt/hooke.c (1994) 12. Spellucci, P.: Donlp2 User Guide. http://www.netlib.org/opt/donlp2/donlp2doc.ps 13. Schittkowski, K.: NLPQL: A Fortran subroutine for solving constrained non linear programs. Annals of Operations Research, Vol.5, (1985–1986) 4850–500 14. Tseng, C.: MOST 1.1. Applied Optimal Design Laboratory, National Chiao Tung Univeristy, Technical Report AODL-9-01 (1996) 15. Smith, S. and Lasdon, L.: Solving large sparse nonlinear programs using GRG. ORSA J. Comput. 4, (1992) 1–15 16. Tanese, R.: Distributed genetic algorithms. Proceedings of the Third International Conference on Genetic Algorithms, (1989) 434–439. 17. Sandgren, E.: The utility of nonlinear programming algorithms. Purdue University Ph.D. Thesis, West Lafayette, IN (1977). 18. Sandgren, E.: Nonlinear integer and discrete programming in mechanical design optimization. Transactions of the ASME, Journal of Mechanical Design, 112(2), (1990) 223–229 19. Fylstra, D., Lasdon, L., Watson, J. and Waren, A.: Design and Use of the Microsoft Excel Solver. Interfaces 28 (1998) 29–55. 20. Furse, C.: Design an Antenna for Pacemaker Communication. Microwaves & RF, March (2000) 21. Powell, D.: Inter-GEN: A hybrid approach to engineering design optimization. Rensselaer Polytechnic Institute Ph.D. Thesis, Troy, NY (1990). 22. Vanderplaats, G.: Numerical Optimization Techniques for Engineering Design. (1999) 317–320. 23. Powell, D., Skolnick, M., and Tong, S.: Interdigitation: A Hybrid Technique for Engineering Design Optimization. Handbook of Genetic Algorithms. Van Nostrand Reinhold. (1991) 312–331.
Spatial Operators for Evolving Dynamic Bayesian Networks from Spatio-temporal Data 1
1
Allan Tucker , Xiaohui Liu , and David Garway-Heath
2
1
Brunel Univeristy, Middlesex, UK {allan.tucker,xiaohui.liu}@brunel.ac.uk 2 Glaucoma Unit, Moorfield’s Eye Hospital, London, UK [email protected]
Abstract. Learning Bayesian networks from data has been studied extensively in the evolutionary algorithm communities [Larranaga96, Wong99]. We have previously explored extending some of these search methods to temporal Bayesian networks [Tucker01]. A characteristic of many datasets from medical to geographical data is the spatial arrangement of variables. In this paper we investigate a set of operators that have been designed to exploit the spatial nature of such data in order to learn dynamic Bayesian networks more efficiently. We test these operators on synthetic data generated from a Gaussian network where the architecture is based upon a Cartesian coordinate system, and real-world medical data taken from visual field tests of patients suffering from ocular hypertension.
1 Introduction Bayesian Networks (BNs) are probabilistic models that can be used to combine expert knowledge and data. They also facilitate the discovery of complex relationships in large datasets. A BN consists of a directed acyclic graph consisting of links between nodes that represent variables in the domain. The links are directed from a parent node to a child node, and with each node there is an associated set of conditional probability distributions. Learning the structure of a BN from data [Cooper92] is a non-trivial problem due to the large number of candidate network structures and as a result there has been substantial research in developing efficient algorithms within the optimisation and evolutionary communities. For example, Evolutionary Programs (EP) [Bäck93] and Genetic Algorithms (GA) [Holland95] have been used to search for candidate structures [Larranaga96, Wong99]. The Dynamic Bayesian Network (DBN) is an extension of the BN that can model time series [Friedman98]. See figure 1 for an example of a DBN with N+1 variables spanning two time slices. Note that links in a DBN can be between nodes in the same time slice or from nodes in previous time slices. We have developed algorithms (evolutionary and non-evolutionary) for efficiently learning DBN structures [Tucker01]. Spatial Data-mining [Ester00] involves the application of specialist algorithms and operators (e.g. neighbourhood distance, topology and direction) to data that has a spatial nature. That is, variables in the domain can be considered to have neighbourhood relations with other variables. This can vary from simple Cartesian coordinates E. Cantú-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2360–2371, 2003. © Springer-Verlag Berlin Heidelberg 2003
Spatial Operators for Evolving Dynamic Bayesian Networks
2361
to more complex spatial neighbourhoods. See [Roddick99] for a bibliography on spatial and temporal data mining. Spatial data is common in geographical research [Cofiño02].
Fig. 1. A Typical DBN
Fig. 2. A Visual Field Test
To our knowledge, BN learning has not been investigated with respect to spatial data. We define a Spatial Bayesian Network (SBN) to be a BN that represents data of a spatial nature and a Spatial Dynamic Bayesian Network (SDBN) to be a BN that represents spatio-temporal data. We introduce and test an evolutionary algorithm and a set of spatial operators for learning SDBNs from spatio-temporal data. These operators are tested on synthetic data and visual field data of patients suffering from ocular hypertension. The algorithm and its operators are compared with a standard BN search technique. We validate resulting networks in several ways including inspection of the spatial links, calculating the structural difference of the original network used to generate the synthetic data, and using clinical expert knowledge on the visual field data.
2 Methods 2.1 Datasets We have generated spatio-temporal data using DBNs with Gaussian probability distributions [Geiger94]. This data contains 1000 time points of 64 variables spatially located on an 8x8 grid, where each variable was influenced by its immediate neighbours at the previous time point. Visual Field (VF) data can be recorded using the Humphrey Perimeter [Haley87]. This involves a patient fixating a point in the centre of a dimly illuminated bowl. The perimeter shines brighter stimuli onto the bowl at various points, corresponding to points in the visual field. The stimuli are varied in intensity and the patient presses a
2362
A. Tucker, X. Liu, and D. Garway-Heath
button when a stimulus has been observed. This technique determines the sensitivity to light of each point in the VF. The data that is used in this paper is from a study that contains tests recorded every few months, in patients with ocular hypertension, a major risk factor for glaucoma. The test measured 54 points on each eye (figure 2 shows an example of a patient’s VF test). The dataset used in this paper involves 95 patients with 1809 measurements in all, concerning only the right eye. Two points in the VF correspond to the blind spot and should not contain any useful data. We have included these points to check for spurious relationships. We know of little research in using probabilistic models to understand VF data. Previously, a state space model has been used to classify glaucomatous patients [Anderssen98] and Bayesian statistics have been proposed to test VF data [Bengtsson97]. Both datasets were discretised into four states using equal bin sizes from the maximum value to the minimum value for each variable. 2.2 Algorithm Candidate structures of a network, bn, given a dataset, D, are scored using the loglikelihood, calculated using the equation:
( ri − 1)! ri ∏ Fijk ! , j =1 ( Fij + ri − 1)! k =1 qi
N
log p( D | bn D ) = log ∏ ∏ i =1
where N is the number of variables in the domain, ri denotes the number of states that a node xi can take, qi denotes the number of unique instantiations of the parents of node xi, Fijk is the number of cases in D, where xi takes on its kth unique instantiation and, the parent set of i takes on its jth unique instantiation. Fij =
ri
∑F
ijk
.
k =1
In order to increase efficiency of our algorithms, we have developed an evolutionary approach without the necessity of storing a population of candidate solutions. Rather, we consider each point in the spatial dataset to be an individual within the population of points. Therefore, the population, itself, is the candidate solution. We have looked at a similar method before for grouping algorithms [Liu01]. The algorithm also makes use of a simulated annealing type of selection criteria [Kirkpatrick83], where good operations are always carried forward, but sometimes less good ones are also accepted dependant upon a temperature parameter. A form of elitism [Grefenstette86] is employed to ensure that the final structure is the best discovered. This is to prevent the simulated annealing process from moving away from a better solution when the temperature is still high. We formally define the algorithm below where maxfc is the maximum number of calls to the scoring function, c is the ‘cooling parameter’, t0 is the initial temperature, b is the branching factor of a network, and R(a, b) is a uniform random number generator with limits, a and b.
Spatial Operators for Evolving Dynamic Bayesian Networks
Input
t0, b, maxfc, D fc = 0, t = t0 Initialise bn to a SDBN with no links result = bn While fc ≤ maxfc do score = L(bn) For each operator do Apply operator to bn If bn is valid given b Then newscore = L(bn) fc = fc + 1 dscore=newscore-oldscore If newscore > score Then result = bn dscore t
Output
2363
Else If R(0,1)< e Undo the operator End If End If End For t = t × c End While result
Then
2.3 Operators We now introduce three spatial and three non-spatial operators. All involve manipulating links within the SDBN. For this paper we have only investigated links that span one time slice, although we have developed equivalent operators for links within the same time slice (these require a more strict ordering assumption to ensure the structure is directed acyclic which is an essential property of a BN structure). 2.4 Non-spatial Operators We have chosen three non-spatial operators as these represent common operators used in optimisation techniques such as hill climbing and simulated annealing. 1) Add - A link with random parent and child is added to the network. 2) Take– Randomly remove a single existing link. 3) Mutate - Randomly change the parent of an existing link. 2.5 Spatial Operators For the scope of this paper, we assume that the points in a spatial dataset are located according to Cartesian coordinates. Therefore, each point in a dataset with coordinates (x,y) has a first order neighbourhood which includes all nodes with coordinates (i,j) for x-1 ≤ i ≤ x+1 and y-1 ≤ j ≤ y+1. The spatial operators that we have developed exploit the Cartesian spatial nature of a dataset, whereby links are added to a node
2364
A. Tucker, X. Liu, and D. Garway-Heath
based upon the proximity of the parent to the child, and mutations are made from an existing parent’s position to the positions of other nearby parents. A crossover operator has also been developed that swaps the relative positions of node’s parents. Figure 3 shows examples of parent coordinates, relative to the child node, when applying the operators. Unfilled circles represent child nodes, filled circles represent parents of the child.
Before
Node x - Before (c) • •
After
(a) •
Node y -Before • •
•
• -
-
-
-
-
-
-
-
Before
-
-
Node x – After • •
After
(b) •
•
⊗
-
•
Node y - After • •
•
• •
•
•
Fig. 3. Spatial Operator Examples: (a) Add (b) Mutate and (c) Crossover.
1) Spatial Add - Randomly add a link to a node, such that its parent is one of its first order neighbours (shaded in figure 3a). Note that the coordinates of a parent can be the same as its child because we are dealing with links that span a time-slice. 2) Spatial Mutate - Randomly replace the parent of an existing link with one of its first order neighbours (for example, the shaded region in figure 3b represents possible new positions for the parent immediately to the right of its child). 3) Spatial Crossover - Randomly select two nodes. For each parent of a node, x, there is a 50% chance of moving it to become a parent of the other node, y, so that its relative position to x becomes updated relative to y. See figure 3c where the shaded cells with filled circles show the parents of node x, before and after crossover, and the unshaded cells with filled circles show the parents of node y. For all spatial operators, if a parent that is selected for a node is outside the boundary of the coordinate system, then it is deemed invalid.
Spatial Operators for Evolving Dynamic Bayesian Networks
2365
3 Experiments The experiments involved running the stochastic algorithm described earlier on the synthetic and VF datasets. One experiment involved using only non-spatial operators, one involved only spatial operators and one using all operators. All operators were applied in the order that they appear listed in the previous section. Due to the stochastic nature of the algorithm, we ran each experiment ten times and recorded the average learning curve. We also investigated Cooper’s greedy search algorithm, K2, [Cooper92] when applied to the two datasets. To find out the performance of the individual operators, we also recorded the average success rate of each operator in the ‘all operators’ experiments. A success is recorded when an operator results in improved fitness. In order to determine the quality of the final structures, we use the Structural Difference (SD) metric on the synthetic data. This is calculated from summing the missing links with the spurious links in the discovered structure, compared to the original network used to generate the data. Expert knowledge is used for the VF data, based upon the anatomical correspondence of VF locations to particular nerve fibre bundles and the proximity of the nerve fibre bundles to each other in terms of their position of entry (angular location) at the Optic Nerve (ON) margin [Garway00]. 3.1 Parameters In this experiments we set t0=5, b=3, c=0.9999 and maxfc=35000. These values were chosen as they generated the most efficient results for all methods.
4 Results 4.1 Synthetic Data Figure 4a shows the learning curves for each method when applied to the synthetic data. It can be seen that the simple K2 algorithm is not as efficient as any of the stochastic methods. What is more, the log likelihood of the final solution is not as high as the other methods. The spatial operators alone seem to perform the most efficiently. This could be due to the more limited type of relationships that exist. All are based upon the first order Cartesian neighbourhood. It should be noted that on the synthetic data that if all methods, including K2, are run for long enough they eventually reach solutions very close to one another (again perhaps due to the simplicity of the relationships within the network). Figure 4b shows the mean success of individual operators during an experiment, figure 4b shows that the most successful were Spatial Add (SpatAdd), Take, and Spatial Mutation (SpatMut). The worst seems to be Spatial Crossover (SpatCross). We have calculated the structural difference of the discovered structures from the original structures used to generate the data. Table 1 shows that on average the closest network to resemble the original structure that generated the data is surprisingly that from using K2. This is somewhat unexpected, but could be due to the simplistic na-
2366
A. Tucker, X. Liu, and D. Garway-Heath
ture of the synthetic data (where all dependencies are generated from a linear Gaussian process given each variable’s first order neighbours). The next best structure is that from the spatial operators alone. The next closest is found when using spatial and non-spatial operators, and the worst when using only non-spatial operators.
Log Likelihood
(a) -177000 -177100 -177200 -177300 -177400 -177500 -177600 -177700 -177800 -177900 -178000
AllOps NonSpat Spat K2 0
5000 10000 15000 20000 25000 30000 Function Calls
(b)
Add Take Mutate SpatAdd SpatCross SpatMut
Operator Succeses
80 70 60 50 40 30 20 10 0 0
5000
10000 15000 20000 25000 30000 Function Calls
Fig. 4. (a) Mean Learning Curves and (b) Operator Successes for Synthetic Data
Table 1. Quality of Networks Learnt from Synthetic Data
Structural Difference
K2 119.0
Non-Spat 142.0
Spat 122.3
All 129.2
Figure 5 shows the spatial nature of the final networks for the synthetic dataset. All networks appear to have strong spatial characteristics. The K2 result appears to have more links that lie on the border compared to the stochastic methods. This could be due to a bias in the search or an inability to remove links once they are discovered. 4.2 Visual Field Data Figure 6a shows the learning curves for all methods on the VF data. It appears that the best operators are a combination of spatial and non-spatial. This results in more efficient learning curves compared to the experiments involving just spatial operators,
Spatial Operators for Evolving Dynamic Bayesian Networks
2367
which in turn generate more efficient learning curves than the non-spatial operators alone. However, figure 6a implies that if the non-spatial operator experiments are run for enough iterations then they eventually overtake the spatial operator curves (probably due to the over-restrictive nature of using these operators only). The VF data, unlike the synthetic data, appears to have a mixture of spatial and non-spatial relationships. Figure 6b shows the average success of individual operators during an experiment on the VF data. The most successful were Spatial Add, Take, and Spatial Mutation. The worst seems to be Spatial Crossover, though there is a steady success rate. In fact, successes involving Spatial Crossover generally result in higher fitness improvements than other operators.
K2
Spatial Only
Non-Spatial Only
All Operators
Fig. 5. Networks Learnt from Synthetic Data
We use the average Optic Nerve (ON) distance between each link’s parent and child to validate the resulting networks. Table 2 shows that a similar ordering is found to that on the SD with the synthetic data but with K2 performing somewhat worse. The spatial operators generate networks that have least ON distance, followed by a combination of spatial and non-spatial, then just the non-spatial and finally the structure with largest ON distance is that generated using K2. This implies that for more complex real-world problems, a simple greedy search such as K2 is not suitable. We
2368
A. Tucker, X. Liu, and D. Garway-Heath
have also used knowledge of how nerve fibre bundles are arranged on the visual field [Garway00]. Figure 7 shows how these bundles are positioned with respect to the visual field. This means that it is likely that points in the visual field that share a nerve fibre bundle are likely to be related. Grey shaded points in figure 7 correspond to the blind spot. Table 2 shows the mean number of links that are contained within the same bundle as a percentage of all links, for each search method. A similar ordering to the mean ON distance is observed: a higher percentage of links in the same bundle implies a shorter average ON distance. The highest percentage of links contained within the same bundle are from networks generated from only spatial operators, the next best being all operators, followed by non-spatial operators and the worst being those generated using K2.
(a) Log Likelihood
-110000 -110500 -111000 AllOps
-111500
NonSpat Spat
-112000
K2
-112500 0
5000 10000 15000 20000 25000 30000
Function Calls
Operator Successes
(b)
100 90 80 70 60 50 40 30 20 10 0 0
Add Take Mutate 5000 10000 15000 20000 25000 30000 SpatAdd SpatCross Function Calls SpatMut
Fig. 6. (a) Mean Learning Curves and (b) Operator Successes for VF Data
Spatial Operators for Evolving Dynamic Bayesian Networks Table 2. Quality of Networks learnt from VF Data
% Links in Mean same Bundle ON Distance K2 62.963 41.056 Non-Spat 70.863 29.477 Spat 78.325 19.225 All 73.333 25.138
• •
• • • •
• • • • • •
• • • • • • • •
• • • • • • • •
• • • • • • • •
• • • • • • • •
2369
• • • • • •
• • • •
Fig. 7. Nerve Fibre Bundles
K2
Spatial Only
Non-Spatial Only
All Operators
Fig. 8. Networks Learnt from VF Data
Figure 8 shows the spatial nature of the final networks for the different search methods on the VF data. ON distance is represented by the shading of the links (the larger the distance the lighter the link). Notice that the stochastic methods all have similar looking networks with strong spatial features. This is likely to be because these networks are the result of a large number of iterations and the methods have converged close to the optimum (figure 6a). The K2 result displays fewer spatial
2370
A. Tucker, X. Liu, and D. Garway-Heath
characteristics and many links with high ON distance (in lighter grey), including links across the horizontal midline that are anatomically unlikely. None of the networks have included links involving the blind spot, as was expected. For the stochastic methods, there are still some links that have high ON distance (links in lighter grey). Most of these links appear to be between VF points that are very close to one another and cross the central horizontal axis. Some nerve fibres interdigitate across the horizontal raphe in the nasal visual field, but this does not account for some of the more central links crossing the horizontal meridian. These could be a result of bias in the spatial operators, but they still exist in the non-spatial experiments. This implies errors in measurement of the visual field where the VF is shifted slightly or the disease process acts symmetrically in the visual field in anatomically unlinked locations. These biases and errors will be explored further.
5 Conclusions and Future Work In this paper, we have developed evolutionary and non-evolutionary operators, for learning Bayesian network structures from spatial time series data. These operators have been tested on synthetic data and real-world visual field data. We have compared the efficiency of the operators as well as a well-known straw-man learning algorithm. Results using the proposed operators were very encouraging, particularly on the real-world data, where high quality networks were found more efficiently when spatial operators were used. Various methods were used to measure network quality, including structural difference and expert knowledge on the synthetic and visual field data, respectively. In the future, we will use spatial information such as optic nerve distance and geographical direction to guide search rather than simple Cartesian coordinates. It would be interesting to see how temporal lag search [Tucker01] interacts with spatial operators, and their performance on other datasets such as rainfall prediction of cities [Cofiño02]. We intend to explore how the operators compare to other methods for learning the network structures such as Estimation of Distribution Algorithms [Larranaga01] and methods based upon neural network optimisation [Kahng95]. We also plan to extend the modelling of the visual field to include data from both eyes, as well as clinical information. Acknowledgements. We would like to thank Nick Strouthidis for his help with collating the visual field dataset and Stephen Swift for his general help and advice. In addition, we would like to thank the EPSRC and BBSRC for funding this research.
References Anderssen, K.E., Jeppesen, V., Classifying Visual Field Data, Technical Report, MSc Thesis, Aalborg University, Denmark, (1998). Bengtsson, B., Olsson, J., Heijl, A., Rootzen. H., A new generation of algorithms for computerized threshold perimetry, SITA. Acta Ophthalmologica Scandinavica 75, (1997), 368–375.
Spatial Operators for Evolving Dynamic Bayesian Networks
2371
Cofiño, A.S., Cano, R., Sordo C., Gutiérrez, J.M., Bayesian Networks for Probabilistic Weather th Prediction, Proc. of the 15 European Conference on Artificial Intelligence. IOS Press, (2002). Cooper, G.F., Herskovitz, E., A Bayesian Method for the Induction of Probabilistic Networks from Data, Machine Learning 9, (1992), 309–347. Ester, M., Frommelt, A., Kriegel, H.-P., Sander, J., Spatial Data mining: Database Primitives, Algorithms and Efficient DBMS Support, Special Issue on Integration of Data Mining with Database Technology, Data Mining and Knowledge Discovery, an International Journal, Kluwer Academic Publishers 4, Nos. 2/3, (2000). Friedman, N., Learning the Structure of Dynamic Probabilistic Networks, Proc. of the 14th Annual Conference on Uncertainty in AI, (1998), 139–147. Garway-Heath, D.F., Fitzke, F., Hitchings, R.A., Mapping the Visual Field to the Optic Disc, Opthalmology 2000, 107, (2000), 1809–1815. th Geiger, D., Heckerman, D., Learning Gaussian Networks, Proc. of the 10 Conference in Uncertainty in Artificial Intelligence, (1994), 235–243. Grefenstette, J.J., Optimization of Control Parameters for Genetic Algorithms, IEEE Transactions on Systems, Man & Cybernetics 16, No. 1, (1986), 122–128. Haley, M.J. (ed.), The Field Analyzer Primer, Allergan Humphrey, San Leandro, California, (1987). Holland, J.H., Adaptation in Natural and Artificial Systems, University of Michigan Press, (1995). th Kahng, A.B., Moon, B.R., Toward More Powerful Recombinations, Proc. of the 6 International Conference on Genetic Algorithms, Morgan Kauffman, (1995), 96–103. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P., Optimization by Simulated Annealing, Science 220, No. 4598, (1983), 671–80. Larrañaga, P., Poza, M., Yurramendi, Y., Murga, R., Kuijpers, C., Structure Learning of Bayesian Networks using GAs, IEEE Transactions on Pattern Analysis and Machine Intelligence 18, No.9, (1996), 912–926. Larrañaga, P., Lozano, J.A. (eds.), Estimation of Distribution Algorithms, Kluwer, (2001). Liu, X., Swift, S., Tucker, A., Using Evolutionary Algorithms to Tackle Large Scale Grouping Problems, GECCO (2001), 454–460. Roddick, J.F., Spiliopoulou, M., A Bibliography of Temporal, Spatial and Spatio-Temporal Data mining Research, ACM SIGKDD Explorations, Vol. 1, No. 1, (1999), 34–38. Bäck T., Schwefel H.-P., An Overview of Evolutionary Algorithms for Parameter Optimization, Evolutionary Computation 1, No. 1, (1993) 1–23 Tucker, A., Liu, X., Ogden-Swift, A., Evolutionary Learning of Dynamic Probabilistic Models with Large Time Lags, International Journal of Intelligent Systems 16, No. 5, (2001), 621– 646. Wong, M., Lam, W., Leung, S., Using Evolutionary Programming and Minimum Description Length Principle for Data Mining of Bayesian Networks, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 21, No.2, (1999), 174–178.
An Evolutionary Approach for Molecular Docking Jinn-Moon Yang Department of Biological Science and Technology National Chiao Tung University, Hsinchu, 30050, Taiwan [email protected]
Abstract. We have developed an evolutionary approach for the flexible docking that is now an important component of a rational drug design. This automatic docking tool, referred to as the GEMDOCK (Generic Evolutionary Method for DOCKing molecules), combines both global and local search strategies search mechanisms. GEMDOCK used a simple scoring function to recognize compounds by minimizing the energy of molecular interactions. The interactive types of atoms between ligands and proteins of our linear scoring function consist only hydrogenbonding and steric terms. GEMDOCK has been tested on a diverse dataset of 100 protein-ligand complexes from Protein Data Bank. In total 76% of these complexes, it obtained docked ligand conformations with root mean square deriva˚ when the ligand was tions (RMSD) to the crystal ligand structures less than 2.0 A docked back into the binding site. Experiments shows that the scoring function is simple and efficiently discriminates between native and non-native docked conformations. This study suggests that GEMDOCK is a useful tool for molecular recognition and is a potential docking tool for protein structure variations.
1
Introduction
The molecular docking problem is the prediction of a ligand conformation and orientation relative to the active site of a target protein. A computer-aided docking process, identifying the lead compounds by minimizing the energy of intermolecular interactions, is an important approach for structure-based drug designs [1]. Solving a molecular docking problem involves two critical elements: a good scoring function and an efficient algorithm for searching conformation and orientation spaces. A good scoring function should be fast and simple for screening large potential solutions and effectively discriminating between correct binding states and non-native docked conformations. Various scoring functions have been developed for calculating binding free energy, including knowledge-based [2], physic-based [3], and solvent-based scoring functions [4]. In general the binding energy landscapes of these scoring functions are often complex and rugged funnel shapes [5]. Many automated docking approaches have been developed and can be roughly divided into rigid docking, flexible ligand docking, and protein flexible docking methods. The rigid-docking methods, such as DOCK program [6], treated both ligands and proteins as rigid. In contrast the ligand is flexible and the protein is rigid for flexible ligand docking methods including evolutionary algorithms [7,8,9,10], simulated annealing [11], fragment-based approach [12], and other algorithms. For reasonably addressing protein flexible problems, which both ligands and proteins are flexible, most of docking methods E. Cant´u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2372–2383, 2003. c Springer-Verlag Berlin Heidelberg 2003
An Evolutionary Approach for Molecular Docking
2373
often allowed a limited model of protein variations, such as the side-chain flexible or small motions of loops in the binding site [13]. Most of these previous docking methods studied on a small test set ( < 20 complexes), by contrast, the GOLD [8] and FlexX [12] were tested on a test set of over 100 complexes. In this paper, we proposed an automatic program, GEMDOCK (Generic Evolutionary Method for DOCKing molecules), for docking flexible molecules. Our program used a simplified scoring function and a new evolutionary approach which is more robust than standard evolutionary approaches [14,15,16] on some specific domains [17,18,19,20]. Our energy function consisted only of steric and hydrogen-bonding terms with a linear model which was simple and fast enough to recognize potential complexes. In order to balance exploration and exploitation, the core idea of our evolutionary approach is to design multiple operators cooperating with each other by using the family competition which is similar to a local search procedure. We have successfully applied a similar idea to solve optimization problems in some differing fields [17,18,19,20]. In order to evaluate the performance and limitations of GEMDOCK on docking flexible ligands, we have tested it on a diverse dataset of 100 complexes from the Protein Data Bank. GEMDOCK achieved 76 ligands whose structures with RMSD values to the ˚ The rate increases to 86% when the structure ligand crystal structures are less than 2.0A. water is considered. GOLD [8] achieved a 71% success rate in the same dataset and FlexX [12] achieved a 70% success rate on a dataset of 200 complexes extended from the data set of GOLD.
2
Method
The basic structure of the GEMDOCK (Figure 1) is as follows: Randomly generate a starting population with N solutions by initializing the orientation and conformation of the ligand relating to the center of the receptor. Each solution is represented as a set of four n-dimensional vectors (xi , σ i , v i , ψ i ), where n is the number of adjustable variables of a docking system and i = 1, . . . , N where N is the population size. The vector x represents the adjustable variables to be optimized in which x1 , x2 , and x3 are the 3-dimensional location of the ligand; x4 , x5 , and x6 are the rotational angles; and from x7 to xn are the twisting angles of the rotatable bonds inside the ligand. σ, v, and ψ are the step-size vectors of decreasing-based Gaussian mutation, self-adaptive Gaussian mutation, and self-adaptive Cauchy mutation. In other words, each solution x is associated with some parameters for step-size control. The initial values of x1 , x2 , and x3 are randomly chosen from the feasible box, and the others, from x4 to xn , are randomly chosen from 0 to 2π in radians. The initial step sizes σ is 0.8 and v and ψ are 0.2.After GEMDOCK initializes the solutions, GEMDOCK enters the main evolutionary loop which consists of three main stages in every iteration: decreasing-based Gaussian mutation, self-adaptive Gaussian mutation, and self-adaptive Gaussian mutation. Each stage is realized by generating a new quasi-population (with N solutions) as the parent of the next stage. As shown in Figure 1, these stages apply a general procedure “FC adaptive” with only different working population and the mutation operator. The FC adaptive procedure (Figure 1) employs two parameters, namely, the working population (P , with N solutions) and mutation operator (M ), to generate a new quasipopulation. The main work of FC adaptive is to produce offspring and then conduct the
2374
J.-M. Yang FC_adaptive (P, M)
Initialize N solutions (P) FC_adaptive procedure with decreasebased Gaussian mutation (Mdg) and P generates P1(g) FC_adaptive procedure with selfadaptive Gaussian mutation (Mg) and P1(g) generates P2(g) FC_adaptive procedure with selfadaptive Gaussian mutation (Mc) and P2(g) generates Pnext(g+1)
No
Satisfy terminal conditions Yes
Output the best solution (a)
Generate L solutions from a “family father” by applying recombination and mutation M Repeat for each individual in P
Select the best solution from these L solutions
Return the offspring population (b) M: Mutation Operator (Mdg, Mg, Mc or MDE) Mdg:Decreasing-based Gaussian mutation Mg: Self-adaptive Gaussian mutation Mc: Self-adaptive Cauchy mutation L: Family Competition Length P1(g) and P2(g) are quasi-population
Fig. 1. Overview of GEMDOCK for molecular docking: (a) Main procedure (b) FC adaptive procedure.
family competition. Each individual in the population sequentially becomes the “family father.” With a probability pc , this family father and another solution that is randomly chosen from the rest of the parent population are used as parents for a recombination operation. Then the new offspring or the family father (if the recombination is not conducted) is operated on by a mutation. For each family father, such a procedure is repeated L times called the family competition length. Among these L offspring and the family father, only the one with the lowest scoring function value survives. Since we create L children from one “family father” and perform a selection, this is a family competition strategy. This method avoids the population prematureness but also keeps the spirit of local searches. Finally, the FC adaptive procedure generates N solutions because it forces each solution of the working population to have one final offspring. In the following, genetic operators are briefly described. We use a = (xa , σ a , v a , ψ a ) to represent the “family father” and b = (xb , σ b , v b , ψ b ) as another parent. The offspring of each operation is represented as c = (xc , σ c , v c , ψ c ). The symbol xsj is used to denote the jth adjustable optimization variable of a solution s, ∀j ∈ {1, . . . , n}. 2.1
Recombination Operators
GEMDOCK implemented modified discrete recombination and intermediate recombination [15]. A recombination operator selected the “family father (a)” and another solution (b) randomly selected from the working population. The former generates a child as follows: a xj with probability 0.8 c xj = (1) xbj with probability 0.2.
An Evolutionary Approach for Molecular Docking
2375
The generated child inherits genes from the “family father” with a higher probability 0.8. Intermediate recombination works as: wjc = wja + β(wjb − wja )/2,
(2)
where w is σ, v, or ψ based on the mutation operator applied in the FC adaptive procedure. The intermediate recombination only operated on step-size vectors and the modified discrete recombination was used for adjustable vectors (x). 2.2
Mutation Operators
After the recombination, a mutation operator, the main operator of GEMDOCK, is applied to mutate adjustable variables (x). Gaussian and Cauchy Mutations: Gaussian and Cauchy Mutations are accomplished by first mutating the step size (w) and then mutating the adjustable variable x: wj = wj A(·), xj = xj + wj D(·),
(3) (4)
where wj and xj are the ith component of w and x, respectively, and wj is the respective step size of the xj where w is σ, v, or ψ. If the mutation is a self-adaptive mutation, A(·) is evaluated as exp[τ N (0, 1) + τ Nj (0, 1)] where N (0, 1) is the standard normal distribution, Nj (0, 1) is a new value with distribution N (0, 1) that must be regenerated for each index j. When the mutation is a decreasing-based mutation A(·) is defined as a fixed decreasing rate γ = 0.95. D(·) is evaluated as N (0, 1) or C(1) if the mutation is, respectively, Gaussian mutation or Cauchy mutation. For example, the self-adaptive Cauchy mutation is defined as ψjc = ψja exp[τ N (0, 1) + τ Nj (0, 1)], xcj = xaj + ψjc Cj (t).
(5) (6)
√ √ We set τ and τ to ( 2n)−1 and ( 2 n)−1 , respectively, according to the suggestion of evolution strategies [15]. A random variable is said to have the Cauchy distribution (C(t)) if it has the density function: f (y; t) = t2t/π +y 2 , −∞ < y < ∞. In this paper t is set to 1. The formulation of the self-adaptive Gaussian mutation is similar to the self-adaptive Cauchy mutation and is given vjc = vja exp[τ N (0, 1) + τ Nj (0, 1)], xcj = xaj + vjc Nj (0, 1).
(7) (8)
Our decreasing-based Gaussian mutation uses the step-size vector σ with a fixed decreasing rate γ = 0.95 and works as σ c = γσ a , xcj = xaj + σ c Nj (0, 1).
(9) (10)
2376
J.-M. Yang
van der walls A
21
r12
Energy
V6
13
−
B r6
Steric
Steric
3.3 3.6 4.5 6.0 -0.4 20
Hydrogen bond
5
V1 V2
V5
-3
Interactive V 1 V 2 V 3 V 4 V 5 V 6 type Hydrogen 2.3 2.6 3.1 3.6 -2.5 20 bond
V3 V4
Distance (r )
Fig. 2. The linear energy function of the pair-wise atoms for steric and hydrogen bonds in GEMDOCK (bold line) with a standard Lennard- Jones potential (light line).
Rotamer-Mutation: This operator is only used for x7 to xn to find the conformations of the rotatable bonds inside the ligand. For each ligand, this operator mutates all of the rotatable angles according to the rotamer distribution and works as: xj = rki with probability pki ,
(11)
where rki and pki are the angle value and the probability, respectively, of ith rotamer of kth bond type including sp3 − sp3 and sp3 − sp2 bond. The values of rki and pki are based on the energy distributions of these two bond types.
2.3
Scoring Function
In this work, we used a simple scoring function given as Etot = Einter + Eintra + Epenal ,
(12)
where Einter and Eintra are the intermolecular and intramolecular energy, respectively, Epenal is a large penalty value if the ligand is out of range of the search box. In this paper, Epenal is set to 10000. The intermolecular energy is defined as Einter
lig pro qi qj Bij F (rij ) + 332.0 , = 4rij i=1 j=1
(13)
where rij is the distance between the atoms i and j, qi and qj are the formal charges and 332.0 is a factor that converts the electrostatic energy into kilocalories per mole. The lig and pro denote the numbers of the heavy atoms in the ligand and receptor, B respectively. F (rijij ) is a simple atomic pair-wise potential function (Figure 2) modified from previous works [7,21] and given as
An Evolutionary Approach for Molecular Docking
2377
Table 1. Atom types of GEMDOCK
Atom type Donor Acceptor Both Nonpolar
Heavy atom name primary and secondary amines, sulfur, and metal atoms oxygen and nitrogen with no bound hydrogen structural water and hydroxy1 groups other atoms (such as carbon and phosphorus)
B F (rijij )
=
Bij V6 r V6 − Vij 1 B V5 (rijij −V1 ) V2 −V 1 V5 V5 − 0
B
if rijij ≤ V1 B
if V1 < rijij ≤ V2
Bij V5 (rij −V3 )
V4 −V3
B if V2 < rijij ≤ V3 .
if V3 < B
B rijij
(14)
≤ V4
if rijij > V4
B
rijij is the distance between the atoms i and j with bond type Bij which is the interaction bonding type forming by the pair-wise heavy atoms of a ligand and a protein. Bij is either hydrogen binding or steric state. The values of parameters, V1 , . . . , V6 , are given in Figure 2. In this atomic pair-wise model, the interactive types are only hydrogen binding and steric potential which have the same function form but with different parameters, V1 , . . . , V6 . The energy value of hydrogen binding should be larger than the one of steric potential. In this model, the atom is divided into four different atom types (Table 1) : donor, acceptor, both, and nonplar. The hydrogen binding can be formed by the following pair atom types: donor-acceptor (or acceptor-donor), donor-both (or bothdonor), acceptor-both (or both-acceptor), and both-both. Other pair-atom combinations are to form the steric state. The intramolecular energy of a ligand is Eintra =
lig lig
B
F (rijij ) +
i=1 j=i+2
B
dihed
A[1 − cos(mθk − θ0 )],
(15)
k=1
B
where F (r ijij ) is defined as Equation 14 except the value is set to 1000 when r ijij < ˚ and dihed is the number of rotatable bonds. We followed the work of Gehlhaar et 2.0 A al. (1995) to set the values of A, m, and θ0 . For the sp3 − sp3 bond A, m, and θ0 are set to 3.0, 3, and π; and A = 1.5, m = 6, and θ0 = 0 for the sp3 − sp2 bond.
3 3.1
Results Parameters of GEMDOCK
Table 2 indicates the setting of GEMDOCK parameters, such as initial step sizes, family competition length (L = 3), population size (N = 200), and recombination probability (pc = 0.3) in this work. The GEMDOCK optimization stops when either the convergence is below certain threshold value or the iterations exceed a maximal preset value which
2378
J.-M. Yang Table 2. Parameters of GEMDOCK Parameter Value of parameters Initial step sizes σ = 0.8, v = ψ = 0.2 (in radius) Family competition length L=3 Population size N = 200 Recombination rate pc =0.3 # of the maximum generation 100
was set to 100. Therefore, GEMDOCK generated 2400 solutions in one generation and terminated after it exhausted 240000 solutions in the worse case. These parameters were decided after experiments conducted to recognize complexes of test docking systems with various values. 3.2 Test Data Set In order to evaluate the strength and limitation of GEMDOCK, we tested it on a highly diverse dataset of 100 protein-ligand complexes proposed by Jones et al. [8] (Table 3). The ligand input files were generated by GENLIG which assigned the formal charge and atom type (donor, acceptor, both, or nonplar) of each atom and the bond type (sp3 − sp3 , sp3 − sp2 , or others) of a rotatable bond inside a ligand. These materials were used in Equation 12 to calculate the scoring value of a solution. Table 3 shows the ligand summary, including the minimum, average, and maximum values of the number of rotatable bonds and the number of heavy atoms. When preparing the proteins, we removed all structural water molecules and metal atoms except we discussed the influence of considering these hetero atoms. In order to ˚ 6A, ˚ decide the size of active site, GEMDOCK was tested on four different sizes (d A): ˚ ˚ ˚ ˚ 8A, 10A, and 12A. The size with d A means that all protein atoms in the active site ˚ apart from each ligand atom. GEMDOCK are selected if they are located less than d A automatically decide the cube of a binding site based on the maximum and minimum of coordinates of these selected protein atoms. Experiments shows that GEMDOCK had ˚ when a little influence on the different sizes. In this paper, the distance d is set to 10A ligand is docked back into the active site. Among these 100 test systems, the minimum ˚ ˚ ˚ (2mcp) and the maximum cube is 41A×40 ˚ ˚ ˚ (2r07). cube is 23A×24 A×20 A A×30 A 3.3
Results on the Dataset of 100 Complexes
GEMDOCK executed 10 independent runs for each complex. The solution with lowest scoring function was then compared with the observed ligand crystal structure. Table 3 shows the summary information and performance. We based the results on root mean square deviation (RMSD) error in ligand heavy atoms between the docked conformation and the crystal structure. By contrast, Jones et al [8] used four subjective categories (good, close, error, and wrong) to evaluate the performance. Because they found all of good ˚ we considered that a docking result and close solutions with RMSD were below 2.0 A, ˚ is acceptable if the RMSD value is less than 2.0 A [8,12].
An Evolutionary Approach for Molecular Docking
2379
Table 3. GEMDOCK results and summary of ligands in the dataset of 100 complexes
˚ RMSD(A) ≤ 0.5 > 0.5, ≤ 1.0 > 1.0, ≤ 1.5 > 1.5, ≤ 2.0 > 2.0, ≤ 2.5 > 2.5, ≤ 3.0 > 3.0
rank any 1 rank PDB code with rank 1 24 41 1abe 1acm 1aco 1did 1epb 1fki 1hdy 1hsl 1ida 1lst 1pbd 1pha 1rob 1stp 1tpp 2ada 2cgr 2cht 2dbl 3aah 3tpi 4phv 6abp 6rsa 39 32 1acj 1ack 1acl 1aha 1dbb 1dbj 1die 1dr1 1dwd 1eap 1etr 1fkg 1ghb 1hri 1hyt 1ldm 1lic 1mrk 1nis 1phd 1phg 1rds 1slt 1srj 1tka 1tmn 1ulb 1glq 2ak3 2ctc 2mcp 2pk4 2r07 2sim 3cpa 3hvt 4cts 4dfr 4est 10 11 1aaq 1apt 1cbx 1cps 1hdc 1icn 1poc 4fab 5p2p 8gch 3 6 1aec 1tdb 3ptb 1 0 1mdr 3 6 1ase 1blh 2yhx 20 10 1eta 1eed 1azm 1rne 6rnt 1mcr 2mth 1mup 2plv 1baf 1ive 2phh 3gch 1hef 1igj 1coy 1xid 1xie 3cla 7tim
(a) 1glq
(b) 4dfr
(c) 4phv
Fig. 3. Acceptable docking examples: The docked ligand conformation (red) is similar to the ˚ for 1glq, 0.56 A ˚ for 4dfr, and 0.42 crystal ligand structure (yellow). The RMSD values are 0.78 A ˚ for 4phv. A
Table 3 shows that GEMDOCK achieved a 76% success rate in identifying the experimental binding model if the solutions at the first rank are considered. The RMSD ˚ This rate further rises to 84% based on values of 63 complexes are less than 1.0 A. the solutions with any rank. The performance of GEMDOCK was little influenced by the number of rotatable bonds and the number of heavy atoms of a ligand. When the structural water and metal atoms are considered, the success rate is improved to 86% with the first rank. On average GEMDOCK took 305 seconds for a docking run on Pentium 1.4 GHz personal computer with single processor. The maximum time was 883 seconds for the complex, 1rne, and the shortest time was 102 seconds for 2pk4. In the following, we discussed some acceptable examples and unacceptable examples. Figure 3 shows three typical acceptable solutions in which GEMDOCK predicted correct positions of all ligand groups. The predicted ligand is red and crystal ligand is yellow. All of these three examples are identical with the crystal structures and the ˚ (1glq: nitrophenyl ligand for glutathione S-transferase), 0.56 A ˚ RMSD values are 0.78 A
2380
J.-M. Yang
(a) 1eta
(b) 1ive
(c) 3cla
Fig. 4. Results of three poor docking examples. The docked ligand conformation is red and the ˚ for 1eta, 6.32 A ˚ for 1ive, and 7.56 A ˚ crystal ligand structure is cpk. The RMSD values are 3.62 A for 3cla.
˚ (4phv: peptide-like (4dfr: methotrexate ligand for dihydrofolate reductase), and 0.42 A ligand for HIV-1 protease). 3.4
Examples of Unacceptable Solutions
˚ Table 3 shows 24 unacceptable docking complexes with RMSD values more than 2.0 A. Three poor examples are shown in Figure 4 in which the structures of predicted ligands and crystal ligands are displayed with red and cpk color, respectively. The RMSD values ˚ for 1eta (tetraiodo L-thyronine ligand for transthyretin), 6.32 A ˚ for 1ive are 3.62 A ˚ (acetylamino ligand for influenza ), and 7.56 A for 3cla (chloramphenicol ligand for Type III chloramphenicol acetyltransferase). We have analyzed these poor examples to understand why GEMDOCK failed to recognize the binding models by using numerical experiments. These experiments were based on three main factors: the scoring functions, the docking materials, and the search methods. For the scoring functions we tested various uses and parameter values (Equation 12) on the dataset of 100 complexes. According to our experimental results, the Einter was the main factor in our system, the Eintra and Epenal were minor factors that B influenced some specific docking cases. The element, F (rijij ), of the Einter (Equation 13) dominated the performance. Figure 5 shows the relationship between the RMSD values and scoring values with 100 independent runs. For the good docking example ˚ and the scoring value is similar (4dfr) 95 solutions with RMSD values are less than 1.0 A in each run (Figure 5(a)). By contrast, for a poor example the RMSD value is more than ˚ and the score is diverse (Figure 5(b)). These experiments indicate that our scoring 3.0 A function seems to be simple and fast to discriminate native binding state and non-native docked conformations for 90% testing complexes. For the docking materials we have discussed the influences of the sizes of the binding site (see Subsection Test data set) and of the hetero atoms in the binding site. Among these 100 complexes there are 17 proteins with metal atoms, and 84 proteins with structural water atoms. When the metal atoms are included, GEMDOCK can consistently im-
An Evolutionary Approach for Molecular Docking
4 3.5
RMSD (Å)
RMSD (Å)
3 2.5 2 1.5 1 0.5 0 -300
-250
-200
-150
-100
-50
Energy
(a) good examples- 4dfr
0
2381
10 9 8 7 6 5 4 3 2 1 0 -60
-50
-40
-30
-20
-10
0
Fitness Value
(b) poor example- 1ive
Fig. 5. Typical relationship between the values of the scoring function and the RMSD on 100 independent runs. (a) For a good docking example the 95 solutions with RMSD values are less ˚ and the scoring values are similar. (b) For a poor docking example the RMSD values than 0.5 A ˚ and the scoring values are diverse. are often more than 3.0 A
prove docking accuracy, such as the complexes 1xid and 1xie. In general GEMDOCK is able to improve the docking accuracy when both structural water and metal atoms are considered. 3.5
Comparison with Other Approaches
According to our best survey, most of previous docking works studied on a small dataset (< 20 complexes) except the GOLD [8] and FlexX [12] used. Here we compared GEMDOCK with GOLD and FlexX on the dataset of 100 complexes. Table 4 shows the summary of these three docking tools. The rates of FlexX based on a test set of 200 complexes enlarged the GOLD test set. GOLD was a steady-state genetic algorithm and FlexX was an incremental approach. GEMDOCK obtained a 76% success rate based on ˚ In contrast GOLD [8] achieved a 71% the condition of an RMSD value less than 2 A. success rate in identifying the experimental binding model based on their assessment categories, and the rate was 66% if based on the RMSD condition. FlexX [12] achieved a 70% success rate based on solutions with any rank and the RMSD condition. The rate was 46.5% if the solutions at the fist rank were considered. The results of FlexX were often sensitive to the choice of the base fragment and its placement and the number of the fragments.
4
Conclusions
In this work, we have developed a robust evolutionary approach with a simple fitness function for docking flexible molecules. Experiments on 100 test systems verify that the proposed approach achieved a 76% success rate in recognizing the binding models.
2382
J.-M. Yang
Table 4. Comparsion GEMDOCK with GOLD and FlexX on the dataset of 100 complexes ˚ GEMDOCK GOLDa FlexXb RMSD(A) ≤ 0.5 24% 8% 12.5% > 0.5, ≤ 1.0 39% 27% 38.5% > 1.0, ≤ 1.5 10% 20% 12.5% > 1.5, ≤ 2.0 3% 11% 5.5 % > 2.0, ≤ 2.5 1% 2% 7.5 % > 2.5, ≤ 3.0 3% 4% 2% > 3.0 20% 28% 21.5% a
: GOLD [8] is a steady-state genetic algorithm. : The rate of FlexX [12], a fragment-based approach, is based on any rank with a dataset of 200 complexes extended from the GOLD data set. b
GEMDOCK seamlessly blends local search and global search to work cooperatively by the integration of a number of genetic operators, each having unique search mechanism. In summary, we have demonstrated the robustness and adaptability of GEMDOCK for exploring the conformational space of a molecular docking problem and efficiently finding the solution under the constraint of the fitness function used. Our scoring function seems to be simple and fast to discriminate native binding states and non-native docked conformations. We believe that GEMDOCK is an effective tool for docking flexible molecules. Acknowledgements. This work was supported in part by grant 91-2320-B-009-001 from the National Science Council of Taiwan and in part by grand DOH92-TD-1132 from Department of Health, Taiwan.
References 1. I. D. Kuntz. Structure-based strategies for drug design and discovery. Science, 257:1078– 1082, 1992. 2. H. Gohlke, M. Hendlich, and G. Klebe. Knowledge-based scoring function to predict proteinligand interactions. Journal of Molecular Biology, 295:337–356, 2000. 3. S. J. Weiner, P. A. Kollman, D. A. Case, U. C. Singh, C. Ghio, G. Alagona, S. Profeta, Jr., and P. Weiner. A new force field for molecular mechanical simulation of nucleic acids and proteins. Journal of the American Chemical Society, 106:765–784, 1984. 4. B. K. Shoichet, A. R. Leach, and I. D. Kuntz. Ligand solvation in molecular docking. Proteins: Structure, Function, and Genetics, 34:4–16, 1999. 5. D. W. Miller and K. A. Dill. Ligand binding to proteins: the binding landscape model. Protein Science, 6:2166¡V2179, 1997. 6. I. D. Kuntz, J. M. Blaney, S. J. Oatley, R. Langridge, and T. E. Ferrin. A geometric approach to macromolecular-ligand interactions. Journal of Molecular Biology, 161:269–288, 1982. 7. D. K. Gehlhaar, G. M. Verkhivker, P. Rejto, C. J. Sherman, D. B. Fogel, L. J. Fogel, and S. T. Freer. Molecular recognition of the inhibitor AG-1343 by HIV-1 protease: conformationally flexible docking by evolutionary programming. Chemistry and Biology, 2(5):317–324, 1995.
An Evolutionary Approach for Molecular Docking
2383
8. G. Jones, P. Willett, R. C. Glen, A. R. Leach, and R. Taylor. Development and validation of a genetic algorithm for flexible docking. Journal of Molecular Biology, 267:727–748, 1997. 9. J. S. Taylor and R. M. Burnett. Darwin: A program for docking flexible molecules. Proteins: Structure, Function, and Genetics, 41:173V191, 2000. 10. G. M. Morris, D. S. Goodsell, R. S. Halliday, R. Huey, W. E. Hart, R. K. Belew, and A. J. Olson. Automated docking using a lamarckian genetic algorithm and empirical binding free energy function. Journal of Computational Chemistry, 19:1639–1662, 1998. 11. C. J. Sherman, R. C. Ogden, and S. T. Freer. De Novo design of enzyme inhibitors by monte carlo ligand generation. Journal of Medicinal Chemistry, 38(3):466–472, 1995. 12. B. Kramer, M. Rarey, and T. Lengauer. Evaluation of the flexX incremental construction algorithm for protein-ligand docking. Proteins: Structure, Function, and Genetics, 37:228– 241, 1999. ¨ 13. F. Osterberg, G. M. Morris, M. F. Sanner, A. J. Olson, and D. S. Goodsell. Automated docking to multiple target structures: Incorporation of protein mobility and structural water heterogeneity in autodock. Proteins: Structure, Function, and Genetics, 46:34–40, 2002. 14. D. E. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley Publishing Company, Inc., Reading, MA, USA, 1989. 15. T. B¨ack. Evolutionary Algorithms in Theory and Practice. Oxford University Press, New York, USA, 1996. 16. D. B. Fogel. Evolutionary Computation: Toward a New Philosophy of Machine Intelligent. IEEE Press, New York, 1995. 17. J.-M. Yang and C.-Y. Kao. Flexible ligand docking using a robust evolutionary algorithm. Journal of Computational Chemistry, 21(11):988–998, 2000. 18. J.-M. Yang, C.-H. Tsai, M.-J. Hwang, H.-K. Tsai, J.-K. Hwang, and C.-Y. Kao. GEM: A gaussian evolutionary method for predicting protein side-chain conformations. Protein Science, 11:1897–1907, 2002. 19. J.-M. Yang and C.-Y. Kao. A robust evolutionary algorithm for training neural networks. Neural Computing and Application, 10(3):214–230, 2001. 20. J.-M. Yang, J.-T. Horng, C.-J. Lin, and C.-Y. Kao. Optical coating designs using an evolutionary algorithm. Evolutionary Computation, 9(4):421–443, 2001. 21. R. M. A. Knegtel, J. Antoon, C. Rullmann, R. Boelens, and R. Kaptein. Monty: a monte carlo approach to protein-dna recongnition. Journal of Molecular Biology, 235:318–324, 1994.
Evolving Sensor Suites for Enemy Radar Detection Ayse S. Yilmaz1 , Brian N. McQuay1 , Han Yu1 , Annie S. Wu1 , and John C. Sciortino, Jr.2 1
2
School of Electrical Engineering and Computer Science University of Central Florida, Orlando, FL 32816, USA {selen,bmcquay,hyu,aswu}@cs.ucf.edu Naval Research Laboratory, Washington, DC 20375, USA [email protected]
Abstract. Designing optimal teams of sensors to detect the enemy radars for military operations is a challenging design problem. Many applications require the need to manage sensor resources. There is a tradeoff between the need to decrease the cost and to increase the capabilities of a sensor suite. In this paper, we address this design problem using genetic algorithms. We attempt to evolve the characteristics, size, and arrangement of a team of sensors, focusing on minimizing the size of sensor suite while maximizing its detection capabilities. The genetic algorithm we have developed has produced promising results for different environmental configurations as well as varying sensor resources.
1
Introduction
The problem of determining an optimal team of cooperating sensors for military operations is a challenging design problem. Tactical improvements along with increased sensor types and abilities have driven the need for automated sensor allocation and management systems. Such systems can provide valuable input to human operators in terms of exploring and offering candidate solutions, providing a testbed on which to evaluate candidate solutions, and providing a penalty-free environment on which to test for severe failure conditions. Given a possible configuration of enemy radars, our goal is to find a team of sensors that can sense as many enemy radars as possible. Sensors are available in a wide range of capabilities and costs and the number of sensors in a team can vary. The importance of a mission may restrict the types and numbers of available sensors as well as the maximum cost. There is a tradeoff between the need to maximize the sensing capability of a sensor suite and the desire to minimize cost. The open-endedness of this problem makes it an interesting design problem. The number of potential components (sensor type, characteristics, and placement) that make up a solution is extremely large. There are very few restrictions as well as very little guidance as to how many and what types of these components would make up a good solution. E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2384–2395, 2003. c Springer-Verlag Berlin Heidelberg 2003
Evolving Sensor Suites for Enemy Radar Detection
2385
We use a genetic algorithm (GA) to tackle this problem of designing optimal sensor suites. We have developed a GA that will generate candidate sensor suites that are evaluated using a simulator based on military specification. In this paper, we describe experiments which explore the performance of our GA in different types of enemy environments using different types of sensor resources. Our experimental results indicate that such a GA approach is able to design reasonable and effective sensor suites.
2
Related Work
Evolutionary Algorithms have been used to generate optimal designs effectively. Peter Bentley’s [1] pioneering work in evolutionary design incorporated evolutionary computation techniques into the evolution of design components that would build wide range of solid objects such as tables, prisms and cars. The system takes the input specifications for the object to be designed and evolves the shape of the design that performs the required function. Funes et al. [2] apply evolutionary techniques to the design of structures in a simulation environment that are then buildable using Lego parts. Instead of using expert engineering knowledge, they use a fitness function to evaluate the feasibility and functionality of evolved structures. Husbands et al. [3] report the results of a comparative study of ten different optimization techniques in structural design optimization. The distributed Genetic Algorithm (DGA) and various hybrid methods (DGA with gradient descent, Simulated Annealing with gradient descent) appear to have significant advantages over the other techniques tested. Lee et. al. [4] develop a hybrid approach to evolve both controllers and robot bodies for performing specific tasks. They evolve physical body parameters such as the number and location of sensors and show that the design of robot body considerably can affect the behavior of the system. Bugajska et al. [5] investigate the co-evolution of form and function for autonomous agents. The form component in [5] focuses on the characteristics of a sensor suite. All sensors within a sensor suite are assumed to have the same characteristics. Characteristics that can be evolved are the number of the sensors in a sensor suite and the sensor coverage area of each sensor. The maximum number of sensors is limited to 32 and the detection angle varies from 5 to 30 degrees. The location of the sensors are fixed and the coverage degradation by the distance is not taken into consideration. In [6], Bugajska et al. added the detection angle and the placement of the sensor to the list of evolved characteristics. They assume that power efficiency degrades with increased sensor coverage area. They evolve the number of the sensors implicitly by allowing a zero detection angle to indicate the existence of no sensor. The maximum number of sensors is limited to nine. The sensor detection angle ranges from 0 to 45 degrees. By contrast, in our system, detection angle can vary from 0 to 359 degrees and the placement of the sensors is not limited to fixed locations. The evolvable characteristics of our sensors are location, detection coverage, power and the frequency band at which each sensor operates.
2386
A.S. Yilmaz et al.
(a)
(b)
Fig. 1. (a) Example problem environment; (b) Problem environment with candidate solution.
The sensor management problem has been investigated by researchers using non-evolutionary techniques as well. Gaskell et al. [7] use a decision theoretic approach to develop a sensor management system for mobile robots. Popoli [8] proposes a sensor management scheme that uses fuzzy set theory, knowledgebased reasoning and expert systems. Schmaedeke et al. [9] uses an information theoretic approach for sensor to target assignment optimization. We believe the evolutionary technique and the advances in our work will have an advantage in terms of being robust and easily transferable between different problem scenarios with little or no modification to the system.
3
Problem Environment
Before looking at the details of our GA implementation, it is necessary to understand the problem to which we apply the GA. The problem environment consists of a collection of stationary enemy radars located in a two-dimensional plane. Figure 1(a) shows an example environment consisting of twelve randomly placed enemy radars. Radars are represented as points surrounded by gradually fading circles. The environment is restricted to two dimensions in our current work but can be extended to three dimensions in the future. The location, power, and frequencies of the enemy radar are configured beforehand and remain static throughout a run. Each radar can operate on one of four different frequencies: A, B, C, or D. A radar can only be detected by sensors that are configured to sense on the same frequency. A radar must be detected by at least three difference sensors to be fully detected. (Three measurements are necessary for accurate triangulation of position.) Radars that are detected by two sensors are partially detected and radars that are detected by one sensor are minimally detected. Partially and minimally detected radars contribute less to the fitness evaluation than fully detected radars. In the experiments described here, we set all radars to operate on a single frequency in order to test the GA’s ability to evolve minimal cost teams.
Evolving Sensor Suites for Enemy Radar Detection
(a)
2387
(b)
Fig. 2. (a) Problem representation for the evolvable team of sensors; (b) Sensor characteristics, α = Detection angle range and β= Direction angle
Figure 1(b) shows the same environment along with a candidate solution. The pie shaped elements indicate sensors and their detection angle and range. Lines indicate detection of a radar by a sensor.
4 4.1
GA Design Problem Representation
Each individual in a GA population specifies the composition and arrangement of a team of sensors encoded as a vector of genes. Each gene encodes the evolvable characteristics for a single sensor. Figure 2(a) shows an example individual which represents a team of N sensors. Example parameter values for Sensor 2 are shown in detail. As the optimal number of sensors may not be known in advance, we allow the GA to evolve variable length individuals. Initially, each individual contains 20 randomly configured sensors. The maximum possible length of an individual is 100, indicating a maximum team size of 100 sensors. The location of each sensor is encoded as x and y coordinate values. Sensors can be active or inactive. As we have no restriction on representation, multiple sensors in an individual may have the same location in the environment. If that’s the case, only the sensor that appears first in the individual (the leftmost sensor) is active. The others are inactive and are unable to detect any radars; however, they are still included in the cost component of the fitness function. Figure 2(b) illustrates the direction angle and the detection angle range. The direction angle indicates the “front” of a sensor and is a counter-clockwise angle relative to the 3 o’clock position. The detection angle range is centered around the direction angle and indicates the range within which a sensor can detect signals. Suppose that we have a square environment with a radar in each corner and a sensor located in the center of the square. If the sensor has a direction angle of 45 degrees and a detection angle range of 90 degrees. This particular sensor is only capable of detecting the radars located in the upper right quadrant of the environment, as long as other requirements such as frequency and power are satisfied.
2388
A.S. Yilmaz et al.
The minimum power threshold for each sensor indicates the minimum power value that a sensor must receive from an enemy radar in order for detection to occur. A sensor is initially randomly configured with a minimum threshold power for radar detection. The power a sensor receives from a radar is inversely proportional to the distance between the sensor and the radar. The received power is also proportional to the base power of each enemy radar. Thus, p = r/d, where p is the power received by a sensor, r is the enemy radar power, and d is the distance between the sensor and the enemy radar. A sensor can detect a radar only if the power received by the sensor is greater than the minimum threshold for that sensor. The minimum threshold value that can be evolved is configurable as a system parameter which enables us to vary the coverage capabilities of the sensors that can be generated within a GA run. Each sensor can detect radar frequencies in zero to four bands (as represented by A, B, C, and D in section 3). Detection is represented by a four bit binary string in each gene. Each bit refers to one radar band. A 1 denotes the ability to detect the radar frequency represented by that bit and a 0 indicates inability to detect the corresponding radar frequency. Sensors are capable of detecting on more than one band. A sensor can detect a radar only if it operates on the same radar frequency as the radar. 4.2
Fitness Evaluation
The fitness function consists of two parts, the detection capability and the total cost of a solution. The fitness function is: f = ρ/σ,
(1)
where f is the raw fitness, ρ is the detection fitness that indicates the detection capability, and σ is the total cost of a solution. To calculate ρ, the detection fitness, we count the number of radars that are fully, partially, and minimally detected. The detection fitness is: ρ=
3∗F +2∗P +M , R
(2)
where R is the number of enemy radars and F, P, M are the numbers of fully, partially and minimally detected radars, respectively. Detection angle affects how much the environment a sensor can see, which implicitly affects the number of detectable radars. The raw fitness is inversely proportional to the total solution cost. The total cost of a solution is its basic cost plus the total cost of all of the sensors: σ =b+
n
(pi + qi ),
(3)
i=1
where b is the fixed basic cost of deployment, n is the total number of sensors, pi is the fixed basic cost of each sensor and qi is the cost due to the band frequencies that are equipped in each sensor.
Evolving Sensor Suites for Enemy Radar Detection
2389
Fig. 3. Inter-gene level crossover operation
q i = 1 + 2 ∗ ri ,
(4)
where ri is the number of bands a sensor has. The cost of radar bands favors multi-band sensors such that it is cheaper to have a sensor with four bands than to have two sensors with two bands each. The costs given to a sensor according to the number of bands it has are 1, 3, 5, 7 and 9 for zero, one, two, three and four bands respectively. Two 2-band sensors would cost 10 while one 4-band sensor would cost 9, thus, favoring multi-band sensors. 4.3
Selection and Genetic Operators
The selection method we use is deterministic tournament selection with size two. One-point crossover operator is applied to variable length individuals where the crossover point on each parent is chosen independently. The length of an offspring may be different from its parents. Crossover points always fall in between the genes as shown in Figure 3. We use a mutation scheme that is different from traditional mutation. Mutation is done on the intra-gene level. Each characteristic of each gene is subject to the mutation rate. If frequency band is to be mutated, we randomly generate a value (0 or 1) for the mutated bit. If the location, direction angle, detection angle range, or minimum threshold power are to be mutated, a Poisson distribution function is used to generate a value as an offset from the original value. That is, the new value is not randomly selected from a specified range, but rather follows a distribution probability that favors smaller changes. As a result, mutation is more likely to generate values that are similar to the original value instead of simply mutating randomly to any new value. We expect this mutation scheme to encourage accurate adjustment of the location, direction angle, detection angle range and power of the sensors.
5
Experiments
The goal of our experiments is to demonstrate that a GA is capable of designing teams of sensors for the detection of the enemy radars in a reasonable amount
2390
A.S. Yilmaz et al.
of time. We test our GA on a variety of environmental scenarios and sensor configurations to study the robustness of the system with respect to the inputs for the environmental conditions. 5.1
Test Sets
The two environmental scenarios we test are : – Radars organized in a 5 × 5 grid pattern totaling up to 25 radars. We expect a GA to evolve a relatively distributed placement of sensors in this environment. Placing sensors within the area covered by radars is expected to be most efficient; however, the rightmost column and the bottom row of the environment are left empty to see if the GA places sensors outside the perimeter of the radars. – Radars clustered at random locations, 12 radars in three clusters. We expect sensors to be clustered near radar clusters. The locations that have no enemy radars are expected to have no sensors. The two general types of sensors we test are: – Long range sensors that are able to cover the entire environment. As any sensor should be able to detect all radars, we expect the GA to evolve solutions that are close to three sensors, the minimum number of sensors required to fully detect a radar. – Short range sensors that cover less than a quarter of the environment. We expect solutions to consist of more sensors as short range sensors cannot sense the entire environment. Using the test cases explained above, we test four different scenarios for in our experiments: 1. 2. 3. 4.
Long range sensors with radars placed evenly in a grid pattern. Longe range sensors with radars placed randomly in three clusters Short range sensors with radars placed evenly in a grid pattern. Short range sensors with radars placed randomly in three clusters
All enemy radars are set to operate on a single band, A. We selected the following GA parameter settings based on the performance of previous experiments: Population size Initial length Parent Selection Crossover type Crossover rate Mutation rate Max number of generations Number of runs
: : : : : : : :
200, initialized randomly 20 Tournament, size:2 one-point 0.7 0.01 (per gene) 400 100
Evolving Sensor Suites for Enemy Radar Detection
2391
Fig. 4. Long range sensors with enemy radars located in a grid for a single run Average of the 100 Runs - Length
Average of the 100 Runs - Percentage of Detection
60
1.4
50
Percentage of Detection
Average Best Individual Maximum Minimum Standard Deviation
Length
40 30 20 10 0
Best Individual Maximum Average Standard Deviation
1.2 1 0.8 0.6 0.4 0.2 0
0
50
100
150 200 250 Generations
300
350
400
0
50
100
150 200 250 Generations
300
350
400
Fig. 5. Length (Number of Sensors) and percentage of detection averaged over 100 runs for long range sensors and grid placement scheme for enemy radars
5.2
Test Results
Figure 4 shows an example solution from our first scenario. In this experiment, sensors are able to extend their coverage area to include the whole environment. We observe that the GA is able to evolve a team of four sensors which is close the minimum requirement of three to fully detect all enemy radars. There are no undetected radars. 24 out of 25 enemy radars are fully detected . Only one enemy radar located at the top right corner is detected by two sensors. The results we obtain over 100 runs are consistent with the example solution. Figure 5 shows the evolution of individual length, i.e. number of sensors evolved, and percentage of radar detection averaged over 100 runs. The number of sensors evolved in the best individuals levels off around four, and the percentage of radar detection in best individual is typically 90% or higher. The second scenario uses long range sensors in a clustered environment. Our best solutions are again teams of four sensors as shown in Figure 6. All radars except for one are fully detected; the rest of the radars are partially detected. The average behavior of the GA over 100 runs as shown in Figure 7 is similar to that shown in Figure 5 suggesting that the behavior of this GA, in terms of the number of evolved sensors and the
2392
A.S. Yilmaz et al.
Fig. 6. Long range sensors with enemy radars located randomly in 3 clusters for a single run Average of the 100 Runs - Length
Average of the 100 Runs - Percentage of Detection
60
1.4
50
Percentage of Detection
Average Best Individual Maximum Minimum Standard Deviation
Length
40 30 20 10 0
Best Individual Maximum Average Standard Deviation
1.2 1 0.8 0.6 0.4 0.2 0
0
50
100
150 200 250 Generations
300
350
400
0
50
100
150 200 250 Generations
300
350
400
Fig. 7. Length (Number of sensors) and percentage of detection averaged over 100 runs for long range sensors and cluster placement scheme for enemy radars
percentage of detection, is independent of the placement of the enemy radars. This result is expected as long range sensors should have the same capability regardless of the enemy radar configuration. In general, sensors tend to cluster around the center of the environment in order to detect all radars most efficiently. The third scenario employs sensors that can sense a maximum of one quarter of the environment. Given the distributed radars and assuming the need for at least three sensors for each quarter of the environment, we anticipate solutions to contain a minimum of twelve sensors. The GA is able to evolve a team of fourteen sensors as shown in the example solution in Figure 8. All enemy radars are detected by at least one sensor. The fourth scenario employs short range sensors in a clustered environment. We expect each cluster to require at least three sensors for full detection. Figure 10 shows that the GA is successful at evolving a minimum number of sensors of nine. There are no undetected radars. The number of radars detected by one, two and three sensors are one, seven and four respectively. Because of their decreased range capacity, all of the sensors can only detect radars within the one of the clusters.
Evolving Sensor Suites for Enemy Radar Detection
2393
Fig. 8. Short range sensors with enemy radars located in a grid for a single run Average of the 100 Runs - Length
Average of the 100 Runs - Percentage of Detection
100
1.4
Length
80
Percentage of Detection
Average Best Individual Maximum Minimum Standard Deviation
60
40
20
0
Best Individual Maximum Average Standard Deviation
1.2 1 0.8 0.6 0.4 0.2 0
0
50
100
150
200
250
Generations
300
350
400
0
50
100
150
200
250
300
350
400
Generations
Fig. 9. Length (Number of sensors) and percentage of detection averaged over 100 runs for short range sensors and grid placement scheme for enemy radars
Fig. 10. Short range sensors with enemy radars located randomly in 3 clusters
The results we obtain over 100 runs are also consistent with the single run results. The individual length, i.e. the number of sensors evolved and the percentage of detection averaged over 100 runs are reported in Figure 9 and Figure
2394
A.S. Yilmaz et al. Average of the 100 Runs - Length
Average of the 100 Runs - Percentage of Detection
100
1.4
Length
80
Percentage of Detection
Average Best Individual Maximum Minimum Standard Deviation
60
40
20
0
Best Individual Maximum Average Standard Deviation
1.2 1 0.8 0.6 0.4 0.2 0
0
50
100
150 200 250 Generations
300
350
400
0
50
100
150 200 250 Generations
300
350
400
Fig. 11. Length and percentage of detection averaged over 100 runs for short range sensors and cluster placement scheme for enemy radars
Fig. 12. Medium range sensors with enemy radars located randomly in 3 clusters
11, for grid and the clustered placement of the enemy radars respectively. GA is able to detect the enemy radars with same performance for both placement scheme with a detection percentage around 90%. However the number of sensors evolved in best individuals levels off around sixteen for grid placement while it’s near fourteen for the clustered placement which makes short range sensors more sensitive to the changes in the environment than the long range sensors. Due to limited capability of the sensors, we observe a considerable increase in the number of sensors evolved compared to the scenarios with long range sensors. We explored additional varying sensor ranges and include an example solution using medium range sensors which can potentially cover approximately 75% of the working environment. This solution, shown in Figure 12, includes sensors that are mainly focused within clusters, with a few that are able to reach multiple clusters.
Evolving Sensor Suites for Enemy Radar Detection
6
2395
Conclusion
In this study, we examine the problem of designing teams of sensors to detect enemy radars located in different configurations. We extend the basic GA design to enable evolving as much sensor characteristics as possible. Based on the detection criteria we have, the results we obtain are promising. Our GA is able to evolve close to the optimal number of sensors based on the capabilities of the sensors. Our system is demonstrated to be robust to the changes in the placement of the enemy radars. The detection percentage is around 90% with the limited sensing capability of the sensors and around 100% with the sensors that have full coverage capability independent from the enemy radar placement. Future goals include testing the robustness and the performance of our GA on more realistic scenarios such as ones with moving radars . In addition we plan to investigate the adaptibility of our system to changing environments, to introduce more problem specific genetic operators and to investigate the identification of the building blocks that form effective subteams. Acknowledgements. The study we reported in this paper is sponsored by the Naval Research Laboratory, ITT Industries Inc., and partially supported by NSF. We would like to thank the reviewers for their valuable comments.
References 1. Bentley, P.J.: Generic evolutionary design of solid objects using a genetic algorithm. In: Ph.D. thesis, Division of Computing and Control Systems, School of Engineering, University of Huddersfield. (1996) 2. Funes, P., Pollack, J.: Computer evolution of buildable objects. In: Procs. of fourth European Conference on AI. (1997) 3. Husbands, P., Jermy, G., McIlhagga, M., Ives, R.: Two applications of genetic algorithms to component design. In: Workshop on EC. (1996) 4. Lee, W., Hallam, J., Lund, H.: A hybrid gp/ga approach for co-evolving controllers and robot bodies to achieve fitness-specified tasks. In: Procs of IEEE third International Conference on Evolutionary Computation. (1996) 5. Bugajska, M., Schultz, A.: Co-evolution of form and function in the design of autonomous agents: Micro air vehicle project. In: GECCO-2000 Workshop on Evolution of Sensors in Nature, Hardware and Simulation, Las Vegas, NV (2000) 6. Bugajska, M., Schultz, A.: Co-evolution of form and function in the design of micro air vehicles. In: NASA/DoD Conference on Evolvable HW. (2002) 7. Gaskell, A., Probert, P.: Sensor models and a framework for sensor management. In: Sensor Fusion VI. Proceedings of SPIE. Volume 2059. (1993) 8. Popoli, R.: The sensor management imperative. In: Chapter in MultitargetMultisensor Tracking: Applications and Advances. Volume II. (1992) 9. Schmaedeke, W., Kastella, K.: Information based sensor management and immkf. In: Signal and data processing of small targets 1998: proceedings of the SPIE. Volume 3373., Orlando, FL (1998)
Optimization of Spare Capacity in Survivable WDM Networks1 H.W. Chong and Sam Kwong Department of Computer Science City University of Hong Kong, 83 Tatchee Avenue, Hong Kong [email protected], [email protected]
Abstract. A network with restoration capability requires spare capacity to be used in the case of failure. Optimization of spare capacity is to find the minimum amount of spare capacity for the network to survive from network component failures. In this paper, this problem is investigated for wavelength division multiplexing (WDM) mesh networks without wavelength conversion. We propose a hybrid genetic algorithm approach (GA) for the problem. Simulated Annealing (SA) and Tabu Search (TS) are also applied to this problem for comparison purpose. Simulation results show very favorable results for the Genetic Algorithm approach.
1
The Proposed Algorithm
In this paper, we are primarily concerned with the restoration at the optical layer. The discussion is limited to wavelength-continuity optical WDM mesh network with static traffic. Given a network topology, a traffic demand consisting of the connections to be established, and set of alternate routes for each connection request, Our objective is to find sets of primary lightpaths and backup lightpaths such that the sum of working and backup capacity usage can be minimized while a set of customer traffic demands can still be satisfied and the traffic is 100% restorable under the constraint to be immune against a single link failure. Our algorithms will use path-based restoration method and backup multiplexing to improve the channel utilization. To minimize the spare capacity, we will optimize both routing and wavelength assignment, this combinatorial problem is usually called Routing and Wavelength Assignment (RWA) problem and it is known to be NP-complete. For the routing problem, we employ the alternate routing approach. For each connection request, a set of candidate routes is precomputed off-line, our algorithms will select the primary routes and backup routes that can optimize the total capacity usage. For the wavelength assignment problem, we reduce it to a graph-coloring problem and the nodes are assigned colors in the order found by our algorithms. The RWA problem is divided into two sub-problems in this paper but our proposed GA will solve these two these Sub-problems simultaneously. Figure 1 shows the structure of a chromosome, the chromosome consists of three parts.
1
This work is supported by City University Strategic Grant 7001337.
E. Cantú-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2396–2397, 2003. © Springer-Verlag Berlin Heidelberg 2003
Optimization of Spare Capacity in Survivable WDM Networks
2397
Part one is a set of working routes for connection requests, part two is a set of backup routes for connection requests and part three is the wavelength assignment order for each route. Since the structure of the chromosome is divided into three parts and each part codes different sub-problem, we design different type of crossover operation for different part of chromosome. One-point crossover and two-point crossover operations are applied to part 1 and part 2 of the chromosome, after that, uniform order-based crossover operation is applied to part 3 of the chromosome. Similar to crossover operation, we design different mutation operator for different part of chromosome structure. For the mutation on part 1 and 2 of chromosome, we need to select a working route and its corresponding backup route before mutation is implemented, and then a new working route and backup route are found to replace the current routes in the selected chromosome. After performing mutation on part 1 and 2 of chromosome structure, we will apply scramble sublist mutation [1] to the third part of chromosome structure.
2
Experimental Results
We conduct experiment to study the performance of proposed GA. For benchmarking purpose, SA and TS are also applied to the problem as well. The three approaches are applied on the network shown in figure 2. A performance comparison between GA, SA and TS is shown in the chart. From the result, we can find that GA approach clearly outperforms SA and TS approaches. Also, TS approach is the poorest of the three. Therefore, GA approach is recommended to be the best choice for solving the spare capacity allocation problem arising in the design of survivable WDM mesh network with static traffic. W
P
o rk in g
P
1
R o u te s
…
2
B a c k u p
…
P
B
n
B
1
2
R o u te s
…
W
…
B
a v e le n g th
X
n
X
1
A s s ig n m e n t O rd e r
…
2
…
X
2 n
Fig. 1. Structure of a chromosome 13
5
5.3
7
3
To tal C a p ac ity U sa g e vs N u m b er o f C o n n ecti o n s
3 17 7 6.7
3
1600
2.5 3
9.4
4
0
8.6
3.5
5
18
4
3
4
4
3.4
3
3.3
14
3.5
3.5
4
6 3.5
2.5 4.5
2.2
5 2.3
2
8
4
2 2
3
5.7
2
6.2
15 4.3
5.4
G ene tic A lgo rit hm
1400 1200 1000
S im ulat ed A nnea ling
800 600 400
Ta bu Se arc h
200 0
24
16 7
22
3.5
2
2
23
6
4.7
19
2.5
12
3.5
2
21
2
11
4 1
20
4.1 10
5.5
4
Total Capacity Usage
9
4.3 3
2.5
2.5
S i m u la ti o n N e t w o r k
Fig. 2.
3
5
10
20
30
40
50
60
70
80
90
100
N u m be r o f C on n e ction s
C o m p a r is o n o f t h e r e s u l ts o f G A , S A a n d T S in s im u la t io n n e tw o r k
Simulation network and results
Reference 1. Davis, L.: Handbook of genetic algorithms. 1991, New York: Van Nostrand Reinhold
Partner Selection in Virtual Enterprises by Using Ant Colony Optimization in Combination with the Analytical Hierarchy Process Marco Fischer1 , Hendrik J¨ ahn1 , and Tobias Teich2 1
1
Chemnitz University of Technology, 09107 Chemnitz, Germany [email protected] http://www.tu-chemnitz.de/wirtschaft/bwl7/ 2 The University of West Saxony at Zwickau, PSF 201037 08012 Zwickau, Germany
Summary
Within the scope of the collaborative research centre “Non-hierarchical regional production networks” at Chemnitz University of Technology, a virtual enterprise model is developed. This is based on very small performance units - the competence cells (CCs). The precondition of the efficient and thereby competitive operation of those networks is a network controlling for objectively selecting the best suitable CCs for every order. This problem gains complexity because the manufacture of a product can be carried through in different ways. The central task within the network consists in guaranteeing the best possible realization of an order. First of all, a customer offer has to be generated to the competence cell network and the necessary CCs for the processing of the several partial performances within an order have to be selected. Thereby, a difficult economical decision problem arises for the network management. An elemenalternatives for step i
alternatives for step i+1
destination (final product) step n
CC Y working dimenplan source sion,
...
accuracy
offer a)
S
CC Y
CC V
dimenworking sourcesion, plan
working dimenplan source sion,
offer b)
offer a)
accuracy
CC Z
CC W
working dimenplan source sion,
dimenworking sourcesion, plan
accuracy
offer b)
demand
IP 0
dimension, accuracy
D
accuracy
offer d)
working plan
...
accuracy
demand
IP 1
working plan
dimension, accuracy
IP 2
IP n
Fig. 1. Illustration of the Problem
tarized process variant plan are described exactly by the demand vectors (DV). Those define the necessary process plans in order to complete an intermediate product. According to the according DVs, the corresponding CCs are searched E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2398–2399, 2003. c Springer-Verlag Berlin Heidelberg 2003
Partner Selection in Virtual Enterprises
2399
for all elements, which are potentially able to complete the performances. This means, the offer vectors (OVs) of the CCs have to be equal to the DVs to a special degree. The OV has to express the possibilities of a CC as exactly as possible and to make those comparable in order to make possible a machine evaluation. For the optimization all manufacturing variants within the CC-offer network 1 2 3 4 5 6 7 8 9
begin initialization(problem − structure); i = source; while not(exit − condition) do for k := 0 to m step 1 do while (Nik = ∅) ∩ (i = destination) do random(z); if z ≤ q then pkij (t) = [τij (t)]α · [ηij ]β ; [τij (t)]α ·[ηij ]β else pij (t) = ; [τil (t)]α ·[ηil ]β
l∈N k i
fi decide(j); Ψk = Ψk ∪ CC(max(AHP (j))); / ∗ localP heromonupdate ∗ / τij (t + 1) = (1 − ρ ) · τij (t) + ρ · ∆τij ; i := j
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
od od / ∗ globalP heromonupdate ∗ / τij (t + 1) ← (1 − ρ) · τij (t) + ∆τij (t)∀τij ; max − min − rule τij ; decide(Lk );
od Ψk ∈ M : ∀ Ψk mit Lk > κ · L∗k 0 ≤ κ ≤ 1; compute(EΨk ,CC ∀ CC ∈ Ψk , KΨk ∀Ψk ∈ M ); decide(Ψkmax : max(aggregation(M Kk , SKk ))) end
Fig. 2. Procedure of the Search for an optimal Manufacturing Variant
are illustrated in a digraph. Each manufacturing alternative includes a source, subsequently it disposes of CCs for carrying through the partial performances and finally it ends in a drain. The objective function of the algorithm is the maximization of the combined AHP-values of the CCs. The larger the value of a manufacturing variant, the more advantageous it is to realize that variant. Figure 1 illustrates the modelling as a digraph. For every step of processing along the way to the final product, several manufacturing variants exist, out of whom the best has to be selected by the help of an algorithm illustrated in Figure 2.
References 1. Teich, T., Fischer, M., Vogel, A., Fischer, J.: A new Ant Colony Algorithm for the Job Shop Scheduling Problem. In: Proceedings of the Genetic and Evolutionary Computation Conference, San Francisco, California (2001) 803
Quadrilateral Mesh Smoothing Using a Steady State Genetic Algorithm Mike Holder1 and Charles L. Karr2 1
American Buildings Company, 1150 State Docks Road, Eufaula, AL 36027-3344 2 Aerospace Engineering and Mechanics, The University of Alabama Tuscaloosa, AL 35487-0280
Abstract. This paper investigates the use of a steady state genetic algorithm (GA) to perform quadrilateral finite element mesh smoothing. GAS short for genetic algorithm smoother moves one to 64 nodes at the same time. GAs smooth as well as untangle (removing twisted and inverted elements), which has been a separate operation in the past.
1
Introduction
A common use of finite elements analysis (FEA) is to determine displacements and stresses of a structure. Automeshers fill the model with finite elements establishing connectivity between the element nodes; it is the objective of mesh smoothers to insure the quality of the mesh once connectivity has been established. Poorly shaped finite elements result in incorrect results. This paper compares a GA smoothing tool with the Laplace smoothing method.
2
Finite Element Smoothing with a GA
FlexGA, the genetic algorithm engine used in this paper, is written in MATLAB and was developed by Dr. K. K. Kumar, The Flexible Intelligence Group, Tuscaloosa, AL. The objective function created by Alan Oddy, at Carleton University in Ottawa, Canada was written in C and interfaced with MATLAB using CMEX, the MATLAB/C interface code [1]. The GA works on a family of chromosomes that in this problem represents the X/Y coordinates of each element node. The distortion metric was augmented with penalty functions. There are penalty functions that check for quadrilateral interior angles being equal to 90 degrees; aspect ratios equal to one and twisted or inverted element. The angular penalty function is pf =
4 2 (ϕi − 90) i=1
SF
GAS-N moves all of the element nodes at a time. An upper limit of about 64 nodes was run using GAS-N. A rectangle with a cutout was created to fool the Laplace smoother to see how the GA would perform using movable boundary nodes. (See Figure (a)). E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2400–2401, 2003. c Springer-Verlag Berlin Heidelberg 2003
Quadrilateral Mesh Smoothing Using a Steady State Genetic Algorithm
3
2401
Results
For simple meshes, the Laplace smoother outperformed the GAS-N smoother when both accuracy and computational time were considered. However, GAS-N usually outperformed the Laplace method in terms of accuracy when complex geometries were encountered. An example can be seen in the rectangle shown in Figure (a) with a semicircular cutout. Figure (b) presents this mesh smoothed with the Laplace method (maximum distortion of 84.3) with Figure 1 giving the mesh smoothed with GAS-N using movable boundary nodes (maximum distortion of 1.9).
(a) GAS-smoothed with movable boundary nodes
(b) Laplace-smoothed
Using a series of square “checker board” models ranging in size from and 8x8 mesh to a 3x3 mesh, it was determined that 64 X/Y node coordinates can be successfully manipulated at a time. A more reasonable upper limit is 35 active nodes. It was also determined that 16 movable nodes is the controllable limit of GAS-N without the use of feasible circles to limit the individual nodal search areas. In conclusion, GAS-N’s ability to move small groups of nodes simultaneously during the smoothing process makes it potentially an effective tool for smoothing large meshes.
References 1. Holder, E. M., (2001). Quadrilateral Mesh Smoothing Using a Genetic Algorithm, Doctoral Dissertation, University of Alabama.
Evolutionary Algorithms for Two Problems from the Calculus of Variations Bryant A. Julstrom Department of Computer Science St. Cloud State University, St. Cloud, MN, 56301 USA [email protected]
Abstract. A brachistochrone is the path along which a weighted particle falls most quickly from one point to another, and a catenary is the smooth curve connecting two points whose surface of revolution has minimum area. Two evolutionary algorithms find piecewise linear curves that closely approximate brachistochrones and catenaries.
Two classic problems in the calculus of variations seek a brachistochrone, the path along which a weighted particle falls most quickly from one point to another, and a catenary, the smooth curve of arc length l between two points whose surface of revolution has minimum area. Analytical solutions to these problems have long been known. In a uniform gravitational field and without friction, a brachistochrone is an arc of a cycloid, the curve traced by a point on a rolling circle. The curve of specified length whose surface of revolution has minimum area is a catenary, an arc of a hyperbolic cosine. Two evolutionary algorithms seek approximate solutions to these problems. They search spaces of piecewise linear functions, which they represent as sequences of y-coordinates associated with evenly-spaced x’s. The algorithms apply mutation and (µ + µ) reproduction to the chromosomes in their populations. The mutation operator perturbs elements of parent chromosomes with random values from a normal distribution with mean zero, and the standard deviation of this distribution diminishes as each algorithm runs. Previously, Zitar and Homaifar [2] described a genetic algorithm for the brachistochrone problem, and Erickson, Killmer, and Lechner [1] applied genetic programming to it. In the EA for the brachistochrone problem, a chromosome’s fitness, which the EA seeks to minimize, is the time required for the particle to fall along the path that the chromosome represents. Consider the particle as it passes through a point P = (x, y) on a curve. The particle was initially at rest, and its potential energy relative to P was mgy, where m is the particle’s mass and g is the acceleration due to gravity. At P , the particle is moving with velocity v and its kinetic energy is 12 mv 2√ . Energy is conserved, so mgy = 12 mv 2 . Thus the particle’s velocity at P is v = 2gy. The length of the segment from (xi , yi ) to (xi+1 , yi+1 ) is di = (xi+1 − xi )2 + (yi+1 − yi )2 , and the particle’s average velocity over √ the segment is the average of its velocities at the segment’s endpoints: vi = ( 2gyi + E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2402–2403, 2003. c Springer-Verlag Berlin Heidelberg 2003
Evolutionary Algorithms for Two Problems
2403
√
2gyi+1 )/2. The time the particle spends traversing the segment is then ti = dvii , and the time that the particle requires to fall the entire distance on the path a chromosome specifies is the sum of these values over all the subintervals. In the EA for the catenary problem, we seek to minimize both the area of the surface generated by rotating the represented curve about the x-axis and the difference between the length of that curve and a target length l. The segment connecting (xi , yi ) and (xi+1 , yi+1 ) sweeps out a frustum of a cone whose surface area is Ai = 2πri di , where ri = (yi + yi+1 )/2 is the average radius of the frustum and di is again the length of the segment. The area of the surface is the sum of the areas of these frusta, and the length of the curve is the sum of the lengths of its segments. A chromosome’s fitness is a weighted sum of the former and the deviation of the latter from l: fitness = wA · area + wd · |length − l| . In tests of the two algorithms, the number of linear segments was n = 30. The EAs’ populations contained µ = 50 chromosomes, each of n + 1 = 31 values. The initial standard deviation of the mutating distribution was 1.0, and it was multiplied by 0.998 after each generation. Each algorithm ran through 2 500 generations. In the EA for the catenary problem, wA = 1 and wd = 25. Both EAs were effective on a variety of test instances. Figure 1(a) shows a curve of rapid descent generated by the first algorithm; it is indistinguishable from the optimum arc of a cycloid. Figure 1(b) shows a curve of small area and specified length generated by the second algorithm; it is indistinguishable from the optimum arc of a hyperbolic cosine.
Fig. 1. (a) An approximate brachistochrone generated by the first EA and (b) an approximate catenary generated by the second EA
References 1. Damon Erickson, Charles Killmer, and Alicia Lechner. Solving the brachistochrone problem with genetic programming. In H. R. Arabnia and Youngsong Mun, editors, Proceedings of the International Conference on Artificial Intelligence: IC-AI’02, volume III, pages 860–864. CSREA Press, 2002. 2. R. A. Abu Zitar and Abdullah Homaifar. The genetic algorithm as an alternative method for optimizing the brachistochrone problem. In Proceedings of the International Conference on Control and Modeling, Rome, 1990.
Genetic Algorithm Frequency Domain Optimization of an Anti-Resonant Electromechanical Controller Charles L. Karr1 and Douglas A. Scott2 1 2
Associate Professor and Head, Aerospace Engineering and Mechanics The University of Alabama, Box 870280, Tuscaloosa, AL, 35487-0280 Graduate Student, Mechanical Engineering, Purdue University, 1288 ME West Lafayett, IN, 47907-1288
Abstract. The new standard for actuators in the aerospace industry is electromechanical actuators. This paper describes an approach whereby PID and DFF controllers are used in unison to effectively manipulate a thrust vectoring system.
1
Introduction and Background
Thrust vector control (TVC) is a prime example of an aerospace system in which electromechanical actuators (EMAs) can potentially be effective. Researchers at NASA’s Marshall Space Flight Center are currently developing EMAs to replace the hydraulic servo actuators for use on space vehicles with TVC. The current paper focuses on the control of EMAs in TVC applications and on the optimization of such control systems with genetic algorithms (GAs).
2
Problem Environment
Thrust vector control involves the manipulation of engine thrust direction to achieve attitude adjustment of a vehicle. Figure (a) illustrates the placement and function of the actuators for TVC. The objective of the current effort is to develop an EMA controller that can provide accurate engine position control (high bandwidth) while at the same time rejecting the disturbance forces encountered during startup and shut down. Scott and Karr [1] presented a control system for manipulating the TVC that incorporated the PID and DFF controller architectures. This control architecture is presented in Figure (b).
3
Results
In this problem, a GA is asked to determine values for five control gains used in the control system. Each parameter was represented with an eight-bit substring. A fitness function was defined that considered (1) the magnitude of the E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2404–2405, 2003. c Springer-Verlag Berlin Heidelberg 2003
Genetic Algorithm Frequency Domain Optimization
2405
Plant
Gimbal Joint
-
Combustion Chamber, Turbomachinery, etc.
xc
+
+
Ic
kds2 +kps+ki s
-
PID Pos ctrl
Spring Force
kf s
TVC Actuators
x2
x2 x1
+
k
s + pf
Nozzle
x1
x1 F
F Ic
-
DFF
(a) TVC operation
(b) Block Diagram representation of total system
frequency response, (2) the number of peaks in the frequency response, and (3) the rate at which the frequency response decreases. Figures (c) and (d) compare the performance of the GA-designed controller to that of the previously best-known Bode controller. This figure shows that the GA controller produces a smaller phase lag in the engine closed loop frequency response for higher frequency ranges when compared closed loop system incorporating the Bode controller. This accentuates the tracking ability of the GA controller. The GA design produces a higher bandwidth than the Bode controller. Additionally, by reducing the resonant peak in the Bode controlled system, the GA controller provides a higher overall phase margin.
0
0.15
0.1
1
10
100
1000
10000 1hz command
-50
0.1
-100
0.05
1hz Ga Design
Phase(Degrees)
1hz Bode Design
-150
GAPhase BodePhase
0 0
-200
-0.05
-250
-0.1
-300
0.5
1
1.5
2
2.5
-0.15
Frequency(Rad/Sec)
(c) Phase of the Engine frequency response
Time(seconds)
(d) Engine response due to a .1in 1hz sine input
References 1. Scott, D. A., Karr, C. L., & Schinstock, D. E. (2003). Genetic algorithm frequency domain optimization of an anti-resonant electromechanical controller. Engineering Applications of Artificial Intelligence, 12, 201–211.
Circuit Bipartitioning Using Genetic Algorithm Jong-Pil Kim and Byung-Ro Moon School of Computer Science and Engineering Seoul National University Shilim-dong, Kwanak-gu, Seoul, 151-742 Korea {jpkim,moon}@soar.snu.ac.kr
Abstract. In this paper, we propose a hybrid genetic algorithm for partitioning a VLSI circuit graph into two disjoint graphs of minimum cut size. The algorithm includes a local optimization heuristic which is a modification of Fiduccia-Matheses algorithm. Using well-known benchmarks (including ACM/SIGDA benchmarks), the combination of genetic algorithm and the local heuristic outperformed hMetis [3], a representative circuit partitioning algorithm.
The Fiduccian-Matheyses algorithm (FM) [2] is a representative iterative improvement algorithm for a hypergraph partitioning problem. FM improves an initial solution through short-sighted moves based on the gain. Thus, the quality of the FM is not stable. Kim and Moon [4] introduced lock gain as a primary measure for choosing the node to move. It uses the history of search more efficiently. Lock gain showed excellent performance for general graphs. We adapt the lock gain for hypergraphs within the framework of FM. To apply the lock gain to the hypergraph bisection, considerable modification is necessary for the lock gain calculation method, because lock gain was originally designed for the general graphs [4]. We propose a new lock gain calculation method for the hypergraph bisection. Let’s define le (v) to be the lock gain of a node v due to the net e. le (v) is obtained as the following. We assume that the node v is on the left side without loss of generality.
v
...
...
+1
...
v +1
(a) Case 1
v
...
v
−1 (b) Case 2
Positive lock gain
...
...
−1 (a) Case 3
(b) Case 4
Negative lock gain
If all nodes on the right side are locked and there is no locked node on the left side (Case 1) or if there exists locked nodes on the right side and there is no free node on the left side (Case 2) then le (v) = 1. On the other hand, if all nodes are on the left side and at least one node is locked (Case 3) or if all nodes on the left side except v are locked and there is no locked node on the right side E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2408–2409, 2003. c Springer-Verlag Berlin Heidelberg 2003
Circuit Bipartitioning Using Genetic Algorithm
2409
Table 1. Bipartition Cut Sizes of GA and hMetis GA Average1 CPU2 Test02 88.16 17.20 Test03 58.00 5.59 Test04 51.15 8.90 Test05 72.05 27.85 Test06 63.12 4.89 Prim1 53.87 1.14 Prim2 149.02 19.45 19ks 110.22 21.44 Industry2 186.97 211.25 Circuits
hMetis200 Best Average3 CPU2 Best 88 89.81 12.12 88 58 59.90 11.74 58 51 54.77 10.02 54 71 71.33 19.22 71 63 64.05 12.30 64 53 53.00 6.33 53 146 177.76 34.39 146 110 110.81 25.89 110 183 181.15 233.35 180
1. The average cut size of 100 runs 2. CPU seconds on Pentium III 1GHz 3. The average cut size of 100 runs, each of which is the best of 200 runs of hMetis
(Case 4) then le (v) = −1. Then, the lock gain l(v) of the node v is defined to be l(v) = e∈N (v) le (v) where N (v) is a set of nets to which the node v is connected. We tested the proposed algorithm on 9 benchmarks including ACM/SIGDA benchmarks [1]. We compare the performance of the hybrid GA which uses the lock-gain based FM as a local optimization engine against a well-known partitioner hMetis [3]. Because the GA took roughly 200 times more than a single run of hMetis, it is not clear how critical the genetic search is to the performance improvement. Thus, hMetis200, that is a multi-start version of hMetis with 200 runs, was compared. Table 1 shows the performance of the GA. On the average, the proposed GA performed best in six graphs among nine. Acknowledgments. This work was partly supported by Optus Inc. and Brain Korea 21 Project. The RIACT at Seoul National University provided research facilities for this study.
References 1. BENCHMARK. http://vlsicad.cs.ucla.edu/˜cheese/benchmarks.html. 2. C. Fiduccia and R. Mattheyses. A linear time heuristics for improving network partitions. In 19th IEEE/ACM Design Automation Conference, pages 175–181, 1982. 3. G. Karypis, R. Aggarwal, V. Kumar, and S. Shekhar. Multilevel hypergraph partitioning: Application in VLSI domain. In Proc. Design Automation Conference, pages 526–529, 1997. 4. Y. H. Kim and B. R. Moon. A hybrid genetic search for graph partitioning based on lock gain. In Genetic and Evolutionary Computation Conference, pages 167–174, 2000.
Multi-campaign Assignment Problem and Optimizing Lagrange Multipliers Yong-Hyuk Kim and Byung-Ro Moon School of Computer Science & Engineering, Seoul National University Shillim-dong, Kwanak-gu, Seoul, 151-742 Korea {yhdfly,moon}@soar.snu.ac.kr
Customer relationship management is crucial in acquiring and maintaining royal customers. To maximize revenue and customer satisfaction, companies try to provide personalized services for customers. A representative effort is one-to-one marketing. The fast development of Internet and mobile communication boosts up the market of one-to-one marketing. A personalized campaign targets the most attractive customers with respect to the subject of the campaign. So it is important to expect customer preferences for campaigns. Collaborative Filtering (CF) and various data mining techniques are used to expect customer preferences for campaigns. Especially, since CF is fast and simple, it is widely used for personalization in e-commerce. There have been a number of customer-preference estimation methods based on CF. As personalized campaigns are frequently performed, several campaigns often happen to run simultaneously. It is often the case that an attractive customer for a specific campaign tends to be attractive for other campaigns. If we perform separate campaigns without considering this problem, some customers may be bombarded by a considerable number of campaigns. We call this overlapped recommendation problem. The larger the number of recommendations for a customer, the lower the customer interest for campaigns. In the long run, the customer response for campaigns drops. It lowers the marketing efficiency as well as customer satisfaction. Unfortunately, traditional methods only focused on the effectiveness of a single campaign and did not consider the problem with respect to the overlapped recommendations. In this paper, we define the multi-campaign assignment problem (MCAP) considering the overlapped recommendation problem and propose a number of methods for the issue including a genetic approach. We also verify the effectiveness of the proposed methods with field data. Let N be the number of customers and K be the number of campaigns. The MCAP is to find customer-campaign assignments that maximizes the effects of campaigns. The main difference with independent campaigns lies in that the customer response for campaigns is influenced by overlapped recommendations. More detailed description is omitted by space limit. In case of overlapped campaign recommendations, the customer response rate drops as the number of recommendations grows. We introduce the response suppression function for
This work was partly supported by Optus Inc. and Brain Korea 21 Project. The RIACT at Seoul National University provided research facilities for this study.
E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2410–2411, 2003. c Springer-Verlag Berlin Heidelberg 2003
Multi-campaign Assignment Problem
2411
Table 1. Comparison of Algorithms Method Fitness
Independent 26452.75
Random 25951.19
CAA† 66473.63
LM-GA 66980.76
† Starting at the situation that no customer is recommended any campaigns, we iteratively assign campaigns to customers by a greedy method. We call this algorithm Constructive Assignment Algorithm (CAA). We use an AVL tree for the efficient management of real-valued gains. The time complexity of the algorithm is O(N K 2 logN ).
the response-rate degradation with overlapped recommendations. We used the response suppression function derived from Gaussian function. Since the MCAP is a constraint optimization problem, it can be solved using Lagrange Multipliers (LMs). But, since it is a discrete problem which is not differentiable, it is formulated to a restricted form. The LM method guarantees optimal solutions. The suboptimality of heuristic algorithms can be measured by using the optimality of the LM method. By using LM method, the problem of finding the optimum campaign assignment matrix becomes that of finding a K-dimensional real vector with LMs. The LM method takes O(N K2K ) time. It is more tractable than the original problem. Roughly, for a fixed number K, the problem size is lowered from O(N K+1 ) to O(N ). We also propose a genetic algorithm (GA) for optimizing LMs. Our GA provides an alternative search method to find a good campaign assignment matrix by optimizing K LMs instead of directly dealing with the campaign assignment matrix. A typical steady-state genetic algorithm is used in our GA. In the following, we describe each part of the GA. A real encoding is used for representing a solution. A gene corresponding to a LM has a real value. The GA first creates 100 real vectors at random. We use a proportional selection scheme and the uniform crossover. After the crossover, mutation operator is applied to the offspring. We use a variant of Gaussian mutation that we devised. Our evaluation function is to find a LM vector that has high fitness satisfying the constraints as much as possible. We used the preference values estimated by CF from field data with 48,559 customers and 10 campaigns. We examined the Pearson correlation coefficient of preferences for each pair of campaigns. Thirty three pairs (about 73%) out of the totally 45 pairs showed higher correlation coefficient than 0.5. This property of field data provides a good reason for the need of MCAP modeling. Table 1 shows the performance of the independent campaign and various multi-campaign algorithms in the multi-campaign formulation. The figures in the table represent the fitness values. The result of “Independent” campaign is from 10 independent campaigns without considering their relationships with others. Although the independent campaign was better than the “Random” assignment in multicampaign formulation, it was not comparable to the other multi-campaign algorithms. The solution fitness of CAA heuristic was more than 2.5 times higher than that of the independent campaign. When LMs are optimized by a genetic algorithm (LM-GA), we found the best-quality solution satisfying all the constraints. Our LM method is fast and outputs optimal solutions. But, it is not easy to find LMs satisfying all constraints. When combined with the genetic algorithm, we could find high-quality LMs.
Grammatical Evolution for the Discovery of Petri Net Models of Complex Genetic Systems
Jason H. Moore and Lance W. Hahn Program in Human Genetics, Vanderbilt University, Nashville, TN, USA 37232-0700 {Moore,Hahn}@phg.mc.Vanderbilt.edu
Abstract. We propose here a grammatical evolution approach for the automatic discovery of Petri net models of biochemical systems that are consistent with population level genetic models of disease susceptibility. We demonstrate the grammatical evolution approach routinely identifies interesting and useful Petri net models in a human-competitive manner. This study opens the door for hierarchical systems modeling of the relationship between genes, biochemistry, and measures of health.
Petri nets are a type of directed graph that can be used to model discrete dynamic systems. The goal of this study was to develop a grammatical evolution (GE) strategy for the automatic discovery of Petri net (PN) models of biochemical systems that are consistent with gene-gene interactions that increase susceptibility to human disease. Understanding the relationship between genes, biochemistry, and measures of health is an important endeavor in the domain of human genetics. We first summarize the PN modeling strategy and then briefly present the grammar used and the results. The goal of identifying PN models of biochemical systems that are consistent with observed population-level gene-gene interactions is accomplished by developing PN that are dependent on specific genotypes from two (or more) genetic variations. Here, we make firing rates of transitions and/or arc weights genotype-dependent yielding different PN behavior. Each PN model is related to the genetic model using a threshold model. With the threshold model, it is the concentration of a substance that is related to the risk of disease. For each model, the number of tokens at a particular place is recorded and if they exceed a certain threshold, the appropriate risk assignment is made. If the number of tokens does not exceed the threshold, the alternative risk assignment is made. The high-risk and low-risk assignments made by the discrete threshold from the output of the PN can then be compared to the high-risk and low-risk genotypes from the genetic model. A perfect match indicates the PN model is consistent with the gene-gene interactions observed in the genetic model. Here, the fitness function of the GE algorithm is proportional to the number of highrisk and low-risk assignments incorrectly made. We developed a grammar for PN in Backus-Naur Form (BNF). Nonterminals form the left-hand side of production rules while both terminals and nonterminals can form the right-hand side. A terminal is essentially a model element while a nonterminal is the name of a production rule. For the PN models, the terminal set includes, for E. Cantú-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2412–2413, 2003. © Springer-Verlag Berlin Heidelberg 2003
Grammatical Evolution for the Discovery of Petri Net Models
2413
example, the basic building blocks of a PN places, arcs, and transitions. The nonterminal set includes the names of production rules that construct the PN. For example, a nonterminal might name a production rule for determining whether an arc has weights that are fixed or genotype-dependent. We show below in (1) the production rule that is executed to begin the model building process. (1)
::= <expr>
When the initial production rule is executed, a single PN place with no entering or exiting arc (i.e. ) is selected and a transition leading into or out of that place is selected. The arc connecting the transition and place can be dependent on the genotypes of the genes selected by . The nonterminal <expr> is a function that allows the PN to grow. The production rule for <expr> is shown below in (2). Here, the nonterminal (0, 1, 2, or 3) selected in the right-hand side of the production rule is determined by a combination of bits in the genetic algorithm chromosome. (2)
<expr> ::= <expr> <expr> | <arc> | |
0 1 2 3
The base or minimum PN that is constructed using the production rule consists of a single place, a single transition, and an arc that connects them. Multiple calls to the production rule <expr> by the genetic algorithm chromosome can build any connected PN. In addition, the number of times the PN is to be iterated is selected with the nonterminal . Many other production rules control the arc weights, the genotype-dependent arcs and transitions, the number of initial tokens in a place, the place capacity, etc. All decisions made in the building of the PN model are made by each subsequent bit or combination of bits in the genetic algorithm chromosome. The GE algorithm was run a total of 100 times for each of two hypothetical nonlinear gene-gene interaction models. For each genetic model, GE always yielded a PN model that was consistent with the high-risk and low-risk assignments for each combination of genotypes with no classification error. Most PN models consisted of one place, two arcs, and two transitions. We found that there was a clear preference for making arcs, rather than transitions, genotype-dependent. Understanding how interactions at the biochemical level manifest themselves as interactions among genes at the population level, will provide a basis for understanding the role of genes in diseases susceptibility. This work was supported by National Institutes of Health grants HL65234, HL65962, GM31304, AG19085, and AG20135.
Evaluation of Parameter Sensitivity for Portable Embedded Systems through Evolutionary Techniques James Northern III and Michael Shanblatt Department of Electrical and Computer Engineering Michigan State University, East Lansing, MI 48910 {norther1,mas}@msu.edu
Abstract. Power consumption and portability issues are becoming increasingly significant in embedded system architectures. Therefore, it is important that chip architects and integrated circuit designers focus on power and portability, they must also be conscious of choosing the right set of parameters for an embedded processor. An evolutionary approach to configure the parameters of an embedded processor can be used to address these issues. From a given set of parameters the design space is explored using a simple genetic algorithm. This paper investigates the sensitivity of these parameters as they change during the process of the genetic algorithm.
1
Introduction
Embedded systems have been motivated by the demands of the rapidly growing market in portable electronic devices. The development of lighter products, thermal efficient systems and longer battery life are effects of reduced power consumption in mobile embedded systems (e.g., medical equipment, cellular phones, pagers and video game consoles). Embedded systems usually employ one specific application. To design such a system, various hardware units are configured based on a set of parameters for a particular application. As portable computing systems continue to grow, the focus of development will be the integration of these processors for communication and computation into a single unit (i.e., system-on-a-chip). In this paper, the population size of a genetic algorithm is adapted to find an “ideal” configuration and evaluate the sensitivity of processor parameters. An expansion of Goldberg’s Simple Genetic Algorithm is selected as the method for optimizing these configurations based on power consumption.
2 Methodology For the purpose of this study, the set of experiments includes 16 different parameters in the configuration of an embedded processor, with at least four options per parameter. If searched exhaustively, it would take 8,153,726,970 different configurations to search the entire design space, which could take hundreds of years. Also, if E. Cantú-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2414–2415, 2003. © Springer-Verlag Berlin Heidelberg 2003
Evaluation of Parameter Sensitivity for Portable Embedded Systems
2415
different parameter sets were randomly selected, the impact of each characteristic could possibly have a dynamic effect on power consumption (e.g., cache size and pipeline depth). Based on these parameterized characteristics and a particular embedded application, the search for an “ideal” configuration of an embedded processor is NP-complete. Evolutionary algorithms offer the ability to traverse the design space to find quality solutions from large search spaces. The simple genetic algorithm (SGA) seeks an “ideal” solution for minimum power consumed by a configuration based on natural selection methods. The SGA calls an architectural power estimation tool to evaluate each configuration. Two tools, Genetic Algorithm Optimized for Portability and Parallelism System (GALOPPS) and Wattch, are selected to help complete this task [1, 2]. Each parameter is evaluated for sensitivity by counting the number of mutations of the “best fit” individual in a given run, and all values are sorted based on a total average per parameter.
3
Summary
The experiments that are reported used estimated values without a targeted workload for maximum power consumption per configuration. The population size was varied from 50 to 100. A population size of 70 yielded the best result with a minimal power consumption of 39.97 watts. The GA in its initial form (with a population size of 70) achieved a power reduction of 25% from the initial SoC configuration. Moreover, when population size was varied, additional gains of up to 5% additional power reduction were attained. The comparison of population sizes showed that there is a tradeoff between population size and number of evaluations. Smaller population sizes yielded higher power consumption per number of evaluations, therefore inferring that the population was not large enough. When the population size was greater than 70, the final power consumption values were also higher, concluding that the larger population size may increase the time needed for the SGA to converge to a configuration that consumes minimum power. From our sensitivity evaluation, certain parameters that were less sensitive maintained the same characteristic for the final configuration (e.g., data cache size, integer and floating point ALUs and rate of instructions implemented). Ongoing work involving additional runs will be used to show the statistical significance of power reduced and sensitivity of parameters from using a simple genetic algorithm.
References 1. Goodman, E.,“An Introduction to GALOPPS--The ‘Genetic Algorithm Optimized for Portability and Parallelism’ System,” Technical Report GARAGe 96-07-01, Michigan State University, 1996. 2. Brooks, D., Tiwari, V., Martonosi, M., "Wattch: A Framework for Architectural-Level Power Analysis and Optimizations," Proc. of the 27th International Symposium on Computer Architecture, June 2000, Vancouver, BC, pp. 83–94.
An Evolutionary Algorithm for the Joint Replenishment of Inventory with Interdependent Ordering Costs Anne Olsen Winthrop University, Rock Hill, SC 29733, USA [email protected]
Abstract. The joint replenishment of inventory problem (JRP) requires independence of minor ordering costs. In this paper we propose an evolutionary algorithm (EA) for a modification of the JRP. The modified JRP allows for interdependence of minor ordering costs. Our proposed EA (EARP) is a nested EA that searches for a solution to minimize the total cost of inventory replenishment. It combines an EA which uses a direct grouping method with an EA that uses an indirect grouping approach (EA ind) by nesting EA ind inside EARP. We test EARP against partial enumeration and show that it provides close to optimal results for some problems. We know of no other algorithm to solve this problem.
1
Introduction
Replenishment of multiple items from a single supplier is called joint replenishment. The savings realized by joint replenishment can be significant. The joint replenishment problem (JRP) is to determine an inventory replenishment policy that minimizes the total cost, T C, of replenishing multiple items from a single supplier. The T C depends on the cost of holding items in inventory (hi ), the cost of placing an order, and the demand, Di . The cost of placing an order includes a fixed cost of preparing an order (S) and a handling cost associated with each item in the order (si ) [4]. The problem is to find the close to optimal grouping of items for each order [4]. The JRP assumes independence of the si ’s which means si of item i is not affected by other items included in the same order. We consider problems for which the si ’s are interdependent. Interdependence occurs two ways. First the value of si may change depending on which other items are in the same order. Second, item i may be constrained from being in the same order as item j. Two methods covered in the literature to find the best grouping of items in an order are indirect grouping and direct grouping. Indirect grouping replenishes inventory using a basic cycle time. Not all items are necessarily ordered every cycle. An integer, ki , indicates in which cycles item will be ordered. Direct grouping divides items into disjoint groups and a fixed cycle time is determined for each group. All items in a group are ordered every cycle time for that group [3]. E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2416–2417, 2003. c Springer-Verlag Berlin Heidelberg 2003
An Evolutionary Algorithm for the Joint Replenishment
2417
In this paper we propose a nested EA (EARP) to find a near optimal solution to the JRP under conditions of interdependence. The next section describes EARP, the testing procedure, and ends with a brief description of results.
2
The Evolutionary Algorithm
EARP uses a direct grouping method to place n items into groups,Gj , where j = 1, ..., J, which satisfy the constraints. It then calls EA ind which uses an indirect grouping method to find the basic cycle time, Tj , and the integer multiples, ki , for the items in Gj . At the end of N generations, the individual with the lowest T Crepresents the best ordering policy. Each chromosome represents a replenishment policy and contains group ids (integers) and bits representing the ki ’s. See Olsen [2] for details. Each group, Gj , in the chromosome is evaluated by the following equation: evalj = 2 ∗
S+
i
(si /ki ) +
penalty) ∗ ( hi ki Di
1/2 (1)
i
where penalty is an amount added due to dependence. This is a modification of the equation used by RAND [2], an indirect grouping method. is found by where . indicates the ”fitness” of each chromosome and guides the search. Parameters used follow guidelines in [1] and are: pop size = 20, pc , pm = 0.01, and the number of generations is 1000 for EARP with 500 for EA ind, the nested EA. We tested EARP with 810 problems randomly generated within selected ranges. Results were compared with a partial enumeration process. EARP gave results equal to partial enumeration 5.4% of the time, was worse 87.3% of the time, and showed improvement 7.3% of the time. The average percent difference between EARP and partial enumeration shows EARP is 1.2% worse than partial enumeration overall for the 810 problems. These results suggest an EA solution may produce a close to optimal replenishment policy. Additional work is scheduled for publication.
References 1. Michalewicz, Z.: Genetic Algorithms + Data Structures = Evolution Programs, Springer-Verlag, New York (1996) 2. Olsen, A.: An Evolutionary Algorithm for the Joint Replenishment Problem. Ph.D. thesis, Charlotte, NC: University of North Carolina at Charlotte, USA (2002) 3. Rosenblatt, M.J.: Fixed Cycle, Basic Cycle and EOQ Approaches to the Multi-item Single-supplier Inventory System. International Journal of Production Research, vol. 23:6, Taylor & Francis, Ltd. (1985) 1131–1139 4. Silver, E.A.: A Simple Method of Determining Order Quantities in Joint Replenishments Under Deterministic Demand. Management Science, vol. 22:12. The Institute of Management Sciences, U.S.A. (1976) 1351–1361.
Benefits of Implicit Redundant Genetic Algorithms for Structural Damage Detection in Noisy Environments Anne Raich and Tamás Liszkai Texas A&M University, Dept. of Civil Engineering, College Station, Texas, 77843, USA {araich,tamas}@tamu.edu
Abstract. A robust structural damage detection method that can handle noisy frequency response function information is discussed. The inherent unstructured nature of damage detection problems is exploited by applying an implicit redundant representation (IRR) genetic algorithm. The unbraced frame structure results obtained show that the IRR GA is less sensitive to noise than a SGA.
1 Unstructured Problem Domain of FRF-Based Damage Detection The goal of structural damage identification methods (SDIM) is to accurately assess the condition of structures. Most SDIMs assume that vibration signatures are sensitive indicators of structural integrity. In this research, FRF data was used to identify the location and severity of damage. An optimization problem was defined using an error function between the measured data and the discrete analytical model. Although the total number of structural elements typically is large, the number actually damaged is smaller. This unique situation defines an unstructured problem, in which the number of damages is unknown. The optimization problem is solved using genetic algorithms (GA) by altering member properties. A damage vector is obtained that identifies the location and severity of damage(s) in the structure. This formulation requires minimal measurement information. A comprehensive review of model parameter updating methods is provided in [1]. Two GA representations were investigated (Fig. 1). A fixed number of variables were encoded using a SGA representation to represent a complete solution by defining a damage indicator for each element. The IRR representation [2] considered the unstructured nature of damage detection by allowing the number of damaged elements to change during optimization, which is beneficial when the number and location of damages are unknown. A complete solution is encoded using only the damages for a small subset of the elements, instead of all elements. ... Element 1
Element 3 Element ne ...
Element 2
(a) Fixed Representation
Redundant segment
Gene Instance
1 11 01 10 00 11 10 Gene locator Encoded finite element number
Encoded damage indicator
(b) Implicit Redundant Representation IRR
Fig. 1. Comparison of design variable encoding for SGA and IRR GA representations E. Cantú-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2418–2419, 2003. © Springer-Verlag Berlin Heidelberg 2003
Benefits of Implicit Redundant Genetic Algorithms for Structural Damage Detection
2419
2 Results of Cases Studies with Added Measurement Noise
Percent damage
Case studies were performed on a three-story, three-bay frame with 10% damage imposed on a first floor beam (Element 21). Three measurement locations and an excitation at the upper story were used to generate the simulated FRF data. Four noise levels were investigated by adding normally distributed random noise to the FRF data. A more detailed discussion of the case study and results is provided in [3]. Results obtained for the IRR GA trials after 300 generations are shown in Fig. 2. Without noise, the IRR GA found the global optimum in 241 generations. For all noise levels, the correct damaged element was identified with close to the exact 10 % damage value. The number of falsely identified elements with observable damage magnitude increased as the noise level increased. The IRR GA outperformed the SGA in all trials. The SGA encoded all of the damage indicators in the finite element model. Therefore, the number of falsely identified damaged elements was large. In comparison, the adaptive characteristic of the IRR GA was beneficial since the number of damage indicators was not explicitly encoded. The IRR GA determined the number of damaged elements while minimizing the error function. For the noise free case, the best IRR individual initially encoded 15 gene instances, while the best IRR individual in the final population encoded 3 gene instances, One instance identified the correct damaged element and two others instances identified an element with zero damage. When noise was added, the number of gene instances changed from 15 in the initial population to 8-10 in the final population. In a noise free environment, the global optimum was always found. Overall, the IRR GA was considerably less sensitive to measurement noise compared with the SGA. Even on larger problems, the IRR GA was able to identify the damaged elements. Seeding the initial population with the zero damage individual was not necessary for the IRR GA to find the optimal solution, but was beneficial in many trials. 12% 10% 8% 6% 4% 2% 0%
0 % Noise 20 % Noise
6
10 % Noise 30 % Noise
12 14 21 22 31 33 36 39 42 45 52 54 58 61 Finite element number
Fig. 2. Damage detection results for the frame problem (Element 21 - 10% damage)
References 1. Mottershead, J.E., Friswell, M.I.: Model updating in structural dynamics: A survey. Journal of Sound and Vibration 167(2) (1993) 347–375 2. Raich, A.M., Ghaboussi, J.: Implicit redundant representation in genetic algorithms. Evolutionary Computation 5(3) (1997) 277–302 3. Liszkai, Tamas R.. Modern heuristics in structural damage detection using frequency response functions. Ph.D. Dissertation, Texas A&M University (2003).
Multi-objective Traffic Signal Timing Optimization Using Non-dominated Sorting Genetic Algorithm II Dazhi Sun, Rahim F. Benekohal, and S. Travis Waller University of Illinois at Urbana-Champaign Urbana, IL 61801, USA {dazhisun,rbenekoh,stw}@uiuc.edu
Abstract. This paper presents the application of Non- dominated Sorting Genetic Algorithm II (NSGA II) in solving multiple-objective signal timing optimization problem (MOSTOP). Some recent researches on intersection signal timing design optimization and multi-objective evolutionary algorithms are summarized. NSGA II, which can find more of the Pareto Frontiers and maintain the diversity of the population, is applied to solve three signal timing optimization problems with 2-objective and 3-constraint, which account for both deterministic and stochastic traffic patterns. Mathematical approximation of the resulting Pareto Frontiers are developed to provide more insight into the trade-off between different objectives. GAs experimental design and result analysis are presented with some recommendations for prospective applications.
1
Multi-objective Traffic Signal Timing Optimization Problem
Minimizing the average delay and minimizing the number of stops per unit of time are important objectives for traffic signal timing design. However, none of feasible solutions could achieve the simultaneous optimality of these two objectives for an intersection with asymmetric traffic demand. A generic multiobjective traffic signal timing optimization problem at an isolated intersection with two-phase control strategy and permissive left turn can be formulated as: minimize F (G) = [f1 (G), f2 (G)] Where: G- vector of effective green time for each phase i; f1 (G)- the first objective function with respect to delay; f2 (G)- the second objective function with respect to stops. Webster delay formulation and Ak¸celik stops function, which are widely used for calculating the corresponding performance index of delay and number of stops, are modified to be the objective functions mentioned in the above equation. Due to the impact of cycle length on the intersection overall effective capacity, an additional constraint on minimum cycle length was introduced to the above optimization problem. E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2420–2421, 2003. c Springer-Verlag Berlin Heidelberg 2003
Multi-objective Traffic Signal Timing Optimization
2
2421
GAs Experiment Design and Result Analysis
Three signal design problems are defined to minimize average delay and the average number of stops, using the effective green time at each signal phase as the design variable. NSGA II, first introduced by Deb et. al (2002), is modified to solve these problems. The used GA parameters are - maximum generation: 500; population size: 100; binary string length: 30; probability of crossover: 0.9; probability of mutation: 0.01; minimum green: 10; maximum green: 100. The designed scenario is a two-phase isolated intersection with permissive left turn. The critical flow ratios are 0.47 and 0.39 and saturation flow for each approach is 1800 pcphpl. It was observed that a clear frame of actual Pareto Frontiers can be located within 20 generations. As the generation number grows, more Pareto Frontiers were discovered and a well-fitted third degree polynomial function can be constructed to evaluate the tradeoff between the conflicting objectives. In the meanwhile, it was observed that the Pareto-optimal design variables locate along a certain straight line in the feasible space and the corresponding regression functions were developed in this study as well. The following figures show the population and objective values at generation 1, 10 and 20 for the multi-objective signal optimization problem under stochastic traffic pattern with Webster cycle length constraint. 120
stops
100
G2
80 60 40 20 0 0
20
40
60
80
100
0
120
70
stops
60
G2
50 40 30 20 10 0 0
20
40
60
80
60
stops
G2
50 40 30 20 10 0 20
40
60
80
100
5
10
15
20
25
0.42 0.415 0.41 0.405 0.4 0.395 0.39 0.385 0.38 0.375 0.37
100
70
0
0.44 0.43 0.42 0.41 0.4 0.39 0.38 0.37 0.36 0.35 0.34
0
2
4
6
8
10
12
14
0
2
4
6
8
10
12
14
0.425 0.42 0.415 0.41 0.405 0.4 0.395 0.39 0.385 0.38 0.375 0.37
This study solved multi-objective traffic signal optimization problem by NSGA II and analyzed its effectiveness. Results showed that NSGA II can find a much better spread of optimal signal design plans on the true Pareto-optimal frontier at a high convergence speed.
Exploration of a Two Sided Rendezvous Search Problem Using Genetic Algorithms T.Q.S. Truong1 and A. Stacey2 1
Air Operations Division, Defence Science and Technology Organisation Melbourne, Australia [email protected] 2 Mathematics Department, RMIT University, Melbourne, Australia [email protected]
Abstract. The problem of searching for a walker that wants to be found, when the walker moves toward the helicopter when it can hear it, is an example of a two sided search problem which is intrinsically difficult to solve. Thomas et al [1] considered the effectiveness of three standard NATO search paths [2] for this type of problem. In this paper a genetic algorithm is used to show that more effective search paths exist. In addition it is shown that genetic algorithms can be effective in finding a near optimal path of length 196 when searching a 14 × 14 cell area, that is a search space of 10100 . . . .
1
Problem Description
Rendezvous search, or looking for someone who wants to be found, such as a walker lost in the desert was recently considered by Thomas and Hulme [1]. The basic problem is as follows, the lost walker is searched for by a helicopter. The speed of the walker is 3 m.p.h. and that of the helicopter 60 m.p.h.. If the walker hears the helicopter it moves towards it, if not it stays still. The walker can detect the helicopter to a radius of 1 mile, and the helicopter can detect the walker to a radius of 0.11 miles. The search region is broken up into 196 cells in a 14 × 14 grid. The helicopter takes 9 seconds to transit between cells and spends 10 seconds searching each cell. The walker could move freely in the search space but the helicopter was constrained to move up and down but not diagonally between adjacent cells. The cell size was the largest square that could be inscribed within the circle of detection of the helicopter, and it was assumed that detection could only occur if the walker and helicopter were in the same cell at the same time. This problem is hard to solve analytically. If detection in the simulation (occurring with probability 0.78) is determined by the throw of a dice, this causes fluctuations in the values of the fitness function. The GA was found to not converge unless up to 1, 000 trials were averaged for each fitness value. This then makes the problem computationally intractable. To overcome this, for each search path and starting location of the walker, a simulation was done to determine at what times the walker and helicopter were in the same cell. This information was used to construct a theoretical cumulative E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2422–2423, 2003. c Springer-Verlag Berlin Heidelberg 2003
Exploration of a Two Sided Rendezvous Search Problem
2423
probability curve, as a function of time. These results were then averaged over all possible walker starting locations to give a cumulative probability distribution which characterized the effectiveness of the path. These distributions have no random fluctuations and can be used to construct a noise free fitness function for the GA. The fitness function was a weighted sum of the maximum probability at the end of the search and the area under the curve. Including the area under the curve ensured that higher fitness values were given to paths that had higher probabilities of early detection.
2
Results
The three NATO search paths are the decreasing square (DESQ) starting in cell one (at one corner) and spiraling in to the centre, the expansion square (EXSQ) covering the same path in reverse, and the scan search (SCAN) a path which zig zags across the area with strips parallel to the boundaries. The path found by the GA was compared with the three NATO paths, results for which are shown in Fig. 1.
Fig. 1. 14 × 14 region, velocity of walker 3 m.p.h.
References 1. Thomas, L.C.,Hulme, P.B.: Searching for targets who want to be found. Journal of the Operational Research Society 48 (1997) 44–52 2. N.A.T.O.: Manual on Search and Rescue, ATP-10(C), Annex H of Chapter 6. NATO: Brussels (1988)
Taming a Flood with a T-CUP – Designing Flood-Control Structures with a Genetic Algorithm Jeff Wallace and Sushil J. Louis Department of Computer Science University of Nevada, Reno, USA [email protected] http://www.cs.unr.edu/˜sushil/
Abstract. This paper describes the use of a genetic algorithm to solve a hydrology design problem - determining an optimal or near-optimal prescription of Best Management Practices in a flood-prone watershed. The model has proved capable of discovering design prescriptions that are more cost-effective than existing designs, promising significant financial benefits in a shorter time period. The approach is flexible enough to be applied to any watershed with basic precipitation and soil data.
1
Introduction and Problem Description
We use the T-CUP (Total Capture of Urban Precipitation) hydrology model to drive a genetic algorithm that finds cost-effective prescriptions of Best Management Practices (BMPs) to address a severe flooding problem in the Sun Valley Watershed of northern Los Angeles County, California. Due in part to the T-CUP model, the Los Angeles County Department of Public Works recently diverted funds from a proposed $42 million storm drain system in order to retrofit the entire watershed with BMPs, a decentralized method of capturing runoff locally. BMP examples include cisterns, drywells, pavement removal, and planting trees, and are designed to retain most runoff on-site. The Sun Valley watershed is divided into 38 sub-basins that flow into each other. Each chromosome represents a spatially-explicit prescription of BMP installations. We know a little about the search space and were able to dramatically improve our results through selective initialization - we initialize the first chromosome in the population to be all zeros - a prescription of zero installations of each BMP in each sub-basin. We use a modified version of Eshelman’s Adaptive CHC genetic algorithm [1] and added a ”tweak” option that allows a user to manually initiate a new epoch through cataclysmic mutation. We found that the best solutions were not neccessarily a function of compute time or number of evaluations. As seen in Table 1, When large populations run for relatively few generations, the single null chromosome does not have time to propogate through the population before time runs out. Figure 1 plots objective function values (to be minimized) versus number of generations and shows the E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2424–2425, 2003. c Springer-Verlag Berlin Heidelberg 2003
Taming a Flood with a T-CUP
2425
Table 1. Summary table of performance for different population sizes and generations Population size # of generations 25 50 100 200 20 69,524 43,300 31,971 33,046 50 159,322 62,306 19,456 14,529 100 246,582 136,185 26,636 6796
200000
70
90000
180000
80000
160000
400 29,824 16,231 5725
60
70000 50
140000 60000
100000
80000
40
50000
Fitness
Fitness
Fitness
120000
40000
60000
30000
40000
20000
30
20
20000
10
10000
0 1
2
3
4
5
6
7
8
9
10
11
Generations
12
13
14
15
16
17
18
19
0
0 Generation
Generation
Fig. 1. System performance over three sequential phases.
three phases, geographic viability, meeting runoff targets, and cost minimization that our system goes through. The system quickly finds high quality solutions and we find that our users like to run with small population sizes to more quickly obtain feedback on the structure of promising prescriptions - even if larger population sizes lead to prescriptions with lower costs. The single all-zero individual also has much larger effect on a small population size. T-CUP provides a useful tool for hyrdologists analysing and designing flood control structures. Our initialization procedure allows T-CUP to provide good flood control prescriptions in reasonable time. With large population sizes, the system can be used to provide a set of promising design solutions. We also find that users like to run the system with small populations to get real-time feedback on the structure of promising prescriptions and so getting a “feel” for the kinds of solutions that would suit a particular watershed. We have a number of improvements in mind and we encourage you to try the T-CUP demo available at www.parconline.com/ga/ – any feedback will be appreciated. The full paper is available at www.cs.unr.edu/∼sushil.
References 1. Larry Eshelman: The CHC Adaptive Search Algorithm, How to have safe search when engaging in nontraditional genetic recombination. in Rawlins G.J.E., editor. Foundations of Genetic Algorithms – 1, Morgan KauffMan (1991) 265–283.
Assignment Copy Detection Using Neuro-genetic Hybrids Seung-Jin Yang, Yong-Geon Kim, Yung-Keun Kwon, and Byung-Ro Moon School of Computer Science and Engineering Seoul National University Shilim-dong, Kwanak-gu, Seoul, 151-742 Korea {sjyang,dvp,kwon,moon}@soar.snu.ac.kr Abstract. It is difficult for teachers to find out copies in their program assignments. We propose a method which detects the similarity of assignments. We first transform a program into a sequence of tokens and provide it to an artificial neural network. The neural network is optimized by a hybrid genetic algorithm. Experimental results showed considerable improvement on artificial neural networks.
It is one of the most time-consuming problems for teachers or grading assistants to find out students who copied other program. It is easy to copy other text with various editors. But it is not an easy job to examine the similarity of the programs manually. There are thus many efforts to develop automated copy detection methods. We want to develop a system that takes a number of programs as input and returns suspicious pairs of programs. In the system the core engine takes two programs as input and returns the degree of suspicion for plagiarism. We use token counting method on detecting plagiarism. The frequency of each token is determined by the object and programer of the program. We designed the system as follows. We define 40 tokens for the Java language. And we count the frequencies of these tokens and express the frequencies as a vector. We obtain two vectors (a1 , a2 , ... , a40 ) and (b1 , b2 , ... , b40 ) from the two programs. Here, ai and bi match the 40 attributes each corresponding to a token.These two vectors are used for the input of the neural network. There are two versions in this paper. One uses the vector (a1 , b1 , a2 , b2 , ... , a40 , b40 ) as the input; the other uses (|a1 − b1 |, |a2 − b2 |, ... , |a40 − b40 |) as the input. To test the proposed approaches, we prepared two different program assignment sets submitted in past classes. The program sets are used for the training and the test, respectively. Each set was expanded in the following way. Some base programs were selected from each set, intensively checked to see that they are not copies of one another, and distributed to our laboratory fellows. Then, they generated copy programs from the base programs in their own ways. We added the copy programs into the original data set. Table 1 describes the input data and Table 2 shows the experimental results. In the table, the “candidates” means the number of program pairs that E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2426–2427, 2003. c Springer-Verlag Berlin Heidelberg 2003
Assignment Copy Detection Using Neuro-genetic Hybrids
2427
Table 1. Input Data Assignment 1 (training data) Assignment 2 (test data) total pairs copy pairs total pairs copy pairs 2000 159 1475 42 Table 2. Results of copy detection (average of 20 trials)
40-input 80-input 40-input 80-input
candidates ANN 44.3 ANN 51.4 hybrid GA 44.7 hybrid GA 48.7
output copies no copies accuracy 19.3 25 43.6 16.3 35.1 31.7 20.8 23.9 46.5 24.2 24.5 49.7
each algorithm outputs as suspicious candidates. And the columns “copies” and “no copies” represent the number of copy pairs and non-copy pairs, respectively, among candidates. The column “accuracy” means the rate of copies over candidates. The numbers 40 and 80 in the row titles correspond to the number of input neurons. The 80-input hybrid GA (Genetic Algorithm) performed best among them. We can observe an interesting contrast in the results of ANN (Artificial Neural Networks) and hybrid GAs. In case of ANNs, 40-input ANN performed better than the 80-input ANN. In case of hybrid GAs, on the other hand, 80-input hybrid GA performed better than the 40-input hybrid GA. This is strong evidence that shows the gap between the powers of ANNs and hybrid GAs. Although 80 inputs provide more information than 40 inputs, the problem space of ANN optimization might perhaps be too large for the backpropagatoin algorithm. On the contrary, the hybrid GA took advantage of the full information. The 80-input hybrid GA catched on average 24.2 copies among the 42 copies, although the absolute value would vary depending on the degrees of copies. To the best of our knowledge, this work is the first for the copy detection problem using genetic algorithms combined with neural networks. The result showed considerable performance improvement over ANNs. It also showed that the model of token counting is attractive, and that the GA plays a crucial role in high accuracy detection of copies. Acknowledgments. This work was partly supported by Optus Inc. and Brain Korea 21 Project. The RIACT at Seoul National University provided research facilities for this study.
Structural and Functional Sequence Test of Dynamic and State-Based Software with Evolutionary Algorithms 1
André Baresel1, Hartmut Pohlheim , and Sadegh Sadeghipour
2
1
DaimlerChrysler AG, Research and Technology, Methods and Tools Alt-Moabit 96a, 10559 Berlin, Germany {andre.baresel,hartmut.pohlheim}@daimlerchrysler.com 2 IT-Power Consultants, Jasmunder Str. 9, 13355 Berlin, Germany [email protected] Abstract. Evolutionary Testing (ET) has been shown to be very successful for testing real world applications [10]. The original ET approach focuses on searching for a high coverage of the test object by generating separate inputs for single function calls. We have identified a large set of real world application for which this approach does not perform well because only sequential calls of the tested function can reach a high structural coverage (white box test) or can check functional behavior (black box tests). Especially, control software which is responsible for controlling and constraining a system cannot be tested successfully with ET. Such software is characterized by storing internal data during a sequence of calls. In this paper we present the Evolutionary Sequence Testing approach for white box and black box tests. For automatic sequence testing, a fitness function for the application of ET will be introduced, which allows the optimization of input sequences that reach a high coverage of the software under test. The authors also present a new compact description for the generation of real-world input sequences for functional testing. A set of objective functions to evaluate the test output of systems under test have been developed. These approaches are currently used for the structural and safety testing of car control systems.
1 Introduction to Sequence Testing When analyzing the code of real world applications, a large set of implementations can be identified which use a functional model whereby a controller procedure is called periodically. In the world of car control units it is very common for a software function to be based on an initialization and step function. Whereas the initialization function is only called once at the beginning, the step function is executed at regular time intervals e.g. every 10 milliseconds. A cruise control and vehicle-to-vehicle distance controller program (Distronic) which checks the velocity and distance to a leading vehicle at regular intervals to find out if a safe distance is maintained is an example of control software (structure shown in figure 7). This kind of software seems to be very specific. However, an equivalent situation can be found within object oriented software. OO software systems implement objects which often create their own internal storage and methods. The methods initialize and alter the objects data during its life-time. It is very common for an object to provide an interface to initialize and change its data. The implemented functions behave differently on the basis of the object’s inputs and internal settings. E. Cantú-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2428−2441, 2003. Springer-Verlag Berlin Heidelberg 2003
Structural and Functional Sequence Test of Dynamic and State-Based Software
2429
Depending on the complexity of the system under test automatic testing can be performed in three different ways. The approach commonly used is to generate pairs of initial states and input situations shown in Figure 1. This approach works well for simple software models but raises several problems. At first, the direct setting of internal states is not possible in all cases and needs changes to be made to program under test. By generating the internal states the test system has to make sure that the states Fig. 1. Automatic testing by generating (state, input) pairs (first approach) produced are valid according to the specification. Forcing the system into a generated state is, in most cases, not useful for the tester. This is because the tester has to use the test data generated to analyze software bugs. He needs first to ascertain how to produce the initial state which causes the problem. This information is not provided by this test automation. For complex systems it is necessary to search for an input sequence. One approach is to create lists of inputs for sequential calls of the function under test. This is often sufficient for software systems based on state models, see Fig. 2. For control systems requiring long ‘real world’ input sequences, encoded input functions must be generated. Figure 3 presents this third approach. Fig. 2. Automatic testing by generating list Monitoring the sequence test of a system of inputs (second approach) differs from the original ET approach. For white box tests, a list of execution paths has to be analyzed. Black box tests are performed on sequences of output values instead of single values. This allows the assessment of dynamic functional and safety criteria. The two approaches for generating input sequences are, from the tester’s point of view, the best solution since they guarantee that the system is tested in the same way as it will later be used.
Fig. 3. Automatic testing by generating input sequences from encoded input functions (third approach)
2430 A. Baresel, H. Pohlheim, and S. Sadeghipour
Section 2 describes how evolutionary algorithms (EA) have been applied to structural and functional sequence testing. Both applications have been implemented by the authors and results of real world experiments will be presented. Structural Sequence Testing has been tested with the generation of input sequences, see section 3. Functional Sequence Testing, see section 4, has been applied to Safety Tests by creating encoded input functions.
2 Overview of Evolutionary Testing (ET) Evolutionary algorithms (EA) have been used to search for data for a wide range of applications. EA is an iterative search procedure using different operators to copy the behavior of the biologic evolution. Using EA for a search problem it is necessary to define the search space and the objective function (fitness). The algorithms are implemented in a widely used tool box [7]. It consists of a large set of operators e.g. real and integer parameter, migration and competition strategies.
Fig. 4. The structure of Evolutionary Testing
Evolutionary Testing uses EA to test software automatically. The different software test criteria formulate requirements for a test case set to be generated. Until now, the generation of such a set of test data usually had to be carried out manually. Automatic software testing generates a test data set automatically, aiming to fulfill the requirements in order to increase efficiency and resulting in an enormous cost reduction. Evolutionary Structural Testing. Structural Testing has the goal of automating the test case design for white box testing criteria [2]. Taking a test object, namely the software under test, the goal is to find a test case set (selection of inputs) which achieves full structural coverage. The general idea is a separation of the test into test aims and the use of EA to search for test data fulfilling the test aims. The separation of the test into partial aims and the definition of fitness functions for partial aims are performed in the same manner for each test criterion. Each partial aim represents a program structure that requires execution in order to achieve full coverage of the structural test criterion selected, i.e. each single statement represents a partial aim when using statement coverage. The definition of a fitness function, that represents the test aim accurately and supports the guidance of the search, is a prerequisite for the successful application of Evolutionary
Structural and Functional Sequence Test of Dynamic and State-Based Software
2431
Test. In order to define the fitness function, this research builds upon previous work dealing with branching conditions (among others [8], [5], and [9]). These are developed in [10] by introducing the idea of an approximation level. Evolutionary Functional Testing. Complex dynamic systems must be evaluated over a long time period (longer than the highest internal dead time or time constant). This means that the systems must not be stimulated for only few simulation steps, but rather the input signals must be up to hundreds of time steps long. Long input sequences are therefore necessary in order to simulate these systems. Several disadvantages result from the length of the sequences necessary: the number of free variables during optimization is very high and the correlation between the variables is great. For this reason, one of the most important aims is the development of a very compact description for the input sequences. This must contain as few elements as possible but, at the same time, offer a sufficient amount of variety to stimulate the system under test as much as necessary. Moreover, possibilities for automatically evaluating the system answers must be developed, which allow differentiation between the quality of the individual input sequences. These requirements and the solutions developed will be presented in section 4.
3 Evolutionary Algorithms for Structural Sequence Testing The evolutionary structural test was designed originally to find a set of test data for single function calls which results in a high structural coverage of the software under test. This approach has been defined for the most common test criteria in [10]. The fundamental idea is the transformation of the test data generation into a search problem which is then solved by evolutionary algorithms. The optimization function designed for this calculates so-called fitness values. The evolutionary algorithm uses these values to optimize the solution to meet the test aim currently selected. Structural Sequence Testing targets the structural coverage of a given test object. For this reason, test aims are defined in the same manner as introduced for ET. However, in order to apply evolutionary algorithms it is necessary to redefine the search space and the fitness function because test cases are now input sequences each executing several paths (see figure 5). 3.1 Search Space Structure In the original approach the search space is formed by the input parameters of the function under test. This is different for sequence tests. The authors decided to implement the generation of input sequences for the first experiments in the area of structural tests. With this approach one test datum is a list of inputs. Monitoring a test case will return a list of execution paths (one per input). The length of this list is not defined in the test object and is, in general, not limited for control systems.
2432 A. Baresel, H. Pohlheim, and S. Sadeghipour
Fig. 5. Test case (sequence) and different execution paths for the inputs; On the right hand side an example memory model is shown, where each step results in a state change
Fig. 6. The encoding of the test case data to different runs
For evolutionary algorithms, an encoding of the individuals generated (mapped to test data) has to be defined. A straightforward approach is to let the user specify the length of the input sequences manually. With the information on the sequence length a search space is formed by replicating all input parameters in such a way that a separate value will be provided for each call and each parameter. Every individual generated by the EA represents the data of one sequence of calls (here called test data). Figure 6 shows the mapping of the data. 3.2 Sequence Evaluation As mentioned previously, the test criteria for structural coverage are defined on the basis of the test object source code. An automatic test system has to monitor and analyze all runs of the test object to find out which test aims have been reached (structures that have been covered) and to provide a fitness value for the test aim currently selected. This fitness is determined by monitoring the test execution. In the original ET approach, fitness is calculated by analyzing one execution path through the test object. This is different in sequence testing, since each test case run results in a list of execution paths. This list is the basis upon which a fitness value has to be calculated. The sequence testing fitness function introduced here uses the original idea of the fitness function and creates a sequence fitness value from that.
Structural and Functional Sequence Test of Dynamic and State-Based Software
2433
For structural coverage it is not important at which step of a sequence a structural element is covered. For this reason, the authors decided to design a fitness function which analyzes all the execution paths of a test case sequence and uses the closest path to the test aim as the fitness of the sequence. With this approach it does not matter whether a close execution path is traversed earlier or later in the sequence and whether or not the EA is able to create better performing solutions by ordering the sequence in different ways (see [1] for details of sequence optimization). The original ET approach uses decisive branches to ascertain the fitness of an execution path (see [10]). This idea can be reused by analyzing each path of the sequence and determining an overall fitness. Due to the fitness function’s structure, it has to be minimized by the EA. For this reason, the overall fitness is the minimum of the path fitness values. Using this approach the best path will define the fitness of the sequence. 3.3 Experiments The experiments on Structural Sequence Testing were performed with an extended version of the automatic structural test system. The GEATbx [7] is the optimization component used by the system. The authors applied the standard settings for this class of problem (namely, 6 subpopulations with 50 individuals each, linear ranking, a selection pressure of 1.7, migration and competition between subpopulations). A real valued representation is used. The subpopulations employ different search strategies by using different settings for the recombination (discrete and line recombination) and mutation operator (real valued mutation with differently sized mutation steps – large, medium and small). Table 1. Information on the complexity of the test objects Name Ctrl_S Enable_S Enable_A
Size 16kB 54kB 44kB
LOC 220 800 520
nodes 20 140 86
param. 28 15 11
if–then 5 51 39
conditions 6 70 56
nesting level 2 10 8
Three software functions which are part of a large real-world functional model were tested using the extended ET system. Table 1 provides an overview of the complexity measures for the test objects. The test object Ctrl_S is relatively small but is code generated from a state diagram containing flags in all conditions.
Enable_A Enable_S Ctrl_S
Table 2. Experimetal results, showing the numbers of test aims, number of individuals generated and reached coverage Coverage criteria StatementCover BranchCover ConditionCover StatementCover BranchCover ConditionCover StatementCover BranchCover ConditionCover
test aims 17 23 12 135 196 168 87 126 116
num. individuals 550 000 650 000 370 000 225 000 550 000 650 000 390 000 495 000 425 000
coverage 95 % 91 % 92 % 99 % 98 % 97 % 95 % 92 % 94 %
not reached 1 2 1 1 5 5 5 11 8
The three standard test criteria were employed. Statement cover requires a test case set that executes all statements of the software under test. The branch cover criterion requires
2434 A. Baresel, H. Pohlheim, and S. Sadeghipour
a test case set traversing all branches of a test object. Last but not least, condition cover necessitates a collection of test data evaluating all the atomic conditions of a program with the values true and false. The high coverage reached for all the criteria can be seen in table 2. Only a few of the test aims not reached are a sequence test specific problem. Some of them refer to the problem of unreachable code created by the code generator and some to the problem of performing an evolutionary test with flag conditions. However, in the next paragraphs we would like to give an short overview of the reasons why certain test goals were not reached. Test result details for test object ‘Ctrl_S’. This test object was difficult to test with the original ET approach due to the source code structure. The code was generated from a state diagram. All the program’s conditional statements use flags that have been assigned previously. Some conditions access flags that have been assigned in previous calls of the function which result in the test object requiring sequence tests. The test runs have shown that this kind of software does not pose a problem for the approach, since all sequence test specific test aims have been covered. For one statement, the corresponding two branches and the flag condition of a sequence could not be found because the search was not guided to a flag assignment within a function call. Test results for test object ‘Enable_S’. The testing of this module demonstrates the potential of the evolutionary testing approach. The test object consists of 800 lines of code leading to 135 control flow nodes, 196 branches and 168 test aims for simple condition cover. During the statement coverage test only one statement was not executed. The nonexecutable statements, branches or conditions appear when using current versions of code generators. This occurs if the developer uses template library functions and configures them with constants. This leads to a comprehensible model with reuse of components but the code generated contains lines of code that are not executable. The performance of the branch coverage test was more challenging, resulting in coverage of 98%. Only five branches were not traversed. Upon checking the code we found that two branches belong to the non-executable statement. One branch is not executable because the associated condition cannot be evaluated as false. The evolutionary test system was not able to find an input for two conditions which depend on sequential calls. This is due to the high nesting level of dependent variable assignments and uses. At the moment it is not possible to guide the search to a solution that executes an assignment in one sequence step and traverses the corresponding condition in later steps. This would require further research. The results of simple condition cover were good in that the evolutionary test found a test case set covering 97% of the test aims. For the reason why the five conditions not covered were missed, see the paragraph on branch coverage above. Test results for test object ‘Enable_A‘. This module is not particularly complex with regard to metrics, however, it contains some program structures that are difficult to test. First of all, the code is state oriented and many conditions depend on the settings of state variables. The function has a high nesting level of if-then-else statements and employs a state encoded using a set of flags. Again, the test object contained unfeasible code because of the use of library templates (3 statements). Two other statements were not covered because of the flags used in the conditions. The results of branch coverage are similar to those of statement coverage. The test case set found did not cover 4 branches start-
Structural and Functional Sequence Test of Dynamic and State-Based Software
2435
ing at flag conditions and 6 branches are placed at unreachable code. One branch could not be traversed because a precondition had not been satisfied. This branch requires an input sequence assigning two variables in previous calls of the function. The approach does not find a solution for this. The condition cover test returned equivalent results. Four of the conditions are not feasible because of the generation of code. The test system did not find an input sequence covering three flag conditions. Again, the condition which requires variable settings in previous calls was not fulfilled.
4 Evolutionary Algorithms for Functional Sequence Testing The generation of test sequences for dynamic real-world systems pose a number of challenges: - The input sequences must be long enough to stimulate the system realistically. - The input sequences must possess the right qualities in order to stimulate the system adequately. This concerns aspects such as the type of signal, speed and rate of signal change. - The output sequences must be evaluated regarding a number of different conditions. These conditions are often contradictory and several of them can be active simultaneously. In order to develop a test environment for the functional test of dynamic systems, which can be applied in practice, the following main tasks must therefore be completed. - Creation of a compact description for the input sequences, - Evaluation of output sequences regarding defined requirements, - Inclusion of input sequence generation and evaluation of system requirements into an optimization process. Throughout the whole section examples are given using a real-world system. The structure of the integrated cruise control and vehicle-to-vehicle distance control (Distronic) is shown in figure 7. An extensive description is contained in [3]. The test environment presented in this section is implemented in Matlab and used for the testing of Simulink/Stateflow models. However, we concentrate on the presentation of the main concepts and the demonstration of results. 1 throttle_angle 2 brake_angle
Mu
torques_req
pedal_positions torques_req
v_act
pedal_flags
1 v_req
3 ctrl_lever
ctrl_lever v_trgtcar
Mu
Mu
manipulated vars
trgt_data
gear_act
4 a_act 5 dist_req
trgt_detected
target detection
3 gear_act
mot_rot_speed manipulated vars
dist_rel v_rel
v_act
2 v_act
tot_transm_ratio driver_warning
pedal interpretation
4 v_trgtcar
v_act
v_req
pedal_flags
cruisectrl_mode dist_factor
a_act
5 dist_factor
6 dist_rel car_data
dist_req
cruise control
throttle_valve_angle
7 cruisectrl_mode
car model 8 driver_warning
Fig. 7. Model of an integrated cruise control and vehicle-to-vehicle distance control (Distronic)
2436 A. Baresel, H. Pohlheim, and S. Sadeghipour
4.1 Description of Input Sequences In order to describe the input sequences, several descriptions must be differentiated. On the one hand we have the description of the signal, which is used as input for the simulation of the dynamic system (simulation sequence). On the other hand we have the compact description which the user stipulates (compact user description). Between these is the description of the boundaries of the variables for the optimization (description of sequence bounds) as well as the instantiation of the individual sequences (optimization sequence). These different sequence description levels are illustrated in figure 8. compact user description
transformation
sequence bounds
instantiation
optimization sequences interpolation
simulation sequences (sampled signals)
Fig. 8. Different levels of input sequence description
For a compact description the long simulation sequence is subdivided into individual sections. Each section is represented by a base signal, which is parameterized by the variables: signal type, amplitude and length of section. We use the following base signal types: step, ramp (linear), impulse, sine, spline. These base signal types are sufficient to generate any signal curve. Only the possible areas for these parameters are specified for the optimization. In this way, the bounds are defined, within which the optimization generates solutions, which are subsequently evaluated by the objective function with regard to their quality. The amplitude of each section can be defined absolutely or relatively. The length of the section is always defined relatively (to ensure the monotony of the signal). The base types are given as an enumeration of possible types. These boundaries must be defined for each input of the system. For nearly every real-world system these boundaries can be derived from the specification of the system under test. An example of an input sequence description is provided in figure 10. The textual description as defined by the tester is given. Figure 10 provides examples of three different input signals generated by the optimization. Input.Names = {'throttle_angle', 'brake_angle', 'ctrl_lever', ... 'v_trgtcar', 'dist_factor'}; Input.BasePoints = [10, 10, 10, 10, 2] Input.Amplitude.Bounds = [ 0, 0, 0, 0, 1;... 100, 100, 3, 30, 1] Input.Amplitude.Interpret = {'abs', 'abs', 'abs', 'abs', 'abs'}; Input.Basis.Bounds = [1, 1, 1, 1, 1; ... 3; 3; 5; 3; 3] Input.Transition.Pool = { {'linear', 'spline'}; {'linear'}; {'impulse'}; ... {'spline'}; {'step'} }; Sim.Length = 200; Sim.SamplingRate = 2.5;
Fig. 9. Textual description of input sequences
The system under test (Distronic) has 5 inputs. We use 10 sections for each sequence. The amplitude for the throttle pedal can change between 0 and 100, the control lever can only have the values 0, 1, 2 or 3.
Structural and Functional Sequence Test of Dynamic and State-Based Software
2437
In real-world applications the variety of an input signal is often constrained with regard to possible base signal types. An example of this is the control lever in figure 9. This input contains only impulses as the base signal type. To generate a further bounded description it is possible to define identical lower and upper bounds for some of the other parameters. In this case, the optimization has an empty search space for this variable – this input parameter is a constant. An example is the distance factor (a measure for the relative distance to the preceding vehicle) in figure 10. This input signal is always set to a constant value of 1. Thus, it is not part of the optimization (but used as a constant value for the generation of the internal simulation sequence). angle of throttle pedal
position of control lever
100
velocity of target car
3
30
2.5
25
2
20
80
60
ctrl_lever
throttle_angle [deg]
70
50
v_trgtcar [m/s]
90
1.5
40 30 20
15
1
10
0.5
5
10 0
0 0
50
100 time [s]
150
200
0 0
50
angle of throttle pedal
100 time [s]
150
200
0
50
position of control lever
100
100 time [s]
150
200
150
200
velocity of target car
3
30
2.5
25
2
20
80
60
ctrl_lever
throttle_angle [deg]
70
50
v_trgtcar [m/s]
90
1.5
40 1
30 20
15
10
0.5
5
10 0
0 0
50
100 time [s]
150
200
0 0
50
100 time [s]
150
200
0
50
100 time [s]
Fig. 10. Instances of simulation sequences generated by the optimization based an the textual description in figure 9; left: throttle pedal, middle: control lever, right: velocity of target car
All these different levels of the description of the input sequences ensure that the requirements for an adequate simulation of the system and a very compact and comprehensible description by the tester are fulfilled. The compact description is used for the optimization ensuring a small number of variables. When comparing the size of both descriptions for the example dynamic system used (5 inputs – one of the inputs is constant, 10 signal sections, 3 variable parameters for each section, 200 seconds simulation time, sampling rate 2.5 Hz) the differences are enormous: SizeSimulationSignal = 5 ⋅ 200 ⋅ 2.5 = 2500 SizeCompactDescription = (5 − 1) ⋅10 ⋅ 3 = 120
CompressionRatio =
2500 = 20.8 120
(1)
Only this compact description opens up the opportunity to optimize and test real-world dynamic systems within a realistic time frame. 4.2 Evaluation of Output Sequences and Objective Function The test environment for the functional test of dynamic systems must perform an evaluation of the output sequences generated by the simulation of the dynamic system. These output sequences must be evaluated regarding the optimization aims. During the test we always search for violations of the defined requirements. Possible aims of the test are to check for violations of:
2438 A. Baresel, H. Pohlheim, and S. Sadeghipour
signal amplitude boundaries, signal dependencies, maximal overshoot and maximal settlement time. Each of these checks must be evaluated over the whole or a part of the signal lengths and an objective value generated. Due to space constraints we describe only the first two requirements in detail. However, our test environment can assess all of the checks (and more will be added). -
y
y y(t)
ymax
u(t)
overshoot
y(t)
y min
t
t
Fig. 11. Left: violation of maximum amplitude; right: assessment of signal overshoot
An example of the violation of signal amplitude boundaries is given in figure 11, left. A minimal and maximal amplitude value is defined for the output signal y. In this example the output signal violates the upper boundary. The first violation is a serious violation as the signal transgresses the bound by more than a critical value yc (parameter for this requirement). In this case, a special value indicating a severe violation is returned as the objective value (value -1), see equation (2). The second violation is less severe, as the extension over the bound is not critical. At this point an objective value indicating the closeness of the maximal value to the defined boundary is calculated. This two-level concept allows a differentiation in quality between multiple output signals violating the bounds defined. The direct calculation of the objective value is provided in equation (2). signalmax = max ( y (t ))
ObjVal max
−1 = 6 signalmax y max
signalmax ≥ ymax + yc
(2)
signalmax < ymax + yc
A similar assessment is used for calculating the objective value of the overshoot of an output signal after a jump in the respective reference signal. First, the maximal overshoot value is calculated. Next, the relative height of the overshoot is assessed. A severe overshoot outside the specification returns a special value (again –1). This special value is used to terminate the current optimization. The test was successful, as we were able to find a violation of the specification and thus reach the aim of the optimization. In all other cases, an objective value equivalent to the value of the overshoot is calculated (similar to equation (2)). Each of the requirements tested produces one objective value. For nearly every realistic system test we receive multiple objective values. In order to assess the quality of all objectives tested we employ multi-objective ranking as supported by the GEATbx [7]. This includes Pareto-ranking, goal attainment, fitness sharing and an archive of previously found solutions.
Structural and Functional Sequence Test of Dynamic and State-Based Software
2439
4.3 Experiments The test environment was used for the functional testing of a number of real-world systems. One of these is the Distronic model described earlier ([3], structure shown in figure 7), for which the results of one test will be presented in this subsection. For the car system with an activated Distronic the maximum speed was specified at 44 m/s (the critical value yc was set to 0). This means that a higher speed is not permitted under any circumstances. Thus, a test was specified to search for inputs which produce a speed greater than this boundary. With an active Distronic the car can be accelerated only by pushing the control lever upwards (the respective input value is 1). The car is decelerated by pushing the control lever downwards (input value: 2). Beside the amplitude of the input control lever, the relative length of the signal sections could be changed between 1 and 5. The results of one successful optimization are shown in figures 12 and 13. Best objective values
variables of best indiv.
Obj. vals of all gen. (85% best)
300
5 2
4.5
5
295
250
200
20 2.5 25 2 30
1.5
50
35
1
40
0.5
0
−50
0
5
10 generation
15
20
0
5
10 generation
0
15
index of individual
index of variable
objective value
100
3
290
6
3.5
15 150
4
4
10
8
285
10 280 12 14
275
16 270 18 20
265 0
5
10 generation
15
Fig. 12. Visualization of the optimization process, left: best objective value; middle: variables of the best individual; right: objective value of all individuals
The optimization process is visualized in figure 12. The left graphic presents the progress of the best objective value over all the generations. The optimization continually finds better values. In the 19th generation a serious violation is detected and an objective value of –1 returned. The middle graphic presents the variables of the best individual in a (gray) color quilt. Each line represents the value of one variable over the optimization. The graphic on the right visualizes the objective values of all the individuals during the optimization (generations 1 to 18). position of control lever
position of control lever
2
1
0.5
v_act [m/s]
ctrl_lever
1.5
1
0.5
40
40
35
35
30
30
v_act [m/s]
1.5 ctrl_lever
velocity
velocity
2
25 20
0
0
20
40 time
60
15
10
10 5 0
0
0
20
40 time
60
20
15
5
0
25
0
20
40 time
60
0
20
40 time
60
Fig. 13. Problem-specific visualization of the best individual during the optimization, left: input of the control lever - begin and end of optimization (2nd and 19th generation), right: vehicle velocity nd th begin and end of optimization (2 and 19 generation);
The graphics in figure 13 provide a much better insight into the quality of the results, visualizing the problem-specific results of the best individual of each respective generation. The input of the control lever is shown in the two left side graphics. The resulting
2440 A. Baresel, H. Pohlheim, and S. Sadeghipour
velocity of the car is presented at the right. The graphics are taken from the 2nd (left) and 19th generation (right). At the beginning the maximal velocity is far below the critical boundary. During the optimization the velocity is increased (the control lever is pushed up more often and at an earlier stage as well as being pushed down less frequently). At the end an input situation is found, in which the velocity is higher than the bound specified. By looking at the respective input signals the developer can check and change the implementation of the system. During optimization we employed the following evolutionary parameters: 20 individuals in 1 population, discrete recombination and real valued mutation with medium sized mutation steps, linear ranking with a selection pressure of 1.7 as well as a generation gap of 0.9. Other tests with a higher number of relevant input signals and thus more optimization variables employ 4-10 subpopulations with 20-50 individuals each. In this case, we use migration and competition between subpopulations. Each subpopulation uses a different strategy by employing different parameters (most of the time differently sized mutation steps).
5 Conclusion In this paper we have presented different approaches for applying Evolutionary Testing to sequence testing. The first method aims at automating test case generation to achieve high structural coverage of the system under test (white box test). The other approach searches for input sequences violating the specified functional behavior of the system (black box test). Both test methods were implemented. Experiments with software modules and dynamic systems with varying complexity were conducted. We have presented a small selection of the results. The results show the new test methods to be promising. It was possible to find test sequences, without the need for user interaction, for problems which could previously not be solved automatically. During the experiments a number of issues were identified which could further improve the efficiency of the test methods presented. For the structural sequence test, a fitness function which expresses the dependency between successive sequence steps would significantly aid the search process. In the case of the functional sequence test, it is necessary to include as much as possible of the existing problem-specific knowledge into the optimization process.
References [1] [2] [3] [4] [5]
Baresel, A., Sthamer, H., Schmidt, M.: Fitness Function Design to improve Evolutionary Structural Testing. Proceedings of GECCO2002, New York, USA, pp. 1329–1336, 2002. Beizer, B.: Software Testing Techniques. New York: Van Nostrand Reinhold, 1983. Conrad, M., Hötzer, D.: Selective Integration of Formal Methods in the Development of Electronic Control Units. Proceedings of Second IEEE International Conference on Formal Engineering Methods ICFEM’98, IEEE Computer Society, pp. 144–155, 1998. Harman, M., Hu, L., Munro, M., Zhang, X.: Side-Effect Removal Transformation. IEEE International Workshop on Program Comprehension (IWPC) Toronto, Canada, 2001. Jones, B.-F., Sthamer, H., Eyres, D.: Automatic structural testing using genetic algorithms. Software Engineering Journal, vol. 11, no. 5, pp. 299–306, 1996.
Structural and Functional Sequence Test of Dynamic and State-Based Software
[6]
2441
Korel, B.: Automated Test Data Generation. IEEE Transactions on Software Engineering, vol. 16 no. 8, pp. 870–879, 1990. [7] Pohlheim, H.: GEATbx - Genetic and Evolutionary Algorithm Toolbox for Matlab. http://www.geatbx.com/, 1994-2003. [8] Sthamer, H.: The Automatic Generation of Software Test Data Using Genetic Algorithms. PhD Thesis, University of Glamorgan, Pontyprid, Wales, Great Britain, 1996. [9] Tracey, N., Clark, J., Mander, K., McDermid, J.: An Automated Framework for Structural Test-Data Generation. Proceedings of the 13th IEEE Conference on Automated SE, Hawaii, USA, 1998. [10] Wegener, J., Sthamer, H., Baresel, A.: Evolutionary Test Environment for Automatic Structural Testing. Special Issue of Information and Software Technology, vol. 43, pp. 851–854, 2001. [11] Wegener, J., Sthamer, H., Jones, B., Eyres, D.: Testing Real-time Systems using Genetic Algorithms. Software Quality Journal, vol. 6, no. 2, pp. 127–135, 1997.
Evolutionary Testing of Flag Conditions André Baresel and Harmen Sthamer DaimlerChrysler AG, RIC/SM Methods and Tools Alt-Moabit 96a, 10559 Berlin, Germany {andre.baresel,harmen.sthamer}@daimlerchrysler.com
Abstract. Evolutionary Testing (ET) has been shown to be very successful in testing real world applications [16]. However, it has been pointed out [11], that further research is necessary if flag variables appear in program expressions. The problems increase when ET is used to test state-based applications where the encoding of states hinders successful evolutionary tests. This is because the ET performance is reduced to a random test in case of the use of flag variables or variables that encode an enumeration type. The authors have developed an ET System to provide easy access to automatic testing. An extensive set of programs has been tested using this system [4], [16]. This system is extended for new areas of software testing and research has been carried out to improve its performance. This paper introduces a new approach for solving ET problems with flag conditions. The problematic constructs are explained with the help of code examples originally found in large real world applications.
1
Introduction to the Flag Problem
Evolutionary Structural Testing has to generate a set of test data for a given test object in order to obtain a high coverage of the program structures. The automatic process creates several test aims, which are tried to be executed in separate search processes. Test aims for statement coverage are the statements of the test object. Each search to solve a test aim is guided by a fitness function. This function defines numerically the proximity of a test datum to the current test aim. The function uses the values examined at the condition statements during the program execution. For instance, the instrumentation of an equivalence condition on a and b will report the distance of a and b as fitness. A very small distance results in a very good fitness. This function guides the search to a equals b. In the case of flag values, the fitness function is simply a Boolean function returning only one poor fitness value for all test data resulting in the undesired flag value. The Boolean fitness function does not guide at all the search to a desired solution and results in an evolutionary test behaving like a random test. Fig. 1 shows two fitness landscapes close to the solution of executing the test aim. On the left hand side a Boolean function created by flag variables is shown and, in contrast to that, the right hand side shows a local distance function formed by equivalence operators. The minimum of both functions is searched for. Whereas a solution for one value of the Boolean function is easily found, the solution leading to E. Cantú-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2442–2454, 2003. © Springer-Verlag Berlin Heidelberg 2003
Evolutionary Testing of Flag Conditions
2443
the other Boolean value is very hard to find. This is different for the function created for the equivalence operators. Values for this function can be easily optimized. 4.5 6
4 3.5
5
Zielfunktionswert
3 4
2.5 2
3
1.5 2
1 1
0.5 10
0 10
5 5
0
0 −10
5 5
0
0
−5
−5
−5
−5 V i bl 2
10
0 10
−10
−10
−10
Variable 1
Fig. 1. The diagram on the left shows a Boolean function and the diagram on the right shows a fitness landscape created by an equivalence operator of two double variables.
It is very common to use flag variables in real world applications. Even code generators e.g. for matlab / simulink / stateflow [7] create code containing flags. This creates the problem of a Boolean-like function, which does not guide the search for test data. It also occurs when using variables that encode only a small subset of the integer type. This often emerges for variables defined as enumeration types. The measured distance between enumeration type elements does not help to find the right input to fulfill a condition that compares elements of this type. In this case the fitness function returns values that do not direct the search to the solution because the function only expresses that right or wrong values have been monitored. The authors would like to show that random search behavior can be avoided by using a new fitness function for flag conditions. This also works for enumeration type values. The improvement is using additional information about the flag assignments in a program. These assignments can be identified by static data flow analysis.
2
Short Overview on Evolutionary Testing
Evolutionary algorithms (EA) have been used for searching data for a wide range of applications. EA is an iterative search procedure using different operators to copy the behavior of biologic evolution. When using EA for a search problem it is necessary to define the search space and the objective function (fitness). The algorithms are implemented in the widely used tool box GEAtbx [12]. It consists of a large set of operators e.g. real and integer parameter, migration and competition strategies. ET uses EA for automatic software testing. The different software test criterions formulate requirements for a test case set to be generated. The creation of such a test data set usually has to be carried out manually. Automatic software testing generates test data sets automatically, trying to fulfill the requirements in order to increase efficiency and achieve considerable cost reduction [15]. This paper covers the automation of structural testing which is a white box test defining the quality of a test case set on the basis of the structural coverage of the test object. Evolutionary structural testing defines test aims and fitness functions to transform the problem of generating a test case set into searches solved by EA [16]. The quality of the fitness function plays an important role for the success of this approach. Some research has carried out to improve the quality (e.g. [2], [9]).
2444
A. Baresel and H. Sthamer
ET works very well for many real world applications. However, ‘interesting test objects’, where even an experienced software tester will have problems finding a full covering test data set, are still a challenge for the original ET. This is often due to certain constructs in the programs, which are not taken into account by the fitness.
3
The Approach to Testing Flag Conditions
This section introduces the idea of improving the fitness function by means of additional information in case of flag variables appearing in the code. The goal is to design a fitness function with a better guidance of the search, so that test data generation for flag conditions is improved and does not result in a random search. The section is divided into subsections explaining the solution step by step. The paper ends with the analysis of a real world example and shows that the approach introduced helps gain full coverage of this test object. 3.1
Direct Assignment of a Boolean Value
The basic idea of the new approach will be explained first by using a simple example. The source code in Example 1 shows the assignment and the use of a flag variable within a nested if-then structure. Example 1: Usage of a flag assignment 1: flag = false; [ENTRY-NODE] 2: if (a==0) flag=true; /* flag assign */ [TARGET-NODE-1] 3: ... /* no other assignments to this flag*/ 4: if (c == 4) { 5: if (flag && b> c) /* contains a flag condition */ 6: /* test aim */ [TARGET-NODE-2]
The original ET approach will immediately reach a high coverage without, however, fully covering the structure. The last search is performed for the test aim at node 6. The fitness function for this test aim is based on the control dependencies, which are created by the if-statements of line 4 and 5. This provides the search with good guidance for reaching the if-statement in line 5. However, the local distance there is a Boolean function and does not direct the search. The authors suggest a better fitness function, which makes use of the static analysis of the test object. The use-definition-analysis returns the assignments in the code, which have an influence on the flag condition. The estimation of the use-definitionchain ([1]) will return, that the flag assignment of line 2 is the only assignment influencing the flag use of line 5. This information is the basis for a new fitness function which first guides the search to test data, which execute the flag assignment in line 2, and goes to the flag condition of line 5. A fitness function targeting two locations in the source code has been defined as a node-node-oriented fitness function in [16]. The test aim is split into two target nodes. The first target node is the assignment, the second one the original test aim. After applying the new fitness function, the test data searched for has to execute the assignment as well as the test aim. It is now explained how this changes the fitness function applied to Example 1.
Evolutionary Testing of Flag Conditions
•
2445
The original approach searches for “ c==4” and “flag=true && b>c”. This is created by the control dependency of the if-statements of line 4 and 5. Because of the last condition containing a flag an ET behaves like a random search.
•
The improvement searches for “a==0 “ and “c==4” and “flag=true && b>c” The first part is created by the control dependencies for the flag assignment of line 2 and the second part by the control dependencies for the test aim. This new fitness function directs the search to “a==0” in the first instance, automatically resulting in fulfilling the flag condition.
The following shows a short overview of the evaluation for a node-node-oriented fitness. Evolutionary Structural Testing monitors the execution of the program under test. The monitoring result for each test data contains information on the execution path and the evaluation of the conditions during runtime. On the basis of this information a fitness value has to be calculated. The original ET approach has introduced the idea of decisive branches and approximation levels assigned to the control flow graph of the test object. The approximation level at a decisive branch expresses the global distance to the test aim, whereas a local distance can be used to compare solutions with the same level reached. A detailed description on this can be found in [16]. desired assignment Level 2 1
2
3
4
Level 1 5
Level 3
test aim
Fig. 2. Control flow graph of Example 1 with highlighted decisive branches and annotated approximation levels
Approximation levels are assigned to all branches that create a control dependency for the test aim. The dependencies can be calculated by standard algorithms (see [13]). For the node-node-oriented fitness function, it is necessary to estimate the dependencies for progressing from entry node to target node 1 (for target nodes see Example 1, right hand side) and the dependencies progressing from target node 1 to target node 2. The assignment of the levels is performed by analyzing the possible execution orders of the identified nodes. Nodes that come later in the execution path will achieve better approximation levels than nodes at the beginning of the paths. control dependencies for “entry-node” to “target node 1”: node 2
control dependencies for “target node 1” to “target node 2”: node 4, node 5
execution order of the identified nodes and assigned approximation levels: Node 2 Node 4 Node 5
Level 3 Level 2 Level 1
Fig. 3 shows a graphical interpretation of the approximation levels comparing the original (left hand side) and the new fitness evaluation approach (right hand side).
2446
A. Baresel and H. Sthamer desired assignment
Poor Test Case 1
2
3
4
Level 1 2
1
3
4
5
5
Level 3 Fitness Level: 1
desired assignment
Better Test Case 1
Fitness Level: 3
2
3
4
Level 1 1
2
3
4
5
Level 1 5
Level 3 Fitness Level: 1
Fitness Level: 1
Fig. 3. The original ET (left) estimates the same fitness for both test cases; the new approach (right) assigns a poor fitness for the path not executing the flag assignment (top)
Whereas in the original approach only one decisive branch can be executed per execution path and test aim (because of the definition of the control dependency), it is necessary in the new approach to make sure that the first decisive branch (e.g. top right path) defines the fitness. This is due to the concatenation of the decisive branches of the two target nodes. A complete definition of the fitness calculation rules will be provided later in Definition 1 after all details have been explained. Fig. 4 shows the experiment results. Two scales are used in the diagram, since the fitness values are of different value ranges. The original fitness for the selected test aim ranges from 0 to 2, and the new fitness has an increased range of [0, 3].
Fig. 4. The diagram shows the fitness of the best individual created by the EAs over all generations; the fitness function of the new approach guided a search so that the solution was found in Generation 130, whereas the original ET stagnates at fitness 0,4.
With the new fitness function targeting the flag assignments the evolutionary test can be improved to reach a higher coverage for applications that use single flags with just one assignment. In the next steps the authors would like to show how this idea can be extended to solve more complicated flag uses and to provide a universal solution for any kind of flag definition and use. 3.2
Dealing with Undesired Flag Assignments
The previous subsection gave an example where the execution of the test aim depended on a desired assignment. It is certain that in real world applications program code can include assignments that should not be executed to fulfill a test aim.
Evolutionary Testing of Flag Conditions
2447
A very simple source code showing this can be seen in Example 2. In order to execute the test aim a flag value “true” is necessary. For this reason, a solution that reaches the test aim cannot execute the flag assignment in line 2. Example 2: Source code containing an undesired flag assignment 1: flag = true; 2: if ( a!=0 || c>5 ) flag=false;/* avoid this assignment execution */ 3: switch (er) { 4: case 5: 5: if (flag) 6: /* test aim */
The original ET approach does not use information about the undesired assignment. For the original ET a high probability of the flag assignment execution results in a random search, because the fitness function does not take into account the condition at line 2 (it does not create a control dependency for the test aim). A static analysis of the program under test can be helpful in this situation. By using the information about undesired flag assignments it is possible to create a better fitness function. This function has to guide the search to not execute the identified flag assignment. The control dependencies of the undesired assignment will be used to define the new function, but, in contrast to the solution explained for desired flag assignments, the fitness is now based on the branches not leading to the assignment. undesired assignment Level 3 1
2
3
Level 2 Level 2 4
Level 1 5
6
test aim
Fig. 5. Control flow graph of the example with highlighted decisive branches. The branch leading to the undesired assignment is decisive and gets a poor approx. level.
After applying the new fitness function, the test data that is searched for must not execute the assignment and then traverse the test aim. It is explained next how this changes the fitness function applied to Example 2. The original approach searches for “er==5” and “flag=true”. This is because of the control dependency created by the switch- and if-statements of line 3 and 5. The flag condition creates a Boolean fitness function.
The new approach is searching for “not (a!=0 || c > 5)” and “er==5” and “flag=true” The first part is created by the control dependencies for the flag assignment of line 2 and the second part by the control dependencies for the test aim at line 6. This new fitness function allows only solutions with “not (a!=0 || c > 5)” in the first instance (this will automatically result in fulfilling the flag condition of the second term).
An experiment with the new fitness function is presented in Fig. 6. Again, the fitness curves have been made comparable by using two scales. The optimization using the original approach stagnates at value 0,4 because it does not find a solution which results in a flag value “true”.
2448
A. Baresel and H. Sthamer
Fig. 6. Diagram demonstrating the improvement of the search process; the original approach does not find a solution for the test aim (remains at fitness about 0,4)
As shown by means of this example it is also possible to improve the original ET approach for test objects containing undesired flag assignment. 3.3
Multiple Assignments to Flag Variables
The examples shown in the previous sections only use single flag assignments. The approach is now extended for test objects with multiple flag assignments. To achieve this, the idea of include and exclude lists for flag assignments is introduced. The explanations refer to Example 3. The test aim has been commented on and the desired and undesired flag assignments for this test aim have been marked in the source code. Applying static analysis for the example will return three flag definitions where the execution of one assignment is not desired for covering the test aim. All desired assignments will be collected in a so-called include-assignment list and the undesired ones will be placed on the corresponding exclude-assignment list. Example 3: Source code containing multiple flag assignments 1: flag = false; 2: If (a==0) flag = true; /* execute this */ 3: If (b==0) flag = true; /* or execute this */ 4: … 5: If (z > y) flag = false; /* do not execute ! */ 6: … 7: If (c==0) { 8: If (flag) /* test aim: go here */
In order to obtain a solution with a higher structural coverage it is necessary for the fitness function to return good values for any test data that: - are close to or execute any of the include-assignments and - do not execute any of the exclude-assignments. All include-assignments have to be treated equally since it cannot be decided which of the assignments creates an executable solution for the test aim. Even with these new requirements for the fitness calculation is it possible to reuse the idea of approximation levels. Assigning the right levels to the nodes in the control flow graph and some additional calculation rules will result in fitness values that meet the requirements.
Evolutionary Testing of Flag Conditions
2449
To enable a fitness calculation for the suggested improvements, branches are additionally defined as ’decisive’ (execution has an effect on fitness calculation). These are branches avoiding the execution of a desired flag assignment (Fig. 7 - highlighted branches of nodes 2 and 3) or branches leading to an undesired flag assignment (highlighted branch of node 3). If a path traverses one of those identified branches fitness has to be calculated because of the depending flag assignments. desired assignment
desired assignment
undesired assignment
Level 2
Lvl 3 1
3
2
4
5
6
Level 1
7 8
Level 3
Level 3
test aim
Fig. 7. Decisive branches assigned with approximation levels
The introduction of additional decisive branches makes new fitness calculation rules necessary, because some execution paths might run through several decisive branches. This cannot happen in the original ET. DEFINITION 1: FITNESS CALCULATION RULES 1. first decisive branch temporarily determines the fitness value, fitness can only be changed again when rules 2 and 3 are enabled. 2. within loops the best level over all iterations establishes the fitness value 3. whenever a desired flag assignment is executed it fixes the temporary fitness value; only rule 4 can make this fitness invalid 4. whenever an undesired flag assignment is executed it reactivates rule 1 and resets the fitness calculation. Some examples illustrating the usage of the rules will be described next (all referring to Example 3). If a test case misses the assignment of line 2 (rule 1 is enabled) and then executes the assignment of line 3, it will achieve a good fitness because of rule 3 (level 2). However, if the assignment of line 5 is executed afterwards, rule 4 is activated and fitness is set back to a poor value (level 3). A path that does not execute a decisive branch (highlighted in Fig. 7) and does not fulfill the condition of node 7 is assigned a fitness value of level 2. Last but not least, a test reaching line 8 and not having executed any decisive branch before will meet the test aim. It will be described next how the include- and exclude lists change the fitness function and how this affects the search process. Example 3 is used for descriptions.
The original ET searches for data fulfilling the condition “c==0” and “flag=true” This is because of the control dependency created by the if-statements of line 7 and 8. As described previously, this fitness function will result in a random search.
The suggested improvement searches for a solution of the following term: “(a=0 || b=0) and not (z>y) and (c=0 && flag=true)” The first part is created by the include-assignments of the example (line 2 & 3); the second part comes from the exclude-assignment of line 5, and the last part is created by the control dependencies of the test aim. The new fitness function only allows solutions with “(a==0 or b==0) and not (z>y)” in the first instance, which automatically fulfill the flag condition.
2450
A. Baresel and H. Sthamer
The new fitness function targets the flag assignments, in this way enabling the evolutionary test to find a high covering test data set for test objects that uses flags. The authors will show next that even flag uses within loops and expressions in flag assignments will no longer be a problem with this approach. 3.4
Flag Assignments within Loops
As mentioned previously, applications can also assign flag variables within loops. ET generally has no problems with loops appearing in the test object source code. The method can search for test data fulfilling test aims outside and inside any kind of loops. For this reason the new approach also works if the flag assignment is placed within a loop. The short Example 4 demonstrates this case. The flag assignment of line 3 is placed within a while-loop and the flag use is at line 4. Example 4: Flag assignment appearance within a loop 1: flag = false; 2: while (i<10) 3: if (a[i]==0) flag = true; 4: if (flag) 5: /* test aim */
In this situation, an evolutionary search for a solution of the test aim at line 5 requires an improved fitness function because of the flag condition at line 3. The calculation of the include-assignments and exclude-assignments will return just one include-assignment in line 3 (line 1 is an initialization that is always executed). ET has no problem guiding the search to a test datum that executes the assignment of line 3. The new fitness function is the following logical term: The suggested improvement is searching for a solution of the following term: “{in-any-iteration a[i]==0}“ and “flag==true” The operator “in-any-iteration” is created by the include assignments which is placed within the loop. The assignment needs to be executed only once.
This new fitness function improves the evolutionary test. However, using the new fitness function implemented via approximation levels does not solve the problem of undesired flag assignment appearing in loops. This is because of the incomplete use of the monitored information. With the appearance of undesired-assignments inside loops it is necessary to define a fitness function which numerically evaluates the “in-all-iterations” operator. In Example 4 it is the fulfillment of the condition: “{in-all-iteration not a[i]==0 } But an implementation using approximation levels with the rules of Definition 1 results in a fitness calculated on the basis of a transformed version of this term: “not {in-any-iteration a[i]==0 }. This is logically equivalent to the first but results in a fitness function which is calculated on the basis of only one iteration that fulfills the condition: “a[i] = 0”. Analyzing the results of ET has shown that this will lead to a random search. The reason is that the fitness values do not take into account that a test datum with just one iteration meeting the condition “a[i] equals zero” is better than a test datum executing more iterations meeting this condition. A different implementation is needed here.
Evolutionary Testing of Flag Conditions
2451
Using the new approach the ET can also be improved in the case of loops appearing in the code. The new fitness function guides the search to execute desired flag assignments appearing within any kind of loop statements even if the loop is created by ‘goto’ statements. Only the appearance of undesired-assignments within loops still causes the flag problem for our implementation. 3.5
Boolean Expressions Assigned to Flags
The previous subsections have shown how to improve the fitness function if the flag is directly assigned to just one value. Real world applications often make use of assigning expressions to flags which are evaluated at runtime. Static analysis cannot distinguish which value these assignments return. Nevertheless, the approach can also handle this very common case. Example 5 shows one function code containing an assignment of an expression in line 1 and the use of the flag in line 3. Example 5: Flag assigned by a Boolean expression 1: flag = (a==0) || (b>0 && b<5); 2: ... 3: if (flag) /* test aim */
Running the original ET approach will perform poorly with a low coverage. There is only a slim chance that it will find a solution. An ET inserting the Boolean expression at line 3 will work without any problems. Unfortunately the transformation is sometimes a difficult task (more details in [6]). A more simple transformation can help obtain a test object which can be handled with the approach explained in the previous subsections. This transformation changes the assignment expression into an if-then-else construct which behaves completely equivalent to the original code. The transformed version is shown in Example 6. This code has just two direct value assignments to the flag variable. Example 6: Transformed expression 1: if ( ( a == 0 ) || ( b > 0 && b < 5 ) ) flag = true; else flag = false;
Applying the new approach does not actually require such a transformation as seen in Example 6. Simple function calls inserted into the Boolean expression will obtain the information necessary for the fitness calculation, e.g. flag=DistEqual(a,0) || DistGreater(b,0) && DistLess(b,5)). A fitness function using the data collected during the execution of the instrumented expression guides the search to a solution assigning the required flag value. The approach in [5] describes a similar idea for guiding the search by data dependencies. All paths containing assignments relevant for the flag condition under test are generated and optimized in a sequence. However, our approach uses a single optimization considering all these paths simultaneously. [5] can solve the previously described flag problems, but cannot guide the search in case of Boolean expressions. Up until now a general approach for guiding the search in the case of just one flag variable has been presented. The next subsection extends the applicability to a wider
2452
A. Baresel and H. Sthamer
range of test objects. The use of more than one flag takes place in many code generated examples. However, it becomes more difficult to guide the search in such cases. 3.6
Using More than One Flag Variable
The idea is extended for test problems with more than one flag variable in the condition that needs to be fulfilled. The next example shows an instance of this case. Example 7: Source code showing the use of more than one flag variable 1 2 3 4 5 6 7 8 9
If (a==0) initialized = true; If (b==0) has_been_fired = true; .... if (initialized) { If (b==3) has_been_fired = true; .. If (has_been_fired ) /* test aim */
/* execute this*/ /* and execute this */
/* or execute this */
The idea of this example is that the test aim is only executed if the two flags are assigned in the right way before reaching the test aim. In real world applications any combination of flag assignments within nested conditions and loops are conceivable. The fitness evaluation has to check in parallel for each flag appearing in the code whether or not it is assigned in the desired way. That is why the original approach using approximation levels does not work properly. With the use of approximation levels and local distances the instrumentation can decide how close the test datum is to a flag assignment. This value (temporary flag fitness) can be stored together with the flag value at runtime. If the flag is assigned as necessary for the test aim this information is not used, but, if it is not, the temporary flag fitness will be used to calculate the overall fitness of the test datum. The fitness calculation is prepared by checking the control dependencies of all flag uses; estimating the exclude-assignment and include-assignment lists for each identified flag, calculating the decisive branches on the basis of the include and exclude lists, and assigning the flag-approximation-levels to the branches independently for each flag. The assigned flag-approximation levels are used to calculate temporary flag fitness. This fitness forms the basis for the improved fitness function for multiple flag uses. DEFINITION 2: FITNESS CALCULATION RULES FOR MULTIPLE FLAGS 1. temporary flag fitness is estimated as defined in the rules of Definition 1 for each flag appearing in the code 2. upon reaching a control-dependent condition containing flags it is checked whether or not the flags are assigned in the desired way. If the flags are assigned correctly the usual fitness calculation proceeds. Whenever one or more flags are not assigned correctly, fitness is calculated using the corresponding temporary flag fitness. Annotation. In the case of multiple flags occurring in a condition we suggest calculating an overall fitness using all temporary flag fitness values. See [2] for the reasons for parallel condition optimization.
Evolutionary Testing of Flag Conditions
2453
The implementation of this approach is quite complex and the calculation of the approximation levels requires some extra processing time. However, when applying the original approach on examples using multiple flags, the coverage reached with the original ET is worse. A related paper to this idea is [3]. A complete solution needs further research. 3.7
Real World Example
The authors have tested the approach using a real world application extracted from software for a car-controlling unit. The function under test is responsible for regulating the internal states of the energy control of a small sub-function within body and comfort electronics of a car. The function has 100 Loc and an if-then nesting level of 5. It has 4 input parameters and internally uses two flags. For testing reasons the initial state and the input situation have been generated. The original approach did not find a solution for the condition statements using the two flags. Within a test of about 320.000 individuals it has reached a branch coverage of about 90%. This happens with the standard settings of the ET-system using 300 individuals in 6 populations with competition enabled. A maximum of 200 generations per test aim were allowed. The authors ran the same test with a bigger population (700 individuals) and a later stopping criterion (max. 400 generations), but no improvements were noticeable. Using the new approach no problems were distinguished. One of the flags was assigned by an expression which had a very low probability evaluating to ‘true’. This was the reason why even more tests did not perform better using the original approach. The second flag was assigned within a nested if-then structure and tested later in a different nesting level. In the original ET approach there was only a very low chance of executing the flag assignment and the condition within one execution. The fitness function did not guide the search to this solution. The new approach used the standard EA settings and covered the test object fully after 60 generations and testing 14500 individuals. This is 5 percent of the workload of the original approach and leads to 100 % branch coverage.
4
Conclusion
Evolutionary Testing uses metaheuristic search methods to automate software testing aspects. The occurrence of flag variables has been pointed out to be problematic because of a poor guidance of the search at conditions containing flags. The authors introduce a new fitness function, which improves evolutionary structural testing in case of flag conditions. This function uses additional information on the flag assignments occurring in the test object. The solution is explained by using short code examples extracted from real world applications. The introduced improvements cannot only solve the problem of flag variables, the authors argue that the test of sources containing enumeration type conditions, also known to be problematic, can be improved using the introduced approach. By using a fitness function that guides the search to variable assignments as well as variable uses it has been shown that the flag problem can be solved. We believe
2454
A. Baresel and H. Sthamer
that future research on sequence testing can reuse this idea. This is because in sequence testing state variables are assigned and used in different areas of a program and sometimes also in the different steps of a sequence. A fitness function guiding the search to the execution of state variable assignments and to the condition testing the state variable will perform better than the original ET.
References [1] [2] [3]
[4] [5] [6] [7] [8]
[9] [10] [11] [12] [13] [14] [15] [16]
Appel, A. W.: Modern Compiler Implementation in C, Cambridge, New York: Cambridge University Press, 1998 Baresel, A., Sthamer, H. and Schmidt, M.: Fitness Function Design to improve Evolutionary Structural Testing, Proceedings of the GECCO, New York, USA, July 2002 Bottaci, L.: Instrumenting Programs With Flag Variables For Test Data Search By Genetic Algorithms, GECCO 2002: Proceedings of the Genetic and Evolutionary Computation Conference, p. 1337–1342, July 2003 Buhr, K.: Complexity Measures for the Assessment of Evolutionary Testability (only german version). Diploma Thesis, Technical University Clausthal, 2001 Ferguson, R.; Korel, B.: The Chaining Approach for Software Test Data Generation, Transactions on Software Engineering and Methodology, Vol. 5 No.1, pp.63–86, 1996 Harman, M., Hu, L., Hierons, R., Baresel, A. and Sthamer,H.: Improving Evolutionary Testing by Flag Removal, Proceedings of GECCO, New York, USA, 9–13th July 2002 http://www.mathworks.com/ Harman, Hu, Hierons, Fox, Danicic, Baresel, Sthamer; Wegener: Evolutionary Testing Supported by Slicing and Transformation, 18th IEEE International Conference on Software Maintenance (ICSM 2002), 3–6 Oct. 2002. Montreal, Canada. Page 285. Jones, B.-F.; Sthamer: H.-H.; Eyres, D.: Automatic structural testing using genetic algorithms. Software Engineering Journal, vol. 11, no. 5, pp. 299–306, 1996 Korel, B.: Automated Test Data Generation. IEEE Transactions on Software Engineering, vol. 16 no. 8 pp.870–879; August 1990 Michael, C.C,. Mcgraw, G, Schatz, M.A.: Generating Software Test Data by Evolution, IEEE Transactions on Software Engineering, vol. 27, No. 12 pp. 1085–1110; Dec. 2001 Pohlheim, H.: GEATbx – Genetic and Evolutionary Algorithm Toolbox for Matlab. http://www.geatbx.com/, 1994–2001 Schaeffer: A mathematical theory of global program optimization, Prentice-Hall Inc. 1973 Tracey, N., Clark, J., Mander, K. and McDermid, J.: An Automated Framework for Structural Test-Data Generation, Proceed. of the 13th IEEE Conf. on Automated SE, 1998 Wegener, J., Sthamer, H., Baresel, A.: Application Fields for Evolutionary Testing, Eurostar 2001 Stockholm, Sweden, November 2001 Wegener, J., Sthamer, H., Baresel, A.: Evolutionary Test Environment for Automatic Structural Testing, Special Issue of Information and Software Technology, vol. 43, pp. 851–854, 2001
Predicate Expression Cost Functions to Guide Evolutionary Search for Test Data Leonardo Bottaci Hull University, Hull, HU6 7RX, UK [email protected]
Abstract. Several researchers are using evolutionary search methods to search for test data with which to test a program. The fitness or cost function depends on the test goal but almost invariably an important component of the cost function is an estimate of the cost of satisfying a predicate expression as might occur in branches, exception conditions, etc. This paper reviews the commonly used cost functions and points out some deficiencies. Alternative cost functions are proposed to overcome these deficiencies. The evidence from an experiment is that they are more reliable.
1
Introduction
Several researchers are using evolutionary search methods to search for test data with which to test a program. The fitness or cost function depends on the test goal but almost invariably an important component of the cost function is an estimate of the cost of satisfying a predicate expression as might occur in a branch condition, an exception condition, etc. As an example of the most basic instrumentation, consider a search for test data that will execute a given sequence of branches in a program. A record may be kept of the values of all branch predicate expressions executed. A cost for the given input is computed by counting the number of branches that have not been satisfied. This is the method used by Pargas et al. [5]. Although a count of undesired branch decisions provides some guidance to the search; all the test cases that fail to satisfy the same branch are given the same cost. At this point, the cost surface over the search space has become flat and the search becomes random. As an example, consider the program fragment below. ... if (a <= b) ...
// EXECUTION REQUIRED TO ENTER THIS BRANCH
Suppose a test case is required to cause execution of the true branch of the conditional shown above. If the required branch is difficult to enter, many test cases will cause a <= b to be false. To discriminate between these tests, the program is instrumented to calculate a cost measure that penalises those tests E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2455–2464, 2003. c Springer-Verlag Berlin Heidelberg 2003
2456
L. Bottaci
that may be considered to be “far from” satisfying a <= b. As an example of a possible cost function for the condition a <= b, the value of a − b increases as a becomes larger than b and a zero or negative cost indicates that a solution has been found. Through instrumentation, the subject program has in effect been converted into another program that computes a function that is to be minimised to zero. This approach has been used by Korel [2], Tracey et al. [7], [8], Wegener et al. [9], Jones [1] and Michael [4] [3]. Clearly, the effectiveness of the search depends on the reliability of the cost functions used for the relational and logical expressions. This paper reviews the commonly used cost functions and points out some cases where they are deficient. Alternative cost functions are proposed to overcome these deficiencies. A small experiment provides some evidence that the alternative cost functions are more reliable, and more so for relatively simple programs.
2
Cost Functions for Relational Predicates
Table 1 shows the commonly used cost functions for the relational predicates. a, b are numbers and is a positive number. Since, logically equivalent expressions express the same condition, ideally they should have equal costs. In purely integer domains, a + 1 ≤ b ⇔ a < b. Since cost(a + 1 ≤ b) = a + 1 − b and cost(a < b) = a + − b, this entails = 1. In real domains, is the smallest positive real. It should be emphasised that a − b as the cost of satisfying a ≤ b is at best a heuristic. Usually, the input test case determines the values of the operands a, b only indirectly and any intervening statements have the potential to produce a cost surface as a function of the inputs that is far more complex than a − b. Nonetheless, many arithmetic operations do not destroy the reliability of the heuristic and it remains effective where inputs are modified in the manner illustrated in the example below a := a * a - k; if (a <= 0) It is, of course, possible for the heuristic to be deceptive. Consider the program fragment a := (a * a + 1) mod 65; if (a <= 0) Table 1. Relational predicate cost table using conventional cost functions Predicate expression Cost of not satisfying predicate expression a≤b a−b a
Predicate Expression Cost Functions
2457
In this fragment, the cost of a <= 0 decreases as a (input) approaches 0. At a = 0 the cost function attains a local minimum value of 1 but the solution is a = 8. It is also possible for the heuristic to be completely uninformative as in a := random(a); if (a <= 0)
3
Cost Functions for Logical Operators
Cost functions may be defined for the logical operators not, or and and in order to define a cost for compound predicate expressions such as a <= b and not(a > 0). Consider first the logical negation operator. Some researchers avoid an explicit cost function for logical negation and instead rewrite expressions that contain negation into a form that is negation free. So for example, not (a <= b) is rewritten as a > b. Introducing a cost function for logical negation avoids the need to rewrite expressions; indeed by introducing a cost function for each relational and logical predicate, cost functions may be isomorphic to predicate expressions which allows cost functions to be constructed in a simple syntax directed manner. This is an important consideration when building tools. The cost function for negation can be derived from the requirement that cost(a ≤ b) = cost(¬(a > b)), i.e. a − b should equal −(b − a + ) + and hence cost(¬a) = −cost(a) + . The use of logical negation also introduces the need for a cost to be assigned to a true predicate expression. In particular, the logical constant, true may be given a cost of −maxcost and the logical constant, f alse has a cost of maxcost. The use of these large absolute values reflects the fact that the logical constant f alse can never be satisfied, and true can never be falsified. In this scheme, all cost functions are bounded by −maxcost and maxcost. Consider next the cost function for a disjunction. Clearly, when the operands have different truth values, the cost of the disjunction should be the cost of the true operand. This leaves the cases where the operands have the same truth value. In this case, a popular choice for the cost of a disjunction is the cost of the operand with the lowest cost i.e. the cost function is the min function. The common corresponding cost function for the conjunction is the max function. The cost tables for logical negation, or and and are summarised in Table 2, where ca is the cost of a boolean expression a. In Table 2, ca and cb are positive (f alse) and ca and cb are non-positive (true). The cost functions of Table 2 have been used by a number of researchers. Tracey et al. [7] use essentially the same cost functions although their’s are restricted to nonnegative values, they measure only the cost of not satisfying an expression and negation is removed by rewriting expressions. A notable difference, however, is the use of + rather than max for conjunction so that cost(a ∧ b) = cost(a) + cost(b).
2458
L. Bottaci Table 2. Logical operator cost table using conventional cost functions a ca ca ca ca
4
b ¬a a∨b a∧b cb −ca + min(ca , cb ) max(ca , cb ) cb cb ca cb −ca + ca cb cb min(ca , cb ) max(ca , cb )
Cost Function Reliability
4.1
Analytical Considerations
The functions min and max have become popular for various forms of logical reasoning under uncertainty [6] since Zadeh proposed them for fuzzy logic [11]. In fuzzy logic, true is represented by 1 and f alse by 0 and intermediate values lie in between. The truth value of a fuzzy disjunction is the maximum of the truth values of the operands, the value of a fuzzy conjunction is the minimum of the operand truth values. The common use of the functions min and max1 should not obscure the fact that the intended interpretation in test data search is quite different. Fuzzy logic is concerned with formalising reasoning with vague concepts where the vagueness is formalised as a fuzzy set. The justification for the use of min and max in fuzzy logic comes from the corresponding union and intersection operations for fuzzy sets. The costs associated with logical expressions to guide search are not intended to measure vagueness. Given that the costs of predicate expressions are intended to estimate search effort, some properties which could be considered essential are: 1. The cost of a disjunction should be no more than the cost of either disjunct, i.e. cost(a) ≥ cost(a ∨ b) and cost(b) ≥ cost(a ∨ b). 2. The cost of a conjunction should be no less than the cost of either conjunct, i.e. cost(a) ≤ cost(a ∧ b) and cost(b) ≤ cost(a ∧ b). 3. The cost of logically equivalent expressions should be equal. Property 1 can be justified on the grounds that adding an alternative means of satisfying a condition cannot make that condition more difficult to satisfy and so cannot increase the cost. The min function satisfies property 1 but makes the assumption that the cost of a disjunction is the maximum cost consistent with property 1. The argument for property 2 is analogous to that for the disjunction. The max function satisfies property 2 but makes the assumption that the cost of a conjunction is the lowest cost consistent with property 2. Property 3 requires the cost functions to be consistent with the associative, commutative and distributive laws. Examples of other laws include: cost(a) 1
In fuzzy logic truth increases with numerical value so min corresponds to max and vice versa.
Predicate Expression Cost Functions
2459
should equal cost(a ∨ a) and cost(a ∨ b) should equal cost(¬(¬ a ∧ ¬ b)). The cost functions of Table 2 satisfy all three of these properties. Although all three properties would appear necessary, it might be advantageous to trade some violation of the third property for a more reliable or informative function. Recall the use by Tracey [7] of + instead of max as the cost of a conjunction. To use + instead of max for the cost of a conjunction, while retaining min for the cost of a disjunction is to give up the property that logically equivalent expressions have the same cost. In particular, the distributive law of disjunction over conjunction is not satisfied because cost(a∨(b∧c)) = min(a, b+c)2 but cost((a∨b)∧(a∨c)) = min(a, b)+min(a, c). De Morgan’s law is not satisfied either. A possible consequence of this is illustrated in the code fragment below where two logically equivalent expressions appear in distinct subgoals. Assume that the test goal is to find a test that will execute either of the statements z := 0;, i.e. to satisfy either (x < 4 or y < 4) or not (x >= 4 and y >= 4). if (x + y >= 16) if (x < 4 or y < 4) z := 0; else if (not (x >= 4 and y >= 4)) z := 0; To exaggerate the inconsistency and also for the sake of clarity, variables are taken to be integer so that = 1. When x = 8, y = 8, the cost is min(5, 5) = 5 but when x = 7, y = 8, which is a better test, the cost is higher at −(−3+−4)+1 = 8. The possible bias this might introduce in examples such as that shown above might well be compensated for by a more general reliability advantage of + over max. + rewards a decrease in the costs of both conjuncts more than it rewards a corresponding decrease in the cost of just one. For example, consider the problem of satisfying the condition x = 0 and y = 0. A move by a search algorithm from the (x, y) point (4, 6) to the point (3, 5) is rewarded by + in a cost decrease of 2 but max produces only a cost decrease of 1. Similarly, a detrimental move from (4, 6) to (5, 7) is penalised more heavily by + than by max. The function + would seem to be more discriminating. Continuing in this direction, the use of min as the cost function for disjunction may be reconsidered in the hope of finding some different function, analogous to +, that is more discriminating than min. Consider, for example, the following program fragment. x := 1; while (x <= 0 or y = 5) 2
// EXECUTION REQUIRED TO ENTER LOOP
Note that the notation of ca as the numeric cost of the boolean expression a is being dropped from here on. A symbol a may denote either a boolean expression or its numeric cost, as determined by the context.
2460
L. Bottaci Table 3. As y approaches 5, the cost decreases towards 0 y x <= 0 y = 5 x <= 0 or y = 5
9 1 4 0.8
8 1 3 0.75
7 1 2 0.67
6 1 1 0.5
5 1 0 0
Table 4. Proposed logical or and logical and cost table a a a a a
b a∨b a∧b ab b a+b a+b b b a b a b b b a + b aa +b
When searching for values for the variables x and y to enter the while loop, the value of x is 1 when the conditional is first evaluated and so the cost of x <= 0 is 1 and this will in fact be the cost of the predicate expression for all input values unless y = 5. A flat surface in the cost function provides no guidance to the search. One way to interpret this problem is to consider that the cost of 1 for x <= 0 should, initially at least, be much higher given the impossibility of changing the value of x from this value until entry into the loop. Determining such facts about the values of arbitrary variables in a program is of course just as hard a problem as that of finding test data. Since the min function ignores improvement in the cost of the more costly operand, a more discriminating function might be constructed that takes account of a cost improvement in either operand. Such an alternative is the ratio of the product of the costs to the sum of the costs, i.e. cost(a ∨ b) =
ab . a+b
The table of costs below shows how this cost function solves the problem in the previous example. The proposed cost functions for or and and are shown in Table 4. Note that when a = b = 0 then cost(a ∧ b) = 0. The above cost functions satisfy properties 1 and 2 but do not satisfy the property of equal costs for logically equivalent expressions. The error in satisfying De’Morgan’s law, i.e. cost(a∨b) = cost(¬(¬a ∧ ¬b)) is due to the presence of , the positive offset from zero for the costs of all false predicates that is absent from the costs of true expressions. This anomaly can be removed by modifying the relational predicate cost functions to produce values symmetrical about zero in the range [−maxcost, −] ∪ [, maxcost] (as shown in Table 5) and to define the cost of logical negation as cost(¬p) = −cost(p). Other inconsistencies remain, however. For example, cost(a) = cost(a ∨ a). The difference between cost(a) and cost(a ∨ a ∨ a) is greater still with the dif-
Predicate Expression Cost Functions
2461
Table 5. Relational predicates with costs symmetrical about zero Predicate expression Cost of predicate expression a≤b a − b, a>b a − b − , a≤b a
9.0 10.0 − 1.0 + − − − 1.0 + −1.0 −
ference bounded by cost(a). The difference between cost(a) and cost(a ∧ a) is cost(a) and is unbounded as the number of conjunctions of a increases. In practice, such expressions are likely to be relatively rare because programmers tend to avoid writing expressions that are clearly inefficient. ab One of the problems with the cost function a+b is that when one operand is very small, changes in the other are not significant. When the cost of ab a b = then a+b = a+ which because of rounding error can only safely be taken to be . As an example of this problem, consider the condition not((i <= 9.0) and (x = 0.0)) where x and i are real and so is the smallest positive real. Table 6 shows the cost calculations for different values of i when x = 0.0. The cost remains the same at for all values of i up to 9 and so provides no guidance for the search. To assign a cost of (the least positive value is the relevant number domain and the lowest possible cost for a false predicate expression) to a single relational expression such as a < b leaves no room to give a lower cost to disjunctive expressions that include a < b as one of the disjuncts. Recall the requirement that cost(a) ≥ cost(a ∨ b). Cost functions for relational predicates can be modified to overcome this problem by setting the absolute minimum cost for any single relational predicate expression to be some value significantly larger than , say R ≥ 1. The costs of relational predicates in this scheme is given in Table 7. In Table 8 the cost calculations of the previous example are repeated but with the use of the modified relational predicate cost functions with R = 1. The cost of not(i <= 9.0 and x = 0.0) now decreases as i approaches 9.0. 4.2
Experiment to Compare Cost Function Reliability
Ultimately, the validity of cost functions for test data search is an empirical issue. Ideally, cost functions should be compared over a large sample of programs from
2462
L. Bottaci Table 7. Modified relational predicate cost functions Predicate expression Cost of predicate expression a≤b a − b + R − , a>b a − b − R, a≤b a
6.0 −4.0 −1.0 −0.8 0.8
7.0 −3.0 −1.0 −0.75 0.75
8.0 −2.0 −1.0 −0.67 0.67
9.0 −1.0 −1.0 −0.5 0.5
10.0 2.0 −1.0 2.0 −2.0
various application areas. A less time consuming experiment can be done with synthetic programs generated automatically. Consequently, a sample of programs was generated by modifying the following program. x := a + b + y := a + b + z := a + b + if (x = a or
c; c; c; y = b or z = c)
The modification rules, allowed: all occurrences of + to be replaced by any arithmetic operator, all occurrences of = to be replaced by any relational predicate, all occurrences of or to be replaced by any binary logical operator with the possible insertion of a negation operator, in addition two binary operators leftoperand and rightoperand were allowed throughout (each returning the value of one operand) as a way of reducing the length of expressions. The modification rules, allowed variables (other than x, y and z) to be replaced with any other variable or a constant drawn from a small set of integers. For each program generated, at random, three copies were produced, the first was instrumented with the min and max cost functions of (Table 2) (herein ab called the min-max functions), the second with the a+b and a + b functions of (Table 4) (herein called the ratio-sum functions) and the third with constant functions so that a uniform random search is done. The relational predicate expression cost functions of Table 7 were used throughout with R = 1. For all programs, the test goal was to find values of a, b and c, each from the domain [−4999, 5000] to cause entry into the conditional. The genetic algorithm used for the search was of the steady-state variety and similar to Genitor [10]. Test inputs were coded not as binary strings but as strings of integer. Reproduction takes place between two individuals who produce one or two offspring (depending on the choice of reproduction operator). These
Predicate Expression Cost Functions
2463
Table 9. Performance of cost functions program type simple complex
min-max solns evals 1116 504 1853 308
ratio-sum solns evals 1309 456 2059 313
random solns evals 97 436 789 301
trivial failed 0 0 16992 6237
offspring are then immediately inserted into the population (of size 50) expelling the one or two least fit. The population is kept sorted according to cost and the probability of selection for reproduction is based on rank in this ordering. 4.3
Results
For each set of cost functions, Table 9 shows the number of trials (solns column) in which a solution was found. Three attempts were made at each problem and the table shows the total number of solutions found. The column headed ‘evals’ shows the mean number of fitness function evaluations used per solution, counting only cases where a solution was found. A number of the programs generated were not counted against any cost function because they were either too easy to solve (column headed ‘trivial’), namely a solution was found in less than 10 random attempts, or too difficult (column headed ‘failed’), because a solution was not found within the limit on fitness function evaluations, set at 1000. A crude attempt was made to distinguish between simple and more complex programs according to the syntactic complexity of the arithmetic expressions. All programs of the following form (where only or, and may be replaced by or, and) were classed as simple. x := a; y := b; z := c; if (x = 1 or y = 1 and z = 1)
Simple programs all have a smooth cost surface (smoothest in the case of ratiosum) but the solution set is small. In Table 9, the set of programs classed as complex is simply the entire sample of synthetic programs generated. It can be seen that for simple programs, the performance of the ratio-sum cost functions is better, the data shows a 17% outperformance. For all programs, the ratio-sum cost functions outperform the min-max cost functions by about 10%. A possible explanation for this difference is that there is less opportunity for ratio-sum to take advantage of an improvement in more than one predicate expression cost at once since such moves are less likely to occur as the jaggedness of the cost surface increases.
5
Conclusion
Several researchers are using evolutionary search methods to search for test data with which to test a program. The fitness or cost function depends on the test
2464
L. Bottaci
goal but almost invariably an important component of the cost function is an estimate of the cost of satisfying a predicate expression as might occur in a branch condition, an exception condition, etc. It has been shown that the set of commonly used cost functions for the satisfaction of logical predicates (the min-max functions) perform poorly in certain cases. An alternative set of cost functions (the ratio-sum functions) has been proposed which overcome these specific problem cases. To determine the effectiveness of the ratio-sum functions on a wider range of problems, an experiment was done to compare cost functions for a sample of synthetic programs. It has been shown the ratio-sum cost functions modestly outperform the min-max cost functions but more so for relatively simple programs. A possible explanation for this is that the ability of ratio-sum to take advantage of an improvement in more than one predicate expression cost at once can be better exploited in simple programs. Synthetic programs are an inexpensive way of subjecting cost functions to a relatively large sample of programs but they can at best provide only an insight into the behaviour of different cost functions. In terms of assessing the reliability of different cost functions in practice, they can be no more than a prelude to an experiment with a large sample of real programs. Work is underway to do this.
References 1. B. F. Jones, H. Sthamer, and D.E. Eyres. Automatic structural testing using genetic algorithms. Software Engineering Journal, 11(5):299–306, 1996. 2. B. Korel. Automated software test data generation. IEEE Transactions on Software Engineering, 16(8):870–879, August 1990. 3. G. McGraw, C. Michael, and M Schatz. Generating software test data by evolution. Technical Report RSTR-018-97-01, RST Corporation, Suite 250, 21515 Ridgetop Circle, Sterling VA 20166, 1998. 4. C. Michael, G. McGraw, M. Schatz, and C. Walton. Genetic algorithms for dynamic test data generation. Technical Report RSTR-003-97-11, RST Corporation, Suite 250, 21515 Ridgetop Circle, Sterling VA 20166, 1997. 5. R. P. Pargas, M. J. Harrold, and R. P. Peck. Test-data generation using genetic algorithms. Software Testing, Verification and Reliability, 9:263–282, 1999. 6. Judea Pearl. Probabilistic reasoning in intelligent systems. Morgan Kaufmann, 1988. 7. N. Tracey, J. Clark, and K. Mander. Automated program flaw finding using simulated annealing. Software Engineering Notes, 23(2):73–81, March 1998. 8. N. Tracey, J. Clark, K. Mander, and J. McDermid. Automated test data generation for exception conditions. Software – Practice and Experience, 30:61–79, 2000. 9. J Wegener, A. Baresel, and H. Sthamer. Evolutionary test environment for automatic structural testing. Information and Software Technology, 43:841–854, 2001. 10. D. Whitley. The genitor algorithm and selective pressure: why rank based allocation of reproductive trials is best. Proceedings of the Third International Conference GAs., pages 116–121, 1989. 11. L. A. Zadeh. Fuzzy logic and approximate reasoning. Synthese, 30:407–428, 1975.
Extracting Test Sequences from a Markov Software Usage Model by ACO Karl Doerner1 and Walter J. Gutjahr2 1
2
Department of Management Science, University of Vienna Bruenner Strasse 72, A-1210 Vienna, Austria Department of Statistics and Decision Support Systems, University of Vienna Universitaetsstrasse 5/3, A-1010 Vienna, Austria {Karl.Doerner,Walter.Gutjahr}@univie.ac.at Abstract. The aim of the paper is to investigate methods for deriving a suitable set of test paths for a software system. The design and the possible uses of the software system are modelled by a Markov Usage Model which reflects the operational distribution of the software system and is enriched by estimates of failure probabilities, losses in case of failure and testing costs. Exploiting this information, we consider the tradeoff between coverage and testing costs and try to find an optimal compromise between both. For that purpose, we use a heuristic optimization procedure inspired by nature, Ant Colony Optimization, which seems to fit very well to the problem structure under consideration. A real world software system is studied to demonstrate the applicability of our approach and to obtain first experimental results.
1
Introduction
Markov usage models, as introduced by Whittaker, Poore, Walton and other authors (see [27]–[28], [26], [21], [24]) are versatile model descriptions for the use of a given software product in the application field. It is common knowledge that users do not exploit all functions described in the specification of a product with the same intensity; some features of a software application get “everyday routine”, others are very rarely touched – or never. A software developer can schedule her/his resource allocations during different phases of the development cycle more cost-effectively, if she/he takes account of these differences. For example, it does not make sense to perform extensive design reviews, code inspections and tests for specific usage constellations that will probably never occur in practice, at least if a failure of the product on these constellations will not cause much harm. A Markov usage model (MUM) is based on a Markov chain representation of the software usage. For putting it up, it is necessary to identify (1) states of the software, (2) possible transitions between these states (during these transitions, processing takes place), and (3) probabilities (or: relative frequencies) for the possible transitions. It is convenient to use a directed graph for the representation of the MUM; therein, the nodes are the states, the arcs are the transitions, and the transition probabilities are indicated as labels assigned to the arcs. E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2465–2476, 2003. c Springer-Verlag Berlin Heidelberg 2003
2466
K. Doerner and W.J. Gutjahr
Related, but not quite identical to Markov usage models in the sense above is another class of probabilistic, graph-based processing models for software the first of which have already been developed three decades ago (see Littlewood [18], [19], [20], Cheung [5]); more recent articles concerning this other type of models are Siegrist [25] or Rajgopal and Mazumdar [22]: In these models, the nodes of the graph do not represent states of the processing, but components (modules etc.) of the software product, and the arcs do not represent processing steps, but control transfer between the components. In so far as they assume a Markovian (history-independent) probability structure, these models are often simply called Markov models; to distinguish them from Markov usage models in the sense defined above, we call them component-based Markov models in this paper. On a non-formal level, the main difference is that component-based Markov models already exploit “whitebox” knowledge of the software developer for putting up the model graph, whereas MUMs rather start with the “blackbox” viewpoint of the software user: The developer knows the module structure of the given program and can estimate transfer probabilities between modules (knowledge on the actual use of the program is nevertheless necessary for that); the user, on the other hand, identifies states, e.g., certain input windows, menus, intermediate or final outputs etc., and can estimate probabilities of transitions between such states in her/his application environment. The present paper is based on the MUM modelling type, but we shall enrich the MUMs by some additional information accessible only to the developer. The aim of the paper is to automatically derive a suitable set of test paths through the MUM graph, which may be elaborated in a second step to complete test cases for the software product under consideration. Similar problems have already been investigated in the literature, mainly in the context of protocol conformance testing or user interface testing (see. e.g., Belli and Grosspietsch [2], Aho et al. [1], Cs¨ ondes et al. [6], Zhang and Cheung [29]). While these approaches strive for a satisfaction of pre-defined coverage criteria (usually: arc coverage)1 , we do not force complete coverage (or coverage on a pre-defined level) in the present paper. Instead, we consider the tradeoff between coverage and testing costs and try to find an optimal compromise between both. For determining this solution, the uncertain knowledge the developer has on possible failures and losses in case of failure needs to be quantified, and the MUM must be enriched by this information. We do not deal here with the question of how the selected test paths are extended to full test cases. We must treat this aspect as beyond the scope of the present paper. A Markov usage model intends to represent the operational distribution of program inputs. For purposes of reliability estimation, it would be desirable to use this distribution also for testing, or to test with a distribution changed in a controlled way, as suggested in [11], [12], [7] and [8]. However, as argued by 1
An exception is [6], where, among others, the problem of maximizing coverage subject to cost restrictions is treated. Contrary to our approach, however, information on program usage does not enter into the problem formulation.
Extracting Test Sequences from a Markov Software
2467
Rivers and Vouk in [23], resource limits often force the developer to give up this aim and to resort to non-operational testing. Our approach proposes a kind of compromise for such situations: while dropping the goal of making the derivation of reliability estimates from the test outcomes possible, we nevertheless try to exploit as much information about the operational use of the program when deriving test cases, including information on processing steps that are “critical” with respect to failure probability or to damage in case of failure.
2
The Model
As input we take a MUM represented by a strongly connected directed graph G = (V, A) without multiple arcs, where a specific node (labeled by the number 1) is distinguished as the start node. A MUM with unique start node and unique end node, as considered in [11]–[12], can be reduced to this model by adding an arc from the end node to the start node, representing a new call of the program. The transition probabilities p(e) assigned to the arcs e ∈ A are assumed as strictly positive. Furthermore, we suppose that to each arc e, the following estimates are assigned: the failure probability f (e), the loss l(e) caused in case of failure, and the cost c(e) for testing e. Note that since there are no multiple arcs, each arc e can be identified with the ordered pair (i, j) of its incident nodes. If e = (i, j), we shall also write p(i, j), f (i, j), l(i, j) and c(i, j) instead of p(e), f (e), l(e) and c(e), respectively. Remark. Let us emphasize that in our model, processing steps of the program are assigned to arcs and not to nodes of the graph. This convention is different from that used in component-based Markov models (see our remarks in the Introduction). For example, in our formalism, node i = 2 can refer to the state that a specific menu, from which the user is to select one of the program functions fun1, fun2 or fun3, is presented to the user. According to the actual choice by the user, the program starts processing the selected function and gives an output to the user; node i = 3, i = 4 and i = 5 refer to the state that the output of function fun1, fun2 and fun3, respectively, is presented to the user. In this way, the processing of function fun1, fun2 and fun3 corresponds to arc (2, 3), (2, 4) and (2, 5), respectively. In particular, the chosen representation implies that estimates on failure probabilities and losses in case of failure have to be assigned, as indicated above, to the arcs e ∈ A and not to the nodes i ∈ V . Since in our model, the states do not refer to processing activities, also sojourn times etc. are not essential in our case. For modelling failures, we use the static model described in [12], i.e., we assume that a failing arc always fails, and that the resulting loss occurs only one time, no matter how often the arc is (or would be) traversed. Our technique can also be extended to the dynamic model of [12]. A discussion of the two models from the viewpoint of application can be found in [12]. We define a test sequence as a closed walk w = (e1 , . . . , em ), ek ∈ A (k = 1, . . . , m), containing the start node 1. Arcs e may occur more than once in w.
2468
K. Doerner and W.J. Gutjahr
The cost for testing w = (e1 , . . . , em ) is C(w) =
m
c(ek ).
(1)
k=1
The assumption that w is closed allows a decomposition of w in a series of single test paths from the start node 1 to an “end” node, i.e., a predecessor node of 1 (a node where a new program call is necessary to continue the test). After test, tested arcs are considered as correct (not failing). Of course, this is a simplification, since other program inputs than those actually chosen for the test of a given arc could have caused this arc to fail. The simplification can be justified by the consideration that not testing an arc at all leaves a failure probability of a higher order of magnitude than testing it with just one specific input combination. It should be mentioned that the well-known coverage measures for whitebox testing use the same simplification. Whether an arc is failing or not is uncertain before test. For modelling this uncertainty, we use the probabilistic approach. As an indicator variable for the stochastic event that arc e fails in case it is traversed, we introduce If (e) = 1 if e is failing, given that e is traversed, and If (e) = 0 otherwise. Evidently, the expected value of the random variable If (e) is just f (e). In a similar way, we introduce an indicator variable for the traversal of arc e. When defining this variable as a random variable, it seems appropriate to refer to the steady state of the operational usage of the program, i.e., the distribution of traversals of arcs as it is reached after some time of use of the program, when “starting effects” have already become insignificant. Therefore, we set Ir (e) = 1 if a random transition in steady state passes through e, and Ir (e) = 0 otherwise. The random variables Ir (e) and If (e) are independent by definition, since we have defined the event If (e) = 1 as the event that e fails, conditional on the event that e is traversed, so whether e is actually traversed or not has no influence on whether If (e) = 1 or not. Let A¯ ⊆ A be the set of not tested arcs. According to our assumption, only the arcs in A¯ can cause losses by failure. Denoting the mathematical expectation by E, the expected loss by failure for one steady-state transition can then be calculated as follows: ¯ =E L(A) Ir (e) · If (e) · l(e) = E(Ir (e) · If (e)) · l(e). ¯ e∈A
¯ e∈A
Using independence of If (e) and Ir (e) and the fact that the expectation of a product of independent random variables is the product of the expectations, we further obtain ¯ = L(A) E(Ir (e)) · f (e) · l(e). (2) ¯ e∈A
For abbreviation, let us set q(e) = q(i, j) = E(Ir (e)) for the probability that in steady state, arc e = (i, j) is traversed. These probabilities can be computed
Extracting Test Sequences from a Markov Software
2469
as follows: Let xi denote the probability with which state (node) i occurs in the steady-state distribution. Evidently, q(i, j) = xi · p(i, j). The probabilities xi result by the solution of the system (P t − I)(x1 , . . . , xn )t = (0, . . . , 0)t ,
(1, . . . , 1)(x1 , . . . , xn )t = 1
(3)
of linear equations, where P = (p(i, j)) is the matrix of transition probabilities, I is the identity matrix, and n is the number of states (nodes). By using a sufficiently long test sequence w, it is of course possible to test all ¯ = 0. However, there is a tradeoff between arcs, such that A¯ = ∅ and hence L(A) ¯ the achieved coverage and the testing costs. Let A(w) denote the set of arcs not covered by the walk w. Then we may determine the best point of this tradeoff by solving the following optimization problem: ¯ Problem 1. Find a test sequence w, such that γL(A(w))+C(w) = min. Therein, γ is a factor indicating how high the program developer values an average loss of one currency unit per transition in operational usage as his/her own loss. γ depends on several factors, e.g., contracts, number of customers, market reputation effects etc. An alternate problem formulation where γ needs not to be estimated is the following: ¯ Problem 2. Find a test sequence w, such that L(A(w)) = min under the constraint C(w) ≤ B. Here, B is a fixed given budget limit for testing. Proposition. Both Problem 1 and 2 are NP-hard combinatorial optimization problems. Proof. By choosing specific values for f (e) or l(e), the Rural Postman Problem (RPP), which is known to be NP-hard, can be reduced to Problem 1 or Problem 2. This can be seen as follows: The RPP consists in finding a shortest closed walk w in a graph G = (V, A) with the property that w contains each arc e ∈ A0 , where A0 is a given subset of A, at least once. Now define f (e) = 1 for all e ∈ A0 and f (e) = 0 for all e ∈ / A0 , and choose γ very high. Then a solution of Problem 1 solves the given RPP. In a similar way, the RPP can be reduced to Problem 2. Therefore, Problems 1 and 2 are NP-hard as well. 2 The property stated in Proposition is an indicator that for larger problem instances, heuristic solution approaches will be required. Remark. We want to emphasize that the problem of obtaining complete coverage, as investigated in [17] and in a series of related articles, is contained in our problem 1 as the special case where the constant γ is very large and the values f (e) and l strictly positive, such that an optimal solution must be a solution where all arcs are covered.
2470
3
K. Doerner and W.J. Gutjahr
Heuristic Solution by ACO
One can easily design a straightforward greedy heuristic for solving Problem 1 or 2. Nevertheless, our experiments indicate that greedy-type heuristics can produce very poor solutions at our problems (see Section 4). So we apply instead a metaheuristic technique, Ant Colony Optimization (ACO), for which some types of convergence behavior towards the optimal solution have been shown ([13]– [14]), to our problems. The basic ideas of ACO have been developed by Dorigo, Maniezzo and Colorni [10] at the beginning of the 90ies; a formulation as a flexible metaheuristic applicable to the overall range of combinatorial optimization is given by Dorigo and Di Caro [9]. Compared with other possible metaheuristics like genetic algorithms, simulated annealing or tabu search, the application of ACO to our problems has the advantage that the graph structure of the problems fits very well to the graph-oriented “philosophy” of ACO (see [13]). Roughly speaking, we can consider an ACO implementation as a learning system with a fixed number of cooperative agents, implemented on the same or on different processors. Each agent constructs solutions for the given problem, governed (1) by randomness, (2) by a pre-evaluation of possible construction steps according to simple heuristic considerations, yielding so-called attractiveness values, and (3) by a blackboard mechanism representing the “collective memory” of the agent community. Learning takes place via this blackboard which contains trail levels used by all agents for the computation of the probabilities for performing possible partial construction steps. Agents that have found the best solution up to now, or the best solution generated within a certain construction period, or, in the so-called rank-based variant (see [4]), one of the r best solutions generated within the period, are allowed to “write” on the blackboard, which means that the trail levels for the partial construction steps they applied in their solution are increased. For details of the general algorithm, we refer the reader to the literature cited above. In our computational experiments, we applied three different variants; in each of them, trail levels are assigned to the arcs of G, and a construction step is an extension of the current partial walk w(k) = (e1 , . . . , ek ) to the partial walk w(k+1) = (e1 , . . . , ek , ek+1 ), where ek+1 is selected from the set of arcs starting at the end node of ek . (Construction begins by choosing an arc e1 starting at the start node 1.) In the first variant (“Standard”), the probability of an arc of being appended is chosen proportional to τ (e) · η(e), where τ (e) is the trail level and η(e) is the attractiveness value; we decided to compute η(e) as q(e) · f (e) · l(e). In the second variant (“Variant A”), the probability that a particular arc e is τ (e) appended is chosen proportional to n(e)+1 · η(e), where n(e) counts the total number how often e has been traversed in all previous construction rounds in any walk constructed by any of the agents. Thus, arcs, that have already been used frequently get a lower probability of being considered in the future, which favors diversification of the walks. In the third variant (“Variant B”), we apply the same selection mechanism as in the first (Standard) variant, but the trail level update mechanism is changed: Only an arc that lies on the best walk found so far with a minimum number of traversals in this best walk gets a trail level increment. Also this variant favors diversification.
Extracting Test Sequences from a Markov Software
4
2471
Experimental Results
In this section we apply our approach to a medium-size real world system representing an outbound call center which is used by the Austrian Red Cross. We give a brief overview of the system and present numerical results of our analysis. The considered MUM consists of 47 nodes and 85 arcs and reflects three main menues and submenues including the functions for the customer processing, the supervisor functions and the functions for the campaign management. To increase transparency, we have partitioned the entire MUM description into a set of 4 diagrams–the main section and three subsections. Entry node and exit node of a subsection are represented in the main section and are repeated in the subsection.
1 Invocation
1.0/l/l/l
Subsection 1
1.0/l/l/l
.01/l/l/l
2 Password
6 Customer Processing
.49/l/h/l
.19/l/l/l
3 OK State
.19/l/l/l
1.0/l/l/l
1.0/l/l/l
20 Statistics
1.0/l/l/l
.01/l/l/l
4 Info
1.0/l/l/l
47 Termination
Subsection 2
.50/l/l/l
.01/l/l/l
19 Save Call Result & Exit
1.0/l/l/l
.1/l/l/l
21 File 5 Window
30 Exit Statistics
1.0/l/l/l
22 Install Printer
.1/l/l/l
.8/l/l/l
23 New Session
1.0/l/l/l
Subsection 3 .6/l/m/l
24 Supervisor
46 Exit Supervisor
Fig. 1. Call Center Application – Main Section
The main functionality is customer processing (node 6 and the subsequent part of the graph given in Fig. 2). By means of customer processing, the operator can select a potential new or an already registered customer, and the customer’s data can be seen on the screen (node 7). The system provides also further information about the customer: information on the last phone calls with the customer (node 8) and the complete chronology of interactions with him/her (node 9). As soon as the operator knows the important facts about the potential customer he/she will, with a certain probability, proceed to perform a phone call (node 10). After the call the operator can modify customer data, e.g., register a new e-mail address. If there are no changes the operator can go directly to a function where he can store the call results (node 15).
2472
K. Doerner and W.J. Gutjahr .1/h/h/h
Main Section 6 Customer Processing
.4/h/h/h
8 Calls
.6/l/l/l
.4/h/h/h
.7/l/l/l .6/l/l/l
7 Customer Data
9 Chronology
.1/h/h/h
1.0/m/h/l .3/l/l/l
18 Save Call Result & Next Customer
10 Telephone Integration
.7/l/l/h
.1/l/l/l
11 Hang Up
1.0/h/m/m
.8/l/l/l .2/l/l/l
.1/l/l/l .4/m/h/m
19 Save Call Result & Exit
1.0/l/l/l
15 Call Result
.4/m/h/l 1.0/l/l/l
14 Exit Modify Data
12 Modify Customer Data
1.0/l/l/l
13 Save Customer Data
.1/l/l/l
17 Calendar
16 Comments
.1/l/l/l 1.0/l/l/l
.9/h/h/m
Fig. 2. Call Center Application – Customer Processing
In the subsection statistics , several performance statistics can be computed and printed or exported to an accounting program. In the subsection supervisor (see Fig. 3), the three main functionalities check campaign results (node 31), distribute campaigns (node 36) and maintain campaigns (node 39) are implemented. One important task of the supervisor is to control the work of the operators. For this purpose he/she uses the functionality check campaign results. By using the functionality send e-mail the checked results can be distributed (node 34). Another task of the supervisor is to distribute the work to the available operators (node 36). Certain customers are grouped to so-called campaigns, and each campaign is assigned to an operator.
Main Section
.05/l/l/l 1.0/l/l/l
24 Supervisor
.33/l/l/l .33/l/l/l
.5/l/l/l
.3/l/l/h
1.0/l/l/l .3/h/h/h
38 Check Selection
32 Select Customer Data
.5/l/l/l
.2/l/l/l
.4/l/l/l
.05/m/l/l
36 Distribute Campaigns
.5/l/l/l
.5/l/l/l
31 Check Campaign Results
46 Exit Supervisor
.5/m/h/l
33 Check Customer Data
35 Print
1.0/l/l/l
34 Send e-mail
45 Print
.2/l/l/l 1.0/l/l/l
.4/l/l/l 1.0/l/l/l
37 Select Campaign
40 New Campaign 1.0/l/l/h
.2/l/h/l
39 Maintain Campaigns
1.0/l/l/l .2/l/l/l 1.0/l/l/l
.5/l/l/l .2/l/l/l
41 Create Campaigns
44 Change Campaign
.5/h/m/h 1.0/l/l/l
42 Select Campaign
.34/l/l/l
Fig. 3. Call Center Application – Supervisor Functions
43 Delete Campaign
Extracting Test Sequences from a Markov Software
2473
Our description in Fig. 1–Fig. 4 also includes the operational probability p(i, j), an estimation of the failure probability f (i, j) and of the loss in case of failure l(i, j), and also the cost c(i, j) for testing the function (i, j). These values are integrated into the MUM description. Each arc (i, j) is labelled with the following estimates: p(i, j) / f (i, j) / l(i, j) / c(i, j). Therein, the constants l (low), m (medium), and h (high) have the following values: For f (i, j): l = 0.001, m = 0.01, h = 0.1. For l(i, j): l = 1, m = 10, h = 100. For c(i, j): l = 1, m = 10, h = 1000. Failure probabilities, losses in case of failure and testing costs have been estimated by “educated guesses”. A high failure probability is expected for the functions (7,8), (7,9), (8,9) and (9,8). Also the loss in case of failure is high for these functions, since it is not possible to perform a phone call if the operator has wrong or missing information about the (potential) customer. The costs for testing these functions are high as well, because the data are extracted from a legacy system. To test this system an additional expert has to join the test team. Also the testing costs for testing function (7,10) are high: To verify the test results of function (7,10), a call center application expert is required. We tested Problem 1 with γ = 1 at the problem instance described above for the three different variants of our ACO algorithm. To evaluate the performance of our novel approach, we compared the results with the results of a greedy heuristic. Our greedy heuristic works as follows: we choose the next not visited arc that produces the maximum value of (q(e) · f (e) · l(e))/(d(e) + 1)), where d(e) denotes the distance from the current position to arc e. For using this heuristic one has to specify the coverage rate of the arcs to be tested. In order to make a fair comparison, we use about the same coverage rate as it has been recognized as favorable in good solutions found by the ACO algorithm (about 25 %). The solution quality obtained by the greedy heuristic turned out as 1045.45 cost units. To evaluate the performance of the ACO algorithm, we will also plot the performance of the variant where the trail levels are not used, i.e., construction is only guided by the attractiveness values η(e). In Fig. 4, we call this variant “Attractiveness Heuristic”. We performed six runs with each of the algorithms. The plots presented in Fig. 4–5 show best solutions values found after a certain number of iterations, averaged over these six runs. In Fig. 4, the results of the attractiveness heuristic and our first (Standard) ACO variant are shown. It is immediately seen that the solution quality achieved by the attractiveness heuristic is about the same as that of the greedy heuristic, and that both are not able to find low-cost solutions: the best found solution value is higher than 1000 cost units. Standard ACO finds good solutions, but only after about 35 iterations. After 10 – 20 iterations, the values are still rather poor. Fig. 5 shows the results of the ACO modifications, Variant A and B. We see that, compared to Standard ACO, relatively good solutions are obtained already after a smaller number of iterations. This makes Variants A and B particularly interesting for large applications, where it cannot be expected that computation time suffices for driving the optimization process to the point where optimal solutions are reached.
2474
K. Doerner and W.J. Gutjahr Computational Results
3000 2500 2000
2000
tss oC1500
1000
1000
500
500
0
10
20
30 40 Iterations
50
60
70
Variant A Variant B
2500
tss o1500 C
0
Computational Results
3000
Standard Attractiveness Heuristic
80
Fig. 4. Computational Results – Attractiveness Heuristic, Standard ACO
0
0
10
20
30 40 Iterations
50
60
70
80
Fig. 5. Computational Results – Variant A, Variant B
All the three ACO algorithms find very reasonable solutions after 70 iterations. These iterations took about three seconds. Of course, the trends of the presented results will have to be verified by more extensive experiments in the future.
5
Conclusion
We have proposed a technique for obtaining appropriate software test sequences from a Markov Usage Model description of the expected use of a software product. Our approach addresses the tradeoff between testing costs and failure risk caused by untested processing steps (state transitions). For the search for good solutions, we have applied a metaheuristic, Ant Colony Optimization. In this respect, our work is in the spirit of the currently emerging paradigm of “SearchBased Software Engineering”, as it has been formulated by Harman and Jones [15] and propagated in a series of recent initiatives. As an additional input information completing the Markov Usage Model description, we use estimates of failure probabilities and of losses in case of failure. This gives the considered problem the character of a decision problem under uncertainty. Our technique has been applied to a medium-size real-world problem. The experimental results indicate that our ACO approach outperforms a greedy-type solution heuristic. Two variants of the standard ACO technique have been specifically designed for the problem under consideration; they produce still more promising results than standard ACO. Nevertheless, much more experimental work will be required in the future as a prerequisite for the development of suitable toolkits supporting software testing along the lines of the described approach on an industrial level. Practical application of our technique requires that the parameters values p(e), f (e), l(e) and c(e) are estimated. Sensitivity of the optimization results with respect to these estimates is an important topic that deserves further investigation. Our first impression is that the results are relatively robust with respect to all parameter types except the transition probabilities p(e). On the other hand, just these probabilities can be estimated in a rather precise, objec-
Extracting Test Sequences from a Markov Software
2475
tive way by using an instrumentation of the program code and measuring the transfer frequencies in a test with similar usage profile as that expected in the application field. Our approach takes Markov usage models of any type as inputs and can therefore be used in a broad range of applications. Some recent articles indicate that the application of graph–based models for GUI (graphical user interface) testing is an area of growing importance, see, e.g., Kallepalli [16] or Belli [3]. Taking account of the specific requirements of GUI testing in the context of our problem formulation seems to be an interesting topic for future research. Another topic of research would be an extension to semi-Markov usage models for applications where the Markov assumption (claiming that state transition probabilities are history-independent) seems not adequate. Of course, an important issue for further research is how to test paths are to be extended (semi-)automatically to test cases. Acknowledgement. The authors are indebted to Fevzi Belli for drawing their attention to the optimization approaches in protocol conformance testing and for some helpful disussions.
References ¨ “An optimization technique for 1. Aho, A. V., Dahbura, A. T., Lee, D., Uyar, M. U, protocol conformance test generation based on UIO sequences and rural chinese postman tours”, IEEE Trans. on Communications 39 (1991), pp. 1604–1615. 2. Belli, F., Grosspietsch, K.-E., “Specification of fault-tolerant system issues by predicate/transition nets and regular expressions – approach and case study”, IEEE Trans. Software Eng. 17 (no. 6) (1991), pp. 513–526. 3. Belli, F., “Finite-state testing and analysis of graphical user interfaces”, Proc. 12th Int. Symp. on Software Reliability Engineering (ISSRE), IEEE CS Press (2001), pp. 34–43. 4. Bullnheimer, B., Hartl, R. F., Strauss, C., “A new rank-based version of the Ant System: A computational study”, Central European Journal for Operations Research 7(1) (1999), pp. 25–38. 5. Cheung, R. C., “A user-oriented software reliability model”, IEEE Trans. Software Eng., vol. 6 (1980), pp. 118–125. 6. Cs¨ ondes, T., Kotnyek, B., Szab´ o, J. Z., “Application of heuristice methods for conformance test selection”, European J. of Operational Research, vol. 142 (2002), pp. 203–218. 7. Doerner, K., Gutjahr, W. J., “Representation and Optimization of Software Usage Models with Non–Markovian State Transitions”, Information and Software Technology vol. 47 (2000), pp. 873–884. 8. Doerner, K., Laure, E., “High performance computing in the optimization of test plans”, Optimization and Engineering, vol. 3 (2002), pp. 67–87. 9. Dorigo, M., Di Caro, G., “The Ant Colony Optimization metaheuristic”, in: New Ideas in Optimization, D. Corne, M. Dorigo, F. Glover (eds.), pp. 11–32, McGraw– Hill (1999).
2476
K. Doerner and W.J. Gutjahr
10. Dorigo, M., Maniezzo, V., Colorni, A., “The Ant System: An autocatalytic optimization process”, Technical Report 91–016, Dept. of Electronics, Politecnico di Milano, Italy (1991). 11. Gutjahr, W. J., “Importance sampling of test cases in Markovian software usage models”, Probability in the Engineering and Informational Sciences, vol. 11, (1997), pp. 19–26. 12. Gutjahr, W. J., “Software Dependability Evaluation Based on Markov Usage Models”, Performance Evaluation, vol. 40 (2000), pp. 199–222. 13. Gutjahr, W. J., “A graph–based Ant System and its convergence”, Future Generation Computing Systems 16 (2000), pp. 873–888. 14. Gutjahr, W. J., “ACO Algorithms with Guaranteed Convergence to the Optimal Solution”, accepted for publication in: Information Processing Letters. 15. Harman, M., Jones, B. F., “Search-based software engineering”, Information and Software Technology, vol. 43 (2001), pp. 833–839. 16. Kallepalli, Ch., “Measuring and modeling usage and reliability for statistical Web testing”, IEEE Trans. Software Eng. vol. 27 (no. 11) (2001), pp. 1023–1036. IEE Proc.-Softw. 146, no. 4 (1999), pp. 187–192. 17. Kumar, G. P., Venkataram, P., “Protocol test sequence generation using MUIOS based on TSP Problem”, Proc. IFIP TC6 Conf. (1994), pp. 165–191. 18. Littlewood, B., “A reliability model for systems with Markov structure”, J. Royal Statistical Soc., Series C (Applied Statistics), vol. 24 (1975), pp. 172–177. 19. Littlewood, B., “A Semi-Markov model for software reliability fith failure costs”. In: MRI Symp. Computer Software Engineering, Polytechnic Press, Polytechnic of New York, New York (1976), pp. 281–300. 20. Littlewood, B., “Software Reliability Models for modular program structure”, IEEE Trans. Reliab., vol. 28 (no. 3) (1979), pp. 241–246. 21. Poore, J. H., Trammell, C. J., “Application of statistical science to testing and evaluating software intensive systems”. In: Statistics, Testing, and Defense Acquisition, Washington: National Academy Press (1998). 22. Rajgopal, J., Mazumdar, M., “Modular operational test plans for inferences on software reliability based on a Markov model”, IEEE Trans. Software Eng., vol. 28 (no. 4) (2002), pp. 358–363. 23. Rivers, A. T., Vouk, A. M., “Resource-constrained non-operational testing of software”, Proc. 9th Int. Symp. on Software Reliability Engineering (ISSRE), pp. 154– 163 (1998). 24. Sayre, K., Poore, J. H., “Partition testing with usage models”, Information and software technology, vol. 42 (2000), pp. 845–850. 25. Siegrist, K., “Reliability of systems with Markov transfers of control”, IEEE Trans. Software Eng., vol. 14 (no. 7) (1988), pp. 1049–1053. 26. Walton, G. H., Poore, J. H., Trammell, J., “Statistical testing of software based on a usage model”, Software – Practice and Experience, vol. 25 (1) (1995), pp. 97–108. 27. Whittaker, J. A., Poore, J. H., “Markov analysis of software specifications”, ACM Trans. on Software Eng. and Method., vol. 2 (1) (1993), pp. 93–106. 28. Whittaker, J. A., Thomason, M. G., “A Markov chain model for statistical software testing”, IEEE Trans. Software Eng., vol. SE-20 (1994), pp. 812–824. 29. Zhang, F., Cheung, T. Y., “Optimal Transfer Trees and Distinguishing Trees for Testing Observable Nondeterministic Finite-State Machines”, IEEE Trans. Software Eng., vol. SE-29 (2003), pp. 1–14.
Using Genetic Programming to Improve Software Effort Estimation Based on General Data Sets Martin Lefley and Martin J. Shepperd School of Design Engineering and Computing, University of Bournemouth, Talbot Campus, Poole, BH12 5BB, UK [email protected]
Abstract. This paper investigates the use of various techniques including genetic programming, with public data sets, to attempt to model and hence estimate software project effort. The main research question is whether genetic programs can offer 'better' solution search using public domain metrics rather than company specific ones. Unlike most previous research, a realistic approach is taken, whereby predictions are made on the basis of the data available at a given date. Experiments are reported, designed to assess the accuracy of estimates made using data within and beyond a specific company. This research also offers insights into genetic programming’s performance, relative to alternative methods, as a problem solver in this domain. The results do not find a clear winner but, for this data, GP performs consistently well, but is harder to configure and produces more complex models. The evidence here agrees with other researchers that companies would do well to base estimates on in house data rather than incorporating public data sets. The complexity of the GP must be weighed against the small increases in accuracy to decide whether to use it as part of any effort prediction estimation.
1 Introduction Reliable predictions of project costs — primarily effort — are greatly needed for better planning of software projects, but unfortunately size and costs are seldom, if ever, proportionally related [8]. There has been extensive research into software project estimation, with researchers assessing a number of approaches to improving prediction accuracy. One of the first methods to estimate software effort automatically was COCOMO [4], where in its simplest form effort is expressed as a function of anticipated size as:
E = aSb
(1)
where E is the effort required, S is the anticipated size and a and b are domain specific parameters. Independent studies, for example, Kitchenham and Taylor [14] and Kemerer [12] found many errors considerably in excess of 100% even after model calibration. E. Cantú-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2477–2487, 2003. © Springer-Verlag Berlin Heidelberg 2003
2478
M. Lefley and M.J. Shepperd
Subsequently attempts have been made to model data automatically based on local collection, for example, Kok et al. [15] published encouraging results using stepwise regression. Such linear modelling can cover only a small part of the possible solution space, potentially missing the complex set of relationships evident in many complex domains such as software development environments. A variety of machine learning (ML) methods have been used to predict software development effort. Good examples include artificial neural nets (ANNs) [3, 28], case based reasoning (CBR) [9, 23] and rule induction (RI) [20]. Hybrids are also possible, e.g. Shukla [24] reports better results from using an evolving ANN compared to a standard back propagation ANN. Dolado and others [6, 7] analyse many aspects of the software effort estimation problem and present encouraging results for a genetic programming (GP) based estimation using a single input variable. Burgess and Lefley [5] also had some success using GP based estimation. One characteristic to all ML methods is the need for training data. However, recent software engineering research has found that shared or public data sets are much less effective than restricting the prediction system to potentially very few local cases [11, 21]. The issue at stake is whether a company can improve prediction accuracy by incorporating results from other companies. Generally the more data available to a learner, the better it can model behaviour. However, no matter how good the control, some metrics are likely to be measured differently across companies and the working environments may have a marked difference. Thus there will be some distortion of the models accuracy. The results reported by [11, 21] for non-evolutionary models found the larger data sets to provide less accurate estimates. This suggests that companies should only use their own data, assuming they have sufficient examples close to a new case to make an estimate. Their research used models based on regression, CART, robust regression, stepwise ANOVA and analogy based estimation. Genetic programming offers an evolutionary solution to estimation problems that may better take into account the source of data. For example a variable indicating in house or external could be used as a multiplier to ignore data from outside sources and so has the potential to build a prediction system at least as accurate as one based on only internal data. To summarise, the aim of this paper is to consider the question: can the use of evolutionary models overcome the problems of using non-local data to build project effort prediction systems. The ability to do this is of particular interest to environments that have collected little local data, a seemingly common occurrence within software engineering. The remainder of the paper describes the data set used for our analysis, considers how prediction accuracy might be assessed and then reviews the application of ANNs, nearest neighbour (NN) and GP methods. Next we provide more detailed information on our GP method. This is followed by results, conclusions and suggestions for further work.
2 The Data Set The data used for the case study is often referred to as the 'Finnish data set' and is made up of project data collected by the software project management consultancy organisation SSTF. This data set contains 407 cases described by more than 90 features including metrics such as function points. This data set used in this paper is
Using Genetic Programming to Improve Software Effort Estimation
2479
quite large for this type of application due to the fact that it comprises project data from many organisations interested in benchmarking their software productivity. The features are a mixture of continuous, discrete and categorical. However, there are missing data values and features not be known at prediction time and so are not included in the development of a prediction system. Removing features with missing values or values that cannot be known until after the prediction is required, leaves a subset of 83 features used for this case study. The data set exhibits significant multicollinearity, in other words there are strong relationships between features as well as with the dependent variable. More background information on this can be found in [21] which describes an earlier version of the same data set. The total project effort variable was removed and used as the output or dependent variable. This is used to model the training data and to evaluate the test data. Where data was in the form of a string it was changed to an index, so that it could be processed numerically. This included the company number; to assist the models the chosen company was allocated the special code index 0. All values were scaled by range to fall between 0 and 1. In order to provide a realistic assessment of the hypothesis, the data was split chronologically into training and test sets. The data was sorted by start dates in such a way that a reasonable number of the projects from one company th were available for training and testing. Thus we model the situation on 15 October 1991 when the selected company had completed 48 projects, with another 15 yet to commence. At this cut off date, there was data available from another 101 projects from other companies apart the one chosen for modelling. Data from the other companies beyond the cut off date was discarded.
3 Comparing Prediction Techniques Our chief requirement is to obtain a quantitative assessment of the accuracy of the different prediction techniques. This achieved by comparing software project effort estimates made over a number of cases with known actual effort. Typically these cases are not made available during the configuration and training stages in order to avoid bias. There are many ways for calculating and combining the accuracy measures obtained, with no simple way of ranking these techniques, even for a given domain [13]. Most learners use an evaluation or fitness function to determine error to decide on future search directions. The choice of method to make this evaluation is crucial for accurate modelling. All of the learners considered here use root average mean square error. This is important also when considering accuracy estimates as the evaluation function will bias search towards similar evaluation metrics. A number of accuracy — strictly speaking residual — summary statistics can be used to make comparisons between models. Each captures different aspects of the notion of model accuracy. Ultimately the decision should be made by the user, based on estimates of the costs of under or over estimating project effort, how risk averse they are and so forth. All are based on the calculated error or residual, that is the difference between the predicted and observed output values for a project. Most use an average performance over the whole set so we also include the worst case as an illustration of a less generalised metric. We also assess results using AMSE, adjusted mean square error (see Equation 2). This is the sum of squared errors, divided by the
2480
M. Lefley and M.J. Shepperd
product of the means of the predicted and observed outputs. For further details on these metrics, see [5]. i =n
AMSE =
∑ i =1
( Ei − Êi )
2
(E * Ê ) i
(2) .
i
In summary the accuracy metrics used for this research are: 1. Correlation coefficient (Pearson) of actual and predicted 2. Adjusted mean square error - AMSE 3. Percentage of predictions within 25% - Pred(25)% 4. Mean magnitude of relative error - MMRE 5. Balanced mean magnitude of relative error - BMMRE 6. Worst case error as % We also consider qualitative factors adapted from Mair et al. [20]: accuracy, explanatory value and ease of configuration. The ease of set up and quality of information provided for the estimator is of great importance. Empirical research has indicated that end-users coupled with prediction systems can outperform either prediction systems or end-users alone [25]. The more explanations given for how a prediction was obtained, the greater the power given to the estimator. If predictions can be explained, estimators may experiment with “what if” scenarios and meaningful explanations can increase confidence in the prediction.
4 Machine Learning and Software Effort Estimation Many machine learning methods have been used for effort estimation, though there are few comparisons. Work by Burgess and Lefley [5] found that GPs performed similarly to other advanced models. As part of this investigation we consider different learners to benchmark the achievements of the GP system. Results are presented for models based on • • •
Random Least squares regression Nearest neighbour
• • •
Artificial neural network Genetic programming Average of all non-random estimators
The random method uses the substitution of an output variable, that is effort, randomly selected from the training set and is included as a benchmark. 4.1 Artificial Neural Networks (ANNs) ANNs are parallel systems inspired by the architecture of biological neural networks, comprising simple interconnected units, called artificial neurons. The neuron com-
Using Genetic Programming to Improve Software Effort Estimation
2481
putes a weighted sum of its inputs and generates an output activated by a stepped function. This output is excitatory (positive) or inhibitory (negative) input fed to other neurons in the network. Feed-forward multi-layer perceptrons are the most commonly used form of ANN, although many more neural network architectures have been proposed. The ANN is initialised with random weights. The network then ‘learns’ the relationships implicit in a data set by adjusting the weighs when presented with a combination of inputs and outputs that are known as the training set. Studies concerned with the use of ANNs to predict software development effort focus on comparative accuracy with algorithmic models, rather than on the suitability of the approach for building software effort prediction systems. An example is the investigation by Wittig and Finnie [28]. The results from two validation sets are aggregated and yield a high level of accuracy viz. Desharnais set MMRE=27% and ASMA set MMRE=17%, although some outlier values are excluded. Mair et al. [20] show neural networks offer accurate effort prediction, though not as good as Wittig and Finnie, concluding that they are relatively difficult to configure and interpret. A number of experiments were needed to determine an appropriate number of hidden nodes, with 20 being used for this investigation. Other parameters such as learning rate were set to values chosen by experience and experimentation. It was found that nearby values for parameters only slightly affected convergence performance. 4.2 Nearest Neighbour Techniques Examples of successful case based reasoning (CBR) tools for software project estimation include: cEstor [26], a CBR system dedicated to the selection of similar software projects for the purpose of estimating effort, and more recently, FACE [2] and ANGEL [22]. We use a simple NN technique and a weighted (by the reciprocal of the Euclidean distance) average of three neighbours as simple form of CBR. The Euclidean distance is determined by the root sum of squares of the difference between the scaled input variables for any two projects.
5 Background to Genetic Programming Genetic Algorithms were developed as an alternative technique for tackling general optimisation problems with large search spaces. They have the advantage that they do not need any prior knowledge, expertise or logic related to the particular problem being solved. Occasionally, they can yield the optimum solution, but for most problems with a large search space, a good approximation to the optimum is a more likely outcome. The basic ideas used are based on the Darwinian theory of evolution, which in essence says that genetic operations between chromosomes eventually leads to fitter individuals which are more likely to survive. Thus, over time, the population of the species as a whole improves. However, not all of the operations used in the computer analogy of this process necessarily have a biological equivalent. Genetic Programming is an extension of Genetic Algorithms, which removes the restriction that the chromosome representing the individual has to be a fixed length binary string. In general for GP, the chromosome is some type of program, which is then executed to obtain the required results. One of the simplest forms of program, used for this
2482
M. Lefley and M.J. Shepperd
application, is a binary tree containing operators and operands. This means that each solution is an algebraic expression that can be easily evaluated. Koza “offers a large amount of empirical evidence to support the conclusion that GP can be used to solve a large number of seemingly different problems from many different fields” [16].
Fig. 1. Illustration of crossover operator before operation. The double line illustrates where the trees are cut
Fig. 2. Illustration of crossover operator after operation. The double line illustrates where the sub-trees have been swapped.
The main genetic operations used are crossover and mutation. The crossover operator chooses a node at random in each chromosome and the branches to those nodes are cut. The sub trees produced below the cuts are then swapped. The method of performing crossover is easily illustrated using an example (see figures 1 and 2). Although this example includes a variety of operations, only the set { + , - , * , / } were made available for our experiments. More complex operators can, of course, be developed by combining these simple operations. Simple operators eliminate the bounding problems associated with more complex operations such as XOR, which is not defined for negative or non-integer values. The multiply is a protected multiply 20 which prevents the return of any values greater than 10 . Similarly there is a protected divide, division by 0 returning 1000. This is to minimise the likelihood of real overflow occurring during the evaluation of the solutions. The initial solutions were generated using "ramped half and half". This means that half the trees are constructed as full trees, i.e. operands only occur at the maximum depth of the tree, and half are trees with a random shape. Often elitism, where the top x% of the solutions, as measured by fitness, is copied straight into the next generation, is used to maintain the best solutions, where x can be any value but is typically 1 to 10. Since trees tend to grow as the algorithm progresses, a maximum depth is normally imposed (or maximum number of nodes) in order that the trees do not get too large. This is often larger than the maximum size allowed in the initial population. Any trees produced by crossover that are too large are discarded. The reasons for imposing the size limit is to save on both the storage space required and the execution
Using Genetic Programming to Improve Software Effort Estimation
2483
time needed to evaluate the trees. There is also no known evidence that allowing very large trees will lead to better results and smaller trees are more likely to generalise. Key parameters that have to be determined, in order to get good solutions without using too much time or space, are: the best genetic operators, the population size, maximum tree size, number of generations, etc. More information related to Genetic Programming can be found in [1, 17, 18].
6 Applying GP to Software Effort Estimation The software effort estimation problem may be modeled as a symbolic regression problem, which means, given a number of sets of data values for a set of input parameters and one output parameter, construct an expression of the input parameters which best predicts the value of the output parameter for any set of values of the input parameters. In this paper, GP is applied to the data set already described in section 2, with the same 149 projects in the Training Set and the remaining 15 projects in the Test Set in a similar way to the other ML methods. Ease of configuration depends mainly on the number of parameters required to set up the learner. For example, a nearest neighbour(NN) system needs parameters to decide the weight of the variables and the method for determining closeness and combining neighbours to form a solution. Koza [16] lists the number of control parameters for genetic programming as being 20 as compared to 10 for neural networks, but it is not always easy to count what constitutes a control parameter. However, it is not so much the number of the parameters as the sensitivity of the accuracy of the solutions to their variation. It is possible to search using different parameters but this will depend on their range and the granularity of search. In order to determine the ease of configuration for a genetic program, we test empirically whether parameter values suggested by Koza offer sufficient guidance to locate suitably accurate estimators. Similarly, all of our solutions use either limited experimentation or commonly used values for control parameters. The parameters chosen after a certain amount of experimentation with the Genetic Programming system are shown in Table 1. Table 1. Main parameters used for the GP system
Parameter Size of population Number of generations Number of runs Maximum initial full tree depth Maximum no. of nodes in a tree Percentage mutation rate Percentage of elitism
Value 1000 500 10 5 64 5% 5%
No function ‘seeding’ was used and only a simple mutation operator was available based on substitution of parameters. The initial solutions were generated using "ramped half and half". This means that half the trees are constructed as full trees, i.e.
2484
M. Lefley and M.J. Shepperd
operands only occur at the maximum depth of the tree, and half are trees with a random shape. The GP system, as with all the systems considered here, is written in power basic running on a 1GHz Pentium PC with 512Mbytes of RAM. However, the run-time memory usage is less than 1Mbytes to store the trees and population data required for each generation. Careful garbage management ensures this does not grow with successive generations. One run of 500 generations takes about 40 minutes processor time for the full data set, 25 minutes using the smaller, company specific set. The In order to appreciate the performance of the GP the test is also performed on some other suitable techniques.
7 Results of the Comparison This study evaluates various statistical and machine learning techniques, including a genetic programming tool, to make software project effort predictions. The main hypothesis to be tested is whether GP can significantly produce a better estimate of effort using a public data set rather than a company specific database. We make comparisons mostly based on the accuracy of the results but we also consider the ease of configuration and the transparency of the solutions. Note that in the comparison, learners that contain a stochastic element are tested over a population of solutions, independently generated with the best modeller (smallest average squared error) used. 7.1 Accuracy of the Predictions The ANN and GP solutions required some experimentation to find good control parameters. All of the training sessions were successful and no problems with local minima were encountered. Table 2 lists the main results used for using general, company wide data. Table 3 shows the same results for the company specific data set. Table 2. Comparing different prediction systems for the full data set. Best performing estimators (within 5% of best) are highlighted in bold type. Best performers across both data sets are shown in italics Estimated effort Random
LSR
1-NN
3-NN
ANN
GP
Average
correlation
-0.161
0.846
0.390
0.497
0.806
0.937
0.890
AMSE
28.091
4.601
13.970
14.720
6.584
1.981
3.596
Pred(25)%
13.333
40.000
33.333
20.000
26.667
40.000
33.333
MMRE
166.39
46.925
85.401
59.192
68.856
37.670
43.467
BMMRE
238.35
73.629
110.022
79.049
85.478
62.797
47.343
609
150
332
185
195
125
167
Worst case MRE
Using Genetic Programming to Improve Software Effort Estimation
2485
Table 3. Comparing different prediction systems for the company specific data set. Best performing estimators (within 5% of best) are highlighted in bold type. Best performers across both data sets are shown in italics. Estimated effort Random
LSR
1-NN
3-NN
ANN
GP
Average
correlation
-0.167
0.927
0.104
0.793
0.951
0.933
AMSE
77.762
2.523
19.347
5.868
2.015
12.672
1.796
6.667
26.667
26.667
20.000
40.000
33.333
40.000
MMRE
381.422
45.648
129.045
114.709
69.228
37.959
65.694
BMMRE
448.695
63.995
217.353
128.525
71.166
54.015
71.888
1865
174
436
495
254
79
198
Pred(25)%
Worst case MRE
0.939
The three strongest techniques overall appear to be LSR, ANN and GP with GP achieving the best (or within 5%) level of accuracy the most often. The NN techniques are consistently less accurate. Using an average of each technique performs well for the company specific data set but is less successful for the full data set. In terms of our main hypothesis, that it is better to restrict data to locally relevant projects, the evidence is somewhat mixed. For the poorest techniques, most notably 1-NN, this is clearly not the case as in all cases the results from the full data set are to be preferred to the company specific data set. With ANN and LSR the results for the company specific data set are generally better than those for the full data set. This agrees with other researchers that companies would do well to base estimates on their own data rather than incorporating public data sets. The GP results are almost identical in terms of the correlation coefficients and pred(25) indicator, substantially improved for BMMRE and the worst case, yet worse for MMRE and AMSE. It may be that the latter two indicators are affected by a different distribution of under and overestimates since they are both in asymmetric in their treatment of such estimates. Overall the results are weakly suggestive of the benefits of using preferring local data particularly for LSR, ANN and too a lesser extent GP. However, where local data is very limited, it is possible that the public data set could improve estimation accuracy. 7.2 Qualitative Assessment It was fairly easy to configure the ANN but we have a fair degree of past experience in this area. Generally, different architectures, algorithms and parameters made little difference to results, but some expertise is required to choose reasonable values. Although heuristics have been published on this topic [10, 27], we found the process largely to be one of trial and error, guided by experience. ANN techniques would not be easy to use for project estimation by end-users as some expertise is required. We have found that investigation of using methods such as those reported in [19] can improve understanding of results from the ANN but that such information is limited. GP has the most degrees of freedom in design, for example the choice of functions, mutation rate and strategy, percentage elitism, dealing with out of bounds conditions, creation strategy and setting maximum tree sizes and depths and so on. We found by using suggestions by Koza [16] and experimentation, we obtained convergence but again, at present, expertise is necessary. Many experiments were tried with various
2486
M. Lefley and M.J. Shepperd
approaches and a number are still pending. Such effort suggests that GP only be used if it provides greater accuracy, not supported by our quantitative results for this data. The resulting program, like most of the techniques here, may provide information on the relative importance of input variables. Typical GP solutions for this problem are provided below. ((X13 + X17 + ((((((((X62 + X56) * (X20 + X33)) * (((0.25 * X34) * (X32 + X70 - X19)) + (X20 + X80))) - (X40 - X9)) - (((X4 * X69) + X39 + X47) + ((2.1 + X28 + X62) * X4 * X56))) - X75 + X2) + 2.1 + X28 + X62) + (X20 + X33))) + (3 * X56 + X9 - X21 - X19 + X62)) * 0.25 * X83
(3)
X83 / (X45 * X4)
(4)
Equation 3 is not easy to comprehend, unlike equation 4, which is of forced, limited length. The latter type of solution was found to be marginally less accurate but might provide the software manager with clearer insight to the problem.
8 Conclusions and Future Work There is no clear winner on accuracy but GP performed well for this problem. This is in good agreement with other research [5, 7] using different approaches and data sets. The company specific ANN also performed well and the average over all methods for this data is also particularly good. Generally the better techniques (LSR, ANN and GP) are slightly more accurate with the company specific database. The results, however, highlight that measuring accuracy is not simple and researchers should consider a range of measures. Perhaps a useful conclusion is that based on the evidence from this experiment, the more elaborate estimation techniques such as ANNs and GPs provide better fitting models but this slight advantage must be weighed against the extra effort in design, and a simple LSR can still provide useful estimates. Future work will center on a greater variety of GP models, though we need to ensure that by trying so many models leads to one model performing very well by chance over fit. The stochastic population nature of GP can be used to reduce this effect. Models that will be used include better mutation (by tree generation rather than parameter swapping), seeding (for example with the LSR) and parameter tuning. Acknowledgements. The authors are indebted to STTF Ltd for making the Finnish data set available.
References [1] [2] [3] [4]
W. Banzhaf, P. Nordin, R. E. Keller, and F. D. Francome, Genetic Programming: An introduction. San Mateo, CA: Morgan Kaufmann, 1998. R. Bisio and F. Malabocchia, "Cost estimation of software projects through case base reasoning," presented at 1st Intl. Conf. on Case-Based Reasoning Research & Development, 1995. J. Bode, "Neural networks for cost estimation," Cost Engineering, vol. 40, pp. 25–30, 1998. B. W. Boehm, Software Engineering Economics. Englewood Cliffs, N.J.: Prentice-Hall, 1981.
Using Genetic Programming to Improve Software Effort Estimation [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28]
2487
C. J. Burgess and M. Lefley, "Can genetic programming improve software effort estimation? A comparative evaluation," Information & Software Technology, vol. 43, pp. 863– 873, 2001. J. J. Dolado, "Limits to methods in software cost estimation," presented at 1st Intl. Workshop on Soft Computing Applied to Software Engineering, Limerick, Ireland, 1999. J. J. Dolado, "On the problem of the software cost function," Information & Software Technology, vol. 43, pp. 61–72, 2001. S. Drummond, "Measuring applications development performance," in Datamation, vol. 31, 1985, pp. 102–8. G. R. Finnie, G. E. Wittig, and J.-M. Desharnais, "Estimating software development effort with case-based reasoning," presented at 2nd Intl. Conf. on Case-Based Reasoning, 1997. S. Huang and Y. Huang, "Bounds on the number of hidden neurons," IEEE Trans. on Neural Networks, vol. 2, pp. 47–55, 1991. R. Jeffery, M. Ruhe, and I. Wieczorek, "Using public domain metrics to estimate software development effort," presented at 7th IEEE Intl. Metrics Symp., London, 2001. C. F. Kemerer, "An empirical validation of software cost estimation models," Communications of the ACM, vol. 30, pp. 416–429, 1987. B. A. Kitchenham, S. G. MacDonell, L. Pickard, and M. J. Shepperd, "What accuracy statistics really measure," IEE Proceedings - Software Engineering, vol. 148, pp. 81–85, 2001. B. A. Kitchenham and N. R. Taylor, "Software cost models," ICL Technical Journal, vol. 4, pp. 73–102, 1984. P. Kok, B. A. Kitchenham, and J. Kirakowski, "The MERMAID approach to software cost estimation," presented at Esprit Technical Week, 1990. J. R. Koza, Genetic programming: On the programming of computers by means of natural selection. Cambridge, MA: MIT Press, 1992. J. R. Koza, Genetic Programming II: Automatic discovery of reusable programs: MIT Press, 1994. J. R. Koza, F. H. Bennett, D. Andre, M. A. Keane, and (). Genetic Programming III: Darwinian Invention and Problem Solving. San Mateo, CA: Morgan Kaufmann, 1999. M. Lefley and T. Kinsella, "Investigating neural network efficiency and structure by weight investigation," presented at European Symp. on Intelligent Technologies, Germany, 2000. C. Mair, G. Kadoda, M. Lefley, K. Phalp, C. Schofield, M. Shepperd, and S. Webster, "An investigation of machine learning based prediction systems," J. of Systems Software, vol. 53, pp. 23–29, 2000. K. Maxwell, L. Van Wassenhove, and S. Dutta, "Performance evaluation of general and company specific models in software development effort estimation," Management Science, vol. 45, pp. 787–803, 1999. M. J. Shepperd and C. Schofield, "Estimating software project effort using analogies," IEEE Transactions on Software Engineering, vol. 23, pp. 736–743, 1997. M. J. Shepperd, C. Schofield, and B. A. Kitchenham, "Effort estimation using analogy," presented at 18th Intl. Conf. on Softw. Eng., Berlin, 1996. K. K. Shukla, "Neuro-genetic prediction of software development effort," Information & Software Technology, vol. 42, pp. 701–713, 2000. E. Stensrud and I. Myrtveit, "Human performance estimating with analogy and regression models: an empirical validation," presented at 5th Intl. Metrics Symp., Bethesda, MD, 1998. S. Vicinanza, M. J. Prietula, and T. Mukhopadhyay, "Case-based reasoning in effort estimation," presented at 11th Intl. Conf. on Info. Syst., 1990. S. Walczak and N. Cerpa, "Heuristic principles for the design of artificial neural networks," Information & Software Technology, vol. 41, pp. 107–117, 1999. G. Wittig and G. Finnie, "Estimating software development effort with connectionists models," Information & Software Technology, vol. 39, pp. 469–476, 1997.
The State Problem for Evolutionary Testing Phil McMinn and Mike Holcombe Department of Computer Science, The University of Sheffield, Regent Court, 211 Portobello Street, Sheffield, S1 4DP, UK. {p.mcminn,m.holcombe}@dcs.shef.ac.uk
Abstract. This paper shows how the presence of states in test objects can hinder or render impossible the search for test data using evolutionary testing. Additional guidance is required to find sequences of inputs that put the test object into some necessary state for certain test goals to become feasible. It is shown that data dependency analysis can be used to identify program statements responsible for state transitions, and then argued that an additional search is needed to find required transition sequences. In order to be able to deal with complex examples, the use of ant colony optimization is proposed. The results of a simple initial experiment are reported.
1
Introduction
Evolutionary testing (ET) is a technique by which test data can be generated automatically through the use of optimizing search techniques. The search space is the input domain of the software under test. ET has been shown to be successful for generating test data for many forms of testing, namely specification testing [9], extreme execution time testing [12] and structural testing [10]. It has also been shown that certain features of programs can inhibit the search for test data, for example flag variables [3, 6]. This paper introduces another such feature: states in test objects. States can cause a variety of problems for ET, since test goals involving states can be dependent on the entire history of input to the test object, as well as just the current input. In addition, guidance must be provided so that statements responsible for state transitions are executed, so as to put the test object into the required states for certain test goals to become feasible. Internal states have hindered test data generation for automotive components used at DaimlerChrysler. The aim of this work is to extend the DaimlerChrysler ET system [14] to enable it to generate test data when presented with such troublesome test objects. This paper is organized as follows. Section 2 reviews evolutionary testing. Section 3 introduces the state problem with examples. Section 4 discusses the use of data dependency analysis to identify program statements that are responsible for state transitions. Section 5 discusses how this could be applied to the state problem, and argues that an additional search is needed to find required sequences of transitional statements. In the case of simple examples an exhaustive search may be all that is required, however for more complex cases an optimization technique may be needed, and the use of ant colony optimization is proposed. Results of a simple initial ex-
E. Cantú-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2488–2498, 2003. © Springer-Verlag Berlin Heidelberg 2003
The State Problem for Evolutionary Testing
2489
periment are reported. Section 6 then closes with conclusions and outlines future work.
2
Evolutionary Testing (ET)
Evolutionary Testing (ET) uses optimizing search techniques such as evolutionary algorithms to generate test data. The search space is the input domain of the test object, with each individual, or potential solution, being an encoded set of inputs to that test object. The fitness function is tailored to find test data for the type of test that is being undertaken. This paper discusses the state problem in the context of structural testing. Here the aim is to find test data to execute every structural component of some coverage type, for example all branches of the program's control flow graph, or the execution of every definition-use pair for every variable. In order to retrieve fitness information, the test object must be instrumented. Previous work [10] has argued that higher levels of coverage are obtained when each structural element of the chosen coverage type is targeted individually as a partial aim. For each partial aim, the minimizing fitness function is made up of two components, namely the approximation level and a branch distance calculation [10, 11]. The approximation level supplies a value indicating how close in structural terms an individual is to reaching the target. For node-oriented coverage types, for example statement coverage, this value is calculated as the number of branching statements lying between branches covered by an individual and the target branch. At the point where the individual diverged away from the target node, a normalized branch distance calculation is computed. This value indicates how close the individual was to evaluating the branch predicate in the desired way. For example if a condition (x == y) needs to be executed as true, the branch distance is calculated using |x-y|. For the thermostat function in figure 1, and the partial aim where the node 6 must be executed, the fitness values are computed as follows. Individuals reaching node 4 and evaluating the branching condition heater_on as false receive an approximation level of 1 and a branch distance of 1 (1 - heater_on). On the other hand, individuals reaching node 5 but evaluating the condition as false receive an approximation level of zero, and a branch distance computed using the formula POWER_THRESHOLD - power. The value of the fitness function when the target is finally reached is therefore zero. With path-oriented coverage types, the approximation level is formed from the length of all identical path sections. Branch distances are taken at each point in which the flow of execution diverged from the intended path. These are then accumulated and normalized.
3 The State Problem The state problem occurs with functions at higher system levels that exhibit state-like qualities by storing information in internal variables, which retain their values from one execution of the function under test to the next. Such variables are hidden from the optimization process because they are not available to external manipulation. The
2490
P. McMinn and M. Holcombe
only way to change the values of these variables is through execution of statements that perform assignments to them. If such variables can be described as state variables, then these assignments, or definitions, are the transitions of the underlying state machine.
1 2 3 4 5 6 7
int thermostat (int temp, int pow) { int heater_on; if (temp < TEMP_THRESHOLD) heater_on = 1; else heater_on = 0; if (heater_on) if (pow < POWER_THRESHOLD) heater_on = 0; return heater_on; }
s
2
1 4
3
5 6
7 e
Fig. 1. C Code fragment of the simple thermostat example, with control flow graph (right column) and control flow graph node numbers corresponding to program statements in the left column
Figure 2 illustrates a function that is a controller for a smoke detector, the style of which mimics real-life code witnessed at our industrial partner. It takes three arguments - the first being the current room smoke level, followed by two arguments used as outputs to signal that the alarm should be switched on or off (the alarm works on a latch whereby after an “on” signal is received; the alarm stays on until it receives an “off” signal). The function is designed to be cycled once every second by the hardware. When the room smoke level becomes higher than a given threshold for a certain period of time, the alarm is raised (lines 13-14). When the room smoke level returns to safe levels for a given time, a special waiting flag becomes true (lines 1718). The alarm then stays on for another 20 seconds (lines 21-22), unless the smoke levels breach acceptable limits again (lines 19 -20). The static storage class is used to declare several local variables. This allows them to maintain their values after the function is executed, however, as they are internal to the function, they cannot be directly optimized by the evolutionary search process. Of course, states can also exist in system components. This can occur again through the use of the static storage class in C, or in object-oriented languages, through class variables that are protected from external manipulation using access modifiers. In our work however, the focus is only on test objects written using the C language. The next sections discuss the problems that states in test objects can cause for ET.
The State Problem for Evolutionary Testing
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
2491
const double LEVEL = 0.3; const int DANGER = 5, WAIT_TIME = 20; void smoke_detector(double level, int* signal_on, int* signal_off) { static int time = 0, off_time = 0; static int detected = 0, alarm_on = 0, waiting = 0; time ++; if (level > LEVEL && detected < DANGER) detected ++; else if (detected > 0) detected --; if (!alarm_on && detected == DANGER) { alarm_on = 1; *signal_on = 1; } if (alarm_on) { if (!waiting && detected == 0) { waiting = 1; off_time = time + WAIT_TIME; } if (waiting && detected == DANGER) waiting = 0; if (waiting && time > off_time) { waiting = 0; alarm_on = 0; *signal_off = 1;} } } Fig. 2. C code for the smoke detector example, line numbers appear in the left column
3.1
Input Sequences
Partial aims relying on the values of state variables often require input sequences in order for them to be possible. This is because the execution of a series of transitions is required in order to set state variables to desired values. For the smoke detector example, the true branch from line 15 requires the detected variable to be equal to the DANGER threshold. This requires five calls to the function, each of which must execute the transitional statement on line 10, which increments the detected variable. Not until this has been done does the target become feasible. A problem with the generation of sequences is that it is generally impossible to automatically determine beforehand roughly how long the required sequence is going to be, as different test objects can require radically different sequence lengths. 3.2
Disseminated Functionality
Where system functionality is dispersed across a series of components, it is possible that the function under test is not the function responsible for manipulating the state in the desired way. For these more complicated state problems, input sequences must be found to call different functions in a certain order. This is illustrated by the exam marks example in figure 3. The target is the true branch from the if statement in the
2492
P. McMinn and M. Holcombe
compare function (part (a)). This depends on the state of the module portrayed in part (b). Only until the function has_max in this module has been executed a certain number of times will the branch in compare become feasible.
3.3
Guidance to Transitional Statements
A sequence of function calls is not always enough to ensure that transitions will be invoked for test goals to become feasible. Extra guidance must be provided to find inputs that ensure that transitional statements are actually executed, and executed in the correct order. In the exam marks example, the incremental statement involving the num_top variable in the dependent module must be executed for the test goal in the compare function to become feasible (assuming last_year_top is greater than zero). This statement is unlikely to be executed at random since the actual input value that will cause the statement to be reached (where mark is equal to 100) occupies a very small portion of the input domain. The use of control variables can further complicate matters. Control variables are used to model the fact that a system’s behavior should change given the fact that certain events have occurred, as with the flags used in the smoke detector (alarm_on and waiting). Such variables are often implemented as Boolean variables, to indicate the system is in a state or not in a state; or as enumerations, to indicate the system is in one of a collection of states. ET has further trouble with such variables, for the same reason that it has trouble with the closely related flag problem [3, 6]. The evolutionary search receives little guidance from branch distance calculations using these types of variables, due to their “all or nothing” nature. As state problems are not only dependent on the current input but also the history of inputs, the problem is accentuated. // ... void compare() { if (get_num_top() > last_year_top) {/* target branch */} } // ... (a) Function under test
const int MAX_MARK = 100; static int num_top = 0; void has_max(int mark) { if (mark == MAX_MARK) num_top ++; } int get_num_top () { return num_top; } (b) Dependent module
Fig. 3. C code for the exam marks example
4
Possible Solutions
Solving state problems for ET would be seemingly straightforward if some state machine specification of the test object existed. Unfortunately such a model might not always be available, and even if it were, the required information may not be present.
The State Problem for Evolutionary Testing
2493
Such representations tend to model control only (due to state explosion problems), whereas partial aims in ET may depend also on data states. An alternative is to identify transitional statements using data dependency analysis [1]. The chaining approach of Ferguson and Korel [5] is relevant here. The chaining approach was not specifically designed for state problems but rather for the structural test data generation for "problem" program nodes, whose executing inputs could not be found using their local search algorithms (hence many nodes were declared problematic in their work, since these techniques often became stuck in local optima). As a secondary means of trying to change the flow of execution at the problem node, data dependency analysis was used to find previous program nodes that could somehow affect the outcome at the problem node. Through trial and error execution of sequences of these nodes, it was hoped that the outcome at the problem node would be changed. In figure 1, execution of the if statement at node 4 as true may have been declared as problematic. The use of a flag variable produces fitness values of zero or one, which provides little guidance to the search for identifying input values for the true case in the event that current test data had led to the false case, or vice versa. In this situation the chaining approach would look for the last reachable definitions of variables used at the problem node. In this instance the variable heater_on is the only variable used at node 4, and its last reachable definitions can be found at nodes 2 and 3. Therefore two node sequences, known as event sequences, are generated - one requiring execution of node 2 before node 4, and the other requiring the execution of node 3 before node 4. In order to execute node 2 in the sequence <s, 2, 4>, the true branch from node 1 must be taken, requiring the search for input data to execute the condition temp < TEMP_THRESHOLD as true. This becomes the new goal of the local search. If the required test data to execute this branch is found, node 2 is reached and finally the condition on node 4 is also evaluated as true. In the case where a node leading to a last definition node also became problematic, the chaining approach procedure would recurse to find the last definition nodes for the variables used at these new problem nodes. This led to the generation of longer and longer sequences until some satisfying sequence was found or some pre-imposed length limit was reached. The use of the chaining approach to change execution at problem nodes is a useful concept that seems applicable to ET and the state problem. In the chaining approach, problem nodes are those for which input data could not be found with local search methods. With the state problem, difficult nodes are those that are not easily executed with ET, due to the use of internal state variables. The chaining approach found previous nodes that could change the outcome of the target node. For the ET state problem; transitional statements need to be found and executed that manipulate the state, as such guidance is not already provided by the fitness function.
5
Applying the Chaining Approach
Our system (from here on referred to as ET-State) aims to deal with test objects exhibiting state behavior whose structural elements could not be covered using traditional ET. The system identifies potential event sequences. Traditional ET then attempts to find the input sequence that will lead to both execution of the event
2494
P. McMinn and M. Holcombe
sequence, and if the identified event sequence leads to the desired state, the structural target also. This is performed using path-oriented fitness functions, so that each transitional statement receives the required guidance that leads to its execution. Event sequences for state problems will be potentially much longer than those for problem nodes for functions solely dependent on the current input, as identified by the original chaining approach. As will be seen, our system also performs data dependency analysis that is more extensive than the original chaining approach, meaning that the number of nodes available to add to each sequence at each step of its construction is also greater. For complicated examples an exhaustive search of the chaining tree may not always be tractable. We plan to accommodate this by using a further stage of optimization to heuristically build and evaluate sequences – namely the ant colony system. 5.1 Ant Colony System (ACS) Ant colony system (ACS) [2] is an optimizing technique inspired by the foraging behavior of real ants [7]. In ACS the problem is represented by a graph. Ants then incrementally construct solutions using this graph by applying problem-specific heuristics and by taking into account artificial pheromone deposited by previous ants on graph edges. This informs the ant of the previous performance of edges in previous solutions, since the edges belonging to the better solutions receive more pheromone. ACS is an attractive search technique to use for building event sequences for the state problem, since the incremental solution construction process allows for straightforward incorporation of data dependency procedures for identifying possible transitional program nodes. The space of viable event sequences for a state problem is underpinned by the control and data dependencies of the underlying code structure, and as these factors can be taken into account, ants are prevented from exploring solutions that are unintelligent or infeasible from the outset. Dorigo et al. [4] originally devised ant systems for the traveling salesman problem (TSP). Here, graph nodes correspond to cities in the problem. Initially, graph edges are initialized to some initial pheromone amount τ0. Then, in tmax cycles, m ants are placed on a random city and progressively construct tours through the use of a probabilistic transition rule. In this rule, a random number, q, 0 ≤ q ≤ 1 is selected and compared with a tunable exploration parameter q0. If q ≤ q0, the most appealing edge in terms of heuristic desirability and pheromone concentration is always chosen. However if q > q0 a transition probability rule is used. The probability of ant k to go from node i to node j whilst building its tour in cycle t, is given by:
pijk (t ) =
[ ( ij (t )] ⋅ [Iij ] F
(1)
∑ l∈J k [ ( il (t )] ⋅ [Iil ] F i
where τij(t) is the amount of pheromone on edge (i,j), β is an adjustable parameter that controls the relative importance of pheromone on edges, and ηij is a desirability heuristic value from city i to j. In the TSP, the desirability heuristic is simply the inverse of the distance from node i to j, making nearer cities more attractive than those further
The State Problem for Evolutionary Testing
2495
away. Ants keep track of towns they have visited, and exclude these cities from future choices. Every time an edge is selected a local update is performed whereby some of the pheromone on that edge evaporates. This has the effect of making that edge slightly less attractive to other ants, so as to encourage exploration of other edges of the graph. A local update for an edge (i,j) is computed using the formula:
τ ij (t ) ← (1 − ρ ) ⋅ τ ij (t ) + ρ ⋅ τ 0
(2)
where ρ is the pheromone decay coefficient 0 < ρ ≤ 1. After a cycle has finished, a global update of the graph takes place on the basis of the best solution found so far (i.e. the shortest tour). This encourages ants to search around the vicinity of the best solution in building future tours. The global update is performed for every edge (i,j) in the best tour using the formula:
τ ij (t ) ← (1 − ρ ) ⋅ τ ij (t ) + ρ ⋅ ∆τ ij (t )
(3)
where ∆τij(t) is computed as the inverse of the tour distance. At the end of the last cycle the algorithm terminates and delivers the shortest tour found. In the following sections we outline how the basic ACS algorithm is adapted for solving state problems. Problem Representation. The problem representation for the state problem is simply a graph of all program nodes linked to one another. However, in the state problem program nodes can be visited more than once in a sequence. This means that arcs belonging to good solutions can also be reinforced more than once. In order to prevent a situation where a sub-path is reinforced to the point where ants begin to travel around the graph in infinite loops, the search space is extrapolated so that nodes in the graph correspond to the nth use of some program node in building an event sequence. This expansion of the search space can take place dynamically, so there is no need to pre-impose limits on the number of times each individual program node can be visited. Solution Construction. At each stage of event sequence construction, each ant chooses from a subset of all program nodes, as identified by the data dependency analysis procedure. The original chaining approach built event sequences by working backwards from the target node. Data dependency analysis for a problem node ended at the last definitions of variables used at that node. For state problems, we extend the data dependency analysis by considering all nodes that could potentially affect the problem node (for example by analyzing variables used in assignments at last definitions, and so on). The algorithms for doing this are similar to those used to construct backward program slices [8]. Evaluation of Event Sequences. In the ET-State ant algorithm, the heuristic used to evaluate the desirability of the inclusion of a node into the event sequence is also equivalent to the “goodness” of the entire event sequence including that node. In or-
2496
P. McMinn and M. Holcombe
der to evaluate event sequences, they are first executed by traditional ET using pathoriented fitness functions. The fitness information from the best individual is then fed into ET-State. In most cases, the best individual found would likely have a series of diverge points corresponding to nodes depending on states. When adding a new node to an event sequence, ants use inputs used from the previous sequence to seed the first generation of the evolutionary algorithm for finding inputs leading to the execution of the new sequence. Pheromone Updates. Local and global pheromone updates are handled in a similar fashion to ACS for the TSP, however ∆τij(t) for global updates is computed using the fitness of the event sequence as evaluated with ET. Termination Criteria. As the length of solutions for the state problem is not fixed as they are for tours in the TSP, termination criteria are required to stop ants building infinitely long sequences. Two rules include a) the ant stops if it discovers a solution better than the current best solution found so far, and b) ants can only explore sequences up to a certain number of nodes longer than the current best so far, or the ant fails to improve its solution over the last n nodes.
5.2
Simple Initial Experiment
A simple initial experiment was run with a preliminary version of the system. The code used for this study can be seen in figure 4. The aim is for the ants to find an event sequence to get into the true branch from the if statement in fn_test (part (a)). The outcome of this branch predicate is dependent on functions in the module shown in part (b), which depend on internal state variables. Four ants were used in ten cycles, the desirability heuristic exponent β was set to 2, the exploration parameter q0 was set to 0.75, and a pheromone decay ρ of 0.25 was used. Sequences were built using backward dependency analysis. Therefore when adding the first node to its sequence, the ant could only choose to execute fn_c. For the second node the ant could execute fn_c or fn_b, since the assignment statement for k in fn_c uses the variable j that in turn is defined in fn_b. For every node after fn_b was called, any function from the dependent module could be used, since fn_b uses the variable i which in turn is defined in fn_a. In the first cycle ants were prohibited from using any node more than once, so that the full definition-use chain from k through to i could potentially be explored. The greedy desirability heuristic for the addition of nodes, and for evaluating entire sequences, was simply based on the branch distance for the true branch in fn_test. In this case, and in order to make the search more complicated, the ants were directed to find the shortest satisfying sequence once a satisfying sequence had been found, by simply rewarding shorter sequences (of course, shorter sequences ultimately require less mental effort on behalf of the human checking the results of the structural tests). Ants finished building their sequence if they had found a new best, or they had made no improvement on their sequence for the last five nodes. The results showed promise, as seen in table 1. In ten runs of the experiment, ants found a satisfying sequence in 2-3 cycles. The shortest function call sequence found (
The State Problem for Evolutionary Testing
2497
fn_b, fn_c, fn_c, fn_c, fn_test> (or similar sequence of the same length)) was
ascertained in an average of 4.9 cycles.
// ... const int target = 32; void fn_test() { if (fn_c() == target) { /* target branch */ } } // ... (a) Function under test
static int i=0, j=0, k=0; void fn_a() { i ++; } void fn_b() { j += i * 2; } int fn_c() { k += j * 2; return k; } (b) Dependent module
Fig. 4. C code for the initial experiment
Table 1. Results for the initial experiment
Satisfying Event Sequence Found Shortest Satisfying Event Sequence Found No of ants: 4 No of trials: 10
6
Average 2.1 4.9
Best
2 3
Worst 3 7
Conclusions and Future Work
This paper has introduced the state problem for evolutionary testing; showing that state variables can hinder or render impossible the search for test data. State variables can be dependent on the input history to the test object, as well as just the current input. They do not form part of the input domain to the test object, and therefore cannot be optimized by the ET search process. The only way to control these variables is to attempt to execute the statements that perform assignments to them, statements that form the transitions of the underlying state machine of the system. Such statements can be identified by data dependency analysis. The use of ant colony algorithms is proposed as a means of heuristic search and evaluation of sequences of transitional statements, in order to find one that makes the current test goal possible. The next step in this work is the complete implementation of the system, which will then be tried on a set of industrial examples. These experiments will be considered successful if coverage of structural elements involving states are achieved which were previously unobtainable by the use of ET or even random testing alone. If not all structural elements involving states can be covered then further analysis will take place to understand the features of these programs that are still causing problems.
2498
P. McMinn and M. Holcombe
Acknowledgements. This work is sponsored by DaimlerChrysler Research and Technology. The authors have benefited greatly from the discussion of this work with Joachim Wegener, André Baresel and other members of the ET group at DaimlerChrysler, along with members of the EPSRC-funded FORTEST and SEMINAL networks.
References 1.
Aho A., Sethi R., Ullman J. D.: Compilers: Principles, Techniques and Tools. AddisonWesley (1986) 2. Bonabeau E., Dorigo M., Theraulaz G.: Swarm Intelligence. Oxford University Press (1999) 3. Bottaci, L.: Instrumenting Programs with Flag Variables for Test Data Search by Genetic Algorithm, Proceedings of the Genetic and Evolutionary Computation Conference, New York, USA (2002) 4. Dorigo M., Maniezzo, V., Colorni A.: Ant System: An Autocatalytic Optimizing Process. Technical report, Politechnico di Milano, Italy, No. 91–016 (1991) 5. Ferguson R., Korel B.: The Chaining Approach for Software Test Data Generation. ACM Transactions on Software Engineering and Methodology, Vol. 5, No. 1, pp. 63–86 (1996) 6. Harman M., Hu L., Hierons R., Baresel A., Sthamer H: Improving Evolutionary Testing by Flag Removal. Proceedings of the Genetic and Evolutionary Computation Conference, New York, USA (2002) 7. Goss S., Aron S., Denenubourg J. L., Pasteels J. M.: Self Organized Shortcuts in the Argentine Ant. Naturwissenschaften, Vol. 76, pp. 579–581 (1989) 8. Tip F., A Survey of Program Slicing Techniques. Journal of Programming Languages, Vol.3, No.3, pp.121–189 (1995) 9. Tracey N., Clark J., Mander K.: Automated Flaw Finding using Simulated Annealing. International Symposium on Software Testing and Analysis, pp. 73–81 (1998). 10. Wegener J., Baresel A. Sthamer H.: Evolutionary Test Environment for Automatic Structural Testing. Information and Software Technology, Vol. 43, pp. 841–854 (2001) 11. Wegener J., Buhr K., Pohlheim H.: Automatic Test Data Generation for Structural Testing of Embedded Software Systems by Evolutionary Testing. Proceedings of the Genetic and Evolutionary Computation Conference, New York, USA (2002) 12. Wegener J., Grochtmann M.: Verifying Timing Constraints of Real-Time Systems by Means of Evolutionary Testing. Real-Time Systems, Vol. 15, pp. 275–298 (1998)
Modeling the Search Landscape of Metaheuristic Software Clustering Algorithms Brian S. Mitchell and Spiros Mancoridis Department of Computer Science Drexel University, Philadelphia PA 19104, USA {bmitchell, spiros.mancoridis}@drexel.edu http://www.mcs.drexel.edu/˜{bmitchel,smancori}
Abstract. Software clustering techniques are useful for extracting architectural information about a system directly from its source code structure. This paper starts by examining the Bunch clustering system, which uses metaheuristic search techniques to perform clustering. Bunch produces a subsystem decomposition by partitioning a graph formed from the entities (e.g., modules) and relations (e.g., function calls) in the source code, and then uses a fitness function to evaluate the quality of the graph partition. Finding the best graph partition has been shown to be a NP-hard problem, thus Bunch attempts to find a sub-optimal result that is “good enough” using search algorithms. Since the validation of software clustering results often is overlooked, we propose an evaluation technique based on the search landscape of the graph being clustered. By gaining insight into the search landscape, we can determine the quality of a typical clustering result. This paper defines how the search landscape is modeled and how it can be used for evaluation. A case study that examines a number of open source systems is presented.
1
Introduction and Background
Since many software systems are large and complex, appropriate abstractions of their structure are needed to simplify program understanding. Because the structure of software systems is usually not documented accurately, researchers have expended a great deal of effort studying ways to recover design artifacts from source code. For small systems, source code analysis tools [3,9] can extract the sourcelevel components (e.g., modules, classes, functions) and relations (e.g., method invocation, function calls, inheritance) of a software system. However, for large systems, these tools are, at best, useful for studying only specific areas of the system. For large systems there is significant value in identifying the abstract (highlevel) entities, and then modeling them using architectural components such as subsystems and their relations. Subsystems provide developers with structural information about the numerous software components, their interfaces, and their interconnections. Subsystems generally consist of a collection of collaborating E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2499–2510, 2003. c Springer-Verlag Berlin Heidelberg 2003
2500
B.S. Mitchell and S. Mancoridis
source code resources that implement a feature or provide a service to the rest of the system. Typical resources found in subsystems include modules, classes, and possibly other subsystems. The entities and relations needed to represent software architectures are not found in the source code. Thus, without external documentation, we seek other techniques to recover a reasonable approximation of the software architecture using source code information. Researchers in the reverse engineering community have applied a variety of software clustering approaches to address this problem. These techniques determine clusters (subsystems) using source code component similarity [18,4,17], concept analysis [10,5,1], or information available from the system implementation such as module, directory, and/or package names [2]. In this paper we examine the Bunch software clustering system [12,11,13]. Unlike the other software clustering techniques described in the literature, Bunch uses search techniques to determine the subsystem hierarchy of a software system, and has been shown to produce good results for many different types of systems. Additionally, Bunch works well on large systems [16] such as swing and linux, and some work has been done on evaluating the effectiveness of the tool [14,15]. Bunch starts by generating a random solution from the search space, and then improves it using search algorithms. Bunch rarely produces the exact same result on repeated runs of the tool. However, its results are very similar – this similarity can be validated by inspection and by using similarity measurements [14]. Furthermore, as we will demonstrate later in this paper, we validated that the probability of finding a good solution by random selection is extremely small, thus Bunch appears to be producing a family of related solutions. The above observations intrigued us to answer why Bunch produces similar results so consistently. The key to our approach will be to model the search landscape of each system undergoing clustering, and then to analyze how Bunch produces results within this landscape.
2
Software Clustering with Bunch
The goal of the software clustering process is to partition a graph of the sourcelevel entities and relations into a set of clusters such that the clusters represent subsystems. Since graph partitioning is known to be NP-hard [7], obtaining a good solution by random selection or exhaustive exploration of the search space is unlikely. Bunch overcomes this problem by using metaheuristic-search techniques. Figure 1 illustrates the clustering process used by Bunch. In the preparation phase, source code analysis tools are used to parse the code and build a repository of information about the entities and relations in the system. A series of scripts are then executed to query the repository and create the Module Dependency Graph (MDG). The MDG is a graph where the source code components are modeled as nodes, and the source code relations are modeled as edges. Once the MDG is created, Bunch generates a random partition of the MDG and evaluates the “quality” of this partition using a fitness function that is called
Modeling the Search Landscape Preparation Phase
Clustering Phase
Source Code System.out.println(…);
Source Code Analysis (e.g., cia, Acacia)
Analysis & Visualization Phase
Metaheuristic Search Software Clustering Algorithms (e.g., Bunch)
Search Space
Software Structure Graph Generation (e.g., MDG)
MDG
2501
Visualization
Additional Analysis <node id=”C1"> <node id=”M1"/> <node id=”M2"/> <edge from=”M1" to=”M2"/> ...
Fig. 1. Bunch’s Software Clustering Process
Modularization Quality (MQ) [16]. MQ is designed to reward cohesive clusters and penalize excessive inter-cluster coupling. MQ increases as the number of intraedges (i.e., internal edges contained within a cluster) increases and the number of interedges (i.e., external edges that cross cluster boundaries) decreases. Given that the fitness of an individual partition can be measured, metaheuristic search algorithms are used in the clustering phase of Figure 1 in an attempt to improve the MQ of the randomly generated partition. Bunch implements several hill-climbing algorithms [12,16] and a genetic algorithm [12]. Once Bunch’s search algorithms converge, a result can be viewed as XML [8] or using the dotty [6] graph visualization tool. 2.1
Clustering Example with Bunch
This section presents an example illustrating how Bunch can be used to cluster the JUnit system. JUnit is an open-source unit testing tool for Java, and can be obtained online from http://www.junit.org. JUnit contains four main packages: the framework itself, the user interface, a test execution engine, and various extensions for integrating with databases and J2EE. The JUnit system contains 32 classes and has 55 dependencies between the classes. For the purpose of this example, we limit our focus to the framework package. This package contains 7 classes, and 9 inter-class relations. All remaining dependencies to and from the other packages have been collapsed into a single relation to simplify the visualization. Figure 2 animates the process that Bunch used to partition the MDG of JUnit into subsystems. In the left corner of this figure we show the MDG of JUnit. Step 2 illustrates the random partition generated by Bunch as a starting point in the search for a solution. Since the probability of generating a “good” random partition is small, we expect the random partition to be a low quality
2502
B.S. Mitchell and S. Mancoridis
1. JUnit MDG
2. A Random Partition of the JUnit MDG
3. JUnit’s MDG After Clustering FrameworkPackage TestCaseSS
FrameworkPackage
OtherPackages
OtherPackages Runner Package
TestSuite
TestSuite
Runner Package
FrameworkPackage
OtherPackages
Runner Package
TestFailure
Comparison Failure
TestCase
TestCase
UI Package Comparison Failure
Assert
TestResult
TestFailure
Assertion FailedError
Extensions Package Assert
Assertion FailedError
RuntimeValidationSS
TestResult
TestSuite
Extensions Package
Comparison Failure
TestResult
Assert
TestFailure
Assertion FailedError
UI Package
TestCase
UI Package
TestResultSS
Extensions Package
Fig. 2. JUnit Example
solution. This intuition is validated by inspecting the random partition, which contains 2 singleton clusters and a disproportionately large number of interedges (i.e., edges that cross subsystem boundaries). The random partition shown in step 2 is converted by Bunch into the final result shown in step 3 of Figure 2. The solution proposed by Bunch has many good characteristics as it groups the test case modules, the test result modules, and the runtime validation modules into clusters. The overall result shown in step 3 is a good result, but Bunch’s search algorithms are not guaranteed to produce exactly the same solution for every run. Thus, we would like to gain confidence in the stability of Bunch’s clustering algorithms by analyzing the search landscape associated with each MDG.
3
Modeling the Search Landscape
This section presents an approach to modeling the search landscape of software clustering algorithms. The search landscape is modeled using a series of views, because software clustering results have many dimensions, and combining too much detail into a single view is confusing. The search landscape is examined from two different perspectives. The first perspective examines the structural aspects of the search landscape, and the second perspective focuses on the similarity aspects of the landscape. 3.1
The Structural Search Landscape
The structural search landscape highlights similarities and differences from a collection of clustering results by identifying trends in the structure of graph partitions. Thus, the goal of the structural search landscape is to validate the following hypotheses: – We expect to see a relationship between MQ and the number of clusters. Both MQ and the number of clusters in the partitioned MDG should not vary widely across the clustering runs.
Modeling the Search Landscape
2503
– We expect a good result to produce a high percentage of intraedges (edges that start and end in the same cluster) consistently. – We expect repeated clustering runs to produce similar MQ results. – We expect that the number of clusters remains relatively consistent across multiple clustering runs.
3.2
The Similarity Search Landscape
The similarity search landscape focuses on modeling the extent of similarity across all of the clustering results. For example, given an edge < u, v > from the MDG, we can determine, for a given clustering run, if modules u and v are in the same or different clusters. If we expand this analysis to look across many clustering runs we would like to see modules u and v consistently appearing (or not appearing) in the same cluster for most of the clustering runs. The other possible relationship between modules u and v is that they sometimes appear in the same cluster, and other times appear in different clusters. This result would convey a sense of dissimilarity with respect to these modules, as the < u, v > edge drifts between being an interdedge and an intraedge.
4
Case Study
This section describes a case study illustrating the effectiveness of using the search landscape to evaluate Bunch’s software clustering results. Table 1 describes the systems used in this case study, which consist of 7 open source systems and 6 randomly generated MDGs. Mixing the random graphs with real software graphs enables us to compare how Bunch handles real versus randomly generated graphs. Table 1. Application descriptions Application Modules Relations Application Name in MDG in MDG Description Telnet PHP Bash Lynx Bunch Swing Kerberos5 Rnd5 Rnd50 Rnd75 Bip5 Bip50 Bip75
28 62 92 148 220 413 558 100 100 100 100 100 100
81 191 901 1,745 764 1,513 3,793 247 2,475 3,712 247 2,475 3,712
Terminal emulator Internet scripting language Unix terminal environment Text-based HTML browser Software clustering tool Standard Java user interface framework Security services infrastructure Random graph with 5% edge density Random graph with 50% edge density Random graph with 75% edge density Random bipartite graph with 5% edge density Random bipartite graph with 50% edge density Random bipartite graph with 75% edge density
2504
B.S. Mitchell and S. Mancoridis Black = Bunch
Gray = Random
MQ
Sample Number
Sample Number
Y-Axis:
Cluster Count
(|IntraEdges|/|E|)%
MQ
T ELNET
X-Axis: 8
100
4
6
75
3
4
2
2
50 25
0
0
0
P HP
0
2
B ASH L YNX
50
0
100
0
50
100
8
80
12
75
6
60
4
40
6
50 25
2
20
0
0
0
6
0
6.2
50
0
50
100
100
8
80
30
75
6
60
20
4
40
10
50 25
2
20
0
0
0
5
0
10
50
0
40
100
30
75
20
50
6 4
10
25
2
0
0 6
50
100
0
100
20
30
75
15
20
50
10
10
25
5
0
0 18
50
100
100
80
60
75
60
40
50
40
20
25
20
0
0 65
50
100
50
100
50
100
0
50
100
80
100
80
600
60
75
60
450
40
40
300
20
50 25
20
150
0
0
0
64
66
68
70
0
50
100
50
100
0
50
100
0
50
100
0
50
100
0
50
100
0
50
100
500 400 300 200 100 0
0 0
70
0
125 100 75 50 25 0 0
80
60
100
0 0
20
100
150 120 90 60 30 0
40
16
50
0 0
8
50
0
100 8
0
0
100
40
4
B UNCH
10
100
0
S WING
20
18
5.6 5.8
K ERBEROS5
Cluster Count
30
1 0
4
Sample Number
0 0
50
100
Fig. 3. The Structural Search Landscape for the Open Source Systems
Some explanation is necessary to describe how the random MDGs were generated. Each MDG used in this case study consists of 100 modules, and has a name that ends with a number. This number is the edge density of the MDG.
Modeling the Search Landscape Black = Bunch
Gray = Random
MQ
Sample Number
Sample Number
Y-Axis:
Cluster Count
(|IntraEdges|/|E|)%
MQ
R ND5
X-Axis: 50 40 30 20 10 0
R ND50
14
16
50 40 30 20 10 0
R ND75
0
B IP5
0
2
B IP50
100
75
15
75
50
10
50
25
5
25
0
0 0
50
0
100
0
50
100
100
4
100
75
3
75
50
2
50
25
1
25
0
0 0
50
100
50
100
100
4
100
75
3
75
50
2
50
25
1
25
0
0 0
4
50
100
50
100
100 75
6
75
4
50
5
50 25
2
25
0
0
0
10
6
0
6.2
50 40 30 20 10 0
50
100
75
0 50 40 30 20 10 0 0
2
4
50
100
0
50
100
0
50
100
0
50
100
0
50
100
0
50
100
100 75
4
50
50 25
2
0
0 0
5
100
0 0
6
100
50
0 0
100
15
0
0 0
8
20
5.6 5.8
B IP75
Cluster Count
20
5
50 40 30 20 10 0
Sample Number
100
18
2505
50
100
25 0 0
50
100
100
4
100
75
3
75
50
2
50
25
1
25
0
0 0
50
100
0 0
50
100
Fig. 4. The Structural Search Landscape for the Random MDGs
For example a 5% edge density means that the graph will have an edge count of 0.05 ∗ (n(n − 1)/2), which is 5% of the total possible number of edges that can exist for a graph containing n modules. Each system in Table 1 was clustered 100 times using Bunch’s default settings.1 It should also be noted that although the individual clustering runs are independent, the landscapes are plotted in order of increasing MQ. This sorting highlights some results that would not be obvious otherwise. 1
An analysis on the impact of altering Bunch’s clustering parameters was done in a 2002 GECCO paper [16].
2506
4.1
B.S. Mitchell and S. Mancoridis
The Structural Landscape
Figure 3 shows the structural search landscape of the open source systems, and Figure 4 illustrates the structural search landscape of the random graphs used in this study. The results produced by Bunch appear to have many consistent properties. This observation can be supported by examining the results shown in Figures 3 and 4: – By examining the views that compare the cluster counts (i.e., the number of clusters in the result) to the MQ values (far left) we notice that Bunch tends to converge to one or two “basins of attraction” for all of the systems studied. Also, for the real software systems, these attraction areas appear to be tightly packed. For example, the php system has a point of concentration were all of the clustering results are packed between MQ values of 5.79 and 6.11. The number of clusters in the results are also tightly packed ranging from a minimum of 14 to a maximum of 17 clusters. An interesting observation can be made when examining the random systems with a higher edge density (i.e., rnd50, rnd75, bip50, bip75). Although these systems converged to a consistent MQ, the number of clusters varied significantly over all of the clustering runs. We observe these wide ranges in the number of clusters, with little change in MQ for most of the random systems, but do not see this characteristic in the real systems. – The view that shows the percentage of intraedges in the clustering results (second column from the left) indicates that Bunch produces consistent solutions that have a relatively large percentage of intraedges. Also, since the 100 samples were sorted by MQ, we observe that the intraedge percentage increases as the MQ values increase. By inspecting all of the graphs for this view it appears that the probability of selecting a random partition (gray data points) with a high intraedge percentage is rare. – The third view from the left shows the MQ value for the initial random partition, and the MQ value of the partition produced by Bunch. The samples are sorted and displayed by increasing MQ value. Interestingly, the clustered results produce a relatively smooth line, with points of discontinuity corresponding to different “basins of attraction”. Another observation is that Bunch generally improves MQ much more for the real software systems, when compared to the random systems with a high edge density (rnd50, rnd75, bip50, bip75). – The final view (far right column) compares the number of clusters produced by Bunch (black) with the number of clusters in the random starting point (gray). This view indicates that the random starting points appear to have a uniform distribution with respect to the number of clusters. We expect the random graphs to have from 1 (i.e., a single cluster containing all nodes) to N (i.e., each node is placed in its own cluster) clusters. The y-axis is scaled to the total number of modules in the system (see Table 1). This view shows that Bunch always converges to a “basin of attraction” regardless of the number of clusters in the random starting point. This view also supports
Modeling the Search Landscape
2507
the claim made in the first view where the standard deviation for the cluster counts appears to be smaller for the real systems than they are for the random systems. When examining the structural views collectively, the degree of commonality between the landscapes for the systems in the case study is surprisingly similar. We do not know exactly why Bunch converges this way, although we speculate that this positive result may be based on a good design property of the MQ fitness function. Section 2 described that the MQ function works on the premise of maximizing intraedges, while at the same time, minimizing interedges. Since the results converge to similar MQ values, we speculate that the search space contains a large number of isomorphic configurations that produce similar MQ values. Once Bunch encounters one of these areas, it’s search algorithms cannot find a way to transform the current partition into a new partition with higher MQ. 4.2
The Similarity Landscape
The main observation from analyzing the structural search landscapes is that the results produced by Bunch are stable. For all of the MDGs we observe similar characteristics, but we were troubled by the similarity of the search landscapes for real software systems when compared to the landscapes of the random graphs. We expected that the random graphs would produce different results when compared to the real software systems. In order to investigate the search landscape further we measured the degree of similarity of the placement of nodes into clusters across all of the clustering runs to see if there were any differences between random graphs and real software systems. Bunch creates a subsystem hierarchy, where the lower levels contain detailed clusters, and higher levels contain clusters of clusters. Because each subsystem hierarchy produced by Bunch may have a different height,2 we decided to measure the similarity between multiple clustering runs using the initial (most detailed) clustering level. The procedure used to determine the similarity between a series of clustering runs works as follows: 1. Create a counter C for each edge < u, v > in the MDG, initialize the counter to zero for all edges. 2. For each clustering run, take the lowest level of the clustering hierarchy and traverse the set of edges in the MDG: – If the edge < u, v > is an intraedge increment the counter C associated with that edge. 2
The height of the subsystem hierarchy produced by Bunch generally does not differ by more than 3 levels for a given system, but the heights of the hierarchy for different systems may vary dramatically.
2508
B.S. Mitchell and S. Mancoridis
3. Given that there are N clustering runs, each counter C will have a final value in the range of 0 ≤ C ≤ N . We then normalize the C , by dividing by N , which provides the percentage P of the number of times that edge < u, v > appears in the same cluster across all clustering runs. 4. The frequency of the P is then aggregated into the ranges: {[0,0], (0,10], (10,75), [75,100]}. These ranges correspond to no (zero), low, medium and high similarity, respectively. In order to compare results across different systems, the frequencies are normalized into percentages by dividing each value in the range set by the total number of edges in the system (|E|). Using the above process across multiple clustering runs enables the overall similarity of the results to be studied. For example, having a large zero and high similarity is good, as these values highlight edges that either never or always appear in the same cluster. The low similarity measure captures the percentage of edges that appear together in a cluster less than 10% of the time, which is desirable. However, having a large medium similarity measure is undesirable, since this result indicates that many of the edges in the system appear as both inter- and intraedges in the clustering results. Table 2. The Similarity Landscape of the Case Study Systems Edge Similarity Percentage (S) Application Density Zero (%) Low (%) Medium (%) High (%) Name Percent (S = 0%) (0% < S ≤ 10%) (10% < S < 75%) (S ≥ 75%) Telnet PHP Bash Lynx Bunch Swing Kerberos5 Rnd5 Rnd50 Rnd75 Bip5 Bip50 Bip75
21.42 10.10 21.52 16.04 3.17 1.77 2.44 5.00 50.00 75.00 5.00 50.00 75.00
34.56 48.16 58.15 49.28 48.70 61.53 57.55 12.14 32.46 33.80 37.27 47.21 29.90
27.16 19.18 22.86 30.64 20.23 13.81 19.06 54.65 37.97 30.39 20.97 23.89 38.36
13.58 11.70 6.32 8.99 9.36 9.84 9.75 24.71 29.49 35.81 13.46 28.60 31.74
24.70 20.96 12.67 11.09 21.71 14.82 13.64 8.50 0.08 0.00 28.30 0.30 0.00
Now that the above approach for measuring similarity has been described, Table 2 presents the similarity distribution for the systems used in the case study. This table presents another interesting view of the search landscape, as it exhibits characteristics that differ for random and real software systems. Specifically: – The real systems tend to have large values for the zero and high categories, while the random systems score lower in these categories. This indicates
Modeling the Search Landscape
2509
that the results for the real software systems have more in common than the random systems do. – The random systems tend to have much higher values for the medium category, further indicating that the similarity of the results produced for the real systems is better than for the random systems. – The real systems have relatively low edge densities. The swing system is the most sparse (1.77%), and the bash system is the most dense (21.52%). When we compare the real systems to the random systems we observe a higher degree of similarity between the sparse random graphs and the real systems than we do between the real systems and the dense random graphs (rnd50, rnd75, bip50, bip75). – It is noteworthy that the dense random graphs typically have very low numbers in the high category, indicating that it is very rare that the same edge appears as an intraedge from one run to another. The result is especially interesting considering that the MQ values presented in the structural landscape change very little for these random graphs. This outcome also supports the isomorphic “basin of attraction” conjecture proposed in the previous section, and the observation that the number of clusters in the random graphs vary significantly.
5
Conclusions
Ideally, the results produced by Bunch could be compared to an optimal solution, but this option is not possible since the graph partitioning problem is NP-hard. User feedback has shown that the results produced by Bunch are “good enough” for assisting developers performing program comprehension and software maintenance activities, however, work on investigating why Bunch produces consistent results had not been performed until now. Through the use of a case study, we highlighted several aspects of Bunch’s clustering results that would not have been obvious by examining individual clustering results. We also gained some intuition about why the results produced by Bunch have several common properties regardless of whether the MDGs were real or randomly generated. A final contribution of this paper is that it demonstrates that modeling the search landscape of metaheuristic search algorithms is a good technique for gaining insight into the solution quality of these types of algorithms. Acknowledgements. This research is sponsored by grants CCR-9733569 and CISE-9986105 from the National Science Foundation (NSF). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF.
2510
B.S. Mitchell and S. Mancoridis
References 1. Anquetil, N.: A Comparison of Graphis of Concept for Reverse Engineering. In Proc. Intl. Workshop on Program Comprehension, June 2000. 2. Anquetil, N., Lethbridge, T.: Recovering Software Architecture from the Names of Source Files. In Proc. Working Conf. on Reverse Engineering, October 1999. 3. Chen, Y-F.: Reverse engineering. In B. Krishnamurthy, editor, Practical Reusable UNIX Software, chapter 6, pages 177–208. John Wiley & Sons, New York, 1995. 4. Choi, S., Scacchi, W.: Extracting and Restructuring the Design of Large Systems. In IEEE Software, pages 66–71, 1999. 5. van Deursen,A., Kuipers, T.: Identifying Objects using Cluster and Concept Analysis. In International Conference on Software Engineering, ICSM’99, pages 246–255. IEEE Computer Society, May 1999. 6. Gansner, E.R., Koutsofios, E., North, S.C., Vo, K.P: A Technique for Drawing Directed Graphs. IEEE Transactions on Software Engineering, 19(3):214–230, March 1993. 7. Garey, M.R., Johnson, D.S: Computers and Intractability. W.H. Freeman, 1979. 8. GXL: Graph eXchange Language: Online Guide. http://www.gupro.de/GXL/. 9. Korn, J., Chen, Y-F., Koutsofios, E.: Chava: Reverse Engineering and Tracking of Java Applets. In Proc. Working Conference on Reverse Engineering, October 1999. 10. Lindig, C., Snelting, G.: Assessing Modular Structure of Legacy Code Based on Mathematical Concept Analysis. In Proc. International Conference on Software Engineering, May 1997. 11. Mancoridis, S., Mitchell, B.S., Chen, Y-F., Gansner, E.R.: Bunch: A Clustering Tool for the Recovery and Maintenance of Software System Structures. In Proceedings of International Conference of Software Maintenance, pages 50–59, August 1999. 12. Mancoridis, S., Mitchell, B.S., Rorres, C., Chen, Y-F., Gansner, E.R.: Using Automatic Clustering to Produce High-Level System Organizations of Source Code. In Proc. 6th Intl. Workshop on Program Comprehension, June 1998. 13. Mitchell, B.: A Heuristic Search Approach to Solving the Software Clustering Problem. PhD thesis, Drexel University, Philadelphia, PA, USA, 2002. 14. Mitchell, B.S., Mancoridis, S.: Comparing the Decompositions Produced by Software Clustering Algorithms using Similarity Measurements. In Proceedings of International Conference of Software Maintenance, November 2001. 15. Mitchell, B.S., Mancoridis, S.: CRAFT: A Framework for Evaluating Software Clustering Results in the Absence of Benchmark Decompositions. In Proc. Working Conference on Reverse Engineering, October 2001. 16. Mitchell, B.S., Mancoridis, S.: Using Heuristic Search Techniques to Extract Design Abstractions from Source Code. In Proceedings of Genetic and Evolutionary Computation Conference, 2002. 17. M¨ uller, H., Orgun, M., Tilley, S., Uhl., J.: A Reverse Engineering Approach to Subsystem Structure Identification. Journal of Software Maintenance: Research and Practice, 5:181–204, 1993. 18. Schwanke, R., Hanson, S.: Using Neural Networks to Modularize Software. Machine Learning, 15:137–168, 1998.
Search Based Transformations Deji Fatiregun, Mark Harman, and Robert Hierons Department of Information Systems and Computing Brunel University Uxbridge, Middlesex, UB8 3PH [email protected]
1
Introduction
Program transformation [1,2,3] can be described as the act of changing one program to another. The aim being an alteration of the program syntax but not its semantics, hence leaving both source and target programs functionally equivalent. Consider as examples of transformations the following program fragments: T1 : T2 : T3 : T4 :
if (e1) s1; else s2; ⇒ if (true) s1; else s2; ⇒ x = 2; x = x - 1; y = 10; x = x + 1; ⇒ for (s1; e2; s2) s3; ⇒
if (!e1) s2; else s1; s1; x = 2; y = 10; s1; while (e2) s3; s2;
Program Transformations are generally written in order to generate better programs. In transformations, we apply a number of simple transformation axioms to parts of a program source code to obtain a functionally equivalent program. The application of these axioms is treated as a search problem and we apply a meta–heuristic search algorithm such as hill climbing to guide the direction of the search. 1.1
The Transformation Problem
An overall program transformation from one program p to an improved version p typically consists of many many smaller transformation tactics [1,2]. Each tactic consists of the application of a set of transformation rules. A transformation rule is an atomic transformation capable of performing the simple alterations like those captured in the examples T1 . . . T4 . At each stage in the application of these simple rules, there are many points in the program at which a chosen transformation rule could be applied. There are many points in a program; typically one per node of the Control Flow Graph. The set of pairs of possible transformation rules and their corresponding application point is therefore large. Furthermore, to achieve an effective overall program transformation tactic, many rules may need to be applied, and each will have to be applied in the correct order to achieve the desired result. E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2511–2512, 2003. c Springer-Verlag Berlin Heidelberg 2003
2512
2
D. Fatiregun, M. Harman, and R. Hierons
Local Search-Based Transformation
Meta-heuristic search algorithms such as hill-climbing may be applied to arrive at an optimum result, or at least, a locally optimal result. Rather than apply the transformations manually, one after the other, we allow the algorithm to pick the best transformations to apply from a given set. We assume a search space containing all the possible allowable transformation rules and define our fitness function using an existing software metric [5,4], for example to be size of the program in Lines of Code (LoC). An optimum solution would be the sequence of transformations that results in an equivalent program with the fewest possible number of statements. For instance, in the examples used in the introduction, transformation T3 clearly shows a reduction in the size of the program from 4 nodes to 2 nodes and so would be selected as one which returned a better program, than, for example, the identity transformation. Using a simple hill-climbing search algorithm and a size-based metric such as LoC, after a particular transformation is applied, the size of the new program is compared with that of the previous program. Any transformation which reduces size, is retained and the search continues from the new smaller program found. When no smaller program is found by the application of a rule, the search terminates.
3
Evolutionary Search-Based Transformation
Our approach is to use the transformation sequence to be applied to the program as the individual to be optimised. Using the transformation sequence as the individual makes it possible to define crossover relatively easily. Two sequences of transformations can be combined to change information, using single point, multiple point and uniform crossover. The result is a valid transformation sequence and since all transformation rules are meaning preserving, so are all sequences of transformation rules.
References 1. I. D. Baxter. Transformation systems: Domain-oriented component and implementation knowledge. In Proceedings of the Ninth Workshop on Institutionalizing Software Reuse, Austin, TX, USA, January 1999. 2. Keith H. Bennett. Do program transformations help reverse engineering? In IEEE International Conference on Software Maintenance (ICSM’98), pages 247– 254, Bethesda, Maryland, USA, November 1998. IEEE Computer Society Press, Los Alamitos, California, USA. 3. John Darlington and Rod M. Burstall. A tranformation system for developing recursive programs. Journal of the ACM, 24(1):44–67, 1977. 4. Norman E. Fenton. Software Metrics: A Rigorous Approach. Chapman and Hall, 1990. 5. Martin J. Shepperd. Foundations of software measurement. Prentice Hall, 1995.
Finding Building Blocks for Software Clustering Kiarash Mahdavi, Mark Harman, and Robert Hierons Department of Information Systems and Computing Brunel University Uxbridge, Middlesex, UB8 3PH [email protected]
1
Introduction
It is generally believed that good modularization of software leads to systems which are easier to design, develop, test, maintain and evolve [1]. Software clustering using search–based techniques has been well studied using a hill climbing approach [2,4,5,6]. Hill climbing suffers from the problem of local optima, so some improvement may be expected by by considering more sophisticated search-based techniques. However, hitherto, the use of other techniques to overcome this problem such as Genetic Algorithms (GA) [3] and Simulated Annealing [7] have been disappointing. This poster paper looks at the possibility of using results from multiple hill climbs to form a basis for subsequent search. The findings will be presented in the poster created for the poster session in GECCO 2003.
2
Multiple Hill Climbing to Identify Building Blocks
The goal of module clustering is to arrive at a graph partition in which each cluster maximizes the number of internal edges and minimizes the number of external edges [1]. To capture this in our approach, we use the “Basic MQ” fitness function [4]. The multiple hill climbing process consists of two stages. In the initial stage a set of hill climbs are carried out to produce a set of clusterings. These clusterings are then used to create building blocks for the final stage of the process. The hill climbers use a nearest neighbor approach used in Bunch [4,6]. In this approach, the nearest neighbors of a clustering are constructed by moving a node (or module) to a new cluster or into an existing cluster. In our approach however, the nodes correspond to building blocks (sets of modules) not individual modules. The clusterings from the initial stage of the process are compared to identify groups of nodes that are placed in the same cluster across clusterings. To reduce the disruptive effect of highly unfit solutions on the creation of the initial building blocks, cut off points based on the quality of initial hill climbs are used. These are made from the best 10% to the best 100% of the hill climbed, in increments of 10%, resulting in 10 sets of building blocks. These building blocks are then used in the second stage to improve the results of subsequent hill climbing and GAs. E. Cant´ u-Paz et al. (Eds.): GECCO 2003, LNCS 2724, pp. 2513–2514, 2003. c Springer-Verlag Berlin Heidelberg 2003
2514
3
K. Mahdavi, M. Harman, and R. Hierons
Using Building Blocks to Reduce the Search Space for Repeated Hill Climbing
Simply reducing the number of nodes by grouping nodes together as building blocks reduces the search space. Furthermore, created building blocks provide a way to identify and preserve useful structures within the solution landscape and remove the destructive effects of mutation on these structures.
4
Using Building Blocks to Seed Genetic Algorithms
GAs can find clustering problems difficult [2,3,6]. This could be attributed to the difficulty they face in preserving building blocks from the destructive effects of the genetic operators. To combat this, our approach is to use the identified building blocks as atomic units by the GA. This reduces the size of the search space and potentially assists the GA in forming solutions at the meta level of combining building blocks.
References 1. Larry L. Constantine and Edward Yourdon. Structured Design. Prentice Hall, 1979. 2. D. Doval, S. Mancoridis, and B. S. Mitchell. Automatic clustering of software systems using a genetic algorithm. In International Conference on Software Tools and Engineering Practice (STEP’99), Pittsburgh, PA, 30 August – 2 September 1999. 3. Mark Harman, Robert Hierons, and Mark Proctor. A new representation and crossover operator for search-based optimization of software modularization. In GECCO 2002: Proceedings of the Genetic and Evolutionary Computation Conference, pages 1351–1358, New York, 9–13 July 2002. Morgan Kaufmann Publishers. 4. Spiros Mancoridis, Brian S. Mitchell, Yih-Farn Chen, and Emden R. Gansner. Bunch: A clustering tool for the recovery and maintenance of software system structures. In Proceedings; IEEE International Conference on Software Maintenance, pages 50–59. IEEE Computer Society Press, 1999. 5. Spiros Mancoridis, Brian S. Mitchell, C. Rorres, Yih-Farn Chen, and Emden R. Gansner. Using automatic clustering to produce high-level system organizations of source code. In International Workshop on Program Comprehension (IWPC’98), pages 45–53, Ischia, Italy, 1998. IEEE Computer Society Press, Los Alamitos, California, USA. 6. Brain S. Mitchell. A Heuristic Search Approach to Solving the Software Clustering Problem. PhD Thesis, Drexel University, Philadelphia, PA, January 2002. 7. Brian S. Mitchell and Spiros Mancoridis. Using heuristic search techniques to extract design abstractions from source code. In GECCO 2002: Proceedings of the Genetic and Evolutionary Computation Conference, pages 1375–1382, New York, 9–13 July 2002. Morgan Kaufmann Publishers.