This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
,…,
are all the task decomposition links that share the same parent p. SeqComp(T) is an operation that generates the body of the procedural reasoning rule referred to above by sequentially composing the goal or task children identified in each of the meansends links with the same parent p. The i* model in itself does not provide any information on what this sequence should be. This needs to be provided by the analyst or, by default, obtained from a left-to-right reading of the means-ends-links for the same parent in an SR diagram. For example, in the SR diagram of actor Retail System of figure 1, task Handle Online Order and goal Sell Product are connected by a means-end link, therefore, rule SellProduct()<-ϕ | HandelOnlineOrder() can be added into the Rulebase of agent RetailSystem. Belief formula ϕ and parameters of goal and task can be specified according to the real case. 4. For all a in Actors and for each p in NT for which a link 〈 p, c 〉 in TDLink exists in the SR model for that actor (where c in (NT∪NG)), the corresponding agent 〈 a, B, G, P, A 〉 in Agents must satisfy the property that goal(p)<- ϕ| SeqComp(T) is an element of P. Here T={c1,…,cn}, given that
,…,
are all the task decomposition links that share the same parent p. SeqComp(T) is as defined in rule 3. Note that, in the rules defined above, the execution orders of sub-tasks within the Task-decomposition links are from left to right as default. Belief formulas of each practical reasoning rule cannot be generated completely automatically; instead, those beliefs are specified by designers. Take task Handel Online Order as the parent task node for example, this task is decomposed into three sub-tasks: Confirm Customer, Let Payment System Handle Payment and Let Product Management System Send Product. Using the above rule, will lead to: HandleOnlineOrder() <- ϕ | BEGIN letpaymentsystemhandelpayment(); confirmcustomer(); letproductmanagementsystemsendproduct() END.
Note that, in the rules defined above, the execution orders of sub-tasks within the Task-decomposition links are from left to right as default. Belief formulas of each practical reasoning rule cannot be generated completely automatically; instead, those beliefs are specified by designers. 5. For all a in Actors and for each triple 〈 s, m, c 〉 in SCLinks in the SR model for that actor, the corresponding agent 〈 a, B, G, P, A 〉 in Agents must satisfy the property that belief(m, s, c) is an element of B. We do not describe how beliefs about softgoal contributions are used in agent programs for brevity – we will flag however that they can plan a critical role in selecting amongst procedural reasoning rules. For example, there are two ways to achieve goal Own Product for an actor, one is Go Shopping, the other is Shopping Online. On the assumption that task GoShopping
Agent-Based Prototyping of Web-Based Systems
787
has positive contribution to softgoals low effort, convenient and time saving while task ShoppingOnline has positive effects on those three softgoals. 6. For all dependencies 〈 To, Td, ID 〉 in SD, there exist agents 〈 To, Bo, Go, Po, Ao 〉 , 〈 Td, Bd, Gd, Pd, Ad 〉 in Agents, such that if 〈 To, Td, ID 〉 is in DG(SD), then goal(ID) is an element of Go, Notice that these rules require that the creation conditions be communicated by the depender agent to the ESA agent. The ESA monitors all of the actions/tasks performed by each agent, all of the messages exchanged and all of the beliefs (usually creation conditions for dependencies) communicated by individual agents for consistency and for constraint violations (e.g. the FormalTROPOS-style conditions associated with dependencies). When any of these is detected, the ESA generates a user alert. We shall select one task-dependency and one resource-dependency related to agent Retail System in order to illustrate rule 6. Actor Customer depends on actor Retail System to perform task Buy Product Online and to provide Confirmation of buying. According to rule 6, for agent Customer, BuyProductOnline() is in the Goalbase. Rules shown below are in its Rulebase: Request(confirmation) <product(P) AND needconfirmation(P) | BEGIN send(retailsystem, request, requestProvide(confirmation)); send(ESA, inform ,believe(needconfirmation)) END Task(BuyProductOnline)
For agent RetailSytem, two rules are generated for these two dependencies relationships. received(customer, request, requestProvide(confirmation)) | BEGIN send(customer, request, offer(confirmation)); send(ESA, inform, believe(Offered(confirmation)) END received(customer, request, requestPerform (BuyProductOnline)) | BEGIN Perform(BuyProductOnline); send(ESA, inform, believe(Performed(BuyProductOnline)) END
Notice that these rules require that the creation conditions be communicated by the depender agent to the ESA agent. The ESA monitors all of the actions/tasks performed by each agent, all of the messages exchanged and all of the beliefs (usually creation conditions for dependencies) communicated by individual agents for consistency and for constraint violations (e.g. the FormalTROPOS-style conditions associated with dependencies). When any of these is detected, the ESA generates a user alert. We shall select one task-dependency and one resource-dependency related to agent Retail System in order to illustrate rule 6. Actor Customer depends on actor Retail
788
A. Krishna et al.
System to perform task Buy Product Online and to provide Confirmation of buying. According to rule 6, for agent Customer, BuyProductOnline() is in the Goalbase. Rules shown below are in its Rulebase: Request(confirmation) <product(P) AND needconfirmation(P) | BEGIN send(retailsystem, request, requestProvide(confirmation)); send(ESA, inform ,believe(needconfirmation)) END Task(BuyProductOnline)
For agent RetailSytem, two rules are generated for these two dependencies relationships. received(customer, request, requestProvide(confirmation)) | BEGIN send(customer, request, offer(confirmation)); send(ESA, inform, believe(Offered(confirmation)) END received(customer, request, requestPerform (BuyProductOnline)) | BEGIN Perform(BuyProductOnline); send(ESA, inform, believe(Performed(BuyProductOnline)) END
Fig. 2 (provided below) is a snapshot for the Online Shopping 3APL agent System. It provides insight into the communication messages and flow of the desired system.
Fig. 2. Communication messages
Agent-Based Prototyping of Web-Based Systems
789
4 Conclusions This paper has shown how agent-oriented conceptual modelling techniques such as i* framework can be employed to model web-based systems. Along with this we have suggested an approach to executing i* models by translating these into a set of interacting agents (services) implemented in the 3APL language. This approach makes use of the advantages of i* for the early-phase of requirement engineering and validates the model by mapping it into an executable specification to see the design result in an emulation program. We are working towards automating the approach proposed in this paper. Furthermore, we also believe in the co-evolution approach in which models are composed of i* model and 3APL agents. How to co-evolve i* model and 3APL agents remains for future works.
References 1. Castro, J., Kolp, M., Mylopoulos, J., Towards requirements-driven information systems engineering: the Tropos project, Information Systems Journal, 2002, 27(6), pp. 365-389 2. Chen, F., Yang, H., Guo, H., Xu, B. Agentification for web services, Proceedings of the 28th Annual International Computer Software and Applications Conference, 2004, pp. 514 – 519 3. Chinnici, R. Web Service Description Language (WSDL) Version 1.2, World Wide Web Consortium, 2002, www.w3.org/TR/wsdl12/ 4. Chung, L., Nixon, B., Yu, E., Mylopoulos, J. Non-Functional Requirements in Software Engineering, Kluwer Academic Publishing, 2000 5. Dastani, M. 3APL Platform User Guide November 16, Utrecht University, 2004 6. Diana, L., Mylopoulos, M. Designing Web Services with Tropos, Proceedings of ICWS'04, San Diego, USA, 2004. IEEE Computer Society Press 7. Fuxman, A., Liu, L., Pistore, M., Roveri, M., Mylopoulos, J. Specifying and Analyzing Early Requirements in Tropos, Requirements Engineering Journal, 2004, 9(2), pp. 132150 8. Guan, Y., Ghose, A. K. Executable specifications for agent-oriented conceptual modeling. Proceedings of the IEEE/WIC 2005 International Conference on Intelligent Agent Technology, France, 2005 9. Hindriks, K. V., De Boer, F. S., Van der, H. W., Meyer, J. Agent programming in 3APL. Autonomous Agents & Multi-Agent Systems, 1999, 2(4), pp. 357- 401 10. Lohse, G., Spiller, P. (1998) “Electronic shopping”, Communications of the ACM, July 1998, 41(7), pp. 81–87 11. Salim, F., Chang, C., Krishna, A., Ghose, A. K. Towards executable specifications: Combining i* and AgentSpeak (L). Proceedings of 17th International Conference on Software Engineering and Knowledge Engineering (SEKE-2005), Taipei, Taiwan, July, 2005 12. Universal Description, Discovery and Integration, Organization for Advancement of Structured Information System, (2002), www.uddi.org/specification.html 13. Yu, E. Modelling Strategic Relationships for Process Reengineering. PhD Thesis, Graduate Department of Computer Science, University of Toronto, Toronto, Canada, 1995, pp. 124
High-Dimensional Micro-array Data Classification Using Minimum Description Length and Domain Expert Knowledge Andrea Bosin, Nicoletta Dessì, and Barbara Pes Università degli Studi di Cagliari, Dipartimento di Matematica e Informatica, Via Ospedale 72, 09124 Cagliari [email protected], {dessi, pes}@unica.it
Abstract. This paper reports on three machine learning methods, i.e. Naïve Bayes (NB), Adaptive Bayesian Network (ABN) and Support Vector Machines (SVM) for multi-target classification on micro-array datasets involving a large feature space and very few samples. By adopting the Minimum Description Length criterion for ranking and selecting relevant features, experiments are carried out to investigate the accuracy and effectiveness of the above methods in classifying many targets as well as to study the effects of feature selection on the sensitivity of each classifier. The paper also shows how the knowledge of a domain expert makes it possible to decompose the multi-target classification in a set of binary classifications, one for each target, with a substantial improvement in accuracy. The effectiveness of the MDL criterion to decide on particular feature subsets is asserted by empirical results showing that MDL is comparable with entropy based feature selection methodologies reported by earlier works. Keywords: Data Mining & Knowledge Discovery, Machine Learning, Bioinformatics.
1 Introduction DNA micro-arrays are novel biotechnologies that are being used increasingly in cancer research [1]. Micro-arrays measure the expression levels of a large number of genes in a biological sample. Since the samples come from patients whose pathological condition (e.g. cancer or non-cancer) has already been diagnosed by a doctor, the correlations between genes and pathologies can be studied. Hence, datasets derived from micro-arrays are very interesting to analyse, as they can be used to build diagnostic classifiers. Because of the high cost of micro-array production, the number of samples is often limited while the large number of gene expressions requires a careful computational analysis. In this extreme of few observations on many features, the construction of a classifier generally proceeds by selecting a small informative set of genes that can distinguish among the various classes of pathology (feature selection), by choosing an appropriate mathematical model for the classifier, by estimating the parameters of the M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 790 – 799, 2006. © Springer-Verlag Berlin Heidelberg 2006
High-Dimensional Micro-array Data Classification
791
model based on the “training set” of samples whose classification is known in advance. Finally, the accuracy and the sensitivity of the classifier is tested on an independent set of samples not used in training the classifier. In this paper, all these distinct steps are globally referred as main classification task (MCT). Previous studies [2][3] have shown that the accuracy of the classifier can be greatly improved by reducing the number of features used as input to the classification model, since (1) many genes are not at all correlated with the pathology, and (2) the high level of correlation between the expression levels of many genes makes much of the information redundant. The relevance of good feature selection methods has been discussed, for example, by Guyon and colleagues [4]. As witnessed by recent literature [5][6][7], Bayesian networks and Support Vector Machines techniques may be useful to identify expression patterns in micro-array data, while the MDL criterion [8] is recognized as a new foundation both for feature selection and for classifier optimization. Specifically, we explore the role of MDL in building and evaluating the performance of three classifiers, i.e. Naïve Bayes (NB), Adaptive Bayesian Network (ABN) and Support Vector Machines (SVM). The Acute Lymphoblastic Leukemia (ALL) dataset [9] provides an interesting testbed for the experiments presented here, whose goals are to: a) evaluate the effectiveness of MDL principle as criterion for selecting features and comparing classifiers; b) investigate the accuracy and effectiveness of the above classifiers (NB, ABN, SVM) in handling many ALL sub-types; c) study the effects of feature selection on the accuracy and the sensitivity of classifiers; d) investigate whether it is possible to increase the performance of classifiers, taking advantage from the diagnostic knowledge gleaned from a human expert. In general, finding good classifiers is known to be a difficult task when gene expression profiles are used as complex biomarkers defining many different classes of cancer (multi-target classification). To maintain the feasibility of this study, we pursued a 2-stage design: • at stage I, we attempted a multi-target classification involving all considered ALL sub-types; • at stage II, we focused on decomposing the problem in a set of binary classification problems, one for each ALL sub-type. Feature selection and classification methods, when applied to ALL gene expression data, assume as target classes the specific ALL sub-types, while features correspond to the expression levels measured for each gene. The paper is organized as follows. Section 2 illustrates the feature selection heuristic, while Section 3 provides some insight on the classification algorithms. Section 4 presents the experimental study case and the related results. Results are further discussed in Section 5. Finally, Section 6 outlines conclusions.
792
A. Bosin, N. Dessì, and B. Pes
2 Feature Selection Heuristic Feature selection methods can be classified as filters and wrappers [10][11]. The essential difference between these approaches is that a wrapper method makes use of the algorithm that will be employed in building the final classifier, while a filter method does not. Filters only look at intrinsic characteristics of the data, by addressing the feature selection problem in isolation from the classifier design. Conversely, wrappers methods search through the space of all possible feature subsets and estimate the quality of a particular subset by measuring the predictive accuracy of a classifier built on this subset. Thus, wrappers rely on some classification algorithm while filters are independent of the classification method and may be used in conjunction with any classifier. While more practical and easy to implement, filters don’t explicitly select features, leaving the final choice to the user. Besides, they evaluate features independently on one another and hence result unable to handle correlation between them. Wrappers, on the other hand, may be too slow when applied to high-dimensional datasets. We adopt a mixed strategy which consists in (i) ranking features according to the strength of their correlation with the target and (ii) applying a wrapper approach on a reduced search space, where only top-ranked features are included. Specifically, the first step (i) is performed using a scoring metric derived from the MDL principle [8]. Basically, it states that the best theory to infer from training data is the one that minimizes the length (i.e. the complexity) of the theory itself and the length of the data encoded with respect to it. MDL provides a criterion to judge the quality of a classification model and, as a more particular case, it can be employed to address the problem of feature selection, by considering each feature as a simple predictive model of the target class. So, as described in [12], we rank each feature according to its description length, that reflects the strength of its correlation with the target class. To this end, we use the MDL measure given by [13]:
⎛ N j + C − 1⎞ ⎟– MDL = ∑ log 2 ⎜ ⎜C −1 ⎟ j ⎝ ⎠
Nj
∑∑ log j
i =1
2
( p ji )
(1)
where Nj is the number of training instances with the j-th value of the given feature, C is the number of target class values, and pji is the probability of the value of the target class taken by the i-th training instance with the j-th value of the given feature (estimated from the distribution of target values in the training data). More details can be found in [5]. In the subsequent wrapper step (ii), we start with the set of the N top-ranked features (e.g. N = 20), build a classifier on this set of N features and test its accuracy. Then, we extend the set of features by adding the next k top-ranked features (e.g. k = 10), build a new classifier on the set of N+k features, and again test its accuracy. This procedure is repeated until the accuracy of the classifier improves.
High-Dimensional Micro-array Data Classification
793
3 Classification Methods Naïve Bayes is a very simple Bayesian network consisting of a special node, i.e. the target class, that is parent of all other nodes, i.e. the features, which are assumed conditionally independent given the value of the class. The NB network can be quantified against a training set of pre-classified samples, i.e. we can compute the probability associated to a specific value of each feature, given the value of the class. Then, any new sample can be easily classified by applying the Bayes rule. Despite its strong independence assumption, that is clearly unrealistic in several application domains, NB has been shown to be competitive with more complex stateof-the-art classifiers [14][15][16]. In the last years, a lot of research has focused on improving NB classifiers by relaxing their full independence assumption. One of the most interesting approaches is based on the idea of adding correlation arcs between the features of a NB classifier. On these “augmenting arcs” are imposed specific structural constraints [14][15], in order to maintain computational simplicity on learning. The algorithm we evaluated, the Adaptive Bayesian Network [13], is a variant of the approach proposed in [15]. Specifically, ABN relaxes the NB full independence assumption according to a greedy approach based on MDL. Support Vector Machines [17] have been successfully applied to a wide range of pattern recognition problems. A support vector machine finds an optimal separating hyperplane between members and non-members of a given class in an abstract space, that is represented by a kernel function. This technique avoids the computational burden of explicitly representing the feature vectors.
4 Experiments The dataset used in this work is the Acute Lymphoblastic Leukemia dataset [9]. ALL is a heterogeneous disease consisting of various leukemia sub-types that remarkably differ in their response to chemotherapy [18]. The ALL dataset contains all known ALL sub-types (including T-ALL, E2A-PBX1, TEL-AML1, BCR-ABL, MLL, Hyperdip > 50) and consists of 327 samples, each one is described by the expression level of 12558 genes [9]. Specifically, the dataset includes 215 training samples and 112 testing samples. The classifiers used in the experiments (namely NB, ABN, SVM) are trained and tested according to a main classification task (MCT), whose steps are defined as follows: 1. 2. 3. 4. 5. 6.
Rank all features according to MDL feature selection. Select the set of the N top-ranked features (e.g. start with N = 20). Build a classifier from a training set D, filtered according to selected features. Test classifier accuracy on a test set T, filtered according to selected features. Extend the set from N to N+k by adding the next k top-ranked features (e.g. k = 10) Repeat steps 2 to 5 until the accuracy of the classifier increases.
We denote as C(X, D, T) the MCT aiming to classify the sub-type X, on the training set D and on the test set T. All the experiments have been carried out using standard data mining tools [21].
794
A. Bosin, N. Dessì, and B. Pes
4.1 Stage I
At this stage, we perform a multi-target classification C(X,D,T) involving all considered ALL sub-types: the label X is the set of all six considered ALL sub-types, D includes the full 215 training samples and T includes the full 112 testing samples. Each sample is described by 12558 features. The accuracy of NB, ABN and SVM classifiers is shown in Table 1 as the number N of top-ranked features increases. These results cannot unequivocally demonstrate which is the best classifier; however, they do give some evidence. The Bayesian classifiers outperform SVM when N < 200 while their accuracy is similar when N = 400. As can be seen, the number of features needed to achieve a stabilization in accuracy is very high (400), and the resulting classifiers have a high level of misclassifications (810%). Table 1. Stage I: misclassifications of NB, ABN and SVM multi-target classifiers
N C NB ABN SVM
20
40
60
100
200
400
29 27 39
21 22 45
18 15 58
18 12 38
12 9 23
11 9 9
4.2 Stage II
At this stage our methodology focuses on formalizing how the human expert approaches the ALL diagnosis and on integrating this approach in a machine learning process. Essentially, the human experts, i.e. the physicians, use a divide-and-conquer approach based on clinical knowledge, experience and observation [18]. When analyzing a biological sample, they first look for evidence of the most common or unambiguous sub-type, T-ALL, against all the others (here referred as OTHERS1). If there is no T-ALL evidence, they look for the next most common or unambiguous sub-type, E2A-PBX1, against the remaining (here referred as OTHERS2). Then the process steps through TEL-AML1 vs. OTHERS3, BCR-ABL vs. OTHERS4, MLL vs. OTHERS5 and finally Hyperdip > 50 vs. OTHERS, that groups all samples not belonging to any of the previous sub-types. As Figure 1 shows, this approach can be reproduced in a machine learning process that builds and evaluates six binary sub-models (one for each ALL sub-type) in turn. Being responsible for a single ALL sub-type, each binary model must be trained on a specific set that separates the samples belonging to its own target class from the samples belonging to other classes. At each level, the classification process C(X,D,T) is a binary classification whose label X is a single ALL sub-type. Each C(X,D,T) gains information from the previous level because the ALL subtypes related to the latter are left out when the former is trained and tested. At each level of the decision tree, the classification process is carried out on different subtypes, starting from the original training and test sets that are progressively reduced from one level to the next, as shown in Table 2.
High-Dimensional Micro-array Data Classification
795
Table 3 shows the overall accuracy (i.e. the sum of misclassifications at each level) of the six binary NB, ABN and SVM classifiers as the number N of top-ranked features increases. Level 1
C(T-ALL, D1,T1)
T-ALL
C(E2A-PBX1, D2,T2)
E2A-PBX1
Level 2
C(TEL-AML1, D3,T3)
TEL-AML1
Level 3
C(BCR-ABL, D4,T4)
BCR-ABL
Level 4
C(MLL, D5,T5)
MLL
Level 5
C(Hyperdip > 50, D6,T6)
Hyperdip > 50
Level 6
OTHERS
Fig. 1. Stage II: diagnostic decision tree Table 2. Stage II: number of training (D) and test (T) samples at each level of the decision tree
Level
ALL subtype
D
1
T-ALL vs. OTHERS1
28 vs. 187
2 3 4 5 6
E2A-PBX1 vs. OTHERS2 TEL-AML1 vs. OTHERS3 BCR-ABL vs. OTHERS4 MLL vs. OTHERS5 Hyperdip > 50 vs. OTHERS
18 vs. 169 52 vs. 117 9 vs. 108 14 vs. 94 42 vs. 52
T 15 vs. 97
9 vs. 88 27 vs. 61 6 vs. 55 6 vs. 49 22 vs. 27
Table 3. Stage II: overall accuracy of NB, ABN and SVM
N C NB ABN SVM
10
20
8 10 5
4 4 5
796
A. Bosin, N. Dessì, and B. Pes
Table 4 summarizes the accuracy (i.e. the number of misclassifications) of NB, ABN and SVM classifiers at each level of the diagnostic decision tree. The results only refer to the classifiers built on the 20 top-ranked features, that turns out to be the “optimal” number of features since no improvement occurs when increasing the number of selected features. Table 4. Stage II: misclassifications of NB, ABN and SVM at each single level
Level C NB ABN SVM
1
2
3
4
5
6
0 0 0
0 0 0
0 0 1
3 3 3
0 0 0
1 1 1
5 Discussion The results at stage I (Table 1) agree with recent literature [19] and they show that small feature sets and good predictive classifier are very difficult to achieve by a multi-target classification. Table 1, in comparison with Table 3, also reveals that feature selection is much more effective for the binary models of stage II. Indeed, the number of misclassified samples is higher at stage I, whose best accuracy is 9 misclassifications with up to 400 features, while the diagnostic tree achieves 4 misclassifications with only 120 features (i.e. 6 binary models x 20 features each). In addition, the convergence to the “optimal” set of features is slower at stage I. At stage II, we have built NB, ABN, SVM binary classifiers at every single level L (L = 1, 2, 3, 4, 5, 6) and the best accuracy occurs with a set of 20 features at each single level. At stage I, we have built NB, ABN, SVM classifiers for an increasing number N of features (i.e. N = 20, 40, 60, 100, 200, 400) up to the “optimal” value of N = 400. In order to compare the effectiveness of both the proposed approaches, we look for the features common to both stages. Figure 2 shows, for each level (i.e. target) at stage II, the number of features in common with stage I. As Figure 2 illustrates, there are many features in common for T-ALL, E2A-PBX1, TEL-AML1, but only a few in common for BCR-ABL, MLL, Hyperdip > 50. This seems to suggest that genes relevant to ALL sub-types T-ALL, E2A-PBX1, TELAML1 are promptly selected at both stages, but genes relevant to BCR-ABL, MLL, Hyperdip > 50 are hard to select at stage I. A possible explanation is that the strength of the correlation of genes with a given ALL sub-type can be very different for different sub-types. Hence, if a single sub-type is assumed as target (stage II), the feature selection process top-ranks only those genes that are relevant for the target and discards all the others (even if they are strongly correlated with another sub-type). On the contrary, in the multi-target classification (stage I), all the genes most strongly correlated with any one of the sub-types are top-ranked, but the target they are mostly correlated to is not evident at all. Figure 2 shows that the “optimal” sets of features relevant to targets T-ALL, E2A-PBX1, TELAML1 at stage II are top-ranked at stage I, too, meaning that the correlation of each
High-Dimensional Micro-array Data Classification
797
set with its target is similar in strength. This assertion is partially true for MLL and Hyperdip > 50, but definitely not true for BCR-ABL, meaning that the correlation is less strong. Another interesting point emerging from the experiments at stage II, is that almost no overlap exists between the groups of 20 genes selected at each level. Specifically, only one overlapping gene has been found between levels 2-5, 3-4, 5-6, and this could suggest that very few genes (if any) are responsible for more than one sub-type of ALL. The results presented in this paper can be compared with [20] where six different heuristics for feature selection, based on entropy theory, χ2-statistics and t-statistics, are explored by learning NB and SVM classifiers on the ALL dataset [9]. In [20] the best feature selection heuristic ranks attributes according to their entropy and selects features “having an entropy value less than 0.1 if these exist, or the 20 features with the lowest entropy values otherwise”, up to a maximum of 20 features. Table 5 shows a comparison between the two approaches. The feature selection heuristic proposed in [20] allows for learning SVM and NB classifiers that respectively misclassify 5 and 7 samples. According to the heuristic proposed in this paper, which is based on MDL, the SVM classifiers (with 10 and 20 features) results in 5 misclassifications, while NB classifiers (with 20 and 10 features ) results in 4 and 8 misclassifications.
n. of common features
25
TARGETS T-ALL
20
E2A-PBX1
15
TEL-AML1
10
BCR-ABL MLL
5
Hyperdip > 50
0 20
40
60
80
100
200
300
400
n. of features at stage II Fig. 2. Number of common features at stages I and II Table 5. Comparison of different feature selection heuristics
classifier heuristic errors
NB MDL 4
This work ABN SVM MDL MDL 4 5
Ref. [20] NB SVM entropy entropy 7 5
798
A. Bosin, N. Dessì, and B. Pes
6 Conclusions Our emphasis throughout this paper is on the practical application of MDL and hence we make extensive use of experiments on the considered data set [9]. As we see, MDL provides an objective framework under which rather disparate approaches to micro-array data classification can co-exist and be compared. Specifically, we have shown that the effectiveness of MDL as a feature selection heuristic is comparable with the best heuristic (entropy) discussed in [20]. Our experiments also confirm that Bayesian networks can be used to build classifiers which result in a good accuracy, as proposed in [5]. In addition, both Bayesian networks and SVM classifiers greatly benefit from performing a feature selection based on MDL. We have seen that the knowledge of a domain expert makes it possible to decompose a multi-target classification problem, and efficiently learn a set of binary classifiers, one for each target, with a substantial improvement in accuracy. This result highlights the relevance of the domain knowledge for decomposing multi-target classification problems in a structured set of binary classification problems.
References [1] Hardimann G., Microarray methods and applications: Nuts & bolts. DNA Press, 2003. [2] Golub T.R., et al.., Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286, 531-537, 1999. [3] Khan J., et al. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine, 7, 673-679, 2001. [4] Guyon I., Weston J., Barnhill S., and Vapnik V., Gene Selection for Cancer Classification using Support Vector Machines. Machine Learning, 46 (1-3): 389 – 422, 2002. [5] Bosin A., Dessì N., Liberati D., Pes B., Learning Bayesian Classifiers from GeneExpression MicroArray Data, CIBB-WILF 2005. [6] Furey T., Cristianini N. et al, Support vector machine classification and validation of cancer tissue samples using microarray expression data, Bioinformatics, 16:906-914, 2002. [7] Valentini G., Gene expression data analysis of human lymphoma using support vector machines and output coding ensembles, Artificial Intelligence in Medicine, 2002. [8] Barron A., Rissanen J., Yu B., The minimum description length principle in coding and modelling, IEEE Transactions on Information Theory, 44: 2743-2760, 1998. [9] St. Jude Children’s Research Hospital. http://www.stjuderesearch.org/data/ALL1/. [10] Blum A., Langley P., Selection of relevant features and examples in machine learning, Artificial Intelligence, 97:245-271, 1997. [11] Kohavi R., John G., Wrappers for feature subset selection. Artificial Intelligence, 97:273324, 1997. [12] Kononenko I., On biases in estimatine multi-valued attributes, IJCAI95, 1034-1040, 1995. [13] Yarmus J.S., ABN: A Fast, Greedy Bayesian Network Classifier, 2003. http://otn.oracle.com/products/bi/pdf/adaptive_bayes_net.pdf. [14] Friedman N., Geiger D., Goldszmidt M., Bayesian Network Classifiers, Machine Learning, 29: 131-161, 1997. [15] Keogh E., Pazzani M.J., Learning the structure of augmented Bayesian classifiers, International Journal on Artificial Intelligence Tools, Vol. 11, No. 4, 587-601, 2002.
High-Dimensional Micro-array Data Classification
799
[16] Cheng G., Greiner R., Comparing Bayesian Network Classifiers, Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, Morgan Kaufmann Publishers, Inc., San Francisco, 1999. [17] Vapnik V., Statistical Learning Theory, Wiley-Interscience, New York, NY, USA, 1998. [18] Yeoh E.J. et al., Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling, Cancer Cell, 1:133-143, 2002. [19] Mukherjee S., Classifying Microarray Data Using Support Vector Machines, Understanding And Using Microarray Analysis Techniques: A Practical Guide. Kluwer Academic Publishers, Boston, MA, 2003. [20] Liu H., Li J., Wong L., A Comparative Study on Feature Selection and Classification Methods Using Gene Expression Profiles and Proteomic Patterns, Genome informatics 13: 51-60, 2002. [21] http://www.oracle.com
On Solving Edge Detection by Emergence M. Batouche, S. Meshoul, and A. Abbassene PRAI Group, LIRE Laboratory, Mentouri University of Constantine {batouche, meshoul, abbassene}@wissal.dz
Abstract. Emergence is the process of deriving some new and coherent structures, patterns and properties in a complex system. Emergent phenomena occur due to interactions (non-linear and distributed) between the elements of a system over time. An important aspect concerning the emergent phenomena is that they are observable on a macroscopic level, whereas they are produced by the interaction of the elements of the system on a microscopic level. In this paper, we attempt to grab some emergence and complexity principles in order to apply them for problem solving. As an application, we consider the edge detection problem a key task in image analysis. Problem solving by emergence consists in discovering the local interaction rules, which will be able to produce a global solution to the problem that the system faces. More clearly, it consists in finding the local rules which will have some awaited and adequate global behavior, to solve a given problem. This approach relies on evolving cellular automata using a genetic algorithm. The aim is to find automatically the rules that allow solving the edge detection problem by emergence. For the sake of simplicity and convenience, the proposed method was tested on a set of binary images,. Very promising results have been obtained.
1 Introduction Many systems in nature produce complex patterns, which emerge from the local interactions of relatively simple individual components. Notably this type of emergent pattern formation often occurs without the existence of a central control [4]. Such systems consist of many components, with local interactions only and no central control. Examples of emergent pattern formation in such systems include the foraging and nest-building behavior of social insects, spiral waves in cultures of amoebae, synchronized oscillations in the brain, etc. To simulate the behavior of complex systems, cellular automata (CA) have been used as a powerful mathematical [5]. CA have the advantage of being easy to understand and implement. To exploit all the power of CA (and consequently all the power of complex systems), one must take into account some paramount characteristics, especially the concept of emergence [4]. Emergence is the direct result of the complexity of interactions inside the system. The elements (components) of the system interact locally, the interactions between them are simple, but the significant number of elements and the feedback loop phenomenon (duplex interaction between the system and its environment) produces a complex and interesting behavior (see figure 1). M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 800 – 808, 2006. © Springer-Verlag Berlin Heidelberg 2006
On Solving Edge Detection by Emergence
801
These local interactions occur on a micro level. But the observed phenomena occurring at the macro level (emergent) seem to not have any relationship with local interactions (surprising emergent phenomenon). Taking CA as modeling tools, we associate the system local interactions to the CA transition rules. These rules use local information to produce the future state of each cell [3]. CA can have very elaborate behaviors and even carry out calculations. It was proven than CA is a universal machine equivalent to a Turing machine [5]. If one wants to exploit all the power of this tool, he will have to understand the emergence phenomenon. The question is: What are the rules that provide the right global system behavior? This question is known as the “Inverse Problem” (see figure 2). A possible solution of the “Inverse Problem” is to use an optimization strategy, in order to find the appropriate CA that solves a given task. The search space would be the space of local interaction rules. Search is guided by a quality measure calculated at the global level, which would reflect the adequacy of the system to the problem [1]. A possible manner to deal with this problem is to use Genetic Algorithms (GAs) search scheme. The mixture of Cellular Automata and Genetic Algorithms known as Evolving Cellular Automata (EvCA) [6] provides a powerful and multi-purpose tool. In this work, we try to solve the edge detection problem [9] in an emergent manner. We want to find a CA that will have a global behavior corresponding to edge detection task. Edge detection is a key problem in image analysis [9]. It is typically the primary task of any automatic image understanding process. The aim is to find border pixels between dissimilar regions using an Evolutionary Cellular Automaton. The local rule of this Cellular Automaton emerges (versus handmade) from the evolution of a population of candidate cellular automata by means of a Genetic Algorithm optimization strategy. A simple and efficient CA rule for border detection is revealed by the GA. This rule is run over a cellular automaton initialized by the pixel intensities of the image to be segmented. In this way, a simple, intrinsic parallel algorithm for border detection emerges.
Fig. 1. Emergence of a global property from local interactions
Fig. 2. The "Inverse Problem", Inversing the "Emergence Arrow"
802
M. Batouche, S. Meshoul, and A. Abbassene
2 Problem Formulation A border is a frontier between two homogeneous areas of the image. A homogeneous zone is a part of the image containing pixels which exhibit similar characteristics (intensity, color, texture…). Inter-Zones contrast must be relatively strong. The border between two homogeneous areas constitutes what we call an "Edge" [2, 9]. In the "image processing" literature [9], we find many "edge detection" methods based on various algorithms. The mainstream of these methods are based on directional derivative calculations [9]. Other methods abolishing derivative calculations use preliminary knowledge concerning the nature and characteristics of the treated image, but these requirements limit the applicability of the method [2]. We propose here to construct an alternative method based on the evolution of CA population that uses simple local rules [5]. The aim is to obtain after an evolving process guided by a fitness function (edge detection quality value), a cellular automata able to segment an image and to detect significant edges by emergence.
3 Cellular Automata Principles The Cellular Automata were invented by Stanislaw Ulam (1909-1984) and John von Neumann (1903-1957) at the end of the Forties in Los Alamos National Laboratory (United States) [5]. A CA is a D dimensional grid (lattice) composed of agents called cells. These cells interact and evolve in time, changing their states in a discontinuous way. The transition rules of each automaton (cell) are local and simple. They take into account only the state of the neighbor cells [3]. A Cellular Automaton is a discrete and dynamical system made up of a regular lattice of cells. Formally, it is a quintuplet A = (L, D, S, NR, R). L is the D dimensional cellular lattice. At each discrete moment t, each cell can have only one state of the finite state set S. At each time step, all the cells update their states simultaneously. The transition process is driven by a local update rule R, which is the same for all cells (1). This rule takes the local neighborhood N of the cell (states of neighbor cells), and produce the future state of the cell [3].
( ( ))
S it,+j1 = R N S it, j .
(1)
4 Genetic Algorithms Principles Genetic Algorithms GA [6] are stochastic search methods inspired by the "evolution species theory". The GA maintains a population of potential solutions, represented by chromosomes (genotype). Each chromosome constitutes the genetic code of an individual (phenotype) of that population [1]. The survival of an individual is conditioned by its adaptation to the environment (survival of the fittest – according to its adequacy with the role given to him). A fitness value is affected to each individual, which quantifies its capacity of adaptation. This fitness is proportional to the percentage of accomplishing a task (partially solving a problem). The fitness function
On Solving Edge Detection by Emergence
803
P0 Å Initial Population; Calculate_individual_fitness(P0); WHILE not_end DO P_tmp Å P0 ∪ mutate(P0) ∪ crossover(P0); Calculate_individual_fitness(P_tmp); P0
Å select_best_fitness_from(P_tmp);
END Fig. 3. Simplified Genetic Algorithm
is the individual selection criterion. The theory of the evolution suggests that some ratio of present generation (ancestor) can be selected to build the next generation (offspring). The new population is generated by the application of the genetic operators (crossover and mutation) [6]. An evolution process is applied trying to attain the optimal solution (or an approximate one). This solution will appear by combination and selection of individuals of the generations of candidate solutions. This process of optimization is frequently used to find approximate satisfactory solutions. Figure 3 shows the general scheme of a genetic algorithm.
5 Evolving Cellular Automata (EvCA) The Evolving Cellular Automata (EvCA) is a methodology that takes advantage from the notion of evolutionary strategies. We apply this method as an optimization scheme, in the process of searching the best CA rule that is able to perform some computational task. An initial CA population, composed of chromosomes coding the local transition rule of each CA of the population, is randomly generated. After which, the GA evolve this initial population (Generation Zero), by applying the genetic operators: mutation and crossover. The selection of the individuals of the next generation is conditioned by the value of the fitness function [6]. This cyclic process makes the emergence of a set of interesting individuals. These individuals have a high fitness, and are thus most suited. They constitute the elite. One of them will be the chosen final solution [6]. It is important to choose well the chromosome’s coding method, which corresponds to the local rule of the CA. Another serious point is the choice of the mutation and crossover operators. But the keystone remains the choice of the fitness function, which completely conditions the behaviour of the GA and the convergence of the evolution process towards an acceptable solution.
6 The Proposed Method The proposed approach takes advantage of the capacity of Cellular Automata (CA) for generating and transforming a wide variety of patterns to implement the computational task of edge detection. Performing a computation with a CA means
804
M. Batouche, S. Meshoul, and A. Abbassene
that the input to the computation (image to segment) is encoded as the initial configuration; the output (segmented image) is decoded from the configuration reached by the dynamics some later time step. The intermediate dynamical steps, that transform the input to the output, are taken as the steps in the computation. The computation emerges from the CA rule being obeyed locally by each cell [3]. Since it is very difficult to specify by hand a priori the particular dynamical rule suitable for a desired computation, an evolutionary process is applied in their search [4, 7, 8]. This calculation (edge detection task) must be an emerging phenomenon, produced by the execution of simple local rules by each cell of CA [3]. The proposed method can be summarized as follows. An Evolutionary Algorithm (Genetic Algorithm) is applied in the search for adequate CA dynamical rules that perform edge detection over a set of reference images and their segmented counterparts. The fitness function is a measure of quality of the edge detection process that results from the application of the CA on the set of reference images. The evolutionary process starts from a random initial population of CA rules. The result is a set of evolved local transition rules defining a CA edge detector. 6.1 The Rule Format We represent the CA rule performing the edge detection task, as the concatenation of immediate neighborhood cells states of the cell to be updated (including this). And we add the next state of the central cell after update [7]. This rule can be run when its neighborhood part matches with a pixel patch (of the same size) in the image. Then we update the central pixel by the "next state" part of the rule. The matching between the neighborhood and the image patch is done modulo rotational operators (Rotate (0°, 90°, 180° and 270°); Flip-Flop (Horizontal, Vertical)). The rules are said to be rotational symmetric: Simple rules cannot represent all possible pixel patch configurations in the image. Therefore, we assemble them into rules packets. In this manner, each individual in the GA population is represented by a chromosome. The chromosome is simply the rules packet. We can introduce a weak matching criterion, by using a similarity threshold ε: difference(rule, pixel patch) ≤ ε
(2)
To make the CA determinist we introduce the following constraint: Each rule of the rules packet must be different from other rules, modulo the rotational operators and the similarity threshold.
Fig. 4. CA rule formed by the neighborhood and the next state
On Solving Edge Detection by Emergence
805
Fig. 5. Rotational equivalence
Fig. 6. Rules packet that forms an individual of the population
6.2 The Crossover Operator Due to the chromosome (rules packet) form, we introduce two different crossover operators, a horizontal one and vertical one. They are applied in an equiprobable way. 6.2.1 Horizontal Crossover This operator takes two parents rule packets and provides two children rule packets by interchanging some rules between them. This makes it possible to combine the best rules to obtain better individuals. This strategy keeps the cohesion and the stability of the crossover operator. The horizontal crossover operator adjusts and ameliorates the solutions obtained focussing on a local search area (exploitation). 6.2.2 Vertical Crossover The vertical crossover operator exchanges rule parts between parent chromosomes. This allows obtaining children rules packets appreciably different from the parents. The vertical crossover operator explores a wider area of the search space. In this way, it has a comparable behaviour to a mutation operator, which diversifies the search process (exploration).
Fig. 7. Horizontal crossover operator
Fig. 8. Vertical crossover operator
806
M. Batouche, S. Meshoul, and A. Abbassene
6.3 Mutation Operator Applying a mutation operator on a rule packet, can be done by changing the state of some randomly chosen cell in a rule of the packet according to a probability function. Of course precautions are taken to keep the integrity of the chromosome. It must always be valid and runnable, and not present a rule duplication or contradiction. 6.4 The Fitness Function and the Selection Operator To evaluate the segmentation quality of a rules packet, we execute it on CA. Taking image pixel intensities as initial configuration. We compare the resulting pattern with the beforehand segmented image (target edge). We use the following fitness: F = 1 / (Nb_Bad_Classified_Pixels + 1)
(3)
This formula attains 1 (best fitness) when Nb_Bad_Classified_Pixels (wrongly classified edge) approaches 0. The selection operator randomly chooses the most suited individuals from the current population (current generation). The probability of pulling some individual is proportional to the value of its fitness. Then, we apply the crossover and mutation operators on the selected individuals in order to build the new population (next generation). Another fitness function could be used. It quantifies more accurately the quality of the segmentation. So, let us consider the following quantities [2]: TP : True Positive, correctly classified pixels; FP : False Positive, incorrectly classified pixels; FN : False Negative, unclassified pixels. We calculate then the following values, Building Detection Percentage (BDP) (4), i.e. the percentage of correctly classified pixels in a particular class: BDP = TP / (TP + FN)
(4)
We compute the BDP ratio for "edge" and for "non-edge" class pixels, respectively BDP_Edge and BDP_background. The selected fitness function would be calculated as the multiplication of these two values: F = BDP_Edge x BDP_background.
(5)
7 Experimental Results The images of size 100x100 (figure 9) were obtained using GA on 500 generations. The size of the population was 50 packets (individuals); each packet contains 15 symmetrical rules. The individual who scored the best fitness has a value of 95,637%. The fifteen (15) selected rules are represented as "neighborhood" Î "future state". The rules are symmetrical and keep the same result when applied to the rotated image.
On Solving Edge Detection by Emergence
Initial Image
Target Edge
Result Edge
Fig. 9. Sample results from training set
Initial Image
Target Edge
Fig. 10. Sample results from test
Fig. 11. GA final rules packet
Result Edge
807
808
M. Batouche, S. Meshoul, and A. Abbassene
8 Conclusion The proposed methodology is a purely emergent one. It is based on the evolution and training principles. A CA was trained to solve "Edge Detection" by emergence. GAs were used to evolve cellular automata. The search process is enlightened by the optimization of a fitness function which symbolizes the segmentation quality. It is the criterion to be ameliorated during the search of a valid solution. Our future works will concern the extension of the method to more complex images such as 16 and 256 grey level images and also to color images. We could also explore the utilization of other evolutionary algorithms such as "Genetic Programming" or "Artificial Immune Systems". This approach seems promising and could be transposed to solve other general problems. Indeed, this methodology may be applied for building a powerful general "Framework", for solving broad-spectrum problems "by emergence".
Acknowledgments This work was supported by CMEP – PROGRAMME TASSILI under Project 05 MDU 642.
References 1. Bar-Yam, Y.: Dynamics of complex systems, The Advanced Book studies in nonlinearity series, Westview Press, (2000) 2. Davis L. S.: A Survey of Edge Detection Techniques, Computer Graphics and Image Processing, 12 (1975) 248-270 3. Ganguly, N., Sikdar, B. K., Deutsch, A., Canright, G., P.P.Chaudhuri, P. P.: A Survey on Cellular Automata, Project BISON (IST-2001-38923), (2001) 4. Georgé, J. P. : Résolution de problèmes par émergence, PhD Thesis, Université Toulouse III, July 2004. 5. Langton, C. G.: Studying artificial life with cellular automata, Physica D., 22 (1986) 120149 6. Mitchell, M., Crutchfield, J. P., Das, R.: Evolving Cellular Automata with Genetic Algorithms: A Review of Recent Work, in Proceedings of the first International Conference on Evolutionary Computation and Its Applications (EvCA'96, SFI) Moscow, (1996) 7. Moreno, J. A., Paletta, M. : Evolving Cellular Automata for Noise Reduction in Images, in Proceedings of CAEPIA'2001, (2001) 8. Rosin, P. L.: Training Cellular Automata for Image Processing, in proceedings of the Scandinavian Conference on Image Processing, SCIA'05, (2005) 195-204 9. Shapiro, L. G., Stockman, G. C.: Computer Vision, Prentice Hall inc. (2001)
Clustering Microarray Data Within Amorphous Computing Paradigm and Growing Neural Gas Algorithm S. Chelloug, S. Meshoul, and M. Batouche PRAI Group, LIRE Laboratory Mentouri University of Constantine {chelloug, meshoul, batouche}@wissal.dz
Abstract. The work described in this paper covers mainly the exploration of an important paradigm called amorphous computing. With the current smart systems composed of a great number of cognitive entities, amorphous computing offers useful tools and languages to emerge a coherent behavior relying on local communications and limited capabilities. In order to emphasize its capabilities, the problem of clustering microarray data has been solved within this new computing paradigm. Moreover, it is difficult and time consuming to deal with a large amount of noisy gene expression data. The core motivations of amorphous computing come from self-assembly property to emerge clusters of complex gene expressions. In particular, the process of clustering was applied with respect to the Growing Neural Gas algorithm (GNG), which is an incremental learning method and a visual technique. Although the GNG draws important features from the Self-Organizing map (SOM), it yields accurate results when no information about the initial distribution is available. This contribution considers a huge number of amorphous computing entities placed irregularly, each with a randomly selected reference vector. Using the GNG, the visualization of clusters of gene expressions is obtained by amorphous computing particles. The results obtained using the Netlogo platform are very encouraging and argue that selforganization by means of local interactions and irregularly placed particles will be qualified with performance in a real amorphous computing system.
1 Introduction The current interest in parallel systems field appeared in the investigation of amorphous computing paradigm. Before its advent, parallel systems explored computational elements provided by some information: identities, coordinates and the configuration surrounding each element. To that effect, amorphous computing faces challenges from previous paradigms and considers a huge number of computational elements placed irregularly and communicating locally via broadcast. Also, these entities receive neither geometric information nor time synchronization. At the heart of success of amorphous computing, self-organization, self-healing and selfassembling properties are carried out with simple rules and mechanisms. In addition, one of the most studied technologies that raises considerable interest is called M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 809 – 818, 2006. © Springer-Verlag Berlin Heidelberg 2006
810
S. Chelloug, S. Meshoul, and M. Batouche
microarrays. From a material point of view, microarrays are microscopic slides that contain an ordered series of samples. They are used to determine the expression level of genes in a sample [1]. Hence, the most significant advantages of microarrays concern the extraction of new knowledge to help medical diagnosis. The amount of the data obtained is very large. A matrix represents it. Especially, each element of the matrix encodes the level of the gene under a specific condition. However, due the data size, errors and noise, the innovation of new techniques to analyse gene expressions is still under research. Since several clustering algorithms have been proposed, the choice of the appropriate technique is a problematic one. In this paper, the GNG algorithm is applied for clustering microarray data. It was advocated by Bernd Fritzke as an incremental learning technique that overcomes limitations of SOM. It creates successively more nodes based on simple concepts (reference vector, error, and neighborhood). The contribution reported bellow suggests a new direction to cluster microarray data using amorphous computing and the GNG approach in quest of analyzing gene expressions by an in-deepth interpretation of the obtained results. Thus, this paper includes six sections. In the first section, we present amorphous computing as a challenging paradigm. Next, we introduce microarrays. The GNG algorithm is explained in section 4. In addition, the proposed framework for clustering microarray data is described in section 5. Some experimental results are presented and discussed in section 6. Finally, conclusions and future work are driven.
2 Amorphous Computing Actually, the state of artificial intelligence is very different to that of its earlier applications. While the first researchers designed theories and techniques to program the actions of a single cognitive entity, it was observed that new real problems involve distributed and open systems [2]. Hence, these systems are based on self-organization, self-healing and self-assembling properties that will enable the efficient development of complex applications. Using millions of tiny computing elements integrated with sensors and actuators and embed them into materials [3] is the principal characteristic behind this new vision. So far, the progress in robots has taken a radically different form that raises interest challenges. As explained in [4], the next robots will be built of a huge number of tiny programmable modules that could assemble themselves into different shapes and reconfiguring into a needed tool. In addition, the innovation of such systems is motivated by two factors: first biological systems exhibit remarkable robustness and self-healing properties relying on chemical substances and simple functions; second, nanotechnology has given support to such systems. In the context of sensor networks, which aim at providing a powerful solution to the monitoring and control of critical applications, paintable computing is described as computational elements that can be painted on an arbitrary surface and organize themselves to do an important task [5]. Namely, pushing computing is another progress, where the notion of self-assembly is a central one. It is defined as a set of pushing elements, which can sense and react to the physical world by means of installing, executing and passing a code [6]. In spite of the novelty of sensor networks and cellular engineering, important limitations are raised from the question of programming such systems. One
Clustering Microarray Data Within Amorphous Computing Paradigm
811
way to overcome this problem is called amorphous computing. The attempt to introduce amorphous computing is to obtain a coherent behavior from the corporation of a mass of identical irregularly placed particles that may affect actions, might be mobile, have modest memory, are not synchronized and communicate with few neighbors. In addition, they are possibly faulty, sensitive to their environment, receive no time synchronization information and each particle’s clock is initialized to a random time [7, 8]. So far, most approaches related to amorphous computing are described as languages. Furthermore, a global description is converted by an interpreter to low actions. The Origami Shape Language ‘OSL’ presented by Radhika Nagpal [9] encompasses the property of robustness by using average behavior rather than individual one. Also, the translation of a high description defined by means of six axioms to a low representation is done by gradients that produce good estimate of distance. Additionally, the Growing Point Language ‘GPL’ described by Daniel Coore [10] has used metaphors from Biology and makes it possible to define topological structures by the exploration of growing points that can split, die off and merge. Another lay of improvement in the field of amorphous computing is Resilient Serial Execution on Amorphous Computer ‘RSEAM’ based on a simple semantic of grouping nodes into cells [11]. Unlike the above languages, RSEAM draws heavily upon the problem of writing serial programs that are not aware of the underlying infrastructure and limitations of amorphous computing particles. The aim is to ensure that a program can be converted to a set of capsules interacting between them. Applied to the executing of a swarm of robots, the crucial issue for digital hormone model is to view robots as biological systems that can communicate and collaborate via hormones and receptors [12]. It inspired amorphous computing paradigm by relying on broadcast. Beside this approach, Stoy and Nagpal [13] have described a self-reconfigurable method used successively to move modules and arrive to a desired final configuration by a recruitment gradient generated to attract a wandering module. More specifically, the synthesis of 2D shapes using amorphous computing particles is possible by the compilation of a predetermined global shape into a program that instructs the agents to grow the shape. Kondacs [14] explained that replication and location based control mechanism are the two actions executed by amorphous computing entities to synthesis a 2D shape. Recently, Frost has taken ideas from biology and sensor networks and developed a decentralized approach for mapping unknown arbitrary 2D shapes. The developed method is very interesting. It is able to map shapes using simple actions [15].
3 Microarrays Even though the idea of understanding cell’s functions might be relatively old, only some important features of cells are successively explained. As noted in [16], the production of proteins stands out as a powerful mechanism that involves two steps: the transcription of the gene information into molecules of mRNA is the first phase. During the second one, mRNA molecules are translated into proteins within individual cells. Accordingly, biologist researches have defined the gene expression level as the measurement of the relative representation of a large number of mRNA species [17]. One of the differences that exist between actual and previous methods
812
S. Chelloug, S. Meshoul, and M. Batouche
concerns the measurement of the expression level of thousand of genes simultaneously [18]. This is achieved by microarrays technology that forms a project, which is concerned with the investigation of electronic, mathematic, chemistry and computer science to study gene expressions. Already, [1] defines microarrays as a small surface, made in glass or nylon, and spotted in an organized order with a large number of probe DNA sequences. In such technology, the emphasis is characterized by four steps: the fabrication of microarrays, labeling of each mRNA, the transcription of the two mRNA samples into cDNA, which are both applied to the microarray and hybridize with arrayed probe DNA and then the extraction of the combined fluorescence from each mRNA with a scanner [16]. Obviously, there are good reasons for using microarrays: identify gene regulated in similar manner and characterize cellular differences between different cell types [1]. Beside these features, microarrays are widely used to deal with cancer research [19]. Once all steps are applied, microarrays provide a data matrix where the rows represent gene expressions, columns are experiment conditions and each element (i, j) is a real number indicating the value taken by the gene i in experiment j . As biologists are confronted with a large amount of complex gene expressions [18], generally including missing values and noise [20], the analysis of microarray data is an important issue.
4 Clustering Clustering is the process of organizing a collection of K-dimensional vectors into groups whose members share similar features in some way [21]. In order to partition an initial input space, a wide range of clustering algorithms has been suggested in the literature. Each of them exhibits some advantages and problems. Clustering algorithms fall into two groups: supervised algorithms and unsupervised ones. Other authors distinguish hierarchical and non-hierarchical clustering techniques [22, 23]. Partitioning techniques choose the number of clusters before the actual process and frequently refer to the K-means algorithm. For a relatively more complex task, the SOM is a more suitable way that learns the topological mapping of n dimensional space. It is viewed as two layers of neurons. It is a more suitable way that learns the topological mapping of n dimensional space. Due to a competitive learning process that leads to a flexible net, SOM folds into a cloud formed by the input data [24]. For some cases, where neither the size of the map nor the organization of the map is available, the GNG seems a powerful solution. The GNG is among the recent clustering algorithms. It is motivated by a need to have a structure that is well suitable to represent the given input by modifying and extending the network architecture during learning. It appears quite different from SOM but it draws important features from it. Its principle is very simple: starting with two nodes, the algorithm constructs a graph that preserves the topology in a very general sense. During the process of clustering and for each input signal x an edge is inserted between the two closest nodes. Formally, each node k consists of a reference vector (wk ) , a local accumulated error (errork ) and a set of edges defining its topological neighbors. The basic algorithm can be found in [21], however, we describe here a few key aspects of the GNG. Once two randomly connected units are selected and connected by a zero age
Clustering Microarray Data Within Amorphous Computing Paradigm
813
edge, the formation of clusters is obtained by an execution of a competitive Hebbian learning process. It proceeds in locating the closest nodes to each signal x and connecting them by an edge. Assume that s and t are the nodes of interest with 2
2
ws − x as the smallest value and wt − x as the second smallest one, for all nodes k, the unit s will be translated to another position with a linear movement by updating its weight vector. The new error value is:
errors ← errors + ws − x
2
(1)
Jim Holmtrom explained that updating the error with squared distance is a way of detecting nodes that cover a large portion of the input distribution. So, the novel position of the winner and its topological neighbors is thus:
ws←ws +ew(x+ ws ) , ew[0,1] wn←wn +en(x+ wn) ∀ n ∈ neighbors(s) en∈[0,1]
(2) (3)
The next step meant to maintain the Delaunay triangulation by incrementing the variable edge of s and all its topological neighbors. This process must be controlled during learning, so recent edges have their age reset to 0. More specifically, edges that should not be included in the Delaunay triangulation are moved. We identify them by looking to their age variable that becomes old. After a predefined number λ of iterations, a new node will be inserted between the node u with the largest error and its neighbor v with the second largest error. The insertion is done by respect to this equation:
(wu + wv) (4) 2 Due to node insertion, the errors of u and v will be invalid. To that effect, u , v and r must update their error’s variable using the following formulas: wr ←
erroru←α × erroru
(5)
errorv←α x errorv
(6)
errorr ←erroru
(7)
5 The Proposed Framework Taking inspiration from the GNG, we propose a contribution that can be used to cluster microarrays data within the context of amorphous computing. The basic idea is that a massively distributed system of amorphous computing particles can dynamically change their links in a network. They communicate locally to accomplish the construction of a graph that represents well the input gene expressions. Using amorphous computing particles, one would have to investigate: 1. How can an amorphous computing particle consider itself as the winner that has the minimum distance to some stimulus or the maximum error?
814
S. Chelloug, S. Meshoul, and M. Batouche
2. How can amorphous computing particles create edges between them? 3. Assuming that the GNG algorithm is applied to mobile nodes, how can static amorphous computing entities affect the graph generated? To answer these questions, we have conducted some adjustments to the GNG: 1. A related approach presented by Ellie D’Hondt [25] suggests the organization of a global task (in this case, the recognition of the winner) by the construction of an amorphous computing hierarchy. Theoretical and experimental results have shown that the amorphous computing hierarchy demonstrates general benefits. It minimizes cost, simplifies programming, facilities analysis, increases efficiency and termination is guaranteed [26]. So, amorphous computing particles organize themselves into clubs at different levels. At the first level of the hierarchy, the club algorithm proceeds by electing leaders. Each amorphous computing element chooses a random number and decrements it. If it reaches zero without being interrupted, it declares itself leader and diffuse a recruitment message to other particles. Thus, a particle that receives a message becomes member of the group and can eventually be recruited by further leaders. The higher level of the hierarchy is obtained by tree region algorithm. Its principle is similar to the above one but uses a parameter h to determine the bound of the region. 2. To apply the GNG, the second question can be solved by the attribution of relative coordinates to amorphous computing particles. In this context, Frost [17] has considers the problem of amorphous shape mapping and associates relative coordinates to the computational elements on the basis of the distance and orientation from a reference one. 3. Without moving, amorphous computing particles choose another one satisfying some conditions and pass it the past of the computation. We note that with irregularly placed particles, it is not always possible to find amorphous computing particles that satisfy equations (2, 3). We extend them as follow:
wk =(ws +ew(x+ ws ))±δ
(8)
wn =(wn +en(x+ wn))±δ
(9)
Interactions between a mass of amorphous computing particles are the principle key of this work. In particular, a mass of amorphous computing particles is placed randomly. Each of them is associated with a reference vector. In fact, the first step is a solution, which improves high performance to organize some tasks assigned to amorphous computing particles. If p is the highest leader of the hierarchy, it
associates itself (0,0) coordinates, alters its variable state to true and sends relative coordinates to others. On a more profound level, each amorphous computing particle is associated with a queue (msg-queue) to store received messages. In addition, an amorphous computing entity might be active or not depending on its active variable (active), which indicates if it is incorporated in the current graph. However, this pseudo-code proposed can be abstracted by the following actions: diffusion of a message by the root of the hierarchy, emission of the required information, modification of error’s variable and then the selection of another particle to give it the past of the computation.
Clustering Microarray Data Within Amorphous Computing Paradigm
815
6 Experimental Results An implementation of the proposed method has been made in order to conduct some experimental results. Netlogo 2.1 was the programming environment chosen. It is a programmable modeling environment for simulating natural phenomena. It can simulate a huge number of independent agents which act in parallel and execute their programs independently, if necessary interacting through local communication. It can take advantage of multiple processors by having multiple copies of Netlogo open, in separate virtual Java machines. However, it includes some built-in agents’ variables that hold the initial values like the coordinates of the agents. In fact, it is necessary to hide some aspects of Netlogo and write our programs with respect to the constraints of amorphous computing paradigm. In order to cluster a relatively small number of gene expressions extracted freely from the PDR database available from: http://www.biologie.ens.fr/ lgmgml/publication/pdr1gal/home.html. 3000 amorphous computing particles are randomly distributed; each runs asynchronously the amorphous GNG. Figure 1 illustrates the initial configuration of amorphous computing particles that are placed irregularly. As we can observe from table 1, λ is very small to increase the probability of finding a new particle satisfying equation (4). In an attempt to test the technique and visualize the clusters obtained, we have referred to the U-Matrix (Unified matrix), usually constructed on the top of neurons. For the sake of terminology, if n is a neuron (amorphous computing particle), nn its set of immediate neighbors and wn its reference vector [29]:
u −height(n)=∑m∈NN(n) d(wn − wm)
(10)
Fig. 1. Initial configuration of amorphous computing particles Table 1. The considered parameters
ew
en
α
β
age
λ
δ
0.25
0.23
0.75
0.6
20
2
0.8
816
S. Chelloug, S. Meshoul, and M. Batouche
<0.45 <0.48 <0.2 >=0.8
Fig. 2. Configuration after 1 iteration
Fig. 3. Visualization of clusters
Table 2. Clusters of gene expressions
Color of the cluster
Gene expressions
Number of entities
Cyan
YPR198W YAL004W YPR196W YAL002W
YAL010C YAL038W YAL044C YAL043C YPR193 YAL003W YAL035W
Magenta
YAL045C YAL047C YAL042W YPR194C
YAL039C YAL036C YAL048C YPR195C
Yellow
YAL046C
YAL009W YAL011W YAL001C YPR199C YPR197C YAL049C YAL007C YAL037W YAL008W YAL043C-A YAL005C YAL040C YAL012W
1263
1533
114
% 42.1
51.1
4.73
The fundamental principal is to assign different colors to amorphous computing particle on the basis of the value obtained. The application of the U-Matrix after the first iteration does not show any cluster. The pseudo-code executed leads to the modification of particles’ reference vectors to visualize well the organization of the input gene expressions. In figure (3), amorphous computing articles have clearly cluster gene expressions and three clusters are illustrated. These results illustrated in table 2 show that a great number of genes have high level expression and are assigned to the second cluster. Only one gene has low level expressions and the others have important gene level expressions. Although the interpretation we have given is not sufficient, clustering gene expressions with amorphous computing provides a good scalability and time. Finally, the quality of the result is analyzed by the average quantization error (AQE):
∑
1 x − wc , AQE = 1 1+ AQE N x : An input vector, wc : the reference vector of the winner.
(11)
Clustering Microarray Data Within Amorphous Computing Paradigm
817
After 1500 iterations, the quality of the map is 0,9722 that represents a very interesting value.
7 Conclusion This work presents amorphous computing as a distributed control method that allows a coherent behavior to come from local interactions of irregularly placed particles. Combined with the GNG, it demonstrates that a mass of computational entities that receive neither geometric information nor time synchronization can self-organize to cluster gene expressions. Within the scope of this contribution, we show that a massive parallel system faced to limited information is a powerful tool in the first phase of knowledge extraction. If successful in a real amorphous computing system, this technique would also move complexity raised with microarray data. With regard to its limitations, this technique suffers from the problem related to the appropriate parameters, which is addressed by both clustering methods.
Acknowledgments This work was supported by CMEP – PROGRAMME TASSILI under Project 05 MDU 642.
References [1] T.Pasanen, J.Saarela, I.Saarikko, T.Toivanen, M.Tolvanen, M.Vihinen, G.Wong,
‘DNA Microarray Data Analysis’, Part 1, Editors Jarno Tuimala And M. Minna Laine CSC, ISBN 952-9821-89-1, 2003. [2] K-P. Sycara, “Multiagent Systems”, AI Magazine, Vol 19, No.2, 1998. [3] R.Nagpal “Programmable Self-Assembly Using Biologically-Inspired Multiagent Control”, AAMAS’02, Bologna (Italy), July 2002. [4] R.Nagpal, M.Mamei, “Engineering Amorphous Computing Systems”, Invited chapter in Methodologies and software engineering for agent systems, editors Bergenti, 2004. [5] W. Buthera, “Programming a Paintable Computer”, Phd thesis, Media Lab, MIT, Cambridge, 2002. [6] J.Lifton, “Pushpin Computing: A Platform for Distributed Sensor Networks”, Master’s of Science in Media Arts and Sciences, MIT, September 2002. [7] H. Abelson, D. Allen, D. Coore, C. Hanson, G. Homsy, T. Knight, Jr., R. Nagpal, E. Rauch, G.Sussman,R.Weiss,“AmorphousComputing”, AI MEMO 1665, MIT, August 99. [8] J.Beal, “A Robust Amorphous Hierarchy From Persistent Nodes”, AI Memo 2003 - 012, MIT, May 2003. [9] R.Nagpal, “Programming Self-Assembly: Constructing Global Shapes using Biologically-Inspired Local Interconnections and Orgami Mathematics”, thesis, MIT, 2001. [10] D.Coore, “Botanical Computing, a Development Approach to Generating Interconnect Topologies on an Amorphous Computer”, thesis, MIT, February 1999.
818
S. Chelloug, S. Meshoul, and M. Batouche
[11] A. Sutherland, “Towards RSEAM: Resilient Serial Execution On Amorphous Machines”, Master’s thesis of engineering in electrical engineering and computer science, MIT, Cambridge, June 2003. [12] W-M.Shen, P-Will, A.Galastyan, C-M.Chuong, “Hormone-Inspired Self-Organization and Distributed Control of Robotic Swarms”, Autonomous Robots 17, 93-105, Netherlands, 2004 [13] K. Stoy, R. Nagpal, “Self-Repair trough Scale Independent Self-Reconfiguration”, proceeding of 2004 IEEE/RSJ International Conference on Intelligent Robots and Systems, Sendai, 2004. [14] A. Kondacs, “Biologically Inspired, Self-Assembly Of Two-Dimensional Shapes Using Global To Local Compilation”, International Joint Conference On Artificial Intelligence (IJCAI), 2003. [15] C.Frost, “Amorphous Mapping Objects”, thesis in TCC, University of Virginia, May 2004. [16] G.Grothaus,“Biologically Interpretable Disease Classification Based on Gene Expression Data”, Master’s thesis, Virginia Polytechnic Institute, May 2005. [17] G.Unger, “Linear Separability and Classificability of Gene Expression Datasets”, Master’s thesis, Aviv University, February 2004. [18] A.Mateos, J.Herrero, J.Tomames, J.Dopazo, “Supervised Neural Networks for Clustering Conditions in DNA array Data after Reducing Noise by Clustering Gene Expression Profiles”, Center for Cancer Research (CNIO), Madrid, Spain, Microarray data analysis2, KluwerAcademic, Pp. 91-103. [19] K.Aas, “Microarray Data Mining: Asurvey”, SAMBA/02/01, January 2001. [20] Cho, H.Won, “Machine Learning in DNA Microarray Analysis for Cancer Classification”, Yonsei University, Korea, Asia-Pacific Bioinformatics Conference, Vol .19, Australia, 2003. [21] J. Holmstrom, “Growing Neural Gas experiments with GNG, GNG with utility and supervised GNG”, Uppsala Master’s thesis in computer science, Uppsala University, Sweden, 2002. [22] H.Jin, W-H.Shun, KS.Leung, M-L.Wong, “Expending Self-organizing map and Cluster Analysis”, Chinese and Lingnan University, Elsevier 163(2004) 157-173, 2003. [23] J. Neumann, “Holistic Processing of Hierarchical Structures in Connectionist Networks”, Phd thesis, University of Edinburgh, 2001. [24] T.Liao “Image Segmentation - Hybrid Method Combining Clustering and Region Merging”, Thesis, Monash University, November, 2003. [25] E.’Hondt, “Exploring the Amorphous Computing Paradigm”, Master’s thesis in computer science, Vrije University, Brussel, August 2000. [26] D. Coore, R. Nagpal, R. Weiss, “Paradigms for structure in an amorphous computer”, AI Memo No. 1614, October 97. [27] A.Ultsech, “U*-Matrix: A tool to visualize clusters in high dimensional data”, University of Marburg, Germany, December 2003.
Conflict-Directed Relaxation of Constraints in Content-Based Recommender Systems Dietmar Jannach and Johannes Liegl Institute for Business Informatics & Application Systems University Klagenfurt, A-9020 Klagenfurt, Austria [email protected], [email protected]
Abstract. Content-based recommenders are systems that exploit detailed knowledge about the items in the catalog for generating adequate product proposals. In that context, query relaxation is one of the basic approaches for dealing with situations, where none of the products in the catalogue exactly matches the customer requirements. The major challenges when applying query relaxation are that the relaxation should be minimal (or optimal for the customer), that there exists a potentially vast search space, and that we have to deal with hard time constraints in interactive recommender applications. In this paper, we show how the task of finding adequate or customer optimal relaxations for a given recommendation problem can be efficiently achieved by applying techniques from the field of model-based diagnosis, i.e., with the help of extended algorithms for computing conflicts and hitting sets. In addition, we propose a best-effort search algorithm based on branch-and-bound for dealing with hard problems and also describe how an optimal relaxation can be immediately obtained when partial queries can be (pre-)evaluated. Finally, we discuss the results of an evaluation of the described techniques, which we made by extending an existing knowledge-based recommender system and which we based on different real-world problem settings.
1 Introduction Content-based recommender systems are a special class of recommendation systems that operate on the basis of explicit knowledge about customer requirements, product characteristics, and recommendation logic (typically some sort of "filter rules") that determines the set of products to be proposed, when given some specific customer requirements. One of the main problems of such content-based systems, however, is that situations can easily arise, where all of the existing products are filtered out and no adequate proposal can be made [1] as there exists no exact match. "No products found" is an undesirable system response in such situations, in which we would appreciate a more intelligent behaviour, i.e., an explanation of the situation or – even better – a list of products that fulfil as many as possible of the originally posted requirements. In that context, approaches based on Case-Based Reasoning (CBR) have the advantage that they are in principle capable of also retrieving products that M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 819 – 829, 2006. © Springer-Verlag Berlin Heidelberg 2006
820
D. Jannach and J. Liegl
are similar to the user's query. Nonetheless, also these similarity-based approaches have their limitations, for example, that the most similar product might not be acceptable for the user, or that the means for explaining the proposal are limited [9]. Another approach of dealing with such situations is therefore to "relax" [10] the problem by giving up some of the requirements (i.e., remove parts of the query) and then test whether there exists a product that fulfils at least the remaining requirements. However, finding good relaxations is not a trivial problem, because typically many different alternative relaxations exist, the proposed relaxations should be minimal (or optimal for the user), and finally, the time-frame for finding such relaxations is strictly limited, because recommender systems are interactive applications. In this paper, we show how the task of finding adequate relaxations for a given recommendation problem can be efficiently achieved by applying techniques from the field of model-based diagnosis, i.e., with the help of extended algorithms for computing conflicts and hitting sets. After giving an introductory example, we develop a general and implementation-independent model of the recommendation problem and show how the relaxation problem can be mapped to a model-based diagnosis problem such that extended algorithms for conflict-identification and hitting set computation can be applied. In addition, we also propose a best-effort search algorithm based on branch-and-bound for dealing with hard problems and finally discuss experimental results which were achieved by extending an existing recommender system with these algorithms and by using different real-world test cases. The paper ends with a discussion of related and future work.
2 Example In the following, we will give a simple example from the domain of digital cameras for illustrating the relaxation problem and the importance of conflict-directed search. Let us assume our product database (PDB) of digital cameras contains the following entries. ID p1 p2 p3 p4
USB true false true false
Firewire false true false true
Price 400 500 200 400
Resolution 5 5 4 5
Make Canon Canon Fuji HP
Our knowledge base of filter rules (FRS) comprises the following definitions. f1: IF the customer requires high-quality printouts of the pictures THEN recommend cameras with a resolution of 5 mega-pixels. f2: IF the customer wants to have a cheap camera THEN recommend cameras with a price smaller than 300. f3: IF customer needs a USB port THEN recommend cameras with a USB port. f4: IF customer wants extended connectivity THEN recommend cameras supporting Firewire. f5: IN ANY CASE recommend cameras with a resolution higher than 3 mega-pixels.
Conflict-Directed Relaxation of Constraints
821
Now, let us assume that a customer has the following requirements (REQ): USB-support, extended-connectivity, only cheap cameras, high-quality printouts. Given these requirements, the product database and the filter-rules, no single product will fulfil all the requirements. We can now try to relax (retract) some of the filter rules, in order to find products that fulfil as many of the customer's constraints as possible. A simple algorithm for finding an adequate relaxation is to compute all possible combinations of filters (the powerset of FRS) and check for each of these combinations whether products remain when these filter rules are retracted. Of course, such an algorithm will – in the worst case – require 2n checks, which is inadequate for realistic settings, where response times below one second are required. We therefore propose adopting a conflict-directed approach similar to Reiter's [13] Hitting-Set algorithm: If we look closer on the problem situation of our example, we see that there are two "minimal conflicts", c1 = {f1, f2} and c2 = {f3, f4}, i.e., in any valid problem relaxation, at least one of the filters of c1 and one of the filters of c2 has to be removed. We also see that f5 is not involved in any conflict, which basically means that we do not have to take f5 into account when searching for relaxations. Furthermore, based on Reiter's general theory [13], we can conclude that computing the set of possible, minimal relaxations (i.e., minimal diagnoses in the sense of [13]) can be efficiently done by computing the Hitting Set of all minimal conflicts. In our example problem the set of minimal relaxations thus is the set: {{f1, f3}, {f1, f4}, {f2, f3}, {f2, f4}}. Given these conflicts, finding the first valid relaxation {f1,f3} thus involves only the examination of three paths ({f1},{f2},{f1,f3}), i.e., three queries to the database, when constructing the search tree in breadth-first manner (Fig. 1).
9
9
9
9
Fig. 1. Hitting set search tree for example problem
3 Conflict-Directed Filter Relaxation In the following, we develop a basic formalization of content- or filter-based1 recommender problems by mapping the problem of finding adequate products in a catalog to a selection-query for a relational database, which has the advantage that we 1
We view "filter-based recommenders" [1] as a special class of content-based approaches.
822
D. Jannach and J. Liegl
can rely on an existing, well-established formalism. In addition, knowledge-based recommenders or simple product finders in fact work by dynamically constructing database queries for filtering the products. However, please note that the chosen formalisation does not necessarily imply that a database system has to be used; it is rather used here to clearly characterize the problem. Content-based recommender systems rely on the existence of a product catalog that contains detailed descriptions of the items to be recommended. In general, such a product catalog can be described as a relational database table as follows. Definition (Product catalog): Items in a product catalog are described by a set of attributes A; each attribute ai ∈ A is assigned a domain di. A product catalog P is then a subset of the Cartesian product of the domains di, i.e. P ⊆ d1 x … x dn. The set of suitable products for some given customer requirements is determined by a set of filter rules, which were informally sketched as "if-then" rules in the example section. Definition (Filter rule): Let C be the set of attributes for describing customer requirements. A filter rule f can be described by the two characteristics f.AC, and f.FE, where f.AC is a Boolean formula (activation constraint2) over customer requirements C which describes the condition, when the filter rule shall be applied. f.FE represents the actual filter expression on the elements of the catalog (subquery) and is a Boolean formula over constant values, attributes of the product catalog and attributes from C. Note that in our definition we allow the usage of variables (from C) in the filter expression, e.g. for modelling dynamic filter rules like "In any case, propose only products whose price is equal or lower to the price the customer has specified." The recommendation problem consists of finding a set of items from the catalog that fulfils all the filter expressions of the active filter rules. Definition (Recommendation problem): A recommendation problem RP can be described by a tuple
: P is a product catalog, FRS is the set of filter rules, C the set of attributes for describing customer requirements, and CRS a function over the elements of C describing the actual customer requirements. The compound filter expression CFE for RP is a Boolean formula defined to be the conjunction of the filters whose activation condition is true, given some customer requirements CRS, i.e, CFE =
∧
f∈FRS , f.AC = true
(f.FE).
Finding a solution to
then corresponds to performing the database selection σCFE on the product catalog P. If none of the products in P satisfies all the compound expression CFE, we aim at finding a relaxation of the problem by retracting some of the filter rules, such that the selection results in a non-empty set of items from the catalog. 2
The notion of "activation constraints" is inspired by the Dynamic Constraint Satisfaction approach, see e.g., [9]. We introduce that concept such that we can apply the approach also for "knowledge-based" recommenders, like, e.g., [7].
Conflict-Directed Relaxation of Constraints
823
Definition (Valid relaxation): Given a recommendation problem RP
, for which the size of the selection |σCFE (P)| = 0, a valid relaxation is a set RFRS ⊆ FRS such that the solution for the modified recommendation problem RP'
contains at least one element. Lemma: Given a recommendation problem RP
and P≠∅, a valid relaxation will always exist, because if we set RFRS = FRS, the selection query will be empty and all p ∈ P are in the solution for RP. In general, however, we are interested in finding optimal or minimal relaxations, e.g., we should try to find products that fulfil as many of the customer requirements as possible. Definition (Minimal relaxation): A relaxation RFRS for RP is said to be minimal if there exists no RFRS' ⊂ RFRS such that RFRS' is a valid relaxation for RP. Computing minimal relaxations. One possible algorithm to compute possible relaxations is to compute the powerset of FRS and evaluate the value of the individual solutions. Because of the natural inefficiency of such an approach, we propose to apply a conflict-directed approach for finding relaxations: Given the definitions above, we can view the relaxation problem as a model-based diagnosis problem in the sense of [13] and correspondingly view the set of filter rules as the set of diagnosable and thus possibly faulty components of the system. Definition (Conflict): Given a recommendation problem RP
, a conflict CF is a subset of FRS such that there exists no solution for RP
. A conflict CF is said to be minimal if there is no CF' ⊂ CF which is also a conflict for RP
. Adapted Hitting-Set algorithm (sketch): Given a recommendation problem RP
, construct a HS-DAG3 for the collection K of conflicts in breadth-first manner. Each node n of the HS-DAG is labelled with a conflict CS(n) ∈ K. Edges leading away from n are labelled with a filter rule f ∈ CS(n), the set of edge labels from the root to n is referred to as H(n). Every call to the Theorem Prover TP [13] at a node n returns true when there exists a solution for the adapted recommendation problem RP
(meaning that n can be closed) or a conflict in the other case. Computing minimal conflicts. If we are given a recommendation problem RP
whose result set is empty, the whole set of (active) filter rules of course represents a conflict. However, the size of the conflict sets directly influences the size of the resulting search tree, i.e., we are in general interested in finding small or minimal conflict sets. We therefore propose to use Junker's algorithm for efficiently computing such minimal conflicts [8]: QUICKXPLAIN is a recent, nonintrusive conflict detection algorithm that – based on a divide and conquer strategy – decomposes the overall problem based on the concept of "preferred constraints". The main advantage of the approach lies in its general applicability, i.e., it is not bound to 3
Hitting Set Directed Acyclic Graph; according to [13], finding the set of minimal diagnoses corresponds to computing the Hitting Set of all minimal conflicts.
824
D. Jannach and J. Liegl
specific inference and dependency-tracking mechanisms in the underlying reasoning engine. In addition, QUICKXPLAIN also supports search for "preferred" conflicts for cases where not all elements of the conflict have the same priority. When applied to our problem, conflict detection means to identify minimal sets of active filter rules that have no products in common, i.e., which will lead to an empty result set. Best-effort search for customer-optimal relaxations. When using Reiter's algorithm, we search for diagnoses in breadth-first manner, implicitly assuming that smaller diagnoses (relaxations) are preferable. In recommender applications, however, not all of the filter rules may have equal importance for the customer. On the one hand the domain engineer might annotate the rules in advance with some priority of application based on his/her expert knowledge; on the other hand we could allow the user to express his/her preferences on the importance of rules or derive it from other (external) data sources like from the behaviour of other users. Definition (Optimal relaxation): Given a recommendation problem RP
, let RC be a function over the elements of FRS describing the costs of retracting a single filter f ∈ FRS. Let COSTS be a function describing the costs of a relaxation as an integer number, which takes the set of retracted filters, the individual costs RC, and eventually also customer characteristics4 into account. A relaxation RFRS for RP is said to be optimal if there exists no other set RFRS' ⊆ FRS such that COSTS(RFRS', CRS, RC) < COSTS(RFRS, CRS, RC). For ensuring monotonicity of the COSTS function, COSTS(RFRS', CRS, RC) < COSTS(RFRS, CRS, RC), has to hold in cases where RFRS' ⊂ RFRS We introduce this general cost function, as minimal relaxations may not always be optimal, i.e., it might be better to relax two rules with lower importance than to relax one rule which is highly important for the customer. Applying breadth-first search is thus not reasonable in that context because a potentially many different conflicts have to be computed for finding the first solution; exhaustive search is only possible for very small problems. We therefore propose to explore the search space in depth-first manner, such that a first (non-optimal) relaxation can be found in very short time. After having found this initial solution, we proceed by applying a branch-and-bound search in which we prune those search paths that will definitely not lead to a better solution. We also propose to implement a time-constrained best-effort variant, where we define a maximum time (e.g., one second) for optimization. Of course, optimality of the result cannot be guaranteed then in general, but our experiments suggest that even rather complex problems can be solved in very short time or good relaxations can be found, e.g., when using a domain-independent, priority-based search heuristic. Depth-first algorithm (sketch): Given a recommendation problem RP
, construct the HS-DAG for the collection K of minimal5 conflicts in depth-first manner. Label edges and nodes similar to the algorithm described previously. As a general search heuristic, sort the elements of each conflict according to their weight in 4 5
The cost function could, for instance, take personal preferences of the user into account. Note that we rely on minimal conflicts that are computed with QuickXPlain, which also means that tree pruning [13] is not required.
Conflict-Directed Relaxation of Constraints
825
ascending order. Remember the value of the optimization function for the first valid relaxation as the currently best solution. When further exploring the search tree, always remember the value of the currently best solutions and prune those paths that will definitely result in solutions which are worse than the optimum found so far. In addition – since the relaxations that are found in the depth-first approach may be nonminimal – we need to minimize each relaxation that is found before proceeding with the search process. This minimization operation, however, requires at most n-1 additional consistency checks for a relaxation of size n, since we can check each of the elements of the actual relaxation individually.
4 Implementation and Discussion In order to evaluate our approach, we have implemented the proposed algorithms in the knowledge-based Advisor Suite [7] system. Our particular aim was to demonstrate that the diagnosis approach is suitable given the hard real-time requirements of interactive recommender applications: in general, given a set of n diagnosable components, i.e., filter rules, the computational complexity of finding diagnoses of cardinality k is O(nk). From various real-world instantiations of the Advisor Suite system we know that a typical knowledge base contains about 40-70 filter rules; we also claim that the size of allowed relaxations should be limited to a maximum of five to seven filter rules for pragmatic reasons. From practical settings we know that users are not satisfied with proposals where for instance half of the rules (corresponding to the original wishes) could not be applied. The most costly and frequent operation during the construction of the HS-DAG and the computation of conflicts is to determine for a given set of filter rules whether these rules, if applied together, lead to an empty result set. For increased performance, we have chosen to perform a preprocessing step in which we evaluate the active filter rules individually, i.e., for each of the filter rules, we set up the list of products that fulfil the filter expression. For all of the filter rules that contain no variables in the filter expression, we can pre-compute the set of products that fulfil the expression. For all other rules, we have to set up the product list at run-time for each customer session. The results of these filters can be represented as bit-sets in compact way, i.e., for each product in the catalog we set a flag, if it is in the result or not (Fig. 2). The computation of the conjunction of the filter rules can then be efficiently done by applying a "bitwise and" on these bit-sets.
Fig. 2. Bit-set representation of filter rules
826
D. Jannach and J. Liegl
If the resulting bit-set contains no entries, we know that the result set is empty and we have to relax some of the filters. Note that this sort of representation is not mandatory for applying our approach but was actually implemented in the Java-based Advisor Suite system. Our measurements of the running times (see below) also include numbers for an implementation variant that does not rely on these pre-processing step but rather on individual database queries (HS/DBQ). A particular side-effect when using the pre-processing technique is that we can directly determine the optimal or minimal relaxation without even constructing the HS-DAG: If we remember for each product the set of filter rules that had a "0" for the product and the sum of the weights of these rules (see Fig. 2) we only have to determine that product in the (empty) result set which minimizes the given optimization function. As such, the search for the best valid relaxation can be done by only evaluating each filter rule once, computing the conjunction of the bit-sets, and finding the minimum by scanning the result set (compare also the results of [6] for finding the first maximal succeeding subquery). We could therefore compute the optimal relaxation for all test cases in a few milliseconds (see column BSA in the table below). Still, this is only possible due to a very specific implementation in our system and is not as general as the described conflict-directed approach. In the different test examples we varied the relaxation goal (cardinality only/optimization), the number of (active) filter rules, the number of products and conflicts, the average size of the conflicts, and the cardinality of the minimal diagnoses6. Our test databases contained up to 2.000 different products – the ramp up time for constructing the needed data structures is about 2 seconds on a Pentium IV desktop PC, which is not problematic since this has to be done only once during server start-up. The additional memory requirement for the data structures alone is nbProducts * nbFilterRules bits and thus strictly limited. The following tables show the average running times for a rather hard test case for one of our real-world knowledge bases that contains 70 filter rules. In the test cases, 25 of them were active for some given customer requirements. The number of and size of the conflicts was adequately increased with the size of the minimal relaxation that was to be found. The numbers in the table correspond to average running times in milliseconds for finding the first/best relaxation. In that test case, all of the problems could be solved within the targeted time-frame of one second, even when using database queries instead of the bit-set representation. If no pre-computation of filter results was done, e.g., because all filter rules contain variables, the search process needed about 40 additional milliseconds in our test cases. The performance of the depth-first search in that test case is comparable (or slightly better) than the breadth-first approach, i.e., a good solution was found early and a lot of pruning could be done, i.e., fewer nodes had to be expanded. In general this will vary, depending on the number of conflicts and diagnoses, the search heuristic for the depth-first search and the particular characteristics of the importance factors and so forth. An implementation variant based on A* produced results comparable to those of the depth-first search. 6
Note that a relaxation of size 10 is already rather unrealistic and of little help for the user.
Conflict-Directed Relaxation of Constraints Table 1. Breadth-first search for first diagnosis
C 1 3 5 7 10
HS/BS HS/DBQ BSA 12 14 14 18 67
177.1 345.6 416.6 472.9 861.2
10 10 10 10 10
827
Table 2. Depth-first / Branch & bound for optimal diagnosis
#N
#TP
C
HS/BS
HS/DBQ
#N
#TP
1 4 23 95 1346
15 22 75 99 228
1 3 5 7 10
12 15 16 16 35
164.5 338.5 414.6 465.8 697.1
1 6 19 62 471
15 22 75 99 187
C: Cardinality of smallest relaxation HS/BS: HS-DAG construction, usage of bit-set representation HS/DBQ: HS-DAG construction, usage of dynamically constructed queries BSA: Analysis of bit-sets only (no HS-DAG) #N: Number of search tree nodes #TP: "Theorem prover" calls: DB-queries / Bit-set operations
5 Related Work and Conclusions Related work. Query-relaxation for content-based recommenders was recently addressed in [9] and [10]. In this work, McSherry proposed an approach in which the user is incrementally presented individual explanations for the retrieval failure and the problem is iteratively solved. Such system behaviour can in principle also be implemented based on our approach if we compute a minimal conflict in each step and let the user decide how to proceed. Nonetheless, there might be many interactions required (e.g., if we need to relax many filters), the system will not be able to predict in general if the relaxation of the next sub-query will succeed, and no overall optimum can be guaranteed. In addition, the number of queries required for computing all conflicts before entering the recovery process may be too large for interactive applications in their approach. An comparable iterative repair approach is also implemented in our surrounding framework [7], where the user can interactively change the priorities of the rules according to his/her preferences. The identification of the reasons for a failure of a query is also an issue in the area of Cooperative Query Answering. In [6], Godfrey shows complexity results for finding Minimal Failing Sub-queries MSF (correspond to conflicts in our approach) and maximal succeeding queries XSS (corresponding to our relaxed query) and proposes algorithms for finding one and enumerating all MSF/XSS. Our algorithms can be seen as alternatives to these basic mechanisms which are in addition capable of exploiting priorities for steering the (best-effort) search process and also incorporate recently developed algorithms for fast detection of minimal conflicts [8]. Incremental relaxation of overconstrained problems is also a well-known technique in the Constraint Satisfaction world, see e.g., [12] or [14], where soft constraints and Partial Constraint Satisfaction were introduced. These approaches, however, were designed to work in the context of some specific reasoning technique and cannot be easily applied to other problem solving or inference algorithms.
828
D. Jannach and J. Liegl
From the technological perspective, model-based diagnosis techniques for detecting problems in Software systems have been introduced e.g., in [2] or [4]. We see our approach in the general tradition of that work. Our future work includes the extension of the approach towards the computation of "repair alternatives": Instead of computing only possible relaxations, the goal will be to find a minimal set of changes in the user requirements such that a solution can be found. Finally, we also aim at evaluating approaches that exploit abstraction hierarchies within the knowledge base (see e.g., [3]), an approach that was also suggested as a tool for tackling the relaxation problem for recommenders in [11]. Conclusions. In this paper, we have shown how conflict-directed search can be used to significantly enhance the search for minimal relaxations of unsuccessful queries in filter-based recommender systems. The evaluation of the approach in a knowledgebased advisory framework has shown that even hard problem instances can be solved within the short time frames that are required in interactive recommender applications. Furthermore it was demonstrated how the usage of specialized data structures and pre-processing techniques can further reduce the number of required database queries and thus shorten response times. In our current work, we aim at generalizing this pre-processing approach and also extending it with a mechanism with which we can also compute relaxations (based on the bit-set matrix) that lead at least to n different products, which shall give the end user more choices when searching and comparing different items in a proposal.
References [1] D. Bridge. Product recommendation systems: A new direction. In R. Weber and C. Wangenheim, eds., Proceedings of the Workshop Programme at the Fourth International Conference on Case-Based Reasoning, 2001, p. 79-86. [2] L. Console, G. Friedrich, D. T. Dupré: Model-Based Diagnosis Meets Error Diagnosis in Logic Programs. IJCAI'1993, Chambéry, France, 1993, p. 1494-1501. [3] A. Felfernig A., G. Friedrich, D. Jannach, M. Stumptner: Hierarchical Diagnosis of large configurator knowledge bases. In: S. McIlraith, D. T. Dupré (Eds.): 12th Intl. Workshop on Principles of Diagnosis, 2001, p. 55-62. [4] A. Felfernig, G. Friedrich, D. Jannach, M. Stumptner. Consistency-based diagnosis of configuration knowledge bases, Artificial Intelligence, 152(2), 2004, p. 213-234. [5] E. Freuder, R. J. Wallace: Partial Constraint Satisfaction, Artificial Intelligence (58), 1992. [6] P. Godfrey. Minimization in Cooperative Response to Failing Database Queries, International Journal of Cooperative Information Systems Vol. 6(2), 1997, p. 95-149. [7] D. Jannach. ADVISOR SUITE - A knowledge-based sales advisory system. In: Proceedings of ECAI/PAIS 2004, Valencia, Spain, IOS Press, 2004, pp. 720-724. [8] U. Junker: QUICKXPLAIN: Preferred Explanations and Relaxations for OverConstrained Problems. Proceedings AAAI'2004, San Jose, 2004, p. 167-172. [9] D. McSherry, Explanation of Retrieval Mismatches in Recommender System Dialogues, ICCBR'03 Mixed-Initiative Case-Based Reasoning Workshop, Trondheim, Norway, 2003. [10] D. McSherry, Incremental Relaxation of Unsuccessful Queries, Lecture Notes in Computer Science, Volume 3155, 2004, p. 331–345. [11] N. Mirzadeh, F. Ricci and M. Bansal, Supporting User Query Relaxation in a Recommender System, Proceedings of 5th Intl. Conference on Electronic Commerce and Web Technologies, EC-Web '04, Zaragoza, Spain, 2004, p. 31-40.
Conflict-Directed Relaxation of Constraints
829
[12] S. Mittal, B. Falkenhainer. Dynamic constraint satisfaction problems. In: Proceedings of the 8th National Conference on Artificial Intelligence, AAAI'89, 1989, p. 25-32. [13] R. Reiter. A theory of diagnosis from first principles. Artificial Intelligence, 32(1), Elsevier, 1987, p. 57-95. [14] T. Schiex, H. Fargier, and G. Verfaille. Valued constraint satisfaction problems: Hard and easy problems. In International Joint Conference on Artificial Intelligence IJCAI'95, Montreal, Canada, 1995, p. 631-639.
Modeling pH Neutralization Process Via Support Vector Machines Dongwon Kim and Gwi-Tae Park* Department of Electrical Engineering, Korea University, 1, 5-ka, Anam-dong, Seongbuk-ku, Seoul 136-701, Korea {upground, gtpark}@korea.ac.kr
Abstract. This paper discusses the use of support vector machines for modeling and identification of pH neutralization process. Support vector machines (SVM) and kernel method have become very popular as methods for learning from examples. We apply SVM to model pH process which has strong nonlinearities. The experimental results show that the SVM based on the kernel substitution including linear and radial basis function kernel provides a promising alternative to model strong nonlinearities of the pH neutralization but also to control the system. Comparisons with other modeling methods show that the SVM method offers encouraging advantages and has better performance.
1
Introduction
Process modeling is a very important step in the control, diagnosis and optimization of the process system. For most cases, the model is constructed from the data because first principle models are very difficult to be obtained. Neural networks, fuzzy systems, etc. can all be used to model chemical processes using historical data. Unfortunately, insufficient data hamper the accuracy of these models because they rely strongly on the data when inducing process behavior [1]. Many methods have been proposed to overcome this problem. As promising methods, there is support vector machine (SVM) which is introduced by Vapnik [2]. It is an elegant tool for solving pattern recognition and regression problems. Over the past few years, a lot of researchers from neural network and mathematical programming fields are attracted to devote themselves to the research on SVM. The basic theory is well understood and applications work successfully in practice. In many applications [3,4,5], SVM has been shown to provide higher performance than traditional learning machines [6] and has been introduced as powerful tools. In this paper, SVM and kernel methods (KMs) are first applied to model and identify pH neutralization process. In the following, description of pH process and basic concepts of SVM will be described. In addition, we examine the feasibility of applying SVM in the strong nonlinearities of the pH neutralization process by comparing it with other techniques. *
Corresponding author.
M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 830 – 837, 2006. © Springer-Verlag Berlin Heidelberg 2006
Modeling pH Neutralization Process Via Support Vector Machines
831
2 pH Neutralization Process To demonstrate the high modeling accuracy of the SVM, we apply it to a highly nonlinear process of pH neutralization of a weak acid and a strong base. This model can be found in a variety of practical areas including wastewater treatment, biotechnology processing, and chemical processing [7-9]. The pH is the measurement of the acidity or alkalinity of a solution containing a proportion of water. It is mathematically defined, for dilute solution, as the negative decimal logarithm of the hydrogen ion concentration [H+] in the solution, that is, pH=-log10[H+]. In the continuously stirred tank reactor (CSTR) [10] investigated, shown in Fig. 1, acetic acid (HAC) of concentration Ca flows into the tank at flow rate Fa , and is neutralized by sodium hydroxide(NaOH) of concentration Cb which flows into the tank at rate Fb . The equations of the CSTR can be described as follows.
Influent
Reagent acid
base
~~~~
~~~~
Effluent
Fig. 1. Continuous tank reactor for pH neutralization
The process equations for the CSTR is given by
Vdwa = Fa Ca − ( Fa + Fb )Wa dt Vdwb = Fb Cb − ( Fa + Fb )Wb dt
(1)
where the constant V is the volume of the content in the reactor, wa and wb are the concentrations of the acid and the base, respectively. The above equation describes how the concentrations of wa and wb changes dynamically with time subject to the input streams Fa and Fb . To obtain the pH in the effluent, we need to find a relation between instantaneous concentrations wa and wb and pH values. This relationship can be described by a nonlinear algebra equation known as the titration or characteristic curve. Depending on the chemical species used, the titration curve varies. The static titration curve can be described as [9]
832
D. Kim and G.-T. Park
Wb + 10− pH − 10 pH − pKw −
Wa =0 1 + 10 pKa − pH
(2)
where pKa = − log10 ka . The static relationship between base flow rate and pH in the reactor is plotted in Fig. 2. 12 11 10 9 pH 8 7 6 5
470
480
490
500
510
520 Fb
530
540
550
560
Fig. 2. Titration curve for pH neutralization
It can be seen that the strong nonlinearity inherent in the pH process is characterized by its steady state titration curve. We consider the weak acid-strong base neutralization process. By fixing the acid flow-rate Fa (81cc/min) at a specific value, the process is regarded as a single variable system with base flow-rate Fb and the pH in the effluent being the input and output, respectively. The ( Fb , y pH ) data pairs were produced by using the process physical model with the parameter values given in Table 1. Table 1. Parameters and initial values for pH process Variables
Meaning
Initial setting
V Fa Fb Ca Cb Ka Kw Wa(0) Wb(0)
Volume of tank Flow rate of acid Flow rate of base Concentration of acid in Fa Concentration of base in Fb Acid equilibrium constant Water equilibrium constant Concentration of acid Concentration of base
1000 cc 81 cc/min 515 cc/min 0.32 mole/l 0.05 mole/l 1.76*10-5 1.0*10-14 0.0435 mole/l 0.0432 mole/l
The base flow rate Fb was given by Fb = 515 + 51.5sin(2π t / 25)
for t ≤ 150
Fb = 515 + 51.5sin(2π t / 25)
for t ≤ 150
(3)
Modeling pH Neutralization Process Via Support Vector Machines
833
For obtaining such a data pairs, we applied Newton-Raphson method that is given by Eqs. (4). pH i +1 = pH i −
f ( pH i ) f ′( pH i )
(4)
The system inputs of the SVM structure consist of the delayed terms of Fb (t ) and y pH (t ) which are input and output of the process, i.e. yˆ pH (t ) = ϕ ( Fb (t − 3), Fb (t − 2), Fb (t − 1), y pH (t − 3), y pH (t − 2), y pH (t − 1))
(5)
where, yˆ pH and y pH denote the SVM model output and the actual process output, respectively and 500 data pairs are generated. Further discussion on the process can be obtained in [9]
3 Support Vector Machine Modeling The SVMs can be applied to regression problems by the introduction of an alternative loss function [2]. The basis idea in SVR is to map the input data x into a higher dimensional feature space via a nonlinear mapping Φ and then a linear regression problem is obtained and solved in this feature space. The following presents some basic concepts of SVR as described by prior research. A detailed explanation may be found in [3]. In SVM method, the regression function is approximated by the following function:
y = ∑ i =1 wi Φi ( x) + b l
(6)
where {Φ i ( x)}li =1 are the feature of inputs, {wi }li =1 and b are coefficients. The coefficients are estimated by minimizing the regularized risk function. R (C ) = C
1 l 1 L (di , yi ) + w ∑ i =1 ε 2 l
2
(7)
where ⎧⎪0 Lε (di , yi ) = ⎨ ⎪⎩ d − y − ε
for otherwise
d − y < ε,
(8)
and ε is a prescribed parameter. In Eq. (7), Lε (di , yi ) is ε -insensitive loss function, which indicates that it does 1 2 w is used as a flatness measurement of Eq. (6) and 2 C is a regularized constant determining the tradeoff between the training error and the model flatness. Introduction of slack variables ζ , ζ * leads Eq. (7) to the following constrained function
not penalize errors below ε .
834
D. Kim and G.-T. Park
Minimize R ( w, ζ * ) =
1 l 2 w + C * ∑ i =1 (ζ i + ζ i* ) 2
(9)
s.t. wΦ( xi ) + b − di ≤ ε + ζ i ,
(10)
di − wΦ( xi ) − b ≤ ε + ζ i* , ζ i , ζ i* ≥ 0.
Thus, function (6) becomes the explicit form f ( x, α i , α i* ) = ∑ i =1 wi Φ i ( x) + b = ∑ i =1 (α i − α i* )Φ( xi )T Φ( x) + b l
l
(11)
= ∑ i =1 (α i − α i* ) K ( xi , x) + b l
In
formula
(11),
Lagrange
α i and
multipliers
α i* satisfy
the
con-
straints α i *α = 0 , α i ≥ 0 , α ≥ 0 and they can be obtained by maximizing the dual form of function (9) * i
* i
Φ (α i , α i* ) = ∑ i =1 di (α i − α i* ) − ε ∑ i =1 (α i + α i* ) l
−1
l
2∑ ∑ l
i =1
(α i − α i* )(α j − α *j )K (α i , α j ) j =1
l
(12)
with constraints
∑
l i =1
(α i − α i* ) = 0,
0 ≤ α i ≤ C , 0 ≤ α i* ≤ C
(13)
Based on the nature of quadratic programming, only a number of coefficients among α i and α i* will be nonzero, and the data points associated with them refer to support vectors. The form Φ( xi )T Φ( x) in Eq. (11) is replaced by kernel function with the form K ( x , y ) = Φ ( x )T Φ ( y )
(14)
There are some different kernels for generating the inner products to construct machines with different types of nonlinear decision surfaces in the input space. We employed three kind of kernel functions as follows linear : RBF
K ( x, y ) = xT y : K ( x, y ) = exp(− 1
σ2
x− y ) 2
(15)
Using the two types of kernel functions such as linear, and radial basis function for SVR, approximated models are constructed and their results are compared. The accuracy was quantified in terms of mean squared error (MSE) values. With respect to the SVR modeling, we have considered two cases which inputs to the SVR consists of 4 inputs and 6 inputs, respectively. According to the kernel functions and number of inputs to the SVR, MSE values are summarized in Table 2. In addition, we can compare them at glance. As can be seen from the table, the RBF kernel provides the best results regardless of the number of inputs.
Modeling pH Neutralization Process Via Support Vector Machines
835
Table 2. Kernel functions and corresponding accuracy of pH process
kernel type
No. inputs
MSE
4
0.925
6
0.865
4
0.0001
6
0.0001
linear
RBF
( a) Actual data and model data
(b) Errors between actual and model data
Fig. 3. Actual data vs. model output and their errors :4 inputs are used Table 3. Performance comparison of various identification models
MSE PI
Model [9] SVR
USOCPN SSOCPN
0.230 0.012 0.0001
Figs. 3-4 show the results of the SVR with RBF kernel corresponding 4 and 6 inputs. When we employed 4 inputs for modeling of SVR, the actual data vs. model output and their corresponding errors are depicted in Fig. 3. 6 inputs are utilized for modeling; the results are shown in Fig. 4. In the figure, the actual pH data vs. model output and their corresponding errors can also be seen. Table 3 provides a comparative analysis of other models. From the Table 3, it becomes obvious that the nonlinear pH process can be modeled reasonably well by the SVR and its accuracy outperforms other models.
D. Kim and G.-T. Park
pH
836
(a) Actual data and model data
(b) Errors between actual and model data
Fig. 4. Actual data vs. model output and their errors :6 inputs are used
4 Conclusions In this paper, we have dealt with the problem of modeling pH neutralization process via support vector machine. We employed the support vector machines with respect to various kernel functions to the system. As a result, SVM based on kernel method have been found to work well. Especially SVM with RBF kernel function provides the best results. So we can see that the SVM based on the kernel function provides a promising alternative to model nonlinear pH process.
Acknowledgment The authors would like to thank the financial support of the Korea Science & Engineering Foundation. This works was supported by grant No. R01-2005-000-11044-0 from the Basic Research Program of the Korea Science & Engineering Foundation.
References 1. Wang, Y., Rong, G., and Wang, S.: Hybrid fuzzy modeling of chemical processes. Fuzzy Sets and Systems. 130 (2002) 265-275. 2. Vapnik, V.: The Nature of Statistical Learning Theory. New York, John Wiley (1995). 3. Wang, W., Xu, Z.: A heuristic training for support vector regression. Neurocomputing. 61 (2004) 259-275. 4. Barzilay, O., Brailovsky, V.: On domain knowledge and feature selection using a support vector machine. Pattern Recognition Lett. 20 (1999) 475-484. 5. Drucker, H., Wu, D., Vapnik, V.: Support vector machines for span categorization. IEEE Trans. Neural Networ. 10 (1999) 1048-1054. 6. Burges, C.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Discov. 2 (1998) 7. Shinskey, F.G.: pH and pION Control in Process and Waste Streams. New York, John Wiley (1973)
Modeling pH Neutralization Process Via Support Vector Machines
837
8. Hall, R.C., Seberg, D.E.: Modeling and Self-Tuning Control of a Multivariable pH Neutralization Process. Proc. ACC. (1989) 1822-1827. 9. Nie, J., Loh, A.P., Hang, C.C.: Modeling pH neutralization processes using fuzzy-neural approaches. Fuzzy Sets and Systems 78 (1996) 5-22. 10. McAvoy, T.J.: Time optimal and Ziegler-Nichols control. Ind. Engrg. Chem. Process Des. Develop 11 (1972) 71-78.
Generating Tutoring Feedback in an Intelligent Training System on a Robotic Simulator Roger Nkambou1, Khaled Belghith2, and Froduald Kabanza2 1
Université du Québec à Montréal Montréal, Québec H3C 3P8, Canada [email protected] 2 Université de Sherbrooke, Sherbrooke, Québec J1K 2R1, Canada [email protected], [email protected]
Abstract. Manipulating the Space Station Remote Manipulator (SSRMS), known as “Canadarm II”, on the International Space Station (ISS) is a very challenging task. The astronaut does not have a direct view of the scene of operation and must rely on cameras mounted on the manipulator and at strategic places of the environment where it operates. In this paper, we present Roman Tutor, an intelligent robotic simulator we are developing to address this kind of problem. We also show how a new approach for robot path planning called FADPRM could be used to provide amazingly useful tutoring feedback for training on such a manipulator and under this big constraint of restricted sight.
1 Introduction This paper presents Roman Tutor, an intelligent tutoring system to support astronauts in learning how to operate the Space Station Remote Manipulator (SSRMS), an articulated robot arm mounted on the international space station (ISS). Fig. 1-a illustrates a snapshot of the SSRMS on the ISS. Astronauts operate the SSRMS through a workstation located inside one of the ISS compartments. As illustrated on Fig. 1-b, the workstation has an interface with three monitors, each connected to a camera placed at a strategic location of the ISS. There are a total of 14 cameras on the ISS, but only three of them are seen at a time through the workstation. A good choice of the camera on each of the three monitors is essential for a correct and safe operation of the robot. SSRMS can be involved in various tasks on the ISS, ranging from moving a load from one place of the station to another, to inspect the ISS structure (using a camera on the arm’s end effector) and making repairs. These tasks must be carried out very carefully to avoid collision with the ISS structure and to maintain safety-operating constraints on SSRMS (such as avoiding collisions with itself and singularities). At different phases of a given manipulation such as moving a payload using the arm (Fig. 2), the astronaut must choose a setting of cameras that provides him with the best visibility while keeping a good appreciation of his evolution in the task. Thus, astronauts are trained not only to manipulate the arm per se, but also to recognize M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 838 – 847, 2006. © Springer-Verlag Berlin Heidelberg 2006
Generating Tutoring Feedback in an Intelligent Training System
839
visual cues on the station that are crucial in mentally reconstructing the actual working environment from just three monitors each giving a partial and restricted view, and to remember and be able to select cameras depending on the task and other parameters. One challenge in developing a good training simulator is of course to build it so that one can reason on it. This is even more important when the simulator is built for training purpose [1]. Up until now, Simulation-Based Tutoring is possible only if there is an explicit problem space associated with tasks that are carried out during training to be able to track student actions and to generate relevant tutoring feedbacks [2,3]. Knowledge and model tracing are only possible in these conditions [4]. However, it is not always possible to explicitly develop a well-informed task structure in some complex domains, especially in domains where spatial knowledge is used, as there are too many possibilities to solve a given problem. This paper proposes a solution to this issue through a system called RomanTutor which uses a path planner to support spatial reasoning on a simulator and make it possible model tracing tutoring without an explicit task structure. We developed a simulator of the ISS and SSRMS with quite realistic rendering and kinematics constraints as with the real environment. The Simulator knowledge structure will be described later. Fig. 2 illustrates a snapshot of the simulator with SSRMS moving a payload from one place to another on the ISS. This simulates an operation that actually took place during the construction of the ISS. As most complex tasks deal in one way or another with moving the SSRMS and for the simulator to be able to understand students’ operations in order to provide feedback, it must itself be aware of the space constraints and be able to move the arm by itself. A path-planner that calculates arm’s moves without collision and consistent with best available cameras views is the key training resource on which other resources and abstract tutoring processes hinge. After a brief description of the path planner, we outline the different components of Roman Tutor and show how the FADPRM path planner is used to provide amazingly relevant tutoring feedback to the learner.
(a) Fig. 1. ISS with SSRMS (a) and the workstation (b)
(b)
840
R. Nkambou, K. Belghith, and F. Kabanza
2 The FADPRM Path-Planner In the literature, several approaches dealing with the path-planning problem for robots in constrained environments were found [5-7]. Several implementations were carried out on the basis of these various approaches and much of them are relatively effective and precise. The fact is that none of these techniques deals with the problem of restricted sight we are faced with in our case [8]. That’s why we designed and implemented FADPRM [9] a new flexible and efficient approach for robot path planning in constrained environments. In more of the obstacles the robot must avoid, our approach holds account of desired and non-desired (or dangerous) zones. This will make it possible to take into account the disposition of the cameras on the station. Thus, our planner will try to bring the robot in zones offering the best possible visibility of the progression while trying to avoid zones with reduced visibility (Fig. 2). FADPRM [9] allows us to put in the environment different zones with arbitrary geometrical forms. A degree of desirability dd, a real in [0 1] is assigned to each zone. The dd of a desired zone is then near 1, and the more it approaches 1, the more the zone is desired; the same for a non-desired zone where the dd is in [0 0.5]. On the international Space Station, the number, the form and the placement of zones reflect the disposition of cameras on the station. A zone covering the field of vision of a camera will be assigned a high dd (near 1) and will take a shape which resembles that of a cone (Fig. 2); whereas a zone that is not visible by any camera from those present on the station will be considered as an non-desired zone with a dd near to 0 and will take an arbitrary polygonal shape (a cube in Fig.2).
Fig. 2. SSRMS avoiding a non-desired zone (cube) and going through a desired zone (cone)
Generating Tutoring Feedback in an Intelligent Training System
841
The robot workspace (the ISS environment here) is then preprocessed into a roadmap of collision-free robot motions in regions with highest desirability degree. More precisely, the roadmap is a graph such every node n is labeled with its corresponding robot configuration n.q and its degree of desirability n.dd, which is the average of dds of zones overlapping with n.q. An edge (n,n') connecting two nodes is also assigned a dd equal to the average of dd of configurations in the path-segment (n.q,n'.q). The dd of a path (i.e., a sequence of nodes) is an average of dd of its edges. Following probabilistic roadmap methods (PRM) [10], we build the roadmap by picking robot configurations probabilistically, with a probability that is biased by the density of obstacles. A path is then a sequence of collision free edges in the roadmap, connecting the initial and goal configurations. Following the Anytime Dynamic A* (AD*) approach [11], to get new paths when the conditions defining safe zones have dynamically changed, we can quickly re-plan by exploiting the previous roadmap. On the other hand, paths are computed through incremental improvements so the planner can be stopped at anytime to provide a collision-free path and the more time it is given, the better the path optimizes moves through desirable zones. Therefore, our planner is a combination of the traditional PRM approach [10] and AD* [11] and it is flexible in that it takes into account zones with degrees of desirability. This explains why we called it Flexible Anytime Dynamic PRM (FADPRM). More precisely, FADPRM works as follows: The input is an initial configuration, a goal configuration, a 3D model of obstacles in the workspace, a 3D specification of zones with corresponding dd, and a 3D model of the robot. Given this input: • To find a path connecting the input and goal configurations, we search backward from the goal towards the initial (current) robot configuration. Backward instead of forward search is done because the robot is moving, hence its current configuration, is not necessarily the initial configuration; we want to re-compute a path to the same goal when the environment changes before the goal is reached. • A probabilistic queue OPEN contains nodes on the current roadmap’s frontier (i.e., nodes are expanded because they are new or because they have previously been expanded but are no longer up to date w.r.t. to the desired path) and a list CLOSED contains non frontier nodes (i.e., nodes already expanded). • Search consists of repeatedly picking a node from OPEN, generating its predecessors and putting the new ones or out of date ones in OPEN. • The density of a node is the number of nodes in the roadmap with configurations that are a short distance away (proximity being an empirically set parameter, taking into account the obstacles in an application domain). The distance estimate to the goal takes into account the node's dd and the Euclidean distance to the goal. • A node n in OPEN is selected for expansion with probability proportional to (1-β) / density (n) + β * goal-distance-estimate (n)
with 0 ≤ β ≤ 1.
This equation implements a balance between fast-solution search and best-solution search by choosing different values for β. With β near to 0, the choice of a node to be expanded from OPEN depends only on the density around it. That is, nodes with lower density will be chosen first, which is the heuristic used in traditional PRM approaches to guaranty the diffusion of nodes and to accelerate the search for a path
842
•
•
• •
R. Nkambou, K. Belghith, and F. Kabanza
[10]. As β approaches 1, the choice of a node to be expanded from OPEN will rather depend on its estimated distance to the goal. In this case we are seeking optimality rather than speed. To increase the resolution of the roadmap, a new predecessor is randomly generated within a small neighborhood radius (that is, the radius is fixed empirically based on the density of obstacles in the workspace) and added to the list of successors in the roadmap generated so far. The entire list of predecessors is returned. Collision is delayed: detection of collisions on the edges between the current node and its predecessors is delayed until a candidate solution is found; if there is a collision, we backtrack. Collisions that have already been detected are stored in the roadmap to avoid redoing them. The robot may start executing the first path found. Concurrently, the path continues being improved by re-planning with an increased value of β. Changes in the environment (moving obstacles, moving arm, or changes in dd for zones) cause updates of the roadmap and re-planning.
We implemented FADPRM as an extension to the Motion Planning Kit (MPK)[10] by changing the definition of PRM to include zones with degrees of desirability and changing the algorithm for searching the PRM with FADPRM. The calculation of a configuration dd and a path dd is a straightforward extension of collision checking for configurations and path segments. For this, we customized the Proximity Query Package (PQP)[12].
3 Roman Tutor 3.1 Roman Tutor Architecture Roman Tutor’s architecture contains seven main components: the simulator, the FADPRM path planner, the movie generator, the task editor, the student model and the tutoring sub-system. As most knowledge related to the simulation-based robotic manipulation in Roman Tutor is mainly consistent with spatial reasoning, an appropriate structure is needed for the simulator. We equipped our system with the FADPRM path planner, which as we said before will provide a framework to support the reasoning process within the simulator. However, this reasoning base won’t be sufficient to bring useful tutoring explanations to guide and orient the astronaut during his manipulations. The level of explanation that could be given here remains very limited because the planner is connected at the basic level of the simulator which is made up essentially of physical components in the environment such as the robot, obstacles, cameras, zones, and some related low-level parameters and variables such as the robot’s configuration, its degree of desirability and whether it is colliding or not. This level is equivalent to the structure proposed by Forbus [13]. However a proper use of FADPRM could exploits information in that level to generate very important tutoring feedbacks without any hard-coded structure.
Generating Tutoring Feedback in an Intelligent Training System
843
3.2 Roman Tutor User Interface The Roman Tutor interface (Fig. 3) resembles that of the robotic workstation on which the astronaut operates to manipulate the SSRMS. We have three monitors each of them connected to one of the fourteen cameras present on the station. On each monitor, we have buttons and functionalities to move the corresponding camera: Tilt, Pan and Zoom. SSRMS can be manipulated in two modes: the For mode or the Joint-by-Joint mode. In the For mode, the learner moves the robot starting from his end-effector’s position and by commanding him to go forward or backward, left or right and up or down. In the Joint-by-Joint mode, the learner selects a joint in the robot and moves it according to the corresponding link. In the two modes, the manipulation is done incrementally. While manipulating the robot, the astronaut can choose and change the camera in each monitor to have a better sight of the zone he is working in. Windows at the bottom of the Roman tutor interface contain the trace done so far by the learner. Every operation done by the learner is posted on this window: the selection of a new camera in a monitor, the displacement of a camera and the manipulation of the robot in the For/Joint-by-Joint mode. This trace contains also all information relevant to the current state: if there is a collision or not, the coordinates of the End-Effector, the position and the orientation of the cameras (Fig. 3).
Fig. 3. Roman Tutor User Interface
844
R. Nkambou, K. Belghith, and F. Kabanza
The Trace window keeps then a continuous track of all operations done so far by the learner. So if he had done a mistake (a collision for example), by examining the Trace window, he will find out to what this was due and he will see what to do to get out of it. Roman Tutor‘s interface contains four main menus: Task, Mode, Ask and Tools. From the menu ‘Task’, the learner chooses one of the several tasks he wants to work on. From the second menu ‘Mode’, the learner chooses between two different modes: Free and Assisted. In the ‘Assisted’ mode, the non-cognitive tutor intervenes when needed to support the learning of the student by guiding him or by giving him an illustrated video of the task he has to do. Whereas in the ‘Free’ mode, the learner relies only on the Trace window to carry on his task. In the two modes, the ‘Ask’ menu allows the learner to ask different types of questions while executing a task. In the last menu ‘Tools’, the expert is provided with three different tools that will help him design and validate new tasks to add into the system. These different tools are: the FADPRM Path-Planner, the Movie Generator and the Task Generator.
4 Using FADPRM Path-Planner for the Tutoring Assistance One of the main goals of an intelligent tutoring system is to actively provide relevant feedback to the student in problem solving situations [3]. This kind of support becomes very difficult when an explicit representation of the training task is not available. This is the case in the ISS environment where the problem space associated to a given task consists of an infinite number of paths. Moreover, there is a need to generate new tasks (or sub-tasks) on the fly without any cognitive structure. Roman Tutor brings a solution to these issues by using FADPRM as principal resource for the tutoring feedback. Roman Tutor includes four different types of tasks on which we may train astronauts: Spatial Awareness, GoTo, Inspect and Repair. The ‘Spatial Awareness’ task improves the learner’s knowledge about the space station’s environment by providing him with some exercises such as naming and locating ISS elements, zones and cameras. This is a very important activity since astronauts don’t have a complete view of the station while manipulating the robot and must memorize a spatial model of the ISS in order to execute the different tasks. The tutor can invoke this type of activity when it notices a lack of understanding in the student profile about some environment-related knowledge. In Fig. 4, the learner is asked to name an element of the station and to select the camera from which it is shown. In the ‘GoTo’ task, the learner has to move the SSRMS, carrying a load or not, from one position to another different on the ISS. Inspect and Repair tasks are variants of the ‘GoTo’ task. In ‘Inspect’, the astronaut is trained on how to go towards an element or a zone in the station and how to inspect it at several different points. In the ‘Repair’ task, the astronaut is trained on how to go towards an element of the station and how to execute some repairs on it at several points using the manipulator. These “GoTo”-like tasks are well described in the next paragraph.
Generating Tutoring Feedback in an Intelligent Training System
845
Fig. 4. ‘Spatial Awareness’ Task in Roman Tutor
During a ‘GoTo’ task, the non-cognitive Tutor uses the anytime capability of the FADPRM path planner to validate incrementally student’s action or sequence of actions, give information about the next relevant action or sequence of actions, and generate relevant task demonstration resources using the movie generator. In Fig. 5, the learner is shown an illustration (demonstration) of the task he’s working on. The movie is created so that the end-effector will follow the generated path (in red). RomanTutor gives access to a menu that allows the learner to ask some relevant questions during manipulations. These questions may be of three different forms: How To, What if and Why Not. Several extensions are associated to these different questions. For example, with ‘How to’, one could have: How to Go To, How to avoid obstacle, How to go through zone. Roman Tutor answers How-To questions by generating a path consistent with the best cameras views using FADPRM and by calling the movie generator to build an interactive animation that follows that path. The incremental planning capability of FADPRM is used by Roman Tutor to bring answers to the What-If and Why-Not questions. In both cases, Roman Tutor provides the learner with relevant illustrations explaining whether his action or sequence of actions is appropriate or out of scope (thus may bring him to a dead end) given the generated plan that corresponds to the correct path. We see here the importance of having FADPRM as a planner in our system to guide the evolution of the astronaut. By taking into account the disposition of the cameras on the station, we are assured that the proposed plan the learner is following passes through zones that are visible from at least one of the cameras placed on the environment.
846
R. Nkambou, K. Belghith, and F. Kabanza
Fig. 5. ‘GoTo’ Task in Roman Tutor
5 Conclusion In this paper, we described how a new approach for robot path planning called FADPRM could play an important role in providing tutoring feedback for training on a robot simulator. The capability of our path-planner, built within the simulator’s architecture to provide predictions and to determine what manipulations might achieve a desired effect, makes it a useful component in simulation-based training systems. We also, detailed the architecture of the intelligent tutoring system Roman Tutor in which FADPRM is integrated. All the components of Roman Tutor but the cognitive tutor, are working properly. This constitutes a very important contribution especially in the field of intelligent tutoring systems: we showed that it is not necessary to plan in advance what feedback to give to the learner or to explicitly create a complex task graph to support the tutoring process. The next step is to improve the expressiveness of domain knowledge structures under SSRMS to lead to a better diagnosis process of problems experienced by the learner during manipulations. Also, an evaluation of the system will be done in collaboration with the Canadian Spatial Agency. Acknowledgment. We would like to thank the Canadian Space Agency, The Infotel Group and the Natural Sciences and Engineering Research Council of Canada
Generating Tutoring Feedback in an Intelligent Training System
847
(NSERC) for their logistic and financial support. Special thanks to Mahie Khan, Daniel Dubois, Patrick Hohmeyer and Kaufmann Meudja, who contributed in the design of this system. Special thanks also to all members of J.C. Latombe’s Laboratory at the University of Stanford for providing as with MPK.
References 1. Forbus, K.: Articulate software for science and engineering education. In Forbus, K., Feltovich, P., and Canas, A. (Eds.) Smart machines in education: The coming revolution in educational technology. AAAI Press (2001) 2. Richard Angros, W. Lewis Johnson, Jeff Rickel, Andrew Scholer: Learning domain knowledge for teaching procedural skills. AAMAS, 1372-1378 (2002) 3. VanLehn, K.: The advantages of Explicity Representing Problem Spaces. User Modeling, Springer Verlag LNAI 2702:3. (2003) 4. Crowley R.S., Medvedeva, O., Jukic, D.: SlideTutor - A model-tracing Intelligent Tutoring System for teaching microscopic diagnosis. Proceedings of the 11th International Conference on Artificial Intelligence in Education (2003) 5. Latombe, J.C.: Robot motion planning. Kluwer Academic Publishers, Boston, MA, (1991) 6. Overmars, M.H.: Recent developments in motion planning. P. Sloot, C. Kenneth Tan, J. Dongarra, A. Hoekstra (Eds.): Computational Science - ICCS 2002, Part III, SpringerVerlag, LNCS 2331, (2002) 3-13 7. LaValle, S. M., Branicky, M. S., Lindemann, S.R.: On the relationship between classical grid search and probabilistic roadmaps. International Journal of Robotics Research 23: 673–692, (2004) 8. Roy, J., Nkambou, R., Kabanza, F.: Supporting spatial awareness in training on a telemanipulator in space. Intelligent Tutoring Systems, LNCS 3220, 860-863, SpringerVerlag Berlin Heidelberg (2004) 9. Belghith, K., Kabanza, F., Hartman, L., Nkambou, R.: Anytime Dynamic Path-Planning with Flexible Probabilistic Roadmaps. In Proc. of IEEE International Conference on Robotics and Automation, ICRA (2006) 10. Sanchez, G., Latombe, J.C.: A single-query bi-directional probabilistic roadmap planner with lazy collision checking. Int. Symposium on Robotics Research (ISRR'01), Lorne, Victoria, Australia, November 2001. Published in Robotics Research: The Tenth Int. Symp. R.A. Jarvis and A. Zelinsky (eds.), Springer Tracts in Advanced Robotics, Springer, 403-417 (2003) 11. Likhachev, M., Ferguson, D., Stentz, A., Thrun, S.:. Anytime Dynamic A*: An Anytime Replanning Algorithme. In Proc. of International Conference on Automated Planning and Scheduling, June (2005) 12. Larsen, E., Gottschalk, S., Lin, M., Manocha, D.: Fast Distance Queries using Rectangular Swept Sphere Volumes. In Proc. of IEEE International Conference on Robotics and Automation, 4:24-28 (2000) 13. Forbus, K.: Using qualitative physics to create articulate educational software. IEEE Expert, May/June, 32-41 (1997)
Elaborating the Context of Interactions in a Tutorial Dialog Josephine Pelle and Roger Nkambou Université du Québec à Montréal Montréal, Québec H3C 3P8, Canada [email protected], [email protected]
Abstract. In Socratic tutorial dialogues the tutor selects the most appropriate question to be asked, based on assumptions about what the learner knows. However, there is no guarantee that the learner will understand the question. Indeed, the assumptions of the tutor are sometimes likely to be inaccurate. In this case, an appropriate action for the tutor is to revise its current GOAL as well as its current dialogue plan. In this paper, we present an instance of this issue in PROLOG-TUTOR, a tutoring system for Logic Programming. Our contribution is an explicit address of the dialog management mechanism which supports the revision of the tutor’s intention in a Socratic dialogue. This is done using a combination of the theory of accommodation of communicative acts with the notion of revising intentions.
1 Introduction In Socratic dialogues, the main goal is to guide the learner toward an understanding of the concepts and the principles which underlie the domain learned. Several types of interactions support tutorial dialogues: explanations ([1, 3]), decompositions ([12]), hinting ([7, 20]), questions ([9]). When questions are asked in a tutorial dialogue, there are two different schemes of intervention: the dialogue is directed by the tutor; the tutor and the learner have equal participation in the dialogue. In the former case, only the tutor is allowed to take the initiative in the dialogue. In this scheme of interaction, the tutor selects the most appropriate question to be asked based on assumptions about what the learner knows ([4, 5]). However, the information contained in the learner model is sometimes unlikely to be accurate. Therefore, there is no guarantee that the learner will understand the question that is asked to him. For that reason, the management of tutoring dialogues should anticipate this kind of situation in order to trigger an appropriate action. In this case one example of appropriate action for the tutor is to revise its current GOAL as well as its current dialogue plan. For the tutor, the intended goal of this revision is to make the learner understand what it meant with the last question asked. In this paper, we present an instance of this issue in PROLOG-TUTOR, a tutoring system for Logic Programming (section 1). We describe a model of dialogue management in order to support the revision of a predefined dialogue plan (section 2). This model uses a combination of the theory of accommodation of communicative acts in information states ([14]) with M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 848 – 858, 2006. © Springer-Verlag Berlin Heidelberg 2006
Elaborating the Context of Interactions in a Tutorial Dialog
849
the notion of revising intentions ([2]). We analyze research work related to the revision of dialogue plan in tutorial interactions (section 3). We outline the more general context of research from which this model is taken and we discuss some issues of evaluation with this respect (section 4).
2 Elaborating the Context of a Question in a Tutorial Dialogue One learning activity in PROLOG-TUTOR is the RESOLUTION of a GOAL given a PROLOG In that activity, the learner has to give the solution of such a resolution. If he fails, the goal of PROLOG-TUTOR is to guide him toward the solution. The resolution is a procedural task which consists in the application of a set of principles in order to perform a sequence of appropriate computations. When the learner fails to solve correctly an exercise on Resolution, the tutor engages an articulation process ([19]). The articulation is a tutorial strategy where the tutor takes the learner step-by-step through some reasoning toward the solution or toward the realisation of his own errors. The execution of each step necessitates the elicitation of a skill: applying a principle, recalling a domain statement, etc. The articulation is a generic process. When applied to a procedure, it consists in asking questions to the learner for each step of the procedure. In this way the learner is expected to reflect on the principles that govern the execution of each step, as well as the principles which govern the execution of the overall procedure. There is an implicit way and an explicit way to elicit the learner’s reflection through articulation. The implicit way consists in asking him to give the result of the execution of a step. In that case, the tutor assumes that a good execution implies that the learner masters the concepts and the principles that underlie the achievement of this step. The explicit way consists in making the learner explicitly apply the principle of the domain which is most appropriate at the current stage of the procedure (Table 1). KNOWLEDGE-BASE.
Table 1. Types of questions of Prolog-tutor in articulating a skill Question that promotes implicit reflection at the first step of a RESOLUTION
Question that promotes explicit reflection at the first step of a RESOLUTION
[T1]: What is the first element in the knowledge base which corresponds to the GOAL grand_father(X,john)
[T1’]: Can you unify your GOAL grand_father(john,X) with the Head of the Horn Clause: Grand_father(X,Y):- father(X,Z), father(Z,Y)
[S2]: Grand_father(X,Y):father(X,Z), father(Z,Y)
[S2’]: Yes [T2’]: So what will you do next?
In any case, when the learner is unable to answer the question of the tutor, the reaction of the tutor will depend on its pedagogical strategy. We have implemented several scenarios in PROLOG-TUTOR which aim at promoting learner’s reflection [19]. In the most basic scenario, the tutor provides an explanation to the learner when he his unable to answer a question.
850
J. Pelle and R. Nkambou
2.1 Using a Simulated Learner Model to Adapt a Predefined Dialogue Plan So far, PROLOG-TUTOR uses a simulated learner model in order to choose the next question to ask to the learner. The learner model is drawn upon a Bayesian Network which expresses conditional relationships between the knowledge elements considered in the domain of Logic Programming ([18]). The content of the learner model is a list of knowledge elements, each element being associated with a probability of mastery. In order to choose the next question to ask to the learner, the tutor uses a rule R1 (Table 2). Table 2. A Rule to choose if a step should be articulated
R1: If [mini {i in SkillOfCurrentStep}(Probability_of_mastery[i])> δ ⇒ remove_next_question_from(Private.DialoguePlan) Table 3. Example of elaboration of the context of a question Predetermined Dialogue Plan
Adapted Dialogue plan with temporary accommodations
[T1] What is the first element of your knowledge base which corresponds to the GOAL to prove [S2] grand_father(X,Y):- father(X,Z), father(Z,Y)
T1] What is the first element of your knowledge base which corresponds to the GOAL to prove [S2] grand_father(X,Y):- father(X,Z), father(Z,Y)
[T3] Right. If you use this prolog-rule, what would you do next? [S4] Solve the new goals father(X,Z) and father(Z,john) [T5] Ok. Give me the first solution that you obtain [S6] X= valery, Z=john. [T7] Right. And what about father(Z,john) [S8] Z= stephen [T8] Not really. The variable Z used in father(X,Z) is the same as the one used in father (Z,john) ….
[T3 [AdaptPredeterminedDialoguePlan]] Right. What solution do you obtain with your first sub-goal? [S4] Don’t understand
Adaptation of the predetermi ned plan
[T3 [AccomodateCurrentQuestion]] Ok. Your new sub-goals are father(X,Z), father(Z,john) [S6] Don’t understand [T3 [AccomodateCurrentQuestion] When you associate father(X,john) with the prolog-rule: 1. unify the GOAL with the head of the rule 2. apply the substitutions of the unification to the Tail of the rule 3. The facts of the tail of the prologrule become your new goals 4. In this case, your new goals are father(X,Z) and father(Z,john) [S8] …..
Incremental Elaboration of the context of a question
Elaborating the Context of Interactions in a Tutorial Dialog
851
This rule tells that a given step should not be articulated if among the skills associated to that step, the one which has the minimum probability of mastery is above a threshold δ. In a real context, this rule takes a certain importance since it allows the tutor to be responsive to the learner’s needs: the tutor focuses the articulation of a procedure on the steps which corresponding skills are lacking to the learner. Table 3 illustrates 2 instances of tutorial interactions where one can see the difference between: 1) the sequential articulation of the steps of a procedure based on a predetermined plan and 2) the adaptive articulation of the steps of a procedure, based on the learner model. 2.2 Possible Reasons for the Inaccuracy of the Learner’s Model In PROLOG-TUTOR, the Bayesian network which supports the updating of the learner model is not a temporal Bayesian network. We use a heuristic approach which calculates the mean of all the probabilities obtained for each knowledge element, at each update of the network. More over, the elements of the network as well as their relationships have been defined on the basis of the personal point of view of the author. For that reason, the probabilities that are reported in the student model are likely to be inaccurate. When choosing the next question to ask on the basis of such a model, the learner may not understand the context of this question. This is explained as follows. Let’s suppose that the last step that has been articulated with the learner in a procedural task was step I. Let’s also consider that the next step that will be articulated (chosen on the basis of R1) is J, J ≠ I+1. Thus, when asking the learner to execute step J, he may find himself confused because he is not able to link that step with the last articulated step I. This may happen because the learner does not understand what has happened in the steps that are between I and J. Moreover, even if the learner knows what has happened, he may lack the knowledge which explains the passage through the steps that are between I and J. 2.3 Revision of the Tutor’s Intentions and Accommodation of Its Utterance From the above remarks, we have conceived the tutoring interactions as follows: If the learner is unable to answer the question of step J while the preceding step is the step I, I≠J-1 then (Fig. 1): o The tutor reconsiders its current intention to evolve with the predetermined dialogue plan o The tutor revises its current intention and accommodates its utterance: Its new intention is to propose a first elaboration of the context of the current question, expecting that the learner will get a better understanding of the question. The elaboration is minimal first in order to make the student find out by himself the elements of the gap between I and J. After the first Elaboration, the tutor prompts the learner to try again to answer the question that has been asked in the current step (step J). If he fails, the tutor
852
J. Pelle and R. Nkambou
gives a more detailed elaboration. This process continues until the tutor has given a maximal elaboration. When a maximal elaboration has been stated, the tutor gives an explanation of the answer to the question of the current step (step J). Then, it revises again its intention: its purpose is to go back to continue with the higher level dialogue plan, from step J.
3 A Model to Elaborate the Context of a Question in a Tutorial Dialogue In order to elaborate the context of a question, the tutor uses 3 models. The first model allows the tutor to produce the basic utterances of the dialogue. The second model allows the management of information during the dialogue. The third model defines the rules which govern how the dialogue will evolve.
Articulating this Resolution task with the learner
Knowledge Base supporting the proof of the GOAL being given to the Learner
Telling the learner what he should do in this exercise
There are two steps between the step corresponding to this question and the last performed step.
First Elaboration
Second and maximal elaboration
Negative Feedback and Confirmation of Diagnosis
Fig. 1. Example of diagnosis revision trough the elaboration of dialog question in guiding the learner in a "Resolution" task in Logic Programming
Elaborating the Context of Interactions in a Tutorial Dialog
853
3.1 Modeling the Content of the Utterances of the Tutor This model relates each step I of a procedural task to: 1) the questions that may be asked to the learner and the goal of each of these questions (explicit reflection or implicit reflection); 2) the feedback (positive/negative) that is given to the learner for each type of question; 3) the explanation of the answer to that question in case the learner fails to answer correctly; 4) for each step K. The contextual model provides the tutor with the resources necessary to produce the appropriate utterances for each step examined in a tutorial dialogue. For each step S of the procedure, this model comprises three sets: {[Q,G]}, the set of the questions that are associated with S, with their respective goals; α, a set comprising the expected answer for each question that could be asked in S; {ΣIS} a set that comprises several sets, each set ΣIS contains some utterances, these utterances correspond to the elaboration of the computations that have happened from step I to S; Sk, the set of skills that are associated to S. Then S is defined by the combination <S, {[Q,G]}, α, ΣIS, Sk>. The utterance in Σ(S-1)S is a default context and the corresponding utterance recalls the link between the current step (step S) and its predecessor (step S-1). For each of the other steps I < (S-1), ΣIS comprises a set of utterances. Each utterance gives more details about what has been computed between I and S. These utterances may be characterized as follows: –
–
Σ IS (0) is a general summary of the rationale of S. This utterance is used as an introduction by the tutor when passing from step I to step S in the articulation process. Before asking the question in step S, Σ IS (0) is produced in order to maintain a certain coherence in the flow of the dialogue. For [K>I ∧ K<S], Σ IS (K) is an explanation of how the output of step K was drown and how it may be used to perform step S.
3.2 Modeling the State of the Tutorial Dialogue In order to interact with the learner in a tutorial dialogue, the tutor manages a memory that defines the current information state in the system. The notion of information states that we use is an adaptation of the dialogue game board ([10, 11]). The content of an information state is divided between the information that is shared between the dialogue participants and the information that is private to each dialogue participant. The private part is only accessible by the tutor. It contains a representation of the dialogue plan, an agenda and a learner model. The representation of the dialogue plan is a stack which contains a predetermined sequence of steps. These are the steps that will be addressed by the tutor in the articulation process. The top of the stack is the
854
J. Pelle and R. Nkambou
first step of the procedure and the bottom is the last step of the procedure. This sequence is drawn from the structural model of the procedural task that is being articulated. In this case, it is the resolution of a GOAL in PROLOG. The CurrentAgenda contains the short term goals of the tutor. In order to model explicitly the fact that a predetermined plan of dialogue may change, we have attributed a set of intentions to the tutor. These intentions are modeled through the use of TutorGOALs. The learner model is a set of propositions which state the probability of mastery of the learner with respect to a knowledge element in the domain. Each time that the learner answers to a question, his model is updated accordingly: evidences are sent to the Bayesian network. These evidences activate the knowledge elements that are associated with the skills elicited through the currently articulated step. The shared part of the information state (Table 4) contains the information that both the learner and the tutor can access: the last utterance that the tutor has stated (CTU), the currently articulated step (CS), the learner’s answer to the CTU if it is a question (LA) and the last step articulated before the current step (LS). Table 4. Information state for a tutorial dialogue
Private Information
DialoguePLAN : StackSet(Step) CurrentAgenda: StackSet(TutorGOALs) LearnerModel : SET(Prop(KnowledgeElements[Probs]))
Shared Information
CTU CS LA LS
: : : :
Current Tutor Utterance Currently Articulated step Learner answer to QUA Last step Articulated Before
3.3 Modeling the Management of the Tutorial Dialogue: Rules for Dialogue Adaptation and Rules for Dialogue Accommodation Two types of goals have been defined: EVOLVE and ELABORATE. The goal EVOLVE allows the dialogue plan to be adaptive to the learner model (Table 5). Using R1 (Table 2), the tutor modifies this dialogue plan in order to select the next step that will be articulated. This goal is set when the conditions of the current information state are normal. We consider that these conditions are normal in two cases: 1) the learner’s answer to the currently articulated step is correct; 2) the currently articulated step naturally follows (as defined in the structural model of the procedure) the last articulated step. Thus by default, the goal EVOLVE is at the top of the stack which represents the CurrentAgenda. If none of the normal conditions are fulfilled in the current information state of the dialogue, the goal ELABORATE is pushed at the top of the currentAgenda (revision of the intention of the tutor, Table 6). The tutor will accommodates CTU in order to elaborate the context of the last question that it has asked, before providing an explanation of the good answer.
Elaborating the Context of Interactions in a Tutorial Dialog
855
Table 5. Rules to adapt the predetermined dialogue plan Rule
Adapt the predetermined Dialogue Plan
Conditions
associate(Step,Private.pop(DialoguePLAN)); maximumProbMastery(Sk(Step)) < δ
Effect
push(Private.currentAgenda, EVOLVE(Step))
Table 6. Rules to accommodate the utterance of the tutor to the learner needs Rule
Accommodate the next utterance by setting the tutor goal to ELABORATE
Conditions
¬ normal( evaluateCurrentConditions(Shared.QA, Shared.CS, Shared.LS))
Effect
Push(Private.currentAgenda, ELABORATE(Shared.LS,Shared.CS))
Table 7. Rules to update the dialogue information state Rule Conditions
Effect
Conditions.
Effect
Update information state get(Private.CurrentAgenda) == EVOLVE updateLS(Shared.CS) updateCS(pop(Private.DialoguePlan)) getQuestion(Shared.CS, Shared.CTU) get(Private.CurrentAgenda) == ELABORATE
getContexts(Shared.CS, Shared.CTU)
The difference between adaptation and accommodation lies in their respective consequence on the dialogue plan. Adaptation will definitely change the evolvement of the predetermined dialogue plan. Adaptation depends on the model of the learner, rather than on his answer. The notion of accommodation consists in changing temporarily the course of the dialogue. The goal of this change is to take into account the needs of the interlocutor, perceived through one of his utterances ([14]). In accommodation, the global goal does not change, while the focus of the current state of the dialogue may change temporarily. In our particular case, the global goal is to entertain an adaptive dialogue with a learner in order to help him reflect about the solution to a problem. However, if the adaptation is based on assumptions that are not
856
J. Pelle and R. Nkambou
accurate, the goal (focus) becomes – temporarily – the elaboration of the context of an utterance. There are also update rules which support the updating of the information state after a dialogue move (Table 7). 3.4 Implementation PROLOG-TUTOR is implemented in Java and includes: – –
– – – –
One class which encodes the structural model of a procedure One class which encodes the contextual model for each step of the procedure. This class also encodes all the functions related to the management of this information. One class which encodes the information state of the dialogue One class which encodes the management of the dialogue One library of classes which encode the structure of the Bayesian Network and its updating. This library has been obtained from the work of Cozman[8] One class which encodes the management of the updating of the Bayesian network
4 Related Work Dialogue based tutoring has been investigated from several perspectives. Some systems such as Auto-tutor ([16]) focus on natural language processing issues while modeling several types of hints. Other systems address the formalisation of the hinting process ([20, 21]). The hinting process has also been addressed in an implicit manner in ITS, in the course of student diagnosis([6, 7, 13]). Tutoring dialogues based on the specific articulation of the problem in hand has also been studied ([12, 17]). Moore[15] has showed how an intelligent tutor can refer to past explanations in a tutoring dialogue. Our approach is more inspired from these two frames of research. Our approach differs however in that we have incorporated the notion of context and the notion of intention of the tutor. The hints are associated to the specific state of the dialogue (not that of the resolution of the problem at hand). Therefore, the tutor is aware of what has happened in the past and it can use that information to make hypothesis about the explanation of the learner’s difficulties. Moreover, the tutor is explicitly attributed some specific goals. This approach allows an organization of the tutoring dialogue plan in small units of focus that are nested. This also allows the explicit modelling of several kinds of tutoring strategies in the dialogue.
5 Future Work: Using Tutorial Dialogue for the Explicit Promotion of Reflection In PROLOG-TUTOR, our goal is to propose some models of tutorial dialogues which explicitly promote learner reflection. This paper describes a very specific case of the consequence of such tutorial dialogues. While the ideas in the current paper will not be the direct objects of an evaluation, PROLOG-TUTOR is now in evaluation with
Elaborating the Context of Interactions in a Tutorial Dialog
857
respect to the more general goal of promoting explicit reflection. 10 undergraduate students in computer sciences are participating to this evaluation. The goal of this experience is to assess: 1) if the way that questions are asked to the learner promote reflection; 2) if there is reflection, what type of reflexive processes are triggered by these questions (recall, inference, analysis, etc.).
6 Conclusion This paper has presented how a tutoring dialogue can be adapted and accommodated to fit the learner needs in PROLOG-TUTOR, an ITS for Logic Programming. We have described a set of models inspired from: the dialogue game board model, the theory of the accommodation of dialogue utterances, the notion of intention. These models support a framework which allows a tutor to: 1) adapt a predefined dialogue plan by removing some dialogue steps; 2) accommodate the current dialogue focus in order to elaborate the context of a question that has been asked to the learner. The adaptation is based on the learner model and its goal is to skip the dialogue steps that may be irrelevant for the learner, according to his model. The goal of the elaboration is to revisit the steps that have been skipped, to insure that the learner understands well the context of the current step of the dialogue. This position contributes to similar research that focused on the formalisation of hinting in tutorial dialogue, revisiting past explanation, etc. Our future work concerns the evaluation of the more general framework on which PROLOG-TUTOR is built. This framework comprises a set of model to explicitly promote student’s reflection in tutorial dialogues.
References 1. Cawsey, A., Planning Interactive Explanations. International Journal of Man-machine Studies, 1993. 38(2): p. 169-200. 2. Cawsey, A., J. Galliers, G. Logan, S. Reese, and K. Sparck-Jones. Revising Beliefs and Intentions: A Unified Framework for Agent Interaction. in Prospects in artificial intelligence: AISB. 1993: Amsterdam, IOS Press. 3. Chi, M.T., N. De Leeuv, M. Chiu, H., and C. Lavancher, Eliciting Self explanations improves understanding. Cognitive Sciences, 1994. 18(3): p. 439-477. 4. Clancey, W., J., Knowledge-Based Tutoring: The Guidon Program. 1987, Cambridge, Massachusset: MIT Press. 5. Collins, A. and A.L. Stevens, Goals and strategies of inquiry teachers, in Advances in instructional psychology, R. Glaser, Editor. 1982, Lawrence Erlbaum Assoc.: Hillsdale, NJ. p. 65-119. 6. Conati, C., A. Gertner, and K. VanLehn, Using Bayesian Networks to Manage Uncertainty in Student Modeling. Journal of User Modeling and User-Adapted Interaction, 2002. 12(4): p. 371-417. 7. Corbett, A.T. and J.R. Anderson, Knowledge Tracing: Modeling the Acquisition of Procedural Knowledge. User Modeling and User Adapted Interaction, 1995. 4: p. 253-278. 8. Cozman, F., G., The JavaBayes system. The ISBA Bulletin, 2001. 7(4): p. 16-21.
858
J. Pelle and R. Nkambou
9. Dimitrova, V., J. Self, and P. Brna. Involving the Learner in Diagnosis - Potentials and Problems. in Tutorial at Web Information Technologies: Research, Education and Commerce,. 2000. MontPellier, France. 10. Ginzburg, J., Interrogatives: Questions, facts and dialogue, in Handbook of Contemporary Semantic Theory. 1996, Blackwell: Oxford. 11. Ginzburg, J. Clarifying utterances. in Proceedings of the Twente Workshop on the Formal Semantics and Pragmatics on Dialogues. 1998. 12. Heffernan, N. and K. Koedinger. An Intelligent Tutoring System Incorporating a Model of an Experienced Human Tutor International Conference on Intelligent Tutoring System 2002. Biarritz, France. : Springer Verlag. 13. Koedinger, K., J. Anderson, W. Hadley, and M. Mark, Intelligent Tutoring Goes To School in the Big City. Int. Journal of Artificial Intelligence in Education, 1997. 8: p. 30-43. 14. Larsson, S., R. Cooper, and E. Engdahl. Question accommodation and information states ind Dialogue. in Third Wokshop in Human-Computer Coversation. 2000. Bellagio. 15. Moore, J. and C. Paris, Planning Text for Advisory Dialogues. Computational Linguistics, 1993. 19(4): p. 651-694. 16. Pearson, N., A. Graesser, R. Kreuz, V. Pomeroy, and T.R. Group, Simulating human tutor dialog moves in AutoTutor. International Journal of Artificial Intelligence in Education, 2001. 12: p. 23-39. 17. Razzaq, L. and H. Heffernan. Tutorial dialog in an equation solving intelligent tutoring system. in Proceedings of 7th Annual Intelligent Tutoring Systems Conference. 2004. Maceio, Brazil. 18. Tchetagni, J. and R. Nkambou. A framework for the hierarchical representation of the learner model using Bayesian Networks. 2002. Biarritz, France: Springer-Verlag. 19. Tchetagni, J., R. Nkambou, and F. Kabanza. Epistemological Remediation in Intelligent Tutoring Systems. in Industrial and Engineering Applications of Artificial Intelligence and Expert Systems. 2004. Ottawa, Canada. 20. Tsovaltzi, D. and C. Matheson. Formalising hinting in tutorial dialogues. in EDILOG: 6th workshop on the semantics and pragmatics of dialogue. 2002. Edinburgh. 21. Woolf, B. and T. Murray. A Framework for Representing Tutorial Discourse. in 10th joint conference on artificial intelligence. 1987.
Static Clonal Selection Algorithm Based on Match Range Model Jungan Chen1, Dongyong Yang2, and Feng Liang1 1 Zhejiang Wanli University Ningbo, Zhejiang, 315100, China [email protected], [email protected] 2 Zhejiang University of Technology, No.6 District, Zhaohui Xincun Hangzhou, Zhejiang, 310032, China [email protected]
Abstract. Static Clonal Selection Algorithm (SCSA) is proposed to generate detectors to intrusion detection. A gene expression is used to express detector which exists as a form of classification rules. But full match rule is used and the gene expression can not express classification rules with OR operator freely. In this work, by combined the gene expression with partial match rule which is an important component in negative selection algorithm, a new expression which can express classification rules with OR operator is proposed. But the match threshold in match rule is difficult to set. Inspired from the T-cell maturation, a match range model is proposed. Base on this model and new expression proposed, a Static Clonal Selection Algorithm based on Match Range Model is proposed. The proposed algorithm is tested by simulation experiment for self/nonself discrimination. The results show that the proposed algorithm is more effective to generate detector with partial classification rules than SCSA which generates detector with full conjunctive rules with ‘and’; match range is self-adapted.
1 Introduction In human immune system (HIS), the genes of gene library are inherited from the ancestor. Through Baldwin Effect, HIS generates new antibodies used to detect antigens that had attempted to attack ancestor’s body. Similarly, the genes of gene library in the artificial immune system (AIS) can be the genes of nonselves and used to generate detector to detect the nonselves which are attempted to attack before. So Static Clonal Selection Algorithm (SCSA) is proposed to generate detectors. When nonself data are gathered, SCSA is used to extract nonself pattern from nonself data and generate detectors which contain nonself pattern without overlapping self patterns [1]. SCSA evolves detectors existing as a form of classification rules which classify nonself from self. Negative Selection Algorithm (NSA) is used to delete detectors which detect any selves. Simple binary expression proposed by De Jong is used in order to encode the conjunctive rule detectors [2]. Each gene which represents an M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 859 – 868, 2006. © Springer-Verlag Berlin Heidelberg 2006
860
J. Chen, D. Yang, and F. Liang
attribute of detector phenotype is combined by an “AND” operator. So it is impossible to express classification rules composed of “OR” in SCSA. Also, it is always difficult that full classification rules are extracted from nonselves and survive after NSA processes because full match rule is used in SCSA [1]. In this work, one new detector expression, r-classification which is combined binary expression proposed by De Jong with partial match rule, is proposed. Classification rules composed of “OR” can be expressed by the r-classification. So the algorithm with the r-classification will be more effective to generate detector because the r-classification can express rules with ‘or’ and do not require to extract full classification rules but extract (r+1) partial rules with ‘and’ from nonselves. But in rclassification, r is difficult to set. In this work, inspired from T-cells maturation, a match range model with selfmax and selfmin is proposed. In the match range model, selfmax, selfmin is self-organized because they are evaluated by selves. So a novel idea is inspired, the match range model can be combined with the r-classification, r in the r-classification is replaced by selfmax, selfmin. Based on this idea, a Static Clonal Selection Algorithm based on Match Range Model (SCSA-MRM) is proposed in this work. SCSA-MRM uses genetic algorithm to evolve the detectors with self-adapted match range and extract partial rules from nonselves.
2 Static Clonal Selection Algorithm 2.1 R-Classification and Match Range Model Match rule is an important component in negative selection algorithm (NSA) [3]. For binary representation, there exist several partial match rules with match threshold r (termed as r-match rule) like hamming, r-contiguous and r-chunks [3][4][5]. Match threshold r makes it possible that a single detector detects more similar nonselves. For example, there is a detector ‘100110’ with r=2. If the match rule is r-contiguous, then ‘100***’ or ‘*001***’ will be detected. In this work, one new detector expression, r-classification which is combined binary expression proposed by De Jong with r-match rule, is proposed. Classification rules composed of “OR” can be expressed by the r-classification. Detector match nonself when there are (r+1) genes match between them. For example, there is one classification rule such as: If (Gene 1=TCP or UDP) and (Gene 2=[20,30]) and (Gene 3=0) Then detect nonself. When match threshold r is set to 1 in the r-classification, the rule can be expressed as following three rules with OR: If (Gene 1=TCP or UDP) and (Gene 2=[20,30]) or (…others…) Then detect nonself; If (Gene 2=[20,30]) and (Gene 3=0) or (…others…) Then detect nonself; If (Gene 1=TCP or UDP) and (Gene 3=0) or (…others…) Then detect nonself. So the algorithm with the r-classification will be more effective to generate detector because the r-classification can express rules with ‘or’ and do not require to
Static Clonal Selection Algorithm Based on Match Range Model
861
Fig. 1. Match range model
extract full classification rules with ‘and’ but extract (r+1) partial rules from nonselves. But in the r-classification, r is difficult to set. In human immune system, T-cells maturation goes through two processes, positive and negative selection [5]. Positive selection requires T-cells to recognize self cells with lower affinity, while T- cells must not recognize self cells with higher affinity in negative selection. So there is a range between lower and higher affinity. In this work, inspired from T-cells maturation, a match range model shown in Fig.1 is proposed. Selfmax is the maximal distance between detector and selves. Selfmin is the minimal distance and must be bigger than 0. The range between selfmin and selfmax belongs to the self space. A nonself is detected when the distance is bigger than selfmax or smaller than selfmin, which is termed as range match rule. One detector with range match rule will not detect any selves when selfmax,selfmin is set, because every selves is belonged to the range between selfmin and selfmax. So NSA operator can be replaced by setting selfmax,selfmin. The match range model can be combined with the r-classification, r in the r-classification is replaced by selfmax, selfmin. Based on this idea, a Static Clonal Selection Algorithm based on Match Range Model (SCSA-MRM) is proposed in this work. 2.2 The Model of Algorithm In human immune system, the main features of the clonal selection theory are [6, 7,8]: (1) The new cells are copies of their parents (clone) subjected to a mutation
mechanism with high rates (somatic hypermutation); (2) Elimination of newly differentiated lymphocytes carrying self-reactive receptors; (3) Elimination of newly differentiated lymphocytes carrying low affinity antigenic
receptors. Based on clonal selection theory, the model of SCSA-MRM is shown in Fig.2. In Step.3, the reproduction applies only mutation operator and there are five mutation operators [1], which are coincide to somatic hypermutation in the first feature of clonal selection theory. One parent only product one child.
862
J. Chen, D. Yang, and F. Liang
Step 5~7 simulates the selection operator which selects detectors with high fitness to next generation. These processes act as the third feature of clonal selction theory. Step 5 is the fitness scoring shown in Fig.3; it is based on Binary Immune System Model [9]. maxgen: the maximal generation S: the rate to select self samples from self set PSize: the size of the detector population DETECTORS: the detector population gen: current generation 1: initialize the detector population 2: For gen=0 to maxgen 3: children =reproduction (only mutation) 4: DETECTORS ={ children DETECTORS } 5: Fitness scoring(DETECTORS) 6: Rank DETECTORS with fitness from big to small 7: DETECTORS = Select PSize detectors with higher fitness from DETECTORS 8: End
∪
Fig. 2. Algorithm of SCSA-MRM
When a subset of antibodies meets many antigens, there is a competition to detect antigens between antibodies. The antibody which detects more antigens will multiply and grow the fastest. In Fig.3, Step 6~7 means that some detectors randomly meet the nonselves. It is just like antibodies randomly meet antigens. Step 8~19 is the competition to detect the nonself samples. The more nonself samples one detector detect, the higher fitness it has, which means the detector has more chance to multiply. Step 9 is processed in Fig.5 which is to calculate the number of nonselves detected by detector with match range; the distance is based on the gene expression proposed by De Jong [1,2]. Step 17~19 is according to “The antibody in the sample with the highest match score has its match score added to its fitness. The fitness of all other antibodies remains unchanged [9]”. The value of C in Step 20 is usually 3[1,9]. Step.2 in Fig.3 is shown in Fig.4 which is used to evaluate the value of tow major parameters of the match model, selfmax and selfmin, and create the match range. One detector with range match rule will not detect any selves when selfmax,selfmin is set. This process is according to the second feature of clonal selction theory and used to delete detectors which detect any selves, so False Positive (FP) is always equal to 0. In Fig.4, the distance is based on the gene expression proposed by De Jong in Step 5 [1,2]. Step.9 is the process of positive selection. selfmin must be bigger than 0, or detector will be deleted, which means that selfmax is equal to the gene length (l) and its fitness is the lowest.
Static Clonal Selection Algorithm Based on Match Range Model
863
C: the number of times to repeat DctSample: a sample of DETECTORS D: the size of DctSample NonSample: a sample of nonself set A:the size of NonSample MaxDetectedNum: the maximal number of nonselves in NonSample detected by detector in DctSample. MaxSet: the detectors which detect MaxDetectedNum nonselves in NonSample Hdn[ij]:distance between DctSample[i]and NonSample[j] DctCount[i]: the number of NonSample detected by DctSample[i] Fitness[i]: the fitness of DETECTORS[i] r: the match threshold Fitness Scoring (DETECTORS) 1: For each detector DETECTORS[i] 2: SetSelfminSelfmax(DETECTORS) 3: Fitness[i]=0 4: End 5: For Lc=1 to C* the Size of DETECTORS 6: DctSample is randomly selected from DETECTORS without replacement 7: NonSample is randomly selected from the nonself set without replacement 8: For each detector DctSample[i] 9: DctCount[i] = the number of nonself in NonSample detected by DctSample [i] 10: If DctCount[i]>MaxDetectedNum then 11: MaxDetectedNum= DctCount[i],MaxSet= DctSample[i] 12: End 13: If DctCount[i]=MaxDetectedNum then 14: MaxSet={MaxSet DctSample[i]} 15: End 16: End 17: For each detector Maxset [k] in Maxset 18: The fitness of Maxset [k]+= MaxDetectedNum/ Size of MaxSet 19: End 20: End 21: the processes repeated for (population size×C) times.
∪
Fig. 3. Fitness Scoring of SCSA-MRM
l: the number of bits in detector SELF: self smaples Hds[ij]:the distance between DETECTORS[i] and SELF[j] selfmax[i]:maximal distance between DETECTOR[i] and SELF selfmin[i]:minimal distance between DETECTOR[i] and SELF Fitness[i]: the fitness of DETECTORS [i] SetSelfminSelfmax(DETECTORS)
864
J. Chen, D. Yang, and F. Liang
1: Select S*(size of self set) number of SELF from self set 2: For each detector DETECTORS[i] 3: selfmax[i]=0, selfmin[i]=l 4: For each self sample SELF[j] 5: Compute Hds [ij] 6: If Hds [ij]>selfmax[i] then selfmax[i]= Hds [ij] 7: If Hds [ij]<selfmin[i] then selfmin[i]= Hds [ij] 8: End 9: If selfmin[i]=0 then selfmax[i]=l,Fitness[i]=0 10:End Fig. 4. Setting the selfmax and selfmin
1: DctCount[i]=0 2: For each nonself NonSample[j] 3: compute Hdn[ij] 4: if Hdn[ij]>Selfmax or Hdn[ij]<Selfmin then DctCount[i] += 1 5: End Fig. 5. Computation of the number of nonselves detected by detector
2.3 Experiment for Self/Nonself Discrimination The objective is to understand following: (1) The comparison between SCSA and SCSA-MRM; (2) Match range searched by genetic algorithm; (3) Whether the algorithm is self-adapted.
In this experiment, data set from UCI Repository of machine learning databases is used because two data sets from the UCI are used in JKim’s paper [1]. Minimal entropy discretization algorithm is used to discretize these data sets [10]. The first data set is Wisconsin Diagnostic Breast Cancer (WDBC) data which has 30 attributes. The WDBC data is different from the data in JKim’s paper which is Wisconsin Breast Cancer Database [1]. It consists of 569 examples with two classes: ‘benign’, ’malignant’. 357 examples belong to ‘benign’ and 212 belong to ‘malignant’. The second data set is iris data with 4 attributes which is the same as the data in JKim’s paper [1]. It has total 150 examples with three classes: ’Setosa’, ’ Versicolour’, ’ Virginica’. Each class has 50 examples. For the comparison between SCSA and SCSA-MRM, both WDBC data and Iris data are used. In WDBC data, ‘malignant’ is defined as self set and ‘benign’ as nonself set. In iris data, ‘Virginica’ is defined as self set and others as nonself set. Both SCSA and SCSA-MRM runs for ten times with D=5, A=5, S=1, maxgen=500, PSize=10, C=3. The mutate rate of WDBC data is equal to 0.1 and the mutate rate of iris data is equal to 0.2. Resutls are shown in Fig 6, 7.
Static Clonal Selection Algorithm Based on Match Range Model
865
To illustrate the self-adapted character of SCSA-MRM, Iris data set is used. The algorithm runs for ten times with D=5, A=5, S=1, maxgen=1000, PSize=10, C=3, the th mutate rate is equal to 0.2. Before the 500 generation, ‘Virginica’ is defined as self th set and ’ Versicolour’ is defined as nonself set. After the 500 generation, the data set is changed. ‘Virginica’ and ’Setosa’ are defined as self set. ’ Versicolour’ is still defined as nonself set. Results are shown in Fig.8. 2.4 Analyze 1. the comparison between SCSA and SCSA-MRM Fig.6 shows the comparison between SCSA and SCSA-MRM. In (b) TP of two algorithms is almost the same. But in (a), TP of SCSA is equal to 0 which indicate that binary expression proposed by De Jong is not flexible enough to express classification rules with OR and SCSA can not extract full match rules from nonselves within the maximal generation. So SCSA-MRM is better than SCSA and rclassfication rule proposed in this work is effective.
(a) TP of WDBC data
(b) TP of Iris data
Fig. 6. The True Positive of different algorithm
(a) match range of WDBC data
(b) match range of Iris data
Fig. 7. Match Range of SCSA-TM
866
J. Chen, D. Yang, and F. Liang
2 match range searched by genetic algorithm Fig.7 shows that match range evolve to the optimized range because of clonal selection which is drived by the genetic algorithm. The space which is out of the range between selfmax and selfmin is the nonself space.
(a) self-adapted TP
(b) self-adapted match range, selfmin and selfmax
(c) Average Age Fig. 8. SCSA-MRM
3 whether the algorithm is self-adapted. When self data is changed at 500th generation, detectors are evolved again in fig8 (c). TP in (a), both selfmax and selfmin in (b) is self-adapted.
3 Conclusion In summary, SCSA-MRM is more effective than SCSA because r-classification combined match range is flexible enough to express complex rules. R-classification can express rules with ‘or’ and do not require to extract full classification rule with ‘and’, so it is easy to extract partial classification rules from nonselves. When self data is not constant, SCSA-MRM is self-adapted because selfmax and selfmin is evaluated by self data. According to “the different value of r for each detector can fill
Static Clonal Selection Algorithm Based on Match Range Model
867
the potential holes [11]”, it is possible to solve the holes problem [11] because there is no match threshold r but the match range in this algorithm. Of course, it is required to research further because experiments are not based on network data and the size of detectors population is not self-adapted when both self and nonself are changed.
Acknowledgements Thanks for the assistance received by using the UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html].
References 1. Kim J. W.: Integrating Artificial Immune Algorithms for Intrusion Detection, PhD Thesis,Department of Computer Science, University College London (2002). 2. De Jong, Kenneth A.. William M. Spears, and Diana F. Gordon.: Using Genetic Algorithms for Concept Learning. In Machine Learning (1993) 161-188. 3. Forrest S., Perelson A. S., Allen, L. and Cherukuri, R.: Self-nonself Discrimination in a Computer, Proceedings of the 1994 IEEE Symposium on Research in Security and Privacy, Los Alamos, CA: IEEE Computer Society Press (1994) 4. Hofmeyr S.A.: An Immunological Model of Distributed Detection and its Application to Computer Security, PhD Dissertation, University of New Mexico (1999). 5. Gonzalez F.: A Study of Artificial Immune Systems applied to Anomaly Detection, A Dissertation presented for the Doctor of Philosophy Degree, The University of Memphis (2003). 6. de Castro L. N. and Von Zuben F. J.: Artificial Immune Systems: Part I – Basic Theory and Applications, Technical Report – RT DCA 01/99 (1999) 7. de Castro L. N., Von Zuben F. J.: Learning and Optimization Using the Clonal Selection Principle, IEEE Transactions on Evolutionary Computation, Special Issue on Artificial Immune Systems (2002) 239-251 8. Wenjian L.: Research on Artificial Immune Model and Algorithms Applied to Intrusion Detection, PhD Dissertation, University of Science and Technology of China (2003) 9. Forrest S.,Javornik B., Smith R.E.and Perelson A.S.: Using genetic algorithms to explore pattern recognition in the immune system,Evolutionary Computation (1993) 191-211, 10. Dougherty J., Kohavi R. and Sahami M.: Supervised and unsupervised discretization of continuous features. Available http://robotics.stanford.edu/~ronnyk/disc.ps 11. D'haeseleer P., Forrest S. and Helman P.: A Distributed Approach to Anomaly Detection,,1997, Available http://www.cs.unm.edu/~forrest/isa_papers.htm
Appendix Parameter (D, A, Mutate Rate) sensitivity of SCSA-MRM is investigated. WDBC data set is used. SCSA-MRM runs for ten times with S=1, maxgen=500, Psize=10, C=3. The results are shown in Fig.9~11.
868
J. Chen, D. Yang, and F. Liang
(a)
(b) Fig. 9. SCSA-MRM with different D
(a)
(b) Fig. 10. SCSA-MRM with different A
(a) Fig. 11. SCSA-MRM with different Mutate Rate
(b)
Diagnosing Faulty Transitions in Recommender User Interface Descriptions Alexander Felfernig Computer Science and Manufacturing, University Klagenfurt A-9020 Klagenfurt, Austria [email protected]
Abstract. Complex assortments of products and services offered by online selling platforms require the provision of sales support systems assisting customers in the product selection process. Knowledge-based recommenders are intelligent sales assistance systems which guide online customers through personalized sales dialogs and automatically determine products which conform to their needs and wishes. Such systems have been successfully applied in a number of application domains such as financial services or digital cameras. In this context, the construction of recommender user interfaces is still a challenging task. In many cases faulty models of recommender user interfaces are defined by knowledge engineers and no automated support for debugging such models is available. In this paper we discuss a formal model for defining the intended behaviour of recommender user interfaces and show the application of modelbased diagnosis concepts which allow the automated debugging of those definitions. Experiences show that this approach significantly increases the productivity of recommender user interface development and maintenance. Keywords: knowledge-based recommenders, diagnosis, user interface design.
1 Introduction Simulating the behaviour of human sales experts, recommender systems [1,4,6,9,12] support customers in the product or service selection process. These technologies are of great importance for making product assortments accessible to online customers without technical product domain knowledge. In this paper we concentrate our discussions on knowledge-based recommenders [1,4] which exploit deep knowledge about the product domain in order to determine products or services fitting to the wishes and needs of the customer. Such representations allow the inclusion of automated debugging mechanisms which support the effective development and maintenance of a recommender application. Two basic aspects have to be considered when implementing a knowledge-based recommender application. Firstly, the relevant set of products has to be identified and transformed into a formal representation, i.e. a recommender knowledge base has to be defined. Such a knowledge base consists of a structural description of the provided set of products, a description of the possible set of customer requirements and a set of constraints restricting the possible combinations of customer requirements and product properties. Secondly, a model of a recommender M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 869 – 878, 2006. © Springer-Verlag Berlin Heidelberg 2006
870
A. Felfernig
process [4] has to be defined which serves as the basis for personalized dialogs. Knowledge bases and process definitions can be automatically (no programming is needed) translated into an executable advisor, where each state of the process definition corresponds to a Web-page in the generated application [4]. In each state a user can articulate his requirements and depending on the given answers the process control determines the following state. In a final state, product recommendations are displayed, e.g. concrete digital cameras or financial services (savings, bonds, etc.). In this paper we focus on a situation where knowledge engineers develop a model of the user interface of a knowledge-based recommender application. The interfaces of such applications are described by a finite number of states, where state transitions are triggered by user requirements represented by a set of variable settings (see Figure 1). var(q0)=kl
q0 c2: kl=expert
c1: kl=beginner c3: aw=yes
var(q2)=id
q2 c6: id<>shortterm c7: kl<>beginner var(q5)=sh
q5
c10: av=yes c11: kl=average
c4: aw=no
c5: id=shortterm var(q6)=sl
var(q4)=av
q4
var(q1)=aw
q1
c8: av=no C9: kl=beginner
q6
var(q3)=ds
q3
Fig. 1. Example (faulty) process definition. {c1, c2, ..., c11} are transition conditions between states qi, var(qi) represents variables instantiated by the user in state qi.
Finite state automata [7] can be used for expressing the expected behaviour of a recommender user interface [4]. Typically, customers make decisions by specifying values (requirements) for a subset of a given set of variables. Depending on the answers the automaton changes its state, e.g. an expert (kl = expert) who isn’t interested in financial services advisory (aw = no), is forwarded to the state q3, where a direct product search can be performed, i.e. different subsets of variables are defined by different paths in the automaton. Note that the design in Figure 1 is faulty, since an expert (kl = expert) who wants to be guided by a financial advisory process (aw = yes) and is interested in long-term investments (id = longterm) and doesn’t have any available funds (av = no) comes to a standstill at the input of availability (the conditions {c2, c9} and {c2, c11} are contradictory). In such a situation, mechanisms are needed allowing an effective identification of the sources of inconsistencies. In the following we show how concepts from model-based diagnosis (MBD) [10] are applied for identifying a minimal set of changes allowing consistency recovery in finite state representations of recommender user interfaces. The presented concepts have been implemented within the scope of the Koba4MS1 [4] project, where finite state models are used to specify the intended behaviour of a recommender user interface. 1
Koba4MS is an acronym for Knowledge-based Advisors for Marketing & Sales (Austrian Research Fund Project 808479).
Diagnosing Faulty Transitions in Recommender User Interface Descriptions
871
The remainder of this paper is organized as follows. In Section 2 we introduce the concept of Predicate-Based Finite State Automata (PFSA), a representation formalism for the design of recommender user interfaces. In Section 3 we discuss our approach to the automated debugging of PFSA-based representations using Model-Based Diagnosis (MBD). In Section 4 we present results from evaluating the presented concepts.
2 Finite State Models of Recommender User Interfaces For the following discussions on the application of model-based diagnosis concepts to the automated debugging of recommender user interface descriptions we introduce a variant of Predicate-Based Finite State Automata (PFSA) [15] (see e.g. Figure 1) which is a natural and compact approach to define state transitions (transitions are defined in terms of domain restrictions of finite domain variables). Note that we restrict our discussions to the case of acyclic automata. Definition 1 (PFSA): we define a Predicate-based Finite State Automaton (PFSA) to be a 6-tuple (Q, Σ, ∏, E, S, F), where •
• • • • •
Q = {q1, q2, ..., qj} is a finite set of states, where var(qi) = xi is a finite domain variable assigned to qi, prec(qi) = {φ1, φ2,..., φm} is the set of preconditions of qi (φk = {cr, cs, ..., ct} ⊆ ∏), postc(qi) = {ψ1, ψ2, ..., ψn} is the set of postconditions of qi (ψl= {cu, cv, ..., cw} ⊆ ∏), and dom(xi) = {xi=di1, xi=di2, ..., xi=dip } denotes the set of possible assignments of xi, i.e. the domain of xi. Σ = {xi = dij | xi = var(qi), xi = dij ∈ dom(xi)} is a finite set of variable assignments (input symbols), the input alphabet. ∏ = {c1, c2, ..., cq} is a set of constraints (transition conditions) restricting the set of words accepted by the PFSA. E is a finite set of transitions ⊆ Q × ∏ × Q. S ⊆ Q is a set of start states. F ⊆ Q is a set of final states. □
Preconditions of a state qi (prec(qi) = {φ1, φ2, ..., φm}) can be automatically derived from the reachability tree of a Predicate-based Finite State Automaton. Figure 2 depicts the reachability tree for the PFSA of Figure 1. The state q2 is accessed twice in the reachability tree. The corresponding preconditions (prec(q2)) are derived from the transition conditions of the paths leading to q2, i.e. {{c1}, {c2, c3}}. Similarly, postc(qi) is the set of postconditions of the state qi which are as well derived from the reachability tree, e.g. the state q4 has two postconditions, namely {{c8, c9}, {c10, c11}}. Example 1 includes a PFSA describing the behaviour of a financial services advisory dialog (the graphical representation is contained in Figure 1). States in the PFSA represent questions which can be posed to customers during an advisory session. Subsequent states in the PFSA are determined depending on the answers given by a customer. We denote the set of input sequences leading to a final state of a PFSA as the accepted language. In the context of user interface development for recommender applications, supporting concepts are needed which allow the effective identification of faulty
872
A. Felfernig
transition conditions, e.g. if a sequence of user inputs reaches state qi, there must exist an extension of this input sequence to a final state of the PFSA. Path expressions are the basis for expressing well-formedness properties on process definitions (see Definition 2). Q = {q0, q1, q2, q3, q4, q5, q6}. /* knowledge level of the customer */ var(q0) = kl. /* advisory wanted */ var(q1) = aw. /* duration of investment */ var(q2) = id. /* direct specification of services */ var(q3) = ds. /* availability of financial resources*/ var(q4) = av. /* high risk products */ var(q5) = sh. /* low risk products */ var(q6) = sl. dom(kl)={kl=beginner, kl=average, kl=expert}. dom(aw)={aw=yes,aw=no}. dom(id)={id=shortterm, id=mediumterm, id=longterm}. dom(ds)={ds=savings, ds=bonds,ds=stockfunds, ds=singleshares} dom(av)={av=yes,av=no}. dom(sh)={sh=stockfunds,sh=singleshares} . dom(sl)={sl=savings,sl=bonds}.
prec(q0) = {{true}}. prec(q1) = {{c2}}. prec(q2) = {{c1}, {c2, c3}}. prec(q3) = {{c2, c4}}. ... postc(q0) = {{c2, c4}, {c2, c3, c5}, {c2, c3, c6, c7, c8, c9}, {c2, c3, c6, c7, c10, c11}, {c1, c5}, {c1, c6, c7, c8, c9}, {c1, c6, c7, c10, c11}}. postc(q1) = {{c4}, {c3, c5}, {c3, c6, c7, c8, c9 }, {c3, c6, c7, c10, c11}}. ... postc(q4) = {{c8, c9}, {c10, c11}}. postc(q3)={{true}}. postc(q5)={{true}}. postc(q6)={{true}}. Σ = {kl=beginner, kl=average, kl=expert, aw=yes, aw=no, ..., sl=savings, sl=bonds}. ∏ = {c1, c2, ..., c11}. E= {(q0, {c2}, q1), (q0, {c1}, q2), (q1, {c4}, q3), (q1, {c3}, q2), (q2, {c6, c7}, q4), (q2, {c5}, q6), ( q4,{c8, c9}, q6), (q4, {c10, c11}, q5)}. S= {q0}. F= {q3, q5, q6}.}
Example 1. Predicate-based Finite State Automaton (PFSA)
Definition 2 (consistent path): Let p = [(q1,C1,q2), (q2,C2,q3), ..., (qi−1,Ci−1,qi)] ((qα,Cα,qβ ) ∈ E) be a path from a state q1 ∈ S to a state qi ∈ Q. p is consistent (consistent(p)) iff ∪Cj is satisfiable. □ Using the notion of a consistent path we can now introduce the following set of wellformedness rules specifying structural properties of a PFSA. These rules are relevant for building knowledge-based recommender applications. Due to the generality of the presented diagnosis approach, we can introduce further well-formedness properties for domain-specific purposes. A consistent path in a PFSA leading to a state qi must have a direct postcondition, i.e., (qi,Ci,qj ) propagating the path (i.e. each consistent path must be extensible).
Diagnosing Faulty Transitions in Recommender User Interface Descriptions
873
Definition 3 (extensible path): Let p = [(q1,C1,q2), (q2,C2,q3), ..., (qi−1,Ci−1,qi)] be a path from a state q1 ∈ S to a state qi ∈ Q - F. p is extensible (extensible(p)) iff ∃ (qi,Ci,qi+1): C1 ∪ C2 ∪ ... ∪ Ci−1 ∪ Ci is satisfiable. □
path p1
var(q4, p1)=av1 var(q4, p2)=av2
var(q2, p1)=id1 var(q2, p2)=id2 var(q2, p3)=id3
c6 , c7
q2
q4
c1
var(q0, p1)=kl1 var(q0, p2)=kl2 var(q0, p3)=kl3 var(q0, p4)=kl4 var(q0, p5)=kl5 var(q0, p6)=kl6 var(q0, p7)=kl7
c10, c11
path p3 var(q2, p4)=id4 var(q2, p5)=id5 var(q2, p6)=id6
path p4
var(q6, p3)=sl3
var(q6, p1)=sl1
c4
path p6
c5
c6 , c7
q4
q5 var(q5, p2)=sh2
q2
c10, c11
path p7
q1
c3
var(q4, p4)=av4 var(q4, p5)=av5
path p2
q6
var(q1, p4)=aw4 var(q1, p5)=aw5 var(q1, p6)=aw6 var(q1, p7)=aw7
c2
c5
q6
c8 , c9
q0
q3 var(q3, p7)=ds7
q6 c8 , c9
var(q6, p6)=sl6
path p5 var(q5, p4)=sh4
q5
q6
var(q6, p5)=sl5
Fig. 2. Reachability tree of a PFSA. The tree represents the expansion of the PFSA shown in Figure 1. {p1, p2, ..., p7} are the possible paths of the PFSA. For each path pi the corresponding path variables are shown.
A state qi is a decision point for determining the subsequent state. The selection of subsequent states depends on the definition of the direct postconditions for qi, where each postcondition has to be unique for determining the subsequent state. A state qi is deterministic if each of its postconditions is unique for determining subsequent states. Definition 4 (deterministic state): Let p = [(q1,C1,q2), (q2,C2,q3), ..., (qi−1,Ci−1,qi)] be a path from a state q1 ∈ S to a state qi ∈ Q - F. A state (qi) is deterministic(qi) iff ∀(qi,Ci1,qj), (qi,Ci2,qk) ∈ E : C1∪C2∪...∪Ci−1∪ Ci ∪ Ci2 is contradictory (Ci1≠Ci2 ). □ All transitions of a PFSA should be accessible, i.e. for each transition there exists at least one corresponding path including this transition. Definition 5 (accessible transition): A transition t = (qi,Ci,qi+1) (postcondition of state qi) is accessible (accessible(t)) iff there exists a path p = [(q1,C1,q2), (q2,C2,q3), ..., (qi−1,Ci−1,qi)] (q1 ∈ S): C1 ∪ C2 ∪ ... ∪ Ci−1 ∪ Ci is satisfiable. □ A PFSA is well-formed, if the set of defined well-formedness rules is fulfilled. Definition 6 (well-formed PFSA): A PFSA is well-formed iff • • •
each consistent path p = [(q1,C1,q2), (q2,C2,q3), ..., (qi−1,Ci−1,qi)] (q1 ∈ S, qi ∈ Q F) is extensible to a consistent path p’ = [(q1,C1,q2), (q2,C2,q3), ..., (qi−1,Ci−1,qi), (qi,Ci,qj )]. ∀qk ∈ Q: deterministic(qk). ∀t = (qk,Ck,ql) ∈ E: accessible(t). □
874
A. Felfernig
3 Diagnosing Faulty Transitions in User Interface Descriptions Faulty transition conditions should be identified as soon as possible. Consequently, we need support in the repair of not well-formed PFSAs by an automated identification of minimal sets of transition conditions responsible for the faulty behaviour. We apply model-based diagnosis (MBD) [10] by introducing the concept of a PFSA Diagnosis Problem and a corresponding PFSA Diagnosis. Our approach starts with the description of a system (SD) which is in our case the description of the structure and the intended behaviour of the PFSA (well-formedness rules discussed in Section 2). If the observable behaviour of the system conflicts with its intended behaviour (some of the well-formedness rules are violated), the task of the diagnosis algorithm is it to determine components (in our case transition conditions) which, when assumed to functioning abnormally, explain the discrepancy between the actual and the intended system behaviour. In order to apply MBD concepts, we transform a PFSA definition into a corresponding constraint-based representation [14] consisting of restrictions representing well-formedness properties and transitions of the PFSA. A calculated diagnosis represents a minimal set of transitions which are responsible for the faulty behaviour of the automaton, i.e. are inconsistent with the defined well-formedness properties (see Section 2). Note that there can be different diagnoses providing an explanation for the faulty behaviour of a PFSA. We define a Predicate-based Finite State Automaton Diagnosis Problem (PFSA Diagnosis Problem) as follows. Definition 7 (PFSA Diagnosis Problem): A PFSA Diagnosis Problem is represented by a tuple (SD, TRANS), where SD = STAT ∪ WF. STAT is the structural description of a PFSA represented by a set of finite domain variables. WF is the intended behaviour of a PFSA which is represented by a set of constraints on STAT. Finally, TRANS represents a set of transition conditions. □ The basic elements of a PFSA Diagnosis Problem, i.e., STAT, WF, TRANS, define a corresponding Constraint Satisfaction Problem (CSP) [14]. Solutions for this CSP cover all interaction paths, i.e., accepted runs of the PFSA. In this context, STAT represents a set of finite domain variables related to paths of the reachability tree (the variables of our example PFSA are depicted in Figure 2 and Example 2), e.g. {kl1, id1, av1, sl1} are variables corresponding to the path p1. Note that the projection of all solutions for the CSP (STAT, WF, TRANS) to e.g. the variables {kl1, id1, av1, sl1} represents input sequences accepted by path p1. Example 2 (STAT): STAT = {kl1, id1, av1, sl1, kl2, id2, av2, sh2, kl3, id3, sl3, kl4, aw4, id4, av4, sh4, kl5, aw5, id5, av5, sl5, kl6, aw6, id6, sl6, kl7, aw7, ds7} □ The construction of well-formedness rules (Definitions 3-5) WF = WFextensibility∪WFaccessibility∪WFdeterminism is now exemplified by the accessibility rule related to the transition (q2, {c6, c7}, q4) of our example PFSA (see Figure 1). Example 3 (well-formedness rules for accessibility): WFaccessibility (q2, {c6, c7}, q4) = □ {id1≠noval ∨ id2≠noval ∨ id4≠noval ∨ id5≠noval.}. The transition (q2, {c6, c7}, q4) must be accessible for at least one of the paths p1, p2, p4, p5, i.e. at least one of the variables id1, id2, id4, id5 must have a value ≠ noval in each solution for the CSP defined by (STAT, WF, TRANS). Since in our case a
Diagnosing Faulty Transitions in Recommender User Interface Descriptions
875
reachability tree (see e.g. Figure 2) represents the complete expansion of a corresponding PFSA, not all the paths are necessarily consistent. If a path of the reachability tree represents such an illegal trajectory, i.e. no consistent value assignment exists for the corresponding variables, all variables of this path have the assignment noval. This is assured by meta-constraints defined for each path in the reachability tree, e.g. for path p1: kl1 ≠ noval ∧ id1 ≠ noval ∧ av1 ≠ noval ∧ sl1 ≠ noval ∨ kl1 = noval ∧ id1 = noval ∧ av1 = noval ∧ sl1 = noval. An example for the definition of a transition condition (TRANS) is the following (see Example 4). We represent the transition condition c1 of our example PFSA. Example 4 (TRANS for PFSA): TRANS={c1: (kl1 = beginner ∨ kl1 = noval) ∧ (kl2 = beginner ∨ kl2 = noval) ∧ (kl3 = beginner ∨ kl3 = noval). ...}. □ The condition generated in Example 4 contains those variables belonging to paths including (q0, {c1}, q2), i.e. {kl1, kl2, kl3} which belong to the paths {p1, p2, p3}. For the CSP (STAT, WF, TRANS) the possible values of {kl1, kl2, kl3} are defined by c1, i.e. the value can be either beginner or noval (in the case that the corresponding path is an illegal trajectory). Given a specification of (SD, TRANS), a PFSA Diagnosis is defined as follows. Definition 8 (PFSA Diagnosis): A PFSA Diagnosis for a PFSA Diagnosis Problem (SD, TRANS) is a set S ⊆ TRANS s.t. SD ∪ TRANS − S consistent. □ A diagnosis exists under the assumption that STATS ∪ WF is consistent [2]. The calculation of diagnoses is based on the concept of minimal conflict sets. Definition 9 (Conflict Set): a Conflict Set (CS) for (SD, TRANS) is a set {c1, c2, ..., cn} ⊆ TRANS, s.t. {c1, c2, ..., cn} ∪ SD is inconsistent. CS is minimal iff ¬∃ CS‘ ⊂ CS: □ conflict set (CS’). Algorithm 1 sketches the calculation of diagnoses for a given PFSA. Algorithm 1: PFSA-Diagnosis (SD, TRANS): (a) Generate a pruned HSDAG T for the collection of conflict sets induced by transitions of TRANS in breadth-first manner (we generate diagnoses in order of their cardinality). With every theorem prover (TP) call at a node n of T the consistency of (TRANS - H(n) ∪ SD) is checked. If there exists an inconsistency, a conflict set CS is returned, otherwise ok is returned. If (TRANS - H(n) ∪ SD) is consistent, a corresponding diagnosis H(n) is found. (b) Return {H(n) | n is a node of T labeled with ok}. The labelling of the search tree (Hitting Set Directed Acyclic Graph - HSDAG) is based on the labelling of the HSDAG presented in [10]. A node n is labelled by a corresponding conflict set CS(n). The set of edge labels from the root to node n is denoted to as H(n) which represents a corresponding diagnosis (in the case that n is a leaf node). Following our example, the conflict sets which are determined by subsequent TP calls are {c2, c11}, {c7, c9}, {c2, c9}, and {c1, c9}. One diagnosis for our PFSA is {c2, c9}, i.e., at least {c2, c9} have to be changed in order to restore consistency of a given PFSA definition.
876
A. Felfernig
4 Evaluation The presented approach to introduce Model-Based Diagnosis (MBD) [10] into the development of knowledge-based recommenders does not change the structure of the user interface supporting the design of recommender processes, but rather complements the existing knowledge acquisition component [4] with concepts supporting the detection of faulty areas in a given PFSA definition. This approach is not restricted to knowledge-based recommendation, but is generally applicable to domains, where a user interface is explicitly defined in terms of a finite state model (in the form of a PFSA). A performance evaluation [2] shows the applicability of the presented algorithm in (interactive) real-world settings. Typically, recommender process definitions consist of about 15-40 states – a size which makes the application of the presented debugging concepts meaningful. In many cases there exist a number of alternative diagnoses explaining the sources of inconsistencies in a given process definition. In this context we will include additional ranking mechanisms for diagnoses which take into account the probability of a transition condition to be faulty (e.g. the probability of a transition condition to be faulty is higher if there exists a large number of incoming paths to the related state). We have compared development efforts related to projects without debugging support for process definitions and with a corresponding debugging support. Our development environment has a logging functionality which stores each activity in the recommender design environment (e.g. change/edit a transition condition, etc.). The projects investigated have a similar structure w.r.t. number of product properties, customer properties, constraints, number of states in the PFSA etc. The observation of time efforts shows that the usage of the debugging concepts can significantly reduce efforts related to the development and maintenance of recommender user interfaces. In order to empirically show the direct positive influence of debugging techniques we have conducted an experiment with ninety-seven participating students all having experiences in developing knowledge-based recommender applications. In this experiment participants were interacting with a web-based application providing the basic functionality of our debugging environment. The participants had to solve the task of identifying a minimal set of constraints in given recommender process definitions which should be changed in order to make the process definition consistent with the well-formedness properties presented in this paper. The participants had to propose a set of repair actions which make the process definition consistent with the well-formedness rules. The hypothesis of the experiment was that automated debugging support leads to effort reductions in the detection and repair of faulty transition conditions. The result of the experiment was a significant decrease of time efforts due to the application of the presented debugging concepts (on an average of 36% - time reductions depend on the number of faulty transitions conditions in the process definition). A detailed analysis of the consequences of applying automated debugging concepts in recommender application development is currently ongoing and is in the line of our research focus to improve the effectiveness of recommender application development processes.
Diagnosing Faulty Transitions in Recommender User Interface Descriptions
877
5 Related Work The increasing size and complexity of configuration knowledge bases motivated the application of model-based diagnosis (MBD) [10] in knowledge-based systems development [3]. Similar motivations led to the application of MBD in technical domains such as the development of hardware designs [5], onboard diagnosis for automotive systems [11] and in software development [8]. A value-based model of program execution is introduced in a.o. [8], which is an extension of the approach of debugging imperative languages [5]. Using MBD, the identification of errors is based on source code analysis. Additionally, a set of test cases specifying the expected behaviour of the program is required - this set encompasses concrete values for variables, assertions, reachability constraints for paths, etc. An overview on different applications of model-based diagnosis techniques in software debugging is given in [13]. The representation of the intended behaviour of recommender user interfaces in the form of finite state representations is discussed in [2,4]. This approach is novel in the context of developing knowledge-based applications and, due to its formal foundation, allows an automated translation of the graphical representation into a recommender application. Basically there are three approaches to the implementation of recommender applications. Collaborative Filtering [6,9,12] and Content-based Filtering [9] do not exploit deep knowledge about the product domain. Collaborative Filtering is based on the assumption that customer preferences are correlated, i.e. similar products are recommended to customers with similar interest profiles. Content-based filtering focuses on the analysis of a given set of products already ordered by a customer. Based on this information, products are recommended which resemble products already ordered (products related to similar categories). Additionally, there are a number of approaches combining these basic approaches in order to gain an improved quality of the resulting solutions. Using knowledge-based approaches, the relationship between customer goals and preferences and offered products can be modelled explicitly. An overview on the Koba4MS development environment can be found in [4]. Such model-based knowledge representations are the major precondition for the application of model-based diagnosis [10] discussed in this paper.
6 Conclusions In this paper we have presented an approach to the automated debugging of recommender process definitions representing the intended behaviour of user interfaces of knowledge-based recommenders. Predicate-based finite state representations are applied in order to model interaction paths of recommender applications. For debugging purposes those finite state automata are translated into a corresponding representation of a constraint satisfaction problem. Results of an experiment show that our approach can significantly increase the productivity of recommender development.
878
A. Felfernig
References [1] Burke, R. Knowledge-based Recommender Systems. Encyclopedia of Library & Information Systems, 69, 32, 2000. [2] Felfernig, A., and Shchekotykhin, K. Debugging User Interface Descriptions of Knowledge-based Recommenders. Workshop Notes of the IJCAI’05 Workshop on Configuration, 13–18, 2005. [3] Felfernig, A., Friedrich, G., Jannach, D., and Stumptner, M. Consistency-based Diagnosis of Configuration Knowledge Bases. AI Journal 2, 152, 213–234, 2004. [4] Felfernig, A. Koba4MS: Selling Complex Products and Services Using KnowledgeBased Recommender Technologies. 7th IEEE International Conference on E-Commerce Technology, 92–100, 2005. [5] Friedrich, G., Stumptner, M., and Wotawa, F. Model-based diagnosis of hardware designs. AI Journal 111, 2, 3–39, 1999. [6] Herlocker, J. L., Konstan, J. A., Terveen, L. G., and Riedl, J. T. Evaluating Collaborative Filtering Recommender Systems. ACM Transactions on Information Systems 22, 1, 5–53, 2004. [7] Hopcroft, J., and Ullman, J. Introduction to Automata Theory, Languages, and Computation. Addison-Wesley Publishing Company, Massachusetts, USA, 1979. [8] Mateis C., Stumptner, M., and Wotawa, F. Modeling Java programs for diagnosis. 14th European Conference on Artificial Intelligence, 171–175, 2000. [9] Pazzani, M. A Framework for Collaborative, Content- Based and Demographic Filtering. Artificial Intelligence Review 13, 5-6, 393–408, 1999. [10] Reiter, R. 1987. A theory of diagnosis from first principles. AI Journal 23, 1, 57–95, 1987. [11] Sachenbacher, M., Struss, P., and Carlen, C. A Prototype for Model-Based On-Board Diagnosis of Automotive Systems. AICOM 13, 2, 83–97, 2000. [12] Sarwar, B., Karypis, G., Konstan, J. A., and Riedl, J. T. Item-based collaborative filtering recommendation algorithms. 10th Int. WWW Conference, 285–295, 2001. [13] Stumptner, M., and Wotawa, F. A Survey of Intelligent Debugging. AI Communications 11, 1, 35–51, 1998. [14] Tsang, E. Foundations of Constraint Satisfaction. Academic Press, London, 1993. [15] V.Noord, G., and Gerdemann, D. Finite State Transducers with Predicates and Identities. Grammars 4, 3, 263–286, 2001.
An Unsupervised Method for Ranking Translation Words Using a Bilingual Dictionary and WordNet Kweon Yang Kim1, Se Young Park2, and Dong Kwon Hong3 1
School of Computer Engineering, Kyungil University, 33 Buho-ri, Hayang-up Kyungsan-si, Kyungsangpook-do, Korea [email protected] 82-53-850-7287 2 Dept. of Computer Engineering, Kyungpook National University, Korea [email protected] 82-53-950-5550 3 Dept. of Computer Engineering, Keimyung University, Korea [email protected] 82-53-580-5281
Abstract. In the context of machine translation, picking the correct translation for a target word among multiple candidates is an important process. In this paper, we propose an unsupervised method for ranking translation word selection for Korean verbs relying on only a bilingual Korean-English dictionary and WordNet. We focus on deciding which translation of the verb target word is the most appropriate by using a measure of inter-word semantic relatedness through the five extended relations between possible translations pair of target verb and some indicative noun clues. In order to reduce the weight of application of possibly unwanted senses for the noun translation, we rank the weight of possible senses for each noun translation word in advance. The evaluation shows that our method outperforms the default baseline performance and previous works. Moreover, this approach provides an alternative to the supervised corpus based approaches that rely on a large corpus of senses annotated data.
1 Introduction The task of translation word selection in machine translation is a process to select the correct translation word for a given target word among multiple candidates. The resolution of sense ambiguity and selection of correct translation in non-restricted text are perhaps the great central problem in the context of machine translation. In recent, corpus based solutions for translation word selection have become quite popular. However, theses supervised approaches for selecting translation word heavily rely on the availability of manually created sense tagged or bilingual parallel corpora that they need much of human effort. An alternative approach is to take advantage of the knowledge available in machine readable dictionary[3, 8]. In this paper, we propose an unsupervised method of ranking translation word selection that attempts to select the correct translation word for Korean verbs using M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 879 – 888, 2006. © Springer-Verlag Berlin Heidelberg 2006
880
K.Y. Kim, S.Y. Park, and D.K. Hong
the measure of semantic relatedness that the WordNet similarity package[12] provides. Knowledge sources are extracted from a bilingual Korean-English dictionary and WordNet[5]. From Korean-English dictionary, we first extract the possible English translation words that are corresponding to each sense of given Korean words. We focus on the semantic relatedness between possible translation words in the view of target language context. Therefore, when investigating senses of possible translation words for a given source language word, we do not know which sense is more appropriate for each translation word. Considering all possible senses for each translation word introduces the application of unwanted senses in the view of source language word that may bring about an incorrect translation word selection. In order to avoid the application of unwanted senses for the noun translation, we rank the weight of possible word senses for each translation word by measuring semantic similarity in WordNet between the translation word and near translation synonyms that are described in a bilingual dictionary. From WordNet, we exploit the knowledge for which the semantic relatedness between the senses of translation words pair for source language words pair is measured.
2 Previous Work Approaches based on statistical machine translation exploit bilingual parallel corpora or hand-crafted knowledge for selecting translation word[2]. However, as Koehn and Knight[9] pointed out, those knowledge sources are not easily available. To ease knowledge acquisition, Dagan and Itai[4] have proposed an approach for resolving word sense ambiguity and selecting a translation word using statistical data from a monolingual corpus of target language. They use the statistics on lexical cooccurrence within syntactic relations in a monolingual corpus of target language. Their method can relieve the knowledge acquisition problem by relying on only a monolingual corpus of target language that is sense untagged. However, their method is apt to select an inappropriate translation word because it is based on the simple mapping information between a source word and its possible translation words by using only a bilingual dictionary and a sense untagged target language corpus. When seeing an occurrence of target language word, their method counts word occurrence as a translation word of all the source language words. This can mislead the selection of correct translation word because the translation word also may have several senses from the view of target language. The source of problem is that this method considers the unnecessary senses of translation words that do not have anything to do with the senses of the original source language word. A possible solution is to use the sense tagged monolingual corpora or bilingual parallel corpora[9, 11]. However, knowledge sources like sense tagged or parallel corpora are generally hard to come by because they need a manual creation. Consider the translation of a Korean phrase “nonmun-ul ssu-da”(that means write a paper). In the phrase, noun ‘nonmun’ has only one sense and has its corresponding
An Unsupervised Method for Ranking Translation Words
881
translation word paper in Korean-English dictionary. However, the translation word paper has multiple senses in WordNet. For simplicity’s sake, we took into consideration only the first three senses of the translation word paper that are explained in the following WordNet entries: • • •
paper#n#1, {paper} − a material made of cellulose pulp derived mainly from wood or rags or certain grasses paper#n#2, {composition, paper, report, theme} − an essay. (especially one written as an assignment); "he got an A on his composition" paper#n#3, {newspaper, paper} − a daily or weekly publication on folded sheets; contains news and articles and advertisements; "he read his newspaper at breakfast"
In addition, the Korean verb ‘ssu-da’ has several translation words such as write, put on, use, employ, speak, and etc that are corresponding equivalent to each respective sense in Korean-English dictionary. Table 1 shows the list of combination pairs for possible translation words between ‘nonmun’ and ‘ssu-da’. For this example, the method of Dagan and Itai finds the pairs of possible translation words for source language words using a bilingual dictionary and estimates probabilities of how likely each combination pair of translation words occurs in a target language corpus. We find that an incorrect translation word use for a given target word ‘ssu-da’ is selected because the co-occurrence frequency of pair use-paper is dominant in a target language corpus over the pairs of all other translations. Table 1. Combination pair of translation words with its co-occurrence frequency in the target language corpus
pair of source words Noun Verb
pair of translation words Verb Noun
‘nonmun’ ‘ssu-da’
write put on use employ speak …
paper paper paper paper paper
freq. 22 0 47 0 0
Their method fails to select correct translation word for the target word ‘ssu-da’ as shown in the example of “nonmun-ul ssu-da” because of sense ambiguity of a translation word. The sense of translation word paper in almost all of occurrences of use and paper in target language corpus is used as a first sense of paper in WordNet which is not the correct corresponding sense for the source language word ‘nonmun’. In consequence, this method selects incorrect translation words pair use-paper instead of correct translation words pair write-paper.
882
K.Y. Kim, S.Y. Park, and D.K. Hong
3 Dictionary Based Approaches In recent, corpus based solutions for word sense disambiguation have become popular. In general, such approaches rely on the availability of sense tagged corpus that needs a manual creation. An alternative is dictionary-based approaches that take advantage of the information available in machine readable dictionaries or lexical database. The Lesk algorithm[10] may be identified as a starting point for dictionary based approaches. His algorithm does not require any sense tagged training instances. This approach disambiguates word senses for a particular target word by comparing the dictionary glosses of its possible senses with those of its surrounding words in the context. The target word is assigned the sense whose gloss has the maximum overlap of words with the glosses of its surrounding words. The intuition of this approach is based on two facts that the sense of target word is semantically related to the surrounding words in context and the overlaps of sense glosses reflect the degree of semantic relatedness between senses of a target word and senses of surrounding words in context. Such connections are indicative of the senses in which these words are used rather than not by chance. The main problem of Lesk algorithm is that dictionary glosses tend to be quite short and may not provide sufficient information to measure of relatedness between a given target word and surrounding words in context. To overcome the limitation of Lesk algorithm based on short definitions, recently Banerjee and Pedersen[1] presented a measure of extended gloss overlaps through the rich network of word sense relations in WordNet rather than simply considering the glosses. They suggest an extended approach that exploits not only explicit relations through direct links but also implicit relations through indirect links that WordNet provides. In their experiments, they used the seven possible relations such as synset, hypernym, hyponym, holonym, meronym, troponym, and attribute of each word sense in WordNet. In effect, the glosses of surrounding words in the context are expanded to include glosses of theses related words to which they are semantically related through the extended relations in WordNet. This measurement is not restricted by any specific part of speech and can be applied for measurement of relatedness between senses from any part of speech. All of relations that WordNet provides are not equally helpful and the optimal selection of the best relations depends on the application of overlap measurement.
4 Ranking Translation Word Selection Our algorithm for ranking translation word selection is based on the hypothesis that the correct translation of a target word is highly related to senses of translation words for surrounding clue words. The Korean-English dictionary and WordNet provide good information for mapping from the source language word to possible translation words and mapping from the translation word to word senses. The algorithm first identifies a set of possible translation words for the target word and surrounding clue words using a Korean-English bilingual dictionary.
An Unsupervised Method for Ranking Translation Words
883
Figure 1 shows part of the Korean-English bilingual dictionary. Such a bilingual dictionary generally classifies the meaning of an entry word into several senses and groups its translation words and near synonyms by each sense class.
技師 記事 騎士 論文
/gisa1/ an engineer; a technician /gisa2/ an article; an account; news; a statement; a story; a report /gisa3/ a rider; a horseman; an equestrian /nonmun1/ a paper; a treatise; a dissertation; a thesis; “write a paper.” Fig. 1. A part of the Korean-English dictionary
In Korean-English dictionary, each Korean lexical entry word may have multiple senses and there are some corresponding translation words for each sense. In Figure 1, a Korean noun ‘gisa’ has three senses and has three representative translation words engineer, article, and rider with near synonyms technician, account, news, statement, story, report, horseman and equestrian according to each sense. In the case of Korean noun ‘nonmun’ entry, there is only one sense and has one representative translation word paper with near synonyms such as treatise, dissertation, and thesis. Near synonyms[6] are words that have close senses with the representative translation word for each sense. In addition to sense ambiguity of source language word, the translation word may have multiple senses in WordNet. For example, the translation word paper that is corresponding equivalent for source word ‘nonmun’ has three different noun senses such as paper#n#1, paper#n#2, and paper#n#3 as described in section 2. In order to reduce the weight of application of possibly unwanted senses for the noun translation word that may bring about incorrect translation word selection, our algorithm needs to decide which sense of the representative translation word is more appropriate for the source language word before we start to compute the score of semantic relatedness between combination pairs of possible senses. Let us suppose that the k-th clue word in a given source language sentence is tk and the l-th translation of tk is tkl.The weight of m-th sense tklm for translation word tkl is computed as follows:
w(t klm
∑ max sim(t , NS (t )) )= ∑ ∑ max sim(t , NS (t )) i
m
lm k
j
i
j
j
l k
i
lm k
j
i
l k
The weight of sense tklm is obtained by computing the semantic similarity between the possible senses of the representative translation word and near synonyms(NS) of tkl. To get the semantic similarity between two senses, we use the measure proposed by Jiang and Conrath which combines the information contents of each sense and their most specific subsumer that returns the largest set of commonalities between
884
K.Y. Kim, S.Y. Park, and D.K. Hong
each sense[7]. For the word paper that is the translation of a source word ‘nonmun’, the sense paper#n#2 has higher weight than the sense paper#n#1, because the sense paper#n#2 is more semantically similar to near synonyms of translation word paper such as treatise, dissertation, and thesis. After the weight of each sense is determined, our algorithm computes the score of semantic relatedness between combination pairs of possible senses. This algorithm selects the translation word of target word that has maximum relatedness using the following equation.
arg max ∑ i
k
max related (t 0 , t k ) • w (t k ) ij
lm
lm
j, l, m
This equation computes a relatedness score of each translation word t0i for a target word t0. tklm means m-th sense of translation word tkl of tk that is a surrounding clue word. Two senses are considered to be related semantically when they share a set of words that occur in their respective glosses. The counting of number of overlaps between just gloss words is not likely to find relatedness because they are too short to result in a reliable measure. However, if we consider the glosses and usage examples of all the senses that are connected directly and indirectly to a sense by extended relations, this becomes a larger set of words that can measure relatedness. Rather than simply considering the glosses of the surrounding words in the context, the concept hierarchy of WordNet is exploited to allow for glosses of word senses related to the context words to be compared. WordNet is a broad coverage, machine readable lexical database widely used in natural language processing. The wide availability of WordNet has led to the development of a number of approaches to word sense disambiguation problem. Each sense of a word is mapped to a synset in WordNet and all synsets in WordNet are linked by a variety of semantic relations. In addition to glosses associated with each sense, WordNet provides many example usages associated with them that play a role of sense tagged training instances. These usage examples and glosses are not likely to be much help because they are short. However, the number of example usages and glosses grow larger by extending through the set of relations that WordNet provides. In our experiments, we adopt the following extended relations that WordNet provides. Figure 2 shows the five extended relations between noun and verb we employ. /* noun–verb relation */ syns-syns; syns-glos; syns-exam glos-syns ; glos-glos; glos-exam exam-syns; exam-glos; exam-exam /* coordinate(noun)–verb relation */ syns(coor)-syns; syns(coor)-glos; syns(coor)-exam glos(coor)-syns; glos(coor)-glos; glos(coor)-exam exam(coor)-syns; exam(coor)-glos; exam(coor)-exam
An Unsupervised Method for Ranking Translation Words
885
/* noun-coordinate(verb) relation */ syns-syns(coor); glos-syns(coor); exam-syns(coor) syns-glos(coor); glos-glos(coor); exam-glos(coor) syns-exam(coor); glos-exam(coor); exam-exam(coor) /* hyponyms(noun)–verb relation */ syns(hypo)-syns; syns(hypo)-glos; syns(hypo)-exam glos(hypo)-syns; glos(hypo)-glos; glos(hypo)-exam exam(hypo)-syns; exam(hypo)-glos; exam(hypo)-exam /* noun–troponyms(verb) relation */ syns-syns(trop); glos-syns(trop); exam-syns(trop) syns-glos(trop); glos-glos(trop); exam-glos(trop) syns-exam(trop); glos-exam(trop); exam-exam(trop) Fig. 2. Extended relations (syns : synsets, glos : definitions, exam : examples, coor : coordinate, hypo : hyponyms, trop : troponyms)
While Pedersen and Banerjee used the immediately surrounding context words, we choose the most relevant words that provide much more indicative clues for selecting correct translation word. The focus in much recent work is on local collocation knowledge including a variety of distance and syntactic relations. For the English tasks, that local collocation knowledge such as n-grams is expected to provide important evidences to resolve sense ambiguity because English is a fixed order language. However, Korean is a partially free order language and therefore the ordering information on surrounding words of the ambiguous word does not provide significantly meaning information for resolving sense ambiguity in Korean. Korean has some particularities: plenty of inflectional verb endings, postpositions instead of preposition, and so on. The postpositions and verb endings represent syntactic relations such as predicate-arguments, modifier-modified relations. We deal three major predicate-arguments relations: verb-object, verb-locative and verbinstrument relation that construct complement-head structure. The postpositions attached to the noun are either ‘ul’/‘rul’, ‘ae’, or ‘ro’/‘uro’ that usually represent syntactic relations between the noun and the verb: object, location, and instrument respectively. We focus the selection of translation word for Korean transitive verbs. Therefore we decided to adopt the three major predicate-argument relations such as noun‘ul’/’rul’ (object) + verb, noun-‘ae’ (locative) + verb, and noun-‘ro’/’uro’ (instrument) + verb rather than immediate surrounding words to improve the performance of translation word selection. In order to extract syntactic features, we need a robust syntactic parser. However, the accuracy performance of current parser is not high. Therefore we have taken a simple partial parser by using an only part of speech tagger without robust parsing to identify syntactic relations such as “the first noun with postposition ‘ul’/‘rul’, ‘ae’, or
886
K.Y. Kim, S.Y. Park, and D.K. Hong
‘ro’/‘uro’ to the left or right of target word in context”. This provides a better approximation for syntactic relations than just requiring a limited distance between surrounding words in the context, while relying on the availability of part of speech tagger that is simpler and more available than robust syntactic parser. Sometimes it is hard to identify such syntactic relation in the case of embedded sentence. For example, a sentence “gang-i bumram-ha-nun gut-ul mak-da/keep the river from overflowing” has the first noun ‘gut-ul’(ING) for a verb ‘mak-da’. In this case, the algorithm extracts the nominal form ‘bumram’/overflow of preceding verb ‘bumramha’/overflow as a verb-object relation.
5 Experiments and Results In order to evaluate our algorithm for ranking translation word selection, 1.4 million words Gemong Korean encyclopedia is used as a test data set. Our experiments are performed on ten Korean transitive verbs, ‘mak-da’, ‘ssu-da’, ‘seu-da’, ‘jap-da’, ‘mandul-da’, ‘nanu-da’, ‘mut-da’, ‘bat-da’, ‘ut-da’, and ‘jit-da’ that appear with high frequency and ambiguity in our experiment corpus. This data set consists of 12,513 instances each of which contains a sentence with a target verb to be translated. In order to compare our result with the human created answer, human experts have annotated these instances manually with correct translation words in advance. Table 2. Accuracy of translation word selection
Target Verbs
Accuracy(%) Instances
Senses Baseline
Lesk
P&B
RTW
‘mak-da’ ‘ssu-da’ ‘seu-da’ ‘jap-da’ ‘mandul-da’ ‘nanu-da’ ‘mut-da’ ‘bat-da’ ‘ut-da’ ‘jit-da’
515 2084 933 560 3573 734 585 1722 881 926
7 9 10 8 10 6 7 19 11 13
27 21 40 32 29 45 35 38 30 28
18 15 22 20 18 17 18 21 17 15
38 41 40 37 42 48 25 35 40 36
52 64 47 46 49 63 47 49 57 46
Average
1251
10
33
18
38
52
We define accuracy performance for our evaluation task as the number of correct translations over the number of answers. The results of experiment are summarized in Table 2. The performance of our algorithm was compared with the baseline performance(Baseline) that is achieved by assigning all occurrences to the most frequent
An Unsupervised Method for Ranking Translation Words
887
translation word. The experimental results show that the accuracy performance of our algorithm(RTW: Ranking Translation Word) performs better than the default baseline(Baseline), original Lesk algorithm(Lesk) and Pedersen and Banerjee’s(P&B).
6 Conclusions In this paper we presented an unsupervised method for ranking translation words based on a bilingual dictionary and WordNet that are easily obtainable knowledge sources. The translation words are selected by measuring semantic relatedness through the five extended relations between possible translations of target word and surrounding clue words in source language text. Considering all possible senses for each translation word introduces the application of unwanted senses in the view of source language word that may bring about an incorrect translation word selection. In order to avoid the application of unwanted senses, we first rank weight of possible word senses for each translation word by measuring semantic similarity in WordNet between the translation word and near synonyms that are described in a bilingual dictionary. The evaluation suggests that our approach performs better than other approaches including the baseline performance. We report an average accuracy of 52% with ten Korean ambiguous verbs. While the results are generally lower than the supervised approaches that need much of human effort, these results are significant because our approach relies on only a bilingual dictionary and a target language dictionary, therefore it provides an alternative to the supervised approaches that rely on a large corpus of senses annotated data.
Acknowledgements This work has been supported by grant number(R01-2003-000-10001-0) from the Basic Research Program of Korea Science and Engineering Foundations.
References 1. Banerjee, S. and Pedersen, T.: Extended Gloss Overlaps as a Measure of Semantic Relatedness, Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence (2003) 805-810. 2. Brown, P., Cocke, J., Pietra, V., Peitra, S., Jelinek, F., Lafferty, J., Mercer, R. and Roosin, P.: A Statistical Approach to Machine Translation, Computational Linguistics, Vol. 16, No. 2 (1990) 79-85. 3. Copestake, A., Briscoe, T., Vossen, P., Ageno, A., Casteloon, I., Ribas, G., Rigau, G., Rodriguez, H., and Samiotou, A.: Acquisition of Lexical Translation Relations from MRDs, Machine Translation, Vol. 9, No. 3 (1995) 183-219. 4. Dagan, I. And Itai, A.: Word Sense Disambiguation Using a Second Language Monolingual Corpus, Computational Linguistics, Vol. 20, No. 4 (1994) 563-596. 5. Fellbaum, C. WordNet: An Electronic Lexical Database, MIT Press (1998).
888
K.Y. Kim, S.Y. Park, and D.K. Hong
6. Inkpen, D. and Hirst, G.: Automatic Sense Disambiguation of the Near-Synonyms in a Dictionary Entry, Proceedings, Fourth Conference on Intelligent Text Processing and Computational Linguistics (2003) 258-267. 7. Jiang, J. and Conrath, D.: Semantic Similarity based on Corpus Statistics and Lexicon Taxonomy, Proceedings on International Conference on Research in Computational Linguistics (1997) 19-33. 8. Klavans, J. and Tzoukermann, E.: Combining Corpus and Machine Readable Dictionary Data for Building Bilingual Lexicons, Machine Translation, Vol. 10, No. 3 (1996) 1-34. 9. Koehn, P. and Knight K.: Knowledge Sources for Word-Level Translation Models, Empirical Methods in Natural Language Processing Conference (2001) 27-35. 10. Lesk, M.: Automatic Sense Disambiguation Using Machine Readable Dictionaries: how to tell a pine code from an ice cream cone, Proceedings of the Fifth Annual International Conference on Systems Documentations (1986) 24-26. 11. Li, H. and Li, C.: Word Translation Disambiguation Using Bilingual Boostrapping, Computational Linguistics, Vol. 30, No. 1 (2004) 1-22. 12. Patwardhan, S. and Perdersen, T.: The cpan wordnet::similarity package, http://search.cpan.org/~sid/WordNet-Similarity-0.06/ (2005).
Neuro-fuzzy Learning for Automated Incident Detection M. Viswanathan, S.H. Lee, and Y.K. Yang College of Software, Kyungwon University, Bokjeong-Dong, Sujung-Gu, Seongnam-Si, Kyunggi-Do, South Korea 461-701 {murli, ykyang}@kyungwon.ac.kr
Abstract. Traffic incidents such as vehicle accidents, weather and construction works are a major cause of congestion. Incident detection is thus an important function in freeway and arterial traffic management systems. Most of the large scale and operational incident detection systems make use of data collected from inductive loop detectors. Several new approaches, such as probe vehicles and video image processing tools, have recently been proposed and demonstrated. This research aims at model development for automatic incident detection and travel time estimation employing neuro-fuzzy techniques. As a first step, in this paper we develop an initial model for incident detection using a standard neuro-fuzzy algorithm. In subsequent development we propose a model where the fuzzy rules are themselves extracted from the data using an associative data mining algorithm. The results of the initial experiments suggest that the proposed model has plausible incident detection rates and low false alarm rates. The test results also suggest that the proposed model enhances accuracy of incident detection in an arterial road and can be expected to contribute to formal traffic policy. Keywords: Neuro-Fuzzy Learning; Automated Incident Detection; Intelligent Traveler Systems; Association Rule Mining.
1 Introduction Incidents are non-recurring events such as accidents, disabled vehicles, spilled loads, temporary maintenance and construction activities, signal and detector malfunctions, and other special and unusual events that disrupt the normal flow of traffic and cause motorist delay. Incident detection is an important function in freeway and arterial traffic management systems. Issues of traffic policy have also recently changed from traffic management to offer advanced information services to users. Users want to receive correct information about status of road traffic until their destinations. For the reasons mentioned above, numerous investigations are going into research on travel time estimation, incident detection and data fusion. Among these methods, we study models about incident detection. In the cases where the measured data from traffic observations is applied directly for incident detection, the incident detection contains many false alarms. This is because there are possibilities that the data has noise in itself, due sometimes to traffic detector breakdowns. Furthermore another issue is that the detector itself is measuring different types of data and there can be data from many sources. Therefore, measured M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 889 – 897, 2006. © Springer-Verlag Berlin Heidelberg 2006
890
M. Viswanathan, S.H. Lee, and Y.K. Yang
data must go through a data fusion process which aims to merge information from different sources and reduces false alarms. This research uses data of spot speed where outliers of measurement are removed by data fusion [1]. This research is develops an initial model for incident detection using the using the ANFIS (adaptive-network-based fuzzy inference system) neuro-fuzzy algorithm. Fuzzy rules incorporated into the ANFIS algorithm are extracted by a careful analysis of the traffic data which consist of vehicle speed measurements. A futuristic model for the automatic extraction of fuzzy rules is proposed. From experimental analysis we can observe high incident detection rates and low false alarm rates for the current system.
2 Background The plethora of incident algorithms can be grouped as algorithms based on pattern recognition and statistical methods, algorithms using traffic models and theory, and other hybrid state of the art incident detection methods based on neural networks and fuzzy methods. Among the well-known pattern recognition algorithms are the California algorithm [11, 14, 15] that detected incidents using occupancy rate at specified times, APID algorithm [12, 13, 14, 15] that compares occupancy between two detector areas, and wave analysis, PATREG, and the Monica algorithm. Statistical algorithms include SND and Bayesian algorithms [13, 14] that use probability of incident occurrence through blocked crossroads of downstream traffic. Algorithms using traffic model and theory are Dynamic Algorithms that assumed sudden change in the freeway system to follow predictable patterns due to the integrated relation between speed-density and traffic-density in detection analysis. Additionally based on Catastrophe Theory, the (modified) McMaster Algorithm uses traffic theory as explanation of "Jump" of speed in the operation of freeway that has little change of occupancy. The final group of hybrid algorithms includes Fuzzy Set algorithms, Neural Network Algorithms, Probabilistic Neural Network and Automatic Vehicle Identification Systems [11, 14]. Fuzzy algorithms use reliability and unreliability of deduction as a guiding principle. When the data has unreliability and noise as in the case of the choice of incident decision, fuzzy algorithms would find the most possible decision within boundaries as defined by the fuzzy membership function. Fuzzy membership, differing from ‘crisp’ 0 and 1 decisions, represents a range of occurrence or likelihood in respect of incidents [2]. Fuzzy rules and fuzzy sets regarding output value of the fuzzy controller are dependent and complement. Fuzzy if-then rules or fuzzy conditional statements are expressions of the form IF A THEN B, where A and B are labels of fuzzy sets characterized by appropriate membership functions [6].
3 Adaptive-Network-Based Fuzzy Inference System ANFIS (Adaptive-Network-Based Fuzzy Inference System) selects membership function parameters automatically and is thus an adapted fuzzy inference method to data modeling. Given an input/output data set, it constructs a fuzzy inference system that
Neuro-fuzzy Learning for Automated Incident Detection Layer 1
Layer 2
Layer 3
Layer 4
A1
x
891
x y
A2
П
B1
П
y
w1
w2
N N
w1
w1 f 1
w2
w2 f 2
Σ
B2
f
x y
Fig. 1. Structure of Basic ANFIS
controls membership function parameters using back-propagation or combination of least square estimation and the back-propagation algorithm. In the architectural framework, we assume that there are the two inputs x and y and one output z. We use occupancy and spot speed as input data in this proposed model. ANFIS is an adaptive network which works like the adaptive network simulator of Takagi-Sugeno's fuzzy if-then rules. This adaptive network has a predefined adaptive network topology. The basic topology description is as follows: layer 1: each node of this layer is adaptive and it's function is:
Oi1 = μAi ( x)
(1)
Where x is input and O is output which represents membership function of input to level linguistic variable. Usually bell membership function is used:
μAi ( x) =
1 ⎡⎛ x − c i 1 + ⎢⎜⎜ a ⎢⎣⎝ i
⎞ ⎟⎟ ⎠
2
⎤ ⎥ ⎥⎦
bi
(2)
Where ai, bi and ci are parameters set and these are premise parameters. layer 2: nodes in this layer are not adaptive. They realize only simple multiplication signal of inputs. These outputs represent the firing strength of a rule.
wi = μAi ( x) × μBi ( y ), i = 1, 2
(3)
layer 3: nodes in this layer are not adaptive and they calculate a ratio of rule strengths as follows:
wi =
wi , i = 1, 2 w1 + w2
(4)
layer 4: nodes in this layer are adaptive and so the transfer function will be as follows:
Oi4 = wi f i = wi ( pi x + qi y + ri ) where pi, qi and ri are consequent parameters.
(5)
892
M. Viswanathan, S.H. Lee, and Y.K. Yang
layer 5: nodes are non-adaptive and just make addition of inputs as follows:
O15 = overall output = ∑ wi f i = i
∑wf ∑w i
i
i
i
(6)
i
We apply ANFIS to our proposed model that is used for incident detection.
4 Generating Fuzzy Classification Rules As described in the above section the ANFIS system is based on neuro-fuzzy principles. The application of fuzzy if-then rules also improves the interpretability of the results and provides more insight into the classifier structure and decision making process. Therefore we focus on the extraction of fuzzy rule-based classifiers from labeled data describing traffic volumes and other information. In figure 2 can observe an example of fuzzy membership functions automatically generated by the ANFIS system. However the accuracy of the fuzzy rules produced by this method is low. Therefore, we suggest using an iterative approach for developing fuzzy classifiers as described in [18]. The initial model is derived from the data and subsequently, feature selection and rule-base simplification are applied to reduce the model, while a genetic algorithm is used for parameter optimization.
Fig. 2. Fuzzy Rules Generated by ANFIS
It is shown in [18] that compact, accurate and interpretable fuzzy rule-based classifiers can be obtained from labeled observation data in an iterative approach. First, an initial model is derived from the observation data. Secondly, feature selection and rule-base simplification methods are applied to reduce the initial model. Finally, a real-coded genetic algorithm (GA) is applied to optimize the models parameters in
Neuro-fuzzy Learning for Automated Incident Detection
893
order to improve the classification accuracy. The GA can also be applied in a loop together with rule-base simplification. In this case, the GA uses a multi-criterion objective to search not only for model accuracy but also for model redundancy. This redundancy is then exploited to reduce and simplify the rule base. Finally, the GA is applied with a multi-criterion function where the redundancy is suppressed in order to make the rules more distinguishable while preserving the accuracy. The result is a compact fuzzy rule base of low complexity with high classification accuracy.
5 Neuro-fuzzy Algorithm for Incident Detection The aim of this paper is to adapt the ANFIS algorithm to detect incidents in a scenario where the available input data consists of speed measurements based on moving probe vehicles. Therefore in our case we use ANFIS as the neuro-fuzzy algorithm and input data based on vehicle speed. The speed data is pre-processed to obtain the original speed data and the increase and decrease in acceleration that is calculated through comparison of current speed and past speed. After calculating the speed rate (increase or decrease in speed) and spot speed, we used this data as inputs and detect incidents through ANFIS using filtered data. Figure 3 demonstrates this process from data collection to incident detection. Fuzzy rules are constructed as follow: IF AVGSPD is Small and PDSPD is Very SMALL then ouput is Normal IF AVGSPD is Small and PDSPD is SMALL then ouput is Normal …… IF AVGSPD is Very Large and PDSPD is Large then ouput is Normal
Fig. 3. Architecture of Incident Detection System
894
M. Viswanathan, S.H. Lee, and Y.K. Yang
First, the two input values which include the rate of change of current speed and average speed in each observation are given to the system. For each value the membership degree is calculated by the fuzzy membership function and these values are passed into the neural network according to the fuzzy rules. The neural network multiply weights to value of all rules and then the resulting values after addition are sent to the output layer. These values undergo de-fuzzification by the fuzzy membership function and the final output value is acquired. The structure of the adapted ANFIS algorithm is shown in figure 4. Once the error rate of the model reaches 0.01 or the learning has gone through 1350 epochs, the process is terminated.
Fig. 4. Structure of ANFIS
6 Performance Evaluation of Proposed Model Performance of the proposed model is evaluated by using the detection rate and false alarm rate of incident detection. For the empirical analysis we use the spot speed as input data as mentioned earlier. The expressions for the incident detection rate (DR) and false alarm rate (FAR) are as follows:
DR = t /(t + t ' ) FAR = f /(t × s )
(7)
t is number of real incident detected, t’ is number of incident detection that were undetected, f is number of false alarms during incident detection and s is number of spots. Figure 5 shows result of learning by using speed and increase and decrease rate. Blue circles represent target test values and the red stars represent the original input data. Figure 5 presents successful incident detection in all but 8 cases when using a data set of 1096 input values. We repeat this experiment with similarly-sized datasets containing traffic speed information for 30 days. Based on the empirical analysis the overall average detection rate is found to be 92.72% and the false alarm rate is approximately 11.54%.
Neuro-fuzzy Learning for Automated Incident Detection
895
Fig. 5. Result of Learning
Fig. 6. Error rate in ANFIS
The error is high initially and decreases according to learning volume and as seen from figure 6 is reduced gradually after many learning epochs. It is important to remember that the test data is based on a single information source from GPS-based probe vehicles and therefore it would be optimal to fuse this data with incident information from other sources like fixed detectors. This will be one of our future enhancements to this research work.
7 Summary This research aims to developed and evaluate a model for incident detection by adapting a neuro-fuzzy algorithm. The available inputs to this system was speed data obtained from a probe car having GPS devices and the change in speed that was calculated through a comparison of current and old data and spot speed. These inputs were given to the ANFIS algorithm which detected the occurrence of an incident. We also propose a method of automatic extraction of fuzzy rules. As result of performance testing, the detection rate DR of this model was over 92.72% and false alarm rate FAR was around 11.54%. These results suggest that the model is robust and performs well. This model showed high error rate in the case of irregular patterns of input data but low error rate in the case of regular patterns of input data. In the future, we would like enhance the inner structure of the ANFIS algorithm using fuzzy membership functions extracted automatically from the data and include other input measurements in the analysis of traffic flow patterns.
896
M. Viswanathan, S.H. Lee, and Y.K. Yang
Acknowledgement This research was supported by the MIC (Ministry of Information and Communication), Korea, under the ITRC (Information Technology Research Center) support program supervised by the IITA (Institute of Information Technology Assessment).
References 1. Keechoo Choi, YounShik Chung, “A Data Fusion Algorithm For Estimating Link Travel Time”, Intelligent Transportation Systems Journal, Vol. 7, No. 3-4, 2002, pp. 235 - 260. 2. Ding Zheng, Wolfgang Kainz, “Fuzzy Rule Extraction from GIS Data with a Neural Fuzzy System for Decision Making”, Proceedings of the 7th ACM international symposium on Advances in geographic information systems, 1999, pp. 79 – 84. 3. Roger Jang, "ANFIS: Adaptive-network-based fuzzy inference systems", IEEE Trans. on System, Man and Cybernetics, Vol. 23, No. 3, May/June 1993, pp. 665-685. 4. M. Sugeno and T. Yasukawa, “A Fuzzy logic based approach to qualitative modeling,” IEEE Trans. on Fuzzy System, Vol. 1, No. 1, 1993, pp. 7-31. 5. M. Kubat, “Decision trees can initialize radial basis function networks,” IEEE Trans. on Neural Networks, Vol. 9, No. 5, 5, 1998, pp. 813-821. 6. T. Takagi and M. Sugeno, “Fuzzy indentification of systems and its applications to modeling and control,” IEEE Trans. Syst., Man, Cybern., Vol. 15, 1985, pp. 116-132. 7. H. Takagi and I. Hayashi, “NN-driven Fuzzy Reasoning,” Int. J. of Approximate Reasoning, Vol. 5, No. 3, 1991, pp. 191-212. 8. Takeshi Yamakawa, “ A Neo Fuzzy Neuron and Its Applications to System Identification and Prediction of the System Behavior,” Proceeding of the 2nd International Conference on Fuzzy logic & Neural Networks, 1992, pp.477-483. 9. L. A. Zadeh, “Fuzzy Logic”, IEEE Computer Society, Vol. 21, 1988, pp. 83-93. 10. Forbes, G. J., “Identifying Incident Congestion”, ITE Journal, Vol. 62, No. 6, 1992, pp. 17-22. 11. Dipti Srinivasan, Ruey Long Cheu, Young Peng Poh and Albert Kim Chwee Ng, “Development of an intelligent technique for traffic network incident detection”, Engineering Applications of Artificial Intelligence, Vol. 13, Issue 3, 1 June 2000, pp. 311-322. 12. Dipti Srinivasan, Xin Jin and Ruey Long Cheu, “Adaptive neural network models for automatic incident detection on freeways”, Neurocomputing, Vol. 64, March 2005, pp. 473-496. 13. N. E. Thomas, “Multi-state and multi-sensor incident detection systems for arterial streets”, Transportation Research Part C: Emerging Technologies, Vol. 6, Issues 5-6, 10 December 1998, pp. 337-357. 14. Jiuh-Biing Sheu, “A fuzzy clustering-based approach to automatic freeway incident detection and characterization”, Fuzzy Sets and Systems, Vol. 128, Issue 3, 16 June 2002, pp. 377-388. 15. Jiuh-Biing Sheu, “A sequential detection approach to real-time freeway incident detection and characterization”, European Journal of Operational Research, Vol. 157, Issue 2, 1 September 2004, pp. 471-485.Kwon, S.W., Lee, K.J. and Vizzini, A.J., “Modeling of FireExposed Composite Structures”, Proceedings of 18th Annual Technical Conference of the American Society for Composites, Gainesville, Florida, October 2003.
Neuro-fuzzy Learning for Automated Incident Detection
897
16. Lee, K. and Cho, C., "A Geometrically Nonlinear Nine-Node Solid Shell Element Formulation with Reduced Sensitivity to Mesh Distortion", Computer Modeling in Engineering and Sciences, Vol.3, no.3, pp.339-350, 2002. 17. Sun, C.T., Mechanics of Aircraft Structures, John Wiley & Sons, Inc, 1988. 18. Johannes A. Roubos, Magne Setnes and Janos Abonyi, Learning fuzzy classification rules from labeled data, Information Sciences, Volume 150, Issues 1-2, March 2003, Pages 77-93.
Intelligent GIS: Automatic Generation of Qualitative Spatial Information Jimmy A. Lee and Jane Brennan University of Technology, Sydney, FIT, P.O. Box 123, Broadway NSW 2007, Australia [email protected]
Abstract. This paper reports on an extension to Geographic Information Systems (GIS) that can intelligently analyse and record qualitative information of the surrounding area when adding a feature to a map. This recorded qualitative spatial information can be utilised to perform queries such as path generation using landmarks. Although, there is a lot of research on qualitative spatial reasoning, none of the currently available GIS do actually incorporate this kind of functionality. There have been systems developed that do have functions for adding new features, or generating paths; however they do not generally analyse and record, or use, qualitative information. We have implemented a prototype illustrating our approach. Keywords: Intelligent Systems, Spatial Reasoning, Knowledge Representation, GIS.
1 Introduction Most Geographic Information System (GIS) computations are quantitative, relying on visualisation tools and user interaction to provide the qualitative insights that users need. Although quantitative representations do allow powerful and efficient calculations, they fall short when users lack information about for example the exact positioning or the geometry of the features involved. It has been recognised that a quantitative approach is an inappropriate representation of human cognition and spatial reasoning (e.g. [7]). Most human reasoning about physical space appears to use a qualitative interpretation of that space. Therefore, purely quantitative methods are inadequate and qualitative methods are essential to effectively represent spatial information. In recent years, there have been various proposals for spatial descriptions, which are qualitative rather than quantitative to overcome the problem of imprecise spatial information (see [3] for an overview). However, practical applications applying the results of these research efforts are rare. This paper reports on the development of an extension to a GIS. We examined several aspects of GIS functionality in terms of adding qualitative reasoning capabilities to these systems, and included qualitative feature adding and path generation to our GIS extension. When a new feature is added to the map, the system will automatically analyse and record the qualitative information of its surrounding area. This recorded qualitative spatial information will be utilised to perform intelligent queries such as path generation using landmarks. Although, there is a lot of research on qualitative spatial reasoning (QSR), none of the currently M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 898 – 907, 2006. © Springer-Verlag Berlin Heidelberg 2006
Intelligent GIS: Automatic Generation of Qualitative Spatial Information
899
available GIS do actually incorporate this. There have been systems developed that do have functions for adding new features, or generating paths; however they do not generally analyse and record, or use, qualitative data/information. A system that did generate qualitative path descriptions is the GRAAD system by Moulin and Kettani [9]. It provides an integrated logical and analogical framework for reasoning about time and space, while generating and describing the paths of a virtual pedestrian in a virtual environment. The analogical framework based on the notion of spatial conceptual map (SCM) specifies a spatial environment; and the logical framework based on a possible worlds approach carries out temporal reasoning and generates plans that are used to compute the virtual pedestrian’s displacements in the SCM. It also generates natural language descriptions of the paths. In contrast, our system uses “real world” GIS data and also allows for adding features with qualitative spatial information inferred by the system.
2 Knowledge Representation in GIS Data in GIS is generally represented in raster or vector format. These files are usually stored in a database together with associated information in the database table. Data in the table can be categorised into two types: quantitative information such as “20” or “40.5” and qualitative information such as “front” or “back”. Users can query this information to retrieve particular details such as “which school is on the left of this traffic light”. Our extension to GIS allows for qualitative inferences for added features and a qualitative path generation. Many of the existing GIS do have functions to add features and some also allow for path generation. We reviewed several GIS and found that while all of them had functions to add features, none of them incorporate an intelligent agent to automatically record qualitative information of the surrounding area into the database table. Users would therefore have to enter this information manually. Some systems such as OpenMap® and GRASS® do not even have database tables as they use the Java method to store data. The MapObject® extension of the ArcGIS® package does have a path finding function using shortest path methods. None of the other systems have path generation available. In TakukGIS® and AccuGlobe®, data can also be stored in its GIS database and represented in a database table. There are create/edit feature functions. 2.1 Qualitative Spatial Reasoning in GIS Research in QSR is motivated by a wide variety of possible application areas such as GIS [5], robotic navigation [8], spatial propositional semantics of natural languages [1, 10], engineering design such as qualitative kinematics [6] and many others. As Cohn [2] points out: “The principal goal of qualitative reasoning is to represent not only our everyday commonsense knowledge about the physical world, but also the underlying abstractions used by engineers and scientists when they create quantitative models.”(p 323) However, these findings are rarely used in practical applications. The extension
900
J.A. Lee and J. Brennan
we are bringing forward in this paper is, to our knowledge, the first one to actually add a functionality to a GIS that enables the addition of new features coupled with the inference of qualitative spatial information about the surrounding area. 2.2 AccuGlobe® AccuGlobe® [4] is an ideal GIS software package for prototype implementation, as it is the only user-friendly free GIS software that has create/edit feature functions. The software includes functionality such as zooming, panning, measuring, selecting feature, drawing objects, etc. Information, either quantitative or qualitative, will be stored in a database table. However, when new features are added to the map, the system will not automatically analyse and generate the feature’s relation to other objects in the map. Instead, the user has to manually enter the information into the database table.
3 Integrating Qualitative Aspects into GIS We developed a prototype that integrates qualitative aspects into an existing GIS by allowing features to be added qualitatively. In addition, the prototype automatically analyses, and records qualitative information of the surrounding area. We also developed a road map representation that supports qualitative spatial reasoning based on landmarks, thereby providing an effective tool for path generation. We used PERL to implement our prototype. AccuGlobe® provides a functionality to export the feature attributes recorded in the shape file to a working directory, and even convert it to an easily readable and editable text file format. The prototype can perform any necessary action to the exported text files, instead of working directly on the shape files. AccuGlobe® will automatically convert the text file to a shape file during the import process. Figure 1 on the previous page shows the framework we used for the prototype.
Fig. 1. Integrative framework
Intelligent GIS: Automatic Generation of Qualitative Spatial Information
901
3.1 Qualitative Database Table Design Figure 2 shows a simple example of a database table as used in the shape files, illustrating feature entities and attributes. The words “Feature”, “Type”, and “Streetname” in the top row in bold are the feature entities. The words at the bottom row “A”, “House”, and “Market St” are the feature attributes. Feature A
Type House
Streetname Market St
Fig. 2. Database table example
3.2 Road Segment Layout An appropriate road segment layout is very important as the path-finding algorithm generates its way through these road segments. After some trials and testing, we have come up with an easy and effective way to organise these segments. This layout also allows the inference of relative direction relations both when adding features to the map and when generating a path. Roads are generally already divided into segments in GIS data. However, we found it effective to use road intersections as segment dividers. In Figure 3a, there is only one road with no intersection, so there will be only one road segment, which will receive the unique segment id “0”. In Figure 3b, there is an intersection between the vertical road and the horizontal road. Therefore the horizontal road will be divided into two different road segments with the ids “0” and “0.1”. The vertical road also needs a separate id since it is also a road segment; its id is “1”. The layout displayed in Figure 3c follows the same logic. There is a sequence and meaning for each segment id in order for the path-finding algorithm to work, and to support the inference of qualitative spatial direction relations. Each road segment is either vertical or horizontal. Moreover, all vertical road segments will have segment ids in odd numbers (before the decimal point) and all horizontal road segments will have segment ids in even numbers (before the decimal point). The numbering always starts at the left or the top for horizontal or vertical road segments respectively. The value before the decimal point is like the Y-axis in a graph having the smallest value at the top and the largest value at the bottom. The value after the decimal point is like the X-axis, with the smallest value on the left and the largest value on the right. Figure 4 shows this pattern clearly by grouping the segments accordingly. This pattern will allow the path-finding algorithm, which will be discussed in sub-section 3.4, to determine which direction to take in order to reach the destination. Any road network can be transformed into this layout because the layout does not consider distances. In practical terms, the road segment layout could be another map layer. The sample data used for the prototype implementation includes mainly buildings, referred to as address features. In order to fully utilise the road segment layout, address features are issued with “house ids” that start with the value of the segment id on which they are located. Figure 5 illustrates the procedure of assigning house ids.
902
J.A. Lee and J. Brennan
Fig. 3. Division of road segments
Fig. 4. X and Y-axis pattern in the proposed road segment layout
Fig. 5. Address features added to road segment
Suppose that we want to add three address features on a horizontal road segment with segment id “0.1”. The left most address feature that is located on this particular road segment will have the house id “0.10”. The second left most address feature on the same road segment will have the house id “0.11”, and the third left most address feature will have the house id “0.12”. They are all preceded by “0.1” to indicate their road segment location. Figure 5 illustrates this clearly, the three address features are added onto the road segment with segment id “0.1”. If a new address feature was added between two existing address features such as “0.10” and “0.11”, the new house id would be chosen between “0.10” and “0.11”, e.g. “0.101”.
Intelligent GIS: Automatic Generation of Qualitative Spatial Information
903
3.3 Adding Features by Storing and Inferring Qualitative Spatial Relations We modified the attribute tables of an existing data set in order to accommodate the storage of qualitative relations that can either be entered or inferred. In addition to relative direction relations such as left and right, we also included a small subset of topological relations. In order to resolve any reference frame issues, we presume a bird’s eye view. For an address feature on a horizontal path, we will go from left to right, which means the right is always the front and the left is the back. For an address feature on a vertical path, we will go from top to bottom, which means the bottom is always the front and the top is the back. Figure 6 shows an example of a modified attribute table. If we need to record the qualitative relations of “Feature A” (house id “2.11”) with the features of its surrounding area; we have to create new entities for these qualitative relations in the attribute table. With this recorded, qualitative information of the feature’s location in relation to other objects can easily be accessed by looking up the attribute table. The entities “Front”, “Back”, “Left”, “Right”, “LeftFrBack”, and “RightFrBack” are added to the attribute table, as displayed in Figure 6. We can see that the entity “Right” has a nil attribute value. This is because by turning right at Feature A, there is no feature that can be reached; this is represented by a “nil” in the attribute table.
Fig. 6. Attribute table with added qualitative relations’ entities
Fig. 7. Qualitative relations of features
904
J.A. Lee and J. Brennan
Figure 7 illustrates some address features laying on both the vertical path and horizontal path. House id “2.1” is located on a horizontal path and house id “1.2” on a vertical path. The attributes of their qualitative relations with the surrounding features, which can be extracted from the database table, are summarised below Figure 7. If the user goes the other way around, from bottom to top, or right to left, the relations will automatically be transformed (e.g., the front becomes the back, etc.) . Horizontal path (house id “2.1”) house id “2.2” Front house id “2.0” Back house id “1.2” Left house id “3.2” Right Nil Left from back house id “1.1” Right from back
Vertical path (house id “1.2”) house id “3.2” Front Nil Back house id “2.2” Left house id “2.1” Right house id “0.1” Left from back house id “0.2” Right from back
3.4 Path Finding Algorithm This section explores the simple path-finding algorithm we developed. Due to our road segmentation into vertical or horizontal segments, we can subset our pathfinding algorithm into four functions: “odd-to-odd”, “odd-to-even”, “even-to-odd” and “even-to-even". The “odd-to-odd” function would be used if both the start and destination address features have odd house ids and therefore lie on vertical road segments in the map. A simple example of an odd-to-odd function is shown in Figure 8, where the start and destination address features have a house id “1.11” and “1.21” respectively. In order to select the appropriate function, the system takes the values of house ids, “1.11”, and “1.21”. Both the values before the decimal point are “1”, which are odd. Hence, the odd-to-odd function would be selected. The odd-to-odd function itself determines which particular road segments are needed for navigation. The function specifies which position the destination address is in relative to the start address. It will check on the house ids again. The preceding decimal number for both is “1”; hence they lay on the same level of the y-axis. After the decimal point, the start
Fig. 8. Odd-to-odd function example
Intelligent GIS: Automatic Generation of Qualitative Spatial Information
905
address has a value of “11” and the destination address has a value of “21”. The destination address’s value is larger than the start address’s value on the x-axis; the destination address is therefore on the right of the start address. The results in a path that goes straight to the top, turns right at the next available turn off, then continues to go straight until it finds the right turn with segment id “1.2”, along which it can locate the feature address with house id “1.21”. The other functions (e.g. odd-to-even algorithm) have the same underlying logic. Mainly they check on the values before and after the decimal points to determine into which direction the path should be generated.
4 Prototype Evaluations For the software evaluation, we used a photographic image [4] that consists of several layers for different feature types each such as traffic lights, stop signs etc. We have successfully added features to the map qualitatively using topological relations descriptions. The prototype automatically generates and records the feature id, street id, street name, and the qualitative relations with the surrounding area. 4.1 Experimental Details We tested our GIS extension against several data entries. Once the data was added, we verified information in the database tables for inferred qualitative information. We did this for three different data entry types, namely for the quantitative, simple qualitative and qualitative on different streets addition of features to the map; and subsequently uploaded the newly inferred information into the image map. The following is an example of a simple qualitative addition of a new feature. When we add an address feature between No.3 and No.4 of Water Street; any further information required by the attribute table is automatically inferred by the system. Figure 9 shows the newly added feature with the reference addresses entered by the user and all the other attributes inferred by the system. The figure also shows the new feature being uploaded into the map.
Fig. 9. Added feature to map and feature attributes to database table
906
J.A. Lee and J. Brennan
We further evaluated the usefulness of our data representation by utilising this qualitative information for path description generation using direction relations and landmarks, as shown in Figure 10. This gave us an additional level of evaluation when doing this with newly added features. Without this qualitative information being recorded, it is not possible to effectively generate a path description as there is no sufficient information about the surrounding area of each feature.
Fig. 10. Path description
4.2 Limitations of the System Despite its usefulness, our system does also have limitations that need to be discussed. As explained in Section 3.3, for an address feature on a horizontal path, we will go from left to right, which means the right is always the front and the left is the back. For an address feature on a vertical path, we will go from top to bottom, which means the bottom is always the front and the top is the back. There may be a problem if there are two address features lying side by side, one across the street of the other. There is no front, back, left or right relation in this case. Another limitation is the quantitative distance which is currently only used as part of the path generation output, but not during the reasoning process. If we were to use quantitative distance for the path generation, we were again to face problems when data is imprecise. While every road network can be transformed into our road segment layout theoretically, our system does not yet provide the functionality for such a transformation and relies on manual transformation from existing road network into what is commonly known as “Manhattan Structure”. For the generation of path descriptions, if there is a cluster of addresses between two locations (start and destination locations); the path description generated by the system could be too long to be useful, as every single address would be considered. A solution could be to automatically skip the path description of some addresses if the list is too long.
5 Conclusions and Outlook In this paper we discussed the possibility of an extension to Geographic Information Systems (GIS) that can intelligently analyse and record qualitative information of the
Intelligent GIS: Automatic Generation of Qualitative Spatial Information
907
surrounding area when adding a feature to a map. We then presented a prototype implementation of such a system based on a road segment layout and path finding algorithm that not only support spatial reasoning but also allow the automatic inference and storage of qualitative spatial relations when new features are added to a map. This recorded qualitative spatial information can be utilised to perform queries such as path generation using landmarks, this particular function was also implemented in our prototype. An evaluation of the system confirmed the well known assumption that purely quantitative methods are inadequate and qualitative methods are essential to effectively represent spatial information. Our current prototype is text based, but we are planning a graphical interface to allow user input for feature addition. The path generation still needs to be optimised as it is not necessarily the shortest path. This is a general problem that has received a lot of attention in GIS research.
References 1. Aurnague, M., Vieu, L.: A three-level approach to the semantics of space. C.ZelinskyWibbelt (ed.) The semantics of prepositions – from mental processing to natural language processing. Mouton de Gruyter. Berlin (1993) 2. Cohn, A.: The Challenge of Qualitative Spatial Reasoning. ACM Computing Surveys. Vol 27. No.3. (1995) 323-324, UK 3. Cohn A.G., Hazarika S.M.: Qualitative Spatial Representation and Reasoning: An Overview. Fundamenta Informaticae, volume 46, No.1-2, (2001), 2-32 4. DDTI (Digital Data Technologies, Inc.) (2005) http://www.ddti.net 5. Egenhofer, M., Mark, D.: Naïve Geography. Frank and W. Kuhn (eds.) Spatial Information Theory: A theoretical basis for GIS. Proc. COSIT’95. LNCS No. 988. Springer-Verlag (1995) 1-16 6. Faltings, B.: A symbolic approach to qualitative kinematics. Artificial Intelligence. 56(2) (1992) 7. Kuipers, B.: Modeling Spatial Knowledge. Cognitive Science 2 (1978) 129-153 8. Kuipers, B., Byun, Y.: A robot exploration and mapping strategy based on a semantic hierarchy of spatial representations. Journal of Robotics and Autonomous Systems (1991) 47-63 9. Moulin, B., Kettani, D.: Route generation and description using a logical and an analogical framework. Annals of Mathematics and Artificial Intelligence 24. (1998) 155-179 10. Sablayrolles, P.: Spatio-temporal semantics in natural language: the case of motion. M. Aurnague et al. (eds.) Semantics of Time, Space, Movement and Spatio-Temporal Reasoning, Working Papers of the 4th Intl. Workshop (1992) 69-88.
On-Line Learning of a Time Variant System Fernando Morgado Dias1 , Ana Antunes1 , Jos´e Vieira2 , and Alexandre Manuel Mota3 1 Escola Superior de Tecnologia de Set´ubal do Instituto Polit´ecnico de Set´ubal Departamento de Engenharia Electrot´ecnica, Campus do IPS, Estefanilha, 2914-508 Set´ubal, Portugal, Tel.: +351 265 790000 {aantunes, fmdias}@est.ips.pt 2 Escola Superior de Tecnologia de Castelo Branco, Departamento de Engenharia Electrot´ecnica, Av. Empres´ario, 6000 Castelo Branco, Portugal, Tel.: +351 272 330300 [email protected] 3 Departamento de Electr´onica e Telecomunicac¸o˜ es, Universidade de Aveiro, 3810 - 193 Aveiro Portugal, Tel.: +351 234 370383 [email protected]
Abstract. In the present work a sliding window approach for the LevenbergMarquardt algorithm is used for on-line modelling a time variant system. The system used is a first order cruise control in which a modification is introduced to change the system gain at some point of operation. The initial control of the cruise control is performed by a PI not particularly optimised but enough to keep the system working within the intended range, which is then replaced by an Artificial Neural Network as soon as it is trained, using an Internal Model Controller loop.
1 Introduction Among the many algorithms used for training in the field of Artificial Neural Networks (ANNs) the Levenberg-Marquardt has been considered as the most effective one, but its use has been mostly restricted to the off-line training because of the difficulties to implement a true iterative version. These difficulties come from computing the derivatives for the Hessian matrix, inverting this matrix and computing the region for which the approximation contained in the calculation of the Hessian matrix is valid (the trust region). In the present work a different approach is suggested: the use of the LevenbergMarquardt algorithm on-line in a batch version with sliding window and Early Stopping, allowing the use of the Levenberg-Marquardt algorithm in the same way it is used in the off-line approaches. The implementation also uses a hybrid Direct/Specialized evaluation procedure and the final solution is tested with a cruise control system with a time variant gain. The gain is changed in the middle of the experiment to test the capabilities of adaptation of the solution proposed. Although in most real systems such a time variance does not exist and rather a slow change would be found, this is an extreme situation and if the solution proposed works here it will also be valid for slow variant systems like the ones due to degradation of the systems. M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 908–916, 2006. c Springer-Verlag Berlin Heidelberg 2006
On-Line Learning of a Time Variant System
909
2 The Newton, Gauss-Newton and Levenberg-Marquardt Algorithms Starting from the Taylor series approach of second order, for a generic function F (x), the following can be written: 1 F (xk+1 ) = F (xk + xk ) F (xk ) + G(x, k).xk + .xk .H(x, k).xk 2
(1)
where G(x, k) is the gradient of F (x), xk is xk+1 − xk and H(x, k) is the Hessian matrix of F (x). If the derivative of equation 1 in respect to xk is taken, equation 1 will be obtained: G(x, k) + H(x, k).xk = 0
(2)
This equation can be re-written in the following form: xk = −H(x, k)−1 .G(x, k)
(3)
The updating rule for the Newton algorithm is then obtained: xk+1 = xk − H(x, k)−1 .G(x, k)
(4)
Considering a generic quadratic function as the objective function to be minimized, as represented in equation 5 for a Multi Input Single Output system (here the iteration index is omitted and i is the index of the outputs): F (x) =
N
vi2 (x)
(5)
i=1
The gradient can be expressed in the following form: G(x) = 2.J T (x).v(x)
(6)
and the Hessian matrix can be expressed in the following form: H(x) = 2.J T (x).J(x) + 2.S(x)
(7)
where J(x) is the Jacobian and S(x) is: S(x) =
N i=1
vi (x).
∂ 2 vi (x) ∂xk .∂xj
(8)
if it can be assumed that S(x) is small when compared to the product of the Jacobian, then the Hessian matrix can be approximated by the following: H(x) 2.J T (x).J(x)
(9)
910
F.M. Dias et al.
This approach gives the Gauss-Newton algorithm: −1 xk = − 2.J T (xk ).J(xk ) .2.J T (xk ).v(xk )
(10)
One limitation that can happen with this algorithm is that the simplified Hessian matrix might not be invertible. To overcome this problem a modified Hessian matrix can be used: Hm(x) = H(x) + μ.I (11) where I is the identity matrix and μ is a value such that makes Hm(x) positive definite and therefore can be invertible. This last change in the Hessian matrix corresponds to the Levenberg-Marquardt algorithm: −1 xk = − 2.J T (xk ).J(xk ) + μk I .2.J T (xk ).v(xk )
(12)
where μ is now written as μk to show that this value can change during the execution of the algorithm. The Levenberg-Marquardt algorithm is due to the independent work of both authors in [1] and [2]. The parameter μk is the key of the algorithm since it is responsible for stability (when assuring that the Hessian can be inverted) and speed of convergence. It is therefore worth to take a closer look on how to calculate this value. The modification of the Hessian matrix will only be valid in a neighbourhood of the current iteration. This corresponds to search for the correct update for the next iteration xk+1 but restricting this search to |x − xk | δk . There is a relationship between δk and μk since raising μk makes the neighbourhood δk diminish [3]. As an exact expression to relate these two parameters is not available, many solutions have been developed. The one used in the present work was proposed by Fletcher [3] and uses the following expression: VN (xk ) − VN (xk + pk ) rk = (13) VN (xk ) − Lk (xk + pk ) to obtain a measure of the quality of the approximation. Here VN is the function to be minimized, Lk is the estimate of that value calculated from the Taylor series of second order and pk is the search direction, in the present situation, the search direction given by the Levenberg-Marquardt algorithm. The value of rk is used in the determination of μk according to the following algorithm: 1-Choose the initial values of x0 e μ0 . 2-Calculate the search direction pk . 3-If rk > 0.75 then set μk = μk /2. 4-If rk < 0.25 then set μk = 2.μk . 5-If VN (xk + pk ) < VN (xk ) then the new iteration is accepted. 6-If the stopping condition for the training is not met, return to step 2.
On-Line Learning of a Time Variant System
911
3 On-Line Version As pointed out before, the difficulties come from computing the derivatives for the Hessian matrix, inverting this matrix and computing the trust region, the region for which the approximation contained in the calculation of the Hessian matrix is valid. In the literature, some attempts to build on-line versions can be found, namely the work done by Ngia [4] developing a modified iterative Levenberg-Marquardt algorithm which includes the calculation of the trust region and the work in [5] which implements a Levenberg-Marquardt algorithm in sliding window mode for Radial Basis Functions. 3.1 A Double Sliding Window Approach with Early Stopping The current work is an evolution of the one presented in [6] where an on-line version of the Levenberg-Marquardt algorithm was implemented using a sliding window with Early Stopping and static test set. In the present work two sliding windows are used, one for the training set and another for the evaluation set with all the data being collected on-line. As in the previous work, the Early Stopping technique [7], [8] is used for avoiding the overfitting problem because it is almost mandatory to employ a technique to avoid overtraining when dealing with systems that are subject to noise. The Early Stopping technique was chosen over other techniques that could have been used (like Regularization and Prunning techniques) because it has less computational burden. The use of two sliding windows will introduce some difficulties since both data sets will be changing during training and evaluation phases. For these two windows it is necessary to decide their relative position. In order to be able to perform Early Stopping in a valid way, it was decided to place the windows in a way that the new samples will go into the test window and the samples that are removed from the test set will go in to the training set according to figure 1. Training window
Test window Sample displacement direction
Most recent samples
Fig. 1. Relative position of the training and test sets
If the inverse relative position of the two windows was used the samples would be part of the test set after they have been part of the training set and so the objective of evaluating the generalization ability would be somehow faked. In order to save some of the time necessary to collect all the samples needed to fill both the test and training window, the training is started after some data has been
912
F.M. Dias et al.
collected but before the windows are both filled. The test window always keeps the same number of samples, while the training window is growing in the initial stage. The choice of maintaining the test window always with the same number of points was taken with the objectives of maintaining this window as stable as possible (since it is responsible for producing the numerical evaluation of the models) and assuming the use of a minimal test window that should not be shortened. The windows may not change in each training iteration since all the time between sampling is used for training which may permit several training epochs before a new sample is collected. But each time the composition of the windows is changed the test and training errors will probably be subjected to an immediate change that might be interpreted as an overtraining situation. The Early Stopping technique is here used in conjunction with a measure of the best model that is retained for control. Each time there is a change in the windows the values of the best models (direct and inverse) must be re-evaluated because the previous ones, obtained over a different test set, are no longer valid for a direct comparison. The procedure used for the identification of the direct model on-line is represented in figure 2. As was already explained, training starts when a predefined amount of points have been collected. After each epoch the ANN is evaluated with a test set. The value of the Mean Square Error (MSE) obtained is used to perform Early Stopping and to retain the best models. The conditions for overtraining and the maximum number of epochs are then verified. If they are true, the Flag, which indicates that the threshold of quality has been reached, will also be verified and if it is on, the training of the inverse model starts, otherwise the models will be reset since new models need to be prepared. Resetting here means that the model’s weights are replaced by random values between -1 and 1 as in used in the initial models. After testing the conditions for overtraining and the maximum number of epochs, if they are both false, the predefined threshold of quality will also be tested and if it has been reached the variable Flag will be set to on. In either case the remaining time of the sampling period is tested to decide if a new epoch is to be performed or if a new sample is to be collected and training is to be performed with this new sample included in the sliding window. Each time a new sample is collected both the direct and inverse models must be re-evaluated with the new test set and the information about the best models updated. The procedure for the inverse model is very similar and almost the same block diagram could be used to represent it. The on-line training goes on switching from direct to inverse model each time a new model is produced. The main difference between the procedure for direct and inverse model lies in the evaluation step. While the direct model is evaluated with a simple test set, the inverse model is evaluated with a control simulation corresponding to the hybrid Direct/Specialized approach for generating inverse models [9]. During the on-line training the NNSYSID [10] and NNCTRL [11] toolboxes for MATLAB were used.
On-Line Learning of a Time Variant System
913
Start
Yes Training and Evaluation Best model yet?
Collecting Information reached the preset amount of points?
Yes
No
Save Model
No Overtraining Yes
No
Exceeded the number of epochs?
Is the Flag on? Yes
No
Yes
No
Reset Model
Reached the threshold of quality? Yes
No
Flag on
Enough time for another iteration?
Yes
No
Start Training the Inverse Model
Collect sample
Re-evaluate best models
Fig. 2. Block diagram for the identification of a direct model on-line
4 Time Variant System The time variant system used for this test is a cruise control with a variable gain according to equation 1: 0.05 s+0.05 if sample 500 (14) 0.075 if sample > 500 s+0.05 that is, the gain is increased 50% after sample 500 is collected. The system used is rather simple but the time variance introduced allows testing the functionality of the algorithm proposed.
914
F.M. Dias et al.
5 Results The test sequence is composed of 100 points, the sliding window used for training has a maximum of 200 samples and training starts after 240 samples have been collected. Both direct and inverse models were one hidden layer models with 6 neurons on the hidden layer and one linear output neuron. The direct model has as inputs the past two samples of both the output of the system and the control signal. The sampling period used was 150 seconds, which allowed performing several epochs of training between each control iteration. During the initial phase of collecting data a PI was used in order to keep the system operating within the range of interest. The PI parameters are Kp=0.01 and Ki=0.01. After this initial phase the PI is replaced by an IMC controller, using the direct and inverse models trained on-line. The first inverse model is ready at sample 243, that is only 2 samples after the training has started. After the 240 samples have been collected it only took one sampling period to complete the training of the direct model and another sampling period to complete the inverse model even though the Matlab code was running on a personal computer with a Celeron processor at 466MHz using 64Mbytes of memory.
Fig. 3. Result obtained using the IMC control strategy and the proposed on-line learning strategy with a time variant system
On-Line Learning of a Time Variant System
915
As can be seen from figure 3 after the gain is changed the quality of control decreases and a oscilation appears that can reach 20 to 30 degrees. For a new model to be prepared a test window and a training window with data that belong to the new system need to completed. The necessary points are collected in the next 300 samples. Since the system was changed at sample 500 the renewing of the data in the sliding windows is only complete at sample 800. During this stage training and evaluation are quite difficult since there is a mixture of data from the two systems, but once the two sliding windows are filled with data from the new system the solution proposed is able to create accurate models: the quality of control is re-established at sample 795.
6 Conclusion This paper presents on-line identification and control of a time varying system using the Levenberg-Marquardt algorithm in a batch version with two sliding windows and Early Stopping. The problems pointed out in section 3 to perform Early Stopping under a changing sliding window for the training set were not critical and a good choice of the parameters for identification of the overtraining situation and for the maximum number of iterations for each attempt to create a model were sufficient to obtain reasonable models to perform IMC control. The PID is here just used to maintain the system in the operating range while data is being collected and is disconnected as soon as the ANN models are ready. As shown here, even for a noisy system, for which overtraining is a real problem it is possible to create models on-line of acceptable quality as can be concluded from the values presented in table 1. The artificial time variant system used in this experiment is an extreme situation compared with most real time variant systems, which vary slowly. Nevertheless the successful application of the Levenberg-Marquardt sliding window solution to this situation shows that it will also work for slowly variant systems. With this artificial time variant system it can be seen that learning is very difficult when the sliding windows contains data from the system previously and after the change. This corresponds to training an ANN with mixed data from two different systems. Once this situation is overcome the models of the new system are rapidly obtained. This problem would not happen for a slow changing system. The sliding window solution with Early Stopping for the Levenberg-Marquardt algorithm is very interesting since it does not limit the capabilities of the algorithm and overcomes the limitations of application of the traditional solution.
References 1. K. Levenberg, “A method for the solution of certain problems in least squares,” Quart. Appl. Math., vol. 2, pp. 164–168, 1944. 2. D. Marquardt, “An algorithm for least -squares estimation of nonlinear parameters,” SIAM J. Appl. Math., vol. 11, pp. 431–441, 1963.
916
F.M. Dias et al.
3. M. Nørgaard, O. Ravn, N. K. Poulsen, and L. K. Hansen, Neural Networks for Modelling and Control of Dynamic Systems, Springer, 2000. 4. Lester S. H. Ngia, System Modeling Using Basis Functions and Application to Echo Cancelation, Ph.D. thesis, Department of Signals and Systems School of Electrical and Computer Engineering, Chalmers University of Technology, 2000. 5. P. Ferreira, E. Faria, and A. Ruano, “Neural network models in greenhouse air temperature prediction,” Neurocomputing, vol. 43, no. 1-4, pp. 51–75, 2002. 6. Fernando Morgado Dias, Ana Antunes, Jos´e Vieira, and Alexandre Manuel Mota, “Implementing the levenberg-marquardt algorithm on-line: A sliding window approach with early stopping,” 2nd IFAC Workshop on Advanced Fuzzy/Neural Control, 2004. 7. N. Morgan and H. Bourlard, “Generalization and parameter estimation in feedforward nets: Some experiments.,” Advances in Neural Information Processing Systems, Ed. D.Touretzsky, Morgan Kaufmann, pp. 630–637, 1990. 8. Jonas Sj¨oberg, Non-Linear System Identification with Neural Networks, Ph.D. thesis, Dept. of Electrical Engineering, Link¨oping University, Su´ecia, 1995. 9. Fernando Morgado Dias, Ana Antunes, and Alexandre Mota, “A new hybrid direct/specialized approach for generating inverse neural models,” WSEAS Transactions on Systems, vol. 3, Issue 4, pp. 1521–1529, 2004. 10. M. Nørgaard, “Neural network based system identification toolbox for use with matlab, version 1.1, technical report,” Tech. Rep., Technical University of Denmark, 1996. 11. M. Nørgaard, “Neural network based control system design toolkit for use with matlab, version 1.1, technical report,” Tech. Rep., Technical University of Denmark, 1996.
Bioinformatics Integration Framework for Metabolic Pathway Data-Mining Tom´as Arredondo V.1 , Michael Seeger P.2 , Lioubov Dombrovskaia3, Jorge Avarias A.3 , Felipe Calder´ on B.3 , Diego Candel C.3 , Freddy Mu˜ noz R.3 , 2 2 2 Valeria Latorre R. , Loreine Agull´o , Macarena Cordova H. , and Luis G´ omez2 1
2
Departamento de Electr´ onica [email protected] Millennium Nucleus EMBA, Departamento de Qu´ımica 3 Departamento de Inform´ atica, Universidad T´ecnica Federico Santa Mar´ıa Av. Espa˜ na 1680, Valpara´ıso, Chile
Abstract. A vast amount of bioinformatics information is continuously being introduced to different databases around the world. Handling the various applications used to study this information present a major data management and analysis challenge to researchers. The present work investigates the problem of integrating heterogeneous applications and databases towards providing a more efficient data-mining environment for bioinformatics research. A framework is proposed and GeXpert, an application using the framework towards metabolic pathway determination is introduced. Some sample implementation results are also presented.
1
Introduction
Modern biotechnology aims to provide powerful and beneficial solutions in diverse areas. Some of the applications of biotechnology include biomedicine, bioremediation, pollution detection, marker assisted selection of crops, pest management, biochemical engineering and many others [3, 14, 22, 21, 29]. Because of the great interest in biotechnology there has been a proliferation of separate and disjoint databases, data formats, algorithms and applications. Some of the many types of databases currently in use include: nucleotide sequences (e.g. Ensemble, Genbank, DDBJ, EMBL), protein sequences (e.g. SWISS-PROT, InterPro, PIR, PRF), enzyme databases (e.g. Enzymes), metabolic pathways (e.g. ERGO, KEGG: Kyoto Encyclopedia of Genes and Genomes database) and literature references (e.g. PubMed) [1, 4, 11, 6, 23]. The growth in the amount of information stored has been exponential: since the 1970s, as of April 2004 there were over 39 billion bases in Entrez NCBI (National Center of Bioinformatics databases), while the number of abstracts in PubMed has been growing by 10,000 abstracts per week since 2002 [3, 13]. The availability of this data has undoubtedly accelerated biotechnological research. However, because these databases were developed independently and are M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 917–926, 2006. c Springer-Verlag Berlin Heidelberg 2006
918
T. Arredondo V. et al.
managed autonomously, they are highly heterogeneous, hard to cross-reference, and ill-suited to process mixed queries. Also depending on the database being accessed the data in them is stored in a variety of formats including: a host of graphic formats, RAW sequence data, FASTA, PIR, MSF, CLUSTALW, and other text based formats including XML/HTML. Once the desired data is retrieved from the one of the database(s) it typically has to be manually manipulated and addressed to another database or application to perform a required action such as: database and homology search (e.g. BLAST, Entrez), sequence alignment and gene analysis (e.g. ClustalW, T-Coffee, Jalview, GenomeScan, Dialign, Vector NTI, Artemis) [8]. Beneficial application developments would occur more efficiently if the large amounts of biological feature data could be seamlessly integrated with data from literature, databases and applications for data-mining, visualization and analysis. In this paper we present a framework for bioinformatic literature, database, and application integration. An application based of this framework is shown that supports metabolic pathway research within a single easy to use graphical interface with assisting fuzzy logic decision support. To our knowledge, an integration framework encompassing all these areas has not been attempted before. Early results have shown that the integration framework and application could be useful to bioinformatics researchers. In Section 2, we describe current metabolic pathway research methodology. In Section 3 we describe existing integrating architectures and applications. Section 4 describes the architecture and GeXpert, an application using this framework is introduced. Finally, some conclusions are drawn and directions of future work are presented.
2
Metabolic Pathway Research
For metabolic pathway reconstruction experts have been traditionally used a time intensive iterative process [30]. As part of this process genes first have to be selected as candidates for encoding an enzyme within a potential metabolic pathway within an organism. Their selection then has to be validated with literature references (e.g. non-hypothetical genes in Genbank) and using bioinformatics tools (e.g. BLAST: Basic Local Alignment Search Tool) for finding orthologous genes in various other organisms. Once a candidate gene has been determined, sequence alignment of the candidate gene with the sequence of the organism under study has to be performed in a different application for gene locations (e.g. Vector NTI, ARTEMIS). Once the genes required have been found in the organism then the metabolic pathway has to be confirmed experimentally in the laboratory. For example, using the genome sequence of Acidithiobacillus ferrooxidans diverse metabolic pathways have been determined. One major group of organisms that is currently undergoing metabolic pathways research is bacteria. Bacteria possess the highest metabolic versatility of the three domains of living organisms. This versatility stems from their expansion into different natural niches, a remarkable degree of physiological and genetic
Bioinformatics Integration Framework for Metabolic Pathway Data-Mining
919
adaptability and their evolutionary diversity. Microorganisms play a main role in the carbon cycle and in the removal of natural and man-made waste chemical compounds from the environment [17]. For example, Burkholderia xenovorans LB400 is a bacterium capable of degrading a wide range of PCBs [7, 29]. Because of the complex and noisy nature of the data, any selection of candidate genes as part of metabolic pathways is currently only done by human experts prior to biochemical verification. The lack of integration and standards in database, application and file formats is time consuming and forces researchers to develop ad hoc data management processes that could be prone to error. In addition, the possibility of using Softcomputing based pattern detection and analysis techniques (e.g. fuzzy logic) have not been fully explored as an aid to the researcher within such environments [1, 23].
3
Integration Architectures
The trend in the field is towards data integration. Research projects continue to generate large amounts of raw data, and this is annotated and correlated with the data in the public databases. The ability to generate new data continues to outpace the ability to verify them in the laboratory and therefore to exploit them. Validation experiments and the ultimate conversion of data to validated knowledge need expert human involvement with the data and in the laboratory, consuming time and resources. Any effort of data and system integration is an attempt towards reducing the time spent by experts unnecessarily which could be better spent in the lab [5, 9, 20]. The biologist or biochemist not only needs to be an expert in his field as well stay up to date with the latest software tools or even develop his own tools to be able to perform his research [16, 28]. One example of these types of tools is BioJava, which is an open source set of Java components such as parsers, file manipulation tools, sequence translation and proteomic components that allow extensive customization but still require the development of source code [25]. The goal should be integrated user-friendly systems that would greatly facilitate the constructive cycle of computational model building and experimental verification for the systematic analysis of an organism [5, 8, 16, 20, 28, 30]. Sun et al. [30] have developed a system, IdentiCS, which combines the identification of coding sequences (CDS) with the reconstruction, comparison and visualization of metabolic networks. IdentiCS uses sequences from public databases to perform a BLAST query to a local database with the genome in question. Functional information from the CDSs is used for metabolic reconstruction. One shortcoming is that the system does not incorporate visualization or ORF (Open Reading Frames) selection and it includes only one application for metabolic sequence reconstruction (BLAST). Information systems for querying, visualization and analysis must be able to integrate data on a large scale. Visualization is one of the key ways of making the researcher work easier. For example, the spatial representation of the genes within the genome shows location of the gene, reading direction and metabolic
920
T. Arredondo V. et al.
function. This information is made available by applications such as Vector NTI and Artemis. The function of the gene is interpreted through its product, normally the protein in a metabolic pathway, which description is available from several databases such as KEGG or ERGO [24]. Usually, the metabolic pathways are represented as a flowchart of several levels of abstraction, which are constructed manually. Recent advances on metabolic network visualization include virtual reality use [26], mathematical model development [27], and a modeling language [19]. These techniques synthesize the discovery of genes and are available in databases such as Expasy, but they should be transparent to the researcher by their seamless integration into the research application environment. The software engineering challenges and opportunities in integrating and visualizing the data are well documented [16, 28]. There are some applications servers that use a SOAP interface to answer queries. Linking such tools into unified working environment is non-trivial and has not been done to date [2]. One recent approach towards human centered integration is BioUse, a web portal developed by the Human Centered Software Engineering Group at Concordia University. BioUse provided an adaptable interface to NCBI, BLAST and ClustalW, which attempted to shield the novice user from unnecessary complexity. As users became increasingly familiar with the application the portal added shortcuts and personalization features for different users (e.g. pharmacologists, microbiologists) [16]. BioUse was limited in scope as a prototype and its website is no longer available. Current research is continued in the CO-DRIVE project but it is not completed yet [10].
4
Integration Framework and Implementation
This project has focused on the development of a framework using open standards towards integrating heterogeneous databases, web services and applications/tools. This framework has been applied in GeXpert, an application for bioinformatics metabolic pathway reconstruction research in bacteria. In addition this application includes the utilization of fuzzy logic for helping in the selection of the best candidate genes or sequences for specified metabolic pathways. Previously, this categorization was typically done in an ad hoc manner by manually combining various criteria such as e-value, identities, gaps, positives and score. Fuzzy logic enables an efficiency enhancement by providing an automated sifting mechanism for a very manually intensive bioinformatics procedure currently being performed by researchers. GeXpert is used to find, build and edit metabolic pathways (central or peripheral), perform protein searches in NCBI, perform nucleotide comparisons of organisms versus the sequenced one (using tblastn). The application can also perform a search of 3D models associated with a protein or enzyme (using the Cn3D viewer), generation of ORF diagrams for the sequenced genome, generation of reports relating to the advance of the project and aid in the selection of BLAST results using T-S-K (Takagi Sugeno Kang) fuzzy logic.
Bioinformatics Integration Framework for Metabolic Pathway Data-Mining
921
Fig. 1. High level framework architecture
The architecture is implemented in three layers: presentation, logic and data [18]. As shown in Figure 1, the presentation layer provides for a web based as well as a desktop based interface to perform the tasks previously mentioned.0 The logical layer consists of a core engine and a web server. The core engine performs bioinformatic processing functions and uses different tools as required: BLAST for protein and sequence alignment, ARTEMIS for analysis of nucleotide sequences, Cn3D for 3D visualization, and a T-S-K fuzzy logic library for candidate sequence selection. The Web Server component provides an HTTP interface into the GeXpert core engine. As seen in Figure 1, the data layer includes a communications manager and a data manager. The communication manager is charged with obtaining data (protein and nucleotide sequences, metabolic pathways and 3D models) from various databases and application sources using various protocols (application calls, TCP/IP, SOAP). The data manager implements data persistence as well as temporary cache management for all research process related objects. 4.1
Application Implementation: GeXpert
GeXpert [15] is an open source implementation of the framework previously described. It relies on Eclipse [12] builder for multiplatform support. The GeXpert desktop application as implemented consists of the following: – Metabolic pathway editor: tasked with editing or creating metabolic pathways, shows them as directed graphs of enzymes and compounds. – Protein search viewer: is charged with showing the results of protein searches with user specified parameters. – Nucleotide search viewer: shows nucleotide search results given user specified protein/parameters.
922
T. Arredondo V. et al.
– ORF viewer: is used to visualize surrounding nucleotide ORFs with a map of colored arrows. The colors indicate the ORF status (found, indeterminate, erroneous or ignored). The GeXpert core component implements the application logic. It consists of the following elements: – Protein search component: manages the protein search requests and its results. – Nucleotide search component: manages the nucleotide search requests and its results. In addition, calls the Fuzzy component with the search parameters specified in order to determine the best candidates for inclusion into the metabolic pathway. – Fuzzy component: attempts to determine the quality of nucleotide search results using fuzzy criteria [1].The following normalized (0 to 1 values) criteria are used: e-value, identities, gaps. Each has five membership functions (very low, low, medium, high, very high), the number of rules used is 243 (35 ). – ORF component: identifies the ORFs present in requested genome region. – Genome component: manages the requests for genome regions and its results. GeXpert communications manager receives requests from the GeXpert core module to obtain and translates data from external sources. This module consists of the following subcomponents: – BLAST component: calls the BLAST application indicating the protein to be analyzed and the organism data base to be used. – BLAST parser component: translates BLAST results from XML formats into objects. – Cn3D component: sends three-dimensional (3D) protein models to the Cn3D application for display. – KEGG component: obtains and translates metabolic pathways received from KEGG. – NCBI component: performs 3D protein model and document searches from NCBI. – EBI component: performs searches on proteins and documentation from the EBI (European Bioinformatic Institute) databases. GeXpert data manager is tasked with load administrating and storage of application data: – Cache component: in charge of keeping temporary search results to improve throughput. Also implements aging of cache data. – Application data component: performs data persistence of metabolic paths, 3D protein models, protein searches, nucleotide searches, user configuration data, project configuration for future usage.
Bioinformatics Integration Framework for Metabolic Pathway Data-Mining
923
Fig. 2. Metabolic pathway viewer
4.2
Application Implementation: GeXpert User Interface and Workflow
GeXpert is to be used in metabolic pathway research work using a research workflow similar to identiCS [30]. The workflow and some sample screenshots are given: 1. User must provide the sequenced genome of the organism to be studied. 2. The metabolic pathway of interest must be created (can be based on the pathway of a similar organism). In the example in Figure 2, the glycolisis/gluconeogenesis metabolic pathway was imported from KEGG. 3. For each metabolic pathway a key enzyme must be chosen in order to start a detailed search. Each enzyme could be composed of one or more proteins
Fig. 3. Protein search viewer
924
T. Arredondo V. et al.
Fig. 4. Nucleotide search viewer
4.
5.
6. 7.
8. 9. 10.
5
(subunits). Figure 3 shows the result of the search for the protein homoprotocatechuate 2,3-dioxygenase. Perform a search for amino acid sequences (proteins) in other organisms. Translate these sequences into nucleotides (using tblastn) and perform an alignment in the genome of the organism under study. If this search does not give positives results it return to the previous step and search another protein sequence. In Figure 4, we show that the protein homoprotocatechuate 2,3-dioxygenase has been found in the organism under study (Burkholderia xenovorans LB400) in the chromosome 2 (contig 481) starting in position 2225819 and ending in region 2226679. Also the system shows the nucleotide sequence for this protein. For a DNA sequence found, visualize the ORF map (using the ORF Viewer) and identify if there is an ORF that contains a large part of said sequence. If this is not the case go back to the previous step and chose another sequence. Verify if the DNA sequence for the ORF found corresponds to the chosen sequence or a similar one (using blastx). If not choose another sequence. Establish as found the ORF of the enzyme subunit and start finding in the surrounding ORFs sequences capable of coding the other subunits of the enzyme or other enzymes of the metabolic pathway. For the genes that were not found in the surrounding ORFs repeat the entire process. For the enzyme, the 3D protein model can be obtained and viewed if it exists (using CN3D as integrated into the GeXpert interface). The process is concluded by the generation of reports with the genes found and with associated related documentation that supports the information about the proteins utilized in the metabolic pathway.
Conclusions
The integrated framework approach presented in this paper is an attempt to enhance the efficiency and capability of bioinformatics researchers. GeXpert is a
Bioinformatics Integration Framework for Metabolic Pathway Data-Mining
925
demonstration that the integration framework can be used to implement useful bioinformatics applications. GeXpert has so far shown to be a useful tool for our researchers; it is currently being enhanced for functionality and usability improvements. The current objective of the research group is to use GeXpert in order to discover new metabolic pathways for bacteria [29]. In addition to our short term goals, the following development items are planned for the future: using fuzzy logic in batch mode, web services and a web client, blastx for improved verification of the ORFs determined, multi-user mode to enable multiple users and groups with potentially different roles to share in a common research effort, peer to peer communication to enable the interchange of documents (archives, search results, research items) thus enabling a network of collaboration in different or shared projects, and the use of intelligent/evolutionary algorithms to enable learning based on researcher feedback into GeXpert.
Acknowledgements This research was partially funded by Fundaci´ on Andes.
References 1. Arredondo, T., Neelakanta, P.S., DeGroff, D.: Fuzzy Attributes of a DNA Complex: Development of a Fuzzy Inference Engine for Codon-’Junk’ Codon Delineation. Artif. Intell. Med. 35 1-2 (2005) 87-105 2. Barker, J., Thornton, J.: Software Engineering Challenges in bioinformatics. Proceedings of the 26th International Conference on Software Engineering, IEEE (2004) 3. Bernardi, M., Lapi, M., Leo, P., Loglisci, C.: Mining Generalized Association Rules on Biomedical Literature. In: Moonis, A. Esposito, F. (eds): Innovations in Applied Artificial Intelligence. Lect. Notes Artif. Int. 3353 (2005) 500-509 4. Brown, T.A.: Genomes. John Wiley and Sons, NY (1999) 5. Cary M.P., Bader G.D., Sander C.: Pathway information for systems biology (Review Article). FEBS Lett. 579 (2005) 1815-1820, 6. Claverlie, J.M.: Bioinformatics for Dummies. Wiley Publishing (2003) 7. C´ amara, B., Herrera, C., Gonz´ alez, M., Couve, E., Hofer, B., Seeger, M.: From PCBs to highly toxic metabolites by the biphenyl pathway. Environ. Microbiol. (6) (2004) 842-850 8. Cohen, J.: Computer Science and Bioinformatics. Commun. ACM 48 (3) (2005) 72-79 9. Costa, M., Collins, R., Anterola, A., Cochrane, F., Davin, L., Lewis, N.: An in silico assessment of gene function and organization of the phenylpropanoid pathway metabolic networks in Arabidopsis thaliana and limitations thereof. Phytochem. 64 (2003) 1097-1112 10. CO-Drive project: http://hci.cs.concordia.ca/www/hcse/projects/CO-DRIVE/ 11. Durbin, R.: Biological Sequence Analysis, Cambridge, UK (2001) 12. Eclipse project: http://www.eclipse.org 13. Entrez NCBI Database: www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Nucleotide
926
T. Arredondo V. et al.
14. Gardner, D.,: Using genomics to help predict drug interactions. J Biomed. Inform. 37 (2004) 139-146 15. GeXpert sourceforge page: http://sourceforge.net/projects/gexpert 16. Javahery, H., Seffah, A., Radhakrishnan, T.: Beyond Power: Making Bioinformatics Tools User-Centered. Commun. ACM 47 11 (2004) 17. Jimenez, J. I., Miambres, B., Garca, J., Daz, E.: Genomic insights in the metabolism of aromatics compounds in Pseudomonas. In: Ramos, J. L. (ed): Pseudomonas, vol. 3. NY: Kluwer Academic Publishers, (2004) 425-462 18. Larman, C.: Applying UML and Patterns: An Introduction to Object-Oriented Analysis and Design and Iterative Development. Prentice Hall PTR (2004) 19. Loew, L. W., Schaff, J. C.: The Virtual Cell: a software environment for computational cell biology. Trends Biotechnol. 19 10 (2001) 20. Ma H., Zeng A., Reconstruction of metabolic networks from genome data and analysis of their global structure for various organisms. Bioinformatics 19 (2003) 270-277 21. Magalhaes, J., Toussaint, O.: How bioinformatics can help reverse engineer human aging. Aging Res. Rev. 3 (2004) 125-141 22. Molidor, R., Sturn, A., Maurer, M., Trajanosk, Z.: New trends in bioinformatics: from genome sequence to personalized medicine. Exp. Gerontol. 38 (2003) 10311036 23. Neelakanta, P.S., Arredondo, T., Pandya, S., DeGroff, D.: Heuristics of AIBased Search Engines for Massive Bioinformatic Data-Mining: An Example of Codon/Noncodon Delineation Search in a Binary DNA Sequence, Proceeding of IICAI (2003) 24. Papin, J.A., Price, N.D., Wiback, S.J., Fell, D.A., Palsson, B.O.: Metabolic Pathways in the Post-genome Era. Trends Biochem. Sci. 18 5 (2003) 25. Pocock, M., Down, T., Hubbard, T.: BioJava: Open Source Components for Bioinformatics. ACM SIGBIO Newsletter 20 2 (2000) 10-12 26. Rojdestvenski, I.: VRML metabolic network visualizer. Comp. Bio. Med. 33 (2003) 27. SBML: Systems Biology Markup Language. http://sbml.org/index.psp 28. Segal, T., Barnard, R.: Let the shoemaker make the shoes - An abstraction layer is needed between bioinformatics analysis, tools, data, and equipment: An agenda for the next 5 years. First Asia-Pacific Bioinformatics Conference, Australia (2003) 29. Seeger, M., Timmis, K. N., Hofer, B.: Bacterial pathways for degradation of polychlorinated biphenyls. Mar. Chem. 58 (1997) 327-333 30. Sun, J., Zeng, A.: IdentiCS - Identification of coding sequence and in silico reconstruction of the metabolic network directly from unannotated low-coverage bacterial genome sequence. BMC Bioinformatics 5:112 (2004)
The Probability Distribution of Distance TSS-TLS Is Organism Characteristic and Can Be Used for Promoter Prediction Yun Dai1 , Ren Zhang2 , and Yan-Xia Lin3 1 3
School of Mathematics and Applied Statistics, University of Wollongong, Australia 2 School of Biological Sciences, University of Wollongong, Australia School of Mathematics and Applied Statistics, University of Wollongong, Australia
Abstract. Transcription is a complicated process which involves the interactions of promoter cis-elements with multiple trans-protein factors. The specific interactions rely not only on the specific sequence recognition between the cis- and trans-factors but also on certain spatial arrangement of the factors in a complex. The relative positioning of involved cis-elements provides the framework for such a spatial arrangement. The distance distribution between gene transcription and translation start sites (TSS-TLS) is the subject of the present study to test an assumption that over evolution, the TSS-TLS distance becomes a distinct character for a given organism. Four representative organisms (Escherichia cloi, Saccharomyces cerevisiae, Arabidopsis thaliana and Homo sapiens) were chosen to study the probability distribution of the distance TSSTLS. The statistical results show that the distances distributions vary significantly and are not independent of species. There seems a trend of increased length of the distances from simple prokaryotic to more complicated eukaryotic organisms. With the specific distance distribution data, computational promoter prediction tools can be improved for higher accuracy.
1
Introduction
Transcription initiation is the first and one of the most important control points in regulating gene expression and promoters play a pivotal role in this process of all living organisms. A promoter is the nucleotide acid sequence region upstream of the transcription start site. It includes a core promoter part that governs the basic transcription process and regulatory sequences that control more complicated temporal and spatial expression patterns. It is generally believed that the regulation of transcription relies not only on the specific sequence pattern recognition and interaction between the DNA sequences (cis-elements) and other factors (trans) but also on their spatial organization during the process. This implies that the cis-elements need generally to be located in a defined distance one to the other. With the vast amount of genomic data available and rapid development of computational tools, there is an increasing advancement in the computational M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 927–934, 2006. c Springer-Verlag Berlin Heidelberg 2006
928
Y. Dai, R. Zhang, and Y.-X. Lin
techniques to predict promoters, including recognition of promoter region, identification of functional TSS, or both of them [1, 3, 6, 7]. Many algorithms have been developed for the prediction of promoters which vary in performance. In general, the algorithms for promoter recognition can be classified into two groups: the signal-based approach and the content-based approach. There are also methods that combine the both - looking for signals and for regions of specific compositions [6]. Although a great deal of research has been undertaken in the area of promoter recognition, the prediction techniques are still far from satisfactory. Most programs inevitably produce a considerable level of false positives (FPs) at any significant level of true positive (TP) recognition. The promoter recognition systems for large-scale screening require acceptable ratios of TP and FP predictions (i.e. those that maximize the TP recognition while minimize the FP recognition.). The probability distribution of the distance between transcription start site and translation start site (TSS-TLS) has been studied and utilized in promoter prediction of E. coli bacterium, a model prokaryotic organism. It has been demonstrated that, combining with the information of the E. coli empirical probability distribution of the distance TSS-TLS, promoter prediction of E. coli can be improved significantly [2], using a common neural network promoter prediction tool NNPP2.2 [8, 10]. This has triggered our further interest in examining the TSS-TLS distance distribution in different living organisms with an assumption that over evolution, the TSS-TLS distance will become a distinct character for a given organism. Here, we report our investigation results on four living organisms and discuss the significance and potential application.
2
The Distribution of Distance TSS-TLS
The distance TSS-TLS is defined as the number of base pairs between the TSS and the first nucleotide (A in the case of ATG) of the TLS. For convenience, the following notation will be used in this paper. Given a genome sequence, let s denote the position of TSS of a gene and let D(s) denote the distance (base) between s and its TLS. D is considered as a random variable and we hypothesize that, for different organisms, the probability distribution of D would be significantly different. For each organism, the distribution of D should have its own special characteristics. To test our hypothesis, we considered the available data of four different representative organisms in this study: a model bacterium Escherichia coli [4], a model monocellular eukaryote yeast Saccharomyces cerevisiae [12], a model higher plant Arabidopsis thaliana [5] and human Homo sapiens [11]. All of them have record on the positions of TSS and TLS identified through laboratory testing. In genome sequence, sometimes one TLS might correspond to multiple TSS’s. Considering the probability distribution of the distance TSS-TLS, all such valid pairs will be counted. Thus, the sample size of genome sequence for each species, E. coli, S. cerevisiae, A. thaliana and H.sapiens, are 820, 202, 18958 and 12763 respectively.
The Probability Distribution of Distance TSS-TLS
929
Table 1. Summary statistics for distance TSS-TLS of different organisms Mean
SD
Median Kurtosis Skewness Min
E. coli 98.37 117.90 S. cerevisiac 110.78 102.90 A. thaliana 137.20 134.27 H. sapiens 10904.95 44422.41
56 68 98 353
11.73 2.77 32.09 227.42
2.92 1.79 3.58 12.69
0 9 1 1
Max
Sample Size 920 820 481 202 3214 18958 1313939 12727
We produce the histogram and smoothed probability density function of the distance TSS-TLS for the four organisms and present them in Figures 1 and 2. Summary statistics for four organisms are given in Table 1. In Figures 1 and 2, all four organisms have positively-skewed distributions. From Table 1, we note that both of mean and median of the samples increases from E. Coli to S. cerevisiae, then to A. thaliana and finally to H.sapiens. Since distributions are positively skewed, the values of median are more meaningful. Additionally, the smoothed density functions for E. coli and S. cerevosiae have higher and narrower peak than that for A.thaliana and H. sapiens. This fact can also be seen from the values of sample standard deviation given in Table 1. Figures 1 and 2 clearly show that the smoothed density functions for the four organisms are significantly different. Since the sample size from each individual organism is reasonably large and the difference between the smoothed density functions are significant, we are confident that the probability distributions of distance TSS-TLS are different from organism to organism. (This can be tested by nonparametric method. Since the case is so clear, such a test is omitted.) Histogram and smoothed density of the Distance TSS and TLS given by S. cerevisiae (Sample Size is 202)
0.006 0.0
0.0
0.002
0.004
Density Estimate
0.004 0.002
Density Estimate
0.006
Histogram and smoothed density of the Distance TSS and TLS given by E.Coli (Sample Size is 820)
0
200
400
600
TSS and TLS Distance (base)
800
1000
0
100
200
300
400
500
TSS and TLS Distance (base)
Fig. 1. The histogram and smoothed density of distance TSS-TLS for E. coli and S. cerevisiae
Y. Dai, R. Zhang, and Y.-X. Lin
6*10^-6
Density Estimate
2*10^-6
4*10^-6
0.0006 0.0004
0
0.0
0.0002
Density Estimate
Histogram and smoothed density of the Distance TSS and TLS given by H. sapiens (Sample Size is 12727)
8*10^-6
10^-5
Histogram and smoothed density of the Distance TSS and TLS given by A. thaliana (Sample Size is 18959)
0.0008
0.0010
930
0
2000
4000
6000
8000
10000
TSS and TLS Distance (base)
14000
0
2*10^5
6*10^5
10^6
1.4*10^6
TSS and TLS Distance (base)
Fig. 2. The histogram and smoothed density of distance TSS-TLS for A. thaliana and H. sapiens
In summary, the distance TSS-TLS of simple prokaryotes tends to be shorter than that of complicated eukaryotes and the value of the distance TSS-TLS of simple prokaryotes is clustered closely than that of complicated eukaryotes. Random variable D is related to the TSS position. Therefore, it must contain information about the location of TSS. The above fact indicates that for different organisms the probability distributions of D are different. This indicates that the amount information on TSS position provided by the distribution of D will be different. Some may show more information than the others. For example, Figure 3 shows that the distance TSS-TLS for E. coli and S. cerevisiae might contain more information TSS than A. thaliana does. (Due to scaling problem, we are unable to add the smoothed density function for H. sapiens to Figure 3. According to the statistics for H. sapiens, the smoothed density function for H. sapiens will be flatter than other three and the information on TSS will be less than others as well.).
3
Improving Promoter Predictions Through the Information of Distance TSS-TLS
Given a DNA sequence, denote S a set containing all TSSs of the underlying DNA sequence and s a position in the DNA sequence. Currently, many promoter prediction tools are able to provide the probability s as a position of TSS, i.e. P (s ∈ S) (s ∈ S means s is a position for TSS). However, the prediction false positives rate is very high because they are unable to take into account any
The Probability Distribution of Distance TSS-TLS
931
0.0080 0.0075 0.0070 0.0065 0.0060 ecolip.density$y
0.0055 0.0050 0.0045 0.0040 0.0035 0.0030 0.0025 0.0020 0.0015 0.0010 0.0005 0.0 0
500
1000
1500
2000
The Distance TSS-TLS
Fig. 3. Comparison of the Smoothed Density of the Distance TSS-TLS for E. coli, S. cerevisiae and A. thaliana
other particular information from the underlying DNA sequence when the tools were developed, for example NNPP2.2 [9, 8, 10]. In previous section, we have demonstrated that different species have different probability distributions for the distance TSS-TLS. Therefore, the information of the distance TSS-TLS should benefit promoter prediction. In the following we give a formula to show how we combine the probability P (s ∈ S) with the information of the distance TSS-TLS given by an individual organism to improve TSS prediction for the organism. Consider the probability that position s is a TSS position and the distance between s and relevant TLS is between in d − a and d + a can be expressed as P (s ∈ S and D(s) ∈ [d − a, d + a])
(1)
where a and d are positive integer and D(s) denotes the distance between s and the TLS associated with s. We suggest to use this joint probability to replace P (s ∈ S) as a score for the purpose of predicting TSS position. Probability (1) can be evaluated by using the following formula P (s ∈ S and D(s) ∈ [d − a, d + a]) = P (s ∈ S)P (D(s) ∈ [d − a, d + a]|s ∈ S). (2) By using the above formula, the estimation of P (s ∈ S and D(s) ∈ [d − a, d + a]) can be obtained subject to the estimation of P (s ∈ S) and P (D(s) ∈ [d − a, d + a]|s ∈ S) being available. The estimation of P (s ∈ S) can be obtained from other promoter prediction techniques, for example NNPP2.2. The estimation of P (D(s) ∈ [d − a, d + a]|s ∈ S), as what have done in [2], can be provided by empirical sample distribution related to the under considered species. If the
932
Y. Dai, R. Zhang, and Y.-X. Lin
estimation of P (s ∈ S) is obtained through NNPP2.2, based on the design of NNPP2.2, we always let a = 3 in (1). The previous study on E. coli has shown that using probability P (s ∈ S and D(s) ∈ [d − a, d + a]) to measure the likelihood of TSS position will be able to significantly reduce the level of false positives [2]. The main reason we suggest to use P (s ∈ S and D(s) ∈ [d − a, d + a]) to predict TSS instead of P (s ∈ S) alone is based on the following argument. In practice once a position s in a DNA sequence is suspected as a position of TSS, we will study the position s as well as the gene sequence around s. Therefore, while testing if s is a true position of TSS, we should not ignore the information showed in its neighbor. From our data (Figures 1-2), it found all the density functions are strong positively-skewed. Considering the plots of the histogram and smoothed density of the distance TSS-TLS given by A. thaliana and H. sapiens, we found, when the distance TSS-TLS is large beyond certain point, the value of the probability density function will drop down to a very small value quickly and the density function provides very limited information for the distribution of the position TSS beyond that point. However, before the point the probability density function of the distance TSS-TLS has a bell shape within a certain range of the distance close to zero. Obviously, this range is varied from species to species, and the centre, the peak and the fitness of the bell shape are different from species to species as well. All of these indicate that the information of the distance TLS-TSS within that range might help the prediction of TSS. In practice, if we are only interested in predicting TSS within a particular range, formula (2) can be modified as follows P (s ∈ S, D(s) ∈ [d − a, d + a] and D(s) < K) = P (s ∈ S)P (D(s) < K|s ∈ S)P (D(s) ∈ [d − a, d + a]|s ∈ S and D(s) < K) (3) where the estimation of P (s ∈ S) can be obtained from other promoter prediction tool, say NNPP2.2; for different spices P (D(s) < K|s ∈ S) and P (D(s) ∈ (d − a, d + a)|s ∈ S and D(s) < K) can be estimated through the sample drawn from the population of the underlying species. For example, for A. thaliana data, K can be chosen as 706. Among A. thaliana sample (size 18958), 94.77% TSS have distance TSS-TLS less 706. Since the samples size is very large, we expect that 94.77% is a reasonable estimation of the probability P (D(s) < 706|s ∈ S). Considering A. thaliana genes with distance TSS-TLS less than 706, Figure 4 (a) gives the histogram diagram which provides the information on conditional probability P (D(s) ∈ [d − a, d + a]|s ∈ S and D(s) < 706), where a = 3. Another example is for H. sapiens. If K is chosen as 42517, by the same reason above, P (D(s) < 42517|s ∈ S) can be estimated through sample, which is about 94%, i.e. approximately 94% H. sapiens genes have distance TSS-TLS less than 42517. The histogram diagram for H. sapiens sample with distance TSS-TLS < 42517 is given by Figure 4 (b).
The Probability Distribution of Distance TSS-TLS Histogram and smoothed density of the Distance TSS and TLS given by H. sapiens (distance < 42517)
0.0
0.0
0.001
0.0001
0.0002
Density Estimate
0.003 0.002
Density Estimate
0.004
0.0003
0.005
Histogram and smoothed density of the Distance TSS and TLS given by A. thaliana (distance <706)
933
0
200
400
TSS and TLS Distance (base) (a)
600
0
10000
20000
30000
40000
TSS and TLS Distance (base) (b)
Fig. 4. The histogram and smoothed density of distance TSS-TLS for truncated situations
Figure 4 indicates that the TSS position has strong relationship with the position of its associated TLS when their distance is less than 706 and 42517 for A. thaliana and H. sapiens respectively. Since, more than 94% the distance TSSTLS is less than 706 for A. thaliana and less than 42517 for H. sapiens, it is sufficient to use (3) to predict TSS for A. thaliana and H. sapiens instead of (2).
4
Conclusion
Statistical analysis shows that the position of the TSS of a gene is not independent of that of the TLS in the gene sequence. Our study also revealed the significant variations among the different organisms tested in the distance distribution of TSS-TLS. The distance is shortest in E. coli bacterium and increased to budding yeast, then Arabidopsis plant and finally human. Whether these difference reflet certain trend during evolution is worth further investigation. Based on the specific distance distribution data, our previously demonstrated TLSNNPP method [2] can now be used for predicting promoters in other species including those complicated eukaryotic ones with increased accuracy. The idea developed in this paper can also be used to improve other computational promoter prediction approaches as long as the technique is developed for general situations. PS. All the data and empirical distributions for the four species studied in this paper are available by contacting any one of the authors.
934
Y. Dai, R. Zhang, and Y.-X. Lin
Acknowledgements. This project is supported by grants provided by 2006 URC Small Grant and Center for Statistical and Survey Methodology, School of Mathematics and Applied Statistics, University of Wollongong.
References 1. Bajic, V.B., Seah, S.H., Chong, A., Krishnan, S.P.T., Koh, J.L.Y. and Brusic, V. (2004). Computer model for recognition of functional transcription start sites in RNA polymerase II promoters of vertebrates. Journal of Molecular Graphics and Modelling, 21, 323-332. 2. Burden, S., Lin, Y.-X. and Zhang, R. (2005). Improving promoter prediction for the NNPP2.2 algorithm: a case study using E. coli DNA sequences. Bioinformatics, 21, 601-607. 3. Huerta, A. M. and Collado-Vides, J. (2003). Sigma 70 Promoters in Escherichia coli: Specific Transcrription in Dense Region of Overlapping Promoter-like Singals. J. Mol. Biol, 333, 261-278. 4. Karp, P. D., Riley, M., Saier, M., Paulsen, I.T., Collado-Vides, J., Paley, S., Pellegrini-Toole, A., Paley, S., Bonavides, C. and Gama-Castro, S. (2002). The EcoCyc database. Nucl. Acids Res., 30, 56-58. 5. Molina, C. and Grotewold, E. (2005). Genome wide analysis of Arabidopsis core Promoters. BMC Genomics, 6, 1-12. 6. Ohler, U. and Niemann, H. (2001). Identification and analysis of eukaryotic promoters: recent computational approaches. Trends in Genetics, 17, 56-60. 7. Qiu, P. (2003). Computational approaches for deciphering the transcriptional regulatroy network by promoter analysis. Biosilico, 1, 235-133. 8. Reese, M. (2003). http://www.fruitly.org/seq tools/promoter.html. 9. Reese, M. and Eeckman, F. (1995). Novel neurla network prediction systems for human promoters and splice sites. Proceedings of the Workshop on Gene-Finding and Gene Structure PredictionPennsylvania, Philadelphia, edited by D. Searls, J. Fickett, G. Stormo and M. Noordewier. 10. Reese, M. G. (2001). Application of a time-delay neural network to promoter annotation in the Drosophila melanogaster genome. Computers and Chemistry, 26, 51-56. 11. Suzuki, Y., Yamashita, R., Sugano, S. and Nakai, K. (2004). DBTSS, DataBase of Transcriptional Start Sites: progress report 2004. Nucleic Acids Research, 32, 78-81. 12. Zhu, J. and Zhang, M. Q. (1998). SCPD: a promoter database of the yeast Saccharomyces cerevisiae. Bioinformatics, 15, 607-611.
Protein Stability Engineering in Staphylococcal Nuclease Using an AI-Neural Network Hybrid System and a Genetic Algorithm Christopher M. Frenz Department of Biochemistry, New York Medical College, Valhalla, NY 10595 Phone: (914) 714-5635 [email protected]
Abstract. With the recent increases in the use of protein-based therapeutics for the treatment of disease, protein stability engineering is an area of growing importance. Feed forward neural networks are trained to predict mutation induced protein stability changes in Staphylococcal nuclease to an accuracy of 93%. These neural networks are then implemented as part of a larger AI framework that is capable of engineering mutations within a protein sequence to yield the target stability or as the basis for a fitness function of a genetic algorithm capable of producing a population of mutation containing sequences of the desired stability. The AI neural network hybrid and genetic algorithm based approaches are significantly faster than other computational methods of protein engineering and can be utilized for proteins in which structural information is lacking.
1 Introduction Protein-based therapeutics, like insulin, human growth hormones, and antibodies are being increasingly used as the basis for pharmaceutical treatments. One of the key factors to the success of any protein therapeutic is the stability of the protein, since this has a direct effect on both the active shelf life of the treatment as well as the active life of the treatment within the human body. Thus, in order to maximize the lifespan and effectiveness of a treatment, engineering for increased protein stability is often essential [1]. The stability of a protein is contributed to by a diversity of factors, such as electrostatics, hydrophobicity, and residue packing, and these complex intermolecular interactions often make stability prediction and hence stability engineering a challenge. This is especially true for situations in which multiple mutations are required, since studies have demonstrated that the stability of proteins containing multiple mutations is not always equal to the sum of the stability changes induced by each individual stability change [2]. Furthermore, the challenge of stability engineering is enhanced by the need to maintain the activity of the protein when introducing stabilizing mutations. Current methods rely on structural models to address this by determining which stabilizing mutations produce minimal shifts in the backbone of a given protein fold, in order to ensure that the active site geometry and/or flexibility remains relatively unchanged [3]. M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 935 – 942, 2006. © Springer-Verlag Berlin Heidelberg 2006
936
C.M. Frenz
Current methods of examining mutation induced protein stability changes include molecular dynamics simulations, which are computationally intensive and the duration of the calculations often make them impractical. Other less computationally intensive methods have involved the statistical analysis of tolerable amino acid substitutions within known elements of tertiary structure or have involved calculations of the electrostatic interactions within the tertiary structure, thereby limiting the applicability of these techniques to computationally known structures [4,5]. Recently a neural network based approached to protein stability prediction was utilized for Staphylococcal nuclease. The input to this network was based on evolutionary-based similarities between residues in the wild type protein sequence and the mutation bearing sequences, rather than tertiary structure, and the technique allowed for the rapid prediction of a protein stability score relative to that of wildtype [6]. Recent studies conducted by Capriotti et al. and Cheng et al. have also demonstrated that machine learning approaches could be utilized to predict mutational effects on protein stabilities from sequence information [7,8]. Each of these approaches utilized support vector machines and was able to predict the effects of single site mutations across a range of proteins. Machine learning based methodologies offer the additional potential advantage of reducing mutation induced effects of activity, if activity screening is utilized during the process of training set creation, since positions, where mutations are shown to have a detrimental effect on activity, could be eliminated at this stage. This study seeks to apply neural networks trained to predict protein stability to protein stability engineering tasks, by demonstrating an artificial intelligence (AI)neural network hybrid system that can pinpoint mutations and/or combinations of mutations that will yield a protein of the target stability as well by demonstrating genetic algorithms that can evolve a population of sequences that will yield the desired stability.
2 Methods 2.1 Preparation of Input Patterns Mutation effects on stability correspond to both the position of the mutation within the protein structure as well as the severity of the mutation at that position. Input pattern were prepared using amino acid similarity tables based on the evolutionary frequency of amino acid substitutions in homologous proteins and derived by Jones et al [9]. The more frequently an amino acid replaces another over evolutionary time, the more similar those amino acids are, and accordingly the greater the similarity between two residues the greater the likelihood that tertiary structure and stability are maintained [7]. Thus, the wild type amino acid sequence was then set as the reference sequence. Each mutation containing sequence was then parsed into its individual residue constituents and its similarity to the wild type residue at the corresponding position calculated as described by Frenz [6]. In this way, an array of similarity values was generated for each sequence. All array/residue positions that were never mutated were then removed, since the stability values of these positions never change and hence would provide no information to the network. The finalized array for any
Protein Stability Engineering in Staphylococcal Nuclease
937
given mutant sequence consisted of the similarity scores obtained for each of the 20 possible mutation sites utilized in the study [6]. The 20 mutation sites were based on staphylococcal nuclease stability data collected by Green & Shortle [10] and are represented by the single residue mutation data shown in Table 1. Green & Shortle utilized guanine hydrochloride denaturation to determine the stability changes induced by amino acid mutations, relative to the wild type. Each of the 20 similarity scores in these reduced arrays was then subsequently input into one of the input neurons of the neural network. 2.2 Neural Network Construction and Training The type of network that is employed in this study is a feed-forward neural network that learns with adaptive learning rate back-propagation. It consists of three different layers: an input layer, a hidden layer, and an output layer laid out in a 20-6-1 architecture, with each layer fully interconnected to the preceding layer by a weight matrix. The input layer is used to transfer the array of similarity measures into the neural network. A bias is also added to each neuron in the hidden and output layers, which, scales the neurons input values before they pass through the neurons, bipolar sigmoid transfer function. The transfer function for the output unit was a bipolar sigmoid scaled from 0.55 and 1.3, yields a relative stability value that the network Table 1. Mutation and relative stability data use to train and validate the neural network
Mutation WT I15V I72V Y85A I92V Y113A A130G L7A V23F L37A T62A V66L E75V G79D G79S G88V P117L N118D I18M T33S T62G
Relative Stability 1.00 1.05 1.02 1.01 1.01 0.99 0.96 0.89 0.84 0.89 0.88 0.82 0.79 0.87 0.82 0.78 0.92 0.86 1.07 1.09 1.08
Mutation A69T A90S I15V+Y85A1 I72V+Y113A1 I18M+T33S1 A69T+A90S1 L7A+G79S2 V23F+L37A2 L37A+E75V2 T62A+V66L2 T62A+G88V3 E75V+P117L3 P117L+N118D3 V23F+A69T3 V23F+A90S4 L7A+I92V4 V23F+A130G4 T33S+L37A4 V23F+G79S5 I15V+L37A5 I72V+Y85A5
Superscript Number Indicates Cross-Validation Subset
Relative Stability 1.13 1.10 1.09 1.10 1.09 1.3 0.8 0.73 0.79 0.68 0.69 0.78 0.86 1.17 1.06 0.96 0.76 1.01 0.59 1.01 1.07
938
C.M. Frenz
was trained to associate with the input pattern [6]. The network used in this study was programmed in MATLAB Release 12, utilizing the Neural Network Toolbox. A training set for the aforementioned neural network was constructed from the mutant sequence and relative stability pairs of Staphylococcal nuclease described by Green & Shortle [10]. Over-training was avoided by using five subsets of the 42 mutant sequences in the training set as cross-validation sequences (Table 1). These subset sequences were removed from the training set and used to determine if overtraining had occurred. Training was conducted until the prediction set error dropped below 10-2 or stopped improving with continued training epochs. Once this occurred, training was stopped and the validation sequences were input into the neural network. If the network was able to predict the relative stabilities of the mutant validation sequences to within 10% of the target values, the neural network architecture was saved. If the prediction error was off by more than 10% then the network was discarded and a new neural network was trained [6]. Five networks were trained in this manner (one for each subset) and the finalized results were the averages obtained from these 5 networks. 2.3 Artificial Intelligence Construction The major components of an artificial intelligent system are a knowledge base and a control mechanism that influences how the knowledge in the knowledge base is used to reach a given goal. In the case of AI-neural network hybrid systems trained neural networks are able to serve as a knowledge base component [11]. In this particular case, the five trained neural networks make up the entire knowledge base, since the average of these networks provide an accurate representation of the effects mutational severity and position have upon the stability of staphylococcal nuclease and thus can be considered as encompassing a reasonable body of knowledge upon which further decisions can be made. The control mechanism of this AI hybrid system utilizes depth first search. The system is initially presented with a target stability range, and begins to engineer a suitable mutant by taking the wild-type sequence and making a mutation within that sequence. This mutated sequence is then input into the neural network and a stability prediction is made. If the prediction is in the target range the sequence is outputted and the calculation completed. If the prediction is outside of the target range, the control mechanism considers whether or not the mutation has brought the stability closer to the target stability. If yes, the mutation is kept and an additional mutation is made in order to attempt to reach the target. If no, the mutation is discarded and another mutation made. This process is repeated until a sequence of the desired stability is reached. The AI logic also allows the maximum number of allowable mutations to be set by the user, and hence the depth first search algorithm will backtrack and consider a path invalid if it cannot reach a solution with a number of mutations that are less than or equal to this cutoff value. This is a desirable feature, since reaching a solution that utilizes only a small number of mutations is less likely to adversely affect the function of the protein. It is also advisable that the protein only be allowed to make mutations in residues that are not highly conserved or crucial for protein function, in order to preserve function.
Protein Stability Engineering in Staphylococcal Nuclease
939
2.4 Genetic Algorithm Construction Genetic algorithms allow for the simulation of evolutionary processes in silico [11]. For purposes of this simulation each of the twenty residue positions utilized in the neural network was considered a gene and each of the twenty possible amino acids that could occupy that position were considered an allele. The trained neural networks formed the basis of a fitness function in which fitness was defined as an acceptable range of target stabilities. An initial population of 5000 sequences, each containing a single random mutation, was initialized and the fitness function applied to this population. All sequences deemed unfit were removed from the population, and up to 100 randomly chosen sequences that met the fitness criteria were allowed to breed with another a second random member of the population. Breeding generates offspring with a genetic makeup that is comprised of alleles from each parent sequence. For each residue position (gene) there a 50% chance of an allele coming from any one parent. During breeding there is also a 15% chance that a random mutation is introduced in any given gene, which works to simulate genetic drift and helps to ensure that the simulation results in a heterogeneous population at the end. At the end of the breeding step, the fitness function is applied again, and the unfit members removed from the population. Simulations are run for a total of 50 successive generations. At the end of the simulation it is possible to screen the resultant population based on other desired criteria, such as eliminating members of the population that contain more than a maximum allowable number of mutations.
3 Results and Discussion 3.1 Neural Network Results The neural network is fed both mutant residue similarity information (similarity score) as well as mutant residue position information (location in array). Through training the network is able to learn which positions are most crucial to stability along with the effects of mutation severity at each position. The network learns how multiple inputs work together to produce an output, and in this way learns to associate the multiple mutation patterns with a given stability. Currently an overall accuracy of 92.8% was obtained for 55 Staphylococcal nuclease unknowns and some representative predictions are displayed in Figure 1. Furthermore, the neural networks were able to accurately predict the effects of non-additive mutation combinations, such as L37A/G79S/P117L, which based on the summation of the stability effects of the constituent mutations, should be significantly more destabilizing than observed [6]. In addition to the accuracy of the technique, the other major advantage is that the neural network based predictions are not computationally intensive. Neural network training only takes 3 minutes on a 600 MHz Pentium III processor and unknown predictions are instantaneous. This makes protein engineering in finite time tractable even on a desktop computer, since it enables the artificial intelligence component and genetic algorithms to evaluate hundreds of mutation combinations in a period of several seconds, rather than weeks to months as could be expected from structural based techniques. The other advantage to this neural network based methodology is
C.M. Frenz
$0 $0* $0$ $0-0# -0) -0! -0( -0& -01 -0 -0* -0$ -0-
! "! #
" $ ! $!% (( "!& $& %") ' '" )' $ * $ +" $) , !# "# " ! $&' !# "! . *' "$ $& $) )& '"! . *' " $) $$ , "( " #/ #
2 3
940
, Fig. 1. Representative neural network predictions compared to the values experimentally determined by Green & Shortle [10]
that no explicit structural information is required, which means that this technique is especially viable for situations in which the structure of the target protein remains unknown. The one disadvantage to this technique is that the neural network can only make accurate predictions for residue positions that it was trained to work with. However, with the advent of site directed mutagenesis, a significant number of mutations have already been characterized. 3.2 AI-Neural Network Hybrid Results The AI neural network hybrid system can be an effective means of engineering protein sequences that correspond to a target stability. When asked to engineer a mutant that has 1.07 times the stability of wild-type with a three mutation maximum the AI-neural network hybrid suggests the mutations T33G/Y113C (Figure 2). This is a probable mutation combination, since T33S mutants are shown to be 1.08 times as stable as wild type and mutation of Y113 to a non-aromatic uncharged residue had no significant effect of stability [10]. Both serine (S) and glycine (G) are aliphatic amino acids with side chains that are shorter in length than threonine (T). When asked to engineer a mutant that has a stability of 0.83 that of wild-type, the hybrid system suggested a G88A mutation, which is also probable, since mutation of G88 to the longer aliphatic amino acid valine destabilizes the protein to 0.79 that of wild-type [10]. Alanine is an aliphatic amino acid that is intermediate in length between glycine and valine and hence it is likely that it is destabilizing but not to the same extent as that of valine.
Protein Stability Engineering in Staphylococcal Nuclease
941
Fig. 2. AI-neural network hybrid output for the engineering of a stabilizing and destabilizing mutant protein sequence
3.3 Genetic Algorithms Results The genetic algorithm is a more computationally intensive approach than the AI neural network hybrid system, yet it offers the advantage of evolving multiple solutions within the targeted stability range. A representative simulation which requested that the population members have a relative stability have 1.07 ± 2, yielded 72 distinct solutions. In the initial generation, the sequence containing a T33D mutant was determined to be fit, since it yielded a predicted stability of 1.09 times that of wild type, which is probable, since Hydrogen bonding calculations in the Swiss PDB viewer indicate that the aspartate (D) mutant allows for improved H-bonding over threonine. By the end of the simulation, 69% of these sequences contained this single mutation. Additional generations built upon this initial point mutation by incorporating additional mutations into the sequence, which further modified the resultant stability scores to yield sequences whose stabilities encompassed the full range of fit stabilities.
4 Conclusions Both the AI neural network hybrid and the genetic algorithms are effective methods of protein stability engineering, yet each provides distinct advantages. The depth first search based AI algorithm is less computationally intensive and provides a means of arriving at a single solution to the stability-engineering problem. The genetic algorithm approach is the more computationally intensive approach, yet the populations of sequences generated offers researchers a variety of mutation combinations to choose from, which has the benefit of allowing them to choose the mutation combinations that are best suited for their stability engineering needs. These methodologies illustrate that neural network based methods of protein stability engineering are viable alternatives in cases where structural based methods are not feasible, either due to limitations of processing time or a lack
942
C.M. Frenz
of structural information. The utility of machine learning based approaches for purposes of enzyme engineering is also further supported by recent work in which machine learning based methodologies were used to engineer proteinase K variants with activities up to 45 times that of the wild type enzyme [12]. These findings suggest that machine learning based approaches may be applicable to a diversity of protein engineering tasks and further suggest the potential for machine learning based systems that could be used to simultaneously engineer enhancements in stability and activity.
References [1] Cleland, J.L., Jones, A.J.S., Craik, C.S.: Introduction to protein engineering. In: Cleland, J.L. & Craik, C.S. (Eds) Protein Engineering: Principles and Practice, Wiley-Liss, NY, (1996) 1-32. [2] Braxton, S.: Protein engineering for stability. In: Cleland, J.L., Craik, C.S. (Eds.), Protein Engineering: Principles and Practice, Wiley-Liss, NY, (1996) 299-316. [3] Korkegian, A, Black, M.E., Baker, D., Stoddard, B.L. Computational thermostabilization of an enzyme. Science 308 (2005) 857-860. [4] Borjesson, U., Hunenberger, P.H.: Effect of mutations involving charged residues on the stability of staphylococcal nuclease: a continuum electrostatics study. Protein Engineering 16 (2003) 831-40. [5] Topham, C.M., Srinivasan, N., Blundell, T.L.: Prediction of the stability of protein mutants based on structural environment-dependent amino acid substitution and propensity tables. Protein Engineering 10 (1997) 7-21. [6] Frenz, C.M.: Neural network-based prediction of mutation induced protein stability changes in Staphylococcal nuclease at 20 residue positions. Proteins: Structure, Function, Bioinformatics, 59 (2005) 147-151. [7] Capriotti, E., Fariselli, P., Calabrese, R., Casidio, R.: Predicting protein stability changes from sequences using support vector machines, Bioinformatics 21 Suppl 2 (2005) ii54ii58. [8] Cheng, J., Randall, A., Baldi, P.: Prediction of protein stability changes for single-site mutations using support vector machines, Proteins: Structure, Function, Bioinformatics, 62 (2006) 1125-32. [9] Jones, D.T., Taylor, W.R., Thornton, J.M.: The rapid generation of mutation data matrices from protein sequences. CABIOS 8 (1992) 275-82. [10] Green, S.M., Shortle, D. Patterns of nonadditivity between pairs of stability mutations in staphylococcal nuclease. Biochemistry 32 (1993) 10131-9. [11] Tsoukalas, L.H., Zadeh, L.A., Uhrig, R.E., Hines, J.W.: Fuzzy and Neural Approaches in Engineering, Wiley, NY. (1996). [12] Gustafsson, C., Liao, J., Warmuth, S., Ness, J., Wang, R., Minshull, J.: Engineering proteinase K using machine learning and synthetic genes. PepTalk (2006) Abstract Book.
Identification of Over and Under Expressed Genes Mediating Allergic Asthma Rajat K. De1 and Anindya Bhattacharya2 1
2
Machine Intelligence Unit, Indian Statistical Institute, 203 B. T. Road, Kolkata 700108, India [email protected] Department of Computer Science & Engineering, Netaji Subhash Engineering College, Techno City, Garia, Kolkata 700152, India
Abstract. The present article focuses on identifying some of the genes mediating the development of asthma. Here we apply a pattern recognition based approach to identify the genes those are severely over or under expressed in the allergen samples. The methodology involves clustering on gene expression and fold values followed by determining similarity/dissimilarity among various clusters, and measuring the extent of over/under expression of genes. From this analysis we have identified several genes those have significantly changed their expression values for asthmatic condition, and have reported in the present article. Some of these observations are supported by some earlier investigations. Others have been stayed unnoticed so far, but may play crucial role in mediating the development of asthma. Keywords: mouse, clustering, fold value, Jaccard score, DB-index.
1
Introduction
Modern genome research, being aided by high-throughput technologies, is producing enormous amount of data that has to be organized, classified and interpreted. In functional genomics, it is increasingly appreciated that various cellular processes are rooted from the dynamic interaction among its many constituents, such as, DNA, RNA, proteins and small molecules. This leads to the emerging of some challenging problems including determination of functions of genes/proteins, interaction among genes/proteins and proteins/proteins, and pathway analysis. Identification of genes whose expression patterns mediate a particular disease is also of great interest. Asthma is an inflammatory disease that remains poorly understood and hard to control. This disease is characterized by airway hyper reactivity (AHR, defined by exaggerated airflow obstruction in response to bronchoconstrictors), mucus overproduction and chronic eosinophilic inflammation. AHR and mucus overproduction are consistently linked to asthma symptoms and morbidity. Asthma is thought to be mediated by Th2 lymphocytes, which produce a limited repertoire of cytokines, including interleukin-13 (IL-13) [1, 2, 3, 4]. M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 943–952, 2006. c Springer-Verlag Berlin Heidelberg 2006
944
R.K. De and A. Bhattacharya
In this article, we report a new methodology based on clustering for identifying a set of genes mediating the development of asthma. These genes are either over expressed or under expressed in the allergen samples as compared to those in normal ones, and are selected from a large set of genes using the methodology involving Partitioning Around Mediod (PAM) [5] and Fuzzy c-means (FCM) [6] clustering algorithms. The methodology of this selection considers oligonucleotide microarray gene expression data GDS958 [7] dealing with expression patterns of as many as 22690 genes of both normal and allergen samples. Clustering algorithms have been applied on gene expression as well as fold values (ratio of expression values of genes in allergen and normal samples). This is followed by determining similarity/dissimilarity among various clusters and also measuring the extent of over/under expression of genes. A set of genes mediating the development of asthma has been newly identified along with those already reported by some earlier investigations [1, 3, 8].
2
Data Description
The oligonucleotide microarray gene expression data (GDS958) that were used here were obtained from lung tissue of mouse [7]. The data set contains samples which had undergone either of the two types of strain, and are termed as Wild Type mouse and IL-13 Knocked Out mouse samples. Each of these Wild Type and IL-13 Knocked Out mouse samples can be either an Allergen sample or a Control sample. Two different types of treatments were used for both Wild Type strain and IL-13 Knocked Out strain: (i) House Dust Mite (HDM) and (ii) Phosphate Buffered Saline (PBS). Allergen samples were obtained from HDM treated Wild Type strain and IL-13 Knocked Out strain, whereas control samples from PBS treated Wild Type strain and IL-13 Knocked Out strain. Thus we have four different types of sample: (1) HDM treated Wild Type allergen samples, (2) HDM treated IL 13 Knocked Out allergen samples, (3) PBS treated Wild Type control samples and (4) PBS treated IL 13 Knocked Out control samples. In the present investigation, we considered samples undergone only Wild Type strain, i.e, (1) HDM treated Wild Type allergen samples and (2) PBS treated Wild Type control samples. The data set GDS958 contains expression pattern for 22690 genes obtained from six Wild Type samples. Three samples GSM21415, GSM21418 and GSM 21420 are HDM treated Wild Type allergen samples, whereas GSM21422, GSM 21424 and GSM21426 are PBS treated Wild Type control samples. Further information on this data is available at [7]. It is to be mentioned here that the data set contains expression profiles of cytokines including IL-13, IL-4, IL-5 and their receptors known as key mediators for allergen induced immediate development of asthma. This is due to hyperreactivity of the airway and mucus overproduction in the lung as immediate response for house dust mite allergen. The data set is last updated on December 12, 2004, so it is expected to contain recent information.
Identification of Over and Under Expressed Genes
3
945
Methodology
The methodology we have used here are based on well known partitioning around medoid (PAM) and fuzzy c-means (FCM) clustering algorithms. Both these algorithms require an expected number k of clusters as input for their execution. For example, if we give k = 3 as input, the algorithms partition the entire data set into 3 clusters. Thus selecting a suitable value of k is crucial for these algorithms. In order to find the best k for a set of data points we are going to cluster, we have used a cluster validity index, namely, Davies-Bouldin index (DB-index) [9] which varies with the number of clusters k. Lower the value of DB-index, better is the clustering and vice versa. For details of the clustering algorithms and expression for DB-index, one may refer to [5, 9]. These clustering algorithms have been executed on the (i) expression values of genes given in the data set and (ii) fold values of genes define in Section 3.2. 3.1
Clustering on Expression Values
Clustering on expression value considers gene expression values in both normal (control) and allergen samples. Let there be n number of normal samples and m number of allergen samples. Each of the genes present in the data set is considered as a data point or a pattern. That is, each data point (i.e., each gene) has dimension n for wildtype-normal and m for wildtype-allergen sets. Each of these sets of genes were grouped into k number of clusters using these clustering algorithms. It is to be noted that we select k for which DB-index value attains a minimum. Thus for wildtype-normal samples, we have (say) knormal control clusters S1 , S2 , . . . , Sknormal containing genes with different domains of expression values. Similar is the case for wildtype-allergen samples, where kallergen number of clusters T1 , T2 , . . . , Tkallergen consist of genes with different domains of expression values. The genes in any of these clusters are expected to be coexpressed. On the other hand, genes in different clusters should have different domains of expression values. Now we find the relationship (i.e., similarity or dissimilarity) between the control and allergen clusters. For this purpose, we use Jaccard score which is widely used in literature [10]. Let us assume that ith control cluster Si has the highest degree of similarity with jth allergen cluster Tj . Thus it is expected that the domains of expression values of genes in Si and Tj are similar. If Si and Ti contain genes in control and allergen samples, respectively, with high expression values, then it is expected that most of the genes (with high expression values) in cluster Si are present in the cluster Ti . If a gene, say g, is present in Ti but not in Si then we may infer that the gene g is over expressed in the allergen samples. Thus the genes in the set (Ti − (Si ∩ Ti )) are over expressed in the allergen samples. On the other hand, if both Tj and Sj contain genes with low expression values, the presence of gene g in Tj but not in Sj would lead to the inference of the gene g being under expressed in the allergen samples. Thus the set (Tj − (Sj ∩ Tj )) contains only the genes that are under expressed in the allergen samples.
946
R.K. De and A. Bhattacharya
3.2
Clustering on Fold Values
Apart from the clustering on expression values, genes were grouped by these clustering algorithms on their fold values which are defined below. Let there be n number of control samples and m number of allergen samples, and the expression values of a gene g in ith control sample be cig and that in jth allergen sample be ajg . Then the fold value fgij of gene g is defined as fgij = ajg /cig
(1)
Thus we have mn number of fold values of a gene g. We form an mn-dimensional fold vector vg corresponding to gene g as vg = [fg11 , fg12 , . . . , fgij , . . . , fgmn ]T If fold value fgij of a gene g is nearly equal to one, expression values of the gene g in both ith control and jth allergen samples are almost equal. Therefore, we may say that the gene g has no role in mediating the development of asthma. Fold value fgij of gene g grater than one (or less than one) indicates that gene g gets over expressed (or under expressed) in jth allergen sample compared to ith control sample. It is to be mentioned here that in an ideal situation if gene g has no role in mediating asthma, the corresponding fold vector would be closed to the vector v0 = [1, 1, . . . , 1]T , where all the components of v0 are 1. In order to compute an average extent of over or under expression of genes in an allergen cluster, we have measured Euclidean distance Davg between the corresponding cluster center and v0 , which is given by Davg = ||vA − v0 || ij 11 12 mn T Here vA = [vA , vA , . . . , vA , . . . , vA ] , i = 1, 2, . . . , n; j = 1, 2, . . . , m is the center of an allergen cluster. The higher the value of Davg , the higher will be the extent of over or under expression of genes contained in the allergen cluster. Moreover, in order to determine whether Davg indicates an average over or under expression, we use a term SIGN defined as
SIGN =
n m
ij (vA − 1)
i=1 j=1
If SIGN > 0 then Davg indicates an average over expression. On the other hand, if SIGN < 0 then Davg indicates an average under expression. For SIGN > 0, the highest Davg value indicates that the genes in the allergen cluster get the highest extent of over expression (on an average). Similarly, for SIGN < 0, the highest Davg value indicates the highest extent of under expression (on an average).
4
Results and Analysis
Here we identify some genes mediating the development of asthma, by applying the aforesaid methodology based on the clustering algorithms (PAM and FCM)
Identification of Over and Under Expressed Genes
947
on the gene expression data set GDS958 containing 22690 genes. Some of these genes those are over or under expressed in the allergen samples were not known earlier. The clustering algorithms were applied both on expression values and fold values of genes. 4.1
On Expression Values
In the case of clustering based on expression values of the data set GDS958, the dimension of a data corresponding to a gene in both normal (GSM21422, GSM21424 and GSM21426) and allergen (GSM21415, GSM21418 and GSM 21420) samples is three, as n = m = 3. Here we have provided k = 2 as input to both these algorithm as DB-index attains its minimum at k = 2 for both PAM and FCM (Figure 1). Thus we have obtained two control clusters (S1 and S2 ) and two allergen clusters (T1 and T2 ). We have computed Jaccard score to determine the similarity between control and allergen clusters. We have found, from Jaccard score, that the cluster S1 is the most similar to T1 and S2 to T2 . In fact, the clusters S1 and T1 contain genes with low expression values, and S2 and T2 contain those with high expression values. Thus we infer that the genes being present in S1 but not in T1 are over expressed in the allergen samples. Similarly, the genes in S2 but not in T2 are under expressed in the allergen samples. By applying both PAM and FCM, we have found a few genes, out of the entire set of 22690 genes, which are over expressed in the allergen samples. Some of these observations were already found in [11, 12, 13, 14]. In the present investigation, we have found some new genes including Gpnmb (glycoprotein (transmembrane) nmb), Fxyd4 (FXYD domain-containing ion transport regulator 4), Serpina3n (serine (or cysteine) proteinase inhibitor, clade A, member 3N), Slc26a4 (solute carrier family 26, member 4), Igl-V1 (immunoglobulin lambda chain, variable 1), A430103C15Rik (RIKEN cDNA A430103C15 gene), Igh-4 (immunoglobulin heavy chain 4 (serum IgG1)), AI324046 (expressed sequence AI324046), IgK-V1 (immunoglobulin kappa chain variable 1 (V1)),
Fig. 1. Variation of DB-index computed on expression values with respect to k
948
R.K. De and A. Bhattacharya
IgK-V5 (immunoglobulin kappa chain variable 5 (V5 family)), Muc5ac (mucin 5, subtypes A and C, tracheobronchial/gastric), which are over expressed in the allergen samples. None of them were previously found responsible for asthma, but in our experiment, very large changes in expression values for all of these genes in the allergen samples suggest that they may have very important role in mediating the development asthma. Likewise the set of genes that are under expressed includes Ppp1r2 (protein phosphatase 1, regulatory (inhibitor) subunit 2), Cyp2a4 (cytochrome P450, family 2, subfamily a, polypeptide 4), Tncc (troponin C, cardiac/slow skeletal), Dusp16 (dual specificity phosphatase 16), Scnn1b (sodium channel, nonvoltagegated 1 beta) and Slc7a10 (solute carrier family 7 (cationic amino acid transporter, y+ system), member 10) are some of the significantly under expressed genes found in the allergen samples. These genes have newly been found under expressed in the allergen samples. Some of the observations that are in accordance with earlier investigations include over expression of Clca3 (chloride channel calcium activated 3) [12], Ear11 (eosinophil-associated, ribonuclease A family, member 11) [8], Sprr2a (small proline-rich protein 2A) [15] in allergen samples. 4.2
On Fold Values
Clustering algorithms were executed on the fold values. Here we are looking for three clusters corresponding to (i) the genes not (over/under) expressed in the allergen samples, (ii) those over expressed in the allergen samples and (iii) those under expressed in the allergen samples. Thus we have considered k = 3 as input to these algorithms. It is to be mentioned here that we have three control samples (GSM21422, GSM21424 and GSM 21426), i.e., n = 3, and three allergen samples (GSM21415, GSM21418 and GSM21420), i.e., m = 3. So the fold vector vg of a gene g has the dimension of 9. It is to be mentioned here that the clusters containing over or under expressed genes may also include those which were not (over or under) expressed much compared to those in control samples. In order to find the genes which were much over or under expressed compared to those in control samples, we have further divided the sets of over or under expressed genes. For this purpose, we have used DB-index to find the number of clusters into which each of these sets to be divided further. The cluster containing the over expressed genes was divided into 2 sub clusters (for both PAM and FCM) as the DB-index attains a minimum at k = 2 (Figure 2). The distances of these 2 cluster centers from the vector v0 were computed. This distance shows the extent of over expression of the genes (contained in the corresponding cluster) in allergen samples as compared to those in control samples. The genes in the cluster with center farthest from the vector v0 have the highest degree of over expression. Considering the cluster containing the most over expressed genes, as shown by the distance from its center to the vector v0 , 158 genes were found to be the most over expressed in the allergen samples obtained by both PAM and FCM. Out of these 158 genes, Gpnmb, Fxyd4, Serpina3n, Slc26a4, Igl-V1, A430103C15Rik,
Identification of Over and Under Expressed Genes
949
Fig. 2. Variation of DB-index computed on fold values of over expressed genes with respect to k
Fig. 3. Variation of DB-index computed on fold values of under expressed genes with respect to k
Igh-4, AI324046, IgK-V1, IgK-V5, Muc5ac are some examples which were also found to be over expressed based on the expression values of the genes. Similarly, the cluster containing under expressed genes was divided into 2 sub clusters for PAM and 3 sub clusters for FCM, as DB-index attains its minimum at k = 2 for PAM and at k = 3 for FCM (Figure 3). As in the case of over expressed genes, distances of all the cluster centers from the vector v0 is computed to check the extent of under expression. The cluster containing the genes having the highest extent of under expression in allergen samples compared to that in control sample is selected based on these distance values. Using the above process based on the distance measure, the farthest cluster is identified. As in the case of clustering based on expression values, Ppp1r2, Cyp2a4, Tncc, Dusp16, Scnn1b
950
R.K. De and A. Bhattacharya Table 1. List of genes newly found over expressed in allergen samples
GenBank Accession Number Gene Title Gene Symbol NM 053110 glycoprotein (transmembrane) nmb Gpnmb NM 033648 FXYD domain-containing ion transport regulator 4 Fxyd4 NM 009252 serine (or cysteine) proteinase inhibitor, clade A, member 3N Serpina3n NM 011867 solute carrier family 26, member 4 Slc26a4 AK008145 immunoglobulin lambda chain, variable 1 Igl-V1 BC024402 RIKEN cDNA A430103C15 gene A430103C15Rik BC008237 immunoglobulin heavy chain 4 (serum IgG1) Igh-4 AK007826 immunoglobulin heavy chain (J558 family) Igh-VJ558 BE687919 expressed sequence AI324046 AI324046 Z95478 immunoglobulin kappa chain variable 1 (V1) Igk-V1 U29768 immunoglobulin kappa chain variable 5 (V5 family) Igk-V5 BM250782 tumor necrosis factor receptor sPAM, FCMerfamily, member 9 Tnfrsf9 AK008656 mucin 5, subtypes A and C, tracheobronchial/gastric Muc5ac BB113173 cDNA sequence BC032204 BC032204
Table 2. List of genes newly found under expressed in allergen samples GenBank Accession Number Gene Title Gene Symbol NM 025800 protein phosphatase 1, regulatory (inhibitor) subunit 2 Ppp1r2 NM 007812 cytochrome P450, family 2, subfamily a, polypeptide 4 Cyp2a4 NM 009393 troponin C, cardiac/slow skeletal Tncc NM 130447 dual specificity phosphatase 16 Dusp16 NM 011325 sodium channel, nonvoltage-gated 1 beta Scnn1b NM 017394 solute carrier family 7 (cationic Slc7a10 amino acid transporter, y+ system), member 10
and Slc7a10 are some of examples of under expressed genes. In order to restrict the size of the article, we have not reported all these over/under expressed genes here. Thus we have found some of the new genes including Gpnmb, Fxyd4, Serpina3n, Slc26a4, Igl-V1, A430103C15Rik, Igh-4, AI324046, IgK-V1, IgK-V5, Muc5ac which were found over expressed in the allergen samples, by applying both PAM and FCM executed on both expression and fold values (Table 1). Similarly, Ppp1r2, Cyp2a4, Tncc, Dusp16, Scnn1b and Slc7a10 were found to be under expressed in allergen samples (Table 2). We may infer from this fact that these genes must play important role in mediating the development of asthma. 4.3
Immunoglobulins (Ig) and Their Activators
In order to describe how these newly found over expressed genes mediate the development of asthma, we consider immunoglobulins which are well known as mediators of asthma. First of all, let us review various classes and basic structure of immunoglobulins. Immunoglobulins are glycoprotein molecules that are produced by plasma cells in response to an immunogen and which function as antibodies. An immunoglobulin (Ig) consists of 2 identical light chains (L) and 2 identical heavy chains (H)); at the three-dimensional level. An immunoglobulin (Ig) chain consists of one N-terminal variable domain, V, and one (for an L chain) or several (for an H chain) C-terminal constant domain(s), C. Immunoglobulins can be divided into five different classes, based on differences in the amino acid sequences in the constant region of the heavy chains. They are (1) IgG - Gamma heavy chains, (2) IgM - Mu heavy chains, (3) IgA - Alpha heavy chains, (4) IgD - Delta heavy chains and (5) IgE - Epsilon heavy chains.
Identification of Over and Under Expressed Genes
951
(1) IgK (kappa) genes with light chains: Here in our analysis we have found IgKV1 and IgKV5 significantly over expressed in asthma samples. These genes are IgK genes and their product constructs N-terminal variable domain of immunoglobulins. (2) IgL (lambda) genes with light chains: Similarly, IgLV1 has been found to be significantly over expressed in asthma samples. These genes are IgL genes and their product constructs N-terminal variable domain of immunoglobulins. (3) IgH (’heavy’) genes with heavy chains: Heavy chains IgH4, IgHVJ558 and AI324046 have also been found to be significantly over expressed in asthma samples. These genes are IgH genes. Thus these 6 genes take part in construction of immunoglobulin. Immunoglobulin E (IgE), a class of immunoglobulins, was identified as the key molecule in mediating allergic asthma. Production of IgE is significantly increased in allergic asthma condition. Hence increase in expression levels of these 6 genes cause increase in production of immunoglobulins, thereby playing a significant role in mediating allergic asthma.
5
Conclusions
The present investigation demonstrates a pattern recognition based approach for identification of genes mediating the development of asthma. The methodology involves the application of clustering algorithms on expression and fold values of genes, followed by determining similarity/dissimilarity among various clusters and measuring the extent of over and under expression of genes contained in different clusters. On applying the said methodology, we have identified IL-13, Spp1, Clca3, Ctsk, Ear11, Sprr2a, Chi3l3, Arg1 and Itln genes heavily over expressed in asthma samples. Previous work and biological evidences suggest that these genes are responsible for asthma. We have identified several new genes, by the methodology applied on both expression and fold values, including Gpnmb, Fxyd4, Serpina3n, Slc26a4, Igl-V1, A430103C15Rik, Igh-4, AI324046, IgK-V1, IgK-V5, Muc5ac that are severely over expressed in allergic asthma condition. Similarly, we have found several genes including Ppp1r2, Cyp2a4, Tncc, Dusp16, Scnn1b, Slc7a10 that are heavily under expressed in asthmatic condition. Therefore, these genes may be considered as mediators for asthma.
References 1. Grunig, G., Warnock, M., Wakil, A.E., Venkayya, R., Brombacher, F., Rennick, D.M., Sheppard, D., Mohrs, M., Donaldson, D.D., Locksley, R.M., Corry, D.B.: Requirement for il-13 independently of il-4 in experimental asthma. Science 282 (1998) 2261–2263 2. Wills-Karp, M., Luyimbazi, J., Xu, X., Schofield, B., Neben, T.Y., Karp, C.L., Donaldson, D.D.: Interleukin-13: Central mediator of allergic asthma. Science 282 (1998) 2258–2261
952
R.K. De and A. Bhattacharya
3. Wills-Karp, M.: Interleukin-13 in asthma pathogenesis. Current Allergy and Asthma Reports 4 (2004) 123–131 4. Welch, J.S., Escoubet-Lozach, L., Sykes, D.B., Liddiard, K., Greaves, D.R., Glass, C.K.: Th2 cytokines and allergic challenge induce ym1 expression in macrophages by a stat6-dependent mechanism. The Journal of Biological Chemistry 277 (2002) 42821–42829 5. Han, J., M.Kamber: Data Mining: Concepts and Techniques. Morgan Kaufmann, CA, USA (2001) 6. Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press, New York (1981) 7. http://www.ncbi.nlm.nih.gov/projects/geo/gds/gds browse.cgi?gds=958. 8. Cormier, S.A., Yuan, S., Crosby, J.R., C.A.Protheroe, D.M.Dimina, E.M.Hines, N.A.Lee, J.J.Lee: Th2-mediated pulmonary inflammation leads to the differential expression of ribonuclease genes by alveolar macrophages. American Journal of Respiratory Cell and Molecular Biology 27 (2002) 679–687 9. Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 1 (1979) 224–227 10. Jain, A.K., Murty, M.N., Flyn, P.J.: Data clustering: A review. ACM Computing Surveys 31 (1999) 264–323 11. Godat, E., Lecaille, F., Desmazes, C., Duchene, S., Weidauer, E., Saftig, P., Bromme, D., Vandier, C., Lalmanach, G.: Cathepsin k: a cysteine protease with unique kinindegrading properties. Biochemical Journal 383 (Pt 3) (2004) 501–506 12. Nakanishi, A., Morita, S.: Role of gob-5 in mucus overproduction and airway hyperresponsiveness in asthma. In: Proceedings of the National Academy of Sciences of the United States of America, USA (2001) 5175–5180 13. Vercelli, D.: Arginase: marker, effector, or candidate gene for asthma? The Journal of Clinical Investigation 3 (2003) 1815–1817 14. Zimmermann, N., King, N.E., Laporte, J., Yang, M., Mishra, A., Pope, S.M., Muntel, E.E., Witte, D.P., Pegg, A.A., Foster, P.S., Hamid, Q., Rothenberg, M.E.: Dissection of experimental asthma with dna microarray analysis identifies arginase in asthma pathogenesis. The Journal of Clinical Investigation 3 (2003) 1863–1874 15. Zimmermann, N., Doepker, M., Witte, D., Stringer, K., Fulkerson, P., Pope, S., Brandt, E., Mishra, A., King, N., Nikolaidis, N., Wills-Karp, M., Finkelman, F., Rothenberg, M.: Expression and regulation of small proline-rich protein 2 in allergic inflammation. American Journal of Respiratory Cell and Molecular Biology 32 (2005) 428–435
Correlogram-Based Method for Comparing Biological Sequences Debasis Mitra1, Gandhali Samant1, and Kuntal Sengupta2 1
Department of Computer Sciences Florida Institute of Technology Melbourne, Florida, USA {dmitra, gsamant}@fit.edu 2 Authentec Corporation Melbourne, Florida, USA
Abstract. In this article we have proposed an abstract representation for a sequence using a constant sized 3D matrix. Subsequently the representation may be utilized for many analytical purposes. We have attempted to use it for comparing sequences, and analyzed the method’s asymptotic complexity. Providing a metric for sequence comparison is an underlying operation to many bioinformatics applications. In order to show the effectiveness of the proposed sequence comparison technique we have generated some phylogeny over two sets of bio-sequences and compared them with the ones available in literature. The results prove that our technique is comparable to the standard ones. The technique, called the correlogram-based method, is borrowed from the image analysis area. We have also done some experiments with synthetically generated sequences in order to compare correlogram-based method with the well-known dynamic programming method. Finally, we have discussed some other possibilities on how our method can be used or extended.
1 Introduction Sequence comparison constitutes one of the most fundamental operations in many problems in bio-informatics. For this reason many sequence comparison techniques have been developed in the literature sometimes targeting specific problems. In this article we have proposed a novel comparison method and have shown a few of its usages. Possibly the most accurate sequence comparison technique is the dynamic programming algorithm of Smith and Waterman [1981]. The primary objective of this algorithm is to align two sequences optimally. When the objective is to come up with a distance or a similarity value between two sequences the global alignment provides a mechanism to achieve that. The value of the optimizing function in that case is typically utilized as a similarity parameter. Another popular and efficient sequence comparison method is the BLAST [Altschul, 1990] algorithm. However, its primary purpose is to find homologous longest common sub-sequence between two biosequences. BLAST is a problem specific algorithm and is not a competitor to our method. M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 953 – 961, 2006. © Springer-Verlag Berlin Heidelberg 2006
954
D. Mitra, G. Samant, and K. Sengupta
Our proposed technique is based on a similar method introduced for comparing two images [Huang et al, 1999]. While an image is a two-dimensional organization of the pixels a bio-sequence is a one dimensional organization of characters from a finite set of alphabets (typically nucleic acids or amino acids). We create a mathematical representation of a sequence and subsequently use that representation for the comparison purpose. In the following sections we describe the method (section 2) and some experiments toward using it for sequence comparison (section 3). We conclude with a discussion on a few other possibilities with using the correlogram representations of sequences (section 4). The concept of correlogram has been used in the field of bioinformatics before. Macchiato et al. [1995] used correlograms to analyze autocorrelation characteristics of active polypeptides. Further, correlograms have been used for analyzing spatial patterns in various experiments, e.g., Bertorelle et al [1995] used correlograms to study DNA diversity. Rosenberg et al [1995] used correlograms in their studies regarding patterns of transitional mutation biases within and among mammalian genomes. However, the representation has not been used for the sequence comparison purposes before.
2 Correlogram and Its Usage in Sequence Comparison 2.1 Correlogram of a Sequence Let a sequence be indicated by, s = a1a2 . . . an, where |s|=n, and for all i, ai ∈∑, ∑ is the finite set of alphabets over nucleic acids ({A, T, G, C} for a DNA or {A, U, G, C} for a RNA) or over twenty amino acids for a protein. Let, |∑| = m (m is 4 or 20). Definition 1. A correlogram for s is a 3-dimensional matrix of size (m x m x d), where 0
Correlogram-Based Method for Comparing Biological Sequences
955
i 3 2 1 0 G
T
A
A
T
G
C
C Fig. 1. A shell of a Correlogram over {A,T,G,C} and d=3
A
T
A T
G
C
2/9 1/9
1/9
1/9
G
1/9
1/9
C
2/9
Fig. 2. A layer for a sample correlogram for S
The correlogram-plane for the distance i=0 is nothing but a normalized histogram representing the normalized frequencies of the occurrences of the characters in the sequence. The corresponding plane (for i=0) is a 2D diagonal matrix. 2.2 Computing Correlogram The following algorithm computes a correlogram given an input sequence s and a d value.
956
D. Mitra, G. Samant, and K. Sengupta
Algorithm ComputeCorrelogram (string S, integer d) // Let, S = a1a2 . . . an , where each aj ∈ ∑ (1) for each x, y ∈ ∑ do (2) for i = 1 through d do (3) Corrs (x, y, i) = 0; (4) for i = 0 through d do (5) for integer j = 1 through (n - i) do (6) Corrs (aj, aj+i, i) = Corrs (aj, aj+i, i) + (1/(n-i)); (7) return Corrs; End Algorithm. The complexity of initialization from lines 1 through 3 is O(m2d), where |∑| = m. The complexity of the main computing loops in lines 4 through 6 is O(nd). So, the total complexity is O(max{m2d, nd}). For a large sequence n >> m, and hence, the complexity is O(nd). Also, since m is a constant (m= 20 or 4), and so is d, ComputeCorrelogram is a linear algorithm with respect to the sequence length n. 2.3 Using Correlograms to Compare Sequences Once a sequence is transformed into a correlogram it is possible to measure a distance between two correlograms corresponding to two sequences. Definition 2: The distance between two sequences S and T is lst = L(CorrS, CorrT), where the function L is one of the standard L-norms of distance metrics. For L0-norm: lst = ∑x,y∈∑, , 0 ≤ i ≤ d |CorrS (x, y, i) - CorrT (x, y, i)| / (|S| + |T| +1) For L1-norm: lst = √(∑x,y∈∑, , 0 ≤ i ≤ d [CorrS (x, y, i) - CorrT (x, y, i)]2)/ (|S| + |T| +1) Higher order L-norm distance metrics may be defined accordingly. Since both the correlogram-matrices are of the same dimension the computation of any of these L-norm distances is of the order of O(m2d). We used L1-norm for our experiments. The distance measure is apparently a metric as evidenced in some of our preliminary experiments (not presented here).
3 Experiments We have done three sets of experiments in order to study the effectiveness of the correlogram method in the sequence comparison. They are described below. 3.1 Experiments with Synthetic Data In this set of experiments we compared our proposed technique with Smith-Waterman’s [1981] Dynamic Programming (DP) method over some synthetically generated sequences. For some experiments, we start with a target sequence S and deform it systematically to S’ and measure the distance (or similarity with DP) between the two sequences (lSS’) using the two methods (correlogram and DP). In other experiments, we start with two arbitrary sequences S1 and S2 and deform one of them (S2) systematically to S2’ and measure how the distance between them (lS1S2’) changes with the
Correlogram-Based Method for Comparing Biological Sequences
957
deformation. Such experiments with synthetic sequences have never been done before, to the best of our knowledge. For the lack of space we will provide some sample results of our experiments from this set [for detail see the Tech Report, Samant et al, 2005]. The same conclusion holds over all such experiments that the Correlogram method is more sensitive to the deformation of a sequence than the DP method.
Correlogram Score
DP Score
5 4
Scores
3 2 1 0 -1 1
2
3
4
5 6 Iterations
7
8
9
10
Fig. 3. Correlogram vs. DP scores against character deletion positions (iterations)
Figure 3 shows the result (lS1S2’) obtained by deleting a character from the second sequence S2. The position j of the deleted character is systematically varied between 1≤ j ≤n, for |S2|=n. As expected, DP is not very sensitive to the position of the character being deleted, whereas the correlogram method shows some fluctuation as the position progresses over the string S2. A cautionary note here is that the absolute values of the two methods should not be compared as the two methods measure different aspects – distances and similarities. Rather their relative change with respect to the control parameter (position of deletion in this experiment) should be compared. Our conclusion is that the correlogram method, overall, is more suitable for comparing sequences when character deletion takes place. This is expected, as we create a richer abstract representation (correlograms) for each of the sequences before we compare the two sequences, vis a vis the DP method. Figure 4 shows the results from a similar experiment where a sequence is wrapped around systematically (first with one character, then with 2 characters, and so on – “iterations” on X-axis in the figure indicates this number), and the distance is measured between the original and the deformed sequence. For example, a wrapping around of string AGCTTAGTCT for i=2 is, CTAGCTTAGT. The higher sensitivity of the correlogram-based method is evidenced in Fig 4 as well. Other experiments, done with systematic character addition, and by reversal of a sequence, also provided the same conclusion as drawn from the character deletion-experiment.
958
D. Mitra, G. Samant, and K. Sengupta
Correlogram Score
DP Score
5 4
Score
3 2 1 0 -1 0
2
4
6 Iterations
8
10
12
Fig. 4. Correlogram vs. DP scores against systematic circular permutation of the source sequence
3.2 Experiment with Equine Influenza Virus In this experiment we used a set of protein sequences that is important for immunity of the horse influenza virus. The protein sequences are products of the hemagglutinin (HA) gene that has gone through multiple mutations over ten years (1990 through 2000) as the infected horse moved through different parts of the USA. We constructed the phylogeny tree from the distance values generated by the correlogram method and compared the tree with a standard work on the same data available in the literature [Lai et al, 2004]. The Figure 5a and 5b shows the two trees over the set of viruses (SA90/AF197243, SU90/X68437, LM92/X85087, HK92/L27597, KY91/L39918, KY92/L39917, KY94/ L39914, KY95/AF197247, KY96/AF197248, KY97/AF197249, KY98/AF197241, FL93/L39916, FL94/AF197242, AR93/L39913, AR94/AF197245, AR95/AF197244, AR96/AF197246, NY99/AY273167, OK00/AY273168). The first part of each of the elements in this set is the common identifier of the corresponding influenza-A virus, whereas the second part is the respective accession number to the database (EMBL-EBI, European Bioinformatics Institute, http://www.ebi.ac.uk/cgi-bin/expasyfetch). The strings are a few hundred characters long. Lai et al [2004] used GeneTool version 1.1 (http:// www.biotools.com/products/genetool.html) to generate the distance matrix and then ran the Phylip software (Neigbor-join method) from the University of Washington to draw the tree (Fig 4a). We used correlogram method to generate the distance matrix and used the same program from Phylip package (http:// evolution.genetics.washington. edu/phylip.html) for the phylogeny construction. The two trees are small enough to be compared manually. The similarity between the two trees justifies the usability of correlogram method in drawing phylogeny. The minor differences between the two trees necessitate further investigation for their biological significances.
Correlogram-Based Method for Comparing Biological Sequences
959
3.3 Experiment with Parvo-Virus Parvo-virus family resides in the intestines of higher organisms. They are known to cause illness/death of children and are a focus of medical research. The RNA sequences of the viruses of this family from different organisms have been used and we have measured the distances between the sequences using our proposed method. The sequences are from the set: (B19 virus, Bovine parvovirus, Canine parvovirus strain B, Feline panleukopenia virus (strain 193), Murine minute virus (strain MVMI), Porcine parvovirus (strain NADL-2), Raccoon parvovirus, Adeno-associated virus 2, and Galleria mellonella densovirus) studied in the literature [Chapman et al, 1993]. The strings are around 5000 characters long. We have drawn phylogeny tree from the generated distance matrix over the family. Again the striking similarity (Figures 6a and 6b) with Chapman et al’s tree proves the strength of the correlogram-based method.
Fig. 5. Phylogeny trees of the Horse Influenza HA1, a. Lai et al (2004), b. Correlogram-based
Fig. 6. Phylogeny trees of the Parvovirus RNA sequences, a. Chapman et al (1993), b. Correlogram-based
960
D. Mitra, G. Samant, and K. Sengupta
The two viruses AAV2 and B19 have low sequence similarity and high structural similarity compared to other viruses. Their coming closer to each other in the phylogeny in correlogram-based tree suggests that our proposed method has a stronger capability to classify structures. However, this needs further experimentation to verify.
4 Conclusion and Pointers In this research we have proposed a new approach toward sequence representation and investigated a few of its capabilities. In order to fully understand the significance of the representation many new questions and challenges need to be addressed. Some of them are posed below. 4.1 Information Content of a Correlogram Is correlogram reversible? In other words, can one reconstruct a string given a correlogram? Obviously, when the distance range of the correlogram matrix is maximal or d=|s| -1 for a string s, no information is lost and the corresponding correlogram should be reversible. A relevant question is - does there exist a smaller list of integers i < |s| such that a correlogram over this list of i’s (refer to the discussion in section 2.1) will create a loss-less representation? In that case, the correlogram representation may be used for the string encryption purposes. 4.2 Using Correlgram for Finding Patterns Any sequence comparison method can be utilized for finding a given pattern P over a longer target string T (|T| > |P|). However, the cost of the method could be prohibitive for such pattern finding purpose. For example, the DP method has quadratic timecomplexity for each comparison (P with each subsequence of T of size |P|) over scanning the target sequence. We have implemented a modified version of the correlogram method that scans and searches for a pattern over a target sequence in linear time [Samant et al, 2005]. As our experiments indicate that the correlogram method may have a potential for finding structurally similar patterns, it may have significant impact in bioinformatics. For example, protein docking may be expedited by such candidate pattern-matching pre-scan. 4.3 Extending Correlogram with Gap Handling Capability When a pair of characters, say, AG, appears at a distance i=7 on a sequence (as A - - - - - G), then it may appear at a distance 8 or 6 on a corresponding mutated sequence, where a new character is inserted or deleted from the original sequence. With an objective to use the correlogram-based representation for comparing such modified sequences we extended the basic correlogram technique. In this Gapped-correlogram representation of a sequence we make a weighted distribution of the frequency counts over the respective adjacent cells of the basic correlogram (over the distance i). Thus, e1*Freq(x, y, i) is added to Freq(x, y, i+1) and to Freq(x, y, i-1), and , e2*Freq(x, y, i)
Correlogram-Based Method for Comparing Biological Sequences
961
is added to Freq(x, y, i+2) and to Freq(x, y, i-2), where 0 < e1, e2 <1, are the weight factors. Example e1 and e2 may be 0.5 and 0.25 respectively. This type of weighing can be extended to arbitrary number of adjacent cells, not just to two cell-distances. The weight vector itself may be normalized as a probability density distribution that adds up to 1, e.g., e0+2*(e1+e2)=1 above, where e0 is the weight factor for Freq(x, y, i) itself that was e0=1 before. Gapped-correlogram has obvious biological appeal in sequence comparison. However, our preliminary experiments with gapped-correlograms over the problems addressed here did not show any significant difference with the results obtained using the basic correlograms [Samant et al, 2005]. We suspect broader experiments may show some impact. Acknowledgement. Mavis McKenna provided some data and insight for this work.
References Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. (1990) “Basic local alignment search tool,” Journal of Molecular. Biology, 215:403–410. Bertorelle, G., and Barbujanit, G. (1995) “Analysis of DNA diversity by spatial auto correlation,” Genetics, Volume 140(2):811-819. Chapman, M. S., and Rossmann, M. G. (1993) “Structure, Sequence and Function Correlations among Parvoviruses,” Viriology, 194(2):491-508. Huang, J., Mitra, M., Zhu, W.J. and Zabih, R. (1999), “Image Indexing using color correlograms,” International Journal of Computer Vision, 35(3), pp 245-268 Lai, A. C.K., Rogers, K. M., Glaser, A., Tudor, L., and Chambers, T. “Alternate circulation of recent equine-2 influenza viruses (H3N8) from two distinct lineages in the United States,” Virus Res. (2004) Mar, 15;100(2):159-64. Macchiato, M. F., Cuomo, V., and Tramontano, A. (1995) “Determination of the autocorrelation orders of proteins,” Genetics, 140:811-819. Rosenberg, M. S., Subramanian, S., and Kumar S. (2003) “Patterns of Transitional Mutation Biases Within and Among Mammalian Genomes (2003) Jun;20(6):988-93. Samant, G., and Mitra, D. (2005) “Correlogram method for ,” Mol Biol Evol.Comparing BioSequences,” Florida Institute of Technology Technical Report No. CS-2006-01, http:// www.cs.fit.edu/~tr/cs-2006-01.pdf Smith, T.F., and Waterman, M.S. (1981) “Identification of common molecular sequences,” Journal of Molecular Biology, 147:195-197.
Learning Genetic Regulatory Network Connectivity from Time Series Data Nathan Barker, Chris Myers, and Hiroyuki Kuwahara University of Utah, 50 S. Central Campus Dr. Salt Lake City UT, 84112 {barkern, myers, kuwahara}@vlsigroup.ece.utah.edu
Abstract. Recent experimental advances facilitate the collection of time series data that indicate which genes in a cell are expressed. This paper proposes an efficient method to generate the genetic regulatory network inferred from time series data. Our method first encodes the data into levels. Next, it determines the set of potential parents for each gene based upon the probability of the gene’s expression increasing. After a subset of potential parents are selected, it determines if any genes in this set may have a combined effect. Finally, the potential sets of parents are competed against each other to determine the final set of parents. The result is a directed graph representation of the genetic network’s repression and activation connections. Our results on synthetic data generated from models for several genetic networks with tight feedback are promising.
1
Introduction
High throughput technologies such as cDNA microarrays and oligonucleotide chips can collect gene expression data on a large number of genes in parallel which can potentially be used to determine the connectivity of genetic regulatory networks [1]. These techniques measure gene activity and by inference the amount of the protein in the cell produced by each gene. They can be used to gather time series data in which a series of measurements of gene activity is taken over a period of time. Many techniques exist to analyze gene expression data. One approach is clustering which groups genes with similar expression patterns [2]. Other approaches modify machine learning algorithms to create Boolean functions that would describe when a gene should be active [3, 4, 5, 6]. Many other methods such as the work by Friedman et al. use Bayesian analysis to learn genetic networks [7]. These methods consider each gene as a variable in a joint probability distribution. If knowledge of gene A’s expression level yields information about gene B, the genes are said to be correlated. In this case, one gene may be a (potential) parent of the other. Recently, Sachs et al. applied a Bayesian analysis to find causal connectivities in the human T-cell signaling network [8].
This material is based upon work supported by the National Science Foundation under Grant No. 0331270.
M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 962–971, 2006. c Springer-Verlag Berlin Heidelberg 2006
Learning Genetic Regulatory Network Connectivity
963
Unfortunately, Bayesian approaches have several limitations. First, Bayesian approaches often have difficulty determining which gene is the parent and which is the child. Second, these approaches rely primarily on correlational data and only use time to separate the data points. Third, these approaches can only be applied to networks that are acyclic. This last one we consider a major limitation since feedback control is a crucial component of genetic regulatory networks. To address some of these limitations, Dynamic Bayesian Networks (DBN) have been introduced which incorporate an aspect of time to allow networks with cycles to be found. One example is the work by Yu et al. that uses two different scoring functions to find the best matching network to their data [9]. They first generate a DBN to limit the number of parents considered for a gene since the second step is exponential in the number of parents. Next, the algorithm assigns an influence score to each potential connection. To find this score, they build a cumulative distribution function (CDF) for the child’s potential parents. A CDF is basically a set of bins that indicate the number of times the genes are seen in a given configuration or an earlier one. For example, a configuration may be that proteins produced by all genes are at their highest state. If as the parent’s value increases the resulting CDF is more positive than the previous one the parent is considered an activator. If the resulting CDF is more negative, then the parent is considered a repressor. As there can be many different parents and the data can be noisy, each case contributes a vote on the likelihood of activation or repression. To evaluate their method, they generate synthetic data for 10 genetic networks that include 20 genes in which 8 to 12 are connected to at least one other gene. Using this synthetic data, their method is shown to be fairly successful at recovering the initial networks given sufficient data. They still do, however, occasionally have problems determining the direction of influence. Time is also still only used to separate data points. Finally, only two of their networks have any feedback (i.e., cycles), and these cycles are several genes long. This paper, therefore, presents a new method that targets learning genetic networks with feedback and in particular tight feedback. While this method is similar to the work by Yu et al. it differs in several key ways. First, while Yu’s method begins with an expensive search for a best fit network, our method begins with a local analysis to determine the best potential parents for each gene. Second, our method utilizes the child’s own level in the calculations to further help in determining a child’s parent genes. Third, Yu’s method neglects the time series nature of the data and simply computes a CDF that represents correlation within each individual experimental data point. Our method considers two sequential time series data points to find the percentage of time a child gene rises in a given configuration. We believe that by utilizing the time series nature of the data, our method can potentially better determine activation and repression behavior especially in the case where there is tight feedback in the network. In other words, we feel that not only the context of when a gene is high or low gives information, but more importantly the context in which it rises. Our method is evaluated using synthetic data generated from models of several genetic networks that include tight feedback. Our results indicate that our method is more successful than Yu’s in learning these types of networks.
964
2
N. Barker, C. Myers, and H. Kuwahara
Formalisms
Our method produces a directed graph model of the genetic regulatory network that indicates which gene products activate and repress other genes. A directed graph model is the tuple S, C where S is the set of species in the network and also the vertices of the graph. Connections between species, or edges in the graph, are represented by C ⊆ 2(S×I) × S. Each connection, (P, s) ∈ C, is composed of a set of parents, P , for a child, s. Each parent in P is paired with its type of influence (i.e., I = {a, r} where ’a’ represents activation and ’r’ represents repression). A directed graph representation of the genes in the phage λ decision network is shown in Figure 1(a) [10]. The ’a’ edges are represented by → and ’r’ edges are represented by . For example, N → CII implies that the proteins produced from gene N activate the gene CII. If there are n species, n there are an astronomical 24 n possible network topologies. Possible Five-Gene Networks
Time Series Experments Experiment x
1.8e308
CII N
CI Cro
Experiment 1
1
CIII
Time 0 5 10 .. 100
(a)
N 0 0 0 ..
Cro CI 0 0 30 0 30 0 .. ..
45 133 12
CII 0 0 0 ..
CIII 0 0 2 ..
29
25
(b)
Fig. 1. (a) High level network representation for phage λ. (b) Time series data.
Time series data is assumed to be of the form D = S, E where S is the set of species (typically proteins or mRNA) that is being monitored, and E is the set of time series data experiments. Each element of E is of the form Δ = M, δ, where M is the set of species mutated in the experiment, and δ is the set of data points. A major goal of this work is to integrate results from many different experiments to improve the quality of our predictions. For the best results, time series data should include both wild-type and mutational experiments. A data point α ∈ δ is of the form α = {τ, ν} where τ ∈ R is a time point and ν ∈ R|S| is a vector over S, which includes a value for each species. Time series data collected from microarray data, for example, would provide an expression level for each gene monitored by the microarray. Figure 1(b) shows example time series data. Our method requires time series data as it is difficult to determine the direction of causality from purely correlational data. Consider for example the 5 gene network shown in Figure 1(a) in which the proteins from gene CII activate CI. With only correlational data it would appear that both CII and CI’s proteins are either high or low together. It is, however, difficult to determine which gene activates the other. By adding the element of time, the gene that rose first can be detected, and thus an activation pattern can be determined more easily.
Learning Genetic Regulatory Network Connectivity
3
965
Discovering Genetic Networks
The GeneNet algorithm shown in Figure 2 generates a directed graph representation of the connectivity between genes in a genetic network. The main inputs to the GeneNet algorithm are the species, S, and the time series experiments, E. It also takes three optional inputs. The first is the set of initial network connections, C, representing background knowledge about known connections between species used to reduce computational time. The next inputs are the probability thresholds, T , for activation and repression. The final inputs are the encodings, or ranges of time series data values, specified in L for each molecular species. These encodings can be automatically calculated by taking the highest and lowest level of each species and dividing up the state space evenly. They can also be calculated by partitioning the level of each species into equal size bins, which is the default since it appears to be the most effective. The GeneNet algorithm first determines the level encodings if they are not provided and encodes the experimental data into discrete levels for each of the species. Next, each species is considered separately to determine the species (parents) that activate or repress the gene that produces this species (child). If parents are already provided in C for a species, it is skipped. The experiments where the child species is mutated are discarded as these provide no information for this species. After initial parent genes are selected, the next step is to determine if parent genes act together to influence the child. The last step of the GeneNet algorithm is to determine the most likely parents using the CompetePossibleParents function which considers sets of potential parents together. The result of our method is a new network which includes only those activation or repression edges supported by the experimental data. Currently, our method does not look for self activation or self repression though it does allow for cycles resulting from any other type of feedback. The rest of this section uses the 5 gene network shown in Figure 1 as an example. This network is comprised of the 5 genes involved in the Lysis-Lysogeny decision in the phage λ [10]. The running example focuses on gene CIII as the child gene. GeneNet(Species S, Expts E, NetCon C, Thresholds T , Encodings L) (E, L) := EncodeExpts(S, E, L) foreach s ∈ S if (∗, s) ∈ C E := E − { experiments in which s is mutated } C := SelectInitialP arents(s, S, E , C, T, L) C := CreateM ultipleP arents(s, S, E , C, T, L) C := CompeteP ossibleP arents(s, S, E , C, T, L) return S, C Fig. 2. The GeneNet algorithms
966
3.1
N. Barker, C. Myers, and H. Kuwahara
Scoring Parents
Crucial to each step of the GeneNet algorithm is a method to evaluate different potential sets of parents for each child gene. This is accomplished with the ScoreParents function, shown in Figure 3, which returns a score representing the likelihood of a parent set of genes activating or repressing a child gene. This score is computed by tallying votes for activation (votesa ), repression (votesr ), and unrelated (votesu ) for each potential level assignment. These votes are initially set to 0. Next, all possible level assignments for the other genes of interest, G, are calculated, and a level assignment, l , is selected. The level assignments for the set of potential parents, P , are also calculated, and a level assignment, l, not equal to l0 (the lowest level assignment for the parents) is selected. Next, a probability ratio is calculated. The numerator is the probability of the child rising between two consecutive time points given that the parents are at the levels specified by l and the other genes of interest are at the level specified by l . The denominator is the probability of the child rising when the parents are at their lowest level, l0 , while the other genes are at the levels specified by l . If this probability ratio, probRatio, is larger than the activation threshold, Ta , then there is a vote for this parent set activating the child. If the ratio is smaller than a repression threshold, Tr , then there is a vote for repression. Otherwise, there is a vote for unrelated. The process continues checking each combination of level assignments. When all level assignments have been considered, the score is determined by taking the votes for activation and subtracting the votes for repression and dividing by the total amount of votes obtained. If activation votes outweigh repression votes the score is positive. If repression votes outweigh activation votes the score is negative. ScoreParents(s, P, G, E, T, L) votesa , votesr , votesu := 0 foreach l ∈ levelAssignments(G, L) foreach l ∈ (levelAssignments(P, L) − {l0 }) probRatio = P(s ↑ | νP = l, νG = l )/P(s ↑ | νP = l0 , νG = l ) if probRatio > Ta then votesa + + else if probRatio < Tr then votesr + + else votesu + + return (votesa − votesr )/(votesa + votesr + votesu ) Fig. 3. The ScoreParents function
Intuitively, if a parent set truly activates a child, then whenever the parents are low, the child should have a lower rising rate between time points than when the parents are high. This should be true regardless of the level of other species in the system. However, if the selected gene is not a parent and the true parents are in the G set, then by fixing the true parents to a low state, the probability of the child rising should not change as the false parent’s level is increased.
Learning Genetic Regulatory Network Connectivity
3.2
967
Selecting Initial Parent Sets
The SelectInitialParents algorithm, shown in Figure 4, is used to select species from the set S as potential parents for the child species s. Every species except the child is considered as a potential parent. The ScoreParents function is used to determine if this parent is a potential activator or repressor of the child gene. If the score is higher than the vote threshold, Tv , (default value is 0.5) the parent gene is included in the set of network connections, C, as an activator. If the score is lower than the negative of the vote threshold, Tv , the parent gene is included as a repressor. SelectInitialParents(s, S, E, C, T, L) foreach p ∈ S − {s} score = ScoreP arents(s, {p}, {{s}}, E, T, L) if score ≥ Tv then C := C ∪ {({(p, a)}, s)} else if score ≤ −Tv then C := C ∪ {({(p, r)}, s)} return C Fig. 4. The SelectInitialParents algorithm
Considering the gene CIII as the child, the scores obtained for the 4 other genes as parents are shown in Table 1. The result is that N , Cro, and CI are found to potentially repress CIII while CII potentially activates CIII. CII, however, is discarded since it does not have a high enough score. Table 1. Rates and values for potential parents of CIII N CIII Level0 CIII Level1 CIII Level2 Cro CIII Level0 CIII Level1 CIII Level2 CI CIII Level0 CIII Level1 CIII Level2 CII CIII Level0 CIII Level1 CIII Level2
Level0 5.4% 9.3% 8.6% Level0 11.5% 14.2% 9.7% Level0 19.0% 17.1% 11.6% Level0 3.1% 4.4% 19.4%
Level1 2.9% 7.2% 6.4% Level1 1.8% 4.7% 5.0% Level1 1.7% 2.6% 2.7% Level1 13.7% 7.4% 5.5%
Level2 3.7% 6.7% 6.1% Level2 1.5% 3.1% 4.2% Level2 1.0% 1.2% 1.1% Level2 12.6% 6.8%
L1 / L0 0.53 0.78 0.75 L1 / L0 0.16 0.33 0.52 L1 / L0 0.09 0.15 0.23 L1 / L0 4.32 1.65 0.28
L2 / L0 0.68 0.72 0.71 L2 / L0 0.13 0.23 0.43 L2 / L0 0.05 0.07 0.09 L2 / L0 2.83 0.35
Votes votesa 0 votesr 5 votesu 1 Votes votesa 0 votesr 6 votesu 0 Votes votesa 0 votesr 6 votesu 0 Votes votesa 3 votesr 2 votesu 0
Score
-0.83 Score
-1.0 Score
-1.0 Score
0.2
968
3.3
N. Barker, C. Myers, and H. Kuwahara
Creating Multiple Parent Sets
The CreateMultipleParents algorithm, shown in Figure 5, is used to test if two or more parents exert a combined effect on the child gene. The current implementation though is limited to parent sets of size two. The algorithm considers two parent sets, p1 and p2 , at a time. It first computes the scores for these parent sets (note that this can actually be retrieved from a cache rather than being recomputed and that R() removes a or r from each pair in p1 and p2 leaving only the parents). Next, it computes the score for the combination of these two parent sets (i.e., p1 ∪ p2 ). If the score for the combination is better than both of the simpler parent sets, then this combination is inserted into C and potentially combined again. Finally subsets are removed from the set of parents. CreateMultipleParents(s, S, E, C, T, L) foreach (p1 , s) ∈ C score1 := ScoreP arents(s, R(p1 ), {{s}}, E, T, L) foreach (p2 , s) ∈ C − {(p1 , s)} score2 := ScoreP arents(s, R(p2 ), {{s}}, E, T, L) scoreb := ScoreP arents(s, R(p1 ) ∪ R(p2 ), {{s}}, E, T, L) if abs(scoreb ) ≥ abs(score1 ) and abs(scoreb ) ≥ abs(score2 ) then C := C ∪ {({p1 } ∪ {p2 }, s)} C := removeSubsets(C) return C Fig. 5. The CreateMultipleParents algorithm
Table 2(a) shows the votes and scores for the three different potential parents of CIII as well as the different combinations of parents. For this example, the combination of N and Cro does not have a better score than Cro alone, so it is not retained. The repressor combination of Cro and CI, however, has a score better or equal to each individually, so this combination is retained. Table 2. Information computed for child CIII CreateMultipleParents Algorithm Parents votesa votesr votesn N 0 5 1 Cro 0 6 0 CI 0 6 0 N, Cro 0 23 1 N, CI 0 21 3 Cro, CI 0 24 0 (a)
Score Retained -0.833 -1.0 -1.0 -0.958 no -0.875 no -1.0 yes
CompetePossibleParents Algorithm Parents votesa votesr votesn score N 27 12 15 0.27 Cro, CI 0 71 1 -0.98
(b)
Learning Genetic Regulatory Network Connectivity
3.4
969
Competing Possible Parent Sets
The CompetePossibleParents algorithm in Figure 6 competes parent sets until only one parent set remains. The getContenders function returns sets of parents to be competed (for example, the sets with the highest and lowest scores in C). Next, a score is calculated for each contender, q, controlling for the levels of the other contenders (i.e., Q − {q}). Finally, the scores are compared and the losers, parent sets with scores closest to 0, are removed from C. Also if a parent set’s score represents a change in influence (for example, an activator with a negative score), this parent set is removed. This algorithm is run in a tournament fashion to limit the number of configurations which prevents the data from becoming too sparse. The contenders for CIII are {N } and {Cro, CI}. Table 2(b) shows the scores for each set of genes. The winner is the set {Cro, CI} so the GeneNet algorithm correctly identifies that CIII is repressed by Cro and CI. CompetePossibleParents(s, S, E, C, T, L) while |{c ∈ C|c = (∗, s)}| > 1 Q = getContenders(s, S, E, C, T, L) foreach q ∈ Q Scoresq = ScoreP arents(s, {q}, (Q − {q}) ∪ {{s}}, E, T, L) C := removeLosers(C, Q, Scores) return C Fig. 6. The CompetePossibleParents algorithm
4
Results
To evaluate our method, it must be tested on known networks. As there are currently very few known genetic networks, and even fewer with available time series data, we generated synthetic data for the phage λ network and several randomly generated networks. This procedure is shown in Figure 7. The high level description of the gene interactions are translated into a very detailed reactionbased model expressed in the Systems Biology Markup Language (SBML). This purely reaction-based model can be simulated stochastically with the U. of Tennessee’s Exact Stochastic Simulator (ESS). ESS implements an optimized version of Gillespie’s stochastic simulation algorithm [11] which produces the time evolutions of species in a well-stirred biochemical system. A limited number of time points from this simulation are selected to produce a synthetic time series data run for the original SBML model, which is then used by the GeneNet algorithm to construct a network. The original network is compared to the results of our method. As shown in Figure 7, 7 of the 10 arcs in the phage λ network are obtained and no spurious arcs generated. The tool described in [9] recovers 3 of the 10 arcs and reports 3 spurious arcs. In addition to the phage λ network, our method is examined on 18 synthetic networks inspired by the 4 gene plasmids created by Guet et al. [12]. Given 20
970
N. Barker, C. Myers, and H. Kuwahara SBML Chemical Reaction Based Model mRNA translation transcription CII Protein
Gene CII O P
Translator
Gene CI O
P
Exact Stochastic Simulator S and Time Series Data Time N Cro CI CII CIII 0 0 0 0 0 0 5 0 30 0 0 0 10 0 30 0 0 2 .. .. .. .. .. .. 100
45 133 12
29
Start
N
High Level Description
CI Cro
GeneNet Algorithm
25
High Level Description
CII
C, T , and L are not provided during evaluation
CII CIII
N
Compare and Evaluate
CI Cro
CIII
Fig. 7. Data flow for the GeneNet algorithm evaluation Table 3. Results for 10 gene networks Number of Experiments 1 5 10 25 50
Recall 26% 58% 75% 82% 94%
GeneNet Precision
(26/100) (58/100) (75/100) (82/100) (94/100)
19% 46% 57% 65% 70%
(26/135) (58/126) (75/132) (82/127) (94/135)
Recall 4% 11% 14% 37% 47%
DBN [9] Precision
(4/100) (11/100) (14/100) (37/100) (47/100)
15% (4/26) 48% (11/23) 42% (14/33) 47% (37/79) 49% (47/96)
experiments with 100 data points generated by ESS for each of these networks, the GeneNet algorithm finds all 72 arcs but reports a total of 74. The DBN tool from [9], only finds 33 of the 72 arcs but reports 43 total arcs. We also created 10 high level 10 gene randomized networks that followed the Guet et al. framework. For this example, we also varied the number of experiments where each experiment includes 200 data points. The results are shown in Table 3 using both our GeneNet algorithm and Yu’s DBN algorithm. The runtimes of the two tools are very similar. Precision difference between algorithms are minimal for the first two cases, but increase as more experiments are added. There are larger differences in recall. For example, the GeneNet algorithm finds 26 percent of the arcs using only 1 experiment and nearly 60 percent using only 5 experiments. On the other hand, the DBN algorithm is only able to find 11 percent of the arcs with 5 experiments. With 50 experiments, GeneNet finds 94 percent while the DBN method still only finds 47 percent.
5
Discussion
This paper describes a new method for learning connectivities in genetic networks from time-series data using a Bayesian type of analysis. Although the results are
Learning Genetic Regulatory Network Connectivity
971
promising, there are way to improve them. To reduce noise from the stochasticity of the data a low pass filter can be used. Interpolation can also be used to create intermediate data points to help smooth the data. Transitive edges, a special case of extra edges formed from ’shortcuts’ when multiple genes are connected by a path, may potentially be removed or marked for later review by graph traversal algorithms. The use of data with interventions (i.e., mutational data) may also significantly improve the results. While our method does allow for cycles as opposed to traditional Bayesian approaches, it does not yet detect self-activation or repression. Finally we plan to apply our method to real experimental data.
Acknowledgments The authors thank Jing Yu of Duke for providing his software and Michael Samoilov and Adam Arkin of UCBerkeley for their comments on this work.
References 1. Brown, P.A., Botstein, D.: Exploring the new world of the genome with DNA microarrays. Nature Genet. 21 (1999) 33–37 2. Eisen, M.B., Spellman, P.T., Browndagger, P.O., Botstein, D.: Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad, Sci, USA 95 (1998) 14863–14868 3. Liang, S., Fuhrman, S., Somogyi, R.: REVEAL, a general reverse engineering algorithm for inference of genetic network architectures. In: Pacific Symposium on Biocomputing. Volume 3. (1998) 18–29 4. Akutsu, T., Miyano, S., Kuhara, S.: Identification of genetic networks from a small number of gene expression patterns under the boolean network model (1999) 5. Ideker, T.E., Thorsson, V., Karp, R.M.: Discovery of regulatory interactions through perturbation: Inference and experimental design (2000) 6. L¨ ahdesm¨ aki, H., Shmulevich, I., Yli-Harja, O.: On learning gene regulatory networks under the boolean network model. Machine Learning 52 (2003) 147–167 7. Friedman, N., Linial, M., Nachman, I., Pe’er, D.: Using bayesian networks to analyze expression data. Journal of Computational Biology 7 (2000) 601–620 8. Sachs, K., Perez, O., Pe’er, D., Lauffenburger, D.A., Nolan, G.P.: Causal proteinsignaling networks derived from multiparameter single-cell data. Science 22 (2005) 523–9 9. Yu, J., Smith, V.A., Wang, P.P., Hartemink, A.J., Jarvis, E.D.: Advances to bayesian network inference for generating causal networks from observational biological data. Bioinformatics 20 (2004) 3594–3603 10. Ptashne, M.: A Genetic Switch. Cell Press & Blackwell Scientific Publishing (1992) 11. Gillespie, D.T.: Exact stochastic simulation of coupled chemical reactions. J. Phys. Chem. 81 (1977) 2340–2361 12. Guet, C.C., Elowitz, M.B., Hsing, W., Leibler, S.: Combinatorial synthesis of genetic networks. Science 296 (2002) 1466–1470
On Clustering of Genes Raja Loganantharaj1, Satish Cheepala2, and John Clifford2 1
Bioinformatics Research Lab,University of Louisiana at Lafayette PO Box 44330, Lafayette, LA 70504 [email protected] 2 Department of Biochemistry and Molecular Biology, LSU Health Sciences Center and Feist-Weiller Cancer Center, Shreveport, LA 71130
Abstract. The availability of microarray technology at an affordable price makes it possible to determine expression of several thousand genes simultaneously. Gene expression can be clustered so as to infer the regulatory modules and functionality of a gene relative to one or more of the annotated genes of the same cluster. The outcome of clustering depends on the clustering method and the metric being used to measure the distance. In this paper we study the popular hierarchal clustering algorithm and verify how many of the genes in the same cluster share functionality. Further, we will also look into the supervised clustering method for satisfying hypotheses and view how many of these genes are functionally related.
1 Introduction A gene is the biologic unit of heredity. It is a segment of DNA that contains all the information necessary for the production of a biologically active product (protein or RNA). It is important to study and to understand the regulation and function of genes so as to control or to explain cellular behaviors at the molecular level. A primary regulatory event in the expression of a gene is DNA transcription, the process by which one strand of DNA is copied into an RNA sequence by the basal transcriptional machinery of the cell. Transcription is controlled by a host of DNA-binding transcription factors which, along with associated coactivators or corepressors, bind to specific sequences in the promoter and enhancer regions of the gene [1]. If a pair of genes is up- or down-regulated by the same set of binding factors under the same conditions, they are said to be co-regulated. Such co-regulated genes are expected to contribute to similar cell functions and behaviors [2]. Some of the objectives of microarray analysis include the identification of co-expressed genes, thereby inferring functional relationships. This process is similar to inferring a cause from the effects and therefore the conclusions will only be plausible, but not necessarily be correct. Clustering is a technique of grouping objects of similar features. Gene expression clustering involves grouping genes of similar or close expression profiles over several different conditions [3]. The expression of a gene under N different conditions may be viewed as a point in an N dimensional space at a location defined by expression values in each dimension (condition). The expression similarity of a pair of genes becomes the spatial proximity of a pair of points corresponding to these genes. The M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 972 – 981, 2006. © Springer-Verlag Berlin Heidelberg 2006
On Clustering of Genes
973
outcome of gene expression clustering is dependent on the clustering method and on the metric of measuring the distance. The popular metric of measurements being used in microarray analysis include Euclidean distance and Pearson correlation distance. The hierarchal clustering is a popular and a useful unsupervised method to cluster gene expressions [3]. Different clustering methods or different distance measuring metric may result in different outcomes. Several approaches have been used to compare different clustering methods including the following validity indexes: Dunn’s, Davies-Bouldin, Silhouette, C, Goodman-Kruskal, Isolation, Jaccard and Rand. All these validation indices are based directly or indirectly on the information used for clustering; hence a method that uses any of these indices does not relate or provide any indication of the effectiveness of meeting the expected outcome of clustering when the expected outcome is in a different parameter space. We are proposing a technique based on the intended purpose of gene clustering, which is to group functionally related genes together. The proposed method is illustrated using the data set that we have created as part of a larger study to determine the cancer suppressive mechanism of a class of chemicals called retinoids. This paper is organized as follows. We briefly describe the data being used in this study and its preprocessing in section 2. It is followed by gene ontology in section 3. In section 4 we describe supervised and unsupervised clustering. This is followed by results in section 5. In section 6, we summarize and provide concluding remarks.
2 Data and Preprocessing We have conducted a DNA microarray experiment using the AFFYMETRIX 430 2.0 array, which contains oligonucleotide probe sets representing approximately 39,000 genes. This experiment is part of a larger study to determine the cancer suppressive mechanism of a class of chemicals called retinoids [4]. The major biologically active retinoid is all-trans retinoic acid (ATRA). We have studied the effects of ATRA on skin cancer prevention using the mouse skin 2-stage chemical carcinogenesis protocol. The mouse skin 2-stage chemical carcinogenesis protocol is one of the beststudied models and most informative with regard to understanding molecular mechanisms of carcinogenesis and identifying chemopreventive agents [5]. Skin tumors can be readily induced in this model by the sequential application of a carcinogen, referred to as the initiation stage, followed by repetitive treatment with a noncarcinogenic tumor promoter, referred to as the promotion stage. The initiation stage, accomplished by a single application of the carcinogen dimethylbenzanthracene (DMBA) to the skin, results in a small subset of keratinocytes (skin cells) carrying a mutation in a critical gene(s). The promotion stage requires repeated (twice weekly) application of tumor promoting agents such as 12-O-tetradecanoylphorbol-13-acetate (TPA) that causes the initiated cells to proliferate, eventually producing tumors. ATRA has been shown to be a highly efficient suppressor of tumor initiation and promotion in this model [6]. Here we describe analysis of the gene expression profiles obtained from microarrays for the following mouse skin samples subjected to the 2-stage protocol for 3 weeks; (1) controls treated with acetone solvent alone, (2) TPA (1 µg/application dissolved in 200
974
R. Loganantharaj, S. Cheepala, and J. Clifford
µl acetone), (3) ATRA alone (5 µg/application), and (4) TPA plus ATRA. We chose the 3 week time point in the 2-stage protocol, which is 5-7 weeks prior to the appearance of tumors, in order to identify gene expression changes early in the carcinogenic process that may be influenced by ATRA. Out of the 39,000 genes, 159 were expressed over a factor of 2 after the treatment of TPA compared to normal conditions. This set is called filtered-genes and we will use this set for further clustering and processing. Expression values of each gene are normalized in order to compensate for the variations of each gene’s absolute expression value. Suppose the expression value of a gene, say g, under a condition, i, is egi. Then the normalized expression value of egi becomes (egi - μ)/σ where μ and σ are respectively the mean and the standard deviation of the expression of the gene g over all the conditions.
3 Gene Ontology The gene ontology (GO) project [7] provides structured controlled vocabularies to address gene products consistently over several databases including FlyBase (Drosophila),the Saccharomyces Genome Database (SGD) and the Mouse Genome Database (MGD). The ontology describes gene products in terms of their associated biological processes, cellular components and molecular functions for each annotated gene. Each description of a gene product is arranged in a hierarchy from more general to very specific. In this work we identify the molecular function of each gene and use the functionality as a measure of success of a cluster. For each gene in the filtered set, we have obtained the GO id and then we have obtained the specific molecular function id. Out of the 159 selected genes, 99 have known functional annotations. These are distributed among several functional groups created by Onto-expression, as shown in Figure 1 [8].
4 Gene Clustering There are two types of clustering, namely supervised and unsupervised. Supervised clustering is often used to group genes with specific expression patterns. In this experiment, we are interested in sets of genes in two categories: 1.) Those that are upregulated by TPA (≥2-fold) and are oppositely regulated (counter-regulated) when ATRA is coadministered with TPA and 2.) Those genes whose upregulation by TPA remains unaffected by coadministration with ATRA (expression changed ≤1.2-fold comparing condition 2 to condition 4). The expression profile of the cluster for the first category is shown in Figure 2. Twenty three genes fit to this category. From the clustering, it is clear that the expression level of genes for skin receiving ATRA treatment alone (condition 3) is scattered, since this condition was not taken into consideration for the supervised learning. The second category of genes, are upregulated by TPA, but either do not change or are further upregulated upon coadministration of TPA and ATRA, relative to TPA treatment alone. The normalized expression levels for this set of genes are shown in Figure 3.
On Clustering of Genes
975
Fig. 1. Functional Distributions of Genes using GO ontology
OverExpressed TPA and Down regulated by ATRA 2 Tnfrsf19 Dach1 Arhgap1 Smc4l1 Rps6ka6
1.5
Normalized Expression
1
0.5
0 0
1
2
3
-0.5
-1
4
4933432I09Rik D12Ertd553e 2310003P10Rik 5430411K16Rik Mlp Als2cr2 Hoxb2 Capg Txnl2 5 2310010I16Rik 9430077D24Rik Igk-V28 Ccnc Arih2 BC059177 Kcnj11 Egfl5 AA408451
-1.5 Conditions (C, TPA, ATRA, TPA+ATRA)
Fig. 2. Supervised clustering of genes that are over expressed by TPA and counter regulated by ATRA
4.1 Unsupervised Clustering Hierarchical clustering has been successfully used to cluster microarray gene expression data. The data points are represented as hierarchical series of nested clusters and this representation has a single cluster at the root level and each branch leads to a cluster from top to leaf node [9]. There are two ways of building hierarchical
976
R. Loganantharaj, S. Cheepala, and J. Clifford
Overexpressed TPA and remains overexpressed with TPA and ATRA 2
1.5
Normalized Expression values
1
0.5
0 0
1
2
3
-0.5
-1
-1.5
-2 Conditions (C, TPA, ATRA, TPA+ATRA)
4
Il1f6 Ptx3 Bcl2a1a Cxcl1 S100a8 2010002A20Rik Stfa3 Sema3e Chi3l3 D4Ertd117e Ccl2 Slc7a11 2410016F19Rik Clecsf8 Sprr2h Gyk Scca2-rs1 5 Scca2-rs1 Siat7f Cebpd Krt2-8 Cd207 BC004690 Il1f9 Mup1 Chst10 Lcn2 8430436F23Rik 4632427C23Rik A930014K01Rik Stfa1
Fig. 3. Expression of genes that do not change, or are further increased in expression with the coadministration of ATRA and TPA relative to TPA treatment alone
clustering namely bottom up and top down. In the bottom up approach every data point is considered to be a cluster and a cluster is merged into another cluster based on their proximity to each other. The proximity measures include single link, average link, complete link and un-weighted pair group. We use average linkage clustering, which is defined as the average of all the distances among all the pairs of elements between two clusters, say m and n. It is represented formally as Average link distance =
Σ(dik| ei ∈Clm ∧ ek ∈Cln)/(|Clm| * |Cln|) where dik is the distance between the elements ei of cluster m and ek of cluster n. |Clr| is the size or the total number of genes in cluster Clr.
When a cluster is merged into another cluster, a branch is formed and the process continues until no more individual clusters remain. Once the hierarchical cluster tree is constructed, only one cluster exists at the root level that includes all the genes. As we go down the tree each branch indicates divisions of a cluster into more clusters and the measure of closeness among the clusters are also increasing. The Euclidian distance di,k between a pair of genes, say gi and gk with expression values under n conditions is given by di,k = √( Σm=1,n (ei,m- ek,m)2) where ei,m is the expression value of gene gi under the condition m. We have normalized the expression values of the 159 filtered genes and have created clusters using Hierarchal Clustering Explorer Version 3.5 [10] with the parameters: average linkage clustering with Euclidian distance. As we increase the similarity among the clusters, the number of clusters is also increasing. With the similarity of 0.75 the 159 genes are clustered into 4 groups as shown in Figure 4. As
On Clustering of Genes
977
Fig. 4. 4 clusters with similarity of 0.75 (Euclidian distance)
Fig. 5. 13 clusters with similarity of 0.85 (Euclidian distance)
we increase the similarity to 0.85 the genes are further broken down into 13 clusters as shown in Figure 5. Alternatively, the distance can be based on correlation coefficient. The distance di,k based on Pearson correlation coefficient between a pair of gene expressions, say gi and gk, is given by di,k = 1 – σi,k /(√(σi * σk)) where σi,k is the covariance of the gene expression of i and k, and σi is the standard deviation of the expression of gene i. With the Pearson distance, the hierarchical clustering resulted in 4 clusters, as shown in Figure 6 when the minimum similarity is set to 0.85.
Fig. 6. 4 clusters with similarity of 0.85 (Pearson distance)
The expression profiles of these clusters are shown in Figures from 7 through 10.
1.5
Normalized Expression values
1
0.5
0 0
1
2
3
-0.5
-1
-1.5 Conditions (C, TPA, ATRA, TPA+ATRA)
4
BC012278 Fkbp8 2810468K05Rik Kif5b BC004690 Ppif Sprr1b 2410016F19Rik Minpp1 BC012278 Nr6a1 2700038P16Rik 2310051D06Rik Zfp68 2010005O13Rik Bcas2 4632427C23Rik 9130215M02Rik Tncc 5 AA589481 Apobec3 B230209C24Rik D130054N24Rik Klrg1 Ahsa1 Ccm2 Ttk AU015195 Mknk2 Rfxank BC019206
Fig. 7. Normalized expression of cluster1 (Pearson distance)
Gene Expression of Cluster 2(Pearson Dist) 2
1.5
1
Normalized Expression
Normalized Expression of Cluster 1 (Correlation Dist) 2
0.5
0 0
1
2
3
-0.5
-1
-1.5
-2 Conditions(C, TPA, ATRA, TPA+ATRA)
4
2010319C14Rik Rps20 Igbp1 Eif3s6ip Krt2-8 1110020B03Rik Ywhaz Stfa3 1700025B16Rik 8430436F23Rik Sema3e AW538196 Cd207 Eif2ak4 Ide BC043476 D7Ertd445e 5 Trim27 A930014K01Rik LOC268885 Zfp289 D4Ertd117e 2310007F04Rik Cxcl1 Ccl2 Slc7a11 Sprr2h 1200016E24Rik Slpi Gyk Chi3l3
Fig. 8. Normalized expression of cluster2 (Pearson distance)
978
R. Loganantharaj, S. Cheepala, and J. Clifford
Gene Expression Cluster 4 (Pearson Dist)
Gene Expression of Cluster 3 (Pearson Dist) 2
2 Pex6 Uble1a Col23a1 A2lp 1500004O14Rik Fancc
Normalized Expression
1
Arih2 Mpst
0.5
0 0
0.5
1
1.5
2
2.5
3
3.5
4
2300006M17Rik 0610013E23Rik 5430400N05Rik B930093C12Rik 5830472M02Rik 4.5 Wars2 1300019N10Rik Prg4 Frap1 2010309G21Rik Lrrn6a 9430077D24Rik Kcnj11 5430428G01Rik
-0.5
-1
-1.5
-2 Conditions (C, TPA, ATRA, TPA+ATRA)
Hoxb2 AA409802 Mrps9 Gtl2
1.5 Asf1a Tm7sf1 Zfp398
1 Normalized Expression values
1.5
2810458L13Rik Pja1
0.5
Mib1 Scca2-rs1 2010002A20Rik
0 0
1
2
3
-0.5
4
Scca2-rs1 Wee1 Sfrp5
5
C79248 Hadhsc Chst10 Mup1 Siat7f
-1
-1.5
-2 Conditions (C, TPA, ATRA, TPA+ATRA)
Fig. 9. Normalized expression of cluster 3 (Pearson distance)
Fig. 10. Normalized expression of cluster 4 (Pearson distance)
4.2 Quality of Clustering The number of clusters and the content in each cluster are dependent upon the clustering method and the metric being used to measure the distance. Several approaches have been used to compare different clustering methods including the following validity indexes: Dunn’s, Davies-Bouldin, Silhouette, C, GoodmanKruskal, Isolation, Jaccard and Rand. Bolshakova et al. [11] have used some of these indices to compare different proximity measures of hierarchal clustering. Recently, Speer et al. [12] have used GO functional annotations to cluster genes using minimum spanning tree with single link proximity measures. Incompatible links in the minimum spanning tree are removed to form a minimum number of spanning trees and each spanning tree forms a cluster. They have applied Davies-Bouldin index for estimating the quality of clusters. All the approaches are based directly or indirectly on the information used for clustering and these methods ignore the intended purpose of clustering. DNA microarray expressions are clustered based on their expression profiles, often with the expectation that genes with similar molecular functions group together. In an ideal situation, one to one mapping from a gene expression cluster to a molecular function group is expected. We propose a method based on an expected outcome of clustering, that is, the degree of achieving functional grouping. Suppose, n annotated genes are clustered into m groups based on their gene expression profiles. Assume that these n genes form k functional clusters based on the GO annotation. Let pir be the probability of selecting a gene of functional group i from a cluster r. We define the quality of clustering using the entropy of clusters related to functional associations. Entropy of cluster r = Total entropy =
-Σpirlog2(pir) for ∀ i ∈1..k - ΣΣpirlog2(pir) for ∀ i ∈1..k and ∀ r ∈1..m
For example, a cluster r has n1 and n2 genes from functional groups 1 and 2 respectively. Assuming that there are no other genes in the cluster, the probability of selecting a gene from functional group 1 becomes n1/(n1+n2). Here we assume equal probability of distribution. The total entropy of the system, which is the summation of the entropy of all the clusters, provides a measure of the quality of clustering. The
On Clustering of Genes
979
lower value of entropy indicates the better quality of clusters. In an ideal situation, if the functional group has a one to one mapping with a cluster, then the total entropy of the system is 0. When we are comparing different clustering methods with different numbers of genes, then we must take the entropy per gene as a measure for comparing the quality of clustering.
5 Results Out of the filtered 159 genes, 99 have GO functional annotation and we have selected eleven functional groups each containing a minimum of 4 genes. The total functional IDs is 131, indicating that a gene may belong to more than one functional group. The distribution of genes among the functional groups for each cluster, under Pearson correlation distance of hierarchical clustering with similarity 0.85, is shown in Table 1, where fgr represents the functional group r. Table 1. Functional distribution of genes (Pearson correlation distance measurement)
Clusters 1 2 3 4
fg1
fg2
9 3 4 1
fg3
9 3 4 0
6 5 4 0
fg4
6 2 5 1
fg5
5 0 4 4
fg6
5 2 4 0
fg7
fg8
2 2 5 1
4 2 2 1
fg9
3 1 4 1
fg10
2 4 3 0
fg11
5 0 1 2
Table 2. Entropy for each clustering method
Clustering Methods Pearson correlation (4 clusters) Euclidian distance (4 clusters) Euclidian distance (13 clusters) Supervised Learning (2 clusters)
Total Entropy 11.48 12.43 27.13 6.645
Number of genes 131 131 131 64
Entropy per gene 0.088 0.095 0.207 0.104
6 Summary and Discussion DNA microarray technology allows the study of gene expression patterns of thousands of genes simultaneously under different conditions, so as to understand and to infer which genes are co-regulated, what the potential regulatory mechanisms are, and what functions of co-regulated genes may be. If a pair of genes has a similar expression profile, they are said to be co-expressed, which suggests that these genes can be co-regulated or can have similar molecular functions. In this work, we have studied the popular hierarchical clustering algorithm for its effectiveness in clustering gene expression. We have investigated supervised and hierarchal clustering with Euclidian distance and Pearson correlation distance. The hierarchal clustering algorithm, with the minimum similarity cut-off point of 0.85 resulted in 4 clusters with Pearson correlation distance. With Euclidian distance, the
980
R. Loganantharaj, S. Cheepala, and J. Clifford
same similarity cut-off point resulted in 13 clusters, as we have shown in Section 4.1. For comparison, we have also created 4 clusters with Euclidian distance with a similarity cut off point of 0.75. For measuring or comparing the quality of clusters, several approaches including the following validity indexes have been used: Dunn’s, Davies-Bouldin, Silhouette, C, Goodman-Kruskal, Isolation, Jaccard and Rand. Recently, Speer et al. [12] have proposed a clustering method for functional partition based on gene ontology and have used Davis-Bouldin index for validating the clusters. All these validation indices ignore the expected outcome of clustering. We have proposed a technique for measuring the quality of gene clusters that relates to or provides some indication of the intended functional grouping. Our method is based on the concept of entropy of functional groups in each cluster. The total entropy provides a measure of comparing the effectiveness of clustering genes so as to infer functionally related groups. When we are comparing clustering methods with different numbers of genes, the entropy per gene must be used. A small value of entropy per gene indicates better cohesiveness of functional grouping in clusters. In an ideal situation, if the expression cluster yields functional grouping, then the entropy becomes the lowest value 0. The supervised learning method clustered 64 genes into two groups, where the total entropy is 6.645 and the entropy per gene is 0.104, as shown in Table 2. The hierarchal clustering of genes with Pearson correlation offers the lowest entropy per gene and hence it provides the best clustering metric. Note however that the entropy value is quite high, indicating that co-expressed genes are not always co-functional. This observation is validated by the occurrence of many false positives when inferring co-functionality from co-expression. The method we have proposed is general enough to be used for other purposes, such as measuring co-regulation from coexpression etc. Our investigation is underway for measuring the effectives of gene expression clusters for co-regulation. Further, we are conducting studies to compare our method with other proposed validation methods. Acknowledgements. The first author thanks the partial support from the Louisiana Governor’s Information Technology Initiatives (GITI) for this work.
References 1. Mount, D.W., Bioinformatics: sequence and genome analysis. 2nd ed. 2004, Cold Spring Harbor, N.Y.: Cold Spring Harbor Laboratory Press. xii, 692. 2. Jensen, L.J. and S. Knudsen, Automatic discovery of regulatory patterns in promoter regions based on whole cell expression data and functional annotation. Bioinformatics, 2000. 16(4): p. 326-33. 3. Draghici, S., Data Analysis Tools for DNA Microarrays. Mathematical Biology and Medcine Series. 2003: Chapman & Hall/CRC. 4. H. Xu, C.S., McCauley E, Coombes K, Xiao L, Fischer SM and Clifford JL, Chemoprevention of skin carcinogenesis by phenylretinamides: retinoid receptor independent tumor suppression. Clinical Cancer Research, 2006. 12(3). 5. DiGiovanni, J., Multistage carcinogenesis in mouse skin. Pharmacol Ther, 1992. 54(1): p. 63-128.
On Clustering of Genes
981
6. Verma, A.K., Inhibition of both stage I and stage II mouse skin tumour promotion by retinoic acid and the dependence of inhibition of tumor promotion on the duration of retinoic acid treatment. Cancer Research, 1987. 47: p. 5097-101. 7. go-ontology, the gene ontology http://www.geneontology.org/. 8. Draghici, S., et al., Global functional profiling of gene expression. Genomics, 2003: p. 98-104. 9. Eisen, M.B., Gene Cluster. p. Hierarchical clustering, self-organizing maps (SOMs), kmeans clustering, principal component analysis http://rana.lbl.gov/EisenSoftware.htm. 10. Seo, J., H. Gordish-Dressman, and E.P. Hoffman, An interactive power analysis tool for microarray hypothesis testing and generation. Bioinformatics, 2006(in press). 11. Bolshakova, N., F. Azuaje, and P. Cunningham. An integrated tool for microarray data clustering and cluster validity assessment. in Proceedings of the 2004 ACM Symposium on Applied Computing (SAC). 2004: ACM. 12. Speer, N., C. Spieth, and A. Zell. A Memetic Clustering Algorithm for the Functional Partition of Genes Based on the Gene Ontology. in Proceedings of the 2004 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB 2004). 2004: IEEE Press. 13. Christian Spieth, F.S., Nora Speer, and Andreas Zell. Clustering-based Approach to Identify Solutions for the Inference of Regulatory Networks. 2005.
Towards Automatic Concept Hierarchy Generation for Specific Knowledge Network Jian-hua Yeh1 and Shun-hong Sie2 1
Department of Computer and Information Science, Aletheia University, Taipei, Taiwan, R.O.C. [email protected] 2 Department of Library and Information Science, Fu Jen Catholic University, Taipei, Taiwan, R.O.C. [email protected]
Abstract. This paper discusses the automatic concept hierarchy generation process for specific knowledge network. Traditional concept hierarchy generation uses hierarchical clustering to group similar terms, and the result hierarchy is usually not satisfactory for human being recognition. Humanprovided knowledge network presents strong semantic features, but this generation process is both labor-intensive and inconsistent under large scale hierarchy. The method proposed in this paper combines the results of specific knowledge network and automatic concept hierarchy generation, which produces a human-readable, semantic-oriented hierarchy. This generation process can efficiently reduce manual classification efforts, which is an exhausting task for human beings. An evaluation method is also proposed in this paper to verify the quality of the result hierarchy. Keywords: concept hierarchy, hierarchical clustering, cluster partitioning, tree similarity.
1 Introduction Manual knowledge network generation has been a labor-intensive work for human beings, since humans are capable of creating semantic hierarchy efficiently. But with growing size of real world concepts and their relationships, it is more difficult for humans to generate and maintain large knowledge networks. Meanwhile, the quality of the knowledge structure is hard to keep because human is not able to keep criteria of creation consistently. Therefore, human-generated knowledge networks are usually difficult to maintain or span, such as web directories, large organization category hierarchies, and so forth. Our experience on constructing government ontology[1] also shows that it is difficult to generate fully semantic-aware concept hierarchy purely replying on traditional data clustering algorithms. So this paper will focus on automatic concept hierarchy generation process with a guided clustering method, called GCC, to reduce human efforts in maintaining large knowledge networks, while keeps a reasonable, consistent semantic quality. M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 982 – 989, 2006. © Springer-Verlag Berlin Heidelberg 2006
Towards Automatic Concept Hierarchy Generation
983
2 Issues of Concept Hierarchy Construction Process It has been a complex process for human to generate knowledge network (or concept hierarchy) manually. Each concept should be extracted by human first, the dependencies among concepts (including inclusion and subordination) should also be identified, then the hierarchy is able to be created. The whole process can be very complicated, time-consuming and inconsistent since not everyone keeps the same knowledge recognition. Also, even the same person is hard to keep consistent criteria during the hierarchy construction process. After the invention of world-wide web, this problem had gone far more difficult to handle. Portal sites maintaining web directories are suffered from keeping large structure size since it is not practical for human to process huge amount of concepts. The advantage of human-generated knowledge networks is strongly-semantic, but it is only applicable in small-scale domain, such as corporate or personal use. The scalability and consistency problems form a barrier for manual construction of knowledge networks. While manual construction of knowledge networks has limitations, the automatic hierarchy production researches have gained much attention in recent years [2,3]. Several classification and clustering methods have been proposed to solve the hierarchy generation problem [9-12]. Among these methods, the hierarchical agglomerative clustering (HAC) algorithms [3,4] attracts a lot of attentions since the clustering result is presented in the form of a tree structure, that is, a hierarchy is created. This kind of algorithms has great potential in automatic hierarchy generation since it keeps both scalability and consistency. While this is a promising technology, it still has certain shortage in the current development: the cluster generated by automatic process seldom keeps semantics, since the automatic process based on information retrieval (IR) techniques is usually term-based, but not concept-based. This characteristic causes the clustering result seldom semantic-aware, thus often not acceptable by human being recognition. The automatic generation process can be improved when information retrieval and computational linguistics techniques advance, but for now, it is not applicable to generate high-quality knowledge networks through traditional clustering methods only. From the discussion above, both of the manual and automatic generation processes are not practical. Therefore, a new hierarchy construction technology should be developed to solve this problem. In this paper, we proposed a hybrid method to help knowledge network (or concept hierarchy) construction. A guided clustering algorithm, called Guided Concept Clustering (GCC), is designed for automatic concept hierarchy generation, while this generation process is supervised and directed to existing semantic hierarchy. The focus of GCC is on generation of scalable and semantic-aware hierarchy.
3 Design Concept of Guided Concept Clustering (GCC) The design of the proposed method is based on a voting procedure contained in the clustering process. Suppose we are going to generate a concept hierarchy for some concept C with an initial document set D, and the concept hierarchy of concept C which is contained in other existing web directories are fetched in advance. If there
984
J.-h. Yeh and S.-h. Sie
are k networks {N1, N2, …, Nk} are fetched, and the underlying document sets {DS1, DS2, …,DSk} corresponding to {N1, N2, …, Nk} are identified, each document in set DSi (where i=1, 2, …,k) is labeled by its hierarchy path, as shown in Fig. 1. After finishing the labeling process, each document in its document set is carried with hierarchy path information. Next, if the document in initial document set D also exists in DSi, the path information is copied as reference. This “attaching” process gives the documents, which appears in other existing hierarchies, votes for latter processing stages. These votes are called document foreign votes (DFVs).
Fig. 1. Labeling process for initial document set
The next step is to apply hierarchical clustering algorithm on the document set D. The clustering algorithm applied here is modified from hierarchical agglomerative clustering (HAC) to incorporate DFV information. The clustering decision function (inter-object distance function) is shown in the following equations:
Let d u = {wu ,1 , L, wu ,n }and d v = {wv,1 ,L, wv ,n } EQ1: be two vectores of correlation values for the document d u and d v ,
wu ,i = tf u ,i × idf i =
frequ ,i max frequ ,i
× log
N ni
the document similarity estimation function is v v d ⋅d sim1 = sim(u, v ) = v u vv = du × dv
n
∑w
u ,i
i =1
n
∑w i =1
2 u ,i
×
× wv ,i n
∑w i =1
2 v ,i
Towards Automatic Concept Hierarchy Generation
985
Let pu = {l u ,1 ,L , l u ,n }and pv = {l v ,1 , L, l v ,n } be two path information for the document d u and d v , from root node to leaf node EQ2: the path similarity estimation function is n
sim 2 = DFVu ,i = ∑ h(lu ,i , lu ,i ) i =1
where h(lu ,i , lu ,i ) =
1 n−i
if lu ,i = lu ,i , 0 otherwise.
EQ3: The grand similarity estimation function is shown below : sim(u, v) = w1 sim1 + w2 sim2 ,0 ≤ w1 < 1, 0 ≤ w2 < 1, w1 + w2 = 1
We use EQ3 as the main decision function for hierarchical agglomerative clustering. In equation 3, the weight parameter can be set individually. The terminating condition arrives when there is only one cluster exists. Since the root of the existing hierarchy should be the same with target concept C, the root concept of clustering result is certainly converged to C. The whole concept clustering process is shown in Fig. 2, and the algorithm for the proposed guided clustering method is shown in Fig. 3.
Fig. 2. Cluster construction process
procedure GCC C1 = {d i }, ∀i M 1 = sim(C1 , φ ) k =1 while Ck > 1 do Ck +1 = cluster (Ck , M k ); M k +1 = sim(Ck +1 , M k ); k = k + 1; end; answer = Ck ; end procedure; Fig. 3. Pseudo code for guided concept clustering process
986
J.-h. Yeh and S.-h. Sie
Since the hierarchical clustering algorithm produces a binary tree hierarchy, we apply a cluster partitioning algorithm [7, 15] to produce our final result hierarchy. Besides, we also develop a partition naming strategy to produce proper hierarchy labels with highest frequency terms observed during our clustering process. In the next section, a real-world experiment is presented to verify our GCC algorithm.
4 Experiment As a real-world example, a category under LookSmart Web Directory is chosen for our demonstration. Suppose we want to generate a concept hierarchy for organization “Astronomy”, which also appears in web directories of Yahoo (yahoo.com) and Open Directory Project (ODP, dmoz.org), both of the partial hierarchies are shown in Table 1. We use documents categorized by the LookSmart Web Directory (looksmart.com) as initial experimental document set (with no hierarchical information included), which contains hierarchies of “Astronomy” for our experiment result evaluation. The hierarchy construction process is shown in the following steps. First, each document in the initial document set is labeled with hierarchy path information in Yahoo or ODP, if possible. Second, the clustering process starts with equal weights for Yahoo and ODP. After clustering process finished, the result hierarchy is shown in Table 2, and the related information is shown in Table 3. In Table 2, it is shown that the result hierarchy (in this example, level 1 and level 2) based on LookSmart linked documents are clustered similar to a mixture of both Yahoo and ODP hierarchy. The underscored categories in Table 1(a) and 1(b) are all used in the result hierarchy, as shown in Table 2. It is observed that most of the generated categories have category names corresponding to either Yahoo or ODP, or both of them. Table 1. Example category information from (a) ODP and (b) Yahoo Level1 (Astronomy)
Level2 (Astronomy>Astrophysics, for example) Computational Software@ (7) History@ (184) Institutions Black Holes@ (63) Cosmology@ (42) Neutron Stars@ (24) Stellar Evolution@ (12) …
… Archaeoastronomy@ (95) Astrophysics@ (63) Calendars and Timekeeping (47) Cosmology (42) Dark Sky@ (144) Eclipses, Occultations and Transits (84) Extrasolar Planets (43) Extraterrestrial Life (147) Galaxies (54) Interstellar Medium (16) Personal Pages (145) Solar System (716) Star Clusters (17) Stars (127) … (a) Partial ODP Astronomy category (underscore terms appear in the experiment result)
Towards Automatic Concept Hierarchy Generation
987
Table 1. (continued) Level1 (Astronomy) … Astronomers (102) Astronomical Calendars (20) Astrophysics (127) Auroras@ Comets, Meteors, and Asteroids@ Companies@ Conferences and Events@ Cosmology (113) Deep Sky (13) Eclipses@ Education (174) Extrasolar Planets (31) FITS (Flexible Image Transport System) (1)
Level2 (Astronomy>Astrophysics, for example) Ask an Expert (3) Conferences (1) Cosmology@ Galaxies (7) Gamma-Ray Observation (4) Institutes (48) Neutrino Detection@ Observatories@ Relativity@ Research (22) Space Physics@ Stellar Phenomena@ Surveys (6) Universal Origins@ …
Galaxies (49) History (16) Humor@ Infrared (11) Interstellar Medium (72) Light Pollution@ News and Media (29) Organizations (23) Pictures (43) Planetaria (62) Radio Astronomy (78) Research (248) Scientific Notation@ SETI (Search for Extraterrestrial Intelligence)@ Software (14) Solar System (987) Space@ Stars (222) Telescopes (26) Web Directories (12) X-Ray (20) … (b) Partial Yahoo Astronomy category (underscore terms appear in the experiment result)
To verify the quality of the result concept hierarchy generated by GCC, we refer to several tree similarity measures [13,14]. The similarity measure is shown in Table 4. This result shows the experiment result is over 60% similar to a mixture hierarchy of Yahoo and ODP. In Table 4, the term “structural similarity” shows the portion of result hierarchy concepts shown in the union of ODP and Yahoo hierarchies, and the term “structural difference” shows the portion of result hierarchy concepts missed from the union of ODP and Yahoo hierarchies.
988
J.-h. Yeh and S.-h. Sie
Table 2. Cluster construction result (most categories are classified and have a corresponding category name) Level1 (Astronomy) Astrophysics (ODP/Yahoo) Cosmology (ODP) Dark Sky (ODP) Deep Sky (Yahoo) Education (Yahoo) Galaxies (ODP/Yahoo) Personal Pages (ODP) Reference Solar System (Yahoo) Space (Yahoo) Star Clusters (ODP) X-Ray (Yahoo)
Level2 (Astronomy>Astrophysics) Black Holes (ODP) Cosmology (ODP/Yahoo) Galaxies (Yahoo) Research (Yahoo) Reference Space physics (Yahoo) …
Table 3. Related information for cluster construction result Hierarchy processed Total node count
ODP: Astronomy 3,901
Yahoo: Astronomy 3,016
Our result: Astronomy 1,347
Table 4. Structural similarity measure for concept hierarchy construction result Experiment: w1=0.6, w2=0.4 Average-link HAC Level1 Level1&Level2
Structural similarity with union of ODP & Yahoo 77.27% 61.66%
Structural difference from union of ODP & Yahoo 22.73% 38.34%
5 Conclusion Though this is the very first step for our guided concept clustering experiment, it is identified that there are still several research issues for us to study to improve our clustering strategy. First of all, the result concept hierarchy is partially learned from the features of existing hierarchies, and the weight distribution for existing hierarchies contribute certain effects to our clustering result, so it is important to study how to decide the weight distribution objectively. Secondly, the comparison with LookSmart hierarchy and our experimental result may show interesting phenomena to indicate the difference viewpoints of ODP/Yahoo and LookSmart directory editors. This issue can be further extended to provide a visualized, comparison tool for user to adjust their manually-generated, shallow concept hierarchy for our GCC algorithm. Thirdly, the labeling process introduced in this paper can be generalized to not only use identical document comparison (including path) but also a more fuzzy strategy to combine both category path information and document similarity, even hyperlinks, anchor texts, and other useful information.
Towards Automatic Concept Hierarchy Generation
989
The GCC method presented in this paper adopts the experiences of existing web directories, extracting path information contained and applied to generate new concept hierarchy. This method uses these information as references to guide the formation of target hierarchy, which is both automatic and semantic-aware. This process is able to reduce human efforts efficiently, while keeps semantic quality of generated knowledge network. The GCC algorithm can be used not only in web directories but also other specific domains, such as biology, energy, and other expert domains.
References 1. Chao-chen Chen, Jian-hua Yeh, Shun-hong Sie, “Government Ontology and Thesaurus Construction: A Taiwan Experience”, in Proceedings of the 8th International Conference of Asian Digital Libraries (ICADL2005), Bankok, Tailand, December 2005. 2. Jain, A.K. and Dubes, R.C. (1988) Algorithms for clustering data, Prentice Hall. 3. Jain, A.K., Murty, M.N., Flynn, P.J. (1999) Data Clustering: A Review, ACM Computing Surveys, 31, 264-323. 4. Widyantoro, D., Ioerger, T. R., and Yen, J. (2002) An Incremental Approach to Building a Cluster Hierarchy, in Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM-02). 5. San Antonio, Organizing topic-specific web information, Conference on Hypertext and Hypermedia, Texas, United States Pages: 133 – 141, Year of Publication: 2000 6. S. Chakrabarti, B. Dom and P. Indyk “Enhanced hypertext categorization using hyperlinks.” Proceedings ACM SIGMOD International Conference on Management of Data (pp.307-318), Seattle, Washington: ACM Press, 1998. 7. S.-L. Chuang and L.-F. Chien, “Automatic query taxonomy generation for information retrieval applications” Online Information Review (OIR), 27(4):243-255, 2003 8. V. Kashyap, C. Ramakrishnan and T. C. Rindflesch, “Towards (Semi-)automatic Generation of Bio-medical Ontologies” Poster Proceedings of the AMIA 2003 Annual Symposium, November, 2003, Washington, DC. 9. Daphne Koller, Mehran Sahami “Hierarchically classifying documents using very few words.” Proceedings of ICML-97, 14th International Conference on Machine Learning, 1997. 10. F. Li and Y. Yang. “A loss function analysis for classification methods in text categorization” The Twentith International Conference on Machine Learning (ICML'03), pp472-479, 2003. 11. R.E. Valdes-Perez and etc, "Demonstration of Hierarchical Document Clustering of Digital Library Retrieval Results". Joint Conference on Digital Libraries (JDCL '01), Roanoke, VA (presented as a demonstration), June 24-28, 2001. 12. Y. Yang, J. Zhang and B. Kisiel. “A scalability analysis of classifiers in text categorization” ACM SIGIR'03, pp 96-103, 2003. 13. Bhavsar, V., Boley, H., Yang, L., “A Weighted-Tree Similarity Algorithm for Multi-Agent Systems in E-Business Environments”, Computational Intelligence Journal, 20(4), 2004. 14. Tom Morrison, “Similarity Measure Building for Website Recommendation within an Artificial Immune System”, Ph.D. Thesis, University of Nottingham, 2003. 15. B. Mirkin., “Mathematical Classification and Clustering”, Kluwer, 1996.
An Open and Scalable Framework for Enriching Ontologies with Natural Language Content Maria Teresa Pazienza and Armando Stellato AI Research Group, Dept. of Computer Science, Systems and Production University of Rome, Tor Vergata Via del Politecnico 1, 00133 Rome, Italy {pazienza, stellato}@info.uniroma2.it
Abstract. Knowledge Sharing is a crucial issue in the Semantic Web: SW services expose and share knowledge content which arise from distinct languages, locales, and personal perspectives; a great effort has been spent in these years, in the form of Knowledge Representation standards and communication protocols, with the objective of acquiring semantic consensus across distributed applications. However, neither ontology mapping algorithm nor knowledge mediator agent can easily find a way through ontologies as they are organized nowadays: concepts expressed by hardly recognizable labels, lexical ambiguity represented by phenomena like synonymy and polysemy and use of different natural languages which derive from different cultures, all together push for expressing ontological content in a linguistically motivated fashion. This paper presents our approach in establishing a framework for semi-automatic linguistic enrichment of ontologies, which led to the development of Ontoling, a plug-in for the popular ontology development tool Protégé. We describe here its features and design aspects which characterize its current release.
1 Introduction The scenario offered by the SW (and by the Web in general) is characterized by huge quantities of documents and by users willing to access them. Though machine readability is a primary aim for allowing automatic exchange of data, several SW services like Intelligent Q&A, Semantic Search Engines etc.. still need to recognize and expose knowledge expressed in the sole way humans can easily understand it: natural language. Moreover, the role of different cultures and languages is fundamental in a real World aWare Web and, though English is recognized de facto as a “lingua franca” all over the world, much effort must be spent to preserve other idioms expressing different cultures. As a consequence, multilinguality has been cited as one of the six challenges for the Semantic Web [1]. These premises suggest that ontologies as we know them now, should be enriched to cover formally expressed conceptual knowledge as well as to expose its content in a linguistically motivated fashion. In this paper we introduce our work in establishing a framework for semi-automatic linguistic enrichment of ontologies, which has run through the identification of different categories of linguistic resources and planning their exploitation to augment the linguistic expressivity of ontologies. This effort has lead to the development of M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 990 – 999, 2006. © Springer-Verlag Berlin Heidelberg 2006
An Open and Scalable Framework for Enriching Ontologies
991
Ontoling1, a plugin for the popular ontology editing tool Protégé [6]. We describe here the features characterizing its current release and discuss some of the innovations we are planning for the near future.
2 Linguistic Enrichment of Ontologies: Motivation and Desiderata Whether considering the billions of documents which are currently available on the web, or the millions of users which access to their content, enriching conceptual knowledge with natural language expressivity seems to us a necessary step for realizing true knowledge integration and shareability. To achieve such an objective, we should reconsider the process of Ontology Development to include the enrichment of semantic content with proper lexical expressions in natural language. Ontology Development tools should reflect this need, supporting users with dedicated interfaces for browsing linguistic resources: these are to be integrated with classic views over knowledge data such as class trees, slot and instance lists, offering a set of functionalities for linguistically enriching concepts and, possibly, for building new ontological knowledge starting from linguistic one. By considering some of our past experiences [8,9] with knowledge based applications dealing with concepts and their lexicalizations, a few basic functionalities for browsing linguistic resources (from now on, LRs) emerged to be mandatory: − − − − −
Search term definitions (glosses) Ask for synonyms Separate different sense of the same term Explore genus and differentia Explore resource-specific semantic relations as well as some others for ontology editing: − Add synonyms (or translations, for bilingual resources) as additional labels for identifying concepts − Add glosses to concepts description (documentation) − Use notions from linguistic resources to create new concepts While ontologies have undergone a process of standardization which culminated, in 2004, with the promotion of OWL [4] as the official ontology language for the semantic web, linguistic resources still maintain heterogeneous formats and follow different models, which make tricky the development of such an interface. In the next two sections we address this problem and propose our solution for integrating available LRs.
3 Lexical Resources, an Overview “The term linguistic resources refers to (usually large) sets of language data and descriptions in machine readable form, to be used in building, improving, or evaluating natural language (NL) and speech algorithms or systems” [3]. In particular, this definition 1
OntoLing is freely available for download at:http://ai-nlp.info.uniroma2.it/software/OntoLing
992
M.T. Pazienza and A. Stellato
includes lexical databases, bilingual dictionaries and terminologies, all resources which may reveal to be necessary in the context of a more linguistic-aware approach to KR. In past years several linguistic resources were developed and made accessible (a few for free), then a wide range of resources is now available, ranging from simple word lists to complex MRDs and thesauruses. These resources largely differentiate upon the explicit linguistic information they expose, which may vary in format, content granularity and motivation (linguistic theories, task or system-oriented scope etc…). Multiple efforts have been spent in the past towards the achievement of a consensus among different theoretical perspectives and systems design approaches. The Text Encoding Initiative [14] and the LRE-EAGLES (Expert Advisory Group on Linguistic Engineering Standards) project [2] are just a few, bearing the objective of making possible the reuse of existing linguistic resources, promoting the development of new linguistic resources for those languages and domains where they are still not available, and creating cooperative infrastructure to collect, maintain, and disseminate linguistic resources on behalf of the research and development community. However, at present time, with lack of a standard on existing LRs, it appears evident that desiderata for functionalities which we described in section 2, would depend upon the way these resources had been organized. Often, even a local agreement on the model adopted to describe a given (a series of) resource does not prevent from an incorrect formulation of its content. This is due to the fact that many resources have been initially conceived for humans and not for machines. In some cases [12] synonyms are clustered upon the senses which are related to the particular term being examined, in others [13] they are simply reported as flat lists of terms. In several dictionaries, synonyms are mixed with extended definitions (glosses) in a unpredictable way and it is not possible to automatically distinguish them. Terms reported as synonyms may sometimes not be truly synonyms of the selected term, but may represent more specific or general concepts (this is the case of Microsoft Word synonymy prompter). Of course, the ones mentioned above represent mere dictionaries not adhering to any particular linguistic model, though they may represent valuable resources on their own. A much stronger model is offered by Wordnet [5], which, being a structured lexical database, presents a neat distinction between words, senses and glosses, and is characterized by diverse semantic relations like hypernymy/hyponymy, antonymy etc… Though not being originally realized for computational uses, WordNet has become a valuable resource in the human language technology and artificial intelligence. Furthermore, the development of WordNets in several other languages [11] has definitively contributed to the diffusion of WordNet schema as a wide accepted model for LRs.
4 Accessing Linguistic Resources: The Linguistic Watermark To cope with all of these heterogeneous LRs, we introduced the notion of Linguistic Watermark, as the series of characteristics and functionalities which distinguish a particular resource inside our framework. As we can observe from the Class Diagram in Fig. 1, we sketched a sort of classification of linguistic resources, with the addition of operational aspects. LRs are in fact structured and described in terms of their features
An Open and Scalable Framework for Enriching Ontologies
993
Fig. 1. The Linguistic Watermark
and how their lexical information is organized; the diagram has then been completed with query methods for accessing resource’s content. We thus implemented this schema as a java package on its own, which can externally be imported by any application willing to exploit natural language resources like lexicons and terminologies. The core of the package is composed of an Abstract Class, named LinguisticInterface, which is both the locus for a formal description of a given linguistic resource and a service-provider for exposing the resource specific methods. The other abstract classes and interfaces in the package, which can be implemented or not, depending on the profile of the resource being wrapped, provide instead the signatures for known interface methods.
994
M.T. Pazienza and A. Stellato
We have currently developed several implementations of the Linguistic Watermark. Two of them, the Wordnet Interface and the latest DICT Interface, being freely distributable, have been made publicly available on the Ontoling site. The first one is an almost totally complete implementation of the Linguistic Watermark. The Wordnet Interface is in fact a ConceptualizedLR, because its linguistic expressions are clustered upon the different senses related to the each term. These senses – “synsets”, in Wordnet terminology – have been implemented through the Concept interface, which we see bounded by the import statement in the class diagram. Wordnet is a LRWithGlosses, as glosses are neatly separated from synonyms and organized in a one-to-one relation with synsets. Finally, Wordnet Interface implements TaxonomicalLR, as its indexed word senses are organized in a taxonomy of more specific/more generic objects. The other one, DICT Interface, is based on the Dictionary Server Protocol (DICT) [12], a TCP transaction based query/response protocol that allows a client to access dictionary definitions from a set of natural language dictionary databases. The DICT interface is conceptualized too, though its word senses are not indexed as in Wordnet (that is, it is not possible to correlate senses of two different terms upon the same meaning). DICT Interface is also a BilingualLinguisticInterface, as its available wordlists provide translations for several idioms. Other available interface classes denote Flat resources (as opposed to Conceptualized ones), which contain flat lists of linguistic expressions for each defined term, and BidirectionalTranslators, which represent a further specialization of Bilingual Linguistic Interfaces providing bidirectional translation services. We defined two classes of methods for browsing LRs: those defined in advance in the interfaces, which can thus be exploited inside automatic processes, and other very specific resource-dependent methods, which are loaded at run-time when the LR is interfaced to some browsing application (e.g. Ontoling). Two methods available in LinguisticInterface: getLexicalRelationList and getConceptualRelationList act thus as service publishers, the former providing different methods for exploring lexical relations among terms or relating terms to concepts, the latter reporting semantic relations among concepts.
5 Ontoling Architecture The architecture of the Ontoling plugin (see Fig. 2) is based on three main components: 1. the GUI, characterized by the Linguistic Resource browser and the Ontology Enrichment panel 2. the external library Linguistic Watermark, which has been presented in the previous section, providing a model for describing linguistic resources 3. the core system and an additional external component for accessing a given linguistic resource. This component, which can be loaded at runtime, must implement the classes and interfaces contained in the Linguistic Watermark library, according to the characteristics of the resource which is to be plugged. In the following sections we provide details on the above components.
An Open and Scalable Framework for Enriching Ontologies
Linguistic Browser
GUI Facade
Ontology Browser
995
Ontoling Core
Linguistic Interface <
Wordnet Interface
FreeDict Interface
… Interface
<
<
<
Linguistic Resource
Linguistic Resource
Linguistic Resource
Fig. 2. Ontoling Architecture
5.1 Ontoling Core Application The core component of the architecture is responsible for interpreting the Watermark of linguistic resources and for exposing those functionalities which suit to their profile. Moreover, the behavior of the whole application is dependant on the nature of the loaded resource and is thus defined at run-time. Several methods for querying LRs and for exposing results have been encapsulated into objects inside a dedicated library of behaviors: when a given LR is loaded, the core module parses its Linguistic Watermark and assigns specific method-objects to each GUI event. With such an approach, the user is provided with a uniform view over diverse and heterogeneous linguistic resources, as they are described in the Linguistic Watermark ontology, and easily learns how to interact with them (thus familiarizing with their peculiarities) by following a policy which is managed by the system. For example, with a flat resource, a search on a given term will immediately result in a list of (potential) synonyms inside a dedicated box in the GUI; instead, with a conceptualized resource, a list of word senses will appear in a results table at first, then it will be browsed to access synonymical expressions related to the selected sense. Analogous adaptive approaches have been followed for many other aspects of the Linguistic Watermark (mono or bidirectional Bilingual Translators, presence of glosses, Taxonomical structures and so on…) sometimes exploding with combinatorial growth. 5.2 Ontoling User Interface Once activated, the plugin displays two main panels, the Linguistic Browser on the left side, and the Ontology Panel on the right side (see Fig. 3).
996
M.T. Pazienza and A. Stellato
Conceptual and Lexical Relations explorers
Fields for: and Conceptual Lexical Relation SenseID, Glosses explorers and Synonyms
Results Table Linguistic Browser
Ontology Panel
Fig. 3. A screenshot of the Ontoling Plugin
The Linguistic Browser is responsible for letting the user explore the loaded linguistic resource. Fields and tables for searching the LR and for viewing the results, according to the modalities decided by the core component, are made available. The menu boxes on the left of the Linguistic Browser are filled at run time with the methods for exploring LR specific Lexical and Conceptual relations. The Ontology Panel, on the right, offers a perspective over ontological data in the classic Protégé style. By right-clicking on a frame (class, slot or instance), the typical editing menu appears, with some further options provided by Ontoling to: 1. search the LR by using the frame name as a key 2. change then name of the selected frame to a term selected from the Linguistic Browser 3. add terms selected from the Linguistic Browser as additional labels for the selected frame 4. add glosses as a description for the selected frame 5. add IDs of senses selected from the linguistic browser as additional labels for the frames 6. create a new frame with a term selected from the Linguistic Browser as frame name (identifier) 7. only in class and slot browser: if the LR is a TaxonomicalLR, explore hyponyms (up to a chosen level) of the concept selected on the Linguistic Browser and reproduce the tree on the frame browser, starting from the selected frame, if available
An Open and Scalable Framework for Enriching Ontologies
997
Fig. 4. Enriching an English OWL concept with selected Italian terms
These functionalities allow not only for linguistic enrichment of ontologies, but can be helpful for Ontologists and Knowledge Engineers in creating new ontologies or in improving/modifying existing ones. Note how functionality 5 has not a rigid linguistic motivation, but is indeed dedicated to those willing to build an artificial controlled vocabulary which contains direct references to the senses of a particular resource. How terms and glosses are added to the description of ontologies concepts, depends on the ontology model which is being adopted and is explained in detail in the following section.
6 Using Ontoling with Protégé and Protégé OWL When a frame-based approach was first adopted in Protégé as a knowledge model for representing ontologies and knowledge bases, no explicit effort was dedicated to the representation of possible alternate labels (synonyms) for concepts neither to support the idea of multilingualism in Ontologies. Frame names were almost as equivalent as IDs, and people were only encouraged, as it is common practice in computer programming when addressing variable names, to adopt “meaningful and expressive names” to denote these IDs. The Protégé model was indeed quite strong and expressive, so that every ontology developer could deal with his linguistic needs at a metaontological level and find the right place for them. Rare examples exist of Protégé
998
M.T. Pazienza and A. Stellato
customized applications which deal with multilingualism and/or wider linguistic descriptions [8], but no official agreement was yet established. Later on, with the advent of OWL as a KR standard for the Semantic Web, and with the official release of the Protégé OWL plugin [7], things started to converge towards a minimal agreement for the use of language inside ontologies. To cope with Protégé standard model, we defined the notion of terminological slot, as a slot which is elected by the user to contain different linguistic expressions for concepts. This way, to use Ontoling with standard Protégé, a user only needs to define a proper metaclass and metaslot, containing the elected terminological slot; naturally, the same slot can be dedicated to instances at class level. Multilingual ontologies can also be supported by creating different slots and selecting each of them as terminological slots during separate sessions of Linguistic Enrichment, with diverse LRs dedicated to the different chosen languages. Glosses can instead be added to the common “documentation” slot which is part of every frame by default. Conversely, Linguistic Enrichment of OWL Ontologies follows a more predictable path, thanks to OWL’s language dedicated Annotation Properties, such as rdfs:label and owl:comment. When Ontoling recognizes a loaded ontology as expressed in the OWL language, the terminological slot is set by default to rdfs:label. In this case the xml:lang attribute of the label property is automatically filled with the language declared by the Linguistic Interface (see Fig. 4). As a further step, we are considering to give a greater emphasis to terms, seeing them no more as labels attached to concepts, but reifying them as concrete ontological elements. Many-to-many relationships can be established between concepts and terms, which can thus be accessed both ways. This approach guarantees greater linguistic awareness over ontological data, and is particularly useful when the conceptual content of Ontologies must be retrieved from documents, user questions and other cases where interaction with natural language content is required. This process is however again far from standardization and thus requires an agreement over the way terms, concepts and their relations are modeled.
7 Conclusions and Future Work It appears evident that, in a process which has already been widely described and discussed in literature such as Ontology Development, the role of language must not be underestimated. If we believe that knowledge resources will really help in making the Semantic Web dream become true, we have to face the real aspects which characterize the Web as we know it, now. Thousands of millions of documents which are available on the web are mostly written in natural language; at the same time, people like to interact with computers using even more friendly interfaces, and we do not know better solution than commonly spoken language. A more linguistic awareness could also help semantic search engines in augmenting the retrieval of proper results, or, at least, in excluding information which is not pertinent to the intention behind the submitted query. In this work we stressed the need of providing a general framework for dealing with heterogeneous linguistic resources and for exploiting their content in the process of ontology development. Different functionalities for augmenting the linguistic
An Open and Scalable Framework for Enriching Ontologies
999
expressivity of existing ontologies or for helping users in developing new knowledge resources from scratch have been identified and implemented in the presented Ontoling plugin for Protégé. Ontoling, with the Wordnet Interface as its first available plugin, has been adopted by a community of users coming from diverse research areas, from pure linguists approaching ontologies, to ontology developers exploiting specific parts of Wordnet’s taxonomical structure as a basis for creating their own domain ontology, up to users needing its main functionalities for adding synonyms to concepts of existing ontologies. With the recent release of the DICT Interface we added a little step in assisting multilingual ontology development and we now look forward other available resources (such as [10,11]) to be added to Ontoling plugin library.
References 1. V. R. Benjamins, J. Contreras, O. Corcho and A. Gómez-Pérez. Six Challenges for the Semantic Web. SIGSEMIS Bulletin, April 2004 2. N. Calzolari, J. McNaught, and A. Zampolli EAGLES Final Report: EAGLES Editors Introduction. EAG-EB-EI, Pisa, Italy 1996. 3. R. A. Cole, J. Mariani, H. Uszkoreit, A. Zaenen, and V. Zue, Eds. Survey of the State of the Art in Human Language Technology, Cambridge University Press, Cambridge, UK, 1997. 4. M. Dean, D. Connolly, F. van Harmelen, J. Hendler, I. Horrocks, D. L. McGuinness, P. F. Patel-Schneider, and L. A. Stein. OWL Web Ontology Language 1.0 Reference, W3C Working Draft 29 July 2002, http://www.w3.org/TR/owl-ref/. 5. Fellbaum, C.: WordNet - An electronic lexical database. MIT Press, (1998). 6. J. Gennari, M. Musen, R. Fergerson, W. Grosso, M. Crubézy, H. Eriksson, N. Noy, and S. Tu. The evolution of Protégé-2000: An environment for knowledge-based systems development. International Journal of Human-Computer Studies, 58(1):89–123, 2003. 7. H. Knublauch, R. W. Fergerson, N. F. Noy, M. A. Musen. The Protégé OWL Plugin: An Open Development Environment for Semantic Web Applications Third International Semantic Web Conference - ISWC 2004, Hiroshima, Japan. 2004 8. M.T. Pazienza, A. Stellato, M. Vindigni, A. Valarakos, V. Karkaletsis. Ontology integration in a multilingual e-retail system. HCI International 2003, Crete, Greece, 2003 9. M. T. Pazienza, A. Stellato, L. Henriksen, P. Paggio, F. M. Zanzotto. Ontology Mapping to support ontology-based question answering. Proceedings of the second MEANING workshop. Trento, Italy, February 2005 10. Emanuele Pianta, Luisa Bentivogli, Christian Girardi. MultiWordNet: developing an aligned multilingual database". In Proceedings of the First International Conference on Global WordNet, Mysore, India, January 21-25, 2002 11. P. Vossen. EuroWordNet: A Multilingual Database with Lexical Semantic Networks, Kluwer Academic Publishers, Dordrecht, 1998 12. www.dict.org/bin/Dict 13. www.freelang.com 14. www.tei-c.org/
Acquiring an Ontology from the Text A Legal Case Study N´ uria Casellas1 , Aleks Jakulin2 , Joan-Josep Vallb´e1 , and Pompeu Casanovas1 1
Institute of Law and Technology, Universitat Aut` onoma de Barcelona (Catalonia, Spain) {nuria.casellas, pep.vallbe, pompeu.casanovas}@uab.es 2 Jozef Stefan Institute, Ljubljana, Slovenja [email protected]
Abstract. A topic ontology applies the usual ontological constructs to the task of annotating the topic of a document. The topic is the highly summarized essence of the document. The topics are usually chosen intuitively and rarely questioned. However, we have studied several ways of allocating frequently asked questions from a legal domain into a set of topical sub-domains. Our criteria were: 1) The sub-domains should not overlap. 2) The sub-domain should be objectively identifiable from the words of the text. 3) Which words and grammatical categories can serve as keywords? 4) Can the structure of sub-domains be induced semiautomatically from the text itself? Keywords: ontology acquisition, topic ontology, textual statistics, legal ontology.
1
Introduction
The work we present in this paper is part of the development within the European SEKT1 project of the prototype Iuriservice,2 a web-based application that retrieves answers to questions raised by newly recruited judges in the Spanish Judiciary. This system offers access to a frequently asked questions (FAQ) database through a natural language interface, where the judge describes the problem at hand and the application responds with a list of relevant question-answer pairs that offer solutions to that particular problem. Within this system, ontologies are being used to provide a more accurate search than the basic keyword search [1, 2]. From data gathered through an ethnographic survey [3], a corpus of 756 questions was extracted. Then, with the participation and collaboration of the General Council of the Judiciary (GCPJ), the questions containing dilemmas (mostly regarding best-practices during on-duty hours or staff management), 1 2
http://www.sekt-project.com Developed by Intelligent Software Components, S.A. (iSOCO, http://www.isoco.com) and the Institute of Law and Technology (UAB).
M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 1000–1013, 2006. c Springer-Verlag Berlin Heidelberg 2006
Acquiring an Ontology from the Text
1001
were answered by professionals (the tutors of the Spanish School of the Judiciary). Both, the questions and the answers contained professional knowledge (or lack of professional knowledge) and the purpose of the tool is, thus, the gathering, the maintenance and the distribution of this professional judicial knowledge among the judiciary. In order to conceptualise that professional and domain dependent knowledge, an ontology was extracted manually from the selection of terms (nouns) and relations (adjectives, verbs) of each of the questions gathered. This ontology is called Ontology of Professional Judicial Knowledge (OPJK) [5, 6]. The classes and instances were extracted manually as the 756 questions’ corpus was sufficiently compact to allow manual extraction and, at the time, there was no semi-automatic extraction tool available for Spanish. This ontology is used as the underlying semantic model (OPJK) to compute the input question with natural language processing software (tokenizer, morphological analyser, basic shallow parser and semantic analysis) for retrieving the question-answer pair of FAQ stored within the system [4]. The process of extraction of concepts from the text in order to both model OPJK and show the importance of the source of the data (text) when constructing ontologies of legal professional knowledge has been largely explained in other papers [5]. This paper focuses on the use of this same corpus of questions to model a second ontology, this time a topic ontology, to allow legal sub-domain detection. The sub-domain detection helps with search: if we can detect the topic of a query (such as gender violence), we can offer answers that cover that particular topic. Topic detection does not require the user to be familiar with a specific nomenclature of topics. First, we will introduce topic ontologies and the manually developed classification into topics (or legal sub-domains) by the legal expert’s team. Then, we will describe the analysis of the results provided by a classifier, which results in recommendations on how to prepare the corpus prior to the statistical analysis of the text.
2
A Topic Ontology for the Corpus of Questions
A topic is a semantic annotation of a document that people instinctively use to search and organize information. While we can semantically annotate many facets of a document, the notion of topic covers the document as a whole: it is the conceptualization of the document with the greatest breadth. When we seek information, our primary intention is usually to discover information on the same topic as is the topic of our problem. Topic can be distinguished from other descriptors of a document, such as language, style, punctuation, grammar, reputation, graphical design, author, date, and so on. The conceptual structure of documents and topics can be mapped to the RDF structure. We will treat both topics and documents to be objects. To associate them, we will use the transitive hasSubject predicate to associate a document with its topics. Such a restricted family of ontologies with a somewhat
1002
N. Casellas et al.
Fig. 1. An example of the RDF description of topics and documents
different semantics are referred to as topic ontologies to distinguish them from more general schema ontologies [10]. Fig. 1 shows a description of a topic and document in the RDF format. Initially, the team of domain experts identified several sub-domains (topics) intuitively gathered from the questions. The detection of sub-domains drove, at the beginning, to a classification of questions into these different sub-domains to ease the construction of the ontology, which focused, at first, on two specific subdomains: on-duty and family issues (containing gender violence and protection or restraining orders). The manual classification of questions into sub-domains resulted in the list shown below (see also Fig. 2): 1. On-duty 2. Family issues 2.1. Gender violence 2.2. Minors 2.3. Divorce/separation/custody 2.4. Other 3. Immigration 4. Property 5. Sentencing 5.1. Execution 5.2. Breach of order/judgment 5.3. Form 5.4. Notification 6. Proceedings 6.1. Evidences 6.2. Competence 6.3. Criteria
7. Court office 7.1. Organization 7.2. Civil servants 7.3. Prosecution 7.4. Infrastructure 7.5. Security 7.6. Informatics 7.7. Experts 7.8. Police and security 7.9. Incompatibilities 7.10. Relations with the GCJP 7.11. Other 7.12. Lawyers 8. Business Law 9. Traffic accidents 10. Criminal Law 10.1. Drugs 10.2. Organized crime 10.3. Theft/Robbery/Burglary
Later on, during the manual conceptualization of OPJK, the domain experts and ontology engineers in the team realised that most of the questions regarding on-duty and family issues were necessarily related to procedural moments (phases or acts, proceedings). Moreover, decision-making and judgements were also necessarily related to proceedings and gender violence (included within family issues) was necessarily related to Criminal Law, thus related to the other subdomains being modelled. This fact showed, again intuitively, that the questions
Acquiring an Ontology from the Text
1003
Fig. 2. Screenshot of the intuitively detected sub-domains, where the domains are directly related to the activity of the judge
could have been misclassified and also that the questions could refer to more than just one intuitive sub-domain at the same time. This intuitive conclusion was confirmed by the results provided by a support vector machine classifier. In simple terms, the classifier attempted to learn by which keywords to distinguish the topics or sub-domains of questions. Ideally, the classifier would automatically, correctly and unambiguously decide what subdomain a particular question belongs to. In practice, however, the classifier will make mistakes. Some of the mistakes might indicate that the categories are not specified with a sufficient amount of detail. Other mistakes may indicate incorrectly labelled questions, but there might also be questions that can be correctly classified in several ways. The classifier first analysed the intuitive classification in sub-domains and provided us with two different statistical reports. The first report was a list of keywords for each sub-domain. The list of keywords showed which terms the classifier thought to be in favour or against of that sub-domain. The second report was a diagnosis for each of the intuitively and manually classified questions, listing arguments for or against that classification. 2.1
Operation of the Classifier
The technology needed for topic detection is classification, a ubiquitous technique in machine learning. A classifier (or a classification model) is an automatic detector of topics from documents. Since the classifier cannot work with text directly, the documents are represented with features. Features are concepts that can be easily detected from the text. Normally, the ordering of features is ignored, and we refer to such assumption as “bags of features”. Still, we do count the number of appearances of each feature, so each feature is represented as a number proportional to the number of appearances. If the feature was not detected in a document, the number will be zero.
1004
N. Casellas et al.
In the past, bags of features have been referred to as “bags of words”. It is true that a feature can indicate a word. For the case of Spanish text, the features instead correspond to lemmas: the tense, gender or other grammatical properties of a word are ignored. In some cases, features can denote multi-word expressions with a specific meaning (such as orden de protecci´ on). Some lemmas are usually considered not to be features, such as prepositions and other function words. Also, punctuation is not usually captured by any features. Thus, features should correspond to carriers of relevant semantic information. As the classification algorithm, we have employed the support vector machine (SVM) approach with a dot product kernel [12, 8]. This is perhaps the most frequently used technique for classification of text, usually considered to be one of the best [9]. For that reason, unsuccessful grounding of an ontology cannot be blamed on the choice of the classification algorithm, but on the preparation of the data or on the topic ontology. It is usually insufficient to merely analyze the classification performance of a model. To diagnose the reasons and courses of remedy, we need to examine the problems in more detail. Specifically, we should understand the causes of the errors. Such causes will provide insight about the missing background knowledge which can be introduced manually. To facilitate the analysis, we employ two techniques: – Identifying the decisive features for each concept. – Identifying the instances where the concept is not identified correctly. The SVM classifier encodes each feature with a real-valued variable. All the features are encoded in a feature vector x. The sub-domain is encoded with the variable y that takes the value of +1 when the sub-domain is present and -1 when it is absent. The classifier then acts as a function y = f (x). SVM with the dot product kernel is associated with a function that takes the following form: −1; i αi yi (xi · x) < 0 f (x) = +1 otherwise. Here, yi xi are the labelled training instances, and αi are the parameters of the model. While this sum is over the training instances with nonzero αi , an alternative is to convert it into a sum over the features [7]. In that case, f takes such a form: −1; d + j wj [x]j < 0 f (x) = +1 otherwise. Here, [x]j denotes the j-th dimension of vector x. We can interpret the weights w as a degree of importance of each individual feature. However, rare words would appear to have a very high importance, while in fact they only apply to rare cases. Instead, our measure of feature importance for feature j is going to be the net leverage q across all the training instances: q(j) = wj [xi ]j i
Acquiring an Ontology from the Text
1005
Positive sign of net leverage will indicate that the particular feature votes in favor of a sub-domain across the whole corpus, and a negative sign will indicate votes against the sub-domain. While the net leverage allows the overall diagnosis of the corpus, leverage can also be computed only for a particular question. The question may sometimes be misclassified: a question being in the sub-domain but classified as not belonging to the sub-domain is a false negative, while a question not belonging to a subdomain but classified as being in the sub-domain is a false positive. By examining misclassifications we can see what went wrong, and seek courses of remedy. Analysis of Keywords. To display the “keywords” of a particular sub-domain, we simply build the classifier to detect the sub-domain. Then, we seek out those features that have the highest net leverage: these features are the ‘arguments in favor”. The “arguments against” are the features with the lowest negative net leverage. Analysing the lists of keywords for each sub-domain, the domain experts detected that most terms stated to be in favour had no relevant meaning for that particular sub-domain, some terms stated to be against a sub-domain could actually be relevant for the sub-domain and some terms neither in favour nor against had no more special relation to that particular sub-domain than to another. GUARDIA Model for guardia [on-duty] (#q:79) * Keywords in favor of 291.91 ( 170) guardia [on-duty] 124.85 ( 119) cadaver [corpse] 108.18 ( 973) , 80.68 ( 131) asunto [suit] 65.29 ( 146) medico [doctor] 65.09 ( 250) pedir [to request] 60.64 ( 117) llamar [to call] 59.66 ( 64) traslado [transfer] 57.74 ( 86) levantamiento [corpse removing] 52.16 ( 51) entrada [entry] 45.37 ( 62) urgente [urgent] 44.65 ( 82) en_caso_de [in the case of] 41.39 ( 207) persona [person] 39.81 ( 186) policia [police] 38.77 ( 55) registro [search] 37.50 ( 113) auto [court order] 36.77 ( 63) social [social] 36.42 ( 58) remitir [to refer] 34.18 ( 156) ante [before] 33.31 ( 325) estar [to be]
* Keywords against 139.79 ( 738) . 73.89 ( 337) caso [case] 57.73 ( 143) abogado [lawyer] 48.19 ( 130) menor [minor] 46.26 ( 318) juicio [trial] 45.28 ( 215) pero [but] 44.59 ( 144) mandar [to order] 43.23 ( 110) seguir [to follow] 39.64 ( 117) malo [bad/ill] 38.69 ( 110) trato [treatment] 38.34 ( 137) funcionario [civil servant] 36.97 ( 986) ? 35.75 ( 149) rapido [fast/quick] 31.30 ( 105) poner [to put] 31.15 ( 129) ese [that] 30.18 ( 53) autopsia [autopsy] 28.66 ( 80) trabajo [work] 27.76 ( 210) judicial [judicial] 27.28 ( 141) delito [offence] 26.08 ( 55) telefono [telephone]
For example, if we take the example of the on-duty sub-domain (shown above), legal domain experts can identify that on-duty [guardia], corpse [cadaver ], doctor [medico], corpse removing [levantamiento], entry [entrada], urgent [urgente], police [policia] and search [registro] are relevant terms for the on-duty period. However, suit [asunto], transfer [traslado], person [persona], court order [auto], social [social] and to refer[remitir] had no relevant meaning for that particular sub-domain. Moreover, the list of keywords against a particular question being classified into the on-duty sub-domain contains some terms which can be identified by
1006
N. Casellas et al.
domain experts as being in favour of a question belonging to the on-duty subdomain. The terms autopsy [autopsia] and telephone [telefono] are especially relevant to on-duty questions as most autopsies and line tappings are requested during on-duty hours. We also have to bear in mind that most on-duty requests are made through a phone call to the judge (who might be home). Also, some other terms of the list against questions being classified into the on-duty sub-domain could be identified as relevant to the on-duty questions but could be relevant to other sub-domains as well. Suit [caso], lawyer [abogado], minor [menor], trial [juicio], civil servant [funcionario] and offence [delito] are relevant keywords to the on-duty sub-domain but also to the family issues subdomain (minor), to the criminal law sub-domain (offence), to the court office sub-domain (civil servant and lawyer), to the proceedings sub-domain (trial) or to any sub-domain (suit). A particular feature may be an argument against a sub-domain even if it does appear in the corresponding questions. This can be explained by the fact that the particular feature appears even more frequently in other sub-domains. But as some sub-domains only have a small number of examples (questions), the classifier can make mistakes. Analysis of Misclassifications. Because all the questions have been manually classified by the experts, we know what is the true classification of a question. If the question is then shown to the classifier, we can check if the automatic classification is consistent with the manual one. If there has been a mistake, we can examine what features were decisive in the classification. We show five features with the highest and five features with the lowest leverage for the wrong classification. Furthermore, we do the same for the correct classification. The errors may arise for four reasons: overvaluing of arguments in favor of wrong decision and against the correct decision, and undervaluing of arguments in favor of the right decision and arguments against the wrong decision. In practice, the classifier might not detect an important feature, or might detect an irrelevant feature. When this analysis was related to the analysis made by the same domain experts of the results of the diagnosis for each of the questions originally classified in the on-duty sub-domain, they could detect that: 1. Some on-duty questions had been misclassified in that sub-domain and they clearly belonged to another (see list below). Qu´ e hago con el procedimiento cuando mando an´ alisis de ADN y tardan casi un a~ no en llegar los resultados? True=guardia Predicted=proceso/pruebas * Keywords 0.00 ( 0.00 ( 0.00 ( 0.00 ( 0.00 (
in 0) 0) 0) 0) 0)
favor of guardia estar juzgar unico guardia permanente
* Keywords 1.35 ( 0.36 ( 0.06 ( 0.04 ( 0.00 (
against guardia 4) mandar 6) llegar 6) ano 1) ? 4) procedimiento
Acquiring an Ontology from the Text * Keywords 2.99 ( 1.85 ( 1.85 ( 1.29 ( 1.11 (
in 6) 7) 7) 6) 6)
favor of proceso/pruebas * Keywords analisis 0.32 ( casi 0.04 ( adn 0.00 ( ano 0.00 ( resultado 0.00 (
1007
against proceso/pruebas 4) procedimiento 1) ? 0) estar 0) juzgar 0) unico
2. Some on-duty questions were correctly classified into the on-duty period but were also relevant to other sub-domains (see list below). C´ omo proceder en aquellos casos en que una mujer interpone denuncia por malos tratos, pero pide expresamente que no se acorde ninguna medida de alejamiento? True=guardia Predicted=familia/violencia_domestica * Keywords 1.63 ( 0.88 ( 0.47 ( 0.23 ( 0.11 (
in 6) 3) 4) 2) 4)
favor of guardia ninguno pedir proceder , mujer
* Keywords 1.55 ( 1.47 ( 0.75 ( 0.65 ( 0.65 (
against guardia 4) trato 4) malo 4) pero 3) caso 5) alejamiento
* Keywords 5.24 ( 2.85 ( 2.71 ( 2.66 ( 1.79 (
in 4) 4) 4) 5) 4)
favor of familia medida malo trato alejamiento denunciar
* Keywords 1.18 ( 1.06 ( 1.00 ( 0.63 ( 0.40 (
against familia 4) proceder 6) ninguno 3) pedir 5) aquel 3) caso
3. Some true on-duty questions were shown wrongly to belong to another subdomain (see list below). Me llaman desde el dep´ osito municipal de cad´ averes diciendo que ha venido la funeraria pidiendo la licencia para llevarse un cad´ aver. ¿Qu´ e tengo que comprobar antes de darles la licencia? ¿Qu´ e hay que hacer? True=guardia Predicted=proceso/pruebas * Keywords 9.60 ( 2.25 ( 0.88 ( 0.00 ( 0.00 (
in 9) 4) 3) 0) 0)
favor of guardia cadaver llamar pedir estar juzgar
* Keywords against guardia 5.10 ( 13) licencia 2.89 ( 6) municipal 1.99 ( 5) deposito 1.27 ( 7) funeraria 1.16 ( 5) antes_de
* Keywords in 2.87 ( 13) 0.79 ( 5) 0.60 ( 3) 0.59 ( 6) 0.35 ( 5)
favor of proceso licencia antes_de pedir municipal deposito
* Keywords 1.63 ( 0.78 ( 0.58 ( 0.33 ( 0.28 (
against proceso 9) cadaver 4) dar 2) . 4) venir 4) llamar
In conclusion, the data showed that the classifier encountered several problems in classifying the questions, mostly due to the inefficient original classification. One of the reasons was automatic misclassification, as shown above. However, there was another reason: there was the need to consider the on-duty sub-domain differently. The on-duty sub-domain does not represent a conceptual domain of specific knowledge, it represents a time domain. The on-duty period involves all those questions that happen on a certain time. Therefore, the questions belonging to the on-duty period could and should belong also to the other sub-domains that represent conceptual domains and contain the questions in a certain knowledge area.
1008
N. Casellas et al.
Secondly, the classifier misunderstood the relevance of some of the keywords used for argumentation. As stated above, some terms were not relevant for a particular sub-domain. However, the main difficulties were: 1. The usage of complete irrelevant terms for the legal domain (at least on their own) like but [pero], to order [mandar], to follow [seguir], to put [poner], that [ese], work [trabajo]); 2. The inability to capture relevant compound forms. For example: bad/ill [malo] and treatment [trato] are regarded separately, while they are actually a compound form: bad/ill treatment [malos tratos] is a relevant key-phrase, used to identify the questions related to gender violence. Therefore, a conceptual change was adopted and on-duty was regarded, from that point, as a time domain, rather than a sub-domain in itself. Also the data had to be prepared as to reduce the number of errors.
3
Text Preparation for Ontology Modeling
One of the main principles in Corpus Linguistics is that the analyst has to do his best to avoid as “noise” in the corpus as he can. There are two important operations: to unify all the words with a very similar meaning, and to remove the words that do not carry relevant meaning. The first operation is achieved using stemming, lemmatisation and with thesauri (including WordNet). In stemming, we remove the final few letters of the word that carry grammatical nuances but do not distinguish the meaning. For example, we represent words dog, dogs, doggy with dog+. A verb such as be may take several forms: is, are, were, being, be. During lemmatisation, we replace all these forms that may reflect tense, gender, inflection, irregularity, etc., with the simple canonical one: be. While this particular example may seem contrived, other languages, such as Spanish, have many more such examples. Such languages benefit much more from lemmatisation, while English does well already just with stemming [11]. The second important practice is removing the words that are deemed not to carry any relevant meaning for the task of determining sub-domains and topics. Such words (or lemmas) are referred to as stop words. There may be two reasons why they are to be ignored. Some words may carry no conceptual meaning on their own (e.g. prepositions, conjunctions, etc). Although a more sophisticated method would be able to gain information from such words, the “bag of words” approach cannot, so such words are mere noise. The second type of words may have meaning on their own, but they do not add any (semantic) value for the purpose of our analysis (e.g. auxiliary verbs). A list of such words is then used to prepare the data before classification. 3.1
Irrelevant Grammatical Categories
In any linguistic corpus we may find many “grammatical” or function words that are unlikely to be useful for search. For example, common function words in a
Acquiring an Ontology from the Text
1009
text sample in Spanish are de, la, que, el, en, y, a, los, del, etc. These words do have important functions in Spanish, but they hardly contribute any useful information for search or topic classification. The stop words mentioned above (within others) correspond to different grammatical categories and thereby they are to be tagged in diverse ways by the morphological analyser we are using.3 For instance, de [of], en [in] and a [to] are prepositions [SPS00]; la [the, fem.], el [the, masc.] are a defined, feminine/ masculine, singular articles [TDFS0, TDMS0], que [that, which] and y [and] are a subordinate [CS00] and a coordinate [CC00] conjunction, and so forth. The words in our corpora tagged with the following standard codes (corresponding to function words) will be considered stop words. The first two letters of the code are sufficient for their identification, according to EAGLES tags (e.g. TD—, CC–, etc.): – Articles, Determiners: Demonstrative (DD—), Possessive (DP—), Interrogative (DT—), Exclamatory (DE—), Indefinite (DI—) – Pronouns: Personal (PP—), Demonstrative (PD—), Possessive (PX—), Indefinite (PI—), Interrogative (PT—), Relative (PR—) – Conjunctions: Coordinated (CC00), Subordinated (CS00) – Interjections: (I) – Prepositions: Adpositions (SPC-), Prepositions (SPS00) – Numerals: Cardinal (MC–), Ordinal (MO–) – Punctuation signs: (F) The main category of meaningful words that should be included into the stop word list in order to ease the linguistic treatment are the auxiliary verbs. Some of its forms are in fact removed during the lemmatization process, since all the forms of a verb (including compound ones as ha hecho [has done]) are reduced to its infinitive form, so that the auxiliary form ha [from haber “to have”] is deleted as it functions as an auxiliary verb. Since some errors can be made in this process and some auxiliary verbs can be interpreted by the morphological analyser as full verbs, we propose the elimination a priori of all auxiliary verbs. The first two letters of its EAGLES code are VA—. Another question regarding the classifier is that some verbs like ser and estar [to be], tener [to have something], haber [to have, to exist], are used too frequently to have an important value in classification. In other words, neither of these verbs can be seen as defining any special kind of domain, since they may be found in any kind of standard sentence. Our proposal would be to eliminate them also, as we do with the auxiliary verbs. This should be done after the lemmatization [11] —i.e. once the program has reduced all the verbal forms to infinitive— in order to make it easier for the program to identify just the verbal forms ser, estar, tener and haber. 3
http://garraf.epsevg.upc.es/freeling/
1010
3.2
N. Casellas et al.
Words to Be Preserved in the Corpus
In general terms, the words that contribute most to the meaning of a text are nouns and verbs. With the exceptions mentioned above, these kind of words have to be preserved in the corpora in order to perform a good search. The codes are: 1. For verbs (main verbs): VM— 2. For nouns: (a) Common nouns: NC— (b) Proper nouns: NP— In reference to the other categories not mentioned so far (adjectives, adverbs and abreviations), we think they should be preserved for some reasons. First, adjectives (AQ—) and substantives (NC— and NP—) —in a broad semantic sense— are considered to form a joint category that some authors call ‘noun’, which means that the adjective has not always been treated as a completely independent category, for it works as a complement for the sense of the substantive. To delete it would imply a great loss of information. A typical example of this would be malos tratos [ill-treatment], which is formed by an adjective [malos] plus a substantive [tratos]. Secondly, however its strenght and importance in sense-adding value is not the adjective’s (at least in terms of occurrence), adverbs (RG—) are still important for our purposes, because they may add important information about the use (thus the sense) of a verb in many cases. An example of this may be tramitar urgentemente [to proceed urgently]. Thirdly, abbreviations (Y) (like etc., but also like LeCrim [abbreviated form for referring to the Criminal Code]) are very important for our purposes, since the legal discourse makes a great use of them and can be of great value for classification. 3.3
Domain Dependent Words
The Ontology of Professional Judicial Knowledge (OPJK), introduced above, currently covers the sub-domains of guardia [on-duty period], gender violence and procedural law. Currently, it has to be completed with issues from criminal and commercial law. Obviously the words —whether they are concepts or instances— which appear in this ontology ought to be especially taken into account in the classification process. In other words, OPJK ought to be seen (for this purpose) as a statement of relevant linguistic terms that are important for capturing the semantics we care about. This ontology is being built from textual protocols extracted from an extended field work, which per force implies a strong relation between words and real data. Moreover, the extraction of the words (concepts and instances) has been made manually, mostly extracted by hand by the legal expert team. Thus the words which appear in OPJK ought to be given some special consideration. Moreover, as a good number of the terms in OPJK do not correspond to a single word but to a componund form, these compound forms ought to be taken into account as single occurrences and not as two. Typical forms extracted from OPJK are:
Acquiring an Ontology from the Text
1011
– Noun + Preposition + Noun (juzgado de guardia, diligencia de embargo, etc.) – Adjective + Noun (malos tratos, lesa humanidad, etc.) – Noun + Adjective (documento procesal, enjuiciamiento civil, etc.) 3.4
Experiments
We have examined the above heuristics in an experiment. We have performed two replications of 10-fold cross validation. The experiments were based on different ways of preparing the text before attempting to ground the ontology. This is expected to reveal where the semantics is hidden. As to summarize the precision (prec) and recall (recall) in a single number, we have used the F-score: F-score =
2 × prec × recall prec + recall
Table 1. A comparison of different processing techniques for text. Best results were obtained by eliminating irrelevant POS categories (P+stop). All the techniques combined (All) was not far behind, and identifying synonyms was found helpful. Prioritizing ‘important’ words (l+lexicon) and or important POS categories (P+imp) proved less successful - perhaps this was due to our approach of increasing the scale of the important words. But no processing (none) was almost always the worst. sub-domain 1. On-duty 2. Family issues 2.1. Gender violence 2.2. Minors 2.3. Divorce 3. Immigration 4. Property 5. Sentencing 5.1. Execution 6. Proceedings 6.1. Evidence 6.2. Competence 7. Court office 7.1. Organization 7.2. Civil servants 7.3. Experts 7.4. Infrastructure 7.6. Informatics 7.12. Lawyers 7.10. GCJP 7.9. Incompatibilities 8. Business law 9. Traffic accidents 10. Criminal law RANKS
# 79 103 62 19 18 26 10 48 40 307 61 45 151 40 17 20 19 13 9 11 14 28 11 10
None 0.33 0.31 0.34 0.00 0.00 0.15 0.14 0.22 0.25 0.49 0.13 0.18 0.25 0.05 0.21 0.12 0.00 0.00 0.00 0.00 0.19 0.38 0.09 0.00 2.89
lemm. 0.23 0.49 0.30 0.74 0.10 0.21 0.00 0.32 0.34 0.53 0.24 0.12 0.24 0.02 0.10 0.25 0.00 0.07 0.46 0.17 0.00 0.26 0.31 0.00 3.59
l+lexicon 0.38 0.45 0.43 0.64 0.00 0.06 0.18 0.24 0.29 0.51 0.18 0.13 0.23 0.00 0.41 0.48 0.00 0.00 0.00 0.00 0.19 0.27 0.00 0.16 3.35
l+syn 0.23 0.27 0.18 0.38 0.20 0.07 0.00 0.47 0.52 0.54 0.28 0.13 0.27 0.00 0.19 0.53 0.00 0.19 0.00 0.00 0.00 0.30 0.31 0.18 4.26
l+stop 0.36 0.45 0.47 0.51 0.11 0.24 0.00 0.48 0.48 0.55 0.17 0.18 0.25 0.05 0.36 0.25 0.00 0.14 0.00 0.17 0.00 0.46 0.00 0.00 5.41
P+stop 0.37 0.24 0.23 0.00 0.15 0.23 0.16 0.43 0.44 0.59 0.32 0.22 0.29 0.15 0.21 0.37 0.10 0.07 0.00 0.17 0.00 0.24 0.31 0.00 6.07
P+imp 0.23 0.49 0.29 0.82 0.23 0.20 0.00 0.33 0.35 0.54 0.30 0.12 0.26 0.02 0.15 0.36 0.00 0.07 0.46 0.17 0.00 0.32 0.29 0.00 4.80
All 0.37 0.24 0.24 0.00 0.15 0.21 0.00 0.42 0.44 0.59 0.29 0.24 0.29 0.09 0.19 0.36 0.10 0.07 0.00 0.17 0.00 0.24 0.31 0.00 5.62
1012
N. Casellas et al.
The higher the F-score, the better the grounding performance. To summarize the F-scores across a number of experiments, we have used the average rank of a particular data-preparation technique across the sub-domains. Each sub-domain had a weight proportional to the number of its instances, as to prevent small sub-domains from exerting too much influence. The higher the rank, the better the method. Our implementation used the following techniques: – Lexicon: the words or sequences of words that are deemed to be important have a higher weight in the SVM model. This weighting is done by increasing the scale of the corresponding dimension. – Synonyms: we replace alternate expressions with a single canonical one. – Stop words: we remove the terms that correspond to a list of irrelevant words or phrases. – POS stop: we remove the lemmas that correspond to irrelevant POS categories. – POS important: we increase the weight of those terms that appear under an important POS category. The results of the analysis are shown in Table 1. Some infrequent sub-domains were eliminated. We can conclude that POS-based filtering is very successful, and that it helps to establish a list of synonyms. We can also observe that the important word list was especially helpful for the ‘On-duty’ sub-domain, but not as much for other sub-domains. Also, the lack of success with important words is perhaps due to our way of implementing it with increasing the scale of the variable.
4
Conclusions
In this paper we have explained the construction of a topic ontology from the corpus of questions (obtained by ethnographic field work and related to the Iuriservice prototype developed within the SEKT project). The development started with a manually modelled topic ontology, that was soon disregarded by the evidence given by the classifier. The automatic classification of a set of questions taking into account the previous manual classification resulted in misclassification of questions, and low relevancy of sub-domain keywords. That analysis not only provided some recommendations regarding the sub-domains themselves (on-duty) but also regarding the representation of the text. Thus, the analysis has demonstrated the need to treat the corpus before the automated (or semi-automated) extraction of ontologies to assure better results. In the future we will work on the evaluation of those recommendations and will use a tool (OntoGen) to semi-automatically organize the previously treated corpus of questions into a topic ontology.
Acknowledgements SEC2001-2581-C02-01 and EU-IST (IP) IST-2003-506826 SEKT.
Acquiring an Ontology from the Text
1013
References [1] Benjamins, V,R.; Contreras, J.; Bl´ azquez, M.; Rodrigo, L.; Casanovas, P.; Poblet, M. (2004). “The SEKT use legal case components: ontology and architecture”, in T.B. Gordon (Ed.), Legal Knowledge and Information Systems. Jurix 2004. IOS Press, Amsterdam, 2004, pp. 69-77. [2] Benjamins V.R.; Casanovas, P. Contreras, J., L´ opez-Cobo, J.M.; Lemus, L. (2005). “Iuriservice: An Intelligent Frequently Asked Questions System to Assist Newly Appointed Judges”, in V.R. Benjamins et al. Law and the Semantic Web, SpringerVerlag, London, Berlin, pp. 205-22. [3] Casanovas, Pompeu; Poblet, Marta; Casellas, N´ uria; Vallb´e, Joan Josep; Ramos, Francisco; Benjamins, Richard; Bl´ azquez, Mercedes; Rodrigo, Luis; Contreras, Jes´ us; Gorro˜ nogoitia, Jes´ us (2005): D 10.2.1Legal Scenario. Deliverable WP10 Case Study: Intelligent integrated decision support for legal professionals. SEKT Project (www.sekt-project.com). [4] Casanovas, P.; Poblet, M.; Casellas, N.; Contreras, J.; Benjamins, V.R.; Bl´azquez, M. (2005): “Supporting newly-appointed judges: a legal knowledge management case study”. Journal of Knowledge Management, Vol. 9 No. 5, pp. 7-27. [5] Casanovas, P.; Casellas, N.; Tempich, C.; Vrande˘ci´c, D.; Benjamins, V.R. (2005): “OPJK modeling methodology”. Lehman, J.; Biasiotti, M.A.; Francesconi, E.; Sagri, M.T. (eds.): IAAIL Proceedings. LOAIT - Legal Ontologies and Artificial Intelligence Techniques. Bologna, June 2005, pp. 121-133. [6] Casellas, N.; Bl´ azquez, M.; Kiryakov, A.; Casanovas, P.; Benjamins, V.R. (2005). “OPJK into PROTON: legal domain ontology integration into an upper-level ontology”. R. Meersman et al. (Eds.): OTM Workshops 2005, LNCS 3762. SpringerVerlag Berlin Heidelberg, pp. 846-855. [7] Jakulin, A., M. Mo˘zina, J. Dem˘sar, I. Bra´tko, B. Zupan. (2005). “Nomograms for visualizing support vector machines”. Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. KDD ’05. [8] Joachims T. (1999): “Making large-scale SVM learning practical”. In B. Scholkopf, C. Burges, and A. Smola (eds.). Advances in Kernel Methods - Support Vector Learning. Cambridge (MA): MIT Press. [9] Lewis, D. D.; Yang, Y.; Rose, T.; and Li, F. (2004). “RCV1: A New Benchmark Collection for Text Categorization Research”. Journal of Machine Learning Research 5: 361-397. [10] Terziev, I., Kiryakov, A., Manov, D. (2004). D 1.8.1. Base upper-level ontology (BULO) Guidance, report EU-IST Integrated Project (IP) IST-2003- 506826 SEKT). [11] Vallb´e, J.J.; Mart´ı, M.A.; Fortuna, B.; Jakulin, A.; Mladeni´c, D.; Casanovas, P. (2005) “Stemming and lemmatisation: improving knowledge management through language processing techniques”. Casanovas, P., Bourcier, D., Noriega, P., C´ aceres, E., Galindo, F., The regulation of electronic social systems. Law and the Semantic Web. Proceedings of the B4-Workshop on Artificial Intelligence and Law. IVR’ 05-Granada , May 25th-27th. http://www.lefis.org. XXII World Conference of Philosophy of Law and Social Philosophy. Instituto de Investigaciones Jur´ıdicas, UNAM (M´exico) [In press]. [12] Vapnik, V.N. (1999): The Nature of Statistical Learning Theory. Springer- Verlag New York, 2nd edition.
Terminae Method and Integration Process for Legal Ontology Building Sylvie Despres1 and Sylvie Szulman2 1
CRIP5, Equipe IAD, Université Paris 5 [email protected] 2 LIPN, UMR7030 Université Paris 13 [email protected]
Abstract. This paper describes the contruction method of a legal application ontology. This method is based on the merging of micro-ontologies built from European community directives. The terminae construction method from texts enhanced by an alignment process with a core legal ontology is used for building micro-ontologies. A merging process allows constructing the legal ontology.
1
Introduction
This paper presents the contruction method of a legal application ontology. The aim of this ontology is to support the development of formal models of legislation to be used in legal knowledge-based systems. This construction method is based on micro-ontology building and their merging. A micro-ontology refers to a set of concepts and properties describing a domain’s restricted context. Each micro-ontology from a European directive is achieved both by using the semi automatic terminae method [3] and by aligning it with a legal core ontology clo [11]. A core ontology covers a field such as law which may consist of many subdomains like criminal law, private law, European Union law, etc. Terminae is based on knowledge elicitation from texts and allows creating a domain model by analyzing a corpus with Natural Language Processing (NLP) tools as for the sekt project [18]. The method combines knowledge acquisition tools based on linguistics with modeling techniques so as to keep links between models and texts. During the building process [1], it is assumed that: (1) the ontology builder should have a comprehensive knowledge of the domain, so that she/he will be able to decide which terms (nouns, phrases, verbs or adjectives) are domain terms and which concepts and relations are labeled with these domain terms; (2) the ontology builder knows well how the ontology will be used. The alignment process takes place during the micro-ontology construction. [16], [14] and [10] defined ontology alignment as follows: ontology alignment consists in bringing two or more ontologies into mutual agreement, making them consistent and coherent and allowing the aligned ontologies to reuse information one from the other. In alignment, the original ontologies persist, with links established between them. Alignment is usually performed when the ontologies cover complementary domains. The alignment process was performed mostly M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 1014–1023, 2006. c Springer-Verlag Berlin Heidelberg 2006
Terminae Method and Integration Process for Legal Ontology Building
1015
by hand, with the terminae tool. Therefore our ontology is structured around clo. terminae provides easy import of concepts among clo but doesn’t check whether consistency is maintained once the operations are performed. The merging process [8] is achieved on the micro-ontologies. It consists in creating a single coherent ontology from two or more existing ontologies with overlapping parts. The article is structured as follows: Section 2 describes our ontology construction method; Section 3 implements the method on two European directives around the concepts travailleur (employee), citoyen (citizen); Section 4 presents a comparaison with related works. The article ends with a discussion between ontology and text.
2
Construction Method
Our construction method for building the application ontology includes two steps: - building micro-ontologies from texts using terminae and alignment with a core ontology; - merging micro-ontologies obtained at the first step. 2.1
Building Micro-ontologies
Our micro-ontology construction is based on the semi-automatic terminae method enhanced by an alignment process with a core ontology. The interest of the alignment with a top-ontology as dolce and a core legal ontology as lricore [5] has been shown in [9]. The legal core ontology concepts constitutes the common conceptual denominators of the field, therefore, normally, their reuse is facilitated. In this work, clo has been chosen to achieve an alignment from the natural language definition of the concepts. clo was chosen because it is built from dolce and is available in OWL language whereas lri-core is still under development. The terminae method works with the terms that occur in the analyzed texts so as to describe the thus constructed concepts in a formal ontology. This ontology involves two kinds of concepts: terminological and non terminological. The terminological concepts are created through the ontology elaboration process. They are linked to their term occurrences in the corpus through their terminological forms. These forms contain a definition of the concepts in natural language and a list of synonyms. The non terminological concepts are created to structure the ontology. The terminae tool provides material support to the method. Moreover, terminae allows importing and exporting ontologies in owl language, which aids the alignment and merging processes. syntex [4] is used in the linguistic study activity to obtain the domain terms. The linguae concordancer included in terminae allows pattern recognition. mfd [7] uses association rules to find lexical relations. 2.2
Merging Micro-ontologies
Ontology merging aims to create a unique ontology from the orginal ontologies. The original ontologies will be unified and replaced by the new ontology. Before
1016
S. Despres and S. Szulman
merging micro-ontologies there has to be a correspondance between the concepts of the ontologies to merge. [15] define the ontology integration as the iteration of the following steps: (1) find the places in the ontology where they overlap; (2) relate concepts that are semantically close via equivalence and subsumption relations; (3) check the consistency, coherence and non-redundancy of the results. [16] defined the set of basic operations for ontology merging as: removing a class from the ontology; removing a concept from the list of parent concepts because a more appropriate parent exists in another ontology; adding a concept to the list of parents of the considered class because an appropriate parent concept has been found in another ontology; renaming a concept to conform to naming conventions in another ontology; moving a concept from one ontology to another; removing a slot from the list of slots of a concept because a similar slot was inherited from a new parent concept; moving a slot because it is more appropriately defined in a parent concept; renaming a slot. The problems that arise when merging are the mismatches that may occur between separate ontologies. Ontologies may differ at two levels: language or meta-level (level of the primitives that are used to specify ontologies). In our context, mismatches at the language level do not occur because the micro-ontologies are written in the same language (owl). The mismatches occur at the ontological and terminological levels.
3
Results
In his section, our first results about the legal application ontology are described. The contruction of this ontology was initiated by using two directives (hereafter refered to as "D-travailleur" and "D-citoyen"). First, the application of the linguistic method (cf 2.1) to the micro-ontology construction from "D-travailleur" is detailed. Then, the micro-ontology established from "D-citoyen" is presented. Finally the merging of these two microontologies is studied. 3.1
Micro-ontology Building
The studied directives were written by the European Union Council which is the first decision organ of the Union. The French version is the support of the linguistic study but we used the English version to translate our examples. The first one is "Council directive 2001/23/EC of 12 March 2001 on the approximation of the laws of the Member States relating to the safeguarding of employees’ rights in the event of transfers of undertakings, businesses or parts of undertakings or businesses". The second one is "Council Directive 2004/58/EC on the right of citizens of the Union and their family members to move and reside freely within the territory of the Member States". Henceforth, the "D-travailleur" is used to present our construction method of micro-ontologies. The Corpus Constitution. The corpus consists of a single text ("D-travailleur") of seven pages (3133 words) selected by jurists. This corpus is imposed, closed
Terminae Method and Integration Process for Legal Ontology Building
1017
and small sized. This small size is not an obstacle to the model elaboration because a directive is self-contained and cohesive. Linguistic Study. This step consists in selecting the terms and lexical relations that will be modeled. The results of this stage are quite raw and will be further refined. syntex, linguae and mfd are the used tools. syntex yielded 900 candidate-terms. The most used terms were: travailleur (employee) (56 occurrences), transfert (transfer ) (41 occurrences), cédant (transferor ) (15 occurrences), cessionnaire (transferee) (14 occurrences). Because we work on a small corpus, there are many hapax (a term which appears only once in the corpus) which are significant and have to be kept such as the syntagm "community charter" of the Fundamental Social Rights or the verb "to abrogate". linguae and mfd tools have been used to explore the relations between central concepts. For example, mfd detects two patterns (notify to transferee, transfer to transferee). Then the relation between transferor and transferee is given by a pattern of linguae (see Figure 1). (lemme : cédant) (*) (lemme : cessionnaire) (lemma: transferor) (*) (lemma: transferee) cédant notifie au cessionnaire (transferor notifies transferee) Fig. 1. Relation study with linguae
Normalization. The normalization step is a semantic study and constitutes a bridge between the lexical and syntactical study toward modeling. It consists in a particular conceptualization process based on corpus analysis. The concept description is based on the semantic analysis of the term occurrences. The semantic interpretation of the terms is driven by expert knowledge and application requirements. The method starts with the study of the central concepts of the model. These central concepts are commanded by the aim of the ontology building namely the description of the travailleur (employee) concept in the context of the outsourcing of his/her enterprise in the European Community. We start with three central concepts (travailleur (employee); transfert (transfer), licenciement(redundancy)). a. Modeling bootstrap From the terminological form of the term bound to a central concept, we find terms and relations describing the concept. Terms are translated into concepts and relations into roles. A natural language definition is either found in the directive or established from term occurrences in the directive. The concept definition is given, in comprehensive form, from the term occurrences found in the text with structural andor functional properties. A structural property references a link toward an ancestor concept (for instance, subsumption link). A functional property describes a domain property between concepts. The following figures
1018
S. Despres and S. Szulman
present concept definitions obtained by using the text. Only some definitions are found in the directive. The used syntax simplifies the linguistic expression to remain closer to the ontology language. * Study of travailleur (employee) The terminological form involves the following definition found in the directive "D-travailleur"(see Figure 2). toute personne qui, dans l’état membre concerné est protégé en tant que travailleur dans le cadre de la législation nationale sur l’emploi.
employee shall mean any person who, in the Member State concerned, is protected as an employee under national employment law.
Fig. 2. Definition of the travailleur (employee) term
The definition of the concept employee has been established from the occurrences of the term employee in the text(see Figure 3). Three important points are stressed in this definition: (1) the European community is constituted of Member States which have national laws; (2) an employee is protected by national employment laws; (3) there is no explicit link between national laws and European laws. All these elements have to be represented in the ontology. travailleur Propriétés structurelles est une personne appartient à une entreprise appartient à un Etat membre
employee Structural properties is a person belongs to an enterprise belongs to a member state
Propriétés fonctionnelles possédant des droits et des intérêts ayant des représentants ayant des conditions de travail est l’objet du transfert de son entreprise
Functional properties having rights having representatives having working conditions is the object of the transfer of his/her enterprise est protégée par une législation nationale is protected by national employment law Fig. 3. Concept travailleur(employee) definition
b. Modeling consolidation During the modeling consolidation process, four kinds of activities occur: – generalization, that is to say, the search of the ancestor concepts. It consists in studying and inserting new concepts along the bottom up axis. – specialization, that is to say, the search of the child concepts. It consists in studying and inserting new concepts along the top-down axis in respect of differentiation principles [2].
Terminae Method and Integration Process for Legal Ontology Building
1019
– clustering consists in creating new concepts and regrouping identical properties. – alignment may occur during these previous activities and requires concept and role meanings. The used ontology is clo. The domain modeling requires the introduction of non terminological concepts for structuring the ontology. The work according to the top-down axis is dedicated to specialization and differentiation of the concepts already defined. The roles describe the functional properties and some of them restrict the inherited roles. * Generalization of the employee concept The employee ancestor concepts are found by working on the ascending axis. The structural properties are the linguistic expressions from which the research of the father concepts is achieved. The structural property is a person leads to the study of the term person in the legal domain. In the directive, the term natural or legal person exists and is used in the transferor definition. Recourse to the expert helped us to create three meanings for this term. Each meaning is represented by a concept. The concepts of naturalPerson, legalPerson and naturalLegalPerson are integrated in the micro-ontology. The linguistic syntagm employee is a person is translated into a subsomption link between the employee and naturalPerson. The structural property belong to of the employee concept leads to the study of the terms enterprise and member states. This property is described by the role belong to. * The study of the functional properties The functional properties are used for defining the roles. The terms rights, representatives, working conditions, national employment law are modeled as concepts in the micro-ontology. These concepts are bound to the employee concept by a role which expresses their linguistic property. * The role definition The created bottom-up structuring concept juridicalObject gathers all of the juridical concepts described in a directive. It is defined by the role governed by which takes its values in the range defined by the Social-Object concept from clo. The opposite role to be governed by has been created. Lexical properties as covered by or is protected by that constitute specializations of the property governed by (see figure 4) were also used in the text. Name isGoverned isGoverned governs isProtected belongs to
Domain concept redundancy socialObject juridicalObject employee employee
Value concept nationalEmploymentLaw juridicalObject socialObject nationalEmploymentLaw memberStates
Fig. 4. An excerpt of the defined generic roles
1020
S. Despres and S. Szulman
Therefore a role hierarchy was created between the rolesto be governed by and to be protected by. * The alignment with clo An alignment with clo is achieved for each buttom-up structuration concepts as juridicalObject and naturalPerson . The naturalPerson concept represented in clo defined by Cognitive objects have a specific dependence on agentive physical objects (e.g. a natural person) is identified to the physicalPerson concept. The document concept represented in clo defined by An information realization that realizes (at least) a text . A subsumption link is defined between the document concept and the juridicalObject concept.The figure 5 presents an excerpt of the micro-ontology "D-travailleur".
Fig. 5. An excerpt of the worker micro-ontology
"D-citoyen" Micro-ontology Building. The same method has been used to build the "D-citoyen" micro-ontology around the central concepts citoyen (citizen), libre circulation (freely moving), Etats membres (Member States). Figure 6 shows an excerpt of this micro-ontology.
Fig. 6. An excerpt of the citizen micro-ontology
Terminae Method and Integration Process for Legal Ontology Building
3.2
1021
Merging Micro-ontologies: First Results
For merging union and intersection are the most prominent approaches. In the union approach, the merged ontology is the union of all entities in both source ontologies where differences in the representation of similar concepts have been resolved. In the intersection approach, the merged ontology consists only of the parts of the source ontology which overlap [8]. The aim of this work is to obtain an ontology of the European Union law around the travailleur(worker)concept. The jurists’ requirements led to the building of a minimal ontology from the merging of micro-ontologies. The merging process is an interesting means both to reveal the inconsistencies between the models used to describe the concepts in the different directives and, if possible, to build a shared model of the studied concepts throughout the directives. Therefore, the intersection approach was chosen to build this ontology. The merging process for our micro-ontologies is only beginning but some issues already emerge. Our micro-ontologies represent a model of one legal issue of the European law domain. Therefore in order to obtain a unique ontology of the travailleur(worker) concept, merging is necessary. The process was achieved with the two above mentionned micro-ontologies as shown in figures 5 and 6. The "D-citoyen" micro-ontology constitutes the reference for the merging process. The two micro-ontologies are written in the same language. Therefore, there are no mismatches at this level. The mismatches occur at the ontological and terminological levels. The merging is around the terms travailleur (employee) and travailleur salarié (worker). These two terms refer to the same concept denoted by travailleur salarié (worker). The resort to the terminological form allows the detection of the differences between the definitions associated to these concepts. The results of this study consists in moving the travailleur salarié (employee) concept from the "D-travailleur" micro-ontology to the "D-citoyen" micro-ontology under the travailleur (worker) concept. The generic role appartient à (belongs to) attached to the travailleur(employee) concept is removed because it exists at the citizen concept level and it is inherited. The terminological form associated to the worker concept contains all linked terms.
4
Comparison with Related Works
Our work is based on texts enhanced by an alignment process and a merging of the resulting micro-ontologies. This approach is clearly different from the legal ontology building presented in [5], [12]. Our approach is the only one in juridical domain that permits the traceability from texts to concepts and relations of ontology building. Thus it presents a great advantage for the management of ontology evolution and normally, it should aid to the enrichment of the constructed ontologies. Our approach can be compared to the Ontology of Professional Legal Knowledge (OPLK) project [6]. A middle out approach is adopted and NLP tools for the linguistic study are used. By focusing on relations between concepts our
1022
S. Despres and S. Szulman
method differs from the OPLK project centered on adjectives to construct relations between concepts. The constructed ontologies differ because of the nature of the corpus. If we consider as [17] that juridical language includes several levels: the law-maker’s language, the judge’s language that interprets the law-maker’s language and the language of the jurisprudence, our work is situated at the first level whereas the OPLK project is at the second level since the ontology is built from pragmatic knowledge. In fact, directives constitute normative texts that define objects and doers of the selected reality, and control the EU law.
5
Conclusion
The terminae method gives a central role to the text and permits to establish relations between text and ontology. It differentiates the linguistical and conceptual levels. Therefore terms and concepts are distinct. A terminological network is built from these terms and their linguistical relations. The ontology concepts are built from the terms relevant for the application. The sructural and functional properties are used for elaborating a local concept description. This local concept is linked to a core ontology concepts with an alignment process. During this step, a correspondance is established between a signified of a term and a formal concept. A formalization process such as OntoSpec [13] should permit the alignment between formal concepts.
Acknowledgments This research was carried out in the context of the specific Action "Ontologie du droit et langages juridiques" (G.Lame) funded by the RTP "Droit et systémes d’information" (Danièle Bourcier). We thank our jurist colleague Nicolas Moisard (IDEA, Centre Juridique de Poitiers) for his collaboration during the project and the validation steps.
References 1. N. Aussenac-Gilles, B. Biébow, and S. Szulman. Revisiting ontology design: a methodology based on corpus analysis. In R. Dieng and O. Corby, editors, Knowledge Engineering and Knowledge Management : Methods, Models, and Tools. Proc. of the 12th International Conference, (EKAW’2000), LNAI 1937, pages 172–188. Springer-Verlag, 2000. 2. B. Bachimont. Engagement sémantique et engagement ontologique : conception et réalisation d’ontologies en ingénierie des connaissances. In J. Charlet, M. Zacklad, G. Kassel, and D. Bourigault, editors, Ingénierie des Connaissances, évolutions récentes et nouveaux défis, pages 305–323, Paris, 2000. Eyrolles. 3. B. Biébow and S. Szulman. Terminae: A linguistics-based tool for building of a domain ontology. In D. Fensel and R. Studer, editors, Proc. of the 11th European Workshop (EKAW’99), LNAI 1621, pages 49–66. Springer-Verlag, 1999. 4. D. Bourigault and C. Fabre. Approche linguistique pour l’analyse de corpus, volume 25, pages 131–151. Université Toulouse Le Mirail, 2000.
Terminae Method and Integration Process for Legal Ontology Building
1023
5. J. Breuker. Constructing a legal core ontology: lri-core. In WONTO’2004, 2004. 6. P. Casanovas, N. Casellas, C. Tempich, D. Varandecic, and R. Benjamins. OPKJ modeling methodology. In LOAIT- Legal Ontologies and Artificial Intelligence Techniques, IAAIL Workshop series, pages 121–133, 2005. 7. V. Ceausu and S. Després. Une approche mixte pour la construction d’une ressource terminologique. In IC 2004, pages 211–223, 2004. 8. J. de Bruijn, F. Martin-Recuerda, D. Manov, and M. Ehrig. Technical report. 9. S. Després and S. Szulman. Construction of a legal ontology from a europen community legislative text. In Thomas F. Gordon, editor, Jurix 2004, pages 79– 88. IOS press, 2004. 10. J. Euzenat, T. Le Bach, J. Barrasa, P. Bouquet, J. De Bo, R. Dieng, M. Ehrig, M. Hauswirth, M. Jarrar, R. Lara, D. Maynard, A. Napoli, G. Stamou, H. Stuckenschmidt, P. Shvaisko, S. Tessaris, S. Van Acker, and I. Zaihrayeu. Technical report. 11. A. Gangemi, A. Prisco, M.T. Sagri, G. Steve, and D. Tiscornia. Some ontological tools to support legal regulatory compliance, with a case study. In Workshop WORM Core. LNCS, Springer Verlag, 2003. 12. A. Gangemi, M.-T. Sagri, and D. Tiscornia. A Constructive Framework for Legal Ontologies, pages 97–124. LNAI 3369. Springer, 2005. 13. G. Kassel. Une méthode de spécification semi-informelle d’ontologies. In Actes des 13 èmes journées francophones d’Ingénierie des Connaissances (IC 2002), pages 75–87, 2002. 14. M. Klein. Combining and relating ontologies: an analysis of problems solutions. In A. Gomez-Perez, M. Gruninger, H. Stuckenschmidt, and M. Uschold, editors, Workshop on Ontologies and Information Sharing, IJCAI’01, pages 309–327, Seattle, USA, 2001. 15. D.L. McGuinness, R. Fikes, J. Rice, and S. Wilder. An environment for merging and testing large ontologies. In Seventeenth International Conference on Principles of Knowledge Representation And Reasoning KR-2000), pages 483–493, 2000. 16. N. Fridman Noy and M.A. Musen. An algorithm for merging and aligning ontologies: Automation and tool support. In Proc.of the Workshop on Ontology Management at the Sixteenth National Conference on Artificial Intelligence (AAAI-99), Orlando, 1999. FL: AAI Press. 17. W. Peters, D. Tiscornia, and M.T. Sagri. The structuring of legal knowledge in lois. In LOAIT - Legal Ontology and Artificial Intelligence Techniques. 18. URL: http://www.gate.ac.uk/projects/sekt.
An Approach to Automatic Ontology-Based Annotation of Biomedical Texts Gayo Diallo, Michel Simonet, and Ana Simonet Laboratoire TIMC, Université Grenoble 1, Faculté de Médecine, 38700 La Tronche, France {gayo.diallo, michel.simonet, ana.simonet}@imag.fr
Abstract. Sharing and enriching of documents is expected and is made possible nowadays with tools enabling users to perform different kinds of annotations. We propose an Ontology-based approach to automate the semantic annotation of texts; Ontologies are represented in OWL (Web Ontology Language). OWL is supported by Semantic Web tools such as Racer for reasoning purpose and Jena. The tool for automatic semantic annotation supporting word-based and stem-based pre-indexing techniques is presented and its evaluation is made on three medical corpora both in English and French (brain disease area, cardiology and OHSUMED collection). The evaluation shows difference in the results obtained according to the pre-indixng mode used.
1 Introduction The amount of electronic resources available nowadays is permanently growing, making difficult their organisation and their efficient access. The Semantic web initiative [1] is about adding formal structures and semantics (metadata and knowledge) to web content for easy management and access. To make the resources machine-understandable, it proposes in particular to enrich them with descriptions called annotations. Annotation tools can be viewed according to the functionalities they implement, which can be classified into three main categories: cataloguing using Dublin Core (DC)[2], textual and conceptual [3]. Annotea [4] and SemTag [5] illustrate those various aspects. Annotea is an outcome of the World Wide Web Consortium (W3C) efforts and mainly deals with textual annotation, i.e., the process of adding pieces of text (comments, examples, questions) to a given document in a collaborative manner. SemTag is a project of the IBM Almaden Research Center and deals with ontologybased automatic semantic annotation, which can be considered as a way to perform concept-based indexing. Some authors have studied the annotation process following their computational or cognitive aspects. [6]. Up to now, many annotation tools have been conceived for the purpose of indexing. They usually provide semi-automatic procedures (learning algorithm in Melita [7], S-Cream [8]) in addition to manual support (Melita) to hand out semantic annotations. Sometimes, these annotations are only provided manually (Shoe [9], Comma [10]) or together with some cataloguing procedure. In the medical M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 1024 – 1033, 2006. © Springer-Verlag Berlin Heidelberg 2006
An Approach to Automatic Ontology-Based Annotation of Biomedical Texts
1025
field Health Professionals naturally use textual annotations to link documents and to guide the reading from a document to another (or to a part of document). In this paper we focus on semantic annotation (also called conceptual annotation) which consists in attaching “semantic labels” to a document or parts of a document, using semantic features provided by a formal knowledge organization called ontology. Using ontologies in the annotation process has a particular benefit: the queries for information retrieval and document annotated share the same vocabulary. We present an approach of automatic conceptual annotation using Natural Language Processing (NLP) technology. This approach extends a tool developed in a previous work, the Noesis Annotation Tool (NAT) [3], which helps manually annotating documents. Regarding conceptual annotation, many tools use NLP in order to automatically extract information from documents (Melita, S-Cream). The remainder of this paper is organized as follows. In Section 2 we give some preliminaries presenting the manual annotation tool and the domain ontologies used to evaluate our approach. In Section 3 we explain the process of automatic semantic features extraction and present some technical aspects. An evaluation of our approach is presented in Section 4, and Section 5 concludes by discussing future extensions and improvements.
2 Preliminaries 2.1 Noesis Annotation Tool The NAT (see Fig. 1) has been developed in the context of the NOESIS Project which is an Integrated Project of the European 6th Framework Program (2004 – 2006). Its objective is to build a platform for wide scale integration and visual integration of medical intelligence, with particular emphasis on Knowledge Management and Decision aid.
Fig. 1. Noesis Annotation Tool (manual indexing)
1026
G. Diallo, M. Simonet, and A. Simonet
The three kinds of annotation considered (cataloging, conceptual and textual) have been implemented in a single tool. One novelty of the Noesis Annotation Tool is the use of a terminology server which, in combination with the ontology, brings a real support for the users’ indexing task. This terminology server aims at bridging the gap between the ontology formal language and user’s natural language terms. It generally provides candidate concepts for a user’s term. With the current version of the tool, the annotation process (cataloguing, textual, and conceptual) is performed manually. We propose to automate the conceptual annotation through an automatic features extraction process using an ontology. 2.2 Ontology for Automatic Conceptual Annotation An ontology, as specification of a conceptualization [11] may serve to represent and organize the knowledge in specific domain. In this paper ontologies are used for the purpose of conceptual document representation. The ontology model used here is inspired by the Semantic Document Representation (SDR) module of the UMSIIS approach for unifying the management of structured and informal information sources [12]. UMSIIS distinguishes the domain ontology and the ontologies for defining the terminology used in a specific domain. The model we use relies on three basic elements: concepts, relationships between concepts, and finally terms in two languages (French and English) to manage the multilingual aspects. Each concept is represented by one or several terms (see Fig. 2 and 3). This consideration is important since the algorithm we use is syntax mapping based. We choose the DL-based ontology language, Ontology Web Language (OWL)1 to represent the model. It provides the necessary elements and is compatible with Semantic Web tools such as Jena API for ontology management and Racer, a DL-based reasoner. Jena [13] is a set of tools developed within the framework of the HP Labs Semantic Web Program project, allowing to manipulate ontologies and to apply inference mechanisms. 2.3 The Ontologies Used Before we can annotate a text corpus with semantic information, we need to first establish one or more ontologies, or use already-established ones, that define the way we can classify documents. We have used two ontologies: one of the cardio-vascular domain and one of the brain cognitive functions. The Cardio-Vascular Ontology. A cardio-vascular (CV) ontology has been elaborated in the context of the European project Noesis [14]. The core ontology was taken from the MeSH thesaurus, by selecting the concepts in 10 hierarchies dealing with the CV domain. The first step to enrich the vocabulary has been automatically done through UMLS. The UMLS terms corresponding to the core concepts have been added in English, French, Italian, Spanish and German. It comprises 690 concepts. The original version of the ontology uses the SKOS specification [15] which provides both prefLabel for preferred terms and altLabel for other terms. In order to
1
http://www.w3.org/TR/owl-features/
An Approach to Automatic Ontology-Based Annotation of Biomedical Texts
1027
Fig. 2. Example of a Concept of the Cardio-Vascular Ontology
Fig. 3. A Concept of the Brain Cognitive Functions Ontology
provide the same feature (i.e., a preferred term for each language) without using the SKOS specification for simplicity, we have introduced two relations of type « owl:DatatypeProperty » called « PrefLabelFR » and « PrefLabelEn » for the French and English preferred terms. The other terms of a concept are designated by RDF elements « rdfs:label » [16]. The Ontology of Brain Cognitive Functions. In a former project called BC3 (Brain & Heart Knowledge Base) we have designed a functional and an anatomical ontology of the brain. The functional ontology consists of the CAC2 classification of cognitive function of the brain [19] through the subsumption relation. We have migrated to OWL-DL this work and extended it formally in order to take into account the cognitive tests. The anatomical ontology is not used in the present project. In practice, to determine if a given cognitive function is affected the doctor carries out a set of tests. For example the verbal modality of the cognitive function “memory“ can be tested by the “Palm Tree Test”, “The verbal fluency test”, etc. We decided to create two general disjoint concepts: “CognitiveFunction” and “CognitiveTest” to reflect those two aspects and we have introduced two inverse relations “IsTestedBy” and “Tested” through the “owl:ObjectProperty”, which illustrate the link between a 2
Centre d’Anatomie Cognitive, Hospital of Pitié Salpêtrière, Paris.
1028
G. Diallo, M. Simonet, and A. Simonet
cognitive function and a cognitive test and conversely. Both “IstestedBy” and “Tested” relations are specialised to take into account the different modalities of a cognitive function. Each concept is designed by one or several French or English terms. We have used the Protégé Editor tool3 to edit the Ontology and the Racer [18] reasoner to test the consistency of the Ontology (525 concepts).
3 Semantic Annotation Process The automatic acquisition of semantic features of a domain is an important issue, since it reduces human intervention. The identification of ontology concepts is performed through two distinct phases: a pre-processing phase and a semantic features extraction step. 3.1 Preprocessing Step with Apache Lucene In order to pre-index corpora, we use Apache Lucene4, a performance full-featured search engine API implemented in Java which supports both indexing and search of large document collections. The customizability and modular approach of Lucene was essential in the choice of this Information Retrieval tool. Lucene does not index files but Document objects. To index and then search files, all files must be first converted to Document objects, which we call Lucene documents (the basic indexing unit). All the information associated with Lucene documents is stored in units called fields, where each field has its name and textual value (e.g., content, author, etc.). A Lucene module called Index Writer builds a Token Stream (a sequence of words), which describes information about the token text. Token contains linguistic properties and other information such as the start and end offsets and the type of the string (i.e., the lexical or syntactic class that the token belongs to). Every Lucene token has its own position in the token stream. The task of the Analyzer (which use the Porter Algorithm) is to take a text and provide a stream of tokens. Each analyzer includes one or more tokenizers and may include filters. Lucene provides a French Analyzer that produces stemmed text in French. We have customized this in order to provide root-based and word-based tokens. Table 1. Token generated following the stem-based and word-based modes
3 4
Text
Stem
Word
[…Mais il existe des difficultés d'attention et un ralentissement qui affectent l'ensemble des activités cognitives…] […It was shown that the percentage stenosis in the coronary arteries measured by coronary angiography was negatively correlated …]
[exist][difficulté][attent ][ralent] [affectent] [ensembl] [activité] [cognit] [shown][percentag] [stenosi][coronari] [arteri][measur] [coronari][angiographi] [neg] [correl]
[existe] [difficultés] [attention] [ralentissement] [affectent] [ensemble] [activités] [cognitives]
http://protege.stanford.edu/index.html. http://lucene.apache.org.
[shown][percentage] [stenosis] [coronary] [arteries] [measured] [coronary] [angiography] [negatively] [correlated]
An Approach to Automatic Ontology-Based Annotation of Biomedical Texts
1029
This step provides a word-based or a stem-based index of the text corpora which we store in a MySQL database. 3.2 The Semantic Features Extraction Step The semantic features extraction strategy is based on simple phrase identification. Those phrases identified represent the different terms used to design the concepts of the ontology. The entry of this step has four components: the text corpora, the ontology, the stemming mode and finally the target language. Text corpora are treated using the algorithm described in Fig. 4. Each term of a concept is treated according to the analyzer used to build the index in the first step. The concept weighting is developed using the CF*IDF measure (concept frequency * inverse document frequency) which adapts at the concept level the classical TF/IDF method which operates at the term level [19]. We have implemented the semantic features extraction in Java and the semantic index exploration in Python. The Python script uses the HyperGraph5 source code to
Fig. 4. Algorithm for features extraction
Fig. 5. Semantic Index Visualization 5
http://hypergraph.sourceforge.net
1030
G. Diallo, M. Simonet, and A. Simonet
visualize the ontology as a hyperbolic tree to explore the semantic index. The advantage of the hyperbolic view is that users can understand the ontology structure by seeing the relational cross-links between the vertices. Fig. 5 shows a screenshot of the semantic index.
4 Evaluation 4.1 The Corpora Three corpora have been used to evaluate our approach. The first one, called brain corpus, is composed of 45 medical reports [20] and 285 web pages related to brain obtained through Yahoo! API6. The second one is constituted of 430 scientific publications chosen from three journals on cardiology avalaible from BiomedCentral (“BMC Cardiovascular Disorders”, “Cardiovascular Diabetology”, “Cardiovascular Ultrasound”).7 The last one is a subset of OHSUMED test collection. The OHSUMED test collection is a set of 348,566 references from MEDLINE, the on-line medical information database, consisting of titles and/or abstracts from 270 medical journals over a five-year period (1987-1991) and 101 test queries. The available fields are title, abstract, MESH indexing terms, author, source, and publication type. We have selected the first 2.182 documents that we have indexed using the Cardiovascular Ontology. 4.2 Experiment Analysis For each corpus we have performed two tests using respectively the stem and word based pre-indexing. The results of concept identification are presented in Table 2. Table 2. Number of concepts and number of positions by concepts identified according to the two pre-indexing modes
Corpus Brain Cardiology OHSUMED
Stem Concepts Positions
Word Concepts Positions
2.630 15.969 4.492
2.346 14.380 3.591
11.691 115.977 9.748
10.606 107.199 7.372
Table 3. Frequency and Weight (cf.idf) of the concept “M0025176” with preferred term “Peripheral Vascular Diseases” for three documents of the cardiology corpus
Document
Stem Frequency
0 141 210
4 3 6
6
http://developer.yahoo.net/.
7
http://www.biomedcentral.com/.
Weight 0.131 0.044 0.071
Word Frequency 2 2 5
Weight 0.067 0.030 0.061
An Approach to Automatic Ontology-Based Annotation of Biomedical Texts
1031
By examining Table 2 we note that when the stem pre-indexing mode is used, more concepts are identified than with the simple word-based pre-indexing mode. This is due to the fact that each word of ontology term is reduced to its root (stem). For example a concept represented both by singular and a plural form of a term is counted twice when using the stem mode. Taking the cardiology corpus we note that (Table 3) the concept “M0025176” with the preferred term “Peripheral Vascular Diseases” and as other term “Peripheral Vascular Disease” has different frequencies according to the two modes. Taking the brain corpus indexed with the ontology of brain cognitive functions, when the stem mode is used the algorithm associates all the occurrences of the word “raisonner” in the corpus with the concept “Raisonnement”, as the word “raisonner” and “Raisonnement” have the same stem. This is a well-known ambiguity due to the use of stemming for text indexing [21]. Contrary to SemTag which implements a disambiguation algorithm called TBD for Taxonomy Based Disambiguation, indexing errors made during the automatic annotation process are to be corrected manually by the NAT. In addition to the situation described above it also may occur that a same term is present in different concepts (typically for the cardiovascular ontology which is enriched by term from UMLS). An example of that is the term “Angiography” which is the preferred term of the concept “M0001187” and is an alternative term of “M0001731” (Arteries) and “M0002726” (Blood Vessels).
5 Conclusion and Future Work In this paper we have presented an approach to automatic features extraction for semantic text annotation using Semantic web technologies and ontologies. This approach may be applied for several biomedical domains as showed by the evaluation. Semantic markup, if present, might serve as indexing terms for a hybrid information retrieval engine (keyword and concept based search). That is, in addition to indexing documents according to the text of their words or stems, we might also index them according to their semantic markup. The semantic feature extraction time is proportional to the size of ontology; and the process becomes slow for large size ontologies (several hundreds of concepts represented for each by several terms). We hope to improve the processing time by segmenting the ontology into disjoints sub-branches in order to carry out a parallel features extraction. Only the concepts of the domain ontology are extracted at the present time; the recognition of the named entities will allow a richer description of the documents (e.g., recognition of a cognitive test value). Building an ontology for named entities recognition in medical reports is an ongoing work.
Acknowledgments This work is partly supported by European Commission (NOESIS project, IST-2002-507960).
1032
G. Diallo, M. Simonet, and A. Simonet
References 1. Berners-Lee, T., Hendler, J., Lassila O.: Semantic web. Scientific American, 1(1):68-88, (2000) 2. Hillmann D. : Guide d’utilisation du Dublin Core, consulted on 11/29/2005 (http://www.bibl.ulaval.ca/DublinCore/usageguide-20000716fr.htm) 3. Patriarche, R., Gedzelman, S., Diallo, G., Bernhard, D., Bassolet, C.G., Ferriol, S., Girard, A., Mouries, M., Palmer, P., Simonet, A., Simonet, M.: A Tool for Conceptual and Textual Annotation of Documents. e-challenges conference, Ljubljana, 19-21 Oct. (2005) 4. Kahan, J., Koivunen, M.R., Prud’Hommeaux, E., Swick, R.: Annotea : an open RDF Infrastructure for Shared Web Annotations, Proceedings of WWW10, May 1-5 (2001), Hong-Kong, pp. 623-632 5. Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., Rajagopalan, S., Tomkins, A., Tomlin, J.A., Zien, J.Y.: SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation. In Proceedings International WWW Conference, Budapest, Hungary (2003) 6. Caussanel J.P., Cahier J.P., Zacklad M., Charlet J. : Les Topic Maps sont-ils un bon candidat pour l’ingénierie du Web Sémantique ?, in Proceedings of IC’2002, Rouen, June (2002) 7. Ciravegna, F., Dingli, A., Petrelli, D., Wilks, Y.: User-system cooperation in document annotation based on information extraction. In Proceedings of the 13th International Conference on Knowledge Engineering and Knowledge Management EKAW02. Sringer Verlag (2002) 8. Handschuh, S., Staab, S., Ciravegna, F.: S-Cream, Semi-Automatic Creation of Metadata. Proc. of the European Conference on Knowledge Acquisition and Management. EKAW02. (2002) 9. Heflin, J., Hendler, J. : Searching the web with shoe. In Artifial Intelligence for Web Search. Pages 35-40. AAAI Press, (2000) 10. Gandon, F.: Agents handling annotation distribution in a corporate semantic Web. In Web Intelligence and Agent Systems, IOS Press International Journal (Eds) Jiming Liu, Ning Zhong, Vol.1, Number 1, (2003) pp 23-45, ISSN : 1570-1263 WI Consortium 11. Gruber, T. R. A translation approach to portable ontologies. Knowledge Acquisition, 5(2):199-220, (1993) 12. Diallo, G., Simonet, M., Simonet, A., Bringing Together Structured and Unstructured Sources, DEXA’06 Krakow, Poland. Paper submitted. 13. Carroll, J. J., Dickinson, I., Dollin, C., Reynolds, D., Seaborne, A., Wilkinson, K.: “Jena: implementing the semantic web recommendations”, in Proceedings of the 13th International World Wide Web Conference, pp. 74 - 83, New York (2004) 14. Gedzelman, S., Simonet, M., Bernhard, D., Diallo, G., Palmer, P.: Building an Ontology of Cardio-Vascular Diseases for Concept-Based Information Retrieval. Computer in Cardiology, Lyon -France (2005) 15. SKOS : Schema for Knowledge Organisation Systems. http://www.w3.org/TR/2005/WDswbp-skos-core-guide-20050510/ 16. Lassila, O., Swick, R.: Ressource Description Framework (RDF) model and syntax specification. W3C Working Draft (1998) http://www.w3.org/TR/WD-rdf-syntax 17. Diallo, G., Bernhard, D., Batrancourt, B., Simonet, M. : Vers une Ontologie anatomofonctionnelle du cerveau, Journée Web Sémantique Médicale, Rennes (2003)
An Approach to Automatic Ontology-Based Annotation of Biomedical Texts
1033
18. Haarslev, V., Möller, R.: Racer: A Core Inference Engine for the Semantic Web. In Proceedings of the 2nd International Workshop on Evaluation of Ontology-based Tools (EON2003), located at the 2nd International Semantic Web Conference ISWC 2003, Sanibel Island, Florida, USA, October 20, pages 27–36, (2003) 19. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management, Vol. 24, Issue 5, (1998); 513-523 20. Batrancourt, B.: Modélisation et outils logiciels pour une connaissance anatomofonctionnelle du cerveau humain. Thèse de Doctorat, Université Lyon 1 Décembre (2003) 21. Kraaij, W., Pohlmann, R.: Viewing stemming as recall enhancement. In Proc. of SIGIR'96, pp 40-48, Zurich , Switzerland (1996)
Lexical and Conceptual Structures in Ontology Christophe Roche Condillac Research Group, LISTIC Lab., University of Savoie, Campus Scientifique, 73376 Le Bourget du Lac cedex, France [email protected] http://ontology.univ-savoie.fr
Abstract. From a stricto sensu point of view, there is no direct relationship between ontology and text. The former is interesting in conceptualization – as an extra linguistic representation of conceptual knowledge –, when the latter is a matter for linguistics. Nevertheless, ontology and text are more and more connected in particular for ontology acquisition from texts. The main goal of this article, viewed as an introduction for the special session on “Ontology and Text” of the IEA/AIE 06 Conference, is to claim that: “the lexical structure and the conceptual structure in general do not fit”. As a matter of fact, writing technical documents remains a linguistic activity which belongs to rhetoric and one of the linguistics’ principles is the incompleteness of texts. Even if we can learn a lot of useful information from texts for building conceptual systems, we can not directly consider a word as a lexicalized concept nor the hyponymy relationship as a linguistic translation of the subclass relationship.
1 Introduction Ontology is a very popular research topic [1], [2], [3] mainly due to what they promise: a way of capturing a shared and common understanding of a domain that can be understood and used as well by humans as programs. Even if there is no really agreement on the meaning of what ontology is, we can resume the most definitions used today by saying that an ontology is a shared description of concepts and relationships of a domain expressed in a computer readable language. At last, everybody agrees on the fact that an ontology can be also considered as a vocabulary of terms with their definition: “An [explicit] ontology may take a variety of forms, but necessarily it will include a vocabulary of terms and some specification of their meaning (i.e. definitions).” [4]. Then ontology building raises two main problems: on the first hand a linguistic one with the building of the vocabulary of terms and on the second hand a conceptual one with the definition of concepts. And since we use words for speaking about knowledge, ontology building from texts seems to be a good idea.
2 The Lexical Structure Domain knowledge is mainly conveyed through scientific and technical texts: we use names for denoting concepts, properties, relationships; and definition of concepts M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 1034 – 1041, 2006. © Springer-Verlag Berlin Heidelberg 2006
Lexical and Conceptual Structures in Ontology
1035
relies on using words: “a voltage relay is a kind of relay”, “a Kaplan turbine is a type of propeller turbine in which the pitch of the blades can be changed to improve performance”. The words of usage (i.e. the technical words which are used in specialized texts) as their linguistic relationships – lexical or syntactic – more or less translate the knowledge structure. Let us begin by finding the terms which can denote concepts. Insofar as texts contain linguistic expressions denoting concepts, extracting candidate terms from corpus by automatic text analysis is today an active research domain [5], [6]. Texts are first analyzed from a lexical and syntactic point of view in order to associate to each word its grammatical class and its canonical form. Then linguistic expressions can be automatically extracted using, for example, lexical and syntactic patterns (regular expressions). The “adjective substantive” and “substantive substantive” patterns allow to extract expressions like “electromagnetic relay”, “threshold relay”, “voltage relay” from a technical corpus about relays. The result, which must be validated by experts, is a lexicon of words considered as concept’s names. This lexicon is structured according to linguistic relationships like hypernymy, hyponymy, synonymy, meronymy and so on. Here too, these linguistic relationships can be “automatically” extracted from the corpus using a syntactic analysis based on verbal patterns, and in particular on the verb “to be”: “a voltage relay is a kind of relay”, “the three main types of water turbines are Pelton wheels, Francis turbines, and Kaplan or propeller type turbines”. We can also use the previously extracted expressions. For example, linguistic expressions made up of several words with the same “queue” (i.e. ending with the same words, e.g. with the same noun) give interesting information about the structure of the lexicon. The following linguistic expressions “voltage relay”, “threshold relay’, “electromagnetic relay” can be considered as hyponyms of “relay” (in this article, the linguistic expressions will be written between quotation marks) as well as “propeller turbine” and “Kaplan turbine” are hyponyms of “turbine”.
Fig. 1. An example of lexical structure
3 The Conceptual Structure We have to notice that putting linguistic relationships between linguistic expressions, such hyponymy or synonymy, does not define them. Even if we consider a word, or a set of synonymous words, as a “lexicalised concept” like in Wordnet (a concept is a synset, i.e. a set of synonymous words) [7], the words are not defined: a concept is more than a word and an ontology is not a structured lexicon.
1036
C. Roche
If a structured lexicon does not define a conceptual model, its study can bring a lot of useful information for building it. As a matter of fact, the domain conceptualization appears through the words of usage considered as terms denoting concepts and through their lexical structure translating a conceptual structure. Let us go back to the linguistic expressions made up of several words with the same “queue” (i.e. ending with the same words). From theses data we can extract useful information for building the conceptual structure. The following linguistic expressions “voltage relay”, “threshold relay”, “on-off relay”, if they are considered as lexicalized concepts, denote different kinds of
Fig. 2. The conceptual structure corresponding to the lexical structure of figure 1
∀x VoltageRelay (x) → Relay (x) ∀x On-OffRelay (x) → Relay (x) ∀x ThresholdRelay (x) → Relay (x) Such an approach assumes that a noun denotes a concept when the hyponymy relationship between nouns translates the subsumption relationship between concepts. In few words, the fundamental hypothesis of this approach is that the conceptual structure fits – or translates – the lexical structure. or in a more formal language:
4 Is It so Simple? Does the linguistic structure fit the conceptual structure? Does the conceptual structure translate the linguistic structure? Or on the contrary does – or should? – the linguistic structure translate the conceptual structure? These questions are not equivalent. The first one implies that the linguistic and conceptual analyses can be viewed as independent processes and we have a new problem to solve, the one of alignment of the two structures. The second question corresponds to a semasiologic approach where first we find the vocabulary of terms and then define them. When the last question corresponds to an onomasilogic approach, like in terminology [8] [9], whose the main goal is first to define a conceptualization of the world and then to name the concepts. Ontology building from texts corresponds to a semasiologic approach: we first find terms (mainly nouns) and then define them as concepts. Most of the present works follow the approach we have described, even if the methods and techniques can be
Lexical and Conceptual Structures in Ontology
1037
more sophisticated (for example in relying on both linguistic and mathematical techniques for extracting candidate terms). But is such a built conceptual structure really sound? and can it be really shared and reused? Let us go back to the simple example of relays and let us use this conceptualization for a CMS (content management system) application. Documents can be semantically annotated based on this conceptualization [10], it means a document is classified, or indexed, on every concept the content of the document refers to. Then, looking for information about a concept, for example “threshold relay”, must return all information about this kind of relay, using if necessary the “is a” inheritance relationship for inferences. But in this case searching information on “threshold relay” does not return any information about “voltage relay”. This is not correct. As a matter of fact, for experts, all information about “voltage relay” concerns “threshold relay”: if the conceptualization is not wrong, it is not completely correct. Where is the problem? Working with the experts on their domain conceptualization, independently of the words of usage, it appeared that in fact a
1
Formal languages are mainly logic-oriented. The concepts are represented as unary predicates when their attributes, or slots, are represented as binary predicates also called roles. Description logic [13] is a good example of logic dedicated to knowledge representation as OWL, the web ontology language [14], is a good example of languages for building ontology based on the W3C philosophy. But logic is a neutral (or flat) language which is not able to represent the different kinds of knowledge: a unary predicate can represent either a concept or a property when binary predicates can represent either attributes (internal relationships) or relationships between concepts (external relationships). The frame representation languages [15], in spite of the criticism of [16], are different in the sense that they allow to directly express the structure of concepts in terms of slots and relationships.
1038
C. Roche
Fig. 3. The “correct” conceptual structure corresponding to the lexical structure of figure 1
differences. Furthermore, everybody can agree with the classical (Aristotelian) definition of a concept: a
The lexicon of a LSP (Language for Special Purpose or Specialized Language) and the vocabulary of concept’s names are in fact two different things which should be study independently, and then connected if necessary. Indeed, let us remark that the lexicon of a LSP is made of words of usage which can be polysemic when the concept’s names are, by definition, monosemic; as well as hyponymy relationship can be a multiple relation (a same word can have several hypernyms) when the subsumption relationship based on the specific difference theory can not be multiple.
Lexical and Conceptual Structures in Ontology
1039
Using the same approach based on the “specific difference” theory, the experts defined the ontology of turbines as described in the following figure, putting in evidence that Kaplan and propeller turbines are similar since they are both kinds of reaction hydraulic turbine with blades, but differ according to the movable of the pitch of the blades. Note: the “specific difference” ontologies of this article were realized with the OCW (Ontology Craft Workbench) environment of Ontologos corp. (http://www.ontologoscorp.com). Fig. 4. The possible relationships between Kaplan and Propeller turbines
Fig. 5. The “specific difference” ontology of turbines
5 Conclusion As domain knowledge is mainly conveyed by technical and scientific texts, and since linguistic expressions are used to express this knowledge and to denote concepts and relationships between them, it seems interesting to build ontology – understood as a conceptualization of a domain – from texts. The knowledge extracted from texts is a linguistic knowledge. The lexical structure is a network of words – can we really consider them as lexicalized concepts? – linked
1040
C. Roche
by linguistic relationships like hyponymy, synonymy, etc. This lexical structure is a linguistic “picture” of the conceptual structure of the domain; a picture built in a given and particular context of corpus, which can differ from one context to another. The lexical structure is not an “isomorphic” translation of the conceptual model. As a matter of fact, texts fall within the language in action. Using rhetorical figures, such as metonymy and ellipse, is a very ordinary practice even for writing technical document. Let us recall the example of “voltage relay” where we use a specific difference for denoting a concept (metonymy principle). Furthermore, one of the principles in textual linguistics, even in technical documents, is the incompleteness of texts. A conceptual model “directly” built from a lexical structure will probably not be reusable, because too corpus-dependent, and then probably not correct if we consider an ontology as a non-contingent knowledge. We have to keep in mind that words of usage and concepts belong to different disciplines: the signified (signifié – Saussure) of a word in a text is not a concept and a lattice of words linked by linguistic relationships is not necessarily an ontology. Even if we can extract useful information from texts about the domain conceptualization, we have also to keep in mind that: “the lexical structure does not fit the conceptual structure”.
References 1. Staab S.. Handbook on Ontologies by Steffen Staab (Editor), Rudi Studer (Editor), Springer, 2004 2. Gomez 04. Ontological Engineering : with examples from the areas of Knowledge Management, e-Commerce and the Semantic Web by Asuncion Gomez-Perez, Oscar Corcho, Mariano Fernandez-Lopez Springer, 2004 3. Roche 03. Ontology: a Survey. 8th Symposium on Automated Systems Based on Human Skill and Knowledge IFAC, September 22-24 2003, Göteborg, Sweden 4. « Ontologies : Principles, Methods and Applications » M. Ushold & M. Gruninger. Knowledge Engineering Review, Vol. 11, n° 2, June 1996 5. Buitelaar 05. Ontology Learning from Text: Methods, Evaluation and Applications (Frontiers in Artificial Intelligence and Applications, Vol. 123) by P. Buitelaar (Editor) Ios Press Publication (July 1, 2005) 6. Daille 04. Recent Trends in Computational Terminology. Special issue of Terminology 10:1 (2004). Edited by Béatrice Daille, Kyo Kageura, Hiroshi Nakagawa and Lee-Feng Chien, Benjamins publishing company, 2004 Wordnet. 7. http://wordnet.princeton.edu/ 8. ISO 704: Terminology work - Principles and methods. International Organization for Standardization. 2000 9. Cabré 99. Terminology: Theory, methods and applications. M. Teresa Cabré Castellví, Benjamins publishing company,1999 10. Kiryakov, Popov, Terziev, Manov, and Ognyanoff 05. Semantic Annotation, Indexing, and Retrieval.Elsevier's Journal of Web Sematics, Vol. 2, Issue (1), 2005. 11. Gruber 03. A Translation Approach to Portable Ontology Specifications. Knowledge Systems Laboratory September 1992 - Technical Report KSL 92-71 Revised April 1993. Appeared in Knowledge Acquisition, 5(2):199-220, 199
Lexical and Conceptual Structures in Ontology
1041
12. Roche 01. The ‘Specific-Difference’ Principle: a Methodology for Building Consensual and Coherent Ontologies. IC-AI’2001: Las Vegas, USA, June 25-28 2001 13. Baader 03. The Description Logic Handbook. Franz Baader, Diego Calvanese, Deborah McGuinness, Daniele Nardi, Peter Patel-Schneider, editors. Cambridge University Press, 2003 14. OWL Web Ontology Language: http://www.w3.org/TR/owl-features/ 15. Wright, J. M., M. S. Fox, and D. Adam. 84 "SRL/1.5 Users Manual." Technical report;Robotics Institute, Carnegie-Mellon University 1984. 16. Woods 75. What’s in a Link: Foundations for Semantic Networks. Representation and Understanding: Studies in Cognitive Science, 35-82, edited by D.G. Bobrow and A.M. Collins, New York: Academic Press, 1975.
Discovering Verb Relations in Corpora: Distributional Versus Non-distributional Approaches Maria Teresa Pazienza1, Marco Pennacchiotti1 , and Fabio Massimo Zanzotto2 1
2
University of Roma Tor Vergata, Via del Politecnico 1, Roma, Italy {pazienza, pennacchiotti}@info.uniroma2.it DISCo, University of Milano Bicocca, Via B. Arcimboldi 8, Milano, Italy [email protected]
Abstract. Verbs represent a way in which ontological relationships between concepts and instances are expressed in natural language utterances. Moreover, an organized network of semantically related verbs can play a crucial role in applications. For example, if a Question-Answering system could exploit the direction of the entailment relation win → play, it may expand the question “Who played against Liverpool?” with “X won against Liverpool” and it may avoid the expansion of “Who won against Liverpool?” in “X played against Liverpool” that would be wrong. In this paper, we present a survey of the methods proposed to extract verb relations in corpora. These methods can be divided in two classes: those using the Harris distributional hypothesis and those based on point-wise assertions. These methods are analysed and compared.
1
Introduction
Learning domain ontologies (or semantic models) from texts is a largely debated problem that undergoes to different names according to the research area: terminology extraction [1], named entity recognition [2], lexical acquisition, and information extraction [3]. Models for terminology extraction and structuring [1] have always had the aim to extract terms from domain corpora and structuring them in is-a hierarchies [4] or different conceptual networks. Terms are strictly related to ontologies as they are the linguistic counterpart of domain concepts. Named entity extraction as well as information extraction give the possibility of populating ontologies with concept and relationship instances. Traditionally, these techniques are based on a pre-existing semantic model (or ontology) of the domain defining relevant concepts called named entity categories (e.g., location, team, person) and relevant relations between concepts called templates (e.g., Match(Home,Visitors,Result,Score,Date)). Given a text, the resulting systems aim to extract named entities, that are instances of the categories, and fillers for the template slots. For example, given the sentence “Marcus Camby won for NY at the Yankees Stadium”, these systems have to extract: M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 1042–1052, 2006. c Springer-Verlag Berlin Heidelberg 2006
Discovering Verb Relations in Corpora
1043
– 3 named entities: “Marcus Camby” as instance of person, “NY ” as instance of team, and “Yankees Stadium” as instance of location; – a partial instance of the Match template: Match(New York Yankees,-,HomeWinner,-,-)
These activities are clearly very difficult. What is written in the sample sentence is distant with respest to its interpretation according to the template. However, some systems really achieve good performances in some Information Extraction tasks [2]. Ontology learning from texts is not then proposing a new challenge to natural language processing (NLP). It is pushing the scalability of these methods. These latter give the basis over which more scalable models can be produced. In particular, verb and verb phrases, one of the linguistic counterpart of ontological relationships between concepts and instances, have been fairly studied. A clear model of verb semantics could be helpful in determining a clear model of the relationships between concepts. Relations between verbs play then a very important role. This will be clarified with an example. Consider the question “What college did Marcus Camby play for? ”. A system could find the answer in the snippet “Marcus Camby won for Massachusetts” only if the question verb play is related to the verb win according to some ontological resource even if play and win have a different meaning. Understanding verb relations and discovering such a relations between verbs in corpora is a very interesting research field. This can give some insights to model better ontological resources. Moreover, some results and methods to induce such relations between verbs from corpora are already available. These are mainly based on two hypotheses: – the Harris’ Distributional Hypothesis [5] – the pointwise assertion hypothesis [6] These two hypotheses originated very different methods for discovering relations between words in general and between verbs in particular. The rest of the paper is organised as follows. Sec. 2 shortly revises the possible verb relations. This give an important view on how relationships between concepts should be modelled. Sec. 3 and Sec. 4 describe the automatic models to extract these verb relations from corpora. Sec. 3 and Sec. 4 respectively describes the methods based on the distributional hypothesis and the pointwise assertion methods.
2
Verb Relations: A Classification
WordNet [7] is a very large lexcial knowledge base. Its organization can give hints to better understand which relations hold between verbs. Verbs are basically organized in synonymy sets (synsets) and different kinds of semantic relations can hold between two verbs (i.e., two synsets): troponymy, causation, backwardpresupposition, and temporal inclusion. All these relations are intended as specific types of lexical entailment. According to the definition in [7] lexical entailment
1044
M.T. Pazienza, M. Pennacchiotti, and F.M. Zanzotto
holds between two verbs vt and vh when the sentence Someone vt entails the sentence Someone vh (e.g. “Someone wins” entails “Someone plays”). Lexical entailment is then an asymmetric relation. The four types of WordNet lexical entailment can be classified looking at the temporal relation between the entailing verb vt and the entailed verb vh . Troponymy represents the hyponymy relation between verbs. It stands when vt and vh are temporally co-extensive, that is, when the actions described by vt and vh begin and end at the same times (e.g. limp→walk ). The relation of temporal inclusion captures those entailment pairs in which the action of one verb is temporally included in the action of the other (e.g. snore→sleep). Backwardpresupposition stands when the entailed verb vh happens before the entailing verb vt and is necessary for vt . For example, win entails play via backwardpresupposition as it temporally follows and presupposes play. Finally, in causation the entailing verb vt necessarily causes vh . In this case, the temporal relation is thus inverted with respect to backward-presupposition, since vt precedes vh . In causation, vt is always a causative verb of change, while vh is a resultative stative verb (e.g. buy→own, and give→have). As a final note, it is interesting to notice that the Subject-Verb structure of vt is preserved in vh for all forms of lexical entailment. The two verbs have the same subject. The only exception is causation: in this case the subject of the entailed verb vh is usually the object of vt (e.g., X give Y → Y have), as in most cases the subject of vt carries out an action that changes the state of the object of vt , that is then described by vh .
3 3.1
Distributional Semantics The Harris’ Distributional Hypothesis
The Distributional Hypothesis [5] has been fairly explored and exploited in learning similarity between words and between more structured expressions. The hypothesis is the following: Words that tend to occur in the same contexts tend to have similar meanings. This seems to be a simple statement but it has given very interesting results. It is a very strong assertion as word semantics is modelled using only contextual information. Target words themselves are only used to select contexts from a corpus. Trying to formalize, given a set of words W and a set of contexts Ctx drawn from a corpus C, we can define a function that associate the contexts to each subset of words: C : 2W → 2Ctx (1) The Distributional Hypothesis suggests that the similarity simw between two sets of words, W1 and W2 , is computable as the simililarity simctx between the related sets of contexts: simw (W1 , W2 ) ≈ simctx (C(W1 ), C(W2 ))
(2)
Discovering Verb Relations in Corpora
1045
The problem is then how to compute the similarity between contexts. Contexts are generally represented on a feature space F = F1 × ... × Fn . As any vector space model, this space hosts also the intensional representation of sets of contexts I(Ctx ) (e.g., their centroids). The function that computes the intesional representations is: I : 2Ctx → F (3) The similarity between two sets of contexts (or between their intensional represetations) is then computed over the space F . In the following we will use w ∈ W and c ∈ Ctx in the similarities to indicate, respectively, the singletons {w} and {c}. As any other Machine Learning problem, the methods applying this principle differ according to: – how contexts are represented in features of the space F (using bag-of-word models, syntactic representations, or semantic models); – how, evetually, intensional representations of sets of contexts are computed; – how the similarity measure over contexts is defined. Before starting the analysis of the method we need to give another classification. This latter depends on how the similarity is used in an algorithm that generates a lexicon of similar words starting from a corpus C. We can distinguish to two ways: a direct approach exploit directly the similarity and an indirect approach that, working in the context space, select words sharing similar contexts. In the following, we rewiew some of these learning algorithm with respect to this classification. 3.2
Discovering Similarity Using the Direct Approach
The direct approach is fairly studied and it is applicable when the corpus C is apriori known. Given a set W of relevant words or linguistic structures, the corpus C is scanned for each element w ∈ W . Contexts for each w are gathered and represented in the feature space F . Pairs (w , w ) in W ×W are ranked according to sim(w , w ) and these are retained as good pairs if their similarity is greater than a threshold α, i.e. sim(w , w ) > α. Sometimes, to improve selectiveness this threshold depends on one of the two elements, i.e. αw . The applicability to known corpora depends on two factors. The first is that the set W has to be defined a-priori. This is generally done analysing the frequency, or similar indicators, of the words w in the corpus. The second applicability limitation depends on the fact that the contexts of each w have to be represented in the F space. This means that the corpus has to be completely scanned. Only after the computation of the similarity between different words can be done. The general algorithm DH direct(C), that directly applies the Distributional Hypothesis, is described in the followings. It takes in input a corpus C and return a set of equivalence classes Wi containing words or linguistic sturctures:
1046
M.T. Pazienza, M. Pennacchiotti, and F.M. Zanzotto
DH direct(C) returns (W1 ,...,WN ) Given a corpus C and the releted function C: 1. Let W be the most important elements in C 2. For each w ∈ W : (a) retreive all the contexts C(w) and map them in the feature space F (b) eventually compute I(C(w)) 3. For each (wi , wj ) ∈ W × W : (a) comupute simctx (C(wi ), C(wj )) (b) if simctx (C(wi ), C(wj )) > αwi then put wj in Wi The above algorithm has been largely employed. As we already discussed in the previous sections, the methods differ for the target of the analysis and the feature space in which the similarity is computed. In [8], the DH direct has been used to extract similarity between verbs from a single corpus. The elements represented in the feature space F were then intensional representations of verb contexts. The actual features were the determined by the pair (V erbArg, ArgF iller) where V erbArg is one of the possible verb arguments (e.g., the subject, object, modif ier − f rom, modif ier − in, ...) and ArgF iller is a word sequence. The value of this feature was related to the frequency and the inverse document frequency. In [9], the algoritm has been used to discover equivalence relations between verbal linguistic expressions connecting two arguments X and Y, e.g. X solved Y ≈ X found a solution to Y. Each one of these verbal linguistic expressions is called pattern p. The idea there was to represent in a feature space the possible filler of the slots X and Y . The feature space represents intensionally a set of contexts of each pattern. The features were (s, w) where s is the slot X or Y and w is a possible word filling the slot. Given a set on contexts where the pattern pat has been found (the set will be called pat as the pattern), the feature values of the I(pat) are computed as follows: I(s,w) (pat) = mi(pat, s, w)
(4)
where mi(pat, s, w) is the point-wise mutual information [10]: mi(pat, s, w) = log
p(pat, s, w) p(pat, s)p(s, w)
(5)
Probabilities are estimated with the maximun likelihood principle over the corpus. We define the vectors Is (where s is the slot X or Y ) as the vectors having 1 in the feature (s, w) and 0 otherwise. The similarity sim(p1 , p2 , s) between p1 and p2 according to the slot s is defined as follows : sim(p1 , p2 , s) =
(I(p1 ) + I(p2 ))I(p1 , p2 )IsT I(p1 )IsT + I(p2 )IsT
where I(p1 , p2 ) is a matrix whose elements on the diagonal are:
(6)
Discovering Verb Relations in Corpora
I(s,w),(s,w) (p1 , p2 ) =
1047
1 if I(s,w) (p1 ) > 0 and I(s,w) (p2 ) > 0 0 otherwise
and elements out of the diagonal are 0. The similarity simp (p1 , p2 ) between two patterns is then computed as follows: simp (p1 , p2 ) = sim(p1 , p2 , X) × sim(p1 , p2 , Y ). (7) 3.3
Anchor-Based Algorithms: indirect Approaches
The indirect approaches (e.g. [11, 12, 13]) have been proposed to apply the Distributional Hypothesis also when the corpus C is a-priori not known (e.g. the Web that is potentially infinite). The problem there is that the set W can not be computed in advance. If W were somehow given, the more similar words to a word w can be found only when simw (w, w ) has been computed for each w ∈ W . This is generally unfeasible due to the large size of the corpus. The pursued idea is the following. Given a seeding set of words or linguistic structures WS considered similar or realising a unique semantic relation (e.g., the is-a relation such as in [11]), a set CS of prototypical contexts are extracted. Each element c in the set CS that has some important property is called anchor [?]. An anchor highly characterises the contexts where the set of words WS appears. For each of the element c, it is then possible to derive the set of words: Wc = C −1 ({c ∈ CT X|simctx(c, c ) > α})
(8)
These sets can be used to enrich the original set of word W with similar words. The similarity is always estimated using the similarity between contexts. DH indirect(C,(W1 ,...,WN )) returns (W1 ,...,WN ) Given a corpus C and the releted function C: 1. For each Wi : (a) Wi = Wi : (b) For each relevant set W ⊆ Wi : i. compute IW = I(C(W )) ii. extract W = C −1 ({c ∈ CT X|simctx(c, IW ) > α}) iii. select the most relevant W ⊆ W iv. Wi = Wi ∪ W Hearst [11] by first mined a wide collection of texts to identify, in a set of frequently occurring lexico-syntactic patterns, lexical relations of interest. Main focus was in discovering patterns highlighting hyponymyc lexical relationship between two or more noun phrase in naturally-occurring text. Basic assumption was that the structure of a language can indicate the meanings of lexical items. The problem lies in finding surface linguistic expressions that, frequently and reliably, indicate relations of interest. Limited to hyponymy relation, Hearst identified a set of lexico-syntactic patterns:
1048
M.T. Pazienza, M. Pennacchiotti, and F.M. Zanzotto
– occurring frequently and in several text genres – recognizable with little or no pre-encoded knowledge By first these patterns were discovered by observation, then, to find new patterns automatically, a list of terms for which the specified relation hold was gathered. The corpus was mined to find expressions in which these terms occurred near one another. Then commonalities among these environments were extracted and assumed to indicate the relation of interest. Each time the specific list of terms is used as an anchor. Hearst declares the method has been successfull for hyponimy relation, while not for meronymy. Such a different result has been ascribed to the ”naming” nature of hyponymy relation. Ravichandran and Hovy [12] explored the power of surface text patterns for open-domain Question Answering systems. They recognized that in several Q/A systems specific types of answer are expressed by using characteristic phrases (that could be described in regular expressions). They described a pattern-learning algorithm and focused on scaling relation extraction to the Web. In fact with their algorithm it is possible to infer surface patterns from a small set of instance seeds (the anchor in this approach) by extracting substrings relating seeds in corpus sentences. The presence of any variant of the answer term causes the same treatement as for the original answer term. Nevertheless the patterns cannot handle long-distance dependencies. The approach has been tested on several relations providing good results for specific relations (such as birthdate) while lower precision revealed for generic and frequent ones (as is-a and part-of). Also this algorithm is in the line of learning then extracting approach. Szpektor et al. [13] defined a fully unsupervised learning algorithm for webbased extraction of entailment relations. By assuming that the same meaning can be expressed in a text in a huge variety of surface forms, they focused on acquiring paraphrase patterns that represent different forms in which a certain meaning can be expressed. This approach has been applied to acquire entailment relations from the Web. Paraphrase acquisition results in finding linguistic structures (called templates) sharing the same lexical elements describing the context of a sentence. These lexical elements are used as anchors. Templates extracted from different sentences, while connecting the same anchors, are assumed to paraphrase each other. We recognize a structuring model for which we distinguish between syntactic anchors (such as subject, object, verb) and a context anchor (such as prepositional phrase). Main problems relate to both finding matching anchors and identifying template structure. Specific algorithms have been developed for both problems. A broad range of semantic relations varying from synonymy to more complex entailment have been extracted.
4
Non-distributional Approaches
Since the Distributional Hypothesis [5] suggests equivalence between words, the related methods can discover only symmetric relations. However, consider again
Discovering Verb Relations in Corpora
1049
the question “What college did Marcus Camby play for? ”. A system could find the answer in the snippet “Marcus Camby won for Massachusetts” as the question verb play is related to the verb win. The vice-versa is not true. If the question were “What college did Marcus Camby won for? ”, the snippet “Marcus Camby played for Massachusetts” cannot be used. This is why winnig entails playing but not vice-versa. The relation between win and play is asymmetric. These kinds of relation cannot be easily discovered using the distributional hypothesis. The idea that some point-wise assertions carry relevant semantic information (as in [6]). This point-wise assertions can be detected at three levels: at the sentence level as done in [14], at the level of relations between sentences [15], and, finally, at the level of document as in [16]. In the following sections we will review these methods. 4.1
Single Sentence Patterns
In [14] it has been observed that class-level and word-level selectional preferences [17] offer a very interesting place where to search for these asymmetric entailment relations between verbs. Indeed, selectional preferences indicate an entailment relations between verbs and its arguments. For example, the selectional preference {human} win may be read as a smooth constraint: if x is the subject of win then it is likely that x is a human, i.e. win(x) → human(x). It follows that selectional preferences like {player} win may be read as suggesting the entailment relation win(x) → play(x). To exploit the previous principle for entailment detection, we need to find the specific verb selectional preferences. Our method consists of two steps. Firstly, it is necessary to translate the verb selectional expectation in specific SubjectVerb lexico-syntactic patterns (P(vt , vh )). Secondly, we need to define a statistical measure (S(vt , vh )) that captures the verb preference. This measure will describe how much relations between target verbs (vt , vh ) are stable and commonly agreed. The above idea requires a Subject-Verb textual entailment lexico-syntactic patterns. It often happens that verbs can undergo an agentive nominalization, e.g., play vs. player. The overall procedure to verify if an entailment between two verbs (vt , vh ) holds in a point-wise assertion is: whenever it is possible to apply the agentive nominalization to the hypothesis vh , scan the corpus to detect those expressions in which the personified hypothesis verb is the subject of a clause governed by the text verb vt . Given the two investigated verbs (vt , vh ) the assertion is formalized in a set of textual entailment lexico-syntactic patterns, that we call nominalized patterns Pnom (vt , vh ). This set will contain the following textual patterns: Pagent (vt , vh ) = {“agent(vh )|number:sing “agent(vh )|number:plur “agent(vh )|number:sing “agent(vh )|number:plur
vt |person:third,tense:present ”, vt |person:nothird,tense:present ”, vt |tense:past ”, vt |tense:past ”}
1050
M.T. Pazienza, M. Pennacchiotti, and F.M. Zanzotto
where agent(v) is the noun deriving from the personification of the verb v and elements such as l|f1 ,...,fN are the tokens generated from lemmas l by applying constraints expressed via the feature-value pairs f1 , ..., fN . For example, in the case of the verbs play and win, the related set of textual entailment expressions derived from the patterns are Pnom (win, play) = {“player wins”, “players win”, “player won”, “players won”}. Given a pair vt and vh we may thus define the following entailment strength indicator S(vt , vh ). Specifically, the measure Snom (vt , vh ) we use is derived from point-wise mutual information [10]: Snom (vt , vh ) = log
p(vt , vh |nom) p(vt )p(vh |pers)
(9)
where nom is the event of having a nominalized textual entailment pattern and pers is the event of having an agentive nominalization of verbs. 4.2
Multiple Sentence Patterns
The lexico-syntactic patterns introduced in [15] have been developed to detect six kinds of verb relations: similarity, strength, antonymy, enablement, and happensbefore. Even if, as discussed in [15], these patterns are not specifically defined as entailment detectors, they can be fairly useful for this purpose. In particular, some of these patterns can be use to investigate the backward-presupposition entailment. Verb pairs related by backward-presupposition are not completely temporally included one in the other (cf. Sec. 2): the entailed verb vh precedes the entailing verb vt . One set of lexical patterns in [15] seem to capture the same idea: the happens-before (hb) patterns. Indeed, it is used to detect not temporally overlapping verbs, whose relation is semantically very similar to entailment. These patterns in fact show a positive relation with the entailment relation under investigation. The following table reports the happens-before lexico-syntactic patterns (Phb ) described in [15]: Phb (vt , vh ) = {“vh |t:inf and then vt |t:pres ”, “vh |t:inf * and then vt |t:pres ”, “vh |t:past and then vt |t:pres ”, “vh |t:past * and then vt |t:pres ”, “vh |t:inf and later vt |t:pres ”, “vh |t:past and later vt |t:pres ”, “vh |t:inf and subsequently vt |t:pres ”, “vh |t:past and subsequently vt |t:pres ”, “vh |t:inf and eventually vt |t:pres ”, “vh |t:past and eventually vt |t:pres ”}
where * is mached with any word. For example, for the classical pair (vt , vh ) = (win, play), the happen-before patterns are realised in Phb (win, play) = {“play * and then win”, “played and then win”, ..., “play and eventually win”}. Also in [15], a mutual-information-related measure is used as statistical indicator. The two methods thus seem to be fairly in line.
Discovering Verb Relations in Corpora
4.3
1051
Document Patterns
The other approach we analyse is the “quasi-pattern” used in [16] to capture lexical entailment between two sentences. The pattern has to be discussed in the more general setting of the probabilistic entailment between texts: the text T and the hypothesis H. The idea is that the implication T → H holds (with a degree of truth) if the probability that H holds knowing that T holds is higher that the probability that H holds alone, i.e.: p(H|T ) > p(H)
(10)
In [16], words in the two sentences H and T are supposed to be mutually independent: consequently, the previous relation between H and T probabilities holds also for word pairs. Then, a special case can be applied to verb pairs: p(vh |vt ) > p(vh )
(11)
Equation (11) can be interpreted as the result of the following “quasi-pattern”: the verbs vh and vt should co-occur in the same document. It is thus possible to formalize the idea in probabilistic entailment “quasi-patterns” Ppe reported in the following: Ppe (vt , vh ) = {“vh |person:third,t:pres” ∧ “vt |person:third,t:pres”, “vh |t:past ” ∧ “vt |t:past ”, “vh |t:pres cont ” ∧ “vt |t:pres cont ”, “vh |person:nothird,t:pres ” ∧ “vt |person:nothird,t:pres”}
Also according to [16] point-wise mutual information is a relevant statistical indicator for entailment, as it is strictly related to eq. (11).
5
Conclusions
In this paper, we presented a survey of the methods proposed to extract verb relations in corpora. We tried to demonstate that these standard methods can be divided in two classes: methods using the Harris distributional hypothesis an methods based on point-wise assertions. As ontology learning from texts is not then proposing a new challenge to natural language processing (NLP) but is pushing the scalability of these methods, we believe that reviewing these traditional method can help in building better models for this task.
References 1. Jacquemin, C.: Spotting and Discovering Terms through Natural Language Processing. Massachusetts Institue of Technology, Cambrige, Massachussetts, USA (2001) 2. MUC-7: Proceedings of the seventh message understanding conference (MUC-7). In: Columbia, MD, Morgan Kaufmann (1997)
1052
M.T. Pazienza, M. Pennacchiotti, and F.M. Zanzotto
3. Pazienza, M.T., ed.: Information Extraction. A Multidisciplinary Approach to an Emerging Information Technology. Number 1299 in LNAI. Springer-Verlag, Heidelberg, Germany (1997) 4. Morin, E.: Extraction de liens s´emantiques entre termes ` a partir de corpus de textes techniques. PhD thesis, Univesit´e de Nantes, Facult´e des Sciences et de Techniques (1999) 5. Harris, Z.: Distributional structure. In Katz, J.J., Fodor, J.A., eds.: The Philosophy of Linguistics, New York, Oxford University Press (1964) 6. Robison, H.R.: Computer-detectable semantic structures. Information Storage and Retrieval 6 (1970) 273–288 7. Miller, G.A.: WordNet: A lexical database for English. Communications of the ACM 38 (1995) 39–41 8. Glickman, O., Dagan, I.: Identifying lexical paraphrases from a single corpus: A case study for verbs. In: Proceedings of the International Conference Recent Advances of Natural Language Processing (RANLP-2003), Borovets, Bulgaria (2003) 9. Lin, D., Pantel, P.: DIRT-discovery of inference rules from text. In: Proc. of the ACM Conference on Knowledge Discovery and Data Mining (KDD-01), San Francisco, CA (2001) 10. Church, K.W., Hanks, P.: Word association norms, mutual information and lexicography. In: Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics (ACL), Vancouver, Canada (1989) 11. Hearst, M.A.: Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 15th International Conference on Computational Linguistics (CoLing-92), Nantes, France (1992) 12. Ravichandran, D., Hovy, E.: Learning surface text patterns for a question answering system. In: Proceedings of the 40th ACL Meeting, Philadelphia, Pennsilvania (2002) 13. Szpektor, I., Tanev, H., Dagan, I., Coppola, B.: Scaling web-based acquisition of entailment relations. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcellona, Spain (2004) 14. Zanzotto, F.M., Pazienza, M.T., Pennacchiotti, M.: Discovering entailment relations using textual entailment patterns. In: Proc. of the ACL Workshop on Empirical Modeling of Semantic Equivalence and Entailment, Ann Arbor, Michigan, Association for Computational Linguistics (2005) 13–18 15. Chklovski, T., Pantel, P.: VerbOCEAN: Mining the web for fine-grained semantic verb relations. In: Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcellona, Spain (2004) 16. Glickman, O., Dagan, I., Koppel, M.: Web based probabilistic textual entailment. In: Proceedings of the 1st Pascal Challenge Workshop, Southampton, UK (2005) 17. Resnik, P.: Selection and Information: A Class-Based Approach to Lexical Relationships. PhD thesis, Department of Computer and Information Science, University of Pennsylvania (1993)
Modelling Knowledge with ZDoc for the Purposes of Information Retrieval H. Zinglé Université de Nice (F) 98, boulevard Edouard Herriot – BP 3209 06204 Nice Cedex 3, France [email protected]
Abstract. The paper presents a tool designed for the purpose of information retrieval. It is based on a syntactic-semantic approach of natural language and it exploits the formalism of conceptual graphs for representing the elements of information related to a domain. ZDoc was developed in order to build lexical, semantic and pragmatic resources used to match graphs on documents. It is thus possible either to locate a given information or to list the elements of information contained in a document or in a document flow.
ZDoc is a NLP tool developed in the area of information retrieval on the basis of the syntactic-semantic approach underlying the linguistic workbench ZStation [7][10] from which it is derived. In short, the ZStation is an attempt to facilitate large-scale development of linguistic resources (dictionaries, ontologies, morphologic and syntagmatic grammars) with a special interest in the interpretation of the meaning of utterances, rather than in the calculation of their formal linguistic structure. In the ZStation were also explored different application topics such as word sense disambiguisation, interlingual transfer and document retrieval using linguistic and ontologic knowledge. Despite its interest in the field of research in linguistics and linguistic engineering - in particular with regard to the modelling and the testing of linguistic knowledge - the ZStation reveals to be too complex for operational information retrieval. Like the former experiment with ZTermino [8] - also derived from the ZStation with the intention to help terminology specialists - ZDoc was designed for people working in the field of document retrieval on technical and scientific texts with advanced technology but sparing them an investment in complicated questions in relation with linguistics and knowledge representation. The underlying principle of ZDoc is that knowledge related to a specific domain (business for example) can be encoded schematically thanks to information schemes. An information scheme is a set of concepts and relations between concepts which are relevant for this domain. In comparison to document retrieval based on keywords (with or without thesaurus) the information retrieval using information schemes reveals to be far more precise. If a document contains the sequence the new company buys computers it may be accessed through the non ordered combinations new ~ company, new ~ buy, new ~ computer, company ~ buy, company ~ computer and computer ~ buy, although only three combinations are meaningful. It seems also obvious that queries using the couples company ~ buy and computer ~ buy would M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 1053 – 1058, 2006. © Springer-Verlag Berlin Heidelberg 2006
1054
H. Zinglé
theoretically select information about companies selling computers as well as about companies buying computers, and why not about computers buying or selling companies. One may consider that linguistic analysis could be of great help in such cases. Nevertheless, the examples like to buy a company and to buy a computer point out that these sequences have different informative values for an economist, even if both may be reduced to X buys Y on a linguistic level. In a similar way costs grow and sales grow differ in meaning as the information is considered as negative in the first case and as positive in the second one. In addition to these remarks, we have also to take into account that a given information may be expressed in different ways : to raise rating is equivalent to to lift rating and to keep rating has the same meaning as to maintain rating. Conceptual graphs [5][6] appear to be a well suited formalism for representing information schemes. ZDoc uses the technology developed in the ZStation [10] in a more simple way. In short : a conceptual graph is a combination of conceptual nodes (noted in square brackets) connected together by relations (noted in round brackets). A conceptual node is either a concept or a typed concept (or eventually a graph). The symbol ‘:’ is used to associate a concept to the class it belongs to or to quantify a concept. The information that the dollar rises by 2 percent may be represented by following graph : CG-1 [variation : rise] ├─(expr)─►[currency : dollar] └─(val)─►[percent : @2] The codification [variation : rise] and [currency : dollar] express respectively that [rise] is a kind of [variation] and that [dollar] is a kind of [currency]. The codification [percent : @2] indicates that 2 is the estimated value of the concept of [percent] and represents an extension of the codification of quantification1. To raise rating vs. to lift rating will be represented by the same graph CG-2 CG-2 [process : make] └─(obj)─►[variation : positive] └─(expr)─►[rating] as well as to keep rating vs. to maintain rating by CG-3 below CG-3 [process : make] └─(obj)─►[variation : null] └─(expr)─►[rating] 1
In the standard theory of conceptual graphs, [company] or [company :*] indicate a generic concept (i.e. any company), [company : #1234] or [company : XLtd] an instance or a named instance of the class company (i.e. a given company), [company : XLtd,YLtd,ZLtd}] an extensive set of companies, [company : {*}] a generic set of companies, [company : {*}@3] a generic subset of 3 companies, [company : {XLtd,Yltd,*} a generic subset of companies including XLtd and Yltd (XLtd, Yltd an other companies), [company : {XLtd,Yltd,*}@5] a generic subset of 5 companies including (XLtd and Yltd).
Modelling Knowledge with ZDoc for the Purposes of Information Retrieval
1055
It is clear that the representation of information based on conceptual graphs is not language dependent. For example, CG-2 applies equally to the French combination le gouvernement a relevé les taux as to the English the government raised rating. Therefore we consider that the linguistic dependent knowledge (LDK) has to be clearly distinguished from the interlinguisic knowledge (ILK) and that the knowledge modelling must be achieved on the basis of domain expertise. Paraphrase is usually a good way to put conceptual nodes and relations into evidence and to avoid confusions between words and underlying concepts. ZDoc is programmed in PDC-Visual Prolog as a graphic interface with two fundamental actions : 1) the creation of resources used by the software and 2) the extraction of information which may apply either to a given document or to a set of documents (document flow). A unique interface is provided for the creation and updating of resources needed for information retrieval, i.e. lexical, semantic and pragmatic resources. Users may browse through the items belonging to each resource and switch from one resource field to another. Experience shows that linguists using this kind of tools need to have an overall approach, as the formalization is splitted over three areas which have to be kept separately. Lexical resources are built up as a set of lexical units belonging to a specific knowledge domain. For each lexical unit it is necessary – in the current version - to indicate its lemma, the associated concept as well as the associated morphosyntactic variables. If one form is associated to different lemmas or concepts, all configurations have to be defined separately. Experience shows, that this approach is more simple for users than the use of a morphologic grammar based on specifications like in the ZStation2, as the number of forms are generally reduced. In the new version, which is currently under construction, the users must only input the lemma and its morphosyntactic category and select the morphologic properties from different paradigms generated by the system. A crucial point here is the selection of the lexical units, as we have to take into account polysemy for the calculation of conceptual graphs. Contrary to what is widely believed, polysemy is not an exception in scientific and technical information. For example, H. Ledouble [4:214] noted that in economic documents from the press agency Reuters the concept of variation of an <economic object> may be expressed in various ways whether the variation concerns an upward or downward movement :
The ZStation uses morphologic prototypes which allow to generate all correct inflected forms for a lexical unit with regard to the associated prototype.
1056
H. Zinglé
Modelling Knowledge with ZDoc for the Purposes of Information Retrieval
1057
Pragmatic resources deal essentially with the definition of information schemes and with the linguistic structures which express them. In other words, it is the area where lexical and semantic knowledge are put in relation so that information takes shape. A dialog box allows users to enter a conceptual graph (with an eventual paraphrase in natural language) and its related linguistic representation in natural language. If a conceptual graph may be expressed in different ways, the related linguistic formulas have to be created. Linguistic structures are matched on graphs through unification over constraints on concepts, lemmas, morphosyntactic variables and values. An important aspect of the calculation of graphs is that we apply here the notion of syntactic-semantic analysis developed in the ZStation [10]. This process does not consist of the derivation of a semantic structure on the basis of the syntactic one, but of the direct calculation of the relations between concepts within the graph by verifying all the specified formal constraints. The corpus analysis is here very helpful to determine the different linguistic structures expressing a given information scheme. Ledouble [4] has used for this purpose ZText, which allows to list all associations for a word (for example all verbs in relation with a word expressing an economic object). An alternative solution is to use the indexing functions of the ZStation [10:101], specifically the conceptual indexation of corpora using predefined concepts. It is thus possible to obtain directly an index and concordances, for example, for the concept [currency]. The created resources may be exploited either to search for a given information or to extract information from utterances or from documents. In the second case the analysis process produces a list of information elements represented either by conceptual graphs or by paraphrases in natural language. Interesting applications of ZDoc were proposed in the analysis on French [1] and English [4] economic documents furnished by the press agency Reuters. In particular, Cadel and Ledouble [1][2][3] proposed a common ontology on conceptual classes (economic agent, products, operations, movements ...) and information schemes (expressing variations in turnover, expressing trading, expressing influence ...). This double experiment illustrates, in particular, how far intralinguistic knowledge (lexical resources) may be used in combination with interlinguistic knowledge (domain related ontology) by the means of pragmatic interface matching conceptual graphs onto linguistic structures.
References 1. Cadel P. : Utilisation de connaissances linguistiques dans le cadre du traitement documentaire , analyse syntactico-sémantique de documents économiques. Thèse. Université de Nice (2000) 2. Cadel P., Ledouble H. : Extraction of concepts and multilingual information schemes from french and english economic documents. In Proceedings LREC 2000, 1-3 june Athen (2000) 983-986 3. Ledouble H., Cadel P., Kraif O. : Analysis of combinations between economic objects and types of variation in French and English documents. In Proceedings of COMLEX 2000, 22-23 sept. Kato Aicha (2000) 51-54 4. Ledouble H. : Contribution à la traduction de documents économiques anglais, importance de l’extraction et de la représentation du sens. Thèse. Université de Nice (2002)
1058
H. Zinglé
5. Sowa J. : Conceptual structures : information processing in man and machine. AddisonWesley (1984) 6. Sowa J. : Knowledge representation, Logical, Philosophical, and Computational Foundations. Brooks/Cole Publishing Co., Pacific Grove, CA. (2000) 7. Zinglé H. : The ZStation workbench and the modelling of linguistic knowledge. In Current Issues in Mathematical Linguistics, Elsevier-NHLS (1994) 423-432 8. Zinglé H. : Acquisition et traitement de données terminologiques avec ZTermino. In Travaux du LILLA, n°2, Nice (1997) 7-22 9. Zinglé H. : ZTEXT, un outil pour l’analyse de corpus. In Travaux du LILLA, n°3, Nice (1998) 69-78 10. Zinglé H. : La modélisation des langues naturelles, aspects théoriques et pratiques. Travaux du LILLA (numéro spécial), université de Nice (1999).
Partially Ordered Template-Based Matching Algorithm for Financial Time Series Yin Tang Lab of E-Commerce, Management College, Jinan Univ., GZ 510632, P.R. China [email protected] Abstract. Based on definitions of 1st and 2nd order atomic pattern of time series, this paper deduces n-th order atomic pattern, where partially-ordered relationship within these patterns is discussed. The framework enables more refined comparison between sequences, based on which we propose Template-Based Matching Algorithm. The experimental result has verified its distinct advantages over some similar and classical approaches both in accuracy and performance. Index Terms: time series, pattern recognition, case-based reasoning, partially order, lattice.
1 Introduction The futures market is a market of volatility and opportunism, affected by a great many factors like climate, national policy, supply-demand, economical cycle and currency policy, which place enormous difficulties on extrapolation. The work in this paper tries to address the problem in a pattern recognition way under a general assumption that once the same situation happens, the pattern in financial time series will re-occur. Many approaches for financial time series extrapolation, most of which are based on mathematical models or modern transforming techniques, have been proposed. Classical approaches of time series include Moving Average[1][2], Exponential Smoothing[3], and Box-Jenkins’s ARMA and ARIMA[4], Ge and Smith’s Deformable Markov Template[5], Bayesian Model[6][7]and some Fuzzy Set Methods[8][9][10], etc.. With the rapid development of computing technology, time series modeling has been away from mathematical way, and turns to dimensional reduction and shape modeling. For instance, Discrete Fourier Transformation [11][12][13][14][15]; Dynamic Time Warping[16] [17] enables the comparison between series of different lengths, dynamically computing the minimum cost of path; Landmark Similarity[18] proposes a new data representation, which maintains a set of minimum features of the sequence, and highly reduces the storage and the computation. Branko Pecar’s APRE [19] is a similar attempt to extrapolate the time series in a pattern-recognition way. As for distance measurement, Euclidean Distance treats the sequences as the points in n-th dimensional space, then compute their Euclidean distance, with a complexity of O(n); Goldin and Kanellakis proposed Normalization method[20] which normalizes the mean and the variation, in order to facilitate the comparison. M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 1059 – 1070, 2006. © Springer-Verlag Berlin Heidelberg 2006
1060
Y. Tang
However, these methods have several disadvantages. Some are time-consuming. Some models are either rough or rigid, and unable to fully reflect the market activities. Some of their explanations on similarity measurement are unreasonable and implicit. Time series consists of a series of variables, each of which has the relationship of Positive, Negative and Zero, with the former one. Branko Pecar’s APRE [19] uses “P”, “N”, “Z” to represent the numerical relationship between current variable and the former one, based on which the APRE performs sequence searching and matching, the criteria of matching includes aggregated computation like mean and standard variation [2]. The approach is relatively rigorous, where two sequences are very much alike but a tiny difference on a single variable will result in a totally different computation, further, a totally different conclusion. Apart from classical modeling approaches which depend on the assumption of following a certain distribution or a certain kind of mathematical model, the idea we propose in this paper, named Template-Based Matching (TBM), utilizes a set of atomic patterns to match the sequence, based on the fact that similarity is partially ordered. We offer a measuring specification for partially-ordered similarity of the series, and provide an algorithm for time series mining and extrapolating. The algorithm works as reasoning machine, and allows predicting time series based on the case that has occurred. The rest of the paper is organized as follows: Section 2 discusses how to define 2nd order atomic patterns based on 1st order ones. Section 3 provides an algebraic proof, followed by a distance computing algorithm of the sequences, together with a simple example. Section 4 discusses a whole solution in time series searching and extrapolating, based on the TBM algorithm. The experimental result is presented in section 5. We conclude in section 6.
2 Definition of Atomic Pattern 2.1 1st-Order Atomic Pattern
Definition 1. 1st-order atomic pattern. Let X’{x’i} be the difference series of time series X{ xi }, define sequence M{mi} as: ⎧" P" ⎧> 0 ⎪ ⎪ ' m = x = < 0 if i ⎨ , then i ⎨" N " , respectively ⎪" Z " ⎪= 0 ⎩ ⎩
(1)
Here, “P”, “N” and “Z” are named as 1st-order time series, M{mi} is called 1st-order basic patterns. As we have seen, 1st-order atomic patterns are simple and rudimentary. However, 1st-order atomic patterns roughly represent the raw one, with loss of inherent information, which is important in similarity measuring. Not only the figure of a time series should be considered, but some other features like maximum point, peak and repeated pattern should be included, where a lot of information is implicated. In our work, we describe not only the relatively increasing relationship between two variables, but also the acceleration, namely, the angle formed by former, current, and latter observation. So, the design should follow the following principles: (1) able to
Partially Ordered Template-Based Matching Algorithm for Financial Time Series
1061
distinguish increase and decrease; (2) able to distinguish positive and negative acceleration in increasing and decreasing; (3) able to distinguish the angle. 2.2 2nd-Order Atomic Pattern Let X’’{x’’i} be the difference sequence of X’{x’i} , namely the 2nd order difference of X{xi}. Sequence M’{m’i} is defined as follows(table 1): Table 1. Definition of 2nd –Order Atomic patterns m'’i
Condition '
'
Explanation
x’’i >0
accelerated increasing
PP+
E
x i >0, x i+1 >0,
PP-
C
x'i >0, x'i+1 >0, x’’i <0
decelerated increasing
PN
A
x'i >0, x'i+1 <0
increasing + decreasing
NN-
K
x'i <0, x'i+1 <0, x’’i >0 '
'
NN+
I
x i <0, x i+1 <0,
NP
V
x'i <0, x'i+1 >0 '
'
x’’i <0
decelerated decreasing accelerated decreasing decreasing +increasing
x’’i =0
increasing
PP0
D
x i >0, x i+1 >0,
NN0
J
x'i <0, x'i+1 <0, x’’i =0
decreasing
ZZ
O
x'i =0, x'i+1 =0
zero+zero
ZP
F
x'i =0, x'i+1 >0
zero + increasing
ZN
H
x'i =0, x'i+1 <0
zero +decreasing
PZ
B
x'i >0, x'i+1 =0
increasing +zero
NZ
L
x'i <0, x'i+1=0
decreasing + zero
Here, m’i is named 2nd-order atomic pattern, M’{m’i} is the series of 2nd order atomic pattern. Please note that m’i(PP+, PP-, NN+, NN-) is for convenient presentation in paper work, which enables explicit association to a specific figure. In programming code, it should be a single symbol, as shown in the second column of the table. In addition, “Z” and “ZZ” are practically rare, they are filled when xi -xi-1 is less than a specified small value. These patterns can bring in good enough result in simple pattern recognition. However, sometime they can not satisfy the requirement of specified applications. As Figure 1 shows, “PZ” and “PN”, “PN” and “ZN” seem to have little discrepancy. When it comes to “PZ” and “NN-”, we have more and more difference. It will cause problem if we neglect these kinds of situation, that is, among these 13 2nd-order pattern, partial
1062
Y. Tang
order similarity relationship, rather than linear, ordinal one like “1<2 and 2<3, then 1<3”, is mattering.
3 Constructing Higher Order Pattern 3.1 Algebraic Proof To provide a solid infrastructure for combination of 3rd and higher order patterns, we start from the definition of 1st order patterns. Let H be the set of 2nd-order atomic patterns, p is a relation on the set. For ∀ x , y ∈ H , x p y is defined as that the included angle of pattern x is less than or equal to the included angle of pattern y. The goal is, hopefully, to prove p is a partially ordered relation on set H, then to further define the operators on set H. the H has its upper bound x ∨ y , the largest included angle of x, y, and its lower bound x ∧ y , the smallest included angle of x, y. Where any subset has a least upper bound and a greatest lower bound, there is a complete lattice, whose product generates complete lattice too. Thus, combining the patterns to generate higher order patterns order by order, more and more refined result will be achieved. For the sake of convenience, we replace these patterns with numerical value, {1,-1, 0,.9,-.9} as shown in Fig. 1. The proof is completely equivalent: (1) DEFINING TWO COMPLETE LATTICES Suppose a set S={-1,-.9,0,.9,1}:
∩ ,∪ ), where p means “greater than or equal to”(≥); For any x,y∈S, x ∪ y = min(x, y), x ∩ y = max(x, y); Complete lattice L (S, p , ∪ , ∩ ), where p means “less than or equal to” (≤); For any x,y∈S, x ∪ y = max(x, y),x ∩ y = min(x, y); Complete lattice L1(S,
p 2,
2
2
2
2
2
2
1
1
1
1
1
1
(2) DEFINING PRODUCT L1×L2 L1×L2=( S×S, p ×, × ×), where(x1, x2)
∩ ,∪
x1≥y1 and x2≤ y2.
p ×(y1,y2) <=>x1 p 1y1 and x2 p 2y2<=>
∪ 和∩ : For any (x , x ), (y , y ) ∈S×S,(x , x )∪ (y , y )=(min(x , y ), max(x , y )),(x , x ) ∩ (y , y )=(max(x , y ),min(x , y )).
Definition 2
2
×
1
×
2
1
×
1
2
2
1
1
2
1
1
2
2
×
1
2
1
1
2
The product of the complete lattices generates complete lattice, therefore L1×L2 is complete lattice. (3) PROOF:
p T, T is still a complete lattice (T, p T,
p × in L1×L2 will be replaced in T with
∩ ,∪ ). T
T
Partially Ordered Template-Based Matching Algorithm for Financial Time Series
ZP (0,1)
1063
NZ NP(-1,1)
NN-
PP+ (.9,1)
NN0
ZZ
PP0
(0,0)
PPNN+
PZ
ZN PN (1,-1)
Fig. 1. Partially Ordered Relation among 2nd-order Atomic patterns
p T is a relation on set T. Reflexivity: for any x T obviously x p T x; Anti-symmetry: if x p T y, namely x p × y, and x ≠ y, then it is impossible for y p T x. Transitivity: if x p T y, y p T z, then x p T z. because x p T y, y p T z, therefore x p × y, y p × z, => x p × z, => x p T z. So,
∈,
(4)PROOF,
∈{-1,-0.9,0,0.9,1} q = min(x , ,x , …, x ) ∈{-1,-0.9,0,0.9,1} p= max(y1, ,y2, …, yn)
∈
1
2
n
If (q,p) T, then (p,q) is the least upper bound of w; If not, suppose q = .9, p = -0.9, iteratively observe (.9,0), (.9,.9), (.9,1); (-1,.9), (-1,0), (-1,.9), (-1,1), see if any of the
1064
Y. Tang
above belongs to T. The first one belongs to T must be the least upper bound of w; Likewise, w has its greatest lower bound. So,
∈T:
∪ : Let q =min(x , x ), p = max(y , y ). If (q,p) T
1
2
1
2
T, then (x1, y1) T
(x2, y2) =(q,p); Otherwise, replace any “-0.9” with “-1” in (q,p), then (x1, y1)
∪ (x , y ) = T
2
2
(q,p);
∩ : Let q = max(x , x ) , p = min(y , y ). If (q,p) ∈T, then (x , y ) ∩ (x , y ) =(q,p); Otherwise, replace any “-0.9” with “-1” in (q,p), then (x , y ) ∩ (x ,
Binary Operator T
2
T
1
2
1
2
2
1
1
1
1
T
2
y2) = (q,p) Thus, the relations among the 2nd-order and higher order atomic patterns will be built up. The building-up of the higher order atomic patterns needs the features of complete lattice of lower order ones. Yet, in the application of a specified order, only the feature of partially ordered set is required. 3.2 Distance Computing Algorithm After building up the similarity measurement of the patterns, we set up a model for sequence matching with assembled patterns. Table 2. Time Series X, Y and their 2nd-order Patterns xi 1130
x'i
x’’i
m’i
yi 1137.5
y' i
y’’i
n’i
σi
1112 1099 1115 1115 1095 1082 1095.5 1083 1081 1081.5 1056.5 1044.5 1044
-18 -13 16 0 -20 -13 13.5 -12.5 -2 0.5 -25 -12 -0.5
5 29 -16 -20 7 26.5 -26 10.5 2.5 -25.5 13 11.5
NNNP PZ ZN NNNNPN NNNP PN NNNN-
1114.5 1102.5 1104.5 1115 1112.5 1092.5 1114 1077.5 1079 1102 1069 1054 1042.5
-23 -12 2 10.5 -2.5 -20 21.5 -36.5 1.5 23 -33 -15 -11.5
11 14 8.5 -13 -17.5 41.5 -58 38 21.5 -56 18 3.5
NNNP PP+ PN NN+ NP PN NP PP+ PN NNNN-
0 0 3 1 2 2 0 0 1 0 0 0
We define the distance of two atomic patterns as total cost of the path from one to the other in the Haas Map, e.g. the distance from “PN” to “PP+” is 4, the distance from “NN+” and “PP-” is unmeasurable, designated to 6(the maximum distance in the Haas Map). Based on the distances, the similarity of two sequences is calculated according to the formula given below:
Partially Ordered Template-Based Matching Algorithm for Financial Time Series
1065
n −1
σ =
∑σ i=2
i
(2)
( n − 2) * 6
Here, σi is the distance of each corresponding pattern pair of two sequences (0≤σi≤6), n is the length of the sequence. Apparently, 0≤σ≤1, when σ=0, two sequences have the same shape; when σ=1, they are totally different. Practically, since there are just 13 2nd-order patterns, it is not necessary to dynamically calculate their partially ordered distance. Nevertheless, we can fill the path costs of these patterns into an array or a list (similarity table). The algorithm is shown below:
Input:
sequence
X,
sequence
Y,
similarity table S Output: σ – the distance of sequence (1) m = Differnce(X) , m’ = Differnce(m) (2) n = Differnce(Y) , n’ = Differnce(n) (3) q = 0 (4) for i =2 to length-1 (5) p = Lookup(S, m’i ,n’i) (6) q = q + p (7) next i (8) return σ = q/((l-2)*6)
3.3 A Simple Case Sequence X,Y and their 2nd-order pattern are shown in Table 2, where σi at the last column is the distance from xi, yi. σ = 0.125, a conclusion can be made that from the undulating figure, sequence X, Y are fairly similar(Figure 2). To make further conclusion, Euclidean Distance can be employed as another criteria.
Fig. 2. Undulating Figure of Sequence X , Y
1066
Y. Tang
3.4 3rd-Order Patterns To step further, we can compound the 2nd-order patterns to higher order, and generate 169 3rd-order patterns, which can further refine the similarity assessment and achieve higher accuracy.
Fig. 3. Extrapolation for Contract 26
∪ ∈
∪
In section 2.2, with the complete lattice L1(T, p T, T, ∩T) and L2(T, p T, ∩T, T), we define L(T×T, p , , ∩), for any (x1, y1),(x2,y2) T×T, there is (x1, x2) p (y1, y2) <=>x1 p Ty1 and x2 p Ty2. L, named 3rd-order atomic pattern, is also a complete lattice, whose similarity can still be measured by their partial order relations. Due to the strict time schedule, we just introduced 2nd-order patterns into the extrapolation experiment so far, which brought rather fine result. Promisingly, more accurate result will be carried out if 3rd-order patterns are employed.
∪
4 Template-Based Matching Algorithm 4.1 Time Series Mining Given time series case base, (1)Pre-processing: decompose the time series with the techniques like differencing, moving average, etc., into long term trend, periodical fluctuation, seasonal fluctuation, random item, and finally have stationary time series; (2)Pattern Transforming: transform the time sequence into sequence of 2nd-order patterns; (3)Pattern Mining: set the length and the interval of the sliding window; transverse the sequence, and then match the sub-sequences in every sliding window; index the similar patterns, whose distances are under a specified threshold, into case base. 4.2 Time Series Matching Given a target time series T, (1) Transform T to sequence of 2nd-order patterns, to generate M, a series of 2nd-order patterns;
Partially Ordered Template-Based Matching Algorithm for Financial Time Series
1067
(2) Set the maximum length lmax and the interval of the sliding window; transverse the sequence; (3) The loop starts from 2 to lmax, match each sub-sequence in the sliding window with the source case in time series case base, find out the most familiar one. 4.3 Time Series Extrapolating After retrieving the time series target cases T’, extrapolate the target sequence to a specified length, we will have the result S. (1) start from the end date of the T’, to retrieve the latter part of the sequence, named S’, the length required is specified by user; (2) Calculate the mean and standard deviation of sequence T’ and T, the ratio of means and standard deviations will be rendered as the adaptors that transforms S’ to S. (3)Apply the adaptors on S’, to have the result S.
5 Experimental Results and Discussion The experimental dataset is from Chinese futures market from Aug 1996 to Aug 1998. The series consists of 400 observation points, and the first 300 points were used as basis for the initial case. Although most of the comparison methods used does not require a training period, to maintain consistency, the forecasts were produced for all the methods staring with the period 301, ending at the period 400. Error of Relative Mean Average Deviation is measured, shown as follows:
RMAD
⎛ n ⎞ ⎜∑ di − di'⎟ i =1 ⎠ = ⎝ n( X ⋅ Y )
2
(3)
where n is the length of forecasts(100 points), X and Y are the average of real values(100 points) respectively, and di, di’ stand for the extrapolated value and real value of these 100 points. The results rendered are given in Table 3. 5.1 Comparative Evaluation We choose ARMA, DES and APRE as comparative experiment. The value of the smoothing constant was set to 0.5 for alpha in DES. The autocorrelation and partial autocorrelation analysis showed that the time series follows ARIMA (2,1,2). We show about 100 days extrapolation result of LME COPPER 03 after the Reference Date, based on daily data of about 300 out of 400 days before the Reference Date (Figure 3). From the comparative experiment, TBM has twice accuracy as APRE, and outperforms ARMA as well. The error’s stable standard deviation also testifies its stability. Due to paper limitation, a part of result for one of the time series (Contract 26) is displayed below:
1068
Y. Tang
From the residual ACF and PACF of Contract 26, the residual is random series, which testifies the validity of TBM. The algorithm outperforms classical approaches with relatively high accuracy and stability. In addition, the experimental result also shows that, with the increase of sequence length, the accuracy stably increases. Table 3. Extrapolation Accuracy with DES, ARIMA, APRE RMAD
DES
ARIMA
APRE
TBM
TS1 TS2 TS3 TS4 TS5 TS6
0.0182350 0.0120593 0.0125862 0.0094224 0.0157452 0.0200363
0.0123326 0.0092322 0.0094235 0.0075823 0.0125874 0.0178742
0.013768 0.009229 0.013449 0.008181 0.015637 0.017143
0.007688 0.006079 0.007895 0.004972 0.009672 0.009575
TS7
0.0100252
0.0073871
0.008984
0.005034
TS8
0.0014047
0.0102546
0.013196
0.008491
TS9
0.0151426
0.0125871
0.015170
0.00873
TS10
0.0142575
0.0098547
0.012412
0.007974
Aver.
0.0128914
0.010912
0.012717
0.007611
Std. Dev.
2.4766E-0 5
8.62E-06
8.27706E -06
2.63E-06
Fig. 4. Residual ACF and PACF for Contract 26 TBM Extrapolating
Partially Ordered Template-Based Matching Algorithm for Financial Time Series
1069
5.2 Limitations The study has several limitations. (1) Our research is based on the assumption that, in financial time series the same shape will reoccur when the same condition happens. There is no solid theoretical basis that makes sure the trend of the target time series case can be the basis of forecasting the source time series. the assumption of reoccurence is best applicable in complete market enviornment, and may be not appropriate for non-complete ones. (2) the work is under an assumption that financial time sereis in a same futures market are totally independent, and have no interrelationship. We haven’t take into account the inter-actional effect among the time series in the financial market. It is well-known that the assumption is not sufficient in real market. (3)The experimental results support our solution of financial time series well, however, the algorithm has not yet applied on other kinds of time series, which needs further observation.
6 Conclusion In this paper, based on atomic patterns like “P”,”N”,”Z” (which are defined as 1st order atomic patterns), the idea defines binary operation among these 1st order atomic patterns, to further generate 13 2nd-order patterns, and then 169 3rd-order patterns. We examine the partial relationship among these patterns. They are assembled to be a template, with which we match the sequence during similarity mining. According to the partial relationship, matching will probably be more discrepancy-tolerance, but still reasonable. This idea can take into account some segments that are not neglectable. The higher the order is, the more accurate, without any loss of matchable sequences. This paper offers the definition of 1st-order, 2nd-order and higher order atomic pattern within an integrated theoretical framework, based on which we propose an n-th order atomic pattern template-based matching algorithm (TBM). A solution for time series extrapolation is discussed. The experimental result has verified its distinct advantages over some similar and classical approaches both in accuracy and performance. This research makes several contributions: (1) the theoretical model enables combination of atomic patterns of specified order. This model can be used as a general framework in similarity assessment in many areas; (2) a fast, simple algorithm is proposed, which facilitates time series extrapolating, with high accuracy and stability. In sum, our algorithm improves the accuracy of time series extrapolating, especially in futures market, with a reasonable speed. The partially ordered similarity framework can be widely used in many similarity measurement occasions.
References [1] N. Crato and B.K. Ray (1996) Model selection and forecasting for long-range dependent processes. Journal of Forecasting, 15,p107-125. [2] Christos Faloutsos, M Ranganathan, Yannis Manolopoulos. Fast Subsequence Matching in Time-Series Database. In Proceedings 1994 ACM SIGMOD Conference, Minneapolis, 1994.
1070
Y. Tang
[3] C.C. Holt. Forecasting Seasonal and Trends by Exponentially Weighted Moving Averages, Carnegie Institute of Technology, Pittsburgh, Pennsylvania, 1957. [4] Box E.P. George & Gwilym M. Jenkins: Time Series Analysis: Forecasting and Control, Holden Day, San Francisco,1970. [5] Ge and Smyth. Deformable Markov model templates for time-series pattern matching [J]. KDD 2000: 81-90. [6] M. West and J. Harrison (1997) Basyesian Forecasting and Dynamic Models, 2nd edn. NewYork: Springer. [7] Jensen V. Finn: An Introduction to Bayesian Networks[M], UCL Press, London, 1996. [8] Altrock von Constantin: Fuzzy Logic and Neuro Fuzzy Applications Explained, Prentice Hall, Englewood Cliffs, 1995. [9] Tsoukalas H. Lefteri & Uhrig E. Robert: Fuzzy and Neural Approaches in Engineering, John Wiley, New York, 1997. [10] Zadeh A. Lotfi, Fu King-Sun, Tanaka Kokichi & Shimura Masamichi (eds): Fuzzy Sets and their Applications to Cognitive and Decision Processes, Academic Press, New York, 1975. [11] Eamonn J. Keogh, Michael J. Pazzani. "A simple dimensionality reduction technique for fast similarity search in large time series databases." In the Fourth Pacific- Asia Conference on Knowledge Discovery and Data Mining. Kyoto, Japan, 2000. [12] Eamonn J. Keogh, Michael J. Pazzani: An Indexing Scheme for Fast Similarity Search in Large Time Series Databases. SSDBM 1999: 56-67. 1998. [13] Eamonn J. Keogh, Kaushik Chakrabarti, Michael J. Pazzani, Sharad Mehrotra: Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases. Knowl. Inf. Syst. 3(3): 263-286 (2001). 2000. [14] R. Agrawal, K I Lin, IIS Sawhney. Shim K. Fast Similarity Search in the Presence of Noise, Scaling, and translation in Time Series Database. Proc 21st Int’l Conf. VLDB, Sept, 1995, 490-500. [15] R. Agrawal, C. Faloutsos, A.Swami: Efficient Similarity Search in Sequence Database. Proc 4th Int’l Conf. Foundations of Data Organization and Algorithms, Oct, 1993, 69-84 [16] Eamonn J. Keogh, Michael J. Pazzani. Scaling up Dynamic Time Warping for Datamining Applications, Knowledge Discovery and Data Mining, in ACM 2000 1-58 [17] Berndt and Clifford. Using Dynamic Time Warping to Find Patterns in Time Series. KDD Workshop 1994: 359-370. [18] Chang-Shing Perng, Haixun Wang, Sylvia R. Zhang, D. Stott Parker. Landmarks- A New Model for Similarity-Based Pattern Querying in Time Series Databases. 16th International Conference on Data Engineering ICDE'2000: 33-42, 2000 [19] Branko Pecar. Case-based Algorithm for Pattern Recognition and Extrapolation [C]. The 7th UK Case-Based Reasoning Workshop, In 22nd SGAI Int’l Conf. on Knowledge Based Systems and Applied Artificial Intelligence, Cambridge, 10-12 December 2002 [20] Goldin, Kanellakis. On Similarity Queries for Time Series Data: Constraint Specification and Implementation. CP 1995, pp137-153. [21] Davood Rafiei,Alberto O. Mendelson. Querying Time Series Data Based on Similarity. In IEEE TRANS. ON KNOWLEDGE AND DATA ENG., VOL. 12, NO.5, SEP/OCT 2000. [22] Betty B. Xia. Similarity Search in Time Series Data Sets, M.Sc. thesis, Computing Science, Simon Fraser University, December 1997.
Model and Algebra for Genetic Information of Data* Deyou Tang1,2, Jianqing Xi1,**, and Yubin Guo1 1
School of Computer Science and Engineering, South China University of Technology, Guangzhou, 510641, China 2 Department of Computer Science & Technology, Hunan Polytechnic University, Zhuzhou, 412008, China [email protected], [email protected], [email protected]
Abstract. The genetic information of data is the evolution information about data in its lifecycle, and it performs heredity and mutation with the diffusion of data. Managing the genetic information of data has significance to the information discrimination, auditing, tracing and lookup. This paper presents Data Genome Model (DGM) and its algebra to manage the genetic information of data in its lifecycle according to the gene of life. A structure named data genome is used to carry the genetic information of data and querying algebras are defined to manipulate the genetic information.
1 Introduction Under the environment of complex networks, data is exchanged frequently. In this process, huge amounts of information about the evolution are created. E.g. when data is published, information hiding behind this operation includes the publisher, publishing time, etc. This information has great significance to the incremental utilization of data. Suppose some unhealthy pictures are disseminated in a pure p2p network such as Gnutella[1]. To punish the badman who issues the resource initially, related department has to take long time and much human resource to identify the source step by step. The problem is that no evolution information can be retrieved from the data itself. However, if each item shared in Gnutella has some adjunctive information about the evolution of data such as publisher, propagation path, etc, intelligence agent can do their job through analyzing this information, thus saves much expense. Unfortunately, current data models usually focus on the content or the external characteristic of data, and little attention has been paid to the evolution of data. Metadata are the data about data in applications. But metadata only expresses the static properties of data, and it doesn’t describe the evolution of data. The concept of “temporal” is used in temporal database to describe history information about data [2]. TXPath model [3] extended the XPath data model to model the history of semi-structured data. The GEM model [4] is a graphical semi-structured *
This work is supported by the China Ministry of Education (Grant NoD4109029), Guangdong High-Tech Program (G03B2040770), and Guangdong Natural Science Foundation (B6480598). ** Correspondence author. M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 1071 – 1079, 2006. © Springer-Verlag Berlin Heidelberg 2006
1072
D. Tang, J. Xi, and Y. Guo
temporal data model. These temporal data models use the history information of data effectively, but they don’t express the relevancy of different data. Adoc[5] is a new programming model targeted to collaborative business process integration applications. The concept of “smart distance” was proposed to interweave various dynamic and heterogeneous elements of information system on the base of Adoc[6]. ABO was presented to model the lifecycle of the business entity [7]. However, those methods don’t model the evolution of data. Information Lifecycle Management (ILM) aims at improving system resource utilization and maximizing information value automatically for fixed-content reference data [8][9]. Unfortunately, ILM doesn’t take into account the value of evolution information created in data lifecycle. In this paper, we present Data Genome Model (DGM) to manage the information created in the evolution of data, which is called genetic information. A series of concepts like data gene, data gene sequence and data genome are introduced to carry the genetic information hierarchically. Data genome algebras are presented to query the genetic information stored in data genomes.
2 Data Genome Model In biology, the genetic information of life determines the variety and lineage of species, and a genome is the complete set of genetic information that an organism possesses. The genetic information of parents transmits to offspring along with the evolution of life, and different species have different genetic information. In the evolution of data, information about the evolutionary process are created, including the creator, creation time and descriptive keywords when data is created, the publisher, publishing time and receivers when data is published, the correlation with other data. This kind of information closely correlates with data, and it is created when data is created and deleted when data is deleted. At the same time, it changes with the evolution of data and is different for different phase in the lifecycle of data. Furthermore, it will be transmitted to other data partially or fully in the evolutionary process. As information created in the evolution of data has the above features, we think data has the genetic information too [10]. In this paper, we call this information the genetic information of data and put forward a concept called data genome to carry the genetic information of data with genome in biology as reference. Suppose A is an attribute name of data or operation, and V is its value. We use the concept of data gene, data gene sequence and data genome to carry genetic information for data hierachically. Data gene carries a piece of genetic information for data and describes some characteristics of data at the given condition. Data gene sequence carries the genetic information of a “segment” in the evolution of data. Data genome is “all-sided” to represent the evolution of data and carries genetic information for the whole data. Definition 1 (Data gene). A data gene fragment F is a 2-tuple: F = ( A,V ) , data gene is the positive closure of F: D = F + = ( A,V ) + . Data gene fragment is the minimal unit of genetic information, and the selection of attributes and their domain depends on the situation of applications. Data gene can
Model and Algebra for Genetic Information of Data
1073
describe any characteristics of data, for example, it may be a description of browse operation or the metadata at the given time. For any data, a whole impression about data at any moment is encoded in the leading data gene. Definition 2 (Leading data gene). A leading data gene carries the instant information about data as a whole, denoted as Da. Da is created when data is created, it holds the global information about data like identifier, title, creator, and so on. Definition 3 (Data gene sequence). Data gene sequence is a 2-tuple, S=<$sid, D+>, where $sid is the identifier of the sequence, D+ is the positive closure of data genes that describes the same characteristics of data. Usually, we need many data gene sequences to describe the genetic information of data perfectly, and each data gene sequence encodes part of genetic information. The most important data gene sequence is the main data gene sequence. Definition 4 (Main data gene sequence). A main data gene sequence Sm is a data gene sequence, which describes the global characteristics of data and its evolution. Sm includes the leading data gene and data genes carrying the history of those data gene fragments in the leading data gene. Definition 5 (Data genome). Data genome DG is a 3-tuple: < $ gid , S + , DG′* > , where $gid is the identifier of data genome, S+ is a positive closure of data gene sequence, DG′ is a component data genome of DG and DG′* is a Klein closure of DG′ . In definition 5, $gid must be unique globally. S+ has at least one data gene sequence, Sm, and each sequence in it must have different identifer. The component data genome set DG′* carries genetic information for partial data retrieved from other data sources. Component data genome DG′ is a data genome mutated from the data genome of correlated data. In addition, data gene sequences in DG′ have main data gene sequence only and the main data gene sequence has a leading data gene only, so do the component data genomes in DG′ . Furthermore, for each data genome in DG′* , there is a sequence in S+ to record the correlation, whose identifier is equal to the identifier of the component data genome. Figure 1 is a referenced implementation of data genome for the text file. It is composed of six data gene sequences and two component data genomes. Sequence $gid1 (Herein, each sequence mentioned belongs to data genome dg) is composed of a leading data gene and two ordinary data genes. From the leading data gene in sequence $gid1, we can get a whole impression of Test. The other two data genes in sequence $gid1 denote the history of the corresponding data gene fragments in the leading data gene. Sequence $sid3 carries the genetic information about the history of being edited for data. Data genes in sequence $sid6 indicate that Tom forwarded this data to Mike and Jane on 06/18/2005 and Mike forwarded this data to Jim on 06/19/2005. Data gene in sequence $sid7 indicates that Jim browsed this data on 06/20/2005. Sequence $gid2 is in accordance with component data genome dg2. The only data gene in sequence $gid2 denotes that Tom correlated the component data genome to data genome $gid01 on 06/15/2005. Data gene sequence $sid5 carries the information of operating on the
1074
D. Tang, J. Xi, and Y. Guo
dg
$gid1
Sm
Da Title Creator Identifier
Test Tom 001
$gid 1 DgID Source $gid02 Date 06/14/2005 ... ... Source DgID
0 $gid01
Title Source DgID
T1 $gid01 $gid02
dg2
$gid1
$sid3
$sid5
$sid6
$sid7
$gid2
Da
D4
D5
D1
D3
Di
D
D2
Da
dg4 $gid4
D
Sm
$gid2
$gid2
$gid4
Sm
Da
D1
Editor
Tom
Date
06/18/2005
Action Operator
Date
06/15/2005
Acceptor
Jane
DgID
Correlate Tom $gid01
DgID ...
$gid01
Acceptor ...
Mike ...
Date
06/15/2005
...
...
D4
...
D5
Forw arder
Tom
D2
Modifier Tom Date 06/16/2005 DgID $gid02 ...
...
D3
Title
Test0
Creator dydy Identifier 002 $gid2 DgID Date 03/12/2004 ... ... Title
T11
Creator
Tang 004 $gid4
Forw arder
Mike
Brow ser
Jim
Identifier
Date
06/19/2005
Date
06/20/2005
Acceptor ...
Jim ...
Location ...
... ...
DgID Date ...
10/09/2003 ...
Fig. 1. The structure of data gene, data gene sequence and data genome 1
leading data gene.Component data genome dg2 is a sub data genome of data genome of data Test0, which means part of genetic information of Test0 are transmited to Test. Proposition 1. Each main data gene sequence has a unique leading data gene. The leading data gene carries the genetic information of data resembled to metadata. Though all component genomes have their own leading data gene, they are local, and there is only one leading data gene belonging to the whole data. Usually, the structure of this gene is the most complex data gene in a data genome. Proposition 2. Each data genome has a unique main data gene sequence. The main data gene sequence records the global information of data and its evolution. Comparatively speaking, the main data gene sequence of each component data genome carries the genetic information for part of the data. Proposition 3. Each data has a unique data genome to carry its genetic information, and the data genome can be represented as a tree with the data genome as the root, data gene sequences as the leaf nodes and component data genomes as branch nodes. The simplest data genome has only two nodes, the root and a leaf, when represented as a tree. Once data correlates to other data, a data genome will be inserted into the component data genome set. Since each data genome has a unique main data gene sequence, we also use the identifier of the data genome to identify the main data gene sequence. The definitions and propositions presented above give a framework for managing the adjunctive information of data created in its evolution. Since the type of data may vary 1
Dashed lines are used to explain the structure of each data gene. 2). For simplicity, data genes and the $sid of each sequence in each data genome are displayed in a table, and different background colors are used to differ them.
Model and Algebra for Genetic Information of Data
1075
from structured (e.g. relational data) to semi-structured (e.g. XML, code collections) to unstructured (word document), we give the privilege of modeling the behaviours of such types of data to users and we give the framwork only.
3 Data Genome Algebra Operations The data genome algebra includes eight operators, they are union, intersection, difference, select, project, iterate, filter and join, which in turn are the basis of designing the genetic information querying language. As data gene is the basic operation unit in DGM, the data genome algebra operations are implemented on the base of algebra operations for data genes, which include union, intersection, difference and project. We introduce some symbols that are used in the remainder of this paper first. Let G be the set of data genes, g , gi , g j ∈ G . Let DG be the set of data genomes, dg , dg i , dg j , dg ′ ∈ DG
. ID(s) denotes the identifier of data gene sequence s, T(s) denotes
the data genes set of s. S(dg) denotes the set of data gene sequence in dg, Sm(dg) denotes the main data gene sequence of dg, ψ(dg) denotes the component data genome set of dg. A(g) denotes attribute set of data gene g. µ a(f) denotes the attribute of data gene fragment f. Definition 6 (Algebra operations for data gene) Union, ∪
{
gi ∪ g j = f f ∈ gi ∨ f ∈ g j
Intersection,
{
∩
gi ∩ g j = f f ∈ gi ∧ f ∈ g j
Difference,
-g = { f
gi
j
Project,
-
f ∈ gi ∧ f ∉ g j
}
}
}
π
π A ′ ( g ) = { f f ∈ g ∧ μ a ( f ) ∈ A ′}
,
Obviously, the result of union intersection and difference is a data gene fragment set. Except for the algebra operations for data genes, the other two operations are also the basis of algebra operations for data genome. Definition7 (Set operations of the main data gene sequence). Suppose si and sj are different main data gene sequences, and I(s) denotes the leading data gene in the main data gene sequence.
-
s′ = si ∗ s j = 〈$sid ,T (s′)〉 where ∗∈{∪, ∩, } and ⎧⎪ I ( s′) = I ( si ) ∗ I ( s j ) ⎨ ⎪⎩T ( s′) = (T ( si ) − {I ( si )}) ∗ (T ( s j ) − {I ( s j )}) ∪ {I ( s′)}
1076
D. Tang, J. Xi, and Y. Guo
Defintion 8 (Concatenation of data gene sequences). Suppose si and sj are data gene sequences, given the concatenating condition f, then s′ = si ∞ s j = 〈$sid ,T (s′)〉 where f
T ( s′) = U( gi ∪ g j ) gi ∈ T ( si ) ∧ g j ∈ T ( s j ) ∧ f ( gi , g j ) = 1
The set operation for the main data gene sequence is defined for the set operations of data genomes, while the concatenation operation is defined for the join operation of data genomes. Definition 9 (Algebra operations for data genome). Let s′ = S m ( dg ′) , si = S m ( dg i ) , s j = Sm (dg j ) . Union, ∪ dg ′ = dgi ∪ dg j = 〈$ gid ′, S (dg′),ψ (dg′)〉 ⎧ s′ = si ∪ s j ⎪ where ⎨ S ( dg ′) = ( S (dgi ) − {si }) ∪ ( S ( dg j ) − {s j }) ∪ {s′} ⎪ ⎩ψ (dg ′) = ψ (dg i ) ∪ ψ (dg j )
Intersection, ∩ dg ′ = dg i ∩ dg j = 〈$ gid ′, S ( dg ′),ψ ( dg ′)〉
⎧ s′ = si ∩ s j ⎪ where ⎨ S (dg ′) = ( S (dg i ) − {si }) ∩ ( S (dg j ) − {s j }) ∪ {s′} ⎪ ⎩ψ (dg ′) = ψ (dg i ) ∩ ψ (dg j )
Difference,
-
-
dg ′ = dgi dg j = 〈$ gid ′, S (dg ′),ψ (dg ′)〉
⎧ s′ = si − s j ⎪ where ⎨ S (dg ′) = ( S ( dg i ) − {si }) − ( S ( dg j ) − {s j }) ∪ {s′} ⎪ ⎩ψ (dg ′) = ψ ( dg i ) − ψ (dg j )
Select, δ δ sc (dg ) = dg ′ = 〈$ gid ′, S (dg ′),φ 〉 where S (dg ′) = {s s ∈ S(dg ) ∧ sc( s) = true}
Project, π π A′ (dg ) = dg ′ = 〈$ gid ′, S (dg ′), φ 〉
{
}
where S (dg ′) = s′ | s′ = {π A′ ( g ) g ∈ s ∧ A′ ⊆ A( g )} ∧ s ∈ S (dg )
Iterate, D D a[max −iteration ] (dg ) = dg ′ = 〈$ gid ′,{s′}, φ 〉
where T ( s′) = { g g ∈ U s ∧ s ∈ S ( dg ) ∧ a ∈ A( g )} ∪ (U S (D a[max −iteration−1] (dg ′)) dg ′ ∉ψ (dg ))
Model and Algebra for Genetic Information of Data
Filter,
1077
ε
ε f (dg ) = U dg ′ dg ′ ∈ψ (dg ) ∧ f ( dg ′) = 1
Join, >< dg′ = dg i >< dg j = 〈$ gid ′, S (dg ′),ψ (dg ′)〉 f
,
⎧⎪ S (dg ′) = U( s1 ∞f s2 ) s1 ∈ S (dg i ) ∧ s2 ∈ S (dg j ) where ⎨ ⎪⎩ψ (dg ′) = ψ (dg i ) ∪ ψ (dg j )
Here in above the result of each operator is a temporary data genome. The operator δ selects data gene sequences from the data gene sequence set of dg that satisfies the condition sc. Generally, we use this operator to select one data gene sequence from the data genome by refining a right condition. If no condition is specified, this operation will return a data genome without any component data genome. The operator π filters out data genes whose attribute set is the superset of the given attribute set, gets rid of the redundant attributes and eliminates the repeated data gene for each data gene sequence in S(dg). If no attribute set is specified, this operator will return a null data genome. The operator D recursively traverses the specified attributes set a in the data genome. If no iteration limit is specified, this operation will traverse the whole data genome. If the iteration limit is zero, the temporary data genome will have one data gene sequence consisted of data genes with the given attributes in S(dg). The operator ε selects component data genomes from the data genome with the given filter condition f. If no condition is specified, this operator will return the union of all component data genomes in ψ (dg ) . In the join operation, f is a formula specifying the join predicate, which is a conjunctive form. Usually, it has the form of c1 ∧ c2 ... ∧ cn , where ci (i = 1..n ) has the generic form of A1 ⊗ A2 , A1 and A2 are attributes belonging to dgi and dgj separately, ⊗ is a comparison operator. If the operator ⊗ is “=”, then this operation is an equi-join. Suppose the structure of data genome for document is defined as the data genome in figure 1. Herein, we give some examples of queries. Example 1. To query the history of being edited in 2005, including the operator and operation time.
π Editor , Action, Date (δ Date>12 / 31 / 2004∧ Date<01 / 01 / 2006 (dg )) Example 2. To list all data source related to Test: π Title (D Identifier ,Title (dg ))
Example 3. To list out the creator and creation time of the first correlated data of Test:
π Creator ,Time (ε
$ gid =$ gid 2
(dg ))
Example 4. To list out the common correlated data:
π Title (D Identifier,Title (dg1 )) ∩ π Title (D Identifier,Title (dg 2 ))
1078
D. Tang, J. Xi, and Y. Guo
Example 5. To list out the edit history of T1(suppose the data genome is dg1 ), which the operator is the creator of T2(Suppose its data genome is dg 2 ):
π Editor , Action ,Date dg1
><
dg1 . Editor = dg 2 .Creator
dg 2
Definition 9 presents data genome algebras to support the query of genetic information. In fact, we can organize data genomes into a data genome base and perform data mining on it. For example, if we organize data genomes with the same data identifier retrieved from the P2P network into a data genome base, we can mine out the number of contributors, track the distribution, trace the propagation path of each copy, etc.
4 Usages and Applications of DGM The motivation of DGM is to manage information created in the evolution of data. Due to the limit of space, this paper can’t elaborate on the manipulation of data genome and the simulation of its evolution with the evolution of data. The details will be reported in other papers, and we here only give a brief introduction of its usages and list some applications. Data genome can attach to any types of data and be applied in any information systems if the corresponding information systems have the ability of supporting DGM. For example, when we create a text file in an editor, the editor should create a data genome for the file accordingly. Once the editor saves data, a data gene carrying the operation information is inserted into the corresponding data genome (such as D4 in figure 1). Moreover, if the file (suppose its data genome is dg) correlates to other file (suppose its data genome is dg′ ), the editor should mutate dg′ as a component data genome of dg, and insert a new data gene sequence that identifier equals to the identifier of the main data sequence of dg′ into the data gene sequence set of dg to express the correlation (such as dg2 and $gid2 in figure 1). DGM can be used in information tracking, data mining, information integration, semantic web, and so on. Take information tracking for example, we can find out the source and the propagation path of data instantly using the algebras defined in the previous section or special tools based on data genome. E.g., we can use πForwarder, Acceptor(dg) to filter out the forwarders and their acceptors in figure 1 and construct a propagation graph based on the result, as shown in figure 2. From figure 2, we know that Tom is the creator of file Test and he transmits the file to Mike, and Mike transmits the file to Jim subsequently. Tom
Jane
M ike
Jim
Fig. 2. A propagation graph constructed by Jim
Model and Algebra for Genetic Information of Data
1079
5 Conclusions and Future Works This paper presents DGM and its algebras to model and manipulate the evolution of data. However, the research of DGM is just underway, and no parallel work has been found yet. This model may be not perfect yet, and much work has to be done to improve this model. Currently, we are working for the user interfaces of data genome and implementing a file exchange system using the DGM as the base of tracking the lineage of different file appeared in the system. The programming interfaces of data genome include the manipulation interfaces and query interfaces. The data genome manipulation interfaces include creating data genome for any type of data, evolving data genome to synchronize the evolution data, etc. Though we present the algebras of data genome in this paper, we are sure a friendly user programming interfaces is necessary. In addition, our work can be extended in several ways. First, as we only give a conceptual model for the genetic information of data to date, we plan to design a storage model for DGM recently. Second, we think the number of data gene sequences in a data genome is different for different type of data such as text files, applications, databases etc, so we need to study the metadata of data genome for different type of data. Third, we need to give some tools to mine the value of genetic information more than query language, such as information track system based on DGM, information retrival system with data genome, etc. Moreover, extending DGM to support service is also a prospective direction.
References 1. Gnutella. http://gnutella.wego.com. 2. A. Tansel, J. Clifford, S. Gadia, S. Jajodia, A. Segev, and R.T. Snodgrass, Temporal Databases: Theory, Design, and Implementation, eds.,Benjamin/Cummings, 1994 3. http://www.w3.org/TR/xpath20/ 4. C. Combi, B. Oliboni, and E. Quintarelli. A Graph-Based Data Model to Represent Transaction Time in Semistructured Data. DEXA 2004: 559-568 5. S. Kumaran, P. Nandi, T. Heath, K. Bhaskaran, and R. Das. ADoc-Oriented Programming. SAINT 2003: 334-343 6. Y.M. Ye, P. Nandi, and S. Kumaran, Smart Distance for Information Systems: The Concept. IEEE Computational Intelligence Bulletin 2(1), 2003: 25-30 7. P. Nandi, and S. Kumaran. Adaptive Business Objects - A new Component Model for Business Integration. ICEIS (3), 2005: 179-188 8. D. Reiner, G. Press, M. Lenaghan, D. Barta and R. Urmston, Information lifecycle management: the EMC perspective. Proceedings of the 20th International Conference on Data Engineering, 2004:804 - 807 9. Y. Chen. Information Valuation for Information Lifecycle Management, Proceedings of the Second International Conference on Autonomic Computing (ICAC’05), 2005:135 – 146 10. J.Q. Xi, D.Y. Tang, and Y.B. Guo, Data Gene: the Genetic Information Carrier of Data. Computer Engineering (In Chinese, in press).
Forecasting Intermittent Demand by Fuzzy Support Vector Machines Yukun Bao, Hua Zou, and Zhitao Liu Department of Management Science and Information System, School of Management, Huazhong University of Science and Technology, Wuhan 430074, China [email protected]
Abstract. Intermittent demand appears at random, with many time periods having no demand,which is probably the biggest challenge in the repair and overhaul industry. Exponential smoothing is used when dealing with such kind of demand. Based on it, more improved methods have been studied such as Croston method. This paper proposes a novel method to forecast the intermittent parts demand based on fuzzy support vector machines (FSVM) in regression. Details on data clustering, performance criteria design, kernel function selection are presented and an experimental result is given to show the method’s validity.
1
Introduction
A fundamental aspect of supply chain management is accurate demand forecasting. We address the problem of forecasting intermittent (or irregular) demand. Intermittent demand appears at random, with many time periods having no demand [1]. Moreover, when a demand occurs, the request is sometimes for more than a single unit. Items with intermittent demand include service (spare) parts and high-priced capital goods, such as heavy machinery. Such items are often described as “slow moving”. Demand that it intermittent is often also “lumpy”, meaning that there is great variability among the nonzero value. An example of the difference between intermittent demand data and product demand data that is normal, or “smooth”, is illustrated in the tables below: Table 1. Intermittent Demand Data Month 1 Demand 0
2 0
3 4 19 0
5 0
6 0
7 4
8 9 10 18 17 0
11 0
12 0
13 0
14 0
15 16 3 0
17 0
13 23
14 25
15 25
17 36
Table 2. Normal, Smooth Demand Data Month 1 2 3 4 Demand 17 20 18 25
5 6 30 68
7 70
8 41
9 32
10 35
11 66
12 26
M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 1080–1089, 2006. c Springer-Verlag Berlin Heidelberg 2006
16 28
Forecasting Intermittent Demand by Fuzzy Support Vector Machines
1081
Intermittent demand creates significant problems in the manufacturing and supply environment as far as forecasting and inventory control are concerned. It is not only the variability of the demand size, but also the variability of the demand pattern that make intermittent demand so difficult to forecast [2]. The literature, that includes a relatively small number of proposed forecasting solutions to this demand uncertainty problem, can be found in [3-6]. The single exponential smoothing and the Croston methods are the most frequently used methods for forecasting low and intermittent demands [4,6]. Croston [4] proposed a method that builds demand estimates taking into account both demand size and the interval between demand incidences. Despite the theoretical superiority of such an estimation procedure, empirical evidence suggests modest gains in performance when compared with simpler forecasting techniques; some evidence even suggests losses in performance. On the other hand, Bartezzaghi et al. [3] in their experimental simulation found that EWMA appears applicable only with low levels of lumpiness. Willemain et al. [6] concluded that the Croston method is significantly superior to exponential smoothing under intermittent demand conditions. Recently, support vector machines (SVMs) was developed by Vapnik and his co-workers [7,8]. With the introduction of Vapnik’s -insensitive loss function, SVMs have been extended to solve non-linear regression estimation problems and they have been shown to exhibit excellent performance [7,8]. Since SVMs are formulated for two-class problems, some input points may not be exactly assigned to one of these two classes. Some are more important to be fully assigned to one class so that SVM can separate these points more correctly. Some data points corrupted by noises are less meaningful and the machine should better to discard them. SVM lacks this kind of ability. To solve this problem, Fuzzy support vector machines (FSVM) apply a fuzzy membership to each input point of SVM such that different input points can make different contributions to the learning of decision surface and can enhances the SVM in reducing the effect of outliers and noises in data points [9-11]. Our research focuses on the application of FSVM in regression to make a new attempt to novel forecasting method toward the intermittent demand. The results of experiment indicate that FSVM is effective in improving the accuracy of intermittent demand forecasting compared with the Croston method which has been a widely used method in intermittent demand forecasting. This paper consists of five sections. Section 2 reviews the most widely used approach for forecasting intermittent demand and indicates its limitation and the direction of further improvement. General principles of FSVM and its application in regression are presented in Section 3, together with the general procedures of applying it. Section 4 presents an experiment concerned with the detailed procedures of how to employing FSVM in regression, involving data set selection, data preprocessing and clustering, kernel function selection and so on. Conclusions and discussion for further research hints are included in the last section.
1082
2
Y. Bao, H. Zou, and Z. Liu
Reviews on Forecasting Intermittent Demand Methods
Generally efforts on forecasting the intermittent demand could fall into two categories. One is to find the distribution function and the other is time series forecasting. 2.1
Demand Distribution Estimation
The inventory control method proposed here relies on the estimation of the distribution of demand for low demand parts. It is necessary to use demand distributions rather than expected values of demand because the intermittent patterns characteristic of low demand parts require a probabilistic approach to forecasting that can be incorporated into an inventory management program. Using the demand distributions, it is possible to manage the inventory to maximize readiness given a fixed inventory budget. Other optimization goals are possible as well. The chief obstacle to reliable demand distribution estimation is the paucity of historical data available for any typical set of low demand parts. Demand typically occurs in small integer numbers of parts. Some parts may have only 3 or 4 non-zero demands among a large number of zero demand periods. This amount of data is not sufficient to construct a robust probability demand distribution. If a probabilistic model of the demand is available, such as a Weibull model or Poisson model, then it is possible to estimate the demand distribution directly from the model. If an empirical estimate of the demand distribution must derived, through bootstrapping or other means, it is necessary to group the data in a way that generates enough non-zero data points to produce robust demand distribution estimates. 2.2
Croston Method
Croston method falls into the time series forecasting category and is the most widely used method, which could be illustrated as follows. Let Yt be the demand occurring during the time period t and Xt be the indicator variable for non-zero demand periods; i.e., Xt = 1 when demand occurs at time period t and Xt = 0 when no demand occurs. Furthermore, let jt be number of periods with nonzero demand during interval [0, t] such that jt = t X , i=1 i i.e., jt is the index of the the non-zero demand. For ease of notation, we will usually drop the subscript t on j . Then we let Yj∗ represent the size of ∗ the j th non-zero demand and Qj the inter-arrival time between Yj−1 and Yj∗ . Using this notation, we can write Yj = Xt Yj∗ . Croston method separately forecasts the non-zero demand size and the interarrival time between successive demands using simple exponential smoothing (SES), with forecasts being updated only after demand occurrences. Let Zj and Pj be the forecasts of the (j + 1)th demand size and inter-arrival time respectively, based on data up to demand j. Then Croston method gives
Forecasting Intermittent Demand by Fuzzy Support Vector Machines
1083
Zj = (1 − α) Zj−1 + αYj∗
(1)
Pj = (1 − α) Pj−1 + αPj
(2)
The smoothing parameter α takes values between 0 and 1 and is assumed to be the same for both Yj∗ and Qj . Let l = jn denote the last period of demand. Then the mean demand rate, which used as the h-step ahead forecast for the demand at time n + h is estimated by the ratio Yˆn+h = Zl / Pl
(3)
The assumptions of Croston method could be derived that (1) the distribution of non-zero demand sizes Yj∗ is iid normal;(2) the distribution of inter-arrival times Qj is iid Geometric; and (3)demand sizes Yj∗ and inter-arrival times Qj are mutually independent. These assumptions are clearly incorrect, as the assumption of iid data would result in using the simple mean as the forecast, rather than simple exponential smoothing, for both processes. This is the basic reason for more correction and modification toward Croston method.
3
FSVM for Forecasting
In this section we briefly review the description about the idea and formulations of FSVM in regression problem. In regression problem, the effects of the training points are different. It is often that some training points are more important than others, so we apply a fuzzy membership 0 ≤ si ≤ 1 associated with each training point xi . This fuzzy membership si can be regarded as the attitude of the corresponding training point toward the mapping function and the value (1−si ) can be regarded as the attitude of meaningless. In a result, the traditional SVM was extended as FSVM. Given a set S of data points with associated fuzzy membership (x1 , y1 , s1 )
(xi , yi , si )
(4)
where xi ∈ Rn is the input vector,yi ∈ R is the desired value, and a fuzzy membership σ ≤ si ≤ 1 with i = 1, ..., n and sufficient small σ > 0 . The FSVM regression solves and optimization problem: min ∗
ω,p,ξ,ξ
C
n
1 si (ξi + ξi∗ ) + ω T ω 2 i=1
⎧ T ⎪ ⎨ yiT− ω φ(xi ) − bi ≤ ε + ξi∗, ξi∗≥ 0 Subject to ω φ(xi ) + bi − yi ≤ ε + ξi , ξi ≥ 0 ⎪ ⎩ i = 1, ..., n
(5)
where φ is the high dimensional feature space which is non-linearly mapped from the input space x , ξi is the upper training error ( ξi∗ is lower), subject to the varepsilon -insensitive tube: |y − (ω T φ(x) + b)| ≤ ε
(6)
1084
Y. Bao, H. Zou, and Z. Liu
The parameters that control the regression performance are the cost of error C , the width of the tube ε , the mapping function φ(x) and the fuzzy membership si . Usually, it is more convenient to solve the dual of Eq.(5) by introducing Lagrange multipliers α∗i , αi , and leads to a solution of the form f (x, αi , α∗i ) =
n
(αi − α∗i )K(x, xi ) + b,
i=1
0 ≤ α∗i , αi ≤ si C
(7)
In Eq. (7), αi and α∗i satisfy the equalities αi ∗ α∗i = 0 ,αi ≥ 0 and α∗i ≥ 0 where i = 1,2,,n and are obtained by maximizing the dual function of Eq. (7) which has the following form: R (ai , a∗i ) =
n
yi (ai − a∗i ) − ε
i=1
−
n n 1
2
n
(ai + a∗i )
i=1
(ai − a∗i ) aj − a∗j K (xi , xj )
(8)
i=1 j=1
with the constraints n i=1
(ai − a∗i ) ,
0 ≤ ai ≤ C, i = 1, 2, · · · , n 0 ≤ a∗i ≤ C, i = 1, 2, · · · , n
(9)
Based on the Karush-Kuhn-Tucker (KKT) conditions of quadratic programming, only a certain number of coefficients (ai − a∗i ) in Eq.(5) will assume nonzero values. The data points associated with them have approximation errors equal to or larger than ε and are referred to as support vectors. Generally, the larger the ε, the fewer the number of support vectors and thus the sparser the representation of the solution. K(xi , xj ) is defined as the kernel function. The value of the kernel is equal to the inner product of two vectors Xi and Xj in the feature space φ(xi ) and φ(xj ), that is, K(xi , xj ) = φ(xi ) ∗ φ(xj ).Any function satisfying Mercer’s condition can be used as the kernel function. From the implementation point of view, training SVMs is equivalent to solving a linearly constrained quadratic programming (QP) with the number of variables twice as that of the training data points. Generally speaking, application of FSVM for forecasting follows the procedures: (1) Transform data to the format of an FSVM and conduct simple scaling on the data; (2) Generate the fuzzy membership; (3) Choose the kernel functions; (4) Use cross-validation to find the best parameter and ; (5) Use the best parameter and to train the whole training set; (6) Test.
Forecasting Intermittent Demand by Fuzzy Support Vector Machines
4 4.1
1085
Experimental Setting Data Sets
Forecasting and inventory management for intermittent demand parts is particularly problematic because of the large number of low demand parts that must be considered. As an experiment setting, of 5,000 unique non-repairable spare parts for the Daya Bay Nuclear station in China, over half of those parts have been ordered 10 time or less in the last ten years. While many of these low demand parts are important for the safe operation of the nuclear reactor, it is simply uneconomical to stock enough spares to guarantee that every low demand part will be available when needed. 4.2
Clustering for Data Preprocessing
Clustering is the process of grouping parts with similar demand patterns. There are several methods to cluster data, among which, agglomerative hierarchical clustering and c-means clustering are two typical methods. We have found that demand patterns can be robustly clustered by using cumulative demand patterns. As the cumulative demand patterns avoids problems invoked by the intermittent pattern of incremental demand. Figure 1 shows one of prototype cumulative demand patterns after clustering 4063 low demand spare parts into 10 clusters. Prototype patterns represent the typical demand pattern for each cluster. The cluster size ranges from 34 parts to 241 parts plotted are the 25th and 75th percentiles of demand gathered from the cumulative demand of the individual parts in each cluster. Clustering was accomplished using a fuzzy cmeans (FCM) clustering routine [12]. The generalized objective function subject to the same fuzzy c-partition constraints[13] is: Min Jm (U, V ) =
c n
m
2
(μik ) xk − vi
(10)
i=1 k=1
During our experiment, one of the problems associated with clustering is the difficulty in determining the number of clusters, c. Various validity measures have been proposed to determine the optimal number of clusters in order to address this inherent drawback of FCM [13]. In our experiment, the optimal number of terms is defined as the one that has the lowest mean squared error (MSE). The least MSE measure is also used to identify the most appropriate form of membership functions. In summary, the application procedure of the FCM has the following steps:(1) choose c (2 ≤ c ≤ n) , m (1 < m <∝) and initialize the membership matrix. (2) Read in the data set and find the maximum and minimum values. (3) Calculate cluster centers but force the two clusters with the largest and smallest values to take the maximum and minimum domain values. (4) Update the membership matrix (5) Compute the change of each value in the membership matrix and determine whether the maximum change is smaller than the threshold value chosen to stop the iterative process (set at 0.02 throughout this study). If not, return to Step 3. (6) Redistribute erroneous membership
1086
Y. Bao, H. Zou, and Z. Liu 10 9
Average 25th Percentage 75th Percentage
8
Spare Parts Qty
7 6 5 4 3 2 1 0 1998
1999
2000 Time
2001
2002
Fig. 1. Cumulative demand pattern of Cluster 1
values to the other two more appropriate terms proportional to their current membership values. 4.3
Defining Fuzzy Membership
It is easy to choose the appropriate fuzzy membership. First, we choose σ > 0 as the lower bound of fuzzy membership. Second, we make fuzzy membership si be a function of time ti si = f (ti ) (11) We suppose the last point xn be the most important and choose xn = f (tn ) = 1 , and the first point x1 be the most least important and choose s1 = f (t1 ) = σ. If we want to let fuzzy membership be a linear function of the time, we can select si = f (ti ) = αti + b =
1−σ tn σ − t 1 ti + tn − t 1 tn − t1
(12)
If we want to make fuzzy membership be a quadric function of the time, we can select
2 ti − t1 si = f (ti ) = α(ti − b)2 + c = (1 − σ) +σ (13) tn − t1 4.4
Kernel Function Parameters Selection
We use general RBF as the kernel function. The RBF kernel nonlinearly maps samples into a higher dimensional space, so it, unlike the linear kernel, can handle the case when the relation between class labels and attributes is nonlinear. Furthermore, the linear kernel is a special case of RBF as (Ref.[14]) shows that the linear kernel with a penalty parameter C˜ has the same performance as the RBF kernel with some parameters (C, γ). In addition, the sigmoid kernel behaves
Forecasting Intermittent Demand by Fuzzy Support Vector Machines
1087
like RBF for certain parameters[15].The second reason is the number of hyperparameters which influences the complexity of model selection. The polynomial kernel has more hyper-parameters than the RBF kernel. Finally, the RBF kernel has less numerical difficulties. One key point is 0 < Kij ≤ 1 in contrast to polynomial of which kernel values may go to infinity γxi T xj + r > 1 kernels T or zero γxi xj + r < 1 while the degree is large. There are two parameters while using RBF kernels: C and γ. It is not known beforehand which C and γ are the best for one problem; consequently some kind of model selection (parameter search) must be done. The goal is to identify good (C, γ) so that the classifier can accurately predict unknown data (i.e., testing data). Note that it may not be useful to achieve high training accuracy (i.e., classifiers accurately predict training data whose class labels are indeed known). Therefore, a common way is to separate training data to two parts of which one is considered unknown in training the classifier. Then the prediction accuracy on this set can more precisely reflect the performance on classifying unknown data. An improved version of this procedure is cross-validation. We use a grid-search on C and γ using cross-validation. Basically pairs of (C, γ) are tried and the one with the best cross-validation accuracy is picked. We found that trying exponentially growing sequences of C and γ is a practical method to identify good parameters (for example, C = 2−5 , 2−3 , · · · , 215 ; γ = 2−15 , 2−13 , · · · , 23 ). 4.5
Performance Criteria
The prediction performance is evaluated using the normalized mean squared error (NMSE). NMSE is the measures of the deviation between the actual and predicted values. The smaller the values of NMSE, the closer are the predicted time series values to the actual values. The NMSE of the test set is calculated as follows: NMSE = δ2 =
1 δ2 n
n
1 n−1
2
(yi − yˆi ) ,
(14)
(yi − y)2 ,
(15)
i=1 n
i=1
where n represents the total number of data points in the test set. yˆi represents the predicted value. y denotes the mean of the actual output values. Table 3 shows the NMSE values of different kernel functions compared with Croston method and tells out the best prediction method under our numerical case. 4.6
Experimental Results
Still raise the example of Cluster 1, Figure 2 shows the experimental esults by comparing the forecasting results of actual data, Croston method and FSVM regression. By summing all the clusters’ result, SVMs regression method’s accuracy is 11.6% higher than Croston method by the computation of standard deviation.
1088
Y. Bao, H. Zou, and Z. Liu Table 3. NMSE Values of comparative methods Methods FSVMs RBF FSVMs Linear FSVMs Polynomial
NMSE 0. 3720 0. 5945 0. 6054
Croston
0. 5730
DEMAND 14
Croston Actual FSVM−RBF
12
10
8
6
4
2
0 0
5
10
15
20
25
WEEKS
Fig. 2. Forecasting results comparison of Cluster 1
5
Conclusions
The use of FSVM in forecasting intermittent demand is studied in this paper. The study concluded that FSVM provide a promising alternative to forecast the intermittent demand. But further research toward an extremely changing data situation should be done, which means the data fluctuation may affect the performance of this method. In fact, we got confused with the experimental result at the every beginning without the data clustering. Another further research hint is the knowledge priority used in training the sample and determining the function parameters. This is to say, parameters selection is free but affect the performance a lot. A good parameter selection method should be worthy of further research.
Forecasting Intermittent Demand by Fuzzy Support Vector Machines
1089
Acknowledgements This research is granted by National Science Foundation of China, No.70401015 and Hubei Provincial Key Social Science Research Center of Information Management.
References 1. Silver, E.A. Operations research in inventory management: A review and critique. Operations Research. 29 (1981) 628-645. 2. Syntetos A.A.,Boylan J.E. On the bias of intermittent demand estimates. International Journal of Production Economics.71(2001)457-466. 3. Bartezzaghi E, Verganti R and Zotteri G. A. Simulation framework for forecasting uncertain lumpy demand. International Journal of Production Economics. 59 (1999) 499-510. 4. Croston JD. Forecasting and stock control for intermittent demands. Operational Research Quarterly. 23(3)(1972)289-303. 5. Rao AV. A comment on: forecasting and stock control for intermittent demands. Operational Research Quarterly. 24(4)(1973)639-640. 6. Willemain TR, Smart CN, Shockor JH, DeSautels PA. Forecasting intermittent demand in manufacturing: a comparative evaluation of Croston’s method. International Journal of Forecasting.10(4)(1994)529-538. 7. Vapnik VN, GolowichSE, Smola AJ. Support vector method for function approximation, regression estimation, and signal processing. Advances in Neural Information Processing Systems.9 (1996)281-7. 8. Vapnik VN. The nature of statistical learning theory. New York: Springer.(1995). 9. Abe, S., & Inoue, T. Fuzzy support vector machines for multi-class problems. Proceedings of the Tenth European symposium on Artificial Neural Networks (ESANN 2002), 113-118. 10. Chun-Fu Lin, Sheng-De Wang, Fuzzy Support Vector Machines, IEEE Trans on Neural Networks.13(2)(2002)464-471. 11. Dug H. H, Changha H.. Support vector fuzzy regression machines.Fuzzy Sets and Systems, 138 (2003) 271-281. 12. Medasani S, Kim J, Krishnapuram R. An overview of membership function generation techniques for pattern recognition. International Journal of Approximation Research. 19(2), (1998) 391-417. 13. N.R. Pal, J.C. Bezdek. On cluster validity for the fuzzy c-means model. IEEE Trans. Fuzzy Systems. 3(3) (1995)370-379. 14. Haykin S. Neural networks: a comprehensive foundation. Englewood CliKs, NJ: Prentice Hall. (1999). 15. Zhang GQ, Michael YH. Neural network forecasting of the British Pound US Dollar exchange rate. Omega.26(4) (1998)495-506.
Nonlinear Discrete System Stabilisation by an Evolutionary Neural Network Wasan Srikasam, Nachol Chaiyaratana, and Suwat Kuntanapreeda Research and Development Center for Intelligent Systems, King Mongkut’s Institute of Technology North Bangkok, 1518 Piboolsongkram Road, Bangsue, Bangkok 10800, Thailand sir [email protected], [email protected], [email protected]
Abstract. This paper presents the application of an evolutionary neural network controller in a stabilisation problem involving an inverted pendulum. It is guaranteed that the resulting closed-loop discrete system is asymptotically stable. The process of training the neural network controller can be treated as a constrained optimisation problem where the equality constraint is derived from the Lyapunov stability criteria. The decision variables in this investigation are made up from the connection weights in the neural network, a positive definite matrix required for the Lyapunov function and matrices for the stability constraint while the objective value is calculated from the closed-loop system performance. The optimisation technique chosen for the task is a variant of genetic algorithms called a cooperative coevolutionary genetic algorithm (CCGA). Two control strategies are explored: model-reference control and optimal control. In the model-reference control, the simulation results indicate that the tracking performance of the system stabilised by the evolutionary neural network is superior to that controlled by a neural network, which is trained via a neural network emulator. In addition, the system stabilised by the evolutionary neural network requires the energy in the level which is comparable to that found in the system that uses a linear quadratic regulator in optimal control. This confirms the usefulness of the CCGA in nonlinear discrete system stabilisation applications.
1
Introduction
Neural network controllers have been successfully applied to various nonlinear systems in the past decades. Since the goals of designing a closed-loop system must always include stability assurance, the need to guarantee the stability of neural network control systems is inevitable. The study into the stabilisation capability of neural network controllers can be divided into three main categories. In the first category, a modification of learning algorithms is made where the search for appropriate connection weights is limited to the space at which the stability can be guaranteed. In the second category, a neural network is used to approximate the behaviour of the nonlinear plant where the network is subsequently used to generate control effort according to a predefined control law or M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 1090–1099, 2006. c Springer-Verlag Berlin Heidelberg 2006
Nonlinear Discrete System Stabilisation by an Evolutionary Neural Network
1091
a stability condition. Finally, in the third category an additional stability condition is introduced where the learning process can be treated as constrained optimisation. The modification of learning algorithms, which leads to the solution search in the feasible connection weight space only, was first introduced by Sadegh [1] where the stability condition is integrated directly into the weight adaptation algorithm. The concept proposed by Sadegh [1] is subsequently adapted for robotic system control [2]. In addition to the controller role, neural networks have also been used as both observer and controller in a robotic system [3] where a Lyapunov stability condition is utilised during the weight space restriction. Furthermore, the work in this category has also been extended to include the control of discrete systems [4, 5]. In the case where it is desirable to utilise an explicit control law or a stability condition, the training of a neural network to mimic the behaviour of the nonlinear plant is usually carried out. Since the neural network would certainly have a well-defined mathematical structure, the trained neural network can be regulated in a way that a specific control effort can be produced as the network output. The control effort can be generated according to a linear control law that leads to feedback linearisation [6, 7], a control Lyapunov function [8], or an input-to-state stability condition [9]. Another technique, which has also received much attention, is to introduce stability conditions as optimisation constraints during the search for appropriate connection weights. Tanaka [10] has shown that such a technique can be successfully applied in the case that the system can be explained in terms of linear differential inclusions. In later works, Suykens et al. [11] have proposed a stability constraint, which can be applied to any nonlinear systems, for use with recurrent network controllers. In contrast, Kuntanapreeda and Fullmer [12] have proposed a stability constraint for hermitian systems that are controlled by feedforward networks while Ekachaiworasin and Kuntanapreeda [13, 14] have extended the work to cover nonhermitian systems. From an optimisation viewpoint, the training of a neural network controller usually involves a search in connection weight space where the search objective is defined according to a specific system performance. Although the use of iterative learning algorithms has been successfully applied to the early works mentioned above, it has also been proven that evolutionary computation algorithms are stronger candidates for optimising neural networks [15]. Nonetheless, the application of evolutionary computation for optimising a neural network controller, while also satisfying stability constraints, has rarely been studied. In this paper, a cooperative coevolutionary genetic algorithm or CCGA [16] will be considered for the optimisation of the connection weights in a neural network controller. Specifically, the stability constraint for nonlinear discrete systems developed by Ekachaiworasin and Kuntanapreeda [14] will act as the optimisation constraint. This condition is chosen for this paper since it can be easily integrated into any optimisation techniques.
1092
W. Srikasam, N. Chaiyaratana, and S. Kuntanapreeda
The organisation of this paper is as follows. In section 2, the architecture of the neural network and the associated stability condition will be explained. In section 3, the description of the CCGA will be given. The application of the CCGA to the task of connection weight optimisation and the performance of neural network controller will be discussed in section 4. Finally, the conclusions will be drawn in section 5.
2
Neural Network and Stability Condition
The architecture of neural network controller, which is chosen for this investigation, is a multilayer perceptron. The network has one hidden layer where the activation function in the hidden node is a hyperbolic tangent function while that in the output node is a linear function. The schematic diagram of the neural network controller is given in Fig. 1. From Fig. 1, it can be seen that each neuron does not have a connection weight that represents a threshold. This means that the input to the activation function is simply a weighted-sum of neuron inputs. In Fig. 1, w1ij represents an element in the connection weight matrix W1 and reflects the connection weight between the ith neuron in the hidden layer and the j th network input. Similarly, w2ij denotes the connection weight in W2 and links between the ith output neuron and the j th hidden neuron. Ekachaiworasin and Kuntanapreeda [14] have proven that the network in Fig. 1 can be used as a full-state feedback controller where the associated stability condition, derived from the Lyapunov stability criteria, for the closed-loop discrete system can be described by C1 = (A + aBW2 W1 )T P(A + aBW2 W1 ) − P + qqT + I = 0
(1)
C2 = (A + aBW2 W1 )T PBW2 + W1T + qΓT = 0
(2)
C3 = W2T BT PBW2 + ΓΓT − 2I = 0
(3)
where A and B respectively denote system and input matrices of the plant model, which has been linearised around the equilibrium state. In the stability condition, P is a positive definite symmetric matrix for the Lyapunov function V = xT Px where x represents the state vector while I, a, q and Γ are an identity matrix, a constant, a constant matrix and a constant square matrix, respectively. In this paper, a is preset to 0.5. With this condition, the system is guaranteed to be asymptotically stable. Two types of control strategies are explored in this paper: model-reference control and optimal control. In the model-reference control, the closed-loop behaviour of the system is shaped such that the performances of the system and the reference model are similar. Generally, an error cost function, which is derived from the difference between the performance of the system and that of the reference model, can provide an indication on how the feedback controller should be adjusted. In this investigation, the squared tracking error between the system output and a linear reference model output will be the indicator. In contrast, the cost function for the optimal control is calculated from the energy utilised by the
Nonlinear Discrete System Stabilisation by an Evolutionary Neural Network
x1 x2
xn
w111
w211
1
1093
u1
2
m n
w1hn
h
um
w2mh
Fig. 1. Architecture of the neural network controller
system during the execution of the control task. Here, the energy is calculated from a quadratic function, which can be described by Je =
∞
(xT (k)Qx(k) + ru2 (k))
(4)
k=0
where Je is the energy cost, Q is the weighting matrix for the state x and r is the weighting factor for the control effort u. Both cost functions explained will be parts of fitness functions during the evolution of the neural network controller using a cooperative coevolutionary genetic algorithm.
3
Cooperative Coevolutionary Genetic Algorithm
The cooperative coevolutionary genetic algorithm (CCGA) was first introduced by Potter and De Jong [16]. The CCGA functions by introducing an explicit notion of modularity to the optimisation process. This is done in order to provide reasonable opportunity for complex solutions to evolve in the form of interacting coadapted subcomponents. In brief, the CCGA explores the search space by utilising a population, which contains a number of species or subpopulations. In contrast to other types of genetic algorithm, each species in the CCGA represents a variable or a part of the problem, which requires to be optimised. A combination of an individual from each species will lead to a complete solution to the problem where the fitness value of the complete solution can then be identified. This value of fitness will be assigned to the individual of interest that participates in the formation of the solution. After the fitness values have been assigned to all individuals, the evolution of each species is then commenced using a standard genetic algorithm. The CCGA has been successfully used in a number of problem domains including multivariable function optimisation [16], sequential rule evolution [17], design of a cascade neural network [18] and container loading optimisation [19]. A comprehensive description of the CCGA and the summary of its applications can be found in Potter and De Jong [20].
1094
4
W. Srikasam, N. Chaiyaratana, and S. Kuntanapreeda
Results and Discussions
The capability of the neural network controller in nonlinear control will be demonstrated in this section. Specifically, an inverted pendulum system will be investigated in the case studies. This system is chosen since it is a regulator control system that has only two states and a performance comparison against previously published results can be easily made. The continuous-time model of the inverted pendulum is given by x˙ 1 = x2 x˙ 2 = −x2 + 2 sin x1 + 10u
(5)
where x1 is the angular position, x2 is the angular velocity and u is the control effort. The model is discretised with the sampling interval of 0.05 second. The T T initial condition of the plant is x1 (0) x2 (0) = 1 0 while the desired final T T condition is x1 (kf ) x2 (kf ) = 0 0 where kf denotes the final time index. As mentioned earlier in section 2, two control strategies will be explored: modelreference control and optimal control. Since the performance indices for both strategies are described in terms of cost functions, the search objective of the CCGA can be defined as ⎛ ⎞1/2 ⎛ ⎞1/2 f = α1 Jc + α2 ⎝ c21ij ⎠ + α3 ⎝ c22ij ⎠ i
j
i
j
⎛ ⎞1/2 ⎛ ⎞1/2 +α4 ⎝ c23ij ⎠ + α5 ⎝ c21ij ⎠ − ⎝ c22ij ⎠ i j i j i j ⎛ ⎞ ⎛ ⎞ 1/2 1/2 ⎝ 2 ⎠ 2 ⎠ ⎝ +α6 c2ij − c3ij i j i j ⎛ ⎞ ⎛ ⎞ 1/2 1/2 +α7 ⎝ c21ij ⎠ − ⎝ c23ij ⎠ (6) i j i j ⎛
⎞1/2
where f is the minimisation objective, Jc is the control cost which can be either the squared tracking error or the energy usage, c1ij , c2ij and c3ij are elements from the stability constraint matrices C1 , C2 and C3 as given in equations (1), (2) and (3), respectively and αi is a weighting factor. It can be clearly seen that the search objective is a weighted-sum between the control cost and the constraint. A desired neural network controller would force the plant to perform according to the strategy employed while the stability condition is also satisfied. The parameter settings for the multilayer perceptron and the CCGA are given in Table 1. From Table 1, it can be seen that the decision variables for the evolution of the neural network cover both connection weights and variables
Nonlinear Discrete System Stabilisation by an Evolutionary Neural Network
1095
Table 1. Parameter settings for the multilayer perceptron and the CCGA Parameter Multilayer Perceptron Number of input nodes Number of hidden nodes Number of output nodes CCGA Decision variables
Setting and Value 2 (Number of feedback states) 4 1
W1 (w1ij ∈ [−2, 2]), W2 (w2ij ∈ [−2, 2]), q (qij ∈ [−2, 2]), Γ (γij ∈ [−2, 2]) and P (pij ∈ [0, 50]) Number of variables (Number of species) 39 Chromosome representation Binary chromosome Chromosome length of a species member 20 Fitness scaling method Linear scaling Selection method Stochastic universal sampling Crossover method Uniform crossover with probability = 0.8 Mutation method Bit-flip mutation with probability = 0.3 Subpopulation size 100 Number of elitist individuals 1 Number of generations 300 Number of repeated runs 10
from the stability condition. In this work, each decision variable is treated as a species in the CCGA search. In addition, the evolution of the neural network is repeated ten times for each case study where the best result is selected for detailed analysis. For the purpose of comparison, the performance of the evolutionary neural network for model-reference control will be compared with that of the neural network controller reported in Ekachaiworasin and Kuntanapreeda [14]. In the work by Ekachaiworasin and Kuntanapreeda [14], the model-reference control strategy is also explored. However, an additional neural network emulator is required for the controller training. In brief, a neural network is first trained to emulate the behaviour of the plant. During the training of another neural network for use as the controller, the tracking error between the closed-loop system output and the reference model output is then propagated backward through the neural network emulator where the resulting propagated signal provides an adjustment to the connection weights in the controller. Both evolutionary neural network and neural network controller reported in Ekachaiworasin and Kuntanapreeda [14] are designed such that the resulting closed-loop system behaviour is similar to that of the reference model, which in this study is a linear second order system with the damping ratio = 0.8 and the natural frequency = 3 rad/s. In contrast to the case study involving model-reference control, the performance of the evolutionary neural network for optimal control will be compared with that of a linear quadratic regulator (LQR). The regulator is designed according to the linearised discrete model of the plant. The comparison will be carried out
1096
W. Srikasam, N. Chaiyaratana, and S. Kuntanapreeda
1.2
0.2 Reference Model E&K NN Evolutionary NN
1
0 −0.2
0.8
x2 (rad/s)
x1 (rad)
−0.4
0.6
−0.6
0.4
−0.8
0.2
−1
0 −0.2 0
Reference Model E&K NN Evolutionary NN
−1.2 1
2 3 Time (sec)
(a)
4
5
−1.4 0
1
2 3 Time (sec)
4
5
(b)
Fig. 2. State trajectory from model-reference control (a) angular position (b) angular velocity
via the monitoring of the energy during the control task execution. The energy cost function is formulated such that the matrix Q is set to an identity matrix while the parameter r is set to 0.3 for both neural network evolution and linear regulator design. The performances of the evolutionary neural networks and the techniques selected for comparison for the model-reference control and optimal control are displayed in Figs. 2 and 3, respectively. In addition, the corresponding control cost and constraint are also displayed in Table 2. Firstly, consider the results from the first case study where the strategy employed is model-reference control. It can be clearly seen that the evolutionary neural network is superior to the neural network controller reported in Ekachaiworasin and Kuntanapreeda [14]. The tracking performance of the closed-loop system with the evolutionary neural network is very close to that of the reference model. On the other hand, a significant deviation from the desired response can be observed in the system using the previously reported neural network controller. The numerical results in Table 2 also support this observation where improvements in both squared tracking error and constraint objective are noticeable. These improvements stem from the fact that the use of a neural network emulator, which is previously required during the training of the neural network controller, is no longer necessary in the current implementation. In the early work by Ekachaiworasin and Kuntanapreeda [14], the controller training capability is limited by the ability of the emulator at mimicking the behaviour of the plant. If the emulator cannot truly represent the plant, it would be impossible for the subsequent closed-loop system to behave like the reference model. Since the application of the CCGA as the training algorithm leads to a direct usage of the difference between the closedloop system and the reference model as the training guidance, the improvements reported are hence not surprising. The results from the optimal control are now considered. Figure 3 indicates that the state trajectory from the system with the evolutionary neural network
Nonlinear Discrete System Stabilisation by an Evolutionary Neural Network 1.2
1097
0.2 LQR Evolutionary NN
1
0 −0.2
0.8
x2 (rad/s)
x1 (rad)
−0.4
0.6
−0.6
0.4
−0.8
0.2
−1
0
−1.2
−0.2 0
1
2 3 Time (sec)
4
5
−1.4 0
LQR Evolutionary NN 1
(a)
2 3 Time (sec)
4
5
(b)
Fig. 3. State trajectory from optimal control (a) angular position (b) angular velocity Table 2. Control cost and constraint Strategy
Cost (Jc )
i
Model-Reference E&K NN Evolutionary NN Optimal Control LQR Evolutionary NN
Constraint 1/2 1/2 1/2 2 2 2 j c1ij i j c2ij i j c3ij
5.136447 0.026669
0.735667 0.000005
0.704628 0.000003
0.265072 0.000006
20.329002 21.704000
− 0.000029
− 0.000029
− 0.000033
controller and that from the system with the LQR are very similar. The energy usage reported in Table 2 also confirms this similarity where the energy required by the system with the linear controller is slightly lower. This implies that the evolved connection weights in the neural network must be close to the true optimal values since the state trajectory obtained approaches the ideally attainable one.
5
Conclusions
In this paper, the application of neural network controllers in the stabilisation of a nonlinear discrete system is introduced. The Lyapunov-based stability condition imposed leads to the identification of the feasible connection weight space. As a result, the training of neural networks can be treated as constrained optimisation where the proposed technique for the task is a cooperative coevolutionary genetic algorithm (CCGA). Two control strategies, namely model-reference control and optimal control, are investigated where an inverted pendulum system is utilised in both case studies. The results from the model-reference control
1098
W. Srikasam, N. Chaiyaratana, and S. Kuntanapreeda
indicate that a significant improvement over the previous work by Ekachaiworasin and Kuntanapreeda [14] in terms of the tracking performance and the stability constraint handling is obtained. Furthermore, the results from the optimal control reveal that the energy usage from the system with the evolutionary neural network controller is similar to that from the system with a linear quadratic regulator. This clearly illustrates the suitability of the CCGA to the task.
References 1. Sadegh, N.: A perceptron network for functional identification and control of nonlinear systems. IEEE Transactions on Neural Networks 4(6) (1993) 982–988 2. Fierro, R., Lewis, F.L.: Control of a nonholonomic mobile robot using neural networks. IEEE Transactions on Neural Networks 9(4) (1998) 589–600 3. Kim, Y.H., Lewis, F.L.: Neural network output feedback control of robot manipulators. IEEE Transactions on Robotics and Automation 15(2) (1999) 301–309 4. Jagannathan, S.: Control of a class of nonlinear discrete-time systems using multilayer neural networks. IEEE Transactions on Neural Networks 12(5) (2001) 1113– 1120 5. Zhu, Q., Guo, L.: Stable adaptive neurocontrol for nonlinear discrete-time systems. IEEE Transactions on Neural Networks 15(3) (2004) 653–662 6. He, S., Reif, K., Unbehauen, R.: A neural approach for control of nonlinear systems with feedback linearization. IEEE Transactions on Neural Networks 9(6) (1998) 1409–1421 7. Nam, K.: Stabilization of feedback linearizable systems using a radial basis function network. IEEE Transactions on Automatic Control 44(5) (1999) 1026–1031 8. Kosmatopoulos, E.B.: Universal stabilization using control Lyapunov functions, adaptive derivative feedback, and neural network approximators. IEEE Transactions on Systems, Man, and Cybernetics–Part B: Cybernetics 28(3) (1998) 472–477 9. Sanchez, E.N., Perez, J.P.: Input-to-state stabilization of dynamic neural networks. IEEE Transactions on Systems, Man, and Cybernetics–Part A: Systems and Humans 33(4) (2003) 532–536 10. Tanaka, K.: An approach to stability criteria of neural-network control systems. IEEE Transactions on Neural Networks 7(3) (1996) 629–642 11. Suykens, J.A.K., Vandewalle, J., De Moor, B.: Lur’e systems with multilayer perceptron and recurrent neural networks: Absolute stability and dissipativity. IEEE Transactions on Automatic Control 44(4) (1999) 770–774 12. Kuntanapreeda, S., Fullmer, R.R.: A training rule which guarantees finite-region stability for a class of closed-loop neural-network control systems. IEEE Transactions on Neural Networks 7(3) (1996) 745–751 13. Ekachaiworasin, R., Kuntanapreeda, S.: A training rule which guarantees finiteregion stability of neural network closed-loop control: An extension to nonhermitian systems. In: Proceedings of the 2000 IEEE-INNS-ENNS International Joint Conference on Neural Networks, Como, Italy (2000) 325–330 14. Ekachaiworasin, R., Kuntanapreeda, S.: Stabilizing neural network controllers for sampled-data nonlinear systems. In: Mastorakis, N.E. (ed.): Problems in Modern Applied Mathematics. World Scientific and Engineering Society Press, Vouliagmeni, Greece (2000) 282–286
Nonlinear Discrete System Stabilisation by an Evolutionary Neural Network
1099
15. Yao, X.: Evolving artificial neural networks. Proceedings of the IEEE 87(9) (1999) 1423–1447 16. Potter, M.A., De Jong, K.A.: A cooperative coevolutionary approach to function optimization. In: Davidor, Y., Schwefel, H.-P., M¨ anner, R. (eds.): Parallel Problem Solving from Nature–PPSN III. Lecture Notes in Computer Science, Vol. 866. Springer-Verlag, Berlin, Germany (1994) 249–257 17. De Jong, K.A., Potter, M.A.: Evolving complex structures via cooperative coevolution. In: Proceedings of the Fourth Annual Conference on Evolutionary Programming, San Diego, CA (1995) 307–318 18. Potter, M.A., De Jong, K.A.: Evolving neural networks with collaborative species. In: Proceedings of the 1995 Summer Computer Simulation Conference, Ottawa, Canada (1995) 340–345 19. Pimpawat, C., Chaiyaratana, N.: Three-dimensional container loading using a cooperative co-evolutionary genetic algorithm. Applied Artificial Intelligence 18(7) (2004) 581–601 20. Potter, M.A., De Jong, K.A.: Cooperative coevolution: An architecture for evolving coadapted subcomponents. Evolutionary Computation 8(1) (2000) 1–29
Genetic Algorithm for Inventory Lot-Sizing with Supplier Selection Under Fuzzy Demand and Costs Jafar Rezaei1 and Mansoor Davoodi2 1
Vali-e-Asr University of Rafsanjan, Department of Industrial Management, Rafsanjan, Kerman, Iran [email protected] 2 Vali-e-Asr University of Rafsanjan, Department of Computer Sciences, Rafsanjan, Kerman, Iran
Abstract. In this paper a multi-period inventory lot sizing scenario, where there are multiple products and multiple suppliers, is solved with a Real Parameter Genetic Algorithm. We assume that demand of multiple discrete products is known, not exactly, over a planning horizon and transaction cost is supplier dependent, but does not depend on the variety nor quantity of products involved and holding cost is product-dependent and there are no capacity restrictions and no backlogging is allowed. Because of uncertainties in demand and inventory costs, we consider demand and all costs as fuzzy numbers. The problem is formulated as a fuzzy mixed integer programming and then converted to equivalent crisp decision making problems and is solved with a Real Parameter Genetic Algorithm. Finally, numerical example is provided to illustrate the solution procedure. The results determine what products to order in what quantities with which suppliers in which periods.
1 Introduction Traditional optimization methods can be classified into two distinct groups: direct and gradient-based methods. In direct search methods, only objective function ( f (x ) ) and constraint values ( g j (x) ) are used to guide the search strategy, whereas gradient-based methods use the first and/or second-order derivatives of the objective function and/or constraints to guide the search process. Since derivative information is not used, the direct search methods are usually slow, requiring many function evaluations for convergence. For the same reason, they can also be applied to many problems without a major change of the algorithm. On the other hand, gradient-based methods quickly converge to an optimal solution, but are not efficient in nondifferentiable or discontinuous problems. In addition, there are some common difficulties with most of the traditional direct and gradient-based techniques: − The convergence to an optimal solution depends on the chosen initial solution. − Most algorithms tend to get stuck to a suboptimal solution. − An algorithm efficient in solving one optimization problem may not be efficient in solving a different optimization problem. M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 1100 – 1110, 2006. © Springer-Verlag Berlin Heidelberg 2006
Genetic Algorithm for Inventory Lot-Sizing with Supplier Selection
1101
− Algorithms are not efficient in handling problems having discrete variables. − Algorithms cannot be efficiently used on a parallel machine. Because of the nonlinearities and complex interactions among problem variables often exist in real world optimization problems, the search space may have more than one optimal solutions, of which most are undesired locally optimal solutions having inferior objective function values. When solving these problems, if traditional methods get attracted to any of these locally optimal solutions, there is no escape [1]. To overcome these difficulties, recently Genetic Algorithms (GAs) are used as optimization techniques for decision-making problems. These (c.f. Davis [2], Goldberg [3] and Michalewicz [4]) are adaptive computational procedures modeled on the mechanics of natural genetic systems. They exploit the historical information to speculate on new offspring with expected improved performance (c.f. Goldberg [3] and Pal et al. [5]). These are executed iteratively on a set of coded solutions (called population) with three operators, selection, crossover and mutation. One iteration of these three operators is known as a generation in the parlance of GAs. Since a GA works simultaneously on a set of coded solutions, it has very little chance to get struck at local optima. Here, the resolution of the possible search space is increased by operating on possible solutions and not on the solutions themselves. Further, this search space need not be continuous. Recently, GAs have been applied in different areas like neural network (c.f. Pal et al. [5] ), scheduling (c.f. Davis [6] ), numerical optimization (c.f.Michalewicz and Janikow [7]), pattern recognition (c.f.Gelsema [8] ), inventory classification (Guvenier [9]), fishery modeling (Mardel et. al. [10] ), multi criteria flow shop scheduling (Bagchi [11] ), inventory control ( Mondal and Maiti [12]) etc. Tang and Wang [13], [14] developed an interactive approach for the production problems with quadratic form of fuzzy objective and fuzzy resource constraints and found a family of solutions with acceptable membership degree for the said problem via a GA with mutation along the weighted gradient direction. They also solved a manufacturing problem formulated as a fuzzy non-linear programming problem with penalty coefficients via GA. But, till now, just a few works (c.f. Mondal and Maiti [12], Rezaei and Davoodi [15]) have applied it in the field of inventory control and supply chain management. With the advent of supply chain management, much attention is now devoted to supplier selection. Rosenthal et al. [16] study a purchasing problem where suppliers offer discounts when a “bundle” of products is bought from them, and one needs to select suppliers for multiple products. A mixed integer programming formulation is presented. Chaudhry et al. [17] consider vendor selection under quality, delivery and capacity constraints and price-break regimes. Ganeshan [18] has presented a model for determining lot sizes that involves multiple suppliers while considering multiple retailers, and consequent demand on a warehouse. Kasilingam and Lee [19] have incorporated the fixed cost of establishing a vendor in a single-period model that includes demand uncertainties and quality considerations in the selection of vendors. In a similar vein, Jayaraman et al. [20] have proposed a supplier selection model that considers quality (in terms of proportion defectives supplied by a supplier), production capacity (this limits the order placed on a supplier), lead-time, and storage capacity limits. This is also a single period model that attaches a 6xedcost to dealing
1102
J. Rezaei and M. Davoodi
with a supplier. Jayaraman et al. [20] formulate a mixed integer linear programming model to solve the problem. Basnet and Leung [21] have presented a multi-period supplier selection model that considers demand of multiple products in multiple periods: one or more supplier could be selected in each of these periods for the purchase of products. Alternatively, products could be carried forward to a future period, incurring holding costs. Every time an order is placed on a supplier, a transaction cost is also incurred. In an inventory system, we consider inventory costs which may be imprecise. On the other hand, the demand may be vaguely defined in a different way. Hence, there is a scope of formulating the inventory control problems with imprecise parameters, resources and objective. Till now, only a few fuzzy inventory models (c.f. Kacprzyk and Staniewski [22], Park [22], Petrovic [24]), Vujosevic et. al [25], Chang et. al [26], Lee et. al [27], Yao et. al [28], Yao et. al [29], Yao et. al [30], Gen et. al [31], Fatemi Ghomi and Rezaei [32] are available in the literature. In this paper based on Basnet and Leung [21] work a multi-period inventory lot sizing scenario, where there are multiple products and multiple suppliers, is solved with a Real Parameter Genetic Algorithm while demand and all costs are considered as fuzzy numbers.
2 Genetic Algorithm Approach 2.1 Chromosome Syntax Representation In the multi-product multi-supplier problem, the number of the involved parameters for optimization is large. Traditional binary vectors used to represent the chromosome are not effective in such a large-scale dimension. To overcome this difficulty, a realnumber representation is therefore used. 2.2 Initial Population Production The population initialization technique used in the GA approach is a random realnumber initialization that creates a starting GA population filled with randomly generated real-number strings. The population size is set not less than twice the length of the vector of the chromosomes. 2.3 Evaluation and Fitness Evaluation function plays the same role in GA as that which the environment plays in natural evolution. The objective function value is considered as the evaluation rule of every individual. Sometimes the best and the worst chromosomes will produce almost the same numbers of offspring in the next Population, which cause premature convergence. In this case, the effect of natural section is therefore not obvious. A linear normalization, which converts the evaluations of chromosomes into fitness values, is used here to overcome this problem.
Genetic Algorithm for Inventory Lot-Sizing with Supplier Selection
1103
2.4 Parent Selection and Genetic Operators The purpose of parent selection in GA is to offer additional reproductive chances to those population members that are the fittest. One commonly used technique, the roulette-wheel-selection, is used for this proposed GA. The genetic operators used to perform evaluation are Crossover operator and Mutation operator. Crossover is an extremely important component of GA, which makes it different from all other conventional optimization algorithms. It can recombine the parents and produce offspring as the new individuals. In the proposed GA, Linear crossover is used. Mutation reintroduces diversity into the population when a bit has accidentally taken on a single value everywhere. This prevents the solution from converging to some local optimal solutions; thereby the global optimal solution can be reached. The Random mutation method is used here. 2.5 Termination Criteria and Genetic Parameters Selection Once the same solution has been continuously obtained by the GA approach for 10 reproduction trials, or the maximum generation is larger than the preset trial times stated in advance, the termination criteria have been reached. The population size is set at 100. The crossover operator is applied with a crossover rate of 90%. This rate of mutation operator is inverse to the standard difference of a generation and is not more than 10% of all the evolution.
3 Formulation and Notation This paper is built upon Basnet and Leung’s [21] work. To avoid any possible confusion, the notations and assumptions utilized in Basnet and Leung are employed in this paper: Dit Pij
demand of product i in period t purchase price of product i from supplier j
Hi Oj Xijt Yjt Rit
holding cost of product i per period transaction cost for supplier j number of product i ordered from supplier j in period t 1 if an order is placed on supplier j in time period t, 0 otherwise Inventory of product i, carried over from period t to period t + 1
1. Transaction cost is supplier dependent, but does not depend on the variety and quantity of products involved. 2. There are no capacity restrictions and no backlogging is allowed. 3. Demand of multiple discrete products is known over a planning horizon 4. Holding cost is product-dependent
1104
J. Rezaei and M. Davoodi
3.3 Formulation The total cost (TC) = purchase cost of the products + transaction cost for the suppliers + holding cost for remaining inventory in each period. t ⎛ t ⎞ TC = ∑∑∑ Pij X ijt + ∑ ∑ O jY jt + ∑∑ H i ⎜⎜ ∑ ∑ X ijk −∑ Dik ⎟⎟ i j t j t i t k = 1 j k = 1 ⎝ ⎠
(1)
There are two restrictions for this objective function: 1) All requirements must be filled in the period in which they occur: shortage or backordering is not allowed: Rit =
t
∑∑ X k =1
t
− ∑ Dik ≥ 0
ijk
for all i and t
(2)
k =1
j
2) There is not an order without charging an appropriate transaction cost. ⎛ T ⎞ ⎜ ∑ D ik ⎟ Y jt − X ijt ≥ 0 ⎠ ⎝ k =t
for all i , j and t
(3)
The problem is to find number of product i ordered from supplier j in period t so as to minimize the total cost function subject to restrictions and boundary conditions: t ⎛ t ⎞ Min (TC ) = ∑∑∑ Pij X ijt + ∑ ∑ O jY jt + ∑∑ H i ⎜⎜ ∑ ∑ X ijk −∑ Dik ⎟⎟ i j t j t i t k =1 ⎝ k =1 j ⎠
s .t .
R it ≡
t
t
∑∑ k =1
X
ijk
j
⎛ T ⎞ ⎜ ∑ D ik ⎟ Y ⎝ k =t ⎠
− ∑ D ik ≥ 0
(5)
k =1
− X
jt
for all i and t ,
(4)
ijt
≥ 0
for all i , j and t ,
(6)
Y jt = 0 or 1 for all j and t ,
(7)
X ijt ≥ 0 for all i, j and t.
(8)
When the above objective goal, demand of product i in period t , purchase price of product i from supplier j, holding cost of product i per period and transaction cost for supplier j become fuzzy, the said problem (4-8) is transformed to: t ~ ~ ~ ~ ⎛ t ~ ⎞ Min (TC) = ∑∑∑ Pij X ijt + ∑ ∑ O j Y jt + ∑∑ H i ⎜⎜ ∑∑ X ijk −∑ Dik ⎟⎟ t j t j t i t k =1 ⎝ k =1 j ⎠
~
s .t .
R it ≡
t
∑∑ X k =1
j
~ ⎛ T ⎞ ⎜ ∑ D ik ⎟ Y ⎝ k =t ⎠
jt
t
ijk
~
− ∑ D ik ≥ 0
for all i and t ,
k =1
− X
ijt
(9)
≥ 0
for all i , j and t ,
(10) (11)
Genetic Algorithm for Inventory Lot-Sizing with Supplier Selection
1105
Y jt = 0 or 1 for all j and t ,
(12)
X ijt ≥ 0 for all i , j and t .
(13)
Following Zimmerman [33], the above fuzzy model reduces to:
Max
α
(14)
s.t.
∑ ∑ ∑ (P
ij
i
j
t
− (1 − α ) Pp ij ) X ijt + ∑ ∑ (O j − (1 − α ) Po j )Y jt j
t
t ⎛ ⎞ + ∑ ∑ ( H i − (1 − α ) Phi )⎜⎜ ∑ ∑ X ijk − ∑ ( Dik − (1 − α ) Pd ik ⎟⎟ ≤ TC 0 + (1 − α ) Ptc 0 , k j i t = k = 1 1 ⎝ ⎠ t
t
t
∑∑ k =1
X
j
ijk
− ∑ ( D ik − (1 − α ) Pd k =1
⎛ T ⎜ ∑ ( D ik − (1 − α ) Pd ⎝ k =t
ik
⎞ ⎟Y ⎠
jt
− X
ijt
ik
)≥ 0
≥ 0
(15)
(16)
for all i and t ,
(17)
for all i , j and t ,
Y jt = 0 or 1 for all j and t ,
(18)
X ijt ≥ 0 for all i , j and t .
(19)
where Pp ij , Po j , Ph i , Pd ik and Ptc 0 are tolerances of Pij , O j , H i , D ik and TC 0 respectively and TC0 is the aspiration level of total cost.
4 Numerical Example In this section we solved a numerical example of the above fuzzy model using a Real Parameter Genetic Algorithm. We consider a scenario with four products over a planning horizon of five periods whose requirements are as follows: Demands of four products over a planning horizon of five periods are: Table 1. Demands of four products over a planning horizon of five periods. In each cell, the left number indicates D ik and the right number indicates Pd ik .
Products A B C D
1 (50,10) (35,5) (55,6) (42,7)
Planning Horizon (five periods) 2 3 4 (45,8) (52,7) (43,8) (40,5) (38,6) (35,5) (50,8) (44,10) (50,8) (44,5) (40,6) (43,8)
5 (47,8) (42,8) (60,10) (40,5)
There are three suppliers X, Y, Z and their prices are shown in table 2.
1106
J. Rezaei and M. Davoodi
Table 2. Price of four products by each of three suppliers X, Y, Z. In each cell, the left number
indicates
Pij and the right number indicates Ppij . Products A B C D
Price Y (105,10) (80,8) (100,8) (74,5)
X (100,8) (80,5) (95,6) (75,5)
Z (95,6) (78,5) (94,10) (70,6)
Table 3. Transaction cost of three suppliers X, Y, Z. In each cell, the left number indicates O j and the right number indicates Po j .
Transaction cost Y Z (550,20) (480,30)
X (500,35)
Table 4. Holding cost of four products A,B,C and D. In each cell, the left number indicates H i
and the right number indicates Phi .
Holding Cost B C (4,0.5) (5,0.8)
A (5,1)
D (3.8,0.4)
Also we considered TC 0 and Ptc 0 as 80000 and 8000 respectively. 4.1 Upper Limits for Decision Variables in GAs The upper limits for the decision variables X ijt are obtained from the equation (17) as: T
X ijt ≤ ∑ ( D ik − (1 − α ) Pd ik )Y jt
(20)
k =t
For upper limits, we have considered Y jt X
ijt
≤
T
∑ (D k =t
ik
= 1 and α = 1 , therefore we have:
+ Pd
ik
(21)
)
The results of applying the proposed method are shown as follow: Table 5. Results for model by GA
TC
Iteration 90 0.716 82266.57
Iteration 110 0.7585 81542.63
Iteration 150 0.772 81823.62
Iteration 180 0.802 81547.44
Iteration 190 0.8035 81469.64
Iteration 200 0.82 81326.81
X111 X112
40 0
40 0
40 0
40 0
40 0
40 0
α
Genetic Algorithm for Inventory Lot-Sizing with Supplier Selection Table 5. (Continued)
X113 X114 X115 X121 X122 X123 X124 X125 X131 X132 X133 X134 X135 X211 X212 X213 X214 X215 X221 X222 X223 X224 X225 X231 X232 X233 X234 X235 X311 X312 X313 X314 X315 X321 X322 X323 X324 X325 X331 X332 X333 X334 X335
53 0 0 0 23 0 0 3 51 42 0 18 0 14 0 32 0 0 0 26 0 0 9 50 21 0 31 0 15 0 22 0 0 0 53 0 0 38 53 39 0 40 0
53 0 0 0 23 0 0 3 51 42 0 18 0 14 0 32 0 0 0 26 0 0 9 50 21 0 31 0 15 0 22 0 0 0 53 0 0 38 53 39 0 40 0
53 0 0 0 23 0 0 1.5 51 42 0 18 0 14 0 44.8 0 0 0 26 0 0 9 50 21 0 31 0 15 0 22 0 0 0 53 0 0 30.4 53 39 0 40 0
53 0 0 0 23 0 0 3 51 42 0 18 0 2.8 0 44.8 0 0 0 26 0 0 8.1 50 21 0 31 0 15 0 22 0 0 0 53 0 0 30.4 53 39 0 40 0
53 0 0 0 23 0 0 3 51 42 0 18 0 3.36 0 44.8 0 0 0 26 0 0 8.1 50 25.2 0 31 0 15 0 22 0 0 0 53 0 0 30.4 53 39 0 40 0
53 0 0 0 23 0 0 3 51 42 0 18 0 3.36 0 44.8 0 0 0 26 0 0 4.86 50 25.2 0 31 0 15 0 22 0 0 0 53 0 0 30.4 53 39 0 40 0
1107
1108
J. Rezaei and M. Davoodi Table 5. (Continued)
X411 X412 X413 X414 X415 X421 X422 X423 X424 X425 X431 X432 X433 X434 X435
52 0 36 0 0 0 9 0 0 8 18 57 0 33 0
52 0 32.4 0 0 0 1.8 0 0 8 18 57 0 33 0
52 0 32.4 0 0 0 1.8 0 0 8 18 57 0 33 0
52 0 32.4 0 0 0 2.5 0 0 8 18 57 0 33 0
52 0 32.4 0 0 0 2.5 0 0 8 18 57 0 31 0
52 0 32.4 0 0 0 3.02 0 0 8 18 57 0 33 0
In fuzzy method, only one optimum solution is obtained, whereas, GA gives a series of alternatives along with the optimum one to the decision maker. In all cases, exact optimum solution is not always wanted. As the model has been formulated with vague parameters, the decision maker may choose that solution which suits him best with respect to resources, which will have to be augmented with difficulty.
5 Conclusions In this paper a multi-period inventory lot sizing scenario, where there are multiple products and multiple suppliers, is solved with a Real Parameter Genetic Algorithm. The proposed method does not require unrealistic assumption about the exact parameter for demand and costs. Therefore, in this paper the cost components and demand are considered as triangular fuzzy numbers. The model has been formulated as Fuzzy Mixed Integer Programming problem and then converted to equivalent crisp decision making problem and solved by a Real Parameter Genetic Algorithm. This methodology can be extended without or with some modifications to the complicated inventory and supply chain models, i.e. models with deterioration, discount, variable replenishment, etc. formulated in crisp, fuzzy or fuzzy-stochastic environments.
References 1. Deb, K.: Multi Objective Optimization Using Evolutionary Algorithms, first edn. John Wiley & Sons, LTD England (2001) 2. Davis, L. (Ed.): Genetic Algorithms and Simulated Annealing. Pitman Publishing, London (1987) 3. Goldberg, D. E.: Genetic Algorithms: Search, Optimization and Machine Learning. Addison Wesley, Massachusetts (1989)
Genetic Algorithm for Inventory Lot-Sizing with Supplier Selection
1109
4. Michalewicz, Z.: Genetic Algorithms + Data Structures = Evolution Programs. 3rd edn. Springer-Verlag, Berlin Heidelberg New York (1996) 5. Pal, S. K., De, S., Ghosh, A.: Designing Hopfield Type Networks Using Genetic Algorithms and Its Comparison with Simulated Annealing. Int. J. Patterns Recognition and Artificial Intelligence 11(1997) 447–461 6. Davis, L. (Ed.).: Handbook of Genetic Algorithms. Van Nostrand Reinhold New York (1991) 7. Michalewicz, Z., Janikow, C. Z.: Genetic Algorithms for Numerical Optimization. Statistics and Computing 1 (1991) 75–91 8. Gelsema, E. (Ed.).: Special Issue on Genetic Algorithms, Pattern Recognition Letters, 16 (8) (1995) 9. Guvenir, H. A.: A Genetic Algorithm for Multicriteria Inventory Classification. In Pearson, D. W., Steele, N. C. and Albrecht, R. F (Eds): Artificial Neural Nets and Genetic Algorithms, Proceedings of the Int. Conference, France, Springer-Verlag, Wien, ( 1995) 6-9 10. Mardle, S., Pasco, S., Tamiz, M.: An Investigation of Genetic Algorithm for the Optimization of Multi Objective Fisheries Bioeconomic Models. Technical report 136, Portsmouth, UK: Center for the Economics and Management of Aquatic Resources, University of Portsmouth (1998) 11. Bagchi, T.: Multi Objective Scheduling by Genetic Algorithm. Kluwer Academic Publishers Boston (1999) 12. Mondal, S., Maiti, M.: Multi-item Fuzzy EOQ Models Using Genetic Algorithm. Computers & Industrial Engineering 44 (2002) 105–117 13. Tang, J., Wang, D.: An Interactive Approach Based on a Genetic Algorithm for a Type of Quadratic Programming Problems with Fuzzy Objective and Resources. Computers and Operations Research, 24(5) (1997) 413–422 14. Tang, J., Wang, D.: A Non-Symmetric Model for Fuzzy non Linear Programming Problems with Penalty Coefficients. Computers and Operations Research, 24(8) (1997) 717–725 15. Rezaei, J., Davoodi, M.: Multi-item Fuzzy Inventory Model with Three Constraints: Genetic Algorithm Approach. In: Zhang, S., Jarvis, R. (eds.): The 18th Australian Joint Conference on Artificial Intelligence. Lecture Notes in Artificial Intelligence, Vol. 3809. Springer-Verlag, Berlin Heidelberg (2005) 1120–1125 16. Rosenthal, EC., Zydiak, JL., Chaudhry SS.: Vendor selection with bundling. Decision Sciences, 26 (1995) 35– 48 17. Chaudhry, SS., Frost, FG., Zydiak, JL.: Vendor selection with price breaks. European J. Operational Research, 70 (1993) 52– 66 18. Ganeshan, R.: Managing supply chain inventories: a multiple retailer, one warehouse, multiple supplier model. Int. J. Production Economics, 59 (1999)341–54 19. Kasilingam, RG., Lee, CP.: Selection of vendors—a mixed integer programming approach. Computers and Industrial Engineering, 31 (1996) 347–50 20. Jayaraman, V., Srivastava, R., Benton, WC.: Supplier selection and order quantity allocation: a comprehensive model. The J. Supply Chain Management, 35 (1999) 50–58 21. Basnet, Ch., Leung, J.M.Y.: Inventory lot-sizing with supplier selection. Computers & Operations Research, 32 (2005) 1–14 22. Kacprzyk, J., Staniewski, P.: Long Term Inventory Policy-Making through Fuzzy Decision Making Models. Fuzzy Sets and Systems 8 (1982) 117–132 23. Park, K. S.: Fuzzy Set theoretic Interpretation of Economic Order Quantity. IEEE Transactions on Systems, Man and Cybernetics 17 (1987) 1082–1087
1110
J. Rezaei and M. Davoodi
24. Petrovic, D., Petrovic, R., Vujosevic, M.: Fuzzy Models for the Newsboy Problem. Int. J. Production Economics 45(1996) 435–441 25. Vujosevic, M., Petrovic, D., Petrovic, R.: EOQ Formula when Inventory Cost is Fuzzy. Production Economics 45 (1996)499–504 26. Chang, SC., Yao, J.S., Lee, H.M.: Economic Reorder Point for Fuzzy Backorder Quantity. European J. Operational Research 109 (1998)183–202 27. Lee, H.M., and Yao, J.Sh.: Economic Order Quantity in Fuzzy Sense for Inventory without Backorder Model. Fuzzy Sets and Systems 105(1999)13-31 28. Yao, J.Sh. Change, S.C.h., Su, J.Sh.: Fuzzy Inventory without Backorder for Fuzzy Order Quantity and Fuzzy Total Demand Quantity. Computers & Operations Research 27(2000)935-962 29. Yao, J.S.h., Lee, HM.: Fuzzy Inventory with Backorder for Fuzzy Order Quantity. Information Sciences 93(1996)283-319 30. Yao, J.Sh., Lee, HM.: Fuzzy Inventory with or without Backorder for Order Quantity with Trapezoid Fuzzy Number. Fuzzy Sets and Systems 105(1999)311-337 31. Gen, M., Tsujimura. Y., Zheng, D.: An Application of Fuzzy Set Theory to Inventory Control Models. Computers and industrial Engineering 33(1997)553-556 32. Fatemi Ghomi, S.M.T., Rezaei, J.: Development of a Fuzzy Inventory Control Model with a Real Application. Amirkabir 14(55-D)(2003)924-936 33. Zimmermann, H.J.: Description and Optimization of Fuzzy Systems, Int. J General Systems 2 (4) (1976) 209–215
A Self-tuning Emergency Model of Home Network Environment Huey-Ming Lee, Shih-Feng Liao, Tsang-Yean Lee, Mu-Hsiu Hsu, and Jin-Shieh Su Department of Information Management, Chinese Culture University 55, Hwa-Kung Road, Yang-Ming-San, Taipei (11114), Taiwan [email protected], [email protected]
Abstract. In this paper, we proposed a self-tuning emergency model of home network environment (SEMHNE). This model can not only tune the scaling factors and membership functions to fit the home network environment but also detect the emergency events automatically. There are three modules in this model, namely, emergency report module (ERM), renewable emergency rule base (RERB), and evolutionary database (EDB). ERM determines the emergency situations by fuzzy inferences and sends the warning message to the users. RERB can provide rules to ERM for inference. EDB can do self-tuning by using genetic algorithm and provide information to ERM for inference. Via this model, our home network environment will become more reliable and safety.
1 Introduction Since information appliances (IAs) have become available to all in recent years, there are more and more varied IA products appeared. IA plays an important role in home environment. In home network environment, an IA control mechanism can provide the fine control capability of IA devices. Lee and Huang [8] proposed an IA controlling model (IACM), which can control IA devices through home management broker. Lee et al. [7] came up with the idea of IAs intelligent agent model (IAIA), making home environments more comfortable and convenient. Lee et al. [11] proposed a fuzzy neural network model of information appliances with the functions of self-learning and fuzzy inference, it enables IAIA to maximize efficiency of IAs in a more humane way. Lee et al. [12] proposed an intelligent control model of information appliances (ICMAI) which can not only collect the related users’ information appliances preference messages automatically, but also generate the IA control rules by the fuzzy neural network learning. Lee et al. [9] proposed an adaptive exception process model of information appliances (AEPMIA) which can adjust the membership functions to make the reasoning more approximately. Lee et al. [10] proposed an emergency model of home network environment based on genetic algorithm. This model can not only adapt the home network environment by using genetic algorithm but also detect the emergency events automatically. If there is a mechanism which can do the active response of emergency, then we can prevent the serious accident in home network environment. M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 1111 – 1118, 2006. © Springer-Verlag Berlin Heidelberg 2006
1112
H.-M. Lee et al.
In this study, we proposed a self-tuning emergency model of home network environment (SEMHNE). There are three modules in this model, saying emergency report module (ERM), renewable emergency rule base (RERB), and evolutionary database (EDB). RERB provides rules to ERM for inference. ERM determines the emergency situations by fuzzy inferences. EDB can adjust the database by using genetic algorithm and provide information to ERM for inference. Via this model, our home network environment will become more reliable and safety.
2 SEMHNE In this section, we presented a self-tuning emergency model of information appliances (SEMHNE) under the supervision of IAIA [7], as shown in Fig. 1. There are three modules in SEMHNE, saying emergency report module (ERM), renewable emergency rule base (RERB), and evolutionary database tuner (EDB), as shown in Fig. 2. IAIA
Sensors
IACM De vices Devices
SEMHNE Users
IACMIA
Fig. 1. SEMHNE (Dotted line is normal event)
IACM
De vices Devices
SEMHNE Renewable Emergency Rule Base Module
Emergency Report Module
Evolutionary Database
Users
IACMIA
Fig. 2. Architecture of SEMHNE (Dotted line is normal event)
The functions of these modules are as the following: ♦
RERB: It includes the renewable rule base which is provided by experts and can provide ERM to inference. ♦ ERM: After receiving the messages from RERB and EDB, ERM can determine the emergency events by using fuzzy inferences. Depending on the inference results, it will send the warning message to users or let it pass by.
A Self-tuning Emergency Model of Home Network Environment ♦
1113
EDB: It can tune the scaling factors and membership functions in the database by using genetic algorithm and provide the related messages to ERM for inference.
2.1 ERM After receiving the messages from IACM, ERM can determine the emergency events by fuzzy inferences and send the warning message to users. There are two components in this module, saying, fuzzy inference component (FIC) and message report component (MRC). FIC can reason the emergency events by fuzzy inference. If the emergency event occurred, MRC would start the devices up and send the warning messages to remote users. If the events are not emergent, FIC will let the messages pass by, as shown in Fig.3. De vices Devices
IACM Domain Experts ERM
RERB
Users
MRC
FIC EDB
IACMIA
Fig. 3. Architecture of EPM (Dotted line is normal event)
2.2 RERB There are two components in RERB, saying, update rules component (URC) and emergency rule base (ERB). URC can update ERB by domain experts. ERB can provide ERM to reason the emergency events, as shown in Fig.4. RERB ERB ERM Domain Experts
URC
Fig. 4. Architecture of RERB
2.3 EDB EDB comprises of three components, saying, genetic tuning processor (GTP), reasoning database (RDB) and training data base (TDB). While TDB receives the environment information from ERM, GTP can tune the scaling factors and membership functions in the RDB by using genetic algorithm, as shown in Fig.5.
1114
H.-M. Lee et al.
EDB ERM
RDB
GTP
TDB
Fig. 5. Architecture of EDB
2.3.1 Tuning Membership Function Process In order to fit the real situation in home network environment, the proposed model will tune the scaling factors by using genetic algorithm [1, 3]. Since the shape of membership function is trapezoidal shape, the left, right and the two central points parameterize the membership function [3], we encode the four points as genotype in binary code, as shown in Fig.6.
S1
S3
S2
S4
100101...
110110...
110111...
101111...
gene1
gene2
gene3
gene4
Chromosome
Fig. 6. Initial population encode approach
The trapezoid fuzzy number is defined by four parameters (S1, S2, S3, S4), S k ∈ [ S kl , S kr ], for k = 1,2,3,4 [3], where
S il = S ir−1 =
S i −1 + S i 2
for i = 2,3,4
S − S3 S 2 − S1 r , S4 = S4 + 4 2 2 Every gene will be produced randomly from the respective interval of its parameter. Then, we can start genetic operator to optimize the membership functions. S1l = S1 −
2.3.2 Adjusting Scaling Factors Process In order to get suitable database, we can adjust the scaling factors by using genetic algorithm [1-3]. The adjusting of scaling factors can be regarded as global tuning procedures. The reasoning steps are as shown in Fig.7.
A Self-tuning Emergency Model of Home Network Environment
1115
Rule Base Input
Output
SFi1 Reasoning
Fuzzification
Defuzzification
SFo3
SFi 2
Fig. 7. Fuzzy Inference with scaling factors [6]
Nomenclature: SFi1 : Input scaling factor of parameter 1. SFi 2 : Input scaling factor of parameter 2. SFo3 : Input scaling factor of parameter 3. We set the initial scaling function as Eq. (1) f ( x) = αx + β
(1)
Let α = 1 and β = 0 , then the tuning fits the real situation.
3 Model Implementation In order to present the efficacy of this model, we implement this model in this section. 3.1 Practical Environment
For the purpose of ease manipulation, cross-platform, and remote-control capability, we have adopted Java Server Page (JSP) and Java Servlet written Web Server structure, Java 2 Platform, Standard Edition, v1.4.2 API Specification and MATLAB 7.0 are utilized for constructing EEMHNE prototype. Above-mentioned are done with Pentium 4 1.6G desktop that is powered by O/S Windows Professional and Microsoft Access 2002. 3.2 Implementation
We take the membership function of temperature as an example. Also, we set the parameters as followings: ♦ ♦
Population size: 30 Probability of crossover: Pc = 0 . 6 ♦ Probability of mutation: Pm = 0 . 3 ♦ Selection: roulette wheel ♦ Elite: enable The membership function before tuning is shown in Fig.8 and the tuning one is shown in Fig.9.
1116
H.-M. Lee et al.
Fig. 8. Membership function before tuning
Fig. 9. Membership function after tuning
Next stage, we tune the scaling factors with multiple inputs single output (MISO) system. We set the scaling factors tuning as shown in Table 1: Table 1. MISO system
Parameter Parameter 1 Parameter 2 Parameter 3
Labels 3 5 5
Type input input output
♦ ♦
Population size: 200 Probability of crossover: Pc = 0 . 8 ♦ Probability of mutation: Pm = 0 . 1 ♦ Selection: roulette wheel ♦ Elite: enable The tuning results are shown in Table 2: Table 2. Parameter 1 scaling factors tuning Result
Parameter 1 Scaling factor Before Training After Training Parameter 2 Scaling factor Before Training After Training Parameter 3 Scaling factor Before Training After Training
α
β
1 1.21
0 0.25
α
β
1 1.12
0 0.66
α
β
1 1.15
0 0.03
After tuning, our input and output scaling factors will be more suitable for real situation, and the reasoning will be more approximate.
A Self-tuning Emergency Model of Home Network Environment
1117
4 Conclusion At present, there are more and more varied IA products appeared in home network environment. Therefore, emergency process mechanism is the most significant in home network environment. In this study we proposed a self-tuning emergency model of home network environment (SEMHNE). This model can not only do selftuning to fit the home network environment but also detect the emergency event automatically. Via this model, our home network environment will become more reliable and safety.
Acknowledgement This work was supported in part by the National Science Council, Republic of China, under grant NSC94-2745-M-034-008-URD.
References 1. Arslan, A., Kaya, M: Determination of fuzzy logic membership functions using genetic algorithm. Fuzzy sets and systems Vol. 118, (2001) 297-306 2. Cordón, O., Herrera, F.: A Three –Stage Evolutionary Process for Learning Descriptive and Approximate Fuzzy-Logic-Controller Knowledge Bases From Examples. International Journal of Approximate Reasoning, Vol.17 (1997) 369-407 3. Cordón, O., Herrera, F., Hoffmann, F., Magdalena, L.: Genetic Fuzzy Systems. World Scientific Publishing Co. (2001) 4. Herrera, F., Lozano, M., and Verdegay, J.L.: Tuning Fuzzy Logic Controller by Genetic Algorithm. International Journal of Approximate Reasoning, Vol.12 (1995) 299-315 5. Jung, C.-H., Ham, C.-S., Lee, K.-I.: A real-time self-tuning fuzzy controller through scaling factor adjustment for the steam generator of NPP. Fuzzy sets and systems Vol. 74 (1995) 53-60 6. Ju, M.-S., Yang, D.-L.: Design of adaptive fuzzy controls based on natural control laws. Fuzzy sets and systems Vol. 81 (1996) 191-204 7. Lee, H.-M., Chen, Y.-C., Chen, J.-J.: The Intelligent Agent Design of Information Appliance. JCIS, 2003, Proceeding of the 7th Join Conference on Information Sciences, Cary. NC. USA, (2003) 1681-1684 8. Lee, H.-M., Huang, J.-H.: The study of IA devices monitoring model, The sixth seminar of the research and practices of information management, (2002) 430-437 9. Lee, H.-M., Liao, S.-F., and Lee, S.-Y.: An Adaptive Exception Process Model of Information Appliances. WSEAS TRANSACTIONS on INFORMATION SCIENCE & APPLICATIONS, Issue 2, Vol. 1 (2004) 778-783 10. Lee, H.-M., Liao, S.-F., and Lee, S.-Y.: An Emergency Model of Home Network Environment Based on Genetic Algorithm. Knowledge-Based Intelligent Information & Engineering Systems, Lecture Note in Artificial Intelligent, Part II, Springer-Verlag, Berlin Heidelberg, LNAI 3682 (2005) 1245-1251 11. Lee, H.-M., Mao, C.-H., and Lee, S.-Y.: A Fuzzy Neural Network of Information Appliance. International Workshop on Fuzzy System & Innovation Computing 2004 (FIC2004), Kitakyushu, Japan, (2004)
1118
H.-M. Lee et al.
12. Lee, H.-M., Mao, C.,-H., Lee, S.-Y.:Intelligent Control Model of Information Appliances. Negoita, M.-G, Howlett, R.-J., Jain, L.-C. (Eds.) Knowledge-Based Intelligent Information and Engineering Systems, Part III, Springer-Verlag, Berlin Heidelberg, LNAI 3215 (2004) 123-128 13. Michalewicz, Z.: Genetic Algorithms + Data Structures = Evolution Programs. 3rd edn. Springer-Verlag, Berlin Heidelberg New York (1996) 14. Randy, L.H., Sue, E.H.: Practical Genetic Algorithm. A Wiley-Interscience Publication (1998)
Simulation Studies of Two-Layer Hopfield Neural Networks for Automatic Wafer Defect Inspection Chuan-Yu Chang, Hung-Jen Wang, and Si-Yan Lin Department of Electronic Engineering, National Yunlin University of Science and Technology, Taiwan [email protected]
Abstract. The occurrence of defect on a wafer may result in losing the yield ratio. The defective regions were usually identified through visual judgment with the aid of a scanning electron microscope. Dozens of people visually check wafers and hand-mark their defective regions leading to a significant amount of personnel cost. In addition, potential misjudgment may introduce due to human fatigue. In this paper, a two-layer Hopfield neural network called the competitive Hopfield wafer-defect detection neural network (CHWDNN) is proposed to detect the defective regions of wafer image. The CHWDNN extends the onelayer 2D Hopfield neural network at the original image plane to a two-layer 3D Hopfield neural network with defect detection to be implemented on its third dimension. With the extended 3D architecture, the network is capable of incorporating a pixel’s spatial information into a pixel-classifying procedure. The experimental results show the CHWDNN successfully identifies the defective regions on wafers images with good performances.
1 Introduction Wafer defect inspection is an essential process before die packaging. The aim of the wafer defect inspection is to detect defective dies and discard them. In general, the inspection methods can be categorized into two types: post-sawing inspection and electric inspection. Electric inspection requires additional space for testing circuits and that way causes the increase of unit cost. With the help of a scanning electron microscope, the defective regions are identified through visual judgment and this is the post-sawing inspection method. Wafers are checked and the defective regions are hand-marked by engineers [1]. However, not only this process can result in considerable personnel costs, but also human fatigue may lead to potential misjudgment. A number of methods were developed for wafer defect inspection [1-4]. Tobin et al. [2] proposed applying image process techniques in co-ordination with database systems for the inspection of wafer defect. It compares the images of the tested dies with those of good ones to find out the defective dies. By this means, however, a variety of templates for different circuits and die images for further processing must be created and therefore loses the flexibility. In addition, incorrect templates may result in mistaken inspection results. Mital et al. [3] suggested an automatic wafer pattern inspection system. The inspection system was based on a knowledge database. The knowledge database regulates the geometry and aspects of the assorted regions of M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 1119 – 1126, 2006. © Springer-Verlag Berlin Heidelberg 2006
1120
C.-Y. Chang, H.-J. Wang, and S.-Y. Lin
the wafer images. This implies that various patterns need to be collected. Zang et al. [4] develop an automatic post-sawing inspection making use of computer vision techniques. This method is limited to boundary defect problems such as die cracks. Besides, this approach would spend thirty minutes per wafer on inspection time. Su et al. [1] suggested a neural network approach for semiconductor wafer defect inspection. Three neural learning methods including back propagation (BP), radial basis function (RBF), and learning vector quantization (LVQ) networks, were proposed as well as tested. Nevertheless, all these three neural networks are based on supervised-learning. To be precise, in order to use these neural networks, patterns such as specific defect patterns, predetermined regions, and die boundaries should have be learned. Accordingly, the product flexibility is low. To achieve an efficient inspection, a minor border cut and complex orientation adjusting processes were utilized. These are rather complex and time-consuming processes. Recently, Chang et al. [5] proposed a two-layer Hopfield based neural network (CHEFNN) by including pixel’s surrounding spatial information into image edge detection. Therefore, inspired by the concept of CHEFNN, a specific two-layer competitive Hopfield neural network called the competitive Hopfield wafer-defect detection neural network (CHWDNN) is presented for detecting the defective regions of wafer image. With the extended 3-D architecture, the network is capable of incorporating a pixel’s spatial information into a pixel-classifying procedure. According to the variance of gray level and sharp irregularity, the CHWDNN classifies the pixel of wafer images into two classes (defect or not). The CHWDNN is zoomed-independent, orientation- independent, i.e. the proposed CHWDNN does not require the zoom parameters and patterns of good dies. In addition, during the inspection process, the proposed CHWDNN does not to adjust image positions and correct circuit templates. Experimental results show the effectiveness and efficiency of the proposed CHWDNN.
2 Two-Layer Competitive Hopfield Neural Network Wafer defect detection can be considered as a pixel classifying process that assigns pixels to defect points in accordance with their gray-level and spatial information. In this paper, we applied the two-layer Hopfield neural network architecture, proposed a Competitive Hopfield Wafer-defect Detection Neural Network (CHWDNN) for automatic wafer defect inspection, which considers both the local gray level variance and the neighbor-spatial information. The architecture of the CHWDNN is shown in Fig. 1. The network achieves a stable state when the energy of the Lyapunov function is minimized. The layers of the CHWDNN represent the state of each pixel that indicates if the pixel is a defect point. A neural V x ,i ,1 in layer 1 in a firing state indicates that the pixel locate at ( x, i ) in the image is defined as a defect point, and a neural
V y , j ,2 in layer 2 in a firing state indicates that the pixel located at ( y, j ) in the image is identified as a nondefect point. In order to ensure that the CHWDNN has capability of dealing with spatial information in wafer defect detection, the energy function of CHWDNN must satisfy the following conditions:
Simulation Studies of Two-Layer Hopfield Neural Networks
1121
Vx ,i ,1
Layer 1 (defect layer)
Layer 2 (non-defect layer) Vx ,i, 2
Fig. 1. The architecture of the CHWDNN
First, in the nondefect layer, the normal regions are smoother than the boundary of defective region, and the contrast of normal regions are lower than defect regions. Let g x,i and g y , j denote the gray level of pixels ( x, i ) and ( y, j ) , respectively. Therefore, assume that a nondefect pixel ( x, i ) with V x ,i ,2 = 1 and its surrounding nondefect pixels ( y, j ) within the neighborhood of ( x, i ) with V y , j , 2 = 1 have the minimum Euclid-
ean distance measure. Then this condition can be characterized as follows: N
N
N
N
∑ ∑ ∑∑ x =1
y =1 i =1 ( y , j ) ≠ ( x ,i )
j =1
(
)
Φ xp,i ( y , j ) d x ,i ; y , j + C ny , j Vx ,i , 2V y , j , 2
(1)
where d x ,i ; y , j is the normalized difference between g x,i and g y , j , defined by ⎡ g x ,i − g y , j ⎤ d x ,i ; y , j = ⎢ ⎥ ⎣ MAX (G ) ⎦
2
,
(2)
where MAX (G ) denotes the largest gray level of the wafer image. The function
Cny , j denotes the contrast value that is located within n × n area centered at pixel ( y, j ) , defined by
C ny , j =
MAX ( G ) MAX ( G )
∑ ∑ s−t M s =0
y, j n
( s, t ) ,
(3)
t =0
where M ny , j denotes the normalized gray-level co-occurrence matrix which is located within a n × n area centered at pixel ( y, j ) . The function Φ xp,i ( y, j ) is a neighborhood function used to specify whether pixel ( y, j ) is located within a p × p window area centered at pixel ( x, i ) . The function is defined as
1122
C.-Y. Chang, H.-J. Wang, and S.-Y. Lin
Φ ( y, j ) = p x ,i
p
∑δ
l =− p
p
j ,i + l
∑δ
m=− p
y, x+m
,
(4)
where δ i, j is the Kronecker delta function given by
⎧1 i = j ⎩0 i ≠ j ,
δ i, j = ⎨
(5)
With this definition Φ xp,i ( y, j) will give a value 1, if ( x, i ) is located inside the window area, and zero otherwise. In Eq. (1), V x ,i ,2 and V y , j ,2 are used to restrict that the local gray-level difference and the local contrast are computed only for pixels label by the nondefect layer. Second, in the non-defect layer, the normal regions trend to regular pattern and have relative bright pattern compared with defective regions. To obtain a more uniform and smooth non-detective regions, this can be represented as: N
N
∑
x =1
N
N
∑ ∑ ∑ Φ qx,i ( y, j)(1 − g x,i )Vx,i,2Vy, j ,2 ,
y =1 i =1 ( y , j ) ≠ ( x ,i )
j =1
(6)
where g x ,i is the normalized g x ,i , defined by
⎡ g x ,i ⎤ g x ,i = ⎢ ⎥, ⎣ MAX (G ) ⎦
(7)
Third, in defect layer, it is easy to perceive that the defective regions trend to darken and concentrated. Therefore, to obtain a more complete and strong detective regions, the criteria is defined by N
N
N
N
x =1
y =1
i =1
j =1
∑∑∑∑
[
(
)]
Φ rx,i ( y, j ) g y , j + 1 − C ny , j V x ,i ,1V y , j ,2 ,
(8)
In addition to the constraints mentioned above, the CHWDNN need to satisfy the following two hard conditions to obtain a correct defect detection results: 1.
Each pixel can be assigned by one and only one label(defect or not) 2
∑ V x,i, z = 1.
(9)
z =1
2.
All the pixels need to classify. Therefore, the sum of all classified pixel must be 2
N
N
z =1
x =1
i =1
∑ ∑ ∑ V x ,i , z = N 2 .
(10)
Simulation Studies of Two-Layer Hopfield Neural Networks
1123
The hard constraints can be completely removed by winner-take-all rule as ⎧1 if V = max{V x,i ,1 , V x ,i , 2 } V =⎨ . ⎩0 otherwise
(11)
Therefore, the objective function of the network for defect detection is obtained as
E=
A 2 2 N ∑∑∑ 2 k =1 z =1 x =1
N
N
N
⎡⎛ A
∑ ∑∑ ⎢⎜⎝ 2 Φ
y =1 i =1 j =1 ( y , j ) ≠ ( x ,i )
⎣
p x ,i
(
( y, j ) d x ,i ; y , j + C ny , j
+
B q ⎞ Φ x ,i ( y, j )(1 − g x ,i ) ⎟δ z , 2δ k , 2 2 ⎠
+
C ⎤ g y , j + (1 − C ny , j ) δ z , 2δ k ,1Φ rx ,i ( y, j )⎥Vx ,i , kV y , j , z . 2 ⎦
(
) (12)
)
Comparing the objection function of the CHWDNN in Eq. (12) and the Lyapunov function of the two-layers Hopfield network, the total input to neuron ( x, i, k ) is
Net x ,i , k = −
1 2 N N ⎡⎛ A p ∑∑∑ ⎜ Φ x,i ( y, j ) d x,i; y , j + Cny , j 2 z =1 y =1 j =1 ⎢⎣⎝ 2
(
+
B q ⎞ Φ x ,i ( y, j )(1 − g x ,i ) ⎟δ z , 2δ k , 2 2 ⎠
+
C g y , j + (1 − C ny , j ) δ z , 2δ k ,1Φ rx ,i ( y, 2
(
)
) (13)
⎤ j )⎥V y , j , z . ⎦
3 Experimental Result In order to verify the proposed CHWDNN could precisely detect defective regions in wafer image, four types of zoomed wafer images (50x and 100x) were used for test. The image size is 640 × 480 pixels, with each pixel consists of 256 gray levels. The experiments were implemented by C++ Builder on an Intel Pentium IV 2.8GHz processor with 256MB system memory. Because of space limitation, only two cases of different wafer pattern are demonstrated in Figs. 2 (a-b). To demonstrate that the proposed CHWDNN has good capability of detecting defective regions, the proposed CHWDNN is compare with dynamic thresholding method [6], k-Means clustering, and CHNN method [7]. In the evaluations, the number of clusters is set to 4 for each clustering method to gain the most appreciate detection results. The number of clusters is determined by trial and error. Figures 3-4 show the detection results of Figs. 2 by various methods. Obviously, the contamination was detection correctly by CHWDNN. Since the color of the bounding pad areas is similar to the contaminations, the other methods could not
1124
C.-Y. Chang, H.-J. Wang, and S.-Y. Lin
(a)
(b)
Fig. 2. Two wafer images with defective regions
(a)
(b)
(c)
(d)
Fig. 3. Defect inspection results of Fig. 2(a) by (a) proposed CHWDNN, (b) CHNN, (c) k-means, and (d) dynamic thresholding
detect the defective regions correctly, and there were perceived that there existed many over-detection regions. In order to evaluate the performance of the proposed CHWDNN, we use two measurement equations defined as Eq. (14) [8]. Sensitivity =
N tp Np
and Specificity =
N tn Nn
(14)
Let N p be the total block number of positive ROIs (region of interested) and N n denote the total block number of negative ROIs. Let N tp be the block number of detection ROIs that contains defective regions and are actually detection. In addition, the N fp is the block number of ROIs which contains no defective regions but are falsely detected. Similarity, the true negative number ( N tn ) and false negative number ( N fn ) can be defined by N tn = N n − N fp and N fn = N p − N tp , respectively.
Simulation Studies of Two-Layer Hopfield Neural Networks
(a)
(b)
(c)
(d)
1125
Fig. 4. Defect inspection results of Fig. 2(b) by (a) proposed CHWDNN, (b) CHNN, (c) k-means, and (d) dynamic thresholding
1.2 1 0.8
ueal 0.6 V
Sensitivity. Specificity.
0.4 0.2 0 32x32
16x16
8x8
4x4
2x2
1x1
Block Size
Fig. 5. The sensitivity and specificity values obtained from different block sizes Table 1. Quantitative comparison results (block size is 32 × 32 )
ʳ ʳ
CHWDNN
CHNN
k-means
Dynamic thresholding
sensitivity specificity sensitivity specificity sensitivity specificity sensitivity specificity
Fig.2(a)
1
0.997
1
0.0356
1
0.551
1
0.00315
Fig.2(b)
1
0.994
1
0.57
1
0.592
1
0.11824
To quantitative evaluation, both the positive ROI and the detection result are divided into M × M blocks. One block represents one ROI. If the dimension of wafer image is N × N , then the image can be divided into ( N M ) 2 ROIs. During the measuring procedure, positive ROIs are hand-marked by experienced testing engineer, and
1126
C.-Y. Chang, H.-J. Wang, and S.-Y. Lin
detected ROIs were detected by proposed CHWDNN. Table I shows the quantitative evaluation of the wafer detection results. Due to the over-detection phenomenon, all the compared methods have the sensitivity value of 1. However, the corresponding specificity values are quite low. On the contrary, the average sensitivity and specificity of the two wafer images by CHWDNN is 1 and 0.996, respectively. Figure 5 shows the sensitivity and specificity under different block sizes. Obviously, we can obtain high sensitivity and specificity values under lager block size, such as 32 × 32 and 16 × 16 .
4 Conclusion For post-sawing wafer inspection, the defective regions are usually identified through visual judgment with the aid of a scanning electron microscope. Engineers check wafers visually and hand-mark the defective areas. Therefore, potential misjudgment may be introduced due to human fatigue. In addition, the process can incur significant personnel costs. In this paper, a two-layer Hopfield neural network called the competitive Hopfield wafer-defect detection neural network (CHWDNN) is presented for detecting the defective regions automatically. The CHWDNN is capable of incorporating pixel’s spatial information into a pixel-classifying procedure. According to the local gray level variance and the neighbor-spatial information, the CHWDNN classified the wafer images into two classes (defect or not). The experimental results show the CHWDNN successfully identifies the defective regions on wafers images with good performance.
References 1. Chao-To Su, Taho Yang, Chir-Mour Ke, “Neural-network approach for Semiconductor wafer post-sawing inspection,” IEEE Trans. Semi. Man. , vol. 15, No. 2, (2002) 260-266 2. Kenneth W. Tobin, Jr., Thomas P. Karnowski, Fred Lakhani, “Integrated applications of inspection data in the semiconductor manufacturing environment,” SPIE, Metrology-based Control for Micro-Manufacturing, vol. 4275, (2001) 31-40 3. Mital, D.P. Teoh, E.K. “Computer based wafer inspection system,” Proceeding of International Conference on Industrial Electronics, Control and Instrumentation, vol. 3, (1991) 2497-2503 4. J.M. Zang, R.M. Lin, and M.J. Wang, “The development of an automatic post-sawing inspection system using computer vision techniques,” Computers in Industry, vol. 30, (1999) 51-60 5. Chuan-Yu Chang and Pau-Choo Chung, “Two layer competitive based Hopfield neural network for medical image edge detection,” Optical Engineering , vol. 39, no. 3, (2000) 695-703 6. M. Cheriet, J. N. Said, and C. Y. Suen, “A recursive thresholding technique for image segmentation,” IEEE Trans. Image Processing, Vol. 7,(1998) 918-921 7. J. S. Lin, K. S. Cheng and C. W. Mao, “The application of competitive Hopfield neural network to medical image segmentation,” IEEE Trans. Med. Imag., vol. 15, (1996) 560-567 8. M. G. Penedo, M. J. Carreria, Antonio Mosquera, and D. Cabello, “Computer-Aided Diagnosis: A Neural-Network-Based Approach to Lung Nodule Detection”, IEEE Trans. Med. Imag., vol. 17, (1998) 872-880
Supporting Dynamic Supply Networks with Agent-Based Coalitions Chattrakul Sombattheera and Aditya Ghose Decision Systems Lab School of IT and Computer Science, Faculty of Informatics University of Wollongong, NSW 2500, Australia {cs50, aditya}@uow.edu.au
Abstract. This work extends our previous work to offer richer support for collaboration across supply networks. The role of agents are divided into three sectors at any point in time: buyers, sellers and LPs. Agents take two steps to form coalitions: i) agents in each sector sequencially form primary coalitions in order to increase bargaining power, selling capacity or service efficiency, and ii) agents form secondary coalitions across sectors in order to finalize the deal and deliver goods to buyers. We propose a negotiation protocol and deliberation mechanism. The negotiation protocol allows thorough communication among agents within and across sectors. The deliberation mechanism allows agents to consider potential coalition members and attractive payoffs for them. We provide examples of how they can help agents form coalitions successfully.
1
Introduction
The current business environment is characterized by large complex supply networks that are often global in reach and that are highly adaptive, being frequently re-configured to respond to dynamic business contexts. Collaboration across the supply networks is widely recognized as a key prerequisite for supply networks efficiency. Collaboration can be any form of cooperation/coordination among firms (and will be used interchangeably hereafter). The effective deployment of information technology and systems, specially the internet and the web, has made new modes of complex, dynamic yet effective collaboration possible across supply networks. Collaboration in supply networks can take various forms. Suppliers might collaborate to increase selling power or to aggregate capacity. Buyers might collaborate to increase buying power or to reduce logistics costs. LPs might collaborate to increase efficiency in services. Coalition formation studies the process that rational agents might cooperate to improve individual or social utility. The process involves i) deliberation in order to decide with which agents should they cooperate, and ii) negotiation in order to exchange information among agents such that they can make decision on cooperation. Once a group of agents unanimously agree to cooperate, they have formed a coalition. Coalition formation research has it’s roots in cooperative game theory [1, 2], where the focus of research is how the coalition value, which M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 1127–1137, 2006. c Springer-Verlag Berlin Heidelberg 2006
1128
C. Sombattheera and A. Ghose
is the value agents in a coalition jointly created, can be distributed as payoffs among the coalition members. In this kind of settings, agents are self-interested, i.e., they seek individually maximal payoffs. They form coalitions when they see opportunity to increase individual payoffs. Coalition formation is ubiquitous in real world settings and can take various forms. We discuss in greater details in section 2. Coalition formation has been widely studied in supply networks research [3, 4, 5, 6, 7, 8, 9]. A common scenario is coalition formation among buyers, who form coalitions in order to increase their bargaining power and gain some discounts. Such coalitions can be as simple as buying a single good or can be as complex as combinatorial goods [7]. Each agent attempts to seek the most likely formable coalitions and the most attractive payoffs for coalition members in order to reduce the complication in negotiation [10]. These studies solely address the problem of how the discounts achieved from sellers can be distributed among buyers without addressing the delivery cost of the goods. Another common scenario is in logistic domains, where agents are fully cooperative and commonly seek the maximal utility for the system [11, 8]. In this setting, agents cooperatively decide for the optimal solution for the system. The computation for the optimal solution can also computed in centralized fashion when each coalition value is known a priori [8]. We argue that there are missing links in recent coalition formation research in supply networks. Firstly, delivery cost is a serious issue for forming coalitions. Although it is quite common in practice that sellers can provide free delivery for buyers that are located at neighboring locations, disperse geographic proximity of buyers in a coalition can incur an additional delivery cost to those members who are not in such condition. Furthermore, the goods may be sold by themselves without free delivery service being offered at all. Buyers seriously need to take into account such a cost while negotiating with others or they suffer a total high cost. Secondly, the problem of forming coalitions among sellers in response to a large order is an important issue but has been minimally studied [12]. A seller may not be able to cope with the requirement on its own. Thus it is in need of forming a coalition to increase it’s selling power. This creates a new model of coalition formation between buyers and sellers that brings various issues into perspective. Lastly, the new model has brought a more complex and challenging problem to which LPs need to react in order to increase their service efficiency. They can seek collaboration with others when they want to increase service or economical efficiency. Hence, collaboration across supply networks can be characterized as highly complex coalitions of buyer(s), seller(s) and LP(s). This research extends our previous work [13] to offer richer support for dealing with this complexity. In the previous work, buyers, sellers and LPs form primary coalitions simultaneously then try to form secondary coalition across sectors later. This makes the model static and tends to yield low performance. Sellers and logistics providers may not be able to respond to the need of buyers appropriately. This work, buyers form primary coalitions and send their requests to sellers and LPs. Up on receiving the requests, sellers and LPs try to form
Supporting Dynamic Supply Networks with Agent-Based Coalitions
1129
primary coalitions. Once formed, they can respond to buyers to form secondary coalitions. This makes the model more dynamic and reflects the reality more correctly. In the followings, we describe coalitions in dynamic supply networks. We explain coalition settings in our work and how agents form primary coalitions, which include negotiation among agent of the same sectors and the deliberation mechanism. After that is the explanation of how agents form secondary coalitions across sectors. Lastly, we discuss about related work and conclusion and future work.
2
Coalitions in Dynamic Supply Networks
A coalition is a group of agents who unanimously agree to cooperate, e.g., to buy, sell or deliver. Agents are divided according to their roles into buyers, sellers and LPs. The roles of buyer and seller are interchangeable depending on their present activity. An agent can act a seller in a transaction while becomes a buyer in another transaction. Although it is easy to view LPs as sellers, the role of LPs are quite distinct from other sellers that they sell delivery services. Furthermore, the effect of LPs’ role can be captured more easily. A simple example of coalition formation is a small market game [2], where a seller has a good on sale and two buyers are bidding for it. The seller has a value for the good in mind and is unwilling to accept offers below this price, referred to as the seller’s reserve price. Each buyer also has a private evaluation of the worth of the good and will not pay higher than this worth, called the buyer’s reserve price. Buyers compete with each other by offering a price for the good to the seller. When a buyer offer a price that the other cannot compete with, that buyer then can buy the good off the seller at that price. This can be viewed as they have reached an agreement to cooperate, i.e., they have formed a coalition to trade the good. The value of a coalition is the difference between the buyers and the sellers reserve prices. The payoffs for both agents are the differences between their reserve prices and the sell price. A simple coalition in supply networks is one where a buyer forms a coalition with a seller and an LP. A more complex of such a coalition involves multiple agents in each sector. There are multiple distinct drivers for coalition formation. Several small buyers with similar needs coming together to obtain greater bargaining power. Sellers form coalitions with other sellers to aggregate selling power. In some settings, there are legal impediments to certain forms of such coalitions (specifically cartels) in the form of anti-trust laws. However, such coalitions are common for small sellers, such as in agricultural cooperatives (e.g. for micro-producers of dairy products). Sellers sometimes form coalitions to aggregate/augment capabilities. LPs may want to form coalitions to aggregate their service capacities or efficiency. We note that we have only listed the basic drivers. Most real-life supply network coalitions tend to have more than one motivating factor driving their formation. Multiple factors play a role in coalition formation decisions. In the current work, we focus on quantitative factors, e.g., price, delivery cost and lead time. Most decisions typically involve trade-offs among several attributes. One may be willing to pay more for a shorter lead time and vice versa. One may,
1130
C. Sombattheera and A. Ghose
in some settings, be willing to accept longer lead times for the purpose of getting higher quality. Several other instances of such trade-off are common, but we do not list them all here. Although negotiation can be either bilateral or multilateral, we can in most instances reduce negotiation to the bilateral case without loss of generality. An agent that stands to benefit the most from forming a coalition usually acts as the coalition leader. Firstly, a coalition leader negotiates bilaterally with multiple agents of its own sector to establish a sectoral coalition. An agent might find itself forming a singleton primary coalition because it cannot find suitable coalition members nor there is need to do so. We call this type of coalitions primary coalition. Firstly, buyers form primary coalitions and the coalition leaders send request to sellers and LP(s). Interested sellers and LP(s) then form primay coalitions among themselves. The coalition leaders then negotiate across sectors to establish a cross-sectoral coalition involving a coalition of buyer(s), seller(s) and LP(s). We call the type of coalitions secondary coalition. This secondary coalition is where the real trade is finalized. The buying leader collects money from its members and pays to the selling leader and LP leader, who in turn distribute among their members. Then the coalitions will break. Members can form coalitions later on should they need to do so. We note that these negotiations must occur within stringent time constraints.
3 3.1
Coalition Formation Setting
At a point in time, let B = {b1 , b2 , . . . , bm }, S = {s1 , s2 , . . . , sn } and L = {l1 , l2 , . . . , lo } are set of agents who act buyers, sellers and LPs respectively. Buyers order goods, F = {f1 , f2 , . . . , fp }, from sellers. LPs distribute the goods over a transportation network, G = {V, E}, where V is a set of vertices and E is a set of edges, each of which connects a pair of vertices, for them. The location of each buyer b, seller s, and LP l are specified by function Lb : B → V, Ls : S → V , and Ll : L → V respectively. These three functions are accessible to all agents. Agents have individual goals and constraints. A goal is represented by the good that an agent may seek to buy or sell. Constraints are the time/deadline and the price at which they wish to conclude the transaction. The amount of a specific good required by a specific buyer is given by D : B × F → N+ , under a budget determined by the function B : B → R+ and a due time, T : B × F → N+ (here the output of the function denotes the number of time units from some commonly agreed upon origin at which the good is required). The amount of a certain good that a seller can produce is given by P : F × S → N+ . The time it takes to produce a good is determined by R : F × S → N+ . The cost of producing a single unit of good is CP : F × S → R+ . An LP is initialized with a load capacity, L : L × F → N+ (thus L(l, g) denotes the quantity of good g that LP l can ship, if good g was the only l was shipping). The cost of delivery is CD : V × V → R+ . The time of traveling is CT : V × V → N+ .
Supporting Dynamic Supply Networks with Agent-Based Coalitions
3.2
1131
Forming Primary Coalitions
In the followings, we shall use the notation x, y, z to designate a collection of a certain kind of vectors. The notation f, q, t, for example, denotes a collection of k requirement vectors fi , qi , t, where fi is a good an agent want to buy, qi is a quantity that the good is needed for, and ti is the time by which the good is needed, for all 1 ≤ i ≤ k. Similar interpretation applies for the rest of the paper. Buyer to Buyer Negotiation. A buyer, who wants to buy goods and foresees an opportunity to gain some discounts from buying a large quantity, acts as a coalition leader by negotiating with other buyers. We shall call it the leading buyer. The following is the protocol for forming a coalition of buyers. 1. Every buyer b sends a message (b, f, q, t, v, p), where b is the identifier of the agent, f, q, t is a collection of requirements vector, v ∈ V is the location where the good will be delivered, p is the price, and t is the due time it wants the good to be delivered. 2. Based on it’s deliberation, any buyer can act leading buyer, b0 , and initiate the process of forming a primary coalition of buyers. (See section Deliberation for Joining Primary Coalition for more details). 3. Leading buyer b0 sends a message (msg, b0 , f, q, t, v, p, x), where msg is the message id, p is the price, and x is the expiry time to potential buyers in it’s current best potential coalition. 4. Any buyer, who is satisfied with the offer, sends a message (msg, b, ACK, x) back to b0 stating that it will wait for a confirmation until time x. 5. Leading buyer b0 sends message (msg, b0 , ACK) to inform all interested buyers that now the new primary coalition has been formed. 6. Leading buyer b0 sends message B2S(msg, b0 , f, q, t, p) to sellers, where f, q, t is the order details compiled from its members, and t is the available time. 7. At the same time, the leading buyer sends message B2L(id, b0 , f, q, v, t, f , q , v , t , p), where f, q, v, t and f , q , v , t specifies job description for picking up from sellers and delivering the goods to buyers,to LPs. The leading buyer may find itself being a singleton coalition, i.e., it can not agree with other buyers nor there is need to do so at all. Once the coalition is formed, it can negotiate with sellers and LPs to form secondary coalition by sending messages to them. Seller to Seller Negotiation. Up on receiving the message from a leading buyer, interested sellers react. If the order is within it’s capability, it may form a singleton coalition on its own. If the order is beyond it’s capability, it can act as a coalition leader and try to form a coalition of seller in order to increase it’s selling power. We shall call it leading seller. 1. Interested sellers send a message (msg, s, f, q, p, r, v), where msg is the message id sent by b0 , p is specified by function P, r is specified by function R, and f, q, p, r, v is the list of capability vectors, to all other sellers.
1132
C. Sombattheera and A. Ghose
2. Any seller can act leading seller, s0 , and try to form a primary coalition with other sellers. It selects a number of sellers, who might help increasing its selling power (See section Deliberation for Joining Primary Coalition for more details). 3. Leading seller s0 sends a message (msg, s0 , s, f, v, x), where s, f, v is a collection of preferred capability vectors, to selected seller. 4. Any seller, who is satisfied with the offer, replies with the message (msg, s, ACK, x) and waits for the confirmation by time x. 5. Agent s0 confirms that a primary coalition is formed with the message (msg, s0 , ACK). The leading seller may find itself being a singleton coalition because it cannot agree with other sellers. It can begin negotiation with the leading buyer to form secondary coalition. LP to LP Negotiation. Up on receiving the message from leading buyer b0 , interested LPs react. Any interested LP may try to form a primary coalition with other LPs. We shall call such LP leading LP. 1. Every interested LP sends a message (msg, l, l, v), where msg is the message id sent by b0 , l is the load capacity specified by function L, to all other LPs. 2. Any interested LP can act leading LP, l0 , and try to form a primary coalition with other LPs. It selects a number of LPs, who might help increase its capacity and service efficiency. (See section Deliberation for Joining Primary Coalition for more details). 3. Leading LP l0 sends a message (msg, l0 , v, x), to agents selected if they would like to form a primary coalition. 4. Any LP, who is satisfied with the offer, replies with the message (msg, l, ACK, x) and waits for the confirmation by time x. 5. Agent l0 confirms to its primary coalition members with the message (msg, l0 , ACK). The leading LP may find itself being a singleton coalition. It then can negotiate with the leading buyer b0 on behalf of it’s members to form a secondary coalition. Deliberation for Joining Primary Coaltion. Agents have a common view when forming primary coalitions: Quality of Coalition (QoC). QoC indicates the likelyhood to achieve the goal of the coalition. The common goal of every agent is to maximize it’s payoff. Buyers need members who can help increase bargaining power. Sellers need member who can help increase selling power, which in turn helps maximize their payoffs. LPs needs member who can help increase service efficiency, which in turn helps maximize their payoffs. Based on it status, an agent can create an ideal coalition, the one with the highest QoC. The agent create a template, which is an ideal coalition member of the ideal coalition. Each agent measures the suitability of another agent to the the ideal coalition in terms of the distance between that agent and this template. It uses the distance
Supporting Dynamic Supply Networks with Agent-Based Coalitions
1133
to rank other agents in a table, which will be used to help select appropriate agents for forming high QoC coalitions. Let A = {a1 , a2 , . . . , am } be a set of attributes. Let B = {L1 , U1 , L2 , U2 , . . . , Lm , Um } be a set of lower bounds and upper bounds of quantitative attributes a1 , a2 , . . . , am . A template is a vector t = q1 , q2 , . . . , qm , w1 , w2 . . . , wm , where qi is a quantitative attribute qj , and wk is the weight for each ak of m attributes. The agent a0 , who wants to form a primary coalition, uses the attributes passed over in the first step of the negotiation to measure the distance between other agents and its templates. The closer distance signifies the more suitability for the coalition of the agents. The distance can be considered as the sum of the difference of each pair of corresponding attributes in the template and the agents’ attributes. For each attribute, the distance should be 0 if the attribute’s value are equal. Otherwise the distance is the multiplication of the difference between the values of corresponding attributes, and the weight for that attribute. Let yj be the value of attribute j of the template, the distance between an agent and a template can be derived as following: d=
m
Diff (qj , yj )wm
j=1
where
⎧ if yj Uj or yj Uj ⎪ ⎪∞ ⎪ ⎨ qj −yj if q y and y is ≥ L j j j j j Diff (qj , yj ) = qyjj−L −qj ⎪ Uj −qj if qj yj and yj is ≤ Uj ⎪ ⎪ ⎩ 0 if qj = yj
Let us consider an example of how this mechanism works. Suppose there are four attributes, A = {f, q, p, t}, to describe a quality of primary coalition of buyers. Let B = {1, 10, 10, 20, 10, 30, 0, 100} be the set of upper and lower bounds for f, q, p and t, respectively. Let T = 1, 15, 20, 30, ∞, 10, 1, 5 be a template of buyer b0 , who is considering forming a primary coalition. There are two buyers, b1 and b2 being considered whether they should be invited to join the coalition. Their attributes, achieved from the messages, are {1, 13, 18, 28} and {1, 17, 23, 34}, respectively. The distance between agent b1 and template T is d1 = (0 ∗ ∞) + (2/5) ∗ 10 + (2/10) ∗ 1 + (2/300) ∗ 5 = 0 + 4 + 0.2 + 0.33 = 4.53 The distance between agent b2 and template T is d2 = (0 ∗ ∞) + (2/5) ∗ 10 + (3/10) ∗ 1 + (4/70) ∗ 5 = 0 + 4 + 0.3 + 0.57 = 4.87 Since 4.53 is less than 4.87, agent b1 is obviously closer to the template than b2 . Hence, agent b1 sits higher in the ranking table. Each agent uses its own ranking table to compute potential primary coalitions, those it might try to form, and keep them in a list. The agent sets up a period of time for deliberation, i.e., generating coalitions and adding them into the list. It computes, for each potential coalition, the aggregrated distance, which is the sum
1134
C. Sombattheera and A. Ghose
of all members’ distance to the template. It then ranks those coalitions by the coalitions’ aggregated distance in ascending order. In each potential coalition, the agent computes potential payoffs for the members. The agent considers each potential coalition one by one from the top of the list. For best potential coalition being considered, the agent decides to be a leading agent if it’s potential payoff is the greatest. It then follows the primary protocol to negotiate with others. If the agent’s potential payoff is not the greatest, it will wait for some time. Up on receiving an invitation message, it computes it’s potential payoff and compares it to the potential payoff in the second best potential coalition of it’s own list. If the the inviting potential payoff is not lower, the agent accepts the offer, or rejects it otherwise. In case the coalition cannot be formed. The next potential will be considered. More potential coalitions will be computed when needed. 3.3
Secondary Coalitions
A secondary coalition is composed of three primary coalitions: a buyer, a seller and an LP primary coalitions. The negotiation occurs among coalition leaders and within primary coalition members as an ongoing process. They try to agree on their payoffs and tasks (to supply good for sellers and to deliver goods for LPs). Agents who are not satisfied with payoffs being offered may decide to deviate from thier present primary coalitions and try to form new ones. Buyers to Sellers and LPs. Firstly, leading sellers and leading LPs try to agree on an offer for the corresponding buyer. The negotiation involves their payoffs and tasks. 1. Each member seller evaluates it present capability and sends it’s s0 a message S2S(msg, b, f, q, t, p) bidding for the supply task. 2. Leading seller s0 may have to negotiate on good, quantity, time and price with member sellers with message S2S(id, b, f, q, t, p) until every member seller is satisfied and the total quantity, available time and price is satisfied with the request. 3. Leading seller s0 accumulates the bids from its member sellers and creates an offer. It then sends message S2B(msg, s0 , f, q, v, t, p) back to b0 . 4. Similarly, member LP l bids for the task by sending message L2L(id, f, q, v, t, f , q , v , t , p) to its leading LP l0 . 5. Leading LP l0 may have to negotiate over quantity, pickup location, delivery location, pickup time, delivery time, and price with its member. 6. The leading LP l0 accumulate all the bid from its members in order to create an offer for the delivery job. 7. It sends a message, i.e., proposal, (id, s0 , f, q, v, t, p) back to b0 . 8. The leading buyer b0 accumulate the proposals from the leading sellers and leading LPs and find the best combination of a selling and a LP proposals. It then creates a proposal to its members and sends message(id, s0, f, q,v, t, p) to the members. 9. Members who are satisfied with the proposal sends a message (id, s, ACK) to l0 .
Supporting Dynamic Supply Networks with Agent-Based Coalitions
1135
10. Members who are not satisfied can negotiate over quantity, time and price by sending message (id, s, f, q, v, t, p) back to the leading agent. 11. Once all member buyers are satisfied, leading buyer then sends message (id, b0 , ACK) to the selected leading seller and leading LP. 12. The leading buyer, seller, and LP send message (id, a0 , CON F IRM ) to their coalition members confirming that the secondary has been formed. LPs are to pickup the good from sellers and deliver to buyers. Sellers and LPs get paid when the good is delivered. The negotiation within and among primary coalitions can keep going until agents are satisfied or the time is over. 3.4
Decision Mechanism
In our setting, each agent has their individual reserve price. For a buyer, the reserve price rb is the maximum price he is willing to pay for acquiring a bunch of goods, i.e., including prices of goods and costs of delivery. For a seller, the reserve price rs is the cost cp of producing goods and the minimum profit it expects. For an LP, the reserve price rl would be the estimated cost of operation and the minimum profit it expects. Let B ⊆ B, S ⊆ S, andL ⊆ L be a set of buyers in a primary coalition, a set of sellers in a primary coalition, and a set of LPs in a primary coalition, respectively. For a secondary coalition C = {B, S, L}, the reserve price of B is rB = b∈B r , the reserve price of S is r = b S s∈S rs , and the reserve price of L is rL = l∈L rl . The coalition value of a secondary coalition C is VC = rB + rS + rL , which is to be distributed among agents. Let PS be the price sellers charge buyers. Let PL be the price LPs charge buyers. The payoff for buyers in B is UB = rB −PS −PL . The payoff for sellers in S is US = PS − rS . The payoff for LPs in L is UL = PL − rL . The price each agent use to negotiate may be higher than their reserve prices. There are several solution concepts in coalition formation theory [1], e.g., core kernel, etc. Knowing biding price of every agent in its primary coalitions, the leading agent can use a fair division of their coalition payoff by using the Shapley value [1] concept. Agents in the primary coalition may find that the shares offered by the leading agent are below than their own reserve price. Hence they may have to deviate from their present primary coalitions.
4
Related Work
Coalition formation is an active research area in multi-agent systems. In addition to analyzing stability and payoff distribution, researchers pay more attention to the formation mechanism, i.e., negotiation protocols among agents and the decision making, or reasoning, of individual agents. Previous work has done some part of our proposing framework as the followings. Tsvetovat et al. [9] studied the formation of buyers in university environment where students buy their text books in group. Li et al. [7] addressed the combinatorial auction in coalition formation. Buyers submit their requests to the mediator who will allocate the good from sellers to them. Goldman et al. [5] searched for a strategy where sellers
1136
C. Sombattheera and A. Ghose
selected the most profitable deal while buyers looked for the most satisfiable sellers. Kraus et al. [4] proposed a compromise strategy for distributing profits among sellers in order to form coalition quickly. They found that agents who are willing to give away their profits actually earn more. Breban et al. [3] addressed the coalition formation based on trust among agents. Trust is used as a mechanism to enforce agents to commit themselves to the jobs as parts of the coalition. Hyodo et al. [14] addressed the optimal coalitions of buyers and sellers who are located in distributed sites. They deploy genetic algorithm to search for the optimal share over the discount. Klusch et al. [6] addressed problems in static coalition formation and proposed an algorithm for mobile agents to form coalition under dynamic environment. Agents can form overlapping coalitions. The stability is low since agents are allowed to leave their coalitions at any time. Sandholm et al. [8] studied coalition formation of trucks in dispatch centers. There is no work in coalition formation, as we are aware of, that combines buying and delivery.
5
Conclusion and Future Work
This paper proposes a model of agents-based coalitions in dynamic supply networks. In our setting, agents take two steps: i) agents in each sector form looselycoupled coalitions in order to decrease the complexity of the negotiation, and, ii) agents form coalitions across sectors in order to deliver goods to end customers. We propose a framework, which involves negotiation protocol and decision mechanism. The negotiation protocol allows thorough communication, i.e., buyers to buyers, buyers to sellers, sellers to sellers, buyer to LPs, and LPs to LPs. Since one of the important issues in supply chains is to minimize logistics costs, the aim of our future work is to find a mechanism that helps buyers to manage controlling logistics process rather than leaving it to sellers. We believe that with efficient negotiation protocol and decision making mechanism, the logistics costs can be reduced.
References 1. Kahan, J.P., Rapoport, A.: Theories of Coalition Formation. Lawrence Erlbaum Associates, Hillsdale, New Jersey (1984) 2. Neumann, J.V., Morgenstern, O.: Theory of Games and Economic Behaviour. Princeton University Press, Princeton, New Jersey (1953 (1963 printing)) 3. Breban, S., Vassileva, J.: A coalition formation mechanism based on inter-agent trust relationships. In: Proceedings of the first international joint conference on Autonomous agents and multiagent systems, ACM Press (2002) 306–307 4. Kraus, S., Shehory, O., Taase, G.: The advantages of compromising in coalition formation with incomplete information. In: Proc. of AAMAS-2004. (2004) 5. Goldman, C.V., Kraus, S., Shehory, O.: Agent strategies: for sellers to satisfy purchase-orders, for buyers to select sellers. In: . Proceedings of the 10th European Workshop on Modeling Autonomous Agents in a Multiagent World (Maamaw ’01), Annecy, France. (2001)
Supporting Dynamic Supply Networks with Agent-Based Coalitions
1137
6. Klusch, M., Gerber, A.: Dynamic coalition formation among rational agents. Intelligent Systems, IEEE 17 (2002) 42–47 7. Li, C., Sycara, K.: Algorithms for combinatorial coalition formation and payoff division in an electronic marketplace. Technical Report CMU-RI-TR-01-33, Robotics Institute, Carnegie Mellon University, Pittsburgh, PA (2001) 8. Sandholm, T., Lesser, V.: Coalition Formation among Bounded Rational Agents. 14th International Joint Conference on Artificial Intelligence (1995) 662–669 9. Tsvetovat, M., Sycara, K.: Customer coalitions in the electronic marketplace. In: ACM 2000. (2000) 10. Shehory, O., Kraus, S.: Methods for task allocation via agent coalition formation. Artif. Intell. 101 (1998) 165–200 11. Sandholm, T., Larson, K., Andersson, M., Shehory, O., Tohm, F.: Coalition structure generation with worst case guarantees. Artif. Intell. 111 (1999) 209–238 12. Klusch, M., Contreras, J., Wu, F., Shehory, O.: Coalition formation in a power transmission planning environment. In: Proc. 2. International Conference on Practical Applications of Multi-Agent Systems. (1997) 13. Sombattheera, C., Ghose, A.: Agent-based coalitions in dynamic supply chains. In: Proceedings of The Ninth Pacific Asia Conference on Information Systems. (2005) 14. HYODO, M., MATSUO, T., ITO, T.: An optimal coalition formation algorithm for electronic group buying. In: Proc. of 2003 SICE Annual Conference (SICE2003). (2003)
Reducing Transportation Costs in Distribution Networks Xi Li1 , Andrew Lim1 , Zhaowei Miao2 , and Brian Rodrigues3 1
2
Department of IELM, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong School of Computer Science & Engineering, South China University of Technology, Guang Dong, P.R. China 3 Lee Kong Chian School of Business, Singapore Management University, Singapore 178899
Abstract. In this work, we model the conflicting objectives in a distribution network, i.e. trade-offs between transportation costs, inventory and time scheduling requirements. Transportation costs include time costs, truck setup costs, and the number of trucks used. The model is formulated as an integer program and a solution of two-stage metaheuristic is provided. Computational experiments are conducted to test the algorithms. Keywords: genetic algorithm, heuristic search, inventory, transportation scheduling.
1
Introduction
As companies seek more profitable supply chain management, there has been a desire to optimize distribution networks to reduce logistics costs. This includes finding the best locations for facilities, minimizing inventory, and minimizing transportation costs. In this work, we focus on reducing transportation costs in distribution networks with the added objective that transshipment centers achieve the quick reshipment goal of crossdocks. In particular, we study trucking consolidation since trucking accounts for 83 percent of freight transportation in the US alone (Wilson 2002). Among companies that have benefited from trucking consolidation, Nabisco Inc., for example, has used consolidation to reduce transportation costs by half, and to bring down inventory levels and improve on delivery (Quinn 1997, G¨ um¨ us and Bookbinder 2004). The model we study has its roots in the classical transshipment problem which is concerned with shipping quantities and routes to be taken through transshipment centers. The problem has been studied in the context of network flow (Hoppe and Tardos 1995) to find the shortest transshipment time. For the general n location transshipment model, Robinson (1990), who developed a large LP by discretizing demand, provided heuristic solutions, while Tayur (1995) used a gradient based approach for the problem, also after discretizing demand. M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 1138–1148, 2006. c Springer-Verlag Berlin Heidelberg 2006
Reducing Transportation Costs in Distribution Networks
1139
Although these studies considered inventory and transshipment costs, they did not address time constraints which are present during the transshipment process, for example, constraints imposed by transportation schedules or time window constraints at supply and demand nodes. When time is a critical factor, distribution through crossdocks, which have become synonymous with rapid consolidation and processing, have been studied. Napolitano (2000) has described manufacturing, transportation, distribution, retail and opportunistic crossdocking, all of which have the common feature of consolidation and short cycle times made possible by known pick-up and delivery times. Physical operations which reduce labor costs in LTL crossdocking has been studied by Bartholdi et al. (2000). Some similar work has also been studied by Lim et al. (2005). The model we study is a global optimization problem which identifies the best choices between the conflicting objectives in distribution network logistics, i.e. trade-offs between transportation costs, inventory and time scheduling requirements at all stages of the supply chain. Transportation costs include time costs, truck setup costs, and the number of trucks used. The model is formulated as an integer program, for which we provide a solution approach which is called two-stage metaheuristic algorithm. Computational experiments are conducted to test the algorithms and comparisons made with exact solutions where these are available.
2
The Model
In this article, any retailer or cluster of retailers at a location supplied directly from the manufacturers or through transshipment centers is referred to as a ”customer” (cf. G¨ um¨ us and Bookbinder 2004). We refer to transshipment centers and crossdocks by the collective term ”crossdock”. We assume that each customer is at a predetermined location k ∈ Δ ≡ {1, ..., m}, and demands dk units of a single product which can be shipped from any manufacturer situated at i ∈ Σ ≡ {1, ..., n}, either directly or through a crossdock j ∈X ≡ {1, ..., l}. The value dk represents the total demand of the retailer or cluster of retailers which are represented by k. Each i ∈ Σ, has a supply capacity of si units of the product, which can only be shipped (released) in the time window [bri , eri ](i ∈ Σ) a a and must be delivered (accepted) within the time window [bk , ek ](k ∈ Δ). Also, total supply can meet demand, i.e., i∈Σ si ≥ k∈Δ dk , and, at each crossdock, a holding cost, given by hj (j ∈X) per unit per time, takes effect. This implicitly allows for shipments to be delayed at crossdocks. Figure 1 illustrates the problem for the multiple -manufacturer, -crossdock and -customer case. A truck delivering to customers arrives at location following which it can be routed to other locations in the cluster, if required. The delivery routing plan to each retailer in the cluster can be solved separately from the model. This may be preferable since, for example, local trucking vendors provide smaller truck load capacity for short haul delivery and there is no warehousing required. Alternatively, the set of customers can be expanded to include part of or all cluster points. For each truck, we assume variable setup costs which are determined by, for example, location, vendors, and type of freight carried.
1140
X. Li et al.
Fig. 1. Distribution network with crossdocks and time windows
In the model, we take the time horizon to be discrete (see Hoppe and Tardos 1995) and given by the non-negative integers, and shipment times can be represented as integers. The following parameters are used: dk : the demand quantity of customer k(k ∈ Δ) si : the available quantity of supplier i(i ∈ Σ) p : setup cost for each truck on arc (i, k)(i ∈ Σ, k ∈ Δ) p : setup cost for each truck on arc (i, j)(i ∈ Σ, j ∈X) p : setup cost for each truck on arc (j, k)(j ∈X,k ∈ Δ) hj : inventory holding cost per unit product per unit time in crossdock j (j ∈ X) dik : total shipping time on arc (i, k)(i ∈ Σ, k ∈ Δ) dij : total shipping time on arc (i, j)(i ∈ Σ, j ∈ X) djk : total shipping time on arc (j, k)(j ∈ X, k ∈ Δ) cik : shipping cost per unit product per unit time on segment (i, k)(i ∈ Σ, k ∈ Δ) cij : shipping cost per unit product per unit time on segment (i, j)(i ∈ Σ, j ∈X) cjk : shipping cost per unit product per unit time on segment (j, k)(j ∈X, k ∈ Δ) Ti : set of feasible shipping time points from supplier i, i.e., Ti = {bri , bri + 1, ..., eri }(i ∈ Σ) Tjk : set of feasible shipping time points from crossdock j to demand point k, i.e., Tjk = {bak − djk , bak − djk + 1, ..., eak − djk }(j ∈ X, k ∈ Δ) Tj : set of feasible time points at which shipments can arrive at or depart from crossdock j, i.e. Tj = {mini∈Σ {eri + dij }, mini∈Σ {eri + dij } + 1, ..., maxi∈Σ {bri + dij }}(j ∈X) Q: capacity of each truck. The decision variables required for the model are: xikt : quantity of product shipped on arc (i, k) at time t(i ∈ Σ, k ∈ Δ, t ∈ Ti ) xijt : quantity of product shipped on arc (i, j) at time t(i ∈ Σ, j ∈ X, t ∈ Ti ) xjkt : quantity of product shipped on arc (j, k) at time t(j ∈ X, k ∈ Δ, t ∈ Tjk ) vikt : number of trucks used on arc (i, k) at time t(i ∈ Σ, k ∈ Δ, t ∈ Ti ) vijt : number of trucks used on arc (i, j) at time t(i ∈ Σ, j ∈ X, t ∈ Ti ) vjkt : number of trucks used on arc (j, k) at time t(j ∈ X,k ∈ Δ, t ∈ Tjk ) Ijt : quantity of inventory in crossdock j at time t(j ∈X,t ∈ Tj )
Reducing Transportation Costs in Distribution Networks
1141
The objective in the shipping consolidation problem (SCP) is to minimize the total cost including transportation and inventory costs while satisfying time window constraints. Transportation costs include time costs, truck setup costs and the number of trucks deployed. With the above notation, the SCP can be given by the following integer program: min
(pvikt + cik dik xikt ) +
i∈Σ k∈Δ t∈Ti
+
(p vjkt + cjk djk xjkt ) +
xijt +
j∈X t∈Ti
xjkt +
hj Ijt
(1)
xikt ≤ si
(i ∈ Σ)
(2)
xikt = dk
(k ∈ Δ)
(3)
i∈Σ t∈Ti
xij,t−d − ij
i∈Σ
xjkt = Ijt
(j ∈ X, t ∈ Tj )
t
(j ∈ X, t = max{t ∈ Tj })
Ij,t = 0
t
Q(vikt − 1) < xikt ≤ Qvikt
Q(vijt − 1) < xikt ≤ Qvijt
(4)
k∈Δ
(j ∈ X, t = min{t ∈ Tj } − 1)
Ij,t = 0
k∈Δ t∈Ti
j∈X t∈Tjk
j∈X t∈T j
j∈X k∈Δ t∈Tjk
Ij,t−1 +
i∈Σ j∈X t∈Ti
s.t.
(p vijt + cij dij xijt )
Q(vjkt − 1) < xikt ≤ Qvjkt
(5) (6)
(i ∈ Σ, k ∈ Δ, t ∈ Ti )
(7)
(i ∈ Σ, j ∈ X, t ∈ Ti )
(8)
( j ∈ X, k ∈ Δ, t ∈ Tjk )
(9)
All decision variables are nonnegative integers
(10)
In the objective function, the first term gives direct transportation costs from manufacturers to customers, including all truck setup costs and time costs; the second and third terms are similar to the first, and represent costs between manufacturers and crossdocks and crossdocks and customers, respectively. The last term represents total holding cost. Constraints (2) ensure the total quantity of the product shipped from manufacturers is no greater than the available supply. Similarly, constraints (3) ensure that total quantity of product received meets demand. Constraints (4) require that, for each crossdock, the inventory at time t is equal to the inventory heldover at time t − 1 plus the total quantity received at time t minus the quantity shipped out at time t. Constraints (5) and (6) are
1142
X. Li et al.
initial and terminal conditions, respectively, of the inventory level at each crossdock. Constraints (7), (8) and (9) ensure that the number of trucks used on any route is a minimum. This problem is N Phard in strong sense(Miao 2005). In view of the computational complexity of the problem, we provide in the next section, a heuristic approach to its solution.
3
A Two-Stage Heuristic
In attempting to find solutions for the SCP, we need to decide the quantity to be shipped, shipment times and routes to be taken. Once TL quantities are decided, LTL shipments at crossdocks need to consolidated at the manufacturer and crossdock to reduce truck setup costs. The solution approach we take has two components: (1) a full truck load (TL) plan, and (2) a less than truck load (LTL) Plan. This is used in a two-stage heuristic algorithm. In the first, initialization, stage, a solution is found which consists of a TL Plan and a LTL Plan. The TL Plan is found using a network flow algorithm. In a second, iteration, stage, a metaheuristics is used to improve the LTL Plan. Stage 1: Initialization A TL Plan is developed, which is used to obtain an initial LTL Plan. (1)TL Plan A transportation plan is found by solving a network flow problem determined as follows. Split the demand at each customer into a TL part and a LTL part, and remove the LTL part (remainders) from the demand. Next, form the network G with |Σ| supply nodes and |Δ| customer nodes, without time windows or crossdocks, where the demand is the TL demand obtained in the preceding step and where the supply is unchanged. Each arc in G is taken to be the least cost route available for a TL shipment from a supply node to a demand node, which does not violate time window constraints. Each arc represents a direct route from a manufacturer to a customer, or a route through a crossdock. A greedy method is utilized to allocate full truck load flow in G to generate the TL Plan. The TL Plan cannot guarantee all demand is satisfied; we leave unsatisfied demand to be satisfied by the LTL Plan.
Fig. 2. Example - least cost routes
Fig. 3. Network G constructed from Example
Reducing Transportation Costs in Distribution Networks
1143
To illustrate this, Figure 2 shows a simple example of least cost routes from manufacturers to customers, and Figure 3 shows G constructed from this example. Assume manufacturer nodes, S1 , S2 , S3 , have s1 = 103, s2 = 10, s3 = 28, and customer nodes, D1 , D2 , D3 , have d1 = 25, d2 = 27, d3 = 82, and that the truck capacity Q = 20. In G, the demand becomes d1 = 20, d2 = 20, d3 = 80. Assume that transportation costs are such that C11 < C12 < C13 < C33 < · · · (Cik = cost from Si to Dk on least cost route in Figure 2). Using a greedy method, the resulting transportation plan will be to use one truck from S1 to D1 and S1 to D2 , three trucks from S1 to D3 , and one truck from S3 to D3 . Hence, the TL Plan for this example can be represented by: {(S1 , Crossdock1 , D1 , 20), (S1 , Directly, D2 , 20), (S1 , Crossdock1 , D3 , 60), (S3 , Directly, D3 , 20)}. Once the TL Plan is decided, it is not altered. Next, we construct an initial LTL Plan. (2) LTL Plan With a TL Plan in hand, the quantities remaining at supply and demand nodes can be determined. Using the example above, remaining supply and demand is s1 = 3, s2 = 10, s3 = 8, and d1 = 5, d2 = 7, d3 = 2. Take a lane to be an ordered pair of supply and demand nodes and a route on a lane to be a crossdock through which shipments pass (possibly empty). Each lane can have multiple routes, i.e., crossdocks through which the product can transit. Take quantity to be the number of units of the product shipped on a lane, and let t1 be the time the product leaves a supply node, and t2 the time the product leaves the crossdock, if it passes through the crossdock. To overcome the difficulty of time window constraints, the LTL Plan is generated lane by lane, i.e., the LTL Plan is represented by {(laneik , routeik , quantity ik , t1ik , t2ik ) : i ∈ Σ, k ∈ Δ}. For example, the element ((1, 2), 1, 13, 7, 12) obtained in an LTL Plan indicates that we should ship 13 units of the product from supply node 1 to demand node 2 through crossdock 1, where the time the product leaves the supply node 1 is 7 and the time it leaves crossdock 1 is 12. Since the TL Plan is fixed, the initial solution depends only on the LTL Plan. This is generated through the following three steps: Step 1. Preallocate: A plan is found greedily as follows. Sort supply and demand in decreasing order of the remaining quantities and then allocate quantities to be shipped from largest demand to the smallest demand. In the example, s1 = 3, s2 = 10, s3 = 8 and d1 = 5, d2 = 7, d3 = 2; hence s2 > s3 > s1 and d2 > d1 > d3 . Selecting the largest supply and demand, i.e., s2 = 10 and d2 = 7, we first satisfy D2 with S2 , and then satisfy the second largest demand d1 = 5, so that D1 is satisfied with the remaining amount 3 at S2 and the second largest supply at S3 . Continue in this way until all the demand is satisfied. With the assumption that supply is not less than demand, this can always be done and all lanes allocated with flow. Step 2. Generate a Timetable: With preallocation, once routes are found for each lane, a timetable is constructed taking into account the time window
1144
X. Li et al.
constraints. To avoid holding cost, feasible time intervals at each crossdock for each route of each lane are maintained, i.e., for a route of lane (Si , Dk ), the time the product can remain at crossdock j is calculated according to the time windows at Si and Dk , the shipping time from Si to j and from j to Dk . Step 3. Consolidate: Once a timetable has been constructed, consolidation is made at each supply point and crossdock according to the route and timetable available, and the times t1 and t2 are determined to complete the plan to obtain a feasible initial solution.
Fig. 4. Consolidation Figure 4 shows routes and quantities to be shipped in a plan. In this case, flow from S1 to D1 and from S1 to D2 can be consolidated at S1 , if the time window constraint is not violated, so that there need only be one truck to transport the product from S1 to Crossdock 1. Similarly, flow from S2 to D2 and flow from S3 to D2 can be consolidated at Crossdock 2, and there need only be one truck to transport the product from Crossdock 2 to D2 . There may be a holding cost if the product does not arrive crossdock 2 simultaneously, so that the shipment arriving earlier will have to hold for the later shipment. The flow from S3 to D3 does not transit any crossdock.
Stage 2: Iteration Once an initial LTL Plan is found, it can be improved iteratively using a metaheuristic. We chose a genetic algorithm (GA) (Glover and Kochenberger 2003) metaheuristic since it is representative of techniques available. It is used to refine the LTL Plan. Using GA Once a TL Plan is generated, a GA can be used to improve the LTL Plan: Step 1. Each solution is mapped into a sequence S = (a1 , a2 , . . . , an ), where n is the number of customers, and S is a permutation of 1, 2, . . . , n. The sequence is a customer priority for choosing suppliers. To begin, sequences are generated randomly. Step 2. Selection: the population is randomly divided into M 2 disjoint pairs, where M is the size of the population. Each pair of individuals are used in the crossover step. Step 3. Crossover: each offspring is generated from its parents with equal probability. Let the parents be P1 = (a1 , a2 , . . . , an ) and P2 = (b1 , b2 , . . . , bn ), then, a temporary offspring of P1 and P2 is given by Ctemp = (rand (a1 , b1 ),
Reducing Transportation Costs in Distribution Networks
1145
rand (a2 , b2 ), . . . , rand (an , bn )) with Prob (rand (a1 , b1 ) = a1 ) = Prob (rand (a1 , b1 ) = b1 ) = 0.5. After crossover, normalize Ctemp into a permutation of 1, 2, . . . , n, by sorting priorities and re-assigning a value from 1 to n to get an offspring C. For example, Ctemp = (3, 1, 2, 1) becomes C = (4, 2, 3, 1) or C = (4, 1, 3, 2) after normalization. Step 4. Mutation: a new individual is produced from the offspring by the 2-swap neighborhood with a given probability. For example, C = (4, 1, 3, 2) mutates to C = (2, 1, 3, 4), where the priority of customer 1 and customer 4 is interchanged. Step 5. Population update: the new individual is inserted into the population replacing one with the highest cost.
4
Computational Experiments
All the algorithms were coded in Java and executed on Intel Pentium(R) 2.8GHz machine with 1024Mb memory. For comparisons, we used CPLEX 8.0 to solve the integer program for the SCP given in Section 3. The test generation process, parameter settings of various methods, and detailed computational results are described in the following sections. 4.1
Test Data Generation
The test cases were generated with three sets of parameters, which were randomly generated using the parameter param in a uniform distribution Unif [ 45 param, 65 param]. (1) (p, c, h) - (setup cost, transportation cost, inventory holding cost). By 2000 setting the ratio of pc , normal setup cost ( 1000 5 ), high setup cost ( 5 ), and very 3000 high setup cost ( 5 ) test cases were generated; by setting the ratio for hp , low 1000 holding cost ( 1000 50 ) and high holding cost ( 300 ) cases were generated. q (2) (q, cap) - (quantity of demand, truck capacity). By setting the ratio of cap , 30 50 we generated normal demand quantity ( 20 ), high demand quantity ( 20 ), and very high demand quantity ( 70 20 ) cases. Since, in the heuristics, a TL Plan is generated by a network flow algorithm, the performance of the heuristics was tested by setting TL parts to be as small as possible (less than two truck loads) so that the LTL costs contribute more to total cost. (3) (t, d) - time window and the distances between manufacturers, crossdocks and customers. By setting the ratio of dt , we generated normal time window ( 40 30 ) and narrow time window ( 20 ) cases. If the time interval used is too wide, time 30 window constraints become unnecessary. 4.2
Results from the Experiments
A test set s-d-cd-attr indicates the number of manufacturers (s), the number of customers (d), the number of crossdocks (cd), and the attributes of the parameters (attr). The basic attribute is normal (n), which means that all three sets
1146
X. Li et al. Table 1. Result of general cases Test set 10-20-3-n 10-40-3-n 10-50-3-n 10-60-4-n 10-80-4-n 20-20-4-n 20-40-4-n 20-50-4-n 20-60-4-n
NF+GA TNF +GA 227002 4 456922 8 571103 10 691034 16 919470 22 241805 10 450842 20 578840 29 694650 33
CPLEX TCP LEX 226127 600 471217 600 592163 600 722602 600 957030 1200 233264 600 468353 600 598409 600 721179 1200
have parameters set to normal level as described above. The two-stage heuristic is denoted NF+GA (network flow with GA). Each test set contains 10 test cases, and the value in the tables are the average cost of all the 10 test cases for each package. TN F +GA and TCP LEX denote running times in seconds. From table 1, we see that as the number of customers increases, the performance of NF+GA becomes better than CPLEX. And NF+GA require much less time than consumed of the CPLEX. Using Different Parameters Setup cost and holding cost. Table 2 gives results for the effect of high (hs) and very high (es) setup costs. From the table we can see heuristic performs better than CPLEX with high setup cost and extremely high setup cost sets. The following table 3 shows the heuristic provides superior solutions when compared with CPLEX in almost all packages with high holding cost (h). Narrow time windows. Table 3 also shows the heuristic outperforms CPLEX in all test packages with narrow time windows (ntw). Table 2. Result of high setup cost cases Test set 10-40-3-hs 10-50-3-hs 10-60-4-hs 10-80-4-hs
NF+GA 559446 754346 859917 1247090
CPLEX 569273 765658 900284 1294442
Test set 10-40-3-es 10-50-3-es 10-60-4-es 10-80-4-es
NF+GA 773072 955919 1176252 1551440
CPLEX 805238 988054 1214280 1625256
Table 3. Result of high holding cost and narrow time window cases Test set 10-20-3-h 10-40-3-h 10-50-3-h 10-60-4-h
NF+GA 224270 462588 569797 690903
CPLEX 222071 477714 596800 721774
Test set 10-20-3-ntw 10-40-3-ntw 10-50-3-ntw 10-60-4-ntw
NF+GA 218593 446453 562514 674903
CPLEX 221497 454824 584205 696020
Reducing Transportation Costs in Distribution Networks
1147
Table 4. Result of large demand cases Test set 10-20-3-lq 10-40-3-lq 10-50-3-lq 10-60-4-lq
NF+GA 527552 986667 1277606 1558303
CPLEX 545709 1020072 1329626 1642329
Test set 10-20-3-eq 10-40-3-eq 10-50-3-eq 10-60-4-eq
NF+GA 774117 1542854 1929691 2323319
CPLEX 788267 1604314 2024923 2500889
Large and very large demand. The following table 4 shows the results for large demand (lq) and very large demand (eq) results. From the table we can see that CPLEX performs unstably, as demand quantity increases. The cost calculated by CPLEX is significantly larger than found by the heuristics. As with the other test packages, the two-stage heuristics performed better than CPLEX for almost all the test sets.
5
Conclusion and Future Work
In this work, we studied a new shipment consolidation problem in distribution networks which include crossdocks with manufacturer and customer time windows. A single product can be shipped directly or through crossdocks, and shipments planned to reduce LTL shipments by consolidation at manufacturers and crossdocks. The problem is formulated as an integer program problem which was shown to be NP-hard in strong sense. A two-stage heuristic algorithm, which combines a network flow algorithm and a metaheuristic, was developed to solve the problem. We used a squeaky wheel optimization heuristic and a genetic algorithm to consolidate LTL shipments. The idea behind the heuristics is to split the demand into TL and LTL components. The TL component was solved as network flow problem. In solving the LTL problem, the key point was to represent lanes from manufacturers to customers which are feasible with respect to time windows constraints. This allowed time constraints to be omitted, and the problem reduced, in the next step of the solution procedure. Computational experiments were carried out to compare solutions obtained by CPLEX and the two-stage heuristic. These showed that the heuristic was able to provide better results and required significantly less computing time, while they was more stable than CPLEX. In future research, it is possible that the multi-commodity consolidation problem, which is more difficult, can be solved.
References [1] Bartholdi, John J. III and Gue, K. R. (2000)Reducing labor costs in an LTL crossdocking terminal, Operations Research, 48(6) 823-832. [2] G¨ um¨ us, M., Bookbinder, J. H. (2004), Cross-docking and its implication in location-distribution systems, Journal of Business Logistics, 25(2) 199 - 228.
1148
X. Li et al.
[3] Hoppe, B. and Tardos, E. (1995) The quickest transshipment problem, In Proceedings of the Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, San Francisco, California, United States. [4] Lim, A., Miao, Z., Rodrigues, B. and Xu, Z. (2005) Transshipment through crossdocks with inventory and time windows, Naval Research Logistics Quarterly, to appear. [5] Miao, Z. (2005) Optimization problems in a crossdocking distribution system, Ph.D. thesis, Hong Kong University of Science and Technology. [6] Napolitano, M. (2000) Making the move to crossdocking, The WERC-sheet, Warehousing Education and Research Council. [7] Quinn, F. J. (1997) The payoff! Logistics Management 36 (12) 37 - 41. [8] Robinson, L.W. (1990) Optimal and approximate policies in multiperiod, multilocation inventory models with transshipments, Operations Research 38 278-295. [9] Tayur, S. (1995) Computing optimal stock levels for common components in an assembly system, Working paper, Carnegie-Mellon University, USA. [10] Wilson, R.A. (2002) Transportation in America, 19th edition, Eno Transportation Foundation, Washington, USA.
Application of an Intuitive Novelty Metric for Jet Engine Condition Monitoring David A. Clifton1,2 , Peter R. Bannister1 , and Lionel Tarassenko1 1
2
Department of Engineering Science, Oxford University, UK {davidc, prb, lionel}@robots.ox.ac.uk Oxford BioSignals Ltd., Magdalen Centre, Oxford Science Park, Oxford, OX4 4GA, UK
Abstract. Application of novelty detection to a new class of jet engine is considered within this paper, providing a worked example of the steps necessary for constructing a model of normality. Abnormal jet engine vibration signatures are automatically identified with respect to a training set of normal examples. Pre-processing steps suitable for this area of application are investigated. An intuitive metric for assigning novelty scores to patterns is introduced, with benefits for reducing model sensitivity to noise, and in pruning patterns from the model training set. Keywords: Novelty Detection, Jet Engine Application, Machine Learning.
1
Introduction
Novelty detection, defined to be the identification of departures from normal system behaviour, is appropriate for areas of application in which examples of normal behaviour greatly outnumber examples of abnormal behaviour. The identification of abnormal jet engine vibration data is one such example, in which the majority of available examples are “normal” [12, 3]. This paper describes the application of the shape analysis method of novelty detection [9, 5] to a new engine example, investigating the necessary pre-processing steps required to provide accurate classification of abnormal examples from a new class of modern jet engine. An intuitive novelty metric is introduced for assigning novelty scores to engine examples such that sensitivity of the model of normality to noise in example patterns may be reduced. The novelty metric is used as a basis for meaningful pruning of data from the training set such that a robust model of normality is constructed. Results from the application of this novelty detection process to the new class of jet engine considered within this investigation are presented, providing evidence of the suitability of the approach in the identification of abnormal engine vibration signatures. M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 1149–1158, 2006. c Springer-Verlag Berlin Heidelberg 2006
1150
2
D.A. Clifton, P.R. Bannister, and L. Tarassenko
Data Description
This investigation considers a modern civil jet engine, consisting of several rotating engine shafts, mounted concentrically. In normal operation, air enters the low pressure (“LP”) shaft for compression, passed to the intermediate pressure (“IP”) and high pressure (“HP”) shafts for further compression, before ultimately being mixed with aviation fuel for use within the combustion chamber [10]. Vibration energy at the frequency of the fundamental harmonic of vibration associated with each shaft is referred to as the tracked order of each shaft [4]. During product testing, engines of the class considered within this investigation perform a controlled two-minute acceleration from idle to maximum speed, followed by a two-minute deceleration back to idle. Vibration amplitude levels of the fundamental tracked orders corresponding to each shaft are recorded during these tests, from which the speed-based vibration signature a(s) is constructed for rotational speeds s of each engine shaft. If a(s) ≤ H for all speeds s, where H is a contractual vibration limit, then the engine may be released into service. The data set used in the investigation described in this article consists of 71 engine examples, initially divided into four sub-sets D1 . . . D4 , according to their maximum vibration amplitude max{a(s)} compared with the contractual vibration limit H, as shown in Table 1. Note that sub-set D4 is formed from recordings of engines with seeded fault conditions, in which small masses are deliberately applied to the engine fan blades. These weights are deliberately applied to fan (and sometimes turbine) blades in order to correct any unbalances during rotation of those blades about the shaft to which they are connected. This unbalance is noted by engine manufacturers to result in very large vibration amplitude levels of the LP shaft, with approximately normal behaviour in other shafts. Novelty detection applied to the IP and HP shafts should therefore not identify examples from sub-set D4 as being novel.
Table 1. Data classified according to maximum vibration amplitude Sub-set |Dn | D1 D2 D3 D4
3
29 15 17 10
Class membership criteria Examples Examples Examples Examples
for which max{a(s)} < 0.9H for which 0.9H ≤ max{a(s)} ≤ H for which H < max{a(s)} in which weights are deliberately applied to the engine fan
Quantisation of Vibration Signatures
Increasing dimensionality of data requires exponentially increasing numbers of patterns within the data set used to construct a general model; this is termed the curse of dimensionality [1]. In order to avoid this problem, each signature is summarised by a shape vector x. This is performed by computing a weighted
Application of an Intuitive Novelty Metric
1151
average of the vibration amplitude values a(s) over D = 10 speed sub-ranges [9]. The dth dimension of shape vector xd , for d = 1 . . . D, is defined to be: smax d x = a(s)ωd (s)ds (1) smin
in which the vibration amplitude a(s) is integrated over the speed range s : [smin smax ], using weighting functions ωd (s), for d : 1 . . . D. The weighting functions used are overlapping trapezoids, as shown in Figure 1.
n = 1 n = 2 n = 3 n = 4 n = 5 n = 6 n = 7 n = 8 n = 9 n = 10
1
ωn (s)
0 30
40
50
60
70
80
90
100
s (% total LP shaft speed)
Fig. 1. Example of weighting functions ωd , d : 1 . . . 10, used to create shape vectors
4
Normalisation
In order to identify changes in the shape of 10-dimensional vectors derived from vibration signatures with respect to a population of normal patterns, regardless of absolute values of vibration amplitude, normalisation is applied. This preprocessing step removes dependence upon absolute amplitudes, whilst preserving information about the general shape of signatures. Five normalisation methods were investigated [2] for scaling the d : 1 . . . 10 dimensional shape vectors {xd1 , xd2 , . . . , xdI } constructed from vibration signatures. These normalisation methods are defined in Table 2. We define the component-wise normalisation function N (xi ) to be a transformation of the d elements within pattern xi : N (xi ) =
xdi − μd , σd
d = 1...D
(2)
where (μ, σ) are vectors of D elements, computed component-wise across all i = 1 . . . I patterns: I 1 d μd = x (3) I i=1 i
1152
D.A. Clifton, P.R. Bannister, and L. Tarassenko Table 2. Normalisation methods applied to engine shape vectors Method Energy-based normalisation Amplitude normalisation Unit normalisation Zero-mean, unit-variance normalisation Component-wise normalisation
d
σ =
Definition d 2 (xd ) = 1 x =1 max{xd } = 1 μ(xd ) = 0, σ(xd ) = 1 See Eq.(2)
1 d (x − μd )2 I − 1 i=1 i I
12 (4)
In order to examine the results of normalisation, the data set was visualised by projecting the set of 10-dimensional shape vectors into 2 dimensions. Topographic projection is a transformation that attempts to best preserve, in the projected space of lower-dimensionality (latent space, Rq ), distances between data in their original high-dimensional space (data space, Rd ). The Sammon stress metric[11] is based upon the distance dij between points (xi , xj ) in Rd , and the distance d∗ij between projected points (yi , yj ) in Rq : Esam =
N N
(dij − d∗ij )2
(5)
i=1 j>i
in which the distance measure is typically Euclidean. The NeuroScale model [6, 7] trains a radial basis function (RBF ) neural network to perform the mapping from Rd to Rq , in which Esam is minimised; i.e. distances between points are best preserved after projection. A NeuroScale network was used for projecting shape vectors derived from the example data set described previously, with d = 10 inputs (corresponding to the number of elements in each shape vector) and q = 2 outputs (for 2-dimensional projection). The projection of the un-normalised vibration signatures from the HP shaft during acceleration manoeuvres, as generated by the NeuroScale network, is shown in Figure 2(a). A new NeuroScale mapping is generated for use with each normalisation scheme, by training a RBF network using data normalised by each method previously considered. The resulting projections from each NeuroScale network are shown in Figure 2(b)-(f). Note that the axes of NeuroScale projections have no units of measurement: they are arbitrary orthogonal axes that allow minimisation of Esam . The figures show that component-based normalisation provides increase in separation of “abnormal” patterns (sub-set D3 ), while the other four normalisation methods result in significant overlap between “normal” and “abnormal” patterns. Thus, component-wise normalisation is used in pre-processing the 10dimensional shape vectors for this engine data-set.
Application of an Intuitive Novelty Metric
0.2
1153
0.6
0.15
0.4
0.1 0.2 0.05 0
0
−0.05
−0.2
−0.1 −0.4 −0.15 −0.6
−0.2 −0.25 −0.2
−0.1
0
0.1
0.2
0.3
0.4
0.5
−0.8 −0.8
−0.6
−0.4
−0.2
(a) 6
x 10
0
0.2
0.4
0.6
0.8
1
0.04
0.06
0.08
0.1
0.12
6
8
10
12
(b) 0.08
−3
0.06 4
0.04 2
0.02 0
0
−0.02 −2
−0.04 −4
−6 −0.01
−0.06
−0.005
0
0.005
0.01
0.015
−0.08 −0.06
−0.04
−0.02
0
(c)
0.02
(d)
2.5
3
2 2
1.5 1
1
0.5 0
0
−0.5 −1
−1 −1.5
−2
−2 −2.5 −3
−2
−1
0
(e)
1
2
3
4
−3 −6
−4
−2
0
2
4
(f )
Fig. 2. Projections of HP Shaft Accelerations: (a) un-normalised; (b) unit normalisation; (c) amplitude normalisation; (d) energy-based normalisation; (e) zero-mean, unitvariance normalisation; (f) component-wise normalisation. Data sub-sets {D1 . . . D4 } are shown by {• × +}, respectively.
1154
5
D.A. Clifton, P.R. Bannister, and L. Tarassenko
Modelling Normality
The batch k-means clustering algorithm [13] was used, as described in [8], to construct a model of normality from “normal” patterns (i.e. those from sub-sets D1 , D2 ). In this method, the distribution of “normal” patterns is defined by CK cluster centres in R10 space, each with an associated cluster width σK . A novelty score z(x) may be computed for shape vector x with respect to the K cluster centres: K d(x, Ck ) z(x) = min (6) k=1 σk where d(x, Ck ) is Euclidean distance. A threshold Hz is defined such that patterns with novelty scores z(x) ≥ Hz are classified as “abnormal”. The k-means model assigns individual patterns to the nearest Ck cluster centre. We define the population γk of cluster Ck to be: γk = i : [1 . . . I] d(x, Ck ) < d(x, Cj ) , ∀j : [1 . . . K], j = k. (7) A pattern is said to be outside a cluster if it lies beyond that cluster’s width, σk . Thus, an oft-used first approximation of cluster membership is to set the width σk of cluster CK according to: σk = max d(xi , Ck ), i∈γk
(8)
effectively placing K hyperspherical clusters into data space Rd , each corresponding to the membership region of a cluster. We propose a new definition of width σk :
1 σk = d(xi , Ck )2 , (9) |γk | i∈γ k
effectively setting width σk to be the standard deviation of intra-cluster distances for each cluster’s population γk . Thus, the novelty score z(x) in (6) may be interpreted as the number of standard deviations that x lies from its nearest cluster centre. 5.1
Sensitivity to Noise
The use of σk from (8) causes cluster widths to be dependent on the location of a single pattern in Rd ; i.e. that pattern from population γk which is furthest from centre Ck . Small variations in the placement of that pattern relative to Ck can results in considerable variations in σk , resulting in poor robustness to noisy data. Furthermore, the cluster width σk does not represent the distribution of its population of patterns γk .
Application of an Intuitive Novelty Metric 7
1155
7
6 6 5
+ 5 4
+
3
4
2
+
3
1 2
+
0 −1 −1
0
1
2
3
4
5
6
7
8
1
1
2
3
(a)
4
5
6
7
8
(b)
Fig. 3. Cluster centre widths calculating using: (a) σk ; (b) σk . Training patterns and cluster centres are shown by {• +}, respectively. Cluster boundaries are shown by dashed lines. Outlying pattern shown by a large •.
This sensitivity to single patterns can be avoided by: i. placing cluser centres Ck using the k-means algorithm, as before; ii. computing σk using (9); iii. computing z(x) for all patterns using (6); iv. re-allocating patterns to cluster populations γk according to minimum novelty score using: d(x), C ) d(x, Cj ) k γk = i : [1 . . . I] < , ∀j : [1 . . . K], j = k; (10) σk σk v. re-computing the final σk using γk from the previous step. This is illustrated in Figure 3, using example data drawn from two circular distributions in R2 , and a single outlying pattern. We refer to these two distributions as p1 and p2 , appearing in the lower-left and upper-right of Figure 3, respectively. The single outlying pattern, indicated by a large •, lies between p1 and p2 , and is slightly closer to the centre of the former. Figure 3(a) shows widths σk computed using (8). The outlying pattern has been assigned to C1 , i.e. the cluster centre placed by the k-means algorithms to represent data drawn from p1 . Thus, σ1 is placed to include all of its population γ1 , and does not closely represent the distribution of the data in γ1 . Figure 3(b) shows widths σk after the single outlying pattern has been allocated to γ2 , using (10). Thus, (C1 , σ1 ) closely represents distribution p1 , and (C2 , σ2 ) has extended slightly to cover the single outlying pattern. This illustrates the improvement in model-fit achieved with (9), in which the resulting two cluster centres closely model the underlying distribution from which the example data were drawn. This is beneficial in application to engine vibration data, in which patterns (derived from engine vibration signatures) are subject to noise.
1156
5.2
D.A. Clifton, P.R. Bannister, and L. Tarassenko
Improved Pruning of Training Data
The use of (9) to compute σk results in novelty scores z(x), computed using (6), which may be interpreted as the number of standard deviations that x lies from its closest cluster centre, using normalised distance d(x, Ck )/σk . This interpretation allowed by the proposed definition of the novelty score z(x) allows initial pruning of patterns from the training set, in order to discard outliers unrepresentative of normality, and which should not be included in the training set. This can be achieved using: i-v. as before, in 5.1; vi. discarding all patterns x for which z(x) > Hz . Thus, if, for example Hz = 3.0, all patterns which lie more than three standard deviations from their nearest cluster centre (using normalised distance) are discarded. This corresponds to the intuitive notion of discarding outliers based on their novelty relative to the distribution of the remainder of the data.
6
Results of Application to Example Engine Data
The results of novelty detection using models trained on patterns from each engine shaft may be visualised using NeuroScale projection from R10 onto R2 . Results for the LP and IP shafts are presented. 6.1
LP Shaft Results
Engines that were deliberately unbalanced by attachment of weights to the fan exhibit extremely high vibration amplitude within the LP shaft, as discussed in Section 2. Figure 4 shows a NeuroScale projection of all patterns generated from LP shaft vibration signatures. 6 patterns derived from unbalanced engines (i.e. drawn from sub-set D4 ) are contained within the LP shaft data set, all of which are correctly classified “abnormal” by the novelty detection process. The figure shows that they are clearly 10 8 6 4 2 0 −2 −4 −6 −8 −10
−5
0
5
10
15
20
25
30
35
40
45
Fig. 4. Projection of all LP shaft patterns. Patterns classified as “abnormal” are circled. Data sub-sets {D1 . . . D4 } are shown by {• × ∗}, respectively.
Application of an Intuitive Novelty Metric
1157
8 6 4 2 0 −2 −4 −6 −10
−5
0
5
10
15
20
25
Fig. 5. Projection of all IP shaft patterns. Patterns classified as “abnormal” are circled. Data sub-sets {D1 . . . D4 } are shown by {• × ∗}, respectively.
separated from the cluster formed by normal training patterns, supporting the assumption that unbalanced patterns should appear highly abnormal in the LP shaft. 4 of the 5 patterns that form the set of abnormal example patterns for the LP shaft (i.e. drawn from sub-set D3 ) are classified “abnormal” with respect to the model. As can be seen from the figure, these 4 patterns classified “abnormal” are clearly separated from the normal patterns. The remaining pattern of this abnormal class lies in the centre of the normal cluster of patterns. This engine was released into service following pass-off testing (source A), but exceeded the contractual vibration limit for that shaft, placing it within sub-set D3 (as defined in Section 2). This provides support for the claim that comparison of maximum vibration amplitude to a fixed contractual limit Ha does not provide a reliable indication of abnormality, and that this shape analysis technique can provide better discrimintation between “normal” and “abnormal” engines. 6.2
IP Shaft Results
All abnormal patterns (from sub-set D3 ) derived from IP shaft vibration signatures are correctly classified by the novelty detection process, as shown in Figure 5. Obvious separation between “abnormal” and “normal” classes can be seen in the figure. A single pattern from the “normal” class is classified as “abnormal”. This pattern was pruned from the training set prior to training, due to its outlying position relative to the rest of the “normal” patterns. The figure shows that patterns from sub-set D4 are not classified as “abnormal” within the IP shaft, following the observation that changes in vibration levels caused by unbalance conditions are localised within LP shaft vibration signatures.
1158
7
D.A. Clifton, P.R. Bannister, and L. Tarassenko
Conclusions
Results from the use of the novelty detection procedure described within this investigation showed that abnormal examples of jet engine vibration signatures could be automatically identified, with a very low false-positive rate. Appropriate pre-processing of vibration signatures was investigated, with component-wise normalisation proving most effective in increasing separation between normal and abnormal engine examples. The pruning method resulting from application of the novelty metric described in this investigation results in appropriate removal of patterns that are significantly novel with respect to the majority of training patterns. This allows a model of normality to be constructed that closely models vibration signatures taken from normal engines, such that abnormal engine examples are identified. Acknowledgements. The authors gratefully acknowledge the support of the EPSRC; Dennis King and Mark Walters of Rolls-Royce Plc.; and Nicholas McGrogan of Oxford BioSignals Ltd.
References 1. Bishop, C.: Neural networks for pattern recognition. Oxford University Press (1995) 2. Clifton, D.A.: Condition Monitoring of Gas-Turbine Engines. Department of Engineering Science, University of Oxford, Transfer Report (2005) 3. Hayton, P., Scholkopf, B., Tarassenko, L., Anuzis, P.: Support vector novelty detection applied to jet engine vibration spectra. NIPS Proceedings (2000) 4. Hayton, P., Utete, S., Tarassenko, L.: QUOTE Project Final Report. University of Oxford (2003) 5. Hayton, P., Utete, S., King, D., King, S., Anuzis, P., Tarassenko, L.: Static and dynamic novelty detection methods for jet engine health monitoring. Proc. of the Royal Society (2006) 6. Lowe, D., Tipping, M.E.: Feed-forward neural networks and topographic mappings for exploratory data analysis. Neural Computing and Applications (1996) 7. Nabney, I.T.: NETLAB, algorithms for pattern recognition. Springer (2002) 8. Nairac, A., Corbett-Clark, T.A., Ripley, R.M., Townsend, N.W., Tarassenko, L.: Choosing an appropriate model for novelty detection. IEE 5th International Conference on Artificial Neural Networks (1997) 9. Nairac, A., Townsend N., Carr, R., King, S., Cowley, P., Tarassenko, L.: A system for the analysis of jet engine vibration data. Integrated Computer-Aided Engineering (1999) 10. Rolls-Royce PLC: The jet engine. Renault Publishing (2005) 11. Sammon, JW.: A non-linear mapping for data structure analysis. IEEE Transactions on Computers (1969) 12. Tarassenko, L., Nairac, A., Townsend, N.: Novelty detection. Proc. of NCAF/NSYN (1996) 13. Tarassenko, L.: A guide to neural computing applications. Arnold (1998)
Determination of Storage Locations for Incoming Containers of Uncertain Weight Jaeho Kang1 , Kwang Ryel Ryu1 , and Kap Hwan Kim2 1
2
Department of Computer Engineering, Pusan National University San 30, Jangjeon-dong, Kumjeong-gu, Busan, 609-735, Korea {jhkang, krryu}@pusan.ac.kr Department of Industrial Engineering, Pusan National University San 30, Jangjeon-dong, Kumjeong-gu, Busan, 609-735, Korea [email protected]
Abstract. In container terminals, heavier containers are loaded onto a ship before lighter ones to keep the ship balanced. To achieve efficient loading, terminal operators usually classify incoming export containers into a few weight groups and group containers belonging to the same weight group in the same stack. However, since the weight information available at the time of the container’s arrival is only an estimate, a stack often includes containers belonging to different weight groups. This mix of weight groups necessitates extra crane works or container re-handlings during the loading process. This paper employs a simulated annealing algorithm to derive a more effective stacking strategy to determine the storage locations of incoming containers of uncertain weight. It also presents a method of using machine learning to reduce occurrences of re-handling by increasing classification accuracy. Experimental results have shown that the proposed methods effectively reduce the number of re-handlings than the traditional same-weight-group-stacking (SWGS) strategy.
1
Introduction
In container terminals, average ship turn-around time is the most important productivity criterion. Container loading and unloading are the two major services provided by terminals, and these services take the majority of available turn-around time. When loading containers, heavier containers must be brought from the yard and loaded onto a ship before the lighter containers are loaded to keep the ship balanced. This weight constraint makes the planning of container loading very difficult. Therefore, terminal operators usually use a more relaxed constraint by grouping containers into a limited number of weight groups and loading them onto the ship in decreasing order (from heaviest to lightest). However, a container is often misclassified because the weight information provided by the trucking company is not accurate. The true weights of containers
This work was supported by the Regional Research Centers Program (Research Center for Logistics Information Technology), granted by the Korean Ministry of Education Human Resources Development.
M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 1159–1168, 2006. c Springer-Verlag Berlin Heidelberg 2006
1160
J. Kang, K.R. Ryu, and K.H. Kim
are obtained one or two days before the arrival of the ship; by this time, most of the containers destined for the ship have already been stacked. As a result, containers belonging to different weight groups may be piled together, causing extra work in loading. In this situation, a yard crane needs to move a lighter container that has been stacked on top of a heavier container to a temporary location. This temporary container move is a re-handling, and is the major cause of delay in loading. Therefore, the number of re-handlings should be minimized to improve terminal productivity. Although the traditional stacking strategy (SWGS) is known for causing a large number of re-handlings, terminals have regarded it as the best possible stacking strategy – whether the weight information is accurate or not. There has been little research concerning the re-handlings caused by the inaccuracy of weight information. In this paper, we propose a method of deriving a more effective stacking strategy through the application of a simulated annealing algorithm. Experimental results have shown that stacking strategies derived through our method could more effectively reduce the number of re-handlings than SWGS could. In addition, machine learning is proven to reduce the additional number of re-handlings by increasing classification accuracy. Section 2 of the paper describes the re-handling problem in a terminal and reviews some related research. Section 3 demonstrates the method of estimating the number of re-handlings caused by inaccurate weight information. The implementation details of determining a more effective stacking strategy are presented in Section 4. In Section 5, the proposed method is tested and the results are summarized. Finally, Section 6 contains our conclusions and outlines topics for future research.
2
Container Re-handing Problem
The storage yard of a container terminal consists of multiple blocks, each having several bays. A bay is made up of several rows of container stacks of a certain maximum height, called tiers. Figure 1(a) presents a top view of a block, about half of which is occupied by the containers to be loaded onto ship ‘A.’ The containers of the same ports of destination – indicated by ‘Hong Kong,’ ‘Singapore,’ and ‘Kobe’ – are stored at adjacent locations. Since a bay can accommodate only one 20-foot container, the 40-foot containers occupy two consecutive bays. In the figure, the containers belonging to the same weight group are stored in the same bays; this practice does not always occur in reality, but they must at least be stored in the same stack. In most container terminals, the containers are classified in three weight groups: light (L), medium (M ), and heavy (H), as indicated in the figure. Figure 1(b) is a cross-sectional view of a block that corresponds to a 4-tier 6-row bay. Figure 2(a) depicts a bay accommodating only the containers classified to weight group M at the time of their arrival. Figure 2(b) shows the same bay with the containers re-classified based on the correct weight information obtained just before loading time. It turns out that the eight containers indicated by
Determination of Storage Locations for Incoming Containers (a)
1161
(b)
Target ship A
M M M
Hong Kong
Singapore
H M
M
H M
H
L
M H
L
L
L
M
M
L
b7, 8
b9, 10
Kobe
M M M M M M M M M M M M M M M M M M
b1
b2 b3, 4
b5
b6
b11 b12 b13 b14,15 b16 b17,18 b19 b20
b21
r1
r2
r3
r4 r5 r6
Fig. 1. (a) Top view and (b) cross-sectional view of a block
thick boxes in the figure actually belong to weight groups other than M . The eight shaded containers are those that must be re-handled (i.e., they have to be moved to other locations at the time of loading to fetch out the heavier containers under them). Eight (the number of shaded containers) is the lower bound on the number of required re-handlings. In most cases, we can easily derive a container-loading plan with the required number of re-handlings equal to the lower bound: as more containers are fetched out, more spaces become available for the re-handled containers to be placed without causing any further re-handling. Figure 2(c) reveals the transition of bay status as the containers are fetched out, starting from the initial configuration of Fig. 2(b). In each configuration, the containers to be fetched out are indicated by thick boxes, and those that should be re-handled are shaded. In the sections that follow, we simplify our analysis by assuming that the estimated number of re-handlings for a given bay configuration is always equal to its lower bound.
Fig. 2. Status of a bay (a) before and (b) after obtaining the correct weight information, and (c) change of bay status as containers are fetched out for loading
Few studies have focused on reducing the number of re-handlings of export containers with inaccurate weight information. Kim et al. derived an optimal stacking strategy for export containers when the weight group classification was always correct.[1] Kim studied the re-handling of inbound containers and derived a formula for estimating the number of required re-handlings.[2] Castilho et al.
1162
J. Kang, K.R. Ryu, and K.H. Kim
introduced a formula that calculated the number of re-handlings when a container was selected randomly for fetching out in an inbound yard.[3] Watanabe proposed an accessibility index as an indication of re-handling occurrences.[4] The last three studies estimated the number of re-handlings for inbound containers for which there was no ordering constraint. In contrast, this paper estimates the number of re-handlings to evaluate stacking strategies for export containers having uncertain weight information and some ordering constraints.
3
The Expected Number of Container Re-handlings
Given a stack of containers with the estimated weight groups assigned, we want to calculate the expected number of re-handlings needed to fetch out all the containers from that stack. This expected number later becomes the basis on which we evaluate the stacking results produced by the stacking strategies under investigation. One of the things we need to know to calculate the expected number of re-handlings is the probability distribution of the true weight groups, given an estimated weight group. To obtain this distribution, the record of containers exported for a certain five month period at a real terminal has been analyzed. Table 1 is the confusion matrix of weight grouping of all the 20-foot containers in our record. The numbers in the parentheses next to the group labels L, M , and H in the column and row headings are the proportions of the containers belonging to the respective group. The number in each cell is the proportion of the containers that belonged to a particular estimated weight group and a true weight group. For example, the bold item indicates that 7.1% of all the 20-foot containers had been classified to the weight group M based on the estimated weight but they turned out to belong to the true weight group L. The overall accuracies of weight grouping based on estimated weight were less than 70% for 20-foot containers and less than 65% for 40-foot containers. Now, we explain how we calculate the expected number of container rehandlings for a given stack. To simplify our example of calculating the number of re-handlings, assume that the maximum stack height is three. Since there are three different weight groups, there can be 3 × 3 × 3 = 27 different weight combinations (Fig. 3). If all the weight labels are correct, the shaded containers in the figure will be the ones that need to be re-handled. For example, the weight combination M LH (bottom-to-top order) requires a single re-handling when the bottom container M , placed under a lighter container L, is to be fetched out. Suppose we want to calculate the expected number of re-handlings for a stack of three containers with the estimated weight labels LLM . Then the conditional
L L L L L L L L L M M M M M M M M M H H H H H H H H H L
L L M M M H H H L L L M M M H H H L L
L M M M H H H
L M H L M H L M H L M H L M H L M H L M H L M H L M H
Fig. 3. Twenty-seven possible weight combinations for a stack of height three
Determination of Storage Locations for Incoming Containers
1163
Table 1. The confusion matrix for 20-foot containers Estimated weight group L(13.9%) M (39.7%) H (46.4%)
True weight group L(20.0%) M (23.1%) H (56.9%) 11.1% 2.0% 0.9% 7.1% 17.4% 15.3% 1.8% 3.8% 40.8%
probability of the true weight combination of this stack to be M LH, given the estimated weight LLM , is p(M LH|LLM ) = p(M |L) · p(L|L) · p(H|M ).
(1)
Since the weights of the containers are independent of one another, the conditional probability for a stack can be obtained by multiplying the conditional probabilities for the respective individual containers. From the confusion matrix of Table 1, we see that p(M |L) = 0.134, p(L|L) = 0.8, and p(H|M ) = 0.383; thus, p(M LH|LLM ) = 0.041. If n(xyz) denotes the number of required rehandlings for a stack with true weight labels x, y, and z, the expected number of re-handlings E[qrs] for a stack with estimated weight labels q, r, and s can be calculated by E[qrs] = x,y,z∈{L,M,H} n(xyz) · p(xyz|qrs) (2) = x,y,z∈{L,M,H} n(xyz) · p(x|q) · p(y|r) · p(z|s). The above equation can be easily extended to the general case of a stack with an arbitrary height. The lower bound on the expected number of re-handlings for loading all the containers in a bay can be obtained by summing up the expected number of re-handlings of each stack of the bay. This lower bound is used to evaluate the stacking results produced by the stacking strategies under investigation.
4
Deriving Good Stacking Strategies
In this section, we describe the representation of a stacking strategy and explain how we applied a simulated annealing algorithm[5] to the problem of finding a good stacking strategy. We also point out that we can further reduce the number of re-handlings if we increase the accuracy of the weight classification by applying machine learning. 4.1
Representation of a Stacking Strategy
Assuming the maximum stack height of three, there are thirteen possible stack types of different weight combinations (including an empty stack) on which a new container can be placed. Figure 4(a) shows all these stack types, each tagged with a unique index number referring to the particular type. We represent a stacking
1164
J. Kang, K.R. Ryu, and K.H. Kim
rule by a permutation of these stack type indexes to indicate the precedence for placing a newly arriving container. Figure 4(b) depicts an example situation of a bay in which we want to select a stack on top of which we may place a new container. The current state of the bay is represented by a vector B = (2, 7, X, 10, 6, 4), where the numbers are the type indexes of the corresponding stacks and X is an indication that the corresponding stack is full. The figure also shows two different stacking rules s1 and s2 . According to rule s1 , the best place for a new container is p1 , which is on top of type 2 stack because type 1 stack (which is of higher precedence) cannot be found in the current state of the bay. Similarly, rule s2 selects p2 as the best place because it is on top of type 6 stack, which is currently the best type of stack available.
(a)
1
5
6
7
8
9
10 11 12 13
(b)
2
3
4
L
M
H
L
M
H
L
M
H
p1
H M H M
L
M
H
L
L
L
M
M
M
H
H
H
L
L M M
H
p2
L
The state of a bay = (2, 7, X, 10, 6, 4)
H
s1 = (1,2,3,4,5,6,7,8,9,10,11,12,13) s2 = (5,9,8,6,1,3,7,10,2,13,12,11,4)
Fig. 4. (a) Thirteen possible 3-tier stack types on which there are empty slots for a newly arriving container to be placed and (b) example of a bay and two stacking strategies
Since the preferred position for an incoming container can be different depending on its estimated weight group, a complete stacking strategy is a triple (sL , sM , sH ), where se is the stacking rule applicable when the estimated weight group of an incoming container is e. There are more than six billion possible permutations of thirteen stack types; thus, there are more than 1029 possible stacking strategies. This number grows exponentially with stack height. Therefore, the problem of finding an optimal stacking strategy is intractable, so a heuristic search might be the only viable solution. 4.2
Searching for a Good Stacking Strategy
Our simulated annealing search starts from a strategy consisting of three randomly generated precedence rules, each for a weight group. Neighbors of a current strategy are generated by randomly picking a precedence rule and then swapping two randomly chosen types of index. A strategy can be evaluated by calculating the lower bound on the expected number of re-handlings for all possible bay configurations generated by that strategy. However, a strategy can be evaluated only by a simulation because it is infeasible to enumerate all possible bay configurations. Our simulation procedure is as follows. First, a random sequence of containers incoming to a bay is generated according to the probability distribution of the estimated weight groups. The sequence should be long enough to fill up a whole bay. Then the stacking strategy under investigation is applied to place each container in sequence,
Determination of Storage Locations for Incoming Containers
1165
starting from an empty bay.1 Finally, once all the containers in the sequence are stacked, we calculate the lower bound on the expected number of re-handlings for that particular bay by applying the method described in the previous section. By repeating this simulation process with randomly generated new sequences for a sufficient number of times, we can get a good estimate of the quality of the stacking strategy. In our experiment, we repeated these simulations ten thousand times per strategy. 4.3
Improving the Accuracy of Weight Grouping Through Learning
Many re-handling situations could be avoided if the weight information given by the trucking companies at the time of container arrival were more accurate. Based on the conjecture that the types of weight errors made are dependent on the trucking companies, we tried to capture the pattern by applying machine learning algorithms. For example, some trucking companies tend to overestimate, while others tend to underestimate the weight of containers. The record obtained from a real container terminal is our data for learning. The containers of the first four months in that record are used for training, and the rest for testing. The attributes for describing each container include the name of the trucking company, the estimated weight in tons (EWT), and a flag indicating whether it is a trans-shipment container. The class value of each container is its true weight group. We tried several machine learning algorithms, including the decision tree[6], naive Bayes, 1R, and PART. By using the estimated weight group given by the learned classifier, we were able to observe some reduction on the number of re-handlings. Detailed results of experiments are given in the next section.
5
Experimental Results
Of the collected record of containers exported for five months in a real container terminal, those for the first four months were used to learn classifiers for weight grouping and to derive stacking strategies. The rest were used to evaluate the derived strategies. Separate experiments were conducted for 20-foot and 40-foot size containers, because they had different confusion matrices. For the learning of weight classification, the data mining tool Weka[7] was used. In an effort to improve the accuracy of weight grouping, we tried several machine learning algorithms to find a good classifier. Among the algorithms tried, decision tree learning showed the best performance. The accuracies of weight grouping by the decision tree classifier were determined to be 5% higher than that of weight grouping by ’without learning.’ Figure 5 illustrates part of a decision tree generated for 40-foot containers. The attribute CNAME is the name of a trucking company. The decision tree reveals 1
Note, in reality, that the sequence of incoming containers is not known ahead of time although we generate the whole sequence for the simulation purpose. When we are stacking a container, we do not know the type of the container to come in next.
1166
J. Kang, K.R. Ryu, and K.H. Kim
EWT <= 9 no
yes
EWT <= 14
CNAME HYTLC
CITLC
EWT <= 8 yes
L
HAUSC KCTCC
L
EWT <= 5 no
M
yes
L
yes
L
no
CNAME DURECY
no
M
HJCLC
M
EWT <= 12 no
yes
L
M
Fig. 5. Part of a decision tree generated for 40-foot containers
tendencies of either overestimating or underestimating the weights, depending on the company. For example, the left part of the tree indicates the containers that actually belong to weight group M , although company CITLC estimated their weights to be between 8 and 9. Note that a 40-foot container of weight less than 10 is regularly classified to weight group L. By using this decision tree, we can compensate for the errors in estimating weight group. Tables 2 presents the results of experiments performed for a 6-tier 9-row bay, which is used in automated container terminals. We implemented and compared four different stacking strategies, including the one that we propose. SWGS refers to the traditional strategy of stacking the containers of the same weight group on the same stack. DP refers to the stacking strategy derived by dynamic programming, as described in [1]. Note that DP can derive an optimal strategy under the condition that the weight grouping is perfectly correct. Simple-rule positions a newly arriving container using a simple heuristic.[8] The heuristic selects a stack by considering only what the heaviest estimated weight group of the stack is. There are about 90,000 possible such heuristics. All these heuristics are generated and evaluated by simulation and the best one is selected. Since our method uses a simulated annealing algorithm to search for a good strategy, the results differed from experiment to experiment. The numbers in the tables are the averages of the results of five experiments. Our simulated annealing search was terminated if twenty thousand strategies were evaluated or no improvement was observed for one thousand successive moves. The temperature T was initially set to 100.0. After each evaluation, the current temperature T was lowered by subtracting a small fraction of it (i.e., T ← T − 0.001 × T ). The search usually took less than two hours on a Pentium PC with a 1.7GHz Xeon CPU. The table presents the relative frequencies of re-handlings incurred by different strategies compared to the frequency by SWGS without learning. In all cases, our proposed method showed the best performance. We often observed further reductions of re-handling frequencies when we used a decision tree learned to make a better estimate of the weight group. For the 20-foot containers, however, the performances of SWGS and DP became slightly poorer when a decision tree was used for weight classification. Although the overall accuracy of weight
Determination of Storage Locations for Incoming Containers
1167
Table 2. Relative frequencies of re-handlings of different strategies in reference to the frequency by SWGS (without learning) for a 6-tier 9-row bay 20-foot 40-foot Weight Grouping Weight Grouping Without learning Decision Tree Without learning Decision Tree SWGS 100.0% 100.5% 100.0% 91.4% DP 88.5% 89.6% 88.6% 80.2% Simple rule 73.7% 68.8% 81.6% 75.8% Proposed method 70.6% 68.9% 78.3% 72.1%
Stacking Strategy
classification was improved by the decision tree, the decision tree learned tended to misclassify the real weight group M and H to weight group L more often than the regular weight grouping rule ‘without learning’. Since it was likely that the expected number of re-handlings of a stack full of group L containers was higher than that of the stack full of group M or H, the use of such a decision tree might have some negative effects. If we assume that only one container is misclassified and misclassifications to other weight groups are equally likely, the expected number of re-handlings is 1 for stack LLL, 5/6 for M M M , and 2/3 for HHH. The strategies derived by our method somewhat overcame the weakness of the decision tree because the derivation of the strategies was based on the confusion matrix of the decision tree. Figure 6 reveals a strategy derived by our method without using the learned classifier for weight group estimation. The strategy was for stacking 20-foot containers in a 4-tier 6-row bay. The strategy consisted of three precedence rules, one for each of the three weight groups L, M , and H. Each precedence rule was represented by a sequence of stack types in the order of preference. We were interested in knowing what types of stacks had actually been frequently selected to place the incoming containers of each weight group. Therefore, the figure shows only those stack types that had been selected to place at least 3% of all the containers coming in during the simulation. For example, “8.0%” in the left part of the figure indicates that the number of containers of weight group L that were placed on the ground (i.e., empty stack) was equal to 8.0% of all the containers that came in during the simulation. Overall, we observe that the containers of weight group L were preferred to be placed on empty stacks or on stacks having just one container of weight group L. In contrast, the containers
3.8 % 3.0 % 8.0 … %
L
11.3 % …
Precedence for L
…
M
3.3 %
M 14.5 … %
M
Precedence for M
…
L
…
…
7.1 %
3.1 %
3.2 %
3.1 %
H
M
M
7.3 %
H
M
M
M
M
M M
H
…
M
…
M
3.2 % … …
M
…
Precedence for H
Fig. 6. Strategy derived for stacking 20-foot container in a 4-tier 6-row bay
1168
J. Kang, K.R. Ryu, and K.H. Kim
of weight group H were preferred to be placed on top of stacks of many heavy containers. A somewhat different strategy was derived for the 40-foot containers because the confusion matrix of 40-foot containers was different from that of 20foot containers. It is noteworthy that yet different strategies were derived if we used the learned classifier for weight group estimation again due to the different confusion matrices.
6
Conclusions
We have presented a method for deriving a strategy for stacking containers with uncertain weight information. Stacking incoming containers according to our strategy can significantly reduce the number of container re-handlings at the time of loading. Since re-handlings occur due to errors in estimated weight groups of containers, we have devised a method that is sensitive to the probability distribution of true weight group, given an estimated weight group. We have implemented a simulated annealing algorithm to find a good stacking strategy within a reasonable time. It was confirmed that the number of re-handlings could be further reduced by improving the accuracy of weight grouping through the use of a learned classifier. The next step of our work will be to apply a cost-sensitive learning for weight classification, since experimental results have shown that improvement of classification accuracy alone does not always lead to reduction of re-handling counts for some stacking strategies.
References 1. Kim, K. H., Park, Y. M., Ryu, K. R.: Deriving Decision Rules to Locate Export Containers in Container Yard. European Journal of Operational Research 124 (2000) 89–101 2. Kim, K. H.: Evaluation of the Number of Rehandles in Container Yards. Computers and Industrial Engineering 32(4) (1997) 701–711 3. Castilho, B., Daganzo, C. F.: Handling Strategies for Import Containers at Marine Terminals. Transportation Research 27B(2) (1993) 151–166 4. Watanabe, I.: Characteristics and Analysis Method of Efficiencies of Container Terminal - An Approach to the Optimal Loading/Unloading Method. Container Age (1991) 36–47 5. Aarts, E., Korst, J.: Simulated Annealing, Local Search in Combinatorial Optimization. John Wiley & Sons (1997) 91-120 6. Quinlan, J. R.: C4.5: Programs for Machine Learning. San Mateo, CA, Morgan Kaufmann (1993) 7. Witten, I. H., Frank, E.: Data Mining - Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann Publisher (2000) 8. Kang, J., Ryu, K. R., Kim, K. H.: Generating a Slot Assignment Rule for Outbound Containers Having Imprecise Weight Information. Journal of Korean Navigation and Port Research 29(6) (2005) 573–581 (in Korean)
Fault Diagnostics in Electric Drives Using Machine Learning Yi L. Murphey*, M. Abul Masrur*, and ZhiHang Chen
Abstract. Electric motor and power electronics based inverter are the major components in industrial and automotive electric drives. In this paper we present a fault diagnostics system developed using machine learning technology for detecting and locating multiple classes of faults in an electric drive. A machine learning algorithm has been developed to automatically select a set of representative operating points in the (torque, speed) domain, which in turn is sent to the simulated electric drive model to generate signals for the training of a diagnostic neural network, “Fault Diagnostic Neural Network” (FDNN). We presented our study on two different neural network systems and show that a well-designed hierarchical neural network system is robust in detecting and locating faults in electric drives. Keywords: intelligent systems, neural networks, machine learning.
1 Introduction Fault diagnostics for internal combustion (IC) engine vehicles has been well investigated [1,4,10], but not to the same extent for electric or hybrid vehicles. However, there are active researches in electrical system diagnostics [2, 5, 6, 8, 9]. Rule-based expert systems and decision trees are the two traditional diagnostic techniques, but they have serious limitations. A rule-based system often has difficulties in dealing with novel faults and acquiring complete knowledge to build a reliable rule-base, and is system dependent. A decision tree can be very large for a complex system, and it is also system dependent such that even small engineering changes can mean significant updates [2]. More recently model based approaches, fuzzy logic, artificial neural networks (ANN), case based reasoning (CBR) are popular techniques used in various fault diagnostics problems in electrical systems. Our research is one step more advanced from those published works. Most of the existing diagnostic systems are built to detect a faulty condition against the normal condition. We present an advanced machine learning technology for the development of a robust diagnostic system that has the capability of detecting and locating multiple classes of faults in an electric drive operating at any valid (torque, speed) conditions. The diagnostic system FDNN(Fault Diagnostic Neural Network) is trained with simulation data generated by a robust machine learning algorithm. The diagnostic results provided by FDNN can be used to make a gracefully degradable [6, 8] operation of a faulty drive possible. Experiments were conducted both on the *
Senior Member, IEEE.
M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 1169 – 1178, 2006. © Springer-Verlag Berlin Heidelberg 2006
1170
Y.L. Murphey, M.A. Masrur, and Z. Chen
simulated data and the results show that the proposed diagnostic system is very effective in detecting multiple classes of faulty conditions of an inverter in an electric drive operating at any valid (torque, speed) point.
2 Electric Drive Fault Detection Using Signal Analysis and Machine Learning Figure 1 illustrates our approach to fault diagnostics of an electric drive. “SIM_drive”, a simulation model of a field oriented control electric drive with a power electronics based inverter and a 3-phase induction motor(see Figure 2), is developed and implemented using the Matlab-Simulink software. SIM_drive has the capability of simulating normal operation condition of an electric drive as well as the faulty conditions of the open and post-short-circuit faults in an inverter switch. The SIM_drive model operates at any selected (torque, speed) operating point under normal and various faulty conditions. Since in real world an electric drive can operate at different (torque, speed) points, a diagnostic system should be trained to be robust throughout the (torque, speed) domain.
A simulation model of an electric drive
Data generated under the normal and various faulty conditions
Machine learning system CP-Select algorithm Neural network learning
Representative operation points Operation data from an Electric drive data
A fault diagnostic system
Normal/faulty conditions and causes
Fig. 1. Illustration of a model based fault diagnostic system driven by machine learning Inverter
A
B
A’
B’
C Battery C’
Induction Motor
Fig. 2. A a six-switch inverter in a 3-phase electric drive
Fault Diagnostics in Electric Drives Using Machine Learning
1171
A machine learning algorithm is developed to select representative operating points from the (torque, speed) domain for use by the SIM_drive model to generate training data. The detail of the SIM_drive can be found in [7]. In this paper we focus on the machine learning system and the fault diagnostic system. The objective of the machine learning approach is to train a diagnostic system on the representative data so it has the capability of performing accurate fault diagnostics in an electric drive that operates at any valid operating point. The intelligent system used in this research is a multi-class neural network system. 2.1 Multiple Class Fault Detection We are attempting to develop a robust diagnostic system to detect the faulty cases in the electric drive system shown in Figure 2. The challenges in developing such robust diagnostic systems lie in the fact that it is easier to identify signatures of a faulty condition against the normal condition, whereas signal signatures of one fault against another one are often quite subtle. We model the fault diagnostics in electric drive as a multi-class classification problem. Figure 3 illustrates the computational steps involved in the signal fault detection system, where the input consists of the input voltages Van, Vbn, Vcn to the motor, the currents Ia, Ib, Ic, and the motor electro-magnetic torque Te. The first computational step is to segment the signals and extract the signal features from each segment. The signal segments are then analyzed by an artificial neural network, which is trained on the signals generated by SIM_drive at the parameter points selected by the CP-Select algorithm, a machine learning algorithm. The major research contribution in this paper is the machine learning technology used to train a neural network that can robustly detect and locate faults inside an electric drive operated under any given valid condition. Van,Vn,Vcn, Ia,Ib, Ic,Te
Signal Segmentation
Feature Extraction
Neural Network Classification
faulty class
Fig. 3. Major computational steps in a signal fault detection system
2.2 Signal Segmentation and Feature Extraction Signal fault detection is performed on a segment-by-segment basis. All input signals are segmented using the same fixed sized segments and the two adjacent segments are overlapped in 1/3 of the segment width in order to maintain continuity of information flowing between segments. The basic frequency of the signals is over 80 Hz, and sampling frequency is chosen to be 1000, which is sufficient for this purpose. We chose to use 16 samples in each segment with an overlap of 5 samples between two adjacent segments. A signal of 3000 data samples is segmented into 272 segments. Each signal segment is represented by the following features:
1172
Y.L. Murphey, M.A. Masrur, and Z. Chen
Max: maximum magnitude of the signal within the segment Min: minimum magnitude of the signal within the segment Median: median of the signal within the segment Mean: mean of the signal within the segment Standard deviation: standard deviation of the signal segment Zero-frequency component of the power spectrum
The detection of signal faults within a time period requires one segment from each input signal and each segment is represented by the 6 features listed above. Since we have 7 input signals (3 voltage signals, 3 current signals, and 1 torque signal), the combined feature vector to represent a particular state in the electric drive at a particular time is a vector of 42 dimensions. The feature vector is the input to a neural network classifier that determines whether the 7 signals within this time period manifest any fault. 2.3 Smart Selections of Operation Parameters In an electric drive system, the current and voltage signals behave differently under different operation conditions specified by torque and speed. The issue of smart selection of “control parameters” (also referred to as operating point) in the (torque, speed) domain is important for all electric drive diagnostic systems that are trained on simulated data. A diagnostic system trained on more representative data is more likely to perform better diagnostics in real world system under any operation condition. We developed a machine learning algorithm, CP-Select (Control Point-Select), for the generation of training data for a robust electric drive diagnostic system. The CP-Select algorithm automatically selects representative operating points in a given domain of control parameter (CP) space to generate representative training data for a neural network system for fault diagnostics. The operating space for a drive system has two components, i.e. torque (Tq) and speed (Sp). The Tq and Sp pair selected by CPSelect is sent to SIM_drive, which in turn, generates current and voltage signals, I, V, at all three phases at the given speed and torque point under normal and faulty conditions. Diagnostic features are extracted from these signals and feature vectors are used to train an ANN called FDNN, and the performance of the FDNN is evaluated on a validation data set Tv. If the performance is satisfactory, the algorithm stops, otherwise, more operating points are selected. The most complicated component in the CP-Select algorithm is ASCP (Automatic Selection of Control parameters). Initially Φ contains the rectangular space that includes all valid torque and speed points used by a real world electric drive. As the process goes on, Φ contains all subspaces from which potential parameters can be selected. The ASCP algorithm repeatedly removes one parameter space from Φ at a time and performs an iterative process until Φ is empty or the performance of the trained FDNN is satisfactory. At each iteration ASCP selects three sets of points, and each set goes through a simulation, training and evaluation process.
Fault Diagnostics in Electric Drives Using Machine Learning
1173
The first set of points contains the four corner points and the center point of the current parameter space C_CP and are stored in P0. The points in P0 that have not been selected before are sent to SIM_drive to generate new training data. The newly generated training data are combined with the existing ones to form the current training data set, Tr. FDNN is trained on Tr and evaluated on validation data set Tv. If the performance of FDNN on Tv is satisfactory, the process stops. Otherwise it goes on to select the second set of points, which are the interior points of the current parameter space C_CP. The same simulation, training, and evaluation steps are repeated on this set of points. If the performance of FDNN is satisfactory, the process stops. Otherwise the third set of parameters are selected, which contains the four center points on the four sides of C_CP. Again, the same simulation, training and evaluation process is applied to this set of parameters. If the performance of FDNN is satisfactory, the ASCP algorithm stops, otherwise, the current parameter space C_CP is evenly divided into four subspaces CP1, CP2, CP3 and CP4, which are appended to the parameter space set Φ. All the parameter spaces in Φ are sorted based on the performances of FDNN on the validation points in the space, and the entire process repeats. 2.4 Multi-class Fault Classification Using Artificial Neural Networks (ANN) Neural networks have been successfully applied to a broad range of problems including engineering diagnosis, pattern classification, intelligent manufacturing, control problems and computer vision. Most of the research in neural networks has been in the development of learning and training algorithms for 2-class classifiers, i.e. classifiers with one output node that represent classes 0 and 1. However, fault diagnostics in electric drive has six classes of single switch faults and three classes of post-short-circuit classes. The most common architectures which have been proposed for multi-class neural networks [11], involve a single neural network with K output nodes, where K is the number of faulty classes, and a system of binary neural networks combined with a posterior decision rule to integrate the results of neural networks. A system of binary neural networks requires separate training of each neural network and each trained neural network generates a decision boundary between one class and all others. The most noticeable limitation in this approach is that the decision boundaries generated by the different 2-class neural network classifiers can have overlapped or uncovered regions in the feature space [11]. For the feature vectors that fall on an overlapped region in the feature space, more than one 2-class classifiers can claim the input as their classes, resulting in ambiguity. The feature vectors falling on the regions that are not claimed by any neural networks will be rejected by all neural networks. As a result the resulting system may not generalize well. Another type of multi-class neural network system is to use a single neural network with k output, where k > 1. This type of the neural network architecture has the advantage of simple training procedure, and only one neural network is trained for all m classes, where m > 2, and, if trained properly a neural network system implemented in this architecture should reduce the ambiguity problem [11].
1174
Y.L. Murphey, M.A. Masrur, and Z. Chen A 7-Class neural network for detecting single switch faults Normal A broken A’ broken B broken B’ broken C broken C’ broken
Signal Features
Normal AA’ broken BB’ broken CC’ broken
Decision logic (Winner Take All )
Final detection result
A 4-Class neural network for post-short-circuit faults (a) A system of two neural networks for classifying the single switch and short circuit faults in a 3-phase electric drive Signal Features
Normal A broken A’ broken B broken B’ broken C broken C’ broken AA’ broken BB’ broken CC’ broken
A 10-Class neural network (b) A 10-class single neural network for classifying all 10 classes of faults in a 3-phase electric drive
Fig. 4. Two neural network architectures developed for the fault classification in a in an electric drive
Based on the single neural network architecture, we implemented two different systems of neural networks as illustrated in Figure 4 for the diagnosis of 10 classes of faults in an electric drive: one class represents the normal condition, six classes represent six single switch faults, and the last three classes represent the three postshort-circuit faults. Figure 4 (a) shows a structured diagnostic system consisting of two neural networks, one is trained to classify single switch faults, and the other classifies the post-short-circuit faults, and WTA [11] approach is used to integrate the results from the two neural networks. Figure 4 (b) shows a single neural network trained to classify all 10 classes: normal, six single switch faults, and 3 post-shortcircuit faults. One important issue in a multi-class neural network is how to encode the classes in the output nodes of the neural network. In both neural network
Fault Diagnostics in Electric Drives Using Machine Learning
1175
architectures, we chose to use a “one-hot spot” method described as follows. For a kclass classification problem, we need an output layer of k bits, each class is assigned a unique binary string (codeword) of length k with only 1 bit set to “1” and all other bits are “0.” For example if it is a six-class classification problem, class 0 is assigned a codeword of 000001, class 1 is assigned a codeword of 000010, class 2 is assigned of a codeword 000100, etc. The advantage of this encoding is that it gives enough tolerance among different classes. We use the back propagation learning algorithm to train all the neural networks. Training, validation and te st points
7U
6SHHG
7U
7U
9DOLGDWLRQ 3RLQWV 7HVW3RLQWV
7RUTXH
(a) Randomly selected test and validation set, and the train data selected by CP-Select algorithm for classifying single switch faults Training, validation and test points for post-short circuit fault classification 7U
6SHHG
9DOLGDWLRQ 3RLQWV
7HVW3RLQWV
7RUTXH
(b) Randomly selected test and validation parameters, and the training parameters selected by the CP-Select for classifying post-short-circuit faults. Fig. 5. Randomly selected test and validation parameters, and the training parameters selected by the CP-Select for classifying single switch faults and post-short-circuit faults
1176
Y.L. Murphey, M.A. Masrur, and Z. Chen Training, validation and test points for the 10-class single neural network 7U 6SHHG
7U 7U
7U 9DOLGDWLRQ3RLQWV
7HVW3RLQWV
7RUTXH
Fig. 6. Randomly selected test and validation parameters, and the training parameters selected by the CP-Select using a single neural network classification system
V\VWHPRIWZR QHXUDOQHWZRUNV FODVVVLQJOH QHXUDOQHWZRUN
)D XO W )D XO W )D XO W )D XO W 2Y HU DO O
1R UP DO
Fig. 7. Performances of two different neural network systems
In order to evaluate these two neural network systems, we conducted the following experiments using simulated data. The structured multi-class neural network system contains two separately trained neural networks, both having 42 input nodes and 1 hidden layer with 20 hidden nodes. The neural network for single switch fault classification has 7 output nodes, which represent the normal class and the 6 faulty classes. The neural network for the post-short-circuit classification has 4 output nodes, which represent the normal class and the 3 post-short-circuit classes. The randomly selected validation and test parameters, and the training parameters generated by the CP-Select algorithm for the six single switch faults are shown in Figure 5 (a) and the parameters for the post-short-circuit are shown in Figure 5 (b). The single-switch fault classification neural network was trained on the control parameters in Tr0, Tr1 and Tr2 generated by the CP-Select algorithm using 3 iterations as described in section 3.2. The post-short-circuit fault classification neural network was trained on the control parameters in Tr0 shown in Figure 5 (a), which gave 100% correct performance on the validation data shown in the squared points. Therefore the CP-Select algorithm stopped at the end of the first iteration. The four randomly selected test points are shown in triangular in Figure 5(a).
Fault Diagnostics in Electric Drives Using Machine Learning
1177
For the 10-class single neural network system, the CP-Select algorithm generated the training points in four iterations resulting in Tr0, Tr1, Tr2 and Tr3, which are shown in Figure 6 along with the validation points and test points. The 10-class neural network has 42 input nodes and 1 hidden layer with 20 hidden nodes, and 10 output nodes, where one node represents the normal class, and 6 represent the single switch fault classes, and 3 represent the three post-short-circuit faulty classes. It is trained on the data generated by the SIM_drive using the operating points in Tr0 ∪ Tr1 ∪ Tr2 ∪ Tr3 . The test data for both diagnostic systems were the signals generated by SIM_drive from the same four randomly selected (torque, speed) points. 11320 feature vectors were extracted from these signals among which 6792 data samples contain the six single switch faults, and 3396 contain the three post-short-circuit faults, and 1132 are normal. The performances of these two diagnostic systems on the test data set are shown in Figure 6. The structured diagnostic system correctly detected and located all 9 faulty classes and the normal class with 100% correct detection. The single neural network system correctly detected with 100% all the 6 single switch faults, but detected correctly with only 90% and 92% on test data from the post-short-circuit faulty class 1 and class 2.
3 Summary and Conclusions We have presented a diagnostic system driven by a machine learning algorithm for multi-class fault detection in an electric drive system with a three phase induction motor. We presented the machine learning algorithm for the smart selection of vehicle operating points from the (torque, speed) space for the use by the simulated model, SIM_drive, to generate representative training data, and a neural network classification system developed and trained on the signals generated at the representative operation conditions for the fault diagnostics of electric drive inverters. The intelligent diagnostic system trained with machine learning technology has been evaluated by experimental data. We used the test signals generated by the SIM_drive that contain normal and 9 faulty classes. The results show that the proposed intelligent diagnostics approach is very effective in detecting multiple classes of faults in an electric drive inverter. The authors also investigated two different neural network architectures, a structured neural network system and a single neural network. Our finding is that the structured neural network system gives more accurate diagnostics on all 10 classes than the single neural network system.
Acknowledgment This work is supported in part by a research contract with the U. S. Army TACOM.
1178
Y.L. Murphey, M.A. Masrur, and Z. Chen
References 1. Crossman, J., Guo, H., Murphey, Y., and Cardillo, J., “Automotive Signal Fault Diagnostics: Part I: signal fault analysis, feature extraction, and quasi optimal signal selection," IEEE Transaction on Vehicular, July 2003, Volume 52, Issue 4, July 2003 Page(s):1076 – 1098. 2. Fenton, W., McGinnity, T., and Maguire, L., “Fault diagnosis of electronic systems using intelligent techniques: a review”, IEEE Trans. on Systems, Man, and Cybernetics – Pt. C, Vol. 31, No. 3, Aug 2000, pp. 269-281. 3. Huang, W. Y., and Lippmann, R.P., Neural net and traditional classifiers. In D. Z. Anderson (Ed.), Neural Information Processing Systems, pp. 387-396. New York: American Institute of Physics. 4. Gertler, J., Costin, M., Fang, X., Kowalczuk, Z., Kunwer, M., Monajemy, R., “Model Based Diagnosis for Automotive Engines-Algorithm Development and Testing on a production Vehicle”, IEEE Trans on Control Systems Technology, Vol. 3, No. 1, pp 61 – 69, 1995, 5. Kim K., and Parlos, A. G., “Induction motor fault diagnosis Based on neuropredictors and wavelet signal processing,” IEEE/ASME Transactions on Mechatronics, Vol. 7, No. 2, pp 201 – 219, June 2002. 6. Klima, J., “Analytical investigation of an induction motor drive under inverter fault mode operations”, IEE Proc.- Electr. Power Appl., Vol 150, No. 3, May 2003, pp. 255-262. 7. Masrur, M. A., Chen, Z., Zhang, B., Jia, H., and Murphey, Y., "Model-Based Fault Diagnosis in Electric Drives Using Artificial Neural Networks", to appear in the IEEE Trans. On Mechatronics, September, 2005 8. Masrur, M.A.., Xu, X., and Liang, F., "Fault Isolation in an Induction Motor Control System", U.S. Patent 5469351, Nov 1995. 9. Moseler, O., Isermann, R., “Application of Model-Based Fault Detection to a Brushless DC Motor ”, IEEE Trans on Industrial Electronics, Vol. 47, No. 5, pp 1015-1020, 2000. 10. Murphey, Y., Guo, H., Crossman, J.., and Coleman, M., “Automotive Signal Diagnostics Using Wavelets and Machine Learning,” IEEE Transaction on Vehicular, November, 2000. 11. Ou, G., Murphey, Y., Feldkamp, L., “Multiclass Pattern Classification Using Neural Networks,” International Conference on Pattern Recognition, Cambridge, UK, 2004
An Integrated and Flexible Architecture for Planning and Scheduling Antonio Garrido, Eva Onaind´ıa, and Ma. de Guadalupe Garc´ıa-Hern´andez Universidad Polit´ecnica de Valencia Camino de Vera, s/n, 46071 - Valencia (Spain) {agarridot, onaindia, magarcia}@dsic.upv.es
Abstract. In this paper we propose an integrated architecture to cope with real-world problems of planning and scheduling. Our approach consists of two specialised processes, one for planning and one for scheduling, which work together in a strong-coupled manner to deal with complex problems that involve time and resource constraints. Besides describing the elements of this approach, we also present its functional behaviour together with an example to show its applicability to real problems.
1
Introduction
The application of AI techniques to automated planning and scheduling is acquiring more and more relevance. The reason behind this is that AI techniques not only help automatise and simplify many tedious tasks but they also provide very adequate mechanisms to improve the quality of plans. Nowadays, AI planning and scheduling techniques are able to deal with real-world problems that require handling duration of tasks, execution costs, resource usage, optimisation criteria, etc. All these ingredients make necessary the use of algorithms to provide action selection together with feasible resource assignments in a countless variety of problems, such as logistics management and distribution, space applications, autonomous vehicle navigation (planetary rovers and observation satellites), manufacturing systems, crisis management, etc. [1]. AI planning aims at determining which actions (plan) need to be executed to successfully achieve some goals, which usually entails a complex search process [1]. On the other hand, scheduling aims at determining when these actions must be executed and with which resources, which usually entails a constraint satisfaction process to validate all the problem constraints. Although many realworld problems include features of both planning and scheduling, these processes have been traditionally hardly related to each other [2]. However, dealing with planning and scheduling in a disunited way does not seem sensible: selecting an action in a plan is usually conditioned to time constraints, resource availability and multiple criteria to be optimised. Consequently, planning and scheduling complement each other perfectly, which clearly motivates the challenge in designing flexible models to integrate planning and scheduling capabilities, particularly considering that they use similar techniques (search in graphs, use of heuristics, management and reasoning of constraints, etc.) [2, 3, 4]. M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 1179–1188, 2006. c Springer-Verlag Berlin Heidelberg 2006
1180
A. Garrido, E. Onaind´ıa, and Ma.d.G. Garc´ıa-Hern´ andez
This paper presents (what we understand as) a flexible architecture to integrate planning and scheduling, as a model that dynamically interleaves a planner and a scheduler. Hence, our main contributions are: i) the proposal of a general integrated architecture that combines two specialised processes for planning and scheduling; ii) the use of an action network that represents planning causal links and time+resource constraints; iii) the description (and a general algorithm) of the functional behaviour of this integrated architecture; and iv) an example that shows the applicability of this architecture to real-world problems.
2
Modelling Planning and Scheduling Problems: A Brief Review
Solving a planning and scheduling real-world problem in an automated way requires modelling some basic elements, such as: 1. Initial state, with all the information that is true at the beginning of the problem. 2. Goals, with all the facts that must be achieved at the end of the problem. 3. Actions, defined on the problem domain that allow to achieve the goals from the initial state. 4. Available resources, to execute the actions in the plan. 5. Temporal constraints, with additional constraints the plan must satisfy. 6. Metric function, as a multi-criteria function that needs to be optimised. Although there is not a widely used approach for modelling problems with all these elements, from the planning point of view there exist some proposals to extend planning domain definition languages to deal with scheduling features, such as recent versions of PDDL [5]. For instance, in a planetary rovers problem1 where rovers navigate a planet surface, finding samples (rocks, soil and images) and communicating them back to a lander, PDDL3 [5] allows us to model: i) actions with duration ((calibrate ?camera) takes 5’); ii) actions with local conditions and effects (calibrate requires the ?camera to be on board ?rover just at the start of the action); iii) actions with numeric conditions ((energy ?rover) needs to be ≥ 10) and effects ((increase (energy ?rover) 18)); iv) multi-criteria problem metrics to optimise (minimise the plan makespan and the energy used by rover0); and v) other constraints such as deadlines in (sub)goals (communicate needs to be performed before 30’), temporal constraints in the form of time windows (sun is available in waypoint1 from 7’ to 27’), etc. However, there still exist some limitations to model hard constraints among actions (calibrate camera0 must be executed exactly 5’ before calibrate camera1), finite persistence of action effects (the effect of having a camera calibrated only lasts for 10.5’), etc. Despite this, most features of real-world planning/scheduling problems can be currently modelled, at least from an enriched planning point of view [5]. 1
More information about this kind of problem can be found in the International Planning Competition at: http://ipc.icaps-conference.org/
An Integrated and Flexible Architecture for Planning and Scheduling
3
1181
Integrating Planning and Scheduling
When dealing with planning and scheduling problems, different approaches have been traditionally followed [1, 2, 4]; some of them extend planning to cope with scheduling capabilities, whereas others include the planning component into scheduling, by manually providing the set of activities and, usually, the order in which theses activities must be applied. However, the solution does not necessarily imply an embedded planning process into a scheduling one, or vice versa. Our architecture provides a general and flexible model, where both planning and scheduling processes have an important role in the problem solving per se. It is important to note that while designing the architecture, our aim was to come up with a general model where each process is solved by using already existing techniques. There exist many planning and scheduling techniques that, properly combined, can be used for our purpose of creating a cooperative and highly-coupled integrated model of planning and scheduling. 3.1
Description of the Integrated Architecture
Fig. 1 depicts the structure of the integrated architecture for planning and scheduling. The integrated module requires as an input the model of the problem (domain+problem definition). Moreover, a plan (either provided by the user as a set of activities or by a planner), is used as an additional input, which is later
PDDL problem (define (problem roverprob4213) PDDL domain (:domain Rover) (define (domain Rover) (:init (visible wp0 wp1) (visible wp2 wp1) ... ... (:durative-action navigate (:goal (and (communicated_soil_data wp0) :duration (= ?duration 5) ... ... (:metric minimize (+ (* 1 (total-time)) (* 0.01 ((energy (:durative-action recharge rover0)))) :duration (= ?duration (/ (- 80 (energy ?x)) (recharge-rate ... (:durative-action sample_rock :duration (= ?duration 8) ... Model of the
Integrated heuristics common heuristics based on actions and cost/problem metric
problem Integrated module
Plan / Set of activities (by hand) Planner Any kind of plan (provided by a classical planner) or set of activities (provided by a user)
0: (calibrate rover0 cam0 obj wp0) 0: (sample_rock rover0 rover0st wp0) 1: (take_img rover0 wp0 obj1 cam0 low_res) 1: (drop rover0 rover0st) 2: (sample_soil rover0 rover0st wp0) 3: (comm_rock_data rover0 gen wp0 wp0 wp1) ...
A N
[5,5]
c o n v e r t e r
[-8,-6] CL [10,10]
[0,0]
Action Network
update (add/delete)
Planning process Replanner (plan execution and reparation)
checking problem integrated consistency problem solving check constraints Scheduling process + Validation of (temporal + temp. allocation resource) constraints
Executable plan 0.001: (calibrate rover0 cam0 obj wp0) [5] 0.001: (recharge rover0 wp0) [7.09091] 0.001: (sample_rock rover0 rover0st wp0) [8] 5.006: (take_img rover0 wp0 obj1 cam0 low_res) [7] 8.009: (drop rover0 rover0st) [1] 9.01: (sample_soil rover0 rover0st wp0) [10] 27.028: (comm_rock_data rover0 gen wp0 wp0 wp1) [10] 37.038: (comm_soil_data rover0 gen wp0 wp0 wp1) [10] ... ;; Plan makespan: 47 ;; Plan cost: 75.9 ;; Number of actions: 12
Fig. 1. Structure of the integrated architecture for planning and scheduling
1182
A. Garrido, E. Onaind´ıa, and Ma.d.G. Garc´ıa-Hern´ andez
motivated. The integrated module contains an action network (AN) converter that transforms the input plan into a network that represents the plan with its actions, causal links and (temporal+resource) constraints. The planning process updates the action network to make it free of planning conflicts; i.e. it mainly works as a replanner, repairing (adding or deleting) actions in the network when they cannot be executed. The scheduling process mainly works as a validator of constraints and checks the feasibility of the action network and its constraints. However, the scheduler’s task is not only to validate the action network, but also to provide the planning process with advice about the conflicting actions and the resources involved in such conflicts. This way, the scheduler helps the planner repair the plan by means of integrated heuristics that allow both processes to take common agreement decisions. There exist two elements that are essential to establish the foundations of the integrated module: the former concerns the input of the module, whereas the latter concerns how the action network represents the plan, actions and their constraints. Given the domain and problem definition as the input, the integrated module could start from an empty plan, i.e. generating a plan from scratch. However, it does not seem sensible to tackle such a complex task from scratch when there are many state-of-the-art planners that generate plans very efficiently [6]; so it seems more reasonable to use a given initial plan as a basis. The underlying idea is to use a classical planner, as simple (in terms of expressivity and calculus) and efficient as possible. Note that this does not imply a loss of generality: a plan provided by any planner can be used as an input, and even a plan generated by hand. Also note that this plan does not need to be
[5,5] (calibrate rover0 cam0 obj wp0).on
(calibrate rover0 cam0 obj wp0).off [-2,-2]
[0,¥]
CL
(take_img rover0 wp0 obj1 cam0 low_res).on
T0
[7,7]
(take_img rover0 wp0 obj1 cam0 low_res).off
[-1,-1] [0,¥] [1,1] (drop rover0 rover0st).on
(drop rover0 rover0st).off
[0,50]
[0,0] CL time energy of rover0 causal link (CL)
(sample_soil rover0 rover0st wp0).on
[10,10]
(sample_soil rover0 rover0st wp0).off
[-3,-3] temporal constraint
Fig. 2. Example of an action network, which comprises a plan. Solid and dashed lines represent usage (increase/decrease) of resources in actions (time and energy of rover0, respectively); thick lines represent causal links; and dotted lines represent temporal constraints. T0 identifies the beginning of time.
An Integrated and Flexible Architecture for Planning and Scheduling
1183
executable since the replanner can repair it. This increases the opportunities to use efficient planners as a previous step to the integrated module. The action network (see Fig. 2) follows the philosophy of temporal constraint networks and (consumable) resource networks (TCN and rCN [7, 8], respectively) to represent actions and constraints on time and resources. Nodes represent timepoints where actions start/end (.on/.off, respectively). There are three types of edges, all of them labelled with an interval, that represent: i) usage of a resource (time is considered as a resource as well) between the .on/.off timepoints of the same action; ii) causal links between timepoints of actions; and iii) temporal constraints between timepoints. For instance, using the planetary rovers problem again, action (calibrate rover0 cam0 obj wp0) increases time in 5 units (action duration) and decreases energy of rover0 in 2 units (note that these values can be also defined by a min-max interval). Action (calibrate rover0 cam0 obj wp0).off has a causal link with (take img rover0 wp0 obj1 cam0 low res).on because the first action generates a proposition at end that the second action needs over all its execution. Finally, temporal constraints allow to define additional constraints and deadlines between the execution of actions and the beginning of time: (sample soil rover0 rover0st wp0) must finish before time 50. 3.2
Functional Behaviour of the Integrated Architecture
The general scheme with the functional behaviour of the integrated architecture is presented in Algorithm 1. The goal is to revise, both from the planning and scheduling points of view (rev PLN and rev SCH), all actions in the AN to find a conflict-free plan. Selecting one action at a time (step 5), two are the main (planning+scheduling) tasks to perform. One is solve PLN conflicts (step 8), i.e. when the plan is not executable due to planning conflicts. This type of conflict is basically caused by unsupported action preconditions or by mutex relations between the actions in the plan. Once the planning conflicts have been solved, the other task is solve SCH conflicts (step 14), i.e. when the plan is not 1: rev PLN ← IS {IS is a fictitious action that represents the initial state} 2: rev SCH ← ∅ 3: while ∃a ∈ AN | a ∈ rev SCH do 4: {Planning part; reasoning on causal links} 5: a ← earliest action that can be planned (executed) in AN 6: rev PLN ← rev PLN ∪ {a} 7: if number of conflicts(rev PLN) > 0 then 8: solve PLN conflicts(rev PLN) 9: {Scheduling part; reasoning on time and resources} 10: for all ai ∈ rev PLN | ai ∈ rev SCH do 11: if ∃ a consistent allocation of ai in rev SCH then 12: rev SCH ← rev SCH ∪ {ai } 13: else 14: solve SCH conflicts(ai , R, quantity, time point, Cannot acts, Should not acts, C)
Algorithm 1. General scheme for integrating planning and scheduling
1184
A. Garrido, E. Onaind´ıa, and Ma.d.G. Garc´ıa-Hern´ andez
Repair problem
Planning problem
solve_PLN_conflicts Flaw repair solve_SCH_conflicts
Delete actions
Action Network
ø ø
G G’
Planning Process
ai,R,quantity,time_point, Cannot_acts,Should_not_acts,C Actions scheduling (consistent allocation)
P1
P2
P3
P4
P5
XP
1
P2
XP XP XP 3
4
5
Extract heuristic information
Scheduling Process
Fig. 3. Subtasks that are derived from steps 8 and 14 of Algorithm 1
executable due to time/resources conflicts. In both cases, the planner performs the same tasks (see Fig. 3): it converts the original problem into a planning repair problem which is later solved by applying classical planning techniques (flaw repair mechanisms, such as inserting actions to support unsatisfied preconditions). During the process of solving the planning problem, the planner eventually invokes the scheduler to obtain advice and make a better decision about the most convenient planning decision. That is, the scheduler helps the planner focus its search upon those planning alternatives that better accomplish the time/resource objectives/constraints. The key points in this functional behaviour are the two calls (steps 8 and 14 of Algorithm 1) to the planning process and the subtasks derived from them (see Fig. 3): 1. solve PLN conflicts. In this case, the planner invokes itself in order to fix the planning conflict in rev PLN. This is the task that requires less heuristic information, entailing fewer decision points, because it means to transforming a planning conflict into a planning repair problem. Hence, the overall idea is how to solve a planning conflict from a planning perspective. 2. solve SCH conflicts. Unlike solve PLN conflicts, this is a harder task as it implies solving a scheduling conflict through planning. When the scheduler detects a constraint violation due to an oversubscripted resource R, it attempts to find a consistent allocation for actions in rev SCH. The ultimate goal is to find out when that resource is needed (at which time point), how much is required (quantity) and who requires it (action ai ). Additionally, the scheduler provides the planner some helpful information on which actions cannot and should not be removed from the AN (Cannot acts and Should not acts); for instance, a set of actions that must compulsorily be in
An Integrated and Flexible Architecture for Planning and Scheduling
1185
the plan, or that support the fulfillment of some time/resource constraints, or that correspond to a sequence of actions already scheduled, etc. Moreover, in order to guide the planning search as much as possible, the scheduler supplies an extra set of constraints (C) that the planner must hold when solving the problem; for instance, do not use more than n units of resource Rj when solving this conflict. 3. The decision criteria to come up with a planning repair problem. After receiving any of the two mentioned calls, the planner analyses the information provided and transforms it into its own terms, i.e. into a planning repair problem (see Fig. 3). The idea is to formulate a repair problem in terms of which subgoals are to be achieved and which ones are to be maintained in the AN. In order to deduce this information, the planner makes use of the information provided by the scheduler and/or its own planning tactics to decide whether it is convenient to delete some actions prior to call the planning solver. 4. The advice provided by the scheduler to focus the planning search process. This stage corresponds to a classical planning solving task. The interesting point here is that the planner will count on the scheduler in order to decide which planning alternative to take on. This way, the planner shows the scheduler the planning decision points (plans P1 , P2 , . . . P5 ) and the scheduler determines the best alternative (P2 ) from the time/resource perspective. As a conclusion, this model represents an open collaborative approach where planner and scheduler work together to repair any conflict that arises during the process of building up the plan, no matter the origin of the conflict. Both components work in a strongly-coupled manner taking into account the information/advice provided by the other module (bidirectional communication). Obviously, the planner plays the central role in this collaborative work, though it is intelligently guided by the heuristic information provided by the scheduler. 3.3
Application Example
In this section we will use the rovers problem as an application example to show how our integrated model works. The goal is to have soil, rock and image data communicated. The initial plan, generated by any automated planner is shown in Fig. 4-a, which is subsequently converted into the AN of Fig. 4-b. However, the problem imposes these additional complex constraints: i) actions have different durations and consume energy of rover0; ii) the initial energy level of rover0 is 10; iii) sun is available to recharge only in waypoint1 and waypoint3; iv) the subgoal (have image objective1) cannnot be achieved before time 20; v) the effect (calibrated camera0) only persists during 30 units; and vi) the goal (communicated image) has a deadline, so it must be achieved before time 40. Following Algorithm 1, actions are selected from the AN and revised by the planning and scheduling parts. While no planning nor scheduling conflicts arise, the sets rev PLN and rev SCH are updated. This process continues until revising (comm image wp0 wp1) that introduces a scheduling conflict in rev SCH because resource energy becomes oversubscripted: (comm image wp0 wp1) requires
1186
A. Garrido, E. Onaind´ıa, and Ma.d.G. Garc´ıa-Hern´ andez
6 units of energy, but only 4 (10-3-2-1) are available. Then, the scheduler executes solve SCH conflicts((comm image wp0 wp1), (energy rover0), 2 (6-4), (comm image wp0 wp1).on, {(sample soil wp0), (take image cam0 wp0)},∅, { (energy rover0) ≤ 10}). Note that Cannot acts contains the actions that need to appear in the plan (they are the only way to achieve certain subgoals), whereas Should not acts is empty in this case. On the other hand, C indicates that now energy of rover0 cannot exceed 10 (since this is the initial value). The repair problem solver decides to maintain the same subgoals. Then, the planning solver tries to repair this conflict by inserting actions (recharge wp1) and (recharge wp3), thus creating several alternative plans that may be inserted at different timepoints (P1 , P1 , . . . P3 ), as seen in Fig. 4-b. Since there are several different alternatives, the planner communicates the scheduler these choices and the scheduler verifies their consistency, informing the planner about the best choice to follow. After solving this conflict, the algorithm resumes by studying the next actions in the AN until no more actions need to be revised. [10,10] (sample_soil wp0).on
(sample_soil wp0).off
CL
[-3,-3]
IS (T0)
0: 1: 2: 3: 4: 5: 6: 7:
(calibrate cam0 wp0) (take_image cam0 wp0) (sample_soil wp0) (comm_image wp0 wp1) (comm_soil wp0 wp1) (drop) (sample_rock wp0) (comm_rock wp0 wp1)
CL
communicated_image obj1 [0,40]
[5,5] (calibrate cam0 wp0).on
(calibrate cam0 wp0).off
have_image obj1 [20,¥]
[-2,-2]
P1
recharge wp1 ...
CL
P1' recharge wp3 ...
calibrated cam0 [0,30] [7,7]
(take_image cam0 wp0).on
P2
(take_image cam0 wp0).off
recharge wp1 ...
[-1,-1]
P2' recharge wp3 ...
CL
resource conflict: energy rover0
[15,15] (comm_image wp0 wp1).on
P3 recharge wp1 ...
(comm_image wp0 wp1).off [-6,-6]
P3' recharge wp3 ...
(a)
(b)
Fig. 4. (a) Initial plan given by an automated planner for the rovers application example; (b) Action network that includes the additional complex problem constraints
4
Discussion Through Related Work
From the AI point of view, there exist two immediate perspectives for solving planning and scheduling problems: a planner can be extended to handle time and resources (temporal planning approach) [1, 9]; or a scheduler can be extended to course some dynamic action selection (planning capabilities embedded within scheduling) [4]. In both cases, the indistinguishable mixture of capabilities may increase the complexity of the overall approach, making big problems practically intractable. From the OR point of view, there exist very efficient algorithms to perform the task allocation [10], but modelling the problem and, particularly,
An Integrated and Flexible Architecture for Planning and Scheduling
1187
finding an initial plan as as set of actions becomes much more difficult. Consequently, the way in which planning and scheduling must be combined has been addressed as an interesting hot topic of research [2]. Although there have been some successful attempts of integrating planning and scheduling, such as HSTS [11] or Aspen [12], they are ad-hoc models designed for solving problems in a particular domain. This loss of generality makes them fail at the time of demonstrating success in other domains. On the contrary, our integrated approach tries to be a general, flexible model that dynamically interleaves planning and scheduling processes (both playing a similar role) to tackle planning problems with a strong and weak component of resource/time management. We combine both processes, separating implementations of planning and scheduling to achieve a bi-modular solver, in which each component deals with the respective sub-problem as presented in [13]: the planner produces a plan while the scheduler enforces the time and resource constraints on this plan, but both work on the basis of the information/advice provided by the other component. This allows us to benefit from their knowledge by separate, such as plan execution and reparation, definition of timepoints, resource allocation, temporal constraint management, etc. Further, it also incorporates the use of common integrated heuristics, reasoning on an action network that combines a TCN with a rCN [7, 8], and an intelligent interaction between both processes.
5
Conclusions
Separating the planning component from the scheduling component when solving real-world problems has shown important drawbacks that are difficult to overcome unless an integrated architecture is used. The key issue about integration is how to level out the roles of both processes. In this paper we have proposed a flexible architecture that makes use of two specialised processes for planning and scheduling under a semi-distributed approach, in opposition to other completely isolated approaches. This way, our approach provides a high flexibility to deal with more complex problems in a more efficient way, since planning features of the problem are managed by a planner, while scheduling features are managed by a scheduler. Further, both processes take advantage of common heuristics and of the advice provided by the other module. For instance, in situations where there exist different planning alternatives, the scheduler can help the planner select the best decision according to scheduling features, such as use of resources, fulfillment of temporal constraints, etc., which are per se difficult to be managed by a planner. We certainly know there is still some work to be done before achieving a fully automated tool able to solve real-life problems (some decision points and improvements in the planning/scheduling interaction require further investigation). However, the open design of our architecture allows to incorporate new modules to provide extended reasoning on additional aspects of the problem, such as dealing with soft constraints and preferences and including OR algorithms for scheduling, which is part of our current work.
1188
A. Garrido, E. Onaind´ıa, and Ma.d.G. Garc´ıa-Hern´ andez
Acknowledgments This work has been partially supported by the Spanish government project MCyT TIN2005-08945-C06-06 (FEDER).
References 1. Ghallab, M., Nau, D., Traverso, P.: Automated Planning. Theory and Practice. Morgan Kaufmann (2004) 2. Smith, D., Frank, J., J´ onsson, A.: Bridging the gap between planning and scheduling. Knowledge Engineering Review 15(1) (2000) 47–83 3. Bartak, R.: Integrating planning into production scheduling: a formal view. In: Proc. of ICAPS-2004 Workshop on Integrating Planning Into Scheduling. (2004) 4. Smith, S., Zimmerman, T.: Planning tactics within scheduling problems. In: Proc. ICAPS-2004 Workshop on Integrating Planning Into Scheduling. (2004) 83–90 5. Gerevini, A., Long, D.: Plan constraints and preferences in PDDL3. Technical report, University of Brescia, Italy (2005) 6. Edelkamp, S., Hoffmann, J., Littman, M., Younes, H., eds.: Proc. of the International Planning Competition IPC–2004. (2004) 7. Dechter, R., Meiri, I., Pearl, J.: Temporal constraint networks. Artificial Intelligence 49 (1991) 61–95 8. Wallace, R., Freuder, E.: Supporting dispatchability in schedules with consumable resources. Journal of Scheduling 8(1) (2005) 7–23 9. Gerevini, A., Saetti, A., Serina, I., Toninelli, P.: Planning in PDDL2.2 domains with LPG-TD. In: Proc. ICAPS-2004. (2004) 33–34 10. Winston, W.: Operations Research: Applications and Algorithms. (1994) 11. Muscettola, N.: HSTS: Integrating planning and scheduling. In Zweben, M., Fox, M., eds.: Intelligent Scheduling. Morgan Kaufmann, CA (1994) 169–212 12. Chien, S., et al.: ASPEN - automating space mission operations using automated planning and scheduling. In: Proc. SpaceOps 2000. (2000) 13. Pecora, F., Cesta, A.: Evaluating plans through restrictiveness and resource strength. In: Proc. WIPIS of AAAI-2005. (2005)
A Robust RFID-Based Method for Precise Indoor Positioning Andrew Lim1,2 and Kaicheng Zhang1 1
Department of Industrial Engineering and Logistics Management Hong Kong University of Science and Technology Clear Water Bay, Kowloon, Hong Kong {iealim, kczhang}@ust.hk 2 School of Computer Science & Engineering South China University of Technology, Guangdong, PR China
Abstract. A robust method for precise indoor positioning utilizing Radio Frequency Identification (RFID) technology is described. The indoor environment is partitioned into discrete locations, and a set of RFID tags are scattered in the environment. Sample reading patterns of RFID tags are collected at each location by an RFID reader. A pattern recognition and classification method is used, when human, vehicle, or other carrier moves around the environment with an RFID reader, to estimate its physical location based on the reading pattern. The resulting estimation error is within one meter. The method is adaptive to different tag distributions, reader ranges, tag reading defects, and other alterable physical constraints. Results from extensive experiments show that the described method is a robust and cost-saving solution to indoor positioning problems in logistics industry. A provisional US patent is granted on the described method. Keywords: Intelligent Systems, Systems for Real Life Applications, Indoor Positioning.
1
Introduction
Positioning and location identification are useful in a number of industrial fields. In logistics and transportation domain, the real-time position of each cargo, vehicle carrier, and even human worker can be valuable information. There are growing needs for acquiring high-precision indoor location information at a low cost. 1.1
Background and Related Work
Global Positioning Systems (GPS) has great value in determining the position of a user around the globe. However, for indoor applications, GPS has some problems. First, when it comes to indoor or urban outdoor areas, GPS is not reliable solution due to the poor reception of satellite signals. In addition, indoor applications usually require a much higher precision than that GPS could achieve, sometimes as high as within one meter. M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 1189–1199, 2006. c Springer-Verlag Berlin Heidelberg 2006
1190
A. Lim and K. Zhang
Many of the existing Indoor Positioning Systems (IPS) [1] are based on Wireless Local Area Network. Wireless access points are installed in each room and portable receiving device can determine which room it is in when it receives signal from access points. This approach would be useful to determine which room inside a building the tracked subject is in. However the precision is not high enough for logistics applications. Also there will be problem if any of the access points fails to work. There are other technologies that provide more precise positioning. Wi-Fi, infrared, ultrasonic, or magnetic technologies has been employed in positioning applications such as [2], [3] and [4]. The use of these technologies comes with high-cost. And when comes into industrial environment, the performance may not be good. Radio Frequency Identification (RFID), a technology invented more than half a century ago, has recently receive overwhelming attention, especially in its use in the logistic applications for cargo tracking and identification. For a comprehensive understanding of RFID, refer to [5]. There has been notably efforts in employing RFID in positioning, both in outdoor applications [3] and indoor applications [6, 7]. 1.2
Indoor Positioning – An Industrial Example
Located at the International Container Terminal, SunHing godown, owned by SunHing Group, is one of the largest distribution centers in Hong Kong. Its main operations include receiving goods from various manufacture plants in South China, storing them inside its 11,000 sq ft warehouse, and picking the cargos and loading them into containers, which are then transshipped to various locations in Japan, Europe and the U.S. The receiving, picking and transferring of cargos within SunHing warehouse are done by more than 20 forklifts. In peak seasons, the operations are quite busy and error in placement or picking could be costly. SunHing believes that an indoor positioning system could be used on its forklifts to enhance its operation accuracy and efficiency. Usually within a warehouse, a cargo terminal, or a truck depot, we want to know the position of each vehicle. This position information enables the warehouse to have better performance measurement, dynamic job scheduling, and automatic information acquisition. Furthermore, the use of high precision indoor positioning system is not restricted to such logistics application. It could also be used in high precision local navigation system or intelligent system where location-awareness of the mobile agents is required.
2 2.1
Our Proposed System “Mobile Reader Dispersed Tag” Approach
There exist RFID positioning systems that put tags on mobile agent and place RFID readers at fixed locations [8, 9]. In this approach, when one tag is being
A Robust RFID-Based Method for Precise Indoor Positioning
1191
read by a reader, the system verdicts that the mobile agent carrying the tag is inside the read range of the particular reader. This approach will be practical for applications in which, mobile agents move around well-partitioned and separated areas through some checkpoints. Readers are normally placed at such checkpoints to detect the movement of the agents. However, when the precision requirement is higher, or the number of checkpoint increases, this “mobile tag fixed reader” approach becomes impractical. Also, the unstable nature of wireless communication, especially of the UHF radio wave communication in industrial environment [10], introduces high error rate when position information of the mobile agent is only based on the reading of one single tag. Thus, in contrast to “mobile tag fixed reader” method, a “super-distributed tag infrastructure” [11] approach is proposed. In this approach, a large number of passive tags are distributed in the environment, and an RFID reader is carried by each mobile agent. When an agent moves inside the environment, at each location one tag is read and its location determined. Successful applications of such method include [6] and [7]. However, the restriction that only one single tag could be read at any time is too harsh. To place tags on the floor [6, 7] is also inconvenient or infeasible in some industrial applications. Therefore we propose a new “mobile reader dispersed tag” approach: a large number of tags are dispersed in the environment, allowing more than one tag being read by a reader in the environment. A mobile agent is first placed at designated grid points, or “sample points”. “Sample readings” of what tags are read by reader at each point are collected and stored. After that when the agent is moving about in the environment, the real-time reading results are collected and compared against the sample readings; and the location of the agent is classified into one of the sample points with pattern recognition and classification techniques. 2.2
System Overview
Figure 1 illustrates the architecture of our proposed RFID-based indoor positioning system. The system basically contains Dispersed Tags, Mobile Reading and Processing Module, and Backend Server. Dispersed tags are a set of fixed tags distributed in the environment, in this case on the ceiling of the indoor environment. The Mobile Reading and Processing Module is carried by the mobile agent, for instance, a forklift. The module consists of RFID antenna, RFID reader, PDA, and wireless adapter. RFID reader controls antenna, which emits and receives UHF radio waves, and identify the RFID tags inside its reading range. A PDA or a PC is connected to RFID reader, to control the reader and collect the tag reading results. The results are merely on what tags are read in the most recent reading cycle. No additional information, such as signal strength, or direction of tag is required. The processed reading results are then sent through wireless network to the backend server, where classification algorithms convert the reading results to an estimate of the mobile agent’s location. The backend information and control server could provide positioning service and control for more than one mobile agent, and can be further integrated into the warehouse management system.
1192
A. Lim and K. Zhang
Fig. 1. Architecture of proposed RFID-based indoor positioning system
2.3
Pattern Matching Formulation
We now formulate the positioning problem in the proposed system as a pattern matching process: First, a set R of m RFID tags, each identified by a 96-, 128-, or 256-bit string, is dispersed in the environment. Next, we partition the environment into n discrete grid points. The sample data we collect at the ith (0 ≤ i ≤ n) point is a ternary tuple Si = (xi , yi , oi ), where xi is the x-coordinate of ith sample point, yi is the y-coordinate of ith sample point, and oi a subset of R, is the set of observed RFID tag identification strings at the ith sample point. Finally we define the input and output of the problem. Input: A sample data set U collected at n points {S1 , S2 , . . . , Sn }, and A set T of tags observed by a mobile agent at an unknown position, T ⊆ R. Output: An integer j (1 ≤ j ≤ n), where (xi , yi ) is the estimated location for T .
3 3.1
System Setup and Experiment Design System Setup
We set up out proposed system in a quasi-industrial indoor environment in our re-search center. The effective area is a rectangle of size 4.2 m by 8.4 m. This area
A Robust RFID-Based Method for Precise Indoor Positioning
1193
Fig. 2. Floor Plan of Experiment Environment
is partitioned into 7 rows and 7 columns, thus creating 49 grids. The width and length of a grid are not equal. It is made this way so that one grid can perfectly match with one piece of tile on the ceiling. Fig. 2 shows the floor plan of the test environment and the partition. Fig. 3(a) and (b) are photos of the floor and ceiling of the experiment area. A total number of 176 RFID tags, arranged in 22 columns by 8 rows, are placed onto the ceiling. In Fig. 2, each tag is illustrated by a small square. Each rectangular tile on the ceiling has 4 tags at its vertices and another 4 on its two long sides. The tags used are 4” by 4” Symbol UHF Carton Tag, as shown in Fig. 4. A simple carrier is built for the experiment as a mobile agent, as shown in Fig. 5. One Symbol AR400 RFID Reader with power supply and two Symbol High Performance Antenna are installed on the carrier. Metal bars are attached to both long and short sides of the carrier to assist in placing the carrier at precise locations. The height and angel of the antenna are adjustable. The reader is connected to a computer that collects the reading results at each location. 3.2
Experiment Design
The main purpose of the experiments is to examine the feasibility of the proposed method as an effective and practical approach for indoor positioning in logistics applications. Also, good algorithms for solving the formulated pattern matching problem are to be discovered. A number of factors related to the performance
1194
A. Lim and K. Zhang
Fig. 3. (a)Experimental Indoor Environment (b)The Ceiling
Fig. 4. A Tag Placed on the Ceiling
Fig. 5. Mobile Agent with Reader and Antenna
of the system are considered: height, angle, orientation and power level of the antenna, and the density of sample data collection (sampling rate). In total, 7 sets of sample data are collected, as summarized in Table 1. The standard sample data set U1 is collected at the center of each partitioned grid, thus having 49 sample points. The standard power level is 255, which is full power. The standard antenna angel is 0◦ with respect to the horizontal plane. The standard antenna height is 77 cm (antenna to ceiling 144.5 cm). The standard orientation of the carrier is facing east (long side of antenna parallel to long side of the rectangular area). U2 is collected with standard configuration at both center and boundary of each grid, thus having 15×14 points. U3 through U6 are sample sets with variations on the power, angel, orientation, and height of the antenna, as described in Table 1. U7 is the union of U1 and U5 , introducing another element ri , the orientation, into the tuple (xi , yi , oi ). Note that, when collecting the readings at every point for each sample set, an additional 9 readings are collected. Together the 10 readings at each point are used as testing set of observations at that point with respect to its sample set. And this testing set is classified against a sample data (normally itself, or a standard sample set) with a pattern matching algorithm to examine the accuracy of the classification or the effect of various alteration on the configuration.
A Robust RFID-Based Method for Precise Indoor Positioning
1195
Table 1. Sample Data Sets
4
Sample Data Sets
Descriptions
Sample Points
U1 –7×7 std U2 –15×14 std U3 –7×7 power192 U4 –7×7 slope11 U5 –7×7 side U6 –7×7 height115 U7 –7×7 std side mix
Standard Standard Power level = 192 (half) 11◦ Slope of antenna 11◦ Carrier rotated 90◦ (facing north) Raised antenna U1 ∪ U5
7×7 15×14 7×7 7×7 7×7 7×7 7×7×2
Pattern Matching Algorithms
We have tested several different algorithms to solve the pattern matching problem formulated in 2.3. Each algorithm runs over all 6 sample data and the results are compared. We find that two algorithms – “Intersection over Union” and “Tag-to-Location Mapping Count’ – are able to provide sufficiently good results, as compared to the rest. They are describes in this chapter and the experiment results presented in Chapter 5. 4.1
“Intersection over Union” Algorithm
As in the formulation of the problem in 2.3, the algorithm takes in a sample data set U and an Observation T , and outputs an integer j – the index of the matched pattern. Following are the steps to find such j: 1. For each Si ∈ U , compute the “similarity” between Si and T as follows Sim(Si , T ) =
|Si ∩ T | |Si ∪ T |
(1)
2. Choose j such that Sim(Sj , T ) ≥ Sim(Si , T ) for all 1 ≤ i ≤ n. Equation (1) merely suggests that the “similarity” between the observation set and a sample set is the quotient of the number of their common elements over the number of their unified elements. 4.2
“Tag-to-Location Mapping Count” Algorithm
The “Tag-to-Location Mapping Count” algorithm matches observation set T to a sample set Sj ∈ U in the following way: 1. Define tag-to-location mapping, a mapping from a tag id string to a set of integers, which are the indices of the tag’s associated sample points. T T L(s) = { p | s ∈ op , 1 ≤ i ≤ n}
(2)
1196
A. Lim and K. Zhang
2. Given T , for each sample point at (xi , yi ), compute the mapping count score 0, if i ∈ T T L(s) score(i, T ) = (3) 1, if i ∈ T T L(s) s∈T
3. Choose j such that score(j, T ) ≥ score(i, T ) for all 1 ≤ i ≤ n. Intuitively, (3) suggests that for each tag in an observation, we add one count to every sample point where its corresponding sample set contains the tag. The sample point that receives most count will be considered as solution. Note that, however, this would be faulty when dealing with observations at boundary areas, as Fig. 6 explains. Thus we exclude the boundary points when evaluating the performance of “Tag-to-Location Mapping Count” algorithm.
Fig. 6. “Tag-to-Location Mapping Count” Algorithms Runs into Problem with Boundary Cases
5 5.1
Experimental Results Comparison on Algorithms
To compare the effectiveness of the classification algorithms, 10 readings collected at each point in 7×7 sample data are used as observations and U1 is used Table 2. Comparison on Two Algorithms Sample Test Test “Intersection over “Tag-to-Location Data Data Points Union” Mapping Count” Errors Error Rate Errors Error Rate U1 U1 U1 U1 U1 U2
U1 U3 U4 U5 U6 U2
250 250 250 250 250 1210
3 113 5 81 143 43
1.2% 45.2% 2.0% 32.4% 57.2% 3.55%
1 72 36 111 116 72
0.4% 28.8% 14.4% 44.4% 46.4% 5.95%
A Robust RFID-Based Method for Precise Indoor Positioning
1197
as sample data set. The reading accuracy are presented in Table 2. Boundary points are excluded for the comparison. Thus for row 1–5, there are 5×5×10=250 test cases. And for row 6, there are 11×11×10=1210 test cases. We can see that, both algorithms perform well on U1 –7×7 std self-validation test case. Each algorithm outperforms the other in some of the rest test cases. We also observe that change of power, orientation and height of the antenna while using the standard sample data significantly reduces the positioning accuracy. 5.2
Positioning Accuracy of the Proposed Method
Table 3 presents the results of the “Intersection over Union” algorithm running over each same sample data and test data. For all the test cases, our proposed method could provide >97% accuracy in determining the correct position. Furthermore, for the <3% error cases, the average error on x and y is within 0.50 unit with respect to the width and length of one tile, which are within 1 meter. Table 3. Positioning Accuracy
5.3
Sample Data
Test Data Errors Test Points Error Rate
U1 –7×7 std U3 –7×7 power192 U4 –7×7 slope11 U5 –7×7 side U6 –7×7 height115 U2 –15×14 std
U1 U3 U4 U5 U6 U2
11 2 0 5 2 43
490 490 490 490 490 2100
2.24% 0.41% 0.00% 1.02% 0.41% 2.05%
Discussion on Other Factors Related to Accuracy
Through experiment, we also verified that our proposed solution is able to determine the orientation of the carrier by adding an orientation attribute in the sample data tuple, generating sample data U7 . Using it as a sample data, we achieved 2.45% error rate in 7×7 std case and 1.43% 7×7 side case. From the experiment results, the following recommendations could be made to ensure a high accuracy of the positioning system. First, height and power level of the antenna should be consistent in the whole system. A small change in height or power may lead to big variation on the observation and damage the performance, as one could compare the error rates in Table 2. Second, a slight variation of antenna angle with respect to horizontal plane would not affect the accuracy too much, as shown in Table 2 row 3. This suggests that the method is practical for industrial application where carrier moves on uneven floor in high speed with vibration. Finally, increase on the sampling rate, as we expected, would generally produce higher accuracy (error rate reduced from 2.24% to 1.02% on 7×7 std when sample data changed from 7×7 std to 15×14 std).
1198
6
A. Lim and K. Zhang
Conclusions and Future Work
In this paper, we identify the importance of precise indoor positioning problem in industry applications. We propose a new approach – “mobile reader dispersed tag”, allowing multiple tags be read each time, to combat the RF-unfriendly environment in real industry workplace. We formulate the positioning problem as pattern matching problem and propose two algorithms that are particular effective. Extensive experiments are carried out and several factors in the system are examined. We conclude that our method achieved >97% positioning accuracy with <1 m error. We also propose a simple approach for identifying the orientation of the carrier in addition to location, which adds on the practicality of the proposed system. The presented method has several improvements and advantages: It is more precise than existing WLAN-based indoor positioning system. It is low cost solution, since no battery-support is needed for the passive RFID tags, and no information on RF signal strength is required. It is robust to environment noise and tag failure. Finally it provides a way to determine the orientation of the carrier. Future work includes improving of current pattern matching algorithms, possibly with online-learning and reasoning, solving more general 360◦ orientation problem, and improving the experiment design to estimate the precision of the system on continuous points in the environment.
References 1. Pahlavan, K., Li, X., Makela, J.: Indoor Geolocation Science and Technology. IEEE Com-munications Magazine, Feb. 2002. 2. Hiyama, A., Yamashita, J., Kuzuoka, H., Hirota, K., Hirose, M.: Position Tracking Using Infra-Red Signals for Museum Guiding System. Ubiquitous Computing Systems, Second International Symposium. (UCS 2004). 3. Minami, M., Fukuju, Y., Hirasawa, K., Yokoyama, S., Mizumachi M., Morikawa, H., Ao-yama, T.: DOLPHIN: A Practical Approach for Implementing a Fully Distributed Indoor Ultrasonic Positioning System. Ubiquitous Computing: 6th International Conference (Ubi-Comp 2004). 4. Prigge, E.A., How, J.P.: Signal Architecture for a Distributed Magnetic Local Positioning System. IEEE Sensors Journal, Vol.4, No.6, Dec. 2004. 5. Finkenzeller, K.: RFID Handbook: Fundamentals and Applications in Contactless Smart Cards and Identification, 2nd edition. John Wiley & Sons (2003) 6. Chon, H.D., Jun, S., Jung, H., An, S.W.: Using RFID for Accurate Positioning. Journal of Global Positioning Systems (2004), Vol 3, No.1-2, 32-39 7. Seo, D.S., Won, D., Yang, G.W, Choi, M.S., Kwon, S.J., Park J.W.: A Probabilistic Ap-proach for Mobile Robot Localization under RFID Tag Infrastructures. (Submission) Inter-national Conference on Control, Automation and Systems (ICCAS 2005). 8. Gharpure., C.P.: Orientation Free Radio Frequency Identification Based Navigation in a Robotic Guide for the Visually Impaired. Master Thesis. Utah State University (2004).
A Robust RFID-Based Method for Precise Indoor Positioning
1199
9. Borriello, G., Brunette. W., Hall, M., Hartung, C., Tangney, C.: Reminding About Tagged Objects Using Passive RFIDs. Ubiquitous Computing: 6th International Conference (Ubi-Comp 2004). 10. Rappaport, T.S.: Characterization of UHF Multipath Radio Channels in Factory Buildings. IEEE Transactions on Antennas and Propergation, Vol. 37 (1989), No. 8, 1058-1069 11. Bohn, J., Mattern, F.: Super-Distributed RFID Tag Infrastructures. Proc. 2nd European Symposium on Ambient Intelligence (EUSAI 2004)
A Study of Optimal System for Multiple-Constraint Multiple-Container Packing Problems Jin-Ling Lin1, Chir-Ho Chang2, and Jia-Yan Yang3 1
Department of Information Management, Shih-Hsin University [email protected] 2 Dept. of Industrial Management, LungHwa University of Science and Technology [email protected] 3 Department of Industrial Engineering & Management Information, Huafan University, 1 [email protected]
Abstract. The proposed research focuses on multiple-container packing problems with considerations of multiple constraints. The space utilization, stability, load bearing, and loading sequence of objects are also considered in order to make results more practicable. Clustering technology and genetic algorithm are combined to solve the proposed problems. At the beginning, clustering algorithm is applied to classify data objects into different groups with varied characteristics, such as dimension of objects, unloading sequence of objects, and capacity of containers. Then, genetic algorithm combines with heuristic rules is used to pack data objects into containers respectively. The stable packing, space utilization, unhindered unloading, and load bear limitation are the major considerations in this stage. A computer system based on the proposed algorithm was developed. Thousands of cases were simulated and analyzed to evaluate the performance of the proposed research and prove the applicability in real world.
1 Introduction The proposed research stems from the fact that the single container is unable to pack all data objects, therefore lose its practicability in the real world. Besides the space utilization, it is really important to consider constraints such as stability, load bearing, and loading sequence of objects for real world applications. But it is hard to find a literature which discussed these factors together [1], [2], [3]. The K-means clustering algorithm is easy to implement and has a lot of successful applications in the grouping data [4], [5], [6], [7], [8]. Therefore, this proposed paper first applied the K-means clustering algorithm to group data objects with similar dimensions and unloading sequences of objects to be packed [9]. Then genetic algorithm, combined with heuristic rules, is used to pack these objects in the same cluster one by one, respectively [2]. M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 1200 – 1210, 2006. © Springer-Verlag Berlin Heidelberg 2006
Multiple-Constraint Multiple-Container Packing Problems
1201
2 Problem Definition The proposed paper focuses on how to pack n unequal-sized cuboid-shaped data objects into m containers, which can have different dimensions. The container packing problem that we deal with is multiple objectives combined with multiple constraints. Besides the container cost and space utilization, we also take the efficiency of unloading, stability of packing, and load bearing into considers. The details of inputs, limitation, and problem model are described as follows. 2.1 Inputs and Assumptions The followings information is necessary before using the proposed method to solve multiple-constraint multiple-container packing problems. a)
b)
Objects Dimensions, weight, load bearing limitation, and unloading sequence of each object to be packed are the inputs information for the proposed system. Containers Interior size/dimensions and cost of containers to be used for packing are inputs for the proposed system.
In order to work with the proposed method, we put some assumptions on these related objects and containers. a)
b)
Objects o Objects to be packed are three-dimension cuboid, whose dimension can be different. o The object must keep the top upwards while packing. In other words, objects to be packed can only rotate from side to side. o When the center of gravity of an object located on the plane formed by these bearing objects, it satisfies the condition of stability. Containers o Containers to be used are three dimension cuboid, whose dimension can be different. o The containers can be used in the proposed method are dry containers. It means except exits all of the wall of containers are solid, no way to enlarge or shrink. o There is only on exit for a container and it is on the full plane of height by width, which is no longer than length of container.
2.2 Problem Model The objectives of multiple-container packing problem in the proposed paper are to minimize container cost, maximize space utilization (it is the same as minimize the free space), and minimize unloading overhead, under the constraints of stable packing and load bearing limitation. The corresponding mathematical model for proposed method is shown in Table 1 and these corresponding notations used in the mathematic model are listed as following:
1202
J.-L. Lin, C.-H. Chang, and J.-Y. Yang
Ci: capacity of container i Ci,l, Ci,w, Ci, h: length, width, and height of container i F: objective function value M: the number of containers needed in order to pack these pre-specified objects m: the number of different type of containers can be used to pack objects N: the number of data objects to be packed Ni: the number of type i container needed in order to pack these pre-specified objects ni: the number of data objects which are packed into container i OCi: the volume of data object i OCi,j: the volume of data object j which are packed into container i OCi,l, OCi,w, OCi, h: length, width, and height of data object i Pi: cost of type i container US(Oi,j): loading sequence of data object j in container i Wc: weight of container cost Wsu: Wu:
weight of space utilization weight of unloading overhead Table 1. Mathematic model of the proposed problem
Minimize
F =
⎛ ⎜ ⎜W ⎜ c ⎜ ⎝
m
∑
i =1
optimal ( C )
⎛ ⎜ ⎜ ⎜ ⎜W u ⎜ ⎜ ⎜ ⎝ Subject to 1
M =
M
∑
i =1
N =
⎛
i =1
⎝
−
i
ni
∑ OC j =1
i, j
M
∑C i =1
⎞ ⎟ ⎟ ⎠ * 100
i
⎛ ⎞ ⎜ max (US (O i , j )) − min (US (O i , j )) + 1 − n i ⎟ j = 1 ,..., n i ⎝ j = 1 ,..., n i ⎠ ni * 100 M
⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠
⎞ ⎟ ⎟ ⎟+ ⎟ ⎟ ⎠
M
∑n
i
3
max( C
j ,l
,C
j ,w
) ≥ max( OC
4
min( C
j ,l
,C
j,w
) ≥ min( OC
6
M
∑ ⎜⎜ C
Ni
i =1
5
⎞ ⎛⎜ ⎟ ⎜ * 100 ⎟ + ⎜ W su ⎟ ⎟ ⎜ ⎠ ⎜⎝
m
∑
i =1
2
N i Pi − optimal ( C )
max ( C
) ≥ max ( OC
j = 1 ,..., m
j ,h
C i ,h ≥
∑ OC
i = 1 ,..., n i
ni
i =1
i ,h
7
∀ variables
8
W c + W su + W u = 1
i,h
i ,l i ,l
)
, OC
, OC
i,w i,w
for j = 1,..., M & each
)
)
for j = 1,..., M & each
for j = 1,..., M & each
for i = 1,..., M & each
are interger
& ≥1
j , i = 1,..., n i
j , i = 1,..., n i j , i = 1,..., n i
j , i = 1,..., n i
Multiple-Constraint Multiple-Container Packing Problems
1203
The measure of container cost, space utilization, and loading overhead are different, so normalization were done on them. First item denotes the ratio of container cost and optimal container cost, which is the solution of m
min imize ∑ Pi N i
.
i =1 m
N
i =1
i =1
subject to ∑ C i N i ≥ ∑ OC i
(1)
Second term is the percentage of free space. The last term denotes dispersing degree of unloading sequence of objects which will be packed in the same container. Finally, the weight factors, Wc, Wsu, and Wu, are decided by users according to their needs in the real case.
3 Algorithm Analysis The packing problem, discussed in this paper, is solved by two stages: a grouping procedure and a packing procedure. In the grouping procedure, K-means method is applied to group objects into different clusters according to the capacity of container, the dimensions of objects, and the unloading sequence of objects. These objects in the same group means they will be packed in the same container. Then, genetic
Fig. 1. The procedure of multiple-container packing
1204
J.-L. Lin, C.-H. Chang, and J.-Y. Yang
algorithm combined with heuristic rules is applied for a single container packing. The details of grouping and packing algorithm are described in Section 3.1 and Section 3.2 respectively. The procedure of the proposed multiple containers packing is shown in Figure 1. Where Doc: a list of related information of objects to be packed, information for each object contains dimensions, load bearing limitation, unloading sequence Dc: a list of related information of different container types, information for each container is cost and capacity C: a list of grouping information after performing the K-means clustering Cc: a list of container information for each group kj: cluster j, which is a member of C kc: containers for current working cluster 3.1 Grouping Algorithm First, find the demanded number of each kind of container by using the Formula 1. Set k for K-means clustering to the number of demanded containers, m
Initial k = ∑ N i where m is the number of different types of containers, Ni i =1
denotes the number of demanded containers for type I container. Then perform the K-means clustering, which is guided by capacity of container, dimensions of objects, and unloading sequence of objects. Let kj(OC) and kj(US) denote initial points of cluster j, kj. kj(OC) is derived from the dimensions of objects to be packed and denoted by the length, width, and height of object, as shown in Formula 2. kj(US) is gained from the unloading sequence of objects, as shown in Formula 3. N is the total number of objects to be packed. OCi,l, OCi,w, OCi,h and denotes the length, width, and height of object i, respectively. US(Oi) is the unloading sequence of object i.
: min
Length
i = 1 ,..., N
: min
Width
i = 1 ,..., N
Height
OC
i ,l
OC
i,w
: min OC i =1,..., N
i ,h
+(
+(
max OC
i = 1 ,..., N
i ,l
− min OC i = 1 ,..., N
k +1 max
i = 1 ,..., N
+ (
OC
i,w
− min
i = 1 ,..., N
k + 1
max OCi , h − min OCi , h i =1,..., N
i =1,..., N
k j (US ) =
k +1
max US (Oi )
i =1,..., N
k +1
×j
× j)
i ,l
× j)
OC
i,w
for j = 1,..., k
× j)
for j = 1,..., k
for j = 1,..., k
for
j = 1 ,..., k
(2)
(3)
Multiple-Constraint Multiple-Container Packing Problems
1205
The degree of approximation between objects and the pivots is measured by two factors: dimension and unloading sequence of objects to be packed. The first one is done by the Euclidean Distance of dimensions of objects. Let i and j denote different points for clustering. The xi,l, xi.w, and xi,h denote the length, width , and height of point i. The formula to measure degree of approximation of dimensions is shown in Formula 4. The degree of approximation of unloading sequence is measured by difference of unloading sequence.
d (i, j ) = ( xi ,l − x j ,l ) 2 + ( xi , w − x j , w ) 2 + ( xi ,h − x j ,h ) 2
(4)
Since the measures of dimensions and unloading sequence are different, we normalize them and use percentage to represent them. Formula 5 shows the normalization, where OCSimilar_Oij denotes the degree of approximation of dimensions between object i and pivot j. USSimilar_Oij represents the degree of approximation of unloading sequence between object i and pivot j. Wsu and Wu, are weight of each term and can be decided by users according to their needs. ⎛ ⎜ Similar _ O ij = ⎜ W su ⎜ ⎜ ⎝
⎞ ⎛ ⎞ ⎟ ⎜ ⎟ USSimilar _ O ij ⎟ ⎜ ⎟ + * 100 * 100 W N ⎟ ⎜ u N ⎟ OCSimilar _ O ij USSimilar _ O ij ⎟ ⎜ ⎟ ∑ ∑ i =1 i =1 ⎠ ⎝ ⎠ OCSimilar _ O ij
(5)
Objects are classified into clusters depends on the degree of approximation to pivots of clusters. Then the average of degree of approximation of the objects in the same cluster will be used as the new pivot for next generation, as shown in Formula 6. The clustering procedure will be repeated until the stopping criterion, which is set to a predefined number, has been reached. nj ⎞ ⎛ nj ⎜ ∑ ( OC i , j ) ∑ ( US ( Oi , j )) ⎟ ⎟, k j = ⎜ i =1 , i =1 ⎟ ⎜ nj nj ⎟ ⎜ ⎠ ⎝
for j = 1 ,..., k
(6)
The procedure following the grouping is to assign containers for each cluster. In this procedure, we choose containers which have not only enough capacity but also minimizing container cost to pack these objects. First, these containers, which have enough capacity to pack objects in the particular cluster, are selected as the candidates. Then pick up one of these candidate containers with the minimal cost. If there is a feasible container to pack objects in the specified cluster, then a signal of success will be returned; otherwise it returns a signal of fail. The procedure of choose appropriate container for a cluster is shown in Figure 2. 3.2 Packing Algorithm The packing algorithm performs single container packing for each cluster independently right after the grouping procedure. The major algorithm refers previous
1206
J.-L. Lin, C.-H. Chang, and J.-Y. Yang S ta r t
i = 0 C o n ta in e r i = ϕ R e tu r n S u c c e s s w ith C o n ta in e r
i < k
No
Y es
c = 0 C a n d id a te = ϕ c + +
c < m
R e tu r n F a il Y es
No
Y es ni
C a n d id a te = = ϕ
C i > ∑ O C ij
No
Y es
C o n t a in e r i = m i n _ c o s t ( C a n d i d a te ) i+ +
No
j=1
CWi ≥
ni
∑
OCW
j =1
ij
No
Y es
C a n d id a te + = C c
Fig. 2. The procedure of choosing appropriate container for a cluster
works [2]. It uses genetic algorithm combined with heuristic rules to do the single container packing with constraints of load bearing limitation and stable packing. Each object is assigned a unique number from 1 to n, which is the number of objects to be packed. Genetic algorithm uses these object numbers as genes of individuals. Then, objects are assigned to different layers by heuristic rules based on unloading sequence of objects. In the meantime, multiple-chromosome individual is introduced into genetic operations in order to do crossover and mutation independently for different layers. Finally, heuristic rules, based on constraints of stability and load bearing, are applied to transfer one dimension string into threedimension packing layout [2]. If it exists that all the objects in a cluster is unable to pack into specified container, the proposed method will redistribute container for the particular cluster. First the algorithm looks for larger capacity, which can pack all objects in the cluster. If the larger capacity container can be found, we choose the container with minimum cost and do the single container packing again. If it is unable to find appropriate container to pack all the objects in the particular container, the initial k value of K-means will increase by 1, then start over again grouping and packing procedure for the new clusters.
4 Simulation and Discussion In order to evaluate the performance of the proposed method, a computer program was developed based on the proposed algorithm and hundreds of testing cases, which
Multiple-Constraint Multiple-Container Packing Problems
1207
were generated randomly, were simulated. The number of objects for these testing cases is around 100 and types of container are 4~12. From the simulation results, we found the efficiency, which is calculated based on the Formula 1 and the proposed mathematic model in section 2.2, of most of these testing cases is around 80%. The number of generation does not have much effect on the performance when it is over 5, as shown in Figure 3. Because the degree of approximate of objects will be strong after some generations of K-means algorithm and the number of objects in these testing cases is not too many, it is quite reasonable that it doesn't needs too much generation to reach the steady results of clustering. We use three related factors of objective function, such as capacity of container, unloading sequence of objects, and dimensions of objects, as the measure of degree of approximate during the procedure of clustering, so the performance of packing is very high efficiency. The Relationship of Objective and Generation Num. of K-means 40 35 30
) % ( e v i t c e j b O
25 20 15 10 5 0
Testing Cases
代數5 GN: 5 GN: 15 代數15
代數10 GN: 10 GN: 20 代數20
Fig. 3. The relationship of objective and generation number of K-means clustering
From Figure 4, we obviously observed that it has better objective when the weight of container coast is set to 0.25, the weight of free space is set to 0.25, and the weight of loading overhead is set to 0.5. This results show that loading overhead can be satisfied than the other two factors in the objective.
The Effects of Weights for Factors of Objective Function 50
NPP 0.25 SWP 0.5 USSP 0.25 NPP 0.5 SWP 0.25 USSP 0.25 NPP 0.25 SWP 0.25 USSP 0.5
45 40
) % ( e v i t c e j b O
35 30 25 20 15 10 5 0
Testing Cases
Fig. 4. The effects of weights for factors of objective function
1208
J.-L. Lin, C.-H. Chang, and J.-Y. Yang
Theoretically, the more types of container it has, the higher efficiency of space utilization has. Since the more type of container, it can choose, the more compact container can be found. From Figure 5, the packing algorithm shows higher efficiency of space utilization when the type of container is more for most of testing cases, but these results is not quite obvious. The Relationship of Space Utilization and # of Container Type 35 30
)% ( ev tic ej bO
25 20 15 10 ContainerType 4 ContainerType 8 ContainerType 12 ContainerType 16
5 0
Testing Cases
Fig. 5. The relationship of space utilization and number of container type
To evaluate the effect of dimensions and number of objects, we designed 100 testing cases, whose dimension is ±10% based on (8, 5, 6), (10, 5, 6), and (12, 6, 7). The number of objects to be packed is set to 80, 120, and 160, respectively. When the number of objects and dimensions of objects are moderate, the simulation results showed very high efficiency and better than others, as shown in Figure 6. It is also reasonable, because the smaller dimensions of objects are, the more compact packing is. Besides, the more objects to be packed, the more dispersing of unloading sequence of objects is. So when the number of objects and dimensions of objects are moderate, packing will have higher efficiency.
The Effects of Dimensions and Number of Objects 21 20
%)( ev it ce jb O
19 18 17
Dimension base(8,5,6), # 160 Dimension base(10,5,6), # 120 Dimension # 8080 物件維度 Base 12base(12,6,7), 6 7 變動 10% 個數 物件維度 Base 8 5 6 變動 10% 個數 160
16
物件維度 Base 10 5 6 變動 10% 個數 120 15
Testing Cases
Fig. 6. The effects of dimensions and number of objects to be packed
The last observation is the variation of dimensions among objects. First, use (9, 7, 8) as the dimension base of testing cases. Then generate objects based on the
Multiple-Constraint Multiple-Container Packing Problems
1209
dimension base with 10%, 20%, and 30% adjustment. The number of generation is set to 5 and the number of objects is set to 90. Figure 7 showed the simulation results of these testing cases. It is obvious that the smaller variation percentage of dimensions among objects, the better performance the results have, since the compact packing is more difficult when the dimensions among objects have larger variation. The Effects of Dimensions Variation among Objects 40 35
)% ( ev it ce jb O
30 25 20 15 10
物件維度 9 7 8 變動 個數 9090 Dimension Base base (9,7,8) ±10%,10%Objects: 物件維度 Dimension Base base (9,7,8) 9 7 8 變動 ±20%,20%Objects: 個數 9090 Dimension Base base (9,7,8) ±30%,30%Objects: 物件維度 9 7 8 變動 個數 9090
5 0
Testing Cases
Fig. 7. The effects of dimension variation among objects
5 Conclusion This research proposed an efficient algorithm to solve multiple-container packing problems with multiple constraints. We are not only consider the space utilization, but also take the load bearing limitation, unloading overhead, and stable packing into considerations. Besides the algorithm, a computer program based on the proposed algorithm was developed. The computer system is not only a simulation tool for performance analysis, but also a system to provide practical solutions for customer designated multiple-container packing problems. Thousands of cases were simulated and analyzed to evaluate the performance of the proposed research and prove the applicability in real world. The adjustable weight of factors of objective function makes the proposed method more practicable and useful for real world applications.
References [1] Lin, J. L. & Chang, C. H, "Stability 3D container packing problems," The 5th Automation '98 and International Conference on Production Research, (1998) [2] Lin, J.L. & Chung, Cheng-Hao, "Solving Multiple-Constraint Container Packing by Heuristically Genetic Algorithm," The 8th Annual International Conference on Industrial Engineering – Theory, Applications and Practice, Nov. 10-12, 2003, Las Vegas, U.S.A., (2003) [3] Verweij, B., Multiple destination bin packing. ALCOM-IT Technical Report (1996) [4] Bellot, P. and El-Beze, M., A clustering method for information retrieval, Technical Report IR-0199. Laboratoire d’Informatique d’Avignon (1999)
1210
J.-L. Lin, C.-H. Chang, and J.-Y. Yang
[5] Boley, D. Gini, M. Gross, R. Han, E.H. Hastings, K. Karypis, G. Kumar, V. Mobasher B. and Moore J., Partitioning-Based Clustering for Web Document Categorization. DSSs Journal, (1999) [6] Dunham, M., Data Mining: Introductory and Advanced Topics, Prentice Hall (2003) [7] Han, J. and Kamber M., Data Mining: Concepts and Techniques, Morgan Kafmann Publishers (2001) [8] Inderjit S. Dhillon & Dharmendra S. Modha, “A Data-clustering Algorithm on Distributed Memory Multiprocessors,” Proc. of Large-scale Parallel KDD Systems Workshop, ACM SIGKDD, 1999 August (also Large-Scale Parallel Data Mining, Lecture Notes in AI, Vol. 1759, pp. 245-260, 2000) [9] Jia-Yan Yang, "A Study of Optimal System for Multiple-Constraint Multiple-Container Packing Problems," Master Thesis, Huafan University (2005)
Planning for Intra-block Remarshalling in a Container Terminal Jaeho Kang1 , Myung-Seob Oh1 , Eun Yeong Ahn1 , Kwang Ryel Ryu1 , and Kap Hwan Kim2 1
2
Department of Computer Engineering, Pusan National University San 30, Jangjeon-dong, Kumjeong-gu, Busan, 609-735, Korea {jhkang, oms1226, tinyahn, krryu}@pusan.ac.kr Department of Industrial Engineering, Pusan National University San 30, Jangjeon-dong, Kumjeong-gu, Busan, 609-735, Korea [email protected]
Abstract. Intra-block remarshalling in a container terminal refers to the task of rearranging export containers scattered around within a block into designated target bays of the same block. Since the containers must be loaded onto a ship following a predetermined order, the rearrangement should be done in a way that containers to be loaded earlier are placed on top of those to be loaded later to avoid re-handlings. To minimize time to complete a remarshalling task, re-handlings should be avoided during remarshalling. Moreover, when multiple yard cranes are used for remarshalling, interference between the cranes should be minimized. In this paper, we present a simulated annealing approach to the efficient finding of a good intra-block remarshalling plan, which is free from re-handlings at the time of loading as well as during remarshalling.
1
Introduction
In a container terminal, the efficiency of container loading onto a ship is highly dependent on the storage location of the containers in the yard. The containers must be loaded onto the ship in a predetermined loading sequence. Therefore, if the containers are located adjacently in the yard for successive loading, the travel distance of yard cranes and service time for container loading can be greatly reduced. However, gathering containers close to one another is not sufficient for efficient loading because the loading sequence must be followed. Gathering containers without considering the sequence may cause a considerable number of re-handlings at the time of loading because the container to be fetched out at each time is often found under others. In addition, time required for intra-block remarshalling must be minimized for efficient use of valuable resources such as yard cranes. In particular, when multiple yard cranes are used collaboratively for remarshalling, care must be taken to minimize crane interference.
This work was supported by the Regional Research Centers Program (Research Center for Logistics Information Technology), granted by the Korean Ministry of Education Human Resources Development.
M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 1211–1220, 2006. c Springer-Verlag Berlin Heidelberg 2006
1212
J. Kang et al.
For a remarshalling task, cranes have to move tens of containers considering the sequence of container loading and interference between cranes. Therefore, the number of possible crane schedules is very large and finding the optimal one is intractable. Moreover, a crane schedule for remarshalling should be constructed in a reasonable time. In order to solve the scheduling problem efficiently, we applied simulated annealing[1]. Simulated annealing searches for a good partial order among containers to be rearranged that requires less time for remarshalling and does not make any re-handling during remarshalling and loading. A candidate solution expressed in the form of partial order graph does not regard the number of cranes to be used and their detailed movements but it can express a collection of possible re-handling-free total orders of container moves. Each candidate solution is used to construct a full crane schedule by a greedy evaluation heuristic with detailed crane simulation. By the heuristic, container moves are assigned to each crane and an appropriate order of these moves is determined under the constraint of the given partial order graph in a way of minimizing time for remarshalling, taking crane interference into account. The estimated time of the constructed full schedule is used as the quality of the partial order graph. Simulated annealing navigates the search space of partial order graphs using their evaluation results as a compass. The performance of the evaluation heuristic is adjustable by changing the number of container assignments looked forward in selecting a container for the next assignment. Therefore, under the limited time of scheduling, there are two possible directions of improving the quality of schedule: trying more candidates or constructing a crane schedule with more container assignments looked forward. Experimental results show that a better remarshalling schedule can be obtained by carefully balancing between the two. The next section explains the remarshalling problem in detail and reviews some related researches. The simulated annealing approach and the evaluation heuristic are presented in Section 3 and 4, respectively. In Section 5, the proposed method is tested and the results are summarized. Finally in Section 6, we give our conclusions and some topics for further research.
2
Intra-block Remarshalling
Intra-block remarshalling is a task of moving a set of target containers to designated bays within a block. We call the bays in which the target containers are located before remarshalling ‘source bays’ and the empty bays into which the target containers are moved ‘target bays.’ In Fig. 1, there are two source bays S1 and S2 and two target bays TW and TG . In this figure, each bay in the block is three rows wide and three tiers high. A bay can contain up to nine containers. Figure 1 also shows cross-sectional views of the source and target bays. A rectangle drawn with a solid line in the source bays is a container. In this example, fourteen target containers need to be moved during remarshalling. The number in each container refers to its order of loading at its target bay and an alphabetic character denotes the target bay for the container.
Planning for Intra-block Remarshalling in a Container Terminal Yard crane 202
Yard crane 203
Block 4A
1213
Row 1 Row 3 4G Tier 3 7W Tier 2 1G 7G 3G 4W 5G 5W
Bay 1
S1
TG
S2
TW Bay 12 Tier 1 1W 2G 3W 2W 6W 6G Source bay S1 Source bay S2 Target bay TW Target bay TG
Fig. 1. An example of remarshalling problem
Because a predetermined sequence of container loading in each target bay must be followed, if a container for later loading is moved and stacked on top of an earlier one, the upper container should be moved to a temporary location during container loading. For example, in Fig. 1, if container 4G is moved and stacked on top of container 3G in one of stacks in TG , then container 4G should be moved to somewhere before container 3G is being fetched out for loading. This temporary movement of container is called ‘re-handling.’ Re-handlings incur extra works of yard cranes and delay the work of container loading that is one of the most expensive and time-critical services of container terminals. Therefore, target containers should be moved into their target bays in a manner of avoiding such re-handlings. For a very simple solution of scheduling cranes for remarshalling, the moves of target containers can be sequenced in the reverse order of container loading. For example, in Fig. 1, a yard crane can move container 7W first into its target bay TW , container 6W the next, and so on. However, this approach may cause additional work of yard cranes during remarshalling. For example, in Fig. 1, after moving container 7W , a yard crane has to move container 5G temporarily to some other location before moving container 6W . Later in the remarshalling, container 5G will be moved again to be stacked into its proper target bay. These temporary movements of containers during remarshalling are also called re-handlings and they delay the work of remarshalling that seizes valuable resources such as yard cranes. Therefore, this type of re-handlings should also be avoided to finish remarshalling as early as possible. A re-handling-free schedule does not incur any re-handling situation during remarshalling as well as during loading. For a schedule to be free of re-handlings, the following two main constraints must be satisfied. Constraint-Re-handling-Free-In-Loading: After remarshalling, each container should be loaded onto the ship without re-handling. Constraint-Re-handling-Free-In-Remarshalling: Each container must be moved from its source bay to its target bay without re-handling. There has been a little research on the issue of remarshalling. Kim and Bae presented a two-stage approach to planning remarshalling.[2] In the first stage, containers to be moved and the target bays of selected containers are determined in a manner of minimizing the number of containers to be moved by considering the given ship profile. In the second stage, the sequence of container moves is determined to minimize crane movements. They assumed that remarshalling is
1214
J. Kang et al.
started and finished before the loading sequence is determined and a single crane is used for remarshalling. Moreover, they did not consider the specific storage slots for containers in the bays before and after moving. In contrast, our paper assumes that multiple cranes carry out remarshalling after the loading sequence is determined. Furthermore, we consider the specific storage slots of containers in the bays and interference between cranes. Some studies have developed strategies that determine a desirable stacking location for each container that comes into the yard. Dekker and Voogd described and compared various stacking strategies.[3] Each stacking strategy is evaluated by a few measures such as the number of re-handling occasions and workload distribution of yard cranes. Kim et al. derived a decision rule to locate export containers from an optimal stacking strategy to minimize the number of rehandlings in loading.[4] They assumed that containers are loaded onto a ship in the decreasing order of container weight group. These two studies tried to reduce the number of re-handlings by assigning the location of each container that comes into the yard more carefully. However, re-handlings in loading cannot be avoided completely for various reasons such as imprecise container weight information[5] and insufficient yard storage. Therefore, remarshalling may not be completely removed in most of the practical situations. Finally, there are studies on crane scheduling particularly when multiple cranes are used in a block. Ng presented a dynamic programming-based heuristic approach to crane scheduling to minimize the waiting time of trucks.[6] Dynamic programming is used to partition the block into ranges in which each crane can move; these ranges help crane interference to be resolved. However, these studies cannot be used to schedule cranes for remarshalling directly because a primitive task of remarshalling is moving a container from one place to another in the same block rather than giving a container to or taking a container from a truck. In addition, there are some order constraints among container moves imposed by the sequence of container loading. Therefore, a crane scheduling method that can efficiently consider the characteristics of remarshalling is required.
3
Using Simulated Annealing for Remarshalling
This section describes a method of deriving a partial order graph from the locations of containers before and after moving. The derived partial order graph is used as an initial solution of simulated annealing. An efficient method of generating valid neighbors from the current partial order graph is also presented in this section. 3.1
Determining Target Locations for Containers
In this sub-section, we present a heuristic method of determining target locations of containers under the two main constraints introduced in the previous section. Figure 2 shows a step-by-step example of the heuristic. In the initial state Fig. 2(a), containers 7W , 7G, 3G, 4G, 5G, and 5W are located on top of stacks
Planning for Intra-block Remarshalling in a Container Terminal Source bays
Target bays
(a)
Source bays
1215
Target bays
(d) 7W
4G
4W
3G 1G
1G 7G 3G
4W 5G 5W
5W
4G 5G
1W 2G 3W
2W 6W 6G
7W
7G 6G
3G
4W 1W
3G 1G
5W 2W
4G 5G
7W 3W
7G 6G
1W 2G 3W
(b)
2W 6W
(e) 7W 1G
4W 5G 5W
4G
1W 2G 3W
2W 6W 6G
7G
(c)
2G
6W
(f) 1G 1W 2G 3W
4W
3G
4W 1W
3G 1G
5G
5W
4G
5W 2W
4G 5G
2W 6W 6G
7W
7G
7W 3W 6W
7G 6G 2G
Fig. 2. Example of determining of the target locations of containers
in their source bays. These six containers can be moved to their target bays without violating Constraint-Re-handling-Free-In-Remarshalling. From these containers, some to the same target bay can be selected to be moved and stacked without breaking Constraint-Re-handling-Free-In-Loading. For example, if 7G, 4G, and 5G are selected, the order of moves for these selected containers should be 7G → 5G → 3G to satisfy Constraint-Re-handling-Free-InLoading. Although there can be other container selections for a stack, each selection can have only one order of moving if the selected containers belong to the same target bay. Figure 3 shows the detailed procedure of container selection for filling a stack. By the procedure, 4G, 7G, and 3G are selected, and the order of moves is 7G → 4G → 3G. The target locations of the three containers are now fixed. After removing these three containers from the source bays and placing them on a stack in their target bay, the state of bays will be shown as Fig. 2(b). The container selection and placing process is repeated until the target locations of all containers are determined. In this example, the set of containers {7W , 5W , 4W } is the next selection, and their order is 7W → 5W → 4W . When the target locations of all containers are determined, we can obtain a target configuration defined by the locations of the containers at target bays. 3.2
Deriving Partial Order Graph from the Locations of Containers
Figure 4(a) shows the source configuration defined by the initial positions of containers at the source bays for the remarshalling problem of Fig. 1. The source configuration constrains the order of container moves satisfying ConstraintRe-handling-Free-In-Remarshalling. If container c1 is placed on top of another container c2 on the same stack in a source configuration, c1 should be moved to its target bay before c2 to satisfy Constraint-Re-handling-FreeIn-Remarshalling. Similarly, a target configuration, which is defined by the
1216
1.
2.
J. Kang et al.
Select target bay t that has the largest number of target containers, which can be moved without violating Constraint-Re-handling-Free-In-Remarshalling. 1.1. For tie breaking, one of the target bays is chosen at random 1.2. Let Ct be a set of containers that satisfy Constraint-Re-handling-Free-InRemarshalling for selected target bay t. From Ct, select a sufficient number of containers to fill one of empty stacks of t using the following priority rules. (In Fig. 2, a maximum of three containers can be stacked in a row) 2.1. Select the container that has the largest number of target containers under it. 2.2. If a tie is found, then select the container that has the largest number of target containers for t under it. 2.3. If a tie still exists, then select one of these containers at random. Fig. 3. Procedure for selecting containers to fill a stack
positions of containers at the target bays, also restricts the order of moves to fulfill the two main constraints. In a stack of a target configuration, if container c1 is prior to container c2 in loading, c2 must be moved before c1 so that c1 should be placed on top of c2 . By combining these two types of ordering constraints imposed by the source and target configurations, a partial order graph can be derived to express all the possible re-handling-free sequences of moves under the two configurations. The derived partial order graph is shown in Fig. 4(c). A solid arrow in the figure represents an order constraint issued by the source configuration, and a dotted arrow is a constraint derived from the target configuration. In our experiment, a derived partial order graph is used for the initial solution of simulated annealing.
(a) Source configuration 7W
4G
(b) Target configuration
(c) Partial order graph
4W 1W
3G 1G
7W
1G 7G 3G 4G 5G 5W
5W 2W
4G 5G
7G
1W 2G 3W 2W 6W 6G
7W 3W 6W 7G 6G 2G
1G
1W
2G 3G 4G
3W
4W
2W 5G
5W
6W
6G
Fig. 4. Partial ordering of container moves
3.3
Generating Neighborhood Solutions
A different partial order graph can be obtained by slightly modifying the current one. Figure 5 shows an example of generating neighbors from the partial order graph in Fig. 4(c). In the graph, two containers on different stacks of the same target bay are selected at random for swapping. After swapping the
Planning for Intra-block Remarshalling in a Container Terminal Source configuration Candidate target configuration 1 4W 1W
2G 1G
1G 7G 3G 4G 5G 5W
5W 2W
3G 5G
1W 2G 3W 2W 6W 6G
7W 3W 6W 7G 6G 4G
7W
4G
7W
1G
7G
2G 3G 4G
Source configuration Candidate target configuration 2 7W
4G
4W
5W 2W 4W 4G 5G 7W 3W 6W 7G 6G 2G
7W
1G
3W 4W
4G
6W
(a) Candidate solution 1
1W
2G 3G
6G
3G 1G
1G 7G 3G 4G 5G 5W
7G
2W
1W
1W 2G 3W 2W 6W 6G
3W
5G 5W
1W
1217
5G 5W
2W
6W
6G
(b) Candidate solution 2
Fig. 5. An example of generating neighbors from the partial order graph in Fig. 4(c)
selected containers, a new partial order graph can be generated. For example, in Fig. 5(a), container 2G and 4G are selected and a neighbor, which is slightly different with the partial order graph of Fig. 4(c), is obtained. Note that some containers should be reordered in their stacks according to their turns of loading to satisfy Constraint-Re-handling-Free-In-Loading. A single container can be transferred to another non-full stack of the same target bay by swapping it with a virtual container as shown in Fig. 5(b). A container swapping does not always lead to a valid partial order graph. When a cycle exists in a modified graph, no re-handling-free total order can be derived from it. Another swapping is tried to the original graph if the previous swapping introduces a cycle. Cycle detection can be performed in a linear time proportional to the number of containers.[7]
4
Constructing a Schedule from Partial Order Graph
This section presents a detailed description of the evaluation heuristic. The evaluation heuristic constructs an executable crane schedule by assigning each move of container to one of cranes and by determining the order of moves archived by each crane. It also detects crane interference by simulation based on the state-transition-graph model. Figure 6 shows a snapshot of constructing a crane schedule using the heuristic. In this figure, some container moves have been already assigned and currently crane A is moving container 2W . A new container assignment is required to the idle crane B that just finished the work of moving container 3W . Under the constraints of the given partial order graph, container 1G and 6W are two possible containers for the assignment to crane B. The heuristic simulates all the possible container assignments and selects the best-looking one to construct the partially built crane schedule incrementally.
1218
J. Kang et al.
Fig. 6. An example of partially built crane schedule by the evaluation heuristic
The heuristic uses a depth-limited branch-and-bound search[8] to find the best-looking container. Figure 7 shows a tree that was constructed by the heuristic with the maximum depth of four. For each of the containers, there are two possible scenarios of avoiding crane interference when two cranes are used for remarshalling and interference is detected by simulation. Interference between the cranes can be resolved by giving the priority of working to one of the two cranes. The crane that is given the priority works without waiting. In contrast, the other should wait at the location nearest to the crane with priority to avoid interference. Therefore, a node in the tree represents both of crane priority and container assignment. The crane that has priority in each node is expressed in underline face. A node is evaluated by delay time, which is denoted on the right of the node, during the work of moving the assigned container. In Fig. 7, container 1G is selected for the assignment of a new container to crane B, and crane A will take the priority in moving container 2W . The partially built crane schedule of Fig. 6 is expanded by adding the selected container 1G to crane B. Another tree search will be performed for the next new container assignment for crane A because crane A will finish its work of moving container 2W earlier than crane B. This process is repeated until all the containers are assigned to one of the cranes. The maximum depth L of the depth-limited search is adjustable. Usually a better crane schedule can be obtained with higher L. However, higher L requires longer time in evaluation and results less trials in the level of simulated annealing if the time for crane scheduling is limited. The results of the experiment show that a better remarshalling schedule can be obtained by adjusting the balance between trying more partial order graphs and evaluating each graph with more forward looking.
Planning for Intra-block Remarshalling in a Container Terminal
1219
Fig. 7. Depth-limited branch-and-bound for assigning a container to idle crane B
5
Results of Experiment
For the experiment, we chose an environment with 33 bays in a block, each bay consisting of nine rows by six tiers. Two non-crossing cranes were used for remarshalling and the cranes required at least a five-bay gap for suitable working clearance. Four empty target bays are used to stack 196 target containers. Initially, target containers were randomly scattered within the block. Five different scenarios were used to test our approach to problems of various difficulties by changing the crowdness of target containers in the source bays. The time of searching for the remarshalling schedule was limited to 10 minutes. About 15,800 candidate solutions were evaluated in this time limit by depth-limited branchand-bound with L = 1. About 14 candidates were evaluated with L = 8 in the same time limit. We also had tried another local-search algorithm, hill-climbing search, which did not show a better performance than simulated annealing. A Pentium PC with 3.2GHz was used for the experiments and each experiment was repeated ten times. Figure 8 shows the performance of search with different L for each scenario. The graph depicts the relative estimated time of generated remarshalling schedules compared to those generated by search with L = 1. It is easily noticeable that higher L does not always give a better remarshalling schedule when the running time is limited. There is a point of balance between the maximum depth in evaluation and the number of evaluations for each scenario.
6
Conclusions
This paper presented a method of generating a remarshalling schedule for multiple cranes. Partial order graph is used to make a schedule to be free of rehandlings, and simulated annealing is applied to minimize time required for remarshalling at the level of partial order graphs. We also presented an evaluation heuristic for constructing a full crane schedule from a partial order graph.
J. Kang et al.
Relative estimated time of remarshalling
1220
100% 99%
4 bays
98% 97% 6 bays 96% 8 bays 95%
16 bays 29 bays
94% 1
2
3
4
5
6
7
8
Maximum depth L of depth-limited branch-and-bound
Fig. 8. The relative estimated time of remarshalling with schedules generated by simulated annealing with various maximum depths of evaluation heuristics
Our proposed method can generate an efficient crane schedule in a reasonable time. The results of the experiment show that a better remarshalling schedule can be obtained by carefully adjusting the running time of the evaluation heuristic under a limited time of computation. For further research, we are interested in finding a smart way of generating neighbors by selecting containers to be swapped considering feedback information that can be obtained in the process of applying the evaluation heuristic.
References 1. Aarts, E., Korst, J.: Simulated Annealing. Local Search in Combinatorial Optimization. John Wiley & Sons (1997) 91-120 2. Kim, K. H., Bae, J.-W.: Re-Marshaling Export Containers. Computer and Industrial Engineering 35(3-4) (1998) 655–658 3. Dekker, R. and Voogd, P.: Advanced methods for container stacking. The Proceeding of International Workshop on Intelligent Logistics Systems (2005) 3–29 4. Kim, K. H., Park, Y. M., Ryu, K. R.: Deriving decision rules to locate export containers in container yards. European Journal of Operational Research, 124 (2000) 89–10 5. Kang, J., Ryu, K. R., Kim, K. H.: Determination of Storage Locations for Incoming Containers of Uncertain Weight. Proceedings of the 19th International Conference on Industrial, Engineering & Other Applications of Applied Intelligent System (2006) 1159–1168 6. Ng, W. C.: Crane scheduling in container yards with inter-crane interference. European Journal of Operational Research 164 (2005) 64–78 7. Nivasch, G.: Cycle Detection Using a Stack. Information Processing Letters 90(3) (2004) 135–140 8. Russell, S. J., Novrig, P.: Artificial Intelligence: A Modern Approach (second edition) Prentice Hall (2002)
Develop Acceleration Strategy and Estimation Mechanism for Multi-issue Negotiation Hsin Rau and Chao-Wen Chen Department of Industrial Engineering, Chung Yuan Christian University Chungli, Taiwan 320, Republic of China [email protected]
Abstract. In recent years, negotiation has become a powerful tool in electronic commerce. When two negotiation parties still have a lot of space to negotiate, little-by-little concession has no benefit to the negotiation process. In order to improve negotiation efficiency, this study proposes a negotiation acceleration strategy to facilitate negotiation. In addition, this paper develops an estimation mechanism with regression technique to estimate the preference of opponent, with which results the joint utility of negotiation can be maximized. Finally, an example is given to illustrate the proposed estimation mechanism.
1 Introduction Negotiation is a process by means of which agents communicate and compromise to reach mutually beneficial agreements [8]. Recently, negotiation becomes a powerful tool to aid electronic commerce trade. In negotiation strategy, Jennings and Wooldridge [3] considered the negotiation agent should increase the capability of intelligence and learning in order to know the opponent’s negotiation attitude. Based on this idea, researchers considered that agents should be able to respond to opponent’s actions intelligently and proposed some algorithms or rules to develop new negotiation strategies or tactics [4-10]. Factors in negotiation strategy influencing the negotiation mostly are time, opponent’s concession attitude and offer range, etc. However, little-by-little concession has no benefit when two negotiation parties still have a lot of space to negotiate. In order to improve negotiation efficiency, this study extends the negotiation decision functions (NDFs) proposed by Faratin et al. [2] to promote an acceleration strategy to speed up negotiation. In addition, a regression technique is used to estimate the opponent’s preference. Moreover, this study uses linear programming to find the maximal joint utility in order to maximize satisfaction. Finally, an example is given to illustrate the proposed methodology. The rest of this paper is organized as follows. Section 2 briefly introduces negotiation decision functions. Section 3 develops the negotiation acceleration strategy. Section 4 discusses how to estimate opponent’s favorite. Section 5 describes the methodology for maximizing utility. Section 6 gives an example for illustration. In the final section, a conclusion is provided. M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 1221 – 1230, 2006. © Springer-Verlag Berlin Heidelberg 2006
1222
H. Rau and C.-W. Chen
2 Negotiation Decision Functions This study extends the negotiation model proposed by Faratin et al. [2]. In their negotiation model, two parties negotiate on one issue or multi-issues, such as price, delivery time, quality, etc. Let x aj→b (t n ) be an offer proposed by agent a to agent b for negotiation issue j at time tn.
a t max is agent a’s negotiation deadline. A negotiation
thread is denoted as X aj →b (t n ) , which is a sequence of alternating offers of the form ( x aj →b (t 0 ), x bj→ a (t1 ), x aj →b (t 2 ),L ). The negotiation is terminated by an acceptance or a withdrawal from either side. Agent a’s response at
t n to agent b’s offer sent at time
t n−1 is defined as: a ⎧reject if t n > t max ⎪⎪ I a (t n , x bj→a (t n−1 )) = ⎨accept if V a ( x bj→a (t n−1 )) ≥ V a ( x aj→b (t n )) ⎪ a→b ⎩⎪ x j (t n ) otherwise
(1)
where x aj →b (t n ) is the counter offer from agent a to agent b when an offer x bj→a (t n −1 ) is not accepted by a. Agent i (a or b) has a scoring function, V ji : [ min ij , max ij ]→[0, 1], that gives the score (or utility) agent i assigns a value of issue j in the range of its acceptable values.
V ji is defined in Eqs.(2) and (3). w aj is agent a’s weight for issue j, which indicates the importance of issue j for agent a. J is the number of issues.
⎧⎪( x ij − min ij ) /(max ij − min ij ) V ji (x ij ) = ⎨ i i i i ⎪⎩(max j − x j ) /(max j − min j )
as x ij is an increasing function as x ij is a decreasing function
(2)
J
V a ( x) = ∑ w aj v aj ( x j )
(3)
j =1
The NDFs proposed by Faratin et al. [2] include three tactics: time-dependent, resource-dependent, and behavior-dependent. Moreover, we find that negotiation strategies used mostly relate to time, so we use the time-dependent tactic in this study. The characteristic of time-dependent tactic is that agent gives an offer according to time with concession parameter β , which is described by the concession function
α aj (t )
and offers, as shown in Eqs. (4) and (5).
a ⎧⎪ x aj →b (t0 ) + α aj (tn ) × ( x aj → b (tmax ) − x aj → b (t0 )) x aj → b (tn ) = ⎨ a →b a a →b a a →b ⎪⎩ x j (t0 ) + α j (tn ) × ( x j (tmax ) − x j (t0 ))
if x aj → b (tn ) decreasing if x aj → b (tn ) increasing
a a α aj (t ) = k aj + (1 − k aj )(min(t n , t max ) / t max )1 / β
(4) (5)
Develop Acceleration Strategy and Estimation Mechanism
Time-dependent tactic is divided into three ranges according to the value of (1)
1223
β:
β <1, it is called as the Boulware tactic, which insists on the offered value until
the time is almost exhausted, whereupon it concedes up to the reservation value. it is called as the Conceder tactic. In this tactic, the agent is willing to concede to its reservation value as quickly as possible. (3) β = 1 , it is called as the linear tactic. Its behavior is between the above two tactics. (2)
β >1,
The details can be found in Faratin et al. (1998).
3 Develop a Negotiation Acceleration Strategy In NDFs, two negotiation agents might spend a lot of time to advance little toward their negotiation settlement. In order to improve this situation, this study proposes a negotiation acceleration strategy for single issue, as expressed in Eq. (6). 1
α aj (t n ) = [( α aj (t n −1 ) +CTF( t n ))+(1-( α aj (t n −1 ) +CTF( t n )))× γ ×CAF( t n ))] β
(6)
where γ ∈ [0,1] . CTF is a concession time function and can be defined as the ratio of remaining concession to remaining time, as shown in Eq. (7).
CTF (t n ) =
1 − α aj (t n −1 ) a t max − t n −1
(7)
CAF evaluates the distance of current offer between both agents (see Fig. 1) as compared with the distance of initial offer between both agents, as shown in Eq. (8).
⎧ x aj→b (t n −1 ) − x bj→a (t n −1 ) if x aj →b (t n ) decreasing ⎪ a →b b→a − x ( t ) x ( t ) ⎪ j 0 CAF (t n ) = ⎨ b→j a 0 a →b − x ( t ) x ( t n −1 j n −1 ) ⎪ j if x aj→b (t n ) increasing a b b a → → ⎪ x j (t 0 ) − x j (t 0 ) ⎩
Fig. 1. Illustration of concession acceleration function (CAF)
(8)
1224
H. Rau and C.-W. Chen
(a) Effect of weight
(b) Effect of
γ
(gamma)
Fig. 2. Effects of weight and gamma
CAF ∈ [−1,1] . When x bj→a (t n −1 ) > x aj →b (t n −1 ) , CAF>0 indicates normal concession, whereas CAF<0 presents over-concession. Sometimes an issue with CAF<0 has reached its own settlement; but, it has to wait for other issues to reach the overall settlement together. Parameter γ in Eq.(6) indicates agent’s willingness to quickly
γ
α aj (t n ) becomes the time-dependent tactic (polynomial function) in the NDFs. As γ increases, the agent enters the zone of agreement quicker. When γ =1, the agent hopes to make a settlement immediately reach an agreement. When
=0,
b a and the offer is the reservation value ( x aj →b (t max ) or x bj→a (t max ) ). For a multi-issue
negotiation, we need to take weight into consideration in the negotiation strategy. As shown in Eq. (9), we use an exponential form to express the effect of weight. Using β =0.3 as an example, Fig. 2(a) illustrates the effect of weight on concession. As the weight increases, the agent concedes less. Fig. 2(b) illustrates the effect γ on CAF. CAF decreases quicker as γ increases. 1
α (t n ) a j
1− w aj
= [( α aj (t n −1 ) +CTF( t n ))+(1-( α aj (t n −1 ) +CTF( t n )))× γ 1
×CAF( t n ))]
(
1
) a
β 1− w j
(9)
, 0 ≤ w aj < 1 , γ ∈ [0,1]
4 Estimate Opponent’s Preference 4.1 Estimate Reservation Value
In this study, we assume that the negotiation deadline of opponent agent is known. The steps to estimate opponent's reservation value are as follows: (1) Collect offer values of the opponent agent until an arranged negotiation thread. (2) Normalize the values in (1) with respect to the difference between initial offers. (3) Collect a number of recent values in (2).
Develop Acceleration Strategy and Estimation Mechanism
1225
(4) Use an equation with an appropriate order polynomial to fit the scatter diagram with minimal error consideration. (5) From (4), calculate opponent’s reservation value Based on our experiment, we found a third order polynomial is suitable for tasks done in step (4), in which it needs at least six values to be collected in (3). 4.2 Stop Conditions
During the process of reservation value estimation, we need to set conditions to stop estimation. In this study, we set two stop conditions: (1) After a number of continuous estimation values is within the predefined range. (2) An opposite direction for the trend of the decreasing or increasing of the estimation values occurs after an initial period of negotiation. According to our experiment, if 4 continuous estimation values are in a defined range, we may stop. In the initial period of negotiation, the estimation diverges very much; however, after a period of negotiation, the estimation converges. In our study, condition 2 can be found usually after one fifth of the negotiation deadline. 4.3 Estimate Weights
Once the opponent’s reservation value is known, we can estimate the weight of each issue after a sequence of opponent’s offers. The procedure is as follows: (1) Collect offer values of each issue of the opponent until an arranged negotiation thread (2) Normalize the values in (1) with respect to the difference between opponent’s initial offer and reservation value. (3) Plot scatter diagrams for every two issues. (4) Use an exponential equation to fit the scatter diagram with minimal error consideration. (5) From (4), the ratio of two weights for every two issues can be found. (6) From ∑ w j = 1 , the weight of each issue can be obtained. j
In step (5), for any two issues, we can derive simultaneous equations as shown in Eq. (10):
1 − w Xa = C X ,Y , X = 1,.., J , Y = 1,..., J , X ≠ Y 1 − wYa
(10)
Considering an example, a buyer estimates a seller’s weights. Suppose that there are three issues, and the seller's weights are 0.2, 0.5, and 0.3. According to Eq. (10), we have three sets of equations: issue one and issue two, issue one and issue three, and a issue two and issue three. Take set one as an example, we have C1, 2 = 1 − w1 = 1.6 . 1 − w2a
1226
H. Rau and C.-W. Chen
The curve is fitted from the scatter diagram that is shown Fig. 3. Then from C1, 2 , C1,3 , and C 2 ,3 , we can found the weights w1 , w2 , and w3 .
Fig. 3. Concept of estimate weight
5 Utility Maximization In order to increase the joint utility of negotiation, this study uses the linear programming technique to maximize the opponent’s utility (Z) when the utility of the self side is fixed, as shown in Eq. (11). Before doing this, the opponent’s weight of each issue must be known beforehand. This explains why the effort is done in the previous section. The objective function in Eq. (11) is agent b’s linear combination of weight and scoring function estimated from agent a. Constraints include agent a’s linear combination of weight and scoring function and offer range. V is agent a’s score.
⎧ x aj →b (t 0 ) − x bj→a (t n ) b if x bj →a (t n ) decreasing ⎪∑ w j × b →a b→a x j (t 0 ) − x j (t max ) ⎪ j Max Z = ⎨ x aj →b (t n ) − x bj→a (t 0 ) ⎪ b b→a ⎪∑ w j × x b →a (t ) − x b→a (t ) if x j (t n ) increasing max 0 j j ⎩ j a →b a →b ⎧ x j (t 0 ) − x j (t n ) a = V if x aj →b (t n ) decreasing ⎪∑ w j × a →b a →b x t x t ( ) ( ) − 0 max j j ⎪ j s.t. ⎨ a →b a →b x j (t n ) − x j (t 0 ) ⎪ a a →b ⎪∑ w j × x a →b (t ) − x a →b (t ) = V if x j (t n ) increasing j max j 0 ⎩ j ⎧⎪ x bj →a (t 0 ) ≤ x aj →b (t n ) ≤ x aj →b (t 0 ) ⎨ a →b a →b b→a ⎪⎩ x j (t 0 ) ≤ x j (t n ) ≤ x j (t 0 )
j = 1,..., J
if x aj →b (t n ) decreasing
j = 1,..., J
if x aj →b (t n ) increasing
(11)
Develop Acceleration Strategy and Estimation Mechanism
1227
Table 1. Parameters used in the example
Price
Issue Offer range Utility function
Buyer [40,100] Increasing
Seller [60,120] Decreasing
Offer range
[2,7]
[5,10]
Utility function
Increasing
Decreasing
Offer range
[100,200]
[150,250]
Utility function
Increasing
Decreasing
Due date
Quantity
Fig. 4. The effect of utility maximization Table 2. Compare negotiation process whether the utility maximization is used
Thread 1 3 … 37 39 41 43 45
Utility Buyer -0.54333 -0.54332 … -0.29249 -0.21537 -0.11865 0.00177 0.15064
Utility Seller (I) 1 0.99999 … 0.77481 0.70444 0.61612 0.50614 0.37022
Buyer -0.54333 -0.54332 … -0.29249 0.02466 0.13065 0.26262
Seller (II) 1 0.99999 … 0.77481 0.70444 0.61612 0.50614
Let us use an experiment to discuss the benefit of using the utility maximization. The experiment parameters are shown in Table 1 with β = 0.3 and weight = [0.2, 0.5, 0.3] for the Seller and β = 1.0 and weight = [0.4, 0.5, 0.1] for the Buyer. Suppose that the Seller has estimated the Buyer’s preference and only the Seller uses the utility maximization. Two cases are that Seller (I) (case I) dose not use utility maximization; but, Seller (II) (case II) starts to use utility maximization when negotiation thread equals 39. The negotiation result is shown in Fig. 4 and Table 2. Bold numbers shown
1228
H. Rau and C.-W. Chen
in Table 2 are the offers calculated from utility maximization. We can find that the Buyer’s utility becomes higher when the thread reaches 39 if the Seller uses the utility maximization (case II). Moreover, the utility maximization can also make an offer close to the efficient frontier (Pareto), as shown in Fig. 4.
6 Example Illustrations An example is used here for discussed with parameters shown in Table 1 with β = 0.5 and γ = 0.5 for Buyer, and β = 1.0 and γ = 0.6 for Seller. Negotiation deadline sets 25 for both agents (thread=50). Under the environment without estimating the opponent’s preference, Fig.5 shows the difference between without and with acceleration in a negotiation process, and the negotiation settles at threads 44 and 36 for the first case and the second case, respectively. This result explains that the negotiation acceleration strategy proposed by this study improves negotiation.
(a) Without acceleration
(b) With acceleration
Fig. 5. Negotiation process without and with acceleration Table 3. Estimation process of reservation value for each issue
Thread 14 16 18 20 22
Price 100.0425 100.0338 100.0266
Due date 7.0021 7.0017 7.0013
Quantity 198.8612 199.0692 199.2475 199.3989 199.5263
Continuing the previous example but with estimating opponent’s preference, Table 3 shows the estimation process, which starts from thread 14. The Buyer’s reservation values of price and due date are found to be 100.0266 and 7.0013 at thread 18, and the reservation value of quantity is found to be 199.5263 at thread 22. Comparing with the actual values for three reservation values, (100, 7, 199), it is very close. After all the Buyer’s reservation values have been estimated, their weights can be estimated as (0.2, 0.3, 0.5), which are exactly the same as actual values. Once the Buyer’s preference
Develop Acceleration Strategy and Estimation Mechanism
1229
is known, we can start the process of utility maximization until the negotiation settles at thread 25. At this moment, the joint utility equals 0.866, which is much better than 0.561 for the case without utility maximization and is close to the best joint utility, 0.953. In Fig.6, we can see Buyer’s utility jump from 0.124 to 0.494 at the Seller’s side, where the thread is from 21 to 23, as also shown in Table 4. Table 4. Negotiation process with estimation and utility maximization
Thread 1 3 13 15 17 19 21 23 25
Seller Seller’s Buyer’s utility utility 0.642 -0.107 0.620 -0.086 0.505 0.027 0.480 0.050 0.455 0.074 0.430 0.099 0.404 0.124 0.378 0.494 0.351 0.514
Thread 2 4 14 16 18 20 22 24
Buyer Seller’s Buyer’s utility utility -0.225 0.793 -0.206 0.775 -0.099 0.672 -0.074 0.648 -0.049 0.622 -0.022 0.595 0.005 0.567 0.034 0.537 0.064 0.506
Fig. 6. Negotiation process with estimation and utility maximization
7 Conclusions Negotiation has been applied to a wide range of industrial and business domains and its electronic format is even essential for electronic commerce. When two negotiation parties are still far away from their settlement, slow concession has no benefit to the negotiation process. In order to speed up the negotiation, this study proposed a negotiation acceleration strategy to facilitate negotiation. In addition, to maximize joint utility, we applied the linear programming technique to maximize the opponent’s utility when the self-side’s utility is fixed. With this way, we can also attract the opponent to accept the offer early. However, before the utility maximization can be
1230
H. Rau and C.-W. Chen
applied, we have to know the opponent’s preference. This is why we have to propose the estimation mechanism to estimate the reservation value and the weight for each issue of the opponent. We believe that this study provides several technologies or directions to improve the electronic negotiation process. This study only focuses on the negotiation between two parties; indeed, an extension to multiple parties can be done for the future work.
Acknowledgements This work is supported in part by the National Science Council of Republic of China under the grant NSC 94-2213-E-033-006.
References 1. Choi, S.P.M., Liu, J., Chan, S.P.: A genetic agent-based negotiation system. Computer Networks. Vol. 37. 2 (2001) 195-204 2. Faratin, P., Sierra, C., Jennings, N.R.: Negotiation decision functions for autonomous agents. Robotics and Autonomous Systems. Vol. 24. 3 (1998) 159-182 3. Jennings, N.R., Wooldridge, M.: Software agents, IEE Review. Vol. 42. 1 (1996) 17-20 4. Li, J., Cao, Y.D.: Bayesian learning in bilateral multi-issue negotiation and its application in MAS-based electronic commerce. Proceedings of Intelligent Agent Technology. IEEE/WIC/ACM International Conference. (2004) 437-440. 5. Krovi, R., Graesser, A.C., Pracht, W.E.: Agent Behaviors in Virtual Negotiation Environments. IEEE Transactions on System, Man, and Cybernetics-Part C: Applications and Reviews. Vol. 29. 1 (1999) 15-25 6. Oliver, J.R.: On artificial agents for negotiation in electronic commerce. Proceedings of the Twenty-Ninth Hawaii International Conference on System Sciences. (1996) 337-346 7. Oprea, M.: An adaptive negotiation model for agent-based electronic commerce. Studies in informatics and Control. Vol. 11. 3 (2002) 217-279 8. Raiffa, H..: The art and science of negotiation. Harvard University Press. (1982) 9. Ren, Z.., Anumba, C. J.: Learning in multi-agent systems: a case study of construction claims negotiation. Advanced Engineering Informatics. Vol. 16. 4 (2002) 265-275. 10. Wang, L.M.., Huang, H.K.., Chai, Y.M.: A learning-based multistage negotiation model. Proceedings of International Conference on Machine Learning and Cybernetics. (2004) 140-145.
Least Squares Support Vector Machines for Bandwidth Reservation in Wireless IP Networks Jerzy Martyna Institute of Computer Science, Jagiellonian University, ul. Nawojki 11, 30-072 Cracow, Poland [email protected] Abstract. Future mobile wireless Internet network will be based on Internet Protocol (IP) version 6. In this paper we describe an approach for obtaining a bandwidth reservation mechanism in the process of handoff in the wireless IP network. In our approach, we adopt the least squares support vector machines (LS-SVM) method to minimize the probability of a new call blocking and the probability of dropping a call during the connection transfer between the cells. Simulation experiments show that in the environments with some types of traffic, a mechanism based on LS-SVM achieves good performance measures in bandwidth reservation. Keywords: wireless networks, bandwidth reservation, least squares support vector machines.
1
Introduction
As the Internet has grown in size and diversity of applications a two trends have emerged to provide a new perspective. Firstly, to attain mobility, Internet services are achievable by radio access networks. Secondly, to achieve the quality of service (QoS) parameters for different traffic types, such as voice, audio, video, multimedia, are under development. Several traffic types have different characteristics and different QoS demands. For CAC in wireless cellular networks were proposed some mechanisms. To one of the first solution belongs the fractional guard channel policy [16] which determines the number of guard channels for handoffs by considering just the status of the local cell. In the paper by Naghshineh et al. [15], the total bandwidth required for handoff and the existing connections is calculated under the assumptions of the channel holding time and perfect knowledge of the rate of handoff. In the approach presented in the paper by Levine [12], the shadow cluster scheme makes it possible to estimate future resource requirements. However, this method cannot determine the shadow cluster in real networks. In another scheme given by Lu [14], an algorithm for an indoor management was given. Nevertheless, this scheme does not estimate the channel holding time and therefore cannot be used for bandwidth reservation. In the paper by [3], the user mobility is estimated on the basis of the history of the handoff observed in each cell. The scheme assumes that each mobile user will do handoff to neighboring cells with equal probability in the mobility-estimate-time-window. This M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 1231–1239, 2006. c Springer-Verlag Berlin Heidelberg 2006
1232
J. Martyna
assumption may not be accurate in general. In the paper by Liu [13], a hierarchical location-prediction algorithm was introduced. This algorithm was composed of an approximate pattern-matching mechanism that extracts any regular movement pattern to estimate the global intercell direction and of the extended self-learning Kalman filter that deals with random movements by tracking intracell trajectory (local prediction). Nevertheless, the model postulated by Liu is associated with the large number of stored patterns of movements of mobile users. Recently, the authors of the paper [11] formulated a handoff management scheme using simultaneous multiple bindings that reduces packet loss and generates negligible delays due to the handoff in IP-based third generation cellular systems. In the paper by Kuo [10], the class of staying time, available time, and the class of the mobile host to bandwidth reservation was postulated. However, the presented solution was idealistic since the speed of the mobile host is difficult to detect accurately. On the other hand, Ei-Kadi [6] proposed a scheme in which some bandwidth from multimedia connections for supporting the new calls or handoff connections was borrowed. The borrowed bandwidth is later returned to the original connections as soon as possible to satisfy the QoS requirements. Motivated by the above discussion, we propose a new bandwidth management scheme in this paper. It consists of a bandwidth reservation scheme in the cellular systems and support vector machines technique to reduce the call-blocking probability of real-time traffic. Our model of cellular system is described by the cellular coverage area in which the location can be easily determined using trigonometric relationships. By means of the use of the fuzzy logic controller for bandwidth estimation we obtained some dependencies which allow us to predict the demand for the bandwidth in all neighbouring cells. We propose the LS-SVM method for the bandwidth prediction in all the cells of the cellular IP network. We also study the accuracy of our LS-SVM method by comparing it with the fuzzy logic controller for bandwidth reservation in wireless IP network. We recall that the support vector machines (SVM), based on the statistical learning theory, as a powerful new tool for data classification and function estimation, have been introduced by Vapnik [20, 21] and Cortes and Vapnik [4]. SVM map input data into a high dimensional feature space where it may become linearly separable. SVM method was used, among others, by Gong [7] for a multiuser detection in CDMA communications, by Hasegawa [8] for prediction of the Internet traffic. In the paper by Chen [2], the call classification system for AT&T natural dialog system was based on SVMs. Recently, many solutions with SVM methods are implemented in low-cost VLSI chips. The rest of the paper is organized as follows. In section 2, we describe the proposed fuzzy logic model for bandwidth reservation. Section 3 introduces a support vector machine reservation scheme for problem solving. In section 4, we give some simulation results, which compare the proposed method with the fixed reservation scheme.
Least Squares Support Vector Machines for Bandwidth Reservation
2
1233
Fuzzy Logic Controller for Bandwidth Reservation in Wireless IP Networks
This section describes the fuzzy logic model that will be used for bandwidth reservation in cellular wireless IP networks. Our cellular network model is described by the cellular coverage area in which the cell frontier is defined as the maximum outreach of its base station (BS) transmission (see Fig. 1). Any traffic outside the frontier is of no direct interest to the base station, as it cannot service any of it. The hexagonal regions of each cell represent the region that can be covered by their baseline patterns of coverage. The small rings represent the regions which can only be covered by their own base stations. The shadowed part in the middle represents the forbidden zone (no mobiles in it). For the location of mobile terminal (MT) and movement prediction the standard methods described among others in [1] can be used. In the case when the MT moves from cell 2 to cell 1 and crosses to the shadowed part, the BS1 will reserve the bandwidth and the BS2 will release the used bandwidth (see Fig. 2). When the MT goes beyond the darkness area, the base station BS1 will not reserve the bandwidth and the base station B2 will release the used bandwidth (see Fig. 2). In the case when the MT goes to the very shadowed area, the three base stations will reserve the bandwidth. The reservation of bandwidth depends on three factors, namely: - the probability that the MT will entered the shadowed area, - the probability that the MT will leave the current cell, - the speed of movement of the MT.
r BS4
H HH H q
BS1
H H H HH HH H H H H q
BS2
q
..... ....... ......... .......... ........... ........... .......... ......... ....... ...... .... ..... . .. ... ..... .... ∗.... .∗.∗.. .... .... ..... ..... ..... ...... r ...... ....... ...... . BS2 ...... ...... ..... ..... .... ... ... ...
r
BS3
BS3
H H HH HH H H Fig. 1. Cell models used for bandwidth reservation
Fig. 2. Bandwidth reservation after location discovery of mobile terminal and its movement prediction
1234
J. Martyna
Crisp input
Fuzzy input - Fuzzier
- Inference Engine
Fuzzy output - Defuzzier
Crisp output -
6 6 Fuzzy Rule Base
Membership Functions
Fig. 3. Schematic diagram of fuzzy logic controller
We present the calculation of the reserved bandwidth that takes all three factors into consideration. Let the MT be in shadowed area and move to cell 1. The amount of the reserved bandwidth for the j-th connection of the MT can be expressed as (j) B (j) = C · Bmax · pm · pl , j = 1, 2, . . . , J
(1)
(j)
where C is the scaling factor, Bmax is the maximum bandwidth required for the j-th connection of MT, pm is the probability that the MT will move into the shadowed area, pl is the probability that the MT will leave the current cell, J the total number of the current connections for the MT. We can give the change of the bandwidth for j-th connection in the current point of the MT in the shadowed area, namely (j) ΔB (j) = C · Bmax · pm · pl · Δps ,
j = 1, 2, . . . , J
(2)
where Δps is the speed of the mobile terminal in the direction of the movement. The diagram of the fuzzy logic controller [17] consists of the following components: a fuzzifier, an inference engine, a fuzzy rule base, membership functions, and a defuzzifer (see Fig. 3). The first of them converts crisp input data into the linguistic values which are used by the inference engine. The inference engine allows us to find the associated fuzzy output. It is possible with the help of the fuzzy rule base and the defined membership functions. The Mamdani function can be used as the inference engine. For the MT that is in movement, the fuzzy logic controller decides to choose the bandwidth reservation in the same cell or overtake the reservation to the nearest BS. The input linguistic term sets used to describe the bandwidth reser(c) vation in the same cell defined as Tres = {Low, Middle, High}. The input term (c) sets for release the bandwidth in the same cell Trel = {Little, Medium, Big}. The term set of the output linguistic variable S (the same cell) is defined V (S) = {Reject, Weakly Accept, Accept}. Similarly, the term set of the output linguistic variable N (nearest BS) is defined as V (N ) = {BS, Nearest BS}. The Nearest BS variable indicates the switching between the BS stations.
Least Squares Support Vector Machines for Bandwidth Reservation
1235
For the estimation of the output data we used the defuzzier which can be given by the ratio: (j) (j) (j) (μ(Bout )Bout )dBout (j) B = , j = 1, 2, . . . , J (3) (j) μ(Bout )dBout (j)
where μ(Bout ) is the membership function of the aggregated output. The reserved bandwidth for the j-th connection in the target cell can be expressed as the product of (j)
(j) Bres = C1 · B1 ,
j = 1, 2, . . . , J
(4)
The total bandwidth reserved for the MT in the target cell is given by Bres =
J
(j) Bres
(5)
j=1
3
Least Squares Support Vector Machines Method for Bandwidth Reservation in Wireless IP Networks
In this section, we give the overview of the Least Squares SVM method for bandwidth reservation in wireless IP networks. n Let the set {(xi , yi )}N i=1 , xi ∈ R , yi ∈ {−1, +1} be a training data, where xi is the ith input vector with the known binary target yi . The original SVM classifier [20] satisfies the following conditions: T w φ(xi ) + b ≥ +1 if yi = 1 (6) wT φ(xi ) + b ≤ −1 if yi = −1 or equivalently yi [wT φ(xi ) + b] − 1 ≥ 0, i = 1, 2. . . . , N
(7)
where φ: Rn → Rm is the feature map mapping the input space to a usually high dimensional feature space. The data points are linearly separable by a hyperplane defined by the pair (w ∈ Rm , b ∈ R). Thus, the classification function is given by f (x) = sign{wT φ(x) + b} (8) It is usually not needed to make the assessment with the help of the feature map and it is enough instead to use the kernel function in the original space given by K(xi , xj ) = φ(xi )T · φ(xj ) (9) For the violation of Eq. (7) slack variables ξi are introduced such that yi [wT φ(xi ) + b] ≥ 1 − ξi , ξ > 0, i = 1, 2, . . . , N
(10)
1236
J. Martyna
Thus is considered the following minimization problem: N
min J(w, b, ξ) =
w,b,ξ
subject to
1 || w ||2 +C ξi 2 i=1 yi [wT φ(xi ) + b] ≥ 1 − ξi , ξi > 0, i = 1, 2, . . . , N, C > 0
(11)
where C is a positive constant parameter used to control the tradeoff between the training error and the margin. The dual of the system (11) as the result of Karush-Kuhn-Tucker (KKT) condition [9] leads to a well-known convex quadratic programming (QP). The solution of the QP problem is slow for large vectors and it is difficult to implement in the on-line adaptive form. Therefore, a modified version of the SVM called the Least Squares SVM (LS-SVM) was proposed by Suykens et al [18]. In the LS-SVM method, the following minimization problem is formulated W
1 1 2 min J(w, b, e) = wT w + γ ek w,b,e 2 2 k=1
subject to
yk [wT φ(xk ) + b] = 1 − ek , k = 1, 2, . . . , N
(12)
The corresponding Lagrangian for Eq. (12) is given by L(w, b, e; α) = J(w, b, e) −
N
αk {yk [wT φ(xk ) + b] − 1 + ek }
(13)
k=1
where the αk are the Langrange multipliers. The optimality condition leads to the following (N + 1) × (N + 1) linear system 0 YT b 0 = (14) Y Ω ∗ + γ −1 I α 1 where Z = [φ(x1 )T y1 , . . . , φ(xN )T yN ] Y = [y1 , . . . , yN ] 1 = [1, . . . , 1] α = [α1 , . . . , αN ] and Ω ∗ = ZZ T . Due to the application of Mercer’s condition [18] there exists a mapping and an expansion ∗ Ωkl = yk yl φ(xk )T φ(xl ) = yk yl K(xk , xl )
(15)
Thus, the LS-SVM model for function estimation is given y(x) =
N k=1
αk yk · K(x, xk ) + b
(16)
Least Squares Support Vector Machines for Bandwidth Reservation
1237
where parameters αk and b are based on the solution to Eqs. (14) and (15). The parameters αk , b denote the optimal desired weights vector for the bandwidth reservation when the MT crossed the shadowed areas. The parameter yk contains the information. In comparison with the standard SVM method, the LS-SVM has a lower computational complexity and memory requirements.
4
Simulation Results
In this section, we give some simulation results presenting the effectiveness of our approach. We compared the proposed LS-SVM bandwidth reservation method with the fuzzy logic controller (FLC) for bandwidth reservation. Our simulation was restricted to 36 cells in the cellular IP network. In each cell one of the BS station is allocated. Additionally, we assumed that in each cell 100 Mbps were accessible. Each new call is a MT and it will move to one of the six neighbouring cells with equal probability. We assumed that there were two classes of traffic flow, <640 bits, 32 000 bps, 50 ms>, <1280 bits, 64 000 bps, 100 ms>, where each triple defines the burst size of the flow, an average rate of flow, and the end-to-end delay requirement for each class [5]. Each base station can service a traffic flow in any class. Figures 4 and 5 show the call-blocking probability and the call-dropping probability for two schemes: with the LS-SVM bandwidth reservation method and with the fuzzy logic reservation controller. The depicted graphs illustrate that the proposed LS-SVM method can improve the call-blocking probability ca. by 30% for the low value of call arrival rate and ca. by 100%. for the the high value of call arrival rate, respectively. Analogously, the call-dropping probability is also improved. For the low value of a call arrival rate it is ca. 10% and for the low value of call arrival rate and for the high value of a call arrival it is ca. 30%.
C a l .06 l .05 b l .04 o c .03 k. p .02 r o .01 b.
First class (LS-SVM)
r Second class (LS-SVM)
b b
First class (FLC) b Second class (FLC)
b b b r 10−2
b r
r
r r
r
10−1 Call arrival rate
100
Fig. 4. Call-blocking probability versus call arrival rate
1238
J. Martyna C a .03 l l .025 d r o .02 p. p .015 r o b. .01
First class (LS-SVM)
r Second class (LS-SVM)
b
First class (FLC)
b Second class (FLC)
b b
b b b r 10−2
r
r
r
10−1 Call arrival rate
r
r
100
Fig. 5. Call-dropping probability versus call arrival rate
It can also be seen that the bandwidth reservation method is somewhat better for the little traffic flow stream than for the heavily traffic flow stream.
5
Conclusion
In this paper we have established a new scheme for bandwidth reservation in wireless IP networks. We have used the LS-SVM method to determinate the required bandwidth in neighbouring cells. The gained bandwidth reservation show that the LS-SVM method can be a good candidate to effectively improve handoff call-blocking probability and call-dropping probability. However, the investigated method will be compared with other methods of bandwidth reservation.
References 1. S. Basagni, M. Conti, S. Giordano, I. Stojmenovi´c, ”Mobile Ad Hoc Networking”, IEEE Press, John Wiley and Sons, Inc. (2004). 2. S. Chen, A.K. Samingan, L. Hanzo, ”Support Vector Machine Multiuser Receiver for DS-CDMA Signals in Multipath Channels”, IEEE Trans. on Neural Networks, Vol. 12, No. 3 (2001) 604 - 611. 3. S. Choi, K.G. Kin, ”Predictive and Adaptive Bandwidth Reservation for Handoffs in QoS-Sensitive Cellular Networks”, in: Proceedings of ACM SIGCOMM ’98, Vancouver (1998). 4. C. Cortes, V.N. Vapnik, ”Support Vector Networks”, Machine Learning, 20, (1995) 273 - 297. 5. S. Dixit, R. Prasad (Eds.), ”Wireless IP and Building the Mobile Internet”, Artech House, Boston, London (2003). 6. M. Ei-kadi, S. Olariu, H. Abdel-Wahab, ”Rate-Based Borrowing Scheme for QoS Provisioning in Multimedia Wireless Networks”, IEEE Trans. on Parallel and Distributed Systems, Vol. 13, No. 2 (2002) 156 - 166.
Least Squares Support Vector Machines for Bandwidth Reservation
1239
7. X. Gong, A. Kuh, ”Support Vector Machine for Multiuser Detection in CDMA Communications”, in: The 33rd Asilomar Conference on Signals, Systems, and Computers, Vol. 1 (1999) 680 - 684. 8. M. Hasegawa, G. Wu, M. Mizuno, ”Applications of Nonlinear Prediction Methods to the Internet Traffic”, in: The 2001 IEEE International Symposium on Circuits and Systems, Vol. 2 (2001) III-169 - III-172. 9. H. Kuhn, A. Tucker, ”Nonlinear Programming”, in: Proceedings of the 2nd Berkeley Symposium on Mathematical Statistics and Probabilistics, University of California Press (1951) 481 - 492. 10. G.-S. Kuo, P.-C. Ko, M.-L. Kuo, ”A Probabilistic Resource Estimation and SemiReservation Scheme for Flow-Oriented Multimedia Wireless Networks”, IEEE Communications Magazine (2001) 135 - 141. 11. J.-H. Lee, S.-U. Yoon, S.-K. Youm, C.-H. Kang, ”An Adaptive Resource Allocation Mechanism Including Fast and Reliable Handoff in IP-Based 3G Wireless Networks”, IEEE Personal Communications (2000) 42 - 47. 12. D.A. Levine, I.F. Akyildiz, M. Naghshineh, ”A Resource Estimation and Call Admission Algorithm for Wireless Multimedia Networks Using the Shadow Cluster Concept”, IEEE/ACM Trans. on Networking, Vol. 5, No. 1 (1997) 1 - 12. 13. T. Liu, P. Bahl, I. Chlamtac, ”Mobility Modeling, Location Tracking, and Trajectory Prediction in Wireless ATM Networks”, IEEE Journal on Selected Areas in Communications, Vol. 16, No. 6 (1998) 922 - 936. 14. S. Lu, V. Bharghavan, ”Adaptive Resource Management Algorithm for Indoor Mobile Computing Environment”, in: Proceedings of ACM SIGCOMM ’96, Stanford, CA, Sept. (1996). 15. M. Naghshineh, M. Schwartz, ”Distributed Call Admission Control in Mobile/Wireless Networks”, IEEE Journal on Selected Areas in Communications, Vol. 14, No. 4 (1996) 711 - 717. 16. R. Ramjee, R. Nagarajan, D. Towsley, ”On Optimal Call Admission Control in Cellular Networks”, Wireless Networks Journal, Vol. 3, No. 1 (1997) 29 - 41. 17. Q. Ren, G. Ramamurthy, ”A Real-Time Dynamic Connection Admission Controller Based on Traffic Modeling, Measurement and Fuzzy Logic Control”, IEEE Journal on Selected Areas in Communications, Vol. 18, No. 2 (2000) 184 - 196. 18. J.A.K. Suykens, T. van Gestel, J. de Brabanter, B. de Moor, J. Vandewalle, ”Least Squares Support Vector Machines”, World Scientific, New Jersey, London, Singapore, Hong Kong (2002). 19. V.N. Vapnik, ”The Nature of Statistical Learning Theory”, Springer-Verlag, Berlin, Heidelberg, New York (1995). 20. V.N. Vapnik, ”Statistical Learning Theory”, John Wiley and Sons, (1998). 21. V.N. Vapnik, ”The Support Vector Method of Function Estimation”, in: J.A.K. Suykens, J. Vandewolle (Eds.), ”Nonlinear Modeling: Advanced Black-box Techniques”, Kluwer Academic Publishers, Boston (1998) pp. 55 - 85.
An Ontology-Based Intelligent Agent for Respiratory Waveform Classification Chang-Shing Lee and Mei-Hui Wang Department of Computer Science and Information Engineering National University of Tainan, Tainan, Taiwan [email protected]
Abstract. This paper presents an ontology-based intelligent agent for respiretory waveform classification to help the medical staff with the judging the respiratory waveform from the ventilator. We present the manual construction tool (MCT), the respiratory waveform ontology (RWO), and the intelligent classification agent (ICA) to implement the classification of the respiratory waveform. The MCT allows the medical experts to construct and store the fuzzy numbers of respiratory waveforms to the RWO. When the ICA receives an input respiratory waveform (IRW), it will retrieve the fuzzy numbers from the RWO to carry out the classification task. Next, the ICA will send the classified results to the medical experts to make a confirmation and store the classified results to the classified waveform repository (CWR). The experimental results show that our approach can classify the respiratory waveform effectively and efficiently.
1 Introduction The research on the ontology has been spread widely to be critical components in the knowledge management, Semantic Web, business-to-business applications, and several other application areas [2]. For example, C. S. Lee et al. [8] proposed a fuzzy ontology application to news summarization, M. H. Burstein [4] presented a dynamic invocation of semantic web services using unfamiliar ontologies, A. Hameed et al. [6] presented an approach to acquire knowledge and construct multiple experts’ ontologies in a uniform way, and R. Navigli et al. [9] presented an OntoLearn system for automated ontology learning to extract relevant domain terms from a corpus of text, relate them to appropriate concepts in a general-purpose ontology, and detect taxonomic and other semantic relations among the concepts. C. S. Lee et al. [7] proposed an intelligent fuzzy agent for meeting scheduling decision support system. Currently, for long-term mechanical ventilators (LMVs) patients, the ventilator provides the physician with some vital information such as pressure-time waveform to assist in diagnosis. Besides, the medical staff often needs to spend much of the monitoring and making out a large number of information about patients such as medical history, medicine administered, and the physiological variables. But, for physiological variables, they are continuously generated over time, and generally correspond to physio-pathological processes that require rapid responses. So, P. F’elix et al. [5] proposed the FTP (Fuzzy Temporal Profile) model to represent and reason M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 1240 – 1248, 2006. © Springer-Verlag Berlin Heidelberg 2006
An Ontology-Based Intelligent Agent for Respiratory Waveform Classification
1241
on information concerning the evolution of a physical parameter, and then also study the applicability of this model in the recognition of signal patterns. In addition, because of the development of both the healthcare sciences as well as information technologies, the ICU units handle an ever-increasing amount of signals and parameters, so S. Barror et al. [1] proposed a patient supervision system that the fuzzy logic is playing in its designs. In this paper, we present an ontology-based intelligent agent for respiratory waveform classification. The experimental results show that our approach can work efficiently and effectively for classifying the respiratory waveform. The remainder of this paper is organized as follows. In Section 2, we briefly present the structure of the respiratory waveform ontology. Section 3 introduces the structure of ontology-based intelligent agent for respiratory waveform classification. The experimental results are presented in Section 4. Finally, the conclusions are drawn in Section 5.
2 The Structure of Respiratory Waveform Ontology In this section, first we briefly introduce the basic fuzzy notions for the fuzzy number. If one fuzzy set satisfies the normality and convexity, then this fuzzy set is called the fuzzy number [5]. For example, a fuzzy number π C can be denoted by
,
C = (a, b, c, d ), a ≤ b ≤ c ≤ d , where [b, c] is core core(C ) = {v ∈ R | π C (v) = 1} , and ]a, d [ is support, supp(C ) = {v ∈ R | π C (v) > 0} [10]. Then, we call a, b, c, and d are the begin support, begin core, end core, and end support, respectively [10]. Now we give the definition of the fuzzy number to generate the possibility distribution of the ~ respiratory waveform slope. A fuzzy number M is of LR-type if there exist reference functions L, R, and scalars α >0, β >0 with ⎧ ⎛ m− x⎞ ⎟ x≤m ⎪ L⎜ ⎪ ⎝ α ⎠ μ M~ ( x) = ⎨ ⎪ R⎛⎜ x − m ⎞⎟ x ≥ m ⎪ ⎜⎝ β ⎟⎠ ⎩
(1)
~
where m, called the means value of M , is a real number and α and β are called the
~
left and right spreads, respectively. Symbolically M is denoted by ( m, α , β ) LR [11]. Second, we introduce the concepts of the ventilator waveform. Fig. 1 shows the typical pressure-time diagram under the volume-controlled and constant flow. From the Fig. 1, the airway pressure depends on the alveolar pressure and the total of all airway resistances, and it can be affected by the resistance and compliance values specific to the ventilator and the lung. On inspiration, intra-thoracic volume is increased; this lowers intra-pleural pressure, making it more negative and causing the lungs to expand and the air to enter [3]. Consequently, at the beginning of the inspiration, the pressure between the point A and B increases dramatically on account of the resistances in the system. Then, after the point B, the pressure increases in a
1242
C.-S. Lee and M.-H. Wang Pressure
C
Peak pressure
D E
B
Plateau pressure
Gradient V/C
F
G
A
Inspiration time
Time
Expiration time
Fig. 1. Pressure-time diagram under the volume-controlled and constant flow
straight line until the peak pressure at the point C is reached. At the point C, the ventilator applies the set tidal volume and no further flow is delivered; therefore, the pressure quickly falls to the point D, the plateau pressure. Then, because of lung recruitment, the pressure lightly drops to the point E. In addition, the slope of line A-D will affect the static compliance and it is a key feature when examining the
Respiratory Waveform Ontology
Domain Layer
Major Category Layer
Minor Category Layer
Pressure-Time Diagram
Volume-Time Diagram
RT fuzzy numbers
Physiological Parameters Layer
Inspiration Time Expiration Time I: E Ration
Peak Pressure Plateau Pressure Mean Pressure PEEP
Lung Measurement
Dynamic Compliance Static Compliance Airway Resistance
Breath Rate Expiratory Minute Volume Expiratory Tidal Volume
Disease Layer
Asthma
Chronic Bronchitis 1. Persistent cough with sputum production for at least 3 months of the year for 2 consecutive years 2. Small airway is obstructed
Flow-Volume Loop
…
RWS fuzzy numbers
…
Respiratory Waveform Slope
Respiratory Airway Pressure
Test Mechanisms of Breathing
1. Wheezing 2. Breathlessness 3. Chest Tightness 4. Cough
Pressure-Volume Loop
RAP fuzzy numbers
Respiratory Time
Concept Layer
Flow-Time Diagram
Pressure-Time Slope Maximum Slope Minimum Slope
…
Work of Breathing Work done by Patients Work done by Ventilator Total Work
Bronchiectasis 1. An abnormal and permanent dilatation of the bronchi 2. Chronic infection 3. Cough with sputum
Arterial Blood Gas Analysis Blood Oxygen Saturation Partial Pressure of Carbon Dioxide Partial Pressure of Oxygen PH
Pneumoconiosis 1. Disorder caused by inhalation of mineral or biological dusts 2. Dyspnoea 3. Cough with sputum
Pulmonary Fibrosis …
1. Honeycomb Lung 2. Progressively Breathless 3. Dry, non-productive cough 4. Decreased Transfer Factor
Fig. 2. The structure of the respiratory waveform ontology
An Ontology-Based Intelligent Agent for Respiratory Waveform Classification
1243
patients. On expiration, the muscles of the chest wall relax and the lungs return to their original size by elastic recoil, with the expulsion of air. As a result, the pressure exponentially falls to point F, the expiratory-end pressure (EEP) or baseline pressure. Finally, we introduce the structure of the respiratory waveform ontology, shown as Fig. 2. Included the domain layer, major category layer, minor category layer and concept layer, the respiratory waveform ontology is an extended domain ontology of the [8]. The domain name is respiratory waveform ontology, and it consists of several major categories such as the pressure-time diagram, volume-time diagram, pressure-time loop, flow-volume loop, and so on. In the minor category layer, there are a couple of categories such as the respiratory time (RT) fuzzy numbers, respiratory airway pressure (RAP) fuzzy numbers, and respiratory waveform slope (RWS) fuzzy numbers, and so on. The concepts layer is divided into two sub-layers, the physiological parameters layer and disease layer. The physiological parameters layer contains the parameters of the fuzzy numbers, such as the respiratory time, respiratory airway pressure, respiratory waveform slope, and so on. And, the disease layer is with these fuzzy numbers like asthma, chronic bronchitis, pulmonary fibrosis, and so on.
3 The Structure of Ontology-Based Intelligent Agent for Respiratory Waveform Classification In this section, we briefly describe the functionality of the ontology-based intelligent agent for respiratory waveform classification, shown in Fig. 3. There are the manual construction tool (MCT), the respiratory waveform ontology (RWO), and the intelligent agent (ICA) proposed to implement the classification of the respiratory waveform. The MCT allows the medical experts to construct the parameters of the standard respiratory waveform (SRW) such as RT fuzzy numbers, and RAP fuzzy
Manual Construction Tool for Respiratory Waveform Ontology
Respiratory Waveform Ontology
Medical Experts
Input Respiratory Waveform
Intelligent Classification Agent Medical Experts Confirmation
Classified Respiratory Waveform
Not OK
OK
Fig. 3. The structure of the ontology-based intelligent agent for respiratory waveform classification
1244
C.-S. Lee and M.-H. Wang
numbers and then the respiratory waveform slope (RWS) fuzzy numbers is also automatically constructed. After completing the construction, all of these parameters of the SRW are restored to the RWO. If the ICA receives an IRW, based on the information restored in the RWO, the ICA starts carrying out the classification of the respiratory waveform. Next, the ICA sends the classified results to the medical experts to make a confirmation. If the classified results are passed the verification of the medical experts, then they are stored to the classified waveform repository (CWR). Otherwise, the IRW is sent back to the ICA to work again. Below is the algorithm for the ICA. Algorithm for the Intelligent Classification Agent ~ ~ Step 1: Retrieve the fuzzy numbers ~ pA , tA , ~ pB , and tB from RWO Step 1.1: ~ p = (n , r , δ ) A
pA
pA
p A LR
~ Step 1.2: tA = (nt A , rt A , δ t A ) LR Step 1.3: ~ pB = (m pB , α pB , β pB ) LR ~ Step 1.4: tB = (nt B , rt B , δ t B ) LR Step 2: Generate ~ s = (m , α , β AB
s AB
s AB
)
s AB LR
~ Step 2.1: Generate the positive slope ~ s AB where the Δ~ p is positive and Δ t is positive. m pB − n p A Step 2.1.1: ms AB = mt B − nt A
Step 2.1.2: α s AB = (m p B − n p A ) × (α t B + δ t A ) + (mt B − nt A ) × (α p B + δ p A ) Step 2.1.3: β s AB = (m p B − n p A ) × ( βt B + γ t A ) + (mt B − nt A ) × ( β p B + γ p A ) ~ Step 2.2: Generate the negative slope ~ s AB where the Δ~ p is negative and Δ t is positive. m pB − n p A Step 2.2.1: ms AB = mt B − nt A Step 2.2.2: α s AB = (mt B − nt A ) × (α p B + δ p A ) − (m p B − n p A ) × ( β t B + γ t A ) Step 2.2.3: β s AB = (mt B − nt A ) × ( β p B + γ p A ) − (m p B − n p A ) × (α t B + δ t A ) Step 3: Retrieve the SRW from the RWO and the IRW form the ventilator Step 4: For k ← 1 to c /* The c denotes of the number of the SRW Sk .*/ Step 4.1: For j ← 1 to n-1 /* The n denotes the number of the significant points.*/ Step 4.1.1: total _ μ IRW ∈Type Sk =
n −1
∑μ j =1
IRW ∈Type Sk
( x j ) /* μ IRW∈Type Sk ( x) denotes
the membership degree of the value x of the IRW belongs to the k-th SRW Type Sk . Step 4.2: Normalize the total _ μ IRW∈Type Sk and save it to μ IRW ∈Type S k ( x) .
An Ontology-Based Intelligent Agent for Respiratory Waveform Classification
μ IRW ∈Type S ( x) = k
total _ μ IRW ∈Type S
1245
k
n
∑ MAX {μ j =1
IRW ∈ Type S k
( x j )}
Step 5: Run the matching task for IRW and SRW Type1 n
{( μ IRW ∈Type S1 ( xslope ) > σ s ) ∧ ( I[ μ ~p ( x p )] > σ p ) ∧
If
AB
j
j =1
1
j
n
( I[ μ ~t ( xt )] > σ t )} j =1
Then μ f
j
1
j
n
1
n
p ( x p )], I[ μ ~ ( A ) ( y1 ) = I[ μ IRW ∈ Type S ( xslope ), I[ μ ~ t ( xt )] ] 1
j =1
j
j
j =1
j
j
Step 6: Run the matching task for IRW and SRW Type2
{( μ IRW ∈Type S ( x slope ) > σ s ) ∧
If
AB
2
n
( I[ μ ~p j ( x p j )] > σ p ) ∧ j =1
2
n
( I[ μ ~t j ( xt j )] > σ t )} 2
j =1
Then μ f
2
( A) ( y2 ) = I[ μ IRW ∈Type S ( xslope ), 2
n
n
j =1
j =1
I[μ ~p j ( x p j )], I[μ ~t j ( xt j )]]
Step 7: End.
4 Experimental Results To test the performance of the ontology-based intelligent agent for respiratory waveform classification, we set up a test experimental environment at National University of Tainan. First, we select two types of respiratory waveforms as the SRW Type1 and SRW Type2 denoting the characteristic of the typical normal waveform and highpressure waveform, respectively. Table1 shows the parameters of SRW Type1 and SRW Type2 defined by the medical staff. And, all of the parameters of signification points will be stored to the RWO. Second, we design an interface to allow the medical experts to define or tune the RT fuzzy numbers and RAP fuzzy numbers for the respiratory waveform. Then, the RWS fuzzy numbers will be constructed auto-matically. Fig. 4 shows the MCT for the SRW Type1. Finally, in order to evaluate the accuracy of the ICA, we choose the precision and recall as our criteria. Fig. 5 (a) and Fig. 5 (b) show the precision and recall versus the number of Table 1. The parameters of significant points for four types of the SRW
SRW Type1 Type2
A (0.2, 6) (0.2, 6)
Significant Point (Time: sec, Pressure: cmH2O) B C D E F G (0.3, 22) (1.1, 44) (1.2, 31) (1.9, 29) (2.2, 6) (4.4, 6) (0.3, 22) (1.1, 60) (1.2, 31) (1.9, 29) (2.2, 6) (4.4, 6)
1246
C.-S. Lee and M.-H. Wang
(a)
(b)
(c) Fig. 4. The manual construction tool of the RT, RAP, and RWS fuzzy numbers for the SRW Type1
An Ontology-Based Intelligent Agent for Respiratory Waveform Classification Precision
1247
SRWType1
120 100 80 60 40 20 0
SRWType2
10
40
70
100
130
160
190
220
250
The number of IRW
(a) Recall 105 100 95 90 SRWType1
85
SRWType2
80 75 10 30 50 70 90 110 130 150 170 190 210 230 250 270 The number of IRW
(b) Fig. 5. The precision and recall of the IRW for the SRW Type 1 and SRW Type 2
IRW curve, respectively, when the threshold σ s AB is 0.75. We observe that the average precision and recall for SRW Type1 is about 64% and 96%, respectively. And the average precision and recall for SRW Type2 is about 68% and 96%, respectively. The experimental results show that our approach can work effectively.
5 Conclusions In this paper, we present an ontology-based intelligent agent, including MCT, RWO, and ICA, for respiratory waveform classification to help the medical staff with the judging the respiratory waveform from the ventilator. From the experimental results, we observe that the proposed ontology-based intelligent agent applied to classify the respiratory waveform is effectively and efficiently. In the future, we will extend our method to process more different types of IRW and semi-automatically construct the RWO.
1248
C.-S. Lee and M.-H. Wang
Acknowledgement This work is partially supported by the National Science Council of Taiwan under the grant NSC94-2213-E-024-006, and the Service Web Technology Research Project of Institute for Information Industry and sponsored by MOEA, Taiwan.
References 1. S. Barro, R. Mar’in, F. Palacios, and R. Rui’z: Fuzzy logic in a patient supervision system, Artificial Intelligence in Medicine. Vol. 21. (2001) 193-199 2. C. Brewster, K. O’Hara, S. Fuller, Y. Wilks, E. Franconi, M. A. Musen. J. Ellman, and S. B. Shum: Knowledge representation with ontologies: the present and future. Vol. 19. (2004) 72-81 3. S. M. Burns: Working with respiratory waveforms: How to Use Bedside Graphics. AACN Clinical Issues. Vol. 14. (2003) 133-144 4. M. H, Burstein: Dynamic invocation of semantic web services that use unfamiliar ontologies. IEEE Intelligent Systems. Vol. 19. (2004) 67-73 5. P. Félix, S. Barro, and R. Marín: Fuzzy constraint networks for signal pattern recognition. Artificial Intelligence. Vol. 148. (2003) 103-140 6. A. Hameed, D. Sleeman, and A. Preece: Detecting mismatches among experts’ ontologies acquired through knowledge elicitation. Knowledge-Based Systems. Vol. 15. (2002) 265-273 7. C. S. Lee and C. Y. Pan: An intelligent fuzzy agent for meeting scheduling decision support system. Fuzzy Sets and Systems. Vol. 142. (2004) 467-488 8. C. S. Lee, Z. W. Jian, and L. K. Huang: A fuzzy ontology and its application to news summarization. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics. Vol. 35. (2005) 859-880 9. R. Navigli, P. Velardi, and A. Gangemi: Ontology learning and its application to automated terminology translation. IEE Intlligent Systems. Vol. 18. (2003) 22-31 10. J. Yen, and R. Langari: Fuzzy logic. Prentice-Hall, New Jersey (1999) 11. H. -J. Zimmermann: Fuzzy set theory and its applications. Kluwer, Boston (1991)
A New Inductive Learning Method for Multilabel Text Categorization Yu-Chuan Chang1, Shyi-Ming Chen2, and Churn-Jung Liau3 1
Department of Computer Science and Information Engineering National Taiwan University of Science and Technology Taipei, Taiwan, R.O.C. [email protected] 2 Department of Computer Science and Information Engineering National Taiwan University of Science and Technology Taipei, Taiwan, R.O.C. [email protected] 3 Institute of Information Science, Academia Sinica Taipei, Taiwan, R.O.C. [email protected]
Abstract. In this paper, we present a new inductive learning method for multilabel text categorization. The proposed method uses a mutual information measure to select terms and constructs document descriptor vectors for each category based on these terms. These document descriptor vectors form a document descriptor matrix. It also uses the document descriptor vectors to construct a document-similarity matrix based on the "cosine similarity measure". It then constructs a term-document relevance matrix by applying the inner product of the document descriptor matrix to the document similarity matrix. The proposed method infers the degree of relevance of the selected terms to construct the category descriptor vector of each category. Then, the relevance score between each category and a testing document is calculated by applying the inner product of its category descriptor vector to the document descriptor vector of the testing document. The maximum relevance score L is then chosen. If the relevance score between a category and the testing document divided by L is not less than a predefined threshold value λ between zero and one, then the document is classified into that category. We also compare the classification accuracy of the proposed method with that of the existing learning methods (i.e., Find Similar, Naïve Bayes, Bayes Nets and Decision Trees) in terms of the break-even point of micro-averaging for categorizing the "Reuters-21578 Aptè split" data set. The proposed method gets a higher average accuracy than the existing methods.
1 Introduction As the amount of information in the Internet is growing so rapidly, it is difficult for users to find desired information unless it is organized and managed well. Text categorization (TC) is a major research topic in machine learning (ML) and information retrieval (IR) to help users obtain desired information. A document can belong to a M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 1249 – 1258, 2006. © Springer-Verlag Berlin Heidelberg 2006
1250
Y.-C. Chang, S.-M. Chen, and C.-J. Liau
single category, multiple categories, or not belong to any category. The goal of automatic text categorization is to utilize categorized training documents to construct text classifiers, which are then used to classify new documents into appropriate categories automatically. Several machine learning methods have been developed to deal with the text categorization problem, e.g., regression models [10], [17], nearest neighbor classification [16], [18], Bayesian probabilistic approaches [1], [12], [14], decision trees [1], [8], [12], inductive rule learning [1], [6], neural networks [18] and on-line learning [6]. In this paper, we present a new inductive learning method for multilabel text categorization. It uses the mutual information measure [11] for term selection and constructs document descriptor vectors for each category based on the selected terms. These document descriptor vectors form a document descriptor matrix and they are also used to construct a document-similarity matrix based on the cosine similarity measure [2]. It then constructs a term-document relevance matrix by applying the inner product of the document descriptor matrix to the document-similarity matrix. It infers the degree of relevance of the selected terms to construct the category descriptor vector of each category. The relevance score between each category and a testing document is calculated by applying the inner product of its category descriptor vector to the document descriptor vector of the testing document. The maximum relevance score L is then chosen. If the relevance score between a category and the testing document divided by L is not less than a predefined threshold value λ, where λ∈[0, 1], then the document is classified into that category. We also compare the classification accuracy of the proposed method with that of the existing learning methods (i.e., Find Similar [13], Decision Trees [5], Naïve Bayes and Bayes Nets [14]) in terms of the break-even point of micro-averaging for categorizing the "Reuters-21578 Aptè split 10 categories" data set [21]. The experimental results show that the proposed method outperforms the existing methods.
2 Preliminaries Machine learning systems deal with categorization problems by representing samples in terms of features in order to apply machine learning methods to text categorization, documents must be transformed into feature representations. The vector space model [13] is widely used in information retrieval for the representation of documents. In the vector space model, a document is represented as a document descriptor vector of terms and every element in the vector denotes the weight of a term with respect to the document. The learning methods presented in [10] and [17] calculate the tf × idf term weight for each term. The learning methods presented in [1], [8] and [12] use a binary weight 1 or 0 to represent each index term. The dimension of a term’s space is an important research topic in text categorization. With a high dimension, a classifier over fits training samples, which may be good for classification purposes, but it is not feasible for classifying previously unseen testing samples. The purpose of term selection is to choose relevant terms for document indexing that yield the highest accuracy rates. The simplest term selection method [19] is based on the frequency of a term’s occurrence in documents, where only terms that occur in the highest number of documents are retained. Other term
A New Inductive Learning Method for Multilabel Text Categorization
1251
selection methods are based on information-theoretic functions, such as the DIA association factor [9], information gain measure [6], [12], chi-square measure [4], and mutual information measure [8], [19]. In recent years, an increasing number of categorization methods have applied the mutual information (MI) measure in term selection [3], [7]. The mutual information score between term ti and category c is defined by
MI (t i , c) = p(t i , c) log 2
p(t i , c) , p(t i ) p(c)
(1)
where p(ti, c) = NC (ti)/NC, p(ti) = N(ti)/N, p(c) = NC /N, NC (ti) denotes the number of occurrences of term ti in category c, NC denotes the number of occurrences of all terms in category c, N(ti) denotes the number of occurrences of term ti in the collection, and N denotes the number of occurrences of all terms in the collection. There are some rules for determining the threshold value in multilabel text categorization [15], [16]. The threshold value is used when a document may belong to multiple categories. In this paper, we use a variant of the rank-based thresholding (R-cut) measure [16] to assign text documents into categories. We use the following criterion for multilabel text categorization: Score(c i , d ) ≥ λ, L
(2)
where Score(ci, d) denotes the relevance score between category ci and document d, L = max Score(c j , d ) denotes the maximum relevance score, λ is a threshold value j
that controls the degree of multilabel categorization, and λ ∈ [ 0, 1] . If the relevance score between category cj and document d divided by L is not less than the threshold value λ, where λ ∈ [ 0, 1] , then the document d is classified into category ci. The lower the threshold value λ, the more categories a document may belong to. If λ = 0, then the document belongs to all categories. Thus, the threshold value λ provides us some flexibility to deal with multilabel text categorization problem. In the following, we briefly review some classifiers [8], namely, Find Similar [13], Decision Trees [5], Naïve Bayes and Bayes Nets [14]. (1) Find Similar Classifier [13]: The Find Similar method is a variant of Rocchio’s method for relevance feedback, which is often used to expand queries based on the user’s relevance feedback. In text classification, Rocchio’s method calculates the weight wt of a term t as follows:
wt = α ⋅ wt + β ⋅
∑w
i∈ pos
t,i
N pos
+γ ⋅
∑w
i∈neg
t, i
N neg
,
(3)
where wt denotes the weight of term t, Npos denotes the number of positive documents in the category, Nneg denotes the number of negative documents in the category, and α, β and γ are the adjusting parameters. The method finds the representative centroid of the positive documents of each category and classifies a new document by comparing it with the centroid of each category by using a specific similarity measure. In [8],
1252
Y.-C. Chang, S.-M. Chen, and C.-J. Liau
Dumais et al. let α = 0, β = 1 and γ = 0 and use the Jaccard similarity measure to calculate the degree of similarity between the centroid of each category and a testing document. (2) Decision Trees Classifier [5]: A decision tree (DT) text classifier is a tree in which the internal nodes are labeled by terms, the branches are labeled by weights, and the leaves are labeled by categories. The classifier categorizes a testing document dj by recursively test the weights of its terms (i.e., the internal nodes) until a leaf node is reached. The label of the leaf node is then assigned to dj. The main advantage of the decision tree method is that it is easily interpretable by humans. (3) Naïve Bayes Classifier [12]: The Naïve Bayes (NB) classifier (also called the Probabilistic Classifier) is a popular approach for handling classification problems. It uses the joint probability of terms and categories to calculate the probability that the terms of a document belong to certain categories, and then applies the Bayesian Theory to calculate the probability of the document dj belonging to category ci: n
P (d j | ci ) = ∏ P ( wkj | c i ),
(4)
k =1
P ( wkj | ci ) =
P ( wkj , ci ) P (ci )
,
(5)
where P ( d j | c i ) denotes the probability of document dj belonging to category ci; P ( wkj | ci ) denotes the probability of term tk of document dj belonging to category ci, and n denotes the number of terms belonging to document dj and category ci. The naïve part of the NB method is the assumption of term independence. This makes NB classifiers far more efficient than non-naïve Bayes methods due to the fact that there is no need to consider the conditional probabilities of terms. (4) Bayes Nets Classifier [14]: In [14], Sahami utilizes the Bayesian network for classification, which relaxes the restrictive assumptions of the Naïve Bayes classifier. A 2-dependence Bayesian classifier allows for the probability that each feature is directly influenced by the appearance/non-appearance of at most two other features.
3 A New Inductive Learning Method for Multilabel Text Categorization In this section, we present a new inductive learning method for mulitlabel text categorization. The mutual information measure shown in Eq. (1) is used to select the top K terms that have the highest MI scores for a category. Assume there are N documents and K selected terms in category c. We use a K × N document descriptor matrix to represent the binary weights of the K selected terms in each document. A column in the document descriptor matrix is a document descriptor vector based on the K selected terms. For example, assume that there are 5 documents d1, d2, d3, d4, d5 and 4 selected terms t1, t2, t3, t4 in category c. Fig. 1 shows an example of a 4 × 5 document
A New Inductive Learning Method for Multilabel Text Categorization
1253
descriptor matrix A. Each column of the document descriptor matrix A represents the document descriptor vector of a document. For example, from the second column of the document descriptor matrix A, we can see the document descriptor vector d 2 of
the document d2, where d 2 = [1 0 1 0] . It indicates that the terms t1 and t3 are appearing in the document d2 and the terms t2 and t4 are not appearing in the document d2. d 1 d 2 d3 d4 d 5 t1 ª1 1 0 1 0º » t « A = 2 «1 0 1 0 0». t3 «1 1 1 0 1»
«
t4 ¬1
» 0 1 0 1¼
Fig. 1. Document descriptor matrix A
We use the “cosine similarity measure” [2] to construct a document-similarity matrix S, shown as follows:
S (i, j ) =
A(i) ⋅ A( j ) , A(i ) × A( j )
(6)
where the value of S(i, j) indicates the degree of similarity between document di and document dj, S(i, j)∈[0, 1], and A(i) and A (j) denote the ith and the jth column vectors of the document descriptor matrix A, respectively. For the document descriptor matrix A shown in Fig. 1, we can get its document-similarity matrix S, as shown in Fig. 2. d1
S =
d1 d2 d3 d4 d5
d2
ª 1 1 2 « 1 «1 2 « 3 2 1 6 « « 12 1 2 «1 2 1 2 ¬
d3
d4
3 2
12
1
6 1 0
2
1
2 0 1
6
0
d5 1
2º » 12 » 2 6 ». » 0 » 1 »¼
Fig. 2. Document-similarity matrix S
We can obtain the term-document relevance matrix R by applying the inner product of the document descriptor matrix A to the document-similarity matrix S, shown as follows:
R = A ⋅ S,
(7)
where the value of R(i, j) denotes the relevance degree of term ti with respect to document dj. Therefore, for the above example, we can get the term-document relevance matrix R, as shown in Fig. 3.
1254
Y.-C. Chang, S.-M. Chen, and C.-J. Liau
R =
t1 t2 t3 t4
d1 ª2.2 «1.9 « «3.3 « ¬2.6
d2
d3
d4
d5
2.4 1.3 2.2 1.2 º 1.1 1.9 0.5 1.5 »» . 2.6 3.1 1.2 3 » » 1.6 2.7 0.5 2.5¼
Fig. 3. Term-document relevance matrix R
We use Eq. (8) to get the category descriptor vector v c for category c, vc = R ⋅1,
(8)
where 1 = [1, 1, ..., 1] T . Thus, for the above example, we can get T T vc = R ⋅1 = R ⋅ [1 1 1 1] = [9.3 6.9 13.2 9.9] .
Then, we use the weight-averaged method to normalize v c . Thus, for the above example, v c is normalized into [0.24 0.17 0.34 0.25]. Finally, we refine the weight
vci of the ith term in the category descriptor vector v c into wci to obtain the refined category descriptor vector wc , where
wci = vci × log 2
|C | , cf i
(9)
wci denotes the refined weight of the ith term in the refined category descriptor vector wc , |C| denotes the number of categories, and cfi denotes the number of category descriptor vectors containing term ti. This refinement reduces the weights of the terms that appear in most of the categories and increases the weights of the terms that only appear in a few categories. Assume that the document descriptor vector of a testing document dnew is d new . We can then apply the inner product to calculate the relevance score Score(c, dnew) of category c with respect to the testing document dnew as follows: Score(c, dnew) = d new ⋅ wc .
(10)
We calculate the relevance score of each category with respect to dnew, rank these relevance scores, and then assign dnew to multiple categories according to Eq. (2). In other words, we choose the maximum relevance score L among them. If the relevance score between a category and the testing document divided by L is not less than a predefined threshold value λ, where λ∈[0, 1], then the document is classified into that category.
A New Inductive Learning Method for Multilabel Text Categorization
1255
4 Experimental Results We have implemented the proposed multilabel text categorization method to classify the "Reuters-21578 Aptè split 10 categories" data set [21] using Delphi Version 5.0 on a Pentium 4 PC. The "Aptè split 10 categories" data set contains the 10 top-sized categories obtained from the "Reuters-21578 Aptè split" data set [20], where each category has at least 100 training documents for training a classifier. We chose the "Aptè split 10 categories" data set as our experimental data set, because it accounts 75% of the "Reuters-21578 Aptè split" data set. Table 1 shows the category names of "Aptè split 10 categories", the number of training samples for each category, and the number of testing samples for each category. There are totally 6490 training documents and 2547 testing documents in the "Aptè split 10 categories" data set. Table 1. The number of training and testinging samples for each category of the "Aptè split 10 categories" data set
Category Names Earn Acq Money-fx Grain Crude Trade Interest Wheat Ship Corn
Number of Training samples 2877 1650 538 433 389 369 347 212 197 182
Number of Testing samples 1087 719 179 149 189 118 131 71 89 56
Several evaluation criteria for dealing with classification problems have been used in text categorization [15], [16]. The most widely used measures are based on the definitions of precision and recall. If a sample is classified into a category, we call it “positive” with respect to that category. Otherwise, we call it “negative”. In this paper, we use the following micro-averaging method [15] to evaluate the recall and the precision of the proposed method, where Recall =
∑ ∑
Precision =
C i =1
C i =1
TPi
TPi + ∑i =1 FN i C
∑ ∑
C i =1
C i =1
TPi
TPi + ∑i =1 FPi C
,
(11)
,
(12)
1256
Y.-C. Chang, S.-M. Chen, and C.-J. Liau
TPi denotes the number of correctly classified positive samples for category ci, FNi denotes the number of incorrectly classified negative samples for category ci, FPi denotes the number of incorrectly classified negative samples for category ci, and |C| denotes the number of categories. If the values of the precision and the recall of a classifier can be tuned to the same value, then the value is called the break-even point (BEP) of the system [12]. BEP has been widely used in text categorization evaluations. If the values of the precision and the recall are not exactly equal, we use the average of the nearest precision and recall values as the BEP. Based on the mutual information measure [11] for term selection, we select the top 300 terms for training classifiers. Table 2 compares the break-even point of the proposed method with those of four existing learning methods [8], namely, Find Similar, Decision Trees, Naïve Bayes and Bayes Nets. From Table 2, we can see that the proposed method gets a higher average accuracy than the existing methods. Table 2. Breakeven performance for Reuters-21578 Aptè split 10 categories
Methods Category Earn Acq Money-fx Grain Crude Trade Interest Ship Wheat Corn Average
Find Similar
Naïve Bayes
Bayes Nets
Decision Trees
92.9 % 64.7 % 46.7 % 67.5 % 70.1 % 65.1 % 63.4 % 49.2 % 68.9 % 48.2 % 64.6 %
95.9 % 87.8 % 56.6 % 78.8 % 79.5 % 63.9 % 64.9 % 85.4 % 69.7 % 65.3 % 81.5 %
95.8 % 88.3 % 58.8 % 81.4 % 79.6 % 69.0 % 71.3 % 84.4 % 82.7 % 76.4 % 85 %
97.8 % 89.7 % 66.2 % 85.0 % 85.0 % 72.5 % 67.1 % 74.2 % 92.5 % 91.8 % 88.4 %
The Proposed Method (λ = 0.87) 97.5 % 95.1 % 79.2 % 84.7 % 84.4 % 85 % 81 % 85.4 % 79.8 % 78.2 % 91.3 %
5 Conclusions In this paper, we have presented a new inductive learning method for multilabel text categorization. The proposed method uses a mutual information measure for term selection and constructs document descriptor vectors for each category based on the selected terms. These document descriptor vectors form a document descriptor matrix and they are also used to construct a document-similarity matrix based on the cosine similarity measure. It then constructs a term-document relevance matrix by applying the inner product of the document descriptor matrix to the document-similarity matrix. The proposed method infers the degree of relevance of the selected terms to construct the category descriptor vector of each category. The relevance score between each category and a testing document is calculated by applying the inner product of
A New Inductive Learning Method for Multilabel Text Categorization
1257
its category descriptor vector to the document descriptor vector of the testing document. The maximum relevance score L is then chosen. If the relevance score between a category and the testing document divided by L is not less than a predefined threshold value λ, where λ∈[0, 1], then the document is classified into that category. From the experimental results shown in Table 2, we can see that the proposed method gets a higher average accuracy than the existing methods.
Acknowledgements This work was supported in part by the National Science Council, Republic of China, under Grant NSC 94-2213-E-011-003.
References [1] Aptè, C., Damerau, F.J., Weiss, S.M.: Automatic Learning of Decision Rules for Text Categorization. ACM Transactions on Information Systems 1 (1997) 233−251 [2] Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM Press, New York (1999) [3] Bekkerman, R., Ran, E.Y., Tishby, N., Winter, Y.: Distributional Word Clusters vs. Words for Text Categorization. Journal of Machine Learning Research (2003) 1183−1208 [4] Caropreso, M.F., Matwin, S., Sebastiani, F.: A Learner-Independent Evaluation of the Usefulness of Statistical Phrases for Automated Text Categorization. In: Chin, A.G. (eds.): Text Databases and Document Management: Theory and Practice. Idea Group Publishing, Hershey PA (2001) 78−102 [5] Chinkering, D., Heckerman, D., Meek, C.: A Bayesian Approach for Learning Bayesian Networks with Local Structure. Proceedings of Thirteen Conference on Uncertainty in Artificial Intelligence. Morgan Kaufman, San Franscisco California (1997) 80−89 [6] Cohen, W.W., Singer, Y.: Context-Sensitive Learning Methods for Text Categorization. ACM Transactions on Information Systems. 17 (1999) 141−173 [7] Dhillon, I.S., Mallela, S., Kumar, R.: A Divisive Information-Theoretic Feature Clustering Algorithm for Text Classification. Journal of Machine Learning Research 3 (2003) 1265−1287 [8] Dumais, S.T., Platt, J., Heckerman, D., Sahami, M.: Inductive Learning Algorithms and Representation for Text Categorization. Proceedings of CIKM-98, 7th ACM International Conference on Information and Knowledge Management. Bethesda MD (1998) 148−155 [9] Fuhr, N., Buckley, C.: A Probabilistic Learning Approach for Document Indexing. ACM Transactions on Information Systems 9 (1991) 323−248 [10] Fuhr, N., Pfeifer, U.: Probabilistic Information Retrieval as Combination of Abstraction Inductive Learning and Probabilistic Assumptions. ACM Transactions on Information Systems 12 (1994) 92−115 [11] Hankerson. D., Harris, G.A., Johnson, P.D., Jr.: Introduction to Information Theory and Data Compression. CRC Press, Boca Raton Florida. (1998) [12] Lewis, D.D., Ringuetee, M.: Comparison of Two Learning Algorithms for Text Categorization. Proceedings of the Third Annual Symposium on Document Analysis and Information Retrieval (1994) 81−93
1258
Y.-C. Chang, S.-M. Chen, and C.-J. Liau
[13] Rocchio, J.J.: Relevance Feedback in Information Retrieval. In Salton G. (eds.): The SMART Retrieval System: Experiments in Automatic Document Processing. Prentice Hall New Jersey (1971) 313−323 [14] Sahami, M.: Learning Limited Dependence Bayesian Classifiers. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. AAAI Press, (1996) 335−338 [15] Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34 (2002) 1−47 [16] Yang, Y.: An Evaluation of Statistical Approaches to Text Categorization. Information Retrieval 1 (1999) 69−90 [17] Yang, Y., Chute, C.G.: An Example-based Mapping Method for Text Categorization and Retrieval. ACM Transactions on Information Systems 12 (1994) 252−277 [18] Yang, Y., Liu, X.: A Re-examination of Text Categorization Methods. Proceedings of SIGIR-99 22th ACM International Conference on Research and Development in Information Retrieval. Berkeley, California (1999) 42−49 [19] Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. Proceedings of ICML-97 14th International Conference on Machine Learning. Nashville, TN (1997) 412−420 [20] Reuters-21578 Aptè split data set, http://kdd.ics.uci.edu/data-bases/reuters21578/ reuters21578.html [21] Reuters-21578 Aptè split 10 categories data set, http://ai-nlp.info.uniroma2.it/moschitti/ corpora.htm
An Intelligent Customer Retention System Bong-Horng Chu1,3, Kai-Chung Hsiao2, and Cheng-Seen Ho2,4 1
Department of Electronic Engineering, National Taiwan University of Science and Technology, 43 Keelung Road Sec. 4, Taipei 106, Taiwan [email protected] 2 Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology, 43 Keelung Road Sec. 4, Taipei 106, Taiwan [email protected] 3 Telecommunication Laboratories, Chunghwa Telecom Co., Ltd. 11 Lane 74 Hsin-Yi Road Sec. 4, Taipei, 106, Taiwan [email protected] 4 Department of Electronic Engineering, Hwa Hsia Institute of Technology, 111 Gong Jhuan Road, Chung Ho, Taipei, 235, Taiwan [email protected]
Abstract. This paper proposes an intelligent system for handling the customer retention task, which is getting important due to keen competition among companies in many modern industries. Taking wireless telecommunication industry as a target of research, our system first learns an optimized churn predictive model from a historical services database by the decision tree-based technique to support the prediction of defection probability of customers. We then construct a retention policy model which maps clusters of churn attributes to retention policies structured in a retention ontology. The retention policy model supports automatic proposing of suitable retention policies to retain a possible churner provided that he or she is a valuable subscriber. Our experiment shows the learned churn predictive model has around 85% of correctness in tenfold cross-validation. And a preliminary test on proposing suitable package plans shows the retention policy model works equally well as a commercial website. The fact that our system can automatically propose proper retention policies for possible churners according to their specific characteristics is new and important in customer retention study.
1 Introduction To build customer loyalty and maximize profitability, intelligent techniques for predicting customer churning behaviors are becoming more than necessary in customer relationship management (CRM), especially in the highly competitive wireless telecommunications industry. A study by KPMG about customer defection in the UK reported that 44 percent of UK consumers changed at least one of their key product or service suppliers in 2004 [1]. Analysts have estimated that churning cost wireless service providers in North America and Europe more than four billion US dollars a year [2]. In addition, analysts agreed that acquiring a new customer was many times more expensive than retaining an existing one. M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 1259 – 1269, 2006. © Springer-Verlag Berlin Heidelberg 2006
1260
B.-H. Chu, K.-C. Hsiao, and C.-S. Ho
Besides predicting customer defection, proposing appropriate retention policies to retain profitable churners is another significant issue. So far, most of the researchers focus on analyzing customers’ behavior and predicting who are likely to churn [3], [4], [5]. There have been, however, no specific researches working on how policies can be made in accord with the churners. Knowing that the reasons of churn are distinct for different people, so are retention policies, by analyzing the specific characteristics of each churner we should be able to provide suitable policy to retain him. In this paper, we describe an intelligent system for customer retention by suggesting appropriate retention policies for possible churners according to their specific characteristics. In addition to predicting churners, our system deals with customer retention by first constructing an ontology, which contains comprehensive retention policies of various incentives and recommendations to deal with most possible causes of customer defection. It then digs out hidden correlations among the attribute values of the churners, trying to find out the knowledge of how retention policies are related to churn attribute clusters built upon the hidden correlations, which often reveal why a customer defects. The knowledge is then used to automatically propose retention policies for valuable churners. Note that the retention policy ontology we constructed not only can support the construction of specific retention policies but also can help general retention policy design and analysis.
2 Retention Policy Ontology and Historical Services Database Our system runs in two modes: the learning mode and the application mode. In the learning mode, it learns a churn predictive model and constructs a retention policy model; while in the application mode, it uses the above models to predict whether a customer defects and to propose proper retention policies to retain a potential, valuable churner. Before we further describe how each mode works, we first introduce the construction of our retention policy ontology and the structure of the exemplified historical customer services database. 2.1 Retention Policy Ontology The retention policy ontology is the key component in our system [6], [7]. To develop the domain ontology, we have done a survey on a variety of policies about CRM in lots of industries and academic papers, from which we analyzed the reasons for subscriber defection, the categories of retention policies, and the potential meaning of each policy [2], [8], [9], [10]. Based on these professional suggestions, we have collected comprehensive retention policies for most types of possible churners. There are, however, two issues we have to cope with about these collected retention policies. First, they may conflict with one another: each policy has its own specific function but the combination might not always do good things to a churner because of, e.g., mutual exclusion. The second issue is the genera of policies: policies with similar properties should be grouped into the same class to facilitate their usage. We can solve these two issues by constructing a retention policy ontology that can completely categorize all the retention policies into classes and clearly specify the
An Intelligent Customer Retention System
1261
conflicts between them. The ontology is designed to contain five main categories. The Money category defines the policies with respect to various fees, the Service category specifies the policies associated with value-added services the companies provide, the Goods category relates promotion plans that are associated with limited-edition souvenirs or discounted handsets, the Contact category lists promising channels in sustaining better communication with subscribers, and the Quality category contains the policies of improving quality-of-service, which involves the attitudes of customer service representatives, efficiency of engineering problem-solving, and so forth. Fig. 1 illustrates the detailed subcategories and the corresponding retention policies of the Money category as an example. Note that we further partition the Money category into eight different policy clusters, each containing a number of retention policies. A “conflict” relationship is associated with the cluster of Monthly rental policies in the figure, which means no two policies of the cluster can be recommended to a possible churner at the same time. Also an “alternative” relationship links the cluster of Message fees policies to that of Monthly rental policies. It means the policies in either cluster can be proposed for a possible churner, but not both can be chosen at a time. We used Protégé 3.0, which was developed by University of Stanford [11], to build this ontology.
Fig. 1. Part of retention policy ontology about Money category
2.2 Historical Services Database The historical customer services database of a major telecommunication company in Taiwan is used as our example database. It span for six months and contained information about 63,367 subscribers, each described by hundreds of attributes. Among the attributes, we identify 13 attributes, which we think are most significant in our work. We enumerate the value range of each attribute as shown in Table 1. The following represents an example subscriber data: Subscriber A: ‘15’, ‘8’, ‘3’, ‘2’, ‘4’, ‘7’, ‘1’, ‘3’, ‘0’, ‘0’, ‘1’, ‘2’, ‘2’. Note that some attributes in our database are numerical, including Tenure, Average invoice, and Average domestic invoice. They can not be directly used by the decision
1262
B.-H. Chu, K.-C. Hsiao, and C.-S. Ho
tree algorithm. We need to divide the range of each such numerical attribute into several intervals whose boundaries are determined according to the subscriber distribution: a cut point is chosen and a boundary is created when the slope of the subscriber distribution curve makes an apparent change. Table 1. Attributes and their enumerated values in the historical services database Attribute name Zip code
Industry code Service center Tenure Discount type Dealer ID Gender Package code
Stop-use Re-use Disconnect Average invoice Average domestic invoice
Value enumeration 0 (Taipei city), 1 (Keelung), 2 (Taipei county), 3 (Ilan), 4 (Hsinchu city and country), 5 (Taoyuan), 6 (Miaoli), 7 (Taichung city), 8 (Taichung country), 9 (Changhua), 10 (Nantou), 11 (Chiai city and country), 12 (Yunlin), 13 (Tainan city), 14 (Tainan country), 15 (Kaohsiung city), 16 (Kaohsiung country), 17 (Penghu), 18 (Kinmen), 19 (Pingtung), 20 (Taitung), and 21 (Hualien). 0 (Public), 1 (Business), 2 (Manufacture), 3 (Finance), 4 (Communication), 5 (Department Store), 6 (Social Service), 7 (Farming and Fishing), 8 (individual), and 9 (Other). 0 (Site1), 1 (Site2), 2 (Site3), 3 (Site4), 4 (Site5), 5 (Site6), and 6 (Site7). 0 (0~4), 1 (5~13), 2 (14~25), 3 26~37), 4 (38~61), and 5 (over 62 months). 0 (Government), 1 (Guaranty Free), 2 (Prepaid Card), 3 (Normal), 4 (Enterprise), 5 (Employee), 6 (Military), 7 (Official), 8 (Public Servant), 9 (Alliance), and 10 (Subordinate). The ID number (11 Dealers. 0 (Company), 1 (Male), and 2 (Female). 0 (Standard), 1 (Economy), 2 (Prepay), 3 (Ultra Low Rate), 4 (Base Rate), 5 (Base Rate plus 100NT), 6 (Base Rate plus 400NT), 7 (Base Rate plus 800NT), 8 (Base Rate plus 1500NT), and 9 (Special Low Rate). 0~9 times that customers actively lay claim to the telecommunication company for temporarily suspending telephone service. 0~7 times of service reopening. 0~2 times that a telecommunication company actively disconnects the subscriber’s telephone call for some reasons. 0 (0~100), 1 (101~200), 2 (201~500), 3 (501~1000), 4 (1001~2000), 5 (2001~3000), and 6 (over 3001 NT dollars). 0 (0~100), 1 (101~250), 2 (251~400), 3 (401~600), 4 (601~1000), 5 (1001~1500), 6 (1501~2000), 7 (2001~3000), 8 (over 3000 NT dollars).
3 The Learning Mode Fig. 2 shows the flow diagram of the learning mode. The Churn Model Learner learns the churn predictive model, which can decide subscribers’ loyalty from the historical
Historical Services Database
Churn Churn Model Model Learner Learner
Churn Predictive Model
Churn Churn Attribute Attribute Cluster Cluster Constructor Constructor
Churn Attribute Cluster Model
Policy Policy Model Model Constructor Constructor
Retention Retention Policy Policy Ontology Ontology
Retention Policy Model
Legend: Dashed line denotes process flow; solid lines denote data flows
Fig. 2. Flow diagram of the learning mode
An Intelligent Customer Retention System
1263
services database. The Churn Attribute Cluster Constructor discovers hidden correlations among the attribute values of the same database to form the churn attribute cluster model, based upon which and the retention policy ontology, the Policy Model Constructor is able to construct the retention policy model so as to recommend appropriate retention policies for possible churners. 3.1 Churn Model Learner The Churn Model Learner runs C4.5 algorithm [12], [13] on the historical services database to build a decision tree-represented churn predictive model. As we all know C4.5 doesn’t guarantee the created decision tree to be optimal, we thus use the greedy method [14] to optimize the churn predictive model by deleting each attribute one at a time to find out a better accuracy. This process then repeats until no more improvement can be obtained. The remaining attributes are used as the features for churn prediction. Our experiment showed that C4.5 originally produced a decision tree that contained all 13 attributes as listed in Table 1. The follow-up optimization process stopped until the fourth repetition, which produced no further improvement and thus we removed nothing. The previous three repetitions removed Zip code, Industry code, and Gender, respectively. The final churn predictive model thus contained only ten attributes. To our surprise, all the removed attributes belong to the demographic attributes. We therefore conjecture that the demographic-related attributes may not be crucial in deciding subscriber defection in our exemplified database. Fig. 3 shows part of the constructed churn predictive model.
Fig. 3. Churn predictive model (partial)
3.2 Churn Attribute Cluster Constructor Designing suitable retention policies for possible churners is not an easy task for humans. To do it automatically is even harder. One naïve idea is to provide possible churners with policies according to each single attribute value. It, however, may miss hidden, but significant interactions among attribute values, which can play a key role in explaining why subscribers defect. Our first task is thus to discover how attributes are associated with one another by the association rule mining technique. From the
1264
B.-H. Chu, K.-C. Hsiao, and C.-S. Ho
historical services database of all churners, we use Apriori algorithm [15] to mine the association rules, which are in the form of: Tenure=0 Average invoice=2 2480 ==> Package code=4 2363 conf: (0.95). This sample rule means that there are 2,480 churners whose Tenure is 0~4 months (code 0) and Average invoice is 201~500 NT dollars (code 2), and among whom 2,363 churn because the Package code is 4. The confidence of this rule is 0.95. In our experiment, we discovered 12 association rules with the minimum support set to 0.1 and confidence to 0.8. All of the attributes involved in a single rule have to be considered at the same time during the decision of suitable retention policies. In other words, we need to group the antecedents and consequents of a rule into a cluster, and define for it one or several retention policies accordingly. Fig. 4 shows the constructed churn attribute clusters corresponding to these rules. Note that some rules, like Rules 7 and 11, fall into the same group because they contain the same attributes after combination. In the figure some clusters are isolated while others are related by the “isa” relationship and form a class hierarchy. The complete churn attribute cluster model was similarly constructed from 300 association rules, with support from 0.1 down to 0.01 and fixed confidence = 0.8, as partially shown in Fig. 5. This hierarchical model allows us to easily set corresponding policies for two groups that are related by “isa”. Note that each cluster in the figure is marked with a support value, which is used for conflict resolution. The support values are created by the association rule mining. Specifically, we executed Apriori algorithm ten times starting from support = 0.1 down to support = 0.01 with 0.01 decrement each time. A less minimum support usually generates a greater number of large itemsets, which in turns causes more rules to be created. By comparing the current set of rules with those generated last time, we can associate the newly generated rules with the current minimum support value. This support-valuebased distinction between the rules allows us to solve any possible conflicts between two or more clusters of policies by preferring rules with higher supports during conflict resolution.
Fig. 4. Churn attribute clusters corresponding to 12 association rules
An Intelligent Customer Retention System
1265
Fig. 5. Part of churn attribute cluster model
There is one issue left: there are some attribute values that could not be properly clustered with other attribute values. It brings in a closely related issue: none of the attribute values of a possible churner can match any pre-clustered group of the churn attribute cluster model. Our solution to the issues is to define proper nominal policies for each attribute value that cannot be properly pre-clustered so that we can provide a nominal retention policy for a churner whose major cause of defection is difficult to find, even by an expert. 3.3 Policy Model Constructor Fig. 6 shows how we construct the retention policy model. First, the attribute values analysis module analyzes the churn attribute cluster model and the attributes that do not appear in the churn attribute cluster model in order to derive any possible explanations. The mapping knowledge construction module then follows the explanations to construct the mappings between the retention policies to each churn attribute cluster or singleton attributes. The policy conflict eliminator finally removes all conflicts from the mappings according to the support values. We take Fig. 5 as an example to demonstrate how we construct the retention policy model, as shown in Table 2 for its churn attribute clusters.
Churn Attribute Clusters
Attribute Values Analysis
Mapping Knowledge Construction
Policy Conflict Eliminator
Remaining Attributes
Fig. 6. Retention policy model construction
Retention Policy Model
1266
B.-H. Chu, K.-C. Hsiao, and C.-S. Ho Table 2. Construction of the retention policy model
Cluster Cluster 1 Tenure=0, Package code=4 Cluster 2 Zip code=15, Tenure=0, Package code=4
Explanation The duration of use is less than 4 months, and the package plan is Base Rate, which is a normal plan. So we recommend two policies. This cluster looks similar to Cluster 1 except Zip code shows people living in Kaohsiung city tend to churn. We conjecture that the condition might happen because of the efficiency of trouble solving. So we add a new policy.
The association between Average domestic invoice=0 and Package code=4 reveals no new information. So we still recommend policies according to the value of Tenure. In this cluster, Average invoice (code=2) is higher than Package plan (code=4). We thus add a new policy of changing package plans. This cluster is the superclass of Cluster 3 with one more attribute value: Service center=5. It implies that some services might be unsatisfactory in service center 5. As a result, we add a new policy. This cluster is the superclass of Cluster 1 with two more attribute values: Service center=5 and Gender=1. Unfortunately, it seems that no new information can be discovered from these two attribute values. So we recommend the same policies as for Cluster 1. This cluster is the superclass of Cluster 4 along with one Cluster 7 Tenure=0, Gender=1, more attribute value: Gender=1, which provides no specific Average invoice=2, Package information for policy making. We thus recommend the code=4 same policies as for Cluster 4. This cluster is the superclass of Cluster 4 with one more Cluster 8 Tenure=0, Average attribute: Average domestic invoice=0. Taking the values of invoice=2, Average domestic Average invoice and Average domestic invoice into invoice=0, Package code=4 account, the ratio of the average monthly domestic fees to the average monthly fees is lower than 50 percent. We thus add a new policy. Cluster 3 Tenure=0, Average domestic invoice=0, Package code=4 Cluster 4 Tenure=0, Average invoice=2, Package code=4 Cluster 5 Service center=5, Tenure=0, Package code=4, Average domestic invoice=0 Cluster 6 Service center=5, Tenure=0, Gender=1, Package code=4
Proposed retention policies -Provide all kinds of discountable tickets regularly. -Announce the newest discountable plans regularly. -Solve the troubles reported by the subscribers immediately and report to them as soon as the troubles are cleared. -Provide all kinds of discountable tickets regularly. -Announce the newest discountable plans regularly. -Provide all kinds of discountable tickets regularly. -Announce the newest discountable plans regularly. -Change package plan to “Base Rate Plus 100”. -Provide all kinds of discountable tickets regularly. -Announce the newest discountable plans regularly. -Enhance the ability of problem-solving of the engineers. -Provide all kinds of discountable tickets regularly. -Announce the newest discountable plans regularly. -Provide all kinds of discountable tickets regularly. -Announce the newest discountable plans regularly.
-Change package plan to “Base Rate Plus 100”. -Provide all kinds of discountable tickets regularly. -Announce the newest discountable plans regularly. -Select an appropriate message package plan to suit personal usage. -Provide all kinds of discountable tickets regularly. -Announce the newest discountable plans regularly.
4 The Application Mode Fig. 7 shows how the application mode works. The Churner Predictor uses the trained churn predictive model to decide the churn possibility of a given subscriber. If the subscriber tends to churner, his/her data will be sent to the Lifetime Customer Value discriminator, which decides whether he/she is a valuable one to retain. The lifetime value of a possible churner is calculated by Eq. (1). Lifetime _ Customer _ Value =
1 × Average _ Invoice Churn _ Score
(1)
where Churn_Score is the predicted churn rate by the churn predictive model and Average_Invoice is the average monthly invoice of a subscriber. If the calculated value is higher than a default lifetime customer value, a suitable set of retention policies will be proposed by the Policy Proposer, which first allocates the churner to some n (n≥0) churn attribute clusters according to his attribute values. The module then resorts to the retention policy model to retrieve and combine the corresponding retention policies. Note that the retention policy ontology will be checked to see whether there are any conflicts existing among these policies at the same time. Only when this check is OK, will the combined retention policies be proposed for the possible churner.
An Intelligent Customer Retention System
Subscribers Data
Churner Churner Predictor Predictor
1267
Churn Predictive Model
Possible churners
Lifetime Lifetime Customer Customer Value Value Discriminator Discriminator Subscribers to be retained Retention Retention Policy Policy Ontology Ontology
Policy Policy Proposer Proposer
Retention Policy Model
Retention policies to the subscribers
Legends: Dashed lines denote process flows; solid lines denote data flows
Fig. 7. Flow diagram of the application mode
5 Evaluation In Table 3 we show the results of tenfold cross-validation on the optimized churn predictive model (Fig. 3). Out of 63,367 subscribers, which contain 24,774 churn instances and 38,593 non-churn instances, 17,668 instances were correctly classified as positive (true positive), and 35,592 instances were correctly classified as negative (true negative). There were 7,106 instances incorrectly classified as negative (false negative) and 2,901 instances incorrectly classified as positive (false positive). On average, the accuracy degree of prediction is 84.21% ⎛⎜ 17668 + 35692 = 84.21% ⎞⎟ . ⎝ 63367
⎠
63367
Table 3. Accuracy of the optimized churn predictive model Churn Prediction Actual Class
+ -
Predicted Class + 17,668(27.88%) 2,901(4.57%)
Total 7,106(11.21%) 35,592(56.34%)
24,774 38,593 63,367
As for the evaluation of policy proposing, we found some telecommunication company provides a website to help the subscribers calculate which package plans fit them best [10] and it can be used to partially evaluate how well the automatic policy proposing mechanism in our system does. The input data of the website include: the total call minutes per month, the intra-carrier-call minutes per month, the crosscarrier-call minutes per month. After normalizing the definition of each package plan of the website to make it the same as ours, we randomly selected 10 churner instances from the historical services database for evaluation. The attribute values of the 10 instances are shown in Table 4. The results of the package plans proposed by the website and our system are shown in Table 5. It shows only case 9 results in different package plans. This divergence is because of the bad correspondence between the package plan selected by customer 9 and his calling minutes. Note that package code 0 stands for the “standard” payment
1268
B.-H. Chu, K.-C. Hsiao, and C.-S. Ho
package which has a fixed high monthly rental fee, from which the calling charge is not deductible, even though its per-minute calling charge is low. It is suitable to highvolume traffic customers. In that case, no divergence will happen and our system works as well, e.g., cases 1, 7, and 8. A low-volume traffic user, e.g., case 9, however, will end up with a very high average invoice and causes divergence. This divergence cannot be tackled by our system for now, because the Policy Proposer currently only considers deductible package plans. Table 4. The values of attributes of the testing churner data ID Package Monthly WithinCross- Average ID Package Monthly WithinCrossAverage code usage carrier carrier invoice code usage carrier carrier invoice minutes counts counts minutes counts counts 1 0 198 147 (65%) 80 (35%) 4 6 5 90 278 (90%) 31 (10%) 2 2 6 169 84 (30%) 199 (70%) 4 7 0 373 188 (39%) 298 (61%) 5 3 8 545 1164 (77%) 352 (23%) 6 8 0 220 58 (20%) 238 (80%) 4 4 1 639 106 (30%) 253 (70%) 5 9 0 55 164 (100%) 0 (0%) 3 5 4 150 76 (19%) 327 (81%) 4 10 1 42 14 (26%) 40 (74%) 2
Table 5. Comparison of package plans proposed by our system and the website Customer ID Package plan recommended by our system Package plan recommended by the website
1 7 7
2 7 7
3 8 8
4 8 8
5 7 7
6 5 5
7 8 8
8 7 7
9 6 4
10 4 5
6 Conclusions In this paper we described an intelligent customer retention system which works in two modes. In the learning mode, it learns about potential associations inside the historical services database to construct a churn predictive model. It also creates a churn attribute cluster model by exploiting the correlationships among the attribute values of all churners in the historical services database. The churn attribute cluster model, with the support of the retention policy ontology, allows us to create a retention policy model that maps appropriate retention policies to the clusters. In the application mode, it uses the churn predictive model to calculate the churn probability of a given subscriber. If the subscriber is decided to be a potential churner, it goes one further step to evaluate whether he is valuable to retain. A valuable churner will invoke the system to propose proper retention policies according to the retention policy model. Our experimental results show that the churn predictive model, containing only ten attributes, can reach a high degree of prediction accuracy. A preliminary test on proposing package plans also shows the system works equally well as a commercial website. The attributes we used to construct the churn predictive model in our system contain much information about the subscribers’ background and usage behavior, which explains why we can decide customer defection more objectively and correctly. More factors may also strongly affect the propensities of subscribers. For example, subscriber satisfaction can explain why poor services or inefficient problem solving makes subscribers lose their confidence in the carriers. Frequently checking the complaints of subscribers and figuring out what subscribers really want are
An Intelligent Customer Retention System
1269
indispensable to promote the brand image. Subscriber alliance is another example that the defection of a subscriber to anther carrier is likely to cause a “snow-ball effect”, which means that he/she may influence his/her business alliances to follow up. A more intricate investigation of complicated relationships between subscribers is necessary to prevent it from happening.
Acknowledgement This paper was partly supported by the National Science Council, Taiwan, R.O.C, under grant NSC 94-2213-E-228-006.
References 1. Oceansblue: Accurate Customer Churn Prediction with Information Inventor. Available at: http://www.oceansblue.co.uk/resources/iichurn1.pdf 2. Mozer, M.C., Wolniewicz, R., Grimes, D.B.: Predicting Subscriber’s Dissatisfaction and Improving Retention in the Wireless Telecommunications Industry. IEEE Transactions on Neural Networks, Vol. 11, 3 (2000) 690-696 3. Wei, C., Chiu, I.: Turning Telecommunications Call Details to Churn Prediction: A Data Mining Approach. Expert Systems with Applications, Vol. 23, 2 (2002) 103-112 4. Au, W.H., Chan, K.C.C., Yao, X.: A Novel Evolutionary Data Mining Algorithm With Applications to Churn Prediction. IEEE Transactions on Evolutionary Computation, Vol. 7, 6 (2003) 532-545 5. Yan, L., Wolniewicz, R.H., Dodier, R.: Predicting Customer Behavior in Telecommunications. IEEE Intelligent Systems, Vol. 19, 2 (2004) 50-58 6. Uschold, M., Gruninger, M.: ONTOLOGIES: Principles, Methods and Applications. The Knowledge Engineering Review, Vol. 11, 2 (1996) 93-136 7. Fensel, D.: Ontologies: A Silver Bullet for Knowledge Management and Electronic Commerce, Springer (2001) 8. Gerpott, T.J., Rambs, W., Schindler, A.: Customer Retention, Loyalty, and Satisfaction in the German Mobile Cellular Telecommunications Market. Telecommunications Policy, Vol. 25 (2001) 249-269 9. Kin, H.S., Yoon, C.H.: Determinants of Subscriber Churn and Customer Loyalty in the Korean Mobile Telephony Market. Telecommunications Policy, Vol. 28, Issue 9-10 (2004) 751-765 10. Chunghwa Telecom: Best Package Plan Calculation. Available at: http://www.cht.com.tw 11. Stanford Medical Informatics: The Protégé Ontology Editor and Knowledge Acquisition System. Available at: http://protege.stanford.edu 12. Quinlan, J.R.: Induction of Decision Tree. Machine Learning, Vol. 1 (1983) 81-106 13. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo CA (1993) 14. Dehne, F., Eavis, T., Rau-Chaplin, A.: Computing Partial Data Cubes for Parallel Data Warehousing Applications. In: Proc. of Euro PVM/MPI 01, Santorini Greece (2001) 15. Agrawal, R., Srikant, R.: Fast Algorithms for Mining Association Rules. In: Proc. of the 20th International Conference on Very Large Databases (1994) 487-499
Software Diagnosis Using Fuzzified Attribute Base on Modified MEPA Jr-Shian Chen1,2 and Ching-Hsue Cheng1 1
Department of Information Management, National Yunlin University of Science and Technology, 123, Section 3, University Road, Touliu, Yunlin 640, Taiwan [email protected] 2 Department of Computer Science and Information Management, HUNGKUANG University, No.34, Chung-Chie Road, Shalu, Taichung 433, Taiwan [email protected]
Abstract. Currently, there are many data preprocess methods, such as data discretization, data cleaning, data integration and transformation, data reduction ... etc. Concept hierarchies are a form of data discretization that can use for data preprocessing. Using discrete data are usually more compact, shorter and more quickly than using continuous ones. So that we proposed a data discretization method, which is the modified minimize entropy principle approach to fuzzify attribute and then build the classification tree. For verification, two NASA software projects KC2 and JM1 are applied to illustrate our proposed method. We establish a prototype system to discrete data from these projects. The error rate and number of rules show that the proposed approaches are both better than other methods.
1 Introduction Defective software modules cause software failures, increase development and maintenance costs, and decrease customer satisfaction. Effective defect prediction models can help developers focus quality assurance activities on defect-prone modules and thus improve software quality by using resources more efficiently. In real-world databases are highly susceptible to noisy, missing, and inconsistent data. Noise is a random error or variance in a measured variable [1]. When decision trees are built, many of the branches may reflect noisy or outlier data. Therefore, data preprocessing steps are very important. There are many methods for data preprocessing. Concept hierarchies are a form of data discretization that can use for data preprocessing. Data discretization has many advantages. Data can be reduced and simplified. Using discrete features are usually more compact, shorter and more accurate than using continuous ones [2]. So that we proposed a modified minimize entropy principle approach on fuzzified attribute and then build the classification tree. This paper is organized as: The related work is described in section 2. Section 3 will be devoted to study the modified minimize entropy principle approach, and presents our proposed classifier method. Experiment results are shown in Section 4. The final is conclusions. M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 1270 – 1279, 2006. © Springer-Verlag Berlin Heidelberg 2006
Software Diagnosis Using Fuzzified Attribute Base on Modified MEPA
1271
2 Related Work In this section, related work include data preprocess, classification decision tree and minimize entropy principle approach. 2.1 Software Diagnosis Software diagnosis is becoming an important field in software engineering and applications. Defective software modules cause software failures, increase development and maintenance costs, and decrease customer satisfaction. Hence, effective defect diagnosis models can help developers focus quality assurance activities on defect-prone modules and thus improve software quality by using resources more efficiently [3]. 2.2 Data Preprocess In data mining algorithms, the data preprocessing steps are very important. Preprocessing data affects the quality and the efficiency of the data mining algorithms. There are number of data reprocessing techniques. For example: data cleaning [4], data integration and transformation, data reduction [5][6]. Concept hierarchies [7] for numeric attributes can be constructed automatically based on data distribution analysis. Discretization maps similar values into one discrete bin, which can improve predictions by reducing the search space, reducing noise, and by pointing to important data characteristics. Discretization includes the unsupervised and supervised approaches. Unsupervised approaches divided the original feature value range into few equal-length or equal-data-frequency intervals. Supervised method used the maximizing measure involving the predicted variable, e.g. entropy or the chi-square statistics [2]. The entropy–based discretization can reduce the data size. Unlike the other methods, entropy–based discretization uses class information that makes it more likely. The interval boundaries are defined to occur in places that may help classification accuracy. 2.3 Classification Decision Tree ID3 and C4.5 Classification is an important data mining technique. Several classification models have been proposed, e.g. statistical based, distance based, neural network base and decision tree based [8]. A decision tree is a flow-chart-like tree structure, where each internal node denotes a test on an attribute, each branch represents an outcome of the test, and leaf nodes represent class or class distributions [1]. The ID3 [9] is a decision tree algorithm that based on information theory. The basic strategy used by ID3 is to choose splitting attributes with the highest information gain. The concept used to quantify information is called entropy. Entropy is used to measure of information in an attribute. Assume that have a collection set S of c outcomes then the entropy is defined as
H ( S ) = ∑ ( − pi log 2 pi ) Where
pi the proportion of S belonging to class i .
(1)
1272
J.-S. Chen and C.-H. Cheng
Gain ( S , A) is information gain of example set S on attribute A is defined as
Gain( S , A) = H ( S ) − ∑ and where
Sv S
H (Sv )
(2)
v is a value of A , S v = subset of S , S v = number of elements in S v ,
S = number of elements in S . C4.5 is a software extension of the basic ID3 algorithm designed by Quinlan [10] to address the following issues not dealt with by ID3: avoiding overfitting the data, determining how deeply to grow a decision tree, reduced error pruning, rule postpruning, handling continuous attributes, choosing an appropriate attribute selection measure, handling training data with missing attribute values, handling attributes with differing costs, improving computational efficiency. 2.4 Minimize Entropy Principle Approach(MEPA) A key goal of entropy minimization analysis is to determine the quantity of information in a given data set. The entropy of a probability distribution is a measure of the uncertainty of the distribution [11]. To subdivide the data into membership functions, establishing the threshold between classes of data is needed. A threshold line can be determined with an entropy minimization screening method, and then start the segmentation process, first into two classes. Therefore, a repeated partitioning with threshold value calculations will allow us to partition the data set into a number of fuzzy sets [12]. Assume that a threshold value is being seeking for a sample in the range between x1 and x2. An entropy equation is written for the regions [x1, x] and [x, x2], and denote the first region p and the second region q. Entropy with each value of x are expressed as: [13]
S ( x ) = p ( x ) S p ( x ) + q( x ) S q ( x )
(3)
Where
S p ( x ) = −[ p1 ( x ) ln p1 ( x ) + p2 ( x ) ln p2 ( x )] S q ( x ) = −[ q1 ( x ) ln q1 ( x ) + q2 ( x ) ln q2 ( x )]
(4)
And pk(x) and qk(x) = conditional probabilities that the class k sample is in the region [x1, x1+x] and [x1+x, x2], respectively. p(x) and q(x) = probabilities that all samples are in the region [x1, x1+x] and [x1+x, x2], respectively. p(x) + q(x) = 1
(5)
A value of x that gives the minimum entropy is the optimum threshold value. The entropy estimates of pk(x) and qk(x), p(x) and q(x), are calculated as follows: [12]
Software Diagnosis Using Fuzzified Attribute Base on Modified MEPA
1273
pk ( x ) =
nk ( x ) + 1 n( x ) + 1
(6)
qk ( x ) =
N k ( x) + 1 N ( x) + 1
(7)
n( x ) n
(8)
q(x) = 1- p(x)
(9)
p( x ) =
where nk(x) = number of class k samples located in [x1, x1+x] n(x) = the total number of samples located in [x1, x1+x] Nk(x) = number of class k samples located in [x1+x, x2] N(x) = the total number of samples located in [x1+x, x2] n = total number of samples in [x1, x2]. 1 0.5 PRI 1
SEC1
PRI
SEC2
1
TER1
SEC1
TER2
PRI
TER3
SEC2
TER4
Fig. 1. Partitioning process of Minimize entropy principle approach
Figure 1 shows the partitioning process. While moving x in the region [x1, x2], we calculate the values of entropy for each position of x. The value of x in the region that holds the minimum entropy is called the primary threshold (PRI) value. With repeating this process, secondary threshold values can be determined which denote as SEC1 and SEC2. To develop seven partitions we need tertiary threshold values which denote as TER1, TER2, TER3 and TER4. [8]
1274
J.-S. Chen and C.-H. Cheng
3 Proposed Modified MEPA Approach to Fuzzify Attribute In this section, an approach is described. New approaches are proposed to improve the accuracy rate and reduce number of rules in C4.5 decision tree. The well-known dataset is giving in table 1, that is used to explain these approaches and our proposed method is described as follows: Table 1. A dataset for play game [8] page 18
outlook
temperature
humidity
Windy
play
sunny
85
85
FALSE
no
sunny
80
90
TRUE
no
overcast
83
86
FALSE
yes
rainy
70
96
FALSE
yes
rainy
68
80
FALSE
yes
rainy
65
70
TRUE
no
overcast
64
65
TRUE
yes
sunny
72
95
FALSE
no
sunny
69
70
FALSE
yes
rainy
75
80
FALSE
yes
sunny
75
70
TRUE
yes
overcast
72
90
TRUE
yes
overcast
81
75
FALSE
yes
rainy
71
91
TRUE
no
Step 1: Partition quantitative attributes and calculates the threshold value (PRI, SEC1, and SEC2). Entropy value of each data is computed by the entropy equation proposed by Christensen [13] described above. Table 2 shows that when x = 82.5, the entropy value S(x) is the smallest, so the PRI is 82.5. By repeating this procedure to subdivide the data, the thresholds can be obtained, as shown in Table 3. Step 2: Build membership function. Using the thresholds from step 1 as the midpoint of triangular fuzzy number, but we modify the first and last membership functions to be trapezoid. When the attribute value is below to SEC1 then the membership degree is equal to 1. That is the same to when attribute value is greater then SEC2. The membership function of minimize entropy principle approach can be established which is shown in Table 3 and Figure 2.
Software Diagnosis Using Fuzzified Attribute Base on Modified MEPA
1275
Step 3: Defuzzify the quantity data to determine its linguistic value. According to the membership function in step 2, the membership degree of each data is calculated to determine its linguistic value. (see Table 4). Step 4: Use the C4.5 decision tree classifier to build the model on the dataset. Use the linguistic values build the classification decision tree. Table 2. Example of entropy value calculation for attribute “humidity”
x P1(x) P2(x) Q1(x) Q2(x) P(x) Q(x)
75 + 80 = 77.5 2 1+1 = 0.333 5 +1 4 +1 = 0.833 5 +1 4 +1 = 0.500 9 +1 5 +1 = 0.600 9 +1 5 = 0.357 14 9 = 0.643 14
Sp(x) Sq(x) S(x)
80 + 85 = 82.5 2 1+1 = 0.250 7 +1 6 +1 = 0.875 7 +1 4 +1 = 0.625 7 +1 3+1 = 0.500 7 +1 7 = 0.500 14 7 = 0.500 14
0.518 0.653 0.605
85 + 86 = 85.5 2 2 +1 = 0.333 8+1 6 +1 = 0.778 8 +1 3+1 = 0.571 6 +1 3+1 = 0.571 6 +1 8 = 0.571 14 6 = 0.429 14
0.463 0.640 0.552
0.562 0.640 0.595
1.2
1
0.8
0.6
0.4
0.2
0 60
70
80
90
Fig. 2. Membership function of humidity
100
1276
J.-S. Chen and C.-H. Cheng Table 3. Thresholds of attributes
min 64 65
temperature humidity
sec1 66.5 72.5
pri 84 82.5
Sec2 85 95.5
max 85 96
Table 4. Fuzzified attribute base on modified MEPA
temperature
Linguistic value
humidity
Linguistic value
85
high
85
med
80
med
90
high
83
med
86
med
70
low
96
high
68
low
80
med
65
low
70
low
64
low
65
low
72
low
95
high
69
low
70
low
75
low
80
med
75
low
70
low
72
low
90
high
81
med
75
low
71
low
91
high
4 Case Verification Our empirical case study used data sets from two NASA software projects, labeled JM1 and KC2 [14]. JM1 is written in “C” and is a real-time predictive ground system: Uses simulations to generate predictions. The original dataset JM1 consisted of 10,885 software modules. We remove tuples that have 5 null values and 2 questionable values. The JM1 2102 modules had software faults and the remaining 8776 had no faults. KC2 is the data from “C++” functions using for scientific data processing. The original dataset KC2 consisted of 520 software modules, of which 106 modules had software faults and the remaining 414 had no faults. JM1 and KC2 projects have same with 21 attributes for software measurements. The attributes are list in Table 5. From the section 3’s statements establish a prototype system. The system exploit by the program language Delphi7.0. And then use the system to discrete data and use the linguistic values build the classification decision. The C4.5 classifier used in our studies is Weka open software [15].
Software Diagnosis Using Fuzzified Attribute Base on Modified MEPA
1277
The entropy value of each data is computed by equation (3)-(9). By repeating this procedure to partition the data, the thresholds can be obtained, as shown in Table 6. The proposed approaches are applied to build the classified decision tree on the dataset KC2 and JM1. The C4.5 method is also applied to compare with the proposed approaches. From Table 7, The Error rate and number of rules show that the proposed approaches are better than other methods. Table 5. Attribute information
no. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Attribute loc v(g) ev(g) iv(g) n v l d i e b t lOCode lOComment lOBlank lOCodeAndComment uniq_Op uniq_Opnd total_Op total_Opnd branchCount
Attribute information numeric % McCabe's line count of code numeric % McCabe "cyclomatic complexity" numeric % McCabe "essential complexity" numeric % McCabe "design complexity" numeric % Halstead total operators + operands numeric % Halstead "volume" numeric % Halstead "program length" numeric % Halstead "difficulty" numeric % Halstead "intelligence" numeric % Halstead "effort" numeric % Halstead numeric % Halstead's time estimator numeric % Halstead's line count numeric % Halstead's count of lines of comments numeric % Halstead's count of blank lines numeric numeric % unique operators numeric % unique operands numeric % total operators numeric % total operands numeric % of the flow graph
Table 6. Thresholds of attributes in KC2 & JM1 KC2 no. min
JM1
sec1
pri
sec2
max min
sec1
pri
sec2
max
1 11.50
38.50
93.50
3442
1
1
7.50
31.50
100.50
1275
2
1
1.50
6.50
15.50
180
1
1.50
8.50
15.50
470
3
1
3.50
4.50
7.50
125
1
3.50
6.50
12.50
165
4
1
2.50
3.50
8.50
143
1
1.50
4.50
9.50
402
5
1 26.50
72.50
305.50
3982
0
0.50
102.50
295.50
8441
6
0 109.21 354.28 1733.15
33814.56
0
1.00
699.33
7
0
2
0
0.01
0.06
0.01
0.12
0.34
2001.57 80843.08 0.16
1
1278
J.-S. Chen and C.-H. Cheng Table 6. (continued)
8
0
2.93
9.09
24.73
103.53
0
0.50
19.52
26.83
418.2
9
0 24.55
24.73
47.14
415.06
0
1.55
37.29
103.82
569.78
10
0 435.20 3600.08 35401.11 2147483.64
0
4.00 14248.47 41260.17 31079782
11
0
11.27
0
0.16
0.24
12
0 24.18 200.01 1966.73 153047.01
0
0.22
791.59
13
0
7.50
23.50
83.00
1107
0
0.50
26.50
67.50
2824
14
0
0.50
1.50
11.00
44
0
0.50
4.50
20.50
344
15
0
1.50
6.50
7.50
121
0
1.50
5.50
11.50
447
16
0
0.50
1.50
3.00
11
0
0.50
1.50
6.50
108
17
1
6.50
11.50
16.50
47
0
0.50
17.50
21.50
411
18
0
9.50
16.50
34.50
325
0
0.50
23.50
55.50
1026
19
1 14.50
80.50
148.00
2469
0
0.50
83.50
198.50
5420
20
0 11.50
49.50
118.50
1513
0
0.50
63.50
326.50
3021
21
1
12.00
26.00
361
1
2.00
11.50
29.50
826
0.04
2.00
0.13
0.58
0.67
26.95
2292.24 1726655
Table 7. Comparison with the proposed approaches and other model
Approaches Dataset Error rate Number of Rules Size of tree
Rforest[16]
C4.5
KC2 23.08% NA1
JM1 NA2 NA2
KC2 18.08% 26
JM1 20.45% 284
NA1
NA2
51
567
Proposed Approach KC2 JM1 17.69% 18.87% 17 169 25
253
Note: The NA1 denotes no given the answer in this method. The NA2 denotes that the method has not use this dataset to experiment.
5 Conclusions In this paper, we have proposed a new method to classify tree base on fuzzified attribute of modified MEPA. From Table 7, The Error rate and number of rules show that the proposed approaches are better than other methods. We can find that the proposed method has 2 main advantages as follows. 1. Reduce the size of tree and number of rules. 2. Improve the classification accuracy without removing the noise instance.
Software Diagnosis Using Fuzzified Attribute Base on Modified MEPA
1279
Future work could focus on the number of membership function used MEPA method, which could be fixed at two, three, and seven …etc. We can consider another flexibility method to build the membership function. That will leave as an area for future research.
Reference 1. J. Han, M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, San Francisco(2001) 2. H. Liu, F. Hussain, C. Tan, and M. Dash. Discretization: An enabling technique. Data Mining and Knowledge Discovery, Vol. 6, 4(2002), 3. Norman Fenton and Shari Pfleeger. Software Metrics - A Rigorous and Practical Approach. Chapmann & Hall, London, (1997) 4. Vijayshankar Raman, Joe Hellerstein, Potter's wheel: An interactive data cleaning system. VLDB(2001) ,381-390, Roma, Italy. 5. U. Fayyad, G. P. Shapiro, and P. Smyth, The KDD process for extracting useful knowledge from volumes of data, Communications of the ACM, vol. 39(1996), 27-34 6. S. Mitra, S.K. Pal, p. Mitra, Data mining in soft computing framework: A survey, IEEE Trans. Neural Networks Vol. 13, 1 (2002) 3-14 7. Cai, Y., Cercone, N., and Han, J., Knowledge discovery in databases: an attribute-oriented approach, VLDB (1992) 547-559. 8. MH Dunham, Data Mining: Introductory and Advanced Topics, Prentice Hall, Upper Saddle River, NJ (2003) 9. J. Ross Quinlan, Induction of decision trees. Machine Learning, 1 (1986) 81-106 10. J. Ross Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann Publishers, San Mateo, CA (1993) 11. Yager, R. and Filev, D., Template-based fuzzy system modeling, Intelligent and Fuzzy Sys., Vol. 2 (1994) 39-54. 12. Ross, T.J., Fuzzy logic with engineering applications, International edition, McGraw-Hill, USA. (2000) 13. Christensen, R., Entropy minimax sourcebook, Entropy Ltd., Lincoln, MA. (1980) 14. Sayyad Shirabad, J., Menzies, T.J. The PROMISE Repository of Software Engineering Databases. School of Information Technology and Engineering, University of Ottawa, Canada. Available: http://promise.site.uottawa.ca/SERepository(2005) 15. Ian H. Witten, Eibe Frank, Data Mining: Practical machine learning tools and techniques, 2nd Edition, Morgan Kaufmann, San Francisco (2005) 16. Taghi M. Khoshgoftaar, Naeem Seliya, Kehan Gao, Detecting noisy instances with the rule-based classification model, Intelligent Data Analysis, Vol.9, 4 (2005), 347-364
New Methods for Text Categorization Based on a New Feature Selection Method and a New Similarity Measure Between Documents Li-Wei Lee and Shyi-Ming Chen Department of Computer Science and Information Engineering National Taiwan University of Science and Technology Taipei, Taiwan, R.O.C. [email protected]
Abstract. In this paper, we present a new feature selection method based on document frequencies and statistical values. We also present a new similarity measure to calculate the degree of similarity between documents. Based on the proposed feature selection method and the proposed similarity measure between documents, we present three methods for dealing with the Reuters-21578 top 10 categories text categorization. The proposed methods get higher performance for dealing with the Reuters-21578 top 10 categories text categorization than that of the method presented in [4].
1 Introduction Text categorization is a task of classifying text documents into a predefined number of categories based on classification patterns [3], [22]. The terms appearing in documents are treated as features. One major difficulty in text categorization is the large dimension of the feature space. Therefore, we hope to reduce the dimension of the feature space to get a higher performance for dealing with the text categorization problem. One method for text categorization is based on the feature selection method [3], [4], [21]. Some results from the previous researches show that the semantic feature selection approach affects the performance of text categorization [3], [21]. Some feature selection methods have been presented to deal with a wide range of text categorization tasks, such as Chi-Square test [1], [2], [6], [20], Information Gain (IG) [1], [9], [10], [12], [15], and Mutual Information (MI) [2], [5], [8], [12], [13]. In this paper, we present a new feature selection method based on document frequencies and statistical values to select useful features. We also present a new similarity measure between documents. Based on the proposed feature selection method and the proposed similarity measure between documents, we present three methods for dealing with the Reuters-26578 top 10 categories text categorization. The proposed methods can get higher performance for dealing with the Reuters-26578 top 10 categories text categorization than that of the method presented in [4]. The rest of this paper is organized as follows. In Section 2, we briefly review the previous research for text categorization. In Section 3, we present a new feature selection M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 1280 – 1289, 2006. © Springer-Verlag Berlin Heidelberg 2006
New Methods for Text Categorization Based on a New Feature Selection Method
1281
method based on document frequencies and statistical values for text categorization. In Section 4, we present a new similarity measure between documents. Based on the proposed feature selection method and the proposed similarity measure between documents, we present three methods to deal with the Reuters-26578 top 10 categories text categorization. In Section 5, we show the experimental results. The conclusions are discussed in Section 6.
2 Preliminaries In the vector space model [18], documents are usually represented by feature vectors of terms. The task of preprocessing consists of transforming capital letters into lowercased letters and removing stop words (such as “a”, “an”, “the”, etc.), where words are stemmed by applying the Porter algorithm [17]. The acquired document-term matrix is then transformed into TF-IDF (Term Frequency-Inverse Document Frequency) weights which are normalized by document lengths [19]. After the feature selection process, the dimension of the feature space is reduced and useful features are obtained. Therefore, the feature selection process is a very important task, and it can affect the performance of text categorization. Assume that F consists of n features f1, f2, …, fn and assume that S consists of m features s1, s2, …, sm, where S is a subset of F. The goal of the feature selection process is to choose an optimal subset S of F for text categorization. There are many statistic measures for dealing with the task of feature selection, e.g., Chi-Square [1], [2], [6], [20], Information Gain [1], [9], [10], [12], [15], and Mutual Information [2], [5], [8], [12], [13]. Among these measures, the mutual information measure is the most commonly used measure. It also has a better performance for dealing with the task of feature selection. In the following, we briefly review some feature selection measures, shown as follows: (1) Chi-Square test [2]: Fix a term t, let the class labels be 0 and 1. Let ki,0 denote the number of documents in class i not containing term t and let ki,1 denote the number of documents in class i containing term t. This gives us a 2 × 2 contingency matrix: It
C
0
1
0
k00
k01
1
k10
k11
where C and It denote Boolean random variables and k lm denotes the number of observations, where C ∈ {0,1} and I t ∈ {0,1} . Let n = k00 + k01 + k10 + k11. We can estimate the marginal distribution as follows: Pr(C = 0) = (k00 + k01)/n, Pr(C = 1) = (k10 + k11)/n, Pr(It = 0) = (k00 + k10)/n, Pr(It = 1) = (k01 + k11)/n.
1282
The χ χ
2
L.-W. Lee and S.-M. Chen
2
= ∑ l,m
test is shown as follows: (k lm − n Pr(C = l) Pr(I t = m))
2
n Pr(c = l) Pr(I t = m)
=
n(k11k 00 − k10k 01)
2
(k11 + k10 )(k 01 + k 00 )(k11 + k 01)(k10 + k 00 )
(1)
.
2 The larger the value of χ , the lower is our belief that the independence assumption is upheld by the observed data. In [2], Chakrabarti pointed out that for feature selec2 tion, it is adequate to sort terms in decreasing order of their χ values, train several classifiers with a varying number of features, and stopping at the point of maximum 2 accuracy (see [2], pp. 139). The larger the value of χ , the higher the priority to choose term t. For more details, please refer to [2]. (2) Information Gain Measure [23], [24]: For the binary document model and two 2 classes (the same as the case of the χ test), the Information Gain (IG) of term t with respect to the two classes can be written as follows: 1 1 1 IG(t ) = − ∑ P(ci ) log P(ci ) − ( − P(t ) ∑ P(ci | t ) log P(ci | t ) − P(t ) ∑ P(ci | t ) log P(ci | t )), i =0 i =0 i =0
P(c0 ) =
where P (t ) =
k 00 + k 01 k 00 + k 01 + k10 + k11
k 00 + k10 k 00 + k 01 + k10 + k11
P ( c1 | t ) =
k10 k 00 + k10
k 01 + k11
P (t ) =
,
, P(c0 | t ) =
P ( c1 ) =
,
k 00 + k 01 + k10 + k11
k 01 k 01 + k11
,
and P ( c1
| t) =
(2)
k10 + k11 k 00 + k 01 + k10 + k11 P (c 0 | t ) =
, k11
k 00 k 00 + k10
,
,
.
k 01 + k11
The larger the value of IG (t ) , the higher the priority to choose term t. For more details, please refer to [23] and [24]. (3) Mutual Information Measure [2]: For the binary document model and two classes 2 (the same as the case of the χ test), the Mutual Information (MI) of term t with respect to the two classes can be written as follows: MI ( I t , C ) =
∑ l , m∈{0,1}
k l ,m n
log
k l,m / n ( k l , 0 + k l ,1 )( k 0, m + k1,m ) / n
2
,
(3)
where n = k00 + k01 + k10 + k11. The larger the value of MI ( I t , C ) , the higher the priority to choose term t. For more details, please refer to [2].
New Methods for Text Categorization Based on a New Feature Selection Method
1283
3 A New Feature Selection Method Based on Statistical Values and Document Frequencies In this section, we present a new feature selection method based on statistical values and document frequencies. Let X and Y be two different classes of documents. The mean values μ X ,t and μ Y ,t of X and Y are 1 / | X | ( ∑ X x t ) and 1 / | Y | ( ∑Y y t ) , respectively, where xt denotes the TFIDF of term t in class X and yt denotes the TFIDF of term t in class Y. Furthermore, the variances σ X and σ Y of X and Y are 1 / | X | ∑ X ( x t − μ X ,t )
2
and
1 / | Y | ∑Y ( y t − μ Y ,t )
2
, respectively. Let |X| denote the
number of documents in the class X and let |Y| denote the number of documents in the class Y. Here, we consider the effect of document frequencies and variances for feature selection. Let DF(xt) denote the document frequencies of term xt in the X class and let DF(yt) denote the document frequency of term yt in the Y class. The proposed feature selection method is as follows: S (t ) =
( DF ( x t ) /|X| − DF ( y t ) /|Y|) 2
2
(1/|X|) ∑ X ( x t − μ X,t ) + (1/|Y|) ∑Y ( y t − μY,t )
2
.
(4)
The larger the value of S(t), the higher the priority to choose term t.
4 New Methods for Text Classification Based on the Proposed Similarity Measure and the k-NN Approach Many learning-based approaches have been presented to deal with the task of text categorization, e.g., the k-NN approach [7], [22], [24], support vector machines [5], [14], [24], Naïve Bayes approaches [5], [10], [24], and neural networks [16], [24]. In this paper, we present three classification methods based on the k-NN approach [7], [22], [24] to classify the Reuters-21578 top 10 categories data set. The k-NN classifier uses the k-nearest training documents with respect to a testing document to calculate the likelihood of categories. The document-document similarity measure used in the k-NN classifier is the most important part for text categorization. Most previous k-NN classifiers use the cosine similarity measure in the vector space v v model. The cosine similarity measure cos(a, b) , for measuring the degree of similarity between documents a and b is as follows: v v a⋅b v v cos(a, b) = v v , (5) a ⋅ b v v v v where cos(a, b) ∈ [0,1] , a and b denote the vector representation of the documents a v v and b, respectively. The larger the value of cos(a, b) , the higher the similarity between the documents a and b.
1284
L.-W. Lee and S.-M. Chen
The term weight wij calculated by TFIDF normalized by document lengths is the most commonly-used method [19] for the cosine similarity measure in the vector space model, where TFIDF ( t i ,d j )
wij =
,
(6)
T
∑ (TFIDF ( t ,d )) 2 k j
k = 1
wij denotes the weight of term i in document j, |T| the total number of terms, and Number of Documents TFIDF(t i ,d j ) = (Term Frequencyof Term t i in d ) × log( ). j Document Frequencyof Term t i
A comparison of the three proposed methods with Dona’s method [4] is shown in Table 1. Table 1. A comparison of the three proposed methods with Dona’s method Methods
Feature selection Term weight Similarity measure Classifier
Dona’s method [4] Mutual Information Measure [2, pp. 139-141] N/A N/A Naïve Bayes [24]
The first proposed method Mutual Information Measure [2, pp. 139-141] Formula (6) Formula (5) k-NN [24]
The second proposed method Mutual Information Measure [2, pp. 139-141] Boolean Formula (7) k-NN [24]
The third proposed method Formula (4) Boolean Formula (7) k-NN [24]
In the following, we summarize the three proposed methods as follows: (A) The First Proposed Method for Text Categorization: Step 1: Select a predefined number of features based on the Mutual Information (MI) Measure [2] shown in formula (3) to reduce the number of features of each document. Step 2: Given a testing document, calculate the term weight of the testing document by using formula (6). Find its k-nearest documents among the training documents by using formula (5). Step 3: The testing document belonging to the category has the largest summing weight. (B) The Second Proposed Method for Text Categorization: Step 1: Select a predefined number of features based on the Mutual Information (MI) Measure [2] shown in formula (3) to reduce the number of features of each document. Step 2: Given a testing document, find its k-nearest documents among the training documents by using the proposed document-document similarity measure described as follows. We use the Boolean method for document representation. Each term weight is either 0 or 1, where 0 means that the term is not appearing and 1 means that it is appearing in the document. Let M(d1, d2) denote the number of terms appearing in documents D1 and D2, simultaneously. The proposed similarity measure to calculate the degree of similarity Similarity(d1, d2) between documents is shown as follows:
New Methods for Text Categorization Based on a New Feature Selection Method
M ( d1 , d 2 )
Similarity ( d 1 , d 2 ) =
,
1285
(7)
d1 × d 2
where |d1| denotes the number of terms in document d1 and |d2| denotes the number of terms in document d2. Calculate the likelihood of the testing document belonging to each category by summing the weights of its k-nearest documents belonging to the category. For example, assume that there are 3-nearest training documents d1, d2, and d3 of testing document d4 as shown in Fig. 1. Assume that the degree of similarity between document d1 and the testing documents d4 is w1, the degree of the similarity between document d2 and the testing document d4 is w2, and the degree of similarity between document d3 and the testing document d4 is w3. Then, the summing weights of the documents d1 and d2 belonging to Category 1 are equal to w1 + w2, the weight of d3 belonging to Category 2 is w3, if (w1 + w2) < w3, then we let the testing document d4 belong to Category 2.
Category 1 d1
w1
d2 w2
d4 w3 d3 Category 2
Fig. 1. The 3-nearest training documents of testing document d4
Step 3: The testing document belonging to the category has the largest summing weight. (C) The Third Proposed Method for Text Categorization: Step 1: Select a predefined number of features based on the proposed feature selection method shown in formula (4) to reduce the number of features of each document. Step 2: The same as Step 2 of the second proposed method. Step 3: The testing document belonging to the category has the largest summing weight.
5 Experimental Results In our experiment, we use the Reuters-21578 “top 10 categories” data set [4], [25] shown in Table 2 for dealing with the text categorization.
1286
L.-W. Lee and S.-M. Chen Table 2. Top 10 categories of the Reuters-21578 data set [4], [25] Category Earn Acq Money-fx Grain Crude Trade Interest Ship Wheat Corn Total
Number of Training Documents 2877 1650 538 433 389 368 347 197 212 181 7769
Number of Testing Documents 1083 719 179 149 189 117 131 89 71 56 3019
We have implemented the proposed method by using MATLAB version 6.5 on a Pentium 4 PC. We use the Microaveraged F1 [2] for evaluating the performance of the proposed methods. Precision and Recall are defined as follows [2]: Precision =
number of documents retrieved that are relevant
,
(8)
total numberof documents that are retrieved
Recall =
number of documents retrieved that are relevant
.
(9)
total number of documents that are relevant
The relationship between the precision and the recall is characterized by a graph called the precision-recall curve. The F1 measure combines the precision and the recall defined as follows [2]: F1 =
2 × Precision × Recall Precision + Recall
(10)
.
For multiple categories, the precision, the recall, the microaveraged precision, the microaveraged recall, and the microaveraged F1 are calculated based on the global contingency matrix shown in Table 3, where pi =
ri =
ai a i + ci
ai a i + bi
microaveraged precision =
,
(11)
,
(12)
A A+C
=
∑ik=1 a i ∑ik=1 (a i + c i )
,
(13)
New Methods for Text Categorization Based on a New Feature Selection Method
microaveraged recall =
microaveraged F1 =
A A+B
=
∑ ik=1 a i ∑ ik=1 (a i + b i )
1287
(14)
,
2 × microaverage_precision × microaverage_recall micoraverage_precision + microaverage_recall
(15)
.
Table 3. The Contingency Matrix for a Set of Categories [2] Category C = {c1, c2, … c|c|}
Predicted “YES” c A = ∑ ai i |c| C = ∑ ci i
Actual class “YES”
Actual class “NO”
Predicted “NO” c B = ∑ bi i c D = ∑ di i
In our experiment, we use the proposed three methods (i.e. the first proposed method, the second proposed method and the third proposed method) shown in Table 1 to compare their performance for text categorization with the performance of the method presented in [4]. To compare the proposed three methods with the Dona’s Method [4], the microaveraged F1 is considered for comparing the system performances of different methods for text categorization. Table 4 shows the results of the Retuers-21578 top 10 categories text categorization of the methods. The experimental results show that the proposed three methods (i.e. the first proposed method, the second proposed method and the third proposed method) get higher performances than that of Dona’s method [4]. Table 4. A comparison of the performance of categorizing Reuters-21578 top 10 categories data set for different methods Methods F1 measure
Dona’s method [4]
The first proposed method
The second proposed method
The third proposed method
Categories Earn Acq Money-fx Grain Crude Trade Interest Ship Wheat Corn Microaveraged F1
98.04 96.67 76.54 57.47 79.43 85.60 73.38 68.75 48.39 44.02 74.06
96.82 89.2 73.13 57.41 70.7 65.63 60.97 84.85 39.27 42.95 81.99
97.76 95.43 73.5 60 73.68 70.72 60.76 85.71 44.25 38.71 84.6
97.54 95.56 73.12 59.67 74.42 72.44 65.25 86.75 47.01 37.57 84.73
6 Conclusions In this paper, we have presented a new feature selection method based on document frequencies and statistical values. We also have presented a new similarity measure to
1288
L.-W. Lee and S.-M. Chen
calculate the degree of similarity between documents. Based on the proposed feature selection method and the proposed similarity measures between documents, we also have presented three methods to deal with the categorization of the Reuters-21578 top 10 categories data set. The experimental results show that the proposed three methods get higher performance for text categorization than the method presented in [4].
Acknowledgements The authors would like to thank Professor Yuh-Jye Lee for his help during this research. This work was supported in part by the National Science Council, Republic of China, under grant NSC 94-2213-E-011-003.
References 1. Caropreso, M. F., Matwin, S., Sebastiani, F.: A Learner-Independent Evaluation of the Usefulness of Statistical Phrases for Automated Text Categorization. In: A. G. Chin, eds. Text Databases and Document Management: Theory and Practice, Idea Group Publishing, Hershey, PA (2001) 78–102 2. Chakrabarti, S.: Mining the Web. New York: Morgan Kaufmann (2003) 137–144 3. Chua, S., K, N.: Semantic Feature Selection Using WordNet. Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence (2004) 4. Doan, S.: An Efficient Feature Selection Using Multi-Criteria in Text Categorization. Proceedings of the IEEE Fourth International Conference on Hybrid Intelligent Systems (2004) 5. Dumais, S. T., Plant, J., Heckerman, D., Sahami, M.: Inductive Learning Algorithms and Representations for Text Categorization. Proceedings of the 7th ACM International Conference on Information and Knowledge Management (1998) 148–155 6. Galavotti, L., Sebastiani, F., Simi, M.: Experiments on the Use of Feature Selection and Negative Evidence in Automated Text Categorization. Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries (2000) 59–68 7. Lam, W., Ho, C. Y.: Using a Generalized Instance Set for Automatic Text Categorization. Proceedings of SIGIR-98 the 21st ACM International Conference on Research and Development in Information Retrieval (1998) 195–202 8. Larkey, L. S., Croft, W. B.: Combining Classifiers in Text Categorization. Proceedings of the 19th ACM International Conference on Research and Development in Information Retrieval (1996) 289–297 9. Larkey, L. S.: Automatic Essay Grading Using Text Categorization Techniques. Proceedings of the 21st ACM International Conference on Research and Development in Information Retrieval (1998) 90–95 10. Lewis, D. D.: An evaluation of phrasal and clustered representations on a text categorization task. Proceedings of the 15th ACM International Conference on Research and Development in Information Retrieval (1992) 37–50 11. Lewis, D. D.: Representation and Learning in Information Retrieval. Ph.D. Dissertation, Department of Computer Science, University of Massachusetts, Amherst, MA (1992) 12. Lewis, D. D.: and Ringuette, M., A Comparison of Two Learning Algorithms for Text Categorization. Proceedings of the 3rd Annual Symposium on Document Analysis and Information Retrieval (1994) 81–93
New Methods for Text Categorization Based on a New Feature Selection Method
1289
13. Li, Y. H., Jain, A. K.: Classification of Text Documents, Computer Journal Vol. 41, No. 8 (1998) 537–546 14. Li, H., Yamanishi, K.: Text Classification Using ESC-Based Stochastic Decision Lists. Proceedings of the 8th ACM International Conference on Information and Knowledge Management (1999) 122–130 15. Mladenic, D.: Feature Subset Selection in Text Learning. Proceedings of the 10th European Conference on Machine Learning (1998) 95–100 16. Ng, H. T., Goh, W. B., Low, K. L.: Feature Selection, Perceptron Learning, and a Usability Case Study for Text Categorization. Proceedings of the 20th ACM International Conference on Research and Development in Information Retrieval (1997) 67–73 17. Porter, M. F.: An Algorithm for Suffic Stripping Program. Vol. 14, No. 3 (1980) 130–137 18. Salton, G., Wong, A., Yang, C.: A Vector Space Model for Automatic Indexing. Communications of the ACM, Vol. 18, No. 11 (1975) 613–620 19. Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Survey, Vol. 34, No. 1 (2002) 1–47 20. Sebastiani, F., Sperduti, A., Valdambrini, N.: An Improved Boosting Algorithm and its Application to Automated Text Categorization. Proceedings of the 9th ACM International Conference on Information and Knowledge Management (2000) 78–85 21. Shima, K., Todoriki, M., Suzuki, A.: SVM-Based Feature Selection of Latent Semantic Features. Pattern Recognition Letters 25 (2004) 1051–1057 22. Yang, Y.: An Evaluation of Statistical Approaches to Text Categorization. Information Retrieval Journal, Vol. 1, No. 1–2, (1999) 69–90 23. Yang, Y. Pedersen, J.: A Comparative Study on Feature Selection in Text Categorization. Proceedings of the 14th International Conference on Machine Learning (1997) 412–420 24. Yang, Y., Liu, X.: A Re-examination of Text Categorization Methods. Proceedings of the SIGIR-99 22nd ACM International Conference on Research and Development in Information Retrieval, Berkeley, CA (1999) 42–49 25. Reuter-21578 Apte Split Data Set, http://kdd.ics.uci.edu/data-bases/reuter21578/ reuter221578.html
Using Positive Region to Reduce the Computational Complexity of Discernibility Matrix Method Feng Honghai1,2, Zhao Shuo1, Liu Baoyan3, He LiYun3, Yang Bingru2, and Li Yueli1 1
Hebei Agricultural University, 071001 Baoding, China [email protected] 2 University of Science and Technology Beijing, 100083 Beijing, China 3 China Academy of Traditional Chinese Medicine, 100700 Beijing, China
Abstract. Rough set discernibility matrix method is a valid method to attribute reduction. However, it is a NP-hard problem. Up until now, though some methods have been proposed to improve this problem, the case is not improved well. We find that the idea of discernibility matrix can be used to not only the whole data but also partial data. So we present a new algorithm to reduce the computational complexity. Firstly, select a condition attribute C that holds the largest measure of γ(C, D) in which the decision attribute D depends on C. Secondly, with the examples in the non-positive region, build a discernibility matrix to create attribute reduction. Thirdly, combine the attributes generated in the above two steps into the attribute reduction set. Additionally, we give a proof of the rationality of our method. The larger the positive region is; the more the complexity is reduced. Four Experimental results indicate that the computational complexity is reduced by 67%, 83%, 41%, and 30% respectively and the reduced attribute sets are the same as the standard discernibility matrix method.
1 Introduction The rough set theory introduced by Pawlak [1] provides a systematic framework to study the problems arising from imprecise and insufficient knowledge. In a real world information system there may be hundreds of attributes, many of which may be irrelevant to the decision-making. So, knowledge reduction is one of the most important problems in rough set theory. Because an information system may usually have more than one reduction, we always hope to obtain the set of the most concise rules. Unfortunately, it has been shown that finding the minimal reduct of an information system is a NP-hard problem [2]. Therefore, heuristic methods that explore a reduced search space are commonly used for attribute reduction. In these kinds of methods, the measure of significance of every attribute is analyzed, and regard it as a heuristic information in order to decrease search space. WEI has given uncertainty measures in probabilistic rough set using fuzzy entropy of the fuzzy set [3]. Some authors have introduced information-theoretic measures of uncertainty for rough sets [4]. In addition, some authors apply genetic algorithm to find the suboptimal set [5] and use approximation techniques [6] to obtain relative reduct [7], finding relative core and relative knowledge reduction. M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 1290 – 1298, 2006. © Springer-Verlag Berlin Heidelberg 2006
Using Positive Region to Reduce the Computational Complexity
1291
2 Concepts of Rough Set Suppose given two finite, non-empty sets U and A, where U is the universe, and A – a set of attributes. With every attribute a∈A we associate a set Va, of its values, called the domain of a. Any subset B of A determines a binary relation I (B) on U, which will be called an indiscernibility relation, and is defined as follows: xI(B)y if and only if a(x) = a(y) for every a∈A, where a(x) denotes the value of attribute a for element x. Obviously I(B) is an equivalence relation. The family of all equivalence classes of I(B), i.e., partition determined by B, will be denoted by U/IND(B); an equivalence class of IND(B), i.e., block of the partition U/ IND(B), containing x will be denoted by B(x). If (x, y) belongs to IND(B) we will say that x and y are B-indiscernible. Equivalence classes of the relation IND(B) (or blocks of the partition U/IND(B)) are referred to as B-elementary sets. In the rough set approach the elementary sets are the basic building blocks (concepts) of our knowledge about reality. The indiscernibility relation will be used next to define approximations, basic concepts of rough set theory. Approximations can be defined as follows:
B∗ ( X ) = {x ∈ U : B( x ) ⊆ X } , B ∗ ( X ) = {x ∈ U : B( x ) ∩ X ≠ ∅}, assigning to every subset X of the universe U two sets B*(X) and B*(X) called the Blower and the B-upper approximation of X, respectively. The set
BN B ( X ) = B∗ ( X ) − B∗ ( X ) will be referred to as the B-boundary region of X. Sometimes we distinguish in an information table two classes of attributes, called condition denoted by C and decision attributes denoted by D. The coefficient γ(C, D) expresses the ratio of all elements of the universe, which can be properly classified to blocks of the partition U/D, employing attributes C.
γ (C , D) =
| POSC ( D) | , where POSC ( D) = |U |
UC ( X )
∗ X ∈U / I ( D )
The expression POSC(D), called a positive region of the partition U/IND(D) with respect to C, is the set of all elements of U that can be uniquely classified to blocks of the partition U/IND(D), by means of C. By a discernibility matrix of B ⊆ A denoted M (B) a n × n matrix is defined as
(cij ) = {a ∈ B : ( xi ) ≠ a ( x j )} for i, j = 1,2,K, n . Let us assign to each attribute a ∈ B a binary Boolean variable
a , and let
Σδ ( x, y ) denote Boolean sum of all Boolean variables assigned to the set of attributes δ ( x, y ) . Then the discernibility function can be defined by the formula f ( B) =
∏{Σδ ( x, y) : ( x, y) ∈U
( x , y )∈U
2
2
and
δ ( x, y ) ≠ ∅} .
1292
F. Honghai et al.
The following property establishes the relationship between disjunctive normal form of the function f(B) and the set of all reducts of B. All constituents in the minimal disjunctive normal form of the function f(B) are all reducts of B. In order to compute the value core and value reducts for x we can also use the discernibility matrix as defined before and the discernibility function, which must be slightly modified:
f x ( B) = ∏{Σδ ( x, y ) : y ∈ U and δ ( x, y ) ≠ ∅} . y∈U
The D-core is the set of all single element entries of the discernibility matrix MD(C), i.e.
CORED (C ) = {a ∈ C : cij = (a ), for some i, j} .
3 Algorithm (1) For (C =0; C
| POSC ( D) | to calculate γ(C, D)} // D is the decision |U |
attribute. (2) Sort γ(C, D) by size. (3) Select a condition attribute C that holds the largest measure of γ(C, D) (4) With examples in the non-positive region, build another discernibility matrix to induce the other attributes in the attribute reduction set. (5) Combine the attributes generated in the above two steps into the attribute reduction set. Below we give a proof of the rationality of the algorithm. Suppose that condition attribute C holds the largest measure of γ(C, D) with which the decision attribute D depends on C, and there are n examples in the positive region of partition of the decision attribute D with respect to the condition attribute C. Suppose that U/IND (C) = {A1, A2, L Am}, A1 = {a1, a2, L an}, U/IND (D) = {D1, D2, L DL}, for A1, A2, L , Am, only A1 ⊆ D1. Then we can hold POS C (D) = A1. The components of the discernibility matrix are denoted by DM. Every component of the discernibility matrix that discerns the examples in A1 is denoted by DM (A1)i which is the combination of condition attributes that discern a1, a2, L , an from the other examples. We can affirm that DM (A1)i must contain the attribute C, that is, C ∈ DM (A1)I
So
∏
DM (A1)i = C
Namely, the result of the Boolean multiplication of the DM (A1)i is C .
Using Positive Region to Reduce the Computational Complexity
1293
Then with the condition attribute C, a1, a2, L an can be discerned from the other examples, and with the other condition attributes the other examples can be discerned from each other. Hence for the examples a1, a2, L an, we can use only the attribute C and the positive knowledge to induce the classification rule, and for the other examples we can build a discernibility matrix to induce the classification rules. So the size of the discernibility matrix is reduced. For example, Table 1 is the information system of 4 hand-written Chinese characters [8], and Table 2 is the discernibility matrix of the system. Obviously, POS A (Y ) = {7, 8}, POS B (Y ) = {3, 4, 5}, POS C (Y ) = {5, 6}, so the largest positive region of the partition U/ IND (Y) is with respect to B, that is, {3, 4, 5} is the largest positive region of the partition U/ IND (Y). In the discernibility matrix, the constituents with * are the ones that differentiate examples 3, 4, and 5 from the other examples. Table 1. The information table of 4 hand-written Chinese characters after discretization
U 1 2 3 4 5 6 7 8
A 17 18 17 17 18 18 15 16
B 8 8 9 9 7 8 8 8
C 5 5 5 5 6 6 5 5
D 4 3 5 3 4 4 4 5
E 2 2 2 2 2 3 3 3
F 1 1 2 1 1 2 2 2
Y 1 1 2 2 3 3 4 4
Table 2. Discernibility matrix of the Table 1 1 1 2 3 4 5 6 7 8
BDF* BD* ABC* ACEF AEF ADEF
2
ABDF * AB* BCD* CDEF ADEF ADEF
3
ABCDF* ABCDE* ABDE* ABE*
4
5
ABCDEF * ABCDEF* ABDEF* ABCEF* ABDEF* ABCDEF*
6
AC ACD
We can find that B is the only element in all the constituents with * generated by comparing the examples 3, 4, and 5 with the others, that is, after using Boolean sum and Boolean multiplication to these constituents with * the result is attribute B. In other words, we can use the other examples 1, 2, 6, 7, and 8 to build another discernibility matrix to generate the other constituents with no * to induce the other attributes that will be in the attribute reduction set.
1294
F. Honghai et al.
Table 3. The results of the Table 2 after removing the components that contain attribute B
1 2 3 4 5 6 7 8
1
2
3
4
5
6
* * * ACEF AEF ADEF
* * * CDEF ADEF ADEF
* * * *
* * * *
* *
AC ACD
From Table 3, we can see that all the components with * relate to examples 3, 4, and 5, that is, with the condition attribute B, we can discern the examples 3, 4, and 5 from the other examples and for the other examples we can build a discernibility matrix to get the corresponding attribute reduction. We can also conclude that the condition attribute B is included in the attribute reduction set determinately. The reduced condition attribute set contains the least amount of the condition attributes among the all reduced attribute sets generated with standard discernibility matrix method, since the attribute B holds the largest positive region of partition of the decision attribute. Because attribute B holds the largest positive region of decision attribute, the classification accuracy should be the highest among the all the attribute reduction set generated with standard discernibility matrix attribute reduction method.
4 Hand Written Chinese Character Recognition Experiments In Table 1, the values of 1, 2, 3, and 4 of attribute Y denote the above 4 hand-written Chinese characters respectively. The features of the Chinese characters are generated with SVD and among all the features, A, B, C, D, E, and F are the ones whose values are not the same after equal-width based discretization.
Fig. 1. 4 hand-written Chinese characters
With standard discernibility matrix method, the comparison times between different examples with different decision attribute values are (2*6+ 2*4+2*2)*6=144, namely, examples 1, and 2 should be compared with examples 3, 4, 5, 6, 7, and 8 respectively, following this comparison step, examples 3, and 4 should be compared with 5, 6, 7, and 8 respectively, and examples 5, and 6 should be compared with 7, and 8 respectively.
Using Positive Region to Reduce the Computational Complexity
1295
Table 4. Comparison results of computational complexity of Table 1
Standard Improved discernibility matrix method discernibility matrix method Amount of the examples 2 2 with Y=1 amount of the examples 2 0 with Y=2 amount of the examples 2 1 with Y=3 amount of the examples 2 2 with Y=4 amount of the attributes 6 6 times for comparing examples (2*6+ 2*4+2*2)*6 (2*3+1*2)*6 Ratio (2*3+1*2) / (2*6+ 2*4+2*2)=8/24=0.33/1
5 Micronutrients Experiments The SARS data are the experimental results of micronutrients that are essential in minute amounts for the proper growth and metabolism of human beings. Among them, examples 31~60 are the results of SARS patients and 61~90 are the results of healthy human beings. Table 5. Left examples after discretization
U 31 32 34 39 41 42 43 47 48 49 50 52 53
1 2 3 4 5 6 7 C U 1 2 3 4 5 6 7 C U 1 2 3 1 1 1 0 1 1 1 0 54 0 0 1 0 0 1 1 0 69 1 1 1 2 1 1 0 1 1 1 0 55 1 0 1 0 1 1 1 0 70 1 2 2 1 1 1 0 1 2 1 0 56 0 1 1 0 0 1 1 0 71 1 1 2 1 1 0 0 0 1 1 0 59 0 0 0 0 0 2 1 0 72 2 1 1 0 1 1 0 0 2 2 0 60 1 2 1 0 1 1 1 0 73 1 1 2 2 1 1 0 1 2 1 0 61 2 1 1 2 2 0 1 1 74 2 1 2 2 1 1 0 2 1 1 0 62 1 1 1 1 1 1 1 1 76 2 1 1 0 1 0 0 0 1 1 0 63 1 1 1 2 1 0 1 1 78 1 1 2 2 1 0 2 1 1 1 0 64 2 1 2 2 1 1 1 1 79 2 2 1 1 1 1 1 1 1 2 0 65 1 1 1 2 1 1 1 1 85 2 1 1 0 1 1 0 0 2 1 0 66 1 1 1 2 1 0 0 1 86 2 1 1 1 1 2 0 1 1 1 0 67 1 1 1 2 2 1 1 1 87 2 1 1 1 1 1 1 1 0 1 0 68 1 1 1 1 1 1 0 1 88 2 1 2
4 2 2 1 2 1 2 2 2 2 2 2 2 2
5 2 2 1 2 2 2 1 2 2 1 2 2 2
6 1 2 1 1 1 1 1 1 1 0 1 1 1
7 2 1 1 1 1 1 1 1 1 0 0 1 1
C 1 1 1 1 1 1 1 1 1 1 1 1 1
Attributes “1”, “2”, “3”, “4”, “5”, “6”, “7” denote micronutrient Zn, Cu, Fe, Ca, Mg, K and Na respectively, and decision attribute “C” denotes the class “SARS” and “healthy”. VC = {0,1} , where “0” denotes “SARS”, “1” denotes “healthy”. After discretization, some examples become a repeat. The amount of the total examples is reduced from 60 to 39.
1296
F. Honghai et al.
POSCa (C ) = {31,32,34,39,41,42,43,47,50,52,54,55,56,59,60} , POSFe (C ) = {39,47,48,59} , POS Na (C ) = {66,68,85,86} Obviously, the largest positive region of the partition U/C is with respect to Ca. With the improved discernibility matrix method, because the examples 31, 32, 34, 39, 41, 42, 43, 47, 50, 52, 54, 55, 56, 59, and 60 are not involved in the comparing step, only three examples 48, 49, and 53 are involved in the comparing step. Table 6. Comparing results of computational complexity of micronutrients experiments
Standard Improved discernibility matrix method discernibility matrix method amount of the examples 18 3 with C=0 amount of the examples 21 21 with C=1 amount of the attributes 7 7 times for comparing examples 18*21*7 3*21*7 Ratio 3*21*7/18*21*7=1/6=0.17/1 With the standard discernibility matrix method, attribute 4 (Ca) is involved in the attribute reduction set, and “Ca=0 → C=0” is involved in the rule set. So are involved in the improved algorithm. Namely, the rule set is the same.
6 Clinical SARS Data Experiments From the clinical SARS data set of 501 patients obtained from Beijing in 2003, we find the following rules: (1) T<=37.6˚C →1. (2) 37.7˚C <=T<=38.6˚C →1 or 2. (3) Table 7. Comparing results of computational complexity of clinical SARS data experiments
Training set
Original training set Improved method and standard method T<= 37.6˚C 37.7<=T<=38.6˚C
Total amount of examples 501 Class 1 355 Class 2 135 Class 3 11 Comparing Times (355*135+355*11+135*11)*47 Total times Ratio
= 53315*47 53315 *47
26 26
0
122 92 30
T>=38.7˚C 353 237 105 11
92*30 *47 (237*105+237*11+105*11)*47
=2760*47 =28647*47 31407*47 31407/53315=0.59/1
Using Positive Region to Reduce the Computational Complexity
1297
T>=38.7˚C →1, 2 or 3 (1, 2, and 3 means that the state of illness is slight, serious, and critical respectively). Where T denotes the highest body temperature.
7 Forest Cover Type (UCI) Data Set Experiments The forest cover type for 30 x 30 meter cells obtained from US Forest Service (USFS) Region 2 Resource Information System (RIS) data. The number of instances (observations) is 581012, the number of attributes is 54, the forest cover type (Cover_Type) designate 7 types (1 to 7). We only take the 44 binary variables into account. From the data set we find that: “Soil Type 29” =0 → “Cover_Type” = 1, 2, 3, 4, 5, 6, or 7. “Soil Type 29” =1 → “Cover_Type” = 1, 2, 5, or 7. So we can divide the whole data set into two parts: “Soil Type 29” =0 and “Soil Type 29” =1. Table 8. Comparing results of computational complexity of Forest Cover Type (UCI) Data Set
Training set
Original training set and standard method
Improved method “Soil Type 29” =0
Total amount of examples 581012
465766
Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 Class 7
169928 211903 35754 2747 8362 17367 19705
Comparing Times Total times Ratio
211840 283301 35754 2747 9493 17367 20510 A
B
“Soil Type 29” =1 115246 41912 71398
1131 805 C
(B+C)/A=0.704/1
Where A=211840*(283301+35754+ 2747+9493+17367+20510)+288301*(35754+ 2747+9493+17367+20510)+35754*(2747+9493+17367+20510)+ 2747*(9493+17367 + 20510) + 9493*(17367+20510)+ 17367*20510=105170508790. B= 169928*(211903+35754+2747+8362+17367+19705)+ 211903*(35754+2747+ 8362+17367+19705)+35754*(2747+8362+17367+19705)+2747*(8362+17367+19705) + 8362*(17367+19705)+ 17367*19705=70812033350. C= 41912*(71398+1131+805)+ 71398*(1131+805)+ 1131*805=3212711591. So the total comparing times ratio is (B+C)/A=0.704/1, namely, the computational complexity is reduced by 30%. And the attribute reduction results are the same as the standard discernibility matrix method theoretically.
1298
F. Honghai et al.
8 Conclusions and Discussions (1) From the four experimental results, with the new algorithm the computational complexities are reduced by 41%, 83%, 67%, and 30% respectively, which shows that the new algorithm is general in reducing the standard discernibility matrix method. (2) The new algorithm suits the case that the partition of the decision attribute has a larger positive region to a condition attribute. The larger the positive region of the partition of the decision attribute to a condition attribute is, the more the complexity is reduced. That is the reason why the above reduction rates have such a large variance of more than 50%. Of course, the method is not suited to the case where there is not any attribute that holds a positive region of partition of the decision attribute. (3) The idea of discernibility matrix can be used to not only the whole data but also partial data. (4) The selected attribute to which the partition U/IND(D) has the largest positive region must be in one of the attribute reduction set generated by the standard discernibility matrix based. (5) Because the selected attribute holds the largest positive region of decision attribute: (a) The reduced condition attribute set contains the least amount of the condition attributes among all the reduced attribute sets generated with standard discernibility matrix method. (b) The amount of the classification rules is the least. (c) The classification accuracy should be the highest among the all the attribute reduction set generated with standard discernibility matrix attribute reduction method.
References 1. Pawlak Z.: Rough sets. Int. J. of Computer and Information Science. 1l (1982): 341-356 2. Wong S K, Ziarko W.:On optimal decision rules in decision tables. Bulletin of Polish Academy of Sciences. 33(1985) 357-362 3. Duntscb D, Gediga G.: Uncertainly measures of rough set prediction. Artificial Intelligence. 106(1998) 109-137 4. Beaubouef T, Petry F E and &ora G.: Information-theoretic measures of uncertainty for rough sets and rough relational databases. Information Science. 109(1998) 185-195 5. Wroblewski, J.: Ensembles of classifiers based on approximate reducts. Fundamenta Informaticae, IOS Press, Netherlands. 47(2001) 351-60 6. Wroblewski, J. Covering with reducts-a fast algorithm for rule generaten. Rough Sets and Current Trends in Computing. First International Conference, RSCTC'98. Proceedings, (1998) 402-7 7. Xiao, Jian-Mei. New rough set approach to knowledge reduction in decision table. Proceedings of 2004 International Conference on Machine Learning and Cybernetics, 4 (2004) 2208-2211 8. Fan Jin-song, Fang Ting-jian. Rough set and SVM based pattern classification method. Pattern Recognition and Artificial Intelligence. 13(2000) 419-423.
A Novel Mining Algorithm for Periodic Clustering Sequential Patterns Che-Lun Hung1, Don-Lin Yang2, Yeh-Ching Chung1, and Ming-Chuan Hung2 1
Department of Computer Science, Tsing Hua University, 101, Section 2 Kuang Fu Road, Hsinchu, Taiwan 30013 [email protected], [email protected] 2 Department of Information Engineering and Computer Science, Feng Chia University, 100, Wenhwa Rd., Seatwen, Taichung, Taiwan 40724 {dlyang, mchong}@fcu.edu.tw
Abstract. In knowledge discovery, data mining of time series data has many important applications. Especially, sequential patterns and periodic patterns, which evolved from the association rule, have been applied in many useful practices. This paper presents another useful concept, the periodic clustering sequential (PCS) pattern, which uses clustering to mine valuable information from temporal or serially ordered data in a period of time. For example, one can cluster patients according to symptoms of the illness under study, but this may just result in several clusters with specific symptoms for analyzing the distribution of patients. Adding time series analysis to the above investigation, we can examine the distribution of patients over the same or different seasons. For policymakers, the PCS pattern is more useful than traditional clustering result and provides a more effective support of decision-making.
1 Introduction With the tremendous growth of data volume and demanding need of knowledge management, data mining has recently become an important research topic and is receiving substantial interest from the areas of both academia and industry. Many data mining technologies have been developed in these past few years, such as the association rule, clustering, and classification, etc. The method of clustering is unsupervised and is based on the similarity of attributes to determine if they are elements of the same cluster [1,2,3,4]. Although clustering is used in many application domains, its time attribute has not been well-researched [5,6,7]. In fact, data attributes are less discussed in data mining, but the consideration of time and time-ordered correlation are becoming more and more popular, especially in the association rule [8,9,10]. There are two popular patterns of time series in data mining, the sequential pattern and periodic pattern [11,12,13,14]. Sequential patterning is a very important technology in temporal databases and is applied in science and finance applications [15,16]. Periodic patterning is equally important and more accurate in analyzing time variations than traditional sequential patterning. Periodic patterns have time slices, but sequential patterns do not. In this M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 1299 – 1308, 2006. © Springer-Verlag Berlin Heidelberg 2006
1300
C.-L. Hung et al.
paper, we present the mining of periodic clustering sequential (PCS) patterns. It is different from the aforementioned patterns because the PCS pattern is obtained by clustering, whereas the two traditional patterns are from the association rule. The PCS pattern takes advantage of both the sequential and periodic patterns. We focus on the clustering of varying times, rather than the technology of clustering. The goal is to provide more efficient applications of clustering in time series. The challenge is how to use this new concept, which combines the sequential pattern and the periodic pattern using clustering, in a fast and effective manner. The PCS pattern has two properties, sequential and periodic attributes. In general, the element of time is not considered in clustering because defining the time factor is very difficult. Another reason to disregard time is because it is hard to mine efficiently when many attributes of data and time dimension are considered together. However for the PCS pattern, clusters are made according to the sequence of periods. For example, the attribute values of a cluster that has the highest survival rate may change with season. This will provide better diagnosis support for doctors and help control the symptoms of patients during each season to increase the survival rate.
2 Periodic Clustering Sequential Pattern Given a time unit ti ∈ T, i = 0, …, n, and a time series T, we denote the size of T as L, i.e., L = Size(T) = ⏐T⏐ = n+1, and time segment Tseg = [ti, ti+1]. A cycle is formed throughout the whole time series being measured, and there exists identical clusters with the same period length in the different time segments. The period length is the size of the time segment when the cycle is being formed. A PCS pattern is based on the cycle with the same period length. Definition 1 (a cycle) Given a time series T whose size is L, if period length ι is a positive integer (Z+), 0 ι L, then ∃τ and 0 τ L/ι. If ∃s ∈ Z+ such that 0 s < ι, there are identical clusters at the ( ι τ + s)th time unit in the time series T. This is called a cycle, denoted by Ć = (ι, s, c), where ι is the period length, s is the cyclic position of the period length, and c is the cluster in the cycle. When the period length is known, the cycle can be denoted by Ć =(s, c).
<
≦ ≦ (. )
≦
≦
Example 1 After clustering a dataset, we have a time series whose sequence is abdabcabfabc. Each letter of the alphabet is the index of a cluster after clustering and is mapped into the position of a Time Cube (see Section 3.1). Note that every first position of the sequence repeats the letter “a” starting at offset 0. There is a cycle that can be denoted as Ć1 (3,0,a). At the mean time, there are two other cycles Ć2 (3,1,b) and Ć3 (6,5,c).
=
=
=
Definition 2 (a PCS pattern) Given a time series T with size L and the period length ι, 0 ι L. If there are v (v > 0) cycles, a PCS pattern P is denoted by (ι, v, Ĉ), where Ĉ {( s, c) Ć =(ι, s, c) for ∀ s v }. That is, the combination of all single cycles forms a PCS pattern. 0
≦ <
≦< = ∣
A Novel Mining Algorithm for Periodic Clustering Sequential Patterns
1301
Example 2 Following the time sequence in Example 1, there are two cycles (3,0,a) and (3,1,b), denoted by a** and b** respectively. According to Definition 2, the patterns a** and b** can be denoted as Ĉ {(0,a),(1,b)}, and the combination of (3,0,a) and (3,1,b) forms a PCS pattern, denoted by (3,2,{(0,a),(1,b)}), or ab*. In a time series pattern, every period length has the same cycle. If a cycle Ć (ι, s, c) appears in the time series T whose size is L, then the number of occurrences for the cycle Ć is (L/ι). Notice that the PCS pattern defined above refers to the perfect PCS pattern in a time series in the sense that all of the corresponding time units in the series are similar. This is the ideal case. In practice, most applications have noises. Therefore, we need a parameter to indicate the degree of confidence for the PCS pattern.
=
=
Definition 3 Suppose there is a cycle Ć (ι, s, c) in a time series T of size L. The confidence F of the cycle is the number of times the cycle has appeared. The ideal confidence F is (L/ι). Notice that a perfect cycle in a time series given in Definition 1 implies F = 1. However, a perfect PCS pattern is less common, so we need to define the confidence for full and partial periodic patterns. Usually, a user may provide a minimum confidence threshold, min_conf, to indicate the minimum strength of the PCS pattern to be mined. If a PCS pattern is formed, it will satisfy min_conf ×(L/ι), where L is the size of the time series and ιis the period length.
=
3 An Efficient Mining Algorithm of the PCS Pattern The main step of the PCS algorithm is to do clustering with time. In order to mine PCS patterns, we construct a Time Cube first. Then, in accordance with the time attribute, which is mapped to a Time-index, we put the clusters into the Time Cube. In this section, we discuss how to construct a Time Cube in our algorithm and how to mine the PCS pattern. The Time Cube is used to create the Periodic Table for discovering candidate PCS patterns. 3.1 Construction of a Time Cube Time Cubes are used to store clusters after clustering with respect to time. A Time Cube has three dimensions: Cluster-index, Time-index, and Time-period. The goal of using a Time Cube is to produce a Periodic Table, and then verify that if the candidate patterns derived from the Periodic Table are PCS patterns. The Time-period dimension is used to present data in different time periods (e.g., summer quarter, or year 2001); a Time-index is used to indicate the data in each time unit of a period (e.g., the first quarter of year 2001); the Cluster-index discriminates all clusters in every Timeindex. Using this structure can save time in verifying the discovered PCS patterns.
1302
C.-L. Hung et al.
Figure 1 shows a Time Cube. It depicts the relationship between time-indices and time-periods. In the next section, we will show how cluster-indices are used to find PCS patterns. Here we use an example to explain how to construct the Time Cube of Figure 1. Example 3 Assume a stock database contains business information for some companies from 1970-1998, and there are five attributes: stock type, time, the highest stock value of the season, the lowest stock value of the season, and the quantity of trade during the season. Suppose the min_conf is 0.75. The Time Cube shown in Figure 1 is a cluster with respect to the time attribute. We use the Time-index to cluster data in separate datasets. The first one is the first season of the first year, and then we do the same step for the second season of the first year, and so on. This process aims at clustering data in different time segments that are independent of each other. When the periodic index becomes the second year, it starts an additional process that compares the cluster of the first season for the first year with that of the first season of the second year. If these two seasons have similar results, it will store the first season of the second year in the same Cluster-index with the first season of the first year. If they are not, then it will be kept in a different Cluster-index. Assume there exists three clusters in the first season of the first year, and the cluster-indices are a, b, and c, respectively. We compare the three clusters of the first season of the second year with the clusters of the first season of the first year. If all the clusters are different, the clusters of the first season of the second year will be saved with new cluster indices d, e, and f. If two second year clusters are the same as a and b, the cluster indices used in the second year would be a, b, and d. Each dice represents the result of a season in each year after clustering. If it has no cluster in the Cluster-index, it is denoted by 0 in the Time Cube. Otherwise, it is denoted as * in the Time Cube and has at least a cluster with a Cluster-index.
Fig. 1. A time cube
A Novel Mining Algorithm for Periodic Clustering Sequential Patterns
1303
3.2 Construction of a Periodic Table A Periodic Table is derived from a Time Cube. The table records the number of clusters in the corresponding Cluster-index, indicating that the clusters in each Clusterindex have the same characteristics. The construction of a Periodic Table is shown in Figure 2.
Clusterindex
m 2
0
1
.
…
… …
d c b a
…
2
2 1 2 1 1 2 3 1 4 4 1 0 1 2 0 2 1st 2nd 3rd 4th
Time-index (season)
Fig. 2. A Periodic Table
3.3 Mining PCS Patterns Adopting from the principle of Apriori algorithm [17], we develop the process of finding PCS patterns. First, one-cycle PCS patterns are mined based on the occurrence of patterns in the Periodic Table. A one-cycle PCS pattern comes from Definition 2, P = (ι, v, Ĉ), where v = 1 and period lengthιis the size of the Time-index, such that a one-cycle PCS pattern is formed if the number of occurrences of a pattern P greater than min_conf. According to Apriori, for k 1, the (k+1)-cycle PCS pattern is derived from the k-cycle PCS pattern, i.e., a (k+1)-cycle is a candidate pattern if all of its k-cycle subsets are k-cycle patterns. All the (k+1)-cycle candidates can be verified by scanning the Time Cube once. Mining the relationship of cluster sequences is the target of the PCS algorithm. The mining process is as follows:
≧
Step 1: Clustering with temporal attribute The first step is to cluster all valid datasets. One can use the K-mode algorithm or any other appropriate algorithm to cluster a dataset based on the time segment. In this paper, we will not illustrate the process of clustering. The algorithm selection is based on the attributes of datasets. Step 2: Building a Time Cube This step puts all clusters from Step 1 into a Time Cube with corresponding Timeindices and Time-periods. After the first round, the clusters of the nth time-period will be compared with the clusters of the (n-1)th time-period. If there are discrepancies, different cluster-indices will be used for the unmatched clusters of the nth time-period.
1304
C.-L. Hung et al.
Step 3: Construction of a Periodic Table The Time Cube is compressed into a Periodic Table according to the Time-period and the number of clusters that have the same Cluster-indices on the corresponding Timeindices. Step 4: Mining the PCS pattern P1=One-cyclePCSpattern(); /* Find frequent one-cycles in the Periodic Table */ for i=2 i < ι & Pi-1≠{} i++ { /* Create all i-cycles Candidate Patterns in Periodic Table */ PCSCi=CreateCandidatePatterns i Pi=VerifyPattern(PCSCi); /* Verify all i-cycles Candidate Patterns to find frequent PCS patterns in the Time-Cube */ } Return all P /* Return all frequent PCS patterns */
( ;
; ) ( );
;
The three functions used in the algorithm of Step 4 are explained below. (1) One-cyclePCSpattern
:
For i = 0 to Time-Period length - 1 { For j = 0 to Time-Index – 1 { For k = ‘a’ to Cluster-Index Ć (Time-Index, j, k)++; } if Count of Ć (Time-Index, j, k) >= min_conf Ć (Time-Index, j, k) is saved in temp; } Return temp;
ι
ι
This function finds frequent 1-cycles in the Periodic Table. A 1-cycle can be denoted as =( , s, c) with a confidence no less than min_conf, where is the period length (the size of the Time-index), s is the cyclic position of the period length, and c is the cluster in the cycle. The confidence is based on the number of clusters in the Periodic Table with a Cluster-index. Example 4 According to Figure 1 and Figure 2, suppose min_conf = 0.75 and the size of the time-period = 4 (there are four months). Any cell with a count no less than 4 × 0.75 = 3 will pass the test in the Periodic Table. In Figure 2, three 1-cycles of (Timeindex = 1, Cluster-index = b), (Time-index = 2, Cluster-index = b), and (Time-index = 3, Cluster-index = c) pass the cycle test. That is, Ć0 = (4, 1, b), Ć1 = (4, 2, b), and Ć2 = (4, 3, c), where Ći is denoted as ith Ć. The output from the algorithm is P1 = {PCSC01 = (4, 1, {(1,b)}), PCSC11 = (4, 1,{(2,b)}), PCSC21 = (4, 1, {(3,c)}), where PCSC is the candidate PCS pattern and PCSCji is to denote the jth candidate pattern in the ith-cycle.
A Novel Mining Algorithm for Periodic Clustering Sequential Patterns
(2)
1305
:
CreateCandidatePatterns
For j = 0 to Pi-Count – 1 { For l = 1 to Pi-Count PCSCki = Pji join Pli; if every element of PCSCki exists in Pi k++; else delete PCSCki; } return all PCSCki; As in the Apriori algorithm, if a k-cycle is frequent (i.e., passing the min_conf), all of its m-cycles for m < k must be frequent as well. Using this property to create a candidate PCS pattern, denoted by PCSCi where i 1, we can form a set of (i-1)cycle PCS patterns Pi-1. This procedure includes two phases: a join phase and a prune phase. The join phase is completed for the (i-1)-cycle PCS pattern set Pi-1 to form an icycle PCSC pattern set PCSCi. The prune phase discards the PCSC patterns in PCSCi containing some (i-1)-cycle sub-PCS-patterns that are not in Pi-1. See Example 5.
≧
Example 5 We continue using the 1-cycle PCS pattern, P1, obtained in Example 4. The join phase of P1 and P1 yields 2-cycle PCSC patterns PCSC02 = (4, 2, {(1,b), (2,b)}), PCSC12 = (4, 2, {(2,b), (3,c)}), and PCSC22 = (4, 2, {(1,b), (3,c)}). Suppose the 2-cycle PCS patterns found after verification are P02 = PCSC12 and P12 = PCSC22. The only candidate for the 3-cycle PCS pattern formed from P02 and P12 is PCSC03 = (4, 3, {(1,b), (2,b), (3,c)}). This PCSC pattern is pruned because a sub-pattern of PCSC03, (4, 2, {(1,b), (2,b)}), does not belong to the set of 2-cycle PCS patterns. (3)
:
VerifyPattern
for j = 0 to PCSCji – Count – 1 { for i = 0 to Time-Period length - 1 { if PCSCji appears in each time-slice of the Time Cube Count++; } if Count >= min_conf PCSCji is saved in temp; } return temp; VerifyPattern checks the Time Cube to verify whether or not the pattern posted in the i-cycle PCSC pattern set is frequent. After deriving the i-cycle PCSC patterns, we verify whether such candidates form real i-cycle patterns in the Time Cube. This is done by checking if the number of simultaneous occurrences of the i-cycles in a
1306
C.-L. Hung et al.
PCSC pattern in the time-period is not less than the min_conf. When a pattern is confirmed, all of its sub-patterns are eliminated from the pattern lists of fewer cycles.
4 Experimental Results Since stock research is a good target for time series mining [18], a stock database is used for our experiment. The stock database contains stock exchange information of many companies from 1970-1998, containing approximately 120,000 transactions. After pre-processing the source data, we apply our algorithm to mine PCS patterns for providing decision-making support. We use the K-mode algorithm in the experiment. First, we transform numeric data to categorical data as shown in Figure 4 by using the transform rules in Figure 5. The fields of the transformed data in Figure 4 (from left to right) are stock id, change rate, open price, highest price, lowest price, trade volume, year, month and week. The date is transformed by using the first week of 1996 as the base. Hence the first week in the first month of 1996 is denoted as “1, 1, 1”. The second week is “1, 1, 2”. "2016,2,2,2,2,1,1,1,1" "2207,7,6,6,6,1,1,1,1" "2014,8,2,2,2,4,1,1,1" "2006,2,3,3,3,3,1,1,1" "2012,1,2,2,2,2,1,1,1" Fig. 4. Transformed data
1 2 3 4 5 6 : 12
Change rate (Average) 0%~ -1% -1%~ -2% -2%~ -3% -3%~ -4% -4%~ -5% -5%~ -7% : 5%~ 7%
Open price (Average) 0~10 10~20 20~30 30~40 40~50 50~60 : 110~
Highest price in a month 0~10 10~20 20~30 30~40 40~50 50~60 : 110~
Lowest price Trade in a month volume 0~10 0~500 10~20 500~1000 20~30 1000~1500 30~40 1500~2000 40~50 2000~2500 50~60 2500-3000 : : 110~ 5500~
Fig. 5. The transformation rules (Price unit is NT$, Volume unit is 1,000)
Here, the min_conf is set to 0.75. So any pattern with a count number not less than 28 × 0.75 = 21 (the database contains twenty-eight years of data, i.e., 112 seasons, and the period length is four because one year has four seasons) will pass the test. In the PCS discovery process, clusters satisfying the min_conf are in two segments, the first season and the second season. The number of clusters in the first season is twelve and the number is ten in the second season. The final result is shown in Figure 6.
A Novel Mining Algorithm for Periodic Clustering Sequential Patterns
Time Item 1 2 3 4 5
First season 7,5,5,5,1 7,6,6,6,2 3,4,4,4,1 4,5,5,5,6 8,5,5,5,1
Second season 8,2,2,2,2 3,4,4,4,1 2,8,2,2,2 2,2,2,2,8 7,6,6,6,2
Third season Х Х Х Х Х
1307
Fourth season Х Х Х Х Х
Fig. 6. The result of 1-cycle PCS patterns
Time Item 1 2 3 4
First season 7,5,5,5,1 8,5,5,5,1 3,4,4,4,1 1,5,5,5,1
Second season 8,2,2,2,2 3,4,4,4,1 3,4,4,4,1 7,6,6,6,2
Fig. 7. The result of 2-cycle PCS patterns
For the 2-cycle PCS patterns, it can be seen from the Item 1: when cluster {7,5,5,5,1} occurs in the first season, cluster {8,2,2,2,2} is in the second season. Using the transformation rule of Figure 5, categorical data can be converted back to numeric data for the resultant rule. Using the PSC pattern can really provide better decision support for investors and related decision makers.
5 Conclusion and Future Work In this paper, we present a new concept of time series, the Periodic Clustering Sequential pattern, to illustrate time-related clustering. We develop two mechanisms to use the Time Cube and the Periodic Table, and use the principle of Apriori to make the candidate pattern generation process more efficiently. Although temporal data mining has been applied in many applications, such as the periodic and sequential patterns, the PCS pattern can provide further analysis about time-related clusters. Presently, we apply the PCS pattern as well as the simplified CS pattern to mine medical information. The attributes come from patient’s profiles and examination records. We hope that it can help doctors diagnose patients more effectively. However, more patient data are needed from long-term observations to cluster according to various time periods (such as seasons). In addition, being able to select relevant attributes to illustrate relationship of seasonal illnesses can benefit even more people. Acknowledgements. This research was partially supported by the National Science Council, Taiwan, under grant number NSC94-2218-E-007-057.
1308
C.-L. Hung et al.
References 1. K. Alsabti, S. Ranka, V. Singh, An Efficient K-Means Clustering Algorithm, PPS/SPDP Workshop on High performance Data Mining, 1997. 2. K. Alsabti, S. Ranka, V. Singh, An Efficient Space-Partitioning Based Algorithm for the K-Means Clustering, 1997. 3. Zhexue Huang, Clustering Large Data Sets With Mixed Numeric and Categorical Values, In Proceedings of The First Pacific-Asia Conference on Knowledge Discovery and Data Mining, Singapore, World Scientific, 1997. 4. Zhexue Huang, A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining, SIGMOD'97 Data Mining Workshop. 5. Keogh, E., Lin, J., Truppel, W, Clustering of time series subsequences is meaningless: implications for previous and future research. In Data Mining, 2003. ICDM 2003. Third IEEE International Conference, pp. 115-122. 6. Wencai Liu, Yu Luo, Applications of clustering data mining in customer analysis in department store, in Services Systems and Services Management, 2005. Proceedings of ICSSSM '05. 2005 International Conference, pp. 1042-1046. 7. Chun-Hao Chen, Tzung-Pei Hong, Tseng, V.S., Analyzing time-series data by fuzzy datamining technique. In Granular Computing, 2005 IEEE International Conference, pp. 112117. 8. Ozden, B.; Ramaswamy, S.; and Silberschatz, A., Cyclic Association Rules. In Proc. Of 1998 Int. Conf. Data Engineering (ICDE’98), pp.412-421, 1998. 9. Chen, X., Petrounias, I. and Heathfield, H., Discovering Temporal Association Rules in Temporal Databases, In Proc. International Workshop on Issues and Applications of Database Technology (IADT'98), pp. 312-319, 1998. 10. Ye, S. and Keane, J.A., Mining Association Rules in Temporal Databases, In Proc. International Conference on Systems, Man and Cybernetics, IEEE, New York. pp.2803-8, 1998. 11. H. Mannila, H. Toivonen, and I. Verkamo, Discovery of Frequent Episodes in Event Sequences, DMKD Journal, 1(3): pp.259-289, 1997. 12. T. Oastes, A family of Algorithms for Finding Temporal Structure in Data, In 6th Workshop on AI and Statistics, 1997. 13. Han, J., Dong, G. and Yin, Y., Efficient Mining of Partial Periodic Patterns in Time Series Database. In Proc. Fifteenth International Conference on Data Engineering, Sydney, Australia. IEEE Computer Society, pp.106-115, 1999. 14. S. Parthasarathy, M. J. Zaki, M. Ogihara, S. Dwarkadas, Incremental and Interactive Sequence Mining, ACM, CIKM’99, 1999. 15. Patel, P., Keogh, E., Lin, J., Lonardi, S., Mining motifs in massive time series databases, In Data Mining, 2002. ICDM 2002. Proceedings. 2002 IEEE International Conference, pp. 370-377. 16. Povinelli, R.J., Xin Feng, A new temporal pattern identification method for characterization and prediction of complex time series events, In 2003 Knowledge and Data Engineering, IEEE Transactions, Vol. 15, No. 2, pp. 339-352. 17. R. Srikant and R. Agrawal, Mining Sequential Patterns: Generalizations and Performance Improvements, In 5th EDBT, 1996. 18. Xiao-Ye Wang, Zheng-Ou Wang, Stock market time series data mining based on regularized neural network and rough set, In Machine Learning and Cybernetics, 2002. Proceedings. 2002 International Conference, pp. 315-318.
Text Mining with Application to Engineering Diagnostics Liping Huang and Yi Lu Murphey Department of ECE, 4901 Evergreen Road Dearborn, Michigan, USA {yilu, lilyping}@umich.edu
Abstract. Our research focuses on text document mining with an application to engineering diagnostics. In automotive industry, the auto problem descriptions are often used as the first step of a diagnostic process that map the problem descriptions to diagnostic categories such as engine, transmission, electrical, brake, etc. This mapping of problem description to diagnostic categories is currently being done manually by mechanics that perform this task largely based on their memory and experience, which usually lead to lengthy repair processes, less accurate diagnostics and unnecessary part replacement. This paper presents our research in applying text mining technology to the automatic mapping of problem descriptions to the correct diagnostic categories. We present our results through the study on a number of important issues relating to text document classification including term weighting schemes, LSA and similarity functions. A text document categorization system is presented and it has been tested on a large test data collected from auto dealers. Its system performance is very satisfactory. Keywords: data mining and knowledge, intelligent systems.
1 Introduction As computers become more powerful, data storage devices become more plentiful, the amount of information in digital form has dramatically increased. Many organizations have been collecting large amounts of information during daily operations, but most of them are unable to extract useful information from the data to improve their operations due to the lack of automatic tools in data mining. In automotive industry there is abundant information available in human language description form that contains valuable vehicle diagnostic knowledge, marketing information, consumer evaluation or satisfaction of certain vehicle models, styles, accessories, etc. We are particularly interested in applying text mining to engineering diagnostics. In automotive industry, several thousand auto problems per week are reported to various dealership shops. In a typical diagnostic process, problems are first described, in casual natural language, by either customers or maintenance technicians. Based on the description, the vehicle problem is assigned with a specific diagnostic code. Each diagnostic code represents a problem category of vehicles, which are used to guide the diagnostic process in search for the true causes of the problems, and help solving automotive problems as well. Problem descriptions are then mapped to diagnostic M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 1309 – 1317, 2006. © Springer-Verlag Berlin Heidelberg 2006
1310
L. Huang and Y.L. Murphey
categories such as engine, transmission, electrical, etc. The diagnostic code is then used to guide the further diagnosis for the true causes of the vehicle problem within the category. This mapping of problem description to diagnostic categories is currently being done manually by mechanics who perform this task largely based their memory and experience, which usually lead to lengthy repair processes, less accurate diagnostics and unnecessary part replacement. Hence, new techniques are in high demand to correctly and automatically map problem descriptions to problem categories. Our research is focused on applying text mining technology to automatically categorizing text documents with applications to engineering fault diagnostics. When a customer states his automotive problems, the description, such as there is a noise heard in the engine and the engine runs rough, is probably mapped to different diagnostic codes according to mechanics’ experience, which usually point in various directions for the repairing procedure. The text descriptions of vehicles problems pose another challenge: data are often ill-structured and text descriptions often do not follow the English grammar, and often contain many self-invented acronyms and shorthand descriptions. One example is given below. Customer 1: “WENT ON A SALES ROAD TEST WITH CUST, VEHICLE WOULDNOT START,” Customer 2: “CHECK CAR WONT START,” Customer 3: “CK BATTERY HARD TO START.” These three descriptions are expected to be mapped to the same diagnostic category, whose definition typically is: ENGINE WOULD NOT START. From this example we can see a number of problems including synonymy, there are many different ways to describe the same vehicle problem. It is another important task for us to look for the most appropriate code based on multiple symptoms. In addition, we are dealing with other problems such as poor wording and self invented acronyms. Polysemy is also a problem, for example, the word “light” could mean a number of things in vehicle problem descriptions. Finally, there are thousands or more automotive problems reported to the mechanical department every week, manually mapping work by mechanics is time-consuming and error prone. When new codes are generated and added in the diagnostic system with the vehicle development, it requires a longer time for the mechanics to accommodate the system update. In this paper, we present an automatic text categorization system, DKE&TC (Diagnostic Knowledge Extraction and Text Classification), developed based on vector space model. DKE&TC has two components, DKE is a component that has functions for learning diagnostic categorization knowledge from a corpus of diagnostic text documents. The TC component then applies the learnt knowledge to categorizing a customer’s description of vehicle problems to diagnostic codes. We also present our study on a number of important issues relating to text document classification including term weighting schemes, LSA and similarity functions. The proposed system has been trained on more than 200000 documents over mover than 50 diagnostic categories and tested over 6000 documents.
Text Mining with Application to Engineering Diagnostics
1311
2 DKE&TC: A System for Automatic Learning of Diagnostic Knowledge from Text Documents In automotive diagnostics, much like other engineering diagnostics, problems of vehicles are organized into different categories. We are trying to develop a system, DKE&TC, that automatically maps a problem description, q, to the correct diagnostic categories. The DKE&TC consists of two stages (see Figure 1), LFT(Learning From Text), and DTC(Diagnostic Text Classification). The LFT system is the machine learning stage that attempts to extract diagnostic knowledge from text documents. The LFT system consists of two major components within our research interest, a Term Category Weight (TCW) matrix and the matrix obtained through the well-known Latent Semantic Analysis (LSA) that uses SVD to obtain a weighted matrix. The output from the LFT system is a matrix representing document categories. The second stage is to transform the input problem description q to document vector and find the best matched diagnostic categories using the matrices generated by the LFT system. 2.1 Learning from Text(LFT) Learning from Text (LFT) component is developed based on the vector space model(VSM) [1,10,11,12]. It is one of the classic models in the area of information retrieval and text mining. In the vector space model text objects are modeled as elements of a vector space. That is, documents, user query in the model are described as a set of representative keywords called index terms. An index term is simply a word or a token which can help understanding the main theme of the document. After index LFT (Learning From Text) Training data
DTC (Diagnostic Text Classification) Input query
Training Module
Query Vector generating Module
Weighted Query Vector TCW Matrix
SVD Module Similarity Function Top K matched categories
Fig. 1. System architecture of Diagnostic Knowledge Extraction and Text Classification (DKE&TC)
1312
L. Huang and Y.L. Murphey
terms are identified and extracted from the document set, they need to be weighted according to their importance for summarizing the contents of a document. There are different ways to compute the index term weight[3]. The most commonly used weight scheme is tf-idf (term frequenc-inverse document frequency). However, depending on document collection, the results of different weight schemes are not quite consistent. After document and query are represented as vectors in the vector space, the document is ranked with respect to the query according to a similarity measure. LFT consists of two major computational steps, the document indexing, and constructing an effective term by diagnostic category matrix. Let us assume a collection of diagnostic text documents are to be classified into N categories, (C1, C2, … CN), and we have training documents Tr1, Tr2, …, TrN, where Tri, contains the training documents belonging to category i, i = 1, …, N. The following subsections describe the two computational steps. 2.1.1 Document Indexing Process First we generate an index term list T_L, where each term t i ∈ T _ L is extracted from all documents in Tr, where Tr = by the following steps:
Tr1 ∪ ... ∪ TrN . The T_L is then processed
Stopping Word (Function Words) Removal: Since stopping words are unlikely to be useful for document matching, they are removed from T_L. Examples of stopping words: the, from, to, …. Stemming Words Morphological variants of the same lexical items are merged into a single root. For example: leak, leaks, leaking and leaked are all stemmed to leak Selecting frequently occurred words A word, w, is frequent if and only if it occurs in more than d_th documents and more than w_th times in all documents in Tr, where Select a document threshold d_th and word threshold w_th. Infrequent words are removed from T_L. 2.1.2 Constructing an Effective Term-Category-Weight (TCW) Matrix In text document categorization, the term by document category matrix is considered one of the most important steps with respect to system performance. The entries of the matrix are usually weighted according to terms’ importance for a particular document and for the whole collection [2, 3]. In the proposed system, the TCW matrix is constructed as follows. The TCW matrix A with dimensions of M by N, where M is the number of terms in T_L, and N is the number of diagnostic categories. The entries of A, ai,j = tfij*Gi, are the multiplication of local and global weights defined as follows. tfij is the frequency of ith terms in T_L occurring in training data Trj for diagnostic category j, and Gi is the inverse document frequency defined as Gi =
⎡ ndocs ⎤ log 2 ⎢ ⎥ + 1 where ndocs is the number of documents in the entire train⎣ df i ⎦ ,
ing data Tr, dfi , stands for document frequency of term i, is defined as the total number of documents in Tr that contain term i.
Text Mining with Application to Engineering Diagnostics
1313
The local weight is used to evaluate the importance of each term in the each category. However sometimes less frequently used terms can be more critical in distinguish one category from another. For example, in a document “check fuel leak, check and advise”, term “leak” is probably more meaningful than “check” although “check” occurs more often than “leak”. Global weight is therefore introduced to reflect the overall importance of the index term in the whole document collection. The idea is that a term that occurs rarely should have a high weight and frequently occurring terms should be weighted low. Taken to the extreme, stop words such as “the”, “an”,”a”, that appear in almost all documents should ideally receive a global weight of 0. The column vectors in TCW matrix A can be interpreted as a representative vector for a diagnostic category. Another important type of term by diagnostic category matrix is constructed using the latent semantic indexing (LSI) method [4, 5, 6], which is a variant of vector space model. LSI is a statistical model of word usage that permits comparisons of the semantic similarity between pieces of textual information. While assuming the existence of some underlying or “latent” semantic structure according to the overall term usage pattern that is partially obscured by variability in term choice [5], LSA was originally designed to improve the effectiveness of information retrieval methods by performing text retrieval based on the derived “semantic” content of words in a query as opposed to performing direct word matching. This approach avoids some of the problems of synonymy and polysemy. A truncated singular value decomposition (SVD) is used to build a lower dimensional space which is associated with this “latent” semantic structure. Each document and query is mapped into this space. Relevance judgments for the user’s query are also performed in this space. In our model, we decomposed a weighted term frequency matrix A into the product of three matrices [3, 4] ,
A = U ∑ V T , where U T U = V T V = I n and
∑ = diag (σ 1 ,L, σ n ), σ i > 0 for 1≤ i ≤ r, σ j = 0 for j ≥ r+1. Matrices U and V contain left and right singular vectors of A, respectively, and diagonal matrix, Σ, contains singular values of A. If only the k largest singular values of Σ are kept along with their corresponding columns in the matrices U and V, we obtain new matrices Uk ,Vk with reduced dimension. A rank-k approximation to A is constructed with the following formula:
A ≈ Ak = U k ∑ k VkT
.
Ak is the unique matrix of rank k that is closest
in the least squares sense to A. 2.2 Diagnostic Text Classification (DTC) The DTC component has the following two major computational steps, formulating a query document vector based input problem description, and determining the similarity between the query document and the diagnostic categories. An input problem description, d, is first transformed into a pseudo-document within the reduced termcategory space. Namely d is transformed to q , which is a term vector with the same length M as T_L, and q = ( q1 ,..., q M ) , where qi is the frequency of the ith term on T_L occurred in the query document d, i = 1, …, M. The decision on the diagnostic T
1314
L. Huang and Y.L. Murphey
category of
q is made based on the similarity measure q and each column vector of
a j = (a1 j ,..., aMj )T , j=1, …, N, We have investigated three different similarity measures between the query vector q and the TCF
A. Let the column vectors of A be
matrix A. The first similarity function of the query vector and a diagnostic category is measured by the cosine measure [7] denoted as follows: M
r (q , a j ) = c j
q • aj q • aj
∑q a i
=
ij
i −1
M
M
∑q ∑a 2 i
i =1
i =1
2 ij
Pearson's correlation coefficient is another measure of extent to which two vectors are related[8]. The formula for Pearson’s correlation coefficient takes on many forms. A commonly used formula is shown below: M
r jp (q , a j ) =
M
M
M ∑ qi aij − ∑ qi ∑ aij i =1
i =1
⎡ ⎛ ⎞ ⎢M ∑ qi2 − ⎜ ∑ qi ⎟ ⎝ i =1 ⎠ ⎢⎣ i =1 M
M
2
i =1
2 ⎤⎡ ⎛M ⎞ ⎤ ⎥ ⎢ M ∑ aij2 − ⎜ ∑ aij ⎟ ⎥ ⎝ i =1 ⎠ ⎥⎦ ⎥⎦ ⎢⎣ i=1 M
The third measure we investigated in our system is Spearman rank correlation coefficient. It provides a measure of how closely two sets of rankings agree with each other. Spearman rank correlation coefficient is a special case of the Pearson correlation coefficient in which the data are converted to ranks before calculating the coefficient. It can be used when the distribution of the data makes the latter undesirable or misleading [9]. The formula for Spearman's rank correlation coefficient is: M
r js = 1 −
6∑ d i2 i =1
M ( M 2 − 1)
d = (d1 ,..., d M )T is the difference vector between ranks of corresponding values of q and a j .
where:
3 Experiment Results The proposed DKE&TC system has been fully implemented. The LFT program used a training data set of 200,000 documents that describe 54 different categories of auto problems to construct a TCW matrix A with dimensions of 3972x54. Based on the practical consideration, for each input query document d, the DKE&TC finds the three diagnostic categories that best match d. The performance measure is the
Text Mining with Application to Engineering Diagnostics
1315
% of accuracy
percentage of the queries in the test set whose true categories are contained in the matched categories returned by DKE&TC. The performances of DKE&TC system and the SVD system on the test set of 6000 queries are shown in Figure 2 (a). The retrieval performance of SVD is heavily affected by its value of rank K. We explored the effects of varied K values used in SVD and compared it with the DKE&TC’s performance. It appears that the best performance was achieved with K equal to 14. However the DKE&TE system outperformed the best SVD system by more than 11%.
0.6
40.00% 20.00% 0.00% p no y rm al Bid lo f g lo -idf gG fid f
0.4
60.00%
en tro
% of Accuracy
0.8
80.00%
0.2
weight schemas
0 1
3
5
7
9 11 13 15 17 19 21 23 25 27 29 31 33 35
DKE&TE(Cosine)
K Value
DKE&TE(Pearson)
DKE&TE(Spearman) SVD
DKE&TC
(a)
(b)
Fig. 2. Performance evaluation. (a) Performances of the proposed DKE&TC system vs SVD with various K values. (b) Comparing DKE&TE with various weight schemes and similarity functions.
We also compared the performance of the DKE&TE system with other weighting schemes commonly used in text document categorization and the results are shown in Figure 2(b) along with the performances of the DKE&TC systems that used three different similarity functions, cosine, Pearson, and Spearman. The definition of these weight schemes are defined in Table 1 in which, tfij is the frequency of term i occurred in document category j, dfi is the total number of document categories in the training data that contain term i, gfi is global frequency at which term i occurs in the entire training data, ndocs is the total number of document categories in the entire collection. In addition, we define:
pij =
⎧⎪0 if tf ij = 0 . Bij = ⎨ gfi , and ⎪⎩1 if tfij > 0
tf ij
Figure 3 shows that the DKE&TC systems used similarity functions of Pearson and cosine gave the best performances. Table 2 shows three examples of document queries and the diagnostic categories classified by the DKE&TC system. The first column lists the three examples of input queries and the right column lists three top matched diagnostic categories to each input query. The correct categories are highlighted.
1316
L. Huang and Y.L. Murphey Table 1. Definitions of various weight schemes
entropy tfij (1 −
Gfidf * tfij *
pij log( pij )
∑ log(ndocs) ) j
gf i df i
Normal tfij *
B-idf Bij *
1
∑ tf
B-normal
⎡ ndocs ⎤ log ⎢ ⎥ +1 ⎣ df i ⎦
2 ij
j
Bij *
1 ∑ tfij2 j
Bg-idf log(tfij +1) ( ⎡ ndocs ⎤ )
*
log ⎢ ⎥ +1 ⎣ dfi ⎦
Log-entropy log(tfij +1) * ( 1− ∑ j
pij log( pij ) log(ndocs )
log-Gfidf log(tfij +1) *
gfi dfi
log-norm log(tfij +1)
*
1 ∑ tfij2 j
) Table 2. Examples of matched diagnostic categories classified by the DKE&TC
Input queries CUST STATES VEHICLE OVERHEAT NO LOSS OF COOLANT BUT HAD BURNING COOLANT SMELL. CUSTOMER STATES CHECK FOR OIL LEAK BETWEEN TRANS. AND ENGINE SAYS GETTING SQUEAK NOISE FROM BELT OR BEARING TYPE
Three top matched diagnostic categories and associated description C1: COOLANT LEAK C2: ENGINE OVERHEATS/RADIATOR TROUBLES C3: UNUSUAL EXHAUST SYSTEM ODOR C1: ENGINE LEAKS OIL C2: UNDETERMINED ENGINE LEAK C3: TRANSMISSION/CLUTCH FLUID LEAKS C1: ENGINE BELT BREAKING/SLIPPING/SQUEALING C2: ENGINE BELT SLIPPING/SQUEALING C3: ENGINE BELT OFF/FRAYED/COMING APART/BROKEN
4 Conclusion In this paper, we presented a text document mining system, DKE&TC, developed for the automotive engineering diagnostic application. We presented our research results in a number of important areas including various weighting schemes and similarity functions. Our experiment results show that the proposed system, DKE&TC, outperformed the well-known LSA model and many other weight schemes. We also compared the performance of DKE&TC system with diagnostic engineers on a small data set of 100 queries. The results showed that the DKE&TC system outperformed its human counter part.
Text Mining with Application to Engineering Diagnostics
1317
Acknowledgment This work is support in part by a CEEP grant from the College of Engineering and Computer Science at the University of Michigan-Dearborn.
References 1. S.K.M.Wong., V.V.R.a.: A Critical Analysis of Vector Space Model for Information Retrieval. Journal of the Americal Society for Information Science, 1986. 37(2): p. 279-287. 2. Salton, G.a.B., C.: Term weighting approaches in automatic text retrieval. Information Processing and Management, 1988. 24(5): p. 513--523. 3. Dumais, S.T.: Enhancing performance in latent semantic indexing (LSI) retrieval, in Technical Report Technical Memorandum. 1990, Bellcore. 4. Berry, M.W., Dumais, S. T., and O'Brien, G. W.: Using linear algebra for intelligent information retrieval. SIAM Review, 1995. 37(4): p. 573-595. 5. S. Deerwester, S.D., G. Furnas, T. Landauer, and R. Harshman.: Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science, 1990. 41: p. 391-407. 6. T. K. Landauer, P.W.F., and D. Laham.: Introduction to latent semantic analysis. Discourse Processes, 1998(25): p. 259-284. 7. Michael W. Berry, Z.D., Elizabeth R. Jessup.: Matrices, Vector Spaces, and Information Retrieval. SIAM Review, 1999. 41(2): p. 335-362. 8. Edwards, A.L.: The Correlation Coefficient. San Francisco, CA: W. H. Freeman, 1976 9. Lehmann, E.L., D'Abrera, H. J. M., Nonparametrics: Statistical Methods Based on Ranks, rev., ed. E. Cliffs. NJ: Prentice-Hall, 1998 10. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval, Addison-Wesley, 1999 11. Gerard Salton and Michael McGill.: Introduction to Modern Information Retrieval. McGraw-Hill, 1983 12. Ana Cardoso-Cachopo, A.L.O.: An Empirical Comparison of Text Categorization Methods. in SPIRE 2003,LNCS, 2003
Handling Incomplete Categorical Data for Supervised Learning* Been-Chian Chien1, Cheng-Feng Lu2, and Steen J. Hsu3 1
Department of Computer Science and Information Engineering, National University of Tainan, 33, Sec. 2, Su-Lin St., Tainan 70005, Taiwan, R.O.C. [email protected] 2 Department of Information Engineering, I-Shou University, Kaohsiung 840, Taiwan, R.O.C. [email protected] 3 Department of Information Management, Ming Hsin University of Science and Technology, 1 Hsin-Hsing Road, Hsin-Fong, Hsin-Chu, Taiwan 304, R.O.C. [email protected]
Abstract. Classification is an important research topic in knowledge discovery. Most of the researches on classification concern that a complete dataset is given as a training dataset and the test data contain all values of attributes without missing. Unfortunately, incomplete data usually exist in real-world applications. In this paper, we propose new handling schemes of learning classification models from incomplete categorical data. Three methods based on rough set theory are developed and discussed for handling incomplete training data. The experiments were made and the results were compared with previous methods making use of a few famous classification models to evaluate the performance of the proposed handling schemes.
1 Introduction The classification problem is a task of supervised learning that consists of two phases. The learning phase is to learn a classification model from a set of training data with a predefined class for each datum. The classification phase is to classify unknown cases to one of the predefined classes using the learned classification model. Typically, the training data are assumed to be collected completely without loss or error inside. Reliable classification models thus can be learned from the training data and represented in the form of classification rules, decision trees, or mathematical functions. However, incomplete data happen usually in real world applications. An incomplete dataset is that a dataset contains at least one missing value as in Table 1. The symbols ‘?’ in the cases x5, x6, x9 denote missing values. The reasons of producing incomplete data may be caused by a broken machine or mistakenly erased by a person. Generally, the incomplete training data will degrade the learning quality of classification models. *
This work was supported in part by the National Science Council of Taiwan, R. O. C., under contract NSC94-2213-E-024-004.
M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 1318 – 1328, 2006. © Springer-Verlag Berlin Heidelberg 2006
Handling Incomplete Categorical Data for Supervised Learning
1319
Table 1. A simple incomplete dataset No. x1
outlook sunny
temperature L
humidity H
play no
x2
?
H
H
no
x3
overcast
M
?
yes
x4
rainy
M
H
yes
In this paper, we investigate the problem of handling incomplete categorical data contained in training datasets for supervised learning. Since the incomplete samples do not provide perfect information for training process, most of traditional supervised learning methods can not deal with incomplete data directly but generate inaccurate classifiers from an incomplete dataset. If the incomplete training data can be tackled well, an effective classification model can be learned. Many methods on dealing with missing values for classification had been proposed in the past decades. The previous researches on handling incomplete data can be generally divided into two strategies: (1) Ignore the data with missing values inside the dataset. [7][10][13][18][21]. (2) Fill the missing value by an appropriate alternative. [5][8][12][14][17][23]. To learn an effective classification model from incomplete categorical data, we try to overcome the problem of missing values in attributes based on rough set theory. During the learning phase of classification, three methods are proposed to handle the problem of incomplete categorical data based on rough membership functions. The proposed methods transform categorical data into numerical form and replace the missing values with different granular processes based on rough membership. To evaluate the performance of the proposed handling methods, a few famous classification models, such as GPIZ, SVM, NB and NBTree, were selected to test and compare with C4.5 which handles missing values using information gain. The tested datasets include real and artificial incomplete datasets generated from UCI Machine Learning repository [1]. The experimental results demonstrate that the proposed methods based on rough membership can perform various effects on different classification models. The remainder of this paper is organized as follows: Section 2 reviews the rough set theory and the related researches of incomplete data. In Section 3, we present the proposed scheme for handling incomplete data. Section 4 describes the experiments and demonstrates the performance of proposed methods combining different classification models. Finally, conclusions and future work are presented.
2 Related Work 2.1 Rough Sets Theory The rough set theory was proposed by Pawlak in 1982 [19][20] that is an extension of set theory and is well suited to deal with incompleteness and uncertainty. The main advantages of rough set theory are that it does not need any additional information about data and provides a powerful foundation to reveal important structures in data.
1320
B.-C. Chien, C.-F. Lu, and S.J. Hsu
Many researches had broadly and successfully applied rough set theory to discover knowledge from incomplete databases [8][10][18][23]. The idea of rough sets is based on the establishment of equivalence classes on a given dataset U, called a universe, and supports two approximations called lower approximation and upper approximation. We give the definitions of the rough set theory as follows: Let S = (U, T) denote an information table, where U means a non-empty finite set of objects and T = A C, A is the set of condition attributes and C is the decision attribute. We then define a equivalence relation RA(B) on S as RA(B) = {(x, y) | x, y∈U, ∀Aj∈B, Aj(x)=Aj(y)}. We say that objects x and y are indiscernible if the equivalence relation RA(B) is satisfied on the set U, for all Aj ∈A and each B⊆A. Let [x]B denote an equivalence class of S on RA(B) and [U]B denote the set of all equivalence classes [x]B for x∈U. That is, [x]B = {y | x RA(B) y, ∀x, y∈U} and [U]B = {[x]B | x∈U}. Then, the lower approximation and upper approximation for B on concept X, denoted B*(X) and B*(X) are defined as B*(X) = {x | x ∈ U, [x]B ⊆ X} and B*(X) = {x | x ∈ U and [x]B ∩ X ≠ ∅}, respectively. For a given concept X⊆U, a rough membership function of X on the set of attributes B is defined as
∪
| [ x] B ∩ X | , | [ x] B |
μ BX ( x) =
(1)
where |[x]B∩X| denotes the cardinality of the set [x]B∩X. The rough membership value μ BX (x) can be interpreted as the conditional probability that an object x belongs to X, given that the object belongs to [x]B, where μ BX (x) ∈ [0, 1]. 2.2 Review of Incomplete Data Handling The techniques of supervised learning on incomplete datasets are more difficult than on complete datasets. Designing an effective learning algorithm being able to deal with incomplete datasets is a challenge to researchers. Many previous researches generally were motivated from two strategies: ignoring and repairing. The first strategy is to ignore the whole record with unknown values or just ignore the missing attribute values. The related researches are enumerated as follows: (1) Ignore samples or attributes [7]. (2) Ignore the missing attribute values [10][13][18][21][24]. The second strategy is to use the mechanism of reparation to transforms an incomplete dataset into a complete dataset. The related techniques on handling incomplete categorical data are presented as follows: (1) Fill in the missing value manual. (2) Concept most common attribute value [17]. (3) Assign all possible values of the attribute [8][9]. (4) Use rough set approach to repair missing values [14][15][23].
Handling Incomplete Categorical Data for Supervised Learning
1321
The preprocess of handling incomplete data is important and necessary for learning a high-accuracy classification model.
3 Handling Incomplete Data Based on Rough Membership In this section, we introduce three handling methods for incomplete data based on rough membership. The first method is called fill in another value (FIAV). The second and third methods are fill in all possible value with class (FIAP-class) and fill in all possible value with minimum (FIAP-min), respectively. All the three methods first employ the transformation function based on rough membership of equation (1) for transform categorical data into numerical data. The purpose of the transformation is to transform the categorical attributes into a set of numerical attributes. The transformation function is described as follows: Given a dataset U having n conditional attributes A1, A2, ... , An and a decision attribute C. Let Di be the domain of Ai, for 1 ≤ i ≤ n, and C = {C1, C2, ... , CK}, where K is the number of predefined classes. For each object xj ∈ U, xj = (vj1, vj2, ... , vjn), where vji∈Di stands for the value on attribute Ai of the object xj. The idea of transformation is to transform the original categorical attribute Ai into K numerical attributes Ãi. Let Ãi = (Ai1, Ai2, ... , AiK), where K is the number of predefined classes C and the domains of Ai1, Ai2, ... , AiK are in [0, 1]. The method of transformation is based on the rough membership function of equation (1). For an object xj ∈ U, xj = (vj1, vj2, ... , vjn), vji∈Di, the vji will be transformed into (wjk, wj(k+1), ... , wj(k+K-1) ), wik ∈ [0, 1] and wjk = μ CAi1 ( x j ) , wj(k+1) = μ CAi2 ( x j ) , ... , wj(k+K-1) = μ AC ( x j ) , where K
i
μ CA ( x j ) = k
i
| [ x j ] A ∩ [ x j ]C | i
k
| [ x j ]A |
, if vji∈[xj]Ai and xj∈U.
(2)
i
After the transformation, the new dataset U′ with attributes à = {Ã1, Ã2, ... , Ãn} is obtained and an object xj ∈ U, xj = (vj1, vj2, ... , vjn) will be transformed into yj, yj = (wj1, wj2, ... , wjn′), where n′ = q × K, q is the number of categorical attributes. The proposed handling methods are presented in detailed as follows. FIAV This method is described as its name; the missing values are filled by an alternate value. For example, we can use ‘unknown’ value to replace the missing values. Then, the values in categorical attributes are transformed into numerical values. We thus ignore what real values should be in the missing fields and shift to concern about the distribution of ‘unknown’ value on the decision attribute. We give an example to illustrate this method using the incomplete dataset as shown in Table 1. Example 1: In Table 1, there are of three categorical attributes: outlook, temperature and humidity and the decision attribute is play. The FIAV method uses the value ‘unknown’ to replace the symbol ‘?’ as a new attribute value. The symbolic categories in the three attributes become: outlook: {sunny, overcast, rainy, unknown}, temperature: {L, M, H} and humidity: {L, H, unknown}. The number of numerical
1322
B.-C. Chien, C.-F. Lu, and S.J. Hsu
attributes n′ = 3 × 2 = 6. Based on the above equivalence classes, the values in case x1 after transformation are as follows: w11 = μ AC11 ( x1 ) = 0, w12 = μ A12 ( x1 ) = 1, w13 = μ A21 ( x1 ) = 0, C
C
w14 = μ AC22 ( x1 ) = 1, w15 = μ AC31 ( x1 ) = 0.33, w16 = μ AC32 ( x1 ) = 0.67. So, y1 = (w11, w12, w13, w14, w15, w16) = (0, 1, 0, 1, 0.33, 0.67). The new dataset U′ is generated and shown as Table 2. Table 2. The results of the transformation of Table 1 using FIAV No. y1 y2 y3 y4
outlook wj1
wj2
0 0 1 1
1 1 0 0
temperature wj3 wj4 0 0 1 1
1 1 0 0
humidity wj5 wj6
play Ci
0.33 0.33 1 0.33
no no yes yes
0.67 0.67 0 0.67
FIAP-class This method assumes that missing values can be any possible values in the attribute. Hence, all possible rough membership values have to be calculated by replacing the missing values with all possible values in that attribute. After all possible rough membership values are derived, the value with the maximum of μ CA ( x j ) is picked to be the alternate of the missing value for object xj on attribute Ai, where Ck is the class of object xj. We give Example 2 to illustrate this method more detailed. k
i
Example 2: In Table 1, the missing values are first replaced by all possible values according to their attributes, e.g. the symbolic categories in the three attributes are: outlook: {sunny, overcast, rainy}, temperature: {L, M, H} and humidity: {L, H}. The object x2 in attribute outlook is replaced by sunny, overcast and rainy. The missing value in the object x2 will have three sets of possible transformation values corresponding to sunny, overcast and rainy in attribute outlook, as follows:
If outlook = sunny in case x2, the values after transformation are w21 = μ AC11 ( x 2 ) = 0, w22 = μ AC12 ( x 2 ) = 1.
If outlook = overcast in case x2, the values after transformation are w21 = μ CA ( x 2 ) = 0.5, w22 = μ AC ( x 2 ) = 0.5. 1
2
1
1
If outlook = rainy in case x2, the values after transformation are w21 = μ AC ( x2 ) = 0.5, w22 = μ AC ( x2 ) = 0.5. 1
2
1
1
Since the class label of the object x2 is C2, i.e. play = no, we select the set of rough membership values with maximum μ AC ( x2 ) among the three sets (i.e. w21 = 0 and w22 = 1) to be the alternate of the missing value on the attribute outlook. The other missing values can be also produced by the same procedure and are shown as Table 3. 2
1
Handling Incomplete Categorical Data for Supervised Learning
1323
Table 3. The results of the transformation of Table 1 using FIAP-class No.
wj1
y1 y2 y3 y4
outlook wj2
0 0 1 1
temperature wj3 wj4
1 1 0 0
0 0 1 1
1 1 0 0
wj5
humidity wj6
play Ci
0.33 0.33 0.50 0.33
0.67 0.67 0.50 0.67
no no yes yes
FIAP-min This method is similar to the FIAP-class method except the selected rough membership values. FIAP-min selects the minimum of μ AC ( x j ) as the alternate from the same class instead of the maximum in FIAP-class method. We define the minimum value λjk as follows k
i
λ jk = arg min v
ji ∈Di
{μ
Ck Ai = v ji
}
(xj ) .
(3)
Then we use λjk to replace the missing value wjk. We give Example 3 to show the transformation of FIAP-min on Table 1. Example 3: As the example of the missing value of object x2 in Table 1, we first obtain the three sets of transformation values corresponding to sunny, overcast and rainy in the attribute outlook as Example 2. Then, we have λ21 = min{0, 0.5, 0.5} = 0 and λ22 = min{1, 0.5, 0.5} = 0.5. Therefore, we replace w21 with 0 and w22 with 0.5. The same procedure is applied to the other missing values in Table 1 and the result is shown in Table 4. Table 4. The results of the transformation of Table 1 using FIAP-min No. y1 y2 y3 y4
outlook wj1
wj2
0 0 1 1
1 0.5 0 0
temperature wj3 wj4 0 0 1 1
1 1 0 0
humidity wj5 wj6
play Ci
0.33 0.33 0.50 0.33
no no yes yes
0.67 0.67 0 0.67
4 Experiments and Comparison The experiments was conducted by a PC with 3.4 GHz CPU with 256 MB RAM. For understanding the effect of the proposed methods for supervised learning on incomplete data, we generated incomplete datasets artificially selected from UCI Machine Learning repository [1]. The related information of the selected datasets is summarized in Table 5. The selected datasets were modified to generate the incomplete datasets by randomly selecting a specified percentage of cases to set to be null. However, the generation of missing values still must follow the following constraints: C1: Each original case retains at least one attribute value. C2: Each attribute has at least one value left.
1324
B.-C. Chien, C.-F. Lu, and S.J. Hsu Table 5. The selected complete datasets Attributes categorical
numerical
Number of Objects
Number of classes
led7
7
0
3200
10
Lymph tic-tac-toe
18 9
0 0
148 958
4 2
Datasets
The performance of the classification scheme is evaluated by the average classification error rate of 10-fold cross validation for 10 runs. Each run works using a new generated incomplete dataset. In addition to the three proposed methods, we also compare the methods of handling incomplete categorical data with the method of concept most common attribute value (CMCAV) in [17]. The selected classification models include a GP-based classifier GPIZ[3], statistical based classifiers Naïve Bayes [6] and NBTree [16]; support vector machines (SVM) [11], and decision tree based classifiers C4.5 [21]. Note that C4.5 was tested on incomplete data directly in the experiment of FIAV because C4.5 has its own mechanism in handling missing values. The experimental results of classification error rates are listed from Table 6 to Table 8. Table 6 is the classification results of the five classifiers and four incomplete data handling methods on the led7 dataset. Since the number of data in led7 is large, we can test higher missing rate up to 30% missing. The NB-based classifiers in this dataset have better classification rate than other classifiers in FIAV and CMCAV methods and the SVM classifier has the best performance in average no matter what handling methods is. Generally, missing data do not influence the classification rate so much in this dataset. Table 7 is the classification results of the five classifiers and four incomplete data handling methods on the lymph dataset. This dataset has only 148 objects belonging to four classes. Therefore, only 10 % maximum missing rate can be considered. In this dataset, we found that the FIAP-class method is suitable for non-NB-based classifiers, GPIZ, C4.5 and SVM. The CMCAV method especially provides good classification rates for NB-based classifiers. The error rates on the dataset with missing data are even less than those without missing for NB and NBTree. Table 8 is the classification results of the five classifiers and four incomplete data handling methods on the tic-tac-toe dataset. This dataset contains 958 objects and only two classes. The maximum missing rate can be tuned to 30%. The more missing rate increases, the more classification error rate results. We found that NB-based classifiers are terrible bad in this dataset no matter what handling method used. The classifier SVM is fair in this case and the classifier GPIZ is as good as C4.5. From the experimental results, we obviously knew that no single classifier and handling method can do all data well. Generally, the FIAV method is not an ideal handling method for most of the tested classifiers. The FIAP-class and FIAP-min methods have better classification accuracy than other methods in the GPIZ, C4.5 and
Handling Incomplete Categorical Data for Supervised Learning
1325
Table 6. The results of classification using incomplete led7 dataset FIAV
Missing rate led7 0% 5% 10% 20% 30%
GPIZ Ave. S.D. 27.8 0.9 27.6 1.3 27.3 1.5 27.9 1.9 28.5 2.1
5% 10% 20% 30%
27.5 27.6 27.3 28.0
0.9 1.6 2.2 2.1
5% 10% 20% 30%
27.9 28.2 28.5 29.3
1.1 1.7 2.1 2.4
5% 10% 20% 30%
27.9 28.0 28.6 29.9
0.6 1.4 1.6 2.6
C4.5 Ave. S.D. 27.1 0.0 27.0 0.3 27.3 0.9 28.0 1.2 29.2 1.4
SVM NB Ave. S.D. Ave. S.D. 26.6 0.0 26.8 0.0 26.4 0.5 27.0 0.4 26.4 0.7 26.5 0.7 26.4 0.9 26.7 1.0 27.2 1.7 26.9 1.5 FIAP-class 26.8 0.2 26.5 0.4 27.3 0.5 26.9 0.7 26.6 0.9 27.3 0.6 27.1 1.3 26.5 1.2 27.1 1.3 27.5 1.4 26.9 2.3 27.5 1.7 FIAP-min 27.1 0.4 26.6 0.5 27.2 0.3 27.3 0.7 26.9 1.1 27.2 0.6 28.0 1.4 26.8 1.5 26.7 1.0 29.6 1.6 27.2 1.9 27.1 2.0 Concept most common attribute value(CMCAV) 26.7 0.5 26.6 0.5 27.0 0.4 26.9 0.7 26.6 0.9 27.0 0.6 27.0 1.4 26.8 1.2 26.8 1.0 27.6 1.1 27.4 1.6 27.2 1.4
NBTree Ave. S.D. 26.8 0.0 26.9 0.5 26.9 0.8 26.5 0.9 26.9 1.7 27.2 27.2 26.6 27.5
0.5 0.6 1.1 2.1
26.9 27.0 26.5 26.9
0.4 0.8 0.9 1.7
26.9 26.8 26.7 26.8
0.4 0.7 1.1 1.4
Table 7. The results of classification using incomplete lymph dataset FIAV
Missing rate lymph 0% 5% 10%
GPIZ Ave. S.D. 17.9 1.3 19.6 2.8 20.3 4.4
5% 10%
16.9 17.6
2.1 3.9
5% 10%
17.3 18.5
3.6 6.2
5% 10%
20.3 21.2
2.5 4.0
C4.5 Ave. S.D. 22.9 0.0 22.5 5.2 27.3 6.4
SVM NB Ave. S.D. Ave. S.D. 17.5 0.0 16.9 0.0 17.9 3.4 16.2 1.0 15.3 3.6 13.8 6.0 FIAP-class 19.5 4.8 13.6 1.2 18.8 0.8 14.5 5.9 14.7 2.1 15.9 5.3 FIAP-min 23.3 6.2 13.7 2.4 16.3 1.5 23.2 7.2 14.7 2.1 14.2 7.1 Concept most common attribute value(CMCAV) 19.6 2.3 17.3 2.4 12.8 2.7 24.3 4.6 18.6 3.8 16.9 8.2
NBTree Ave. S.D. 14.8 0.0 25.4 3.2 17.2 7.7 25.1 26.6
1.6 8.1
24.3 18.5
5.4 9.3
13.4 19.8
2.8 8.1
SVM classification models, and FIAP-class defeats FIAP-min. However, these two methods are not suitable for NB-based classifiers. The CMCAV method obviously is not a good handling method for the GPIZ and the SVM classification models; nevertheless, it provides a good repair for NB-based classifiers. For the four handling methods, the GPIZ, C4.5 and SVM classifiers combining with the FIAP-class will yield good models for supervised learning. The previous CMCAV method performs fair classification accuracy while combining with NB and
1326
B.-C. Chien, C.-F. Lu, and S.J. Hsu
NBTree classification models. Generally, the former models (FIAP-class + GPIZ, C4.5 or SVM) perform better performance than the latter (CMCAV + NB or NBTree). Further, the GPIZ method presents a stable classification rate in average while using the FIAP-class to be the handling method of incomplete data.
5 Conclusion Supervised learning from incomplete data to learn an effective classification model is more difficult than learning one from complete data. In this paper, we propose three new methods, FIAV, FIAP-class, FIAP-min, based on rough membership to handle the problem of incomplete categorical data. The approach we used is different from previous methods that filled appropriate categorical values for missing data. The novelty of the proposed methods is to transform the categorical data into numerical data and estimate the membership of missing values belonging to each class. Then, the membership values are used to replace the missing values. The experimental results demonstrate that the proposed method FIAP-class has good performance on GPIZ, SVM and C4.5, but is poor on the NB-based classifiers like NB and NBtree. On the other hand, we found that NB and NBtree have good performance on the CMCAV method. We are also interested in the experimental results that the error rates may decrease when missing rates are larger than 10% or 20% in some datasets. We still analyze and try to explain such a situation using more test data in the future. Table 8. The results of classification using incomplete tic-tac-toe dataset FIAV
Missing rate tic-tac-toe 0% 5% 10% 20% 30%
GPIZ Ave. S.D. 5.9 0.7 7.5 2.1 7.4 2.7 10.7 4.4 13.6 6.3
5% 10% 20% 30%
6.9 7.3 9.4 15.2
2.3 2.8 4.6 5.5
5% 10% 20% 30%
7.0 8.2 13.7 17.1
1.2 3.0 3.5 6.7
5% 10% 20% 30%
7.0 7.4 16.5 18.7
1.9 3.2 4.5 9.0
C4.5 Ave. S.D. 2.60 0.0 7.69 1.7 8.08 1.3 10.6 4.1 13.7 5.9
SVM NB Ave. S.D. Ave. S.D. 19.6 0.0 30.4 0.0 19.9 1.4 36.4 1.4 16.7 2.3 35.1 1.7 18.9 4.4 33.9 4.1 21.4 6.3 33.2 7.4 FIAP-class 6.69 1.1 17.0 1.7 36.4 1.3 6.92 1.4 16.2 2.6 35.0 1.8 10.2 3.5 16.2 3.1 34.6 3.5 16.2 6.0 23.2 7.8 34.9 7.9 FIAP-min 8.41 1.5 17.2 1.7 36.0 1.2 8.06 1.7 15.2 2.5 34.8 1.9 10.9 2.7 18.1 3.6 33.6 3.8 15.1 6.3 21.8 7.9 32.5 7.6 Concept most common attribute value 6.89 1.6 16.3 1.2 34.1 1.1 8.42 1.3 15.8 2.6 32.7 2.3 14.5 4.1 22.2 2.6 31.7 4.9 18.9 8.0 27.1 8.9 31.2 8.7
NBTree Ave. S.D. 16.0 0.0 27.7 2.9 26.6 4.0 26.9 4.3 31.0 6.1 28.2 28.3 27.0 31.3
1.9 2.1 5.0 5.3
24.0 24.8 25.7 31.8
3.9 2.8 5.8 3.8
23.3 19.7 24.1 25.0
1.4 2.8 3.8 8.7
Handling Incomplete Categorical Data for Supervised Learning
1327
References 1. Blake, C., Keogh E., Merz, C. J.: UCI repository of machine learning database. http://www.ics.uci.edu/~mlearn/MLRepository.html, Irvine, University of California, Department of Information and Computer Science (1998) 2. Chang, C. C., Lin, C. J.: LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm 3. Chien, B. C., Lin, J. Y., Yang, W. P.: Learning effective classifiers with z-value measure based on genetic programming. Pattern Recognition, 37 (2004) 1957-1972 4. Chien, B. C., Yang, J. H., Lin, W. Y.: Generating effective classifiers with supervised learning of genetic programming. Proceedings of the 5th International Conference on Data Warehousing and Knowledge Discovery (2003) 192-201 5. Dempster, P., Laird, N. M., Rubin, D. B.: Maximum-likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B39 (1977) 1-38 6. Duda, R. O., Hart, P. E.: Pattern Classification and Scene Analysis. New York: Wiley, John and Sons Incorporated Publishers (1973) 7. Friedman, J. H.: A recursive partitioning decision rule for non-parametric classification. IEEE Transactions on Computer Science (1977) 404-408 8. Grzymala-Busse, J. W.: On the unknown attribute values in learning from examples. Proceedings of the ISMIS-91, 6th International Symposium on Methodologies for Intelligent Systems, Lecture Notes in Artificial Intelligence, Vol. 542, Springer-Verlag, Berlin Heidelberg New York (1991) 368-377 9. Grzymala-Busse, J. W., Hu, M.: A comparison of several approaches to missing attribute values in data mining. Proceedings of the Second International Conference on Rough Sets and Current Trends in Computing (2000) 378-385 10. Grzymala-Busse, J. W.: Rough set strategies to data with missing attribute values. Proceedings of the Workshop on Foundations and New Directions in Data Mining, associated with the third IEEE International Conference on Data Mining (2003) 56-63 11. Gunn, S. R.: Support vector machines for classification and regression. Technical Report, School of Electronics and Computer Science University of Southampton (1998) 12. Han, J., Kamber, M.: Data Mining: Concept and Techniques. Morgan Kaufmann publishers, (2001) 13. Hathaway, R. J., Bezdek, J. C.: Fuzzy c-means clustering of incomplete data, IEEE Transactions on Systems, Man, and Cybernetics-part B: Cybernetics 31(5) (2001) 14. Hong, T. P., Tseng, L. H., Chien, B. C.: Learning fuzzy rules from incomplete numerical data by rough sets. Proceedings of the 2002 IEEE International Conference on Fuzzy Systems (2002) 1438-1443 15. Hong, T. P., Tseng, L. H., Wang, S.-L.: Learning rules from incomplete training examples by rough sets. Expert Systems with Applications 22 (2002) 285-293 16. Kohavi, R.: Scaling up the accuracy of naïve-bayes classifiers: a decision-tree hybrid. Knowledge Discovery & Data Mining, Cambridge/Menlo Park: AAAI Press/MIT Press Publishers (1996) 202-207 17. Koninenko, I., Bratko, K., Roskar,E.: Experiments in automatic learning of medical diagnostic rules. Technical Report, Jozef Stenfan Institute, Ljubljana (1984) 18. Kryszkiewicz, M.: Rough set approach to incomplete information systems. Information Science 112 (1998) 39-49 19. Pawlak, Z.,: Rough sets. International Journal of Computer and Information Sciences 11 (1982) 341-356
1328
B.-C. Chien, C.-F. Lu, and S.J. Hsu
20. Pawlak Z., A. Skowron: Rough membership functions, in: R.R. Yager and M. Fedrizzi and J. Kacprzyk (Eds.), Advances in the Dempster-Shafer Theory of Evidence (1994) 251-271 21. Quinlan, J. R.: C4.5: Programs for Machine Learning. San Mateo, California, Morgan Kaufmann Publishers (1993) 22. Singleton, A.: Genetic Programming with C++. http://www.byte.com/art/9402/sec10/art1.htm, Byte (1994) 171-176 23. Slowinski, R., Stefanowski, J.: Handling various types of uncertainty in the rough set approach. Proceedings of the International Workshop on Rough Sets and Knowledge Discovery (1993) 366-376 24. Stefanowski, J., Tsoukias, A.: On the extension of rough sets under incomplete information. Proceeding of the 7th International Workshop on New Directions in Rough Sets, Data Mining, and Granular-Soft Computing (1999) 73-81 25. Witten, H., Frank, E.: Data Mining: Practical machine learning tools with Java implementations. Morgan Kaufmann, San Francisco (2000)
Mining Multiple-Level Association Rules Under the Maximum Constraint of Multiple Minimum Supports Yeong-Chyi Lee1, Tzung-Pei Hong2,*, and Tien-Chin Wang3 1
Department of Information Engineering, I-Shou University, Kaohsiung, 84008, Taiwan [email protected] 2 Department of Electrical Engineering, National University of Kaohsiung, Kaohsiung, 811, Taiwan [email protected] 3 Department of Information Management, I-Shou University, Kaohsiung, 84008, Taiwan [email protected]
Abstract. In this paper, we propose a multiple-level mining algorithm for discovering association rules from a transaction database with multiple supports of items. Items may have different minimum supports and taxonomic relationships, and the maximum-itemset minimum-taxonomy support constraint is adopted in finding large itemsets. That is, the minimum support for an itemset is set as the maximum of the minimum supports of the items contained in the itemset, while the minimum support of the item at a higher taxonomic concept is set as the minimum of the minimum supports of the items belonging to it. Under the constraint, the characteristic of downward-closure is kept, such that the original Apriori algorithm can easily be extended to find large itemsets. The proposed algorithm adopts a top-down progressively deepening approach to derive large itemsets. An example is also given to demonstrate that the proposed mining algorithm can proceed in a simple and effective way.
1 Introduction Finding association rules in transaction databases is most commonly seen in data mining. The mined knowledge about the items tending to be purchased together can be passed to managers as a good reference in planning store layout and market policy. Agrawal and his co-workers proposed several mining algorithms based on the concept of large itemsets to find association rules from transaction data [1-4]. Most of the previous approaches set a single minimum support threshold for all the items [3][5][6][7]. In real applications, different items may have different criteria to judge their importance. The support requirements should thus vary with different items. Moreover, setting the minimum support for mining association rules is a dilemma. If it is set too high, many possible rules may be pruned away; on the other hand, if it is set too low, many uninteresting rules may be generated. Liu et al. thus proposed an approach for mining association rules with non-uniform minimum support values [10]. Their approach allowed users to specify different minimum supports to different *
Corresponding author.
M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 1329 – 1338, 2006. © Springer-Verlag Berlin Heidelberg 2006
1330
Y.-C. Lee, T.-P. Hong, and T.-C. Wang
items. Wang et al. proposed a mining approach, which allowed the minimum support value of an itemset to be any function of the minimum support values of items contained in the itemset [11]. In the past, we proposed a simple and efficient algorithm based on the Apriori approach to generate large itemsets under the maximum constraints of multiple minimum supports [12][13]. Furthermore, taxonomic relationships among items often appear in real applications. For example, wheat bread and white bread are two kinds of bread. Bread is thus a higher level of concept than wheat bread or white bread. Meanwhile, the association rule “bread Æ milk” may be more general to decision makers than the rule “wheat bread Æ juice milk”. Discovering association rules at different levels may thus provide more information than that only at a single level [8][9]. In this paper, we thus propose a multiple-level mining algorithm for discovering association rules from a transaction database with multiple minimum-supports of items. It is an extension of our previous approach with taxonomy being considered. Each item is first given a predefined support threshold. The maximum-itemset minimum-taxonomy support constraint is then adopted in finding large itemsets. That is, the minimum support for an itemset is set as the maximum of the minimum supports of the items contained in the itemset, while the minimum support of the item at a higher taxonomic concept is set as the minimum of the minimum supports of the items belonging to it. This is quite consistent with the mathematical concepts of union and intersection. Itemsets can be thought of as item intersection in transactions, and higher-level items as item union. The algorithm then adopts a top-down progressively deepening approach to derive large itemsets. The remaining parts of this paper are organized as follows. Some related mining algorithms are reviewed in Section 2. The proposed algorithm for mining multiplelevel association rules under the maximum-itemset minimum-taxonomy support constraint of multiple minimum supports is described in Section 3. An example to illustrate the proposed algorithm is given in Section 4. Conclusion and discussion are given in Section 5.
2 Review of Related Mining Algorithms Some related researches about mining multiple-level association rules and mining association rules with multiple minimum supports are reviewed in this section. 2.1 Mining Multiple-Level Association Rules Previous studies on data mining focused on finding association rules on the singleconcept level. However, mining multiple-concept-level rules may lead to discovery of more general and important knowledge from data. Relevant data item taxonomies are usually predefined in real-world applications and can be represented using hierarchy trees. Terminal nodes on the trees represent actual items appearing in transactions; internal nodes represent classes or concepts formed by lower-level nodes. A simple example is given in Figure 1.
Mining Multiple-Level Association Rules Under the Maximum Constraint
1331
Food
Milk
Chocolate
Dairyland
Bread
...
Apple
Foremost
...
White
OldMills Wonder
...
Wheat
...
...
Fig. 1. Taxonomy for the example
In Figure 1, the root node is at level 0, the internal nodes representing categories (such as “Milk”) are at level 1, the internal nodes representing flavors (such as “Chocolate”) are at level 2, and the terminal nodes representing brands (such as “Foremost”) are at level 3. Only terminal nodes appear in transactions. Han and Fu proposed a method for finding association rules at multiple levels [8]. Nodes in predefined taxonomies are first encoded by sequences of numbers and the symbol "*" according to their positions in the hierarchy tree. For example, the internal node "Milk" in Figure 2 will be represented as 1**, the internal node "Chocolate" as 11*, and the terminal node "Dairyland" as 111. A top-down progressively deepening search approach is then used to mine the rules out. 2.2 Mining Association Rules with Multiple Minimum Supports In the conventional approaches of mining association rules, minimum supports for all the items or itemsets to be large are set at a single value. However, in real applications, different items may have different criteria to judge its importance. Liu et al. [10] thus proposed an approach for mining association rules with non-uniform minimum support values. Their approach allowed users to specify different minimum supports to different items. The minimum support value of an itemset is defined as the lowest minimum supports among the items in the itemset. This assignment is, however, not always suitable for application requirements. As mentioned above, the minimum support of an item means that the occurrence frequency of the item must be larger than or equal to it for being considered in the next mining steps. If the support of an item is smaller than the support threshold, this item is not worth considering. When the minimum support value of an itemset is defined as the lowest minimum supports of the items in it, the itemset may be large, but items included in it may be small. In this case, it is doubtable whether this itemset is worth considering. It is thus reasonable in some sense that the occurrence frequency of an interesting itemset must be larger than the maximum of the minimum supports of the items contained in it. Wang et al. [11] then generalized the above idea and allowed the minimum support value of an itemset to be any function of the minimum support values of items
1332
Y.-C. Lee, T.-P. Hong, and T.-C. Wang
contained in the itemset. They proposed a bin-oriented, non-uniform support constraint. Items were grouped into disjoint sets called bins, and items within the same bin were regarded as non-distinguishable with respect to the specification of a minimum support. Although their approach is flexible in assigning the minimum supports to itemsets, the mining algorithm is a little complex due to its generality. As mentioned above, it is meaningful in some applications to assign the minimum support of an itemset as the maximum of the minimum supports of the items contained in the itemset. Although Wang et al.’s approach can solve this kind of problems, the time complexity is high. In our previous work, a simple algorithm based on the Apriori approach was proposed to find the large-itemsets and association rules under the maximum constraint of multiple minimum supports [12][13]. The proposed algorithm is easy and efficient when compared to Wang et al.’s under the maximum constraint. Below, we will propose an efficient algorithm based on Han’s mining approach and our previous approach for multiple-level items under the maximum-itemset minimum-taxonomy support constraint to generate the large itemsets level by level. Some pruning can also be easily done to save the computation time.
3 The Proposed Algorithm In the proposed algorithm, items may have different minimum supports and taxonomic relationships, and the maximum-itemset minimum-taxonomy support constraint is adopted in finding large itemsets. That is, the minimum support for an itemset is set as the maximum of the minimum supports of the items contained in the itemset, and the minimum support for an item at a higher taxonomic concept is set as the minimum of the minimum supports of the items belonging to it. Under the constraint, the characteristic of downward-closure is kept, such that the original Apriori algorithm can be easily extended to find large itemsets. The proposed mining algorithm first encodes items (nodes) in a given taxonomy as Han and Fu's approach did [8]. It first handles the items at level 1. The minimum supports of the items at level 1 are set as the minimum of the minimum supports of the items belonging to it. It then finds all large 1-itemsets L1 from the given transactions by comparing the support of each item with its minimum support. After that, candidate 2-itemsets C2 can be formed from L1. Note that the supports of all the large 1-itemsets comprising each candidate 2-itemset must be larger than or equal to the maximum of the minimum supports of them. This feature provides a good pruning effect before the database is scanned for finding large 2-itemsets. The proposed algorithm then finds all large 2-itemsets L2 for the given transactions by comparing the support of each candidate 2-itemset with the maximum of the minimum supports of the items contained in it. The same procedure is repeated until all large itemsets have been found. The algorithm then find large itemsets at the next level until all levels have been processed. The details of the proposed mining algorithm under the maximum-itemset minimum-taxonomy support constraint are described below.
Mining Multiple-Level Association Rules Under the Maximum Constraint
1333
The mining algorithm for multiple-level association rules under the maximumitemset minimum-taxonomy support constraint of multiple minimum supports: INPUT: A body of n transaction data D, a set of m items, each with a predefined minimum support value, a predefined taxonomy of items in D, a set of membership functions, and a minimum confidence value λ. OUTPUT: A set of multiple-level association rules under maximum constraint of multiple minimum supports. STEP 1: Encode the predefined taxonomy using a sequence of numbers and the symbol “*”, with the l-th number representing the branch number of a certain item at level l. STEP 2: Translate the item names in the transaction data according to the encoding scheme. STEP 3: Set k = 1, where k is used to store the level number currently being processed. STEP 4: Group the items with the same first k digits in each transaction. Use the encoded name to represent the group, and retain only one item when there are more than two items in the same group in a transaction. Denote the j-th group at level k as gjk, where j = 1 to mk and mk is the number of groups at level k. STEP 5: Calculate the count cjk of each group gjk at level k as its occurring number in the transaction data set D. The support sjk of gjk can then be derived as:
s kj =
c kj n
.
STEP 6: Check whether the support sjk of gjk is larger than or equal to the threshold τjk that is the minimum of minimum supports of items contained in it. If the support of a group gjk satisfies the above condition, put gjk in the set of large 1-itemsets (L1k) at level k. That is,
L1k = {g kj s kj ≥ τ kj , 1 ≤ j ≤ m k } ; otherwise, remove all the items in the group from the transactions D. STEP 7: If L1k is null, then set k = k + 1 and go to STEP 4; otherwise, do the next step. STEP 8: Set r = 2, where r is used to represent the number of items stored in the current large itemsets. STEP 9: Generate the candidate set Crk from Lkr −1 in a way similar to that in the Apriori algorithm [3]. That is, the algorithm first joins Lkr −1 and Lkr −1 , assuming that r-1 items in the two itemsets are the same and the other one is different. In addition, it is different from the Apriori algorithm in that the supports of all the large (r-1)-itemsets comprising a candidate r-itemset I must be larger than or equal to the maximum (denoted as mI) of the minimum supports of these large (r-1)-itemsets. Store in Crk all the itemsets satisfying the above conditions and with all their sub-r-itemsets in Lkr −1 .
1334
Y.-C. Lee, T.-P. Hong, and T.-C. Wang
STEP 10: If the candidate r-itemsets Crk is null, set k = k + 1 and go to STEP 4; otherwise, do the next step. STEP 11: For each newly formed candidate r-itemset I with items (I1, I2, …, Ir) in Crk , do the following substeps: (a) Calculate the count cIk of each candidate r-itemset I at level k as its occurring number in the transaction data set. The support of I can then be derived as: ck sI = I . n (b) Check whether the support sI of each candidate r-itemset is larger than or equal to τI, which is the minimum of minimum supports of items contained in it. If I satisfies the above condition, put it in the set of large k r-itemsets ( Lr ) at level k. That is, Lkr +1 = {I s I ≥ τ I } . STEP 12: If Lrk is null and k reaches to the level number of the taxonomy, then do the next step; otherwise, if Lrk is null, then set k = k + 1 and go to STEP 4; otherwise, set r = r + 1 and go to STEP 9. STEP 13: Construct the association rules for each large q-itemset I with items ( I1 , I 2 , ..., I q ), q ≥ 2, by the following substeps: (a) Form all possible association rules as follows:
I1 ∧ ... ∧ I r −1 ∧ I r +1 ∧ ... ∧ I q → I r , r = 1 to q. (b) Calculate the confidence values of all association rules by:
sI j s I1 ∧...∧ I r −1 ∧ I r +1 ∧...∧ I q
.
STEP 14: Output the rules with confidence values larger than or equal to the predefined confidence value λ.
4 An Example In this section, a simple example is given to demonstrate the proposed mining algorithm, which generates a set of taxonomic association rules from a given transaction dataset with multiple minimum supports. Assume the transaction dataset includes the ten transactions shown in Table 1. Each transaction consists of two parts, TID and Items. The field TID is used for identifying transactions and the field Items lists the items purchased at a transaction. For example, there are three items, plain milk, tea and coke, purchased at transaction T1. Assume the predefined taxonomy among the items is shown in Figure 2. All the items fall into three main classes: foods, drinks and alcohol. Foods can be further classified into bread and cookies. There are two kinds of bread, white bread and wheat bread. The other nodes can be explained in the same manner.
Mining Multiple-Level Association Rules Under the Maximum Constraint
1335
Table 1. The ten transactions in this example TID T1 T2 T3 T4 T5 T6 T7 T8 T9 T10
Items plain milk, tea, coke chocolate biscuits, plain milk, tea whit bread, chocolate biscuits, red wine, blended whiskey wheat bread, chocolate biscuits, plain milk, tea chocolate biscuits, pure malt whiskey, white wine whit bread, juice milk, tea, coke chocolate biscuits, plain milk, juice milk, tea soda cookies, juice milk, coke soda cookies, coke, blended whiskey soda cookies, plain milk, tea Goods
Foods
Bread
White bread
Cookies
Wheat bread
Alcohol
Soft drinks
Milk
Beverages
Wine
Soda Chocolate Plain Juice Tea Coke White wine cookies biscuits milk milk
Whiskey
Red Blended Pure malt wine whiskey whiskey
Fig. 2. The predefined taxonomy in this example
Also assume that the predefined minimum support values of items are given in Table 2 and the minimum confidence value is set at 0.85. The proposed mining algorithm for finding association rules with the predefined multiple minimum supports and multiple-level item taxonomy under the maximum-itemset minimumtaxonomy support constraint proceeds as follows. Table 2. The predefined minimum support values for items Item
White bread
Wheat bread
Soda cookies
Chocolate biscuits
Plain milk
Juice milk
Minsupport
0.4
0.4
0.4
0.7
0.6
0.4
Item
Tea
Coke
Red wine
White wine
Blended whiskey
Pure malt whiskey
Minsupport
0.5
0.7
0.4
0.5
0.6
0.4
1336
Y.-C. Lee, T.-P. Hong, and T.-C. Wang
First of all, each item name is encoded using the predefined taxonomy. The results are shown in Table 3. All transactions shown in Table 1 are encoded using the above encoding scheme. Since k = 1, where k is used to store the level number currently being processed, all the items in the transactions are first grouped at level one. For example, the items 112 and 122 are grouped into 1** since their first digits are both 1. Table 3. Codes of the item names Item name Foods Drinks Alcohol Bread Cookies Milk Beverages Wine Whiskey White bread Wheat bread
Code 1** 2** 3** 11* 12* 21* 22* 31* 32* 111 112
Item name Soda cookies Chocolate biscuits Plain milk Juice milk Tea Coke Red wine White wine Blended whiskey Pure malt whiskey
Code 121 122 211 212 221 222 311 312 321 322
The count and support of each item occurring in the ten transformed transactions are then calculated. Take 1** as an example. The count of 1** is obtained as 9. Its support value is then calculated as 0.9 (=9/10). The support value of each item at level 1 is compared with its threshold, which is the minimum of minimum supports of items contained in it. Since 1** includes the four items, 111, 112, 121 and 122, its minimum support is then calculated as min(0.4, 0.4, 0.4, 0.5) according to the minimum supports given in Table 2. The minimum support of 1** is thus 0.4. Since the support value of 1** (=0.9) is larger than its minimum support, 1** is put into the large 1-itemset L11 at level 1. In this example, the set of large 1-itemsets at level 1 is: L11 = {{1**}, {2**}}. All the items belonging to 3** (311, 312, 321 and 322) are then removed from the original transaction data set. The candidate set C21 is generated from L11 as {1**, 2**}. The support of {1**, 2**} can then be derived as 0.7. Since the support value of {1**, 2**} is larger than the maximum (0.4) of the minimum supports of 1** (0.3) and 2** (0.4), it is thus large at level 1 and is put into the set of large 2-itemsets L21. In this example, L21 = {1**, 2**}. The mining processes are then executed for level 2 and level 3. When large itemsets are generated through the process, candidate itemsets may be pruned in the following three cases: Case 1: The item which belongs to a higher-level small item does not satisfy its own minimum support and can be removed from the original transactions due to the minimum-taxonomy constraint. For example, the items, 111 and 112, belonging to 11*, are pruned since the support of 11* is smaller than its threshold. The original transaction data set is then reformed as a new one without 111 and 112. Case 2: An itemset is pruned if any of its subset is not in the set of large itemsets. For example, the 2-itemset {11*, 12*} is pruned since 11* is not a large item.
Mining Multiple-Level Association Rules Under the Maximum Constraint
1337
Case 3: An itemset is pruned if any of support values of the items in the itemset is smaller than the maximum of minimum supports of the items contained in it. For example, the two items 122 and 221 at level 3 have support values 0.8 and 0.6, which are respectively larger than their own minimum supports. 122 and 221 are thus large 1-itemsets for level 3. The 2-itemset {122, 221}, formed from them, however, is not likely to be a large itemset, since the support value (0.6) of item 221 is smaller than the maximum (0.7) of minimum supports of these two items. The 2-itemset {122, 221} is thus pruned. Since the level of the given taxonomy is 3, the association-rule deriving process is then executed. The association rules for each large q-itemsets, q ≥ 2, are constructed by the following substeps. (a) All possible association rules are formed. (b) The confidence factors of the above association rules are calculated. Take the possible association rule “If 12*, then 21*” as an example. The confidence value for this rule is calculated as: s12*∪21* 0.5 = = 0.625 . s12* 0.8
The confidence values of the above association rules are compared with the predefined confidence threshold λ. Assume the confidence λ is set at 0.85 in this example. The following five association rules are thus output: 1. If 1**, then 2**, with confidence 1.0; 2. If 2**, then 1**, with confidence 0.875; 3. If 21*, then 22*, with confidence 1.0; 4. If 22*, then 21*, with confidence 0.875; 5. If 21* and 12*, then 22*, with confidence 1.0. The proposed algorithm can thus find the large itemsets level by level without backtracking.
5 Conclusion Using different criteria to judge the importance of different items and managing taxonomic relationships among items are two issues that usually occur in real mining applications. In this paper, we have proposed a simple and efficient mining algorithm to solve these two issues. In the proposed algorithm, items may have different minimum supports and taxonomic relationships, and the maximum-itemset minimumtaxonomy support constraint which is set as the minimum support for an itemset is set as the maximum of the minimum supports of the items contained in the itemset, while the minimum support of the item at a higher taxonomic concept is set as the minimum of the minimum supports of the items belonging to it. The rational for using the maximum constraint has been well explained and this constraint may be suitable to some mining domains. Under the constraint, the characteristic of downward-closure can be easily kept, such that the original Apriori algorithm can be easily extended to finding large itemsets. The proposed algorithm can thus generate large itemsets from multiple-level items level by level and then derive association rules. Some pruning heuristics have also been adopted in this paper. Due to the minimum-taxonomy
1338
Y.-C. Lee, T.-P. Hong, and T.-C. Wang
constraint, the item with its higher-level item not satisfying the minimum support is removed from the original transactions. The itemset with any of support values of the items contained in it smaller than the maximum of minimum supports of the items is also pruned. These make the proposed algorithm work in a better way.
Acknowledgement This research was supported by the National Science Council of the Republic of China under contract NSC 94-2213-E-390-005.
References 1. R. Agrawal, T. Imielinksi and A. Swami, “Mining association rules between sets of items in large database,“ The ACM SIGMOD Conference, pp. 207-216, Washington DC, USA, 1993. 2. R. Agrawal, T. Imielinksi and A. Swami, “Database mining: a performance perspective,” IEEE Transactions on Knowledge and Data Engineering, Vol. 5, No. 6, pp. 914-925, 1993. 3. R. Agrawal and R. Srikant, “Fast algorithm for mining association rules,” The International Conference on Very Large Data Bases, pp. 487-499, 1994. 4. R. Agrawal and R. Srikant, ”Mining sequential patterns,” The Eleventh IEEE International Conference on Data Engineering, pp. 3-14, 1995. 5. S. Brin, R. Motwani, J. D. Ullman, and S. Tsur, “Dynamic itemset counting and implication rules for market-basket data,“ in Proceedings of the 1997 ACM-SIGMOD International Conference in Management of Data, pp. 207-216, 1997. 6. J. S. Park, M. S. Chen, and P. S. Yu, “An Effective hash-based algorithm for mining association rules,” in Proceedings of the 1995 ACM-SIGMOD International Conference in Management of Data, pp. 175-186, 1995. 7. A. Savasere, E. Omiecinski, and S. Navathe, “An efficient algorithm for mining association rules in large databases,” in Proceedings of the 21st International Conference in Very Large Data Bases (VLDB’95), pp. 432-443, 1995. 8. J. Han and Y. Fu, “Discovery of multiple-level association rules from large databases,” in Proceeding of the 21st International Conference on Very Large Data Bases, pp. 420-431, 1995. 9. R. Srikant and R. Agrawal, “Mining generalized association rules,” in Proceeding of the 21st International Conference on Very Large Data Bases, pp. 407-419, 1995. 10. B. Liu, W. Hsu, and Y. Ma, “Mining association rules with multiple minimum supports,” in Proceedings of the 1999 International Conference on Knowledge Discovery and Data Mining, pp. 337-341, 1999. 11. K. Wang, Y. H and J. Han, “Mining frequent itemsets using support constraints,” in Proceedings of the 26th International Conference on Very Large Data Bases, pp. 43-52, 2000. 12. Y. C. Lee, T. P. Hong and W. Y. Lin, "Mining fuzzy association rules with multiple minimum supports using maximum constraints", The Eighth International Conference on Knowledge-Based Intelligent Information and Engineering Systems, 2004, Lecture Notes in Computer Science, Vol. 3214, pp. 1283-1290, 2004. 13. Y. C. Lee, T. P. Hong and W. Y. Lin, "Mining association rules with multiple minimum supports using maximum constraints," International Journal of Approximate Reasoning, Vol. 40, No. 1, pp. 44-54, 2005.
A Measure for Data Set Editing by Ordered Projections Jesús S. Aguilar-Ruiz, Juan A. Nepomuceno, Norberto Díaz-Díaz, and Isabel Nepomuceno Bioinformatics Group of Seville, Pablo de Olavide University and University of Seville, Spain [email protected], {janepo, ndiaz, isabel}@lsi.us.es
Abstract. In this paper we study a measure, named weakness of an example, which allows us to establish the importance of an example to find representative patterns for the data set editing problem. Our approach consists in reducing the database size without losing information, using algorithm patterns by ordered projections. The idea is to relax the reduction factor with a new parameter, λ, removing all examples of the database whose weakness verify a condition over this λ. We study how to establish this new parameter. Our experiments have been carried out using all databases from UCI-Repository and they show that is possible a size reduction in complex databases without notoriously increase of the error rate.
1
Introduction
Data mining algorithms must work with databases with tends of attributes and thousands of examples when they are used to solve real and specific problems. This kind of databases contain much more information than standard databases, most of them of small size, which are usually used to testing data mining techniques. A lot of time and memory size are necessary to accomplish the final tests on these real databases. Methodologies based on axis-parallel classifiers are classifiers that provide easy-to-understand decision rules by humans and they are very useful for the expert interest in getting knowledge from databases. These methodologies are the most common among all methodologies used by data mining researchers. If we want to apply one of this type of tools, as C4.5 or k-NN [9], to solve a real problem with a huge amount of data, we should use some method in order to decrease the computational cost of applying these algorithms. Databases preprocessing techniques are used to reduce the number of examples or attributes as a way of decreasing the size of the database with which we are working. There are two different types of preprocessing techniques: editing (reduction of the number of examples by eliminating some of them or finding representatives patterns or calculating prototypes) and feature selection (eliminating non-relevant attributes). M. Ali and R. Dapoigny (Eds.): IEA/AIE 2006, LNAI 4031, pp. 1339–1348, 2006. c Springer-Verlag Berlin Heidelberg 2006
1340
J.S. Aguilar-Ruiz et al.
Editing methods are related to the nearest neighbours (NN) techniques [4]. For example, in [5], Hart proposed to include in the set of prototypes those examples whose classification is wrong using the nearest neighbour technique; in this way, every member of the main set is closer to a member of the subset of prototypes of the same class than a member of a different class of this subset. In [2] a variant of the previous method is proposed. In [15], Wilson suggests to eliminate the examples which are incorrectly classified with the k-NN algorithm, the works of [13] and [11] follows the same idea. Other variants are based on Voronoi diagrams [7], for example: Gabriel neighbours (two examples are Gabriel neighbours if their diametrical sphere does not contain any other examples) or relative neighbours [14] (two examples p and q are relative neighbours if for all example x in the set, the following expression is true: d(p, q) < max{d(p, x), d(x, q)}). In all previous methods the distances between examples must be calculated, so that, if we are working with n examples with m attributes, the first methods takes a time of Θ(mn2 ), the method proposed in [11] takes Θ(mn2 + n3 ) and Θ(mn3 ) the methods based on Voronoi diagrams. We work in this paper in the line proposed by Aguilar-Riquelme-Toro [1], where a first version of editing method by ordered projection technique was introduced. This algorithm works well with continuous attributes. In [10], a second and more elaborated version of this algorithm is proposed and it works simultaneously with continuous and discrete attributes (i.e., nominal attributes) and it conserves the properties of the initial approach. Working with NN-based techniques implies to introduce some initial parameters and defining a distance to calculate the proximity between the different examples of the database. This new method based on ordered projection technique does not need to define any distance and it works with each attribute independently as we will see in the next section. The most important characteristic of this approach to the editing techniques, in addition to the absence of distance calculations, are: the considerable reduction of the number of examples, the lower computational cost Θ(mn log n) and the conservation of the decision boundaries (especially interesting for applying classifiers based on axis-parallel decision rules). We are interesting in a measure, the weakness of an example, which help us to determine the importance of an example as decision boundary: more weakness implies less relevance. We propose a relaxation to the projection approach eliminating those examples whose weakness is larger than a threshold using a new parameter, λ, in the algorithm. At the present time some authors think that editing methods are rather oldfashioned because by today’s standard technology (even thought today´s data sets are larger) it is not clear whether it is worthwhile to spend the pre-processing time to perform editing. That is why some methods which embedded approaches for (simultaneous) feature selection (or editing) and classification, as SVMs [8], are being used. We are interesting in the study how to relax the projection approach to the editing problem in order to combine this new measure with the parameter of a similar approach to feature selection (eliminating non-relevant attributes), see [12]. A good method (as a new theory of measure to preprocessing
A Measure for Data Set Editing by Ordered Projections
1341
techniques) to find out the threshold which reduce the number of examples and the number of attributes without losing information in huge databases, would be a great achievement. In this paper, we show that in more complicated databases we can relax the reduction factor eliminating those examples whose weakness verify a condition over λ. We have dealt with two different databases of the UCI repository [3] (University of California at Irvine), heart-statlog database and ionosphere database. k-NN (for k = 1) and C4.5 classifiers have been used to classify our database before and after applying our editing method P OPλ (patterns by ordered projections). The condition over the weakness of each example has been relaxed gradually in order to study the importance of this measure and the goodness of our method being applied to algorithms based on axis-parallel classifiers. After having determined the threshold using the λ parameter, we carry out the study of it over the different databases of the UCI repository with continuos attributes [3]. A ten-fold cross-validation has been used for each database.
2
Description of the Algorithm
A database with m attributes and n examples can be seen as a space with m dimensions, where each example takes a value in the rank of each attribute and it has a determined class associated. Each attribute represents an axis of this space, with n points or objects inside and each example has a particular label associated with its corresponding class. For example, if our database has two attributes, we will be in a two-dimensional space (attributes are represented by x and y axis respectively) see Figure 1. As we said in the previous section, our method does not need to define any distance to compare the different examples, we will work with the projection of each example over each axis of the space. The main idea of the algorithm is the following: if the dimension of the space is d, in order to locate a region (in the context of the example region means hyper-rectangle although our algorithm will work with any hyperplane) of this space with examples with the same class, we will need only 2d examples which will define the borders of this region; for example, in order to define a squared in R2 we only need four points. So that, if we have more than 2d examples in the region with the same class, we can eliminate the rest which are inside. Our objective will be to eliminate examples which are not in the boundaries of the region. The way of finding if an example is inner to a region will be studding if it is inner in each corresponding interval in the projection of the region over the different axis of the space. An ordered projected sequence is the sequence formed by the projection of the space onto one particular axis, i.e., a particular attribute. A partition is a subsequence formed from one ordered projection sequence which maintains the projection ordered. We define the weakness of an example as the number of times that an example is not a border in a partition (i.e., it is inner to a partition) for every
1342
J.S. Aguilar-Ruiz et al.
Fig. 1. A two-dimensional database with twelve examples
partition obtained from ordered projected sequences of each attribute. We call irrelevant examples those examples whose weakness is equal to the number of attributes. In order to illustrate the method, we have designed a simple two–dimensional labelled database. This database is depicted in Figure 1, picture 1, and it contains twelve examples from two different class: N (numbers) and L (letters). An optimal classifier would obtain two rules, examples with numbers label and letters label, see picture 2 in the Figure 1, with overlapped rules. The classifier must be hierarchical because it produces overlapped rules. This is no the case of an axis parallel classifier which does not produce overlapped rules. For example, C4.5 and many others similar classifiers would produce situations like we can see in picture 2, 3, 4 and 5 in Figure 1. The aim of our algorithm is to built regions containing all the examples of the same class and to eliminate those examples which are not necessaries to define the regions, that is, those examples which are not in the borders of the regions. If we consider the situation depicted in picture number 7 in Figure 1, each region only contains examples of the same class in a maximal way. The projection of the examples on the abscissa axis, for the first attribute, it will produce four ordered sequences {N, L, N, L} corresponding to {[5, 2, 6, 4, 1], [b], [3], [f, c, d, e, a]}. Respectively on the ordinates axis, will produce the sequences {N, L, N, L} formed by the examples {[1, 2, 6, 5, 3], [d], [4], [f, c, b, a, e]}. Each sequence represents a rectangular region as possible solution of a classifier and initial and
A Measure for Data Set Editing by Ordered Projections
1343
final examples of the sequence (if it has only one, it is simultaneously the initial and the final one) represent the lower and upper values for each coordinate of this rectangle. In this situation, 5 and 1 are border for the first attribute. According to this figure, the weakness of each examples would be 0 to examples ‘1’, ‘3’ and ‘f’; 1 to ‘4’, ‘d’, ‘e’, ‘5’, ‘b’ and ‘a’; and 2 to example ‘2’, ‘6’ and ‘c’. Last examples have weakness equal to the dimension, therefore they are not necessaries to define the subregions, they are irrelevant examples. So, they are removed from the database, see picture 8 in the Figure 1. 2.1
Algorithm
Given the database E, let be n and n the initial and the final number of examples (n ≥ n ), let be m the number of attributes and let be λ ∈ R the initial parameter to relax the measure of the weakness. The P OPλ -algorithm (algorithm for patterns by projections ordered) is the following: ................................................................................ Procedure P OPλ (in: (En×m , λ), out: En ×m ) for each example ei ∈ E,i ∈ {1, . . . , n} weakness(ei ) := 0 end for for each attribute aj , j ∈ {1, . . . , m} Ej := QuickSort(Ej , aj ) in increasing order Ej := ReSort(Ej ) for each example ei ∈ Ej , i ∈ {1, . . . , n} if ei is nor border weakness(ei ) := weakness(ei ) + 1 end if end for end for for each example ei ∈ E, i ∈ {1, . . . , n} if weakness(ei ) ≥ m · λ remove ei from E end if end for end P OPλ ................................................................................ The computational cost of POP is Θ(mn log n). This cost is much lower than other algorithms proposed in the bibliography, normally Θ(mn2 ). The algorithm constructs the ordered projected sequence over each axis of the space (attribute) and it calculates the weakness for each example. The value of the projections need to be sorted when we are working with each attribute. We use QuickSort algorithm, [6], and a second sorting, we call it Resort, is made in order to create regions containing examples of the same class in a maximal way. The examples sharing the same value for an attribute are not necessary
1344
J.S. Aguilar-Ruiz et al.
Fig. 2. Applying P OPλ over two different databases. Ordinate axis shows the percentage of retention. In abscissa axis different values of the λ parameter.
nearer to those examples that have the same class and have another value. The solution to that problem consists of resorting the interval containing repeated values. The heuristic is applied to obtain the least number of changes of class. Therefore, the algorithm sort by value and, in case of equality, by class (Resort sorting). In the case of working with a database with nominal attributes, an other more elaborated version of this kind of algorithm could be considered: discrete attributes does not need to be sorted and the weakness of all the examples except one which has the least weakness obtained for the continuous attributes is increased. We are not interesting in this first approach in database with nominal attributes. Finally, examples verifying the condition over λ parameter are eliminated of the database. This parameter permit us to control the level of reduction of our editing method.
3
Experiments
Our tests have been achieved over two different databases: heart-statlog database and ionosphere database, both obtained from the UCI repository [3]. The main objective is to compare the performance of our editing method when the λ parameter is modified. We have a measure for each example, the weakness of the example, which determines its importance as a decision boundary, we relate this measure with the parameter of our editing algorithm. Our objective is to study how to establish a threshold to eliminate examples of the database, we want to determinate a parameter. A ten-fold cross-validation is made dividing the database in ten parts and taking blocks of nine parts which are our training set and the other one is the test set. We apply our reducing method to the training set and then, after having applied the corresponding classifier algorithm, we use the test set to validate the process. This operation is realized ten times each one with the different ten subset we have built.
A Measure for Data Set Editing by Ordered Projections
1345
Table 1. Computational Cost in Seconds and Error Rate (average - standard deviation) for C4.5 and k-NN (with k = 1) algorithms over the different databases obtained with P OPλ
Original POPλ=1 POPλ=0.95 POPλ=0.90 POPλ=0.85 POPλ=0.80 POPλ=0.75 POPλ=0.70 POPλ=0.65 POPλ=0.60 POPλ=0.55 POPλ=0.50
C4.5 Heart-Statlog Ionosphere CCS ER ±σ CCS ER ±σ 0.08 21.7 ±6.6 0.16 14.3 ±7.9 0.08 24.0 ±6.0 0.18 14.3 ±7.9 0.13 20.8±10.8 0.15 15.4 ±8.7 0.08 20.8±10.8 0.14 14.8 ±7.7 0.05 22.1 ±6.7 0.14 12.0 ±7.2 0.03 27.4±11.4 0.12 14.3 ±6.8 0.04 37.4±11.4 0.09 17.7 ±8.1 0.03 40.3±14.8 0.08 25.1±14.5 0.02 42.0±14.2 0.06 30.2±11.7 0.02 42.0±14.2 0.04 47.2±13.7 0.01 43.0±15.8 0.04 58.3±16.6 0.01 9.3∗ ±18.5 0.04 53.4±17.0
k-NN Heart-Statlog Ionosphere CCS ER ±σ CCS ER ±σ 0.09 24.6 ±8.1 0.06 15.4 ±8.2 0.05 24.8 ±8.6 0.05 16.3 ±8.8 0.05 28.4 ±5.9 0.05 16.8 ±9.3 0.05 28.4 ±5.9 0.05 17.1 ±9.4 0.03 24.9±11.4 0.06 17.1 ±9.4 0.02 39.4 ±8.8 0.05 16.3 ±8.9 0.01 39.4 ±8.8 0.05 15.4 ±8.2 0.01 45.4 ±9.5 0.04 15.7 ±8.5 0.00 41.3 ±9.5 0.04 19.7±11.5 0.00 41.3 ±9.5 0.04 24.8±11.2 0.00 46.6 ±6.5 0.03 44.1 ±9.5 0.00 42.9 ±7.1 0.03 51.2 ±7.9
We apply P OPλ algorithm with λ ∈ {1, 0.95, 0.9, 0.85, 0.8, 0.75, 0.7, 0.65, 0.6, 0.55, 0.5}. Figure 2 shows the percentage of retention of our method varying with the different values of λ (100% means that the database has not any reduction)1 . We notice that both functions are increasing, that is because we have imposed the condition “ weakness(ei ) ≥ m · λ” in the algorithm and for each new valor of the parameter, we remove the new examples that verifies the condition and the old ones. If we have put = instead ≥, the graphic would not be increasing and the process would not be cumulative. If a group of examples with the same weakness is removed for a value of λ, they will be removed when the parameter decreases. We are interested in relaxing the condition over λ in order to remove examples from the database gradually. A possible threshold could be establish seeing the graphics but we must verify that we do not lose any knowledge in order to make the classification phase. The results of the classifications using C4.5 and k-NN techniques are shown in Table 1. We presents the CCS, computational cost in seconds, and the ER, error rate, for the original database and the different reduced databases. ER is the classification error produced when the test set validates the model that we have constructed applying the different techniques to the different databases. Values are the average and the standard deviation of complete 10-fold cross-validation (the sum of the 10 experiments for each value obtained). We can observe that the computational cost is decreasing while the lambda value is decreasing too. So, if we found a lambda value less than 1 without losing information, we would manage reduce the database size and the computational cost would decreased too. 1
Percentages of retention for each value of λ are the average of the ten different experiments realized in the 10-fold CV.
1346
J.S. Aguilar-Ruiz et al.
The purpose is to study the relevance of P OPλ as a reduction method for the different values of λ. The best situation could be getting the λ which produces the greatest reduction in our database and the least error rate when the classification methods is applied. We observe between λ = 0.85 and λ = 0.80 a possible threshold: for λ = 0.80 the number of examples are removed dramatically from the database and the error rate seems to increase. We have a good candidate to be our threshold. We must verify it and we must study how to establish the valour of the parameter. In order to proof the goodness of our parameter, in Table 2, we carry out the study of the situation for this two values over all the databases from UCI repository with continuous attributes [3]. We show the percentage of reduction, PR, in order to indicate the number of examples which are removed from the database, values are the average of complete 10-fold cross-validation process. We must consider how the error changes when the database is reduced considerably. Our aim is to ascertain the value of λ which reduce more the database without losing information. For example, for Heart–Statlong, the Error Rate from the original database using C4.5 is 21.7 ± 6.6, but if we apply P OPλ=0.85 (Table 2) the error would be only 22.1 ± 6.7. That is, we have managed to reduce the database in a 68.8% (100-PR) without losing knowledge. If we take the same database and configuration but using k-NN, similar behavior is observed. In general, for λ = 0.85, databases would be reduced to 61.8% of the original size and error rate would be incremented from 18.6 ± 4.8 to 24.8 ± 5.8 using C4.5, and from 15.4 ± 4.6 to 23.1 ± 5.0 using 1-NN. We have drawn in bold in both tables the data which are relevant according to t-Student statistical distribution. Paying attention to results obtained with λ = 0.8 we have to say that although databases was reduced dramatically, the error rate is incremented notably. These Experiments show us how to establish an appropriate value of λ parameter in order to apply the P OPλ algorithm reducing the database up to the limit and conserving the knowledge of the original database. In summary, we can state that with lambda values minor than 1 it is possible a higher database size reduction without losing information. But this reduction is limited to a 0.85 lambda value. We have established a method to find out a threshold to relax the reduction factor over the algorithm for finding representative patterns for dat aset editing. We have proven the goodness of our level with all the databases of the UCI repository with only continuous attributes.
4
Conclusions
We have defined a new parameter, λ, which helps us to remove all examples in a database which verify a condition over it, its “weakness ≥(number attributes)·λ”. Therefore we have established a threshold via a measure over each example in order to reduce the number of examples of a database. After analyzing our approach using some databases from the UCI–Repository, we conclude that it is possible to reduce the database size up to a 40% without losing any knowledge. Furthermore, the computational cost is decreased by allowing to remove examples with weakness less than the number of attributes.
A Measure for Data Set Editing by Ordered Projections
1347
Table 2. Error Rate for C4.5 and k-NN (with k = 1) algorithms over databases from UCI repository with continuous attributes. Every database is considered before and after applying P OPλ for λ = 0.85, for the first table, λ = 0.80 for the second one. PR is the percentage of reduction of the reduction algorithm.
Original Data Base ER ±σ Heart-Statlog 21.7 ±6.6 Ionosphere 14.3 ±7.9 Balance-Scale 25.4 ±7.2 Breast-W 5.2 ±2.7 Bupa 15.4 ±0.0 Diabetes 26.4 ±7.3 Glass 51.3 ±19 Iris 2.0 ±0.0 Lung-Cancer 15.6 ±0.0 Page-Blocks 1.5 ±0.0 Pima-Indians-Diabetes 15.9 ±0.0 Segment 3.3 ±1.3 Sonar 45.5±19.3 Vehicle 27.8 ±3.4 Waveform-5000 25.1 ±1.5 Wine-5000 1.1 ±0.0 Average 18.6 ±4.8
C4.5 P OPλ=0.85 ER ±σ PR 22.1 ±6.7 31.2 12.0 ±7.2 78.2 66.0 ± 17.7 10.8 17.0 ± 9.9 5.2 42.0 ± 0.0 42.0 28.1 ±7.7 55.3 51.4 ±19 97.6 12.7 ± 0.0 26.0 9.4 ± 0.0 96.0 9.3 ± 0.0 13.0 23.6 ± 0.0 58.0 3.8 ±1.5 92.0 45.5±19.3 100.0 28.2±4.6 88.1 25 ±1.8 97.1 1.1 ±0.0 98.0 24.8 ±5.8 61.8
Original ER ±σ 24.6 ±8.1 15.4 ±8.2 20.2 ±5.6 5.2 ±4.7 0.0 ±0.0 29.6 ±4.0 39.8 ±12 0.0 ±0.0 0.0 ±0.0 0.3 ±5.6 0.0 ±0.0 3.2 ±1.6 49.4±16.1 31.9 ±4.6 26.8 ±1.5 0.0 ±0.0 15.4 ±4.6
k-NN P OPλ=0.85 ER ±σ PR 24.9 ± 11.4 31.2 17.1 ±9.4 78.2 54.0 ± 13.1 10.8 23.2 ± 7.8 5.2 22.6 ± 0.0 42.0 35.0 ± 5.8 55.3 39.8 ±12 97.6 3.3 ± 0.0 26.0 3.1 ± 0.0 96.0 23.5 ± 3.6 13.0 11.9 ± 1.8 58.0 3.2 ±1.6 92.0 49.4±16.1 100.0 31.8 ±4.4 88.1 26.8 ±1.6 97.1 0.0 ±0.0 98.0 23.1 ±5.0 61.8
Original Data Base ER ±σ Heart-Statlog 21.7 ±6.6 Ionosphere 14.3 ±7.9 Balance-Scale 25.4 ±7.2 Breast-W 5.2 ±2.7 Bupa 15.4 ±0.0 Diabetes 26.4 ±7.3 Glass 51.3 ±19 Iris 2.0 ±0.0 Lung-Cancer 15.6 ±0.0 Page-Blocks 1.5 ±0.0 Pima-Indians-Diabetes 15.9 ±0.0 Segment 3.3±1.4 Sonar 45.5±19.3 Vehicle 27.8 ±3.4 Waveform-5000 25.1 ±1.5 Wine-5000 1.1 ±0.0 Average 18.6 ±4.8
C4.5 P OPλ=0.80 ER ±σ PR 27.4 ± 11.4 11.7 14.3 ±6.8 62.5 6.0 ± 17.8 10.8 50.4 ± 29 1.3 42.0 ± 0.0 42.0 49.5 ± 8.2 26.4 52.6±17.5 94.2 12.7 ± 0.0 26.0 12.5 ± 0.0 90.0 9.3 ± 0.0 13.0 48.8 ± 0.0 28.0 4.1 ±1.5 87.0 45.5±19.3 100.0 29.0 ±6.4 79.6 25.1 ±1.6 92.6 1.1 ±0.0 96.6 30.6 ±7.5 53.9
Original ER ±σ 24.6 ±8.1 15.4 ±8.2 20.2 ±5.6 5.2 ±4.7 0.0 ±0.0 29.6±4.0 39.8±12.0 0.0 ±0.0 0.0 ±0.0 0.3 ±5.6 0.0 ±0.0 3.2 ±1.6 49.4±16.1 31.9 ±4.6 26.8 ±1.5 0.0 ±0.0 15.4 ±4.6
k-NN P OPλ=0.80 ER ±σ PR 39.4 ± 8.8 11.7 16.3 ±8.9 62.5 54.0 ± 13.2 10.8 34.9 ± 9.0 5.2 22.6 ± 0.0 42.0 47.0 ± 4.1 26.4 39.8 ±12 94.2 3.3 ± 0.0 26.0 9.4 ± 0.0 90.0 23.6 ± 0.0 13.0 32.2 ± 0.0 13.0 3.7 ±1.5 87.0 49.4±16.1 100.0 32.8 ±5.3 79.6 26.9 ±1.6 92.6 0.0 ±0.0 96.6 27.2 ±5.0 53.9
1348
J.S. Aguilar-Ruiz et al.
In spite of having introduced a new parameter and treating a problem of editing (some authors consider editing is also rather old-fashioned because it is worthwhile to spend the pre-processing time with today´s standard technologies), this paper begins a way to consider the preprocessing problem, such editing as features selection, as a problem of election of two parameters. As a future work, the combination of P OPλ -algorithm with SOAP-algorithm [12] is proposed. Thus we will obtain an algorithm to preprocess a database working with two parameters in order to remove such examples as attributes.
References 1. Aguilar, J.S.; Riquelme, J.C.; Toro, M.: Data set editing by ordered projection, in: Proceedings of the 14th European Conference on Arti2cial Intelligence (ECAI’00), Berlin, Germany, (2000), pp. 251-255. 2. Aha, D.W.; Kibler, D.; Albert, M.K.: Instance-based learning algorithms, Mach. Learning 6 (1991), pp. 37-66. 3. Blake, C.; Merz, E.K.: UCI repository of machine learning databases, (1998). 4. Cover, T. and Hart, P.: Nearest Neighbor Pattern Classification. IEEE Transactions on Information Theory, IT-13 (1) (1967), pp. 21-27. 5. Hart, P.: The condensed nearest neighbor rule, IEEE Trans. Inf. Theory 14 (3) (1968), pp. 515-516. 6. Hoare, C.A.R.: Quicksort, Comput.J. 5 (1) (1962), pp. 10-15. 7. Klee, V.: On the complexity of d-dimensional voronoi diagrams, Arch. Math. 34 (1980), pp. 75-80. 8. Neumann, Julia; Schnörr, Christoph; Steidl, Gabriele: SVM-based Feature Selection by Direct Objective Minimisation Pattern Recognition, Proc. of 26th DAGM Symposium, LNCS, Springer, August (2004). 9. Quinlan, J.R.: C4.5: Programs for Machine Learning, Morgan Kauffmann, San Mateo, CA, (1993). 10. Riquelme, José C.; Aguilar-Ruiz, Jesús S.; Toro, Miguel: Finding representative patterns with ordered projections Pattern Recognition 36 (2003), pp. 1009-1018. 11. Ritter, G.; WoodruI, H.; Lowry, S.; Isenhour, T.: An algorithm for a selective nearest neighbor decision rule, IEEE Trans. Inf. Theory 21 (6) (1975), pp. 665669. 12. Ruiz, R.; Riquelme, Jose C.; Aguilar-Ruiz, Jesus S.: NLC: A Measure Based on Projections 14th International Conference on Database and Expert Systems Applications, DEXA 2003Lecture Notes in Computer Science, Springer-VerlagPrague, Czech Republic, 1-5 September, (2003). 13. Tomek, I.: An experiment with the edited nearest-neighbor rule, IEEE Trans. Syst. Man Cybern. 6 (6) (1976), pp. 448-452. 14. Toussaint, G.T.: The relative neighborhood graph of a finite planar set, Pattern Recognition 12 (4) (1980), pp. 261-268. 15. Wilson, D.R.; Martinez, T.R.: Improved heterogeneous distance functions, J. Artif. Intell. Res. 6 (1) (1997), pp. 1-34.
Author Index
Abbassene, Ali 800 Abril, Montserrat 400 Adnan, Muhaimenul 363 Aguilar-Ruiz, Jesús S. 1339 Agulló, Loreine 917 Ahn, Eun Yeong 1211 Ahn, Tae-Chon 472 Aioanei, Daniel 691 Alcalá, Rafael 452 Alcalá-Fdez, Jesús 452 Alhajj, Reda 363 Amarger, Véronique 155 An, Zeng 353 Angerer, Bibiane 600 Antunes, Ana 908 Arredondo V., Tomás 462, 917 Avarias A., Jorge 917 Bacauskiene, Marija 701 Bannister, Peter R. 1149 Bao, Yukun 1080 Baoyan, Liu 1290 Barber, Federico 400 Barker, Ken 363 Barker, Nathan 962 Barlatier, Patrick 14 Basile, Teresa Maria Altomare 721 Batouche, Mohamed 800, 809 Belghith, Khaled 838 Berlanga, Francisco José 452 Bhattacharya, Anindya 943 Biba, Marenglen 721 Bingru, Yang 1290 Bø, Ketil 554 Boberg, Jorma 610 Bosin, Andrea 790 Brennan, Jane 898 Brézillon, Patrick 137, 146 Briand, Henri 312 Calderón B., Felipe 917 Cameron, Ian T. 70 Candel C., Diego 917
Carloni, Olivier 590 Carson-Berndsen, Julie 674, 691 Casanovas, Pompeu 1000 Casellas, Núria 1000 Cesta, Amedeo 410, 421 Chaiyaratana, Nachol 1090 Chang, Chir-Ho 1200 Chang, Chuan-Yu 1119 Chang, Yeon-Pung 760 Chang, Yu-Chuan 1249 Chau, K.W. 111, 548 Che, Oscar 711 Cheepala, Satish 972 Chelloug, Samia 809 Chen, Chao-Wen 1221 Chen, Jiah-Shing 197 Chen, Jr-Shian 1270 Chen, Jungan 859 Chen, Lijuan 711 Chen, Peter P. 750 Chen, Rong 639 Chen, Shifu 510 Chen, Shyi-Ming 432, 442, 1249, 1280 Chen, Stephen 44 Chen, Xi 54 Chen, Yi-Wei 760 Chen, ZhiHang 1169 Cheng, Ching-Hsue 478, 1270 Cheung, Yee Chung 127 Chien, Been-Chian 1318 Chohra, Amine 155 Choi, Chang-Ho 177 Choi, Nam-Sup 177 Choi, Sang-Kyu 177 Chu, Bong-Horng 1259 Chu, Ming-Hui 760 Chung, Paul 54 Chung, Paul Wai Hing 127 Chung, Yeh-Ching 1299 Clifford, John 972 Clifton, David A. 1149 Cordova H., Macarena 917 Cortellessa, Gabriella 421
1350
Author Index
Dai, Yun 927 Damásio, Carlos V. 650 Dan, Pan 353 Dapoigny, Richard 14 Davoodi, Mansoor 1100 De, Rajat K. 943 Demazeau, Yves 731 Despres, Sylvie 1014 Dessì, Nicoletta 790 Di Mauro, Nicola 629, 721 Diallo, Gayo 1024 Dias, Fernando Morgado 908 Díaz-Díaz, Norberto 1339 Dombrovskaia, Lioubov 917 Dourgnon-Hanoune, Anne 583 Ekinci, Murat 500 Esposito, Floriana 629, 721 Fanizzi, Nicola 629 Felfernig, Alexander 869 Feng, Jun 117 Ferilli, Stefano 721 Foulloy, Laurent 14 Fratini, Simone 421 Frenz, Christopher M. 935 Freund, Wolfgang 462 Gacto, María José 452 García-Hernández, Ma. de Guadalupe 1179 Garrido, Antonio 1179 Garza Castañón, Luis E. 520, 530 Gentil, Sylviane 2 Ghose, Aditya K. 780, 1127 Giunchiglia, Fausto 1 Gómez, Luis 917 Gonzalez, Avelino J. 137 Groza, Adrian 91 Gu, Mingyang 554 Guan, Ying 780 Guillet, Fabrice 312 Guo, Yubin 1071 Han, Chang-Wook 238 Hangos, Katalin M. 70 Hasgul, Servet 393 Hendtlass, Tim 292 Hennig, Sascha 332 Herrera, Francisco 452
Ho, Cheng-Seen 1259 Hong, Dong Kwon 879 Hong, Tzung-Pei 1329 Honghai, Feng 1290 Hou, Jia-Leh 197 Hsiao, Kai-Chung 1259 Hsu, Mu-Hsiu 1111 Hsu, Steen J. 1318 Huang, Liping 1309 Hung, Che-Lun 1299 Hung, Ming-Chuan 1299 Hur, Gi T. 488 Huynh, Xuan-Hiep 312 Iannone, Luigi 629 Imada, Miyuki 322 Ingolotti, Laura 400 Islier, A. Attila 741 Jakulin, Aleks 1000 Jamont, Jean-Paul 101 Jang, Min-Soo 540 Jannach, Dietmar 166, 819 Jędrzejowicz, Joanna 24 Jędrzejowicz, Piotr 24 Ji, Se-Jin 770 Jian-bin, He 353 Jorge, Rui D. 650 Kabanza, Froduald 838 Kanaoui, Nadia 155 Kang, Jaeho 1159, 1211 Kang, Yuan 760 Kanokphara, Supphanat 674, 691 Kim, Dongwon 830 Kim, Jung H. 488 Kim, Kap Hwan 1159, 1211 Kim, Kweon Yang 879 Kim, Kyoung Min 177 Kim, Sang-Jun 540 Kim, Sang-Woon 668 Kim, Sun Yong 322 Kim, Yong-Ha 177 Kim, Yong-Guk 540 Komatani, Kazunori 207 Kong, Ying 44 Krishna, Aneesh 780 Kuntanapreeda, Suwat 1090 Kuster, Jürgen 166
Author Index Kuwahara, Hiroyuki 962 Kwon, Jung-Woo 770 Lakner, Rozália 70 Laowattana, Djitt 60 Latorre R., Valeria 917 Leclère, Michel 590 Lee, Buhm 177 Lee, Chang-Shing 1240 Lee, Huey-Ming 1111 Lee, Jeong-Eom 540 Lee, Jia-Chien 682 Lee, Jimmy A. 898 Lee, Lee-Min 682 Lee, Li-Wei 1280 Lee, Seok-Joo 540 Lee, S.H. 889 Lee, Tsang-Yean 1111 Lee, Yeong-Chyi 1329 Letia, Ioan Alfred 91 Li, Xi 1138 Li, Yanzhi 272 Liang, Feng 859 Liao, Shih-Feng 1111 Liau, Churn-Jung 1249 Liegl, Johannes 819 Lim, Andrew 262, 272, 282, 711, 1138, 1189 Lin, Jin-Ling 1200 Lin, Si-Yan 1119 Lin, Ya-Tai 218 Lin, Yan-Xia 927 Liu, Alan 750 Liu, Ying 342 Liu, Yuan-Liang 760 Liu, Zhitao 1080 LiYun, He 1290 Loganantharaj, Raja 972 Loh, Han Tong 342 Lova, Antonio 400 Lu, Cheng-Feng 1318 Lu, Min 373 Lv, Pin 373 Ma, Hong 262, 272 Macek, Jan 674 Madani, Kurosh 155 Maneewarn, Thavida 60 Martin, Trevor 12 Martyna, Jerzy 1231
Masrur, Abul M. 1169 Melaye, Dimitri 731 Meliopoulos, Sakis A. 177 Mellal, Nacima 14 Meshoul, Souham 800, 809 Miao, Zhaowei 262, 1138 Mitra, Debasis 953 Montes de Oca, Saúl 520 Morales-Menéndez, Rubén 520 Moser, Irene 292 Mota, Alexandre Manuel 908 Mugnier, Marie-Laure 590 Mukai, Naoto 117 Mukkamala, Srinivas 619 Muñoz, Cesar 462 Muñoz R., Freddy 917 Murphey, Yi L. 1169 Murphey, Yi Lu 1309 Musliu, Nysret 302 Myers, Chris 962 Nakadai, Kazuhiro 207 Nakano, Mikio 207 Navarro, Nicolas 462 Németh, Erzsébet 70 Nepomuceno, Isabel 1339 Nepomuceno, Juan A. 1339 Nilsson, Carl-Magnus 701 Nkambou, Roger 838, 848 Noda, Jugo 573 Nolazco-Flores, Juan A. 530 Occello, Michel 101 Oddi, Angelo 421 Ogata, Tetsuya 207 Oh, Myung-Seob 1211 Oh, Soo-Hwan 668 Oh, Sung-Kwun 472 Ohta, Masakatsu 322 Okuno, Hiroshi G. 207 Onaindía, Eva 383, 1179 Ozcelik, Feristah 741 Ozkan, Metin 393 Pahikkala, Tapio 610 Park, Gwi-Tae 540, 830 Park, Ho-Sung 472 Park, Jong-Hee 770 Park, Jung-Il 238 Park, Min Chul 540
1351
1352
Author Index
Park, Se Young 879 Park, Taejin 228 Parlaktuna, Osman 393 Pazienza, Maria Teresa 990, 1042 Peischl, Bernhard 660 Pelle, Josephine 848 Pennacchiotti, Marco 1042 Pereira, Luís Moniz 81 Pérez Reigosa, MariCarmen 530 Pes, Barbara 790 Policella, Nicola 410, 421 Potter, W.D. 244 Qi-lun, Zheng 353 Quirós, Fernando 462 Randall, Marcus 254 Rasconi, Riccardo 410 Rau, Hsin 1221 Rezaei, Jafar 1100 Ritthipravat, Panrasee 60 Roche, Christophe 583, 1034 Rodrigues, Brian 1138 Ryu, Kwang Ryel 228, 1159, 1211 Saga, Ryosuke 573 Sălăgean, Ana 127 Salakoski, Tapio 610 Salaün, Patrick 583 Salido, Miguel Angel 400 Samant, Gandhali 953 Santana, Pedro 81 Sapena, Oscar 383 Saricicek, Inci 393 Schuster, Alfons 187 Seeger P., Michael 917 Sengupta, Kuntal 953 Shao, Jinyan 34 Shen, Lixiang 342 Shr, Arthur M.D. 750 Shuo, Zhao 1290 Sie, Shun-hong 982 Simonet, Ana 1024 Simonet, Michel 1024 Sombattheera, Chattrakul 780, 1127 Soomro, Safeeullah 660 Spotton Visano, Brenda 44 Srikasam, Wasan 1090 Stellato, Armando 990 Su, Jin-Shieh 1111
Sung, Andrew H. 619 Szulman, Sylvie 1014 Takeda, Ryu 207 Tang, Deyou 1071 Tang, Yin 1059 Tarassenko, Lionel 1149 Tormos, Pilar 400 Tseng, Lin-Yu 218 Tsivtsivadze, Evgeni 610 Tsuji, Hiroshi 573 Tsujino, Hiroshi 207 Tuohy, Daniel R. 244 Valin, Jean-Marc 207 Vallbé, Joan-Josep 1000 Verikas, Antanas 701 Verrons, Marie-Hélène 731 Vieira, José 908 Viswanathan, M. 889 Wang, Chih-Huang 432 Wang, Hsing-Wen 564 Wang, Hui-Yu 442 Wang, Hung-Jen 1119 Wang, Jia-Wen 478 Wang, Long 34 Wang, Mei-Hui 1240 Wang, Tien-Chin 1329 Wang, Zhenyu 711 Watanabe, Toyohide 117 Wotawa, Franz 600, 639, 660 Wu, C.L. 111 Wurst, Michael 332 Wyatt, Jeremy 60 Xi, Jianqing Xu, Dennis
1071 619
Yamamoto, Shun’ichi 207 Yang, Dongyong 859 Yang, Don-Lin 1299 Yang, Jia-Yan 1200 Yang, Y.K. 889 Yang, Yubin 510 Yap, Ivan 342 Yeh, Jian-hua 982 Yong-quan, Yu 353 Yoon, Sung H. 488
Author Index Yu, Junzhi 34 Yueli, Li 1290 Zanzotto, Fabio Massimo Zhang, Jin-fang 373 Zhang, Kaicheng 1189
1042
Zhang, Ren 927 Zhang, Yao 510 Zhu, Wenbin 282 Zinglé, Henri 1053 Zou, Hua 1080
1353