Distributed Prediction and Hierarchical Knowledge Discovery by ARTMAP Neural Networks Gail A. Carpenter Department of Cognitive and Neural Systems, Boston University 677 Beacon Street, Boston, MA 02215 USA
[email protected]
Abstract Adaptive Resonance Theory (ART) neural networks model real-time prediction, search, learning, and recognition. ART networks function both as models of human cognitive information processing [1,2,3] and as neural systems for technology transfer [4]. A neural computation central to both the scientific and the technological analyses is the ART matching rule [5], which models the interaction between topdown expectation and bottom-up input, thereby creating a focus of attention which, in turn, determines the nature of coded memories. Sites of early and ongoing transfer of ART-based technologies include industrial venues such as the Boeing Corporation [6] and government venues such as MIT Lincoln Laboratory [7]. A recent report on industrial uses of neural networks [8] states: “[The] Boeing … Neural Information Retrieval System is probably still the largest-scale manufacturing application of neural networks. It uses [ART] to cluster binary templates of aeroplane parts in a complex hierarchical network that covers over 100,000 items, grouped into thousands of self-organised clusters. Claimed savings in manufacturing costs are in millions of dollars per annum.” At Lincoln Lab, a team led by Waxman developed an image mining system which incorporates several models of vision and recognition developed in the Boston University Department of Cognitive and Neural Systems (BU/CNS). Over the years a dozen CNS graduates (Aguilar, Baloch, Baxter, Bomberger, Cunningham, Fay, Gove, Ivey, Mehanian, Ross, Rubin, Streilein) have contributed to this effort, which is now located at Alphatech, Inc. Customers for BU/CNS neural network technologies have attributed their selection of ART over alternative systems to the model's defining design principles. In listing the advantages of its THOT® technology, for example, American Heuristics Corporation (AHC) cites several characteristic computational capabilities of this family of neural models, including fast on-line (one-pass) learning, “vigilant” detection of novel patterns, retention of rare patterns, improvement with experience, “weights [which] are understandable in real world terms,” and scalability (www.heuristics.com). Design principles derived from scientific analyses and design constraints imposed by targeted applications have jointly guided the development of many variants of the basic networks, including fuzzy ARTMAP [9], ART-EMAP [10], ARTMAP-IC [11], V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1-4, 2003. Springer-Verlag Berlin Heidelberg 2003
2
Gail A. Carpenter
Gaussian ARTMAP [12], and distributed ARTMAP [3,13]. Comparative analysis of these systems has led to the identification of a default ARTMAP network, which features simplicity of design and robust performance in many application domains [4,14]. Selection of one particular ARTMAP algorithm is intended to facilitate ongoing technology transfer. The default ARTMAP algorithm outlines a procedure for labeling an arbitrary number of output classes in a supervised learning problem. A critical aspect of this algorithm is the distributed nature of its internal code representation, which produces continuous-valued test set predictions distributed across output classes. The character of their code representations, distributed vs. winner-take-all, is, in fact, a primary factor differentiating various ARTMAP networks. The original models [9,15] employ winner-take-all coding during training and testing, as do many subsequent variations and the majority of ART systems that have been transferred to technology. ARTMAP variants with winner-take-all (WTA) coding and discrete target class predictions have, however, shown consistent deficits in labeling accuracy and post-processing adjustment capabilities. The talk will describe a recent application that relies on distributed code representations to exploit the ARTMAP capacity for one-to-many learning, which has enabled the development of self-organizing expert systems for multi-level object grouping, information fusion, and discovery of hierarchical knowledge structures. A pilot study has demonstrated the network's ability to infer multi-level fused relationships among groups of objects in an image, without any supervised labeling of these relationships, thereby pointing to new methodologies for self-organizing knowledge discovery.
References [1] [2]
[3] [4]
[5]
S. Grossberg, “The link between brain, learning, attention, and consciousness,” Consciousness and Cognition, vol. 8, pp. 1-44, 1999. ftp://cns-ftp.bu.edu/pub/diana/Gro.concog98.ps.gz S. Grossberg, “How does the cerebral cortex work? Development, learning, attention, and 3D vision by laminar circuits of visual cortex,”Behavioral and Cognitive Neuroscience Reviews, in press, 2003, http://www.cns. bu.edu/Profiles/Grossberg/Gro2003BCNR.pdf G.A. Carpenter, “Distributed learning, recognition, and prediction by ART and ARTMAP neural networks,” Neural Networks, vol. 10, pp. 1473-1494, 1997, http://cns.bu.edu/~gail/115_dART_NN_1997_.pdf O. Parsons and G.A. Carpenter, “ARTMAP neural networks for information fusion and data mining: map production and target recognition methodologies,” Neural Networks, vol. 16, 2003, http://cns.bu.edu/~gail/ ARTMAP_map_2003_.pdf G.A. Carpenter and S. Grossberg, “A massively parallel architecture for a self-organizing neural pattern recognition machine,” Computer Vision, Graphics, and Image Processing, vol. 37, pp. 54-115, 1987.
Distributed Prediction and Hierarchical Knowledge Discovery
3
[6]
T.P. Caudell, S.D.G. Smith, R. Escobedo, and M. Anderson, “NIRS: Large scale ART 1 neural architectures for engineering design retrieval,” Neural Networks, vol. 7, pp. 1339-1350, 1994, http://cns.bu.edu/~gail/ NIRS_Caudell_1994_.pdf [7] W. Streilein, A. Waxman, W. Ross, F. Liu, F., M. Braun, D. Fay, P. Harmon, and C.H. Read, “Fused multi-sensor image mining for feature foundation data,” In Proceedings of 3rd International Conference on Information Fusion, Paris, vol. I, 2000. [8] P. Lisboa, “Industrial use of saftey-related artificial neural netoworks,” Contract Research Report 327/2001, Liverpool John Moores University, 2001. http://www.hse.gov.uk/research/crr_pdf/2001/crr01327. pdf [9] G.A. Carpenter, S. Grossberg, N. Markuzon, J.H. Reynolds, and D.B. Rosen, “Fuzzy ARTMAP: A neural network architecture for incremental supervised learning of analog multidimensional maps,” IEEE Transactions on Neural Networks, vol. 3, pp. 698-713, 1992, http://cns.bu.edu/~gail/ 070_Fuzzy_ARTMAP_1992_.pdf [10] G.A. Carpenter and W.D. Ross, “ART-EMAP: A neural network architecture for object recognition by evidence accumulation,” IEEE Transactions on Neural Networks, vol. 6, pp. 805-818, 1995, http://cns.bu. edu/~gail/097_ART-EMAP_1995_.pdf [11] G.A. Carpenter and N. Markuzon, “ARTMAP-IC and medical diagnosis: Instance counting and inconsistent cases,” Neural Networks, vol. 11, pp. 323-336, 1998. http://cns.bu.edu/~gail/ 117_ARTMAP-IC_1998_.pdf [12] J.R. Williamson, “Gaussian ARTMAP: A neural network for fast incremental learning of noisy multidimensional maps,” Neural Networks, vol. 9, pp. 881-897, 1998, http://cns.bu.edu/~gail/ G-ART_Williamson_1998_.pdf [13] G.A. Carpenter, B.L. Milenova, and B.W. Noeske, “Distributed ARTMAP: a neural network for fast distributed supervised learning,” Neural Networks, vol. 11, pp. 793-813, 1998, http://cns.bu.edu/~gail/ 120_dARTMAP_1998_.pdf [14] G.A. Carpenter “Default ARTMAP,” Proceedings of the International Joint Conference on Neural Networks (IJCNN'03), 2003. http://cns.bu.edu/~gail/Default_ARTMAP_ 2003_.pdf [15] G.A. Carpenter, S. Grossberg, and J.H. Reynolds, “ARTMAP: Supervised real-time learning and classification of nonstationary data by a self-organizing neural network,” Neural Networks, vol. 4, pp. 565-588, 1991, http://cns.bu.edu/~gail/054_ ARTMAP_1991_.pdf.
Biography Gail Carpenter (http://cns.bu.edu/~gail/) obtained her graduate training in mathematics at the University of Wisconsin (PhD, 1974) and taught at MIT and Northeastern University before moving to Boston University, where she is a professor
4
Gail A. Carpenter
in the departments of cognitive and neural systems (CNS) and mathematics. She is director of the CNS Technology Lab and CNS director of graduate studies; serves on the editorial boards of Brain Research, IEEE Transactions on Neural Networks, Neural Computation, Neural Networks, and Neural Processing Letters; has been elected to successive three-year terms on the Board of Governors of the International Neural Network Society (INNS) since its founding, in 1987; and was elected member-at-large of the Council of the American Mathematical Society (1996-1999). She has received the INNS Gabor Award and the Slovak Artificial Intelligence Society Award. She regularly serves as an organizer and program committee member for international conferences and workshops, and has delivered many plenary and invited addresses. Together with Stephen Grossberg and their students and colleagues, Professor Carpenter has developed the Adaptive Resonance Theory (ART) family of neural networks for fast learning, pattern recognition, and prediction, including both unsupervised (ART 1, ART 2, ART 2-A, ART 3, fuzzy ART, distributed ART) and supervised (ARTMAP, fuzzy ARTMAP, ARTEMAP, ARTMAP-IC, ARTMAP-FTR, distributed ARTMAP, default ARTMAP) systems. These ART models have been used for a wide range of applications, such as remote sensing, medical diagnosis, automatic target recognition, mobile robots, and database management. Other research topics include the development, computational analysis, and application of neural models of vision, nerve impulse generation, synaptic transmission, and circadian rhythms.
The Brain's Cognitive Dynamics: The Link between Learning, Attention, Recognition, and Consciousness Stephen Grossberg Center for Adaptive Systems and Department of Cognitive and Neural Systems Boston University, 677 Beacon Street, Boston, MA 02215
[email protected] http://www.cns.bu.edu/Profiles/Grossberg
Abstract The processes whereby our brains continue to learn about a changing world in a stable fashion throughout life are proposed to lead to conscious experiences. These processes include the learning of top-down expectations, the matching of these expectations against bottom-up data, the focusing of attention upon the expected clusters of information, and the development of resonant states between bottom-up and top-down processes as they reach a predictive and attentive consensus between what is expected and what is there in the outside world. It is suggested that all conscious states in the brain are resonant states, and that these resonant states trigger learning of sensory and cognitive representations when they amplify and synchronize distributed neural signals that are bound by the resonance. Thus, processes of learning, intention, attention, synchronization, and consciousness are intimately bound up together. The name Adaptive Resonance Theory, or ART, summarizes the predicted link between these processes. Illustrative psychophysical and neurobiological data have been explained and quantitatively simulated using these concepts in the areas of early vision, visual object recognition, auditory streaming, and speech perception, among others. It is noted how these mechanisms seem to be realized by known laminar circuits of the visual cortex. In particular, they seem to be operative at all levels of the visual system. Indeed, the mammalian neocortex, which is the seat of higher biological intelligence in all modalities, exhibits a remarkably uniform laminar architecture, with six characteristic layers and sublamina. These known laminar ART, or LAMINART, models illustrate the emerging paradigm of Laminar Computing which is attempting to answer the fundamental question: How does laminar computing give rise to biological intelligence? These laminar circuits also illustrate the fact that, in a rapidly growing number of examples, an individual model can quantitatively simulate the recorded dynamics of identified neurons in anatomically characterized circuits and the behaviors that they control. In this precise sense, the classical Mind/Body problem is starting to get solved. It is further noted that many parallel processing streams of the brain often compute properties that are complementary to each other, much as a lock fits a key or the pieces of a puzzle fit together. Hierarchical and parallel interactions within and between these processing streams can overcome their complementary deficiencies by generating emergent properties that compute complete information about a prescribed V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 5-12, 2003. Springer-Verlag Berlin Heidelberg 2003
6
Stephen Grossberg
form of intelligent behavior. This emerging paradigm of Complementary Computing is proposed to be a better paradigm for understanding biological intelligence than various previous proposals, such as the postulate of independent modules that are specialized to carry out prescribed intelligent tasks. Complementary computing is illustrated by the fact that sensory and cognitive processing in the What processing stream of the brain, that passes through cortical areas V1-V2-V4-IT on the way to prefrontal cortex, obey top-down matching and learning laws that are often complementary to those used for spatial and motor processing in the brain's Where/How processing stream, that passes through cortical areas V1-MT-MST-PPC on the way to prefrontal cortex. These complementary properties enable sensory and cognitive representations to maintain their stability as we learn more about the world, while allowing spatial and motor representations to forget learned maps and gains that are no longer appropriate as our bodies develop and grow from infanthood to adulthood. Procedural memories are proposed to be unconscious because the inhibitory matching process that supports their spatial and motor processes cannot lead to resonance. Because ART principles and mechanisms clarify how incremental learning can occur autonomously without a loss of stability under both unsupervised and supervised conditions in response to a rapidly changing world, algorithms based on ART have been used in a wide range of applications in science and technology.
Biography Stephen Grossberg is Wang Professor of Cognitive and Neural Systems and Professor of Mathematics, Psychology, and Biomedical Engineering at Boston University. He is the founder and Director of the Center for Adaptive Systems, founder and Chairman the Department of Cognitive and Neural Systems, founder and first President of the International Neural Network Society (INNS), and founder and co-editor-in-chief of the official journal, Neural Networks, of INNS and the European Neural Network Society (ENNS) and Japanese Neural Network Society (JNNS). Grossberg has served as an editor of many other journals, including Journal of Cognitive Neuroscience, Behavioral and Brain Sciences, Cognitive Brain Research, Cognitive Science, Adaptive Behavior, Neural Computation, Journal of Mathematical Psychology, Nonlinear Analysis, IEEE Expert, and IEEE Transactions on Neural Networks. He was general chairman of the IEEE First International Conference on Neural Networks and played a key role in organizing the first annual INNS conference. Both conferences have since fused into the International Joint Conference on Neural Networks (IJCNN), the largest conference on biological and technological neural network research in the world. His lecture series at MIT Lincoln Laboratory on neural network technology was instrumental in motivating the laboratory to initiate the national DARPA Study on Neural Networks. He has received a number of awards, including the 1991 IEEE Neural Network Pioneer award, the 1992 INNS Leadership Award, the1992 Thinking Technology Award of the Boston Computer Society, the 2000 Information Science Award of the Association for Intelligent Machinery, the 2002 Charles River Laboratories prize of the Society for Behavioral Toxicology, and the 2003 INNS Helmholtz award. He was elected a fellow of the American
The Brain’s Cognitive Dynamics
7
Psychological Association in 1994, a fellow of the Society of Experimental Psychologists in 1996, and a fellow of the American Psychological Society in 2002. He and his colleagues have pioneered and developed a number of the fundamental principles, mechanisms, and architectures that form the foundation for contemporary neural network research, particularly those which enable individuals to adapt autonomously in real-time to unexpected environmental changes. Such models have been used both to analyse and predict interdisciplinary data about mind and brain, and to suggest novel architectures for technological applications. Grossberg received his graduate training at Stanford University and Rockefeller University, and was a Professor at MIT before assuming his present position at Boston University. Core modeling references from the work of Grossberg and his colleagues for neural models of working memory and short-term memory, learning and long-term memory, expectation, attention, resonance, synchronization, recognition, categorization, memory search, hypothesis testing, and consciousness in vision, visual object recognition, audition, speech, cognition, and cognitive-emotional interactions. Some articles since 1997 can be downloaded from http://www.cns.bu.edu/ Profiles/Grossberg
References [1] [2]
[3] [4] [5] [6] [7]
[8]
Baloch, A.A. and Grossberg, S. (1997). A neural model of high-level motion processing: Line motion and formotion dynamics. Vision Research, 37, 30373059. Banuet, J-P. and Grossberg, S. (1987). Probing cognitive processes through the structure of event-related potentials during learning: An experimental and theoretical analysis. Applied Optics, 26, 4931-4946. Reprinted in Carpenter, G.A. and Grossberg, S. (1991). Pattern Recognition by Self-Organizing Neural Networks. Cambridge, MA: MIT Press. Boardman, I., Grossberg, S., Myers, C., and Cohen, M.A. (1999). Neural dynamics of perceptual order and context effects for variable-rate speech syllables. Perception and Psychophysics, 61, 1477-1500. Bradski, G. and Grossberg, S. (1995). Fast-learning VIEWNET architectures for recognizing three-dimensional objects from multiple two-dimensional views. Neural Networks, 8, 1053-1080. Bradski, G., Carpenter, G.A., and Grossberg, S. (1992). Working memory networks for learning temporal order with application to three-dimensional visual object recognition. Neural Computation, 4, 270-286. Bradski, G, Carpenter, G.A., and Grossberg, S. (1994). STORE working memory networks for storage and recall of arbitrary temporal sequences. Biological Cybernetics, 71, 469-480. Carpenter, G.A. (1989). Neural network models for pattern recognition and associative memory. Neural Networks, 1989, 2, 243-257. Reprinted in Carpenter, G.A. and Grossberg, S. (1991). Pattern Recognition by SelfOrganizing Neural Networks. Cambridge, MA: MIT Press. Carpenter, G.A. (1997). Distributed learning, recognition, and prediction by ART and ARTMAP neural networks. Neural Networks, 10,1473-1494.
8
[9]
[10]
[11]
[12] [13] [14] [15]
[16]
[17] [18] [19] [20]
[21]
Stephen Grossberg
Carpenter, G.A. and Grossberg, S. (1987). A massively parallel architecture for a self-organizing neural pattern recognition machine. Computer Vision, Graphics, and Image Processing, 37, 54-115. Reprinted in Grossberg, S. (1988). Neural Networks and Natural Intelligence. Cambridge, MA: MIT Press. Carpenter, G.A. and Grossberg, S. (1987). ART 2: Self-organization of stable category recognition codes for analog input patterns. Applied Optics, 26,49194930. Reprinted in Carpenter, G.A. and Grossberg, S. (1991). Pattern Recognition by Self-Organizing Neural Networks. Cambridge, MA: MIT Press. Carpenter, G.A. and Grossberg, S. (1990). ART 3: Hierarchical search using chemical transmitters in self-organizing pattern recognition architectures. Neural Networks, 3, 129-152. Reprinted in Carpenter, G.A. and Grossberg, S. (1991). Pattern Recognition by Self-Organizing Neural Networks. Cambridge, MA: MIT Press. Carpenter, G.A. and Grossberg, S. (1991). Pattern Recognition by SelfOrganizing Neural Networks. Cambridge, MA: MIT Press. Carpenter, G.A. and Grossberg, S. (1992). A self-organizing neural network for supervised learning, recognition, and prediction. IEEE Communications Magazine, September, 38-49. Carpenter, G.A. and Grossberg, S. (1993). Normal and amnesic learning, recognition, and memory by a neural model of cortico-hippocampal interactions. Trends in Neurosciences, 16, 131-137. Carpenter. G.A., Grossberg, S., Markuzon, N., Reynolds, J.H., and Rosen, D.B. (1992). Fuzzy ARTMAP: A neural network architecture for incremental supervised learning of analog multidimentional maps. IEEE Transactions on Neural Networks, 3, 698-713. Carpenter, G.A., Grossberg, S., and Reynolds, J.H. (1991). ARTMAP: Supervised real-time learning and classification of nonstationary data by a selforganizing neural network. Neural Networks, 4, 565-588. Reprinted in Carpenter, G.A. and Grossberg, S. (1991). Pattern Recognition by SelfOrganizing Neural Networks. Cambridge, MA: MIT Press. Carpenter, G.A., Grossberg, S., and Reynolds, J.H. (1995). A fuzzy ARTMAP nonparametric probability estimator for nonstationary pattern recognition problems. IEEE Transactions on Neural Networks, 6, 1330-1336. Carpenter, G.A., Grossberg, S., and Rosen, D.B. (1991). ART 2-A: An adaptive resonance algorithm for rapid category learning and recognition, Neural Networks, 4, 493-504. Chey, J., Grossberg, S., and Mingolla, E. (1997). Neural dynamics of motion grouping: From aperture ambiguity to object speed and direction. Journal of the Optical Society of America A, 14, 2570-2594. Cohen, M.A. and Grossberg, S. (1983). Absolute stability of global pattern formation and parallel memory storage by competitive neural networks. IEEE Transactrions on Systems, Man, and Cybernetics, 13, 815-826. Reprinted in S. Grossberg (1987). The Adaptive Brain, Vol. I. Amsterdam: Elsevier Science. Cohen, M.A. and Grossberg, S. (1986). Neural dynamics of speech and language coding: Developmental programs, perceptual grouping, and competition for short term memory. Human Neurobiology, 5, 1-22. Reprinted in S. Grossberg (1987). The Adaptive Brain, Vol. II. Amsterdam: Elsevier Science.
The Brain’s Cognitive Dynamics
9
[22] Cohen, M.A. and Grossberg, S. (1987). Masking fields: A massively parallel neural architecture for learning, recognizing, and predicting multiple groupings of patterned data. Applied Optics, 26, 1866-1891. Reprinted in Grossberg, S. (1988). Neural Networks and Natural Intelligence. Cambridge, MA: MIT Press. [23] Cohen, M.A., Grossberg, S. and Stork, D.G. (1988). Speech perception and production by a self-organizing neural network. In Evolution, Learning, Cognition, and Advanced Architectures. (Y.C. Lee, Ed.). Singapore: World Scientific. Reprinted in Carpenter, G.A. and Grossberg, S. (1991). Pattern Recognition by Self-Organizing Neural Networks. Cambridge, MA: MIT Press. [24] Ellias, S.A. and Grossberg, S. (1975). Pattern formation, contrast control, and oscillations in the short-term memory of shunting on-center off-surround networks. Biological Cybernetics, 20, 69-98. [25] Gove, A., Grossberg, S., and Mingolla, E. (1995). Brightness perception, illusory contours, and corticogeniculate feedback. Visual Neuroscience, 12, 1027-1052. [26] Granger, E., Rubin, M., Grossberg, S., and Lavoie, P. (2001). A what-andwhere fusion neural network for recognition and tracking of multiple radar emitters. Neural Networks, 14, 325-344. [27] Grossberg, S. (1969). On learning and energy-entropy dependence in recurrent and nonrecurrent signed networks. Journal of Statistical Physics, 1969, 1, 319350. [28] Grossberg, S. (1971). Pavlovian pattern learning by nonlinear neural networks. Proceedings of the National Academy of Sciences, 68, 828-831. [29] Grossberg, S. (1972). Pattern learning by functional-differential neural networks with arbitrary path weights. In Delay and functional-differential equations and their applications (K. Schmitt, Ed.). New York: Academic Press, 1972. [30] Grossberg, S. (1973) Contour enhancement, short term memory, and constancies in reverberating neural Networks. Studies in Applied Mathematics, LII, 213-257. Reprinted in Grossberg, S. (1982) Studies of Mind and Brain. New York: Kluwer/Reidel. [31] Grossberg, S. (1974). Classical and instrumental learning by neural networks. In Progress in Theoretical Biology, Vol. 3 (R. Rosen and F. Snell, Eds.), pp. 51141. New York: Academic Press, 1974. [32] Grossberg, S. (1976). Adaptive pattern classification and universal recoding, I: Parallel development and coding of neural feature detectors. Biological Cybernetics, 23, 121-134. Reprinted in Grossberg, S. (1982) Studies of Mind and Brain. New York: Kluwer/Reidel. [33] Grossberg, S. (1976). Adaptive pattern classification and universal recoding, II: Feedback, expectation, olfaction, and illusions. Biological Cybernetics, 23, 187202. [34] Grossberg, S. (1977). Pattern formation by the global limits of a nonlinear competitive interaction in n dimensions. Journal of Mathematical Biology, 4, 237-256.
10
Stephen Grossberg
[35] Grossberg, S. (1978). A theory of human memory: Self-organization and performance of sensory-motor codes, maps and plans. In Progress in Theoretical Biology, Vol. 5 (R. Rosen and F. Snell, Eds.), pp. 233-374. New York, Academic Press. Reprinted in Grossberg, S. (1982) Studies of Mind and Brain. New York: Kluwer/Reidel. [36] Grossberg, S. (1978). Behavioral contrast in short term memory: serial binary memory models or parallel continuous memory models. Journal of Mathematical Psychology, 17, 199-219. Reprinted in Grossberg, S. (1982) Studies of Mind and Brain. New York: Kluwer/Reidel. [37] Grossberg, S. (1978). Competition, decision, and consensus. Journal of Mathematical Analysis and Applications, 66, 470-493. Reprinted in Grossberg, S. (1982) Studies of Mind and Brain. New York: Kluwer/Reidel. [38] Grossberg, S. (1978). Decisions, patterns, and oscillations in nonlinear competitive systems with applications to Volterra-Lotka systems. Journal of Theoretical Biology, 73, 101-130. [39] Grossberg, S. (1980). Biological competition: Decision rules, pattern formation, and oscillations. Proceedings of the National Academy of Sciences, 77, 23382342. [40] Grossberg, S. (1980). How does a brain build a cognitive code? Psychological Review, 87, 1-51. Reprinted in Grossberg, S. (1982) Studies of Mind and Brain. New York: Kluwer/Reidel. [41] Grossberg, S. (1982) Studies of Mind and Brain: Neural principles of learning, perception, development, cognition, and motor control. New York: Kluwer/Reidel. [42] Grossberg, S. (1982). Associative and competitive principles of learning and development: The temporal unfolding and stability of STM and LTM patterns. In Competition and Cooperation in Neural Nets (S. Amari and M. Arbib, Eds.). Lecture Notes in Biomathematics, 45, 295-341. New York: Springer-Verlag. Reprinted in S. Grossberg (1987). The Adaptive Brain, Vol. I. Amsterdam: Elsevier Science. [43] Grossberg, S. (1982). Processing of expected and unexpected events during conditioning and attention: A psychophysiological theory. Psychological Review, 89, 529-572. [44] Grossberg, S. (1984). Some normal and abnormal behavioral syndromes due to transmitter gating of opponent processes. Biological Psychiatry, 19, 1075-1118. [45] Grossberg, S. (1984). Some psychophysiological and pgharmacological correlates of a developmental, cognitive, and motivational theory. In Brain and Information: Event Related Potentials, 425, 58-151. (R. Karrer, J. Cohen, and P. Tueting, Ed.s). New York Academy of Sciences. Reprinted in Grossberg, S. (1987. The Adaptive Brain, Vol. 1. Amsterdam: Elsevier Science. [46] Grossberg, S. (1984). Unitization, automaticity, temporal order, and word recognition. Cognition and Brain Theory, 7, 263-283. [47] Grossberg, S. (1987). Competitive learning: From interactive activation to adaptive resonance. Cognitive Science, 11, 23-63. Reprinted in Grossberg, S. (1988). Neural Networks and Natural Intelligence. Cambridge, MA: MIT Press. [48] Grossberg, S. (1987). The Adaptive Brain, Vols. I and II. Amsterdam: Elsevier Science.
The Brain’s Cognitive Dynamics
11
[49] Grossberg, S. (1988). Neural Networks and Natural Intelligence. Cambridge, MA: MIT Press. [50] Grossberg, S. (1988). Nonlinear neural networks: Principles, mechanisms, and architectures. Neural Networks, 1, 17-61. Reprinted in Carpenter, G.A. and Grossberg, S. (1991). Pattern Recognition by Self-Organizing Neural Networks. Cambridge, MA: MIT Press. [51] Grossberg, S. (1995). The attentive brain. American Scientist, 83, 438-449. [52] Grossberg, S. (1999). How does the cerebral cortex work: Learning, attention, and grouping by the laminar circuits of visual cortex. Spatial Vision, 12, 163186. [53] Grossberg, S. (1999). The link between brain learning, attention, and consciousness. Consciousness and Cognition, 8, 1- 44. [54] Grossberg, S. (1999). Pitch-based streaming in auditory perception. In Musical networks: Parallel Distributed Perception and Performance (N. Griffith and P. Todd, Eds.),. Cambridge, MA: MIT Press, pp.117-140. [55] Grossberg, S. (2000). The complementary brain: Unifying brain dynamics and modularity. Trends in Cognitive Sciences, 233-246. [56] Grossberg, S. (2000). The imbalanced brain: From normal behavior to schizophrenia. Biological Psychiatry, 81-98. [57] Grossberg, S. (2000). How hallucinations may arise from brain mechanisms of learning, attention, and volition. Journal of the International Neuropsychological Society, 6, 583-592. [58] Grossberg, S., Boardman, I., and Cohen, M.A. (1997). Neural dynamics of variable-rate speech categorization. Journal of Experimental Psychology: Human Perception and Performance, 23,481-503. [59] Grossberg, S. and Grunewald, A. (1997). Cortical synchronization and perceptual framing. Journal of Cognitive Neuroscience, 9, 117-132. [60] Grossberg, S. and Howe, P.D.L. (2002). A laminar cortical model of stereopsis and three-dimensional surface perception. Vision Research, in press. [61] Grossberg, S. and Levine, D. (1976). Some developmental and attentional biases in the contrast enhancement and short-term memory of recurrent neural networks. Journal of Theoretical Biology, 53, 341-380. [62] Grossberg, S. and Levine, D. (1987). Neural dynamics of attentionally modulated Pavlovian cnditioning: Blocking, interstimulus interval, and secondary conditioning. Applied Optics, 26, 5015-5030. Reprinted in Grossberg, S. (1988). Neural Networks and Natural Intelligence. Cambridge, MA: MIT Press. [63] Grossberg, S. and Merrill, J.W.L. (1992). A neural network model of adaptively timed reinforcement learning and hippocampal dynamics. Cognitive Brain Research, 1, 3-38. [64] Grossberg, S. and Merrill, J.W.L. (1996). The hippocampus and cerebellum in adaptively timed learning, recognition, and movement. Journal of Cognitive Neuroscience, 8, 257-277. [65] Grossberg, S., Mingolla, E., and Ross, W.D. (1994). A neural theory of attentive visual search: Interactions of boundary, surface, spatial, and object representations. Psychological Review, 101, 470-489.
12
Stephen Grossberg
[66] Grossberg, S. Mingolla, E., and Ross, W.D. (1997). Visual brain and visual perception: How does the cortex do perceptual grouping? Trends in Neurosciences, 20, 106-111. [67] Grossberg, S., Mingolla, E., and Viswanathan, L. (2001). Neural dynamics of motion integration and segmentation within and across apertures. Vision Research, 41, 2521-2553. [68] Grossberg, S. and Myers, C. (2000). The resonant dynamics of speech perception: Interword integration and duration-dependent backward effects. Psychological Review, 107, 735-767. [69] Grossberg, S. and Raizada, R.D.S. (2000). Contrast-sensitive perceptual grouping and object-based attention in the laminar circuits of primary visual cortex. Vision Research, 40, 1413-1432. [70] Grossberg, S. and Schmajuk, N.A. Neural dynamics of attentional modulated Pavlovian conditioning: Conditioned reinforcement, inhibition, and opponent processing. Psychobiology, 15, 195-240. Reprinted in Grossberg, S. (1988). Neural Networks and Natural Intelligence. Cambridge, MA: MIT Press. Reprinted in Grossberg, S. (1988). Neural Networks and Natural Intelligence. Cambridge, MA: MIT Press. [71] Grossberg, S. and Somers, D. (1991). Synchronized oscillations during cooperative feature linking in a cortical model of visual perception. Neural Networks, 4, 453-466. [72] Grossberg, S. and Stone, G. (1986). Neural dynamics of attention switching and temporal order information in short term memory. Memory and Cognition, 14, 451-468. Reprinted in Grossberg, S. (1988). Neural Networks and Natural Intelligence. Cambridge, MA: MIT Press. [73] Grossberg, S. and Stone, G. (1986). Neural dynamics of word recognition and recall: Attentional priming, learning, and resonance. Psychological Review, 93, 46-74. Reprinted in S. Grossberg (1987). The Adaptive Brain, Vol. II. Amsterdam: Elsevier Science. [74] Grossberg, S. and Williamson, J.R. (1999). A self-organizing neural system for learning to recognize textured scenes. Vision Research, 39, 1385-1406. [75] Grossberg, S. and Williamson, J.R. (2001). A neural model of how horizontal and interlaminar connections of visual cortex develop into adult circuits that carry out perceptual grouping and learning. Cerebral Cortex, 11, 37-58 [76] Grunewald, A. and Grossberg, S. (1998). Self-organization of binocular disparity tuning by reciprocal corticogeniculate interactions. Journal of Cognitive Neuroscience, 10, 100-215. [77] Olson, S.J. and Grossberg, S. (1998). A neural network model for the development of simple and complex cell receptive fields within cortical maps of orientation and ocular dominance. Neural Networks, 11, 189-208. [78] Raizada, R.D.S. and Grossberg, S. (2001). Context-sensitive binding by the laminar circuits of V1 and V2: A unified model of perceptual groupiing, attention, and orientation contrast. Visual Cognition, 8, 431-466. [79] Raizada, R. and Grossberg, S. (2003). Towards a Theory of the Laminar Architecture of Cerebral Cortex: Computational Clues from the Visual System. Cerebral Cortex, 13, 100-113.
Adaptive Data Based Modelling and Estimation with Application to Real Time Vehicular Collision Avoidance Chris J. Harris Department of Electronics and Computer Science, University of Southampton Highfield, Southampton SO17 1BJ, UK
[email protected]
Abstract The majority of control and estimation algorithms are based upon linear time invariant models of the process, yet many dynamic processes are nonlinear, stochastic and non-stationary. In this presentation an online data based modelling and estimation approach is described which produces parsimonious dynamic models, which are transparent and appropriate for control and estimation applications. These models are linear in the adjustable parameters – hence are provable, real time and transparent but exponential in the input space dimension. Several approaches are introduced – including automatic structure algorithms to reduce the inherent curse of dimensionality of the approach. The resultant algorithms can be interpreted in rule based form and therefore offer considerable transparency to the user as to the underlying dynamics, equally the user can control the resultant rule base during learning. These algorithms will be applied to (a) helicopter flight control (b) auto-car driving and (c) multiple ship guidance and control.
References [1] [2] [3]
Harris C.J., Hong X, Gan Q. Adaptive Modelling Estimation and Fusion from Data. Springer Verlag, Berlin (2002) Chen S., Hong D., Harris C.J. Sparse multioutput rbf network construction using combined locally regularised OLS & D Optimality. IEE Proc. CTA. Vol. 150 (2) (March 2002) pp.139-146 Hong X., Harris C.J., Chen S. Robust neurofuzzy model knowledge extraction. To appear Trans. IEEE SMC (2003)
Biography Professor Harris has degrees from the Universities of Southampton, Leicester and Oxford. He is a Fellow of the Royal Academy of Engineering, Honorary Professor at University of Hong Kong; Holder of the 2001 Faraday Medal and the 1998 IEE Senior Achievement Medal for research into Nonlinear Signal Processing. Author of V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 13-14, 2003. Springer-Verlag Berlin Heidelberg 2003
14
Chris J. Harris
7 research books in nonlinear adaptive systems and control theory and over 300 learned papers. Currently he is Emeritus Professor of Computational Intelligence at Southampton University and Director of the UK – MOD Defence Technology Centre in Data and Information Fusion – a £10M a year initiative involving three companies and eight leading UK Universities, supporting over 70 researchers.
Creating a Smart Virtual Personality Nadia Magnenat-Thalmann MIRALab - University of Geneva 24, Rue General-Dufour,1211 Geneva, Switzerland
[email protected] www.miralab.ch
As people are getting more and more dependent on using the computer for a variety of tasks, providing interfaces that are intelligent and easy to interact with, has become an important issue in computer science. Current research in computer graphics, artificial intelligence and cognitive psychology aims to give computers a more human face, so that interacting with computers becomes more like interacting with humans. With the emergence of 3D graphics, we are now able to create very believable 3D characters that can move and talk. However, the behaviour that is expressed by such characters is far from believable in a lot of systems. We feel that these characters lack individuality. This talk addresses some of the aspects of creating 3D virtual characters that have a form of individuality, driven by state-of-the-art personality and emotion models (see Figure 1). 3D characters will be personalized not only on an expressive level (for example generating emotion expressions on a 3D face for example), but also on an interactive level (response generation that is coherent with the personality and emotional state of the character) and a perceptive level (having an emotional reaction to the user and her/his environment). In order to create a smart virtual personality, we need to concentrate on several different research topics. Firstly, the simulation of personality and emotional state requires an extensive psychological research of how real humans act/react emotionally to events in their surroundings, as well as an investigation into which independent factors cause humans to act/react in a certain way (also called the personality). Once this has been determined, we need to investigate whether or not such models are suitable for computer simulations, and if so, we have to define a concrete form of these notions and how they interact. Secondly, the response generation mechanism used by a virtual character needs to take personality and emotions into account. It is crucial to find generic constructs for linking personality and emotions with a response generation system (which can be anything from rule-based pattern matching systems to fullscale logical reasoning engines). And finally, the expressive virtual character should have speech capabilities and face and body animation. The animation should be controlled by highlevel parameters such as facial expressions and gestures. Furthermore, the bodies and faces of virtual characters should be easily exchangeable and animation sequences should be character-independent (an example of this is face and body representation and animation using the MPEG-4 standard).
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 15–16, 2003. c Springer-Verlag Berlin Heidelberg 2003
16
Nadia Magnenat-Thalmann
Fig. 1. 3D virtual character overview. From the user’s behaviour, the system determines the impact on its emotional state. This information is then used (together with semantic information of the user’s behaviour) to generate an appropriate response, which is expressed through a 3D virtual character linked with a text-to-speech system
Biography Prof. Nadia Magnenat-Thalmann has pioneered research into virtual humans over the last 20 years. She obtained several Bachelor’s and Master’s degrees in various disciplines and a PhD in Quantum Physics from the University of Geneva. From 1977 to 1989, she was a Professor at the University of Montreal in Canada. She moved to the University of Geneva in 1989, where she founded MIRALab. She has received several scientific and artistic awards for her work in Canada and in Europe. In l997, she has been elected to the Swiss Academy of Technical Sciences, and more recently, she was nominated as a Swiss personality who has contributed to the advance of science in the 150 years history CD-ROM produced by the Swiss Confederation Parliament, 1998, Bern, Switzerland. She has been invited to give hundreds of lectures on various topics, all related to virtual humans. Author and coauthor of a very high number of research papers and books, she has directed and produced several films and real-time mixed reality shows, among the latest are CYBERDANCE (l998), FASHION DREAMS (1999) and the UTOPIANS (2001). She is editor-in-chief of the Visual Computer Journal published by Springer Verlag and editor of several other research journals.
Intelligent Navigation on the Mobile Internet Barry Smyth1,2 1
1
Smart Media Institute, University College Dublin, Dublin, Ireland 2 ChangingWorlds, South County Business Park Leopardstown Road, Dublin, Ireland
[email protected]
Summary
For many users the Mobile Internet means accessing information services through their mobile handsets - accessing mobile portals via WAP phones, for example. In this context, the Mobile Internet represents both a dramatic step forward and a significant step backward, from an information access standpoint. While it offers users greater access to valuable information and services “on the move”, mobile handsets are hardly the ideal access device in terms of their screen-size and input capabilities. As a result, mobile portal users are often frustrated by how difficult it is to quickly access the right information at the right time. A mobile portal is typically structured as a hierarchical set of navigation (menu) pages leading to distinct content pages (see Fig. 1). An ideal portal should present a user with relevant content without the need for spurious navigation. However, the reality is far from ideal. Studies highlight that while the average user expects to be able to access content within 30 seconds, the reality is closer to 150 seconds [1]. WAP navigation effort can be modelled as click-distance ([5, 4]) - the number of menu selections and scrolls needed to locate a content item. Our studies indicate that, ideally, content should be positioned no more than 10-12 ‘clicks’ from the portal home page. However, we have also found that many portals suffer from average click-distances (home page to content items) in excess of 20 [3] (see Fig. 1). Personalization research seeks to develop techniques for learning and exploiting user preferences to deliver the right content to the right user at the right time (see [2]). We have shown how personalization techniques for adapting the navigation structure of a portal can reduce click-distance and thus radically reduce navigation effort and improve portal usability ([5]). Personalization technology has led to the development of the ClixSmart NavigatorT M product-suite, developed by ChangingWorlds Ltd. (www.changingworlds.com). ClixSmart Navigator has been deployed successfully with some of Europe’s leading mobile operators. The result: users are able to locate information and services more efficiently through their mobile handsets and this in turn has led to significant increases in portal usage. In fact live-user studies, encompassing thousands of users, indicate usage increases in excess of 30% and dramatic improvements in the user’s online experience ([5, 4]).
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 17–18, 2003. c Springer-Verlag Berlin Heidelberg 2003
18
Barry Smyth
[News and Weather] [Sport] [Business] [Entertainment]
[TV/Video] [Horoscopes] [Lottery Results] [Cinema] [Cinema]
[Cinema Times] [Cinema Booking] [Flix]
Options
Options
Options
Back
[Derry] [Donegal] [Down] [Dublin] [Dublin]
Options
Back
Back
[Meeting House Square] [Omniplex ] [Ormonde ]
Back
Options
Back
Fig. 1. To access their local cinema (the Ormonde) this user must engage in an extended sequence of navigation actions; 16 clicks in total (selects and scrolls) are need in this example
2
Biography
Prof. Barry Smyth is the Digital Chair of Computer Science at University College Dublin, Ireland. His research interests cover many aspects of artificial intelligence, case-based reasoning and personalization. Barry’s research has led to the development of ChangingWorlds Ltd, which develops AI-based portal software for mobile operators, including Vodafone and O2 . He has published widely and received a number of international awards for his basic and applied research, including best paper awards at the International Joint Conference on Artificial Intelligence (IJCAI) and the European Conference on Artificial Intelligence Prestigious Applications of Intelligent Systems (ECAI - PAIS).
References [1] M. Ramsey and J. Nielsen. The WAP Usability Report. Neilsen Norman Group, 2000. 17 [2] D. Reiken. Special issue on personalization. Communications of the ACM, 43(8), 2000. 17 [3] B. Smyth. The Plight of the Mobile Navigator. MobileMetrix, 2002. 17 [4] B. Smyth and P. Cotter. Personalized Adaptive Navigation for Mobile Portals. In Proceedings of the 15th European Conference on Artificial Intelligence - Prestigious Applications of Artificial Intelligence. IOS Press, 2002. 17 [5] B. Smyth and P. Cotter. The Plight of the Navigator: Solving the Navigation Problem for Wireless Portals. In Proceedings of the 2nd International Conference on Adaptive Hypermedia and Adaptive Web-Based Systems (AH’02), pages 328– 337. Springer-Verlag, 2002. 17
The Evolution of Evolutionary Computation Xin Yao School of Computer Science, The University of Birmingham Edgbaston, Birmingham B15 2TT, UK
[email protected] http://www.cercia.ac.uk
Abstract. Evolutionary computation has enjoyed a tremendous growth for at least a decade in both its theoretical foundations and industrial applications. Its scope has gone far beyond binary string optimisation using a simple genetic algorithm. Many research topics in evolutionary computation nowadays are not necessarily “genetic” or “evolutionary” in any biological sense. This talk will describe some recent research efforts in addressing several fundamental as well as more applied issues in evolutionary computation. Links with traditional computer science and artificial intelligence will be explored whenever appropriate.
Evolutionary Algorithms as Generate-and-Test Evolutionary algorithms (EAs) can be regarded as population-based stochastic generate-and-test [1, 2]. The advantage of formulating EAs as a generate-and-test algorithm is that the relationships between EAs and other search algorithms, such as simulated annealing (SA), tabu search (TS), hill-climbing, etc., can be made clearer and thus easier to explore and understand. Under the framework of generate-andtest, different search algorithms investigated in artificial intelligence, operations research, computer science, and evolutionary computation can be unified. Computational Time Complexity of Evolutionary Algorithms Most work in evolutionary computation has been experimental. Although there have been many reported results of EAs algorithms in solving difficult optimisation problems, the theoretical results on EA’s average computation time have been few. It is unclear theoretically what and where the real power of EAs is. It is also unclear theoretically what role a population plays in EAs. Some recent work has started tackling several fundamental issues in evolutionary computation, such as the conditions under which an EA will exhibit polynomial/exponential time behaviours [3, 4], the conditions under which a population can make a difference in terms of complexity classes [5], and the analytical tools and frameworks that facilitate the analysis of EA’s average computation time [6]. Two Heads Are Better than One Although one of the key features of evolutionary computation is a population, most work did not actually exploit this. We can show through a number of examples that exploiting population V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 19–21, 2003. c Springer-Verlag Berlin Heidelberg 2003
20
Xin Yao
information, rather than just the best individual, can lead to many benefits in evolutionary learning, e.g., improved generalisation ability [7, 8, 9] and better fault-tolerance [10].
References [1] X. Yao, “An overview of evolutionary computation,” Chinese Journal of Advanced Software Research (Allerton Press, Inc., New York, NY 10011), vol. 3, no. 1, pp. 12–29, 1996. 19 [2] X. Yao, ed., Evolutionary Computation: Theory and Applications. Singapore: World Scientific Publishing Co., 1999. 19 [3] J. He and X. Yao, “Drift analysis and average time complexity of evolutionary algorithms,” Artificial Intelligence, vol. 127, pp. 57–85, March 2001. 19 [4] J. He and X. Yao, “Erratum to: Drift analysis and average time complexity of evolutionary algorithms: [artificial intelligence 127 (2001) 57-85],” Artificial Intelligence, vol. 140, pp. 245–248, September 2002. 19 [5] J. He and X. Yao, “From an individual to a population: An analysis of the first hitting time of population-based evolutionary algorithms,” IEEE Transactions on Evolutionary Computation, vol. 6, pp. 495–511, October 2002. 19 [6] J. He and X. Yao, “Towards an analytic framework for analysing the computation time of evolutionary algorithms,” Artificial Intelligence, vol. 145, pp. 59–97, April 2003. 19 [7] P. J. Darwen and X. Yao, “Speciation as automatic categorical modularization,” IEEE Transactions on Evolutionary Computation, vol. 1, no. 2, pp. 101–108, 1997. 20 [8] X. Yao and Y. Liu, “Making use of population information in evolutionary artificial neural networks,” IEEE Trans. on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 28, no. 3, pp. 417–425, 1998. 20 [9] Y. Liu, X. Yao, and T. Higuchi, “Evolutionary ensembles with negative correlation learning,” IEEE Transactions on Evolutionary Computation, vol. 4, pp. 380–387, November 2000. 20 [10] T. Schnier and X. Yao, “Using negative correlation to evolve fault-tolerant circuits,” in Proceedings of the 5th International Conference on Evolvable Systems (ICES-2003), LNCS 2606, Springer, Germany, March 2003. 20
Speaker Bio Xin Yao is a professor of computer science and Director of the Centre of Excellence for Research in Computational Intelligence and Applications (CERCIA) at the University of Birmingham, UK. He is also a visiting professor of the University College, the University of New South Wales, the Australian Defence Force Academy, Canberra, the University of Science and Technology of China, Hefei, the Nanjing University of Aeronautics and Astronautics, Nanjing, and the Northeast Normal University, Changchun. He is an IEEE fellow, the editor in chief of IEEE Transactions on Evolutionary Computation, an associate editor or an editorial board member of five other international journals, and the chair of IEEE NNS Technical Committee on Evolutionary Computation. He is the recipient of the 2001 IEEE Donald G. Fink Prize Paper Award and has given more than 20 invited keynote/plenary speeches at various conferences. His major
The Evolution of Evolutionary Computation
21
research interests include evolutionary computation, neural network ensembles, global optimization, computational time complexity and data mining.
A Unified Model Maintains Knowledge Base Integrity John Debenham University of Technology, Sydney, Faculty of Information Technology, PO Box 123, NSW 2007, Australia
[email protected]
Abstract. A knowledge base is maintained by modifying its conceptual model and by using those modifications to specify changes to its implementation. The maintenance problem is to determine which parts of that model should be checked for correctness in response a change in the application. The maintenance problem is not computable for first-order knowledge bases. Two things in the conceptual model are joined by a maintenance link if a modification to one of them means that the other must be checked for correctness, and so possibly modified, if consistency of the model is to be preserved. In a unified conceptual model for first-order knowledge bases the data and knowledge are modelled formally in a uniform way. A characterisation is given of four different kinds of maintenance links in a unified conceptual model. Two of these four kinds of maintenance links can be removed by transforming the conceptual model. In this way the maintenance problem is simplified.
1
Introduction
Maintenance links join two things in the conceptual model if a modification to one of them means that the other must be checked for correctness, and so possibly modified, if consistency of that model is to be preserved. If that other thing requires modification then the links from it to yet other things must be followed, and so on until things are reached that do not require modification. If node A is linked to node B which is linked to node C then nodes A and C are indirectly linked. In a coherent knowledge base everything is indirectly linked to everything else. A good conceptual model for maintenance will have a low density of maintenance links [1]. The set of maintenance links should be minimal in than none may be removed. Informally, one conceptual model is “better” than another if it leads to less checking for correctness. The aim of this work is to generate a good conceptual model. A classification into four classes is given here of the maintenance links for conceptual models expressed in the unified [2] knowledge representation. Methods are given for removing two of these classes of link so reducing the density of maintenance links. Approaches to the maintenance of knowledge bases are principally of two types [3]. First, approaches that take the knowledge base as presented and then try to control the maintenance process [4]. Second, approaches that engineer a model of the knowledge base so that it is in a form that is inherently easy to maintain [5] [6]. The approach described here is of the second type.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 21-27, 2003. c Springer-Verlag Berlin Heidelberg 2003
22
John Debenham
The terms data, information and knowledge are used here in the following sense. The data things in an application are the fundamental, indivisible things. Data things can be represented as simple constants or variables. If an association between things cannot be defined as a succinct, computable rule then it is an implicit association. Otherwise it is an explicit association. An information thing in an application is an implicit association between data things. Information things can be represented as tuples or relations. A knowledge thing in an application is an explicit association between information and/or data things. Knowledge can be represented either as programs in an imperative language or as rules in a declarative language.
2
Conceptual Model
Items are a formalism for describing all data, information and knowledge things in an application [7]. Items incorporate two classes of constraints, and a single rule of decomposition is specified for items. The key to this unified representation is the way in which the “meaning” of an item, called its semantics, is specified. The semantics of an item is a function that recognises the members of the “value set” of that item. The value set of an item will change in time τ, but the item’s semantics should remain constant. The value set of an information item at a certain time τ is the set of tuples that are associated with a relational implementation of that item at that time. Knowledge items have value sets too. Consider the rule “the sale price of parts is the cost price marked up by a universal mark-up factor”; this rule is represented by the item named [part/sale-price, part/cost-price, mark- u p ] with a value set of corresponding quintuples. The idea of defining the semantics of items as recognising functions for the members of their value set extends to complex, recursive knowledge items. An item is a named triple A[ SA, VA, CA ] with item name A, SA is called the item semantics of A, VA is called the item value constraints of A and CA is called the item set constraints of A. The item semantics, SA , is a λ-calculus expression that recognises the members of the value set of item A. The expression for an item’s semantics may contain the semantics of other items {A1 ,..., An } called that item’s components: 1 n 1 ) º......º S (yn ,...,y n ) º J(y1 ,..,y 1 ,..,yn ) ]• λ y11 ...y m ...ym •[ SA (y11,...,y m A 1 m 1 m m 1
n
1
1
n
n
1
n
The item value constraints, VA, is a λ-calculus expression: 1 n 1 ) º......º V (yn ,...,y n ) º K(y1 ,..,y 1 ,..,yn ) ]• λ y11 ...y m ...ym •[ V A (y11 ,...,y m An 1 mn 1 m1 mn 1 1 n 1 that should be satisfied by the members of the value set of item A as they change in time; so if a tuple satisfies SA then it should satisfy VA [8]. The expression for an item’s value constraints contains the value constraints of that item’s components. The item set constraints, CA, is an expression of the form: C A º CA º...º CA º (L)A 1
2
n
where L is a logical combination of: • Card lies in some numerical range; • Uni(Ai) for some i, 1 ≤ i ≤ n, and
A Unified Model Maintains Knowledge Base Integrity
23
• Can(A i , X) for some i, 1 ≤ i ≤ n, where X is a non-empty subset of {A1 ,..., An } - {Ai }; subscripted with the name of the item A, “Uni(a)” means that “all members of the value set of item a must be in this association”. “Can(b, A)” means that “the value set of the set of items A functionally determines the value set of item b”. “Card” means “the number of things in the value set”. The subscripts indicate the item’s components to which that set constraint applies. For example, each part may be associated with a cost-price subject to the “value constraint” that parts whose part-number is less that 1,999 should be associated with a cost price of no more than $300. The information item named part/cost-price then is: part/cost-price[ λxy•[ Spart(x) º Scost-price (y) º costs(x, y) ]•, λxy•[ Vpart(x) º V cost-price(y) º ((x < 1999) 8 (y ≤ 300)) ]•, C part º Ccost-price º (Uni(part) º Can( cost-price, {part})part/cost-price ] Rules, or knowledge, can also be defined as items, although it is neater to define knowledge items using “objects”. “Objects” are item building operators. The knowledge item [part/sale-price, part/cost-price, mark-up] which means “the sale price of parts is the cost price marked up by a uniform markup factor” is: [part/sale-price, part/cost-price, mark-up][ λx1x2y1y2z•[(Spart/sale-price(x1, x2) º Spart/cost -price(y1, y2) º Smark-up (z) ) º ((x 1 = y 1) 8 (x2 = z _ y2))]•, λx1x2y1y2z•[Vpart/sale-price(x1, x2) º V part/cost-price(y1, y2) ºVmark-up (z) ) º (( x1 = y1 ) 8 ( x2 > y2 ))]•, C [part/sale-price, part/cost-price, mark-up] ] Two different items can share common knowledge and so can lead to a profusion of maintenance links. This problem can be avoided by using objects. An n-adic object is an operator that maps n given items into another item for some value of n. Further, the definition of each object will presume that the set of items to which that object may be applied are of a specific “type”. The type of an m-adic item is determined both by whether it is a data item, an information item or a knowledge item and by the value of m. The type is denoted respectively by Dm, Im and Km. Items may also have unspecified, or free, type which is denoted by X m . The formal definition of an object is similar to that of an item. An object named A is a typed triple A[E,F,G] where E is a typed expression called the semantics of A, F is a typed expression called the value constraints of A and G is a typed expression called the set constraints of A. For example, the part/cost-price item can be built from the items part and cost-price using the costs operator: part/cost-price = costs(part, cost-price) costs[λP:X1Q:X1•λxy•[ SP(x) º SQ(y) º costs(x,y) ]••, λP:X1Q:X1•λxy•[VP(x) º V Q(y) º ((1000 < x < 1999) 8 (y ≤ 300)) ]••,
λP:X1Q:X1•[ CP º CQ º (Uni(P) º Can(Q, {P})) n(costs,P,Q) ]•] where n(costs, P, Q) is the name of the item costs(P, Q). Objects also represent value constraints and set constraints in a uniform way. A decomposition operation for objects is defined in [2].
24
John Debenham
A conceptual model consists of a set of items and a set of maintenance links. The items are constructed by applying a set of object operators to a set of fundamental items called the basis. The maintenance links join two items if modification to one of them necessarily means that the other item has at least to be checked for correctness if consistency is to be preserved. Item join provides the basis for item decomposition [7]. Given items A and B, the item with name A ⊗E B is called the join of A and B on E, where E is a set of components common to both A and B. Using the rule of composition ⊗, knowledge items, information items and data items may be joined with one another regardless of type. For example, the knowledge item: [cost-price, tax] [λxy•[Scost-price(x) º Stax (y) º x = y
_ 0.05]•,
λxy•[V cost-price(x) º V tax(y) º x < y]•, C[cost -price, tax] ]
can be joined with the information item part/cost-price on the set {cost-price} to give the information item part/cost-price/tax. In other words: [cost-price, tax] ⊗{cost-price} part/cost-price =
part/cost-price/tax[ λxyz•[ Spart(x) º Scost-price (x) º Stax (y) º costs(x,y) º z = y _ 0.05 ]•, λxyz•[ Vpart(x) º V cost-price(x) º V tax(y) º ((1000<x<1999) 8 (0
In this way items may be joined together to form more complex items. The ⊗ operator also forms the basis of a theory of decomposition in which each item is replaced by a set of simpler items. An item I is decomposable into the set of items D = {I1, I2,..., In} if: Ii has non-trivial semantics for all i, I = I1 ⊗ I2 ⊗ ... ⊗ In , where each join is monotonic; that is, each term in this composition contributes at least one component to I. If item I is decomposable then it will not necessarily have a unique decomposition. The ⊗ operator is applied to objects in a similar way [7]. The rule of decomposition is: “Given a conceptual model discard any items and objects which are decomposable”. For example, this rule requires that the item part/cost-price/tax should be discarded in favour of the two items [cost-price, tax] and part/cost-price.
3
Maintenance Links
A maintenance link joins two items in the conceptual model if modification of one item means that the other item must be checked for correctness, and maybe modified, if the consistency of the conceptual model is to be preserved [9]. The number of maintenance links can be very large. So maintenance links can only form the basis of a practical approach to knowledge base maintenance if there is some way of reducing their density on the conceptual model. For example, given two items A and B, where both are n-adic items with semantics SA and SB respectively, if π is permutation such that:
A Unified Model Maintains Knowledge Base Integrity
25
(: x1x2...x n)[ SA(x1,x 2 ,...,x n ) * SB(π(x1,x 2 ,...,x n )) ] then item B is a sub-item of item A. These two items should be joined with a maintenance link. If A and B are both data items then B is a sub-type of A. Suppose that: X = E D; where D = C A B (1) for items X, D, A and B and objects E and C. Item X is a sub-item of item D. Object E has the effect of extracting a sub-set of the value set of item D to form the value set of item X. Item D is formed from items A and B using object C. Introduce two new objects F and J. Suppose that object F when applied to item A extracts the same subset of item A’s value set as E extracted from the “left-side” (ie. the “A-side”) of D. Likewise J extracts the same subset of B’s value set as E extracted from D. Then: X = C G K; where G = F A and K = J B (2) so G is a sub-item of A, and K is a sub-item of B. The form (2) differs from (1) in that the sub-item maintenance links have been moved one layer closer to the data item layer, and object C has moved one layer away from the data item layer. Using this method repeatedly sub-item maintenance links between non-data items are reduced to sub-type links between data items. It is shown now that there are four kinds of maintenance link in a conceptual model built using the unified knowledge representation. Consider two items A and B, and suppose that their semantics SA and SB have the form: 1 ...y p •[S (y1,...,y 1 ) º .. º S (yp,...,y p ) º J(y1 ,..,y 1 ,..,y p )]• S A = λy11...y m Ap 1 mp A1 1 m1 mp 1 m1 mp 1 1 1 q 1 1 q q 1 1 q S B = λy1 ...y n ...yn •[SB (y1 ,...,y n ) º .. º SB (y1 ,...,yn ) º K(y 1 ,..,yn ,..,yn )]• 1
q
1
1
q
q
1
q
S A contains (p + 1) terms and SB contains (q + 1) terms. Let µ be a maximal subexpression of SA ⊗ B such that: both SA 8 µ and SB 8 µ (a) where µ has the form: λy11...y 1d ...ydr •[SC (y 11 ,...,y 1d ) º ...... º SC (y1r ,...,y dr ) º L(y 11 ,..,y1d ,..,ydr )]• 1
r
1
1
r
r
1
r
If µ is empty, ie. ‘false’, then the semantics of A and B are independent. If µ is nonempty then the semantics of A and B have something in common and A and B should be joined with a maintenance link [10]. Now examine µ to see why A and B should be joined. If µ is non-empty and if both A and B are items in the basis then: A and B are a pair of basis items with logically dependent semantics (b) If µ is non-empty and if A is not in the basis then there are three cases. First, if: S A ° SB ° µ (c) then items A and B are equivalent and should be joined with an equivalence link. Second if (c) does not hold and: either SA ° µ or S B ° µ (d) then either A is a sub-item of B, or B is a sub-item of A and these two items should be joined with a sub-item link. Third, if (c) and (d) do not hold then if ∆ is a minimal sub-expression of SA such that ∆ 8 µ. Then: either SA ( y j ,...,y j ) › ∆, for some j 1 mj j
(e)
26
John Debenham
or J( y1 ,..,y 1 ,...,y p ) › ∆ 1
mj
(f)
mp
Both (e) and (f) may hold. If (e) holds then items A and B share one or more component items to which they should each be joined with a component link. If (f) holds then items A and B may be constructed with two object operators whose respective semantics are logically dependent. Suppose that item A was constructed by object operator C then the semantics of C will imply:
F = λQ1 :Xi11 Q2:Xi22 ...Qj:Xijj •λy11 ...y1d ...ydr •[ SP (y11 ,...,y1d ) º ..... º 1
r
1
1
S P (y1r ,...,ydr ) º L(y11 ,..,y 1d ,..,ydr )]• r r 1 r where the Qi ’s take care of any possible duplication in the Pj ’s. Let E be the object E[ F, T, Á] then C is a sub-object of E; that is, there exists a non-tautological object F such that: C ,w E ⊗M F (g) for some set M and where the join is not necessarily monotonic. Items A and B are weakly equivalent, written A ,w B, if there exists a permutation π such that: (: x1 x2 ...x n )[SA(x1,x 2 ,...,x n ) ° SB(π(x1,x 2 ,...,x n ))] where the x i are the ni variables associated with the i’th component of A. If A is a sub-item of B and if B is a sub-item of A then items A and B are weakly equivalent. If (g) holds then the maintenance links are of three different kinds. If the join in (g) is monotonic then (g) states that C may be decomposed into E and F. If the join in (g) is not monotonic then (g) states that either C ,w E or C ,w F. So, if the join in (g) is not monotonic then either E will be weakly equivalent to C, or C will be a sub-object of E. It has been shown above that sub-item links between non-data items may be reduced to sub-type links between data items. So if: • the semantics of the items in the basis are all logically independent; • all equivalent items and objects have been removed by re-naming, and • sub-item links between non-data items have been reduced to sub-type links between data items then the maintenance links will be between nodes marked with: • a data item that is a sub-type of the data item marked on another node, these are called the sub-type links; • an item and the nodes marked with that item’s components, these are called the component links, and • an item constructed by a decomposable object and nodes constructed with that object’s decomposition, these are called the duplicate links. If the objects employed to construct the conceptual model have been decomposed then the only maintenance links remaining will be the sub-type links and the component links. The sub-type links and the component links cannot be removed from the conceptual model. Unfortunately, decomposable objects, and so too duplicate links, are hard to detect [11]. Suppose that objects A and B are decomposable as follows: A ,w E ⊗M F
A Unified Model Maintains Knowledge Base Integrity
27
B ,w E ⊗M G Then objects A and B should both be linked to object E. If the decompositions of A and B have not been identified then object E may not have been identified and the implicit link between objects A and B may not be identified.
4
Conclusions
Potential maintenance hazards caused by one chunk of knowledge being hidden within another that plague rule-based formalisms have been partially avoided by using a unified representation based on items. Items make it difficult to analyse the structure of a whole application. To make the structure clear, ‘objects’ are introduced as item building operators. Objects are an abstraction representing the structure of knowledge. The use of objects to build items enables the hidden links in the knowledge to be identified and made explicit. A single decomposition operation for objects enables some of these links to be removed thus simplifying maintenance.
References 1
Mayol, E. and Teniente, E. (1999). “Addressing Efficiency Issues During the Process of Integrity Maintenance” in proceedings Tenth International Conference DEXA99, Florence, September 1999, pp270-281. 2 Debenham, J.K. “Validity of First-Order Knowledge Bases”, in proceedings 14th International FLAIRS Conference FLAIRS-2001, Key West, Florida, 21-23 May 2001, pp217-221. 3 Katsuno, H. and Mendelzon, A.O. “On the Difference between Updating a Knowledge Base and Revising It”, in proceedings Second International Conference on Principles of Knowledge Representation and Reasoning, KR’91, Morgan Kaufmann, 1991. 4 Barr, V. “Applying Reliability Engineering to Expert Systems” in proceedings 12th International FLAIRS Conference, Florida, May 1999, pp494-498. 5 Jantke, K.P. and Herrmann, J. “Lattices of Knowledge in Intelligent Systems Validation” in proceedings 12th International FLAIRS Conference, Florida, May 1999, pp499-505. 6 Darwiche, A. (1999). “Compiling Knowledge into Decomposable Negation Normal Form” in proceedings International Joint Conference on Artificial Intelligence, IJCAI’99, Stockholm, Sweden, August 1999, pp 284-289. 7 Debenham, J.K. “Knowledge Engineering”, Springer-Verlag, 1998. 8 Johnson, G. and Santos, E. (2000). “Generalizing Knowledge Representation Rules for Acquiring and Validating Uncertain Knowledge” in proceedings 13th International FLAIRS Conference, Florida, May 2000, pp186-2191. 9 Ramirez, J. and de Antonio, A. (2000) “Semantic Verification of Rule-Based Systems with Arithmetic Constraints” in proceedings 11th International Conference DEXA2000, London, September 2000, pp437-446. 10 Baral, C. and Zhang, Y. “On the Semantics of Knowledge Update” in proceedings Seventeeth International Joint Conference on Artificial Intelligence [IJCAI-01], Seattle, Washington, August 2001, pp97-102. 11 Roger, M., Simonet, A and Simonet, M. (2002). “Bringing together Description Logics and Database in an Object Oriented Model” in proceedings thriteenth Internatinal Conference DEXA 2002, Aix-en-Provence, September 2002, pp504-513.
Formal Argumentation Frameworks for the Extended Generalized Annotated Logic Programs Takehisa Takahashi, Yuichi Umeda, and Hajime Sawamura Department of Information Engineering and Graduate School of Science and Technology Niigata University 8050, Ninocho, Ikarashi, Niigata, 950-2181 Japan {takehisa,umeda,sawamura}@cs.ie.niigata-u.ac.jp http://www.cs.ie.niigata-u.ac.jp/
Abstract. Argumentation is an important form and way of interaction which is considered one of the most essential issues for agent systems as well as for humans. So far, a number of argumentation models have been proposed, in particular for the extended logic programs as knowledge representation. In this paper, we further pursuit, along the same lines, the basic argumentation framework for much more expressive logic programs called the extended generalized annotated logic programs (EGAP). We provide the semantics and dialectical proof theory for it, and prove the soundness and completeness and the equivalence to the well-founded semantics. Then, we develop the basic argumentation framework to that for multi-agents. Argument examples are described in illustration of these results.
1
Introduction
Argumentation is a ubiquitous form and way of dialogue in the human society. Recently in the fields of artificial intelligence and computer science, there has been a growing interest in argumentation as an effective means of interaction for intelligent systems [1], as a general framework for relating nonmonotonic logics of different styles [3] and so on. So far, a number of argumentation frameworks have been proposed [1]. The underlying frameworks are mainly built using the normal logic programs or the extended logic programs (ELP) as knowledge representation languages (e. g., [6]). In this paper, we are concerned with how arguments should be made under incomplete, uncertain, inconsistent or vague knowledge bases. For this, we further pursuit more versatile argumentation framework using much more expressive logic programs called the extended generalized annotated logic programs (EGAP) that has two kinds of negations (default and explicit ones [3]) and are founded on many-valuedness. In Section 2, we define the well-founded semantics for EGAP under the general semantics for the generalized annotated logic programs [4]. In Section 3, V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 28–38, 2003. c Springer-Verlag Berlin Heidelberg 2003
Formal Argumentation Frameworks for EGAP
29
we prepare an abstract argumentation framework of which soundness and completeness are proved. This section is a preparation for the following Section 4 and 5. Section 4 describes the basic argumentation framework which deals with argumentation within a single knowledge base. We give the definitions of arguments and attack relation which are proper to EGAP, and prove the equivalence of the argumentation semantics of the basic argumentation framework to the well-founded semantics of Section 2. In Section 5, we further develop the basic argumentation framework to that for multi-agents, in which the new attack relation is introduced so as to reflect a skeptical view of agents to other knowledge bases.
2
Language and Semantics
We define Generalized Annotated Logic Programs with default negation (EGAP) as follows. Definition 1. (Extended Generalized Annotated Logic Programs [4, 8]). An extended generalized annotated logic program (EGAP ) is a set of ground instances of rules of the form A0 : µ0 ← B1 : µ1 & . . . & Bn : µn & not(Bn+1 : µn+1 ) & . . . & not(Bm : µm ) where not is the default negation symbol. A0 : µ0 and Bi : µi (1 ≤ i ≤ m) are annotated atoms, and not(Bi : µi ), (n+1 ≤ i ≤ m) are annotated default atoms1 . An EGAP with no annotated default atom coincides with a generalized annotated logic program (GAP [4]). In [4], the general and restricted semantics for GAP were given. The wellfounded semantics (WFS) for EGAP based on the restricted semantics of GAP was studied in [8] and [2]. In this section, we provide WFS for EGAP based on the general semantics of GAP along the line of the result of [8]2 . We assume a set of all ideals of a complete lattice (T , ≤) of truth values (denoted by I(T )), and define the interpretation of an EGAP P to be a mapping from the Herbrand base (Hp ) of P to I(T ) [4]. Let I be an interpretation, then the satisfaction is defined such that I |= A : µ iff µ ∈ I(A); I |= not(A : µ) iff I |= A : µ (refer to [4] for the remaining part). The monotonic and continuous operator TP mapping interpretations into interpretations was defined for the fixpoint semantics in [4]. Definition 2. (ΓP Operator). Let P be an EGAP, and I be an interpretation. The transformation of P to a GAP by I is defined as follows: PI = {A0 : 1 2
The explicit negation “¬” is implicitly handled in this paper so that it is defined by ¬ : T → T and ¬A : µ = A : ¬µ In [2], the coherence principle (¬A : µ → not(A : µ)) was introduced, but we do not take it into account in this paper since we think that the coherence principle is not necessarily adequate to our purpose.
30
Takehisa Takahashi et al.
µ0 ← B1 : µ1 & . . . & Bn : µn | A0 : µ0 ← B1 : µ1 & . . . & Bn : µn & not(Bn+1 : µn+1 ) & . . . & not(Bm : µm ) ∈ P, and I |= Bn+1 : µn+1 and . . . and I |= Bm : µm }. Then, the function ΓP mapping interpretations into interpretations is defined such that ΓP (I) = lfp T P . I
ΓP is anti-monotonic, but ΓP ΓP which applies ΓP twice becomes monotonic (this can be proved in a similar way to [8]). Hence it has a least fixpoint. We write simply Γ for ΓP when P is obvious. Definition 3. (Well-Founded Semantics). Let P be a EGAP, and M be a least fixpoint of Γ Γ . Then the well-founded model of P is defined by: − WF S(P ) |= A : µ iff µ ∈ M (A); − WF S(P ) |= not(A : µ) iff µ ∈ (Γ (M ))(A). The least fixpoint of Γ Γ can be found by iteration as follow: I0 = ∆ Iα = Γ Γ (Iα−1 ) for successor ordinal α Iλ = α<λ Iα for limit ordinal λ where ∆ is the least interpretation, i.e. for all atom A, ∆(A) = ∅ (the empty ideal). Then there exists a least ordinal λ0 such that Iλ0 is the least fixpoint of ΓΓ. Example 1. Suppose the complete lattice T of truth values is FOUR = ({⊥, t, f, }, {⊥ < t, ⊥ < f, t < , f < }), and consider the following EGAP P . arise(unbalance) : t ← large(1N ) : t & is(OA, D level) : t. arise(unbalance) : f ← not(direction(1N, horizontal) : t). large(1N ) : t ← . is(OA, D level) : t ← . direction(1N, vertical) : t ← . This program is a simple example of the rules of failure diagnosis for the rotating machinery. If M is the least fixpoint of Γ Γ , then M (arise(unbalance)) = {⊥, t, f, }, and hence ∀µ ∈ T , WF S(P ) |= arise(unbalance) : µ. WF S(P ) |= arise(unbalance) : says that the result of the diagnosis about unbalance shows inconsistency. In the systems which treat uncertain knowledge of humans, for example, in expert systems, it is very convenient for systems to be able to directly represent inconsistent information.
3
Abstract Argumentation
Following [3, 6, 7], we introduce an abstract argumentation framework in which both arguments and attack relation on them are not specified in a concrete manner. This section is the preparation of the argumentation frameworks described in the succeeding sections.
Formal Argumentation Frameworks for EGAP
3.1
31
Acceptable and Justified Arguments
We define the argumentation semantics for EGAP as the least fixpoint of the function which collects all acceptable arguments. Definition 4. (Attack Relation). Let Args be an abstract argument set. An attack relation x on Args is a binary relation on Args, i. e., x ⊆ Args2 . Definition 5. (x/y-Acceptable and Justified Argument). Let x and y be attack relations on Args. Suppose Arg1 ∈ Args and S ⊆ Args. Then Arg1 is x/y-acceptable wrt. S if for every Arg2 ∈ Args such that (Arg2 , Arg1 ) ∈ x there exists Arg3 ∈ S such that (Arg3 , Arg2 ) ∈ y. The function FArgs,x/y mapping from P(Args) to P(Args) is defined by FArgs,x/y (S) = {Arg ∈ Args | Arg is x/y-acceptable wrt. S}. Since FArgs,x/y is monotonic, it has a least fixpoint, denoted by JArgs,x/y (we write simply Jx/y when Args is obvious ). An argument Arg is x/y-justified if Arg ∈ Jx/y ; an argument is x/y-overruled if it is attacked by an x/y-justified argument; and an argument is x/y-defensible if it is neither x/y-justified nor x/y-overruled. A least fixpoint of Fx/y can be constructed by the following iterative method[3] : 0 Jx/y =∅ α−1 α Jx/y = Fx/y (Jx/y ) for successor ordinal α λ α for limit ordinal λ Jx/y = α<λ Jx/y λ0 λ0 Then there exists a least ordinal λ0 such that Fx/y (Jx/y ) = Jx/y = Jx/y .
3.2
Dialectical Proof Theory
Justified arguments can be dialectically determined from a set of arguments by the dialectical proof theory. We show the sound and complete dialectical proof theory for the abstract argumentation semantics JArgs,x/y . Definition 6. (x/y-Dialogue and x/y-Dialogue Tree). An x/y-dialogue is a finite nonempty sequence of moves movei = (P layeri , Argi ), (i ≥ 1) such that 1. P layeri = P (Proponent) iff i is odd; and P layeri = O (Opponent) iff i is even. = j) then Argi = Argj . 2. If P layeri = P layerj = P (i 3. If P layeri = P (i ≥ 3) then (Argi , Argi−1 ) ∈ y; and if P layeri = O (i ≥ 2) then (Argi , Argi−1 ) ∈ x. An x/y-dialogue tree is a tree of moves such that every branch is an x/y-dialogue, and for all moves movei = (P, Argi ), the children of movei are all those moves (O, Argj ) such that (Argj , Argi ) ∈ x.
32
Takehisa Takahashi et al.
Definition 7. (Provably x/y-Justified). An x/y-dialogue D is a winning x/y-dialogue iff the termination of D is a move of proponent. An x/y-dialogue tree T is a winning x/y-dialogue tree iff every branch of T is a winning x/ydialogue. An argument Arg is a provably x/y-justified argument iff there exists a winning x/y-dialogue tree with Arg as its root. Theorem 1. Let Args be an abstract argument set. Assume an attack relation on Args is finite. Then Arg ∈ Args is Provably x/y-justified iff Arg is x/yjustified. (refer to [9] for the proof. )
4
Basic Argumentation
We describe our basic argumentation (BA) framework for EGAP based on the abstract argumentation framework. 4.1
Arguments
We extend the notion of reductant which is described in GAP [4]. Definition 8. (Reductant and Minimal Reductant). Suppose P is an EGAP, and Ci (1 ≤ i ≤ k) are rules in P of the form: i A : ρi ← B1i : µi1 & . . . & Bni i : µini & not(Bni i +1 : µini +1 ) & . . . & not(Bm : µimi ) i i For simplicity, we write Body1···n for B1i : µi1 & . . . & Bni i : µini and i i i i i : µimi ). Let not(Bodyni +1···mi ) for not(Bni +1 : µni +1 ) & . . . & not(Bm i ρ = {ρ1 , . . . , ρk }. Then, 1 k & . . . & Body1···n & A : ρ ← Body1···n 1 k k . . . & not(Bodynk +1···mk )
not(Bodyn1 1 +1···m1 ) &
is a reductant of P . If there exists no non-empty proper subset S ⊂ {ρ1 , . . . , ρk } such that ρ = S then it is a minimal reductant. Definition 9. (Arguments). Let P be an EGAP. An argument in P is a finite sequence Arg = [r1 , . . . , rn ] of minimal reductants in P such that: 1. for every i (1 ≤ i ≤ n) and for every annotated atom Aj : µj in the body of ri , there exists a minimal reductant rk such that Aj : µk (µk ≥ µj , n ≥ k > i) is head of rk . 2. There exists no proper subsequence of [r1 , . . . , rn ] which meets the first condition, and includes the minimal reductant r1 . A subargument of Arg is a subsequence of Arg which is an argument. The heads of minimal reductants in Arg are called conclusions of Arg, and the annotated default atoms in the body of minimal reductants in Arg are called assumptions of Arg. We write concl(Arg) for the set of conclusions and assm(Arg) for the set of assumptions of Arg. We denote the set of all arguments in P by ArgsP .
Formal Argumentation Frameworks for EGAP
33
Example 2. Suppose T = FOUR and P is an EGAP such that P = { recommend(I, X) : t ← f amous(X) : t & not(disreputable(X) : t), recommend(I, X) : f ← expensive(X) : t, recommend(I, X) : ⊥ ← read(I, X) : f, f amous(book A) : t ←, expensive(book A) : t ←, read(I, book A) : f ←}
where X is an object variable3 . Then, Arg = [ recommend(I, book A) : ← f amous(book A) : t & expensive(book A) : t & not(disreputable(book A) : t), f amous(book A) : t ←, expensive(book A) : t ←
]
is an argument which says that the book A has both sides of recommendable and unrecommendable. To make an argument with the conclusion recommend (I, book A) : , we need the minimal reductant with the head in Arg. Reductants which are not minimal are unnecessary (or redundant ) in the sense that we can construct arguments with the same conclusion only from minimal reductants, and occasionally tends to be irrelevant. recommend(I, book A) : t ← f amous(book A) : t & read(I, book A) : f & not(disreputable(book A) : t)
is a reductant which is not minimal. It says “I recommend the book A because it is famous, I have never read it, and have never heard a disrepute of it”. read(I, book A) : f may be not relevant as a reason for asserting recommend (I, book A) : t. Thus the introduction of minimal reductants has a role of excluding redundant arguments with excessive ground. 4.2
Attack Relation for BA
The undercut and rebut are typical attack relations employed very often in argument systems [6, 7]. The undercut invalidates assumptions of arguments, and the rebut contradicts conclusions of arguments. However, in EGAP, there is no conflict among conclusions by paraconsistency of EGAP. To see this, let us consider T = [0, 1], EGAP P = {p : 0.3 ←, p : 0.6 ←}. Then the ideal assigned to p is [0, 0.6] in the model of P , and we have both p : 0.3 and p : 0.6 as the logical consequences. Therefore, two arguments ArgsP = {[p : 0.3 ←], [p : 0.6 ←]} can not be viewed as being in conflict with each other within one program. Therefore, we will take in to account the undercut only in BA. Definition 10. (Undercut). Arg1 undercuts Arg2 iff there exists A : µ1 ∈ concl(Arg1 ) and not(A : µ2 ) ∈ assm(Arg2 ) such that µ1 ≥ µ2 . Note that that undercut relation is not always one-directional. That is, it happens that Arg1 undercut Arg2 and vice versa. P is called f initary if an attack relation on ArgsP is finite. 3
We assume that a rule containing (object or annotation) variables represents any ground instance of it.
34
Takehisa Takahashi et al.
In BA, we use undercut/undercut-acceptable (u/u-acceptable), and define the BA semantics for an EGAP P to be JArgsP ,u/u (JP for short). u/u-justified arguments can be dialectically determined by the dialectical proof theory specified in the subsection 3.2. We omit the notation u/u when BA is obvious. 4.3
Equivalence of the BA Semantics to WFS
In ELP, the equivalence between the argumentation semantics Ju/a (a is undercut or rebut) and the well-founded semantics with explicit negation (WFSX) was shown in [7]. Similarly, in EGAP, we can show that Ju/u coincides with WFS. Definition 11. (BA Semantics). Let P be an EGAP. Then, 1. BA(P ) |= A : µ iff there exists a justified argument Arg ∈ JP such that for some ρ ≥ µ, A : ρ ∈ concl(Arg); 2. BA(P ) |= not(A : µ) iff for every arguments Arg ∈ ArgsP , if there exists ρ ≥ µ such that A : ρ ∈ concl(Arg), Arg is overruled. Theorem 2. Let P be an EGAP. Then, – WF S(P ) |= A : µ iff BA(P ) |= A : µ; – WF S(P ) |= not(A : µ) iff BA(P ) |= not(A : µ). (refer to [9] for the proof. ) Example 3. Consider T = FOUR, and let P be the following EGAP. P = { a : f ← not(b : f), a : t ← not(b : t), b : f ← not(a : t), b : f ← }.
Then, I0 Γ (I0 ) I1 = Γ Γ (I0 ) Γ (I1 ) I2 = Γ Γ (I1 )
= {a → ∅, b → ∅} = {a → {⊥, t, f, }, b → {⊥, t, f, }} = {a → ∅, b → {⊥, f}} = {a → {⊥, t}, b → {⊥, t, f, }} = {a → ∅, b → {⊥, f}} = I1
WFS(P ) |= b : ⊥, b : f, not(a : f), not(a : ). Arg2 : [a : t ← not(b : t)], ArgsP = { Arg1 : [a : f ← not(b : f)], Arg4 : [b : f ←], Arg3 : [b : t ← not(a : t)], Arg5 : [a : ← not(b : f) & not(b : t)], Arg6 : [b : ← not(a : t)] } JP = JP1 = {Arg4 } overruled arguments = {Arg1 , Arg5 } BA(P ) |= b : ⊥, b : f, not(a : f), not(a : ).
Thus, BA(P ) coincides with WF S(P ).
Formal Argumentation Frameworks for EGAP
5
35
Multi-agent Argumentation
In this section, we describe the multi-agent argumentation (MAA) framework for distributed knowledge bases. Like in [5], we identify each distributed EGAP with an agent with an individual view of the world, and arguments as view of agents. Definition 12. (Multi-agent System) Let each KBi (1 ≤ i ≤ n) be EGAP. Then the set KB = {KB1 , . . . , KBn } is a multi-agent system (MAS ). Our MAA is an argumentation framework in which agents argue each other about what each agent believes to be right (i.e., justified arguments for each agent in terms of BA). It is, however, not simply a distributed version of BA since each agent usually has a different recognition even for the same assertion from other agents, and hence there can be a new attack relation, differently from in BA. Let us take up an example. Suppose agent A has the confidence factors 0.5 and 0.8 about solution of some problem, depending on the different grounds. If agent B argues about it with a confidence factor 0.6, then agent B could rebut it by saying to agent A the two reasons. One is that agent A’s overall confidence is 0.8 (a maximum) according to the BA semantics, and the other is that agent A has too excessive confidence in view of B. The latter amounts to assuming sort of skepticism for the rebut relation. Note that agent A cannot rebut agent B since agent A’s overall confidence subsumes agent B’s one. Otherwise agent A would fall into self-contradiction. Taking into these considerations, we will have the following definitions. Definition 13. (Maximal Arguments). Let Args be a set of argument, and Arg be an argument in Args. We define the set of conclusions whose annotations are maximal as follows. max concl(Arg) = {A : µ ∈ concl(Arg) | for any ρ, if A : ρ ∈ concl(Arg), µ ≥ ρ}. Then Arg is called a maximal argument (margument ) if for all A : µ ∈ max concl(Arg), there is no Arg ∈ Args such that for some ρ > µ, A : ρ ∈ max concl(Arg ). Definition 14. (Rebut and Defeat). – Arg1 rebuts Arg2 iff there exists A : µ1 ∈ max concl(Arg1 ) and A : µ2 ∈ max concl(Arg2 ) such that µ1 ≥ µ2 . – Arg1 defeats Arg2 iff Arg1 undercuts Arg2 , or Arg1 rebuts Arg2 and Arg2 does not undercut Arg1 . Suppose KB = {KB1 , . . . , KBn } is a MAS. We denote the set of every margument in JKBi (i. e., the set JArgsKBi ,u/u of justified arguments for KBi in BA) by mJKBi , and define ArgsKB = mJKBi . KB is called finitary if the attack relation on ArgsKB is finite. Then we define the MAA semantics for EGAP by JArgsKB ,d/u (JKB for short, and d = defeat, u = undercut). JKB also can be dialectically determined by the dialectical proof theory shown in the subsection 3.2 . We omit the notation d/u when MAA is obvious. Finally we define the semantics of MAA, similarly to BA.
36
Takehisa Takahashi et al. Arg33
m-Arguments of KB3
agree(death):(0.9, 1.0)
Arg32
desire(family, death):(0.9, 1.0)
Arg31
hate(family, muderer):(0.9, 0.2)
not(assuage(death, family):(0.7, 0.0)) u
r u
Arg12 agree(death):(0.8, 0.1) not(desire(family, death):(0.0, 0.6))
r
r
r
Arg22 Arg21 agree(death):(0.1, 0.4)
Arg11
hate(family, murderer):(1.0, 0.2)
atone(death,guilt):(0.2, 0.8) not(remorse(dead):(1.0, 0.0))
m-Arguments of KB1 r m-Arguments of KB2
Fig. 1. Relation among arguments of Example 4, where ‘u’ stands for undercut, and ‘r’ for rebut. The m-arguments framed in a thick line are justified arguments
Definition 15. (MAA Semantics). Let KB be a MAS. 1. MAA(KB) |= A : µ iff there exists a justified argument Arg ∈ JKB such that for some ρ ≥ µ, A : ρ ∈ concl(Arg); 2. MAA(KB) |= not(A : µ) iff for every argument Arg ∈ ArgsKB , if there exists ρ ≥ µ such that A : ρ ∈ concl(Arg), Arg is overruled: Example 4. We give an example of MAA about the pros and cons of the death penalty of murderers. Suppose a complete lattice T = [0, 1]2 where (µ1 , ρ1 ) ≤ (µ2 , ρ2 ) iff µ1 ≤ µ2 and ρ1 ≤ ρ2 , and in (µ, ρ) ∈ T , µ and ρ represent the degrees of an affirmation and a negation respectively. Let KB = {KB1 , KB2 , KB3 } be a MAS, and each KBi be a knowledge base such that: KB1 = { agree(death) : (0.8 × X, 0.5 × Y ) ← hate(f amily, murderer) : (X, Y ) & not(desire(f amily, death) : (0.0, 0.6)), hate(f amily, murderer) : (1.0, 0.2) ← } KB2 = { agree(death) : (0.5 × X, 0.5 × Y ) ← atone(death, guilt) : (X, Y ), atone(death, guilt) : (0.2, 0.8) ← not(remorse(dead) : (1.0, 0.0)) } KB3 = { agree(death) : (X, Y ) ← desire(f amily, death) : (X, Y ), desire(f amily, death) : (0.0, 1.0) ← not(assuage(death, f amily) : (0.7, 0.0)), desire(f amily, death) : (X, 0.3) ← hate(f amily, murderer) : (X, 0.0), hate(f amily, murderer) : (0.9, 0.2) ← },
where X and Y are annotation variables. Figure 1 shows every possible m-argument and every possible attack relation among them. The justified arguments are 1 2 = {Arg21 , Arg31 , Arg32 }, JKB = JKB = {Arg21 , Arg22 , Arg31 , Arg32 }. JKB
Formal Argumentation Frameworks for EGAP
37
An interpretation in the restricted semantics [4] is a mapping from the Herbrand base to T . The least interpretation ∆r is an interpretation assigning every atom to a least element (⊥), and hence ∆r |= A : ⊥ for every atom A. This means that an arbitrary interpretation satisfies any annotated atoms with an annotation ⊥. Therefore, we have to assume that there exists a rule A : ⊥ ← for every atom A in any EGAP to construct BA that coincides with WFS under the restricted semantics. With this assumption, let us consider Example 4 again. It is only in KB2 that there exists explicitly a knowledge about atone(death, guilt). Arg21 is supposed to be rebutted by an argument made from atone(death, guilt) : (0, 0) implicitly included in KB1 and KB3 . Likewise, Arg31 and Arg32 are rebutted in the same manner. The only justified arguments are ones of the form [A : ⊥ ←] for any atom A. We think that the rebuttal based on such unawareness is too skeptical. The general semantics allows to distinguish explicit knowledge from implicit knowledge by the existence of the empty ideal. Thus, we would say the general semantics is more relevant to the restricted one.
6
Concluding Remark
The main contribution of this paper is that we have given the formalization of two argumentation frameworks so that the argumentation semantics for BA coincides with the well-founded semantics for EGAP, under the general semantics of GAP. We have found that in MAA derived from BA, the general semantics is more adequate than the restricted semantics employed in [8, 2]. We conclude our paper with the remark that our argumentation frameworks with the expressive EGAP are more suitable to arguments with incomplete, inconsistent, or vague knowledge (for example, even with empirical and epistemic knowledge states such as ”it is and is not”, or ”it neither is nor is not”), and are flexible to further extension such as to more versatile attack relations, cooperation and learning through argumentation.
References [1] C. I. Chesnevar, A. G. Maguitman and R. P. Loui. Logical models of argument. ACM Computing Surveys, 32(4): 337-383, 2000. 28 [2] C. V. Damasio, L. M. Pereira and T. Swift. Coherent Well-founded Annotated Logic Programs. Proc. of the 5th Int. Conference on Logic Programming and Nonmonotonic Reasoning, pp. 262-276, 1999. 29, 37 [3] P. M. Dung. An Argumentation semantics for logic programming with explicit negation. Proc. of 10th Int. Conference on Logic Programming, pp. 616-630, 1993. 28, 30, 31 [4] M. Kifer and V .S. Subrahmanian. Theory of generalized annotated logic programmin and its applications. J. of Logic Programming, 12: 335-397, 1992. 28, 29, 32, 37 [5] I. Mora, J. J. Alferes and M. Schroeder. Argumentation and Cooperation for Distributed Extended Logic Programs. Working notes of the Workshop on Nonmonotonic Reasoning, 1998. 35
38
Takehisa Takahashi et al.
[6] H. Prakken and G. Sartor. Argument-based Extended Logic Programming with Defeasible Priorities. J. of Applied Non-Classical Logics, 7(1): 25-75, 1997. 28, 30, 33 [7] R. Schweimeier and M. Schroeder. Well-founded argumentation semantic for extended logic programming. Proc. of Int. Workshop on Non-monotonic Reasoning, 2002 30, 33, 34 [8] V. S. Subrahmanian. Amalgamating Knowledge Bases. ACM Transactions on Database Systems, 19(2): 291-331, 1994. 29, 30, 37 [9] T. Takahashi, Y. Umeda and H. Sawamura. Formal Argumentation Frameworks for the Extended Generalized Annotated Logic Programs. Technical Report http://www.cs.ie.niigata-u.ac.jp/~takehisa/, 2003. 32, 34
Specifying and Validating Reactive Systems with CommonKADS Methodology Maamar El-Amine Hamri, Claudia Frydman, and Lucile Torres Laboratoire des Sciences de l'Information et des Systèmes (LSIS) Campus Universitaire de St-Jérôme Avenue Escadrille Normandie-Niemen, 13397 Marseille cedex 20 France {amine.hamri,claudia.frydman,lucile.torres}@lsis.org http://www.lsis.org
Abstract. We present an extension of the CommonKADS methodology that is relevant to the specification and validation of reactive system behavior. CommonKADS is considered as one of the best-known methodology for specifying and designing knowledge systems. It provides a language for describing the system behavior, which is suitable for transformational systems. We introduce the elements useful for event-driven behavior specification in this language. The later translation of these elements into statecharts formalism allows us to validate the reactive behavior of the system by simulation.
1
Introduction
A reactive system is designed by a set of tasks that cooperate together in realizing a complex behavior. This is an open system that reacts to extern events by answering with actions. Like any system, the specification of a reactive system requires three points of view: structural, functional and behavioral views [1]. The behavior of reactive systems is really different from those of transformational systems and requires a different approach to be specified [2]. Using the CommonKADS methodology [3] for system specification gives some well-known advantages such as distinguishing specification from design or associating generic models with tasks. However, the methodology does not provide any concept for specifying the behavior of reactive system since constraints related to time and events cannot be introduced in the task layer in the CommonKADS expertise model [4]. We propose to extend the CommonKADS methodology by adding the concepts the expert should have to specify the behavior of a reactive system and to validate this behavior by simulation. These concepts are based on the statecharts formalism that is suitable for behavioral specification of reactive system.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 39-44, 2003. Springer-Verlag Berlin Heidelberg 2003
40
2
Maamar El-Amine Hamri et al.
CommonKADS Task Behavior
In the CommonKADS knowledge model, task knowledge describes a hierarchical organization of tasks. So, any task is structurally defined by the set of its component subtasks; and the task behavior is described at each abstraction level, by specifying the running order of the components of the tasks. The hierarchical decomposition principle reduces system complexity, it makes possible to study the system at several abstraction levels and it makes easier the specification changes. Moreover, a hierarchical specification of behavior leads to obtain tasks free from their utilization domain and reusable. When describing the behavior of the task consists in ordering subtasks in time, it is a question of defining the internal control of the task and the task must be autonomous. Such a task is viewed as a function transforming inputs into outputs and relations between tasks are input/output relations. In the case of a reactive system, the control is external, driven by external events whose occurrences induce state changes of the system. The behavior of a reactive system is really the set of allowed sequences of events, conditions and actions with some timing constraints [2]. CommonKADS does not provide the means for specifying such an external control. We propose to complete the CommonKADS task description so as tasks can represent activities or states of the system (instead of functions) and their behavior is given in terms of events whose occurrences change the system state or trigger activities.
3
Reactive Behavior Specification
The reactive behavior is described by means of triplets (evt, task1, task2) where evt is an event name and the tasks task1 and task2 have the same abstraction level (they are component of the same parent task). This can be interpreted as follows: if task1 is running and evt happens, then task1 is interrupted and task2 is triggered. We introduce the initial task concept in order that a task triggers one of its components at least (the initial one). The initial task may be explicitly showed by this kind of specification: (init, task), where init is the name of the predefined event that makes task as initial one. Otherwise, the subtask given in first in the component list is the initial task, by default. Finally, the delay concept of a task expresses the maximum life of the task. It makes possible the task to be deactivated if no event occurs during the delay considered. The delay of a task can be expressed in the form of a positive floating number. Relations are not limited to tasks from one abstraction level. A relation (evt, task1, task2) can also be specified for tasks that are not at the same abstraction level, or that have not the same parent task. We propose to describe reactive behavior in the control structure of the task in the CommonKADS knowledge model, according to the following elementary grammatical rules:
Specifying and Validating Reactive Systems with CommonKADS Methodology
41
Reactive_Behavior := Orthogonal_Behavior | Orthogonal_Behavior Reactive_Behavior | Delay Orthogonal_Behavior | Delay Orthogonal_Behavior Reactive_Behavior Delay := Positive_Float_Number Orthogonal_Behavior := (‘ ‘init' ‘,' Id_Task ‘)'Triplet_Set Triplet_Set := ‘(' Id_Event ‘,' Id_Task ‘,' Id_Task ')' | ‘(' Id_Event ‘,' Id_Task ‘,' Id_Task ')' Triplet_Set Id_Event := Identifier Id_Task := Identifier Let us consider the example of the Citizens Quartz Multi-Alarm III watch [2]. This watch has a main display area and four smaller ones, a two-tone beeper, and four control buttons denoted here a, b, c and d. It can display the time or the date (day of month, month, day of week); it has a chime (beep on the hour if enabled), two independent alarms and a stopwatch (with lap and regular display modes). Time display is the default task. The time and the date display tasks are linked by d's, time shows again after two minutes in date display. Pressing the a button displays sequentially alarm1, alarm2, chime, stopwatch and return to time. Here, we specify just a part of the watch without refinement of the tasks. The components and the behavior of the display task are defined as follow: Task Display Decomposition: Time, Date, Alarm1, Alarm2, Chime, Stopwatch Control structure: (init, Time), (d, Time, Date), (d, Date, Time), (2, Date, Time), (a, Time, Alarm1), (a, Alarm1, Alarm2), (a, Alarm2, Chime), (a, Chime, Stopwatch), (a, Stopwatch, Time)
4
Operationalizing into Satecharts
The statechart formalism defined by David Harel [2] constitutes a standard for the specification of reactive systems. This formalism supports hierarchical descriptions and has an operative semantics that makes possible the specification to be validated by simulation [5]. A CommonKADS task can be interpreted as an entity provided with inputs/outputs, active or not, according to the event occurrences. With this point of view, we can traduce the reactive behavior of tasks described in terms of our elementary language into statecharts, by using the equivalencies below: Task layer of a CommonKADS Statechart knowledge model a task an initial task an event in a triplet a subtask the delay of a task
a state with the associated task as activity a default state a transition labeled with the event, between two states a substate a generic event: timeout
42
Maamar El-Amine Hamri et al.
Fig. 1. Statecharts for the reactive behavior of the Display task
The translation rules we defined allow the reactive behavior specification to be automatically traduced into the statechart formalism. Figure 1 gives the specification of the library management task obtained once translated.
5
Simulation
Statecharts are discrete event models. This kind of model is the most used for reactive system specification. In discrete event models, time is continuous and an event is an instantaneous value change of one or more descriptive variables. Events happen at dates called event occurring dates, represented by positive floating numbers. A way to implement a discrete event model simulation consists in using a set of discrete descriptive variables, a scheduler (for the chronological list of events or messages) and a global clock giving the simulation current date (that is the occurring date of the event being treated). Once these elements are given, the simulation algorithm is: while scheduler not empty do begin extracting the event of the nearest future: (e, s, d) (that is the event at the head of the scheduler); current_date = d (affecting the current date with the occurring date of the event); making the state change induced by the event; creating the future timeout or init events due to this state change and stocking them in the scheduler by chronological order end The timeout events must only be treated when the time past in the state is equal to the life of this state (that is the delay of the task corresponding to the state). If an external event occurs before this delay, the timeout event must be suppressed from the scheduler. If several events can occur simultaneously, treating at one go these events makes the system to pass from the current state to a new state. This new state cannot be
Specifying and Validating Reactive Systems with CommonKADS Methodology
43
always determined in a unique way. This happens with events leading the system from a common state to different states (in the case of conflicting transitions mentioned by [5]), when there exist simultaneous occurrences of these events. In the statechart of the Display task, the transitions labeled with events a and d are conflicting transitions while the time task is activate. Consequently, we choose to ignore the happening of simultaneous events and we suppose that at the most one event occurs at any given date. Finally, we consider that statecharts modeling nondeterministic behavior constitute underspecified models that must be corrected before simulation. We give now the results of the simulation of the display task for the following scheduler: Step
Actions
0 (initial)
1
2
3
4
5 6 7
extracting the first event : (init,time,50) current_date = 50 changing the state : activating the time task extracting the first event : (d,nil,60) current_date = 60 changing the state : activating the date task adding (time_out,date,62) in the scheduler extracting the first event : (time_out,date,62) current_date = 62 changing the state : activating the time task extracting the first event : (d,nil,65) current_date = 65 changing the state : activating the date task adding (time_out,date,67) in the scheduler extracting the first event : (time_out,date,67) current_date = 67 changing the state : activating the time task extracting the first event : (a,nil,70) current_date = 70 changing the state : activating the alarm1 task extracting the first event : (a,nil,72) current_date = 72 changing the state : activating the alarm2 task
Scheduler init,time,50 d,nil,60 d,nil,65 a,nil,70 a,nil,72 d,nil,60 d,nil,65 a,nil,70 a,nil,72 time_out,date,62 d,nil,65 a,nil,70 a,nil,72 d,nil,65 a,nil,70 a,nil,72 time_out,date,67 a,nil,70 a,nil,72 a,nil,70 a,nil,72 a,nil,72
44
6
Maamar El-Amine Hamri et al.
Conclusion
We provide experts with a simple language for specifying the behavior of a reactive system, in terms of events and tasks to be activated or interrupted. We have introduced translating rules to transform automatically such specifications into statecharts. These works have led us to develop software allowing the experts to operationalize the behavioral part of the specification established with the CommonKADS methodology in the form of statecharts, and to simulate this behavior. We propose now to enrich the concepts and the associated tools by defining conditions on events and communication between tasks.
References [1] [2] [3] [4]
[5]
Frydman C. and Torres L., Vérification et Validation de Modèles CommonKADS, Revue d'Intelligence Artificielle, 2000 Harel, D. 1987. Statecharts: A Visual Formalism for Complex Systems. Science of Computer Programming, 8, pp. 231-274 Schreiber G., Akkerman H., Anjerwierden A., De Hoog R., Shadbolt N., Van De Velde W. and Wielinga B. Knowlegde Engineering and Management The Commonkads Methodology, MIT press, London, England, 1999 Torres L., Frydman C. and Garrido de Ceita A. Adding Event-driven Control in CommonKADS Knowledge Model, KES–2001 5thInternational conference on knowledge-Based Intelligent Information Engineering Systems & Allied Technologies, Osaka et Nara, Japan, september 2001 Harel, D. and Naamad A. 1996. The Statemate Semantics of Statecharts. The Weizmann institute of science rehobot, Israel, October, 1995
TMMT: Tool Supporting Knowledge Modelling Guy Camilleri, Jean-Luc Soubie, and Joseph Zalaket IRIT CCI-CSC, Universit´e Paul Sabatier 118 route de Narbonne, 31062 Toulouse Cedex 4, France {camiller,soubie,zalaket}@irit.fr
Abstract. In this paper we propose a new tool for models design and maintenance. This tool was built to design cooperative knowledge based models. We first present the specifications of such a tool: it must produce easily executable models and meta-models; it has to be easy to use, with the possibility to handle drafts of models. Then we present the characteristics of TMMT (Task Method Modelling Tool) satisfying these specifications. We finally give some elements of methodology to build conceptual models with TMMT.
Introduction We could think today that the modelling of knowledge reached a stage of development that does not imply any more research about tools supporting modelling. In addition, the design of systems imbedding knowledge developed increasingly towards the use of the task/method paradigm like primitives of representation of the reasoning. However, force is to note the under-utilization of the functionalities of these tools and the difficulty of inserting them in the activity of design when this one is carried out within traditional environments of software development, specially based on UML. The main characteristic of cooperative systems containing knowledge (see Soubie [1]) is the multiplicity of the models, which constitute them. Thus, the model of the application can be compared to the traditional conceptual models, but the model of cooperation, although of comparable nature, has as model of its domain, the tasks of the preceding model. In the same manner, the model of communication operates on itself for the implementation of strategies of dialogue. It was thus a question of designing a tool of assistance to the construction and the operationalization of models also being able to operate on themselves (metamodels) implementing the recursivity and reflexivity. A specification of the tool was the facility of handling of the primitives and the readability of the produced models. Thus, TMMT is provided with a graphic interface from which it is possible to build and to modify the model in progress. Two formats of internal representation of the models allow a better portability.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 45–52, 2003. c Springer-Verlag Berlin Heidelberg 2003
46
1
Guy Camilleri et al.
The Task/Method Paradigm, Methods and Tools for Knowledge Modelling
The reasoning seen as a task is the result of research on the knowledge acquisition and modelling of the Eighties. At this period, the concept of expert knowledge, based on sets of heuristic knowledge of an expert, yielded the place to the modelling of knowledge of a domain. The effort was put on methodologies of acquisition (the term elicitation was very often used) of knowledge resulting from one or more experts, but also from other sources, like documents. The principle guiding these steps was the construction of a complete set of structured pieces of knowledge allowing the resolution of problems in a particular field of competences. The task constituted the container of a set of pieces of knowledge solving one problem or a set of problems. In this meaning, it constituted a granule of knowledge at the level where it described the resolution. Knowledge was then described as much as possible independently from the formal representation which would be used thereafter to implement them. Methodologies of knowledge acquisition were based on a top-down analysis of the problem solving resulting in a hierarchical decomposition of the whole task into sub-tasks being able to be alternate, describing various methods for the realization of the same task. These principles are the basis of the European project KADS, follow-up of the CommonKADS methodology, which constitutes the reference in this field. Until 2000, date of publication of the handbook ”Knowledge Engineering and Management - The CommonKADS Methodology ” [2], few methodologies were based on completely executable models, since the support of modelling induced constraints which one wanted to free oneself. One will however see with MACAO II [3] an attempt to facilitate the operationalization of the conceptual models. Same manner DSTM [4] with its language ZOLA allowed the use of executable primitives task and method. However, it seems that the movement towards the supports of modelling aiming to build models easily operationalizable is not also powerful, whereas work on the constitution of ontology [5] on the one hand and the re-use of components of models of problem solving developed. First of all, the facility of use is provided by a graphical and declarative representation of the knowledge implemented. Indeed, this characteristic allows at the same time a more significant implication of the carriers of knowledge in the systems design, but also facilitates the maintenance of the models produced by a fast reappropriation of the model a long time after its first construction. Then, a capacity to carry out syntactic and/or structural checks. For example, the checking of the illegal re-use of names, the consistency of the preconditions of a task or the detection of the lack of sub-task or loops within a hierarchical decomposition of tasks. This capacity is corollary of a possibility of automating thereafter the operationalization of the produced models. Lastly, it is necessary that the produced models can be handled simply by the tools for processing, not only in their operational form, but also in their declaratory form as a representation of the task.
TMMT: Tool Supporting Knowledge Modelling
47
Fig. 1. Truck world domain
We present in the second part TMMT (Task-Method Modelling Tool) a tool of knowledge modelling answering the preceding specifications, as well as a methodological approach of modelling with this kind of systems.
2
Modelling Primitives of TMMT
In this section we define some modelling primitives, on which the execution engine used in knowledge based systems and in the plan recognition processes for plan based approaches of the discourse (see Camilleri [6]) and in the action based planners are applicable. Execution and planning are two distinct activities requiring different models. A major difference between execution and planning is: planning builds hypothetical states of the world; therefore it is possible to come back (or cancel task application); while execution directly modifies the state of the world, the achievement of a task cannot be ignored even if it is not desired. Moreover, in planning it is necessary to isolate the salient objects of the world and to clone them; these two operations are not required for the execution processes. The main objective of our modelling primitives is to allow the representation of models being able to be handled by the execution, and by the plan recognition, and by the planning processes. 2.1
Domain Model
Definition 1 A domain model describes the objects of the world handled (directly or indirectly) by the reasoning model. A generic definition of high level modelling primitives for the domain model seems to be difficult, because it is too dependent on the application (cf [7]). We only chose to use the modelling object concept (inheritance, polymorphism, etc.) adopted from object-oriented languages. For example, the object model (described in UML) below constitutes a model fragment of the possible domain under TMMT of the “Truck” world. 2.2
Reasoning Model
Definition 2 The reasoning model describes how a task can be performed.
48
Guy Camilleri et al.
All the world objects and their relevant relations handled by the reasoning model must be described in the domain model. Therefore, a domain model can be defined only from the reasoning model. In the same way, the tasks described in the reasoning model handle the domain objects, and thus they depend on the domain model. The knowledge modelled in a conceptual model often results of a parallel construction of the domain model and the reasoning model. The Task/Method paradigm has been chosen to represent the reasoning model. The following modelling primitives are defined to support planning and execution processes. Definition 3 A task is a transition between two world state families. A task is defined as follows: Name Par Objective Methods
Task name Typified list of parameters handled by the task Goal state of the task Method list achieving the task
The field Name specifies the name of the task. The list of parameters Par represents the set of world objects (described in the domain model) handled by the task. The objective describes the goal of the task a state form. This Objective field can appear redundant, however a goal can be expressed in a natural language by to two different ways, by a verb (corresponding to the name of the task) or by a state (Objective of the task). All the defined methods during modelling are recorded in the list (Method of the task. As an example of a task model in our paradigm, the ”transport” task is defined as follows: ( Name: transport; Par : 0: Truck; 1:Package; 2:City; Objective: 1:getLocation().equals(2:); Methods: m )
Definition 4 A method describes one way (at only one level of abstraction) of performing a task. A method is characterized by the following fields: Heading App-cond Prec Effects Control Sub-task
Task achieved by the method Applicability conditions Preconditions which must be satisfied to be able to apply the method. Effects generated by the successfull application of the method achievement order of the sub-tasks sub-task set
The action performed by the method is indicated in the heading. The applicability conditions (as parameters of action) are used to constraint the instantiation of the method. For example, the parameters 0:getLocation() and 1: of the task transport (presented previously) must be different, this constraint will be modelled under the applicability conditions by 0:getLocation().equals(1:).equals(false). The preconditions are the conditions which must
TMMT: Tool Supporting Knowledge Modelling
49
be satisfied to apply the method. The difference between preconditions and applicability conditions is: an agent can satisfy the preconditions to apply the method, if an applicability condition is not satisfied then it will not try to satisfy it (cf Carberry [8]). The effects are caused by the application of the method. The task objectives belong necessarily to the effects, therefore all effects of all task’s methods contain the task objective. All the methods performing a task must thus generate a state of the world containing the objective; nevertheless the effects can be different. The execution order of sub-tasks is described in the Control field, the sub-tasks are recorded in the field Sub-task . For example, the following method carries out the task ”transport”: ( Heading : transport(Truck, Package, City); App-cond : 1:getLocation().equals(2:).equals(false); Prec: 0:getLocation().equals(1:getLocation(); 0:canLoad(1:); Effects: 1:setLocation(2:); 0:setLocation(2:); Control : Load(0:,1:); Move(0:,2:); Sub-task : {Load ; Move} )
Definition 5 A terminal task is a directly executable task. Its execution does not require decomposition (or planning). The terminal tasks are tasks having only one method. The field sub-task is empty and the Control can point on an executable code. 2.3
The Language of Constraint
To be used with TMMT, a domain model must be an object model. The constraints (preconditions, applicability conditions, effects, objective . . . ) of the tasks and the methods correspond to object methods of the modelled domain. The domain objects handled by a task are specified in the list of parameters. Therefore, a task cannot reach an object that is not a part (or that is not related to an object) of its parameters list. However, it is possible to access to a great number of world objects through object’s methods of the parameters list. The choice of the accessibility methods is often guided by the construction of the task model. In the previous example: 0:getLocation().equals(1:getLocation()), the number 0 corresponds to the parameter 0 of the list of parameters (Truck, Package, City). 0 indicates an object of the type (class) Truck. 2.4
The Language of Control
The field Control of Method allows the description of the execution order of method’s sub-tasks. Currently possible controls are: The sequence “;”, the conditionnals “If . . . Then and If . . . Then . . . Else” and the loop “While”. 2.5
TEE (Task Execution Engine)
TEE is an execution engine allowing the launching of a task execution. The main algorithm of this engine can be described roughly in the following way:
50
Guy Camilleri et al.
start(Task t) {if (t instanceof TerminalTask) startTerminal(t); else {choose the method to apply; found=false; for all methods m of t and if not found do {if ( m.getCondApp() is true ) {found=true; addMethodNode(t); startMethod(m,found);}} if (!found ) error("cannot execute");}}}
The procedure start(t) launches the execution of the task T. The task T must be instantiated to be carried out. If the task T is a terminal task, the engine launches the code execution attached to this task ( startTerminal(t)). If the task T is not terminal, the engine chooses a method from the current task. It traverses all methods; if a method is applicable, it applies it (startMethod(m, found)). The procedure startMethod(m, found) achieves a control analysis and launches the procedure start(ti) on sub-tasks of the method m. In the case where a subtask has not applicable methods, the procedure startMethod(m, found) assigns false to the found variable, which involves the search of another method. This property is significant because a method can be applicable (preconditions are true) to an abstraction level and is not longer on a lower level (which constitutes the principal interest of this kind of hierarchical modelling). As the tasks are represented in object form, a reasoning model can be regarded as a domain model by another reasoning model (meta-model). Moreover, the Control of the methods (as sequence, If . . . Then, etc.) is also modelled in the form of (terminal) tasks. A model (meta) of tasks can handle (build dynamically, modify . . . ) another model of tasks. Under TMMT, an execution meta-model was built. The execution of this model launches the task execution associated. Therefore, the execution engine TEE can itself modify the execution process through a meta-model. Moreover, This meta-model is a recursive model.
3
Methodological Aspects
Like the practice showed in the majority of cases of knowledge modelling, one cannot consider this activity like a sequential process. Indeed, the object structure of the model constraints the representation of the characteristics of the task (preconditions, conditions of applicability, inputs and outputs), and reciprocally, any reorganization of the control contained in the methods of the tasks or the decomposition itself. This is why it first of all seems preferable to build a rather summary domain model, from which one can carry out the first modelling of the tasks. The description of the method (the methods) of the preconditions and the conditions of applicability reveals new objects or properties which must be integrated in the domain model. One builds thereafter repeatedly the model of the field and the model of tasks, until the completion.
TMMT: Tool Supporting Knowledge Modelling
51
TMMT being intended for the construction of models of various levels, the modification at one level induces the modification of the others. For example, a questioning of the cooperation can involve a modification of the granularity of the tasks, forcing to introduce new characteristics into the objects of the domain. The methodology of modelling for the cooperative knowledge based systems is described in the life cycle of these systems suggested in [1]. It is this methodology which is at the base of the specifications of TMMT. Models can be experimentally validated on the most important cases (nominal ones) with hierarchical AI planners. However, it appears possible to validate formally some pieces of models.
Conclusion There is a requirement in terms for methodology and tools to build flexible and maintainable knowledge based systems, being able to contain several types of models. TMMT meets this requirement by proposing a framework easy to use and to conceive in a flexible way the models which are the subject of a rigorous syntactic checking and which are easily operationalizable. The form of the built models facilitates their use by processes of plan recognition and generation, useful in the situations not explicitly envisaged and for which the modelled knowledge can bring help. TMMT provides an environment of development of reasoning model based on a domain model described in the object paradigm. It appeared difficult to us to define primitives of modelling for the domain; the nature, objectives and the constraints (performance, robustness etc.) of the system are very dependent on the application. As the cooperative knowledge based systems are information processing systems, we chose to use the basic primitives of the object programming allowing, according to the application, to develop adapted tools of modelling of the domain. With TMMT, it is possible to describe meta-models (models handling a reasoning model or a domain model). The models of the domain and the reasoning are both object models; consequently it is possible to consider cross-references between these two models. This type of links must be managed by all the generic processes of handling of the knowledge models. Moreover, TMMT makes it possible to describe recursive models. The recursive models are often essential in the description of the meta-model (as in the model of the engine of execution presented). TMMT is currently used in an aeronautical project to build models for a cooperative knowledge based system aiming to help maintenance operator during his activity.
52
Guy Camilleri et al.
References [1] Soubie, J.L.: Coop´eration et syst`emes ` a base de connaissances. Master’s thesis, habilitation ` a diriger des recherches, Universit´e de Toulouse III, Toulouse (1996) 45, 51 [2] Schreiber, G., Akkermans, H., Anjewierden, A., de Hoog, R., Shadbolt, N., de Velde, W.V., Wielinga, B. In: Knowledge Engineering and Management The CommonKADS Methodology. The MIT Press, Cambridge, Massachusetts, London, England (2000) 46 [3] Matta, N., Aussenac-Gilles, N.: Expliciter Une M´ethode de R´esolution de Probl`emes Avec MACAO : Probl`emes M´ethodologiques. Aussenac-Gilles,N., Laublet,P. et Reynaud,C. Eds. Cepadues Editions , Toulouse, France (1996) 46 [4] Trichet, F., Tchounikine, P.: DSTM: A framework to operationalize and refine a problem-solving method modeled in terms of tasks and methods. International Journal of Expert Systems With Applications 16 (1999) 105–120 46 [5] Sowa, J.F.: Ontology, metadata, and semiotics. B. Ganter & G. W. Mineau, eds., Conceptual Structures: Logical, Linguistic, and Computational Issues, Lecture Notes in AI (1867), Springer-Verlag (2000) 55–81 46 [6] Camilleri, G.: A generic formal plan recognition theory. In: IEEE International Conference on Information, Intelligence and Systems ICIIS’99. (1999) 540–547 47 [7] Ist´enes, Z.: Zola a language to operationalise conceptual models of reasoning. Jounal of Computing and Information 2 (1996) 689–706 47 [8] Carberry, S.: Modeling the user’s plans and goals. Computational Linguistics 14 (1988) 23–37 49
State-Based Planning with Numerical Knowledge Joseph Zalaket and Guy Camilleri IRIT CCI-CSC, Universit´e Paul Sabatier 118 route de Narbonne, 31062 Toulouse Cedex 4, France {zalaket,camiller}@irit.fr
Abstract. The drawback of the STRIPS[4] encoding to support numbers in planning has motivated the extension to PDDL language the PDDL2.1 version [5] to allow numerical state variables. Many planning systems as FF-Metric [8] and SAPA [3] treat numbers in planning, but only resources and time are represented as numbers in their planning problems. In contrast, some real world problems require an enormous numerical representation. Therefore, we present in this paper a new planning framework allowing the application of arithmetic functions for numerical knowledge update. We propose a new action representation to improve numerical conditions and effects, which allow the application of symbolical planning techniques over numerical knowledge. As an application of our planning framework, we show the implementation of a numerical version of the well-known FF[7] planning system able to solve domains containing symbolical and/or numerical knowledge.
1
Introduction
In the recent years the development of efficient automated planning algorithms such as GRAPHPLAN [1], FF [7], HSP and HSP-r [2] has enhanced the planning process and made it applicable on a large number of symbolical domains. But problems closer to real world often contain numerical knowledge. Many recent planners as [6],[8] has been extended to support numerical handling, but often this planners are interested in treating the problem of time and resources added to the classical planning domains as an attached aspect. In this article we suggest the non-separation of the numerical aspect from the symbolical one. This integration of numerical and symbolical aspects within a planning problem gives us the possibility of sharing the results already obtained by the symbolical planning with the numerical planning. We propose the extension of STRIPS[4] language to represent numerical objects and numerical effects; in a way that numerical knowledge can be seen and handled as the symbolical ones. This integration within the STRIPS language will allow the planning to solve new types of problems. For example, to solve a planning problem consisting of producing some final products according to what is available in the stock as raw material products. The domain objects of this problem can be viewed as a set of object’s categories (products) (product r1, product r2, product F1,. . . ). Suppose we want V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 53–60, 2003. c Springer-Verlag Berlin Heidelberg 2003
54
Joseph Zalaket and Guy Camilleri
to produce an item of final product F1, which requires the use of 300 items of raw material product r1 from 5000 existing items in the stock. In classical STRIPS modeling we have to create 5000 different constant symbols to represent the items of product r1. And using 300 of these items can be done one by one in the best case. So the classical STRIPS language is not qualified to model this kind of problems. Indeed, an extension of STRIPS to integrate numerical knowledge modeling seems to be crucial. In our paradigm the objects of this example can be represented by numbers (a number for each product), which yields to a simpler problem resolving. Also in our paradigm, we have the advantage of representing symmetrical domain objects by number (e.g. the ferry domain Table 1) instead of multiple constant objetcs. An extension of the STRIPS language to support numbers will be detailed in section-1. To illustrate our approach, we will present in section-2 a numerical version of FF[7] planning system, which we call it NFF (Numerical Fast Forward). The difference between NFF and the numerical version of FF introduced by Hoffman Metric-FF [8] is that: NFF treats numerical knowledge by the same way as the symbolical ones. In contrast, Metric-FF treats numerical knowledge as a separate part of the symbolical planning problem. Some empirical results will be shown in section-3 before the conclusion of this paper.
2
Extending STRIPS to Support Numbers
In the following we use a sub-set of the first order logic to introduce an extension of the STRIPS language to deal with the numerical knowledge. Definition 1 The language L=(V, C, F , P) is defined over sets of variable symbols V, constant symbols C, function symbols F , and predicate symbols P = Pp ∪ Pt in the following way: 1. Variable symbols x ∈ V are terms. 2. Constant symbols c ∈ C are terms. 3. If f ∈ F is a function symbol with an arity n and t1 , . . . , tn are terms, then f (t1 , . . . , tn ) is a term. 4. All terms are obtained by applying the above rules (1, 2 and 3) in a finite time. 5. If p ∈ Pp is a predicate symbol with an arity m and t1 , . . . , tm are terms then p(t1 , . . . , tm ) is an atom. 6. If p ∈ Pt is a predicate symbol with an arity l and t1 , . . . , tl are terms then p(t1 , . . . , tl ) is an atom. Pt = {<, <=, =, =, >=, >} ∪ D where the first set is the set of comparator predicates and D is the set of the definition domains. These predicates are associated to a domain theory T . are atoms then the set 7. If p1 (. . .), p2 (. . .), . . . , pk (. . .) {p1 (. . .), p2 (. . .), . . . , pk (. . .)} represents the conjunction of atoms, which is the formula p1 (. . .) ∧ p2 (. . .) ∧ . . . ∧ pk (. . .).
State-Based Planning with Numerical Knowledge
55
Remarks : – All the language sets V, C, F and P can be infinite. – All constants c ∈ C are typed, that means it exists a predicate p ∈ D such as T p(c). Definition 2 An action α (or an operator) represents a state transition and is defined in a STRIPS style by: – Param(α) is the list of parameters specified by a list of variable symbols. – Pre(α), Add(α) and Del(α) are respectively the precondition list, the add list and the delete list. Each one of these three lists is represented by a set of atoms.
Table 1. The ferry problem that consists of transporting a number n of cars from a place r1 to a place r2 can be modeled by introducing cars as number instead of constant symbols in the following way: Load(R,X) Prec: Add : Del :
Rive(R), Integer(X), > (X, 0) emptyF, nbC(R, X), atF (R) f ullF, nbC(R, −(X, 1)) emptyF, nbC(R, X)
UnLoad(R,X) Prec : Rive(R), Integer(X) f ullF, nbC(R, X), atF (R) Add : emptyF, nbC(R, +(X, 1)) Del : f ullF, nbC(R, X) Move(R1 ,R2 ) Prec: Rive(R1 ), Rive(R2 ) = (R1 , R2 ), atF (R1 ) Add : atF (R2 ) Del : atF (R1 )
Note: As in STRIPS, all variables are present in the parameter list. Therefore, numerical variables (like X) should also be present in the parameter list. In this example, the numerical parameter X represents the number of cars in a place R. The functions + and - are used to carry out calculations of the numerical terms. Definition 3 A world state is described by a set of atoms without function symbols or variable symbols. Definition 4 Substitution. The following two particular substitutions are used:
56
Joseph Zalaket and Guy Camilleri
The first substitution noted σ correponds to substitute variable symbols by constant symbols and function terms without variable symbols by constant symbols. The second substitution noted θ such as θ(X) is a substitution of X where: – X = {p1 (...); ...; pn (...)} is a set of atoms without variable symbols then θ(X) = {θ(p1 (...)); ...; θ(pn (...))} – X = P (t1 , . . . , tn ) is an atom without variable symbols then θ(X) = P (θ(t1 ), . . . , θ(tn )) – X = t is a term without variable symbols then • if t is a constant symbol then t = c and θ(t) = c. • if t = f (c1 , ... , cn ) with c1 , ... , cn constant symbols then it exists a constant symbol c such as {c/f (c1 , ... , cn )} ⊆ θ and θ(f (c1 , ... , cn )) = c. • (recursive part) if t = f (t1 , ... , tn ) with t1 , ..., tn are terms without variable symbols then θ(t) = θ(f (t1 , ... , tn )) = θ(f (θ(t1 ), ..., θ(tn ))). Remarks : – These two substitutions are used to replace variables by constants and to evaluate the functions (by replacing functions by constants). – As in the presented language L all terms are finite, so the recursive definition of θ stops at a certain time. – The substitution θ(f (c1 , ... , cn )) corresponds to the application of the function f . For example if θ(+(1, 2)) = 3 then in the domain theory T the result of 1 + 2 is 3. – If an infinite domain definition is used for the numerical variables then the C set can be infinite. – In the domain theory T , a function must only return a constant value and it does not have any side effects. Definition 5 An action α can be instantiated by two substitutions σ and θ (defined as above) iff ∀p(t1 , ... , tn ) ∈ prec(α), p ∈ Pt then T θ(σ(p(t1 , ... , tn ))). In the ferry example (Table 1) Load(R,X) can be instantiated by σ={r1 /R; 1/X} and {0/ − (1, 1)} ⊆ θ because the theory T uses the classical interpretation of the comparator > then θ(σ(>(X, 0)) = >(1, 0), θ(σ(Rive(R)))=Rive(r1 ), θ(σ(Integer(X)))=Integer(1) and T Rive(r1) ∧ Integer(1)∧ >(1, 0). Definition 6 a is a ground action of α iff it exists two substitutions θ and σ such as α can be instantiated by θ and σ. a is defined by: – Param(a)=σ(Param(α)). – Prec(a)={p(t1 , ..., tn ) such as it exists p’(t’1 , ... , t’n ) ∈ Pre(α), p’ ∈ Pp and p(t1 , ..., tn )=θ(σ(p’(t’1 , ... , t’n )))}. – Add(a)={p(t1 , ..., tn ) such as it exists p’(t’1 , ... , t’n ) ∈ Add(α), p’ ∈ Pp and p(t1 , ..., tn )=θ(σ(p’(t’1 , ... , t’n )))}. – Del(a)={p(t1 , ..., tn ) such as it exists p’(t’1 , ... , t’n ) ∈ Del(α), p’ ∈ Pp and p(t1 , ..., tn )=θ(σ(p’(t’1 , ... , t’n )))}.
State-Based Planning with Numerical Knowledge
57
Remark : Each variable symbol and each function symbol is replaced by the application of θ o σ. Therefore, the generated ground actions are identical to the ground actions generated in a pure STRIPS paradigm. They don’t contain any atom of Pt .
3
Numerical Planning
Usually planners generate the ground actions from the domain of action variables. In our framework, numerical knowledge can lead to an infinite number of ground actions. For example: the action Load(r1 ,1) defined below is a ground action of the Load(R,X) action. Load(r1 ,1) Prec: emptyF, nbC(r1 , 1), atF (r1 ) Add : f ullF, nbC(r1 , 0) Del : emptyF, nbC(r1 , 1)
3.1
Planning Graph
We present an approach based on the relaxation of the planning problem through a planning graph building for heuristic calculation. In this numerical paradigm, the generation of ground actions has to be completed progressively during the building of the graph. The idea is to instantiate actions from the current level during the graph building in a forward pass. In this way, the use of bijective (inversible) function is not required. The resulting ground actions being identical to pure STRIPS ground actions, which make possible the use of symbolical algorithms based on a planning graph building like GRAPHPLAN [1], FF [7] and HSP[2] to solve numerical problems. However, the search completeness cannot be guaranteed as the use of functions and numbers makes the planning process undecidable. To avoid an infinite flow during the planning process we give the modeler a possibility to add a lower limit and an upper limit for each numerical type depending on the planning problem to be solved. For example in the ferry domain of Table 1, if the problem consists of transferring 50 cars from place r1 to place r2 and the initial state contains 50 cars in place r1 (nbC(r1 , 50)) and 0 cars in place r2 (nbC(r2 , 0)), the interval of upper and lower limits for the integer variable X in the Load and UnLoad actions definition is [0, 50] . And thus, extra preconditions could be added to the list of preconditions ; < (X, 50) for the Load(R, X) action and > (X, 0), < (X, 50) for the U nLoad(R, X) action. 3.2
Numerical Fast Forward (NFF)
In contrary to FF, the ground actions in NFF are not computed at the beginning from the definition domain of variables.
58
Joseph Zalaket and Guy Camilleri
Fig. 1. The relaxed planning graph of the problem P
The NFF algorithm can be roughly described in the following way: 1. Only the symbolical part of actions is instantiated. 2. The planning graph is built by completing the actions instantiation (numerical part), then by applying the instantiated actions. 3. The relaxed plan is extracted from planning graph; the length of this relaxed plan constitutes the heuristic value. 4. The heuristic value previously calculated is used to guide the search of an algorithm close to the FF’s Hill Climbing algorithm. The heuristic calculation (steps: 2)and 3) and 4)) is done for each state in the main algorithm. Relaxed Planning Graph Building Let consider the following planning problem P=(A,I,G) where the action set A={Move; Load; UnLoad} (see Table 1), the initial state I={atF (r1 ); emptyF; nbC(r1 , 1); nbC(r2 , 0)} and the goal G={nbC(r2 , 1); atF (r1 )}.The relaxed planning graph of the problem P is discribed in figure 1. The graph starts at the level 0 by the initial state I. The actions Move(r1 ,r2 ) and Load(r1 ,1) are applicable at this level. The partially instantiated action Load(r1 ,X) is completed by σ={r1 /R; 1/X} because {nbC(r1 , 1); atF (r1 ); emptyF } ⊆ Level 0 and T >(1,0). Moreover, θ according to the domain theory T replaces -(1,1) by 0 , that is {0/-(1,1)} ⊆ θ. The action Load(r1 ,1) is defined by: {emptyF ; nbC(r1 , 1); atF (r1 )}, Add(Load(r1 ,1))= Pre(Load(r1,1))= {f ullF ; nbC(r1, 0)} and Del(Load(r1 ,1))= {emptyF ; nbC(r1 , 1)}. The two actions Move(r1 ,r2 ) and Load(r1 ,1) are applied and their add lists are added to the level 1. Remark : In the relaxed graph building algorithm, we only defined the σ substitution. It seems reasonable that the θ substitution be common for all the actions in a planning problem. In our implementation, all functional calculus are carried out in the interpretation domain The actions completion algorithm is described in Algorithm1
State-Based Planning with Numerical Knowledge
59
Algorithm 1 Action completion algorithm for all p(...) ∈ Pre(α) and p ∈ Pp do for all predicate symbols p’ in the current level for which p=p’ and for all c’i ∈ Param(p’) and ci ∈ Param(p) such as c’i =ci do for all variable terms Xi ∈ Param(p) and ci ∈ Param(p’) do {ci /X} is added to the set S of σ end for end for end for for all σ ∈ S do if α can be instantiated by σ and θ then generate the corresponding ground action end if end for
4
Empirical Results
We have implement the NFF in Java, our main objective was the test of feasibility of certain type of numerical problems, little effort is made for code optimization. The machine used for tests is Intel celeron Pentium III with 256MB of RAM. The water jug problem: Consists of 3 jugs j1 ,j2 and j3 where Capacity(j1 ,36), Capacity(j2 ,45), Capacity(j3 ,54) The initial state is: Volume(j1 ,16), Volume(j2 ,27), Volume(j3 ,34) The goal is : Volume(j1 ,25), Volume(j2 ,0), Volume(j3 ,52)− > time = 8.36 s. The ferry domain: The initial state is: atF (r1 ), emptyF , nbC(r1 , 50), nbC(r2 , 0) For goals: nbC(r2 , 5)− > time= 0.571 s; nbC(r2 , 10)− > time= 1.390 s; nbC(r2 , 30)− > time= 21.937 s.
5
Conclusion
We have presented a STRIPS extension to support numerical and symbolical domains definition. We have proposed the way of substituting actions progressively during the planning process to reduce the ground action generation for numerical domains. The main goal of introducing numbers in the planning by the presented way is to allow the definition of domains closer to the real world. In our framework world objects can be retrieved from numerical variables by functions evaluation instead of only being constant symbols as in pure STRIPS. We haven’t interested in our work in the resources optimization (time or metrics), but we believe this task can be added easily to (or retrieved from) our planning framework.
60
Joseph Zalaket and Guy Camilleri
References [1] Avrim L. Blum and Merrick L. Furst. Fast planning through planning graph analysis. Proceedings of the 14th International Joint Conference on AI (IJCAI95), pages 1636–1642, 1995. 53, 57 [2] B. Bonet and H. Geffner. Planning as heuristic search. Artificial Intelligence, 129:5–33, 2001. 53, 57 [3] Minh B. Do and Subbarao Kambhampati. Sapa: A domain-independent heuristic metric temporel planner. European Conference on Planning, 2001. 53 [4] Fikes R. E. and Nilsson N. STRIPS: A new approach to the application of theorem proving to problem solving. Artificial Intelligence, 2:189–208, 1971. 53 [5] M. Fox and D. Long. PDDL2.1 : An extention to PDDL for expressing temporal planning domains. AIPS, 2002. 53 [6] P. Haslum and H. Geffner. Heuristic planning with time and resources. Proc. IJCAI-01 Workshop on Planning with Resources, 2001. To appear., 2001. 53 [7] J. Hoffman. FF: The fast-forward planning system. AI Magazine, 22:57 – 62, 2001. 53, 54, 57 [8] J. Hoffmann. Extending FF to numerical state variables. to appear in: Proceedings of the 15th European Conference on Artificial Intelligence, Lyon, France, 2002. 53, 54
KAMET II: An Extended Knowledge-Acquisition Methodology* Osvaldo Cairó and Julio César Alvarez Instituto Tecnológico Au tónomo de México (ITAM) Department of Computer Science, Río Hondo 1, Mexico City, Mexico
[email protected] [email protected]
Abstract. The Knowledge-Acquisition (KA) community necessities demand more effective ways to elicit knowledge in different environments. Methodologies like CommonKADS [8], MIKE [1] and VITAL [6] are able to produce knowledge models using their respective Conceptual Modeling Languages (CML). However, sharing and reuse is nowadays a must-have in knowledge engineering (KE) methodologies and domain-specific KA tools in order to permit Knowledge-Based System (KBS) developers to work faster with better results, and give them the chance to produce and utilize reusable Open Knowledge Base Connectivity (OKBC)-constrained models. This paper presents the KAMET II1 Methodology, which is the diagnosis-specialized version of KAMET [2,3], as an alternative for creating knowledge-intensive systems attacking KE-specific risks. We describe here one of the most important characteristics of KAMET II which is the use of Protégé 2000 for implementing its CML models through ontologies.
1
Introduction
The KAMET [2,3] methodology was born during the last third of the nineties. The life-cycle of KAMET's first version was constituted mainly by two phases. The first one, analyzes and acquires the knowledge from different sources involved in the system. The second one, models and processes this knowledge. The mechanism proposed in KAMET for KA was based on progressive improvements of models, which allowed to refine knowledge and to obtain its optimal structure in an incremental fashion. KAMET was inspired by two basic ideas: the spiral model proposed by Boehm and the essence of cooperative processes. Both ideas are closely related to the concept of risk reduction. *
1
This project has been founded by CONACyT, as project number 33038-A, and Asociación Mexicana de Cultura, A.C. KAMET II is a project that is being carried out in collaboration with the SWI Group at Amsterdam University and Universidad Politécnica de Madrid.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 61-67, 2003. Springer-Verlag Berlin Heidelberg 2003
62
Osvaldo Cairó and Julio César Alvarez
Fig. 1. KAMET II's Methodological Pyramid
Fig. 1 shows the Methodological Pyramid of KAMET II. KAMET-II's life cycle: it represents the methodology itself; it is the beginning to reach the goal. Auxiliary tools provide an automated or semi-automated support for the process and the methods. A KBS Model represents a KBS detailed description, in form of models. This has led KAMET to successful applications in the medicine, telecommunications and human resources areas. The integration of diagnosis specialization and the KAMET methodology ideas converges in the KAMET II methodology which is a specialized methodology in the diagnosis area by focusing on KA from multiple sources. Although KAMET provides a framework for managing the phase of knowledge modeling process in a systematic and disciplined way, the manner to obtain additional organizational objectives like predictability, consistency and improvement in quality and productivity is still not well defined within the KAMET Life-Cycle Model (LCM). KAMET II supports most aspects of a KBS development project, including organizational analysis, KA from multiple sources, conceptual modeling and user interaction. We will present in this section a brief description of Project Management (PM) in KAMET II. We want a PM activity that helps us measure as much as possible with the purpose of making better plans and reaching commitments. We believe that the best way to deal with these problems is to analyze not only the “what” and the “how” of a software system, but also “why” we are using it. This is best done by starting the KA process with the early requirements, where one analyzes the domain within which the system will operate, by studying how the use of the system will modify the environment, and by progressively refining this analysis down to the actual implementation of single knowledge modules. The purpose of KAMET II PM is the elimination of the project mismanagement and its consequences in KA projects [4, 5]. This new approach intends to build reliable KBS that fulfill customer and organization quality expectations by providing the project manager with effective knowledge and PM techniques. The purpose of this paper is not to present the PM dimension in detail, but it is important to emphasize this new key piece in KAMET II.
2
Diagnosis within KAMET II
Diagnosis differs from classification in the sense that the desired output is a malfunction of the system. In diagnosis the underlying knowledge typically contains knowledge about system behavior, such as a causal model. The output of diagnosis can take many forms: it can be a faulty component, a faulty state, or a causal chain. Diagnosis tasks are frequently encountered in the area of technical and medical systems.
KAMET II: An Extended Knowledge-Acquisition Methodology
63
Fig. 2. KAMET II's Environment
KAMET II tries to mitigate the problem of constructing solutions for diagnosis problems with a complete Problem-Solving Method (PSM) library. This will be carried out by substituting the elaboration of the answer from the beginning for tailoring and refinement of PSMs with predefined tasks. KAMET II will have as part of his assets a PSM library for technical diagnosis in order to carry out diagnosis task modeling by selecting a subset of assumptions and roles from this library. The objective is to convert KAMET II into a methodology that concentrates as much experience as possible in the form of a PSM library, so that whenever a diagnosis problem needs to be treated, the steps to tackle it are always available. Fig. 2 shows how the PSM Diagnosis Library is the base for the creation of knowledge models by KAMET II. The philosophy of KAMET II is sharing and reuse. But, not only PSMs are the reusable KE artifacts in KAMET II, but the knowledge models as well, as shown next.
3
KAMET II Models and Protégé 2000 Ontologies
Constructing knowledge systems is viewed as a modeling activity for developing structured knowledge and reasoning models. To ensure well-formed models, the use of some KE methodology is crucial. Additionally, reusing models can significantly reduce time and cost of building new applications [2]. The goal is to have shareable knowledge, by encoding domain knowledge using a standard vocabulary based on the ontology. The KAMET II CML is presented after a discussion of what the potential problems are in KA. Pure rule representation as well as an object-modeling language, data dictionaries, entity-relationship diagrams, among other methods are considered no longer sufficient neither for the purpose of system construction nor for that of knowledge representation. Knowledge is too rich to be represented with notations like the Unified Modeling Language (UML). This requires stronger modeling facilities. A knowledge modeling method should provide a rich vocabulary in which the expertise can be expressed in an appropriate way. Knowledge and reasoning should be modeled in such a way that models can be exploited in a very flexible fashion [2]. The KAMET II CML has three levels of abstraction. The first corresponds to structural constructors and components. The second level of abstraction corresponds to nodes and composition rules. The third level of abstraction corresponds to the global model [3].
64
Osvaldo Cairó and Julio César Alvarez
Fig. 3. KAMET II' implemented in a Protégé 2000 ontology
In the following lines, it will be described how the Protégé 2000 model can be used to implement KAMET II knowledge models visually by means of Protégé 2000 frame-based class model (or metaclass model, although it is not necessary for KAMET II as it is for CommonKADS [9]) and the Diagram_Entity implementation. The structural constructors and the structural components of the KAMET II CML are mapped in Protégé 2000 in the following way. An abstract subclass Construct of the abstract class: THING is created as a superclass of the Problem, Classification and Subdivision classes that implement the correspondent concepts in the CML. The same is done for the structural components creating an abstract subclass Component as the superclass of Symptom, Antecedent, Solution, Time, Value, Inaccuracy, Process, Formula and Examination. However, with the purpose of creating diagrammatic representations of knowledge models, composition rules need to be subclasses of the abstract class Connector in order to be used in a KAMETII Diagram instance. [9] should be consulted for a complete description of the Protégé 2000 knowledge model and its flexibility of adaptation to other knowledge models like KAMET II's. Nevertheless, the importance of adapting KAMET II to Protégé 2000 is because of the potential visual modeling available in the tool. Diagrams are one way to accomplish the latter goal. A diagram is, visually, a set of nodes and connectors that join the nodes. Underlying these, there are instances of classes. Nodes map to domain objects and connectors map to instances of Connector (a class the diagram widget uses to
KAMET II: An Extended Knowledge-Acquisition Methodology
65
represent important relationships). When users see a diagram, they see a high-level summary of a large number of instances, without seeing a lot of the details of any particular instance. That is why a Diagram Widget is presented. This widget has the representativeness above mentioned. The mapping permits the use of an automated aid to construct KAMET II models. Fig. 3 (left) shows the KAMET II in a Protégé 2000 ontology [9]. In Fig. 3 (right) it is presented the Diagram Widget for KAMET II, that is nothing but a special kind of form [9] to obtain instances for the classes (or concepts) involved in diagnosis-specific problems. This is the way to edit the KAMET II models. As it is visible in the left hierarchy, all the KAMET II diagrams will be instances of the KAMETIIDiagram class, which is a direct subclass of the Network class. Fig. 3 (right) shows the structural components and constructors and the composition rules. Developers used to the original graphical notation [2] will not have any problem getting used to the symbols implemented in Protégé 2000. It is important to mention that inaccuracies are represented as entities and not as probabilistic slots. So, better intrinsic notation and representation are necessary to have better working models. Probabilistic networks [10] need a more elaborate analysis before they could be expressed in these graphical terms. Fig. 4 shows a simple model in KAMET II CML modeled in Protégé 2000. The model expresses that the problem P1 can occur due to two different situations. In the first one, the model expresses that if the symptoms S1 and S2 are known to be true then we can deduce the problem P1 is true with probability 0.70. In the second one, the model shows that if symptoms S1 and S5 are observed then we can conclude that the problem is P1 with probability 0.70. On the other hand, we can deduce that the problem P3 is true with probability 0.40 is symptoms S1 and S4 are known to be true. Finally, we can reach a conclusion that the problem is P2 with probability 0.90 if problems P1 and P3 and the symptom S3 are observed.
Fig. 4. Electrical Diagnosis modeled in Protégé 2000
66
Osvaldo Cairó and Julio César Alvarez
Fig. 4 showed an example implemented in Protégé 2000 using the facilities it provides for diagram construction with the Diagram Widget in the Instances tab.
4
Conclusions
In this paper we have described various aspects of the new KE methodology KAMET II. The reasons why KAMET II is a complete diagnosis methodology through its PSM Diagnosis Library for the procedural knowledge part of KBS construction were discussed. Not only simple cause problems can be tackled but also a great variety of them can be solved using a PSM Library. It was also shown how the modeling phase of declarative knowledge in KAMET II can be carried out using the Protégé 2000 automated tool through ontologies. The purpose was to map the KAMET II CML models to the Protégé 2000 frame-based model in order to provide sharing and reuse to knowledge engineers. These objectives can be achieved in the tool thanks to the API it provides in order to reuse the knowledge stored in the knowledge representation schemes facilitated and by the OKBC compliance of the tool. It was also presented the PM dimension of the methodology in order to diminish KE-specific risks in KBS development.
References [1] [2] [3]
[4]
[5] [6] [7] [8]
Angele, J., Fensel, D., and Studer, R.: Developing Knowledge-BASED Systems with MIKE. Journal of Automated Software Engineering. Cairó, O.: A Comprehensive Methodol ogy for Knowledge Acquisition from Multiple Knowledge Sources. Expert Systems with Applications, 14(1998), 116. Cairó, O.: The KAMET Methodology: Content, Usage and Knowledge Modeling. In Gaines, B. and Musen, M., editors, Proceedings of the 11th Banff Knowledge Acquisition for Knowledge-Based Systems Workshop, pages 1-20. Department of Comp uter Science, University of Calgary, SRGD Publications Cairó, O., Barreiro, J., and Solsona, F.: Software Methodologies at Risk. In Fensel, D. and Studer, R., editors, 11th European Workshop on Knowledge Acquisition, Modeling and Management, volume 1621 of LNAI, pages 319-324. Springer Verlag. Cairó, O., Barreiro, J., and Solsona, F.: Risks Inside-out. In Cairó, O., Sucar, L. and Cantu, F., editors, MICAI 2000: Advances in Artificial Intelligence, volume 1793 of LNAI, pages 426-435. Springer Verlag. Domingue, J., Motta, E., and Watt, S.: The Emerging VITAL Workbench. Medsker, L., Tan, M., and Turban, E.: Knowledge Acquisition from Multiple Experts:Problems and Issues. Expert Systems with Applications, 9(1), 35-40. Schreiber, G., and Akkermans, H.: Knowledge Engineering and Management: the CommonKADS Methodology. MIT Press, Cambridge, Massachusetts, 1 999.
KAMET II: An Extended Knowledge-Acquisition Methodology
67
Schreiber, G., Crubézy, M., and Musen, M.: A Case Study in Using Protégé2000 as a Tool for CommonKADS. In Dieng, R. and Corby, O., editors, 12th International Conference, EKAW 2000, Juan-les-Pins, France. [10] van der Gaag, L., and Helsper, E.: Experiences with Modeling Issues in Building Probabilistic Networks. In Gómez-Pérez, A. and Benjamins, R., editors, 13th International Conference, EKAW 2002,volume 2473 of LNAI, pages 21-26. Springer Verlag.
[9]
Automated Knowledge Acquisition by Relevant Reasoning Based on Strong Relevant Logic* Jingde Cheng Department of Information and Computer Sciences, Saitama University Saitama, 338-8570, Japan
[email protected]
Abstract. Almost all existing methodologies and automated tools for knowledge acquisition are somehow based on classical mathematical logic or its various classical conservative extensions. This paper proposes a new approach to knowledge acquisition problem: automated knowledge acquisition by relevant reasoning based on strong relevant logic. The paper points out why any of the classical mathematical logic, its various classical conservative extensions, and traditional relevant logics is not a suitable logical basis for knowledge acquisition, shows that strong relevant logic is a more hopeful candidate for the purpose, and establishes a conceptional foundation for automated knowledge acquisition by relevant reasoning based on strong relevant logic.
1
Introduction
From the viewpoint of knowledge engineering, knowledge acquisition is the purposive modeling process of discovering and learning knowledge about some particular subject from one or more knowledge sources, and then abstracting, formalizing, representing, and transferring the knowledge in some explicit and formal forms suitable to computation on computer systems [10]. Automated knowledge acquisition is concerned with the execution of computer programs that assist in knowledge acquisition. As Sestito and Dillon pointed out: The main difficulty in extracting knowledge from experts is that they themselves have trouble expressing or formalizing their knowledge. Experts also have a problem in describing their knowledge in terms that are precise, complete, and consistent enough for use in a computer program. This difficulty stems from the inherent nature of knowledge that constitutes human expertise. Such knowledge is often subconscious and may be approximate, incomplete, and inconsistent [15]. Therefore, the intrinsically characteristic task in knowledge *
This work is supported in part by The Ministry of Education, Culture, Sports, Science and Technology of Japan under Grant-in-Aid for Exploratory Research No. 09878061 and Grant-in-Aid for Scientific Research (B) No. 11480079.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 68-80, 2003. Springer-Verlag Berlin Heidelberg 2003
Automated Knowledge Acquisition
69
acquisition is discovery rather than justification, and the task have to be performed under the condition that working with approximate, incomplete, inconsistent knowledge is the rule rather than the exception. As a result, if we want to establish a sound methodology for knowledge acquisition in knowledge engineering practices, we have to consider the issue how to discover new knowledge from one or more knowledge sources where approximateness, incompleteness, and inconsistency are existent in some degree. Until now, almost all existing methodologies and automated tools for knowledge acquisition are somehow based on classical mathematical logic (CML for short) or its various classical conservative extensions. This approach, however, may be suitable to searching and describing a formal proof of a previously specified statement, but not necessarily suitable to forming a new concept and discovering a new statement because the aim, nature, and role of CML is descriptive and non-predictive rather than prescriptive and predictive. This paper proposes a new approach to knowledge acquisition problem: knowledge acquisition by relevant reasoning based on strong relevant logic. The paper points out why any of the classical mathematical logic, its various classical conservative extensions, and traditional relevant logics is not a suitable logical basis for knowledge acquisition, shows that strong relevant logic is more a hopeful candidates for the purpose, and establishes a conceptional foundation for automated knowledge acquisition by relevant reasoning based on strong relevant logic.
2
Reasoning and Proving
Reasoning is the process of drawing new conclusions from given premises, which are already known facts or previously assumed hypotheses (Note that how to define the notion of “new” formally and satisfactorily is still a difficult open problem until now). Therefore, reasoning is intrinsically ampliative, i.e., it has the function of enlarging or extending some things, or adding to what is already known or assumed. In general, a reasoning consists of a number of arguments (or inferences) in some order. An argument (or inference) is a set of declarative sentences consisting of one or more sentences as its premises, which contain the evidence, and one sentence as its conclusion. In an argument, a claim is being made that there is some sort of evidential relation between its premises and its conclusion: the conclusion is supposed to follow from the premises, or equivalently, the premises are supposed to entail the conclusion. Therefore, the correctness of an argument is a matter of the connection between its premises and its conclusion, and concerns the strength of the relation between them (Note that the correctness of an argument depends neither on whether the premises are really true or not, nor on whether the conclusion is really true or not). Thus, there are some fundamental questions: What is the criterion by which one can decide whether the conclusion of an argument or a reasoning really does follow from its premises or not? Is there the only one criterion, or are there many criteria? If there are many criteria, what are the intrinsic differences between them? It is logic that deals with the validity of argument and reasoning in general.
70
Jingde Cheng
A logically valid reasoning is a reasoning such that its arguments are justified based on some logical validity criterion provided by a logic system in order to obtain correct conclusions (Note that here the term “correct” does not necessarily mean “true.”). Today, there are so many different logic systems motivated by various philosophical considerations. As a result, a reasoning may be valid on one logical validity criterion but invalid on another. For example, the classical account of validity, which is one of fundamental principles and assumptions underlying CML and its various conservative extensions, is defined in terms of truth-preservation (in some certain sense of truth) as: an argument is valid if and only if it is impossible for all its premises to be true while its conclusion is false. Therefore, a classically valid reasoning must be truth-preserving. On the other hand, for any correct argument in scientific reasoning as well as our everyday reasoning, its premises must somehow be relevant to its conclusion, and vice versa. The relevant account of validity is defined in terms of relevance as: for an argument to be valid there must be some connection of meaning, i.e., some relevance, between its premises and its conclusion. Obviously, the relevance between the premises and conclusion of an argument is not accounted for by the classical logical validity criterion, and therefore, a classically valid reasoning is not necessarily relevant. Proving is the process of finding a justification for an explicitly specified statement from given premises, which are already known facts or previously assumed hypotheses. A proof is a description of a found justification. A logically valid proving is a proving such that it is justified based on some logical validity criterion provided by a logic system in order to obtain a correct proof. The most intrinsic difference between reasoning and proving is that the former is intrinsically prescriptive and predictive while the latter is intrinsically descriptive and non-predictive. The purpose of reasoning is to find some new conclusion previously unknown or unrecognized, while the purpose of proving is to find a justification for some specified statement previously given. Proving has an explicitly given target as its goal while reasoning does not. Unfortunately, until now, many studies in Computer Science and Artificial Intelligence disciplines still confuse proving with reasoning. Discovery is the process to find out or bring to light of that which was previously unknown. For any discovery, both the discovered thing and its truth must be unknown before the completion of discovery process. Since reasoning is the only way to draw new conclusions from given premises, there is no discovery process that does not invoke reasoning. As we have mentioned, the intrinsically characteristic task in knowledge acquisition is discovery rather than justification. Therefore, the task is concerning reasoning rather than proving. Below, let us consider the problem what logic system can satisfactorily underlie knowledge discovery.
3
The Notion of Conditional as the Heart of Logic
What is logic? Logic is a special discipline which is considered to be the basis for all other sciences, and therefore, it is a science prior to all others, which contains the
Automated Knowledge Acquisition
71
ideas and principles underlying all sciences [9, 16]. Logic deals with what entails what or what follows from what, and aims at determining which are the correct conclusions of a given set of premises, i.e., to determine which arguments are valid. Therefore, the most essential and central concept in logic is the logical consequence relation that relates a given set of premises to those conclusions, which validly follow from the premises. In general, a formal logic system L consists of a formal language, called the object language and denoted by F(L), which is the set of all well-formed formulas of L, and a logical consequence relation, denoted by meta-linguistic symbol |−L, such that for P ⊆ F(L) and c ∈ F(L), P |−L c means that within the framework of L, c is a valid conclusion of premises P, i.e., c validly follows from P. For a formal logic system (F(L), |−L), a logical theorem t is a formula of L such that φ |−L t where φ is the empty set. We use Th(L) to denote the set of all logical theorems of L. Th(L) is completely determined by the logical consequence relation |−L. According to the representation of the logical consequence relation of a logic, the logic can be represented as a Hilbert style formal system, a Gentzen natural deduction system, a Gentzen sequent calculus system, or other type of formal system. A formal logic system L is said to be explosive if and only if {A, ¬A} |−L B for any two different formulas A and B; L is said to be paraconsistent if and only if it is not explosive. Let (F(L), |−L) be a formal logic system and P ⊆ F(L) be a non-empty set of sentences (i.e., closed well-formed formulas). A formal theory with premises P based on L, called a L-theory with premises P and denoted by TL(P), is defined as TL(P) =df Th(L) ∪ ThLe(P), and ThLe(P) =df {et | P |−L et and et ∉ Th(L)} where Th(L) and ThLe(P) are called the logical part and the empirical part of the formal theory, respectively, and any element of ThLe(P) is called an empirical theorem of the formal theory. A formal theory TL(P) is said to be directly inconsistent if and only if there exists a formula A of L such that both A ∈ P and ¬A ∈ P hold. A formal theory TL(P) is said to be indirectly inconsistent if and only if it is not directly inconsistent but there exists a formula A of L such that both A ∈ TL(P) and ¬A ∈ TL(P). A formal theory TL(P) is said to be consistent if and only if it is neither directly inconsistent nor indirectly inconsistent. A formal theory TL(P) is said to be explosive if and only if A ∈ TL(P) for arbitrary formula A of L; TL(P) is said to be paraconsistent if and only if it is not explosive. An explosive formal theory is not useful at all. Therefore, any meaningful formal theory should be paraconsistent. Note that if a formal logic system L is explosive, then any directly or indirectly inconsistent L-theory TL(P) must be explosive. In the literature of mathematical, natural, social, and human sciences, it is probably difficult, if not impossible, to find a sentence form that is more generally used for describing various definitions, propositions, and theorems than the sentence form of “if ... then ... .” In logic, a sentence in the form of “if ... then ...” is usually called a conditional proposition or simply conditional which states that there exists a relation of sufficient condition between the “if” part and the “then” part of the sentence. Scientists always use conditionals in their descriptions of various definitions, propositions, and theorems to connect a concept, fact, situation or conclusion to its sufficient conditions. The major work of almost all scientists is to discover some sufficient condition relations between various phenomena, data, and laws in their research
72
Jingde Cheng
fields. Indeed, Russell 1903 has said, “Pure Mathematics is the class of all propositions of the form ‘p implies q,' where p and q are propositions containing one or more variables, the same in the two propositions, and neither p nor q contains any constants except logical constants” [14]. In general, a conditional must concern two parts which are connected by the connective “if ... then ... ” and called the antecedent and the consequent of that conditional, respectively. The truth of a conditional depends not only on the truth of its antecedent and consequent but also, and more essentially, on a necessarily relevant and conditional relation between them. The notion of conditional plays the most essential role in reasoning because any reasoning form must invoke it, and therefore, it is historically always the most important subject studied in logic and is regarded as the heart of logic [1]. In fact, from the age of ancient Greece, the notion of conditional has been discussed by the ancients of Greek. For example, the extensional truth-functional definition of the notion of material implication was given by Philo of Megara in about 400 B.C. [11, 16]. When we study and use logic, the notion of conditional may appear in both the object logic (i.e., the logic we are studying) and the meta-logic (i.e., the logic we are using to study the object logic). In the object logic, there usually is a connective in its formal language to represent the notion of conditional, and the notion of conditional, usually represented by a meta-linguistic symbol, is also used for representing a logical consequence relation in its proof theory or model theory. On the other hand, in the meta-logic, the notion of conditional, usually in the form of natural language, is used for defining various meta-notions and describing various meta-theorems about the object logic. From the viewpoint of object logic, there are two classes of conditionals. One class is empirical conditionals and the other class is logical conditionals. For a logic, a conditional is called an empirical conditional of the logic if its truth-value, in the sense of that logic, depends on the contents of its antecedent and consequent and therefore cannot be determined only by its abstract form (i.e., from the viewpoint of that logic, the relevant relation between the antecedent and the consequent of that conditional is regarded to be empirical); a conditional is called a logical conditional of the logic if its truth-value, in the sense of that logic, depends only on its abstract form but not on the contents of its antecedent and consequent, and therefore, it is considered to be universally true or false (i.e., from the viewpoint of that logic, the relevant relation between the antecedent and the consequent of that conditional is regarded to be logical). A logical conditional that is considered to be universally true, in the sense of that logic, is also called an entailment of that logic. Indeed, the most intrinsic difference between various different logic systems is to regard what class of conditionals as entailments, as Diaz pointed out: “The problem in modern logic can best be put as follows: can we give an explanation of those conditionals that represent an entailment relation?” [7]
Automated Knowledge Acquisition
4
73
The Logical Basis for Automated Knowledge Acquisition
Any science is established based on some fundamental principles and assumptions such that removing or replacing one of them by a new one will have a great influence on the contents of the science and even lead to creating a completely new branch of the science. CML was established in order to provide formal languages for describing the structures with which mathematicians work, and the methods of proof available to them; its principal aim is a precise and adequate understanding of the notion of mathematical proof. Given its mathematical method, it must be descriptive rather than prescriptive, and its description must be idealized. CML was established based on a number of fundamental assumptions. Some of the assumptions concerning with our subject are as follows: The classical abstraction: The only properties of a proposition that matter to logic are its form and its truth-value. The Fregean assumption: The truth-value of a proposition is determined by its form and the truth-values of its constituents. The classical account of validity: An argument is valid if and only if it is impossible for all its premises to be true while its conclusion is false. The principle of bivalence: There are exactly two truth-values, TRUE and FALSE. Every declarative sentence has one or other, but not both, of these truth-values. The classical account of validity is the logical validity criterion of CML by which one can decide whether the conclusion of an argument or a reasoning really does follow from its premises or not in the framework of CML. However, since the relevance between the premises and conclusion of an argument is not accounted for by the classical validity criterion of CML, a reasoning based on CML is not necessarily relevant, i.e., its conclusion may be not relevant at all, in the sense of meaning and context, to its premises. In other words, in the framework of CML, even if a reasoning is classically valid, the relevance between its premises and its conclusion cannot be guaranteed necessarily. Note that this proposition is also true in the framework of any classical conservative extension or non-classical alternative of CML where the classical account of validity is adopted as the logical validity criterion. On the other hand, taking the above assumptions into account, in CML, the notion of conditional, which is intrinsically intensional but not truth-functional, is represented by the truth-functional extensional notion of material implication (denoted by → in this paper) that is defined as A→B =df ¬(A∧¬B) or A→B =df ¬A∨B. This definition of material implication, with the inference rule of Modus Ponens for material implication (from A and A→B to infer B), can adequately satisfy the truthpreserving requirement of CML, i.e., the conclusion of a classically valid reasoning based on CML must be true (in the sense of CML) if all premises of the reasoning are true (in the sense of CML). This requirement is basic and adequate for CML to be used as a formal description tool by mathematicians. However, the material implication is intrinsically different from the notion of conditional in meaning (semantics). It is no more than an extensional truth-function of its antecedent and consequent but does not require that there is a necessarily relevant
74
Jingde Cheng
and conditional relation between its antecedent and consequent, i.e., the truth-value of the formula A→B depends only on the truth-values of A and B, though there could exist no necessarily relevant and conditional relation between A and B. It is this intrinsic difference in meaning between the notion of material implication and the notion of conditional that leads to the well-known “implicational paradox problem” in CML. The problem is that if one regards the material implication as the notion of conditional and regards every logical theorem of CML as an entailment or valid reasoning form, then a great number of logical axioms and logical theorems of CML, such as A→(B→A), B→(¬A∨A), ¬A→(A→B), (¬A∧A)→B, (A→B)∨(¬A→B), (A→B)∨(A→¬B), (A→B)∨(B→A), ((A∧B)→C)→((A→C)∨(B→C)), and so on, present some paradoxical properties and therefore they have been referred to in the literature as “implicational paradoxes” [1]. Because all implicational paradoxes are logical theorems of any CML-theory TCML(P), for a conclusion of a reasoning from a set P of premises based on CML, we cannot directly accept it as a correct conclusion in the sense of conditional, even if each of the given premises is regarded to be true and the conclusion can be regarded to be true in the sense of material implication. For example, from any given premise A ∈ P, we can infer B→A, C→A, ... where B, C, ... are arbitrary formulas, by using the logical axiom A→(B→A) of CML and Modus Ponens for material implication, i.e., B→A ∈ TCML(P), C→A ∈ TCML(P), ... for any A ∈ TCML(P). However, from the viewpoint of scientific reasoning as well as our everyday reasoning, these inferences cannot be regarded to be valid in the sense of conditional because there may be no necessarily relevant and conditional relation between B, C, ... and A and therefore we cannot say “if B then A,” “if C then A,” and so on. Obviously, no scientists did or will reason in such a way in their scientific discovery. This situation means that from the viewpoint of conditional or entailment, the truth-preserving property of reasoning based on CML is meaningless. Note that any classical conservative extension or non-classical alternative of CML where the notion of conditional is directly or indirectly represented by the material implication has the similar problems as the above problems in CML. Consequently, in the framework of CML and its various classical or non-classical conservative extensions, even if a reasoning is classically valid, neither the necessary relevance between its premises and conclusion nor the truth of its conclusion in the sense of conditional can be guaranteed necessarily. From the viewpoint to regard reasoning as the process of drawing new conclusions from given premises, any meaningful reasoning should be ampliative but not circular and/or tautological, i.e., the truth of conclusion of the reasoning should be recognized after the completion of the reasoning process but not be invoked in deciding the truth of premises of the reasoning. As an example, let us see the most typical human logical reasoning form, Modus Ponens. The natural language representation of Modus Ponens may be “if A holds then B holds, now A holds, therefore B holds.” When we reason using Modus Ponens, what we know? We know “if A holds then B holds” and “A holds.” Before the reasoning is performed, we do not know whether or not “B holds.” If we know, then we would not need to reason at all. Therefore, Modus Ponens should be ampliative but not circular and/or tautological. While, how can we know “B holds” by using Modus Ponens? Indeed, by using Modus Ponens, we can
Automated Knowledge Acquisition
75
know “B holds,” which is unknown until the reasoning is performed, based on the following reasons: (i) “A holds,” (ii) “There is no case such that A holds but B does not hold,” and (iii) we know (ii) without investigating either “whether A holds or not” or “whether B holds or not.” Note that the Wright-Geach-Smiley criterion (see below) for entailment is corresponding to the above (ii) and (iii). From this example, we can see that the key point in ampliative and non-circular reasoning is the primitive and intensional relevance between the antecedent and consequent of a conditional. Because the material implication in CML is an extensional truth-function of its antecedent and consequent but not requires the existence of a necessarily relevant relation between its antecedent and consequent, a reasoning based on the logic must be circular and/or tautological. For example, Modus Ponens for material implication is usually represented in CML as “from A and A→B to infer B.” According to the extensional truth-functional semantics of the material implication, if we know “A is true” but do not know the truth-value of B, then we cannot decide the truth-value of “A→B.” In order to know the truth-value of B using Modus Ponens for material implication, we have to know the truth-value of B before the reasoning is performed! Obviously, Modus Ponens for material implication is circular and/or tautological if it is used as a reasoning form, and therefore, it is not a natural representation of Modus Ponens. Moreover, in general, our knowledge about a domain as well as a scientific discipline may be incomplete and inconsistent in many ways, i.e., it gives us no evidence for deciding the truth of either a proposition or its negation, and it directly or indirectly includes some contradictions. Therefore, reasoning with incomplete (and some time inconsistent) information and/or knowledge is the rule rather than the exception in our everyday real-life situations and almost all scientific disciplines. Also, even if our knowledge about a domain or scientific discipline seems to be consistent at present, in future, we may find a new fact or rule that is inconsistent with our known knowledge, i.e., we find a contradiction. In these cases, we neither doubt the “logic” we used in our everyday logical thinking nor reason out anything from the contradictions, but we must consider that there are some wrong things in our knowledge and will investigate the causes of the contradictions, i.e., we do reason under inconsistency in order to detect and remove the causes of the contradictions. Indeed, in scientific research, the detection and explanation of an inconsistency between a new fact and known knowledge often leads to formation of new concepts or discovery of new principles. How to reason with inconsistent knowledge is an important issue in scientific discovery and theory formation. For a paraconsistent logic with Modus Ponens as an inference rule, the paraconsistence requires that the logic does not have “(¬A∧A)→B” as a logical theorem where “A” and “B” are any two different formulas and “→” is the relation of implication used in Modus Ponens. If a logic is not paraconsistent, then infinite propositions (even negations of those logical theorems of the logic) may be reasoned out based on the logic from a set of premises that directly or indirectly include a contradiction. However, CML assumes that all the information is on the table before any deduction is performed. Moreover, it is well known that CML is explosive but not paraconsistent, and therefore, any directly or indirectly inconsistent CML-theory TCML(P) must be explosive. This is because CML uses Modus Ponens for material implication as its inference rule, and has “(¬A∧A)→B”
76
Jingde Cheng
as a logical theorem, which, in fact, is the most typical implicational paradox. In fact, reasoning under inconsistency is impossible within the framework of CML. Through the above discussions, we have seen that a reasoning based on CML is not necessarily relevant, the classical truth-preserving property of a reasoning based on CML is meaningless in the sense of conditional, a reasoning based on CML must be circular and/or tautological but not ampliative, and reasoning under inconsistency is impossible within the framework of CML. These facts are also true to those classical or non-classical conservative extensions of CML. What these facts tell us is that CML and its various classical or non-classical conservative extensions are not suitable logical basis for automated knowledge acquisition. Traditional relevant logics ware constructed during the 1950s in order to find a mathematically satisfactory way of grasping the elusive notion of relevance of antecedent to consequent in conditionals, and to obtain a notion of implication which is free from the so-called “paradoxes” of material and strict implication [1, 2, 8, 12, 13]. Some major traditional relevant logic systems are “system E of entailment”, “system R of relevant implication”, and “system T of ticket entailment”. A major feature of these relevant logics is that they have a primitive intensional connective to represent the notion of conditional and their logical theorems include no implicational paradoxes. Von Wright, Geach, and Smiley suggested some informal criteria for the notion of entailment, i.e., the so-called “Wright-Geach-Smiley criterion” for entailment: “A entails B, if and only if, by means of logic, it is possible to come to know the truth of A→B without coming to know the falsehood of A or the truth of B” [1]. However, it is hard until now to know exactly how to formally interpret such epistemological phrases as “coming to know” and “getting to know” in the context of logic. Anderson and Belnap proposed variable-sharing as a necessary but not sufficient formal condition for the relevance between the antecedent and consequent of an entailment. The underlying principle of these relevant logics is the relevance principle, i.e., for any entailment provable in E, R, or T, its antecedent and consequent must share a sentential variable. Variable-sharing is a formal notion designed to reflect the idea that there be a meaning-connection between the antecedent and consequent of an entailment [1, 2, 12, 13]. It is this relevance principle that excludes those implicational paradoxes from logical axioms or theorems of relevant logics. However, although the traditional relevant logics have rejected those implicational paradoxes, there still exist some logical axioms or theorems in the logics, which are not so natural in the sense of conditional. Such logical axioms or theorems, for instance, are (A∧B)⇒A, (A∧B)⇒B, (A⇒B)⇒((A∧C)⇒B), A⇒(A∨B), B⇒(A∨B), (A⇒B)⇒(A⇒(B∨C)) and so on, where ⇒ denotes the primitive intensional connective in the logics to represent the notion of conditional. The present author named these logical axioms or theorems ‘conjunction-implicational paradoxes' and ‘disjunction-implicational paradoxes' [3, 4, 6]. For example, from any given premise A⇒B, we can infer (A∧C)⇒B, (A∧C∧D)⇒B, and so on by using logical theorem (A⇒B)⇒((A∧C)⇒B) of T, E, and R and Modus Ponens for conditional. However, from the viewpoint of scientific reasoning as well as our everyday reasoning, these inferences cannot be regarded as valid in the sense of conditional because there may be no necessarily relevant and conditional relation between C, D, ... and B and therefore we cannot say ‘if A and C then B', ‘if A and C and D then B', and so on.
Automated Knowledge Acquisition
77
In order to establish a satisfactory logic calculus of conditional to underlie relevant reasoning, the present author has proposed some strong relevant logics (or strong relevance logics), named Rc, Ec, and Tc [3, 4, 6]. The logics require that the premises of an argument represented by a conditional include no unnecessary and needless conjuncts and the conclusion of that argument includes no unnecessary and needless disjuncts. As a modification of traditional relevant logics R, E, and T, strong relevant logics Rc, Ec, and Tc rejects all conjunction-implicational paradoxes and disjunction-implicational paradoxes in R, E, and T, respectively. What underlies the strong relevant logics is the strong relevance principle: If A is a theorem of Rc, Ec, or Tc, then every sentential variable in A occurs at least once as an antecedent part and at least once as a consequent part. Since the strong relevant logics are free of not only implicational paradoxes but also conjunction-implicational and disjunctionimplicational paradoxes, in the framework of strong relevant logics, if a reasoning is valid, then both the relevance between its premises and its conclusion and the validity of its conclusion in the sense of conditional can be guaranteed in a certain sense of strong relevance. The strong relevant logics are hopeful candidates for the fundamental logic to satisfactorily underlie automated knowledge acquisition. First, the strong relevant logics can certainly underlie relevant reasoning in a certain sense of strong relevance. Second, a reasoning based on the strong relevant logics is truth-preserving in the sense of the intensional primitive semantics of the conditional. Third, since the WrightGeach-Smiley criterion for entailment is accounted for by the notion of conditional, a reasoning based on the strong relevant logics is ampliative but not circular and/or tautological. Finally, the strong relevant logics can certainly underlie paraconsistent reasoning.
5
Automated Knowledge Acquisition by Relevant Reasoning
Knowledge in various domains is often represented in the form of conditional. From the viewpoint of logical calculus of conditional, the problem of automated knowledge acquisition can be regarded as the problem of automated (empirical) theorem finding from one or more knowledge sources. Since a formal theory TL(P) is generally an infinite set of formulas, even though premises P are finite, we have to find some method to limit the range of candidates for “new and interesting theorems” to a finite set of formulas. The strategy the present author adopted is to sacrifice the completeness of knowledge representation and reasoning to get the finite set of candidates. This is based on the author's conjecture that almost all “new and interesting theorems” of a formal theory can be deduced from the premises of that theory by finite inference steps concerned with finite number of low degree entailments. Let (F(L), |−L) be a formal logic system and k be a natural number. The kth degree fragment of L, denoted by Thk(L), is a set of logical theorems of L which is inductively defined as follows (in the terms of Hilbert style formal system): (1) if A is a jth (j ≤ k) degree formula and an axiom of L, then A ∈ Thk(L), (2) if A is a jth (j ≤ k) degree formula which is the result of applying an inference rule of L to some mem-
78
Jingde Cheng
bers of Thk(L), then A ∈ Thk(L), and (3) Nothing else are members of Thk(L), i.e., only those obtained from repeated applications of (1) and (2) are members of Thk(L). Obviously, the definition of the kth degree fragment of logic L is constructive. Note that the kth degree fragment of logic L does not necessarily include all kth degree logical theorems of L because it is possible for L that deductions of some kth degree logical theorems of L must invoke those logical theorems whose degrees are higher than k. On the other hand, the following holds obviously: Th0(L) ⊂ Th1(L) ⊂ ... Thk−1(L) ⊂ Thk(L) ⊂ Thk+1(L) ⊂ ... Let (F(L), |−L) be a formal logic system, P ⊂ F(L), and k and j be two natural numbers. A formula A is said to be jth-degree-deducible from P based on Thk(L) if and only if there is an finite sequence of formulas f1, ..., fn such that fn = A and for all i (i ≤ n) (1) fi ∈ Thk(L), or (2) fi ∈ P and the degree of fi is not higher than j, or (3) fi whose degree is not higher than j is the result of applying an inference rule to some members fj1, ..., fjm (j1, ..., jm < i) of the sequence. If P ≠ φ, then the set of all formulas which are jth-degree-deducible from P based on Thk(L) is called the jth degree fragment of the formal theory with premises P based on Thk(L), denoted by T j th k Thk(L)(P). A formula is said to be j -degree-deductive from P based on Th (L) if and only if it is jth-degree-deducible from P based on Thk(L) but not (j−1)th-degreededucible from P based on Thk(L). Note that in the above definitions, we do not require (j ≤ k). The notion of jth-degree-deductive can be used as a metric to measure the difficulty of deducing an empirical theorem from given premises P based on logic L. The difficulty is relative to the complexity of problem being investigated as well as the strength of underlying logic L. Based on the above definitions, we have the following result. Let TSRL(P) be a formal theory where SRL is a strong relevant logic, and k and j be two natural numbers. If P is finite, then T jThk(SRL)(P) must be finite. This is also true even if T jThk(SRL)(P) is directly or indirectly inconsistent. This means that there exists a fixed point P′ such that P ⊆ P′ and T jThk(SRL)(P′) = P, even if T jThk(SRL)(P) is directly or indirectly inconsistent. The above result about SRL does not hold for those paradoxical logics such as CML and its various classical or non-classical conservative extensions, and traditional (weak) relevant logics T, E and R because these logics accept implicational, conjunction-implicational, or disjunction-implicational paradoxes as logical theorems. Let TSRL(P) be a formal theory where SRL is a strong relevant logic and k be a natural number. SRL is said to be kth-degree-complete for TSRL(P) if and only if for any empirical theorem et of ThSRLe(P), there is a finite natural number j such et is jthdegree-deducible from P based on Thk(SRL), i.e., all empirical theorems of ThSRLe(P) are some how deducible from P based on Thk(SRL). Having SRL where SRL is a strong relevant logic as the fundamental logic, and constructing, say the 3rd, degree fragment of SRL previously, for any given premises P, we can find the fixed point T jTh3(SRL)(P). Since SRL is free of implicational, conjunction-implicational, and disjunction-implicational paradoxes, as a result, we can obtain finite meaningful empirical theorems as candidates for “new and interesting theorems” of T jTh3(SRL)(P). Moreover, if SRL is 3rd-degree-complete for TSRL(P),
Automated Knowledge Acquisition
79
then we can obtain all candidates for “new and interesting theorems” of TSRL(P). These are also true even if TSRL(P) is inconsistent. Of course, maybe SRL is not 3rddegree-complete for TSRL(P). In this case, a fragment of SRL whose degree is higher then 3 must be used in order to find more empirical theorems. Until now, we have established a conceptional foundation for the empirical theorem finding within the framework of strong relevant logic. The next problem is how to develop programs to find empirical theorems automatically. Since any backward and/or refutation deduction system cannot serve as an autonomous reasoning mechanism to form and/or discover some completely new things, what we need is an autonomous forward reasoning system. We are developing an automated forward deduction system for general-purpose entailment calculus, named EnCal, which provides its users with the following major facilities [5]. For a logic L which may be a propositional logic, a first-order predicate logic, or a second-order predicate logic, a non-empty set P of formulas as premises, a natural number k (usually k < 5), and a natural number j all specified by the user, EnCal can do the following tasks: (1) reason out all logical theorem schemata of the kth degree fragment of L, (2) verify whether or not a formula is a logical theorem of the kth degree fragment of L, if yes, then give the proof, (3) reason out all empirical theorems of the jth degree fragment of the formal theory with premises P based on Thk(L), and (4) verify whether or not a formula is an empirical theorem of the jth degree fragment of the formal theory with premises P based on Thk(L), if yes, then give the proof. Now, for one or more given knowledge sources, we can represent the explicitly known knowledge in logical formulas at first, and then regard the set of logical formulas as the set of premises of a formal theory for the subject under investigation, use relevant reasoning based on strong relevant logic and EnCal to find new knowledge from the premises.
References [1] [2] [3] [4]
[5]
Anderson, A.R., Belnap Jr., N.D.: Entailment: The Logic of Relevance and Necessity, Vol. I. Princeton University Press, Princeton (1975) Anderson, A.R., Belnap Jr., N.D., Dunn, J. M.: Entailment: The Logic of Relevance and Necessity, Vol. II. Princeton University Press, Princeton (1992) Cheng, J.: Logical Tool of Knowledge Engineering: Using Entailment Logic rather than Mathematical Logic. Proc. ACM 19th Annual Computer Science Conference. ACM Press, New York (1991) 228-238 Cheng, J.: The Fundamental Role of Entailment in Knowledge Representation and Reasoning. Journal of Computing and Information, Vol. 2, No. 1, Special Issue: Proceedings of the 8th International Conference of Computing and Information. Waterloo (1996) 853-873 Cheng, J.: EnCal: An Automated Forward Deduction System for GeneralPurpose Entailment Calculus. In: Terashima, N., Altman, E. (eds.): Advanced IT Tools, Proc. IFIP World Conference on IT Tools, IFIP 96 − 14th World Computer Congress. CHAPMAN & HALL , London Weinheim New York Tokyo Melbourne Madras (1996) 507-514
80
Jingde Cheng
[6]
Cheng, J.: A Strong Relevant Logic Model of Epistemic Processes in Scientific Discovery. In: Kawaguchi, E., Kangassalo, H., Jaakkola, H., Hamid, I.A. (eds.): Information Modelling and Knowledge Bases XI. IOS Press, Amsterdam Berlin Oxford Tokyo Washington DC (2000) 136-159 Diaz, M.R.: Topics in the Logic of Relevance. Philosophia Verlag, Munchen (1981) Dunn, J.M., Restall, G.: Relevance Logic. In: Gabbay, D., Guenthner, F. (eds.): Handbook of Philosophical Logic, 2nd Edition, Vol. 6. Kluwer Academic, Dordrecht Boston London (2002) 1-128 Godel, K.: Russell's Mathematical Logic. In: Schilpp (ed.): The Philosophy of Bertrand Russell. Open Court Publishing Company, Chicago (1944) Hayes-Roth, F., Waterman, D.A., Lenat, D.B. (eds.): Building Expert Systems. Addison-Wesley, Boston (1983) Kneale, W., Kneale, M.: The Development of Logic. Oxford University Press, Oxford (1962) Mares, E.D., Meyer, R.K.: Relevant Logics. In: Goble, L. (ed.): The Blackwell Guide to Philosophical Logic. Blackwell, Oxford (2001) 280-308 Read, S.: Relevant Logic: A Philosophical Examination of Inference. Basil Blackwell, Oxford (1988) Russell, B.: The Principles of Mathematics. 2nd edition. Cambridge University Press, Cambridge (1903, 1938). Norton Paperback edition. Norton, New York London (1996) Sestito, S., Dillon, T.S.: Automated Knowledge Acquisition. Prentice Hall, Upper Saddle River (1994) Tarski, A.: Introduction to Logic and to the Methodology of the Deductive Sciences. 4th edition, Revised. Oxford University Press, Oxford (1941, 1946, 1965, 1994)
[7] [8] [9] [10] [11] [12] [13] [14] [15] [16]
CONCEPTOOL: Intelligent Support to the Management of Domain Knowledge Ernesto Compatangelo and Helmut Meisel Department of Computing Science, University of Aberdeen AB24 3UE Scotland, UK {compatan,hmeisel}@csd.abdn.ac.uk
Abstract. We have developed ConcepTool, a system which supports the modelling and the analysis of expressive domain knowledge and thus of different kinds of application ontologies. The knowledge model of ConcepTool includes semantic constructors (e.g. own slots, enumerations, whole-part links, synonyms) separately available in most framebased, conceptual, and lexical models. Domain knowledge analysis in ConcepTool explicitly takes into account the differences between distinct categories of concepts such as classes, associations, and processes. Moreover, our system uses lexical and heuristic inferences in conjunction with logical deductions based on subsumption in order to improve the quantity and the quality of analysis results. ConcepTool can also perform approximate reasoning by ignoring in a controlled way selected parts of the expressive power of the domain knowledge under analysis.
1
Context, Background and Motivations
Knowledge technologies are based on the concept of a “lifecycle” that encompasses knowledge acquisition, modelling, retrieval, maintenance, sharing, and reuse. Domain knowledge is currently a major application area for the lifecycleoriented approach to knowledge management [15]. This area includes most ontologies, which are formal shared specifications of domain knowledge concepts [6]. Domain knowledge is continuously created and modified over time by diverse actors. The effective management of this knowledge thus needs its continuous Verification and Validation (V&V). Verification analyses whether the modelling intentions are met and thus the intended meaning matches the actual one. Validation analyses whether there is consistency, and thus lack of contradictions. In practice, sizeable domain knowledge bodies cannot be properly verified and validated (i.e. they cannot be semantically analysed ) without a computerbased support. However, semantic analysis requires computers to understand the structure and the lexicon of this knowledge, thus making its implications explicit (verification) and detecting inconsistencies in it (validation). In recent years, Description Logics (DLs) were proposed as a family of languages for representing domain knowledge in a way that enables automated reasoning about it [1]. However, DL reasoners (e.g. FaCT [10], (Power)Loom [12], V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 81–88, 2003. c Springer-Verlag Berlin Heidelberg 2003
82
Ernesto Compatangelo and Helmut Meisel
CLASSIC [3], RACER [9]) and, consequently, the analysis tools based on them (e.g. i•COM [8], OilEd [2]) do not fully support domain knowledge V&V. There are two main reasons why DL-based systems do not fully support the verification and the validation of domain knowledge (and application ontologies): – They do not really operate at the conceptual level, as DL reasoning algorithms fail to distinguish between different deductive mechanisms for diverse categories of concepts. For instance, the differences between hierarchies of classes and hierarchies of functions are not taken into account [4]. – They do not complement taxonomic deductions with other logical, lexical and heuristic inferences in order to maximise conceptual analysis. Therefore, they do not detect further (potential) correlations between concepts, attributes, associative links, and partonomic links [5]. We have developed ConcepTool, the Intelligent Knowledge Management Environment (IKME) described in this paper, in order to overcome the above drawbacks of current DL-based tools for domain knowledge analysis.
2
The Knowledge Model of CONCEPTOOL
Existing modelling-oriented environments aim to provide an adequate expressive power to represent and combine a rich variety of knowledge. Modelling-oriented focus on expressiveness is not influenced by the potential undecidability or by the potential complexity of semantic analysis. Conversely, existing analysis-oriented environments aim to provide an adequate set of specialised reasoning services within a decidable, complete and computationally tractable framework. However, current analysis focus on worst-case decidability, completeness and tractability actually limits the allowed knowledge expressiveness. Research on ontologies highlighted that expressive domain knowledge models are needed to enable sharing and reuse in distributed heterogeneous contexts [7]. This implies allowing incomplete (approximate) deductions whenever decidable, complete and tractable reasoning is not theoretically or practically possible [14]. ConcepTool provides a “reasonable framework” for domain knowledge which supports expressive modelling and “heterogeneous” analysis. This framework is based on a modular approach that (i) expresses domain concepts by composing basic semantic constructors and (ii) combines diverse kinds of specialised inferences about concepts. The constructors that characterise the knowledge model of ConcepTool are described below, while the heterogeneous approach to reasoning used in our system is discussed in the following section. In devising the knowledge model (and the expressiveness) of ConcepTool, we focused on the following aspects which enhance its semantic interoperability: – Allow the analyst to specify domain knowledge at the “conceptual level”, using different concept types such as classes, associations, functions, goals. – Include constructors used in frame-based conceptual, and lexical knowledge models (e.g. own slots, enumerations, whole-part links, synonyms).
ConcepTool: Intelligent Support to the Management of Domain Knowledge
83
– Introduce the notion of individual, which generalises the notion of instance to denote an element that is not necessarily a member of an explicitly defined concept in the considered domain. We describe the ConcepTool knowledge model using the resource ontology in Figure 1, which integrates semantic constructors from frame-based, DL-based, conceptual, and lexical models. This ontology includes three kinds of elements, namely concepts (i.e. class and individual frames), roles (i.e. slots) and properties of inheritable concept-role links (i.e. facets of inheritable slots). All elements have non-inheritable properties (own slots). Concepts also have inheritable properties (template slots), which constrain them to other concepts as specified by facets. Concepts can be interpreted either as individuals or as sets of individuals (i.e. classes, associations, functions, etc.). Roles are interpreted as sets of binary links. Concept-role links (slot attachments) are interpreted as constraints of inheritable concept properties (e.g. attributes, parts, associative links, global axioms). Concept and role constructors (slots and facets) are interpreted according to the standard set-theoretic semantics used in description logics [1]. Non-inheritable concept properties (own slots) include, among others, the following constructors — the first three ones are not applicable to individuals. – The concept type specifies whether the concept is a class, an association, a process, or an individual belonging to one of these three categories. – The concept definition states whether the concept is either totally or only partially specified by its description. In the latter case, the concept is a generic subset of the (dummy) set fully specified by the description. – The instantiability states whether the concept can have any instances. – An enumeration explicitly defines a concept as the set characterised by the listed individuals. Note that concepts defined in this way (i.e. starting from the empty set — bottom — and adding individuals) cannot be compared with concepts defined starting from the universe (Top or Thing) and adding constraints. In other words, no subsumption relationship can be derived between concepts belonging to these two complementary groups. Each labelled field defines a couple (key name and corresponding value) that can be used to introduce a new own property such as author, creation date, version. All these user-definable properties can be used for different knowledge management purposes (e.g. versioning, annotations, retrieval). Inheritable concept properties (template slots) include the following constructors. Attributes are individual properties shared by all the individuals that belong to the same set concept. Parts are either individual components in structured individuals or set concept components in structured set concepts. Associative links are bi-directional connections between different types of concepts (e.g. classes and associations). Global axioms are mutual constraints between set concepts (e.g. disjointness or full coverage in terms of some subclasses). Non-inheritable role properties include inverse roles, used to define bidirectional associative links, and synonym roles, used to specify property name aliases.
84
Ernesto Compatangelo and Helmut Meisel
Restrictions on inheritable concept-role links (i.e. slot facets) include the following constructors. Minimum and maximum cardinalities constrain the multiplicity of a property (i.e. its lower and upper number of allowed values). Fillers constrain all individuals belonging to a concept to share the same value(s) for a property. The domain of an attribute constrains all its values to be selected from the class (or combination of classes) specified as domain. A domain enumeration explicitly lists all the allowed values (individuals) in the domain.
(0,N)
(0,N)
(0,N)
Concept (Frame)
(0,N)
Role (Slot Frame)
Concept-Role Link (Frame-Slot Attachment)
Role (Slot Frame)
Role Properties (Slots on Slot Frames)
Local Restrictions on Roles (Facets)
Global Restrictions on Roles
(a) Resource ontology overview Non-Inheritable Properties (Own Slots) Concept Type
1:1 {Class | . . . }
Name
1:1 String
Definition
1:1 {Partial | Total}
Superconcepts
0:M Concept
Instances
0:M Concept Instance
Instantiability
1:1 {Concrete | Abstract}
Enumeration
0:M Individual
Labelled Fields
0:M (Key, Value) Pair
Inheritable Properties (Template Slots) Attributes
0:M Concept Attribute
Parts
0:M Concept Part
Associative Links 0:M Concept Global Axioms
0:M Assertion
(b) Concepts Restrictions on inheritable properties (Facets)
Non-Inheritable Properties ( Own Slots)
Min Cardinality
1:1 Integer
Name
1:1 String
Max Cardinality
1:1 Integer
Superroles
0:M Global Role
Fillers
0:M Individual
Inverse Role
0:1 Global Role
Domain
0:1 Class (and|or Class)∗
Synonym Roles 0:M Global Role
Domain Enumeration 0:M Individual
(c) Concept-role links
Labelled Fields 0:M Key Value Pair
(d) Roles
Fig. 1. The ConcepTool resource ontology
ConcepTool: Intelligent Support to the Management of Domain Knowledge
85
Individuals in ConcepTool either explicitly or implicitly commit to the structure of some set concepts. Individuals directly belonging to a set concept are denoted as concept instances, while those belonging to any of its subsets are denoted as concept elements. This interpretation of individuals extends the one in DL-based and in frame-based approaches.
3
Knowledge Management Services in CONCEPTOOL
In devising the services to be provided by ConcepTool, we focused on a wide range of inferences and management support functionalities for very expressive domain knowledge. Our previous research highlighted the benefits gained (i) by combining different kinds of inferences and/or (ii) by including incomplete deductions and/or (iii) by reducing the expressiveness of the knowledge subject to analysis [5, 11]. We have incorporated all these features in ConcepTool, which currently combines functionalities from three different levels: – At the epistemological level, deductions from DL engines are used as the fundamental logic-based analysis services. These deductions are considered as a kind of reference ontology that can be compared with the domain knowledge under analysis. However, DL engines do not correctly compute subsumption (and thus the analysis results based on it) between concepts with “non-standard” hierarchical rules, such as associations (relationships) and functions. Therefore, conceptual emulators [4] are used to transform DL deductions into the result expected at the conceptual level whenever needed. – At the conceptual level, deductions include (i) the generation of different concept hierarchies (one for each concept type), (ii) the identification of inconsistent concepts, and (iii) the explicitation of new concept constraints and properties which are not explicitly stated in a domain knowledge body. The possibility of setting a specific level of expressive power to be used in conceptual reasoning allows the generation of further (and / or different) analysis results. This is performed by selecting those semantic constructors (e.g. attribute multiplicities, domains, or fillers) that can be ignored while reasoning with a domain knowledge body. The reduction of the expressive power subject to analysis is particularly useful for aligning and articulating domain knowledge during the early stages of ontology sharing and reuse [5]. – At the lexical level, deductions include lexical subsumption, synonymity, antonymity, and partonymity provided by lexical databases such as WordNet [13]. These deductions are complemented by a set of lexical heuristic inferences based on string containment. Inferential results from the three levels are presented to the analyst who can decide which ones to accept and which ones to reject. Both accepted and rejected inferences can be incorporated into the knowledge body under analysis by adding new concept constraints or by removing existing ones. The different kinds of epistemological, conceptual and lexical inferences can be performed in any user-defined sequence in order to maximise analysis results.
86
Ernesto Compatangelo and Helmut Meisel
In other words, the user (i) can decide which inferential services to invoke, (ii) compares the results of each invoked services, and (iii) decides which step to perform next. However, (partial) results from a specific inferential service (e.g. a constraint satisfaction check) could be automatically passed to another inferential service (e.g. a subsumption computation algorithm). In this way, a reasoning system could derive results about expressive knowledge that cannot be completely interpreted and processed by each single inferential service alone. The combination of different kinds of inferences has been performed with a prototypical version of ConcepTool which dynamically fuses taxonomic deductions from a DL engine with inferences from a constraint satisfaction reasoner [11]. In this approach, constraints on (numeric) concrete domains are transformed into a hierarchy of concrete domain classes which are then used to compute subsumption on the whole domain knowledge base. Preliminary results showed how this “inference fusion” mechanism can successfully extend the automated reasoning capabilities of a DL engine. Because the inferences provided by the constraint satisfaction reasoner are in no way complete, this same approach shows that incomplete deductions are nonetheless useful to derive meaningful (although potentially incomplete) analysis results. ConcepTool also provides an environment where different domain knowledge bodies can share the lexical namespace of their concept properties (attributes, associative links, partonomic links). Synonyms (e.g. last name and surname) are first proposed by lexical databases and then validated by ConcepTool on the basis of their mutual structural features. Full synonymity between two properties with lexical synonymity is rejected if their structural features (e.g. super-properties, inverse) are different. This approach, where distinct domain knowledge bases can share the same project space, and possibly the namespace associated to the project, explicitly supports both cooperative domain knowledge development and the alignment of existing domain knowledge bodies.
4
Discussion and Future Work
ConcepTool introduces three major novelties with respect to existing intelligent knowledge management environments. Firstly, its expressive knowledge model contains semantic components, such as individuals, which are neither jointly nor separately available in other models. Secondly, ConcepTool performs deductions at the conceptual level, thus providing correct analysis results for non-standard concepts like associations and functions. It also allows to ignore selectable properties and restrictions while reasoning. Moreover, it complements conceptual deductions with both lexical and heuristic inferences of different kinds. Thirdly, the ConcepTool interface explicitly supports cooperative domain knowledge development by providing an environment where new knowledge bodies can be directly linked to (and later decoupled from) existing ontologies or other reference knowledge bases.
ConcepTool: Intelligent Support to the Management of Domain Knowledge
87
The modelling and analysis functionalities provided in ConcepTool have been successfully tested using a number of sizeable domain ontologies. A preliminary version that demonstrates some relevant features of our system can be downloaded from http://www.csd.abdn.ac.uk/research/IKM/ConcepTool/. Future releases are expected to include (i) functionalities to import from / export to frame-based and DL-based environments using XML and RDF(S), (ii) a more general version of the articulation functionality already available in a previous version of ConcepTool that uses an entity-relationship knowledge model [5], and (iii) inference fusion for approximate reasoning with complex constraints [11].
Acknowledgements This work is supported by the British Engineering & Physical Sciences Research Council (EPSRC) under grants GR/R10127/01 and GR/N15764 (IRC in AKT).
References [1] F. Baader et al., editors. The Description Logic Handbook. Cambridge University Press, 2003. 81, 83 [2] S. Bechhofer et al. OilEd: a Reason-able Ontology Editor for the Semantic Web. In Proc. of the Joint German/Austrian Conf. on Artificial Intelligence (KI’2001), number 2174 in Lecture Notes in Computer Science, pages 396–408. SpringerVerlag, 2001. 82 [3] R. J. Brachman et al. Reducing CLASSIC to practice: knowledge representation meets reality. Artificial Intelligence, 114:203–237, 1999. 82 [4] E. Compatangelo, F. M. Donini, and G. Rumolo. Engineering of KR-based support systems for conceptual modelling and analysis. In Information Modelling and Knowledge Bases X, pages 115–131. IOS Press, 1999. Revised version of a paper published in the, Proc. of the 8th European-Japanese Conf. on Information Modelling and Knowledge Bases. 82, 85 [5] E. Compatangelo and H. Meisel. EER-ConcepTool: a “reasonable” environment for schema and ontology sharing. In Proc. of the 14th IEEE Intl. Conf. on Tools with Artificial Intelligence (ICTAI’2002), pages 527–534, 2002. 82, 85, 87 [6] A. Duineveld et al. Wondertools? a comparative study of ontological engineering tools. In Proc. of the 12th Banff Knowledge Acquisition for Knowledge-based Systems Workshop (KAW’99), 1999. 81 [7] R. Fikes and A. Farquhar. Distributed Repositories of Highly Expressive Reusable Ontologies. Intelligent Systems, 14(2):73–79, 1999. 82 [8] E. Franconi and G. Ng. The i•com Tool for Intelligent Conceptual Modelling. In Proc. of the 7th Intl. Workshop on Knowledge Representation meets Databases (KRDB’00), 2000. 82 [9] V. Haarslev and R. M¨ oller. RACER System Description. In Proc. of the Intl. Joint Conf. on Automated Reasoning (IJCAR’2001), pages 701–706, 2001. 82 [10] I. Horrocks. The FaCT system. In Proc. of the Intl. Conf. on Automated Reasoning with Analytic Tableaux and Related Methods (Tableaux’98), number 1397 in Lecture Notes In Artificial Intelligence, pages 307–312. Springer-Verlag, 1998. 81
88
Ernesto Compatangelo and Helmut Meisel
[11] B. Hu, I. Arana, and E. Compatangelo. Facilitating taxonomic reasoning with inference fusion. In Knowledge Based Systems – To appear. Revised version of the paper published in the Proc. of the 22nd Intl. Conf. of the British Computer Society’s Specialist Group on Artificial Intelligence (ES’2002). 85, 86, 87 [12] R. MacGregor. A description classifier for the predicate calculus. In Proc. of the 12th Nat. Conf. on Artificial Intelligence (AAAI’94), pages 213–220, 1994. 81 [13] G. A. Miller. WordNet: a Lexical Database for English. Comm. of the ACM, 38(11):39–41, 1995. 85 [14] H. Stuckenschmidt, F. van Harmelen, and P. Groot. Approximate terminological reasoning for the semantic web. In Proc. of the 1st Intl. Semantic Web Conference (ISWC’2002), 2002. 82 [15] The Advanced Knowledge technologies (AKT) Consortium. The AKT Manifesto, Sept. 2001. http://www.aktors.org/publications http://www.aktors.org/publications. 81
Combining Revision Production Rules and Description Logics Chan Le Duc and Nhan Le Thanh Laboratoire I3S, Universit´e de Nice - Sophia Antipolis, France
[email protected] [email protected]
Abstract. Knowledges of every application area are heterogeneous and they develop in time. Therefore, a combination of several formalisms on which revision operations are defined, is required for knowledge representation of such a hybrid system. In this paper, we will introduce several revision policies in DL-based knowledge bases and show how revision operators are computed for DL language ALE. From this, we define revision production rules consequent of which is a revision operator. Such rules allow us to represent context rules in translation between DL-based ontologies. Finally, we will introduce formal semantics of the formalism combining revision production rules and Description Logics (ALE).
1
Introduction
Description Logics can be used as a formalism for design of ontologies of an application domain. In order to reduce the size of ontologies, different ontologies for different subdomains or user profiles are derived from a shared common ontology. The shared concepts in the common ontology can be redefined in a derived ontology with the aim of fitting an adaptable context. This is necessary to determine more sufficient meaning of a shared concept in the context of current users. The redefining can be performed in run-time owing to context rules. The antecedent of a context rule is a predicate which describes context condition and its consequent is an operation which translates a concept definition into another. For this reason, we need an other formalism to capture the semantics of such ontologies. It is specific production rules, namely revision production rules, whose consequent is an operation allowing to change a shared concept definition. In previous hybrid languages, for example CARIN in [4], it combines Horn rules and Description Logics to obtain a formalism the expressiveness of which is significantly improved. However, it seems that these languages could not capture context rules since the consequent of a Horn rule is still a predicate. On the other hand, since the definition of a concept in knowledge base can be modified by the application of a revision production rule, revision operations must be defined on DL-based terminology. Previous works on how to define revision operations have concentrated on operations TELL (to add a concept description fragment to a concept definition), FORGET (to delete a concept description fragment V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 89–99, 2003. c Springer-Verlag Berlin Heidelberg 2003
90
Chan Le Duc and Nhan Le Thanh
from the concept definition), and on simple languages which do not permit existential restrictions [1]. Moreover, operation FORGET requires that a deleted fragment must be a part of the concept definition to be revised. Thus, the main contributions of this paper are, first, to propose several revision policies some of which require the definition of a new operation, namely PROPAG, in order to conserve subsumption relationships. Second, we propose to extend revision operations to language ALE. This extension will allow us to define the semantics of a formalism combining revision production rules and language ALE.
2 2.1
Knowledge Base Revision and Revision Policies Preliminaries
Revision Operators Informally, TELL is interpreted as an operation which adds knowledge to knowledge base ∆, whereas FORGET deletes knowledge from knowledge base ∆. If we use a DL language allowing both to normalize concept descriptions and rewrite concept definitions as conjunctions of simple concepts, the revision operators are more formally written as follows : – TELL(C, Expr ) := C Expr – FORGET(C Expr, Expr ) := C where C is concept to be revised and Expr is a simple concept. In order to have reasonable definitions of the revision operators, it is paramount that several constraints are respected. However, if all of these constraints are to be respected, certain problems are inevitable. In the following subsection, we attempt to identify some of important constraints for the revision operators. Criteria for Revision The majority of these criteria is proposed and discussed in [1]. We only recall and reformulate some of them for the purposes of this paper. 1. Adequacy of the Request Language. The request language should be interpretable in the formalism the KB uses. 2. Closure. Any revision operation should lead to a state of the KB representable by the formalism used. 3. Success. The criterion requires that any revision operation must be successful, i.e after a TELL the new knowledge (Expr ) must be derivable from the KB and after a FORGET the deleted knowledge (Expr ) must not be derivable from the KB no longer. More formally, there exists an interpre/ tation I and an individual aI ∈ OI domain, such that aI ∈ C I and aI ∈ (T ELL(C, Expr))I . Similarly, there exists an interpretation I and an indi/ C I and aI ∈ (F ORGET (C, Expr))I . vidual aI ∈ OI domain, such that aI ∈ 4. Minimal Change. This criterion requires that revision operations cause a minimal change of the KB. It is obvious that it needs a definition concerning the distance between two KBs. Therefore, a pragmatic consideration is necessary in each real application.
Combining Revision Production Rules and Description Logics
91
5. Subsumption Conservation. The terminological subsumption hierarchy of the KB should be conserved by any revision operation. A priori, criteria 4 and 5 are difficult to guarantee simultaneously since subsumption conservations may change the majority of the hierarchy. 2.2
Revision Policies
In the previous subsection, we introduced the general principles for revision on DL-based knowledge base. In certain cases there may exist one principle that we cannot entirely adhere to. We now present pragmatic approaches in which some of these criteria are respected while the others are not. In order to facilitate the choice of policy made by the user, we will attempt to identify the advantages and disadvantages. The first policy takes into account criterion 5 (Subsumption Conservation) whereas the second policy allows us to change subsumption hierarchy. The third policy can be considered as a compromise between the first and the second. Conservative Policy This policy allows us to modify definition of a concept C (concept description) by adding or deleting a fragment to or from the definition. Concept C and all concepts defined via C now have a new definition. However, the modification of C should not violate any subsumption relationship. Otherwise, no operation is to be done. By that we mean, if A B holds before the modification for some concepts A, B in the Tbox, then A B still holds after the modification. This requirement is in accordance with the viewpoint in which the subsumption relationships are considered as invariant knowledges of the KB. This condition is clearly compatible with the object-oriented database model provided that the hierarchical structure of classes is persistent, although class definitions may be changed. In this model, each time a class definition is changed, all existent instances of the old class must remain the instances of the new class. In fact, the evolution of our understanding about the world leads to more precise definitions. In other words, we need to change the definition of a concept C while all subsumption relationships and assertions C (a) in Abox remain held. This can lead to some incompatibilities between the considered criteria for definitions of inferences in the open world. Indeed, we suppose that revision operator TELL is defined as follows : TELL(C, Expr ) := C Expr. According to the assertion conservation proposed by this policy, if C I (aI ) holds then (T ELL(C, Expr))I (aI ) holds as well for all interpretation I. This means that C TELL(C, Expr ) for all interpretation I. Hence C ≡ TELL(C, Expr ) since TELL(C, Expr ) C for all interpretation I. This contradicts criterion 3 (success). In order to avoid this incompatibility, an addition of the epistemic operator can be required [5]. We now try to provide a framework for the revision operators. Suppose that Tbox is acycle and does not contain general inclusions. Since the subsumption relationship conservation is required in this policy, a modification on C can be propagated in all Tbox. In the following, we will investigate this propagation
92
Chan Le Duc and Nhan Le Thanh
for each operator TELL and FORGET. The definitions of TELL, FORGET and PROPAG with respect to criteria 3, 4 (success and minimal change) for a concrete language are investigated in the next section. TELL(C, Expr ). Obviously, this operation makes a concept D more specific where D contains concept C in its definition. On the other hand, concepts no containing concept C may be modified because of the conservation of the subsumption relationships. The unfolded definition of D containing C is written ˜ ) and the unfolded definition of D no containing C is written as D( ˜ C). ¯ as D(C ˜ ) and D( ˜ C) ¯ can be obtained by replacing all concept names by their defiD(C nitions except concept C. Let E be a direct super concept of F in the hierarchy. ˜ C) ¯ then relationship F E is There are two cases as follows : i) if E = E( automatically conserved since E is not changed and F may be not changed or ¯ and E = E(C ˜ ) or F = F˜ (C ) become more specific. ii) if whether F = F˜ (C) ˜ ), we need a new operator, namely PROPAG TELL, that allows and E = E(C us to compute a new definition of F such that the subsumption relationship is conserved. Operator PROPAG TELL should be defined with respect to criterion 4 in some way. FORGET(C, Expr ). This operator is, in fact, computed via an other operator, namely MINUS, that can be interpreted as a deletion of Expr from C. In the case where C = C’ Expr, we must have MINUS(C, Expr ) = C’. Similar as TELL, the modification made by FORGET may be propagated in all Tbox. Revision of Literal Definition This revision policy had been proposed in [1]. The main idea of this policy is to allow us to modify the definition of a concept without taking into account subsumption relationships. In addition, it requires to revise the Abox with respect to the new Tbox because a modification in Tbox may cause inconsistencies in the Abox. This policy is convenient for the case in which we need a correction of a misunderstanding in terminology design. The individuals of a “bad” definition now belong to a better definition. In fact, in this approach concept definitions are considered as literal knowledges while subsumption relationships are derived from these definitions. As a result, a reclassification of the concept hierarchy can be required after a revision. Note that this policy does not apply operator PROPAG since subsumption conservation is omitted. Operators TELL(C, Expr ) and FORGET(C, Expr ) are defined owing to operator MINUS described as above. Revision on Abox. An important task in this policy is to revise the ABox when the Tbox is changed. All assertions D (a) where D is defined directly or indirectly via C must be revised. It is clear that operator FORGET does not cause any inconsistence in the Abox since if F (C )(a) holds where F (C ) is a concept description via C, then FORGET(F (C )(a), Expr ) holds as well. In contrast, there may be an inconsistence in the Abox after TELL(C, Expr ). In this case, the revision on Abox requires an inference which is based on the added part Expr. In particular, D (a) holds for some concept D = F (C ) = C, iff Expr (a) holds.
Combining Revision Production Rules and Description Logics
93
Mixed Policy As presented above, this policy can be considered as a compromise between two policies presented, namely mixed policy. It allows us to modify the definition of a concept while respecting all subsumption relationships. Moreover, following a revision on the Tbox, a revision on the Abox is necessary. Technically, this policy does not require any additional operator in comparison to the two policies above.
3 3.1
Revision Operators for Language ALE Preliminaries
Let NC be a set of primitive concepts and NR be a set of primitive roles. Language ALE uses the following constructors to build concept descriptions: conjunction (C D ), value restriction (∀r.C ), existential restriction (∃r.C ), primitive negation (¬P ), concepts TOP and BOTTOM. Let ∆ be an non-empty set of individuals. Let .I be a function that transforms each primitive concept P ∈ NC into P I ⊆ ∆ and each primitive role r ∈ NR into rI ⊆ ∆× ∆. The semantics of a concept description are inductively defined owing to the interpretation I = (∆,.I ) in the table below.
Syntax ⊥ C D ∀r.C, r ∈ NR ∃r.C, r ∈ NR ¬P , P ∈ NC
Semantics ∆ ∅ CI ∩ DI {x ∈ ∆|∀y:(x,y) ∈ r I → y ∈ C I } {x ∈ ∆|∃y:(x,y) ∈ r I ∧ y ∈ C I } ∆ \ PI
Least Common Subsumer (LCS) [8] Subsumption Let C, D be concept descriptions. D subsumes C, C D, iff C I ⊆ DI for all interpretations I. Least Common Subsumer Let C1 , C2 be concept descriptions in a DL language. The least common subsumer C is a least common subsumer of C1 , C2 (for short lcs(C1 , C2 )) iff i) Ci C for all i , and ii)C is the least concept description which has this property, i.e if C’ is a concept description such that Ci C’ for all i, then C C’. According to [8], lcs of two or more descriptions always exists for language ALE and the authors proposed an exponential algorithm for computing it.
94
Chan Le Duc and Nhan Le Thanh
Language ALE and Structural Subsumption Characterizing The idea underlying the structural characterization of subsumption proposed in [8] is to transform concept descriptions into a normal form. This normal form allows one to build a description tree the nodes of which are labeled by sets of primitive concepts. The edges of this tree correspond to value restrictions, namely ∀r-edge and existential restrictions, namely r-edge. Hence, the subsumption relationship of two concept descriptions can be translated into the existence of a homomorphism between the two corresponding description trees. The following theorem results from this idea. Theorem for structural subsumption characterization [8] Let C, D be ALEconcept descriptions. Then, C D if and only if there exists a homomorphism to GC . from GD Simple Concept Descriptions in Tbox In a Tbox, a concept definition can depend on other concept definitions. This dependence must be conserved by revision operations. For this reason, we cannot define simple concept as a conjunct in unfolded form. In what following, we propose a simple concept definition which fits the idea above. An ALE-concept description is called simple if it does not contain any conjunction on the top-level. A concept description is called normal if it is a conjunction of simple concept descriptions. We denote SC as a set of simple concept descriptions of concept C in a TBox. Note that a simple concept can be considered part of the meaning of a concept. As a consequence, the sets of simple concepts of the equivalent concepts can be different. A discussion in detail of this subject is given in [1]. 3.2
Operator TELL
Operator TELL Let C1 ... Cn be of the normal form of C and Expr be simple concept. The operator TELL is defined as follows : If C Expr then TELL(C, Expr ) := C Otherwise, TELL(C, Expr ) := C Expr and SD = SC ∪ {Expr } If the knowledge to be added is derivable from C before the execution of TELL i.e C Expr, no more remains to be done and the KB is not changed. Otherwise, we need to guarantee that the added knowledge is derivable from C after the execution of TELL. In fact, since TELL(C, Expr ) = C Expr , we have TELL(C, Expr ) Expr. Moreover, if C Expr, this definition of operator TELL will satisfy criteria 3 (success). Criterion 4 (minimal modification) will be semantically satisfied since C Expr is the most concept description which is subsumed by C and Expr. Operator PROPAG TELL We will use the notions presented in section 2.2. ˜ In addition, denote E (C, D ), E(C, D ) as original and unfolded forms where each occurrence of C is replaced by D. Let E be a direct super concept of a concept
Combining Revision Production Rules and Description Logics
95
F in the hierarchy. Operator PROPAG TELL(E, F, Expr ) will compute a new concept description of F while taking into account criterion 4 (minimal change) and F E. ˜ ). We denote S 0 (C ) as a set of all sE ∈ SE such Assume that E = E(C E 1 ¯ that sE depends on C, denoted by sE (C ), and SE (C) as a set of all sE ∈ SE ¯ We have : such that sE does not depend on C, denoted by sE (C). ˜ PROPAG TELL(E, F, Expr ):=F (C, C Expr ) sE ∈SE0 (C) sE (C, C Expr ) As a consequence, PROPAG TELL(E, F, Expr ) E. Note that PROPAG TELL(E, F, Expr ) is the most concept description ˜ which is less than F˜ (C, C Expr ) and E(C, C Expr ).
Example 1. Let T1 ={ C := ∃r.Q, E := ∃r.C, F := ∃r.∃r.(Q R) } and Expr := ∃r.P where P , Q, R ∈ NC , r ∈ NR . ˜ ), F = F˜ (C); ¯ SE ={∃r.C }, SF ={∃r.∃r.(Q R)} We have F E and E = E(C ˜ ˜ ˜ Note that F (C, C Expr ) E(C, C Expr ) since E(C, C Expr ) = ∃r.(∃r.Q ˜ ∃r.P ) and F (C, C Expr ) = ∃r.∃r.(Q R). We have PROPAG TELL(E, F, Expr ) = ∃r.∃r.(Q R) ∃r.(C ∃r.P ) and SP ROP AG T ELL(E,F,Expr) = {∃r.∃r.(Q R), ∃r.(C ∃r.P )}. 3.3
Operator FORGET
This operator FORGET(C, Expr ) here defined, allows for Expr ∈ / SC . This extension is necessary because in some cases we need to forget a term Expr from C where C Expr and Expr ∈ / SC . Such knowledge can be derived from C. Operator MINUS between Two ALE-Description Trees Let C, D be propagated normal ALE-concept descriptions and C D. MINUS(C, D ) is defined as the remainder of GC description tree after the deletion of image ϕ(GD ) in GC for all homomorphisms ϕ from GD to GC . More formally, let GC =(NC , EC , n0 , lC ) and GD =(ND , ED , m0 , lD ). MINUS(C, D ) = (NC , EC , p0 , lC ) where lC (ϕ(m)) = lC (ϕ(m)) \ lD (m) for all m ∈ ND and all ϕ. Example 2. Let E, F be concept descriptions in the example1. We have MINUS(F, E ) = ∃r.∃r.R.
Proposition 3.1 Let C be a propagated normal ALE-concept description, D be a propagated normal ALE-simple concept description and D = D1 ... Dn . Moreover, C D. If C = X D and X Di for all i, then MINUS(C, D)≡ X. A proof of Proposition 3.1 can be found in [9]. 0 1 a set of all sC ∈ SC such that sC Di for all i and SC−D We denote SC−D 0 a set of all sC as the remainder of sC ∈ / SC−D following the operation MINUS. 0 Proposition 3.1 guarantees that sC ∈ SC are not changed. We have SMIN US(C,D) 0 1 = SC−D ∪ SC−D .
96
Chan Le Duc and Nhan Le Thanh
We can now define operator FORGET. Let C1 ... Cn be of the normal form of C and Expr be a simple concept. Operator FORGET is defined as follows : If C Expr then FORGET(C, Expr ) := C Otherwise, FORGET(C, Expr ) := MINUS (C, Expr ) and, 0 1 SF ORGET (C,Expr) = SC−Expr ∪ SC−Expr If C Expr, i.e we may leave out the fragment Expr as it does not contribute to the definition of C. In this case, there is nothing to do since C is not derivable from the knowledge to be forgotten. In fact, there exists an individual a such / ExprI where I is an interpretation. By that we mean, that aI ∈ C I and aI ∈ the knowledge Expr is not derivable from concept C following the execution of operation FORGET. Therefore, the operator which is defined in such a way satisfies criterion 3 (success). If C Expr, operator MINUS ensures the unique existence of the result concept description. Hence, operator FORGET is well defined. Also, operator MINUS guarantees criteria 3 and 4 for operator FORGET owing to Proposition 3.1 [9]. Note that if C = X Expr then FORGET(C, Expr) = X. Operator PROPAG FORGET Let E be a direct super concept of a concept F in the hierarchy. Operator PROPAG FORGET(E, F, Expr ) will compute a new concept description of E while taking into account criterion 4 (minimal change). Computing lcs is, therefore, necessary for this aim. Assume that F = F˜ (C ). We define : PROPAG FORGET(E, F, Expr ):= ˜ lcs{E(C, MINUS(C, Expr )), F˜ (C, MINUS(C, Expr ))} Computed in this way, PROPAG FORGET(E, F, Expr ) is the least concept ˜ description such that it is more than E(C, MINUS(C, Expr )) and ˜ F (C, MINUS(C, Expr )). We now compute SP ROP AG F ORGET (E,F,Expr) according to the definition above. ˜ MINUS(C, Expr )) and E2 = Thus abbreviated, we denote E1 = E(C, PROPAG FORGET(E, F, Expr ). Since E1 E2 , we can define an intersection of all images ϕ(GE2 ) in GE1 : K = ϕ ϕ(GE2 ) where ϕ is a homomorphism from GE2 0 1 to GE1 . We denote SE a set of all sE1 ∈ SE1 such that GsE1 ⊆ K and SE a set of 1 2 all sE2 ∈ SE2 such that ϕ(GsE2 ) K where ϕ is some homomorphism from GE2 0 1 to GE1 . We define SE : = SE ∪ SE . It is not difficult to check that 1 2 s∈SE s ≡ PROPAG FORGET(E, F, Expr ).
Example 3. Let T1 ={ C := ∃r.(P Q ) ∀r.R, E := ∃r.S ∃r.∃r.(P Q ), F := ∃r.S ∃r.C } and Expr := ∃r.P where P , Q, R, S ∈ NC , r ∈ NR . ˜ C); ¯ SE ={∃r.S, ∃r.∃r.(P Q )}, We have F E and F = F˜ (C ), E = E( SF ={∃r.S, ∃r.C }, MINUS(C, Expr ) = ∃r.Q ∀r.R. ˜ Note that F˜ (C,MINUS(C, Expr )) E(C,MINUS(C, Expr )) since ˜ F (C,MINUS(C, Expr ))=∃r.S ∃r.(∃r.Q ∀r.R) and ˜ E(C, MINUS(C, Expr ))= ∃r.S ∃r.∃r.(P Q ) We have PROPAG FORGET(E, F, Expr ) = ∃r.S ∃r.∃r.Q; 0 1 ={∃r.S , ∃r.∃r.Q} and SE = ø. K = {∃r.S ∃r.∃r.Q}; SE 1 2
Combining Revision Production Rules and Description Logics
As a consequence, SP ROP AG 3.4
F ORGET (E,F,Expr)
= {∃r.S, ∃r.∃r.Q }
97
Revision on Abox
In the previous subsection, a concept can be considered as a conjunction of simple concepts. The revision operators on Tbox add or delete some terms to/from these sets. Let E be a modified concept definition in the Tbox. Let SE and SE be the sets of simple concepts of E, respectively, before and after a revision on the Tbox. Let I be a model of the KB. For each assertion E I (aI ), revision on the Abox performs a checking of all assertions sI (aI ) where s ∈ SE \SE . If there is some sI (aI ) which is not verified, E I (aI ) becomes inconsistent and it is deleted from the Abox. Otherwise, if every sI (aI ) is verified, assertion E I (aI ) is conserved in the Abox.
4 4.1
Revision Production Rules and Description Logics Revision Production Rules
The rule to be defined below is a specific form of production rule, namely revision production rule . In fact, a consequent of these rules is only an revision operator on a terminology. The general form of these rules in a KB ∆ = (∆T , ∆A , ∆PR ) is given as follows : r ∈ ∆PR , r : p ⇒ TELL(Concept, Expr ) or r : p ⇒ FORGET(Concept, Expr ) where p is a predicate formed from assertions in ∆T with the constructors ¯ 1 ) ∧. . . ∧ pn (X ¯ n ), X ¯ i are tuples of variables or constants. The predip1 (X cates p1 ,. . . , pn are whether concepts names, role names or an ordinary predicate. Revision operators TELL and FORGET have been defined in section 3. An example which illustrates how to exploit these rules can be found in [9]. 4.2
Semantics of Combining Formalism
We can now define the semantics of the formalism combining revision production rules and description logic ALE based on the semantics of revision operators. We will define semantics of the formalism on the assumption that the propagation operators are not taken into account. It will not be too difficult to extend the following definition to a KB with these operators. Let ∆ be a KB composed of three components ∆ = (∆T , ∆A , ∆PR ). An interpretation I is a pair (O,.I ) where O is a non-empty set and a function .I maps every constant a in ∆ to an object aI ∈ O. An interpretation I is a model of ∆ if it is a model of each component of ∆. Models of terminological component ∆T are normally defined as what of description logics. An interpretation I is ¯ I ∈ pI . a model of a ground fact p(¯ a) in ∆A if a An interpretation I is a model of r : p ⇒ TELL(C, Expr ) if, whenever the ¯ i )I ∈ pI for function .I maps the variables of r to the domain O such that (X i every atom of the antecedent of r, then
98
Chan Le Duc and Nhan Le Thanh
1. if aI ∈ (C I ∩ ExprI ) then aI ∈ (T ELL(C, Expr))I 2. if aI ∈ C I and aI ∈ / ExprI then aI ∈ / (T ELL(C, Expr))I An interpretation I is a model of r : p ⇒ FORGET(C, Expr ) if, whenever ¯ i )I ∈ pI the function .I maps the variables of r to the domain O such that (X i for every atom of the antecedent of r, then if aI ∈ C I then aI ∈ (F ORGET (C, Expr))I
5
Conclusion and Future Work
We introduced several revision policies and a framework for the revision operators on DL-based knowledge base. In the first policy, the assertion conservation on Abox, despite modifications of concept definition on Tbox, focuses our attention on inferences with the epistemic operator in a closed world. However, the interaction between the epistemic operator and the revision operators was not investigated in this paper. Next, we showed how to compute revision operators for language ALE. A deeper study of the complexity of these operators is necessary for a possible improvement in the computing. An other question arises concerning the extension of the revision operators to languages allowing for disjunction (ALC). This extension requires a structural characterization of subsumption for these languages. On the other hand, if we allow for Horn rules as a third component, then which interactions will take place in the combining formalism ? In this case, the obtained formalism will be an extension of language CARIN [4]. Hence, an extension of the inference services of CARIN to the combining formalism deserves investigation.
References [1] B. Nebel. Reasoning and Revision in Hybrid Representation Systems, Thesis 19901995 90, 92, 94 [2] B. Nebel. Terminological reasoning is inherently intractable. Artificial Intelligence, 1990 [3] M. Schmidt-Schauβ, G. Smolka. Attributive concept descriptions with complements. Artificial Intelligence, 1991 [4] A. Y. Levy and M.C Rousset. Combining Horn Rules and Description Logics in CARIN, Artificial Intelligence, Vol 104, 1998. 89, 98 [5] F. M. Donini, M. Lenzerini, D. Nardi, W. Nutt, A. Schaerf. An Epistemic Operator for Description Logics, Artificial Intelligence, Vol 100, 1998. 91 [6] F. M. Donini, M. Lenzerini, D. Nardi, A. Schaerf. Reasoning in Description Logics, CSLI Publications, 1997 [7] R. K¨ usters. Non-Standard Inferences in Description Logics. Thesis, Springer 2001 [8] F. Baader, R. K¨ usters and R. Molitor. Computing least common subsumer in Description Logics with existential restrictions, Proc. of the 16th Int. Joint Conf. on Artificial Intelligence (IJCAI-99 ) 93, 94
Combining Revision Production Rules and Description Logics
99
[9] C. Le Duc, N. Le Thanh. Combining Revision Production Rules and Description Logics. Technical Report 2003. See http://www.i3s.unice.fr/~cleduc. 95, 96, 97
Knowledge Support for Modeling and Simulation ˇ cenko Michal Sevˇ Czech Technical University in Prague Faculty of Electrical Engineering, Department of Computer Science
[email protected]
Abstract. This paper presents intermediate results of the KSMSA project (Knowledge Support for Modeling and Simulation Systems). The outcome of the project is to apply artificial intelligence and knowledge engineering techniques to provide support for users engaged in modeling and simulation tasks. Currently we are focused on offering search services on top of a knowledge base, supported by a domain ontology.
1
Introduction
System modeling and simulation is an important part of an engineer’s work. It is an important tool for e.g. analyzing behavior of systems, predicting behavior of new systems, or diagnosing discrepancies in faulty systems. However, modeling and simulation tasks are highly knowledge-intensive. They require lot of knowledge directly needed to perform modeling and simulation tasks, as well as lot of background knowledge, such as mathematics, physics, etc. A great deal of this knowledge is tacit, hard to express, and thus hard to transfer. The knowledge has often heuristic nature, which complicate implementation of such a knowledge directly into computer systems. We propose to design a knowledge-based system supporting modeling and simulation tasks. At present, we are focused on offering search services for the users. The prerequisite to our system is the development of a formal model of concepts in the domain of modeling and simulation, or of engineering in general. Such a model is commonly referred to as an ontology. Ontologies basically contain taxonomy of concepts related to each other using axioms. Axioms are statements that are generally recognized to be true, represented as logical formulas. Ontologies help us to represent the knowledge normally possessed only by humans in computer memory, and to perform simple reasoning steps that resemble reasoning of humans. Ontologies also help to structure large knowledge bases, and provide a means to search such knowledge bases with significantly better precision than conventional search engines. Ontology can be accompanied by a lexicon, which provides a mapping between concepts in the ontology and natural language. This allows users to refer to concepts in the ontology using plain English words or phrases, or to automatically process natural language texts and to estimate their logical content. We have developed a prototype of ontology and lexicon for our domain. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 99–103, 2003. c Springer-Verlag Berlin Heidelberg 2003
100
ˇ cenko Michal Sevˇ
The new generation of search engines are based on the concept of metadata. Metadata are information that is associated with documents being indexed and searched by search engines, but are invisible to the user when a document is normally displayed. They are used to focus the search, thus improving its precision. Metadata that refer to the document as a whole include the name of the document author, date of the last modification, application domain, bindings to other documents, etc. Such kind of metadata provide additional information about the document that is not explicitly specified in its content. However, metadata can be associated also with parts of documents, such as sections, paragraphs, sentences or even words, often for the purpose of associating them with certain semantics, i.e. meaning. For instance, a string ‘John Smith’ can be associated with metadata stating that this string expresses the name of some person. Such metadata in general do not provide additional information to humans, as they understand the meaning of words from the context. However, they help to compensate the lack of intelligence of indexing and search engines that do not have such capability. Of course, ontologies and metadata bear on each other. Ontologies provide a means of describing the structure of metadata, so that metadata can be authored in a uniform way. A resource (document) is said to commit to an ontology if its metadata are consistent with that ontology. Indexing engine can then search a group of documents committing to one ontology, and it can provide users with means for focusing their searches according to concepts in that ontology. This scenario demonstrates the need for standardized ontologies that are committed to by many resources, so that the ontology-based search engines can index all of them.
2
Our Approach
We propose to design an ontology-based indexing and search engine for engineering applications, and for modeling and simulation in particular. To do this, we need to design an ontology for this domain. This section describes the most fundamental design goals of our search engine, the next section presents an overview of the content of our prototype ontology. The architecture of our knowledge system is quite conventional. It consists of the following components: – – – – –
knowledge base knowledge base index ontology and lexicon user’s interface (search service) administrator’s interface
The knowledge base is a collection of documents on the Internet available through the HTTP protocol. These are mostly HTML documents, containing either unrestricted natural language text relevant to our domain (such as electronic textbooks), or semi-structured documents (collection of solved examples,
Knowledge Support for Modeling and Simulation
101
catalogs of system components, model libraries, etc). Documents in the knowledge base may be provided by different authors. Since the knowledge in these documents is expressed implicitly, it is the user’s responsibility to extract and interpret this knowledge. Our system merely retrieves documents that it considers useful to the user. This is done with the help of the metadata stored in the knowledge base index. The knowledge base index is a central repository that stores metadata associated with documents in the knowledge base. It is used by the search engine to determine documents relevant to users’ queries. For semi-structured documents in the knowledge base, metadata have database-like nature. For example, documents in a collection of solved examples are annotated with attributes like application domain, physical domain, modeling technique, or used simulation engine. Documents containing unrestricted text may be also annotated with keywords structured according to concepts in the ontology. Since the index is connected to the ontology which defines mutual couplings between concepts, the search engine can perform simple reasoning steps during searches, such as generalizations, specializations, and natural-language disambiguations. The design decision that the knowledge base index is centralized is pragmatic. Centralized index is easier to maintain and is more accessible for the search engine that needs it for its operation. One can think that the metadata might be better supplied by authors of the documents and stored directly with these documents. This is the idea behind the initiative called the Semantic Web. However, today’s practice is that only few documents on the web are annotated with metadata (at least in the engineering domain), as the web-document authors are conservative and are not willing to invest extra effort into the development of their web pages. Our centralistic approach allows us to annotate documents developed by third parties. The user’s interface allows users to express their queries and submit them to the search engine. We propose two kinds of search interfaces. For searching for semi-structured documents, an administrator can design an ad-hoc form for entering the values of attributes. This enables users to focus on attributes that are relevant to a group of documents they are interested in. The second kind of interface enables users to search according to keywords. This service is very similar to conventional keyword-based search engines, but thanks to its exploitation of ontology and lexicon, it should exhibit better precision and recall. Keywords entered by the user are looked up in the lexicon, and related to respective concepts in the ontology. In the case of ambiguity (for polysemous keywords), users are prompted to resolve this ambiguity by selecting a subset of meanings (ontology concepts) corresponding to the keyword. The lexicon can also contain multiple words (synonyms) for single ontology concept, reducing thus the risk of document loss during the search due to using synonymous words when asking for concepts. These simple techniques partially reduce the impact of ambiguity of the natural language, and ensure better quality of search. The final component of our knowledge server is the administrator’s interface. It is an integrated environment for maintaining the ontology, lexicon and
102
ˇ cenko Michal Sevˇ
knowledge base index. It contains an ontology and lexicon browser and editor, document annotation tool, and analytical tools supporting the maintenance and evolution of the knowledge server.
3
The Ontology
This section presents a brief overview of the ontology that makes the core of our knowledge system. Our ontology is based on a general purpose, or upper , ontology called Suggested Upper Merged Ontology (SUMO) [1]. This implies that our ontology is represented in the Knowledge Interchange Format (KIF). One of the unique features of SUMO is that it is mapped to a large general purpose lexicon WordNet [2]. This enable the user to refer to concepts in the SUMO ontology using plain English words or phrases. We started to create our domain ontology on top of SUMO, and added corresponding terms to the lexicon by embedding them in the ontology using special predicates. Unlike other engineering ontologies, like EngMath [3] or PhysSys [4], our ontology is intended to represent basic common-sense concepts from the domain, rather than formally represent engineering models. Our knowledge base should represent the knowledge about modeling, not the models themselves. Our ontology is divided into several sections. The first section describes a taxonomy of models and modeling methodologies that engineers use to describe modeled systems, e.g. differential equations, block diagrams, multipole diagrams, bond graphs, etc. Although these concepts could be defined quite formally, we decided to provide only basic categorization accompanied by loose couplings providing informal common-sense navigation among the concepts. The second section describes a taxonomy of engineering components (devices) and processes, i.e. entities that are subject to modeling and simulation. The taxonomy in the current version of our ontology is only a snapshot of the terminology used in engineering, but should provide a solid base for extending it by domain experts. The third section of the ontology contains a taxonomy of typical engineering tasks (problems) and methods that can be used to solve them. The last section describes structured metadata that can be associated with individual documents in the knowledge base.
4
Conclusions and Future Directions
In this paper, we have presented our approach to the design of the knowledge system supporting modeling and simulation tasks. We presented only the general idea, more technical details can be found in [5]. We have implemented a first prototype of the whole system, which can be accessed from the project homepage [6]. The system includes all mentioned components, i.e. the search interface, the knowledge base index, the ontology and lexicon, and the administration tool. In the following paragraphs we propose some future work that should be done within the project.
Knowledge Support for Modeling and Simulation
103
Evolution of the Ontology, Lexicon and Knowledge Base. Although the current version of the ontology and the knowledge base can already demonstrate some concepts of our approach, it is clear that these resources must be much bigger both to be useful in practice and to prove that the approach is correct. We must thus dedicate some time for evolution of these resources—we must add new documents to the knowledge base and amend the ontology and lexicon accordingly. This work should also provide useful feedback for improving our tools for the knowledge base maintenance. Enlarging the ontology and knowledge base will also raise some questions about the performance and scalability. Improving the Functionality of Interfaces. The administrator’s interface and user’s interface should be improved. Especially the administration tool provides wide space for improvement. We plan to add some analytical tools based on information extraction and statistical methods, which should help to evolve and validate the knowledge base. We also plan to develop a tool that will automatically evaluate the performance of our knowledge system based on data recorded by the search interface, i.e. questions asked by users and answers provided by the knowledge system. Formalizing the Task Knowledge. Besides the static knowledge implicitly contained in the knowledge base, we propose to explicitly formalize the task knowledge required to perform certain classes of typical tasks taking place in the modeling and simulation domain. This task knowledge should provide more direct support for the users than mere access to static knowledge. Integration with the Existing Environment. Another important goal is to integrate our prototypical tools to an existing modeling and simulation environment, DYNAST, which we developed within another project. We will seek for new userinterface and computer-human interaction paradigms that can directly exploit the ontology and knowledge structures developed within this project.
References [1] I. Niles and A. Pease, “Toward a standard upper ontology,” in Proceedings of the 2nd International Conference on Formal Ontology in Information Systems (FOIS-2001), 2001. 102 [2] “Wordnet homepage,” http://www.cogsci.princeton.edu/~wn/. 102 [3] T. R. Gruber and R. G. Olsen, “An ontology for engineering mathematics,” in Fourth International Conference on Principles of Knowledge Representation and Reasoning, Bonn, Germany, 1994. 102 [4] W. N. Borst, Construction of Engineering Ontologies for Knwoledge Sharing and Reuse, Ph.D. thesis, University of Twente, September 1997. 102 ˇ cenko, “Intelligent user support for modeling and simulation of dy[5] Michal Sevˇ namic systems,” Tech. Rep., Czech Technical University in Prague, January 2003, Postgraduate Study Report DC-PSR-2003-01. 102 [6] “Knowledge support for modeling and simulation (ksmsa) project homepage,” http://virtual.cvut.cz/ksmsa/index.html. 102
Expert System for Simulating and Predicting Sleep and Alertness Patterns Udo Trutschel, Rainer Guttkuhn, Anneke Heitmann, Acacia Aguirre, and Martin Moore-Ede Circadian Technologies, Inc. 24 Hartwell Avenue, Lexington, MA 02421,
[email protected]
Abstract. The Expert System “Circadian Alertness Simulator” predicts employee sleep and alertness patterns in 24/7-work environments using rules of complex interactions between human physiology, operational demands and environmental conditions. The system can be tailored to the specific biological characteristics of individuals as well as to specific characteristics of groups of individuals (e.g., transportation employees). This adaptation capability of the system is achieved through a built-in learning module, which allows transferring information from actual data on sleep-wake-work patterns and individual sleep characteristics into an internal knowledge database. The expert system can be used as a fatigue management tool for minimizing the detrimental biological effects of 24-hour operations by providing feedback and advice for work scheduling and/or individualspecific lifestyle training. Consequently, it can help reduce accident risk and human-related costs of the 24/7 economy.
1
Introduction
Expert systems are proven tools for solving complex problems in many areas of everyday life, including fault analysis, medical diagnostic, scheduling, system planning, economic predictions, stock investments, sales predictions, management decisions and many more. Despite the numerous applications, expert systems that allow the prediction of human behaviour are relatively rare. One important application in this area is the prediction of sleep and alertness patterns of individuals and/or industry specific employee groups exposed to 24/7 operations. The last 40 years have seen the evolution of a non-stop 24/7 (24 hours a day, 7 days a week) world driven by the need to increase productivity, enhance customer service and utilize expensive equipment around the clock. However, humans are biologically designed to sleep at night. Millions of years of biological evolution led to the development of an internal biological clock in the brain (suprachiasmatic nucleus), which generates the human body's circadian (approximately 24 hour) rhythms of sleep and wakefulness, and synchronizes them with day and night using V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 104-110, 2003. Springer-Verlag Berlin Heidelberg 2003
Expert System for Simulating and Predicting Sleep and Alertness Patterns
105
daylight as a resetting tool (Moore-Ede 1982). In this non-stop 24/7 world, people are constantly exposed to mismatches between the biological clock and the work demands, leading to compromised sleep and alertness, decreased performance and accidents, and an increase in human error-related costs. Circadian Technologies has developed the Expert System “Circadian Alertness Simulator” (CAS) that predicts employee sleep and alertness patterns in a 24/7 work environment. This expert system is based on complex rules and interactions between human physiology (e.g., light sensitivity, sleep pressure, circadian sleep and alertness processes), operational demands (e.g., work schedules) and environmental conditions (e.g., light effects). The system can be tailored to the specific biological characteristics of individuals as well as to specific characteristics of groups of individuals (e.g., specific employee populations). This adaptation capability of the system is achieved through a built-in learning module, which applies the system rules and algorithms to actual data on sleep-wake-work patterns and individual characteristics in a training process and stores the results into an internal knowledge database. The Expert System CAS is used as a fatigue management tool for minimizing the detrimental biological effects of 24-hour operations by providing advice for work scheduling or individual-specific lifestyle training. It can therefore help reduce the human-related costs of the 24/7 economy.
2
Expert System CAS – Structure and Learning Capabilities
The Expert System CAS is based on a general concept with model functions, specific rules and free parameters for storage of the information about the light sensitivity, circadian and homeostatic components of sleep/wake behavior and alertness and other individual features of a person, such as chronotype (morningness/eveningness), habitual sleep duration (long/short sleeper), napping capabilities and sleep flexibility (individual capability to cope with mismatches of internal clock and sleeping time constraints). The knowledge base of the system represents a complex interaction of model functions, free parameters and rules. The most important features will be described below. A two process model of sleep and wake regulation assumes an interaction of homeostatic (modeled by exponential functions and two free parameters) and circadian components (modeled by multiple sinus functions and ten free parameters) based on the following rules: (1) If ‘the homeostatic factor reaches the upper circadian threshold' then ‘the system switches from sleep to wake, and (2) If the homeostatic factor reaches the lower circadian threshold'; then ‘the system switches from wake to sleep', provided that sleep is allowed, otherwise the system switches always to sleep below the lower threshold as soon as sleep is permitted (Daan et al. 1984). The characteristics of the circadian component depend on the complex interaction of biological rhythms, individual light sensitivity (modeled by the Phase Response Curve (PRC)) and environmental light exposure strongly correlated to the 24-hour daily cycle which can be summarized by the next two rules: (3) If ‘light is applied before the minimum of the circadian component' then ‘the phase of the biological clock is delayed, and (4) If light is applied after the minimum of the circadian component' then ‘the phase of the biological clock is advanced' (Moore-Ede, et al.
106
Udo Trutschel et al.
1982). As a consequence the circadian curve is shifted to a new position over the next day, modifying all future sleep/wake behaviour. As a further consequence, the sleep/wake behaviour will determine the timing and intensity of light exposure expressed by the following rule: (5) If ‘the state of a person is sleep' then ‘the light exposure is blocked' leading to an unchanged phase of the circadian component. The expert knowledge for the sleep and alertness prediction is obtained through a training process, in which the expert system learns from actual data of individual sleep personality, sleep/wake-activity and light exposure (Fig 1.). Training Process Minimize Target Function
Interference Engine Input (Interface)
Output (Interface)
Sleep Personality Wake/Sleep Data Light Exposure
Sleep Times Alertness Level
Knowledge Base External Database of Sleep Personality Profiles of Shiftworkers
External Database of Wake/Work/Sleep Data of Shiftworkers During Work Times
Example-Database of Wake/Sleep Data of Shiftworkers during Vacation Times
Fig. 1. Training procedure for the Expert system CAS
The target of the training is to determine the free parameter that result in the best match of actual and simulated sleep and alertness data. After the training is complete, a specific data set is stored in the knowledge database of the Expert System CAS and can be used to simulate and predict sleep and alertness patterns for given specific work-rest schedules (Fig. 2).
Interference Engine Input (Interface)
Output (Interface)
Person/Group ID Wake/Work Data Light Exposure
Sleep/Alertness Patterns Fatigue Risk Score Risk Times
Knowledge Base Fig. 2. Sleep and alertness predictions of the trained Expert System CAS
Expert System for Simulating and Predicting Sleep and Alertness Patterns
107
The simulated and predicted sleep behavior of a person has direct consequences for the alertness level as a function of time, in this paper expressed by two overall easy to understand output parameters: (1) Fatigue Risk Score (0 = lowest fatigue, 100 = highest fatigue), which calculates the cumulative fatigue risk, and (2) Percentage of Work Time below a critical alertness threshold.
3
“Short-Lark”- and “Long-Owl”-Test Cases for Shift Scheduling
To illustrate the capabilities of the Expert System CAS, two artificial people were simulated in a shiftwork environment, working 8-hour backward rotating shift schedule (7 working days-3 shift start times). They are called “Short-Lark” (extreme morning type with short habitual sleep time) and “Long-Owl” (extreme evening type with long habitual sleep time). Without any external constraints such as work or other blocked times for sleep, the “Short-Lark” sleeps every day from 2200 to 0400 and the “Long-Owl” sleeps every day from 0100 to 1100. The expert system was trained with both sleep patterns, and the results were stored in the knowledge database as personspecific input vectors. The predicted sleep and alertness behavior of our test subjects when working 8-hour backward rotating shift schedule with the assumption that sleep is forbidden during the blue marked working times is shown in Fig. 3 and Fig. 4.
Fig. 3. Predicted sleep pattern for “Short-Lark” working 8-hour shift schedule
It is clear from Fig. 3 that the evening and especially the night shift interfere with the normal sleep behavior (black bars) of the “Short-Lark,” creating conflicts with the internal biological clock. This should have severe consequences for the alertness level
108
Udo Trutschel et al.
during work times (blue bars) as expressed by a high overall Fatigue Risk Score of 72, and 37.4% work performed at extreme risk times.
Fig. 4. Predicted sleep pattern for “Long-Owl” working 8-hour shift schedule
Completely different is the sleep pattern of the “Long-Owl” (Fig.4). Here the work schedule interferes with sleep (black bars) only during the day shift, and the disturbed night sleep is recovered immediately after the shift. Therefore, the conflict with the internal biological clock is only modest and reasonable alertness levels during work times are maintained, expressed by a lower overall Fatigue Risk Score of 57 and only 12.3% of work at extreme risk times.
4
Group-Specific Application in the Transportation Industry
Fatigue-related safety concerns are particularly relevant for safety-critical occupations, such as vehicle operators in the rail, road transport or bus industry. Accidents caused by driver fatigue are often severe, as the drowsy driver may not take evasive action (i.e. brake or steer) to avoid a potential collision. In recognition of this, the U.S. Department of Transportation has identified human fatigue as the “Number One” safety problem with a cost in excess of $12 billion per year. To address this need Circadian Technologies, Inc. over the past ten years in partnership with the rail & trucking industry has collected work/rest/sleep and alertness data from over 10,000 days of rail and truck operators in actual operating conditions. Using the training process depicted in Fig.1 an Expert System “CAS-Transportation” was created and applied to a specific trucking operation. “CAS-Transportation” was used to simulate
Expert System for Simulating and Predicting Sleep and Alertness Patterns
109
sleep and alertness behavior and predict fatigue risk from the work-rest data of truck drivers. The results were correlated with the actual accident rates and costs of a 500truck road transport operation. Using “CAS-Transportation” as feedback tool, nonexperts (managers, dispatchers) were provided with the relative risk of accidents due to driver fatigue from any planned sequence of driving and resting hours, and were ask to make adjustments to the working hours to minimize the overall Fatigue Risk Score. As a result of the intervention, the mean Fatigue Risk Score of the group was significantly reduced from 46.8 to 28.9 (dark line in Fig. 5a, 5b). The reduction of the Fatigue Risk Score was associated with a reduction in the frequency and severity of accidents (Moore-Ede et al., 2003). The total number of truck accidents dropped 23.3% and the average cost per accident was significantly reduced (65.8%).
Frequency in Percent
Before
8 7 6 5 4 3 2 1 0
0
10
20
30
40
50
60
70
80
90
100
Fatigue Index Fig. 5a. Fatigue Risk Score before feedback from Expert System CAS-Transportation
Frequency in Percent
After
8
p< 0.0001
7 6 5 4 3 2 1 0
0
10
20
30
40
50
60
70
80
90
Fatigue Index Fig. 5b. Fatigue Risk Score after feedback from Expert System CAS-Transportation
100
110
Udo Trutschel et al.
Using the feedback from the Expert System CAS in a risk-informed, performancebased safety program gives managers and dispatchers incentives to address some of the most important causes of driver fatigue, and therefore of fatigue-related highway accidents. This risk-informed, performance-based approach to fatigue minimization enables non-experts (managers and dispatchers) to make safety-conscious operational decisions while having sufficient flexibility to balance the specific business needs of their operation.
5
Conclusion
The Expert System CAS is able to acquire knowledge about individual sleep personalities or industry specific groups in order to predict Fatigue Risk Scores and/or alertness patterns for given shift schedule options or real world work/wake data. Missing sleep patterns can be replaced through simulated sleep patterns with considerable certainty. From a database of many shift schedules acceptable schedule options can be selected by exposing these options to different sleep personalities or industry specific groups and predicting possible fatigue risks. In addition, the simulation capabilities of Expert System CAS are supporting the decision process during schedule design. The examples (“Short-Lark”, “Long-Owl”, “Transportation Employees”) presented in this paper as well as the shift schedule design process in general can be considered as forward-chaining application of the Expert System CAS. Shift schedule design starts with the operational requirements, investigates the large number of possible solutions for shift durations, shift starting times, direction of rotation and sequence of days ‘on' and days ‘off', and makes the schedule selection based on the criteria fatigue risk and/or alertness levels specified by the user. In addition, the Expert System CAS can be used in the investigation of a fatigue related accident. This specific application represents an example of a backward chaining approach where fatigue as a probable cause of an accident is assumed and the Expert System CAS attempts to find the supporting evidence to verify this assumption.
References [1] [2] [3]
Daan S., Beersma DG., Borbely A. Timing of human sleep: recovery process gated by a circadian pacemaker. Am. J. Physiology 1984; 246: R161-R178. Moore-Ede M. Sulzman F., Fuller C. The clocks that time us. Harvard University Press, Cambridge, 1982. Moore-Ede M., Heitmann A., Dean C., Guttkuhn R. Aguirre A. Trutschel U. Circadian Alertness Simulator for Fatigue Risk Assessment in Transportation: Application to Reduce Frequency and Severity of Truck Accidents. Aviation, Space and Environmental Medicine 2003, in press.
Two Expert Diagnosis Systems for SMEs: From Database-Only Technologies to the Unavoidable Addition of AI Techniques Sylvain Delisle1 and Josée St-Pierre2 Institut de recherche sur les PME Laboratoire de recherche sur la performance des entreprises Université du Québec à Trois Rivières 1 Département de mathématiques et d'informatique 2 Département des sciences de la gestion C.P. 500, Trois-Rivières, Québec, Canada, G9A 5H7 Phone: 1-819-376-5011 + 3832 Fax: 1-819-376-5185 {sylvain_delisle,josee_st-pierre}@uqtr.ca www.uqtr.ca/{~delisle, dsge}
Abstract. In this application-oriented paper, we describe two expert diagnosis systems we have developed for SMEs. Both systems are fully implemented and operational, and both have been put to use on data from actual SMEs. Although both systems are packed with knowledge and expertise on SMEs, neither has been implemented with AI techniques. We explain why and how both systems relate to knowledgebased and expert systems. We also identify aspects of both systems that will benefit from the addition of AI techniques in future developments.
1
Expertise for Small and Medium-Sized Enterprises (SMEs)
The work we describe here takes place within the context of the Research Institute for SMEs—the Institute's core mission (www.uqtr.ca/inrpme/anglais/index.html) is to support fundamental and applied research to foster the advancement of knowledge on SMEs to contribute to their development. The specific lab in which we have conducted the research projects we refer to in this paper is the LaRePE (LAboratoire de REcherche sur la Performance des Entreprises: www.uqtr.ca/inrpme/larepe/). This lab is mainly concerned with the development of scientific expertise on the study and modeling of SMEs' performance, including a variety of interrelated subjects such as finance, management, information systems, production, technology, etc. The vast majority of research projects carried out at the LaRePE involves both theoretical and practical aspects, often necessitating in-field studies with SMEs. As a result, our research projects always attempt to provide practical solutions to real problems confronting SMEs. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 111-125, 2003. Springer-Verlag Berlin Heidelberg 2003
112
Sylvain Delisle and Josée St-Pierre
In this application-oriented paper we briefly describe two expert diagnosis systems we have developed for SMEs. Both can be considered as decision support systems— see [15] and [18]. The first is the PDG system [5]: a benchmarking software that evaluates production and management activities, and the results of these activities in terms of productivity, profitability, vulnerability and efficiency. The second is the eRisC system [6]: a software that helps identify, measure and manage the main risk factors that could compromise the success of SME development projects. Both systems are fully implemented and operational. Moreover, both have been put to use on data from actual SMEs. What is of particular interest here, especially from a knowledge-based systems perspective, is the fact that although both the PDG and the eRisC systems are packed with knowledge and expertise on SMEs, neither has been implemented with Artificial Intelligence (AI) techniques. However, if one looks at them without paying attention to how they have been implemented, they qualify as “black-box” diagnostic expert systems. In the following sections, we provide further details on both systems and how they relate to knowledge-based and expert systems. We also identify aspects of both systems that could benefit from the addition of AI techniques in future developments.
2
The PDG System: SME Performance Diagnostic
2.1
An Overview of the PDG System
The PDG system evaluates a SME from an external perspective and on a comparative basis in order to produce a diagnosis of its performance and potential, complemented with relevant recommendations. Although we usually refer to the PDG system as a diagnostic system it is in fact a hybrid diagnostic-recommendation system as it not only identifies the evaluated SME's weaknesses but it also makes suggestions on how to address these weaknesses in order to improve the SME's performance. An extensive questionnaire is used to collect relevant information items on the SME to be evaluated. Data extracted from the questionnaire is computerized and fed into the PDG system. The latter performs an evaluation in approximately 3 minutes by contrasting the particular SME with an appropriate group of SMEs for which we have already collected relevant data. The PDG's output is a detailed report in which 28 management practices (concerning human resources management, production systems and organization, market development activities, accounting, finance and control tools), 20 results indicators and 22 general information items are evaluated, leading to 14 recommendations on short term actions the evaluated SME could undertake to improve its overall performance. As shown in Figure 1 (next page), the PDG expert diagnosis system is connected to an Oracle database which collects all the relevant data for benchmarking purposes— the PDG also uses the SAS statistics package, plus Microsoft Excel for various calculations and the generation of the final output report. The PDG reports are constantly monitored by a team of multidisciplinary human experts in order to ensure that recommendations are valuable for the entrepreneurs. This validation phase, which always takes place before the report is sent to the SME, is an occasion to make further im-
Two Expert Diagnosis Systems for SMEs
113
provements to the PDG system, whenever appropriate. It is also a valuable means for the human experts to update their own expertise on SMEs. Figure 1 also shows that an intermediary partner is part of the process in order to guarantee confidentiality: nobody in our lab knows to what companies the data are associated. data and results
Oracle Database
report
report
Lab questionnaire financial data
questionnaire, financial data
report
Entrepreneur (SME)
Intermediary Partner
PDG Expert Diagnosis System
information expertise
Multidisciplinary Team of Human Experts
Fig. 1. The PDG system: evaluation of SMEs, from an external perspective and on a comparative basis, in order to produce a diagnosis of their performance and potential
The current version of the PDG system has been in production for 2 years. So far, we have produced more than 600 reports and accumulated in the database the evaluation results of approximately 400 different manufacturing SMEs. A recent study was made of 307 Canadian manufacturing SMEs that have used the PDG report, including 49 that have done so more than once. Our results show that the PDG's expert benchmarking evaluation allows these organisations to improve their operational performance, confirming the usefulness of benchmarking but also, the value of the recommendations included in the PDG report concerning short term actions to improve management practices [17]. 2.2
Some Details on the PDG System
The PDG's expertise is located in two main components: the questionnaire and the benchmarking results interpretation module—in terms of implementation, the PDG uses an Oracle database, the SAS statistical package, and Microsoft Excel. The first version of the questionnaire was developed by a multidisciplinary team of researchers in the following domains: business strategy, human resources, information systems, industrial engineering, logistics, marketing, economics, and finance. The questionnaire development team was faced with two important challenges that quickly became crucial goals: 1) find a common language (a shared ontology) that would allow researchers to understand each other and, at the same time, would be accessible to entrepreneurs when answering the questionnaire, and 2) identify long-term performance indicators for SMEs, as well as problem indicators, while keeping contents to a minimum since in-depth evaluation was not adequate.
114
Sylvain Delisle and Josée St-Pierre
The team was able to meet these two goals by the assignment of a “knowledge integrator” role to the project leader. During the 15-month period of its development, the questionnaire was tested with entrepreneurs in order to ensure that it was easy to understand both in terms of a) contents and question formulation, and b) report layout and information visualization. All texts were written with a clear pedagogical emphasis since the subject matter was not all that trivial and the intended lectureship was quite varied and heterogeneous. Several prototypes were presented to entrepreneurs and they showed a marked interest for graphics and colours. Below, Figure 2 shows a typical page of the 10-page report produced by the PDG system.
Fig. 2. An excerpt from a typical report produced by the PDG system. The evaluated SME's performance is benchmarked against that of a reference group
The researchers' expertise was precious in the identification of vital information that would allow the PDG system to rapidly produce a general diagnosis of any manufacturing SME. The diagnosis also needed to be reliable and complete, while being
Two Expert Diagnosis Systems for SMEs
115
comprehensible by typical entrepreneurs as we pointed out before. This was pioneering research work that the whole team was conducting. Indeed, other SME diagnosis systems are generally financial and based on valid quantitative data. The knowledge integrator mentioned above played an important part in this information engineering and integration process. Each expert had to identify practices, systems, or tools that had to be implemented in a manufacturing SME to ensure a certain level of performance. Then, performance indicators had to be defined in order to measure to what extent these individual practices, systems, or tools were correctly implemented and allowed the enterprise to meet specific goals—the relationship between practices and results is a distinguishing characteristic of the PDG system. Next, every selected performance indicator was assigned a relative weight by the expert and the knowledge integrator. This weight is used to position the enterprise being diagnosed with regard to its reference group, thus allowing the production of relevant comments and recommendations. The weight is also used to produce a global evaluation that will be displayed in a synoptic table. Contrary to many performance diagnostic tools in which the enterprise's information is compared to norms and standards (e.g. [11]), the PDG system evaluates an enterprise relative to a reference group selected by the entrepreneur. Research conducted at our institute seriously questions this use of norms and standards: it appears to be dubious for SMEs as they simply are too heterogeneous to support the definition of reliable norms and standards. Performance indicators are implemented as variables in the PDG system—more precisely in its database, and in the benchmarking results interpretation module (within the report production module). These variables are defined in terms of three categories: 1) binary variables, which are associated with yes/no questions; 2) scale variables, which are associated with the relative ranking of the enterprise along a 1 to 4 or a 1 to 5 scale, depending on the question; and 3) continuous (numerical) variables, which are associated with numerical figures such as the export rate or the training budget. Since variables come in different types, they must also be processed differently at the statistical level, notably when computing the reference group used for benchmarking purposes. In order to characterize the reference group with a single value, a central tendency measure that is representative of the reference group's set of observations is used. Depending on the variable category and its statistical distribution, means, medians, or percentages are used in the benchmarking computations. Table 1 (next page) shows an example of how the evaluated enterprise's results are ranked and associated with codes that will next be used to produce the various graphics in the benchmarking report. The resulting codes (see CODE in Table 1) indicate the evaluated enterprise's benchmarking result for every performance indicator. They are then used by the report generation module to produce the benchmarking output report, which contains many graphical representations, as well as comments and recommendations. The codes are used to assign colours to the enterprise, while the reference group is always associated with the beige colour. For instance, if the enterprise performs better than its reference group, CODE = 4 means colour is green forest. However, in the opposite situation, CODE = 4 would mean colour is red. Other colours with other meanings are yellow, salmon, and green olive. Figure 2 above illustrates how these coloured graphics look like (although they appear only in black and white here).
116
Sylvain Delisle and Josée St-Pierre
Table 1. Some aspects of the representation of expertise within the PDG system with performance indicators implemented as variables. This table shows three (3) variables: one scale variable (participative management), one binary (remuneration plan), and one continuous numerical (fabrication cost). Legend: SME = variable value for the evaluated enterprise; MEA = mean value of the variable in the reference group; RG = reference group; MED = median value of the variable in the reference group; CODE = resulting code for the evaluated enterprise
Scale variable if SME >= (1.25 x MEA), then CODE = 4
Binary variable if SME = 1 and 10% of RG =1 then CODE = 4 if SME = 1 and 25% of RG = 1 then CODE = 3 if SME = 1 and 50% of RG = 1 then CODE = 2 if SME = 1 and 75% of RG = 1 then CODE = 1 if SME = 1 and 90% of RG = 1 then CODE = 0
if SME >= (1.10 x MEA), then CODE = 3 if SME >= (1.00 x MEA), then CODE = 2 if SME >= (0.90 x MEA), then CODE = 1 if SME >= (0.75 x MEA), then CODE = 0 Example: participative Example: remuneration plan management
Continuous (numerical) variable if SME >= (1.25 x MED), then CODE = 4 if SME >= (1.10 x MED), then CODE = 3 if SME >= (1.00 x MED), then CODE = 2 if SME >= (0.90 x MED), then CODE = 1 if SME >= (0.75 x MED), then CODE = 0 Example: fabrication cost
3
The eRisC System: Risk Assessment of SME Development Projects
3.1
An Overview of the eRisC System
SMEs often experience difficulties accessing financing to support their activities in general, and their R&D and innovation activities in particular—see [1], [4], [8], and [9]. Establishing the risk levels of innovation activities can be quite complex and there is no formalized tool to help financial analysts assess them and correctly implement compensation and financing terms that will satisfy both lenders and entrepreneurs. This situation creates a lot of pressure on the cash resources of innovating SMEs. Based on our team's experience with SMEs and expertise with the assessment of risk, and thanks to the contribution of several experts that constantly deal with SMEs development projects, we have developed a state-of-the-art Web-based software called eRisC (see Figure 3 next page). The eRisC (https://oraprdnt.uqtr.uquebec.ca/erisc/index.jsp) expert diagnosis system identifies, measures and allows to manage the main risk factors that could compromise the success of SMEs development projects including expansion, export and innovation projects, each of which is the object of a separate section of the software. An extensive dynamic, Web-based questionnaire is used to collect relevant information items on the SME expansion project to be evaluated.
Two Expert Diagnosis Systems for SMEs
Entrepreneur (SME) or other agent
data
117
Oracle Database
data and results
report
report
Internet data
eRisC Expert Diagnosis System
information expertise
Multidisciplinary Team of Human Experts
Fig. 3. The eRisC system: a Web-based software that helps identify, measure and manage the main risk factors involved in SME development projects
The contents of the questionnaire are based on an extensive review of literature in which we identified over 200 risk factors acting upon the success of SMEs development projects. For example, factors associated with the export activity are export experience, commitment/planning, target market, product, distribution channel, shipping and contractual/financial aspects. These seven elements are broken down into 21 subelements involving between 58 and 216 questions—the number of questions ranges from 59 to 93 for an expansion project, from 58 to 149 for an export project, and from 86 to 216 for an innovation project. Data extracted from the questionnaire is fed into an elaborated knowledge-intensive algorithm that computes risk levels and identifies main risk elements associated with the evaluated project. As shown in Figure 3, the eRisC expert diagnosis system is connected to an Oracle database which collects all the relevant data. Since eRisC was developed after the PDG system, it benefited from the most recent Web-based technologies (e.g. Oracle Java) and was right from the start designed as a fully automated system. More precisely, contrary to the PDG reports, there is no need to constantly monitor eRisC's output reports—thus the dotted arrows on the right-hand side of Figure 3 above. eRisC was developed for and validated by entrepreneurs, economic agents, lenders and investors, to identify the main risk factors of SMEs development projects in order to improve their success rates and facilitate their financing. As of now, various organizations are starting to put eRisC at use in real life situations, allowing us to collect precious information in eRisC's database on SMEs projects and their associated risk assessment. We have a group of 30 users, from various organizations and domains, who currently use eRisC for real-life projects and who provide us with useful feedback for marketing purposes.
118
Sylvain Delisle and Josée St-Pierre
3.2
Some Details on the eRisC System
eRisC's contents was developed by combining various sources of information, knowledge and expertise: the literature on business failure factors and the one on project management, our colleagues' expertise on SMEs, and invaluable information from various agents dealing with these issues on a day-to-day basis, such as lenders, investors, entrepreneurs, economic advisors and experts. Based on this precious and abundant information, we first assembled a long list of potential risk factors that could disturb or influence significantly the development of SMEs projects. In a second phase, we had to reduce the original list of risk factors which was simply too long to be considered in its entirety in real-life practical situations. In order to do that, we considered the relative importance and influence of risk factors on the failure of evaluated projects. Once this pruning was completed, and after we ensured that we had not discarded important factors, the remaining key factors were grouped into meaningful generic categories. We then developed sets of questions and subquestions that would support the measurement of the actual risk level of a project. This also allowed us to add a risk management dimension to our tool by inviting the user to identify with greater precision facets that could compromise the success of the project, and thus allowing a better control with the implementation of appropriate corrective measures. A relatively complex weight system was also developed in order to associate a quantitative measure to individual risk elements, to rank these elements, and to compute a global risk rating for the evaluated project—see Figure 4 below.
Fig. 4. An excerpt from the expansion project questionnaire. The only acceptable answers to questions are YES, NO, NOT APPLICABLE, DON'T KNOW
Two Expert Diagnosis Systems for SMEs
119
In a third and final phase, the contents of eRisC was validated with many potential users and their feedback was taken into consideration to make adjustments on several aspects such as question formulation, term definition, confidentiality of information, etc. At this point, the tool was still “on paper”, as an extensive questionnaire (grid), and had not been implemented yet. So an important design decision had to be made at the very beginning of the implementation phase: how to convert the on-paper, large, and static questionnaire into an adequate form to be implemented into the eRisC software? As we examined various possibilities, we gradually came to look at it more and more as an interactive and dynamic document. In this dynamic perspective, the questionnaire would be adaptable to the users' needs for the specific project at hand. In a sense, the questionnaire is at the meeting point of three complementary dimensions: the risk evaluation model as defined by domain experts, the user's perspective as a domain practitioner; and the computerized rendering of the previous two dimensions. Moreover, from a down-to-earth, practical viewpoint, users would only be interested in the resulting software if it proved to be quick, user friendly, and better than their current non-automated tool.
Fig. 5. Risk Assessment Results Produced by eRisC
With regard to the technological architecture, eRisC is based on the standard 3tiered Web architecture for which we selected Microsoft's Internet Explorer (Web browser) for the client side, the Tomcat Web server for the middleware, and, for the data server, the Oracle database server (Oracle Internet Application Server 8.1.7 Enterprise Edition) running on a Unix platform available at our University in a secured environment. All programming was done with JSP (JaveServer Page) and JavaScript. A great advantage of the 3-tiered model is that it supports dynamic Web applications
120
Sylvain Delisle and Josée St-Pierre
in which the contents of Web pages to be shown on the user's (client's) Web browser is computed “on the fly”, i.e. dynamically, from the Web server and the information it fetched from the database server in response to the user's (client's) request. The five (5) main steps of processing involved in a project risk evaluation with eRisC are: 1) dynamic creation of the questionnaire, according to the initial options selected by the user; 2) project evaluation (questions answering: see Figure 4) by the user; 3) saving of data (user's answers) to the database; 4) computation of results; and 5) presentation of results in an online and printable report. Once phases 1 to 3 are completed, after some 30 minutes on average, eRisC only takes a minute or so to produce the final results, all this taking place online. Final results include a numerical value representing the risk rating (a relative evaluation between 0 and 100) for the specific SME project just evaluated, combined with the identification of at least the five most important risk factors (to optionally perform risk mitigation) within the questionnaire's sections used to perform the evaluation, plus a graphical (pie) representation showing the risk associated with every section and their respective weight in the computation of the global project risk rating—see Figure 5 on the previous page and Figure 6 below. The user can change these weights to adjust the evaluation according to the project's characteristics, or to better reflect her/his personal view on risk evaluation. These “personal” weights can also be saved by eRisC in the user's account so that the software can reuse them the next time around.
Fig. 6. Mitigation Report and Risk Assessment Simulation in eRisC
Two Expert Diagnosis Systems for SMEs
121
When sufficient data will have been accumulated in eRisC's database, it will be possible to establish statistically-based weight models for every type of user. Amongst various possibilities, this will allow entrepreneurs to evaluate their projects with weights used by bankers, allowing them to better understand the bankers' viewpoint when entrepreneurs ask for financing assistance. Finally, mitigation elements are associated with many risk factors listed in eRisC's output report. Typically associated with the most important risk factors, these mitigation elements suggest ways to reduce the risk rating just computed. The user can even re-compute the risk level with the hypothesis that the selected mitigation elements have been put in place in order to assess the impact they may have on the project's global risk level. A new graphic will then be produced showing a comparison of the risk levels before and after the mitigation process.
4
Conclusion: AI-less Intelligent Decision Support Systems
A good deal of multi-domain expertise and informal knowledge engineering was invested into the design of the PDG and the eRisC expert diagnosis systems. In fact, at the early stage of the PDG project, which was developed before eRisC, it was even hoped that an expert-system approach would apply naturally to the task we were facing. Using an expert system shell, a prototype expert system was in fact developed for a subset of the PDG system dealing only with human resources. However, reality turned out to be much more difficult than anticipated. In particular, the knowledge acquisition, knowledge modelling, and knowledge validation/verification phases ([7], [12], [11], [16], [3]) were too demanding in the context of our resources constraints especially in the context of a multidisciplinary domain such as that of SME for which little formalized knowledge exists. Indeed, many people were involved, all of them in various specialization fields (i.e. management, marketing, accounting, finance, human resources, engineering, technical, information technology, etc.) and with various backgrounds (researchers, graduate students, research professionals and, of course, entrepreneurs). One of the main difficulties that hindered the development of the PDG as an expert system was the continuous change both the questionnaire and the benchmarking report were undergoing during the first three years of the project. So at the same time the research team was trying to develop a multidisciplinary model of SME performance evaluation, users' needs had to be considered, software development had to be carried out, and evaluation reports had to be produced for participating SMEs. This turned out to be a rather complicated situation. The prototype expert system mentioned above was developed in parallel with the current version, although only for the subset dealing with human resources—see [10] and [19] for examples of expert systems in finance. The project leader's knowledge engineer role was very difficult since several experts from different domains were involved and the extraction and fusion of these various fields of expertise had never been done before. Despite the experts' valuable experience, knowledge, and good will, they had never been part of a similar project before. The modelling of such rich, complex, and vast information, especially for SMEs, was an entirely new challenge both scientifically and technically. Indeed, because of their heterogeneous nature, and contrary to large enterprises, SMEs are
122
Sylvain Delisle and Josée St-Pierre
much more difficult to model and evaluate. For instance, the implementation of certain management practices may be necessary and usual for traditional manufacturing enterprises, but completely inappropriate for a small enterprise subcontracting for a large company or a prime contractor. These important considerations and difficulties, not mentioning the consequences they had on the project's schedule and budget, lead to the abandon of the expert system after the development of a simple prototype. As to the eRisC system, since it was another multi-domain multi-expert project, and thanks to our prior experience with the PDG system, it was quickly decided to stay away from AI-related approaches and techniques. During the development of eRisC's questionnaires, we realized how risk experts always tended to model risk assessment from their own perspective and from their own personal knowledge, as reported in the literature. This is why we built our risk assessment model from many sources, thanks to a comprehensive review of literature and the availability of several experts, in order to ensure we ended up with an exhaustive list of risk determining factors for SME projects. For instance, here are the main different perspectives (see e.g. [14]). • • •
Bankers and lenders care mostly about financial aspects and tend to neglect qualitative dimensions that indicate whether the enterprise can solve problems and meet challenges in risky projects. Entrepreneurs do not realize that their implication in the project can in fact constitute a major risk from their partners' viewpoint. Economic consultants and advisors have a specialized background that may prevent them from having a global perspective on the project.
Obviously, it is the fusion of all these diverse and complementary expertise sources that would have been used to develop the knowledge base of an expert-system version of the current eRisC system. However, this was simply impossible given the timetable and resources available to us. Of course, this does not mean that AI tools were inappropriate for those two projects. As a research team involved in an applied project, we made a rational decision based on our experience on a smaller scale experiment (i.e. the PDG prototype expert system on human resources), on our time and budget constraints, and on the welldocumented fact that multi-domain multi-expert knowledge acquisition and modelling constitutes a great challenge. Yet another factor that had great influence on our design decisions was the fact that both projects started out on paper as questionnaires which led naturally to database building and use of all the database-related software development. Thus, both the PDG and eRisC systems ended up as knowledge-packed systems built on database technology. However, as we briefly discuss in Section 5 below, we are now at a stage where we plan the addition of AI-related techniques and tools. The current versions of the PDG and eRisC systems, although not implemented with AI techniques, e.g. knowledge base of rules and facts, inference engine, etc. (see, e.g., [13], [18]), qualifies as “black-box” expert diagnosis systems. These unique systems are based on knowledge, information and algorithms that allow them to produce outputs that only a human expert, or in fact several human experts in different domains, would be able to produce in terms of diagnosis and recommendation quality. These reports contain mostly coloured diagrams and simple explanations that are formulated in plain English (or French) so that SMEs entrepreneurs can easily under-
Two Expert Diagnosis Systems for SMEs
123
stand them. The PDG is the only system that can be said to use some relatively old AI techniques. Indeed, the comments produced in the output report are generated via a template-based approach, an early technique used in natural language processing.
5
Future Work: Bringing Back AI Techniques into the Picture
The PDG and the eRisC systems are now at a stage where we can now reconsider the introduction of AI techniques in new developments. The main justification for this is the need to eliminate human intervention while preserving high quality outputs, based on rare highly-skilled knowledge and expertise. We have started to develop new modules that will increase even more the intelligence features of both systems. Here is a short, non-exhaustive list accompanied with brief explanations: •
•
•
•
•
Development of data warehouses and data mining algorithms to facilitate statistical processing of data and extend knowledge extraction capabilities. Such extracted knowledge will be useful to improve the systems' meta-knowledge level, which could be used in the systems' explanations for instance, and also to broaden human experts' domain knowledge. This phase is already in progress. The huge number of database attributes and statistical variables manipulated in both systems is overwhelming. A conceptual taxonomy, coupled with an elaborated data dictionary, has now become a necessary addition. For instance, the researcher should be able to find out quickly to what concepts a particular attribute (or variable) is associated, to what computations or results it is related, and so on. This phase has recently begun. Development of an expert system to eliminate the need for any human intervention in the PDG system. Currently, a human expert must revise all reports before they are sent to the SME. Most of the time, only minor adjustments are required. The knowledge used to perform this final revision takes into consideration individual results produced in various parts of the benchmarking report and analyze potential consequences of interrelationships between them in order to ensure that conclusions and recommendations of the evaluated SME are both valid and coherent. This phase is part of our future work. Augment current systems with case-based reasoning and related machine learning algorithms. In several aspects of both systems, evaluation of the problem at hand could be facilitated if it were possible to establish relationships with similar problems (cases) already solved before. Determining the problems' salient features to support this approach would also offer good potential to lessen the users' burden during the initial data collection phase. This phase is part of our future work. Study the potential of agent technology to reengineer some elements of both systems, especially from a decision support system perspective [2]. This could be especially interesting for the modelling and implementation of distributed sources of expertise that contribute to decision processing. For example, in the PDG system each source of expertise in the performance evaluation of a SME could be associated with a distinct agent controlling and managing its own knowledge
124
Sylvain Delisle and Josée St-Pierre
base. Interaction and coordination between these agents would be crucial aspects of a PDG system based on a community of cooperative problem-solving agents.
References [1] [2] [3] [4] [5]
[6]
[7] [8] [9] [10] [11] [12] [13] [14]
Beaudoin R. and J. St-Pierre (1999). “Le financement de l'innovation chez les PME”, Working paper for Développement Économique Canada, 39 pages. Available: http://www.DEC-CED.gc.ca/fr/2-1.htm Bui T. and J. Lee (1999). “An Agent-Based Framework for Building Decision Support Systems”, Decision Support Systems, 25, 225-237. Caulier P. and B. Houriez (2001). “L'approche mixte expérimentée (modélisation des connaissances métiers)”, L'Informatique Professionnelle, 32 (195), juin-juillet, 30-37. Chapman R.L., C.E. O'Mara, S. Ronchi and M. Corso (2001). “Continuous Product Innovation : A Comparison of Key Elements across Different Contingency Sets”, Measuring Business Excellence, 5(3), 16-23. Delisle S. and J. St-Pierre (2003). “An Expert Diagnosis System for the Benchmarking of SMEs' Performance”, First International Conference on Performance Measures, Benchmarking and Best Practices in the New Economy (Business Excellence '03), Guimaraes (Portugal), 10-13 June 2003, to appear. Delisle S and J. St-Pierre (2003). “SME Projects: A Software for the Identification, Assessment and Management of Risks”, 48th World Conference of the International Council for Small Business (ICSB-2003), Belfast (Ireland), 15-18 June 2003, to appear. Fensel D. and F. Van Harmelen (1994). “A Comparison of Languages which Operationalize and Formalize KADS Models of Expertise”, The Knowledge Engineering Review, 9(2), 105-146. Freel M.S. (2000). “Barriers to Product Innovation in Small Manufacturing Firms”, International Small Business Journal, 18(2), 60-80. Menkveld A.J. and A.R. Thurik (1999). “Firm Size and Efficiency in Innovation: Reply”, Small Business Economics, 12, 97-101. Nedovic L. and V. Devedzic (2002), “Expert Systems in Finance—A CrossSection of the Field”, Expert Systems with Applications, 23, 49-66. Matsatsinis N.F., M. Doumpos and C. Zopounidis (1997). “Knowledge Acquisition and Representation for Expert Systems in the Field of Financial Analysis”, Experts Systems with Applications, 12(2), 247-262. Rouge A., J.Y. Lapicque, F. Brossier and Y. Lozinguez (1995). “Validation and Verification of KADS Data and Domain Knowledge”, Experts Systems with Applications, 8(3), 333-341. Santos J., Z. Vale and C. Ramos (2002). “On the Verification of an Expert System: Practical Issues”, Lecture Notes in Artificial Intelligence #2358, 414424. Sarasvathy, D.K., H.A. Simon and L. Lave (1998). “Perceiving and Managing Business Risks : Differences Between Entrepreneurs and Bankers”, Journal of Economic Behavior and Organization, 33, 207-225.
Two Expert Diagnosis Systems for SMEs
125
[15] Shim J.P., M. Warkentin, J.F. Courtney, D.J. Power, R. Sharda and C. Carlsson (2002). “Past, Present, and Future of Decision Support Technology”, Decision Support Systems, 33, 111-126. [16] Sierra-Alonso A. (2000). “Definition of a General Conceptualization Method for the Expert Knowledge”, Lecture Notes in Artificial Intelligence #1793, 458469. [17] St-Pierre J., L. Raymond and E. Andriambeloson (2002). “Performance Effects of the Adoption of Benchmarking and Best Practices in Manufacturing SMEs”, Small Business and Enterprise Development Conference, The University of Nottingham. [18] Turban E. and J.E. Aronson (2001). Decision Support Systems and Intelligent Systems, Prentice Hall. [19] Wagner W.P., J. Otto and Q.B. Chung (2002). “Knowledge Acquisition for Expert Systems in Accounting and Financial Problem Solving”, Knowledge-Based Systems, 15, 439-447.
Using Conceptual Decision Model in a Case Study Miki Sirola Helsinki University of Technology Laboratory of Computer and Information Science P.O.Box 5400 , FIN-02015 HUT, Finland
[email protected]
Abstract. Decision making is mostly based on decision concepts and decision models built in decision support systems. Type of decision problem determines application. This paper presents a case study analysed with a conceptual decision model that utilises rule-based methodologies, numerical algorithms and procedures, statistical methodologies including distributions, and visual support. Selection of used decision concepts is based on case-based needs. Fine tuning of the model is done during construction of the computer application and analysis of the case examples. A kind of decision table is built including pre-filtered decision options and carefully chosen decision attributes. Each attribute is weighted, decision table values are given, and finally total score is calculated. This is done with a many-step procedure including various elements. The computer application is built on G2 platform. The case example choice of career is analysed in detail. The developed prototype should be considered mostly as an advisory tool in decision making. More important than the numerical result of the analysis is to learn about the decision problem. Evaluation expertise is needed in the development process. The model constructed is a kind of completed multi-criteria decision analysis concept. This paper is also an example of using a theoretical methodology in solving a practical problem.
1
Introduction
There are aspects in decision making that need to be paid special attention to. Various decision concepts have been composed and many kind of decision models have been built to provide the decision support systems with best possible aid. Decision maker has to be able to integrate all valuable information available and ennoble out good enough decision in each particular decision case. Visual support is often very valuable for the decision maker. Visual support can mean several things of course. Visualisation of process data is a basic example of giving to the decision maker information that can be valuable. Visualisation of process variables begin with simple plots and time series, and may go into more and more complicated development. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 126-133, 2003. Springer-Verlag Berlin Heidelberg 2003
Using Conceptual Decision Model in a Case Study
127
Such support methodologies as decision tables, decision trees, flow diagrams, and rule-based methods ought to be mentioned. Calculation algorithms e.g. for optimisation are often needed as well. Selection criteria formation and decision option generation are also important parts of the decision process, systematically used for example in multi-criteria decision analysis methodology. Statistical methodologies, distributions, object models, agents and fuzzy models are also introduced as parts of decision models. Simulation for tracking purposes and prediction should not be forgotten either. In system level applications several of these concepts are needed and utilised further. The decision types can be divided into long-term decisions and short-term decisions. The number of objectives, possible uncertainty, time dependence, etc. also affect to the decision problem perspective. Sometimes the decisions need to be made on-line and sometimes off-line. All these factors need to be kept in mind when the problem solving methodology is chosen. Comparing risk and cost is a common methodology in decision making. Choosing the preferences with competing priorities is an important point. Cumulative quality function [1] and chained paired comparisons [2] are examples of more specific methodologies used in decision making. Measures in decision making are also interesting. Deterministic decision making has its own measures mostly based on value and utility theory, while stochastic decision making uses statistical measures such as distributions. Decision concepts have been reviewed in reference [3]. Although decision making is applied in many areas, the literature seems to concentrate on economy and production planning. In used methodologies there exists more variation. For instance decision analysis approach and knowledge-based technologies are commonly used concepts. Decision making in handling sequences, resource allocation, network design, sorting problems and classification are examples of problem types studied in detail. These issues are also discussed in references [4], [5] and [6]. The author has dealt with other decision analysis case examples e.g. in references [7], [8] and [9]. In the conceptual decision model presented in this paper only carefully considered features have been used. Some of the earlier presented techniques have been selected into this model. This model utilises rule-based methodologies, numerical algorithms and procedures, statistical methodologies including e.g. distributions, and visual support. Rule-based methodologies are used for instance in preliminary elimination of decision options, algorithms and procedures e.g. in calculation of weight coefficients, and statistical distributions in evaluating the values in a kind of decision table. This utilisation is explained more in detail in the next chapter about the decision concept and model. The selection of the used decision concepts in the model is based on case-based needs. By analysing real decision problems the most suitable features and methodologies have been taken in use. If some feature is noticed to be unnecessary in the decision process, it has been left out from the final decision model. Also missing features have been completed on the way. The concept of the model is first planned on rough level, and then fine tuned during the examination of the case examples. The type of the decision problem determines the application. In a decision situation there often exist alternatives. Sometimes the alternatives appear in frequent sets. In each decision situation there is a certain history that can be more or less known.
128
Miki Sirola
Statistical support, production of solutions, filtering and selection are all needed in certain steps. Situation-based assessment is very important for instance in a case as if checkers. This game located on a draught-board may be analysed by using different area sizes or numbers of elements. The number of possible combinations increases very fast with the size. Other possible cases worth to mention are e.g. selection of career, buying decision (car is the classical example), or an optimisation of a going route. The selection of career may include such attributes as inclination, interest, economy, risk, etc. The buying decision has such possible attributes as price, technical qualities, colour, model, etc. In this paper the case choice of career is analysed more in detail. The computer application of this case example has been built with G2 expert system shell.
2
Decision Concept of the Model
The conceptual decision model is presented here first on a rough level, and then by going more and more into details when getting more into the examined case itself. It was found out that the rule-based methods, algorithms and procedures, statistical methods such as distributions, and visual support are the most suitable methodologies and give the most desired features for the decision model in question. The whole concept in use consists of these elements just mentioned. In the model a kind of decision table is built including decision options and decision attributes (see Figure 1). The decision attributes can also be called decision criteria. Each decision attribute has a weight coefficient, and each decision option can be valued in regard to each attribute. This far the table is quite similar that is used in quantitative analysis of multi-criteria decision analysis. In fact also the formula calculating the final numerical result is exactly the same: n
vtot ( ai ) = ∑ wjvj ( ai )
(1)
i =1
where ai is decision alternative i, wj is weight coefficient of criteria j and vj(ai) is a scaled score of the decision alternative i compared to criterion j. The decision problem shapes during the analysis process. First the decision attributes are defined for the case in question. Then the decision options are created. The decision options are filtered with a rule-based examination, and only the most suitable ones are selected for the final analysis. Similar procedure is possible to realise with the decision attributes as well. The weight coefficients are calculated with an algorithm based on pairwise comparisons and step by step adjustment through all attributes in the final analysis. This procedure is adjusted for each case separately. The use of statistical distributions comes into the figure when the values in the decision table are given. Some attributes with regard to their corresponding decision option are valued by using information in statistical tables. Historical data is one of the elements used in constructing these tables. This method and its realisation is explained more in detail in the next chapter about the computer application.
Using Conceptual Decision Model in a Case Study
129
The decision table helps also the decision maker to perceive the decision problem visually. It is of course possible to use many kinds of visualisation methodologies in addition, visualisations that help the decision maker to picture the decision problem better than just pure numbers in a decision table. But already the decision problem shaped into a decision table gives great help in understanding the decision problem better.
3
Computer Application
The computer application is realised in G2 expert system shell. The presented realisation scheme (see Figure 1) is for the case choice of career, but it can be easily generalised for any of the cases mentioned in this paper. Some key features of the realisation are also documented here. The decision table itself is realised as a freedom table. The changing values comes either from arrays such as choice of career (options), value of attribute (criteria) and weight of attribute (weight coefficients), or some of the functions explained later on (numerical values in the decision table). The result is calculated with formula (1) as it was mentioned in chapter 2. Also the array values are calculated as outputs of procedures or functions. An object class career has been defined, and four subclasses inclination, interest, economy and risk. The career candidates such as mathematician, natural scientist, linguist, lawyer, economist and trucker are defined as instances of the object class career. The rule base (not seen in the figure) takes care for instance of the filtering of the initial decision options. Minimum limit values are defined for the weight coefficients of the attributes, and based on these comparisons the final decision options are selected. As already mentioned there are procedures and functions calculating the weight coefficients and numerical values in the decision table. The procedures and functions calculating the numerical values in the decision table also utilises accessory tables including information from statistical distributions about the attributes in the decision table. The realisation scheme of the calculation of the weight coefficients and numerical table values is not described in every detail in this paper.
4
Case Example about the Choice of Career
In the case more than thirty persons choice of career is analysed. The selection process is more or less the same with each person, or at least so similar, that only one such example is discussed in detail. No statistical analysis is done for the whole random sample. The qualities of the decision model and the concept used are considered more important, and these features are best lighted through an illustrative example. The decision table for person number 17 is seen in Figure 1. For the person number 17 the decision option filtering gives out three options: mathematician, lawyer and trucker. So in the decision table of Figure 1 the career1 is mathematician, career2 is lawyer, and career3 is trucker. The decision attributes are
130
Miki Sirola
inclination, interest, economy and risk. Here inclination means the genetic feasibility of the person for such a career. Interest means the subjective willingness to choose the career. Economy means the statistical income of each career type, and risk the statistical possibility of getting unemployed in each profession.
Fig. 1. A window of the G2 application where the case choice of career is analysed for person number 17
As input for the procedure calculating weight coefficients are given such things as each person's subjective evaluation and a kind of general importance. Pairwise comparisons are done to each two attributes and the final values are calculated step by step. As already mentioned the attributes economy and risk are valued by using tables including information from statistical distributions, while the other attributes are valued by different means. The statistical distributions are taken from public databases. A subjective measure is used in valuing the attribute interest, while the attribute inclination is valued by combined technology including subjective, statistical and kind of quality measures. Person number 17 seems to have most inclination for the career of lawyer, and most interest for the career of mathematician. The statistics shows best income for the career of lawyer, and smallest risk for the career of trucker. Note that the attribute risk is in inverse scale, so a big number means a small risk (which is considered to be a good quality), while all other attributes are in normal scale (big number means big inclination, interest or economy). The scale of the values in the decision table is from 0 (worst possible) to 5 (best possible).
Using Conceptual Decision Model in a Case Study
131
The numerical result shows clearly that person number 17 should choose the option lawyer. The second best choice would be mathematician, and clearly last comes trucker. This result is so clear that sensitivity analysis is not needed with this person. In many cases the sensitivity analysis shows the week points of the analysis by making clear how different parameters affect to the final result. It must be noted though that the attribute risk was given very small weight. Still many people consider this attribute rather important in this context. In this case the inclination has been given rather high emphasis. The importance of attributes economy and interest are located somewhere between. This tool should be considered as some kind of advisory tool, and by no means as an absolute reference for the final choice. Although with the person number 17 two of the attributes clearly point to the choice of lawyer career, there are still two important attributes that disagree with this opinion. For instance many people think that one should follow the voice of interest in such things as choosing career. Still this kind of analysis can be very informative for the decision maker to find out the other motives for such choices that are often more hidden.
5
Discussion
Conceptual decision model has been built by first collecting the desired decision concept by combining in a new way existing decision support methodology. Then the decision model has been built by iterating the details while the computer application is being built and the case examples analysed. The computer application is programmed on G2 platform. The case example of choice of career is analysed more in detail. Also other case examples have been used in the development of the computer application, and also in the fine tuning of the concept and model itself. The developed prototype should be considered mostly as an advisory tool in decision making. The features have been selected with thorough care, and therefore the desired qualities are mostly found from the final application. Also the realisation puts up some restrictions in the possibilities of course, but in general I think that we can be rather satisfied with the results achieved. The decision model includes rule-based methodologies, numerical algorithms and procedures, statistical methodologies such as distributions, and visual support. The use of these methodologies in the model has been explained more in detail in previous chapters. The role of visual support is not very remarkable in this application, although it can be considered generally quite important and also an essential part of the concept. There exist many possibilities in making errors, for instance in calculating the weight coefficients and in producing the values in the decision table, but mostly the sensitivity analysis helps to find out the week points in the analysis. Plausible comparison of different attributes is also a problem. By giving the weight coefficients it is tried to put the attributes in a kind of order of importance, but to justify the comparison itself is often problematic. For instance comparison of risk and cost is not always correct by common opinion. The numerical result of the analysis is not the most important result achieved. The most important result is to find out what are the most important decisive factors in the
132
Miki Sirola
whole decision making process, and therefore to learn more about the problem itself. Already the better understanding of the problem helps in finding out a good solution, even if the decision maker does not agree with the numerical result given by the tool. The decisive factors and their order of importance are found from the decision attributes (criteria) of the formulated decision problem. Sometimes the revelation of hidden motives helps the decision maker. This often happens during the long procedure of playing with the decision problem, from the beginning of the problem formulation to the final analysis and even documentation. Better understanding by learning during the decision problem is one of the key issues that this paper is trying to present. Evaluation expertise is also very important to include in the solution procedure of the decision problem. Otherwise we are just playing with random numbers without a real connection to the decision itself that we are tying to solve. This paper is an example of using a theoretical methodology in solving a practical problem. The model constructed is a kind of completed multi-criteria decision analysis (MCDA) concept. The decision table itself is very similar, but the difference is in the way that different components and values of the table are produced. In ordinary MCDA method more handwork is done, while this system is rather far automated. The case about choice of career is a typical one time decision. The case of buying a new car is very similar in this regard. On the other hand the game checkers is completely different decision problem type. In that case rather similar decision situations come out repeatedly. The role of historical data, retrieval and prediction becomes more important. The case of optimising a going route is again different, third type of problem. Only the first one of the previous problems was analysed in this paper, although also other cases have been calculated with the tool. The others were just very shortly introduced here. Reporting large amount of analysis results would be impossible and not appropriate in a short paper, and also better focus on the chosen particular problem has been possible. With different type of decision problem the methodology used changes from many parts. Such case could be a topic of another paper. In this paper the used methodology has been reflected through a single application to be able to point out findings quite in detail. Although many persons were analysed in the choice of career case example experiment, no statistical analysis for the whole random sample was made. To concentrate more in this issue is one clear future need. As a tool capable of handling rather large amounts of data in a moderate time has been built, a natural way to proceed is to widen the scope into this direction. In the analysis of other persons choosing their career numerous variations in the results, and also in different phases of the analysis were found out. Because from the methodological point of view the procedure is similar, reporting this part of analysis has been ignored. The realisation platform for the computer application is G2 expert system shell. This environment is quite suitable for this kind of purpose. G2 is very strong in heuristics, and not so good in numerical calculation. Although both of them are needed in this application, the need of very heavy numerical calculation is not so essential. As a kind of combinatorial methodology has been developed, also other
Using Conceptual Decision Model in a Case Study
133
platforms could be considered. G2 was just a natural choice to make the first prototype to test the ideas presented in this paper. I owe an acknowledgement for the Laboratory of Automation Technology of our university for the possibility of using G2 expert system shell as a platform of the computer application, which made the analysis of the case examples possible.
References [1] [2] [3] [4] [5] [6] [7] [8]
[9]
Zopounidis C., M. Doumpos. Stock evaluation using preference disaggregation methodology. Decision sciences. Atlanta (1999) Ra J. Chainwise paired comparisons. Decision sciences. Atlanta (1999) Sirola M. Decision concepts. To be published. Ashmos D., Duchon D., McDaniel R. Participation in strategic decision making: the role of organizational predispositions and issue interpretation. Decision sciences. Atlanta (1998) Santhanam R., Elam J. A survey of knowledge-based systems research in decision sciences (1980-1995). The Journal of the Operational Research Society. Oxford (1998) Rummel J. An empirical investigation of costs in batching decisions. Decision sciences. Atlanta (2000) Sirola M. Computerized decision support systems in failure and maintenance management of safety critical processes. VTT Publications 397. Espoo, Finland (1999) Laakso K., Sirola M., Holmberg J. Decision modelling for maintenance and safety. International Journal of Condition Monitoring and Diagnostic Engineering Management. Birmingham (1999) Vol. 2, No. 3, ISSN 1363 – 7681, pp. 13-17 Sirola M. Applying decision analysis method in process control problem in accident management situation. Proceedings of XIV International Conference on System Science. Wroclaw, Poland (2001)
Automated Knowledge Acquisition Based on Unsupervised Neural Network and Expert System Paradigms Nazar Elfadil and Dino Isa Division of Engineering, University of Nottingham in Malaysia No. 2 Jln Conlay, 50450 Kuala Lumpur, Malaysia Tel: 60321408137, Fax: 603-27159979
[email protected]
Abstract. This is paper presents an approach for automated knowledge acquisition system using Kohonen self-organizing maps and k-means clustering. For the sake of illustrating the system overall architecture and validate it, a data set represent world animal has been used as training data set. The verification of the produced knowledge based had done by using conventional expert system.
1
Problem Background
In our daily life, we cannot avoid making decisions. Decision-making may be defined as making a conclusion or determination upon a problem at hand. However, in recent years, problems to be solved have become more complex. Consequently, knowledge based decision-making systems have been developed to aid us in solving complex problems. Nevertheless, the knowledge base itself has become the bottleneck, as it is the part of the system that is still developed manually. Nonetheless, as the size and complexity of the problems increase, and experts become scarce, the manual extraction of knowledge becomes very difficult. Hence, it is imperative that the task of knowledge acquisition be automated. The demand for automated knowledge acquisition system has increased dramatically. The previous approaches for automated knowledge acquisition are based on decision trees, progressive rule generation, and supervised neural networks [1]. All the above-mentioned approaches are supervised learning methods, requiring training examples combined with their target output values. In the real world cases, target data is not always known so as to provide to the system before start training the data set [2]. This paper is organized as follows: Section II presents an overview of the automated knowledge acquisition. Section III demonstrates illustrative application. Finally section IV illustrates the conclusion and future work.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 134-140, 2003. Springer-Verlag Berlin Heidelberg 2003
Automated Knowledge Acquisition Based on Unsupervised Neural Network
2
135
Automated Knowledge Acquisition
The paper proposes an automated knowledge acquisition method in which, knowledge (connectionist) is extracted from data that have been classified by a Kohonen self-organizing map (KSOM) neural network. This knowledge (at this stage) is of the intermediate-level concept rule hierarchy. The final concept rule hierarchy is generated by applying a rule generation algorithm that is aided by an expert system inference engine. The resulting knowledge (symbolic) may be used in the construction of the symbolic knowledge base of an expert system. The proposal is rationalized from the realization that most complex real-world problems are not solvable using just symbolic or just adaptive processing alone. However, not all problems are suitable for neural expert system integration [3]. The most suitable ones are those that seek to classify inputs into a small number of groups. Figure. 1 illustrates the top-level architecture that integrates neural network and expert system.
Fig. 1. Neural Expert system architecture
3
Illustrative Application
To illustrate the details of the various tasks in the proposed method, a simple and illustrative case study will be use. The system has been trained with animals data set illustrated in Table 1. Each pattern consists of 11 attributes and the whole animal data set with various attributes represented 11 kinds of animals, as shown in Table 1. The trained data is composed of 1023 examples.
136
Nazar Elfadil and Dino Isa Table 1. Portion of animal data set input data
Attributes Is Big
2 legs
4 legs
Hair
Feathers
Hunt
Run
Fly
Swim
Likes to
Medium
Dove Duck Goose Hawk Cat Eagle Fox Dog Wolf Lion Horse
Has
Small
Patterns
1 1 1 1 1 0 0 0 0 0 0
0 0 0 0 0 1 1 1 1 0 0
0 0 0 0 0 0 0 0 0 1 1
1 1 1 1 0 1 0 0 0 0 0
0 0 0 0 1 0 1 1 1 1 1
0 0 0 0 1 0 1 1 1 1 1
1 1 1 1 0 1 0 0 0 0 0
0 0 0 1 1 1 1 0 1 1 0
0 0 0 0 0 0 0 1 1 1 1
1 0 1 1 0 1 0 0 0 0 0
0 1 1 0 0 0 0 0 0 0 1
Step 1: initialize weight to small random values and set the initial neighborhood to be large. Step 2: stimulate the net with a given input vector. Step 3: calculate the Euclidean distance between the input and each output node and select the output with the minimum distance. D ( j) = ( w − x ) 2 D(j) is a minimum
∑
ij
i
i
Step 4: update weights for the selected node and the nodes within its neighborhood. wij (new) = wij (old ) + α ( xi − wij (old )) Step 5: repeat from step 2 unless stopping condition. Fig. 2. KSOM learning algorithm
The data-preprocessing phase consist of normalization and initialization of input data. In normalization, the objective is to ensure that there is no data that dominates over the rest of input data vectors [4]. The initialization task involves three tasks namely: weight initialization, topology initialization, and neighborhood initialization. The hexagonal lattice type is chosen as the map topology in animal data set case study. The choice of number of output nodes is done through comprehensive trials [5]. The weights of the neural network is initialized either by linear or random initialization. The random initialization technique is chosen here. For the neighborhood function, Gaussian or Bubble are the suitable choices [3]. The bubble function is considered the simpler but adequate one, and it is applied here.
Automated Knowledge Acquisition Based on Unsupervised Neural Network
137
Table 2. Portion of KSOM output
Output Nodes (Row, Column) (1,1) (1,2) (1,3) (1,4) (1,5)
Input nodes Q1
Q2
0 0 0 0 0
1.0 1.0 1.0 1.0 1.0
Q3 Q4 0 0 0 0 0
0 0 0 0 0
Q5
Q6
Q7
Q8
Q9
Q10
Q11
1.0 1.0 1.0 1.0 1.0
1.0 1.0 1.0 1.0 1.0
0 0 0 0 0
1.0 0.9 0.9 0.5 0.1
0 0 0 0 0
1.0 0.9 0.9 0.5 0.1
1.0 1.0 1.0 1.0 0.9
The machine learning and clustering phase composed of KSOM learning and Kmeans clustering algorithms. The proposed method employs unsupervised learning, and this is the key contribution of this research. Figure 2 outlines the KSOM learning algorithm. Table 2 illustrates a portion of the result of the KSOM training session. While Figure 3 and Figure 4 demonstrates the visual representation of KSOM output and Kmeans output, respectively. The selection of neighborhood range and learning rate techniques well defined in [5]. The next step is classifying the updated weights explicitly, and the process is referred as clustering. The terminology comes from the appearance of an incoming sequence of feature vectors which arrange themselves into clusters, groups of points that are closer to each other and their own centers rather than to other groups. When a feature vector is input to the system, its distance to the existing cluster representatives is determined, and it is either assigned to the cluster, with minimal distance or taken to be a representative if a new cluster. The required clustering process carried out by the modified K-means algorithm. The K-means algorithm self-organizes its input data to create clusters [5]. Figure 4 shows a visual representation for clustering session output. Each gray shade represents a certain cluster. The step now is to find the codebook vector and the indices for each cluster. This data contains the weights that distinguish and characterize each cluster. In this stage of the knowledge acquisition process, the extraction of a set of symbolic rules that map the input nodes into output nodes (with respect to each cluster) is performed. The antecedents of the rules that define these concepts consist of contributory and inhibitory input weights. In a KSOM network, each output node is connected to every input node, with the strength of interconnection reflected in the associated weight vector. The larger the weight vector associated with a link, the greater is the contribution of the corresponding input node to the output node. The input with the largest weight link makes the largest contribution to the output node [4]. To distinguish the contributory inputs from inhibitory inputs, then binarize the weights. If contributive, the real-valued weight is converted to 1, and is converted to 0 if inhibitory. There are two approaches to do this, namely: threshold and breakpoint technique [4]. The threshold technique has chosen. The threshold is set at 50% (i.e. below 0.5 is considered as 0 and above 0.5
138
Nazar Elfadil and Dino Isa
considered as 1). Such threshold selection made the probability of false positive and true negative are equal. More information concerning this selection found in [5].
Fig. 3. Output of Kohonen NN learning
Fig. 4. Output of modified Kmeans clustering
The final set of antecedents in each cluster usually contains some duplicated patterns. This redundancy is now removed. Now can map symbolically the antecedents to each cluster and obtain the rules for each cluster. The symbolic rule extraction algorithm is an inductive learning procedure. The algorithm as is provided in Figure 5 is self-explanatory.
Automated Knowledge Acquisition Based on Unsupervised Neural Network
139
Fig. 5. Symbolic rule extraction algorithm
The system provides rules that recognized the various types of animal's types. Given an input pattern of the animal data, the clusters can now recognized using this rule base. For making the rule antecedents more comprehensible we removed the inhibitory parts, as shown in Table 3. For the sake of testing and evaluating the final produced symbolic rules an expert system has developed using C language, and evaluated the rules using the expert system inference engine. Table 3. Elaborated extracted symbolic rule base
Rule No. 1 2 3 4 5 6 7 8 9 10 11
4
Antecedents parts
Conclusion
(Big)&(4_legs)& (hair)& (hunt)&(run) (Big)&(4_legs)& (hair)&(run)&(swim) (Small)&(2_legs)&(feathers)&(swim) (Small)& (2_legs)& (feathers)& (fly)& (swim) (Small)&(4_legs)& (hair)& (hunt) (Small)&(2_legs)& (feathers)& (fly) (Medium)& (4_legs)& (hair)& (run) (Small)&(2_legs)& (feathers)& (hunt)& (fly) (Medium)& (4_legs)& (hair)& (hunt) (Medium)&(4_legs)& (hair)& (hunt)& (run) (Medium)&(2_legs)&(feathers)& (hunt)& (fly)
Lion Horse Duck Goose Cat Dove Dog Hawk Fox Wolf Eagle
Conclusion
The animal data set case study shows that the proposed automated knowledge acquisition method can successfully extract knowledge in the form of production rules from numerical data set representing the salient features of the problem domain. This study has demonstrated that symbolic knowledge extraction can be performed using unsupervised learning KSOM neural networks, where no target output vectors are available during training. The system is able to learn from examples via the neural network section. The extracted knowledge can form the knowledge base of an expert system, from which explanations may be provided. Large, noisy and incomplete data
140
Nazar Elfadil and Dino Isa
set can be handled. The system proves the case of the viability of integrating neural network and expert system to solve real-world problems.
References [1]
[2] [3] [4]
[5]
T. S. Dillon, S. Sestito, M. Witten, M. Suing. “ Automated knowledge acquisition Using Unsupervised Learning”. Proceeding of the second IEEE workshop on Emerging Technology and Factory Automation (EFTA'93). Cairns, Sept. 1993, pp. 119-128. M. S. Kurzyn, “ Expert Systems and Neural Networks: A comparison”, IEEE Expert, pp. 222-223, 1993. Sabrina Sestito, Automated knowledge acquisition, Prentice Hall, 1994, Australia. N. Elfadil, M. Khalil, S. M. Nor, and S. Hussein. “Kohonen self-organizing maps & expert system for disk network performance prediction”. Journal of Systems Analysis Modeling & Simulation (SAMS), 2002, Vol. 42, pp. 10251043. England. N. Elfadil, M. Khalil, S. M. Nor, and S. Hussein. “Machine Learning: The Automation of Knowledge Acquisition Using Kohonen Self-Organizing Maps Neural Networks”. Proceeding of Malaysian Journal of Computer Science, Malaysia, June 2001: Vol. 14, No. 1, pp. 68-82.
Selective-Learning-Rate Approach for Stock Market Prediction by Simple Recurrent Neural Networks Kazuhiro Kohara Chiba Institute of Technology 2-17-1, Tsudanuma, Narashino, Chiba 275-0016, Japan
[email protected]
Abstract. We have investigated selective learning techniques for improving the ability of back-propagation neural networks to predict large changes. The prediction of daily stock prices was taken as an example of a noisy real-world problem. We previously proposed the selective-presentation and selective-learning-rate approaches and applied them into feed-forward neural networks. This paper applies the selective-learning-rate approach into three types of simple recurrent neural networks. We evaluated their performances through experimental stock-price prediction. Using selective-learning-rate approach, the network can learn the large changes well and profit per trade was improved in all of simple recurrent neural networks.
1
Introduction
Prediction using back-propagation neural networks [1] has been extensively investigated (e.g., [2-5]), and various attempts have been made to apply neural networks to financial market prediction (e.g., [6-18]), electricity load forecasting (e.g., [19, 20]) and other areas (e.g., flour price prediction [21]). In the usual approach, all training data are equally presented to a neural network (i.e., presented in each cycle) and the learning rates are equal for all the training data independently of the size of the changes in the prediction-target time series. Generally, the ability to predict large changes is more important than the ability to predict small changes, as we mentioned in the previous paper [16]. When all training data are presented equally with an equal learning rate, the neural network will learn the small and large changes equally well, so it cannot learn the large changes more effectively. We have investigated selective learning techniques for improving the ability of a neural network to predict large changes. We previously proposed the selectivepresentation and selective-learning-rate approaches and applied them into feedforward neural networks [16, 17]. In the selective-presentation approach, the training data corresponding to large changes in the prediction-target time series are presented more often. In the selective-learning-rate approach, the learning rate for training data corresponding to small changes is reduced. This paper applies the selective-learningV. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 141-147, 2003. Springer-Verlag Berlin Heidelberg 2003
142
Kazuhiro Kohara
rate approach into three types of simple recurrent neural networks. We evaluate their performances, using the prediction of daily stock prices as a noisy real-world problem.
2
Selective-Learning-Rate Approach
To allow neural networks to learn about large changes in prediction-target time series more effectively, we separate the training data into large-change data (L-data) and small-change data (S-data). L-data (S-data) have next-day changes that are larger (smaller) than a preset value. In the selective-learning-rate approach [17], all training data are presented in every cycle; however, the learning rate of the back-propagation training algorithm for S-data is reduced compared with that for L-data. The outline of the approach is as follows. Selective-Learning-Rate Approach: 1. 2. 3.
3
Separate the training data into L-data and S-data. Train back-propagation networks with a lower learning rate for the S-data than for the L-data. Stop network learning at the point satisfying a certain stopping criterion.
Simple Recurrent Neural Prediction Model
We considered the following types of knowledge for predicting Tokyo stock prices. These types of knowledge involve numerical economic indicators. 1. 2. 3.
If interest rates decrease, stock prices tend to increase, and vice versa. If the dollar-to-yen exchange rate decreases, stock prices tend to decrease, and vice versa. If the price of crude oil increases, stock prices tend to decrease, and vice versa.
We used the following five indicators as inputs to the neural network in the same way as in our previous work [16, 17]. • • • • •
TOPIX: the chief Tokyo stock exchange price index EXCHANGE: the dollar-to-yen exchange rate (yen/dollar) INTEREST: an interest rate (3-month CD, new issue, offered rates) (%) OIL: the price of crude oil (dollars/barrel) NY: New York Dow-Jones average of the closing prices of 30 industrial stocks (dollars)
TOPIX was the prediction target. EXCHANGE, INTEREST and OIL were chosen based on the knowledge of numerical economic indicators. The Dow-Jones average was used because Tokyo stock market prices are often influenced by New York exchange prices. The information for the five indicators was obtained from the Nihon Keizai Shinbun (a Japanese financial newspaper).
Selective-Learning-Rate Approach for Stock Market Prediction
143
The daily changes in these five indicators (e.g. ∆ TOPIX(t) = TOPIX(t) TOPIX(t-1)) were input into neural networks, and the next-day's change in TOPIX was presented to the neural network as the desired output. The back-propagation algorithm [1] was used to train the network. All the data of the daily changes were scaled to the interval [0.1, 0.9]. We considered three types of simple recurrent neural networks (SRN). The structures of three SRN are shown in Figures 1, 2 and 3. The SRN-1 is similar to the recurrent network proposed by Elman [22], where inputs C(t) to the context layer at time t are the outputs of the hidden layer at time t-1, H(t-1): C(t) = H(t-1). In the SRN-1 training, C (t) was initialized to be 0 when t = 0. The structure of the SRN-1 was 25-20-1 (the 25 includes 5 in the input layer and 20 in the context layer).
Fig.1. SRN-1
The SRN-2 is similar to the recurrent network proposed by Jordan [23], where input C(t) to the context layer at time t is the output of the output layer at time t-1, O(t-1): C(t) = O(t-1). In the SRN-2 training, C (t) was initialized to be 0 when t = 0. The structure of the SRN-2 was 6-6-1 (the 6 includes 5 in the input layer and 1 in the context layer). The SRN-3 is our original network. In the SRN-3, input C(t) to the context layer at time t is the prediction error at time t-1: C(t) = ∆ TOPIX(t-1) - O(t-1). In the SRN-3 training, C (t) was initialized to be 0 when t = 0. The structure of the SRN-3 was also 6-6-1 (the 6 includes 5 in the input layer and 1 in the context layer).
144
Kazuhiro Kohara
Fig. 2. SRN-2
Fig. 3. SRN-3
4
Evaluation through Experimental Stock-Price Prediction
4.1
Experiments
We used data from a total of 409 days (from August 1, 1989 to March 31, 1991): 300 days for training, 109 days for making predictions.
Selective-Learning-Rate Approach for Stock Market Prediction
145
In Experiments 1, 3 and 5, all training data were presented to the SRN-1, SRN-2, and SRN-3 in each cycle with an equal learning rate ( ε = 0.7), respectively. In Experiments 2, 4 and 6, the learning rate for the S-data was reduced up to 20% (i.e., ε = 0.7 for the L-data and ε = 0.14 for the S-data) in the SRN-1, SRN-2, and SRN-3 training, respectively. Here, the large-change threshold was 14.78 points (about US$ 1.40), which was the median (or the 50% point) of absolute value of TOPIX daily changes in the training data. In each experiment, network learning was stopped at 3000 learning cycles. The momentum parameter α was 0.7. All the weights and biases in the neural network were initialized randomly between -0.3 and 0.3. When a large change in TOPIX was predicted, we tried to calculate “Profit” as follows: when the predicted direction was the same as the actual direction, the daily change in TOPIX was earned, and when it was different, the daily change in TOPIX was lost. This calculation of profit corresponds to the following experimental TOPIX trading system. A buy (sell) order is issued when the predicted next-day's up (down) in TOPIX is larger than a preset value which corresponds to a large change. When a buy (sell) order is issued, the system buys (sells) TOPIX shares at the current price and subsequently sells (buys) them back at the next-day price. Transaction costs on the trades were ignored in calculating the profit. The more accurately a large change is predicted, the larger the profit is. 4.2
Results
In each experiment the neural network was run ten times for the same training data with different initial weights and the average was taken. The first experimental results are shown in Table 1, where a change larger than 14.78 points (the 50% point) in TOPIX was predicted (“prediction threshold” equals to 14.78), we tried to calculate “Profit” according to the above trading method. When the prediction threshold was comparatively low, number of trades was too large. In the actual stock trading, the larger number of trades became, the higher transaction costs became. Table 1. Experimental results 1 (prediction threshold = 14.78)
Ex. 1 Profit Number of trades Profit per trade
Ex. 2 SRN-1 equal selective 466 768 24.2 45.0 19.2 17.0
Ex. 3
Ex. 4 SRN-2 equal selective 486 781 24.4 44.4 19.9 17.5
Ex. 5
Ex. 6 SRN-3 equal selective 367 760 17.4 36.7 21.1 20.7
The second experimental results are shown in Table 2, where a change larger than 31.04 points (the 75% point) in TOPIX was predicted, we tried to calculate “Profit.” When the prediction threshold became high, number of trades became small. In equallearning-rate approach (Experiments 1, 3 and 5), the networks cannot learn the large changes well and the output value of the networks cannot reach high level. So, number of trades was very small. In our selective-learning-rate approach
146
Kazuhiro Kohara
(Experiments 2, 4 and 6), the networks can learn the large changes well and number of trades became larger than that of equal-learning-rate approach. Using selectivelearning-rate approach, number of trades became large and profit per trade was improved in all of three types of simple recurrent neural networks. The SRN-3 achieved the best profit per trade in these experiments. Table 2. Experimental results 2 (prediction threshold = 31.04)
Ex. 1
Profit Number of trades Profit per trade
5
Ex. 2 SRN-1 equal selective 27 256 1.4 10.5 19.2 24.3
Ex. 3
Ex. 4 SRN-2 equal selective 8 180 0.5 6.9 16.0 26.0
Ex. 5 Ex. 6 SRN-3 equal selective 20 161 0.8 5.2 25.0 31.0
Conclusion
We investigated selective learning techniques for stock market forecasting by neural networks. We applied our selective-learning-rate approach into three types of simple recurrent neural networks. The learning rate for training data corresponding to small changes is reduced. The results of several experiments on stock-price prediction showed that using selective-learning-rate approach, the network can learn the large changes well and profit per trade was improved in all simple recurrent neural networks. Next, we will apply these techniques to other real-world forecasting problems. We also plan to develop a forecasting method that integrates statistical analysis with neural networks.
References [1] [2] [3] [4] [5] [6] [7]
Rumelhart D, Hinton G, Williams, R. Learning internal representations by error propagation. In: Rumelhart D, McClelland, J and the PDP Research Group (eds.). Parallel Distributed Processing 1986; 1. MIT Press, Cambridge, MA Weigend A, Gershenfeld N. (eds). Time Series Prediction: Forecasting the Future and Understanding the Past. Addison-Wesley, Reading, MA, 1993 Vemuri V, Rogers R (eds). Artificial Neural Networks: Forecasting Time Series. IEEE Press, Los Alamitos, CA, 1994 Pham D, Liu X. Neural Networks for Identification, Prediction and Control. Springer, 1995 Kil D, Shin F. Pattern Recognition and Prediction with Applications to Signal Characterization. American Institute of Physics Press, 1996 Azoff E. Neural Network Time Series Forecasting of Financial Markets. John Wiley and Sons, West Sussex, 1994 Refenes A, Azema-Barac M. Neural network applications in financial asset management. Neural Comput & Applic 1994; 2(1): 13-39
Selective-Learning-Rate Approach for Stock Market Prediction
[8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23]
147
Goonatilake S, Treleaven P. (eds). Intelligent Systems for Finance and Business. John Wiley and Sons, 1995 White H. Economic prediction using neural networks: the case of IBM daily stock return. In: Proc of Int Conf Neural Networks 1988; II-451-II-458. San Diego, CA Baba N, Kozaki M. An intelligent forecasting system of stock price using neural networks. In: Proc of Int Conf Neural Networks 1992; I-371-I-377. Singapore Dutta S, Shekhar S. Bond rating: a non-conservative application of neural networks. In: Proc of Int Conf Neural Networks 1988; II-443-II-450. San Diego, CA Freisleben B. Stock market prediction with backpropagation networks. In: F Belli and J Rademacher (eds). Lecture Notes in Computer Science 604, pp. 451-460, Springer -Verlag, Heidelberg, 1992 Kamijo K, Tanigawa T. Stock price pattern recognition – a recurrent neural network approach. In: Proc of Int Conf Neural Networks 1990; I-215-I-221. San Diego, CA Kimoto T, Asakawa K, Yoda M, Takeoka M. Stock market prediction with modular neural networks. In: Proc of Int Conf Neural Networks 1990; I-1-I-6. San Diego, CA Tang Z, Almeida C, Fishwick P. Time series forecasting using neural networks vs. Box-Jenkins methodology. Simulation 1991; 57(5): 303-310 Kohara K, Fukuhara Y, Nakamura Y. Selective presentation learning for neural network forecasting of stock markets. Neural Comput & Applic 1996; 4(3): 143-148 Kohara K, Fukuhara Y, Nakamura Y. Selectively intensive learning to improve large-change prediction by neural networks. In: Proc of Int Conf Engineering Applications of Neural Networks 1996; 463-466. London Kohara K. Neural networks for economic forecasting problems. In: Cornelius T. Leondes (ed). Expert Systems -The Technology of Knowledge Management and Decision Making for the 21st Century- 2002; Academic Press Park D, El-Sharkawi M, Marks II R, Atlas L, Damborg M. Electric load forecasting using an artificial neural network. IEEE Trans Power Syst 1991; 6(2): 442-449 Caire P, Hatabian G, Muller C. Progress in forecasting by neural networks. In: Proc of Int Conf Neural Networks 1992; II-540-II-545. Baltimore, MD Chakraborty K, Mehrotra K, Mohan C, Ranka S. Forecasting the behavior of multivariate time series using neural networks. Neural Networks 1992; 5: 961970 Elman J. L. Finding structure in time, CRL Tech. Rep. 8801, Center for Research in Language, Univ. of California, San Diego, 1988 Jordan M. I. Attracter dynamics and parallelism in a connectionist sequential machine. In: Proc 8th Annual Conf Cognitive Science Society 1986; 531-546. Erlbaum
A Neural-Network Technique for Recognition of Filaments in Solar Images V.V. Zharkova1 and V. Schetinin2 1
Department of Cybernetics, University of Bradford, BD7 1DP, UK
[email protected] 2 Department of Computer Science, University of Exeter, EX4 4QF,UK
[email protected]
Abstract. We describe a new neural-network technique developed for an automated recognition of solar filaments visible in the hydrogen Halpha line full disk spectroheliograms. This technique deploys the artificial neural network (ANN) with one input and one output neurons and the two hidden neurons associated either with the filament or with background pixels in this fragment. The ANN learns to recognize the filament depicted on a local background from a single image fragment labelled manually. The trained neural network has properly recognized filaments in the testing image fragments depicted on backgrounds with various brightness caused by the atmospherics distortions. Using a parabolic activation function this technique was extended for recognition of multiple solar filaments occasionally appearing in selected fragments.
1
Introduction
Solar images observed from the ground and space-based observatories in various wavelengths were digitised and stored in different catalogues, which are to be unified under the grid technology. The robust techniques including limb fitting, removal of geometrical distortion, centre position and size standardisation and intensity normalisation were developed to put the Hα and Ca K lines full disk images taken at the Meudon Observatory (France) into a standardised form [1]. There is a growing interest to widespread ground-based daily observations of a full solar disk in the Hydrogen Hα-line, which can provide important information on the long-term solar activity variations during months or years. The project European Grid of Solar Observations [2] was designed to deal with the automated detection of various features associated with solar activity, such as: sunspots, active regions and filaments, or solar prominences. Filaments are the projections on a solar disk of prominences seen as very bright and large-scale features on the solar limb [1]. Their location and shape does not change very much for a long time and, hence, their lifetime is likely to be much longer then one solar rotation. However, there are visible changes seen in the filament V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 148-154, 2003. Springer-Verlag Berlin Heidelberg 2003
A Neural-Network Technique for Recognition of Filaments in Solar Images
149
elongation, position with respect to an active region and magnetic field configuration. For this reason the automated detection of solar filaments is a very important task to tackle in sense of understanding the physics of prominence formation, support and disruptions. Quite a few techniques were explored for a different level of the feature detection such as: the rough detection with a mosaic threshold technique [3], the image segmentation and region growing techniques [4] – [7]. Artificial Neural Networks (ANNs) [8], [9] applied to the filament recognition problem commonly require a representative set of image data available for training. The training data have to represent image fragments, depicting filaments on different conditions, under which the ANN has to solve the recognition problem. For this reason the number of the training examples must be large. However the image fragments are still taken from the solar catalogues manually. In this paper we describe a new neural-network technique which is able to learn recognizing filaments from a few image fragments labelled manually. This technique deploys the artificial neural network (ANN) with one input and one output neurons and the two hidden neurons associated either with the filament or background pixels in this fragment. Despite the difference in backgrounds in the selected fragments of solar images containing filaments the trained network has properly recognized all the filaments in these fragments. Using a parabolic activation function, this technique was extended to recognize multiple solar filaments occasionally appearing in some fragments.
2
A Recognition Problem
First, let us introduce the image data as a n×m matrix X = {xij}, i = 1, …, n, j = 1, …, m, consisting of pixels whose brightness ranges between 1 and 255. This matrix depicts filament which is no more than a dark elongated feature observable on the solar surface with a high background brightness. Then, a given pixel xij ∈ X may belong to a filament region, class Ω0, or to a non-filament region, class Ω1. Note that the brightness of non-filament region varies hardly over the solar surface. Dealing with images we can make a realistic assumption that the influence of neighbouring pixels on the central pixel xij is restricted to k elements. Using this assumption we can easily define a triangular window, k×k matrix P(i,j), with central pixel xij and k nearest neighbours. The background of the filament elements is assumed to be additive to xij that allows us to evaluate and subtract it from the brightness values of all the elements of matrix P. Now we can define a background function u = ϕ(X; i, j), which reflects a total contribution of background elements to the pixel xij. Parameters of this function can be estimated from image data X. In order to learn background function ϕ and then decide whether a pixel x is a filament element or not, we can use image fragments whose pixels are manually labelled and assigned either to class Ω0 or class Ω1. A natural way to do this is to use a neural network which is able to learn recognising filaments from a few number of the labelled image fragments.
150
V.V. Zharkova and V. Schetinin
3
The Neural-Network Technique for Filament Recognition
As we stated in the Introduction, this technique deploys the artificial neural network with one input and one output neurons and the two hidden neurons associated either with the filament or with background pixels in this fragment. In general, neural networks perform a threshold technique of pattern recognition using one hidden neuron. However, in the case of solar images it turned out that background, on which darker filaments occur, vary from image to image. In order to make filament detection independent on the seeing conditions this technique has to take into account a variability of background elements in images. Hence, the idea behind the proposed method is to use the additional information on a contribution of the variable background elements which is represented by the function u = ϕ(X; i, j). This function, as we assume, can learn from image data. One of the possible ways to estimate the function ϕ is to approximate its values for each pixel xij of a given image X. Regarding to the filament recognition; we can use either a parabolic or linear approximation of this function. Whilst the first type is suitable for small size image fragments, the second one is used for relative large fragments of the solar surface. Below we describe our neural-network technique exploiting the last type of approximation. For image processing, our algorithm exploits a standard sliding technique for which a given image matrix X is transformed into a column matrix Z consisting of q = (n – k + 1)(m – k + 1) columns z(1), …, z(q)). Each column z presents the r pixels taken from the matrix P, where r = k2. The pixels x11, x12, …, x1k, …, xk1, xk2, …, xkk of this matrix are placed in the columns of the matrix Z so that the central element of P is located in the (r + 1)/2 position of the column z. Let us now introduce a feed-forward ANN consisting of the two hidden and one output neurons as depicted in Fig 1. The first hidden neuron is fed by r elements of column vector z(j). The second hidden neuron evaluates the value uj of a background for the jth vector z(j). The output neuron makes a decision, yi = {0,1}, on the central pixel in the column vector z(j). (j)
f2
uj
(j)
z1 z2(j)
w0(2)
f3
… zr(j)
f1
sj
yj = (0, 1)
w0(3)
w0(1)
Fig. 1. The feed-forward neural network consisting of the two hidden and one output neurons
Assuming that the first hidden neuron is fed by r elements of the column vector z, its output s is calculated as follows: sj = f1(w0(1), w(1); z(j)), j = 1, …, q,
(1)
A Neural-Network Technique for Recognition of Filaments in Solar Images
151
where w0(1), w(1), and f1 are the bias term, weight vector and activation function of the neuron, respectively. The activity of the second hidden neuron is proportional to the brightness of a background and can be described by the formula: uj = f2(w0(2), w(2); j), j = 1, …, q.
(2)
(2)
(2)
The bias term w0 and weight vector w of this neuron are updated so that the output u becomes an estimation of a background component contributed to the pixels of the jth column z(j). Parameters w0(2) and w (2) may learn from the image data Z. Taking into account the outputs of the hidden neurons, the output neuron makes a final decision, yj ∈ (0, 1), for each column vector z(j) as follows: yj = f3(w0(3), w(3); sj, uj), j = 1, …, q.
(3)
Depending on activities of the hidden neurons, the output neuron assigns a central pixel of the column z(j) either to the class Ω0 or Ω1.
4
A Training Algorithm
In order to train our feed-forward ANN plotted in Fig 1, we can use the backpropagation algorithms, which provide a global solution. These algorithms require recalculating the output sj for all q columns of matrix Z and for each the training epochs. However, there are some local solutions in which the hidden and output neurons are trained separately. Providing an acceptable accuracy, the local solutions can be found much more easer than the global ones. First, we need to fit the weight vector of the second hidden, or a “background” neuron that evaluates a contribution of the background elements. Fig 2 depict an example where the top left plot shows the image matrix X presenting a filament on the unknown background and the top right plot reveals the corresponding filament elements depicted in black. The two bottom plots in Fig 2 show the outputs s (the left plot) and the weighted sum of s and u (the right one) plotted versus the columns of matrix Z. From the left top plot we see that the brightness of a background varies from the lowest level at the left bottom corner to the highest at the right top corner. Such variations of the background increase the output value u calculated over the q columns of matrix Z, see the increasing curve depicted at the bottom left plot. This plot shows us that the background component is changed over j = 1, …, q and can be fitted by a parabola.. Based on this finding we can define a parabolic activation function of the “background” neuron as: uj = f2 (w0(2), w(2); j) = w0(2) + w1(2)j + w2(2)j2.
(4)
152
V.V. Zharkova and V. Schetinin
1(a)
1(b)
2000
4
1500
2
1000
0
500
-2
0
0
5000 Y
10000
-4
0
5000 F
10000
Fig. 2. The example of image matrix X depicting the filament on the unknown background
The weight coefficients w0(2), w1(2), and w2(2) of this neuron can be fitted to the image data Z so that a squared error e between the outputs ui and si became minimal: e = Σj(uj – sj)2 = Σi(w0(2) + w1(2)j + w2(2)j2 – sj)2 → min, j = 1,… , q.
(5)
The desirable weight coefficients can be found with the least square method. Using a recursive learning algorithm [10], we can improve the evaluations of these coefficients due to robustness to non-Gauss noise in image data. So the “background” neuron can be trained to evaluate the background component u. In the right bottom plot of Fig 2 we present the normalized values of si, which are no longer affected by the background component. The recognized filament elements are shown at the right top plot in the Fig 2. By comparing the left and right top plots in Fig 2 we can reveal that the second hidden neuron has successfully learnt to evaluate a background component from the given image data Z. Before training the output neuron, the weights of the first hidden neuron are to be found. For this neuron a local solution is achieved for a set of the coefficients to be equal to 1. In case, if it would be necessarily to improve the recognition accuracy, one can update these weights by using the back-propagation algorithm. After defining the weights for both hidden neurons, it is possible to train the output neuron which makes the decisions between 0 or 1. Let us re-write the output yi of this neuron as follows: yj = 0, if w1sj + w2uj < w0, and yj = 1, if w1sj + w2uj ≥ w0.
(6)
Then the weight coefficients w0, w1, and w2 can be fit in such way that the recognition error e is minimal: e = Σi| yi – ti | → min, i = 1,… , h,
(7)
A Neural-Network Technique for Recognition of Filaments in Solar Images
153
where | ⋅ | means a modulus operator, tj ∈ (0, 1) is the ith element of a target vector t and h is the number of its components, namely, the training examples. In order to minimize the error e one can apply any supervised learning methods, for example, the perceptron learning rule [8], [9].
5
Results and Discussion
The neural network technique with the two hidden neurons and one input and one output neurons described above was applied for recognition of dark filaments in solar images. The full disk solar images obtained on the Meudon Observatory (France) during the period of March-April 2002 were considered for the identification [1]. The fragments with filaments were picked from the Meudon full disk images taken for various dates and regions on the solar disk with different brightness and inhomogeneity in the solar atmosphere. There were 55 fragments selected depicting the filaments on a various background, one of them was used for training the ANN, the remaining 54 ones were used for testing the trained ANN. Visually comparing the resultant and origin images we can conclude that our algorithm recognised these testing filaments well. Using a parabolic approximation we can now recognize the large image fragments which may contain multiple filaments as depicted at the left plot in Fig 3.
1(a)
1(b)
Fig. 3. The recognition of the multiple filaments
The recognized filament elements depicted here in black are shown at the right plot. A visual comparison of the resulting and original images confirms that the proposed algorithm has recognized all four filaments being close to their location and shape.
154
V.V. Zharkova and V. Schetinin
6
Conclusions
The automated recognition of filaments on the solar disk images is still a difficult problem because of a variable background and inhomogeneities in the solar atmosphere. The proposed neural network technique with the two hidden neurons responsible for filaments and background pixel values can learn the recognition rules from a single image depicting the solar filament fragmented and labelled visually. The recognition rule has been successfully tested on the 54 other image fragments depicting the filaments on various backgrounds. Despite the background differences, the trained neural network has properly recognized as single as multiple filaments presented in the testing image fragments Therefore, the proposed neural network technique can be effectively used for an automated recognition of filaments in solar images.
References [1]
Zharkova, V.V., Ipson, S.S., Zharkov, S.I., Benkhalil, A., Aboudarham, J., Bentley, R.D.: A full disk image standardisation of the synoptic solar observations at the Meudon Observatory. Solar Physics (2002) accepted [2] Bentley, R.D. et al: The European grid of solar observations. Proceedings of the 2nd Solar Cycle and Space Weather Euro-Conference, Vico Equense, Italy (2001) 603 [3] Qahwaji, R., Green, R.: Detection of closed regions in digital images. The International Journal of Computers and Their Applications 8 (4) (2001) 202207 [4] Bader, D.A., Jaja, J., Harwood, D., Davis L.S.: Parallel algorithms for image enhancement and segmentation by region growing with experimental study. The IEEE Proceedings of IPPS'96 (1996) 414 [5] Turmon, M. Pap, J., Mukhtar, S.: Automatically finding solar active regions using SOHO/MDI photograms and magnetograms (2001) [6] Turmon, M., Mukhtar, S., Pap, J.: Bayesian inference for identifying solar active regions (2001) [7] Gao, J., Zhou, M., Wang, H.: A threshold and region growing method for filament disappearance area detection in solar images. The Conference on Information Science and Systems, The Johns Hopkins University (2001) [8] Bishop, C.M.: Neural networks for pattern recognition. Oxford University Press (1995) [9] Nabney, I.T.: NETLAB: Algorithms for pattern recognition. Springer-Verlag (1995) [10] Schetinin, V.: A Learning Algorithm for Evolving Cascade Neural Networks. Neural Processing Letters, Kluwer, 1 (2003 )
Learning Multi-class Neural-Network Models from Electroencephalograms Vitaly Schetinin1, Joachim Schult2, Burkhart Scheidt2, and Valery Kuriakin3 1
Department of Computer Science, Harrison Building, University of Exeter, EX4 4QF, UK
[email protected] 2 Friedrich-Schiller-University of Jena, Ernst-Abbe-Platz 4, 07740 Jena, Germany
[email protected] 3 Intel Russian Research Center, N. Novgorod, Russia
Abstract. We describe a new algorithm for learning multi-class neuralnetwork models from large-scale clinical electroencephalograms (EEGs). This algorithm trains hidden neurons separately to classify all the pairs of classes. To find best pairwise classifiers, our algorithm searches for input variables which are relevant to the classification problem. Despite patient variability and heavily overlapping classes, a 16-class model learnt from EEGs of 65 sleeping newborns correctly classified 80.8% of the training and 80.1% of the testing examples. Additionally, the neural-network model provides a probabilistic interpretation of decisions.
1
Introduction
Learning classification models from electroencephalograms (EEGs) is still a complex problem [1] - [7] because of the following problems: first, the EEGs are strongly nonstationary signals which depend on an individual Background Brain Activity (BBA) of patients; second, the EEGs are corrupted by noise and muscular artifacts; third, a given set of EEG features may contain features which are irrelevant to the classification problem and may seriously hurt the classification results and fourth, the clinical EEGs are the large-scale data which are recorded during several hours and for this reason the learning time becomes to be crucial. In general, multi-class problems can be solved by using one-against-all binary classification techniques [8]. However, a natural way to induce multi-class concepts from real data is to use Decision Tree (DT) techniques [9] - [12] which exploit a greedy heuristic or hill-climbing strategy to find out input variables which efficiently split the training data into classes. To induce linear concepts, multivariate or oblique DTs have been suggested which exploit the threshold logical units or so-called perceptions [13] - [16]. Such multivariate DTs known also as Linear Machines (LM) are able to classify linearly separable V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 155-162, 2003. Springer-Verlag Berlin Heidelberg 2003
156
Vitaly Schetinin et al.
examples. Using the algorithms [8], [13] - [15], the LMs can also learn to classify non-linearly separable examples. However, such DT methods applied for inducing multi-scale problems from real large-scale data become impractical due to large computations [15], [16]. Another approach to multiple classification is based on pairwise classification [17]. A basic ides behind this method is to transform a q-class problem into q(q - 1)/2 binary problems, one for each pair of classes. In this case the binary decision problems are presented by fewer training examples and the decision boundaries may be considerably simpler than in the case of one-against-all binary classification. In this paper we describe a new algorithm for learning multi-class neural-network models from large-scale clinical EEGs. This algorithm trains hidden neurons separately to classify all the pairs of classes. To find best pairwise classifiers, our algorithm searches for input variables which are relevant to the classification problem. Additionally, the neural-network model provides a probabilistic interpretation of decisions. In the next section we define the classification problem and describe the EEG data. In section 3 we describe the neural-network model and algorithm for learning pairwise classification. In section 4 we compare our technique with the standard data mining techniques on the clinical EEGs, and finally we discuss the results.
2
A Classification Problem
In order to recognize some brain pathologies of newborns whose prenatal age range between 35 and 51 weeks, clinicians analyze their EEGs recorded during sleeping and then evaluate a EEG-based index of brain maturation [4] - [7]. So, in the pathological cases, the EEG index does not match to the prenatal maturation. Following [6], [7] we can use the EEGs recorded from the healthy newborns and define this problem as a multi-class one, i.e., a 16-class concept. Then all the EEGs of healthy newborns should be classified properly to their prenatal ages, but the pathological cases should not. To build up such a multi-class concept, we used the 65 EEGs of healthy newborns recorded via the standard electrodes C3 and C4. Then, following [4], [5], [6], these records were segmented and represented by 72 spectral and statistical features calculated on a 10-sec segment into 6 frequency bands such as sub-delta (0-1.5 Hz), delta (1.5-3.5 Hz), theta (3.5-7.5 Hz), alpha (7.5-13.5 Hz), beta 1 (13.5-19.5 Hz), and beta 2 (19.5-25 Hz). Additional features were calculated as the spectral variances. EEGviewer has manually deleted the artifacts from these EEGs and then assigned normal segments to the 16 classes correspondingly with age of newborns. Total sum of the labeled EEG segments was 59069. Analyzing the EEGs, we can see how heavily these depend on the BBA of newborns. Formally, we can define the BBA as a sum of spectral powers over all the frequency bands. As an example, Fig. 1 depicts the BBA calculated for two newborns aged 49 weeks. We can see that the BBA depicted as a dark line chaotically varies during sleeping and causes the variations of the EEG which significantly alter the class boundaries.
Learning Multi-class Neural-Network Models from Electroencephalograms
157
Clearly, we can beforehand calculate and then subtract the BBA from all the EEG features. Using this pre-processing technique we can remove the chaotic oscillations from the EEGs and expect improving the classification accuracy. 4
5
x 10
4 3 2 1 0
0
200
400
600
800
1000
1200
1400
4
5
x 10
4 3 2 1 0
0
500
1000
1500
Fig. 1. EEG segments of two sleeping newborns aged 49 weeks
Below we describe a neural-network technique we developed to learn multi-class concept from the EEGs.
3
The Neural-Network Model and Learning Algorithm
The idea behind our method of multiple classification is to separately train the hidden neurons of neural network and combine them to approximate the dividing surfaces. These hidden neurons learn to divide the examples from each pair of classes. For q classes, therefore, we need to learn q(q - 1)/2 binary classifiers. The hidden neurons that deal with one class are combined into one group, so that the number of the groups corresponds to the number of the classes. The hidden neurons combined into one group approximate the dividing surfaces for the corresponding classes. Let fi/j be a threshold activation function of hidden neuron which learns to divide the examples x of ith and jth classes Ωi and Ωj respectively. The output y of the hidden neuron is y = fi/j(x) = 1, ∀ x ∈ Ωi,, and y = fi/j(x) = – 1, ∀ x ∈ Ωj.
(1)
Assume q = 3 classification problem with overlapping classes Ω1, Ω2 and Ω3 centered into C1, C2, and C3, as Fig. 2(a) depicts. The number of hidden neurons for this example is equal to 3. In the Fig. 2(a) lines f1/2, f1/3 and f2/3 depict the hyperplanes of the hidden neurons trained to divide the examples of three pair of the classes which are (1) Ω1 and Ω2, (2) Ω1 and Ω3, and (3) Ω2 and Ω3.
158
Vitaly Schetinin et al. 1
w
Ω1 f1/3
g1 = f1/2 + f1/3
f1/2
f1/2
x1
+1
x
g3 = - f1/3 - f2/3
g1
2
w
Ω3
+1
-1
x2
f1/3
+1
g2
3
g2 = f2/3 - f1/2 f2/3
Ω2
xm
w
-1 f2/3
g3
-1
Fig. 2. The dividing surfaces (a) g1, g2, and g3, and the neural network (b) for q = 3 classes
Combining these hidden neurons into q = 3 groups we built up the new hyperplanes g1, g2, and g3. The first one, g1, is a superposition of the hidden neurons f1/2 and f1/3, i.e., g1 = f1/2 + f1/3. The second and third hyperplanes are g2 = f2/3 – f1/2 and g3 = – f1/3 – f2/3 correspondently. For hyperplane g1, the outputs f1/2 and f1/3 are summed with the weights 1 because both give the positive outputs on the examples of the class Ω1. Correspondingly, for hyperplane g3, the outputs f1/3 and f2/3 are summed with weights –1 because they give the negative outputs on the examples of the class Ω3. Fig. 2(b) depicts for this example a neural network structure consisting of three hidden neurons f1/2, f1/3, and f2/3 and three output neurons g1, g2, and g3. The weight vectors of hidden neurons here are w1, w2, and w3. These hidden neurons are connected to the output neurons with weights equal to (+1, +1), (–1, +1) and (–1, –1), respectively. We can see that in general case for q > 2 classes, the neural network consists of q output neurons g1, …, gq and q(q – 1)/2 hidden neurons f1/2, …, fi/j, …, fq - 1/q, where i < j = 2, …, q. Each output neuron gi is connected to the (q – 1) hidden neurons which are partitioned into two groups. The first group consists of the hidden neurons fi/k for which k > i. The second group consists of the hidden neurons fk/i for which k < i. So, the weights of output neuron gi connected to the hidden neurons fi/k and fk/i are equal to + 1 and –1, respectively. As the EEG features may be irrelevant to the binary classification problems, for learning the hidden neurons we use a bottom up search strategy which selects features providing the best classification accuracy [11]. Below we discuss the experimental results of our neural-network technique applied to the real multi-class problem.
4
Experiments and Results
To learn the 16-class concept from the 65 EEG records, we used the neural network model described above. For training and testing we used 39399 and 19670 EEG segments, respectively. Correspondingly, for q = 16 class problem, the neural network includes q (q – 1)/2 = 120 hidden neurons or binary classifiers with a threshold activation function (1).
Learning Multi-class Neural-Network Models from Electroencephalograms
159
The testing errors of binary classifiers varied from 0 to 15% as depicted in Fig. 3(a). The learnt classifiers exploit different sets of EEG features. The number of these features varies between 7 and 58 as depicted in Fig 3(b). Our method has correctly classified 80.8% of the training and 80.1% of the testing examples taken from 65 EEG records. Summing all the segments belonging to one EEG record, we can improve the classification accuracy up to 89.2% and 87.7% of the 65 EEG records for training and testing, respectively. 20
Error,%
15 10 5 0
0
20
40
60 Classifiers
80
100
120
0
20
40
60 Classifiers
80
100
120
Number of Features
60
40 20
0
Fig. 3. The testing errors (a) and the number of features (b) for each of 120 binary classifiers
In Fig. 4 we depict the outputs of our model summed over all the testing EEG segments of two patients which belong to the second and third age groups, respectively. In both cases, most parts of the segments were correctly classified. In addition, summing the outputs over all the testing EEG segments, we may provide a probabilistic interpretation of decisions. For example, we assign the patients to 2 and 3 classes with probabilities 0.92 and 0.58, respectively, calculated as parts of the correctly classified segments. These probabilities give us the additional information about the confidence of decisions. We compared the performance of our neural network technique and the standard data mining techniques on the same EEG data. First, we tried to train a standard feedforward neural network consisted of q = 16 output neurons and a predefined number of hidden neurons and input nodes. The number of hidden neurons was defined between 5 and 20, and the number of the input nodes between 72 and 12 by using the standard principal component analysis. Note that in our experiments we could not use more than 20 hidden neurons because even a fast Levenberg-Marquardt learning algorithm provided by MATLAB has required an enormous time-consuming computational effort. Because of a large number of the training examples and classes, we could not also use more than and 25% of the training data and, as a result, the trained neural network has correctly classified less than 60% of the testing examples.
Vitaly Schetinin et al.
Disribution of Segments
Disribution of Segments
160
1 0.8 0.6 0.4 0.2 0
0
2
4
6
8 Classes
10
12
14
16
0
2
4
6
8 Classes
10
12
14
16
0.8 0.6 0.4 0.2 0
Fig. 4. Probabilistic interpretation of decisions for two patients
Second, we trained q = 16 binary classifiers to distinguish one class against others. We defined the same activation function for these classifiers and trained them on the whole data. However the classification accuracy was 72% of the testing examples. Third, we induced a decision tree consisting of (q – 1) = 15 binary decision trees trained to split (q – 1) subsets of the EEG data. That is, the first classifier learnt to divide the examples taken from classes Ω1, …, Ω8 and Ω9, …, Ω16. The second classifier learnt to divide the examples taken from classes Ω1, …, Ω4 and Ω5, …, Ω8, and so on. However the classification accuracy on the testing data was 65%.
5
Conclusion
For learning multi-class concepts from large-scale heavily overlapping EEG data, we developed a neural network technique and learning algorithm. Our neural network consists of hidden neurons which perform the binary classification for each pairs of classes. The hidden neurons are trained separately and then their outputs are combined in order to perform the multiple classification. This technique has been used to learn a 16-class concept from 65 EEG records represented by 72 features some of which were irrelevant. Having compared our technique with the other classification methods on the same EEG data, we found out that it gives the better classification accuracy for an acceptable learning time. Thus, we conclude that the new technique we developed for learning multi-class neural network models performs on the clinical EEGs well. We believe that this technique may be also used to solve other large-scale multi-class problems presented many irrelevant features.
Learning Multi-class Neural-Network Models from Electroencephalograms
161
Acknowledgments The research has been supported by the University of Jena (Germany). The authors are grateful to Frank Pasemann for enlightening discussions, Joachim Frenzel from the University of Jena for the clinical EEG records, and to Jonathan Fieldsend from the University of Exeter (UK) for useful comments.
References [1] [2] [3]
[4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14]
Riddington, E., Ifeachor, E., Allen, N., Hudson, N., Mapps, D.: A Fuzzy Expert System for EEG Interpretation. Neural Networks and Expert Systems in Medicine and Healthcare, University of Plymouth (1994) 291-302 Anderson, C., Devulapalli, S., Stolz, E.: Determining Mental State from EEG Signals Using Neural Networks. Scientific Programming 4 (1995) 71-183 Galicki, M., Witte, H., Dörschel, J., Do ering, A., Eiselt, M., Grießbach, G.: Common Optimization of Adaptive Preprocessing Units and a Neural Network During the Learning: Application in EEG Pattern Recognition. J. Neural Networks 10 (1997) 1153-1163 Breidbach, O., Holthausen, K., Scheidt, B., Frenzel, J.: Analysis of EEG Data Room in Sudden Infant Death Risk Patients. J. Theory Bioscience 117 (1998) 377-392 Holthausen, K., Breidbach, O., Schiedt, B., Frenzel, J.: Clinical Relevance of Age Dependent EEG Signatures in Detection of Neonates at High Risk of Apnea. J. Neuroscience Letter 268 (1999) 123-126 Holthausen, K., Breidbach, O., Scheidt, B., Schult, J., Queregasser, J.: Brain Dismaturity as Index for Detection of Infants Considered to be at Risk for Apnea. J. Theory Bioscience 118 (1999) 189-198 Wackermann, J., Matousek, M.: From the EEG Age to Rational Scale of Brain Electric maturation. J. Electroencephalogram Clinical Neurophysiology 107 (1998) 415-421 Duda, R.O., Hart, P.E.: Pattern Classification and Scene Analysis. Wiley and Sons, New York (1973) Quinlan, J.: Induction of Decision Trees. J. Machine Learning 1 (1986) 81-106 Cios, K., Liu, N.: A Machine Learning Method for Generation of Neural Network Architecture. J. IEEE Transaction of Neural Networks 3 (1992) 280-291 Galant, S.: Neural Network Learning and Expert Systems. MIT Press, Cambridge, MA (1993) Kononenko, I., Šimec, E.: Induction of Decision Trees using RELIEFF. In: Kruse, R., Viertl, R., Riccia, G. Della (eds.). CISM Lecture Notes, Springer Verlag (1994) Utgoff, P., Brodley, C.: Linear Machine Decision Trees. COINS Technical Report 91-10, University of Massachusetts, Amhert, MA (1991) Brodley, C., Utgoff, P.: Multivariate Decision Trees. COINS Technical Report 92-82, University of Massachusetts, Amhert, MA (1992)
162
Vitaly Schetinin et al.
[15] Frean, M.: A Thermal Perceptron Learning Rule. J. Neural Computational 4 (1992) [16] Murthy, S., Kasif, S., Salzberg, S.: A System for Induction of Oblique Decision Trees. J. Artificial Intelligence Research 2 (1994) 1-33 [17] Hastie, T., Tibshirani, R.: Classification by Pairwise Coupling. Advances in Neural Information Processing Systems 10 (NIPS-97) (1998) 507-513
Establishing Safety Criteria for Artificial Neural Networks Zeshan Kurd and Tim Kelly Department of Computer Science University of York, York, YO10 5DD, UK. {zeshan.kurd,tim.kelly}@cs.york.ac.uk
Abstract. Artificial neural networks are employed in many areas of industry such as medicine and defence. There are many techniques that aim to improve the performance of neural networks for safety-critical systems. However, there is a complete absence of analytical certification methods for neural network paradigms. Consequently, their role in safety-critical applications, if any, is typically restricted to advisory systems. It is therefore desirable to enable neural networks for highlydependable roles. This paper defines the safety criteria which if enforced, would contribute to justifying the safety of neural networks. The criteria are a set of safety requirements for the behaviour of neural networks. The paper also highlights the challenge of maintaining performance in terms of adaptability and generalisation whilst providing acceptable safety arguments.
1
Introduction
Typical uses of ANNs (Artificial Neural Networks) in safety-critical systems include areas such as medical systems, industrial process & control and defence. An extensive review of many areas of ANN use in safety-related applications has been provided by a recent U.K. HSE (Health & Safety Executive) report [1]. However, the roles for these systems have been restricted to advisory, since no convincing safety arguments has yet been produced. They are typical examples of achieving improved performance without providing sufficient safety assurance. There are many techniques for traditional multi-layered perceptrons [2, 3] which aim to improve generalisation performance. However, these performance-related techniques do not provide acceptable forms of safety arguments. Moreover, any proposed ANN model that attempts to satisfy safety arguments must carefully ensure that measures incorporated do not diminish the advantages of the ANN. Many of the existing approaches which claim to contribute to safety-critical applications focus more upon improving generalisation performance and not generating suitable safety arguments. One particular type of ANN model employs diversity [2] which is an ensemble of ANNs where each is devised by different methods to encapsulate as much of the target function as possible. Results demonstrate improvement in V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 163-169, 2003. Springer-Verlag Berlin Heidelberg 2003
164
Zeshan Kurd and Tim Kelly
generalisation performance over some test set but lack in the ability to analyse and determine the function performed by each member of the ensemble. Other techniques such as the validation of ANNs and error bars [3] may help deal with uncertainty in network outputs. However, the ANN is still viewed as a black-box and lacks satisfactory analytical processes to determine the overall behaviour. Finally, many development lifecycles [4] for ANNs lack provision for analytical processes such as hazard analysis and focuses more upon black-box approaches. However, work performed on fuzzy-ANNs [5] attempts to approach development based upon refining partial specifications.
2
The Problem
The criticality of safety-critical software can be defined as directly or indirectly contributing to the occurrence of a hazardous system state [6]. A hazard is a situation, which is potentially dangerous to man, society or the environment [7]. When developing safety-critical software, there are a set of requirements which must be enforced and are formally defined in many safety standards [8]. One of the main tools for determining requirements is the use of a safety case to encapsulate all safety arguments for the software. A safety case is defined in Defence Standard 00-55 [8] as: “The software safety case shall present a well-organised and reasoned justification based on objective evidence, that the software does or will satisfy the safety aspects of the Statement of Technical Requirements and the Software Requirements specification.” Some of the main components in formulating safety requirements are hazard analysis and mitigation. Function Failure Analysis (FFA) [9] is a predictive technique to distinguish and refine safety requirements. It revolves around system functions and deals with identifying failures to provide a function, incorrect function outputs, analysing effects of failures and determining actions to improve design. Another technique is Software Hazard Operability Study (SHAZOP) [10] which uses ‘guide words' for qualitative analysis to identify hazards not previously considered. It attempts to analyse all variations of the system based on these guide words and can uncover causes, consequences, indications and recommendations for particular identified hazards. Arguing the satisfaction of the safety requirements is divided into analytical arguments and arguments generated from testing. This is defined in a widely accepted Defence Standard 00-55 [8]. Arguments from testing (such as white-box or blackbox) can be generated more easily than analytical arguments. Unlike neural networks, the behaviour of the software is fully described and available for analysis. However, generating analytical arguments is more problematic. Some of these problems are mainly associated with providing sufficient evidence that all hazards have been identified and mitigated. There are several safety processes performed during the life-cycle of safety-critical systems [11]. Preliminary Hazard Identification (PHI) is the first step in the lifecycle,
Establishing Safety Criteria for Artificial Neural Networks
165
it forms the backbone of all following processes. It aims to identify, manage and control all potential hazards in the proposed system. Risk Analysis and Functional Hazard Analysis (FHA) analyses the severity and probability of potential accidents for each identified hazard. Preliminary System Safety Assessment (PSSA) [9] ensures that the proposed design will refine and adhere to safety requirements and help guide the design process. System Safety Analysis (SSA) demonstrates through evidence that safety requirements have been met. It uses inductive and deductive techniques to examine the completed design and implementation. Finally, the Safety Case [11] generated throughout development delivers a comprehensible and defensible argument that the system is acceptably safe to use in a given context. Any potential safety case must overcome problems associated with typical neural networks. Some typical problems may be concerned with ANN structure and topology. These are factors that may influence the generalisation performance of the ANN. Another problem lies in determining the training and test set where they must represent the desired function using a limited number of samples. Dealing with noise during training is also problematic to ensure that the network does not deviate from the required target function. Other issues related to the learning or training process may involve forgetting of data particularly when using the back-propagation algorithm. This could lead to poor generalisation and long training times. Furthermore, problems during training may result in learning settling in local minima instead of global and deciding upon appropriate stopping points for the training process. This could be aided by cross-validation [12] but relies heavily on test sets. One of the key problems is the inability to analyse the network such that a whitebox view of the behaviour can be presented. This contributes to the need for using test sets to determine generalisation performance as an overall error and the lack of explanation mechanisms for network outputs. The inability to analyse also makes it difficult to identify and control potential hazards in the system and provide assurance that some set of requirements are met.
3
Safety Criteria for Neural Networks
We have attempted to establish safety criteria for ANNs (Artificial Neural Networks) that defines minimum behavioural properties which must be enforced for safetycritical contexts. By defining requirements from a high-level perspective, the criteria are intended to apply to most types of neural networks. Figure 1.0 illustrates the safety criteria in the form of Goal Structuring Notation (GSN) [11] which is commonly used for composing safety case patterns. Each symbol in figure 1.0 has a distinct meaning and is a subset of the notation used in GSN. The boxes illustrate goals or sub-goals which need to be fulfilled. Rounded boxes denote the context in which the corresponding goal is stated. The rhomboid represents strategies to achieve goals. Diamonds underneath goals symbolise that the goal requires further development leading to supporting arguments and evidence.
166
Zeshan Kurd and Tim Kelly G1
C2 Use of network in safetycritical context must ensure specific requirements are met
Neural network is acceptably safe to perform a specified function within the safety-critical context
C1 Neural Network model definition
S1 Argument over key safety criteria
C6 A fault is classified as an input that lies outside the specified input set
C3 ‘Acceptably safe’ will be determined by the satisfaction of safety criteria
G2 Pattern matching functions for neural network have been correctly mapped
C4 Function may partially or completely satisfy target function
G3
G4
G5
Observable behavior of the neural network must be predictable and repeatable
The neural network tolerates faults in its inputs
The neural network does not create hazardous outputs
C5 Known and unknown inputs
C7 Hazardous output is defined as an output outside a specified set or target function
Fig. 1. Preliminary Safety Criteria for Artificial Neural Networks
The upper part of the safety criteria consists of a top-goal, strategy and other contextual information. The goal G1 if achieved, allows the neural network to be used in safety-critical applications. C1 requires that a specific ANN model is defined such as multi-layered perceptron or other models. C2 intends for the ANN to be used when conventional software or other systems cannot provide the desired advantages. C3 attempts to highlight that ‘acceptably safe' is related to product and process based arguments and will rely heavily on sub-goals. The strategy S1 will attempt to generate safety arguments from the sub-goals (which form the criteria) to fulfil G1. The goals G2 to G5 presents the following criteria: Criterion G2 ensures that the function performed by the network represents the target or desired function. The function represented by the network may be considered as input-output mappings and the term ‘correct' refers to the target function. As expressed in C4, the network may also represent a subset of the target function. This is a more realistic condition, if all hazards can be identified and mitigated for the subset. This may also avoid concerns for attempting to solve a problem where analysis may not be able to determine whether totality of the target function is achieved. Previous work on dealing with and refining partial specification for neural networks [5] may apply. However, additional methods to analyse specifications in terms of performance and safety (existence of potential hazards) may be necessary. Forms of sub-goals or strategies for arguing G2 may involve using analytical methods such as decompositional approaches [13]. This attempts to extract behaviour
Establishing Safety Criteria for Artificial Neural Networks
167
by analysing the intrinsic structure of the ANN such as each neuron or weight. This will help analyse the function performed by the ANN and help present a white-box view of the network. Techniques to achieve this may involve determining suitable ANN architectures whereby a meaningful mapping exists from each network parameter to some functional description. On the other hand, pedagogical approaches involve determining the function by analysing outputs for input patterns such as sensitivity analysis [14]. This methodology however, maintains a black-box perspective and will not be enough to provide satisfactory arguments for G2. Overall, approaches must attempt to overcome problems associated with the ability of the ANN to explain outputs and generalisation behaviour. Criterion G3 will provide assurance that safety is maintained during ANN learning. The ‘observable behaviour' of the neuron means the input and output mappings that take place regardless of the weights stored on each connection. The ANN must be predictable given examples learnt during training. ‘Repeatable' ensures that any previous valid mapping or output does not become flawed during learning. The forgetting of previously learnt samples must not occur given adaptation to a changing environment. Possible forms of arguments may be associated with functional properties. This may involve providing assurance that learning maintains safety by abiding to some set of behavioural constraints identified through processes such as hazard analysis. Criterion G4 ensures the ANN is robust and safe under all input conditions. An assumption is made that the ANN might be exposed to training samples that do not represent the desired function. Achievement of this goal is optional as it could be considered specific to the application context. The network must either detect these inputs and suppress them, or ensure a safe state with respect to the output and weight configuration. Other possible forms of argument and solutions may involve ‘gating networks' [15] which receives data within a specific validation area of the input space. Criterion G5 is based upon arguments similar to G2. However, this goal focuses solely on the network output. This will result in a robust ANN, which through training and utilisation, ensures that the output is not hazardous regardless of the integrity of the input. For example, output monitors or bounds might be used as possible solutions. Other possible forms of arguments may include derivatives of ‘guarded networks' [15] that receives all data but is monitored to ensure that behaviour does not deviate too much from expectations. The above criteria focuses on product-based arguments in contrast to processbased arguments commonly used for conventional software [16]. This follows an approach which attempts to provide arguments based upon the product rather than a set of processes that have been routinely carried out [16]. However, process and product-based arguments may be closely coupled for justifying the safety ANNs. For example, methods such as developing ANNs using formal languages [17] might reduce faults incorporated during implementation. However the role of analytical tools is highly important and will involve hazard analysis for identifying potential hazards. Some factors which prevent this type of analysis can be demonstrated in typical monolithic neural networks. These scatter their functional behaviour around its
168
Zeshan Kurd and Tim Kelly
weights in incomprehensible fashion resulting in black-box views and pedagogical approaches. Refining various properties of the network such as structural and topological factors may help overcome this problem such as modular ANNs [18]. It may help improve the relationship between intrinsic ANN parameters and meaningful descriptions of functional behaviour. This can enable analysis in terms of domainspecific base functions associated with the application domain [19]. Other safety processes often used for determining existence of potential hazards may require adaptation for compatibility with neural network paradigms.
4
Performance vs. Safety Trade-off
One of the main challenges when attempting to satisfy the criteria is constructing suitable ANN models. The aim is to maintain feasibility by providing acceptable performance (such as generalisation) whilst generating acceptable safety assurance. Certain attributes of the model may need to be limited or constrained so that particular arguments are possible. However, this could lead to over-constraining the performance of ANNs. Typical examples may include permissible learning during development but not whilst deployed. This may provide safety arguments about the behaviour of the network without invalidating them with future changes to implementation. The aim is to generate acceptable safety arguments providing comparable assurance to that achieved with conventional software.
5
Conclusion
To justify the use of ANNs within safety critical applications will require the development of a safety case. For high-criticality applications, this safety case will require arguments of correct behaviour based both upon analysis and test. Previous approaches to justifying ANNs have focussed predominantly on (test-based) arguments of high performance. In this paper we have presented the key criteria for establishing an acceptable safety case for ANNs. Detailed analysis of potential supporting arguments is beyond the scope of this paper. However, we have discussed how it will be necessary to constrain learning and other factors, in order to provide analytical arguments and evidence. The criteria also highlight the need for adapting current safety processes (hazard analysis) and devising suitable ANN models that can accommodate them. The criteria can be considered as a benchmark for assessing the safety of artificial neural networks for highly-dependable roles.
References [1] [2]
Lisboa, P., Industrial use of safety-related artificial neural networks. Health & Safety Executive 327, (2001). Sharkey, A.J.C. and N.E. Sharkey, Combining Diverse Neural Nets, in Computer Science. (1997), University of Sheffield: Sheffield, UK.
Establishing Safety Criteria for Artificial Neural Networks
[3] [4] [5]
[6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19]
169
Nabney, I., et al., Practical Assessment of Neural Network Applications. (2000), Aston University & Lloyd's Register: UK. Rodvold, D.M. A Software Development Process Model for Artificial Neural Networks in Critical Applications. in Proceedings of the 1999 International Conference on Neural Networks (IJCNN'99). (1999). Washington D.C. Wen, W., J. Callahan, and M. Napolitano, Towards Developing Verifiable Neural Network Controller, in Department of Aerospace Engineering, NASA/WVU Software Research Laboratory. (1996), West Virginia University: Morgantown, WV. Leveson, N., Safeware: system safety and computers. (1995): Addison-Wesley. Villemeur, A., Reliability, Availability, Maintainability and Safety Assessment. Vol. 1. (1992): John Wiley & Sons. MoD, Defence Standard 00-55: Requirements for Safety Related Software in Defence Equipment. (1996), UK Ministry of Defence. SAE, ARP 4761 - Guidelines and Methods for Conducting the Safety Assessment Process on Civil Airborne Systems and Equipment. (1996), The Society for Automotive Engineers,. MoD, Interim Defence Standard 00-58 Issue 1: HAZOP Studies on Systems Containing Programmable Electronics. (1996), UK Ministry of Defence. Kelly, T.P., Arguing Safety – A Systematic Approach to Managing Safety Cases, in Department of Computer Science. (1998), University of York: York, UK. Kearns, M., A Bound on the Error of Cross Validation Using the Approximation and Estimation Rates, with Consequences for the Training-Test Split, AT&T Bell Laboratories. Andrews, R., J. Diederich, and A. Tickle, A survey and critique of techniques for extracting rules from trained artificial neural networks. (1995), Neurocomputing Research Centre, Queensland University of Technology. Kilimasaukas, C.C., Neural nets tell why. Dr Dobbs's, (April 1991): p. 16-24. Venema, R.S., Aspects of an Integrated Neural Prediction System. (1999), Rijksuniversiteit, Groningen: Groningen, Netherlands. Weaver, R.A., J.A. McDermid, and T.P. Kelly. Software Safety Arguments: Towards a Systematic Categorisation of Evidence. in International System Safety Conference. (2002). Denver, CO. Dorffner, G., H. Wiklicky, and E. Prem, Formal neural network specification and its implications on standardisation. Computer Standards and Interfaces, (1994). 16 (205-219). Osherson, D.N., S. Weinstein, and M. Stoli, Modular Learning. Computational Neuroscience, Cambridge - MA, (1990): p. 369-377. Zwaag, B.J. and K. Slump. Process Identification Through Modular Neural Networks and Rule Extraction. in 5th International FLINS Conference. (2002). Gent, Belgium: World Scientific.
Neural Chaos Scheme of Perceptual Conflicts Haruhiko Nishimura1 , Natsuki Nagao2 , and Nobuyuki Matsui3 1
Hyogo University of Education 942-1 Yashiro-cho, Hyogo 673-1494, Japan
[email protected] 2 Kobe College of Liberal Arts 1-50 Meinan-cho, Akashi-shi, Hyogo 673-0001, Japan
[email protected] 3 Himeji Institute of Technology 2167 Shosya, Himeji-shi, Hyogo 671-2201, Japan
[email protected]
Abstract. Multistable perception is perception in which two (or more) interpretations of the same ambiguous image alternate while an obserber looks at them. Perception undergoes involuntary and random-like change. The question arises whether the apparent randomness of alternation is real (that is, due to a stochastic process) or whether any underlying deterministic structure to it exists. Upon this motivation, we have examined the spatially coherent temporal behaviors of multistable perception model based on the chaotic neural network from the viewpoint of bottom-up high dimensional approach. In this paper, we focus on dynamic processes in a simple (minimal) system which consists of three neurons, aiming at further understanding of the deterministic mechanism.
1
Introduction
Multistable perception is perception in which two (or more) interpretations of the same ambiguous image alternate spontaneously while an obserber looks at them. Figure-ground, perspective (depth) and semantic ambiguities are well known (As an overview, for example, see [1] and [2]). Actually, when we view the Necker cube which is a classic example of perspective alternation, a part of the figure is perceived either as front or back of a cube and our perception switches between the two different interpretations. In this circumstance the external stimulus is kept constant, but perception undergoes involuntary and random-like change. The measurements have been quantified in psychophysical experiments and it has become evident that the frequency of the time intervals spent on each percept T (n) in Fig.1 is approximately Gamma distributed [3, 4]. The Gestalt psychologist Wolfgang K¨ ohler [5] claimed that perceptual alternation occurs owing to different sets of neurons getting “tired” of firing after they have done so for a long time. The underlying theoretical assumptions are that: 1. different interpretations are represented by different patterns of neural activity, 2. perception corresponds to whichever pattern is most active in the V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 170–176, 2003. c Springer-Verlag Berlin Heidelberg 2003
Neural Chaos Scheme of Perceptual Conflicts
171
t
T(2)
T(1)
T(3)
Fig. 1. A schematic example of switching behavior brain at the time, 3. neural fatigue causes different patterns of activation to dominate at different times. Upon this neural fatigue hypothesis, synergetic Ditzinger-Haken model [6, 7] was proposed and is known to be able to reproduce the Gamma distribution of T well by subjecting the time-dependent attention parameters to stochastic forces (white noises). However, the question arises whether the apparent randomness of alternation is real (that is, due to a stochastic process) or whether any underlying deterministic structure to it exists. Until now diverse types of chaos have been confirmed at several hierarchical levels in the real neural systems from single cells to cortical networks (e.g. ionic channels, spike trains from cells, EEG) [8]. This suggests that artificial neural networks based on the McCulloch-Pitts neuron model should be re-examined and re-developed. Chaos may play an essential role in the extended frame of the Hopfield neural network beyond the only equilibrium point attractors. Following this idea, we have examined the spatially coherent temporal behaviors of multistable perception model based on the chaotic neural network [9, 10] from the viewpoint of bottom-up high dimensional approach. In this paper, we focus on dynamic processes in a simple (minimal) system which consists of three neurons, aiming at further understanding of the deterministic mechanism.
2
Model and Method
The chaotic neural network (CNN) composed of N chaotic neurons is described as [11, 12] Xi (t + 1) = f (ηi (t + 1) + ζi (t + 1)) ηi (t + 1) =
N
wij
j=1
ζi (t + 1) = −α
t
(1)
kfd Xj (t − d)
(2)
krd Xi (t − d) − θi
(3)
d=0 t d=0
where Xi : output of neuron i(−1 ≤ Xi ≤ 1), wij : synaptic weight from neuron j to neuron i, θi : threshold of neuron i, kf (kr ) : decay factor for the feed-
172
Haruhiko Nishimura et al.
back(refractoriness) (0 ≤ kf , kr < 1), α : refractory scaling parameter, f : output function defined by f (y) = tanh(y/2ε) with the steepness parameter ε. Owing to the exponentially decaying form of the past influence, Eqs.(2) and (3) can be reduced to ηi (t + 1) = kf ηi (t) +
N
wij Xj (t)
(4)
j=1
ζi (t + 1) = kr ζi (t) − αXi (t) + a
(5)
where a is temporally constant a ≡ −θi (1 − kr ). All neurons are updated in parallel, that is, synchronously. The network corresponds to the conventional discrete-time Hopfield network when α = kf = kr = 0 (Hopfield network point (HNP)). Under an external stimulus, Eq.(1) is influenced as (6) Xi (t + 1) = f ηi (t + 1) + ζi (t + 1) + σi where {σi } is the effective term by external stimulus. The two competitive interpretations are stored in the network as minima of the energy map : E=−
1 wij Xi Xj 2 ij
(7)
at HNP. The conceptual picture of our model is shown in Fig.2. Under the external stimulus {σi }, chaotic activities arise on the neural network and cause the transitions between stable states of HNP. This situation corresponds to the dynamic multistable perception.
3
Simulations and Results
To carry out computational experiments, we consider a minimal network which consists of three neurons (N = 3). Of 23 possible states, Two interpretation patterns {ξi11 } = (1, −1, 1) and {ξi12 } = (−1, 1, −1) are chosen and then the corresponding bistable landscape is made on the network dynamics by 0 −2 2 1 {wij } = −2 0 −2 . 3 2 −2 0 In the case of HNP, the stored patterns {ξi11 } and {ξi12 } are stable (with E = −2) and the remaining 6 patterns are all unstable (with E = 23 ). (−1, −1, 1), (1, 1, 1) and (1, −1, −1) converge to {ξi11 }, and (1, 1, −1), (−1, −1, −1) and (−1, 1, 1) converge to {ξi12 } as shown in Fig.3.
Neural Chaos Scheme of Perceptual Conflicts
173
1 {σi } = s { ξ i }
stimulus
E{X}
(−1, 1, 1)
( 1, 1, 1)
ambiguous 1 { ξi }(1,1,0) 12
stable { ξi } (−1, 1,−1)
( 1, 1,−1) 11
11
{ξ i }
1
{ξ i }
12
{ξ i }
stable { ξi } (1, −1, 1)
(−1,−1, 1)
{X} (−1,−1,−1)
Fig. 2. Conceptual picture illustrating state transitions induced by chaotic activity
( 1,−1,−1)
Fig. 3. Pattern states for N=3 neurons corresponding to the ambiguous figure {ξi1 } and its interpretations {ξi11 } and {ξi12 }, and flow of the network at HNP
Figure 4 shows a time series evolution of CNN (kf = 0.5, kr = 0.8, α = 0.46, a = 0, ε = 0.015) under the stimulus {σi } = 0.59{ξi1}. Here, m11 (t) =
N 1 11 ξ Xi (t) N i=1 i
(8)
and is called the overlap of the network state {Xi } and the interpretation pattern {ξi11 }. A switching phenomenon between {ξi11 } (m11 = 1.0) and {ξi12 } (m11 = −1.0) can be observed. Bursts of switching are interspersed with prolonged periods during which {Xi } trembles near {ξi11 } or {ξi12 }. Evaluating the maximum Lyapunov exponent to be positive (λ1 = 0.30), we find that the network is dynamically in chaos. Figure 5 shows the return map of the active potential hi (t) = ηi (t) + ζi (t) + σi of a neuron(#3). In the cases λ1 < 0, such switching phenomena do not arise. From the 2×105 iteration data (until t = 2×105 ) of Fig.4, we get 1545 events T (1) ∼ T (1545) staying near one of the two interpretations, {ξi12 }. They have various persistent durations which seem to have a random time course. From the evaluation of the autocorrelation function for T (n), C(k) =< T (n + k)T (n) > − < T (n + k) >< T (n) > (here, <> means an average over time), we find the lack of even short term correlations (−0.06 < C(k)/C(0) < 0.06 for all k). This indicates that successive durations T (n) are independent. The frequency of
174
Haruhiko Nishimura et al.
neuron#:3 2
hi(t+1)
ξ
ξ
ξ
0
−2
−2
0
2
hi(t)
Fig. 4. Time series of the overlap with Fig. 5. Return map of the active {ξi11 } and the energy map under the stim- potential hi of #3 neuron for the ulus {ξi1 } data up to t=5000. Solid line traces its behavior for a typical T term t=3019 to 3137 0.3
Frequency
0.25 0.2 0.15 0.1 0.05 0 30
60
90
120 150 180 210 240 270 300 T
Fig. 6. Frequency distribution of the persistent durations T (n) and the corresponding Gamma distribution
occurrence of T is plotted for 1545 events in Fig.6. The distribution is well fitted by Gamma distribution ˜
G(T˜ ) =
bn T˜ n−1 e−bT Γ (n)
(9)
with b = 2.03, n = 8.73(χ2 = 0.0078, r = 0.97), where Γ (n) is the Euler-Gamma function. T˜ is the normalized duration T /15 and here 15 step interval is applied to determine the relative frequencies.
Neural Chaos Scheme of Perceptual Conflicts
175
Frequency
0.3
TI=50
0.2 0.1 0
Lower cortical area
Higher cortical area
Frequency
0.3
TI=75
0.2 0.1 0
Neuron assembly
Neuron assembly
Frequency
0.3
TI=100
0.2 0.1 0
Ping-pong
Fig. 7. Conceptual diagram of cortico-cortical interaction within the cortex
30
60
90
120
150
180
210
240
270
300
T
Fig. 8. Frequency distribution of the persistent durations T (n) and the corresponding Gamma distribution, under the background cortical rhythm with bs=0.005
The results are in good agreement with the characteristics of psychophysical experiments [3, 4]. It is found that aperiodic spontaneous switching does not necessitate some stochastic description as in the synergetic D-H model [6, 7]. In our model, the neural fatigue effect proposed by K¨ ohler [5] is considered to be supported by the neuronal refractriness α, and its fluctuation originates in the intrinsic chaotic dynamics of the network. The perceptual switching in binocular rivalry shares many features with multistable perception, and might be the outcome of a general (common) neural mechanism. Ping-pong style matching process based on the interaction between the lower and higher cortical area [13] in Fig.7 may serve as a candidate of this mechanism. According to the input signal from the lower level, the higher level feeds back a suitable candidate among the stored templates to the lower level. If the lower area cannot get a good match, the process starts over again and lasts until a suitable interpretation is found. The principal circuitry among the areas by the cortico-cortical fibers is known to be uniform over the cortex. As a background rhythm from the ping-pong style matching process, we introduce the following sinusoidal stimulus changing between bistable patterns {ξi11 } and {ξi12 }: S1 (t) = bs · cos(2πt/T0 ), S2 (t) = −bs · cos(2πt/T0 ), and S3 (t) = bs · cos(2πt/T0 ), where T0 = 2TI is the period and bs is the background strength. As we can see from the three cases TI = 50, 75, and 100 in Fig.8, the mean value of T is well controlled by the cortical matching period.
176
4
Haruhiko Nishimura et al.
Conclusion
We have shown that the deterministic neural chaos leads to perceptual alternations as responses to ambiguous stimuli in the chaotic neural network. Its emergence is based on the simple process in a realistic bottom-up framework. Our demonstration suggests functional usefulness of the chaotic activity in perceptual systems even at higher cognitive levels. The perceptual alternation appears to be an inherent feature built in the chaotic neuron assembly.It will be interesting to study the brain with the experimental technique (e.g., fMRI) under the circumstance where the perceptual alternation is running.
References [1] Attneave, F.: Multistability in perception. Scientific American 225 (1971) 62–71 170 [2] Kruse, P., Stadler, M., eds.: Ambiguity in Mind and Nature. Springer-Verlag (1995) 170 [3] Borsellino, A., Marco, A. D., Allazatta, A., Rinsei, S., Bartolini, B.: Reversal time distribution in the perception of visual ambiguous stimuli. Kybernetik 10 (1972) 139–144 170, 175 [4] Borsellino, A., Carlini, F., Riani, M., Tuccio, M. T., Marco, A. D., Penengo, P., Trabucco, A.: Effects of visual angle on perspective reversal for ambiguous patterns. Perception 11 (1982) 263–273 170, 175 [5] K¨ ohler, W.: Dynamics in psychology. Liveright, New York (1940) 170, 175 [6] Ditzinger, T., Haken, H.: Oscillations in the perception of ambiguous patterns: A model based on synergetics. Biological Cybernetics 61 (1989) 279–287 171, 175 [7] Ditzinger, T., Haken, H.: The impact of fluctuations on the recognition of ambiguous patterns. Biological Cybernetics 63 (1990) 453–456 171, 175 [8] Arbib, M. A.: The Handbook of Brain Theory and Neural Networks. MIT Press (1995) 171 [9] Nishimura, H., Nagao, N., Matsui, N.: A perception model of ambiguous figures based on the neural chaos. In Kasabov, N., et al., eds.: Progress in ConnectionistBased Information Systems. Volume 1. Springer-Verlag (1997) 89–92 171 [10] Nagao, N., Nishimura, H., Matsui, N.: A neural chaos model of multistable perception. Neural Processing Letters 12 (2000) 267–276 171 [11] Aihara, K., Takabe, T., Toyoda, M.: Chaotic neural networks. Phys.Lett. A 144 (1990) 333–340 171 [12] Nishimura, H., Katada, N., Fujita, Y.: Dynamic learning and retrieving scheme based on chaotic neuron model. In Nakamura, R., et al., eds.: Complexity and Diversity. Springer-Verlag (1997) 64–66 171 [13] Mumford, D.: On the computational architecture of the neocortex ii. the role of cortico-cortical loops. Biol. Cybern. 66 (1992) 241–251 175
Learning of SAINNs from Covariance Function: Historical Learning Paolo Crippa and Claudio Turchetti Dipartimento di Elettronica ed Automatica, Universit` a Politecnica delle Marche Via Brecce Bianche, I-60131 Ancona, Italy {pcrippa, turchetti}@ea.univpm.it
Abstract. In this paper the learning capabilities of a class of neural networks named Stochastic Approximate Identity Neural Networks (SAINNs) have been analyzed. In particular these networks are able to approximate a large class of stochastic processes from the knowledge of their covariance function.
1
Introduction
One attracting property of neural networks is their capability in approximating stochastic processes or, more in general, input-output transformations of stochastic processes. Here we consider a wide class of nonstationary processes for which a canonical representation can be defined from the Karhunen-Lo`eve theorem [1]. Such a representation constitutes a model that can be used for the definition of stochastic neural networks based on the Approximate Identity (AI) functions whose approximating properties are widely studied in [3, 2]. The aim of this work is to show that these stochastic neural networks are able to approximate a given stochastic process belonging to this class, provided only its covariance function is known. This property has been called historical learning.
2
Approximation of Stochastic Processes by SAINNs
Let us consider a stochastic process (SP) admitting the canonical representation ϕ (t) = Φ (t, λ) η (dλ) , (1) Λ
where η (dλ) is a stochastic measure and Φ (t, λ) is a family of complete realvalued functions of λ ∈ Λ depending on the parameter t ∈ T such that 2 Φ (t, λ) Φ (s, λ)F (dλ) with F (∆λ) = E{|η (∆λ)| } (2) B (s, t) = Λ
In this Section we want to show that a SP ϕ(t) of this kind can be approximated in mean square by neural networks of the kind ψm um (t) (3) ψ (t) = m
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 177–184, 2003. c Springer-Verlag Berlin Heidelberg 2003
178
Paolo Crippa and Claudio Turchetti
where ψm are random variables and um (t) are AI functions [2]. These networks are named Stochastic Approximate Identity Neural Networks (SAINNs). For this purpose, let be given the SP ϕ (t) admitting the canonical representation 2 (1) where Φ (t, λ) is L2 -summable, i.e. T Λ |Φ (t, λ)| F (dλ) dt < +∞. Let us consider the function Ψ (t, λ) = am um (t) um (λ) (4) m
where um (·) are AI functions. As it has been shown in [2], the set of functions Ψ (t, λ) is dense in L 2 (Λ × T ). Thus, ∀ ε > 0 it results d (Ψ (t, λ) , Φ (t, λ)) < ε, ∀ t ∈ T , λ ∈ Λ, where d( , ) is the usual distance. As the stochastic measure η (∆) establishes a correspondence U between L 2 (Λ) and L 2 (ϕ), therefore a process ψ (t) corresponds through U to the function Ψ (t, λ) ∈ L 2 (Λ) with canonical representation given by Ψ (t, λ) η (dλ) . (5) ψ (t) = Λ
Due to the isometry property of U the following equality holds E{|ϕ (t) − ψ (t)|2 } 2 = Λ |Φ (t, λ) − Ψ (t, λ)| F (dλ), ∀ t ∈ T . By taking advantage of the dense prop 2 erty of the functions Ψ (t, λ) we have T Λ |Φ (t, λ) − Ψ (t, λ)| F (dλ)dt → 0 that 2 implies E 2 (t) = Λ |Φ (t, λ) − Ψ (t, λ)| F (dλ) → 0 in all Λ but a set belonging 2 to Λ of null measure, and thus also E{|ϕ (t) − ψ (t)| } → 0. From what showed above we conclude that the SP ψ (t) defined by (5) is able to approximate in mean square the given process ϕ (t) [4]. We can also write ψ (t) = am um (t) um (λ) η (dλ) = am um (t) um (λ) η (dλ) . (6) Λ
m
m
Λ
By defining the random variables ψm = am
Λ
um (λ) η (dλ) ,
(7)
we finally demonstrate that (3) holds. Eq. (3) can be viewed as a stochastic neural network expressed in terms of linear combination of AI functions, through the coefficients um , being um random variables whose statistics depends on the process ϕ (t) to be approximated. It is clear the similarity of (3) to Karhunen-Lo`eve representation, although in K.L. expansion the functions of time are eigenfunctions dependent on the covariance function, while in (3) they are AI functions.
3
Learning of SAINNs from Covariance Function
The process ψ (t) in (3) defines a stochastic neural network that, on the basis of what previously proven, is able to approximate any given process ϕ (t) belonging to the class of SP’s (nonstationary in general) admitting a canonical form. As
Learning of SAINNs from Covariance Function: Historical Learning
179
the process ψ (t) depends on function Ψ (t, λ) = m am um (t) um (λ), which approximates the function Φ (t, λ), it is essential to know Φ (t, λ) in order to perform the learning of the neural network. To this end it is convenient to treat the case of finite T separately to the case of infinite T . Finite T In this case the set Λ is countable, as it follows from Kahrunen-Lo`eve theory [1, 4], thus the approximating process ψ(t) reduces to: ψ (t) = ψm um (t) with ψm = am um (λj ) η (λj ) . (8) m
j
Moreover we can write (8) as ψ (t) = am um (t) um (λj ) η (λj ) . j
m
(9)
Karhunen-Lo`eve theory establishes that ϕ (t) can be expanded as ϕ (t) = Φ (t, λj ) η (λj ) .
(10)
Comparing (9) and (10), and assuming the approximation Φ (t, λj ) ≈ am um (t) um (λj )
(11)
j
m
then we have ϕ (t) ≈ ψ (t) .
(12)
More rigorously
2
E |ϕ (t) − ψ (t)|2 = E
am um (t) um (λj ) η (λj )
Φ (t, λj ) −
j m
2
= am um (t) um (λj ) E |η (λj )| 2
Φ (t, λj ) −
m j
2
= am um (t) um (λj ) λ2j → 0 , (13)
Φ (t, λj ) −
m
j
where the orthogonality of the stochastic measure η (λj ) and the approximating property of the AINN’s have been adopted. In the case of finite T the functions Φ (t, λj ) are known, since they are the eigenfunctions of the integral operator defined by the covariance function B (s, t). Therefore (13) can be viewed as a learning relationship of the stochastic neural network. The other learning relationship is given by η (λj ) = ϕ (t) Φ (t, λ j ) dt . (14) T
180
Paolo Crippa and Claudio Turchetti
Thus, in conclusion, in the case of finite T from the knowledge of covariance function B (s, t) (or an estimation of it) we can define the stochastic neural network ψ (t) = ψm um (t) , ψm = am um (λj ) η (λj ) (15) m
m
approximating the process ϕ (t), through the relationships am um (t) um (λj )| 2 → 0 i) | Φ (t, λj ) − m ii ) η (λj ) = ϕ (t) Φ (t, λ j ) dt .
(16)
T
It is worth to note that when ϕ (t) is Gaussian, η (λj ) are independent Gaussian random variables. This case is of particular interest for applications because the statistical behavior of η (λj ) is completely specified by the variance E{|η (λj )|2 }, and thus the eigenvalues λj . As an example of the complexity of the approximating problem let us consider a stationary process ϕ (t) with covariance function B (s, t) = a exp (−a |t − s|)
(17)
on the interval [0, T ]. The eigenvalue equation in this case is
T
a exp (−a |t − s|) Φ (s) ds = λ Φ (t) .
(18)
0
In order to solve this integral equation it is useful to rewrite the integral as the convolution over the entire real axis (−∞, +∞) h (t) ∗ Φ (t) = λ Φ (t)
with h (t) = a exp (−a |t|)
(19)
constraining (19) to the boundary conditions as derived from (18) at the boundaries of the time interval [0,T], so that (19) is identical of (18). By Fourier transform (19) becomes H (jω) · U (jω) = λU (jω)
(20)
where H (jω), U (jω) are the Fourier transforms of h (t), Φ (t) respectively. Since it results H (jω) = 2a2 a2 + ω 2 , (21) eq. (20) reduces to 2 2 2a a + ω 2 U (jω) = λU (jω)
(22)
and the eigenvalues λ are given by λ = 2a2
2 a + ω2 .
(23)
Learning of SAINNs from Covariance Function: Historical Learning
181
Applying the inverse Fourier transform to (22), a second-order differential equation results Φ (t) + [(2 − λ) /λ] a2 Φ (t) = 0 (24) Solving this equation requires two boundary conditions on Φ (t) and its derivative Φ (t) that can be easily obtained from (18): Φ (T ) /Φ (T ) = −1/a and Φ (0) /Φ (0) = 1/a. The general solution of (24) is given by Φ (t) = A sin (ωt + α)
(25)
and the above boundary conditions take the form tan (ωT + α) = −ω/a and tan α = ω/a, respectively. After some manipulations these boundary conditions reduce to a unique equation tan (ωk T ) = −2aωk a2 + ωk (26) where the index k means that a countable set of values of ωk satisfying (26) exists. Eq. (23) establishes that a corresponding set of values λk exists, given by ∆λk = 2a2
2 a + ωk2
, k = 1, 2, . . .
(27)
thus all the eigenfunctions are expressed as Φk (t) = Ak sin (ωk t + αk )
(28)
where the constants Ak , αk are determined from orthonormalization conditions T Φk (t) Φl (t) dt = δkl . (29) 0
As for as the stochastic neural network is concerned, (16) apply in this case where Φ is given by (28). Infinite T In this case the set Λ is not countable, and the problem of determining a complete set of functions Φ (t, λ) falls in the theory of operators whose eigenvalues form a noncountable set. Since this subject is out of the scope of this work, we restrict the treatment of this case to a specific class of random processes. Let us assume Φ (t, λ) = a (t) exp (iλt)
(30)
where a (t) is an arbitrary function (such that ∀ λ ∈ Λ, Φ (t, λ) is L2 -integrable), then the set of functions {Φ (t, λ)}, yielded by varying the parameter t ∈ T , is a complete family in L2 (F ). Any linear combination of Φ (t, λ) can be written as bk Φ (tk , λ) = bk a (tk ) exp (iλtk ) = ck exp (iλtk ) . (31) k
k
k
Yet from the theory of Fourier integral it is known that the set {exp (iλt)} is complete in L2 (F ), meaning that linear combinations of such kind generate the
182
Paolo Crippa and Claudio Turchetti
entire space L2 (F ). As a consequence also the set { k bk Φ (tk , λ)} has the same property of generating the space L2 (F ), which is equivalent to say that, given any function f (λ) ∈ L2 (F ), it results f (λ) = limN →∞ N k=1 bk Φ (tk , λ). The covariance function of the process ϕ (t) can be written as B (s, t) = Φ (t, λ) Φ (s, λ)F (dλ) = a (t) a (s) exp [iλ (t − s)] F (dλ) (32) Λ
Λ
and the canonical representation (1) reduces to ϕ (t) = a (t) ζ (t)
(33)
where ζ (t) =
exp (iλt) η (dλ)
(34)
Λ
is a stationary SP since the canonical representation holds for it. Thus we conclude that the canonical representation in (1), with Φ (t, λ) given by (30), is valid for the class of nonstationary processes which can be expressed in the form ϕ (t) = a (t) ζ (t), being a (t) an arbitrary function and ζ (t) a stationary process. From (1) it can be shown that the stochastic measure η (∆λ), with ∆λ = [λ1 , λ2 ) can be written as η (∆λ) = ϕ (t) ξ∆λ (t) dt (35) T
where ξ∆λ (t) is defined through the integral equation 1 ξ∆λ (t) Φ (t, λ) dt = χ∆λ (λ) with χ∆λ (λ) = 0 T
if λ ∈ ∆λ . if λ ∈ / ∆λ
(36)
As far as the inversion formula (35) is concerned, by inserting Φ (t, λ), given by (30), in (36) and performing the Fourier inverse transform, it results ξ∆λ (t) a (t) =
exp(−iλ2 t)−exp(−iλ1 t) . −2πit
Finally from (35) and (37) we have exp(−iλ2 t)−exp(−iλ1 t) ζ (t) dt . η (∆λ) = ζ (t) a (t) ξ∆λ (t) dt = −2πit T T
(37)
(38)
Eq. (38) establishes that the stochastic measure η (∆λ) is only dependent on the stationary component ζ (t) of the process ϕ (t) = a (t) ζ (t). The measure F (∆λ) can be derived from general equation (35) F (∆λ) = E{|η (∆λ)|2 } = ξ∆λ (t) ξ∆λ (s) B (s, t) dt ds T T = ξ∆λ (t) ξ∆λ (s)a (t) a (s) exp [iλ (t − s)] F (dλ)dtds (39) T
T
Λ
Learning of SAINNs from Covariance Function: Historical Learning
183
which, by taking into account of (37), reduces to F (∆λ) = exp(−iλ2 t)−exp(−iλ1 t) exp(−iλ2 s)−exp(−iλ1 s) Bζ (t−s)dtds(40) −2πit −2πis T T where Bζ (t − s) is the covariance function of ζ (t) and is defined by B (s, t) = a (t) a (s)Bζ (t − s) .
(41)
Coming back to the problem of approximating the SP ϕ (t) = a (t) ζ (t), it is equivalent to approximating the resulting Φ (t, λ) = a (t) eiλt by the AINN, i.e. am um (t) um (λ) . (42) a (t) eiλt ≈ m
Indeed the mean square error between the SP ϕ (t) and the approximating process ψ (t) = Ψ (t, λ) η (dλ) with Ψ (t, λ) = am um (t) um (λ) , (43) Λ
m
is given by
2 E |ϕ (t) − ψ (t)|2 = E Λ Φ (t, λ) η (dλ) − Λ Ψ (t, λ) η (dλ) 2 = Λ |Φ (t, λ) − g (t, λ)| F (dλ) = Λ |a (t) eiλt − m am um (t) um (λ)|2 F (dλ) → 0
(44)
The stochastic neural network is defined by (44), with η (∆λ) given by (38). If the process ϕ (t) is Gaussian the random variables η (∆λ) are independent and 2 completely specified by the variance E{|η (∆λ)| } = F (∆λ) so that Bζ (t − s) completely characterizes the process as it follows from (40), (41).
4
Conclusions
In this work the historical learning, i.e. the SAINNs capability of approximating a wide class of nonstationary stochastic processes from their covariance function has been demonstrated. In particular the two cases where the ‘time’ t belongs to a finite or infinite set T have been analyzed separately.
References [1] Karhunen, K.: Uber lineare Methoden in der Wahrscheinlicherechnung, Ann. Acad. Sci. Fennicae, Ser. A. Math Phys., Vol. 37, (1947) 3–79 177, 179 [2] Turchetti, C., Conti, M., Crippa, P., Orcioni, S.: On the Approximation of Stochastic Processes by Approximate Identity Neural Networks. IEEE Trans. Neural Networks, Vol. 9, 6 (1998) 1069–1085 177, 178
184
Paolo Crippa and Claudio Turchetti
[3] Belli, M. R., Conti, M., Crippa, P., Turchetti, C.: Artificial Neural Networks as Approximators of Stochastic Processes. Neural Networks, vol. 12, 4-5 (1999) 647– 658 177 [4] Doob, J. L.: Stochastic Processes. J. Wiley & Sons, New York, USA, (1990) 178, 179
Use of the Kolmogorov's Superposition Theorem and Cubic Splines for Efficient Neural-Network Modeling Boris Igelnik Pegasus Technologies, Inc. 5970 Heisley Road, Suite 300, Mentor OH 44060, USA
[email protected]
Abstract. In this article an innovative neural-network architecture, called the Kolmogorov's Spline Network (KSN) and based on the Kolmogorov's Superposition Theorem and cubic splines, is proposed and elucidated. The main result is the Theorem giving the bound on the approximation error and the number of adjustable parameters, which favorably compares KSN with other one-hidden layer feed-forward neural-network architectures. The sketch of the proof is presented. The implementation of the KSN is discussed.
1
Introduction
The Kolmogorov's Superposition Theorem (KST) gives the general and very parsimonious representation of a multivariate continuous function through superposition and addition of univariate functions [1]. The KST states that any function, f ∈ C ( I d ) , has the following exact representation on I d f ( x) =
∑ g ∑ λψ ( x )
2 d +1
d
n =1
i =1
i
ni
i
(1)
with some continuous univariate function g depending on f, while univariate functions, ψ n , and constants, λi , are independent of f. Hecht-Nielsen [2] was first who recognized that the KST could be utilized in neural network computing. Using early Sprecher's enhancement of the KST [3] he proved that the Kolmogorov's superpositions could be interpreted as a four-layer feedforward neural network. Girosi and Poggio [4] pointed out that the KSN is irrelevant to neural-network computing, because of high complexity of computation of the functions g and ψ n from the finite set of data. However Kurkova [5] noticed that in the Kolmogorov's proof of the KST the fixed number of basis function, 2d + 1, can be replaced by a variable N, and the task of function representation can be replaced by the task of function approximation. She also demonstrated [6] how to approximate Hecht-Nielsen's network by the traditional neural network.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 184-190, 2003. Springer-Verlag Berlin Heidelberg 2003
Use of the Kolmogorov’s Superposition Theorem and Cubic Splines
185
Numerical implementation of the Kolmogorov's superpositions was analyzed in [7]-[10]. In particular Sprecher [9], [10] derived the elegant numerical algorithms for implementation of both internal and external univariate functions in [1]. The common starting point of the works [2], [3], [5]-[10] is the structure of the Kolmogorov's superpositions. Then the main effort is made: to represent the Kolmogorov's superpositions as a neural network [2], [3]; to approximate them by a neural network [5], [6]; to directly approximate the external and internal univariate functions in the Kolmogorov's representation by a numerical algorithm [7]-[10]. In our view it is an attempt to preserve the efficiency of the Kolmogorov's theorem in representation of a multivariate continuous function in its practical implementation. Especially attractive is the idea of preserving the universal character of internal functions. If implemented with reasonable complexity of the suggested algorithms this feature can make a breakthrough in building efficient approximations. However, since the estimates of the complexity of these algorithms are not available so far, the arguments of Girosi and Poggio [4] are not yet refuted. The approach adopted in this paper is different. The starting point is the function approximation from the finite set of data, by a neural net of the following type N d f N ,W ( x ) = ∑ an g n ∑ψ i ( xi , wni ) , W = {an , wni } n =1 i =1
(2)
The function, f, to be approximated belongs to the class, Φ ( I d ) of continuously differentiable functions with bounded gradient, which is wide enough for applications. We are looking for the qualitative improvement of the approximation f ≈ f N ,W , f ∈ Φ ( I d )
(3)
using some ideas of the KST proof. In particular we see from the Kolmogorov's proof that it is important to vary, dependent on data, the shape of external univariate function, g, in contrast to traditional neural networks with fixed-shape basis functions. Use of variable-shape basis functions in neural networks is described in [11]-[15]. In this paper innovative neural-network architecture, KSN, is introduced. The distinctive features of this architecture are: it is obtained from (1) by replacing fixed number of basis functions, 2d + 1, by the variable N, and by replacing both the external function, and the internal functions, ψ n , by the cubic spline functions, s (., γ n ) and s (., γ ni ) respectively. Use of cubic splines allows to vary the shape of basis functions in the KSN by adjusting the spline parameters γ n and γ ni . Thus the KSN, f Ns ,W , is defined as follows N d f Ns ,W ( x ) = ∑ s ∑ λi s ( xi , γ ni ) , γ n n =1 i =1
(4)
where λ1 ,...λd > 0 are rationally independent numbers [16] satisfying the condition
∑
d i =1
λi ≤ 1 . These numbers are not adjustable on the data and are independent of an
application. In Section 2 we give a sketch of the proof for the main result of this
186
Boris Igelnik
paper. It states that the rate of convergence of approximation error to zero with N → ∞ is significantly higher for the KSN than the corresponding nets for traditional neural networks described by equation (2). Simultaneously we proved that the complexity of the approximation of KSN with N → ∞ tends to infinity significantly slower than the complexity of the traditional neural networks. The complexity is defined as the number of adjustable parameters needed to achieve the same approximation error for both types of neural networks. This main result is the justification for introducing the KSN. Thus, utilizing some of the ideas of the KST proof, a significant gain, both in accelerating rate of convergence of approximation error to zero and reducing the complexity of the approximation algorithm can be achieved by the KSN. Section 3 describes some future work in this direction.
2
Main Result
Main result is contained in the following theorem: Theorem. For any function f ∈ Φ ( I d ) and any natural N there exists a KSN defined by equation (4), with the cubic spline univariate functions g ns ,ψ nis 1, defined on [0,1] and rationally independent numbers λ1 > 0,...λd > 0, ∑ i =1 λi ≤ 1 , such that i=d
f − f Ns ,W = Ο (1/ N ) ,
(5)
f − f Ns ,W = max f ( x ) − f Ns ,W ( x ) . d
(6)
where x∈ I
The number of network parameters, P, satisfies the equation P = Ο( N 3/ 2 ) .
(7)
This statement favorably compares the KSN with the one-hidden layer feedforward networks currently in use. Most of the existing such networks provides the estimate of approximation [17]-[22] error by the following equation
(
)
f − f N ,W = Ο 1/ N .
(8)
Denote N∗ and P∗ the number of basis functions and the number of adjustable parameters for a network currently in use with the same approximation error as the KSN with N basis functions. Comparison of (5) and (8) shows that N∗ = Ο ( N 2 ) and P∗ = Ο ( N 2 ) . Therefore, P∗ >> P for large values of N, which confirms the significant
advantage of the KSN in complexity. 1
These notations are used in section 2 instead of notations s (., γ n ) , s (., γ ni ) in formula (4)
Use of the Kolmogorov’s Superposition Theorem and Cubic Splines
187
The detailed proof of the theorem is contained in [23]. The proof is rather technical and lengthy. In particular large efforts were spent on proving that the conditions imposed on the function f are sufficient for justification of equations (5) and (7) In this paper we concentrate on the description of the major steps of the proof. The first step of the proof follows the arguments of the first step of the KST proof, except two things: it considers variable number of basis functions, N, and differently defines the internal functions, ψ i in formula (2). In the latter case piecewise linear functions of the Kolmogorov's construction are replaced by almost everywhere piecewise linear functions with connections between adjacent linear parts made from nine degree spline interpolates on the arbitrarily small intervals. As a result of this first step a network, f N ,W , described by equation (2) is obtained, satisfying the equation f − f N ,W = Ο (1/ N ) .
(9)
The univariate functions, g n and ψ i , in the first step of the proof have uncontrolled complexity. That is why they are replaced in the second step by their cubic spline interpolates g ns and ψ nis . The number of spline knots, M, depends on N, and was chosen so that the spline network, f Ns ,W , defined as N d s f Ns ,W ( x ) = ∑ g n ∑ λψ i ni ( xi , γ ni ) , γ n n =1 i =1
(10)
approximates f with approximation error and complexity satisfying equations (5) and (7). The functions, g n and ψ i , defined in the first step of proof, should be four times continuously differentiable in order to be replaced by their cubic spline interpolates, g ns and ψ nis respectively, with estimated approximation error. That is why in the construction of g n and ψ i nine degree spline interpolates should be used.
3
Implementation of the KSN
In the implementation of the KSN we use the Ensemble Approach (EA) [24], [25]. EA contains two optimization procedures, the Recursive Linear Regression (RLR) [26] and the Adaptive Stochastic Optimization (ASO), and one module of the specific neural net architecture (NNA). Both RLR and ASO operate on the values of the T × N design matrices defined for N = 1,...N max
(
Γ1 x (1) , w1N ... Γ1 x (T ) , w1N
(
) )
(
)
1 ... Γ N x ( ) , wNN , ... ... T ... Γ N x ( ) , wNN
(
)
(11)
188
Boris Igelnik
and
the
matrix
of
the
target
output
values
y (1) ,... y (T ) ,
where
d t (t ) s Γ n x ( ) , wnN = g ns ∑ λψ i ni xi , γ ni , γ n , n = 1,...N , t = 1,...T , T is the number of i =1 points in the training set. Since these matrices are numerical, operation of ASO and RLR is independent of the specifics of NNA. The NNA for spline functions is defined in detail in [23]. The use of the EA can be supplemented by the use of the Ensemble Multi-Net (EMN) [27]-[29] in order to increase the generalization capability of the KSN (it can be used as well for training of other predictors with the same purpose). The main features of the EMN are as follows. First the ensemble of the nets
(
{f
)
n ,Wn , En
(
)
}
, n = 1,...N , having N 0 basis functions and trained on different subsets,
En , n = 1,...N , of the training set E, is created. Then these nets are combined in one
net, FN ,W , E , trained (if needed) on the set E. We prefer the method of linear combining FN ,W , E ( x ) = ∑ an f n,W∗ N , En ( x )
(12)
where f n,W* N , En , n = 1,...N are the trained nets. The parameters an , n = 1,...N are the only adjustable parameters on the right side of equation (12). The training sets are chosen so that to minimize the possibility of linear dependences among basis functions f n,W* N , En , n = 1,...N . One of the methods used for creation of training sets is called bagging [29]. For each ( x, y ) ∈ E the pair ( x, y + N ( 0, σ% ) ) is included in En with the probability 1/ E , where N ( 0, σ% ) is a Gaussian random variable with the zero mean and the standard deviation σ% equal to the estimate of the standard deviation of the net output noise. Inclusions in En are independent of inclusions in other sets, Em , m ≠ n . The number of basis functions, N 0 , in each net f n,W* N , En , n = 1,...N is small [27], [28] compared to N. It was proven [29], both theoretically and experimentally, that this method might lead to significant decrease of generalization error. Given relatively small size of the nets f n,W* N , En , n = 1,...N the EMN can be called the “parsimonious cross-validation”.
4
Conclusion and Future Work
In this article we have laid down the justification for a new, highly adaptive modeling architecture, the KSN, and developed the methods of its implementation and learning. Several theoretical issues remain for future work. In particular we are interested in deriving a bound on the estimation error [2] of the KSN and finding the methods for automatic choosing the intervals for the shape parameters of the KSN. Not less
Use of the Kolmogorov’s Superposition Theorem and Cubic Splines
189
important is the experimental work where not only the practical advantages and disadvantages of the KSN as a modeling tool will be checked but the KSN coupling with the preprocessing and post-processing tools.
References [1]
[2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]
[16]
Kolmogorov, A.N.: On the Representation of Continuous Functions of Many Variables by Superposition of Continuous Functions of One Variable and Addition. Dokl. Akad. Nauk SSSR 114 (1957) 953-956, Trans. Amer. Math. Soc. 2 (28) (1963) 55-59 Hecht-Nielsen, R.: Counter Propagation Networks. Proc. IEEE Int. Conf. Neural Networks. 2 (1987) 19-32 Sprecher, D.A.: On the Structure of Continuous Functions of Several Variables. Trans. Amer. Math. Soc. 115 (1965) 340-355 Girosi F., Poggio, T.: Representation Properties of Networks: Kolmogorov's Theorem Is Irrelevant. Neur.Comp. 1 (1989) 465-469 Kurkova, V.: Kolmogorov's Theorem Is Relevant. Neur.Comp. 3 (1991) 617622 Kurkova, V.: Kolmogorov's Theorem and Multilayer Neural Networks. Neural Networks. 5 (1992) 501-506 Nakamura, M., Mines, R., and Kreinovich, V.: Guaranteed Intervals for Kolmogorov's Theorem (and their Possible Relations to Neural Networks) Interval Comp. 3 (1993) 183-199 Nees, M.: Approximative Versions of Kolmogorov's Superposition Theorem, Proved Constructively. J. Comp. Appl. Math. 54 (1994) 239-250 Sprecher, D.A.: A Numerical Implementation of Kolmogorov's Superpositions. Neural Networks. 9 (1996) 765-772 Sprecher, D.A.: A Numerical Implementation of Kolmogorov's Superpositions. Neural Networks II. 10 (1997) 447-457 Guarnieri, S., Piazza, F., and Uncini, A.: Multilayer Neural Networks with Adaptive Spline-based Activation Function. Proc. Int. Neural Network Soc. Annu. Meet. (1995) 1695-1699 Vecci, L., Piazza, F., and Uncini, A.: Learning and Generalization Capabilities of Adaptive Spline Activation Function Neural Networks. Neural Networks. 11 (1998) 259-270 Uncini, A., Vecci, L., Campolucci, P., and Piazza, F.: Complex-valued Neural Networks with Adaptive Spline Activation Function for Digital Radio Links Nonlinear Equalization. IEEE Trans. Signal Proc. 47 (1999) 505-514 Guarnieri, S., Piazza, F., and Uncini, A.: Multilayer Feedforward Networks with Adaptive Spline Activation Function. IEEE Trans. Neural Networks. 10 (1999) 672-683 Igelnik, B.: Some New Adaptive Architectures for Learning, Generalization, and Visualization of Multivariate Data. In: Sincak, P., Vascak, J. (eds.): Quo Vadis Computational Intelligence? New Trends and Approaches in Computational Intelligence. Physica-Verlag, Heidelberg (2000) 63-78 Shidlovskii, A.V.: Transcendental Numbers. Walter de Gruyter, Berlin (1989)
190
Boris Igelnik
[17] Barron, A.R., Universal Approximation Bounds for Superpositions of a Sigmoidal Function. IEEE Trans. Inform. Theory. 39 (1993) 930-945 [18] Breiman, L.: Hinging Hyperplanes for Regression, Classification, and Function Approximation. IEEE Trans. Inform. Theory. 39 (1993) 999-1013 [19] Jones, L.K.: Good Weights and Hyperbolic Kernels for Neural Networks, Projection Pursuit, and Pattern Classification: Fourier Strategies for Extracting Information from High-dimensional Data. IEEE Trans. Inform. Theory. 40 (1994) 439-454 [20] Makovoz, Y.: Random Approximants and Neural Networks. Jour. Approx. Theory. 85 (1996) 98-109 [21] Scarcelli, F and Tsoi, A.C.: Universal Approximation Using Feedforward Neural Networks: a Survey of Some Existing Methods and Some New Results. Neural Networks. 11 (1998) 15-37 [22] Townsend, N. W. and Tarassenko, L.: Estimating of Error Bounds for NeuralNetwork Function Approximators. IEEE Trans. Neural Networks. 10 (1999) 217-230 [23] Igelnik, B., Parikh, N.: Kolmogorov's Spline Network. IEEE Trans. Neural Networks. (2003) Accepted for publication [24] Igelnik, B., Pao, Y.-H., LeClair, S.R., and Chen, C. Y.: The Ensemble Approach to Neural Net Training and Generalization. IEEE Trans. Neural Networks. 10 (1999) 19-30 [25] Igelnik, B., Tabib-Azar, M., and LeClair, S.R.: A Net with Complex Coefficients. IEEE Trans. Neural Networks. 12 (2001) 236-249 [26] Albert, A.: Regression and the Moore-Penrose Pseudoinverse. Academic Press, New York (1972) [27] Shapire, R. E.: The Strength of Weak Learnability. Machine Learning. 5 (1990) 197-227 [28] Ji, S and Ma, S.: Combinations of Weak Classifiers. IEEE Trans. Neural Networks. 8 (1997) 32-42 [29] Breiman, L.: Combining Predictors. In: Sharkey, A.J.C. (ed.): Combining Artificial Neural Nets. Ensemble and Modular Nets. Springer, London (1999) 31-48
The Influence of Prior Knowledge and Related Experience on Generalisation Performance in Connectionist Networks F.M. Richardson 1,2, N. Davey1, L. Peters1, D.J. Done2, and S.H. Anthony2 1
Department of Computer Science, University of Hertfordshire, College Lane, Hatfield, Hertfordshire, AL10 9AB, UK {F.1.Richardson, N.Davey, L.Peters}@herts.ac.uk 2 Department of Psychology, University of Hertfordshire, College Lane, Hatfield, Hertfordshire, AL10 9AB, UK {D.J.Done, S.H.1.Anthony}@herts.ac.uk
Abstract. The work outlined in this paper explores the influence of prior knowledge and related experience (held in the form of weights) on the generalisation performance of connectionist models. Networks were trained on simple classification and associated tasks. Results regarding the transfer of related experience between networks trained using backpropagation and recurrent networks performing sequence production, are reported. In terms of prior knowledge, results demonstrate that experienced networks produced their most pronounced generalisation performance advantage over navï e networks when a specific point of difficulty during learning was identified and an incremental training strategy applied at this point. Interestingly, the second set of results showed that knowledge learnt about in one task could be used to facilitate learning of a different but related task. However, in the third experiment, when the network architecture was changed, prior knowledge did not provide any advantage and indeed when learning was expanded, even found to deteriorated.
1
Introduction
Some complex tasks are difficult for neural networks to learn. In such circumstances an incremental learning approach, which places initial restrictions on the network in terms of memory or complexity of training data has been shown to improve learning [1], [2], [3]. However, the purpose of the majority of networks is not simply to learn the training data but to generalise to unseen data. Therefore, it can be expected, but not assumed, that using incremental learning may also improve generalisation performance. The work reported in this paper extends upon the original work of Elman [3], in which networks trained incrementally showed a dramatic improvement in learning. In this paper three different ways of breaking down the complexity of the task are V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 191-198, 2003. Springer-Verlag Berlin Heidelberg 2003
192
F.M. Richardson et al.
investigated, with specific reference to generalisation performance. In the first experiment an incremental training regime, in which the training set is gradually increased in size, is evaluated against the standard method of presenting the complete training set. In the second experiment the hypothesis that knowledge held in the weights of a network from one task might be useful to a network learning a related task, is explored. The third experiment completes the investigation by determining whether knowledge transfer between networks such as those in the second experiment, may prove beneficial to learning across different types of networks, performing related tasks. The use of well-known learning rules (back-propagation and recurrent learning) allows this work to complement previously mentioned earlier work, allowing a more complete picture of learning and generalisation performance.
2
Experiment One: Investigating Prior Knowledge
The aim of this experiment was to compare the generalisation performance of networks trained using incremental learning in which the input to the network is staged (experienced networks), with equivalent networks initialised with random weights (navï e networks). Networks were trained with the task of classifying static depictions of four simple line-types as seen in Figure 1. Classification of the line-type was to be made irrespective of its size and location on the input array. In order to accomplish this task the network, must learn the spatial relationship of units in the input and then give the appropriate output in the form of the activation of a single classification unit on the output layer.
Horizontal
Vertical
Diagonal acute
Diagonal obtuse
Fig. 1. Shows the four basic line-types which the network was required to classify
2.1
Network Architecture
A simple feed-forward network consisting of an input layer of 49 units arranged as a 7x7 grid was fully connected to a hidden layer of 8 units, which was also connected to an output layer of 4 units, each unit representing a single line type was used (see Figure 2). 2.2
Training and Testing of Naïve Networks
A full set of patterns for all simple line-types of lengths ranging from 3 to 7 units, for all locations upon the input grid, were randomly allocated to one of two equal-sized training and testing sets (160 patterns per set). Three batches of networks (each consisting of 10 runs) were initialised with random weights (navï e networks). The first batch of networks was trained to classify two different line-types (horizontal and vertical lines), the second three and the third, all four different line-types. All
The Influence of Prior Knowledge and Related Experience
193
networks were trained using back-propagation with the same learning rate (0.25) and momentum (0.9) to the point where minimal generalisation error was reached. At this stage the number of epochs that each network had taken to reach the stop-criterion of minimal generalisation error was noted. These networks formed the basis of comparison with experienced networks. 2.3
Training and Testing of Experienced Networks
These networks were trained using the same parameters as those used for navï e networks. The training and testing of the four line-types was divided into three increments, the first increment consisting of two line-types (horizontal and vertical lines) with an additional line-type being added at each subsequent increment. The network progressed from one increment to the next upon attaining minimal generalisation error for the current increment. At this point the weights of the network were saved and then used as the starting point for learning in the following increment, patterns for the additional line-type were added along with an additional output unit (the weights for the additional unit were randomly initialised). O u tp u t lay er
H id den la y er
Inp u t lay er
Fig. 2. Shows the network architecture of the 7x7-classification network. The network consisted of an input layer with units arranged as a grid, a hidden layer, and an output layer consisting of a number of classification units. Given a static visual depiction of a line of any length as input, the network was required to classify the input accordingly by activating the corresponding unit on the output layer. In the example shown the network is given the input of a vertical line with a length of 3 units, which is classified as a vertical line-type
2.4
Results
Generalisation performance was assessed in terms of the number of output errors and poor activations produced by each type of network. Outputs were considered errors if activation of the target unit was less than 0.50, or if a non-target unit had an activation of 0.50 or higher. Poor activations were marginally correct classifications (activation of between 0.50-0.60 on the target unit, and/or activation of 0.40-0.49 on non-target units) and were used to give a more detailed indicator of the level of classification success.
194
F.M. Richardson et al.
The generalisation performance of navï e networks trained on two line-types was good, with networks classifying on average, 89.84% of previously unseen patterns correctly. However, performance decreased as the number of different line-types in the training set increased, with classification performance for all four line-types dropping to an average of 79.45%. In comparison, experienced networks proved marginally better, with an average of 81.76%. Further comparisons between navï e and expe rienced networks revealed that a navï e network trained upon three line-types produced a better generalisation performance than an experienced network at the same stage. It seemed that the level of task difficulty increased between the learning of three and four line-types. So the weights from navï e networks trained with three lin e-types were used as a starting point for training further networks upon four line-types, resulting in a two-stage incremental strategy. This training regime resulted in a further improvement in generalisation performance, with an average of 84.48%. A comparison of the results for the three different training strategies implemented can be seen in Figure 3.
Fig. 3. Shows a comparison of generalisation performance between networks trained using the three different strategies. It can be seen that naïve networks produced th e lowest generalisation success. Of the experienced networks, those trained using the two-stage strategy produced the best generalisation performance
3
Experiment Two: Investigating Related Experience I
The aim of this experiment was to attempt to determine whether the knowledge acquired by networks trained to classify line-types in Experiment One would aid generalisation performance of networks trained upon different but related task. In the new task, networks were given the same type of input patterns as those used for the classification network, but were required to produce a displacement vector as output. This displacement vector contained information as to the length of the line and the direction in which the line would be drawn if produced (see Figure 4). It was hypothesised that the knowledge held in the weights from the previous network would aid learning in the related task because the task of determining how to
The Influence of Prior Knowledge and Related Experience
195
interpret the input layer had already been solved by the previous set of weights. The divisions between line-types created upon output in the classification task were also relevant to the related task, in that same line-types shared activation properties upon the output layer, for example all vertical lines are the result of activation upon the yaxis. 3.1
Network Architecture
All networks consisted of the same input layer as used in Experiment One, a hidden layer of 12 units and an output layer of 26 units. -
+ y x
Fig. 4. Shows the static input to the network of a vertical line of a length of five units and the desired output of the network. Output is encoded in terms of movement from the starting point along the x and y-axis. This form of thermometer encoding preserves issometry and is position invariant [3]
3.2
Training and Testing
All networks were trained and tested in the same manner and using the same parameters as those used in Experiment One. Navï e networks were initialised with a random set of weights. For experienced networks, weights from networks producing the best generalisation performance in Experiment One were loaded. Additional connections required for these networks (due to an increase in the number of hidden units) were initialised randomly. Generalisation performance was assessed. 3.3
Results
Navï e networks reached an average generalisation performance of 59.59%. Experienced networks were substantially better, with an average generalisation performance of 70.42%. This result as seen in Figure 5, clearly demonstrates the advantage of related knowledge about lines and the spatial relationship of units on the input grid in the production of displacement vectors.
4
Experiment Three: Investigating Related Experience II
The aim of this experiment was to examine whether weights from a classification task could be used to aid learning in a recurrent network required to carry out an extension of the static task. Initially, a standard feed-forward network (as shown in Figure 2) was trained to classify line-types of simple two-line shapes. Following this a recurrent
196
F.M. Richardson et al.
network [5] was trained to carry out this task in addition to generating the sequence in which the line-segments for each shape would be produced if drawn (as shown in Figure 6).
Fig. 5. Shows a comparison of generalisation performance between naïve and experienced networks. It is clear that the experienced networks produced the best generalisation performance
static
t1
sequential static
t2
sequential Input
Output
Fig. 6. Shows an example of the input and output for the sequential shape classification task. The input was a simple shape composed of two lines of different types. The output consisted of two components. The static, identifying the line-types, and the sequential generating the order of production. Networks used in Experiment One were trained to give the static output of the task. Weights from these networks were then used by recurrent networks to produce both the static and sequential outputs as shown above
4.1
Training and Testing
Both navï e and experienced networks were trained on an initial sequence production task, using simple shapes composed of diagonal line-types only. Following this, the initial task was extended for both networks to include shapes composed of horizontal and vertical line-types. 4.2
Results
The generalisation performance of navï e and experienced networks for the initial and extended task was assessed. Initially, comparison showed no notable difference
The Influence of Prior Knowledge and Related Experience
197
between navï e and experienced networks. However, performance deteriorated for experienced networks with the addition of the extended task. This drop in performance was attributed to a reduction in performance upon the initial task.
Fig. 7. Shows a comparison of generalisation performance between naïve and experienced networks, performing the initial sequence production task followed by the extended task. It can be seen that performance for the two types of networks for the initial task is relatively equal. However, for the extended task, performance of the experienced networks is poor in comparison to naïve networks trained with random weights
5
Discussion
The experiments conducted have provided useful insights into how prior knowledge and related experience may be used to improve the generalisation performance of connectionist networks. Firstly, it has been demonstrated that incremental learning was of a notable benefit. Secondly, by selecting the point at which learning becomes difficult as the time to increment the training set produces a further advantage. Thirdly, an interesting result is that knowledge learnt about in one task can be used to facilitate learning in a different but related task. Finally, the exploration into whether knowledge can transfer and aid learning between networks of different architectures has a less clear outcome; prior knowledge was successfully transferred, but was found to deteriorate as the network attempted to expand its learning. Further work involves exploring methods by which knowledge transfer between static and recurrent networks may prove beneficial in both learning and generalisation performance.
References [1] [2]
Altmann, G.T.M.: Learning and Development in Connectionist Learning. Cognition (2002) 85, B43-B50 Clarke, A.: Representational Trajectories in Connectionist Learning. Minds and Machines (1994) 4, 317-322
198
F.M. Richardson et al.
[3]
Elman, J.L.: Learning and Development in Neural Networks: the Importance of Starting Small. Cognition. (1993) 48, 71-99 Richardson, F.M., Davey, N., Peters, L., Done, D.J., Anthony, S.H.: Connectionist Models Investigating Representations Formed in the Sequential Generation of Characters. Proceedings of the 10th European Symposium on Artificial Neural Networks. D-side publications, Belgium (2002) 83-88 Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning Internal Representations by Error Propagation. In: Parallel Distributed Processing. Vol. 1. Chapter 8. MIT Press, Cambridge (1986)
[4]
[5]
Urinary Bladder Tumor Grade Diagnosis Using On-line Trained Neural Networks D.K. Tasoulis1,2 , P. Spyridonos3 , N.G. Pavlidis1,2 , D. Cavouras4, P. Ravazoula5 , G. Nikiforidis3 , and M.N. Vrahatis1,2 1
Department of Mathematics, University of Patras GR–26110 Patras, Greece
[email protected] 2 University of Patras Artificial Intelligence Research Center (UPAIRC) 3 Computer Laboratory, School of Medicine University of Patras GR–26110 Patras, Greece 4 Department of Medical Instrumentation Technology, TEI of Athens Ag. Spyridonos Street Aigaleo, GR–12210 Athens, Greece 5 Department of Pathology, University Hospital GR–26110 Patras, Greece
Abstract. This paper extends the line of research that considers the application of Artificial Neural Networks (ANNs) as an automated system, for the assignment of tumors grade. One hundred twenty nine cases were classified according to the WHO grading system by experienced pathologists in three classes: Grade I, Grade II and Grade III. 36 morphological and textural, cell nuclei features represented each case. These features were used as an input to the ANN classifier, which was trained using a novel stochastic training algorithm, namely, the Adaptive Stochastic On-Line method. The resulting automated classification system achieved classification accuracy of 90%, 94.9% and 97.3% for tumors of Grade I, II and III respectively.
1
Introduction
Bladder cancer is the fifth most common type of cancer. Superficial Transitional cell carcinoma (TCC) is the most frequent histological type of bladder cancer [13]. Currently, these tumors are assessed using a grading system based on a variety of histopathological characteristics. Tumor grade, which is determined by the pathologist from tissue biopsy, is associated with tumor aggressiveness. The most widely accepted grading system is the WHO (World Health Organization) system, which stratifies TCCs into three categories: tumors of Grade I, Grade II and Grade III. Grade I tumors are not associated with invasion or metastasis but present a risk for the development of recurrent lesions. Grade II carcinomas are associated with low risk of further progression, yet they frequent recur. Grade III tumors are characterized by a much higher risk of progression
Corresponding author.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 199–206, 2003. c Springer-Verlag Berlin Heidelberg 2003
200
D.K. Tasoulis et al.
and also high risk of association with disease invasion [7]. Although histological grade has been recognized as one of the most powerful predictors of the biological behavior of tumors and significantly affects patients’ management, it suffers from low inter and intra observer reproducibility due to the subjectivity inherent to visual observation [12]. Digital image analysis techniques and classification systems constitute alternative means to perform tumor grading in a less subjective manner. Numerous research groups have proposed quantitative assessments to address this problem. H-K Choi et al [3] have developed an automatic grading system using texture features on a large region of interest, covering a typical area in the histological section. The textural based system produced an overall accuracy of 84.3% in assessing tumors grade. In a different study [6], researchers have employed tissue architectural features and classified tumors with an accuracy of 73%. More recent studies have focused on the analysis of cell nuclei characteristics to perform tumor grade classification with success rates that do not significantly exceed 80% [2, 17]. In this study, we present a methodology which improves considerably the level of diagnostic accuracy in assigning tumor grade. The method is based on the application of an ANN as a classifier system. The input data for the ANN, describe a number of nuclear morphological and textural features, that were obtained through an automatic image processing analysis technique. It is worth noting that the prognostic and diagnostic value of these features, has been confirmed [4].
2
Materials and Methods
129 tissue sections (slides) from 129 patients (cases) with superficial TCC were retrieved from the archives of the Department of Pathology of Patras University Hospital in Greece. Tissue sections were routinely stained with HaematoxylinEosin. All cases were reviewed independently by the experts to safeguard reproducibility. Cases were classified following the WHO grading system as follows: thirty-tree cases as Grade I, fifty-nine as Grade II and thirty-seven as Grade III. Images from tissue specimens were captured using a light microscopy imaging system. The method of digitalization and cell nuclei segmentation for analysis has been described in previous work [17]. Finally, from each case 36 features were estimated: 18 features were used to describe information concerning nuclear size and shape distribution. The rest were textural features that encoded chromatin distribution of the cell nucleus [17]. These features were used as an input to ANN classifier.
3
Artificial Neural Networks
Back Propagation Neural Networks (BPNNs) are the most popular artificial neural network models. The efficient supervised training of BPNNs is a subject of considerable ongoing research and numerous algorithms have been proposed to this end. Supervised training amounts to the global minimization of the network learning error.
Urinary Bladder Tumor Grade Diagnosis
201
Applications of supervised learning can be divided into two categories: stochastic (also called on line) and batch (also called off line) learning. Batch training can be viewed as the minimization of the error function E. This minimization corresponds to updating the weights by epoch and to be successful it requires a sequence of weight iterates {wk }∞ k=0 where k indicates epochs, which converges to a minimizer w∗ . In on line training, network weights are updated after the presentation of each training pattern. This corresponds to the minimization of the instantaneous error of the network E(p) for each pattern p individually. On line training may be chosen for a learning task either because of the very large (or even redundant) training set or because we want to model a gradually time varying system. Moreover, it helps escaping local minima. Given the inherent efficiency of stochastic gradient descent, various schemes have been proposed recently [1, 18, 19]. Unfortunately, on line training suffers from several drawbacks such as sensitivity to learning parameters [16]. Another disadvantage is that most advanced optimization methods, such as conjugate gradient, variable metric, simulated annealing etc., rely on a fixed error surface, and thus it is difficult to apply them for on line training [16]. Regarding the topology of the network it has been proven [5, 20] that standard feedforward networks with a single hidden layer can approximate any continuous function uniformly on any compact set and any measurable function to any desired degree of accuracy. This implies that any lack of success in applications must arise from inadequate learning, insufficient number of hidden units or the lack of a deterministic relationship between inputs and targets. Keeping these theoretical results we restrict the network topology of ANNs used in this study to one hidden layer.
4
Training Method
For the purpose of training neural networks an on-line stochastic method was employed. For recent proposed on–line training methods as well as application in medical applications see [8, 9, 10, 11, 14] This method uses a learning rate adaptation scheme that exploits gradient-related information from the previous patterns. This algorithm is described in [9], and is based on the stochastic gradient descent proposed in [1]. The basic algorithmic scheme is exhibited in Table 1. As pointed out in Step 4, the algorithm adapts the learning rate using the dot product of the gradient from the previous two patterns. This algorithm produced both fast and stable learning in all the experiments we performed, and very good generalization results.
5
Results and Discussion
To measure the ANN efficiency the dataset was randomly permutated five times. Each time it was split into a train set and a test set. The training set contained about 2/3 of the original dataset from each class. For each permutation the
202
D.K. Tasoulis et al.
Table 1. Stochastic On-Line Training with adaptive stepsize 0: 1: 2: 3: 4: 5: 6:
Initialize weights w0 ,η 0 , and meta-stepsize K. Repeat for each input pattern p Calculate E(wp ) and then ∂E(wp ) Update the weights: wp+1 = wp − η p ∂E(wp ) Calculate the stepsize to be used with the next pattern p + 1: η p+1 = η p + K∂E(wp−1 ), ∂E(wp ) Until the termination condition is met. Return the final weights w∗ .
Table 2. Accuracy of Grade I,II and III for various topologies Topology 36-1-3 36-2-3 36-5-3 36-16-3
Grade I 18.18% 81.81% 90.90% 81.81%
Grade II 100% 100% 100% 100%
Grade III 100% 100% 100% 100%
network was trained with the Stochastic On Line method with adaptive step size discussed previously. Two terminating conditions were used: the maximum number of cycles over the entire training set was set to 100, and the correct classification of all the training patterns. Alternatively, the Leave-One-Out (LOO) method [15] was employed to validate ANN classification accuracy. According to this method, the ANN is initialized with the training set including all patterns except one. The excluded pattern is used to assess the classification ability of the network. This process is repeated for all the patterns available and results are recorded in the form of a truth table. The software used for this task was developed under the Linux Operating System using the C++ programming language, and the gcc ver2.96 compiler. A great number of different ANN topologies (number of nodes in the hidden layer) were tested for the grade classification task. Some of these tests are exhibited in Table 2. Best results were obtained using the topology: 36-5-3. Table 3 illustrates analytically the ANN performance for each Crossover permutation. The ANN exhibited high classification accuracy for each grade category. It is worth noting that Grade I tumors were differentiated successfully from Grade III tumors. In four out of five Crossovers neither Grade I to III nor Grade III to I errors occurred. As can be seen from Table 3 in one permutation only 1 case of Grade I was misclassified as Grade III.rom a clinical point of view, it is important to distinguish low grade tumors, which can generally be treated
Urinary Bladder Tumor Grade Diagnosis
203
Table 3. Crossover Results For the ANNs
Histological finding GRADE I GRADE II GRADE III Overall Accuracy Histological finding GRADE I GRADE II GRADE III Overall Accuracy Histological finding GRADE I GRADE II GRADE III Overall Accuracy Histological finding GRADE I GRADE II GRADE III Overall Accuracy Histological finding GRADE I GRADE II GRADE III Overall Accuracy
Crossover I ANN classification Grade I Grade II Grade III 10 1 0 0 19 1 0 1 12 Crossover II ANN classification Grade I Grade II Grade III 10 1 0 0 20 0 0 2 11 Crossover III ANN classification Grade I Grade II Grade III 10 1 0 0 19 1 0 1 12 Crossover IV ANN classification Grade I Grade II Grade III 10 0 1 0 19 1 0 0 13 Crossover V ANN classification Grade I Grade II Grade III 10 1 0 0 18 2 0 0 13
Accuracy(%) 90.9 95 92.3 93.2
Accuracy(%) 90.9 100 84.62 93.2
Accuracy(%) 90.9 95 92.3 93.2
Accuracy(%) 90.9 95 100 95.4
Accuracy(%) 90.9 90 100 93.2
conservatively in contrast to high-grade tumors. The latter often require a more aggressive therapy because of a high-risk cancer progression. Results could also be interpreted in terms of specificity and sensitivity. That is specificity is the percentage of Grade I tumors correctly classified and sensitivity is the percentage of Grade III tumors correctly classified. ANN grade classification safeguarded high sensitivity which is of vital importance for patients treatment course, retaining
204
D.K. Tasoulis et al.
Table 4. Leave One out Results For the ANNs Histological finding GRADE I GRADE II GRADE III Overall Accuracy
ANN classification Grade I Grade II Grade III 30 1 2 0 56 3 0 1 36
Accuracy(%) 90 94.9 97.3 94.06
at the same time high specificity. Another important outcome is that the intermediate Grade II tumors were recognized with high confidence from Grade I and III. This would be particular helpful for pathologist who encounter difficulties in assessing Grade II tumors since some of them fall into the gray zone bordering on either Grade I or Grade III, and the decision is subject to the judgment of the pathologist. The simplicity and efficiency of the training method enabled us to verify the ANN classification accuracy by employing the LOO method (the whole procedure required 46 seconds to complete in Athlon CPU running at 1500 MHz). It is well known that this method is optimal to test the performance of a classifier when small data sets are available, but this testing procedure is computationally expensive when used with conventional training methods. Classification results employing the LOO method are shown in Table 4. The consistency of the system in terms of high sensitivity (no Grade III to Grade I error occurred) was verified. In [3], a textural based system produced an overall accuracy of 84.3% in assessing tumors grade. In a different study [6], researchers have employed tissue architectural features and classified tumors with an accuracy of 73%. More recent studies have focused on the analysis of cell nuclei characteristics to perform tumor grade classification with success rates that do not significantly exceed 80% [2, 17]. The ANN methodology proposed in this paper, improved significantly the tumor grade assessment with success rates of 90%, 94.9%, and 97.3%, for Grade I, II and III respectively.
6
Conclusions
In this study an ANN was designed to improve the automatic characterization of TCCs employing nuclear features. The ANN exhibited high performance in correctly classifying tumors into three categories utilizing all the available diagnostic information carried by nuclear size, shape and texture descriptors. The proposed ANN could be considered as an efficient and robust classification engine able to generalize in making decisions about complex input data improving significantly the diagnostic accuracy. The present study extends previous work in terms of the features used and enforces the belief that objective measurements on nuclear morphometry and texture offer a potential solution for the accurate
Urinary Bladder Tumor Grade Diagnosis
205
characterization of tumor aggressiveness. The novelty of this paper resides in the results obtained since they are the highest reported in the literature. Since most Grade I tumors are considered to be good prognosis, while Grade III is associated with bad prognosis, the 0% misclassification of Grade III tumors as Grade I, gives an advantage to the proposed methodology to be part of a fully automated computer aided diagnosis system.
References [1] L. B. Almeida, T. Langlois, L. D. Amaral, and A. Plankhov. Parameter adaption in stohastic optimization. On-Line Learning in Neural Networks, pages 111–134, 1998. 201 [2] N. Belacel and M. R. Boulassel. Multicriteria fuzzy assignment method: a useful tool to assist medical diagnosis. Artificial Intelligence in Medicine, 21:201–207, 2001. 200, 204 [3] H.-K. Choi, J. Vasko, E. Bengtsson, T. Jarkrans, U. Malmstrom, K. Wester, and C. Busch. Grading of transitional cell bladder carcinoma by texture analysis of histological sections. Analytical Cellular Pathology, 6:327–343, 1994. 200, 204 [4] C. De Prez, Y. De Launoit, R. Kiss, M. Petein, J.-L. Pasteels, and A. Verhest. Computerized morphonuclear cell image analysis of malignant disease in bladder tissues. Journal of Urology, 143:694–699, 1990. 200 [5] K. Hornik. Multilayer feedforward networks are universal approximators. Neural Networks, 2:359–366, 1989. 201 [6] T. Jarkrans, J. Vasko, E. Bengtsson, H.-K. Choi, U. Malmstrom, K. Wester, and C. Busch. Grading of transitional cell bladder carcinoma by image analysis of histological sections. Analytical Cellular Pathology, 18:135–158, 1995. 200, 204 [7] I. E. Jonathan, B. A. Mahul, R.R Victor, and F.K Mostofi. (transitional cell) neoplasms of the urinary bladder. The American Journal of Surgical Pathology, 22(12):1435–1448, 1998. 200 [8] G. D. Magoulas, V. P. Plagianakos, and M. N. Vrahatis. Global learning rate adaptation in on–line neural network training. In Proceedings of the Second International Symposium in Neural Computation May 23–26, Berlin, Germany, 2000. 201 [9] G. D. Magoulas, V. P. Plagianakos, and M. N. Vrahatis. Adaptive stepsize algorithms for on-line training of neural networks. Nonlinear Analysis, T. M. A., 47(5):3425–3430, 2001. 201 [10] G. D. Magoulas, V. P. Plagianakos, and M. N. Vrahatis. Hybrid methods using evolutionary algorithms for on–line training. In INNS–IEEE International Joint Conference on Neural Networks (IJCNN), July 14–19, Washington, D. C., U. S. A., volume 3, pages 2218–2223, 2001. 201 [11] G. D. Magoulas, V. P. Plagianakos, and M. N. Vrahatis. Improved neural networkbased interpretation of colonoscopy images through on-line learning and evolution. In D. Dounias and D. A. Linkens, editors, European Network of Excellence on Intelligent Technologies for Smart Adaptive Systems, pages 38–43. 2001. 201 [12] E. Ooms, W. Anderson, C. Alons, M. Boon, and R. Veldhuizen. Analysis of the performance of pathologists in grading of bladder tumors. Human Pathology, 14:140–143, 1983. 200 [13] S. L. Parker, T. Tony, S. Bolden, and P. A. Wingo. Cancer statistics. Cancer Statistics 1997. CA Cancer J Clin, 47(5):5–27, 1997. 199
206
D.K. Tasoulis et al.
[14] V. P. Plagianakos, G. D. Magoulas, and M. N. Vrahatis. Tumor detection in colonoscopic images using hybrid methods for on–line neural network training. In G. M. Papadourakis, editor, Neural Networks and Expert Systems in Medicine and Healthcare (NNESMED), pages 59–64. Technological Educational Institute of Crete, Heraklion, 2001. 201 [15] S. J. Raudys and A. K. Jain. Small sample size effects in statistical pattern recognition: Recommendations for practitioners. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 252–264, 1991. 202 [16] N. N. Schraudolf. Gain adaptation in stochastic gradient descend. 1999. 201 [17] P. Spyridonos, D. Cavouras, P. Ravazoula, and G. Nikiforidis. Neural network based segmentation and classification system for the automatic grading of histological sections of urinary bladder carcinoma. Analytical and Quantitative Cytology and Histology, 24:317–324, 2002. 200, 204 [18] R. S. Suton. Adapting bias by gradient descent: an incremental version of deltabar-delta. In Proc. 10th National Conference on Artificial Intelligence, pages 171–176. MIT Press, 1992. 201 [19] R. S. Suton. Online learning with random representations. In Proc. 10th International Conference on Machine Learning, pages 314–321. Morgan Kaufmann, 1993. 201 [20] H. White. Connectionist nonparametric regression: Multilayer feedforward networks can learn arbitrary mappings. Neural Networks, 3:535–549, 1990. 201
Invariants and Fuzzy Logic Germano Resconi and Chiara Ratti Dipartimento di Matematica, Universit` a cattolica via Trieste 17, 25128 Brescia, Italy
[email protected]
Abstract. In this paper with the meta-theory based on modal logic, we study the possibility to build invariants in fuzzy logic. Every fuzzy expression will be compensated by other expressions so that to get a global expression which value is always true. The global expression takes the name of invariant because it always assumes true value also if the single components of the expression are fuzzy and they assume value between false and true. Key words: Fuzzy logic, meta-theory, modal logic, invariant, compensation, tautology.
1
Introduction
In the classical modal logic, given a proposition p and a world W, p can be true or false. In the meta-theory, we consider more worlds. P is valued in a set of worlds understood as only indivisible entity. If in the same set of worlds we have different evaluations, the uncertainty is born. In classical logic, if we consider a tautology, its evaluation is always true, but if the evaluation is done on a set of worlds according to the meta-theory, the same expression couldn’t be true. This fuzzy expression can be compensate by other expressions. We get a global expression which value is always true. It is an invariant.
2
Meta-theory Based on Modal Logic
In a series of papers Klir, et al.,(1994, 1995; Resconi, et al., 1992, 1993, 1994, 1996) a meta-theory was developed. The new approach is based on Kripke model of modal logic. A Kripke model is given by the structure: < W, R, V >. Resconi, et al. (1992-1996) suggested to adjoin a function Ψ : W → R where R is the set of real numbers assigned to worlds W in order to obtain the new model S1 =< W, R, V, Y > . (1) That is for every world, there is an associated real number that is assigned to it. With the model S1, we can build the hierarchical meta-theory where we can calculate the expression for the membership function of truthood in the fuzzy set theory to verify a given sentence via a computational method based V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 207–212, 2003. c Springer-Verlag Berlin Heidelberg 2003
208
Germano Resconi and Chiara Ratti
on {1, 0} values corresponding to the truth values {T, F } assigned to a given sentence as the response of a world, a person, or a sensor, etc. At this point, we should ask: “what are the linkages between the concepts of a population of observers, a population of possible worlds and the algebra of fuzzy subsets of well defined universe of discourse?” To restate this question, we need to point out that fuzzy set was introduced for the representation of imprecision in natural languages. However, imprecision means that a word representing an “entity” (temperature, velocity) cannot have a crisp logic evaluation. The meaning of a word in a proposition may usually be evaluated in different ways for different assessments of an entity by different agents, i.e. worlds. An important principle is: “we cannot separate the assessments of an entity without some loss of property in the representation of that entity itself ”. Different and in some cases conflicting evaluations for the same proposition may come up for the same entity. The presence of conflicting properties within the entity itself is the principal source of the imprecision in meaning representation of the entity. For example, suppose the entity is a particular temperature of a room, and we ask for the property cold, when we have no instrument to measure that property. The meaning of the entity, “temperature”, is composed of assessments that are the opinions of a population of observers that evaluate the predicate “cold”. Without the population of observers, and their assessments, we cannot have the entity “temperature” and the predicate “cold”. When we move from crisp set to the fuzzy set, we move from atomic elements, i.e., individual assessments to non-atomic entities, i.e., aggregate assessments. In an abstract way, the population of the assessments of the entity becomes the population of the worlds that we associate the set of propositions with crisp logic evaluation. Perception based data embedded in a context generate imprecise properties. Imprecise properties are generating by the individual generalization to the other worlds of the evaluation of the properties itself. For example when four persons give four different evaluations and any person think that its evaluation is the right evaluation and the others are wrong, we have in the four worlds a superposition of the individual evaluation and this generate uncertainty or fuzziness. With the perception-based data, as assessments of an entity, in context, we evaluate the properties of an entity. The evaluation can be conflicting. In such cases, the world population assessment composed of individual assessments give the context structure of the perception-based data. If we know only the name of a person (entity), we cannot know if s/he is “old”. Additional observationbased information that s/he is married, is a manager, and that s/he plays with toys are conflicting information of the proposition that associates her/him with the predicate “old”, etc. The aim of this paper is to show that with a model of a perception based imprecision generated by worlds, i.e., agents, we can on the one hand simplify the definitions of the operations in fuzzy logic and on the other hand expose and explain deeper issues embedded within the fuzzy theory. Consider a sentence such as: “John is tall” where x ∈ X is a person, “John”, in a population of people, X, and “tall” is a linguistic term of a linguistic
Invariants and Fuzzy Logic
209
variable, the height of people. Let the meta-linguistic expression that represents the proposition “John is tall” written in fuzzy set canonical form as follows: pA (x) ::= “x ∈ XisrA ,
(2)
where “isr” means “x ∈ X is in relation to a fuzzy information granule A”, and pA (x) is the proposition that relates x to A. Next consider that the imprecise sentence “John is tall” which can be modelled by different entities, i.e., different measures of the physical height h. At each height h, let us associate the opinions, based on their perceptions, of a set of observers or sensors that give an evaluation for the sentence “John is tall”. Any observer or sensor can be modelled in an abstract way by a world wk ∈ W , where W is the set of all possible worlds that compose the indivisible entity associated with a particular h and k is the index, 1, 2, 3 · · · n, that associates with any assessment of the entity given by the population of the observers. It should be noted that the world, i.e., the person or the sensor, does not say anything about the qualification, i.e., descriptive gradation of “John’s being tall”, but just verifies on the basis of a valuation schema. With these preliminaries, we can then write for short: pA (x) such that V (pA (x), wk ) evaluates to T, true. Next let us assign h(wk , x, A) = 1 if or h(wk , x, A) = 0 if With this background, we next define the membership expression of truthood of a given atomic sentence in a finite set of worlds wk ∈ W as follows: η(wk , x, A) set of worlds wherepA (x)is true = k . (3) µPA (x) = |W (X)| |W (x)| In the equation 3 for any value of the variable x we associate the set of the worlds W (x) that represent the entity with the opinions, based on their perceptions, of the population of the observers, where pA represents the proposition that the atomic sentence “John is tall”, for x = “John”, and A = “tall” such that ; and |W | is the cardinality of the set of worlds in our domain of concern. Recall once again that these worlds, wk ∈ W , may be agents, sensors, persons, etc. Let us define the subset of worlds WA = wk ∈ W |V (pA(x), wk) = T . We can write, expression 3 as follows: |WA | . (4) µPA (x) = |W | With the understanding that WA represent the subset of the worlds W where the valuation of pA (x) is “true” in the Kripke sense. For the special case, where the relation R in the Kripke model is wk Rwk at any world wk only itself is accessible (any world wk is isolated from the others) the membership expression is computed as the value of Y in S1 stated in 1 above. It is computed by the 1 expression Ψ = |W | for any (single) world w in W. Thus we can write the expression 3 as follows: µPA (x) = η(wk )Ψk . (5) k
210
3
Germano Resconi and Chiara Ratti
Transformation by External Agent
Given a set of worlds with a parallel evaluation of true value for every world, we can use an operator Γ to change the logic value of the propositions in the worlds. In the modal logic we have the possible ♦ and necessary , unary operators that change the logic value of the proposition p. One proposition p in one world w is necessary true or p is true, when p is true in all accessible worlds. One proposition p in one world w is possible true or ♦p is true, when is true in almost one world accessible from w. Near to these operators, we adjoin a new operator Γ . One proposition p in one world is true under the agent’s action, or Γ p is true, when there is an agent under whose action p is true. The action of the agent can be neutral so in the Γ p expression p does not change its logic value. The action of the agent can also change the logic value of p in false. With the agent operator G we can generate any type of t-norm or t-conorm and also non classical complement operator. (6) Exemple. For example given five worlds (w1 , w2 , w3 , w4 , w5 ) and two propositions p and q. When the logic evaluation of p in the four worlds is V (p) = (T rue, T rue, F alse, F alse, F alse)
(7)
V (q) = (F alse, T rue, T rue, T rue, F alse) .
(8)
and
For 7 and 8 we have µp = 25 and µq = 35 . The evaluation of V (pORq) = (T rue, T rue, T rue, T rue, F alse) where we use the classical logic operation OR for every world. So we have that µporq =
4 2 3 3 > max( , ) = 5 5 5 5
(9)
but when we introduce an external agent whose action changes the true value of p in this way V (Γ p) = (F alse, T rue, T rue, F alse, F alse) we have that µΓ porq =
2 3 3 3 = max( , ) = . 5 5 5 5
(10)
The agent has transformed the general fuzzy logic operation OR into the Zadeh max rule for the same operation OR. In conclusion the agent G can move from one fuzzy logic operation to another with the same membership function value for p and q. We know that, given the proposition p and the membership function µp , the Zadeh complement is µcp = 1 − µp . In this case we have that µp and cp = min(µp , 1 − µp ) that in general is different from zero or false value. We know that for Zadeh fuzzy logic the absurd condition is not always false. (11) Exemple. When we use the world image for the uncertainty we have evaluation 7 and the evaluation of its complement:
Invariants and Fuzzy Logic
211
V (cp) = (F alse, F alse, T rue, T rue, T rue); we have that V (p and cp) = (F alse, F alse, F alse, F alse, F alse); then V (p and cp) is always false. In this case the absurd condition is always false. But when an agent changes the true value position of the cp expression, in this way, if V (Γ cp) = (T rue, T rue, T rue, F alse, F alse), we have that V (p and Γ cp) = (T rue, T rue, F alse, F alse, F alse) and 2 2 3 µpandΓ cp = min( , ) = 5 5 5
(12)
that is different from false. We remark that V (Γ cp) = V (cΓ p). To have uniform language with classical logic we write cp = ¬p
4
Invariants in Fuzzy Logic
We know that in classical logic we have the simple tautology T = p ∨ ¬p
(13)
that is always true for every logic value of p When we evaluate the previous expression by the worlds and we introduce an external agent that use the operator G to change the logic value of p, we have the fuzzy expression of T or TF T F = p ∨ Γ ¬p = p ∨ ¬Γ p
(14)
Because in 14 we introduce a new variable or Γ p, the 14 is not formally equal to 13 so can assume also false value when the variable Γ p is different from the variable p. In fact the 14 can be written, for every world , in this way TF = Γp → p
(15)
that is false when Γ p is true and p is false. So a tautology in the classical logic is not a tautology in the fuzzy logic. In fact with the Zadeh operation we have µporcp = max(µp , 1 − µp ) ≤ 1
(16)
To find an invariant inside the fuzzy logic, we change the variables in the expression 13 in this way p → p ∨ ¬Γ p (17) because the expression 13 is a tautology , it is always true for every substitution of the variable. So we have that T = (p ∨ ¬Γ p) ∨ ¬(p ∨ ¬Γ p) is always true.
(18)
212
Germano Resconi and Chiara Ratti
So we have that T can separate in two parts one is the expression (4) and the other is the compensation part. In conclusion for the DeMorgan rule we have (p ∨ ¬Γ p) ∨ (¬p ∧ Γ p)
(19)
(20) Exemple. When we use the evaluation in the example 11, we have V (p) = (T rue, T rue, F alse, F alse, F alse), V (¬Γ p) = (T rue, T rue, T rue, F alse, F alse); V (Γ p) = (F alse, F alse, F alse, T rue, T rue), V (¬p) = (F alse, F alse, T rue, T rue, T rue). So we have V (p ∨ ¬Γ p) = (T rue, T rue, T rue, F alse, F alse) and V (¬p ∨ Γ p) = (F alse, F alse, F alse, T rue, T rue). When we come back to the original fuzzy logic with Zadeh operation we obtain max[µp , (1 − µp )] + min[(1 − µp ), µp ] = 1
(21)
where max[µp , (1 − µp )] is associate to the expression V (p ∨ ¬Γ p) and min[(1 − µp ), µp ] to V (¬p ∧ Γ p). The term min[(1 − µp ), µp ] is the compensation term. We can extend to other tautology the same process and in this way we can find new invariants in fuzzy logic.
5
Conclusion
In fuzzy logic, given a proposition p, we have fractionary evaluation of p. Then we can study t-norms, t-conorms and complement as general operations. In classical logic we have general properties for operations. They are always true for every evaluation(tautology). In fuzzy logic with invariants, we want to introduce greatness that don’t change if the single evaluation of p changes. In this way, invariants give a general formalization to fuzzy operations. The idea is to build a formal structure for fuzzy operations and to get a global expression always true even if every component is fuzzy.
References [1] G.Resconi,G.J Klir e U.St.Clair Hierarchical Uncertainity Metatheory Based Upon Modal Logic Int J. of General System, Vol. 21, pp. 23–50, 1992. [2] G.Resconi e I. B. Turksen. Canonical forms of fuzzy truthoods by meta theory based upon modal logic Information Sciences 131/2001 pp. 157–194. [3] G.Resconi, T.Murai. Field Theory and Modal Logic by Semantic Field to Make Uncertainity Emerge from Information IntJ. General System 2000. [4] G.Klir e Bo Yuan. Fuzzy sets and Fuzzy Logic Prentice Hall, 1995. [5] T.Muray, M.Nakata e M.Shimbo. Ambiguity, Inconsistency, and Possible-Worlds: A New Logical Approach Ninth Conference on Intelligence Technologies in HumanRelated Sciences, Leon, Spain, pp. 177, 184, 1996. [6] L. A.Zadeh. Fuzzy Sets Information and Control, Vol. 8, pp. 87–100, 1995.
Newsvendor Problems Based on Possibility Theory Peijun Guo Faculty of Economics, Kagawa University Takamatsu, Kagawa 760-8523, Japan
[email protected]
Abstract. In this paper, the uncertainty of market of new products with short life cycle is characterized by possibility distribution. Two possibilistic models for newsvendor problem are proposed, one is based on optimistic criterion and the other is based on pessimistic criterion to reflect the different preference for risk in such one-shot decision problem, which are very different from the conventional newsvendor problem based on probability distribution in which maximizing expected utility is goal.
1
Introduction
The newsvendor problem, also known as newsboy or single-period problem is a wellknown inventory management problem. In general, the newsvendor problem has the following characteristics. Prior to the season, the buyer must decide how many units of goods to purchase. The procurement lead-time tends to be quite long relative to the selling season, so the buyer can not observe demand prior to placing the order. Due to the long lead-time, there is no opportunity to replenish inventory once the season has begun. Excess products can not be sold (or the price is trivial) over season. As well known that Newsvendor problem derives its name from a common problem faced by a person selling newspapers on the street, interest in such a problem has increased over the past 40 years partially because the increased dominance of service industrial for which newsboy problem is very applicable in both retailing and service organization. Also, the reduction in product life cycles makes newsboy problem more relevant. Many extensions have been made in last decade, such as different objects and utility function, different supplier pricing policies, different new-vendor pricing policies [11,17]. Almost all of extensions have been made in the probabilistic framework, that is, the uncertainty of demand and supply is characterized by the probability distribution, and the objective function is used to maximizing the expected profit or probability measure of achieving a target profit. However, some papers [3,8,9,10,13,14,15,18] have dealt inventory problems using fuzzy sets theory. There are few papers dealing with the uncertainty in newsvendor problems by fuzzy methods. Buckley [1] used possibility distribution to represents a decision-maker's linguistic express on demand, such as “good” and “not good” etc., and introduced some fuzzy goal to express decision-maker's satisfaction. The order quantity was obtained V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 213-219, 2003. Springer-Verlag Berlin Heidelberg 2003
214
Peijun Guo
based on possibility and necessity measures, which made the possibility of not achieving fuzzy goal was sufficiently low and achieving fuzzy goal was sufficiently high. Ishi etc. al [6] investigated the fuzzy newsboy problem in which the shortage cost was vague and expressed as a fuzzy number. An optimal order quantity was obtained by fuzzy maximal order. Petrovic et al. [16] gave a fuzzy newsboy model where uncertain demand was represented by a group of fuzzy sets and inventory cost was given by a fuzzy number. The defuzzification method was used to obtain an optimal order quantity. Kao et al. [7] obtained the optimal order quantity to minimize the fuzzy cost by comparing the area of fuzzy numbers. Li et al. [12] proposed two models, in one the demand was probabilistic while the cost components were fuzzy and in the other the costs were deterministic but the demand was fuzzy. The profit was maximized through ordering fuzzy numbers with respect to their total integral values. Guo et al [4,5] proposed some new newsboy models with possibilistic information. In this paper, new newsboy models are proposed with emphasis that the proposed models are for new life-cycle short products, such as fashion goods, season presents where no data can be used for statistical analysis to predict the coming demand. The uncertainty of demand is characterized by the possibility distribution where possibility degrees are used to catch the potential of market determined by the matching degree between customer needs and goods features, which can obtained by detailed market investigation before selling season. In other words, the uncertain demand is not described by the linguistic language, such as “good”, “better” etc., to only reflect the subjective judgment of decision-maker. The other should be noted is that newsboy problem is a typical one-shot decision problem so that maximizing expected profit or probability measure seems less meaningful. In this paper, the optimal order is determined by maximizing optimistic and pessimistic values of order quantity instead of maximizing mean value in probabilistic models and ranking fuzzy numbers in fuzzy models. The third should be noted is that a general model is given where the possibility distributions and utility functions are not given as specified function so that the general analysis and conclusion are made.
2
Newsvendor Model Based on Possibility Theory
Consider a retailer who sells a short life cycle, or single-period new product. The retailer orders q units before the season at the unit wholesale price W. When demand x is observed, the retailer sell units (limited by the supply q and the demand x ) at unit revenue R with R > W . Any excess units can be salvaged at unit salvage price S o with W > S o . If shortage, the lost chance price is S u . The profit function of the retailer is as follows: Rx + S o (q − x) − Wq ; if x < q r ( x, q ) = ( R − W )q − S u ( x − q) ; if x ≥ q
(1)
Newsvendor Problems Based on Possibility Theory
215
Fig. 1. The possibility distribution of demand
The plausible information of demand x is represented by a possibility distribution. The possibility distribution of x , denoted as π D (x) is defined by the following continuous function
π D : [d l , d u ] → [0,1]
(2)
∃d c ∈ [d l , d u ] , π D (d c ) = 1 , π D (d l ) = 0 and π D (d u ) = 0 . π D increases within x ∈ [d l , d c ] and decreases within x ∈ [d c , d u ] . d l and d u are the lower and upper
bounds of demand, respectively, d c is the most possible amount of demand, which is shown in Fig. 1. Because demand is inside the interval [d l , d u ] , a reasonable supply also should lie inside this region. The highest profit of retailer is ru = ( R − W )d u . That is, the retailer orders the most d u and the demand is the largest d u . The lowest profit is rl = (d l R + (d u − d l ) S o − d u W ) ∧ (d l R − (d u − d l ) S u − d l W ) , which is determined by the minimum of two cases, one is that the retailer order the most but the demand is the lowest, the other is that the retailer order the lowest but the demand is the highest. Without loss of generality, the assumption W ≥ S o + S u is made, which leads to rl = d l R + (d u − d l ) S o − d u W . Definition 1. The utility function of retailer is defined by the following continuous strictly increasing function of profit r , U : [rl , ru ] → [0,1]
(3)
where U (rl ) = 0 , U (ru ) = 1 . (3) gives a general form of utility function of decisionmaker where the utility of the lowest profit is 0 and the utility of the highest profit is 1. Definition 2. The optimistic value of supply q, denoted as Vo (q) is defined as follows,
216
Peijun Guo
Fig. 2. The optimistic value of supply q
Vo (q ) = max min(π D ( x), U (r ( x, q ))) x
(4)
It can be seen that Vo (q) is similar to the concept of possibility measure of fuzzy event, illustrated by Fig. 2, where U (r ( x, q)) corresponds to fuzzy membership function. Definition 3. The pessimistic value of supply q, denoted as V p (q ) is defined as follows: V p (q) = min max(1 − π D ( x),U (r ( x, p))) x
(5)
It can be seen V p (q ) is similar to the concept of necessity measure of fuzzy event, illustrated by Fig. 3, where U (r ( x, q)) corresponds to fuzzy membership function. The retailer should decide an optimal order quantity, which maximizes Vo (q) or V p (q) , that is, q o* = arg max V o (q)
(6)
q *p = arg max V p (q)
(7)
where q o* and q *p are called optimistic order quantity and pessimistic order quantity, respectively. It is obvious that an order quantity q will be given a higher evaluation by optimistic criterion (4) if this order can lead to a higher utility with a high possibility. On the other hand, an order quantity q will be given a lower evaluation by pessimistic criterion (5) if this order can lead to a lower utility with a high possibility. It is clear the possibility theory based-approach makes some decision to balance plausibility and satisfaction. The optimistic criterion and pessimistic criterion are initially proposed by Yager [20] and Whalen [19], respectively. These criteria have been axiomatized in the style of Savage by Dubois Prade and Sabbadin [2].
Newsvendor Problems Based on Possibility Theory
217
Fig. 3. The pessimistic value of supply q
Lemma 1. In the case of π D (q ) ≥ U (r (q, q)) , arg max ( max min(π D ( x),U (r ( x, q )))) = qo* where q o* is the solution of the equaq∈[ d l , d u ] x∈[ d l , d u ]
tion π D (q ) = U (r (q, q )) with q > d c . Proof. Considering (1) max U (r ( x, q)) = U (r (q, q )) holds. Then min(π D ( x), U (r ( x, q))) ≤ U (r ( x, q)) ≤ U (r (q, q)) holds for any x ∈ [d l , d u ] . π D (q ) ≥ U (r (q, q)) makes min(π D (q ),U (r (q, q ))) = U (r (q, q )) hold. So that Vo (q ) = max min(π D ( x),U (r ( x, q))) = U (r (q, q)) . Because U (r (q, q)) is a strictly x∈[ d l , d u ]
increasing function of q , max Vo (q) leads to q increasing and q > d c holding. q
Within x ∈ [d c , d u ] increasing q makes U (r (q, q)) increase and π D (q ) decrease. With considering the condition π D (q ) ≥ U (r (q, q)) , when π D (q ) = U (r (q, q)) holds, Vo (q) reaches maximum where the optimal q is denoted as q o* .
Lemma 2. In Case of π D (q ) < U (r (q, q)) , max min(π D ( x), U (r ( x, q))) < U (r (q o* , q o* )) holds for any q < d c .
x∈[ d l , d u ]
Lemma 3. π D (q ) ≥ U (r (q, q)) holds for d c < q < q o* . Lemma 4. In the case of π D (q ) < U (r (q, q)) , max (min(π D ( x),U (r ( x, q ))) < U (r (qo* , qo* )) holds for any q > qo* .
x∈[ d l , d u ]
Theorem 1. q o* is the solution of the following equation.
π D (q) = U (( R − W )q) where q ∈ [d c , d u ] .
(8)
218
Peijun Guo
Proof. The relation between π (x) and U (r ( x, q)) can be divided into three cases, that is, Case 1: π D (q ) ≥ U (r (q, q)) ; Case 2: π D (q ) < U (r (q, q)) and q < d c , and Case 3: π D (q ) < U (r (q, q)) and q > d c . It should be noted that there is no such case of π D (q ) < U (r (q, q)) and q = d c because of 1 = π D (d c ) > U (r (d c , d c )) . It is straightforward from Lemma 1, 2, 3 and 4 that the optimal order quantity is the solution of the equation U (r (q, q)) = π D (q ) with q ∈ [d c , d u ] . Considering (1), π (q ) = U (( R − W )q ) is obtained. (8) can be easily solved by Newton method with the condition q ∈ [d c , d u ] . Theorem 1 shows that the optimal order quantity obtained from optimistic criterion is only based on the revenue and wholesale price, which means that the retailer has enough confidence that he can sell what he orders. Theorem 2. The pessimistic order q *p is the solution of the following equation. U (r (d *pl , q )) = U (r (d *pu , q ))
(9)
where d *pl and d *pu are the horizontal coordinates of the intersections of U (r ( x, q)) and 1 − π D ( x) within [dl , min[q, d c ]] and [max[q, d c ], d u ] , respectively.
3
Conclusions
In this paper, the uncertainty of market of new products with short life cycle is characterized by possibility degrees to catch the potential of market determined by the matching degree between customer needs and goods features. Two possibilistic models for newsvendor problem are proposed, one is based on optimistic criterion and the other is based on pessimistic criterion to reflect the different preference for risk in such one-shot decision problem, which are very different from the conventional newsboy problem maximizing expected utility. The optimal order quantity based on optimistic criterion is only depended on the revenue and wholesale price, which means that the retailer has enough confidence that he can sell what he has ordered. On the contrary, the optimal order quantity based on pessimistic criterion needs all information of market, which reflects a more conservative attitude. Because newsvendor problem is a typical one-shot decision problem and the uncertainty of the coming market is easily and reasonably characterized by possibility distribution, it can be believed that the proposed models can make some suitable decision for newsvendor problem.
References [1] [2]
Buckley, J.J., Possibility and necessity in optimization, Fuzzy Sets and Systems 25 (1988) 1-13 Dubois D., Prade H., and Sabbadin R. Decision-theoretic foundations of possibilty theory. European Journal of Operational Research, 128 (2001) 459-478.
Newsvendor Problems Based on Possibility Theory
[3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20]
219
Gen, M., Tsujimura, Y. and Zheng, D., An application of fuzzy set theory to inventory control models, Computers Ind. Engng 33 (1997) 553-556. Guo, P., The newsboy problem based on possibility theory, Proceeding of the 2001 Fall National Conference of the Operations research Society of Japan (2001) 96-97. Guo, P. and Chen, Y., Possibility approach to newsboy problem, Proceedings of the First International Conference on Electronic Business (2001) 385-386. Ishii, H. and Konno, T., A stochastic inventory problem with fuzzy shortage cost, European Journal of Operational Research 106 (1998) 90-94. Kao, C. and Hsu., W. A single-period inventory model with fuzzy demand, Computers and Mathematics with Applications 43 (2002) 841-848. Katagiri, H. and Ishii, H., Some inventory problems with fuzzy shortage cost, Fuzzy Sets and Systems 111 (2000) 87-97. Katagiri, H. and Ishii, H., Fuzzy inventory problem for perishable commodities, European Journal of Operational Research 138 (2002) 545-553. Lee, H. and Yao, J. Economic order quantity in fuzzy sense for inventory without backorder model, Fuzzy Sets and Systems 105 (1999) 13-31. Lippman, S. A. and McCardle,K.F., The competitive newsboy, Operations Research, 45 (1997) 54-65. Li, L., Kabadi, S.N. and Nair, K.P.K., Fuzzy models for single-period inventory problem, Fuzzy sets and Systems (In Press) Lin, D. and Yao, J., Fuzzy economic production for inventory, Fuzzy Sets and Systems 111 (2000) 465-495. Kacpryzk, P. and Staniewski, P., Long-term inventory policy-making through fuzzy decision-making models, Fuzzy Sets and Systems 8 (1982) 117-132. Park, K.S., Fuzzy set theoretic interpretation of economic order quantity, IEEE Trans. Systems Man Cybernet. SMC 17(6) (1996) 1082-1084. Petrovic, D., Petrovic,R. and Vujosevic, M., Fuzzy model for newsboy problem, Internat. J. Prod. Econom. 45 (1996) 435-441. Porteus, E., L. Stochastic Inventory Theory, Handbooks in OR and MS, Vol. 2 Heyman D. P. and Sobel, M.J. eds. Elsevier Science Publisher, 605-652. Roy, T. K. and Maiti, M., A fuzzy EOQ model with demand-dependent unit cost under limited storage capacity, European Journal of Operational Research 99(1997) 425-432. Whalen, T. Decision making under uncertainty with various assumptions about available information, IEEE Transaction on Systems, Man and Cybernetics 14 (1984) 888-900. Yager, R. R., Possibilistic decision making, IEEE Transaction on Systems, Man and Cybernetics 9 (1979) 388-392.
Uncertainty Management in Rule Based Systems Application to Maneuvers Recognition T. Benouhiba, and J. M. Nigro Université de Technologie de Troyes 12 rue Marie Curie, BP 2060 10010 Troyes Cedex, France {toufik.benouhiba, nigro}@utt.fr
Abstract. In this paper we study the uncertainty management in expert systems. This task is very important especially when noised data are used such as in the CASSICE project, which will be presented later. Indeed, the use of a classical expert system in this case will generally produce mediocre results. We investigate the uncertainty management using the Dempster-Shafer theory of evidence and we discuss benefits and disadvantages of this approach by comparing the obtained results with those obtained by an expert system based on the fuzzy logic. Keywords: Expert system, evidence theory, uncertain reasoning.
1
Introduction
The expert systems technique is a very used approach in Artificial Intelligence; it can easily model expert knowledge in a well known field. Expert systems were used successfully in diagnostics, failure detection, recognition, etc. Nevertheless, they suffer from many problems. On one hand, they are closed systems, i.e., they could not react positively when the environment changes. On the other hand, they can neither make an uncertain reasoning nor handle noised or vague data. Since the first versions of MYCIN [16], many theoretical works have studied uncertainty in expert systems such as probabilities [7] and fuzzy logic [18]. However, there is another promising method that has not been well studied, which is the evidence theory. This one offers a mathematical framework for subjectivity and total ignorance but does not manage the uncertainty directly. In this paper we study the application of this theory in expert systems by combining it with some concepts of fuzzy logic. We see a real application of this approach to the recognition of driving maneuvers within the framework of the CASSICE project [12]. This paper is organized as follows. First, section 2 describes briefly the CASSICE project. Next, section 3 presents the Dempster-Shafer theory of evidence. Then, we see in section 4 the use of this theory in order to make an uncertain reasoning in expert systems. After that, section 5 shows an application example of the proposed approach in the CASSICE project; we discuss the obtained results by comparing them with those obtained by another expert system based on fuzzy logic. At the end, we give limitations and perspectives of this work. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 220-228, 2003. Springer-Verlag Berlin Heidelberg 2003
Uncertainty Management in Rule Based Systems Application
2
221
The CASSICE Project
The CASSICE project aims to help psychologists to understand a driver behavior in order to model it. The final goal is to improve the comfort and the security of the driver. The concrete objective of CASSICE is to build a computer system, which indexes all real driving situations. This system allows psychologists to efficiently search driving situations in a given context via a multi-criteria research interface. This project uses a car equipped with a set of proprioceptive sensors and cameras. Until now, data which are used in the recognition come from a simulator. Table1 shows the used data types. Table 1. Simulator data types
Data Time Acc Phi Rd Rg Teta V X,Y
Signification Clock (second) Acceleration of EV compared to TV1 EV front wheels angle EV distance from the right border EV distance from the left border Angle with TV Speed of EV compared TV Position of TV compared to EV (x and y axis )
It should be noted that two classical expert systems have been developed: the first one, called IDRES [13], makes an exhaustive recognition of the states (we consider that a maneuver is composed of a succession of several states). A maneuver is recognized when all the states that compose it are detected in the right order. The second system, called DRSC [11], considers a maneuver as an automaton. A maneuver is recognized when the sensors data enable the recognition process to pass from the initial state to the last state of the automaton. In this paper we have limited the recognition to the overtaking maneuver; we consider that this one is composed of ten successive states: wait to overtake, overtaking intention, beginning of the lane changing to the left, crossing left discontinuous line, end of the lane changing to the left, passing, end of passing, beginning of the lane changing to the right, crossing right discontinuous line and end of the lane changing to the right.
3
Uncertainty Management
3.1
Evidence Theory Elements
This theory can be regarded as a general extension of the Bayesian theory. It can deal with total ignorance and subjectivity of experts by combining many points of view. The last point is the most important because it enables us to manage many data uncertainties at the same time. The evidence theory originates in the work of 1
TV means target vehicle whereas EV means experimental vehicle
222
T. Benouhiba and J. M. Nigro
Dempster on theory of probabilities with upper and lower bounds [2]. It was concretely formulated by his student Shafer in 1976 through his book “A Mathematical Theory of Evidence” [15]. Let Ω be a finite set of hypotheses, an evidence mass m is a function defined on 2 Ω such that ∑ m( A) = 1 where A ⊆ Ω . m( A) represents the exact belief in A. Any subset of Ω for which the evidence mass is non-zero is called a focal element. If Ω = {a, b} then m({a,b}) measures the fact that one is indifferent to a or b, in other words, it's an ignorance measure. This theory is a generalization of the probabilities theory: if all focal elements are singleton sets, then this measure is a probability [16]. Many other measures are defined in this theory but the most used are the belief (bel) and the plausibility (pl) measures: bel ( A) = ∑ m( B) B⊆ A
pl ( A) =
∑ m( B )
A∩ B ≠ ∅
(1)
The belief represents the confidence that one lies in A or any subset of A whereas the plausibility represents the extent to which we fail to disbelieve A. These two measures are, in reality, the lower and upper bounds of the probability of A. The Dempster combination operator is one of the most important elements of the evidence theory. It makes it possible to combine many evidence masses (many points of view) about the same evidence. Let m1 and m2 be two evidence masses, so the combination of two points of view give: 0 m1 ( A)m2 ( B ) m1 ⊕ m2 (C ) = A∩∑ B =C 1 − ∑ m ( A)m ( B ) 2 A∩ B =∅1
3.2
if C = ∅ else
(2)
Adaptation of the Evidence Theory to Expert Systems
The evidence theory has been successfully used in many fields such as medical imagery [4], defaults classification by imagery [6], robot navigation systems [10] [17] and decision making [2] [3]. In General, this theory has shined in all fields relating to data fusion. However, too little works have been made to apply this theory to expert systems field; this is probably due to the combination operator. In fact, this operator can not directly combine many evidence masses associated to different variables. Some theoretical works have tried to resolve this problem, in this paper we use the method presented in [14]. In the following formulae, A and B are two distinct facts, m1 is the evidence mass associated to A, m2 is the evidence mass associated to B and m is the evidence mass associated to the fact (A and B). m( A ∩ B ) = m1 ( A).m 2 ( B )
(3)
Uncertainty Management in Rule Based Systems Application
223
m( A ∩ B ) = m1 ( A).m 2 ( B ) + m1 ( A).m 2 ( B ) + m1 ( A).m 2 ( B ) + m1 ( A).m 2 ( B ∪ B )
(4)
+ m1 ( A ∪ A).m 2 ( B ) m( A ∩ B ∪ A ∩ B) = m1 ( A).m 2 ( B ∪ B ) + m1 ( A ∪ A).m 2 ( B ) + m1 ( A ∪ A).m 2 ( B ∪ B )
4
Uncertain Reasoning
4.1
Evidence Masses Generating
(5)
In order to generate evidence masses, we consider each input data as an intuitionistic number (see Intuitionistic fuzzy sets [1]). The input data represents the center of this number while the choice of parameters depends on the scale and the variations of data. A condition is a conjunction of many simple conditions which have one of the following forms:
or where belongs to the set {=, >, <, ≥, ≤, ≠} . The second form can be easily transformed into the first one by using operations on intuitionistic number. We associate to every pair () an intuitionistic set which has nearly the same form as an intuitionistic number, e.g., the belonging function of the operator (=n) is µ n,op = exp(−( x − n) 2 / α 2 ) whereas the non-belonging function is ν n,op = 1 − exp(−( x − n) 2 / β 2 ) with α ≤ β . To compute evidence masses, we use the Zadeh combination principle: m(True) = m(T ) = sup inf( µ , µ n,op )
m( False) = m( F ) = inf max(ν ,ν n,op )
(6)
µ and ν are, respectively, the belonging and the non-belonging functions of the intuitionistic number to be compared. The computed evidence mass is eventually combined with another one if the input data id given with an evidence mass m' such that m'(T) < 1 . 4.2
Inference System
When the evidence masses of all simple conditions are computed, we can calculate the evidence mass of the conjunction by using the Dempster operator: If S = A1 ∧ A2 ∧ ... ∧ An then m s = ⊕ m Ai i
The inference is made by applying the modus ponens rule and comparing the m s (T) with a threshold θ ; if m s (T ) ≥ θ then the rule is activated and all facts which are inserted by this rule will be associated to the evidence mass m s . In this paper we
224
T. Benouhiba and J. M. Nigro
use the CLIPS tool [5] which does not offer any direct tool to manage uncertainty. In particular it offers too little possibilities to make calculus in the left rule side. To solve this problem, each premise that contains a comparison between data will be transferred to the right side of the rule then we can decide whether the original rule could be fired. Concretely, if we consider the following classical rule in the left, it will be rewritten as in the right: Rule: Wait to overtake {sup, inf and val are parameters}
Rule: Wait to overtake {sup, inf, val and threshold are parameters}
If (y<sup) and (y>inf) and (x
If (true) Then E=M(y<sup) ⊕ M(y>inf) ⊕ M(xthreshold then State= («Wait to overtake», E)
However, another problem arises: useless rules could be fired. In reality, this problem can not be avoided because of the overhead introduced by the uncertainty management but we can solve it partially by using of a threshold. Even if we have a dedicated tool to write our expert system, this problem will remain but the code of the system would be more readable and easier to maintain.
5
Experience and Results
In order to make the driving maneuvers recognition, some considerations have to be mentioned. It should be noted that parameters are chosen according to the scale and the variations of data. The flexibility of the system depends on the choice of thresholds: if the threshold is important then the system will be close to a classical expert system, if it's, on the contrary, too small then the research tends to become random. 5.1
Results
The developed system enables us to trace the graphs of the variation of evidence masses for a given state through time (see paragraph 2). For example, figure 1 shows the graphs of the variations of the evidence masses associated to the “true” value (m(T)) for the two states: “overtaking intention” (continuous line) and “crossing the left discontinuous line” (dotted line):
Uncertainty Management in Rule Based Systems Application
225
Fig. 1. Variations of the evidence masses of two states
Figure 2 shows now the graph of the variations of the evidence mass associated to “true” for the whole maneuver:
Fig. 2. Evidence mass variations for the overtaking maneuver
Numbers on the graph refer to the rank of the recognized state within the maneuver. The evidence mass associated to the “true” value is a decreasing function which passes sometimes by picks to zero. This can be explained by the fact that we are in a total ignorance situation, i.e., we can not decide which state is occurring (unknown state). We remark that some states do not appear on the graph, this is due to the fact that these states are not indispensables to the maneuver recognition. 5.2
A Recognition System Based on Fuzzy Logic
We have developed a second expert system based on fuzzy logic in order to better evaluate the performance of the first system. We have used here a specific tool: FUZZYCLIPS [8] which is an extension to CLIPS tool in order to manage uncertainty by using fuzzy logic. In this system we transform input data into fuzzy numbers [9], this converting operation uses many parameters like the first system. The figure 3 shows the graph of the variations of the certitude factors (CF) of the two states “overtaking intention” (continuous line) and “crossing left discontinuous line” (dotted line) while figure 4 shows the graph of the variations of the CF of the whole maneuver:
226
T. Benouhiba and J. M. Nigro
Fig. 3. The CF variations of two states
Fig. 4. The CF variations of the maneuver
We can explain cuts in figure 4 by the fact that the system can not find the current state, i.e., these are total ignorance situations. The fuzzy logic theory can not manage these situations because it does not have any measure to quantify the ignorance. 5.3
Discussion
The obtained results by the both systems are nearly similar. They can handle imprecise data but each one uses its own method. The first system uses the intuitionistic fuzzy sets concept because they enable us to generate evidence masses; the second system uses the vague predicates concepts, the possibilities and necessities calculations. In addition, thanks to the evidence theory; the first system could make data fusion, e.g., if a right rule side inserts numeric data in the facts base and if it's about the same variable a first firing of the rule inserts the data in the facts base while the second one will merge the new data with the first one inserted. Both the systems enable us to see when the car gets ready to pass to a given state or to quit it. This makes easier the understanding of the maneuver progress through time. The rules of the second system are easier to write because of the use of vague predicates. In the first system, rules are neither easy to read nor to design because that CLIPS can not support directly the uncertainty management. In addition, transferring
Uncertainty Management in Rule Based Systems Application
227
premises from the left side to right side involves that lot of useless rules would be fired. We can solve the first problem by designing a specific tool by analogy with FUZZYCLIPS in order to hide all computing details. The second problem could not be solved; in reality this problem is due to the overhead introduced by the uncertainty management. However, because FUZZYCLIPS has the control of uncertainty, the second system is not so flexible. The first system does not suffer from this limitation; we can easily modify the system flexibility by changing the parameters α i and β i of the intuitionistic numbers. It offers several types of measures that can help us to better understand the maneuver progress.
6
Conclusion and Perspectives
In this paper we have studied two ways to manage uncertainty in expert system by applying them in a driving maneuver recognition application (the CASSICE project). The both systems consist in recognizing driving maneuvers by using data coming from several sensors, only the overtaking maneuver has been considered here. The first system is based on the Dempster-Shafer theory of evidence; however, we have used some fuzzy logic concepts in order to manage uncertainty in the rule-based system. The second system relies on the fuzzy sets theory but does not support total ignorance. Results obtained by the both systems are nearly similar. We can detect the moments when the driver gets ready to pass from one state to another. However, the first system allows us to better understand the maneuver progress by using many measures. For example, when the ignorance mass is equal to 1, we can deduct that the system could not recognize the current state; consequently, we could say that the rules set does not contain all necessary knowledge. There are two perspectives of this work. First, we should generalize the developed systems in order to recognize other maneuvers. We can imagine that the system would have a maneuvers database which contains all maneuvers description. Then, we can develop a specific tool for the evidence theory, by analogy with FUZZYCLIPS, in order to facilitate the expert system design.
Reference [1] [2] [3]
Atanassov K.: Intuitionistic fuzzy sets. FSS, 20: 87-96, 1986. Beynon M., Cosker D., Marshall D.: An expert system for multi-criteria decision making using Dempster-Shafer theory. Expert Systems with Applications 20, p.357 -367, 2001. Beynon M., Curry B., Morgan P.: The Dempster-Shafer theory of evidence: an alternative approach to multi-criteria decision modeling. The International Journal of Management Science, p. 37 -50, 2000.
228
T. Benouhiba and J. M. Nigro
[4]
Bloch I.: Some aspects of Dempster-Shafer evidence theory for classification of multimodality medical images taking partial volume effect into account. Pattern Recognition Letters 17, p.905-919, 1996. CLIPS : www.ghg.net/clips/download/documentation Dongping Z., Conners T., Schmoldt D., Araman P.: A prototype vision system for analyzing CT imagery of hardwood logs. IEEE Transactions on Systems, Man and Cybernetics: B 26(4), p.522-532, 1996. Fagin R., Halpern J. Y.: Reasoning about Knowledge and Probability. Proceedings of the Second Conference on Theoretical Aspects of Reasoning about Knowledge, Morgan Kaufmann, p.277 -293, 1988. FuzzyCLIPS : ai.iit.nrc.ca/cgi-bin/FuzzyCLIPS_ log Heilpern S.: Representation and application of fuzzy numbers. Fuzzy Sets and Systems 91, p.259-268, 1997. Kak A., Andress K., Lopez-Abadia C., Carol M., Lewis R.: Hierarchical evidence accumulation in the PSEIKI system. Uncertainty in Artificial Intelligence vol. 5, North-Holland, 1990. Loriette S., Nigro J.M., Jarkass I.: Rule-Based Approaches for the Recognition of Driving Maneuvers. Canberra (Australia), In ISTA'2000 (International Conference on Advances in Intelligent Systems: Theory and Applications), Volume59 of the series Frontiers in Artificial Intelligence, Ed. IOS Press 2-4 February 2000. Nigro J. M., Loriette-Rougegrez S., Rombaut M., Jarkass I.: Driving situation recognition in the CASSICE project – Towards an uncertainty management. ITSC 2000, 71-76, October 2000. Nigro J. M., Loriette S.: Characterization of driving situation. MS'99 (International Conference on Modeling and Simulation, May17-19, p.287-297. Nigro J. Marc, Loriette-Rougegrez S., Rombaut M.: Driving situation recognition with uncertainty management and rule-based systems. Engineering Applications of Artificial Intelligence Journal, Volume 15, Issue 3-4, pp. 217228, June - August 2002. Shafer G.: Mathematical theory of evidence. Princeton University Press, 1976. Shortliffe E. H., Buchanan B.: A model of inexact reasoning in medicine. Mathematical Biosciences, 23:351-379, 1975. Wang J., Wu Y.: Detection for mobile robot navigation based on multisensor fusion. Proceeding of the SPIE, The International Society for Optical Engineering vol.2591, p.182-192, 1995. Zadeh L. A.: Fuzzy sets. Information and control – n°8, p.338-353, 1965.
[5] [6] [7] [8] [9] [10] [11]
[12] [13] [14]
[15] [16] [17] [18]
Fuzzy Coefficients and Fuzzy Preference Relations in Models of Decision Making Petr Ekel1 , Efim Galperin2 , Reinaldo Palhares3, Claudio Campos1 , and Marina Silva1 1
3
Graduate Program in Electrical Engineering, Pontifical Catholic University of Minas Gerais Ave. Dom Jos´e Gaspar, 500, 30535-610, Belo Horizonte, MG, Brazil [email protected] [email protected] [email protected] 2 Department of Mathematics, University of Quebec at Montreal C.P. 8888, Succ. Centre-Ville, Montreal, Quebec, Canada H3C 3P8 [email protected] Department of Electronics Engineering, Federal University of Minas Gerais Ave. Antˆ onio Carlos, 6627, 31270-010, Belo Horizonte, MG, Brazil [email protected]
Abstract. Analysis of < X, R > models is considered as part of a general approach to solving a wide class of optimization problems with fuzzy coefficients. This approach consists in formulating and solving one and the same problem within the framework of interrelated models to maximally cut off dominated alternatives. The subsequent contraction of the decision uncertainty region is based on reduction of the problem to multiobjective decision making in a fuzzy environment with applying techniques based on fuzzy preference relations. The results of the paper are of a universal character and are already being used to solve power engineering problems.
1
Introduction
Investigations of recent years show the benefits of applying fuzzy set theory to deal with various types of uncertainty, particularly for optimization problems where there are advantages of a fundamental nature (the possibility of validly obtaining less “cautious” solutions) and computational character [1]. The uncertainty of goals is the notable kind of uncertainty related to a multiobjective character of many optimization problems. It is possible to classify two types of problems, which need the use of a multiobjective approach [2]: (a) problems in which solution consequences cannot be estimated on the basis of a single criterion, that involves the necessity of analyzing a vector of criteria, and (b) problems that may be solved on the basis of a single criterion but their unique V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 229–236, 2003. c Springer-Verlag Berlin Heidelberg 2003
230
Petr Ekel et al.
solutions are not achieved because the uncertainty of information produces socalled decision uncertainty regions, and the application of additional criteria can serve as a means to reduce these regions [3]. According to this, two classes of models (so-called < X, M > and < X, R > models) may be constructed. When analyzing < X, M > models, a vector of objective functions is considered for their simultaneous optimization. The lack of clarity in the concept of “optimal solution” is the fundamental methodological complexity in analyzing < X, M > models. When applying the Bellman-Zadeh approach [4], this concept is defined with reasonable validity because the maximum degree of implementing goals serves as the optimality criterion. This conforms to the principle of guaranteed result and provides a constructive line in obtaining harmonious solutions [1, 5]. Some specific questions of using the approach are discussed in [1, 2]. Taking this into account, the paper considers primarily < X, R > models.
2
Optimization Problems with Fuzzy Coefficients
Many problems related to complex system design and control may be formulated as follows: (1) maximize F (x1 , . . . , xn ) , subject to the constraints j , gj (x1 , . . . , xn ) ⊆ B
j = 1, . . . , m ,
(2)
where the objective function (1) and constraints (2) include fuzzy coefficients, as indicated by the ∼ symbol. The following problem can also be defined: minimize F (x1 , . . . , xn ) ,
(3)
subject to the same constraints (2). An approach [3] to handle constraints such as (2) involves replacing each of them by a finite set of nonfuzzy constraints. Depending on the essence of the problem, it is possible to convert constraints (2) to constraints gj (x1 , . . . , xn ) ≤ bj ,
j = 1, . . . , d ≥ m ,
(4)
gj (x1 , . . . , xn ) ≥ bj ,
j = 1, . . . , d ≥ m .
(5)
or to constrains
Problems with fuzzy coefficients only in the objective functions can be solved by modifying traditional methods [2, 3]. For example, the algorithms [1, 3] of solving discrete problems (1), (4) and (3), (5) are based on modifying the methods of [6]. In their use, the need arises to compare alternatives on the basis of relative fuzzy values of the objective function. This may be done by applying the methods classified in [7]. One of their groups is related to building fuzzy
Fuzzy Coefficients and Fuzzy Preference Relations
231
preference relations providing the most justified way to compare alternatives [8]. Considering this, it is necessary to distinguish the choice function or fuzzy number ranking index based on the conception of a generalized preference relation [9]. If the membership functions corresponding to the values F1 and F2 of the objective function to be maximized are µ(f1 ) and µ(f2 ), the quantities η{µ(f1 ), µ(f2 )} =
sup f1 , f2 ∈ F f1 ≥ f2
min {µ(f1 ), µ(f2 )} ,
(6)
min {µ(f1 ), µ(f2 )} ,
(7)
and η{µ(f2 ), µ(f1 )} =
sup f1 , f2 ∈ F f2 ≥ f1
are the degrees of preferences µ(f1 ) µ(f2 ) and µ(f2 ) µ(f1 ), respectively. Applying (6) and (7), it is possible to judge the preference of any of the alternatives compared. However, if the membership functions µ(f1 ) and µ(f2 ) correspond to so-called flat fuzzy numbers, the use of (6) and (7) can lead to η{µ(f1 ), µ(f2 )} = η{µ(f2 ), µ(f1 )}. In such situations the algorithms do not allow one to obtain a unique solution. This also occurs with other modifications of optimization methods when the uncertainty and relative stability of optimal solutions produce decision uncertainty regions. In this connection other choice functions (for example, [10, 11]) may be used. However, these functions occasionally result in choices, which appear intuitively inconsistent, and their use does not permit one to close the question of building an order on a set of fuzzy numbers [3]. An approach, which is better validated and natural, is associated with transition to multiobjective choosing alternatives. The application of additional criteria can serve as a convincing means to contract the decision uncertainty region.
3
Multiobjective Choice and Fuzzy Preference Relations
First of all, it should be noted that considerable contraction of the decision uncertainty region may be obtained by formulating and solving one and the same problem within the framework of mutually interrelated models:(a) maximizing (1) while satisfying constraints (4) interpreted as convex down and (b) minimizing (3) while satisfying constraints (5) interpreted as convex up [1, 3]. Assume we are given a set X of alternatives (from the decision uncertainty region) that are to be examined by q criteria of quantitative and/or qualitative nature to make an alternative choice. The problem is presented as < X, R > where R = {R1 , . . . , Rq } is a vector of fuzzy preference relations. Therefore Rp = [X × X, µRp (Xk , Xl )],
p = 1, . . . , q ,
Xk , X l ∈ X ,
(8)
where µRp (Xk , Xl ) is a membership function of fuzzy preference relation. The availability of fuzzy or linguistic estimates Fp (Xk ), p = 1, . . . , q, Xk ∈ X with the membership functions µ[fp (Xk )], p = 1, . . . , q, Xk ∈ X permits one to build Rp , p = 1, . . . , q on the basis of correlations similar to (6) and (7).
232
Petr Ekel et al.
If R is a single preference relation, it can be put in correspondence with a strict fuzzy preference relation RS = R\R−1 [9] with the membership function µSR (Xk , Xl ) = max{µR (Xk , Xl ) − µR (Xl , Xk ), 0} .
(9)
Since µSR (Xl , Xk ) describes a fuzzy set of alternatives, which are strictly dominated by Xl , its complement by 1 − µSR (Xl , Xk ) gives the set of nondominated alternatives. To choice all these alternatives, it is enough to find the intersection of all 1 − µSR (Xl , Xk ), Xk ∈ X on all Xl ∈ X with µR (Xk ) = inf [1 − µSR (Xl , Xk )] = 1 − sup µSR (Xl , Xk ) . Xl ∈X
Xl ∈X
(10)
Because µR (Xk ) is the degree of nondominance, it is natural to choice X = {Xk | Xk ∈ X, µR (Xk ) = sup µR (Xk )} . Xk ∈X
(11)
If we have the vector fuzzy preference relation, expressions (9)-(11) can serve as the basis for a lexicographic procedure of step-by-step introducing criteria. It generates a sequence X 1 , X 2 , . . . , X q so that X ⊇ X 1 ⊇ X 2 ⊇ · · · ⊇ X q with the use of the following expressions: µp R (Xk ) = 1 −
sup Xl ∈X p−1
µSRp (Xl , Xk ),
p X p = {Xkp | Xkp ∈ X p−1 , µp R (Xk ) =
p = 1, . . . , q ,
sup Xk ∈X p−1
µp R (Xk )} ,
(12)
(13)
constructed on the basis of (10) and (11), respectively. If a uniquely determined order is difficult to build, it is possible to apply another procedure. In particular, the expressions (9)-(11) are applicable if we q take R = p=1 Rp , i.e., µR (Xk , Xl ) = min µRp (Xk , Xl ), l≤p≤q
Xk , Xl ∈ X .
(14)
The use of this procedure leads to the set X that fulfils the role of the Pareto set [12]. If necessary, its contraction is possible using the convolution µT (Xk , Xl ) =
q
λp µRp (Xk , Xl ),
Xk , Xl ∈ X ,
(15)
p=1
q where λp > 0, p = 1, . . . , q are importance factors normalized as p=1 λp = 1. The construction of µT (Xk , Xl ), Xk , Xl ∈ X allows one to obtain the membership function µT (Xk ) of the subset of nondominated alternatives using an expression similar to (9). Its intersection with µT (Xk ) defined as µ (Xk ) = min{µR (Xk ), µT (Xk )},
Xk ∈ X ,
(16)
Fuzzy Coefficients and Fuzzy Preference Relations
233
provides us with X = {Xk | Xk ∈ X, µ (Xk ) = sup µ (Xk )} . Xk ∈X
(17)
Finally, it is possible to apply the third procedure based on presenting (9) as µRp (Xk ) = 1 − sup µSRp (Xl , Xk ), Xl ∈X
p = 1, . . . , q
(18)
to build the membership functions of the subset of nondominated alternatives for all preference relations. Since the membership functions (18) play a role identical to membership functions replacing the objective functions in < X, M > models [1, 2], it is possible to build µ (Xk ) = min µRp (Xk ) l≤p≤q
(19)
to obtain X in accordance with (17). If necessary to differentiate the importance of different fuzzy preference relations, it is possible to transform (19) as follows: µ (Xk ) = min [µRp (Xk )]λp . l≤p≤q
(20)
The use of (20) does not require the normalization of λp , p = 1, . . . , q. It is natural that the use of the second procedure may lead to the solutions different from the results obtained on the basis of the first procedure. However, solutions based on the second procedure and the third procedure (that is more preferential from the substantial point of view) relating to a single generic basis may at time also be different. Considering this, it should be stressed that the possibility of obtaining different solutions is natural, and the choice of the approach is a prerogative of a decision maker. All procedures have been implemented within the framework of a decision making system DMFE (developed in C++ in the Builder-Borland environment). Its flexibility and user-friendly interaction with a decision maker makes it possible to use DMFE for solving complex problems of multiobjective choosing alternatives with the use of criteria of quantitative and qualitative nature.
4
Illustrative Example
The results described above have found applications in solving power engineering problems. As an example, we dwell on using the multiobjective choice of alternatives for substation planning. Without discussing substantial considerations (they are given in [13]), the problem consists in comparing three alternatives, which could not be distinguished from the point of view of their total costs, using criteria “Alternative Costs”, “Flexibility of Development”, and “Damage to Agriculture”. The fuzzy preference relations corresponding to these criteria,
234
Petr Ekel et al.
built with the use of (6) and (7) on the basis of alternative membership functions presented in [13], are the following: 1 1 1 1 1 , µR1 (Xk , Xl ) = 1 (21) 1 0.912 1
1 1 0.909 µR2 (Xk , Xl ) = 1 1 0.909 , 1 1 1 and
(22)
1 1 1 1 1 . µR3 (Xk , Xl ) = 0.938 0.625 0.938 1
(23)
Consider the application of the first approach if the criteria are arranged, for example, in the following order: p = 1, p = 2, and p = 3. Using (21), it is possible to form the strict fuzzy preference relation with 0 0 0 µSR1 (Xk , Xl ) = 0 0 0.088 . (24) 0 0 0 Following (12) and (13), we obtain on the basis of (24)
1 1 0.912 µ1 R (Xk ) =
(25) 1 1 and X 1 = {X1 , X2 }. From (22) we have: µ2R2 (Xk , Xl ) = , that leads 1 1 1 1 2 3 , to X = {X1 , X2 }. Finally, from (23) we can obtain: µR2 (Xk , Xl ) = 0.938 1 providing us with X 3 = {X1 }. Consider the application of the second approach. As a result of an intersection of (21), (22), and (23), we obtain 1 1 0.909 1 0.909 . µR (Xk , Xl ) = 0.938 (26) 0.625 0.912 1
It permits one to construct
0 0.062 0.284 0 0 . µSR (Xk , Xl ) = 0 (27) 0 0.003 0
Following (11), we obtain µR (Xk ) = 1 0.938 0.716 and X = {X1 }. Let us consider the use of the third approach. The membership function of the subset of nondominated alternatives for µR1 (Xk ) is (25). Using (22) and (23),
Fuzzy Coefficients and Fuzzy Preference Relations
235
we obtain µR2 (Xk ) = 0.909 0.909 1 and µR2 (Xk ) = 1 0.938 0.625 , respectively. As a result of their intersection, we have X = {X1 , X2 }. Thus, the use of the first and second approaches leads to choosing the first alternative. The last approach casts away the third alternative but cannot distinguish the first and second alternatives on the basis of information reflected by fuzzy preference relations (21)-(23).
5
Conclusion
Two classes of problems that need the application of a multiobjective approach have been classified. According to this, < X, M > and < X, R > models may be constructed. The use of < X, R > models is based on applying a general approach to solving optimization problems with fuzzy coefficients. It consists in analyzing one and the same problem within the framework of mutually interrelated models. The contraction of the obtained decision uncertainty region is based on reducing the problem to multiobjective choosing alternatives with applying procedures based on fuzzy preference relations. The results of the paper are of a universal character and can be applied to the design and control of systems and processes of different nature as well as the enhancement of intelligent decision support systems. These results are already being used to solve power engineering problems.
References [1] Ekel, P.Ya.: Fuzzy Sets and Models of Decision Making. Int. J. Comp. Math. Appl. 44 (2002) 863-875. 229, 230, 231, 233 [2] Ekel, P.Ya.: Methods of Decision Making in Fuzzy Environment and Their Applications. Nonlin. Anal. 47 (2001) 979-990. 229, 230, 233 [3] Ekel, P., Pedrycz, W., Schinzinger, R.: A General Approach to Solving a Wide Class of Fuzzy Optimization Problems. Fuzzy Sets Syst. 97 (1998) 49-66. 230, 231 [4] Bellman R., Zadeh L. A.: Decision Making in a Fuzzy Environment. Manage. Sci. 17 (1970) 141-164. 230 [5] Ekel, P.Ya., Galperin, E. A.: Box-Triangular Multiobjective Linear Programs for Resource Allocation with Application to Load Management and Energy Market Problems. Math. Comp. Mod., to appear. 230 [6] Zorin, V. V., Ekel, P.Ya.: Discrete-Optimization Methods for Electrical Supply Systems. Power Eng. 18 (1980), 19-30. 230 [7] Chen, S. J., Hwang, C. L.: Fuzzy Multiple Attribute Decision Making: Methods and Applications. Springer-Verlag, Berlin Heidelberg New York (1992). 230 [8] Horiuchi, K., Tamura, N.: VSOP Fuzzy Numbers and Their Fuzzy Ordering. Fuzzy Sets Syst. 93 (1998) 197-210. 231 [9] Orlovsky, S. A.: Decision Making with a Fuzzy Preference Relation, Fuzzy Sets Syst. 1 (1978) 155-167. 231, 232 [10] Fodor, J., Roubens, M.: Fuzzy Preference Modelling and Multicriteria Decision Support. Kluwer, Boston Dordrecht London (1994). 231
236
Petr Ekel et al.
[11] Lee-Kwang, H.: A Method for Ranking Fuzzy Numbers and Its Application to Decision-Making. IEEE Trans. Fuzzy Syst. 7 (1999) 677-685. 231 [12] Orlovsky, S. A.: Problems of Decision Making with Fuzzy Information. Nauka. Moscow.: Nauka (1981), in Russian. 232 [13] Ekel, P.Ya., Terra, L. D. B., Junges, M. F. D.: Methods of Multicriteria Decision Making in Fuzzy Environment and Their Applications to Power System Problems. Proceedings of the 13th Power Systems Computation Conference, Trondheim (1999) 755-761. 233, 234
Mining Fuzzy Rules for a Traffic Information System Alexandre G. Evsukoff and Nelson F. F. Ebecken COPPE/Federal University of Rio de Janeiro P.O.Box 68506, 21945-970 Rio de Janeiro RJ, Brazil {Evsukoff,Nelson}@ntt.ufrj.br
Abstract. This work presents a fuzzy system for pattern recognition in a real application: the selection of traffic information messages to be displayed in Variable Message Signs located at the main routes of the city of Rio de Janeiro. In this application, flow and occupancy rate data is used to fit human operators' evaluation of traffic condition, which is currently done from images of strategically located cameras. The fuzzy rule-base mining is presented considering the symbolic relationships between linguistic terms describing variables and classes. The application presents three classifiers built from data.
1
Introduction
The work presents the development of a fuzzy system for the selection of messages to be displayed in Variable Message Signs (VMS). Such devices were initially conceived to provide drivers with information about traffic incidents, environmental problems, special events etc [1], [4]. However, due to the increasing levels of congestion in big cities the VMS have been used mainly to display current traffic conditions. In the city of Rio de Janeiro, the VMS are systematically used to inform the traffic conditions on the main routes located downstream each panel. Operators in the traffic control centre analyse images from strategically located cameras and classify the conditions on each route in three categories: fluid, dense and slow. The classification is not standardised and depends on the operator's evaluation, which can be different for similar situations. On the other hand, many streets in Rio are equipped with inductive loop detectors. These sensors provide flow and occupancy rates, of which data is used for traffic lights planning but not to support VMS operation. Fuzzy reasoning techniques are a key for human-friendly computerised devices, allowing symbolic generalisation of high amount of data by fuzzy sets and providing its linguistic interpretability [2], [3], [5]. The application described in this work uses flow and occupancy data to mine fuzzy rules to support operators' evaluation in the selection of the messages to be displayed in VMS. The following section introduces the current approach. The third section describes the mining of fuzzy rules for the fuzzy model. The fourth section presents and discusses the results achieved by three classifiers built with this methodology. Finally, some concluding remarks are presented. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 237-243, 2003. Springer-Verlag Berlin Heidelberg 2003
238
Alexandre G. Evsukoff and Nelson F. F. Ebecken
2
Fuzzy Systems for Pattern Recognition
Consider a pattern recognition problem where observations are described as a Ndimensional vector x in a feature space X N and classes are represented by the set C = {C1 K C m } . The solution consists in assigning a class label C j ∈ C to an observation x(t ) ∈ X N . The problem of designing a fuzzy system for pattern recognition is to build a classifier, which should executes correctly the mapping X N → C . Each input variable x(t ) ∈ X is described using ordered linguistic terms in a descriptor set A = {A1 , K , An } . The meaning of each term Ai ∈ A is given by a fuzzy set. The collection of fuzzy sets used to describe the input variable forms a fuzzy partition of the input variable domain. For a given input x(t ) ∈ X , the membership vector (or fuzzification vector) u(t ) is computed by the fuzzy sets in the fuzzy partition of the input variable domain as:
(
)
u(t ) = µ A1 ( x(t )) K µ An ( x(t )) .
(1)
In pattern recognition applications, the rules' conclusions are the classes in the set C = {C1 K C m } . The fuzzy rule base has thus to relate input linguistic terms Ai ∈ A to the classes C j ∈ C , in rules such as: if x(t ) is Ai then class is C j with cf = ϕ ij
(2)
where ϕ ij ∈ [0,1] is a confidence factor (cf) that represents the rule certainty. The confidence factor weights for all rules define the symbolic fuzzy relation Φ , defined on the Cartesian product A × C . A value µ Φ ( Ai , C j ) = ϕ ij represents how much the term Ai is related to the class C j in the model described by the rule base. A value ϕ ij > 0 means that the rule (i, j ) occurs in the rule base with the confidence
[ ]
factor ϕ ij . The rule base can be represented by the matrix Φ = ϕ ij
, of which the
lines i = 1...n are related to the terms in the input variable descriptor set A and the columns j = 1...m are related to classes in the set C . The output of the fuzzy system is the class membership vector (or the fuzzy model output vector) y (t ) = µ C1 ( x(t )), K , µ Cm ( x(t )) , where µ C j ( x(t )) is the output mem-
(
)
bership value of the input x(t ) to the class C j . The class membership vector y (t ) can also be seen as a fuzzy set defined on the class set C . It is computed from the input membership vector u(t ) and the rule base weights in Φ using the fuzzy relational composition: y (t ) = u(t ) o Φ
(3)
Mining Fuzzy Rules for a Traffic Information System
239
Adopting strong and normalised fuzzy partitions, and using the sum-product composition operator for the fuzzy inference, the class membership vector y (t ) is easily computed as a standard vector matrix product as: y (t ) = u(t ).Φ .
(4)
When two or more variables occur in the antecedent of the rules, input variables are the components of the feature vector x ∈ X N and the rules are written as: if x(t ) is Bi then class is C j with cf = ϕij
(5)
Each term Bi in the multi-variable descriptor set B = {B1 , K , B M } represents a combination of terms in the input variables' descriptor sets. It is a symbolic term used for computation purposes only and does not necessarily need to have a linguistic interpretation. All combination must be considered in such a way that the model is complete, i.e. it produces an output for whatever input values. The multi-variable fuzzy model described by a set of rules (5) is analogous to the single variable fuzzy model described by rules of type (2) and the model output y (t ) = µ C1 (x(t )), K , µ Cm (x(t )) is computed as:
(
)
y (t ) = w (t ).Φ
(6)
The combination of terms in multiple antecedent rules of type (5) implies in an exponential growth of the number of rules in the rule base. For problems with many variables, reasonable results [2] can be obtained considering the aggregation of partial conclusions computed by single antecedent rules of type (2). The final conclusion is the membership vector y (t ) = µ C1 (x(t )), K , µ Cm (x(t )) ,
(
)
which is computed by the aggregation of partial conclusions y i (t ) as in multi-criteria decision-making [6]. Each component µ C j (x(t )) is computed as:
(
µ C j (x(t )) = H µ C j ( x1 (t )), K , µ C j ( x N (t ))
)
(7)
where H : [0,1]N → [0,1] is an aggregation operator. The best aggregation operator must be chosen according to the semantics of the application. Generally, a conjunctive operator, such as the “minimum” or the “product”, gives good results to express that all partial conclusions must agree. A weighted operator like OWA may be used to express some compromise between partial conclusions. The final decision is computed by a decision rule. The most usual decision rule is the “maximum rule”, where the class is chosen as the one with greatest membership value. The “maximum rule” is often used to determine the most suitable solution. Nevertheless, other decision rules can be used including risk analysis and reject or ambiguity distances. When all variables are used in single variable rules as (2) and their outputs are aggregated as (7), the fuzzy classifier behaves as a standard Bayesian classifier. The
240
Alexandre G. Evsukoff and Nelson F. F. Ebecken
current approach is flexible enough so that some partial conclusions can be computed from the combination of two or three variables in multi-variable rules (5). An aggregation operator computes a final conclusion from partial conclusions obtained from all sub-models. A decision rule (the “maximum” rule) computes the final class as shown in Fig. 1. x1 (t )
A1
u 1 (t )
xk (t )
Bk
M x N (t )
AN
y 1 (t )
M
M xk +1 (t )
Φ1
w k (t )
u N (t )
Φk
M ΦN
y k (t )
H
y (t ) decision
C0
y N (t )
Fig. 1. The fuzzy system approach for pattern recognition
The rule base is the core of the model described by the fuzzy system and its determination from data or human knowledge is described in the next section.
3
Fuzzy Rule Base Mining
Consider a labelled data set T , where each sample is a pair (x(t ), v (t ) ) , of which x(t ) is the vector containing the feature values and v(t ) is the vector containing the assigned membership values of x(t ) to each class. For each fuzzy sub-model k , the rule base weights matrix Φ k is computed by the minimisation of the rule output error in each sub-model, defined as: J = N1
∑ (y k (t ) − v(t ) )2
t =1.. N
(8)
where y k (t ) = u k (t ).Φ k if the sub-model k is described by single antecedent rules as (2) or y k (t ) = w k (t ).Φ k if the sub-model is described by multiple antecedent rules as (5). The rule base weights matrix Φ k as the solution of the linear system: Uk Φk = V
(9)
where the matrix U k is the fuzzification matrix, of which lines are the input membership vector for each sample. Each line of the matrix U k can be the single variable input membership vector u k (t ) or the multi-variable input membership vector w k (t ) , if the sub-model k is described by single or multiple antecedent rules. The
Mining Fuzzy Rules for a Traffic Information System
241
matrix V stores, at each line, the class membership vector for the corresponding sample. Equation (9) can be computed by any standard numerical method. Fuzzy rule base weights are computed for each sub-model and used to compute partial conclusions on class descriptions. Final conclusions are computed by aggregation of partial conclusions of each sub-model. Heuristic and experimental knowledge can be combined as different sub-models and their outputs can be aggregated to compute the final conclusions.
4
Application
The application described in this work use flow and occupancy rate data to automatically select messages to support operators' evaluation. One of the data sets used in this application is shown in Fig. 2. The interpolation of the data represented by the solid line shows a typical relationship between flow and density. Density is not directly measured but it can be approximated by the occupancy rate. For low occupancy rate and flow values, traffic condition is classified as clearly “fluid”, since the road is being operated under its capacity. As flow increases the occupancy rate also increases until a saturation point, represented by the limit of the road capacity, where the traffic is classified as “dense”. After such limit, the effects of congestion reduce the traffic flow, and thus the mean velocity, while the occupancy rates still increases. In this situation, the traffic condition is classified as “slow”.
Fig. 2. A training data set
Operators classify traffic conditions visually from camera images. Thus, for some close data values, the classification can be completely different, due to the subjectivity of operators' evaluation. The methodology presented above allowed to build different classifiers; three of them are discussed in this section. The first classifier is obtained considering flow/occupancy ratio as the input variable. Using this analytical relation, the rule base design becomes trivial, since a low mean velocity implies a slow traffic, a medium mean velocity implies a intense traffic
242
Alexandre G. Evsukoff and Nelson F. F. Ebecken
and high velocity implies a fluid traffic. The membership functions were centered in the mean values of flow/occupancy ratio for each class in the training set. The output class membership vector was computed simply by fuzzification. The operators have agreed that flow can be described by three terms and occupancy is better described by four symbolic terms. Fuzzy meanings have been defined by equally spaced fuzzy sets with triangular membership functions. No optimization of fuzzy sets location was performed. The second classifier was obtained considering features separately as a sub-model in single antecedent rules like (2). The rule base weights were computed as (9) for each feature. The partial conclusions were aggregated by the t-norm “minimum” operator. The third classifier considers all the combination of terms in the variables' descriptor set in two-variable antecedent rules like (5). The fourth classifier is a standard Bayes classifier considering normal probability distribution. The collection of the data set was not a simple task. As the operators have many other occupations than monitor traffic conditions, each sample data (acquired at 1/15 minutes rate) was included into the data set only if the operator has just changed the message in the VMS. One data set was collected for each route. As routes are different in capacity, a different classifier was also developed for each route. The data sets were divided into a training data set and a test data set. The classification error rates in the test data set are shown in Table 2. The large error rates in the third route are due to inconsistent data in the training set. Table 2. Classification Results
Classifier #1 #2 #3 #4
Route 1 34.78 31.87 35.16 32.96
Route 2 27.83 31.19 23.85 31.19
Route 3 47.83 46.09 43.48 44.34
In the real VMS application, many detectors are available for each route. In this case, the solution provided by each data collection station should be considered as an independent partial conclusion and final decision could be obtained by weighted aggregation of all data collection stations in each route, in a data fusion scheme.
5
Conclusions
This work has presented a fuzzy system approach to represent knowledge elicited from experts and mined from a data set. The main advantage of this approach is its flexibility to combine single or multiple-variable fuzzy rules and to derive fuzzy rules from heuristic and/or experimental knowledge. In the application, three classifiers were discussed for a simple but real application of a traffic information system. It has been shown that the operator evaluation from visual inspection is very subjective and corresponds roughly to measured data. The classification results reflect the lack of standardization in operators' evaluation. The results obtained with fuzzy classifiers (# 1 to # 3) were more or less similar
Mining Fuzzy Rules for a Traffic Information System
243
to the results of the Bayes classifier (#4), although the rule bases in fuzzy classifiers are readable by domain experts. The rule bases obtained in the data-driven classifiers (classifiers #2 and #3) correspond roughly to the operators' expectation on how flow and occupancy are related to traffic condition, expressed in the heuristic classifier (classifier #1). An automatic procedure to support operators' evaluation could help to standardize the messages to be displayed in VMS, allowing a better understanding of traffic messages by the drivers. The parameters of the fuzzy partitions for each variable are important design issues that can improve classification rates and have not been considered in this work. It is considered for further development of the symbolic approach developed in this work.
Acknowledgements The authors are grateful to the kind co-operation of the traffic control operators and to CET-Rio that has provided the data for this work.
References [1] [2] [3] [4] [5] [6]
Dudek, C. L. (1991). Guidelines on the Use of Changeable Message Signs – Summary Report. Publication No. FHWA-TS-91-002. U.S. Department of Transportation, Federal Highway Administration. Evsukoff, A.; A. C. S. Branco and S. Gentil (1997). A knowledge aquisition method for fuzzy expert systems in diagnosis problems. Proc. 6th IEEE International Conference on Fuzzy Systems – FUZZIEEE'97, Barcelona. Iserman, R. (1998). On fuzzy logic applications for automatic control, supervision and fault diagnosis. IEEE Trans. on Systems Man Cybernetics – Part A: Systems and Humans, 28 (2), pp. 221-234. Sethi, V. and N. Bhandari (1994). Arterial incident detection using fixed detector and probe vehicle data. Transportation Research. Zadeh, L. (1996). Fuzzy logic = computing with words. IEEE Trans. on Fuzzy Systems, 4 (2), pp. 103-111. Zimmermann, H.-J. (1996). Fuzzy Set Theory and its Applications, Kluwer.
Possibilistic Hierarchical Fuzzy Model Paulo Salgado Universidade de Trás-os-Montes e Alto Douro 5000-911 Vila Real, Portugal [email protected]
Abstract. This paper presents a Fuzzy Clustering of Fuzzy Rules Algorithm (FCFRA) with dancing cones that allows the automatic organisation of the sets of fuzzy IF … THEN rules of one fuzzy system in a Hierarchical Prioritised Structure (HPS). The algorithm belongs to a new methodology for organizing linguistic information, SLIM (Separation of Linguistic Information Methodology), and is based on the concept of relevance of rules. The proposed FCFRA algorithm has been successfully applied to the clustering of an image.
1
Introduction
A fuzzy reasoning model is considered as a set of rules in the IF-THEN form to describe input-output relations of a complex system. We use rules to describe this relation instead of classical function approximation techniques mainly due to the transparency of the resulting fuzzy model. When building fuzzy systems from experts or automatically from data we need some procedures that divide the input space into fuzzy granules. These granules are the building blocks for the fuzzy rules. To keep interpretability we usually require that the fuzzy sets be specified in local regions. If this requirement is not fulfilled, many rules must be applied and aggregated simultaneously, such that the final result becomes more difficult to grasp. Moreover, aiming for a high approximation quality, we tend to use a large number of rules. In these last cases, it is not possible to interpret a fuzzy system any longer. The main objective of this article is to show that automated modelling techniques can be used to obtain not only accurate, but also transparent rule-based models from system measurements. This is obtained by organizing the flat fuzzy system f(x) into a set of n fuzzy sub-systems f1(x), f2(x), ..., fn(x), each corresponding to a readable and interpretable fuzzy system that may contain information related with particular aspects of the system f(x). This objective can be reached by using an algorithm that implements Fuzzy Clustering of Fuzzy Rules (FCFRA) with dancing cones [1]. The proposed algorithm allows the grouping of a set of rules into c subgroups (clusters) of similar rules, producing a HPS representation of the fuzzy systems. The paper is organized as follows. A brief introduction of the hierarchical fuzzy systems HPS is made and the concept of relevance is reviewed in section 2. In sec-
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 244-250, 2003. Springer-Verlag Berlin Heidelberg 2003
Possibilistic Hierarchical Fuzzy Model
245
tion 3 the FCFR strategy is proposed. An example is presented in section 4. Finally, the main conclusions are outlined in section 5.
2
The Hierarchical Prioritized Structure
The SLIM methodology organizes a fuzzy system f(x) as a set of n fuzzy systems f1(x), f2(x), …, fn(x). A clustering algorithm is used in this work to implement the separation of information among the various subsystems [2]. The HPS structure, depicted in figure 1, allows the prioritization of the rules by using a hierarchical representation, as defined by Yager [3]. If i < j the rules in level i will have a higher priority than those in level j. Consider a system with i levels, i=1,…, n-1, each level with Mi rules: I. II.
If U is Aij and Vˆi −1 is low, then Vi is Bij and rule II is used V1 is Vˆ i −1
Rule I is activated if two conditions are satisfied: U is Aij and Vˆi −1 is low. Vˆi −1 , which is the maximum value of the output membership function of Vi-1, may be interpreted as a measure of the satisfaction of the rules in the previous levels. If these rules are relevant, i.e. Vˆi −1 is not low, the information conveyed by these rules will not be used. On the other hand, if the rules in the previous levels are not relevant, i.e. Vˆ is i −1
low, this information is used. Rule II states that the output of level i is the union of the output of the previous level with the output of level i.
Fig. 1. One Hierarchical Prioritized Structure (HPS)
The fuzzy system of figure 1 is a universal function approximator. With the appropriated number of rules it can describe any nonlinear relationship. The output of a generic level i is given by the expression [4]: Mi Gi = (1 − α i −1 ) ∧ ∪ Fi l ∪ Gi −1 l =1
(1)
246
Paulo Salgado
where Fi l = Ail ( x* ) ∧ Bil , l = 1,
, M i is the output membership function of rule l in
level i. The coefficient αi translates the relevance of the set of rules in level i (for level 0: α0=0, G0 =∅). Equation (1) can be interpreted as an aggregation operation of the relevance, for the hierarchical structure. For the characterization of the relative importance of sets of rules, in the modelling process, it is essential to define a relevance function. Depending on the context where the relevance is to be measured, different metrics may be defined [5]. Definition 1. Consider ℑ a set of rules from U into V, covering the region S = U × V in the product space. Any function of relevance must be of the form ℜ S : P ( ℑ ) → [ 0 , 1]
(2)
where P ( ℑ) is the power set of ℑ. As a generalization of equation (1), we propose a new definition of relevance for the HPS structure [5]. Definitions 2. (Relevance of fuzzy system just i level) Let Si be the input-output region covered by the set of rules of level i. The relevance of the set of rules in level i is defined as:
(
αi = S αi −1 , T (1 − αi −1 ) , ℜS i
({
where ℜSi = S ℜSi ( Fi l ) , l = 1, , M i
)
(3)
}) and S and T are, respectively, S-norm e T-
norm operations. ℜSi ( Fi l ) represents the relevance of rule l in level i.
Using the product implication rule, and considering Bil a centroid of amplitude δil centred in y=ÿ, then ℜ Si ( Fi l ) = Ail ( x* ) ⋅ δ il
(4)
When the relevance of a level i is 1, the relevance of all the levels below is null. The relevance of the set of rules just i level, as definition 2, is here simplified by using the following expression:
αi −1 + ℜS ; if αi −1 + ℜS ≤ 1 αi = i
1
i
; otherwise
(5)
Mi
where ℜSi = ∑ ℜSi ( Fi l ) . l =1
In next section we present a new method to cluster the fuzzy rules of a fuzzy system. The result of this process corresponds to transforming the original fuzzy system (a flat structure) into a HPS fuzzy model.
Possibilistic Hierarchical Fuzzy Model
3
247
The Possibilistic Clustering Algorithm of Fuzzy Rules
The objective of the fuzzy clustering partition is the separation of a set of fuzzy rules ℑ={R1, R2,..., RM} into c clusters, according to a “similarity” criterion, finding the optimal clusters centre, V, and the partition matrix, U. Each value uik represents the membership degree of the kth rule, Rk, belonging to the ith cluster i, Ai, and obeys simultaneously to the equations (6) and (7): n
0 < ∑ uik < n, ∀i ∈ { 1, 2,L , c}
(6)
k =1
and c
M
∑∑ ℜ ( x ) ⋅ u l
k
il
⋅ α i = 1 , ∀xk ∈ S
(7)
i =1 l =1
Equation (7) can be interpreted as the sum of the products between the relevance of the rules l in the xk point with the degree of the rule l belonging to the cluster i, uil α i . This last item reflects the joint contribution of rule l to the ith hierarchical system, uil , with the relevance of the hierarchical level i, α i . Thus, for the Fuzzy Clustering of Fuzzy Rules Algorithm, FCFRA, the objective is to find a U=[uik] and V = [v1 , v2 ,L , vC ] with vi ∈ R p where: n m c M J (U , V ) = ∑ ∑∑ ( uil ℜl ( xk ) ) xk − vi k =1 i =1 l =1
2 A
m + ηi (1 − uil )
(8)
is minimized, with a weighing constant m>1, with constrain of equation (7). The ⋅ is an inner product norm x
2 A
= x T A x . The parameter ηi is fixed for each cluster with
membership distance of ½. It can be shown that the following algorithm may lead the pair (U*,V*) to a minimum. The models specified by the objective function (8) were minimized using alternating optimization. The results can be expressed in following algorithm:
Possibilistic Fuzzy Clustering algorithms of fuzzy rules – P-FCAFR Step 1: For a set of points X={x1,..., xn}, with xi∈S, and a set of rules ℑ={R1, R2,..., RM}, with relevance ℜl ( xk ) , k= 1, … , M, keep c, 2 ≤ c < n, and initialize U(0)∈ Mcf. Step 2: On the rth iteration, with r= 0, 1, 2, ... , compute the c mean vectors. ∑ ( u ( ) ) ⋅ ∑ ( ℜ ( x ) ) M
vi( ) = r
l =1
r il
m
(r ) ∑ uil l =1 M
( )
⋅ xk k =1 , where u ( r ) = U ( r ) , i=1, ... ,c. np il m m ⋅ ∑ ( ℜl ( x k ) ) k =1 np
l
m
k
(9)
248
Paulo Salgado
Step 3: Compute the new partition matrix U(r+1) using the expression: uil( r +1) =
1 ∑ ∑ (ℜ ( x )) np
c
j =1
k =1
l
m
k
⋅ xk − vi( r )
ηi
2 m −1
, with 1≤ i ≤ c , 1≤ l ≤ M.
(10)
A
In this algorithm, we interpret the values of {ui ( xk )} (discrete support) as observations from the extended membership function µi ( l ) : µi ( l ) = ui ( l ) , l=1,…, M (continuous support). Step 4: Compare U(r) with U(r+1): If || U(r+1)-U(r)|| < ε then the process ends. Otherwise let r = r+1 and go to step 2. ε is a small real positive constant. The applications of the P-FCAFR algorithm to fuzzy system rules with singleton fuzzifier, inference product and centroid defuzzifier result in a fuzzy system with HPS structure, i.e., in the form: c M f ( x ) = ∑ α i ∑ θ l ⋅ µ l ( x ) ⋅ uil i =1 l =1
∑ α ∑ µ ( x ) ⋅ u c
M
i
i =1
l
il
l =1
(11)
If the rules describe one region S, instead of a set of points, and the membership of relevance function of the rules was symmetrical the equation (10) will be reformulate ( r +1)
µil
=1
∑( c
j =1
(r )
xl − vi
A
η
2 * m −1 i
)
(12)
where xl is the centre of rule l. The shapes of this membership function are determined by the update of below iterative procedure in order to minimize the objective function(8). However, the user might be interested in choosing memberships functions shapes that are considered more useful for a given application, although he abandons the objective function. The P-FCAFR algorithm and the Alternating Cluster Estimation Fuzzy rules [1][6], with Cauchy membership function, were used in the identification of the Abington Cross.
4
Experimental Results
The Abington Cross image (Fig. 2a) was used to illustrate the proposed approach. It is a critical image for obtaining the cross clustering due to the high amount of noise. The image is a 256×256 pixels, grey level image. The intensity dynamic range has been normalized to the [0,1] interval. The image is a set of R3 p points, with coordinates (x,y,I), where I (x,y) is the intensity of image at point (x, y). I = 0 corresponds to the white colour, and I = 1 corresponds to the black colour. The aim is to partition the fuzzy system into five clusters. In the first step, the system is modelled in a set of rules, using the nearest neighbourhood method [7]. The
Possibilistic Hierarchical Fuzzy Model
249
resulting system after the learning process has 1400 fuzzy rules. The output of the system at this stage is shown in Fig. 2.b. (b)
(a)
Fig. 2. Grey scale image, with the shape of Abington Cross: a) original image. b) Fuzzy image
The second step consists in the segmentation of the image into 5 clusters (with m=2), each representing a fuzzy system in a HPS structure, using the FCAFR algorithm presented in section 3. Four of them represent the background (each representing the background rectangle in each corner of the image) and one represents the area containing the white cross. The degree of membership of each rule to each rule cluster has been measured using the weighed distance of the centre of the rule to the centre of the cluster, η. Figs. 3.a.-3.e. show the individual outputs response of each hierarchical fuzzy model. The original image can be described as the aggregation (equation (11)).
5
Conclusions
In this work, the mathematical fundaments for Possibilistic fuzzy clustering of fuzzy rules were presented, where the relevance concept has a significant importance. Based in this concept, it is possible to make a fuzzy clustering algorithm of fuzzy rules, which is naturally a generalization of fuzzy possibilistic clustering algorithms. Now, many others clusters algorithms (crisp) can be applied to the fuzzy rules. The P-FCAFR Algorithm organizes the rules of fuzzy systems in the HPS structure, based on the SLIM methodology, where the partition matrix can be interpreted as containing the values of the relevance of the sets of rules in each cluster. The proposed clustering algorithm has demonstrated the ability to deal with an image strongly corrupted by noise, the Abington Cross image. The test image was first identified using the Nearest Neighbourhood Method. The resulting fuzzy rules were successfully grouped into 5 clusters using the proposed P-FCAFR Algorithm. Each resulting cluster represents one meaningful region of the image. This helps in the corroboration of the proposed approach.
250
Paulo Salgado
(a)
(c)
(b)
(d)
(e)
Fig. 3. Images representing the output result of the five background clusters
References [1] [2] [3] [4] [5] [6] [7]
Hoppner, Frank, and al.: Fuzzy Cluster Analysis, Methods for classification data analysis and image recognition, Wiley, (1999). Salgado, Paulo, Clustering and hierarchization of fuzzy systems, submitted to Soft Computer Journal. Yager, R.: On a Hierarchical Structure for Fuzzy Modeling and Control, IEEE Trans. On Syst., Man, and Cybernetics, 23, (1993), 1189-1197. Yager, R.: On the Construction of Hierarchical Fuzzy Systems Models, IEEE Trans. On Syst., Man, and Cyber. –Part C, 28, (1998), 55-66. Salgado, P.: Relevance of the fuzzy sets and fuzzy systems. In: “Systematic Organization of Information in Fuzzy Logic”, NATO Advanced Studies Series, IOS Press, In publication. Runkler, T. A, Bezdek, C.: Alternating Cluster Estimation: A new Tool for Clustering and Function Approximation, IEEE Trans. on Fuzz. Syst., 7, (1999), 377-393. Wang, Li-Xin: Adaptive Fuzzy System and Control, Design and stability analysis, Prentice Hall, Englewood Cliffs, NY 07632 (1994).
Fuzzy Knowledge Based Guidance in the Homing Missiles Mustafa Resa Becan and Ahmet Kuzucu System Dynamics and Control Unit, Mechanical Engineering Faculty Istanbul Technical University, Turkey
Abstract. A fuzzy knowledge based tail pursuit guidance scheme is proposed as an alternative to the conventional methods. The purpose of this method is to perform the tracking and interception using a simplified guidance law based on the fuzzy controller. Noise effects in homing missile guidance must be reduced to get an accurate control. In this work, developed fuzzy tail pursuit control algorithm reduces the noise effects. Then, a knowledge based algorithm is derived from the fuzzy control behavior to obtain a simplified guidance law. Simulation results show that the new algorithm led to satisfactory performance for the purpose of this work.
1
Introduction
The tail pursuit is the one of the most common conventional guidance methods in the homing missile area. But, the homing missile guidance has some uncertain parameters for the target maneuver and target behavior is observed through noisy measurements. In these cases, conventional guidance methods may not be sufficient to obtain the tracking and interception. Fuzzy control has suitable properties to eliminate such difficulties. Fuzzy controller has been used in the systems where variables have fuzzy characteristics or deterministic mathematical models are difficult or impossible [1], [2]. Recently, developed neuro-fuzzy techniques serve as possible approaches for the nonlinear flight control problems [3-5]. However, a limited number of papers have been adressed to the issue of fuzzy missile guidance design [6-8]. Fuzzy tail pursuit guidance is designed considering the noise sourced from the thermal and radar detection sources at the system input. This noise affects the guidance system entirely. Fuzzy tail pursuit (FTP) guidance is applied to a homing missile with known aerodynamic coefficients and their variations. The purpose of this paper is not only to perform the FTP application but also is to search a simplified knowledge based guidance inspired by the fuzzy control results. The objective of this research is to obtain a rule based low cost guidance control free from the computational complexities that a full fuzzy controller requires.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 251-257, 2003. Springer-Verlag Berlin Heidelberg 2003
252
Mustafa Resa Becan and Ahmet Kuzucu
2
Definition of the Guidance Control System
Figure 1 shows the interception geometry for the missile model and it presents some symbols used in this study. Missile motions are constrained within the horizontal X-Y plane. State variables from Fig.1 are the flight path angle γ , missile velocity Vm, attitude angle θ, angle of attack α , pitch rate q, missile and target positions in the inertial space Xm, Ym, XT, YT and the target angle β. The missile is represented by an aerodynamic model with known coefficients.
Fig. 1. Interception Geometry
2.1
Conventional Tail Pursuit (CTP) Design
Line of sight (LOS) is the distance between the missile and target. The purpose of the tail pursuit guidance is to keep the flight path angle γ equal to the line of sight angle σ. Τhis goal leads to the design of a closed loop control around LOS angle.
Fig. 2. Block Diagram of CTP
Fig. 2 shows the simple block diagram of CTP. Controller inputs are the error e and first derivative of error e& . Noise effect is symbolized as w and σ& is the variation of LOS angle. The CTP output can be written in PD form as shown below: u = K p .e + K v .e&
(1)
where Kp and Kv are the control coefficients chosen for a desired system performance.
Fuzzy Knowledge Based Guidance in the Homing Missiles
3
253
Fuzzy Tail Pursuit (FTP) Design
Target position measurement is not precise and some noise effect is superposed to it. This particularity allows the fuzzy tail pursuit as an alternative to the conventional version. The noise is modeled as a gaussian density function in this study . The input and output variables of the fuzzy controller are linguistic variables taking linguistic values. The input linguistic variables are the error ( e ) and change of error ( e& ) and the linguistic output variable is the control signal u. The linguistic variables are expressed by linguistic sets. Each of these variables is assumed to take seven linguistic sets defined as negative big (NB), negative medium (NM), negative small (NS), zero (ZE), positive small (PS), positive medium (PM), positive big (PB). Triangular membership functions (MF) for error input are scaled in synchronization with the gaussian density function distribution shown on Fig. 3. This approach is preferred to match the membership function distribution with the noisy measurement distribution. Universe of discourse of MF (e/emax) is choosen in regular form at first, but the satisfactory results could not be obtained. Then it is changed using trial and error to provide the accordance with the density function and performed the best results. Consequently, the most reasonable way to design MF is determined as presented on Fig.3.
Fig. 3. Density Function and MF Design
A set of fuzzy PD rules has been performed in a real time application [9]. Table 1 shows the fuzzy PD rules set following this approach.
254
Mustafa Resa Becan and Ahmet Kuzucu Table 1. Rule Table of Fuzzy Tail Pursuit
Minimum (Mamdani type) inference is used to obtain the best possible conclusions [8]. This type of inference allowed easy and effective computation and it is appropriate for real time control applications.The outputs of the linguistic rules are fuzzy, but the guidance command must be crisp. Therefore, the outputs of the linguistic rules must be defuzzified before sending them to the plant. The crisp control action is calculated here using the center of gravity (COG) defuzzification method.
4
Control Applications
CTP and FTP strategy is applied in PD form to a homing missile model. Aerodynamic coefficients are evaluated from tabulated experimental values in the simulation environment. Fuzzy control coefficients are scaled according to the system limitations. Fig. 4 display the interception trajectory and LOS performance of the application. Results show that the FTP perform the interception without time delay.
Fig. 4. FTP Behaviors
Fuzzy Knowledge Based Guidance in the Homing Missiles
255
Fig. 5. Comparison of the Control Behaviors
Control behavior obtained by CTP and FTP are shown on Fig. 5. PD controller coefficients have been determined considering a second order system model with a natural freqency of wn= 6 Hz. and a damping ratio of ζ = 0.7 in CTP application. It is clear that better noise filtering performance may be obtained using prefilters or more sophisticated PD type controller design, so the comparison may seem unfair. But still the improvement in noise filtering of the fuzzy controller is impressing. Taking into account the distribution of measurements instead of instantaneous measurement during fuzzification , and the use of COG method in defuzzification both having some ‘averaging' property, reduce considerably the propagation of noise to the control variable behavior. On the other hand it is clear that a pure theoretical derivative effect in the CTP scheme will amplify the noise effect if a suitable filtering is not added to controller for the frequency band under consideration.
5
Derivation of Knowledge Based Control
Homing missiles are launched for once only. Therefore low cost controllers are needed in such applications. A knowledge based algorithm is developed using the fuzzy results to obtain a guidance controller less equipped than the full fuzzy control. Our goal is to obtain a guidance scheme related to the fuzzy tail pursuit, a new algorithm requiring less sophisticated computing power. Fig. 6 shows the block diagram of the proposed guidance control system.
Fig. 6. Knowledge Based Control Block Diagram
256
Mustafa Resa Becan and Ahmet Kuzucu
Fuzzy surface of fuzzy tail pursuit controller obtained through off-line simulation studies is shown on Fig. 7 Membership functions for knowledge based control are rearrenged using this surface. Table 2 presents the rules rewritten using the FTPcontrol behavior on Fig. 5 Control values are crisp and changed from –3 to +3 to match the control variable variation on Fig. 5. Fig. 7 shows that the derivated guidance law performed the interception within 7.55 s. using the proposed knowledge based discontinuous control. This value is very close to the interception time obtained with the fuzzy tail pursuit guidance.
Fig. 7. Fuzzy Surface Between the Input-Output Values
Table 2. Knowledge Based Rules
Fig. 8. Knowledge Based Guidance Behaviors
Fuzzy Knowledge Based Guidance in the Homing Missiles
6
257
Conclusion
A fuzzy tail pursuit guidance for a homing missile has been presented in this work. The simulation results showed the proposed method achieved a relatively noise free guidance without interception time delay using the filtering property of the fuzzy control. Then, a knowledge based guidance law is derived based on the fuzzy tail pursuit results, to obtain a simple discontinuous control. This work shows that the guidance can be performed using knowledge based algorithm because the line of sight behavior is quite satisfactory. The results encourage future studies through the fuzzy tail pursuit and knowledge based control on the homing missile guidance area.
References [1] [2] [3]
[4] [5] [6] [1] [8] [9]
Lee C.C., “Fuzzy Logic in Control Systems: Fuzzy Logic Controller: Part I & Part II”, IEEE Trans. Syst. Man and Cybernetics, 20, 404-435, 1990 Williams T., “Fuzzy Logic Simplifies Complex Control Problems”, Computer Design, 90- 102, 1991 Huang C., Tylock J., Engel S., and Whitson J., “Comparison of Neural Network Based , Fuzzy Logic Based, and Numerical Nonlinear Inverse Flight Control”, Proceedings of the AIAA Guidance, Navigation and Control Conference, AIAA Washington DC, 1994, pp. 922-929 Geng Z.J., and McCullough C.L., “Missile Control Using Fuzzy Cerebellar Model Arithmetic Computer Neural Networks”, Journal of Guidance, Control and Dynamics, Vol. 20, No.3, 1997, pp. 557-565 Lin C.L., and Chen Y.Y., “Design of an Advanced Guidance Law Against High Speed Attacking Target”, Proceedings of the National Science Council, Part A, Vol.23, No.1, 1999, pp.60-74 Mishra K., Sarma I.G., and Swamy K.N., “Performance Evaluation of the Fuzzy Logic Based Homing Guidance Schemes”, Journal of Guidance, Control and Dynamics, Vol.17, No.6, 1994, pp. 1389-1391 Gonsalves P.G., and Çağlayan A.K., “Fuzzy Logic PID Controller for Missile Terminal Guidance”, Proceedings of 1995 IEEE International Symposium and Intelligent Control, IEEE Publications, Piscataway, NJ.1995, pp.377-382 Lin C.L., and Chen Y.Y., “Design of Fuzzy Logic Guidance Law Against High Speed Target”, Journal of Guidance, Control and Dynamics, Vol.23, No.1, January-February 2000, pp. 17-25 Hoang L.H., and Maher H., “Control of a Direct-Drive DC Motor by Fuzzy Logic”, IEEE Industry Applications Society, Vol. VI, 1993, pp. 732-738
Evolutionary Design of Rule Changing Cellular Automata Hitoshi Kanoh1 and Yun Wu2 1
Institute of Information Sciences and Electronics, University of Tsukuba Tsukuba Ibaraki 305-8573, Japan [email protected] 2 Graduate School of Systems and Information Engineering University of Tsukuba, Tsukuba Ibaraki 305-8573, Japan [email protected]
Abstract. The difficulty of designing cellular automatons' transition rules to perform a particular problem has severely limited their applications. In this paper we propose a new programming method of cellular computers using genetic algorithms. We consider a pair of rules and the number of rule iterations as a step in the computer program. The present method is meant to reduce the complexity of a given problem by dividing the problem into smaller ones and assigning a distinct rule to each. Experimental results using density classification and synchronization problems prove that our method is more efficient than a conventional one.
1
Introduction
Recently, evolutionary computations by parallel computers have gained attention as a method of designing complex systems [1]. In particular, parallel computers based on cellular automata (CAs [2]), meaning cellular computers, have the advantages of vastly parallel, highly local connections and simple processors, and have attracted increased research interest [3]. However, the difficulty of designing CAs' transition rules to perform a particular task has severely limited their applications [4]. The evolutionary design of CA rules has been studied by the EVCA group [5] in detail. A genetic algorithm (GA) was used to evolve CAs. In their study, a CA performed computation means that the input to the computation is encoded as the initial states of cells, the output is decoded from the configuration reached at some later time step, and the intermediate steps that transform the input to the output are taken as the steps in the computation. Sipper [6] has studied a cellular programming algorithm for non-uniform CAs, in which each cell may contain a different rule. In that study, programming means the coevolution of neighboring, non-uniform CAs' rules with parallel genetic operations. In this paper we propose a new programming method of cellular computers using genetic algorithms. We consider a pair of rules and the number of rule iterations as a V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 258-264, 2003. Springer-Verlag Berlin Heidelberg 2003
Evolutionary Design of Rule Changing Cellular Automata
259
step in the computer program, whereas the EVCA group considers an intermediate step of transformation as a step in the computation. The present method is meant to reduce the complexity of a given problem by dividing the problem into smaller ones and assigning a distinct rule to each one. Experimental results using density classification and synchronization problems prove that our method is more efficient than a conventional method.
2
Present Status of Research
2.1
Cellular Automata
In this paper we address one-dimensional CAs that consist of a one-dimensional lattice of N cells. Each cell can take one of k possible states. The state of each cell at a given time depends only on its own state at the previous time step and the state of its nearby neighbors at the previous time step according to a transition rule R. A neighborhood consists of a cell and its r neighbors on either side. A major factor in CAs is how far one cell is from another. The CA rules can be expressed as a rule table that lists each possible neighborhood with its output bit, that is, the update value of the central cell of the neighborhood. Figure 1 shows an example of a rule table when k=2 and r=1. We regard the output bit “11101000” as a binary number. The number is converted to a decimal, which is 232, and we will denote the rule in Fig. 1 as rule 232. Here we describe CAs with a periodic boundary condition. The behavior of one-dimensional CAs is usually displayed by space-time diagrams in which the horizontal axis depicts the configuration at a certain time t and the vertical axis depicts successive time steps. The term “configuration” refers to the collection of local states over the entire lattice, and S(t) denotes a configuration at the time t.
Fig. 1. An example of a rule table (k=2, r=1)
2.2
Computational Tasks for CAs
We used the density classification and the synchronization tasks as benchmark problems [4, 7]. The goal for the density classification task is to find a transition rule that decides whether or not the initial configuration S(0) contains a majority of 1s. Here ρ(t) denotes the density of 1s in the configuration at the time t. If ρ(0) > 0.5, then within M time steps the CA should go to the fixed-point configuration of all 1s (ρ(t M) =1 ); otherwise, within M time steps it should produce the fixed-point configuration of all 0s (ρ(t M) = 0 ). The value of constant M depends on the task. The second task is one of synchronization: given any initial configuration S(0), the CA should reach a final configuration, within M time steps, that oscillates between all 0s and all 1s in successive time steps.
260
2.3
Hitoshi Kanoh and Yun Wu
Previous Work
The evolutionary design of CA rules has been studied by the EVCA group [3, 4] in detail. A genetic algorithm (GA) was used to evolve CAs for the two computational tasks. That GA was shown to have discovered rules that gave rise to sophisticated emergent computational strategies. Sipper [6] has studied a cellular programming algorithm for 2-state non-uniform CAs, in which each cell may contain a different rule. The evolution of rules is here performed by applying crossover and mutation. He showed that this method is better than uniform (ordinary) CAs with a standard GA for the two tasks. Meanwhile, Land and Belew [8] proved that the perfect two-state rule for performing the density classification task does not exist. However, Fuks [9] showed that a pair of human written rules performs the task perfectly. The first rule is iterated t1 times, and the resulting configuration of CAs is processed by another rule iterated t2 times. Fuks points out that this could be accomplished as an “assembly line” process. Other researchers [10] have developed a real-world application for modeling virtual cities that uses rule-changing, two-dimensional CAs to lay out the buildings and a GA to produce the time series of changes in the cities. The GA searches the sequence of transition rules to generate the virtual city required by users. This may be the only application using rule changing CAs.
3
Proposed Method
3.1
Computation Based on Rule Changing CAs
In this paper, a CA in which an applied rule changes with time is called a rule changing CA, and a pair of rules and the number of rule iterations can be considered as a step in the computer program. The computation based on the rule changing CA can thus operate as follows: Step 1: The input to the computation is encoded as the initial configuration S(0). Step 2: Apply rule R1 to S(0) M1 times; ……; apply rule Rn to S(M1+…+Mn-1) Mn times. Step 3: The output is decoded from the final configuration S(M1+…+Mn). In this case, n is a parameter that depends on the task, and rule Ri and the number of rule iterations Mi (i=1, …, n) can be obtained by the evolutionary algorithm described in the next section. The present method is meant to reduce the complexity of a given task by dividing the task into n smaller tasks and assigning (Ri, Mi) to the i-th task. 3.2
Evolutionary Design
Each chromosome in the population represents a candidate set of Ri and Mi as shown in Fig. 2. The algorithm of the proposed method is shown in Fig. 3.
Evolutionary Design of Rule Changing Cellular Automata
261
R1 | M1| ……… | Rn | Mn Fig. 2. Chromosome of the proposed method (Ri: rule, Mi: the number of rule iteration) procedure main initialize and evaluate population P(0); for g = 1 to the upper bound of generations { apply genetic operators to P(g) and create a temporary population P'(g); evaluate P' (g); get a new population P(g+1) by using P(g) and P'(g); } procedure evaluate for i = 1 to the population size { operate CA with the rules on the i-th individual; calculate fitness of the i-th individual; } Fig. 3. Algorithm of the proposed method (procedure main and procedure evaluate)
3.3
Application
In this section we describe an application that uses the CA with two rules for density classification and synchronization tasks. Chromosome Encoding In the case of the CA with two rules, a chromosome can generally be expressed by the left part in Fig. 4. The right part in Fig. 4 shows an example of the chromosome when k=2 and r=1. Each rule and the number of iterations are expressed by the output bit of the rule table and a nonnegative integer, respectively. R1 | M1| R2 | M2 = 1 1 1 0 1 0 0 0 | 45 | 0 1 0 0 1 0 0 1 | 25 Fig. 4. Example of the chromosome of the proposed method for the CA with two rules
Fitness Calculation The fitness of an individual in the population is the fraction of the Ntest initial configurations in which the individual produced the correct final configurations. Here the initial configurations are uniformly distributed over ρ(0)∈[0.0, 1.0]. Genetic Operations First, generate Npop individuals as an initial population. Then the following operations are repeated for Ngen generations. Step 1: Generate a new set of Ntest initial configurations, and calculate the fitness on this set for each individual in the population. Step 2: The individuals are ranked in order of fitness, and the top Nelit elite individuals are copied to the next generation without modification. Step 3: The remaining (Npop-Nelit) individuals for the next generation are formed by single-point crossovers between the individual randomly selected from Nelit
262
Hitoshi Kanoh and Yun Wu
elites and the individual selected from the whole population by roulette wheel selection. Step 4: The rules R1 and R2 on the offspring from crossover are each mutated at exactly two randomly chosen positions. The numbers of iterations M1 and M2 are mutated with probability 0.5 by substituting random numbers less than an upper bound.
4
Experiments
4.1
Experimental Methods
The experiments were conducted under the following conditions: the number of states k=2, the size of lattice N=149, and M1+M2=149 for the CA; the population size Npop=100, the upper bound of generations Ngen=100, the number of elites Nelit=20, and the number of the initial configuration for test Ntest=100 for the GA. 4.2
Density Classification Task (r=1)
Table 1 shows the best rules in the population at the last generation. The obtained rules agree with the human written rules shown by Fuks [9], which can perform this task perfectly. Table 1. Rules obtained by the genetic algorithm for the density classification task (r=1, k=2, N=149)
Experiment-1 Experiment-2 4.3
R1 184 226
M1 124 73
R2 232 232
M2 25 76
Density Classification Task (r=3)
Figure 5 shows the comparison of the proposed method with the conventional method that uses only one rule as the chromosome [5]. Each point in the figure is the average of ten trials. It is seen from Fig. 5 that the fitness of the former at the last generation (0.98) is higher than that of the latter (0.91). Table 2 shows the best rules obtained by the proposed method, and Fig. 6 shows examples of the space-time diagrams for these rules. The rules in Table 2 have been converted to hexadecimal. Table 2. The best rules obtained by the genetic algorithm for the density classification task (r=3, k=2, N=149)
R1 01000100111C0000 000004013F7FDFFB
M1 101
R2 97E6EFF6E8806448 4808406070040000
M2 48
Evolutionary Design of Rule Changing Cellular Automata
263
Fig. 5. The best fitness at each generation for density classification tasks (r=3, k=2, N=149)
Fig. 6. Examples of space-time diagrams for a density classification task (r=3, k=2, N=149)
4.4
Synchronization Task (r=3)
Figure 7 shows the results of the same comparison as in Fig 5 for the synchronization task. The fitness of the proposed method reaches 1.0 by the 3rd generation, whereas the conventional method requires more than 20 generations.
Fig. 7. The best fitness at each generation for the synchronization task (r=3, k=2, N=149)
264
Hitoshi Kanoh and Yun Wu
5
Conclusions
In this paper we proposed a new programming method of cellular computers. The experiments were conducted using a rule changing CA with two rules. In a forthcoming study, the performance of a CA with more than two rules will be examined.
References [1]
Alba, E. and Tomassini, M.: Parallelism and Evolutionary Algorithms. IEEE Transactions on Evolutionary Computation, Vol. 6, No. 5 (2002) 443-462. [2] Wolfram, S.: A New Kind of Science. Wolfram Media Inc. (2002). [3] Sipper, M.: The Emergence of Cellular Computing. IEEE Computer, Vol. 32, No. 7 (1999). [4] Mitchell, M., Crutchfield, J., and Das, R.: Evolving Cellular Automata with Genetic Algorithms: A Review of Recent Work. Proceedings of the First International Conference on Evolutionary Computation and Its Applications (1996). [5] Mitchell, M., Crutchfield, J. P., and Hraber, P.: Evolving Cellular Automata to Perform Computations: Mechanisms and Impediments. Physica D 75 (1994) 361-391. [6] Sipper, M.: Evolving of Parallel Cellular Machines: The Cellular Programming Approach. Lecture Notes in Computer Science Vol. 1194. Springer-Verlag (1997). [7] Das, R., Crutchfield, L., Mitchell, M., and Hanson, J.: Evolving Globally Synchronized Cellular Automata. Proceedings of the 6-th ICGA (1995) 336-343. [8] Land, M. and Belew, R.: NO Two-State CA for Density Classification Exists. Physical Review Letters 74 (1995) 5148. [9] Fuks, H.: Solution of the Density Classification Problem with Two Cellular Automata Rules. Physical Review E, 55(3) (1997) 2081-2084. [10] Kato, N., Okuno, T., Suzuki, R., and Kanoh, H.: Modeling Virtual Cities Based on Interaction between Cells. IEEE International Conference on Systems, Man, and Cybernetics (2000) 143-148.
Dynamic Control of the Browsing-Exploitation Ratio for Iterative Optimisations L. Baumes1, P. Jouve1, D. Farrusseng2, M. Lengliz1, N. Nicoloyannis1, and C. Mirodatos1 1
Institut de Recherche sur la Catalyse-CNRS 2, Avenue Albert Einstein - F-69626 Villeurbanne Cedex France 2 Laboratoire Equipe de Recherche en Ingénierie des Connaissances Univ. Lumière Lyon2 -5, Avenue Pierre Mendès-France 69676 Bron Cedex France [email protected]
Abstract. A new iterative optimisation method based on an evolutionary strategy. The algorithm is proposed, which proceeds on a binary search space, combines a Genetic Algorithm and a knowledge extraction engine that is used to monitor the optimisation process. In addition of the boosted convergence to the optima in a fixed number of iterations, this method enables to generate knowledge as association rules. The new algorithm is applied for the first time for the design of heterogeneous solids in the frame of a combinatorial program of catalysts development.
1
Introduction
Today, the Genetic Algorithm (GA) approach is a standard and powerful iterative optimisation concept mainly due to its unique adaptation skills to highly diverse contexts. In a sequential manner, it consists at designing population of individuals that shall exhibit superior fitness, generation after generation. When it proceeds, the population shall focus on the optima of the space parameters. The optimisation process is over when criteria on the objective function are achieved. Very recently, this approach has been successfully applied for the optimisation of advanced materials (e.g. catalysts) for the petro-chemistry [1,2]. In this frame, the population consists at around 50 catalysts that are prepared and tested by automated working stations within a few days [3]. In addition of a quick optimisation demand, the total number of iteration was imposed. In this case, convergence of GAs may not be reached due to the "time" constraint. In turn, we face a particular problem: “Knowing that the number of iterations is low and a priori fixed, can efficient optimisations be carried out?” Thus, our strategy is to “act” on the stochastic character of genetic operators in order to guide the GA, without any modification of its algorithmic structure. It comes clear that the relevance of the population design shall be closely linked to the V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 265-270, 2003. Springer-Verlag Berlin Heidelberg 2003
266
L. Baumes et al.
remaining time (e.g. numbers of iterations left). Indeed the Browsing-Exploitation ratio shall be monitored by the number of iterations left that allows the algorithm to get the best out of all the iterations at its disposal. Except Case Based Reasoning [4] and Supervised Learning methods [5] a very limited numbers of works deal for this purpose. This work presents a learning method based on extraction of equivalence classes, which accumulates strategic information on the whole previously explored search space, and that enables the dynamic control of the Browsing/Exploitation ratio.
2
Context and of the Catalysts Developments and Optimisation Requirements
Because industrial equipments are designed to prepare and test collection of catalysts (called libraries), iterative optimisation and more specifically the evolutionary approach is very well suited. The goal is to optimise properties of catalysts (produce quality). The optimisation cycle consists at designing a new population of individuals (catalysts) in light of previous individuals testing. In the context of heterogeneous catalysis, it does exist neither strong general theories nor empirical models that allow optimising a catalysts formulation for new specificity and requirements in a straightforward manner. Moreover, the probable behaviour of individuals cannot be estimated by simulation. In this study, we face the problem that the number of iterations is a priori fixed and fixed at a low value. It is not the purpose of this study to seek efficient algorithms that reduce the number of iterations. In addition, since the synthesis and testing of catalysts takes half-day to a few days, the calculation time is not a requirement. The context can be depicted in an industrial design process (Figure 1). Final Product
Specifications (compounds, processes, knowledge, objective function, time, …)
Cycle end Yes/No
Population Design
Population Production
Data analysis & decision
Population Test
Fig. 1. Iterative Design Process
On a general point of view, the data analysis and decision step, which follows each population test, emphases new guidelines for individuals improvement and also to update previous guidelines, which have been already considered. Thus, knowledge about the individuals (catalysts) and their properties (yields) is created, up-dated in the course of the screening and re-injected in the design process. In turn, each population design takes benefits of previous experiments. The knowledge mainly
Dynamic Control of the Browsing-Exploitation Ratio for Iterative Optimisations
267
arises from the discovery of relationships between catalysts variables and the respective yields. In addition, the discovery of relationships between the properties of the variables and the yields can provide occasionally very useful guidelines. Indeed, it allows passing form categorical variables to such as the elements of the Mendeleeï v periodic table to continuous variables such as their respective electronegativity values. When introducing these additional information at the starting point, the optimisation process can be boosted because of the wider knowledge accumulated. Because of the very limited iterations number, usual optimisation methods such as GAs, would not be efficient to address this issue. Furthermore, GAs do not take into account the information linked to the catalysts (variables properties) so that the knowledge that would be extracted in the course of the screening could not be re-injected in the optimisation process. In addition, the stochastic feature of genetic operators can become a major problem when the number of iteration is very limited. In the industrial process, the browsing on the search space shall be promoted at the early stage of the screening whereas the exploitation shall take more and more priority when the end of the optimisation process is approaching. The genetic operators work simultaneously on both browsing and exploitation. However, this BrowsingExploitation ratio cannot be directly and dynamically controlled by usual AG in the course of the screening.
3
A New Hybrid Algorithm for Iterative Optimisation
In order to extract and manage information in the optimisation cycle, we have developed a new approach that consists at hybridising a GA with a Knowledge Based System (KBS), which extracts empirical association rules on the whole set of previously tested catalysts and re-injects these strategic information to guideline the optimisation process. The guideline is performed by monitoring the Browsing/Exploitation ratio with the KBS, which enables to consider dynamically the number of iterations. 3.1
AG and Knowledge-Based System Search Spaces
The algorithm operates with all kinds of GAs. The GA search space can consider ordered and non-ordered qualitative variables that can be are either pure binary variable (presence or absence of elements), or discretized continuous variable coded by a set of binary value (element concentration) or nominal variable coded by a set of binary value (Figure 2). The KBS search space includes the GA search space with additional information, which are inherent features of the GA variables such as the size of the pore or the elemental composition of discrete type of zeolites. Table 1. GA and KBS representation GA search space P re pa ra tion va ria ble s Ze olite Ca tion/m e thod KL ZSM-5 NaY NaX Na-beta Li/imp Li/ie 1 0 1 0 0 0 1 0 2 0 0 0 1 0 0 1 3 1 0 0 0 1 0 0
Extended search space … … … … …
Ca tion cha rge "+1" "+2" 1 0 1 0 0 1
Additionna l va ria ble s O-ring Ele ctrone ga tivity 10 12 0.79-0.82 0.89-0.95 0.98-1 1.31 1 0 0 0 1 0 0 1 0 0 1 0 1 1 1 1 0 0
… … … … …
268
L. Baumes et al.
3.2
The Knowledge-Based System
The mechanism of the KBS cannot be extensively depicted in this extended abstract; details can be found elsewhere [6-8]. The KBS search space is binning in different zones that form a lattice. The space zone extraction consists in obtaining equivalence classes by the maximum square algorithm . Each of these zones is characterized by two coefficients B and E, which are characteristics of the Browsing and Exploration level, respectively. The relevance R of an individual x that is in a zone Zi is calculated a function of B and E. We define the followings:
R(h1 ( x), h2 ( x), t , µ) with h = ( f o g )
h1 = f1 (B(Zi )), h2 = f 2 (E (Zi )) and B(Zi ) = g1 ( x, t , µ),
E(Zi ) = g 2 ( x, t , µ)
For example, we can define B, E and R as below:
B( Z i ) =
(number of individuals x previously tested in Zi ) card(Zi )
E ( Z i ) = mean fitness of individuals x previously tested in Zi R(xk ) =
∑ (E (Z ) × (1 − B(Z )) i
i
∀Z i , xk ∈ Z i
i
In this example R is actually the Browsing-Exploitation ratio, others more complex function can be used. The total number of iteration (µ) can be here injected. The Browsing coefficient function enables to estimate the “real” diversity of an individual with respects to the occupancy rates of all the classified zones and quantifies its contribution to the extent of browsing of the total search space. The second coefficient E that evaluates the performances, takes into account the results of all individuals x in the same classification zones (for example the mean value). This second function quantifies the relevance of an individual for improving exploitation. In order to get a unique selection criteria, a third coefficient which combines the two former ones and the number of remaining individuals still to be tested t is defined R(h1(x),h2(x),t). It enables to evaluate the global relevance of an individual with respect to its interest for both the browsing and exploitation. In this manner, the relevance of individuals constituting each new generation can be increased and controlled as a function of time. 3.3
Hybridization Mechanism
The individual that exhibits the highest Relevance shall constitute the next generation. Within the traditional framework, GAs take as input a population of size K and proposes a new population of identical size (figure 4). In order to increase the number of individuals in the population of output (size K) with high T(x) values, one forces the GA to propose a number µK (µ>1, µ∈IN) of individuals, among which K valuers showing the highest relevance R are selected. The choice of the various functions f1, f2, g1, g2 and R is of major importance and completely defines the manner of controlling the optimisation process.
Dynamic Control of the Browsing-Exploitation Ratio for Iterative Optimisations Initial population
269
K individuals selection of K individuals by the KBS
Evaluation
µ x K individuals
Final population of K individuals
end criteria GA
Fig. 2. Mechanism of the AG-KBS coupling
4
Tests and Discussions
A validation of the algorithm has been carried out on the following benchmark, which simulate a real case of industrial optimisation. • •
•
AG Space: ∀i, {x1,…,x6}∈ℜ+ ; xi ∈[0...1] - binary coding on 4 bits KBS Space: xi, min(xi≠0), max(xi) in {[0..0,25[ [0,25..0,5[ [0,5..0,75[ [0,75..1[} x1, x4 and (x1&x4) presence or absence ratio = max / min in {[0..1] ]1..2] ]2..4] ]4..8] ]8..16]} Number of element (NbE) in {1;2;3;4;5;6} Fitness Function: (4Σ(x1+x2+x3)+2Σ(x4+x5)-x6) × promotor × factor × feasible with: If (x1 or x4)>0.15 then promotor = 0 else 1 If NbE < 4 then factor = NbE else (7-NbE) If ratio > 10 then feasible = 0 else 1
A conventional GA is used for comparison purpose for which the features are: one-point crossover (number of parents: 2), roulette wheel selection, 24 (6*4) of chromosome size and a generation size (K) of 20. For the GA-KBS algorithm two studies have been carried out: with µ1 fixed at 10 (figure 5) and µ2=2 which increases of one unity each 10 generation. Fitness
Fitness
Fitness
Generation 25000
50000
75000
Generation
Generation 5000
10000
100
200
500
1000
Fig. 3. Test results, SGA (left), AG-KBS with µ=10 (middle), zoom on both (right)
270
L. Baumes et al.
In figure 3, the curves represent the mean value for maxima found on 100 runs. The areas around the curves show the dispersion. The hybrid algorithm gives much better results even if the parameters are not optimised. We noticed that with µ2 the algorithm shows better results, KBS cannot learn ‘good' rules on few data at the beginning (low µ) but needs many different chromosomes (high µ) in order to choose whereas GA generations often converge to local optima. Indeed, the KBS decreases the dispersion of GA runs and increases fitness results. The SGA converges to local optima and does not succeed to reach the global maximum (72) before a long time. At the same time, KBS tries to keep diversity in generations. Therefore higher mutation and crossover rates are required in order to get individuals which belong to different KBS zones.
References [1] [2] [3]
[4] [5] [6] [7] [8]
Wolf, D.; Buyevskaya, O. V.; Baerns, M. Appl. Catal., A 2000, 200, 63-77. Corma, A.; Serra, J. M.; Chica, A. In Principles and methods for accelerated catalyst design and testing; Derouane, E. G., Parmon, V., Lemos, F., Ribeiro, F. R., Eds.; Kluver Academic Publishers: Dordrecht - NL, 2002, pp 153-172. Farrusseng, D.; Baumes, L.; Vauthey, I.; Hayaud, C.; Denton, P.; Mirodatos, C. In Principles and methods for accelerated catalyst design and testing; Derouane, E. G., Parmon, V., Lemos, F., Ribeiro, F. R., Eds.; Kluver Academic Publishers: Dordrecht - NL, 2002, pp 101-124. Ravisé, C.; Sebag, M.; Schoenauer, M. In Artificial Evolution; Alliot, J. M., Lutton, E., Ronald, Schoenauer, M., Snyers, D., Eds.; Springer-Verlag, 1996, pp 100-119. Rasheed, K.; Hirsh, H. In The Seventh International Conference on Genetic Algorithms (ICGA'97), 1997. Norris, E. M. Revue Roumaine de Mathématiques Pures et Appliquées 1978, 23, 243-250. Agrawal, R.; Imielinski, T.; Swami, A. In Proceedings of the ACM SIGMOD Conference, 1993. Belkhiter, N.; Bourhfi, C.; Gammoudi, M. M.; Jaoua, A.; Le Thanh, N.; Reguig, M. INFOR 1994, 32, 33-53.
Intelligent Motion Generator for Mobile Robot by Automatic Constructed Action Knowledge-Base Using GA Hirokazu Watabe, Chikara Hirooka, and Tsukasa Kawaoka Chikara Hirooka and Tsukasa Kawaoka Dept. of Knowledge Engineering & Computer Sciences, Doshisha University Kyo-Tanabe, Kyoto, 610-0394, Japan
Abstract. Intelligent robots, which can become a partner to us humans, need ability to act naturally. In actions of the robot, it is very important to recognize environment and move naturally. In this paper, the learning algorithm to construct knowledge of action in order to achieve tasks that is given to the mobile robot is proposed. Conventionally, the knowledge of action for the robot was mostly constructed by humans, but the more complex the information of the environment, action, and task becomes, the more trial and error must be repeated. Therefore, the robot must learn and construct the knowledge of action by itself. In this study, the action to achieve a task in an environment is generated by genetic algorithm, and it is shown that by repeatedly extracting knowledge of action, the construction of the Action Knowledge-Base concerning a task of any situation is possible.
1
Introduction
In the field of mobile robot control, many approaches have been tried to generate robot control program or strategy by giving only the objective task [2][3][4][5][6][7][8]. To achieve such a problem is not so difficult if the size of problem is small or a number of states of the robot is small. It becomes, however, very difficult when the size of problem is big or the number of states of the robot is big. It is assumed that a mobile robot has some sensors and acts some actions. The robot can properly acts, in some environment, if all action rules are given. The action rule is a kind of IF-THEN rule. A part of IF is one of the states of sensors and a part of THEN is one of the actions that the robot can perform. The problem treated in this paper is to decide all action rules automatically. If the number of states of sensors is small and the number of possible actions is also small, the problem is not so difficult. But, if these numbers are relatively big, the problem becomes difficult to solve because the size of solution space is proportional to the combination of those numbers. The problem becomes too difficult to solve if the number of states of sensors is infinite. Originally, the number of states of sensors is infinite since the V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 271-278, 2003. Springer-Verlag Berlin Heidelberg 2003
272
Hirokazu Watabe et al.
value of each sensor is a continuous variable. However, in conventional methods, each sensor variable must be quantitized and the number of states of sensors must be finite and as small as possible. This constraint is hard to solve the practical problems. In this paper, the learning method to construct the action knowledge base, which consists of all proper action rules to achieve the task, by which the robot can properly act in any situation different from learning situation, is proposed. The learning method is basically constructed by genetic algorithm (GA)[1]. GA automatically generates some action rules in the learning environments. To evaluate the proposed method, the mobile robot simulator is made, and expected results are demonstrated. subgoal
G
S Fig. 1. Road Like Environment
d
subgoal b
L
a
R Fig. 2. Mobile Robot
Intelligent Motion Generator for Mobile Robot
2
Target Environment and Mobile Robot
2.1
Assuming Environment
273
In this paper, the environment for the robot is road like one shown in figure 1. That is, there are two places. One place is the place where the robot can move and the other place is the place where the robot cannot move. These two places are separated by some kinds of wall. In this environment, the robot can generate subgoals which are intermediate points from start point to goal point using his sonar sensors. Each subgoal has position and direction.
Fig. 3. Subgoal generator
2.2
Definition of Mobile Robot
In this paper, Pioneer 2 type mobile robot is assumed. Figure 2 shows the structure of the robot. It is assumed that the robot recognizes the following five parameters. (Unit length is a distance between left and right wheel of the robot.) 1. 2.
A distance to the subgoal: d (0<=d, and d is floating point number) An angle between the robot direction and the direction from the robot to the subgoal: a (-180degree < a <= +180degree, and a is a floating point number) 3. An angle between the robot direction and the objective direction: b (-180degree < b <= +180 degree, and b is a floating point number) 4. A speed of left wheel: L (-10 <= L <= +10, L is a integer number) 5. A speed of right wheel: R (-10 <= R <= +10, R is a integer number) And it is also assumed that the robot output the following two parameters. 1. 2.
Acceleration of left wheel: X (-3<=X<=+3, X is an integer number) Acceleration of right wheel: Y (-3<=Y<=+3, Y is an integer number)
Therefore, the action rule of this robot is formed as the following. IF (d, a, b, L, R) THEN (X, Y)
274
Hirokazu Watabe et al.
b1
SubGoal
a1
d1 L1 R1
X1 Y1 X2 Y2 Chromosome1 3 0 -2 1 E Chromosome2 -1 -2 0 1 E E
Xm Ym E E E 0 -3 E E E E1 1 E E
ChromosomeN
E E E 3 -3
2 -1
Fig. 4. Initial state
1
3
Fig. 5. Genotype X2
Y2
Y3 X3
X1 Y1
X1 Y1 X2 Y2 Best -3 3 2 0
EEE
Xm Ym * *
Fig. 6. Best actions
3
Planning and Moving Algorithm
The robot makes a plan and moves by the following algorithm. 1.
Generate one subgoal on the road by observing the environment around the robot using sonar sensors. (Fig. 3) 2. Move to the subgoal by using the Action Knowledge-Base (AKB). 3. Moving task is completed if the robot reaches to the final goal otherwise go to step 1.
Intelligent Motion Generator for Mobile Robot
4
275
Learning Method
To realize the proposed algorithm, AKB is required. AKB is a knowledge base, which has action rules. Each action rule is an IF-THEN type rule described in section 2.2. In this chapter, the method to generate the proper action rules automatically is described. 4.1
Learning Model
Learning process is performed in the mobile robot simulator. In this paper, the given task is to reach the subgaol (target place and target direction) as fast as possible and minimum energy. The following is an outline of the learning process. 1. 2.
Set the target place and target direction in the virtual environment. Set the initial states of the robot. These are position, orientation, and speed of left and right wheels. 3. Apply GA and obtain the best individual (consecutive actions) for the given initial states of the robot. 4. Extract action rules from the best individual (consecutive actions) and store these action rules into AKB. 5. Repeat from step 1 to 4 until enough action rules are collected. 4.2
Applying GA
In this learning model, GA is applied to obtain one consecutive actions from an initial state of the robot to the objective state of the robot (position and direction). The following is the algorithm that is obtained such a consecutive actions. 1. 2.
Set the initial state of the robot (d1, a1, b1, L1, R1) at random. (Fig. 4) Generate N chromosomes at random. (Fig. 5) Each chromosome is an mconsecutive action of the robot. For example, the chromosome (3,0, -2,1, 0,3) represents three steps of actions. Left wheel is accelerated for 3 in the first step, left wheel is accelerated for -2 and right wheel is accelerated for 1 in the second step, and right wheel is accelerated for 3 in the third step. 3. Perform simple GA process using an elite strategy, tournament selection of size 2, two-point crossover and mutation. The fitness function is as the following. fitness=w1T+w2D+w3B+w4E
Where, T is the time to reach the subgoal if the robot reaches the subgoal in msteps, otherwise T=m. D is the distance between the robot and the subgoal place in the final step when the robot did not reach the target in the m-step. B is the difference between robot direction and subgoal direction in the m-step. E is the total energy from initial state to final state. It is assumed that the robot uses unit energy by 3 accelerations for each wheel. Each wi is weight parameter. 4. Obtain the best chromosome after GA process. (Fig. 6)
276
Hirokazu Watabe et al.
4.3
Extract Action Rules
After a GA process is finished, some action rules are generated from the best chromosome of GA process. Figure 7 shows the image of action rules extraction. The robot is set in the initial state, and IF (the initial state) THEN (action of first step of best chromosome) becomes one of the action rules, i.e. IF (d1, a1, b1, L1, R1) THEN (X1, Y1). And this action rule is stored in the AKB. After the robot performed the action (X1, Y1), the state of the robot changed to (d2, a2, b2, L2, R2), then next action rule is the following. IF (d2, a2, b2, L2, R2) THEN (X2, Y2) And by repeating this process, maximum m action rules are obtained. (d3,a3,b3,L3,R3 | X3, Y3)
X2
Y2
Y3 X3 (d2,a2,b2,L2,R2 | X2, Y2) X1 Y1 Action Knowledge Base
(d1,a1,b1,L1,R1 | X1, Y1)
Fig. 7. Extract action rules
5
Experimental Results
To evaluate the proposed method, computer experiments are performed. Table 1. GA parameters Maximum generation Population size Length of chromosome Selection Elite Crossover Crossover rate Mutation rate
500 50 40 Tournament (size 2) 2% 2 point 60% 10% per a gene
Intelligent Motion Generator for Mobile Robot
277
4000 3500 3000
Fitness
2500 2000 1500
‚ U
1000 500
‚ U
0 0
50
100
150
200 250 300 Generation
350
400
Fig. 8. GA result (fitness)
5.1
450
500
Fig. 9. GA result (motion)
GA's Performance
The given task is to reach the subgoal (place and orientation) with minimum time and energy. Table 1 shows GA parameters used in the learning stages. Figure 8 shows an example of GA's performance by fitness graph and figure 9 shows the robot motion of the best individual of the same example. 5.2
Performance of Learning Method
To evaluate the proposed learning method, the following experiment is performed. 1.
Execute learning stage 1,500 times and obtain action rules only when the motion of best individual was success. In the learning stage, one subgoal was placed in the environment and the robot was set in the initial state. The initial state is chosen at random in each learning stage. The total number of obtained action rules was about 50,000. 2. After repeat the learning stage, the evaluation stage is performed. In the evaluation stage, 100 initial states of the robot are generated at random, and the robot acts using the action knowledge base constructed in the learning stage. The robot uses the action rule, which is most similar to current situation, in each step. The success rate was 85%. By these results, the proposed method is efficient and about 1,500 learning times is enough for this task.
6
Conclusion
In this paper, the learning method to construct the action knowledge base by which the robot can act properly to achieve the given tasks in any situation was proposed. The learning method is basically constructed by GA. However, applying GA simply to the problem, the robot did not properly act in any situation. GA can be applied only to the fixed situations. Therefore, GA was used to obtain the action rules to achieve
278
Hirokazu Watabe et al.
each given task and the action knowledge base was constructed by the collection of these action rules. Experimental results show that the robot can acts properly in almost any situation using the action knowledge base, which has enough and proper action rules.
Acknowledgements This work was supported with the Aid of Doshisha University's Research Promotion Fund.
References [1] [2] [3] [4] [5] [6]
[7] [8]
D. E. Goldberg: Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley (1989) T. Ito, H. Iba and M. Kimura: Robustness of Robot Programs Generated by Genetic Programming, Proc. of GP-96 (1996) 321-326 P. Maes, Behaviour-Based Artificial Intelligence, Proc. of the Second International Conference on Simulation of Adaptive Behaviour (SAB-92), MIT-Press (1993) 2-10 T. Minato and M. Asada, Environmental Change Adaptation for Mobile Robot Navigation, Proc. of IEEE/RSJ International Conference on Intelligent Robots and Systems (1998) 1859-1864 C. W. Reynolds, Evolution of Obstacle Avoidance Behaviour: Using Noise to Promote Robust Solutions, Advances in Genetic Programming, Vol.1, chapter 10, MIT Press (1994) 221-241 K. Terada, H. Takeda and T. Nishida, An Acquisition of the Relation between Vision and Action using Self-Organizing Map and Reinforcement Learning, Proc. of Second International Conference on Knowledge-Based Intelligent Electronic Systems (KES98), Vol.1, IEEE (1998) 429-434 S. Yamada, Evolutional Learning of Behaviours for Action-Based Environment Modelling, Journal of Japanese Society for Artificial Intelligence, Vol.14, No.5 (1999) 870-878 H. Watabe and T. Kawaoka, Automatic Generation of Behaviours for Mobile Robot by GA with Automatically Generated Action Rule-Base, Proc. of IEEE International Conference on Industrial Electronics, Control and Instrumentation (IECON-2000) (2000) 1668-1674
Population-Based Approach to Multiprocessor Task Scheduling in Multistage Hybrid Flowshops Joanna J¸edrzejowicz1 and Piotr J¸edrzejowicz2 1
2
Institute of Mathematics, Gda´ nsk University Wita Stwosza 57, 80-952 Gda´ nsk, Poland [email protected] Department of Information Systems, Gdynia Maritime University Morska 83, 81-225 Gdynia, Poland [email protected]
Abstract. The paper considers multiprocessor task scheduling in multistage hybrid flowshops. To solve the above problem a population based approach is suggested. The population learning algorithm based on several local search procedures has been proposed and implemented. The algorithm has been evaluated by means of a computational experiment in which 160 benchmark instances have been solved and compared with the available upper bounds. It has been possible to improve 45% of previously known upper bounds.
1
Introduction
In this paper multiprocessor task scheduling in multistage hybrid flowshops is considered. Multiprocessor task scheduling has received growing attention in recent years. Already in the early eighties it was observed that there exist scheduling problems where some tasks have to be processed on more than one processor at a time [1], [2], [4]. During an execution of these multiprocessor tasks communication among processors working on the same task is implicitly hidden in a ”black box” denoting an assignment of this task to a subset of processors during some time interval. Multiprocessor task scheduling in multistage hybrid flowshops has numerous practical applications including fault-tolerant computing, multiprocessor computations, design of operations system for parallel environments, traffic control in restricted areas, and many others. Unfortunately the general, multiprocessor, flowshop problem to minimize schedule length (makespan) is NP-hard [5]. It should be also noted that even in a classical case of flow shop scheduling, only a few particular cases are efficiently solvable [3]. In view of the complexity of the considered scheduling problem it is unlikely that efficient exact algorithms for solving it could ever be found. Hence, various approximation algorithms have been studied and proposed. For example, well known metaheuristics - tabu search and genetic algorithm were proposed in, respectively [10] and [12]. This paper proposes applying a relatively new V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 279–286, 2003. c Springer-Verlag Berlin Heidelberg 2003
280
Joanna J¸edrzejowicz and Piotr J¸edrzejowicz
technique - the population learning algorithm to obtain solutions to multistage hybrid flowshop problems. It is shown that the approach is effective and promising, moreover, in numerous instances it has been possible to improve previously announced upper bounds for benchmark problems.
2
Problem Formulation
The discussed problem involves scheduling of n jobs composed of tasks in a hybrid flowshop with m stages. All jobs have the same processing order through the machines, i.e. a job is composed of an ordered list of multiprocessor tasks where the i-th task of each job is processed at the i-th flowshop stage (the number of tasks within a job corresponds exactly to the number of flowshop stages). Processing order of tasks flowing through stages is the same for all jobs. At each stage i, i = 1, . . . , m, there are available mi identical parallel processors. For the processing at stage i, task i being a part of the job j, j = 1, . . . , n, requires sizei,j processors simultaneously. That is, sizei,j processors assigned to task i at stage i start processing the task of the job j simultaneously and continue doing so for a period of time equal to processing time requirement of this task, denoted pi,j . Each subset of available processors can process only the task assigned to it at a time. The processors do not break down. All jobs are ready at the beginning of the scheduling period. Preemption of tasks is not allowed. The objective is to minimize makespan, i.e. the completion of the lastly scheduled task in the last stage.
3
Population Learning Algorithm
Population learning algorithm [7] is a population-based method inspired by analogy to a phenomenon of social education processes in which a diminishing number of individuals enters more and more advanced learning stages. In PLA an individual represents a coded solution of the considered problem. Initially, a number of individuals, known as the initial population, is randomly generated. Once the initial population has been generated, individuals enter the first learning stage. It involves applying some, possibly basic and elementary, improvement schemes. These can be based on some local search procedures. The improved individuals are then evaluated and better ones pass to subsequent stages. A strategy of selecting better or more promising individuals must be defined and duly applied. In the following stages the whole cycle is repeated. Individuals are subject to improvement and learning, either individually or through information exchange, and the selected ones are again promoted to a higher stage with the remaining ones dropped-out from the process. In the final stage the remaining individuals are reviewed and the best represents a solution to the problem at hand. At different stages of the process, different improvement schemes and learning procedures are applied. These gradually become more and more sophisticated and time consuming as there are less and less individuals to be taught.
Population-Based Approach to Multiprocessor Task Scheduling
281
Although PLA shares some features with evolutionary algorithms and greedy randomised adaptive search procedures (GRASP), introduced in [6], it clearly has its own distinctive characteristics. Main features of the PLA as compared with evolutionary algorithms are shown in Table 1.
Table 1. Population learning versus evolutionary algorithms Features
PLA
EA
Underlying phenomena Processing unit Population size Iterative search mode Intensification
Social education and Natural evolution learning processes Population of solutions Population of solutions Decreasing Constant Selection Generation replacement Variety of learning and Some implementations allow improvement procedures for intensification through embedded local search procedures Information exchange Different schemes including Information exchange based crossover allowed but none on crossover mechanisms is obligatory is a must Diversification Mostly achieved by setting Mutation is a primary massive initial population mechanism for size but additional schemes diversification including mutation can be used
PLA shares also some features with other population-based methods. In fact the idea of refining the population of solutions during the subsequent computation stages is common to PLA and memetic algorithms [9]. The latter, however, assume a constant population size and a single local search procedure and rely to a greater extend on typical genetic/evolutionary algorithms operators. There are also some similarities between PLA and cultural algorithms where an inheritance process operates at the micro and the micro-evolutionary levels [11].
4
PLA Implementation
To solve the multiprocessor task scheduling in a multistage hybrid flowshop problem, the population learning algorithm, denoted as PLA-MPFS has been implemented. It is an extension of the algorithm proposed originally in [8]. The algorithm makes use of different learning and improvement procedures, which, in turn, are based on five neighbourhood structures shown in Table 2. In what follows π = (π(1), . . . , π(n)) denotes an individual (a permutation of tasks) and g(π) its fitness function. Fitness function of an individual is calculated according to the following algorithm:
282
Joanna J¸edrzejowicz and Piotr J¸edrzejowicz
FITNESS: begin for i = 1 to m do for j = 1 to n do allocate task π (j) at stage i to the required number of processors, scheduling it as early as feasible; end for end for output:= finishing time of task number π (n) at stage m end
Table 2. Neighbourhood structures Notation Move N1 (x) N2 (x) N3 (x)
Neighbourhood space
Exchange of two consecutive tasks in x All possible exchanges Exchange of two non-consecutive tasks in x All possible exchanges Finding optimal order of four consecutive All fourtuples of consecutive tasks in x by enumeration tasks
All learning and improvement procedures operate on the population of individuals P, and perform local search algorithms based on the neighbourhood structures shown in Table 2. LEARN (i, P): for each individual x in P do Local_search (i) end for Local search algorithms are defined in Table 3. Within the discussed implementation a simple procedure SELECT(P,s) is used, where s is a percentage of best individuals from a current population promoted to higher stages. For any population P, let LEARNING(P) stand for the following: LEARNING(P): for i = 1 to 5 do LEARN(i,P) SELECT(P, s) end for The last learning procedure LEARN REC(x) is recursive and operates on overlapping partitions of the individual x, which form a ”small” population of individuals P’. To such population LEARNING(P’) is applied. Now, the structure of the implemented population learning algorithms can be shown as:
Population-Based Approach to Multiprocessor Task Scheduling
283
PLA-MPFS: begin generate randomly initial_population P := initial_population LEARNING(P) for each individual x in P do LEARN_REC (x) end for output the best individual from P end
Table 3. Local search algorithms used in the PLA-MPFS i
Description of the local search algorithm
1
Perform all moves from the neighborhood structure N1 (x); accept moves improving g(x); stop when no further improvements of g(x) are possible Mutate x producing x (mutation procedure is selected randomly from the two available ones the two point random exchange or the rotation of all chromosomes between two random points); perform all moves from the neighborhood structure N1 (x ); accept moves improving g(x ); stop when no further improvements of g(x’) are possible Repeat k’ times (k’ is a parameter set at the fine-tuning phase; in the reported experiment k’ = 3 * initial population size); generate offspring y and y by a single point crossover from x and random x ; perform all moves from the neighborhood structure N1 (y) and N1 (y ); accept moves improving g(y) and g(y ); stop when no further improvements of g(y) and g(y ) are possible; adjust P by replacing x and x with the two best from {x, x , y, y } Perform all moves from the neighborhood structure N2 (x); accept moves improving g(x); stop when no further improvements of g(x) are possible Perform all moves from the neighborhood structure N3 (x); accept moves improving g(x); stop when no further improvements of g(x) are possible
2
3
4
5
284
5
Joanna J¸edrzejowicz and Piotr J¸edrzejowicz
Computational Experiment
The proposed algorithm has been evaluated experimentally. The experiment has involved 160 instances of the multiprocessor task scheduling in a multistage hybrid flowshop. The dataset includes 80 instances with 50 jobs each and 80 instances with 100 jobs each. Each group of 80 problem instances is further partitioned into four subgroups of 20 instances each with 2, 5, 8 and 10 stages, respectively. Processor availability at various stages and processor requirements per task varied between 1 and 10. All problem instances as well as their currently best upper bounds have been obtained through OR-LIBRARY (http://www.ms.ic.ac.uk/info.html). The experiment has been carried on a PC with Pentium III, 850 Mhz processor. The initial population size for the experiment has been set to 200 and the selection factor s to 0.5. Mean relative error (MRE), standard deviations of errors and mean computation times calculated after a single run of the PLA-MPFS are shown in Table 4. Negative values of MRE show how much it has been possible to improve upper bounds within respective clusters of benchmark dataset. Table 4. MRE, standard deviation of errors and mean computation time Measure
Number of jobs 2
MRE (%) St. deviation of errors (%) Mean computation time (min.)
50 100 50 100 50 100
-0.9814% -0.6127% 0.0174% 0.0116% 1.20 3.55
Number of stages 5 8 -0.4036% -0.3180% 0.2971% 0.2253% 3.35 8.80
-0.3367% -0.2535% 0.4104% 0.3042% 5.76 14.15
10 0.3712% -0.4540% 0.5992% 0.2122% 10.40 27.42
Application of the PLA-MPFS has resulted in improving the total of upper bounds of all considered cases by 0.34%. Out of 160 instances solved it has been possible to improve currently known upper bounds in 73 instances, which is more than 45% of all considered instances. The results of the reported experiments are available at http://manta.univ.gda.pl/˜jj/mpfs.txt.
6
Conclusions
Results of the experiments allow to draw the following conclusions: – Population learning algorithms can be considered as an effective metaheuristic for solving problems of multiprocessor task scheduling in multistage hybrid flowshops. The reported experiment has contributed to improving 73 upper bounds out of 160 solved benchmark instances.
Population-Based Approach to Multiprocessor Task Scheduling
285
– Quality of results obtained by applying PLA in terms of both - standard deviation of error and mean relative error produced in a single run, is quite satisfactory. – Whereas the quality of solutions obtained using PLA algorithms is satisfactory, there is still a lot of room for improvements with respect to computational effort required. – Future research should concentrate on parallel PLA schemes with a view of increasing quality of results or decreasing computation time or achieving both goals simultaneously. – Critical for the successful application of the PLA approach is the fine-tuning phase where a proper balance between computation time and solution quality has to be found. Since there are several parameters, which can be controlled by the user (population size, selection criterion, number of iterations for the embedded local search procedures etc.), further research is needed to identify interactions between these parameters and to establish methodology for setting their values. – The experiment has confirmed both - value and strength of the populationbased approach, which remains one of the most effective computational intelligence techniques.
References [1] Bla˙zewicz J., M. Drozdowski, J. W¸eglarz: Scheduling independent 2-processor tasks to minimize schedule length, Information Processing Letters, 18 (1984) 267– 273 279 [2] Bla˙zewicz J., M. Drozdowski, J. W¸eglarz: Scheduling multiprocessor tasks to minimize schedule length, IEEE Transactions on Computers, C-35(1986) 81–96 279 [3] Bla˙zewicz J., K. H.Ecker, E.Pesch, G.Schmidt, J.W¸eglarz: Scheduling Computer and Manufacturing Processes, Springer, Berlin (1996) 279 [4] Lloyd E. L.: Concurrent task systems, Operations Research, 29 (1981) 189–201 279 [5] Brucker P., B. Kramer: Shop Scheduling Problems with multiprocessor tasks on dedicated processors, Annals of Operations Research, 50 (1995) 13–27 279 [6] Feo T. A., M. G. C. Resende: A probabilistic heuristic for computationally difficult set covering problems, Operations Research Letters, 8 (1989) 706–712 281 [7] J¸edrzejowicz P.: Social Learning Algorithm as a Tool for Solving Some Difficult Scheduling Problems, Foundation of Computing and Decision Sciences, 24 (1999) 51–66 280 [8] J¸edrzejowicz J., P. J¸edrzejowicz: Permutation Scheduling Using Population Learning Algorithm, in: Knowledge-based Intelligent Information Engineering Systems and Allied Technologies, E.Damiani et al. (Eds.), IOS Press, Amsterdam (2002) 93–97 281 [9] Moscato P.: Memetic Algorithms: A short introduction. In: D.Corne, M.Dorigo, F.Glover (Eds.), New Ideas in Optimization, McGraw-Hill, New York (1999) 219– 234 281 [10] Oˇ guz, C., Y. Zinder, V. Ha Do, A. Janiak, M. Lichtenstein: Hybrid flow-shop scheduling problems with multiprocessor task systems, Working paper, The Hong Kong Polytechnic University, Hong Kong SAR (2001) 279
286
Joanna J¸edrzejowicz and Piotr J¸edrzejowicz
[11] Reynolds R. G.: An Introduction to Cultural Algorithms. In: A. V. Sebald, L. J. Fogel (Eds.), Proceedings of the Third Annual Conference on Evolutionary Programming, World Scientific, River Edge, (1994) 131–139 281 [12] Sivrikaya-S ¸ erifoˇ glu, F., G. Ulusoy: A genetic algorithm for multiprocessor task scheduling in multistage hybrid flowshops, Working paper, Abant Izzet Baysal University, Bolu (2001) 279
A New Paradigm of Optimisation by Using Artificial Immune Reactions M. K¨ oster1 , A. Grauel1 , G. Klene1 , and H. Convey2 1
University of Applied Sciences – South Westphalia, Soest Campus L¨ ubecker Ring 2, D-59494 Soest, Germany {koester,grauel,klene}@coin.soest.fh-swf.de 2 Bolton Institute of Higher Education, Deane Campus Bolton, BL3 5AB, England [email protected]
Abstract. This paper presents an implementation of an artificial immune system in order to realise multi-modal function optimisation. The main paradigms of an artificial immune system are clonal selection and affinity maturation. Therefore these paradigms are reviewed briefly. The optimisation is based on the opt-aiNet algorithm. This algorithm is described theoretically. Furthermore, the Griewank function has been used to test the algorithm with different parameters. The simulation results are discussed.
1
Introduction
The architecture and mechanisms of the human immune system can be explained by several theories and mathematical models [1, 2]. The immune system consists of a complex set of cells and molecules that protect the body against infections. The features of an immune system such as learning, memory, and adaptation are currently being used in the development of Artificial Immune Systems. Beside the classical procedures and the computational approaches we propose AIS as a useful optimisation strategy. Publications in this relatively new research field show, that the concept is interesting for e.g. data mining [3], pattern recognition [4] and optimisation [5]. It is still unclear what characterises an AIS, but the principles clonal selection and affinity maturation can be found in most of the works related to AIS. These principles are going to be reviewed in Section 2. Based on clonal selection and affinity maturation de Castro and Timmis developed the opt-aiNet algorithm [6] which is suitable for multimodal optimisation problems. Section 3 discusses the concept of this algorithm. In order to test the performance of the algorithm the Griewank function [7] has been applied to the algorithm. The results of this test function are described in Section 3.2.
2
Artificial Immune Systems
Compared to other well-established computational intelligence paradigms such as artificial neural networks, evolutionary algorithms and fuzzy systems, the field V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 287–292, 2003. c Springer-Verlag Berlin Heidelberg 2003
288
M. K¨ oster et al.
of artificial immune systems is still a new paradigm. The two major principles used in artificial immune systems are clonal selection and affinity maturation. 2.1
Clonal Selection and Affinity Maturation
When an antigen (pathogen) enters the human body, the B-cells (B-lymphocytes) of the human immune system respond by secreting antibodies. Antibodies are molecules which are able to recognize and bind to a certain part of the antigen called epitope. Each antibody can only bind to one type of epitope, whereas an antigen can have more than one epitope. Thus, the antigen can be recognized by several antibodies. Once an antibody of a B-cell has recognized the eptitope of an antigen, the antigen stimulates the B-cell to proliferate (clone) and maturate into a plasma cell or a memory cell. Plasma cells are the most active antibody secretors. Memory cells are the cells having high affinity with the antigens. They circulate through the blood, lymph and tissues for a faster response to previously recognized antigens. As those B-cells that recognize an antigen improve their affinity to the selective antigen, this principle is called affinity maturation. And since these B-cells also proliferate, while those who don’t recognize an antigen waste away, this principle is called clonal selection [8]. It is similar to the natural selection found in the evolution theory except that it occurs on a rapid time scale within the order of days.
3
Implementation and Test of the opt-aiNet Algorithm
Based on the principles mentioned in Section 2 de Castro and Timmis developed the opt-aiNet algorithm [6] to realise a multimodal function optimisation. Like evolutionary algorithms the opt-aiNet algorithm is an adaptive optimisation technique. With each iteration the solution gradually improves until the local or the global optimum is found. The algorithm is also a stochastic optimisation method. Thus, it enables the exploration of the fitness landscape. It is not only capable of locating the global optimum, but also local optima which might also be of interest. 3.1
Algorithm
The components used in the algorithm can be summarised as follows: Network Cell: A network cell c is one feasible solution to the optimisation problem. Each cell can be represented by a real-valued vector c ∈ IRn . Population: The population Pt = {c0 , ..., c|Pt |−1 } is the set of all current network cells at the time t. Fitness: The fitness of a cell describes the quality of the solution. In the case of finding the minimum or the maximum, the fitness value can be derived by the objective function itself. Ft is the average fitness of the population Pt at the time t.
A New Paradigm of Optimisation by Using Artificial Immune Reactions
289
Affinity: The affinity measure is given by the Euclidean distance between two cells. Clone: A clone is an identical copy of a parent network cell. The opt-aiNet algorithm can be summarised as follows: (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15) (16) (17) (18) (19) (20) (21) (22) (23) (24) (25) (26) (27) (28) (29) (30) (31) (32) (33) (34) (35) (36) (37) (38)
t=0 P0 = randomly initialise population with np cells |P0 |−1 F0 = |P10 | i=0 f itness(ci ) while not abort do repeat t=t+1 Pt = Pt−1 for i ∈ {0, ..., |Pt | − 1} do for j ∈ {1, ..., nc } do mc = hypermutate(ci, f itness(ci )) if f itness(mc) > f itness(ci ) then ci = mc end if end for end for |Pt |−1 Ft = |P1t | i=0 f itness(ci ) until Ft < Ft−1 + σf i=0 Pmem = {c0 } for j ∈ {0, ..., |Pt | − 1} do k=0 repeat if af f inity(cj , ck ) < σs if f itness(cj ) > f itness(ck ) replace ck in Pmem by cj end if else k = k+1 end if until (k > i) or (af f inity(cj , ck ) < σs ) if k > i Pmem = Pmem ∪ {cj } i=i+1 end if end for Pt = Pmem Pt = add (|Pmem | ∗ d) randomly generated cells end while
In this algorithm the optima can be treated as the antigens and the network cells as the B-cells of the immune system. The fitness of the network cell is a measure of the affinity to the antigen (optimum). The algorithm combines a local search in the steps (5)-(17) and a global search in the steps (18)-(35). In the local search the fitness of the currently
290
M. K¨ oster et al.
available network cells improve gradually until the degree of improvement is less than the predefined fitness threshold σf . The affinity mutation in step (10) is realized by the following equation: mc = c +
N −fn (c) e β
.
(1)
In the above equation mc is the mutated network cell, c is the parent network cell, N is a Gaussian random number of zero mean and standard deviation σ = 1, β is a decay parameter, and fn (x) is the normalized fitness value of c which is defined by the following equation. fn (c) =
(f (c )) f (c) − min c ∈P
max (f (c )) − min (f (c )) c ∈P
;
f (c) : f itness value .
(2)
c ∈P
Since the mutation is based on the fitness and thus on the affinity of the parent network cell, it describes the affinity maturation process. If the fitness of a mutated clone is better than the parent cell, the parent cell is replaced by this clone in the steps (11)-(13) of the algorithm. This mechanism describes the clonal selection in the local search. The suppression mechanism in the steps (24)-(26) describes the clonal selection in the global search. 3.2
Simulation Results
The opt-aiNet algorithm has been applied to the Griewank function, which is defined as follows: n n xi x2i − cos √ (3) f (x) = 1 + ; x = (x1 , ..., xn ) ∈ IRn . 4000 i i=1 i=1 The griewank function is a widely used multimodal test function with a very rugged landscape and large number of local optima. In the following the degree of the function is set to n = 2. For the local search the number of clones are set to nc = 20 and the fitness threshold σf = 0.01. The factor of newcomers is set to d = 5. The algorithm is aborted after 200 generations. To find the global maximum of the Griewank function in an input range of x1 , x2 ∈ [−100, 100] the suppression threshold can be set to σs = 100 and the mutation decay to β = 2. The corresponding result is shown in Figure 1. Only the best network cell survives in each iteration as the suppression area covers nearly the whole input space. Since the decay β is small, the possible range of the affinity mutation, defined by Equation (1), is high and therefore the network cells don’t get stuck in a local minima. It was able to locate the global optima (x=(0,0)) in most of the test runs after approximately 116 generations. In a second example the input range of the Griewank function is set to x1 , x2 ∈ [−10, 10]. The factor of newcomers is set to d = 0.8. By setting the suppression threshold to σs = 0.5 and the mutation decay to β = 10 the
A New Paradigm of Optimisation by Using Artificial Immune Reactions
291
Fig. 1. The Griewank function, defined by Equation (3), with an input range of x1 , x2 ∈ [−100, 100]
Fig. 2. The Griewank function, defined by Equation (3), with an input range of x1 , x2 ∈ [−10, 10] algorithm discovers not only the global optimum but also local optima as shown in Figure 2. After approximately 83 generations the algorithm was able to locate all 17 optima including the global optimum. In this example the higher mutation decay β reduces the possible range of the affinity mutation. Therefore, the network cells converge towards their next local optimum.
292
4
M. K¨ oster et al.
Conclusion
The presented AIS is a new computational intelligent method based on biological metaphors for optimisation. We have tested this method on different multi-modal function optimisation problems. Main advantages in an engineering perspective are the handling and the simplicity of the clonal selection principle and the affinity maturation. In contrast to Evolutionary Algorithms (EAs) no encoding of cells (individuals) is required, the cardinality of the population is automatically determined by the suppression mechanism and the capability of combining exploitation with exploration of the fitness landscape with good stabilisation of the population. The convergency speed for functions with many extremities can be a strong limitation to the practical application of artificial neural networks (ANN).
References [1] N. K. Jerne: Towards a network theory of the immune system. In: Ann. Immunol. (Inst. Pasteur), Vol. 125C. (1974) 373–379 287 [2] Alan. S. Perelson: Theoretical Immunology, Volume 1-2. Addison Wesley, Reading, MA, USA (1988) 287 [3] T. Knight and J. Timmis: A Multi-Layered Immune Inspired Approach to Data Mining. In A Lotfi, J Garibaldi, and R John, eds.: Proceedings of the 4th International Conference on Recent Advances in Soft Computing, Nottingham, UK. (2002) 266–271 287 [4] L. N. de Castro and J. Timmis: Artificial Immune Systems: A Novel Approach to Pattern Recognition. In L Alonso J Corchado and C Fyfe, eds.: Artificial Neural Networks in Pattern Recognition. University of Paisley (2002) 67–84 287 [5] D. Dasgupta, ed.: Artificial Immune Systems and Their Applications. SpringerVerlag, Berlin (1999) 287 [6] L. N. de Castro and J. Timmis: An Artificial Immune Network for Multimodal Function Optimization. In D. B. Fogel, M. A. El-Sharkawi, Xin Yao, G. Greenwood, H. Iba, P. Marrow, and M. Shackleton, eds.: Proceedings of the 2002 Congress on Evolutionary Computation CEC2002, IEEE Press (2002) 699–704 287, 288 [7] D. Whitley, K. Mathias, S. Rana, and J. Dzubera: Building Better Test Functions. In Larry Eshelman, ed.: Proceedings of the Sixth International Conference on Genetic Algorithms, San Francisco, CA, Morgan Kaufmann (1995) 239–246 287 [8] F. M. Burnet: Clonal Selection and After. In G. I. Bell, A. S. Perelson, and G. H. Pimbley Jr., eds.: Theoretical Immunology, Marcel Dekker Inc. (1978) 63–85 288
Multi-objective Genetic Programming Optimization of Decision Trees for Classifying Medical Data Ernest Muthomi Mugambi1 and Andrew Hunter2 1
Science Lab, Computer Science Dept., Sunderland University South Rd., Durham DH1 3LE [email protected] 2 Durham University
Abstract. Although there has been considerable study in the area of trading- off accuracy and comprehensibility of decision tree models, the bulk of the methods dwell on sacrificing comprehensibility for the sake of accuracy, or fine-tuning the balance between comprehensibility and accuracy. Invariably, the level of trade-off is decided a priori. It is possible for such decisions to be made a posteriori which means the induction process does not discriminate against any of the objectives. In this paper, we present such a method that uses multi-objective Genetic Programming to optimize decision tree models. We have used this method to build decision tree models from Diabetes data in a bid to investigate its capability to trade-off comprehensibility and performance.
1
Introduction
Decision trees have been popularly univariate (also known as axis-parallel) in nature which implies that they use splits based on a single attribute at each internal node. Even though several methods have been developed for constructing multivariate trees, this body of work is not well-known [1]. Most of the work on multivariate splits considers linear (oblique) trees [10]. The problem of finding an optimal linear split is known to be intractable [11] hence the need to find good albeit suboptimal linear splits. While there exist methods used for finding good linear splits ranging from linear discriminant analysis [12], hill-climbing search [13] to perceptron training [14] amid others, the use of evolutionary algorithms which are known to be powerful global search mechanisms, is little-known. In this paper, we use a Genetic Programming algorithm to optimize polynomial (non-linear) decision structures. For many practical tasks, the trees produced by tree-generation algorithms are not comprehensible to users due to their size and complexity. It is desirable that the classifier “provide insight and understanding into the predictive structure of the data” [16] as well as explanations of its individual predictions [15]; in medical data mining such characteristics are a prerequisite for model acceptability. It can be argued that the incomprehensibility of some models is caused by the V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 293–299, 2003. c Springer-Verlag Berlin Heidelberg 2003
294
Ernest Muthomi Mugambi and Andrew Hunter
model induction process being primarily based on predictive accuracy or performance. To address this concern, we use a multi-objective Genetic Programming algorithm to optimise decision trees for both classification performance and comprehensibility, without discriminating against either. There is a lack of a proper empirically tested theory on comprehensibility; the comprehensibility ideas used are based on some limited studies in the literature [17][18].
2
Decision Trees
Decision trees are a way to represent underlying data hierarchically by recursively partitioning the data. The Decision tree formalism has been found to be intuitively appealing and comprehensible because the way they perform classification by a sequence of tests is easy for a domain expert to understand. For this reason among others, decision trees have found their way into many a data mining researcher’s / practitioners tool box. For a comprehensive survey on decision trees refer to [1]. One way to classify decision trees is by looking at the way they split data. The most common type of decision tree are referred to as univariate or axis-parallel. This type of decision trees carry out tests on a single variable at each non-leaf node. Their mode of splitting the data is equivalent to using axis-parallel hyperplanes in the feature space. Another class of decision trees tests a linear combination of features at each internal node and are referred to as multivariate linear or oblique trees because the tests are geometrically equivalent to hyperplanes at an oblique orientation to the axis of the feature space. The last and least known decision tree type is the non-linear multivariate decision trees. Although often a linear tree can describe the data well, studies [3] have shown that non-linear trees are more accurate. Ittner and Schlosser [2] conceived a non-linear multivariate decision tree that produces partitioning in the form of curved hypersurfaces of the second degree. Their method is based on combination of primitive features (constructive induction) in the form of polynomials which are induced by a linear (oblique) tree generation algorithm. We introduce a novel multivariate non-linear decision tree that is optimized by use of Genetic Programming. The optimization is aimed at maximizing the performance and improving the comprehensibility of the decision tree in tandem.
3
Structure and Comprehensibility
There are three types of operators used that dictate the structure of the decision tree. Base operators comprise multiplication (*) and addition (+) operators that are used to combine several terminal functions or variables into a polynomial function. Fuzzy operators are fuzzy membership functions (sigmoid functions and bell functions) used to squash continuous variables. For example we could have a polish notation polynomial function of the form: (∗(+x1 x2 )φx3 ), where φ represents a sigmoid membership function. Intermediate operators are novel high level operators that are used to partition the feature space. Model comprehensibility is determined by syntactic and semantic simplicity.
Multi-objective Genetic Programming Optimization of Decision Trees
295
– Syntactic simplicity - This refers to the size of a pattern or number of interacting terms in a model; also known as complexity. For a polynomial model it is determined by the number of nodes in the decision tree. – Semantic simplicity - The type and order of operators used in the decision trees. The model is easier to comprehend when the intermediate operators are at the top followed by the base operators in the middle and lastly the fuzzy operators at the leaf nodes.
4
Genetic Programming(GP)
Genetic Programming is a global, random search method for solving genetic treelike programs. It is suitable for solving NP-hard computational problems such as those that arise in non-linear decision tree modeling. GP has been successful in such areas as symbolic regression, pattern recognition and concept learning among others. GP offers a very natural way to represent the decision tree structure. GP uses several operators such as crossover and mutation to optimize the expression tree.
5
Coefficient Optimization
While GP is good at optimizing the structure of the tree, it is known to be deficient in the capability to optimize the coefficients in the expression tree [8]. We have incorporated a Quasi-Newton optimization technique to augment the power of the GP coefficient optimization. This technique uses an error propagation algorithm that efficiently calculates the gradient of the error function with respect to the coefficients embedded in the GP expression tree.
6
Multi-objective Optimization (MOO)
Multi-objective optimization is defined as “a vector of decision variables which satisfies constraints and optimizes a vector function whose elements represent the objective functions”. Most decision trees use classification performance as the objective of the induction process. While performance is an important objective other characteristics such as tree size and comprehensibility might not be any less important. MOO therefore offers a suitable way to optimize the decision tree without discriminating against any of the objectives by using Pareto optimality concept. According to this concept a feature solution A is better than B if A is better than B in at least one objective and not worse-off in any of the remaining objectives. A set of solutions where none is better than any other is known as a non-dominated set and if such a set dominates all other solutions it is reffered to as the Pareto front. MOO algorithms [7][19] attempt to find this Pareto front. There are three main objectives we are interested in: tree size, classification performance and degree of the polynomial. Tree size is a straightforward measure of the size of the decision tree, which we aim to minimize. It is accepted that
296
Ernest Muthomi Mugambi and Andrew Hunter
smaller decision tree are more comprehensible and have better generalization capabilities. The second objective we mean to maximize is classification performance, which is measured using the Receiver Operating Characteristic curve. This is a standard technique, widely used in medical applications, which characterises the performance of a classifier as the decision threshold is altered, by displaying the sensitivity1 versus 1-specificity2 . Finally the degree of the polynomial structure determines how the feature space is partitioned. Second degree polynomials describe quadratic hypersurfaces which can be used to model complex non-linear behaviour. Therefore the third objective is to use second degree polynomial terms where appropriate; for minimizing use of an excessive number of linear terms.
7
Multi-objective Genetic Programming Algorithm (MOGP)
The algorithm commences with the initialisation of the decision tree structures in the form of Genetic Programming expression trees. The trees can be stronglytyped to ensure that only specific orders of operators are allowed. This is in line with the need to develop perspicuous tree structures. Although the generation of terminals and non-terminals is purely random, there are stricts controls to ensure that the expression trees are valid. During initialization the training set is also set to half the data set, the cross validated test set is set to the remaining data set and the terminal set to the variables contained in the data set. After initialization the initial population is evaluated and the elite set which is made up of the “best” individuals of that population is constituted. MOGP is an elitist evolutionary algorithm but great caution is taken to ensure that the elite set does not have undue influence on the ability of the algorithm to explore diverse feature spaces. The selection of the elite is based on three objectives: the performance on the Receiver Operating Characteristic curve, the size of the decision tree and lastly the percentage of the second degree polynomial functions. The Pareto optimality concept is used for the multi-objective evaluation after the coefficient optimization of the expression tree. The mating pool is made up of individuals from population and the elite set chosen by binary tournament selection where the winner is determined by fitness sharing. Reproduction is carried out by the mating pool using crossover and mutation. Crossover and mutation operations are strictly controlled to ensure that they give valid offsprings.
8
Experiments and Results
We conducted ten experiments using the algorithm. Control parameters for the runs are given in Table 1. The data set constitutes 2304 records of patients that attend a diabetic clinic. Each patient corresponds to a record in the data set. 1 2
Percentage of positively classified instances that the classifier gets right. Percentage of negatively classified instances that the classifier gets right.
Multi-objective Genetic Programming Optimization of Decision Trees
297
Table 1. MOGP Settings Population Non-dominated set size Tournament size Dominance group size Generations Crossover rate Mutation rate
100 25 2 10 200 0.3 0.3
Each record contains readings taken from the patients and their medical history. The 29 patient attributes constitute factors that are associated with Diabetes and the type of complications the patients may have suffered in the period they have had diabetes. The aim of this classification exercise is to build decision tree models that can predict patients whose diabetic status is likely to deteriorate leading to complications that are normally associated with the disease. The training set is made up of 1300 cases while the test set constitutes the remaining 1004 cases. Figure 1 shows the ROC Curves of some of the decision tree models found in the elite set which comprises the non-dominated models in the population. The complexity of the elite set models range from 1 to 39. To assess the efficacy of the algorithm, we compared the results with a Radial Basis Function neural network using all the 29 variables. The Radial Basis Function network is a powerful but
1
1
1
1
0.5
0.5
0.5
0.5
c=1 c=3 c = 11 c=5 0 0 0 0 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 1 1 1 1 0.5
0.5
0.5 0.5 c = 21 c = 35 c = 17 c = 27 0 0 0 0 0 0.5 1 0 0.5 1 0 0.5 1 0 0.5 1 1 1 KEY MOGP 0.5 0.5 RBF c = 37 c = 39 0 0 0 0.5 1 0 0.5 1
Fig. 1. ROC Performance of MOGP models compared with RBF models for different complexities
298
Ernest Muthomi Mugambi and Andrew Hunter
poorly comprehensible technique and hence forms a good basis for comparison with our model. It is evident from the ROC curves in figure 1 that the low complexity models (1-35) perform as well as the RBF network. As expected the higher complexity models perform very poorly - it is known that complex models have poor generalization capability. From the results it is also evident that our algorithm is capable of producing a wide range of model complexities that are of good peformances. In addition to this, very comprehensible (small and compact) models such as 0.111x2 ∗ 0.34x18 and 0.02x2 ∗ 2.3x23 + 1.2x6 ( see the first and second ROC curves in figure 1) that perform as well as an RBF network, are easily obtained.
9
Conclusion
We have introduced and described a method for inducing decision tree models without determining the level of trade-off between comprehensibility and ROC curve performance before hand. The decision tree models are optimised using a multi-objective Genetic Programming technique. The coefficients within the polynomial decision tree models are optimised using an efficient error propagation algorithm. Future work will involve comparing the performance of our decision tree framework mainly with other known linear and non-linear decision tree induction algorithms.
References [1] [2]
[3] [4]
[5]
[6] [7]
[8] [9]
S.K Murthy. Automatic construction of decision trees from data: a multidisciplinary survey.Kluwer academic publishers,Boston.1-49,1998. 293, 294 A.Ittner and M.Schlosser. Discovery of relevant new features by generating non-linear decision trees.Proc. of 2nd International Conference on Knowledge Discovery and Data Mining,108-113.AAAI Press,Menlo Park,CA,Portland,Oregon,USA,1996. 294 A.Ittner. Non-linear decision trees NDT.International conference on machine learning,1996. 294 N.Nikolaev and V. Slavov. Concepts of inductive genetic programming.EuroGP:First European workshop on genetic programming,lecture notes in computer science. LNCS 1391, Springer,Berlin,1998,pp.49-59. M. C.J Bot and W. B. Langdon. Application of genetic programming to induction of linear classification trees.EuroGP 2000:European conference,Edindburgh,Scotland,UK,April 2000. L.A Breslow and D. W. Aha. Simplifying decision trees: a survey. Navy Center for Applied Research in Knowledge Engineering Review Technical Report , 1998. C.Emmanouillidis. Evolutionary multi-objective feature selection and ROC analysis with applocation to industrial machinery fault diagnosis.Evolutionary methods for design,optimization and control,Barcelona 2002. 295 A.Hunter. Expression Inference - genetic symbolic classification integrated with non-linear coefficient optimization.AISC 2002:117-127. 295 J. Hanley and B. McNeill. The meaning and use of the area under a receiver operator characteristic curve. Diagn.Radiology,143,29-36,(1982).
Multi-objective Genetic Programming Optimization of Decision Trees [10]
[11] [12]
[13] [14] [15]
[16] [17]
[18] [19]
299
A. Ittner, J. Zeidler, R. Rossius, W. Dilger and M. Schlosser. Feature space partitioning by non-Linear and fuzzy decision trees.Chemnitz University of Technology ,Department of Computer Science,Chemnitz,1997. 293 L. Hyafil and R. L. Rivest. Constructing optimal binary decision trees is NPcomplete.Information Processing Letters,5(1):15-17,1976. 293 W. Loh and N. Vanichsetakul. Tree structured classification via generalized discriminant analysis.J.of the American Statistical Association,83(403):715728,1988. 293 S. Murthy,S. Kasif, S. Salzberg and R. Beigel. A system for induction of oblique decision trees.J. of Artificial Intelligence Research,2:1-33,August 1994. 293 S.E Hampson and D. J. Volper. Linear function neurons:Structure and training.Biological Cybernetics,53(4):203-217,1986. 293 D. Michie. Inducing knowledge from data;First Principles.Unpublished manuscript for a talk given at the Seventh International Conference on Machine Learning.Austin,Texas. 293 L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and regression trees.Belmont,CA:Wadsworth International Group. 293 C. Pena-Reyes, and M. Sipper. Fuzzy CoCo:Balancing accuracy and interpretability of fuzzy models by means of coevolution.Logic Systems Laboratory, Swiss Federal Institute of Technology in Lausanne, CH-1015 Lausanne,Switzerland. 294 N. Lavrac. Selected techniques for data mining in medicine. Artificial Intelligence in Medicine 16(1999) 3-23. 294 A. Hunter. Using multiobjective genetic programming to infer logistic polynomial regression models. 15th European Conference on Artificial Intelligence,Lyon,France.2002. 295
A Study of the Compression Method for a Reference Character Dictionary Used for On-line Character Recognition Jungpil Shin Department of Computer Software, University of Aizu Aizu-Wakamatsu City, Fukushima, 965-8580, Japan tel.: +81 (242) 37-2704; fax: +81 (242) 37-2731 [email protected]
Abstract. This paper reports on a compression method for a reference character dictionary used for on-line Chinese handwriting character recognition, which is based on pattern matching between input and reference patterns. First, one reference pattern for each category is generated from the training data. Second, the reference dictionary is optimized by using the Generalized Learning Vector Quantization (GLVQ) algorithm. Third, the reference dictionaries are compressed to approximately one tenth of the original data using Vector Quantization with the K-means clustering algorithm. Then, the compressed reference dictionaries are again optimized by the GLVQ algorithm. Experimental results for Chinese character recognition show that the dictionary can be successfully compressed without decreasing the recognition rate, and the calculation time of distance between strokes can be reduced by a factor of approximately five.
1
Introduction
On-line recognition of handwritten cursive characters is a key issue in state-ofthe-art character recognition research [8, 16], and extensive research has been conducted to accommodate the variation seen in stroke order, stroke number and stroke deformation [7, 9, 10, 11, 12, 13, 14, 15]. The recognition framework is usually based on pattern matching between the input and a reference pattern. However, the accuracy of character recognition is greatly influenced by the quality of the reference characters. In the case of Chinese characters, their reference dictionary consists of approximately one thousand different characters. To realize high quality recognition on a small microprocessor with built-in memory, for example a Personal Digital Assistant (PDA), the computational resources and recognition time required must be reduced. In addition, each character change by those who write the reference characters needs to be optimized. The aim of this paper is the compression and quality improvement of reference characters. This aim can be realized by making a smaller dictionary
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 300–309, 2003. c Springer-Verlag Berlin Heidelberg 2003
A Study of the Compression Method for a Reference Character Dictionary
301
based on stroke elements, not on character elements. First, one reference pattern for each category is generated from the training data. Second, the reference dictionary is optimized using the Generalized Learning Vector Quantization (GLVQ) algorithm [1],[2]. Third, the reference dictionary is compressed considerably compared with the original data using the Vector Quantization with K-means Clustering algorithm. Then, the GLVQ algorithm again optimizes the compressed reference dictionaries. In this paper, a suitable compression rate is investigated, because a trade-off exists between the recognition rate and recognition time/compression rate. Experimental results for Chinese character recognition show that the dictionary can be successfully compressed without decreasing the recognition rate, and the calculation time of distance between strokes can be considerably reduced.
2
Reference Pattern Generation System
Figure 1 shows the block diagram of the compression algorithm employed for the reference dictionary. First, each stroke of a preliminary reference pattern is generated from the training data by storing the average values of the loci of feature points in each stroke. One reference pattern for each category is made. Second, the vector of the preliminary reference stroke is corrected to the optimal vector by repeating learning using the GLVQ algorithm. Third, using the K-means clustering algorithm, the radicals composed by strokes having similar position and shape are combined to form a single compressed radical. Four, for all the strokes, the compression process is performed and a new reference dictionary is generated using the K-means clustering algorithm. Five, each compressed stroke in the new reference dictionary is again optimized using the GLVQ algorithm. Then, the final reference character patterns are created.
3
Average Pattern Generation
The experimental data consists of 2965 categories of Chinese characters, i.e., specified characters in the first level of the Japanese Industry Standard (JIS) code set, written by 90 university students (a total of 258603 characters having up to 26 strokes each). Students were directed to write cursively in a normal manner. The data were obtained using a stylus pen on a Liquid Crystal Display (LCD) tablet. The input character is transformed into a planar 256x256 mesh by preprocessing. The steps are redundancy elimination, smoothing, size normalization,
Average Pattern Generation
Optimization by GLVQ
Compression by Radical Units
Compression by Stroke Units
Fig. 1. Generation flow of the reference dictionary
Optimization by GLVQ
302
Jungpil Shin
(a)
(b)
Fig. 2. Examples of (a) input pattern, and (b) preliminary reference pattern
and feature point extraction. As feature information, the x-y coordinates and movement direction vector between one point and the next are extracted from the character data. Preliminary reference patterns are made as follows: First, the non-connected strokes are extracted by rearranging the strokes in the correct stroke order. Then, each stroke of the preliminary reference pattern is generated from the training data by storing the average values of the loci of the feature points in each stroke. One reference pattern for each category is generated. The generated reference patterns are divided stroke by stroke and numbered sequentially. The numbers are memorized in a code table and the actual recognition process is performed by referring to the table. The average of an original stroke that is referred to in the code table is denoted by C a , And the total number of C a is 32398. The typical examples of input and reference patterns of ” ” are shown in Figure 2. Automatic searching for correct stroke-correspondence between input pattern and reference pattern is not trivial, because the analysis of stroke order and stroke connection should be performed simultaneously with high accuracy. First, using the method of Cube Search [14, 15], correct stroke-correspondence between input pattern and reference pattern are searched for automatically by backtracking. This algorithm has a novel advantage that enables an efficient search for the optimal stroke-correspondence in spite of: (i) stroke order variation, (ii) stroke number variation due to stroke connection, and (iii) exceptional user-generated stroke deformations. Some of the wrong stroke-correspondence is manually converted into the appropriate stroke-correspondence by observation of these characters. These errors are due to strokes written in substantially different positions.
4
Generalized Learning Vector Quantization
The Generalized Learning Vector Quantization (GLVQ) algorithm [1, 2] is based on the Learning Vector Quantization (LVQ) algorithm proposed by Kohonen [5,
A Study of the Compression Method for a Reference Character Dictionary
303
different pattern w’2 input pattern same pattern w’1 (xl, yl)
(xj,yj)
(xi,yi)
Fig. 3. Sample vectors to match among feature points after changes using GLVQ
6], and it has been shown to be superior to LVQ. The purpose of these methods is to correct a reference vector to the optimal vector by repeating learning. We employ the GLVQ algorithm for reference pattern optimization. Briefly, one reference vector near the input vector v and belonging to the same category as v is set to ω1 , and another reference vector near v belonging to a different category from v is set to ω2 . Each distance is set to d1 and d2 , respectively. Two vectors of ω1 and ω2 are changed using Eqs. (1) and (2), respectively. That is, a similar reference pattern approaches an input pattern and a different pattern deviates from an input pattern. Figure 3 shows intuitively the vector changes from the original vector. ω1 ← ω1 + α
∂f d2 (v − ω1 ) ∂µ (d1 + d2 )2
(1)
ω2 ← ω2 − α
d1 ∂f (v − ω2 ) ∂µ (d1 + d2 )2
(2)
where, α is a small positive constant that controls the vector movement. In this ∂f = 4F (µ, t){1 − F (µ, t)} was used in the experiments, where t is the paper, ∂µ 2 learning time, F (µ, t) is a sigmoid function of 1+e1−µt , and µ = dd11 −d +d2 . The optimized reference stroke by GLVQ is denoted by C g . Examples of C g and its code number are shown in Figure 4.
5
Compressing Reference Characters
Using the preliminary reference pattern optimized by GLVQ, we generate a new compressed reference pattern using the K-means clustering algorithms as follows: 5.1
Compression by Radical Units
The compressed VQ code for radical units, the so-called ”Hen” and ”Tsukuri”, which is denoted by C r , is generated from C g using the K-means clustering algorithm, which performs on those radicals with the same number of strokes.
304
Jungpil Shin
4234
4237
4238
4235 405
407
4239
15052
27567 27566
15053 15059
15054
4236
15055
408 4240
15056
406
15060
15061
15057 15062
27580
4241
27568
27569 27571 27570 27574 27572 27573 27575 27579 27578
27576
27577
27581
15058
Fig. 4. Examples of stroke C g
Fig. 5. Number of strokes in radical The goal is to make C r substitute for similarly shaped strokes in C g using the average of each stroke having the same position and shape. By this process we can reduce the misrecognition, which is determined by a different stroke in spite of the same radical of stroke position and shape. Furthermore, the next clustering stage works more effectively. Figure 5 shows the number of strokes in the radical and the number of its variations. Examples of a reference character with radical ” ” and its compressed strokes C r are shown in Figure 6.
(a)
(b)
Fig. 6. Examples of (a) reference character with the same radical and (b) its compressed strokes C r
A Study of the Compression Method for a Reference Character Dictionary
305
Fig. 7. Examples of original and compressed reference characters
5.2
Compression by Stroke Units
The compressed VQ code for each stroke unit, which is denoted by C s , is again generated from C r using the K-means clustering algorithm. The compressed size of the reference pattern is determined by the threshold value that determines whether a stroke belongs in the same cluster as other strokes. It is expected that the size of a dictionary can be reduced, without decreasing the rate of recognition, by compressing the reference dictionary using vector quantization. Moreover, the computational effort to calculate the distance between the input stroke and a reference stroke can be reduced dramatically as follows. First, the distance values calculated are immediately saved in the table. Then, when using the same value again, it is read from a table and is used without calculation. A suitable compression rate should be investigated because a trade-off exists between recognition rate and recognition time/compression rate. Figure 7 shows the examples of the original reference characters and the compressed reference characters, which are recreated by the VQ code. In the case of Ref 1 of Figure 7, the number of strokes was compressed from 32398 to 7372, a compression rate of 22.8%. A maximum of 96 strokes were compressed into one stroke. Figure 8 shows the number of strokes compressed to one stroke in the case of Ref 3 . Although the reference characters were compressed, compared with the original characters, characters from Ref 1 to Ref 3 did not lose shape significantly. The actual compression rate is shown in Table 1. Using DP matching, the distance between an input stroke and a reference stroke in the K-means algorithm for radical and stroke units is calculated by the weighted sum of (1) the distance between x-y coordinate sequences and (2) the distance between directional vector sequences. An asymmetric DP equation is employed for DP matching.
306
Jungpil Shin
Fig. 8. Compression result Table 1. Total stroke number and compression rate ref. set total number of str. compression rate maximum str. no. in one category Ref 0 32398 Ref 1 7372 22.8 % 96 Ref 2 5657 17.5 % 123 Ref 3 3148 9.7 % 186 Ref 4 1899 5.9 % 305 Ref 5 1203 3.7 % 413 Ref 6 783 2.4 % 494 Ref 7 551 1.7 % 753 Ref 8 317 1.0 % 1310
5.3
Generalized Learning Vector Quantization
Each compressed stroke of C s is again optimized using the GLVQ algorithm, and is denoted by C m . Then, the final reference character patterns are created. Using this compressed reference dictionary of stroke C m with small size, it is expected that the recognition time will be reduced greatly without decreasing the recognition rate compared with the non-compressed dictionary.
6
Recognition Experiment
The usefulness of the presented method was demonstrated by recognition experiments performed on a PC, a Pentium III 700 MHz processor. Training data consisted of 2965 Chinese character categories, used as the investigation data in Sec. 3. Another 20 writers provided test data for the recognition experiment in
A Study of the Compression Method for a Reference Character Dictionary
307
Fig. 9. Recognition rates for each stroke number Table 2. Recognition rate (%) for using strokes (a) C g , (b) C s , and (c) C m ; (d) the improvement (%) between C s and C m ref. set Ref 0 Ref 1 Ref 2 Ref 3 Ref 4 Ref 5 Ref 6 Ref 7 Ref 8
(a) 99.0 -
(b) 98.2 98.1 97.9 97.2 97.0 96.5 95.8 94.2
(c) 98.9 98.9 98.8 98.1 98.0 97.4 96.9 95.6
(d) +0.7 +0.8 +0.9 +0.9 +1.0 +0.9 +1.1 +1.4
a similar fashion to the training data. The data consisted of the same characters as those of the training data and totalled 48631 characters, and were limited to characters with the correct number of strokes. Recognition experiments using C a , C g , C s , and C m as reference patterns were conducted. For the recognition algorithm, we used the method of Cube Search [14, 15]. Table 2 summarizes the recognition experiment results. A considerable improvement in recognition rate is achieved between using stroke C s and C m . The result of (b) is that it succeeded in compression of data without decreasing the recognition rate. The result of (c), compared with VQ references, is that the average recognition rate increased about 1%, and recognition rates from Ref 1 to Ref 3 hardly
308
Jungpil Shin
changed compared with Ref 0 . Therefore, Ref 3 is the most suitable as the reference characters. The stroke distance calculation time in case of Ref 3 can be reduced by a factor of approximately five, from 1.23 sec to 0.25 sec. Moreover, the recognition rate of each stroke was investigated and the result is shown in Figure 9. When the compression rate becomes high, in the case of few stroke characters, the recognition rate tends to fall. For this research, a program, as shown in Figure 10, that can select and display Chinese characters having reference strokes included in the same category was written using the C++ language.
Fig. 10. An application that displays VQ character
7
Conclusion
We have investigated a compression method for a reference character dictionary used for on-line Chinese handwriting character recognition. By a combination of the K-means clustering algorithm and the GLVQ algorithm, the reference dictionary was successfully compressed to one tenth of the original data size without a decrease in the recognition rate. The time used to calculate distance between strokes can be reduced to approximately one fifth. Future work includes applying the proposed method to a reference dictionary in which one category of reference stroke consists of multiple strokes rather than a single stroke. We expect that a more precise recognition algorithm can be found.
A Study of the Compression Method for a Reference Character Dictionary
309
References [1] A. Sato, K. Yamada, ”Character Recognition using Generalized Learning Vector Quantization”, Technical Report of IEICE, PRU95-219, pp. 23-30, March 1996. 301, 302 [2] A. Sato, K. Yamada, ”Generalized Learning Vector Quantization”, Advances in Neural Information Processing Systems 8, The MIT Press, 1996. 301, 302 [3] A. Kitadai, M. Nakagawa, ”A Learning Algorithm for Structural Pattern Representation used in On-Line Handwritten Character Recognition”, Technical Report of IEICE PRMU2000-209(2001-03), pp. 61-68. [4] Mu-King Tsay, Keh-Hwa Shyu, Pao-Chung Chang, ”Feature Transformation with Generalized Learning Vector Quantization for Hand-Written Chinese Character Recognition”, IEICE trans. Vol. E82-D, No.3 pp. 687-692, March 1999. [5] T. Kohonen, ”Self-Organization and Associative Memory”, 3rd ed., SpringerVerlag, 1989. 303 [6] T. Kohonen, ”LVQ PAK Version 3.1 -The Learning Vector Quantization Program Package,” LVQ Programming Team of the Helsinki University of Technology, 1995. 303 [7] M. Nakagawa, “Non-keyboard Input of Japanese Text – On-Line Recognition of Handwritten Characters as the Most Hopeful Approach,” IPSJ Trans., Japan, vol. 13, no. 1, pp. 15-34, April 1990. 300 [8] C. C. Tappert, C. Y. Suen, and T. Wakahara, “The State of the Art in On-Line Handwriting Recognition,” IEEE Trans. Pattern Anal. Machine Intell., vol. 12, no. 8, pp. 787–808, Aug. 1990. 300 [9] M. Nakagawa and K. Akiyama, “A Linear-time Elastic Matching for Stroke Number Free Recognition of On-Line Handwritten Characters,” Proc. 4th Int. Workshop on Frontiers in Handwriting Recognition, pp.48-56, Dec. 1994. 300 [10] A. Hsieh, K. Fan, and T. Fan, “Bipartite Weighted Matching for On-Line Handwritten Chinese Character Recognition,” Pattern Recognition, Vol.28, No.2, pp.143-151, 1995. 300 [11] T. Wakahara, A. Suzuki, N. Nakajima, S. Miyahara, and K. Odaka, “StrokeNumber and Stroke-Order Free On-Line Kanji Character Recognition as One-toOne Stroke Correspondence Problem,” IEICE Trans. Inf. & Syst., vol. E79-D, no. 5, pp. 529–534, May 1996. 300 [12] T. Uchiyama, N. Sonehara, and Y. Tokunaga, “On-Line Handwritten Character Recognition Based on Non-Euclidean Distance,” IEICE Trans. Inf. & Syst., vol. J80-D-II, No. 10, pp. 2705–2712, Oct. 1997 (in Japanese). 300 [13] T. Wakahara and K. Odaka, ”On-Line Cursive Kanji Character Recognition Using Stroke-Based Affine Transformation,” IEEE Trans. Pattern Anal. Machine Intell., vol. 19, no. 12, pp. 1381–1385, Dec. 1997. 300 [14] J. Shin, M. M. Ali, Y. Katayama, and H. Sakoe, “Stroke Order Free On-Line Character Recognition Algorithm Using Inter-Stroke Information,” IEICE Trans. Inf. & Syst., vol. J82-D-II, No. 3, pp. 382–389, Mar. 1999 (in Japanese). 300, 302, 307 [15] J. Shin and H. Sakoe, “Stroke Correspondence Search Method for Stroke-Order and Stroke-Number Free On-Line Character Recognition — Multilayer Cube Search,” IEICE Trans. Inf. & Syst., vol. J82-D-II, No. 2, pp. 230–239, Feb. 1999 (in Japanese). 300, 302, 307 [16] R. Plamondon, D. P. Lopresti, L. R. B. Schomaker, and R. Srihari, “On-line handwriting recognition,” In: J. G. Webster (Ed.). Wiley Ency clopedia of Electrical & Electronics Engineering, 123-146, New York, 1999. 300
Mercer Kernels and 1-Cohomology of Certain Semi-simple Lie Groups Bernd-Jrü gen Falkowski University of Applied Sciences Stralsund Zur Schwedenschanze 15, D-18435 Stralsund, Germany [email protected]
Abstract. For the construction of support vector machines so-called Mercer kernels are of considerable importance. Since the conditions of Mercer's theorem are hard to verify some mathematical results arising from semi-simple Lie groups are collected here to provide concrete examples of Mercer kernels for the real line. Besides an interesting connection to Furstenberg's theory of noncommuting random products comes to light. These results have, in essence, been known for quite some time but are rather technical in nature. Hence a concise treatment is offered here to make them accessible to Neural Network researchers.
1
Introduction
Recently support vector machines, cf. e.g. [Sch], have received much attention. In this context Mercer kernels, cf. e.g. [Cri], p. 35 for Mercer's theorem, which are important building blocks of such machines, have frequently been used. Thus in [Fal2] some mathematical results providing a complete description of all Mercer kernels possessing certain invariance properties under a group action were collected. These were used to derive two out of three basic types of kernels, cf. e.g. [Hay], p. 333, and one less well-known kernel type in a systematic fashion. However, apart from these and some construction rules, cf. e.g. [Cri], pp. 42-44, very little else appears to be known about such kernels amongst Neural Network researchers. This is all the more surprising since the conditions of Mercer's theorem are not easily verifiable in general. Hence it seems worthwhile to point out some results for semi-simple Lie groups that also have, in essence, been known for quite some time, cf. [Fur], [Gan], [Erv], to provide further concrete examples of Mercer kernels. Besides an interesting connection to Furstenberg's theory of noncommuting random products and associated laws of large numbers comes to light. Note that due to lack of space and the very technical nature of the subject only sketch proofs will be given. For technical details and further background information we refer to [Fur], [Gan], [Hel], [Erv].
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 310-316, 2003. Springer-Verlag Berlin Heidelberg 2003
Mercer Kernels and 1-Cohomology of Certain Semi-simple Lie Groups
2
311
Some Basic Definitions and Results
In this section some basic definitions and results from [Fal2] are going to be recalled for the convenience of the reader. Definition 1: Let G be a topological group and let g → Ug be a weakly continuous unitary representation of G in a Hilbert space H (i.e. in particular a homomorphism). A map δ: G → H is called a first order cocycle (or just a cocycle) associated with U if Ug δ(h) = δ(gh) - δ(g)
∀g, h ∈ G
Example 1: Suppose that v is a fixed vector in H. Then a trivial cocycle (coboundary) associated with U may be defined by setting δ(g):= Ug v – v. In [Fal2] it was shown how Mercer kernels enjoying certain invariance properties may be constructed from positive definite functions. Moreover the logarithms of such functions (so-called conditionally positive definite functions) were described in terms of the 1-cohomology there. The (relevant part of the) basic result is: Theorem 1: If δ is a cocycle satisfying δ(e) = 0, then the function ψ(g) = - c< δ (g), δ (g)> is normalized conditionally positive definite for every c>0 and hence ϕ(g):= exp[ψ(g)] is positive definite in the usual sense. Mercer kernels K may be obtained from these functions for example by setting K(g1, g2):= ϕ(g −1 g1) or 2 1 − K(g1, g2):= ψ (g g1) − ψ (g −1) − ψ (g1) 2 2 Proof: See example 2, lemma 1.2, and theorem3.2 in [Fal2]. Unfortunately the 1-cohomology for topological groups is by no means easy to calculate explicitly in general, although some results are available for ℜn, cf. [Fal2]. However, strangely enough some known facts about non-commuting random products for semi-simple Lie groups prove helpful to obtain very concrete results.
3
Semi-simple Lie Groups with Finite Centre
Note first of all that in theorem 1 above an explicit knowledge of δ may not be required. Indeed, suppose that G is a semi-simple Lie group with finite centre. Let G = KAN, with K compact, A abelian, and N nilpotent, be its Iwasawa decomposition, cf. e.g. [Hel] for the technical details. Suppose further that a nontrivial cocycle of G associated with U satisfies δ(k) = 0 ∀k ∈ K . Then it follows that U cannot have a non-trivial vector vector invariant under K, cf. [Erv], p. 88. From this fact one obtains: Lemma 1: The function ψ given in theorem 1 satisfies the functional equation
312
Bernd-Jürgen Falkowski
∫ ψ(g1k g 2)dk = ψ(g1) + ψ(g 2) ∀ g1 , g 2 ∈ G ------------------------------(*)
K
where dk denotes the Haar measure on K. Proof: See [Erv], p. 90. The equation (*) above is quite remarkable since it appears in many different contexts, cf. e.g. [Fur] for non-commuting random products and laws of large numbers, or [Gan], [Fal1] for Levy-Schoenberg Kernels. If it has a unique nonnegative solution and a non-trivial cocycle is known to exist, then the negative of its solution is conditionally positive definite. In [Fur] non-negative solutions of (*) are described as follows: Let G = KAN as above, and let MAN, where M is the centralizer of A in K, be a minimal parabolic subgroup, cf. e.g. [Fur] for the technical details. Consider B = KAN/MAN. Then K acts transitively on this homogeneous space and in fact there exists a unique K-invariant probability measure m on B, cf. [Fur]. In order to describe the solutions of (*) a further definition of [Fur] is needed. Definition 2: If G acts transitively on a topological space X, then an A-cocycle on X is a real-valued continuous function ρ on G×X satisfying (i) (ii)
ρ(g1g2,x) = ρ(g1,g2x) + ρ(g2,x) ρ(k,x) = 0 ∀k∈K
∀ g1, g2 ∈ G, x ∈ X
Then the following theorem holds, cf.[Fur], [Fal1]: Theorem 2: The function ψρ defined by ψρ(g):= ∫ ρ(g, x )dm( x ) B
is a non-negative solution of (*). Note that the ψρ thus constructed is biinvariant with respect to K, i.e. satisfies the condition ψρ(k1gk2) = ψρ(g) ∀ k1, k2 ∈ K, ∀ g ∈ G, which is rather useful since any element in G may be written as g = k1ak2 with k1, k2 ∈ K, a ∈ A in view of the Cartan decomposition of G, cf. [Hel]. Hence it will suffice to compute ψρ on A.
4
Concrete Examples (SOe(n;1) and SU(n;1))
As examples the real and complex Lorentz groups will be treated in this section (actually in the real case only the connected component of the identity will be dealt with). Definition 3: SOe(n;1) (SU(n;1)) are real (complex) (n+1)×(n+1) matrix groups which leave the real (hermitian) quadratic forms n −1 ∑ x i2 − x 2 n i=0
invariant.
n −1 ( ∑ | x i |2− | x n |2) i=0
Mercer Kernels and 1-Cohomology of Certain Semi-simple Lie Groups
313
The homogeneous G-space B = KAN/MAN in either case turns out to be the (real (complex)) Sphere Sn-1 on which an element g := [gij] for 0 ≤ i, j ≤ n of G acts by fractional linear transformations, i.e. n −1 n −1 (gx)i:= ( ∑ g ik x k + g in ) /( ∑ g nk x k + g nn ) k=0 k=0 n −1 n −1 ∑ x i2 = 1 ( ∑ (| x i | 2 = 1) and x := (x0, x2, …, xn-1). i=0 i=0 It is in fact known, cf. [Fur], that there is, up to scalar multiples, precisely one Acocycle in either case, which is described in:
where
Lemma 2: Let g ∈ SOe(n;1) (SU(n;1)) be given by g := [gij] for 0 ≤ i, j ≤ n. Then ρ(g,x) := log| where
n −1 ∑ g nk x k + g nn |, k=0
x := (x0, x2, …, xn-1), is an A-cocycle.
Proof: Computation (note here that the invariance conditions given in definition 3 of course impose certain restrictions on the gij which have not been given here but must be used in the computation). Q.E.D. Next note that the abelian group A appearing in the Cartan decomposition is given by the special elements g := [gij] for 0 ≤ i, j ≤ n satisfying gij = δij (Kronecker delta) for 0 ≤ i, j ≤ n-2, for some t ∈ ℜ, gn-1n-1 = gnn = cosht, gn-1n = gnn-1 = sinht all other gij being zero. Hence A is isomorphic in either case to the group of hyperbolic rotations in the plane or equivalently (ℜ, +). If one denotes a typical element of A by a(t) in this case, then ρ(a(t),x) = log|xn-1*sinht + cosht|. From this one then immediately obtains the following theorem: Theorem 3: The negative of the function ψρ(t) defined by π
ψρ(t):= ∫ log(sinh t cos u + cosh t ) sin n − 2 u du 0
is conditionally positive definite on the real line for every n ≥ 2. Proof: By theorem 2, lemma 2, and the remarks above the negative of the function fρ(t) originally defined by fρ(t):=
∫ ρ(a ( t ), x )dm( x ) n −1
S
on SOe(n;1) is conditionally positive definite on the real line for every n. But from this one immediately obtains the expression given in theorem 3 (modulo a constant) on transforming to polar coordinates, cf. e.g. [Mag]. Q.E.D.
314
Bernd-Jürgen Falkowski
Note that the integral for n = 3 may be evaluated and one thus obtains an interesting explicit result. Corollary to Theorem 3: The negative of the function ψρ(t) defined by ψρ(t):= t*coth t – 1 is conditionally positive definite on the real line. Hence the kernels K1 and K2 defined by K1(t1,t2):= exp[1-(t1-t2)*coth (t1-t2)] respectively K2(t1,t2):= -(t1-t2)*coth (t1-t2) + t2*coth t2+ t1*coth t1 –1 are Mercer kernels on the real line. π
Proof: Note that ∫ log(sinh t cos u + cosh t ) sin u du = 2*(t*coth t – 1), discard the 0
factor 2 (clearly any non-negative constant factor could appear in the kernels without altering the fact that they are Mercer kernels) and apply theorem 1. Q.E.D. Remark: If one is willing to invest a little more effort, then the case n = 2 can be evaluated explicitly as well, cf. e.g. [Gan] for a similar calculation. This computation has not been performed here since the result is similar to the case discussed in theorem 4 below. In the complex case the calculations are slightly more complicated and thus only an interesting special case will betreated here. For SU(1;1) one obtains the following result: Theorem 4: The negative of the function ψρ(t) defined by ψρ(t):= log(cosht) is conditionally positive definite on the real line. Hence the kernels K3 and K4 defined by K3(t1,t2):= 1/cosh(t1-t2) respectively K4(t1,t2):= log[(cosht1*cosht2)/cosh(t1-t2)] are Mercer kernels on the real line. Proof: For SU(1;1) an invariant measure on the complex unit circle is given by (1/2πi)dx0/x0 where |x0| = 1. Hence 1 / 2πi ∫ log | sinh t x 0 + cosh t | d x 0 / x 0 = log(cosh t) by the Cauchy residue 1
S
theorem, cf. e.g. [Ahl]. Again apply theorem 1 to obtain the result.
Q.E.D.
Mercer Kernels and 1-Cohomology of Certain Semi-simple Lie Groups
5
315
Concluding Remarks
The functions described in theorem 2 describe the Gaussian part of a noncommutative analogue of the Levy-Khintchin formula, cf. e.g. [Gan]. Indeed, if one wishes to consider more general Mercer kernels, then this is perfectly possible, see e.g. [Fal1] for a more thorough discussion of the situation, which would have exceeded the framework of this paper. However, there is also an interesting connection to a law of large numbers which should not be overlooked. If {Xn} is a sequence of independent identically distributed random variables with values in a multiplicative group of positive real numbers, then the law of large numbers can be stated as follows: (1/n)log(X1X2 … Xn) → E(logXn), where E denotes the expectation value. The function ψ(t):= log t is characterized up to a constant multiple by the equation log (t1t2) = log t1 + log t2. However, a semi-simple Lie group does not admit any non-trivial homomorphisms and hence it would be pointless to look for an exact analogue of this function. The relevant analogue is the functional equation (*) given above and using solutions of this functional equation a law of large numbers can indeed be proved in the non-commutative situation prevailing in semisimple Lie groups as discussed above, for details see e.g. [Fur]. Since only Mercer kernels have been described above it may be worthwhile to remind the reader that from these the relevant feature spaces may be constructed as Hilbert spaces. In this case the Mercer kernels are used to define the inner product (that is to say they are reproducing kernels). Alternatively they may be used to define positive integral operators where the feature maps then arise as eigenfunctions belonging to positive eigenvalues. For a more detailed discussion of these aspects the reader is referred to [Sch] chapter 1. Finally, from a practical point of view, for example the very explicit result in the corollary to theorem 3 seems noteworthy, since even for the hyperbolic tangent kernel the positive definiteness criterion is satisfied only for some parameter values, cf. e.g. [Hay], p. 333.
References [Ahl] [Cri] [Erv] [Fal1] [Fal2]
Ahlfors, L.V.: Complex Analysis. Second Edition, (1966) Cristianini, N.; Shawe-Taylor, J.: An Introduction to Support Vector Machines and other Kernel-Based Learning Methods. Cambridge University Press, (2000) Erven, J.; Falkowski, B.-J.: Low Order Cohomology and Applications. Springer Lecture Notes in Mathematics, Vol. 877, (1981) Falkowski, B.-J.: Levy-Schoenberg Kernels on Riemannian Symmetric Spaces of Non-Compact Type. In: Probability Measures on Groups, Ed. H. Heyer, Springer Lecture Notes in Mathematics, Vol. 1210, (1986) Falkowski, B.-J.: Mercer Kernels and 1-Cohomology. In: Proc. of the 5th Intl. Conference on Knowledge Based Intelligent Engineering Systems & Allied Technologies (KES 2001), Eds. N. Baba, R.J. Howlett, L.C. Jain, IOS Press, (2001)
316
[Fur] [Gan] [Hay] [Hel] [Mag] [Sch]
Bernd-Jürgen Falkowski
Furstenberg, H.: Noncommuting Random Products. Trans. of the AMS,Vol. 108, (1962) Gangolli, R.: Positive Definite Kernels on Homogeneous Spaces. In: Ann. Inst. H. Poincare B, Vol. 3, (1967) Haykin, S.: Neural Networks, a Comprehensive Foundation. Prentice-Hall, (1999) Helgason, S.: Differential Geometry and Symmetric Spaces. Academic Press, New York, (1963) Magnus, W.; Oberhettinger, F.: Formeln und Sätze frü die speziellen Funktionen der Mathematischen Physik. Springer, (1948) Schölkopf, B.; Burges, J.C.; Smol a, A.J. (Eds.): Advances in Kernel Methods, Support Vector Learning. MIT Press, (1999)
On-line Profit Sharing Works Efficiently Tohgoroh Matsui1 , Nobuhiro Inuzuka2 , and Hirohisa Seki2 1
2
Department of Industrial Administration, Faculty of Science and Technology Tokyo University of Science, 2641 Yamazaki, Noda-shi, Chiba 278-8510, Japan [email protected] Department of Computer Science and Engineering, Graduate School of Engineering Nagoya Institute of Technology Gokiso-cho, Showa-ku, Nagoya 466-8555, Japan {inuzuka@elcom,seki@ics}.nitech.ac.jp
Abstract. Reinforcement learning constructs knowledge containing state-to-action decision rules from agent’s experiences. Most of reinforcement learning methods are action-value estimation methods which estimate the true values of state-action pairs and derive the optimal policy from the value estimates. However, these methods have a serious drawback that they stray when the values for the “opposite” actions, such as moving left and moving right, are equal. This paper describes the basic mechanism of on-line profit-sharing (OnPS) which is an actionpreference learning method. The main contribution of this paper is to show the equivalence of off-line and on-line in profit sharing. We also show a typical benchmark example for comparison between OnPS and Q-learning.
1
Introduction
Reinforcement learning constructs knowledge containing state-to-action decision rules from agent’s experiences. Most of reinforcement learning methods are based on action-value function which is defined as the expected return starting from state-action pairs, and called action-value estimation methods [5]. Action-value estimation methods estimate the true values of state-action pairs and derive the optimal policy from the value estimates. However, these methods have a serious drawback that they stray when the values for the “opposite” actions, such as moving left and moving right, are equal. This paper describes the basic mechanism of on-line profit-sharing (OnPS). Profit sharing [2, 3] is an action-preference learning method, which maintains preferences for each action instead of action-value estimates. Profit sharing1 has an advantage of the number of parameters to be set, because it only needs a discount rate parameter. Moreover, Arai et al. [1] have shown that profit sharing outperforms Q-learning in a multi-agent domain. 1
Note that profit sharing sometimes converges to a locally optimal solution like EM algorithm or neural networks.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 317–324, 2003. c Springer-Verlag Berlin Heidelberg 2003
318
Tohgoroh Matsui et al.
Profit sharing is usually an off-line updating method, which does not change its knowledge until the end of a task. One drawback is that ordinary profit sharing requires unbounded memory space to store all of the selected stateaction pairs during the task. Another disadvantage is that intermediate rewards until the end of the task are considered to be zero. Although OnPS is one of the profit-sharing methods, it can be implemented with bounded memory space and use intermediate rewards during the task for updating its knowledge. The main contribution of this paper is to show the equivalence of off-line and on-line in profit sharing. We also give the result of a typical benchmark example for comparison between OnPS and Q-learning [6].
2
Profit Sharing
Profit sharing [2, 3] considers the standard episodic reinforcement learning framework, in which an agent interacts with an environment [5]. In profit sharing, state-action pairs in a single episode of T time steps, s0 , a0 , r1 , s1 , a1 , r2 , · · · , rT , sT with states st ∈ S, actions at ∈ A, and rewards rt ∈ , are reinforced at the end of the episode. We consider the initial state, s0 , to be given arbitrarily. In episodic reinforcement learning tasks, each episode ends in a special state called the terminal state. Profit sharing learns the action-preferences, P , of state-action pairs. After an episode is completed, P is updated by P (st , at ) ← P (st , at ) + f (t, rT , T ),
(1)
for each st ∈ S, at ∈ A in the episode, where f is a function, called credit assignment. The rationality theorem [4] guarantees that unnecessary state-action pairs, that is, pairs evermore make up the loop, are always given smaller increments than the necessary pairs, if the credit-assignment function satisfies f (t, rT , T ) > L
t−1
f (i, rT , T ),
(2)
i=0
where L is the maximum number of pairs, except for unnecessary ones, in a state. In many cases, a function f (t, rT , T ) = γ T −t−1 rT ,
(3)
is used, where γ is a parameter, also called the discount rate. This function satisfies Equation (2) with γ ≤ 1/ max |A(s)| ≤ 1/L. We also use this function in this paper. The shape of the function is shown in Figure 1. The increments for P (st , at ) decrease as the state-action pair, st and at , was visited earlier. Profit
On-line Profit Sharing Works Efficiently
319
rT
f (t, r T , T) = γ T − t − 1 r T γ1 rT γ2 rT ...
T− 3
T−2
T−1
Step t
Fig. 1. A credit-assignment function in the form of Equation (3), in profit sharing. The increments of P (st , at ) decrease as the difference between t and T increases
sharing chooses action a with probability in proportion to action preferences at state s: P (s, a) Pr(at = a|st = s) = . a ∈A(s) P (s, a ) This is called weighted-roulette. Figure 2 shows the algorithm for the profitsharing method.
Initialize, for all s ∈ S, a ∈ A(s): P (s, a) ← a small constant Repeat (for each episode): Initialize s Repeat (for each step of episode): Choose a from s using weighted-roulette derived from P Take action a; observe reward, r, and next state, s s ← s until s is terminal For each pair st , at appearing in the episode: P (st , at ) ← P (st , at ) + f (t, rT , T )
Fig. 2. Off-line profit-sharing algorithm. The function f is a credit-assignment function
3
On-line Profit-Sharing Method
The ordinary profit-sharing method has the problem that its memory and computational requirements increase without bound, because the method must store
320
Tohgoroh Matsui et al. c(s,a) Steps
Visiting to this state-action pair
Fig. 3. The credit trace for a state-action pair. It decays by γ on each step and is incremented by one when the state-action pair is visited in a step, in accordance with Equation (4)
all the experienced states, the selected actions, and the final reward. That is, it requires s0 , a0 , s1 , a1 , · · · , sT −1 , aT −1 , rT to calculate the updates of all of P for state-action pairs st and at in an episode. On the other hand, incremental methods compute the (k + 1)th values from the kth values, so that it is not necessary to keep earlier state-action pairs. Instead, our incremental implementation method requires additional but bounded memory for credit traces. Credit traces, c, maintain credit assignments for each state-action pair. They are based on eligibility traces [5] which are one of the basic mechanisms of reinforcement learning. We apply this technique for implementing the profit-sharing method incrementally. We call this version of the profit-sharing method the online profit-sharing method (OnPS) because updates are done at each step during an episode, as soon as the increment is computed. Because the number of elements in the credit trace, c, is equal to that in the preference, P , the size of required memory is bounded. In each step, the credit traces for all states decay by γ, while the trace for the state-action pair visited on the step is incremented by one: γct−1 (s, a) + 1 if s = st and a = at ; ct (s, a) = γct−1 (s, a) otherwise, for all non-terminal states s ∈ S, a ∈ A. The credit trace accumulates each time when the state-action pair is visited, then decays gradually while the state-action pair is not visited, as illustrated in Figure 3. The increments are given as follows: ∆Pt (s, a) = rt+1 ct (s, a) for all s, a. Figure 4 shows the algorithm for OnPS.
(5)
On-line Profit Sharing Works Efficiently
321
Initialize, for all s ∈ S, a ∈ A(s): P (s, a) ← a small constant Repeat (for each episode): Initialize s c(s, a) = 0, for all s, a Repeat (for each step of episode): Choose a from s using weighted-roulette derived from P c(s, a) ← c(s, a) + 1 Take action a; observe reward, r, and next state, s For all s, a: P (s, a) ← P (s, a) + rc(s, a) c(s, a) ← γc(s, a) s ← s until s is terminal
Fig. 4. On-line profit-sharing (OnPS) algorithm
4
Equivalence of Off-line and On-line Updating in Profit-Sharing Methods
In this section, we show that the sum of all the updates is exactly the same for the off-line and on-line profit-sharing methods over an episode. Theorem 1. If the rewards are zero until the system reaches one of the goal states and Equation (3) is used as the credit-assignment function, then the offline updating given in Equation (1) is equivalent to the on-line updating given in Equation (5). Proof. First, an accumulating eligibility trace can be written non-recursively as ct (s, a) =
t
γ t−k Issk Iaak ,
k=0
where sk and ak are states and actions that appeared in the episode respectively, and Ixy is an identity-indicator function, equal to one if x = y and equal to zero otherwise. The update of a step in the on-line profit-sharing method is Equation (5). Thus, the sum of all the updates can be written as ∆P (s, a) =
T −1
rt+1 ct (s, a) =
t=0
T −1 t=0
rt+1
t
γ t−k Issk Iaak .
(6)
k=0
On the other hand, the sum of updates in the off-line profit-sharing method can be written using the identity-indicator functions as ∆P (s, a) =
T −1 t=0
f (t, rT , T )Isst Iaat .
322
Tohgoroh Matsui et al.
G
F
S
G
F
Fig. 5. 11×11 grid world example. S is the initial state, and G’s are goal states. G’s and F’s are terminal states
Using Equation (3) as the credit-assignment function, f , this can be transformed to T −1 T −1 T −t−1 γ rT Isst Iaat = rT γ T −t−1 Isst Iaat . (7) ∆P (s, a) = t=0
t=0
Now, we rewrite the right-hand side of Equation (6). Because all the intermediate rewards until the end of the episode are zero, r1 = r2 = · · · = rT −1 = 0, only the term for t = T − 1 remains: ∆P (s, a) =
T −1
rt+1
t=0
=r1
0
t
γ t−k Issk Iaak
k=0
γ 0−k Issk Iaak + · · · + rT
k=0
=rT
T −1
T −1
γ (T −1)−k Issk Iaak
k=0
γ T −k−1 Issk Iaak .
k=0
Therefore, the sums of all the updates in the off-line and on-line profit-sharing methods are same, if we use Equation (3) as the credit-assignment function, and all of the intermediate rewards until the end of the episode are zero. ✷
5
An Example
We used 11×11 grid-world shown in Figure 5. The MDP is deterministic and has 4 actions, moving up, down, left, or right. If an agent bumps into a wall, it remains in the same state. The four corner states are terminal. The agent receives a reward of 1 for the actions entering the bottom-right and upper-left corners. All of the other rewards are 0, including for entering the other two corners. The initial state is in the center.
0.1
0.1
0.08
0.08
Rewards per Step
Rewards per Step
On-line Profit Sharing Works Efficiently
0.06 0.04 0.02
Q-learning 1
10
100
OnPS
0.06 0.04 0.02
OnPS
0
1000 1e+4 1e+5 1e+6 1e+7 1e+8 Number of Steps
323
Q-learning
0 1
10
100
1000 1e+4 1e+5 1e+6 1e+7 1e+8 Number of Steps
Fig. 6. The performances of greedy policy (left) and either weighted-roulette or softmax policy which are used during the learning (right) in the 11×11 environment. The optimal performance is 0.1
We have implemented OnPS with the initial preferences are 1/|A| = 0.25, γ = 1/|A| = 0.25, and Q-learning using Gibbs-distribution softmax action-selection with α = 0.01, γ = 0.9 and τ = 0.2. The results are shown in Figure 6. For each evaluation 101 trials are run with a maximum step cutoff at 101 steps. We ran a series of 30 runs and show the average. The left panel shows the performances of greedy policy derived from the same knowledge at each total time step. The right panel shows the performance comparison of probabilistic policies, that is, between weighted-roulette and softmax, which are used during the learning. The results indicate that OnPS outperforms Q-learning for accomplishing the task when they learn. Since the true values for moving all directions are the same at center, actionvalue estimation methods stray as the estimation approaches the true value. Therefore, the performance of Q-learning was poor using the stochastic policy.
6
Discussion and Conclusion
Monte Carlo control method (MC-control) [5] is the most closest method to OnPS. MC-control acquires the average returns per selection for each stateaction pair in order to estimate action-values. The essential difference is that MCcontrol is an action-value estimation method, while OnPS is an action-preference learning method. Hence, MC-control suffers from even-value actions when it explores. We proved the equivalence of off-line and on-line updating in profit sharing, if all of the intermediate rewards are zero. Profit sharing can accomplish the task even during the learning, although the acquired knowledge is not guaranteed to be optimal. Nevertheless, OnPS worked well in several benchmark problems [5]. OnPS is superior to off-line profit-sharing method from the view point of required memory space. Furthermore, OnPS can deal with intermediate rewards during
324
Tohgoroh Matsui et al.
the task, though details of the effects have not examined yet. Those theoretical analysis remains as future work.
References [1] Sachiyo Arai, Katia Sycara, and Terry R. Payne. Experience-based reinforcement learning to acquire effective behavior in a multi-agent domain. In R. Mizoguchi and J. Slaney, editors, Proceedings of The 6th Pacific Rim International Conference on Artificial Intelligence (PRICAI-2000), volume 1886 of Lecture Notes in Artificial Intelligence, pages 125–135. Springer-Verlag, 2000. 317 [2] John J. Grefenstette. Credit assignment in rule discovery systems based on genetic algorithms. Machine Learning, 3:225–245, 1988. 317, 318 [3] John H. Holland. Escaping brittleness: The possibilities of general-purpose learning algorithms applied to parallel rule-based systems. In R. S. Michalski, J. G. Carbonell, and T. M. Mitchell, editors, Machine Learning: An Artificial Intelligence Approach, volume 2. Morgan Kaufmann Publishers, 1986. 317, 318 [4] Kazuteru Miyazaki, Masayuki Yamamura, and Shigenobu Kobayashi. On the rationality of profit sharing in reinforcement learning. In Proceedings of The 3rd International Conference on Fuzzy Logic, Neural Nets and Soft Computing, pages 285–288. Fuzzy Logic Systems Institute, 1994. 318 [5] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, 1998. 317, 318, 320, 323 [6] Christopher J. C. H. Watkins and Peter Dayan. Technical note: Q-learning. Machine Learning, 8(3/4):279–292, 1992. 318
Fast Feature Ranking Algorithm Roberto Ruiz, Jos´e C. Riquelme, and Jes´ us S. Aguilar-Ruiz Departamento de Lenguajes y Sistemas, Universidad de Sevilla. Avda. Reina Mercedes S/N. 41012 Sevilla, Espa˜ na {rruiz,riquelme,aguilar}@lsi.us.es Abstract. The attribute selection techniques for supervised learning, used in the preprocessing phase to emphasize the most relevant attributes, allow making models of classification simpler and easy to understand. The algorithm has some interesting characteristics: lower computational cost (O(m n log n) m attributes and n examples in the data set) with respect to other typical algorithms due to the absence of distance and statistical calculations; its applicability to any labelled data set, that is to say, it can contain continuous and discrete variables, with no need for transformation. In order to test the relevance of the new feature selection algorithm, we compare the results induced by several classifiers before and after applying the feature selection algorithms.
1
Introduction
It is advisable to apply to the database preprocessing techniques to reduce the number of attributes or the number of examples in such a way as to decrease the computational time cost. These preprocessing techniques are fundamentally oriented to either of the next goals: feature selection (eliminating non-relevant attributes) and editing (reduction of the number of examples by eliminating some of them or calculating prototypes [1]). Our algorithm belongs to the first group. Feature selection methods can be grouped into two categories from the point of view of a method’s output. One category is about ranking feature according to same evaluation criterion; the other is about choosing a minimum set of features that satisfies an evaluation criterion. In this paper we present a new feature ranking algorithm by means of Projections and the hypothesis on which the heuristic is based is: ”place the best attributes with the smallest number of label changes (NLC)”.
2
Feature Evaluation
2.1
Description
To describe the algorithm we will use the well-known data set IRIS, because of the easy interpretation of their two-dimensional projections.
This work has been supported by the Spanish Research Agency CICYT under grant TIC2001-1143-C03-02.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 325–331, 2003. c Springer-Verlag Berlin Heidelberg 2003
Roberto Ruiz et al.
5
5
4,5
4,5
4
4
3,5
3,5
sepalwidth
sepalwidth
326
3 2,5 2 1,5
3 2,5 2 1,5
1
1
0,5
0,5 0
0 0
2
(a)
4 sepallength
6
8
10
0
0,5
(b)
1
1,5
2
2,5
3
petalwidth
Fig. 1. Representation of Attributes (a) Sepalwidth-Sepallength and (b) Sepalwidth-Petalwidth
Three projections of IRIS have been made in two-dimensional graphs. In Figure 1(a) it is possible to observe that if the projection of the examples is made on the abscissas or ordinate axis we can not obtain intervals where any class is a majority, only can be seen the intervals [4.3,4.8] of Sepallength for the Setosa class or [7.1,8.0] for Virginica. In Figure 1(b) for the Sepalwidth parameter in the ordinate axis clear intervals are not appraised either. Nevertheless, for the Petalwidth attribute is possible to appreciate some intervals where the class is unique: [0,0.6] for Setosa, [1.0,1.3] for Versicolor and [1.8,2.5] for Virginica. SOAP is based on this principle: to count the label changes, produced when crossing the projections of each example in each dimension. If the attributes are in ascending order according to the NLC, we will have a list that defines the priority of selection, from greater to smaller importance. Finally, to choose the more advisable number of features, we define a reduction factor, RF, in order to take the subset from attributes formed by the first of the aforementioned list. Before formally exposing the algorithm, we will explain with more details the main idea. We considered the situation depicted in Figure 1(b): the projection of the examples on the abscissas axis produces an ordered sequence of intervals (some of then can be a single point) which have assigned a single label or a set of them: [0,0.6] Se, [1.0,1.3] Ve, [1.4,1.4] Ve-Vi, [1.5,1.5] Ve-Vi, [1.6,1.6] Ve-Vi, [1.7,1.7] Ve-Vi, [1.8,1.8] Ve-Vi, [1.9,2.5] Vi. If we apply the same idea with the projection on the ordinate axis, we calculate the partitions of the ordered sequences: Ve, R, R, Ve, R, R, R, R, R, R, R, R, R, R, Se, R, Se, R, Se, where R is a combination of two or three labels. We can observe that we obtain almost one subsequence of the same value with different classes for each value from the ordered projection. That is to say, projections on the ordinate axis provide much less information that on the abscissas axis. In the intervals with multiple labels we will consider the worst case, that being the maximum number of label changes possible for a same value. The number of label changes obtained by the algorithm in the projection of each dimension is: Petalwidth 16, Petallength 19, Sepallenth 87 and Sepalwidth 120. In this way, we can achieve a ranking with the best attributes from the point
Fast Feature Ranking Algorithm
327
Table 1. Main Algorithm Input: E training (N examples, M attributes) Output: E reduced (N examples, K attributes) for each attribute Ai ∈ 1..M QuickSort(E,i) NLCi ← NumberChanges(E,i) NLC Attribute Ranking Select the k first
Table 2. NumberChanges function Input: E training (N examples, M attributes), i Output: number of label changes for each example ej ∈ E with j in 1..N if att(u[j],i) ∈ subsequence of the same value changes = changes + ChangesSameValue() else if lab(u[j]) <> lastLabel) changes = changes + 1 return(changes)
of view of the classification. This result agrees with what is common knowledge in data mining, which states that the width and length of petals are more important than those related to sepals. 2.2
Algorithm
The algorithm is very simple and fast, see Table 1. It has the capacity to operate with continuous and discrete variables as well as with databases which have two classes or multiple classes. In the ascending-order-task for each attribute, the QuickSort [5] algorithm is used. This algorithm is O(n log n), on average. Once ordered by an attribute, we can count the label changes throughout the ordered projected sequence. NumberChanges in Table 2, considers whether we deal with different values from an attribute, or with a subsequence of the same value (this situation can be originated in continuous and discrete variables). In the first case, it compares the present label with that of the following value. Whereas in the second case, where the subsequence is of the same value, it counts as many label changes as are possible (function ChangesSameValue). After applying QuickSort, we might have repeated values with the same or different class. For this reason, the algorithm firstly sorts by value and, in
328
Roberto Ruiz et al.
case of equality, it will look for the worst of the all possible cases (function ChangesSameValue). We could find the situation as depicted in Figure 2(a). The examples sharing the same value for an attribute are ordered by class. The label changes obtained are two. The next execution of the algorithm may find another situation, with a different number of label changes. The solution to this problem consists of finding the worst case. The heuristic is applied to obtain the maximum number of label changes within the interval containing repeated values. In this way, the ChangesSameValue method would produce the output shown in Figure 2(b), seven changes. This can be obtained with low cost. It can be deduced counting the class’ elements. ChangesSameValue stores the relative frequency for each class within the interval. It is possible to be affirm that: if rf i > (nelem/2) them (nelem − rf i) ∗ 2
else nelem − 1
(1)
rfi: relative frequency for each class, with i in {1,. . . ,k} classes. nelem: number of elements within the interval. In Figure 2(a) we can observe a subsequence of the same value with eight elements: three elements are class A, four class B and one C. Applying formula 2 there is no relative frequency greater than half of the elements. Then, the maximum number of label changes is nelem-1, seven. In Figure 2(b) we verify it. Ranking algorithms produce a ranked list, according to the evaluation criterion applied. The methods need an external parameter to take the subset from attributes formed by the first features of the aforementioned list. This parameter produces different results with different data sets. Therefore, in order to establish the number of attributes in each case, we put the range of value of the ranked lists between [0,1], i.e. the punctuation of the first attribute of the list will be 1, and the last attribute 0. Then, we select attributes over the parameter named Reduction Factor (RF). We do not realize an especial analyzed on each data set.
3
Experiments
In order to compare the effectiveness of SOAP as a feature selector for common machine learning algorithms, experiments were performed using sixteen
3
3
3
3
3
3
3
3
values
A
A
A
B
B
B
B
C
classes changes
3
3
3
3
3
3
3
3
values
B
A B
A
B
A B
C
classes changes
Fig. 2. Subsequence of the same value (a) two changes (b) seven changes
Fast Feature Ranking Algorithm
329
Table 3. Data sets, number of selected features, the percentage of the original features retained and time in milliseconds DATA Data Set Inst. Atts N◦ Cl. autos 205 25 7 breast-c 286 9 2 breast-w 699 9 2 diabetes 768 8 2 glass2 163 9 2 heart-c 303 13 5 heart-st 270 13 2 hepatit. 155 19 2 horse-c. 368 27 2 hypothy. 3772 29 4 iris 150 4 3 labor 57 16 2 lymph 148 18 4 sick 3772 29 2 sonar 208 60 2 vote 435 16 2 Average
SOAP Atts ( %)t-ms 2.9 (11.8) 15 1.5 (16.7) 4 5.2 (57.6) 6 2.8 (34.9) 6 3.2 (35.7) 2 6.3 (48.2) 6 5.4 (41.8) 4 2.6 (13.6) 4 2.3 ( 8.6) 16 1.7 ( 5.7) 180 2.0 (50.0) 3 4.3 (27.0) 1 1.8 ( 9.9) 3 1.0 ( 3.4) 120 3.0 ( 5.0) 21 1.6 (10.0) 9 (23.7)
Atts 5.3 4.1 9.0 3.1 4.0 6.4 6.3 8.7 2.0 1.0 1.9 3.3 8.9 1.0 17.8 1.0
CFS ( %)t-ms (21.3) 50 (45.9) 6 (99.7) 35 (38.9) 39 (43.9) 9 (49.1) 10 (48.2) 12 (45.6) 9 ( 7.4) 43 ( 3.4) 281 (48.3) 3 (20.8) 3 (49.2) 7 ( 3.4) 252 (29.7) 90 ( 6.3) 4 (35.1)
Atts 10.9 3.7 8.1 0.0 0.3 6.9 6.3 13.3 2.3 5.2 4.0 8.8 11.8 7.1 3.9 15.5
RLF ( %) t-ms ( 43.7) 403 ( 41.6) 174 ( 89.4) 1670 ( 0.0) 1779 ( 3.6) 96 ( 53.4) 368 ( 48.2) 365 ( 70.0) 135 ( 8.6) 941 ( 18.0)94991 (100.0) 44 ( 55.3) 21 ( 65.8) 109 ( 24.5)93539 ( 6.5) 920 ( 96.9) 651 ( 45.3)
standard data sets from the UCI repository [4]. The data sets and their characteristics are summarized in Table 3. The percentage of correct classification with C4.5, averaged over ten ten-fold cross-validation runs, were calculated for each algorithm-data set combination before and after feature selection by SOAP (RF 0.75), CFS and ReliefF (threshold 0.05). For each train-test split, the dimensionality was reduced by each feature selector before being passed to the learning algorithms. The same fold were used for each feature selector-learning scheme combination. To perform the experiment with CFS and ReliefF we used the Weka1 (Waikato Environment for Knowledge Analysis) implementation. Table 3 shows the average number of features selected and the percentage of the original features retained. SOAP is a specially selective algorithm compared with CFS and RLF. If SOAP and CFS are compared, only in one data set (labor) is the number of characteristics significantly greater than those selected by CFS. In six data sets there are no significant differences, and in nine, the number of features is significantly smaller than CFS. Compare to RLF, only in glass2 and diabetes, SOAP obtains more parameters in the reduction process (threshold 0.05 is not sufficient). It can be seen that SOAP retained 23,7% of the attributes on average. Table 4 shows the results for attribute selection with C4.5 and compares the size (number of nodes) of the trees produced by each attribute selection scheme 1
http://www.cs.waikato.ac.nz/˜ml
330
Roberto Ruiz et al.
Table 4. Result of attribute selection with C4.5. Accuracy and size of trees. ◦, • Statistically significant improvement or degradation (p=0.05) DATA Set Ac. autos 82.54 breast-c 74.37 breast-w 95.01 diabetes 74.64 glass2 78.71 heart-c 76.83 heart-stat 78.11 hepatitis 78.97 horse-c.OR. 66.30 hypothyroid 99.54 iris 94.27 labor 80.70 lymph 77.36 sick 98.66 sonar 74.28 vote 96.53 Average 82.93
Size 63.32 12.34 24.96 42.06 24.00 43.87 34.58 17.06 1.00 27.84 8.18 6.93 28.05 49.02 27.98 10.64 26.36
SOAP Ac. Size 73.37 • 45.84 70.24 • 6.61 94.64 21.28 74.14 7.78 78.96 14.88 77.06 34.02 80.67 ◦ 19.50 80.19 5.62 66.30 1.00 95.02 • 4.30 94.40 8.12 78.25 3.76 72.84 • 7.34 93.88 • 1.00 70.05 • 7.00 95.63 • 3.00 80.98 11.94
◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
Ac. 74.54 72.90 95.02 74.36 79.82 77.16 80.63 81.68 66.30 96.64 94.13 80.35 75.95 96.32 74.38 95.63 82.24
CFS Size • 55.66 18.94 24.68 14.68 14.06 29.35 ◦ 23.84 ◦ 8.68 1.00 • 5.90 7.98 6.44 20.32 • 5.00 28.18 • 3.00 16.73
◦ • ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
Ac. 74.15 70.42 95.02 65.10 53.50 79.60 82.33 80.45 66.28 93.52 94.40 80.00 74.66 93.88 70.19 96.53 79.38
RLF Size • 85.74 • 11.31 24.68 • 1.00 • 1.70 ◦ 28.72 ◦ 14.78 11.26 1.36 • 12.52 8.16 5.88 24.10 • 1.00 • 9.74 10.64 15.79
• ◦ ◦ ◦ ◦ ◦ • ◦ ◦ ◦ ◦ ◦
against the size of the trees produced by C4.5 with no attribute selection. Smaller trees are preferred as they are easier to interpret, but accuracy is generally degraded. The table shows how often each method performs significantly better (denoted by ◦) or worse (denoted by •) than when performing no feature selection (column 2 and 3). Throughout we speak of results being significantly different if the difference is statistically at the 5% level according to a paired two-sided t test. Each pair of points consisting of the estimates obtained in one of the ten, ten-fold cross-validation runs, for before and after feature selection. For SOAP, feature selection degrades performance on seven data sets, improves on one and it is equal on eight. The reason for why the algorithm is not as accurate is the number of attribute selected, less than three feature. Five of these seven data sets obtain a percentage less than 10% of the original features. The results are similar to ReliefF and a little worse than those provided by CFS. Analyzing the data sets in which SOAP lost to CFS, we can observe breast-c, lymph and sonar, where the number of feature selected by SOAP is 25% of CFS (breast-c 4,1 to 1,5 with SOAP, lymph 8,9-1,8 and sonar 17,8-3). Nevertheless the accuracy reduction is small: breast-c 72,9 (CFS) to 70,24 with SOAP, lymph 75,95-72,84 and sonar 74,38-70,05. It is interesting to compare the speed of the attribute selection techniques. We measured the time taken in milliseconds to select the final subset of attributes. SOAP is an algorithm with a very short computation time. The results shown
Fast Feature Ranking Algorithm
331
in Table 3 confirm the expectations. SOAP takes 400 milliseconds2 in reducing 16 data sets whereas CFS takes 853 milliseconds and RLF more than 3 minutes. In general, SOAP is faster than the other methods and it is independent of the classes number. Also it is possible to be observed that ReliefF is affected very negatively by the number of instances in the data set, it can be seen in ”hypothyroid” and ”sick”. Even though these two data sets were eliminated, SOAP is more than 3 times faster than CFS, and more than 75 times than ReliefF.
4
Conclusions
In this paper we present a deterministic attribute selection algorithm. It is a very efficient and simple method used in the preprocessing phase. A considerable reduction of the number of attributes is produced in comparison to other techniques. It does not need distance nor statistical calculations, which could be very costly in time (correlation, gain of information, etc.). The computational cost is lower than other methods O(m n log n).
References [1] Aguilar-Ruiz, Jes´ us S., Riquelme, Jos´e C. and Toro, Miguel. Data Set Editing by Ordered Projection. Intelligent Data Analysis Journal. Vol. 5, n◦ 5, pp. 1-13, IOS Press (2001). 325 [2] Almuallim, H. and Dietterich, T. G. Learning boolean concepts in the presence of many irrelevant features. Artificial Intelligence, 69(1-2):279-305 (1994). [3] Blake, C. and Merz, E. K. UCI Repository of machine learning databases (1998). [4] Hall M. A. Correlation-based feature selection for machine learning. PhD thesis, Department of Computer Science, University of Waikato, Hamilton, New Zealand (1998). [5] Hoare, C. A. R. QuickSort. Computer Journal, 5(1):10-15 (1962). 327 [6] Kira, K. and Rendell, L. A practical approach to feature selection. In Proceedings of the Ninth International Conference on Machine Learning. pp. 249-256, Morgan Kaufmann (1992). [7] Kohavi, R. and John, G. H. Wrappers for feature subset selection. Artificial Intelligence, 97, 273-324 (1997). [8] Kononenko, I. Estimating attibutes: Analisys and extensions of relief. In Proceedings of the Seventh European Conference on Machine Learning. pp. 171-182, Springer-Verlag (1994). [9] Quinlan, J. C4.5: Programs for machine learning. Morgan Kaufmann (1993). ˇ [10] Robnik-Sikonja, M. And Kononenko, I. An adaption of relief for attribute estimation in regression. In Proceedings of the Fourteenth International Conference on Machine Learning. pp. 296-304, Morgan Kaufmann (1997). [11] Setiono, R., and Liu, H. A probabilistic approach to feature selection-a filter solution. In Proceedings of International Conference on Machine Learning, 319327 (1996). 2
This is a rough measure. Obtaining true cpu time from within a Java program is quite difficult.
Visual Clustering with Artificial Ants Colonies Nicolas Labroche, Nicolas Monmarch´e, and Gilles Venturini Laboratoire d’Informatique de l’Universit´e de Tours, ´ Ecole Polytechnique de l’Universit´e de Tours-D´epartement Informatique, 64, avenue Jean Portalis 37200 Tours, France {labroche,monmarche,venturini}@univ-tours.fr http://www.antsearch.univ-tours.fr/
Abstract. In this paper, we propose a new model of the chemical recognition system of ants to solve the unsupervised clustering problem. The colonial closure mechanism allows ants to discriminate between nestmates and intruders by the mean of a colonial odour that is shared by every nestmate. In our model we associate each object of the data set to the odour of an artificial ant. Each odour is defined as a real vector with two components, that can be represented in a 2D-space of odours. Our method simulates meetings between ants according to pre-established behavioural rules, to ensure the convergence of similar odours (i.e. similar objects) in the same portion of the 2D-space. This provides the expected partition of the objects. We test our method against other well-known clustering method and show that it can perform well. Furthermore, our approach can handle every type of data (from numerical to symbolic attibutes, since there exists an adapted similarity measure) and allows one to visualize the dynamic creation of the nests. We plan to use this algorithm as a basis for a more sophisticated interactive clustering tool.
1
Introduction
Computer scientists have successfully used bio mimetic approaches to imagine performing heuristics. For example, Dorigo et al. modelled the pheromone trails of real ants to solve optimisation problems in the well-known Ant Colony Optimization heuristic (ACO [1, 2]). Genetic algorithms have been applied to optimisation problems and clustering problems [3, 4]. The collective behaviours of ants have also inspired researchers to solve the graph partitioning problem with co-evolving species [5] and the unsupervised clustering problem [6, 7]. In these studies, researchers model real ants abilities to sort their brood. Artificial ants move on a 2D discrete grid on which objects are randomly placed. Ants may pick up or carry one or more objects and may drop them according to given probabilities. After a given time, these artificial ants build groups of similar objects that provide the final partition. In [8], the authors present AntClust, an ant-based clustering algorithm that relies on a chemical recognition model of ants. In this work, each artificial ant is associated to an object by the mean of its cuticular odour. This odour is representative of the belonging nest of each ant and coded with a single value. This value may change according to behavioural V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 332–339, 2003. c Springer-Verlag Berlin Heidelberg 2003
Visual Clustering with Artificial Ants Colonies
333
rules and meetings with other ants, until each ant finds the nest that best fits the object it represents. In this paper, we propose an alternative model of this chemical recognition system. Our goal is to see if the previous discrete odour model implies limitations to the method. Furthermore, we want to be able to visualize dynamically the building of the nests. In this case, an expert could interact with the system, while the results are displayed. Our paper is organised as follows: the section 2 briefly describes the main principles of the chemical recognition system of ants and introduces the new model of an artificial ant (the parameters and the behavioural rules for the meetings). The section 3 presents the Visual AntClust algorithm and discuss its parameters settings. Then, the section 4 describes the clustering method that we use to evaluate Visual AntClust and the data sets used as benchmarks. Finally, the section 5 concludes and introduces futures developments of our work.
2 2.1
The Model The Chemical Recognition System of Real Ants
As many other social insects (such as termites, wasps and bees), ants have developed a mechanism of colonial closure that allows them to discriminate between nestmates and intruders. Each ant generates its own cuticular odour, called the label. This label is partly defined by the genome of the ant and partly by chemical substances extracted from its environment (mainly nest materials and food). According to the ”Gestalt odour theory” [9, 10], the continuous chemical exchanges between nestmates lead to the establishment of a colonial odour that is recognised by all the members of the colony. 2.2
Parameters
The main idea of Visual AntClust is to associate one object of the data set to the genome of one artificial ant. The genome allows ants to generate their own odour and is used to learn or update their recognition template. This template is used in every meeting to decide if both ants should accept each others or not. For one ant a, we define the following parameters. The label labela is a vector in the 2D-space of odours. Each of its components reflects the chemical concentration of one compound in the label of the artificial ant and is defined in [0, 1]. As no assumption can be done concerning the final partition, the labels vectors are randomly initialised for every ant. The genome genomea corresponds to an object of the data set and can not evolve during the algorithm. The recognition template ta , is used by each ant during meetings to appreciate the similarities between genomes (i.e. between 2 objects). ta is learned during a fixed number of iterations (set experimentally to 50). During this period, the ant a encounters other randomly chosen ants and estimates the similarity
334
Nicolas Labroche et al.
between their genomes. At the end, the ant a uses the mean and maximal similarities values Sim(a, ·) and max (Sim(a, ·)) observed during this period to set its template threshold ta as follows: ta =
max (Sim(a, ·)) + Sim(a, ·) 2
(1)
The satisfaction estimator sa . At the beginning of the algorithm, it is set to 0 and is increased each time the artificial ant a meets and accepts another ant in a local portion of the 2D-space of odour. sa estimates for each ant if it is well placed in the 2D-odour space. When it increases, the ant limits the area of the 2D-space in which it accepts other ants to help the algorithm to converge. 2.3
Meeting Algorithm
We detail hereafter the meeting process between two ants i and j. This mechanism allows ants to exchange and mutually reinforce their labels if they are close enough and if their genomes are similar enough. Meeting(Ant i, Ant j) (1) (2) (3) (4) (5) (6) (7) (8)
D ← Compute the Euclidian distance between Labeli and Labelj if D ≤ (1 − max(si , sj )) then if there is acceptance between ants i and j then Increase si and sj because ants i and j are well-placed Compute Ri and Rj , the odour exchange rates between ant i and ant j Update Labeli(j) according to Ri(j) and si(j) endif endif
The algorithm determines if both ants labels are similar enough to allow the meeting. It is known that real ants prefer to reinforce their labels with nestmates that have similar odours. We consider also that an ant should have a need to exchange chemical substances to meet another ant. We use si and sj , the satisfaction estimators, to evaluate this need (line 2). An artificial ant i, that is well placed in the odour-space (si is close to 1), already massively accepts its neighbours. That means that the object corresponding to ant i is already assigned to a good cluster. In the opposite case, if si is close to 0, either ant i is alone in a portion of the odour-space, or it is in a region where the neighbours are not similar enough (i.e. the object is misclassified). In this case, the ant needs to encounter other ants to increase its chances to belong to a good nest. If an ant corresponding to the previous criteria is found, there must be acceptance between ants (line 3). In this case, the similarity between genomes of two ants i and j must be greater or equal than both ants’ template thresholds ti and tj as shown in the following equation 2. (2) Acceptance(Anti , Antj ) ← Sim(Anti , Antj ) ≥ max (tAnti , tAntj )
Visual Clustering with Artificial Ants Colonies
335
Finally, if both ants are equally satisfied, they mutually reinforce their labels according to an impregnation rate R computed as follows for one ant i that accepts an other ant j. (3) Ri ← Sim(i, j) − ti This ensures the convergence of the algorithm by preventing ants that are wellplaced to update their label. Thus, only the less satisfied ants can change their label consequently to a meeting, which improves the efficiency of the algorithm.
3
Visual AntClust Visual AntClust(data set with N objects, Niter ) (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11)
Initialise N ants according to parameters described in the section 2.2 while (Niter iterations are not reached) do Draw N ants in the 2D odor space for c ← 1 to N do Meeting(i,j) where i and j are randomly chosen ants endfor endwhile Compute Dmax , the maximal distance in odor space that allows an ant to find a nestmate Group in the same nest all the ants within a local perimeter of value Dmax Delete nests n that are too small (size(n) < 0.5 × size∀ nests ) Re-assign the ants that have no more nest to the nest of the most similar ant
Our algorithm depends on two main parameters that are namely, the number of global iterations Niter and the maximal distance under which a nestmate may be found Dmax . The first parameter Niter has been set experimentally to 2.500 iterations. In fact, some data sets need more or less iterations, but we do not develop, for the time being, a heuristic to determine automatically the number of iterations. To the last extent, our goal is to integrate this algorithm in an interactive framework, in which an expert will have the possibility to start and stop the visual clustering process and manipulate the groups. The second parameter Dmax is computed for each ant a according to an estimation of the mean D(i, ·) and minimal min(D(i, ·)) distances (here an Euclidean distance measure) between its label and the labels of the other ants as follows: (4) Dmax ← (1 − β)D(i, ·) + β min(D(i, ·)) β has been set experimentally to 0.95 to get the best results.
4 4.1
Experiments and Results Experiments
We apply a traditional k-Means approach because of its time complexity Θ(N ), N being the number of objects to cluster. The algorithm inputs an initial
336
Nicolas Labroche et al.
Table 1. Main characteristics of the data sets Art1 Art2 Art3 Art4 Art5 Art6 Iris Glass P ima Soybean T hyroid N 400 1000 1100 200 900 400 150 214 798 47 215 M 2 2 2 2 2 8 4 9 8 35 5 K 4 2 4 2 9 4 3 7 2 4 3
partition of k clusters. This partition is then refined at each iteration by the assignment of an object to its nearer cluster center. The cluster centers are updated after each object assignment until the algorithm converges. The method stops when the intra class inertia becomes stable. In our experiment, we randomly generate the initial partition with the expected number of clusters to get the best results with this approach. We also compare our new approach to AntClust, an other ant-based clustering algorithm that also relies on the chemical recognition system of ants and that has been described in [8]. In this method the label of each ant is coded by a single value that reflects the nest the ant belongs to. The algorithm is able to find good partitions without any assumption concerning the number of expected clusters and can treat every type of data. We use several numerical data sets to evaluate the contribution of our method. There are real data sets such as Iris, P ima, Soybean, T hyroid and Glass data sets and also artificial data sets called Art1,2,3,4,5,6 . They are generated according to distinct laws (Gaussian and uniform) and with distinct difficulties. The number of objects N , the number of attributes M and the number of clusters K of each data sets are described in the table 1. We use a clustering error measure that evaluates the differences between the final partition output by each method and the expected one, by comparing each pair of objects and by verifying each time if they are clustered similarly or not in both partitions. 4.2
Results
The table 2 shows that, even with its simple heuristic to create the partition from the 2D-space of odours, Visual AntClust is able to perform well. For some data sets like Art1 , Art4 and Art6 , Iris and Glass, it slightly outperforms AntClust though it shows generally a greater variability in its performances. Furthermore, Visual AntClust is even better than the k-Means approach in two cases for Art6 and Soybean, where the clustering error is equal to 0. We notice also that Visual AntClust does not manage to estimate the number of expected clusters for P ima and T hyroid. This may mean that the parameters of the algorithm are not well suited for these data sets or that the heuristic that finds the nests is not accurate enough in these cases to find a good partition. It is very important to notice that Visual AntClust is slower than AntClust and of course k-Means because at each iteration, it has to draw the on-going partition of the data set. The figure 1
Visual Clustering with Artificial Ants Colonies
Beginning of the algorithm: 0 iteration
337
Convergence: about 100 iterations
Clusters apparition: about 500 iterations Clusters definition: about 1000 iterations
Fig. 1. Four steps of Visual AntClust in the 2D-space of odours to build the final partition for Art1
represents four successive steps of Visual AntClust as it dynamically builds the final partition.
5
Conclusion and Perspectives
In this paper, we introduce a new ant based clustering algorithm named Visual AntClust. It associates one object of the data set to the genome of one artificial ant and simulates meetings between them to dynamically build the expected partition. The method is compared to the k-Means approach and AntClust, an other ant-based clustering method over artificial and real data sets. We show that Visual AntClust performs well generally and sometimes very well. Furthermore, it can treat every type of data unlike k-Means that is limited to numerical data and allow one to visualize the partition as it is constructed. In the near future, we will try to develop a method to automate the parameters’ setting. We will also use Visual AntClust as a base for an interactive clustering and data visualization tool.
338
Nicolas Labroche et al.
Table 2. Mean number of clusters (# clusters) and mean clustering error (ClusteringError) and their standard deviation ([std]) for each data set and each method computed over 50 runs K-Means Data sets mean [std] Art1 3.98 [0.14] 2.00 [0.00] Art2 3.84 [0.37] Art3 2.00 [0.00] Art4 8.10 [0.75] Art5 4.00 [0.00] Art6 Iris 2.96 [0.20] Glass 6.88 [0.32] P ima 2.00 [0.00] Soybean 3.96 [0.20] T hyroid 3.00 [0.00]
# clusters Clustering Error AntClust VAntClust K-Means AntClust VAntClust mean [std] mean [std] mean [std] mean [std] mean [std] 4.70 [0.95] 4.66 [1.04] 0.11 [0.00] 0.22 [0.03] 0.17 [0.07] 2.30 [0.51] 3.76 [2.36] 0.04 [0.00] 0.07 [0.02] 0.14 [0.06] 2.72 [0.88] 3.54 [2.11] 0.22 [0.02] 0.15 [0.02] 0.23 [0.07] 4.18 [0.83] 2.16 [0.37] 0.00 [0.00] 0.23 [0.05] 0.03 [0.04] 6.74 [1.66] 4.5 [1.69] 0.09 [0.02] 0.26 [0.02] 0.29 [0.12] 4.06 [0.24] 3.98 [0.14] 0.01 [0.04] 0.05 [0.01] 0.00 [0.02] 2.82 [0.75] 2.28 [0.49] 0.14 [0.03] 0.22 [0.01] 0.19 [0.06] 5.90 [1.23] 6.2 [1.16] 0.32 [0.01] 0.36 [0.02] 0.35 [0.03] 10.66 [2.33] 19.38 [7.68] 0.44 [0.00] 0.46 [0.01] 0.49 [0.01] 4.16 [0.55] 4.00 [0.00] 0.09 [0.08] 0.07 [0.04] 0.00 [0.00] 4.62 [0.90] 11.66 [3.44] 0.18 [0.00] 0.16 [0.03] 0.36 [0.07]
References [1] A. Colorni, M. Dorigo, and V. Maniezzo, “Distributed optimization by ant colonies,” in Proceedings of the First European Conference on Artificial Life (F. Varela and P. Bourgine, eds.), pp. 134–142, MIT Press, Cambridge, 1991. 332 [2] E. Bonabeau, M. Dorigo, and G. Theraulaz, From natural to artificial swarm intelligence. New York: Oxford University Press, 1999. 332 [3] Y. Chiou and L. W. Lan, “Genetic clustering algorithms,” European journal of Operational Research, no. 135, pp. 413–427, 2001. 332 [4] L. Y. Tseng and S. B. Yang, “A genetic approach to the automatic clustering problem,” Pattern Recognition, no. 34, pp. 415–424, 2001. 332 [5] P. Kuntz and D. Snyers, “Emergent colonization and graph partitioning,” in Cliff et al. [11], pp. 494–500. 332 [6] E. Lumer and B. Faieta, “Diversity and adaptation in populations of clustering ants,” in Cliff et al. [11], pp. 501–508. 332 [7] N. Monmarch´e, M. Slimane, and G. Venturini, “On improving clustering in numerical databases with artificial ants,” in Lecture Notes in Artificial Intelligence (D. Floreano, J. Nicoud, and F. Mondala, eds.), (Swiss Federal Institute of Technology, Lausanne, Switzerland), pp. 626–635, 1999. 332 [8] N. Labroche, N. Monmarch´e, and G. Venturini, “A new clustering algorithm based on the chemical recognition system of ants,” in Proc. of 15th European Conference on Artificial Intelligence (ECAI 2002), Lyon FRANCE, pp. 345–349, 2002. 332, 336 [9] N. Carlin and B. H¨ olldobler, “The kin recognition system of carpenter ants(camponotus spp.). i. hierarchical cues in small colonies,” Behav Ecol Sociobiol, vol. 19, pp. 123–134, 1986. 333 [10] B. H¨ olldobler and E. Wilson, The Ants. Springer Verlag, Berlin, Germany, 1990. 333
Visual Clustering with Artificial Ants Colonies
339
[11] D. Cliff, P. Husbands, J. Meyer, and S. W., eds., Third International Conference on Simulation of Adaptive Behavior: From Animals to Animats 3, MIT Press, Cambridge, Massachusetts, 1994. 338
Maximizing Benefit of Classifications Using Feature Intervals ˙ Nazlı Ikizler and H. Altay G¨ uvenir Bilkent University Department of Computer Engineering 06533, Ankara Turkey {inazli,guvenir}@cs.bilkent.edu.tr
Abstract. There is a great need for classification methods that can properly handle asymmetric cost and benefit constraints of classifications. In this study, we aim to emphasize the importance of classification benefits by means of a new classification algorithm, Benefit-Maximizing classifier with Feature Intervals (BMFI) that uses feature projection based knowledge representation. Empirical results show that BMFI has promising performance compared to recent cost-sensitive algorithms in terms of the benefit gained.
1
Introduction
Classical machine learning applications try to reduce the quantity of the errors and usually ignore the quality of errors. However, in real-world applications, the nature of the error is very crucial. Further, the benefit of correct classifications may not be the same for all cases. Cost-sensitive classification research addresses this imperfection and evaluates the effects of predictions rather than simply measuring the predictive accuracy. By incorporating cost(and benefit) knowledge to the process of classification, the effectiveness of the algorithms in real-world situations can be evaluated more rationally. In this study, we concentrate on costs of misclassifications and try to minimize those costs, by maximizing the total benefit gained during the process of classification. Within this framework, we propose a new cost-sensitive classification technique, called Benefit-Maximizing classifier with Feature Intervals (BMFI for short),that uses the predictive power of feature projection method previously proposed in [6]. In BMFI, voting procedure has been changed to impose the cost-sensitivity property. Generalization techniques are implemented to avoid overfitting and to eliminate redundancy. BMFI has been tested over several benchmark datasets and a number of real-world datasets that we have compiled. The rest of the paper is organized as follows: In Section 2, benefit maximization problem is addressed. Section 3 gives the algorithmic descriptions of BMFI algorithm along with the details of feature intervals concept, voting method and generalizations. Experimental evaluation of BMFI is presented in Section 4. Finally, Section 5 reviews the results and presents future research directions on the subject. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 339–345, 2003. c Springer-Verlag Berlin Heidelberg 2003
340
2
˙ Nazlı Ikizler and H. Altay G¨ uvenir
Benefit Maximization Problem
Recent research in machine learning has used the terminology of costs when dealing with misclassifications. However, those studies mostly lack the information that correct classifications may have different interpretations. Besides implying no cost, accurate labeling of instances may entail indisputable gains. Elkan points out the importance of these gains [3]. He states that doing accounting in terms of benefits is commonly preferable because there is a natural baseline from which all benefits can be measured, and thus, it is much easier to avoid mistakes. Benefit concept is more appropriate to real world situations, since net flow of gain is more accurately denoted by benefits attained. If a prediction is profitable from the decision agent’s point of view, its benefit is said to be positive. Otherwise, it is negative, which is the same as cost of wrong decision. To incorporate this natural knowledge of benefits to cost-sensitive learning, we have used benefit matrices. B=[bij ] is a n × m benefit matrix of domain D if n equals to the number of prediction labels, m equals to the number of possible class labels in D and bij ’s are such that ≥ 0 if i = j bij = . (1) < bii otherwise Here, bij represents the benefit of classifying an instance of true class j as class i. The structure of the benefit matrix is similar to that of the cost matrix, with the extension that entries can either have positive or negative values. In addition, diagonal elements should be non-negative values, ensuring that correct classifications can never have negative benefits. Given a benefit matrix B, the optimal prediction for an instance x is the class i that maximizes expected benefit (EB ), that is P (j|x) × bij . (2) EB(x, i) = j
where P (j|x) is the probability that x has true class j. The total expected benefit of the classifier model M over the whole test data is EBM = arg max EB(x, i) = arg max P (j|x) × bij . (3) x
i∈C
x
i∈C
j
where C is the set of possible class labels in the domain.
3
Benefit Maximization with Feature Intervals
As shown in [6], feature projections based classification is a fast and accurate method, and the rules it learns are easy for humans to verify. For this reason, we have chosen to extend its predictive power to involve benefit knowledge. In a particular classification problem, given the training dataset which consists of p features, an instance x can be thought as a point in a p-dimensional space with an associated class label xc . It is represented as a vector of nominal or
Maximizing Benefit of Classifications Using Feature Intervals
341
train(T rainingSet, Benef itM atrix) begin for each feature f sort(f , T rainingSet) i list ← make point intervals(f ,T rainingSet) for each interval i in i list votei (c) ← voting method (i,f ,Benef itM atrix) if f is linear i list ← generalize(i list,Benef itM atrix) end.
Fig. 1. Training phase of BMFI linear feature values together with its associated class, i.e., <x1 , x2 , .., xp , xc >. Here, xf represents the value of the f th feature of the instance x. If we consider each feature separately, and take x’s projection onto each feature dimension, then we can represent x by the combination of its feature projections. Training process of BMFI algorithm is given in Fig. 1. In the beginning, for each feature f , all training instances are sorted with respect to their value for f . This sort operation is identical to forming projections of the training instances for each feature f . A point interval is constructed for each projection. Initially, lower and upper bounds of the interval are equal to the f value of the corresponding training instance. If the f value of a training instance is unknown, it is simply ignored. If there are several point intervals with the same f value, they are combined into a single point interval by adding the class counts. At the end of point interval construction, vote for each class label is determined by using one of the two voting methods. The first one is the voting method of CFI algorithm [5], called VM1 in our context. VM1 can be formulated as follows: V M 1(c, I) =
Nc . classCount(c)
(4)
where Nc is the number of instances that belong to class c in interval I and classCount(c) is the total number of instances of class c in the entire training set. This voting method favors the prediction of minority class in proportion to its occurrence in the interval. The second voting method, called VM2, is basically founded on optimal prediction approximation given by (2) and makes direct use of the benefit matrix. VM2 casts votes to class c in interval I as bck × P (k|I) . (5) V M 2(c, I) = k∈C
P (k|I) is the estimated probability that an instance falling to interval I will have the true class k, and is calculated as P (k|I) =
Nk . classCount(k)
(6)
342
˙ Nazlı Ikizler and H. Altay G¨ uvenir generalize(interval list) begin I ← first interval in interval list while I is not empty do I ← interval after I I” ← interval after I if merge condition(I, I , I”) is true merge I (and/or) I” into I else I ← I end.
Fig. 2. Generalization of intervals step in BMFI
After the initial assignment of votes, for linear features, intervals are generalized to form range intervals in order to eliminate redundancy and avoid overfitting. The generalization process is illustrated in Fig. 2. Here, merge condition() is a comparison function that evaluates relative properties of each interval and returns true if sufficient level of similarity between neighboring intervals is reached. Besides adding more prediction power to the algorithm, proper generalization reduces the number of intervals, and by this way, decreases the classification time. In this work, we have experimented with three interval generalization methods. The first one, called SF (same frequent) joins two consecutive intervals if the most frequently occurring class of both are the same. The second method, SB (same beneficial) joins two consecutive intervals if they have the same beneficial class. A class c is the beneficial class of an interval i iff for ∀j ∈ C and j = c, x∈i B(x, c) ≥ x∈i B(x, j) . If the beneficial classes of two consecutive intervals are the same, then it can be more profitable to unite them into a single interval. The third method, HC (high confidence) combines three consecutive intervals into a single one, when the middle interval has less confidence on its votes than the other two. The confidence of an interval is measured as the difference between votes of the most beneficial class and second beneficial class. Table 1. List of evaluated cost-sensitive algorithms Name
Description
MetaNB MetaJ48 C1NB C2NB C1J48 C2J48 C1VFI C2VFI
MetaCost on Naive Bayes MetaCost on J4.8 CostSensitiveClassifier with CostSensitiveClassifier with CostSensitiveClassifier with CostSensitiveClassifier with CostSensitiveClassifier with CostSensitiveClassifier with
reweighting on Naive Bayes direct minimization on Naive Bayes reweighting on J4.8 direct minimization on J4.8 reweighting on VFI direct minimization on VFI
Maximizing Benefit of Classifications Using Feature Intervals
343
classify(q) begin for each class c vc ← 0 for each feature f if qf is known I ← search interval(f, qf ) for each class c vc ← vc + interval vote(c, I) prediction ← argmaxc(vc ) end.
Fig. 3. Classification step in BMFI
The classification process of the BMFI algorithm is given in Fig. 3. The choice of voting method to be used depends on the characteristics of the domain. Based on our empirical results, we propose to use VM1 voting together with SF, SB and HC techniques when the correct classification of the minority class is more beneficial than the other classes. On the contrary, when the benefit matrix is not correlated with the distribution, VM2 can be employed together with SB and HC to boost up the benefit performance. Experimental results presented in Sect. 4 are achieved by using this general rule-of-thumb.
4
Experimental Results
For evaluation purposes, we have used benchmark datasets from UCI ML Repository [1]. These data sets do not have predefined benefit matrices, so we formed their benefit matrices in the following manner. In binary datasets, one class is assumed to be more important to predict correctly than the other by a constant benefit ratio, b. We have tested our algorithm by using five different b values that are 2, 5, 10, 20, 50. Note that when b is equal to 1, the problem reduces to the classical classification problem. Further, we have compiled four new datasets. Their benefit matrices have been defined by experts of each domain. For more information about the datasets and benefit matrices the reader is referred to [7]. We have compared BMFI with MetaCost [2] and CostSensitiveClassifier of Weka [4] on well-known base classifiers which are Naive Bayesian Classifier, C4.5 decision tree learner and VFI [6]. Table 1 lists these algorithms with their base classifiers (J4.8 is Weka’s implementation of C4.5 in Java). MetaCost is a wrapper algorithm that takes a base classifier and makes it sensitive to costs of classification [2]. It operates with a bagging logic beneath and learns multiple classifiers on multiple bootstrap replicates of the training set. MetaCost has become a benchmark for comparing cost-sensitive algorithms. In addition to MetaCost, we have compared our algorithm with two cost sensitive classifiers provided in Weka. The first method uses reweighting of training
344
˙ Nazlı Ikizler and H. Altay G¨ uvenir
Table 2. Comparative evaluation of BMFI with wrapper cost-sensitive algorithms. The entries are benefit per instance values. Best results are shown in bold domain
MetaNB MetaJ48 C1NB C2NB C1J48 C2J48 C1VFI C2VFI BMFI
breast-cancer pima-diabetes ionosphere liver disorders sonar bank-loans bankruptcy dermatology lesion
4.0 2.8 5.7 5.3 3.3 -0.8 7.8 7.5 8.7
4.0 3.0 6.1 5.2 4.5 -0.9 7.7 7.5 8.9
3.8 2.8 6.5 5.3 4.6 -0.4 7.5 7.2 7.8
4.0 2.7 6.0 5.4 4.0 -0.6 7.4 7.5 9.0
3.9 2.9 6.5 5.4 4.6 0.1 7.5 7.2 7.8
3.7 2.5 5.7 4.4 3.3 -0.5 7.3 7.3 7.7
3.7 -1.5 6.4 4.3 0.0 -1.2 7.7 6.9 6.4
2.8 2.8 6.1 5.3 4.0 -2.8 7.8 5.6 4.0
3.9 (VM1) 2.7 (VM1) 6.5 (VM2) 5.4 (VM2) 4.9 (VM2) -0.1 (VM1) 7.9 (VM1) 7.4 (VM2) 9.0 (VM1)
instances and the second method makes direct cost-minimization based on probability distributions [8]. We call these two classifiers C1 and C2, respectively. Experimental results are presented in Table 2. In this table, results of binary datasets are benefit per instance values for b=10. All results are recorded by using 10-fold cross validation. As the results demonstrate, BMFI algorithm is very successful in most of the domains and remarkably comparable to other algorithms in all of the domains. In ionosphere, liver, sonar, bankruptcy and lesion domains, BMFI attains the maximum benefit per instance value. In the remaining datasets its performance is very high and comparable to other algorithms. We have observed that benefit achieved is highly dependent on the nature of the domain, i.e., benefit matrix information, distribution of classes, etc, as expected. In addition, it is worthwhile to note that BMFI outperforms cost-sensitive versions of its base classifier VFI (C1VFI and C2VFI). This observation suggests that using benefit knowledge inside the algorithm itself is more effective than wrapping a meta-stage around to transform it into a cost-sensitive classifier. In binary datasets, we observed that the success of BMFI increases as the benefit ratio increases. This is an important highlight of BMFI and is mostly due to its high sensitivity to benefit of classifications. This aspect of BMFI has been illustrated with the results of pima-diabetes dataset given in Table 3.
Table 3. Benefit per instance values of pima-diabetes dataset with different benefit ratios. Best results are shown in bold b MetaNB MetaJ48 C1NB C2NB C1J48 C2J48 C1VFI C2VFI BMFI 2 5 10 20 50
0.5 1.2 2.8 5.8 16.6
0.6 1.2 2.8 5.8 16.2
0.7 1.5 3.0 6.2 16.2
0.6 1.3 2.7 6.1 16.6
0.6 1.2 2.9 6.1 16.3
0.6 1.2 2.5 5.6 14.7
0.0 -0.5 -1.5 -3.3 -9.0
0.0 1.1 2.8 6.3 16.7
0.5 1.2 2.7 6.3 16.8
Maximizing Benefit of Classifications Using Feature Intervals
5
345
Conclusions and Future Work
In this study, we have focused on the problem of making predictions when the outcomes have different benefits associated with them. We have implemented a new algorithm, namely BMFI that uses the predictive power of feature intervals concept in maximizing the total benefit of classifications. We make direct use of benefit matrix information provided to the algorithm in tuning the prediction so that the resultant benefit gain is maximized. BMFI has been compared to MetaCost and two other cost-sensitive classification algorithms provided in Weka. These generic algorithms are wrapped over NBC, C4.5 and VFI. The results show that BMFI is very effective in maximizing the benefit per instance values. It is more successful in domains where the prediction of a certain class is particularly important. Empirical results we obtained also show that using benefit information directly in the algorithm itself is more effective than using a meta-stage around the base classifier. In benefit maximization problem, we have observed that individual characteristics of the datasets influence results significantly, due to the extreme correlation between cost-sensitivity and class distributions. As future work, feature-dependent domains can be explored in depth and feature-dependency aspect of BMFI can be improved. Benefit maximization can be extended to include the feature costs. Feature selection mechanisms that are sensitive to individual costs of features can be utilized.
References [1] Blake C. L. and Merz C. J.: UCI repository of machine learning databases. University of California, Irvine, Department of Information and Computer Sciences (1998) [http://www.ics.uci.edu/˜mlearn/MLRepository.html] 343 [2] Domingos P.: Metacost: A general method for making classifiers cost-sensitive. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining, San Diego, CA (1999) 155-64 343 [3] Elkan C.: The Foundations of Cost-Sensitive Learning. In: Proceedings of the Seventeenth International Joint Conference on Articial Intelligence (2001) 340 [4] Frank E. et al.: Weka 3 - Data Mining with Open Source Machine Learning Software in Java. The University of Waikato (2000) [http://www.cs.waikato.ac.nz/˜ml/weka] 343 [5] G¨ uvenir H. A.: Detection of abnormal ECG recordings using feature intervals. In: Proceedings of the Tenth Turkish Symposium on Artificial Intelligence and Neural Networks (2001) 265-274 341 ˙ [6] G¨ uvenir H. A, Demir¨ oz G., and Ilter N.: Learning differential diagnosis of erythemato-squamous diseases using voting feature intervals. Artificial Intelligence in Medicine, Vol. 13(3) (1998) 147-165 339, 340, 343 ˙ [7] Ikizler N.: Benefit Maximizing Classification Using Feature Intervals. Technical Report BU-CE-0208, Bilkent University (2002) 343 [8] Ting K. M.: An instance weighting method to induce cost-sensitive trees. IEEE Transactions on Knowledge and Data Engineering, Vol. 14(3) (2002) 659-665 344
Parameter Estimation for Bayesian Classification of Multispectral Data Refaat M Mohamed and Aly A Farag Computer Vision and Image Processing Laboratory University of Louisville, Louisville, KY 40292, USA {refaat,farag}@ cvip.uofl.edu www.cvip.uofl.edu
Abstract. In this paper, we present two algorithms for estimating the parameters of a Bayes classifier for remote sensing multispectral data. The first algorithm uses the Support Vector Machines (SVM) as a multi-dimensional density estimator. This algorithm is a supervised one in the sense that it needs in advance, the specification of the number of classes and a training sample for each class. The second algorithm employs the Expectation Maximization (EM) algorithm, in an unsupervised way, for estimating the number of classes and the parameters of each class in the data set. Performance comparison of the presented algorithms shows that the SVM- based classifier outperforms those based on Gaussian-based and Parzen window algorithms. We also show that the EM based classifier provides comparable results to Gaussian-based and Parzen window-based while it is an unsupervised algorithm. Keywords: Bayes classification, density estimation, support vector machines (SVM), expectation maximization (EM), multispectral data.
1
Introduction
Bayes classifier constitutes the basic setup for a large category of classifiers. To implement the bayes classifier, there are two main parameters to be addressed [1]. The first parameter is the estimation of the class conditional probability for each class, and the second one is the prior probability of each class. In this paper, we address both of these parameters. Support Vector Machines (SVM) approach was developed to solve the classification problem, but recently have been extended to regression problems [2]. SVM have been shown to perform well for density estimation where the probability distribution function of the feature vector x can be inferred from a random sample Ð. SVM represent Ð by a few number of support vectors and the associated kernels [3]. This paper employs the SVM as a density estimator for multi-dimensional feature spaces, and uses this estimate in a Bayes classifier. SVM as a density estimator work V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 346-355, 2003. Springer-Verlag Berlin Heidelberg 2003
Parameter Estimation for Bayesian Classification of Multispectral Data
347
in a supervised setup; the number of classes and a design data sample from each class need to be available. There are a number of unsupervised algorithms for parameter estimation for Bayes classifier, e.g. the well known k-means algorithms [4]. The k-means algorithm requires the knowledge of the number of classes. In this paper, we propose a fully unsupervised approach for estimating the Bayes classifier parameters. The Expectation Maximization (EM) algorithm is a general method of finding the maximum-likelihood estimate (MLE) of the parameters of an underlying distribution from a given data set, when the data is incomplete or has missing values. The proposed approach in this paper incorporates the EM algorithm with some generalizations to estimate the number of classes, parameter of each class as well as the prior probability of each class. The two algorithms introduced in this paper are the following: Algorithm 1: SVM as a density estimator for Bayes Classifier; and Algorithm 2: EM algorithm for Estimating number of classes, class conditional probability parameters, as well as the prior probabilities for a Bayesian classification of multispectral data classification. We describe the details of the algorithms and implementation for classification of seven-dimensional multispectral Landsat data in a Bayes classifier setup. We evaluate the performance of the two algorithms with respect to the following classical approaches: Gaussian-based and Parzen window-based classifiers. The proposed algorithms show high performance against these algorithms.
2
Support Vector Machines for Density Estimation
Support Vector Machines (SVM) have been developed by Vapnik [5] and are gaining popularity due to many attractive features and promising empirical performance. The formulation embodies the Structural Risk Minimization (SRM) principle, which has been shown to be superior to traditional Empirical Risk Minimization (ERM) principle employed in conventional learning algorithms (e.g. neural networks). SRM minimizes an upper bound on the generalization error, as opposed to ERM, which minimizes the error on the training data. It is this difference, which makes SVM more attractive in statistical learning applications. In this section, we present a brief outlines of the SVM approach for density estimation. 2.1
The Density Estimation Problem
The probability density function p(x) of the random vector x is a nonnegative quantity which is defined as: x F (x) = ∫ p (x ′) dx ′ −∞
(1)
where F (x) is the cumulative distribution function. Hence, in order to estimate the probability density, we need to solve the integral equation:
348
Refaat M Mohamed and Aly A Farag
x ∫ p (x ′, α ) dx ′ = F (x) −∞
(2)
on a given set of densities p( x, α ) , where the integration in (2) is a vector integration, and α is the set of parameters to be determined. The estimation problem in (2) can be regarded as solving the linear operator equation: A p ( x) = F ( x)
(3)
where the operator A is a one-to-one mapping for the elements p(x) of the Hilbert space E1 into elements F (x) of the Hilbert space E 2 . 2.2
Support Vector Machines
In support vector machines, we look for a solution of the density estimation problem in the form: N
p ( x) =
∑β
k
K ( x, x k )
(4)
k =1
where N is the size of the training sample, K (x, x k ) is the kernel that defines a
Reproducing Kernel Hilbert Space (RKHS), and β k ' s are the parameters to be estimated using the SVM method. The density estimation problem in (3) is known to be an ill-posed problem, see [6]. One solution for the ill-posed problems is to introduce a semi-continuous, and positive functional (Λ ( p(x)) ≤ c ; c > 0 ) in a RKHS. Also, we define p(x) as a trade-off between Λ ( p(x)) and
Ap (x) − FN (x) . An
example for such method is that by Phillips [6]: min p ( x)
Λ ( p (x))
(5)
Ap(x) − FN (x) < ε N , ε N > 0, ε N → 0
such that:
= εk max FN (x) − ∫−x ∞ p(t ) dt x = xk k
or simply,
(6)
where FN (x) is an estimate for F (x) which is characterized by an estimate error ε k for each instant x k , see [6]. Using the properties of RKHS, (5) rewritten as: N β k K (x, x k ), Λ( p( x)) = ( p(x), p(x)) H = k =1
∑
N
=
N
∑ β ∑ β (K (x, x k
k =1
j
j =1
N
∑β j =1
j
K ( x, x j ) H N
k
), K (x, x j )
N
)H = ∑∑ β k β j K (x k , x j ) k =1 j =1
(7)
Parameter Estimation for Bayesian Classification of Multispectral Data
349
Therefore, to solve the estimation problem in (3) we minimize the functional: N
W ( β ) = Λ( p(x), p(x)) =
N
∑∑ β
k β j K (x k , x j )
(8)
k =1 j =1
subject to the constraints (6), and : N
βk ≥0,
∑β
j
=1
(9)
j =1
where 1 ≤ k ≤ N . The constraints in (9) are imposed to obtain the solution in the form of a mixture of densities. The only remaining issue for the SVM description is the choice of the functional form for the kernel. To obtain a solution in the form of a mixture of densities, we choose a nonnegative kernel which satisfies the following conditions: x − xk x − xk ; a(γ ) K K γ (x, x k ) = a (γ ) K γ γ and K (0) = 1
∫
dx = 1 ;
(10)
Following is a listing of the implementation steps for SVM as a density estimator. Algorithm 1. SVM density estimator Step 1: Step 2: Step 4: Step 5: Step 6: Step 7:
Get the random sample Ð from the density function to be estimated. Use Ð to construct FN (x) and ε k ' s . Select a kernel function that satisfies (10). Set up the objective function (8) for the optimization algorithm. Set up the constraints (6) and (9). Apply the optimization algorithm. The nonzero parameters ( β k ' s ) and their corresponding data vectors are considered as the support vectors. Step 8: Calculate the density function value from (4) corresponding to a feature vector x .
3
Unsupervised Density Estimation Using the EM Algorithm
The EM algorithm had been proposed as an iterative optimization algorithm to find the maximum likelihood estimates (e.g., [7][8]). In this paper, we use the EM algorithm to estimate the parameters for a mixture of densities. For an observed data set X of size N , the probabilistic model is defined as: M
p (x | Θ) =
∑α
j
p j (x | θ j )
j =1
where,
M : number of different components “classes” in the mixture,
(11)
350
Refaat M Mohamed and Aly A Farag M
α j : weight of component i in the mixture;
∑α
j
=1 ,
j =1
θ j : parameters indexing the density of component i in the mixture, Θ = (α j ' s , θ j ' s ) : parameters that indexing the mixture density.
The likelihood for the mixture density can be written as: N
l (Θ | X) =
∏ p(x
| Θ)
i
(12)
i =1
The optimization problem can be significantly simplified if we assume that the data set X is incomplete and there is an unobserved data set w = {wi }iN=1 , each component of which is associated with a component of X . The values of w determine which density in the mixture (i.e. class) generated the particular data instant. Thus, we have wi = k , for wi ∈{1, 2, ..., M } , if the sample i of X is generated by mixture component k . If w is known, (12) can be written as: N
∑ log(P(x
log(l (Θ | X, w ) ) = log(P ( X, w | Θ) ) =
i
| wi ) P (w ) )
i =1
N
=
∑ log(α
wi
p wi (x i | θ wi )
i =1
(13)
)
The parameter set Θ = (α j ' s , θ j ' s ) can be obtained if the components of w are known. In the case where w is unknown, we still can proceed by assuming w to be a random vector. In the reminder of this section, we will develop an expression for the distribution of the unobserved data w . We start by guessing initial values Θ c of the mixture density parameters, i.e. we assume Θ c = (α cj , θ cj ; j = 1...M ) . Using Bayes rule, the conditional probability density for the value wi ”class of site i ” is written as: p( wi | x i , Θ c ) =
α wc i p wi (x i | θ cwi ) p (x i | Θ)
=
α wc i p wi (x i | θ cwi )
∑
M
α kc p k (x i | θ ck )
(14)
k =1
Then, the joint density for w can be written as: p(w | X, Θ c ) =
N
∏ p(w
i
i =1
| xi , Θc )
(15)
Parameter Estimation for Bayesian Classification of Multispectral Data
3.1
351
Maximizing the Conditional Expectation
The EM algorithm first finds the expected value (Q(Θ, Θ c )) of the log-likelihood of the complete-data (the observed data X and the unobserved data w ) with respect to the unknown data w . The pieces of information which known here are the observed data set X and the current parameter estimates Θ c . Thus, we can write: Q (Θ, Θ c ) = E [( log p ( X, w | Θ) ) | X, Θ c ]
(16)
where Θ are the parameters that we optimize them to maximize the conditional expectation Q . In (16), X and Θ c are constants, Θ is a regular variable that we try to adjust, and w is a random variable that governed by the density function defined in (15). Thus, using (13) and (15), conditional expectation in (16) can be rewritten as: Q(Θ, Θ c ) =
∫ log p(X, w | Θ) p(w | X, Θ
c
) dw =
∑ log (l ( Θ | X, w)) p(w | X, Θ
c
)
w∈w
w∈w
(17)
With some mathematical simplification for (17) the conditional expectation can be rewritten as: Q (Θ, Θ c ) =
M
N
∑∑ log (α
j
p j (x i | θ j )) p ( j | x i , Θ c )
j =1 i =1
M
=
N
∑∑ log(α
j ) p(
c
j | xi , Θ ) +
j =1 i =1
M
(18)
N
∑∑ log( p
j (x i
c
| θ j )) p ( j | x i , Θ )
j =1 i =1
In (18), the term containing α j and the term containing θ j are independent. Hence, those terms can be maximized independently, which significantly simplifies the optimization problem. 3.2
Parameters Estimation
In this section we will present the implementation steps to estimate the different parameters of the density function underlying the distribution of the classes in a data set. The density is assumed to be a mixture of densities. Thus, the parameters to be estimated are, number of components in the mixture, the mixing coefficients and the parameters of each density component As stated before, the first term of the conditional expectation in (18) can be maximized independently to obtain the mixing coefficients, α j ' s . Maximizing the first term in (18) results in the value of a mixing coefficient as:
αj=
1 N
N
∑ p( j | x , Θ i
c
)
(19)
i =1
We stated before, that we use the EM algorithm under the assumption that we set a parametric form for the density of each class. Theoretically, the algorithm can be
352
Refaat M Mohamed and Aly A Farag
applied to any form for the class densities. For computational purpose, we assume the class densities to be in the form of multinomial Gaussian. Under this assumption, the parameters will be the mean vector and the covariance matrix, i.e. θ j = (µ j , Σ j ) . Under the normality assumption, the second term of (18) becomes: M
N
∑∑log(p (x | θ ))p( j | x , Θ ) = j
i
j
i
c
j =1 i=1
M
N
1 1 T −1 c − log(| Σ j |) − (xi − µ j ) Σ j (xi − µ j )p( j | xi , Θ ) 2 2 j =1 i =1
∑∑
(20)
Maximizing (20) with respect to µ, and Σ will give: N
µj =
∑
N
x i p( j | x i , Θ c )
i =1 N
∑
∑ p( j | x , Θ i
and Σ j = p( j | x i , Θ c )
i =1
c
)(x i − µ j )(x i − µ j ) T
i =1
N
∑
(21) p( j | x i , Θ c )
i =1
The essence of the proposed algorithm is that we can automatically estimate the number of classes M in the scene. The optimum value of M is that value at which Q is maximum. This can be done simply by assuming a number of classes, M and then optimize (16) to obtain the other parameters and at the same time we calculate the final optimized value of Q . Then, we change M and redo the optimization process. This iterative procedure is continued until a maximum value of Q is obtained. The value of M at which Q is maximum is the optimum number of classes defined in the data set. Algorithm 2. Step 1: Step 2: Step 3: Step 4: Step 5: Step 6:
Set a value for M . Define an initial estimate for Θ . Compute the value of Q from (18) using these parameters. Compute the new values of Θ from (19), and (21). Compute the new value of Q . Repeat 4 and 5 to get acceptable small difference between subsequent values of Q . Step 7: Increase M by one and iterate from step 2 until Q is maximum.
4
Experimental Results
In this section we present the classification results of using the above two algorithms as density estimators in the design of a Bayesian classifier. For the SVM approach, a 7-D Gaussian-like kernel function has been used. The experiments were carried out
Parameter Estimation for Bayesian Classification of Multispectral Data
353
using real data acquired from the Landsat Thematic Mapper (TM) for two different data sets as will be discussed below. The proposed two algorithms are evaluated against the Gaussian-based and Parzen window-based algorithms. Details of these algorithms are elsewhere (e.g. [1]). 4.1
Golden Gate Bay Area
The first part of the experiments is carried out using real multispectral data (7-bands) for the Golden Gate Bay area of the city of San Francisco, California, USA. Five classes are defined on this image: Trees, Streets (light urban), Water, Buildings (dense urban) and Earth. The available ground truth data for this data set includes about 1000 points per class, which is approximately 1% of the image scene. The classification results, Table 1 and Fig.1 and, show that the SVM-based classifier outperforms Gaussian- based and Parzen window-based algorithms, which reflects that the SMV works well as a density estimator in multi-dimensional. For the second algorithm, Fig. 2 shows the evolution of the conditional expectations with the number of components assumed in the area. The figure shows that the maximum value of Q occurs at M = 5 , which is the nominal value. The estimated parameters using the proposed algorithm are used in a Bayes classifier. Table 1 shows that the unsupervised classifier provides comparable results which reflects its applicability for the problem.
Fig. 1. Classified Images for the Golden Gate Area. From upper left to lower right, original, Gaussian-based, Parzen Window-based, SVM-based, and EM-based Bayes classifier
4.2
Agricultural Area
The first part of this experiment was carried out using real multispectral data for an agricultural area, Fig 3. It is known that this area contains: Background, Corn, Soybean, Wheat, Oats, Alfalfa, Clover, Hay/Grassland, and Unknown. The ground
354
Refaat M Mohamed and Aly A Farag
truth data for that area is available. Again, the results in Fig 1 and Table 2 emphasize the same concluded notes in the above Golden Gate Bay area data set.
Fig. 2. Evolution of Conditional Expectation with the number of classes, left Golden Gate, right Agricultural area (Actual value is Q*104)
Fig. 3. Classified Images for the Agricultural Area. From upper left to lower right, original, Gaussian-based, Parzen Window-based, SVM-based, and EM-based Bayes classifier
Parameter Estimation for Bayesian Classification of Multispectral Data
5
355
Conclusion and Future Work
This paper presented two algorithms for estimating the parameters for a Bayes classifier of multispectral data classification. The first one is a supervised algorithm which uses the SVM as an estimator for the class conditional probabilities. The second one is an unspervised and it assumes the density distribution of a class as a mixture of different densities. The algorithm is used to estimate the number of components of this mixture (number of classes), the parameters of each component and the proportion of each component in the mixture (which is regarded as the prior probability of the class). The results show that the SVM has the capabilities to simulate well density functions in high dimensions. Also, the results illustrated that the unsupervised algorithm performs well with the data sets used in the experiments. For future work, we study methods for enhancing the SVM performance and to automatically choose its parameters. For the unsupervised algorithm, we will relax the normality assumption of the densities. Also, we will adjust the algorithm to estimate the distribution of the observed data set assuming nonparametric forms for the densities of the classes in the data.
References [1] [2] [3] [4] [5] [6] [7] [8]
A. Farag, R. M. Mohamed and H. Mahdi, “Experiments in Image Classification and Data Fusion,” Proceedings of 5th International Conference on Information Fusion, Annapolis, MD, Vol. 1, pp. 299-308, July 2002. V. Vapnik, S. Golowich and A. Smola, “Support Vector Method for Function Approximation, Regression Estimation, and Signal Processing,” Neural Information Processing Systems, Vol. 9, MIT Press, Cambridge, MA, 1997. Vapnik and S. Mukherjee, “Support Vector Method for Multivariate Density Estimation,” Neural Information Processing Systems 1999, Vol. 12, MIT Press 2000. R. O. Duda and et al., Pattern Classification, John Wiley & Sons, 2nd edition, 2001. V. N. Vapnik, The Nature of Statistical Learning Theory, Springer, 2nd Edition, 2000. Aly Farag, and Refaat Mohamed, “Classification of Multispectral Data Using Support Vector Machines Approach for Density Estimation,” Proceedings of the 7th INES 2003, Assiut, Egypt, March4-6, 2003. D. Phillips, “A Technique for the Numerical Solution of Certain Integral Equations of the First Kind,” Journal of Association Computing Machinery, 9:84-97, 1962. Greg A. Rempala and Richard A. Derrig, “ Modeling Hidden Exposures in Claim Severity via the EM Algorithm,” TR-MATH, Mathematical department, University of Louisville, December 2002.
Goal Programming Approaches to Support Vector Machines Hirotaka Nakayama1, Yeboon Yun2 , Takeshi Asada3 , and Min Yoon4 1
Konan University, Kobe 658-8501, Japan Kagawa University, Kagawa 761-0396, Japan 3 Osaka University, Osaka 565-0871, Japan Yonsei University, Seoul 120-749, Republic of Korea 2
4
Abstract. Support vector machines (SVMs) are gaining much popularity as effective methods in machine learning. In pattern classification problems with two class sets, their basic idea is to find a maximal margin separating hyperplane which gives the greatest separation between the classes in a high dimensional feature space. However, the idea of maximal margin separation is not quite new: in 1960’s the multi-surface method (MSM) was suggested by Mangasarian. In 1980’s, linear classifiers using goal programming were developed extensively. This paper considers SVMs from a viewpoint of goal programming, and proposes a new method based on the total margin instead of the shortest distance between learning data and separating hyperplane.
1
Introduction
Recently, Support Vector Machines (SVMs, in short) are gaining much popularity for machine learning. One of main features of SVMs is that they are kernel based linear classifiers with maximal margin on the feature space . The idea of maximal margin in linear classifiers is intuitive, and its reasoning in connection with perceptrons was given in early 1960’s (e.g., Novikoff [6]). The maximal margin is effectively applied for discrimination analysis using mathematical programming, e.g., MSM (Multi-Surface Method) by Mangasarian [4]. Later, linear classifiers with maximal margin were formulated as linear goal programming, and extensively studied through 1980’s to the beginning of 1990’s. The pioneering work was given by Freed-Glover [2], and a good survey can be seen in Erenguc-Koehler et al. [1]. This paper discusses the goal programming approach to SVMs, and shows a close relationship between the so-called ν-SVM and our suggested total margin method.
2
Goal Programming Approach to Linear Discrimination Analysis
Let X be a space of conditional attributes. For binary classification problems, the value of +1 or −1 is assigned to each data xi according to its class A or V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 356–363, 2003. c Springer-Verlag Berlin Heidelberg 2003
Goal Programming Approaches to Support Vector Machines
357
B. The aim of machine learning is to predict which class newly observed data belong to by virtue of a certain discrimination function f (x) constructed on the basis of the given data set (xi , yi ) (i = 1, . . . , l), where yi = +1 or −1. Namely, f (x) 0 ⇒ x ∈ A f (x) < 0 ⇒ x ∈ B When the discrimination function is linear, namely if f (x) = wT x + b, this is referred to as a linear classifier. In this case, the discrimination boundary is given by a hyperplane defined by wT x + b = 0. In 1981, Freed-Glover suggested to get a hyperplane separating two classes with as few misclassified data as possible by using goal programming [2]. Let ξi denote the exterior deviation which is a deviation from the hyperplane for a point xi improperly classified. Similarly, let ηi denote the interior deviation which is a deviation from the hyperplane for a point xi properly classified. Some of main objectives in this approach are as follows: i) Minimize the maximum exterior deviation (decrease errors as much as possible) ii) Maximize the minimum interior deviation iii) Maximize the weighted sum of interior deviation iv) Minimize the weighted sum of exterior deviation Of course, we can combine some of them. Note that the objective ii) corresponds to maximizing the shortest distance between the learning data and the hyperplane. Although many models have been suggested, the one considering iii) and iv) above may be given by the following linear goal programming: Letting yi = +1 for i ∈ IA and yi = −1 for i ∈ IB , (GP)
minimize
l
(hi ξi − ki ηi )
i=1
subject to
yi (xTi w + b) = ηi − ξi , ξi , ηi 0, i = 1, . . . , l.
Here, hi and ki are positive constants. In order for ξi and ηi to have the meaning of the exterior deviation and the interior deviation respectively, the condition ξi ηi = 0 for every i = 1, . . . , l must hold. Lemma 1. If hi > ki for i = 1, . . . , l, then we have ξi ηi = 0 for every i = 1, . . . , l at the solution to (GP). Proof. Easy due to lemma 7.3.1 of [7]. It should be noted that the above formulation may yield some unacceptable solutions such as w = 0 and unbounded solution.
358
Hirotaka Nakayama et al.
Example Let z1 = (−1, 1), z2 = (0, 2) ∈ A and z3 = (1, −1), z4 = (0, −2) ∈ B. Constraint functions of SVM are given by z1 z2 z3 z4
: : : :
w1 (−1) + w2 (1) + b = η1 − ξ1 w1 (0) + w2 (2) + b = η2 − ξ2 w1 (1) + w2 (−1) + b = η3 − ξ3 w1 (0) + w2 (−2) + b = η4 − ξ4
Here, it is clear that ξ = 0 at the optimal solution. The constraints include ηi added in the right hand side. Note that the feasible region in this formulation moves to the north-west by increasing ηi . Maximizing ηi yields unbounded optimal solution unless any further constraint in w are added. In the goal programming approach to linear classifiers, therefore, some appropriate normality condition must be imposed on w in order to provide a bounded optimal solution. One of such normality conditions is ||w|| = 1. If the classification problem is linearly separable, then using the normalization ||w|| = 1, the separating hyperplane H : wT x+b = 0 with maximal margin can be given by (GP1 )
maximize
η
subject to
yi (xTi w + b) η, i = 1, . . . , l, ||w|| = 1.
However, this normality condition makes the problem to be of nonlinear optimization. Note that the following SVM formulation with the objective function minimizing ||w|| can avoid this unboundedness handily. Instead of maximizing the minimum interior deviation in (GP1 ) stated above, we use the following equivalent formulation with the normalization wT z+b = ±1 at points with the minimum interior deviation: (SVM)
minimize subject to
||w|| yi w T z i + b 1, i = 1, . . . , l,
where yi is +1 or −1 depending on the class of z i . Several kinds of norm are possible. When ||w||2 is used, the problem is reduced to quadratic programming, while the problem with ||w||1 or ||w||∞ is reduced to linear programming (see, e.g., Mangasarian [5]). For the above example, we have the following condition in the SVM formulation: z1 : w1 (−1) + w2 (1) + b 1 z2 : w1 (0) + w2 (2) + b 1 z3 : w1 (1) + w2 (−1) + b −1 z4 : w1 (0) + w2 (−2) + b −1 Since it is clear that the optimal hyperplane has b = 0, the constraint functions for z3 and z4 are identical to those for z1 and z2 . The feasible region in
Goal Programming Approaches to Support Vector Machines
359
(w1 , w2 )-plane is given by w2 w1 + 1 and w2 1/2. Minimizing the objective function of SVM yields the optimal solution (w1 , w2 ) = (−1/2, 1/2) for the QP formulation. Similarly, we have a solution among the line segment {w2 w1 + 1} ∩ {−1/2 w1 0} depending on the initial solution for the LP formulation. On the other hand, Glover [3] shows the following necessary and sufficient condition for avoiding unacceptable solutions: xi + lB xi )T w = 1, (1) (−lA i∈IB
i∈IA
where lA and lB denote the number of data for the category A and B, respectively. Geometrically, the normalization (1) means that the distance between two hyperplanes passing through centers of data for lA and lB is scaled by lA lB . Taking into account that ηi /||w|| represents the margin of correctly classified data xi from the hyperplane wT x + b = 0, larger value of ηi and smaller value of ||w|| are more desirable in order to maximize the margin. On the other hand, since ξi /||w|| stands for the margin of misclassified data, the value of ξi should be minimized. The methods considering all of ξi /||w|| and ηi /||w|| are referred to as the total margin methods. Now, we have the following formulation for getting a linear classifier with maximal total margin: (GP2 )
minimize
||w|| +
l
(hi ξi − ki ηi )
i=1
subject to
yi xTi w + b = ηi − ξi , i = 1, . . . , l, ξi , ηi 0, i = 1, . . . , l.
If we maximize the smallest margin for correctly classified data instead of the sum of all margins, the formulation is reduced to the well known ν-SVM by setting ηi ρ, hi = 1/l, and ki = ν: l
(ν−SVM)
minimize subjectto
1 ξi l i=1 yi xTi w + b ρ − ξi , i = 1, . . . , l,
||w|| − νρ +
ξi , ηi 0, i = 1, . . . , l. In the following, we use a simplified formulation for (GP2 ) as follows: (GP3 )
minimize subject to
l l 1 ||w||22 + C1 ξi − C2 ηi 2 i=1 i=1 yi xTi w + b ηi − ξi , i = 1, . . . , l,
ξi , ηi 0, i = 1, . . . , l.
360
Hirotaka Nakayama et al.
Note that equality constraints in (GP2 ) are changed into inequality constraints (GP3 ) which bring the equivalent solution. When the data set is not linearly separable, we can consider linear classifiers on the feature space mapped from the original data space by some nonlinear map Φ in a similar fashion to SVMs. Considering the dual problem to (GP3 ), we have a simple formulation by using a kernel K(xi , xj ) with the property K(xi , xj ) = Φ(xi ), Φ(xj ) as follows: l 1 (GP3 −SVM) minimize yi yj αi αj K (xi , xj ) 2 i,j=1 subject to
l
αi yi = 0,
i=1
C2 αi C1 , i = 1, . . . , l. Let α be the optimal solution to the problem (GP3 -SVM). Then, ∗
w=
l
yi α∗i xi .
i=1
n+ +n− l 1 ∗ b = yi αi K xi , xj , (n+ − n− ) − n+ + n− j=1 i=1 ∗
where n+ is the number of xj with C2 < α∗j < C1 and yj = +1 and n− is the number of xj with C2 < α∗j < C1 and yj = −1. Likewise, similar formulations following i) and/or iv) of the main objectives of goal programming are possible: l 1 ||w||22 + C1 σ − C2 minimize ηi (GP4 ) 2 i=1 subject to yi xTi w + b ηi − σ, i = 1, . . . , l, (GP5 )
minimize subject to
σ, ηi 0, i = 1, . . . , l. 1 ||w||2 + C1 σ − C2 ρ 2 2 yi xTi w + b ρ − σ, i = 1, . . . , l, σ, ρ 0.
GP4 and GP5 can be reformulated using kernels in a similar fashion to GP3 SVM. l 1 minimize yi yj αi αj K (xi , xj ) (GP4 −SVM) 2 i,j=1 subject to
l i=1 l
αi yi = 0, αi C1 ,
i
C2 αi , i = 1, . . . , l.
Goal Programming Approaches to Support Vector Machines
361
Table 1. Classification rate of various SVMs (Liver–90% training data) SVM(hard) SVM(soft) ν-SVM GP3 SVM GP4 SVM GP5 SVM
GP
AVE
–
67.57
68.38
71.11
63.53
59.08
69.12
STDV
–
8.74
7.10
8.59
8.10
6.79
6.55
TIME
–
221
109.96
192.14
1115
93.83
2.29
Table 2. Classification rate of various SVMs (Liver–70% training data) SVM(hard) SVM(soft) ν-SVM GP3 SVM GP4 SVM GP5 SVM
GP
AVE
66.51
65.32
67.35
68.33
65.05
55.48
68.16
STDV
2.43
4.83
4.2
4.68
6.39
5.28
2.94
TIME
20.2
161.81
28.57
61.33
109.83
19.98
1.23
Table 3. Classification rate of various SVMs (Cleveland–90% training data) SVM(hard) SVM(soft) ν-SVM GP3 SVM GP4 SVM GP5 SVM
GP
AVE
74.67
80.0
80.59
79.89
71.67
72.81
80.67
STDV
6.86
4.88
4.88
5.85
5.63
5.63
5.33
TIME
17.57
35.06
22.4
36.04
6.86
6.86
2.25
(GP5 −SVM)
minimize
subject to
l 1 yi yj αi αj K (xi , xj ) 2 i,j=1 l
αi yi = 0,
i=1
C2
l
αi C1 ,
i
0 αi , i = 1, . . . , l.
3
Numerical Example
We compare several SVMs through well known bench mark problems: 1) BUPA liver disorders(345 instances with 7 attributes), 2) Cleveland heart-disease (303 instances with 14 attributes)(http://www.ics.uci.edu/ mlearn/MLSummary.html).
362
Hirotaka Nakayama et al.
Table 4. Classification rate of various SVMs (Cleveland–70% training data) SVM(hard) SVM(soft) ν-SVM GP3 SVM GP4 SVM GP5 SVM
GP
AVE
75.6
81.07
81.97
81.03
79.01
75.06
83.19
STDV
6.27
3.29
3.95
2.88
2.58
6.08
2.78
TIME
6.6
12.64
9.18
14.41
12.51
5.63
1.16
We made 10 trials with randomly selected training data of 90% and 70% from the original data set and the test data of rest 10% and 20%, respectively. Tables 1-4 show the results. We used the computer with CPU of Athron +1800. As can be seen in Table 1, SVM (hard margin) cannot produce any solution for the 90% training data of BUPA liver disorders problem. This seems because the problem is not linearly separable in a strict sense even in the feature space. On the other hand, however, the linear classifier of GP provides a good result. Therefore, we may conclude that the problem is nearly linearly separable, but not in a strict sense. The data around the boundary may be affected by noise, and this causes that nonlinear decision boundaries can not necessarily yield good classification ability.
4
Concluding Remarks
SVM was discussed from a viewpoint of Goal Programming. Since the model (GP) is a linear programming, the computation time is extremely short. However, GP sometimes produces undesirable solutions. It is seen through our experiments that the stated normalization can not get rid of this difficulty completely. SVMs with LP formulation are of course possible. We shall report numerical experiments with LP-based SMVs in another opportunity.
References [1] Erenguc, S. S. and Koehler, G. J. (1990) Survey of Mathematical Programming Models and Experimental Results for Linear Discriminant Analysis, Managerial and Decision Economics, 11, 215-225 356 [2] Freed, N. and Glover, F. (1981) Simple but Powerful Goal Programming Models for Discriminant Problems, European J. of Operational Research, 7, 44-60 356, 357 [3] Glover, F. (1990) Improved Lineaar Programming Models for Discriminant Analysis, Decision Sciences, 21, 771-785 359 [4] Mangasarian, O. L. (1968): Multisurface Method of Pattern Separation, IEEE Transact. on Information Theory, IT-14, 801-807 356 [5] Mangasarian, O. L. (2000) Generalized Support Vector Machines, in Advances in Large Margin Classifiers, A. Smola, P. Bartlett, B. Sh¨ olkopf, and D. Schuurmans (eds.) MIT Press, Cambridge, pp.135-146 358
Goal Programming Approaches to Support Vector Machines
363
[6] Novikoff, A. B. (1962): On Convergence Proofs on Perceptrons, Symposium on the Mathematical Theory of Automata, 12, 615-622 356 [7] Sawaragi, Y., Nakayama, H. and Tanino, T. (1994) Theory of Multiobjective Optimization, Academic Press 357 [8] Sch¨ olkopf, B. and Smola, A. S. (1998) New Support Vector Algorithms, NeuroCOLT2 Technical report Series NC2-TR-1998-031
Asymmetric Triangular Fuzzy Sets for Classification Models J. F. Baldwin and Sachin B. Karale* A.I. Group, Department of Engineering Mathematics, University of Bristol Bristol BS8 1TR, UK {Jim.Baldwin,S.B.Karale}@bristol.ac.uk
Abstract. Decision trees have already proved to be important in solving classification problems in various fields of application in the real world. The ID3 algorithm by Quinlan is one of the well-known methods to form a classification tree. Baldwin introduced probabilistic fuzzy decision trees in which fuzzy partitions were used to discretize continuous feature universes. Here, we have introduced a way of fuzzy partitioning in which we can have asymmetric triangular fuzzy sets for mass assignments approach to fuzzy logic. In this paper we have shown with example that the use of asymmetric and unevenly spaced triangular fuzzy sets will reduce the number of fuzzy sets and will also increase the efficiency of probabilistic fuzzy decision tree. Keywords. Decision trees, fuzzy decision trees, fuzzy sets, mass assignments, ID3 algorithm
1
Introduction
ID3 is a classic decision tree generation algorithm for classification problems. This was later extended to C4.5 to allow continuous variables with the use of crisp partitioning [1]. If attribute A is partitioned in two sets, one having values of A> α and the other A ≤ α , for some parameter α , then small changes in attribute value can result in inappropriate changes to the assigned class. So most of the time the ID3 algorithm becomes handicapped in the presence of continuous attributes. There are many more partitioning techniques available, to deal with continuous attributes [2,3]. Fuzzy ID3 extends ID3 by allowing the values of attributes for fuzzy sets. This allows discretization of continuous variables using fuzzy partitions. The use of fuzzy sets to partition the universe of continuous attributes gives much more significant results [2]. Baldwin's mass assignment theory is used to translate attribute values to probability distribution over the fuzzy partitions. Fuzzy decision rules are less sensitive to small changes in attribute values near partition boundaries and thus helps to obtain considerably good results from ID3 algorithm. *
Author to whom correspondence should be addressed.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 364-370, 2003. Springer-Verlag Berlin Heidelberg 2003
Asymmetric Triangular Fuzzy Sets for Classification Models
365
A fuzzy partition in the context of this paper is a set of triangular fuzzy sets such that for any attribute value the sum of membership of the fuzzy sets is 1. The use of symmetric, evenly spaced triangular fuzzy sets generates a decision tree, which normally gives good classification on training sets but for test data sets it may tend to give erroneous results, depending on the data set used as training set. This can be avoided by using asymmetric, unevenly spaced triangular fuzzy sets. This is discussed in the following sections and supported with examples.
2
Symmetric and Asymmetric Triangular Fuzzy Partitioning
The need of partition universe has introduced a new problem into decision tree induction. The use of asymmetric, unevenly spaced triangular fuzzy sets can help to improve the results compared to those obtained by symmetric, evenly spaced triangular fuzzy sets. Asymmetric fuzzy sets will give improved results or the similar results depending on circumstances. 2.1
Symmetric, Evenly Spaced Triangular Fuzzy Sets
Symmetric, evenly spaced triangular fuzzy partitioning is very much popular and has been used for various applications in number of research areas. If fuzzy sets {f1,…,fn}forms fuzzy partition then we have n
∑ prob( f ) =1 i
(1)
i =1
where prob( f ) is the membership function of ‘ f ', with respect to a given distribution over the universe [4]. While forming the fuzzy partitions the total universe is evenly divided into number of symmetric triangular partitions, decided by the user, which generally looks like the fuzzy partitions as shown in Fig.1.
Fig. 1. Symmetric, evenly spaced triangular fuzzy sets
Depending on the training data set available this may result into forming one or more fuzzy partitions, which have no data point associated with them. In this case if there is a data point in a test data set which lies in this fuzzy partition then obviously
366
J. F. Baldwin and Sachin B. Karale
the model gives equal probability towards each class for that data point and thus there is the possibility of erroneous results. 2.2
Asymmetric, Unevenly Spaced Triangular Fuzzy Sets
Same as in symmetric fuzzy sets, in this case also, we have fuzzy sets {f1,…,fn} forms fuzzy partition and probability distribution given by formula (1). But the fuzzy partitions formed in this method will depend on the training data set available. The training data set is divided so that each fuzzy partition will have almost same number of data points associated with it. This will end up in obtaining asymmetric fuzzy sets, which will look like fuzzy sets shown in Fig. 2.
Fig. 2. Asymmetric, unevenly spaced triangular fuzzy sets
This will minimise the possibility of developing rules, which were giving equal probability, prob( f i ) towards each class for data points in test data set and will help to let us know the class of the data point to which it belongs. In turn it will increase the average accuracy of the decision model. If the training data is having data points clustered at few points of the universe of the attribute instead of scattered all over the universe or if there are gaps in the universe of the attribute without any data point, in those cases this modification will also help to minimize the number of fuzzy sets required and/or will let us know the maximum fuzzy sets possible (i.e. in such cases the algorithm will provide the exact number of fuzzy sets required).
3
Defuzzification
After having formed a decision tree, we can obtain probabilities for each data point towards each class of the target attribute. Now, it is another important task to defuzzify these probabilities and obtain a class of the data point to which it belongs to. Here, we have used the value at the maximum membership of the fuzzy set (i.e. the point of the fuzzy set at which the data points show 100% dependency towards that fuzzy set) for defuzzification. This can be easily justified with an example as below;
Asymmetric Triangular Fuzzy Sets for Classification Models
367
Let us assume we have an attribute with universe from 0 to 80 and five fuzzy sets on this attribute as shown in Fig. 2. If we have a data point having value 30 on this attribute then after fuzzification we get membership of 70% on the second fuzzy set and 30% membership on the third fuzzy set as shown in Fig. 3.
Fig. 3. Example diagram for defuzzification
If we defuzzify this point using the values at the maximum membership of the fuzzy sets, then we will get: • •
Second fuzzy set 15 (value at maximum membership) × 0.7 = 10.50, and Third fuzzy set 65 (value at maximum membership) × 0.3 = 19.50
i.e. we get back the total 10.50 + 19.50 = 30.00, which is exactly as the value at the initial point before fuzzification. Thus the final output value is given by n
V=
∑α
v
i i
(2)
i =1
where ‘ n ' is the number of fuzzy sets each associated with probability ‘ α i ' and ‘ vi ' is the value at maximum membership of each fuzzy set. Whereas the use of any other value for defuzzification, like, if we use the mean value of the fuzzy set then the value obtained after defuzzification will not be same to the value before fuzzification.
4
Results and Discussion
4.1
Example 1
Consider an example having six attributes, one of which is a categorical, noncontinuous attribute with four classes and five are non-categorical, continuous attributes. Universes for all continuous attributes are shown in Table 1.
368
J. F. Baldwin and Sachin B. Karale Table 1. Universes for continuous attributes
Non-categorical attribute 1 2 3 4 5
Universe 10 - 100 0 - 550 3 - 15 0 - 800 0 – 1000
A training data set of 500 data points is generated such that the first and the fourth non-categorical attributes have a gap on their universe, where no data point is available. For this example the gap is adjusted so that there is one fuzzy set in symmetrical fuzzy partitioning on each of the first and the fourth attributes without any data point belong to them. For example, in Fig. 4, symmetric triangular partitioning for attribute one is shown, where no data point associated with third fuzzy set.
Fig. 4. Symmetric fuzzy partitions on the first attribute
Fig. 5. Asymmetric fuzzy partitions on the first attribute
On use of asymmetric fuzzy sets, the partitioning is such that each fuzzy partition has almost same number of data points, as shown in Fig. 5. Results obtained for both symmetric and asymmetric approaches are shown in Table 2a. Test data set of 500 data points is generated which contains the data points scattered all over the universe of the attributes instead of clustered. We can see from Table 2a that, on use of asymmetric fuzzy sets the percentage accuracy is significantly higher than that on the use of symmetric fuzzy sets for both training and test data sets. In this problem, the use of four symmetric fuzzy sets avoid forming empty fuzzy partitions and gives more accuracy than the accuracy obtained by the use of five fuzzy sets. But still the accuracy cannot be improved up to that obtained by asymmetric fuzzy sets.
Asymmetric Triangular Fuzzy Sets for Classification Models
369
Table 2a. Comparison for use of symmetric and asymmetric fuzzy sets
Fuzzy Sets
4 5 6 7 4.2
Symmetric Triangular Fuzzy Sets %Accuracy %Accuracy on Training on Test Data Set Data Set 54.56 67.87 55.45 47.51 55.91 51.83 65.17 36.87
Asymmetric Triangular Fuzzy Sets %Accuracy %Accuracy on Training on Test Data Set Data Set 83.60 74.20 85.37 71.75 88.84 75.90 91.35 77.12
Example 2
In this example also the training data set of 500 data points with five non categorical attributes with universes as shown in Table 1 and one categorical attribute with four classes is used. This time the training data set is generated so that the gap in the first example is reduced to 50% on third fuzzy partition for the first attribute, as shown in Fig. 6. Even on a reduction in the gap, we can see from Table 2b that the results obtained by the use of asymmetric fuzzy sets are better than those obtained by the use of symmetric fuzzy sets. The test data set used in this case is same as that used in Example 1.Fig. 7 shows the fuzzy partitioning on use of asymmetric fuzzy sets with reduced gap.
Fig. 6. Symmetric fuzzy partitions on the first attribute with reduced gap
Fig. 7. Asymmetric fuzzy partitions on the first attribute with reduced gap
370
J. F. Baldwin and Sachin B. Karale
Table 2b. Comparison for use of symmetric and asymmetric fuzzy sets with reduced gap
Fuzzy Sets
4 5 6 7
5
Symmetric Triangular Fuzzy Sets %Accuracy %Accuracy on Training on Test Data Set Data Set 63.93 69.68 57.25 66.58 77.09 81.93 62.66 69.08
Asymmetric Triangular Fuzzy Sets %Accuracy %Accuracy on Training on Test Data Set Data Set 83.10 82.77 84.64 86.11 87.03 83.86 86.92 84.13
Conclusions
The new attempt made to use Fuzzy ID3 algorithm generate a decision tree which can give same or improved results compared to the use of symmetric fuzzy sets in Fuzzy ID3 algorithm for the real-world classification problems. The use of asymmetric, unevenly spaced triangular fuzzy sets makes Fuzzy ID3 more efficient and increases the reliability of the algorithm for variety of data. This approach of fuzzy partitioning for mass assignment ID3 can improve the results either by reducing the number of fuzzy sets or by increasing the accuracy, depending on the problem.
References [1] [2] [3] [4]
Quinlan, J.R.: C4.5: Programs for Machine Learning. San Mateo, Morgan Kaufmann Publishers (1993) Baldwin, J.F., Lawry, J., Martin, T. P.: A Mass Assignment Based ID3 Algorithm for Decision Tree Induction. International Journal of Intelligent Systems, 12 (1997) 523-552 Fayyad, U.M., Irani, K.B.: On the handling of continuous-valued attributes in decision tree generation. Mach. Learn., 8 (1992) 87-102 Baldwin, J.F., Martin, T.P., Pilsworth, B.W.: FRIL- Fuzzy and Evidential Reasoning in A.I. Research Studies Press, Wiley, New York (1995)
Machine Learning to Detect Intrusion Strategies Steve Moyle and John Heasman Oxford University Computing Laboratory Wolfson Building, Parks Road, Oxford, OX1 3QD, England
Abstract. Intrusion detection is the identification of potential breaches in computer security policy. The objective of an attacker is often to gain access to a system that they are not authorized to use. The attacker achieves this by exploiting a (known) software vulnerability by sending the system a particular input. Current intrusion detection systems examine input for syntactic signatures of known intrusions. This work demonstrates that logic programming is a suitable formalism for specifying the semantics of attacks. Logic programs can then be used as a means of detecting attacks in previously unseen inputs. Furthermore the machine learning approach provided by Inductive Logic Programming can be used to induce detection clauses from examples of attacks. Experiments of learning ten different attack strategies to exploit one particular vulnerability demonstrate that accurate detection rules can be generated from very few attack examples.
1
Introduction
The importance of detecting intrusions in computer based systems continues to grow. In early 2003 the Slammer worm crippled the Internet taking only 10 minutes to spread across the world. It contained a simple, fast scanner to find vulnerable machines in a small worm with a total size of only 376 bytes. Fortunately the bug did not contain a malicious payload. Intrusion detection is the automatic identification of potential breaches in computer security policy. The objective of an attacker is often to gain access to a system that they are not authorized to use. The attacker achieves this by exploiting a (known) software vulnerability by sending the system a particular input. In this work the buffer over-flow vulnerability is studied, and its technical basis described (in section 1). This is followed by the description of a semantic intrusion detection framework in section 2. Section 3 describes an experiment to use Inductive Logic Programming techniques to automatically produce intrusion detection rules from examples of intrusions. These results are discussed in the final section of the paper. 1.1
The Buffer Overflow Attack
Buffer overflow attacks are a type of security vulnerability relating to incomplete input validation. The circumstances in which a buffer overflow can occur are well V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 371–378, 2003. c Springer-Verlag Berlin Heidelberg 2003
372
Steve Moyle and John Heasman
understood and wholly avoidable if the software is developed according to one of many secure programming checklists [3, ch. 23].1 From a technical perspective the concepts behind exploiting a buffer overflow vulnerability are simple but the implementation details are subtle. They are discussed thoroughly from the view of the attacker in the seminal paper “Smashing the Stack for Fun and Profit” [1]. Sufficient background and implementation details are reproduced here so that the reader may understand how such attacks can be detected. Buffer overflow attacks exploit a lack of bounds checking on a particular user-supplied field within a computer software service that performs some transaction. The attacker supplies an overly large input transaction to the service. In the course of processing the transaction, the service copies the field into a buffer of a fixed size.2 As a result, memory located immediately after the buffer is overwritten with the remainder of the attacker’s input. The memory that is overwritten may have held other data structures, the contents of registers, or, assuming the system is executing a procedure, the return address from which program execution should continue once the procedure completes. It is the return address that is of most interest to the attacker. The function that copies data from one location to another without regard for the data size is referred to in the sequel as simply the dangerous function. Before the service executes the procedure containing the function call that leads to the overflow, the return address is pushed onto the stack so that program execution can continue from the correct place once the procedure is complete. Information about the processor’s registers is then pushed onto the stack so that the original state can be restored. Space for variables local to the procedure is then allocated. The attacker aims to overwrite the return address thus controlling the flow of the program. A malicious transaction input of a buffer overflow attack consists of several components as shown in Fig. 1. The dangerous function call takes the attacker’s input and overwrites the intended buffer in the following order: 1) the space allocated for the procedure’s local variables and the register save area, and 2) the return address. The purpose of the sequence of idle instructions at the beginning of the exploit code stream is to idle the processor. Idle instructions are used to reduce the uncertainty of the attacker having to guess the correct return address which must be hard coded in the attack stream. By including a sequence of idle instructions at the start of the exploit code, the return address need only be set to an address within this sequence which greatly improves the chances of an exploit succeeding. This sequence of idle instructions is often referred to as a NOP sledge. NOP is the opcode that represents a no operation instruction within the CPU. The attacker aims to set the return address to a location within the idle sequence so that his exploit code is executed correctly. He supplies multiple copies of his desired return address to compensate for the uncertainty of where 1 2
This does not demean the target of this work – although preventable – buffer overflow vulnerabilities are still prolific in computer software! The size of the buffer is smaller than that necessary to store the entire input provided by the attacker.
Machine Learning to Detect Intrusion Strategies
373
Arbitrary Sequence of idle Exploit code Multiple copies of data instructions return address
Fig. 1. attack
Components of a malicious transaction containing a buffer overflow
the actual return address is located. Once the procedure has completed, values from the (overwritten) register save area are popped off the stack back into system registers and program execution continues from the new return address – now executing the attacker’s code. For the attacker to correctly build the malicious field within the transaction, he must know some basic properties of the target system, including the following: – The processor architecture of the target host as machine code instructions have different values in different processors. – The operating system the target is running as each operating system has a different kernel and different methods of changing between user and kernel mode. – The stack start address for each process which is fixed for each operating system/kernel. – The approximate location of the stack pointer. This is a guess at how many bytes have been pushed onto the stack prior to the attack. This information together with the stack start address allow the attacker to correctly calculate a return address that points back into the idle sequence at the beginning of the exploit code. The work described in the sequel is limited to the discussion of buffer overflow exploits for the Intel x86 family of processors running the Linux operating system. These are only pragmatic constraints. The work is easily extended to different processor architectures, operating systems, and some vulnerabilities. 1.2
An IDS for Recognizing Buffer Overflow Attacks
An attacker must provide a stream of bytes, that, when executed after being stored in an overflowed buffer environment, will gain him control of the CPU, which will in-turn, perform some task for him (say execute a shell on a Unix system). When crafting a string of bytes the attacker is faced with many alternatives, given the constraints that he must work within (e.g. the byte stream must be approximately the size of the vulnerable buffer). These alternatives are both at the strategic and implementation levels. At the strategic level, the attacker must decide in what order the tasks will be performed by the attack code. For example, will the address of the command to the shell be calculated and stored before the string that represents the shell type to be called? Having decided on the components that make up the strategy, at the implementation level the attacker has numerous alternative encodings to choose from. Which register, for example, should the null byte be stored in? It
374
Steve Moyle and John Heasman
is clear that for any given attack strategy, there are vast numbers of alternative attack byte code streams that will, when executed, enact the strategy. This also means that simple syntax checking intrusion detection systems can easily be thwarted even by attackers using the same strategy that the syntactic IDS has been programmed to detect. A semantic Intrusion Detection System (IDS) was developed as a Prolog program in [4], which was further used for the basis of this work. The IDS program includes a dis-assembler that encodes the semantics of Intel x86 class processor byte codes. The IDS includes a byte stream generator that utilizes the same dis-assembler components. This byte stream generator produces a stream of bytes by producing a randomized implementation of the chosen intrusion strategy. This stream of bytes is then compiled into a dummy executable which simulates the process of the byte stream being called. This simply emulates the situation of the buffer having already been over-flowed, and not the process of over flowing it per se. Each attack byte stream is verified in its ability to open a shell (and hence its success for an attacker).
2
Inductive Logic Programming
Inductive Logic Programming (ILP) [6] is a field of machine learning involved in learning theories in the language of Prolog programs. ILP has demonstrated that such a powerful description language (First Order Predicate Calculus) can be applied to a large range of domains including natural language induction [2], problems in bio-informatics [7], and robot programming [5]. This work is closet to that of natural language programming in that it is concerned with inducing context dependent parsers from examples of “sentences”. The general setting for ILP is as follows. Consider an existing partial theory (e.g. a Prolog program) B (known as Background knowledge), from which one is not able to derive (or predict) the truth of an externally observed fact. Typically these facts are known as examples3 , E. This situation is often presented as B |= E + . The objective of the ILP system is to produce an extra Prolog program H known as a theory, that, when combined with the background knowledge enables all the examples to be predicted. It is this new theory that is the knowledge discovered by the ILP system. B ∪ H |= E In the context of the semantic intrusion detection system, the basic assembler of the CPU architecture and the general semantic parser program, BIDS make up the background knowledge. The examples are the byte streams known to produce a successful attack. Consider the following single example of e+ 1 , where the attack is represented as a list of bytes in Prolog. 3
In general, the examples in ILP can be considered as either positive E + examples, or negative E − examples such that E = E + ∪ E − and E + ∩ E − = ∅.
Machine Learning to Detect Intrusion Strategies
e+ 1 =
375
attack([0x99,0x90,0x90,0xf8,0x9b,0x9b,0xeb,0x1c,0x5b,0x89,0x5b,0x08,0xba,0xff,0xff,
0xff,0xff,0x42,0x88,0x53,0x07,0x89,0x53,0x0c,0x31,0xc0,0xb0,0x0b,0x8d,0x4b,0x08,0x8d, 0x53,0x0c,0xcd,0x80,0xe8,0xdf,0xff,0xff,0xff,0x2f,0x62,0x69,0x6e,0x2f,0x73,0x68]).
Such an example is a single instance of the SSFFP buffer overflow exploit. It can be described (and parsed) by the following high level strategy (written in Prolog). shown below (which covers semantic variations in the exploit code taken from “Smashing the Stack for Fun and Profit” [1]). attack(A):- consume_idles(A,B), consume_load(B,C,Addr,Len), consume_place_addr(C,D,Addr,Len), consume_null_long(D,E,Null,Addr), consume_null_operation(E,F,Len,Null,Addr), consume_set_sys_call(F,G,Null,0x0b,Addr), consume_params(G,H,Addr,Null,Len), consume_system_call(H,[]).
Consider for example, the predicate consume idles(A,B) which transforms a sequence of bytes A into a suffix sequence of bytes B by removing the contiguous idle instructions from the front of the sequence. The above attack predicate is clearly much more powerful than a syntactic detection system in that it is capable detecting the whole class of byte streams for that particular strategy. However, it does not detect other strategies of attacks that have different structural properties. A simple solution is to add more rules that cover variations in the exploit code structure but this requires substantial human effort. If the expert is faced with new real world examples he believes can be expressed using the background information that is already encoded, he might ask “is there a way of automating this process?”. This is the motivation for the following section, which demonstrates that inducing updates to the semantic rule set is possible.
3
An Experiment to Induce Semantic Protection Rules
This section describes controlled experiments to (re-)learn semantic intrusion detection rules using the ILP system ALEPH [9]. The motivation for the experiment can be informally stated as testing the hypothesis that ILP techniques can recover reasonably accurate rules describing attack strategies. The basic method was to use correct and complete Prolog programs to randomly generate sequences of bytes that successfully exploit a buffer overflow4. Such byte sequences are considered as positive examples, E + , of attacks. The byte sequence generating predicates were then removed from the Prolog program leaving a common background program to be used as background knowledge, BIDS , to ALEPH. This background program was then used, along with positive examples of attack byte sequences, to induce rules similar to the byte sequence generating predicates. 3.1
Results
The learning curves for the recovery of each of the ten intrusion strategies are presented in Figure 2. Here, accuracy is measured as the proportion of test set 4
All such generated sequences, when supplied as part of a transaction with a vulnerable service, enabled the attacker to open a shell.
376
Steve Moyle and John Heasman × ♦
100
× + ♦
× + ♦
× + ♦
+ × ♦
80 Average Accuracy (%) on hold-out × sets 60
+
+ ♦
40 3
9 27 Cardinality of training set (logarithmic scale)
81
Fig. 2. Learning curves for the recovery of ten different intrusion strategies from examples. Each line on the graph represents a different intrusion attack strategy (i.e. of the form attack(A) :- consume. . . ). Each point on the graph represents the average accuracy of 10 randomly chosen training and test sets
examples which are detected by the recovered rule. The proportion of the test set examples which are not detected by the recovered rule is the intrusion detection false negative rate. It can be seen from Figure 2 that, on average, for the attack strategies studied, relatively few examples of attacks are required to recover rules that detect a high proportion of attacks. In fact, for all of the ten studied attack strategies, only ten positive attack examples were required to recover a rules that were 100% accurate on the test set. This finding is discussed in the following section.
4
Discussion and Conclusions
The graphs in Figure 2 demonstrate that it is possible to induce semantic detection rules for buffer overflows attacks that are 100% accurate. Furthermore, these results can be obtained using, on average, a low number of examples for the induction. The need for few examples can be attributed to the relatively low size of the hypothesis space, which was a result of the high input and output connectedness of the background predicates. The average estimated size of the hypothesis space for the recovery of these attack strategies is 797 clauses and is comparable to the classic ILP test-bed problem, the Trains Srinivasan provides estimates of the hypothesis space for the trains problem and other common ILP data sets in [8].
Machine Learning to Detect Intrusion Strategies
377
Even the rules with lower accuracies can be understood by a domain expert and have some intrusion detection capabilities. This result demonstrates that ILP was well suited to the intrusion problem studied. It may well be suited to real world intrusion detection problems, provided the background theory is sufficient. These experiments indicate that the background information was highly accurate and relevant. The importance of relevancy has been studied by Srinivasan and King [10]. This application domain is very “crisp” and well defined. In a particular context, a byte op-code always performs a particular function. For example when the byte 0x9B is executed on an Intel x86 processor it always triggers the FWAIT functions to be performed. Such well defined, and cleanly specified domain is well suited to being represented in a logic program and susceptible to ILP. Contrast this with a biological domain where the activity of a particular system varies continuously with respect to many inputs – for example the concentration of certain chemicals. The work described has shown that it is possible to encode the operational semantics of attack transactions in a logical framework, which can then be used as background knowledge to an ILP system. Sophisticated attack strategies can then be represented in such a framework. In studying the particular application of a buffer overflow attack, it has been shown that ILP techniques can be used to learn rules that detect – with low false negative rates – attack strategies from relative few examples of attacks. Furthermore, only positive examples of attacks are necessary for the induction of attack strategy rules.
References [1] ”Aleph One”. Smashing The Stack For Fun And Profit. Phrack 49, 1996. 372, 375 [2] J. Cussens. Part-of-speech tagging using Progol. In Inductive Logic Programming: Proceedings of the 7th International Workshop (ILP-97), p. 93-108. Prague, 1997. Springer. 374 [3] S. Garfinkel, G. Spafford. Practical UNIX and Internet Security. Sebastopol, O’Reilly and Associates, 1996. 372 [4] S. A. Moyle and J. Heasman. Applying ILP to the Learning of intrusion strategies. Technical Report PRG-RR-02-03, Oxford University Computing Laboratory, University of Oxford, 2002. 374 [5] S. Moyle. Using Theory Completion to learn a Robot Navigation Control Program. In S. Matwin, editor, Proceedings of the 12th International Workshop on Inductive Logic Programming, 2002. 374 [6] S. Muggleton. Inverse Entailment and Progol. New Generation Computing, 13(3 and 4):245–286, 1995. 374 [7] S. H. Muggleton, C. H. Bryant, A. Srinivasan, A. Whittaker, S. Topp, and C. Rawlings. Are grammatical representations useful for learning from biological sequence data? – a case study. Computational Biology, 8(5):493-522, 2001. 374 [8] A. Srinivasan. A study of two probabilistic methods for searching large spaces with ILP. Technical Report PRG-TR-16-00, Oxford University Computing Laboratory, University of Oxford, 2000. 376
378
Steve Moyle and John Heasman
[9] A. Srinivasan. The Aleph Manual. http://web.comlab.ox.ac.uk/oucl/research/ ~areas/machlearn/Aleph/, 2001. 375 [10] A. Srinivasan, R. D. King. An empirical study of the use of relevance information in Inductive Logic Programming. Technical Report PRG-RR-01-19, Oxford University Computing Laboratory, University of Oxford, 2001. 377
On the Benchmarking of Multiobjective Optimization Algorithm Mario K¨ oppen Fraunhofer IPK Dept. Security and Inspection Technologies, Pascalstr. 8-9, 10587 Berlin, Germany [email protected]
Abstract. The ”No Free Lunch” (NFL) theorems state that in average each algorithm has the same performance, when no a priori knowledge of single-objective cost function f is assumed. This paper extends the NFL theorems to the case of multi-objective optimization. Further it is shown that even in cases of a priori knowledge, when the performance measure is related to the set of extrema points sampled so far, the NFL theorems still hold. However, a procedure for obtaining function-dependent algorithm performance can be constructed, the so-called tournament performance, which is able to gain different performance measures for different multiobjective algorithms.
1
Introduction
The ”No Free Lunch” (NFL) theorems state the equal average performance of any optimization algorithm, when measured against the set of all possible cost functions and if no domain knowledge of the cost function is assumed [3]. Usually, the NFL theorem is considered in a context of design of algorithms, especially it became well-known in the scope of evolutionary computation. However, the NFL theorem has also some other facettes, one of which is the major concern of this paper. So, the NFL theorem can also be seen as stating the impossibility to obtain a concise mathematical definition of algorithm performance. In this context, this paper considers multi-objective optimization and how the NFL theorems apply in this field. After recalling some basic definitions of multi-objective optimization in section 2, esp. the concept of Pareto front, the standard NFL theorem is proven for the multi-objective case in section 3. Then, the proof is extended to the case where sampling of extrema is also involved in the performance measure in section 4, proving that there is no gain in using such a measure. Only the case that two algorithms are compared directly give rise to so-called tournament performance and a heuristic procedure to measure algorithm performance. This will be presented in section 5.
2
Basic Definitions
In multi-objective optimization, optimization goal is given by more than one objective to be extreme [1]. Formally, given a domain as subset of Rn , there V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 379–385, 2003. c Springer-Verlag Berlin Heidelberg 2003
380
Mario K¨ oppen
are assigned m functions f1 (x1 , . . . , xn ), . . . , fm (x1 , . . . , xn ). Usually, there is not a single optimum but rather the so-called Pareto set of non-dominated solutions: For two vectors a and b it is said that a (Pareto-)dominates b, when each component of a is less or equal to the corresponding component of b, and at least one component is smaller: a >D b ←→ ∀i(ai ≤ bi ) ∧ ∃k(ak < bk ).
(1)
Note that in a similar manner Pareto dominance can be related to ”>”-relation. The subset of all vectors of a set M of vectors, which are not dominated by any other vector of M is the Pareto set (also Pareto front) P F . The Pareto set for univariate data (single objective) contains just the maximum of the data. The task of multi-objective optimization algorithm is to sample points of the Pareto front. A second instantiation (often called decision maker) is needed to further select from the Pareto front.
3
NFL-Theorem for Multi-objective Optimization Algorithms
A slight modification extends the proof of the single-objective NFL theorems given in [2] to the multi-objective case. Be X a finite set and Y a set of k finite domains Yi with i = 1, . . . , k. Then we consider the set of all sets of k cost functions fi : X → Yi with i = 1, . . . , k, or f : X → Y for simplicity. Let m be a non-negative integer < |X|. Define dm as a set {(dxm (i), dym (i) = = dxm (j). (f (dxm (i)))}, i = 1, . . . , m where dxm (i) ∈ X ∀ i and ∀ i, j, dxm (i) Now consider a deterministic search algorithm a which assign to every possible dm an element of X \ dxm (see fig. 1): dxm+1 (m + 1) = a[dm ] ∈ {dxm }.
(2)
Fig. 1. A deterministic algorithm derives next sampling point dxm+1 (m + 1) from the outcome of foregoing sampling dm
On the Benchmarking of Multiobjective Optimization Algorithm
381
Define Y (f, m, a) to be the sequence of m Y values produced by m successive applications of the algorithm a to f . Let δ(., .) be the Kronecker delta function that equals 1 if its arguments are identical, 0 otherwise. Then the following holds: Lemma 1. For any algorithm a and any dym ,
δ(dym , Y (f, m, a)) =
k
|Yi ||X|−m .
i=1
f
Proof. Consider all cost functions f+ for which δ(dym , Y (f+ , m, a)) takes the value 1, 2 asf. of the sequence dym : i) f+ (a(∅)) = dym (1) ii) f+ (a[dm (1)]) = dym (2) iii) f+ (a[dm (1), dm (2)]) = dym (3) ... where dm (j) ≡ (dxm (j), dym (j)). So the value of f+ is fixed for exactly m distinct elements from X. For the remaining |X| − m elements from X, the corresponding value of f+ can be assigned freely. Hence, out of the i |Yi ||X| separate f , exactly i |Yi ||X|−m will result in a summand of 1 and all others will be 0. Then, we can continue with the proof of NFL theorem in multi-objective case. Take any performance measure c(.), mapping sets dym to real numbers. Theorem 1. For any two deterministic algorithms a and b, any performance value K ∈ R, and any c(.), δ(K, c(Y (f, m, a))) = δ(K, c(Y (f, m, b))). f
f
dym
Proof. Since more than one may give the same value of the performance measure K, for each K the l.h.s. is expanded over all those possibilities: δ(K, c(Y (f, m, a))) = f = δ(K, c(dym ))δ(dym , Y (f, m, a)) (3) m f,dy m ∈Y
=
m dy m ∈Y
=
m dy m ∈Y
=
k i=1
δ(K, c(dym ))
δ(dym , Y (f, m, a))
f
δ(K, c(dym ))
|Yi ||X|−m
k
|Yi ||X|−m (by Lemma 1)
i=1
m dy m ∈Y
δ(K, c(dym ))
The last expression does not depend on a but only on the definition of c(.).
(4)
382
4
Mario K¨ oppen
Benchmarking Measures
The formal proof of the NFL theorems assumes no a priori knowledge of the function f . This can be easily seen in the proof of Theorem 1, when the expansion over f is made (line 3 of the proof): it is implicitely assumed that the performance measure c(.) does not depend on f . There are performance measures depending on f , for which Theorem 1 does not hold and that can be easily constructed (as e.g. derived from the requirement to scan (x, y) pairs in a given order). This is a reasonable assumption for evaluating an algorithm a. Domain knowledge of f could result in algorithm a somehow designed in a manner to show increased performance on some benchmark problems. However, common procedure to evaluate algorithms is to apply them onto a set of so-called ”benchmark problems.” This also holds in the multi-objective case. From a benchmark function f , usually analytic properties (esp. the extrema points) are given in advance. In [1], an extensive suite of such benchmark problems is proposed, in order to gain understanding of abilities of multi-objective optimization algorithms. So, for each benchmark problem, a description of the Pareto front of the problem is provided. The task given to a multi-objective optimization algorithm is to sample as many points from the Pareto front as possible. To name it here again: clearly, such a performance measure is related to a priori of f itself. NFL theorems given with Theorem 1 do not cover this case. However, in the following, it will be shown that NFL theorems even apply in such a case. It is based on the following lemma: Lemma 2. For any algorithm a it holds {a ◦ f } |Y = Y |X| f
A given algorithm a applied to any f gives a sequence of values from Y . The union of all those sequences will be the set of all possible sequences of |X| elements chosen from Y . Or, in other words: each algorithm, applied to all possible f will give a permutation of the set of all possible sequences, with each sequence appearing exactly once. Proof. Assume that for two functions f1 and f2 algorithm a will give the same sequence of y-values (y1 , y2 , . . . , y|X| ). This also means that the two corresponding x-margins are permutations of X. Via induction we show that then follows f1 = f2 . Verification. Since we are considering deterministic algorithms, the choice of the first element x1 is fixed for an algorithm a (all further choices for x values are functions of the foregoing samplings). So, both f1 and f2 map x1 to y1 . Step. Assume f1 (xi ) = f2 (xi ) for i = 1, . . . , k (and k < |X|). Then according to eq. 2, algorithm a will compute the same dxk+1 (k +1) since this computation only depends on the sequence dm that is equal for f1 and f2 by proposition. Since the y-margins are also equal in the position (k + 1), for both f1 and f2 xk+1 = dxk+1 (k + 1) is mapped onto yk+1 .
On the Benchmarking of Multiobjective Optimization Algorithm
383
Table 1. Performance measure Pareto sampling after two steps in the example case Y = {0, 1}2 and |X| = 3 y1 y2 00 00 00 00 00 00 00 00 00 01 00 01 00 01 00 01 00 10 00 10 00 10 00 10 00 11 00 11 00 11 00 11 Sum
y3 00 01 10 11 00 01 10 11 00 01 10 11 00 01 10 11
P F c(2) y1 y2 y3 P F c(2) y1 y2 y3 P F c(2) y1 y2 y3 P F c(2) 00 1 01 00 00 00 1 10 00 00 00 1 11 00 00 00 1 00 1 01 00 01 00 1 10 00 01 00 1 11 00 01 00 1 00 1 01 00 10 00 1 10 00 10 00 1 11 00 10 00 1 00 1 01 00 11 00 1 10 00 11 00 1 11 00 11 00 1 00 1 01 01 00 00 0 10 01 00 00 0 11 01 00 00 0 00 1 01 01 01 01 1 10 01 01 01, 10 2 11 01 01 01 1 00 1 01 01 10 01, 10 1 10 01 10 01, 10 2 11 01 10 01, 10 1 00 1 01 01 11 01 1 10 01 11 01, 10 2 11 01 11 01 1 00 1 01 10 00 00 0 10 10 00 00 0 11 10 00 00 0 00 1 01 10 01 01, 10 2 10 10 01 01, 10 1 11 10 01 01, 10 1 00 1 01 10 10 01, 10 2 10 10 10 10 1 11 10 10 10 1 00 1 01 10 11 01, 10 2 10 10 11 10 1 11 10 11 10 1 00 1 01 11 00 00 0 10 11 00 00 0 11 11 00 00 0 00 1 01 11 01 01 1 10 11 01 01, 10 1 11 11 01 01 0 00 1 01 11 10 01, 10 1 10 11 10 10 1 11 11 10 10 0 00 1 01 11 11 01 1 10 11 11 10 1 11 11 11 11 1 16 16 16 11 Average Performance 59/64 ∼ 0.92
This completes the proof. It has to be noted that not each permutation of Y |X| can be accessed by an algorithm (what can be easily seen from the fact that there are much more permutations than possible algorithm specifications). Following this lemma, all performance calculations that are independent of the sorting of the elements of Y |X| will give the same average performance, independent on a. Sampling of Pareto front elements after m algorithm steps is an example for such a measure. For illustration, table 1 gives these compuations for the simple case Y = {0, 1} × {0, 1} and X = {a, b, c}. Table 1 displays all possible functions f : X → Y and the corresponding Pareto set P F . The column c(2) shows the number of Pareto set elements that have already been sampled after two steps. The computation of the average performance does not depend on the order in which the functions are listed, thus each algorithm a will have the same average performance cav (2) = 59/64 . A remark on the single-objective case: in the foregoing discussion, multiobjectivity of f was not referenced explicitely. Hence, the discussion holds also for the ”single-objective” version, in which an algorithm is judged by its ability to find extrema points within a fixed number of steps. The NFL theorems also appy to this case.
5
Tournament Performance
Among the selection of function-dependent performance measures, one should be pointed out in the rest of this paper. For obtaining ”position-dependence” of the measure on a single function f , the value obtained by applying a base algorithm A is taken. Algorithm a now runs competively against A. In such
384
Mario K¨ oppen
Table 2. Tournament performance of algorithm a after two steps f (a) f (b) 01 00 01 00 01 00 01 00 01 01 01 01 01 01 01 01 01 10 01 10 01 10 01 10 01 11 01 11 01 11 01 11 Total
f (c) 00 01 10 11 00 01 10 11 00 01 10 11 00 01 10 11
P F (dy 2 (A, f )) 00 00 00 00 01 01 01 01 01, 10 01, 10 01, 10 01, 10 01 01 01 01
P F (dy 2 (a, f )) 00 01 01, 10 01 00 01 01, 10 01 00 01 01, 10 01 00 01 01, 10 01
y P F (dy 2 (A, f )) \ P F (d2 (a, f )) 00 00 00 00 01 01 01 01, 10 01, 10 01, 10 01 01 01
|.| 0 0 0 0 -1 0 0 0 -2 0 0 0 -1 0 0 0 -4
a case, the NFL theorem does not hold. For seeing this, it is sufficient to provide a counterexample. Before, we define the difference of two Pareto sets P Fa and P Fb as the set P Fa \ P Fb , in which in P Fa all elements are removed, which are dominated by any element of P Fb . Be Y = {0, 1}2 and X = {a, b, c}, as in the foregoing example. Now algorithm A is as follows: Algorithm A: Take the dxm in the following order: a, b, c. And be Algorithm a: Take a as first choice. If f (a) = {0, 1} then select c as next point, otherwise b. Table 2 shows the essential part of all possible functions f , in which algorithms A and a behave different. For ”measuring” the performance of a at step m, we compute the size of the Pareto set difference |P F (dym (A, f )) \ P F (dym (a, f ))| .
(5)
and take the average over all possible f as average performance. For functions that do not start with f (a) = {0, 1}, both algorithms are identical, so in these cases = 0. For functions mapping x-value a onto {0, 1}, we see = −4. Now taking other algorithms: Algorithm b: Take a as first choice. If f (a) = {1, 0} then select c as next point, otherwise b. Algorithm c: Take a as first choice. If f (a) = {1, 1} then select c as next point, otherwise b.
On the Benchmarking of Multiobjective Optimization Algorithm
385
For b and c, a similar computation gives = −4 and = −5 respectively. In this sense ”strongest” algorithm (i.e. in comparison to A) is to sample in the order a, c, b with a performance = −13. It should be noted that this performance measure is also applicable to the single-objective case. However, more studies on this measure have to be performed. Based on this, a heuristic procedure to measure performance of multi-objective optimization algorithm a might look like: 1. Let algorithm a run for k evaluations of cost function f and take the set M1 of non-dominated points from Y obtained by the algorithm. 2. Select k random domain values of X and compute the Pareto set M2 of the corresponding Y values. 3. Compute the set M3 of elements of M2 that are not dominated by any element of M1 . The relation of |M1 | to |M3 | gives a measure how algorithm a performs against random search.
References [1] Carlos A. Coello Coello, David A. Van Veldhuizen, Gary B. Lamont. Evolutionary Algorithms for Solving Multi-Objective Problems. Kluwer Academic Publishers, 2002. 379, 382 [2] Mario K¨ oppen, David H. Wolpert and William G. Macready, “Remarks on a recent paper on the ”no free lunch” theorems,” IEEE Transactions on Evolutionary Computation, vol. 5, no. 3, pp. 295–296, 2001. 380 [3] David H. Wolpert and William G. Macready, “No free lunch theorems for optimization,” IEEE Transactions on Evolutionary Computation, vol. 1, no. 1, pp. 67–82, 1997. 379
Multicategory Incremental Proximal Support Vector Classifiers Amund Tveit and Magnus Lie Hetland Department of Computer and Information Science Norwegian University of Science and Technology, N-7491 Trondheim, Norway {amundt,mlh}@idi.ntnu.no
Abstract. Support Vector Machines (SVMs) are an efficient data mining approach for classification, clustering and time series analysis. In recent years, a tremendous growth in the amount of data gathered has changed the focus of SVM classifier algorithms from providing accurate results to enabling incremental (and decremental) learning with new data (or unlearning old data) without the need for computationally costly retraining with the old data. In this paper we propose an efficient algorithm for multicategory classification with the incremental proximal SVM introduced by Fung and Mangasarian.
1
Introduction
Support Vector Machines (SVMs) are an efficient data mining approach for classification, clustering and time series analysis [1, 2, 3]. In recent years, a tremendous growth in the amount of data gathered (for example, in e-commerce and intrusion detection systems) has changed the focus of SVM classifier algorithms from providing accurate results to enabling incremental (and decremental) learning with new data (or unlearning old data) without the need for computationally costly retraining with the old data. Fung and Mangasarian [4] introduced the Incremental and Decremental Linear Proximal Support Vector Machine (PSVM) for binary classification and showed that it could be trained extremely efficiently, with one billion examples (500 increments of two million examples) in two hours and twenty-six minutes on relatively low-end hardware (400 MHz Pentium II). In this paper we propose an efficient algorithm based on memoization, in order to support Multicategory Classification for the Incremental PSVM.
2
Background Theory
The standard binary SVM classification problem with soft margin (allowing some errors) is shown visually in Fig. 1(a) and as a constrained quadratic programming problem in (1). Intuitively, the problem is to maximize the margin between the solid planes and at the same time permit as few errors as possible, errors being positive class points on the negative side (of the solid line) or vice versa. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 386–392, 2003. c Springer-Verlag Berlin Heidelberg 2003
Multicategory Incremental Proximal Support Vector Classifiers
(a) Standard SVM classifier
387
(b) Proximal SVM classifier
Fig. 1. SVM and PSVM
min
(w,γ,y)∈Rn+1+m
{ve y + 12 w w} (1)
s.t. D(Aw − eγ) + y ≥ e y≥0 A ∈ Rm×n , D ∈ {−1, +1}m×1 , e = 1m×1
Fung and Mangasarian [5] replaced the inequality constraint in (1) with an equality constraint. This changed the binary classification problem, because the points in Fig. 1(b) are no longer bounded by the planes, but are clustered around them. By solving the equation for y and inserting the result into the expression to be minimized, one gets the following unconstrained optimization problem: min
(w,γ)∈Rn+1+m
Setting ∇f =
f (w, γ) = ν2 D(Aw − eγ) − e2 + 12 (w w + γ 2 )
∂f ∂f ∂w , ∂γ
(2)
= 0 one gets:
−1 −1 I w A De A A + νI −A e + E = E E De = γ −e De −e A ν1 + m ν B
X
(3)
A−1
E = [A − e], E ∈ Rm×(n+1) Fung and Mangasarian [6] later showed that (3) can be rewritten to handle increments (E i , di ) and decrements (E d , dd ), as shown in (4). This decremental approach is based on time windows.
388
Amund Tveit and Magnus Lie Hetland
X = =
w γ
I + E E + (E i ) E i − (E d ) E d ν
−1
E d + (E i ) di − (E d ) dd
,
(4)
where d = De .
3
Incremental Proximal SVM for Multiple Classes
In the multicategorical classification case, the (incremental) class label vector di consists of mi numeric labels in the range {0, . . . , c − 1}, where c is the number of classes, as shown in (5). w0 . . . wc−1 (5) X = = A−1 B γ0 . . . γc−1 3.1
The Naive Approach
In order to apply the proximal SVM classifier in a “one-against-the-rest” manner, the class labels must be transformed into vectors with +1 for the positive class and −1 for the rest of the classes, that is, Θ(cmi ) operations in total, and later Θ(cmi n) for calculating (E i ) d for each class. The latter (column) vectors are collected in a matrix B ∈ R(n+1)c . Because the training features represented 2 by E i are the same for all the classes, it is enough to calculate A ∈ R(n+1) once, giving Θ(mi (n + 1)2 + (n + 1)2 ) operations for calculating (E i ) E i and adding it to νI + E E. The specifics are shown in shown in Algorithm 1. Theorem 1. The running time complexity of Algorithm 1 is Θ(cminc n). Proof. The conditional statement in lines 3–7 takes Θ(1) time and is performed minc times (inner loop, lines 2–8) per iteration of classId (outer loop, line 1–10). Calculation of the matrix-vector product B[classId, ] in line 9 takes Θ((n + 1)minc ) per iteration of classId. This gives a total running time of Θ(c · (minc + minc (n + 1))) = Θ(cminc n) .
Multicategory Incremental Proximal Support Vector Classifiers
389
Algorithm 1 calcB Naive(Einc , dinc ) Require: Einc ∈ Rminc x(n+1) , dinc ∈ {0, . . . , c − 1}minc and n, minc ∈ N Ensure: B ∈ R(n+1)xc 1: for all classId in {0, . . . , c − 1} do 2: for all idx in {0, . . . , minc − 1} do 3: if dinc [idx ] = classId then 4: dclassId [idx, ] = +1 5: else 6: dclassId [idx, ] = −1 7: end if 8: end for dclassId 9: B[classId, ] = Einc 10: end for 11: return B
3.2
The Memoization Approach
The dclassId vectors, c in all, (in line 3 of Algorithm 1) are likely to be unbalanced, that is, have many more −1 values than +1 values. However, if there are more than two classes present in the increment di , the vectors will at least share one index position where the value is −1. With several classes present in the increment di , the matrix-vector products (E i ) dclassId actually perform duplicate calculations each time there exists two or more dclassId vectors that have −1 values in the same position. The basic idea for the memoization approach (Algorithm 3) is to only calculate the +1 positions for each vector dclassId by first creating a vector F = −[E.i j ]
(a vector with the negated sum of E’s columns, equivalent to multiplying E i with a vector filled with −1) and then to calculate the dclassId vectors using F and only switching the −1 to a +1 by adding the row vector of E twice if the row in dclassId is equal to +1. In order to do this efficiently, an index of di for each class ID has to be created (Algorithm 2). Algorithm 2 buildClassMap(dinc ) Require: dinc ∈ {0, . . . , c − 1}minc and minc ∈ N 1: classMap = array of length c containing empty lists 2: for all idx = 0 to minc − 1 do 3: append idx to classMap[dinc [idx, ]] 4: end for 5: return classMap
Theorem 2. The running time complexity of Algorithm 2 is Θ(minc ).
390
Amund Tveit and Magnus Lie Hetland
Proof. Appending idx to a the tail of a linked list takes Θ(1) time, lookup of classMap[dinc [idx, ]] in the directly addressable arrays classMap and dinc also takes Θ(1) time, giving a total for line 3 of Θ(1) time per iteration of idx. idx is iterated minc times, giving a total of Θ(minc ) time.
Algorithm 3 calcB Memo(Einc , dinc , Einc Einc ) Require: Einc ∈ Rminc ×(n+1) , dinc ∈ {0, . . . , c − 1}minc and n, minc ∈ N Ensure: B ∈ R(n+1)xc , F ∈ R(n+1) 1: classMap = buildClassMap(dinc ) 2: for all classId in {0, . . . , c − 1} do 3: B[classId, ] = Einc Einc [n] 4: for all idx in classMap[classId, ] do 5: B[idx, classId, ] = B[idx, classId, ] + 2 · n j=0 Einc [idx , j] 6: end for 7: end for 8: return B
Theorem 3. The running time complexity of Algorithm 3 is Θ(n(c + minc ). Proof. Calculation of classMap (line 1) takes Θ(minc ) time (from Theorem 2). Line 3 takes Θ(n + 1) time per iteration of classId, giving a total of Θ(c(n + 1)). Because classMap provides a complete indexing (| c−1 u=0 classMap[u]| = minc ) of the class labels in d inc , and because there are no repeated occurrences of idx for c−1 different classId s ( u=0 classMap[u] = ∅), line 5 will run a total of minc times. This gives a total running time of Θ(minc + (n + 1)minc + c(n + 1) + minc ) = Θ(n(c + minc )) .
Corollary 1. Algorithms 1 and 3 calculate the same B if provided with the same input.
4
Empirical Results
In order to test and compare the computational performance of the incremental multicategory proximal SVMs with the naive and lazy algorithms, we have used three main types of data: 1. Forest cover type, 580012 training examples, 7 classes and 54 features (from UCI KDD Archive [7])
50
391
● ●
5
●
30
Seconds
●
Seconds
●
Naive Lazy 20
10
●
Naive Lazy
●
40
15
Multicategory Incremental Proximal Support Vector Classifiers
●
●
10
● ●
● ● ●
●●
0e+00
0
0
● ●
1e+05
2e+05
3e+05
4e+05
5e+05
● ●
● ●
200
Number of Examples
(a) Runtime vs examples (cover type)
●
400
●
●
●
●
600
800
1000
Number of Classes
(b) Runtime vs classes (synthetic)
Fig. 2. Computational performance: training time
2. Synthetic datasets with a large number of classes (up to 1000 classes) and 30 features 3. Synthetic dataset with a large number of examples (10 million), 10 features and 10 classes The results for the first two data sets are shown in Fig. 4; the average time from tenfold cross-validation is used. For the third data set, the average classifier training times were 18.62 s and 30.64 s with the lazy and naive algorithm, respectively (training time for 9 million examples, testing on 1 million). The incremental multicategory proximal SVM was been implemented in C++ using the CLapack and ATLAS libraries. The tests were run on an Athlon 1.53 GHz PC with 1 GB RAM running Red Hat Linux 2.4.18.
Acknowledgements We would like to thank Professor Mihhail Matskin and Professor Arne Halaas. This work is supported by the Norwegian Research Council in the framework of the Distributed Information Technology Systems (DITS) program, and the ElComAg project.
5
Conclusion and Future Work
We have introduced the multiclass incremental proximal SVM and shown a computational improvement for training the multiclass incremental proximal SVM, which works particularly well for classification problems with a large number of
392
Amund Tveit and Magnus Lie Hetland
classes. Another contribution is the implementation of the system (available on request). Future work includes applying the algorithm to demanding incremental classification problems, for example, web page prediction based on analysis of click streams or automatic text categorization. Algorithmic improvements that need to be done include (1) develop balancing mechanisms (in order to give hints for pivot elements to the applied linear system solver for reduction of numeric errors), (2) add support for decay coefficients for efficient decremental unlearning, (3) investigate the appropriateness of parallelized incremental proximal SVMs, (4) strengthen implementation with support for tuning set, kernels as well as one-against-one classifiers.
References [1] Burbidge, R., Buxton, B. F.: An introduction to support vector machines for data mining. In Sheppee, M., ed.: Keynote Papers, Young OR12, University of Nottingham, Operational Research Society, Operational Research Society (2001) 3–15 386 [2] Huang, J., Shao, X., Wechsler, H.: Face pose discrimination using support vector machines (svm). In: Proceedings of 14th Int’l Conf. on Pattern Recognition (ICPR’98), IEEE (1998) 154–156 386 [3] Muller, K. R., Smola, A. J., Ratsch, G., Scholkopf, B., Kohlmorgen, J., Vapnik, V.: Predicting time series with support vector machines. In: ICANN. (1997) 999–1004 386 [4] Fung, G., Mangasarian, O. L.: Incremental support vector machine classification. In Grossman, R., Mannila, H., Motwani, R., eds.: Proceedings of the Second SIAM International Conference on Data Mining, SIAM (2002) 247–260 386 [5] Fung, G., Mangasarian, O. L.: Multicategory Proximal Support Vector Classifiers. Submitted to Machine Learning Journal (2001) 387 [6] Schwefel, H. P., Wegener, I., Weinert, K., eds.: 8. Natural Computing. In: Advances in Computational Intelligence: Theory and Practice. Springer-Verlag (2002) 387 [7] Hettich, S., Bay, S. D.: The UCI KDD archive. http://kdd.ics.uci.edu (1999) 390
Arc Consistency for Dynamic CSPs Malek Mouhoub Department of Computer Science University of Regina, 3737 Waskana Parkway, Regina SK, Canada, S4S 0A2 [email protected]
Abstract. Constraint Satisfaction problems (CSPs) are a fundamental concept used in many real world applications such as interpreting a visual image, laying out a silicon chip, frequency assignment, scheduling, planning and molecular biology. A main challenge when designing a CSPbased system is the ability to deal with constraints in a dynamic and evolutive environment. We talk then about on line CSP-based systems capable of reacting, in an efficient way, to any new external information during the constraint resolution process. We propose in this paper a new algorithm capable of dealing with dynamic constraints at the arc consistency level of the resolution process. More precisely, we present a new dynamic arc consistency algorithm that has a better compromise between time and space than those algorithms proposed in the literature, in addition to the simplicity of its implementation. Experimental tests on randomly generated CSPs demonstrate the efficiency of our algorithm to deal with large size problems in a dynamic environment. Keywords: Constraint Satisfaction, Search, Dynamic Arc Consistency
1
Introduction
Constraint Satisfaction problems (CSPs) [1, 2] are a fundamental concept used in many real world applications such as interpreting a visual image, laying out a silicon chip, frequency assignment, scheduling, planning and molecular biology. This motivates the scientific community from artificial intelligence, operations research and discrete mathematics to develop different techniques to tackle problems of this kind. These techniques become more popular after they were incorporated into constraint programming languages [3]. A main challenge when designing a CSP-based system is the ability to deal with constraints in a dynamic and evolutive environment. We talk then about on line CSP-based systems capable of reacting, in an efficient way, to any new external information during the constraint resolution process. A CSP involves a list of variables defined on finite domains of values and a list of relations restricting the values that the variables can take. If the relations are binary we talk about binary CSP. Solving a CSP consists of finding an assignment of values to each variable such that all relations (or constraints) are satisfied. A CSP is known to be an NP-Hard problem. Indeed, looking for V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 393–400, 2003. c Springer-Verlag Berlin Heidelberg 2003
394
Malek Mouhoub
a possible solution to a CSP requires a backtrack search algorithm of exponential complexity in time1 . To overcome this difficulty in practice, local consistency techniques are used in a pre-processing phase to reduce the size of the search space before the backtrack search procedure. A k-consistency algorithm removes all inconsistencies involving all subsets of k variables belonging to N . The kconsistency problem is polynomial in time, O(N k ), where N is the number of variables. A k-consistency algorithm does not solve the constraint satisfaction problem, but simplifies it. Due to the incompleteness of constraint propagation, in the general case, search is necessary to solve a CSP problem, even to check if a single solution exists. When k = 2 we talk about arc consistency. An arc consistency algorithm transforms the network of constraints into an equivalent and simpler one by removing, from the domain of each variable, some values that cannot belong to any global solution. We propose in this paper a new technique capable of dealing with dynamic constraints at the arc consistency level. More precisely, we present a new dynamic arc consistency algorithm that has a better compromise between time and space than those algorithms proposed in the literature [4, 5, 6], in addition to the simplicity of its implementation. In order to evaluate the performance in time and memory space costs of the algorithm we propose, experimental tests on randomly generated CSPs have been performed. The results demonstrate the efficiency of our methods to deal with large size dynamic CSPs. The rest of the paper is organized as follows. In the next section we will present the dynamic arc consistency algorithms proposed in the literature. Our dynamic arc consistency algorithm is then presented in section 3. Theoretical comparison of our algorithm and those proposed in the literature is covered in this section. The experimental part of our work is presented in section 5. Finally, concluding remarks and possible perspectives are listed in section 6.
2
Dynamic Arc-Consistency Algorithms
2.1
Arc Consistency Algorithms
The key AC algorithm was developed by Mackworth[1] called AC-3 over twenty years ago and remains one of the easiest to implement and understand today. There have been many attempts to best its worst case time complexity of O(ed3 ) and though in theory these other algorithms (namely AC-4[7], AC-6[8] and AC7[9]) have better worst case time complexities, they are harder to implement. In fact, the AC-4 algorithm fares worse on average time complexity than the AC-3 algorithm[10]. It was not only until recently when Zhang and Yap[11]2 proposed 1
2
Note that some CSP problems can be solved in polynomial time. For example, if the constraint graph corresponding to the CSP has no loops, then the CSP can be solved in O(nd2 ) where n is the number of variables of the problem and d the domain size of the different variables. Another arc consistency algorithm (called AC-2001) based on the same idea as AC3.1 was proposed by Bessi`ere and R´egin [12]. We have chosen AC-3.1 for the simplicity of its implementation.
Arc Consistency for Dynamic CSPs
395
an improvement directly derived from the AC-3 algorithm into their algorithm AC-3.1. The worst case time complexity of AC-3 is bounded by O(ed3 ) [13] where e is the number of constraints and d is the domain size of the variables. In fact this complexity depends mainly on the way the arc consistency is enforced for each arc of the constraint graph. Indeed, if anytime a given arc (i, j) is revised, a support for each value from the domain of i is searched from scratch in the domain of j, then the worst case time complexity of AC-3 is O(ed3 ). Instead of a search from scratch, Zhang and Yap [11] proposed a new view that allows the search to resume from the point where it stopped in the previous revision of (i, j). By doing so the worst case time complexity of AC-3 is achieved in O(ed2 ). 2.2
Dynamic Arc Consistency Algorithms
The arc-consistency algorithms we have seen in the previous section can easily be adapted to update the variable domains incrementally when adding a new constraint. This simply consists of performing the arc consistency between the variables sharing the new constraint and propagate the change to the rest of the constraint network. However, the way the arc consistency algorithm has to proceed with constraint relaxation is more complex. Indeed, when a constraint is retracted the algorithm should be able to put back those values removed because of the relaxed constraint and propagate this change to the entire graph. Thus, traditional arc consistency algorithms have to be modified so that it will be able to find those values which need to be restored anytime a constraint is relaxed. Bessi`ere has proposed DnAC-4[4] which is an adaptation of AC-4[7] to deal with constraint relaxations. This algorithm stores a justification for each deleted value. These justifications are then used to determine the set of values that have been removed because of the relaxed constraint and so can process relaxations incrementally. DnAC-4 inherits the bad time and space complexity of AC-4. Indeed, comparing to AC-3 for example, AC-4 has a bad average time complexity[10]. The worst-case space complexity of DnAC-4 is O(ed2 + nd) (e, d and n are respectively the number of constraints, the domain size of the variables and the number of variables). To work out the drawback of AC-4 while keeping an optimal worst case complexity, Bessi`ere has proposed AC-6[8]. Debruyne has then proposed DnAC-6 adapting the idea of AC-6 for dynamic CSPs by using justifications similar to those of DnAC-4[5]. While keeping an optimal worst case time complexity (O(ed2 )), DnAC-6 has a lower space requirements (O(ed + nd)) than DnAC-4. To solve the problem of space complexity, Neveu and Berlandier proposed AC|DC[6]. AC|DC is based on AC-3 and does not require data structures for storing justifications. Thus, it has a very good space complexity (O(e + nd)) but is less efficient in time than DnAC-4. Indeed, with its O(ed3 ) worst case time complexity, it is not the algorithm of choice for large dynamic CSPs. Our goal here is to develop an algorithm that has a better compromise between running time and memory space than the above three algorithms. More precisely, our ambition is to have an algorithm with the O(ed2 ) worst case time
396
Malek Mouhoub
complexity of DnAC-6 but without the need of using complex and space expensive data structures to store the justifications. We have then decided to adapt the new algorithm proposed by Zhang and Yap[11] in order to deal with constraint relaxations. The details of the new dynamic arc consistency algorithm we propose that we call AC-3.1|DC, are presented in the next section. The basic idea that we took was to integrate the AC-3.1 into the AC|DC algorithm since that algorithm was based on AC-3. The problem with the AC|DC algorithm was that it relied solely on the AC-3 algorithm and did not keep support lists like DnAC4 or DnAC6 causing the restriction and relaxation of a constraint to be fairly time consuming. This is also the reason for its worst case time complexity of O(ed3 ). If AC-3.1 was integrated into the AC|DC algorithm, then by theory the worst case time complexity for a constraint restriction should be O(ed2 ). In addition to this, the worst case space complexity remains the same as the original AC|DC algorithm of O(e + nd). The more interesting question is whether this algorithm’s time complexity can remain the same during retractions. Following the same idea of AC|DC, the way our AC3.1|DC algorithm deals with relaxations is as follows (see pseudo-code of the algorithm in figure 3). For any retracted constraint (k, m) between the variables k and m, we perform the following three phases : 1. An estimation (over-estimation) of the set of values that have been removed because of the constraint (k, m) is first determined by looking for the values removed from the domains of k and m that have no support on (k, m). Indeed, those values already suppressed from the domain of k (resp m) and which do have a support on (k, m), do not need to be put back since they have been suppressed because of another constraint. This phase is handled by the procedure Propose. The over-estimated values are put in the array propagate list[k] (resp propagate list[m]). 2. The above set is then propagated to the other variables. In this phase, for each value (k, a) (resp (m, b)) added to the domain of k (resp m) we will look for those values removed from the domain of the variables adjacent to k (resp m) supported by (k, a) (resp (m, b)). These values will then be propagated to the adjacent variables. The array propagate list is used to contain the list of values to be propagated for each variable. After we propagate the values in propagate list[i] of a given variable i, these values are removed from the array propagate list and added to the array restore list in order to be added later to the domain of the variable i. This way we avoid propagating the values more than once. 3. Finally a filtering procedure (the function Filter) based on AC-3.1 is then performed to remove from the estimated set the values which are not arcconsistent with respect to the relaxed problem.
3
AC-3.1|DC
The worst case time complexity of the first phase is O(d2 ). AC-3.1 is applied in the third phase and thus the complexity is O(ed2 ). Since the values
Arc Consistency for Dynamic CSPs
397
in propagate list are propagated only once, then the complexity of the second phase is also O(ed2 ). Thus the overall complexity of the relaxation is O(ed2 ). In terms of space complexity, the arrays propagate list and restore list require O(nd). AC-3.1 requires an array storing the resume point for each variable value (in order to have O(ed2 ) time complexity). The space required by this array is O(nd) as well. If we add to this the O(e + nd) space requirement of the
Function Relax(k,m) 1. propagate list ← nil 2. Remove (k, m) from the set of constraints 3. Propose(k,m,propagate list) 4. Propose(m,k,propagate list) 5. restore list ← nil 6. Propagate(k,m,propagate list,restore list) 7. Filter(restore list) 8. for all i ∈ V do 9. domaini ← domaini ∪restore list[i] Function Propose(i,j,propagate list) 1. for all value a ∈ dom[i] − D[i] do 2. support ← false 3. for all b ∈ D[j] do 4. if ((i a),(j b)) is satisfied by (i,j) then 5. support ← true 6. exit 7. if support ← false then 8. propagate list[i] ← propagate list[i] ∪ {a} Function Propagate(k,m,propagate list,restore list) 1. L ← {k,m} 2. while L = nil do 3. i ← pop(L) 4. for all j such that (i,j) ∈ the set of constraints 5. S ← nil 6. for all b ∈ dom[j] − (d[j]∪restore list[j]∪propagate list[j]) do 7. for all a ∈ propagate list[i] do 8. if ((i a),(j b)) is satisfied by (i,j) then 9. S ← S ∪ {b} 10. exit 11. if S = nil do 12. L ← L ∪ {j} 13. propagate list[j] ← propagate list[j] ∪ S 14. restore list[i] ← restore list[i] ∪ propagate list[i] 15. propagate list[i] ← nil
Fig. 1. Pseudo code of the dynamic arc consistency algorithm
398
Malek Mouhoub
Table 1. Comparison in terms of time and memory costs of the four algorithms DnAC-4 DnAC-6 AC|DC AC-3.1|DC Space complexity O(ed2 + nd) O(ed + nd) O(e + nd) O(e + nd) Time complexity O(ed2 ) O(ed2 ) O(ed3 ) O(ed2 )
traditional AC-3 algorithm, the overall space requirement is O(e + nd) as well. Comparing to the three dynamic arc consistency algorithms we mentioned in the previous section, ours has a better compromise, in theory, between time and space costs as illustrated by table 1.
4
Experimentation
Theoretical comparison of the four dynamic arc consistency algorithms shows that AC3.1|DC has a better compromise between time and space costs than the other three algorithms. In order to see if the same conclusion can be said in practice we have performed comparative tests on dynamic CSPs randomly generated as we will show in the following subsection. The criteria used to compare the algorithms are the running time needed and the memory space required by each algorithm to achieve the arc consistency . The experiments are performed on a Sun Sparc 10 and all procedures are coded in C/C++. Given n the number of variables and d the domain size, each CSP instance is randomly obtained by genconstraints are then picked randomly erating n sets of d natural numbers. n(n−1) 2 from a set of arithmetic relations {=, =, <, ≤, >, ≥, . . .}. The generated CSPs are characterized by their tightness, which can be measured, as shown in [14], as the fraction of all possible pairs of values from the domain of two variables that are not allowed by the constraint. Figure 2 shows the performance in time performed by each arc consistency algorithm to achieve the arc consistency in a dynamic environment, as follows. Starting from a CSP having n = 100 variables, d = 50
7
6
DnAC-4
Time (sec)
5
4
3
2 AC|DC
AC3.1|DC 1 DnAC-6 0 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Tightness
Fig. 2. Comparative tests of the dynamic arc-consistency algorithms
Arc Consistency for Dynamic CSPs
399
Table 2. Comparative results in terms of memory cost n 500 500 500 500 500 500 500 500 500 500 500 d 100 90 80 70 60 50 40 30 20 10 5 DnAC6 343MB 336MB 278MB 243MB 203MB 160MB 130MB 98MB 67MB 37MB 23MB AC|DC3.1 55MB 51MB 45MB 42MB 36MB 32MB 26MB 16MB 16MB 12MB 10MB
and 0 constraints, restrictions are done by adding the relations from the random ) is obtained. AfterCSP until a complete graph (number of constraints= n(n−1) 2 wards, relaxations are performed until the graph is 50% constrained (number of ). These tests are performed on various degree of tightness constraints= n(n−1) 4 to determine if one type of problem, (over-constrained, middle-constrained or under-constrained) favored any of the algorithms. As we can easily see, the results provided by AC-3.1|DC fares better than that of AC|DC and DnAC-4 in all cases. Also AC-3.1|DC algorithm is comparable if not better than DnAC6 (that has the best running time of the three dynamic arc consistency algorithms) as can be seen in figure 2. Table 2 shows the comparative results of DnAC-6 and AC3.1|DC in terms of memory space. The tests are performed on randomly generated CSPs in the same way as for the previous ones. As we can easily see, AC3.1|DC requires much less memory space than DnAC-6 especially for large problems with large domain size.
5
Conclusion and Future Work
In this paper we have presented a new algorithm for maintaining the arc consistency of a CSP in a dynamic environment. Theoretical and experimental comparison of our algorithm with those proposed in the literature demonstrate that our algorithm has a better compromise between time and memory costs. In the near future we are looking to integrating our dynamic arc consistency algorithm during the backtrack search phase in order to handle the addition and relaxation of constraints. For instance, if a value from a variable domain is deleted during the backtrack search, would it be worthwhile to use a DAC algorithm to determine its effect or would it be more costly than just continuing on with the backtrack search.
References [1] A. K. Mackworth. Consistency in networks of relations. Artificial Intelligence, 8:99–118, 1977. 393, 394 [2] R. M. Haralick and G. L. Elliott. Increasing tree search efficiency for Constraint Satisfaction Problems. Artificial Intelligence, 14:263–313, 1980. 393 [3] P Van Hentenryck. Constraint Satisfaction in Logic Programming. The MIT Press, 1989. 393
400
Malek Mouhoub
[4] C. Bessi`ere. Arc-consistency in dynamic constraint satisfaction problems. In AAAI’91, pages 221–226, Anaheim, CA, 1991. 394, 395 [5] R. Debruyne. Les algorithmes d’arc-consistance dans les csp dynamiques. Revue d’Intelligence Artificielle, 9:239–267, 1995. 394, 395 [6] B. Neuveu and P. Berlandier. Maintaining arc consistency through constraint retraction. In ICTAI’94, pages 426–431, 1994. 394, 395 [7] R. Mohr and T. Henderson. Arc and path consistency revisited. Artificial Intelligence, 28:225–233, 1986. 394, 395 [8] C. Bessi`ere. Arc-consistency and arc-consistency again. Artificial Intelligence, 65:179–190, 1994. 394, 395 [9] C. Bessi`ere, E. Freuder, and J. C. Regin. Using inference to reduce arc consistency computation. In IJCAI’95, pages 592–598, Montr´eal, Canada, 1995. 394 [10] R. J. Wallace. Why AC-3 is almost always better than AC-4 for establishing arc consistency in CSPs. In IJCAI’93, pages 239–245, Chambery, France, 1993. 394, 395 [11] Yuanlin Zhang and Roland H. C. Yap. Making ac-3 an optimal algorithm. In Seventeenth International Joint Conference on Artificial Intelligence (IJCAI’01), pages 316–321, Seattle, WA, 2001. 394, 395, 396 [12] C. Bessi`ere and J. C. R´egin. Refining the basic constraint propagation algorithm. In Seventeenth International Joint Conference on Artificial Intelligence (IJCAI’01), pages 309–315, Seattle, WA, 2001. 394 [13] A. K. Mackworth and E. Freuder. The complexity of some polynomial networkconsistency algorithms for constraint satisfaction problems. Artificial Intelligence, 25:65–74, 1985. 395 [14] D. Sabin and E. C. Freuder. Contradicting conventional wisdom in constraint satisfaction. In Proc. 11th ECAI, pages 125–129, Amsterdam, Holland, 1994. 398
Determination of Decision Boundaries for Online Signature Verification Masahiro Tanaka1 , Yumi Ishino2 , Hironori Shimada2 , Takashi Inoue2 , and Andrzej Bargiela3 1
Department of Information Science and Systems Engineering Konan University, Kobe 658-8501, Japan m [email protected] http://cj.is.konan-u.ac.jp/~tanaka-e/ 2 R & D Center Glory Ltd., Himeji 670-8567, Japan {y ishino,shimada,inoue}@dev.glory.co.jp 3 Department of Computing and Mathematics The Nottingham Trent University, Burton St, Nottingham NG1 4BU, UK [email protected]
Abstract. We are developing methods for online (or dynamical) signature verification. Our method is first to move the test signature to the sample signature so that the DP matching can be done well, and then compare the pen information along the matched points of the signatures. However, it is not easy to determine how to use the several elements of discrepancy between them. In this paper, we propose a verification method based on the discrimination by RBF network.
1
Introduction
We are developing methods for online signature verification. The data is multidimensional time series obtained by the tablet and an electronic pen ([1, 4, 3]). The information available in our system includes x and y coordinates, the pressure of the pen, the azimuth and the altitude of the pen. Hence the data can be seen as a sequence of vectors. Our method is first to modify the test signature to the template signature so that the DP matching can be done well, and then compare the pen information along the matched points of the signatures. One of the most intuitive methods for verification or recognition of handwritten characters or signatures is to extract the corresponding parts of two drawings and compare them next. For this objective, we use only the coordinates of the drawings and neglect other elements. We developed this method as a preprocessing tool before the main procedure of verification method [5, 6, 7]. However, it is not easy to determine how to use the several elements of differences between them. In this paper, we propose a verification method based on the discrimination by neural network. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 401–407, 2003. c Springer-Verlag Berlin Heidelberg 2003
402
2
Masahiro Tanaka et al.
Data Categories
Suppose q-dimensional time-series vector z ∈ Rq is available from the tablet with a fixed interval, whose elements include x and y coordinates, pressure, azimuth and the altitude of the pen. We will use k as the discrete time index. For a signature verification problem, the following three kinds of data category should be prepared. – Template. Template signature is the one registered by the true person. We could use more than two template signatures, but we use one template here for the simplicity of the treatment. Let S ∈ Rq×ns represent it. – Training signatures. We need both true and forgeries. It is possible to make a model with only genuine signatures and establish probability density function model, but the ability of discrimination is low. It is better to use both positive and negative cases. The more training signatures, the higher the accuracy. Let T1 , T2 , · · ·, Tm be the genuine signatures. Let Fij , j = 0, · · · , m(i); i = 1, · · · , n be the forgeries, where i is the ID of the person and j is the file number. – Test signatures. We also need both genuine and forgeries for checking the model of verification. If we need to authenticate n signers, we need to prepare n-sets of the above.
3
Matching Method
By considering the diversity of genuine signatures as the partial drift of the speed during signing procedure, it is appropriate to apply DP matching. Here we will use only x and y which are elements of z for matching of two vectors. One ¯ = [p(1), · · · , p(ns )] ∈ sequence is the x, y elements of S. We will denote it as S 2×ns . Another sequence to compare with it is a 2-dimensional sequence which R is denoted here as Q = [q(1), · · · , q(nq )]. For this problem, DP matching did not work sufficiently well. However, we considered this to be particularly important for verification. So, we proposed the matching method in [5], which is to apply DP matching and the coordinate modification iteratively. Next we will explain the summary of DP matching and coordinate modification procedures. 3.1
DP Matching
DP matching used here is the method to derive the corresponding points that give the minimal cost for two vector sequences, and the following recursive form is well known: D(i − 1, j) + d(i, j) D(i, j) = min D(i − 1, j − 1) + 2d(i, j) (1) D(i, j − 1) + d(i, j)
Determination of Decision Boundaries for Online Signature Verification
403
where d(i, j) is the distance between p(i) and q(j), which is usually Euclidean distance. 3.2
Coordinate Modification
This is a procedure for moving points of the test signature to the sample signature. The following is the model: x (k) = a(k)x(k) + b(k)
y (k) = a(k)y(k) + c(k)
(2) (3)
where k denotes the time index, x(k) and y(k) are the elements of p(k). x (k) and y (k) are the transformed values which are taken as the elements of q(k), and a(k), b(k), c(k) are time-variant transformation parameters. Now we define the parameter vector θ as θ(k) = [a(k) b(k) c(k)] If they are allowed to change slowly and independently, the model can be written as the following: θ(k + 1) = θ(k) + w(k) (4) where w(k) is a zero-mean random vector with covariance matrix Q = diag(q1 , q2 , q3 ). If the diagonal elements q1 , q2 , q3 are small, it means the elements of θ(k) are allowed to change slowly. Now we fix the template data p(k). Then it is expected that the test data transformed by (2)-(3) is close to the template data p(k). We can write this assertion by the following equation obtained from equations (2), (3) z(k) = H(k)θ(k) + v(k) where
x (k) z(k) = y (k)
is the test data, and
H(k) =
(5)
x(k) 1 0 y(k) 0 1
(6)
includes the elements of the model data where v(k) is a 2-dimensional random vector independent from w, its mean is zero and covariance R. This absorbs the unmodelled portion of the observation data. Based on equations (4) and (5), we can estimate the time-variant parameter θ(k) by using the linear estimation theory. If we must estimate them on-line, we would use Kalman filter or fixed-lag smoother. However besides this, the data must be matched by using DP matching, where we must use the data in an offline way. So, we must work in an off-line manner, and “fixed-interval smoother” yields the best result for off-line processing. Hence we will use fixed-interval smoother.
404
4
Masahiro Tanaka et al.
Verification Method
We have constructed the signature verification method based on the matching of signatures. Here we will describe the method in detail. 4.1
Normalisation of Training Data
We have experimentally experienced that data normalisation is very useful before matching. Let p be the original 2-D signal, and let R be the covariance matrix of p. By the orthogonalisation, we have RV = V Λ, where V is the orthonormal matrix and Λ is the diagonal matrix. Since V is orthonormal, we have V −1 = V and V RV = λ. ¯ be the mean of p. Also define ¯r = V p ¯ and r = V p − ¯r. Then we Now, let p have ¯ )(p − p ¯ ) ]V = Λ E[rr ] = V E[(p − p If we want to have the data whose standard deviation of the horizontal axis is 1000, we can normalise it by 1000 r = √ (V p − ¯r) λ1 thus we have E[rr ] =
4.2
106 0
λ2 λ1
0 × 106
Matching Data
Here we apply the iterative procedure of DP matching and coordinate modification of the data as described in section 3. 4.3
Extracting Feature Vectors
After the matching of two signatures s and ti , we can use the square of the distance between each elements as d(i) = [d1 (i) d2 (i) · · · d7 (i)], i = 1, · · · , m or m(j)
(7)
The above criteria are common to the genuine and forgeries. Each of the elements in (7) are the mean square difference between the template signature and the training signature. Each of the criteria have the following meaning. Suppose (i, j) is the index after matching for one of the vectors. Then we trace back the original position (i , j ) and calculate the velocity by vel = (x(i − 1) − x(i ))2 + (y(j − 1) − y(j ))2 (8) 2 2 + (x(i ) − x(i + 1)) + (y(j ) − y(j + 1))
Determination of Decision Boundaries for Online Signature Verification
405
Table 1. Elements of Criteria element number meaning 1 length of the data 2 modified coordinate x after matching 3 modified coordinate y after matching 4 pressure of matched points 5 angle of matched points 6 direction of matched points 7 velocity of matched points
This trace back is necessary because the matched index does not necessarily proceed smoothly, and will not show the actual velocity of the signature. The meaning of other elements will be straightforward, so we will omit the detail explanation. 4.4
Model Building by RBF Network
Thus we have the verification model. A training vector d(i) is 7-dimensional, and the output o(i) is 1 (for genuine) or 0 (for forgery). We have m cases for o(i) = 1 and m(1) + · · · + m(n) cases for o(i) = 0. However, the scale of the criteria vary a lot, thus we need further normalisation. Hence we normalised the criteria by 1. Find the maximum maxj among the j-th elements in the training vectors, and 2. Divide all the elements of j-th elements by maxj . This maxj is again used in the verification. Let us denote the training data by d(k), k = 1, · · · , N . By using all the samples as the kernel of the function, we have the model f (x) =
N k=1
x − dk 2 θk exp − σ2
(9)
for a general x ∈ R7 . This can be rewritten as Y = Mθ
(10)
where Y ∈ RN , M ∈ RN ×N , and θ ∈ RN . By using the training data dk , k = 1, · · · , N , we have the n × N matrix M whose rank is N . So, we have a unique parameter θ. It is clear that equation (9) produces the exact values for the training data dk , k = 1, · · · , N . Various kinds of neural networks can adapt to nonlinear complicated boundary problems. However, we applied RBF network for a special reason. It has
406
Masahiro Tanaka et al.
a universal approximation property like multi-layer perceptron. However, RBF network has another good property for pattern recognition like this. Due to the functional form, it intrinsically produces nearly zero if the input vector is not similar to any of the training data. Other recogniser like multi-layer perceptron does not have this property. 4.5
Verification
We can use the model for verifying whether the data is genuine or forgery based on the model. When we have the data, we have to process it as follows. 1. Normalise the original data 2. Normalize the feature vector 3. Input the feature vector to the recogniser.
5
Experimental Results
Fig. 1 shows the ranges of each criterion. The sub-figures are numbered in the order from the top-left, top-right, and to the second row. Sub-figure 1 shows the values of d1 (i), i = 1, · · · , 9. The same for the sub-figure 2 with values d2 (i), i = 1, · · · , 9, and so on. The first var corresponds to the genuine signature. By using each criterion alone, it is almost impossible to divide the space clearly. However, by using the RBF network, we have obtained a good separation result as shown in Fig. 2. The first 15 cases are the genuine signatures and all the others are forgeries. If we put the threshold at 0.4, the errors of the first kind is 3 out of 15 and the errors of the second kind is 3 out of 135. This is fairly a good result.
1500
6
1000
4
500 0 3
x 10
4
2
0 x 10
4
2
4
6
8
10
0 4
0 x 10
5
2
4
6
8
10
0 x 10
4
2
4
6
8
10
2
4
6
8
10
2 2 1 0 3
0 x 10
5
2
4
6
8
10
3
2
2
1 0 4
0
1
0 x 10
4
2
4
6
8
10
2
4
6
8
10
0
0
2
0
0
Fig. 1. Ranges of the criteria
Determination of Decision Boundaries for Online Signature Verification
407
1.2
1
0.8
output
0.6
0.4
0.2
0
-0.2
0
50
100
150
sample #
Fig. 2. Output of RBF network for test data
6
Conclusions
The performance of the algorithm will be upgraded if we apply the matching procedure more times. We will have to apply the algorithm for other persons’ data as the template. Further comparison will be our future work.
References [1] L. L. Lee, T. Berger and E. Aviczer, “Reliable On-line Human Signature Verification Systems,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 18, pp. 643-647, 1996. 401 [2] M. L. Minsky and S. A. Papert, Perceptrons – Expanded Edition, MIT, 1969. [3] V. S. Nalwa, “Automatic On-line Signature Verification,” Proceedings of the IEEE, Vol. 85, pp. 215-239, 1997. 401 [4] R. Plamondon and S. N. Srihari, “On-line and Off-line Handwriting Recognition: A Comprehensive Survey,” IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 22, pp. 63-84, 2000. 401 [5] M. Tanaka, Y. Ishino, H. Shimada and T. Inoue, “DP Matching Using Kalman Filter as Pre-processing in On-line Signature Verification,” 8th International Workshop on Frontiers in Handwriting Recognition,, pp. 502–507, Niagara-onthe-Lake, Ontario, Canada, August 6-8, 2002. 401, 402 [6] M. Tanaka, Y. Ishino, H. Shimada and T. Inoue, Dynamical Scaling in Online Hand-written Characters’ Matching, 9th International Conference on Neural Information Processing, Singapore, November 19-22, 5 pages (CD ROM), 2002. 401 [7] M. Tanaka, Y. Ishino, H. Shimada, T. Inoue and A. Bargiela, “Analysis of Iterative Procedure of Matching Two drawings by DP Matching and Estimation of Time-Variant Transformation Parameters,” The 34th International Symposium on Stochastic Systems Theory and Its Applications, accepted. 401
On the Accuracy of Rotation Invariant Wavelet-Based Moments Applied to Recognize Traditional Thai Musical Instruments Sittisak Rodtook and Stanislav Makhanov Information Technology Program, Sirindron International Institute of Technology, Thamasat University, Pathumthani 12121, Thailand {sittisak,makhanov}@siit.tu.ac.th
Abstract. Rotation invariant moments constitute an important technique applicable to a versatile number of applications associated with pattern recognition. However, although the moment descriptors are invariant with regard to spatial transformations, in practice the spatial transformation themselves, affect the invariance. This phenomenon jeopardizes the quality of pattern recognition. Therefore, this paper presents an experimental analysis of the accuracy and the efficiency of discrimination under the impact of the rotation. We evaluate experimentally the behavior of the noise induced by the rotation versus the most popular basis functions based on wavelets. As an example, we consider a particular but interesting case of the Thai traditional musical instruments. Finally, We present a semi heuristic pre computing technique to construct a set of descriptors suitable for discrimination under the impact of the spatial transformation.
1
Introduction
It has been very well documented that performance of pattern recognition may critically depend on whether the employed descriptors are invariant with respect to spatial transformations. A popular class of the invariant shape descriptors is based on the moment techniques, first, introduced by Hu[1] . However, a dramatic increase in complexity when increasing the order makes the Hu’s moments impractical. Shortly after the Hu’s paper a variety of invariant moments based techniques designed to recognize moving objects was proposed and analyzed [2][6]. The major developments of the rotational invariant moment-based methods are the orthogonal Zernike, orthogonal Fourier-Mellin and complex moments. Finally, Shen [3] introduced the wavelet moments and showed that multi-resolution accomplished by the wavelets made it possible to construct adaptable moment descriptors best suited for a particular set of objects. Sensitivity of the moment descriptors to the image noise has been repeatedly mentioned in the Literature. An interesting consequence of this is that the moment descriptors are invariant only when they are computed from the ideal analog images. Even in the absence of a noise induced by physical devices, there V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 408–414, 2003. c Springer-Verlag Berlin Heidelberg 2003
On the Accuracy of Rotation Invariant Wavelet-Based Moments
409
always exists a noise due to the spatial transformations. Therefore, although the moment descriptors are invariant with regard to the spatial transformations, in practice the spatial transformation themselves affect the invariance. This phenomenon could seriously jeopardize the quality of pattern recognition. Therefore, this paper considers accuracy and discriminative properties of the wavelet-based moment descriptors under the impact of rotation. We perform the experiments as applied to an interesting case of Thai musical instruments. The impact of the rotation transforms is profound due to the elongated shape of the instruments. We analyze the range of the errors and the accuracy. Finally, we present a combination of a variance-based procedure with a semiheuristic pre-computing technique to construct the appropriate wavelet-based descriptors.
2
Rotationally Invariant Wavelet-Based Moment
T ype A general rotation invariant moment MOrder with respect to moment function T ype Forder (r, θ) defined in the polar coordinates (with the origin at the centroid of the object) is given by
2π1 T ype MOrder
T ype f (r, θ)FOrder (r, θ)rdrdθ
= 0
0
T ype T ype T ype We assume that Forder (r, θ) = RΩ (r)Gα (θ) , where RΩ (r) denotes the radial moment function and Gα (θ) the angular moment function. Note that the angular function defined by Gα (θ) = ei α θ provides rotational since invariance T ype T ype it leads to the circular Fourier transform. Therefore, MOrder = M ROrder [3] T ype where M ROrder be a moment of the rotated object image and φ be the angle of rotation. In words, rotation of the object affects the phase but not the magnitude. In order to compute the rotation invariant moments of a given image, we first calculate the centroid of the object image. Next, we define the polar coordinates and map the object onto the parametric unit circle. Finally, we represent the integral by
2π1 T ype MOrder
= 0
where Sα (r) =
2π
1 T ype f (r, θ)FOrder (r, θ)rdrdθ
0
T ype RΩ (r)Sα (r)rdr.
= 0
f (r, θ)Gα (θ)dθ. The continuous image function is obtained by
0
the standard bilinear interpolation applied to the discrete image. 1 T ype T ype | is rotation invariant but RΩ (r) |Sα (r)| rdr and Finally, not only |MOrder 1 T ype RΩ (r)Sα (r)r dr are rotation invariant as well. 0
0
410
Sittisak Rodtook and Stanislav Makhanov
It is plain that from the viewpoint of functional analysis, each object is represented by an infinite and unique set of the rotational invariants. In other words, T ype if the set RΩ (r) constitutes a basis in L2 [0, 1] then Sα (r) can be represented with a prescribed accuracy. However, in practice, we always have a finite set of moment descriptors affected by noise. Wavelets being well localized in time and frequency are an efficient tool to construct appropriate moment descriptors. Consider a wavelet-based radius function given by Rm,n (r) = √1m ψ( r−n m ) where ψ(r) is the mother wavelet, m the dilation parameter (the scale index), n the shifting parameter. Unser et al show the usefulness of biorthogonal spline wavelets transform for texture analysis [7]. In this paper we apply the cubic B-spline to construct the appropriate moment descriptors. In this case of a B-spline the mother wavelet is given by (2r − 1)2 4 σw cos(2πf0 (2r − 1)) exp(− 2 ). ψ(r) = 2σw (n + 1) 2π(k + 1) 2 where k = 3, a = 0.7, f0 = 0.41 and σw = 0.56. The basis functions are given by m
ψm,n (r) = 2 2 ψ(2m r − 0.5n).
3
Errors of the Moment Descriptors Applied to Thai Musical Instruments
We analyze the accuracy of the wavelet-based moments as applied to the Thai musical instrument images in presence of the geometric errors induced by the rotational transforms and the subsequent binarization. We also consider an effect of the rotation. Photographs of the instruments are rotated by 360◦ with the increment 5◦ (see Fig. 1) by the Adobe Photoshop. In order to eliminate accumulation of errors due to multiple re-sampling, each rotation has been performed by rotating the original photograph corresponding to 0◦ . Fig. 2 shows a typical impact of rotation in the case of the spline wavelet moment. We evaluate the accuracy by measuring the standard deviation of the normalized spline moment descriptor. The error varies from 0.0 to 27.81 % with the maximum error produced by the ”SUENG” rotated by 50◦ for |M0,1,5 |. The wavelets make it possible to control not only the spatial position of the basis function but the frequency range as well. However, without an appropriate adaptation, the rotation may drastically affect a wavelet descriptor. We analyze the accuracy of spline wavelet descriptors |Mm,n,α | where α is the angular order. The comparison of the accuracy versus the position and the main frequency of wavelet basis function shows that the maximum error is only 7.21% for a 130◦ rotated ”SAW OU” with |M1,2,5 |. Note that the extrema of ψ1,2 (r) and extrema of |S5 (r)| almost coincide which results in a large numerical value of the moment. Furthermore, Fig. 3 indicates the best wavelet for particular frequencies. Fig. 4 shows why wavelets ψ1,4 (r),ψ2,1 (r) ,ψ2,4 (r) and ψ3,1 (r) produce poor results. The noise either has been substantially magnified at various positions
On the Accuracy of Rotation Invariant Wavelet-Based Moments
(a)
(b)
(d)
(e)
411
(c)
(f)
Fig. 1. Gray-scale photos and Silhouettes of the Thai musical instruments (a). ”SUENG”, Lute. (b). ”SAW SAM SAI”, Fiddle. (c). ”PEE CHAWA”, Oboe. (d). ”PEE NOKE”, Pipe. (e). ”SAW DUANG”, Fiddle. (f). ”SAW OU”, Fiddle
Fig. 2. Impact of rotation of |S5 (r)| by ψ2,1 (r), ψ2,4 or has been ”washed out” along with the peak itself by ψ1,4 (r)or ψ3,1 (r) . In other words, although ψ1,4 (r) and ψ3,1 (r) eliminate the noise they ”wash out” information about the object as well. Such functions are easily detected by the energy threshold (see for instance[9]). Unfortunately, the rotation noise in the frequency domain often appears at low frequencies which also characterize the signal. The rotated object produces a shifted of Sα (r), in other words, the noise often ”replicates” Sα (r). That is why it is difficult to construct a conventional filter in this case.
Table 1. The standard deviation calculated for the normalized spline wavelet moment magnitude corresponding to ”SAW OU” m 0 1 2 3
The standard deviation |Mm,n,5 | n=0 n=1 n=2 0.0800 0.0255 0.0221 0.0438 0.0213 0.0201 0.0450 0.0714 0.0270 0.1093 0.0526 0.1053
of normalized n=3 0.0231 0.0455 0.0230 0.1009
n=4 0.0225 0.0679 0.0681 0.1534
412
Sittisak Rodtook and Stanislav Makhanov
Fig. 3.
ψm,n (r) versus |S5 (r)|
Fig. 4. Rotation noise. (a). |S5 (r)| (Solid line: the original ”SAW OU”, Dasheddoted line: 335◦ rotated ”SAW OU”). (b)-(f). Wavelet versus |S5 (r)|. (g)-(k). Normalized |ψ(r)S5 (r)r|. (b),(g). ψ1,2 (r) (c),(h). ψ1,4 (r) (d),(i). ψ2,1 (r) (e),(j) ψ2,4 (r) (f),(k) ψ3,1 (r)
4
Pre Computing Techniques to Classify the Wavelet-Based Descriptors Applied to the Thai Musical Instruments
Our experiments reveal that it is difficult to find an appropriate moment descriptor which provides the both accurate and discriminative representation. The most accurate moment descriptors might be different for different instruments. It is not always possible to find one small set of wavelet basis functions suitable to represent all the angular moments. Note that in case of dissimilar, unsymmetrical objects, a moment descriptor suitable for discrimination can be derived with α = 0. However, if the objects have similar shapes, the zero angular order is not sufficient. Moreover, it is difficult to decide which angular order is the most representative since different angular orders magnify different frequencies of the noise. Meanwhile, we find the best discriminative moment descriptor for each good angular order and construct an appropriate vector of wavelet-based moment descriptors (such as (|Mm1 ,n1 ,q1 | , |Mm2 ,n2 ,q2 | , ... |Mmk ,nk ,qk |)). Furthermore, given the magnitudes of the wavelet-based moments we apply the variance-based classification technique (Otsu’s algorithm) introduced in [8] which invokes the least inner-class/ the largest inter-class variance ratio representing the discriminative measures of the descriptors.
On the Accuracy of Rotation Invariant Wavelet-Based Moments
413
Fig. 5. (a)-(b). |F T (S5 (r))| , (†: the original ”SAW OU”, ∞ : 335◦ rotated ”SAW OU”). (c). |F T (N )|, N : The spatial noise in frequency domain Next, we propose a concatenation of the above variance-based procedure with the pre computing techniques as follows: 1. Pre-compute an appropriate set of the angular order α = {q1 , q2 , . . . , qn } by considering the least of circular Fourier transform square error for each qi .
M N N ( dif fi,j (rk )2 ∆r)
1 , ∆r = , rk = k∆r, k = 0, 1, 2, ...N MN N where dif fi,j (rk ) = Sql (rk )i,Orginal − Sql (rk )i,j . N denotes the number of points employed for numerical integration, N the number of rotations and M denotes the number instruments. 2. Select a set of wavelet basis functions ψm,n (r). The function must belong to a basis in L2 [0, 1] 3. Check the following condition: – If |Sα (r)| is large at r than at least one of the basis functions ψ must be large at r. The condition could be replaced by {ψi > ε} = [0, 1] for εql =
i=1 j=1 k=1
i
4. 5. 6. 7.
some ε. For each angular order and for each musical instrument find the set of moment descriptors having the best normalized standard deviation. Threshold the wavelet-based moment descriptors by the energy [9]. Collect the best descriptors. Apply the variance-based techniques to the set of the descriptors for each angular order.
The techniques applied to case of the Thai musical instruments makes it possible to decrease the computational time by 10-15%. The discriminative measures in the Table. 2 demonstrate that the B-spline wavelet moment descriptors are well separated. The discriminative measure averages of the first three best discriminative descriptors |M1,2,5 |,|M1,0,5 | and |M1,3,5 | at α = 5 are 0.0726, 0.0915 and 0.1163 respectively. The energy threshold is 2.0.
5
Conclusion
Rotations produce a noise which could substantially affect the quality of descriptors and their discriminative properties. Wavelets constitute a suitable class
414
Sittisak Rodtook and Stanislav Makhanov
Table 2. The appropriate spline wavelet descriptors for discrimination and discriminative measures at α = 5 SUENG SAM SAI CHA WA NOKE DAUNG
SAMSAI |M1,0,5 | 0.0039
CHA WA |M1,0,5 | 0.0035 |M0,1,5 | 0.0149
NOKE |M1,2,5 | 0.0013 |M1,2,5 | 0.0039 |M1,2,5 | 0.0043
DUANG |M0,1,5 | 0.0031 |M1,3,5 | 0.0054 |M1,3,5 | 0.0039 |M1,0,5 | 0.0054
OU |M1,0,5 | 0.0043 |M1,3,5 | 0.0121 |M0,3,5 | 0.0064 |M1,2,5 | 0.0023 |M1,2,5 | 0.0030
of basis functions to perform recognition under the impact of rotations. The proposed algorithm based on standard deviation and energy combined with the variance-based procedure makes it possible to efficiently construct a set of descriptors suitable for discrimination under the impact of the rotation. The best discrimination properties for the set of the Thai musical instruments are displayed by {|M0,1,1 |, |M1,2,3 |, |M1,2,5 |, |M0,1,6 |, |M1,1,7 |}, appropriate α = {1, 3, 5, 6, 7}.
References [1] Hu, M. K.: Visual pattern recognition by moment invariants. IRE Trans. Inform. Theory, 8 (1962) 179–187 408 [2] Liao, S. X., Pawlak, M.: On the Accuracy of Zernike Moments for Image Analysis. IEEE Trans. On Pattern Analysis and Machine Intelligence, Vol. 20. 12 (1998) 1358–1364 408 [3] Shen, D., Ip, H. H.: Discriminative Wavelet Shape Descriptors for Recognition of 2-D patterns. Pattern Recognition, 32 (1999) 151–165 408, 409 [4] Kan, C., Srinath, M. D.: Orthogonal Fourier-Mellin moments with centroid bounding circle scaling for invariant character recognition. 1st IEEE conf. on Electro Information Technology, Chicago illinois (2000) [5] Kan, C., Srinath, M. D.: Invariant Character Recognition with Zernike and Orthogonal Fourier-Mellin Moments. Pattern Recognition, 35 (2002) 143–154 [6] Flusser, J.: On the inverse problem of rotation moment invariants. Pattern Recognition, 35 (2002) 3015–3017 408 [7] Unser, M. Aldroubi, A., Eden, M.: Family of polynomial spline wavelet trans1 form. Signal Process., 30 (1993) 141–162 410 [8] Otsu, N.: A Threshold Selection Method from Gray Level Histograms. IEEE Tran. On Systems Man. and Cybernetics, SMC-9 (1985) 377–393 412 [9] Thuillard, M.: Wavelets in Soft Computing, Series in Robotices and Intelligent Systems, 25 World Scientific London 84-85 411, 413
A Multi-agent System for Knowledge Management in Software Maintenance Aurora Vizcaino1, Jesús Favela 2, and Mario Piattini1 1
Escuela Superior de Informática, Universidad de Castilla-La Mancha {avizcain,mpiattini}@inf-cr.uclm.es 2 CICESE, Ensenada, México [email protected]
Abstract. Knowledge management has become an important topic as organisations wish to take advantage of the information that they produce and that can be brought to bear on present decisions. This work describes a system to manage the information and knowledge generated during the software maintenance process, which consumes a large part of the software lifecycle costs. The architecture of the system is formed from a set of agent communities, each community is in charge of managing a specific type of knowledge. The agents can learn from previous experience and share their knowledge with other agents, or communities. Keywords: Muti-agent systems, Knowledge Management, Software Maintenance
1
Introduction
Knowledge is a crucial resource for organizations, it allows organizations to fulfil their mission and become more competitive. For this reason, companies are currently researching techniques and methods to manage their knowledge systematically. In fact, nearly 80% of companies worldwide have some knowledge management efforts under way [5]. Organizations have different types of knowledge that are often related to each other and which must be managed in a consistent way. For instance, software engineering involves the integration of various knowledge sources that are constantly changing. The management of this knowledge and how it can be applied to software development and maintenance efforts has received little attention from the software engineering research community so far [3]. Tools and techniques are necessary to capture and process knowledge in order to facilitate subsequent development and maintenance efforts. This is particularly true for software maintenance, a knowledge intensive activity that depends on information generated during long periods of time and by large numbers of people, many of whom may no longer be in the organisation. This paper presents a multi-agent system (KM-MANTIS) in charge of managing V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 415-421, 2003. Springer-Verlag Berlin Heidelberg 2003
416
Aurora Vizcaino et al.
the knowledge that is produced during software maintenance. The contents of this paper are organized as follows: Section 2 presents the motivation for using a Knowledge Management (KM) system to support software maintenance. Section 3 describes the implementations of KM-MANTIS. Finally, conclusions are presented in Section 4.
2
Advantages of Using a KM System in Software Maintenance
Software maintenance consumes a large part of overall lifecycle costs [2, 8]. The incapacity to change software quickly and reliably causes organizations to lose business opportunities. Thus, in recent years we have seen an important increase in research directed towards addressing these issues. On the other hand, software maintenance is a knowledge intensive activity. This knowledge comes not only from the expertise of the professionals involved in the process, but it is also intrinsic to the product being maintained, and to the reasons that motivate maintenance (new requirements, user complaints, etc.) processes, methodologies and tools used in the organization. Moreover, the diverse types of knowledge are produced at different stages of the MP. Software maintenance activities are generally undertaken by groups of people. Each person has partial information that is necessary to other members of the group. If the knowledge only exists in the software engineers and there is no system in charge of transferring the tacit knowledge (contained in the employees) to explicit knowledge (stored on paper, files, etc) when an employee abandons the organization a significant part of the intellectual capital goes with him/her. Another well-known issue that complicates the MP is the scarce documentation that exists related to a specific software system, or even if detailed documentation was produced when the original system was developed, it is seldom updated as the system evolves. For example, legacy software written by other units often has little or no documentation describing the features of the software. By using a KM system the diverse kinds of knowledge generated may be stored and shared. Moreover, new knowledge can be produced, obtaining maximum benefit from the current information. By reusing information and producing relevant knowledge the high costs associated with software maintenance could also be decreased [4]. Another advantage of KM systems is that they help employees build a shared vision, since the same codification is used and misunderstanding in staff communications may be avoided. Several studies have shown that a shared vision may hold together a loosely coupled system and promote the integration of an entire organisation [7].
3
A Multi-agent System to Manage Knowledge in Software Maintenance
The issues explained above motivated us to design a KM system to capture, manage, and disseminate knowledge in a software maintenance organisation, thus increasing
A Multi-agent System for Knowledge Management in Software Maintenance
417
the workers' expertise, the organisation's knowledge and its competitiveness while decreasing the costs associated with the software MP. KM-MANTIS is a multi-agent system where different types of agent manage the diverse types of information generated during SMP. Agents interchange data and take advantage of the information and experience acquired by other agents. In order to foster the interchange of information the system uses an open format to store data and metadata XMI (XML metadata interchange) [6]. This is an important advantage of this system, since data and metadata defined along with other tools that support XMI can also be managed by KM-MANTIS. And it also facilitates the interchange of information between agents since they all use the same information representation. 3.1
The KM-MANTIS Architecture
The system is formed of a set of agent communities which manage different types of knowledge. There are several reasons why agents are recommendable for managing knowledge. First of all, agents are proactive, this means they decide to act when they find it necessary to do so. Moreover, agents can manage both distributed and local information. This is an important feature since software maintenance information is generated by different sources and often from different places. One aspect related to the previous one is that agents may cooperate and interchange information. Thus, each agent can share its knowledge with others or ask them for advice, benefiting from the other agents' experience. Therefore, there is reuse and knowledge management in the architecture of the system itself. Another important issue is that agents can learn from their own experience. Consequently the system is expected to become more efficient with time since the agents have learnt from their previous mistakes and successes. On the other hand, each agent may utilize different reasoning techniques depending on the situation. For instance, an agent can use an ID3 algorithms to learn from previous experiences and use case-based reasoning to advise a client on how to solve a problem. The rationale for designing KM-MANTIS with several communities is that during the software MP different types of information are produced, each with its own specific features. The types of information identified were: information related to the products to be maintained; information related to the activities to be performed in order to maintain the products; and, peopleware involved during software maintenance [10]. Therefore, KM-MANTIS has three communities: a community termed the "products community", another called the “activities community”, and the last community denoted as the "peopleware community". In what follows we describe each community in more detail. Products Community: This community manages the information related to the products to be maintained. Since each product has its own features and follows a specific evolution this community has one agent per product. The agents have information about the initial requirements, changes made to the product, and about metrics that evaluate features related to the maintainability of the product, (this
418
Aurora Vizcaino et al.
information is obtained from different documents such as modification requests, see Figure 1, perfective, corrective or preventive actions performed or product measurements). Therefore, the agents monitor the product's evolution in order to have up to date information about it at each moment. Each time an agent detects that information about its product is being introduced or modified in KM-MANTIS (the agent detects this when the application identification number that it represents is introduced or displayed in the interface of KM-MANTIS) the agent analyses the new information, or comparing it to that previously held in order to detect inconsistencies, or checking the differences and storing the relevant information in order to have up-to-date information. Information relevant to each product (data) is stored in an XMI repository. The XMI repository also stores rules (knowledge) produced by the agents through induction and decision trees-based algorithms. The decision to use XMI documents based on the MOF (Meta Object Facility) standard makes it possible for agents to have access to the different levels of information and knowledge that they need to process and classify their information and the queries that they receive. Activities community: Each new change demanded implies performing one or more activities. This community, which has one agent per activity, is in charge of managing the knowledge related to the different activities including methods, techniques and resources used to perform an activity.
Fig. 1. KM-MANTIS Interface
A Multi-agent System for Knowledge Management in Software Maintenance
419
Activities agents can also obtain new knowledge from their experience or taught learning. For instance, an activity agent learns which person usually carries out a specific activity or what techniques are most often used to perform an activity. Furthermore, activities agents use case-based reasoning techniques in order to detect whether a similar change under analogous circumstances was previously requested. When this is the case the agent informs the users on how that problem was previously solved, taking advantage of the organization's experience. Peopleware Community: Three profiles of people can be clearly differentiated in MP [9]: the maintenance engineer, the customer and the user. The Peopleware community has three types of agent, one per profile detected. One agent is in charge of the information related to staff (maintainers). This is the staff agent. Another manages information related to the clients (customers) and is called the client agent. The last one is in charge of the users and is termed the user agent. The staff agent monitors the personal data of the employees, in which activities they have worked, and what product they have maintained. Of course, the agent also has current information about each member of the staff. Therefore it knows where each person is working at each moment. The agent can utilise the information that it has to generate new knowledge. For instance, the staff agent calculates statistics to estimate the performance of an employee. The client agent stores the information of each client, their requirements (including the initial requirements, if they are available) and requests. The client agent also tries to gather new knowledge. For instance, it tries to guess future requirements depending on previous requirements or it estimates the costs of changes that the client wants to make, warning him, for instance, of the high costs associated with a specific change request. The user agent is in charge of knowing the requirements of the users of each product, their background and also their complaints and comments about the products. New knowledge could also be generated from this information, for example by testing to what degree the users' characteristics influence the maintenance of the product. Each type of agent has a database containing the information that they need. In this case there is no community repository because there is no data common to the three types of agents. 3.2
Implementation Considerations
In order to manage the XML documents different middleware alternatives were studied, some being object-relational databases such as ORACLE 9, Microsoft SQL Server 2002 and Tamino which have been designed specifically for XML documents. Finally Tamino, a Software AG's product, was chosen, because KM-MANTIS needs to store a huge amount of XML documents and manage them efficiently. The fact that an object-relational database needs to translate XML to tables and vice versa considerably reduces its efficiency. On the other hand, the platform chosen for creating the multiagent system is JADE [1] which is FIPA compliant. The agent communication language is FIPA ACL. Agents interchange information in order to take advantage of the knowledge that
420
Aurora Vizcaino et al.
others have, thereby the architecture itself performs reuse of information and knowledge.
4
Conclusions
Software maintenance is one of the most important stages of the software life cycle. This process takes a lot of time, effort, and costs. It also generates a huge amount of different kinds of knowledge that must be suitably managed. This paper describes a multiagent system in charge of managing this knowledge in order to improve the MP. The system has different types of agents in order to deal with the different types of information produced during SMP. Agents generate new knowledge and take advantage of the organization's experience. In order to facilitate the management of data and metadata or knowledge a XMI repositories have been used. XMI uses the MOF standard which enables the description of information at different levels of abstraction.
Acknowledgements This work is partially supported by the TAMANSI project (grant number PBC-02001) financed by the Consejería de Ciencia y Tecnología of the Junta de Comunidades de Castilla-La Mancha.
References [1] [2] [3]
[4] [5]
[6]
Bellifemine, A., Poggi, G., and Rimassa, G. (2001). Developing multi agent systems with a FIPA-compliant agent framework. Software Practise & Experience, (2001) 31: 103-128. Bennet K.H., and Rajlich V.T.(2000). Software Maintenance and Evolution: a Roadmap, in Finkelstein, A. (Ed.), The Future of Software Engineering, ICSE 2000, June 4-11, Limerick, Ireland, pp 75-87. Henninger, S., and Schlabach, J. (2001). A Tool for Managing Software Development Knowledge, 3ª International Conf. on Product Focused Software Process Improvement. PROFES 2001, Lecture Notes in Computer Science, Kaiserslautern, Germany, pp 182-195. de Looff, L.A., Information Systems Outsourcing Decision Making: a Managerial Approach. Hershey, PA: Idea Group Publishing, 1990. Mesenzani, M., Schael, T., and Alblino, S. (2002). Multimedia platform to support knowledge proceses anywhere and anytimeIn Knowledge-Based Intelligent Information Engineering Systems & Allied Technologies (KES 2002). Damiani, E., Howlett, R.J., Jain L.C., Ichalkaranje, N. (Eds.), pg 14341435. OMG XML Metadata Interchange (XMI), v. 1.1, Nov-2000.
A Multi-agent System for Knowledge Management in Software Maintenance
[7] [8] [9] [10]
421
Orton, J.D., and Weick, K.E. (1990) Loosely coupled systems: A reconceputalization. Academy of Management Review, 15(2), pp 203-223. Pigoski, T.M. (1997). Practical Software Maintenance. Best Practices for Managing Your Investment. Ed. John Wiley & Sons, USA, 1997. Polo, M., Piattini, M., Ruiz, F. and Calero, C.: Roles in the Maintenance Process. Software Engineering Notes; vol 24, nº 4, 84-86. ACM., (1999). Vizcaíno, A., Ruíz, F., Favela, J., and Piattini, M. A Multi-Agent Architecture for Knowledge Management in Software Maintenance. In Proceedings of International Workshop on Practical Applications of Agents and Multiagent Systems (IWPAAMS'02), Salamanca, Spain 23-25 October, (2002) 39-52.
D²G²A: A Distributed Double Guided Genetic Algorithm for Max_CSPs Sadok Bouamama, Boutheina Jlifi, and Khaled Ghédira SOI²E (ex URIASIS) SOI²E/ISG/Université Tunis, B. 204, Département d'Informatique 41 rue de la liberté, 2000 cité Bouchoucha. Tunisie [email protected] [email protected] [email protected]
Abstract. Inspired by the distributed guided genetic algorithm (DGGA), D²G²A is a new multi-agent approach, which addresses Maximal Constraint Satisfaction Problems (Max-CSP). GA efficiency provides good solution quality for Max_CSPs in one hand and benefits from multi-agent principles reducing GA temporal complexity. In addition to that the approach will be enhanced by a new parameter called guidance operator. The latter allows not only diversification but also an escaping from local optima. D²G²A and DGGA are been applied to a number of randomly generated Max_CSPs. In order to show D²G²A advantages, experimental comparison is provided. As well, guidance operator is experimentally outlined in order to determine its best given value. Keywords: Maximal constraint satisfaction problems, multi-agent systems, genetic algorithms, Min-conflict-heuristic, guidance operator
1
Introduction
The CSP formalism consists of variables associated with domains and constraints involving subsets of these variables. A CSP solution is an instantiation of all variables with values from their respective domains. The instantiation must satisfy all constraints. A CSP solution, as defined above, is costly to get and does not necessarily exist within every problem. In such cases, one had better search an instantiation of all variables that satisfies the maximal number of constraints. Such problems called Maximal CSPs and referred to as Max_CSPs, make up the framework to this paper. Max_CSPs have been dealt with by complete or incomplete methods. The first ones are able to provide an optimal solution. Unfortunately, the combinatorial explosion thwarts this advantage. The second ones, such as Genetic Algorithms [4] have the same property to avoid the trap of local optima. They also sacrifice completeness for efficiency. There is another incomplete but distributed method V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 422-429, 2003. Springer-Verlag Berlin Heidelberg 2003
D²G²A: A Distributed Double Guided Genetic Algorithm for Max_CSPs
423
known as Distributed Simulated Annealing (DSA) that has been successfully applied to Max-CSP [2] As DSA outperforms the centralized Simulated Annealing in terms of optimality and quality, the same idea is adopted for Centralized Genetic Algorithms (CGAs), which are especially known to be expensive. The result was the distributed guided genetic algorithm for Max_CSPs [3]. This paper aims to enhance the DGGA in order to escape from local optima and for better search diversification. And it is organized as follows: The next subsection presents the Distributed Double Guided Genetic Algorithm: context and motivations, the basic concepts and global dynamic. The third section details both experimental design and results. Finally, concluding remarks and possible extensions to this work are proposed.
2
Distributed Double Guided Genetic Algorithm
2.1
Context and Motivations
The DGGA cannot be considered as a classic GA. In fact, in classic GAs the mutation aims to diversify a considered population and then to avoid the population degeneration [4]. In the DGGA, mutation operator is used improperly since it is considered as a betterment operator of the considered chromosome. However, if a gene value was inexistent in the population there is no mean to obtain it by cross-over process. Thus, it is sometimes necessary to have, a random mutation in order to generate the possibly missing gene values. The DGGA approach is a local search method. The first known improvement mechanism of local search is the diversification of the search process in order to escape from local optima [6]. No doubt, the simplest mechanism to diversify the search is to consider a noise part during the process. Otherwise the search process executes a random movement with probability p and follows the normal process with a probability 1-p [5]. In figure 1, an example of local optima attraction basin is introduced; in a maximization case, S2, which is a local maximum, is better than S1. The passing through S1 from S2, is considered as a solution destruction but give more chance to the search process to reach S3, the global optimum. For all these reasons, the new proposed approach is a DGGA enhanced by a random providing operator, which will be called guidance probability Pguid. Thus the D2G2A will possess (in addition to the cross-over and mutation operators, to the generation number and to the initialpopulation size) a guidance operator. This approach will conserve the same basic principles and the same agents structure used with in the DGGA approach [3].
Fig. 1. An example of attraction basin of local optima
424
1. 2. 3. 4. 5. 6. 7. 8. 9.
Sadok Bouamama et al.
m ← getMsg (mailBox) case m of optimization-process (sub-population) : apply-behavior (sub-population) take-into-account (chromosome) : population-pool ← population-pool ∪ {chromosome} inform-new-agent (Speciesnvc) : list-acquaintances ← list-acquaintances ∪ {Speciesnvc} stop-process : stop-behavior Fig. 2. Message processing relative to Speciesnvc
2.2
Global Dynamic
The Interface agent randomly generates the initial population and then partitions it into sub-populations accordingly to their specificities. After that the former creates Species agents to which it assigns the corresponding sub-populations. Then the Interface agent asks these Species agents to perform their optimization processes (figure 2 line 3). So, before starting its optimization process, i.e. its behavior (figure 3), each Specie agent, Speciesn, initializes all templates corresponding to its chromosomes (figure 3 line 3). After that it carries out its genetic process on its initial sub-population, i.e. the sub-population that the Interface agent has associated to it at the beginning. This process, which will be detailed in the following subsection, returns a sub-population “pop” ( figure 3 line 4) that has been submitted to the crossing and mutating steps only once, i.e. corresponding to one generation. For each chromosome of pop, Speciesn computes the number of violated constraints “nvc” (figure 3 line 6). Consequently, two cases may occur. The first one corresponds to a chromosome violating the same number of constraints of its parents. In this case, the chromosome replaces one of the latter randomly chosen (figure 3 line 8). In the second case is that this number (nvc) is different from (n), i.e, the specificity of the corresponding Speciesn. Then the chromosome is sent to another Speciesnvc (figure 3 line 10) if such an agent already exists, otherwise it is sent to the Interface agent (figure 4 line 11). The latter creates a new agent having nvc as specificity and transmits the quoted chromosome to it. Whenever a new Species agent is created, the Interface agent informs all the other agents about this creation (figure 2 line 7) and then asks the new Species to perform its optimization process (figure 2 line 3). Note that message processing is given a priority. So, whenever an agent receives a message, it stops its behavior, saves the context, updates its local knowledge, and restores the context before resuming its behavior. If all the Species agents did not meet a chromosome violating zero constraints at the end of their behavior, they successively transmit one of their randomly chosen chromosomes, linked to its specificity to the Interface agent. The latter determines and displays the best chromosome namely the one which violates the minimal number of constraints.Here we describe the syntax used in the figures: sendMsg (sender, receiver,‘message'): ‘message' is sent by “sender” to “receiver” getMsg (mailBox): retrieves the first message in mailBox.
D²G²A: A Distributed Double Guided Genetic Algorithm for Max_CSPs
Apply-behavior (initial-population) 1. init-local-knowledge 2. for i := 1 to number-of-generations do 3. template-updating (initial-population) 4. pop ← genetic-process (initial-population) 5. for each chromosome in pop do 6. nvc ← compute-violated-constraints (chromosome) 7. if (nvc = n) 8. then replace-by (chromosome) 9. else if exist-ag (Speciesnvc) 10. then sendMsg (Speciesn, Speciesnvc,'take-into-account ( chromosome)') 11. else sendMsg (Speciesn, Interface, ‘create-agent (chromosome)') 12. sendMsg (Speciesn, Interface, ‘result (one-chromosome, specificity)')
Fig. 3. Behavior relative to Speciesn
genetic process 1. mating-pool ← matching (population-pool) 2. template-updating (mating-pool) 3. offspring-pool-crossed ← crossing (mating-pool) 4. offspring-pool-mutated ← mutating (offspring-pool- crossed) 5. return offspring-pool-mutated
Fig. 4. Genetic process
Crossing (mating-pool) if (mating-pool size < 2) then return mating-pool for each pair in mating-pool do if (random [0,1] < Pcross) then offspring ← cross-over (first-pair, second-pair) nvc ← compute-violated-constraints (offspring) if (nvc = 0) then sendMsg (Speciesn, Interface, ‘Stop-process (offspring)') else offspring-pool ← offspring-pool ∪ {offspring} return offspring-pool
Fig. 5. Crossing process relative to Speciesn cross-over (chromosomei1, chromosomei2) 1. for j :=1 to size (chromosomei1) do 2. sum ← templatei1,j + templatei2,j 3. if (random-integer [0, sum – 1]< templatei1,j) 4. then genei3,j ← genei2,j 5. else genei3,j ← genei1,j 6. return chromosomei3
Fig. 6. Cross-over operator
425
426
Sadok Bouamama et al.
Guided_Mutation (chromosome i) 1. min-conflict-heuristic (chromosome i) 2. return chromosomei
Fig. 7. Guided Mutation relative to chromosomei Random_Mutation (chromosome i) 1. choose randomly a genei,j 2. choose randomly a value vi in domain of genei,j 3. value(genei,j) ← vi 4. return chromosomei
Fig. 8. Random Mutation relative to chromosomei mutating (offspring-pool) 1. for each chromosome in offspring-pool do 2. if (random [0,1]< Pmut) 3. if (random [0,1]< Pguid) 4. then Guided_Mutation (chromosome i) 5. else Random_Mutation(chromosome i ) 6. nvc* ← violated_constraints_number (chromosome i) 7. if (nvc* = 0) 8. then sendMsg (Speciesn, Interface, ‘stop-process (chromosomei)' ) 9. else offspring-pool-mutated offspring-pool-mutated ∪{chromosomei} 10. return offspring-pool-mutated
Fig. 9. New mutating sub-process relative to Speciesn min-conflict-heuristic (chromosomei) 1. δi,j ← max (templatei) /*δi,j is associated to genei,j which is in turn associated to the variable vj*/ 2. nvc* ← nc /* nc is the total number of constraints*/ 3. for each value in domain of vj do 4. nvc ← compute-violated-constraint (value) 5. if (nvc < nvc*) 6. then nvc* ← nvc 7. value* ← value 8. value (genei,j) ← value* 9. update (templatei) 10. return nvc*
Fig. 10. Min-conflict-heuristic relative to chromosomei
3
Experimentation
The goal of our experimentation is to compare two distributed implementation. The first referred to as Distributed Double Guided Genetic Algorithm (D²G²A). Whereas the second implementation is known as Distributed Guided Genetic Algorithm
D²G²A: A Distributed Double Guided Genetic Algorithm for Max_CSPs
427
(DGGA). The implementations have been done with ACTALK [1], a concurrent object language implemented above the Object Oriented language SMALLTALK-80. 3.1
Experimental Design
Our experiments are performed on binary CSP-samples randomly generated. The generation is guided by classical CSP parameters: number of variables (n), domain size (d), constraint density p (a number between 0 and 100% indicating the ratio between the number of the problem effective constraints to the number of all possible constraints, i.e., a complete constraint graph) and constraint tightness q (a number between 0 and 100% indicating the ratio between the number of forbidden pairs of values (not allowed) by the constraint to the size of the domain cross product). As numerical values, we use n = 20, d = 20. Having chosen the following values 0.1, 0.3, 0.5, 0.7, 0.9 for the parameters p and q, we obtain 25 density-tightness combinations. For each combination, we randomly generate 30 examples. Therefore, we have 750 examples. Moreover and considering the random aspect of genetic algorithms, we have performed 10 experimentations per example and taken the average without considering outliers. For each combination density-tightness, we also take the average of the 30 generated examples. Regarding GA parameters, all implementations use a number of generations (NG) equal to 10, an initial-population size equal to 1000, a cross-over probability equal to 0.5, a mutation probability equal to 0.2 and a random replacement. The performance is assessed by the two following measures: -
Run time: the CPU time requested for solving a problem instance, Satisfaction: the number of satisfied constraints.
The first one shows the complexity whereas the second receals the quality. In order to have a quick and clear comparison of the relative performance of the two approaches, we compute ratios of DGGA and D²G²A performance using the Run time and the satisfaction, as follows: -
CPU-ratio = DGGA-Run-time/D²G²A-Run-time Satisfaction-ratio = D²G²A Satisfaction/DGGA Satisfaction.
Thus, DGGA performance is the numerator when measuring the CPU time ratios, and the denominator when measuring satisfaction ratio. Then, any number greater than 1 indicates superior performance by D²G²A. 3.2
Experimental Results
Figure11 shows the performance ratios from which we draw out the following results: -
-
From the CPU time point of view, D²G²A requires lesser time for the overconstrained and most strongly tight set of examples. For the remaining set of examples the CPU-time ratio is always over 1. In average, this ratio is equal to 2.014. From the satisfaction point of view, the D²G²A always finds more or same satisfaction than DGGA. It finds about 1.23 times more for the most strongly
428
Sadok Bouamama et al.
constrained and most weakly tight set of problems. The satisfaction ratios average is about 1.05 Note that the experiment do not show a clear dependency between NG and the evolution of both CPU-time and satisfaction ratios [3]. 3.3
Guidance Operator Study
Attention is next focused on the guidance probability study in order to determine its best values. To accomplish this task, we have assembled the CPU-time averages and satisfaction averages for different values of Pguid. Both figure 12 shows that the best value of Pguid is about 0.5. This value can be explained by the fact that not only random mutations are important but also guided mutations. In fact the guided mutating sub-process allows guidance and so helps the algorithm to reach optima, and the random one helps it to escape from local optima attraction basin.
4
Conclusion and Perspective
We have developed a new approach called D²G²A which is a distributed guided genetic algorithm enhanced by a new parameter called guidance probability. compared to the DGGA, our approach has been experimentally shown to be always better in terms of satisfied constraints number and CPU time.
20
400
15
300
CPU time
satisfied constraint number
Fig. 11. CPU-time ratio and Satisfaction ratio
10 5 0
200 100 0
0
0,2
0,4
0,6
guidance probability
0,8
1
0
0,2
0,4
0,6
guidance probability
Fig. 12. CPU-time and Satisfaction relative to different values of Pguid
0,8
1
D²G²A: A Distributed Double Guided Genetic Algorithm for Max_CSPs
429
The improvement is due to diversification, which increases the algorithm convergence by escaping from local optima attraction basin. And to guidance which helps the algorithm to attain optima. So that our approach gives more chance to the optimization process to visit all the search space. We have come to this conclusion thanks to the proposed mutation sub process. The latter is sometimes random, aiming to diversify the search process, and sometimes guided in order to increase the number of satisfied constraints. The guidance operator has been tested, too. Its best given value has been about 0.5. No doubt further refinement of this approach would allow its performance to be improved. Further works could be focused on applying this approach to solve other hard problems like valued CSPs and CSOPs.
References [1]
[2]
[3] [4] [5] [6]
BRIOT J.P., “actalk: a testbed for classifying and designing Actor languages in the Smalltalk-80 Environment”, Proceedings of European Conference on Object-Oriented Programming (ECOOP'89), British Computer Society Workshop Series, July 1989, Cambridge University Press. GHÉDIRA K., “A Distributed approach to Partial Constraint Satisfaction Problem”, Lecture Notes in Artificial Intelligence, Distributed software Agents and Applications, number 1069, 1996, John W. Perram and J.P Muller Edts, Springer Verlag, Berlin Heidelberg. GHÉDIRA K & JLIFI B. “ A Distributed Guided Genetic Algorithm for Max_CSPs” journal of sciences and technologies of information (RSTI), journal of artificial intelligence series (RIA), volume 16 N°3/2002. GOLDBERG D.E, Genetic algorithms in search, Optimisation, and Machine Learning, Reading, Mass, Addison-Wesley, 1989. T. SCHIEX, H. FARGIER & G. VERFAILLIE, “Valued constrained satisfaction problems: hard and easy problems, proceeding of the 14th IJCAI, Montreal, Canada august 1995. TSANG E.P.K, WANG C.J., DAVENPORT A., VOUDOURIS C., LAU T.L, “A family of stochastic methods for Constraint Satisfaction and Optimization”, University of Essex, Colchester, UK, November 1999.
Modelling Intelligent Agents for Organisational Memories Alvaro E. Arenas and Gareth Barrera-Sanabria Laboratorio de C´ omputo Especializado Universidad Aut´ onoma de Bucaramanga Calle 48 No 39 - 234, Bucaramanga, Colombia {aearenas,gbarrera}@unab.edu.co
Abstract. In this paper we study the modelling of intelligent agents to manage organisational memories. We apply the MAS-CommonKADS methodology to the development of an agent-based knowledge management system applied to project administration in a research centre. Particular emphasis is made on the development of the expertise model, where knowledge is expressed as concepts of the CommonKADS Conceptual Model Language. These concepts are used to generate annotated XML-documents, facilitating their search and retrieval. Keywords: Agent-Oriented Software Engineering; Knowledge Management; Organisational Memory; MAS-CommonKADS
1
Introduction
An organisational memory is an explicit, disembodied and persistent representation of crucial knowledge and information in an organisation. This type of memory is used to facilitate access, sharing and reuse by members of the organisation for individual or collective tasks [3]. In this paper we describe the development of a multi-agent system for managing a document-based organisational memory distributed through the Web using the MAS-CommonKADS methodology [5]. MAS-CommonKADS is a general purpose multi-agent analysis and design methodology that extends the CommonKADS [10] design method by combining techniques from object-oriented methodologies and protocol engineering. It has been successfully applied to the optimisation of industrial systems [6], the automation of travel assistants [1] and the development of e-commerce applications [2], among others. In this paper particular emphasis is made on the development of the expertise model, where knowledge is expressed as concepts of the CommonKADS Conceptual Model Language. These concepts are used to generate annotated XML documents, which are central elements of the organisational memory, thus facilitating their search and retrieval.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 430–437, 2003. c Springer-Verlag Berlin Heidelberg 2003
Modelling Intelligent Agents for Organisational Memories
2
431
Applying MAS-CommonKADS to the Development of Organisational Memories
The concepts we develop here are applicable to any organisation made of distributed teams of people who work together online. In particular, we have selected a research centre made of teams working on particular areas as our case study. A team consists of researchers who belong to particular research projects. The organisational memory is made of electronic documents distributed throughout the Intranet of that centre. The MAS-CommonKADS methodology starts with a conceptualisation phase where an elicitation task is carried out, aiming to obtain a general description of the problem by following a user-centred approach based on use cases [9]. We have identified three human actors for our system: Document-Author, End-User and Administrator. A Document-Author is a person working in a research project and who is designated for updating information about such a project. An EndUser is any person who makes requests to the system; requests could be related to a team, a project or a particular researcher. For instance, a typical request is ‘Find the deliverables of the project named ISOA’. Finally, an Administrator is the person in charge of maintaining the organisational memory. The outcome of this phase is a description of the different actors and their use cases. 2.1
The Agent Model
The agent model specifies the characteristics of an agent, and plays the role of a reference point for the other models. An agent is defined as any entity —human or software— capable of carrying out an activity. The identification of agents is based on the actors and use cases diagrams that are generated in the conceptualisation. We have identified four classes of agents: User Agent, software agent corresponding to an interface agent that helps to carry out the actions associated with human actors; it has three subclasses, one for each actor: Author Agent, End-User Agent and Administrator Agent. Search Agent, software agent responsible for searching information in the organisational memory; it has three subclasses, one for each kind of search: Search Team Agent, Search Project Agent and Search Researcher Agent. Update Agent, software agent responsible for updating information in the organisational memory; it has three subclasses, one for kind of update: Update Team Agent, Update Project Agent, Update Researcher Agent. Create Agent, software agent responsible for creating new information in the organisational memory; it has three subclasses, one for each possible creation: Create Team Agent, Create Project Agent, Create Researcher Agent. 2.2
The Task Model
The task model describes the tasks that an agent can carry out. Tasks are decomposed according to a top-down approach. We use UML activity diagrams
432
Alvaro E. Arenas and Gareth Barrera-Sanabria
Table 1. Textual Template for Organise Search Criteria Task Task Organise Search Criteria Purpose: Classify the search criteria Description: Once the criteria are received, they are classified and the information to which they will be applied is identified Input: General criteria Output: Criteria classified Precondition: Existence of criteria Ingredient: Pattern of criteria classification
to represent the activity flow of the tasks and textual templates to describe each activity. For instance, in the case of Search Agent, it carries out tasks such as request information from other agents, organise search criteria, select information and notify its status to other agents. Table 1 shows the textual template of task Organise Search Criteria. 2.3
The Organisation Model
This model aims to specify the structural relationships between human and/or software agents, and the relationships with the environment. It includes two submodels: the human organisation model and the multi-agent organisation model. The human organisation model describes the current situation of the organisation. It includes a description of different constituents of the organisation such as structure, functions, context and processes. As a way of illustration, the process constituent for our case study includes the reception of information about new teams, projects or researchers; once the information is validated, it is registered in data bases and new pages are generated in the organisation’s Intranet; such information can be updated by modifying the corresponding records; finally, the information is offered to the different users. The multi-agent organisation model describes the structural organisation of the multi-agent system, the inheritance relationship between agents and their structural constituent. The structural organisation and the inheritance relationships are derived directly from the agent model. The structural constituent of the multi-agent system specifies the geographic distribution of the agents and the organisational memory. Figure 1 presents an abstract diagram of the structural constituent of our case study. 2.4
The Coordination and Communication Models
The coordination model shows the dynamic relationships between the agents. This model begins with the identification of the conversations between agents, where use cases play again an important role. At this level, every conversation consists of just one single interaction and the possible answer, which are
Modelling Intelligent Agents for Organisational Memories
433
Fig. 1. Abstract Diagram of the Structural Constituent of the MAS
described by means of textual templates. Next, the data exchanged in each interaction are modelled by specifying speech acts and synchronisation, and collect all this information in the form of sequence diagrams. Finally, interactions are analysed in order to determine their complexity. We do not emphasize this phase due to lack of space. The communication model includes interactions between human agents and other agents. We use templates similar to those of the coordination model, but taking into consideration human factors such as facilities for understanding recommendations given by the system. 2.5
The Expertise Model
The expertise model describes the reasoning capabilities of the agents needed to carry out specific tasks and achieve goals. It consists of the development of the domain knowledge, inference knowledge and task knowledge. The domain knowledge represents the declarative knowledge of the problem, modelled as concepts, properties, expressions and relationships using the Conceptual Modelling Language (CML) [10]. In our problem, we have identified concepts such as User, Project, Team, Profile, Product and properties such as the description of the products generated by a project, the description of the profile of an user. Table 2 presents a fragment of the concept definition of our system. Such a description constitutes the general ontology of the system, it is an essential component for generating the XML-annotated documents that make the organisational memory. The inference knowledge describes the inference steps performed for solving a task. To do so, we use inference diagrams [10]. An inference diagram consists of three elements: entities, representing external sources of information that are necessary to perform the inference, which are denoted by boxes; inferences, denoted by ovals and flow of information between entities and inferences, denoted
434
Alvaro E. Arenas and Gareth Barrera-Sanabria
Table 2. Fragment of the Concept Definition of the System Concept User Description: Metaconcept that represents the type of users in the system. It does not have specific properties Concept: External User Description: Subconcept of concept User. This concept represents the external users of the system, i.e. people form other organisations who want to consult our system Concept: Employee Description: Subconcept of concept User that details the general information of the employees of the organisation Properties: Identity: String ; Min = 8 ; Max = 15 Name : String; Max = 100 Login: String; Min = 6 ; Max = 8 Type: Integer ; Max = 2 ...
by arrows. Typical processes that require inference steps include validating the privileges of a user or the search of information in the organisational memory. Figure 2 shows the inference diagram for validating the privileges of a user. The task knowledge identifies the exact knowledge required by each task and determines its most suitable solving method. To do so, each task is decomposed in simpler subtasks, from which we extract relevant knowledge to structure the organisational memory. 2.6
The Design Model
This model aims to structure the components of the system. It consists of three submodels: the agent network, the agent design and the platform submodels. The agent network design model describes the network facilities (naming services, security, encryption), knowledge facilities (ontology servers, knowledge representation translators) and the coordination facilities (protocol servers, group management facilities, resource allocation) within the target system. In our system, we do not use any facilitator agent, so that this model is not defined. The agent design model determines the most suitable architecture for each agent. To do so, we use a template for each agent type that describes the functionalities of each subsystem of such an agent type. Finally, in the platform design model, the software and hardware platform for each agent are selected and described. We have selected Java as the programming language for the implementation of the agent subsystems. The documents that make the organisational memory are represented as XML documents. We use JDOM [4], an API for accessing, manipulating, and outputting XML data from
Modelling Intelligent Agents for Organisational Memories
435
Fig. 2. Inference Diagram for Validating the Privileges of a User
Java code, to program the search and retrieval of documents. Lastly, Java Server Pages (JSP) is used for implementing the system dynamic pages.
3
Implementation
The implementation of our system consisted of three phases: creation of a document based organisational memory, development of user interfaces and construction of the multi-agent system. The organisational memory consisted of XML-annotated documents that can be updated and consulted by the users according to their privileges. In the current implementation, such documents are generated straightforwardly from the concepts and relationships of the domain knowledge. The structure of the user interfaces is derived directly from the communication model and developed with JSP. In the current version of the system, each agent type is implemented as a Java class. We are implementing a new version with the aid of the AgentBuilder tool.
4
Conclusions and Future Work
The agent-based approach to system development offers a natural means of conceptualising, designing and building distributed systems. The successful practice of this approach requires robust methodologies for agent-oriented software
436
Alvaro E. Arenas and Gareth Barrera-Sanabria
engineering. This paper applies MAS-CommonKADS, a methodology for agentbased system development, to the development of a knowledge management system for administrating a document-based organisational memory distributed throughout the Web. We have found the application of MAS-CommonKADS useful in this type of applications. First, the organisational model was crucial in the specification of the relationship between system and society, which is a central aspect in this type of applications. In our application, it was useful for defining the structure of the information within the Intranet of the organisation. Second, the knowledge submodel of the expertise model was the basis for creating the organisational memory. From such concepts we derive XML-annotated documents that make the organisational memory. Several projects have studied ontologies for both knowledge management and Web search. The OSIRIX project also proposes the use of XML-annotated documents to build an organisational memory [8]. They develop a tool that allows users to translate a corporate ontology into a document type definition, provided that the ontology is represented in the CommonKADS Conceptual Modelling Language. Although we share similar objectives, our work was focused on the methodological aspects to develop this kind of systems. Klein et al study the relation between ontologies and XML for data exchange [7]. They have devised a procedure for translating an OIL ontology into a specific XML schema. Intended future work will focus on the automation of the generation of the XML-documents, and the inclusion of domain-specific ontologies in the system.
Acknowledgements The authors are grateful to Jos´e P´erez-Alc´azar and Juan Carlos Garc´ıa-Ojeda for valuable comments and suggestions. This work was partially funded by the Colombian Research Council (Colciencias-BID).
References [1] A. E. Arenas and G. Barrera-Sanabria. Applying the MAS-CommonKADS Methodology to the Flights Reservation Problem: Integrating Coordination and Expertise. Frontiers of Artificial Intelligence and Applications Series, 80:3–12, 2002. 430 [2] A. E. Arenas, N. Casas, and D. Quintanilla. Integrating a Consumer Buying Behaviour Model into the Development of Agent-Based E-Commerce Systems. In IIWAS 2002, The Fourth International Conference on Information Integration and Web-Based Applications and Services. SCS, European-Publishing House, 2002. 430 [3] R. Dieng. Knowledge Management and the Internet. IEEE Intelligent Systems, 15(4):14–17, 2000. 430 [4] E. R. Harold. Processing XML with Java. Addison-Wesley, 2002. 434 [5] C. A. Iglesias, M. Garijo, J. Centeno-Gonzalez, and J. R. Velasco. Analysis and Design of Multiagent Systems using MAS-CommonKADS. In Agent Theories, Architectures, and Languages, Lecture Notes in Artificial Intelligence, pages 313– 327, 1997. 430
Modelling Intelligent Agents for Organisational Memories
437
[6] C. A. Iglesias, M. Garijo, M. Gonz´ alez, and J. R. Velasco. A Fuzzy-Neural Multiagent System for Optimisation of a Roll-Mill Application. In IEA/AIE, Lecture Notes in Artificial Intelligence, pages 596–605, 1998. 430 [7] M. Klein, D. Fensel, F. van Harmelen, and I. Horrocks. The Relation between Ontologies and XML Schemas. Electronic Transactions on Artificial Intelligence, 6(4), 2001. 436 [8] A. Rabarijoana, R. Dieng, O. Corby, and R. Ouaddari. Building and Searching an XML-Based Corporate Memory. IEEE Intelligent Systems, 15(3):56–63, 2000. 436 [9] J. Rumbaugh, I. Jacobson, and G. Booch. The Unified Modelling Language Reference Manual. Addison Wesley, 1999. 431 [10] G. Schreiber, H. Akkermans, A. Anjewierden, R. de Hoog, N. Shadbolt, W. Vand de Velde, and B. Wielinga. Knowledge Engineering and Management: The CommonKADS Methodology. The MIT Press, 2000. 430, 433
A Study on the Multi-agent Approach to Large Complex Systems Huaglory Tianfield School of Computing and Mathematical Sciences Glasgow Caledonian University, 70 Cowcaddens Road, Glasgow G4 0BA, UK [email protected]
Abstract. This paper explores the multi-agent approach to large complex systems. Firstly, system complexity research is briefly reviewed. Then, self-organization of multi-agent systems is analyzed. Thirdly, complexity approximation via multi-agent systems is presented. Finally, distinctive features of multi-agent approach are summarized.
1
Introduction
Large complex systems (LCS) are found in abundance. Practically all disciplines ranging from natural sciences to engineering, from management/decision/behavior sciences to social sciences have encountered various LCS. Various natural and manmade instances of LCS are, e.g., biological body systems, biological society systems, environmental and weather systems, space/universe systems, machine systems, traffic/transport systems, human organizational/societal (e.g., politic, economic, military) systems, to just mention a few. Complexity research and its applications have received very broad attention from different disciplines. Particularly, the different models employed have largely determined the approaches to LCS, as depicted in Table 1. An LCS may have hundreds or even thousands of heterogeneous components, between which are complicated interactions ranging from primitive data exchanges to natural language communications. The information flows and patterns of dynamics in an LCS are intricate. Generally LCS have three intrinsic attributes, i.e. perceptiondecision link, multiple gradation, and nesting [1-3]. As emergent behaviors dominate the system, study of behaviors of LCS is very hard. Given that System Geometry ∆∇= components + interaction between components
(1)
where ∆∇= denotes “packaging” and “de-packaging”, system complexity is much more attributable to the interactions between components than to the components themselves. The architecture study is very hard. Apparently there is certainly a hierarchy in an LCS, but it is unclear how many levels there should be and how a level can be stratified. And integration of LCS is very hard. Fundamentally there is even no effective approach. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 438-444, 2003. Springer-Verlag Berlin Heidelberg 2003
A Study on the Multi-agent Approach to Large Complex Systems
439
Table 1. Models in complexity research Model Statistical models
Originating area Thermodynamics
Biology
Characteristics of complexity Entropy, complexity measure, dissipative structure, synergetics, selforganization Chaos, fractal geometry, bifurcation, catastrophe, hypercylce Evolution, learning, emergence
Deterministic models Biological species models Game theoretical models Artificial life, adaptive models
Non-linear science
Operations research
Game, dynamic game, co-ordination
Cybernetics, adaptation
Co-evolution, cellular automata, replicator, self-organized criticality, edge of chaos Time series dynamics, black-box quantitative dynamics, system controllability, systems observeability Hierarchy, self-organizing, adaptation, architecture of complexity, systems dynamics To be explored
Time-series dynamic models
System science and engineering
Macroscopic dynamic models
Economics and sociology, systems research
Multi-agent models
Distributed artificial intelligence
All of these make it rather difficult to investigate the basic problems of LCS, including complexity mechanism, modeling, architecture, self-organization, evolution, performance evaluation, system development, etc. Intelligent agents and multi-agents systems have been widely accepted as a type of effective coarse-granularity metaphors for perception, modeling and decision making, particularly in systems where humans are integrated [4, 5]. They will be very effective in dealing with the heterogeneous natures of the components of LCS as intelligent agents and their emergent interaction are ubiquitous modeling metaphors whether the real-world systems are hardware or software, machines or humans, low or high level.
2
Self-Organization of Multi-agent Systems
2.1
Geometry of Multi-agent Systems
Prominent advantage of agentization is that it enables a separation of social behavior (problem solving, decision making and reasoning) at social communication level from individual behavior (routine data and knowledge processing, and problem solving) at autonomy level. Due to such a separation of the two levels, every time a social communication is conducted, the corresponding agents have to be associated beforehand. Therefore, social communication level and autonomy level, along with agent-task association, predominate the geometry of a multi-agent system on the basis of its infrastructure, as depicted in Fig. 1. Infrastructure of a multi-agent system includes agents, agent platform(s), agent management, and agent communication language.
440
Huaglory Tianfield
System Geometry Infrastructure ACL Social Communication Level
Agent Platform
Agent-Task Association
Registration Autonomy level Agent
Agent
Agent
Fig. 1. Geometry of multi-agent system (ACL: agent communication language)
Agent-task association is dynamic. The dynamics can be as long as a complete round of problem solving, and can also be as short as a round of social communication, or even as a piece of social communication (exchange of one phrase). Social communication is emergent. It is purely a case-by-case process. It is nonpredictable beforehand and is only appearing in the situation. Moreover, social communication is at knowledge level. Agents themselves have sufficient abilities. The purpose of social communication is not solving problems, but initiating, activating, triggering, invoking, inducing, and prompting the abilities of individual agents. However, the essential decision making is relying upon the corresponding agents, not the social communication between agents. 2.2
Uncertainty and Self-Exploration of Multi-agent Problem Solving Process
Multi-agent problem solving process involves three aspects: •
Agent-task association: which agents are associated, which are not, and the alternation and mixture of association and non-association in problem solving; • Social communication: organizationally constrained and non-constrained social communication, and their alternation and mixture in problem solving; • Progression of problem solving: what tasks or intermediate solutions are achieved, how they are aggregated, what are the further tasks to achieve the complete solution. The autonomy of agents, the dynamic property of agent-task association, and the emergent and knowledge level properties of social communication between agents make that multi-agent problem solving process is uncertain and self-exploring. Multi-agent problem solving process is self-exploring. At the beginning, the solving process is unknown. With the progression of problem solving, the process is made clearer and clearer.
A Study on the Multi-agent Approach to Large Complex Systems
441
In redundant multi-agent systems, agents to be associated vary uncertainly from one round or even one piece of social communication to another. Overlapping between the abilities of individual agents in a society forms the basis of the uncertainty of multi-agent problem solving process. In non-redundant multi-agent systems, agents are functionally fixed under a given organizational structure, social communication is organizationally constrained, and there is no provision of competition or alternation between agents. The total problem solving is just a mechanical additive aggregation of the abilities of individual agents over an uncertain process. This appears as if the problem solving of individual agents forms a Markov chain. t
t
1 2 Total task = PS(Agent-1) + PS(Agent-2) + ··· +
tN
PS(Agent-N)
(2)
t
where PS is abbreviation of problem solving, and “ + ” denotes a mechanical aggregation of results from the problem solving of individual agents which takes place at asynchronous instants. However, the social communication in non-redundant multi-agent systems still varies in time, content and/or quality from one round or even one piece to another. So, even in this case the multi-agent problem solving process is uncertain. For instance, within an enterprise, even provided that there is an adequate organizational structure between functional departments, from one time to another, functional departments may have failures in delivery dates and qualities, depending on how cooperation and/or negotiations are made between functional departments. Therefore, in both redundant and non-redundant multi-agent systems, multi-agent problem solving process is uncertain.
3
Approximating Complexity through Flexible Links of Simplicities
3.1
Flexibility of Multi-agent Modeling of Real-World Systems
Consider ∆
A system = ((Agent - i)N)T ( ⇔ ijt )NxN (Agent - j)N)
(3)
where parentheses denote a vector or matrix, superscript T refers to transposition of a vector, ⇔ ijt denotes social communication from agent j to agent i at asynchronous instant t. It can refer to pattern, discourse, or phrase of social communication, or all of these. Due to the dynamic property of agent-task association, the number of agents to be associated, N, varies from one round, or even one piece of social communication to another. The basic idea of multi-agent based modeling is packaging various real-world LCS into multi-agent systems, as depicted in Fig. 2. Thus Equation (1) becomes Large Complex System Geometry ∆∇ = Agents + Interactions between agents
(4)
442
Huaglory Tianfield
Unified Systems Engineering Process: Architecture, Analysis, Design, Construction, Performance Evaluation, (Co-)Evolution, etc. Multi-agent Version of Systems Engineering Process
Multi-agent Systems
De-packaging
Packaging
Non-life Living Living Non-life
Real-world Version of Systems Engineering Process
Natural Various Real-World Large Complex Systems Man-made
Fig. 2. Multi-agent systems as metaphor of various real-world LCS
Then, multi-agent modeling involves specifications upon packaging, agents, agent communication language, and agent platform. This apparently requires a specification language. Multi-agent packaging has many prominences, e.g., • • •
Unified systems engineering process of LCS: architecture, analysis, design, construction, performance evaluation, evolution, etc.; Unified concepts, measure, theory, methods, etc. of LCS; Unified studies on different types of real-world LCS.
Given a fixed infrastructure, the self-organization of multi-agent systems makes it possible that a multi-agent system can adapt itself for a wide variety of real-world systems. 3.2
Multi-agent Based System Development
Principal impact of multi-agent paradigm upon system development is the transition from individual focus to interaction focus. High level social behavior may result from a group of low level individuals. Developing systems by multi-agent paradigm immediately turns out developing the corresponding multi-agent systems. To develop a multi-agent system, there are two parts to complete, i.e., first to develop the infrastructure of the multi-agent system, including agents, agent platform(s), agent management, and agent communication language, and then to develop the agent-task association and the social communication. For the former, the development of infrastructure of multi-agent system is just similar to that of any traditional systems. Traditional paradigms of system development, including analysis/design methods and life-cycle models are applicable.
A Study on the Multi-agent Approach to Large Complex Systems
443
Table 2. Multi-agent approach versus traditional complexity research Scope
Destination
Highlight
Conventional complexity research Non-linearity, chaos, selforganization, quantum mechanics, thermodynamics, etc. Basically, physical mechanism oriented, i.e., trying to discover the primary physics of non-linear and thermodynamic complex systems More on microscopic/quantum level, ignoring effectively incorporating humans / computers in the systems, and incapable of dealing with the genuine largeness and structural complexity of a system
Multi-agent approach Very natural to incorporate humans/computers as basic components of the systems, and able to deal with the largeness and architecture of systems Not physical mechanism oriented, but black-box systematic behavior oriented Macroscopic, meta-level, i.e., using coarse-granularity (pseudo-) computational entities (i.e., agents) as the primary elements of characteristics and modeling of LCS. Power is not on the primary elements themselves, but on the dynamic, uncertain, emergent, knowledge-level interactions between these primary elements
For the latter, exactly speaking, the agent-task association and the social communication are not something that can be developed, except influenced. In such a circumstance, traditional paradigms are generally inapplicable. Actually, there is no apparent life-cycle concept. Developers are interactively working with the multi-agent system. In order to exert influence, extra requirements are posed upon the development of infrastructure of multi-agent system. That is, when infrastructure of multi-agent system is developed, consideration should already be given about how they can be influenced later on. Essentially, the geometry of a multi-agent system should have been designed before the development of it infrastructure. By means of multi-agent paradigm, infrastructure of multi-agent systems can be very easily inherited from one system to another. This prominent provision greatly facilitates and alleviates reusability and evolvability of systems.
4
Summary
Advantages of self-organization of multi-agent systems include: • • •
System modeling becomes greatly alleviated. Given a fixed infrastructure, the multi-agent system can adapt itself for many varieties of real-world systems; Completeness and organic links between system perspectives are readily guaranteed; Reusability and evolvability of systems is thoroughly enhanced. Provided with the infrastructure of a multi-agent system, new systems are rapid to be available, considerably short time to market;
444
Huaglory Tianfield
•
Late user requirement can be readily resolved and adapted. Actually users are interactively working with the multi-agent system. Users can change (exactly, influence) the behavior of the multi-agent system. Disadvantages of multi-agent systems may be: • Multi-agent problem solving process is uncertain and self-exploring. These leave the problem solving process to be nontransparent and untraceable to users; • While efficiency of multi-agent problem solving can be influenced by incorporating heuristics into the multi-agent system, the speed of problem solving is uncontrollable.
The traditional models/methods for LCS are relatively dependent upon the originating area, and lack a unified research process for systems from different variety of domains. Multi-agent systems provide a unified approach to the analysis, design, construction, performance, evolution, etc. of LCS. A comparison is made as in Table 2.
References [1] [2] [3] [4] [5]
Tianfield, H.: Formalized analysis of structural characteristics of large complex systems. IEEE Trans. on Systems, Man, and Cybernetics, Part A, 31 (2001) 559-572 Tianfield, H.: Structuring of large-scale complex hybrid systems: From illustrative analysis toward modeling. J. of Intelligent and Robotic Systems: Theory and Applications, 30 (2001) 179-208 Tianfield, H.: An innovative tutorial on large complex systems. Artificial Intelligence Review, 17 (2002) 141-165 Tianfield, H.: Agentization and coordination. Int. J. of Software Engineering and Knowledge Engineering, 11 (2001) 1-5 Tianfield, H.: Towards advanced applications of agent technology. Int. J. of Knowledge-based Intelligent Engineering Systems, 5 (2001) 258-259
Multi-layered Distributed Agent Ontology for Soft Computing Systems Rajiv Khosla School of Business La Trobe University, Melbourne, Victoria – 3086, Australia [email protected]
Abstract. In this paper we outline a multi-layered distributed agent ontology for providing system modeling support to practitioners and problem solvers at four levels, namely, distributed, tool (technology), optimization and task level respectively. We describe the emergent characteristics of the architecture from a architectural viewpoint. We also outline how the ontology facilitates development of humancentered soft computing systems. The ontology has been applied in a range areas including alarm processing, web mining, image processing, sales recruitment and medical diagnosis. We also outline definition of agents in different layers of the ontology.
1
Introduction
Soft computing agents today are being applied in a range of areas including image processing, engineering, process control, data mining , internet and others. In the process of applying soft computing agents to complex real world problems three phenomena have emerged. Firstly, the application of soft computing agents in distributed environments has resulted in merger of techniques from soft computing area with those in distributed artificial intelligence. Secondly, in an effort to improve the quality of soft computing solution researchers have been combining and fusing technologies. This has resulted in hybrid configurations of soft computing technologies. Finally, from a practitioners perspective, as the soft computing technologies have moved out of laboratories into the real world the need for knowledge or task level practitioner-centered technology independent constructs has emerged. In this paper we describe four levels of intelligent soft computing agent design which correspond to the three phenomena described above. These four levels or layers are distributed, tool, optimization and problem solving or task layer. The distributed level support is provided among other aspects for fetching and depositing data across different machines in a distributed environment. Tool support is provided in terms of applying various soft and hard computing technologies like fuzzy logic, neural networks, genetic algorithms, and knowledge based systems. Optimization level support is provided in terms of hybrid configurations of soft computing technologies for designing and developing optimum models. The optimization level V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 445-452, 2003. Springer-Verlag Berlin Heidelberg 2003
446
Rajiv Khosla
assists in improving the quality of solution of the technologies in the tool layer and range of tasks covered by them. . Finally, task level support is provided in terms of modelling user's tasks and problem solving models in a technology independent manner. The layered architecture is also motivated by the human-centred approach and criteria outlined in the 1997 NSF workshop on human-centered systems and consistent problem solving structures/strategies employed by practitioners while designing solutions to complex problems or situations. The paper is organised as follows. Section 2 describes work done in the process of developing the ontology. Section 3 outlines the multi-layered agent ontology for soft computing systems and some aspects of the emergent behavior or characteristics of multi-layered ontology. Section 4 concludes the paper.
2
Background
In this section we firstly look at evolution of intelligent technologies along two dimensions, namely, quality of solution and range of tasks. Then we outline some of the existing work on problem solving ontologies. 2.1
Evolution of Intelligent Technologies
The four most commonly used intelligent technologies are symbolic knowledge based systems (e.g. expert systems artificial neural networks, fuzzy systems and genetic algorithms) [3, 5, 8].The computational and practical issues associated with intelligent technologies have led researchers to start hybridizing various technologies in order to overcome their limitations. However, the evolution of hybrid systems is not only an outcome of the practical problems encountered by these intelligent methodologies but is also an outcome of deliberative, fuzzy, reactive, self-organizing and evolutionary aspects of the human information processing system [2]. Hybrid systems can be grouped into three classes, namely, fusion systems, transformation systems, combination systems [3, 6, 7]. These classes along with individual technologies are shown in Fig. 1 along two dimensions, namely, quality of solution and range of tasks. In fusion systems, the representation and/or information processing features of in technology A are fused into the representation structure of another technology B. From a practical viewpoint, this augmentation can be seen as a way by which a technology addresses its weaknesses and exploits its existing strengths to solve a particular real-world problem. Transformation systems are used to transform one form of representation into another. They are used to alleviate the knowledge acquisition problem. For example, neural nets are used for transforming numerical/continuous data into symbolic rules which can then be used by a symbolic knowledge based system for further processing. Combination system involve explicit hybridization. Instead of fusion, they model the different levels of information processing and intelligence by using technologies that best model a particular level. These systems involve a modular arrangement of two or more technologies to solve real-world problems.
Multi-layered Distributed Agent Ontology for Soft Computing Systems
447
Associate Systems
Fusion Systems Transformation Systems
Quality of Solution
Symbolic AI
Fuzzy System
Genetic Algorithm
Neural Networks
Combination Systems
Range of Tasks
Fig. 1. Technologies, Hybrid Configurations, Quality of Solution and Range of Tasks
However, these hybrid architectures also suffer from some drawbacks. These drawbacks can be explained in terms of the quality of solution and range of tasks covered as shown in Fig. 1. Fusion and transformation architectures on their own do not capture all aspects of human cognition related to problem solving. For example, fusion architectures result in conversion of explicit knowledge into implicit knowledge, and as a result lose on the declarative aspects of problem solving Thus, they are restricted in terms of the range of tasks covered by them. The transformation architectures with bottom-up strategy get into problems with increasing task complexity. Therefore the quality of solution suffers when there is heavy overlap between variables, where the rules are very complicated, the quality of data is poor, or data is noisy. The combination architectures cover a range of tasks because of their inherent flexibility in terms of selection of two or more technologies. However, because of lack of (or minimal) knowledge transfer among different modules the quality of solution suffers for the very reasons the fusion and transformation architectures are used. It is useful to associate these architectures in a manner so as to maximize the quality as well as range of tasks that can be covered. These class of systems are called associative systems as shown in Fig. 1. As may be apparent from Fig. 1, associative systems consider various technologies and their hybrid configurations as technological primitives that are used to accomplish tasks. The selection of these technological primitives is contingent upon satisfaction of task constraints. In summary, it can seen from the discussion in this section that associative systems represent evolution from a technology-centered approach to a task-centered approach 2.2
Strengths and Weaknesses of Existing Problem Solving Ontologies
In order to pursue the task-centered approach one tends to look into work done in the area of problem solving ontologies. The research on problem solving ontologies or knowledge-use level architectures has largely been done in artificial intelligence. The research at the other end of the spectrum (e.g., radical connectionism, soft computing)
448
Rajiv Khosla
is based more on understanding the nature of human or animal behavior rather than developing ontologies for dealing with complex real world problems on the web as well as in conventional applications. A discussion on strengths and some limitations of existing problem solving ontologies can be found in [7, 8]. Some of limitations include lack of modelling constructs for soft computing applications, humancenteredness, context, response time, external or perceptual representations and some others. Firstly, especially with the advent of the internet and the web human (or user)centeredness has become an important issue (NSF workshop on human-centered systems 1997) and Takagi [10]. Secondly, from a cognitive science viewpoint, distributed cognition (which among aspects involves consideration of external and internal representations for task modelling) has emerged as a system modelling paradigm as compared to the traditional cognitive science approach based on internal representations only. Thirdly, Human-centered problem solving among other aspects involves, context, focus on practitioner goals and tasks, human evaluation of tasks modelled by technological artifacts, flexibility, adaptability and learning. Fourthly, unlike knowledge based systems, soft computing systems because of imprecise and approximate nature require modelling of constructs like optimization . Finally, as outlined in the last section there are a range of hard and soft computing technologies and their hybrid configurations which lend flexibility to the problem solver as against trying to force fit a particular technology or method on to a system design (as traditionally has been done). These aspects can be considered as pragmatic constraints. Most existing approaches do not meet one or more of the pragmatic constraints. Besides, from a soft computing perspective most existing do not facilitate component based modelling at optimization, task and technology levels respectively.
3
Distributed Multi-layered Agent Ontology for Soft Computing Systems
The multi-layered agent ontology is shown in Fig. 2. It is derived from integration of characteristics of intelligent artifacts like fuzzy logic, neural network and genetic algorithm, agents, objects and distributed process model with problem solving ontology model [6, 7]. It consists of five layers, namely, the object layer, which defines the data architecture or structural content of an application. The tool or technology agent layer defines the constructs for various intelligent and soft computing tools. The optimization agent layer defines constructs for fusion, combination and transformation technologies which are used for optimizing the quality of solution (e.g., accuracy). Finally, the problem solving ontology (task) agent layer defines the constructs related to the problem solving agents namely, preprocessing, decomposition, control, decision and postprocessing. This layer models practitioners tasks in the domain under study. Some generic goals and tasks associated with each problem solving agent are shown in Table 1. This layer employs the services of the other 4 layers for accomplishing various tasks. The five layers facilitate a component based approach for agent based software design .
Multi-layered Distributed Agent Ontology for Soft Computing Systems
Problem Solving (Task) Agent Layer Optimization Agent Layer Tool Agent Layer Expert System Agent
Global Preprocessing Phase Agent
Fusion Agent
Fuzzy Logic Agent
Distributed Comm. & Processing Agent1
Distributed Agent Layer Distri-
Object Layer
buted Comm. & Processing Agent N
Supervised Neural Network Agent
SelfOrganising Agent
Assoc iative Agent
Postprocessing Phase Agent
Transformation Agent
Genetic Algorithm Agent
Decomposition Phase Agent
Decision Phase Agent
Combination Agent Control Phase Agent
Fig. 2. Multi-Layered Distributed Agent Ontology for Soft Computing
Table 1. Some Goals and Tasks of Problem Solving Agents Phase
Goal
Preprocessing
Improve data quality
Decomposition
Restrict the context of the input from the environment at the global level. By defining a set of orthogonal concepts Reduce the complexity and enhance overall reliability of the computer-based artifact
Control
Determine decision selection knowledge constructs within an orthogonal concept for the problem under study .
Decision
Post-processing
Provide decision instance results in a user defined decision concept. Establish outcomes as desired outcomes
Some Tasks Noise Filtering Input Conditioning
Define orthogonal concepts
Define decision level concepts with in each orthogonal concept as identified by users Determine Conflict Resolution rules Define decision instance of interest to the user Concept validation Decision instance result validation
449
450
N U M B E R o f P A T H S
Rajiv Khosla D ia g n o sis to T re a tm e n t D ia g n o sis to S ym p tom s a n d T re a tm e nt S ym p tom s to D ia g n o sis to T re a tm e nt to
S y ste m D e co m p o sitio n
T re a tm e n t D e c isio n S yste m P o stp ro c e s sin g
D e c o m p o sitio n S y ste m
D ia g n o stic D e cisio n D e co m p o sitio n
T rea tm e n t D e cision C o n tro l
P o stp ro ce ssin g D ia g n o stic D e cisio n T re a tm e n t D e c isio n P o stp ro c e s sin g
D e c is io n P a th s
Fig. 3. Decision Paths Taken by a Medical Practitioner and Agent Sequence In t e l l i g e n t C o n t r o l A g e n t ( I C A ) N eu ra l N e tw o rk A g en t
W a te r Im m e rs io n S e g m e n t a t i o n IP A
M om en t In v a ria n t T ra n sfo rm a tio n A g en t
M a t h e m a t ic a l M o r p h o lo g y S e g m e n ta tio n a g e n t 1 N
Fig. 4. Optimization Level Modelling of a Image Processing Application
3.1
Some Emerging Characteristics of Soft Computing Agent Ontology
In this section we establish the human-centeredness of the multi-layered agent ontology from two perspectives. Firstly, the multi-layered agent ontology facilitates human involvement, evaluation and feedback at three levels, namely, task, optimization and technology respectively. The constructs used at the task level are based on consistent problem solving structures employed by users in developing solutions to complex problems. Thus the five problem solving agents facilitate mapping of user's tasks in a systematic and scalable fashion [7]. Further, the component based nature of the five problem solving agents allows them to be used in different problem solving contexts as shown in Fig. 4. These sequences assist in modeling different problem solving or situational contexts. The optimization level allows an existing technology based solution to be optimized based on human feedback. For example, in an unstained mammalian cell image processing application we employ neural network (as shown in Fig. 4) for prediction and optimization of segmentation quality of segmented images by morphological and water immersion agents. At the technology level human evaluation and feedback is modelled by asking user to define the search parameters or fitness
Multi-layered Distributed Agent Ontology for Soft Computing Systems
451
function to be used by a soft computing agent like GA to create the correct combination of shirt design [7]. Further, the generic tasks of the five problem solving agents have been grounded in human-centered criteria outlined in the 1997 NSF workshop on human-centered systems. These criteria state that a) human-centered research and design is problem driven rather than logic theory or any particular technology; b) human-centered research and design focuses on practitioner's goals and tasks rather than system developer's goals and tasks, and c) human-centered research and design is context bound. The context criterion relates to social/organizational context and representational context (where people use perceptual as well as internal representations to solve problems) and problem solving context. The generic tasks employed by the five problem solving agents are based on consistent problem solving structures used by practitioners solving complex problems in engineering, process control, image processing, management and other areas. The component based nature problem solving agents shown in Fig. 2 enables them to be used in different sequences and different problem solving contexts. Further, the representation (external and internal representations) and social and organization context (not described in this paper) have also been modelled [6,7]. Additionally, the five layers of the agent ontology lead to component based distributed (collaboration and competition) soft computing system design. The ontology also provides flexibility of technologies, learning and adaptation, different forms of knowledge (fuzzy, rule based and distributed pattern based) to be used for modelling component based software design.
4
Conclusion
In this paper we have outlined a multi-layered distributed agent ontology for developing soft computing applications. The ontology provides modelling support at four levels, namely, task level, optimization level, technology level, and distributed processing level. Further, it takes into consideration pragmatic constraints like human-centeredness, context, distributed cognition, and flexibility of technologies. The ontology has been applied in a range of areas including web mining, image processing, e-commerce, alarm processing, medical diagnosis and others
References [1] [2] [3]
Bezdek, J.C. “Pattern Recognition with Fuzzy Objective Function Algorithms,' Advanced Applications in Pattern Recognition,” Plenum Press 1981, USA. Bezdek, J.C., “What is Computational Intelligence?' Computational Intelligence: Imitating Life, Eds. Robert Marks-II et al., IEEE Press, New York, 1994. Chiaberage, M., Bene. G.D., Pascoli, S.D., Lazzerini, B., and Maggiore, A. (1995), “Mixing fuzzy, neural & genetic algorithms in integrated design environment for intelligent controllers,” 1995 IEEE Int Conf on SMC,. Vol. 4, pp. 2988-93.
452
Rajiv Khosla
[4]
Khosla, R., and Dillon, T., “Learning Knowledge and Strategy of a Generic Neuro-Expert System Architecture in Alarm Processing”, in IEEE Transactions on Power Systems, Vol. 12, No. 12, pp. 1610-18, Dec.1997. Khosla, R., and Dillon T., Engineering Intelligent Hybrid Multi-Agent Systems, Kluwer Academic Publishers, MA, USA August 1997. Khosla, R., Sethi, I. and Damiani, E., Intelligent Multimedia Multi-Agent Systems – A Human-Centered Approach, Kluwer Academic Publishers, MA, USA October 2000. Khosla, R. Damiani, E., and Grosky, W. Human-Centered E-Business, Kluwer Academic Publishers, MA, USA, April 2003 . Goldberg, D.E. (1989), Genetic Algorithms in Search, Optimization and Machine Learning, Addison-Wesley, Reading, MA, pp. Takagi, H.K. (2001) “Interactive Evolutionary Computation: Fusion of the Capabilities of EC Optimization and Human Evaluation,” Proceedings of the IEEE, vol. 89, No. 9, September. Zhang, J., Norman, D. A. (1994), "Distributed Cognitive Tasks", Cognitive Science, pp. 84-120.
[5] [6] [7] [8] [9] [10]
A Model for Personality and Emotion Simulation Arjan Egges, Sumedha Kshirsagar, and Nadia Magnenat-Thalmann MIRALab - University of Geneva 24, Rue General-Dufour,1211 Geneva, Switzerland {egges,sumedha,thalmann}@miralab.unige.ch http://www.miralab.ch
Abstract. This paper describes a generic model for personality, mood and emotion simulation for conversational virtual humans. We present a generic model for describing and updating the parameters related to emotional behaviour. Also, this paper explores how existing theories for appraisal can be integrated into the framework. Finally we describe a prototype system that uses the described models in combination with a dialogue system and a talking head with synchronised speech and facial expressions.
1
Introduction
With the emergence of 3D graphics, we are now able to create very believable 3D characters that can move and talk. However, an important part often missing in this picture is the definition of the force that drives these characters: the individuality. In this paper we explore the structure of this entity as well as its link with perception, dialogue and expression. In emotion simulation research so far, appraisal is popularly done by a system based on the OCC model [1]. This model specifies how events, agents and objects from the universe are appraised according to an individual’s goals, standards and attitudes. These three (partly domain-dependent) parameters determine the ‘personality’ of the individual. More recent research indicates that personality can be modelled in a more abstract, domain-independent way [2, 3]. In this paper we will investigate the relationship between such personality models and the OCC model. The effect of personality and emotion on agent behaviour has been researched quite a lot, whether it concerns a general influence on behaviour [4], or a more traditional planning-based method [5]. Various rule based models [6] and probabilistic models [7] have been reported in the past. How behaviour should be influenced by personality and emotion depends on the system that is used and it is out of the scope of this paper. The effect of personality and emotion will also have an effect on expression (speech intonation, facial expressions, etc.) is partly deatl with in Kshirsagar et al. in [8], which describes a system that simulates personalized facial animation with speech and expressions, modulated through mood. There have been very few researchers who have tried to simulate mood [9, 10].
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 453–461, 2003. c Springer-Verlag Berlin Heidelberg 2003
454
Arjan Egges et al.
Fig. 1. Overview of the emotional state and personality in an intelligent agent framework Figure 1 depicts how we view the role of personality and emotion as a glue between perception, dialogue and expression. Perceptive data is interpreted on an emotional level by an appraisal model. This results in an emotion influence that determines, together with the personality what the new emotional state and mood will be. An intelligent agent uses the emotional state, mood and the personality to create behaviour. This paper is organized as follows. Section 2 and Section 3 present the generic personality, mood and emotion model. In Section 4, we investigate the relationship between the OCC model and personality. Finally, we present an example of how to use our model with an expressive virtual character1.
2
A Generic Personality and Emotion Model
The intention of this section is to introduce the concepts that we will use in our model. In the next section we will explain how these concepts interact. We will first start by defining a small scenario. Julie is standing outside. She has to carry a heavy box upstairs. A passing man offers to help her carry the box upstairs. Julies personality has a big influence on her perception and on her behaviour. If she has an extravert personality, she will be happy that someone offers her some help. If she has a highly introvert and/or neurotic personality, she will feel fear and distress and she will respond differently. As someone is never 100% extravert or 100% neurotic, each personality factor will have a weight in determining how something is perceived and what decisions are being taken. Surely, Julies behaviour will not only be based 1
Please read Egges et al. [11] and Kshirsagar et al. [12, 13] for more details.
A Model for Personality and Emotion Simulation
455
on emotional concepts, but also on intellectual concepts. A dialogue system or intelligent agent will require concrete representations of concepts such as personality, mood and emotional state so that it can decide what behaviour it will portray [4, 14]. As such, we need to define exactly what we mean by personality, mood, and emotion before we can simulate emotional perception, behaviour and expression. 2.1
Basic Definitions
An individual is an entity that is constantly changing. So, when we speak of an individual, we always refer to it relative to a time t. The moment that the individual starts existing is defined by t = 0. The abstract entity that represents the individual at a time t we will call It from now on. An individual has a personality and an emotional state (we do not yet take mood into consideration). The model based on this assumption is called PE. The personality is constant and initialized with a set of values on t = 0. The emotional state is dynamic and it is initialized to 0 at t = 0. Thus we define It as a tuple (p, et ), where p represents the personality and et represents the emotional state at time t. In our example, Julie will portray emotions (that change over time) based on what happens, but how she obtains these emotions and the behaviour that results from it, depends on a static part of her being, the personality. There exist many personality models, each of them consisting of a set of dimensions, where every dimension is a specific property of the personality. Take for example the OCEAN model [3], which has five dimensions (Openness, Conscientiousness, Extraversion, Agreeableness, Neuroticism) or the PEN model [2] that has three dimensions. Generalizing from these theories, we assume that a personality has n dimensions, where each dimension is represented by a value in the interval [0, 1]. A value of 0 corresponds to an absence of the dimension in the personality; a value of 1 corresponds to a maximum presence of the dimension in the personality. The personality p of an individual can then be represented by the following vector: pT = α1 . . . αn , ∀i ∈ [1, n] : αi ∈ [0, 1]
(1)
Emotional state has a similar structure as personality, but it changes over time. The emotional state is a set of emotions that have a certain intensity. For example, the OCC model defines 22 emotions, while Ekman [15] defines 6 emotions for facial expression classification. We define the emotional state et as an m-dimensional vector, where all m emotion intensities are represented by a value in the interval [0, 1]. A value of 0 corresponds to an absence of the emotion; a value of 1 corresponds to a maximum intensity of the emotion. This vector is given as follows: β . . . β 1 m , ∀i ∈ [1, m] : βi ∈ [0, 1] if t > 0 eTt = (2) 0 if t = 0
456
Arjan Egges et al.
Furthermore, we define an emotional state history ωt that contains all emotional states until et , thus: ωt = e0 , e1 , . . . , et
(3)
An extended version of the PE model, the PME model, is given by including mood. We now define the individual It as a triple (p, mt , et ), where mt represents the mood at a time t. We define mood as a rather static state of being, that is less static than personality and less fluent than emotions [8]. Mood can be one-dimensional (being in a good or a bad mood) or perhaps multi-dimensional (feeling in love, being paranoid). Whether or not it is justified from a psychological perspective to have a multi-dimensional mood is not in the scope of this paper. However, to increase generality, we will provide for a possibility of having multiple mood dimensions. We define a mood dimension as a value in the interval [−1, 1]. Supposing that there are k mood dimensions, the mood can be described as follows: , ∀i ∈ [1, k] : γi ∈ [−1, 1] if t > 0 . . . γ γ 1 k (4) mTt = 0 if t = 0 Just like for the emotional state, there is also a history of mood, σt , that contains the moods m0 until mt : σt = m0 , m1 , . . . , mt
3
(5)
Emotional State and Mood Updating
From perceptive input, an appraisal model such as OCC will obtain emotional information. This information is then used to update the mood and emotional state. We define the emotional information as a desired change in emotion intensity for each emotion, defined by a value in the interval [0, 1]. The emotion information vector a (or emotion influence) contains the desired change of intensity for each of the m emotions: aT = δ1 . . . δm , ∀i ∈ [1, m] : δi ∈ [0, 1]
(6)
The emotional state can then be updated using a function Ψe (p, ωt , a). This function calculates, based on the personality p, the current emotional state history ωt and the emotion influence a, the change of the emotional state. A second part of the emotion update is done by another function, Ωe (p, ωt ) that represents internal change (such as a decay of the emotional state). Given these two components, the new emotional state et+1 can be calculated as follows: et+1 = et + Ψe (p, ωt , a) + Ωe (p, ωt )
(7)
A Model for Personality and Emotion Simulation
457
In the PME model (which includes the mood), the update process slightly changes. When an emotion influence has to be processed, the update now happens in two steps. The first step consists in updating the mood; the second step consists of updating the emotional state. The mood is updated by a function Ψm (p, ωt , σt , a) that calculates the mood change, based on the personality, the emotional state history, the mood history and the emotion influence. The mood is internally updated using a function Ωm (p, ωt , σt ). Thus the new mood mt+1 can be calculated as follows: mt+1 = mt + Ψm (p, ωt , σt , a) + Ωm (p, ωt , σt )
(8)
The emotional state can then be updated by an extended function Ψe that also takes into account the mood history and the new mood. The internal emotion update, which now also takes mood into account, is defined as Ωe (p, ωt , σt+1 ). The new emotion update function is given as follows: et+1 = et + Ψe (p, ωt , σt+1 , a) + Ωe (p, ωt , σt+1 )
4
(9)
Personality, Emotion and the OCC Model of Appraisal
As OCC is the most widely accepted appraisal model, we will present some thoughts about how to integrate personality models with OCC. OCC uses goals, standards and attitudes. These three notions are for a large part domain dependent. As multi-dimensional personality models are domain-independent, we need to define the relationship between this kind of personality model and the personality model as it is used in OCC. We will choose an approach where we assume that the goals, standards and attitudes in the OCC model are fully defined depending on the domain. Our personality model will then serve as a selection criterion that indicates what and how many goals, structures and attitudes fit with the personality. Because the OCEAN model is widely accepted, we will use this model to illustrate our approach. For an overview of the relationship between OCEAN and the goals, standards, and attitudes, see Table 1. The intensity of each personality faction will determine the final effect on the goals, standards and attitudes.
5 5.1
Application Overview
In order to demonstrate the model in a conversational context, we have built a conversational agent represented by a talking head. The update mechanisms of emotions and mood are implemented using a linear approach (using simple matrix representations and computations). The center of the application is a dialogue system, using Finite State Machines, that generates different responses
458
Arjan Egges et al.
Table 1. Relationship between OCEAN and OCC parameters Goals Openness
Standards Attitudes making a shift of attitude towards standards in new sit- new elements uations
Conscientiousness abandoning and adopting goals, determination Extraversion Agreeableness
willingness to communicate abandoning and compromising stan- adaptiveness to adopting goals in dards in favor of oth- other people favor of others ers
Neuroticism
based on the personality, mood and emotional state. The personality and emotion model is implemented using the 5 factors of the OCEAN model of personality, one mood dimension (a good-bad mood axis) and the 22 emotions from OCC plus 2 additional emotions (disgust and surprise) to have maximum flexibility of facial expression. A visual front-end uses the dialogue output to generate speech and facial expressions. The dialogue system annotates its outputs with emotional information. An example of such a response is (the tags consist of a joy emotion percentage of 58): |JO58|Thanks! I like you too! 5.2
Visual Front-End
Our visual front-end comprises of a 3D talking head capable of rendering speech and facial expressions in synchrony with synthetic speech. The facial animation system interprets the emotional tags in the responses, generates lip movements for the speech and blends the appropriate expressions for rendering in real-time with lip synchronization2 . Facial dynamics are considered during the expression change, and appropriate temporal transition functions are selected for facial animation. We use MPEG-4 Facial Animation Parameters as low level facial deformation parameters [12]. However, for defining the visemes and expressions, we use the Principal Components (PCs) [13]. The PCs are derived from the statistical analysis of the facial motion data and reflect independent facial movements observed during fluent speech. The use of PCs facilitates realistic speech animation, especially blended with various facial expressions. We use available text-to-speech (TTS) software that provides phonemes with temporal information. The co-articulation rules are applied based on the algorithm of Cohen et al. [16]. The expressions are embedded in the text in terms of tags 2
For more details on expressive speech animation, see Kshirsagar et al. [13].
A Model for Personality and Emotion Simulation
459
Fig. 2. Julie’s behaviour as an extravert (a) and an introvert (b) personality
and associated intensities. An attack-sustain-decay-release type of envelope is applied for these expressions and they are blended with the previously calculated co-articulated phoneme trajectories. Periodic eye-blinks and minor head movements are applied to the face for increased believability. Periodic display of facial expression is also incorporated, that depends on the recent expression displayed with the speech, as well as the mood of the character. 5.3
Example Interaction
As an example, we have developed a small interaction system that simulates Julie’s behaviour. We have performed these simulations with different personalities, which gives different results in the interaction and the expressions that the face is showing. The interaction that takes place for an extravert personality (90% extravert) is shown in Figure 2(b). An introvert personality (5% extravert) significantly changes the interaction (see Figure 2(b)).
6
Conclusions and Future Work
In this paper we have presented a basic framework for personality and emotion simulation. Subsequently we have shown how this framework can be integrated with an application such as an expressive MPEG-4 talking head with speech synthesis. Our future work will focus on user studies to validate the personality
460
Arjan Egges et al.
and emotion model that is used. We will also explore the possibility of having multiple mood dimensions. Furthermore we will explore how personality and emotion are linked to body behaviour and what computational methods are required to simulate this link.
Acknowledgment This research has been funded through the European Project MUHCI (HPRNCT-2000-00111) by the Swiss Federal Office for Education and Science (OFES).
References [1] Ortony, A., Clore, G.L., Collins, A.: The Cognitive Structure of Emotions. Cambridge University Press (1988) 453 [2] Eysenck, H.J.: Biological dimensions of personality. In Pervin, L.A., ed.: Handbook of personality: Theory and research. New York: Guilford (1990) 244–276 453, 455 [3] Costa, P.T., McCrae, R.R.: Normal personality assessment in clinical practice: The NEO personality inventory. Psychological Assessment (1992) 5–13 453, 455 [4] Marsella, S., Gratch, J.: A step towards irrationality: Using emotion to change belief. In: Proceedings of the 1st International Joint Conference on Autonomous Agents and Multi-Agent Systems, Bologna, Italy (2002) 453, 455 [5] Johns, M., Silverman, B.G.: How emotions and personality effect the utility of alternative decisions: a terrorist target selection case study. In: Tenth Conference On Computer Generated Forces and Behavioral Representation. (2001) 453 [6] Andr´e, E., Klesen, M., Gebhard, P., Allen, S., Rist, T.: Integrating models of personality and emotions into lifelike characters. In: Proceedings International Workshop on Affect in Interactions. Towards a New Generation of Interfaces. (1999) 453 [7] Ball, G., Breese, J.: Emotion and personality in a conversational character. In: Proceedings of the Workshop on Embodied Conversational Characters. (1998) 83–84 and 119–121 453 [8] Kshirsagar, S., Magnenat-Thalmann, N.: A multilayer personality model. In: Proceedings of 2nd International Symposium on Smart Graphics, ACM Press (2002) 107–115 453, 456 [9] Vel´ asquez, J.: Modeling emotions and other motivations in synthetic agents. In: Proceedings of AAAI-97, MIT Press (1997) 10–15 453 [10] Moffat, D.: Personality parameters and programs. In: Lecture Notes in Artificial Intelligence : Creating Personalities for Synthetic Actors: Towards Autonomous Personality Agents. (1995) 453 [11] Egges, A., Kshirsagar, S., Zhang, X., Magnenat-Thalmann, N.: Emotional communication with virtual humans. In: The 9th International Conference on Multimedia Modelling. (2003) 454 [12] Kshirsagar, S., Garchery, S., Magnenat-Thalmann, N.: Feature Point Based Mesh Deformation Applied to MPEG-4 Facial Animation. In: Deformable Avatars. Kluwer Academic Publishers (2001) 33–43 454, 458
A Model for Personality and Emotion Simulation
461
[13] Kshirsagar, S., Molet, T., Magnenat-Thalmann, N.: Principal components of expressive speech animation. In: Proceedings Computer Graphics International. (2001) 59–69 454, 458 [14] Elliott, C.D.: The affective reasoner: a process model of emotions in a multiagent system. PhD thesis, Northwestern University, Evanston, Illinois (1992) 455 [15] Ekman, P.: Emotion in the Human Face. Cambridge University Press, New York (1982) 455 [16] Cohen, M.M., Massaro, D. In: Modelling co-articulation in synthetic visual speech. Springer-Verlag (1993) 139–156 458
Using Loose and Tight Bounds to Mine Frequent Itemsets Lei Jia, Jun Yao, and Renqing Pei School of Mechatronics and Automation Shanghai University, Shanghai, 200072, China {jialei7829,junyao0529,prq44}@hotmail.com
Abstract. Mining frequent itemsets forms a core operation in many data mining problems. The operation, however, is data intensive and produces a large output. Furthermore, we also have to scan the database many times. In this paper, we propose to use loose and tight bounds to mine frequent itemsets. We use loose bounds to remove the candidate itemsets whose support cannot satisfy the preset threshold. Then, we find whether we can determine the frequency of the remainder candidate itemsets with the tight bounds. According to the itemsets that cannot be treated, we scan the database for them. Using this new method, we can decrease not only the candidate frequent itemsets have to be tested, but also the database scan times.
1
Introduction
Data mining is to efficiently discover interesting rules from large collections of data. Mining frequent itemsets forms a core operation in many data mining problems. The frequent itemset problem is stated as follows. Assume we have a finite set of items I. A transaction is a subset of I, together with a unique identifier. A database D is a finite set of transactions. A subset of I is called an itemset. The support of an itemset equals to the fraction of the transactions in D that contains it. If the support is above the preset threshold, we call it frequent itemset. Many researchers devote to designing efficient structures or algorithms [1,2,3,4,5,6,7,8] to mine frequent itemsets. We find, in these algorithms, we often spend a lot of time in scanning the database. If we can optimize the scanning processes, we will improve the efficiency of the mining dramatically. In this paper, unlike the existing algorithms, we propose to use loose and tight bounds to mine the frequent itemsets. We use loose bounds to remove the candidate itemsets whose support cannot satisfy the preset threshold. Then, we find whether we can determine the frequency of the remainder candidate itemsets with the tight bounds. According to the itemsets that cannot be treated, we scan the database for them. In section 2, we introduce loose bounds and tight bounds. In section 3, we show how to use loose and tight bounds to mine frequent itemsets by LTB algorithm (loose V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 462-468, 2003. Springer-Verlag Berlin Heidelberg 2003
Using Loose and Tight Bounds to Mine Frequent Itemsets
463
and tight bounds based algorithm). We present the experiment result in section 4 and section 5 is a brief conclusion.
2
Loose Bounds and Tight Bounds
2.1
Loose Bounds
Suppose we have two itemsets A,B and their support sup(A), sup(B). Minsup is the preset support threshold. According to the Apriori-trick, if at least one of them is not frequent, we should not consider AB. If A and B are both frequent, we have to generate and test the candidate frequent itemsets AB. Not only do we generate it, but also scan the database to determine its frequency. We know sup(AB) ∈ [max(0,sup(A)+sup(B)-1), min(sup(A),sup(B))]. Because we have already known sup(A) and sup(B) at the last level, we can easily induce it is a frequent itemset if max(0,sup(A)+sup(B)-1) ≥ minsup. In the same way, if min(sup(A),sup(B))<minsup, we confirm it is not a frequent itemset. We call this bound is a loose bound. We define a symbol ∇ to denote the cross union between the itemsets A and B. Given a database D, s1 and s2 as the minimum and maximum support level, we define Fs as the itemsets with the frequency above s, Fs1,s2 to be frequent itemsets occurring at least s1, but less than s2 percent of transactions in D. We can get Fs1Fs2 ⊆ Fs1,s2, also can we get Fs1,s2 ∪ Fs2 ⊆ Fs1. Now we can use the definition to the loose bound. We let ∆ s=s1+s2-1, if ∆ s 〉 0,
then Fs1,s2 ∇ Fs3,s4 ⊆ F ∆s ,min(s2,s4). For example, if sup(A)=60%, sup(B)=70%, sup(C)=10%, sup(D)=30%, minsup=20%. We can easily get 30% ≤ sup(AB) ≤ 60%, 0 ≤ sup(AC) ≤ 10%, 0 ≤ sup(AD) ≤ 30%. So we conclude that itemset AB must be frequent itemsets, AC cannot be frequent itemset and we don't know AD is frequent or not. In the existing algorithm, we have to scan the database for the itemsets like AB and AD. But now, we can use the following tight bounds to determine frequency of the itemset like AB. 2.2
Tight Bounds
We find, using the tight bounds [9,10], the numbers of database scans and the itemsets that have to be counted can be reduced significantly. Consider the following example. Suppose we know: sup(A)=sup(B)=sup(C)=2/3, sup(AB)=sup(AC)=sup(BC)= 1/3 and minsup=1/3, we cannot judge the itemsets ABC whether it is frequent or not with the idea in Apriori-like algorithm. However, we can calculate the frequency of it with the following inequalities.
464
Lei Jia et al.
We get the following results:
So we conclude that sup(ABC)=0. We find we can determine the frequency without scanning the database. This method is called tight bounds. We define tx as the frequency of x in the transactions if transaction.item=x. For example: tAB means the frequency of tranactions contains AB in the database and it is independent to tA and tB. Then sup(A)=tA+tAB+tAC+…+tABC+tABD+…+tI. Then we have the following equalities.
We know any one of the tx is above zero. So we can solve the above equalities recursively from (1) to (N) to get the following solutions.
We find if we know all the subsets of the itemsets I, we can accurately calculate the lower and upper limits of its support with above solutions. If the lower and upper limits are the same, we can determine its frequency without scanning the databases. We prune the superset of I with the downward-closed property (Apriori-trick). When we cannot decide the support just because the low and upper limits of the tight bounds are not the same, we have to scan the database for them. Also do we scan database for the itemsets we cannot decide like AD in the last subsection.
Using Loose and Tight Bounds to Mine Frequent Itemsets
3
465
LTB Algorithm (Loose and Tight Bounds Based Algorithm)
Through the above analysis, we find if we make full use of the loose and tight bounds, we can decrease the number of the itemsets that need to be tested and the times we have to scan the database. So we propose the LTB algorithm in this section. LTB algorithm: Input: database D, minsup and t (the highest level we use the loose bounds); Output: Frequent itemsets L; L= φ ; L1={frequent 1-itemsets}; L1=order L1 according to their frequency in ascend order; 4) T[1]=get_filtered(D,L1); 5) for(k=2,Lk-1 ≠ φ ,k++) 6) Ck=apriori_gen(Lk-1); 7) if(k ≤ t) 8) use the loose bounds to get L' whose support is above minsup; 9) L''=Ck-L'; 10) Calculate the support of itemsets M1(M1 ⊆ L') with the tight bounds; 11) if the low and upper limits are the same; 12) M2=L'-M1; 13) Lk=Lk ∪ M1; 14) end; 15) Scan the database T[1] for the itemsets x (x ⊆ (L'' ∪ M2)); 16) if sup(x) ≥ minsup 17) Lk=Lk ∪ x; 18) else 19) Calculate the support of itemsets in Ck with the tight bounds; 20) if the low and upper limits of M3(M3 ⊆ Ck) are the same; 21) M4=Ck-M3; 22) Lk=Lk ∪ M3; 23) end; 24) Scan the database T[1] for the itemsets y in M4; 25) If sup(y) ≥ minsup 26) Lk=Lk ∪ y; 27) end 28) end Answer L= ∪ kLk; 1) 2) 3)
The algorithm can be divided into three parts. From line 1 to line 4, we find the frequent itemsets at level 1 and use them to filter the database D to T [1]. If some database has infrequent 1-item, this filtering can shorten the database significiently.
466
Lei Jia et al.
From line 5 to 17, we use the loose and tight bounds to mine the frequent itemsets below level t where t is the preset highest level we use loose bound. We utilize the apriori_gen function as the traditional Apriori Algorithm to make full use of the downward-closed property to prune the search space. We use the loose bounds to select the itemsets that is needed to calculate their frequency with the tight bounds. Some of these have the same low and upper limits, so we can determine its support. We scan the database for the others and the itemsets we cannot determine whether it is frequent or not by loose bounds. The last part is from line 15 to the end. In this part, because the loose bounds is too loose, we only calculate their supports with tight bounds. We scan the database again if we cannot determine the support of them. Using this algorithm, we can use the loose and tight bounds to mine frequent itemsets efficiently.
4
Experiment
To study the effectiveness and efficiency of the algorithm we proposed in the above section, we implemented it in VB and tested on a 1GHz Pentium PC computer with 128 megabytes of the main memory. The test consists of a set of synthetic transaction database generated using a randomized itemset generation algorithm similar to the algorithm described in [2]. Table 1 shows the database used, in which N is the average size of these itemsets and T is the size of item sets we randomly choose from. Table 1.
Database 1 2
N 2 3
T
Size of transactions 100K 100K
5
N=2,MinSup=20%
N=3,MinSup=20%
5000
15000 Apriori LTB
Apriori LTB
4500 4000 3500
10000 Time(ms)
Time(ms)
3000 2500 2000
5000
1500 1000 500 0
0
10
20
30
40 50 60 70 Size of Transactions(k)
80
90
100
Fig. 1. Database1, N=2, MinSup=20%
0
0
10
20
30
40 50 60 70 Size of Transactions(k)
80
90
100
Fig. 2. Database2, N=3, MinSup=20%
Using Loose and Tight Bounds to Mine Frequent Itemsets N=2,MinSup=40%
N=3,MinSup=40%
3500
7000 Apriori LTB
3000
6000
2500
5000
2000
4000
Time(ms)
Time(ms)
Apriori LTB
1500
3000
1000
2000
500
1000
0
467
0
10
20
30
40 50 60 70 Size of Transactions(k)
80
90
100
Fig. 3. Database1, N=2, MinSup=40%
0
0
10
20
30
40 50 60 70 Size of Transactions(k)
80
90
100
Fig. 4. Database2, N=3,MinSup=40%
We compare our LTB algorithm with the classic Apriori algorithm [2]. Figure 1 and Figure 2 show the running time of the two approaches in relevance to the number of transactions in database1 and database2 where the support threshold MinSup is 20%. In the same way, we get figure 3 and figure 4 respectively when we change the MinSup to 40%. When n=2, LTB algorithm is about 6% (MinSup=20%) or 3% (MinSup=40%) less efficient than Aprori algorithm. The reason is that we cannot find the itemsets that could satisfy the tight bounds. We have to scan the database for them all. We waste the time we spend in calculating the tight bounds. When n=3, LTB approach is about 20% (MinSup=20%) or 12% (MinSup=40%) more efficient just because we can determine the support of some itemsets without scanning the database. In practice, there are many candidate frequent itemsets and the database is very large. So, we can save more time if we use LTB algorithm.
5
Conclusion
We propose to use a new method, loose and tight bounds, to mine frequent itemsets in this paper. The algorithm we propose is called LTB algorithm (loose bounds and tight bounds based algorithm). The loose bounds can help us narrow the search space and the tight bounds can assist us to get the frequency of the itemsets without scanning the database. We also use the order and apriori-trick in the algorithm to make the mining process efficiently. The detail experiments prove the effectiveness of our method.
References [1] [2]
R.Agrawal, T.Imielinski and A.Swami. Mining association rules between sets of items in large databases. SIGMOD (1993) 207–216 R.Agrawal and R.Srikant. Fast slgorithms for mining association rules. VLDB (1994) 487–499
468
Lei Jia et al.
[3]
J.Han and Y.Fu. Discovery of multiple-level association rules from large databases. VLDB (1995) 420–431 R.Srikant,Q.Vu,and R.Agrawal. Mining associstion rules with item constraints. In Proc. 1997 Int. Conf. Knowledge Discovery and Data mining (KDD'97) 67–73 J.Pei,J.Han,and L.V.S.Lakshmanan. Mining frequent itemsets with convertible constraints. In ICDE'01, 323–332 J.Han,J,Pei and Y.Yin. Mining frequent patterns without candidate generation. In SIGMOD'00, 1–12 N.Pasquier,Y.Bastide,R.Taouil, and L.Lakhal. Efficient mining of association rules using closed itemset lattices. Information Systems (1999) 25–46 J.-F.Boulicaut and A.Bykowski. Frequent closures as a concise representation for binary data mining. In Proceedings of the Fourth Pacif-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'00) 62–73 T.Calders. Deducing bounds on the frequency of itemsets. In EDBT Workshop DTDM Database Techniques in Data Mining, 2002 T.Calders and B.Goethals. Mining all non-deriable frequent itemsets. Proceedings of the 6th European Conference on Principles of Data Mining and Knowledge Discovery. 2002
[4] [5] [6] [7] [8] [9] [10]
Mining Association Rules with Frequent Closed Itemsets Lattice Lei Jia, Jun Yao, and Renqing Pei School of Mechatronics and Automation Shanghai University, Shanghai, 200072, China {jialei7829,junyao0529,prq44}@hotmail.com
Abstract. One of the most important tasks in the field of data mining is the problem of finding association rules. In the past few years, frequent closed itemsets mining has been introduced. It is a condensed representation of the data and generates a small set of rules without information loss. In this paper, based on the theory of Galois Connection, we introduce a new framework called frequent closed itemsets lattice. Compared with the traditional itemsets lattice, it is simple and only contains the itemsets that can be used to generate association rules. Using this framework, we get the support of frequent itemsets and mine association rules directly. We also extend it to fuzzy frequent closed itemsets lattice, which is more efficient at the expense of precision.
1
Introduction
Data mining is to extract the previous unknown and potentially useful information from the large database. The problem of mining association rules [1,2] has been the subjects of numerous studies. However, the traditional methods generate too many frequent itemsets. We know if the size of largest itemsets is N, the candidate frequent itemsets space is 2N. It is really a complex and untrivial work. In the past few years, based on the theory of Formal Concept Analysis, a new technology, frequent closed itermsets mining, has been introduced [3,4,5,6,7]. It extracts a particular subset of the frequent itemsets that can regenerate the whole ones without information loss. In this paper, we introduce a new framework. It is called Frequent Closed Itemsets Lattice. We use Apriori-style algorithm to build the framework. It makes full use of the Galois Connection theory and only contains the itemsets that can be used to form association rules. We also extend the frequent closed itemsets lattice to fuzzy frequent closed itemsets lattice. It is more efficient at the expense of information loss. In Section 2, we present theoretical basis of the frequent closed itemsets lattice. In Section 3, we introduce Apriori-style FCIL algorithm to build the new framework. In section 4, we discuss how to generate the informative association rules under the V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 469-475, 2003. Springer-Verlag Berlin Heidelberg 2003
470
Lei Jia et al.
framework. In section 5, we extend the framework to fuzzy frequent closed itemsets lattice. We present our experiments in section 6 and section 7 is a brief conclusion.
2
Theoretical Basis
In this section, we define the data mining context, Galois connection, Galois closure operator, closed itemsets and frequent closed itemsets lattice.[8] Definition 1 (Data mining context). A data mining context is a triple D=( Γ , τ ,R). The rows are transactions Γ and the columns are the itemsets τ . R ⊆ Γ × τ is a relation between the transactions and itemsets. Definition 2 (Galois Connection). Let D=( Γ , τ ,R) be data mining context. For T ⊆ Γ and I ⊆ τ , we define: Γ
τ
τ
Γ
p(T): 2 → 2 p(T)={i ∈ τ | ∀ t ∈ T,(t,i) ∈ R} q(I): 2 → 2 Γ
q(I)={t ∈ Τ | ∀ i ∈ I,(t,i) ∈ R}
τ
Because 2 and 2 are power set of Γ and
τ
, (p,q)
forms a Galois connec-
tion. The connection has the following property.
I1 ⊆ I2 → q(I1) ⊇ q(I2)
T1 ⊆ T2 → p(I1) ⊇ p(I2) T ⊆ q(I) → I ⊆ p(T) Definition 3 (Galois closure operator). Given the Galois connection, if we choose h=p(q(I)), g=q(p(T)), I,I1,I2 ⊆ τ ,T,T1,T2 ⊆ Γ , we have the following properties: Extension:
I ⊆ h(I)
T ⊆ g(T)
Idempotency:
h(h(I))=h(I)
Monotonicity:
I1 ⊆ I2 → h(I1) ⊆ h(I2)
g(g(T))=g(T)
T1 ⊆ T2 → g(T1) ⊆ g(T2) Definition 4 (Closed itemsets and Frequent closed itemsets). The closure of an itemset I is the maximal superset of I which has the same support as I. The closed itemset C is the itemset whose support is equal to its closure. If the support is higher than the minimum support, the closed itemsets called frequent closed itemsets. Definition 5 (Frequent closed itemsets lattice). We use all the itemsets C' in the closure to build a complete lattice (C', ≤ ) called frequent closed itemsets lattice.
Mining Association Rules with Frequent Closed Itemsets Lattice
471
The lattice has the following properties: 1) The closure lattice is simple than the frequent itemsets lattice. It is composed by the frequent closed itemsets. 2) All sub-itemsets in the closure lattice of a frequent closed itemsets are frequent (anti-monotonicity property). 3) We can get the associations directly from the lattice as shown in section 4.
3
Building Frequent Closed Itemsets Lattice with Apriori-Style FCIL Algorithm
Considering that frequent closed itemsets lattice has the anti-monotonicity property, we build the framework with the Apriori-style algorithm. Suppose we have 6 transactions in table1. The second column is the items contain in the transaction and the class label is in the third column. Now, we start to build the lattice. Generally, we sort the itemsets in lexicographic order, but nowadays we find if we put items in specific order, it will improve the efficiency of the process. We use the support-based order introduced in paper [7]. Also do we sort the frequent 1-item in support-based ascend order and use the same Itemset-Transaction structure. Table 1.
1 2 3 4 5 6
Item ACTW CDW ACTW ACDW ACDTW CDT
Fig.1. Frequent closed itemsets lattice
Class 1 2 1 1 2 2
472
Lei Jia et al.
The level-wise algorithms like Apriroi [2] have been proved to be one of the most efficient algorithms in the field of association rules mining. So we propose the Apriori-style FCIL algorithm (Frequent Closed Itemsets Lattice algorithm). The FCIL algorithm: Input:
T[1](in ascend order), MinSup is the Minimum Support threshold and function t(x) represents the transactions that contain x. Output: frequent closed itemsets lattice 1)L1={large 1-itemsets X1 × t(X1) in ascend order}; 2)T[2]=order T[1] according to L1; 3)Construct the lattice of level-1;
4)for(k=2;Lk-1 ≠ φ ;k++) 5) Ck=apriori_gen(Lk-1) where Lk=Lk-1 ∪ Lk-1,t(Lk)= t(Lk-1) ∩ t(Lk-1); 6) forall transactions t ∈ T[2] 7) Ct=subset(Ck,t); 8) forall candidates c ∈ Ct do 9) c.count++; 10) end 11) end 12) Lk={c ∈ Ck| c.count ≥ MinSup}; 13) if t(x)=t(y) where x ∈ Lk, y ∈ Lk-1 14) add the item to the lattice and connect them; 15) else add the item to the lattice; 16) end 17)end return the frequent closed itemsets lattice Using the FCIL algorithm, we can get the frequent closed itemsets collections from the lattice easily. In each {} are the frequent itemsets have the same closure. We show the transactions that contain the itemsets in ( ). They are {D,DC(2456)}, {DW,DWC(245)},{T,TC(1356)},{TA,TW,TAW,TAC,TWC(135)},{A,AW,AC,AW C(1345)},{W,WC(12345)} and {C(123456)}.
4
Forming Association Rules
After building the frequent closed itemsets lattice, we try to form association rules from the structure directly. Association rules can be divided into two classes. According to the rule I1 → I2-I1, if both I1 and I2 are in the same closure, the confidence of the rule is 1. It is so-called exact rules. If I1 and I2 are in the different closure, the rule maybe belongs to approximate rules. We should compare the confidence with the preset confidence threshold to determine whether they are our targeted association rules. We use ⇒ and → to show exact and approximate rules respec-
Mining Association Rules with Frequent Closed Itemsets Lattice
473
tively, we also present the confidence of the approximate rules in the brackets behind the rule. Using the following inference technology, we elicit the informative rules directly. 1) The informative rules mean we can infer the other rules from them. We should pay attention to the following cases: i. If the antecedents of the rules are the same, which one has the largest consequent is the informative one. ii. If the conjunctions of the antecedent and consequent of the rule is the same, which rule has the smallest antecedent is the informative rule. 2) According to the Guigues-Duquenne basis and Luxenburger basis [8], we can prune some rules. (1) Guigues-Duquenne basis: i. X ⇒ Y, W ⇒ Z a XW ⇒ YZ ii. X ⇒ Y, Y ⇒ Z a X ⇒ Z We can also get: X → Y, Y ⇒ Z a X → Z (2) Luxenburger basis: i. The association rule X → Y has the same support and confidence as the rule close(X) → close(Y). The close(x) means the closed itemsets of x. ii. For any three closed itemsets I1,I2 and I3, such that I1 ⊆ I2 ⊆ I3, the confidence of the rule I1 → I3 is equal to the product of the confidences of the rules I1 → I2 and I2 → I3, and its support is equal to the support of the rule I2 → I3. (3) When two rules are equipotent (The support of the antecedent and the conjunctions of the antecedent and consequent of the two rules are the same.), we can delete one according to the ascend order. We get the informative ones of approximate rules and exact rules respectively, then compare them with the above principle again to get the final results. According to the transactions in table 1, we get the final informative rules: C → W(5/6), W → AC(0.8), D ⇒ C, T ⇒ C and TW ⇒ AC. We note that, unlike the traditional algorithm, we do not generate the rule like W → A, then prune it when we find W → AC. Based on the frequent closed itemsets lattice and above inference technology, we can get W → AC directly (The rule is determined by the smallest itemset in one closure with the largest itemset in another closure). We are sure it will cover the information of rules like W → A. Through analyzing the example, we find that if we combine frequent closed itemsets lattice and inference technology, we can mine association rules efficiently.
5
Fuzzy Frequent Closed Itemsets Lattice
In reality, the more correlated the data is, the more association rules we will find. In such case, the collections of frequent closed itemsets are more compact than frequent itemsets. However, we often find the following situation: itemsets A and itemsets B
474
Lei Jia et al.
are both occur in almost the same transactions except small exceptions. Case 1: |A|=|B|, just means they are in the same level. Case2: |A|=|B|-n, just means A is the subset of B. In the above definition, if the frequent itemsets have the same support (occur in the same transactions), they are in the same frequent itemsets closure. The frequent closed itemsets is a concise representation. However, in many applications, it will cost a lot of time and memories. So we propose to loose the concise representation to ε -adequate representation, we can gain efficiency at the expense of precision. We introduce the notation fuzzy frequent closed itemsets. Formally, if itemsets X occurs in t transactions with the database, we say that the itemsets Y is in the same fuzzy frequent closed itemsets of X if the difference of sup(X) and sup(X ∪ {Y}) is less than the threshold δ . If δ =0, the fuzzy frequent closed itemsets degenerate to frequent closed itemsets. The fuzzy frequent closed itemsets form the fuzzy frequent closed itemsets lattice. In the FCIL algorithm, if we change step 13 to the step 13': if |t(x) -t(y)| ≤ δ where x ∈ Lk-1,y ∈ Lk, we can construct the fuzzy frequent closed itemsets as well.
6
Experiment
To study the effectiveness and efficiency of the algorithm we proposed in the above section, we implemented it in Basic and tested on a 1GHz Pentium PC computer with 128 megabytes of the main memory. The test consists of a set of synthetic transaction database generated using a randomized itemset generation algorithm similar to the algorithm described in [2]. The average size of these itemsets N is 3, the size of item sets T we randomly choose from is 5 and the size of database is 10K. The minimum support threshold is 70%. Using the traditional algorithm like Apriori [2], we have 35 connections if we want to build the frequent itemsets lattice. Now, we use FICL algorithm to build frequent closed itemsets lattice, the number reduces to 17. After building the lattice, we begin to mine the association rules. Within Apriori algorithm, we firstly generate candidate rules, then using the inference technology to prune the useless rules. If we have already built the frequent closed itemsets lattice, we can directly induce the exact association rules and approximation association rules respectively. We only use the inference technology to find whether some exact rules can be covered by some approximation rules. The time used to create all candidate rules and prune uninformative ones can be reduced. The experiment result is summarized in Table 2. Because the time spent in finding frequent itemset is same, we only show the time used to induce the association rules in the time list. Clearly, in practice, we will save more time when N increases. Table 2. Algorithm
Connections
Apriori FCIL
35 17
Candidate Rules 43
(xact+Approximation) Rules 7+5
Result Rules 3
Time (ms) 13 2
Mining Association Rules with Frequent Closed Itemsets Lattice
7
475
Conclusion
Association rule mining has been extensively studied since its introduction. It has been used in many applications. However, we are often in trouble just because we have to face a large number of candidate frequent itemsets. In the past few years, frequent closed itemsets mining has been introduced. It will generate a small set of rules compared with the traditional frequent itemsets mining without information loss. In this paper, we introduce a new framework called frequent closed itemsets lattice to mine association rules. Compared with the traditional itemsets lattice, the framework is simple and only contains the itemsets we need to form association rules. Under this framework, we get the support of the frequent itemsets and mine association rules directly. We also extend the structure to fuzzy frequent closed itemsets lattice, which is more efficient at the expense of precision. Finally, through the experiment, we prove the effectiveness of our method.
References [1] [2] [3] [4] [5] [6] [7] [8]
R.Agrawal,T.Imielinski, and A.Swami. Mining association rules between sets of items in large databases. SIGMOD 93, 207–216 R.Agrawal and R.Srikant. Fast slgorithms for mining association rules. VLDB 94, 487-499 N.Pasquier,Y.Bastide,R.Taouil, and L.Lakhal. Discovering frequent closed itemsets for association rules. In 7th Intl. Conf. on Database Theory 1999 D.Cristofor, L.Cristofor, and D.Simovici. Galois connection and data mining. Journal of universal Computer Science (2000) 60–73 J.Pei,J.Han,and R.Mao. Closet: An efficient algorithm for mining frequent closed itemsets. In SIGMOD Int'l Workshop on Data Mining and Knowledge Discovery, 2000 N.Pasquier,Y.Bastide,R.Taouil, and Lotfi Lakhal. Efficient mining of associaion rules using closed itemset lattice. Information Systems (1999) 25–46 M.Zaki. Generating non-redundant association rules. In Proceedings of the 6th ACM-SIGKDD Intermational Conference on Knowledge Discovery and Data Mining (2000)34–43 B.A.Davey and H.A.Priestley. Introduction to lattices and Order. Cambridge University Press, Four edition (1994)
Mining Generalized Closed Frequent Itemsets of Generalized Association Rules Kritsada Sriphaew and Thanaruk Theeramunkong Sirindhorn International Institute of Technology, Thammasat University 131 Moo 5, Tiwanont Rd., Bangkadi, Muang, Pathumthani, 12000, Thailand {kong,ping}@siit.tu.ac.th
Abstract. In the area of knowledge discovery in databases, the generalized association rule mining is an extension from the traditional association rule mining by given a database and taxonomy over the items in database. More initiative and informative knowledge can be discovered. In this work, we propose a novel approach of generalized closed itemsets. A smaller set of generalized closed itemsets can be the representative of a larger set of generalized itemsets. We also present an algorithm, called cSET, to mine only a small set of generalized closed frequent itemsets following some constraints and conditional properties. By a number of experiments, the cSET algorithm outperforms the traditional approaches of mining generalized frequent itemsets by an order of magnitude when the database is dense, especially in real datasets, and the minimum support is low.
1
Introduction
The task of association rule mining (ARM) is one important topic in the area of knowledge discovery in databases (KDD). ARM focuses on finding a set of all subsets of items (called itemsets) that frequently occur in database records or transactions, and then extracting the rules representing how a subset of items influences the presence of another subset [1]. However, the rules may not provide informative knowledge in the database. It may be limited with the granularity over the items. For this purpose, generalized association rule mining (GARM) was developed using the information of pre-defined taxonomy over the items. The taxonomy may classify products (or items) by brands, groups, categories, and so forth. Given a taxonomy where only leaf items present in the database, more initiative and informative rules (called generalized association rules) can be mined from the database. Each rule contains a set of items from any levels of the taxonomy. In the past, most previous works focus on efficient finding all generalized frequent itemsets. As an early intensive work, Srikant et al. [2] proposed five algorithms that apply the horizontal database format and breath-first search strategy like Apriori algorithm. These algorithms waste a lot of time in multiple scanning a database. As a more recent algorithm, Prutax, was proposed in [3] by V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 476–484, 2003. c Springer-Verlag Berlin Heidelberg 2003
Mining Generalized Closed Frequent Itemsets
477
applying a vertical database format to reduce the time needed in database scanning. Nevertheless, a limitation of this work is the cost of checking whether their ancestor itemsets are frequent or not using a hash tree. There exists a slightly different task for dealing with multiple minimum supports as shown in [4, 5, 6]. A parallel algorithm was proposed in [7]. The recent applications of GARM were shown in [8, 9]. Our efficient approach to mine all generalized frequent itemsets is presented in [10]. Furthermore to improve time complexity of the mining process, the concepts of closed itemsets have been proposed in [11, 12, 13]. The main idea of these approaches focus to find only a small set of closed frequent itemsets, which is the representative of a large set of frequent itemsets. This technique helps us reduce the computational time. Thus, we intend to apply this traditional concept to deal with the generalized itemsets in GARM. In this work, we propose a novel concept of generalized closed itemsets, and present an efficient algorithm, named cSET, to mine only generalized closed frequent itemsets.
2
Problem Definitions
A generalized association rule can be formally stated as follows. Let I = {A, B, C, D, E, U, V, W} be a set of distinct items, T = {1, 2, 3, 4, 5, 6} be a set of transaction identifiers (tids). The database can be viewed into two formats, i.e. horizontal format as shown in Fig. 1A, and vertical format as shown in Fig. 1B. Fig. 1C shows the taxonomy, a directed acyclic graph on items. An edge in a taxonomy represents is-a relationship. V is called an ancestor item of U, C, A and B. A is called a descendant item of U and V. Note that only leaf items of a taxonomy are presented in the original database. Intuitively, the database can be extended to contain the ancestor items by adding the record of ancestor items of which tidsets are given by the union of their children as shown in Fig. 1D. A set IG ⊆ I is called a generalized itemset (GI) when IG is a set of items where no any items in the set is an ancestor item of the others. The support of IG , denoted by σ(IG ), is defined as a percentage of the number of transactions in which IG occurs as a subset to the total number of transactions. Only the GI that has its support greater than or equal to a user-specified minimum support (minsup) is called a generalized frequent itemset (GFI). A rule is an implication → I2 , where I1 , I2 ⊆ I, I1 ∩ I2 = ∅, I1 ∪ I2 is GFI, and no of the form R: I1 item in I2 is an ancestor of any items in I1 . The confidence of a rule, defined as σ(I1 ∪ I2 )/σ(I1 ), is the conditional probability that a transaction contains I2 , given that it contains I1 . The rule is called a generalized association rule (GAR) if its confidence is greater than or equal to a user-specified minimum confidence (minconf). The task of GARM can be divided into two steps, i.e. 1) finding all GFIs and 2) generating the GARs. The second step is straightforward while the first step takes intensive computational time. We try to improve the first step by exploiting the concept of closed itemsets to GARM, and find only a small set of generalized closed itemsets to reduce the computational time.
478
Kritsada Sriphaew and Thanaruk Theeramunkong
Fig. 1. Databases and Taxonomy
3
Generalized Closed Itemset (GCI)
In this section, the concept of GCI is defined by adapting the traditional concept of closed itemsets in ARM [11, 12, 13]. We show that a small set of generalized closed frequent itemsets is sufficient to be the representative of a large set of GFIs. 3.1
Generalized Closed Itemset Concept
Definition 1. (Galois connection): Let the binary relation δ ⊆ I × T be the extension database. For an arbitrary x∈I and y∈T, xδy can be denoted when x is related to it y in database. Let it X ⊆ I, and Y ⊆ T. Then the mapping functions, t:I → T, t(X) = {y ∈ T | ∀x ∈ X, xδy} i:T → I, i(Y) = {x ∈ I | ∀y ∈ Y, xδy} define a Galois connection between the power set of I (P(I)) and the power set of T (P(T)). The following properties hold for all X,X1 ,X2 ⊆I and Y,Y1 ,Y2 ⊆T: 1. X1 ⊆ X2 =⇒ t(X1 ) ⊇ t(X2 ) 2. Y1 ⊆ Y2 =⇒ i(Y1 ) ⊇ i(Y2 ) 3. X ⊆ i(t(X) and Y ⊆ t(i(Y)) Definition 2. (Generalized closure): Let X ⊆ I, and Y ⊆ T, the composition of two mappings gcit :P(I) → P(I) and gcti : P(T) → P(T) are generalized closure operator on itemset and tidset respectively. The mapping of gcit (X) = i ◦ t(X) = i(t(X)) while gcti (Y) = t ◦ i(Y) = t(i(Y)). Definition 3. (Generalized closed itemset and tidset): X is called a generalized closed itemset (GCI) when X = gcit (X), and Y is called a generalized closed tidset (GCT) when Y = gcti (Y).
Mining Generalized Closed Frequent Itemsets
479
Fig. 2. Galois Lattice of Concepts For X ⊆ I and Y ⊆ T, the generalized closure operators gcit and gcti satisfy the following properties (Galois property): 1. Y ⊆ gcti (Y). 2. X ⊆ gcit (X). 3. gcit (gcit (X)) = gcit (X), and gcti (gcti (Y)) = gcti (Y). For any GCI X, there exists a corresponding GCT Y, with the property that Y = t(X) and X = i(Y). Such a GCI and GCT pair X × Y is called a concept. All possible concepts can be formed a Galois lattice of concepts as shown in Fig. 2. 3.2
Generalized Closed Frequent Itemsets (GCFIs)
The support of a concept X × Y is the size of GCT (i.e. |Y|). A GCI is frequent when its support is greater than or equal to minsup. Lemma 1. For any generalized itemset X, its support is equal to the support of its generalized closure (σ(X) = σ(gcit (X))). Proof. Given X, its support σ(X) = |t(X)|/|T|, and the support of its generalized closure σ(gcit (X)) = |t(gcit (X))|/|T|. To prove the lemma, we have to show that t(X) = t(gcit (X)). Since gcti is the generalized closure operator, it satisfies the first property that t(X) ⊆ gcti (t(X)) = t(i(t(X))) = t(gcit (X)). Thus t(X) ⊆ t(gcit (X)). The gcit (X) provides the GI that is the maximal superset of X and has the same support as X. Then X ⊆ gcit (X), and t(X) ⊇ t(gcit (X)) due to the Galois property [11]. We can conclude that t(X) = t(gcit (X)). Implicitly, the lemma states that all GFIs can be uniquely determined by the GCFIs since the support of any GIs will be equal to its generalized closure. In the worst case, the number of GCFIs is equal to the number of GFIs, but typically it is much smaller. From the previous example, there are 10 GCIs, which are the representatives of a large amount of all GIs as shown in Fig. 2. With minsup=50%, only 7 concepts (in bold font) are GCFIs.
480
4 4.1
Kritsada Sriphaew and Thanaruk Theeramunkong
Algorithm cSET Algorithm
In our previous work [10], all GFIs can be enumerated by applying two constraints, i.e. subset-superset and parent-child, on GIs for pruning. We propose an algorithm called cSET algorithm, which specifies the order of set enumeration by using these two constraints and the generalized closures to generate only GCIs. Two constraints stated that only descendant and superset itemsets of GFIs should be considered in the enumeration process. For generating only GCFIs, the following conditional properties must be checked when generating the child itemsets by joining X1 × t(X1 ) with X2 × t(X2 ). 1. If t(X1 ) = t(X2 ), then (1) replace X1 and children under X1 with X1 ∪ X2 , (2) generate taxonomy-based child itemsets of X1 ∪ X2 , and (3) remove X2 (if any). 2. If t(X1 ) ⊂ t(X2 ), then (1) replace X1 with X1 ∪ X2 and (2) generate taxonomy-based child itemsets of X1 ∪ X2 . 3. If t(X1 ) ⊃ t(X2 ), then (1) generate join-based child itemset of X1 with X1 ∪ X2 , (2) add hash table with X1 ∪ X2 , and (3) remove X2 (if any). 4. If t(X1 ) = t(X2 ) and t(X1 ∪ X2 ) is not contain in hash, then generate join-based child itemset of X1 with X1 ∪ X2 . Using the given example in Fig. 1 with minsup=50%, the cSET algorithm starts with an empty set. Then, we add all frequent items in the second level of the taxonomy, that are item V and W, and form the second level of the tree shown in Fig. 3. Each itemset has to generate two kinds of child itemsets, i.e. taxonomybased and join-based itemsets, respectively. We first generate taxonomy-based itemset by joining last items in itemsets by its child according to taxonomy. One taxonomy-based itemset of V is VU. The first property holds for VU, which results in replacing V with VU and then generating VUA and VUB. The second taxonomy-based itemset is joined with the current itemset (VU), which produces VUC. Again, the first property holds for VUC, which results in replacing VU and the children in tree under VU with VUC. Next, the join-based child itemset of V, VW, is generated. The third property holds for VW, which results in removing W and then generating VW under V. In the same approach, the process recursively occurs until no new GCFIs are generated. Finally, a complete itemset tree is constructed without excessive checking cost as shown in Fig. 3. All remaining itemsets in Fig. 3, except ones of crossed itemsets, are GCFIs. 4.2
Pseudo-code Description
The formal pseudo-code of cSET, extended from SET in [10], is shown below. The main procedure is cSET-MAIN and a function, called cSET-EXTEND, creates a subtree followed by a proposed set enumeration. cSET-EXTEND is executed recursively to create all itemsets under the root itemsets. The NewChild function
Mining Generalized Closed Frequent Itemsets
481
Fig. 3. Finding GCFIs using cSET with minsup=50%
creates a child itemset. For instance, NewChild(V,U) creates a child itemset VU of a parent itemset V, and adds the new child in a hash table. The GenTaxChild function returns the taxonomy-based child itemsets of that GI. Line 8-11 generates the join-based child itemsets. The function, called cSET-PROPERTY, checks for four conditional properties of GCIs and makes the operations with the generated itemset. Following the cSET algorithm, we will get a tree of all GCFIs. cSET-MAIN (Database,Taxonomy,minsup): 1. Root = Null Tree //Root node of set enumeration 2. NewChild(Root, GFIs from second level of taxonomy) 3. cSET-EXTEND(Root) cSET-EXTEND(Father) 4. For each Fi in Father.Child 5. C = GenTaxChild(Fi ) //Gen taxonomy-based child itemset 6. If supp(C) ≥ minsup then 7. cSET-PROPERTY(Nodes,C) 8. For j = i+1 to |Father.Child| //Gen join-based child itemset 9. C = Fi ∪ Fj 10. If supp(C) ≥ minsup then 11. cSET-PROPERTY(Nodes,C) = NULL then cSET-EXTEND(Fi ) 12. If Fi .Child cSET-PROPERTY(Node,C) 13. if t(Fi ) = t(Fj ) and Child(Fi )= ∅ then //Prop.1 14. Remove(Fj ); Replace(Fi ) with C 15. else if t(Fi ) ⊂ t(Fj ) and Child(Fi )= ∅ then //Prop.2 16. Replace(Fi ) with C 17. else if t(Fi ) ⊃ t(Fj ) then //Prop.3 18. Remove(Fj ); if !Hash(t(C)) then NewChild(Fi ,C) 19. else if !Hash(t(C)) then NewChild(Fi ,C) //Prop.4
482
5
Kritsada Sriphaew and Thanaruk Theeramunkong
Experimental Results
Since the novel concept of GCIs has never appeared in any researches, there are no existing algorithms for finding GCFIs. In our experiment, the cSET algorithm is evaluated by comparing with the current efficient algorithms for mining GFIs, i.e. SET algorithm [10]. All algorithms are coded in C language and the experiment was done on a 1.7GHz PentiumIV with 640Mb of main memory running Windows2000. The syntactic and real datasets are used in our experiment. The syntactic datasets are automatically generated by a generator tool from IBM Almaden with some slightly modified default values. Two real datasets from the UC Irvine Machine Learning Database Repository, i.e. mushroom and chess, are used with our own generated taxonomies. The original items contain in the leaf-level of taxonomy. Table. 1 shows the comparison of using SET and cSET to enumerate all GFIs and GCFIs, respectively. In real datasets, the number of GCFIs is much smaller than that of GFIs. With the same datasets, the ratio of the number of GFIs to that of GCFIs typically increases when we lower minsup. The higher the ratio is, the more time reduction is gained. The ratio can grow up to around 7,915 times, which results in reduction of running time around 3,878 times. Note that in syntactic datasets, the number of GFIs is slightly different from the number of GCFIs. This indicates that the real datasets are dense but the syntactic datasets are sparse. This result makes us possible to reduce more computational time by using cSET in real situations.
Table 1. Number of itemsets and Execution Time (GFIs vs. GCFIs)
Mining Generalized Closed Frequent Itemsets
6
483
Conclusion and Further Research
A large number of generalized frequent itemsets may cause of high computational time. Instead of mining all generalized frequent itemsets, we can mine only a small set of generalized closed frequent itemsets and then result in reducing computational time. We proposed an algorithm, named cSET, by applying some constraints and conditional properties to efficiently enumerate only generalized closed frequent itemsets. The advantage of cSET becomes more dominant when minimum support is low and/or the dataset is dense. This approach makes us possible to mine the data in real situations. In further research, we intend to propose a method to extract only a set of important rules from these generalized closed frequent itemsets.
Acknowledgement This paper has been supported by Thailand Research Fund (TRF), and NECTEC under project number NT-B-06-4C-13-508.
References [1] Agrawal, R., Imielinski, T., Swami, A. N.: Mining association rules between sets of items in large databases. In Buneman, P., Jajodia, S., eds.: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, D. C. (1993) 207–216 476 [2] Srikant, R., Agrawal, R.: Mining generalized association rules. Future Generation Computer Systems 13 (1997) 161–180 476 [3] Hipp, J., Myka, A., Wirth, R., G¨ untzer, U.: A new algorithm for faster mining of generalized association rules. In: Proceedings of the 2nd European Conference on Principles of Data Mining and Knowledge Discovery (PKDD ’98), Nantes, France (1998) 74–82 476 [4] Chung, F., Lui, C.: A post-analysis framework for mining generalized association rules with multiple minimum supports (2000) Workshop Notes of KDD’2000 Workshop on Post-Processing in Machine Learing and Data Mining 477 [5] Han, J., Fu, Y.: Mining multiple-level association rules in large databases. Knowledge and Data Engineering 11 (1999) 798–804 477 [6] Lui, C. L., Chung, F. L.: Discovery of generalized association rules with multiple minimum supports. In: Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD ’2000), Lyon, France (2000) 510–515 477 [7] Shintani, T., Kitsuregawa, M.: Parallel mining algorithms for generalized association rules with classification hierarchy. In: Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data. (1998) 25–36 477 [8] Michail, A.: Data mining library reuse patterns using generalized association rules. In: International Conference on Software Engineering. (2000) 167–176 477 [9] Hwang, S. Y., Lim, E. P.: A data mining approach to new library book recommendations. In: Lecture Notes in Computer Science ICADL 2002, Singapore (2002) 229–240 477
484
Kritsada Sriphaew and Thanaruk Theeramunkong
[10] Sriphaew, K., Theeramunkong, T.: A new method for fiding generalized frequent itemsets in generalized association rule mining. In Corradi, A., Daneshmand, M., eds.: Proc. of the Seventh International Symposium on Computers and Communications, Taormina-Giardini Naxos, Italy (2002) 1040–1045 477, 480, 482 [11] Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Discovering frequent closed itemsets for association rules. Lecture Notes in Computer Science 1540 (1999) 398–416 477, 478, 479 [12] Pasquier, N., Bastide, Y., Taouil, R., Lakhal, L.: Efficient mining of association rules using closed itemset lattices. Information Systems 24 (1999) 25–46 477, 478 [13] Zaki, M. J., Hsiao, C. J.: CHARM: An efficient algorithm for closed itemset mining. In Grossman, R., Han, J., Kumar, V., Mannila, H., Motwani, R., eds.: Proceedings of the Second SIAM International Conference on Data Mining, Arlington VA (2002) 477, 478
Qualitative Point Sequential Patterns Aomar Osmani LIPN- UMR CNRS 7030 Universit´e de Paris 13, 99 avenue J.-B. Cl´ement 93430 Villetaneuse, France [email protected]
Abstract. We have introduced [Osm03] a general model for representing and reasoning about STCSP (Sequential Temporal Constraint Satisfaction Problems) to deal with patterns in data mining and applications which generate great quantity of data used to understand or to explain some given situations (diagnostic of dynamic systems, alarms monitoring, event prediction, etc.). One important issue of sequence reasoning concerns the recognition problem. This paper presents the STCSP model with qualitative point primitives using frequency evaluation function. It gives the problem formalization; usual problems concerned with this kind of approach and it proposes some algorithms to deal with sequences.
1
Introduction
Sequences generation, and more generally reasoning about sequences,problem in many disciplines including artificial intelligence, databases, cognitive sciences, and engineering techniques [Se02]. A sequence reasoning called also sequence mining is the discovery of sets of characteristics shared through time by a great number of objects ordered in time [Zak01]. Various problems are concerned with reasoning about sequences i.e. sequence prediction, sequence generation, and sequence recognition. The most considered one is the task of discovering of all frequent sequences in large databases. It is a quite challenging problem, for instance, the search space is extremely large: with m variables there are O(mk ) potentially frequent sequences of length at most k. Many techniques have been proposed to mine temporal databases for the frequently occurring sequences [Zak01, JBS99, Lin92, Se02, aSD94, BT96].
2
Definitions
Let us consider the set E = {e1 , . . . , en } of qualitative points and their possible relationships {<, =, >}. Let (e1 , . . . , em ) a vector of observations or events, such that ∀i ∈ {1..m}, ei ∈ E and R a matrix of binaries relations on E. Definition 1. A sequential pattern P is defined as a couple ((e1 , . . . , en ), R) such that (∀i), i ∈ {1..n − 1} ei {<, =}ei+1 .
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 485–492, 2003. c Springer-Verlag Berlin Heidelberg 2003
486
Aomar Osmani
For instance, consider the following situation: when you install this software, you will see in the screen the message m1 , then m2 finally the messages m1 and m3 in the same window (at the same time). The pattern P corresponding to this situation will be represented by the vector (m1 , m2 , m1 , m3 ) and the matrix R = ((=, <, <, <), (<, =, <, <)(<, <, =, =)(<, <, =, =)). Using the property 1 proved in [Osm03], the pattern P may be represented by ((m1 , m2 , m1 , m3 ), (<, <, =)). Definition 2. (Subsequence) A partial order set(poset) S is a subsequence of a poset P with partial order relation , denoted S ⊆ P , if – there exists an injective application φ from P onto S such that if x y in S then φ(x) φ(y) in P; – each object of S and P is represented by pair (identifier, time). For all x = (ix , tx ), φ(x) = (ix , φ(tx )). Example 1. [(a, c), (<, <)] ⊆ [(a, b, c, a), (<, =, <, <)] In several applications all variables instances are important, instances masking is not authorized. For that, we define the contiguous subsequence: s1 is a contiguous subsequence of s, denoted S P if S ⊆ P , and for all x, y ∈ S such that x y, if ∃z ∈ S such that x z y then ∃t ∈ P such that φ(x) t φ(y). In the case where is a poset ( = {α1 , . . . , αn }), the last condition becomes: x αi y, if ∃z ∈ S such that x αj z αk y then ∃t ∈ P such that φ(x) αl t αm φ(y) for all i, j, k, l, m ∈ {1..n} (gaps in objects and gaps in relations are prohibited). In sequential pattern reasoning, data are given by large sequences of variables instances with infinite length in theory. For most of the applications it is not necessary or not possible to observe all data in the same time. If we admit that data are organized in a time line, then we take the assumption that if advancing on time line from the past to the future at each time we have the possibility to observe events (variable instances) in a finite window. For instance, if we consider the monitoring task of dynamic systems, the operators observe alarms in the screen and they can only make deduction about finite sequence length. The maximal length is called observation window. Definition 3. the observation window Wq on the sequence s is the number of variables instances of S observable simultaneously. Example 2. Let us consider the pattern ((m1 , m2 , m1 , m3 ), (<, <, =)). If we assume that no masking is authorised then (1) if Wq = 2 then the possible observations are {(m1 , m2 ), (m2 , m1 , (m2 , m3 ), (m1 , m3 )} (2) if Wq = 3 then the possible observations will be {(m1 , m2 , m1 ), (m1 , m2 , m3 , and (m2 , m1 , m3 ).
3
Qualitative Points Model
A Sequence Temporal Constraint Satisfaction Problem (STCSP) is defined [Osm03] as a 5-uplet M =< O, L, f, C, S >
Qualitative Point Sequential Patterns
487
Where: O describes the considered objects(qualitative points), L defines the sequences language(Vilain and Kautz point algebra [KV86], f defines the evaluation function (in the considered model we treat only the frequency evaluation function), C is the set of constraints applied to the considered functions (C is used to select the significant patterns: fixed frequency threshold is the main used constraint), and S is the solver (evaluation algorithm). f (M, M1 ) defines the frequency of the subsequence M1 in the sequence M . For instance – f ( (a1 , a2 , a1 , a2 ), a1 ) = 2. – f ( (a1 , a2 , a1 , a2 ), (a1 , a2 )) = 3 if we accept redundancy, and – f ( (a1 , a2 , a1 , a2 ), (a1 , a2 )) = 2 otherwise. 3.1
Constraint Network
A qualitative point STCSP network of M =< O, L, f, C, S > is a multi-valuated graph G = (N, V ) where each node n ∈ N represents an object of O and V is a set of couple (time point relation, position). Example 3. Let us consider the following temporal pattern [(a1 , a2 , a3 , a3 , a4 ), (= , <, <, =)]. Figure 3.1 gives the corresponding STCSP network. To simplify the representation we may omit the relationship in the arcs: an oriented arc corresponds to the relation ”<”. The non oriented arc refers to the relation ”=” (see figure 3.1(b) ). 3.2
Qualitative Point STCSP Algorithms
The type of the selected algorithms depends on the considered application and the kind of problem to solve. If we consider a deterministic environment, the main studied problems are [Se02]:
(=,1)
a2
1
(<,2)
a1
a2
2
3
a
a1 (<,3)
a
3
a4 (a)
(=,4)
3
a4
4
(b)
Fig. 1. The qualitative point TSCSP network
488
Aomar Osmani
– Prediction problem: given a pattern [(a1 , . . . , an ), R], we want to predict the object an+j , j > 0 such that (∀i ∈ 1..n) ai < an+1 . When j=1, we make predictions based only on the immediately preceding object; – Sequence generation. This task has the same formulation as the prediction problem but consider the following situation: given a pattern [(a1 , . . . , an ), R], we want to generate an object an+j such that (∀i ∈ 1..n) ai < an+1 ; – Sequence recognition. This problem consists in verifying the consistency of a given pattern [(a1 , . . . , an ), R]. The response will be the pattern is valid or not regarding to the set of an a priori defined criteria. To solve these problems we must first solve some generic problems like: (1)checking the θ-consistency of a given pattern, (2)checking the θn -consistency of the network, (3)finding all solutions, and (4) finding minimal set of solutions. For specific algorithms let us consider the telecommunication network monitoring real problem. The telecommunication network equipments generate an infinite sequences of alarms1 . These alarms arrive to the supervision centre. One considered problem is the detection of ”regular” sequences of alarms received by the supervisor centre when a known breakdown situation is happened. Definition 4. A pattern p = [(a1 , . . . , an ), R] is θ-consistent in the network G if there exists at least θ instances ”without redundancy” 2 of the pattern p in G. Example 4. In the following network G = [(a, b, a, b), (<, <, <)], the pattern p = [(a, b), (<)] is 2-consistant: (a,b,a,b), and (a,b,a,b). But, if we consider the redundancy there are three instances of p in G: (a,b,a,b), (a,b,a,b), and (a,b,a,b). Definition 5. A network G is θ-consistent if there is at last one θ-consistent pattern p = [(a1 , . . . , an ), R] in the network G. G is θn -consistent if there exist at last one θ-consistent pattern p of the length n in G 3 . Example 5. In the following network G = [(a, b, a, b), (<, <, <)], the pattern p = [(a, b), (<)] is 2-consistent. But it is not 3-consistant. However, if we consider the redundancy there are three instances of p in G. Checking the θ-Consistency Algorithm. A temporal pattern s is θ-consistent on the network G if there are more that θ instances of s in G such that the intersection between instances of s is empty.
1
2 3
To simplify the problem, we define an alarm as an event in the network received by the supervisor centre identified by the alarm name, equipment identifiers, and the time when the event is happening. Each label of each arc of G appears in at most in one instance of the pattern p. The pattern length is the number of objects instances in the pattern.
Qualitative Point Sequential Patterns
489
Algorithm 1 Checking the θ-consistency of a given pattern The global network: G = [(a1 , . . . , am ), (ra1,2 , dots, ram−1,m )] s = [(b1 , . . . , bn ), (r1,2 , . . . , rn−1,n )] Output : s is θ-consistent? begin 1 result = 0; 2 Ga = AllNextGen(G); // for all values of variable computes the possible successors 3 currentNodeG= s1 ; //such that aj = s1 ) starts the generation from the heat of s 4 treeSolutionsRoot = VirtualRootNode; //TreeSolution includes the tree of potential solutions 5 treeSolutions.add(virtualNode,s1 ); 6 currentSolution =s1 ; 7 currentPositionInGa = ai ; 8 while ((result = θ) and (currentNodeG has next Node)) do 9 for all next node nd of currentPositionGa do 10 if ((currentNodeInGa, nd)=(si , si+1 )) 11 treeSolutions.add(si , si+1 ); 12 if (i=n) 13 result++; 14 for (j=n downto 1) do 15 treeSolutions.delete(sj ); 16 if (the node (sj−1 , s1 ) exists) treeSolutions.add(virtualRoot, s1 ); = s1 ; 17 treeSolutions.delete(sj , a) for all a 18 if (result= θ) return (s is θ-consistent in G; // (end) 19 if (currentNodeG has not next Node) 20 return (s is θ-inconsistent in the G; // (end) EndAlgorithm Input :
The algorithm allN extGen(G uses the point algebra composition operation to generate, for each value of all variables in the STCSP, possible successors). Example 6. Let us consider the following network: G = [(a1 , a2 , a3 , a3 , a4 ), (= , <, <, =)]. After the application of the allN extGen() algorithm, the resulting network will be: Ga = [(a1 , a2 , a3 , a3 , a4 ), (r1,2 = r3,4 = {=}, r3,3 = {<}, r2,3 = {=}, r1,3 = r1,4 = r2,4 = {<})] (see figure 6).
Finding θ-Solutions Algorithm. This algorithm generates all possible θsolutions of the network. We propose an incremental algorithm which first computes arc θ-solutions, then path θ-solutions, then 4th θ-solutions, etc.
490
Aomar Osmani
1
a2
2
2
1
a2
2
3
a
1
a1
a1 3
a
3
3
1
a4 (a)
4
a4
4
(b)
Fig. 2. (a) Example of STCSP network (b) application of the algorithm AllNextGen() on (a) Definition 6. A network G is arc θ−consistent if there is at least one arc with θ labels. Definition 7. A network G is path θ−consistent if there is at least one sequence s of the length 3 such that s is θ−consistent in G. Definition 8. A network G is n θ−consistent if there is at least one sequence s of the length n such that s is θ−consistent in G. The proposed finding θ−solutions algorithm uses the following properties: (1) the trivial one is if the temporal pattern s is θ−consistent then all subpatterns of s are also θ−consistent, (2) if we have all θ−consistent solutions of the length n, then solutions of the length n+ 1 may be matching sequences of length n sharing the same prefix subsequence of the length (n-1). The same propriety is applied in Apriori algorithm proposed in [AS95]. Algorithm 2 Finding θ-solutions Input : network G Output : all θ− solutions begin 1 solutions = ∅ ; stepSolutions =0; l =1 ; 2 for all arc in G if (θ ≤ nbLabels of (si , sj , l) then 3 solutions =solutions ∪(si , sj , l); stepSolutions++; 4 while (stepSolutions = 0) do 5 stepSolutions =0; l++; 6 for all couple (si , sj ) such that si = ((a1 , . . . , al−1 , a), (r1,1 , . . . , r1,l−1 , ra ))and sj ((a1 , . . . , al−1 , b), (r1,1 , . . . , r1,l−1 , rb )) 7 check the θ− consistency of the sequences s = ((a1 , . . . , al−1 , a, b), (r1,1 , . . . , r1,l−1 , ra , r)) such that r = {<, =, >} 8 for each case if s if θ−consistent then 9 solutions=solutions ∪s; 10 stepSolutions++; EndAlgorithm
Qualitative Point Sequential Patterns
491
Checking the Minimal θ-Consistency Algorithm. The previous algorithm for checking the θ− consistency may be used with the following operation: after the line (9) we add solutions = solutions/{si, sj }. Both of them may be generated from the solution s. Checking the θn -Consistency Algorithm. A network G is θn -consistent if there exists a θ-consistent sequence s of the length n. One trivial algorithm to solve this problem is to use the previous finding θ-solutions algorithm. It’s possible to propose a more efficient algorithm. We can also us the finding θ− solutions algorithm and stop the computation when the value of l is n. For the contextual problems, let us consider the telecommunication networks presented before. It is a prediction problem. We can distinguish two sequential pattern problems: (1) supervised rules generation and (2)rules discovery. Supervised Rules Generation. Let F = {f1 , . . . , fn } be a set of predefined situations (breakdown situations in the our considered application)4 , |w| the observation window width and S all observations on the system. For each i, i ∈ {1..n} there exists a set of sequences s = ((e1 , . . . , em ), R), n < |w| such that if we observe s then fi is happened. The problem is, let us consider a continuous observations on the system, when it is possible to predict elements of F and which possible elements we can predict at each moment? Algorithm 3 Supervised Rules generation Input : network G Output : all θ− solutions begin 1 solutions = ∅ ; stepSolutions =0; l =1 ; 2 for all arc in G if (θ ≤ nbLabels of (si , sj , l) then 3 solutions =solutions ∪(si , sj , l); stepSolutions++; 4 while (stepSolutions = 0) do 5 stepSolutions =0; l++; 6 for all couple (si , sj ) such that si = ((a1 , . . . , al−1 , a), (r1,1 , . . . , r1,l−1 , ra ))and sj ((a1 , . . . , al−1 , b), (r1,1 , . . . , r1,l−1 , rb )) 7 check the θ− consistency of the sequences s = ((a1 , . . . , al−1 , a, b), (r1,1 , . . . , r1,l−1 , ra , r)) such that r = {<, =, >} 8 for each case if s if θ−consistent then 9 solutions=solutions ∪s; 10 stepSolutions++; EndAlgorithm
4
fi may be an element of E .
492
Aomar Osmani
Rules Discovery. Let E be a subset of E, |w| the observation window width and S all observations on the system. The problem is, let us consider a continuous observations on the system, which possible elements of E we can predict at each moment?
4
Conclusion
This paper presents a formulation of the sequential patterns reasoning using constraint satisfaction problems formalism. A classification of the main kinds of temporal primitives is presented. A Model of qualitative point STCSP is introduced with a model of constraints network representation. In this paper, we consider the sequence pattern problem with frequency evaluation function. Some problems and associated resolution algorithms are proposed. This work is still in progress, and raises many questions especially how to manage constraints and which kind of knowledge we can explore when we consider systems where the values of variables change among the time.
References R. Agrawal and R. Srikant. Mining sequentiel motifs. In International Conference on Data Engineering (ICDE), Taipei, Taiwan. Expanded version is available as IBM Research Report RJ9910, October 1994, 1995. 490 [aSD94] N. Lavrac an S. Dzeroski. Inductive Logic Programming. Ellis Horword, New York, 1994. 485 [BT96] D. Bertsekas and J. Tsitsiklis. Neuro-dynamic programming. In Athena Scientic, Belmont, MA., 1996. 485 [JBS99] S. D. Johnson, D. S. Battisti, and E. S. Sarachik. Empirically derived markov models and prediction of tropical sea surface temperature anomalies. In Journal of Climate, 1999. 485 [KV86] H. Kautz and M. Vilain. Constraint propagation algorithms for temporal reasoning. In In Proceeding AAAI-86, pages 377–382, 1986. 487 [Lin92] L. Lin. Self-improving reactive agents based on reinforcement learning, planning, and teaching. Machine Learning, 8:293–321, 1992. 485 [Osm03] A. Osmani. Stcsp: A representation model for sequential patterns. In Foundations and Applications of Spatio-Temporal Reasoning, March 2003. 485, 486 [Se02] R. Sun and L. Gilles editors. Introduction to sequence Learning. (a completer), 2002. 485, 487 [Zak01] D. J. Zaki. On spade´ a:an efficient algorithm for mining frequent sequences. Machine Learning, 42:31–60, 2001. 485 [AS95]
Visualization and Evaluation Support of Knowledge Discovery through the Predictive Model Markup Language Dietrich Wettschereck1 , Al´ıpio Jorge2, and Steve Moyle3 1
The Robert Gordon University, School of Computing St. Andrew Street, Aberdeen, AB25 1HG, UK [email protected] 2 LIACC-University of Porto Rua do Campo Alegre, 823, 4150 Porto, Portugal [email protected] 3 Oxford University Computing Laboratory Wolfson Building, Parks Road, Oxford, OX1 3QD, UK [email protected]
Abstract. The emerging standard for the platform- and system-independent representation of data mining models PMML (Predictive Model Markup Language) is currently supported by a number of knowledge discovery support engines. The primary purpose of the PMML standard is to separate model generation from model storage in order to enable users to view, post-process, and utilize data mining models independently of the tool that generated the model. In this paper two systems, called VizWiz and PEAR, are described. These software packages allow for the visualization and evaluation of data mining models that are specified in PMML. They can be viewed as decision support systems, since they enable non-expert users of data mining results to interactively inspect and evaluate these results.
1
Introduction
The Predictive Model Markup Language (PMML, [1]) industrial standard for the representation of data mining models has gained enough momentum to motivate the development of a new generation of decision support tools that utilize this standard, while avoiding the complexity of current knowledge discovery support engines (KDDSE). Such tools typically do not include the actual analysis methods, but concentrate on one or more of the other steps in the CRISP-DM [2] process. Such a separation of the actual model generation step from model handling enables users to view, post-process and utilize data mining models independently of the KDDSE that generated the model. The current PMML standard (version 2.0) supports the following types of models: decision trees, neural networks, center and density based clusters, general and polynomial regression, Naive Bayes, and association and sequence rules. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 493–501, 2003. c Springer-Verlag Berlin Heidelberg 2003
494
Dietrich Wettschereck et al.
Neither propositional nor first-order classification rule models are currently supported. We, therefore, also developed a non-standard extension to the PMML model for association rules to define (first-order) classification and regression rules as well as subgroups [4]. Further information about PMML and the Data Mining Group (DMG) can be obtained at the DMG.org web site [1]. A software architecture that integrates several key data mining and decision support techniques is proposed in [3]. The architecture aims at providing a modular knowledge discovery and decision support system for geographically dispersed groups that collaborate in different modes, depending on the complexity of the project and the resources available. The integration of models takes place in this architecture at the representation level and is based on PMML. At this point the visualization and evaluation of data models is a key issue. This issue is addressed in this paper. A further motivation for using PMML in this architecture is its key relevance in centralized model evaluation [5]. Only a standardized model representation language will allow for rapid, practical, reliable, and reproducible model handling in decision making. The paper is organized as follows: two software systems for the visualization and evaluation of data mining models are described in Section 2 in some detail. First, a visualization tool (VizWiz) for PMML models that is independent of a specific KDDSE is presented; and second, a post-processing methodology and an optimized tool (PEAR) for browsing and visualizing large sets of association rules. Section 3 concludes the paper with a discussion on how these tools can be utilized as decision support tools.
2
Model Visualization and Evaluation
Data visualization methods have been a part of statistics and data analysis research for many years. This research has concentrated primarily on plotting one or more independent variables against a dependent variable in support of explorative data analysis. The visualization of analysis results, however, only recently gained some attention with the proliferation of data mining and decision support techniques. This recent interest was spawned by the often overwhelming number and the complexity of data mining results. Within the framework described in the previous section, visualization serves four purposes: (1) to better illustrate the model to the end user, (2) to enable the comparison of models, (3) to increase model acceptance, and (4) to enable model editing and support for “what-if” questions. The ultimate goal of visualization is to support the user in making better decisions. Below, we describe two software systems that visualize data mining models which are specified in PMML. 2.1
VizWiz - A KDDSE-Independent PMML Visualizer
Within the SolEuNet project we developed a software component called VizWiz that visualizes PMML data mining models. The software includes visualizations for decision and regression trees, for subgroups, for propositional and first-order
Visualization and Evaluation Support of Knowledge Discovery
495
Fig. 1. A visualization of a decision tree
Fig. 2. A VizWiz visualization of a set of association rules
rules, and for association rules (see figures below). It is a Java implementation that can be run as an application or as an Applet.1 It allows for viewing of models by users that either do not have access to the actual KDDSE, that want to avoid the overhead of starting a KDDSE, or want to present their results via the internet. The original visualization wizard has been extended with model editing and model evaluation facilities (see [3], Chapter 5). Hence, the philosophy of VizWiz is to support all steps within the CRISP-DM [2] process that follow the actual modeling step. The remainder of this section will show several sample visualizations for the currently supported model types. Figure 1 shows a decision tree for the well known Play/Dont Play domain [6]. The tree is not fully expanded and is normally shown in color, where differently colored bars in the nodes denote the number of instances from each class contained in that node. The user can browse through the tree and open or close sub-trees as needed, the visualization can be changed from bar charts to pie charts and more detailed views can be shown for selected nodes. Figure 2 shows an interactive visualization for association rules. Confidence and support values are displayed for each rule. The bar above each rule graphically displays these two numbers where bar length represents the support and bar color represents its confidence (from red denoting low confidence to green 1
For a demonstration see: http://soleunet.ijs.si/website/other/pmml.html.
496
Dietrich Wettschereck et al.
Fig. 3. A visualization of propositional rules generated from the Cleveland Heart Disease domain denoting high confidence). Sliders at the bottom of the display (not shown) allow rules to be filtered out that lie outside selected minimal and maximal confidence and support values. An extension to the PMML model for association rules allows the definition of propositional and first-order rules (see [7, 8] for details). Figure 3 displays a set of propositional rules in the Cleveland Heart Disease domain [9]. The visualization is very similar to that of a set of multivariate decision trees: each tree displays the set of rules that predict a certain class. The leaves list the conditions that must be satisfied such that a certain rule fires. The display mode can be changed between bar charts to pie charts as shown for the second to last rule. Figure 4 displays a hand coded set of first-order rules that predict whether an actor is likely to reappear in a Star Trek Enterprise episode or not. The rules for each class are summarized by the bars at the top of the figure. The left bar shows the number of instances correctly covered, and the right bar the number of exceptions covered by the rule. The figure shows that rules may contain literals with bound or unbound variables, and that it is possible to replace Prolog-like notation with pretty text. Figure 5 displays a set of subgroups discovered by Midos [4] in a multirelational medical application. The size of each subgroup is shown, how it compares to the entire population, and the distribution of the target values within each subgroup. Experience gained from working with non-technical end users has shown that a pie chart visualization is more appealing to these users because they closely resemble business charts. Pie charts, however, often mislead the per-
Fig. 4. A visualization of three first-order classification rules
Visualization and Evaluation Support of Knowledge Discovery
497
ception of the user due to difficulties with relating the size of pie slices to actual values. Hence, alternative visualizations are possible (see, for example [10]).
Fig. 5. Visualization of selected subgroups learned from a multi-relational task VizWiz has also been extended with model evaluation facilities that allow the user to evaluate entire data sets against given models and to plot the evaluation results on a ROC Curve [11]. It is also possible to select single records and VizWiz will highlight the rule or tree node of the model that classifies this instance, and finally a rule or tree node can be selected and all instances covered by this partial model will be highlighted (Figure 6). This combination of model visualization and model evaluation offers an essential utility for collaborative data mining. The model evaluation features of VizWiz enable the user to interactively evaluate partial or entire data mining models on selected data sets or data records. The aim of such an evaluation phase would be to increase the end-user’s understanding of and confidence in the model. This is particularly valuable when the model was produced by someone else. The integration of VizWiz with the ROC Curve viewer permits the comparison and selection of models based on their predictive power. This integrated tool can, therefore, be seen as a powerful decision support tool for users of data mining results and as an effective presentation tool for producers of data mining results. 2.2
Post-processing Association Rules
PEAR is a Post-processing environment for association rules [12] that reads association models represented in PMML and allows the exploration of large sets of rules through rule set selection and visualization. PEAR allows the user to browse a large set of rules to ultimately find the subset of interesting/actionable rules. The central idea is that a large set of rules can be visited like a web site. PEAR is implemented as a set of dynamic web pages that can be accessed with an ordinary web browser. After loading a PMML file, an initial selection of the rules can be seen (Figure 7). This starting set is the index page of the whole set. To get to another subset of rules, the user can choose one of a list of intuitive operators (or change minimal support or minimal confidence). As an example, suppose we want to study the behavior of the users of one web site, by analyzing association rules obtained from web access data. In this case, one transaction corresponds to the set of page categories visited by one registered user. Rules indicate association between preferred page categories.
498
Dietrich Wettschereck et al.
Fig. 6. The evaluation of a selected data record (highlighted in table on the left). The path in the decision tree that leads to the leaf classifying this data record is marked in bold
Fig. 7. Subset of association rules as shown by PEAR
Visualization and Evaluation Support of Knowledge Discovery
499
Table 1. Some of the initial rules Rule Economics-and-Finance <= Population-and-Social-Conditions & Industry-and-Energy & External-Commerce Commerce-Tourism-and-Services <= Economics-and-Finance & Industry-and-Energy & General-Statistics Industry-and-Energy <= Economics-and-Finance & Commerce-Tourism-and-Services & General-Statistics Environment-and-Territory <= Population-and-Social-Conditions & Industry-and-Energy & General-Statistics General-Statistics <= Commerce-Tourism-and-Services & Industry-and-Energy & Environment-and-Territory External-Commerce <= Economics-and-Finance & Industry-and-Energy & General-Statistics Agriculture-and-Fishing <= Commerce-Tourism-and-Services & Environment-and-Territory & General-Statistics
Sup Conf 0.038 0.94 0.036 0.93 0.043 0.77 0.043 0.77 0.040 0.73 0.036 0.62 0.043 0.51
Table 2. Applying the consequent generalization operator Rule Environment-and-Territory Environment-and-Territory Environment-and-Territory Environment-and-Territory Environment-and-Territory
<= <= <= <= <=
Sup Conf Population-and-Social-Conditions & Industry-and-Energy & General-Statistics 0.043 0.77 Population-and-Social-Conditions & Industry-and-Energy 0.130 0.41 Population-and-Social-Conditions & General-Statistics 0.100 0.63 Industry-and-Energy & General-Statistics 0.048 0.77 General-Statistics 0.140 0.54
After rule generation, the PMML model is loaded. An initial page shows the 30 rules with highest support. Other possibilities for the initial page are the 30 rules with highest confidence or a set of rules involving different items to guarantee variability. Table 1 shows some of the rules in the initial page. The user then finds the rule on “Environment and Territory” relevant for structuring the categories on the site. By applying the consequent generalization operator to this rule, a new page with a subset of rules appears (Table 2). This operator results in the rules with the same antecedent but a more general consequent. From here, we can see that “Population and Social Conditions” is not relevantly associated to “Environment and Territory”. The user can now, for example, look into rules with “Population and Social Conditions” by applying the focus on the antecedent operator (results not shown here) to see what the main associations to this item are. For each page, the user can also select a graphical visualization that summarizes the set of rules on the page. Currently, the available visualizations are confidence/support plot and confidence/support histograms (Figure 8). The charts produced are interactive and indicate the rule that corresponds to the point under the mouse. After being loaded, rules are internally stored in a relational database, allowing the implementation of the operators as SQL queries. The charts are generated dynamically as Scalable Vector Graphics (SVG) pages, an XML based language.
3
Discussion
The visualization tools presented are simple, yet powerful, tools that can function as dissemination tools for data mining results. Their simplicity ensures that non-KDD users can operate the tools and interpret the results obtained by a data mining expert. Java/HTML technology ensures that platform issues are secondary, and that results could even be part of online content management or workgroup support systems. The philosophy behind VizWiz and PEAR differs in the sense that VizWiz attempts to cover as many different model types as
500
Dietrich Wettschereck et al.
Fig. 8. PEAR plot of confidence vs. support values for a set of association rules
possible while PEAR is a highly efficient and intuitive tool for one specific type of model.
Acknowledgements This work has been supported in part by the EU funded project Data Mining and Decision Support for Business Competitiveness: A European Virtual Enterprise (IST-1999-11495). Some of the visualizations presented here were developed by A. and G. Andrienko, AIS, FhG, Sankt Augustin, Germany. A. Jorge has also been supported by the POSI/2001/Class Project sponsored by Funda¸c˜ao Ciencia e Tecnologia, FEDER e Programa de Financiamento Plurianual de Unidades de I & D.
References [1] Data Mining Group, PMML specification, see http://www.dmg.org 493, 494 [2] Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C. and Wirth, R. (2000). CRISP-DM 1.0: step-by-step data mining guide. 493, 495 [3] Mladenic, D., Lavrac, N., Bohanec, M., Moyle, S. (editors): Data Mining and Decision Support: Integration and Collaboration, Kluwer Publishers, to appear. 494, 495 [4] Wrobel, S. (1997). An algorithm for multi-relational discovery of subgroups. Proc. First European Symposium on Principles of Data Mining and Knowledge Discovery, 78–87, Springer. 494, 496
Visualization and Evaluation Support of Knowledge Discovery
501
[5] Blockeel, H., Moyle, S. (2002). Centralized model evaluation for collaborative data mining. In M. Grobelnik, D. Mladenic, M. Bohanec, and M. Gams, editors, Proceedings A of the 5th International Multi-Conference nformation Society 2002: Data Mining and Data Warehousing/Intelligent Systems, pages 100–103. Jozef Stefan Institute, Ljubljana, Slovenia. 494 [6] Quinlan, J. R. (1993) C4.5: Programs for Machine Learning. Machine Learning. Morgan Kaufmann, San Mateo, CA, USA. 495 [7] Wettschereck, D., & M¨ uller, S. (2001). Exchanging data mining models with the predictive model markup language. In Proc. of the ECML/PKDD-01 Workshop on Integration of Data Mining, Decision Support and Meta-Learning, pp. 55-66. 496 [8] Wettschereck, D. (2002). A KDDSE-independent PMML Visualizer, in Proc. of IDDM-02, workshop on Integration aspects of Decision Support and Data Mining, (Eds.) Bohanec, M., Mladenic, D., Lavrac, N., associated to the conferences ECML/PKDD. 496 [9] Blake, C., Keogh, E., Merz, C. J. (1999). UCI repository of Machine Learning databases (Machine Readable data repository). Irvine, CA: Department of Information and Computer Science, University of California at Irvine. http://www.cs.uci.edu/ mlearn/MLRepository.html]. 496 [10] Gamberger, D., Lavraˇc, N., Wettschereck, D. (2002) Subgroup Visualization: A Method and Application in Population Screening. ECAI 2002 Workshop on INTELLIGENT DATA ANALYSIS IN MEDICINE AND PHARMACOLOGY, (2002). 497 [11] Provost, F., Fawcett, T. (2001). Robust Classification for Imprecise Environments. Machine Learning 42(3): 203-231. 497 [12] Jorge, A., Po¸cas, J., Azevedo P. (2002). Post-processing operators for browsing large sets of association rules, in Proceedings of Discovery Science 02, Luebeck, Germany, LNCS 2534, Eds. Steffen Lange, Ken Satoh, Carl H. Smith, SpringerVerlag. 497
Detecting Patterns of Fraudulent Behavior in Forensic Accounting Boris Kovalerchuk1 and Evgenii Vityaev2 1
Dept. of Computer Science, Central Washington University Ellensburg, WA 98926, USA [email protected] 2 Institute of Mathematics, Russian Academy of Sciences, Novosibirsk, 630090, Russia [email protected]
Abstract. Often evidence from a single case does not reveal any suspicious patterns to aid investigations in forensic accounting and other forensic fields. In contrast, correlation of sets of evidence from several cases with suitable background knowledge may reveal suspicious patterns. Link Discovery (LD) has recently emerged as a promising new area for such tasks. Currently LD mostly relies on deterministic graphical techniques. Other relevant techniques are Bayesian probabilistic and causal networks. These techniques need further development to handle rare events. This paper combines firstorder logic (FOL) and probabilistic semantic inference (PSI) to address this challenge. Previous research has shown this approach is computationally efficient and complete for statistically significant patterns. This paper shows that a modified method can be successful for discovering rare patterns. The method is illustrated with an example of discovery of suspicious patterns.
1
Introduction
Forensic accounting is a field that deals with possible illegal and fraudulent financial transactions [3]. One current focus in this field is the analysis of funding mechanisms for terrorism where clean money (e.g., charity money) and laundered money are both used [1] for a variety of activities including acquisition and production of weapons and their precursors. In contrast, traditional illegal businesses and drug trafficking make dirty money appear clean [1]. There are many indicators of possible suspicious (abnormal) transactions in traditional illegal business. These include (1) the use of several related and/or unrelated accounts before money is moved offshore, (2) a lack of account holder concern with commissions and fees [2], (3) correspondent banking transactions to offshore shell banks [2], (4) transferor insolvency after the transfer or insolvency at the time of transfer, (5) wire transfers to new places [4], (6) transactions without identifiable business purposes, and (7) transfers for less than reasonably equivalent value [5]. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 502-509, 2003. Springer-Verlag Berlin Heidelberg 2003
Detecting Patterns of Fraudulent Behavior in Forensic Accounting
503
Some of these indicators can be easily implemented as simple flags in software. However, indicators such as wire transfers to new places produce a large number of 'false positive' suspicious transactions. Thus, the goal is to develop more sophisticated mechanisms based on interrelations among many indicators. To meet these challenges link analysis software for forensic accountants, attorneys and fraud examiners such as NetMap, Analyst's Notebook and others [4-7] have been and are being developed. Here we concentrate on fraudulent activities that are closely related to terrorism such as transactions without identifiable business purposes. The problem is that often an individual transaction does not reveal that it has no identifiable business purpose or that it was done for no reasonably equivalent value. Thus, we develop a technique that searches for suspicious patterns in the form of more complex combinations of transactions and other evidence using background knowledge. The specific tasks in automated forensic accounting related to transaction monitoring systems are the identification of suspicious and unusual electronic transactions and the reduction in the number of 'false positive' suspicious transactions by using inexpensive, simple rule-based systems, customer profiling, statistical techniques, neural networks, fuzzy logic and genetic algorithms [1]. This paper combines the advantages of first-order logic (FOL) and probabilistic semantic inference (PSI) [8] for these tasks. We discover the following transaction patterns from ordinary or distributed databases that are related to terrorism and other illegal activities: • a normal pattern (NP) – a Manufacturer Buys a Precursor & Sells the Result of manufacturing (MBPSR); • a suspicious (abnormal) pattern (SP) – a Manufacturer Buys a Precursor & Sells the same Precursor (MBPSP); • a suspicious pattern (SP) – a Trading Co. Buys a Precursor and Sells the same Precursor Cheaper (TBPSPC ); • a normal pattern (NP) -- a Conglomerate Buys a Precursor & Sells the Result of manufacturing (CBPSR).
2
Example
Consider the following example. Table 1 contains transactions with the attributes seller, buyer, item sold, amount, cost and date and Table 2 describes the types of companies and items sold. Table 1. Transactions records
Record ID 1 2 3 4 5 6
Seller Aaa Bbb Ttt Qqq Ccc Ddd
Buyer Ttt Ccc Qqq Ccc Ddd Ccc
Item sold Td Td Td Pd Td Pd
Amount 1t 2t 1t 1.5t 2.0t 3.0t
Cost $100 $100 $100 $100 $200 $400
Date 03/05/99 04/06/98 05/05/99 05/05/99 08/18/98 09/18/98
504
Boris Kovalerchuk and Evgenii Vityaev
We assemble a new Table 3 from Tables 1 and 2 to look for suspicious patterns. For instance, row 1 in Table 3 is a combination of row 1 from Table 1 and rows 1 and 4 from Table 2 that contain types of companies and items. Table 3 does not indicate suspicious patterns immediately, but we can generate pairs of records from Table 3 that can be mapped to patterns listed above using a pattern-matching algorithm A. The algorithm A analyzes pairs of records in Table 3. For simplicity, we can assume that a new table with 18 attributes is formed to represent pairs of records from Table 3. Each record in Table 3 contains nine attributes. Table 2. Company types and item types
Record ID 1 2 3 4 5 6
Company name Company type (seller/buyer) Aaa Trading Bbb Unknown Ccc Trading Ttt Manufacturing Ddd Manufacturing Qqq Conglomerate
Item
Item type in process PP
Td Pd Rd
Precursor Product Precursor
Table 3. Combined data records
Record Seller Seller ID type 1 2 1 Aaa trading 2 3 4 5 6
Buye r 3 Ttt
Buyer type 4 Manuf . Bbb unknown Ccc Tradin g Ttt manuf. Qqq Congl. Qqq Congl. Ccc Tradin g Ccc Trading Ddd Manuf . Ddd Manuf Ccc Tradin g
Item Item type Amount Price Date sold 5 6 7 8 9 Td Precursor 1t $100 03/05/99 Td
Precursor 2t
$100 04/06/98
Td pd
Precursor 1t Product 1.5t
$100 05/05/99 $100 06/23/99
td
Precursor 2.0t
$200 08/18/98
pd
Product
$400 09/18/98
3.0t
Thus, we map pairs of records in Table 3 into patterns: A(#5,#6)=MBPSR, that is a pair of records #5 and #6 from Table 3 indicates a normal pattern -- a manufacturer bought a precursor and sold product ; In contrast, two other pairs indicate suspicious patterns: A(#1,#3)= MBPSP, that is a manufacturer bought a precursor and sold the same precursor; A(#2,#5)= TBPSPC, that is a trading company bought a precursor and sold the same precursor cheaper.
Detecting Patterns of Fraudulent Behavior in Forensic Accounting
505
Now let us assume that we have a database of 105 transactions as in Table 1. Then Table 3 will have all pairs of them, i.e., about 5*109. Statistical computations can reveal a distribution of these pairs into patterns as shown in Table 4. Table 4. Statistical characteristics
Pattern
Type
Frequency, %
Approximate number of cases
MBPSR normal 55 0.55*5*109 MBPSP suspicious 0.1 100 CBPSR normal 44.7 0.44*5*109 TBPSPC suspicious 0.2 200 Thus, we have 300 suspicious transactions. This is 0.3% of the total number of transactions and about 6*10-6% of the total number of pairs analyzed. It shows that finding such transactions is like finding a needle in a haystack. The automatic generation of patterns/hypotheses descriptions is a major challenge. This includes generating MBPSP and TBPSPC descriptions automatically. We do not assume that we already know that MBPSP and TBPSPC are suspicious. One can ask: “Why do we need to discover these definitions (rules) automatically?” A manual way can work if the number of types of suspicious patterns is small and an expert is available. For multistage money-laundering transactions, this is difficult to accomplish manually. It is possible that many laundering transactions were processed before money went offshore or was used for illegal purposes. Our approach to identify suspicious patterns is to discover highly probable patterns and then negate them. We suppose that a highly probable pattern should be normal. In more formal terms, the main hypothesis (MH) is: If Q is a highly probable pattern (>0.9) then Q constitutes a normal pattern and not(Q) can constitute a suspicious (abnormal) pattern.
Table 5 outlines an algorithm based on this hypothesis to find suspicious patterns. The algorithm is based first-order logic and probabilistic semantic inference [8]. To minimize computations we generate randomly a representative part of all possible pairs of records such as shown in Table 4. Then an algorithm finds highly probable (P>T) Horn clauses. Next, these clauses are negated as described in Table 5. After that, a full search of records in the database is performed to find records that satisfy the negated clauses. According to our main hypothesis (MH) this set of records will contain suspicious records and the search for “red flag” transactions will be significantly narrowed. Use of the property of monotonicity is another tool we use to minimize computations. The idea is based on a simple observation: If A1&A2&…&An-1⇒ B represents a suspicious pattern then A1&A2&…&An-1&An⇒ B is suspicious too. Thus, one does not need to test clause A1&A2&…&An-1 &An⇒ B if A1&A2&…&An-1 ⇒ B is already satisfied.
506
Boris Kovalerchuk and Evgenii Vityaev
Table 5. Algorithm steps for finding suspicious patterns based on the main hypotheis (MH)
1
2
3
4
3
Discover patterns in a database such as MBPSR in a form MBP ⇒SR, that is, as a Horn clause A1&A2&…&An-1⇒ An (see [8] for mathematical detail). 1.1.Generate a set of predicates Q={Q1,Q2,…,Qm} and first order logic sentences A1,A2,…,An based on those predicates. For instance, Q1 and A1 could be defined as follows: Q1 (x)=1 Ù x is a trading company and A1(a)= Q1 (a)& Q1 (b), where a and b are companies. 1.2. Compute a probability P that pattern A1&A2&…&An-1⇒ An is true on a given database. This probability is computed as a conditional probability of conclusion An under assumption that If-part A1&A2&…&An-1 is true, that is P(An/A1&A2&…&An-1)= =N(An/A1&A2&…&An-1)/N(A1&A2&…&An1&An), where N(An/A1&A2&…&An-1) is the number of An/A1&A2&…&An-1 cases and N(A1&A2&…&An-1&An) is the number of A1&A2&…&An-1&An cases. 1.3. Compare P(A1&A2&…&An-1⇒ An) with a threshold T, say T=0.9. If P(A1&A2&…&An-1⇒ An )>T then a database is “normal”. A user can select another value of threshold T, e.g., T=0.98. If P(MBP⇒ SR)=0.998, then DB is normal for 0.98 too. 1.4. Test statistical significance of P(A1&A2&…&An-1⇒ An). We use the Fisher criterion [8] to test statistical significance. Negate patterns. If database is “normal” (P(A1&A2&…&An-1⇒ An) >T=0.9 and A1&A2&…&An-1⇒ An is statistically significant then negate A1&A2&…&An-1⇒ An to produce a negated pattern ┐( A1&A2&…&An-1⇒ An). Compute the probability of the negated pattern P(┐(A1&A2&…&An-1⇒ An)) as P(A1&A2&…&An-1⇒ ┐An ) = 1- P( ┐(A1&A2&…&An-1⇒ An). In the example above, it is 1-0.998=0.002. Analyze database records that satisfy A1&A2&…&An-1 & ┐An. for possible false alarm. Really suspicious records satisfy the property A1&A2&…&An- & ┐ An , but normal records also can satisfy this property.
Hypothesis Testing
One of the technical aims of this paper is to design tests and simulation experiments for this thesis. We designed two test experiments: 1.
2.
Test 1: Generate a relatively large Table 4 that includes a few suspicious records MBPSP and TBPSPC. Run a data-mining algorithm (MMDR [8]) to discover as many highly probable patterns as possible. Check that patterns MBPSR and CBPSR are among them. Negate MBPSR and CBPSR to produce patterns MBPSP and TBPSPC. Run patterns MBPSP and TBPSPC to find all suspicious records consistent with them. Test 2: Check that other highly probable patterns found are normal; check that their negations are suspicious patterns (or contain suspicious patterns).
Detecting Patterns of Fraudulent Behavior in Forensic Accounting
507
A positive result of Test 1 will confirm our hypothesis (statement) for MBPSR and CBPSR and their negations. Test 2 will confirm our statement for a wider set of patterns. In this paper we report results of conducting Test 1. The word “can” is the most important in our statement/hypothesis. If the majority of not(Q) patterns are consistent with an informal and intuitive concept of suspicious pattern then this hypothesis will be valid. If only a few of the not(Q) rules (patterns) are intuitively suspicious then the hypothesis will not be of much use even if it is formally valid. A method for Test 1 contains several steps: • •
• • • • •
Create a Horn clause: MBP ⇒ SR. Compute a probability that MBP ⇒ SR is true on a given database. Probability P(MBP =>SR) is computed as a conditional probability P(SR/MBP)=N(SR/MBP)/N(MBP), where N(SR/MBP) is the number of MBPSR cases and N(MBP) is the number of MBP cases. Compare P(MBP ⇒ SR) with 0.9. If P(MBP ⇒ SR)>0.9 then a database is ‘normal”. For instance, P(SR/MBP) can be 0.998. Test the statistical significance of P(MBP ⇒SR). We use Fisher criterion [8] to test statistical significance. If the database is “normal” (P(MBP ⇒ SR) >T=0.9) and if P(MBP ⇒SR) is statistically significant then negate MBP=>SR to produce ┐(MBP ⇒SR). Threshold T can have another value too. Compute probability P(┐(MBP ⇒SR)) as P(MBP ⇒ ┐(SR)) = P(┐(SR)/MBP)=1P(MBP⇒SR). In the example above it is 1-0.998=0.002. Analyze database records that satisfy MBP and ┐(SR). For instance, really suspicious MBPSP records satisfy property MBP and ┐(SR), but other records also can satisfy this property too. For instance, MBPBP records (a manufacturer bought a precursor twice) can be less suspicious than MBPSP.
Thus, if the probability P(SR/MBP) is high (0.9892) and statistically significant then a normal pattern MBPSR is discovered. Then suspicious cases are among the cases where MBP is true but the conclusion SR is not true. We collect these cases and analyze the actual content of the then-part of the clause MBP =>SR. The set ┐SR can contain a variety of entities. Some of them can be very legitimate cases. Therefore, this approach does not guarantee that we find only suspicious cases, but the method narrows the search to a much smaller set of records. In the example above the search is narrowed to 0.2% of the total cases.
4
Experiment
We generated two synthesized databases with attributes shown in Table 4. The first one does not have suspicious records MBPSP and TBPSPC. A second database contains few such records. Using a Machine Method for Discovery Regularities (MMDR) [8] we were able to discover MBPSR and CBPSR normal patterns in both databases.
508
Boris Kovalerchuk and Evgenii Vityaev Table 6. Database with suspicious cases
Pattern Normal pattern, MBP ⇒ SR Negated ┐(MBP ⇒SR), MBP ⇒ ┐(SR) Normal pattern CBP=> SR Negated pattern ┐(CBP ⇒ SR), CBP ⇒ ┐(SR)
Probability P(A1&A2&…&An-1⇒ An ) In database without In database with suspicious cases suspicious cases > 0.95 >0. 9 < 0.0.5 < 0.1 >0.95 > 0.9 <0.05 < 0.05
The MMDR method worked without any advanced information that these patterns are in data. In the database without suspicious cases, negated patterns MBP ⇒ ┐(SR) and CBP ⇒ ┐(SR) contain cases that are not suspicious. For instance, MBP ⇒BP, that is, a manufacturer that already bought precursors (transaction record 1) bought them again (transaction record 2). The difference in probabilities for MBP ⇒ ┐(SR) in the two databases points out actually suspicious cases. In our computational experiments, the total number of regularities found is 41. The number of triples of companies (i.e., pairs of transactions) captured by regularities is 1531 out of total 2772 triples generated in the experiment. Table 7 depicts some statistically significant regularities found. Attributes New_Buyer__type and New_Item_type belong to the second record in a pair of records (R1,R2). Individual records are depicted in table 3. Table 7. Computational experiment: examples discovered regularities
# Discovered regularity Frequency 1 IF Seller_type = Manufacturing AND Buyer__type = 72 / (6 + 72) = Manufacturing 0.923077 THEN New_Item_type = product 2 IF Seller_type = Manufacturing AND New_Buyer__type = 72 / (6 + 72) = Manufacturing THEN New_Item_type = product 0.923077 3 IF Seller_type = Manufacturing AND Item_type = precursor 152 / (59 + THEN New_Item_type = product 152) = 0.720379 4 IF Seller_type = Manufacturing AND Price_Compare = 1 AND 47 / (2 + 47) = New_Buyer__type = Trading THEN New_Item_type = product 0.959184 5 IF Seller_type = Manufacturing AND Price_Compare = 1 AND 79 / (5 + 79) = Item_type = precursor THEN New_Item_type = product 0.940476
5
Conclusion
The method outlined in this paper advances pattern discovery methods that deal with complex (non-numeric) evidences and involve structured objects, text and data in a variety of discrete and continuous scales (nominal, order, absolute and so on). The paper shows potential application of the technique for forensic accounting. The technique combines first-order logic (FOL) and probabilistic semantic inference (PSI). The approach has been illustrated with an example of discovery of suspicious patterns in forensic accounting.
Detecting Patterns of Fraudulent Behavior in Forensic Accounting
509
References [1] [2]
[3] [4] [5] [6] [7] [8]
Prentice, M., Forensic Services - tracking terrorist networks,2002, Ernst & Young LLP, UK, http://www.ey.com/global/gcr.nsf/UK/Forensic_Services_tracking_terrorist_networks Don Vangel and Al James Terrorist Financing: Cleaning Up a Dirty Business, the issue of Ernst & Young's financial services quarterly, Spring 2002. http://www.ey.com/GLOBAL/content.nsf/International/Issues_&_Perspectives _-Library_-Terrorist_Financing_Cleaning_Up_a_Dirty_Business IRS forensic accounting by TPI, 2002, http://www.tpirsrelief.com/forensic_accounting.htm Chabrow, E. Tracking The Terrorists, Information week, Jan. 14, 2002, http://www.tpirsrelief.com/forensic_accounting.htm How Forensic Accountants Support Fraud Litigation, 2002, http://www.fraudinformation.com/forensic_accountants.htm i2 Applications-Fraud Investigation Techniques, http://www.i2.co.uk/applications/fraud.html Evett, IW., Jackson, G. Lambert, JA , McCrossan, S. The impact of the principles of evidence interpretation on the structure and content of statements. Science & Justice 2000; 40: 233–239 Kovalerchuk, B., Vityaev, E., Data Mining in Finance: Advances in Relational and Hybrid Methods, Kluwer, 2000
SPIN!—An Enterprise Architecture for Spatial Data Mining Michael May and Alexandr Savinov Fraunhofer Institute for Autonomous Intelligent Systems Schloss Birlinghoven, Sankt-Augustin, D-53754 Germany {michael.may,alexandr.savinov}@ais.fraunhofer.de
Abstract. The rapidly expanding market for Spatial Data Mining systems and technologies is driven by pressure from the public sector, environmental agencies and industry to provide innovative solutions to a wide range of different problems. The main objective of the described spatial data mining platform is to provide an open, highly extensible, ntier system architecture based on Java 2 Platform, Enterprise Edition (J2EE). The data mining functionality is distributed among (i) Java client application for visualization and workspace management, (ii) application server with Enterprise Java Bean (EJB) container for running data mining algorithms and workspace management, and (iii) spatial database for storing data and spatial query execution.
1
Introduction
Data mining is the partially automated search for hidden patterns in typically large and multi-dimensional databases. It draws on results in machine learning, statistics and database theory [7]. Data mining methods have been packaged in data mining platforms, which are software environments providing support for the application of one or more data-mining algorithms. So far Data Mining and Geographic Information Systems (GIS) have existed as two separate technologies, each with its own methods, traditions and approaches to visualization and data analysis. Recently, the task of integrating these two technologies has become highly actual [3,8,9] especially as various public and private sector organizations possessing huge databases with thematic and geographically referenced data began to realize the huge potential of information hidden there. As a response to this demand a prototype has been developed [1,2] which demonstrates the potential of combining data mining and GIS. This initial prototype encouraged the development of the SPIN! [4,11,12] system the overall objective of which consists in developing a spatial data mining platform by integrating state of the art Geographic Information System (GIS) and data mining functionality in a closely coupled open and extensible system architecture. This paper describes an open, extensible architecture for spatial data mining, which pays special attention to such features as scalability, security, multi-user access, roV. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 510-517, 2003. Springer-Verlag Berlin Heidelberg 2003
SPIN!—An Enterprise Architecture for Spatial Data Mining
511
bustness, platform independence and adherence to standards. It integrates Geographic Information System for interactive visual data exploration and Data Mining functionality specially adapted for spatial data. The system is built on the Java 2 Enterprise Edition (J2EE) architecture and particularly uses Enterprise Java Bean (EJB) technology for implementing remote object functionality. The flexibility and scalability of the J2EE platform has made it the platform of choice for building different multitiered enterprise applications so using it as a basis for a spatial data mining platform in SPIN! project [4] is a natural extension. EJB is a server-side component architecture, which cleanly separates the “business logic” (the analysis tools, in our case) from server issues, shielding the method developers from many technicalities involved in client-server programming. This choice allows us to meet the requirements often found in business applications, e.g. security, scalability, platform independence, in a principled manner. The system is tightly integrated with a relational database and can serve as data access and transformation tool for spatial and non-spatial data. Analysis tools can be integrated either as stand-alone modules or, more tightly, by distributing the analysis functionality between the database and the core algorithm. Particularly, spatial database is used to execute complex spatial queries generated by analysis algorithms. The final system integrates several data mining methods adapted to the analysis of spatial data, e.g., multi-relational subgroup discovery, rule induction and spatial cluster analysis, and combines them with rich interactive functionality for visual data exploration, thus offering an integrated distributed environment for spatial data analysis.
2
N-tier EJB-Based Architecture
The general SPIN! architecture is shown in Fig. 1. It is a n-tier Client/Serverarchitecture based on Enterprise Java Beans for the server side components. A major advantage of using Enterprise Java Beans is that such tasks as controlling and maintaining user access rights, handling multi-user access, pooling of database connections, caching, handling persistency and transaction management are delegated to the EJB container. The architecture has the following major subsystems: client, application server with one or more EJB containers, one or more database servers and optionally compute servers. The SPIN! client is a standalone Java application. It always creates one server side representative in the form of session bean the methods of which are accessed through the corresponding remote reference via Java RMI or CORBA IIOP protocol. The client session bean executes various server side tasks on behalf of the client. In particular, it may load/save workspace objects from/in its persistent state. The client is based on component connectivity framework, which is implemented in Java as connectivity library (CoCon). The idea is that the workspace consists of components each of which is considered a storage for a set of parameters and pieces of functionality (e.g., algorithms). The system functionality is determined by a set of available components.
512
Michael May and Alexandr Savinov Enterprise Java Bean Container
Database
Algorithm Session Bean
Data
Client Entity Bean
Workspace Entity Bean
Database Persistent object
Client Workspace Visual Component
Algorithm Component
JDBC (Connections) RMI/IIOP (References)
Fig. 1: SPIN! platform architecture. Main components are a Java-based client, an Enterprise Java Beans Container and one or more databases serving spatial and non-spatial data
The workspace components and connections among them can be edited in two views: tree view and graph view. In the tree view components from the system repository can be added into the workspace (Fig. 2, left top). User connections among workspace components can be established in the Connection Editor dialog. A more user friendly way of editing workspaces is through a Clementine-style workspace graph view, which shows both components and their user connections (Fig. 2, left bottom). In this view components can be added by selecting them from the system tool bar and connecting them by drawing arrows between graph nodes. It is also very important that components can be arranged within views into visually expressive diagrams. The application server is an Enterprise Java Bean container. It manages the client workspace, analysis tasks, data access and persistency. There may be more than one simultaneously running container on one or more servers so that, e.g., different algorithms and other tasks can be executed on different computers under different restrictions. The SPIN! system uses an EJB container for making workspaces persistent in the database and for remote computations. For the first task the client creates a special session bean, which is responsible on the server side for workspace persistence and access. Particularly, if the client needs to load or save a workspace it delegates this task to this session bean. The client creates one remote object for each analysis task to be run so that data is transferred directly from the database to the algorithm. After the analysis is finished its result is transferred to the client for visualization. User data are stored in primary data storage, which is a relational database system (it may be the same machine as the application server). There may be one or more optional secondary databases. In addition, data can be loaded from other sources – databases, ASCII files in the file system or Excel files. It is important that for remote computations in application server data is transferred directly into the remote algorithm bypassing the client. It is only a set of components (subgraph of the workspace) that is transferred between application server and client.
SPIN!—An Enterprise Architecture for Spatial Data Mining
513
Fig. 2. SPIN! client. The workspace consists of interconnected components such as database connections, database queries, data mining algorithms, analysis results and spatial object visualizers
3
Remote Algorithm Management
The developed architecture supposes that all algorithms are executed on compute servers. For each running algorithm a separate session bean is created which implements high-level methods for controlling its behavior, particularly, starting/stopping the execution, getting/setting parameters, setting the data to process, and getting the result. The session bean then is responsible for the methods implementation. There are several ways how it can be done.
A clean and very convenient but in some cases inefficient approach is using Java for implementing the complete algorithm directly within the corresponding EJB, loading all data via JDBC into the workspace. A second approach divides the labor between the EJB container and the relational database. We have implemented a multi-relational spatial subgroup-mining algorithm [6] that does most of the analysis work (especially the spatial analysis) directly in the database. The EJB part retrieves summary statistics, manages hypotheses and controls the search. A third approach consists in implementing computationally intensive methods in native code wrapped into shared library by means of Java Native Interface (JNI). A rule induction algorithm based on finding largest empty intervals in data
514
Michael May and Alexandr Savinov
[13,14] has been implemented in this way, namely, as a dynamically linked library the functions of which are called from the algorithm EJB. A fourth option is that the algorithm session bean directly calls an external executable module. This approach has been used to run SPADA algorithm [10]. And finally other remote objects (e.g. CORBA) can be used to execute the task.
The algorithm parameters are formed in the client and transferred to the algorithm EJB as a workspace component before the execution. In particular, data to be processed by the algorithm has to be specified. It is important that only a data description is specified and not the complete data set is transferred. In other words, the algorithm EJB gets information where and how to take data and what kind of restrictions to use. Thus when the algorithm starts, the data is directly retrieved by the algorithm EJB rather than passes through the client. For example, assume that we need to find interesting subgroups in spatially referenced data [6]. The data is characterised by both thematic attributes, e.g., population, and spatial attributes, e.g., proximity to highway or percentage of forests in the area. The data to be analysed is specified in the corresponding component where we can choose tables, columns, join and restriction conditions including spatial operators supported by the underlying database system. The algorithm component is connected to the data component and the subgroup pattern component. The algorithm component creates a remote algorithm object in the EJB container as a session bean and transfers to it all necessary components such as the data description. The remote object (EJB) starts computations while its local counterpart periodically checks its state until the process is finished. During computations the remote object retrieves data, analyses it and stores the result in the result component. Note that each client may start several local and remote analysis algorithms simultaneously and for each of them a separate thread is created. Once interesting subgroups have been discovered and stored in a component they can be visualised in a special view, which provides a list of all subgroups with all parameters as well as a two-dimensional chart where each subgroup is represented by one point according to its coverage and strength. Additionally, the data analysed by subgroup discovery data mining algorithm can be viewed in a geographic information system and analysed by visual analysis methods.
4
Workspace Management
One task, which is very important in distributed environment is workspace management. During one session user loads into the client and works with one workspace from some central storage. As the work is finished the workspace is stored back into its initial or new location. There are several alternatives how persistent workspace can be implemented: (i) the whole workspace is serialized and stored in one object like local/remote file or database record, and (ii) the workspace components and connections are stored separately in different database records. The first approach is much simpler but it is difficult to share workspaces. The second approach allows us to treat workspace components as individual objects even within persistent storage, i.e., the whole workspace graph structure is represented in the storage.
SPIN!—An Enterprise Architecture for Spatial Data Mining
Connection object Component object
Global graph in database
GRAPH PERSISTENCE MANAGER
Local (sub)graph in client
515
Connection table
Component table
Fig. 3. Workspace is a graph where nodes are components and edges are connections between them. All workspaces are stored in a database and retrieving a workspace means finding its component and connection objects. The persistent workspace management functionality is implemented as a session bean, which manipulates two types of entity beans: workspace components and workspace connections
We implemented both approaches and in both cases the workspace is represented as special graph object, i.e., a set of its nodes (workspace components) and a set of its edges (workspace connections). The graphs can be created from existing run-time workspace objects by specifying constraints on its nodes and connections. For example, for loading and storing workspaces view connections are ignored. Then the selected subgraph is passed to the persistence manager. If it needs to be stored as one object then the whole graph is serialized. Otherwise individual node and edge objects are serialized. We used XML for serialization, i.e., any object state is represented as an XML text. The functionality of remote workspace management is implemented by a special session bean. This EJB has functions for loading and storing workspaces. If the workspace is stored as a set of its constituents then the session bean uses entity beans, which correspond to the workspace components. The state of such workspaces is stored in two tables: one for nodes and one for edges. There exist two classes of entity beans, which are used to manipulate these two tables. The workspace management architecture for this case is shown in Fig. 3.
5
Conclusion
We have described the general architecture of the SPIN! spatial data mining platform. It integrates GIS and data mining algorithms that have been adapted to spatial data. The choice of J2EE technology allows us to meet requirements such as security, scal-
516
Michael May and Alexandr Savinov
ability, platform independence, in a principled manner. The system is tightly integrated with a RDBMS and can serve as data access and transformation tool for spatial and non-spatial data. The client has been implement in Java using Swing for its visual interface. Jboss 3.0 [5] has been used as an application server. Oracle 9i database has been used as a spatial data and workspace storage. In future it would be very interesting to add the following features to this architecture: persistent algorithms running with no client, web interface to data mining algorithms via conventional browser, data mining functionality as web services via XML-based SOAP protocol, shared workspaces where components can belong to more than one workspace.
Acknowledgement Work on this paper has been partially funded by the European Commission under IST-1999-10536-SPIN!
References [1]
Andrienko, N., G. Andrienko, A. Savinov, and D. Wettschereck, “Descartes and Kepler for Spatial Data Mining”, ERCIM News, No. 40, January 2000, 44–45. [2] Andrienko, N., Andrienko, G., Savinov, A., Voss, H. and Wettschereck, D., “Exploratory Analysis of Spatial Data Using Interactive Maps and Data Mining”, Cartography and Geographic Information Science 28(3), July 2001, 151-165. [3] Ester, M., Frommelt, A., Kriegel, H.P, Sander, J., “Spatial Data Mining: Database Primitives, Algorithms and Efficient DBMS Support”, in Data Minining and Knowledge Discovery, an International Journal, 1999. [4] European IST SPIN!-project web site, http://www.ccg.leeds.ac.uk/spin/. [5] JBoss Application Server, www.jboss.org. [6] W. Klösgen, May, M. Spatial Subgroup Mining Integrated in an ObjectRelational Spatial Database, PKDD 2002, Helsinki, Finland, August 2002, 275-286. [7] Klösgen, W., Zytkow, J. (eds.), Handbook of Data Mining and Knowledge Discovery. Oxford University Press, 2002. [8] Koperski, K., Adhikary, J., Han, J., 1996. Spatial Data Mining, Progress and Challenges, Vancouver, Canada, Technical Report [9] Koperski, K., Han, J. “GeoMiner: A System Prototype for Spatial Mining”, Proceedings ACM-SIGMOD, Arizona, 1997. [10] Lisi, F.A., Malerba, D., SPADA: A Spatial Association Discovery System. In A. Zanasi, C.A. Brebbia, N.F.F. Ebecken and P. Melli (Eds.), Data Mining III, Series: Management Information Systems, Vol. 6, 157-166, WIT Press, 2002.
SPIN!—An Enterprise Architecture for Spatial Data Mining
517
[11] May, M.: Spatial Knowledge Discovery: The SPIN! System. Fullerton, K. (ed.) Proceedings of the 6th EC-GIS Workshop, Lyon, 28-30th June, European Commission, JRC, Ispra. [12] May, M., Savinov, A. An integrated platform for spatial data mining and interactive visual analysis, Data Mining 2002, Third International Conference on Data Mining Methods and Databases for Engineering, Finance and Other Fields, 25-27 September 2002, Bologna, Italy, 51-60. [13] Savinov, A.: Mining Interesting Possibilistic Set-Valued Rules. In: Da Ruan and Etienne E. Kerre (eds.), Fuzzy If-Then Rules in Computational Intelligence: Theory and Applications, Kluwer, 2000, 107-133. [14] Savinov, A.: Mining Spatial Rules by Finding Empty Intervals in Data. Proc. of the 7th International Conference on Knowledge-Based Intelligent Information & Engineering Systems (KES'03), 3-5 September 2003, University of Oxford, United Kingdom (accepted).
The Role of Discretization Parameters in Sequence Rule Evolution Magnus Lie Hetland1 and P˚ al Sætrom2 1
Norwegian University of Science and Technology Dept. of Computer and Information Science Sem Sælands vei 9, NO–7491 Trondheim, Norway [email protected] 2 Interagon AS, Medisinsk-teknisk senter, NO–7489 Trondheim, Norway [email protected]
Abstract. As raw data become available in ever-increasing amounts, there is a need for automated methods that extract comprehensible knowledge from the data. In our previous work we have applied evolutionary algorithms to the problem of mining predictive rules from time series. In this paper we investigate the effect of discretization on the predictive power of the evolved rules. We compare the effects of using simple model selection based on validation performance, majority vote ensembles, and naive Bayesian combination of classifiers.
1
Introduction
As raw data become available in ever-increasing amounts, there is a need for automated methods that extract comprehensible knowledge from the data. The process of knowledge discovery and data mining has been the subject of much research interest in later years, and one recent subfield is that of sequence mining. In our previous work [3] we have applied evolutionary algorithms to the problem of mining predictive rules from time series. Our method involves discretizing the time series, in order to be able to evaluate our rules on them (which, in turn, allows us to use genetic programming to find such rules). This process of discretization discards some information about the time series, and the parameters chosen (such as the width of the discretization window) will determine which features are available for the mining algorithm. In this paper we investigate the effect of discretization on the predictive power of the evolved rules. We evolve rules for different discretization parameter settings and design predictors using the following methods: 1. We test rules evolved for different settings on a validation set. We then take the rule with the best performance to be our predictor. 2. We select the best rule for each of five discretization resolutions and combine them to form a predictor ensemble by simple majority vote. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 518–525, 2003. c Springer-Verlag Berlin Heidelberg 2003
The Role of Discretization Parameters in Sequence Rule Evolution
519
3. We select the best rule for each of five discretization resolutions and combine them by means of a naive Bayesian model. Majority vote ensembles are a simple way of combining several predictors to achieve increased predictive power. To make a prediction, all the component predictors make their predictions; a simple majority vote is used to decide which prediction “wins”. The naive Bayesian predictor is a simple probabilistic model that assumes conditional independence among the predictor variables. Even though this assumption may be too strong in many cases, there is much empirical evidence suggesting that the method is quite robust.
2
Method
In this section we first describe the general approach, as developed in [3], as well as the extensions introduced here, that is, combining predictors for several resolutions. Finally, the discretization method is discussed. 2.1
The General Mining Approach
The basic method works by using an evolutionary algorithm (EA) to develop rules that optimize some score (or fitness) function, which describes what qualities we are looking for. Three components are of central importance: The rule format, the mechanism for rule evaluation, and the fitness function. For the purposes of this discussion, we only consider one rule format: The rule consequent is some given event that may occur at given times, possibly extrinsic to the series itself, while the antecedent (the condition) is a pattern in a very expressive pattern language called IQL [5]. This language permits, among other things, such constructs as string matching with errors, regular expressions, shifts (latency), and Boolean combinations of expressions. In our EA system, these expressions are represented as syntax trees, in the standard manner. In order to calculate the fitness of each rule, we need to know at which positions in the data the antecedent is true. To ascertain this, we use the antecedent as a query in an information retrieval system, the Pattern Matching Chip (PMC) [4], for which our pattern language was designed. Such a chip is able to process about 100 MB/s. There are many possible fitness functions that measure the precision, degree of recall, or interestingness of a query or rule. For the relatively straightforward prediction task undertaken here, simple correlation is quite adequate. We use a supervised learning scheme, in which we mark the positions where want the rule to make a prediction, and then, for each rule, calculate the correlation between its behaviour and the desired responses. This can be calculated from the confusion matrix of the rule (true and false positives and negatives) [3]. It is worth mentioning that even though we focus mainly on predictive power here, one of the strengths of the method is the transparency of its rule format. Rather than a black-box predictor, the user receives a rule in a human-readable language. This can be an advantage in data mining contexts.
520
2.2
Magnus Lie Hetland and P˚ al Sætrom
Selecting and Combining Predictors
We want to investigate the effect of the parameters of the discretization process, or, more specifically, of the resolution and the alphabet size, on the predictive power of the developed rules. The specific meaning of resolution and alphabet size is given in Sect. 2.3; informally, the resolution refers to the width of each (overlapping) discretized segment of the time series, while the alphabet size is simply the cardinality of the alphabet used in the resulting strings. As mentioned in the introduction, we will compare three methods of exploiting the variability introduced by these parameters: model selection through cross-validation, majority vote ensembles, and naive Bayesian combination of predictors. The rules that are developed by our system have their performance tested on a validation set (done through a form of k-fold cross-validation). In the simple case, a single resolution and alphabet size is used, and the rule that has the best performance on the validation set is selected and tested on a separate test set. Instead of this simple approach, we use several resolutions and alphabet sizes, run the simple process for each parameter setting, and choose the rule (and, implicitly, the parameter setting) that has the highest validation rating. This way we can discover the resolution and alphabet size that best suits a given time series, without having to arbitrarily set these beforehand. To demonstrate the effect of this procedure, we also select the rule (and parameter setting) that has the lowest validation score, as a baseline. The next method is the majority vote ensemble. For this, we run the basic process for each parameter setting, and for each resolution we choose the rule (and, implicitly, the alphabet size) that has the best validation performance. We then combine these rules to form a single predictor, through voting. In other words, when we observe a time series (for example, the test set) we discretize it at all the resolutions, with the given alphabet sizes, (possibly incrementally) and run each rule in its own resolution. The prediction (“up” or “down”) that is prevalent (simple majority) among the rules is taken to be the decision of the combined predictor. Ensembles are described in more detail in [1]. The basic idea behind them is that if each of a set of classifiers (or predictors) is correct with a probability p ≥ 0.5 and the classifiers are generally not wrong at the same time (that is, they are somewhat independent) then a majority vote will increase the probability of a correct classification.1 More sophisticated ensemble methods exist; see [1] for details. The naive Bayesian predictor is a simple probabilistic model that assumes conditional independence among the predictor variables. Assume that we have a family {Xi } of predictor variables, and one dependent (predicted) variable Y . Assume that P (Y = y) is the a priori probability of a given value y being observed for Y , and P (Xi = xi | Y = y) is the conditional probability of 1
Note that because of this requirement, rules with lower than 50% validation accuracy were excluded from the ensembles.
The Role of Discretization Parameters in Sequence Rule Evolution
521
observing x for Xi , given that Y = y. Then, the naive Bayesian model states that our predicted class should be P (Xi = xi |Y = y) . (1) ζ(x) = arg max P (Y = y) y
i
The independence assumption may in many cases be quite strong, but the naive Bayesian model can be surprisingly robust, and, as shown in [2], may even be optimal in cases where the independence assumption is violated. 2.3
Feature Extraction and Discretization
Many features may potentially be extracted from time series; for the task of prediction, we need to extract features in a way that lets us access the features of a sequence prefix. A natural approach is to divide the sequence into (overlapping) windows of width w, and to extract features from each of these. This discretization approach is described in the following. Our basic discretization process is a simple one, used, for example, in [7]. It works as follows. A time series x = (xi )ni=1 , where xi ∈ R, is discretized by the following steps: 1. Use a sliding window of width k to create a sequence w of m overlapping vectors. More formally: Create a sequence of windows w = (wi )m i=1 , r(i) where wi = (xj )j=l(i) , such that l(1) = 1, r(m) = n, r(i) − l(i) = k − 1 for 1 ≤ i ≤ m, and l(i) = l(i − 1) + 1 for 1 < i ≤ m. The window width is also referred to as the resolution. 2. For some feature function f : Rk → R, create a feature sequence f = (f (wi ))m i=1 . 3. Sort f and divide it into a segments of equal length. We refer to a as the alphabet size. 4. Use the limit elements in the sorted version of f to classify each element in f , creating a new sequence s where si is the number of the interval where fi is found. After this discretization process, we have a sequence s, where si is an integer such that 1 ≤ si ≤ a, and each integer occurs the same number of times in s. For convenience, we map these integers to lowercase letters (so that for a = 3 our alphabet is a. . . c). This discretization is performed on the training set, and the resulting limits are used to classify the features of the test set. (This means that there is no guarantee that each character is represented an equal number of times in the test set.) Many feature functions f are possible, such as average value or signal-to-noise ratio. Since we are interested in the upward and downward movement of the series, we have chosen to use the slope of the line fitted to the points in each window through linear regression.
522
Magnus Lie Hetland and P˚ al Sætrom 1.0 0.8 Accuracy
Worst 0.6
Best
0.4
Ensemble Bayesian
0.2
ng ch a Ex
ua hq rt Ea
e
ke
k et wo r N
ts Su
ns
po
G EC
R
an
do
m
0
Fig. 1. Performance comparison
3
Experiments
In our experiments we use six data sets, available from the UCR Time Series Data Mining Archive [6] (the first five) and the Federal Reserve Board [8] (the last data set): Random Walk. A standard random walk, where each value is equal to the previous one, plus a random number. ECG. ECG measurements from several subjects, concatenated. Earthquake. Earthquake-related seismic data. Sunspots. Monthly mean sunspot numbers from 1749 until 1990. The series is periodic, with one period of eleven years, and one of twenty-seven days. Network. Packet round trip time delay for packets sent from UCR to CMU. Exchange Rates. Daily exchange rate of Pound Sterling to US Dollars. For each of the data sets, the target prediction was when the series moved upward, that is, when the next value was greater than the current. It is to be expected that good predictors can be found for the ECG data (as it is quite regular and periodic), whereas the random walk data, and, to some extent, the exchange rate data, function as baselines for comparison. Finding predictors for random data would clearly indicate that our experiments were flawed, and finding good predictors for exchange rate data would also be surprising, as this is considered a quite difficult (if at all possible) task. For each of the data sets we performed experiments with alphabet sizes 2 (the minimum non-trivial case), 5, 10, and 20, as well as window sizes (resolutions) 2, 4, 8, 16, and 32. This was done with tenfold cross-validation2 and early 2
Note that full cross-validation was not used, due to the temporal nature of the data. The validation was constrained so that the elements of the training data occurred at an earlier time than the elements of the testing set.
The Role of Discretization Parameters in Sequence Rule Evolution
Random Walk 0.5004 0.5002 0.5 0.4998 0.4996 0.4994 0.4992 0.499
2
4
a-b a-e
0.62 0.6 0.58 0.56 0.54 0.52 0.5
2
a-b
2
a-b a-e
ECG
16
32 a-t
0.72 0.7 0.68 0.66 0.64 0.62 0.6 0.58
2
a-o Sunspots
4
a-e 0.64 0.62 0.6 0.58 0.56 0.54 0.52 0.5
8 a-j
8
a-j
8
a-j
a-o
a-b
4
a-e
16
32 a-t
0.66 0.64 0.62 0.6 0.58 0.56 0.54
2
a-o Earthquake
4
523
a-b
16
32 a-t
2
a-b a-e
a-j
16
32 a-t
16
32 a-t
16
32 a-t
a-o Network
4
a-e 0.53 0.525 0.52 0.515 0.51 0.505 0.5 0.495 0.49 0.485
8
8
a-j
a-o Exchange
4
8 a-j a-o
Fig. 2. Accuracy as function of window size and alphabet
stopping; that is, rules were selected based on their performance on a separate validation set before they were tested. As described in Sect. 2.2, results were used to select the alphabet size that gave the best single-predictor performance for each resolution, and these rules were then used in constructing ensembles and Bayesian classifiers. The rules in the ensembles were developed individually, that is, only their individual performance was used in calculating their fitness. The performance of the ensemble was then calculated by performing a simple majority vote among the rules for each position. The results are summarized in Fig. 1. The percentages (predictive accuracy) are averaged over the ten folds. For the ECG, sunspot, network, and earthquake data sets, the difference between the worst single classifier and the best single classifier is statistically significant (p < 0.01 with Fisher’s
524
Magnus Lie Hetland and P˚ al Sætrom
exact test), while the differences between the best single classifier, the ensemble, and the Bayesian classifier are not statistically significant (p > 0.05 for all except the difference between the best single classifier and the Bayesian combination for the Network data, where p = 0.049). For the random and exchange rate data sets there are no significant differences, as expected. Figure 2 shows how rule accuracy is related to window and alphabet size. For the random and exchange rate data, no clear trend is discernible. Although there are clearly problem specific differences, for the ECG, sunspot, and network data, there seems to be a rough inverse relationship between window size and accuracy, regardless of alphabet size. For the earthquake data, a large alphabet makes up for poor resolution, giving peak performance for the two largest window sizes and the most fine-grained alphabet.
4
Discussion
In this paper we have examined the role of discretization when evolving time series predictor rules. We used three main techniques to improve the basic evolution presented in [3]: Model selection based on validation performance, majority vote ensembles, and naive Bayesian classifiers. Prior to our empirical study, we expected the ensembles to outperform the simple selection, and the Bayesian classifiers to outperform both of the other methods. As it turns out, on our data, there was no statistically significant difference between the three methods, even though the difference between the result they produced and that produced by unfavorable parameter settings (discretization resolution and alphabet size) was highly significant. This leads to the conclusion that, given its simplicity, plain model selection may well be the preferred method. Our experiments also showed that the relationship between discretization resolution, alphabet size, and prediction accuracy is highly problem dependent, which means that no discretization parameters can be found that work equally well in all cases.
References [1] Thomas G. Dietterich. Ensemble methods in machine learning. Lecture Notes in Computer Science, 1857:1–15, 2000. 520 [2] Pedro Domingos and Michael J. Pazzani. On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning, 29(2–3):103–130, 1997. 521 [3] Magnus Lie Hetland P˚ al Sætrom. Temporal rule discovery using genetic programming and specialized hardware. In Proc. of the 4th Int. Conf. on Recent Advances in Soft Computing, 2002. 518, 519, 524 [4] Interagon AS. Digital processing device. PCT/NO99/00308, Apr 2000. 519 [5] Interagon AS. The Interagon query language : a reference guide. http://www.interagon.com/pub/whitepapers/IQL.reference-latest.pdf, sep 2002. 519 [6] E. Keogh and T. Folias. The UCR time series data mining archive. http://www.cs.ucr.edu/~eamonn/TSDMA, sep 2002. 522
The Role of Discretization Parameters in Sequence Rule Evolution
525
[7] E. Keogh, S. Lonardi, and W. Chiu. Finding surprising patterns in a time series database in linear time and space. In Proc. of the 8th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pages 550–556, 2002. 521 [8] Federal Reserve Statistical Release. Foreign exchange rates 1971–2002. http://www.federalreserve.gov/Releases/H10/Hist, oct 2002. 522
Feature Extraction for Classification in Knowledge Discovery Systems Mykola Pechenizkiy1, Seppo Puuronen1, and Alexey Tsymbal2 1
University of Jyväskylä Department of Computer Science and Information Systems, P.O. Box 35, FIN-40351, University of Jyväskylä, Finland {mpechen, sepi}@cs.jyu.fi 2 Trinity College Dublin Department of Computer Science College Green, Dublin 2, Ireland [email protected]
Abstract. Dimensionality reduction is a very important step in the data mining process. In this paper, we consider feature extraction for classification tasks as a technique to overcome problems occurring because of “the curse of dimensionality”. We consider three different eigenvector-based feature extraction approaches for classification. The summary of obtained results concerning the accuracy of classification schemes is presented and the issue of search for the most appropriate feature extraction method for a given data set is considered. A decision support system to aid in the integration of the feature extraction and classification processes is proposed. The goals and requirements set for the decision support system and its basic structure are defined. The means of knowledge acquisition needed to build up the proposed system are considered.
1
Introduction
Data mining applies data analysis and discovery algorithms to discover information from vast amounts of data. A typical data-mining task is to predict an unknown value of some attribute of a new instance when the values of the other attributes of the new instance are known and a collection of instances with known values of all the attributes is given. In many applications, data, which is the subject of analysis and processing in data mining, is multidimensional, and presented by a number of features. The so-called “curse of dimensionality” pertinent to many learning algorithms, denotes the drastic raise of computational complexity and classification error with data having big amount of dimensions [2]. Hence, the dimensionality of the feature space is often tried to be reduced before classification is undertaken. Feature extraction (FE) is one of the dimensionality reduction techniques. FE extracts a subset of new features from the original feature set by means of some V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 526-532, 2003. Springer-Verlag Berlin Heidelberg 2003
Feature Extraction for Classification in Knowledge Discovery Systems
527
functional mapping keeping as much information in the data as possible [5]. Conventional Principal Component Analysis (PCA) is one of the most commonly used feature extraction techniques. PCA extracts the axes on which the data shows the highest variability [7]. There exist many variations of the PCA that use local and/or non-linear processing to improve dimensionality reduction, though they generally do not use class information [9]. In our research, beside the PCA, we consider also two eigenvector-based approaches that use the within- and between-class covariance matrices and thus do take into account the class information. We analyse them with respect to the task of classification with regard to the learning algorithm being used and to the dynamic integration of classifiers (DIC). During the last years data mining has evolved from less sophisticated firstgeneration techniques to today's cutting-edge ones. Currently there is a growing need for next-generation data mining systems to manage knowledge discovery applications. These systems should be able to discover knowledge by combining several available techniques, and provide a more automatic environment, or an application envelope, surrounding this highly sophisticated data mining engine [4]. In this paper we consider a decision support system (DSS) approach that is based on the methodology used in expert systems (ES). The approach combines feature extraction techniques with different classification tasks. The main goal of such a system is to automate as far as possible the selection of the most suitable feature extraction approach for a certain classification task on a given data set according to a set of criteria. In the next sections we consider the feature extraction process for classification and present the summary of achieved results. Then we consider a decision support system that integrates the feature extraction and classification processes, describing its goals, requirements, structure, and the ways of knowledge acquisition. As a summary the obtained preliminary results are discussed and the focus of further research is described.
2
Eigenvector-Based Feature Extraction
Generally, feature extraction for classification can be seen as a search process among all possible transformations of the feature set for the best one, which preserves class separability as much as possible in the space with the lowest possible dimensionality [5]. In other words we are interested in finding a projection w: y = wT x
(1)
where y is a p'×1 transformed data point (presented using p' features), w is a p × p ' transformation matrix, and x is a p × 1 original data point (presented using p features). In [10] it was shown that the conventional PCA transforms the original set of features into a smaller subset of linear combinations that account for the most of the variance of the original data set. Although it is the most popular feature extraction technique, it has a serious drawback, namely the conventional PCA gives high
528
Mykola Pechenizkiy et al.
weights to features with higher variabilities irrespective of whether they are useful for classification or not. This may give rise to the situation where the chosen principal component corresponds to the attribute with the highest variability but having no discriminating power. A usual approach to overcome the above problem is to use some class separability criterion [1], e.g. the criteria defined in Fisher linear discriminant analysis and based on the family of functions of scatter matrices: J (w ) =
wT S B w w T SW w
(2)
where SB is the between-class covariance matrix that shows the scatter of the expected vectors around the mixture mean, and SW is the within-class covariance, that shows the scatter of samples around their respective class expected vectors. A number of other criteria were proposed in [5]. Both parametric and nonparametric approaches optimize the criterion (2) by using the simultaneous diagonalization algorithm [5]. In [11] we analyzed the task of eigenvector-based feature extraction for classification in general; a 3NN classifier was used as an example. The experiments were conducted on 21 data sets from the UCI machine learning repository [3]. The experimental results supported our expectations. Classification without feature extraction produced clearly the worst results. This shows the so-called “curse of dimensionality” with the considered data sets and the classifier supporting the necessity to apply some kind of feature extraction in that context. In the experiments, the conventional PCA was the worst feature extraction technique on average. The nonparametric technique was only slightly better than the parametric one on average. However, the nonparametric technique performed much better on categorical data. Still, it is necessary to note that each feature extraction technique was significantly worse than all the other techniques at least on a single data set. Thus, among the tested techniques there does not exist “the overall best” one for classification with regard to all given data sets.
3
Managing Feature Extraction and Classification Processes
Currently, as far as we know, there is no feature extraction technique that would be the best for all data sets in the classification task. Thus the adaptive selection of the most suitable feature extraction technique for a given data set needs further research. Currently, there does not exist canonical knowledge, a perfect mathematical model, or any relevant tool to select the best extraction technique. Instead, a volume of accumulated empirical findings, some trends, and some dependencies have been discovered. We consider a possibility to take benefit of the discovered knowledge by developing a decision support system based on the methodology of expert system design [6] in order to help to manage the data mining process. The main goal of the system is to recommend the best-suited feature extraction method and a classifier for a given data set. Achieving this goal produces a great benefit because it might be
Feature Extraction for Classification in Knowledge Discovery Systems
529
possible to reach the performance of the wrapper type approach by using the filter approach. In the wrapper type approach the interaction between the feature selection process and the construction of the classification model is applied and the parameter tuning for every stage and for every method is needed. In the filter approach the evaluation process is independent from the learning algorithm and the methods, and their parameters' selection process is performed according to a certain set of criteria in advance. However, the additional goal of the prediction of model's output performance requires also further consideration. The “heart” of the system is the Knowledge Base (KB) that contains a set of facts about the domain area and a set of rules in a symbolic form describing the logical references between a concrete classification problem and recommendations about the best-suited model for a given problem. The Vocabulary of KB contains the lists of terms that include feature extraction methods and their input parameters, classifiers and their input and output parameters, and three types of data set characteristics: simple measures such as the number of instances, the number of attributes, and the number of classes; statistical measures such as the departure from normality, correlation within attributes, the proportion of total variation explained by the first k canonical discriminants; and information-theoretic measures such as the noisiness of attributes, the number of irrelevant attributes, and the mutual information of class and attribute. Filling in the knowledge base is among the most challenging tasks related to the development of the DSS. There are two potential sources of knowledge to be discovered for the proposed system. The first is the background theory of the feature extraction and classification methods, and the second is the set of field experiments. The theoretical knowledge can be formulated and represented by an expert in the area of specific feature extraction methods and classification schemes. Generally it is possible to categorise the facts and rules that will be present in the Knowledge Base. The categorisation can be done according to the way the knowledge has been obtained – has it been got from the analysis of experimental results or from the domain theory. Another categorisation criterion is the level of confidence of a rule. The expert may be sure in a certain fact or may just think or to hypothesize about another fact. In a similar way, a rule that has been just generated from the analysis of results by experimenting on artificially generated data sets but has been never verified on realworlds data sets and a rule that has been verified on a number of real-world problems. In addition to the “trust“ criteria due to the categorisation of the rules it is possible to adapt the system to a concrete researcher's needs and preferences by giving higher weights to the rules that actually are the ones of the user.
4
Knowledge Acquisition from the Experiments
Generally, the knowledge base is a dynamic part of the decision support system that can be supplemented and updated through the knowledge acquisition and knowledge refinement processes [6]. Potential contribution of knowledge to be included into the KB might be found discovering a number of criteria from the experiments conducted on artificially generated data sets with pre-defined characteristics. The results of experiments can be
530
Mykola Pechenizkiy et al.
examined looking at the dependencies between the characteristics of a data set in general and the characteristics of every local partition of the instance space in particular. Further, the type and parameters of the feature extraction approach best suited for the data set will help to define a set of criteria that can be applied for the generation of rules of KB. The results of our preliminary experiments support that approach. The artificially generated data sets were manipulated by changing the amount of irrelevant attributes, the level of noise in the relevant attributes, the ratio of correlation among the attributes, and the normality of the distributions of classes. In the experiments, supervised feature extraction (both the parametric and nonparametric approaches) performed better than the conventional PCA when noise was introduced to the data sets. The similar trend was found with the situation when artificial data sets contained missing values. The finding was supported by the results of experiments on the LED17, Monk-3 and Voting UCI data sets (Table 1) that are known as ones that contain irrelevant attributes, noise in the attributes and a plenty of missing values. Thus, this criterion can be included in the KB to be used to give preference to supervised methods when there exist noise or missing values in a data set. Nonparametric feature extraction essentially outperforms the parametric approach on the data sets, which include significant nonnormal class distributions and are not easy to learn. This initial knowledge about the nature of the parametric and nonparametric approaches and the results on artificial data sets were supported by the results of experiments on Monk-1 and Monk-2 UCI data sets (Table 1). Table 1. Accuracy results of the experiments
Dataset LED17 MONK-1 MONK-2 MONK-3 Voting
5
PCA .395 .767 .717 .939 .923
Par .493 .687 .654 .990 .949
NPar .467 .952 .962 .990 .946
Plain .378 .758 .504 .843 .921
Discussions
So far we have not found a simple correlation-based criterion to separate the situations when a feature extraction technique would be beneficial for the classification. Nevertheless, we found out that there exists a trend between the correlation ratio in a data set and the threshold level used in every feature extraction method to address the amount of variation in the data set explained by the selected extracted features. This finding helps in the selection of the initial threshold value as a start point in the search for the optimal threshold value. However, further research and experiments are required to check these findings. One of our further goals is to make the knowledge acquisition process semiautomatic using the possibility of deriving new rules and updating the old ones based on the analysis of results obtained during the self-run experimenting. This process will include generating artificial data sets with known characteristics (simple,
Feature Extraction for Classification in Knowledge Discovery Systems
531
statistical and information-theoretic measures); running the experiments on the generated artificial data sets; derivation of dependencies and definition of criteria from the obtained results and updating the knowledge base; validating the constructed theory with a set of experiments on real-world data sets, and reporting on the success or failure of certain rules. We consider a decision tree learning algorithm as a mean of automatic rule extraction for the knowledge base. Decision tree learning is one of the most widely used inductive learning methods [12]. A decision tree is represented as a set of nodes and arcs. Each node contains a feature (an attribute) and each arc leaving the node is labelled with a particular value (or range of values) for that feature. Together, a node and the arcs leaving it represent a decision about the path an example follows when being classified by the tree. Given a set of training examples, a decision tree is induced in a “top-down” fashion by repeatedly dividing up the examples according to their values for a particular feature. In this context, mentioned above data set characteristics and a classification model's outputs that include accuracy, sensitivity, specificity, time complexity and so on represent instance space. And the combination of a feature extraction method's and a classification model's names with their parameter values represent class labels. By means of analysing the tree branches it is possible to generate “if-then” rules for the knowledge base. A rule reflects certain relationship between meta-data-set characteristics and a combination of a feature extraction method and a classification model.
6
Conclusions
Feature extraction is one of the dimensionality reduction techniques that are often used to cope with the problems caused by the “curse of dimensionality”. In this paper we considered three eigenvector-based feature extraction approaches, which were applied for different classification problems. We presented the summary of results that shows a high level of complexity in dependencies between the data set characteristics and the data mining process. There is no feature extraction method that would be the most suitable for all classification tasks. Due to the fact that there is no well-grounded strong theory that would help us to build up an automated system for such feature extraction method selection, a decision support system that would accumulate separate facts, trends, and dependencies between the data characteristics and output parameters of classification schemes performed in the spaces of extracted features was proposed. We considered the goals of such a system, the basic ideas that define its structure and methodology of knowledge acquisition and validation. The Knowledge Base is the basis for the intelligence of the decision support system. That is why we recognised the problem of discovering rules from the experiments of an artificially generated data set with known predefined simple, statistical, and informationtheoretic measures, and validation of those rules on benchmark data sets as a prior research focus in this area. It should be noticed that the proposed approach has a serious limitation. Namely the drawbacks can be expressed in the terms of fragmentariness and incoherence (disconnectedness) of the components of knowledge to be produced. And we
532
Mykola Pechenizkiy et al.
definitely do not claim the completeness of our decision support system. Otherwise, certain constrains and assumptions to the domain area were considered, and the limited sets of feature extraction methods, classifiers and data set characteristics were considered in order to guarantee the desired level of confidence in the system when solving a bounded set of problems.
Acknowledgments This research is partly supported by the COMAS Graduate School of the University of Jyväskylä, Finland and Science Foundation, Ireland. We would like to thank the UCI ML repository of databases, domain theories and data generators for the data sets, and the MLC++ library for the source code used in this study.
References [1]
Aivazyan, S.A.: Applied Statistics: Classification and Dimension Reduction. Finance and Statistics, Moscow, 1989. [2] Bellman, R., Adaptive Control Processes: A Guided Tour, Princeton University Press, 1961. [3] Blake, C.L., Merz, C.J. UCI Repository of Machine Learning Databases [http://www.ics.uci.edu/~mlearn/ MLRepository.html]. Dept. of Information and Computer Science, University of California, Irvine CA, 1998. [4] Fayyad U.M. Data Mining and Knowledge Discovery: Making Sense Out of Data, IEEE Expert, Vol. 11, No. 5, Oct., 1996, pp. 20-25. [5] Fukunaga, K. Introduction to Statistical Pattern Recognition. Academic Press, London, 1991. [6] Jackson P. Introduction to Expert Systems, 3rd Edn. Harlow, England: Addison Wesley Longman, 1999. [7] Jolliffe, I.T. Principal Component Analysis. Springer, New York, NY. 1986. [8] Kohavi, R., Sommerfield, D., Dougherty, J. Data mining using MLC++: a machine learning library in C++. Tools with Artificial Intelligence, IEEE CS Press, 234-245, 1996. [9] Liu H. Feature Extraction, Construction and Selection: A Data Mining Perspective, ISBN 0-7923-8196-3, Kluwer Academic Publishers, 1998. [10] Oza, N.C., Tumer, K. Dimensionality Reduction Through Classifier Ensembles. Technical Report NASA-ARC-IC-1999-124, Computational Sciences Division, NASA Ames Research Center, Moffett Field, CA, 1999. [11] Tsymbal A., Puuronen S., Pechenizkiy M., Baumgarten M., Patterson D. Eigenvector-based feature extraction for classification, In: Proc. 15th Int. FLAIRS Conference on Artificial Intelligence, Pensacola, FL, USA, AAAI Press, 354-358, 2002. [12] Quinlan, J.R. 1993. C4.5 Programs for Machine Learning. San Mateo CA: Morgan Kaufmann.
Adaptive Per-application Load Balancing with Neuron-Fuzzy to Support Quality of Service for Voice over IP in the Internet Sanon Chimmanee, Komwut Wipusitwarakun, and Suwan Runggeratigul Sirindhorn International Institute of Technology (SIIT) Thammasat University, Thailand. {Sanon,komwut,suwan}@siit.tu.ac.th [email protected]
Abstract. This paper presents a per-application load balancing with adaptive feedback control to optimize both QoS requirement for VoIP application (first target) and effective link usages (second target) over the Internet. In order to forward the Internet applications over the route coupling with respect to QoS for VoIP, the perceptron neural net is used to classify the applications with similar properties belong to one of two groups. However, the perceptron has some limitations such as nonoptimization for several targets. Therefore, this paper proposes to develop the perceptron with fuzzy control approach to eliminate the limitation of perceptron and enhance the capability of perceptron.
1
Introduction
Nowadays, real-time connectionless technology applications such as Voice over IP (VoIP) are becoming increasingly popular over the Internet. However, TCP/IP protocols are originally developed without the idea of supporting the quality of service (QoS) for real-time applications. The load balancing approach is one possible method for improving the QoS of VoIP application over the Internet since a) the load balancing approach allows the system to exploit the availability of the network resources to increase throughput and reduce end-to-end delay of the system, which results in a better QoS [1], and b) it can provide additional bandwidths to alleviate the degrading bandwidth performance due to the network unreliability of the Internet [2]. Although the existing load balancing tools that are per-packet and per-destination [1],[2] are already implemented on routers, they can not support both the QoS and the equal usage of links simultaneously. In addition, they cannot selectively control the QoS of specific real-time applications such as the VoIP traffic because packets of different applications such FTP trains and VoIP stream may be distributed to the same link. Traditionally , the time-delay problem of delay-sensitive applications arises in IP networks when their data streams are behind large data transfers such as FTP trains [2]. In other words, they are developed without considering the properties of the network applications, and hence introduce a low QoS for VoIP application. This has V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 533-541, 2003. Springer-Verlag Berlin Heidelberg 2003
534
Sanon Chimmanee et al.
led [1],[2] to present ARLB that can “selectively” control the QoS for VoIP stream with static rule approach. However, the Internet traffic pattern is uncertain, which results in unequal link usages. Therefore, the challenge in designing load balancing tools is how to optimize both QoS requirements and link usages. The proposed method, which is referred to as Adaptive Per-Application Load Balancing (APALB), uses the concept of application classification to enable the system to selectively control the QoS, and then tries to optimize the link usages and maintain the QoS by using a feedback control. In other words, the APALB is an extension version of ARLB, that is the per-application load balancing, with adaptive approach. The merit of the new version is to allow applications can belong to one of the two groups dynamically and result in optimizing the desired targets at the same time. Perceptron is one of the simple neural nets that are widely used to classify whether an input belongs to one of the two classes [3][7], while ADALINE and Back-propagation are generalizations of the Least Mean Square (LMS) algorithm [7]. We found that if we use the perceptron to classify applications belong to one of two groups. The first target, that is to support QoS for VoIP, can be achieved. However, the second target, that is the effective link usages, can not be achieved since a) the nature of the perceptron is not try to minimize an error between output and target like ADALINE [7]. b) the original perceptron is not developed for adaptive classification and for the two desired targets [7]. Therefore, the fuzzy controller [6] is used to train the neural net to allow the perceptron to be able to minimize the error and support the desired targets. Based on simulation results, the proposed APALB with neuron-fuzzy approach [4][5] can reduce the time-delay of VoIP packets, decrease the unequal usage of links, and increase the network throughput when compared to the existing load balancing tools.
2
Adaptive Per-application Load Balancing Algorithm
APALB classifies the certain application as either belonging to Group1(G1) or Group2(G2) by using the proposed application metric (AM) for evaluating the cost of each application. Then the application of G1 and G2 will be forwarding to the different links. Since the Internet traffic pattern is uncertain, the criterion for evaluating cost should be adjusted to support the targets. Low cost means that its current properties, e.g., packet size, is similar to VoIP and therefore it can be the member of G1. High cost means that its current properties is much more different from VoIP and therefore it has to be member of G2. APALB consists of two major parts: the initial weight state with the perceptron and the adaptive weight state with the neuron-fuzzy. In the initial state, G1 contains only VoIP application and G2 contains other applications. Therefore, this part is to train the perceptron to evaluate the AM formula in Eq. (1) called “weights for the initial state”. After training is done, the AM cost of all VoIP patterns are less than or equal to zero. AM cost for other applications are higher than zero and therefore they are belong to G2. This mean that, VoIP will obtain all network resources of G1 due to having only VoIP. In the adaptive state, the fuzzy controller will adjust the weight so that other applications can be in G1, therefore G1 may contain both VoIP and other applications as well. This process is called “weights for the adaptive state” which results in optimizing both the usage of network resources and QoS for VoIP.
Adaptive Per-application Load Balancing with Neuron-Fuzzy INPUT LAYER
535
OUTPUT LAYER /A
X1
Input: Application
W1 /B
X2
/C
W2
Z
Output: Classifying application whether belongs to one of two groups
W3 Threshold
X3
Fig. 1. A single-layer perceptron neural net
2.1
Initial Weight State with the Perceptron
The initial weight with the perceptron has performed and obtained the initial weights in advance. The results are displayed in the subsection 4.1. The formula of application metric (AM) uses the application's properties that including packet size, volume traffic, and RTP, to be in the form of the perceptron as AM = W1(X1/A) + W2(X2/B) + W3(X3/C)
(1)
where X1 is the average packet size (bytes) , X2 is the average traffic volume in packet per second (pps). If the application is the Real Time application Protocol (RTP), X3 is 1. Otherwise, X3 is 2. A is the packet size coefficient , B is the traffic volume coefficient, C is the RTP coefficient, W1 is the packet weight, W2 is the traffic volume weight, and W3 is the RTP weight. From the Table 4, the maximum average packet size and traffic volume of VoIP is 70 byte and 50 pps, respectively. Since the characteristic of VoIP is used as criteria for classifying application into two groups, the evaluation cost for each element of Eq. (1), such as X1, is compared to VoIP properties by assigned A = 70, B = 50, and C = 1. This approach allows us to evaluate the cost for each application based on the characteristic of VoIP. Then, the perceptron is trained with 600 different Internet application patterns to adjust the appropriate weights for classification VoIP application to G1 and other applications to G2. It should be noted that the system uses the traditional port number in IP packets to specify the type of application of packets. The single node perceptron as shown in Fig. 5 computes a weighted sum of the input elements, subtracts threshold and passes the result through a hard-limit function [3],[7]. Z_in is the net input to output unit Z (2) the transfer function is the hard-limit (fh)
(3)
536
Sanon Chimmanee et al.
Update weights: Wi(new) = Wi(old) + ∆Wi
(4)
Update threshold: θm (new) = θm (old) + ∆θm
(5)
where, θm is the threshold. 2.2
Adaptive Weight State with Neuron-Fuzzy
The neuron-fuzzy consists of 2 mechanisms: the perceptron and the fuzzy control for training the perceptron. After fuzzy control has trained the perceptron, the perceptron classifies applications belonging to G1 or G2 dynamically. Its diagram is shown in Fig.2. 2.2.1 Perceptron Neural Net. The criteria for the adaptive application classification can be expressed as (6) where Z_in is the net input to output unit Z The transfer function is the same as Eq. (3). The errors (that are unequal usage of links and high time-delay for VoIP) will be reduced most rapidly by adjusting the weights according to the following condition: Update weights: Wi(new) = Wi(old)+ WF
(7)
where WF is the weight update that is produced by training of fuzzy controller. The resulting weights in the initial state are used as the initial values of the weights for the adaptive and fix the threshold equal to 3. 2.2.2 Fuzzy Control for Training the Perceptron : The fuzzy controller contains three layers: input layer, hidden layer, and output layer. The error is determined from the input information and then the controller will produce the output that is the weight update (WF) to adjust the weight connection of perceptron in Eq. (7). A. Input Layer: the usage of links and the delay are the two main input parameters. A.1) Usage of links parameter consists of two elements. The first one is the Usage of links Error, ∆EUS(t),that is the difference between the usage of link1 and link2 , each is denoted by US1(t) and US2(t), respectively.
∆EUS(t) = US1(t) –US2(t)
(8)
The other is the Change-in- link usage error, ∆EUS(t)′, which can be defined as
∆EUS(t)′ = d(∆EUS(t))/dt
(9)
Adaptive Per-application Load Balancing with Neuron-Fuzzy Reference model or Target
Fuzzy Controller
Delay error
+ - Link Usage error +
Target for Delay (RD(t))
Target for Link Usage (RUS(t))
-
537
d/d(t)
Delay force (FD)
d/d(t)
Usage force (FUS)
FD(t) Weight force (WF ) FUS(t)
Weight update to train the perceptron INPUT X(t)
Perceptron Neural Net (Classified by AM)
Packet Forwarding (Plant)
Link usage output (YUS(t)) Delay output (YD(t))
Adjust Weights
Fig. 2. Block diagram of APALB contains two major parts: the perceptron and the fuzzy control known as the neuron-fuzzy. YUS is the link usage output, YD is the delay output, RUS is the link usage target or reference, RD is the delay target or reference, ∆US is the link usage error, ∆D is the delay error, FD is the delay force for maintaining the QoS for VoIP (first target), FUS is the link usage force for balancing the usage of link1 and link2 (second target), and WF is the weight update to train the perceptron, which results in optimizing both the end-to-end delay for VoIP and the effective usage of links
A.2) Delay parameter consists of two elements. The first one is the Delay Error, ∆ED(t), that is the difference between θDelay and, D(t), which can be defined as
∆ED(t) = D(t) -θDelay
(10)
where θDelay is a threshold for starting the adaptive control to the system. If the delay value is longer than 100 ms, the system must start to adjust itself with feedback control to maintain QoS for VoIP. As a result, θDelay is equal to 100 ms. D(t) is the current time-delay of VoIP. The other is the Change-in-error of the Delay Error, ∆ED(t)′, which can be defined as (11) ∆ED(t)′ = d(∆ED(t))/dt B. Hidden Layer : There are also two main elements. The first one is the usage of links force, FUS, that is a force sourced from ∆EUS(t) and ∆EUS(t)′. The other is the delay force, FD, that is a force sourced from ∆ED(t) and ∆ED(t)′. C. Output Layer: Weight force, WF, is a force sourced from FUS and FD. D. Rule-Base for Fuzzy Control: Table 1. Rule Table for the FUS “Link usage “Change-in-Link usage error” ∆EUS′ force”FUS -2 -1 0 1 2 -2 -2 -2 -2 1 0 -1 -2 -2 -1 0 1 “Error” 0 -2 -1 0 1 1 ∆EUS 1 -1 0 1 2 2 2 0 1 2 2 2
538
Sanon Chimmanee et al. Table 2. Rule Table for the FD “Delay force” FD “error” ∆E D
0 1 2
“Change-in-delay error” ∆ED ′ -2 -1 0 1 2 0 0 0 1 2 0 0 1 2 2 0 1 2 2 2
Table 3. Rule Table for the WF “Weight WF FD
force” 0 1 2
-2 -2 -1 2
-1 -1 0 2
FUS 0 0 1 2
1 1 2 2
2 2 2 2
E. Membership Functions:
Fig. 3. Membership function for ∆EUS(t), ∆EUS(t)′,FUS(t) , ∆ED(t), ∆ED(t)′, FD(t) , and WF (t)
3
Simulation Model
3.1
Traffic Load Model
We obtained a total of 600 different Internet application patterns based on actual traffics measured at our Intranet that is summarized in Table4. The Internet application patterns were recorded in the database, which are used for training
Adaptive Per-application Load Balancing with Neuron-Fuzzy
539
perceptron and simulator. We have built the simulator for generating the Internet traffic load by randomizing the traffic volume and packet sizes of application from the database. Table 4. The summary table of average value for the Interent application patterns based on TCP/IP layer 3 Application VoIP Telnet DNS SMTP Http FTP
3.2
Packet Size (bytes) Avg. Max Avg. Mean Avg. Min 70 66.6 66 154 109 79 180 126 102 668 446 375 670 503 164 919 752 597
Traffic Volume (pps) Avg. Max Avg. Mean Avg. Min 50 31.4 10 8.5 4.9 1 14 7.4 1.9 8.2 3.6 1.6 15 5.2 1 6 4.4 2.7
Network Configuration Models
Initial value of the network delay of each route is randomized within the range of 40 to 100 ms. In addition, initial value of the access bandwidth (Internet bandwidth) of each route is randomized within the range of 45 to 256 kbps. And then these value may be increased, remaining constant, and decreased after every 10 seconds by adding a random value within the range of –10 to 10 percentage of their initial values.
4
Experimental Results
4.1
Training the Perceptron for the Initial Weight State
In the initial state, we use the MatLab release 12 to be perceptron tool for finding out weights for initiate state that are W1,W2, W3, and Threshold. Given the initiate values for W1,W2,W3, and threshold with 0. In addition, the targets are given as follows: AM cost of VoIP is less than or equal to zero and AM cost of other applications are higher than zero. Therefore, After training (weights adjustment) with the total 600 different Internet application patterns is done, the appropriate values for W1,W2,W3, and threshold are 1.1271, -3.69, 1, and 3, respectively.
Fig. 4. Network Configuration Model for experimental simulation
540
Sanon Chimmanee et al.
4.2
Performance Evaluation
4.2.1 Performance Evaluation on Unequal Usages and Network Throughput: The result is that APALB, ARLB and per-destination introduce 22 %, 30 %, and 58 % of the unequal usage of links, respectively. In addition, APALB, ARLB and perdestination introduce 86 %, 74 %, and 66 % of the network throughput, respectively. We see that if the unequal link usage decrease, the throughput will increase. In other words, the APALB can increase the network throughput when compared to perdestination and ARLB by up to 20 % and 12%, respectively. However, the unequal link usages is higher and throughput is less than compared with per-packet since APALB has to optimize not only usage of link but also the QoS . 4.2.2 Performance Evaluation on QoS: It can be seen that APALB can reduce the number of packets that take more than 150 ms when compared to per-packet and perdestination by up to 80% and 72%, respectively. In addition, it decreases the average end-to-end delay when compared to per-packet and per-destination by up to 34% and 22%, respectively. This means that APALB can offer a better quality of voice. However, the QoS performances of APALB and ARLB are nearly the same.
5
Conclusion
The APALB uses the concept of classification and then try to optimize link usages with respect to maintain the QoS for VoIP. Unfortunately, the perceptron has only a capability of classification. Therefore, fuzzy control is introduced to enhance the capability of perceptron in order to support optimization for the desired targets. The APALB has a better performance than the existing load balancing since it is the novel adaptive per-application load balancing with the intelligent feedback control.
References [1] [2]
[3] [4]
S. Chimmanee, K. Wipusitwarakun and S. Runggeratigul : Load Balancing for Zone Routing Protocol to Support QoS in Ad Hoc Network, Proc. ITC-CSCC 2002, vol.3, pp.1685-1688, 16-19 July, 2002. S. Chimmanee, K. Wipusitwarakun, P. Termsinsuwan and Y. Gando : Application Routing Load Balancing (ARLB) to support QoS for VoIP application over VPN Environment, Proc. IEEE NCA 2001, pp.94-99, 11-13 Feb, 2002. A. Ali and R. Tervo : Traffic Identification Using Artificial Neural Network, Canadian Conference on Electrical and Computer Engineering 2001, Vol. 1, pp.667-671. Lazzerini, L. M. reyneri and M. Chiaberge : Neuron-Fuzzy Approach to Hybrid Intelligent Control, IEEE transactions on industry applications, Vol.35, No.2, March/April 1999.
Adaptive Per-application Load Balancing with Neuron-Fuzzy
[5] [6] [7]
541
L. C. Jain : Intelligent Adaptive Control Industrial Applications, 1999 by CRC Press LLC. K. M. Passino and S. Yurkovich : Fuzzy Control, Addison Wesley Longman, Inc.,1998. L. Fausett, “Fundamentals of Neural Networks Architectures, Algorithms, and Applications”, 1994 by Prentice-Hall, Inc., New Jersey.
Hybrid Intelligent Production Simulator by GA and Its Individual Expression Hidehiko Yamamoto and Etsuo Marui Faculty of Engineering, GifuUniversity,1-1, Yanagido, Gifu, Japan [email protected]
Abstract: This paper is focused on the problem for one-by-one products input into Flexible Transfer Line (FTL) and describes an algorithm for a hybrid intelligent off-line production simulator connected Genetic Algorithm (GA) system in order to search a better solution. New individual expression, crossover operations, mutation operations in GA and their application examples are also described. The developed hybrid simulator to have a wide solution search space can solve the problem for production levels.
1
Introduction
Because of users' taste variety, FMS (Flexible Manufacturing System) and FTL (Flexible Transfer Line) that a single production line manufactures a variety of products have been developing and working[1]. In the production line, it is important to decide which parts are input into the production line in order to fit the timing of users' needs and to increase productivity. Currently, the change from batch production to one-by-one production that inputs each parts into a production line is seen. For example, there are the one-by-one input method according to production ratio and the method with the constraints of appearance ratios such as “Assembly interval of a sun-roof car is every ten” which is from engineers' experience. Although these methods realize one-by-one production, they include the problem that they don't search better solutions. This paper is focused on the problem for one-by-one parts input into FTL and describes the research of a hybrid intelligent off-line production simulator connected Genetic Algorithm (GA) system in order to search a better solution. This paper also describes the methods to input new individuals and to keep the diversity in GA
2
One-by-One Production System
FTL that a single production line manufactures varieties of parts according to a decided production ratio is expressed as an automatic production line including some V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 542-548, 2003. Springer-Verlag Berlin Heidelberg 2003
Hybrid Intelligent Production Simulator by GA and Its Individual Expression
543
machine tools and some automatic conveyers connected with machine tools. One-byone production means the production style to input each variety of parts one-by-one under the constraint to keep the production ratio. In order to satisfy the constraint, it is necessary for the production style to have the production equality not to arise the input number difference between each variety of parts in a limited time. The methods of the production levels have been developed[2]. In the methods, however, the problem that the methods ignore other many feasible solutions for input sequence is remaining.
3
Hybrid Intelligent Simulator and Its Recurring Individual
3.1
Recurring Individual
In order to solve the problem that the conventional production equality methods have, this research takes a wide solution search space by connecting a production simulator with GA system or hybrid intelligent production simulator. New individual expression for the sake of keeping the production ratio of the simulator is proposed in this paper. The individuals generated in the algorithm of the production simulator correspond to input information of a production simulator. For example, in a case where the variety of parts are A, B, and C, the sequence A,B,C,A,B,C,… or A,A,C,B,A,C,B,… corresponds to the individual. The conventional method to generate an individual is randomly to select a gene and to assign the gene into a locus[3]. Because one-by-one production has the constraint to keep the goal production ratio, the input parts ratio must not be ignored in generating individuals. In order to give the equality to a GA problem, it is important that the parts percentage assigned as the gene expressing input sequence is not much different from the goal production ratio. The conventional method to generate individuals does not consider the ratio of each gene in an individual. There is another problem. If the simulation time is tremendous, input parts number also becomes tremendous. If the conventional method of an individual expression that is a sort of a belt-type chromosome[4], the individual length will become thousands or millions. In order to solve the problems, because of using the characteristic of the production equality, the new individual expression to keep a gene ratio and to generate a short individual length is used. If a variety of product is uniform during a limited time, it is regarded that the production equality can be performing during the time. If a constant individual length is programmed and the equality is performing between individual length, the individual is regarded as a constant percentage of individual length. The production equality is realized as the following algorithm. [Primitive Individuals Generative Algorithm] STEP1: Based on the gene ratio for v kinds of genes, r(1) : r(2) : … : r(v), the possibility, P(v), to assign on a locus for each of genes is calculated as the following ;
544
Hidehiko Yamamoto and Etsuo Marui r (v)
P(V) = v
…
(1)
∑ r (v ) v =1
STEP2: By using the assign possibility P(v), select a gene from among v kinds of genes with roulette strategy. The selected gene is expressed as “ i ”. STEP3: Randomly select an integer constant “a”, and decide the value q(v) corresponding to a gene quantity in an individual and the individual length δ with the following equations; q(v) = a × r ( v ) …
v
δ = ∑ q (v ) …
(2)
v =1
(3)
STEP4: Regard the cumulative value of assigned genes as Sum(v) and give the primitive value the following; Sum(v) = 0…
(4)
STEP5: Carry out the following rule to the accumulative value Sum(i) for gene “ i “ and this gene quantity q(i); [if] Sum ( i ) ≠ q ( i ) [then] Continue to STEP6 [else] Regard the gene i assignment as bad and go to STEP9 in order to proceed another gene assignment. STEP6: Judge the gene “i” as the assignment gene and assign the gene to the vacant left locus. STEP7: Renew the accumulative value of gene “i” with the following equations; newSum(i)
= Sum(i) + 1 …
(5)
Sum(i) = newSum(i) …
(6)
STEP8: By comparing an individual length δ with the current accumulative values Sum(v) for each gene, carry out this algorithm's finish judgment with the following rule. v
[if] δ =
∑ Sum( v ) v =1
[then] Regard acquired chromosome as an individual and finish this algorithm. [else] Continue to STEP9 STEP9: By using a roulette strategy with the assign probability P, select a gene from among v kinds of genes, regard the gene as “i” and go back to STEP5.
Hybrid Intelligent Production Simulator by GA and Its Individual Expression
545
Fig. 1. Recurring individual example
In this way, it is possible for the algorithm not to change the containing ratio and to generate an adequate length of an individual. The individual generated in the algorithm is expressed as a ring-type individual shown in Figure 1. The ring-type individual is considered as the first cycle that regards the locus to which the gene selected in STEP2 of the algorithm is assigned as a tentative locus number “1“ and the tentative locus number “δ” which is the last result by continuing to add 1 to the right direction of the ring. The tentative locus number 1 is regarded as the tentative locus number “δ+ 1” corresponding to the start of the second cycle. In the same manner, by recurring the above operation, a single individual is expressed. This individual is called recurring individual. 3.2
Production Simulator and New Individuals Inputs
The hybrid simulator's relation between production simulator and GA system is connected by the idea that GA system generates one of the input information to operate a production simulator[5]. The relation is realized by the following algorithm. [Algorithm Connected Production Simulator and GA] STEP1: Generate a population that have n pieces of individuals corresponding to randomly selected genes. STEP2: Carry out the production simulations with input information corresponding to each individual and calculate each individual's fitness by using the simulation results. STEP3: Carry out the elitist strategy to send e pieces individuals that have high ranking fitness to the next generation. STEP4: Select a pair of individuals by a probability based on fitness and give them crossover operation. STEP5: Select an individual with the mutation probability (Pm) and the selected individual participates in mutation operation. STEP6: If the individuals' number generated in STEP4 and STEP5 are equal to (n-e), continue to STEP7. If not, return to STEP4.
546
Hidehiko Yamamoto and Etsuo Marui
STEP7: Regard n pieces of individuals (sum up e acquired in STEP3 and (n-e) acquired in STEP6) as the next generation's population and finish one cycle of GA operation. STEP8: If the finish condition is not satisfied, some (f) new individuals are generated, f individuals whose fitness corresponds to smallest in the n pieces individuals of a population set are deleted, put the new f pieces individuals into the population set and return to STEP2. If satisfied, the algorithm is finished. In STEP 8 of the above algorithm, new individuals are sometimes input into the population and, because of it, the diversity can be kept.
4
Application Examples
4.1
Simulations for Productivity
The developed algorithm is applied for FTL example. The FTL has 8 stations ( st1, st2, …… , st8 ) assembly line and assembles 10 products. Products are input into the station st1 one-by-one. Assembling times for each variety of products in each station are not same. The goal cycle time of this line is 12 seconds. The production ratio for 10 products is P1 : P2 : …… : P10 = 9 : 6 : 7 : 6 : 8 : 7 : 3 : 2 : 4 : 1 . FTL has glitches and they randomly happen every 120 seconds. Their fix time is constant, 20 seconds. The conditions of GA system are the following; population size = 50, mutation probability = 1%, elite individual = 2, integer constant (a) = 1, individual length (δ) = 53, optional constant (ε) = 865, optional integers S1 = 5 and S2 = 5. As the finishing condition of STEP8 in Section3 algorithm, the time when generations = 60 is adopted.
Fig. 2. Fitness curves
Hybrid Intelligent Production Simulator by GA and Its Individual Expression
547
Several simulations by changing a series of a random number were carried out. Figure 2 shows one of the results. In the figure, a thin fine curve shows an average fitness among 50 individuals and a bold fine curve shows a maximum fitness among the 50. Fitness converge appears from about generation 35. This kind of results is shown in other examples of a series of a random number. The resulted production ratio was P1 : P2 : …… : P10 = 8.505 : 5.902 : 6.854 : 5.902 : 7.829 : 6.854 : 2.927 : 1.951 : 3.902 : 1 which is very similar to a goal production ratio. When the conventional belt-type chromosome was used, the program was stopped because the crossover to satisfy the production ratio was not carried out. 4.2
Simulations for Diversity
In general, GA avoids to fall in the local solutions with expanding the solutions space by mutations. The research carries out the simulations by the production simulator connected with GA. That means not only the calculations of many evolutions of GA but the calculations of production simulations are needed. Especially, production simulations for each individual need much time if we adopt a big population size, tremendous amount of a whole simulation time are needed. It is necessary to adopt the small population size and we adopt 50 individuals in a generation. If the population size is small, it occurs that the diversity disappears. In order to solve the problem, we adopt the method that new individuals are sometimes input into the population set. Every 5 generation, 10 individuals are input as new individuals to keep the diversity. Figure 2 also shows the fitness curves of the simulations results. The bold dotted curve shows the maximum fitness without diversity idea. Around the generation 38, the curve is saturated. Checking up the individuals among a population, it is found that each sequence of the chromosome is very close, the diversity disappears and the simulations go straight to the one local solution. The bold solid curves show the maximum fitness with diversity idea or to input new individuals. The thin solid curve shows the average fitness curve. It can be seen that the average fitness curve sometimes drops because of inputting 5 new individuals every 5 generation. Because of it, the solution space is always kept wide and the best fitness curve (bold solid curve) saturated at the higher position.
5
Conclusions
This paper describes the development of a hybrid intelligent off-line production simulator connected GA system in order to realize one-by-one production for a variety of products. The production simulator includes recurring individuals, their crossover operations and mutation operations not to change the results of a production ratio for each variety of products. The developed production simulator was applied for a FTL model that assembles a variety of products. As a result, acquired production ratio was very similar to a goal production ratio. The developed production simulator can be used in starting a production plan of FTL that manufactures a variety of products. Because of inputting new individuals to keep the diversity, it is also ascertained that the better solutions are acquired.
548
Hidehiko Yamamoto and Etsuo Marui
References [1] [2] [3] [4] [5]
Monden, Y.,Toyota Production System, Institute of Industrial Engineers, (1983), Atlanta, GA. Hitomi, K, Production Management, Corona Pub. (in Japanese). Goldberg, Genetic Algorithm in Search, Optimization and Machine Learning, (1989), Addison-Wesley. Holland, Adaption in Natural and Artificial Systems, (1975)Univ. Michigan Press. Yamamoto, H., Simulator of Flexible Transfer Line including Genetic Algorithm and its Applications to Buffer Capacity Decision, Proc. of the 3rd IFAC Workshop on Intelligent Manufacturing Systems, (1995), pp127-132.
On the Design of an Artificial Life Simulator Dara Curran and Colm O’Riordan Dept. of Information Technology National University of Ireland, Galway, Ireland Abstract. This paper describes the design of an artificial life simulator. The simulator uses a genetic algorithm to evolve a population of neural networks to solve a presented set of problems. The simulator has been designed to facilitate experimentation in combining different forms of learning (evolutionary algorithms and neural networks). We present results obtained in simulations examining the effect of individual life-time learning on the population’s performance as a whole.
1
Introduction
Genetic algorithms have long been used as an efficient approach to solution optimisation [1]. Based on a Darwinian evolutionary scheme, potential solutions are mapped to genetic codes which, in the case of the canonical genetic algorithm, are represented by bit patterns. Each of these solutions is tested for validity and a portion are selected to be combined to create the next generation of solutions. Using this mechanism in an iterative manner, the approach has been shown to solve a variety of problems [2]. Artificial neural networks are a method of machine learning based on the biological structure of the nervous systems of living organisms. A neural network is composed of nodes and interconnecting links with associated adjustable weights which are modified to alter a network’s response to outside stimuli. Through a process of training, a neural network’s error can be iteratively reduced to improve the network’s accuracy. The focus of this paper is on the design and development of an artificial life simulator which combines both genetic algorithm and neural network techniques. An initial set of experiments is also presented, which examines the relationship between life-time learning and increased population fitness. The next section discusses in more detail some limitations of genetic algorithms and neural networks when used in isolation and presents some of the successful work which has used a combination of the two approaches. Section 3 presents the simulator’s architecture, including the encoding mechanism employed and Section 4 outlines the initial experiments undertaken with the simulator.
2
Related Work
While the individual use of genetic algorithms and neural networks has been shown to be successful in the past, there are limitations associated with both V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 549–555, 2003. c Springer-Verlag Berlin Heidelberg 2003
550
Dara Curran and Colm O’Riordan
Fig. 1. Simulator Architecture approaches. The learning algorithm employed by neural network implementations is frequently based on back propagation or a similar gradient descent technique employed to alter weighting values. These algorithms are liable to become trapped in local maxima and, in addition, it is difficult to foresee an appropriate neural network architecture to solve a given problem [3], [4]. Genetic algorithms may also become trapped in local maxima and, furthermore, they require that potential solutions to a given problem be mapped to genetic codes. Not all problems are readily converted to such a scheme [1]. The combination of neural networks and genetic algorithms originally stemmed from the desire to generate neural network architectures in an automated fashion using genetic algorithms [5], [6]. The advantage of this approach is that neural networks can be selected according to a variety of criteria such as number of nodes, links and overall accuracy. The neural network component, on the other hand, provides computational functionality at an individual level within the genetic algorithm’s population. The combination of genetic algorithms and neural networks has since been proven successful in a variety of problem domains ranging from the study of language evolution [7] to games [8].
3
Simulator Architecture
The architecture of the simulator is based on a hierarchical model (Fig. 1). Data propagates from the simulator’s interface down to the simulator’s lowest level. The neural network and genetic algorithm layers generate results which are then fed back up to the simulator’s highest level. 3.1
Command Interpreter Layer
The command interpreter is used to receive input from users in order to set all variables used by the simulator. The interpreter supports the use of scripts, allowing users to store parameter information for any number of experiments.
On the Design of an Artificial Life Simulator
3.2
551
Neural Network Layer
The neural network layer generates a number of networks using variables set by the command interpreter and initialises these in a random fashion. Once initialisation is complete, the network layer contains a number of neural networks ready to be trained or tested. Training. Several algorithms exist to alter the network’s response to input so as to arrive at the correct output. In this system, the back propagation algorithm is used. Error reduction is achieved by altering the value of the weights associated with links connecting the nodes of the network. Each exposure to input and subsequent weight altering is known as a training cycle. Testing. Testing allows the simulator to ascertain how well each network solves a given problem. The output for each network is used by the selection process in the genetic algorithm layer. 3.3
Encoding and Decoding Layers
To perform genetic algorithm tasks, the neural network structures must be converted into gene codes upon which the genetic algorithm will perform its operations. This conversion is carried out by the encoding layer. Once the genetic algorithm has generated the next generation, the decoding layer converts each gene code back to a neural network structure to be trained and tested. The encoding and decoding layers follow the scheme outlined in Section 3.5. 3.4
Genetic Algorithm Layer
The genetic algorithm layer is responsible for the creation of successive generations and employs three operators: selection, crossover and mutation. Selection. The selection process employed uses linear based fitness ranking to assign scores to each individual in the population and roulette wheel selection to generate the intermediate population. Crossover. As a result of the chosen encoding scheme, crossover does not operate at the bit level as this could result in the generation of invalid gene codes. Therefore, the crossover points are restricted to specific intervals—only genes indicating the end of a complete node or link value can be used as a crossover point. Two-point crossover is employed in this implementation. Once crossover points are selected, the gene portions are swapped. The connections within each portion remain intact, but it is necessary to adjust the connections on either side of the portion to successfully integrate it into the existing gene code. This
552
Dara Curran and Colm O’Riordan
is achieved by using node labels for each node in the network. These labels are used to identify individual nodes and to indicate the location of interconnections. Once the portion is inserted, all interconnecting links within the whole gene code are examined. If any links are now pointing to non-existing nodes, they are modified to point to the nearest labelled node. Mutation. The mutation operator introduces additional noise into the genetic algorithm process thereby allowing potentially useful and unexplored regions of the problem space to be probed. The mutation operator usually functions by making alterations on the gene code itself, most typically by altering specific values randomly selected from the entire gene code. In this implementation, weight mutation is employed. The operator takes a weight value and modifies it according to a random percentage in the range -200% to +200%. 3.5
Encoding and Decoding Schemes
Before the encoding and decoding layers can perform their respective tasks, it is necessary to arrive at a suitable encoding scheme. Many schemes were considered in preparation of these experiments. These included Connectionist Encoding [6], Node Based Encoding [9], Graph Based Encoding [10], Layer Based Encoding [11], Marker Based Encoding [8], Matrix Re-writing [12],[13], Cellular Encoding [14], Weight-based encoding [3],[4] and Architecture encoding [15]. In choosing a scheme, attention was paid to flexibiliy, efficiency and scalability. The scheme chosen is based on Marker Based Encoding which allows any number of nodes and interconnecting links for each network giving a large number of possible neural network permutations. Marker based encoding represents neural network elements (nodes and links) in a binary string. Each element is separated by a marker to allow the decoding mechanism to distinguish between the different types of element and therefore deduce interconnections[12], [13]. In this implementation, a marker is given for every node in a network. Following the node marker, the node’s details are stored in sequential order on the bit string. This includes the node’s label and its threshold value. Immediately following the node’s details, is another marker which indicates the start of one or more node-weight pairs. Each of these pairs indicates a back connection from the node to other nodes in the network along with the connection’s weight value. Once the last connection has been encoded, the scheme places an end marker to indicate the end of the node’s encoding. The scheme has several advantages over others: - Nodes can be encoded in any particular order, as their role within the network is determined by their interconnecting links. - The network structures may grow without restriction—any number of nodes can be encoded along with their interconnections.
On the Design of an Artificial Life Simulator
553
- Links between nodes can cross layer boundaries. For instance, a node in the input layer may link directly to a node in the output layer, even if there are many layers between the two. - The system encodes individual weighting values as real numbers, which eliminates the ‘flattening’ of the learned weighting values which can occur when real number values are forced into fixed bit-size number values. The decoding mechanism must take the gene codes and generate neural network data structures ready to be trained or tested. Any decoding mechanism employed must be robust and tolerate imperfect gene codes. A number of anomalies may occur following crossover: - Data may occasionally appear between an end marker or a start marker. In such a circumstance the decoder ignores the data as it is no longer retrievable. - It is possible that extra start or end markers may be present within a node definition. In such a case, two choices are possible: either the new marker and its contents are ignored, or the previous section is ignored and the new one is taken as valid. The current implementation follows the latter approach. - A start marker may have no corresponding end marker or vice-versa. In such a situation, the decoder ignores the entire section of the gene code.
4
Experiments
The problem set employed for these experiments was 5-bit parity. Each network was exposed to 5-bit patterns and trained to determine the parity of each pattern. The number of training iterations was varied from 0, 10 and 100 iterations. The crossover rate was set to 75% and the mutation rate to 2%. Three experiments were carried out in total with 500 networks in each generation for 600 generations. The general aim of these experiments was twofold: to demonstrate the validity of the simulator and to ascertain how much the training or learning process affects each population’s fitness. The experimental results are shown in figure Fig. 2. No Training. When the genetic algorithm is used in isolation without the help of the learning process, the population’s fitness shows very little improvement in the first 200 generations. There is then a slight increase in fitness leading to further stagnation at around the 0.3 fitness level. The genetic algorithm alone is unable to generate a successful population for this problem set. 10 Training Iterations. The addition of training shows that even a modest increase in the population’s individual learning capability, enables the simulator to achieve very high levels of fitness. The fitness level ascends steadily to 0.85 before leveling out at nearly 0.9. At this level of fitness most individuals in the population are capable of solving all 32 solutions in the 5-bit parity problem.
554
Dara Curran and Colm O’Riordan 1
0.9
0.8
Fitness
0.7
0.6
0.5
0.4
0.3 No Training Training x 10 Training x 100 0.2 0
100
200
300 Generations
400
500
600
Fig. 2. Population Fitness For 0-100 Training cycles
100 Training Iterations. Once the networks receive more training, the advantages of the training process become obvious. The population’s fitness increases along a steep curve before jumping 0.3 points in 100 generations. The curve then levels out at around 0.95, the highest level of fitness attained in this experiment set.
5
Conclusion
The results achieved with the simulator indicates that population learning alone is not capable of solving problems of a certain difficulty. Once lifetime learning is introduced, the training process guides the population towards very high levels of fitness. The fact that the population is capable of achieving such high levels from little training (in the case of the 10 Training iterations experiment) shows that this approach should be capable of solving more complex problems.
6
Acknowledgements
This research is funded by the Irish Research Council for Science, Engineering and Technology.
References [1] J. H. Holland. Adaptation in Natural and Artificial Systems. Ann Arbor MI: The University of Michigan Press, 1975. 549, 550
On the Design of an Artificial Life Simulator
555
[2] D. E. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning. Reading, MA, Addison-Wesley, 1989. 549 [3] R. S. Sutton. Two problems with backpropagation and other steepest-descent learning procedures for networks. In Proc. of 8th Annual Conf. of the Cognitive Science Society, pages 823–831, 1986. 550, 552 [4] J. F. Kolen and J. B. Pollack. Back propagation is sensitive to initial conditions. In Richard P. Lippmann, John E. Moody, and David S. Touretzky, editors, Advances in Neural Information Processing Systems, volume 3, pages 860–867. Morgan Kaufmann Publishers, Inc., 1991. 550, 552 [5] P. J. Angeline, G. M. Saunders, and J. P. Pollack. An evolutionary algorithm that constructs recurrent neural networks. IEEE Transactions on Neural Networks, 5(1):54–65, January 1994. 550 [6] R. K. Belew, J. McInerney, and N. N. Schraudolph. Evolving networks: Using the genetic algorithm with connectionist learning. In Christopher G. Langton, Charles Taylor, J. Doyne Farmer, and Steen Rasmussen, editors, Artificial Life II, pages 511–547. Addison-Wesley, Redwood City, CA, 1992. 550, 552 [7] B. MacLennan. Synthetic ethology: An approach to the study of communication. In Artificial Life II: The Second Workshop on the Synthesis and Simulation of Living Systems, Santa Fe Institute Studies in the Sciences of Complexity, pages 631–635, 1992. 550 [8] D. Moriarty and R. Miikkulainen. Discovering complex othello strategies through evolutionary neural networks. Connection Science, 7(3–4):195–209, 1995. 550, 552 [9] D. W. White. GANNet: A genetic algorithm for searching topology and weight spaces in neural network design. PhD thesis, University of Maryland College Park, 1994. 552 [10] J. C. F. Pujol and R. Poli. Efficient evolution of asymmetric recurrent neural networks using a two-dimensional representation. In Proceedings of the First European Workshop on Genetic Programming (EUROGP),, pages 130–141, 1998. 552 [11] M. Mandischer. Representation and evolution of neural networks. In R. F. Albrecht, C. R. Reeves, and N. C. Steele, editors, Artificial Neural Nets and Genetic Algorithms Proceedings of the International Conference at Innsbruck, Austria, pages 643–649. Springer, Wien and New York, 1993. 552 [12] H. Kitano. Designing neural networks using genetic algorithm with graph generation system. In Complex Systems, 4, 461-476, 1990. 552 [13] P. M. Todd G. F. Miller and S. U. Hedge. Designing neural networks using genetic algorithms. In Proceedings of the Third International Conference onGenetic Algorithms and Their Applications, pages 379–384, 1989. 552 [14] F. Gruau. Neural Network Synthesis using Cellular Encoding and the Genetic Algorithm. PhD thesis, Centre d’etude nucleaire de Grenoble, Ecole Normale Superieure de Lyon, France, 1994. 552 [15] J. R. Koza and J. P. Rice. Genetic generation of both the weights and architecture for a neural network. In International Joint Conference on Neural Networks, IJCNN-91, volume II, pages 397–404, Washington State Convention and Trade Center, Seattle, WA, USA, 8-12 1991. IEEE Computer Society Press. 552
3-Dimensional Object Recognition by Evolutional RBF Network Hideki Matsuda1, Yasue Mitsukura2, Minoru Fukumi2, and Norio Akamatsu2 1 Graduate
School of Engineering, University of Tokushima, 2-1 Minami-josanjima, Tokushima, Japan [email protected] 2 Faculty of Engineering, University of Tokushima, 2-1 Minami-josanjima, Tokushima, Japan {mitsu,fukumi,akamatsu}@is.tokushima-u.ac.jp
Abstract. This paper tries to recognize 3-dimensional objects by using an evolutional RBF network. Our proposed RBF network has the structure of preparing four RBFs for each hidden layer unit, selecting based on the Euclid distance between an input image and RBF. This structure can be invariant to 2- dimensional rotation by 90 degree. The other rotational invariance can be achieved by the RBF network. In hidden layer units, the number of RBFs, form, and arrangement are determined using real-coded GA. Computer simulations show object recognition can be done using such a method.
1
Introduction
By development of computer technology, various robots, such as a humanoid robot and a pet type robot, have been developed. There is a robot with a visual system. Although the visual system of the robot for recognizing an object has not reached to pattern recognition with the same level as man yet. Man can recognize it as seeing from a fixed angle to a known object in the case of what the object is from every angle. In the case of a robot, it cannot recognize except a fixed angle. Conquest of this fault is indispensable to further improvement in a visual system. In order to conquer such a fault in recent years, research of 3-dimensional object recognition has been done. A technique of recognizing an object is achieved by restoration of the 3-dimensional object form by the multi-viewpoint picture in one of them. One object restoration takes huge calculation time, and the many training data is mostly needed for this technique. We think that it is unsuitable for object recognition to a picture photographed from a strange angle. The technique of not building 3dimensional object form but recognizing an object by the different method is desirable. Moreover, there is also report whether it can recognize by a technique of building relation of each training picture, by using a neural network, which is geometric structure. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 556-562, 2003. Springer-Verlag Berlin Heidelberg 2003
3-Dimensional Object Recognition by Evolutional RBF Network
557
By this research from the above thing, we propose a technique of the 3dimensional object recognition using the Radial Basis Function (RBF) network that introduced rotation invariant structure. It is known that RBF network has the advantage that an input image can be localized, and the performance of interpolation is excellent. Rotation invariant structure prepares four RBF for each hidden layer unit of RBF network, inputs them, calculating the Euclid distance between an input image and the center position of RBF, and uses the RBF unit with the nearest Euclid distance. With such a structure, rotated objects can also be trained at the same time. Therefore, it is not necessary to newly add another hidden layer unit and curtailment of the number of hidden layer units is considered to influence improvement in the speed of training. Calculation time and capability are influenced by arrangement of RBFs, form, and input area. It can be trained completely, if the number of training data and the same number of RBF are prepared. But when the number of training data becomes huge, the number of RBFs increases, calculation time becomes long, and it becomes less practical. Therefore in this paper Genetic Algorithms (GA) tries to find the number of RBFs, form, and an arrangement.
2
RBF Network
An output function in RBF network is given by
( x − a j )T ( x − a j ) y j = exp − , 2σ 2j T
z k = wk y
(1) (2)
where a variable x is an input vector in an input layer, a value zk(k=1,2, …, n) is an output value of each output layer unit, and a vector aj(j=1,2, …, m) is the center position information of RBF. A value σ j (j=1,2, …, m) is the range of RBF. A vector wk(k=1,2, …, n) is weights of combination between each hidden layer and output layer. In this paper, a Gaussian Function is used as RBF.
3
Recognition System
3.1
Network Form
The system proposed in this paper has a network structure as shown in Fig.1. The brightness value in an input image is an input signal to an input layer. The output value of each output layer unit shows the possibility of whether to be each object. Four RBFs are prepared for each hidden layer unit, and as shown in Fig.2, which has position information. Each position information is constituted on the basis of the position information expressed with one arrangement, and is made by rotating by 0 degree, 90 degrees, 180 degrees, 270 degrees. For every input, each hidden layer unit
558
Hideki Matsuda et al.
finds the Euclid distance between the center position of four RBFs, and an input image, and uses only RBF with the nearest distance. With such a structure, the picture of Fig.3 (b), (c), (d) can also be trained by training the picture of Fig.3 (a) at the same time. Because the output layer of the RBF network is a linear filter model, it makes correction for the error of each unit using the Normalized LMS (3).
Fig. 1. Network structure
The normalized LMS algorithm is given by
wk +1 = wk + µ
ε k yk ,
|| y k ||
(3)
where µ is a step size, ε k (k=1,2, …, n) is the error between the output signal of each output layer unit and a teacher signal.
Fig. 2. The center position of RBF structure
3-Dimensional Object Recognition by Evolutional RBF Network
559
Fig. 3. The picture of input image
3.2
Input Data
A single viewpoint picture which is taken from only one direction is used as an input image to a subject object. As photography conditions, a background is made uniform and fixes the distance between a subject object and a camera in all subject objects. We take the photography from one direction to the subject object. A viewpoint is in a state as it is, and performs fixed angle rotation for a subject object in the position and we take it. Such an operation is repeated. These pictures are changed into a gray scale picture. The objective center of gravity is moved to the center of a picture in all pictures. Let brightness for every coordinates of a picture be an input value. 3.3
GA
As above-mentioned, the position and form of RBFs, and the number of hidden layer units, are important in a RBF network. These figures are determined using real-coded GA. The chromosome structure is shown in Fig.4.
Fig. 4. The chromosome structure
where a value N is the number of hidden layer units used in the RBF network. A vector a is the center position information of RBF. A value σ is the range of RBF.The following operations are performed using such a chromosome structure. • • • • • •
STEP 1: Individuals are generated using a random numbers. STEP 2: An each individual is applied to a RBF network and training cycles is performed based on training data. After training, let the acquired least square error be the degree of adaptation of an each individual. STEP 3: One individual with the best degree of adaptation is selected out of the individuals, and it is saved as the elite. STEP 4: GA search stops if the degree of adaptation of the elite's individual exceeds a fixed value. If conditions are not fulfilled, proceed to STEP 5. STEP 5: Individual generation of the next generation, crossover, and operation of mutation are performed. STEP 6: When the number of generations does not exceed a fixed value, it proceeds to STEP1, with the elite saved.
560
Hideki Matsuda et al.
i ) KENDAMA
ii ) stapler
iii ) cup
Fig. 5. The subject objects
a)
b)
c)
d)
Fig. 6. The examples of teacher image
3.4
e)
f)
g)
Fig. 7. The examples of test image
Computer Simulations
Experiments using the RBF network that introduced the rotation invariant structure are carried out. As a comparison experiment, the experiment using the usual RBF network that has one RBF in the each hidden layer unit, is conducted. As a subject object, three objects as shown in Fig.5 are used for the experiment. Pictures of the object rotated by 45 degree as shown in Fig.6, are used in training. Pictures of the object rotated by 22.5 degree as shown in Fig.7, are used for testing. For study of the proposed network, 24 teacher data is used. For the usual RBF network, the pictures which rotated dimensional every 90 degree are added to teacher data. Therefore, it learns using 96 teacher data. Both networks use the picture of 24 sheets as test data. Picture size is set to 32 × 32 pixel. The parameter of a setup of both networks is shown in Table 1. By GA, the number of the hidden layer units determines both networks. The number of the hidden layer units, form and center position of RBF are determined using GA. Table 1. The parameter of a setup of networks
Input layer unit Hidden layer unit Output layer unit Times of the maximum study Step size of study
The proposed network 1024 24 3 1000 0.1
The usual RBF network 1024 96 3 1000 0.1
3-Dimensional Object Recognition by Evolutional RBF Network
3.5
561
Simulation Results
Fig. 8. The transition of the proposed network
Fig. 9. The transition of RBF network
The result experimental, in processing by GA, transition of the degree of adaptation became to 500 generations as it is shown in Fig. 8. 9. The proposed technique was able to find an optimal parameter more quickly than technique that used the usual RBF network. The rate of recognition and the number of hidden layer units became as it is shown in Table 2. Moreover, the number of hidden layer units could also be learned with the number fewer than it. In accuracy of recognition, both of techniques recognized the object correctly about 80 percent on average. As a result, the present approach requires a fewer number of RBF units than a conventional RBF network.
4
Conclusions
In this paper, an object which rotated by the fixed angle, has been recognized by training based on the Euclid distance using a RBF network. Moreover, the present method can carry out reduce the number of hidden layer units and high-speed training rather than the usual RBF network structure was brought by introducing rotation invariant structure into RBF network. At this time, this structure accepts only rotation of a fixed direction which corresponds to the object. Therefore, it is necessary to also conduct the experiment of rotation of any different direction. Moreover, since GA determines values about RBF, there is a problem that long training time is needed. Correspondence of all rotations of an object and our high-speed search of values about RBF are done as future work. Table 2. The result of experiments of each technique The proposed network The usual RBF network
Correct answer 21/24 20/24
Hidden layer units 12 79
562
Hideki Matsuda et al.
References [1] [2] [3] [4] [5] [6]
S. Itoh, J. Murata and K. Hirasawa, "Size-Reducing RBF network and Its Application to Control Systems", IEEJ Trans. EIS, vol. 123, no. 2, pp.338-344, 2003, in Japanese. R. Katayama, Y. Kajitani, K. Kuwata, and Y. Nishida, "Self Generating Radial Basis Function as Neuro-Fuzzy Model and its Application to Nonlinear Prediction of Chaotic Time Series", FUZZ-IEEE, no. 2, pp.407-414, 1993. N. B. Karayiannis and G. W. Mi, "Growing Radial Basis Neural Network, Merging Supervised and Unsupervised Learning with Network Growth Techniques", IEEE Trans. Neural Networks, 8, 6, pp.1492-1506, 1997. S. A. Billings and G. Zheng, "Radial Basis Function Network Configuration Using Genetic Algorithms", Neural Networks, 8, pp.877-890. M. Maruyama, "Networks for Learning Based on Radial Basis Functions", J. ISCIE, vol. 36, no. 5, pp.322-329, 1992, in Japanese. N. Sato and M. Hagiwara, "Parallel-Hierachical Neural Network for 3D Object Recognition", IEICE Transactions, vol. J86- D-II, no 4, pp.553-562, 2003.
Soft Computing-Based Design and Control for Vehicle Health Monitoring Preeti Bajaj1 and Avinash Keskar2 1
Electronics Department G. H. Raisoni College of Engineering CRPF Gate 3, Digdoh Hills, Hingna Road, Nagpur, India- 440016 Phone (R: 91-712-2562782, M: 91-9822237613) [email protected] 2 Department of Electronics and Computer Sciences Visvesvarya National Institute of Technology Nagpur-440011, Phone (R: 91-712-2225060, M: 91-9823047697) [email protected]
Abstract. The study of Vehicle Health monitoring plays very important role in deciding the possibility of completing journey successfully. Already work has been done on Pre Trip plan assistance with vehicle Health Monitoring with simple and Hybrid Fuzzy controllers. The conventional Fuzzy logic controllers are knowledge-based systems, incorporating human knowledge into their knowledge base through fuzzy rules and fuzzy membership functions. They are limited in application, as its logic rules and membership functions have to be preset with expert knowledge and have a great influence over the performance of the fuzzy logic controller. In the present work, a Genetic Fuzzy system is proposed to promote the learning performance of logic rules. The resultant hybrid system seems to be highly adaptive and trained through a proper performance, hence is much more sophisticated and has a higher degree of adaptive parameters. The work proposes a benefit of Methodology by comparison of calculating safest distance by evaluating Vehicle Health Monitoring using Hybrid Fuzzy, and genetic fuzzy controller.
1
Introduction
In recent years, increased efforts have been centered on developing intelligent vehicles those can provide in vehicle information to the motorists. These include the development of the systems to increase comfort to the driver. In the field of transport studies, FLC can be applied to transportation planning including trip generation, distribution, model split, and route choice, Vehicle Health Monitoring and many others. Genetic programming has recently been demonstrated to be a vital approach to V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 563-569, 2003. Springer-Verlag Berlin Heidelberg 2003
564
Preeti Bajaj and Avinash Keskar
learning fuzzy logic rules for transportation. The study of Vehicle Health monitoring plays very important role in deciding the possibility of completing journey successfully. Already work has been done on Pre Trip plan assistance with simple Fuzzy and Hybrid Fuzzy systems [1,2]. In the present work, a GA-Fuzzy system is proposed; to promote the learning performance of logic rules that can extract information from multiple fuzzy knowledge source and put into a single knowledge base. The work elaborates a fuzzy system, in which the membership function shapes, types and fuzzy rule set including number of rules inside it are evolved using a Genetic Algorithm. The genetic operators are adapted by a fuzzy system. The resultant hybrid system seems to be highly adaptive and trained through a proper performance, hence is much more sophisticated and has a higher degree of adaptive parameters. The benefits of the methodology are illustrated in the process of calculating safest Distance Traveled by a vehicle depending upon status of various inputs.
2
Vehicle Health and Vehicle Maintenance
The success of completing a desired journey depends on various factors, which can be categorized as journey classification, journey type, journey condition, vehicle condition, vehicle maintenance, vehicle characteristics and the proficiency of the driver. The vehicle health is the current condition of the vehicle based on sensor output about the Air pressure in tyres, Coolant level, Engine oil level, Break oil level, Battery condition, Fuel condition, Tyre age, Last servicing report, part replacement, age of the car etc. The detail history of servicing and life of different parts of vehicle is maintained as per the characteristics of the vehicle. Since all these factors are measured by sensors and vary in specifying the minimum requirement with the make, type of journey, load etc, Fuzzy controller is chosen to judge the Vehicle Health and to calculate the safest distance to be traveled by a vehicle with current sensors output. For simplicity the adaptive ness of the Controller is tested with four dominating inputs as coolant, engine oil, break oil and fuel.
3
Conventional Fuzzy Logic Controller
Like a typical PID system, a conventional Fuzzy system usually takes the form of an iteratively adjusting a model. In such a system, input values are normalized and converted to fuzzy representations, the model's rule base is executed to produce a consequent fuzzy region for each solution variable and the consequent regions are defuzzified to find expected value of each solution variable using centroid method. An open loop Fuzzy logic controller is designed. The Membership functions assigned to the inputs and outputs for Input Mamdani Fuzzy inference system is as shown in Table 1. The sample of input and output membership function for Hybrid fuzzy is as shown in Figure 1 & 2 respectively. These inputs are fuzzified and applied to the fuzzifier. The Inference engine fires appropriate rules and gives defuzzified output for safest distance to be traveled.
Soft Computing-Based Design and Control for Vehicle Health Monitoring
565
Table 1. Input & Output membership mapping in fuzzy
In/Out labels
Membership function and levels
Coolant
Trapezoidal, Dangerous, safe, almost full
0, 10
Engine oil
Triangular, Dangerous, safe, almost full
0, 10
Break oil
Triangular, Dangerous, safe, almost full
0,10
Fuel
Trapezoidal, Less, Moderate, Full
0, 35
Safest traveling distance -Km
Trapezoidal, Short, moderate, large
In range
0, 300km
Fig. 1. Hybrid Membership Function for Fuel
4
Design of Genetic Fuzzy Controller (GFLC)
In current problem, a genetic algorithm is applied to the Fuzzy controllers, which adapts the car simulator (real vehicle) and generates rule base of the fuzzy logic controller while maintaining membership function unchanged. The GFLC is implemented in Matlab using ‘Fuzzy' and ‘Geatbx' toolboxes. It has been implemented using the functions as given below. Gene length is taken as
566
Preeti Bajaj and Avinash Keskar
4*4*4*4=256. Assumed the constant mileage of the vehicle as 19.0 Km/lt. It took 41 generations to give best output. GA tries to minimize ObjVal in each iteration, and thus trying to minimize error between outputs of simulator and FLC [3-4]. Authors have used Pittsburgh approach for tuning the rule base [5] in which each Chromosome represents entire rule base or knowledge base.
Fig. 2. Membership function for Safe Traveling Distance
Car Simulator. Car simulator function simulates a vehicle on which Fuzzy logic controller is to be adapted. A sim_datagen.m is a function, which saves computation time for output of car simulator. It generates sim.mat that contains output for various inputs, and gives output for simulator without running carsim and hence saving time Initialize Geatbx Toolbox. A function geamain.m is the main driver for the GEA Toolbox where all the parameters are resolved and the appropriate functions are called. Promain.m is main program [6-7] and a front-end module used to start genetic optimization use to initialize various parameters migration rate, and crossover rate. Initialize Population. The process of initializing starts with random values in the population concerned. Chrome_encode.m is a program that encodes the chrome. It inputs rule set from input mamdani fuzzy inference system and converts rule set into 256 gene initial chromosome. Calculate Fitness for Each Chromosome in Population (fitness.m). This file implements fitness function. It is proportional to the performance measurement of the function being optimized. Usually the value can range from 0 to 1. The rules generated by genetics are evaluated with random inputs and is compared with output of car simulator. GEATBx tries to minimize the value of fitness function, objval. Objval = abs (distance recommended by car simulator distance recommended by fuzzy system). Reproduction. It comprises forming new population, usually with same number of chromosomes by selecting from number of current population on the basis of fitness values. The method used is a ‘Roulette wheel' procedure that assigns a portion in the wheel is proportional to the fitness value. The members in the new population will certainly be from having higher fitness value [8-9].
Soft Computing-Based Design and Control for Vehicle Health Monitoring
567
Crossover. It is the process of exchanging portions of the strings of the two ‘Parent' chromosomes. The probability of crossover is often in the range of 0.65-0.80. Mutation. It consists of changing an element's value at random, often with a constant probability for each element in the population. Rule Generation. A function Rulegen.m decodes chromosome to generate a rule set. It takes chromosome generated by genetic toolbox and outputs a new fuzzy system. It generates rule set from chromosome.
5
Results
The Input mamdani model is provided knowledge base of only 13 rules. Sample of the rules are: 1.
If (engine-Oil is mf1) then (safest-distance is mf1) (1)
2.
If (brake-oil is mf1) then (safest-distance is mf1) (1)
3.
If (Fuel is less) then (safest-distance is mf1) (1)
And the sample of 29 rules generated by adaptive GA-fuzzy controller is as under. 1.
If (Fuel is moderate) then (safest-distance is mf1) (1)
2.
If (Fuel is high) then (safest-distance is mf1) (1)
3.
If (brake-oil is mf1) and (Fuel is less) then (safest-distance is mf1) (1)
4.
If (brake-oil is mf1) and (Fuel is high) then (safest-distance is mf1) (1)
5.
If (engine-Oil is mf1) and (Fuel is less) then (safest-distance is mf1) (1)
6.
If (coolant is mf1) and (brake-oil is mf3) and (Fuel is high) then (safest-distance is mf1) (1)
7.
If (coolant is mf1) and (engine-Oil is mf1) and (brake-oil is mf2) then (safestdistance is mf1) (1)
8.
If (coolant is mf2) and (engine-Oil is mf3) and (brake-oil is mf2) and (Fuel is high) then (safest-distance is mf2) (1)
9.
If (coolant is mf2) and (engine-Oil is mf3) and (brake-oil is mf3) and (Fuel is high) then (safest-distance is mf3) (1)
10. If (coolant is mf3) and (Fuel is less) then (safest-distance is mf1) (1) 11. If (coolant is mf3) and (engine-Oil is mf2) and (brake-oil is mf3) and (Fuel is less) then (safest-distance is mf3) (1) The sample of adaption output stored in text file gives following information. Number of variables: 256 Boundaries of variables: 0 3 Evolutionary algorithm parameters: subpopulations =5
568
Preeti Bajaj and Avinash Keskar
Individuals=50 30 20 20 10 (at start per subpopulation) Termination 1:max. Generations =100; 4:running mean= 0; Variable format 2(integer values-phenotype == genotype) Output Results on screen every 1 generation Graphical display of results every 5 generations Method = 111111, Style = 614143 File name = Rule_Adapt.txt End of optimization: running mean (55 generations; 38.43 cpu minutes / 38.43 time minutes) Best Objective value: 3.1746 in Generation 41 Figure 3 shows performance characteristics of hybrid and Genetic fuzzy controller.
Fig. 3. Performance Characteristics for simulator, FLC, GFLC
6
Summary and Conclusion
This paper demonstrates an approach to Vehicle health monitoring. GA is applied to discover the fuzzy controllers capability of determining the safest distance allowed by the vehicle. The performance of the best evolved FLC was comparable to that of
Soft Computing-Based Design and Control for Vehicle Health Monitoring
569
conventional FLC and of simulator. GA evolved a good and reasonable set of rules for an FLC that demonstrated satisfactory responsiveness to various initial conditions while utilizing minimal human interface. The evolved FLC shown to be robust to permutations of sensor noise and change in membership functions. Such systems strike a balance between learned responsiveness and explicit human knowledge makes the system very robust, extensible and suitable for solving a variety of problems.
References [1] [2] [3] [4] [5] [6] [7] [8] [9]
Ms Bajaj, Dr. A.G.Keskar,” Smart Pre trip Plan assistance system using Vehicle health monitoring”, ITSC-02, fifth international conference on Intelligent transportation systems, IEEE, 3-6th sept 2002, Singapore Ms Bajaj and Dr. Keskar,” Smart Pre Trip Plan Assistance System With hybrid Fuzzy Algorithm”, Hong Kong Transportation Association Luis Magdalena, “Adapting the gain of FLC with Genetic Algorithm” ETS Ingenieros De Telecommunication, Universidad, Spain G P Bansal and et al, “Genetic algorithm based Fuzzy expert system”, paper no 224-B, IETE journal, pp 111-118, India Yu-Chiun and et al,” Genetic Fuzzy Controllers” ITSC-02, fifth international conference on Intelligent transportation systems, IEEE, 3-6th Sept 2002, Singapore www.geatbx.com www.mathtools.net/MATLAB/Genetic_algorithms www.genetic-programming.org www.geatbx.com/links/ea_java.html
An Adaptive Novelty Detection Approach to Low Level Analysis of Images Corrupted by Mixed Noise Alexander N. Dolia1, Martin Lages1, and Ata Kaban2 1 Department of Psychology, University of Glasgow, 58 Hillhead Street, Glasgow G12 8QB, United Kingdom {a.dolia,m.lages}@psy.gla.ac.uk 2 Computer Science Department, University of Birmingham, Edgbaston Birmingham B15 2TT, United Kingdom [email protected] http://www.cs.bham.ac.uk/~axk/
Abstract. We propose a new adaptive novelty detection based algorithm for the primary local recognition of images corrupted by multiplicative/additive and impulse noise. The purpose of primary local recognition or low level analysis such as segmentation, small object and outlier detection is to provide a representation which could be potentially used e.g. in context based classification or nonlinear denoising techniques. The method is based on the estimation of mixing parameters (priors) of probabilistic mixture models along a small sliding window. A novelty score is defined by the mixing parameters and this is utilized by the procedure for determining the corresponding class of image patch with the aid of a lookup f. Numerical simulations demonstrate that the proposed method is able to improve upon previously employed techniques for the same task. In addition, the computational demand required by the proposed method is clearly inferior to some of the recently applied techniques as expert systems or neural networks.
1
Introduction
A number of methods of novelty detection have been proposed using nonparametric [1,2], semi-parametric and parametric [2,3] statistical approaches, support vector techniques [4,5] and neural networks [6]. The main idea behind these approaches is to estimate either the unconditional probability density function or directly the support of the data distribution from a training set. Then based on this information it is tested to what extent the new data do fit to the model by calculating some measure or score of novelty. If the score exceeds
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 570-577, 2003. Springer-Verlag Berlin Heidelberg 2003
An Adaptive Novelty Detection Approach to Low Level Analysis of Images
571
some previously set threshold then the corresponding test samples are considered as outliers or novelty [1-6]. Note however that in practice data could be inhomogeneous and so the location of the support of the data distribution may change. In these cases the support of the data is rather a local feature as the global support becomes meaningless. The approach proposed in this paper provides a solution by adapting the parameters of the model to small space-time domains. This is achieved by local analyses of the image with small sliding windows and by updating a parameter of the novelty model itself. We also show real-world examples where the proposed method is useful for low level analysis of images such as segmentation, small object and outlier detection. The paper is organized as follows. The image models are described in Section 2. A new adaptive novelty detection method is proposed in Section 3. Simulation results are presented in Section 4. And conclusions are provided in Section 5.
2
Models of Noisy Images
Two main noise models have been considered in the image processing literature: a) an image obtained by Ka-band side look aperture radar (SLAR) is corrupted by a mixture of Gaussian multiplicative noise and outliers [7]; b) an image is influenced by a mixture of Gaussian additive noise and outliers. The main difference between these two models is that in the first case the noise is multiplicative or data-dependent while in the second case the noise is additive or signal-independent [7,8]. Let index n denote both the index of the pixel under consideration and the location of the central pixel in the scanning window. With the probability Pim of the occurrences of outliers A n the model of SLAR image x n corrupted by outliers can be written as x n = A n and in the other cases x n is corrupted by signal dependent noise, x n = I n (1 + σξ n ) , [7,8,9] where I n ¸{µ 1 , µ 1 ,..., µ J } , I n denotes the value of the n-th pixel of the image without noise. Further, ξ n denotes standard Gaussian noise, σ 2 is the relative variance of the multiplicative noise with mean equal to 1, µ1 , µ 2 ,...µ J denote the different intensity levels, J is the number of different intensity levels in the noise-free image. The multiplicative noise can be easily transformed to additive noise using an appropriate homomorphic transformation, e.g., the natural logarithm [8,9]. Then the transformed image model can be written as y n = sI hn + σ s ξ n , where σ s2 = s 2 σ 2 , I hn
¸{µ1h , µ h2 ,..., µ hJ } and µ hj = ln µ j . Further, y n , I hn are the n-th pixels of
x n , I hn after homomorphic transform, respectively; s is a scaling parameter - in our
experiments we set s = 46 for our convenience (in this case x n , y n ¸[0,255] ). In the above approximation we have made use of the Taylor expansion of the natural loga-
572
Alexander N. Dolia et al. ‡
xk when x < 1 , by truncating it to the k k =1 linear term. Observe that after the homomorphic transform the probability density function p( y n ) of the image y n can be approximated by the mixture of Gaussians
rithm function i.e., ln (1 + x ) = ‡”(- 1) k +1
with
shared
variance
J
σ s2 ,
mixing
parameters
p( y n ) = ‡”π j N( y n , µ hj , σ S2 ) where 0 < π j ¡Ü1 and j=1
πj
and
means
µ hj ,
J
‡”π
j
= 1 [10]. Thus we now
j=1
have a model with 2J + 1 unknown parameters ( π j , µ hj and σ s2 ) which could be estimated using a gradient-like method or Expectation-Maximization (EM) algorithm [10]. However, this still does not provide any information about the local changes of intensity (what Marr has called as ‘primal sketch' [11]). The next section is concerned with developing a method for this purpose.
3
Locally Adaptive Novelty Detection Method
Our algorithm is based on the following intuitive idea. If an observer is trained to see just one homogeneous region at a time with different light intensity or texture then it is unusual for him to see: 1) two large homogeneous regions (edge) at a time; or 2) small objects or outliers with a homogeneous region in the background. Similar to [9,12,13], we consider six classes for primary local recognition: 1) homogeneous region (H); 2) edge neighborhood between large objects or two homogeneous regions (E); 3) neighborhood of a spike (outlier) - by this we mean that there is one or no more than 3 spikes elsewhere than in the central pixel (NS); 4) spike in the central pixel of the scanning window (S); 5) the central pixel belongs to a small sized object (a small sized object is characterized by compactness of pixels belonging to it as well as by homogeneity of the pixel values) (O); 6) neighborhood of a small sized object (NO). Because the scanning window should be small enough to ensure locality of the analysis (e.g., 5x5 pixels) and also for keeping our model as simple as possible [4] then it is reasonable to assume that there are not more than two Gaussian components appearing in the scanning window. Therefore, we will model the image fragment in each small scanning window as a mixture of two Gaussians with the mixing parameters π in and π nj ( π in + π nj = 1 , 0 ¡Üπ in , π nj ¡Ü 1 ) for the i-th and j-th component of the mixture distribution p( y n ) = π in N( y n , µ ih , σ s2 ) + π nj N( y n , µ hj , σ s2 ) , i
‚j , i, j ¸{1,2,..., J} .
(1)
The model (1) has only 5 free parameters but it is necessary to estimate these parameters for each location of the scanning window. Therefore, for the entire image we end up having as many models as pixels in the image. Because we employ small
An Adaptive Novelty Detection Approach to Low Level Analysis of Images
573
scanning windows and because the image is contaminated by outliers, it would be inefficient to apply a statistical estimation such as an EM algorithm, for example. Obviously, for i ‚j knowing one of mixing parameters we can find another one ( π in = 1 - π nj ) and a small reduction in the number of free parameters is possible. Note that there are three distinct possibilities: 1) π in < π nj ; 2) π in > π nj ; 3) π in = π nj . Therefore, we can find a Gaussian component that has a mixing parameter that is not less than the other one. All we need to do then is to estimate π nmax = max{π in , π nj } for each location of the scanning window. In order to estimate π nmax we will employ a well-known technique, the so-called 3sigma rule, which states that with probability 0.997 the random value y will lie within the interval [µ - 3σ s , µ + 3σ s ] if it belongs to the Gaussian distribution with mean µ and standard deviation σ s , P( | y - µ | ¡Ý 3σ s ) ¡Ü 0.003 [14]. We introduce a new notation here, let N µn denote the number of pixels of the scanning window in the neighbourhood of y n which are inside the interval [µ - 3σ s , µ + 3σ s ] . For grey-scaled images with 255 grey levels µ ¸[0,255] . In summary, the proposed algorithm will consist of two steps: 1) calculate π nmax where the estimate of π nmax is equal to the maximal value of N µn given the location of the scanning window and an estimate of σ s2 divided by the number of points in the scanning window L (e.g., if the size of the scanning window is 5x5 then L=25 and, π nmax = max N µn / 25 ); 2) based on the estimate of π nmax (or max N µn or novelty µ
score N score = 1 - π nmax ) and on the information about either the central pixel being inside of the selected interval [µ - 3σ s , µ + 3σ s ] corresponding to max N µn look up the class from the table (e.g., see Table 1). The method is locally adaptive because an estimate of the dominant mixing parameter of the mixture model π nmax is calculated for each position of the small scanning window but not for the entire image. For 5x5 scanning window when L = 25 , e.g., the first row of Table 1 corresponds to max N µn = 25 , the second (third) one relates to max N µn = {23,24} ( max N µn = 22 ). Similar tables were proposed for the classification of noise-free images in order to obtain a target image using a genetic algorithm to correct misclassifications (see Table 1,2 in [13]). Thus, if the image fragment in the scanning window is not novel, then we get a novelty score N score ¡Ö0 . If the image fragment is not just one homogeneous region or is unusual to the observer then N score > 0 . In practice N score is never exactly equal to one because even if pixels are uniformly distributed in [0,255] then max N µn is equal at least 1 ( 1 ¡Ümax N µn ¡ÜL and 0 ¡ÜN score ¡Ü1 - 1 / L ).
574
Alexander N. Dolia et al.
a)
b)
c)
d)
Fig. 1. Illustrations of primary image recognition for the artificial image: a) the artificial Kaband SLAR image; b) novelty score mapping; c) central pixel mapping; d) classification mapping
4
Experiments
In this section, we demonstrate the performance of the proposed algorithm for both an artificial image corrupted by Gaussian multiplicative noise (Fig.1,a) and a real Kaband SLAR image (see Fig.2,a). We start with a noise-free image containing small and large objects with intensity levels from the set {10,15,20,80,120,160} i.e., six different positive and negative contrasts and a background of 40. This image was then corrupted by multiplicative (data-dependent) noise with a mean equal to 1 and relative variance σ 2 = 0.003 (see Fig.1,a). Table 1. Example of novelty score table to select class based on N score
N score
A central pixel of scanning window belongs to the interval [µ - 3σ s , µ + 3σ s ]
A central pixel of scanning window does not belong to the interval [µ - 3σ s , µ + 3σ s ]
0..0.04
H
H
0.04..0.08
NS
S
0.08..0.16
NS
O
0.16..0.32
NO
O
0.32..0.64
E
E
0.64..1.0
O
NO
After applying the homomorphic transform to the noisy test image and setting the scaling factor ( s = 46 ) the variance of the additive noise be-
An Adaptive Novelty Detection Approach to Low Level Analysis of Images
575
comes σ s2 ¡Ös 2 σ 2 = 6.348 . We use this setting to compare three different methods: 1) the method of supervised primary local recognition based on radial basis neural network (NN) with 10 inputs (features are the bins of modified histograms of the image in the small scanning window, see [8] for details), 50 nodes in the hidden layer, and 6 outputs (6 classes). We will refer to this method as NN-Hi; 2) the second method uses a similar NN except that the number of inputs has been reduced to 6 local statistical parameters which have been calculated for each position of the scanning window (see [8,9]). This method will be referred to as NN-SP; 3) finally we will refer to our new method as ND. An equal window size of 5x5 has been chosen for all three methods in this comparison. Results are shown in Table 2. The analysis of this table highlights that the proposed method produced a superior recognition performance for all the main classes (H,E,O,NO) when compared to NN-Hi and NN-SP. In addition, our simulations demonstrate that the superiority of the proposed method holds also for images with different contrasts than those utilized in the training process for NN-Hi and NN-SP. Fig.1,b shows the novelty scores N score for noisy test image pixels: pixels in black color belong to homogeneous regions (not novel) while the white pixels are edges or highly novel. In Fig.1,c we can see that if the central pixel belongs to the interval [µ - 3σ s , µ + 3σ s ] this information is useful in classifying patches like S and O. Fig. 1,d shows how the information presented in Fig.1,b,c can be combined to recognize all classes (we use different colours to represent different classes). Fig.2,a depicts results for a real Ka-band SLAR image where Fig.2,b,c,d correspond to Fig 1,b,c,d, respectively. The size of the scanning windows was 5x5 and the relative variance was equal to 0.004. These values have been chosen according to a human expert analysing the homogeneous regions of the real image (see Fig. 2,a). Visual inspection also demonstrates that the proposed method presents encouraging performance and is able to distinguish between important components of the image such as edges, small objects, outliers and homogeneous regions.
Table 2. Results of correct image classifications by NN-Hi, NN-SP and ND
Recognized classes Methods
H
E
O
NO
ND
99.94 %
99.99 %
99.99 %
99.99 %
NN-Hi
99.5 %
93.6 %
99.6 %
90.6 %
NN-SP
99.1 %
91.8 %
94.7 %
85.8 %
576
Alexander N. Dolia et al.
a)
b)
c)
d)
Fig. 2. Real radar image processing: a) real Ka-band SLAR image; b) novelty score mapping; c) central pixel mapping; d) classification mapping
5
Conclusions
In this paper we have proposed a locally-adaptive novelty detection method for the primary local analysis in image data corrupted by mixed noise. The recognition results have outperformed those obtained with an RBF classifier on both artificial images and real Ka-band SLAR images.
References [1] [2] [3] [4] [5] [6] [7]
Bishop, C. Novelty Detection and Neural Network Validation. Proceedings, IEE Conference on Vision and Image Signal Processing (1994) 217--222 Nairac, A., Townsend, N., Carr, R., King, S., Cowley, P., Tarassenko L. A System for the Analysis of Jet Engine Vibration Data. Integrated Computer Aided Engineering 6 (1999) 53-65 Roberts, S.J. Novelty Detection Using Extreme Value Statistics. IEE Proceedings on Vision, Image and Signal Processing 146(3) (1999) 124—129 Schölkopf, B., Williamson, R., Smola, A., Taylor, J.S., Platt, J. Support Vector Method for Novelty Detection. In: Solla, S.A., Leen, T.K., Muller, K.R. (eds.) Neural Information Processing Systems (2000) 582--588 Campbell, C., Bennett, K.P. A Linear Programming Approach to Novelty Detection. Advances in Neural Information Processing Systems, 14 MIT Press, Cambridge, MA (2001) Ypma, A., Duin, R.P.W. Novelty Detection Using Self-organising Maps.: Progress in Connectionist Based Information Systems, 2 (1998) 1322--1325 Melnik, V.P. Nonlinear Locally Adaptive Techniques for Image Filtering and Restoration in Mixed Noise Environments: Thesis for the degree of Doctor of Technology, Tampere University of Technology, Tampere, Finland (2000)
An Adaptive Novelty Detection Approach to Low Level Analysis of Images
[8]
[9]
[10] [11] [12]
[13]
[14]
577
Dolia, A.N. Lukin, V.V., Zelensky, A.A., Astola, J. T., Anagnostopoulos, C. Neural Networks for Local Recognition of Images with Mixed Noise. In Applications of Artificial Neural Networks in Image Processing VI, N.M. Nasrabadi, A.K. Katsaggelos (eds.), Proc.SPIE 4305, (2001) 108-118 Dolia, A.N., Burian, A., Lukin, V.V., Rusu, C., Kurekin, A.A., Zelensky, A.A. Neural Network Application to Primary Local Recognition and Nonlinear Adaptive Filtering of Images. Proceedings of the 6-th IEEE International Conference on Electronics, Circuits and Systems, Pafos, Cyprus 2 (1999) 847-850 Bishop, C. Neural Network for Pattern Recognition. Oxford: Oxford University Press (1995) Marr, D., Vision: A Computational Invistigation into the Human Representation and Processing of Visual Information. San Francisco: Freeman (1982) Lukin, V.V., Ponomarenko, N.N., Astola, J.T., Saarinen, K. Algorithms of Image Nonlinear Adaptive Filtering using Fragment Recognition by Expert System. Proceeddings of IS@T/SPIE Symp. on Electronic Imaging: Science and Technology, San Jose (California, USA) 2318 (1996) 114-125 Niemistö, A. , Lukin, V., I. Shmulevich, I., Yli-Harja, O., Dolia, A. A Training-based Optimization Framework for Misclassification Correction. Proceedings of 12th Scandinavian Conference on Image Analysis, Bergen, Norway (2001) 691-698 Bendat, A., Pierson, J.S. Random Data: analysis and Measurement Procedures. Wiley series in Probability and Statistics, New York: Chichester: John Wiley&Sons, Inc., third edn. (2000)
The Image Recognition System by Using the FA and SNN Seiji Ito1 , Yasue Mitsukura2 , Minoru Fukumi3 , Norio Akamatsu3 , and Sigeru Omatu1 1
Osaka Prefecture University 1-1 Gakuen-chou, Sakai, Osaka 599-8531, Japan [email protected] [email protected] 2 Okayama University 3-1 Tsushima Okayama 700-8530 Japan [email protected] 3 The University of Tokushima 2-1 Minami-Josanjima Tokushima, 770-8506 Japan {fukumi,akamatsu}@is.tokushima-u.ac.jp
Abstract. It is difficult to obtain images only we want on the web. Because, enormous data exist in the web. A present detection system of images are keyword detection which is added the name of keyword for images. Therefore, it is very important and difficult to add the keyword for images. In this paper, keywords in the image are analized by using the factor analysis and the sandglass-type neural network (SNN) for image searching. As images preprocessing, objective images are segmented by the maximin-distance algorithm. Small regions are integrated into a near region. Thus, objective images are segmented into some region. After this images preprocessing, keywords in images are analyzed by using factor analysis and a sandglass-type neural network (SNN) for image searching in this paper. Images data are corresponded to 2-dimensional data by using these two methods. 2-dimensional data are plotted on a graph. Images are recognized by using this graph.
1
Introduction
Recently, enormous image data exist on the web by the rapid development of the Internet. Therefore, it is difficult to obtain images only we want on the web. In order to prevent these problems, there are image search systems, for example Google searching system [1]. However, by using this searching systems, the images only we want are not obtained. Because these systems consist of file name searching basically. Image searching systems are divided into 3 kinds; keyword search, similarity-based image retrieval, and browsing search [2]. However, there are many problems in them. The keyword searching method is used by using database. In similarity-based image retrieval method, we obtain images V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 578–584, 2003. c Springer-Verlag Berlin Heidelberg 2003
The Image Recognition System by Using the FA and SNN
579
Fig. 1. Process of preprocessing an image
by drawing images we want. However, this method is taken many tasks. The browsing search method means interactive method. In this method, however, it takes time to search images. Therefore, we focus on the keyword searching, This method is very simple and high speed system. Though, this method is difficult to get keywords in images. In this paper, it is proposed that images are recognized automatically. By recognition of images, we can expect to add keywords automatically. First, as images preprocessing, objective images are segmented by maximindistance algorithm. Small regions are integrated into a near region. Therefore, the objective images are segmented into some region. After this images preprocessing, keywords in images are analyzed by using factor analysis and a sandglasstype neural network (SNN) for image searching in this paper. Images data are correspond to 2-dimensional data by using these two methods. 2-dimensional data are plotted on a graph. After keywords are analyzed in detail, images are recognized by using this graph.
2
Image Preprocessing
In this section, the image preprocessing for image recognition system are introduced. Fig.1 shows the process of preprocessing in the proposed method. Images can be segmented into some region roughly by this preprocessing. This preprocessing is explained to below in detail. 2.1
Median Filtering
In the first procedure, impulse noises can be removed by using median filtering. By removing noises, images can be divided more roughly than original images. Median filtering is median density when image densities in the local region are ranked in the order of high (low) value.
580
2.2
Seiji Ito et al.
Maximin-Distance Algorithm
Image pixels with similar values after preprocessing of median filtering are classified into the same cluster by using a maximin-distance algorithm (MDA) [5]. There are another cluster methods, for example k-means. However, this method must be decided to the number of clusters k. The value of k changes with pictures. Therefore, it is difficult to decide the value of k automatically. The maximindistance algorithm is as it were the longest point is searched from existing cluster. If this distance is greater than a threshold, this point is set as the center of a new cluster. The detailed algorithm is as follows: 1. The cluster which sets x1 as a cluster center is made. 2. Di = min d(xi , Z¯j ) is calculated to each xi . However, Z¯i is set as the center of j
cluster Zi , Di (X, Y ) (Distance between a pixel X and a pixel Y) is computed as follows: Di = (Xr − Yr )2 + (Xg − Yg )2 + (Xb − Yb )2 . (1) Where, Xr , Xg , Xb means RGB of a pixel X, Yr , Yg , Yb means RGB of a pixel Y. 3. l = max Di is calculated. Then, i is set as k. i
4. If l/M AX > r , then xk is set as the center of a new cluster, and back to step2, where, M AX is given by M AX = max d(Z¯i , Z¯j ) and r is parameter. ij
5. If l/M AX ≤ r, then this algorithm is finished. 2.3
Integration of Small Regions
There are a lot of small-sized regions by using MDA. In this case, the image is divided into big regions by integration of small regions [6]. In this study, a small region is defined as 0.5% or less area of a picture. The integration algorithm of small regions is summarized as follows: 1. Objective pixel is defined by the maximum outline pixel of small region. 2. In the 8-neighbor distance pixel of objective distance, regional information of another region is obtained. 3. In the obtained regional information, the most region information is selected. 4. Objective regions are integrated into selected region. 5. This algorithm is continued in all of small region. An original image is shown in Fig.2(a). The original image after this preprocessing is shown in Fig.2(b). Therefore, we can obtain the good regions by using this process.
The Image Recognition System by Using the FA and SNN
(a) An original image
581
(b) The result of preprocessing image
Fig. 2. The preprocessing image example
3
Factor Analysis
After regions are obtained by image preprocessing, feature parameters correlations are extracted by using factor analysis [4] in every keywords. In this section, we show the factor analysis. The equation of factor analysis is as follows:
Z = FA + UD
.
(2)
Where, Z is called data matrix. this matrix is measured data. F or U are not measured data, but assumed data. F is called common factor score (or factor score) matrix. U is called unique factor score. A is called factor loading. Factor analysis means assuming A, because A cannot be obtained from equation 2 directly. Factor analysis procedure is as follows: 1. 2. 3. 4. 5.
Generation the data matrix Z. Calculation of the correlation matrix R. Calculation of the matrix R∗ . Extraction of factors. Rotation of factors.
In this paper, the calculation method of R∗ is estimation of diagonal components. The decision method of diagonal components are maximum of correlation coefficient. The extraction of factors method are centroid method. The rotation of factors method are varimax method.
4
Sandglass-Type Neural Network (SNN)
2-dimensional data which are compressed these regions data are analized how keyword is distributed in the 2-dimensional graph. By compressing two or more information into two dimensions, compressing data can be plotted to 2-dimensional graph, and can be analized visually. The SNN [7] is used for compressing data. The SNN is one of the hierarchical Neural Networks. When input and output patterns are the same, compression data can be acquired on the hidden layers. The SNN structure in this paper is shown in Fig.3.
582
Seiji Ito et al.
Fig. 3. The structure of SNN
5
Computer Simulation
40 images are used for learning data. In 40 images, 140 regions are obtained by using image preprocessing with handwork. Important features are found by factor analysis. This result of found features is shown in Table 1. Showing this table, it is found that brightness (I) (included in HSI color system) is the most important feature parameter. Generally, the brightness (I) (included in HSI color system) is the most important parameter in image researches. Therefore, from these results, it was confirmed that the brightness (I) is the most important information in the various scene images. These selected features are compressed to 2-dimensional data by using SNN. Compressed data are plotted into 2-dimensional coordinate. This 2-dimensional coordinate are shown in Fig.4. Showing this Fig.4, it is found that all keywords are separated in this coordinate. By using Fig.4, images are recognized. This recognition method is as follows: 1. 2. 3. 4. 5.
Center points in every keywords are decided. Regions of unknown images are compressed by SNN. Distances between center and compressed points are calculated. The shortest distance is found in calculated distances. Regions are recognized as keywords with the shortest distance.
The distance calculation method is Euclidean distance. The distance between center point (xc , yc ) and a point (x, y) is as follows: d = (xc − x)2 + (yc − y)2 . (3) Thus, regions are recognized. The recognition results are as follows. 74.4% accuracy can be obtained by the 7 keywords (sky blue, cloud, sandy beach, mountain, sun, tree and snow). 62.7% accuracy can be obtained by using the 8 keywords (addition rock). 42.9% accuracy can be obtained by using the 9 keywords (addition glassland), and 38.3% accuracy can be obtained by using the 10 keywords (addition sea). Therefore, this proposed method is effective,
The Image Recognition System by Using the FA and SNN
583
Table 1. The result of factor analysis Average RGBHS Sky blue * * * * Rock * Cloud * * * Sandy beach * * * * Sea Mountain * * * Glassland * * * Sun * * * * * Tree * * * Snow * * *
Variance IRGBHSI * * * * * * * * * * * * * * * * * * * * * * * *
because 7 keywords recognition accuracy is 74.4%. Generally, the separation of the keyword “sea” and “sky blue” is difficlut to do. In this paper, there is nothing to mistake sea for skyblue. There are 3 miss-recognitions which mistake sky blue for sea. Therefore, relatively good results can be obtained. The recognition accuracy by using the 10 keywords is not so high. The reason is thought as follows: In the keywords distributions in Fig.4, some is wide and some is narrow. Narrow distributions are converged more than wide distributions. Therefore, it is found that error recognitions are obtained by difference of distributions areas in every keywords. As this solution method, the numbers of sample data are the same in every keywords. It is roughly said that the more recognition objects I use, the lower recognition accuracy can be obtained. In the future works, many keywords separations will be achieved.
Fig. 4. The distribution of compressed data in 2-dimendional coordinate
584
6
Seiji Ito et al.
Conclusions
In this paper, we analized scene images for image search system. First of all, we do image preprocessing (median filtering, maximin-distance algorithm, and integration of small regions) for image segmentation. Next, we use factor analysis for extraction of important feature parameters. Finally, we use SNN for compression these parameters to 2-dimensional data, and 2-dimensional data are plotted. Images are recognised by using this graph. 74.4% recognition accuracy can be obtained by using the 7 keywords. Therefore, this proposed method is effectiveness. In the future works, a high recognition accuracy will be obtained by the improved system. Then, new feature parameters will be extracted, for example shape and so on. However, by using this preprocessing of image, accurate regional segmentation are not done so much. Therefore, accurate regional segmentation methods are necessary. Furthermore, the system of keyword addition automatically will be constructed.
References [1] http://www.google.com/ 578 [2] M.Mukunoki, M. Minoh, and K. Ikeda ”A Retrieval Method of Outdoor Scenes Using Object Sketch and an Automatic Index Generation Method”, IEICE D-II, Vol. J79-D-II No. 6, pp. 1025-1033 (1996) In Japanese 578 [3] John Hertz, Anders Krogh and Richard G. Palmer ”Introduction to the Theory of Neural Computation”, Addison-Wesley Publishing Company (1991) [4] S Shiba ”Factor Analysis Method”, University of Tokyo Press (1972), in Japanese 581 [5] M. Nagao ”Multimedia Information Science”, Iwanami Shoten, in Japanese 580 [6] S. Sakaida, Y. Shishikui, Y. Tanaka, and I. Yuyama ”An Image Segmentation Method by the Region Integration Using the Initial Dependence of the K-Means Algorithm”, IEICE Japan, No 2, pp.311-322 (1998), in Japanese 580 [7] E. Watanabe and K. Mori ”A Compression Method for Color Images by Using Multi-Layered Neural Networks”, IEICE Japan, No 9, pp.2131-2139 (2001), in Japanese 581
License Plate Detection Using Hereditary Threshold Determine Method Seiki Yoshimori1 , Yasue Mitsukura2 , Minoru Fukumi1 , and Norio Akamatsu1 1
University of Tokushima, 770-8506 2-1 Minami-Josanjima Tokushima, Japan {minow,fukumi,akamatsu}@is.tokushima-u.ac.jp 2 Okayama University, 700-8530 3-1-1 Tsushima Okayama, Japan [email protected]
Abstract. License plate recognition is very important in an automobile society. Also in it, since plate detection has big influence on subsequent number recognition, it is very important. However, it is very difficult to do it, because a background and a body color of cars are similar to that of the license plate. In this paper, we propose a new thresholds determination method in the various background by using the real-coded genetic algorithm (RGA). By using RGA, the most likely plate colors are decided under various light conditions. First, the average brightness Y values of images are calculated. Next, relationship between the Y value and the most likely plate color thresholds (upper and lower bounds)are obtained by RGA to estimate thresholds function by using the recursive least squares (RLS) algorithm. Finally, in order to show the effectiveness of the proposed method, we show simulation examples by using real images. . . .
1
Introduction
Much work on number plate detection with a digital computer has been done actively, and applied to car counting or the recognition of number plate of cars. For these demand, there are methods based on the color histogram. However these methods cannot cope with a change of back ground color (light condition). It is very difficult, by using nothing but the color information to extract a specific object from an image. People can recognize color information using various color characteristics. Objects are normally not monochrome except those made intentionally. Good results can be obtained by threshold obtained from binarization for a given image. However the results are not so good when several images are used, because of the variation in lightings, orientation etc [1]-[3]. In this paper, the best thresholds for every image are found using RGA. It can be automatically decided using RGA. By this method, the thresholds (upper and lower bounds) of a number plate are determined using RGA out of the strange images from which brightness and a type of a car differ. Next, the relation between this thresholds (upper and V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 585–593, 2003. c Springer-Verlag Berlin Heidelberg 2003
586
Seiki Yoshimori et al.
lower bounds) and brightness of car domain are changed into a function using a least-squares method. In the case of plate detection, thresholds are calculated based on this function. For this reason, the time to determine the thresholds is saved. Therefore, the calculation to detection also decreases. Moreover, in order to determine thresholds based on the brightness of a car domain, also to change of lighting conditions that it is effective. In addition, the images used for the simulation of this paper are photoed with the digital camcorder.
2
Image Processing
In this paper, images that photoed car while running an outdoor road with the digital camcorder are used. Photography conditions are as having been shown in Table 1. 2.1
Car Area Extraction
The continuous image of ten per car is used for a simulation. First of all, difference is taken to the image of two sheets followed in ten sheets. Next, in this difference, change asks for the large portion and makes the value the graph of length and a transverse direction [4]. The portion that exceeds a threshold in length and a transverse direction in this graph is extracted as a car domain. This situation is shown in Fig. 1. In addition, the portion in which cars are not contained in the stage of extraction, and the portion with a small plate are removed beforehand. 2.2
Color Systems
To obtain optical features of color information from color images, each attribute in a color system is used. Examples of color systems are RGB, YCr Cb , YIQ, etc. In this paper, the Y, Cr , Cb color system is used, which reflects the most color perception property of human beings. In particular, the value Y is used for the purpose of understanding the light concentration. A transform formula with the RGB table color is shown in the following.
Table 1. Photography conditions Shutter speed Picture size
1 / 30 [sec] length 340 pixel width 240 pixel Picture form 24bit color Environment fine day (daytime) rainy day (night) Photography time from 14:00 to 15:00 (daytime) from 18:00 to 19:00 (night)
License Plate Detection Using Hereditary Threshold Determine Method
(a) The original image
(b)Difference result
587
(c)An extraction result
Fig. 1. Car domain extraction
Y 0.29900 0.58700 0.11400 Cr = 0.50000 −0.41869 −0.08131 Cb −0.16874 −0.33126 0.50000 Where Y is brightness, and Cr , Cb are colors. Note that, only Y is used in this paper. The method using thresholds can recognize objects at high speed. 2.3
Thresholds
How to choose the thresholds for the plate detection is a difficult problem. It is because a threshold changes by the condition of the light. Therefore, detection accuracy changes significantly with varying threshold value. In this paper, appropriate thresholds of plate detection are found by using RGA under the condition of an unknown Y value, which is calculated by the above equation. The plate domain for samples is extracted from a car domain by viewing. The example of a car domain and the extracted plate domain is shown in Fig. 2. When calculate the thresholds, these images are used. 2.4
Genetic Algorithms
On the problem with which it deals in this paper, we have to change the thresholds for plate detection for each image. The images from which brightness differs are also stabilized and a plate position must be detected. However, when a thresholds (upper and lower bounds) changes, the results after processing may differ greatly. In this paper, the genetic algorithm that used real numerical value
Fig. 2. A sample of a plate extraction result
588
Seiki Yoshimori et al.
coding (RGA) is adopted as optimization algorithm. When design variable value is the continuous function optimization problem, which is the real number, RGA can be efficiently taken over to a parent individual’s feature. Therefore, as compared with GA, which used the bit string for the chromosome, efficient search is possible [5]. Chromosomes In this paper, since a parameter was a thresholds (upper and lower bounds), it has set as the form that shows the form of a chromosome in Fig. 3. Fitness Functions In this paper, the degree of adaptation was set up as follows. ✏ RECOGP LAT E < 0.20
if else
f itness =
RECOGP LAT E RECOGCAR
f itness = RECOGP LAT E ∗ BRESULT ∗GRESULT ∗ RRESULT
if (BMAX − BMIN ) , (GMAX − GMIN ),
(RMAX − RMIN ) > 80.0
(< 80.0) BRESULT ,GRESULT , RRESULT =
1.05 (MAX−MIN )0.01
else ({(BMIN , GMIN , RMIN ) + 1.0},
✒
{(BMAX ,GMAX ,RMAX ) − 3.0})BRESULT ,GRESULT ,RRESULT 1.05 = (MAX−MIN )0.01
✑
RECOGP LAT E : The rate of the number of pixels which exists in the threshold (upper and lower) bounds determined by GA to thetotal number of pixels in a plate domain RECOGCAR : The rate of the number of pixels which exists in the threshold (upper and lower bounds) determined by GA to all thepixels in a sample cars domain BMIN , GMIN , RMIN : The minimum of the threshold value of B, G, and R which were determined by GA BMAX , GMAX , RMAX : Maximum of the threshold value of B, G, and R which were determined by GA By using a thresholds function, the rate of recognition is high and what has as small the range of a threshold value as possible is chosen. Therefore, it is thought that it is expectable to obtain a desirable result to the method proposed in this paper.
Fig. 3. A sample of chromosome
License Plate Detection Using Hereditary Threshold Determine Method
589
Fig. 4. The estimated threshold line by using RLS algorithm(B, upper and lower bounds)
2.5
Thresholds Function
The relation between the thresholds (upper and lower bounds) and the average of the brightness in a car domain are used to calculate of the thresholds function to (upper and lower bounds). The thresholds function of the upper and lower bounds are shown in Fig. 4. The value calculated from this function is effective to the difference of light conditions or car color. Because, it has calculated based on the brightness of the whole car domain. Moreover, the thresholds of the plate detection in a strange image are determined based on the calculated function. That is, the threshold value, which detects the plate position in an image, is determined only calculating the brightness of an image. Therefore, the amount of calculation, which is needed for detection, can also be reduced. 2.6
Symmetric Property
If a domain is extracted only with a thresholds function, a domain, which has similar thresholds of a plate, may also be chosen. In this case, which domain finally is determined as a plate domain poses a problem. The method of determining a domain with many pixels, which exist within the limits of thresholds as a plate domain, is considered as one. However, by this method, a possibility of causing incorrect recognition in a plate domain, which has similar thresholds, becomes high. Then, in this paper, the degree of symmetry is proposed and it is used as the domain specification method. The degree of symmetry shows as the following equations. Sim Symmetry = P ixel∗2 Sim : The degree of similar of two points which is applicable (difference of pixel value) P ixel : The number of pixels of the half of a chosen domain
590
Seiki Yoshimori et al.
By using this method, the area chosen as a plate domain is equally divided into two first at length or a transverse direction. Next, the concentration value of the pixel, which becomes symmetry to a division line in two domains, is compared. Finally, the symmetry of a domain is measured by breaking the total of the value by the value of the half of the number of pixels of the whole domain.
3
Flow of Processing
The process in the case of actual plate detection is shown below. 3.1
Preprocessing
First of all, the cars domain that has brightness and cars color differ is prepared, and these image are formed into 2 values images using difference between frames. In quest of length and the number of pixels of a transverse direction, it graphizes using this 2-value image. In these graphs, determine the domain beyond a threshold value as a cars domain. Next, a plate domain is started from the extracted cars domain. The upper and lower bounds of thresholds for each image calculates by the real numerical value GA using the plate domain image and cars domain image. Finally, the thresholds function calculates from the relation between the average value of the brightness of car domain and threshold value, which is calculated from RGA. 3.2
Detection Processing
First of all, an inputted image is formed into 2 values image using difference between frames. Graph of length and transverse direction pixels calculate from 2 value pictures. In both these graphs, the domain beyond thresholds is determined as a cars domain. Next, the average value of the brightness in the determined cars domain is calculated. Each R, G, and B thresholds (upper and lower bound) are determined from thresholds function using car domain average. The number of pixels in a cars domain, which exists within the limits of the thresholds, is calculated. Graphs (length and transverse direction) calculate based on the pixel, which exist within the car domain and limits thresholds. By using these graphs (length and transverse direction), determine the domain beyond a threshold value as a plate domain. When two or more selection domains exist in the case of plate domain determination, in addition to the rate of recognition, the degree of symmetry of a selection domain is also used, and, the thing most in character with a plate domain is determined. Finally an exact position is determined by performing template matching to the inside of the determined domain [6], [7].
4
Computer Simulations
In this section, in order to show the effectiveness of the proposed method, computer simulations are done by using the 100 car area images.
License Plate Detection Using Hereditary Threshold Determine Method
591
Table 2. The parameters used in GA The number of individuals The number of generations The number of elite preservation Occurrence probability of crossover The mutation rate
200 1000 20 0.60 0.07
Table 3. Simulation Result
Method1 Method2 Method3 Method4 Method4 (night)
Rate of detection Processing time (%) (sec) 57.0 0.23 82.0 0.16 85.0 0.17 91.0 0.18 70.2 0.18
CPU PentiumIII(500MHz) Method1 : Only template matching Method2 : Threshold value function Method3 : Template matching and threshold vlue function Method4 : Template matching, threshold value function and symmetric property
4.1
Simulation Conditions
When computer simulation has done, parts where dose not contain a car are deleted. The deleted parts are part upper of original image and outside centerline. The number of the individuals, the generation shift number of times, crossover probability, mutation percentage in RGA is shown in Table 2. 4.2
Simulation Results
When more than 70 % of plate areas made contained in the square of result, the detection is succeeding. As a results of the detection are shown in Table 3 and the example of this simulation results are shown in Fig. 5. The one, which succeeded in the detection, can extract a plate area from these results correctly. Moreover, we have confirmed that the detection to the plate which was difficult by the research until now because of the car has the similar color is possible, too. When examining about the one which failed in the detection, there has been detected a plate in the mistaking with the headlight. It is because the color of the plate domain and the color of the headlight are similar, so the thresholds are also similar and the size is also similar too. Moreover, since the cover was
592
Seiki Yoshimori et al.
Fig. 5. Simulation results
attached to the plate portion, the cars by which the brightness of a plate portion became extremely low had caused incorrect detection.
5
Conclusions
In this paper, we propose the new thresholds determination method in the various background by using the RGA. In the conventional thresholds detection method is not robust for the changing background. That is, in the conventional method, we determine the thresholds, if the background color is changed, the detection accuracy is getting worse. However in this paper, determined the thresholds are changed adaptive by using thresholds function. Therefore, if the background is changed (for example, rainy day thresholds and day time thresholds), we can obtain the suitable thresholds by using RGA. Then, we can obtain the good detection accuracy. In order to show the effectiveness of the proposed method, we show the simulation. In the simulation, we show the effectiveness the proposed a new thresholds determination method using RGA.
References [1] T. Naitou, T. Tukada, K. Yamada, and S. Yamamoto, “License Plate Recognition Method for Passing Vehicles whith Robust Sensing Device against Varied Illumination Condition,” IEICE Tech. Report(D-II), vol. J81, no. 9, pp. 2019–2026, 1998. 585 [2] H. Fujiyoshi, T. Umezeki, T. Imamura, and T. Kaneda, “Area Extraction of the Licence Plate Using Artificial Nerual Network,” IEICE Tech. Report(D-II), vol. J80, no. 6, pp. 1627–1634, 1997. [3] K. Tanabe, H. Kawashima, E. Marubayashi, T. Nakanishi, A. Shio, and S. Ohtsuka, “Car License Plate Extraction Based on Character Alignment Model,” IEICE Tech. Report(D-II), vol. J81, no. 10, pp. 2280–2287, 1998. 585
License Plate Detection Using Hereditary Threshold Determine Method
593
[4] T. Hada and T. Miyake, “Tracking of a Moving Object with Occlusion by Using Active Vision System,” IEICE Tech. Report(D-II), vol. 84, no.1, pp.93–101, 2001. 586 [5] M. Haseyama, M. Kumagai, and T. Miyamoto, “A Genetic Algorithm Based Picture Segmentation Method,” IEICE Tech. Report(D-II), vol. J82, no. 11, pp. 1903– 1911, 1999. 588 [6] S. Muramatsu, Y. Otsuka, Y. Kobayashi, E, Shimizu, “Strategy of High Speed Template Matching and Its Optimization by Using GA,” IEICE Tech. Report(DII), vol. J83, no. 6, pp. 1487–1497, 2000. 590 [7] M. Ikeda, S. Yoshida, K. Nakashima, N. Hamada and H. Yoda, “High Speed Template Matching by Monotonized Normalized Correlation” IEICE tech. Report(DII), vol. 83 no. 9 pp. 1861–1869, 2000. 590
Recognition from EMG Signals by an Evolutional Method and Non-negative Matrix Factorization Yuuki Yazama1 , Yasue Mitsukura2 , Minoru Fukumi1 , and Norio Akamatsu1 1
University of Tokushima 2-1 Minami-josanjima Tokushima 770-8506 Japan Telephone + 81 -88-656-7488 {rabbit,fukumi,akamatsu}@is.tokushima-u.ac.jp 2 Okayama University 3-1-1 Tsushima Okayama Japan [email protected]
Abstract. In this paper, we propose a method of the noise rejection from a signal acquired from many channels using the Electromyograph (EMG) signals. The EMG signals is acquired by the 4th electrodes. EMG signals of 4ch(es) is decomposed into two processions using Non-Negative Matrix Factorization(NMF). And noise rejection is performed by applying the filter obtained by GA to the decomposed matrix . After performing noise rejection, EGM signals is reconstructed and the acquired EMG signal is recognized. The EMG signals based on 7 operations at a wrist are measured. We show the effectiveness of this method by means of computer simulations.
1
Introduction
EMG is an electrical recording of muscle activity which is measured from the surface of skin. It is possible to perform a control of manipulator or prosthetic hand by utilizing EMG as an incoming signal [1]-[3]. A facial expression and gesture is used as a natural interface of man, and a machine and a computer. However systems using EMG which can be used as interface more easily have studied since 1960s. The myoelectric upper limb prosthesis using EMG has a significant role in the field of rehabilitation medicine or welfare as an equipment which realizes natural motions as man’s hand by engineering progress [4]. However, the ecology signal is the multiple electrode signals. With the result that the ecology signal has strong nonlinear nature, various neural networks (NN) which has nonlinear separation capability are used for recognition of the ecology signal [5]. Therefore, the ecology signal includes the signal emitted from various parts. Then, research of the noise rejection using ICA or various filters is advanced. It is possible to control an unnecessary spectrum by performing noise rejection and strengthen the feature for each of the signals. The noise rejection with good filter leads to V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 594–600, 2003. c Springer-Verlag Berlin Heidelberg 2003
Recognition from EMG Signals
595
accuracy improvement in a recognition system and clear the way for the feature analysis. On the one hand, NMF is one of techniques in the feature extraction by imitating a function in man’s vision [6][7]. It is thought that noise rejection can be performed leaving the visual feature using NMF. This work proposes a method of carrying out noise rejection to the EMG signals of 4ch using non-negative matrix factorization(NMF). NMF is the method of decomposing non-negative matrices into the product of non-negative matrix. By covering the mask generated by using GA, the element of matrix is chosen to reconstruct EMG signal. We carry out experiments with the reconstructed EMG signal and NN and show the effectiveness of the present approach.
2
Electromyograph
One of the purpose of the ecological signal processing make clear the advanced information processing mechanism which a living body has through the analysis of the ecological signal. Then, it is reflected to an artificial system. Moreover, it is also the one of purpose to objectify and quantify diagnosis depending on doctor experience by medical diagnosis. There are many ecological signals (electroencephalogram(EEG), electrocardiogram (ECG), and electromyogram (EMG)). The living body signal used in this paper is a signal (EMG signal) extracted by operation of a wrist, and uses what recorded the activity potential of skeletal muscle from the outside of a cell. 2.1
EMG Pattern and Preprocessing
We focuse on 7 operations which are considered as basic wrist motion types, neutral, top, bottom, left, right, outward rotation and inward rotation. The EMG patterns are recorded under the wrist motion. These operations are thought as standard motions and reflect a subject’s simple operation. Data extraction is preformed for two male subjects and 30 sets are measured, each of which is composed of 7 operations. The EMG signals are measured by a surface electrode. Moreover, the surface electrode of dry type (electrolysis cream was not used) is adopted because of a subject’s displeasure and a practical application as PDA operation machines. The EMG pattern which is amplified electrically the feeble signal emitted from a living body, and is obtained as time series data. 4 electrodes are placed around the wrist (see Figure 1). An example of EMG pattern is shown in Figure 2. The EMG patterns are measured under sampling frequencies of 20.48kHz and for 2.048 seconds for all electrodes. The raw EMG signals have to be preprocessed and be reduced to a small number of attributes in order to achieve real-time processing and reduce the learning cost. We apply the FFT with the Hamming windowing function with 256 points to the raw EMG pattern and change the raw EMG pattern into a frequency domain. It is considered that the EMG signal after the 10,000th point is stationary. The frequency components are produced by shifting every 128 points in FFT of 256 points after the 10,000th. The feature vector is computed as an average over ten times in frequency components.
596
Yuuki Yazama et al.
Fig. 1. Wrist with 4 electrodes
3
Non-negative Matrix Factorization
NMF is a technique of decomposing non-negative matrix V into non-negative matrices W and H. V ≈ W H.
(1)
The rank of the approximate matrix W H is generally choosed in the rank of n × (n + m) × r < n × m,
(2)
W H is considered as the matrices which compressed the original matrix V . V is linear combination of W which is weighted with the elements of H. However, NMF performs matrix decomposition on non-negative restrictions conditions as above the constraint. Therefore, the obtained decompound matrix can express the original matrix by linear combination of only addition without subtraction. This show that the whole matrix be expressed only by the specific element, and man’s intuition is reflected. 3.1
Updata Rule
NMF approximates the matrix V by the product of two matrices. As a measure of the approximation, the distance during two matrices and the difference during two matrices are used. Firstly, we describe the technique which updates W H by minimarizing the distance two matrix (V and W H). The approximate matrix W H is updated under the following rules :
H ij = Hij
(W T V )ij (W T W H)ij
(3)
W ij = Wij
(V H T )ij , (W HH T )ij
(4)
Recognition from EMG Signals
597
where H and W are the updated decompound matrices. In performing repetition operation, the update is applied as H to H and W to W . This updating rule uses the Euclid distance between two matrices shown in equation 5. Then the matrix W H which approximates the original matrix V is repeatedly obtained until it it completed by the purpose function. F = (Vij − (W H)ij )2 . (5) i
j
Secondly, we describe the technique which updates W H by minimizing the difference between V and W H. This technique updates the approximate matrix W H using the updating rules : H ij = Hij
(V )kj (W H)kj
(6)
(V )ik Hjk (W H)ik
(7)
Wki
k
ˆ ij = Wij W
k
ˆ ij W W ij = . ˆ k Wk j
(8)
It is applied repeatedly by updating using the distance. This updating rule obtains W H to apply repeatedly so that the criterion function shown in the formula (9) converges to the local maximum. F = (Vij log((W H)ij )) − (W H)ij ). (9) i
j
This criterion function uses the kullback-leibler divergence as an approximate measure.
4
The Recognition System of the EMG Signal
EMG signal is reconstructured using NMF as pretreatment of the EMG recognition system. It is then recognized using reconstructured EMG signal. However, in order to reconstruct it, we have to select an element in the matrix. Then, GA determines the used element to reconstruct EMG signal. The used element is selected form matrix H which obtained by NMF. Matrix W is considered to be base showing the feature of data, and matrix H is considered to be a coefficient for restoring the original data using matrix W . Therefore, the noise is suppressed by selecting the required coefficient in the element of matrix H for reconstruction. Furthermore, it is thought that the feature is emphasized. 4.1
Genetic Algorithm
Genetic Algorithm(GA) is the method of having considered evolutionary process. It is one of optimization techniques which calculate the better solution, changing
598
Yuuki Yazama et al.
Fig. 2. Crossover operation
Fig. 3. Mutant operation
two or more solutions hereditarily. Research which grasps the correlation of two or more objects using a GA is done [8]. This research can use the method of pattern recognition on the Euclid space by the map to the Euclid space [8]. And the analysis of an object from a new starting point is attained. We proposes the method extracting the feature and characteristic using GA from many features acquired. The unnecessary element in the matrix H is included. Then, the required element in the matrix H is chosen by using GA. The chromsome used for GA is based on binary coding. The element corresponding to a code “0” is not used, but the element corresponding to a code “1” is uesd. The chromsom is twodimensional model. Mutation and crossover of the genetic operator are described as follows. First, crossover perform the process which specifies the range made to cross and replace specified portion(see Fig.2). Mutation also specifies the range which performs mutation, reverses the bit of the binary coding(see Fig.3). The portion with the deep color in the figure is the range of crossover or mutation. The fitness function to each chromsom is described. The fitness function is defined by the product of the distance toward the original signal, and the average value of the element to be used. x is original signal and the reconstructed signal is x ˆ. i is number of sensor, and j is number of data. The distance toward the original signal is defined : (xij − x ˆij )2 . (10) D= i
j
The average value of the element to be used is defined : Hij , U= n
(11)
Recognition from EMG Signals
599
Table 1. Recognition acurracy Subject1 ∗ Subject2 ∗ Subject3 ∗
N 88.1% 85.3% 81.3% 75.5% 55.8% 58.0%
T 40.8% 64.8% 51.5% 72.3% 32.6% 56.0%
U 53.8% 65.8% 69.1% 89.8% 66.6% 85.0%
R 63.0% 72.8% 65.0% 81.1% 62.6% 75.5%
L 75.5% 92.3% 57.3% 75.3% 63.1% 81.6%
RI 57.3% 67.0% 47.8% 64.1% 60.8% 74.0%
RO 74.0% 86.0% 50.5% 53.1% 51.8% 49.6%
Total 64.6% 76.2% 60.4% 73.0% 56.2% 68.5%
where H described here is only the element to be used. The fitness function is defined : F = DU.
(12)
As the reason using the average value of the element to be used, it is considerd that there is little influence to the reconstructed signal by using the big value which the element of matrix H has. Thus, it is thought that the waveform is generable near the original signal, deleting the unnecessary element. Furthermore, the noise could be removable leaving the feature which EMG signal has. 4.2
Result
The pretreatment, which is noise rejection and emphasis of the feature, for the recognition system is performed by NMF and GA. The recognition experiment is then performed using the acquired signal. The computer simulations are perfomed using the data obtained from 3 subjects. 30 sets of three male subjects are measured, each of which is composed of 7 operations. The frequency band to be used is from 40Hz to 1000Hz. The recognition experiment is performed by NN. The number of input layer units is 100, the number of hidden layer units is 10 and the number of output layer units is 7. The result of simulation shows average recognition accuracy by 30 simulation of NN. In addition, the comparison simulation of NN using the original EMG signal. The average of the recognition accuracy obtained by the simulation is shown in Table 1. Recognition accuracy is improved about 10% in all the results of 3 subjects. It is thought that the feature which the EMG signal has is emphasized and the signal influenced by other parts is reduced that influence by operating H matrix decomposed using NMF.
5
Conclusion
NMF is one of the techniques of the feature extraction imitating the function in cerebral vision. The filter for reconstructing the signal by using NMF and GA is created. Moreover, the signal is generated using the filter and the recognition
600
Yuuki Yazama et al.
experiment is carried out using the reconstructed signal. We verify effectivity of the waveform shrapening using NMF by computer simulation. As feature work, this technique is proposed as noise rejection. However, we do not know the generated signal in detail. It is possible that the feature for generating the original waveform not only to W matrix used as a base but to the element of H matrix is appeared. Therefore, the process of the signal waveform by NMF and GA will be analyzed. Moreover, it is thought that the important frequency band will be identified from H matrix obtained by NMF and GA. We will verify to conduct recognition experiment using the few amount of the feature by specifying the important frequency band.
References [1] D.Nishikawa, W Yu, H.Yokoi and Y.Kakazu. n-Line Learning Method for EMG Prosthetic Hand Controlling. IEICE, D-II Vol J82, No.9, pp.1510-1519, September 1999, in japanese. 594 [2] D.Nishikawa, H.Yokoi, and Y.Kakazu. Design of motion-recognizer using electoromyogram. In Robotics and Mechatronics Division, June 1998, in japanese. [3] D.Nishikawa, W Yu, H.Yokoi, and Y.Kakazu. On-Line Supervising Mechanism for Learning Data in Surface Electromyogram Motion Classifiers. In IEICE, D-II Vol J84, No.12, pp.2634-2643, December 2001. 594 [4] Yuichiro KAWAMURA, Masako NOZAWA, Haruki IMAI, Yoshiyuki SANKAI. Stand-up Motion Control for Humanoid using Human’s Motion Pattern and EMG. Academic lecture in the 19th Robotics Society of Japan, 2001,in japanese. 594 [5] M.Vuskovic, and S.Du. Classification of Prehensile EMG Patterns With Simplified Fuzzy ARTMAP Networks. JIJCNN, Honolulu, Hawaii, May 2002. 594 [6] D.Lee and H.Seung. Algorithms for non-negative matrix factorization. NIPS 2000, 2000. 595 [7] D.Lee and H.Seung. Learing the parts of objects by non-negative matrix factorization. Nature, Vol.401, pp.788-791, 1999. 595 [8] Shin’ichiro OMACHI, Hiroko YOKOYAMA, and Hirotomo ASO. Subject Arrangement in the Euclidean Space Using Algorithm. IEICE, D-II, Vol.J82-D-II No.12 pp.2195-2202, 1999, in japanese. 598
A Feature Extraction Method for Personal Identification System Hironori Takimoto1 , Yasue Mitsukura2 , Minoru Fukumi1 , and Norio Akamatsu1 1
The University of Tokushima 2-1 Minami-Josanjima Tokushima 770-8506 Japan {taki,fukumi,akamatsu}@is.tokushima-u.ac.jp 2 Okayama University 3-1-1 Tsushima Okayama 700-8530lJapan [email protected]
Abstract. Recently in the world, many researches for individual identification method using biometrics are widely done. Especially, personal identification using faces are used because of needless for physical contact. However, when the number of registrant of a system increase, the recognition accuracy of system will get worse certainly. Therefore, in order to improve the recognition accuracy, it is necessary to extract the feature area effectively for getting the high recognition accuracy. In this paper, we analyze and examine about the individual feature in a face using the GA and the SPCA. Thus, by removing the area which is not valuable, we think that recognition accuracy becomes high. Then, in order to show the effectiveness of the proposed method, we show computer simulations by using real image.
1
Introduction
Consequently, many researches of personal individual identification technique using biometrics are widely studied[1, 2]. By the way, biometrics is not in need of memory or carrying of cards, and only registrant is accepted. Especially, the face is always opened to society, it has little psychological burden compared with other physical feature. There are some personal individual identification methods using the front face image[3, 4]. One is an extraction method of feature points using edge detection and others are template matching, and so on. In the extraction method of the feature points, the form of the portion that constitutes a face and an individual difference of the position are used for recognition. In the technique of the pattern matching, a face is regarded as a value pattern of the gray under arrangement of 2-dimensions. However, it becomes redundancy expression by the vector with the huge number of dimensions, and it is influenced by the change of the environment[4].
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 601–608, 2003. c Springer-Verlag Berlin Heidelberg 2003
602
Hironori Takimoto et al.
On the other hand, the principle component analysis (PCA) is used for yielding a feature vector to discernment[4, 5]. However in the scheme of PCA, it is not easy to compute eigenvectors with a large matrix when considering the cost of calculation to adapt for time-varying processing. Therefore, the simple principle component analysis (SPCA) method has been proposed in order for PCA to simulate at high-speed[6, 7]. Moreover, the technique of calculating for the feature vector of face pattern by SPCA is effective in recognition of front face pattern. By the way, the deterioration of the recognition accuracy by the number of registrant increase is a very important problem. Therefore, in order to improve accuracy, it is necessary to extract the feature area effective for recognition. In this paper, we analyze and examine about the individual feature in a face using the genetic algorithm (GA) and the SPCA. Conventionally, it is said that the feature region of face is in eyebrows, an eye, a nose and a mouth. However, it is thought that they always are not regular, and change with registrants of system. Thus, by removing the area which is not valuable, we think that recognition accuracy becomes high. Then, in order to show the effectiveness of the proposed method, we show computer simulations by using real image.
2
SPCA
The SPCA has been proposed in order for the PCA to chieve high-speed processing. It is confirmed that it is effective in information compression at the handwritten digits and dimensionality reduction at a vector space model[7]. The SPCA produces approximate solutions without calculating a variance-covariance matrix. The SPCA algorithm is given as follows: 1. Collect of an dimensional data set, V = {v1 , v2 , ..., vm } 2. X = {x1 , x2 , ..., xm }, which are obtained by subtracting which is the average value of form the center of gravity of set of the vectors, is used as input vectors. 3. The column vector is defined as connection weights between the inputs and output. The first weight, is used to approximate the first eigenvector. The output function has a value given by: y1 = αT1 xj
(1)
4. Using the following equation (2) and (3), repetitive calculation of arbitrary vector suitably given as an initial value is carried out. As a result, the vector can approach the same direction as. j Φ1 (y1 , xj ) k+1 (2) a1 = j Φ1 (y1 , xj ) y1 = (ak1 )T xj
(3)
A Feature Extraction Method for Personal Identification System
where is a threshold function given as +xj Φ1 (y1 , xj ) = −xj
y1 ≥ 0 otherwise
603
(4)
5. Using the following equation (5), we remove the first principal component vector from the dataset in order to find the next principal component. xi = xi − (aT1 xi )a1
(5)
We obtain principal components because we substitute with and with in equation (2), (3) and perform repetitive calculation once again.
3
Face Image Data
In this paper, we used 72 adult men’s, 28 adult women’s in total 100 people images for registrant, and they are 6-face images par person (total 600 face images). Moreover, we used 113 adult men’s, 139 adult women’s in total 252 people images for un-registrant, and they are a face image par person (total 252 face images). Although the subjects are various generation and race, does not have glass. The subject is looking at the lens of the camera by natural expression. Background is restricted, however lighting is not restricted. In this paper, it is need to normalize the face image to recognize the individual. We show normalization method of face image. The face image is normalized based on both eyes. The reason for having used the eye for normalization of face image is as following. The first, eye is having been easy to perform the normalization about a rotation and a size, compared with a lip, a nose, or an ear. Next, many researches of extracting the region of an eye are proposed[8, 9]. Therefore, to use eye for normalization of face image is efficient. First of all, an original image is changed into gray scale image (420×560 pixels, 24bit color; Fig. 1(a)), and median filter is performed in order to remove noise. Next, center positions of both eyes are extracted. Then, the line segment joining both eyes is rotated so that it matches the horizontal line. Furthermore, the distance between both eyes is made 40 pixels by scale change. Moreover, in order to diminish influence of hair and clothes, an image is cut out as shown in Fig. 1(c). Finally, in order to ease the influence by photometric property, gray scale transformation is performed. Therefore, the image of the Fig. 1(a) is normalized as the Fig. 1(b).
4
Individual Feature Extraction
In this paper, the region segment where the individual feature has shown up is extracted by using the GA and the SPCA. Moreover, it is used for recognition. Recognition is obtained the good accuracy by removing the region which is not valuable for it. Furthermore, it is considered that it can mitigate calculation cost required for recognition.
604
Hironori Takimoto et al.
(b)The normalized image (a) The original image (c)The outline of the normalization
Fig. 1. The normalization of a face image
4.1
Region Segment and Genetic Coding
As shown in Fig. 2(a), the region is segmented. One block size is a square (4×4 pixel). In the case of using the Fig. 2(b), when performing SPCA for fitness function of the GA. This figure is performed sampling by average gray-scale value in segmented regions, in order to perform the high-speed processing and dimensionality reduction. The chromosomes of GA are binary coding, and they are expressed in 2-dimensional arrangement. Each gene is one-to-one correspondence with the segmented region. When chromosome value is 1, the related region is used for recognition. 4.2
Fitness Function
The fitness function is shown in equation (6). The numerator means that the fitness value becomes high, if the difference of individual in the selected region becomes large. ”Sab” means total of variance in the selected regions. ”Dim” means total of rate of contribution to the 5th, when performing the SPCA to the selected regions. Then, This denominator means that the fitness value becomes high, if number of selected region becomes a few. By using this fitness function, we can obtain the small-sized inputs and high recognition accuracy.
(a)The region segment
(b)The mosaic pattern
Fig. 2. The region segment and mosaic pattern
A Feature Extraction Method for Personal Identification System
605
In this paper, the elite preservation method and the roulette selection method was adopted as the selection method of the individual. The rate of the individual who reserves as the elite is about 10 %. Moreover, creation of a new individual chooses 40 % with good fitness value from the whole, and is performed using the GA operations (selection, crossover, mutation). F itness =
4.3
Sab · (1 − Dim) Pix Pixmax
(6)
Feature Extraction Simulation
The individual feature extraction is performed to 100 registrants. The experimental conditions are shown in Table 2. In the result, average of ten individuals with good fitness value is shown in Fig. 3(b). This figure means that the more blackish region was chosen by many individuals. In this paper, because face image is normalized based on both eyes, the selected region has more eye’s region than mouth’s region. From these results, it was confirmed that the proposed method by using the GA gets characteristics of faces. Therefore, the GA is effective for the getting individual characteristics.
5
Personal Identification Simulation
In order to show the effectiveness of proposed method, we show computer simulations by using real image. Recognition process is as follows, and we show each details of process.
step step step step
5.1
1 2 3 4
Normalization face image Dimensionality reduction of image pattern by SPCA Calculation of similarity (cosθ) between eigenface and face image Learning and recognition by NN
Neural Networks
In this paper, the NN is used for a classification and is a three-layered type. It is learned by using the back-propagation (BP) method. The number of input layer
Table 1. The experimental conditions of GA Generation Individual Bit of gene Mutation rate
500 50 400 0.05
606
Hironori Takimoto et al.
(a)The sample face image
(b)The result image
Fig. 3. The result of feature extraction
units is same as the number of similarity (between each eigenvectors from which the total of rate of contribution becomes 90 % or more). The number of output layer units is same as the registrant of the system. If the output of an output unit exceeds a threshold, then it accept as a registrant corresponding to the unit. If the output of two or more units exceeds a threshold or no output of a unit exceeds a threshold, then it is rejected. 5.2
Evaluation Simulation
In this paper, we perform the simulation of six patterns which is shown in Table 3. This table shows the number of used pixels, used rate, and calculated principal component by SPCA for simulation. The case 1 is used the whole face image which was normalized. The regions of the case 2-4 are determined by result of GA. In the result, False Rejection Rate (FRR) and False Acceptance Rate (FAR) are shown in Fig. 4. From the result of FRR, a difference of each result is not seen when the threshold is comparatively low. However, if the threshold is comparatively large, it is found that the case 2, 3 accuracys are better than the case 1. Therefore, the proposed method is effective in the condition where a threshold is large, and the situation that a un-registrant is not allowed. However, the case 4 accuracy is bad than the case 1. It suggests that there is a bad influence, when the amount of information is reduced too much. Moreover, equal error rate (EER) and threshold are shown in Table 3. EER is the error rate in the threshold that FAR and FRR cross, and it is used often for verification of recognition system effectiveness[10]. In the result of EER,
Table 2. The simulation pattern case case case case
1 2 3 4
Used pixels Used rate(%) Input 6400 100.0 50 4528 70.7 50 3552 55.5 49 2672 41.7 47
Table 3. The EER and threshold case case case case
1 2 3 4
EER(%) 2.15 1.33 1.55 2.75
θ0 0.57 0.58 0.52 0.54
A Feature Extraction Method for Personal Identification System
30
Recognition Rate(%)
25 Recognition Rate(%)
20
case 1 case 2 csae 3 case 4
20 15 10
607
case 1 csae 2 case 3 case 4
15
10
5
5 0 0.4
0.45
0.5
0.55
0.6 0.65 0.7 Threshold
0.75
0.8
(a)The false rejection rate
0.85
0.9
0 0.4
0.45
0.5
0.55
0.6 0.65 0.7 Threshold
0.75
0.8
0.85
0.9
(b)The false acceptance rate
Fig. 4. The simulation result of recognition the case 2 accuracy is the best in four patterns. Then, the threshold of case 2 is the highest. Thus, the proposed method is effective in the severe situation of security environment such as high threshold. In coclusions, the proposed method is effective, and obtained the individual characteristics for recognition.
6
Conclusions
In this paper, the feature extraction method for personal identification was proposed be using the SPCA and the GA. Furthermore, effectiveness of proposed method was checked, as compared with the case which was used the whole face image. In the result, we obtained EER is 1.33%.
References [1] Y.Yamazaki, N.Komatsu: ”A Feature Extraction Method for Personal Identification System Based on Individual Characteristics”. IEICE Trans. Vol.D-II, Vol.J79. No.5, pp.373-380, (1996) in Japanese 601 [2] A. K.Jain, L.Hong, and S.Pankanti: ”Biometric identification”. Commun. ACM, vol.43, no.2, pp.91-98, (2000) 601 [3] R.Chellappa, C. L.Wilson, and S.Sirohey: ”Human and machine recognition of faces: A survey”. Proc.IEEE, vol.83, no.5, pp705-740, (1995) 601 [4] S.Akamatsu: ”Computer Recognition of Human Face”. IEICE Trans. Vol.D-II, Vol.J80. No.8, pp.2031-2046, (1997) in Japanese 601, 602 [5] O.Delloye, M.Kaneko and H.Harashima : ”Handling of Facial Features Using Face Space”. IEICE Trans. vol-J80A, No.8, pp.1332-1336, (1997) in Japanese 602 [6] H.Takimoto, Y,Mitsukuta, M.Fukumi, and N.Akamatsu: ”A Design of Face Detection System Using the GA and the Simple PCA”. Proc. of ICONIP’02, Vol.4, pp.2069-2073, November (2002) 602 [7] M.Partridge and R.Calvo: ”Fast dimentionality reduction and simple PCA”, IDA, Vol.2(3), pp.292-298, (1989) 602 [8] S.Kawato, N.Tetsutani: ”Circle-Frequency Filter and its Application”, Proc. Int. Workshop on Advanced Image Technology, pp.217-222, Feb (2001) 603
608
Hironori Takimoto et al.
[9] T.Kawaguchi, D.Hikada, and M.Rizon: ”Detection of the eyes from human faces by hough transform and separability filter”, Proc. of ICIP 2000, pp.49-52, (2000) 603 [10] S. Furui: ”Digital Sound Processing”, TOKAI UNIVERSITY PRESS, Tokyo, (1985) 606
A Feature Extraction of the EEG Using the Factor Analysis and Neural Networks Shin-ichi Ito1 , Yasue Mitsukura2 , Minoru Fukumi1 , and Norio Akamatsu1 1
University of Tokushima 2-1, Minami-Josanjima, Tokushima, 770-8506, Japan {itwo,fukumi,akamatsu}@is.tokushima-u.ac.jp 2 Okayama University 3-1-1, Tsushima-naka, Okayama, 700-8530, Japan [email protected]
Abstract. It is often known that an EEG has the personal characteristic. However, there are no researches to achieve the considering of the personal characteristic. Then, the analyzed frequency components of the EEG have that the frequency components in which characteristics are contained significantly, and that not. Moreover, these combinations have the human equation. We think that these combinations are the personal characteristics frequency components of the EEG. In this paper, the EEG analysis method by using the GA, the FA, and the NN is proposed. The GA is used for selecting the personal characteristics frequency compnents. The FA is used for extracting the characteristics data of the EEG. The NN is used for estimating extracted the characteristics data of the EEG. Finally, in order to show the effectiveness of the proposed method, classifying the EEG pattern does computer simulations. The EEG pattern is 4 conditions, which are listening to Rock music, Schmaltzy Japanese ballad music, Healing music, and Classical music. The result, in the case of not using the personal characteristics frequency components, gave over 80 % accuracy. Then the result, in the case of using the personal characteristics frequency components, gave over 95 % accuracy. This result of our experiment shows the effectiveness of the proposed method.
1
Introduction
Recently in the world, the research of the electroencephalogram (EEG) interface is being done, because it has the possibility to realize an interface that can be operated without special knowledge and technology by using the EEG as a means of the interface. The EEG is activities of electric potential inside the brain recorded from the top of the scalp. The EEG is a time series signal to change by the internal factor, which is human’s thinking and conditions, or the outside stimulus those are the light and sound [1]-[4]. Moreover, the EEG is the time series signals that more than one factor was intricately intertwined, and the EEG is different by the measurement points of the EEG. Therefore, taking V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 609–616, 2003. c Springer-Verlag Berlin Heidelberg 2003
610
Shin-ichi Ito et al.
account of these problem, the EEG analysis method and the measurement point of the EEG must be taken into consideration. In this paper, taking account of the EEG interface, we propose the method, which is focused on four points of the following. First of all, the EEG is analyzed by the information of one measurement point because we assume that the information of some measurement points is unfitted in using the EEG interface [4],[5]. Second, the Genetic Algorithms (GA) is used for the EEG analysis. There are frequency components, which characteristics are contained significantly and/or much unnecessary information, of the EEG. Then those frequency components, which characteristics are contained significantly, have human equation. We see frequency components, which characteristics are contained significantly, as the personal characteristics frequency components. So, the GA is used for specifying the personal characteristics frequency components. Moreover, the feature choosing method using GA is already proposed, and the validity is also reported [6]-[8]. Third, the Factor Analysis (FA) is used for the EEG analysis. Because the EEG is the time series signals that more than one factor was intricately intertwined and the EEG contains much noise, the EEG has the information, which is difficult to obtain from direct observation data. Therefore, taking account of the correlation of the time transition of each frequency spectrum of the EEG, the latent structure, which explains that correlation, is analyzed by using the FA. We have attempted to extract the characteristics data of the EEG by using the FA. Finally, the Neural Networks (NN) is used for the EEG analysis. The NN is possible that the non-line relations, which more than one factor was intricately intertwined, are expressed. The NN is used for estimating extracted the characteristics data of the EEG. In other words, taking account of the EEG interface, we propose the EEG analysis method, which is using the GA and the FA and the NN. In order to show the effectiveness of the proposed method, classifying the EEG pattern does computer simulations. The EEG pattern is 4 conditions, which are listening to Rock music, Schmaltzy Japanese ballad music, Healing music, and classical music.
2
Measurement of the EEG
In this paper, as for a goal for the final of our research, the EEG control system by any music is constructed. This system uses human’s physiologic and mental effect, which a music stimulus gives to human. There is causal relationship between the EEG and the music, for instance, α waves appear by listening to Classical music, and β waves appear by listening to Rock music. This system adjusts the outputs of the music automatically by that causal relationship. Moreover, the EEG is controlled by listening to that music, which is adjusted the output of, in this system. In the constructing this system, the EEG analysis and the music analysis are absolutely imperative. The most important part of this system is that the EEG analysis. So, the EEG analysis is done in this paper.
A Feature Extraction of the EEG
2.1
611
The EEG Patterns
In this paper, as basic research of constructing the EEG control system, some genre of music is classified by the EEG. The EEG patterns are 4 conditions, which are listening to Rock music, Schmaltzy Japanese ballad music, Healing music, and Classical music. We used the following questionnaire to decide the EEG pattern because classifying the genre of music becomes easier. In addition, the questionnaire is done 20 people, who are boys and girls of twentysomething. 2.2
Measurement Conditions of the EEG
The subjects of this paper were 5 people who are 4 boys (The average age: 22.3 years old) and 1 girl (age: 23 years old). The electroencephalograph used simple electroencephalograph. The simple electroencephalograph can be measured under the practical environment. The simple electroencephalograph is made by Brain Function Research & Development Center in Japan. Measurement part place is electrode arrangement FP1 (the left frontal lobe) in international 10/20. As for the data, FFT is being done, and frequency analysis is carried out to 24Hz at intervals of 1Hz by attached analytic software. As a measuring condition, the EEG measuring is carried out in the laboratory with some noise. Then subjects wear a sensor band and headphone. Measuring time is for four minutes in each condition. In addition, the data of using frequency components are from 4Hz to 22Hz. Fig.1 shows measured points of EEG.
3
The Procedure of the Proposed Method
In this paper, we propose the EEG analysis method, which is using the GA and the FA and the NN. The GA is used for specifying the personal characteristics frequency components. The FA is used for extracting the characteristics data of the EEG. The NN is used for estimating extracted the characteristics data of the EEG. In addition, Fig.1 show the flowchart of the EEG analysis method. The EEG analysis method is as follows: First of all, the individual, which has a random chromosome, is formed as an initial group. The value of an individual chromosome is “0” or “1”. An individual chromosome copes with each frequency components (4-22Hz). When an individual chromosome is “1”, the time series data of allocated frequency component is used for the FA. When an individual chromosome is “0”, the time series data of allocated frequency component is not used for the FA. In the second, the characteristics data of the EEG is extracted by using the FA. The FA is one of the statistical methods. Then the information that each variate (frequency components) with the correlation has is summarized in small number of latent factor. The FA can be denoising, dimensional compression, latent structure analysis that structure lurks behind each variate with the correlation. In this paper, the model of the FA is cross-factor model because we assume that common factor, which responds to particular stimulus, is identified when extracted common factor is noncorrelated [9],[10]. Moreover, taking
612
Shin-ichi Ito et al. Initialization Factor analysis Fitness Fitness < 0.99 Iteration < 50
End
Elite preservation Crossover Mutation Multiplication
Fig. 1. Flowchart account of measurement conditions of the EEG, the first common factor shows “the EEG changing by the music”, we think. The common factor is extracted by using principal factor analysis method. In this paper, the characteristics data of the EEG is the data of first factor loading. Third, the characteristics data of the EEG is estimated by the NN. The NN, which is 3-layer class pattern, is used for learning the characteristics data of the EEG, and the EEG pattern classification. Then back-propagation (BP) method is used for the way of learning the characteristics data of the EEG, and leave-one-out cross-validation (LOOCV) method is used for test method. Finally, the fitness of each individual is calculated by using the result (recognition accuracy) of the NN. When established termination conditions are not satisfied, next generation individuals are generated by genetic manipulations. Genetic manipulations are selection, crossover, mutation, multiplication, and a local solutions, which are used for not selection a local solution. In addition, when termination conditions are satisfied, the personal characteristics frequency components are allocated “1” of an individual chromosome, which is the highest fitness value. 3.1
The Fitness Value
First of all, the characteristics data is learned by using the NN. Mean squared error is calculated after learning using the learning data. Second, the recognition
A Feature Extraction of the EEG
1
0 1 1 0 1 0
1
1 0 0 1 1 0
613
genetic manipulation point
conjunction
disjunction 1
0 1 1 1 1 0
1
0 1 0 0 1 0
1
1 0 1 1 1 0
1
1 0 0 0 1 0
Fig. 2. Genetic manipulation based on the logical operations accuracy of an individual is calculated by using the NN. Finally, the fitness accuracy is calculated. The fitness value is as follows: f itness = (Recog −
N umber of 1 s 2 ) − Err N umData ∗ Length
Err Mean squared error Recog The result of NN (the recognition accuracy) Length The length of gene NumData The number of the test data pattern Number of 1’s The numben of 1’s in an gene 3.2
Genetic Manipulations
Genetic manipulations are selection, crossover, mutation, multiplication, and the logical operations. Selection is the elite preservation strategy. Crossover is one-point crossover and using the logical operations, which are shown Fig.4. Multiplication is using the logical operations. The logical operations are as follows: First of all, the conjunction (Elite conjunction) of the individuals, which are selected elite, is calculated. Second, the conjunction (Not elite conjunction) of the individuals, which are not selected elite, is calculated. Finally, the gene (0 or 1) of Not elite conjunction is inverted. Then the conjunction and the disjunction of Elite conjunction and inverted Not elite conjunction are calculated.
4
Computer Simulations
In this simulation, the EEG pattern is classified. The EEG pattern is 4 conditions, which are listening to Rock music, Schmaltzy Japanese ballad music,
614
Shin-ichi Ito et al.
Table 1. The parameter of GA The number of generation 50 The number of individual 50 The length of chromosome 19
Table 2. The parameter of NN The maximum number of input layer unit 19 The number of layer unit 5 The number of output layer unit 4 The number of learning 10000 The step size 0.01
Healing music, and Classical music. In addition, the subjects of this paper were 5 people who are 4 boys (The average age: 22.3 years old) and 1 girl (age: 23 years old). Furthermore, the parameter of the GA is shown in the Table 1. Then the parameter of the NN is shown in the Table 2. In addition, the step size and the number of hidden layer in parameter of the NN are being found experientially. Compared with using the frequency components (4-22Hz) in this paper. The recognition accuracy, in the case of using the personal characteristic frequency components, and in the case of using the frequency components (4-22Hz) is shown Fig.3. In addition, the selected the personal frequency components are shown in the Table 3. The result in the case of using the frequency components (4-22Hz) gave over 80 % accuracy. This result suggests that extracting the characteristics data of the EEG by using the FA is significant in the EEG analysis. Then the result in the case of using the personal frequency components gave over 95 % accuracy. This result suggests that specifying the personal frequency components is significant in the EEG analysis.
5
Conclusions
Taking account of using the EEG interface, the simple electroencephalograph of the band type is used for measuring the EEG. We propose the EEG analysis
Table 3. Selected frequency components by the GA subjact1 subject2 subject3 subject4 subject5
The extracted frequency components 6,8,10,12,19-21Hz 4,6-8,10,17,18Hz 6,8,10-12,22Hz 6-8,12,17,18,20Hz 5,6,8,12-14,16,17,21Hz
A Feature Extraction of the EEG
615
Fig. 3. The recognition accuracy, in the case of using the personal characteristics frequency components, and in the case of using the frequency components (422Hz)
method, which is using the GA and the FA and the NN. The GA is used for specifying the personal characteristics frequency components. The FA is used for extracting the characteristics data of the EEG. The NN is used for estimating extracted the characteristics data of the EEG. In order to show the effectiveness of the proposed method, classifying the EEG pattern does computer simulations. The EEG pattern is 4 conditions, which are listening to Rock music, Schmaltzy Japanese ballad music, healing music, and classical music. The result, in the case of not using the personal characteristics frequency components, gave over 80 % accuracy. Then the result, in the case of using the personal characteristics frequency components, gave over 95 % accuracy. The result of our experiment shows the effectiveness of the proposed method. In this paper, in such a way that classifying the EEG pattern is easy, we used the following questionnaire to decide the EEG pattern. Future studies will focus on boosting the number of classifying the EEG pattern, which is listening to the genre of music. Moreover, we will analyze the music.
References [1] O. Fukuda. T. Tsuji. and M. Kaneko. Pattern Classification of a Time-Series EEG Signal Using a Neural Network¨ uCIEICE (D-II)¨ uCvol.J80-D-II¨ uCNo.7¨ uCpp.18961903¨ uC(1997) 609 [2] H. Tanaka. and H. Ide. Intention Transmitting by the Single-Trial MRCP Analysis¨ uCT.IEE Japan¨ uCvol.122-C¨ uCNo.5¨ uC(2002) [3] S. Yamada. Improvement and Evolution of an EEG Keyboard Input Speed¨ uCIEICE (A)¨ uCvol.J79-A¨ uCNo.2¨ uCpp.329-336¨ uC(1996) [4] T. Shimada. T. Shina. and Y. Saito. Auto-Detection of Characteristics of Sleep EEG Intergrating Multi Channel Information by Neural Networks and Fuzzy Rule¨ uCIEICE (D-II)¨ uCvol.J81-D-II¨ uCNo.7¨ uCpp.1689-1698¨ uC(1998) 609, 610 [5] S. Tasaki. T. Igasaki. N. Murayama. and H. Koga. Relationship between Biological Signals and Subjective Estimation While Humans Listen to Sounds¨ uCT.IEE Japan u ¨Cvol.122-C¨ uCNo.9¨ uCpp.1632-1638¨ uC(2002) 610
616
Shin-ichi Ito et al.
[6] S. Ito. Y. Mitsukura. M. Fukumi. and N. Akamatsu. Neuro Rainfall Forecast with Data Mining by Real-Coded Genetical Preprocessing¨ uCT.IEE Japan, Vol.123-C, No.4, pp.817–822, (2003) 610 [7] M. Fukumi. and S. Omatsu. Designing an Architecture of a Neural Network for Coin Recognition by a Genetic Algorithm¨ uCT.IEE Japan¨ uCvol.113D¨ uCNo.12¨ uCpp.1403-1409¨ uC(1993) [8] M. Fukumi. and S. Omatsu. Designing a neural network by a genetic algorithm with partial fitness¨ uCProc.of Int¨ uCConf¨ uCNeural Netwirks¨ uCpp.18341838¨ uC(1995) 610 [9] L. R. Tuckey. and R. C. MacCallum. Exploratory Factor Analysis, (1997) 611 [10] A. L. Comrey. and H. B. Lee. A First Course in Factor Analysis 611
A Neural Network Approach to Color Image Classification Masayuki Shinmoto1, Yasue Mitsukura2, Minoru Fukumi2 and Norio Akamatsu2 1 Graduate
School of Engineering University of Tokushima 2-1 Minami-josanjima-cho, Tokushima, 770-8506, Japan [email protected] 2 Faculty of Engineering, The University of Tokushima 2-1 Minami-josanjima-cho, Tokushima, 770-8506, Japan {mitsu,fukumi,akamatsu}@is.tokushima-u.ac.jp
This paper presents a method for image classification by neural networks which uses characteristic data extracted from images. In order to extract characteristic data, image pixels are divided by a clustering method on YCrCb 3-dimensionl-color space and processed by labeling to select domains. The information extracted from the domains is characteristic data (color information, position information and area information) of the image. Another characteristic data, which is extracted by Wavelet transform, is added to the feature and a comparative experiment is conducted. Finally the validity of this technique is verified by means of computer simulations.
1
Introduction
In recent years, the Internet spreads rapidly and is widely used by various people. Much information from different sources has been accumulated over the Internet. In the viewpoint of these facts, there is an enormous quantity of images on the World Wide Web. However, those images are preserved in various Web sites, and there exist many kinds of image types. Therefore, it is difficult and consumes time to search an image, which people requires [1]. A search system for an image file is then needed. An image search system has been already used in practical use. Google (http://www.google.co.jp/) offers us an image search system. But, this system searches Web page and filename that include a target word, and retrieve image files that are found from those as a result. Search results depend on information of text key words. Such a method cannot find target images if a filename or a caption don't include key words. In general, an image has plural information. When we see a certain picture, some people feel that this includes a mountain, others feel that this includes the sky. Since a filename and a caption are named by someone, they imply his or her subjectivity. Therefore a text based image search technique doesn't work effectively in many cases. Image classification into some categories is important as preprocessV. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 617-622, 2003. Springer-Verlag Berlin Heidelberg 2003
618
Masayuki Shinmoto et al.
ing for keywords extraction. The keywords can be easily extracted from a group of similar images. In this study, images are therefore divided into categories for keyword extraction. In this study, characteristic data is extracted from an image directly, and classification of rough categories using neural networks carried out by using it. Images utilized in this paper are scenery images (Fig.1), person images (Fig.2), buildings images (Fig. 3), illustrations images (Fig.4), and flower images (Fig.5).
2
Characteristic Data
2.1
YCrCb Transforms
At first, a target expressed in RGB color system is transformed into YCrCb color system using YCrCb transform. Y represents intensity. Cr and Cb represent color difference. Fig.6 are results obtained by the YCrCb transform in which the sample image in Fig.1 is uccccccsed.
Fig. 1. A sample of scenery images
Fig. 2. A sample of person images
Fig. 3. A sample of building images
Fig. 4. A sample of illustration images
Fig. 5. A sample of flower images
A Neural Network Approach to Color Image Classification
Parameter Y
Parameter Cr
619
Parameter Cb
Fig. 6. Result of YCrCb transform
Fig. 7. A result obtained using the K-means method
2.2
K-means Method
To quantize color information, pixels distributed on the YCrCb 3-dimensional space are classified into some clusters. This paper uses the K-means method for clustering [3][4][5]. Fig.7 is a result obtained by the K-means method for the image in Fig.1. 2.3
Labeling
The image quantized is processed by labeling transaction to divide pixels into some domains with the same value. In this paper, domains obtained as a result of labeling are 3 domains selected in order of large area (Fig.8) and then characteristic data are extracted. The number of characteristic data used in experiments are fifteen, average values of R, G, B, Y, Cr and Cb in a domain, standard deviation of R, G, B, Y, Cr and Cb in a domain, center of gravity (x and y coordinates), area of domain.
The largest area
The second largest area Fig. 8. A result of Labeling
The third largest area
620
Masayuki Shinmoto et al.
Fig. 9. Results obtained by Wavelet transform
2.4
Fig. 10. Distribution of frequency
Wavelet Transforms
The image is expressed by Wavelet transform to detect edges (Fig.9). In signal processing, Wavelet transform is equivalent to use low-pass filter and High-pass filter. The image is decomposed by low frequency and high frequencies (Fig.10). The low frequency (LL) means rough image. The high frequencies (LH, HL, HH) mean edges. Characteristic data of edges extracted from high frequencies, which correspond selected domains, is average of frequencies and standard deviation of frequencies.
3
Experiment
3.1
Experimental Conditions
The classification experiments are performed using characteristic data. A neural network as shown in Fig.11 is used in the experiments as mentioned above. The learning algorithm is the back propagation method [7]. Characteristic data are fed to input units. The number of characteristic data extracted from one domain is seventeen, each average of RGB and YCrCb, each standard deviation of RGB and YCrCb, x and y coordinates, area of domain, average of frequencies and standard deviation of frequencies. In this paper, five experiments are carried out. Each experimental condition has different categories as follows. 1. 2. 3. 4. 5.
Scenery, Person, Building, Illustration, Flower Scenery, Building, Illustration, Flower Illustration, Others (Scenery, Building, Flower) Scenery, Building, Flower (Scenery and Flower), Building
The number of output units is related for the number of experimental condition's categories.
A Neural Network Approach to Color Image Classification
621
Fig. 11. A neural network used in computer simulations
3.2
Experimental Result
Table.1 shows result of experiment 1. Total accuracy is more than 70%. In particular, illustration image's accuracy indicates a high value. This reason is that illustration images used in this study don't have various colors. Therefore color information's standard deviation is effective in this case. However, Person's image and Flower's image accuracy is lower value which is less than 60%. However, person image can be classified by skin color detection. Therefore, we evaluate the case four categories in experiment 2 (Table.2). In Table.2, Each category's accuracy and total accuracy are improved. However, Flower image's accuracy is lower than others, yet. This reason is that some flower images have similar characteristic data extracted from scenery images. To classify images into two categories, experiment 3 is carried out (Table.3). Others image's accuracy is nearly 100%. Illustration image, which is wrong in experiment 1 to 3, have similarity. Therefore, this misclassification depends on image kind, which differs completely from other illustration images. And also, Table.3 shows illustration image can be classified. Accurately Tables 4 and 5 show results of experiments, which don't used Illustration images. Accuracy is improved in Tables. Therefore, this approach is effective in classification. Table 1. Result of experiment 1
Table 2. Result of experiment 2
622
Masayuki Shinmoto et al. Table 3. Result of experiment 3
Table 4. Result of experiment 4
Table 5. Result of experiment 5
Scenery and Flower Building Total
4
91.45% 84.20% 89.03%
Conclusion
In this paper, characteristic data is extracted from images and is an input to neural networks in order that images are classified. In classification experiments, eighty percent or more accuracy is achieved. In the future, the number of images used for a classification is increased, and change of classification rate is examined. In addition, techniques that extract characteristic data from small domains is studied. This research was partially supported by the Tele communications Advancement Foundation and Grant-in-Aid for scientific Research© (No.1368044) in MEXT.
References [1] [2] [3] [4] [5] [6] [7]
James Z. Wang, Integrated Region-Based Image Retrieval, 2001, Kluwer Academic Publishers Mikio Takagi, Picture Analysis Handbook, 1991, University of Tokyo Press, in Japanese Sadaaki Miyamoto, A Guide to Cluster Analysis, 1999, Morikita Publishing, in Japanese Takeshi Agui and Tomoharu Nagao, Processing and Recognition of a Picture, 1992, Syokodo, in Japanese Hisoshi Ozaki and Keiji Taniguchi, Image Processing, 1988, Kyoritsu Publishing Company, in Japanese Hideyuki Tanaka and Japan Industry Engineering Center, A Guide to Computer Image Processing, 1985, Soken Publishing Company, in Japanese J. Dayhoff, Neural Network Architecture, 1990, Van Nostrand Reinhold
Recognition of EMG Signal Patterns by Neural Networks Yuji Matsumura1, Yasue Mitsukura2, Minoru Fukumi2 Norio Akamatsu2, and Fumiaki Takeda3 1
Graduate School of Engineering University of Tokushima Tokushima, 770-8506, Japan [email protected] 2 Faculty of Engineering University of Tokushima Tokushima, 770-8506, Japan {mitsu,fukumi,akamatsu}@is.tokushima-u.ac.jp 3 Graduate School of Engineering Kochi University of Technology Kochi, 782-8502, Japan [email protected]
Abstract. This paper tries to recognize EMG signals by using neural networks. The electrodes under the dry state are attached to wrists and then EMG is measured. These EMG signals are classified into seven categories, such as neutral, up and down, right and left, wrist to inside, wrist to outside by using a neural network. The NN learns FFT spectra to classify them. Moreover, we structuralized NN for improvement of the network. It is shown that our approach is effective to classify the EMG signals by means of computer simulations.
1
Introduction
Conventionally, there have been some pointing devices like a mouse, a data glove, a data suit, a motion capture and so on, which can measure data such as position, direction, and power. However these devices are not comfortable and not easy to use because of their form and weight. For example, the mouse is too large, and such machines and tools are too heavy to carry. Recently, information terminals such as a mobile phone, have been widely used. According to this, industrial standard of radio communication such as Bluetooth has been established. As a result, it would be possible to combine and to perform various interface. For example, a portable telephone is used on manners mode, a music player volume is easily regulated and so on, because of this standard. Now a days, the device which have various operation of portable machines and tools and can control networks (call “total operation device” for short) has not been provided yet. Therefore, V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 623-630, 2003. Springer-Verlag Berlin Heidelberg 2003
624
Yuji Matsumura et al.
we investigate ElectroMyoGram (EMG) [1] which is a signal generated from a living body according to voluntary exercise of subjects. Moreover, the wristwatch type is preferable in the viewpoint of operationality. Therefore, for construction of recognition system of EMG patterns, we analyze the values of EMG from the wrist and perform recognition experiments using the Fast Fourier Transform (FFT) [2] and a neural network (NN) [3].
2
EMG Recognition System
A flowchart of Construction of a recognition system of EMG patterns which we propose in this paper is shown in Fig. 1. The proposed system is composed of an input, a signal processing and learning-evaluation parts. First, time series data for EMG is measured by electrodes in the input part. Second, this data is amplified and A/D transform is performed in the signal processing part. Next, this amplified data is converted to Fourier power spectra. Finally, we evaluate various data in the learning-evaluation part. Fig. 2 shows the outside feature of the proposed system. 2.1
Input Part
There are an insertion-type electrode and a surface-type electrode to measure EMG. The insertion-type electrode can measure EMG from a specific muscle because of sticking muscle directly. However, it is not preferable for the subject and in danger of sick contagion. As compared with this, the surface-type electrode is preferable for the subject and not in danger of sick contagion and pain. On the one hand, it is difficult to divide action style and motor control because EMG is measured as a signal which is a mixed plural potential of muscle fiber. In this paper, we use the surface-type electrode as an input device because we try to recognize a lot of behavior from the various muscles. There are two types of surface electrodes. One is a wet type and the other is a dry type. From the viewpoint of easiness of operation, we adopt the dry type one. Fig. 3 illustrates the device for measuring EMG of the wrist. We measure 7 patterns of the wrist behavior using this device. The wrist behavior is neutral, up, down, bend to right, bend to left, twist of inside, twist of outside. Also, Fig. 4 shows the patterns of the wrist movement.
Fig. 1. A flowchart of the proposed system
Fig. 2. The outside of the proposed system
Recognition of EMG Signal Patterns by Neural Networks
625
Fig. 3. The device of measuring
(a)Neutral
(b) Up
(e) Bend to left
(c) Down
(f) Twist to inside
(d) Bend to right
(g) Twist to outside
Fig. 4. The patterns of the wrist
2.2
Signal Processing Part
We measure EMG with the surface-type electrode. It has been reported that the main frequency of EMG is generated between a few Hz and 2 kHz [4],[5]. We extract frequency components between 70 Hz and 2 kHz to avoid the influence of the noise by the electromagnetic wave which is the commercial frequency of 60 Hz. As a result, the amplifier reduces the influence of the electromagnetic wave noise. The specification of the amplifier is shown in Table 1. Table 1. The specification of the amplifier
Low Cut-off Frequency High Cut-off Frequency Gain Output Voltage Input Voltage
70Hz (6th filter) 2kHz (6th filter) 2000 times (65dB) ±5V ±2.5mV
626
Yuji Matsumura et al.
Fig. 5. A method to yield FFT spectra
2.3
Data Transform Part
The EMG output program yields 20,480 time series data every channel for 2,048 milliseconds. The values obtained by performing 1,024-points FFT from 10,000th data point every channel and by doing FFT moving data window every 512 data point are used for learning data. Fig. 5 shows a schematic to yield of learning data. 2.4
Learning-Evaluation Part
NN is trained by using the FFT data and is evaluated in learning-evaluation part. In this paper, three methods are used in learning and evaluating. The methods used are the Back-Propagation (BP) network, the Learning Vector Quantization (LVQ) network and the BP&LVQ network. (1) The BP network The BP network is a three-layered feed-forward network that is composed of an input layer, a middle layer, and an output layer. Since the error between an output and a teacher signal are updated in order of an output layer, the middle layer, and the input layer in each weight w , it is also called the Back Propagation method. (2) The LVQ network The LVQ network is a method that changes Self-Organizing Map (SOM) by Kohonen [6] to learning with teacher data. Vector Quantization performs data compression approximating representative vectors. Here, the LVQ algorithm is summarized as follows: STEP1: Random values are set as initial weight values. STEP2: An input vector { X = ( x1 , x 2 ,L , x n )} is set in the input layer. STEP3: The distance between the weight of j-th neuron and the input vector is calculated as follows:
dj =
N
∑ (x i =1
− w ji )
2
i
(1)
Recognition of EMG Signal Patterns by Neural Networks
627
STEP4: The weight vector w ji that has the smallest d j is selected, and is renewed according to the following equation.
w ji (k ) + α (x i − w ji ) : The result of recognitio n is true w ji (k + 1) = w ji (k ) − α (x i − w ji ) : The result of recognitio n is false where
α
(2)
denotes a learning late and k is a time index.
Using this operation, Bayes discrimination border that is theoretically the most suitable in pattern classification is formed. Therefore, this method is useful for pattern recognition problem in many cases [7]. The LVQ network basically has three methods called LVQ 1, LVQ 2 and LVQ3. In this paper, the method using the LVQ 1 and 3 is utilized. (3) The BP&LVQ network There are the problems that BP has been applied to many pattern recognition but it requires a long time for learning and LVQ doesn't require a long time for learning but it doesn't have been applied to many pattern recognition. Therefore, we suggest BP&LVQ that mixed with good point of their networks for structuring of network. BP&LVQ's algorithm is shown below. Also, Fig. 6 shows construction of the BP&LVQ network. STEP 1: Neutral and outside are regarded as the same pattern because of resemblance. STEP 2: 6 patterns learning (up, down, left, right, inside, neutral & outside) is performed using BP. STEP 3: Concurrently, 2 patterns learning (neutral, outside) is performed using LVQ.
Fig. 6. BP&LVQ network
628
Yuji Matsumura et al.
STEP 4: If both of the learning is got through, 6 patterns recognition (major division) is performed by BP. STEP 5: Finally, using data that ignited neutral & outside and that appertained neutral & outside in BP, 2 patterns recognition (minor division) is performed by LVQ.
3
Recognition Experiments
In this paper, we performed experiments with three subjects (A:male; at age 30, B:male; at age 22, C:male; at age 21). They keep holding their right arm and grasp their hand a little. In this posture, behavior of the wrist is neutral, up, down, bend to right, bend to left twist to inside, twist to outside with bending their arm by 90 degrees. While the subjects keep this posture, we measure EMG. The 7 patterns of arm behavior are recognized by NN. We regard the 7 patterns as 1 set. The learning data is 10 sets out of 30 sets and the evaluation data is the rest 20 sets. We use the frequency under 1kHz for the input units of NN because we analyzed the frequency components by the BPWF network in advance [8]. 3.1
Experimental Conditions
Table 2 shows experimental conditions of the BP network, Table 3 shows experimental conditions of the LVQ network and Table 4 shows experimental conditions of the BP&LVQ. In Tables, CL means Competition Layer, IL means Input Layer, ML means Middle Layer, OL means Output Layer. Table 2. Experimental conditions of BP
No. of units on IL No. of units on ML No. of units on OL No. of Learning Data No. of Evaluation Data Maximum Learning No.
404 20 7 70 140 20,000
Table 3. Experimental conditions of LVQ
No. of units on IL No. of units on CL Category of Class No. of Learning Data No. of Evaluation Data Learning No. of the LVQ1 Learning No. of the LVQ3
404 49 (7×7) 7 70 140 1,000 1,000
Recognition of EMG Signal Patterns by Neural Networks
629
Table 4. Experimental conditions of BP&LVQ
BP Part (Major Classification) No. of units on IL 404 No. of units on ML 20 No. of units on OL 6 No. of Learning Data 70 No.of Evaluation Data 140 Maximum Learning No. 20,000 LVQ Part (Minor Classification) No. of units on CL 4 (2×2) Category of Class 2 No. of Learning Data 20 No. of Evaluation Data Variable Learning No. of the LVQ1 1,000 Learning No. of the LVQ 3 1,000
3.2
Experimental Results
Table 5 shows the experimental results. According to Table 5, we can obtain the rate of accurate recognition with BP&LVQ higher than BP and LVQ, which combined both the methods in, subject A and C. As a result, we consider that structuring of a network will become the one method of improving the recognition accuracy. But, between BP&LVQ's result and BP's result are almost the same in subject B. As this reason, I thought that it became fuzzy data because the noise may have been mixed in a part of learning and evaluation data of the subject B. Table 5. Experimental results Recognition accuracy (%)
Subject A BP LVQ Up Right Down Left Neutral Outside Inside Total
77.7 81.7 84.0 94.3 75.0 73.0 74.7 80.1
90.0 75.0 76.7 81.7 96.7 78.3 28.3 74.7
BP& LVQ 90.0 83.3 85.7 97.3 95.0 70.3 60.0 83.1
Subject B BP LVQ 67.3 82.7 97.0 85.3 82.3 73.0 50.3 76.9
65.0 61.7 83.3 98.3 83.3 70.0 30.0 70.2
BP& LVQ 79.7 86.0 95.3 76.3 63.7 78.7 68.0 78.2
Subject C BP LVQ 56.7 56.7 87.0 86.0 63.3 74.7 85.0 72.9
93.3 56.7 55.0 85.0 68.3 41.7 26.7 61.0
BP& LVQ 74.7 62.3 88.7 81.0 96.7 77.0 84.0 80.6
630
Yuji Matsumura et al.
4
Conclusions
In this paper, we have examined EMG signals that are signal in a living body for recognition of wrist behavior. Moreover, we have demonstrated recognition systems by NN. According to experimental results, we have found that the EMG recognition system by using the BP&LVQ networks that combined good point of both the BP network and the LVQ network are best in recognition accuracy. We want to go forward our study to adopt Independent Component Analysis (ICA) and fuzzy network for removing the noise, and to increase the subject for construction of robust system in the next step. This research was partially supported by Tateishi Foundation and Grant-in-Aid for Scientific Research (B) (No. 15300073) in MEXT.
References [1] [2] [3] [4] [5] [6] [7] [8]
K. Hashimoto, T. Endo: “How to Inspect to Vital Function”, Japan publishing service, (1983), in Japanese J.W Cooley, J. W. Turkey: “An Algorithm for Machine Calculation of Complex Fourier Series”, 19, pp. 297-301, (1965) J. Hearts, A. Krogh, Richard. G. Palmer: “Introduction to The Theory of Neural Computation”, Addison-Wesley Publishing Company, The Advanced Book Program, 350 Bridge Park0way, Redwood City, (1991) H.G Vaughan, Jr., L. D. Costa, L. Gilden and H. Schimmel : “Identification of Sensory And Motor Components of Cerebral Activity in Simple Reaction Time Tasks”, Proc. 73rd, Conf. Amer. Psychol. Ass., 1, pp. 179-180, (1965) C. J. De Luca: “Physiology and Mathematics of Myoelectric signals”, IEEE Trans, Biomed. Eng., 26, pp. 313-325, (1979) T. Kohonen: “Self-Organization and Associative Memory”, Springer-Verlag Berlin Heidelberg New York Tokyo, (1984) T. Kohonen: “Self-Organizing Maps”, Springer-Verlag Berlin Heidelberg New York Tokyo, (1995) Y. Matsumura, M. Fukumi and N. Akamatsu:“Recognition of EMG Signal Patterns by Neural Network”, ICONIP'02Workshop Proceeding, (2002)
True Smile Recognition Using Neural Networks and Simple PCA Miyoko Nakano1 , Yasue Mitsukura2 , Minoru Fukumi2 , Norio Akamatsu2 , and Fumiko Yasukata1 1
Faculty of Nursing Fukuoka Prefectural University 4395,Ida,Tagawa, Fukuoka 825-8585 Japan {mnakano, yasukata}@fukuoka-pu.ac.jp 2 Department of Information Science and Intelligent University of Tokushima 2-1,Minami-Jyosanjima,Tokushima 770-8506 Japan {mitsu,fukumi,akamatsu}@is.tokushima-u.ac.jp
Abstract. Recently, an eigenface method by using the principal component analysis (PCA) is popular in a filed of facial expressions recognition. In this study, in order to achieve high-speed PCA, the simple principal component analysis (SPCA) is applied to compress the dimensionality of portions that constitute a face. By using Neural Networks (NN), the difference in value of cos θ between true and false (plastic) smiles is clarified and the true smile is discriminated. Finally, in order to show the effectiveness of the proposed face classification method for true or false smile, computer simulations are done with real images.
1
Introduction
In previous studies, many researchers have proposed several methods for recognition of facial expressions[1]-[4]. Most of these, however, have been classified facial expressions (e.g. anger, sadness, happiness and so on). The classification using the expression made intentionally. In particular, this study focuses on a true smiling face considering application to ’man-machine interface‘ systems. An eigenface method is popular in this research field [1],[3],[4]. In PCA, it is not easy to compute eigenvectors with a large matrix when considering the cost of calculation to apply it to time-varying processing. Therefore, SPCA method has been proposed in order to make the PCA calculation faster[5]. Neural networks (NN) are an advanced parallel system, which are excellent especially in problems related to pattern recognition[6]. In this study, therefore, NN is used to classify a true and false smile. Because, it is used in order to avoid the difficulty of threshold value determination in many dimensions. Firstly, dimensionality compression using SPCA of the three portions of the right eye, the left eye, and mouth and nose that constitutes a face is carried out. Secondly, a value of cos θ is calculated using the eigenvector and a gray scale image vector of each image pattern. By using NN, cos θ between true and false V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 631–637, 2003. c Springer-Verlag Berlin Heidelberg 2003
632
Miyoko Nakano et al.
smiles is verified and true smile is discriminated. Finally, simulation results show that this approach is effective for true smile discrimination.
2
Simple Principal Component Analysis (SPCA)
In this section, the SPCA algorithm [5] is applied to a set of vectors V ={v1 , v2 , · · · , vm }. At first, in order to make the center of gravity of this set of vector into the origin, the average vector vˆ is subtracted from all vectors. The new vector set X = {x1 , x2 , · · · , xm } is used as input vectors, which are computed as follows: m
xi = vi − vˆ;
( vˆ =
1 vj ) . m j=1
(1)
Then the column vector an is defined as connection weights between the inputs xj and output yn . The n-th weight, an is used to approximate the n-th eigenvector, αn . The output function has a value given by: T
yn = an xj .
(2)
This function which yields a positive value if an input vector xm exists in the an side and a negative value if it exists in the same half-plane as the opposite side as shown in Fig. 1. Next, a threshold function equation (3) is introduced into the output function of the equation (2). xj if yn ≥ 0 Φ(yn , xj ) = (3) 0 otherwise In the threshold function, there are four functions as the above. This study introduces the equation (3) as a threshold function. Using the above-mentioned function, repetitive calculation of arbitrary vector an suitably given as an initial value is carried out. Therefore the vector an can approach in the same direction as αn .
Fig. 1. The simple PCA algorithm
True Smile Recognition Using Neural Networks and Simple PCA
633
Initially a0 can be set as an arbitrary vector. The sum of Φ(yn , xj ) over all samples xj , is calculated where yn is calculated with respect to the old estimated vector ak . This sum is then normalized to give the new estimate (equation (4)) j Φ1 (yn , xj ) k+1 ; yn = (akn )T xj , an = (4) j Φ1 (yn , xj ) where k denotes the number of times of a repetition. The value of yn is determined by akn before repeating. This calculation can modify an to the center direction of distribution. Hence, akn is converged in the same direction as the n-th eigenvector αn by these repeating (equation 4). The n-th principle component is calculable with the above operation. Next, in order to find the next principle component, the effect of the n-th principle component is removed from the dataset to find many principal component. Based on the orthogonalization of Schmidt, that is performed by subtracting the component which is parallel to the principal component from the vector xi , leaving only an orthogonal component. T
xi = xi − (an xi )an
(5)
Therefore the principle components can be obtained in order with the large rate of contribution by repeating the same operation. The SPCA algorithm is geometrically illustrated in Fig. 1.
3
Image Data
Photography conditions are set as fixed; that means the same lighting condition, subjects, and camera position. As shown in Fig.2, at first, the image data as a total of four expressions, which are composed of natural smile (smile), false (plastic) smile (false), neutral expressionless (neutral) and other expression (comparison), are taken from 16 subjects. The 4th expression was chosen, in order to take into consideration the influence on the system by blink. That is the facial expression which closed the eyes and opened the mouth lightly. In Fig.2, a false smiling face means a forced smile. It is the image of smile we acquired when a subject does not want to laugh. Next, in order to discriminate other expression from smile and false, six expression (”anger“,”disgust“,”fear“,”happiness“,”sadness“,”surprise“) are taken from 19 subjects. Recognition of true smiling face is performed using these face images. It is said that much information exists in the territory of the eye, nose, and mouth though their rate to occupy in the face is very small. Hence, these areas are extracted from face images as data (Fig.3).
4
Computer Simulations
In order to show the effectiveness of the proposed method, it was applied to true smile classification using real images. When a few data is used for computer
634
Miyoko Nakano et al.
Fig. 2. The samples of data
simulations, in order to test at sufficient accuracy, the leave-one-out method of cross validation is often performed. For example, subjects are N persons. The value of cos θ in (N − 1) subject’s image is used as training data. The remaining subject’s images are used for test data Test data was changed one by one and N simulations were performed.
Fig. 3. The procedure of extracting each portion
True Smile Recognition Using Neural Networks and Simple PCA
635
Table 1. The classification result obtained by cross validation facial expressions average (%)
4.1
true smile 87.5
false smile 93.8 90.6
Calculation of the Value of cos θ
The value of cos θ is calculated using an eigenvector and a gray scale image vector of each picture pattern. The angle is made by two vectors. It is calculated by the following equation: cos θ =
(an , xj ) , an · xj
(6)
where an represents the eigenvector and xj means a gray scale images vector of each picture pattern. The feature cos θ is better in recognition accuracy than the inner product of them. 4.2
Structure of NN
A NN model used for the classification in this paper is a three-layered network. BP method is adopted for learning. All eigenvectors used to yield inputs has the sum of contribution that is 80.0% or more. The values of cos θ are fed into the input units. Hence, the number of units in the input layer is 26 (In each eyes image, the 1st-7th eigenvectors are used. And in mouths and noses image, the 1st-12th eigenvectors are used.). The hidden layer has ten units and the output layer has one unit. Therefore, 1, 0 means teachers signals for a true smile and 0, 1 means a false smile. 4.3
True Smile Recognition Using SPCA
The result is shown in Table 1. The rate of an average accurate classification is 90.6%, which shows the classification of smiles can be done by using NN. In the details, the average rate of classification of ”true“ is 87.5%, and the average rate of classification of ”false“ is 93.8%. 4.4
Consideration of the ” Blink “
On the other hand, when considering application to video processing, it is important to discriminate the true smiling face and another facial expression, which is for instance a blink closed the eyes and opened the mouth lightly. The smiling faces shown in Fig. 2 resemble the blink. Therefore, the next simulations are performed to consider the influence by blink. The value of cos θ in 18 subject’s images (smile and comparison) are used as training data. The remaining
636
Miyoko Nakano et al.
Table 2. The classification result in consideration of the blink facial expressions average (%)
true smile 89.5
comparison 97.4 92.1
subject’s images are used for test data. Test data was changed one by one and 18 simulations were performed. This result is shown in Table 2. The rate of an average classification is 92.1%, which shows the classification of smiles can be done by using NN. In the details, the average rate of accurate classification of ”true“is 89.5%, and the average rate of accurate classification of ”comparison“ is 97.4%. From these results, it is rare to recognize expression which closes both eyes and opened a mouth lightly to be a true smile. Moreover, a false smile is similar to a true smile than a ”blink“. Therefore, it can decide that the truth classification of a smiling face is difficult. 4.5
Classification with the Various Expressions
A human being has the large variety of expression. In this experiment, furthermore, a classification with other expressions and a natural smile is performed. NN is used for the classification. When 1, 0 is yielded in the output layer, NN recognize it is smile. When 0, 1 is yielded , it shows false, and if 0, 0 is yielded, it shows other expressions. This simulation is done in the same way as 4.3 by using the data of 19 subjects. The expression to treat as other expressions is 7 expression of”anger“, ”disgust“,”fear“,”happiness“,”sadness“,”surprise“, ”neutral expression“. A classified result is shown in Table 3. An average classified rate was 87.1%. It was difficult to classify ”disgust“, ”fear“ and so on, with among other expressions.
5
Conclusion
An accurate recognition method of true smile is proposed in this study, and its validity is verified by computer simulations. In this method, SPCA which is effective in information compression is used for dimensionality compression of the portion that constitutes a face. The eigenvectors, which are yielded by SPCA, are used as a feature vector. Then, NN is used in order to avoid the difficulty of the threshold value determination in many dimensions. By using NN, the value
Table 3. The classification result of smile facial expressions average(%)
true smile 94.7
false smile 86.8 87.1
others 80.4
True Smile Recognition Using Neural Networks and Simple PCA
637
of cos θ between true and false smile is clarified and used to recognize true smile by using it as input data of NN. From the results, the rate of a classification using NN was 90.0% as the whole. An experimental result in consideration of the blink was 92.1%. Moreover, it was 87.1% when other expressions were taken into consideration. From these results, we can conclude that the classification accuracy of proposed method is better than conventional methods.
References [1] S.Akamastu: ”Computer Recognition of Human Face“, IEICE Trans. on, Vol.DII-J80, No.8, pp.2031-2046 (1997) in Japanese 631 [2] H.Ohta, H.Saji, and H.Nakatani: ”Recognition of Facial Expressions Using Muscle-Based Feature Models“, IEICE Trans. on, Vol.D-II-J82, No.7, pp.11291139 (1999) in Japanese [3] O.Delloye, M.Kaneko, andH.Harashima: ”Operation of the Facial Feature Using Face Space“, IEICE Trans. on, Vol.A-J80, No.8, pp.1332-1336 (1997) in Japanese 631 [4] K.Matuno, C.Lee, and S.Tsuji: ”Recognition of Facial Expressions Using Potential Net and KL Expansion“: IEICE Trans. on, Vol.D-II-J77, No.8, pp.1591-1600 (1994) in Japanese 631 [5] M.Partridge and R.Calvo: ”Fast Dimensionality Reduction and Simple PCA“, IDA, Vol.2 (3), pp.292-298 (1997) 631, 632 [6] S. Y.Kung: Digital Neural Networks, PTR Prentice-Hall (1993)
Estimation of Sea Ice Movement Using Locally Parallel Matching Model Kong Xiangwei, Chen Feishi, Liu Zhiyuan, and Zong Shaoxiang Dept. of Electronic Engineering, Dalian Univ. of Technology Dalian 116024, China [email protected] [email protected] [email protected]
Abstract. In order to estimate the velocity of sea ice movement, according to the characters of sea ice image, an effect method has been adopted: Proper candidate points are found by using interest operators, a locally parallel model for matching is constructed to analyze the disparity of sequential images, after a certain post processing, the estimation of the velocity could be got by using a orientation choosing and maximum probability decision method. The algorithm has shown good performance in practice for estimating velocity of sea ice in Bohai.
1
Introduction
The sea ice is very harmful to the marine transport and production. So the study of the prediction of sea ice and the corresponding measure is very important. Sea ice inspection is an important part of sea ice management. There are many methods to inspect sea ice, one of which is using icebreaker. The artificial way of inspecting sea ice movement in ice breaker which is widely used at present is not accurate enough. And it can't work continually for a long time. In this paper we adopt a series of methods to estimate the velocity in real time by processing the digital video signal of sea ice movement, which is got by a camera bound on the ice breaker. To estimate the speed of sea ice from the digital video signal, we present a method that can use a sequence of video to find the most similar area of them, and then gain the velocity from the difference of area's position in the two frames. So, our research focus on finding a matching method which performances in real-time, stably and suitably for sea ice's images. Usually image matching includes template matching, object matching and dynamic pattern matching etc. Because the object matching has the advantages of low computation, high efficiency and precision, the interesting object are used widely for matching. In this paper, a candidate point matching method is adopted. Candidate point matching must solve two problems: One problem is how to select the candidate points? Not all the points can be used as candidate points, such as that V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 638-645, 2003. Springer-Verlag Berlin Heidelberg 2003
Estimation of Sea Ice Movement Using Locally Parallel Matching Model
639
brightness in some region may not be obvious, or some points are visible in current frame, but invisible in the next frame. Therefore, we must take those points which can be distinguished from their neighborhoods. Interest operator [2] is effective in choosing candidate points. In section 2, we will describe the method in detail. The other problem is how to confirm which matching is correct. Many researches have adopted correlation matching or minimum error square matching as the similar measurement. Ordinarily, in the current frame, the region which has obvious features is chosen as template. In the next frame a full search has been taken to find the most similar one according to the above measurement. However, this method has two disadvantages: (1) Sometimes, proper matching may not be the maximum similarity one as a result of noise and error. (2) A full research requires heavy computation. One of the improvement methods is to increase the sensitivity of similarity measurement. Levine, O'Handley, Yagi[3] adopt an adaptive correlation window. The size of the window depends on the variance of the region surrounding the point. Mori, Kidode, Asada[4] use a Gaussian-weighted correlation window to minimize the errors due to distortion of the extremities. In limiting search area, several strategies have been proposed. Nevatia[5] uses a series of progressive views to constrain disparity to small values. Another method is to use a coarse search to approximately locate the matching points, then do a fine search to locate it more accurately. However, all the above methods assume the camera is fixed. If the camera moves, the movement orientation and the shape of the characteristic window will change relatively. So, additional measures must be taken to eliminate the effect, which need great computation. In this article, a locally parallel and wholly serial model contained of a relaxation labeling algorithm[1] is adopted to compute the disparity between images and get the velocity of the sea ice movement. It doesn't need the precise information of the camera. The change of observed region and candidate points moving orientation caused by camera rotation will not have effect on the result of the algorithm. In section 3 we will describe this model particularly. After getting the matching couple points in continual frames, we present an orientation choosing and maximum probability method to estimate the velocity of the sea ice movement in section 4. In section 5, we will give the experiment results and conclusion.
2
Choosing Candidate Points
For inspection sea ice movement, we bound a camera on a ice breaker to get the sequence video. Sea ice video has follow characters: (1) The intensity in views is smooth; (2) The grey level of these image change slightly. In order to achieve a good matching result we try to select the points which are the centers of highly variable areas and furthermore the variance should be high in all four directions. The goal can be achieved by an interest operator [2].
640
Kong Xiangwei et al.
First, choose a certain region (here the window size is 5×5) and compute the difference value of all pixels along four directions according to equation(1).
(1)
Then we can get the summation of squares of the differences along four directions by equation (2). Sx =
∑∇
f (i, j )
∑∇
f (i, j )
x ( i , j )∈SelectedArea
Sy =
y ( i , j )∈SelectedArea
S 45 =
∑∇
45 ( i , j )∈SelectedArea
S x135 =
∑∇
f (i, j )
135 ( i , j )∈SelectedArea
(2)
f (i, j )
Among the four above direction measures, choose the minimum one as the initial value of interest operator of the center point.
I 0 (i , j ) = min{ S x , S y , S 45 , S 135 }
(3)
After computing the initial values of all pixels in one frame, we can get the final value of interest operator following the rule below: 1
I o (i , j )> I 0 ( m,n )
0
Other
I n (i. j ) = {
( m , n )∈N 8 ( i , j ), m ≠ i , n ≠ j
(4)
Choosing Rule: Only the point (i, j) whose final value I n (i . j ) equals one can be used as a candidate point for matching. In fact the pixels of sea ice image always changes slightly in gray. So the point which has a local maximum, however, small initial value may be used as candidate point. It will make an error final result. To avoid this situation we set a threshold based on such points to ensure the initial value to be a big value. In our estimating the velocity of sea ice, the threshold is selected by computing the maximum initial value of a sea ice background image. Fig. 1(a) and (b) are two adjacent images. Fig.1(c) and (d) are the final results in which the candidate points are displayed
Estimation of Sea Ice Movement Using Locally Parallel Matching Model
(a) Image 1
(b) Image 2
(c) Candidate points for image1
(d) Candidate points for image2
641
Fig. 1. Candidate points for two adjacent images
3
A Locally Parallel Model for Matching Candidate Points
After finding the two sets of candidate points, we would like to construct a set of matches. Because some points may be occluded, shadowed, or not visible in the next image for some reason, an initial set of possible matches are constructed by pairing each candidate point from the current image with each candidate from the next image. We regard the set of matches as a collection of “nodes” { ai } . Associated with each node ai is coordinate (x i , y i ) in image 1 and a set of labels L i which represents possible disparity that may be assigned to the point. Each label in L i is either a disparity vector (l x , l y ) ( l x ∈ [ − r , r ], l y ∈ [ − r , r ] , r is the maximum detectable disparity) or a distinguished label l * denoting “undefined disparity”. Each possible is assigned an initial likelihood using (5), (6), (7) and (8). wi (l ) =
1 l ≠ l* 1 + c * s i (l )
(5)
642
Kong Xiangwei et al.
p i0 ( l * ) = 1 − max* ( w i ( l ))
(6)
p i0 ( l ) = p i ( l | i ) * (1 − p i0 ( l * )), l ≠ l *
(7)
l≠l
pi (l | i) =
wi (l ) ∑ wi (l ' )
(8)
'
l ≠l *
si (l ) is the sum of the squares of the differences between a small window from image 1 centered on (x i , y i ) and a window from image 2 centered of ( xi + l x , yi + l y ) . c is a constant. After constructing the initial set of possible matches, we iteratively refine these likelihood using (9), (10), (11).
qik (l ) =
^ k +l
pi
^ k +l
pi
∑[ ∑p
j∃ai l '∃ nearai ||l −l ' ||≤Θ j ≠i
k j
(l ' )], l ≠ l *
(l ) = pik (l ) * ( A + B * qik (l )), l ≠ l *
(9)
(10)
(l * ) = pik (l * ) ^ k +l
p
k +l i
(l ) =
pi
(l )
^ k +l
∑ pi
(l ' )
(11)
l 'inLi
A and B are constants. Labels are considered consistent if they satisfy the following relation: || l − l ' ||≤ Θ (12) A node a j may be considered near ai if max(| xi − x j |, | yi − y j |) ≤ R
(13)
the probability of the label l * is affected only by the normalization (11). If ^ k +l
∑ pi l ≠l *
^ k
(l ) < ∑ pi (l ) l ≠l *
(14)
Estimation of Sea Ice Movement Using Locally Parallel Matching Model
643
the probability that ai has label l * increases. Otherwise ^ k +1
∑p l ≠l *
i
^ k
(l ) ≥ ∑ pi (l )
(15)
l ≠l *
it decrease or remain the same. Such iteration procedure is repeated for a certain times. Matching Rule: If Pi (l ) ≥ T , the pair points are considered to be matched. Pi (l ) < T , the pair points will be cancelled.
T is the threshold for estimating the likelihood. Some nodes may remain ambiguous, with several potential matches retaining nonzero probabilities.
4
Estimation of the Velocity of the Sea Ice Movement
Because the gray of sea ice images always change slightly, sometimes there would be an invalid match. So we present a orientation choose and probability decision method to estimate the velocity of sea ice like that: After matching each neighbor images, we can get a set of matching pairs: Ai : {( x1i , y1i ), ( x2i , y 2i )}
(16)
Ai is the i th matching pair of the two neighbor image; ( x1i , y1i ) is the coordinates of the i th matching point of the first image. ( x2i , y 2i ) : the coordinates of the i th
matching point of the next image. Compute the velocity vector of each matching pair:
v Vi = [( x2i − x1i ), ( y2i − y1i )T
(17)
Fig. 2. Matching results
r According to the angle of Vi , each matching pair can be allocated to one of the eight orientation subsets(each orientation subset represents one orientation):
644
Kong Xiangwei et al.
j = 1,2L8 , Am ⊂ A
Subset ( j ) = { Am }
(18)
Here, Subset ( j ) is one of the eight orientation subsets. The eight orientations include: up, down, left, right, left-up, left-down, right-up, right-down. The probability that the velocity orientation of the sea ice belongs to Subset ( j ) is defined as: P( j ) =
Number ( j ) 8
∑ Number ( j )
(19)
k =1
Here Number( j ) is the number of the elements in Subset ( j ) . Find the maximum probability,
Pmax = max P( j ) j
j = 1,2L8
(20)
Assume that the maximum probability corresponds to Subset (n) , all the elements in Subset (n) will be used to estimate the velocity of the sea ice. Then, the final result of r the estimating velocity is V : r Vm ∑ r A ⊂ Subset ( n ) (21) V= m Number ( j ) Am is all the matching pairs belonging to the Subset (n) . r Vm is the velocity vector of the matching pair corresponding to Am .
5
Results and Conclusion
According to the character of the sea ice images, we adopt a locally parallel matching model to estimate the velocity of the sea ice. First, use an interest operator to choose the candidate points. Then, a locally parallel model is adopted to match the neighbor images of sea ice. Finally, we present a method to select the maximum probability subset for estimating the velocity. An important property of this matching algorithm is that it works for any mode of disparity and does not require precise information about camera orientation, position and distortion. And the method of selecting maximum probability subset for estimating the velocity can eliminate the case of some invalid matching. By bounding a camera on the ice breaker and using a general image collection board, we have applied the above system to estimate the velocity of the sea ice in Bohai of north China. In practice, the parameters in the above expressions are selected as below: In section3, The window size for matching is 5×5. Constant c in the
Estimation of Sea Ice Movement Using Locally Parallel Matching Model
645
equation (5) equals 0.001. The constants A and B in the equation (10) are selected as 0.6 and 3. In equation (12), (13), Θ = 1 , R = 15 . The iteration procedure is repeated for eight times. In the matching rules, the threshold T=0.5. These methods have been proved to have nice effects and stabilization in practice of estimating velocity of sea ice. The precise level of estimating velocity of the sea ice can meet the demand of application.
References [1] [2] [3] [4] [5] [6] [7]
S. T. Barnard, W. B. Thompson, “Disparity analysis of images”, IEEE Transactions on PAMI, Vol. PAMI-2, pp.333-340, July 1980 H. P. Moravec, “Towards automatic visual obstacle avoidance,” in Proc. 5th Int. Joint Conf. Artificial Intell., Cambridge, MA, Aug. 1977, p. 584 M. D. Levine, D. A. O' Handley, and G. M. Yagi, “Computer determination of depth maps,” Comput. Graphics Image Processing, Vol. 2, pp. 131-150, 1973 K. Mori, M. Kidode, and H. Asada, “An iterative prediction and correction method for automatic stereo-comparison,” Comput. Graphics Image Processing, Vol. 2, pp.393-401, 1973 R. Nevatia, “Depth measurement by motion stereo,” Comput. Graphics Image Processing, Vol.5, pp.203-214, 1976 N. Evans, “Glacier surface motion computation from digital image sequences”, To appear in IEEE Transaction on Geoscience and Remote Sensing, 2000 Zhang Yujin, Image engineering: Image understanding and computer vision, Tsinghua press, 2000
Evaluation of a Combined Wavelet and a Combined Principal Component Analysis Classification System for BCG Diagnostic Problem Xinsheng Yu1, Dejun Gong2, Siren Li2, and Yongping Xu2 Marine Geology College, Ocean University of China Qingdao 266003, P. R. China 2 Institute of Oceanology, Chinese Acadamy of Sciences Qingdao 266071, P. R. China 1
Abstract. Heart disease is one of the main factor causing death in the developed countries. Over several decades, variety of electronic and computer technology have been developed to assist clinical practices for cardiac performance monitoring and heart disease diagnosis. Among these methods, Ballistocardiography (BCG) has an interesting feature that no electrodes are needed to be attached to the body during the measurement. Thus, it is provides a potential application to asses the patients heart condition in the home. In this paper, a comparison is made for two neural networks based BCG signal classification models. One system uses a principal component analysis (PCA) method, and the other a discrete wavelet transform, to reduce the input dimensionality. It is indicated that the combined wavelet transform and neural network has a more reliable performance than the combined PCA and neural network system. Moreover, the wavelet transform requires no prior knowledge of the statistical distribution of data samples and the computation complexity and training time are reduced.
1
Introduction
Traditionally, the physicians in hospitals need to interpret characteristics of the measured records and to calculate relevant parameters to determine whether or not the heart shows signs of cardiac disease. Recently, the advances in computer and electronic technology have provided a basis for automatic cardiac performance monitoring and heart disease diagnosis to assist clinical practice by saving diagnostic time. These technologies along with artificial intelligence research have also established entirely new applications in detecting the risk of heart disease [4]. However, the demands of practical health care require that these technologies are not limited to hospital environments but are able to detect the vital signs of heart disease during daily life under unrestricted conditions. For example, it would be helpful for both doctors and patients, if the preliminary heart condition could be monitored regularly at the home before making the decision whether or not it is necessary to visit V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 646-652, 2003. Springer-Verlag Berlin Heidelberg 2003
Evaluation of a Combined Wavelet
647
hospital, for further multiple assessment and treatment. Thus, the time and transport expenses could be saved. These requirements lead an increasing innovation in portable systems in reduction in size, weight and power consumption to an extent that the instruments are available in routine life monitoring and diagnosis at home. BCG provides a potential application to monitor heart condition in the home because no electrodes are required to be attached to the body during the measurements. In the past several years, the BCG has been used in a variety of clinical studies such as prognosis, monitoring, screening, physical conditioning, stress tests, evaluation of therapy, cardiovascular surgery and clinical conditions [1]-[4]. It would therefore be of value if the BCG could be used as a non-invasive alternative for the prediction of subjects at risk of heart attack that could help clinicians to establish an early treatment. Although studies have been made in the last few decades on the subject of improving BCG measurement technology, there has tended to be relatively little work done to attempt to improve the BCG process and analysis methods by using modern signal processing technology There is little work on developing computer assisted analysis systems that are capable of operating in real time, especially those incorporating artificial intelligence methods for BCG pattern classification. Moreover, no reports have been made in the literature to implement decision making algorithms in a portable system for real world application [5]. In this paper, two BCG classification models are evaluated. One is a combined PCA and a three layer neural network system, and the other is to use wavelet transform to decompose the original signal into a series of views for neural network classifier. The performance and computation efficiency of the two models are compared. It illustrated that the combined wavelet transform with backpropagation neural network has a more reliable performance than the combined PCA and neural network system.
2
BCG Signal Feature Selection
A classification of the whole BCG waveform can be computationally intensive by the neural network classifier. The high dimension number inputs forces us to use a network with a higher number of free parameters which means a large training samples is required to estimate these parameters. The time involved in the network training is considerable. Moreover, the presence of irrelevant feature combinations can actually obscure the impact of the more discriminating inputs. A small input means a smaller neural network structure is required and thus could save in time and system implementation for real world applications. 2.1
BCG Signal and Preprocessing
The BCG data set used in this study was provided by the medical Informatics Unit of the Cambridge University. The signal was sampled at 100 samples/second from the seat of the chair in a clinical trial. During the BCG recording, the reference ECG signals were recorded simultaneously from the arms of the chair at the same sampling rate. The R-wave of the ECG signal is used to identify each BCG cycle. The BCG
648
Xinsheng Yu et al.
signal are filtered and then averaged to reduce the background noise. Each averaged BCG signal was then normalized into a standard 80 points. Altogether, there are 58 normal subjects, 275 mild hypertension subjects and 6 subjects who died suddenly due to myocardial infarction within 24 months of the BCG recording. Figure 1 shows the standard length BCG signal of normal and hypertension subjects.
1
11
21
31
41
51
61
71
1
11
21
31
(A)
41
51
61
71
(B)
Fig. 1. BCG of normal (A) and hypertension (B)
2.2
Using PCA for Feature Extraction
Principal components are usually derived by numerically computing the eigenvector of the variance matrix. Those eigenvectors with largest eigenvalue (variance) are used as features onto which the data are projected [7],[8]. This method may be considered optimal, in the sense that mean square linear reconstruction error is minimised. For a given data set X = ( X1 , X 2 ,... X N ) with a zero mean, a simple approach to PCA is to compute the simple variance matrix
C=
1 N X i X iT ∑ N i =1
(1)
The next step is to use any of the existing standard eigenvector analysing algorithms to find the m largest eigenvalues and their corresponding eigenvectors of the data covariance matrix:
Cφ = λφ
(2)
where λ is the eigenvalue, and φ is its corresponding eigenvector. The m principal components of n dimensional data are the m orthogonal directions in n spaces which capture the greatest variation in the data. These are given by the eigenvectors φ . In this study, it is found that 10 principal components are have the ability to maintain the required information for the neural classifier 2.3
BCG Signal Analysis Using Discrete Wavelet Transform
In the BCG signal classification, most of the information needed for classification is in the BCG shape and related parameters, such as the wave amplitudes of H, I, and J,
Evaluation of a Combined Wavelet
649
slopes of base line to H, H-J, and I-J, and times of H-J, I-J as well as H-I and H-J in percent of heart period [6]. These kinds of features are often dependent on the degree of resolution. At a coarse resolution, information about the large features of the complex waves is represented; while at a fine resolution, information about more localised features of individual waves is formed. Upon discarding selected detail, one can obtain the required information for the classifier without much signal distortion. It is true that not every resolution is equally important to provide the classification information in the classification phase. Therefore, it is possible to choose a proper resolution and discard the selected component without much signal distortion. Using the wavelet transform, one can project the raw BCG data into a small dimension space and decompose the original signal into a series of sub-signals at different scales. The coarse components which still have the significant shape information can be used to present to the neural network classifier. One of the most useful features of wavelets is that one can choose the defining coefficients for a given wavelet system to be adapted for a given problem. In Mallat's original paper [9], he developed a specific family of wavelets that is good for representing polynomial behaviour. As there is no prior information to suggest that one wavelet function would be better suited to BCG analysis than others, the filter coefficients derived by Mallat [9] are adopted for this study. In this study, the signal reconstruction procedure is not required. Thus, only low pass filter processing is used to approximate the original BCG signal. The fast discrete wavelet transform algorithm is implemented in C code for BCG signal decomposition. As the original data frame has 80 sample points, at each successive resolution level, the pyramidal algorithm compresses the data by a factor of 2. At the first process level, the N dimension signal f ( x ) is taken as C0 and is decomposed into two bands: coarse component C j ,n and detail component D j ,n which both have N/2 samples. At the next level, only the low-pass output signal is further split into two bands. Signals at various resolutions give information about different features. In time scale, they are 40, 20, and 10 samples respectively. According to our experiment results, the level of 20 coarse components have relatively good classification performance.
3
BCG Classification
The neural network model used in this study is a variation of the backpropagation neural network. with a single hidden layer. The network parameters are empirically defined. The whole BCG data set ispatitioned randomly into two data set which shows in Table 1. One data set is used for the neural classifier training and another data set is used for test. We then swap the training data set and testing data set to repeat the training and testing procedure again. The same partitioning of the data set was used in all runs of this investigation and the generalization capability of the neural classifier was evaluated using the same strategy.
650
Xinsheng Yu et al. Table 1. Training and Testing Data Sets
Subjects Normal (N) Hypertension (H) Risk of Heart Attack (RHA)
4
Results
4.1
Recognition Performance
Training Data Set 30 111 3
Testing Data Set 28 164 3
After training the neural network, the weights were fixed and the network was evaluated. The results of using data set 1 as training data set for combined PCA and neural classifier is shown in Table 2. Table 3 illustrates the classification results of using data set 2 as training data set. The results for the combined wavelet transform and neural network classification system are shown in Table 4 and Table 5 respectively. Table 2. Classification Results Using 10 Principal Components
Class N H RHA Overall Performance
Training Data N H 30 0 0 111 0 1
RHA 0 0 2
99.31%
Testing Data N H 22 6 3 160 1 2
RHA 0 1 0
93.33%
Table 3. Classification Results of Swapped Data Sets Using 10 Principal Components
Class N H RHA Overall Performance
Training Data N H 28 0 1 163 0 1
RHA 0 0 2
98.97%
Testing Data N H 25 5 2 105 0 2
RHA 0 4 1
90.97%
Table 4. Classification Results Using Combined Wavelet Transform and Neural Classifier
Class N H RHA Overall Performance
Training Data N H 28 0 0 111 0 1 99.3%
RHA 0 0 2
Testing Data N H 25 2 2 159 0 1 95.38%
RHA 1 3 2
Evaluation of a Combined Wavelet
651
Table 5. Classification Results of Using Swapped Data Sets
Class N H RHA Overall Performance 4.2
Training Data N H 28 0 0 164 0 1 99.49%
RHA 0 0 2
Testing Data N H 27 3 0 107 0 1
RHA 0 4 2
94.44%
Computational Complexity
For real time applications, the reduction of computation complexity is of greatest concern for the engineering design. Table 6 shows a comparison of the forward pass computation complexity of the two combined classification systems. It shows that the computation operation required for the combined wavelet transformation is similar to that of the combined PCA network with 10 outputs. However, the memory storage required for the combined wavelet transform and neural classifier system is much smaller. Moreover, unlike the PCA method, no training procedure is required to generate the features for classifier inputs. This provides a great advantage for real time implementation. The comparison results of performance and computation efficiency suggest that the combined wavelet transformation and neural network system is suitable for real time implementation.
5
Conclusion
A study was made to determine the usefulness of a combination of different types of features for BCG classification. The simulation results using the partition training scheme show that the wavelet transform with neural network system has better performance than a combined PCA and neural networks approach. The simulation results also suggested that the improvement in system performance is constrained by the limited number of data samples. The results indicate that the number of data samples and balance of the class patterns in the samples have a significant influence on the generalization capability of the combined PCA and neural network system. Hoffbeck and Landgrebe [11] have suggested that the statistical method needs a large training data set to provide reliable statistical estimation, for example, to accurately estimate a sample covariance matrix. Although constructing a compact supported mother wavelet involves complex mathematics, the calculation of the wavelet transform is relatively simple. The wavelet transform can be implemented as finite impulse response (FIR) filters [10]. It has been demonstrated that the wavelet transform has a compact computational complexity. From the classification performance and real time implementation point of view, the multiresolution wavelet is promising for an on-line dimensionality reduction in BCG analysis in term of performance, storage and operations per decision.
652
Xinsheng Yu et al. Table 6. Comparison of the Computation Complexity for BCG Classification
Feature Classifier Total
Combined PCA Method Multiplication Addition Storage 800 808 808 115 115 115 915 923 923
Combined Wavelet Method Multiplication Addition Storage 720 660 12 99 99 99 819 759 111
References [1]
I. Starr and F.C. Wood, “Twenty years studies with ballistocardiograph, the relation between the amplitude of the first record of ‘health' adults and eventual mortability and morbidity from the heart disease,” Circulation, Vol. 23, pp. 714-732, 1961 [2] 2 C.E. Kiessling, “Preliminary appraisal of the prognostic value of Ballistocardiography,” Bibl. Cardiol., Vol. 26, pp. 292-295, 1970 [3] 3 T.N. Lynn and S. Wolf, “The prognostic significance of the ballistocardiogram in ischemic heart disease,” AM. Heart, J., Vol. 88, pp. 277280, 1974 [4] 4 R.A. Marineli, D.G. Penney, W.A. Marineli, and F. A. Baciewicz, “Rotary mation in the heart and blood vessels: A review,” Journal of Applied Cardiology, Vol. 6, pp. 421-431, 1991 [5] X. Yu, “Ballistocardiogram classifier prototyping with CPLDs,” Electronic Engineering, Vol. 68, No. 834, pp. 39-40, 1996 [6] W.K. Harrison, S.A. Talbot and Baltimore, “Discrimination of the Quantitative Unltralow-frequency Ballistocardiogram in Coronary Heart Disease,” American Heart Journal, Vol. 74, pp. 80-87, 1967 [7] I.T. Jolliffe, “Principal Component Analysis,” Springer-Verlog Press, 1986 [8] E. Oja, H. Ogawa, and J. Wangviwattana, “Principal component analysis by homogeneous neural networks, Part I & Part II: The Weighted Subspace Criterion,” IEICE Trans. INF. & Syst., Vol. E75-D, No. 3, pp. 366-381, 1992 [9] S. G. Mallat, “A theory for multiresolution signal decomposition: The wavelet representation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 11, No. 7, pp. 674-693, 1989 [10] O. Rioul and M. Vetterli, “Wavelet and signal processing”, IEEE Transaction on Signal Processing Magazine, Vol. 8, pp. 14-38, 1991 [11] J.P. Hoffbeck and D.A. Landgrebe, “Covariance matrix estimation and classification with limited training data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 18, No. 7, pp. 763-767, 1996
Learning from the Expert: Improving Boundary Definitions in Biomedical Imagery Stewart Crawford-Hines Colorado State University Fort Collins, Colorado USA [email protected]
Abstract. Defining the boundaries of regions of interest in biomedical imagery has remained a difficult real-world problem in image processing. Experience with fully automated techniques has shown that it is usually quicker to manually delineate a boundary rather than correct the errors of the automation. Semi-automated, user-guided techniques such as Intelligent Scissors and Active Contour Models have proven more promising, since an expert guides the process. This paper will report and compare some recent results of another user-guided system, the Expert's Tracing Assistant, a system which learns a boundary definition from an expert, and then assists in the boundary tracing task. The learned boundary definition better reproduces expert behavior, since it does not rely on the a priori edge-definition assumptions of the other models.
1
Background
The system discussed in this paper provides a computer-aided assist to human experts in boundary tracing tasks, through the application of machine learning techniques to the definition of structural image boundaries. Large imagery sets will usually have a repetition and redundancy on which machine learning techniques can capitalize. A small subset of the imagery can be processed by a human expert, and this base can then be used by a system to learn the expert's behavior. The system can then semiautomate the task with this knowledge. The biomedical domain is a rich source of large, repetitive image sets. For example, in a computed tomographic (CT) scan, cross-sectional images are generated in parallel planes typically separated by millimeters. At a 2mm separation between image planes, approximately 75 images would be generated in imaging the complete brain. Images such as this, generated along parallel planes, are called sectional imagery. Such sectional imagery abounds in medical practice: X-ray, MRI, PET, confocal imagery, electron microscopy, ultrasound, and cryosection (freezing and slicing) technologies all produce series of parallel-plane 2D images. Generating a three dimensional polygonal model of a structure from sectional imagery requires bounding the structure across the whole image set. Currently, the V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 653-659, 2003. Springer-Verlag Berlin Heidelberg 2003
654
Stewart Crawford-Hines
reference standard for high-quality outlining tasks is an expert's delineation of the region. The state-of-the-practice is that this is done manually, which is a repetitive, tedious, and error-prone process for a human. There has been much research directed toward the automatic edge detection and segmentation of images, from which the extraction of a boundary outline could then proceed. Systems based on this work have run into two significant problems: (1) the cost of user adjustments when the system runs into troublesome areas often exceeds the cost of manually tracing the structure from the start, and (2) the a priori assumptions implicit in these methods impact the ability of the system to match the expert's performance on boundary definition tasks where an expert's judgement is called into play. The system discussed herein, the Expert's Tracing Assistant (ETA), provides viable assistance for such tracing tasks which has proven beneficial in several regards:
• • •
the specialist's time can be significantly reduced; errors brought on by the tedium of tracing similar boundaries over scores of similar images can be reduced; and the automated tracing is not subject to human variability and is thus reproducible and more consistent across images.
The thrust of this work is to learn the boundary definitions of structures in imagery, as defined by an expert, and to then assist the expert when they need to define such boundaries in large image sets. The basic methodology of this Expert's Tracing Assistant is: 1) 2) 3) 4)
in one image, an expert traces a boundary for the region of interest; that trace is used in the supervised training of a neural network; the trained network replicates the expert's trace on similar images; the expert overrides the learned component when it goes astray.
The details of the neural network architecture and training regimen for this system have been previously documented by Crawford-Hines & Anderson [1]. This paper compares these learned boundaries to the semi-automated boundary definition methods of Intelligent Scissors by Barrett & Mortensen [2] , and Active Contour Models begun by Kass, Witkin, & Terzopoulos [3].
2
Methods for Boundary Tracing
To understand the relative merits of learning boundary contours, the Expert's Tracing Assistant (ETA) was studied in comparison to other user-guided methods representing the current best state-of-the-practice for boundary delineation. The techniques of Active Contour Models (ACM) and Intelligent Scissors (IS) were chosen for comparison to ETA because of they have been brought into practice, they have been studied and refined in the literature, and they represent benchmarks against which other novel methods are being compared. The ground truth for the comparison of these methods is an expert's manual tracing of a structure's boundary in an image.
Learning from the Expert: Improving Boundary Definitions in Biomedical Imagery
655
The structures chosen for comparison were taken from the imagery of the Visible Human dataset, from the National Library of Medicine [4], which is also becoming a benchmark set upon which many image processing and visualization methods are being exercised. Several structures were selected as representative cross-sections. For each, the IS, ACM, and ETA methods were used to define the structure's boundary and an expert manually delineated the boundary in two independent trials. Figure 1 shows three structures to be compared. The leg bone and skin are shown clearly, without much confusion. The leg muscle is fairly typical, surrounded by highly contrasting fatty tissue. However sometimes only a thin channel of tissue separates one muscle from the next. IS, also known as Live-Wire, is a user-guided method. With an initial mouse click, the user places a starting point on a boundary of interest; the system then follows the edges in an image to define a path from that initial control point to the cursor's current screen location. As the cursor moves, this path is updated in real time and appears to be a wire snapping around on the edges in an image, hence the terminology "live wire" for this tool. ACM, another user-guided methodology, uses an energy minimizing spline, that is initialized close to a structure of interest and then settles into an energy minima over multiple iterations. The energy function is defined so these minima correspond to boundaries of interest in the imagery. Since the initial contour is closed, the final result will always be a closed, continuous curve.
Fig. 1. A transverse image of the leg, highlighting the femur (bone), the biceps femoris (muscle), and the skin
656
Stewart Crawford-Hines
3
Comparing Boundaries
For a ground truth in this comparison, an expert was asked to manually trace the structures. The expert traced the structures twice, generating two independent contours for each structure. This permits a basic measure of the variation within the expert's manual tracings to be quantified. It might be argued that this ground truth is not really a truth, it is only one user's subjective judgement of a structural boundary. But the expert user brings outside knowledge to bear on the problem and is dealing with more than simple pixel values when delineating a boundary. And for a system to be useful and acceptable as an assistant to an expert, it should replicate what the expert is attempting to do, rather than do what is dictated by some set of a priori assumptions over which the expert has no input or control. The boundary definitions are to be quantitatively compared to each other and to the ground truth of the expert. The boundaries produced by each of these methods are basically sets of ordered points which can be connected by lines or curves or splines to create a visually continuous bound around a region. To compare two boundaries, A and B, we first connect the points of B in a piecewise linear curve, and measure the minimum distance from each point in A to the curve of B. We then flip the process around, and measure from each point in B to the piecewise linear curve of A. The collection of these measures is called a difference set. Figure 2 illustrates several visualizations of this difference set. The first graph, in the upper half of the figure, plots the distance between the curves (on the vertical axis) as a function of position on the curve (on the horizontal axis). In this example, there are perhaps three places where the curves are significantly more than one pixel apart from each other, shown in the plot by the six excursions of the graph over the distance of 1.0 (remembering the plot measures A to B and B to A, thus typically excursions show up twice). If the goal is to have the curves within one pixel of each other, this indicates that there are three places where operator intervention would be required to adjust the curves so as to meet that objective. The lower left of Figure 2 is a simple histogram of the distance set, with the number of occurrences on the vertical axis and distance bins on the horizontal. The lower right is an empirical cumulative distribution function (CDF) over the distance set. The vertical axis measures the fraction of occurrences that are within a tolerance specified on the horizontal axis. The CDF allows quantification of the inter-curve distances by selecting a tolerance such as one pixel and stating, "The curves are within one pixel of each other 86% of the time" or by selecting a percentile such as 90% and stating, "The curves are within 1.1 pixels of each other 90% of the time".
Learning from the Expert: Improving Boundary Definitions in Biomedical Imagery
657
Fig. 2. Three visualizations of the distance set
4
Key Results
The three structures of Figure 1 typify the range of results found so far across many images. The expert manually outlined each structure on two independent trials. The expert's first boundary is used as the Ground Truth (GT), while the second manually traced boundary (M2T) is used to provide a measure of intra-expert variation, i.e., the inherent variation a user shows in tracing a boundary at different times. Given there exists variation within what an expert might manually trace, a good boundary delineation method needn't exactly match any specific expert's boundary definition, but it should be within the range of that expert's variance. Looking at the left-hand side of Figure 3, this is exactly what is observed. The black curve illustrates the CDF of M2T compared to the Ground Truth. Note the three methods are roughly comparable, all close to the bound of the black CDF. The righthand side of the figure shows the performance for the muscle. The performance is consistently worse overall. Figure 4 shows a detail of the five boundaries superimposed on the original image; the expert's traces are in black, the semiautomated traces in white. In the lower-left of the muscle, however, there is no consistency of definition - even the expert made different traces at different times. All did equally poorly. Figure 5 shows the results for the leg skin. Here the performance difference is dramatic between ETA (far left) and the IS and ACM methods (to the right). Figure6 illustrates what is happening in this situation. The expert judgement of the skin boundary places the boundary slightly inside what a more classically-defined boundary would be; note that both IS and ACM are agreeing on where the boundary lies, and apriori this appears to be a sensible boundary to draw. In this case, however, the body was encased in a gel before freezing, and the expert is accounting for both gel effects and the image pre-processing in locating the actual skin boundary. The expert is consistent in this judgement, and the ETA system has learned this behavior and replicated it.
658
Stewart Crawford-Hines
Fig. 3. CDF s of M2T (black) and IS, ACM, and ETA (grey) for the bone (left) and the muscle (right) from Figure 1
Fig. 4. Detail of the five boundaries
Fig. 5. Results for the leg's skin: CDFs for, from left to right: ETA, M2T, ACM, and IS
Learning from the Expert: Improving Boundary Definitions in Biomedical Imagery
659
Fig. 6. In this detail of the skin, the expert (in black) has traced consistently to the inside of what would be clasically considered the image's edge; ETA (white) follows the expert's lead, while IS and ACM follow more traditional edge definitions.
The range of results in Figures 3 and 5 typify what has been seen so far: the learned boundary was either consistent with the classically defined IS and ACM methods, or it did better when expert judgement was called into play.
References [1] [2] [3] [4]
S. Crawford-Hines & C. Anderson, "Learning Expert Delineations in Biomedical Image Segmentation", ANNIE 2000 - Artificial Neural Networks In Engineering, November 2000. E.N. Mortensen & W.A. Barrett, "Interactive Segmentation with Intelligent Scissors", Graphical Models and Image Processing, v.60, 1998, pp.349-384. M. Kass, A. Witkin, D. Terzopoulos, "Snakes: Active Contour Models", First International Conference on Computer Vision, 1987, pp.259-268. National Library of Medicine, as of Jan 2001: http://www.nlm.nih.gov/research/visible
Low Complexity Functions for Stationary Independent Component Mixtures K. Chinnasarn1 , C. Lursinsap1 , and V. Palade2 1
AVIC Center, Department of Mathematics, Faculty of Science Chulalongkorn University, Bangkok, 10330, Thailand [email protected] [email protected] 2 Oxford University Computing Laboratory Parks Road, Oxford, OX1 3QD, UK [email protected]
Abstract. Obtaining a low complexity activation function and an online sub-block learning for non-gaussian mixtures are presented in this paper. The paper deals with independent component analysis with mutual information as a cost function. First, we propose a low complexity activation function for non-gaussian mixtures, and then an online sub-block learning for stationary mixture is introduced. The size of the sub-blocks is larger than the maximal frequency Fmax of the principal component of the original signals. Experimental results proved that the proposed activation function and the online sub-block learning method are more efficient in terms of computational complexity as well as in terms of learning ability. Keywords: Blind signal separation, independent component analysis, mutual information, unsupervised neural networks
1
Introduction
In ”independent component mixture” or ”non-gaussian mixture” problems, the source signals are mutually independent and no information about the mixture environment is available. The recovering system receives unknown mixed signals from the receivers, such as microphones or sensors. In order to recover the source signals, some unsupervised intelligent learning systems are needed. We used, in our approach, a combined learning system, including unsupervised neural networks, principal component analysis, and independent component analysis using mutual information with natural gradient. The learning rule is based on a Mutual Information (MI) using the Kullback-Leibler distance measure: KLp(y)||p(˜y) =
p(y) log
p(y) dy p(˜ y)
(1)
This work is fully supported by a scholarship from the Ministry of University Affairs of Thailand.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 660–669, 2003. c Springer-Verlag Berlin Heidelberg 2003
Low Complexity Functions for Stationary Independent Component Mixtures
661
where p(y) is a true probability density function of y, and p(˜ y) is an estimated parameter of the probability density function of y. For low computational time, we propose some approximation functions of 2nd order for non-gaussian mixtures. In our previous paper [3], we presented only the batch mode learning, which needs a computational complexity of at least O(K 3 ). In this paper, we propose an online sub-block learning method which requires a complexity of at most O(K.k 2 ), where k < K. The paper is organized as follows: Section 2 describes some background of an ICA problem; Section 3 proposes some low complexity approximation of the activation functions; Experimental design and results are presented in section 4 and 5, respectively; Section 6 shows an analytical considerations on complexity; some conclusions are presented in section 7.
2
Independent Component Analysis Problem
An Independent Component Analysis (ICA) problem is an adequate approach because, usually, the sources si are statistically independent from each others. A very well-known practical example of an ICA application is the cocktail party problem. This problem assumes there are some people talking simultaneously in the room, which is provided with some microphones for receiving what they are talking about. Herein, we assume that there are n people, and m microphones as illustrated in figure 1 (for n = m = 3). Each microphone M icj gives you a recorded time signal, denoted as xj (t), where 1 ≤ j ≤ m and t is an index of time. Each of these recorded signals is a linear combination of the original signals si , (1 ≤ i ≤ n) using the mixing matrix A, as given below: xj (t) =
n
aji si (t)
(2)
i=1
where aji , 1 ≤ j, i ≤ n are the weighted sum parameters, that depend on the distance in between the microphones and the speakers [7]. If the sources si are near to the receivers M icj , the elements of the mixing matrix A are similar to a diagonal matrix, a special case of ICA problem. In case of mixture in diag environment, it is said that there is no mixing occurrance between the sources si , because an original signal is only scaled and/or permuted by the diag mixing matrix. Commonly, the elements of mixing matrix A and the probability density function of si (t) are unknown in advance. The only basic assumption of the cocktail party problem is that all of the sources si (t) are independent and identically distributed (iid). The basic background of ICA were presented in [1] [4] [7]. The objective of an ICA problem is to recover the source signals ˜ s = y = Wx from an observed signal x = As, where each component of the recovered signals yi (t) is iid. The equation for transforming the mixed signals is the following: ˜ s = y = Wx = WAs = A−1 As = Is = s
(3)
662
K. Chinnasarn et al.
A
W x1
y1
s
2
x
y2
s3
x
s1
2
y3
3
Fig. 1. The cocktail party problem with 3 speakers and 3 receivers
Equation (3) shows that the full rank de-mixing matrix W is needed for recovering the mixed signal xi . There are many successful algorithms for finding the de-mixing matrix W, such as minimization of mutual information [1], infomax and maximum likelihood estimation [2], projection persuit [7] etc. Herein, we prefer the mutual information with natural gradient (MING algorithms), which was proposed by Amari et al. [1] in 1996. Lee et al. [9] and Chinnasarn and Lursinsap in [3] added a momentum term β∆Wt , for speeding up the convergence. The learning equation is described below: Wt+1 = Wt + η[I − φ(y)yT ]Wt + β∆Wt
(4)
where η is the learning rate, β is the momentum rate, I is an identity matrix, t is i) the time index, φi (y) = ∂p(y ∂yi is the nonlinear activation function which depends on the probability density function of the source signals si , and y = Wx. Let’s consider that the activation functions for de-mixing of the sub-gaussian and the super-gaussian distributions are φ(y) = y3 and φ(y) = tanh(αy), respectively. In this paper, we are looking for some activation functions of lower complexity than the functions given above.
3
Low Complexity Activation Functions
In the ICA problem, at most one gaussian channel is allowed because the transformation of two gaussianities are also gaussian in another variable [7]. The nongaussian signals can be classified into the super-gaussian and the sub-gaussian distributions. A super-gaussian signal has a sharp peak and a large tail probability density function (pdf). On the other hand, sub-gaussian signals have a flat pdf. As we described in the previous section, the nonlinear activation function φi (y) in equation (4) is determined by the sources’ distribution. In this paper, we used the Kurtosis [4, 5] for selecting an appropriate activation function.
Kurtosis(s) =
E[s4 ] −3 (E[s2 ])2
(5)
where Kurtosis(s) values are negative, zero, and positive for the sub-gaussianity, the gaussianity, and the super-gaussianity, respectively.
Low Complexity Functions for Stationary Independent Component Mixtures
3.1
663
Approximation for Super-Gaussianity
In [8], Kwan presented the KTLF (Kwan Tanh-Like activation Function), which is a 2nd order function. This function is an approximation of tanh(2y) function. He divided the approximation curve into 3 regions, which are the upper bound y(t) ≥ L, the nonlinear logistic tanh-like area −L < y(t) < L, and the lower bound y(t) ≤ −L. All regions are described below: 1, y
(y ≥ L) y (γ − θ ), (0 ≤ y < L) L L φ(y) = y y (γ + θ ), (−L < y < 0) L L −1, (y ≤ −L)
(6)
The shape of KTLF curve is controlled by γ = 2/L and θ = 1/L2. The approximation function given in equation (6) corresponds to the tanh(2y) function. Consequently, the term α2 is needed for controlling y, and we also suggest L = 1. Then the modified equation can be rewritten as follow: 1,
(y ≥ 1) ´ (2 − y ´ ), (0 ≤ y < 1) y φ(y) = ´ (2 + y ´ ), (−1 < y < 0) y −1, (y ≤ −1)
(7)
´ = αy where α is an upper-peak of the derivative of the activation function and y 2 . Figure 2 shows tanh(αy), its approximation (the dash line) and their derivatives. From the figure we can conclude that the fraction α2 is fitted for all tanh(αy).
Logistic tanh(2y) and tanh−like(2y) activation function
Derivative of Logistic tanh(2y) and tanh−like(2y) activation function 2 φ(y)= tanh(2y) φ(y)= tanh−like(2y)
1
φ(y)= tanh(2y) φ(y)= tanh−like(2y)
1.8
0.8 1.6 0.6 1.4 0.4 1.2
0.2 0
1
−0.2
0.8
−0.4 0.6 −0.6 0.4 −0.8 0.2
−1
−2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
0 −2.5
−2
(a)
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
(b)
Fig. 2. (a) An activation function φ(y) = tanh(2y), its approximation from equation (7) and (b) their derivatives
3.2
Approximation for Sub-gaussianity
In this subsection, we propose a 2nd order approximation of φ(y) = y11 and φ(y) = y3 , presented by Amari et al. [1] and Cichocki et al. [4], respectively.
664
K. Chinnasarn et al. Sub−gaussian activation functions
Derivative of sub−gaussian activation functions
2.5
6
φ(y)=Amari Fn. φ(y)=y3 2 φ(y)=+/−y
2
φ(y)=Amari Fn. φ(y)=y3 2 φ(y)=+/−y
4
1.5 2 1 0 0.5 −2 0 −4 −0.5 −6
−1
−1.5
−8
−2
−2.5 −2
−10
−1.5
−1
−0.5
0
0.5
1
1.5
2
−2.5
−2
−1.5
(a)
−1
−0.5
0
0.5
1
1.5
2
2.5
(b)
Fig. 3. (a) Graphical representation of 11th , 3rd , and 2nd order activation function and (b) their derivatives
Given the graphical representation of the sub-gaussian activation functions illustrated in figure 3, it can be seen that the sub-gaussian activation functions can be separated into 2 regions: the positive and the negative regions. For demixing the sub-gaussian distribution, we proposed the bisection paraboloid function given in equation (8) which is a good approximation for the previously reported functions in the literature. φ(y) =
4
+y2 , (y ≥ 0) −y2 , (y < 0)
(8)
Experimental Design
For the de-mixing efficiency, the signal separating procedure is divided into 2 subprocedures which are the preprocessing and the blind source separating procedure. 4.1
Preprocessing
For speeding up the procedure of signal transforming, some preprocessing steps are required. We used 2 steps of preprocessing as described in [7], which are centering and prewhitening. The first step is the centering. For an ICA learning procedure, an input signal si , and also xi , must be zero mean E[si ] = 0. This is an important assumption because the zero mean signal causes the covariance E[(si − mi )(sj − mj )T ] equal to the correlation si sTj , where sTj is the transpose matrix of sj , and mi is a channel average or a statistical mean. Consequently we can calculate an eigenvalue and an eigenvector from both their covariance and correlation. If the input signal is nonzero mean, it must be subtracted by some arbitary constant, such as its mean.
Low Complexity Functions for Stationary Independent Component Mixtures
665
The second step is prewhitening. This step will decorrelate the existing correlation among the observed channels xi . Sometimes, the convergent speed will be increased by the prewhitening step. The Principal Component Analysis (PCA) is a practical and a well-known algorithm in the multivariate statistical data analysis. It has many useful properties, including data dimensionality reduction, principal and minor feature selection and data classification, etc. In this paper, without loss of generality, we assume that all sources are principal components. Then the PCA is designed as a decorrelated filter. The PCA procedure is the ˜ = VT x where VT is the transpose matrix of an eigenvector of projection of x covariance of the observed signals E[xxT ]. After the projection of x, it becomes ˜ T ] = I, where I is an identity matrix. the covariance E[˜ xx 4.2
Blind Source Separating Processing
In this subsection, we describe an online sub-block ICA learning algorithm. We used an unsupervised multi-layer feed forward neural network for de-mixing the non-gaussian channels. Our network learning method is a combination of the online and the batch learning techniques. Unknown signals xi (k0 : k0 + k) are fed into the input layer, where k0 is the start index of the sub-block, k ≥ Fmax is the length of the sub-block, and Fmax is the value of the maximal frequency in time domain of the principal source si . Output signals yi (k0 : k0 +k) are produced by yi (k0 : k0 + k) = Wxi (k0 : k0 + k), where W is called the de-mixing matrix. If the output channels yi (k0 : k0 + k) depend on each other, the natural gradient descent in equation (4) still updates the de-mixing matrix W. Then repeat to produce the output signals yi (k0 : k0 + k) until they become independent. The increase of the convergence speed of the online sub-block method is proved by the following theorem. Theorem 1. ICA online sub-block learning is converging faster than batch learning. Proof Considering K is the total time index of the signal and k is the time index for each sub-block, where k < K. The learning equation (4) can be rewritten as follow: Wt+1 = Wt + η[I − φ(y)yT ]Wt + β∆Wt
(9)
The computational complexity of equation (9) depends on the correlation φ(y)yT , where yT is a transpose matrix of y. For the batch learning with time index K, the complexity of (9) is of O(K 3 ) On the other hand, for the online sub-block learning, we have K k sub-blocks. 3 2 The computational complexity of equation (9) is of K O(k ) = O(K.k ). It is k 2 3 proved that O(K.k ) < O(K ), where k < K. Hence, the ICA online sub-block learning method is faster than the batch learning method.
666
5
K. Chinnasarn et al.
Experimental Results
Some simulations have been made on both super-gaussian and sub-gaussian signals, which contained 191,258 data points for each channel. The super-guassian data sets were downloaded from http://sound.media.mit.edu/ica-bench/, and consist of 3 sound sources. For sub-gaussianity, we simulated our algorithm using the following three-synthesized signals: 1. s1 (t) = 0.1 sin(400t) cos(30t) 2. s2 (t) = 0.01sign[sin(500t + 9 cos(40t))] 3. s3 (t) = uniform noise in range [-0.05,0.05]
A mixing matrix A is randomly generated. An initial de-mixing matrix W is set to the transpose of an eigenvector of the covariance of an observed signal x. As presented in [3], the convergent criterion should be set so that the distance of Kullback-Leibler divergence is less than 0.00001 (∆KL ≤ 0.00001). We have run 10 simulations using a variable learning rate with initial value 0.05 and a momentum rate value of 0.01η. Anyway, both the learning rate value and momentum rate value can be arbitrary set over range (0..1]. At each learning iteration, the η ). Some of the experimental results learning was decreased by 1.005 (η = 1.005 were presented in our previous paper [3]. For improving the learning performance, we determined the relationship between the online sub-blocks. The final de-mixing matrix W of sub-block j is set to the initial de-mixing matrix of j + 1 sub-block, where 1 ≤ j ≤ T b, T b is the total number of blocks K k , k ≥ Fmax = 20, 000 Hz, and K is the total time index. The weight inheritance will maintain the output channel of unknown mixture environments. Figure 4(a) and 4(b) display the original sources and recovered sources of the sub-gaussian and the super-gaussian distributions, respectively. As an algorithmic performance measurement, we used the performance correlation index, derived from the performance index proposed by Amari et al. in [1]. In practice, the performance index can be replaced with the performance correlation index.
E=
N N ( i=1 j=1
N N |cij | |cij | − 1) + − 1) ( maxk |cik | maxk |ckj | j=1 i=1
(10)
The matrix C = φ(y)yT is close to the identity matrix when the signal yi and yj are mutually uncorrelated or linearly independent. The results are averaged over the 10 simulations. Figure 5 illustrates the performance correlation index on a logarithmic scale, during the learning process, using our proposed activation functions in section 3, and the typical activation functions for the non-gaussian mixtures from section 2. Figure 5(a) corresponds to the mixture of super-gaussian signals. The curve of tanh(.) was matched with its approximation function, and it can be seen that they converge over the saturated region with the same speed. Figure 5(b) corresponds to the mixture of sub-gaussian distribution. The proposed activation function φ(y) = ±y2 converges faster than
Low Complexity Functions for Stationary Independent Component Mixtures Input:S
X : Block No. 8 0.1
After PCA
667
Output:Y
0.05
0.05
0.1 0.05
0.05 0
0
0
−0.05
0
−0.05
−0.1 8317
−0.1 8317
8367
0.1
0.05
−0.05
−0.05 8317
8367
8367
8317
0.05
8367
0.05
0.05 0
0
0
0
−0.05 −0.05 8317
−0.1 8317
8367
0.05
8367
8317
8367
0.1
0
0
0
−0.05
−0.1
−0.1
8367
8367
0.05
0.1
8317
−0.05
−0.05 8317
8317
8367
0
−0.05
8317
8367
8317
8367
(a) Original Signals:
s
Observed Signals:
x
Recovered Signals:
6 5
2 0
0
−10
−2 −5
0
y1
s1
y
10
4
−4 5
10
15
5
10
4
15
5
10
4
x 10
15 4
x 10
x 10
5
4
5 0
0
y2
s2
2
−2
−5
−4
−5 5
10
15
5
10
4
15
5
10
4
x 10
15 4
x 10
x 10
6
5
5
4 2 0
y3
s3
0
0
0
−2 −5
−5
−4 5
10
15
5
10
4
15
5 4
x 10
x 10
10
15 4
x 10
(b)
Fig. 4. The mixing and de-mixing of (a) sub-gaussianity and (b) supergaussianity φ(y) = y3 in the beginning, and slows down when the outputs yi are close to saturated region. The derivative of the function φ(y) = ±y2 is slower than the derivative of the function φ(y) = y3 when y ≥ ±0.667, see the slope of each function in figure 3.
6
Analytical Considerations on Complexity
For the mixture of the super-gaussian signals, the unknown source signals can be recovered by tanh(αy) and its approximation as given in equation (8). Considering the same input vector, both activation functions produce a similar output vector, because the curve of the approximation was matched to the curve of tanh(αy), as illustrated in figure 2. Hence, they required the same number of epochs for recovering the source signals, as shown in figure 5(a). But, an approximation function requires lower computational micro-operations per instruction than tanh(αy), and is more suitable for hardware implementation. Regarding the recovery of the mixture of more sub-gaussian signals, the curve of φ(y) = ±y2 did not exactly match either y3 or y11 , but they produced the same results with different convergent speed as shown in figure 5(b). The lower activation function needs smaller memory representation during the running process. And, also, the φ(y) = ±y2 requires only XOR and complementary micro-operations per instruction.
668
K. Chinnasarn et al. Performance correlation index for de−mixing of Super−gaussianity
2
Performance correlation index for de−mixing of Sub−gaussianity
10
φ(y)= mKTLF(2y) φ(y)= tanh(2y)
2
φ(y)=+/−y φ(y)=y3
4
10 1
log−scale of performance correlation index
log−scale of performance correlation index
10
0
10
−1
10
−2
10
−3
10
2
10
1
10
0
10
−1
10
−2
−4
10
3
10
10 0
500
1000
1500 2000 number of iterations (epochs)
(a)
2500
3000
3500
400
800 1200 number of iterations (epochs)
1600
2000
(b)
Fig. 5. Performance correlation index for the separation of (a) the supergaussian mixtures and (b) the sub-gaussian mixtures (averaged on 10 simulations)
7
Conclusions
In this paper, we presented a low complexity framework for an independent component analysis problem. We proposed a 2nd order approximation used for the demixing of the super-gaussian and sub-gaussian signals. Moreover, the number of the multiplication operations required by the separating equation was reduced by using an online sub-block learning method. The proposed activation functions and the online sub-block learning algorithm are efficient methods for demixing the non-gaussian mixtures, with respect to the convergence speed and learning abilities.
References [1] S.-I.Amari, A.Cichocki, and H. H.Yang. A New Learning Algorithm for Blind Signal Separation, MIT Press, pp.757-763, 1996. 661, 662, 663, 666 [2] J.-F.Cardoso. Infomax and Maximum Likelihood for Blind Source Separation, IEEE Signal Processing Letters, Vol. 4, No. 4, pp. 112-114, 1997. 662 [3] K.Chinnasarn and C.Lursinsap. Effects of Learning Parameters on Independent Component Analysis Learning Procedure, Proceedings of the 2nd International Conference on Intelligent Technologies, Bangkok/Thailand, pp. 312-316, 2001. 661, 662, 666 [4] A.Cichocki and S.-I.Amari. Adaptive Blind Signal and Image Processing: Learning Algorithms and Applications, John Wiley & Sons, Ltd., 2002. 661, 662, 663 [5] M.Girolami. An Alternative Perspective on Adaptive Indepndent Component Analysis Algorithms, Neural Computation, Vol. 10, No. 8, pp. 201-215, 1998. 662 [6] Haykin.S. Neural Network a Comprehensive foundation. 2nd, Prentice Hall,1999. [7] A.Hyvarinen and E.Oja. Independent Component analysis: algorithms and applications, Neural Networks. vol. 13 pp. 411-430, 2000. 661, 662, 664 [8] H. K.Kwan. Simple Sigmoid-like activation function suitable for digital hardware implementation, Electronics Letter Vol.28, no.15, pp.1379-1380, 1992. 663
Low Complexity Functions for Stationary Independent Component Mixtures
669
[9] T.-W.Lee, M.Girolami and T. J.Sejnowski. Independent Component Analysis Using an Extended Informax Algorithm for Mixed Sub-Gaussian and SuperGaussian Sources Neural Computation, Vol. 11, No. 2, pp. 409-433, 1999. 662
Knowledge-Based Automatic Components Placement for Single-Layer PCB Layout Tristan Pannérec Laboratoire d'Informatique de Paris VI Case 169, 4 place Jussieu, 75005 Paris, France [email protected]
Abstract. Printed Circuit Board layout involves the skillful placement of the required components in order to facilitate the routing process and minimize or satisfy size constraints. In this paper, we present the application to this problem of a general system, which is able to combine knowledge-based and search-based solving and which has a meta-level layer to control it's reasoning. Results have shown that, with the given knowledge, the system can produce good solutions in a short time.
1
Introduction
The design of a PCB (Printed Circuit Board) for a product goes through three main stages. The first one is the designing of the logical scheme, which defines the components used (part list) and their interconnections (net list). The second one is the layout of the PCB with the definition of positions for all components and tracks, which link the components together. Generally, these two steps (components placement and routing) are carried out sequentially. The last stage is the industrial production, where the PCB are obtained by chemical processes and components are soldered. In this paper, we are interested in the components placement stage. The problem is to find a position on the board and a direction for each component in such a way that the components do not overlap and that the placement allows a good routing result. A good routing result means that all wires are taken on and that tracks are shortest. Depending on the nets, the length of the tracks can be of varying importance. For example, unprotected alimentation wires or wires between quartz and microcontrollers must absolutely be short for physical reasons (the quality of the product will depend on these lengths). A secondary objective during the component placement step can be to minimize the surface of the PCB if no predefined form is given. While the routing step is automatically done for a long time in most CAD tools, fully automated components placement is a more recent feature. Most applications thus allow “interactive” placement, where the user can instantly see the consequence of placing a component in a given position, but few can achieve good automatic placement, especially for single-layer PCB, where the problem is much harder. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 669-675, 2003. Springer-Verlag Berlin Heidelberg 2003
670
Tristan Pannérec
In this paper we report on the use of our MARECHAL framework for solving this complex problem of automatic component placement for a single-layer PCB (the method works of course for multi-layer PCB as well). The rest of this paper is organized as follows: the second section presents some related work. Then, the MARECHAL framework is described briefly in section 3. Section 4 is dedicated to the knowledge we have given to this framework and section 5 presents the results we have achieved. Some final conclusions are drawn in section 6.
2
Related Work
The placement of components is a well-known problem, which has been extensively studied. Research has mainly focused on the VLSI design. The methods used for VLSI layout can be divided into two major classes [1]: constructive methods and iterative improvement methods. The first group (also referred to as global or relative placement) builds a global placement derived from the net-cell connections. In this class, we can find the force-directed (analytical) methods [3] and the min-cut (partitioning-based) methods [2]. The iterative methods start with an initial random placement and try to improve it by applying local changes, like reflecting a cell or swapping two cells. Deterministic methods or randomized methods (either with Simulated Annealing [9] or Genetic Algorithm [11]) belong to this class. Hybrid methods have also been investigated [4] but are limited to the use of a constructive method to produce the initial placement for an iterative method. In our approach, the two processes are more integrated. Little academic work addresses specifically the PCB framework, although the problem is broadly different from the VLSI one. There are not hundreds of thousands of cells, which have all the same shape and have to be organized in predefined rows. Instead, we have from ten to a few hundred components, with very variable sizes and numbers of pins (from 2 to 30 or more). Components cannot be modeled as points as in the common VLSI methods and the objective function has to be more precise. In the VLSI layout, objective functions use the semi-perimeter method [10] to estimate wiring length and density estimation to avoid crossing between wires. In a single layer PCB, these approximations are not precise enough and more complex models have to be considered, as we will see in the next section. But these more complex models prevent the use of classical constructive methods and lead to more CPUconsuming objective functions, which also prevent the use of classical iterative approaches (we have for example tested a genetic algorithm with our objective function but, even after 2 hours of computation, the function was not correctly optimized and the solution was of no interest). We use an objective function based on the minimum spanning tree model (MST) for each net (cf. Fig. 1 and Fig. 4). On the basis of these trees, the evaluation function is the sum of two terms. The first term is the weighted sum of all segment lengths (the weights depend on the wire type). The second term represents the crossing value: when two segments cross, a value is added to the evaluation function depending on the a priori possibilities to resolve the crossing and proportionally to the avoiding path if it can be easily computed (cf. Fig. 2). The resulting objective function takes a long time to compute but is more precise.
Knowledge-Based Automatic Components Placement for Single-Layer PCB Layout
671
Fig. 1. Example of a MST for a 5-pins net
Fig. 2. Example of importance of crossing
3
The Marechal System
The MARECHAL system is a general problem solver, which uses domain-dependent knowledge. For the moment, it is limited to optimisation problems where the goal is to coordinate numerous concurrent decisions to optimise a given criterion (e.g. the evaluation function defined in the previous section). Except to automatic component placement, it is also currently applied to a complex semi-concurrent game [7] and timetabling. The approach materialized in the MARECHAL framework is first of all based on three principles. The first one is the integration of a knowledge-based solving with a search-based solving [6]. The system uses domain-dependent knowledge to select the best a priori choices, but it can also question these choices to explore the search space. The second principle is to use abstract reasoning levels to limit the search to coherent solutions. The problem is thus decomposed into sub-problems (thanks to domainspecific knowledge) and for each decomposition level, the system can make abstract choices (choices about intentions, stage orders…). The third principle is to use autoobservation to control the solving process. By means of domain-dependent and domain-independent meta-level rules, the system can analyse it's own reasoning and determine how to improve the current solution [5]. Instead of randomly trying changes to the current solution, it can thus limit the search to interesting improvement possibilities. These three important features of our approach allow dealing with complex problems where pure search methods cannot be applied (because of huge
672
Tristan Pannérec
search space and irregular time-consuming objective functions) and where good specific heuristics cannot be found for all possible instances of the problem. The MARECHAL system materializes this approach, by providing a kind of an interpreter to a specific language used to define the domain-specific knowledge. It is build on a meta-level architecture: the first part is the basic system, which is able to construct solutions, and the second part is the meta level, which is responsible of managing and monitoring the solving process by observing the first part [8]. To achieve that, a bi-directional communication mechanism is used: on the one hand, the basic system sends notifications and traces about what it is doing and what has happened (a sub-problem has been successfully solved…) and on the other hand, the supervision layer sends orders or recommendations, with eventually new constraints or objectives to precisely drive the resolving process.
4
Knowledge
In this section, we describe in more detail the application of the MARECHAL system to component placement, with a particular emphasis on the knowledge given to solve this problem. The knowledge allows decomposing the problem (decomposition expertise), to construct a solution for each sub-problem (construction expertise) and to intelligently explore the solution space by improving the solution already constructed (improvement expertise). 4.1
Decomposition and Construction Knowledge
As regards the other applications, the decomposition knowledge for the PCB layout problem is simple. The main problem is decomposed into two stages. First, the system tries to generate macro-components by looking for repeated patterns in the circuit. Components belonging to a macro-component are then placed locally and replaced by a single new component. This stage is especially useful for large circuits. The second stage is the main placement stage. If we have n components, n calls to the “Single Component Placement” (SCP) sub-problem are made, according to the best a priori order. Heuristics used to define this order are based on the number of connections and the size of the components (big components with many connections are set up first). The SCP sub-problem first chooses an already placed "brother" component and tries to place the new component close to this brother by scanning the local area (the last point is the most complex one, because it poses difficult perception problems). To find a ‘brother” component, the heuristics emphasis on the components which share many connections. Heuristics used to choose the concrete positions are based on simulation of the local “rat nest” that will be computed for the component to be placed. 4.2
Improvement Knowledge
With the above knowledge, the system is thus able to construct a solution but, as the choices are done a priori (next component to consider, position of the components…),
Knowledge-Based Automatic Components Placement for Single-Layer PCB Layout
673
they can be proved wrong after some attempts, when a future component cannot be correctly placed. That is why a priori choices have to be questioned by means of improvement tests, which will allow a partial exploring of the solution space. Two types of improvement are tested by the system for the automatic component placing application. The first type are called “immediate tests” because they are executed as soon as the possibilities are discovered, without waiting for a full solution. For example, if a component Cx cannot be placed next to Cy (but where it would be well placed) because of the presence of Cz, the position of Cz is immediately questioned before placing any other component: the system tries to place Cz in another place that will allow to place Cx next to cy. For example in Fig. 3, RL3 is placed first, but then U6 cannot be placed close to pre-placed component B4, although a short connection is required between U6 and B4. The position of RL3 is thus put into question to obtain the layout on the right, which is better. Then, the placement process go on with a new component. A total of seven rules produce improvement possibilities of this type.
Fig. 3. Example of immediate improvement
The second type of improvement acts after having constructed a full solution. When all components have been placed, the system tries to modify the worst placed component or to minimize the size of the circuit by generating new form constraints. The process stops after a fixed amount of time and the best-constructed solution is sent to the CAD program.
5
Results
Our system has been tested on several circuits involving from 20 to 50 components and with one to three minutes to produce each solution. Two examples of a solution produced by the system for simple circuits are given in Fig. 4. The figure displays the associated MST: the bold lines represent connections that have to be especially minimized. In all cases, the solutions produced by the systems had a better value than those produced manually by a domain expert, in terms of the evaluation function presented in section 3 (cf. Fig. 5). Of course, this function is not perfect, as only a full routing process can determine if the placement is really good.
674
Tristan Pannérec
Fig. 4. Examples of solution produced by the system
Objective function
MARECHAL
human expert
160000 140000 120000 100000 80000 60000 40000 20000 0 #1
#2
#3
#4
#5
#6
Circuits
Fig. 5. Comparison of objective function values obtained by our system and expert humans (the value have to be minimized)
That's why the solutions have also been evaluated by a domain expert, who was asked to build a final solution (with the tracks) from the solution produced by the system. It appeared that only minor changes were necessary to obtain a valid solution. In addition, these manual routing processes did not use more "straps" (bridges to allow crossings) than manual routing operated on the human placements.
6
Conclusion
In this paper, we have presented the application of the MARECHAL system to automatic component placement for single-layer PCB layout. Thanks to the meta-
Knowledge-Based Automatic Components Placement for Single-Layer PCB Layout
675
level solving control and the combination of knowledge-based and search-based approaches, the system is able to produce good solutions in a short time, compared to human solutions. As the single-layer problem is more difficult than the multi-layer problem, the knowledge could be easily adapted to deal with multi-layer PCB design. More constraints, such as component connection order, could also easily be added. Future work will first consists in improving the current knowledge to further minimize the necessary human intervention. A more important direction is to implement an automatic router with the MARECHAL framework and to connect the two modules. Thus, the router will serve as an evaluation function for the placement module, which will be able to modify component positions according to the difficulties encountered in the routing process. The system will then be able to produce a full routed solution from scratch and without any human intervention.
References [1]
H.A.Y. Etawil: Convex optimization and utility theory: new trends in VLSI circuit layout. PhD. Thesis of the University of Waterloo, Canada, (1999). [2] D.J.H. Huang & A.B. Kahng: Partitioning-based standart-cell global placement with an exact objective: Inter. Symp. On Physical Design (ISPD) (1997), 18-25. [3] F. Johanes, J.M. Kleinhaus, G. Sigl & K. Antereich: Gordian: VLSI placement by quadratic programming and slicing optimization, IEEE. Trans. On CAD, 10(3) (1991), 356-365. [4] A. Kennings: Cell placement using constructive and iterative methods (PhD. Thesis of the University of Waterloo, Canada (1997). [5] T. Pannérec: Using Meta-level Knowledge to Improve Solutions in Coordination Problems, Proc. 21st SGES Conf. on Knowledge Based Systems and Applied Artificial Intelligence, Springer, Cambridge (2001) 215-228. [6] T. Pannérec: An Example of Integrating Knowledge-based and Search-based Approaches to Solve Optimisation Problems. Proc. of the 1st European STarting A.I. Researchers Symp. (STAIRS 02), p. 21-22, Lyon, France (2002). [7] T. Pannérec: Coordinating Agent Movements in a Semi-Concurrent TurnBased Game of Strategy. Proceedings of the 3rd Int. Conf. on Intelligent Games and Simulation (GameOn 02), p. 139-143, Harrow, England (2002). [8] J. Pitrat: An intelligent system must and can observe his own behavior. Cognitiva 90, Elsevier Science Publishers (1991), 119-128. [9] C. Sechen: VLSI placement and global routing using simulated annealing. Kluwer Academic Publishers (1988). [10] 10.C. Sechen & A. Sangiovanni: The TimberWolf 3.2: A new standard cell placement and global routing package: Proc. 23rd DAC, IEEE/ACM, Las Vegas (1986), 408-416. [11] 11.K. Shahookar & P. Mazumder: A genetic approach to standard cell placement using meta-genetic parameter optimization, IEEE. Trans. on Computers, 9(5) (1990), 500-511.
Knowledge-Based Hydraulic Model Calibration Jean-Philippe Vidal1 , Sabine Moisan2 , and Jean-Baptiste Faure1 1
Cemagref, Hydrology-Hydraulics Research Unit, Lyon, France {vidal,faure}@lyon.cemagref.fr 2 INRIA Sophia-Antipolis, Orion Project, France [email protected]
Abstract. Model calibration is an essential step in physical process modelling. This paper describes an approach for model calibration support that combines heuristics and optimisation methods. In our approach, knowledge-based techniques have been used to complement standard numerical modelling ones in order to help end-users of simulation codes. We have both identified the knowledge involved in the calibration task and developed a prototype for calibration support dedicated to river hydraulics. We intend to rely on a generic platform to implement artificial intelligence tools dedicated to this task.
1
Introduction
Simulation codes are scientific and technical programs that build numerical models of physical systems, and especially environmental ones. We are interested in river hydraulics where simulation codes are based on the discretisation of the simplified fluid mechanics equations (de Saint-Venant equations) that model streamflows. Theses codes have been evolving during the last forty years from basic numerical solvers to efficient and user-friendly hydroinformatics tools [1]. When a numerical model is built up for a river reach and its corresponding hydraulic phenomena (e.g., flood propagation), the model must be as representative as possible of physical reality. To this end, some numerical and empirical parameters must be adjusted to make numerical results match observed data. This activity – called model calibration – can be considered as a task in the artificial intelligence sense. This task has a predominant role in good modelling practice in hydraulics [2] and in water-related domains [3]. Users of simulation codes currently carry out model calibration either by relying on their modelling experience or by resorting to an optimisation code. The most widely used method is trial-and-error, which consists in running the code, analysing its outcomes, modifying parameters and restarting the whole process until a satisfactory match between field observations and model results has been reached. Few indications to carry out this highly subjective task in an efficient way are given in reference books [4] and experienced modelers follow heuristic rules to modify parameters. This task thus requires not only a high degree of expertise to analyse model results but also fast computers to perform numerous time-consuming simulation runs. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 676–683, 2003. c Springer-Verlag Berlin Heidelberg 2003
Knowledge-Based Hydraulic Model Calibration
677
In order to overcome these difficulties, many automatic calibration methods have been developed over the last thirty years [5]. These methods rely on three main elements: an objective function that measures the discrepancy between observations and numerical results, an optimisation algorithm that adjusts parameters to reduce the value of the function, and a convergence criterion that tests its current value. The major drawback of this kind of calibration stands in the equifinality problem – as defined by Beven [6] – which predicts that the same result might be achieved by different parameter sets. Consequently, local minima of the objective function might not be identified by the algorithm and thus lead to unrealistic models. Artificial intelligence techniques have recently been used to improve these automatic methods, with the use of genetic algorithms, alone [7] or combined with case-based reasoning systems [8]. Our objective is to propose a more symbolic approach for hydraulic model calibration support. The aim is to make hydroinformatics expert knowledge available for end-users of simulation codes; a knowledge-based system encapsulating the expertise of developers and experienced engineers can guide the calibration task. Contrary to Picard [9], we focused more on the operational use of simulation codes and not on their internal contents. The paper first presents the cognitive modelling analysis of calibration and hydroinformatics domain, then outlines the techniques used for the development of a knowledge-based system dedicated to model calibration.
2
Descriptive Knowledge Modelling
From a cognitive modelling perspective, we have first identified and organised the knowledge in computational hydraulics that may influence the model calibration task. Our approach was to divide this knowledge into generic knowledge, corresponding to concepts associated with numerical simulation of physical processes in general, and domain knowledge, corresponding to concepts specific to open-channel flow simulation. Concerning generic computational knowledge, we defined a model as a composition of data, parameters, and a simulation code (see Fig. 1). Following Amdisen [10], we distinguished static data from dynamic data. Static data are components of the model; they depend only on the studied site (e.g., in our application domain, topography and hydraulic structures). Dynamic data represent inputs and outputs of the simulation code; they include observed data (e.g., water levels measured during a flood) and corresponding computed data. Concerning domain knowledge, we extracted from river hydraulics concepts a hierarchy of bidimensional graphical objects which instantiate the previous meta-classes of data. Figure 2 shows a partial view of this hierarchy of classes and displays objects commonly used in 1-D unsteady flow modelling. For instance, CrossSection in Fig. 2 derives from StaticData in Fig. 1 and DischargeHydrograph may derive from any subclass of DynamicData.
678
Jean-Philippe Vidal et al.
Data
PhysicalEvent
1..*
1..*
StaticData
DynamicData
1 InputData
CalibrationInputData *
OutputData
PredictionInputData
CalibrationOutputData
*
uses for calibration
SimulationOutputData
* are compared to for calibration 1..*
1..*
uses for prediction SimulationCode
1
1..*
*
1..*
Model Parameters
1
1
produces *
1..*
Fig. 1. UML class diagram of generic computational knowledge
GraphicalObject
Curve
Point
DynamicCurve
StaticCurve
GivenReachCurve
StaticPoint
GivenSectionCurve
DynamicPoint
TimeLimitedPoint
{incomplete}
TimeExtendedPoint
{incomplete} {incomplete}
GivenTimeCurve
{incomplete} {incomplete}
WaterSurfaceProfile 0..1
RatingCurve
StageHydrograph 0..1
0..1 2..*
2..* WaterLevel
DischargeHydrograph
0..1
GaugingPoint
0..1
0..1
2..*
3..*
{incomplete}
1..* 1
CrossSection
0..1
1
Discharge
GroundPoint
Fig. 2. Simplified UML class diagram of graphical objects
Floodmark
Knowledge-Based Hydraulic Model Calibration
:Model [uncalibrated]
:DynamicData
solve inverse problem
679
:PredictionInputData
solve direct problem
preprocess
simulate
determine :SimulationOutputData
dispatch
define [else] initialise
[parameter definition = ok]
simulate
compare :Model [calibrated]
evaluate [model calibration = unsatisfactory] [else]
Fig. 3. UML activity diagram of problem solving in modelling
3
Operative Knowledge Modelling
The second category of knowledge concerns the problem solving strategy of the calibration task. Modellers of physical processes may face two situations [11]: solving a direct problem, which means computing the output of a model, given an input, or solving an inverse problem, which means finding out the model which transforms a given input into a given output. The first situation occurs during productive use of simulation codes. In hydraulics, the second situation amounts to determine the model parameters – since hydraulic modelling follows deterministic laws – and corresponds in fact to a calibration task. Figure 3 shows the relation between these two situations – calibration must of course be performed before using simulation codes for prediction – and emphasises the first-level decomposition of both modelling problems. We determined the problem solving strategy of this task essentially from interviews of experts, since literature on calibration task is very scarce. We identified six main steps, listed in Fig. 3. We detail them below for a typical case of unsteady flow calibration of a river reach model: 1. dispatch: Available dynamic data are dispatched among calibration input and output data. Boundary conditions compose the core of calibration input data
680
2. 3. 4. 5.
6.
4
Jean-Philippe Vidal et al.
(e.g., a flood discharge hydrograph determined at the upstream cross-section and the corresponding stage hydrograph at the downstream cross-section). Field observations like floodmarks (maximum flood water levels) measured along the reach constitute calibration output data. define: Generally, only one model parameter is defined at first: here, a single riverbed roughness coefficient for the whole reach; initialise: Parameter default value(s) is(are) set up, often thanks to literature tables; simulate: The code is run to produce simulation output data, usually in the form of water-surface profiles for each time step; compare: Simulation output data are compared to calibration output data, e.g., by putting maximum envelops of water-surface profiles and floodmarks side by side; evaluate: Model error is judged; an unsatisfactory calibration leads to adjust coefficient value (back to point 3) or to give another definition of model parameters, e.g., to take into account a spatially distributed roughness coefficient (back to point 2).
Prototype for Calibration Support
We intend to rely on a platform for knowledge-based system design that has been developed at Inria and applied to program supervision [12]. Program supervision is a task that emulates expert decisions in the skilled use of programs, such as their choice and organisation, their parameter tuning, or the evaluation of their results. Like calibration, this task involves activities such as initialise parameters, run codes, evaluate results, etc. Both tasks also use very similar concepts, such as the common notion of parameters. We plan to take advantage of these similarities to reuse – with slight modifications – both an existing engine and the attached knowledge description language, named Yakl. The previous sections provide us with a conceptual view of the calibration task, which serves as a basis to determine the required modifications. From this conceptual view it is possible to derive modifications in both the language to describe the task concepts and the inference engine to manipulate them. For instance, while initialise and evaluate are existing reasoning steps in the program supervision task, dispatch is a new one that should be added for the calibration task. In parallel to this conceptual view, we also conducted an experiment to implement a calibration knowledge-based system, by using the program supervision language and engine. This experiment allowed us to refine the specifications for new tools dedicated to calibration task. The platform will support the implementation of these tools by extension of program supervision ones. Two concepts provided by the program supervision task have proven useful for our purpose: operators and criteria. Operators correspond to actions, either simple ones (e.g., run a code) or composite ones, which are combinations of operators (sequence, parallel or alternative decompositions) that solve abstract processing steps. Operators have arguments corresponding to inputs/outputs
Knowledge-Based Hydraulic Model Calibration
Initialisation criteria Rule { name determine base value for gravel comment ”Determination of base value for gravel roughness coefficient” Let obs an observation If obs.bed material == gravel Then nb.lower := 0.028, nb.upper := 0.035 }
681
Choice criteria Rule { name select flood dynamics operator comment ”Selection of operator for flood dynamics modelling” Let s a study If s.phenomena == flood dynamics Then use operator of characteristic unsteady flow }
Fig. 4. Examples of criteria and attached rules in Yakl syntax of actions. Both kinds of operators also include various criteria composed of production rules, capturing the know-how of experts on the way to perform actions. For instance, in program supervision, criteria express how to initialise input argument of programs (initialisation criteria), to evaluate the quality of their results (assessment criteria), to modify parameter values of the current operator (adjustment criteria), or to transmit failure information in case of poor results (repair criteria). Additional criteria are related to composite operators (for choices among sub-operators or optional applications of sub-operators in a sequence). This range of criteria mostly suits our needs. Figure 4 shows examples of rules that belong to our prototype knowledgebased calibration system. The first rule is connected to the initialise step and expresses a conversion from an observation of bed material nature to an appropriate range of numerical values for roughness coefficient. The second rule is related to the simulate step. It helps to select an appropriate sub-operator for flood dynamics modelling, in the line of current research on model selection [13].
5
Conclusion and Perspectives
We have currently achieved an important phase which was the specification of artificial intelligence tools dedicated to the calibration task, with a focus on its application to hydraulics. Our approach is based on a conceptual view of the task and on a prototype knowledge-based system. A generic platform will support the implementation of these tools by extending program supervision ones. The specifications of the new calibration engine, following the problem solving method in Fig. 3, has been completed and the implementation is under way. We reuse the notions of operators and rules from program supervision for the simulate, initialise and compare subtasks of calibration. Checking the resulting system with real data will allow us not only to test calibration results but also to improve the encapsulated knowledge which should mimic an experienced modeller reasoning in all situations. We intend to use existing river models (e.g., on Ard`eche river) that have been built up with different simulation codes (Mage, Hec-Ras). We will compare the problem solving processes with and without support: since simulation code use is difficult to grasp, we expect that a knowledge layer should reduce the time spent by end-users without impairing the quality of the results.
682
Jean-Philippe Vidal et al.
The presented approach allows experts to share and to transmit their knowledge and favours an optimal use of simulation codes. The complementarity between numerical and symbolic components makes the resulting system highly flexible and adaptable. Data flow processing is thus automated while taking into account the specificities of each case study. Our approach appears to be a relevant alternative to current automatic calibration methods, since it includes knowledge regarding both physical phenomena and numerical settings, and may thus avoid equifinality pitfalls. As we have already mentioned, this approach, while focusing on hydraulic domain, is general and can be applied with minimal modifications to any physical process modelling that requires a calibration.
References [1] Faure, J. B., Farissier, P., Bonnet, G.: A Toolbox for Free-Surface Hydraulics. In: Proceedings of the fourth International Conference on Hydroinformatics, Cedar Rapids, Iowa, Iowa Institute of Hydraulic Research (2000) 676 [2] Abbott, M. B., Babovic, V. M., Cunge, J. A.: Towards the Hydraulics of the Hydroinformatics Era. Journal of Hydraulic Research 39 (2001) 339–349 676 [3] Scholten, H., van Waveren, R. H., Groot, S., van Geer, F. C., W¨ osten, J. H. M., Koeze, R. D., Noort, J. J.: Good Modelling Practice in Water Management. In: Proceedings of the fourth International Conference on Hydroinformatics, Cedar Rapids, Iowa, Iowa Institute of Hydraulic Research (2000) 676 [4] Cunge, J. A., Holly, F. M., Verwey, A.: Practical Aspects of Computational River Hydraulics. Volume 3 of Monographs and Surveys in Water Resources Engineering. Pitman, London, U. K. (1980) 676 [5] Khatibi, R. H., Williams, J. J. R., Wormleaton, P. R.: Identification Problem of Open-Channel Friction Parameters. Journal of Hydraulic Engineering 123 (1997) 1078–1088 677 [6] Beven, K. J.: Prophecy, Reality and Uncertainty in Distributed Hydrological Modeling. Advances in Water Resources 16 (1993) 41–51 677 [7] Chau, K. W.: Calibration of Flow and Water Quality Modeling Using Genetic Algorithm. In McKay, B., Slaney, J. K., eds.: Proceedings of the 15th Australian Joint Conference on Artificial Intelligence. Volume 2557 of Lecture Notes in Artificial Intelligence., Canberra, Australia, Springer (2002) 720 677 [8] Passone, S., Chung, P. W. H., Nassehi, V.: The Use of a Genetic Algorithm in the Calibration of Estuary Models. In van Harmelen, F., ed.: ECAI 2002 - Proceedings of the Fifteenth European Conference on Artificial Intelligence. Volume 77 of Frontiers in Artificial Intelligence and Applications., Lyon, France, IOS Press (2002) 183–187 677 [9] Picard, S., Ermine, J. L., Scheurer, B.: Knowledge Management for Large Scientific Software. In: Proceedings of the Second International Conference on the Practical Application of Knowledge Management PAKeM’99, London, UK, The Practical Application Company (1999) 93–114 677 [10] Amdisen, L. K.: An Architecture for Hydroinformatic Systems Based on Rational Reasoning. Journal of Hydraulic Research 32 (1994) 183–194 677 [11] Hornung, U.: Mathematical Aspects of Inverse Problems, Model Calibration, and Parameter Identification. The Science of the Total Environment 183 (1996) 17–23 679
Knowledge-Based Hydraulic Model Calibration
683
[12] Thonnat, M., Moisan, S., Crub´ezy, M.: Experience in Integrating Image Processing Programs. In Christensen, H. I., ed.: Proceedings of the International Conference on Computer Vision Systems, ICVS’99. Volume 1542 of Lecture Notes in Computer Science., Las Palmas, Spain, Springer (1999) 200–215 680 [13] Chau, K. W.: Manipulation of Numerical Coastal Flow and Water Quality Models. Environmental Modelling and Software 18 (2003) 99–108 681
Using Artificial Neural Networks for Combustion Interferometry Victor Abrukov, Vitaly Schetinin, and Pavel Deltsov Moskovsky prosp, 15, Chuvash State University 428015, Cheboksary, Russia [email protected]
Abstract. We describe an application of artificial neural networks (ANNs) for combustion interferometry. Using the ANN we calculate the integral and local characteristics of flame presented by using incomplete set of features that characterize interferometric images. Our method performs these calculations faster than the standard analytical approaches. In prospects, this method can be used in automated systems to control and diagnose combustion processes.
1
Introduction
Interferometry has wide possibilities in combustion research. It allows to determine simultaneously the local characteristics, such as temperature or density field in flame as well as the integral characteristics, such as mass of flame, Archimedean force acting upon flame, quantity of heat in flame, etc [1 - 3]. Among other integral characteristics, which can be extracted from data of interferometric images, are non-stationary mass burning rate and heat release power during ignition, heat release rate, the force of powder, changing of mechanical impulse of non-stationary gas flow, mechanical impulse of arise flow, profile of heat release rate in the stationary wave of burning, etc. [1, 3]. However, for the determination of the mentioned characteristics, first of all, it is necessary to measure a lot of values (big set discrete values) of the function of the phases difference distribution on the interferogram plane, S(x,y). The results often have a subjective nature and depend greatly on the experience of a user in the interferometric measurement. Besides this, the measurement of a lot of values of S(x,y) is labor-intensive. These circumstances confine the level of usage of interferometry in the quantitative analysis of combustion. We can mark here also that, from the point of view of interferometric requirements, the possibilities of determination of mentioned characteristics of flame exist only in the case of ideal conditions of interferometric experiment. There are a lot of real conditions of experiment when it is impossible to measure accurately the S(x,y) and to implement the wide possibilities of interferometry. In any cases, after the accurate measurement a lot of values of S(x,y), it is necessary to make the integration of S(x,y) in a plane of the interferogram (a direct task dealing with the determination of integral characteristics) or to make the differentiation of S(x,y) (an inV. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 684-690, 2003. Springer-Verlag Berlin Heidelberg 2003
Using Artificial Neural Networks for Combustion Interferometry
685
verse task dealing with the determination of local characteristics). As one can see, the full realization of the all possibilities of interferometry demands the solving of some not simple problems the decision of which is not ever possible. In this paper we describe examples of methods of application of artificial neural networks (ANNs) for solving of problems of combustion interferometry in real condition of experiments. The main problem established in our work was how to determine the integral and local characteristics of flame without necessity of measurement of complete set of data about features of interferometric image, first of all, without necessity of measurement of a lot of values of S(x,y), e.g. by means of an incomplete set of values of S(x,y) (the case of determination of local characteristics of flame) or even without them (the case of determination of integral characteristics of flame). This problem is very important for solving of problems of combustion research in real condition of experiments as well as for creation of automated systems of diagnostics and control of combustion processes.
2
The Artificial Neural Network Method
Using ANNs we can solve problems, which have no algorithmic solutions. The ANNs are applicable when problems are multivariate and relationships between variables are unknown. The ANN methods can induce realistic models of complex systems from experimental data. In the special cases, the ANNs are able to induce relations between variables, which can be represented in a knowledge-based form [4]. 2.1
Integral Characteristics of Flame. Direct Task
The integral characteristics that we consider here are the mass of flame, m, the Archimedean lifting force acting on flame, Fa, and quantity of heat in flame, H. We use the following geometrical parameters of the interferometric images: the maximal height, h, and width of the image, w, its square, s, and perimeter, e. We consider these parameters as an incomplete set of features characterizing interferometric images, because, in usual case, we should measure the complete set of value of S(x,y). In our work, we used a feed-forward ANN Neural Networks Wizard 1.7 (NNW) developed by BaseGroupLabs [4]. For training the NNW, we used some the combinations of the above listed features and integral characteristics. The integral characteristics that we have to calculate are assigned as the target variables. Under these assumptions we aim to train the NNW on the input data to calculate the desirable integral characteristics of flames. The number of hidden neurons was 8 and the learning rate was 0.1. The learning stops when the classification error is less than 0.01. Table 1 presents six sets of the integral characteristics (m, Fa, H) determined by usual numerical method for six interferometric images of flame as well as six sets of corresponding geometrical parameters of the interferometric images (h, w, s, e).
686
Victor Abrukov et al.
Table 1. Sets of values of the features characterizing interferometric images (images geometrical parameters) and integral characteristics of flames
№ 1 2 3 4 5 6
w, cm, 10-2 65 88 108 130 142 151
h, cm, 10-2 65 85 147 222 274 341
S, cm2
e, cm
0,32 0,62 1,26 2,25 2,98 3,91
1,99 2,77 4,08 5,61 6,68 8,11
m, g, 10-3 0,08 0,22 0,52 1,25 1,90 2,67
Fa, dynes
H, J
0,08 0,22 0,56 1,06 1,42 1,90
0,024 0,070 0,175 0,324 0,430 0,575
The geometrical parameters were as input data and the integral characteristics (m, Fa, H) were as output data. Five sets were used for training the ANN, and set N 4 was used for testing the ANN. We used 12 various combinations of the geometrical parameters for each set of integral characteristics. The one of results of testing are shown on Fig.1. The horizontal line in Fig. 1 is the target value. The vertical columns correspond to the calculated outputs of the ANN. Each column corresponds to a unique set of geometrical parameters. For example, the term hse means that the height h, the square s as well as the perimeter e were used as the input data. This example shows that the ANN can calculate the integral characteristics of flame successfully. Analyzing this result, we can see that the accuracy of calculation is dependent on the combination of features. For example, if the ANN uses a combination of height and perimeter of image, then the error is minimal. And contrary, the error is large if it is combined width and height of images. More detailed analysis shows, that the combinations of features, which include width and square of images give a large error. On the other hand, less error is achieved if we use the combinations height and perimeter of the image. So we conclude that the values of height and perimeter of the image are more essential for calculation of integral characteristics of flame than width and square of images. Fig.2 depicts the significance of the geometrical parameters for calculation of integral characteristics. The first column corresponds to an average error of calculations for combinations, which include value of width of images. The second column shows an average error of calculation when we use the combinations that include value of height, etc.
Quantity of heat, J
0,4 0, 35 0,3 0, 25 0,2 0, 15 0,1 0, 05 0
e
h
he
w
wh
w he
hs
h se
s
se
w h s w h se
V a rio us co mbina tio ns of the e ntra nce da ta
Fig. 1. Results of calculation of heat quantity of heat for different combinations of features
687
1 0 ,0 % 8 ,3 % 8 ,0 %
6 ,8 %
6 ,0 % 4 ,4 %
4 ,3 %
%
Average error of the calculation,
Using Artificial Neural Networks for Combustion Interferometry
4 ,0 % 2 ,0 % 0 ,0 %
w
h
s
e
Fig. 2. Errors of calculation for geometrical parameters and integral characteristics of flames
Fig. 2 shows that using height and perimeter of images we can improve the accuracy by 1.5 than using square of images and by 2 than width. Our results show that in order to calculate the integral characteristics of flame we have to use the following combinations of geometrical parameters of images: e, h, he, whe, hs. In some cases we are interesting in an analytical representation of the relations induced from experimental data representing the inputs and outputs of real systems. Below we describe an analytical model of flame mass for ignition of powder by laser radiation, which was considered. m = 0.0534 + 1.5017x1 + 2.3405x2 - 2.5898x3 - 0.3031x4 ,
(1)
where x1, …, x4 are normalized width, height, square and perimeter of flame image, respectively, xi = zi/max(zi), and zi is current value, max(zi) is maximum rating of a variable. The ANN model has no hidden neurons. A standard least squared error method was used during creation of ANN model. Obviously that we can interpret the coefficients in Equation 1 as the contribution of variables x1, …, x4 to the value of m. This Equation is an example of the interferometric diagnostics model.
2.2
The Local Characteristics of Flame. Optical Inverse Problem
In the section we apply Neural Networks Wizard 1.7 (NNW) for solving an inverse problem of optics and demonstrate an example of calculation of a refractive index distribution in flame. We consider a determination of integrand meanings on basis of incomplete integral meanings and exam Abel integral equation for the case of cylindrical symmetry. The main problem is how to solve the inverse problem using an incomplete set of features presenting the phase difference distribution function in an interferogram plane. We can use only meaning of the integral, which allows us to calculate all the meanings of integrand, the refractive index distribution and temperature distribution in flame. To solve this problem we can write a dimensionless Abel's equation: 2 1− p
S( p) = 2
∫ (n 0
0
− n (r )) dz
688
Victor Abrukov et al.
where z2 + p2 = r2, z is a ray's path in the object, p is an aim distance (0 < p < 1), and r is a variable radius, see Fig.3. p 1 p
r 1
z
Fig. 3. Geometrical interpretation
Using this equation, we can then calculate the integrals S(p) from different integrands of form nо – n(r) = 1 + ar – br2, where a and b are the given constants. In total we used seven different integrands that reflect a real distribution of refractive index distribution in flames. The training data were as follows. For various r, values nо – n(r) were calculated. Then the integrals of each of these seven integrands were calculated. For example, for the function nо – n(r) = 1 + 4r – 5r2, we can write the following expression 1 + 1− p 2 8 10 S(p ) = 1 − p 2 ⋅ − p 2 + 2p 2 ⋅ ln 3 3 1− 1− p2 .
Here values of S(p) are calculated for different p. The input data for training the NNW were values of S(p), p and r. Values of integrand corresponding to each radius are the target values. In total, we used about 700 sets of these parameters. The number of hidden layers was 5, the number of neurons in them was 8, the learning rate was 0.1, the condition when the maximum errors during both learning and training is less than 0.01 were choused as the stopping conditions. The results of testing the NNW for an integrand nо – n (r) =1 + 0.5r – 1.5r2 are represented in Fig. 4.
Fig. 4. Relative errors of ANN integrand calculation
Relative errors of integrand calculated in the interval of p and r between 0.1 up to 0.9 were lower than 6%. We have executed also the complete LV1 (leave-one-out)
Using Artificial Neural Networks for Combustion Interferometry
689
procedure in the computational experiment. The average errors over all seven tests were in the narrow range from 5% to 6%. More fullness train-test experiment with using different initial weights of neural networks will be done in the nearest future. We also explored the robustness of our calculation to errors in the input data. The calculation error increases by 3 % per 2% of noise added to the input data. This means a good stability of our calculation. In total, the results show that the ANN can solve the inverse problem with an acceptable accuracy in a case of cylindrical symmetry. The definition of one integral value (and also it changes during time in case of non-stationary objects) can be automated. Therefore it is possible to use an ANN cheap (microprocessor) embedded into automated control systems. Further perspectives of the work are concerned with the implementation of neural networks for solving the inverse optical problems for other optical methods and also solving the practical problems of control of combustions processes.
3
Conclusions
The ANN applied for combustion interferometry performs well. Despite an incomplete set of features characterizing flame images, we can solve both the direct and inverse optics problems required for diagnostics of combustion processes. The main advantages of the ANN application are: 1.
2.
3. 4.
It will be possible to calculate the distribution of local thermodynamic characteristics, including density of separate components, by means of measuring in one point of plane of signal registration, for example in the case of laser-diode technique [5]. The ANN method does not require additional real experiments for solving of inverse problems. ANN model for solving of inverse problems can be obtained by means a knowledge base created with using a enough simple numerical calculation as well as there are possibilities of realizing of this approach for solving of the direct task (determination of integral characteristics of flame and other object including industry objects. The method can be extended for other optical methods, for which data are integrated on a line of registration. For diagnostics goals, the ANN can be applied to analysis optical images and signals of flows various types of engine, in particular in a very promising pulse detonation engine [5].
So, we conclude that the ANN method allows us to extend significantly the possibilities of optical diagnostics techniques and improve control systems of combustion processes as well as other industry processes.
690
Victor Abrukov et al.
References [1] [2] [3] [4] [5]
Abrukov, V.S., Ilyin, S.V., Maltsev, V.M., and Andreev I.V.: Proc. VSJ-SPIE Int. Conference, AB076 (1998). http://www.vsj.or.jp/vsjspie/ Abrukov, V.S., Andreev, I. V., Kocheev, I. G.: J. Chemical Physics 5 (2001) 611 (in Russian) Abrukov V.S., Andreev I.V., Deltsov P.V.: Optical Methods for Data Processing in Heat and Fluid Flow. Professional Engineering Publishing, London (2002) 247 The BaseGroup Laboratory http://www.basegroup.ru Sanders S.T., Jenkins T.P., Baldwin J.A. at al.: Control of Detonation processes. Moscow, Elex-KM Publisher (2000) 156-159
An Operating and Diagnostic Knowledge-Based System for Wire EDM Samy Ebeid1, Raouf Fahmy2, and Sameh Habib2 1
Faculty of Engineering, Ain Shams University, 11517 Cairo, Egypt [email protected] 2 Shoubra Faculty of Engineering Zagazig University, Egypt
Abstract. The selection of machining parameters and machine settings for wire electrical discharge machining (WEDM) depends mainly on both technologies and experience provided by machine tool manufacturers. The present work designs a knowledge-based system (KBS) to select the optimal process parameters settings and diagnose the machining conditions for WEDM. Moreover, the present results supply users of WEDM with beneficial data avoiding further experimentation. Various materials are tested and sample results are presented for improving the WEDM performance.
1
Introduction
Wire electrical discharge machining (WEDM) Fig.1 is a special form of electrical discharge machining wherein the electrode is a continuously moving conductive wire. The mechanism of material removal involves the complex erosion effect from electric sparks generated by a pulsating direct current power supply. With WEDM technology, complicated difficult-to-machine shapes can be cut. The high degree of accuracy obtainable and the fine surface finishes make WEDM valuable [1]. One of the serious problems of WEDM is wire breakage during the machining process. Wire rupture increases the machining time, decreases accuracy and reduces the quality of the machined surface [2,3,4]. WEDM is a thermal process in which the electrodes experience an intense local heat in the vicinity of the ionized channel. Whilst high erosion rate of the workpiece is a requirement, removal of the wire material leads to its rupture. Wire vibration is an important problem and leads to machining inaccuracy [5,6,7]. There is a great demand to improve machining accuracy in WEDM as it is applied to the manufacture of intricate components. The term knowledge engineering has been used quite recently and is now commonly used to describe the process of knowledge-based-systems [8,9]. The knowledge base is obviously the heart of any KBS. The process of collecting and structuring knowledge in a problem domain is called data acquisition. As the user consults the expert systems, answers and results build up in the working memory. In order to make use of the expertise, which is contained in the knowledge base, the V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 691-698, 2003. Springer-Verlag Berlin Heidelberg 2003
692
Samy Ebeid et al.
system must posses an inference engine, which can scan facts and rules, providing answers to the queries given to it by the user. Shells are the most widely available tools for programming knowledge. Using a shell can speed up the development of a KBS. Snoeys et al [10] constructed a KBS for the WEDM process. Three main modules were constructed namely: work preparation, operator assistance and fault diagnosis, and process control. The software has been built in PROLOG language. They concluded that the knowledge based system for WEDM improved the performance of the process. Sadegh Amalink et al [11] introduced an intelligent knowledge-based system for evaluating wire-electro erosion dissolution machining in a concurrent engineering environment using the NEXPERT shell. The KBS was used to identify the conditions needed to machine a range of design features, including various hole shapes of required surface roughnesses.
Fig. 1. Principle of WEDM
2
Experimental Work
The present experiments have been performed on a high precision 5 axis CNC WED machine commercially known as Robofil 290, manufactured by Charmilles Technologies Corporation. The Robofil 290 allows the operator to choose input parameters according to the material, height of the workpiece and wire material from a manual supplied by the manufacturer. The machine uses wire diameters ranging from 0.1 to 0.3 mm. However, the wire used in the present tests is hard brass 0.25 mm diameter, under a wire traveling speed ranging between 8-12 m/min. To construct the knowledge based system, the experimental work has been planned to be done on four categories of materials. The first one is alloy steel 2417 (55 Ni Cr Mo V7), while the second category includes two types of cast iron. The third material category is the aluminum alloys (Al-Mg-Cu and Al-Si-Mg 6061). Conductive composites are the fourth materials category where seven types are selected (see Table I).
3
Knowledge-Based System Design
The proposed WEDM knowledge-based system is based on the production rule system of representing knowledge. The system has been built in “CLIPS” shell
An Operating and Diagnostic Knowledge-Based System for Wire EDM
693
software. CLIPS is a productive development and delivery expert system tool which provides a complete environment for the construction of rule and/or object based expert systems [12]. In the CLIPS program the method of representing knowledge is the (defrule) statement instead of (IF…THEN) statements. A number of such (defrule) rules in the forward chaining method has been built into the proposed KBS. The general layout of the architecture of the proposed system for WEDM process is depicted in Fig.2. Four main modules are designed namely: machining parameters, WEDM machine settings, problem diagnosis and machine technology database. Table 1. Types of composite materials
Al 1 2 R 3 R 4 R 5 R 6 R 7 5 Values are in %. 3.1
Cu Gr R 10 4 4 4 4 (R = remainder)
SiC 10 15 30 -
SiO2 10 -
Al2O3 24 -
Zn R
System Modules
3.1.1 Machining Parameters Module This module includes the required workpiece information such as the material to be machined, height and the desired final machining quality. The model is based upon the following: CS = Vf * H
(1)
MRR = (dw+2Sb) * Vf * H
(2)
2
where: CS = cutting speed, mm /min. dw = wire diameter, mm. Sb = spark gap, mm. Vf = machining feed rate, mm/min. H = workpiece height, mm. MRR = material removal rate, mm3/min. 3.1.2 WEDM Machine Settings Module The machine settings module includes the required parameters for operating the machine such as pulse frequency, average machining voltage, injection flushing flow rate, wire speed, wire tension and spark gap. The equations of the above two modules are presented in a general polynomial form given by: P = a H4 + b H3 + c H2 + d H + e
(3)
694
Samy Ebeid et al.
Where, P denotes any of the KBS outputs as cutting speed, MRR, .. etc. The constants “a” to “e” have been calculated according to the type of material, height of workpiece and roughness value. 3.1.3 Problem Diagnosis Module The problems and faults happening during the WEDM process are very critical when dealing with intricate and accurate workpieces. Consequently, faults and problems must be solved soon in order to minimize wire rupture and dead times. The program suggests the possible faults that cause these problems and recommend the operators with the possible actions that must be taken. The correction of these faults might be a simple repair or a change of some machine settings such as feed rate, wire tension, voltage, etc.. 3.1.4 Machine Technology Database Module This module contains a database for recording and organizing the heuristic knowledge of the WEDM process. The machining technology data has been stored in the form of facts. The reasoning mechanism of the proposed KBS proceeds in the same manner as that of the operator in order to determine the machine settings. The required information of workpiece material type, height and required surface roughness is acquired through keyboard input by the operator. On the basis of these data, the KBS then searches through the technology database and determines the recommended WED machine settings. The flow chart of the KBS program is shown in Fig.3. The knowledge acquisition in the present work has been gathered from: 1.
Published data of WEDM manuals concerning machining conditions for some specific materials as steel, copper, aluminum, tungsten carbide and fine graphite. 1. Experimental tests: The system has been built according to: Number of workpiece materials = 12 (refer to section of experimental work), number of surface roughness grades = 2 (ranging between 2.2 to 4 µm depending on each specific material) and number of workpiece heights for each material = 5 (namely: 10, 30, 50, 70 and 90 mm).
Fig. 2. Architecture of KBS.
An Operating and Diagnostic Knowledge-Based System for Wire EDM
695
Fig. 3. Flow chart of KBS
4
Results and Discussion
Figure 4 shows a sample result-taken from the above twelve materials-for the variation of cutting speed with workpiece height for two grades of surface roughness namely Ra=3.2 and Ra=2.2 µm for alloy steel 2417. The results show that the cutting speed increases until it reaches its maximum value at 50 mm thickness, then begins to decrease slightly again for the two grades of roughness. The results also indicate that the values of cutting speed for Ra=3.2 µm are higher than those of Ra=2.2 µm. Nevertheless, the work of Levy and Maggi [13] did not show any dependence of roughness on wire feed rate. Figs.5 and 6 show the variation of material removal rate with cutting speed for alloy steel 2417 and Al 6061 respectively for two grades of roughness. Both charts indicate a linear increase in MRR with cutting speed. Fig.5 shows that the most optimal zone for machining steel 2417 lies between nearly 40-70 mm2/min (Ra=3.2 µm) and between 20-40 mm2/min (Ra=2.2µm) for workpiece thickness ranging from 10-90 mm. Whereas, for Al 6061 this zone shows higher values between 120-180 mm2/min (Ra=4 µm) and between 50-90 mm2/min (Ra=2.2µm) for thickness values from 10-60 mm.
696
Samy Ebeid et al.
The variation between spark gap and cutting speed for both alloy steel 2417 and Al 6061 is shown in Fig.7 for Ra=3.2 and 4 µm respectively. The spark gap increases directly with the increase of cutting speed thus leading to a wider gap. This result is in accordance with the work of Luo [14] for cutting speeds up to 200 mm2/min. It was realized during the tests that the variation in the spark gap with cutting speed at Ra=2.2 µm for both steel 2417 and Al 6061 was nearly negligible.
Fig. 4. Variation of cutting speed with workpiece thickness
Fig. 5. Variation of MRR with cutting speed for steel 2417
An Operating and Diagnostic Knowledge-Based System for Wire EDM
697
Fig. 6. Variation of MRR with cutting speed for Al 6061
Fig. 7. Variation of spark gap with cutting speed
5
Conclusions
The present results of the KBS supply users of WEDM with useful data avoiding further experimentation. The designed system enables operation and diagnosis for the WEDM process. The system proves to be reliable and powerful as it allows a fast retrieval for information and ease of modification or of appending data. The sample results for alloy steel 2417 and Al 6061 of the various twelve tested materials are presented in the form of charts to aid WEDM users for improving the performance of the process.
698
Samy Ebeid et al.
References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14]
Lindberg, R.A., “Processes and Materials of Manufacture”, Published by Allyn and Bacon, Boston, (1990), 796-802 Rajurkar, K.P., Wang, W.M. and McGeough, J.A., “WEDM identification and adaptive control for variable height components”, CIRP Ann. of Manuf. Tech., 43, 1, (1994), 199-202 Liao, Y.S., Chu, Y.Y. and Yan, M.T., “Study of wire breaking process and monitoring of WEDM”, Int. J. Mach. Tools and Manuf., 37, 4, (1997), 555-567 Yan, M.T. and Liao, Y.S., ”Monitoring and self-learning fuzzy control for wire rupture prevention in wire electrical discharge machining”, Int. J. of Mach. Tools and Manuf., 36, 3, (1996), 339-353 Dauw, D.F. and Beltrami, I., “High-precision wire-EDM by online wire positioning control” CIRP Ann. of Manuf. Tech., 43, 1, (1994), 193-196 Beltrami, I., Bertholds, A. and Dauw, D., “A simplified post process for wire cut EDM”, J. of Mat. Proc. Tech., 58, 4, (1996), 385-389 Mohri, N., Yamada, H., Furutani, K., Narikiyo, T. and Magara, T., “System identification of wire electrical discharge machining”, CIRP Ann. of Manuf. Tech., 47, 1, (1998), 173-176 Smith, P., “An Introduction to Knowledge Engineering”, Int. Thomson Computer Press, London, UK, (1996) Dym, C.L. and Levitt, R.E., “Knowledge-Based Systems in Engineering”, McGraw-Hill, Inc., New York, USA, (1991) Snoeys, R., Dekeyser, W. and Tricarico, C., “Knowledge based system for wire EDM”, CIRP Ann. of Manuf. Tech., 37, 1, (1988), 197-202 Sadegh Amalink, M., El-Hofy, H.A. and McGeough, J.A., “An intelligent knowledge-based system for wire-electro-erosion dissolution in a concurrent engineering environment”, J. of Mat. Proc. Tech., 79, (1998), 155-162 Clips user guide, www.ghgcorp.com/clips, (1999) Levy, G.N. and Maggi, F., “WED machinability comparison of different steel grades”, Ann. CIRP 39, 1, (1990), 183-185 Luo, Y.F., “An energy-distribution strategy in fast-cutting wire EDM”, J. of Mat. Proc. Tech., 55, 3-4, (1995), 380-390
The Application of Fuzzy Reasoning System in Monitoring EDM Xiao Zhiyun and Ling Shih-Fu Department of Manufacturing and Production Engineering Nanyang Technological University, Singapore 639798
Abstract. EDM (electrical discharge machining) is a very complicated and stochastic process. It is very difficult to monitor its working conditions effectively as lacking adequate knowledge on the discharge mechanism. This paper proposed a new method to monitor this process. In this method, electrical impedance between the electrode and the workpiece was taken as the monitoring signal. Through analyzing this signal and using a fuzzy reasoning system as classifier, sparks and arcs were differentiated effectively, which is difficult when using other conventional monitoring methods. The proposed method first partitions the collected voltage and current trains into separated pulses using Continuous Wavelet Transform. Then apply Hilbert Transform and calculate electrical impedance of each pulse. After that, extract features from this signal and form a feature vector. Finally a fuzzy logic reasoning system was developed to classify the pulses as sparks and arcs.
1
Introduction
Discharge pulses of EDM can commonly be classified into five kinds, which are open, spark, arc, short-circuit and off [1]. Open, short circuit and off stages can be distinguished easily for their distinctive characteristics, but it is very difficulty to distinguish spark and arc effectively because they are quite similar in many characters. Traditionally, the signals used to monitor EDM include the magnitude of voltage and current, the high frequency components of discharge, the delay time of discharge pulses and the radio frequency during discharge. From literature, though these signals had been widely used, the results are not good enough, especially in differentiating sparks and arcs [2-3]. Electrical impedance is an inherent characteristic of electrical system that will not change with the input signals and its change can reflect the changes of the electrical system. Though EDM is a complicated process, it can still be regarded as an electrical system approximately because the discharge is the result of the applied voltage. As the discharge is so complicated and we lack enough knowledge on the discharge mechanism, it is quite difficult to find an accurate mathematical model for sparks and arcs. Fuzzy reasoning system has advantages in processing fuzzy, uncertainty information and has a lot of successful application in engineering field. This paper proV. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 699-706, 2003. Springer-Verlag Berlin Heidelberg 2003
700
Xiao Zhiyun and Ling Shih-Fu
posed a new monitoring method for EDM, through deriving a new monitoring signal and extracting features from it, a fuzzy reasoning system was used to classify the pulses into sparks and arcs correctly. In section 2, Wavelet transform and the partition of the discharge process were studied. In section 3, the definition and calculation of the electrical impedance signal was given. In section 4, the feature extraction from the impedance signal and voltage signal was introduced. In section 5, a Takagi-Sugeno fuzzy reasoning system was developed to differentiate sparks and arcs. In the last section, conclusions were remarked.
2
Signal Segmentation
The signal segmentation method we developed partitions the voltage and current of the whole discharge process into time slices representing separately discharge pulses. Theoretically, between every two pulses, there is an off stage that means no voltage and no current. During this stage, dielectric fluid was de-ionized and prepared for next discharge. We can regard this off stage as the edge of a discharge pulse, a kind of singularities, which commonly appears in both the one-dimensional signals and two-dimensional images. Until present, the Fourier transform was the main mathematical tool for analyzing singularities. The Fourier transform is global and provides a description of the overall regularity of signals but it is not well adapted for finding the location and the spatial distribution of singularities. The subject of Wavelet transform is a remarkable mathematical tool to analyze the singularities including the edges, by decomposing signals into elementary building blocks that are well localized both in space and frequency, the Wavelet transform can characterize the local regularity of signals, and further, to detect them effectively. A significant study related to this research topic had been done by Mallat, Hwang and Zhong [4][5]. The continuous Wavelet transform of a function is defined as W
f
(a, b) = a
−1 / 2
∫
f ( t )h (
t − a ) dt b
(1)
Here, f (t ) is the function to apply wavelet transform, a is dilation, b is translation and h ( t − a ) is wavelet basic function. b
Wavelet analysis presents a windowing technique with variable-sized regions in contrast to the other methods of signal analysis. Wavelet analysis employs longer time intervals where more precise low frequency information is required and shorter regions for high frequency information. This allows wavelet analysis to capture aspects like trends, breakdown points, and discontinuities in higher derivatives and selfsimilarity that are missed by the other signal analysis techniques. Singularities contain the high frequency information, using the coefficients of low scale of CWT, we can find the location of the singularities correctly. Calculation of the Wavelet transform is a key process in applications. The fast algorithm of computing W s f ( x ) for detecting edges and reconstructing the signal can
The Application of Fuzzy Reasoning System in Monitoring EDM
701
be found in [6] when ψ ( x ) is a dyadic wavelet defined in that article. However, in edge detection, the reconstruction of signals is not required. Therefore, the choice of the wavelet function will not be restricted in the conditions that were presented in [6]. Many wavelets other than dyadic ones can be utilized. In fact, almost all the general integral wavelets satisfy this particular application. Figure 1 is the voltage and current signal from the experiment. The experiment setting is: open voltage 80V, peak current 1.0A, on-time 2.4µs, off-time 0.8µs and sample frequency is 10M. Figure 2 is the segmentation result. The second part of this figure is the CWT coefficients of scale 5. It shows that we can extract pulse 1 to pulse 6 from the original signal accurately.
Fig. 1. Original voltage and current
Fig. 2. Segmentation result
3
Electrical Impedance of Discharge
Electrical impedance is a very important property of an electrical system. It is widely used to study stationary process and always appears as a frequency dependent variable. For an electrical system, impedance is always calculated by voltage over current in frequency domain. The ratio of the voltage and current amplitudes gives the absolute value of the impedance at the given frequency, and the phase shift between the current and the voltage gives the phase of it. The electrical impedance of an electrical process is defined as:
Z =V /I
(2)
702
Xiao Zhiyun and Ling Shih-Fu
The difficulty in calculating impedance in time domain is the sensed voltage and current are AC signals and they cannot be divided directly. In order to avoid this problem, we first convert the sensed voltage and current in real representations into their corresponding complex representations. The converted signals are called analytic signals. Then electrical impedance can be calculated by dividing the analytic voltage by the analytic current. The electrical impedance can express in the complex form, the real part and the imaginary part. The real part represents the resistance and the imaginary part represents the capacitance and inductance.
Z ( t ) = R ( t ) + jX (t )
(3)
It also can be expressed in magnitude and phase representation.
I (t ) = Z (t ) =
V ( t ) jφ ( t ) e Z (t )
(4)
R (t ) 2 + X (t ) 2
ϕ ( t ) = arctg
(5)
X (t ) R (t )
The commonly used transform is FFT, and the result is the electrical impedance at a given frequency. In order to get electrical impedance in time domain and study its varying trend with time, here, we use the Hilbert transform to transfer the measured voltage and current to their corresponding analytical signals and then calculate the electrical impedance. Hilbert transform of signal x(t ) are defined as: x (t ) = H
−1
[ x ( s )] =
π
x ( t ) = H − 1 [ x ( s )] =
π
1
1
∫
∞ x(s) −∞ t − s
ds
(6)
∫
∞ x(s) −∞ t − s
ds
(7)
This shows that the Hilbert transform are defined using the kernel
Φ (t , s ) = 1 /[π ( s − t )] and the conjugate kernel ψ (t , s ) = 1 /[π (t − s )] . That is, the kernels differ only by sign. Here the variable s is a time variable. As a result, the Hilbert transform of a function of time is another function of time of different shape. Accordingly, the corresponding analytical signal of v(t) and i(t) is respectively V ( t ) = v ( t ) + j vˆ ( t ) = a ( t ) exp( j ω t + φ 1 )
(8)
I ( t ) = i ( t ) + j iˆ ( t ) = b ( t ) exp( j ω t + φ 2 )
(9)
In the above two equations, v ( t ) and i (t ) is the real signal, vˆ ( t ) and
iˆ (t ) represent the Hilbert transform of v (t ) and i (t ) . After being transformed into complex form, the following calculation becomes possible for evaluating the electrical impedance.
The Application of Fuzzy Reasoning System in Monitoring EDM
Z =
V (t ) e I (t )
j (φ 1 − φ 2 )
703
(10)
Figure 5 is the calculated electrical impedance of each pulse using Hilbert transform. In the next step, we will extract some features from these signals, which will be used as the inputs of the fuzzy classifier.
Fig. 5. Magnitude of electrical impedance
4
Features Extraction
In this step, we will extract some features from the electrical impedance and voltage signals and determine the determinant features for classification. We define a feature to be any property of the impedance or voltage signal within a pulse that is useful for describing normal or abnormal behavior of this pulse. Based on the electrical discharge machining and signal processing knowledge, we have found the following features to be useful in determining a pulse is spark or arc: Voltage Average: As the breakdown voltage is lower and the delay time is shorter in arc than in spark. The voltage average of arc should be smaller than that of spark. Current Average: Because the dielectric was not fully de-ionized or the debris between the electrode and the workpiece was not flush away effectively. The average current of arc should be bigger than that of spark. Pulse Duration. As arcs are easy to start, the shorter the duration is, the more possible this pulse belongs to arc. Average Impedance. This is the most important feature in differentiating spark and arc because arc is believed to occur at the same spot of previous pulse, the dielectric between the two spot of electrode and workpiece was not fully de-ionized. Delay Time: Extracted from the voltage signal which can be obtained easily when partition the voltage signal using the coefficients of low scale. Commonly, spark is believed to have delay time before discharge starts while arc has no observable delay time.
704
Xiao Zhiyun and Ling Shih-Fu
Table 1 is the values of these features of pulse 1 to pulse 6. In the next section, we will study using a fuzzy reasoning system, taking all the features as the inputs, to differentiate spark and arc. Table 1. Features extracted from pulse 1 to pulse 6
Average V (V) Average I (A) Average Im (Ω) Duration (µs) Delay time (µs)
5
P1 61.448 0.181 409.06 11.8 7.2
P2 74.098 0.157 467.09 4.2 3.2
P3 24.733 0.350 61.31 3.4 0.6
P4 28.256 0.287 91.65 4.2 0.9
P5 22.027 0.290 58.92 4.2 0.8
P6 25.638 0.291 47.11 4.2 1.6
Fuzzy Reasoning System
Presently, Fuzzy system and neural network have been widely and successfully used in monitoring and controlling EDM [8-9]. The theory of fuzzy logic is aimed at the development of a set of concepts and techniques for dealing with sources of uncertainty, imprecision and incompleteness. The nature of fuzzy rules and the relationship between fuzzy sets of different shapes provides a powerful capability for incrementally modeling a system whose complexity makes traditional expert systems, mathematical models, and statistical approaches very difficult. The most challenging problem in differentiating sparks and arcs is that many characters of these two kinds pulses are similar and the knowledge about how to differentiate them is incomplete and vague due to the complexity of discharge phenomena. This uncertainty leads us to seek a solution using fuzzy logic reasoning method. Fuzzy reasoning is performed within the context of a fuzzy system model, which consist of control, solution variables, fuzzy sets, proposition statements, and the underlying control mechanisms that tie all these together into a cohesive reasoning environment. The fuzzy rules can be completely characterized by a set of control variables, X = { x1 , x 2 , L , x n } and solution variables, y1 , y 2 , L y k . In our application, we have five control variables corresponding to the extracted features and one solution variable representing the discharge state, i.e. a value of 1 indicate the normal discharge while a value of 0 indicate a abnormal discharge. Each control variable xi is associated with a set of fuzzy terms ∑ = {α 1 , α 2 , L a p i } , and the solution i
i
i
i
variable has its own fuzzy terms. Each fuzzy variable is associated with a set of fuzzy membership functions corresponding to the fuzzy terms of the variable. A fuzzy membership function of a control variable can be interpreted as a control surface that responds to a set of expected data points. The fuzzy membership functions associated with a fuzzy variable can be collectively defined by a set of critical parameters that uniquely describe the characteristics of the membership function, and the characteristic of an inference engine is largely affected by these critical parameters.
The Application of Fuzzy Reasoning System in Monitoring EDM
705
Takagi-Sugeno fuzzy reasoning system was first introduced in 1985 [10]. It is similar to the Mamdani method in many respects. In fact the first two parts of the fuzzy inference process, fuzzifying the inputs and applying the fuzzy operator, are exactly the same. The main difference between Mamdani-type of fuzzy inference and Sugeno-type is that the output membership functions are only linear or constant for Sugeno-type fuzzy inference. A typical fuzzy rule in a zero-order Sugeno fuzzy model has the form If x is A and y is B then z = k Where A and B are fuzzy sets in the antecedent, while k is a crisply defined constant in the consequent. When the output of each rule is a constant like this, the similarity with Mamdani's method is striking. The only distinctions are the fact that all output membership functions are singleton spikes, and the implication and aggregation methods are fixed and cannot be edited. The implication method is simply multiplication, and the aggregation operator just includes all of the singletons. In our application, as the fuzzy reasoning system has only one output, that is a pulse belongs to spark or arc, it is more convenient to use Sugeno-type reasoning system than conventional used Mamdani system. The fuzzy reasoning system we construct shown in figure 6. It has five inputs that corresponding to the five features we obtained, every input has three fuzzy terms: low, medium and high. There is only one output, the discharge state, which has two values, zero corresponding to arc and one corresponding to spark. Table 2 is the fuzzy reasoning results. We can see, pulse 1 and pulse 2 were classified as sparks, and pulse 3 to pulse 6 were classified as arcs. Average voltage Average current Average impedance Delay time
Fuzzy reasoning system
Discharge state (0: Arc, 1: Spark)
Pulse duration
Fig. 6. Fuzzy reasoning system Table 2. Fuzzy reasoning results for pulse 1 to pulse 6
Pulses 1 Reasoning result 1(spark)
6
2 1(spark)
3 0(arc)
4 0(arc)
5 0(arc)
6 0(arc)
Conclusions
This paper proposed a new monitoring method for EDM process. Through analyzing the electrical impedance signal and extracting features from this signal and using fuzzy reasoning system as the classifier, this method can differentiate sparks and arcs effectively. The proposed method has the following advantages: (1) this method can detect off stage easily when segmenting the voltage and current signal. (2) this
706
Xiao Zhiyun and Ling Shih-Fu
method can process pulse individually, so the monitoring result will be precise and can be quantitated, not just the varying trend as taking frequency as monitoring signal. (3) The proposed method can effectively differentiate sparks and arcs. (4) The monitoring signal is an inherent character of the EDM system, so the monitoring result is more credible. (5) This method system is easier to be implemented. The cost of the monitoring system will be lower because we just need to collect the voltage and current between the discharge gap.
References [1]
“Summary specifications of pulse analyzers for spark-erosion machines”, CIRP scientific technical committee E, 1979. [2] M. Weck, J. M. Dehmer. Analysis and adaptive control of EDM sinking process using the ignition delay time and fall time as parameter. Annals of the CIRP Vol. 41/1/1992 [3] M. S. Ahmed. Radio frequency based adaptive control for electrical discharge texturing. EDM digest, Sept./Oct. 1987, pp. 8-10 [4] Mallat, S. and Hwang, W. L. (1992) Singularity detection and processing with wavelet. IEEE Trans. On information theory, 38: 617~643 [5] Mallat, S. and Zhong, S. (1992) Characterization of signals from multiscale edges. IEEE Trans. On pattern analysis and machine intelligence, 14(7): 710 ~ 732 [6] Mallat, S (1989) Multiresolution approximations and wavelet orthonormal base of L2(R). Trans. Amer Math. Soc, 315: 69 ~ 87 [7] Stefan L. Hahn, Hilbert Transform in signal processing, Artech House, 1996 [8] Y. S. Tarng, C. M. Tseng and L. K. Chung. A fuzzy pulse discriminating system for electrical discharge machining. International Journal of Tools and Manufacturing. Vol. 37, No. 4, pp. 511-522, 1997 [9] J. Y. Kao, Y. S. Tarng. A neural network approach for the on-line monitoring of electrical discharge machining process. Journal of Materials Processing Technology 69 (1997) 112-119 [10] Sugeno, M., Industrial applications of fuzzy control, Elsevier Science Pub. Co., 1985.
Knowledge Representation for Structure and Function of Electronic Circuits Takushi Tanaka Department of Computer Science and Engineering, Fukuoka Institute of Technology 3-30-1 Wajiro-Higashi Higashi-ku, Fukuoka 811-0295, Japan [email protected]
Abstract. Electronic circuits are designed as a hierarchical structure of functional blocks. Each functional block is decomposed into sub-functional blocks until the individual parts are reached. In order to formalize knowledge of these circuit structures and functions, we have developed a type of logic grammar called Extended-DCSG. The logic grammar not only defines the syntactic structures of circuits but also defines relationships between structures and their functions. The logic grammar, when converted into a logic program, parses circuits and derives their functions.
1
Introduction
Electronic circuits are designed as a hierarchical structure of functional blocks. Each functional block consists of sub-functional blocks. Each sub-functional block is also decomposed into sub-sub-functional blocks until the individual parts are reached. In other words, each functional block, even a resistor, has a special goal for its containing functional block. Each circuit as a final product itself can also be viewed as a functional block designed for a special goal for users. As these hierarchical structures were analogous to syntactic structures of language, we developed a grammatical method for knowledge representation of electronic circuits[1]. In the study, each circuit was viewed as a sentence and its elements as words. Structures of functional blocks were defined by logic grammar called DCSG (Definite Clause Set Grammar)[2]. Here, the DCSG is a DCG[3]like logic grammar developed for analyzing word-order free languages. A set of the grammar rules, when converted into Prolog clauses, forms a logic program which executes top-down parsing. Thus knowledge of circuit structures were successfully represented by DCSG, but knowledge of circuit functions could not be represented. In this study, we extend DCSG by introducing semantic terms into grammar rules. The semantic terms define relationships between syntactic structures and their meanings. By assuming functions as meaning of structures, we can represent relationships between functions and structures of electronic circuits.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 707–714, 2003. c Springer-Verlag Berlin Heidelberg 2003
708
2 2.1
Takushi Tanaka
DCSG Word-Order Free Language
A word-order free language L(G’) is defined by modifying the definition of a formal grammar[2]. We define a context-free word-order free grammar G’ to be a quadruple < VN , VT , P, S > where: VN is a finite set of non-terminal symbols, VT is a finite set of terminal symbols, P is a finite set of grammar rules of the form: A −→ B1 , B2 , ..., Bn . (n ≥ 1) A ∈ VN , Bi ∈ VN ∪ VT (i = 1, ..., n) and S(∈ VN ) is the starting symbol. The above grammar rule means rewriting a symbol A not with the string of symbols “B1 , B2 , ..., Bn ”, but with the set of symbols {B1 , B2 , ..., Bn }. A sentence in the language L(G’) is a set of terminal symbols which is derived from S by successive application of grammar rules. Here the sentence is a multi-set which admits multiple occurrences of elements taken from VT . Each non-terminal symbol used to derive a sentence can be viewed as a name given to a subset of the multi-set. 2.2
DCSG Conversion
The general form of the conversion procedure from a grammar rule A −→ B1 , B2 , ..., Bn .
(1)
to a Prolog clause is: subset(A, S0 , Sn ) : − subset(B1 , S0 , S1 ), subset(B2 , S1 , S2 ), ... subset(Bn , Sn−1 , Sn ).
(1)’
Here, all symbols in the grammar rule are assumed to be a non-terminal symbol. If “[Bi ]”(1 ≤ i ≤ n) is found in the right hand side of grammar rules, where “Bi ” is assumed to be a terminal symbol, then “member(Bi , Si−1 , Si )” is used instead of “subset(Bi , Si−1 , Si )” in the conversion. The arguments S0 , S1 , ..., Sn in (1)’ are multi-sets of VT , represented as lists of elements. The predicate “subset” is used to refer to a subset of an object set which is given as the second argument, while the first argument is a name of its subset. The third argument is a complementary set which is the remainder of the second argument less the first; e.g. “subset(A, S0, Sn )” states that “A” is a subset of S0 and that Sn is the remainder. The predicate “member” is defined by the following Prolog clauses: member(M, [M |X], X). member(M, [A|X], [A|Y ]) : − member(M, X, Y ).
(2)
The predicate “member” has three arguments. The first is an element of a set. The second is the whole set. The third is the complementary set of the first.
Knowledge Representation for Structure and Function of Electronic Circuits
3
709
Extended-DCSG
In order to define relationships between syntactic structures and their meaning, we introduce semantic terms into grammar rules. The semantic terms can be placed both sides of grammar rules. Different from DCSG, we do not distinguish terminal symbols from non-terminal symbols in grammar rules so that we can treat meaning of words same as meaning of structures. 3.1
Semantic Term in Left-Hand Side
The semantic terms are placed in curly brackets as: A, {F1 , F2 , ..., Fm } −→ B1 , B2 , ..., Bn .
(3)
This grammar rule can be read that the symbol A with meaning {F1 , F2 , ..., Fm } consists of the syntactic structure B1 , B2 , ..., Bn . This rule is converted into a Prolog clause in Extended-DCSG as: ss(A, S0 , Sn , E0 , [F1 , F2 , ..., Fm |En ]) : − ss(B1 , S0 , S1 , E0 , E1 ), ss(B2 , S1 , S2 , E1 , E2 ), ... , ss(Bn , Sn−1 , Sn , En−1 , En ). (3)’ As the conversion is different from DCSG, we use predicate “ss” instead of “subset”. When the rule is used in parsing, the goal ss(A, S0 , Sn , E0 , E) is executed, where the variable S0 is substituted by an object set and the variable E0 is substituted by an empty set. The subsets “B1 , B2 , ..., Bn ” are successively identified in the object set S0 . After all of these subsets are identified, the remainder of these subsets (complementary set) is substituted into Sn . While, semantic information of B1 is added with E0 and substituted into E1 , semantic information of B2 is added with E1 and substituted into E2 ,..., and semantic information of Bn is added with En−1 and substituted into En . Finally, semantic information {F1 , F2 , ..., Fm } which is the meaning associated with symbol A is added, and whole semantic informations are substituted into E. 3.2
Semantic Term in Right-Hand Side
Semantic terms in right-hand side define semantic conditions for the grammar rule. For example, the following rule (4) is converted into the Prolog clause (4)’. A −→ B1 , {C1 , C2 }, B2 .
(4)
ss(A, S0 , S2 , E0 , E2 ) : − ss(B1 , S0 , S1 , E0 , E1 ), member(C1 , E1 , ), member(C2 , E1 , ), ss(B2 , S1 , S2 , E1 , E2 ).
(4)’
When the clause (4)’ is used in parsing, conditions C1 , C2 are tested whether the semantic information E1 fills these conditions after identifying the symbol B1 . If it succeeds, the parsing process goes on to identify the symbol B2 .
710
3.3
Takushi Tanaka
Terminal Symbol
Both terminal and non-terminal symbols are converted with the predicate “ss”. Only difference is that terminal symbols are defined by rules which do not have right-hand side to rewrite. The terminal symbol A with meaning {F1 , F2 , ..., Fm } is written as (5). A, {F1 , F2 , ..., Fm }.
(5)
This rule is converted into the following clause (5)’ in Extended DCSG. ss(A, S0 , S1 , E0 , [F1 , F2 , ..., Fm |E0 ]) : −member(A, S0 , S1 ).
(5)’
That is, when the rule (5) is used in parsing, the terminal symbol A is searched in the object set S0 . If it is found, the complementary set is returned from S1 , and the semantic term {F1 , F2 , ..., Fm } associated with A is added with current semantic information E0 to make the fifth argument of “ss”. Thus, we can treat meaning of wards with the same manner of syntactic structures.
4 4.1
Knowledge Representation Representation of Circuits
In the previous study[1], the circuit in Figure 1 is represented as the free wordorder sentence (6). The compound term “resistor(r1, 2, 10)” is a word of the sentence. It denotes a resistor named r1 connecting node 2 and node 10. The word “npnT r(q1, 3, 5, 6)” denotes an NPN-transistor named q1 with the base connected to node 3, the emitter to node 5, and the collector to node 6 respectively. [ resistor(r1, 2, 10), resistor(r2, 9, 1), npnT r(q1, 3, 5, 6), npnT r(q2, 4, 5, 7), npnT r(q3, 10, 1, 5), npnT r(q4, 10, 1, 10), npnT r(q5, 10, 1, 8), npnT r(q6, 8, 9, 2), ... , terminal(t4, 9), terminal(t5, 1) ] 4.2
(6)
Grammar Rules
The rule (7) defines an NPN-transistor in active state as a terminal symbol. Its semantic term defines relationships of voltages and currents in the state. Each compound term such as “gt(voltage(C, E), vst)” is a logical sentence for circuit functions. Here, “gt(voltage(C, E), vst)” states “voltage between C and D is greater than collector saturation voltage”. Similar rules are defined for other states of transistor. npnT R(Q, B, E, C), { state(Q, active), gt(voltage(C, E), vst), equ(voltage(B, E), vbe), gt(current(B, Q), 0), gt(current(C, Q), 0), cause(voltage(B, E), current(B, Q)), cause(current(B, Q), current(C, Q)) }.
(7)
Knowledge Representation for Structure and Function of Electronic Circuits
711
t3 2
q7
q8
6
7
7
t1 q1 3
6 5
7
2
q2
5
q9
2
6
8
2
q6
2
10
8
9 t4
r1
4
t2
9
4 5
q3
1
10
q4
10 q5
8
1
1
r2 1 t5 1
Fig. 1. Operational Amplifier cd42 The rule (8) which was originally introduced to refer a resistor as non-polar element[1] defines causalities of voltage and current on the resistor. res(R, A, B), { cause(voltage(A, B), current(A, R)), cause(current(A, R), voltage(A, B)) } −→ ( resistor(R, A, B); resistor(R, B, A) ).
(8)
The rule (9) defines a simple voltage regulator in Figure 2. The semantic term of left-hand side is the function of this structure that the circuit named vreg(D, R) controls the voltage between Out and Com. The right-hand side consists of the syntactic structures of diode D and resistor R, and the semantic term which specifies an electrical condition of diode D. vbeReg(vreg(D, R), V p, Com, Out), { control(vreg(D, R), voltage(Out, Com)) } −→ dtr(D, Out, Com), { state(D, conductive) }, res(R, V p, Out). vreg(D,R) R Vp
Out D
Com
Fig. 2. Voltage regulator
(9)
712
Takushi Tanaka sink(VR,Q) VR
In
Q B
Com
Fig. 3. Current Sink
The rule (10) is defined for the current sink in Figure 3. The semantic term in left-hand side is the function of this structure. The right-hand side is the condition of this circuit. The disjunction “;” specifies either a structural condition or a semantic condition. It becomes useful to analyze context dependent circuits[1]. The next line specifies the connections of transistor Q, and requires the transistor Q must be in active state. cSink(sink(V R, Q), In, Com), { control(sink(V R, Q), current(In, Q)) } −→ ( vbeReg(V R,, Com, B); { control(V R, voltage(B, Com))} ), npnT r(Q, B, Com, In), { state(Q, active) }. (10) The rule (11) defines the active load (current mirror) in Figure 4. Functional blocks such as common-emitter, emitter-follower, and differential amplifier are also defined as grammar rules. activeLoad(al(D, Q), Ref, V p, Ld), { control(al(D, Q), current(Q, Ld)), cause(current(al(D, Q), Ref ), current(Q, Ld)), equ(current(Q, Ld), current(al(D, Q)Ref )) } −→ dtr(D, V p, Ref ), { state(D, conductive) }, pnpT r(Q, Ref, V p, Ld), { state(Q, active) }.
al(D,Q)
Vp
Vp D Q
Ref
Ld
Fig. 4. Active Load
(11)
Knowledge Representation for Structure and Function of Electronic Circuits
5
713
Deriving Circuit Functions
The grammar rules in the previous section are converted into Prolog clauses. Using these clauses, the goal (12) parses the circuit in Figure 1. The identified circuit is substituted into the variable X. The first argument of the composite term is a name given to the circuit, and represents the syntactic structure as shown in Figure 5. The circuit functions which consist of more than 100 logical sentences are substituted into the variable Y as meanings of the circuit structure. ? − cd42(Circuit), ss(X, Circuit, [ ], [ ], Y ).
(12)
X = opAmp(opAmp(sdAmp(ecup(q1, q2), al(pdtr(q8), q7), sink(vreg(dtr(q4), r1), q3)), pnpCE(q9, sink(vreg(dtr(q4), r1), q5)), npnEF (q6, r2)), 3,4,9,2,1)
(13)
Y = [ input(opAmp(...), voltage(3, 4)), output(opAmp(...), voltage(9, 1)), cause(voltage(3, 4), voltage(9, 1)), enable(sdAmp(...), amplif y(opAmp(...), dif f erential inputs)), enable(pnpCE(...), high(voltage gain(opAmp(...)))), enable(npnEF (...), low(output impedance(opAmp(...)))), input(npnEF (...), voltage(8, 1)), output(npnEF (...), voltage(9, 1)), cause(voltage(8, 1), voltage(9, 1)), high(input impedance(npnEF (...))), low(output impedance(npnEF (...))), equ(voltage gain(npnEF (...)), 1), state(q6, active), ... ] (14)
6
Conclusion
We have introduced semantic terms into grammar rules to extend DCSG. The semantic terms define relationships between syntactic structures and their meanings. We also changed the notation of terminal symbol. Different from DCSG, terminal symbols are not distinguished from non-terminal symbols, but defined by grammar rules lacking right-hand side. This enables us to treat the meaning of words same as meaning of structures. That is, Extended-DCSG defines not only the surface structure of language, but also their meanings hidden from the surface.
714
Takushi Tanaka opAmp
sdAmp
ecup
al
pnpCE
sink
sink q9
dtr q1
q2
npnEF
q6
r2
vreg q7
q3
q5
dtr q8
r1 q4
Fig. 5. Parse tree of cd42
We assumed circuit functions as meaning of circuit structures, and extended circuit grammars to define relationships between structure and function. The grammar rules were translated into Prolog clauses, and derived structures and functions from the given circuits. Now we have had a method of encoding knowledge for electronic circuits. If we encode more knowledge, more circuits can be parsed. Currently, our system outputs circuit structures and functions in logical sentence. In order to improve readability for user, we are developing natural language interface. Circuit simulators such as SPICE [5] derive voltages and currents from given circuits. Such simulator replaces measurements of experimental circuits with computer. While our system derives structures and functions from given circuits as a simulator of understanding. It helps user to understand how circuits work in design, analysis, and trouble shooting. These two different systems will work complementary.
References [1] Tanaka, T.: Parsing Circuit Topology in A Logic Grammar, IEEE-Trans. Knowledge and Data Eng., Vol.5, No.2, pp. 225–239, 1993. 707, 710, 711, 712 [2] Tanaka, T.: Definite Clause Set Grammars: A Formalism for Problem Solving, J. Logic Programming, Vol.10, pp. 1–17, 1991. 707, 708 [3] Pereira, F. C. N. and Warren, D. H. D.: Definite Clause Grammars for Language Analysis, Artificial Intell., Vol.13, pp. 231–278, 1980. 707 [4] Tanaka, T. and Bartenstein, O.: DCSG-Converters in Yacc/Lex and Prolog, Proc. 12th International Conference on Applications of Prolog, pp. 44–49, 1999. [5] Tuinenga, P. W.: SPICE - A Guide to circuit Simulation & Analysis using PSpice, Prentice Hall (1988) 714
A Web-Based System for Transformer Design J.G. Breslin and W.G. Hurley Power Electronics Research Centre, National University of Ireland Galway, Ireland {john.breslin,ger.hurley}@nuigalway.ie http://perc.nuigalway.ie/
Abstract. Despite the recent use of computer software to aid in the design of power supply components such as transformers and inductors, there has been little work done on investigating the usefulness of a webbased environment for the design of these magnetic components. Such an environment would offer many advantages, including the potential to share and view previous designs easily along with platform/OS independence. This paper presents a web-based transformer design system whereby users can create new optimised transformer designs and collaborate or comment on previous designs through a shared information space.
1
Introduction
Despite the recent use of computer software to aid in the design of magnetic components [3], to date there has been little work done on investigating the usefulness of a web-based environment for magnetic component design. Such an environment would offer many advantages, including the potential to share and view previous designs easily along with platform and operating system independence. This paper proposes a web-based prototype for magnetic component design. The system is based on an existing transformer design methodology [1], and is implemented using a web programming language, PHP. Magnetic component material information and designs created by students and instructors are stored in a MySQL database on the system server. It will therefore be possible for users of this system to collaborate and comment on previous designs through a shared information space. We will begin with a summary of existing methods for transformer design, followed by details of the system design, and finally an overview of the web-based implementation.
2
Related Research and Other Systems
2.1
Transformer Design Methodologies
The basic area product methodology [4] often results in designs that are not optimal in terms of either losses or size since the method is orientated towards low frequency V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 715-721, 2003. Springer-Verlag Berlin Heidelberg 2003
716
J.G. Breslin and W.G. Hurley
transformers. A revised arbitrary waveform methodology was proposed [1] that allows designs at both low and high frequencies, and is suitable for integration with high frequency winding resistance minimisation techniques [2]. The selection of the core is optimised in this methodology to minimise both the core and winding losses. The design process can take different paths depending on whether a flux density constraint is violated or not. If the flux density exceeds a set limit (the saturation value for a particular core material), it is reset to be equal to this limit and one path is used to calculate the core size; otherwise the initial flux density is used and a different path is taken to find the core size. 2.2
Previous Windows-Based Packages
Many magnetic design companies have used computer spreadsheets to satisfy their own design needs and requirements. They thus tend to be solely linked to a company and its products, and remain unpublished as their content is only of interest to their direct users and competitors. Some of the limitations of these spreadsheets include: difficulties in incorporating previous designer expertise due to limited decision or case statements; non-conformity with professional database formats used by manufacturers; problems with implementing most optimisation routines due to slow iterative capabilities and a reduced set of mathematical functions; basic spreadsheet input mechanisms that lack the features possible with a customised GUI. A computer aided design package has previously been developed [3] for the Windows environment based on the arbitrary waveform methodology [1]. This package allows the design of transformers by both novice and expert users for multiple application types through “try-it-and-see” design iterations. As well as incorporated design knowledge and high frequency winding optimisation, the system allows access to customisable or pre-stored transformer geometries which were usually only available by consulting catalogues. Other systems such as [6] provide more detailed information on winding layouts and SPICE models, but lack the high frequency proximity and skin effects details of [3]. Another Windows-based package has recently been released [7] which allows the comparison of designs based on different parameters. Some companies have also advertised web-based selection of magnetic components [8]; however these tend to be online spreadsheets and therefore suffer from the problems previously mentioned.
3
System Design
All of the information and associated programs of the system reside on a web server, and any user can access the system using a web browser through the appropriate URL. The inputs to the system are in the form of specifications such as desired voltage, current, frequency, etc. The output from the system is an optimised design for the specifications given. Fig. 1 shows the steps taken in creating and managing a design. The system is comprised of the following elements: a relational database management system (RDBMS); a web-based graphical user interface (GUI); optimisation
A Web-Based System for Transformer Design
717
techniques; a repository of knowledge; and a shared information space. We will now describe these in some more detail. Start
Enter Specifications, Choose Materials
Opt. Flux Density > Sat. Value?
Y
N Use Opt. Flux Density, Calc. Area Product
Use Sat. Value, Calc. Alternative Area Product
N Satisified?
Y N Satisified?
Select Desired Core Geometry
Y
Calculate Turns Information N Satisified?
Y
Optimise Winding Thickness?
Y
N Select Desired Winding Geometries
Calculate and Use Opt. Winding Geometries
N Satisified?
Y
Predict Losses, Efficiency, Calc. Leakage Inductance N Satisified?
Y Stop
Fig. 1. Flow chart of design steps
3.1
Relational Database Management System
In transformer design, huge amounts of core and winding data have to be managed for effective exchange of information between various design stages. The magnetic designer should not get lost in the data management process as their concentration on the real design problem may be affected. A RDBMS can be used to save designers from having to search through books with manufacturer's data or from manipulating
718
J.G. Breslin and W.G. Hurley
data themselves with lower level external programs. All transformer core and winding data is accessible using the sophisticated database storage and retrieval capabilities of a relational database engine incorporated into the design application. The database contains the following tables of data: designs (where each saved design is identified by a unique design_id); designers (users of the system); areas (consisting of parent areas which contain child areas for related sets of designs); comments (each relating to a particular design_id); cores and windings (either read-only items added by an administrator or modifiable items created by a particular designer); and tables for core and winding materials, shapes, manufacturers and types. 3.2
Web-Based Graphical User Interface
In our system, we require over 250 HTML objects and form controls for interaction with the user; these include text boxes for both inputs and calculated outputs, labels describing each text box, radio buttons and checkboxes for selection of discrete or Boolean variables, option group menus, graphs of waveforms, etc. Proper categorisation and presentation of data in stages is our solution to the problem of organising this data in a meaningful way, whereby images identify links to the distinct steps in the design process, and only information related to a particular step is shown at any given time. Some of the main GUI features incorporated in the system are: numbered and boxed areas for entry of related data in “sub-steps”; input and output boxes colourcoded according to whether data entry is complete, required, or just not permitted; scrollbars for viewing large tables of data in small areas; and pop-up message windows for recommendations and errors. 3.3
Optimisation Techniques
The performance level of an engineering design is a very important criterion in evaluating the design. Optimisation techniques based on mathematical routines provide the magnetic designer with robust analytical tools, which help them in their quest for a better design. The merits of a design are judged on the basis of a criterion called the measure of merit function (or the figure of merit if only a single measure exists). Methods for optimising AC resistance for foil windings [2] and total transformer loss [1] are implemented in the web-based system; these variables are our measures of merit. 3.4
Repository of Knowledge
A “repository of knowledge” is incorporated into our system, to allow a program design problem to be supplemented by rules of thumb and other designer experience. In the early design stages, the designer generates the functional requirements of the transformer to be designed, and the expertise of previous users can play a very important role at this stage. For example, on entry of an incompatible combination of transformer specifications, the designer will be notified by a message informing them that a design error is imminent. The system will also suggest recommended “expert” values for certain
A Web-Based System for Transformer Design
719
variables. Although such a system is useful for novices, it can also be used by experts who may already know of certain recommended values and who want to save time setting them up in the first place. 3.5
Shared Information Space
The system allows collaboration between users working on a design through a shared information space, with features similar to those of a discussion forum. Designs are filed in folders, where each folder may be accessed by a restricted set of users. To accommodate this, user and group permissions are managed through an administration panel. Access to design folders is controlled by specifying either the users or the groups that have permission to view and add designs to that folder. Users can comment on each design, and can also send private messages to other users.
4
Implementation and Testing
A popular combination for creating data-driven web sites is the PHP language with a MySQL database, and this was chosen for the implementation of the system. PHP is a server-side scripting language that allows code to be embedded within an HTML page. The web server processes the PHP script and returns regular HTML back to the browser. MySQL is a RDBMS that can be run as a small, compact database server and is ideal for small to medium-sized applications. MySQL also supports standard ANSI SQL statements. The PHP / MySQL combination is cross-platform; this allowed the development of the system in Windows while the server runs on a stable BSD Unix platform. A typical PHP / MySQL interaction in the system is as follows. After initial calculations based on the design specifications to find the optimum core size (as mentioned in section 2.1 and detailed in [1]), a suitable core geometry is obtained from the cores database table using the statement: $suitable_core_array = mysql_query("SELECT * FROM cores WHERE (core_ap >= $optimum_ap AND corematerial_id = $chosen_corematerial_id AND coretype_id = $chosen_coretype_id) ORDER BY core_ap LIMIT 1"); where the names of calculated and user-entered values are prefixed by the dollar symbol ($), and fields in the database table have no prefix symbol. Fig. 2 shows the user interface, with each of the design steps clearly marked at the top and the current step highlighted (i.e. “Specifications”). An area at the bottom of the screen is available for designer comments. The underlying methodologies have previously been tested by both the authors [1] and external institutions [5]. Design examples carried out using the system produce identical results to those calculated manually. The system is being tested as a computer-aided instruction tool for an undergraduate engineering class.
J.G. Breslin and W.G. Hurley
Designs > Push-Pull Group > Dudley Device ES2d 1.1 Specifications
Comments 12
New Comment
Output Voltage (V): Output Current (A): Input Voltage Lower (V): Input Voltage Upper (V): Frequency (Hz): Temperature Rise (°C): Ambient Temp (°C): Efficiency (%): Calculated Area Product (cm^4): Calculated Flux Density (T):
Core Material: Winding Material:
36 72
John Smith
Circuit Diagram 11
Custom Addition 10
Leak Inductance 9
Welcome, John Smith! You have 2 PMs.
Modifications: 10 | Last: 22:01, 11-FEB-03
1.2 Core and Winding Info
24 12.5
Optim Thickness 8
Total Losses 7
Magnetic Component Designer 2.0.0b Core Losses 6
Wire Data 4
Turns Info 3
Core Data 2
Specifications 1
Login and Files 0
MaCoDe
Copper Losses 5
720
N27 CU
Transformer Type:
1.3 Expert Options Primary Power Factor: 2ndary Power Factor: Duty Cycle: Optimum Constant: Output Power (W): Volt Ampere Rating (VA): Waveform Factor: Current Density Factor: Turns Ratio: Window Utilisation Factor:
1:1 0.4
Perhaps it would be possible to integrate this particular design with the existing spec for the Allen application.
Emb PSU Team
If we limit the temperature rise to 25 degrees, we can meet that spec. E-Mail | Priv Msg
Matt Jones
13:23, 10-FEB-03
Edit | Quote
Yes.
Fig. 2. Screenshot of web-based system
5
Conclusions and Future Work
With the current trend towards miniaturisation in power converters, the magnetics designer should now expect accurate computer aided techniques that will allow the design of any magnetic component while incorporating existing techniques in the area of web-based collaboration. This paper presented a web-based transformer design system based on a previous methodology. This system is an improvement on previous automated systems because: previous designer expertise and optimisation routines are incorporated into the design method; database integration avoids the need for consultation of catalogues; and a user-friendly interface, with advanced input mechanisms, allows for collaboration among users where designs can be shared and analysed. This system can be further developed for more transformer applications, and with a revised methodology the system could also be updated to incorporate inductors and integrated planar magnetics.
A Web-Based System for Transformer Design
721
References [1] [2] [3] [4] [5] [6] [7] [8]
Hurley, W.G., Wlö fle, W., Breslin, J.G. : Optimized Transformer Design: Inclusive of High Frequency Effects. In: IEEE Trans. on Power Electronics, Vol. 13, No. 4 (1998) 651-659 Hurley, W.G., Gath, E., Breslin, J.G.: Optimizing the AC Resistance of Multilayer Transformer Windings with Arbitrary Current Waveforms. In: IEEE Trans. on Power Electronics, Vol. 15, No. 2 (2000) 369-376 Hurley, W.G., Breslin, J.G.: Computer Aided High Frequency Transformer Design Using an Optimized Methodology. In: IEEE Proc. of COMPEL 2000 (2000) 277-280 McLyman, W.T.: Transformer and Inductor Design Handbook. Marcel Dekker, New York (1978) Lavers, J.D., Bolborici, V.: Loss Comparison in the Design of High Frequency Inductors and Transformers. IEEE Transactions on Magnetics, Vol. 35, No. 5 (1999) 3541-3543 Intusoft Magnetics Designer: http://www.intusoft.com/mag.htm Ansoft PExprt: http://www.ansoft.com/products/em/pexprt/ Cooper Electronic Technologies Press Release for Coiltronics Versa Pac: WebBased Selection for Transformers and Inductors. http://www.electronicstalk.com/news/hnt/hnt105.html and http://www.cooperet.com/products_magnetics.asp
A Fuzzy Control System for a Small Gasoline Engine S.H. Lee, R.J. Howlett, and S.D. Walters Intelligent Systems & Signal Processing Laboratories Engineering Research Centre, University of Brighton Moulsecoomb, Brighton, BN2 4GJ, UK {S.H.Lee,R.J.Howlett,S.D.Walters}@Brighton.ac.uk
Abstract.Small spark-ignition gasoline-fuelled internal-combustion engines can be found all over the world performing in a variety of roles including power generation, agricultural applications and motive power for small boats. To attain low cost, these engines are typically aircooled, use simple carburettors to regulate the fuel supply and employ magneto ignition systems. Electronic control, of the sort found in automotive engines, has seldom proved cost-effective for use with small engines. The future trend towards engines that have low levels of polluting exhaust emissions will make electronic control necessary, even for small engines where previously this has not been economic. This paper describes a fuzzy control system applied to a small engine to achieve regulation of the fuel injection system. The system determines the amount of fuel required from a fuzzy algorithm that uses the engine speed and manifold air pressure as input values. A major advantage of the fuzzy control technique is that it is relatively simple to set up and optimise compared to the labour-intensive process necessary when the conventional "mapped" engine control method is used. Experimental results are presented to show that a considerable improvement in fuel regulation was achieved compared to the original carburettor-based engine configuration, with associated improvements in emissions. It is also demonstrated that the system produces improved output power and torque curves compared to those achieved when the original mechanical fuel regulation system was used. 1
Introduction
Electronic control of the air-fuel ratio (AFR) and ignition timing of a spark ignition (SI) engine is an effective way to achieve improved combustion efficiency and performance, as well as the reduction of exhaust emissions. The AFR essentially sets the operating point of the engine, and in conjunction with the ignition timing angle, determines the output power and the resulting levels of emissions. In an engine with electronic control, the amount of fuel that is supplied to the engine is controlled by an engine control unit (ECU). This is a micro-processor based V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 722-732, 2003. Springer-Verlag Berlin Heidelberg 2003
A Fuzzy Control System for a Small Gasoline Engine
723
system that controls the frequency and width of the control pulse supplied to the fuel injector. The AFR is important in the combustion and calibration processes. If there is too much fuel, not all of it is burnt, causing high fuel consumption and increased emissions of HC and CO. Too little fuel can result in overheating and engine damage such as burnt exhaust valves. Conventional ECUs use three-dimensional mappings (3-D maps), in the form of look-up tables, to represent the non-linear behaviour of the engine in real-time [1]. A modern vehicle ECU can contain up to 50 or more of these maps to realise complex functions. In addition the engine will be equipped with a wide range of sensors to gather input data for the control system. A major disadvantage of the look-up table representation is the time taken to determine the values it should contain for optimal engine operation; a process known as calibration of the ECU. These 3-D maps are typically manually calibrated, or tuned, by a skilled technician using an engine dynamometer to obtain desired levels of power, emissions and efficiency. The calibration process is an iterative one that requires many cycles of engine measurements and is very time consuming. Techniques that reduce the time and effort required for the calibration process are of considerable interest to engine manufacturers. This is especially the case where the engine is a small capacity non-automotive engine. These engines are particularly price sensitive and any additional cost, including the cost of extended calibration procedures, is likely to make the engine un-economic to manufacture. For similar economic reasons, any control strategy intended for application to a small engine has to be achievable using only a small number of low-cost sensors. This paper makes a useful contribution to research in this area by proposing a technique for the rapid calibration of a small-engine ECU, requiring only sensors for speed and manifold air pressure.
2
Fuzzy Control
Fuzzy logic is a ‘soft computing' technique, which mimics the ability of the human mind to learn and make rational decisions in an uncertain and imprecise environment [2]. Fuzzy control has the potential to decrease the time and effort required in the calibration of engine control systems by easily and conveniently replacing the 3-D maps used in conventional ECUs. Fuzzy logic provides a practicable way to understand and manually influence the mapping behaviour. In general, a fuzzy system contains three main components, the fuzzification, the rule base and the defuzzification. The fuzzification is used to transform the so-called crisp values of the input variables into fuzzy membership values. Afterwards, these membership values are processed within the rule base using conditional ‘if-then' statements. The outputs of the rules are summed and defuzzified into a crisp analogue output value. The effects of variations in the parameters of a Fuzzy Control System (FCS) can be readily understood and this facilitates optimisation of the system. The system inputs, which in this case are the engine speed and the throttle angle, are called linguistic variables, whereas ‘large' and ‘very large' are linguistic values which are characterised by the membership function. Following the evaluation of the rules, the defuzzification transforms the fuzzy membership values into a crisp output
724
S.H. Lee et al.
value, for example, the fuel pulse width. The complexity of a fuzzy logic system with a fixed input-output structure is determined by the number of membership functions used for the fuzzification and defuzzification and by the number of inference levels. The advantage of fuzzy methods in the application of engine control over conventional 3-D mappings is the relatively small number of parameters needed to describe the equivalent 3-D map using a fuzzy logic representation. The time needed in tuning a FCS compared to the same equivalent level of 3-D map look-up control can be significantly reduced.
3
The Fuzzy Control System
3.1
Feedforward Control
This aim of the control strategy was to govern the value of AFR in the engine, keeping it at a desired optimal value, and minimising the influence of changes in speed and load. Figure 1 shows the block diagram of the test system. Engine load was estimated indirectly by measurement of the inlet manifold air pressure (MAP). The parameters of the fuzzy control system and rule-base contents in the fuzzy control system were determined during test-rig trials and implanted as a system reference into the control unit. The details of the creation of such a system for this experiment are explained in the next section of the paper. The minor drawback of this feedforward control is lack of feedback information; factors such as wear and spark plug deterioration will detract from optimum fuel injection quantity in what is still effectively an open-loop system. Feedback control of AFR is often provided in automotive engines, but this is seldom economic on small engines. A suitable model was created to predict throttle position by using the MAP and the engine rotating speed. The feedforward fuzzy control scheme was used in order to reduce deviations in lambda-value or λ, where λ is an alternative method of expressing AFR (λ = 1.0 for an AFR of approximately 14.7:1, the value for complete combustion of gasoline). The scheme also has the benefit of reducing the sensitivity of the system to disturbances which enter the system outside the control loop. This fuzzy model offers the possibility of identifying a single multi-input single-output non-linear model covering a range of operating points [3].
Fig. 1. Block diagram for feedforward and fuzzy logic control scheme
A Fuzzy Control System for a Small Gasoline Engine
3.2
725
Experimental Arrangement
The experimental fuzzy control algorithm was implemented using a test facility based on a Bosch Suffolk single-cylinder engine having a capacity of 98 cc. The engine parameters are summarised in Table 1. The engine had a single camshaft and sidevalve arrangement, and was capable of generating manufacturer- listed peak power and torque outputs of 1.11kW at 3000 revolutions per minute (RPM) and 3.74Nm at 2100 RPM respectively. Load was applied to the engine via a DC dynamometer with a four-quadrant speed controller. A strain gauge load cell system was incorporated and a frequency-to-voltage converter was used to provide speed information. The dynamometer was capable of either motoring or absorbing power from the engine. A PC-based data acquisition system utilising an Advantech PCL818HD analogue-to-digital converter (ADC) card was used. Various sensors were provided to measure the engine operating parameters: speed, torque, manifold vacuum pressure, temperatures, AFR, etc. The ignition system used was the standard fitment magneto. A modification was made to the air-induction system in order to accommodate a fuel injector as well as the original carburettor. Thus, the engine could be conveniently switched so as to use the carburettor or the fuel injection system. The fuel injector electronic system consisted of a programmable counter/interval timer (Intel 82C54) which generates a pulse of the required length, feeding an automotive specification Darlington-configuration power transistor, driving the fuel injector solenoid. The fuel pulse width (FPW) governed the quantity of fuel injected into the engine. Table 1. Basic Engine specification
Bore (mm) Stroke (mm) Compression ratio Capacity Valve arrangement Carburettor Ignition system 3.3
57.15 38.1 5.2 : 1 98cc Sidevalve Fixed jet Flywheel magneto
Engine Load Estimation
In a spark-ignition engine the induction manifold pressure varies with engine speed and throttle opening according to a non-linear mapping. Figure 2 shows the 3-D relationship between these operating parameters for the Bosch Suffolk engine. By measuring these two variables, the engine load/throttle position can be determined. A conventional look-up table can be used, although in the case of this work fuzzy logic was used to represent the non-linear relationship between functions. An optical sensor was used for speed measurement, and a low-cost pressure sensor was applied to measure the MAP. These formed the major control inputs to the fuzzy control loop.
726
S.H. Lee et al.
Fig. 2. Variation of MAP with speed and throttle opening
3.4
Fuzzy Control Algorithm
The fuzzy control system was devised using a Fuzzy Development Environment (FDE) which was the outcome of a linked piece of work. The FDE is an MS Windows-based application that consists of a Fuzzy Set Editor and Fuzzy Rule Editor. Fuzzy sets, membership functions and rule sets for this project were all created, and modified where required, using the FDE. Parameters derived from the FDE, specific to the particular set-up devised, were transferred to an execution module, known as the Fuzzy Inference Kernel (FIK). The FIK was a module programmed in C++ code. To make it possible to embed the FIK directly into an ECU, the code was compiled to .obj format, and incorporated into the rest of the control code by the linker.
Fig. 3. Air-fuel ratio fuzzy control loop
The fuzzy control loop illustrated in Figure 3 was implemented in order to optimise the AFR. To determine the effectiveness of the control loop, the AFR was monitored using a commercial instrument, known as an Horiba Lambda Checker. The engine speed was determined by an optical sensor while the MAP was measured by a pressure sensor located in the intake manifold. These instruments sampled individual parameters and through the medium of signal conditioning circuitry provided analogue output voltage levels proportional to their magnitude. These were converted to digital form and the crisp digital signals were then applied to a fuzzy algorithm implemented in the C programming language on a PC. The crisp output from the algorithm was the width of the pulse applied to the fuel injector (the FPW).
A Fuzzy Control System for a Small Gasoline Engine
727
Fig. 4. Fuzzy input set – engine speed
The fuzzy sets showed in Figures 4 and 5 were used in the fuzzy controller. The engine speed fuzzy set used three trapezoidal membership functions for classes low, medium and high. The MAP fuzzy set consisted of four trapezoidal membership functions for classes Very Low, Low, High, Very High. Experimental adjustment of the limits of the membership classes enabled the response of the control kernel to be tailored to the physical characteristic of the engine.
Fig. 5. Fuzzy input set – vacuum pressure
The contents of the rule-base underwent experimental refinement as part of the calibration process. The final set of rules contained in the rule-base is shown in Figure 6.
Fig. 6. The fuzzy rule base
728
S.H. Lee et al.
The fuzzified values for the outputs of the rules were classified into membership sets similar to the input values. An output membership function of output singletons, illustrated in Figure 7, was used. This was defuzzified to a crisp value of FPW.
Fig. 7. Fuzzy output set – FPW (ms)
3.5
The Mapping
Engine control typically requires a two-dimensional plane of steady state operating points with engine speed along the horizontal axis and throttle position along the vertical axis. The control surface in Figure 8 shows the crisp value of FPW at different combinations of speed and vacuum pressure using FCS. Each of these intersection points indicates the differing requirement for fuel, which is determined by the design of fuzzy sets and membership functions. The control surface acts as a means of determining the FPW needed for each combination of speed and MAP value.
Fig. 8. Three-dimensional FCS map
4
Experimental Results
The performance of the engine running with the FCS was experimentally compared with that of the engine running using the conventional mechanical fuel regulation and delivery system. A monitoring sub-routine was created to capture performance data, under conventional operation and using the FCS, under the experimental conditions described in Table 2. The experimental evaluation was carried out using a combination of six speed settings and five values of Throttle Position Setting (TPS) as illustrated in Table 2. Values of engine torque and power were recorded for each combination of speed and TPS.
A Fuzzy Control System for a Small Gasoline Engine
729
The results are presented graphically in Figures 9 to 16 and discussed in Sections 4.1 and 4.2. Table 2. Experimental conditions
Engine speed (RPM) Throttle Position (%) 4.1
1800, 2000, 2200, 2400, 2600, 2700 0, 25, 50, 75, 100
Power and Torque
1
1
0.9
0.9
0.8
0.8
Power(kW)
Power(kW)
Figures 9 to 12 illustrate that the power produced by the engine with the FCS exhibited an increase of between 2% and 21% with an average of approximately 12% compared with the original mechanical fuel delivery system. A corresponding improvement in output torque also resulted from the use of the fuel injection system with the FCS compared to when the original fuel delivery system was used. Figures 13 to 16 show that the mean torque exhibited an increase of between 2% and 20% with an overall average of 12%. These increases in engine performance are partly due to the improvement in charge preparation achieved by the fuel injection process; the improvement in fuel metering also results in improved combustion efficiency hence increased engine power.
0.7 0.6
FCS Basic engine
0.5
0.4 1700 1900 2100 2300 2500 2700 2900
0.7 0.6
0.4 1700 1900 2100 2300 2500 2700 2900
Speed (RPM)
Speed (RPM)
Fig. 10. Engine power when TPS=50%
1
1
0.9
0.9
0.8
0.8
Power(kW)
Power(kW)
Fig. 9. Engine power when TPS=25%
0.7 0.6 FCS Basic engine
0.5
FCS Basic engine
0.5
0.4 1700 1900 2100 2300 2500 2700 2900 Speed (RPM)
Fig. 11. Engine power when TPS=75%
0.7 0.6
FCS Basic engine
0.5
0.4 1700 1900 2100 2300 2500 2700 2900 Speed (RPM)
Fig. 12. Engine power when TPS=100%
730
S.H. Lee et al.
4
2
FCS Basic engine
Torque (Nm)
3
2 FCS Basic engine
1 1700 1900 2100 2300 2500 2700 2900
1 1700 1900 2100 2300 2500 2700 2900
Speed(RPM)
Speed (RPM)
Fig. 13. Engine torque when TPS=25%
Fig. 14. Engine torque when TPS=50%
4
4
3
3
2
FCS Basic engine
2
FCS Basic engine
1 1700 1900 2100 2300 2500 2700 2900
1 1700 1900 2100 2300 2500 2700 2900
Speed (RPM)
Speed (RPM)
Fig. 15. Engine torque when TPS=75%
4.2
Torque (Nm)
3
Torque (Nm)
Torque (Nm)
4
Fig. 16. Engine torque when TPS=100%
Air-Fuel Ratio
The AFR was monitored, over a range of speeds and load conditions, using both the original fuel delivery system and the fuzzy-controlled fuel-injection system to comparatively evaluate the variation in AFR that occurred. The control objective was to stabilise the AFR such that λ=1.0 was achieved under all engine operating conditions. Figures 17 and 18 illustrate how the value of λ varied with different combinations of speed and throttle position using the original fuel regulation system and the fuzzy-controlled fuel-injection system, respectively. Figure 17 shows that wide variations in λ occurred when the original fuel regulation system was used, this being due non-linearities in the characteristic of the carburettor. This resulted in an excessively rich mixture at small throttle openings and an excessively weak mixture when the throttle opening was large. The large
A Fuzzy Control System for a Small Gasoline Engine
731
variations in λ suggested poor combustion efficiency and higher, harmful, exhaust emissions. An improved and refined contour was found to occur when the FCS was employed. Reasonable regulation of λ was achieved, the value being maintained between 0.8 and 1.0 in approximately 90% of the experimental operating region. Exceptions occurred in two extreme conditions, which were • •
high engine speed with very small throttle opening; and low engine speed with throttle wide open.
Neither of these conditions are likely to occur frequently in normal engine operation. There were a number of limitations in the mechanical and electronic components of the fuel injection system which adversely affected the stabilisation of the AFR. Firstly, the fuel injector was one that was conveniently available for the experiment, but it was too big for the size of the engine, making it difficult to make small changes in the amount of fuel delivered. Secondly, the resolution of the counter that determined the fuel pulse width was too coarse, again causing difficulty in making fine adjustments to the quantity of fuel delivered. Finally, the chamber where the fuel injector was installed and the inlet manifold were not optimised for fuel injection. Even with such a non-optimal system, it was possible to conveniently and quickly adjust the parameters of the fuzzy control system to produce a close to optimal solution.
Fig. 17. Variation in lambda with original fuel regulation system
5
Fig. 18. Variation in lambda with fuzzy-controlled fuel-injection system
Conclusion
This paper has introduced an improved technique for the computer control of the fuel supply of a small internal combustion engine. The technique provides significant time savings in ECU calibration, and improved performance. The fuzzy logic control scheme eliminates the requirement for skilful and timeconsuming calibration of the conventional three-dimensional map while at the same time achieving good fuel regulation, leading to improved control of polluting exhaust emissions.
732
S.H. Lee et al.
It was demonstrated that the entire tuning process, including the set-up of membership function and derivation of the rule-base, took as little as one hour to deliver results. This was significantly faster than comparable manual-controlled calibration for the equivalent mapped control. Faster times could be achieved with experience and practise. Laboratory tests showed the fuzzy-controlled fuel-injection system achieved increased engine power and torque over that obtained with mechanical fuel delivery. In addition, it was shown that the system was capable of maintaining the variation of λ within a narrow range, leading to reduced emissions. Parameters where further work will have significant impact include the development of further fuzzy set(s) that refine the control strategy by including ignition timing, cold start enrichment, etc.
References [1]
“Representation of 3-D Mappings for Automotive Control Applications using Neural Networks and Fuzzy Logic”, H. Holzmann, Ch. Halfmann, R Isermann, IEEE Conference on Control Applications – Proceedings, pp.229-234, 1997. [2] “Fuzzy Logic And Neural Networks”, L. Chin, D. P. Mital, IEEE Region 10th Annual International Conference, Proceedings/TENCON, Vol.1, pp.195-199, 1999. [3] “Model Comparison for Feedforward Air/fuel Ratio Control”, D G Copp, K J Burnham , FP Locket, UKACC International Conference on Control 98, pp.670-675, 1998. [4] “Air-fuel Ratio Control System using Pulse Width and Amplitude Modulation at Transient State”, Takeshi Takiyama, Eisuke Shiomi, Shigeyuki Morita, Society of Automotive Engineers of Japan, pp.537-544, 2001. [5] “Neural Network Techniques for Monitoring and Control of Internal Combustion Engines”, R.J.Howlett, M.M.de Zoysa, S.D.Walters & P.A.Howson, Int. Symposium on Intelligent Industrial Automation 1999. [6] “Fuelling Control of Spark Ignition Engines”, David J. Stroh, Mathew A Franchek and James M. Kerns, Vehicle System Dynamics 2001, Vol. 36, No. 45, pp.329-358. [7] “Exhaust Gas Sensors for Automotive Emission Control”, J. Riegel, H. Neumann, H. M. Wiedenmann, Solid State Ionics, 8552, 2002. [8] “Emission from In-use Lawn-mowers in Australia”, M.W. Priest, D. J. Williams, H.A. Bridgman, Atmospheric Environment, 34, 2000. [9] "Effects of Varying Engine Block Temperature on Spark Voltage Characterisation for the Measurement of AFR in IC Engines", M. M. de Zoysa, R. J. Howlett, S. D. Walters, Paper No. K312, International Conference on Knowledge-based Information Engineering Systems and Allied Technologies, University of Brighton, Sussex, 30th August – 1st September 2000. [10] “Engine Control using Neural Networks: A New Method in Engine Management Systems”, R. Muller, H. H. Hemberger, and K. Baier, Meccanica 32 pp.423-430, 1997.
Faults Diagnosis through Genetic Matching Pursuit* Dan Stefanoiu and Florin Ionescu FH-University of Applied Sciences in Konstanz, Dept. of Mechatronics Brauneggerstraße 55, 78462 Konstanz, Germany Tel.: + 49 07 531 206-415,-289; Fax: + 49 07 531 206-294 {Danstef,Ionescu}@fh-konstanz.de
Abstract. Signals that carry information regarding the existing defects or possible failures of a system are sometimes difficult to analyze because of various corrupting noises. Such signals are usually acquired in difficult conditions, far from the place where defects are located and/or within a noisy environment. Detecting and diagnosing the defects require then quite sophisticated methods that are able to make the distinction between noises encoding the defects and another parasite signals, all mixed together in an unknown way. Such a method is introduced in this paper. The method combines time-frequency-scale analysis of signals with a genetic algorithm.
1
Introduction
The problem of faults detection and diagnosis (fdd) using signals provided by a monitored system is approached in this paper through the hybrid combination between a modern Signal Processing (SP) method and an optimization strategy issued from the field of Evolutionary Computing (EC). The signals to analyze are mechanical vibrations provided by bearings in service and have been acquired in difficult conditions: far from the location of tested bearing and without synchronization signal. The resulted vibrations are therefore corrupted by interference and environmental noises (the SNR is quite small) and the main rotation frequency can only be estimated. Moreover, the rotation speed varies during the measurements, because of load and power supply fluctuations. In general, from such signals is difficult to extract the defects information by simple methods. The method that will be succinctly described in this paper (a presentation in deep is made in [4]) relies on Matching Pursuit Algorithm (MPA) introduced in [2]. MPA proves a number of properties that can be exploited in order to perform fdd of noisy vibrations. The most important property is the capacity of denoising. By denoising, the util component is separated from the noisy component of analyzed signal with a controlled accuracy. Another useful characteristic is the distribution of denoised signal energy over the time-frequency (tf) plane, which in general reveals the signal *
Research developed with the support of Alexander von Humboldt Foundation (www.avh.de).
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 733-740, 2003. Springer-Verlag Berlin Heidelberg 2003
734
Dan Stefanoiu and Florin Ionescu
features better than the classical spectrum. Vibrations affected by defects and noises are non stationary signals that require tf analysis rather than classical spectral methods.
2
Matching Pursuit in a tfs Dictionary
The approach presented in [2] has been generalized in the sense that the tf dictionary of waveforms is replaced here by a time-frequency-scale (tfs) dictionary. SP applications proved that the use of time, frequency and scale together can lead to better results than when only time and frequency or only time and scale are employed. Hence, the dictionary is generated by applying scaling with scale factor s0 , time shifting with step u 0 and harmonic modulation with pulsation ω 0 on a basic signal g referred to as mother waveform (mw). The mw can be selected from a large class of known signals, but its nature should be related to the analyzed signals. In case of vibrations, the unit energy Gauss function has been preferred: def
g (t ) = Ae
−
( t − t0 ) 2 2σ 2
, ∀t ∈ 5 ,
(1)
where: A = 1 / 4 πσ 2 is the magnitude, σ > 0 is the sharpness and t 0 ∈ 5 is the central instant. It is well known that Gauss mw (1) has a time support length of 6σ and a frequency support length of 6 / σ , because of Uncertainty Principle (UP). The supports are measuring the tf localization. Starting from the selected mw, the dictionary consists of the following discrete atoms: def
g [m,n,k ][l ] = s0−m / 2 e
j
− ω 0 s0− m klTs
(
)
g s0−m (lTs − nu0 ) ,
(2)
where: ∀l ∈ A is the normalized instant, Ts is the sampling period of vibration,
m ∈ 0, M s − 1 is the scale index, n ∈ 1 − N , N − 1 is the time shifting index, and k ∈ 0, K m − 1 is the harmonic modulation index. The ranges of variation for indexes
can be derived by tuning the dictionary on the vibration data v . Thus, the number of scales M s , as well as the number of frequency sub-bands per scale K m (variable
because of UP), are determined by the vibration bandwidth set by pre-filtering. The number of time shifting steps depends on vibration data length N . Also, the central instant t 0 is naturally set to ( N − 1)Ts / 2 , which involves σ = t 0 / 3 (the support of mw extends over the data support). The operators applied on mw are defined by: s 0 = 1 / 2 , u 0 = Ts and ω 0 = 2 ln 2 / σ (see [4]). Note that atoms (2) are not necessarily orthogonal each other. The tfs dictionary, denoted by D[g ] , is redundant and the spectra of every 2 adjacent atoms overlap. It generates a subspace D[g ] of finite energy signals, as shown in Fig. 1(a). Vibration
Faults Diagnosis through Genetic Matching Pursuit
735
v may not belong to D[g ] , but, if projected on D[g ] , the util signal vD is obtained. The residual signal ∆v is orthogonal on D[g ] and corresponds to unwanted noises. Fig. 1(a) illustrates in fact the signal denoising principle. Since the atoms (2) are redundant, v D cannot easily be expressed. Moreover, v D is an infinite linear combination of atoms from D[g ] . Thus, it can only be estimated with a controlled accuracy.
Fig. 1. (a) Principle of denoising within a tfs dictionary. (b) Principle of GA elitist strategy
The MPA is concerned with estimation of util signal v D , by using the concept of best matching atom (bma). When projecting the signal on dictionary, the bma is the atom with maximum magnitude of resulted projection coefficient. Within MPA, the util signal is estimated by approximating the residual through the recursive process:
∆q +1 x ≡ ∆q x − ∆q x, g [mq ,nq ,kq ] g [mq ,nq ,kq ] , ∀q ≥ 0 .
(3)
The approximation process (3) starts with vibration signal ∆0 x ≡ v as first estimation (the coarser one) of residual. The corresponding bma g [m0 ,n0 ,k0 ] is found and the projected signal is subtracted from the current residual. The new resulted residual ∆1 x (that refines the estimation of noisy part) is looking now for its bma in D[g ] and, after finding it, a finer residual estimation ∆2 x is produced, etc. The iterations stop when the residual energy falls below a threshold a priori set, i.e. after Q bmas have been found. The util signal is then estimated by: Q −1
x D ≅ ∑ ∆q x, g [mq ,nq ,kq ] g [mq ,nq ,kq ] .
(4)
q =0
Note that, in (4), the projections coefficients become from successive residuals and not only from the initial vibration data. Moreover, a remarkable energy conservation property has been proved in [2]:
736
Dan Stefanoiu and Florin Ionescu 2
∆q x ≡ ∆q x, g [mq ,nq ,kq ]
2
2
+ ∆q +1 x , ∀q ≥ 0 ,
(5)
despite the atoms are not necessarily orthogonal. Actually, the iterations in (3) can be stopped thanks to (5), which shows that the energy of util signal increases, while the residual energy decreases with every new extracted bma.
3
A Genetic Matching Pursuit Algorithm
Finding the bma corresponding to current residual ∆q x means solving the following non linear maximization problem:
max m ,n ,k
∑ ∆ x[l ]g q
∗ [ m ,n ,k ]
[l ] ,
(6)
l∈A
where a ∗ is the complex conjugate of a . The sum in (6) is actually finite, because the supports of residual and atoms are finite. Although the searching space D[g ] is finite, it usually includes a huge number of atoms to test. The exhaustive search is inefficient. The gradient based optimization methods are also impractical, because the cost function is extremely irregular and changes in every step of iteration. A promising approach is to use an optimization technique coming from EC. The combination between MPA and evolutionary techniques is a very recent idea that has been used in analysis of satellite images [1]. In this framework MPA is joined to a Genetic Algorithm (GA) [3]. The symbiosis between the 2 algorithms resulted in a Genetic Matching Pursuit Algorithm (GMPA). The GA was designed according to MPA characteristics. Dictionary atoms are uniquely located into the tfs plane by 3 indexes: scaling, time shifting and harmonic modulation. In general M s << 2 N + K 0 − 1 , i.e. the number of scales is small (up to 10, for a bandwidth of about 10 kHz, which is enough for vibrations analysis). Therefore, the GA will be in charge to find the bmas on every scale. The fittest of them will be selected subsequently by a simple exhaustive search. The generic chromosome is represented by 2 successive binary genes γ n,k = [γ n | γ k ] , one ( γ n ) for time shifting indexes and another one ( γ k ) for harmonic modulation indexes. The length of γ n is practically constant among scales, but the length of γ k decreases as the scale index increases, because the number of frequency sub-bands K m decreases (due to UP). Varying the chromosome length minimizes the probability to obtain invalid values of genes after applying the genetic operators. At every scale m ∈ 0, M s − 1 , a population of chromosomes have to be used in optimization. Thus, the GA will operate with a number of M s populations in parallel. It is well known that parallelism can significantly increase the convergence speed. By working with several populations in parallel, the intrinsic degree of parallelism (specific to any GA) is even more increased and thus a higher convergence speed is expected. The fitness function is defined according to maximization problem (6):
Faults Diagnosis through Genetic Matching Pursuit def
f q [m, γ n ,k ] =
∑ ∆ x[l ]g q
* [ m ,n ,k ]
[l ] ,
737
(7)
l∈A
where m ∈ 0, M s − 1 , n ∈ 1 − N , N − 1 and k ∈ 0, K m − 1 . Definition (7) shows that the fitness function changes in every step of recursive process (3). The next setting of GA regards the genetic operations to be applied: first crossover, then mutation and finally inversion with corresponding probabilities ( Pc , Pm and Pi , respectively). Note that crossover is applied only on genes of same length. The use of probabilities in genetic operators relies on Baker's simple procedure that simulates the roulette game ([3], [4]). In order to construct a new population from an existing one, an elitist strategy was adopted. The strategy is elitist in a controlled proportion denoted by Pe ∈ [0,1) . For example, 10% of the current fittest chromosomes are inherited by the next population, whereas the remaining 90% of vacant places have to be taken by genetically modified chromosomes. Subsequent populations have to have the same number of chromosomes, denoted by P . The elitist strategy of next population construction, starting from the current one, is illustrated in Fig. 1(b) above. Thus, the fittest Pe P chromosomes of current population Π t are directly inherited by the next population
Π t +1 . They are the elite. (Obviously, after filling out all places of Π t +1 , the elite can change.) The remaining (P − Pe P ) positions have to be filled by genetically modified chromosomes. A temporary population ∆ t , referred to as ∆-population, is constructed in this aim and includes all chromosomes aptly to reproduce. Any fitted chromosome that can have offspring must transition through this mating pool, find a mate and produce offspring. The number of chromosomes in ∆ t equals the vacant positions in Π t +1 . Finally, the next population (as well as the current one) must include only different chromosomes. Offspring are also selected by an elitist approach, as follows: for crossover, the fittest 2 chromosomes among the 2 parents and their 4 offspring are selected (recall there are 2 genes); for mutation and inversion, the fittest chromosome between each parent and its offspring is selected. This approach is slightly different from the usual one in a GA, where the offspring are replacing their parents independently on their fitness. The elitist approach has the advantage of keeping the most aptly to reproduce chromosomes into populations, so that the chance to skip the optimum is reduced. The drawback is that the elite may stop the evolution, by dominating the population across many generations. In order to attenuate this drawback, a number of places in Π t +1 are left free, to be filled by randomly chosen chromosomes (that must be all different). This “refreshing” technique allows the outstanding chromosomes to be combined with less fitted ones and to avoid the evolution stagnation. The free places are simply reserved by imposing the following natural restriction: only a maximum number of genetic operations M go is allowed in order to complete the population Π t +1 . If after
738
Dan Stefanoiu and Florin Ionescu
M go operations the next population is incomplete, the remaining places are free to be taken by randomly chosen chromosomes. How the ∆-population can be generated? Unlike Π t and Π t +1 (that must include
only different chromosomes), the (P − Pe P ) places of ∆ t are filled according to the
representativity of chromosomes in Π t (i.e. to its reproduction capacity/chance). Thus, the most representative chromosome will take (together with its clones) the maximum number of places. The representativity can numerically be estimated as a function R of chromosomes γ n,k that belong to the current population Π t . A generalized Baker's procedure [4] can be used to construct the ∆-population. Representativity can be quantified by using the fitness. Several techniques referred to as selection methods have been introduced in literature [3]. In this context, a method that performs an adaptation of selection process to the current population has been preferred: Boltzmann annealing. [4]. With the settings above, the GMPA is completely designed and can be implemented.
4
Simulation Results
Data were provided by bearings like the one illustrated in Fig. 2(a).
Fig. 2. (a) Bearing . (b), (c) Vibration data and spectra
Standard bearing (defects free) has been labeled by . Another 3 bearings with the same geometry have been used to provide vibration data that encode defects, as follows: (with a crack located on the inner race), (with wear of outer race) and <M3850609> (with cavities on inner and outer races). Fig. 2(a) also shows the value of sampling rate (ν s = 20 kHz ), the nominal estimated rotation speed of shaft (ν s = 44.3 Hz ) and the natural frequencies of bearing in decreasing order. The resulted vibration data have small SNR, especially in case of defect encoding vibrations (under 6 dB, i.e. more than 30% of energy
Faults Diagnosis through Genetic Matching Pursuit
739
consist of undesirable noise). A band pass filter 0.5-10 kHz was applied in order to remove some noises and the main rotation harmonics up to order 10. Fig. 2 displays data and spectra for bearings (b) and (c). Spectra of the other 2 bearings are quasi identical to (c) and thus the defects cannot be discriminated. After projecting the vibrations on tfs dictionary, the distributions of bmas over tfs plane have been drawn, as depicted in Fig. 3, for bearings and . The tf planes have been horizontally stacked, in order of scales, with time put on vertical axis. On the horizontal axis, the frequency band is split into a variable number of sub-bands, depending on scale index (from 0 to 8). Atoms are located into rectangles with variable size, depending on scale, because the frequency resolution varies along scales (larger scale index involves bigger rectangles). Also, the rectangles overlap in frequency, because atoms spectra overlap. The fitness value (in dB) of every bma is represented with a color (or gray level) according to the scaling on the right side of images. It starts with light colors (blue or gray) for small values and ends with dark colors (red or black) for large values. Thanks to absence of defect encoding noise, the bmas on left window are almost uniformly distributed over the tfs plane, without groups of high fitness atoms concentrated in a specific zone. One can remark that the noise (scales #7 & #8) is located at low frequency, albeit some more high frequency atoms appeared in the end. The distribution corresponding to bearing with inner race defects (right side image), reveals a higher energy concentration on scale #0 (with more than 12% of initial energy) within a small number of bmas. The 2 abnormal groups of bmas are located around sub-band #571, which gives the frequency of about 10.569 kHz, i.e. approximately 47 × BPFI (“BPFI” stands for Ball Pass Frequency on Inner race). Thus, the defect is quite clearly decoded. Moreover, the noise is now located at high frequencies (scale #8). For the other 2 bearings, defects are correctly diagnosed as well [4], albeit all 3 “defective” spectra are practically indistinguishable each other. The defects severity has also been estimated, even in case of multiple defects. The fdd results have been confirmed by bearings after dismounting.
Fig. 3. Distributions of atoms over tfs plane: (left) and (right)
740
Dan Stefanoiu and Florin Ionescu
5
Conclusion
Processing of noisy signals usually requires methods with high complexity degree. Some of such methods lead to greedy procedures, similar to Matching Pursuit Algorithm presented in this paper. The complexity of method described above is involved by a non linear optimization problem that cannot efficiently be solved by means of classical techniques, gradient based. Therefore, an evolutionary approach using Genetic Algorithms has been proposed. The resulted Genetic Matching Pursuit Algorithm proved interesting features, such as denoising and extraction of util signal with a desired SNR or decoding of some information related to fdd. Moreover, its performance can be improved by operating with different mother waveforms, adapted to the nature of analyzed signal. The algorithm can be used for a larger class of one dimension signals than vibrations, in a framework where a method issued from Signal Processing and a strategy relying on Evolutionary Computing proved that these two fields can work together in a promising alliance.
References [1]
[2] [3] [4]
Figueras i Ventura R.M., Vandergheynst P., Frossard P. – Evolutionary Multiresolution Matching Pursuit and its relations with the Human Visual System, Proceedings of EUSIPCO 2002, Toulouse, France, Vol. II, 395-398 (September 2002). Mallat S., Zhang S. – Matching Pursuit with Time-Frequency Dictionaries, IEEE Transactions on Signal Processing, Vol. 41, No. 12, 3397-3415 (December 1993). Mitchell M. – An Introduction to Genetic Algorithms, The MIT Press, Cambridge, Massachusetts, USA (1995). Stefanoiu D., Ionescu F. – Vibration Faults Diagnosis by Using TimeFrequency Dictionaries, Research Report AvH-FHKN-0301, Konstanz, Germany (January 2003).
A Fuzzy Classification Solution for Fault Diagnosis of Valve Actuators C. D. Bocănială1, J. Sa da Costa2, and R. Louro2 1
University “Dunărea de Jos” of Galati, Computer Science and Engineering Dept. Domneasca 47, Galati 6200, Romania [email protected] 2 Technical University of Lisbon, Instituto Superior Tecnico, Dept. of Mechanical Engineering, GCAR/IDMEC Avenida Rovisco Pais, Lisboa 1096, Portugal [email protected] [email protected]
Abstract. This paper proposes a fuzzy classification solution for fault diagnosis of valve actuators. The belongingness of the current state of the system to the normal and/or a faulty state is described with the help of fuzzy sets. The theoretical aspects of the classifier are presented. Then, the case study – the DAMADICS benchmark flow control valve, is shortly introduced and also the method used to generate the data for designing and testing the classifier. Finally, the simulation results are discussed.
1
Introduction
Fault diagnosis is a suitable application field for classification methods as its main purpose is to achieve an optimal mapping of the current state of the monitored system into a prespecified set of behaviors, normal and faulty [6]. Fault diagnosis is performed in two main stages: fault detection, which indicates if a fault occurred in the system, and fault isolation, which identifies the specific fault that affected the system [8]. Classification methods are usually used to perform fault isolation. The recent research in the field largely propose classification methods based on soft computing methodologies: neural networks [8], fuzzy reasoning [4],[5] and combinations of them, neuro-fuzzy systems [2][[7]. This paper proposes a fuzzy classification solution for fault diagnosis of valve actuators. Diagnosis is performed in one single stage. Based on the raw sensor measurements, the classifier identifies the behavior of the system. The classifier uses fuzzy sets to describe the belongingness of the current state of the system to a certain class of system behavior (normal or faulty). The class of behavior with the largest degree of belongingness represents the current behavior of the system. The structure of the paper is following. Section 2 presents the theoretical aspects of the classifier. Section 3 introduces the case study valve and details the method used to V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 741-747, 2003. Springer-Verlag Berlin Heidelberg 2003
742
C. D. Bocănială et al.
generate the data for designing and testing the classifier. The last section, Section 4, describes the simulation results obtained using MATLAB. The paper ends with some conclusions and further work considerations.
2
The Fuzzy Classifier
Cluster analysis studies methods for splitting a group of objects into a number of subgroups on the basis of a chosen measure of similarity. The similarity of objects within a subgroup is larger than the similarity of objects belonging to different subgroups. Usually, a subgroup is considered a classical set, which means that an object either belongs to the set or not. The previous description lacks nuances. In contrast, an object can be assigned to a fuzzy set with a varying degree of membership from 0 to 1. Baker [1] proposes a cluster model that builds an optimal decomposition of the given group of objects into a collection of fuzzy sets. If u and v belong to the analyzed group of objects, the similarity between them, s(u,v), is measured via a dissimilarity measure, d(u,v)=1-s(u,v). The dissimilarity measure is expressed using a distance function hβ that maps the distance between u and v, δ(u,v), into [0,…,1] interval (Eq. 1). If δ(u,v) decreases towards 0, hβ(δ(u,v)) also decreases towards 0 and [1 - hβ(δ(u,v))] increases towards 1. If δ(u,v) increases towards β or is larger than β, hβ(δ(u,v)) increases towards 1 or, respectively, is 1. In the later case, [1- hβ(δ(u,v))] increases towards 0 or, respectively, is 0. These facts prove the correctness of choosing hβ as the dissimilarity measure.
δ (u , v ) / β , for δ (u , v ) ≤ β h β (δ (u , v )) = 1, otherwise
(1)
The fuzzy classifier proposed in this paper uses ideas from the previous mentioned cluster model and it is described in the following. In order to design and test the classifier, the set of all available data for the problem to be solved is split in three distinct subsets: the reference patterns set, the parameters tuning set, and the test set. The elements in the reference patterns set are grouped according to their membership to the prespecified classes. That is, all elements in the reference patterns set belonging to a specific class are in the same subgroup. Let the obtained partition be C={Ci}i=1,…,n, where n is the number of prespecified classes. If u is the input of the classifier, the subset affinity measure defined in [1] can be used to express its average similarity to a specific subset Ci (Eq. 2). The notation ni stands for the cardinal of Ci.
r (u ,Ci ) = 1 −
1 ni
∑ h (δ (u ,v )) . β
(2)
v∈Ci
Given the subset affinity measure, a fuzzy membership function can describe the degree of belongingness of input u to subset Ci (Eq. 3) [1]. The notations n and ni stand for the cardinal of Ci, and, respectively, the cardinal of C. The notations n and ni stand for the cardinal of Ci, and, respectively, the cardinal of C.
A Fuzzy Classification Solution for Fault Diagnosis of Valve Actuators
f i (u ) =
ni − ∑ h β (δ (u , v )) v∈Ei
n − ∑ h β (δ (u , v ))
.
743
(3)
v∈E
The classification task is performed taking into account the fuzzy membership functions values of the input u to all subsets Ci. Remembering that Ci subset represents all elements in the reference pattern set belonging to the i-th prespecified class, the input u is classified as a member of the class whose corresponding fuzzy membership value is the largest (Eq. 4). In case of ties, the vector to be classified is rejected.
u ∈ C max ⇔ f max (u ) = max f i (u ) . i =1 ,...,m
(4)
The classification task is performed taking into account the fuzzy membership functions values of the input u to all subsets Ci. Remembering that Ci subset represents all elements in the reference pattern set belonging to the i-th prespecified class, the input u is classified as a member of the class whose corresponding fuzzy membership value is the largest (Eq. 4). In case of ties, the vector to be classified is rejected. Baker [1] uses only one β parameter value. In this paper, each class has associated a distance function hβ that receives a dedicated value for parameter β. That is, the classifier has as many parameters as the number of prespecified classes. The modification proposed causes the performances to improve. The parameters of the classifier are tuned by increasing the performance of the classifier on the parameters tuning subset towards an optimal value. The algorithm for tuning the parameters must search in a n-dimensional space the parameter vector (β1, …, βn), that offers optimal performance when applying the classifier on the parameters tuning set. Genetic Algorithms can be used to perform this search. Each individual of the population contains n strings corresponding to the n parameters of the classifier. The genetic operators are applied on pairs of strings corresponding to the same parameter. Consecutive populations are produced applying elitism (the most fitted six individuals survive in the next generation), followed by reproduction, crossover and mutation.
3
Case Study
The DAMADICS benchmark flow control valve was chosen as the case study for this method. More information on DAMADICS benchmark is available via web site [3]. The valve has the purpose of supplying water to a steam-generating boiler. It has three main parts: a valve body, a spring-and-diaphragm actuator and a positioner [9]. The valve body is the equipment that sets the flow through the pipe system. The flow is proportional to the minimum flow area inside the valve, which, in its turn, is proportional to the position of a rod. The spring-and-diaphragm actuator determines the position of this rod. The spring-and-diaphragm actuator is composed by a rod, which at one end is connected to the valve body and the other end has a plate, which is placed, inside an
744
C. D. Bocănială et al.
airtight chamber. The plate is connected to the walls of the chamber by a flexible diaphragm. This assembly is supported by a spring. The position of the rod is proportional to the pressure inside the chamber, which is determined by the positioner. The positioner is basically a control element. It receives three signals: a measurement of the position of the rod (x), a reference signal for the position of the rod (CV) and a pneumatic signal from a compressed air circuit in the plant. The positioner returns an airflow signal, which is determined by a classic feedback control loop of the rod position. The airflow signal changes the pressure inside the chamber. Two of the several sensors that are applied to the valve, the sensor for measuring the position of the rod (x) and the sensor for measuring the water flow through the valve (F), provide variables that contain information relative to the faults. The valve is subject to a total of 19 faults that affect all of its components. For the work detailed in this paper only 5 of these faults were considered. A detailed description of the selected faults follows. Sometimes the movement of the rod is limited by an external mechanical event. If this happens the rod will not be able to move above a certain position. This situation is known as fault F1.
Fig. 1. Effects of the faults on the position of the rod
There are large pressure differences inside a valve. Under certain conditions, if the water pressure drops below the vapor pressure, the water may undergo a phase change from liquid to gas. If this happens the flow will be governed by the laws of compressible flow, one of which states that if the water vapor reaches the speed of sound a further increase in the pressure difference across the valve will not lead to an increase in flow. Therefore there is a limit to the flow. The effects of this fault will also be visible on x because it is dependent of the flow. During the normal operation of the valve this phenomenon is unlikely to occur, however sometimes there are tem-
A Fuzzy Classification Solution for Fault Diagnosis of Valve Actuators
745
perature increases which allow the occurrence of this fault. This situation is known as fault F2.
Fig. 2. Effects of the faults on the flow
During the normal operation of the system the upstream and downstream water pressures remain in a given range. However due to a pump failure or to a process disturbance the values of these pressure may change and leave the above-mentioned range. If this happens the value of the flow will deviate from the normal behavior as well as having some effects on the rod position. This situation is known as fault F3. The valve is placed in a pipe circuit that has a circuit parallel to the valve. This circuit is intended to replace the valve without cutting the water feed to the boiler. It has a gate valve that is closed during the normal operation. However, due to a fault in the valve or to employee mishandling, this valve may be opened. This increases the flow of the water to the boiler. This situation is known as fault F4. The flow sensor is subject to a fault occurrence due to wiring failure or to an electronic fault. If this happens it will cause the flow measurements to be biased. This situation is known as fault F5. The changes that each fault causes to the normal behavior can be seen in Fig. 1 and Fig. 2. The figures show that the effects of all the faults, with exception to F3, remain constant from the moment in which they start. The effects of fault F3 are visible only temporarily on x and F. There is a small effect that endures in F, very similar to the effects of F4. This behavior will make fault F3 difficult to correctly classify.
746
C. D. Bocănială et al.
4
Simulation Results
There is no available data related to the operation of the system while a fault is occurring due to the economical costs that this would imply. Therefore there is the need to develop a program that models the system and to artificially introduce faults in this model in order to obtain data relative to the faulty behavior. The valve was extensively modeled using first-principles. A MATLAB / SIMULINK program was then developed on the basis of this model [9]. For the simulation of the effects of the faults the inputs to the program were taken from the available data from the real plant. This makes the data containing the fault effects obtained through simulation more realistic. The development of FDI systems based on these data is more difficult because they contain both dynamic and static behavior. As noted in Section 3, two measurements provide best distinction among the considered classes of behavior: rod position (x), and flow through the valve (F). The first order difference on x (dx) can be added to improve the discrimination between different faults. The input of the classifier at time instance t is the 3-uple (xt, Ft, dxt) that represents the values of the three features at the time instance t. The five selected faults were simulated for 20 values of fault strength, uniformly distributed between 5% and 100%, and different conditions for the reference signal (CV). The previous settings approximate very well all possible faulty situations involving the five selected faults. The simulation data was used as follows: 5% for the reference pattern set, 2.5% for the parameters tuning set, and the rest of 92.5% for the test set. The GA mentioned in Section 2 reached the maximum value for the objective function in 23 generations. The necessary time for producing a successive generation was 50 minutes on a Intel Pentium 4, CPU 1.70 Ghz, RAM 256 MB. The performance of the classifier on the test set is shown in Table 1. The classifier correctly detects and identifies the faults F1, F2, F4 and F5 after 2 seconds they are present in the system. The exception is fault F3 that needs about 3 seconds for detection and 3 seconds for identification. These results are explained by the fact that F3 is mainly distinguishable from normal state only on F (see Fig. 1 and Fig. 2).
5
Summary
The paper introduced a fuzzy classification solution for fault diagnosis of valve actuators. The complexity of the problem is given by two facts. First, the data used are driven by real data and they cover very well all possible faulty situations. Second, the fault F3 is mainly visible in the flow measurement (F) and poorly visible in the other two measurement, x. These facts put a special emphasis on the good performance of the classifier. Further research needs to be carried out on the performance of the classifier when adding other faults to the five faults selected in this paper.
A Fuzzy Classification Solution for Fault Diagnosis of Valve Actuators
747
Table 1. The confusion matrix for the test data. The main diagonal contain the percent of wellclassified data per class
N F1 F2 F3 F4 F5
N 98.71 0.18 0.18 4.66 0.18 0.18
F1 0.14 99.82 0.00 0.79 0.00 0.00
F2 0.00 0.00 99.82 0.00 0.00 0.00
F3 0.29 0.00 0.00 90.74 0.24 0.00
F4 0.00 0.00 0.00 0.00 2.78 99.82
F5 0.00 0.00 0.00 0.00 0.18 0.00
References [1] [2] [3] [4] [5] [6] [7] [8] [9]
Baker, E.: Cluster Analysis by Optimal Decomposition of Induced Fuzzy Sets. PhD thesis, Delft University of Technology, Holland (1978) Calado, J. M. F., Sá da Costa, J. M. G.: An expert system coupled with a hierarchical structure of fuzzy neural networks for fault diagnosis. International Journal of Applied Mathematics and Computer Science 9(3) (1999) 667-688 EC FP5 Research Training Network DAMADICS: Development and Application of Methods for Actuator Diagnosis in Industrial Control Systems, (http://www.eng.hull. ac.uk/research/control/damadics1.htm) Frank, P. M.: Analytical and qualitative model-based fault diagnosis – a survey and some new results. European Journal of Control 2 (1996) 6-28 Koscielny, J. M., Syfert, M., Bartys, M.: Fuzzy-logic fault isolation in largescale systems. International Journal of Applied Mathematics and Computer Science 9 (3) (1999) 637-652 Leonhardt, S., Ayoubi M.: Methods of Fault Diagnosis. Control Eng. Practice 5(5) (1997) 683-692 Palade, V., Patton, R. J., Uppal, F. J., Quevedo, J., Daley, S.: Fault diagnosis of an industrial gas turbine using neuro-fuzzy methods. Preprints of the 15th IFAC World Congress, Barcelona, Spain (2002) Patton, R. J., Lopez-Toribio, C. J., Uppal, F. J.: Artificial intelligence approaches to fault diagnosis for dynamic systems. International Journal of Applied Mathematics and Computer Science 9(3) (1999) 471-518 Sá da Costa, J., Louro R.: Modelling and Simulation of an Industrial Actuator Valve for Fault Diagnosis Benchmark. Proceedings of the Fourth International Symposium on Mathematical Modelling, Vienna (2003) 1057-1066
Deep and Shallow Knowledge in Fault Diagnosis Viorel Ariton “Danubius” University from Galati Lunca Siretului no.3, 6200 – Galati, Romania [email protected]
Abstract. Diagnostic reasoning is fundamentally different from reasoning used in modelling or control: last is deductive (from causes to effects) while first is abductive (from effects to causes). Fault diagnosis in real complex systems is difficult due to multiple effects-to-causes relations and to various running contexts. In deterministic approaches deep knowledge is used to find “explanations” for effects in the target system (impractical when modelling burden appear), in softcomputing approaches shallow knowledge from experiments is used to links effects to causes (unrealistic for running real installations). The paper proposes a way to combine shallow knowledge and deep knowledge on conductive flow systems at faults, and offers a general approach for diagnostic problem solving.
1
Introduction
Fault diagnosis is basically abductive reasoning: from manifestations - as effects, one must infer faults - as causes. The target system's utilities consist of some wanted outputs and appear also as effects. Faults are causes that bring utility far from desired values; they are unexpected and, usually, not included in the cause-effect model of the target systems operation. Such causes are related to physical installation (state and age of components, accidents), human operators (technological discipline, maintenance) or environment (season, working conditions, other installations). From human point of view, the target system goes in normal running from known causes to expected effects, while in faulty running it goes from unknown causes to unexpected effects. Diagnostic reasoning – abductive reasoning in general, involves an open space of causes and, possibly, an open space of effects. Effects-to-causes mapping is based on incomplete knowledge about relations between causes and effects. Deterministic approaches seem not appropriate for diagnostic problem solving: they assume quantitative and precise values also crisp causal relations (deep knowledge) – hence closed sets of parameters and finite value ranges. Better suited seem softcomputing approaches – fuzzy logic and artificial neural networks (ANN), able to deal with incomplete knowledge and imprecise data. Usually, they require large amount of data from practice or from experiments (shallow knowledge), acquired in long time; if obtained from provoked faults to running installations, they look unrealistic. Mixing deep knowledge (compact and general) with shallow V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 748-755, 2003. Springer-Verlag Berlin Heidelberg 2003
Deep and Shallow Knowledge in Fault Diagnosis
749
knowledge (diverse and specific) will bring an advantage. The paper develops considerations on deep and shallow knowledge in fault diagnosis for the class of Conductive Flow Systems (CFSs), and proposes a methodology to mix them and apply techniques to detect and isolate faults. It unifies deep and shallow knowledge representation by assuming qualitative aspects of real system's structure and behaviour, similar to the way human diagnosticians proceed. Faults in Conductive Flow Systems spread effects from the place they occur (as primary effects) throughout the entire system (as secondary effects). Using deep knowledge on flow conduction laws and CFS's structure of components, it is possible to reject secondary effects and use only primary effects directly, in a recognizing scheme, for fault isolation. Deep and shallow knowledge mixture for fault diagnosis briefly refers to: qualitative means-end and bond-graph models of components and the whole target CFS regarding normal behaviour and structure; a) generic anomalies defined for the faulty behaviour of modules and components; b) patterns of qualitative deviations of flow variables in bond graph junctions for each generic anomaly; c) generic types of primary effects and their links to faults at components; d) recognizing schemes and architecture of the diagnosis application. Items a to c represent theoretical premises for the way the fault diagnosis proceeds in d, e; deep and shallow knowledge appear in a, b and c, d respectively.
2
Qualitative Abstraction on Structure and Behaviour
Means-end models refer to goals (ends) of a target system and functions it performs, as means to reach the ends. For example, MFM means-end modelling approach [4] considers a set of flow-functions for low-level operations of components (e.g. valve open - transport flow function, valve shut - barrier flow function, etc.). A network of such flow functions perform an end (utility) and correspond to a module. 2.1
Qualitative Means-End Modelling of Conductive Flow Systems
However, for the qualitative approach proposed and for fault diagnosis it is not necessary to work with detailed flow functions (transport, barrier, etc.) but with some qualitative flow functions. [6] refers to three orthogonal operational facets of a technical system, see Table 1, suited to three qualitative flow functions proposed further. Table 1. Functional orthogonal facets of the real-world components
Concept Activity Aspect
Process Transformation Matter
Flow Transportation Location
Store Preservation Time
750
Viorel Ariton
Let us consider for each concept in Table 1 a generic flow function: • • •
flow processing function (fpf) – as chemical or physical transformation of the piece of flux (to an end - or utility); flow transport function (ftf) – as space location change of the piece of flux, (by pipes, conveyors, etc.); flow store function (fsf) – as time delay of the piece of flux, by accumulation of mass or energy in some storing or inertial components.
A real component may perform one or more generic flow functions; faults affect each of them in specific ways. Component that performs a unique flow function is a primary component (PC). It is final location for fault isolation, so modules and components will be “split” to as many primary components as it is necessary. The set of all primary components defines the “granularity of the diagnosis”. A module (M) is a structure of components (as means), the set of all flow functions completing a certain end. The abstraction above is useful for modelling CFSs' qualitative behaviour. When faults occur to physical components function(s) get disturbed in a specific ways (see below). Fault isolation is made to components performing only one generic flow function. 2.2
Bond Graph Modelling and Qualitative Means-End Approach
CFSs involve transport of the matter and energy as flows along given paths. Bond graphs modelling approach [3] represent Kirchkoff's laws with explicit components and structures. They allow modularization of the CFS model – originally, applied only to the whole system. Notions used in the paper are: • • •
effort and flow: extensive / intensive power variables (pressure and flow-rate like). Bond graph passive components: R (resistance like), storage C (capacity), inertial component I (inductance like), Bond graph active components (power conversion) - T transformer, G gyrator.
It worth to note that bond graph components correspond to primary flow functions (and primary component PC): R to flow transport function (ftf), C and I to flow store function (fsf), T and G to flow processing (fpf). For each Kirchkoff's law type correspond a bond graph junction • •
0-junction (node), flow rate is common power variable, effort the summed one, 1-junction (loop), effort is the common power variable and flow the summed one.
In the presented approach, bond graph model of the target CFS is not meant for the normative running (as usually encountered) but only to obtain the modular structure of modules and components necessary for isolating faults, as presented below.
Deep and Shallow Knowledge in Fault Diagnosis
2.3
751
Deep Knowledge on the Target CFS Structure
Knowledge on components, structure of components and their functions (flow functions), along with hybrid modelling by bond graphs and modules' activities [5] represent deep knowledge on target CFS. Module M corresponds to a bond graph junction of components, the entire CFS corresponds to a junction of modules. Target CFS requires hierarchical decomposition: from CFS to block units, then modules, finally PCs [1]. Because bond graph modelling has no means to represent relative positions of modules (or components) along flow paths or inside junctions, let us introduce the up-stream weak order between units in each junction of the target CFS [1]. Up-stream / down-stream relations between modules are crucial for transport anomaly identification procedure (see 4.1). 2.4
Imprecise and Incomplete Knowledge on Manifestations
Human diagnosticians evaluate deviations of observed variables in an imprecise manner, they deal with incomplete knowledge and (normal) drift of values in installation's real running. So, a value is “normal” (no) when it falls inside a given range, it is “too-high” (hi) or “too low” (lo) when falling in neighbour ranges; hi, lo deviations represent qualitative values for an observed variable. Common representations for such qualitative values are fuzzy attributes (ranges with imprecise limits); hi or lo are fuzzy attributes [2]. Fuzzy representation has another advantage: from continuous values one may obtain discrete ones (attributes). Human diagnostician deals with discrete pieces of knowledge when linking effects to faults then mapping and recognizing them (as proposed in 4.2). Attributes are, in fact, primary effects, and along with the mapping represent shallow knowledge. Fuzzy attributes of continuous variables (hi, lo), and intrinsic specific attributes of other discrete variables (muddy, noisy, etc.) are manifestations directly linked to faults. Let us denote manifestations related to flow conduction FMAN (deviations of flow and effort variables - propagate), and manifestations from other variables MAN (do not propagate).
3
Faulty Behaviour Modelling
Faulty generic components exhibit anomalies. To each generic flow function a generic anomaly is attached, as follows: a)
Process anomaly (AnoP) – means deviation from the normal value (too high or too low) of an end-variable; it refers to transformations the flow undergoes. b) (Flow) Transport anomaly (AnoT) - means changes on flow variables or on inner structure of a component relative to flow transport along flow paths. c) Store anomaly (AnoS) – refers to deviation from the normal value for the delay specific to storing (capacitor-like) or inertial (inductance-like) component.
Works on fault diagnosis deal with concepts as “leakage” or “obstruction”, but no complete set of transport-anomalies is defined. Such a set is presented below [1]:
752
Viorel Ariton
a)
Obstruction – consists in change of the transport resistance-like parameter of a component (increase), without flow path modification (clogged pipe). b) Tunnelling – consists in changes of the transport resistance-like parameter (decrease), without flow path modification (broken-through pipe). c) Leakage - consists in structure changing (balance low) of flow transport, involving flow path modification. d) Infiltration - consists in structure changing (balance high) of a flow transport, with flow path modification. Transport anomalies are orthogonal in pairs (obstruction to tunnelling and leakage to infiltration), each pair orthogonal to the other. A fault causes a unique transport anomaly that appears at respective component and module. Actually, transport anomalies are primary effects that may enter the recognizing scheme for fault isolation. Transport anomalies refer to FMAN – deep knowledge, while for MAN faulty behaviour modelling is faults-to-effects mapping – shallow knowledge.
4
Fault Detection and Isolation
In order to efficiently isolate fault, a recognizing schemes have to use only with primary effects - linked to faults. Secondary effects appear at non-faulty components and each effect is specific to a transport anomaly, but they are irrelevant information for fault isolation. Separating primary effects from the set of all effects means rejecting secondary effects. 4.1
Fault Detection and Isolation Procedure
Transport anomalies are also primary effects, as they appear only at faulty module or component. Fault detection consists in anomaly detection. Fault isolation at module level is similar: transport anomaly detection. Fault isolation inside module is based on causal relations faults - primary effects. It is possible to generate patterns of qualitative deviations, in order to identify a transport anomaly at a (faulty) module, depending on the junction type and depending on the up-stream/down-stream relations relative to other modules in the junction [2]. Table 2. Patterns of qualitative deviations for power variables of components' inputs, for each transport anomaly occurred when faulty
Transport anomaly
Obstruction (Ob) Tunnelling (Tu) Infiltration (In) Leakage (Le)
1-junction type faulty component located: updownstream stream lo-hi lo-lo hi-lo hi-hi lo-hi hi-hi hi-lo lo-lo
0-junction type – 0-junction type - fault fault located down- located up-stream stream updownupdownstream stream stream stream lo-hi hi-hi hi-lo lo-lo hi-lo lo-lo lo-hi hi-hi lo-hi hi-hi lo-hi hi-hi hi-lo lo-lo hi-lo lo-lo
Deep and Shallow Knowledge in Fault Diagnosis
753
Faulty module isolation proceeds by finding patterns in Table 2 at inputs of neighbour modules in each bond graph junction they appear. 4.2
Architecture of ANN Blocks for Fault Diagnosis
Human expert have good knowledge on faulty behaviour of components and modules but poor knowledge on entire system's faulty behaviour. The approach proposed, try to cope with both: faulty module isolation is based on deep knowledge (flow transport anomalies), while fault isolation inside module is based on expert's shallow knowledge (primary effects and faults mapping). To each module correspond two artificial neural network (ANN) blocks - see Fig. 1: one for module isolation (detects patterns as from Table 2) one for component isolation (Fault Isolation Block), by recognizing faults from patterns of primary effects as fuzzy attributes lo, hi to observed variables. While bond graphs are modular, junctions are independent; so, each module has its own neural block for module isolation inside junction (Module Isolation Block). Neural network Fault Isolation Block is a mapping of non flow-related manifestations (MAN) and flow-related manifestations (FMAN – see 2.4) to faults of each PCi.
Fig. 1. Architecture of ANN blocks for isolation of faulty module and component
5
Case Study on a Simple Hydraulic Installation
Fault diagnosis was meant for a simple hydraulic installation in a rolling mill plant, comprising three modules: Supply Unit (pump, tank and pressure valve), Hydraulic Brake (control valve, brake cylinder) and Conveyor (control valve, self, the conveyor cylinder). For the 20 faults to all 8 components considered, manifestations come from sensors for FMAN – 2 flow-rate, 4 pressure, and MAN – 5 temperature, 8 binary (cylinders' left/right ends, open/shut valves) and 10 operator observed variables (noise and oil-mud). Software architecture exhibit 6 ANN Backpropagation blocks – 2 per module.
754
Viorel Ariton
Conveyor
Hydraulic Brake F=20
F=200
J'1
Drossel 66%
Ctrl. Valve 1 Pressure Valve
J1"
Ctrl. Valve 2
J"0 J'0 J1'''
Pump
Oil Tank
Fig. 2. Simple hydraulic installation under diagnosis
Fault diagnosis was performed in two approaches: (1) using shallow knowledge from experiments, (2) using deep and shallow knowledge – in the proposed approach. Recognition rate of faults in each experiment was 58% and 92% respectively. Results are not surprising, while second approach involves much more information on the target system and includes first approach. Qualitative abstraction of structure and behaviour offer premises for rapid prototyping of diagnosis applications, each meant for a particular CFS (each installations in real life). It is of greatest importance to build specific diagnosis applications, while each installation is actually unique; structure may contain same modules as in other installations but parameters will vary, behaviour may depend on place, environment, operators, age or provider of components. It was built a CAKE (Computer Aided Knowledge Elicitation) tool to replace knowledge engineer when assisting human diagnostician for target system analysis; then, a CASE tool automatically builds the fault diagnosis application for the given real installation. Fault diagnosis system represents and mixes both types of knowledge in a single application (which is build automatically if cascading CAKE with a CASE tool).
6
Conclusion
Fault diagnosis of real complex flow conduction systems often relies on shallow knowledge from experts. It comes from their practice on faults of the functional units (modules) – each meant for an end of the process. Faulty behaviour of an entire complex system is quite complicated and hardly known. The paper proposes a way to combine deep knowledge – on structure of modules and on secondary effects (propagated by conduction), with shallow knowledge – as faults-to-primary effects mapping (from practice or experiments). So, many irrelevant relations from effects to faults disappear and the fault diagnosis is manageable and suited to connectionist
Deep and Shallow Knowledge in Fault Diagnosis
755
implementation. Fuzzy representation changes continuous values in discrete ones, hence all data representation is discrete – the same way human diagnostician uses information and also suited for Neural Network recognition for fault isolation.
References [1] [2] [3] [4] [5] [6]
Ariton V., Fault diagnosis connectionist approach for multifunctional conductive flow systems, PhD dissertation., Galati, 1999. Ariton V. Bumbaru S., Fault Diagnosis in Conductive Flow Systems using Productive Neural Networks, The 12th International Conference on Control Systems and Computer Science CSCS12, Bucureşti, 1999, pp. 125-130. Cellier F.E., Modeling from Physical Principles, The Control Handbook (W.S. Levine, ed.) CRC Press, Boca Raton, pp.98-108, 1995. Larsson, J. E., Knowledge-based methods for control systems, PhD Thesis Dissertation, Lund, Sweden, 1992. Mosterman P. J., Kapadia R, Biswas G., Using bond graphs for diagnosis of dynamical physical systems, Sixth Intl. Conference on Principles of Diagnosis, pp. 81-85, Goslar, Germany, 1995. Opdahl A. L., Sindre G., A taxonomy for real-world modelling concepts, Information Syst., 19(3), Pergamon, 1994, pp. 229-241.
Learning Translation Templates for Closely Related Languages Kemal Altintas*and Halil Altay Güvenir Department of Computer Engineering, Bilkent University Bilkent, 06800 Ankara Turkey [email protected] [email protected]
Abstract. Many researchers have worked on example-based machine translation and different techniques have been investigated in the area. In literature, a method of using translation templates learned from bilingual example pairs was proposed. The paper investigates the possibility of applying the same idea for close languages where word order is preserved. In addition to applying the original algorithm for example pairs, we believe that the similarities between the translated sentences may always be learned as atomic translations. Since the word order is almost always preserved, there is no need to have any previous knowledge to identify the corresponding differences. The paper concludes that applying this method for close languages may improve the performance of the system.
1
Introduction
Machine translation has been an interesting area of research since the invention of computers. Many researchers have worked on this subject and developed different methods. Currently, there are many commercial and operational systems and the performances of the machine translation systems are best when the languages are close to each other [2]. There are two main approaches in corpus-based machine translation: statistical methods and example based methods. All corpus-based methods require the presence of a bilingual corpus in hand. The necessary translation rules and lexicons are automatically derived from this corpus. Example based methods in machine translation use previously translated examples to form a “translation memory” for the translation process [3]. There are three main components of example-based machine translation (EBMT): matching fragments against a database of real examples, identifying the corresponding translation fragments and recombining these to give the target text [7]. *
Currently affiliated with Information and Computer Science Department, the University of California, Irvine.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 756-762, 2003. Springer-Verlag Berlin Heidelberg 2003
Learning Translation Templates for Closely Related Languages
757
A detailed review of example based machine translation systems can be found in [9]. The idea of learning generalized translation templates for machine translation was investigated by Cicekli and Gvü enir [5]. They proposed a method for learning translation templates from bilingual translation examples. Their system is based on analyzing similarities and differences between two translation example pairs. There is no linguistic analysis involved in the method and the system totally depends on string matching. The authors claim that the method is language independent and they show that it works for Turkish and English, which are two virtually unrelated languages. The principal idea of translation template learning framework as presented in [5] is based on a heuristic to infer the correspondences between the patterns in the source and target languages from given two translation pairs. The similarities between the source language sentences are identified and assumed to correspond to the similar parts in the target language. Also, the differences in the source language sentences should correspond to the differences in the target language sentence pair. The system they present identifies the similarities and differences between source and target language pairs and learns generalized translation rules from these examples. In this paper, we investigate the possibility of applying the same idea to closely related languages by using the corresponding translated sentences themselves instead of using two examples. We take Turkish and Crimean Tatar as the example closely related language pair and we believe that the idea can be developed and applied for other close language pairs. The rest of the paper is organized as follows: Next section introduces the concept of translation template and Section 3 gives the details of the learning process comparing it against the proposed method in [5]. Section 4 discusses some weak points of the approach that we present here and the last section summarizes the ideas and concludes the paper.
2
Translation Templates
A translation template is a generalized translation exemplar pair where some components are generalized by replacing them with variables in both sentences. Consider the following example: X1 +Verb+Pos+Past+A1sg<=> Y1 +Verb+Pos+Past+A1sg gel <=> kel The left-hand side (first) part in this example and in the following examples throughout the paper refers to Turkish and the right-hand side (second) part refers to Crimean Tatar. The first template means that whenever the sequence “+Verb+Pos+Past+A1sg” follows any sequence that can be put in place of the variable X1, it can be translated into “+Verb+Pos+Past+A1sg” provided that it follows another sequence Y1, which is the translation of X1. In other words, after learning this rule, we can translate a sentence ending in “+Verb+Pos+Past+A1sg” provided that the beginning of the sentence can also be translated using the previously
758
Kemal Altintas and Halil Altay Güvenir
learned rules. The second template is an atomic template, which can be read as “gel” (come) in Turkish always corresponds to “kel” in Crimean Tatar. Since Turkish and all other Turkic languages are agglutinative languages, using the surface form (actual spelling) of the words may not be helpful. For example, Turkish word “geliyoruz” (we are coming) corresponds to “kelemiz” in Crimean Tatar and they do not show much similarity at first sight. However, if we morphologically analyze the two words we get: geliyoruz gel+Verb+Pos+Prog1+A1pl kelemiz
kel+Verb+Pos+Prog1+A1pl
The two analyses are similar except for the roots. Thus, using the morphological analyses of the two words may help us to learn much more rules. For the morphological analysis of Turkish, we used the analyzer developed by Oflazer [8]. For the Crimean Tatar part, we used the analyzer described in [1].
3
Learning Translation Templates
Close languages such as Turkish and Crimean Tatar share most parts of their grammars and vocabularies. The word order in close languages can most of the time be the same and even the ambiguities are preserved [6: p.807]. The first phase of translation template learning algorithm is identifying the similarities and differences between the two sentences. A similarity is a non-empty sequence of common items in both sentences. Actually, the similarity is an exact matching between sub-strings of the sentences. A difference is the opposite of a similarity and it is a non-common sequence of characters between the two sentences. In other words, a difference is what is not a similarity. The following translation pair gives the similarities as underlined: geliyoruz
gel+Verb+Pos+Prog1+A1pl
kelemiz
kel+Verb+Pos+Prog1+A1pl
A matching sequence between the sentences is a sequence of similarities and differences with the following properties: • • • • •
A similarity is followed by a difference and a difference is followed by a similarity. Two consequent similarities and two consequent differences cannot occur in a match sequence. If a terminal occurs in a similarity, it cannot occur in a difference. If a terminal occurs in a difference in one language, it cannot occur in a difference in the other language. A terminal occurring in both sentences must appear exactly n times where n >= 1. If a terminal occurs more than once in both sentences, its ith occurrence in both sentences must end up in the same similarity of their minimal match sequence.
Learning Translation Templates for Closely Related Languages
759
If these rules are satisfied, then there is a unique match for the sentences or there is no match. The details of the algorithm that finds the similarities and differences between the two sentences are explained in [4]. Once the similarities and the differences are identified, the system changes the differences with variables to construct a translation template. If there is no difference between the sentences and it is composed of only a single similarity, then it is learned as an atomic template. Many times, Turkish words and their Crimean Tatar correspondings are the same. For example, both the surface and lexical forms of the words “ev = ev+Noun+A3sg+Pnon+Nom” (house), “bildim = bil+Verb+Pos+Past+A1sg” (I knew) are the same in Turkish and Crimean Tatar. For “ev”, the following translation template is learned: ev+Noun+A3sg+Pnon+Nom <=> ev+Noun+A3sg+Pnon+Nom Although [5] does not discuss matching pairs with a single similarity, it exists between close languages and can be learned. It is always possible that a variable in the template may have to be replaced with a noun like the one above. Consider the sentence “ev aldım = ev+Noun+A3sg+Pnon+Nom al+Verb+Pos+Past+A1sg” (I bought a house). If we have a template like: X1 al+Verb+Pos+Past+A1sg <=> Y1 al+Verb+Pos+Past+A1sg we can easily replace X1 with “ev+Noun+A3sg+Pnon+Nom” for the translation. If the matching sequence is composed of a single similarity and a single difference, then the difference is replaced with a variable and similarity is preserved. Also, the differences and the similarities are learned as separate atomic templates. For the word pair geldim gel+Verb+Pos+Past+A1sg (I came) keldim
kel+Verb+Pos+Past+A1sg
the following templates are learned: X1 +Verb+Pos+Past+A1sg <=> Y1 +Verb+Pos+Past+A1sg +Verb+Pos+Past+A1sg <=> +Verb+Pos+Past+A1sg gel <=> kel When the similarities are in the beginning then the same rule applies. The differences in the end are replaced with variables and the similarities and differences are learned as separate atomic templates. When there are two similarities surrounding a single difference in the sentences, the difference is replaced with a variable and the differences and the similarities are learned as separate templates. For the sentence pair “eve geldim = ev+Noun+A3sg+Pnon+Dat gel+Verb+Pos+Past+A1sg” (I came home) and “evge keldim = ev+Noun+A3sg+Pnon+Dat kel+Verb+Pos+Past+A1sg” the following rules are learned: ev+Noun+A3sg+Pnon+Dat X1 +Verb+Pos+Past+A1sg <=> ev+Noun+A3sg+Pnon+Dat Y1 +Verb+Pos+Past+A1sg
760
Kemal Altintas and Halil Altay Güvenir
gel <=> kel ev+Noun+A3sg+Pnon+Dat <=> ev+Noun+A3sg+Pnon+Dat +Verb+Pos+Past+A1sg <=> +Verb+Pos+Past+A1sg For the cases where there is more than one difference, the system should learn templates only if at least all but one of the differences have previously learned correspondences. Consider the following sentence pair: okula geldim (I came to school) okul+Noun+A3sg+Pnon+Dat
gel+Verb+Pos+Past+A1sg
mektepke keldim mektep+Noun+A3sg+Pnon+Dat
kel+Verb+Pos+Past+A1sg
According to [5], the system should not learn anything if it does not know whether “okul” (school) is really the translation of “mektep” (school) or “kel” (come). Actually it is possible to learn rules without requiring that we know the corresponding differences. The algorithm proposed in [5] requires that at least all but one of the difference correspondences are known. This algorithm is a general method for learning and the system is language independent. The experiments were done for Turkish and English where the word order is clearly different. Thus, for the general system, it might be necessary to verify that all but one of the differences have corresponding translations in hand. However, for close language pairs, such as Turkish and Crimean Tatar, the word order is almost always preserved in the translation. Thus, if we know that our example translations are fully correct, we can learn the following templates without requiring any preconditions: X1+Noun+A3sg+Pnon+Dat X2 +Verb+Pos+Past+A1sg <=> Y1 +Noun+A3sg+Pnon+Dat Y2 +Verb+Pos+Past+A1sg okul <=> mektep +Noun+A3sg+Pnon+Dat <=> +Noun+A3sg+Pnon+Dat gel <=> kel +Verb+Pos+Past+A1sg <=> +Verb+Pos+Past+A1sg
4
Discussions
There are cases where the idea is not applicable. Consider the following phrases: bildiğim yer (the place where I know) bil+Verb+Pos^DB+Adj+PastPart+P1sg
yer+Noun+A3sg+Pnon+Nom
bilgen yerim bil+Verb+Pos^DB+Adj+PastPart+Pnon
yer+Noun+A3Sg+P1sg+Nom
Learning Translation Templates for Closely Related Languages
761
The difference between the two sentences is that the possessive marker in Turkish follows the past participle morpheme affixed to the verb, whereas the possessive marker in Crimean Tatar follows the noun in this clause. Any translation program in such a case should identify that this is an adjectival clause made with past participle and should move the possessive marker that comes after the verb to its place after the noun. The current algorithm cannot deal with such a case, regardless of whether we have any prior information or not. Since the differences between the two sentences are only the possessive markers, we cannot have a prior information like: P1sg <=> Pnon which is totally wrong. However, the approach which uses example pairs is much safer in this case and can identify a template for this case: Turkish: bildiğim yer (the place that I know) bil+Verb+Pos^DB+Adj+PastPart+P1sg
yer+Noun+A3sg+Pnon+Nom
bildiğim ev (the house that I know) bil+Verb+Pos^DB+Adj+PastPart+P1sg ev+Noun+A3Sg+Pnon+Nom Crimean Tatar: bilgen yerim (the place that I know) bil+Verb+Pos^DB+Adj+PastPart+Pnon
yer+Noun+A3sg+P1sg+Nom
bilgen evim (the house that I know) bil+Verb+Pos^DB+Adj+PastPart+Pnon
ev+Noun+A3Sg+P1sg+Nom
From these two examples, we can derive the template: bil+Verb+Pos^DB+Adj+PastPart+P1sg X1 +Noun+A3sg+Pnon+Nom <=> bil+Verb+Pos^DB+Adj+PastPart+Pnon Y1 +Noun+A3Sg+P1sg+Nom However, this is an exceptional case and overwhelming majority of the cases can be covered with the approach that we presented in the paper.
5
Conclusion
Corpus based approaches in language processing have attracted more interest. Example based machine translation is also considered as an alternative to traditional rule based methods with its capabilities to learn the necessary linguistic and semantic knowledge from the translation examples. Cicekli and Güvenir in [5] proposed a method to learn translation templates from bilingual translation examples. They also showed that the method is applicable to Turkish and English, which are two unrelated languages having completely different
762
Kemal Altintas and Halil Altay Güvenir
characteristics. Their method requires two similar translation example pairs to derive a template. Further they require that the similarities and differences are identified and the corresponding translations for almost all differences are known to derive a template from the given example pair. In this paper, we extended their approach to closely related languages and taking Turkish and Crimean Tatar as an example, we investigated the possibility of using the translated sentences themselves instead of a pair of sentences to derive some rules. The first case we saw for close languages is that, it is possible to have cases where the two sentences are exactly the same for both languages. So, this can be learned as an atomic template. Secondly, similarities can always be learned as atomic templates regardless of the number of differences between sentences. Since the word and morpheme order is usually preserved in close languages, it is possible to say that a similarity is always a correspondence between the languages. Finally, we saw that, in most cases there is no need to know any explicit correspondences between the differences in order to derive templates. Cicekli and Gvü enir require that if there are n > 1 differences between sentences, we must know at least n-1 of the correspondences. However, for close languages, since the word order is preserved, there is usually no need to enforce any preconditions provided that the translations are correct.
References [1] [2] [3] [4] [5] [6] [7] [8] [9]
Altintas, K., Cicekli, I., “A Morphological Analyser for Crimean Tatar”, In Proceedings of 10th Turkish Artificial Intelligence and Neural Network Conference (TAINN 2001), North Cyprus, 2001. Appleby, S., Prol, M. P., “Multilingual World Wide Web”, BT Technology Journal Millennium Edition, Vol. 18, No:1, 1999. Carl, M. “Recent research in the field of example-based machine translation”, Computational Linguistics and Intelligent Text Processing, LNCS 2004, pp. 195-196, 2001. Cicekli, I., “Similarities and Differences”, In Proceedings of SCI 2000, pp. 331337, Orlando, FL, July 2000. Cicekli, I., and Gvü enir, H. A., “Learning Translation Templates from Bilingual Translation Examples”, Applied Intelligence, Vol. 15, No. 1, pp: 5776, 2001. Jurafsky, D., Martin, J. H., Speech and Language Processing, Prentice Hall, 2000. Nagao, M., A Framework of a Mechanical Translation Between Japanese and English by Analogy Principle, In Artificial and Human Intelligence, Amsterdam, 1984. Oflazer, K., “Two-level Description of Turkish Morphology”, Literary and Linguistic Computing, Vol. 9, No:2, 1994. Somers, H., “Review Article: Example-based Machine Translation”, Machine Translation, Vol. 14, pp. 113-157, 1999.
Implementation of an Arabic Morphological Analyzer within Constraint Logic Programming Framework Hamza Zidoum Department of Computer Science, SQU University PoBox36, Al Khod PC 132 Oman [email protected]
Abstract. This paper presents an Arabic Morphological Analyzer and its implementation in clp(FD), a constraint logic programming language. The Morphological Analyzer (MA) represents a component of an architecture which can process unrestricted text from a source such as Internet. The morphological analyzer uses a constraint-based model to represent the morphological rules for verbs and nouns, a matching algorithm to isolate the affixes and the root of a given wordform, and a linguistic knowledge base consisting in lists of markers. The morphological rules fall into two categories: the regular morphological rules of the Arabic grammar and the exception rules that represent the language exceptions. clp(FD) is particularly suitable for the implementation of our system thanks to its double reasoning: symbolic reasoning expresses the logic properties of the problem and facilitates the implementation of a the linguistic knowledge base, and heuristics, while constraint satisfaction reasoning on finite domains uses constraint propagation to keep the search space manageable.
1
Introduction
Morphological analysis module is inherent to the architecture of any system that is intended to allow a user to query a collection of documents, process them, and extract salient information, as for instance, systems that handle Arabic texts and retrieve information expressed in Arabic language over Internet [11]. In this paper we present the implementation of an Arabic morphological analyzer that is a component of an architecture that can process unrestricted text, within a CLP framework [12]. Constraint Programming is particularly suitable for the implementation of our system thanks to its double reasoning: symbolic reasoning expresses the logic properties of the problem and facilitates the implementation of a the linguistic knowledge base, and heuristics, while constraint satisfaction reasoning on finite domains uses constraint propagation to keep the search space manageable. Constraints present an overwhelming advantage: declarativity. Constraints describe what the solution is and leave the answer to question how to solve them to the underlying solvers. A typical constraint-based system has a two-level architecture V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 763-769, 2003. Springer-Verlag Berlin Heidelberg 2003
764
Hamza Zidoum
consisting of (1) a programming module i.e. symbolic reasoning module that expresses the logic properties of the problem, and (2) the constraint module provides a computational domain such as reals, booleans, sets, finite domains, etc…, and a reasoning about the properties of the constraints such that satisfiability, and reduction, algorithms known as solvers. Constraints thus, reduce the gap between the high level description of a problem and the code implemented. The objective is to process a text in order to facilitate its usage by a wide range of further applications e.g.; text summary, translation, …etc. The system is intended to process unrestricted text. Hence, the criteria of robustness, and efficiency are critical, and highly desirable. To fulfill these criteria, we made the following choices: 1.
avoid the use of a dictionary as is it the case of classical morphological analyzers. Indeed, the coverage of such tool is limited to a given domain and cannot cope with unrestricted texts from dynamic information sources such as Internet. 2. deal with unvoweled texts since most Arabic texts available on Internet are written in modern Arabic that usually doesn't use diacritical marks. In order to implement the morphological analyzer, we used the contextual exploration method [1]. It consists of scanning a given linguistic marker and its context (surrounding tokens in a given text) looking for linguistic clues that guide the system to make the suitable decision. In our case, the method scans an input token and tries to find the required affixes in order to associate the root-form and the corresponding morpho-syntactic information. Arabic is known to be a highly inflexional language, its famous pattern model using the CV (Consonant, Vowel) analysis has widely been used to build computational morphological models [2, 3, 4]. During the last decade, an increasing interest has been noticed to implement Arabic morphological analyzers [5, 6, 7, 8, 9, 10, 12]. Almost all systems developed, in the industry as well as in the research, make use of a dictionary to perform morphological analysis. For instance, In France, the Xerox Centre developed an Arabic morphological analyzer [7] using the finite-state technique. The system uses an Arabic dictionary containing 4930 roots that are combined with patterns (an average of 18 pattern for every root). This system analyses words that may include full diacritics, partial diacritics, or no diacritics; and if no diacritics are present it returns all the possible analyses of the word. The remaining sections of this paper are organized as follows. Section two describes the system design and its architecture. Section three is dedicated to the description of regular rules. The matching algorithm is described in section 4. Finally, section five concludes the paper with future directions to extend this work.
2
System Design and Implementation
The morphological analyzer finds all word root forms and associates the morphosyntactic information to the tokens.
Implementation of an Arabic Morphological Analyzer
765
Morphological Analyzer
Exception Lists
Exception Rules
Regular Rules
Candidate rule
Text
Text representation
Token Matching algorithm
Token updated
Fig. 1. The morphological analyzer components
Within the text representation, a token includes the following fields. 1. 2. 3. 4. 5. 6. 7. 8. 9.
Name, that contains the name of the token. Every token is assigned a name that allow the system to identify it. Value, which is the word-form as it appears in the text before any processing. Root, stores the root of the word that is computed during morphological analysis. Category, Tense, Number, Gender and Peron are the same fields as the regularrules. Rule applied, stores the identifier of the rule that has been fired to analyse the token. Sentence, a reference to the object "Sentence" containing the token in the text. This is a relationship that holds between the token and its corresponding sentence. Order, stores the rank of the token in sequence from the beginning of the text. It is used to compare the sequential order of tokens in the text. Positions, correspond to the offset positions of the token in the text. It is used to highlight a relevant token when displaying the results to the user. Format, is the associated format (boldface, italics, …) applied to a token in the text.
The morphological analyzer (fig 1.) includes two kind of rules: regular morphological rules and exception rules that represent the morphological exceptions. Lists of exceptions contain all the markers that do not fall under the regular rules category. When analysing input tokens, the matching algorithm attempts to match between the affixes of the token with a regular rule. If it does not succeed, it attempts to apply an exception rule by looking into the exception lists. In view of a rapid-prototyping software implementation strategy, only regular rules specifications have been considered for implementation. After a test phase, the implementation of the exceptions is underway. We used Constraint Logic Programming language clp(FD) [13] for building the Arabic morphological analyzer to benefit from several advantages. CLP is particularly suited for rapid-prototyping software implementation because it provides a high-level of abstraction. The reason lies in the neat separation between the declarative model (how to express the problem to be solved) and the operational model (how the problem is actually solved). A typical constraint-based system has a two-level architecture consisting of a
766
Hamza Zidoum
programming module i.e. symbolic reasoning module expresses the logic properties of the problem and facilitates the implementation of the linguistic knowledge base, and heuristics; while constraint satisfaction reasoning on (finite) domains uses constraint propagation to keep the search space manageable [12].
3
Regular Rules
Regular Arabic verb, and noun forms have a fixed pattern of the form “prefix+root+suffix”, thus they can be implemented as automatic procedures since the identification of affixes is enough to extract the root form and associate the morphosyntactic information. A regular rule models a spelling rule for adding affixes. The structure of regularrule consists of nine fields that can be grouped into three classes: i) Name and Class identify the object in the system, ii) Prefix and Suffix store the prefix and suffix that are attached to a given token iii) Category, Tense, Number, Gender, and Person store the morpho-syntactic information inferred from a token. For instance, consider the Arabic word "( "نوبتكيin the active mode: ‘they write') that is composed of the three following morphemes: the root of the verb that is "( "بتك/ktb/, notion of writing), the prefix " "ـيthat denotes both the present tense and the third person, and the suffix " "نوthat denotes the masculine and the plural. The rule that analyses this word is represented in (fig. 2 (a)). In (fig. 2 (b)) the token is shown before matching, and in (fig. 2 (c)) the token attributes are updated:
Name: V28 Class: regular-rule Prefix: ـي Suffix: نو Category: verb Tense: present Number: plural Gender: masculine Person: third
(a)
match
Name: T1123 Class: Token Value: نوبتكي Root: Category: Tense: Number: Gender: Person: Rule-applied: Sentence: S052 Order: 1123 Positions: (3382, 3388) Format:
Update the Token
(b)
Name: T1123 Class: Token Value: نوبتكي Root: بتك Category: verb Tense: present Number: plural Gender: masculine Person: third Rule-applied: V28 Sentence: S052 Order: 1123 Positions: (3382, 3388) Format:
(c)
Fig. 2. Matching regular rules
The structure of the regular-rule class is detailed below: 1. 2. 3. 4. 5.
Name: identifies uniquely a rule Class: is the class of the object Prefix: a sequence of characters at the beginning of a token Suffix: a sequence of characters at the end of a token Category: the part of speech to which the token belongs to. It can hold two possible values: verb or noun 6. Tense: the tense associated to the token in case of a verb 7. Number: the cardinality of the token consisting of singular, dual or plural
Implementation of an Arabic Morphological Analyzer
767
8.
Gender: the gender associated to the token consisting of either masculine or feminine 9. Person: valid only for verbs, it represents either the first, second or third person. Thus, the rules are represented as predicates which implement the attributes as discussed above: regularRule(p_refix, s_uffix, c_at, g_ender, p_erson, n_ame, c_lass):p_refix = ' ', s_uffix = " ", c_at = verb, t_ens = present, n_umber = plural, g_ender = masculin, p_erson = third, n_ame = v28, c_lass = regular_rule.
4
t_ens,
n_umber,
Matching Algorithm
The extracted tokens from the source text are represented through the following facts base which distinctive arguments are the value of the token itself and its corresponding root classification inferred by the algorithm token(name, value, root, cat, tense, number, gender, person,format, ruleApplied, sentence, order, position). The aim of token-to-rule matching algorithm implemented in predicate tokenToRule/1 (shown below) is to fetch the rule that extracts the root of a given token t, and consequently, associates the morpho-syntax information to the token. The rule is identified if it matches a given pair of suffix and prefix. Thus, first the affixes are extracted from the token's value v through getSuffix/3 and getPrefix/3, then the matching operation is performed by regularRule/8. tokenToRule(t):t = token(_, v,_,_,_,_,_,_,_, r_ule,_,_,_), tokenToRule (v, r_ule, p_refix, s_uffix, 3, 1). tokenToRule(v, r_ule, p_refix, s_uffix, i, j) :i < 0, r_ull = null. tokenToRule(v, r_ule, p_refix, s_uffix, i, j) :length(v)-i-j <= 1, r_ull = null. tokenToRule(v, r_ule, p_refix, s_uffix, i, j):getPrefix(t, i, p_refix), i = i -1, i >= 0, length(v)-i-j > 1,
768
Hamza Zidoum
tokenToRule(v, r_ule, p_refix, s_uffix, i, j). tokenToRule(v, r_ule, p_refix, s_uffix, j < 0, r_ull = null. tokenToRule(v, r_ule, p_refix, s_uffix, length(v)-i-j <= 1, r_ull = null. tokenToRule(v, r_ule, p_refix, s_uffix, getSuffix(v, j, s_uffix), regularRule(p_refix, s_uffix, c_at, g_ender, p_erson, n_ame, c_lass), r_ull = n_ame, j = j - 1, tokenToRule(j, p_refix, s_uffix, i,
i, j):i, j):i, j):t_ens, n_umber,
j).
Note that the length of a prefix for regular rules is at most one character, and the length of a suffix is limited to three characters. The matching algorithm gives the priority to the longest affixes first. Hence, the predicate for affix extraction are respectively initialized to 1, and 3 and are gradually decremented as many time as no rule matches and the root still contains at least 2 characters (constraint length(v)i-j > 1) due to the fact that for regular words no root is less than two characters long: getPrefix(v, i, p_refix):getPrefix1(v, i, [], p_refix). getPrefix1(v, 0, p_refix, p_refix). getPrefix1(v, i, p_refix, _):getPrefix1([c|v1], i-1, [c|p_refix1], _). getSuffix(v, i, s_uffix):reverse(v, v1), getSuffix(v1, i, [], s_uffix). getSuffix1(v, 0, s_uffix, s_uffix1):reverse(s_uffix, s_uffix1). getSuffix1(v, i, s_uffix, _):getSuffix1([c|v1], i-1, [c | s_uffix1], _).
5
Conclusion
A morphological analyzer is one of the essential components in any natural language processing architecture. The morphological analyzer presented in this paper is implemented within CLP framework and is composed of three main components: a linguistic knowledge base comprising the regular and irregular morphological rules of the Arabic grammar, a set of linguistic lists of markers containing the exceptions handled by the irregular rules, and a matching algorithm that matches the tokens to the rules. The complete implementation of the system is underway. In a first phase, we have considered only the regular rules for implementation. Defining a strategy to match the regular and irregular rules and the extension of the linguistic lists of markers are the future directions of this project.
Implementation of an Arabic Morphological Analyzer
769
References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]
J-P. Desclès, Langages applicatifs, Langues naturelles et Cognition, Hermès, Paris, 1990. A. Arrajihi, The Application of morphology, Dar Al Maarefa Al Jameeya, Alexandria, 1973. (in Arabic) F. Qabawah, Morphology of nouns and verbs, Al Maaref Edition, 2nd edition, Beyruth, 1994. (in Arabic) G. A. Kiraz, "Arabic Computational Morphology in the West.", In Proceedings of the 6th International Conference and Exhibition on Multi-lingual Computing, Cambridge, 1998. B. Saliba and A. Al Dannan, “Automatic Morphological Analysis of Arabic: A study of Content Word Analysis”, In Proceedings of the Kuwait Computer Conference, Kuwait, March 3-5, 1989. M. Smets, "Paradigmatic Treatment of Arabic Morphology", In Workshop on Computational Approaches to Semitic Languages COLING-ACL98, August 16, Montreal, 1998. K. Beesley, "Arabic Morphological Analysis on the Internet", In Proceedings of the International Conference on Multi-Lingual Computing (Arabic & English), Cambridge G.B., 17-18 April, 1998. R. Alshalabi and M. Evens, "A Computational Morphology System for Arabic", In Workshop on Computational Approaches to Semitic Languages COLING-ACL98, August 16, Montreal, 1998. R. Zajac, and M. Casper, “The temple Web Translator”, 1997 Available at: http://www.crl.nmsu.edu/Research/Projects/tide/papers/twt.aaai97.html T. A. El-Sadany, and M. A. Hashish, “An Arabic Morphological System”, In IBM Systems J., Vol. 28, No. 4, 600-612, 1989. J. Berri, H. Zidoum., Y. Attif, “Web-based Arabic Morphological Analyser”, A. Gelbukh (Ed): CICLing 2001, pp.389-400, 2001, Springer-Verlag, 2001. K. Marriott and P. Stuckey, “Programming with constraints: An Introduction”, MIT Press, 1998. P. Codognet, and D Diaz, “Compiling Constraints in clp(FD): ”, Journal of Logic Programming, 1996:27:1-199.
Efficient Automatic Correction of Misspelled Arabic Words Based on Contextual Information Chiraz Ben Othmane Zribi and Mohammed Ben Ahmed RIADI Research Laboratory, University of La Manouba, National School of Computer Sciences, 2010 La Manouba, Tunisia [email protected] [email protected]
Abstract. We address in this paper a new method aiming to reduce the number of proposals given by automatic Arabic spelling correction tools. We suggest the use of error's context in order to eliminate some correction candidates. Context will be nearby words and can be extended to all words in the text. We present here experimentations we performed to validate some hypotheses we made. Then we detail the method itself. Finally we describe the experimental protocol we used to evaluate the method's efficiency. Our method was tested on a corpus containing genuine errors and has yielded good results. The average number of proposals has been reduced by about 75% (from 16.8 to 3.98 proposals on average).
1
Introduction
Existing spelling correctors are semi-automatic i.e. they assist the user by proposing him a set of candidates close to the erroneous word. For instance, word processors proceed by asking the user to choose the correct word among the proposals automatically computed. We aim to reduce the number of proposals provided by a spelling corrector. Two major preoccupations have motivated us to be interested in such a problem. First, some applications need to have a totally automatic spelling corrector [1]. Then, we have noticed that the number of candidates given by correctors for Arabic language is too big compared to what correctors give for other languages namely English or French. In such condition, Arabic correctors may even be useless. The idea is rather to eliminate the less probable candidates than to make corrector give fewer candidates. The method we propose makes use of the error's context. It considers the nearby words as well as all words in the text. In this paper we start by studying to what extent the problem is relevant in English, French and Arabic. Then, after a general presentation of the spelling corrector we used, we propose an initial assessment of it based on genuine errors. Finally, we present our method followed by an evaluation protocol measuring its efficiency.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 770-777, 2003. Springer-Verlag Berlin Heidelberg 2003
Efficient Automatic Correction of Misspelled Arabic Words
2
771
Characteristics of the Arabic Language
According to the language, if one wants to automatically correct misspelled written forms, the difficulties encountered will not be the same. The correction methods applied to languages such as English cannot be applied to agglutinative languages such Arabic [2]. As concerns Arabic, apart from vowel marks and agglutination of the affixes (articles, prepositions, conjunctions, pronouns) to the forms, we noticed, thanks to an experience described in [3], that "the Arabic words are lexically very close". According to this experience, the average number of forms that are lexically close1 is 3 for English and 3.5 for French. For Arabic without vowel marks, this number is 26.5. Arabic words would thus be much more close amongst one other than French and English words. This proximity of Arabic words has a double consequence: Firstly on error detection, where the words that are recognized as correct can in fact hide an error. Secondly on error correction, where the number of proposals for an erroneous form is liable to be excessively high. On could estimate a priori that an average of 27 forms will be proposed for the correction of each error.
3
General Presentation of Our Spelling Checker
Our method of checking and correcting Arabic words is mainly based on the use of a dictionary. This dictionary contains inflected forms with vowel marks (1 600 000 (pencils)… together with linguistic information. Beentries)2 such: (pencil), cause of the agglutination of prefixes (articles, prepositions, conjunctions) and suffixes (linked pronouns) to the stems (inflected forms) the dictionary is not sufficient : for recognizing the words as they are represented in the Arabic texts (e.g.: with their pencils). The ambiguities of decomposing words make it difficult to recognize the inflected forms and the affixes. So we made the dictionary go together with algorithms allowing the morphological analysis of textual forms. Apart from the dictionary of inflected forms, this morphological analyzer makes use of a small dictionary that contains all affixes (90 entries) and applies a set of rules for the research of all possible partitions in prefix, stem and suffix. These same dictionaries and grammar shall be used for detecting and correcting errors. Detection of errors occurs during the morphological analysis. Correction is carried out by means of an improved “tolerant” version of the morphological analyzer.
1
2
Words that are lexically close: words that differ with one single editing error (substitution, addition, deletion and inversion). The dictionary contains 577 546 forms without vowel marks.
772
Chiraz Ben Othmane Zribi and Mohammed Ben Ahmed
4
Initial Evaluation of the Spelling Checker
To evaluate our correction system, we took into account the following measures: • • • • 4.1
Coverage: the percentage of errors for which the corrector is not silent, i.e. makes proposals. Accuracy: the percentage of errors having the correct word among its proposals. Ambiguity: the percentage of errors for which the corrector gives more than one proposal. Proposal: the average number of correction proposals per erroneous word. Experiment
Our first experiment bore on genuine errors. To this aim we took three texts (amounting to approximately 5000 forms) from the same field and containing 151 erroneous forms that come under one of the four editing operations. The invocation of our correction system on these texts gave the following results: Table 1. Initial evaluation of the spelling checker
Coverage Accuracy Ambiguity Proposal 100% 100% 78.8% 12.5 forms 4.2 •
• •
5
Comments The coverage and accuracy rates are of 100%. This can be explained by the fact that we token into account only the errors the solution of which was recognized by our morphological analyzer. The other errors (mostly neologisms) were excluded because we were concerned with the evaluation of the corrector not the analyzer. The ambiguity rate is very high: more than 78% of the errors show more than one proposal in their correction. Although the average number of proposals is lower than the previously planned average (27 forms), it remains high if one compares it to other languages. In English for example, the average number of proposals is 5.55 for artificial errors and 3.4 for real errors [4].These results are expected, considering that the words in Arabic are lexically close.
Proposed Method
Our correction system assists the user by giving him a set of proposals that are close to the erroneous word.
Efficient Automatic Correction of Misspelled Arabic Words
773
We use the following notation : We : an erroneous word, Wc : correction of We P = {p1, . . , pn} : the set of proposals for the correction of We Wctxt = {w-k, … w-1, w1, … wk}: the set of words surrounding the erroneous word We in the text (while considering a window of the size k). In order to develop an entirely automatic correction, we would like to reduce the set P to a singleton corresponding to the correct word Wc, thus giving: Card (P) = 1 with Wc ∈ P. We shall try to minimize as much as possible the number of proposals, by stating to eliminate the least probable proposals. The use of the context, on which our method is based, shall occur in 2 cases : firstly, consider only the words near to the error; secondly, consider the whole set of words in the text containing the error. 5.1
Words in Context
The assumption is that each proposal pi has a certain lexical affinity with the words of the context of the erroneous word we wish to correct. Consequently, to classify the proposals and eliminate those that are the least probable, we examine the context and we choose the proposals that are closest to the words of the context. To do so, we opted for a statistical method consisting in calculating for each proposal the probability of being the good solution, with respect to the words surrounding the error in the text. Only the proposals with a probability deemed acceptable are kept, the others are eliminated. For each proposal we calculate: p(pi\Wctxt) : probability that pi is the good solution, knowing that the erroneous word We is surrounded by the context Wctxt Calculating this probability is not easy; it would take far too much data to generate reasonable estimates for such sequences directly from a corpus. We use instead the probability p(Wctxt\pi) by using the inversion rule of Bayes: p(pi \ Wctxt) =
p(Wctxt \ pi) × p(pi) . p(Wctxt)
(1)
Since we are searching proposals with the highest value p(pi\Wctxt), we can only calculate the value p(Wctxt\pi)×p(pi). The probability p(Wctxt) being the same for all the proposals (the context is the same), it thus has no effect on the result. Supposing that the presence of a word in a context does not depend on the presence of the other words in this same context, we can carry out the following approximation, as was already shown by [5]: p(Wctxt \ pi) =
-k,...,k
∏p(wj\ pi) . j
Everything well considered, we calculate for each proposal pi :
774
Chiraz Ben Othmane Zribi and Mohammed Ben Ahmed -k,..., k
∏ p(w j \ p i) × p(p i) j
where : p(w j \ pi) =
Number of occurrences of w j with pi . Number of occurrences of pi
p(pi) =
Number of occurrences of pi . Total number of words
Experiment Our experiment is carried out in two stages : one stage of training during which probabilities are collected for the proposals and one testing stage which consists in using these probabilities for choosing among the proposals. Training Stage This stage consists in creating a dictionary of co-occurrences from a training corpus. The entries in this dictionary are the proposals given by our correction system for the errors that it detected in the text. We put with each entry its probability of appearance in the training corpus p(pi) and the totality of its co-occurring words with their probability of co-occurrence p(wj\pi). Test Stage This stage consists in referring to the correction system for a test text and accessing the dictionary of co-occurrences for each proposal in order to calculate the probability p(pi\Wctxt). Only the proposals with a probability deemed sufficient (> 0.3 in our case, the training corpus not being very big) are chosen. Results Training text: the corpus previously used containing the real errors (5 000 words) Testing text: a portion of the corpus (1 763 words, 61 of which are erroneous) Table 2. Evaluation of the spelling checker : Context words
Initially Context words
Coverage 100% 100%
Accuracy Ambiguity 100% 88.52% 93.44% 72.13%
Proposal 16.8 forms 10.33 forms
The use of the words in context allowed reducing the number of proposals with approximately 40%. However, accuracy decreased: in 6.6% of cases the good solution was not among the proposals.
Efficient Automatic Correction of Misspelled Arabic Words
5.2
775
Words in the Text
The idea for this experiment was conceived after some counts carried out on the previously used corpus which contained real errors. These counts showed us the following: • •
A stem (textual form without affixes) appears 5.6 times on average. A lemma (canonical form) appears 6.3 times on average.
Which leads us to the following deductions: • •
In a text, words tend to repeat themselves. For the canonical forms the frequency is more important than for the stems. This is easy to understand, because if one repeats a word one can vary the gender, the number, the tense, … according to the context in which it occurs.
Starting from the idea that the words in a text tend to repeat themselves, one could reasonably think that the corrected words of the erroneous words in a text can be found in the text itself. Consequently, the research of proposals for the correction of an erroneous word shall from now on be done with the help of dictionaries made up from words of the text containing the errors instead of the two general Arabic dictionaries that we previously used, i.e.: the dictionary of inflected forms and the dictionary of affixes. Two experiments were realized to this aim: the first experiment bore on the use of the dictionary of the text's stems, the second experiment on the use of the dictionary of all the inflected forms of the text's stems. Experiment 1 : Dictionary of the Text's Stems The construction of the dictionaries required for this experiment went through the following stages: 1.
Morphological analysis of the testing text. We obtain the morphological units of the text decomposed in : Prefix / Stem / Suffix 2. Access, with all the stems of the text, to the general dictionary (577 546 forms without vowel marks) so as to obtain a dictionary of 1 025 forms. 3. Access, with all the affixes of the text, to the general dictionary of affixes (71 forms without vowel marks) so as to obtain a dictionary of 33 forms The correction of the text by using these two dictionaries gives the following results: Table 3. Evaluation of the spelling checker : Dictionary of the text's stems
Coverage 73.77%
Accuracy 97.61%
Ambiguity 35.55%
Proposal 2.36 forms
This table shows that the rate of ambiguity decreased by more than half. The average number of proposals also clearly decreased: it went down from 16.8 forms to 2.4 forms. Which means that we succeeded in decreasing the number of proposals but we also lost at the level of coverage and accuracy.
776
Chiraz Ben Othmane Zribi and Mohammed Ben Ahmed
Experiment 2 : Dictionary of the Text's Inflected Forms The second experiment looks like the first one, apart from the fact that instead of the dictionary of the text's stems, we used a dictionary of all the inflected forms of the text's stems. The correction of the testing text by using this dictionary as well as the dictionary of the affixes made up in the previous experiment gives the following results: Table 4. Evaluation of the spelling checker : Dictionary of the text's inflected forms
Coverage 86.75%
Accuracy 92%
Ambiguity 58%
Proposal 4.88 forms
We notice that the use of the dictionary of stems allows reducing the number of proposals to a greater extent than the use of the dictionary of inflected forms. However, the latter gives better results at the level of coverage. As for accuracy, it decreases by using the dictionary of inflected forms. Indeed, the average number of proposals furnished by the corrector using the dictionary of inflected forms being more important, one can expect the possibility of having proposals without the good solution being more important also (since there is more choice among the proposals).
6
Final Assessment
As a final experiment we aimed to combine both previous experiments: words in text and words in context. Firstly, the search of proposals was carried out in the dictionary of inflected forms of the words in text. Each proposal was given a probability measuring its proximity with the context of the erroneous word it corrects. Less plausible proposals were eliminated. Thus we obtained an average of 2.68 proposals and a cover rate of 82% . The second combination, which we prefer, is looking for proposals in the general dictionary and then gives them a contextual probability. The proposals that belong to the dictionary of inflected forms of words in text are weighted by the note 0.8 and the others by the note 0.2 (according to the experiment of the dictionary of inflected forms, the good solution comes from the dictionary of inflected forms in 80% of cases). We then proceed in the same manner as before, by only keeping the proposals that are most probable. The average number of proposals obtained in this case is 3.98 with a coverage rate of 100% and an accuracy of 88.52%. Table 5. Final evaluation of the spelling checker
Combination 1 Combination 2
Coverage 81.97% 100%
Accuracy 86.% 88.52%
Ambiguity 46% 62.29%
Proposal 2.68 forms 3.98 forms
Efficient Automatic Correction of Misspelled Arabic Words
7
777
Conclusion
In this work, we have been interested in reducing the number of candidates given by an automatic Arabic spelling checker. The method we have developed is based on using the lexical context of the erroneous word. Although we considerably have reduced the number of proposals, we think that our method can be improved, mainly by making use of syntactic context as well as other contextual information. In fact, we manually measured the role of this information if it could be used. We notice that the syntactic constraints would be able to reduce the number of proposals by 40%, and as such without taking into account the failures of an automatic syntactic analyzer.
References [1] [2] [3] [4] [5]
Kukich K. : Techniques for automatically correcting words in text. In ACM Computing Surveys, Vol. 24, N.4, (1992) pp.377-439 Oflazer K. : Spelling correction in agglutinative languages. In Proceedings of the 4th ACL Conference on Applied Natural Language Processing, Stuttgart, Germany (1994) Ben Othmane Zribi C. and A. Zribi : Algorithmes pour la correction orthographique en arabe. TALN'99, 6ème Conférence sur le Traitement Automatique des Langues Naturelles, Corse (1999) Agirre E., Gojenola K., Sarasola K., Voutilainen A. : Towards a single proposal in spelling correction. In Proceedings of COLING-ACL 98, Montréal, (1998) Gale W., Church K. W., Yarowsky D. : Discrimination decisions for 100,000 dimensional spaces. In Current Issues in Computational Linguistics (1994), pp. 429-450
A Socially Supported Knowledge-Belief System and Its Application to a Dialogue Parser Naoko Matsumoto and Akifumi Tokosumi Department of Value and Decision Science Tokyo Institute of Technology 2-12-1 Ookayama, Meguro-ku, Tokyo, Japan 152-8552 {matsun,akt}@valdes.titech.ac.jp http://www.valdes.titech.ac.jp/~matsun/
Abstract. This paper proposes a dynamically-managed knowledgebelief system which can represent continuously-changing beliefs about the world. Each belief in the system has a component reflecting the strength of support from other agents. The system is capable of adapting to a contextual situation due to the continuous revision of belief strengths through interaction with others. As a paradigmatic application of the proposed socially supported knowledge-belief system, a dialogue parser was designed and implemented in Common Lisp CLOS. The parser equipped with a belief management system outputs (a) speaker intention, (b) conveyed meaning, and (c) hearer emotion. Implications of our treatment of intention, meaning, and affective content within an integrated framework are discussed.
1
Introduction
Belief, defined in this paper as a knowledge structure with a degree of subjective confidence, can play a crucial role in cognitive processes. The context reference problem in natural language processing is one area where beliefs can dramatically reduce the level of complexity. For instance, in natural settings, contrary to the standard view within pragmatics and computational pragmatics, many utterances can be interpreted with little or no reference to the situations in which the utterances are embedded. The appropriate interpretation in a language user of the utterance “Your paper is excellent” from possible candidate interpretations, such as “The speaker thinks my newspaper is good to read,” or “The speaker thinks my thesis is great and is praising my effort,” will depend on the user's belief states. If, at a given time t, the dominant belief held by the user is that paper = {thesis}, then the lexical problem of disambiguating whether paper = {newspaper, thesis,...} simply disappears. The present paper proposes a belief system that can reflect the beliefs of socially distributed agents, including the speaker and the hearer. Our treatment of the belief system also addresses another important issue; the communication of mental states. In most non-cynical situations, the utterance “Your V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 778-784, 2003. Springer-Verlag Berlin Heidelberg 2003
A Socially Supported Knowledge-Belief System
779
paper is excellent” will evoke a positive emotion within the hearer. While an important function of the utterance would be to convey the epistemic content of the word excellent (of high quality), another important function would be the transfer of affective content, by eliciting a state of happiness in the hearer. Our belief system treats both of these epistemic and affective functions within an integrated framework. In the first half of the paper, we focus on the design and implementation of a dialogue parser equipped with a dynamically-managed belief system, and, then, turn to discuss how the parser incorporates the affective meaning of utterances.
2
The Proper Treatment of Context
Despite the importance attached to it, concepts of context are generally far too simplistic. One of the most neglected problems in pragmatics-related research is how to identify the proper context. Although most pragmatic theorists regard context as being essential for utterance interpretation (e.g., [1]) it is usually treated as simply the obvious and given information. While computational pragmatics tends to see the problem more realistically and, for instance, deconstructs it into a plan-recognition problem (e.g., [2]) context (sometimes in the form of a goal) is generally defined in terms of data structures. With its cognitive stance, Relevance Theory [3] takes a more satisfactory approach to the problem asserting that assumption is the cue to utterance interpretation. For example, in the right context, on hearing the utterance “It is raining,” by constructing the assumption “if it rains, the party will be postponed,” the hearer can move beyond the mere literal interpretation of the utterance (describing the weather) to an appropriate interpretation of the utterance's deeper significance. The theory claims that assumptions are cognitively constructed to have the maximum relevance. We agree with this view, but would point out that the cognitive computation for the construction of assumptions has yet to be fully described. Sharing a similar motivation with Relevance Theory, we propose a new idea— meaning supported by other language users in the community—to solve the difficulties associated with the context identification problem. In our position, context is not a data structure or existing information, but is rather a mechanism of support that exists between the cognitive systems of members of a linguistic community. With this support mechanism, largely-stable states of language meaning (with some degree of ambiguity) are achieved and the mutual transfer of ideas (as well as emotions) is made possible. We will detail the proposed knowledge-belief system in the following sections.
3
The Socially Supported Knowledge-Belief System
The Socially Supported Knowledge-Belief System (SS-KBS) is a computational model of utterance comprehension which incorporates the dynamically-revisable belief system of a language user. The belief system models the user's linguistic and world knowledge with a component representing the degree of support from other us-
780
Naoko Matsumoto and Akifumi Tokosumi
ers in the language community. A key concept in this system is the socially supported knowledge-belief (sskb) — so named to emphasize its social characteristics. As a hearer model, the task for SS-KBS is to identify the intention of the speaker which is embedded in an utterance, using its sskb database. When the speaker's intention is identified, the SS-KBS incorporates the intention within the knowledge-belief structure for the hearer. Thus, the hearer model builds its own knowledge-belief structure in the form of sskbs. In the SS-KBS, each knowledge-belief (sskb) has a value representing the strength of support from others. The knowledge-belief that has the highest level of support is taken as the most likely interpretation of a message, with the ranking order of a knowledge-belief changing dynamically through interaction with others. The current SS-KBS is designed as an utterance parser. Fig. 1 shows the general architecture of the SS-KBS model. Start
Socially Supported Knowledge Belief database
Word analysis
belief strength
word Get the next word
Get sskb
paper
1st thesis 7 2nd newspaper 5
+1
Store each sskb
sskb End of utterance Computation of intention Confirm
speaker’s belief paper = thesis 1
Final decision
the speaker
Fig. 1. General architecture of the SS-KBS
The SS-KBS has its origins in Web searching research where, in a typical search task, no information about the context of the search or about the searcher is available, yet a search engine is expected to function as if it knows all the contextual information surrounding target words [4,5,6]. The main similarity between the SS-KBS and search engines is in the weighed ordering of utterance meaning (likely meaning first) by the system and the presumed ordering of found URLs (useful site first) by search engines. Support from others (other sites in the case of Web search) is the key idea in both cases. 3.1
Implementation of the SS-KBS as a Dialogue Parser
In the SS-KBS, an agent's knowledge-belief reflects the level of support from other agents' obtained through verbal interaction. Our first natural choice for an application of this system is a dialogue parser, called Socially Supported Knowledge-Belief
A Socially Supported Knowledge-Belief System
781
Parser (SS-KBP) which has been implemented in Common Lisp Object System (CLOS). By applying the system, it is possible to interpret utterances without context and to produce utterances based on the dynamically revised database of knowledgebeliefs. 3.2
Internal Structure of an Socially Supported Knowledge-Belief (sskb)
We designed the SS-KBP as a word-based parser, because this has an architecture that can seamlessly accommodate various types of data structures. The SS-KBP processes a sequence of input words using its word knowledge which consists of three types of knowledge; (a) grammatical knowledge, (b) semantic knowledge, and (c) discourse knowledge (Table 1). Each knowledge type is represented as a daemon which is a unit of program code executable in a condition-satisfaction mechanism [7]. The grammatical knowledge controls the behavior of the current word according to its syntactic category. The semantic knowledge deals with the maintenance of the sskb database derived from the current word. The sskb has a data structure similar to the deep cases developed in generative semantics. Although the original notion of deep case was only applied to verbs, we have extended the notion to nouns, adjectives, and adverbs. For instance, in a sskb, the noun paper has an evaluation slot, with a value of either good or bad. The sskbs for nouns are included with evaluation slots in the SS-KBP, as we believe that evaluations, in addition to epistemic content, are essential for all nominal concepts in ordinary dialogue communication. The discourse knowledge can accommodate information about speakers and the functions of utterances. Table 1. Knowledge categorization for the SS-KBP know ledge categorization
action
grarm m atical specifying the category knowledge
sem antic knowledge
exam ples depending its category (ex) "paper" --- noun daem on
extracting the m eaning
extracting the first sskb of the current word
changing the belief structure
controlling the knowledge belief structure with new input belief
im plem entation [category daem on] [predictable category daem on]
[knowledge belief daem on]
(ex) "paper" --- thesis (sskb) [knowledge belief m anagem ent daem on]
predicting the deep case of the current word predicting the following word
(ex1) "buy" (verb) --- "Tom " as agent (sskb)
[predictable word daem on]
(ex2) "flower" (noun) --- "beautuful" as evaluation (sskb) inferring the user of (ex) "pavem ent" --- British (sskb) the word discourse knowledge
inferring the speaker's intention
[speaker inforam tion daem on]
inferring the utterance function (ex) Can you help m e? --- "request" (sskb)
[utterance function daem on]
The SS-KBP parses each input word using the sskbs connected to the word. As the proposed parser is strictly a word-based parser, it can analyze incomplete utterances. An example of a knowledge-belief implementation in the parser can be seen in the following fragment of code:
782
Naoko Matsumoto and Akifumi Tokosumi
;;;class object for 'paper. The word ‘paper” belongs ;;;to “commonnoun.” (defclass paper (commonnoun) ;;;the category is commonnoun ((Cate :reader Cate :initform 'commonnoun) ;;;the word's sskb. The first sskb is ‘thesis,' the second ;;;sskb is ‘newspaper.' (sskb :accessor sskb :initform '((7 thesis)(5 newspaper))) ;;;the predictable verb. (predAct :accessor predAct :initform '((4 read)(3 write))) ;;;the user information of the word. (userInfo :accessor userInfo :initform '((4 professor)(2 father))) ;;;the evaluation. If the word has evaluated, ;;;the slot is filld. (EvalInducer :accessor EvalInducer :initform '((3 beautiful)(1 useful))) ;;;the utterance function. (uttFunc :accessor uttFunc :initform '((6 order)(5 encourage))))) Weighing of a sskb in the SS-KBP is carried out through the constant revision of its rank order. The parser adjusts the rank order of the sskb for each word whenever interaction occurs with another speaker. 3.3
Inferring the Speaker's Intention
In the SS-KBP, the speaker's intension is inferred by identifying the pragmatic functions of an utterance, which are kept in the sskbs. As utterances usually have two or more words, the parser employs priority rules for the task. For instance, when an utterance has an adjective, its utterance function is normally selected as the final choice of the pragmatic functions for the utterance. In case of the utterance “Your paper is excellent,” pragmatic functions associated to the words “your” and “paper” are superseded by a pragmatic function “praise” retrieved from a sskb of the word “excellent.” Because of these rules, it is always possible for the SS-KBP to infer the speaker's intention without reference to the context. 3.4
Emotion Elicitation
Emotion elicitation is one of the major characteristics of the SS-KBP. Ordinary models of utterance interpretation do not include the emotional responses of the hearer, although many utterances elicit emotional reactions. We believe that emotional reactions, as well as meaning/intention identification, are an inherent function of utterance exchange and the process is best captured as a socially-supported belief-knowledge processing. In SS-KBP, emotions are evoked by sskbs activated by input words. When the SS-KBP determines the final utterance functions as a speaker's intention, it extracts the associated emotions from the utterance functions. In Fig. 2, SS-KBP ob-
A Socially Supported Knowledge-Belief System
783
tains an utterance function “praise” from the utterance “Your paper is excellent”, then, the SS-KBP searches the emotion knowledge for the word “praise”, it extracts the emotion “happy.” Input utterance Your
paper
is
excellent
daemons
daemons
daemons
daemons
results
results
results
results
The final intention “praise”
.
praise Emotion - happy - glad - joyful
Emotion production daemon
Fig. 2. Emotion Elicitation in the SS-KBP
4
Future Directions
The present SS-KBP is able to interpret the speaker's intentions and is also capable of producing the hearer's emotion responses. For a more realistic dialogue model, however, at least two more factors may be necessary: (a) a function to confirm the speaker's intentions, and (b) a function to identify the speaker. The confirmation function would provide the sskbs with a more accurately-tuned rank order, and would, thus, improve the reliability of utterance interpretation. This function may also lead to the generation of emotions toward speakers. When an interpretation of an utterance is confirmed in terms of the level of match to speaker expectations, then the model could express the emotion of “satisfaction” towards the interpretation results and might recognize the speaker as having a similar belief structure to its own. The speaker identification function could add to the interpretation of ironical expressions. Once the model recognizes that speaker A has a very similar belief structure to itself, it would be easier to infer speaker A's intentions. If, however, speaker A's utterance were then to convey a different belief structure, the model could possibly interpret the utterance as either irony or a mistake.
5
Conclusions
The proposed SS-KBS model can explain how the hearer's knowledge-belief system works through interaction with others. The system reflects the intentions of the other
784
Naoko Matsumoto and Akifumi Tokosumi
agent at each conversation turn, with each knowledge-belief having a rank based on the support from other. We have designed and examined the SS-KBS model in the following pragmatics-related tasks: (a) a context-free utterance interpretation parser, (b) a synchronizing mechanism between hearer and speaker knowledge-belief systems, and (c) a dynamically changing model of hearer emotion.
References [1] [2] [3] [4]
[5] [6] [7]
Grice, H. P.: Logic and Conversation. In: Cole, P., Morgan, J. L. (eds.): Syntax and Semantics; vol. 3. Speech Acts. Academic Press, New York (1975) 45-58 Cohen, P. R. Levesque, H. J. :Persistence, Intention and Commitment. In: Cohen, P. R., Morgan, J., Pollack, M. (eds.): Intentions in Communication. MIT Press, Cambridge, Mass. (1990) 33-70 Sperber, D., Wilson, D.: Relevance; Communication and Cognition. Harvard Univ. Press, Cambridge, Mass. (1986) Matsumoto, N., Anbo, T., Uchida, S., Tokosumi, A.: The model of Interpretation of Utterances using a Socially Supported Belief Structure. Proceedings of the 19th Annual Meeting of the Japanese Cognitive Science Society. (In Japanese). (2002) 94-95 Matsumoto, N., Anbo, T., Tokosumi, A.: Interpretation of Utterances using a Socially Supported Belief System. Proceedings of the 10th Meeting of the Japanese Association of Sociolinguistic Science. (In Japanese). (2002) 185-190 Matsumoto, N., Tokosumi, A.: Pragmatic Disambiguation with a Belief Revising System. 4th International Conference on Cognitive Science ICCS/ASCS – 2003 (To appear). (Submitted and accepted) Tokosumi, A.: Integration of Multi-Layered Linguistic Knowledge on a Daemon-Based Parser.: In Masanao Toda (ed.): Cognitive Approaches to Social Interaction Processes. Department of Behavioral Science, Hokkaido University, Sapporo, Japan. (In Japanese with LISP code). (1986) Chapter 10
Knowledge-Based Question Answering Fabio Rinaldi1 , James Dowdall1 , Michael Hess1 , Diego Moll´ a2 , Rolf Schwitter2 , and Kaarel Kaljurand1 1
Institute of Computational Linguistics, University of Z¨ urich, Z¨ urich, Switzerland {rinaldi,dowdall,hess,kalju}@cl.unizh.ch 2 Centre for Language Technology, Macquarie University, Sydney, Australia {diego,rolfs}@ics.mq.edu.au
Abstract. Large amounts of technical documentation are available in machine readable form, however there is a lack of effective ways to access them. In this paper we propose an approach based on linguistic techniques, geared towards the creation of a domain-specific Knowledge Base, starting from the available technical documentation. We then discuss an effective way to access the information encoded in the Knowledge Base. Given a user question phrased in natural language the system is capable of retrieving the encoded semantic information that most closely matches the user input, and present it by highlighting the textual elements that were used to deduct it.
1
Introduction
In this article, we present a real-world Knowledge-Based Question Answering1 system (ExtrAns), specifically designed for technical domains. ExtrAns uses a combination of robust natural language processing technology and dedicated terminology processing to create a domain-specific Knowledge Base, containing a semantic representation for the propositional content of the documents. Knowing what forms the terminology of a domain and understanding the relation between the terms is vital for the answer extraction task. Specific research in the area of Question Answering has been promoted in the last few years in particular by the Question Answering track of the Text REtrieval Conference (TREC-QA) competitions [17]. As these competitions are based on large volumes of text, the competing systems cannot afford to perform resource-consuming tasks and therefore they usually resort to a relatively shallow text analysis. Very few systems tried to do more than skim the surface of the text, and in fact many authors have observed the tendency of the TREC systems to converge to a sort of common architecture (exemplified by [1]). The TREC-QA competitions focus on open-domain systems, i.e. systems that can (potentially) answer any generic question. In contrast a question answering system working on a technical domain can take advantage of the formatting and style conventions 1
The term Answer Extraction is used as equivalent to Question Answering in this paper.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 785–792, 2003. c Springer-Verlag Berlin Heidelberg 2003
786
Fabio Rinaldi et al.
in the text and can make use of the specific domain-dependent terminology, besides it does not need to handle very large volumes of text.2 We found that terminology plays a pivotal role in technical domains and that complex multiword terms quickly become a thorn in the side of computational accuracy and efficiency if not treated in an adequate way. A domain where extensive work has been done using approaches comparable to those that we present here is the medical domain. The Unified Medical Language System (UMLS)3 makes use of hyponymy and lexical synonymy to organize the terms. It collects terminologies from differing sub-domains in a metathesaurus of concepts. The PubMed4 system uses the UMLS to relate metathesaurus concepts against a controlled vocabulary used to index the abstracts. This allows efficient retrieval of abstracts from medical journals, but it requires a complex, predefined semantic network of primitive types and their relations. However, [2] criticizes the UMLS because of the inconsistencies and subjective bias imposed on the relations by manually discovering such links. Our research group has been working in the area of Question Answering for a few years. The domain selected as initial target of our activity was that of Unix man pages [11]. Later, we targeted different domains and larger document sets. In particular we focused on the Aircraft Maintenance Manual (AMM) of the Airbus A320 [15]. However, the size of the SGML-based AMM (120Mb) is still much smaller than the corpus used in TREC-QA and makes the use of sophisticated NLP techniques possible. Recently, we have decided to embark upon a new experiment in Answer Extraction, using the Linux HOWTOs as a new target domain. As the documents are open-source, it will be possible to make our results widely available, using a web interface similar to that created for the Unix manpages.5 The remainder of this paper is organized around section 2, which describes the operations adopted for structuring the terminology and section 3, which describes the role of a Knowledge Base in our Question Answering system.
2
Creating a Terminological Knowledge Base
Ideally, terminology should avoid lexical ambiguity, denoting a single object or concept with a unique term. More often than not, the level of standardization needed to achieve this ideal is impractical. With the authors of complex technical documentation spread across borders and languages, regulating terms becomes increasingly difficult as innovation and change expand the domain with its associated terminology. This type of fluidity of the terminology not only increases the number of terms but also results in multiple ways of referring to the same 2 3 4 5
The size of the domains we have dealt with (hundreds of megabytes) is one order of magnitude inferior to that of the TREC collections (a few gigabytes). http://www.nlm.nih.gov/research/umls/ http://www.ncbi.nlm.nih.gov/pubmed/ http://www.cl.unizh.ch/extrans/
Knowledge-Based Question Answering
787
TERM 1
7
cover strip
electrical cable electrical line 2
cargo door
3
compartment door
4
enclosure door
5
functional test operational check
8
12
fastner strip attachment strip
stainless steel cover strip 9
cargo compartment door cargo compartment doors cargo-compartment door
11
door functional test 10
cockpit door
Fig. 1. A sample of the AMM Terminological Knowledge Base
domain object. Terminological variation has been well investigated for expanding existing term sets or producing domain representations [3, 6]. Before the domain terminology can be structured, the terms need to be extracted from the documents, details of this process can be found in [4, 14]. We then organize the terminology of the domain in an internal component called the Terminological Knowledge Base (TermKB) [13]. In plainer terms, we could describe it simply as a computational thesaurus for the domain, organized around synonymy and hyponymy and stored in a database. The ExtrAns TermKB identifies strict synonymy as well as three weaker synonymy relations [5] and exploits the endocentric nature of the terms to construct a hyponymy hierarchy, an example of which can be seen in Fig.1. We make use of simple pattern matching techniques to determine lexical hyponymy and some strict synonymy, more complex processing is used to map the immediate hyperonymy and synonymy relations in WordNet onto the terminology. To this end we have adapted the terminology extraction tool FASTR [7]. Using a (PATRII) phrase structure formalism in conjunction with the CELEX morphological database and WordNet semantic relations, variations between two terms are identified. Hyponymy Two types of hyponymy are defined, modifier addition producing lexical hyponymy and WordNet hyponymy translated from WordNet onto the term set. As additional modifiers naturally form a more specific term, lexical hyponymy is easily determined. Term A is a lexical hyponym of term B if: A has more tokens than B; the tokens of B keep their order in A; A and B have the same head.6 The head of a term is the rightmost non-symbol token (i.e. a word) which can be determined from the part-of-speech tags. This relation is exemplified in Fig.1 between nodes 1 and 8 . It permits multiple hyperonyms as 9 is a hyponym of both 2 and 3 . 6
This is simply a reflection of the compounding process involved in creating more specific (longer) terms from more generic (shorter) terms.
788
Fabio Rinaldi et al.
WordNet hyponymy is defined between terms linked through the immediate hyperonym relation in WordNet. The dashed branches in Fig.1 represent a link through modifier hyponymy where the terms share a common head and the modifiers are related as immediate hyperonyms in WordNet. Nodes 3 and 4 are both hyperonyms of 10 . Similarly, “floor covering” is a kind of “surface protection” as “surface” is an immediate hyperonym of “floor” and “protection” is an immediate hyperonym of “covering”. Mapping a hierarchical relation onto terms in this manner is fine when the hyponymy relation exists in the same direction, i.e. from the modifiers of t1 to the modifiers of t2 and the head of t1 to the head of t2. Unfortunately, it is less clear what the relation between t1, “signal tone” and t2, “warning sound” can be characterized as. This type of uncertainty makes the exploitation of such links difficult. Synonymy Four relations make up synsets, the organizing unit of the TermKB. These are gradiated from strict synonymy to the weakest useful relation. Simplistic variations in punctuation 9 , acronym use or orthography produce strict synonymy. Morpho-syntactic processes such as head inversion also identify this relation, cargo compartment doors −→ door of the cargo compartment. Translating WordNets synset onto the terminology defines the three remaining synonymy relations. Head synonymy 7 , modifier synonymy 12 and both 5 . Automatically discovering these relations across 6032 terms from the AMM produces 2770 synsets with 1176 lexical hyponymy links and 643 WordNet hyponymy links. Through manual validation of 500 synsets, 1.2% were determined to contain an inappropriate term. A similar examination of 500 lexical hyponymy links identified them all as valid.7 However, out of 500 WordNet hyponymy links more than 35% were invalid. By excluding the WordNet hyponymy relation we obtain an accurate knowledge base of terms (TermKB), the organizing element of which is the synset and which are also related through lexical hyponymy.
3
Question Answering in Technical Domains
In this section we briefly describe the linguistic processing performed in the ExtrAns systems, extended details can be found in [15]. An initial phase of syntactic analysis, based on the Link Grammar parser [16] is followed by a transformation of the dependency-based syntactic structures generated by the parser into a semantic representation based on Minimal Logical Forms, or MLFs [11]. As the name suggests, the MLF of a sentence does not attempt to encode the full semantics of the sentence. Currently the MLFs encode the semantic dependencies between the open-class words of the sentences (nouns, verbs, adjectives, and adverbs) plus prepositional phrases. The notation used has been designed to incrementally incorporate additional information if needed. Thus, other modules of the NLP system can add new information without having to remove old 7
This result might look surprising, but it is probably due to the fact that all terminology has been previously manually verified, see also footnote 6.
Knowledge-Based Question Answering
789
information. This has been achieved by using flat expressions and using underspecification whenever necessary [10]. An added value of introducing flat logical forms is that it is possible to find approximate answers when no exact answers are found, as we will see below. We have chosen a computationally intensive approach, which allows a deeper linguistic analysis to be performed, at the cost of higher processing time. Such costs are negliDocument gible in the case of a single sentence (like a user query) but become rapidly impractical in the case of the analysis of a large document set. The approach we take is to analyse all the documents in an off-line stage (see Fig. 2) and store a representation of their contents (the MLFs) in a KnowlTerm edge Base. In an on-line phase (see Fig. 3), the MLF which processing results from the analysis of the user query is matched in the KB against the stored representations, locating those MLFs that best answer the query. At this point the sysLinguistic tems can locate in the original documents the sentences Analysis from which the MLFs where generated. One of the most serious problems that we have encountered in processing technical documentation is the syntactic ambiguity generated by multi-word units, in particuKnowledge Base lar technical terms. Any generic parser, unless developed specifically for the domain at hand, will have serious problems dealing with those multi-words. On the one hand, it Fig. 2. Creating is likely that they contain tokens that do not correspond the KB (offline) to any word in the parser’s lexicon, on the other, their syntactic structure is highly ambiguous (alternative internal structures, as well as possible undesired combinations with neighbouring tokens). In fact, it is possible to show that, when all the terminology of the domain is available, a much more efficient approach is to pack the multi-word units into single lexical tokens prior to syntactical analysis [4]. In our case, such an approach brings a reduction in the complexity of parsing of almost 50%.
QUERY
Query Filtering
Term KB
QUERY + Synset
Syntactic & Semantic Analysis
Semantic Matching
Document KB
document logical form
Document
Answers in Document
Fig. 3. Exploiting the Terminological KB and Document KB (online)
790
Fabio Rinaldi et al.
During the analysis of documents and queries, if a term belonging to a synset is identified, it is replaced by its synset identifier, which then allows retrieval using any other term in the same synset. This amounts to an implicit ‘terminological normalization’ for the domain, where the synset identifier can be taken as a reference to the ‘concept’ that each of the terms in the synset describe [8]. In this way any term contained in a user query is automatically mapped to all its variants. When an answer cannot be located with the approach described so far, the system is capable of ‘relaxing’ the query, gradually expanding the set of acceptable answers. A first step consists of including hyponyms and hyperonyms of terms in the query. If the query extended with this ontological information fails to find an exact answer, the system returns the sentence (or set of sentences) whose MLF is semantically closest with the MLF of the question. Semantic closeness is measured here in terms of overlap of logical forms; the use of flat expressions for the MLFs allow for a quick computation of this overlap after unifying the variables of the question with those of the answer candidate. The current algorithm for approximate matching compares pairs of MLF predicates and returns 0 or 1 on the basis of whether the predicates unify or not. An alternative that is worth exploring is the use of ontological information to compute a measure based on the ontological distance between words, i.e. by exploring its shared information content [12]. The expressivity of the MLF can further be expanded through the use of meaning postulates of the type: “If x is included in y, then x is in y”. This ensures that the query “Where is the temperature bulb?” will still find the answer “A temperature bulb is included in the auxiliary generator”. It should be clear that this approach towards inferences has so far only the scope of a small experiment, a large-scale extension of this approach would mean dealing with problems such as domain-specific inferences, contradictory knowledge, inference cycles and the more general problem of knowledge acquisition. In fact such an approach would require a domain Ontology, or even more general World-Knowledge.8 While the approach described so far is capable of dealing with all variations in terminology previously identified in the offline stage, the user might come up with a new variant of an existing term, not previously seen. The approach that we take to solve this problem is to filter queries (using FASTR, see Fig. 3) for these specific term variations. In this way the need for a query to contain a known term is removed. For example, the subject of the query “Where is the equipment for generating electricity?” is related through synonymy to the synset of electrical generation equipment, providing the vital link into the TermKB. 8
Unfortunately, a comprehensive and easy-to-use repository of World Knowledge is still not available, despite some commendable efforts in that direction [9].
Knowledge-Based Question Answering
791
Fig. 4. Example of interaction with the system
4
Conclusion
In this paper we have described the automatic creation and exploitation of a knowledge base in a Question Answering system. Traditional techniques from Natural Language Processing have been combined with novel ways of exploiting the structure inherent in the terminology of a given domain. The resulting Knowledge Based can be used to ease the information access bottleneck of technical manuals.
References [1] Steven Abney, Michael Collins, and Amit Singhal. Answer extraction. In Sergei Nirenburg, editor, Proc. 6th Applied Natural Language Processing Conference, pages 296–301, Seattle, WA, 2000. Morgan Kaufmann. 785 [2] J. J. Cimino. Knowledge-based terminology management in medicine. In Didier Bourigault, Christian Jacquemin, and Marie-Claude L’Homme, editors, Recent Advances in Computational Terminology, pages 111–126. John Benjamins Publishing Company, 2001. 786 [3] B. Daille, B. Habert, C. Jacquemin, and J. Royaut´e. Empirical observation of term variations and principles for their description. Terminology, 3(2):197–258, 1996. 787 [4] James Dowdall, Michael Hess, Neeme Kahusk, Kaarel Kaljurand, Mare Koit, Fabio Rinaldi, and Kadri Vider. Technical terminology as a critical resource. In International Conference on Language Resources and Evaluations (LREC-2002), Las Palmas, 29–31 May 2002. 8 787, 789 [5] Thierry Hamon and Adeline Nazarenko. Detection of synonymy links between terms: Experiment and results. In Didier Bourigault, Christian Jacquemin, and Marie-Claude L’Homme, editors, Recent Advances in Computational Terminology, pages 185–208. John Benjamins Publishing Company, 2001. 787 8
Available at http://www.cl.unizh.ch/CLpublications.html
792
Fabio Rinaldi et al.
[6] Fidelia Ibekwe-SanJuan and Cyrille Dubois. Can Syntactic Variations Highlight Semantic Links Between Domain Topics? In Proceedings of the 6th International Conference on Terminology and Knowledge Engineering (TKE02), pages 57–64, Nancy, August 2002. 787 [7] Christian Jacquemin. Spotting and Discovering Terms through Natural Language Processing. MIT Press, 2001. 787 [8] Kyo Kageura. The Dynamics of Terminology, A descriptive theory of term formation and terminological growth. Terminology and Lexicography Research and Practice. John Benjamins Publishing, 2002. 790 [9] D. B. Lenat. Cyc: A large-scale investment in knowledge infrastructure. Communications of the ACM, 11, 1995. 790 [10] Diego Moll´ a. Ontologically promiscuous flat logical forms for NLP. In Harry Bunt, Ielka van der Sluis, and Elias Thijsse, editors, Proceedings of IWCS-4, pages 249– 265, 2001. 789 [11] Diego Moll´ a, Rolf Schwitter, Michael Hess, and Rachel Fournier. Extrans, an answer extraction system. T. A. L. special issue on Information Retrieval oriented Natural Language Processing, 2000. 8 786, 788 [12] Philip Resnik. Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. Journal of Artificial Intelligence Research, 11:95–130, 1998. 790 [13] Fabio Rinaldi, James Dowdall, Michael Hess, Kaarel Kaljurand, and Magnus Karlsson. The Role of Technical Terminology in Question Answering. In Proceedings of TIA-2003, Terminologie et Intelligence Artificielle, pages 156–165, Strasbourg, April 2003. 8 787 [14] Fabio Rinaldi, James Dowdall, Michael Hess, Kaarel Kaljurand, Mare Koit, Kadri Vider, and Neeme Kahusk. Terminology as Knowledge in Answer Extraction. In Proceedings of the 6th International Conference on Terminology and Knowledge Engineering (TKE02), pages 107–113, Nancy, 28–30 August 2002. 8 787 [15] Fabio Rinaldi, James Dowdall, Michael Hess, Diego Moll´ a, and Rolf Schwitter. Towards Answer Extraction: an application to Technical Domains. In ECAI2002, European Conference on Artificial Intelligence, Lyon, 21–26 July 2002. 8 786, 788 [16] Daniel D. Sleator and Davy Temperley. Parsing English with a link grammar. In Proc. Third International Workshop on Parsing Technologies, pages 277–292, 1993. 788 [17] Ellen M. Voorhees. The TREC question answering track. Natural Language Engineering, 7(4):361–378, 2001. 785
Knowledge-Based System Method for the Unitarization of Meaningful Augmentation in Horizontal Transliteration of Hanman Characters Huaglory Tianfield School of Computing and Mathematical Sciences Glasgow Caledonian University 70 Cowcaddens Road, Glasgow, G4 0BA, UK [email protected]
Abstract. The basic idea for differentiable horizontal transliteration of single Hanman characters is to equate the horizontal transliteration of a Hanman character to its toneless Zhongcentrish phonetic letters appended with a suffixal word which is from the meaningful augmentation. This paper presents a knowledge-based system method to unitarize the choice from all the meanings associative of the Hanman character.
1
Introduction
All the languages on the globe fall into two streams, i.e. plain-Hanman character string based languages and knowledge-condensed Hanman character based languages [1]. Knowledge-condensed Hanman character based languages are such as Zhongcentrish (used in the mainland, Taiwan, Hong Kong and Macao of Zhonghcentre, Singapore, etc.), many other oriental languages (Japanese, Korean, Vietnamese, etc.), Maya used by Mayan in South America and ancient Egyptian in Africa. Hanman characters are the main Hanman characters for forming the languages of Zhongcentrish and many other oriental languages such as Korean, Japanese, Vietnamese, etc. as well. Hanman characters are the only knowledge-condensed Hanman characters that are in wide use in modern society. Maya are only used in research at present. Plain-Hanman character string based languages are such as Latin, English, German, Spanish, French, Greek, Portuguese, Russian, Swedish, etc. The main Hanman characters are the letters from the alphabets and thus these languages are simply called letter string based ones. Horizontal transliteration is a mechanical literal transliteration from Hanman characters into letter strings. Horizontal transliteration of Hanman characters actually represents the most profound issue of the mutual transformation and comprehension between Hanman character based languages (and more generally, cultures) and plainHanman character string based languages (cultures). The differentiable horizontal V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 793-799, 2003. Springer-Verlag Berlin Heidelberg 2003
794
Huaglory Tianfield
transliteration of knowledge-condensed Hanman characters ca serve as such a mechanical bridge. The conventional way of horizontal transliteration of Hanman characters is by toneless Zhongcentrish phonetic letters (ZPL). The problem of ZPL horizontal transliteration is that it is practically impossible for a viewer to invert to the original Hanman character of the ZPL horizontal transliteration unless the viewer has been told of the original Hanman character in advance or the viewer has had sufficient a priori knowledge about the context around the original Hanman character. Toneless ZPL has only 7% differentiability and is very unsatisfactory and incompetent because of the great homonym rate of Hanman characters. This paper proposes an unconventional approach to the differentiable horizontal transliteration of single Hanman characters under contextless circumstance. The basic idea is to equate the horizontal transliteration of a Hanman character to its toneless ZPL appended with a suffixal word which is from the meaningful augmentation of the Hanman character [2]. As a Hanman character normally has more than one associative meaning, the key is how to determine a unitary suffixal word for the horizontal transliteration.
2
Zhongcentrish Phonemics and Phonetics of Hanman Characters
Architectonics of Hanman characters contains some phonemics, e.g. homonyms. Actually phonetics of most Hanman characters is associated by opposite syllable forming of a Hanman character. That is, the sound initiator phoneme of the Hanman character is marked by another known / simpler Hanman character which has the same sound initiator phoneme, and the rhyme phoneme of the Hanman character is marked by another known / simpler Hanman character which has the same rhyme phoneme. In this sense, phonetics and phonemics are inherent of Zhongcentrish, irrelevant from the inception and existence of ZPL. Latin transcription of Zhongcentrish phonetics began in the early 16th century and eventually yielded over 50 different systems. On February 21, 1958, Zhongcentre adopted the Hanman character based language Phonetic Scheme - known as ZPL (ZPL) to replace the Wade-Giles and Lessing transcription systems [3]. ZPL uses a modified Roman alphabet to phonetically spell the proper pronunciation of Hanman characters. Sounds of ZPL strings only roughly correspond to the English pronunciation of the ZPL strings and thus ZPL is hetero-pronounceable to English. Since its inception, ZPL has become a generally recognized standard for Zhongcentrish phonetics throughout most of the world. ZPL is used throughout the world and has since become the world standard, including the recognition by the International Standards Organization (ISO), the United Nations, the U.S. government, much of the world's media, and the region of Taiwan in 1999. ZPL now is standard of the International Standards Organization (ISO), standard of the United Nations, standard by the Zhongcentrish law of national language and Hanman characters.
Knowledge-Based System Method for the Unitarization of Meaningful Augmentation
795
The original Zhongcentrish phonetic transcription, including symbols for sound initiator phonemes, rhyme phonemes and tones, is by radicals, as depicted in the first rows of Fig. 1.
(a) Zhongcentrish phonetic sound initiator phonemes (21): symbols versus letters
(b) Zhongcentrish (simple or compound) rhyme phonemes (35): symbols versus letters
(c) Symbols of five tones (5) Fig. 1. Zhongcentrish phonetic transcription
In addition to the 21 basic sound initiator phonemes, there are three expanded sound initiator phonemes, i.e. "nil" sound initiator phoneme, and / w / and / y /. The latter two are phonemically the same as / u / and / i /. In addition to the 35 rhyme phonemes, there are four infrequently used rhyme phonemes, i.e. "nil" rhyme phoneme, and / ê / (umlauted / e /), / er /, / m /, / ( η) ng /. Superficially, there are 24 x 39 = 936 mechanical combinations. However, many of the mechanical combinations are either unfeasible or unrealistic. Actually, over the 936 mechanical combinations, only 417 pronunciations are used for Hanman characters. There are 417 strings of ZPL in Zhongcentrish dictionary. All the Hanman characters can be written in ZPL. There are more than 10 thousand Hanman characters in Zhongcentrish among which 6 thousand are frequent usage. Conservatively, accounted by 6,000 Hanman characters, below ratios can be obtained. Theoretical average toneless homonym ratio (TLHR) of Zhongcentrish is TLHR = 6,000 / 417 ≈ 15 / 1
(1)
796
Huaglory Tianfield
which means that one distinctive string of ZPL has to serve fifteen Hanman characters on average. Theoretical average toned homonym ratio (THR) of Zhongcentrish is THR = 6,000 / 417 / 4 ≈ 4 / 1
(2)
which means that one distinctive toned string of ZPL has to serve four Hanman characters on average. Although five different tones are defined to refine and differentiate Zhongcentrish phonetics, there still exist too many homonyms in Zhongcentrish.
3
Differentiable Horizontal Transliteration (DHT)
Comprehension needs common context. That is, the ontology of the viewer of a proposition needs to be the same (or sufficiently intersected) as the ontology of the creator of the proposition. In oral communication (conversation, talk, speech…), under contextless circumstance, after pronunciation, the speaker has to augment the saying by some contexts to the listener, so as for accurate convey of meaning. Such augmented context usually includes the meaning of the Hanman character itself, whose pronunciation has just been presented, or the meaning of the common words composed of the Hanman character. Such context augmentation is also widely used for any letter string based language, under contextless circumstances. For instance, the speaker tells the listener, by spelling the word letter by letter, some acronym, post code, name, etc. For example, U…N, U for "university", N for "nature", R…Z, R for "run", Z for "zero". Here a new method is proposed for horizontal transliteration of single Hanman characters under contextless circumstance, named DHT. It is differentiable and thus the horizontal transliteration can be uniquely accurately reverted back to the original Hanman character. 3.1
Rationale of Horizontal Transliteration
The meanings of a Hanman character are used to augment the context. The reason a different Hanman character was created because there is a different meaning needs to be conveyed. Therefore, there is one-to-one correspondence between the architectonics of Hanman character and the meanings expressed by a Hanman character. Dispersivity set of a Hanman character ::= all meanings the Hanman character expresses or associates to express
(3)
Dispersivity set of a letter-string word ::= all meanings the word denotes
(4)
Usually, dispersivity set has more than one element, and dispersivity set of the Hanman character is not equal to dispersivity set of the translation word. In horizontal transliteration, only one major intersection between the dispersivity set of the Hanman character and the dispersivity set of the translation word is used for horizontal transliteration. This can be depicted in Fig. 2.
Knowledge-Based System Method for the Unitarization of Meaningful Augmentation
HM 1
LM 11 LM 12
797
Translation Word 1
HM 2 HM 3
An Hantman Character
HM 4 HM 5 HM 6
LM 21 LM 22
Translation Word 2
HM 7
HM 8 HM 9
LM 31 LM 32 LM 33 LM 34
Translation Word 3
Fig. 2. Dispersivity sets of Hanman character and translation words and their intersection. HM: heuristic meaning, LM: letter-string meanings
3.2
Two-Tier Letter String for DHT
Horizontal transliteration is a concatenation of two sections of letter strings, as depicted in Fig. 3. Horizontal transliteration is initial capital. ∆
DHT = ZPL+ English meaning word ∆
(5)
= prefix concatenated by a suffixal word
Prefix = ZPL of the Hanman character plus a tone letter
(6)
To be natural, horizontal transliteration must contain in the first place ZPL, because ZPL is standard. The prefix gives phonetic naturalness of DHT. 3.3
Suffixal Word
The suffixal word is from the translation words of the meanings of the Hanman character. Normally there is more than one meaning associated with a Hanman character. The problem is how to find a unique suffixal word from all the meaningfully augmented words. Here a knowledge-based system method is presented for the unitarization of meaningfully augmented words of a Hanman character. Finding a unique suffixal word, i.e., the unitarization of meaningfully augmented words, for a Hanman character can be formulated as knowledge based system. A knowledge-based system can be modeled as a triangle pulled on its three angles by Inference engine, knowledge base and data base, respectively. A problem solving is a process of utilizing knowledge upon data under the guide of Inference engine, as depicted in Fig. 4.
798
Huaglory Tianfield
An Hantman Character
Phonetics
DHT Letter String
::=
TLZPL
Meanings
Translation Word Selected by a Rule Based System
Fig. 3. Horizontal transliteration of a Hanman character consists of two sections of letter strings
Inference Engine
Rule Base
Libraries Z-E Dictionary & E-Z Dictionary
Fig. 4. Knowledge based system for finding a unique suffixal word for a Hanman character. Z-E: Zhongcentrish-English, E-Z: English-Zhongcentrish
The suffixal word is chosen according to below rules and inference rules so that it is simple and commonly used and its first syllable remains unmixable even when concatenated to the prefix. It is assumed that for each Hanman character, a number of libraries of translation words be available, as depicted in Table 1. Table 1. Classified libraries of all the meaningful augmentations associated with a Hanman character
Library 1 Library 2 Library 3 Library 4 Library 5 Library 6
translation words of the Hanman character of itself translation words of the most commonly used Hanman phrases that are started with the single Hanman character if there is no translation word, use the core word that most frequently appears in the translation phrases of the Hanman character if there is no translation word and if the translation phrase of the Hanman character is composed of two words, concatenate the phrase as a word translation words of the general concepts to which the expressions of the single Hanman character belong translation words of less commonly used Hanman phrases that are started with the single Hanman character
All the translation words should be meaningful of themselves and should not be context-dependent and/or prescriptive words such as names of places, countries, people, dynasties, etc.
Knowledge-Based System Method for the Unitarization of Meaningful Augmentation
Rule 01 Rule 02 Rule 03 Rule 04 Rule 05 Rule 06 Rule 07 Rule 08 Rule 09 Rule 10
799
The first syllable of the suffixal word remains unmixable even when the suffixal word is concatenated to the prefix. The suffixal word has most meaningful accuracy to what the Hanman character most commonly expresses. The suffixal word is the same property as the Hanman character, e.g. both are verb, noun, adjective, adverb, or preposition, etc. The translation word of the Hanman character has less interpretation than other translation words. The suffixal word has minimum number of syllables, regardless of parts of speech (noun, verb, adjective, adverb, preposition, exclamation) and specific expressions of the word. The suffixal word is most commonly used. The suffixal word is noun. The suffixal word has minimum number of letters among its homorooted genealogy, regardless of parts of speech (noun, verb, adjective, adverb, preposition, exclamation) and specific expressions of the word. The suffixal word is commendatory or neutral than derogatory. In case Rule 1 can not be satisfied, place a syllable-dividing marker (') (apostrophe) between the prefix and the suffixal word when they are concatenated.
Inference rule 1 Inference rule 2 Inference rule 3 Inference rule 4
No rules are applied to Library i if Library (i-1) is nonempty, i = 1, 2, Rule 1 is mandatory, has absolute paramount importance and should always be applied first of all. Always apply rules to the least numbered library of translation words. End if successful. Only if failed on all the less numbered libraries, turn to other libraries of translation words. The less numbered a rule, the higher priority the rule has.
References [1]
[2]
[3]
Tianfield, H.: Computational comparative study of English words and Hanman character words. Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics (IEEE SMC'2001) Tucson, Arizona, USA, October 7-10, 2001, 1723-1728 Tianfield, H.: Differentiable horizontal transliteration of single hieroglyphic Hanman characters under contextless circumstance. Proceedings of the 1st International Symposium on Multi-Agents and Mobile Agents in Virtual Organizations and E-Commerce (MAMA'00), December 11-13, 2000, Wollongong, Australia, 7 pages Wen Zi Gan Ge Chu Ban She, ed. Han (man) Yu (lanaguage) Pin (spell) Yin (sound) Fang'An (scheme). Beijing: Wen Zi Gan Ge Chu Ban She, 1958. It was passed as ISO 7098 in 1982
An Expressive Efficient Representation: Bridging a Gap between NLP and KR Jana Z. Sukkarieh University of Oxford, Oxford, England [email protected]
Abstract. We do not know how humans reason, whether they reason using natural language (NL) or not and we are not interested in proving or disproving such a proposition. Nonetheless, it seems that a very expressive transparent medium humans communicate with, state their problems in and justify how they solve these problems is NL. Hence, we wished to use NL as a Knowledge Representation(KR) in NL knowledgebased (KB) sytems. However, NL is full of ambiguities. In addition, there are syntactic and semantic processing complexities associated with NL. Hence, we consider a quasi-NL KR with a tractable inference relation. We believe that such a representation bridges the gap between an expressive semantic representation (SR) sought by the Natural Language Processing (NLP) community and an efficient KR sought by the KR community. In addition to being a KR, we use the quasi-NL language as a SR for a subset of English that it defines. Also, it is capable of a general-purpose domain-independent inference component which is, according to semanticists, all what it takes to test a semantic theory in any NLP system. This paper gives only a flavour for this quasi-NL KR and its capabilities (for a detailed study see [14]).
1
Introduction
The main objection against a semantic representation or a Knowledge representation is that they need experts to understand1 . Non-experts communicate via a natural language (usually) and more or less they understand each other while performing a lot of reasoning. Nevertheless, for a long time, the KR community dismissed the idea that NL can be a KR. That’s because NL can be very ambiguous and there are syntactic and semantic processing complexities associated with NL. Recently, researchers started looking at this issue again. Possibly, it has to do with the NL Processing community making some progress in terms of processing and handling ambiguity and the KR community realising that a lot of knowledge is already ’coded’ in NL and one should reconsider the way they handle expressivity and ambiguity to make some advances on this
Natural Language Processing. Knowledge Representation. Though expert systems tend now to use more friendly representations, the latter still need experts.
1
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 800–815, 2003. c Springer-Verlag Berlin Heidelberg 2003
An Expressive Efficient Representation
801
front. We have chosen one of these KR, namely, [8], as a starting point to build a KR system using as input a simplified NL that this KR defines. One of the interesting things about this system is that we are exploring a novel meaning representation for English that no one has incorporated in an NLP system before and building a system that touches upon most of the areas of NLP (even if it is restricted). We extended the basic notion and extended its proof theory for it to allow deductive inferences that semanticists agree that it is the best test for any NLP semantic capacity (see FraCas project2 ). Before we started our work, no one built an automatic system that sanctions the FraCas deductive inferences. By extending the KR we extended the subset of English that this KR defines. In the following section we give a description of the NL-like KR, McLogic (related articles [16], [15]). In section 3 and 4 we give an idea of the inference task and the controlled English that McLogic defines, respectively (see details in [14], [13], respectively). We conclude by summarising and emphasising the practical implications of our work.
2
NL-Like KR: McLogic
McLogic is an extension of a Knowledge Representation defined by McAllester et al [8], [9]. Other NL-like Knowledge representations are described in [5], [10] and [7], [6], [11] but McAllester et. al have a tractable inference relation. For McLogic NL-like means it is easy to read, natural-looking ’word’ order 3 . The basic form of McLogic, we call it McLogic0 , and some extensions built on top of it are presented next. 2.1
The Original Framework: McLogic0
McLogic0 has two logical notions called class expressions and formulae. In the following, we define the syntax of McLogic0 and along with it we consider examples and their corresponding denotations. For the sake of clarity, we write V(e) to mean the denotation of e. Building Blocks Table 1 summarises the syntax of McLogic0 . The building unit for sentences or formulae in McLogic0 is a class expression denoting a set. First, the constant symbols, Pooh, Piglet , Chris − Robin are class expressions that denote singleton sets consisting of the entities Pooh, Piglet and Chris-Robin, respectively. Second, the monadic predicate symbols like bear , pig, hurt , cry and laugh are all monadic class expressions that denote sets of entities that are bears, pigs or are hurt, that cry or laugh respectively. Third, expressions of the form (R (some s )) and (R (every s)) where R is a binary relation and s is a class expression such as (climb(some tree)) and (sting(every bear )), where 2 3
http://www.cogsci.ed.ac.uk/˜fracas/ with the exception of Lambda expressions which are not very English as you will see below!
802
Jana Z. Sukkarieh
V((climb (some tree))) = {y | ∃ x .x ∈ tree ∧ climb(y, x )} and V((sting (every bear ))) = {y | ∀ x .x ∈ bear −→ sting(y, x )}. Fourth, a class expression can be a variable symbol that denotes a singleton set, that is, it varies over entities in the domain. Finally, a class expression can be a lambda expression of the form λ x .φ(x ), where φ(x ) is a formula and x is a variable symbol and V(λ x .φ(x ))= {d | φ(d ) is true}. An example of a lambda expression will be given in the next section.
Table 1. The syntax of McLogic0 . R is a binary relation symbol, s and w are class expressions, x is a variable and φ(x ) is a formula.
Class Expressions
Example
a constant symbol a monadic predicate symbol (R(some s)) (R(every s)) a variable symbol λ x .φ(x )
c, Pooh student (climb(some tree)) (climb(every tree)) x λ x .(x (likes (Pooh))
A Well-Formed formula
Example
(every s w ) (some s w ) Negation of a formula Boolean combinations of formulae
(every athlete energetic) (some athlete energetic) not(some athlete energetic) (every athlete energetic) and (every athlete healthy)
Well-Formed Sentences A sentence in McLogic0 is either: – an atomic formula of the form (every s w ) or (some s w ) where s and w are class expressions, for example, (every athlete energetic), – or a negation of these forms, for example, not (every athlete energetic), – or a boolean combination of formulae such as (every athlete healthy) and (some researcher healthy).
An Expressive Efficient Representation
803
The meaning for each sentence is a condition on its truth value: (every s w ) is true if and only if (iff) V (s) ⊂ V (w ). (some s w ) is true iff V (s) ∩ V (w ) = ∅. The negation and the Boolean combination of formulae have the usual meaning of logical formulae. Note that since variables and constants denote singleton sets, the formulae (some Pooh cry) and (every Pooh cry) are equivalent semantically. For a more natural representation, we can abbreviate this representation to (Pooh cry). Now, that we have defined formulae, we can give an example of a lambda expression: ζ = λ x .(some/every x (read (some book ))), that can be written as ζ = λ x .(x (read (some book ))), where V(ζ) = {d | (d (read (some book ))) is true}. The above is an informal exposition of the meaning of the constituents. Before we go on to describe the extensions, it is important to note that McLogic0 is “related to a large family of knowledge representation languages known as concept languages or frame description languages (FDLs)... However, there does not appear to be any simple relationship between the expressive power ... [of McLogic 0 ] and previously studied FDLs “[9]. Consider, for example, translating the formula (every W (R(some C ))) into a formula involving the expression of the form ∀ R.C 4 you will find that it seems there is no way to do that. 2.2
A Richer McLogic
As we did earlier with McLogic0 , we provide the syntax of the extension together with some examples of formulae and their associated meanings. The extensions that are needed in this paper are summarised in table 2. The main syntactic and semantic innovation than McLogic0 is the cardinal class expression that allows symbolising “quantifiers” other than ’every’ and ’some’ as it will be made clear later. 4
Recall that an object x is a member of the class expression ∀ R.C if, for every y such that the relation R holds between x and y, the individual y is in the set denoted by C .
804
Jana Z. Sukkarieh
Table 2. The syntax of the extensions required. R is a binary relation symbol (R −1 is the inverse of a binary relation), R3 is a 3-ary relation and R3−1 is an inverse of a 3-ary relation, s and t are class expressions, Q1 and Q2 can be symbols like every, some, at most ten, most , and so on, Mod is a function that takes a class expression, s, and returns a subset of s.
Class Expressions Example
s +t s$t s#Mod ¬(s) (R3 (Q1 s) (Q2 t)) (R −1 (Q1 s)) (R3−1 (Q1 s) (Q2 t)) Q2 ∗ s
happy + man client$representative drive#fast ¬(man) (give(some student) (some book )) (borrow −1 (some student)) (give −1 (some book ) (some librarian)) more than one ∗ man
Formulae
Example
(N ∗ s t)
(more than one ∗ man snore)
2.3
Syntax of McLogic − McLogic0 , with Example Denotations
The two logical notions, namely, class expressions and formulae, are the same as McLogic0 . In addition to these two, we introduce the notion of a function symbol. Definition 1. If f is an n-place function symbol, and t1 , . . . , tn are class expressions then f (t1 , . . . , tn ) is a class expression. There are special function symbols in this language, namely, unary symbols, ¬, 2-ary operators, + , $, and ∗. We introduce first the 3 operators, +, and $ and ¬. The operator, ∗, will be introduced after we define a cardinal class expression. Definition 2. Given two class expressions s and t : – s + t is a class expression. – s$t is a class expression. – ¬(s) is a class expression. s + t is defined to be the intersection of the two sets denoted by s and t . s$t is defined to be the union of the two sets denoted by s and t . Moreover, ¬(s) is defined to be the complement of the set denoted by s.
An Expressive Efficient Representation
805
Example 1. Since man and (eat (some apple)) are class expressions then man + (eat (some apple)), man$(eat (some apple)), and ¬(man) are class expressions. They denote the sets V (man) ∩ V ((eat (some apple))), V (man) ∪ V ((eat (some apple))), and (V (man))c (complement of the set denoting man), respectively. It is important to mention special unary functions that take a class expression and returns a subset of that class expression. Hence, if we let mod func be one of these functions then mod func(s1 ) = s2 , where s1 is a class expression and V(s2 ) is a subset of V(s1 ). We will write s1 #mod func for mod func(s1 ). For example , let f be an example of such a function, then f (drive) ⊆ V (drive) and f (man) ⊆ V (man) and so on. Now, we introduce a cardinal class expression. We group symbols like most , less than two, more than one, one, two, three, four , · · · under the name cardinal class expressions. Definition 3. A cardinal class expression, N , represents a positive integer N and denotes the set V(N ) = {X | X is a set of N objects}. Note that a cardinal class expression does not have a meaning independently of entities (in some specified domain). This is motivated by the way one introduces a (abstract) number for a child, it is always associated with objects. Having defined a cardinal class expression, the operator, ∗ can be defined: Definition 4. Given a cardinal class expression N and a class expression that is not a cardinal class expression s, then N ∗ s is defined to be V(N ) ∩ P (V(s)) where P (V(s)) is the power set of the denotation of s. In other words, N ∗ t is interpreted as {X | X ⊆ t ∧ | X |= N }. The operation, ∗, has the same meaning as +, that is, intersection between sets. However, the introduction of a new operator is to emphasize the fact that ∗ defines an intersection between sets of sets of entities and not sets of entities. Example 2. Given the cardinal class expression ten and the non-cardinal class expression book , we can form ten ∗ book and this denotes the set of all sets that consist of ten books. Introducing the operator, ∗, and the class expression N ∗ s allows us to introduce the class expression (R (N ∗ s)) where R is a binary relation or the inverse of a binary relation. For the sake of presentation, we define an inverse of a binary relation: Definition 5. Given a binary relation R, the inverse, R −1 , of R is defined as such: d R d iff d R −1 d . Examples can be (borrow (ten ∗ book )), (buy(more than one ∗ mug)) or (buy −1 (john)). To stick to NL, we can use boughtby for buy −1 and so on. Being class expressions, the last 3 examples will denote sets: a set of entities that borrow(ed) ten books, a set of entities that buy (bought) more than one mug and a set of entities that were bought by john. Hence, the definition:
806
Jana Z. Sukkarieh
Definition 6. A class expression of the form (R(N ∗ s)) denotes the set {x | ∃ y ∈ N ∗ s ∧ ∀ i.(i ∈ y ←→ xRi)}. Moreover, the introduction of ∗ and N ∗s allows the introduction of a formula of the form (N ∗ s t ). Example 3. Given the class expressions two, man, old and (borrow (more than one ∗ book )), some of the formulae we can form are: (two ∗ man old ), (two ∗ old man), (two ∗ man (borrow (more than one ∗ book ))) We want the above formulae to be true if and only if ’two men are old’, ’two old (entities) are men’ and ’two men borrow(ed) more than one book’, respectively. The truth conditions for a formula of the form (N ∗ s t ) are given as follows: Definition 7. (N ∗ s t ) is true if and only if some element in N ∗ s is a subset of t. In other words, there exists a set X ⊆ s that has N elements such that X ⊆ t or for short N s’s are t ’s. The last extension in this paper, namely, a class expression with a 3-ary relation, will be introduced in what follows: Definition 8. A class expression with a 3-ary relation is of the form: – – – –
(R(some (R(some (R(every (R(every
s) (every t )), s) (some t )), s) (every t )), or s) (every t )), . . .
In a more general way : (R(Q1 s) (Q2 t )), (R(Q3 ∗ s) (Q4 ∗ t )), (R(Q1 s) (Q4 ∗ t )), (R(Q3 ∗ s) (Q2 t )), where, Qi for i = 1, 2 is either some or every, Qi for i = 3, 4 are cardinal class expressions, R is a 3-ary relation or an inverse of a 3-ary relation, s and t are non-cardinal class expressions. Examples of such class expressions can be : (give(mary) (some book )), (hand (some student ) (some parcel )), (send (more than one ∗ flower ) (two ∗ teacher )). It is natural that we make these class expressions denote entities that ’give mary some book’, that ’hand(ed) some student some parcel’ and that ’send more than one flower to two teachers’, respectively. In particular, we define the following:
An Expressive Efficient Representation
807
quantifiers give different meanings V((R(some s) (every t ))) = {y | ∃ x ∈ s. ∀ z .z ∈ t −→ y, z , x ∈ R}. V((R(some s) (some t ))) = {y | ∃ x ∈ s ∧ ∃ z ∈ t ∧ y, z , x ∈ R}. V((R(every s) (every t ))) = {y | ∀ x .x ∈ s −→ ∀ z .z ∈ t −→ y, z , x ∈ R}. V((R(every s) (some t ))) = {y | ∀ x .x ∈ s −→ ∃ z .z ∈ t ∧ y, z , x ∈ R}. To allow for the change of quantifiers, we consider the form (R(Q1 s) (Q2 t )) whose denotation is {y | ∃ X .X ⊂ s ∧ ∃ Z .Z ⊂ t such that ∀ x . ∀ y.x ∈ X ∧ z ∈ Z −→ y, z , x ∈ R}, where the cardinality of X and Z depend on Q1 and Q2 respectively. Basically, we are just saying that (R(Q1 s) (Q2 t )) is the set of elements that relate, through R, with elements in s and elements in t and that the number of elements of s and t in concern depend on the quantifiers Q1 and Q2 , respectively. For completeness, we define the inverse of a 3-ary relation and give an example: Definition 9. Given a 3-ary relation R, the inverse, R −1 , of R is defined as: Arg1 , Arg2 , Arg3 ∈ R iff Arg2 , Arg3 , Arg1 ∈ R −1 For example, V((give −1 (some student ) (john))) is: {y | ∃ x ∈ student.y, john, x ∈ give −1 } = {y | ∃ x ∈ student.x , y, john ∈ give}. We have described above the syntax of the extension of McLogic0 , and gave example denotations. As we said earlier, McLogic0 together with the extensions will be called McLogic in this paper. To see a formal semantics for McLogic you can consult [14]. Having described the logic we go on to describe the inferences McLogic sanctions.
3
The Inference Task McLogic Sanctions
We are concerned with NL inferences but not with implicatures nor suppositions. Moreover, we do not deal with defeasible reasoning, abductive or inductive reasoning and so on. In our work [14], we focus on deductive (valid) inferences that depend on the properties of NL constructs. Entailments from an utterance U, or several utterances Ui that seem “natural”, in other words, that people do entail when they hear U. For example, in the following, D1 are deduced from Scenario S1 : – S1 : (1) a. some cat sat on some mat. b. The cat has whiskers.
808
Jana Z. Sukkarieh ⇓ D1 :
some cat exists, some mat exists, some cat sat on some mat, some cat has whiskers (cat1 has whiskers), some whiskers exist. ⇓ whiskers sat on some mat
We define a structurally-based inference to be one that depends on the specific semantic properties of the syntactic categories of sentences in NL. For example, – S2 : most cats are feline animals. ⇓ D2 : most cats are feline, most cats are animals.
– S3 Smith and Jones signed the con⇓ tract. D3 : Smith signed the contract.
D2 depend on the monotonicity properties of generalised quantifiers and D3 on those of conjoined Noun Phrases - among other classes that the FraCas deal with. 3.1
Inference Set
The proof theory is specified with a set of deductive inference rules with their contrapositives. The original inference set, which corresponds to McLogic0 consists of 32 inference rules. We added rules that correspond to the extensions and that are induced by the structurally-based inferences under consideration. In the following, we list some of the rules 5 . We assume these rules are clear and the soundness of each rule can be easily shown against the formal semantics of the logic (consult [14] for a formal semantics). (7) (every C C ) (20) (every W C $W ) (22) (every C + W W ) C exists) (10) (some (some C C ) (some C W ) (12) (some W C ) C Z ),(some C W ) (14) (every (some Z W) ),(at most one (16) (every C(atWmost one C ) (some C exists) (18) (not (every C W )) Z C ),(every Z W ) (24) (every(every Z C +W ) (not (some S exists)) (26) (every T (R(every S ))) 5
W)
(19) (every C C $W ) (21) (every C + W C ) (38) (every C #W C ) (some C W ) (11) (some C exists) (every C W ),(every W Z ) (13) (every C Z ) W ),(at most one C ) (15) (some C (every C W) (at most one C )) (17) (not(some C exists) C Z ),(every W Z ) (23) (every(every C $W Z ) (some C Z ) (25) (some C +Z exists) $Z ),(not (some C W )) (27) (every C W(every C Z)
The numbers are not in sequence as to give an idea of the rules added.
An Expressive Efficient Representation $Z ),(not (some C Z )) (28) (every C W (every C W ) (every C W ) (30) (every (R(every W )),(R(every C ))) (R(some C )) exists) (32) (some(some C exists) (more than one∗C +W exists) (36) (more than one∗C C ) 1 ∗C W ) (39) (N (N1 ∗W C ) Z ),(every Z C ) (41) (D W (D W C) (at most N ∗C W )) (43) (at most N ∗C +Z W )) most N ∗C W )) (45) (at(atmost N ∗C Z +W )) 4 ∗C Z ) (47) (N(N 4 ∗C $W Z )
809
(every C W ) (29) (every (R(some C )) (R(some W ))) (some C W ) (31) (every (R(every C )) (R(some W ))) (more than one∗C W ) (35) (more than one∗C +W exists) W ),(every C Z ) (37) (N1 ∗C (N 1 ∗Z W ) (N3 ∗C W ),(every C Z ) (40) (N3 ∗C W +Z ) most N ∗C W )) (42) (at(atmost N ∗C W #Z )) (at most N ∗C (R(some W ))) (44) (at most N ∗C (R(some Z +W ))) than one∗C W ) (46) (more(some C W)
In the above, C , W and Z are class expressions. D is a representation for a determiner that is monotonically increasing on 2nd argument. N1 belongs to {at least N , N , more than one} and N3 belongs to {most , more than one, some(sg), at least N } and N4 in {more than one, N , at most N , at least N } where N is a cardinal. In the following, we show a few examples of structurally-based inferences licensed by the properties of monotonicity of GQs that are sanctioned by the above inference rules. It is enough, in these examples, to consider the representation without knowing how the translation is done nor which English constituents is translated to what. I leave showing how McLogic is used as a typed SR to a different occasion. The fact that McLogic is NL-like makes the representation and the proofs in this suite easy to follow. 3.2
Illustrative Proofs
Examples 1, 3, 4 and 5 are selected from The Fracas Test Suite [1]. The Fracas test suite is the best benchmark we could find to develop a general inference component. “The test suite is a basis for a useful and theory/system-independent semantic tool” [1]. From the illustrative proofs, it is seen what we mean by not using higher-order constructs and using what we call combinators, like “sine”, “cosine” instead. ’some’ is monotonically increasing on second argument as it is shown in examples 1 and 2. Argument 1 IF Some mammals are four-legged animals THEN Some mammals are four-legged. (more than one ∗ mammal four − legged + animal) r 39 (more than one ∗ four − legged + animal mammal)
(every four − legged + animal four − legged) r 37
(more than one ∗ four − legged mammal ) r 39 (more than one ∗ mammal four − legged)
Argument 2 IF some man gives Mary an expensive car THEN some man gives Mary a car.
810
Jana Z. Sukkarieh
To deal with 3-ary relations, we need to augment rules, like (29), (30) and (31). For this particular case, rule (29) is needed. Hence, we add the rules in table 3. Now, we can provide the proof of argument 2:
Table 3. Augmenting rule 29 to cover 3-ary relations with ’some’ and ’every’ only. C , W , Q are class expressions (not cardinal ones nor time class expressions). (3-ary 1)
(every C W ) (every (R(some Q) (some C )) (R(some Q)(some W )))
(3-ary 2)
(every C W ) (every (R(every Q) (some C )) (R(every Q) (some W )))
(3-ary 3)
(every C W ) (every (R(some C ) (some Q)) (R(some W ) (some Q)))
(3-ary 4)
(every C W ) (every (R(some C ) (every Q)) (R(some W ) (every Q)))
SINCE (every expensive + car car ) THEN (every (give(Mary) (some expensive + car )) (give(Mary) (some car ))) using rule 3-ary 1. MOREOVER, SINCE (some man (give(Mary) (some expensive + car ))) THEN (some (give(Mary) (some expensive + car )) man) using rule 12. THE TWO CONCLUSIONS IMPLY (some (give(Mary) (some car )) man) using rule 14. THE LAST CONCLUSION IMPLIES (some man (give(Mary) (some car ))).
The determiner ’some’ is monotonically increasing on first argument. Using rule 14 together with (every irish + delegate delegate) justify the required conclusion in argument 3. Argument 3 IF Some Irish delegate snores, THEN Some delegate snores. ’every’ is monotonically decreasing on first argument as in the following: Argument 4 IF every resident of the North American continent travels freely within Europe AND every canadian resident is a resident of the North American continent THEN every canadian resident travels freely within Europe. The transitivity rule (rule 13) proves the conclusion of argument 4. ’At most N’ is monotonically decreasing on second argument as it is shown in example 5. Rule 42 proves this property for ’at most ten’:. Argument 5 IF At most ten commissioners drive, THEN At most ten commissioners drive slowly.
An Expressive Efficient Representation
811
The inference rules that McLogic is equipped with sanction all the examples given in the Fracas test suite licensed by the properties of GQs. Here, we have only included some of them. The following examples are given in [4] and we show that their validity is easily shown by our proof system. Argument 6 IF no teacher ran THEN no tall teacher ran yesterday. Alternatively, in McLogic, IF not (some teacher run) THEN not (some tall + teacher run#yesterday). (every tall + teacher teacher ) using rule 22. This together with not(some teacher run) imply not(some run tall + teacher ). Moreover, (every run#yesterday run) using rule 38. Using rule 14, the last two results justify not(some run#yesterday tall + teacher ). The contrapositive of rule 12 gives the required result.
Argument 7 IF every teacher ran yesterday THEN every tall teacher ran. Alternatively, in McLogic, IF (every teacher run#yesterday) THEN (every tall + teacher run). Using the transitivity rule 13 for the premise together with (every run#yesterday run) yields
(every
teacher
run).
Again
the
transitivity
rule
together
with
(every tall + teacher teacher ) justifies the conclusion.
Argument 8 IF every student smiled AND no student who smiled walked THEN no student walked. Since
(every
student
student)
(rule
7)
and
(every
student
smile)
then
(every student student + smile) using rule 24. Using acontrapositive of rule 14, the last result together with not(some student + smile walk ) justify not(some student walk ).
Fyodorov et.al use similar rules. rule 7 is what they call reflexivity rule, rule 24 is what they call conjunction rule. We do not have monotonicity rules as such because monotonicity can be proved through other rules. The last example, they have is the following: Argument 9 IF exactly four tall boys walked AND at most four boys walked THEN exactly four boys walked Assuming that ’exactly C Nom VP’ to be semantically equivalent to ’at least C Nom VP’ and ’at most C Nom VP’, where C is a cardinal, the argument is reduced to showing that ’at least four boys walked’. Given (at least four ∗ tall + boy walk ) and (every tall +boy boy) then (at least four ∗boy walk ) using rule 37. In all the examples above, combining ’determiners’ other than ’some’ and ’every’ is minimal. For example, the argument Argument 10 IF some managers own at least two black cars, THEN some managers own at least two cars.
812
Jana Z. Sukkarieh
Table 4. Again rules similar to rule 29 that cover 3-ary relations but with ’some’, ’every’ and cardinal class expressions. C , W , Q are class expressions (not cardinal ones nor time class expressions) and N is a cardinal class expression. (3-ary 5)
(every C W ) (every (R(some Q) (N ∗C )) (R(some Q)(N ∗W )))
(3-ary 6)
(every C W ) (every (R(every Q) (N ∗C )) (R(every Q) (N ∗W )))
(3-ary 7)
(every C W ) (every (R(N ∗Q) (N ∗C )) (R(N ∗Q) (N ∗W )))
(3-ary 8)
(every C W ) (every (R(N ∗C ) (some Q)) (R(N ∗W ) (some Q)))
(3-ary 9)
(every C W ) (every (R(N ∗C ) (every Q)) (R(N ∗W ) (every Q)))
(3-ary 10)
(every C W ) (every (R(N ∗C ) (N ∗Q)) (R(N ∗W ) (N ∗Q)))
is true but the proof won’t go through unless we introduce another rule like (every P Q) . This, as (29) but with other ’determiners’, namely: (every (R(N ∗P )) (R(N ∗Q))) well, makes it necessary to account for different ’determiners’ in case of a 3-ary relation. The rules required are in table 4 above. Similar augmentations should occur for rules 30 and 31, whether a binary relation accounting for ’determiners’ other than ’some’ and ’every’ or accounting for a 3-ary relation. It is important to say that these inferences are licensed independently of the different scope possibilities of quantifiers. For instance, a property of ’every’ licenses the inference ’every man likes a woman’ from the sentence, S, ’every man likes a beautiful woman’, independently of any interpretation given to S. We described the logic and the inferences it allows. In the next section, for the purpose of completeness only, we describe, very briefly, how the logic and the inferences fit into the picture of defining a computer processable controlled English.
4
CLIP: The Controlled English McLogic Defines
Definition 10. CLIP is a sublanguage of English with the following properties: – It is syntactically and semantically processable by a computer. – Each sentence in CLIP has a well-formed translation in McLogic. – The ambiguities in a sentence are controlled in a way that the interpretation of that sentence allow inferences required in FraCas D16 [1] – The vocabulary is controlled only as far as the syntactic category. The word CLIP implicitly ’clips’ a part of the ’whole’, that is, dialect or sublanguage not full English. Here is an example:
An Expressive Efficient Representation
813
Calvin: Susie understands some comic books. Many comic books deal with serious issues. All superheroes face tough social dilemmas. It is not true that a comic book is an escapist fantasy. Every comic book is a sophisticated social critique. Hobbes: Most comic books are incredibly stupid. Every character conveys a spoken or graphic ethical message to the reader before some evil spirit wins and rules.
McLogic0 is the basic building block for CLIP. To start with, an English sentence belongs to CLIP if, and only if, it has a well-formed translation in McLogic0 . Further, we extended McLogic0 to account for more English constituents motivated by the structurally-based inferences in the FraCas test suite. Inferences with their corresponding properties, premises and conclusion add to the expressivity of the dialect. To emphasize the above idea, we consider some kind of recursive view: Base Case: McLogic0 Recursive Step: McLogicn depends on McLogicn−1 However, it is not an accumulative one-way hierarchy of languages since CLIP and the reasoning task motivates the extension.
5
Conclusion
We believe that the NLP community and the KR community seek common goals, namely, representing knowledge and reasoning about such knowledge. Nonetheless, the two communities have difficult-to-meet desiderata, namely, expressivity and taking information in context into account for a semantic representation and efficient reasoning for a KR. Finding one representation that is capable of both is still a challenge. We argue that an NL-like KR may bring that gap closer. Trying to achieve our aim of an NL-based formal KR led to an interesting inquiry into (not in order of importance): – a novel meaning representation for English – a KR in relation to the following: Given English utterances U1 , ..., Un and an English utterance C, a machine has to decide whether C follows from U1 ∧ ... ∧ Un . – A way for an NL expert Knowledge-Based System (KBS) to provide a clear justification for the line of reasoning used to draw its conclusions. An experimental system has been developed. The system has been tested (as a guideline for its development) on two substantial examples: the first is a wellknown test case for theorem provers (‘Schubert’s Steamroller’) [12] and the second is a well-known example from the Z programme specification literature
814
Jana Z. Sukkarieh
(‘Wing’s library problem’ [2]). Further, our aim was a general reasoning component that could handle a test suite which consists of a set of structurallybased deductions; that is, deductions licensed by specific properties of English constituents, made independently of the domain. As we mentioned earlier, as a guideline for structurally-based inferences, we have used the FraCas test suite. Hence, we extended, in a formal manner, McLogic0 and accordingly the KR system that takes a restricted but still powerful sublanguage of English as input. As far as we know, no one has provided a deductive computational engine covering all the types of inference illustrated or listed in the FraCas test suite or incorporated McLogic0 (nor even extending it) into an NLP system before. The practical value of the study can be seen, at least, in the following way: 1. For a knowledge engineer, having a NL-like (i.e. transparent) KR makes it easier to debug any KB reasoning system. 2. Since McLogic’s inference set is domain-independent and aims to be rich enough to be a test for the semantic capacity of any NLP system then McLogic could be used in an advanced question-answer sytem in any domain.
References [1] Cooper, R. and Crouch, D. and van Eijck, J. and Fox, C. and van Genabith, J. and Jaspars, J. and Kamp, H. and Milward, D. and Pinkal, M. and Poesio, M. and Pulman, S.: The Fracas Consortium, Deliverable D16. With additional contributions from Briscoe, T. and Maier, H. and Konrad, K. (1996) 809, 812 [2] Diller A.: Z An Introduction to Formal Methods. Wing’s Library Problem. John Wiley and Sons Ltd. (1994) 814 [3] Dowty, D. R. and Wall, R. E. and Peters, S.: Introduction to Montague Semantics. D. Reidel Publishing Co., Boston. (1981) [4] Fyodorov, Y. and Winter, Y. and Francez, N.: A Natural Logic inference system. Inference in Computational Semantics. (2000) 811 [5] Hwang C. H. and Schubert L. K.: Episodic Logic: A comprehensive, Natural Representation for Language Understanding. Minds and Machines. 3 number 4 (1993) 381–419 801 [6] Iwanska, L. M. : Natural Language Processing and Knowledge Representation. Natural Language is a Powerful Knowledge Representation system. MIT press. (2000) 7–64 801 [7] Iwanska L. M.: Natural Language Is A Representational Language. Knowledge Representation Systems based on Natural Language. AAAI Press. (1996) 44–70 801 [8] McAllester, D. and Givan, R.: Natural Language Syntax and First-Order Inference. Artificial Intelligence. 56 (1992)1–20 801 [9] McAllester, D. and Givan, R. and Shalaby, S. : Natural Language Based Inference Procedures Applied to Schubert’s Steamroller. AAAI. (1991) 801, 803 [10] Schubert, L. K. and Hwang, C. H.: Natural Language Processing and Knowledge Representation. Episodic Logic Meets Little Red Riding Hood-A comprehensive Natural Representation for Language Understanding. AAAI Press; Cambridge, Mass.: MIT. (2000) 111–174 801
An Expressive Efficient Representation
815
[11] Shapiro S. C. : Natural Language Processing and Knowledge Representation. SNePs: A logic for Natural Language Understanding and Commonsense Reasoning. AAAI Press; Cambridge, Mass.: MIT. (2000) 175–195 801 [12] Stickel M. E.: Schubert’s Steamroller Problem: Formulations and Solutions. Journal of Automated Reasoning. 2 (1986) 89–101 813 [13] Sukkarieh, J. Z.: Mind Your Language! Controlled Language for Inference Purposes. To appear in EAMT/CLAW03. Dublin, Ireland. (2003) 801 [14] Sukkarieh, J. Z.: Natural Language for Knowledge Representation. University of Cambridge. http://www.clp.ox.ac.uk/people/staff/jana/jana.htm. (2001) 800, 801, 807, 808 [15] Sukkarieh, J. Z.: Quasi-NL Knowledge Representation for Structurally-Based Inferences. P. Blackburn and M. Kolhase (eds). Proceedings of the Third International Workshop on Inference in Computational Semantics, Siena, Italy. (2001) 801 [16] Sukkarieh, J. Z. and Pulman, S. G.: Computer Processable English and McLogic. H. Bunt et al. (eds) Proceedings of the Third International Workshop on Computational Semantics. Tilburg, The Netherlands. (1999) 801
Connecting Word Clusters to Represent Concepts with Application to Web Searching Arnav Khare Information Technology, Dept., Institute of Engineering & Technology DAVV, Khandwa Road, Indore, MP, India [email protected] Abstract. The need to have a search technique which will help a computer in understanding the user, and his requirements, has long been felt. This paper proposes a new technique, of doing so. It proceeds by first clustering the English language words into clusters of similar meaning, and then connecting those clusters according to their observed relationships and co-occurrences in web pages. These known relationships between word clusters are used to enhance the user's query, and in effect ‘understand' it. This process will result in giving results of more value to the user. This procedure does not suffer with the problems faced by many of the presently used techniques. Keywords: Knowledge Representation, Machine Learning.
1
Introduction
As the size of the internet, and with it the information on it grew, the task of finding the right place for the information you want has become more and more daunting. Search engines have attempted to solve this problem using a variety of approaches, which have been outlined in the Related Research section. In this paper, a new approach is suggested to understand the user, which will result in more appropriate results to his queries. This technique takes a connectionist approach. There is considerable evidence to support the fact that similar techniques are also applied in the human mind. A thing, meaning or concept is linked to another concept based on their co-occurrence. If two things happen together several times, they are linked to each other. This is known in psychology as ‘Reinforced Learning'. If two events happen one after another many times, it gives rise to the concept of causation. This connectionist theory has been applied in neural networks, but at the physical level of neurons. The proposed method works at a higher level in which concepts are connected to each other to represent meaning.
2
Related Research
The initial approach to text & information retrieval, known as ad hoc retrieval, was to consider a document as a bag of words. Each such bag (document) was remembered V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 816-823, 2003. Springer-Verlag Berlin Heidelberg 2003
Connecting Word Clusters to Represent Concepts
817
for the words that it contained. When user asked a query (which is a string of words), it matched the words in the query with the words in stored documents. To improve the results, some metrics were applied to rank them in order of relevance, quality, popularity, etc. Brin and Page [3] discuss how they implemented this method in Google. Many improvements were tried in various meta-search engines like MetaCrawler [4], SavvySearch [5], etc. But there were many problems associated with this method like, 1.
The numbers of documents that are found by this method were very large (typically in thousands for a large scale search engine). 2. A word generally has multiple meanings, senses & connotations to it. This technique retrieved documents carrying all the meanings of a word. For e.g., if a query contains the word ‘tiger', the engine would retrieve pages which contain references to both ‘the Bengal tiger' & ‘Tiger Woods'.
To tackle these problems, approaches from artificial intelligence and machine learning had to be adopted. Sahami [6] gives an in depth discussion of these approaches. The newer ranking methods such as TFIDF weighing [8], usually weigh each query term according to its rarity in the collection (often referred to as the inverse document frequency, or IDF) and then multiply this weight by the frequency of the corresponding query term in individual document (referred to as the term frequency, or TF) to get an overall measure of that document's similarity to the user's query. The premise underlying such measures is that word frequency statistics provide a good measure for capturing a document's relevance to a user's information need (as articulated in his query). A discussion of various raking methods can be found in [7]. [9] & [12] also give a detailed discussion of building systems with ad hoc retrieval as well as additional information on document ranking measures. The history of development of SMART, one of historically most important retrieval systems based on TFIDF is also given in [11]. But this approach too looks for frequency of a term in a document without considering its meaning. Vector space models ([12] and [8]) have also been used for document retrieval. According to this approach, a document is represented by a multi-dimensional vector, in which there is a dimension for each category. The nearness of two documents, or a document and query, can be found out by using the scalar product of their two vectors. But, this approach has a problem that their can only be a limited number of dimensions, and which dimensions should be added. Also, if a new dimension has to be added, forming new vectors for all the previously analyzed documents is an unnecessary overhead. Two sentences may mean the same, even if they use different words. But previously discussed techniques do not address this issue. Some methods like Latent Semantic Indexing [17], try to solve it. The word category map method can also be used for this purpose. Many attempts have also been made in gathering contextual knowledge behind a user's query. Lawrence [13] and Brezillon [23] discuss the use of context in web search. McCarthy [14] has tried to formalize context. The Intellizap system [15] and SearchPad [16] try to capture the context of a query from the background document, i.e. the document which the user was reading before he asked the query.
818
Arnav Khare
Recent research has focused on use of Natural Language Processing in text retrieval. The use of lexical databases like WordNet ([1] and [2]) has been under consideration for a while. Richardson and Smeaton [18] and Smeaton and Quigley [19] tried to use WordNet in Information Retrieval. It has been shown [21] that use of index words with their WordNet synset or sense improves the information retrieval performance. Research is underway in using Universal Networking Language (UNL) to create document vectors. UNL (explained in [20]) represents a document in form of semantic graph. Bhattacharya and Choudhary [22] have used UNL for text clustering.
3
Word Clusters
A thought having a meaning may be conveyed by a person to another, using a number of words which have, if not same, similar meanings. These set of words (synonyms) refer to approximately the same meaning. Let us refer to this set of similar words as a ‘Cluster'. A word may fall into a number of clusters, depending on the number of different meanings it has. This clustering can be done by available electronic dictionaries, thesauri, or corpora (for e.g. WordNet[1]), or by using automated clustering methods. According to its role in a sentence a word may be classified as a noun, verb or adjective (overlooking other minor syntactic categories). Corresponding to them, there will be three kinds of word clusters. Out of them, noun clusters have a special characteristic- inheritance of properties of one class by another. For e.g.: the cluster {dog, canine} will inherit properties from cluster {mammal } which will inherit from { animal, creature, …}. Thus, the clusters of nouns can be organized into hierarchies of clusters. These hierarchies can be viewed in form of inverted trees.
Fig. 1. A Network representation of semantic relations between nouns. Hyponymy refers to ‘isa' relationship. Antonymy refers to relationship between nouns with opposing meanings. Meronymy refers to containment or ‘has-a' relationship. [2]
Connecting Word Clusters to Represent Concepts
819
Yet, some words will still be left, which will not be clustered. They may be names, slang and new words that have not made their way to universal knowledge or acceptability. Their lexicon has to be maintained separately.
4
Forming a Network of Clusters
Once the various clusters of words which represent meanings have been defined, we can add knowledge by analyzing how these meanings are related. There is no better place to learn relations between meanings than the internet. When this engine visits a page, it follows the following procedure. 1.
Replace those words which have unambiguous meanings, with the cluster number of that word. 2. For the words that have different meanings, look at the clusters near that word. Match those clusters with the clusters near each of the word's meanings, i.e. if a word W has three meanings A, B and C. Then look at the clusters in the neighborhood of W in the document, and match it with the clusters in the neighborhood of clusters A, B and C. The first match will be the closest meaning. Thus correct sense of W will be identified. Now instead of a sequence of words, we have a sequence of meanings represented by clusters. 3. We will now remember a page for the meanings it has. A site is represented by a sequence of clusters. We can do an inverse mapping in which for every cluster the sites containing a word in that cluster are stored. 4. The sequence of clusters in a document can be used to determine the relationships of clusters with each other. A technique can be adopted that n clusters in the neighborhood of a cluster will be taken to be related to that cluster. Any value of n that is suitable may be taken. A link will be created between clusters which are within n distance of each other in a document. If the link is already present, it is strengthened, by increasing its weight by 1. Using this, the probability that given a cluster, which is the most probable related cluster, can be found out. Thus if a sufficiently large number of occurrences of a cluster have been met, the m (a suitable value) top most probable next clusters will be the ones that are actually meaningfully related to that cluster. This linked network resembles a weighted graph. It can be represented as a table. Thus a network of related clusters has been created. This network acts as a knowledge base in which knowledge is represented as links between word meanings.
820
Arnav Khare
Fig. 2. An example network of clusters linked to each other. Darker lines indicate links with higher weights
5
Querying the Network of Word Clusters
When a user query is presented, the following procedure can be used to find the sites that are related to the words in the query. 1. 2.
Get the clusters corresponding to the words in the query. Find a path between the query clusters in the network. The path will consist of a number of other clusters. This path represents the sequence of meaning that connects the query words together. These additional clusters will enhance the original query. These clusters correspond to the meaning of the user query as the engine has ‘understood' it. 3. Get the list of sites that correspond to these clusters. Out of these sites rank the sites which carry the original words of the query, higher. Again the resulting list can be ranked according to any metric mentioned earlier, such as nearness of query words in that document, popularity of the document, relevance of the query words within that document (i.e. heading, bold, italic, underline, etc), or any other suitable metric. 4. Present the results to the user. Finding the Path between Clusters The path finding procedure depends on the number of word clusters in the query. • •
If the number of clusters in query is one: give the sites corresponding to that word and its cluster members. If the number of clusters is two or more: Do the following for a fixed number of iterations. For each cluster in query, get the list of related clusters, with their
Connecting Word Clusters to Represent Concepts
821
probabilities. If a cluster is a noun cluster in a hierarchy, get the list of clusters connected to its parent as well. This is because a child noun will inherit the properties of its parent. New probabilities for each cluster in the list are calculated with respect to the original query cluster. Compare the list at each iteration to look for a common cluster between the lists. If a common cluster is found then a link between those two clusters has been found. Save that path. But, continue, until the paths to all the clusters in the query have been found, or the number of fixed iterations have completed. If no common cluster is found, for the list of each cluster, go to the cluster with the highest next probability. This is similar to ‘best first search', widely applied in artificial intelligence. If after the fixed number of iterations, only some of the connections have been found, take the intermediate clusters of those connections and consider the rest of query clusters as single clusters and treat them as in 1.
Fig. 3. The path between two query clusters (darker) is found when a cluster common to each others' lists is found. The intermediate clusters are used to ‘understand' the query
6
Evaluation
The evaluation of the proposed technique is underway, and the results should be available soon. Analysis of the technique shows that the worst-case situation for this technique will be when no path is found between the query words. Even in this case the meaning of the query words would have been known ( by knowing the cluster to which it belongs, and thus disambiguating it) even if the meaning of the whole query has not been comprehended. Hence, even in its worst case, this procedure will give better results than earlier used techniques, which cannot differentiate between different uses of a word.
822
Arnav Khare
Though this technique uses quantitative methods to learn links between words, it will give better results because it has background knowledge about the user's query, which is the key to understanding its meaning. It is this huge amount of background knowledge about the world that differentiates between the human and computer understanding.
7
Conclusion
As was seen in the last section, the new technique promises to deliver much better results, as compared to currently used methods. It overcomes many of the deficiencies that afflict the methods outlined earlier. It is also an intuitive approach, which relates various concepts that occur together, to each other. This gives a new way to approach computer understanding. Further work can be done in deciding, the number of iterations which will be suitable while searching the network, and the manner in which the noun hierarchy may be more efficiently used. Finding better methods to link words, than simply their co-occurrence in a document, will help a lot. Also, finding how this network may be enhanced to produce human logic is an interesting problem. Such a network can be applied in the future to other tasks such as Natural Language Processing, and Image Processing & Recognition in which visual features of an object may also be stored.
References [1] [2] [3] [4] [5] [6] [7] [8] [9]
WordNet Lexical Database. http://www.cogsci.princeton.edu/~wn/ George A. Miller, Richard Beckwith, Christiane Fellbaum, et al. 1993. Introduction to WordNet: An On-line Lexical Database. Sergey Brin, Lawrence Page. The Anatomy of a Large Scale Hypertextual Web Search Engine. In Proceedings of the 7th International World Wide Web Conference, pages 107–117, Brisbane, Australia, April 1998. Elsevier Science. Selberg and Etzioni. The MetaCrawler Architecture for Resource Aggregation on the Web. IEEE Expert, 12(1) pages 8-14, 1997. A.E. Howe and D. Dreilinger. SavvySearch: A Meta-Search Engine that Learns which Search Engines to Query. 1997. M. Sahami. Using Machine Learning to Improve Information Access. PhD dissertation. Stanford Univ. December 1998. D. Harmon. Ranking algorithms. In Information Retrieval: Data Structures and Algorithms, W.B Frakes and R. Baeza-Yates, Eds. Prentice Hall, 1992, pages. 363-292. G. Salton and C. Buckley. Term Weighing approaches in Automatic Text Retrieval. Information Processing and Management 24,5 (1988), pages 513523. G. Salton and M.J.McGill. Introduction to Modern Information Retrieval. McGraw-Hill Book Company, 1983.
Connecting Word Clusters to Represent Concepts
823
[10] G. Salton, A. Wong and C.S. Yang. A vector space model for automatic indexing. Communications of the ACM 18 (1975), 613-620. [11] G. Salton. The SMART Information Retrieval system. Prentice Hall, Englewood Cliffs, NJ, 1975. [12] C.J. van Rijsbergen. Information Retrieval. Butterworths, 1979. [13] Steve Lawrence. Context in Web Search, IEEE Data Engineering Bulletin, Volume 23, Number 3, pp. 25–32, 2000. [14] J. McCarthy, 1993. "Notes on formalizing context", Proceedings of the 13th IJCAI, Vol. 1, pp. 555-560. [15] L. Finkelstein, et al. Placing Search in Context: The Concept revisited. 10th World Wide Web Conference, May 2-5, 2001, Hong Kong. [16] K. Bharat. SearchPad: Explicit Capture of Search Context to Support Web Search. In Proceedings of the 9th International World Wide Web Conference, WWW9, Amsterdam, May 2000. [17] Scott Deerwester, Susan T. Dumais, George W. Furnas, Thomas K. Landauer, Richard Harshman. Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science 1990. [18] R. Richardson and A. Smeaton. Using Wordnet in a Knowledge-Based Approach to Information Retrieval. In Proceedings of the BCS-IRSG Colloquium, Crewe. [19] Smeaton and A. Quigley. 1996. Experiments on using semantic distances between words in image caption retrieval. In Proceedings of the 19th International conference on research and Development in IR. [20] H. Uchida, M. Zhu, Senta T. Della. UNL: A Gift for a Millennium. The United Nations University, 1995. [21] Julio Gonzalo, Felisa Verdejo, Irina Chugur, Juan Cigarran Indexing with WordNet synsets can improve Text Retrieval, Proceedings of the COLING/ACL '98 Workshop on Usage of WordNet for NLP, Montreal.1998. [22] P. Bhattacharya and B. Choudhary. Text Clustering using Semantics. [23] P. Brezillon. Context in Problem Solving: A Survey. [24] Sven Martin, Jorg Liermann, Hermann Ney. ‘Algorithms for Bigram and Trigram Word Clustering'. [25] Ushioda, Akira. 1996. Hierarchical clustering of words and application to nlp tasks. In Proceedings of the Fourth Workshop on Very Large Corpora.
A Framework for Integrating Deep and Shallow Semantic Structures in Text Mining Nigel Collier, Koichi Takeuchi, Ai Kawazoe, Tony Mullen, and Tuangthong Wattarujeekrit National Institute of Informatics 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan {collier,koichi,zoeai,mullen,tuangthong}@nii.ac.jp
Abstract. Recent work in knowledge representation undertaken as part of the Semantic Web initiative has enabled a common infrastructure (Resource Description Framework (RDF) and RDF Schema) for sharing knowledge of ontologies and instances. In this paper we present a framework for combining the shallow levels of semantic description commonly used in MUC-style information extraction with the deeper semantic structures available in such ontologies. The framework is implemented within the PIA project software called Ontology Forge. Ontology Forge offers a server-based hosting environment for ontologies, a server-side information extraction system for reducing the effort of writing annotations and a many-featured ontology/annotation editor. We discuss the knowledge framework, some features of the system and summarize results from extended named entity experiments designed to capture instances in texts using support vector machine software.
1
Introduction
Recent work in knowledge representation undertaken as part of the Semantic Web initiative [1] has enabled Resource Description Framework (RDF) Schema, RDF(S), [2] to become a common metadata standard for sharing knowledge on the World Wide Web. This will allow for the explicit description of concepts, properties and relations in an ontology, which can then be referenced online to determine the validity of conforming documents. The use of ontologies allows for a deep semantic description in each domain where a group of people share a common view on the structure of knowledge. However it still does not solve the knowledge acquisition problem, i.e. how to acquire ontologies how to instantiate them with instances of the concepts. Instantiation will be a major effort and needs support tools if the Semantic Web is to expand and fulfill its expected role. This is where we consider information extraction (IE) has an important part to play. IE systems are now well developed for capturing low level semantics inside texts such as named entities and coreference (identity) expressions. IE however does not offer a sufficiently formal framework for specifying relations between concepts, assuming for example that named entities are instances of V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 824–834, 2003. c Springer-Verlag Berlin Heidelberg 2003
A Framework for Integrating Deep and Shallow Semantic Structures
825
a disjoint set of concepts with no explicit relations. Without such formalization it is difficult to consider the implications of concept class relations when applying named entity outside of simple domains. To look at this another way, by making the deep semantics explicit we offer a way to sufficiently abstract IE to make it portable between domains. Our objective in Ontology Forge is to join the two technologies so that human experts can create taxonomies and axioms (the ontologies) and by providing a small set of annotated examples, machine learning can take over the role of instance capturing though information extraction technology. As a motivating example we consider information access to research results contained in scientific journals and papers. The recent proliferation of scientific results available on the Web means that scientists and other experts now have far greater access than ever before to the latest experimental results. While a few of these results are collected and organized in databases, most remain in free text form. Efficient tools such as the commercial search engines have been available for some years to help in the document location task but we still do not have effective tools to help in locating facts inside documents. Such tools require a level of understanding far beyond simple keyword spotting and for this reason have taken longer to develop. Access to information is therefore now limited by the time the expert spends reading the whole text to find the key result before it can be judged for relevance and synthesized into the expert’s understanding of his/her domain of knowledge. By explicitly describing the concepts and relations inside the scientific domain and then annotating a few examples of texts we expect that machine learning will enable new instances to be found in unseen documents, allowing computer software to have an intelligent understanding of the contents, thereby aiding the expert in locating the information he/she requires. The focus of this paper is on three topics: firstly to summarize our system called Ontology Forge which allows ontologies and annotations to be created cooperatively within a domain community; secondly to present a metadata scheme for annotations that can provide linkage between the ontology and a text; and thirdly to present results from our work on instance capturing as an extended named entity task [5] using machine learning. We motivate our discussion with examples taken from the domain of functional genomics.
2
The Ontology Forge System
The Ontology Forge design encourages collaborative participation on an ontology project by enabling all communication to take place over the Internet using a standard Web-based interface. After an initial domain group is formed we encourage group members to divide labour according to levels of expertise. This is shown in the following list of basic steps required to create and maintain a domain ontology. Domain Setup a representative for a domain (the Domain Manager) will apply to set up an ontology project that is hosted on the Ontology Forge server.
826
Nigel Collier et al.
Community Setup the Domain Manager will establish the domain group according to levels of interest and competence, i.e. Experts and Users. Ontology Engineering Ontologies are constructed in private by Domain Experts through discussion. Ontology Publication When the Domain Manager decides that a version of the ontology is ready to be released for public access he/she copies it into a public area on the server. Annotation When a public ontology is available Domain Users can take this and annotate Web-based documents according to the ontology. Annotations can be uploaded to the server and stored in the private server area for sharing and discussion with other domain members. Annotation Publication The Domain Manager copies private annotations into the public server area. Training When enough annotations are available the Domain Manager can ask the Information Extraction system called PIA-Core to learn how to annotate unseen domain texts. Improvement Cycle Annotation and training then proceed in a cycle of improvement so that Domain Users correct output from the Information Extraction system and in turn submit these new documents to become part of the pool of documents used in training. Accuracy gradually improves as the number of annotations increases. The amount of work for annotators is gradually reduced. The overall system architecture for Ontology Forge on the server side includes a Web server, a database management system and an application server which will primarily be responsible for information extraction. On the client side we have an ontology editor and annotation tool called Open Ontology Forge (OOF!). 2.1
Annotation Meta-data
In this section we will focus on the specific characteristics of our annotation scheme. The root class in an Ontology Forge ontology is the annotation, defined in RDF Schema as a name space and held on the server at a URI. The user does not need to be explicitly aware of this and all classes that are defined in OOF will inherit linkage and tracking properties from the parent so that when instances are declared as annotations in a base Web document, the instance will have many of the property values entered automatically by OOF. Basically the user is free to create any classes that help define knowledge in their domain according to the limits of RDF Schema. We briefly now describe the root annotation class: context This relates an Annotation to the resource to which the Annotation applies and takes on an XPointer value. ontology id Relates an Annotation to the ontology and class to which the annotation applies.
A Framework for Integrating Deep and Shallow Semantic Structures
827
conventional form The conventional form of the annotation (if applicable) as described in the PIA Annotation Guidelines. identity id Used for creating coreference chains between annotations where the annotations have identity of reference. From a linguistic perspective this basically assumes a symmetric view of coreference where all coreference occurrences are considered to be equal[12]. constituents This is used to capture constituency relations between annotations if required. orphan This property takes only Boolean values corresponding to ‘yes’ and ‘no’. After the annotation is created, if it is later detected that the annotation can no longer be linked to its correct position in doc id, then this value will be set to ‘yes’ indicating that the linkage (in context) needs correcting. author The name of the person, software or organization most responsible for creating the Annotation. It is defined by the Dublin Core [8] ‘creator’ element. created The date and time on which the Annotation was created. It is defined by the Dublin Core ‘date’ element. modified The date and time on which the Annotation was modified. It is defined by the Dublin Core ‘date’ element. sure This property takes only Boolean values corresponding to ‘yes I am sure’ and ‘no I am not sure’ about the assignment of this annotation. Used primarily in post-annotation processing. comment A comment that the annotator wishes to add to this annotation, possibly used to explain an unsure annotation. It is defined by the Dublin Core ‘description’ element. Due to the lack of defined data types in RDF we make use of several data types in the Dublin Core name space for defining annotation property values. In OOF we also allow users to use a rich set of data type values using both Dublin Core and also XML Schema name space for integers, floats etc. Users can also define their own enumerated types using OOF. A selected view of an annotation can be seen in Figure 1 showing linkage into a Web document and part of the ontology hierarchy. One point to note is the implementation of coreference relations between JAK1 and this protein. This helps to maximize information about this object in the text.
3
Named Entities as Instances
Named entity (NE) extraction is now firmly established as a core technology for understanding low level semantics of texts. NE was formalized in MUC6 [7] as the lowest level task and since then several methodologies have been widely explored: heuristics-based, using rules written by human experts after inspecting examples; supervised learning using labelled training examples; and non-supervised learning methods; as well as combinations of supervised and nonsupervised methods.
828
Nigel Collier et al.
Fig. 1. Overview of an annotation
NE’s main role has been to identify expressions such as the names of people, places and organizations as well as date and time expressions. Such expressions are hard to analyze using traditional natural language processing (NLP) because they belong to the open class of expressions, i.e. there is an infinite variety and new expressions are constantly being invented. Applying NE to scientific and technical domains requires us to consider two important extensions to the technology. The first is to consider how to capture types, i.e. instances of conceptual classes as well as individuals. The second is to place those classes in an explicitly defined ontology, i.e. to clarify and define the semantic relations between the classes. To distinguish between traditional NE and extended NE we refer to the later as NE+. Examples of NE+ classes include, a person’s name, a protein name, a chemical formula or a computer product code. All of these may be valid candidates for tagging depending on whether they are contained in the ontology. Considering types versus individuals, there are several issues that may mean that NE+ is more challenging than NE. The most important is the number of variants of NE+ expressions due to graphical, morphological, shallow syntactic and discourse variations. For example the use of head sharing combined with embedded abbreviations in unliganded (apo)- and liganded (holo)-LBD. Such expressions will require syntactic analysis beyond simple noun phrase chunking if they are to be successfully captured. NE+ expressions may also require richer
A Framework for Integrating Deep and Shallow Semantic Structures
829
contextual evidence than is needed for regular NEs - for example knowledge of the head noun or the predicate-argument structure.
4
Instance Capturing
In order to evaluate the ability of machine learning to capture NE+ expressions we investigated a model based on support vector machines (SVMs). SVMs like other inductive-learning approaches take as input a set of training examples (given as binary valued feature vectors) and finds a classification function that maps them to a class. There are several points about SVM models that are worth summarizing here. The first is that SVMs are known to robustly handle large feature sets and to develop models that maximize their generalizability by sometimes deliberately misclassify some of the training data so that the margin between other training points is maximized [6]. Secondly, although training is generally slow, the resulting model is usually small and runs quickly as only the support vectors need to be retained. Formally we can consider the purpose of the SVM to be to estimate a classification function f : χ → {±1} using training examples from χ × {±1} so that error on unseen examples is minimized. The classification function returns either +1 if the test data is a member of the class, or −1 if it is not. Although SVMs learn what are essentially linear decision functions, the effectiveness of the strategy is ensured by mapping the input patterns χ to a feature space Γ using a nonlinear mapping function Φ : χ → Γ . Since the algorithm is well described in the literature cited earlier we will focus our description from now on the features we used for training. Due to the nature of the SVM as a binary classifier it is necessary in a multiclass task to consider the strategy for combining several classifiers. In our experiments with Tiny SVM the strategy used was one-against-one rather than one-against-the-rest. For example, if we have four classes A, B, C and D then Tiny SVM builds classifiers for (1) A against (B, C, D), (2) B against (C, D), and (3) C against D. The winning class is the one which obtains the most votes of the pairwise classifiers. The kernel function k : χ × χ → R mentioned above basically defines the feature space f by computing the inner product of pairs of data points. For x ∈ χ we explored the simple polynomial function k(xi , xj ) = (xi · xj + 1)d . We implemented our method using the Tiny SVM package from NAIST 1 which is an implementation of Vladimir Vapnik’s SVM combined with an optimization algorithm [10]. The multi-class model is built up from combining binary classifiers and then applying majority voting. 4.1
Feature Selection
In our implementation each training pattern is given as a vector which represents certain lexical features. All models use a surface word, an orthographic feature [4] 1
Tiny SVM is available from http://cl.aist-nara.ac.jp/ ˜taku-ku/software/ TinySVM/
830
Nigel Collier et al.
and previous class assignments, but our experiments with part of speech (POS) features [3] showed that POS features actually inhibited performance in the molecular biology data set which we present below. This is probably because the POS tagger was trained on news texts. Therefore POS features are used only for the MUC-6 news data set where we show a comparison with and without these features. The vector form of the window includes information about the position of each word. In the experiments we report below we use feature vectors consisting of differing amounts of ‘context’ by varying the window around the focus word which is to be classified into one of the NE+ classes. The full window of context considered in these experiments is ±3 about the focus word. 4.2
Data Sets
The first data set we used in our experiments is representative of the type of data that we expect to be produced by the Ontology Forge system. Bio1 consists of 100 MEDLINE abstracts (23586 words) in the domain of molecular biology annotated for the names of genes and gene products [14]. The taxonomic structure of the classes is almost flat except for the SOURCE class which denotes a variety of locations where genetic activity can occur. This provides a good first stage for analysis as it allows us to explore a minimum of relational structures and at the same time look at named entities in a technical domain. A break down of examples by class is shown in Table 1. An example from one MEDLINE abstract in the Bio1 data set is shown in Figure 2. As a comparison to the NE task we used a second data set (MUC-6) from the collection of 60 executive succession texts (24617 words) in MUC-6. Details are shown in Table 2.
The [Tat protein]protein of [human immunodeficiency virus type 1]source.vi ([HIV1]source.vi ) is essential for productive infection and is a potential target for antiviral therapy. [Tat]protein , a potent activator of [HIV-1]source.vi gene expression, serves to greatly increase the rate of transcription directed by the viral promoter. This induction, which seems to be an important component in the progression of acquired immune deficiency syndrome (AIDS), may be due to increased transcriptional initiation, increased transcriptional elongation, or a combination of these processes. Much attention has been focused on the interaction of [Tat]protein with a specific RNA target termed [TAR]RNA ([transactivation responsive]RNA ) which is present in the leader sequence of all [HIV-1]source.vi mRNAs. This interaction is believed to be an important component of the mechanism of transactivation. In this report we demonstrate that in certain [CNS-derived cells]source.ct [Tat]protein is capable of activating [HIV-1]source.vi through a [TAR]RNA -independent pathway.
Fig. 2. Example MEDLINE abstract marked up for NE+ expressions
A Framework for Integrating Deep and Shallow Semantic Structures
831
Table 1. NE+ classes used in Bio1 with the number of word tokens for each class Class PROTEIN
Description proteins, protein groups, families, complexes and substructures. DNA 358 IL-2 promoter DNAs, DNA groups, regions and genes RNA 30 TAR RNAs, RNA groups, regions and genes SOURCE.cl 93 leukemic T cell line Kit225 cell line SOURCE.ct 417 human T lymphocytes cell type SOURCE.mo 21 Schizosaccharomyces pombe mono-organism SOURCE.mu 64 mice multiorganism SOURCE.vi 90 HIV-1 viruses SOURCE.sl 77 membrane sublocation SOURCE.ti 37 central nervous system tissue
5
# Example 2125 JAK kinase
Results
Results are given in the commonly used Fβ=1 scores [15] which is defined as Fβ=1 = (2P R)/(P + R), where P denotes Precision and R Recall. P is the ratio of the number of correctly found NE chunks to the number of found NE chunks, and R is the ratio of the number of correctly found NE chunks to the number of true NE chunks. Table 3 shows the overall F-score for the two collections, calculated using 10-fold cross validation on the total test collection. The Table highlights the role that the context window plays in achieving performance. From our results it is clear that +2-2 offers the best overall context window for the feature sets that we explored in both Bio1 and MUC-6 collections. One result that is clear from Table 3 is the effect of POS features. We found that Bio1 POS features had a negative effect. In contrast POS features are overall very beneficial in the MUC-6 collection. Basically we think that the POS tagger was trained on news-related texts and that actually the MUC-6 texts share a strong overlap in vocabulary - consequently POS tagging is very accurate. On
Table 2. Markup classes used in MUC-6 with the number of word tokens for class label Class DATE LOCATION ORGANIZATION MONEY PERCENT PERSON TIME
# 542 390 1783 423 108 838 3
832
Nigel Collier et al.
Table 3. Overall F-scores for each of the learning methods on the two test sets using 10-fold cross validation on all data. † Results for models using surface word and orthographic features but no part of speech features; ‡ Results for models using surface word, orthographic and part of speech features Window Data set -3+3 -3+2 -2+2 -1+1 -1+0 Bio1† MUC-6† MUC-6‡
71.78 71.69 72.12 72.13 65.65 73.21 73.04 74.10 72.96 65.94 74.66 75.13 75.66 74.92 68.83
the other hand the vocabulary in the Bio1 texts is far removed from anything the POS tagger was trained on and accuracy drops down. In analysis of the results we identified several types of error. The first and perhaps most serious type was caused by local syntactic ambiguities such as head sharing in 39-kD SH2, SH3 domain which should have been classed as a PROTEIN, but the SVM split it into two PROTEIN expressions SH2 and SH3 domain. In particular the ambiguous use of hyphen, e.g. 14E1 single-chain (sc) Fv antibody , and parentheses, e.g. scFv (14E1), seemed to cause the SVM some difficulties. It is likely that the limited feature information we gave to the SVM was the cause of this and can be improved on using grammatical features such as head noun or main verb. A second minor type of error seemed to be the result of inconsistencies in the annotation scheme for Bio1.
6
Related Work
There were several starting points for our work. Perhaps the most influential was the Annotea project [11] which provided an annotation hosting service based on RDF Schema. Annotations in Annotea can be considered as a kind of electronic ‘Post-It ’ for Web pages, i.e. a remark that relates to a URI. Annotea has several features which our design has used including: (a) use of XLink/XPointer linkage from annotations to the place in the document where a text occurs; (b) use of generic RDF schema as the knowledge model enhanced with types from Dublin Core. However, there are several major differences between our work and Annotea including our focus on aiding annotation through the use of information extraction, various levels of support for cooperative development of annotations and ontologies, the use of domain groups and explicit roles and access rights for members. Also influential in our development has been Prot´eg´e-2000 [13] and other ontology editors such as Ont-O-Mat [9] which provide many of the same basic features as OOF. While Prot´eg´e-2000 provides a rich environment for creating ontologies it provides limited support for large-scale instance capturing and
A Framework for Integrating Deep and Shallow Semantic Structures
833
collaborative development. Although Prot´eg´e-2000 can be extended to provide server-based storage it is not an inherent part of the design. Ont-O-Mat and OntoEdit from the University of Karlsruhe provide similar functionality to OOF but does not link instances back to the text or have the built-in design focus on collaborative development and information extraction.
7
Conclusion and Future Plans
In this paper we have presented an annotation scheme that links concepts in ontologies with instances in texts using a system that conforms to RDF(S). The software that implements this scheme called Ontology Forge is now being completed and is now undergoing user trials. There are many benefits of linking IE to deep semantic descriptions including abstract of knowledge and portability for IE, and automated instance capturing for Semantic Web. While we have emphasized the use of RDF(S) in Ontology Forge we recognize that it has inherent weaknesses due to a lack of support for reasoning and inference as well as constraints on consistency checking. The current knowledge model used in OOF offers a subset of RDF(S), e.g. we forbid multi-class membership of instances, and we are now in the process of reviewing this so that OOF can be capable of export to other RDF(S)-based languages with stronger type inferencing such as DAML+OIL which we aim to support in the future. We have focussed in this paper on linking named entities to deep structures formalized in ontologies. A key point for our investigation from now is to look at how we can exploit knowledge provided by concept relations between entities and properties of entities which will be of most value to end users and downstream applications.
References [1] T. Berners-Lee, M. Fischetti, and M. Dertouzos. Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web. Harper, San Francisco, September 1999. ISBN: 0062515861. 824 [2] D. Brickley and R. V. Guha. Resource Description Framework (RDF) schema specification 1.0, W3C candidate recommendation. http://www.w3.org/TR/2000/CR-rdf-schema-20000327, 27th March 2000. 824 [3] E. Brill. A simple rule-based part of speech tagger. In Third Conference on Applied Natural Language Processing – Association for Computational Linguistics, Trento, Italy, pages 152–155, 31st March – 3rd April 1992. 830 [4] N. Collier, C. Nobata, and J. Tsujii. Extracting the names of genes and gene products with a hidden Markov model. In Proceedings of the 18th International Conference on Computational Linguistics (COLING’2000), Saarbrucken, Germany, July 31st–August 4th 2000. 829 [5] N. Collier, K. Takeuchi, C. Nobata, J. Fukumoto, and N. Ogata. Progress on multi-lingual named entity annotation guidelines using RDF(S). In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’2002), Las Palmas, Spain, pages 2074–2081, May 27th – June 2nd 2002. 825
834
Nigel Collier et al.
[6] C. Cortes and V. Vapnik. Support-vector networks. Machine Learning, 20:273– 297, November 1995. 829 [7] DARPA. Information Extraction Task Definition, Columbia, MD, USA, November 1995. Morgan Kaufmann. 827 [8] Dublin core metadata element set, version 1.1: Reference description. Technical Report, Dublin Core Metadata Initiative, http://purl.org/DC/documents/recdces-19990702.htm, 1999 1999. 827 [9] S. Handschuh, S. Staab, and A. Maedche. CREAM - creating relational metadata with a component-based, ontology-driven annotation framework. In First International Conference on Knowledge Capture (K-CAP 2001), Victoria, B. C., Canada, 21 – 23 October 2001. 832 [10] T. Joachims. Making large-scale SVM learning practical. In B. Scholkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning. MIT Press, 1999. 829 [11] J. Kahan, M. R. Koivunen, E. Prud’Hommeaux, and R. R. Swick. Annotea: An open RDF infrastructure for shared web annotations. In the Tenth International World Wide Web Conference (WWW10), pages 623–630, May 1 – 5 2000. 832 [12] A. Kawazoe and N. Collier. An ontologically-motivated scheme for coreference. In Proceedings of the International Workshop on Semantic Web Foundations and Application Technologies (SWFAT), Nara, Japan, March 12th 2003. 827 [13] N. F. Noy, M. Sintek, S. Decker, M. Crubezy, R. W. Fergerson, and M. A. Musen. Creating semantic web contents with Prot´eg´e-2000. IEEE Intelligent Systems, 16(2):60–71, 2001. 832 [14] Y. Tateishi, T. Ohta, N. Collier, C. Nobata, K. Ibushi, and J. Tsujii. Building an annotated corpus in the molecular-biology domain. In COLING’2000 Workshop on Semantic Annotation and Intelligent Content, Luxemburg, 5th–6th August 2000. 830 [15] C. J. van Rijsbergen. Information Retrieval. Butterworths, London, 1979. 831
Fuzzy Methods for Knowledge Discovery from Multilingual Text Rowena Chau and Chung-Hsing Yeh School of Business Systems Faculty of Information Technology Monash University Clayton, Victoria 3800, Australia {Rowena.Chau,ChungHsing.Yeh}@infotech.monash.edu.au
Abstract. Enabling navigation among a network of inter-related concepts associating conceptually relevant multilingual documents constitutes the fundamental support to global knowledge discovery. This requirement of organising multilingual document by concepts makes the goal of supporting global knowledge discovery a concept-based multilingual text categorization task. In this paper, intelligent methods for enabling concept-based multilingual text categorisation using fuzzy techniques are proposed. First, a universal concept space, encapsulating the semantic knowledge of the relationship between all multilingual terms and concepts, is generated using a fuzzy multilingual term clustering algorithm based on fuzzy c-means. Second, a fuzzy multilingual text classifier that applies the multilingual semantic knowledge for concept-based multilingual text categorization is developed using the fuzzy k-nearest neighbour classification technique. Referring to the multilingual text categorisation result as a browseable document directory, concept navigation among a multilingual document collection is facilitated.
1
Introduction
The rapid expansion of the World Wide Web throughout the globe means electronically accessible information is now available in an ever-increasing number of languages. In a multilingual environment, one important motive of information seeking is global knowledge discovery. Global knowledge discovery is significant when a user wish to gain an overview of a certain subject area covered by a multilingual document collection before exploiting it. In such a situation, concept navigation is required. The basic idea of concept navigation is to provide the user with a browseable concept space that gives a fair indication of the conceptual distribution of all multilingual documents over the domain. This requirement of organising multilingual documents by concepts makes the goal of supporting global knowledge discovery a concept-based multilingual text categorisation task.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 835-842, 2003. Springer-Verlag Berlin Heidelberg 2003
836
Rowena Chau and Chung-Hsing Yeh
Text categorisation is a classification problem of deciding whether a document belongs to a set of pre-specified classes of documents. In a monolingual environment, text categorisation is carried out within the framework of the vector space model [7]. In the vector space model, documents are represented as feature vectors in a multidimensional space defined by a set of terms occurring in the documents collection. To categorise documents, this set of terms become the features of the classification problem. Documents represented by similar feature vectors belong to the same class. However, multilingual text represents a unique challenge to text categorisation, due to the feature incompatibility problem contributed by the vocabulary mismatch phenomenon across languages. Different languages use different sets of terms to express a set of universal concepts. Hence, documents in different languages are represented by different sets of features in separate feature spaces. This language-specific representation has made multilingual text incomparable. Text categorisation methods that rely on shared terms (features) will not work for multilingual text categorisation. To overcome this problem, a universal feature space where all multilingual text can be represented in a language-independent way is necessary. Towards this end, a corpusbased strategy aiming at unifying all existing feature spaces by discovering a new set of language-independent semantic features is proposed. The basic idea is: given a new universal feature space defined by a set of language-independent concepts, multilingual text can then be uniformly characterised. Consequently, multilingual text categorisation can also take place in a language-independent way. In what follows, the framework of the corpus-based approach to discovering multilingual semantic knowledge for enabling concept-based multilingual text categorisation is introduced in Section 2. Then, a fuzzy multilingual term clustering method for semantic knowledge discovery using fuzzy c-means is presented in Section 3. This is followed by discussion of the application of the multilingual semantic knowledge for developing a fuzzy multilingual text classifier using fuzzy k-nearest neighbour algorithm in Section 4. Finally, a conclusive remark is included in Section 5.
2
The Corpus-Based Approach
The framework of the corpus-based approach to discovering multilingual semantic knowledge for enabling concept-based multilingual text categorization is depicted in Figure 1. First, using the co-occurrence statistics of a set of multilingual terms extracted from a parallel corpus, a fuzzy clustering algorithm, known as fuzzy c-means [1], is applied to group semantically related multilingual terms into concepts. By analyzing the co-occurrence statistics, multilingual terms are sorted into multilingual term clusters (concepts) such that terms belonging to any one of the clusters (concepts) should be as similar as possible while terms of different clusters (concepts) are as dissimilar as possible. As such, a concept space acting as a universal feature space for all languages is formed. Second, referring to the concept space as a set of pre-defined concepts, each multilingual document, being characterized by a set of index terms is then mapped to the concepts to which it belongs according to their conceptual content. For this purpose, a fuzzy multilingual text classifier is developed based on a fuzzy classification method known as the fuzzy k-nearest neighbour algorithm [4].
Fuzzy Methods for Knowledge Discovery from Multilingual Text A parallel corpus in 2 languages: A, B A
B
Multilingual Term Space termA
termB
837
Multilingual Term Clustering (Fuzzy C-Means)
Concept Space concept Multilingual Document Space documentA
documentB
Multilingual Text Categorization (Fuzzy k-Nearest Neighbor)
Document Directory concept Document
Rank
...
...
Fig. 1. A corpus-based approach to enabling concept-based multilingual text categorization
Finally, membership values resulted from fuzzy classification are used to produce a ranked list of documents. Associating each concept with a list of multilingual documents ranked in descending order of conceptual relevance, the resulting document classification will provide a contextual overview of the document collection. Thus, it can be used as a browseable document directory in user-machine interaction. By enabling navigation among inter-related concepts associating conceptually relevant multilingual documents, global knowledge discovery is facilitated.
3
Discovering Multilingual Semantic Knowledge
A fuzzy clustering algorithm, known as fuzzy c-means, is applied to discover the multilingual semantic knowledge by generating a universal concept space from a set of multilingual terms. This concept space is formed by grouping semantically related multilingual terms into concepts, thus revealing the multilingual semantic knowledge of the relationship between all multilingual terms and concepts. As concepts tend to overlap in terms of meaning, crisp clustering algorithm like kmeans that generates partition such that each term is assigned to exactly one cluster is inadequate for representing the real data structure. In this aspect, fuzzy clustering methods, such as the fuzzy c-means [1], that allow objects (terms) to be classified to more than one cluster with different membership values is preferred. With the application of fuzzy c-means, the resulting fuzzy multilingual term clusters (concepts), which are overlapping, will provide a more realistic representation of the intrinsic semantic relationship among the multilingual terms. To realize this idea, a set of multilingual terms, which are the objects to be clustered, are first extracted from a parallel corpus of N parallel documents. A parallel corpus is a collection of documents containing identical text written in multiple languages. Statistical analysis of parallel corpus has been suggested as a potential means of providing effective lexical information basis for multilingual text management [2].
838
Rowena Chau and Chung-Hsing Yeh
This has already been successfully applied in research for building translation model for multilingual text retrieval [6]. Based on the hypothesis that semantically related multilingual terms representing similar concepts tend to co-occur with similar inter- and intra-document frequencies within a parallel corpus, fuzzy c-means is applied to sort a set of multilingual terms into clusters (concepts) such that terms belonging to any one of the clusters (concepts) should be as similar as possible while terms of different clusters (concepts) are as dissimilar as possible in terms of the concepts they represent. To apply FCM for multilingual term clustering, each term is then represented as an input vector of N features where each of the N parallel documents is regarded as an input feature with each feature value representing the frequency of that term in the nth parallel document. The Fuzzy Multilingual Term Clustering Algorithm is as follows: 1.
Initialize the membership values µik of the k multilingual terms xk to each of the i concepts (clusters) for i = 1, . . . , c and k = 1, . . . ,K randomly such that: c
∑ µik = 1
∀ k = 1,..., K
(1)
∀i = 1,...c ; ∀k = 1,...K
(2)
i =1
and µik ∈ [0,1]
2.
Calculate the concept prototype (cluster centers) vi using these membership values µik : K
vi =
∑ ( µik )m ⋅ x k k =1 K
∑ ( µik )
,
∀i = 1,...,c
µ iknew
using these cluster centers vi:
(3)
m
k =1
3.
Calculate the new membership values
1
µiknew =
2 m −1
, ∀i = 1,..., c ; ∀k = 1,..., K
v −x i k j =1 v j − x k c
∑ 4.
(4)
If µ new − µ > ε , let µ = µ new then go to step 2. Else, stop.
As a result of clustering, every multilingual term is now assigned to various clusters (concepts) with various membership values. As such, a universal concept space, which is a fuzzy partition of the multilingual term space, revealing the multilingual
Fuzzy Methods for Knowledge Discovery from Multilingual Text
839
semantic knowledge of the relationship between all multilingual terms and concepts, is now available for multilingual text categorization.
4
Enabling Multilingual Text Categorization
A multilingual text categorization system that facilitates concept-based document browsing must be capable of mapping every document in every language to its relevant concepts. Documents in different languages describe similar concepts using significantly diverse terms. As such, semantic knowledge of the relationship between multilingual terms and concepts is essential for effective concept-based multilingual text categorization. In this section, a fuzzy multilingual text classifier, which applies the multilingual semantic knowledge discovered in the previous section for categorizing multilingual documents by concepts, is developed based on the fuzzy k-nearest neighbor classification technique. Text categorization, which has been widely studied in the field of information retrieval, is based on the cluster hypothesis [8] which states that documents having similar contents are also relevant to the same concept. To accomplish text categorization, the crisp k-nearest neighbor (k-NN) algorithm [3] had been widely applied [5] [9]. For deciding whether an unclassified document d should be classified under category c, k-nearest neighbor algorithm looks at whether the k pre-classified sample documents most similar to d have also been classified under c. One of the problems encountered in using the k-NN algorithm in text categorization is that: when the concepts are overlapping, a pre-classified document that actually falls into both concepts with different strength of membership is failed to be given different weights to differentiate its unequal importance in deciding the class membership of a new document in two different concepts. Another problem is the result of document classification using k-NN algorithm is binary. A document is classified as either belong or not belong to a concept and once a document is assigned to a concept, there is no indication of its ‘strength' of membership in that concept. This is too arbitrary since documents usually relate to different concepts with different relevance weighting. To overcome these problems, the fuzzy k-nearest neighbor algorithm that gives a class membership degree to a new object, instead of assigning it to a specific class, is more appropriate. By the fuzzy k-NN algorithm, the criteria for the assignment of membership degree to a new object depends on the closeness of the new object to its nearest neighbors and the strength of membership of these neighbors in the corresponding classes. The advantages lie both in the avoidance of an arbitrary assignment and in the support of a degree of relevance from the resulting classification. Fuzzy multilingual text categorization can be seen as the task of determining a membership value µi ( d j ) ∈ [0,1] , for document dj with respect to concept ci, to each entry of the C×D decision matrix, where C={c1,…,cm} is a set of concepts, D={d1,…,dn} is a set of multilingual documents to be classified. However, before the fuzzy k-nearest neighbor decision rule can be applied, decisions regarding the set of pre-classified documents and the value of parameter k must be made. For many operation-oriented text categorization tasks such as document routing and information filtering, a set of pre-classified documents determined by the user or
840
Rowena Chau and Chung-Hsing Yeh
the operation are always necessary so that the text classifier can learn from these examples to automate the classification task. However, for multilingual text categorization, a set of pre-classified documents may not be necessary. This is because classification of multilingual text by concepts is a concept-oriented decision. It is made on the basis of a document's conceptual relevance to a concept but not on how a similar document is previously classified during a sample operation As long as the conceptual context of both concepts and documents are well represented, a decision on the conceptual classification of multilingual document can then be made. In fact, given the result of the fuzzy multilingual term clustering in the previous stage, concept memberships of all multilingual terms are already known. Interpreting each term as a document containing a single term, a virtual set of pre-classified multilingual documents is readily available. Given the concept membership of every multilingual term, the class membership values of every single-term document in the corresponding concepts are also known. For fuzzy multilingual text categorization, conceptual specifications provided by the result of fuzzy multilingual term clustering are considered reasonably sufficient and relevant for supporting the decision. Classifying multilingual document using the fuzzy k-nearest neighbor algorithm also involves determining a threshold k, indicating how many neighboring documents have to be considered for computing µi ( d j ) . In our multilingual document classification problem, the nearest neighbors to an unclassified document with k index terms will be the k single-term virtual documents where each of them contains one of the unclassified document's k index terms respectively. This is based on the assumption that a single-term document should contain at least one index term of another document to be considered related or conceptually close. As a result, by applying the fuzzy k-nearest neighbor algorithm for the development of our fuzzy multilingual text classifier, the classification decision of an unclassified document with k index terms will be implemented as a function of its distance from its k single-term neighboring documents (each containing one of the k index terms) and the membership degree of these k neighboring documents in corresponding concepts. The Fuzzy Multilingual Text Classifier is as follows: 1. 2.
Determine the k neighboring documents for document dj. Compute µi ( d j ) using: 1 µi ( d s ) 2 /( m −1 ) d j − ds s =1 ,∀i = 1,K ,m µi ( d j ) = k 1 2 /( m −1 ) s =1 d j − d s k
∑
(5)
∑
where
µi ( d s ) is the membership degree of the kth nearest neighboring sample
document ds in concept ci and m is the weight determining each neighbor's contribution to µi ( d j ) . When m is 2, the contribution of each neighboring document is weighted by the reciprocal of its distance from the document being classi-
Fuzzy Methods for Knowledge Discovery from Multilingual Text
841
fied. As m increases, the neighbors are more evenly weighted, and their relative distances from the document being classified have less effect. As m approaches 1, the closer neighbors are weighted far more heavily than those farther away, which has the effect of reducing the number of documents that contribute to the membership value of the document being classified. Usually, m = 2 is chosen. The results of this computation are used to produce a ranked list of documents being classified to a particular concept with the most relevant one appearing at the top. When every concept in the concept space is associated with a ranked list of relevant documents, multilingual text categorization is completed. Using the concept space as a browseable document directory for user-machine interaction in multilingual information seeking, the user can now explore the whole multilingual document collection by referring to any concept of his interest. As a result, global knowledge discovery that aims at scanning conceptually correlated documents in multiple languages in order to gain a better understanding of a certain area is facilitated.
5
Conclusion
In this paper, a corpus-based approach to multilingual semantic knowledge discovery for enabling concept-based multilingual text categorization is proposed. The key to its effectiveness is the discovery of the multilingual semantic knowledge with the formation of a universal concept space that overcomes the feature incompatibility problem by facilitating representation of documents in all languages in a common semantic framework. By enabling navigation among sets of conceptually related multilingual documents, global knowledge discovery is facilitated. This concept-based multilingual information browsing is particularly important to users who need to stay competent by keeping track of the global knowledge development of a certain subject domain regardless of language.
References [1] [2] [3] [4] [5]
Bezdek, J. C. Pattern Recognition with Fuzzy Objective Function Algorithms. Plenum Press. New York (1981). Croft, W. B., Broglio, J. and Fujii, H. Applications of Multilingual Text Retrieval. In Proceedings of the 29th Annual Hawaii International Conference on System Sciences. 5 (1996) 98-107. Duda, R. O. and Hart, P. E. Pattern Classification and Scene Analysis. New York. Wiley (1973). Keller, J. M., Gray, M R. and Givens, J. A. A Fuzzy k-Nearest Neighbor Algorithm. IEEE Transactions of Systems, Man and Cybernetics. (1985) SMC-15 (4) 580-585. Lam. W. and Ho, C. Y. Using a Generalized Instance Set for Automatic Text Categorization. In proceedings of the 21st Annual International ACM SIGIR Conference in Research and Development in Information Retrieval. ACM Press. New York (1998) 81-89.
842
Rowena Chau and Chung-Hsing Yeh
[6]
Littman, M. L., Dumais, S. T. and Landaur, T. K. Automatic Cross-Language Information Retrieval Using Latent Semantic Indexing. In Grefenstette, G. (editor) Cross-Language Information Retrieval. Kluwer Academic Publishers, Boston (1998). Salton, G. Automatic Text Processing: The Transformation, analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading. MA (1989). Van Rijsbergen, C. J. Information Retrieval. Butterworth. (1972). Yang, Y. Expert Network: Effective and Efficient Learning from Human Decisions in Text Categorization and Retrieval. In proceedings of the 7th Annual International ACM SIGIR Conference in Research and Development in Information Retrieval. ACM Press. New York (1994)13-22.
[7] [8] [9]
Automatic Extraction of Keywords from Abstracts Yaakov HaCohen-Kerner Department of Computer Science Jerusalem College of Technology (Machon Lev) 21 Havaad Haleumi St., P.O.B. 16031 91160 Jerusalem, Israel [email protected]
Abstract. The rapid increasing of online information is hard to handle. Summaries such as abstracts help us to reduce this problem. Keywords, which can be regarded as very short summaries, may help even more. Filtering documents by using keywords may save precious time while searching. However, most of the documents do not include keywords. In this paper we present a model that extracts keywords from abstracts and titles. This model has been implemented in a prototype system. We have tested our model on a set of abstracts of Academic papers containing keywords composed by their authors. Results show that keywords extracted from abstracts and titles may be a primary tool for researchers.
1
Introduction
With the explosive growth of online information, more and more people depend on summaries. People do not have time to read everything, they prefer to read summaries such as abstracts rather than the entire text, before they decide whether they would like to read the whole text or not. Keywords, regarded as very short summaries, can be grate help. Filtering documents by using keywords can save precious time while searching. In this study, we present an implemented model that is capable of extracting keywords from abstracts and titles. We have tested our model on a set of 80 abstracts of Academic papers containing keywords composed by the authors. We compare the proposed keywords to the authors' keywords and analyze the results. This paper is arranaged as follows. Section 2 gives background concerning text summarization and extraction of keywords. Section 3 describes the proposed model. Section 4 presents the experiments we have carried out. Section 5 discusses the results. Section 6 summarizes the research and proposes future directions. In the Appendix we present statistics concerning the documents that were tested.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 843-849, 2003. Springer-Verlag Berlin Heidelberg 2003
844
Yaakov HaCohen-Kerner
2
Background
2.1
Text Summarization
“Text summarization is the process of distilling the most important information from a source (or sources) to produce an abridged version for a particular user (or users) and task (or tasks)” [3]. Basic and classical articles in text summarization appear in “Advances in automatic text summarization” [3]. A literature survey on information extraction and text summarization is given by Zechner [7]. In general, the process of automatic text summarization is divided into three stages: (1) analysis of the given text, (2) summarization of the text, (3) presentation of the summary in a suitable output form. Titles, abstracts and keywords are the most common summaries in Academic papers. Usually, the title, the abstract and the keywords are the first, second, and third parts of an Academic paper, respectively. The title usually describes the main issue discussed in the study and the abstract presents the reader a short description of the background, the study and its results. A keyword is either a single word (unigram), e.g.: ‘learning', or a collocation, which means a group of two or more words, representing an important concept, e.g.: ‘machine learning', ‘natural language processing'. Retrieving collocations from text was examined by Smadja [5] and automatic extraction of collocations was examined by Kita et al. [1]. Many academic conferences require that each paper will include a title page containing the paper's title, its abstract and a set of keywords describing the paper's topics. The keywords provide general information about the contents of the document and can be seen as an additional kind of a document abstraction. The keyword concept is also widely used, throughout the Internet, as a common method for searching in various search-engines. Due to the simplicity of keywords, text search engines are able to handle huge volumes of free text documents. Some examples for commercial websites specializing with keywords are: (1) www.wordtracker.com that provides statistics on most popular keywords people use and how many competing sites use them. In addition, it helps you find all keyword combinations that bear any relation to your business or service. (2) www.singlescan.co.uk. that offers to register a website to one or more unique keywords (such as domain names). These keywords enable the website to be submitted into strategic positions on the search engines. 2.2
Extraction of Keywords
The basic idea of keyword extraction for a given article is to build a list of words and collocations sorted in descending order, according to their frequency, while filtering general terms and normalizing similar terms (e.g. “similar” and “similarity”). The filtering is done by using a stop-list of closed-class words such as articles, prepositions and pronouns. The most frequent terms are selected as keywords since we assume that the author repeats important words as he advances and elaborates.
Automatic Extraction of Keywords from Abstracts
845
Examples of systems that applied this method for abstract-creation are described in Luhn [2] and Zechner [6]. Luhn's system is probably the first practical extraction system. Luhn presents a simple technique that uses term frequencies to weight sentences, which are then extracted to form an abstract. Luhn's system does not extract keywords. However, he suggests generating index terms for information retrieval. In his paper, Luhn presents only two running-examples. Zechner's system does not extract keywords. It rather generates text abstracts from newspaper articles by selecting the “most relevant” sentences and combining them in text order. This system relies on a general, purely statistical principle, i.e., on the notion of “relevance”, as it is defined in terms of the combination of tf*idf weights of words in a sentence. Experiments he carried out achieved precision/recall values of 0.46/0.55 for abstracts of six sentences and of 0.41/0.74 for abstracts of ten sentences.
3
The Model
Our plan is composed of the following steps: (1) Implementing the basic idea for extraction of keywords mentioned in the previous section, (2) Testing it for abstracts that contain keywords composed by the authors (3) Comparing the extracted keywords with the given keywords and analyzing the results. The implementation is as follows: the system chooses the “n” most highly weighted words or collocations as the proposed keywords by the system. The value of “n” has been set at “9” because of two reasons: (1) the number “9” is accepted as the maximal number of items that an average person is able to remember without apparent effort, according to the cognitive rule called “7±2”. This means that an average person is capable of remembering approximately between 5 and 9 information items over a relatively short term [4]; (2) The number 9 is the largest number of keywords given by the authors of the abstracts we have selected to work with. These 9 extracted keywords will be compared to the keywords composed by the authors. A detailed analysis will identify full matches, partial matches and failures. Fig. 1 describes our algorithm in detail. In order to reduce the size of the word weight matrixes that mentioned in step 1 in the algorithm, we transform each word to lower case. In step 2, weights are calculated by counting full matches and partial matches. A full match is a repetition of the same word including changes such as singular/plural or abbreviations, first letter in lower case / upper case. A partial match between two different words is defined if both words have the same first five letters (see explanation below). All other pairs of words are regarded as failures. For each abstract from the data-base A partial match between different words is defined when the first five letters of both words are the same. That is because in such a case we assume that these words have a common radical. Such a definition, on the one hand, usually identifies close words like nouns, verbs, adjectives, and adverbs. On the other hand, it does not enable most of non-similar words to be regarded as partial matches.
846
Yaakov HaCohen-Kerner
1.
Create word weight matrixes for all unigrams, 2-grams and 3-grams. Approximately 450 high frequency close class words (e.g.: we, this, and, when, in, usually, also, do, near) are excluded via a stop-list.
2.
Compute the weights of all unigrams, 2-grams, 3-grams by counting full and partial appearances (definitions in the next paragraph).
3.
Sort each one of these kinds of x-grams in descending order, merge them and select the n highest weighted groups of words as the proposed keywords by the system.
4.
Analyze the results: compare the proposed keywords to the keywords composed by the author and report on the full matches, partial matches and failures of the system. Fig. 1. The algorithm
A positive example for this definition is as follows: all 8 following words are regarded as partial matches because they have the same 5-letter prefix “analy”: the nouns “analysis”, “analyst”, “analyzer”, the verb “analyze”, and the adjectives “analytic”, “analytical”, “analyzable “, and the adverb “analytically”. A negative example for this definition is: all 8 following words: “confection”, “confab”, “confectioner”, “confidence”, “confess”, “configure”, “confinement”, and “confederacy” are regarded as non partial matches because they have in common only a 4-letter prefix “conf”. As stated in section 2.2, the basic idea in automatic finding of keywords is that the most frequent terms are selected as keywords since we assume that the author repeats important words as he advances and elaborates. An additional criterion taken into account in many summarizing programs is the importance of the sentence from which the term is taken. There are various methods to identify the importance of sentences, e.g.: (1) location of sentences (position in text, position in paragraph, titles, …), (2) sentences which include important cue words and phrases like: “problem”, “assumption”, “our investigation”, “results”, “conclusions”, “in summary”, (3) analysis of the text based on relationships between sentences, e.g.: logical relations, syntactic relations, and (4) structure and format of the document, e.g.: document outline and narrative structure. In our study, we use these two basic features (frequency and importance of sentences) for extracting keywords. In addition, we take into account the length of the terms, which are being tested as potential keywords. A statistical analysis of the distribution of words lengths of authors' keywords regarding our data base shows that 2-grams are the most frequent grams (189 out of 332 which is about 57%), then unigrams (102 out of 332 which is about 31%), then 3-grams (34 out of 332 which is about 10%) and finally 4-grams (7 out of 332 which is about 2%).
Automatic Extraction of Keywords from Abstracts
847
Table 1. Summarization of general results
Tested keywords
Full matches
Partial matches
Failures
332 100%
77 23.19%
128 38.56%
127 38.25%
Number of Percentage
4
Experiments
We have constructed a database containing 80 documents that describe Academic papers. Each document includes the title of the paper, its abstract, and a list of keywords composed by the author. Most of the documents are taken from publications and technical reports available at http://helen.csi.brandeis.edu/papers/long.html/. Each document includes the title of the paper, its abstract, and a list of keywords composed by the author. Table 1 shows the general results.
5
Discussion
Previous systems use similar techniques not for keywords' extraction but rather for abstract-creation. Luhn's system does not present general results. Zechner's system achieves precision/recall values of 0.46/0.55 for abstracts of six sentences and values of 0.41/0.74 for abstracts of ten sentences. Our system presents a rate of 0.62 for finding partial and full matches. Our results are slightly better. However, the systems are not comparable because they deal with different tasks and with different kinds of text documents. Full matches where found in a rate of 23.19%. Partial matches and upper where found in a rate of 61.75%. Apparently, the rate of the system's success appears to be unimpressive. However, we claim that these are rather satisfying results due to some interesting findings: (1) 86.25% of the abstracts do not include (in an exact form) at least one of their own keywords. (2) 55.12% of the keywords chosen by their authors do appear (in an exact form) neither in the title nor in the abstract. Full statistics is given in the Appendix. The main cause for these findings is the authors! Most of the abstracts do not include (in an exact form) at least one of their own keywords. Most of the keywords chosen by the authors do not appear (in an exact form) neither in the title nor in the abstract. These results might show that at least some of the authors have problems either in defining their titles and abstracts or in choosing their keywords. In such circumstances our results are quite good.
848
Yaakov HaCohen-Kerner
6
Summary and Future Work
As far as we know no existing program is able to extract keywords from abstracts and titles. Results show that keywords extracted from abstracts and titles may be a primary tool for researchers. Future directions for research are: (1) using different learning techniques might improve the results of the extraction, (2) selecting an optimal set of initial weights, and (3) elaborating the model for extracting and learning of keywords from entire articles.
Acknowledgements Thanks to Ari Cirota and two anonymous referees for many valuable comments on earlier versions of this paper.
References [1]
[2]
[3] [4] [5] [6] [7]
Kita, K., Kato, Y., Omoto, T, & Yano, Y. Automatically extracting collocations from corpora for language learning. Proceedings of the international Conference on Teaching and Language Corpora. Reprinted in A Wilson & A McEnery (eds.): UCREL Technical Papers Volume 4 (Special Issue), Corpora in Language Education and Research, A Selection of Papers from TALC 94, Dept. of Linguistics, Lancaster University, England. (1994) 53-64 Luhn, H. P. The Automatic Creation of Literature Abstracts. IBM Journal of Research and development, 159-165. Reprinted in Advances in automatic text summarization, 1999, In Mani I. And Maybury M. (eds.): Cambridge MA: MIT Press (1959) Mani I. And Maybury M. T . Introduction, Advances in automatic text summarization, In Mani I. And Maybury M. (eds.): Cambridge MA: MIT Press (1999) Miller, G. A. The Magical Number Seven, Plus or Minus Two: Some Limits on our Capacity of Information, Psychological Science, 63 (1956) 81-97 Smadja, F. Retrieving Collocations from Text. Computational Linguistics 19(1) (1993) 143-177 Zechner, K. Fast Generation of Abstracts from General Domain Text Corpora by Extracting Relevant Sentences. In Proceedings of the 16th international Conference on Computational Linguistics (1996) 986-989 Zechner, K. A.: A Literature Survey on Information Extraction and Text Summarization. Term Paper, Carnegie Mellon University (1997)
Automatic Extraction of Keywords from Abstracts
Appendix: Statistics Concerning the Tested Documents There are 80 abstracts There is 1 abstract with 1 keyword There are 12 abstracts with 2 keywords There are 19 abstracts with 3 keywords There are 22 abstracts with 4 keywords There are 12 abstracts with 5 keywords There are 4 abstracts with 6 keywords There are 4 abstracts with 7 keywords There are 4 abstracts with 8 keywords There are 2 abstracts with 9 keywords There are 332 keywords in all abstracts The average number of keywords in a single abstract is 4.15 There are 102 keywords with length of 1 word There are 189 keywords with length of 2 words There are 34 keywords with length of 3 words There are 7 keywords with length of 4 words The average length in words of a single keyword is 1.84 There are 0 abstracts with 1 sentence There are 0 abstracts with 2 sentences There are 1 abstract with 3 sentences There are 4 abstracts with 4 sentences There are 15 abstracts with 5 sentences There are 15 abstracts with 6 sentences There are 17 abstracts with 7 sentences There are 10 abstracts with 8 sentences There are 13 abstracts with 9 sentences There are 5 abstracts with 10 sentences The average length in sentences of a single abstract is 6.88 # of abstracts which do not include their own 1 keyword is : 18 # of abstracts which do not include their own 2 keywords is : 18 # of abstracts which do not include their own 3 keywords is : 14 # of abstracts which do not include their own 4 keywords is : 11 # of abstracts which do not include their own 5 keywords is : 6 # of abstracts which do not include their own 6 keywords is : 1 # of abstracts which do not include their own 7 keywords is : 1 The total sum of the author's keywords that do not appear in their own abstracts is : 183 That is, 55.12% of the author's keywords do not appear in their own abstracts. There are 33 abstracts, which don't include keywords of their own with length of 1 word There are 59 abstracts, which don't include keywords of their own with length of 2 words There are 16 abstracts, which don't include keywords of their own with length of 3 words There are 7 abstracts, which don't include keywords of their own with length of 4 words There are 69 abstracts do not include (in an exact form) at least one of their own keywords. 86.25% of abstracts do not include (in an exact form) at least one of their own keywords.
849
Term-length Normalization for Centroid-based Text Categorization
Verayuth Lertnattee and Thanaruk Theeramunkong Information Technology Program, Sirindhorn International Institute of Technology Thammasat University, Pathumthani, 12121, Thailand {[email protected], [email protected], [email protected]}
Centroid-based categorization is one of the most popular algorithms in text classication. Normalization is an important factor to improve performance of a centroid-based classier when documents in text collection have quite dierent sizes. In the past, normalization involved with only document- or class-length normalization. In this paper, we propose a new type of normalization called term-length normalization which considers term distribution in a class. The performance of this normalization is investigated in three environments of a standard centroid-based classier (TFIDF): (1) without class-length normalization, (2) with cosine class-length normalization and (3) with summing weight normalization. The results suggest that our term-length normalization is useful for improving classication accuracy in all cases. Abstract.
1
Introduction
With the fast growth of online text information, there has been extreme need to organize relevant information in text documents. Automatic text categorization (also known as text classication) becomes a signicant tool to utilize text documents eciently and eectively. In the past, a variety of classication models were developed in dierent schemes, such as probabilistic models (i.e. Bayesian classication) [1, 2], regression models [3], example-based models (e.g., k -nearest neighbor) [3], linear models [4, 5], support vector machine (SVM) [3, 6] and so on. Among these methods, a variant of linear models called a centroid-based model is attractive since it has relatively less computation than other methods in both the learning and classication stages. Despite the less computation time, centroidbased methods were shown in several literatures including those in [4, 7, 8], to achieve relatively high classication accuracy. The classication performance of the model strongly depend on the weighting method applied in the model. In this paper, a new type of normalization so-called term-length normalization, is investigated in text classication. Using various data sets, the performance is investigated in three environments of a standard centroid-based classier (TFIDF) (1) without class-length normalization, (2) with cosine class-length normalization and (3) with summing weight normalization. In the rest of this paper, section 2 presents centroid-based text categorization. Section 3 describes the concept of normalization in text categorization. The proposed term-length normalization V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 850-856, 2003. c Springer-Verlag Berlin Heidelberg 2003
Term-length Normalization for Centroid-based Text Categorization
851
is given in section 4. The data sets and experimental settings are described in section 5. In section 6, experimental results using four data sets are given. A conclusion and future work are made in section 7. 2
Centroid-based Text Categorization
In the centroid-based text categorization, a document (or a class) is represented by a vector using a vector space model with a bag of words (BOW) [9, 10]. The simplest and popular one is applied term frequency (tf ) and inverse document frequency (idf ) in the form of tfidf for representing a document. In a vector space model, given a set of documents D = { d1 ; d2 ; :::; dN }, a document dj is rep! resented by a vector dj = fw1j ; w2j ; :::; wmj g = ftf1j idf1 ; tf2j idf2 ; :::; tfmj idfm g, where wij is a weight assigned to a term ti in the document. In this definition, tfij is term frequency of a term ti in a document dj and idfi is inverse document frequency, dened as log (N=dfi ). The idf can be applied to eliminate the impact of frequent terms that exist in almost all documents. Here, N is the total number of documents in a collection and dfi is the number of documents, which contain the term ti . Besides term weighting, normalization is another important factor to represent a document or a class. The detail of normalization is described latter in the next section. Class prototype ! ck is obtained by summing up all document vectors in Ck and then normalizing the result by its size. The ! ! dj j, where Ck dj =j formal description of a class prototype ! ck is dj 2Ck dj 2Ck = {dj j dj is a document belonging to the class ck }. The simple term weighting is tf idf where tf is average class term frequency of the term. The formal description of tf is dj 2Ck tfijk =jCk j, where jCk j is the number of documents in a class ck . Term weighting described above can also be applied to a query or a test document. In general, the term weighting for a query is tf idf . Once a class prototype vector and a query vector have been constructed, the similarity between these two vectors can be calculated. The most popular one is cosine distance. This similarity can be calculated by the dot product between these two vectors. Therefore, the test document will be assigned to the class whose class prototype vector is the most similar to the vector of the test document.
P
P
P
3
Normalization Methods
In the past, most normalization methods based on document-length normalization which reduced the advantage of a long document over a short one. A longer document may include term with higher term frequency and more unique terms in document representation and causes to increase the similarity and chances of retrieval of longer documents in preference over shorter documents. To solve this issue, normally all relevant documents should be treated as equally important for classication or retrieval especially by the way of normalization. As the most options for radical approach, document-length normalization is incorporated into term weighting formula to equalize the length of document vectors. Although
852
Verayuth Lertnattee and Thanaruk Theeramunkong
there have been several types of document-length normalizations proposed [10], the cosine normalization [4, 8] is the most commonly used. It can solve the problem of overweighting due to both terms with higher frequency and more unique terms. The cosine normalization is done by dividing all elements in a vector 2 with the length of the vector, that is wi where wi is the weight of the term i ti before normalization. In centroid-based method, two types of normalization have been used: (1) normalizing the documents in a certain class and then merge the documents into a class vector and (2) merging the documents into a class vector and then normalize the class vector. The latter type of normalization was used in our experiment. We called the process to normalize the class vector as class-length normalization, which strongly depends on term weighting of the class vector before normalization. In this paper, we propose the novel method so-called term-length normalization that adjusts the term weighting to be more eective and powerful before following with class-length normalization.
pP
4
Term-length Normalization
Our concept is based on the fact that terms which have high discriminating power should have some of the following properties. They should occur frequently in a certain class. This corresponds to tf in a centroid-based method. With the same tf , we consider the terms which distribute consistently are more important than those occur in only few documents with high frequency. To achieve the goal, we propose general term-length in the form of root mean term frequency calculated from all documents in the class (tf lik (n)). The tf lik (n) of a term ti in a class ck at a degree n is dened as:
tf lik (n)
=
v P u u t n
j
n ijk
tf
(1)
j j Ck
Term-length normalization can be held by tf =tf l(n) as shown below
0 tf
ik
The tf 0
(n)
=
h p
tf ik
=
tf lik (n)
r
P
n
tfijk
j
j j Ck
n
1
P j
i
n ijk
(2)
tf
can alse be viewed as a kind of term distribution. The value of (n) is in range of 1= n jCk jn 1 ; 1 . The minimum value is gained when a ik term occurs merely in one document in the class. The maximum value is obtained when a term occurs equally in all documents. The value n is the degree of the utilization of term distribution. The higher n, the more important utilization of term distribution is. In our concept, the term weighting depends on tf , tf 0 and idf. So the weighting skelaton can be dened as:
0 tf
ik
(n)
wik
x
0
y
z
= (tf ik ) (tfik (n)) (idfi )
(3)
Term-length Normalization for Centroid-based Text Categorization
853
From the equation 3, the value x, y and z are the powers which represent the level of contribution of each factor to the term weighting, respectively. In this paper, we focus on x and y and let z to the standard value (z =1). When x=0, the term weighting depend on only tf 0 and idf. When x > 0, the tf has positive eect on term weighting. On the other hand, when x < 0, the tf has negative eect on term weighting. In our preliminary experiments, we found out that tf should have a positive eect on term weighting, that is the value of x should be equal to or greater than zero. For example, Figure 1 illustrates three kinds of classes each of which holds ten documents. Each number shows the occurrence of a certain term in the document. In Figure 1(a), the term appears only in two documents. In Figure 1(b), the term appears in eight documents in the class but the tf of the term is the same as the case of Figure 1(a), i.e. 1.40. Intuitively, the term in Figure 1(b) should be more important than the term in Figure 1(a). Focusing on only tf , we cannot grasp this dierence. However, if we consider the tf 0 (tf =tf l(n)), we can observe that the case (b) obtains a higher value than the case (a) does. For a more complex situation, term distribution pattern in Figure 1(c) is similar to the pattern in Figure 1(b) but Figure 1(c) has a higher 0 tf . In this case, merely tf is not enough for representing the important level of the term. We need to consider tf . Furthermore, when the value of n is higher, the value of tf 0 (n) is lower. So the term weighting is more sensitive to the term distribution pattern, i.e. from tf 0 (2) to tf 0 (3). In conclusion, we introduced a kind of term distribution by tf =tf l(n) that contributes to the term weighting along with the traditional tf and idf.
0
0 8
0 0
6 0
0
0
0
tf=1.40 tfl(2)=3.16 tf´(2)=0.44 tfl(3)=4.18 tf´(3)=0.34
(a)
Fig. 1.
5
2
2 1
0 0
1 2
2
2
2
tf=1.40 tfl(2)=1.61 tf´(2)=0.87 tfl(3)=1.71 tf´(3)=0.82
(b)
4
4 2
0 0
2 4
4
4
4
tf=2.80 tfl(2)=3.22 tf´(2)=0.87 tfl(3)=3.42 tf´(3)=0.82
(c)
Three dierent cases of a term occurring in a class
Data sets and Experimental Settings
Four data sets are used in the experiments: (1) Drug Information (DI), (2) Newsgroups (News), (3) WebKB1 and (4) WebKB2. The rst data set, DI is a set of web pages collected from www.rxlist.com. It includes 4,480 English web pages
854
Verayuth Lertnattee and Thanaruk Theeramunkong
with 7 classes: adverse drug reaction, clinical pharmacology, description, indications, overdose, patient information, and warning. The second data set, Newsgroups contains 20,000 documents. The articles are grouped into 20 dierent UseNet discussion groups. The third and fourth data sets are constructed from WebKB. These web pages were collected from departments of computer science from four universities with some additional pages from some other universities. In our experiment, we use the four most popular classes: student, faculty, course and project as our third data set called WebKB1. The total number of web pages is 4,199. Alternatively, this collection can be rearranged into ve classes by university (WebKB2): cornell, texas, washington, wisconsin and misc. All headers are eliminated from the documents in News documents and all HTML tags are also omitted from documents in DI, WebKB1 and WebKB2. For all data sets, a stop word list is applied to take away some common words, such as a, by, he and so on, from the documents. The following two experiments are performed. In the rst experiment, term-length normalization is applied with the powers of tf and t are equal. After the simple term weighting is modied by term-length normalization, the nal class vectors are construced without class-length normalization and two types of class-length normalization: cosine and summing weight compare to the same methods without term-length normalization. The second experiment, we adjust three dierent powers of tf and t. In all experiments, a data set is split into two parts: 90% for the training set and 10% for the test set (10-fold cross validation). As the performance indicator, classication accuracy (%) is applied. It is dened as the ratio of the number of documents assigned with their correct classes to the total number of documents in the test set. 6
Experimental Results
6.1 Eect of Term-length Normalization Based on Only tf 0 and idf In the rst experiment, the powers of tf and t are equal. Two types of termlength normalization: t(2) and t(3) were applied to the term weighting. As a baseline, a standard centroid-based classier (tf idf ) without term-length normalization (N) is used. We investigate 2 kinds of term weighting tf 0 (2) idf (T2) and tf 0 (3) idf (T3), the nal class vectors were constructed without classlength normalization (CN0) and two types of class-length normalization: cosine normalization (CNC) and summing weight normalization (CNW). The result is shown in Table 1. The bold indicates the highest accuracy in each data sets and on average of 4 data sets. When the power of tf equals to that of t, tf 0 is construced and the tf is not available to express the eect of term frequency. The tf 0 concerns the term distribution among documents in a class. Using tf 0 insteads of tf can improve accuracy on all data sets which have no class-length normalization especially in News. It also enhances the eect of cosine normalization in DI, News and WebKB1. For summing weight, It can improve performance only in News. The 0 0 tf (3) performs better than the tf (2) on average accuracy. In the next experiment, we will turn on the eect of tf to combine with tf 0 and idf.
Term-length Normalization for Centroid-based Text Categorization Table 1.
855
Eect of term-length normalization when the powers of tf and t are equal.
Class
DI
N
Norm.
CN0 CNC CNW
T2
66.99 74.84
News
T3
N
75.71 69.13
T2
WebKB1
T3
N
T2
82.05 79.93 53.32 66.16
WebKB2
T3
N
66.73
40.27
T2
Average
T3
N
T2
T3
51.01 26.41 57.43 68.52 62.20
91.67 92.08 95.33 74.76 82.46 82.09 77.71 75.30 78.85 88.76 61.59 79.04 83.23 77.86 83.83 93.06 66.83
78.95 73.63
76.40 75.29 73.02 59.37
70.35
33.36
22.72 22.82 68.27 56.33 61.85
, tf 0 and idf 6.2 Eect of Term-length Normalization Based on tf In this experiment, three dierent patterns were applied to set the power of tf became higher than the power of t for turning on the eect of tf to combine with tf 0 and idf. For t(2), the three patterns are: (1) tf idf = tf l(2) which equal to (tf )0:5 tf 0 (2)0:5 idf (TA), (2) (tf )1:5 idf =tf l(2) which equal to (tf )0:5 tf 0 (2) idf (TB) and (3) (tf )2 idf =tf l (2) which equal to tf tf 0 idf (2) (TC). The same patterns of term weighting were also applied to t(3). The result is shown in Table 2. The bold indicates the highest accuracy in each data sets and on average of 4 data sets.
p
Table 2.
and t(3)
Term
Eect of term-length normalization when the power of tf is higher than t (2)
Class
Norm. Norm.
t(2) t(3)
DI
TA
TB
News
TC
TA
WebKB1
TB TC TA
TB
TC
53.56 44.92 28.01
23.53
Average
TA
TB
70.11 71.03
68.62 78.28
CNC
95.09 96.88
96.45 80.67 79.60 69.83 80.42 82.42
80.45 89.26 89.97 90.45 86.36 87.22 84.29
CNW
90.76 95.20
96.41 78.72
77.69 71.59 72.23 80.42
80.35 23.60 29.51
70.56
66.33 70.70
79.73
CN0
70.56 71.70
69.02 77.70
72.55 63.55 62.68 64.90
54.58 30.94 22.98
22.17
60.47 58.03
52.33
CNC
96.16 97.08 96.43 80.39
78.33 68.36 81.23 82.81 80.23 88.74 89.47
89.28
86.63 86.92
83.58
CNW
92.77 96.23
76.14 70.45 75.54 82.21
70.30
67.65 74.46
79.60
81.04 23.74 43.27
63.74 59.38
TC
CN0
96.61 78.55
74.79 65.22 61.66 63.68
WebKB2
TC TA TB
52.73
According to the results in Table 2 when the powers of tf are higher than t, almost methods outperform those methods in previous experiment except in News . This suggests that term distribution is very valuable for classifying documents in News. Term-length normalization can improve classication accuracy in all types of class-length normalization, especially cosine normalization. From the average classication accuracy, the maximum result is obtained when using the power of tf = 1.5, the power of t(2) = 1.0 following by cosine normalization. The t(2) is work better on News and WebKB2 while the t(3) work bettern on DI and WebKB1. The highest result in each data set is gained from the dierent combination of tf and tf 0 (n). The suitable combination of tf and 0 tf (n) depends on individual characteristic of the data sets.
856 7
Verayuth Lertnattee and Thanaruk Theeramunkong Conclusion and Future Work
This paper showed that term-length normalization was useful in centroid-based text categorization. It considers the term distribution in class. The terms that appear in several documents in class are important than those appear in few documents with higher term frequencies. The distributions were used to represent discriminating power of each term and then to weight that term. Adjusting the power of tf , t and the level n in a suitable portion is a key factors to improve the accuracy. From the experiments, the results suggested that termlength normalization can improve classication accuracy and work well with all methods of class-length normalization. For our future work, we plan to evaluate the term-length normalization with other types of class-length normalization.
Acknowledgement. This work has been supported by National Electronics and
Computer Technology Center (NECTEC), project number NT-B-06-4C-13-508. References
1. McCallum, A., Rosenfeld, R., Mitchell, T., Ng, A.Y.: Improving text classication by shrinkage in a hierarchy of classes. In: Proc. 15th International Conf. on Machine Learning, Morgan Kaufmann, San Francisco, CA (1998) 359367 2. Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.M.: Learning to classify text from labeled and unlabeled documents. In: Proceedings of AAAI-98, 15th Conference of the American Association for Articial Intelligence, Madison, US, AAAI Press, Menlo Park, US (1998) 792799 3. Yang, Y., Liu, X.: A re-examination of text categorization methods. In: 22nd Annual International SIGIR, Berkley (1999) 4249 4. Chuang, W.T., Tiyyagura, A., Yang, J., Giurida, G.: A fast algorithm for hierarchical text classication. In: Data Warehousing and Knowledge Discovery. (2000) 409418 5. Joachims, T.: A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In Fisher, D.H., ed.: Proceedings of ICML-97, 14th International Conference on Machine Learning, Nashville, US, Morgan Kaufmann Publishers, San Francisco, US (1997) 143151 6. Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In Nédellec, C., Rouveirol, C., eds.: Proceedings of ECML98, 10th European Conference on Machine Learning. Number 1398, Chemnitz, DE, Springer Verlag, Heidelberg, DE (1998) 137142 7. Han, E.H., Karypis, G.: Centroid-based document classication: Analysis and experimental results. In: Principles of Data Mining and Knowledge Discovery. (2000) 424431 8. Lertnattee, V., Theeramunkong, T.: Improving centroid-based text classication using term-distribution-based weighting and feature selection. In: Proceedings of INTECH-01. 2nd International Conference on Intelligent Technologies, Bangkok, Thailand (2001) 349355 9. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing and Management 24 (1988) 513523 10. Singhal, A., Salton, G., Buckley, C.: Length normalization in degraded text collections. Technical Report TR95-1507 (1995)
Recommendation System Based on the Discovery of Meaningful Categorical Clusters Nicolas Durand1 , Luigi Lancieri1 , and Bruno Cr´emilleux2 1
France Telecom R&D 42 rue des Coutures - 14066 Caen C´edex 4 - France {nicola.durand,luigi.lancieri}@francetelecom.com 2 GREYC CNRS-UMR 6072 Universit´e de Caen Campus Cˆ ote de Nacre - 14032 Caen C´edex - France [email protected]
Abstract. We propose in this paper a recommendation system based on a new method of clusters discovery which allows a user to be present in several clusters in order to capture his different centres of interest. Our system takes advantage of content-based and collaborative recommendation approaches. The system is evaluated by using proxy server logs, and encouraging results were obtained.
1
Introduction
The search of relevant information on the World Wide Web is still a challenge. Even if the indexing methods get more efficient, search engines stay passive agents and do not take into account the context of the users. Our approach suggests an active solution based on the recommendation of documents. We propose in this paper a hybrid recommendation system (collaboration via content ). Our method allows to provide recommendations based on the content of the document, and also recommendations based on the collaboration. Recommendation systems can be installed on the user’s computer (like an agent for recommending web pages during the navigation), or on particular web sites, platforms or portals. Our approach can be applied to systems where users’ consultations can be recorded (for example: a proxy server, a restricted web site or a portal). Our system is based on recent KDD (Knowledge Discovery in Databases) methods. In [2], we defined a new method of clusters discovery from frequent closed itemsets. In this paper, we create a recommendation system by taking advantage of this method. We form clusters of users having common centres of interest, by using the keywords of the consulted documents. Our method relates to content-based filtering (the identification of common keywords) but uses some technics of collaborative filtering (i.e. clustering of users). Moreover, we discover a set of clusters and not a strict clustering (i.e. a partition) like the recommendation systems based on clustering. This means that our approach enables a user V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 857–864, 2003. c Springer-Verlag Berlin Heidelberg 2003
858
Nicolas Durand et al.
to be in several clusters. We can retrieve a user with several kinds of queries corresponding to his different centres of interest. Our system is autonomous, it does not need the intervention of users. Indeed, we use logs (consultations of documents for several users) that are a good source of information to indicate what the users want [13]. We can perform both content-based recommendations and collaboration-based recommendations. For a new document arriving in the system, we can recommend it to users by comparing the keywords of the document to the clusters. We can also recommend to a user of a cluster, documents that other users of the cluster have also consulted (collaborative approach). The rest of the paper is organized as follows: in the next section, we present related work. Then, we detail our method of clusters discovery (called Ecclat). In Section 4, we explain our recommendation system based on the discovered clusters. Then, we describe our experimentations and give some results. We conclude in Section 6.
2
Related Work
Recommendation systems are assimilated to information filtering systems because the ideas and the methods are very close. There are two types of filtering: content-based filtering and collaborative filtering. Content-based filtering identifies and provides relevant information to users on the basis of the similarity between the information and the profiles. We can quote the Syskill &Webert system [12] which produces a bayesian classifier from a learning database containing web pages scored by the user. The classifier is then used to establish if a page can interest the user. SiteHelper [9] recommends only the documents of a web site. It uses the feedback of the user. Letizia [6] is a client-side agent which searches web pages similar to the previous consulted or bookmarked ones. WebWatcher [4] uses the proxy server logs to do some recommendations. Mobasher et al. [7] propose a recommendation system based on the clustering of web pages from web server logs. This system determines the URLs which can interest the user by matching the URLs of the user current session with the clusters. Collaborative filtering finds relevant users who have similar profiles, and provides the documents they like to each other. Rather than the similarity between documents and profiles, this method measures the similarity between profiles. Tapestry [3] and GroupLens [5] allow users to comment the Netnews documents, and to get the ones recommended by the others. Amalthaea [8] is an agent which allows to create and modify the user profile. In these systems, users must specify their profiles. Among the autonomous collaborative approaches, we have some methods based on the clustering and the associations of items. Wilson et al. [14] use the frequent associations containing two items (TV programs) in order to determine the similarity between two user profiles. There are also some hybrid approaches. Pazzani [11] showed by some experimentations that the hybrid systems use more information, and provide more precise recommendations. Pazzani talk about collaboration via content, because
Recommendation System Based on the Discovery of Categorical Clusters
859
the profile of each user is based on the content, and is used to detect the similarity among the users. Fab [1] implements this idea in a similar way. In Fab, some agents (one per user) collect documents and put them in the central repository (to take advantage of potential overlaps between user’s interests) in order to recommend them to users. We can also cite OTS [15] which allows a set of users to consult some papers provided by a publication server. The users are grouped according to their profile. These profiles are defined and based on the content of papers. Contrary to OTS, our system can provide recommendations on documents not consulted by users yet, and our method of cluster discovery does not use defined profiles.
3
Clusters Discovery with Ecclat
We have developed a clustering method (named Ecclat [2]) for the discovery of interesting clusters in web mining applications i.e. clusters with possible overlapping of elements. For instance, we would like to retrieve a user (or a page) from several kinds of queries corresponding to several centres of interest (or several points of views). Another characteristic of Ecclat is to be able to tackle large data bases described by categorical data. The approach used by Ecclat is quite different from usual clustering techniques. Unlike existing techniques, Ecclat does not use a global measure of similarity between elements but is based on an evaluation measure of a cluster. The number of clusters is not set in advance. In the following discussion, each data record is called a transaction (a user) and is described by items (the consulted keywords). Ecclat discovers the frequent closed itemsets [10] (seen as potential clusters), evaluates them and selects some. An itemset X is frequent if the number of transactions which contains X is at least the frequency threshold (called minf r) set by the user. X is a closed itemset if its frequency only decreases when any item is added. A closed itemset checks an important property for clustering: it gathers a maximal set of items shared by a maximal number of transactions. In other words, this allows to capture the maximum amount of similarity. These two points (the capture of the maximum amount of similarity and the frequency) are the basis of our approach of selection of meaningful clusters. Ecclat selects the most interesting clusters by using a cluster evaluation measure. All computations and interpretations are detailed in [2]. The cluster evaluation measure is composed of two measures: homogeneity and concentration. With the homogeneity value, we want to favour clusters having many items shared by many transactions (a relevant cluster has to be as homogeneous as possible and should gather “enough” transactions). The concentration measure limits the overlapping of transactions between clusters. Finally, we define the interestingness of a cluster as the average of its homogeneity and concentration. Ecclat uses the interestingness to select clusters. An innovative feature of Ecclat is its ability to produce a clustering with a minimum overlapping between clusters (which we call “approximate clustering”) or a set of clusters with a slight overlapping. This functionality depends on the value of a param-
860
Nicolas Durand et al.
eter called M . M is an integer corresponding to a number of transactions not yet classified that must be classified by a new selected cluster. The algorithm performs as follows. The cluster having the highest interestingness is selected. Then as long as there are transactions to classify (i.e. which do not belong to any selected cluster) and some clusters are left, we select the cluster having the highest interestingness and containing at least M transactions not classified yet. The number of clusters is established by the algorithm of selection, and is bound to the M value. Let n be the number of transactions, if M is equal to 1, we have at worst (n − minf r + 1) clusters. In practice, this does not happen. If we increase the M value, the number of clusters decreases. We are close to a partition of transactions with M near to minf r.
4
Recommendation System
In this section, we present the basis of our recommendation system. It is composed of an off-line process (clusters discovery with Ecclat) and an on-line process realizing recommendations. The on-line process computes a score between a new document and each of the discovered clusters. For a document and a cluster, if the score is greater than a threshold, then the document is recommended to the users of the clusters. We can also use the collaboration and recommend the documents that the users of a cluster have consulted to any users of a cluster. At this moment, we concentrate ourselves on the first type of recommendations. The score between a document and a cluster is computed as follows. Let D be a document and KD be the set of its keywords. Let Ci be a cluster, Ci is composed of a set of keywords KCi and a set of users UCi . We compute the covering rate : CR(D, Ci ) =
|KD ∩ KCi | ∗ 100 |KD |
Let mincr be the minimum threshold of the covering rate. If CR(D, Ci ) ≥ mincr, then we recommend the document D to the users UCi . Let us take an example, a document KD ={fishing hunting england nature river rod}, and the following clusters: – KC1 ={fishing hunting internet java}, CR=33%. – KC2 ={fishing england}, CR=33%. – KC3 ={fishing hunting england internet java programming}, CR=50%. – K ={fishing}, CR=16%. C4
– KC5 ={internet java}, CR=0%.
In this example, we have the following order: C3 > C1, C2 > C4, and C5 is discarded. Let us remark that the used measure (CR) is adapted to the problem, because in a cluster, keywords can refer to different topics. For instance, if a set of users are interested in fishing and programming, it is possible to
Recommendation System Based on the Discovery of Categorical Clusters
861
have a corresponding cluster like C3 . This point does not have to influence the covering rate. For this reason, we select this measure which depends on the common keywords between the document and the cluster, and on the number of the keywords of the document. CR does not depend on the number or the composition of the keywords set of the cluster. The other classical measures like Jaccard, Dice, Cosine, are not adapted to our problem. The possible mixing of topics does not influence the recommendations, but mincr does not have to be too high, because the number of keywords for a cluster is free, and for a document, it is fixed. Another remark, if a user is very interested in C++ and if he is the only one, we do not detect this. We take into account the common interests shared by the group.
5
Experimentation
In order to evaluate recommendations, we used proxy server logs coming from France Telecom R&D. This data contains 147 users and 8,727 items. Items are keywords of the HTML pages browsed by 147 users of a proxy-cache, over a period of 1 month. 24,278 pages were viewed. For every page, we extracted a maximum of 10 keywords with an extractor (developed at France Telecom R&D) based on the frequency of significant words. Let L be the proxy server log. For a document D in L, we determine the users interested by D (noted U sersR(D)), by using the previous discovered clusters. Then, we check by using the logs, if the users who have consulted the document (noted U sers(D)) are present in U sersR(D). Let us remark that we do not use a web server where the sets of documents and of keywords are known and relatively stable over time. For a proxy server, the set of documents and especially the set of keywords can be totally different between two periods. So we used the same period to discover the clusters and the recommendations for a first evaluation without human feedbacks. We use the following measures to evaluate the results: failure(D ) =
r hit (D ) =
|Users(D ) − UsersR(D )| |Users(D )| |UsersR(D ) ∩ Users(D )| |UsersR(D )|
The failure rate evaluates the percentage of users who consulted a document that has not been recommended. The r hit value (recommendation hit) measures the percentage of users indicated in the recommendations of a document, and from those who really consulted it. We set minf r to 10%. It corresponds to a minimal number of 14 users per cluster. The number of frequent closed itemsets is 454,043. We set M to 1 in order to capture the maximum of different centres of interest (overlapping between clusters), we find 45 clusters (the average number of users per cluster is 21). Let
862
Nicolas Durand et al.
80 rhit 70
60
50
40
30
20
10
0 0
Fig. 1. Distribution of the documents according to the failure rate, mincr=20%
2000
4000
6000
8000
10000
12000
Fig. 2. r hit value according to the rank of the documents, mincr=20%
us note that here, our aim is not to study the impact of the parameters. This has already been done in [2]. The choice of the mincr value is not easy. The mincr value influences especially the number of recommended documents. The higher the mincr value is, the lower the number of recommended documents is. Too many recommendations make the system unpractical. We need to have a compromise between the number of recommended documents and, as we could guess, the quality of the system. For the evaluation, we did not really perform recommendations to users, we just evaluated the accuracy of our recommandations. So we used a relatively low value of mincr in order to have a lot of recommendations. We set mincr to 20%. The system has recommended 11,948 documents (49.2% of the total). In Figure 1, we remark that 80% of the documents (among 11,948) are well recommended, and we have only 16% of failure. We ranked the documents according to the r hit values and we obtained Figure 2. We can deduce from the r hit measure that the number of users who are in the results and have not consulted the document is not null. We found more users, maybe they would have been interested, but we cannot verify it. It would be necessary to have human feedbacks.
6
Conclusion
We have presented a recommendation system based on the discovery of meaningful clusters of users according to the content of their consulted documents. Our method of clusters discovery allows to capture the various centres of interest for the users because of the possibility to have a user in several clusters and so retrieve him with several kinds of queries. We provided recommendations of documents using the discovered clusters. We evaluated our method on proxy server logs (not usually done in this application), and we obtained good results, that is encouraging for other experiments (with human feedbacks) and the development of our system. In future works, we will evaluate the second type
Recommendation System Based on the Discovery of Categorical Clusters
863
of possible recommendations i.e. based on the collaboration. We will also look for an incremental version of Ecclat in order to propose a system in pseudo real-time.
References [1] M. Balabanovic. An Adaptive Web Page Recommendation Service. In the 1st International Conference on Autonomous Agents, pages 378–385, Marina del Rey, CA, USA, February 1997. 859 [2] N. Durand and B Cr´emilleux. ECCLAT: a New Approach of Clusters Discovery in Categorical Data. In the 22nd Int. Conf. on Knowledge Based Systems and Applied Artificial Intelligence (ES’02), pages 177–190, Cambridge, UK, December 2002. 857, 859, 862 [3] D. Goldberg, D. Nichols, B.M. Oki, and D. Terry. Using Collaborative Filtering to Weave an Information Tapestry. Communication of the ACM, 35(12):61–70, 1992. 858 [4] T. Joachims, D. Freitag, and T. Mitchell. WebWatcher: A Tour Guide for the World Wide Web. In the 15th Int. Joint Conference on Artificial Intelligence (IJCAI’97), pages 770–775, Nagoya, Japan, August 1997. 858 [5] J.A. Konstan, B.N. Miller, D. Maltz, J.L. Herlocker, L.R. Gordon, and J. Riedl. GroupLens: Applying Collaborative Filtering to Usenet News. Communication of the ACM, 40(3):77–87, March 1997. 858 [6] H. Lieberman. Letizia: An Agent that Assists Web Browsing. In the Fourteenth International Joint Conference on Artificial Intelligence (IJCAI’95), pages 924– 929, Montr´eal, Qu´ebec, Canada,, August 1995. 858 [7] B. Mobasher, R. Cooley, and J. Srivastava. Creating Adaptive Web Sites through Usage-Based Clustering of URLs. In IEEE Knowledge and Data Engineering Exchange Workshop (KDEX99), Chicago, november 1999. 858 [8] A. Moukas. Amalthaea: Information Discovery and Filtering Using a MultiAgent Evolving Ecosystem. International Journal of Applied Artificial Intelligence, 11(5):437–457, 1997. 858 [9] D.S.W. Ngu and X. Wu. SiteHelper : A Localized Agent that Helps Incremental Exploration of the World Wide Web. In the 6th international World Wide Web Conference, pages 691–700, Santa Clara, CA, 1997. 858 [10] N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Efficient Mining of Association Rules Using Closed Itemset Lattices. Information Systems, 24(1):25–46, Elsevier, 1999. 859 [11] M Pazzani. A Framework for Collaborative, Content-Based and Demographic Filtering. Artificial Intelligence Review, 13(5):393–408, 1999. 858 [12] M Pazzani, J. Muramatsu, and D. Billsus. Syskill & Webert: Identifying interesting web sites. In Proceedings of the 13th National Conference on Artificial Intelligence, pages 54–61, Portland, Oregon, 1996. 858 [13] M. Spiliopoulou. Web Usage Mining for Web site Evaluation. Com. of the ACM, 43(8):127–134, August 2000. 858 [14] D. Wilson, B. Smyth, and D. O’Sullivan. Improving Collaborative Personalized TV Services. In the 22nd Int. Conf. on Knowledge Based Systems and Applied Artificial Intelligence (ES’02), pages 265–278, Cambridge, UK, December 2002. 858
864
Nicolas Durand et al.
[15] Y.H. Wu, Y.C. Chen, and A.L.P. Chen. Enabling Personalized Recommendation on the Web Based on User Interests and Behaviors. In the 11th Int. Workshop on Research Issues in Data Engineering (RIDE-DM 2001), pages 17–24, Heidelberg, Germany, April 2001. 859
A Formal Framework for Combining Evidence in an Information Retrieval Domain Josephine Griffith and Colm O’Riordan Dept. of Information Technology National University of Ireland, Galway, Ireland
Abstract. This paper presents a formal framework for the combination of multiple sources of evidence in an information retrieval domain. Previous approaches which have included additional information and evidence have primarily done so in an ad-hoc manner. In the proposed framework, collaborative and content information regarding both the document data and the user data is formally specified. Furthermore, the notion of user sessions is included in the framework. A sample instantiation of the framework is provided.
1
Introduction
Information retrieval is a well-established field, wherein a relatively large body of well-accepted methods and models exist. Formal models of information retrieval have been proposed by Baeza-Yates and Ribeiro-Neto [1], Dominich [6] and vanRijsbergen [14]. These are useful in the development, analysis and comparison of approaches and systems. Various changes in the information retrieval paradigm have occurred, including, among others, the move towards browsing and navigation-based interaction rather than the more traditional batch mode interaction; the use of multiple retrieval models; and the use of multiple representations of the available information. Recent trends, particularly in the web-search arena, indicate that users will usually combine the process of querying with browsing, providing feedback and query reformulation until the information need is satisfied (or until the user stops searching). Although many sources of evidence are usually available, many information retrieval models do not incorporate numerous sources of evidence (except for some ad-hoc approaches and several approaches using inference networks [13]). It has been shown that different retrieval models will retrieve different documents and that multiple representations of the same object (e.g. different query representations) can provide a better understanding of notions of relevance [5]. Within the field of information retrieval, the sub-field of collaborative filtering has emerged. Many parallels and similarities exist between the two approaches: in each process the goal is to provide timely and relevant information to users. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 864–871, 2003. c Springer-Verlag Berlin Heidelberg 2003
A Formal Framework for Combining Evidence
865
However, different evidence is used in these two processes to identify this relevant information. Collaborative filtering models predominately consider social or collaborative data only [10]. In various situations, collaborative filtering, on its own, is not adequate, e.g. when no ratings exist for an item; when new users join a system; or when a user is not similar to any other users in the dataset. One way to overcome these limitations is to utilise any available content information. However, formal approaches to integrate content information and collaborative information have not, to date, been fully investigated. In traditional approaches, a single mathematical model is typically used in isolation. Additions, which are not expressible in that model, are added as ad-hoc extensions. Basu et al. claim that “there are many factors which may influence a person in making choices, and ideally, one would like to model as many of these factors as possible in a recommendation system”[3]. Although the semantics and stages of information retrieval and collaborative filtering have been studied in detail in isolation, we believe there is a need to unify the varying paradigms into a single complete framework. Currently, a myriad of ad-hoc systems exist which obscures accurate comparisons of systems and approaches. Producing a formal framework to combine different sources of evidence would be beneficial as it would allow the categorisation of existing and future systems; would allow comparison at system design level; and would provide a blue-print from which to build one’s own designs and systems. This paper presents a formal framework for information retrieval systems. The framework caters for the incorporation of traditionally disjoint approaches (collaborative filtering and information retrieval) and moves towards accommodating recent trends in the information retrieval paradigm. Section 2 gives a brief overview of information retrieval and collaborative filtering. Section 3 details the proposed framework which combines collaborative and content information. Section 4 presents an instantiation of the framework and conclusions are presented in Section 5.
2
Related Approaches: Information Retrieval and Collaborative Filtering
The aim of information retrieval and filtering is to return a set of documents to a user based on a ranked similarity between a user’s information need (represented by a query) and a set of documents. Baeza-Yates and Ribeiro-Neto [1] proposed the following representation of an information retrieval model: [D, Q, F , R(qi , dj )] where D is a set of document representations; Q is a set of query representations; F is a framework for modelling these representations (D and Q) and the relationship between D and Q; and R(qi , dj ) is a ranking function which associates a real number with qi and dj such that an ordering is defined among
866
Josephine Griffith and Colm O’Riordan
the documents with respect to the query qi . Well known instantiations include the Boolean model and the vector space model. Numerous other instantiations have been explored. While useful, the Baeza-Yates and Ribeiro-Neto model does not address other important aspects of the information retrieval cycle. These include the notions of feedback, a browsing paradigm, and higher level relationships (between documents and between users). Collaborative filtering produces recommendations for some active user using the ratings of other users (these ratings can be explicitly or implicitly gathered) where these users have similar preferences to the active user. Typical or “traditional” collaborative filtering algorithms use standard statistical measures to calculate the similarity between users (e.g. Spearman correlation, Pearson correlation, etc.) [10]. Other approaches have also been used [8]. Within the information retrieval and filtering domains, approaches have been developed for combining different types of content and combining results from different information retrieval systems [5] and recently, for combining content and web link information [9]. Approaches combining multiple sources of relevance feedback information have also been investigated [12]. Several authors suggest methods for combining content with collaborative information, including [2], [3], [4], [11].
3
Proposed Framework
To model extra information, the model proposed by Baeza-Yates and RibeiroNeto is first slightly modified and then mapped to the collaborative filtering case. The two models (content and collaborative) are then combined. 3.1
Collaborative Filtering Model
A model for collaborative filtering can be defined by: < U, I, P, M, R(I, u) > where U is a set of users; I is a set of items; P is a matrix of dimension |U | × |I| containing ratings by users U for items I where a value in the matrix is referenced by pui ; M is the collaborative filtering model used (e.g. correlation methods; probabilistic models; machine learning approaches); and R(I, u) returns a ranking of a set of items1 based on P , given user u. 1
This differs from R as described in the previous section where R returns a real number for a document/query pair. In our model R returns a ranking of the entire document set with respect to a query.
A Formal Framework for Combining Evidence
3.2
867
An Information Retrieval and Collaborative Filtering Framework
Combining both the information retrieval and collaborative filtering models, a framework is provided where the following can be specified: < U, I, P, A, V, Q, R(I, u, q), G > with terms U, I, P as defined previously in the collaborative filtering model. In addition, the following are required: – A: a list of attributes for each item in I, i.e. [a1 , a2 . . . an ]. – V : a matrix of dimension |I| × |A| containing the associated values for each attribute a of each item in I. These are not necessarily atomic and may be of arbitrary complexity. – Q: a set of user queries where a query q is defined as a list of weighted values for attributes, i.e. [(val1 , w1 ), (val2 , w2 ), . . .]. – R(I, u, q): specifies a ranking of the items I with respect to the: • similarity to q, and • evidence in P for user u. – G: models the components and the relationships between them. Based on the explicit information, (i.e. U, I, P, A, V, Q), a number of functions can be defined. Let b = {1, . . . , u} correspond to the number of users in U ; b = {1, . . . , i} correspond to the number of items in I; and b = {1, . . . , n} correspond to the number of attributes in A. Then: 1. h1 : P × b → P(U ), a mapping of P , for a user u, onto the power set of U , i.e. a group of users in the same neighbourhood as u. This corresponds to traditional memory-based collaborative filtering approaches [10]. 2. h2 : P × b → P(I), a mapping of P , for an item i, onto the power set of I, i.e. a group of items which are similar to an item i. This corresponds to item-item based collaborative filtering approaches as proposed in [7]. 3. h3 : V × b → P(I), a mapping of V , for an item i, onto the power set of I, i.e. a cluster of items which are similar to an item i. Standard clustering approaches can be used. 4. h4 : V × b → P(A), a mapping of P , for an attribute a, onto the power set of A, i.e. a cluster of attributes which are similar to some a. Such evidence has not been traditionally used in information retrieval, but has some parallels in data mining. 3.3
Extending the Framework to Include Further Evidence
Other information also needs to be captured including the concepts of feedback and session histories, giving: < U, I, P, A, V, Q, R(I, u, q), G, f b, Ses, History >
868
Josephine Griffith and Colm O’Riordan
Without concerning ourselves with details of how feedback information is gathered (whether explicitly, implicitly or both) we can see feedback as providing a mapping from one state, s, to another, i.e. f b : s → s where the state s is defined as an instantiation of the information in the system. This typically involves changing the values and weights of attributes in the query q; any value in P and any higher level relationships affected by the feedback2 . For the purpose of this paper, we view feedback as changing a subset of the values in q and P to give q and P . A session, Ses, can be defined as a sequence of such mappings such that the state following f bt does not radically differ from the state following f bt+1 , for all t. This necessitates having some threshold, τ , on the measure of similarity between successive states. A history, history, of sessions is maintained per user such that the end of one session can be distinguished from the start of a new session, i.e. the similarity between the final state of one session and the initial state of the next session is lower than the aforementioned threshold τ . Of course, a user should also be able to indicate that a new session is beginning. Given f b, Ses, and History, other higher level relationships could be derived. In particular one could identify user sessions; identify frequent information needs per user; and identify groups of similar users based on queries, behaviour etc. Again, there exists many approaches to define, implement and use this information (e.g. data mining past behaviours). Many possible instantiations of the given framework exist, which specify how the components are defined and combined. We will now consider a possible instantiation to deal with a well understood domain.
4
Sample Instantiation
We provide a possible instantiation of the given framework. Consider, for example, a movie recommender domain with the following being considered: – U is a set of users, seeking recommendations on movies. – I is a set of items (movies), represented by some identifier. – P is the collaborative filtering matrix with ratings by the system or user, for items in I for users in U . – A is a list of attributes [a1 , a2 . . . an ], associated with the items in I, for example: [title, year, director, actors, description, . . .]. – V is a matrix with associated values for each attribute a of each item i. The value of attribute a for item i is referred to by via . – Q is the set of user queries where a query q is a list of weighted values for attributes, i.e. [(val1 , w1 ), (val2 , w2 ), . . .], where all weights are real numbers in the range [0, 1]. 2
In many systems less information is used, but one could envision other sources of information being used.
A Formal Framework for Combining Evidence
869
– R(I, u, q) returns a ranking of the items in I with respect to their similarity to some query q and also based on evidence in P . – G models components and the relationship between them. – Feedback, session information and history are not used in this current instantiation. There exist many means to calculate R(I, u, q). One reasonably intuitive approach is to find the similarity between the attribute values of I and the attribute values in the user query, q, and also find, if available, the associated collaborative filtering value for each of the items. Constants (α and β) can be used which allow the relative importance of the two approaches to be specified3 , giving: β α R(I, u, q) = simcontent (I, q) + C(P ) α+β β+α where simcontent (I, q) is the content-based approach (information and/or data retrieval) and C(P ) is the collaborative-based approach using P . The function simcontent (I, q) returns a similarity-based ranking where, for each i in I, similarity to the query q is defined as: n n (sim(vij , valj ) × wj ) / wj j=1
j=1
where n is the number of attributes and wj is the user assigned weighting to the j th attribute in q (the default is 1 if the user does not specify a weighting, indicating all attributes are equally important). For each attribute in the query, sim(vij , valj ) returns a number indicating the similarity between vij and valj . This can be calculated using an approach suitable to the domain, e.g. Boolean approach or Euclidean distance. The similarities should be normalised to be in the range [0, 1]. A collaborative filtering module, C(P ), will produce values for items in I for a user, u, based on the prior ratings of that user u and the ratings of similar users. Any standard collaborative filtering approach can be used. Again, the range should be constrained appropriately.
5
Conclusions
Within information retrieval, there exists many approaches to providing relevant information in a timely manner (e.g. content filtering, collaborative filtering, etc.). Recent changes in the information retrieval paradigm indicate that users intersperse query formulation, feedback and browsing in the search for relevant information. 3
Note that if α = β, then the content and collaborative approaches have equal importance, otherwise, one has greater importance than the other.
870
Josephine Griffith and Colm O’Riordan
Although formal models and frameworks exist, there has been a lack of frameworks to formally capture the different approaches and include recent changes in the paradigm. In this paper we provide such a framework which captures content-based information, collaborative-based information, and notions of feedback and user sessions. We argue that such a framework allows for the comparison of various approaches and provides a blue-print for system design and implementation. We also provide a sample instantiation. Future work will involve more detailed modelling of possible instantiations making use of all components in the framework as well as some of the higher level relationships.
References [1] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press, 1999. 864, 865 [2] M. Balabanovic and Y. Shoham. Fab: Content-based, collaborative recommendation. Communications of the ACM, 40(3):66–72, 1997. 866 [3] C. Basu, H. Hirsh, and W. Cohen. Recommendation as classification: Using social and content-based information in recommendation. In Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98), pages 714–721, 1998. 865, 866 [4] M. Claypool, A. Gokhale, T. Miranda, P. Murnikov, D. Netes, and M. Sartin. Combining content-based and collaborative filters in an online newspaper. In Online Proceedings of the ACM SIGIR ’99 Workshop on Recommender Systems: Algorithms and Evaluation, University of California, Berkeley, 1999. 866 [5] W. B. Croft. Combining approaches to information retrieval. Kluwer Academic Publishers, 2000. 864, 866 [6] S. Dominich. Mathematical Foundations of Information Retrieval. Kluwer Academic Publishers, 2001. 864 [7] D. Fisk. An application of social filtering to movie recommendation. In H.S. Nwana and N. Azarmi, editors, Software Agents and Soft Computing. Springer, 1997. 867 [8] J. Griffith and C. O’Riordan. Non-traditional collaborative filtering techniques. Technical report, Dept. of Information Technology, NUI, Galway, Ireland, 2002. 866 [9] R. Jin and S. Dumais. Probabilistic combination of content and links. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 402–403, 2001. 866 [10] P. Resnick, N. Iacovou, M. Suchak, P. Bergstrom, and J. Riedl. Grouplens: An open architecture for collaborative filtering of netnews. In Proceedings of ACM 1994 Conference on CSCW, pages 175–186. Chapel Hill, 1994. 865, 866, 867 [11] C. O’ Riordan and H. Sorensen. Multi-agent based collaborative filtering. In M. Klusch et al., editor, Cooperative Information Agents 99, Lecture Notes in Artificial Intelligence, 1999. 866 [12] I. Ruthven and M. Lalmas. Selective relevance feedback using term characteristics. In Proceedings of the 3rd International Conference on Coceptions of Library Information Science, CoLIS 3, 1999. 866
A Formal Framework for Combining Evidence
871
[13] H.R. Turtle and W.B. Croft. Evaluation of an inference network-based retrieval model. ACM Trans. on Info. Systems, 3, 1991. 864 [14] C.J. van Rijsbergen. Towards an information logic. In ACM SIGIR Conference on Research and Development in Information Retrieval, 1989. 864
Managing Articles Sent to the Conference Organizer Yousef Abuzir Salfeet Study Center, Al-Quds Open University, Salfeet, Palestine [email protected], [email protected]
Abstract. The flood of articles or emails to the conference/session organizer is an interesting problem. It becomes important to correctly organize these electronic documents or emails by their similarity. A thesaurus based classification system can be used to simplify this task. This system provides the functions to index and retrieve a collection of electronic documents based on thesaurus to classify them. Classifying these electronic articles or emails presents the most opportunities for the organizer to help him to select and send the right articles to the reviewers taking in our account their profiles or interests. By automatically indexing the articles based on a thesaurus, the system can easily selects relevant articles according to user profiles and send an email message to the reviewer containing the articles
1
Introduction
In organizing and delivering electronic articles by their similarity and classifying them undergoes restrictions. The job has to be done fast, for instance managing the flow of the articles coming in to the organizer of conference or chairman of a session. A thesaurus-based classification system can simplify this task. This system provides the functions to index and retrieve a collection of electronic documents based on thesaurus to classify them. The thesaurus is used not only for indexing and retrieving messages, but also for classifying. By automatically indexing the email messages and/or the electronic articles using a thesaurus, conference organizer can easily locate the related articles or messages and find the topic. The assignment of submitted manuscripts to reviewers is a common task in the scientific community and is an important part of the duties of journal editors, conference program chairs, and research councils. For conference submission, however, the reviews and review assignments must be completed under severe time pressure, with a very large number of submissions arriving near the announced deadline, making it difficult to plan the review assignments much in advance. These dual problems of large volume and limited time make the assignment of submitted manuscripts to reviewers a complicated job that has traditionally been handled by a single person (or at most few people) under quite stressful conditions. Also, manual review assignment is only possible if the person doing the assignment V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 871-878, 2003. Springer-Verlag Berlin Heidelberg 2003
872
Yousef Abuzir
knows all the members of the review committee and their respective areas of expertise. As some conferences grow in scope with respect to number of submissions and reviewers as well as the number of subdomains of their fields, it would be desirable to develop automated means of assigning the submitted manuscripts to appropriate member of the review committee. In this paper we suggest an application involving classification of the articles or emails sent by authors to the conference organizer. For this application to be realized it was necessary to develop the Database and Expert Systems Applications (DEXA) thesaurus, which is used for indexing and classification of those electronic documents. In this paper, we describe an approach to extract and structure terms from the web pages of the conference to construct domain independent thesaurus. The tool, ThesWB [1], [2] is used to construct a thesaurus from the HTML pages. Web documents, however, contain rich knowledge that describes the document's content. These pages can be viewed as a body of text containing two fundamentally different types of data: the contents and the tags. We analyzed the nature of the web content and metadata in relation to requirements for thesaurus construction. After creating the thesaurus, its performance is tested by our toolkit TDOCS [3], [4]. The system used to classify a sample of electronic documents and emails in cache directory containing electronic articles related to that conference. However, due to the sheer number of articles published, it is a time-consuming task to select the most interesting one for the reviewer. Therefore, a method of articles categorization is useful if it is to obtain relevant information quickly. We have been researching into methods of automated electronic articles classification through predefined categories for the purpose of developing an article delivering-service system, which sends the right articles to those reviewers who are interested in that particular topic. The contents of this paper are divided in 6 sections. In section 2 a general introduction and a description of user profiling are presented. Section 3 the thesaurus technology describes how the thesaurus has been created to support the classification and user profiling functions. Section 4 describes the basic functionality of the articles classification and delivering system. In Section 5 an experiment and evaluation, we test the classification and delivering functions using an example collection of articles. The last section presents the conclusions.
2
General Overview
With this system, the organizers of the conferences can get support for archiving, indexing and classification these articles into different classes or topics based on thesaurus (Fig.1). The process in initiated by an author writing a paper, and submitting it over the Internet by email to the conference organizer. Conference organizers receive that paper and use the system to index and classify it based on the thesaurus. Once the paper has been classified it is mapped to the interest of reviewers (User Profile of the reviewers) to send it to appropriate reviewer. Once the paper is complete, there needs to be a mechanism for submitting it to the system. This will be accomplished by sending it as an e-mail attachment. The e-mail should include special structure that contains the abstract and list of keywords relate to the paper that is being submitted.
Managing Articles Sent to the Conference Organizer
873
TDO CS S y ste m
A u th o r
O r g a n iz e r
R e v ie w e r
C o n fe r e n c e P r o c e e d in g s
Fig. 1. Articles flow in the system
When a paper is received by email, it will be tested to ensure that it conforms to the required document structure. An email will be sent to the author, indicating that the paper has been received. If the paper is accepted by the system, then email is sent to appropriate reviewers asking if they can review it. (Appropriate reviewers may be nominated by the system based on keywords, or perhaps on reference citations). At this stage the review process starts. The system should provide the organizer with the necessary tools to track the papers currently under review. This should include the paper name and author, the names of the reviewers, when the paper was sent for review, the reviewer's summary opinion when it is available, and links to detailed notes. The organizer will receive the review. Based on the review, the organizer will decide to either accept the paper, or reject it. 2.1
User Profiles
User profile is a collection of information that describes a user [5]. It may be defined as a set of keywords which describe the information in which the user is interested in. With user profile the user can set certain criteria of preference and ask for articles of specific. The profiling can be done by user-defined criteria in our case here by collecting keywords from reviewer's email message that sent to the conference. With the use of classification techniques based on thesaurus, the articles adapts to reviewers needs and interests according to his/her profile. 2.2
User Profile Creation
User Profile of the reviewers constructed as follows Fig 2. At first email address of the sender and the other fields are extracted from the email header lines. The body part of the email message will be extracted and parsed by TDOCS system. As a result of indexing process all keywords from the message and concepts derived from thesaurus are stored in a database reflects user interests. Thesaurus is used to create the user-profile from received email messages. These messages include many common terms appeared in the thesaurus. Moreover, these terms reflect the interest of reviewers.
874
Yousef Abuzir
TD O CS
Indexing
D EX A DB
em ail m essages
keyw ord D B concept D B
U ser Profile DB review er
K eyw ord
C oncept
ThesW B m aintenance
Fig. 2. User-Profile creation from the email message of reviewer
3
Constructing the DEXA Thesaurus
In order to turn Information Retrieval systems into more useful tools for both the professional and general user, one usually tends to enrich them with more intelligence by integrating information structure, such as thesauri. There are three approaches to construct a thesaurus. The first approach, on designing a thesaurus from document collection, is a standard one [6], [7]. By applying statistical [6] or linguistic [8], [9] procedures we can identify important terms as well as their significant relationships. The second approach is merging existing thesauri [10]. The third automatic is based on tools from expert systems [11]. Our experiment to construct the Database and Expert Systems Applications (DEXA) Thesaurus is based on selecting web pages that are well represented of the domain. The web pages we selected were sample of pages related to call for papers. We start by parsing those web pages using ThesWB [1], [2]. The parsing process will generate list of terms represented in hierarchical structure. An HTML document can be viewed as a structure of different nodes. This document can be parsed into a tree. The tree structure is constructed from parsing the tags and the corresponding content. The set of tags such as , ,
,
, etc. can be used to create the layout structure of the HTML page. For example, the tree structure consists of Head node and Body node. The Head is further divided into TITLE, META NAME= “KEYWORD”, etc. The Body node has other low sub-levels like ,
,
, etc. During the parsing process, ThesWB apply text extraction rules for each type of tags. There are rules to extract and build the tree structure of the tags . The extraction rule for each of these tags is applied until all tags have been extracted. The second step is to eliminate and remove noisy terms and relationships between terms from the list. The list in Fig. 3 shows a sample of the new list after removing the noisy terms. Later, we convert this list to the thesaurus using ThesWB Converter Tool.
Managing Articles Sent to the Conference Organizer
875
Fig. 3. A sample of terms having hierarchical relationship extracted by ThesWB from links in http://www.dexa.org/
The Database and Expert Systems Applications (DEXA) Thesaurus has been designed primarily to be used for indexing and classification of articles sent by emails for reviewers of a conference for evaluation. This thesaurus provides a core terminology in the field of Computer Science. The draft version contains 141 terms.
4
Article Management System
In this paper we used the thesaurus to classify the articles into concepts "subject hierarchy". In our application, all articles are classified into concepts. Our classification approach uses DEXA Thesaurus as a reference thesaurus. Each article is automatically classified into the best matching concept in the DEXA Thesaurus. TDOCS system gives weight for each concept. This weight can be used as selection criteria for the best matching concepts for the articles. The system compares the articles to a list of keywords that describe a reviewer to classify the articles as relevant or irrelevant, Fig. 4. The system uses the user profile to nominate an appropriate reviewer for each article.
5
An Experiment and Evaluation
In order to classify articles a thesaurus is used, the thesaurus can be used to reflect the interests of a reviewer as well as the main topic of the article. The thesaurus is used not only for indexing and retrieving messages, but also for classifying articles [12]. We evaluated this classification method using TDOCS Toolkit and a collection of articles. The collection includes a list of abstract and keywords relate to the Database and expert system domain in general The TDOCS parses the articles and indexes them using a thesaurus. Thereafter the organizer can use Document Search environment to retrieve the related article.
876
Yousef Abuzir
articles TDOCS
Indexing
DEXA DB
Index DB
keyword DB concept DB Reviewer
User Profile DB
Delivery System
Fig. 4. An overview of the Conference articles Delivering-Service
The task of the delivery system is to get a collection of articles to be delivered to a user. The ultimate goal of the system is to select articles that best reflect reviewer's interest. We used TDOCS as tool to index these articles. We can use the result of indexing to classify the articles according to the main root terms or other concepts that reflect the different topics. Reviewer interest will be match to these database results to select the articles that reflect his/her interest. We used VC++ API application to map the interest of each user to the indexing result and get the articles that best reflect his/her interest. Then, the system will delivery these articles by electronic mail to that reviewer. The proposed approach has already been put to practice. A sample of 20 articles was automatically indexed using DEXA Thesaurus. The results were manually evaluated. The test results showed that a good indexing quality has been achieved. We get articles from a cache directory. The batch process to index all the articles takes about 15 second. It takes about 0.75 seconds to classify each article compared to 1-2 minutes human indexer needs. The system provides the fully automatic creation of structured user profiles. It explores the ways of incorporating users' interests into the parsing process to improve the results. The user profiles are structured as a concept hierarchy. The profiles are shown to converge and to reflect the actual interests of the reviewer.
6
Conclusions and Future Work
The increasing number of article sent to the conference presents a rich area, which can benefit immensely from automatic classification approach. We present an approach of automated articles classification through thesaurus for the purpose of developing an article delivering-service system. This paper describes experimental trail to test the feasibility of thesaurus-based articles classification and distribution system. The system seeks to identify keywords
Managing Articles Sent to the Conference Organizer
877
and concepts that characterize articles and classify these articles into pre-defined categories based on the hierarchical structure of the thesaurus. The explicit interest of reviewer in his/her profile enables the system to predict articles of interest Experimental result of our approach shows that the use of thesaurus contributes to improve accuracy and the improvements offered by classification method. The DEXA Thesaurus is useful and effective for indexing and retrieval of electronic articles. Concept hierarchies in DEXA Thesaurus were used to capture user profile and classify articles content. Our experiment proves that To summarize, automatic articles classification is an important problem nowadays. This paper proposes an approach base on thesaurus to classify and distribute the articles. The experimental results indicate accurate result.
References [1]
Abuzir, Y. and Vandamme, F., ThesWB: Work Bench Tool for Automatic Thesaurus Construction, in Proceedings of STarting Artificial Intelligence Researchers Symposium (STAIRS 2002), Lyon-France, July 22-23, 2002. [2] Abuzir, Y. and Vandamme, F., ThesWB: A Tool for Thesaurus Construction from HTML Documents, in Workshop on Text Mining Held in Conjunction with the 6th Pacific-Asia Conference on Knowledge Discovery and Data Mining, Taipei, Taiwan, May 6, 2002. [3] Abuzir, Y. and Vandamme, F., Automatic E-mail Classification Based on Thesaurus, in Proceedings of AI2002 - Twentieth IASTED International Conference APPLIED INFORMATICS AI 2002, Innsbruck, Austria, 2002. [4] Abuzir, Y., Vervenne, D., Kaczmarski, P. and Vandamme, F., TDOCS Thesauri for Concept-Based Document Retrieval, R&D Report BIKIT, BIKIT – LAE, University of Ghent 1999. [5] Jovanovic, D., A Survey of Internet Oriented Information Systems Based on Customer Profile and Customer Behavior, SSGRR 2001, L'Aquila,, Italy , Aug06 12 2001. [6] Salton, G., McGill, M. J., Introduction to modern information retrieval. McGraw Hill, New York. 1983. [7] Crouch, C. J., An approach to the automatic construction of global thesauri, Information Processing & Management, 26(5): 629-40, 1990. [8] Grefenstette, G., Use of syntatic context to produce term association lists for text retrieval. In SIGIR'92, pp. 89-97, 1992. [9] Ruge, G. Experiments on linguistically based term associations. In RIAO'91, pp. 528-545, 1991. [10] Sintichakis, M. and Constantopoulos, P. A Method for Monolingual Thesauri Merging. In Proc. 20th International Conference on Research and Development in Information Retrieval, ACM SIGIR, Philadeplphia PA, USA, July 1997. [11] Gnü tzer, U., Jtü tner, G., Seegmüller, G. and Sarre, F., Automatic Thesaurus Construction by Machine Learning from Retrieval Sessions. Information Processing & Management, Vol. 25, No. 3, pp. 265-273, 1989.
878
Yousef Abuzir
[12] Abuzir Y., Van Vosselen N., Gierts S., Kaczmarski P. and Vandamme F., MARIND Thesaurus for Indexing and Classifying Documents in the Field of Marine Transportation, accepted in MEET/ MARIND 2002, Oct. 6-1, Varna Bulgaria, 2002.
Information Retrieval Using Deep Natural Language Processing Rossitza Setchi1, Qiao Tang1, and Lixin Cheng2 Systems Engineering Division Cardiff University Cardiff, UK, CF24 0YF {Setchi,TangQ}@cf.ac.uk http://www.mec.cf.ac.uk/i2s/ 2 Automation Institute, Technology Center Baosteel, Shanghai, China, 201900 [email protected] http://www.baosteel.com 1
Abstract. This paper addresses some problems of the conventional information retrieval (IR) systems, by suggesting an approach to information retrieval that uses deep natural language processing (NLP). The proposed client-side IR system employs a Head-Driven Phrase Grammar (HPSG) formalism and uses Attribute Values Matrices (AVMs) for information storage, representation and communication. The paper describes the architecture and the main processes of the system. The initial experimental results following the implementation of the HPSG processor show that the extraction of semantic information using the HPSG formalism is feasible.
1
Introduction
Studies indicate that the majority of the web users find information by using traditional keyword-based search engines. Despite the enormous success of these search engines, however, they often fail to provide accurate, up-to-date, and personalised information. For example, traditional search engines can help in identifying entities of interest but they fail in determining the underlying concepts or the relationships between these entities [1]. Personalization is still a challenging task due to the lack of enough computational power needed to analyze the query history of the individual users, identify frequently used concepts or knowledge domains, and re-rank accordingly the query results. Deep linguistic analysis of document collections is not performed for a similar reason, as it is much slower than conventional crawling and indexing [1]. These and other issues have motivated intensive research in the area of personalized information retrieval (IR) and filtering. A promising solution is offered by the emerging semantic web technologies that attempt to express the information contained on the web in a formal way. However, the transformation of billions of web V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 879-885, 2003. Springer-Verlag Berlin Heidelberg 2003
880
Rossitza Setchi et al.
pages written in various natural languages into machine-readable form is a colossal task at present. In addition, the range of metadata formats used and the inconsistent ontologies employed are still major obstacles. The aim of this work is to address these issues by developing a system for personalized information retrieval that uses natural language processing (NLP) techniques. The rest of the paper is organized as follows. Section 2 provides background information on the NLP technique employed in this research. Section 3 introduces the proposed IR system by focusing on its architecture, main processes, implementation and testing. Finally, Section 4 provides conclusions and directions for further research.
2
Background
The NLP algorithm used in this research is the Head-Driven Phrase Grammar (HPSG). It is a constraint-based, lexicalist grammar formalism for natural language processing with a logic ground in typed feature structures [2]. Recently, HPSG has been used in building a large-scale English grammar environment [3]. The HPSG formalism comprises of principles, grammar rules, lexical rules, and a lexicon. The principles are global rules that all syntactically well-formed sentences obey, while the grammar rules are applied to specific grammar structures. The lexical rules are employed when deriving words from lexemes. Finally, the lexicon contains lexical knowledge. The HPSG formalism is represented using Attribute Values Matrices (AVMs), which are notations for describing feature structures [4]. In addition, concept maps are used as tools for organizing and representing knowledge. They are effective means for presenting concepts and expressing complex meaning [5].
3
An Information Retrieval System Using HPSG
3.1
Technical Approach
The idea underlying this research is to utilize deep NLP techniques for information retrieval. Instead of using keyword-based algorithms for shallow natural language processing, a deep grammar analysis is employed to analyse sentences retrieved from the web, and identify main entities such as noun phrases, verb phrases and modifiers. These entities are revealed in the context of the sentences, which makes possible the relationships between these entities to be identified and analyzed. The noun phrases and verb phrases are then generalized into conceptual knowledge that is then indexed and later used to improve the information retrieval process. In addition to analysing the content of the retrieved web documents, it is equally important to accurately identify the purpose of the query made by the end user, i.e. the context behind his/her search for information. The approach adopted in this work is to extract the main concepts contained in the query and combine them with information from a knowledge base (KB). The dynamics of the user's ad-hoc and longterm information needs is encapsulated by adding new concepts and relationships from the queries in a user model. Finally, once the purpose of the query is identified, the content is retrieved from the web and analyzed using AVMs, the result is represented via a concept map.
Information Retrieval Using Deep Natural Language Processing
User Queries
Internet
Intelligent Agent
881
Concept Representation
Web Spiders
Concept KB
HPSG Processor AVM Engine
User Profile DB
Personalised KB
Language KB
Fig. 1. Architecture of the proposed information retrieval system
3.2
System Architecture and Main Processes
The architecture of the proposed system is shown in Fig. 1. The system uses an HPSG processor for parsing user queries and web pages retrieved from the web. The web spiders are agents that retrieve and filter web pages. The intelligent agent is involved in the processing of the user queries and in information personalization. The concept representation module provides graphical presentation of the retrieved information using concept maps. The concept KB contains background conceptual knowledge. The personalized KB includes previously retrieved and indexed items that might be needed in future queries. The language KB consists of systematised morphological, syntactical, semantic and pragmatic knowledge, specific for the English language. The user profile database (DB) is for storing the details of previous user's queries. The AVM engine is the AVM format wrapper of the HPSG processor. Two processes are described below. The first process concerns the user interaction with the system, namely, the way his/her query is processed. The second process relates to the retrieval of web content that is relevant to the user query. Processing user queries (Fig. 2). A user query is first processed by the intelligent agent that uses the HPSG processor to interpret the user query. The HPSG processor uses the language KB to analyze the basic grammar and semantics of the query. When the interpretation of the query in AVM form is obtained, the intelligent agent extracts the important features from it by combining information from three sources. These are the user profile DB, which reports previous user preferences that are relevant to the current query, the personalized KB that may contain relevant knowledge from previous retrieval tasks, and the language KB. After analyzing this input, the intelligent agent sends a query command to the web spiders. The user profile database is updated by storing the query in AVM form. The user preferences are modified accordingly, by comparing the main concept features in the AVM result with the
882
Rossitza Setchi et al.
concept KB. This assures that the user profile is maintained in a dynamic manner that captures and reflects the user's short-term and long-term interest. Retrieving web content (Fig. 3). The process starts when the web spiders receive the query from the intelligent agent. The web spiders retrieve related web content and forward it to the HPSG processor to parse the text. Then, the user profile database and the language KB are employed to analyse the content. When the web spiders receive the analysed result from the HPSG processor, they evaluate it using information from the user profile database, and choose whether to store the retrieved information or discard it. If the information is found to be of relevance, the result in AVM form is added to the personalised KB. Finally, the concept representation module provides graphical interpretation of the AVM result for the user. 3.3
Implementation and Testing
The HPSG processor was built in Java using 53 classes, in about 4,000 lines code. It has two sub-modules: a unification parser and a HPSG formalism module. A part of its UML class diagram is illustrated in Fig. 4. The algorithms employed are adapted from [6-8]. The design and the implementation of the HPSG processor are described in more detail in [9]. The HPSG processor was tested on a P4 1GHz 256M memory Windows NT 4.0 workstation. An experiment with 40 English phrases and sentences was conducted. In this experiment, the language KB included about 50 lexemes, and the grammar contained 8 ID schemas (grammar rules). An example of parsing a simple question is illustrated in the Annex. The experimental results [9] showed that the parsing time increases proportionally to the number of words in a sentence. Therefore, the parsing of a text with 50 sentences (400 words) would require approximately 12 seconds. This time will need to be greatly reduced through optimizing the parsing algorithm and introducing elements of learning in it.
Web Spiders
User Queries
User Profile DB
Intelligent Agent
Concept KB
HPSG
Language KB
Processor
AVM Engine
Personalised
KB
Fig. 2. Processing a user query
Information Retrieval Using Deep Natural Language Processing
Internet
Web Spiders
Concept KB
User Profile DB
HPSG Processor
Concept Representation
883
AVM Engine
Personalised KB
Language KB
Fig. 3. Retrieving web content
The web spiders are implemented in 10 Java classes in about 3,700 lines code with multithreading support. The module is configured to start exploring the web from any web address, for example portals or search engines. In the second case, they automatically submit keywords as instructed by the intelligent agent. The conducted experiment showed that web spiders could retrieve approximately 1,000 pages in an hour. The IR system developed uses an electronic lexical database, WordNet 1.6 [10], to retrieve lexical knowledge e.g. part of speech, hyponyms, hypernyms and synonyms. The User Interface is implemented with 12 classes in about 2,400 lines code; it uses an html render engine.
Grammar
Uniparser
1
1
1 * 1
Sign
1
*
*
Entry
Fig. 4. UML class diagram of the HPSG processor's main components
884
Rossitza Setchi et al.
4
Conclusions and Further Work
A client-side information retrieval system using a deep natural language grammar and semantic analyses processor is proposed in this work. It uses HPSG algorithms and AVMs to extract concept relationships and semantic meaning. The initial experiments show the feasibility of the proposed approach. Further work includes a large-scale performance evaluation of this system that will include extending the current language KB and conducting experiments with a number of search engines.
References [1]
Chakrabarti, S., Mining the Web: Analysis of Hypertext and Semi Structured Data. Morgan Kaufmann Publishers (2002). [2] Carpenter, B., The Logic of Typed Feature Structures. Cambridge University Press, New York (1992). [3] Copestake, A., Implementing Typed Feature Structure Grammars. University of Chicago Press, Stanford, California (2002). [4] Sag, I. and A. Wasow, T., Syntactic Theory: A Formal Introduction. Cambridge University Press, Stanford, California (1999). [5] Novak, J.D. and Gowin, D.B., Learning How to Learn. Cambridge University Press, New York (1984). [6] Keselj, V., Java Parser for HPSGs: Why and How. Proceedings of the Conference Pacific Association for Computational Linguistics, PACLING'99, Waterloo, Ontario, Canada (1999). [7] Kasami, T., An Efficient Recognition and Syntax Algorithm for Context-free Languages. Bedford, Massachusetts, US, Air Force Cambridge Research Laboratory (1965). [8] Younger, D. H., Recognition of Context-free Languages in Time n3. Information and Control, Vol. 10 (2), 189-208 (1967). [9] Chen, L. Internet Information Retrieval Using Natural Language Processing Technology, MSc Thesis, Cardiff University, Cardiff (2002). [10] Fellbaum, C., WordNet: An Electronic Lexical Database, MIT Press, Cambridge (1998).
Information Retrieval Using Deep Natural Language Processing
885
Annex
Fig. A1. Parsing of the sentence “what did Napoleon give Elisa”. (1) and (2) show that the sentence is an interrogative question. (3), (4), (5) and (6) show that “what” needs to be answered and it has to be a nominative component. (7), (8), (9) and (10) illustrate that the subject of the sentence is “Napoleon”, the main verb is “give”, “Elisa” is the indirect object complement of the transitive word “give”, and the nominative component that the question asks is a direct object complement of “give”. Therefore, basic information such as subject, verb, objects, and what the question asked is, can be drawn from the parsing result. Further algorithms can be applied to match this semantic information with the information stored in the knowledge base, e.g. “In 1809, Napoleon gave his sister, Elisa, the Grand Duchy of Tuscany ”
Ontology of Domain Modeling in Web Based Adaptive Learning System Pinde Chen and Kedong Li School of Education Information Technology South China Normal University 510631 Guangdong , China {pinde,likd}@scnu.edu.cn
Abstract: Domain model, which embodies logical relations of teaching material and related instruction strategies, is an important component in Adaptive Learning System(ALS). It is a base of User Modeling and Inference Engine. In this paper, Ontology of Domain Model in Web based Intelligent Adaptive Learning System(WBIALS) and its formal presentation were discussed in detail.
1
Introduction
Adaptive learning supporting environment is a crucial research issue in distance education. Since mid-nineties, the number of research papers and reports on this area is increasing fast, especially on Adaptive Hypermedia System[1] . By using AI(Artificial Intelligence) in hypermedia, the system can understand user and application domain, and then it can customize learning material for the user and direct him through learning process. De Bra[2] gives an AHS reference model named AHAM . In this model, AHS is composed of 4 components: Domain model(DM) User model(UM), Teaching model or Adaptation model(TM or AM), Adaptive engine(AE). DM is a main component in AHS. The system should understand the domain, know the status and requirements of the user so it can adapt to the user. DM involves knowledge representation and its organization. UM involves representation and maintenance of user information. Based on DM and UM, AE performs operations and adapts to the user. Ontology is a philosophy concept on essence of existence. But in these recent years, it is used in computer science, and acted as more and more important role in AI, computer linguistics and database theory. But as far, ontology still has not a consistent definition and fixed application domain. Gruber[3] in Standford University pointed out that Ontology is precise description on concept, which was accepted by many researchers. Ontology was used to described the essence of the thing, whose main purpose is to represent the implied information precisely which can be reused and shared.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 886-892, 2003. Springer-Verlag Berlin Heidelberg 2003
Ontology of Domain Modeling in Web Based Adaptive Learning System
887
In this paper, ontology of domain knowledge in Web based Intelligent Adaptive Leaning Supporting System(WBIALS) was presented in detail.
2
Related Works
In recent years, different efforts are made to develop open learning technology specifications, in laboratories, in industry consortia, in associations of professionals and in standardization bodies. And there are already some results (e.g. IEEE LTSC LOM, IMS specifications, ADL SCORM). In LOM[9], the concept of learn object is defined, which is any entity, digital or non-digital, that can be used, re-used, or referenced during technology-supported learning. Learning objects are entities that may be referred to with metadata. In practice learn objects are mostly smaller objects, and have not provide sufficient information to build a learning unit. In IEEE LTSC CMI, a course can be cut into several blocks, and a block can be cut into assignable units. For assignable units, prerequisite unit can be set, which make course sequencing possible. In IMS Learning Design Specification[10], an information model of unit of learning is modeled which is a framework for learning material and learning process.
3
Ontology of Domain Model
Domain model represents the elements and their relations in a domain. In WBIALS system, structure of domain model can be depicted as following (fig 1) :
curriculum group
curriculum
conecpt
glossary
example
task
FAQ
learning unit
curriculum structure
test group
Discuss Group
test item
Fig. 1. Ontology of domain model (UML class diagram)
888
Pinde Chen and Kedong Li
3.1
Curriculum and Curriculum Group
The goal of WBIALS system is a general platform to support adaptive learning. It can support many curriculums. So, In this system, curriculum group is a set of curriculums. Definition 1: Curriculum group is a set of curriculums, i.e., Curriculum group={x|x is a curriculum}. 3.2
Concept in Curriculum
Concepts in curriculum have five types: fact, concept, ability, principle, problem solving. Concepts in a curriculum are used to index glossary, examples, learning units, entries in FAQ and so on. For every concept ci,, we can represent it as a 4-tuple , where C_id: identifier of a concept; C_name: name of a concept; C_type: type of a concept; C_des: description of a concept; Definition 2: A concept is a 4-tuple . All concepts in a curriculum is a set named Cs, where Cs={ci | ci is a concept in a curriculum }. 3.3
Learning Unit
Learning Unit is an entity, which is a set of all kinds of learning material, instruction guidance , related information and so on. Usually, A learning unit consists of the following elements: • Instruction hint or introduction. According to different instruction theory, content in this part can be learning goal, learning guidance or a real question about this unit. •
Unit content. This is the main part of a learning unit, which is object for user learning. It is often composed of one or a few html pages, one of which works as an entrance into this unit. Considering different users with different learning style or different background, who may need different unit content, the structure of unit content should be a 4-tuple < Lu_id, Url, ls_ID, Bg_ID>, where Lu_id: identifier of learning unit; Url: address of a html file; ls_ID: identifier of learning style; Bg_ID: identifier of background knowledge. For one Lu_id, it can map to several html files, one of which adapts to one type of user.
Ontology of Domain Modeling in Web Based Adaptive Learning System
889
•
Lecture Lecture is an another instruction form which is often used in traditional classroom. It can be powerpoint file or video record, which could is alternative learning material provided to the user.
•
Example
•
Demonstration
•
Exercise
•
Summary
•
Expand content
It is not necessary to contain all the above elements in a learning unit. Generally, only unit content is necessary, and all the other elements are optional. Definition 3: learning unit is a atomic learning cell, which is a set of all kinds of learning material, instruction guidance , related information and so on, can be represented as a 8-tuple< Intr, Cont, Lect, Exam, Demo, Exer, Summ,Expa>. Actually, every element in a learning unit is mapped to a paragraph, an address of a html file , or a pointer to other object. A concept is an abstract representation of an information item from the application domain. A learning unit is an entity in the system, so concept must be related to learning unit. A learning unit is often related to several concepts. The relation between them can also be represented by 3-tuple , where: C_id: identifier of a concept; Lu_id: identifier of a learning unit; Ob_tier: Object tier, Ob_tier ∈ {knowledge, comprehension, application, analysis, synthesis, evaluation }. A learning unit has one or none exercise group, and one exercise group consists of many exercise items. 3.4
Curriculum Structure
The following information was included in the curriculum structure: • • •
tree-like directory pre-next network inference network
3.4.1 Tree-Like Directory Tree-like directory was constructed by some learning units. It help the users get a book-like overview of the curriculum.
890
Pinde Chen and Kedong Li
For most nodes in the tree, normally each of them can be mapped to a learning unit. But there are also some nodes , which are called virtual node, have not content with them. Besides, some nodes in the directory could be optional learning unit. Definition 4: curriculum directory is a tree-like structure, the node in it can be represented as 7-tuple , where Node_ID: identifier of a node; Node_name: name of a node, it will be showed in the directory. Node_layer: tier of a node, Node_sequence: sequence of a node. According to Node_sequence and Node_layer, the structure and appearance of a directory can be decided. Lu_ID: identifier of learning unit. Node_type: this node is optional or compulsory. Url: when this node is a virtual one, this item will be used to decide which html page will be displayed . 3.4.2 Relation among Learning Units in a Curriculum One learning unit in a curriculum often has a relation to another one. There are inherent logic relations among them. When writing a textbook, the author should analyze the inherent relation of a curriculum and then decide the sequence of chapter. In the process of learning, the student will be more efficient if he can follow a rational route through the textbook. Actually, during the instruction design, the main job of the analysis of instruction tasks is to work out the instruction goal, and then divide it into several small goals so as to set up a proper sequence which promote valid learning. There are two types of relations which are pre-next relation and inference relation among the learning unit. 3.4.3 Network of Pre-next Relation According to instruction experience, if the following generation rule exists between learning unit vi and vj: if vi is mastered then vj can be learn next ( supporting factor is λ) It shows that learning unit vi is precondition of vj, and the supporting factor is λ. This rule can be represented as 3-tupe . The degree of “mastered” can be : very good, good, average, small part, not. Combining all the pre-next generation rules together, it makes a weighted directed acyclic diagram (DAG) which is shown as Fig 2 Definition 5: Network of pre-next relation is a weighted directed acyclic diagram (DAG) G=, where V={ v1,v2……vn} is a set of learning unit. R={| if vi is precondition of vj}. W={wij| ∈ R and wij ∈ [0,1]}
Ontology of Domain Modeling in Web Based Adaptive Learning System
891
Fig 2 Network of pre-next relations
3.4.4 Inference Network According to instruction experiences, if the following generation rule exists between learning unit vi and vj: if vi is “mastered” then vj is “mastered” (threshold is λ) It showes the inference relation between vi and vj . That is to say, from the system ,if we obtain the evidence that degree of mastery of vi is bigger than λ, we can inference that vj is mastered. Degree of “mastered” can be : very good, good, average, small part, not. Inference relations in a curriculum makes a weighted directed acyclic digraph (DAG) too. Definition 6: Network of inference relation is a weighted directed acyclic digraph(DAG) G=, where V={ v1,v2……vn} is a set of learning unit. R={| if vi is “mastery” then vj is “mastery” }. W={wij| ∈ R and wij ∈ [0,1]} From above , mastery degree of learning unit, pre-next rule and inference rule are all vague. So it is necessary to use fuzzy set and fuzzy rule to represent them. 3.5
Leaning Task
Learning task, which supports a task-based learning style, is another organization form of the curriculum content. A learning task, which is subset of set of learning unit, consists of several learning units. It has a identifier. Pre-next relation and reference relation in a task is a subgraph of the pre-next network and inference network. 3.6
FAQ and Discuss Group
Every entry in FAQ and discuss group is indexed by the identifier of learning unit, which lay the ground for customizing help information.
892
Pinde Chen and Kedong Li
4
Conclusion
In this paper, we give a precise description on ontology of domain model. In WBIALS system, we use an overlay model as the user model, and fuzzy sets and fuzzy rules are used to represent vague information and make inference. The system provides adaptive presentation, adaptive navigation and customized help information. Especially, adaptive dynamic learning zone was implemented in task-based learning style. This paper is sponsored by the scientific fund of Education Department , GuangDong, China. Project No. is Z02021
References [1]
Brusilovsky, P. (1996) Methods and techniques of adaptive hypermedia. User Modeling and User-Adapted Interaction, 6 (2-3), pp. 87-129. [2] De Bra, P., Houben, G.J., Wu, H., AHAM: A Dexter-based Reference Model for Adaptive Hypermedia, Proceedings of the ACM Conference on Hypertext and Hypermedia, pp. 147-156, Darmstadt, Germany, 1999. (Editors K. Tochtermann, J. Westbomke, U.K. Wiil, J. Leggett) [3] Gruber T R.Toward Principles for the Design of Ontologies Used for Knowledge Sharing.Int. Journal of Human and Computer Studies, 1995:907928 [4] Liao Minghong, Ontology and Information Retrieval, Computer Engineering (china),2000,2, p56-p58 [5] Jin Zhi, Ontology-based Requirement s Elicitation, Chinese J. Computers, 2000,5, p486-p492. [6] Brusilovsky, P., Eklund, J., and Schwarz, E. (1998) Web-based education for all: A tool for developing adaptive courseware. Computer Networks and ISDN Systems (Proceedings of Seventh International World Wide Web Conference, 14-18 April 1998, 30 (1-7), 291-300. [7] Gerhard, W., Kuhl, H.-C. & Weibelzahl, S.(2001). Developing Adaptive Internet Based Courses with the Authoring System NetCoach. In: P. De Bra P. and Brusilovsky (eds.) Proceedings of the Third Workshop on Adaptive Hypertext and Hypermedia. Berlin: Springer. [8] De Bra, P., & Calvi, L. (1998). AHA! An open Adaptive Hypermedia Architecture. The New Review of Hypermedia and Multimedia , 4 115-139. [9] IEEE LTSC , http://ltsc.ieee.org [10] IMS Learning Design Specification, http://www.imsglobal.org
Individualizing a Cognitive Model of Students' Memory in Intelligent Tutoring Systems Maria Virvou and Konstantinos Manos Department of Informatics University of Piraeus Piraeus 18534, Greece [email protected];
[email protected]
Abstract: Educational software applications may be more effective if they can adapt their teaching strategies to the needs of individual students. Individualisation may be achieved through student modelling, which is the main practice for Intelligent Tutoring Systems (ITS). In this paper, we show how principles of cognitive psychology have been adapted and incorporated into the student modelling component of a knowledge-based authoring tool for the generation of ITSs. The cognitive model takes into account the time that has passed since the learning of a fact has occurred and gives the system an insight of what is known and remembered and what needs to be revised and when. This model is individualised by using evidence from each individual student's actions.
1
Introduction
The fast and impressive advances of Information Technology have rendered computers very attractive media for the purposes of education. Many presentation advantages may be achieved through multimedia interfaces and easy access can be ensured through the WWW. However, to make use of the full capabilities of computers as compared to other traditional educational media such as books, the educational applications need to be highly interactive and individualised to the particular needs of each student. It is simple logic that response individualized to a particular student must be based on some information about that student; in Intelligent Tutoring Systems this realization led to student modeling, which became a core or even defining issue for the field (Cumming & McDougall 2000). One common concern of student models has been the representation of the knowledge of students in relation to the complete domain knowledge, which should be learnt by the student eventually. Students' knowledge has often been considered as a subset of the domain knowledge, such as in the overlay approach that was first used in (Stansfield, Carr & Goldstein 1976) and has been used in many systems since then (e.g. Matthews et al. 2000). Another approach is to consider the student's knowledge V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 893-897, 2003. Springer-Verlag Berlin Heidelberg 2003
894
Maria Virvou and Konstantinos Manos
in part as a subset of the complete domain knowledge and in part as a set of misconceptions that the student may have (e.g. Sleeman et al. 1990). Both cases represent an all or nothing approach on what the student knows and they certainly do not take into account the temporal aspects of knowledge, which are associated with how students learn and possibly forget. In view of the above, this paper describes the student modeling module of an educational application. This module can measure-simulate the way students learn and possibly forget through the whole process of a lesson. For this purpose, it uses principles of cognitive psychology concerning human memory. These principles are combined with evidence from the individual students' actions. Such evidence reveals the individual circumstances of how a student learns. Therefore the student model takes into account how long it has been since the student has last seen a part of the theory, how many times s/he has repeated it, how well s/he has answered questions relating to it. As a test-bed for the generality of our approach and its effectiveness within an educational application we have incorporated it in a knowledge based authoring tool. The authoring tool is called Ed-Game Author (Virvou et al. 2002) and can generate ITSs that operate as educational games in many domains.
2
The Cognitive Model of Memory for an Average Student
A classical approach on how people forget is based on research conducted by Herman Ebbinghaus and appears in a reprinted form in (Ebbinghaus, 1998). Ebbinghaus' empirical research led him to the creation of a mathematical formula which calculates an approximation of how much may be remembered by an individual in relation to the time from the end of learning (Formula 1).
b=
100 ∗ k (log t ) c + k
(1)
Where: • • •
t: is the time in minutes counting from one minute before the end of the learning b: the equivalent of the amount remembered from the first learning. c and k : two constants with the following calculated values: k = 1.84 and c = 1.25
In the student model of Ed-Game Author the Ebbinghaus calculations have been the basis for finding out how much is remembered by an average student. In particular, there is a database that simulates the mental library of the student. Each fact a student encounters during the game-lesson is stored in this database as a record. In addition to the fact, the database also stores the date and time when the fact was last used, in a field called LastAccessDate. A fact is first added to the memory database when a student is first taught this fact through a lesson. When a fact is inserted in the database, the current date and time is also recorded in the field called TeachDate.
Individualizing a Cognitive Model of Students’ Memory in Intelligent Tutoring Systems
895
Thus, whenever the system needs to know the current percentage of a student's memory of a fact, the equation (2) is used, which is largely based on the Ebbinghaus' power function. However, equation (2) has been adapted to include one more factor, which is called the Retention Factor (RF). The retention factor is used to individualise this equation to the particular circumstances of each student by taking into account evidence from his/her own actions. If the system does not take into account this evidence from the individual students' actions then the Retention Factor may be set to 100, in which case the result is identical to the generic calculations of Ebbinghaus concerning human memory in general. However, if the system has collected sufficient evidence for a particular student then when a fact is first encountered by this student the Retention Factor is set to 95 and then it is modified accordingly as will be described in detail in the following sections. The RF stored in the “mental” database for each fact is the one representing the student's memory state at the time showed by the TeachDate field.
X% =
b ∗ RF 100
(2)
Where b is the Ebbinghaus' power function result, setting t=Now-TeachDate.
3
Memorise Ability
One important individual student characteristic that is taken into account is the ability of each student to memorise new facts. Some students have to repeat a fact many times to learn it while others may remember it from the first occurrence with no repetition. To take into account these differences, we have introduced the student's Memorise Ability factor (MA). This factor's values range between 0 and 4. The value 0 corresponds to “very weak memory”, 1 to “weak memory”, 2 to “moderate memory”, 3 to “strong memory” and 4 to “very strong memory”. During the course of a virtual-game there are many different things that can give an insight on what the student's MA is. One important hint can be found in the time interval between a student's having read about a fact and his/her answering a question concerning this fact. For example, if the student has given a wrong answer about a fact that s/he has just read about then s/he is considered to have a weak memory. On the other hand if s/he gives a correct answer concerning something s/he had read about a long time ago then s/he is considered to have a strong memory. Taking into consideration such evidence, the student's MA value may be calculated. Using MA the Retention Factor is modified according to the MA value of the student in the way illustrated in Table 1. As mentioned earlier, every fact inserted in the database has an initial RF of 95.
896
Maria Virvou and Konstantinos Manos Table 1. Retention Factor's modification depending on the Memorise Ability
Memorise Ability Value 0 1 2 3 4
Retention Factor Modification RF` = RF – 5 RF` = RF – 2 RF` = RF RF` = RF + 2 RF` = RF + 5
After these modifications the RF ranges from 90 (very weak memory) to 100 (very strong memory), depending on the student's profile. Taking as a fact that any RF below 70 corresponds to a “forgotten” fact, using the Ebbinghaus' power function, the “lifespan” of any given fact for the above mentioned MA may be calculated. So a student with a very weak memory would remember a fact for 3 minutes while a student with a very strong memory would remember it for 6.
4
Response Quality
During the game, the student may also face a question-riddle (which needs the “recall” of some fact to be correctly answered). In that case the fact's factor is updated according to the student's answer. For this modification an additional factor, the Response Quality (RQ) factor, is used. This factor ranges from 0 to 3 and reflects the “quality” of the student's answer. In particular, 0 represents “no memory of the fact”, 1 represents an “incorrect response; but the student was close to the answer”, 2 represents “correct response; but the student hesitated” and 3 represents a “perfect response”. Depending on the Reponse Quality Factor, the formulae for the calculation of the new RF are illustrated in Table 2. When a student gives an incorrect answer, the TeachDate is reset, so that the Ebbinghaus' power function is restarted. When a student gives a correct answer, the increase of his/her Retention Factor depends on his/her profile and more specifically on his/her Memorise Ability factor. In particular, if the student's RQ is 2 and s/he has a very weak memory then the RF will be increased by 3 points (extending the lifespan of the memory of a fact for about a minute) while if s/he has a very strong memory the RF will be increased by 15 (extending its lifespan for over 6 minutes). These formulae for the calculation of the RF give a more “personal” aspect in the cognitive model, since they are not generic but based on the student's profile. Table 2. Response Quality Factor reflecting the student's answer's quality
RQ 0 1 2 3
Modification RF' = X – 10, where TeachDate=Now RF' = X – 5, where TeachDate = Now RF' = RF + (MA + 1) * 3 RF' = RF + (MA + 1) * 4
Individualizing a Cognitive Model of Students’ Memory in Intelligent Tutoring Systems
897
In the end of a “virtual lesson”, the final RF of a student for a particular fact is calculated. If this result is above 70 then the student is assumed to have learnt the fact, else s/he needs to revise it.
5
Conclusions
In this paper we described the part of the student modelling process of an ITS authoring tool that keeps track of the students' memory of facts that are taught to him/her. For this reason we have adapted and incorporated principles of cognitive psychology into the system. As a result, the educational application takes into account the time that has passed since the learning of a fact has occurred and combines this information with evidence from each individual student's actions. Such evidence includes how easily a student can memorise new facts and how well she can answer to questions concerning the material taught. In this way the system may know when each individual student may need to revise each part of the theory being taught.
References [1] [2] [3]
[4]
[5]
[6]
Cumming G. & McDougall A.: Mainstreaming AIED into Education? International Journal of Artificial Intelligence in Education, Vol. 11, (2000), 197-207. Ebbinghaus, H. (1998) “Classics in Psychology, 1885: Vol. 20, Memory”, R.H. Wozniak (Ed.), Thoemmes Press, 1998. Matthews, M., Pharr, W., Biswas G. & Neelakandan, (2000). “USCSH: An Active Intelligent Assistance System,” Artificial Intelligence Review 14, pp. 121-141. Sleeman, D., Hirsh, H., Ellery, I. & Kim, I. (1990). “ Extending domain theories: Two case studies in student modeling”, Machine Learning, 5, pp. 1137. Stansfield, J.C., Carr, B., & Goldstein, I.P. (1976) Wumpus advisor I: a first implementation of a program that tutors logical and probabilistic reasoning skills. At Lab Memo 381. Massachusetts Institute of Technology, Cambridge, Massachusetts. Virvou M., Manos C., Katsionis G., Tourtoglou K.(2002), “Incorporating the Culture of Virtual Reality Games into Educational Software via an Authoring Tool”, Proceedings of IEEE International Conference on Systems, Man and Cybernetics (SMC 2002, Tinisia).
An Ontology-Based Approach to Intelligent Instructional Design Support Helmut Meisel1 , Ernesto Compatangelo1 , and Andreas H¨ orfurter2 1
2
Department of Computing Science University of Aberdeen AB24 3UE Scotland Kompetenzwerk, Riedering, Germany
Abstract. We propose a framework for an Intelligent Tutoring System to support instructors in the design of a training. This requires the partial capture of Instructional Design theories, which define processes for the creation of a training and outline methods of instruction. Our architecture is based on a knowledge representation that is based on the use of ontologies. Reasoning based on Description Logics supports the modelling of knowledge, the retrieval of suitable teaching methods, and the validation of a training. A small exemplary ontology is used to demonstrate the kind of Instructional Design knowledge that can be captured in our approach.
1
Background and Motivation
Instructional design (ID) theories support instructors in designing a training. These theories describe processes for the creation of a training, outlining methods of instruction together with situations in which those methods should be used [1]. The design of a training involves the definition and the classification of learning goals, the selection of suitable teaching methods and their assembly into a course [2]. The learning outcome of a training is improved if these theories are applied in practice. This paper proposes an architecture for an Intelligent Tutoring System that promotes the application of ID theories in practice. Such a system should assist an instructor in designing a training while educating him/her in Instructional Design. More specifically, this system should fulfill the following requirements: – Assist its user in the selection of appropriate teaching methods for a training and encourage the application of a wide range of available teaching methods. – Instruct its user about particular Teaching Methods (TMs) – Highlight errors in the design of a training. In our envisaged system (see Figure 1 for an architectural overview) ID experts maintain the system’s knowledge directly. This requires a conceptual representation of ID knowledge that can be understood by non computing experts. Inferences are necessary to perform the validation and verification tasks V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 898–905, 2003. c Springer-Verlag Berlin Heidelberg 2003
An Ontology-Based Approach to Intelligent Instructional Design Support
Procedural Knowledge Instructional Design Maintains Knowledge Ontology Editor
Instructionl Designer
Domain Knowledge Instructional Design
V&V Ontologies
User Interface
899
Creates Course
Suggest Methods
Trainee
Reasoner
Fig. 1. Architectural Overview
as well as the retrieval of suitable teaching methods. We propose ontologies as the knowledge representation mechanism and show how set theoretic reasoning with ontologies can provide the necessary inferences. Related work in Artificial Intelligence has contributed to Instructional Design with tools targeting its authoring [3] and with systems for the (partial) automation of ID processes [4]. Early systems were built with heuristic knowledge such as condition-action rules [5]. A recent example of a rule-based approach is the Instructional Material Description Language (IMDL), that provides specifications for the automated generation of alternative course designs. IMDL considers instructional elements (i.e. learners or learning objectives) as pre-conditions and didactic elements (i.e. software components for the display of courses) as post-conditions. An inference engine creates alternative courses based on these specifications [6]. While this approach performs reasoning about ID knowledge, it does not provide a solution for an Intelligent Tutoring System. Rule-based systems create a “conceptual gap” between an expert’s knowledge and its representation in the rule-base [5]. Rule-based systems do not highlight relationships between corresponding rules and are therefore difficult to create or to maintain. In e-learning, Meta-Data (such as the Learning Object Metadata) enable sharing and reuse of educational resources. However, current standards offer only limited support to describe pedagogical knowledge [7, 8]. Our framework subsumes these standards, as Metadata classes can be viewed as ontology concepts. Ontologies are increasingly used to organize ID knowledge in e-learning platforms. In most cases, they facilitate the retrieval of semantically marked-up educational resources [9, 10]. The Educational Modelling Language (EML) aims to provide a framework for the conceptual modelling of learning environments. It consists of four extendable top level ontologies that describe (i) theories about learning and instruction, (ii) units of study, (iii) domains, and (iv) how learners learn. EML ontologies could be reused in our framework. However, we aim to deploy the ontologies directly in an Intelligent Tutoring System rather than using them for design purposes only. Knowledge representation and reasoning is equally important to meet the requirements for an Intelligent Tutoring System in ID. While rule-based systems could provide the necessary inferences, representational issues prevent their us-
900
Helmut Meisel et al.
Fig. 2. Exemplary Ontology
age. Ontologies, on the other hand, are suitable means to capture ID knowledge. However, there is no standard inference mechanism for reasoning with ontologies.
2
Representation of ID Knowledge
Instructional Design knowledge consists of a domain part about “methods of instructions” and their application and a procedural part about ID processes. Procedural knowledge of ID processes is of static nature and can be encoded in the User Interface as sequences of screens. Our framework mainly addresses the declarative part (i.e. the domain knowledge) of instructional design. The architecture of the proposed system relies on ontologies to capture this knowledge. Ontologies are “shared specifications of conceptualizations” [11] and encourage collaborative development by different experts. They capture knowledge at the conceptual level, thus enabling ID experts to directly manipulate it without the involvement of a knowledge engineer. Trainers may explore the ontology either by browsing through its entries or by querying it. Queries can also be assembled by the user interface in accordance with the information provided during the design process.
An Ontology-Based Approach to Intelligent Instructional Design Support
System Ontology
... applicable_domain
supports_ktype Knowledge Type
901
Teaching Method
Intensional Descriptions further ontologies: - KnowledgeTypes - ApplicableDomains - ... Extensional Descriptions further individuals e.g. Java-Prog:Domain
Applicable Domain
Ontology: Methods ReceptiveMethod ActiveMethod GroupWork
TeamProject Concrete Methods
Fig. 3. Layered knowledge representation
Ontologies represent knowledge in taxonomies, where more specific concepts inherit the properties of those concepts they specialize. This allows knowledge reuse when an ontology needs to be extended. Figure 2 shows a small exemplary ontology to demonstrate, how ID knowledge can be captured. The part about Teaching Programming has been extracted from an existing evaluation of TMs in Computing Science [12]). Teaching methods are represented in the ontology by describing the situations they can be applied to (e.g. learning goals, characteristics of the learners, course domain, etc.). Instructions about the application of a teaching method can also be added to the ontology. In our exemplary ontology a Training is defined as a sequence of at least three different Teaching Methods thus asserting that diverse teaching methods should be applied during a training. Moreover, each Training must have at least one Learning Goal. A Learning Goal is addressed by one or more Teaching Method. Note that it is thus possible to apply more than one teaching method to a learning goal. The selection of a Teaching Method in our example depends on (i) the Domain it is applied to (e.g. programming), (ii) the supported Learning Goal (e.g. application of a programming language) and (iii) the Learner. In our exemplary ontology, Teaching Programming is understood to be the set of all teaching methods that are applicable to Programming. A specific rule like “It is generally agreed that fluency in a new programming language can only be attained by actually programming” [12] can also be included in the specification. This rule is represented by linking the concepts Programming Exercise and Gain Fluency in Programming with the attributes Learning Goal and Addressed by.
902
Helmut Meisel et al.
Instructional information about concepts must also be included in the ontology. This kind of information can be reviewed by a user in order to learn more about a particular teaching method. In the example, description, strength, and weakness of Teaching Method carry this information. Subclasses inherit this information from its superclasses. For instance, as Lecture about syntax and semantics is a subclass of both Lecture and Teaching Programming, it inherits the instructional information from both of them. A further benefit of multiple inheritance is that multiple views on the ontology can be defined. For instance, Teaching Methods can be classified according to learning strategy (e.g. Collaborative Learning, Discovery Learning, or Problem-based Learning). Our architecture represents knowledge in three layers (see Figure 3): – System Ontology. It defines categories of links between concepts (i.e. attributes). Links either point to instructional information about a TM or represent constraints for the selection of a TM. Only links representing constraints are subject to reasoning. Furthermore, the System Ontology specifies top-level concepts (e.g. Teaching Method, Learner, Learning Goal) which are relevant to model the specific implementation. – Intensional Descriptions. Every class in the System Ontology is specialized by subclasses (e.g. teaching methods can be classified as either receptive or active methods). This level defines generic building blocks for the description of concrete teaching methods in the following layer. – Extensional Descriptions. Individuals populate the ontologies defined in the previous layers. For instance, concrete teaching methods are elements of intensional descriptions (e.g. a Concrete Teaching Method “Writing a bubble sort program in Java” is an instance of the class Programming Exercise). The usage of individuals is essential for validation purposes as the reasoner can identify whether an individual commits to the structure defined in the ontology. This will be explored in the next section.
3
Reasoning Services
We have developed an Intelligent Knowledge Modelling Environment called ConcepTool [13], which is based on a conceptual model that can be emulated in a Description Logic [14]. This enables deductions based on set-theoretic reasoning, where concepts are considered as set specifications and individuals are considered as set elements. Computation of set containment (i.e. whether a set A is included in a set B), inconsistency (i.e. whether a set is necessarily empty), and set membership (i.e. whether an individual x is an element of the set A) enables the introduction of the following reasoning services. – Ontology Verification and Validation. Automated reasoning can provide support to ID experts during ontology creation and maintenance by detecting errors and by highlighting additional hierarchical links. For instance, if the ontology states that a TM must mention its strengths and weaknesses, the system will not accept TMs without this description. If the reasoner derives
An Ontology-Based Approach to Intelligent Instructional Design Support
903
a set containment relationship between two classes (e.g. a TM applicable to any domain and another TM applicable to the Computing domain only), it suggests the explicit introduction of a subclass link. In this case, all the properties of the superclass will be propagated to the subclass. – Retrieval of suitable teaching methods. Teaching Methods are returned as the result of a query, stated as a concept description, which is issued to the reasoner. This retrieves all the individual TMs that are members of this concept. In our exemplary ontology, a query that searches all the TMs for the Computing Domain with the Learning Goal Gain fluency in Programming returns all the elements of the classes Programming Exercise, Write Programs from Scratch, and Adapt Existing Code. Note that individuals are not included in the exemplary ontology shown in Figure 2. The query concept can be generated by the user interface during the creation of a training. As reasoning in Description Logics is sound, only those teaching methods are suggested which satisfy the constraints. – Validation and Verification of a training. Errors in the design of a training can be detected in two ways. The first way is to check whether a training commits to the axioms of the ontology. The training to be validated is considered as an instance of the Training class and thus needs to commit to its structure. Possible violations w.r.t. our exemplary ontology might be (i) to define only two Teaching Methods (whereas at least three are required), (ii) forget to specify a Learning Goal, or (iii) forget to address a learning goal with a Teaching Method. A further possibility to validate a training is to define classes of common errors (e.g. a training with receptive teaching methods only) and check whether the training under validation is an instance of a common error class.
4
Discussion and Future Research
This paper presents an architecture for an Intelligent Tutoring System in Instructional Design (ID). Such an architecture addresses declarative ID knowledge, which can be directly manipulated by ID experts as this knowledge is represented explicitly using ontologies. Automated reasoning fulfills the requirements stated in Section 1 (i.e. assist users in the selection of appropriate teaching methods and highlight errors in the design of a training). The requirement to instruct a user about particular teaching methods can be achieved by attaching instructional information to each TM. Our framework does not commit to any particular ID theory. In principle, it can be applied to any approach as long as this allows the construction of an ontology. However, as the framework does not include a representation of procedural knowledge, the user interface (most likely) needs to be developed from scratch. Nevertheless, we anticipate that changes of an ID process will rarely occur. As the ConcepTool ontology editor is reasonably complete, we aim to develop ontologies and implement the User Interface part of this architecture. This
904
Helmut Meisel et al.
will investigate the validity of the assumption that a user who is not an expert in knowledge modelling can represent ontologies without the help of a knowledge engineer.
Acknowledgements The work is partially supported by the EPSRC under grants GR/R10127/01 and GR/N15764.
References [1] C., R.: Instructional design theories and models: A new paradigm of instructional theory. Volume 2. Lawrence Erlbaum Associates (1999) 898 [2] Van Merriˆenboer, J.: Training complex cognitive skills: A four-component instructional design model for technical training. Educational Technology Publications (1997) ISBN: 0877782989. 898 [3] Murray, T.: Authoring intelligent tutoring systems: An analysis of the state of the art. International Journal of Artificial Intelligence in Education 10 (1999) 98 – 129 899 [4] Kasowitz, A.: Tools for automating instructional design. Technical Report EDOIR-98-1, Education Resources Information Center on Information Technology (1998) http://ericit.org/digests/EDO-IR-1998-01.shtml. 899 [5] Mizoguchi, R., Bourdeau, J.: Using ontological engineering to overcome common ai-ed problems. International Journal of Artificial Intelligence in Education 11 (2000) 107–121 899 [6] Gaede, B., H., S.: A generator and a meta specification language for courseware. In: Proc. of the World Conf. on Educational Multimedia, Hypermedia and Telecommunications 2001(1). (2001) 533–540 899 [7] Pawlowski, J.: Reusable models of pedagogical concepts - a framework for pedagogical and content design. In: Proc. of ED-MEDIA 2002, World Conference on Educational Multimedia, Hypermedia and Telecommunications. (2002) 899 [8] M., R., Wiley, D.: A non-authoritative educational metadata ontology for filtering and recommending learning objects. Interactive Learning Environments 9 (2001) 255–271 899 [9] Leidig, T.: L3 - towards an open learning environment. ACM Journal of Educational Resources in Computing 1 (2001) 899 [10] Allert, H., et al.: Instructional models and scenarios for an open learning repository - instructional design and metadata. In: Proc. of E-Learn 2002: World Conference on E-Learning in Corporate, Government, Healthcare, & Higher Education. (2002) 899 [11] Uschold, M.: Knowledge level modelling: concepts and terminology. The Knowledge Engineering Review 13 (1998) 5–29 900 [12] Nicholson, A.E., Fraser, K.M.: Methodologies for teaching new programming languages: a case study teaching lisp. In: Proc. of the 2nd Australasian conference on Computer science education. (1997) 901 [13] Compatangelo, E., Meisel, H.: K—ShaRe: an architecture for sharing heterogeneous conceptualisations. In: Proc. of the 6th Intl. Conf. on Knowledge-Based Intelligent Information & Engineering Systems (KES’2002), IOS Press (2002) 1439–1443 902
An Ontology-Based Approach to Intelligent Instructional Design Support
905
[14] Horrocks, I.: FaCT and iFaCT. In: Proc. of the Intl. Workshop on Description Logics (DL’99). (1999) 133–135 902
Javy: Virtual Environment for Case-Based Teaching of Java Virtual Machine Pedro Pablo G´ omez-Mart´ın, Marco Antonio G´ omez-Mart´ın, and Pedro A. Gonz´ alez-Calero Dep. Sistemas Inform´ aticos y Programaci´ on Universidad Complutense de Madrid, Spain {pedrop,marcoa,pedro}@sip.ucm.es
Abstract. Knowledge-based learning environments have become an ideal solution to provide an effective learning. Those systems base their teaching techniques upon constructivist problem solving, to supply an engaged learning environment. The students are presented with more and more challenging exercises, selected from a set of different scenarios depending on their knowledge. This paper presents a new of such systems, which aims to teach Java compilation with the help of a metaphorical virtual environment that simulates the Java Virtual Machine.
1
Introduction
Knowledge-based learning environments are considered to be a good solution to instruct students in those domains where “learning by doing” is the best methodology of teaching. Students are faced to more and more complex problems, tailored to their needs depending on their increasing knowledge. Nowadays, improvement in computer capacity let us implement multimedia systems and show real-time graphics to the users. New educational software has started to use virtual environments so users are immersed in synthetic microworlds where they can experiment and check their knowledge ([10]). A supplementary enhance is to incorporate animated pedagogical agents who inhabit in these virtual environments. An animated pedagogical agent is a lifelike on-screen character who provides contextualized advice and feedback throughout a learning episode ([6]). These agents track students’ actions and supply guidance and help in order to promote learning when misconceptions are detected. Students are allowed to ask for help at any time, and the agent will offer specific suggestions and concept explanations according to the current exercise. In order to supply user with guidance, agents need to possess a good comprehension of the taught domain and the current scenario, and to hold information about the user knowledge (usually stored in a user profile). These systems also require a pedagogical module that determines what aspects of the domain knowledge the student should practice according to her strengths
Supported by the Spanish Committee of Science & Technology (TIC2002-01961)
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 906–913, 2003. c Springer-Verlag Berlin Heidelberg 2003
Javy: Virtual Environment for Case-Based Teaching of Java Virtual Machine
907
and weaknesses. With this purpose, these programs need a big collection of scenarios where different exercises are kept indexed by the concepts they assess. The user profile keeps information about which of such concepts the student already understands, and which she doesn’t know. The pedagogical module uses this data and the indexes in the set of exercises to choose the most suitable scenario ([9]). Animated pedagogical agents are an active research topic and several of them have been implemented, for example Design-A-Plant, Internet Advisor, CPU-City and Steve ([7], [4], [1], [2]).
2
General Description
Our aim is to implement an animated pedagogical agent to teach the Java Virtual Machine (JVM) structure and Java language compilation. Users are supposed to know imperative programming, particularly Java programming, and they will be able to improve their knowledge of object oriented programming and the compilation process. Our program uses a metaphorical 3D virtual environment which simulates the JVM ([5]). The user is symbolized as an avatar which is used to interact with the objects in the virtual world. This avatar owns an inventory where it can keep objects. The virtual world is also inhabited by Javy (JavA taught VirtuallY), an animated pedagogical agent that is able to perform two main functions: (1) monitor the student whilst she is solving a problem with the purpose of detecting the errors she makes in order to give her advice or guidance, and (2) resolve by himself the exercise giving explanation at each step. In the virtual environment there are different objects that the user can manipulate using four basic primitives, borrowed from some entertainment programs: “look at”, “take”, “use” and “use with”. Some of the objects have a human look, although they can’t be considered intelligent agents. Their simple mission is to allow the user to interact with some of the metaphorical JVM structures. Each exercise is designed by tutors using an authoring tool, and it consists in Java source code and its associated compiled code (Java byte codes). The student has to use the different provided structures to execute the supplied Java byte codes. Compilation phase learning is considered a side effect, since the system also shows the Java source code, where the sentence which has generated the compiled instructions been executed is highlighted. The user is supposed to pay attention to both of the codes, and to try to understand how the compilation process is performed. Of course, Javy can also give information about this procedure. A second information source is available in the “look at” user interface operation. For example, if the student ordered “look at the operand stack” her own avatar would answer “Operand stack stores temporal results needed to calculate the current expression. We could ask Javy to get more information.” Auxiliary characters behaviour and phrases said by the avatar are relatively hard-coded. However, Javy is quite more complex, because not only does he give
908
Pedro Pablo G´ omez-Mart´ın et al.
advice and help, but also he can replace the user and finish the current exercise in order to show the student the correct way. Consequently, Javy actions are based in a rather big amount of knowledge about the domain, and a more detailed description is needed.
3
Conceptual Hierarchy
Our system stores the domain concepts the students have to learn using a conceptual hierarchy. At first glance, the application uses them to perform two tasks: – User model: the system keeps information about the knowledge the user already has about the specific domain being taught. The user model stores the concepts we can assume she knows, and those that she is learning. – Exercise database: pedagogical module uses the user model to retrieve the next scenario to be presented to the student. It uses an exercise database which is indexed using the concepts it tests. The system should try to select exercises which exhibit a degree of complexity that is not too great for the learner but is sufficiently complex to be challenging ([3]). Conceptual hierarchy is also used by Javy to generate explanations. As for our system, we have identified five kinds of concepts with different abstraction levels: – Compilation concepts: they are related to the high-level structures that the student should learn to compile (e.g. arithmetic expressions) – Virtual machine instructions: this group includes the JVM instructions, sorted into a hierarchy. – Microinstructions: they are the different primitive actions that both the student and Javy can perform in order to change de JVM state (e.g. pushing a value in the operand stack). Each concept representing a virtual machine instruction in the previous group is related to all the microinstructions it needs to execute the instruction. – Virtual machine structures: each part in the JVM has a concept in the hierarchy (like “operand stack”, “frame stack” and “heap”). – User interface operations: each microinstruction is executed in the virtual world using a metaphorical element, and interacting with it through one of the three operations (“take”, “use” or “use with”). Concepts in this group symbolize the relation between them. For example, the concept “use operand stack” (with another object) is related to the microinstruction “push value”. User model and indexes of the exercices database only use the first two groups of concepts (compilation and virtual machine instructions concepts). Each concept has a name and a description, which is used by Javy when the user asks for information about such concept. As students are clearly irritated by the repetition of the same explanation ([7]), each concept has also a short description, used when the first one has been presented recently to the user.
Javy: Virtual Environment for Case-Based Teaching of Java Virtual Machine
909
The conceptual hierarchy is built by the authors or tutors who provide the specific knowledge to the system. An authoring tool has been developed, and it allows the tutor to define concepts and to set its properties and relations.
4
Representation of the JVM Instruction Set
Students have to execute by themselves the JVM instructions. The system, meanwhile, monitors them and checks they are performing correctly each step. In addition, Javy also has to be able to execute each instruction. Actually, we could think that the user only would learn how the JVM works. However we assume that the user will also learn the compilation of Java programs as a side effect by comparing the source and compiled code provided by the system and by using Javy explanations. The system has to store information about the steps (microinstructions) that have to be executed to complete each JVM instruction. The conceptual relations between JVM instructions and primitive instructions are not enough because no order information is kept. Therefore, the system also stores an execution graph of each instruction. Each graph is related to the concept of the instruction it represents. Graphs nodes are states and microinstructions are stored in the edges. In addition, each edge is related to the microinstruction concept in the concept hierarchy. When a primitive action is executed, the state changes. Some of these microinstructions have arguments or parameters, and the avatar uses objects in the inventory to supply them. One example is the primitive action of pushing a constant value in the operand stack. The user got the value in a previous primitive action like fetch the JVM instruction. Graph edges also store explanations that show why their primitive actions are needed to complete the JVM instructions execution. This is useful because the description related to the microinstruction concept explains what they do, but don’t why they are important in a specific situation. For example, the microinstruction concept “push a value into the operand stack” has a description such as “The selected value in the inventory is stacked up the operand stack. If no more values were piled up, the next extracted value would be the value just pushed”. On the other hand, this primitive action is used in the iload instruction execution; the explanation related to it in the graph edge could be “Once we have got the local variable, it must be loaded into the stack”. Graphs also store wrong paths and associate with them an explanation indicating why they are incorrect. This is used to detect misconceptions. When the user tries to execute one of such primitive actions, Javy stops her and explains why she is wrong. Due to Javy don’t allow the user to execute an invalid operation, the graph only has to store “one level of error”, because in our first approach, we never let her reach a wrong execution state. In case the student executes a microinstruction that has not considered in the graph, Javy also stops her, but he is not able to give her a suitable explanation. When this occurs, the system registers which wrong operation was executed and in which state
910
Pedro Pablo G´ omez-Mart´ın et al.
the user was. Later the tutor who built the graph will analyze this data and will expand the graph with those wrong paths more frequently repeated, adding explanations about why they are incorrect. When Javy is performing a task, he executes the microinstructions of a correct path, and uses their explanations. Therefore, our knowledge representation is valid for executing and monitoring cases. These graphs also are built by the authors who provide the specific domain knowledge. A new authoring tool is provided which allows user to create the graphs. As we can expect, the tool checks coherence between the conceptual hierarchy and the graphs being created.
5
Scenarios
So far, knowledge added to the system is general, and is often focused upon JVM structures and instruction execution. The use of the application consists in Javy or the student executing a Java program using the compiled code. These exercises (frequently called scenarios) are provided by a tutor or course author, using a third authoring tool that must simplify exercise construction, because scenario-based (case-based) instruction paradigm (which Javy is based on) works better if a big amount of scenarios have been defined ([8]). The tutor creates the Java code of the scenario which can include one or several Java classes. Each exercise will have a description and a list of all the concepts that are tested in it. These concepts are used in the case retrieval when a student starts a new exercise. The authoring tool compiles the Java code given by the tutor. One of the resultant class files will have the main function. Each JVM instruction is automatically related to a portion of the source code, and during the exercise resolution, when the avatar is performing this JVM instruction, this source code region will be highlighted. The course author can modify the relations between them in order to correct mistakes made by the authoring tool when it creates them automatically. Explanations in the hierarchy concept and graphs are used in the instruction execution. However, our aim is student to learn as side effect how Java source code is compiled. Javy will never check if the user understand these concepts, but he can provide information about the process. In order for Javy to be able to give explanations about the compiled code, the author divide the Java byte codes in different levels, and construct a tree (usually a subset of the syntax tree). For example, a while execution can be decomposed into the evaluation of the boolean expression and the loop instructions. The author also gives an explanation about each part, and why each phase is important. That description is referred to the relationship between source code and compiled code. When the avatar is carrying out a scenario, current frame and current counter program are related to a region in the Java source code, which will be part of a greater region, and so on. If the student asks Javy for an explanation, he will use the text given by the author in the current region. When the user asks again, Javy will use the parent explanation.
Javy: Virtual Environment for Case-Based Teaching of Java Virtual Machine
911
In addition, each region can be linked with a concept in the hierarchy described above. Therefore Javy is able to use explanations store in these concepts to give more details.
6
Architecture
The system has three elements: the virtual world, the agent and the user interface. The virtual environment represents the Java Virutal Machine where Javy and the student perform the actions. We have implemented a simplified JVM with all the capabilities we try to teach. The virtual world contains static and dynamic objects. The first ones are stationary objects like terrain, fences and so on, and they are rendered directly by the user interface, using OpenGL. Dynamic objects are controlled by the Object Interface Manager (OIM) and can be objects representing a JVM structure (a loaded class appears as a building), or objects which the user or agent interacts with. In some cases, the entity belongs in both categories, for example, the operand stack in the virtual world is part of the JVM ant the user interats with it in order to push/pop values. When the student’s avatar is close to some of the interactive elements, the user interface shows their names so the student can select one and perform the desired operation (“look at”, “take”, “use” or “use with”). Javy management is divided in three main modules: cognition, perception and movement. Perception layer receives changes in the virtual environment, and informs the cognition layer about them. This layer interprets the new state and decides to perform some action, which is communicated to the movement layer in order to move Javy and to execute it. Cognition layer interprets the input received from the perception layer, constructs plans and executes them, sending primitive actions to the movement layer. When Javy is demonstrating a task, these plans execute the correct microinstructions. When he is monitoring, the plans check the user actions and warn her in case of error. Cognition layer uses instructions graphs and the trees of each scenario. These trees can be seen as trees of tasks and subtasks. However, Javy only uses them to explain what he is doing when he is performing a task or the student asks for help. When Javy is executing an exercise (each JVM instruction), cognition layer knows the leaf where he is. When the student asks for an explanation of an action (“why? ”), Javy uses the text stores in the leaf. One he has explained it, he will use the parent’s explanation if the pupil asks again, and so on. On the other hand, low-level operation (microinstructions execution) is managed by graphs. As is described above, the edge’s graph contains the primitive actions the avatar must perform to complete each JVM instruction. When Javy is executing the task, cognition layer chooses the next action using the graph, and orders the movement layer to achieve the microinstruction, and it waits until perception layer notifies him its completion. When perception layer informs a task has been completed, the layer uses the graph, change
912
Pedro Pablo G´ omez-Mart´ın et al.
the state using the transition which contains the action, and decides the next primitive action.
7
Conclusions and Future Work
In this paper we have presented a new animated pedagogical agent, Javy, who teaches Java Virtual Machine structure and the compilation process of Java language using a constructivist learning environment. Although the system still needs a lot of implementation work, we predict some other future improvements. Currently, compilation process learning is a side effect, and the system doesn’t evaluate it. In this sense, one more feature can be added to the application: instead of just giving the user Java source code and its Java byte codes, future exercises could consist only in the Java source code, and the user would have to compile this code ‘on the fly’ and execute it in the metaphorical virtual machine. Javy would detect the user’s plans when she is executing primitive actions, to find out which JVM instruction she is trying to execute, and prevent her doing it when she is wrong. Currently, all the explanations given by the agent are fixed by the course author who build the domain knowledge or scenarios. A language generation module could be added. Finally, we have to study if our system is useful. In order to check its benefits, we must analyze if students take advantage of the software, and if they are engaged on it.
References [1] W. Bares, L. Zettlemoyer, and J. C. Lester. Habitable 3d learning environments for situated learning. In Proceedings of the Fourth International Conference on Intelligent Tutoring Systems, pages 76–85, San Antonio, TX, August 1998. 907 [2] W. L. Johnson, J. Rickel, R. Stiles, and A. Munro. Integrating pedagogical agents into virtual environments. Presence: Teleoperators & Virtual Environments, 7(6):523–546, December 1998. 907 [3] J. C. Lester, P. J. FitzGerald, and B. A. Stone. The pedagogical design studio: Exploiting artifact-based task models for constructivist learning. In Proceedings of the Third International Conference on Intelligent User Interfaces (IUI’97), pages 155–162, Orlando, FL, January 1997. 908 [4] J. C. Lester, J. L. Voerman, S.G. Towns, and C. B. Callaway. Cosmo: A life-like animated pedagogical agent with deictic believability. In Working Notes of the IJCAI ’97 Workshop on Animated Interface Agents: Making Them Intelligent, pages 61–69, Nagoya, Japan, August 1997. 907 [5] T. Lindholm and F. Yellin. The Java Virtual Machine Specification. 2nd Edition. Addison-Wesley, Oxford, 1999. 907 [6] R. Moreno, R.E. Mayer, and J. C. Lester. Life-like pedagogical agents in constructivist multimedia environments: Cognitive consequences of their interaction. In Conference Proceedings of the World Conference on Educational Multimedia Hypermedia, and Telecommunications (ED-MEDIA), pages 741–746, Montreal, Canada, June 2000. 906
Javy: Virtual Environment for Case-Based Teaching of Java Virtual Machine
913
[7] B. Stone and J. C. Lester. Dynamically sequencing an animated pedagogical agent. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, pages 424–431, August 1996. 907, 908 [8] D. Stottler, N. Harmon, and P. Michalak. Transitioning an its developed for schoolhouse use to the fleet: Tao its, a case study. In Proceedings of the Industry/Interservice, Training, Simulation & Education Conference (I/ITSEC 2001), August 2001. 910 [9] R. H. Stottler. Tactical action officer intelligent tutoring system (tao its). In Proceedings of the Industry/Interservice, Training, Simulation & Education Conference (I/ITSEC 2000), November 2000. 907 [10] J. F. Trindade, C. Fiolhais, V. Gil, and J. C. Teixeira. Virtual environment of water molecules for learning and teaching science. In Proceedings of the Computer Graphics and Visualization Education’99 – (GVE’99), pages 153–158, July 1999. 906
Self-Organization Leads to Hierarchical Modularity in an Internet Community Jennifer Hallinan Institute for Molecular Biosciences The University of Queensland Brisbane, QLD, Australia 4072 [email protected]
Abstract. Many naturally-occurring networks share topological characteristics such as scale-free connectivity and a modular organization. It has recently been suggested that a hierarchically modular organization may be another such ubiquitous characteristic. In this paper we introduce a coherence metric for the quantification of structural modularity, and use this metric to demonstrate that a selforganized social network derived from Internet Relay Chat (IRC) channel interactions exhibits measurable hierarchical modularity, reflecting an underlying hierarchical neighbourhood structure in the social network.
1
Introduction
The social community existing by virtue of the internet is unique in that it is unconstrained by geography. Social interactions may occur between individuals regardless of gender, race, most forms of physical handicap, or geographical location. While some forms of interaction are subject to external control, in that a moderator may decide what topics will be discussed and who will be allowed to participate in the discussion, many are subject to little or no authority. Such networks self-organize in a consistent, structured manner. Internet Relay Chat (IRC) is a popular form of online communication. Individuals log onto one of several server networks and join one or more channels for conversation. Channels may be formed and dissolved by the participants at will, making the IRC network a dynamic, self-organizing system. Interactions between individuals occur within channels. However, one individual may join more than one channel at once. The system can therefore be conceptualized as a network in which channels are nodes, and an edge between a pair of nodes represents the presence of at least one user on both channels simultaneously. The IRC network is extremely large; there are dozens of server networks, each of which can support up to at least 100,000 users. In the light of previous research into the characteristics of self-organized social networks, the IRC network would be expected to exhibit a scale-free pattern of V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 914-920, 2003. Springer-Verlag Berlin Heidelberg 2003
Self-Organization Leads to Hierarchical Modularity in an Internet Community
915
connectivity [1], [2]. Social networks have also been shown to have a locally modular structure, in which clusters of relatively tightly-connected nodes exist, having fewer links to the rest of the network [3], [4], [5], and it has been suggested that these networks are potential candidates for hierarchical modularity [6], [7].
2
Methods
2.1
Data Collection
Network data was collected using a script for the IRC client program mIRC. The script starts in a single channel and identifies which other channels participants are currently on. It then visits each of these channels in turn and repeats the process. This data was then converted into a set of channel-channel interaction pairs. Because of the dynamic nature of the network, which changes over time as individuals join and leave different channels, no attempt was made to collect data for the entire network. Instead, the network consists of all the channels for which data could be collected in one hour, starting from a single channel and spidering out so that the data collected comprised a single connected component. 2.2
Modularity Detection
Modularity in the network was detected using the iterated vector diffusion algorithm described by [8]. The algorithm operates on a graph. It is initialized by assigning to each vertex in the graph a binary vector of length n, initialized to
0, i ≠ j vi , j = 1, i = j
(1)
where i is an index into the vector and j is the unique number assigned to a given vertex. The algorithm proceeds iteratively. At each iteration an edge is selected at random and the vectors associated with each of its vertices are moved towards each other by a small amount, δ. This vector diffusion process is iterated until a stopping criterion is met. We chose to compute a maximum number of iterations as the stopping criterion. This number, n, is dependant upon both the number of connections in the network, c, and the size of δ, such that
α n = c * δ
(2)
where α is the average amount by which a vector is changed in the course of the run. A value for α of 0.1 was selected empirically in trials on artificially generated networks. The final set of vectors is then subjected to hierarchical clustering using the algorithm implemented by [9].
916
Jennifer Hallinan
Fig. 1. Thresholding a cluster tree. a) Tree thresholded at parent level 2 produces two clusters b) The same tree thresholded at parent level 3 has four clusters
2.3
Cluster Thresholding
The output of the cluster algorithm is a binary tree, with a single root node giving rise to two child nodes, each of which give rise to two child nodes of their own, and so on. The tree can therefore be thresholded at various levels (two parents, four parents, eight parents, etc.; see Figure 1) and the modularity of the network at each level can be examined. 2.4
Hierarchical Modularity Detection
The binary hierarchical classification tree produced by the cluster detection algorithm was thresholded at every possible decision level, and the average coherence of the modules detected at each level measured, using a coherence metric χ:
2k
1
n
k ji
i − ∑ χ = ( ) ( ) − + n n n k k 1 j =1 jo ji
(3)
where ki is the total number of edges between vertices in the module, n is the number of vertices in the network, kji is the number of edges between node j and other vertices within the module, and kjo is the number of edges between vertex j and other vertices outside the module. The first term in this equation is simply the proportion of possible links within the nodes comprising the module which actually exist; a measure of the connectivity within the module. The second term is the average proportion of edges per node which are internal to the module. A highly connected node with few external edges will therefore have a lower value of χ than a highly connected node with many external edges. χ will have a value in the range [-1, +1]. At each level in the hierarchy the number of modules and the average modular coherence of the network was computed. Average coherence was then plotted against threshold level to produce a "coherence profile" summarizing the hierarchical modularity of the network.
Self-Organization Leads to Hierarchical Modularity in an Internet Community
917
Fig. 2. The IRC network. Diagram producing using the Pajek software package [10]
3
Results
3.1
Connectivity Distribution
The IRC network consists of 1,955 nodes (channels) and 1, 416 edges, giving an average connectivity of 1.07 (Figure 1). Many self-organized networks have been found to have a scale-free connectivity distribution. In the IRC network connectivity ranged from 1 to 211 edges per node, with a highly non-normal distribution (Figure 2).Although the connectivity distribution of the network is heavily skewed, the log-log plot in Figure 2b indicates that it is not completely scale-free. There is a strongly linear area of the plot, but both tails of the distribution depart from linearity. This deviation from the expected scalefree distribution is probably due to practical constraints on the formation of the network. It can be seen from Figure 2b that nodes with low connectivity are underrepresented in the network, while nodes of high connectivity are somewhat overrepresented.
918
Jennifer Hallinan
Fig. 3. Connectivity distribution of the IRC network. a) Connectivity histogram. b) Connectivity plotted on a log-log scale
Under-representation of low connectivity nodes indicates that very few channels share participants with only a few other channels. This is not unexpected, since a channel may contain up to several hundred participants; the likelihood that most or all of those participants are only on that channel, or on an identical set of channels, is slim. The spread of connectivity values at the high end of the log-log plot indicates that there is a relatively wide range of connectivities with approximately equal probabilities of occurrence. There is a limit to the number of channels to which a user can pay attention, a fact which imposes an artificial limit upon the upper end of the distribution. 3.2
Hierarchical Modularity
The coherence profile of the IRC network is shown in Figure 3a, with the profile for a randomly connected network with the same number of nodes and links shown in Figure 3b. The IRC profile shows evidence of strong modular coherence over almost the entire range of thresholds in the network. The hierarchical clustering algorithm used will identify “modules” at every level of the hierarchy, but without supporting evidence it cannot be assumed that these modules represent real structural features in the network. The coherence metric provides this evidence. In the IRC network it can be seen that at the highest threshold levels, where the algorithm is identifying a small number of large modules, the coherence of the “modules” found is actually negative; there are more external than internal edges associated with the nodes of the modules. By threshold level 14 a peak in modular coherence is achieved. At this point, the average size of the modules is about 13 nodes (data not shown). As thresholding continues down the cluster tree, detecting more and smaller modules, the modular coherence declines, but remains positive across the entire range of thresholds. It appears that the IRC network has a hierarchically modular structure, with a characteristic module size of about 14; very close to the mode of the connectivity distribution apparent in Figure 2a. The random network, in comparison, has negative coherence over much of its range, with low modular coherence apparent at higher thresholds.
Self-Organization Leads to Hierarchical Modularity in an Internet Community
919
Fig. 4. a) Coherence profile of the IRC network. b) Coherence profile of an equivalent randomly connected network
4
Conclusions
The analysis of large, naturally-occurring networks is a topic of considerable interest to researchers in fields as diverse as sociology, economics, biology and information technology. Investigation of networks in all of these areas has revealed that, despite their dissimilar origins, they tend to have many characteristics in common. A recent candidate for the status of ubiquitous network characteristic is a hierarchically modular topological organization. We present an algorithm for the quantification of hierarchical modularity, and use this algorithm to demonstrate that a self-organized internet community, part of the IRC network, is indeed organized in this manner, as has been hypothesized, but not, as yet, demonstrated. Progress in the analysis of network topology requires the development of algorithms which can translate concepts such a “module: a subset of nodes whose members are more tightly connected to each other than they are to the rest of the network” into a numeric measure such as the coherence metric we suggest here. Such metrics permit the objective analysis and comparison of the characteristics of different networks. In this case, application of the algorithms to the IRC network detects significant hierarchical modularity, providing supporting evidence for the contention that this topology may be characteristic of naturally-occurring networks.
References [1] [2] [3] [4] [5]
Albert, R., Jeong, H. & Barabasi, A.-L. Error and attack tolerance of complex networks. Nature 406, 378 - 382 (2000). Huberman, B. A. & Adamic, L. A. Internet: Growth dynamics of the world-wide web. Nature 401(6749), 131 (1999). Girvan, M. & Newman, M. E. J. Community structure in social and biological networks. Santa Fe Institute Working Paper 01-12-077 (2001). Flake, G. W., Lawrence, S., Giles, C. L. & Coetzee, F. M. Self-organization and identification of web communities. IEEE Computer 35(3), 66 - 71 (2002). Vasquez, A., Pastor-Satorras, R. & Vespignani, A. Physics Review E 65, 066130 (2002).
920
Jennifer Hallinan
[6]
Ravasz, E., Somera, A. L., Oltvai, Z. N. & Barabasi, A. -L. Hierarchical organization of modularity in metabolic networks. Science 297, 1551 - 1555 (2002). Ravasz, E. & Barabasi, A.-L. Hierarchical organization in complex networks. LANL Eprint Archive (2002). Hallinan, J. & Smith, G. Iterative vector diffusion for the detection of modularity in large networks. InterJournal Complex Systems B article 584 (2002). Eisen, M. B., Spellman, P. T., Brown, P. O. & Botstein, D. Cluster analysis and display of genome-wide expression patterns. Proceedings of the National Academy of Sciences 95, 14863 - 14868 (1998). Batagelj, V. & Mrvar, A. Pajek - Program for Large Network Analysis. Connections 21: 47 – 57 (1998).
[7] [8] [9] [10]
Rule-Driven Mobile Intelligent Agents for Real-Time Configuration of IP Networks Kun Yang1,2, Alex Galis1, Xin Guo1, Dayou Liu3 1
University College London Department of Electronic and Electrical Engineering Torrington Place, London, WC1E 7JE, UK {kyang,agalis,xguo}@ee.ucl.ac.uk 2 University of Essex Department of Electronic Systems Engineering Wivenhoe Park, Colchester Essex, CO4 3SQ, UK 3 Jilin University School of Computer Science & Technology Changchun, 130012 China
Abstract. Even though intelligent agent has proven itself to be a promising branch of artificial intelligence (AI), its mobility capacity has yet been paid enough attention to match the pervasive trend of networks. This paper proposes to inject intelligence into mobile agent of current literature by introducing rule-driven mobile agent so as to maintain both intelligence and mobility of current agent. Particularly, this methodology is fully exemplified in the context of real-time IP network configuration through intelligent mobile agent based network management architecture, policy specification language and policy information model. A case study for inter-domain IP VPN configuration demonstrates the design and implementation of this management system based on the test-bed developed in the context of European Union IST Project CONTEXT.
1
Background and Rationale
After years of recession, Artificial Intelligence (AI) regained it vitality relatively thanks to the inception of Intelligent Agent (IA). Agent was even highlighted as another approach of AI by S. Russell et al. [1]. Intelligent agent usually is a kind of software of autonomous, intelligent and social capability. Intelligent Agent and its related areas have been intensively researched over last decades and enormous achievement covering a wide range of research fields are available in the literature. As computers and networks become more pervasive, the requirement of intelligent agent being more (automatically) moveable is getting more a necessity than an option. As an active branch of agent technology researches, mobile agent paradigm intends to bring an increased performance and flexibility to distributed systems by promoting V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 921-928, 2003. Springer-Verlag Berlin Heidelberg 2003
922
Kun Yang et al.
"autonomous code migration" (mobile code moving between places) instead of traditional RPC (remote procedure call) such as CORBA, COPS (Common Open Policy Service) [2]. It turns out that more attention has been given to the mobility of mobile agent whereas the intelligence of mobile agent is seldom talked about in the mobile agent research community. Mobile agent technology is very successfully used in the network-related applications, especially network management, where its mobility feature is largely explored [3], but these mobile agents are usually lack of intelligence. We believe mobile agent is first of all an agent that has intelligence. This paper aims to explore the potential use of mobile agent to manage IP network in a more intelligent and automated way. For this purpose, mobile agents should contain certain extent of intelligence to reasonably respond to the possible change in destination elements and perform negotiation. This kind of intelligence should reflect the management strategy of administrator. A straightforward way for network administrator to give network management command or guide is to produce highlevel rules such as if sourceHost is within finance and time is between 9am and 5pm then useSecureTunnel. Then mobile agent can take this rule and enforce it automatically. By using rules to give network management command or strategy, a unique method of managing network can be guaranteed. The use of rule to management network is exactly what so called Policy-based Network Management (PBNM) [4] is about since policies usually appear as rules for network management. Here in this paper, we don't distinguish the difference between rule-based management and policy-based management and in many cases, the term “policybased” is more likely to be used. In order to put this idea into practice, a specific network management task is selected, i.e., IP VPN (Virtual Private Network) configuration. VPN enables an organization to interconnect its distributed sites over a public network with much lower price than the traditional leased-line private network. VPN is a key and typical network application operating in every big telecom operator as a main revenue source. But the lack of real-time and automated configuration and management capabilities of current IP VPN deployment makes the management of growing networks timeconsuming and error-prone. The integration of mobile agents and policy-based network management (as such making a mobile intelligent agent) claims to be a practical solution to this challenge. This paper first discusses an intelligent mobile agent based IP network management architecture with emphasis on IP VPN; then a detailed explanation with respect to policy specification language and IP VPN policy information model is presented. Finally, before the conclusion, a case study for inter-domain IP VPN configuration is demonstrated, aiming to exemplify the design and implementation of this intelligent MA-based IP network management system.
Rule-Driven Mobile Intelligent Agents
Credential Check Policy-based IP Management Tool
LDAP
923
PDP (MA Platform)
Policy Parser PDP Manager
MA Factory LDAP
KLIPS (Kernel IPsec) Key Management
Policy Repository
Monitoring KLIPS (Kernel IPsec) Tunnel
PDP (IP VPN) SA Management IKE Management
Mobile Agent Initiator PEP (MA Platform)
IPsec Configuration Hardware Router
Linux Router
Tunnel
Fig. 1. Intelligent MA-based IP Network Management Architecture
2
An Intelligent MA-Based IP Network Management Architecture
2.1
Architecture Overview
An intelligent MA based IP network management system architecture and its main components are depicted in Fig. 1, which is organized based on the PBNM concept as suggested by IETF Policy Working Group [5]. Please note that we use IP VPN configuration as an example, but this architecture is generic enough for any other IP network management tasks provided that the corresponding PDP and its information model are given. The PBNM system mainly includes four components: policy management tool, policy repository, Policy Decision Point (PDP) and Policy Enforcement Point (PEP). Policy management tool serves as a policy creation environment for the administrator to define/edit/view policies in an English-like declarative language. After validation, new or updated policies are translated into a kind of object oriented representation and stored in the policy repository, which is used for the storage of policies in the form of LDAP (Lightweight Directory Access Protocol) directory. Once the new or updated policy is stored, signaling information is sent to the corresponding PDP, which then retrieves the policy by PDP Manager via Policy Parser. After passing the Credential Check, the PDP Manager gets the content of the retrieved policy, upon which it selects the corresponding PDP, in this case, IP VPN PDP. After rule-based reasoning on the retrieved policy which may involve in other related policies stored in Policy Repository, PDP decides the action(s) to be taken against the policy. Then corresponding mobile agents that are initiated via Mobile Agent Initiator carry the bytecode for the actions and move themselves to the PEP and
924
Kun Yang et al.
enforce the policy on PEP. The automation of the whole procedure also depends on a proper policy information model that can translate the rule-based policies to the element level actions. This will be discussed separately in next section. Since there is plenty of work presenting rule-based reasoning in the knowledge engineering field, this paper prefers not to repeat them again. Please note that both PDPs and PEPs are in the form of mobile intelligent agents and intelligence is embedded inside the bytecode itself. 2.2
IP VPN Components
IP VPN operational part can be regarded as a type of PDP since it performs a subnet of policy management functionality. For easy demonstration in Fig. 1, all the VPN functional components are placed into one single PDP box. In actual implementation, they can be separated into different PDPs and be coordinated by a VPN PDP manager. Our IP VPN implementation is based on FreeS/WAN IPsec [6], which is a Linux implementation of the IPsec (IP security) protocols. Since IP VPN is built up via the Internet which is a shared public network with open transmission protocols, VPNs must include measures for packet encapsulation (tunneling), encryption and authentication so as to avoid the sensitive data from being tampered by any unauthorized third parties during data transit. Three protocols are used: AH (Authentication Header) provides a packet-level authentication service; ESP (Encapsulating Security Payload) provides encryption plus authentication; and finally, IKE (Internet Key Exchange) negotiates connection parameters, including keys, for the other two. KLIPS (kernel IPsec) from FreeS/WAN has implemented AH, ESP, and packet handling within the kernel [6]. More discussion is given to IKE issues which are closely related to the policies delivered by administrator via policy management tool. Key Management Component: Encryption usually is the starting point of any VPN solution. These encryption algorithms are well known and widely exist in lots of cryptographic libraries. The following features need to be taken into consideration for key management component: key generation, key length, key lifetime, and key exchange mechanism. IKE Management: IKE protocol was developed to manage these key exchanges. Using IPSec with the IKE, a system can set up security associations (SAs) that include information on the algorithms for authenticating and encrypting data, the lifetime of the keys employed, the key lengths, etc; and these information are usually extracted from rule-based policies. Each pair of communicating computers will use a specific set of SAs to set up a VPN tunnel. The core of the IKE management is an IKE daemon that sits on the node to which SAs need to be negotiated. IKE daemon is distributed on each node that is to be an endpoint of an IKE-negotiated SA. IKE protocol sets up IPsec connections after negotiating appropriate parameters. This is done by exchanging packets on UDP port 500 between two gateways. The ability of cohesively monitoring all VPN devices is vitally important. It is essential to ensure that policies are being satisfied by determining the level of
Rule-Driven Mobile Intelligent Agents
925
performance and knowing what in the network is not working properly if there are. The monitoring component drawn in PDP box is actually a monitoring client for enquiring status of VPN devices or links. The real monitoring daemons are located next to the monitored elements and are implemented using different technologies depending on the features of monitored elements.
3
Policy Specification Language and Information Model
Based on the above network management system architecture presented, this section details the design and implementation of this architecture in terms of two critical policy-based management concerns, i.e., policy specification language and policy information model. A high level policy specification language has been designed and implemented to provide the administrator with the ability of adding and changing policies in the policy repository. Policy takes the following rule-based format: [PolicyID] IF {condition(s)} THEN {action(s)}
It means action(s) is/are taken if the condition(s) is/are true. Policy condition can be in both disjunctive normal form (DNF, an ORed set of AND conditions) or conjunctive normal form (CNF, and ANDed set of OR conditions). PolicyID field defines the name of the policy rule and is also related to the storage of this policy in policy repository. An example of policy is given below, which forces the SA to specify which packets are to be discarded. IF (sourceHost == Camden) and (EncryptionAlgorithm == 3DES) THEN IPsecDiscard
This rule-based policy is further represented by XML (eXtensible Mark-up Language) due to XML's built-in syntax check and its portability across the heterogeneous platforms [7]. An object oriented information model has been designed to represent the IP VPN management policies, based on the IETF PCIM (Policy Core Information Model) [8] and its extensions [9]. The major objective of such information models is to bridge the gap between the human policy administrator who enters the policies and the actual enforcement commands executed at the network elements. IETF has described an IPsec Configuration Policy Model [10], representing IPsec policies that result in configuring network elements to enforce the policies. Our information model extends the IETF IPsec policy model by adding more functionalities sitting at a higher level (network management level). Fig. 2 depicts a part of the inheritance hierarchy of our information model representing the IP VPN policies. It also indicates its relationships to IETF PCIM and its extensions. Some of the actions are not directly shown due to the space limitation.
926
Kun Yang et al. Policy
(PCIM)
PolicyAction (PCIM) SimplePolicyAction SAAction
(PCIMe) (abstract)
IKEAction IKERejectAction IPsecAction
(abstract)
IPsecDiscardAction IPsecBypassAction PreconfiguredIPsecAction PolicyCondition (PCIM)
Fig. 2. Class Inheritance Hierarchy of VPN Policy Information Model
4
Case Study: Inter-domain IP VPN
Inter-domain communication is also a challenging research field in network management. This paper provides, as a case study, a solution to inter-domain communication by introducing the mobile intelligent agent. Mobile intelligent agent plays a very important role since the most essential components in PBNM, such as PDP and PEP, are in the form of mobile intelligent agents. Other non-movable components in PBNM architecture, such as policy receiving module, are in the form of stationary agents waiting for the communication with coming mobile agents. Mobile intelligent agents are also responsible for transporting XML-based policy across multiple domains. This case study had been implemented within the context of EU IST Project CONTEXT [11]. Administrator
MA Storage XML
Dom ain A SNMP
Linux+MA+PEP
Policy Station
XML
MA code download
Dom ain B
Cisco Router IP VPN Cisco Router
SNMP
Linux+MA+PEP
Fig. 3. Inter-domain IP VPN based on Intelligent Mobile Agents
Rule-Driven Mobile Intelligent Agents
927
The entire scenario is depicted in Fig. 3. Network administrator uses Policy Management Station to manage the underlying network environment (including two domains with one physical router and one Linux machine next to Cisco router at each domain) by giving policies, which are further translated into XML files and transported to relevant sub-domain PBNM stations using mobile intelligent agents. In this scenario, two mobile intelligent agents are generated at the same time, each going to one domain. Let's take one mobile agent for example. After the mobile agent arrives at the sub-domain management station, the mobile agent communicates with the stationary agent waiting at the sub-domain management station. Based on this policy, the sub-domain PDP manager can download the proper PDP, which is in the form of mobile agent, to make the policy decision. After this, the selected or/and generated policies are handed to PEP manager, which, also sitting on the sub-domain PBNM station, requires the availability of PEP code, e.g., for new IP tunnel configuring, according to the requirement given in policy. The PEP, also in the form of mobile agent, moves itself to the Linux machine, on which it uses SNMP (Simple Network Management Protocol) to configure the physical router so as to set up one end of IP VPN tunnel. Same process happened at the other domain to bring up the other end of IP VPN tunnel.
5
Conclusions and Future Work
As shown in the above case study, after administrator provided the input requirements, the entire configuration procedure processed automatically. Administrator doesn't need to know or analyse the specific sub-domain information thanks to the mobility and intelligence of mobile agents. The rule-driven mobile agents enable the achievement of many advantages, such as, the automated and rapid deployment of new services, the customisation of existing network features, the scalability and cost reduction in network management. However, this is just a first step to bring intelligence and mobility of software agent into the field of IP network management. Defining a full range of rules for IP network management and the study of how they can coexist together towards a practical network management solution are the future work. Rule conflict check and resolution mechanisms will also require more work as the number of policies dramatically increases.
Acknowledgements This paper describes part of the work undertaken in the context of the EU IST project CONTEXT (IST-2001-38142-CONTEXT). The IST programme is partially funded by the Commission of the European Union.
928
Kun Yang et al.
References [1] [2] [3] [4] [5] [6] [7]
[8] [9] [10] [11]
S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice-Hall, 1995 K. Yang, A. Galis, T. Mota, and A. Michalas. “Mobile Agent Security Facility for Safe Configuration of IP Networks”. Proc. of 2nd Int. Workshop on Security of Mobile Multiagent Systems: 72-77, Bologna, Italy, July 2002 D. Gavalas, D. Greenwood,M. Ghanbari, M. O'Mahony. “An infrastructure for distributed and dynamic network management based on mobile agent technology”. Proc. of Int. Conf. on Communications: 1362 -1366, 1999 M. Sloman. “Policy Driven Management for Distributed Systems”. Journal of Network & System Management, 2(4): 333-360, 1994 IETF Policy workgroup web page: http://www.ietf.org/html.charters/policy-charter.html FreeS/WAN website: http://www.freeswan.org/ K. Yang, A. Galis, T. Mota and S. Gouveris. “Automated Management of IP Networks through Policy and Mobile agents”. Proc. of Fourth International Workshop on Mobile Agents for Telecommunication Applications (MATA2002): 249-258. LNCS-2521, Springer. Barcelona, Spain, October 2002 J. Strassner, E. Ellesson, and B. Moore. “Policy Framework Core Information Model”. IETF Policy WG, Internet Draft, May, 1999. B. Moore. “Policy Core Information Model Extensions”. IETF-Draft, IETF Policy Working Group. 2002. J. Jason. IPsec Configuration Policy Model. IETF draft. European Union IST Project CONTEXT web site: http://context.upc.es/
Neighborhood Matchmaker Method: A Decentralized Optimization Algorithm for Personal Human Network Masahiro Hamasaki1 and Hideaki Takeda1,2 1
The Graduate University for Advanced Studies Shonan Village, Hayama, Kanagawa 240-0193, Japan 2 National Institute of Informatics 2-1-2 Hitotsubashi, Chiyoda-ku, Tokyo 101-8430, Japan
Abstract. In this paper, we propose an algorithm called Neighborhood Matchmaker Method to optimize personal human networks. Personal human network is useful for various utilization of information like information gathering, but it is usually formed locally and often independently. In order to adapt various needs for information utilization, it is necessary to extend and optimize it. Using the neighborhood matchmaker method, we can increase a new friend who is expected to share interests via all own neighborhoods on the personal human network. Iteration of matchmaking is used to optimize personal human networks. We simulate the neighborhood matchmaker method with the practical data and the random data and compare the results by our method with those by the central server model. The neighborhood matchmaker method can reach almost the same results obtained by the sever model with each type of data.
1
Introduction
Information exchanging among people is one of powerful and practical ways to solve information flood because people can act intelligent agents for each other to collect, filter and associate necessary information. The power stems from personal human network. If we need variable information to exchange, we must have a good human network. Personal human network is useful for various utilization of information like information gathering, but it is usually formed locally and often independently. In order to adapt various needs for information utilization, it is necessary to extend and optimize it. In this paper, we propose a network optimization method called ”Neighborhood Matchmaker Method”. It can optimize networks distributedly from the arbitrarily given networks.
2
Related Work
There are some systems to capture and utilize personal human network in computer. Kautz et al. [1] emphasized importance of human relations for WWW and V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 929–935, 2003. c Springer-Verlag Berlin Heidelberg 2003
930
Masahiro Hamasaki and Hideaki Takeda
showed done primary work for finding human relations, i.e., their system called ReferralWeb can find people by analyzing bibliography database. Sumi et al. [2] supported people to meet persons who have same interests and share information using mobile computers and web applications. Kamei et al. supported to form communities using visualization relationship among participants[3]. In these systems, they assume a group as a target either explicitly or implicitly. The first problem is how to form such groups, especially how we can find people as members of groups. We call it ”meet problem”. The second problem is how to find suitable people in groups for the specific topics and persons. We call this problem ”select problem”. The bigger group is the more likely to contain valuable persons to exchange information. However, we have to make more efforts with these systems in order to select such persons from a lot of candidates in the group. It is difficult for us to organize and manage such the large group. Therefore information exchanging systems should support methods that realize the above two requirements i.e., to meet and select new partners.
3
Neighborhood Matchmaker Method
As we mention in the previous chapter, if we need better relationship for information exchanging, we must meet and select partners more and more. It is a big burden for us, because we should meet all the candidates before we select them in advance. Since we do not know new friends before meeting them, we have no ways to select them. How can we solve this problem in our daily life? The practical way is introduction of new friends by the current friends. It is realistic and efficient because the person who knows both can judge whether this combination is suitable or not. Friends work as matchmaker for new friends. We formalize this ”friends as matchmaker” as an algorithm to extend and optimize networks. The key feature of this approach is no need for central servers. The benefits of this approach are three-fold. The first is to keep spread of information minimally. Information on a person is transferred to only persons connected to her/him directly. It is desirable to keep personal information secure. The second is distributed computation. Computation to figure out better relationship is done by each node, i.e., computers used by participants work for it. It is appropriate for a personal human network because we do not have to care the size of network. The third is gradual computation. The network will be converged gradually so that we can obtain the optimal network to some extent even if we stop the computation anytime.
4
Formalization
In this chapter, we introduce a model that can optimize networks by formalizing the method in our real life. We call that method ”Neighborhood Matchmaker Method (NMM)” hereafter. Before explaining NMM, we define the network model for this problem. At first we define a person as a node, and a connection
A Decentralized Optimization Algorithm for Personal Human Network
931
Fig. 1. Behavior of nodes
for information exchanging between people as a path. Here we assume that we can measure a degree of the connection between two nodes (hereinafter referred to as ”connection value”). Then, we can define that making a good environment for information exchanging is optimizing this network. In NMM, the network is optimized by matchmaking of neighbor nodes. We need the following two conditions to apply NMM. – All nodes can possibly connect to each other – All nodes can calculate relationship between nodes connected to them A summary these conditions, all nodes can act as matchmakers for their connected nodes to improve the connection network. The behavior of a node as a matchmaker is as follows. 1. Each node calculates connection values between its neighbor nodes. (We call this node ”matchmaker”) 2. If it finds pairs of nodes which have good enough connection values by computation of connection value, it recommends them i.e., it tells each element of the recommendation pair that the pair is a good candidate for connection. 3. The node that receives recommendation decides whether it accepts or not. We can optimize personal human network by iteration of this behavior. Figure 1 shows these behaviors. In the next chapter, we test this method with simulations.
5
Experiments
Since NMM just ensures local optimization, we should investigate the global behavior when applying this method. We test the method by simulation. We simulate optimization with NMM using the random data and the practical data. 5.1
The Procedure of the Simulation
In the previous chapter, we introduce NMM as the three steps, but the third step, i.e., decision is free to choose any tactics for recommendation nodes. In
932
Masahiro Hamasaki and Hideaki Takeda
Fig. 2. Flow chart of the simulation the simulation, we choose a simple tactics. Each node wants to connect to other nodes that have better connection values i.e., if a new node is better in connection than the worst existing node, the former replaces the latter. Figure 2 shows the flow chart of this simulation. At first, we create nodes each of which has some data to represent a person. In this experiment, the data is a 10-dimensional vector or WWW bookmark taken by users. We initially put paths between nodes randomly. We fix the number of paths during simulation. It means that addition of a path requires deletion of a path. One node is selected randomly and exchanges paths in every turn. In this simulation, all nodes take the following tactics for exchanging paths. A node must add the best path recommended by matchmakers. If a node adds a path, it must remove the worst path instead. So that, the size of paths in the network is fixed. The adding path must be better than the worst path already had. If all nodes cannot get a new path using matchmakers, the network is converged. At that time, this simulation is concluded. 5.2
The Measurement
Since the purpose of the simulation is how our method achieves optimization of the network, we should define what is the optimized network. We adopt a simple criterion. The best network for n paths is the network that includes n best paths in connection values 1 . A good news is that this network can be easily calculated by collecting and computing information for all nodes. Then we can compare this best network and networks generated by our method. Of course this computation requires a central server while our method can be performed distributedly. We compare two networks in the following two ways. One is cover rate that is how much paths in the best network is found in the generated network. It means 1
This criterion may not be ”best” for individual nodes, because some nodes may not have any connections. We can adopt other criterion if needed.
A Decentralized Optimization Algorithm for Personal Human Network
933
how much similar in structure two networks are. The other is reach rate that is comparison of the average of connection values between the best and generated networks. It indicates how much similar in effectiveness two networks are. These parameters are defined as the following formulas: cover rate =
| {Pcurrent ∩ Pbest } | N
N
reach rate =
l=1 N
f (pl |pl ∈ {Pcurrent })
m=1
p N {P } {Pbest } {Pcurrent } f (p)
6
f (pm |pm ∈ {Pbest })
: a path : the size of paths : a set of paths : the best set of paths : the current set of paths : a value of path
Simulation Results
There are two parameters to control experiments. One is the number of nodes and the other is the number of paths. In this experiment, we set the size of nodes from 10 to 100 and the size of paths from the 1 to 5 times the number of nodes. The simulation is performed 10 times for each set of parameters, and we use the average as the results. The graphs in Figure 3 plot the average of cover-rate against turn. Figure 3-a shows the results when the size of paths is fixed as thrice and Figure 3-b shows the results when the size of nodes is fixed to 60. In our formalization, we cannot know whether the network will converge. However, we can see that all graphs became horizontal. It implis that all networks were converged using matchmaking. And we can find the average of measurements and the turn of convergence are effective the size of nodes and paths. We observed similar results on reach-rate. The difference from reach-rate is less dependent on size of paths and nodes. We also examine the relevance between the size of networks and the turn of convergence. After iteration of simulation varying size of nodes and paths, we obtain the graph in Figure 4 plots the average of convergence turns against the size of nodes. This graph indicated that the turn of convergence increases linearly when the size of nodes increases. In this simulation, only a single node can exchange paths in a turn, so the times of exchanging per node do not became so large. Let me estimate the complexity computation of the algorithm roughly. When the average of the number of neighborhood nodes is r, this algorithm calculates connection values 2r times in every turn. When the size of nodes is N and the number of turns of convergence is kN according to Figure 4, the calculation
934
Masahiro Hamasaki and Hideaki Takeda
Fig. 3. Cover-Rate in the random data
times to converge is 2rkN using NMM. In the centralized model the calculation times is N 2 because we have to calculate connection values among all nodes. Since r and k are fix value, the order is O(N ) using NMM. It is less than O(N 2 ) using the centralized model. We also used the practical data generated by people. We use WWW bookmarks to measure connection values among people. Users always add a web page in which she/he is interested, and organize topics as folder in WWW bookmark. So it can be said that WWW bookmark represents the user profile. In this simulation, we need to calculate relationship between nodes. We use a parameter called ”category resemblance” such as a value of relationship between nodes [4]. This parameter is based on resemblance of folder structure of WWW Bookmark. We examine the average of measurements and convergence turns. We found that there is the similar tendency with the random data. These results indicate that the network could be optimized in the practical data.
Fig. 4. Average of Convergence Turn
A Decentralized Optimization Algorithm for Personal Human Network
7
935
Conclusion
In this paper, we propose the way to obtain a new person who is a partner for exchanging information and proposed a method called ”Neighborhood Matchmaker Method (NMM)”. Our method use collaborative and autonomous matchmaking and do not need any central servers. Nevertheless, by examining our experiment results, the optimal personal human network can be obtained. In this simulation we need the number of paths that is 2 to 3 times of the number of nodes and the number of turns that is 1.5 to 2 times the number of nodes in order to optimize the network sufficiently. It is applicable to any size of community, because it calculates relationship among people without collecting all data at his server. It is possible to assist bigger groups that are more likely to contain valuable persons to exchange information. And it is less computational cost. Furthermore it is an easy and quick method because we can start up anytime and anywhere without registration to servers. We can assist to form dynamic and emergent communities that are typical in the Internet. Now, we are developing the system using this proposed method. It is the system for sharing hyper-links and comments. In the real world, personal network changes dynamically through the exchanging information among people. A further direction of this study will be to experiment with this system and investigate effectiveness for it in real world.
References [1] H. Kautz, B. Selman, M. Shah. ReferraWeb: Combining Social Networks and Collaborative Filtering. In the Communications of the ACM, Vol. 40, No. 3, 1997. 929 [2] Y. Sumi, K. Mase. Collecting, Visualizing, and Exchanging Personal Interests and Experiences in Communities. the 2001 International Conference on Web Intelligence (WI-01), 2001. 930 [3] K. Kamei, et al. Community Organizer: Supporting the Formation of Network Communities throuh Spatial Representation. In Proceedings of the 2001 Sympo´ sium on Applications and the Internet (SAINT01), 2001. 930 [4] M. Hamasaki, H. Takeda. Experimental Results for a Method to Discover Human Relationship Based on WWW Bookmarks. In Proceedings of the Knowledge-Based ´ Intelligent Information & Engineering Systems (KES01), 2001. 934
Design and Implementation of an Automatic Installation System for Application Program in PDA Seungwon Na and Seman Oh Dept. of Computer Engineering, at Dongguk University 263-Ga Phil-Dong, Jung-Gu Seoul 100-715, Korea {nasw,smoh}@dgu.ac.kr
Abstract. The development of Internet technology onto mobile communication technology has brought us wireless Internet, which is growing into a popular service because of it's added convenience of mobility. Wireless Internet was first provided through cellular phones, but the current trend is moving toward PDA(Personal Digital Assistant), which have extended functionality. Applications to increase the functionality of PDA are constantly being developed, and occasionally application software must be installed. Also, when a PDA's power supply becomes fully discharged, all data stored in the RAM(Random Access Memory) becomes obsolete, and programs must be reinstalled. This paper presents an automated application program installation system, PAIS(PDA Auto Installing System), designed as a solution to the problem of PDA users having to do the installation of application programs on their PDA themselves. When this system is applied, PDA users can save the time and effort required for reinstallations, an added convenience, and application software companies can save on costs previously needed to create materials explaining the installation process.
1
Introduction
Wireless Internet was first provided through cellular phones in 1999, in Korea. But because of the many restrictions, including limited CPU processing power, narrow memory space, small display area etc., much needed to be improved to use cellular phones as mobile Internet devices[9]. Therefore, PDA, having extended functionality over cellular phones, are rising as the new medium for mobile Internet devices. PDA were originally used mainly as personal schedule planners, but with the addition of wireless connection modules, they are evolving into next-generation mobile devices which provide not only telephony services, but also wireless Internet services. To provide various services, many new technologies are being developed, and each time, users must install new software. Also, when a PDA's power supply becomes fully discharged, all data stored in the RAM becomes obsolete, and in this case V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 936-943, 2003. Springer-Verlag Berlin Heidelberg 2003
Design and Implementation of an Automatic Installation System
937
users must reinstall application software[10]. Added to this, PDA are equipped with one of several different operating systems, such as PPC2002, P/B, Palm, Linux. This means that there are several programs to choose from, and the installation process becomes complicated. These problems must be addressed before the use wireless Internet on PDAs can be accelerated. In this paper we present the design and implementation of a system that automates the installation process of application programs on PDAs. This system is named PIAS(PDA Auto Installing System). The main function of the system is explained as follows. The PDA agent sends an installation data file to the server, which the server compares to it's file management table to figure out which installation file to send, and then sends the right installation file to each PDA. The agent then installs the downloaded file into the proper directory.
2
Related Works
In this section, the PDA's structure and SyncML that is presented by standard of data synchronization technology are discussed. 2.1
Current Status of PDA
A PDA device is comprised of a network communication unit, an input/output unit, and a memory storage unit. Table 1 shows the components of each unit. PDA's memory unit is made up of ROM(Read Only Memory) and RAM(Random Access Memory). ROM is divided again into EPROM and Flash ROM. Flash ROM supports both read and write access[3]. A PDA does not have a separate auxiliary storage device, and uses the object storage technique, storing most executable programs and data in the RAM component. Using RAM as a memory device provides faster execution speed compared to ROM, but has the problem of being erased when the power supply is fully discharged. 2.2
SyncML(Syncronization Markup Language)
SyncML is a standard language for synchronizing all devices and applications over any network. SyncML defines three basic protocols, the Representation protocol, which defines message data format, the Synchronization protocol, which defines synchronization rules, and the Sync protocol, which defines binding methodology, for transmitting messages[11]. Table 1. PDA Components
Device unit
Components
Network Communication
Modem
TCP/IP
IrDA
Input/Output
Touch Panel
MIC, Phone
LCD, Display
Memory Storage
ROM
RAM
Memory Card
938
Seungwon Na and Seman Oh
Fig. 1. SyncML Framework
The SyncML Representation protocol is a syntax for SyncML Messages, which defines an XML DTD for expressing all forms of data, metadata, and synchronization instructions needed to accomplish synchronization[6]. Transfer protocols include HTTP, WSP(Wireless Session Protocol), OBEX(Object Exchange Protocol). These protocols are not dependent on the Representation protocol or Sync protocol and, therefore, other transfer protocols can be bound later on. The Sync protocol defines the rules for exchanging messages between client and server to add, delete, and modify data, and defines the operational behaviour for actual synchronization, and also defines synchronization types. Through these steps the SyncML supports the following features[4]: a) b) c) d) e) f)
Efficiency of operation on both wire and wireless Support of several different transport protocols Support of any data type Access of data from several applications Consideration of limited mobile device resources Compatibility with the currently existing internet and web environment The structure of a SyncML framework is shown in Figure 1 below.
Fig. 2. PAIS Operation Concept
Design and Implementation of an Automatic Installation System
939
Fig. 3. PDA Agent Operation
3
PAIS Design
PAIS(PDA Auto Installing System) is a system which sends executable file data stored in a PDA to the server and receives the selected installation file and automatically installs the downloaded file. The basic operation concept of PAIS is as follows. Each PDA(A, B) sends it's data(60, 70) to the PAIS server. The server compares this to it's own data(100) and selects an installation file(40, 30) and creates an installation package. These installation packages are then uploaded to the proper device. PAIS consists of two components, the embedded PDA agent and the PAIS server which manages setup files. The PDA agent sends and receives file information, and the PAIS server manages installation files[8]. 3.1
PDA Agent Design
The main function of the PDA agent is to collect internal file information and transmit it to the PAIS server, and download the final installation file and automatically install it to the appropriate directory. The PDA agent collects executable file information from the registry and sends it to the PAIS server. The collected information is as shown in Table 2 and is stored as a binary file with the ID. The detailed information collected from the PDA's registry is the basis for the application program. The collected file is data and sent as in Figure 4. Table 2. PDA gathering Information Examples
category Customer Information PDA Device PDA O/S Application
Detailed Information
ID
Customer ID
10~
HP, Samsung, SONY PPC 2002, Palm, Linux Internet Brower, e-Book Viewer, VM etc
20~ 30~ 40~
940
Seungwon Na and Seman Oh
Fig. 4. PAIS Agent File Processing Flow
3.2
PAIS Server Design
The PAIS server processes the PDA's connection authentication using the DB server and manages application programs to create packages with installation files and sends the installation files to each PDA. It also is connected to an SMS(Short Message Service) server and alerts the PDA user of new or upgraded files to promote installation. As an additional function, statistical data and bulletin board support is provided. The SMS service informs the user about new updates but the actual installation through the PDA agent is controlled by the user's settings. PDA device type and application program information is managed through specified ID numbers and the agent manages on an application program basis with an already specified ID number table. The PAIS server compares the transmitted PDA file information to the installation file comparison table. The basis for this comparison is the version information, and only when the version is higher is the designated installation file selected. Figure 6, shown below, outlines a case where the first and third are selected. In this case, the selected executable files are packaged and uploaded to the PDA through the wireless connection. The package file contains three URLs, the basic file catalog URL, the detailed file information URL, and the download URL.
Fig. 5. PAIS Agent File Processing Flow
Design and Implementation of an Automatic Installation System
941
Fig. 6. Install File Comparison Example
Transmission mode : Server address ? UID = serviceID_ platformID_application ID [BASIC_URL]
= http://www.PAIS.com/basic.asp
[DETAIL_URL]
= http://www.PAIS.com/detail.asp
[DOWNLOAD_URL] = http://www.PAIS.com/download.asp
4
PAIS Implementation
The development environment and software tools used to implement the PAIS system proposed in this paper are as follows. • • •
Embedded Visual C++ 3.0(Language) Pocket PC PDA PocketPC 2002(O/S)
The download and automatic installation process of an application program were implemented through an emulator as shown in Figure 7.
Fig. 7. Implement Result
942
Seungwon Na and Seman Oh
The screen capture on the left shows the PDA uploading the application's information to the server. The screen capture on the center shows the PDA downloading the package file from the server. The PDA is downloading a package for installing the service. The screen capture on the right shows the automatic installation of the received install package file.
5
Conclusion
When the power supply of a PDA becomes fully discharged, all data stored in the RAM is erased. Each time this happens, users must reinstall all programs. Also, because PDAs are constantly being upgraded, not only do users have to install new programs, but updates also occur frequently. When this happens, users must do the actual installation themselves. To solve this drawback in PDAs, we have proposed, in this paper, a system named PIAS, which automatically installs wanted application programs on a PDA. When PIAS is applied, PDA users can save the time and effort required for installations, an added convenience, and application software companies can save on costs previously needed to create materials explaining the installation process. This improves the limited PDA environment to provide better wireless Internet services. For future research, we plan to look into ways of expanding the use of this automatic installation program technology to cellular phones and other mobile devices.
References [1] [2] [3] [4] [5] [6] [7] [8] [9]
Byeongyun Lee, “Design and Implementation of SyncML Data synchronization system on session Manager”, KISS Journal, Vol.8, No.6, pp.647~656, Dec. 2002. Insuk Ha, “All synchronization suite of programs SDA analysis”, micro software, pp.309~313, Aug. 2001. James Y. Wilson, Building Powerful Platforms with windows CE, 2Edition, Addison Wesley, March 2000. Jiyeon Lee, “The Design and Implementation of Database for Sync Data Synchronization”, KIPS, Conference, Seoul, Vol.8, No.2, pp.1343~1346, Oct. 2001. Suhui Ryu, “Design and Implementation of Data Synchronization Server's Agent that use Sync protocol”, KISS Conference, Seoul, Vol.8, No.2, pp.1347~1350, Oct. 2001. Sync Initiative, Sync Architecture version 0.2, May 2000. Taegyune An, Mobile programming that do with pocket PC, Inforgate, July 2002. Uwe Hansman, Synchronizing and Managing Your Mobile Data, Prentice Hall, Aug. 2002. Intromobile, Mobile Multimedia Technology Trend, http://www.intromobile.co.kr/solution/
Design and Implementation of an Automatic Installation System [10] [11]
Microsoft, PocketPC Official site, http://www.microsoft.com/ mobile/pocketpc/ Sync Initiative, http://www.syncml.org/
943
Combination of a Cognitive Theory with the Multi-attribute Utility Theory Katerina Kabassi and Maria Virvou Department of Informatics University of Piraeus 80 Karaoli & Dimitriou Str. 18534 Piraeus, Greece {kkabassi,mvirvou}@unipi.gr
Abstract. This paper presents how a novel combination of a cognitive theory with the Multi-Attribute Utility Theory (MAUT) can be incorporated in a Graphical User Interface (GUI) in order to provide adaptive and intelligent help to users. The GUI is called I-Mailer and is meant to help users achieve their goals and plans during their interaction with an e-mailing system. I-Mailer constantly observes the user and in case it suspects that s/he is involved in a problematic situation it provides spontaneous advice. In particular, I-Mailer suggests alternative commands that the user could have given instead of the problematic one that s/he gave. The reasoning about which alternative command could have been best is largely based on the combination of the cognitive theory and MAUT.
1
Introduction
In real world situations humans must make a great number of decisions that usually involve several objectives, viewpoints or criteria. The representation of different points of view (aspects, factors, characteristics) with the help of a family of criteria is undoubtedly the most delicate part of the decision problem's formulation (Bouyssou 1990). Multi-criteria decision aid is characterised by methods, that support planning and decision processes through collecting, storing and processing different kinds of information in order to solve a multi-criteria decision problem (Lahdelma 2000). Although research in Artificial Intelligence has tried to model the reasoning of users, it has not considered multi-criteria decision aid as much as it could. This paper shows that the combination of a cognitive theory with a theory of multicriteria analysis can be incorporated into a user interface to improve its reasoning. This reasoning is used by the system to provide automatic intelligent assistance to users who are involved in problematic situations. For this purpose, a graphical user interface has been developed. The user interface is called Intelligent Mailer (I-Mailer) and is meant to operate as a standard e-mail client. I-Mailer monitors all users' actions and reasons about them. In case it diagnoses a problem, the system provides spontaneous assistance. The system uses Human Plausible Reasoning (Collins & Michalski V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 944-950, 2003. Springer-Verlag Berlin Heidelberg 2003
Combination of a Cognitive Theory with the Multi-attribute Utility Theory
945
1989) (henceforth referred to as HPR) and its certainty parameters in order to make inferences about possible users' errors based on evidence from users' interaction with the system. HPR is a cognitive theory about human reasoning and has been used in IMailer to simulate the users' reasoning, which may be correct or incorrect (but still plausible) and thus may lead to “plausible” user errors. HPR has been previously adapted in two other Intelligent Help Systems (IHS), namely RESCUER (Virvou & Du Boulay, 1999) which was an IHS for UNIX users and IFM (Virvou & Kabassi, 2002), which was an IHS for users of graphical file manipulation programs such as the Microsoft Windows Explorer (Microsoft Corporation 1998). However, none of these systems did incorporate the combination of MAUT and HPR. In case of a user's error, I-Mailer uses the statement transforms, the simplest class of inference patterns of HPR, to find the action that the user might have meant to issue instead of the one causing the problem. However, the main problem with this approach is the generation of many alternative actions. In order to select the most appropriate one, I-Mailer uses HPR's certainty parameters. Each certainty parameter represents a criterion that a user is taking into account in order to select the action that s/he means to issue. Therefore, each time the system generates an alternative action, a problem is defined that has to be solved taking into account multiple, often conflicting, criteria. The aim of multi-criteria decision analysis is to recommend an action, while several alternatives have to be evaluated in terms of many criteria. One important theory in Multi-Criteria Analysis is the Multi-Attribute Utility Theory (MAUT). The theory is based on aggregating different criteria into a function, which has to be maximised. In I-Mailer, we have applied MAUT to combine the values for a given action of a user and then to rank the set of actions and thus select the best alternative action to suggest to a user.
2
Background
2.1
Multi-attribute Utility Theory
As Vincke (1992) points out, a multi-criteria decision problem is a situation in which, having defined a set A of actions and a consistent family F of n criteria g1 , g 2 , …,
g 3 ( n ≥ 3 ) on A, one wishes to rank the actions of A from best to worst and determine a subset of actions considered to be the best with respect to F. The preferences of the Decision Maker (DM) concerning the alternatives of A are formed and argued by reference to the n points of view adequately reflected by the criteria contained in F . When the DM must compare two actions a and b, there are three cases that can describe the outcome of the comparison: the DM prefers a to b, the DM is indifferent between the two or the two actions are incompatible. The traditional approach is to translate a decision problem into the optimisation of some function g defined on A . If g ( a ) > g (b) then the DM prefers a to b, whereas if g ( a ) = g (b) then the DM is indifferent between the two. The theory
946
Katerina Kabassi and Maria Virvou
defines a criterion as a function g , defined on A , taking its values in a totally ordered set, and representing the DM's preferences according to some point of view. Therefore, the evaluation of action α according to criterion j is written g j (a ) . MAUT is based on a fundamental axiom: any decision-maker attempts unconsciously (or implicitly) to maximise some function U = U ( g 1 , g 2 ,..., g n ) aggregating all the different points of view which are taken into account. In other words, if the DM is asked about his/her preferences, his/her answers will be coherent with a certain unknown function U . The role of the researcher is to try to estimate that function by asking the DM some well-chosen questions. The simplest (and most commonly used) analytical form is, of course, the additive form: n
U (a) = ∑ k jU j ( x aj )
(1)
j =1
a
where ∀j: x j = g j (a ) ,
n
∑k j =1
2.2
j
= 1 , U j ( x j ) = 0 and U j ( y j ) = 1 .
Human Plausible Reasoning
Human Plausible Reasoning theory (Collins & Michalski 1989, Burstein et al. 1991) is a cognitive theory that attempts to formalise plausible inferences that occur in people's responses to different questions when the answers are not directly known. The theory is grounded on an analysis of people's answers to everyday questions about the world, a set of parameters that affect the certainty of such answers and a system relating the different plausible inference patterns and the different certainty parameters. For example, if the question asked was whether coffee is grown in Llanos region in Colombia, the answer would depend on the knowledge retrieved from memory. If the subject knew that Llanos was in a savanna region similar to that where coffee grows, this would trigger an inductive, analogical inference, and generate the answer yes (Carbonel & Collins 1973). According to the theory a large part of human knowledge is represented in “dynamic hierarchies”, which are always being updated, modified or expanded. In this way, the reasoning of people with patchy knowledge can be modelled. Statement transforms, the simplest class of inference patterns, exploit the 4 possible relations among arguments and among referents to yield 8 types of statement transform. Statement transforms can be affected by certainty parameters. The degree of similarity (σ) signifies the degree of resemblance of a set to another. The degree of typicality (τ) represents how typical a subset is within a set (for example, cow is a typical mammal). The degree of frequency (ϕ) count how frequent a referent is in the domain of the descriptor. Dominance (δ) indicates how dominant a subset is in a set (for example, elephants are not a large percentage of mammals). Finally the only certainty parameter applicable to any expression is the degree of certainty (γ). The degree of
Combination of a Cognitive Theory with the Multi-attribute Utility Theory
947
certainty or belief that an expression is true is defined by HPR as γ = f (σ , δ , φ ,τ ) . However, the exact formula for the calculation of γ is not specified.
3
Intelligent Mailer
Intelligent Mailer (I-Mailer) is an Intelligent Graphical User Interface that works in a similar way as a standard e-mail client but it also incorporates intelligence. The system's main aim is to provide spontaneous help and advice to users who have made an error with respect to their hypothesised intentions. Every time a user issues a command the system reasons about it using a limited goal recognition mechanism and reasons about the command issued. If the command is considered problematic the system uses the principles of HPR to make plausible guesses about the user's errors. In particular, it implements some statement transforms in order to generate alternative commands that the user could have issued instead of the problematic one issued. If an alternative command is found that would be compatible with the user's hypothesised goals and would not be problematic like the command issued by the user then this command is suggested to the user. However, the main problem with this approach is the generation of many alternative commands. In order to select the most appropriate one, I-Mailer uses an adaptation of five of the certainty parameters introduced in HPR. The values of the certainty parameters for every alternative action are supplied by the user model of the particular user (Kabassi & Virvou 2003). In I-Mailer, the degree of similarity (σ) is used to calculate the resemblance of two commands or two objects. It is mainly used to show possible confusions that a user may have made between commands. The typicality (τ) of a command represents the estimated frequency of execution of the command by the particular user. The degree of frequency (ϕ) of an error represents how often a specific error is made by a particular user. The dominance (δ) of an error in the set of all errors shows which kind of error is the most frequent for a particular user. The values for these certainty parameters are calculated based on the information stored in the individual user models and the domain representation. Finally, all the parameters presented above are combined to calculate a degree of certainty related to every alternative command generated by IMailer. This degree of certainty (γ) represents the system's certainty that the user intended the alternative action generated. However, as mentioned earlier, the exact formula for the calculation of the degree of certainty has not been specified by HPR. Therefore, we have used MAUT to calculate the degree of certainty as presented in detail in the next section. An example of how I-Mailer works is presented below: A user in an attempt to organise his mailbox, he intends to delete the folder ‘Inbox\conference1\'. However, he accidentally attempts to delete ‘Inbox\conference2\'. In this case, he runs the risk of losing all e-mail messages stored in the folder ‘Inbox\conference2\'. I-Mailer would suggest the user to delete the folder ‘Inbox\conference1\' because the folder ‘Inbox\conference1\' is empty whereas the folder ‘Inbox\conference2\' is not and the two folders have very similar names therefore one could have been mistaken for the other.
948
Katerina Kabassi and Maria Virvou
Fig. 1. A graphical representation of the user's mailbox
4
Specification of the Degree of Certainty Based on MAUT
The decision problem in I-Mailer is to find the best alternative command to be suggested to the user instead of the problematic one that s/he has issued. This problem involves the calculation of the degree of certainty γ for each alternative command generated through the statement transforms. Each of the certainty parameters of HPR is to be considered as a criterion. This means that we consider other criteria such as the degree of similarity, the degree of frequency, the degree of typicality and dominance. Thus, the main goal of MAUT in I-Mailer is to try to optimise the function n
U (a) = ∑ k jU j ( x aj ) described in Section 2.1, where a is an alternative comj =1
mand to be suggested, U
j
is the estimation of the value of every certainty parameter
(j=1,2,3,4) and k j is the weight of corresponding certainty parameter. For each alternative command each certainty parameter is given a value based on the knowledge representation and long term observations about the user. For example, in a case where the user is very prone to accidental slips the degree of frequency would also be very high. As another example, the degree of similarity between two mails with subject ‘Confirmation1' and ‘Confirmation2' is 0.90 because they have very similar names and are neighbouring objects in the graphical representation of the electronic mailbox. The theory defines that in order to estimate the values of the weights kj's, one must try to obtain (n-1) pairs of indifferent actions, where n is the number of criteria. In the case of I-Mailer we had to find examples of 3 pairs of indifferent actions since n=4 (we have 4 criteria). A pair of indifferent actions in I-Mailer is a pair of alternative commands for which a human advisor would not have any preference between one or another of the two. Therefore, we conducted an empirical study in order to find some pairs of alternative actions that human experts would count as indifferent. The empirical study involved 30 users of different levels of expertise in the use of an emailing system and 10 human experts of the domain. All users were asked to interact with a standard e-mailing system, as they would normally do. During their interaction
Combination of a Cognitive Theory with the Multi-attribute Utility Theory
949
with the system, their actions were video captured and the human experts were asked to comment on the protocols collected. The analysis of the comments of human experts revealed that in some cases the human experts thought that two alternatives where equally likely to have been intended by the user. From the actions that the majority human experts counted as indifferent, we selected 3 pairs that had the greater acceptability. For each alternative action, we calculated the values of the certainty parameters and we substituted those to function U ( a ) =
n
∑k U j
j
( x aj ) and we let the weights of the certainty parame-
j =1
ters to be the unknown quantities ( k j ). As the value of U ( a ) for two indifferent actions is the same, we equalise the values of function U ( a ) for each pair of indifferent actions. This process resulted in the 3 following equations. a1Ia2 ⇔ 0.65 k1 +0.50 k 2 +0.31 k 3 +0.35 k 4 =0.55 k1 +0.60 k 2 +0.39 k 3 +0.03 k 4 a3Ia4 ⇔ 0.30 k1 +0.40 k 2 +0.23 k 3 +0.65 k 4 =0.25 k1 +0.55 k 2 +0.11 k 3 +0.13 k 4 a5Ia6 ⇔ 0.75 k1 +0.70 k 2 +0.56 k 3 +0.43 k 4 =0.90 k1 +0.55 k 2 +0.48 k 3 +0.55 k 4 4
These equations together the equation
∑k
j
= 1 is a system of 4 equations with 4
j =1
unknown quantities, which is easy to be solved. After this process, the values of the weights of the certainty parameters were found to be k1 = 0.44 , k 2 = 0.36 ,
k3 = 0.18 and k 4 = 0.02 . Therefore, the final formula for the calculation of the degree of certainty was found to the following:
γ = 0.44σ + 0.36δ + 0.18φ + 0.02τ 5
(2)
Conclusions
In this paper, we have described an intelligent e-mailing system, I-Mailer, that helps users achieve their goals and plans. In order to provide intelligent and individualised help, I-Mailer uses a combination of the principles of Human Plausible Reasoning with the Multi-Attribute Utility Theory. HPR provides a domain-independent, formal framework for generating hypotheses about the users' beliefs and intentions from the point of view of a human advisor who watches the user over his/her shoulder and reasons about his/her actions. In case the system, suspects that the user is involved in a problematic situation, it uses the adaptation of HPR in order to find possible alternatives that the user might have meant to issue instead of the unintended one. However, usually this process results in the generation of many alternatives. Therefore, the system uses the certainty parameters of HPR and MAUT in order to find the most “predominant” alternative action, which is the action that is more likely to have been intended by the user.
950
Katerina Kabassi and Maria Virvou
References [1] [2] [3] [4] [5] [6] [7] [8] [9]
Bouyssou, D.: Building criteria: a prerequisite for MCDA. In Costa, C. (ed.): Readings in SCDA, Bana e Springer Verlag (1990). Burstein, M.H., Collins, A. & Baker, M.: Plausible Generalisation: Extending a model of Human Plausible Reasoning. Journal of the Learning Sciences, (1991) Vol. 3 and 4, 319-359. Carbonel, J.R. & Collins, A.: Natural semantics in artificial intelligence. In Proceedings of the Third International Joint Conference on Artificial Intelligence, Stanford, California, (1973) 344-351. Collins, A. & Michalski R.: The Logic of Plausible Reasoning: A core Theory. Cognitive Science, (1989) Vol. 13, 1-49. Kabassi, K. & Virvou, M.: Adaptive Help for e-mail Users. In Proceedings of the 10th International Conference on Human Computer Interaction (HCII'2003), to appear. Lahdelma, R., Salminen, P. & Hokkanen, J.: Using Multicriteria Methods in Environmental Planning and Management, Environmental Management, Springer-Verlag; New York, (2000) Vol. 26, No. 6, 565-605. P. Vincke: Multicriteria Decision-Aid. Wiley (1992). Virvou, M. & Du Boulay, B.: Human Plausible Reasoning for Intelligent Help. User Modeling and User-Adapted Interaction, (1999) Vol. 9, 321-375. Virvou, M. & Kabassi, K.: Reasoning about Users' Actions in a Graphical User Interface. Human-Computer Interaction, (2002) Vol. 17, No. 4, 369-399.
Using Self Organizing Feature Maps to Acquire Knowledge about Visitor Behavior in a Web Site Juan D. Vel´ asquez1 , Hiroshi Yasuda1 , Terumasa Aoki1 , Richard Weber2 , and Eduardo Vera3 1
Research Center for Advanced Science and Technology, University of Tokyo {jvelasqu,yasuda,aoki}@mpeg.rcast.u-tokyo.ac.jp 2 Department of Industrial Engineering, University of Chile [email protected] 3 AccessNova Program, Department of Computer Science University of Chile [email protected]
Abstract. When a user visits a web site, important information concerning his/her preferences and behavior is stored implicitly in the associated log files. This information can be revealed by using data mining techniques and can be used in order to improve both, content and structure of the respective web site. From the set of possible that define the visitor’s behavior, two have been selected: the visited pages and the time spent in each one of them. With this information, a new distance was defined and used in a self organizing map which identifies clusters of similar sessions, allowing the analysis of visitors behavior. The proposed methodology has been applied to the log files from a certain web site. The respective results gave very important insights regarding visitors behavior and preferences and prompted the reconfiguration of the web site.
1
Introduction
When a visitor enters a web site, the selected pages have direct relation with the desired information he/she is looking for. The ideal structure of a web site should support the visitors in finding such information. However, reality is quite different. In many cases, the structure of a Web site does not help to find the desired information, although a page that contains it, does exist [3]. Studying visitors’ behavior is important in order to create more attractive contents, to predict her/his preferences and to prepare links with suggestions, among others [9]. These research initiatives aim at facilitating web site navigation, and in the case of commercial sites, at increasing market shares [1], transforming visitors into customers, increasing customers loyalty and predicting their preferences. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 951–959, 2003. c Springer-Verlag Berlin Heidelberg 2003
952
Juan D. Vel´ asquez et al.
Each click of a web site visitor is stored in files, known as web logs [7]. The knowledge about visitors’ behavior contained in these files can be extracted using data mining techniques such as e.g. self-organizing feature maps (SOFM). In this work, a new distance measure between web pages is introduced which is used as input for a specially developed self-organizing feature map that identifies clusters of different sessions. This way, behavior of a web site’s visitors can be analyzed and employed for web site improvement. The special characteristic of the SOFM is its thoroidal topology, which has shown already its advantages when it comes to maintain the continuity of clusters [10]. In Sect. 2, a technique to compare user sessions in a web site is introduced. Section 3 shows how the user behavior vector and the distance between web page is used as input for self-organizing feature maps in order to cluster sessions. Section 4 presents the application of the suggested methodology for a particular web site. Finally, Sect. 5 concludes the present work and points at extensions.
2 2.1
Comparing User Sessions in a Web Site User Behavior Vector Based on Web Site Visits
We define two variables of interest: the set of pages visited by the user and the time spent on each one of them. This information can be obtained from the log files of the web site, which are preprocessed using the sessionization process [5, 6]. Definition 1. User Behavior Vector. U = {u(1), . . . , u(V )} where u(i) = (up (i), ut (i)), and up (i) is the web page that the user visits in the event i of his session. ut (i) is the time the user spent visiting the web page. V is the number of pages visited in a certain session. Figure 1 shows a common structure of a web site. If we have a user visiting the pages 1,3,6,11 and spending 3, 40, 5, 16 seconds, respectively, the corresponding user vector is: U = ((1,3),(3,40),(6,5),(11,16)).
2
U1 U U3 1 2 5
6
10
16
3
11
17
4
7
8
9
12
13
14
18
19
20
15
21
22
Fig. 1. A common structure of web site and the representation of user behavior vectors
Using Self Organizing Feature Maps
953
After completing the sessionization step, we have the pages visited and the time spent in each one of them, except for the last visited page, of which we only know when its visit began. An approximation for this value is to take the average of the time spent in the other pages visited in the same session. Additionally, it is necessary to consider that the number of web pages visited by different users varies. Thus the numbers of components in the respective user behavior vectors differ. However, we can introduce a modification in order to create vectors with the same number of components. Let L be the maximum number of components in a vector, and U a vector with S components so that S ≤ L. Then the modified vector is: (up (k), ut (k)) 1 ≤ k ≤ S (1) U = (0, 0) S
Preprocessing
A web page contains a variety of tags and words that do not have direct relation with the content of the page we want to study. Therefore we have to filter the text and to eliminate the following types of words: HTML Tags, Stopwords (i.e. pronouns, prepositions, conjunctions, etc.) and Word stem (suffix removal). Next the web page is a document represented by a vector space model [2], in particular by vectors of words. Let P = {p1 , . . . , pQ } be the set of Web pages in a web site. Its vectorial representation would be a matrix of QxR, where Q is the number of pages in the web site and R is the number of different words in P . Then a matrix M that contains the vectors of words in its columns is: M = (mij )
i = 1, . . . , Q
and
j = 1, . . . , R
(2)
where mij is the weight of word i in document j. In order to estimate these weights, we use the tfxidf-weighting [2], defined by equation 3. mij = fij ∗ log(
Q ) ni
(3)
where fij is the number of occurrences of word i in document j and ni is the total number of times that word i appears in the whole collection of documents. 2.3
Distance Measure between Two Pages
With the above given definitions we can use vectorial linear algebra, in order to define a distance measure between two web pages. Definition 2. Word Pages Vectors. WPi = (wpi1 , . . . , wpiR ) = (mi1 , . . . , miR ) Thus the distance between page vectors [4] is: R j i k=1 wpk wpk dp(W P i , W P j ) = (4) R R j 2 i )2 (wp (wp ) k=1 k=1 k k
954
Juan D. Vel´ asquez et al.
Definition 3. Page Distance Vector. DAB = (dp (a1 , b1 ), . . . , dp (am , bm )) where A = {a1 , . . . , am } and B = {b1 , . . . , bm } are sets of word page vectors with the same cardinality. 2.4
Comparison between Two User Behavior Vectors
In order to compare two user sessions it is necessary to define a measure that determines the difference between two user behavior vectors based on both characteristics of the user behavior vector (time and page content), see equation 5. i
j
dub(U , U ) =
L k=1
min{
uit (k) ujt (k) , i } ∗ dp(uip (k), ujp (k)) j u (k) ut (k) t
(5)
with dp distance measure between the content of two pages. We use dp (equation 4) because it is possible that two users visit different web pages in the web site, but the content is similar. This is a variation of the approach proposed in [7], where only the user’s path was considered but not the content of each page. ui (k)
The second element of equation 5, min{ ujt (k) , t
ujt (k) } uit (k)
is indicating the user’s
interest for the pages visited. The assumption is that the time spent on a page is proportional to the interest the user has in its content. In this way, if the times spent are closed, the value of the expression will be near 1. In the opposite case, it will be near 0. The final expression of equation 5 combines the content of the visited pages with the time spent on each of the pages by a multiplication. This way we can distinguish between two users who had visited similar pages but spent different times on each of them. Similarly we can separate between users that spent the same time visiting pages with different content and position in the web.
3
Self Organizing Feature Map for Session Clustering
We used an artificial neural network of the Kohonen type (Self-organizing Feature Map; SOFM) [8]. Schematically, it is presented as a two-dimensional array in whose positions the neurons are located. Each neuron is constituted by an n-dimensional vector, whose components are the synaptic weights. By construction, all the neurons receive the same input at a given moment. The idea in this learning process is to present an example to the network and by using a metric, to search the neuron in the network most similar to the example (center of excitation, winner neuron). Next we have to modify its weights and those of the center’s neighbors. The notion of neighborhood among the neurons provides diverse topologies. In this case the thoroidal topology was used [10], which means that the neurons closest to the ones of the superior edge, are located in the inferior and lateral edges (see Fig. 2).
Using Self Organizing Feature Maps
955
Ui+1,j+1
Ui,j Ui,j+1
Fig. 2. Proximity of the User Behavior Vector in a network of thoroidal Kohonen
The U vectors have two components (time and content) for each web page. Therefore it is necessary to modify both when the neural network changes the weights for the winner neuron and its neighbors. The time component of the U vector is modified with a numerical adjustment, but the page component needs a different updating scheme [8]. In the preprocessing step, we constructed a matrix with the pairwise distance among all pages in the web site. Using this information we can adjust the respective weights. Let N be a neuron in the network and E the user behavior vector example presented to the network, using definition 3, the page distance vector is: DN E = (dp (Np (1), Ep (1)), . . . , dp (Np (M ), Ep (M )))
(6)
Now the adjustment is over the DN E vector, i.e., we have DN E = DN E ∗ fp , with fp an adjustment factor. Using DN E , it will be necessary to find a set of pages whose distances with N be near to DN E . Thus the final adjustment for the winner and its neighbor neurons is given by equation 7. n N n+1 = Ntn (i) ∗ ft , p ∈ P/DN (7) E (i) ≈ dp (p, Np (i))
with i = 1, . . . , L.
4
Application of the Proposed Methodology
4.1
Selecting the Web Site
In order to prove the effectiveness of the tools developed in this work, a web site was selected1 . It contains information about programs of specialization for professionals and belongs to the Department of Industrial Engineering of the University of Chile. This site is written in Spanish, has 142 static web pages and for this study approximately 24,000 web logs registers were considered, corresponding to the period August to October, 2002. 1
http://www.dii.uchile.cl/˜diplomas/
956
4.2
Juan D. Vel´ asquez et al.
Preprocessing and Indexing
In the preprocessing step, the grammar particles (articles, prepositions, conjunctions, etc.), the characters with accent and html tags were eliminated. Additionally, the word stemming process was applied. The links pages, i.e., pages that only contain links to other web pages, were not considered in the analysis. The total number of pages is 122. Next we created the matrix with the distances among page vectors. The dimension of this matrix is 6234x122, i.e., R=6234 and Q=122. 4.3
Sessionization and User Behavior Vector
We used the time-based heuristic for the sessionization process and considered 30 minutes as the longest user session. Only 7% of the users have sessions with 7 or more pages visited and 11% visited at least 3 pages. Then we supposed three and six as minimum and maximum number of components in a user behavior vector, respectively. Using these filters, we identified 4113 user behavior vectors. 4.4
Training the Neural Network
We used a SOFM with 6 input neurons (corresponding to the six pages in a visit) and 256 output neurons. Using this structure, we could map the 4113 user behavior vectors to the 256 neurons of the feature map. The thoroidal topology maintains the continuity in clusters, which allows to study the transition among the preferences of the users from one cluster to another. The training of the neural network was carried out on a computer pentium IV, with 512 Mb in RAM and running Linux OS, distribution Redhat 7.1. The time necessary was 2,5 hours and the epoch parameter was 100. 4.5
Results
We identified six main clusters as shown in the following table. The second and third column of Table 1, contain the center neurons of each of the clusters, representing the visited pages and the time spent in each one of them. The pages in the web site were labelled with a number to facilitate its analysis. Table 2 shows the main content of each page. The clusters analysis show: – Cluster 1. The users are interested in the profile of the students, the program and the faculty staff’s curriculums. – Cluster 2 and 3. The users show preferences for courses about environmental topics and visit the page where they can ask for information. In both clusters, program and schedule are very important for the user. – Cluster 4. This cluster contains sessions of users interested in new courses.
Using Self Organizing Feature Maps
957
Table 1. User behavior clusters Cluster
Pages Visited
Time spent in seconds
1
(2,15,60,42,70,62)
(3,5,113,67,87,43)
2
(5,43,65,75,112,1)
(4,53,40,63,107,10)
3
(6,47,67,7,48,112)
(4,61,35,5,65,97)
4
(10,51,118,87,105,1)
(5,80,121,108,30,5)
5
(11,55,37,87,114,12)
(3,75,31,43,76,8)
6
(13,57,41,98,120,107) (4,105,84,63,107,30)
Table 2. Pages and their content Pages 1
Contain Home page
2, . . . , 14
Main page about a course
15, . . . , 28
Presentation of the program
29, . . . , 41
Objectives
42, . . . , 58
Program: Course’s modules
59, . . . , 61
Student profile
62, . . . , 68
Schedule and dedication
69, . . . , 91
Curriculums of the staff of instructors
92, . . . , 108
Menu to solicited information
108, . . . , 121 Information:cost, schedule, postulation, etc. 122
News page
– Cluster 5. The users are interested in the students profile and the course objectives. – Cluster 6. In this case sessions are similar to the sessions in cluster 5. This kind of course is seminar. Reviewing the clusters found, it can be inferred that the users show interest in the profile of the students, the schedules and contents of the courses and the professors who are in charge of each subject. Based on our analysis we propose to change the structure of the Web site, privileging the described information. The following step is to do an analysis of the content of the pages using the distance defined in the equation 5.
958
5
Juan D. Vel´ asquez et al.
Conclusions
In this work we introduced a methodology to study the user behavior in a web site. In the first part we propose a way to study the user behavior in the web, using a new distance measure based on two characteristics derived from the user sessions: pages visited and time spent in each one of them. Using this distance in a self organizing map, we found clusters from users sessions, which allow us to study the user behavior in the particular web site. The experiments made with data from a certain Web site, showed that the methodology used allows to create clusters of user sessions, and – using this information – to study the user behavior in the web site. Since the distance considers the content and position of the page in the web site, its structure and the words used in the pages are variables that influence directly in the capacity of the SOFM to create clusters. The distance introduced, is very useful to increase the knowledge about the user behavior in a web site. As future work, it is proposed to improve the presented methodology introducing new variables derived from user sessions. It will also be necessary to continue applying our methodology to other web sites in order to get new hints on future developments.
References [1] S. Araya, M. Silva and R. Weber, Identifying web usage behavior of bank customers. Proceedings of SPIE, Data Mining and Knowledge Discovery: Theory, Tools, and Technology IV, Vol. 4730, pages 245-251 April, 1-5,Orlando, USA 2002 951 [2] R. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval, chapter 2. Addison-Wesley 1999 953 [3] N. J. Belkin, Helping people find what they don’t know Communications of the ACM, Vol. 43(8), pages 58-61, 2000 951 [4] M. W. Berry, S. T. Dumais and G. W. O’Brien, Using linear algebra for intelligent information retrieval, SIAM Review, Vol. 37, pages 573-595, December 1995 953 [5] R. Cooley, B. Mobasher, J. Srivastava. Data preparation for mining world wide web browsing patterns. Journal of Knowlegde and Information Systems Vol. 1, pages 5-32, 1999 952 [6] R. Cooley, B. Mobasher and J. Srivastava, Grouping Web Page References into Transactions for Mining World Wide Web Browsing Patterns, In Knowledge and Data Engineering Workshop, pages 2-9, Newport Beach, CA, 1997 952 [7] A. Joshi and R. Krishnapuram, On Mining Web Access Logs. In Proceedings of the 2000 ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pages 63-69, 2000. 952, 954 [8] T. Kohonen, Self-Organization and Associative Memory, Springer-Verlag 1987, 2nd edition. 954, 955 [9] B. Mobasher, R. Cooley and J. Srivastava, Creating Adaptive Web Sites Through Usage-Based Clustering of URLs, Proceedings of IEEE Knowledge and Data Engineering Exchange, November, 1999. 951
Using Self Organizing Feature Maps
959
[10] J. Vel´ asquez, H. Yasuda, T. Aoki and R. Weber, Voice Codification using Self Organizing Maps as Data Mining Tool. Proceedings of Second International Conference on Hybrid Intelligent Systems , pages 480-489, Santiago, Chile, December, 2002 952, 954
A Framework for the Development of Personalized Agents Fabio Abbattista, Graziano Catucci, Marco Degemmis, Pasquale Lops, Giovanni Semeraro, and Fabio Zambetta Dipartimento di Informatica Via Orabona 4, 70125 Bari, Italy {fabio,degemmis,lops,semeraro,zambetta}@di.uniba.it [email protected]
Abstract. The amount of information available on the web, as well as the number of e-businesses and web shoppers is growing exponentially. Customers spend a lot of time to browse the net in order to find relevant product information. One way to overcome this problem is to use dialoguing agents that exploit the knowledge stored in user profiles in order to generate personal recommendations. This paper presents a general framework designed according to this idea in order to develop intelligent e-business applications.
1
Introduction
In the last years, the Web has quickly become a global marketplace. Nowadays enterprises are developing new business portals, providing their customers with large amounts of product information: choosing among so many options is very time consuming. This problem requires solutions that show a certain degree of autonomy, personalization, and ability to react to specific circumstances. Agents able to dialogue in natural language with users fit these requirements since they represent a paradigm for the implementation of autonomous and proactive behaviors. Moreover, the interface of these agents should adopt best-practice solutions to achieve a high degree of dialogue intelligence, and use an appropriate graphical design. Finally, the effectiveness of the solution also depends on the ability of the systems to adapt to individual users by learning about their preferences. It is necessary to adopt techniques that turn raw data about customers into knowledge (about their interests) that can be stored in personal profiles exploited to deliver personalized content [8]. In this paper, we present a framework for developing 3D personalized agents that combines an advanced user interface based on natural language processing with machine learning-based profiling techniques. Previous research on related topics can be roughly broken up into two areas, the first focusing on learning user profiles for item recommending, the second on conversational interfaces.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 959-966, 2003. Springer-Verlag Berlin Heidelberg 2003
960
Fabio Abbattista et al.
Recommender systems typically take one of two approaches to help users in searching for items of interest. The content-based approach recommends items similar to the ones that a user has liked in the past; the collaborative approach selects items for a given user that similar users have also liked. The LIBRA system [7] makes content-based book recommending by applying text categorization methods to the book descriptions in Amazon.com, using a naïve Bayes classifier. Data mining methods are used by the 1:1Pro system [2] to build profiles that contain rules describing a customer's behavior. Rules are derived from transactional data, representing the purchasing and browsing activities of each user. Using rules is an intuitive way to represent the customer's needs; moreover, rules generated from a huge number of transactions tend to be statistically reliable. Thus, the profiling system within the framework presented in the next section exploits transactional data to discover rules describing the preferences of a user. Moreover, we combine the rulebased approach with the content-based method described in [7], in order to build more detailed user profiles. Amongst systems using natural language to interact with users, Adaptive Place Advisor (APA) [6] is designed to help a user to select a destination, for example a restaurant. Similarly to our conversational agent (Section 3), APA adopts the content-based recommending approach, but unlikely many recommender systems that accept keywords and produce a ranked list of results, this one carries out a conversation with the user to progressively narrow his/her options.
2
A Framework for Developing Intelligent E-commerce Applications
Our framework for developing e-commerce applications (Figure 1) is composed of 5 main macro-modules: 1. 2.
3.
Web Site, accommodates the search engine used by the agent by means of a remote call to browse information and navigate the contents of the site. Conversational Agent, consists of a chatterbot component and a 3D character. The former is exploited to converse with a user and collect data from the dialogue, the latter elicits non-verbal cues occurring during a face-to-face conversation, thus endowing the system with a higher-level human-computer interaction. Personalization Engine, extracts knowledge about users to personalize the content and the interaction with the web site. Such a knowledge can be extracted from different sources, e.g. the interaction logs, the searching and purchasing history. Another important source of information is represented by chat logs, containing the dialogues between the users and the conversational agent. Different techniques can be used to extract valuable information from these sources: (a) data mining techniques to induce user profiles or usage patterns, e.g., community or association rules; (b) information extraction techniques to inspect data contained in the chat logs, in order to retrieve other information on possible interests; (c) collaborative filtering techniques that can be plugged into the framework.
A Framework for the Development of Personalized Agents
4.
5.
961
XML Content Manager, represents and manages data in XML format. XML technologies play an important role in modern architectures, where the portability and the integration with different data sources are key factors for successful integrated environments. Furthermore, the communication between the different modules is achieved by the use of web services, since a homogeneous representation for data and communication messages is required. Information Retrieval (IR) Suite, exploits knowledge sources to retrieve the right information at the right time when coping with unexpected or ambiguous situations.
Fig. 1. Architecture of the framework
3
The Conversational Agent
Conversational web agents allow a simpler and more natural interaction metaphor between the user and the machine, entertaining him/her and giving to some extent the illusion of interacting with a human-like interface. Frequently these systems are coupled with an animated 2D/3D look-and-feel, embodying their intelligence via a face or an entire body, in order to enhance users trust into these systems simulating a face-to-face dialogue [3]. Though interesting as a proof-of-concept, most of the existing systems are generally heavy to implement, difficult to port onto different platforms, and often not embeddable in Web browsers. These reasons lead us to pursue a light solution, which turns out portable, easy to implement and fast enough in medium-sized computer environments. Our solution, the SAMIR (Scenographic Agents Mimic Intelligent Reasoning) system, is a digital assistant where an artificial intelligence based Web agent is integrated with a purely 3D humanoid, robotic, or cartoon-like layout [1]. SAMIR is a client-server application (Figure 2), composed of 3 main sub-systems: the Dialogue Management System (DMS), the Behavior Manager and the Animation Module. The DMS is responsible for the management of user dialogues and for the extraction of the necessary information for information searching. It is a client-server application composed mainly by two software modules, communicating through the HTTP protocol. The client-side lets a user to type requests in a human-like language
962
Fabio Abbattista et al.
and sends these requests to the server-side application in order to process them. On the server-side, the ALICE Server Engine (http://www.alicebot.org) has been integrated in SAMIR. It encloses all the knowledge and the core system services to process user input. At the same time, based on the events raised by the user on the web site and on his/her requests, a communication between the DMS and the Behavior Manager is set up. The Behavior Generator aims at managing the consistency between the facial expression of the character and the conversation tone. The module is mainly based on a classifier system, the XCS [11], strictly related to the Q-learning approach but able to generate task representations more compact than tabular Q-learning [10].
Fig. 2. SAMIR Architecture
At discrete time intervals, the XCS-based agent observes a state of the environment, selects the better action to be performed (Performance Component), observes a new state and finally receives an immediate reward (evaluated from the Reinforcement Component). In case of degrading performance, the agent tries to identify new and more performing rules (Discovery Component). Behavior rules are expressed in the classical format if then , where (the state of the environment) combines 4 different conversation tones such as: User salutation, user request formulation to the agent, user compliments/insults to the agent, user permanence in the Web page, while represents the expression that the Animation System displays during user interaction. Such an expression is built as a linear combination of a set of expressions that includes the six basic emotions proposed by Ekman [4]. Other emotions and many combinations of emotions have been studied but remain unconfirmed as universally distinguishable. The expression assumed by the Animation System is coded into a string specifying coefficients for each of the possible morph targets [5] into our system. Our Animation System was conceived keeping in mind lightness and performance, so that it supports a variable number of morph targets. For example, we currently use either 12 highlevel morph targets or the entire “low-level” FAP set in order to achieve MPEG-4 compliance. An unlimited number of timelines can be used, allocating one channel for some typical stimulus-response expressions, another one for eye-lid non-conscious reflexes, and so on.
A Framework for the Development of Personalized Agents
4
963
The Personalization Process
The approach adopted to turn raw data about customers into knowledge about their preferences relies on a two-step process for generating profiles. In the first step, the system learns coarse-grained profiles, in which preferences concern the product categories the user is interested into. In the second step, profiles are refined by including a probabilistic model of each product category, induced from the descriptions of the products the user likes. The outcome of the process is a finegrained user profile able to discriminate between interesting and uninteresting products. The first step of the process exploits supervised learning techniques to build the coarse-grained profiles. The system was tested on a dataset about customers accessing to a virtual bookshop: the preferences are the main book categories the product database is subdivided into. Transactional data about customers are arranged into a set of unclassified instances (each instance represents a customer). Then, a domain expert classifies each instance in the training set as a member or non-member of each book category. Instances are processed and a classification rule set for each book category is induced. These rule sets are used to predict whether a user is interested in each book category. The preferred book categories, ranked according to the degree of interest computed by the system, are stored in the profile. In the second step, profiles are refined by taking into account the user's preferences in each category. We adopted the naïve Bayes algorithm [7] to classify the textual descriptions of the books belonging to a specific category as interesting or uninteresting for a particular user. Each instance (book) is represented by three slots. Each slot is a textual field corresponding to a feature of a book: title, authors and textual annotation. Text in each slot is a bag of words (BOW) processed taking into account their occurrences in the original text. Given a set of pre-defined classes C = {c1, c2, …c|C|}, the conditional probability of a class cj given a document di is calculated according to the Bayes' theorem. In our problem, we have 2 classes: c+ represents the positive class (user-likes), and c- the negative one (user-dislikes). The posterior probability of a category cj given an instance di is computed using the formula:
where S is the set of slots, bim is the BOW in the slot sm of the instance di, nkim is the number of occurrences of the token tk in bim. To calculate (1), we need to estimate the probability terms P(cj) and P(tk| cj, sm), from the training set. Each instance is weighted using a discrete rating ri (1-10) provided by a user:
The weights in (2) are used to estimate the two probability terms:
964
Fabio Abbattista et al.
In (4), the denominator denotes the total weighted length of the slot sm in the class cj. This approach allows for the refinement of the coarse-grained profiles by including a probabilistic model able to describe a customer's preferences for each book category. The outcome is a fine-grained profile. The experiments reported in the next section show the promise of the approach. 4.1
Experimental Sessions
Eight book categories were selected at the Web site of a virtual bookshop. For each book category, a set of book descriptions was obtained and stored in a local database. For each category we considered: book descriptions (number of books belonging to a specific category), books with annotation (number of books with not empty slot annotation), average annotation length in words (see Table 1). Each user involved in the experiment was requested to choose one or more categories of interest and to rate 40 or 80 books in each selected category, providing 1-10 discrete ratings. In this way, a dataset for each couple <user, category> was obtained. Table 1. Database Information
Category Computing & Int. Fiction & lit. Travel Business SF, horror & fan. Art & entert. Sport & leisure History Total
Book descriptions 5378 5857 3109 5144 556 1658 895 140 22785
Books with annotation 4178 (77%) 3347 (57%) 1522 (48%) 3631 (70%) 433 (77%) 1072 (64%) 166 (18%) 82 (58%) 14466
Avg. annotation length 42.35 35.71 28.51 41.77 22.49 47.17 29.46 45.47
On each dataset a 10-fold cross-validation was run and several metrics were used in the testing phase. In the evaluation phase, the concept of relevant book is central. A book in a specific category is considered as relevant by a user if his or her rating is greater than 5. This corresponds to having P(c+|di)>=0.5, calculated by the system as in equation (1), where di is a book in a specific category. Classification effectiveness is measured in terms of the classical Information Retrieval notions of Precision, Recall and Accuracy [9] (Table 2).
A Framework for the Development of Personalized Agents
965
Table 2. Experimental results
User Id 37 26 30 35 24c 36 24f 33 34 23 Mean
5
Precision 0.767 0.818 0.608 0.651 0.586 0.783 0.785 0.683 0.608 0.500 0.679
Recall 0.883 0.735 0.600 0.800 0.867 0.783 0.650 0.808 0.490 0.130 0.675
Accuracy 0.731 0.737 0.587 0.725 0.699 0.700 0.651 0.730 0.559 0.153 0.627
Conclusions
We have presented a framework for developing e-business applications that integrates agents and personalization. The conversational agent is comparable with a human assistant that, after a preliminary training, continues to learn new rules of behavior on the ground of experiences and interactions with human customers. The personalization is based on machine learning techniques used for the extraction of dynamic user profiles and usage patterns. Experiments confirm that the use of the profiles have a significant positive effect on the quality of the recommendations made by the system. This results in an improved performance of the agent in terms of the interaction with the users.
References [1] [2] [3] [4] [5] [6]
F. Abbattista, P. Lops, G. Semeraro, F. Zambetta, SAMIR: An Intelligent Web Agent, Proc. of the Sixth International Conference on Knowledge-Based Intelligent Information and Engineering Systems, IOS Press, 2002, 1103-1109. G. Adomavicius, A. Tuzhilin, Using data mining methods to build customer profiles, IEEE Computer, 34/2, 2001, 74-82. J. Cassell, J. Sullivan, S. Prevost, E. Churchill (eds.), Embodied Conversational Agents, MIT Press, Cambridge, 2000. P. Ekman, Emotion in the human face, Cambridge University Press, Cambridge, 1982. B. Fleming, D. Dobbs, Animating Facial Features and Expressions, Charles River Media, Hingham, 1998. P. Langley, C. Thompson, R. Elio, A. Haddadi, An adaptive conversational interface for destination advice, Proc. of the Third International Workshop on Cooperative Information Agents, 1999, 347-364.
966
[7]
Fabio Abbattista et al.
R. J. Mooney, L. Roy, Content-based book recommending using learning for text categorization, Proceedings of the ACM Conference on Digital Libraries, ACM Press, 2000, 195-204. [8] M. Pazzani, D. Billsus, Learning and Revising User Profiles: The Identification of Interesting Web Sites, Machine Learning 27/3, 1997, 313-331. [9] Salton, G. and M. McGill. Introduction to Modern Information Retrieval. New York: McGraw-Hill, 1983. [10] C.J.C.H. Watkins, Learning from delayed rewards, PhD thesis, University of Cambridge, Psychology Department, 1989. [11] S.W. Wilson, Classifier Fitness based on Accuracy, Evolutionary Computation 3/2, 1995, 149-175.
Topic Cascades: An Interactive Interface for Exploration of Clustered Web Search Results Based on the SVG Standard M. Lux, M. Granitzer, V. Sabol, W. Kienreich and J. Becker Know-Center, Inffeldgasse 16c, 8010 Graz, Austria {mlux,mgrani,vsabol,wkien}@know-center.at http://www.know-center.at [email protected]
Abstract. The WebRat is a light-weight, web-based retrieval, clustering and visualisation framework which can be used to quickly design and implement search solutions for a wide area of application domains. We have employed this framework to create a web meta search engine combined with an interactive visualisation and navigation toolkit. Based on the SVG graphics standard, this application allows users to explore search results in a quick and efficient way, by choosing from topically organized result groups. The visualisation and the cluster representation can be stored and reused. We have combined hierarchical navigation of search result sets with a topical similarity based arrangement of these results in one consistent, standard-based system which demonstrates the potential of SVG for web-based visualisation solutions.
1
Overview
This paper introduces an innovative SVG-based visualisation component of the WebRat retrieval and visualisation framework, for exploration and browsing of clustered web search results. An overview of WebRat and motivation for the chosen visualisation approach are provided. Preliminary results of usability studies and an outlook on future activities are discussed.
2
Introducing WebRat
When searching for a specific topic in the Internet very large amounts of information are returned, and a significant portion of hits is often not at all of interest. No explicit relations between the retrieved documents are returned which makes it difficult to obtain an overview, navigate the search results and find relevant information. WebRat [1] is web-based framework upon which incremental retrieval, clustering and visualisation applications can be built. It addresses the problem of information V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 967-974, 2003. Springer-Verlag Berlin Heidelberg 2003
968
M. Lux et al.
overload by integrating documents from different data sources into a topically organized presentation and providing means for interactive exploration of the document set. The system has already been applied to web query refinement and metadata based environmental information search [2]. A meta search engine based on the WebRat framework is available at http://www.know-center.at/webrat. 2.1
Retrieval and Processing
WebRat employs an innovative, incremental, three-stage processing of retrieved documents introducing a feedback-loop to improve processing speed and the quality of results. Retrieval begins by sending the search query to various web data sources. In the high-dimensional stage documents retrieved from different data sources are transformed into a language independent term vector representation by using n-gram decomposition [3], and a term-frequency inverse-document-frequency (TFIDF) weighting scheme is applied. Words and double words which were sources for different n-grams build word vectors which are used for key-term extraction. The mapping stage maps the high-dimensional vector representation to the 2D viewport space by employing an incremental force-directed placement algorithm enhanced by a stochastic sampling schema [6] and a clustering method. The computed 2D configuration reflects the high-dimensional relations of the search results (as far as it is possible) - topically similar documents form dense groups in the 2D layout. Lowdimensional stage makes use of advanced rendering techniques to quickly compute a density matrix based on the 2D document coordinates. The density matrix serves two purposes. First, the landscape background image is generated from it where islands represent dense areas of topically similar documents. Second, it is used to periodically identify the density maxima positions in 2D which are used as cluster seeds. Clusters are created by assigning each document to the nearest seed. A feedback loop is created by passing clusters back to the first two stages. The high-dimensional component computes high-dimensional centroids of the clusters and employs statistical methods to extract key-terms from the underlying documents and compute cluster descriptors. These descriptors are used by the labelling engine to describe groups of topically similar document the user sees on the map. Mapping stage uses clusters to reduce the computational complexity and increase separation in the layout. 2.2
Visualisation
In its basic form, WebRat provides a visualisation following the concept of thematic landscapes, which is a visual representation in which the spatial proximity between visualised entities is a measure for thematic similarity of the underlying documents. Systems like Bead [5] visualise document sets as galaxies of stars or a landscape. However, a thematic landscape is not very appropriate for displaying hierarchically organised information (i.e. clusters). Beside the standard tree visualisation, we also considered cone trees [14], hyperbolic trees (star trees) [17] and information slices [13]. None of these approaches fulfilled our requirements regarding use of screen real estate, ease of use, providing and overview, avoiding occlusion and the necessity to scroll.
Topic Cascades: An Interactive Interface
969
Fig. 1. WebRat's landscape visualisation. Query was “SVG”
There is a general trend towards clustering of search results as performed by Alta Vista and Northern Light search engines. The Vivisimo [11] meta search engine creates a tree-like structure displaying a hierarchy of clustered search results. Lighthouse [4] also performs clustering and computes 2D or 3D thematic layouts. Systems like InfoSky [7] try to combine hierarchical structure with visualising document similarity for very large document corpora. Mondeca [10] is doing research on SVG based visualisation of Topic Maps. All considered systems either provide only simple tree representations or present results in a complex 2D or 3D hierarchical layout. They also feature fixed, non-configurable visualisation stages. WebRat provides more flexibility offering a pluggable visualization interface and is more advanced in the sense that its visualisation works incrementally, meaning that it can incorporate results as they arrive. 2.3
Intelligent Retrieval with WebRat
WebRat supports several powerful features in the area of intelligent retrieval and visualisation of search results: •
•
•
The capability to retrieve textual documents from a number of heterogeneous data sources, such as search engines, knowledge management environments [16], environmental databases [15], and others. WebRat combines these into a single consistent internal representation. Thematical organization of retrieved documents through unsupervised incremental clustering. WebRat accommodates new results as they arrive into a growing, topically organized 2D map, identifies clusters of similar documents and inserts new documents into the identified topical hierarchy. Dynamical labeling provides enhanced orientation while navigating the document set. WebRat computes labels on the fly, depending on the zoom level
970
•
M. Lux et al.
and hierarchy depth to best describe documents and documents groups the user is currently focusing on. Query refinement through recommending additional query terms. Depending on the area the user is focusing on WebRat automatically proposes query terms to narrow down the search query.
Based on these features, a scenario can be developed in which a user issues a specific query, such as “SVG”, gets an overview of the result set and learns about the sub-topics and vocabulary describing each sub-topic. From the user's focus on one special field of interest, for example “ANIMATIONS IMPLEMENTING” (see fig. 1), WebRat “recognizes” the point of interest and automatically extends the search query to gather more documents on that particular sub-topic. As topics overlap between clusters, results of the refined query are incorporated not only within the “implementing animations with SVG” topic, but also within other topics related to SVG (the original query). In conjunction with WebRat's demonstrated ability to present users with a visual summarisation of deep hierarchies [16], this scenario can be extended to the case of a knowledge management system enhancement which, by determining points of interest as described above, offers users a personalised access point of a knowledge space which is created autonomously.
3
The SVG Application
3.1
Motivation
While the features and scenarios described promise several advantages for users querying large repositories, in a real-life environment the 2D visualisation used by WebRat is often not suitable. Most knowledge management systems provide users with a tree view navigation system covering a sizeable amount of screen real estate, with the remaining space holding content and metadata display elements which cannot be hidden or discarded without loosing most of the intended functionality. In consequence, a major challenge has been to develop an interactive visualisation suitable for presenting hierarchies which at the same time offers as much of WebRat's topic-based navigational capabilities and eliminates several of the drawbacks of the standard tree approaches. For example, reduction or elimination of the need for scrolling as experienced with standard tree views when large branches are expanded has been targeted. We also wanted a visualisation that is stable in the sense that navigating the hierarchy down to 4 or 5 levels of depth will not cause any changes in the presentation of the upper levels, as it is the case with hyperbolic trees. The visualisation should offer an overview for several levels of depth, without occlusion and the necessity to scroll, through better use of screen real estate. We also wanted to incorporate statistical information like relevance and size as it is the case in the WebTOC system [18]. However the standard tree is not, to our opinion, a structure where extra information is incorporated and embedded in a clear manner.
Topic Cascades: An Interactive Interface
Fig. 2. SVG Visualisation
971
Fig. 3. File system visualization
Our requirements for the format included the following: vector based visualisation, mouse interactivity for tasks like opening an URL, showing a description or showing and hiding parts of the visualisation, and, finally, saving or exporting the results in a reusable and interchangeable format. Searching for standards for presenting graphical information on the Web we identified two possible candidates: Macromedia Flash [8] and Scalable Vector Graphics (SVG) [9]. Although software providers deliver browsers supporting Flash whereas for viewing SVG an extra plug-in is needed, we preferred the second option, because SVG is new, state-of-the-art, XML-based and, compared to Flash, an open standard. 3.2
Visualisation Metaphor
For visualisation a modified tree representation is used. On the one hand the obvious advantages of a tree for navigating and grouping the results could be retained, on the other hand the modifications allowed us to integrate statistical information about the results or group of results itself. The nodes of the tree are represented by blue rectangles, where inner nodes represent clusters and the leafs represent the actual search results (see fig. 2). Additional data is integrated in the visualisation in three different ways. Firstly, the coloured bars attached on the left side of each rectangle indicate the relevance of the results. Relevance is coded by colours, from green across yellow to red, where green represents high and red low relevance. Secondly, the height of the rectangles: it indicates the cardinality of the cluster. The third kind of information embedded in the visualisation are the relative positions of the rectangles. Neighbouring rectangles have a higher content based similarity than results which are not adjacent.
972
M. Lux et al.
3.3
Application
As an application for our SVG visualisation, we have implemented a web meta-search engine based on the WebRat framework and added the SVG visualisation as a result browser. Users access the system through a standard web search interface featuring an input box for query terms and some additional controls for configuration of a query session. After starting a query, the WebRat system queries the search engines, retrieves, transforms and organises the results and then starts the SVG component in a new browser window, sending to it the results of the query in XML format. Finally, the uppermost level of clusters is displayed in the new browser window and the user can start exploring the search results. Moving the mouse cursor over a blue rectangle results in a color change to red. Clicking on the rectangle representing a group opens a new column, displaying its children. The leafs of the tree are representing single search results and clicking on one of them will open the URI of that specific search result. Our SVG application creates a persistent visual description of a search result space which can be saved and recalled for later use.
4
Future Work
In its current form, the visualisation degrades in usability if too many levels are open, because the screen gets cluttered with too much information. We plan to address this issue by fading unused branches or eventually introducing pan and zoom operations. The descriptions of the leafs and nodes are simply cut off after a specific number of characters. Even though the full description of the node is displayed in a status line after clicking on it, a mouse over effect should reveal the full description.
5
Conclusion
While search result visualisations such as information landscapes are an improvement over the standard ranked list, a more structured and discrete approach can be of benefit in certain situations. We have combined classical tree-based navigation with representation of topical similarity to create a visual representation of search result sets which is both easy to navigate and, at the same time, expresses additional properties of the result set such as topical similarity and statistical information. We found SVG to be a stable and elegant standard for vector graphics which could prove quite useful in developing web-based visualisation applications.
Acknowledgements The Know-Center is a competence center funded within the Austrian Competence Center Programme K plus under the auspices of the Austrian Ministry of Transport, Innovation and Technology (www.kplus.at).
Topic Cascades: An Interactive Interface
973
References [1]
[2]
[3] [4] [5] [6] [7]
[8] [9] [10] [11] [12] [13]
[14]
Sabol, V., Kienreich, W., Granitzer, M., Becker, J., Tochtermann, K., Andrews, K. (2002). “Applications of a lightweight, web-based retrieval, clustering and visualisation framework”, in “Proceedings of the 4th International Conference on Practical Aspects of Knowledge Management”, Vienna University of Technology, Austria. Tochtermann, K., Sabol, V., Kienreich, W., Granitzer, M. and Becker, J. (2002). “Intelligent Maps and Information Landscapes: Two new Approaches to support Search and Retrieval of Environmental Information Objects”, in Proceedings of 16th International Conference on Informatics for Environmental Protection, Vienna University of Technology, Austria. Cavnar, W.B., Trenkle, J. M. (1994). “n-Gram based text categorization”. In Symposion on Document Analysis and Information Retrieval, p161-176, University of Nevada, Las Vegas. Leuski, A., Allan, J. (2000). “Lighthouse: Showing the Way to Relevant Information.” In Proceedings of IEEE Symposium on Information Visualisation 2000, pp. 125-130, InfoVis2000, Salt Lake City, Utah. Chalmers M. (1993). “Using a landscape metaphor to represent a corpus of documents.” in Proceedings European Conference on Spatial Information Theory, COSIT 93, pages 337-390, Elba. Chalmers M. (1996). “A linear iteration time layout algorithm for visualising high-dimensional data.” in Proceedings Visualization96, IEEE Computer Society, pages 127-132, San Francisco, California. Andrews, K., Kienreich, W., Sabol, V., Becker, J., Droschl, G., Kappe, F., Granitzer, M., Auer, P., Tochtermann, K. (2002). “The InfoSky Visual Explorer: Exploiting Hierarchical Structure and Document Similarities.” In Palgrave Journal on Information Visualisation, Hampshire, England. Macromedia Inc. (2003). “Macromedia Flash Support Center”, http://www.macromedia.com/support/flash/ W3C, SVG Working Group (2003) “Scalable Vector Graphics”, http://www.w3.org/Graphics/SVG Delahousse, J.: “Index and knowledge drawing: a natural bridge from Topic Maps to XML SVG”, Mondeca, 2001, http://www.idealliance.org/papers/xml2001/papers/html/04-04-02.html Vivisimo Inc. (2003) “Vivissimo Document Clustering”, http://www.vivissimo.com Hunter, J. (2003). „JDom“, http://www.jdom.org Andrews, K., Heidegger, H. (1998) “Information Slices: Visualising and Exploring Large Hierarchies using Cascading, Semi-Circular Discs” In Late Breaking Hot Topic Paper, IEEE InfoVis'98, Research Triangle Park, North Carolina. Robertson, G. G., Mackinlay, J.D., Card, S.K. (1991) ”Cone trees: Animated 3D Visualisations of Hierarchical Information.” In Proceedings CHI91, pages 189-194, New Orleans, Louisiana.
974
M. Lux et al.
[15] Tochtermann, K., Sabol, V., Kienreich, W., Granitzer, M., Becker, J. (2003). ”Enhancing Environmental Search Engines with Information Landscapes", ISESS - 8th International Symposium on Environmental Software Systems, Vienna, Austria [16] Kienreich, W., Sabol, V., Granitzer M., Becker J., Tochtermann K. (2003). "Themenkarten als Ergänzung zu hierarchiebasierter Navigation und Suche in Wissensmanagementsystemen", 4. Oldenburger Forum Wissensmanagement, Oldenburg, Germany. [17] Lamping, J., Rao, R. (1994) “Laying out and visualizing large trees using a hyperbolic space” In Proceedings UIST94, p. 13-14, Marina del Rey, California. [18] Nation, D. A., Plaisant, C., Marchionini, G., Komlodi, A. (1997) “Visualizing websites using a hierarchical table of contents browser: WebTOC”, University of Maryland.
Active Knowledge Mining for Intelligent Web Page Management Hiroshi Ishikawa, Manabu Ohta, Shohei Yokoyama, Takuya Watanabe, and Kaoru Katayama Tokyo Metropolitan University, Japan
Abstract. In this paper, we describe active knowledge mining approaches to intelligent Web page management. Through application of them to operational Web systems, we have found them very effective as follows: First, we describe an adaptable recommendation system called the system L-R, which constructs user models as knowledge by classifying the Web access logs and by extracting access patterns based on the transition probability of page accesses and recommends the relevant pages to the users based both on the user models and the Web site structures. We have evaluated the prototype system and have successfully obtained the positive effects of the mined knowledge. Second, we describe another approach to constructing user models, which clusters Web access logs for operational systems based on access patterns. In this case, the knowledge helps to discover unexpected access paths corresponding to ill-formed Web site design. Third, we have successfully identified undiscovered research issues such as dynamic page recommendation when we have attempted to mine Web usage logs for operational systems.
1
Introduction
Systems supporting users in navigation of Web contents are in high demand since current operational Web sites consist of a lot of pages. Furthermore, the support for detecting access paths due to ill-structured design of Web sites, contrary to the Web site administrators' expectations is also in high demand. The key solution to such intelligent page management is Web usage mining. In this paper, we describe two complementary approaches to Web usage mining. First, we describe an adaptable recommendation system, which constructs user models by mining the user access logs and recommends the relevant pages to the users based both on the user models and the Web structures. Second, we describe another approach to constructing user models, which clusters Web access logs based on access patterns. The user models help to discover unexpected access paths corresponding to ill-formed Web site design as well as play a basic role in recommendation. We generally consider active Web usage mining as follows: The effectiveness of knowledge for Web usage mining can be checked by applying it to real operational V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 975-983, 2003. Springer-Verlag Berlin Heidelberg 2003
976
Hiroshi Ishikawa et al.
systems (e.g., organizational portal sites). Applying such knowledge to operational systems enables the administrative users (e.g., Web site administrators) to add to their stock of the information (e.g., Web site restructuring) about the operational systems. Attempts to apply such knowledge help the researchers to address new issues in them (e.g., new techniques of recommendation). First, we take a Web recommendation system for personalization of our department portal as an operational system to which knowledge is being applied. We have constructed user models, that is, knowledge, as a result of Web usage mining. We have successfully verified the effectiveness of the knowledge by doing empirical studies through applying it to Web recommendation systems. Second, we have been able to discover new insights about an operational system (i.e., a commercial Web site for a cinema complex) by mining user access patterns as knowledge. Thus, the access patterns of the site users which the site administrators could not expect in advance have been mined. The results have made the site administrators realize that it is necessary to restructure the Web site in order to reduce unexpected access patterns. Third, we have successfully identified undiscovered research issues when we have attempted to mine Web usage logs for operational systems such as portal sites, which heavily use dynamic pages represented as a combination of environment variables, which are evaluated to result in static pages. In such situations, we have found it difficult to identify explicit access patterns corresponding to explicit access paths. Therefore, it is necessary to invent a new mining technique for access paths containing dynamic pages. Further, it is expected that support counts for the same access paths vary depending on time of parts of log data even if the durations have the same length. We have discovered some trends, such as rise or fall, in the support counts. It is necessary to invent a new recommendation technique which can take such trends into account. We briefly describe approaches to these issues.
2
Web Usage Mining for Page Recommendation
2.1
Extraction of Web Usage Logs
Our system called System L-R recommends the relevant pages to the users based on the probability of the user's transition from one page to another, which is calculated by using the Web access logs. First, we delete the unnecessary log data left by socalled Web robots (i.e., sent by search engines) by heuristically finding them in Web access logs. Second, we collect a sequence of page accesses by the same user within a predetermined time interval (e.g., 30 minutes) into a user session by using the genuine (i.e., human) Web logs. At the same time, we correct the session by deleting the transition caused by pushing the back button in Web browsers. Lastly, we calculate the probability of the transition from the page A to the page B denoted by PA->B based on the modified Web log as follows: PA->B = (the total number of transitions from A to B) / (the total number of transitions from A). If count (A, B) is the frequency of the transitions from A to B and “*” denotes any page, then PA->B is alternatively represented as count (A, B) / count (A, *).Note that
Active Knowledge Mining for Intelligent Web Page Management
977
the transition probability of the path from A to B to C which we denote by A->B->C (where the length of the path is two) is calculated as follows: PA->B->C = PA->B * PB->C. 2.2
Recommendation
We provide various methods for recommendation, which can be classified into two groups: pure probability-based methods and weighted probability-based methods. 1) Pure-Probability-Based Methods. This class of recommendation methods is based only on the transition probabilities calculated from the Web log. For example, Future Prediction-Based Recommendation. A page ahead by more than one link is recommended based on the path transition probability. This can let the user know what exists ahead in advance. Any link ahead (i.e., any future path farer than two) can be recommended, but it is also possible that the user may miss their true target between the current page and the page far ahead. For example, if the path A->B->C has the highest probability, then the user at A is recommended C. 2) Weighted Probability-Based Methods. This class of recommendation methods is based on the probability weighted by other aspects such as the time length of stay, the Web policy, and the Web link structures. For example, Weighted by Number of References. If n pages reference a page within the same site, then the weight of the referenced page is increased by n. This method is validated by the observation that many pages link the important pages, which is similar to authorities and hubs in WWW [3].
3
Experiment and Evaluation
3.1
Experimental System
We have implemented an experimental recommendation system and have evaluated it by using the following logs of the Web server of our department: Access logs: 384941 records; Web Pages: about 170 HTML files; Log duration: from June to December 2000; Subjects: five In-campus (knowledgeable) students and five out-campus (less knowledgeable) students Before evaluation, we have prepared problems corresponding to the possible objectives of the users visiting the department Web site. For example, Find the contact information of the member of faculty doing the research on “efficient implementation of pipeline-adaptable filter”. Next, we describe the method for evaluating the experimental system. We instruct the subjects to write down the number of transitions (i.e., clicks) and the time to reach the target pages as answers to the above problems. We evaluate the experimental system based on the figures. Each subject solves different problems by different recommendation methods. For this time of evaluation, we have implemented the following four recommendation methods effective for our department site: (1) Recommendation with back button transitions, (2) Recommendation without back button transitions,
978
Hiroshi Ishikawa et al.
(3) Future prediction-based recommendation, (4) Recommendation Weighted by number of references. Please note that the first two methods are used just for the purpose of comparison with the last two methods as our proposal. We use two user groups based on IP addresses: In-campus students and out-campus students. We have implemented the recommendation system by using frames. The upper frame displays the original Web page and the lower frame displays the recommendation page. We describe how to calculate support counts for specific access paths. We adapt Apriori for mining association rules [1] to mining sequential access patterns. First, we collect candidate access paths with length of one and prune them to find frequent access paths which satisfy pre-specified minimum support. Next, we join the frequent access paths to make candidate access paths longer by one path. Then we find frequent access paths by pruning the candidate access paths. We continue this procedure consisting of join and prune until we find no more frequent access paths. We find and use such frequent access paths of length of more than one path only for history-based recommendation. We usually find and use one-length frequent access paths for the other recommendation methods.
Fig. 1. Number of clicks and time for recommendation
3.2
Evaluation
We illustrate the graphs indicating the average clicks and time required for solving the five problems (See Fig.1). First of all, the result shows the positive effect of
Active Knowledge Mining for Intelligent Web Page Management
979
recommendation because all methods (2-5) decrease the clicks and time in comparison with no recommendation (1). Recommendation without back button transitions (3) is better than recommendation with them (2), so we think that the modified Web logs reflect user access histories more correctly. Future predictionbased recommendation (4) is a little bit worse than recommendation without back button transitions (3). This is probably because the former can skip the user's target mistakenly. Recommendation weighted by number of references (5) is the best of all although recommendation without back button transitions (3) is a little bit better in the number of clicks. The fact that there is difference between in-campus students and out-campus students suggests that we need to classify the user groups. The difference can be shortened by our recommendation system. This indicates that our system is more effective for less-knowledgeable people.
4
Web Usage Mining for Page Restructuring
4.1
User Model Based on Clustering
Here we describe a complementary approach to Web usage mining. First, we cluster Web access log data based on access patterns. Each cluster corresponds to a specific user model with access patterns as its features. Then, we can use the models for recommendation. We can also use the clusters by extracting unexpected access patterns when we redesign sites to remedy such inappropriate structures. We describe how we utilize clusters as the user models for recommendation in more detail. First of all, we match clients with user models (i.e., clusters) and determine the most likely model. Then we recommend the clients relevant pages according to frequent access patterns in the selected cluster. At this time, we can use the various strategies for recommendation described earlier as our first approach. In matching the clients with user models, we can use access patterns (i.e., IP addresses, access date and time, access pages), which we use in clustering user models. In matching, we can also use the other attributes of the Web access log not used for clustering itself. We extract characteristics of such attributes from the constructed clusters by combining them with the original Web access log data. Next we describe how we use clusters for redesigning Web site structures. We can find access patterns contrary to the expectations of the Web site administrators. It is possible to refine Web sites by resolving such ill-formed structures. 4.2
Experiments
We have done experiments to validate this complementary approach described above. First, we describe the experiment settings. We use another Web log: The Web site is one for a cinema complex at Tachikawa City. The Web log contains access log records for Sunday, October 7th, 2001 consisting of 6402 records accessing distinct 138 html pages. We extracted 712 user sessions. We determined one session of a consecutive access from the same IP whose intervals are within 40 minutes for this site.
980
Hiroshi Ishikawa et al.
We provide two types of representation to sessions: Page reference and page transition vectors. We represent each page reference vector as a 138-dimensional vector plus IP and last access time, whose each component corresponds to a distinctive referenced page. Further, we provide two sub types of representation to page reference vectors: Set type and collection type. The set type of reference vectors indicate the existence of references to each page (i.e., each component contains either 0 or 1) while the collection type of reference vectors keep records of the times of references to each page (i.e., each component contains 0 or a larger integer). We exclude accesses to only one page as they are noisy. We represent each page transition vector as a 19044 (= 138*138)-dimensional vector plus IP and last access time, whose each component corresponds to a transition from one page to another (i.e., directed one). We also provide two types of representation: Set and collection. Similar to page reference vectors, they indicate the existence and counts of a specific transition, respectively. In other words, we have four different types of clustering, that is, set- and collection-typed page reference clustering and set- and collection-typed page transition clustering. They are selectively used depending on specific cases of clustering. Cluster Cluster 1 #of 250 sessions Most Road show-> frequently Lineup accessed page/ 13 ratio(%) Cluster Cluster1 #of 279 sessions Most Road show-> frequently Lineup accessed page/ 11 ratio(%)
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
192
83
82
76
29
Road show-> Next road show 6
I-congestion ->City2
7 (a) collection type Cluster 2 Cluster 3 133 I-congestion ->City1
110 Road show-> Lineup 7
5
Road show-> Road showLineup > Lineup 12 7
Next road show-> Road show 17
Cluster 4
Cluster 5
Cluster 6
71
64
55
Road show-> Road showLineup > Lineup 14 12
Next road show-> Road show 10
(b) set type
Fig. 2. The result of page transition clustering
We use Ward's method for clustering sessions: We start with a node in a separate cluster with the node as its center of gravity. In each step, the two clusters that are closest according to a distance measure (e.g., Euclidean distance) are merged and a new center of gravity is computed for this new cluster by using those of the two merged clusters. We repeat this step until we get the target number of clusters. In Ward's method, we have to determine the number of target clusters. To this end, we provide a visual aid for the user to choose the appropriate number of clusters. The aid graphically plots the minimum values of all cluster-distances vs. the number of clusters. With this help, we have detected a big change of the minimum distance when the number of clusters changes from 5 to 4 at page reference clustering. So we choose 5 as the number of target clusters at page reference clustering. For the same reason, we choose 6 as the target number of clusters at page transition clustering. We focus
Active Knowledge Mining for Intelligent Web Page Management
981
on the result of collection-typed page transition clustering. We counted the total number of sessions, the time of the most frequent transitions, the most frequently transitioned page pairs and its ratio for each cluster. At this time, the collection-type also emphasizes the characteristics more clearly than the set-type. The most frequent transition in the cluster 6 of collection-typed page transition clustering has “next road show->road show” with 17% while the corresponding cluster of set-typed clustering has the same transition with 10% (See Fig. 2). We can interpret this observation as follows: The day of the Web log is late Sunday. Therefore the users have access to the “next road show” page for the next week schedule. However, the page on Sunday is empty and is updated on next Tuesday. Until then, the current schedule page “road show” contains the schedule from Saturday to Friday. Then the users notice the wrong visit and go to the right page, which causes this anomalous transition. Of course, this indicates the inappropriate design of Web pages contrary to the users' expectations. In other words, the observation suggests that the Web site administrator should redesign the Web site structures more understandably.
5
Conclusion
Our main contribution is that we have validated the effectiveness of Web usage mining for page recommendation and restructuring through empirical studies. First, we have proposed an experimental recommendation system L-R based on both Web usage mining (i.e., transition probability calculation) and Web site structures. From the result of the evaluation of the system L-R, we have been able to indicate that the recommendation based both on Web usage logs and page structures is positively effective. Second, we have described user model construction based on automatic clustering. We have been able to indicate that collection-typed page clustering extracts features of each cluster more obviously than set-typed one. Lastly, we summarize some issues discovered in the course of application of our knowledge-based methods to operational systems and briefly describe new directions being taken to them. We have attempted to mine frequent access patterns for operational systems such as commercial sites. Such systems heavily use dynamic pages represented as a combination of environment variables, which are dynamically evaluated to result in static pages. Then we have found it difficult to identify explicit access patterns corresponding to explicit access paths. Therefore, we have invented a new mining technique for access paths containing dynamic pages. Again we have adapted Apriori [1] to finding frequent dynamic access paths. First of all, we cluster all values of each variable based on distances. For the moment, we handle numerical or nominal or binary or text variables, which can be ordered. Thus, we calculate average of distances between all pairs of neighborhood values for each variable. We cut a set of values at the distances longer than the average to partition the set into clusters. Then we calculate the support counts for each cluster to find frequent value clusters for one variable. Next we combine (i.e., join) all pairs of frequent value clusters for two distinctive variables and prune them based on prespecified minimum support to find frequent value clusters for two variables. We continue this procedure consisting of combine and prune until we find no more frequent value clusters.
982
Hiroshi Ishikawa et al.
Further, it is easily expected that support counts of the same access paths vary depending on time of parts of log data even if the durations have the same length (e.g., one week). We have discovered some trends, such as rise or fall, in the support counts of access paths for commercial sites. Therefore, we have invented a new recommendation technique which can take such trends into account when we actually use mined access patterns. Thus, we predict future support counts by regressions based on linear or polynomial or exponential models. In other words, the recommended results differ from access time to access time even if we use the same access log data to mine frequent access paths. We will compare our work with relevant works. Cooley et al. [2] have clarified the preprocessing tasks necessary for Web usage mining. Our approach basically follows their steps to prepare Web log data for mining. Srivastava et al. [7] have surveyed techniques for Web usage mining comprehensively. Among various techniques, we use transition probabilities as a sequential variation of association rules for page recommendation and collection-typed clustering as page restructuring. Spiliopoulou et al. [6] have proposed Web Utilization Miner to find interesting sequential access patterns, which is similar to our approach for page recommendation. Unlike them, however, we construct various types of recommendation from the same set of sequential access patterns. Perkowits et al. [5] have proposed Adaptable Web Sites as a method for providing index pages suitable for the users based on Web access logs. The Adaptable Web Sites automatically recommend different pages depending individuals while our system recommends pages in several ways chosen by both the Web site administrator and the users. Mobasher et al. [4] have proposed a method for recommending pages weighted by mining the user access logs. The system recommends pages according to the user access recorded in “cookies” while our system allows the user to choose among several recommendation methods and considers both web logs and page structures.
Acknowledgements This work is partially supported by the Ministry of Education, Culture, Sports, Science and Technology, Japan under a grant 14019075.
References [1] [2] [3] [4]
R. Agrawal, T. Imielinski, and A. Swami: Mining Association Rules between Sets of Items in Large Databases. Proc. ACM Sigmod, pp. 207-216, 1993. R. Cooley, B. Mobasher, and J. Srivastava: Data Preparation for Mining world Wide Web Browsing Patterns, Journal of Knowledge and Information systems, vol.1, no.1, pp.5-32, 1999. J. Kleinberg: Authoritative sources in a hyperlinked environment. Proc. 9th ACM-SIAM Symposium on Discrete Algorithms, 1998. B. Mobasher, R. Cooley, and J. Srivastava: Automatic Personalization Based on Web Usage Mining, CACM, vol.43, no.8, pp.142-151, 2000.
Active Knowledge Mining for Intelligent Web Page Management
[5] [6] [7]
983
M. Perkowittz and O. Etzioni: Adaptive Web Sites, CACM, vol43, no.8, pp.152-158, 2000. M. Spiliopoulou, L.C. Faulstich, K.Winkler: A Data Miner Analyzing the Navigational Behavior of Web Users, Proc. Workshop on MachineLearning in User Modeling of ACAI99, 1999. J. Srivastava, R. Cooley, M. Deshpande, P. Tan: Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data, SIGKDD Explorations, vol.1, no.2, pp12-23, 2000.
Case-Based Reasoning for Time Courses Prognosis Rainer Schmidt and Lothar Gierl Universität Rostock, In stitut frü Medizinische Informatik und Biometrie, Rembrandtstr. 16 / 17, D-18055 Rostock, Germany {rainer.schmidt,lothar.gierl}@medizin.uni-rostock.de
Abstract. Much research has been performed in the field of medical temporal course analysis in the recent years. However, the methods developed so far either require a complete domain theory or wellknown standards. Unfortunately, in many medical areas such knowledge is still missing. So, we have developed a method for predicting temporal courses - without well-known standards and without a complete domain theory. Our method combines Temporal Abstraction and Case-Based Reasoning. In the last decade Case-Based Reasoning, an artificial intelligence method that uses experiences in form of cases, has become successful in many areas. Temporal Abstraction is a medical informatics technique to generalise and describe sequences of events. Here we present our method and summarise two medical applications. The first one deals with multiparametric time courses of the kidney function. We apply the same ideas for the prognosis of the temporal spread of diseases like influenza.
1
Introduction
Since traditional time series techniques [1] work well with known periodicity, but do not fit in domains characterised by possibilities of abrupt changes, much research has been performed in the field of medical temporal course analysis in the recent years. However, the methods developed so far either require a complete domain theory (like RS É UMÉ [2]) or well-known standards [3] (e.g. course pattern or known periodicity). For temporal courses, our idea is to search with Case-Based Reasoning retrieval methods [4] for former, similar courses and to consider their continuations as possible prognosis for a current course. So far, we have successfully applied our method in the kidney function domain [5] and for the prognosis of the spread of infectious diseases (especially of influenza) [6]. Case-Based Reasoning (CBR) means to use previous experience represented as cases to understand and solve new problems. A case-based reasoner remembers former cases similar to a current problem and attempts to modify solutions of the remembered cases to fit for the current problem. The CBR cycle developed by Aamodt and Plaza [7] consists of four steps: retrieving former similar cases, adapting V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 984-991, 2003. Springer-Verlag Berlin Heidelberg 2003
Case-Based Reasoning for Time Courses Prognosis
985
their solutions to a current problem, revising a proposed solution, and retaining new learned cases. However, there are two main tasks [4, 7]: The retrieval, the search for a similar case, and the adaptation, the modification of solutions of retrieved cases. Since differences between two cases are sometimes very complex, especially in medical domains, many case-based systems are so called retrieval-only systems. They just perform the retrieval task, visualise current and similar cases, and sometimes additionally point out important differences between them [8]. Temporal Abstraction has become a hot topic in Medical Informatics in the recent years. For example, in the diabetes domain measured parameters can be abstracted into states (e.g. low, normal, high) and afterwards aggregated into intervals called episodes to generate so-called modal days [9]. The main principles of Temporal Abstraction have been outlined by Shahar [10]. The idea is to describe a temporal sequence of values, actions or interactions in a more abstract form, which provides a tendency about the status of a patient. For example, for monitoring the kidney function it is fine to provide daily reports of multiple kidney function parameters. However, information about the development of the kidney function on time and, if appropriate, an early warning against a forthcoming kidney failure means a huge improvement [5].
2
Prognostic Model
Our prognostic model for time courses (figure 1.) consists of state abstraction, which is only necessary in case of multiparametric courses, of temporal abstraction, of CBR retrieval of prototypes and cases, and, if possible, of an adaptation.
Fig. 1. Prognostic model for temporal courses
986
Rainer Schmidt and Lothar Gierl
State Abstraction. The first step abstracts from a set of parameter values to a single function state. This step is of course only necessary if multiparametric courses are considered. Therefor few requirements have to be met. Meaningful states to describe the parameter sets and a hierarchy of these states must exist. Furthermore, knowledge to define the states must be available. Temporal Abstraction. To describe tendencies, an often-realised idea is to use different trend descriptions for different periods of time, e.g. short-term or long-term trend descriptions etc. [e.g. 11]. The lengths of each trend description can be fixed or may depend on concrete values (e.g. successive equivalent states may be concatenated). However, concrete definitions of the trend descriptions depend on characteristics of the application domain: on the number of states and on the way they are ordered, on the lengths of the considered courses, and on what has to be detected, e.g. long-term developments or short-term changes. Searching for a Prototype. Since doctors reason with prototypical cases anyway, especially for medical applications it has become an often realised idea to organise the retrieval in two steps: at first a prototype is searched by applying some indexes and afterwards the CBR retrieval algorithm considers just the cases belonging to this prototype. We have applied this idea only in the kidney function domain (for details see [5]). CBR Retrieval. To determine the most similar cases sophisticated similarity measures provide the best results, but they consider all stored cases sequentially. Especially for large case bases a sequential process is too time consuming. So, a few non-sequential retrieval algorithms have been developed in the CBR community. Most of the retrieval algorithms can handle various sorts of attributes, but usually they only work well for those sorts of attributes or problems they have been developed for. This means, the choice of the retrieval algorithm should mainly depend on the sort of values of case attributes and sometimes additionally on application characteristics like the size of the case base. The question arises: Of which sort are the parameters that describe a trend? The states are obviously nominal valued ordered according to a hierarchy. The assessments should have ordered nominal values too, e.g. steady, decreasing etc. Only the lengths should have numeric values. If the time points of the parameter measurements are few integers, they can be treated as ordered nominal values. The proposed retrieval algorithms for ordered nominal valued attributes are CBR-Retrieval-Nets [4], which are based on Spreading Activation [12]. So, if all parameters considered for retrieval have ordered nominal values, a good choice of the retrieval algorithm should be a CBR-Retrieval-Net. Adaptation. The decision whether adaptation should be performed mainly depends on the questions whether adaptation knowledge is available, whether the potential users prefer to adapt themselves, and whether visual presentations of current courses in comparison to former, similar courses (retrieval-only) are helpful.
Case-Based Reasoning for Time Courses Prognosis
3
987
Prognosis of Kidney Function Courses
When we started to develop our program, at our intensive care unit (ICU) the doctors daily got a printed renal report from the monitoring system NIMON [22] which consists of 13 measured and 33 calculated parameters of those patients where renal function monitoring was applied. The interpretation of all reported parameters is quite complex and needs special knowledge of the renal physiology. The aim of our knowledge based system ICONS is to give an automatic interpretation of the renal state to elicit impairments of the kidney function on time and to give early warnings against forthcoming kidney failures. In the renal domain, neither exist well-known standards nor complete knowledge about the kidney function. Especially, knowledge about the behaviour of the various parameters on time is yet incomplete. So, we combined the idea of RÉSUMÉ [2] to abstract many parameters into one single parameter with the idea of Haimowitz and Kohane [3] to compare many parameters of current courses with well-known standards. Since well-known standards were not available, we used former similar cases instead. So, the first step of our program is an abstraction of daily parameter sets into daily renal function states, which determine states of increasing severity beginning with a normal renal function and ending with a renal failure. This step is mainly done automatically. Only when the determination is not obvious, we present the states under consideration to the user sorted according to their probability. The doctor has to accept one of them. The second step means a temporal abstraction of a sequence of seven daily kidney function states (longer time periods than a week are too irrelevant for the current situation of a patient). We have defined three different trends: the short-term trend describes just the development since yesterday, the medium-term trend has no fixed length, but describes the most recent trend direction, and the long-term trend describes the whole considered week. Furthermore, we have defined five assessments for these trends (steady, increasing, sharply increasing, decreasing, sharply decreasing), only for the long-term trend we have added four further assessments (alternating, oscillating, fluctuating, nearly steady). Fig.2. shows a result of these two abstractions; in the lower part the sequence of (abbreviated) daily states is depicted, in the upper part are the three trend descriptions.
Fig. 2. Retrieved similar course with an abstracted sequence of kidney function states
988
Rainer Schmidt and Lothar Gierl
After these two abstractions we use the parameters (assessments, length, beginning state, and final state) of the three trend descriptions to retrieve former courses with similar trends. When a current case is finished (no more monitoring of the kidney), we create many 7-day courses: from the first till the seventh day, from the second till the eighth day etc. The course continuations of the similar courses serve as prognosis (e.g. tomorrow and 9th day in fig.2). Since many different continuations are possible for the same previous course, it is necessary to search for similar courses and for different projections. Therefore, we divided the search space into nine parts corresponding to the possible continuation directions. Each direction forms an own part of the search space. In each part, for retrieval we apply a spreading activation algorithm [12]. For evaluation we have compared it with an indexing algorithm [13], which works faster, but does not always find the desired similar courses. The advantage of retrieval-nets and spreading activation is that within an attribute dimension they allow similarities between nominal valued attributes instead of indexing by exact matches. During retrieval the direction parts are searched separately and each part may provide at most one similar course. The similar courses of these parts together are presented in the order of their computed similarity values. Each retrieved similar course is visually presented in comparison to the current query course (fig.2. shows just a part of such a presentation).
4
Prognosis of the Spread of Infectious Diseases
In our current TeCoMed project we apply the same method on the prognosis of the spread of infectious diseases like influenza and bronchitis [6]. The aim of the project is to send early warnings against forthcoming waves or even epidemics of infectious diseases to interested practitioners, pharmacists etc. in the German federal state of Mecklenburg-Western Pomerania. Available data are written confirmations of unfitness for work, which have to be sent by affected employees to their employers and their health insurance schemes. These confirmations contain the diagnoses made by their doctors. Influenza waves are complicated to predict, because they are cyclic, but not regular [14]. Usually, each winter one influenza wave can be observed in Germany. However, moments and intensities of these waves vary very much. In some years they are nearly unnoticeable, while in other years doctors and pharmacists even run out of vaccine. Because of the irregular cyclic behaviour it is insufficient to determine average values based on former years and to give warnings as soon as such values are noticeably overstepped. So, again we apply temporal abstraction and use Case-Based Reasoning to search for similar developments in the past. However, there are some differences in comparison to the kidney function domain. Here, a state abstraction is unnecessary and impossible, because now we have got just one parameter, namely weekly incidences of a disease. So we have to deal with courses of integer values instead of nominal states related to a hierarchy. And the data are aggregated weekly. Since we believe that courses should reflect the development of four weeks, courses consist of 4 integer values.
Case-Based Reasoning for Time Courses Prognosis
989
Again, each case (e.g. each influenza season) is separated in many courses: from the first till the fourth week, from the second till the fifth week etc. And again, we use three trend descriptions. Here they are simply the assessments of the developments from last week to this week, from last but one week to this week and so forth. For retrieval we use these three assessments (nominal valued) plus the four original, weekly data (integers). We use these two sorts of parameters, because we want to ensure that the current and the similar course are on the same level (similar weekly data) and that they have similar changes on time (similar assessments). So far, we sequentially compute the distances between a query course and all 4-day courses stored in the case base. This computation provides a list of all former 4-day courses sorted according to the distances to the query course. Since this very long list is not really helpful, we reduce it by two threshold parameters that guarantee sufficient similarity. The first one considers the sum of the distances concerning the three trend assessments, the second one considers the distances of the original, weekly date of just the current weeks. For adaptation, namely for the decision whether warning is appropriate in the query week, we apply Compositional Adaptation [15] on the rather small list of those former courses that are similar enough. We have manually marked those points of each former case (influenza season) where in retrospect warning was appropriate. So, when deciding whether a warning is appropriate, we split the list of similar courses in two lists, namely concerning the question if warning was appropriate in retrospect or not. For both of these new lists, we compute a sum of the reciprocal distances of their courses. Subsequently, decision about the appropriateness of a warning depends on the question, which of the two sums is bigger. 4.1
Example: Bronchitis
As a second disease we applied our method to bronchitis. The behaviour is similar to influenza; it is not just cyclic and irregular too, but often a nearly parallel spread can be observed. However, there are deviations too. While the application of our method worked very well on the temporal spread of influenza for all four former influenza seasons stored in our case base, it did not work so well for bronchitis courses. We set up the same experiments as for influenza, that means we took one course out of the case base and tried to compute the desired warnings with the help of the remaining three courses as case base. For bronchitis, the desired time points for warnings were as follows, 1997/98: 5th week of 1998, 1998/99: 7th week of 1999, 1999/00: 3rd week of 2000, and 2000/01: 3rd week of 2001 (see the squares in fig. 3.). Our method works very well for all courses but one. The desired warning (5th week) in the bronchitis season of 1997/98 can not be computed (it is delayed until the 6th week), because the temporal spread is too dissimilar to the other three courses. A main problem is that our case base is still very small. Furthermore, our implicit assumption is that the spread of infectious diseases like influenza and bronchitis occurs more or less step by step, while in 1997/98 the bronchitis incidences sprang in a single week from an extreme low level to a warning level.
990
Rainer Schmidt and Lothar Gierl
700
600
500
400
1997 / 1998 1998 / 1999 1999 / 2000
300
2000 / 2001
200
100
0 40th week
42nd week
44th week
46th week
48th week
50th week
52nd week
2nd week
4th week
6th week
8th week
10th week
12th week
Weeks
Fig. 3. Bronchitis courses for Mecklenburg-Western Pomerania from October till March
References [1] [2]
[3] [4]
[5] [6]
Robeson, S.M., Steyn, D.G.: Evaluation and comparison of statistical forecast models for daily maximum ozone concentrations. Atmospheric Environment 24 B (2) (1990) 303-312 Shahar,Y.:,Timing is Everything: Temporal Reasoning and Temporal Data Maintenance in Medicine, Proceedings of AIMDM'99, Lecture Notes in Artificial Intelligence, Vol. 1620. Springer-Verlag, Berlin Heidelberg New York (1999) 30-46 Haimowitz, I.J., Kohane, I.S.: Automated trend detection with alternate temporal hypotheses. Proceedings of 13th International Joint Conference on Artificial Intelligence, Morgan Kaufmann, San Mateo (1993) 146-151 Lenz, M., Auriol, E., Manago, M.: Diagnosis and decision support. In: Lenz M, et al. (eds.): Case-Based Reasoning Technology, From Foundations to Applications. Lecture Notes in Artificial Intelligence, Vol. 1400, SpringerVerlag, Berlin Heidelberg New York (1998) 51-90 Schmidt, R., Pollwein, B., Gierl, L.: Medical multiparametric time course prognoses applied to kidney function assessments. Int J Med Inform 53 (2-3) (1999) 253-264 Schmidt, R, Gierl, L.: Case-based Reasoning for Prognosis of Threatening Influenza Waves. In: Perner, P. (ed.): Advances in Data Mining; Applications in E-Commerce, Medicine, and Knowledge Management. Lecture Notes in Artificial Intelligence, Vol. 2394, Springer-Verlag, Berlin Heidelberg New York (2002) 99-107
Case-Based Reasoning for Time Courses Prognosis
[7] [8] [9] [10] [11]
[12] [13] [14] [15]
991
Aamodt, A., Plaza, E.: Case-based reasoning: Foundation issues. Methodological variation- and system approaches. AI Comunications 7(1) (1994) 39-59 Macura, R., Macura, K.: MacRad: Radiology image resources with a casebased retrieval system, Proceedings of the First ICCBR, Springer-Verlag, Berlin Heidelberg New York (1995) 43-54 Larizza, C., Bellazzi, R., Riva, A.: Temporal abstraction for diabetic patients management. In: Keravnou, E., et al. (eds.): Proc 6th Conference on AI in Medicine, Springer-Verlag, Berlin Heidelberg New York (1997) 319-330 Shahar, Y.: A Framework for Knowledge-Based Temporal Abstraction. Artificial Intelligence 90 (1997) 79-133 Miksch, S., Horn, W., Popow, C., Paky, F.: Therapy planning using qualitative trend descriptions. In: Barahona, P., Stefanelli, M., Wyatt, J. (eds.): Proc of 5th Conference on AI in Medicine, Springer-Verlag, Berlin Heidelberg New York (1995) 197-208 Anderson, J.R.: A theory of the origins of human knowledge, Artificial Intelligence 40, Special Volume on Machine Learning (1989) 313-351 Stottler, R.H., Henke, A.L., King, J.A.: Rapid retrieval algorithms for casebased reasoning, Proceedings of the 11th IJCAI Detroit, Morgan Kaufmann Publishers, San Mateo (1989) 233-237 Farrington, C.P., Beale, A.D.:,The Detection of Outbreaks of Infectious Diseases. In: Gierl L. et al. (eds.): International Workshop on Geomedical Systems, Teubner-Verlag, Stuttgart (1997) 97-117 Wilke, W., Smyth, B., Cunningham, P.: Using Configuration Techniques for Adaptation, In: Lenz, M., et al. (eds.): Case-Based Reasoning Technology, From Foundations to Applications. Lecture Notes in Artificial Intelligence, Vol. 1400, Springer-Verlag, Berlin Heidelberg New York (1998) 139-168
Adaptation Problems in Therapeutic Case-Based Reasoning Systems Rainer Schmidt, Olga Vorobieva, and Lothar Gierl Universität Rostock, In stitut frü Medizinische Informatik und Biometrie, Rembrandtstr. 16 / 17, D-18055 Rostock, Germany {rainer.schmidt,olga.vorobieva,lothar.gierl} @medizin.uni-rostock.de
Abstract. Case-based Reasoning has become a successful technique in many application domains. Unfortunately, so far it is not so successful in medicine. One, probably the main reason is that on the one side in Case-based Reasoning the adaptation problem can not be solved in a domain independent way. On the other side in medicine the adaptation is often more difficult than in other domains, because more and complex features have to be considered. In this paper, we want to indicate possibilities how to solve adaption problems in medical Casebased Reasoning systems.
1
Introduction
In many domains, Case-based Reasoning (CBR) has become a successful technique for knowledge-based systems. Unfortunately, in medicine some more problems arise to use this method. Case-based Reasoning means to use previous experience in form of cases to understand and to solve new problems. A case-based reasoner remembers former cases similar to a current problem and attempts to modify their solutions to fit for the current case. So, a CBR system has to solve two main tasks [1]: The first one is the retrieval, which is the search for or the calculation of most similar cases. For this task much research has been undertaken. The basic retrieval algorithms for indexing [2], Nearest Neighbor match [3], pre-classification [4] etc. have been developed some years ago and have been improved in the recent years. So, actually it has become correspondingly easy to find sophisticated CBR retrieval algorithms adequate for nearly every sort of application problem. The second task, the adaptation means modifying a solution of a former similar case to fit for a current problem. If there are no important differences between a current and a similar case, a solution transfer is sufficient. Sometimes only few substitutions are required, but usually the adaptation is a complicated process. Though theories and models for adaptation [e.g. 5, 6] have been developed, the adaptation is still domain dependent. Usually, for each application specific adaptation rules have to be generated. Why Case-Based Reasoning in Medicine? Especially in medicine, the knowledge of experts does not only consist of rules, but of a mixture of textbook knowledge and V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 992-999, 2003. Springer-Verlag Berlin Heidelberg 2003
Adaptation Problems in Therapeutic Case-Based Reasoning Systems
993
experience. The latter consists of cases, typical and exceptional ones, and the reasoning of physicians takes them into account [7]. In medical knowledge based systems there are two sorts of knowledge, objective knowledge, which can be found in textbooks, and subjective knowledge, which is limited in space and time and changes frequently. The problem of updating the changeable subjective knowledge can partly be solved by incrementally incorporating new up-to-date cases [7]. Both sorts of knowledge can be clearly separated: Objective textbook knowledge can be represented in forms of rules or functions, while subjective knowledge is contained in cases. So, the arguments for applying case-oriented methods in medicine are as follows: 1.
Reasoning with cases corresponds with the decision making process of physicians. 2. Incorporating new cases means automatically updating parts of the changeable knowledge base. 3. Objective and subjective knowledge can be clearly separated. 4. As cases are routinely stored, integration into clinic communication systems is easy.
2
Medical Case-Based Reasoning Systems
Though so far CBR has not become as successful in medicine as in some other domains, several medical systems have already been developed which at least apply parts of the Case-based Reasoning method. Some systems avoid the adaptation problem. They do not apply the complete CBR method, but only a part of it, namely the retrieval. These systems can be divided into two groups, retrieval-only systems and multi modal reasoning systems. Retrieval-only systems are mainly used for image interpretation, because this is mainly a classification task [8]. Retrieval-only systems are used for other visualisation tasks too, e.g. for the development of kidney function courses [9] and for hepatitic surgery [10]. Multi modal reasoning systems apply parts of different reasoning methods. From CBR they usually incorporate the retrieval, often to calculate or support evidences [e.g. 11, 12].
3
Adaptation Solutions in Therapeutic Applications
So far, only a few medical systems have been developed that apply the complete CBR method. Considering these systems and based on our own experiences, here we discuss promising adaptation techniques, mainly for therapeutic applications. Constraints are a promising adaptation technique, but the use of constraints is limited to specific situations. In ICONS, an antibiotics therapy adviser [13], a similar case is retrieved, the list of therapy recommendations made for the retrieved case is transferred to the current patient, and this list is reduced by additional contraindications of the current patient. So, adaptation reduces a list of solutions (therapies) by constraints (contraindications). In a diagnostic program concerning dysmorphic syndromes, which was developed in the GS.52 project [14], the retrieval provides a list of prototypes sorted according to their similarity in respect of a current
994
Rainer Schmidt et al.
patient. Each prototype defines a diagnosis (dysmorphic syndrome) and represents the typical features for this diagnosis. The provided list of prototypes is checked by a set of explicit constraints. These constraints state that some features of the patient either contradict or support specific prototypes (diagnoses). So the provided list of prototypes is reduced by contradictions and sorted anew because of evidences. Another typical example for applying adaptation constraints is menu planing [15]. To use the solution of a former case, it has to be guaranteed that this solution fulfills all requirements of the current query case: special diets and individual factors, personal preferences, and also contraindications and demands based on various complications. Compositional Adaptation [16] is another successful adaptation technique. It is suitable for the calculation of dosages and for the determination of attributes of treatment plans. In TA3-IVF [17], a system to modify in vitro fertilisation treatment plans, relevant similar cases are retrieved (the relevance has to be specified by the user) and Compositional Adaptation computes weighted averages for the solution attributes. Abstraction from single cases to more general prototypical cases seems to be promising to implicitly support adaptation. Since one reason for adaptation problems is the specificity of single cases, the generalisation from single cases into abstracted prototypes [14] or classes [18] may support the adaptation. The idea of generating more abstract cases is typical for the medical domain, because here (proto-) typical cases very often are directly correspond to (proto-) typical diagnoses or therapies. In GS.52 [14] each case is characterised by a list of features, which usually contains between 40 and 130 symptoms and syndromes. This means, there are so many differences between a current and a similar case that adaptation obviously cannot take all of them into account. An abstracted prototypical case usually contains only up to 20 features. For a query case the most similar prototypes are calculated. Subsequently, for the adaptation only few constraints have to be checked. Adaptation Rules are a technique that seems to be general enough to solve many medical adaptation problems. Unfortunately, the content of such rules has to be domain dependent. Especially for more complex medical tasks the generation of adaptation rules often is too time consuming and is sometimes even impossible. One of the earliest medical expert systems that apply CBR techniques is CASEY [19]. It deals with heart failure diagnosis. The most interesting aspect of CASEY is the ambitious attempt to solve the adaptation task. Since the creation of a complete rule base for adaptation was too time consuming, general operators are used for adaptation. Since many features have to be considered in the heart failure domain, not all differences between former, similar cases and a current case can be handled by general adaptation operators. So, if no similar case can be found or if adaptation fails, CASEY uses a rule-based domain theory.
4
Examples from the Endocrinology Domain
Recently, we have developed some programs for endocrinology support in a childrens hospital. To illustrate adaptation solutions, we just present three typical tasks: Computing an initial dose, dose update and consideration of other diseases and complications.
Adaptation Problems in Therapeutic Case-Based Reasoning Systems
995
All body functions are regulated by the endocrine system. The endocrine gland produces hormones and secrets them in blood. Hypothyroidism means that a patient's thyroid gland does not produce enough thyroid hormone naturally. If hypothyroidism is undertreated, it may lead to obesity, brachicardia and other heart diseases, memory loss and many other diseases [20]. Furthermore in children it causes mental and physical retard. The diagnosis hypothyroidism can be established by blood tests. The therapy is inevitable: thyroid hormone replacement by levothyroxine. The problem is to determine the therapeutic dose, because the thyroxin demand of a patient follows general schema only very roughly. So, the therapy must be individualised [21]. If the dose is too low, hypothyroidism is undertreated. If the dose it too high, the thyroid hormone concentration is also too high, which leads to hyperactive thyroid effects [20, 21]. Precise determination of the initial dose is most important for newborn babies with congenital hypothyroidism, because for them every week of proper therapy counts. Computing an Initial Dose. For the determination of an initial dose (fig.1.), a couple of prototypes, called guidelines, exist, which have been defined by commissions of experts. The assignment of a patient to a fitting guideline is obvious because of the way the guidelines have been defined. With the help of these guidelines a range for good doses can be calculate. To compute an optimal dose, we retrieve similar cases with initial doses within the calculated ranges. Since there are only very few attributes and since our case base is rather small, we use Tversky's sequential measure of dissimilarity [22]. On the basis of those of the retrieved cases that had best therapy results an average initial therapy is calculated. Best therapy results can be determined by values of another blood test after two weeks of treatment with the initial dose. The opposite idea to consider cases with bad therapy results does not work here, because bad results may be caused by various reasons. So, we apply two forms of adaptation: first, a calculation of ranges according to guidelines and patients attributes, and secondly compositional adaptation. That means, we take only similar cases with best therapy results into account and calculate the average dose for these cases, which has to be adapted to the query patient by another calculation. Dose Update. For monitoring the patient, three laboratory blood tests have to be made. Usually the results of these tests correspond to each other. Otherwise, it indicates a more complicated thyroid condition and additional tests are necessary. If the tests show that the patient's thyroid hormone level is normal, it means that the current levothyroxine dose is OK. If the tests indicate that the thyroid hormone level is too low or too high, the current dose has to be increased resp. decreased by 25 or 50 µg [27, 28]. So, for monitoring adaptation is a calculation according to some rules based on guidelines. Figure 2. shows an example of a case study. We compared the dicisions of an experienced doctor with the recommendations of our system. In this example there are three deviations, usually there are less. At the second visit (v2), it was obvious that the dose should be increased. Our program recommended a too high increase. So, we modified an adaptation rule. At visit 10 (v10) the doctor tried to decrease the dose, but without success (v11). At visit 21 (v21) the doctor increased the dose because of some minor symptomes of hypothyriodismus, which our program did not assess as important.
996
Rainer Schmidt et al.
Hypothyroidism Patient: Dose? Determination Guideline Adaptation by Calculation Range of Dose Retrieval Similar Former Cases Check: Best Therapy Results Best Similar Cases Compositional Adaptation Optimized Dose Fig. 1. Determination of an initial dose
120
100
Levothyroxine µg/day
80
Doctor
60
Program
40
20
0 V1
V2
V3
V4
V5
V6
V7
V8
V9
V10
V11
V12
V13
V14
V15
V17
V18
V19
V20
V21
V22
Fig. 2. Comparison of recommended dose updates with decisions of a doctor
Adaptation Problems in Therapeutic Case-Based Reasoning Systems
997
Further Complications. It often occurs that patients suffer from further chronic diseases or complications. So, the levothyroxine therapy has to be checked for contraindications, adverse effects and interactions with additionally existing therapies. Since no alternative is available to replace levothyroxine, if necessary the additional therapy has to be modified, substituted, or compensated [20, 21]. In our support program we perform three tests. The first one checks if another existing therapy is contraindicated to hypothyroidism. This holds only for very few therapies, namely for specific diets like soybean infant formula, but they are typical for newborn babies. Such diets have to be modified. Since no exact knowledge is available how to do it, our program just gives a warning saying that a modification is necessary. The second test considers adverse effects. A further existing therapy has either to be substituted or to be compensated by another drug. Since such knowledge is available, we have implemented corresponding rules for substitutional resp. compensational adaptation. The third test checks for interactions between both therapies. Here we have implemented some adaptation rules, which mainly attempt to avoid the interactions. For example, if a patient has heartburn problems that are treated with an antacid, a rule for this situation exists that states that levothyroxine should be administered at least 4 hours after or before an antacid. If such adaptation rules cannot solve an interaction problem, the same substitution rules as for adverse effects have to be applied.
5
Summary of Promising Adaptation Techniques
At present we can just summarise useful adaptation techniques. However, most of them are promising only for specific tasks. Abstraction from single cases to more general prototypes seems to be a promising implicit support. However, if the prototypes correspond to guidelines (as for dose calculations) they may even explicitly solve some adaptation tasks. Compositional Adaptation at first glance does not seem to be appropriate in medicine, because it was originally developed for configuration. However, it has been successfully applied for calculation therapy doses (e.g. in TA3-IVF [17] and in our program). Constraints are a promising adaptation technique too, but only for a specific situation, namely for a set of solutions that can be reduced by checking e.g. contraindications (in ICONS [13]) or contradictions (in GS.52 [14]). Adaptation Rules. The only technique that seems to be general enough to solve many medical adaptation problems is the application of adaptation rules or operators. Unfortunately, the technique is general, but the content of such rules has to be domain specific. Especially for more complex medical tasks the generation of adaptation rules often is too time consuming and sometimes even impossible. However, for therapeutic tasks some typical forms of adaptation rules can be made out, namely for substitutional and compensational adaptation, and for calculating doses.
998
Rainer Schmidt et al.
References [1] [2] [3] [4] [5] [6]
[7] [8]
[9] [10] [11] [12]
[13] [14] [15]
Aamodt, A., Plaza, E.: Case-based reasoning: Foundation issues. AICOM 7 (1994) 39-59 Stottler, R. et al.: Rapid retrieval algorithms for case-based reasoning. In: 11th Int Joint Conference on Artificial Intelligence, Morgan Kaufmann Publishers, San Mateo (1989) 233-237 Broder, A.: Strategies for efficient incremental nearest neighbor search. Pattern Recognition 23 (1990) 171-178 Quinlan, J.: C4.5, Programs for Machine Learning. Morgan Kaufmann Publishers, San Mateo (1993) Bergmann, R., Wilke, W.: Towards a new formal model of transformational adaptation in case-based reasoning. In: Gierl, L., Lenz, M. (eds.): 6th German Workshop on CBR, University of Rostock (1998) 43-52 Fuchs, B., Mille, A.: A knowledge-level task model of adaptation in case-based reasoning. In: Althoff, K.-D. et al. (eds.): Case-Based Reasoning Research and Development, 3rd Int Conference, Lecture Notes in Artificial Intelligence, Vol. 1650, Springer-Verlag, Berlin Heidelberg New York (1999) 118-131 Gierl, L.: Klassifikationsverfahren in Expertensystemen für die Medizin. Mellen Univ. Press, Lewiston (1992) Perner, P.: Why Case-Based Reasoning is Attractive for Image Interpretation. In: Aha, D., Watson, I. (eds.): Case-Based Reasoning Research and Development, 4th Int Conference, Lecture Notes in Artificial Intelligence, Vol. 2080, Springer-Verlag, Berlin Heidelberg New York (2001) 27-43 Schmidt, R. et al.: Medical multiparametric time course prognoses applied to kidney function assessments. Int J Med Inform 53 (2-3) (1999) 253-264 Dugas, M.: Clinical applications of Intranet-Technology. In: Dudeck, J. et al. (eds.): New Technologies in Hospital Information Systems. IOS Press, Amsterdam (1997) 115-118 Montani, S. et al.: Diabetic patients management expoiting Case-based Reasoning techniques. Comput Methods Programs Biomed 62 (2000) 205-218 Bichindaritz, I. et al.: Case-based reasoning in CARE-PARTNER: Gathering evidence for evidence-based medical practice. In: Smyth, B., Cunningham, P. (eds.): Advances in Case-Based Reasoning, 4th European Workshop, Lecture Notes in Artificial Intelligence, Vol. 1488, Springer-Verlag, Berlin Heidelberg New York (1998) 334-345 Schmidt, R., Gierl, L.: Case-based Reasoning for Antibiotics Therapy Advice: An Investigation of Retrieval Algorithms and Prototypes. Artif Intell Med 23 (2) (2001) 171-186 Gierl, L., Stengel-Rutkowski, S.: Integrating consultation and semi-automatic knowledge acquisition in a prototype-based architecture: Experiences with dysmorphic syndromes. Artif Intell Med 6 (1994) 29-49 Petot, G.J., Marling, C., Sterling, L.: An artificial intelligence system for computer-assisted menu planing. Journal of American Diet Assoc 98 (9) (1998) 1009-10014
Adaptation Problems in Therapeutic Case-Based Reasoning Systems
999
[16] Wilke, W., Smyth, B., Cunningham, P.: Using Configuration Techniques for Adaptation. In: Lenz, M. et al. (eds.): Case-Based Reasoning Technology, From Foundations to Applications. Lecture Notes in Artificial Intelligence, Vol. 1400, Springer-Verlag, Berlin Heidelberg New York (1998) 139-168 [17] Jurisica, I. et al: Case-based reasoning in IVF: prediction and knowledge mining. Artif Intell Med 12 (1998) 1-24 [18] Bichindaritz, I.: From cases to classes: Focusing on abstraction in case-based reasoning. In: Burkhard, H.-D. Lenz, M. (eds.): 4th German Workshop on CBR, Humboldt University Berlin (1996) 62-69 [19] Koton, P.: Reasoning about evidence in causal explanations. In: Kolodner, J. (ed.): First Workshop on CBR. Morgan Kaufmann, San Mateo (1988) 260-270 [20] Hampel, R.: Diagnostik und Therapie von Schilddrüsenfunktionsstörungen. UNI-MED, Bremen (2000) [21] DeGroot, L.J.: Thyroid Physiology and Hypothyroidsm. In: Besser, G.M., Turner, M. (eds.): Clinical endocrinilogy. Wolfe, London (1994) (Chapter 15) [22] Tversky, A.: Features of similarity. Psychological review 84 (1977) 327-352
A Knowledge Management and Quality Model for R&D Organizations Guillermo Rodríguez-Ortiz Instituto de Investigaciones Eléctricas (IIE), Reforma 113 Edif. 27 1er Piso, Colonia Palmira, Cuernavaca, Morelos, México, 62490 [email protected]
Abstract. In this paper, a model that combines knowledge management and quality management approaches is presented. Essential concepts are provided about what must be considered in establishing a quality system for an institution, company or group that is dedicated to Research and Development (R&D) activities and that needs to fulfill the ISO9001:2000 standard to obtain certification. The paper describes how certain principles of knowledge management and the requirements of the ISO9001 standard can be aligned with the objectives of an R&D organization and what aspects should be considered in fulfilling the standard. The model and the comments provided will help to an R&D organization to become ISO9001 certified with a minimum of additional effort as compared with its operation without adhering to the standard.
1
Introduction
A Research and Development organization is any group or team of professionals that develops R&D activities autonomously or inside some company or institution. The R&D activity can be carried out under various financing systems, for example, signing contracts with external customers, internally with the objective of developing infrastructure or new products for the company, financed by the government within national development plans, or as research professors in higher educational institutions or universities. In an environment of competitiveness, productivity and quality, the investigation activity is no longer an area where researchers have a wide margin of performance to exploit their creativity, now the delivery schedules and the cost control are factors that have to be considered for the success of a research project and the consequent customer or client satisfaction. In an R&D organization, projects apply and develop knowledge, and the key elements are speed and flexibility in a rapidly changing environment. At IIE (a thousand employee R&D organization), where around 100 projects are executed annually, there is a growing time pressure to take knowledgeable decisions.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1000-1007, 2003. Springer-Verlag Berlin Heidelberg 2003
A Knowledge Management and Quality Model for R&D Organizations
1001
Knowledge was required to implement quality processes and is required to improve and update them. Knowledge for R&D projects (processes) changes rapidly as a result of technological, scientific developments and changing economic relationships [1]. Quality certification of Knowledge Management Systems (KMS) is not yet a topic extensively treated in the literature as, for example, quality certification of Knowledge-based Systems [2] and Software Projects [7, 8]. We propose that some aspects of a KMS [10] be integrated as essential components of a Quality Management System (QMS) for R&D organizations. Our case study [9] is a QMS implemented as an Intranet system, which supports the processes in which knowledge is developed and distributed to those who need it. Knowledge is made accessible both for current use in the whole organization and for the future. Knowledge is included in the definition and establishment of processes related to the requirements of the ISO9001: 2000. Two kinds of processes were identified and implemented: business processes which are the projects, and additional “quality” processes necessary to comply with the requirements of standard that are not directly related to the products of the projects.
2
A Knowledge Management and Quality Model for R&D
In contrast with a factory of material goods where production equipment generally exist, like ovens that blend things, machines that manipulate objects (for example, they stretch them, they wind them or they cool them) and equipments that assemble or pack products; in an research group the equipment or machine that produces the product is the investigator's brain or the knowledge worker head. The computer, the support software, the books and other items are only tools or work instruments for the developer, as well as the chisel and the hammer are utensils for a sculptor. 2.1
The ISO9000: 2000 Standard
The requirements of the norm [3, 4] contain concepts that can be interpreted in a different way when a QMS has to be created for a group dedicated to R&D [5, 6] than when a QMS is elaborated for a factory that massively produces material goods. The accepted practice when carrying out R&D is to define a project. Roughly, the project is documented divided in sections that may contain, for example, the antecedents and motivations for the investigation, the objectives, the specification of the product, prospective benefits, work schedules, human resources and necessary materials for the realization of the project. The model presented in this paper, complies with the standard and is described using an approach centered on the projects life cycle, which is an appropriate language for an R&D group, company or institution.
1002 2.2
Guillermo Rodríguez-Ortiz
Quality Processes
The model presented here is a summary of the QMS for IIE and indicates how the organization fulfills the standard during its daily processes operation. The model includes the processes of an organization dedicated to R&D dividing them into two groups of activities related by the exchange of documents: a) The projects life cycle activities grouped by development phases or stages, and b) The support activities to complete the compliance of the requirements of the norm. Table 1 presents the processes and the ISO9001 standard sections that are fulfilled when the processes are executed. Table 1. QMS processes for an R&D organization
2.3
Project Life Cycle Processes
The typical project life cycle is composed of the following three phases: a) Planning and definition, b) Execution, c) Delivery and evaluation.
A Knowledge Management and Quality Model for R&D Organizations
1003
The objectives of the planning and definition phase of a project are to obtain the project proposal and the project quality plan which has a listing of all documents, procedures, information, appropriate measurement and work equipment, and other items so that the project activities can be executed. In a factory of material goods, typically the procedures are written in such detail that it is not possible to achieve when procedures are developed for a research team. The difference can be appreciated because procedures for factories contain all the necessary knowledge to build a product and they hardly allow creativity, also they are usually adequate for mass production (many copies of the same product). Changes to these procedures are not made during their execution, but during quality revisions. The procedures for R&D projects are documents that explain general methods, methodologies, guides or manuals, sometimes elaborated by the team or found in textbooks, articles, manuals or another type of document. These documents provide the necessary “controlled conditions” for the development of a project and they have the characteristic of promoting creativity and the production of new knowledge that is recorded in documents, codes or prototypes. Additionally, the developers have, in their mind, previous non-written knowledge, experience, empiric rules, intuition, values and beliefs. 2.4
Management Support Processes
During management support activities researchers participate providing the documents and the necessary collaboration so that the support activities are carried out with success. A quality policy and quality objectives must be issued for the organization. Also, quality objectives for each project are defined during the planning phase where the requirements of the products are established. One peculiar aspect of an R&D organization is that human resources that participate in the projects get implicit or explicit training. Additionally, the project participants will learn the details of the domain of the application and will train on the supporting tools for the research. 2.5
The Knowledge Management and Quality Management System
Depending on the situation, an R&D organization has to provide additional knowledge and effort to comply with the standard requirements as compared with the knowledge and effort for operation without adhering to the standard. In the case of IIE, this effort was minimal since most procedures already existed for the projects life cycle and some for the management support processes. To comply with the standard and obtain certification, IIE had to formalize only the following procedures: internal audits, management review, documents and records control, improvement actions and control of nonconforming product (see Table 1), and to write the quality manual. Fig. 1 shows aspects of the entity-relationship model that represents the QMS knowledge. Even though, there is not a well-defined border between KM and QM processes, we can say that the management support processes compose the QM activities and the projects life cycle processes compose the KM activities. The QM
1004
Guillermo Rodríguez-Ortiz
maintains all documents for certification (a 40-page quality manual, 20 procedures and records or evidences). Both projects and support processes use QMS documents to operate according to the standard. Audits and management revisions are support processes that implement improvement actions to correct or prevent non-conformities found in the QMS. The QMS helps to solve problems like, what evidence is used at IIE to comply with an specific standard requirement, how many nonconforming products were corrected during the last 3 years, how many clients were below satisfaction level 4, how many QMS nonconformities were attended through improvement actions.
Fig. 1. The Model integrates Knowledge Management and Quality Management Systems
The underlying KM activities support the development, selling, and application of ideas [11]. Actions undertaken by IIE are based on the alignment of their people, processes and technology with the business strategy, context and goals. The actual KM practices are sharing information from one project to another and documenting innovative ways of solving client problems. The actual mechanisms and processes in place for managing acquisition, screening and selection of these practices at IIE are largely informal. The KM effort is classified under the following attributes: context; goals; strategy; culture; and technological, organizational and process infrastructure. Context. The driver is the market since the main IIE clients are two of the most important oil and power industries in Mexico, CFE (Federal Commission of Electricity, the national electric utility) and PEMEX (the Mexican Oil Company). Both CFE and PEMEX are ISO9001 certified and usually ask their suppliers to be also certified.
A Knowledge Management and Quality Model for R&D Organizations
1005
Goals. IIE has the overall purpose of carrying out projects of applied research and technology development and its business objectives include to satisfy its customers through products with a high content of knowledge using the efficiency of engagement teams or the development of productive technology infrastructures. Strategy. IIE captures information about past and present projects (identification, dates, leader, participants, client, and so forth) and the products delivered to its clients (including the proposal, the quality plan, and the project specific products: systems, designs, prototypes, studies, diagnosis, courses, methodologies among others), which is then made available to the researchers to share and reuse it in other projects. Culture. IIE follows a knowledge-friendly culture. It is the shared values, experiences and common goals that lead to a positive orientation to knowledge and remove inhibitors from employees allowing the movement of knowledge from individuals to the organization. Technological Infrastructure. This is aimed to support learning, and equip researchers with all the knowledge required to successfully perform their engagements. IIE follows the accepted practice [12] in using the Lotus Notes and the Intranet technologies, while there exist various specialized applications/utilities to supplement them like a repository to record information used to support the execution of the projects (documents that explain general methods, methodologies, guides or manuals, standards, textbooks, articles, or another type of document) which are implemented with database technologies. Additionally, project participants have access to a variety of information services to get knowledge. They can access more the 20 industrial and scientific databases, for example, Electric power database (EPRI), fluid engineering abstracts, chemical abstracts, science citation index, US and world patent abstracts, Engineering index, International Energy Agency Database, and Electrical and electronics abstracts. Access to more than 100 journals and conference proceedings, and a collection of more the 60,000 books. Organizational Infrastructure. IIE created a small internal organization to establish, coordinate, and manage the technology and tools, and to facilitate the capture, development, and distribution of knowledge. This work is normally aimed at ensuring that common approaches are used and become institutionalized. The communities of practice are organized around industries, thematic areas, or functional expertise. New roles were not established since the current organization is by thematic areas that lead the knowledge management activities and develop strategic approaches to knowledge; the “knowledge managers” are the actual area managers and the project leaders. Process Infrastructure. To facilitate knowledge distribution and development, the information is made available for reuse to the researchers through the Intranet, but there is a great deal of personal, face-to-face conversations. Almost all knowledge generation is developed within IIE, although some knowledge is acquired by hiring individuals that have it.
1006
Guillermo Rodríguez-Ortiz
The KMS helps to solve problems like, what standard was used to diagnose the Tuxtla power transmission tower failure, and who participated in the Laguna Verde nuclear plant radiation analysis. The QMS has been in operation for more than three years and IIE obtained the ISO9001: 2000 certification last year from an international organization.
3
Conclusions
In this paper, a model that combines quality management and knowledge management approaches has been presented from the perspective of research and development processes. The comments can help to an R&D organization to become ISO9001 certified with a minimum of additional effort as compared with its operation without adhering to the standard. The knowledge management plays a key role in a quality management system for an R&D organization. To ensure the effective and efficient development of new knowledge to solve specified needs of a client, a new project looks for previous knowledge as well as knowledge about the state of the art. The model presented here is used at IIE by all its units independent of their discipline (geothermal, nuclear, environmental, instrumentation and control, informatics, process supervision, simulation, transmission and distribution, network analysis, electrical and mechanical equipment, energy savings and turbo machinery).
References [1] [2] [3] [4] [5] [6] [7]
Rob van der Spek and André Spijkervet. Knowledge management: Dealing Intelligently with Knowledge. Published by the Knowledge Management Network, Kenniscentrum CIBIT and CSC, ISBN 90-75709-02-1, 1997 Anca I. Vermesan. Quality Assessment of Knowledge-Based Software: Some Certification Considerations. Proceedings of the 3rd International Software Engineering Standards Symposium (ISESS '97), pages 144-154, IEEE, 1997. ISO9001:2000 International Standard, Quality Management Systems – Requirements, Third edition 2000-12-15, International Organization for Standarization (ISO). D. Hoyle, Iso 9000 Quality Systems Handbook: Iso 9000: 2000 Version, Butterworth-Heinemann, 544 pages 4th edition (2001) A Lpó ez, Calidad en centros de investigacinó y desarrollo, Quinto congreso internacional de la Asociacinó Mexicana de Calidad A. C, 1997. A. Martínez, La gestinó de la calidad en el departamento de investigacinó y desarrollo. Alta Dirección, Num. 186, (1996). Michalis Nik. Xenos, Addressing Quality Issues: Theory and Practice, A Case Study on a Typical Software Project, Proceedings of the SCI2001, 5th World Multi-Conference on Systemics, Cybernetics and Informatics, Orlando, Florida USA, Vol. I, pp. 556-560, 2001.
A Knowledge Management and Quality Model for R&D Organizations
[8]
1007
Dimitris Stavrinoudis et al., Measuring User's Perception and Opinion of Software Quality, Proceedings of the 6th European Conference on Software Quality, Sponsored by European Organization for Quality – Software Committee (EOQ-SC), Vienna, 1998. [9] Yin, R. K. (1994) Case study research: Design and methods (2nd ed., Applied Social Research Methods Series Vol. 5). Thousand Oaks, CA: Sage Publications. [10] Quinn, Davenport and Alavi, Creating a System to Manage Knowledge, Harvard Business Review and Case Collection, ID# product 39103 [11] Dimitris Apostolou and Gregory Mentzas, Managing Corporate Knowledge: A Comparative Analysis of Experiences in Consulting Firms, Second International Conference on Practical Aspects of Knowledge Management 2930 October, 1998, Basel, Switzerland. [12] Dick Stenmark, Information vs. Knowledge: The Role of intranets in Knowledge Management, Proceedings of the 35th Hawaii International Conference on System Sciences - 2002
Knowledge Management Systems Development: A Roadmap Javier Andrade1, Juan Ares1, Rafael García1, Santiago Rodríguez1, Andrés Silva2, and Sonia Suárez1 Facultad de Informática, Universidad de A Coruña Campus de Elviña s/n, 15071, A Coruña, Spain {jag,juanar,rafael,santi,ssuarez}@udc.es 2 Facultad de Informática, Universidad Politécnica de Madrid Campus de Montegancedo s/n, 28660, Boadilla del Monte, Madrid, Spain [email protected] 1
Abstract. This paper approaches the Knowledge Management Systems study, focusing not only in the establishment of essential development activities, but also in techniques, technologies, and tools for their support. Despite of the wide range of existing proposals for the development of this type of systems, none of them has achieved a level detailed enough to allow a direct application. This study is intended to be a palliative for the above-mentioned lack of detail by means of a development guide for Knowledge Management Systems. In this way, the proposed solution offers a clear definition concerning what has to be done and which type of mechanisms should be used for its development.
1
Introduction
One of the principles that have become broadly accepted is that an important share of the real value of an organisation lays in its own knowledge, in particular the one that comes from experience. However, in most of cases this valuable asset is used bellow its capability, subsequently, the full range of potential advantages that could be obtained is seldom achieved. This situation has lead to a new discipline, known as Knowledge Management (KM) whose aim is the collection and optimal application of all the knowledge and experience that lies in every organisation. The ultimate goal for the sake of maximum competitiveness should be the accurate supply of relevant knowledge, not only to who might request it, but also with the required level of detail and whenever is needed. On the other hand, despite the current raise of KM, none of the already existing approaches has outlined so far the sequence for the implantation of a Knowledge Management System (KMS) within an organisation, due to the fact that all of them suggest different options regarding its development, but with a noticeable lack of detail that does not allow its direct application [1]. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1008-1015, 2003. Springer-Verlag Berlin Heidelberg 2003
Knowledge Management Systems Development: A Roadmap
1009
Bearing in mind the aforesaid and after an intensive study of the already existing proposals, this paper intends to define an approach that enclose the most relevant existing mechanisms for KMS development in order to simplify this procedure and to fulfil the level of detail required for its implantation.
2
Proposed Approach
The Analysis of the main proposals in KM has allowed the establishment of the key phases for KMS development. Not considering Knowledge-Based System approaches –e.g. MIKE or KADS– according to [2] conclusions, one of the most remarkable findings was the existence of a broad agreement concerning the phases that should be considered: from the starting point of a given relevant knowledge, it should be previously extracted and acquired from its sources in order to be lately assimilated by the organisation and eventually, used in the best terms for organisation competitiveness by means of the creation of a collaboration framework. Once these phases have been identified, the following stage should approach the classification of the existing mechanisms –techniques, technologies and tools– regarding their specific phase suitability. The main difficulty at this point is based on the multiple capabilities for different phases of some of the mechanisms, whose best performing single-phase aptitude had to be refined first. 2.1
Knowledge Acquisition and Extraction
This stage consists on the collection of relevant knowledge emanated from not only human but also non-human sources. Regarding the later ones, it is worthwhile quoting that their explicit information can be extracted by means of already existing technological approaches based on different techniques –indexation, statistical analysis, etc. As far as human sources are concerned, due to its individual nature, its related knowledge is best captured through non-technological techniques, based on dialogue and behavioural observation. The first step for this stage, previous to focusing on the knowledge itself, should be the accurate identification of which is the type of knowledge that is relevant for the organisation. The authors propose to carry out this task from two different levels of abstraction, strategic level and thematic areas level, as detailed in Table 1. Knowledge Acquisition at the Strategic Level. By means of the global analysis of the organisation, this level intends to identify the key aspects of business –areas, procedures and, or, functions– in which KMS should focus. This would be carried out simply through the mere observation of the way in which tasks are performed within the organisation, together with the assessment of different beliefs, disagreements or contributions expressed at interviews. Due to obvious reasons, this level has to be approached not with technological means but the method to be followed has to be based on the application of dialogue or behavioural observation techniques, as the SWOT analysis and the bottleneck analysis [3]. By means of the former, weak as well as strong organisation features can be identified, which allows an overall picture of it. As far as the later is concerned, the bottleneck analysis identifies the circumstances affecting some processes that could keep them away from developing their full potential.
1010
Javier Andrade et al. Table 1. Main knowledge acquisition mechanisms
Level
Thematic areas
Strategic
Mechanism SWOT Analysis Bottlenecks Analysis KDD Nonhuman Text Mining sources Web Mining Critical Knowledge Functions (CKFs) Benchmarking Task Environment Analysis and Modelling (TEAM) Knowledge Use and Requirements Analysis (KURA) Human Knowledge Scripting and Profiling (KS&P) sources Knowledge Flows Analysis (KFA) Group Sessions Observation-Based Analysis Verbal Protocol Analysis (VPA) Questionnaire-Based Knowledge Surveys (QBKSs)
Knowledge Acquisition at the Thematic Areas Level. The business aspects previously identified at strategic level are known as thematic areas, from where knowledge is obtained by means of different types of mechanisms depending on the type of knowledge source (Table 1), as seen: • •
2.2
Human sources. Reflect the knowledge of an individual, that could be extracted by means of interviews, opinion surveys and behavioural observation. Non-human sources. Despite the fact that both, technological and manual techniques, could be used to obtain knowledge from these sources, the former one are preferred due to its simplicity, therefore only tools and technologies will be quoted on this paper. There are specific tools depending on the degree of source structuring, as shown in Table 2. This degree varies from an absolute absence of a stiff structure (non-structured) to a high level of structuring (structured), with a middle stage in between (semi-structured). Knowledge Assimilation
The collection of knowledge that the organisation has identified as relevant is the starting point in a trail whose next step involves the assimilation of that acquirement by the organisation itself. For this purpose, the knowledge has to develop into a concept that lately has to be properly embodied, both task known as conceptualisation and representation. The former is purely an intellectual job carried out by experts in order to achieve a global understanding of the knowledge obtained from acquisition and extraction stage. Regarding the later, some of the most remarkable techniques used for this purpose are the so called Knowledge Maps or Mental Maps [4], that work as ordinary maps do, guiding the members of the organisation towards specialised knowledge whenever they need it. Mind Manager (www.mindman.com) stands out clearly among other existing tools for the design of this type of maps.
Knowledge Management Systems Development: A Roadmap
1011
Table 2. Main tools for structured, semi-structured and non-structure sources
2.3
Data Warehouse Web Mining Text Mining
Data Mining
KDD
Semi-structured Non- structured
Structured
Source Technology
Tool Extract Suite (www.eti.com/products/ext_intro.htm) Enterprise/Integrator (www.dillonweb.com/apertus3) IBM Visual WareHouse (www3.ibm.com/software/data/vw) WareHouse Studio (sybase.com/products/bi/industrywarehousestudio) IBM Intelligent Miner (www.ibm.com/software/data/iminer) Clementine (www.spss.com/spssbi/clementine/index.htm) WEKA (www.cs.waikato.ac.nz/ml/weka) Oracle Data Mining Suite (www.oracle.com/ip/analyze/warehouse/datamining) WebHound (www.sas.com/products/webhound/index.html) WUM (wum.wiwi.hu-berlin.de) NetTracker (www.sane.com/products/NetTracker) Clementine (www.spss.com/spssbi/clementine/index.htm) WEKA (www.cs.waikato.ac.nz/ml/weka) TextAnalyst (www.megaputer.com/products/ta/index.php3) SemioMap (www.semio.com/products) Document Explorer [5] Autonomy (www.globalsoftware.com.ar/autonomy.htm)
Creation of a Collaboration Framework
Once after relevant knowledge has been not only identified and conceptualised, but also represented, the next task to perform has to be the creation of basic structures for the corporative knowledge to be available for the members. As this task has to bear in mind not only newly incorporated knowledge, but also the optimal use of the existing one, it implies a four-step procedure as described next: Knowledge Incorporation and Storage. The organisations are not static structures, but they are in continuous evolution, acquiring new knowledge and refining the already existing one. This dynamic aspect has to be borne in mind when determining the best way to introduce and to integrate the knowledge. This incorporation can be active or passive, depending on who is the responsible of monitoring the quality of this new knowledge. In active, there is a specific KM group that performs this task, whereas in passive, the organisation may have many potential testers, in fact any member that may want to share his/her knowledge and experience
1012
Javier Andrade et al.
would test individually whether it fulfils the minimum requirements of quality and relevance. Incorporated knowledge must be physically stored through Corporate Memories [6]. Nowadays, the most widely used implementing technology is the database, being the multimedia type particularly significant. Notification of New Knowledge Incorporation. The storage of knowledge does not imply by itself a real benefit, nevertheless, a proper internal update regarding new incorporations, allows each member in the organisation to refine their individual knowledge according to the new trends and advances. As a previous step to notification it should be defined who should be addressed according the specific type of knowledge newly incorporated. In this regard, there are two alternatives: by subscription –private update concerning specific topics selected by every member individually–; by divulgation –notification is not preceded of any intended will of being updated. One of the media most widely used for notification by subscription as well as by divulgation, is the e-mail [7]. Knowledge Location within the Organisation. Among the key points for a successful KMS implantation there is an essential one, that consist of providing the organisation with accurate search and retrieval mechanisms for existing and incorporated knowledge. Some of the most relevant tools for this purpose are quoted next [8]: hierarchical (e.g. Yahoo –www.yahoo.com); attribute based (e.g. Verity – www.verity.com– or Amazon –www.amazon.com); content based (e.g. Google – www.google.com); meta-search tools (e.g. Metacrawler –www.metacrawler.com); combination tools, as search engine Excite (www.excite.com), whose work is based on a meta-search focused on the content; and finally, intelligent tools, pointing out Excalibur Retrieval Ware (www.convera.com). Support and Communication Systems. One of the main requirements of KMS is the existence of means of communication and collaboration among the users. Two of the technologies available nowadays are remarkable for this purpose. One of them is the e-mail, whose main advantage consists of its asynchronous connection mode. The second technology alludes to shared applications, which allows a group of users to interact simultaneously with more than one running programmes. Some of this type of applications are: whiteboards and screen sharing; workflow systems, as an excellent method of following up the work of geographically distant people; and on line (e.g. chat) as well as off line (e.g. forums) conference systems. Table 3 shows the main existing tools linked to each of the already mentioned technologies. 2.4
Technologies for Creating a Collaboration Framework
Once described the main technologies and tools for the creation of a collaboration framework, it is important to remark that there are technological answers considering the entire stage, suggesting solutions to each activity. Three of these technologies are considered the most remarkable: groupware-based technology, web-based technology (Internet, intranet or extranet), and mixed technology.
Knowledge Management Systems Development: A Roadmap
1013
Table 3. Main communication and support systems
On-line
Tool Outlook (www.microsoft.com/office/outlook) Netscape E-mail (channels.netscape.com/ns/browsers/default.jsp) Eudora (www.eudora.com/email) MBONE (www.acm.org/crossroads/xrds2-1/ Shared applications mbone.html) Ultimus Workflow (www.ultimus1.com) BizFlow (www.handysoft.com/products/products.asp) Workflow Lotus (www.lotus.com/home.nsf/welcome/domworkflow) ICQ (www.icq.com) Sametime (www.lotus.com/home.nsf/welcome/sametime) Chat Quickplace (www.lotus.com/home.nsf/welcome/quickplace) Netmeeting (www.microsoft.com/Windows/NetMeeting) NetMeeting (www.microsoft.com/Windows/NetMeeting) Audio/video CuseeMe Pro (www.cuseeme.com) conference ICUII (www.icuii.com) Sametime (www.lotus.com/home.nsf/welcome/sametime) XO Document Conferencing (www.xo.com) Document CentraOne (www.centra.com/products/centraone.asp) conference Forum MATRIX (www.foruminc.com) MEDIASPACE (www-ihm.lri.fr/~roussel/publications) Media spaces CRUISER [9] Off-line Lundeen Web Crossing (www.webx.lundeen.com)
Conference systems
Technology
Groupware is defined as an integrated set of technologies that facilitate collaboration among users. The functions provided by this kind of platforms include capabilities of communication and collaboration, product management and follow-up throughout its phases, exchange of ideas, work synchronisation and register of the group's collective memory. The most complete and popular groupware tool is Lotus Notes (www.lotus.com), which provides a proprietary architecture with several functions, being database (reply function is the most significant), workflow and e-mail functions the foremost ones. Other relevant tools that belong to this group are Teamware (www.teamware.com), Novell GroupWise (www.novell.com/products/groupwise), and Microsoft Exchange (www.microsoft.com/exchange). Web technology gives users the chance of being connected from and to anywhere, together with making available the content of any document, no matter the type of format, operative system or communication protocol. Internet, intranet and extranet
1014
Javier Andrade et al.
are network types included in this technology. Any of these could be used depending on the organisation requirements, but regarding KMS intranet is preferred, since it provides the same facilities than Internet but at an organisational level. Still, the later does not exclude the use of extranet for corporative KM purposes, allowing external partners access to the organisational network as well as the exploration of knowledge resources from allied companies. Finally, mixed technology provides both possibilities of groupware and web technology. A good example of it, despite its actual competition with Microsoft Exchange 2000 (www.microsoft.com/exchange), is the Lotus Notes Domino technology (www.lotus.com/domino), which represents Lotus integration effort in order to achieve most of the advantages of both of the platforms, Lotus Notes and web technology. This technology integrates a Domino web server, which allows the creation of knowledge in Notes and its later web publication.
3
Conclusions
Nowadays there is a great deal of confusion when dealing with a KMS development. This situation is the consequence of the multiple existing approaches for developing this type of systems that, however, do not describe properly how the task should be performed. For this reason, the present paper introduces an approach that intends to fulfil this shortage by means of refining the already existing ones. This approach consists on three main stages: knowledge acquisition and extraction, knowledge assimilation and creation of the collaboration framework. The goal of this proposal is to provide the organisation KMS developer staff with an accurate work guide that point precisely the current existing techniques, technologies and tools, in order to support the implantation.
References [1] [2]
[3] [4] [5]
Rubenstein-Montano, B., Lievowitz, J., Buchwalter, J., McCaw, D., Newman, B., Rebeck, K.: A Systems Thinking Framework of Knowledge Management. Decision Support Systems. 31 (2001) 5-16 Daniel, M., Decker, S., Domanetzki, A., Heimbrodt-Habermann, E., Hhö n, F., Hoffmann, A., Röstel, H., Studer, R., Wegner, R.: ERBUS-Towards a Knowledge Management System for Designers. Proc.of the Knowledge Management Workshop at the 21st Annual German AI Conference. Freiburg, Germany (1997) Wiig, K., de Hoog, R., Van der Spek, R.: Supporting Knowledge Management: a Selection of Methods and Techniques. Expert Systems with Applications. Vol. 13. No. 1 (1997) 15-27 Davenport, T.H., Prusak, L.: Working Knowledge: How Organizations Manage What They Know. Harvard Business School Press, Boston (2000) Feldman, R., Kloesgen, W., Zilberstein, A.: Document Explorer: Discovering Knowledge in Document Collections. In: Ras, Z.W., Skowron, A. (eds). Proc. of 10th International Symposium on Methodologies for Intelligent Systems. North Carolina (1997)
Knowledge Management Systems Development: A Roadmap
[6] [7] [8] [9]
1015
Van Heijst, G., Van der Spek, R., Kruizinga, E.: Corporate Memories as a Tool for Knowledge Management. Expert Systems with Applications. Vol. 13. No. 1 (1997) 41-54 Saadoun, M.: El proyecto Groupware. Ediciones Gestinó 2000 S.A., Barcelona (1997) Tiwana, A.: The Knowledge Management Toolkit. Prentice Hall PTR, New Jersey (2000) Robinson, M.: Computer Supported Cooperative Work: Cases and Concepts. R. M. Baecker, Morgan Kaufmann Publishers Inc. San Francisco (1993) 29-49
An Extensible Environment for Expert System Development Daniel Pop and Viorel Negru Department of Computer Science, University of the West from Timişoara 4 V. Pârvan Street, RO-1900 Timişoara, Romania {popica,vnegru}@info.uvt.ro
Abstract. The use of expert systems in the speed-up of human professional work has been in two orders of magnitude with resulting increases in human productivity and financial returns. Last decade shows that a growing number of organizations shift their informational systems towards a knowledge-based approach. This fact generates the need for new tools and environments that intelligently port the legacy systems in modern, extensible and scalable knowledge-integrated systems. This paper presents an extensible, user-friendly environment for expert system development, which supports the integration of highlevel knowledge “beans” into host projects, data integration from conventional database systems and system's verification, debugging and profiling. Keywords: Expert systems, systems validation, knowledge management
1
Introduction
The use of expert systems in various disciplines proves an increase in human productivity, financial benefits and a better answer to users needs. The re-engineering of old, existing information systems and their transformation in modern, extensible, scalable, viable systems is a complex and tedious process involving significant costs and resources. Expert System Creator suite was created as a joint project between academic and commercial partners. Its central goal is to help professional in the process of shifting from old implementation to modern approaches, based on latest technologies. It assists the human designer by efficient encoding of expert knowledge and by reusing the available systems. Expert System Creator is a development and integration environment for knowledge management, expert system construction and validation, and database integration. It merges conventional CASE tool facilities with the expert system technology. A similar approach is represented by a family of software CREATOR expert systems has been developed [4]. Although an application of these systems is assisting the human designer when using a conventional CASE tool, they do not support the V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1016-1022, 2003. Springer-Verlag Berlin Heidelberg 2003
An Extensible Environment for Expert System Development
1017
translation between different knowledge representation forms nor the debugging, verification and profiling phases, in contrast to the Expert System Creator suite. Another powerful knowledge management tool is Protégé-2000 project from Stanford Medical Informatics [10], which allows the construction of a domain ontology, the customization of knowledge-acquisition forms and the entering of domain knowledge. Moreover, Expert System Creator not only constructs the knowledge base but also integrates it within external projects. People involved in different phases of knowledge-based system construction or in the re-engineering of existing informational systems have different computing skills and carry out different tasks. Therefore, a viable environment must provide appropriate tools for different categories of users, ranging from domain experts to software designers and programmers. Expert System Creator is a system for the development, verification and debugging, profiling and optimization of expert system applications. Its main facilities include: • • • • • • •
representation of domain knowledge using rules set, decision tables or classification trees; supports the system design phase using active graphical widgets; computer-aided system verification; automatic code generation for declarative, functional or object-oriented programming languages; integration with external systems; integration with relational database management systems; visual debugging and tracing of expert systems using the high-level representation (decision tables, classification trees).
The next section presents how Expert System Creator helps domain experts and software developers to port their old informational systems to knowledge-based approaches. In the third section are outlined techniques and tools for system's validation, verification, debugging and profiling. The last section presents some final remarks, as well as the foreseen extensions.
2
Reengineering and Building Phase
Re-engineering is the first phase in implementing a knowledge-based system from an existing informational system. In order to import the legacy code, a Dictionary Manager module was introduced in Expert System Creator's architecture. A dictionary is a collection of external data, user-defined data types and procedures imported from external projects, i.e.: classes, structures, interfaces, type definitions, enumerations, variables, constants and class instances, functions and procedures. The Dictionary Manager module imports C, C++, Java and CLIPS/JESS definition files. New parsers and extensions can be easily added as plug-ins. All imported elements can be used in constructing the higher-level decision objects. The domain experts will operate with graphical representations of different knowledge forms, regardless of their implementation details, such as programming paradigm or language.
1018
2.1
Daniel Pop and Viorel Negru
Extensions of the Standard Knowledge Forms
For the representation of knowledge in expert systems, a number of forms are used, such as: rules set (production rules, association rules, rules with exceptions), decision tables, classification and regression trees, instance-based representations, and clusters. Each representation has its advantages and drawbacks. Expert System Creator is endowed with advanced graphical tools for three of the above forms: decision table, classification tree and rules set. As these three forms are equivalent [9], an expert system built in one of them can be translated into any other one. In order to overcome the limitations of the “standard” decision table and tree representation, we have introduced several extensions to the basic models: pre-actions, intermediate-actions and scoring. These will be presented in this section. The Decision Frame Designers handles the rule-based expert systems. The most known shells for rule-based expert systems are CLIPS (C Language Integrated Production System) [2] and JESS (Java Expert System Shell) [5]. Both shells are integrated in the Decision Frame Designer and users can switch from one to another at any moment. The Decision Frame Designer [9] is a project-based, user-friendly application for the development, verification, debugging and profiling of rule-based systems. A rule-based expert system can be integrated in C++/Java host projects. The calling code is automatically generated by the Decision Frame Designer. A decision table consists of a two-dimensional array of cells, where the columns contain the system's constraints and each row makes a classification according to each cell's value (case of condition). We are proposing an extended model for the decision table. Each condition has associated two new concepts: pre-actions and score. A preaction denotes a specific action that is executed before the condition is evaluated. Each table's condition (column) has associated a list of pre-actions that is executed prior to condition's evaluation. A pre-action may is represented by variable initialization, user input/output handling etc. The Decision Table Designer let users to construct a decision table object in a visual way. A classification tree consists of a set of nodes and a set of arcs [11]. The set of nodes is divided into three classes: decision nodes, intermediate-action nodes and classification nodes (leaf nodes). Each decision node is associated with a constraint of the system and each leaf node (classification node) makes the classification based on the cases of the constraints from the decision nodes. Each arc has as its source a deciding node, and is associated with a case corresponding to the constraint from the source decision node, and the destination is a decision or classification node. The standard decision tree model was extended with a new type of nodes – intermediateaction nodes – that let users to specify actions to be executed before the decision node's evaluation. The semantic of intermediate-action is similar to decision table's pre-action. The Decision Tree Designer offers the possibility to develop classification trees, providing a structured designer interface that let users to group tree's nodes at each level, thus saving design space. To support inexact reasoning, the decision table/tree were enhanced with scores. A score, represented by a real number between 0 and 1, is attached to each condition of a decision table/tree. The score is a measure of user's confidence in the specific condition. The final rule's score (confidence) is obtained by combining the conditions' scores. Various combination functions can be implemented, but for now, experiments using cumulative and multiplicative functions have been made.
An Extensible Environment for Expert System Development
2.2
1019
Database Integration
A large number of legacy systems underlies on relational databases systems. To access and transform available data of this systems, Expert System Creator offers direct database access for all three forms: rules set, decision table and classification tree. In the case of decision tables and classification trees, users can use their preferred database access libraries by importing them in the dictionary component. Using database access functions from within the constructed decision table or classification tree is straightforward and requires no specific handling. In the case of rules set (or decision frames), the problem of reasoning on facts residing in conventional relational database systems requires more attention. A major objective of database integration is to provide independence of both inference engine and DBMS. The Decision Frame DataBase [9] is an independent subsystem that acts as a communication channel between one or more database systems and a rule-based expert system.
3
Verification, Validation and Debugging Phases
The knowledge base completion and correctness are key issues in designing large knowledge base systems. Expert System Creator includes appropriate mechanisms for visualizing and testing the correctness of constructed knowledge bases and for debugging the execution of expert systems. The integrated debuggers in Decision Frame/Table/Tree Designer can be used to debug the final system in its original form (as a rules set, table or tree), instead of “classical” C++/Java/Jess debugging. This new approach helps domain users and software developers to visually test and repair the constructed expert system. 3.1
Rules Set
For rule-based systems, the semantic graph (SG) highlights the relationships between rules and templates. The semantic graph is a pair SG = (N, A)
(1)
where N – the set of nodes– is represented by system's rules and A is the set of arcs. An arc is defined with source N1 and target N2 if the consequent of the rule N1 asserts a fact that appears in the antecedent of the rule N2. The visual representation of this graph reveals main or isolated rules and templates. The graphical representation is a better approach to computing various numeric metrics that measure the quality of system's design. For users confident to numbers, a set of base metrics is also computed. Despite CLIPS's age, there are no integrated development environments offering “standard” debugging techniques, like step-by-step execution or breakpoints management for it. Decision Frame Debugger implements these debugging techniques by means of an easy-to-use, visual user interface. The Decision Frame Debugger includes the following features: rule-by-rule execution, breakpoints
1020
Daniel Pop and Viorel Negru
management, step into rule RHS's actions (procedural debugging), variables inspection, display of facts and agenda memories. It supports both CLIPS and JESS inference engines. In debugging mode, the system is automatically executed in a ruleby-rule manner, stopping on each breakpoint. In case of using CLIPS inference engine, Java Native Interface (JNI) technology is used for bridging between Decision Frame Debugger (Java environment) and CLIPS engine (native environment). The communication between the Debugger and the inference engine is described by some general interfaces. For the moment, CLIPS and JESS interface instances are implemented, but more inference engines can be easily plugged in. In order to find the time-consuming rules, rule-based profiler traces the system's execution. The system records the execution context in the trace files and using the Trace Viewer, these files are visualized. In case of rules set system, the execution context is formed by the antecedent and the consequent of the executed rule. 3.2
Decision Table
The automatically check of the correctness and completeness of the decision table is carried out by the Table Analyzer tool (embedded in Decision Table Designer), which highlights the duplicated and ambiguous rules that exist in the table. To measure the completeness of a decision table, the completion ratio (CR) is computed as follows: CR = [Possible Rules] / [Actual Rules]
(2)
where [Possible Rules] is the number of possible rules and [Actual Rules] is the number of rules of the decision table. The number of possible rules is computed as the product of cardinalities of all attribute domains The Decision Table Debugger lets you debug the system as a decision table. You can set breakpoints on table's cell that will stop the execution of the system. While the execution is paused, you can inspect variables status. To stop the system's execution on a breakpoint, the Code Generator module generates additional Java/C++ code for each table's cell. Before the execution of a Java/C++ statement in the “host” project, the Decision Table Debugger is interrogated whether or not a breakpoint is hit. If a breakpoint is hit, the execution control is passed to the Expert System Creator thread and the current values of all watched variables are updated. When the user continues the program execution (from Decision Table Debugger), the control is regained by the host project thread and the watch variables' values (possibly modified by the user during the debugging session in Decision Table Debugger) are sent back to the host project. 3.3
Classification Tree
For a better highlighting of the rules induced by a classification tree, the Decision Tree Designer displays all the rules encapsulated in the tree in a distinct panel. It also offers statistics regarding the number of nodes in tree, the number of nodes in each category (decision nodes, action nodes, leaf nodes), the number of induced rules, the number of incomplete decision nodes etc. The Decision Tree Debugger module offers the possibility to debug the expert system in the classification tree form. Similar to the decision table debugging, the
An Extensible Environment for Expert System Development
1021
Code Generator module generates additional code for each tree's node. In order to find out the time-consuming rules, the Code Generator module optionally generates addition code for tracing the host program execution. During a “traced” execution of the host program a trace file is created. It contains the execution context for each visited node. The execution context is composed of the timestamp and variables' values. The trace file is visualized by the Trace Viewer module that embosses the “bottlenecks” nodes.
4
Final Remarks and Future Extensions
Expert System Creator is a portable development environment that significantly cuts the cost of porting legacy systems to new technologies. The system is entirely implemented in Java using Java2™ SDK 1.3. The Decision Frame module works together with CLIPS [2][12] or JESS [5] expert system shells that perform the knowledge-based reasoning process. The Code Generator outputs C/C++ and Java code for decision tables and trees, whilst the rule-based systems are generated using CLIPS/JESS syntax. Two main directions are foreseen for further developments: enhanced knowledge representation and support for automatic project documentation generation. The first direction will be supported by several enhancements in Expert System Creator, such as: the integration of fuzzy logic engines (like FuzzyCLIPS [7] or FuzzyJ [8]), employing advanced widgets supporting the knowledge acquisition phase, and automatic tree growing from large data sets. The second direction will be supported by integrating intelligent reporting tools available on the market. Its aim is to generate the system documentation based on the overall architecture of the developed expert system, using the user's comments and implementation code inserted in frames, tables or trees projects.
Acknowledgements This project is a joint effort of both academia and industry partners. The University of the West from Timisoara supported this project through the Romanian Government's INFOSOC grant no. 61/2002 and CNCSIS grant no. 564/2002. Our industry partner is represented by Optimal Solution Software [6].
References [1] [2] [3]
Colomb, R., M.: Representation of Propositional Expert Systems as Partial Functions. Artificial Intelligence (1999) 109: 187-209 Culbert, C., Riley, G., Donnell, B.: CLIPS Reference Manual, Vol. 1-3. Johnson Space Center NASA (1993) Dial, R., B.: Decision Table Translation. Collected algorithms from CACM (1970)
1022
[4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14]
Daniel Pop and Viorel Negru
Far, B., H., Takizawa, T., Koono, Z.: Software Creation: An SDL-Based Expert System for Automatic Software Design. In Faergemand O, Sarma A. editors. Proceedings of SDL '93. Elsevier Publishing Co, North-Holland (1993) 399-410 Friedman-Hill, E.: JESS: The Rule Engine for the Java Platform. http://herzberg.ca.sandia.gov/jess [3/11/2002] Optimal Solution web site. http://www.optsol.at [3/21/2002] Orchard, R., A.: FuzzyCLIPS User's Guide, Integrated Reasoning Institute for Information Technology National Research Council Canada (1998) Orchard, R., A.: NRC FuzzyJ Toolkit for the Java™ Platform. User's Guide (2001) http://www.iit.nrc.ca/IR_public/fuzzy/fuzzyJDocs [3/11/2002] Pop, D., Negru, V.: Knowledge Management in Expert System Creator. LNCS/LNAI 2443, Springer (2002) 233 – 242 Protégé-2000. http://protégé.stanford.edu/index.html [2/15/2003] Quinlan, J., R.: Introduction of Decision Trees. Machine Learning 1 (1984) Riley, G., Donnell, B.: CLIPS Arhitecture Manual. Johnson Space Center NASA (1993) Shwayder, K.: Conversion of limited-entry decision tables to computer programs - a proposed modification to Pollack's algorithm. Communications of the ACM 14 (1971) 69-73 Witten, I., H., Frank, E.: Data Mining. Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufman Publishers (2000)
An Innovative Approach for Managing Competence: An Operational Knowledge Management Framework Giulio Valente1 and Alessandro Rigallo2 1
Dipartimento di Informatica, Universit` a di Torino Corso Svizzera 185, 10149 Torino, Italy [email protected] 2 Telecom Italia Lab Via G. Reiss Romoli 274, 10148 Torino, Italy [email protected]
Abstract. In the present paper we shall present a general framework to manage a part of the company’s knowledge known as the Operational Knowledge (OK). Moreover, we shall analyze the main differences between traditional Knowledge Management perspective and our Operational Knowledge Management approach. Afterwards, we present a case study on the Operational Knowledge Management (OKM), defined and developed during 2001 and 2002 for Telecom Italia Mobile (TIM), one of the largest European mobile operators.
1
Introduction
In recent years, many organizations are showing a tremendous interest in implementing knowledge management process and technologies, and are even beginning to adopt knowledge management as a part of their overall business strategy [14] [2]. In fact, ”knowledge management allows improving organizational performance by enabling individuals to capture, share, and apply their collective knowledge to make optimal decision...in real time” [11]. But, because of different kinds of corporate knowledge, an initial solution of knowledge management on all existing knowledge is an ambitious objective. So, in this present paper, we focus our attention on management of a part of corporate knowledge, called Operational Knowledge (OK), which is used by people performing their day-by-day activities. In other words, OK is mainly based on individual’s competence and experience, namely some sets of procedures used by people to construct a genuine strategy tailored to the specificities of a given situation [5]. We called our approach to manage OK the Operational Knowledge Management (OKM). Whereas the knowledge creation has been presented as a major issue of Knowledge Management System, very few work have attempted to develop models or frameworks of such system using the well-known Nonaka’s spiral-model. This lack of work is well summarized in the article ”Knowledge Management: Problems, Promise, and challenges”. It claims that in traditional V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1023–1029, 2003. c Springer-Verlag Berlin Heidelberg 2003
1024
Giulio Valente and Alessandro Rigallo
knowledge management approaches the goal is ”to store information from the past so that lesson will not be forgotten” [4]. Consequently, this prospective treats knowledge worker as passive recipients of information and does not consider the knowledge created by knowledge workers during their work. This paper presents an Operational Knowledge Management Framework (OKMF) based on Nonaka’s knowledge creation model. Section 2 is devoted to explore the main differences between the traditional knowledge management approaches and our OKM approaches. In the Section 3 we derive the three main phases, which constituted our framework. Finally, section 4 briefly described an OKM case study, called Netdoctor, which is used to manage the OK developed by skilled technician performing daily maintenance and assurance activities on the mobile radio network.
2
Knowledge Management and Operational Knowledge Management
In traditional Knowledge Management approaches, management collects and structures a knowledge repository’s contents as a finished product at design time (before the knowledge repository is deployed) and then disseminates the product. Such approach assumes that knowledge repository contain all the knowledge required by users. Therefore, this is a top-down approach because it supposes that management creates the knowledge and that workers receive it. Our Operational Knowledge Management approach is an alternative that relates working and knowledge creation. This approach has two fundamental aspects. First, workers create knowledge at use time. Second, knowledge is a side effect of work. This is a bottom-up approach because it presumes that workers create the knowledge, which is validated by managers. Table 1 summarizes the main differences between traditional approaches and our OKM perspective. 2.1
Knowledge Creation
To concise an abstract view into a more precise observation, with regard to the two perspectives on knowledge management, we consider now the knowledge creation process. In traditional KM approaches, the knowledge is (a) captured by a knowledge engineer through one o more sessions with expert users and (b) stored into the repository. Once that all domain’s knowledge is stored into a repository, the user of the Knowledge Management Systems (KMS) are able to access and retrieve the stored knowledge at least till it become obsolete, such as for a strong innovation technology. Furthermore, the follow-up procedures of knowledge repository are often not continuous. In other words, in a typical KMS, knowledge flows are mainly from knowledge repository to user. This is what we call “one-way flow” (see Fig. 1a) [1] [13]. We now consider the wellknown Nonaka’s model for knowledge creation. Nonaka analyzes knowledge creation with 2*2 matrix, where each cell represents the transition from tacit or explicit knowledge to tacit or explicit knowledge [6] [8]. Tacit knowledge is all
An Innovative Approach for Managing Competence
1025
Table 1. Traditional KM versus OKM approach traditional KM
OKM Framework
Acquisition
Specialist (for example knowledge engineer)
Specialist (for example knowledge engineer)
UpGrade
At design time (before system deployement). Specialist (for example knowledge engineer)
At design time and using time (ongoing process). Everyone (for example people during doing the work, knowledge engineer)
Dissemination
Lecture, broadcasting, meet- On demand knowledge, best ings, document management practice diffusion
Learning Paradigm
Knowledge transfer
Knowledge transfer and construction
Social Structure
Individuals in hierarchical structures, communication primarily top-down
Communities of Practices, communication peer-to-peer, communication bottom-up
Work Style
Standardize, repetitive and Complex, based on expertise, no predictable predictable
Information Space
closed, static
Open, Dynamic
Breakdown
Error to be avoided
Opportunities for innovation and knowledge creation
the implicit information we have created by personal reasoning, relationship, experience and so on. Explicit knowledge is all the information that we can find in books, courses, internet and so on, namely knowledge expresses with in a formal ways, easy to transmits and conservable. Each transition requires different kind of thinking and interaction. When viewed as a continuous learning process, the model becomes a clockwise spiral. Nonaka claims that ”the dynamic interaction, between the two types of knowledge, is the key to organizational knowledge creation” [8] [3]. But, if knowledge flows are “one-way flows”, they will not allow the continuous process described by Nonaka. This suggested us to allow the introduction into a KMS of knowledge flows following the opposite direction (from user to knowledge repository) at the same intensity too. This is what we call “double-way flows”(see Fig. 1b).
3
Operational Knowledge Management Framework
In our OKMF, we developed the “double way flows” through three distinct phases [10] (see Fig. 1c):
KNOWLEDGE FLOW
KNOWLEDGE FLOW
KNOWLEDGE FLOW
(a)
USER(S)
USER(S)
SOURCE
Giulio Valente and Alessandro Rigallo
SOURCE
1026
ACQUISITION
DIFFUSION
UP-GRADE
USER(S)
SOURCE
(b)
(c)
Fig. 1. One-way flows vs double-way flows
1. Acquisition: is the phase of capturing existing domain’s knowledge and storing it into a repository (in a structured manner); 2. Diffusion: is the phase where the repository’s knowledge is accessed and used by the users; 3. Up-Grade: is the phase of monitoring and up-grading repository’s knowledge during people’s day-to-day activities (such as interacting with other people and computer system). From the temporal point of view, we have the Acquisition of knowledge. Then it will be possible to use it, during the Diffusion phase. Afterwards, there should be a cycle of use and up-grade of knowledge. This means that, the users will be able to add new knowledge to the repository during the knowledge use as well. In this way we do not need to acquire this new knowledge during a ”formal” phase (Acquisition), but we derive it directly by the usage of the OKM solution itself. However, it will be necessary to go back to the Acquisition phase, usually when the knowledge into the repository is obsolete. For example, in the telecommunication field, when there is a big technological innovation (from GPRS to UMTS technology). Therefore, the link between our OKMF and the Nonaka’s spiral model is the temporal dimension. In fact, the knowledge repository should increase through the continuous cycle generated by Acquisition, Diffusion and Up-grade phases. (see Fig. 2 and [8]). Following the phase identification come the specification of phases. In the specification, we have broken up the phases in seven key-activities and divided them in sub-activities or modules. In that ways, our OKMF has a modular architecture so that the set of possible modules
An Innovative Approach for Managing Competence
1027
configuration covers the different kind of problems, corporate layout and work strategies [12].
Fig. 2. The OKM Phases as a ”Knowledge Spiral”, picture adapted from [7]
4
NetDoctor: A Case Study
In this paragraph, we briefly discuss a specific modules configuration of our OKM Framework, called Netdoctor. In particular, we describe how Netdoctor implemets the three phases of our OKM Framework. NetDoctor was developed during 2001 and 2002 for TIM. It allows OK managing (i.e the competences developed by skilled technicians performing daily maintenance and assurance activities on the mobile radio network). In order to make this, it provides to the network operational personnel a real-time data correlation related to the network performances, and it monitors such data [9]. The network measurement data provide the remarkable input for pro-active discovery of network faults. In case of network faults, which affect the quality of the provided services, Netdoctor is able to detect, and fix the troubles and provides to the network operational personnel the best solution as soon as possible using the OK. TIM network departments are made-up by seven different territorial areas (TAs) with specific ways to perform activities. Each TA owns an instance of Netdoctor system and all these istances are connected through a WAN. 4.1
Netdoctor: Knowledge Acquisition Phase
Within Knowledge Representation field, rules allow to specify what you have to do in a situation with some particular characteristics. For this reason NetDoctor has a rule based knowledge model to map and categorize the unstructured OK related to cellular network maintenance and assurance. The OK is acquired through a series of knowledge acquisition sessions with knowledge engineers and skilled technicians. NetDoctor’s rules have the well-known form SY P T OM, DIAGN OSYS, RESOLU T ION where:
1028
Giulio Valente and Alessandro Rigallo
– SY P T OM is the network fault recognition; – DIAGN OSY S is the identification of fault’s causes; – RESOLU T ION are the activities for restoring the network. Because of different TAs, each instance of Netdoctor has two repositories of rules: Local Rule Repository with rules created by a single user in a specific territorial area and National Rule Repository with rules shared by all territorial areas. 4.2
Netdoctor: Knowledge Diffusion Phase
Thanks to Netdoctor, knowledge diffusion is allowed using rule exchanging mechanism. For example, we suppose to have two TAs (territorial area) both of them with a single instance of Netdoctor. If there are network troubles in the TA1 and the user of TA1 asks Netdoctor for a diagnosis, the Rule Engine starts applying the National Rules. If a solution is not reached, the Rules Engine applies the Local Rules related to the territorial area (in this case TA1) of the user. If it fails again, the Rule Engine automatically tries to find out a solution in the Local Rules of the other territorial area (in this case TA2). In other words, the set of rules originally stored in a single instance of Netdoctor could easily be imported by the other TAs realizing the rules diffusion. 4.3
Netdoctor: Knowledge Up-Grade Phase
Netdoctor has a very user-friendly Rule Editor interface which allows skilled technicians to add, modify or delete his Local Rules (National rules can not be modify or delete) during their day-by-day activities. This is one of the most important features of Netdoctor derived from OKM Framework. Based on Monitoring module, a user can check the rules set life-cycle. This life-cycle is evaluated by a set of indicators like: % of access, % of re-use, % of import, etc. In this way a rule could be evaluated by users as best-practice (eg. high number of imports) or obselete one (eg. low number of access or/and re-use). Both the Rules Engine and the Rule Editor are based on ILOG JRules technology [1].
5
Conclusion
In this paper, we have described our approach to Operational Knowledge Management (OKM). Whilst the Knowledge Management can be defined as a way for knowledge evolution from tacit to explicit form, OKM is a part of KM specifically focused on Operational Knowledge. As far as knowledge management problems are concerned, a remarkable importance topic is the knowledge creation. In fact, most knowledge management systems, as far as our knowledge is concerned, treat knowledge workers as passive recipients of information. Therefore, new knowledge is not created as knowledge workers work. In this paper, we have claimed on the contrary, that considering knowledge workers as attive recipients is a key issue for building knowledge management system according to Nonaka Knowledge Creation model. The last part of this paper is briefly focused on a case
An Innovative Approach for Managing Competence
1029
study of OKM, called NetDoctor. This represents an example of instantiation of a specific OKM system from a neutral OKM Framework in order to manage a particular Operational Knowledge. The Netdoctor OKM system has been field proved during 2001 and 2002 in two TIM territorial areas of Italy. The TAs interested by these Netdoctor installations are really various and characterized by about 5000 Network Elements under monitoring (about 30 millions of data recorded up to 30 days for historical correlations and diagnosis). In each TA about 30 users used a Netdoctor instances with about 100 rules. Moreover, Netdoctor covers both GSM (Global System for Mobile communications) and GPRS (General Packet Radio Service) technologies. The system for the whole period of its functioning has demonstrated its robustness in managing and monitoring the huge amount of network anomalies and measurements collection, correlation and problem solving activities.
References [1] Ilog jrules user’s manual. 1024, 1028 [2] J. R. Anderson. Cognitive Psycology and Its Implications. W. H. Freeman and Company, 2nd edition, 1985. New York. 1023 [3] C. M. Barlow. The knowledge creating cycle. available at http://www.stuart.iit.edu/courses/mgt581/filespdf/nonaka.pdf. 1025 [4] G. Fisher and J. Ostwald. Knowledge management: Problems, promise, realities and challenges. IEEE Intelligent System, (January/February):60–72, 2001. 1024 [5] J.Pomerol, P.Brezillon, and L.Pasquier. Operational knowledge representation for practical decision making. In Proc. of the 34th Hawaii International Conference on System science, 2001. 1023 [6] A. D. Marwick. Knowledge management technology. IBM Systems Journal, 40(4):814–830, 2001. 1024 [7] K. McLennan and B.Guay. Knowledge management advaces - organizing knowhow to gain competitive advantage. Technical report, DMR Consulting, 2000. 1027 [8] I. Nonaka and H. Takeuchi. The Knowlegde-Creating Company. Oxford University Press, 1995. New York. 1024, 1025, 1026 [9] A. Rigallo, A. Stringa, and F. Verroca. Real-time monitoring and operational assistant system for mobile networks. In Proc. of the Network Operations and Management Symposium, 2002. 1027 [10] N. Shadbolt, N. Milton, H. Cottam, and M. Hammersley. Towards a knowledge technology for knowledge management. Int. J. of Human-Computer Studies, (51):615–641, 1999. 1025 [11] R. G. Smith and A. Fatquhar. The road ahead for knowledge management. AI Magazine, (Winter 2000):17–40, 2000. 1023 [12] G. Valente and A. Rigallo. Operational knowledge management: a way to manage competence. In Proc. of the International Conference of Information and Knowledge Engineering, 2002. 1027 [13] M. H. Zack. An information infrastructure model for systems planning. Journal of Systems Management, 43(8):16–40, 1992. 1024 [14] M. H. Zack. Managing codified knowledge. Sloan Management Review, 40(4):45– 58, 1999. 1023
A Synergy of Modelling for Constraint Problems Gerrit Renker, Hatem Ahriz, and In´es Arana School of Computing, The Robert Gordon University in Aberdeen Scotland, UK. {gr,ha,ia}@comp.rgu.ac.uk
Abstract. Formulating and solving constraint problems requires both mathematical modelling as well as software development skills. Rather than reconciling these demands at implementation level, we have introduced a separate modelling layer in our research that abstracts away from low-level representation and implementation issues. We found that this benefits both the mathematical modelling and the software development aspects of constraint programming, leading to an efficient synergy of these activities. In this paper, we report on experiences and advances with our uml-based modelling approach.
1
Introduction
A constraint problem (CP) is a man-made formulation of a real-world phenomenon whose intrinsic structure and characteristics are described by a underlying mathematical model. Common and popular examples are scheduling, timetabling, document formatting and resource allocation. A constraint is a relation that must hold for one or more variables. Constraints can be combined into a constraint network, which serves as the basis for defining constraint problems. Definition 1 (Constraint Network). A constraint network is a triple (X, D, C). X is a finite tuple of k > 0 variables xi . D = {d1 , . . . , dk } is a set of domains, the union of whose members forms a universe In each of discourse, U. C is a finite set of r ≥ k constraints. constraint ci ( xj ) ∈ C, xj is a j-ary sub-tuple of X and ci ( xj ) is a subset of Uj . C contains k unary constraints of the form ci (xv ) = di , one for each variable xv in X, restricting xv to range over a domain di ∈ D which is called the domain of xv . Different problems can be formulated on the same constraint network, the best known probably is the constraint satisfaction problem (CSP).1 Solving a CSP is usually understood as finding an assignment of domain values vj ∈ di , di ∈ D for all variables xi in X such that all constraints in C are satisfied. Modelling in the context of constraint problems currently has much in common with traditional mathematical modelling [7, 11, 22], as it marks the process 1
For more variants of problems cf. Bowen [5], to whom this distinction between a constraint problem and the constraint network on which it is defined is due.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1030–1038, 2003. c Springer-Verlag Berlin Heidelberg 2003
A Synergy of Modelling for Constraint Problems
1031
of choosing which entities to represent as variables and how to relate these via constraint relations. Quoting [16, p. 167]: “Modelling is at the heart of constraint programming since it is by this process that the problem is specified in terms of constraints that can be handled by the underlying solver ”. This form of modelling requires considerable domain expertise, experience and knowledge representation skills. Strategically solving a formulated constraint problem demands two more forms of modelling skills. First, the implementation of a problem requires a representation in terms of the given data structures and primitives of the programming language at hand. The initial problem formulation may have to be revised and adapted, subject to restrictions of the implementation language. Secondly, solving the problem requires committing to a particular solving strategy, which may require setting solver-specific parameters. In the case of combinatorial search e.g., variable and value orderings, depth thresholds and the like need to be modelled. This dichotomy of mathematical modelling on the one hand and software development in constraint programming on the other hand is reflected throughout recent developments. The OPL programming language [14] combines elements from both mathematical modelling languages (such as ampl [11]) and constraint programming, stressing the fact that constraint programming embraces both constraint-centric and programming-language centric elements. Recent research in the Alma-0 project [3] has made clear that the differences between a traditional imperative programming language and a constraint programming language can be described in terms of just 9 elementary features. Dedicated support for modelling search is for instance available in OPL [14] and Localizer [17]. The cacp project [6, 24] tries to minimize the modelling demands of constraint technology on the user by providing an entire support system. The software modelling aspects are made transparent by the use of an interfaced constraint language which provides a solver - (language) independent internal representation. The mathematical modelling aspects are simplified through assistance in problem formulation, via intuitive interfaces, validation of syntax and semantics of the problem formulation and aid in algorithm selection. The complex and multidisciplinary nature of formulating constraint problems may be the reason why this area still is mostly reserved to specialized experts. We have found that the introduction of a separate modelling layer benefits both the mathematical modelling as well as the software development aspects of constraint programming and therefore developed a uml-based modelling paradigm [21]. The concept is sketched in Fig. 1. By using a higher-level modelling layer, we can abstract away
Modelling Layer math. model
software implementation
Fig. 1. Modelling Layer at Higher Level
1032
Gerrit Renker et al.
from low-level representation and software implementation issues. Working on the modelling layer fuses the two main activities in constraint programming in a more abstract way. Our experiences with the approach, which we report on in this paper, were positive and we point out limitations of the underlying uml. Such criticism is in the interest of refining the approach, but may as well benefit the OO modelling community. The remainder of this paper is structured as follows. In the next section, we describe the benefits of the approach, illustrated by an example in Sect. 3. We point out current limitations of using uml in Sect. 4 and conclude in Sect. 5.
2
The Case for a Separate Modelling Layer
In this section, we sum up research experiences with a separate modelling layer in the recop (REpresenting and REformulating Constraint and Optimization Problems) project. Our approach is centered around building two kinds of models. First, a structural model is visually expressed in uml [19]. This is complemented by textual notation for the constraints, currently using the ocl (Object Constraint Language [25]). The end result of this modelling process is an algebraic model, i.e. a model sufficiently specific and precise to be transformed into a constraint programming language [21]. Separate Modelling and Solving. A lesson learnt from mathematical modelling is that structuring a problem is best carried out separately from solving it, as too strong a focus on problem solving creates a spirit of attachment to a particular method [22, Sect. 1.5]. For similar reasons, knowledge bases and inference engines appear as separate units in knowledge engineering. Separating the stages of modelling and solving is much more likely to drive constraints research forward towards new formalisms than constraining models around the limited capabilities of given solvers. Abstraction Facilities. Defining constraint problems in terms of variables and relations (constraints) confines the modelling perspective to a single, low-level vantage point. Domains of variables may occur as complex (hierarchically) nested structures or entire subproblems may appear in the form of meta-variables [23]. It is also common to reify the satisfaction of a constraint into a variable [10]. Apart from being a flat structure, the variable-set based representation further suffers from scalability issues. The structure of larger problems with a higher number of variables becomes very hard to grasp, let alone debug, if the problem representation is based on variables or nodes to represent variables in the constraint graph.2 Missing in the conventional definition of constraint models are abstraction facilities (cf. Sect. 3). This is even more severe as availability of multiple abstraction levels is considered essential for automatic problem reformulation [12]. 2
A constraint problem with k variables can be represented as a constraint (hyper-) graph with k nodes.
A Synergy of Modelling for Constraint Problems
1033
Missing Common Notation. A strong advantage of purely mathematical models is that their unambiguous notation is widely understood, so that models can be communicated all over the world. Constraint models however are usually tied to a particular implementation language. This makes it difficult to share problem solving and domain knowledge, communicate about models and develop these in a larger, possibly distributed team. A separate modelling layer on the other hand facilitates a common notation through abstraction from implementation details. This significantly improves communication about models and allows to compare features of different models on the same background. Constraint Acquisition. Finding the right model in terms of variables and constraints is often the biggest difficulty faced by (novice) constraint programmers [16, Chap. 5]. This cognitive process, analogous to knowledge acquisition, has been termed ‘constraint acquisition’ and forms a branch of current constraints research [20]. The advantage that a modelling layer affords over direct programming is that, especially in early stages of development, unnecessary facts can be elided in order to concentrate on the core features of a problem [7, p. 68]. Relational Basis. A constraint problem has a relational basis and can thus be described in terms of relational algebra [21, Sect. 2.4]. Despite being basic building blocks of constraint problems, relations alone prove too limited for the representation of deeper semantics. In scheduling for instance, often several pieces of information are aggregated around a single task variable, e.g. the duration of the task, the required resource, a tag indicating if this task is interruptible etc. A semantically rich notion of relationships (the uml distinguishes six types of association) and facilities to record additional modelling information as it occurs have proven valuable tools of the modelling layer. Model Optimisation. A model is not accomplished after merely naming variables and constraints. First, the model needs to be validated in regard to external consistency with the phenomenon being modelled and verified with respect to internal consistency (debugging), i.e. it may contain errors or be overconstrained (admitting only an empty solution set). Of great interest is further finding a reformulation of the model which is most likely to minimize the effort of solving it [16, sec. 8.4]. Several variants may be tried until a model optimal in regard to the solving procedure is worked out. A modelling layer is helpful here in that it allows to describe the differences and transformations between models and is a foundation for automated reformulation [12].
3
Example
To illustrate some modelling facilities (abstraction and aggregation), we use a small toy scheduling problem. For a more involved case study see [21]. The task at hand is taken from [16, Sect. 1.2] and involves scheduling the activities of building a house, the interdependencies are summarized in Table 1. Each of
1034
Gerrit Renker et al.
Table 1. Tasks of building a house Task Name
Duration Predecessors
Start
0
/
Foundations
7
Start
Interior Walls
4
Foundations
Exterior Walls
3
Foundations
Chimney
3
Foundations
Roof
2
Exterior Walls
Tiles
3
Chimney,Roof
Windows
3
Exterior Walls
Doors
2
Interior Walls
End
0
Doors, Windows, Tiles
the 10 tasks is to be assigned a start date, subject to precedence constraints, each one has a name, duration and possibly predecessors. Rather than repeating the individual information for each of the tasks in Table 1, a generic Task class is at the heart of the structural uml model in Fig. 2. The aggregation of attributes in the Task class replaces tenfold repetition in the diagram. All other task classes in Fig. 2 are subtypes of Task and inherit its attributes, constraints and associations. In place of the 12 syntactically identical precedence constraints [16, Sect. 1.2], there is a single generic ocl constraint: context Task inv: previous->forAll(t:Task| t.start + t.duration <= self.start)
The constraint asserts that, for each instance of a Task, all tasks preceding this task instance must have completed before its start time. Figure 2 illustrates the abstraction facilities that type level affords over instance level and class level over
Start
Foundations
Int_Walls
Ext_Walls
Chimney
Task 0..* previous
Roof
Doors
name: String start: Integer duration: Integer
Windows
Tiles
Fig. 2. Structural model of the problem
End
A Synergy of Modelling for Constraint Problems
1035
Generic
Task 0..* previous
name: String start: Integer duration: Integer
Start
End Decoration
Construction
Masonry
<>
Foundations
Int_Walls Chimney Ext_Walls
Windows Doors
<>
Roof
Tiles
Fig. 3. Packaging tasks into groups
subclass level. The problem remains that for a larger number of tasks (say a few hundred) the diagram becomes incomprehensible and cluttered. Therefore, the possibility of grouping the tasks of similar sort into packages is shown in Fig. 3. The packages appear as tabbed rectangles, the dependence of the Construction and Decoration packages on the Generic package is shown as dashed line with an access stereotype [19, Sect. 3.38], signifying the accessibility of the elements of the Generic package to the other packages. The Construction package includes the Masonry package. In this manner, containment hierarchies can be built. On the modelling layer, this allows the same form of structuring as introduced in Goualard’s S-Box method [13] for constraint store debugging. The structuring power of these grouping and abstraction facilities unfolds with growing problem size, allowing to break large arrays of variables into small manageable groups of entities.
4
Discussion and Further Work
It would be naive to assume that vanilla uml can cater for all scenarios of such a specialized area as constraint programming. There are some interesting areas for changes and extensions. These are mainly the restricted extension mechanism of uml 1.4 and its inherent lack of precision. The current uml extension mechanism, called profiles [19, Sect. 2.6], is restricted to renaming existing uml elements along with the specification of syntactical well-formedness rules and additional attributes (tagged values). Genuine language extensions (as here required) are currently not possible, a change could maybe happen with the release of uml 2.0.3 We are thus facing a situation similar to Agent uml [18], which also proposes uml language extensions for a domain other than mainstream software development. 3
This is still work in progress and long overdue, cf. www.omg.org/uml.
1036
Gerrit Renker et al.
A
R(A,B)
B
Fig. 4. Trying to represent relations in uml The second major problem is the lack of precision in uml, which is defined in terms of itself (via a uml meta-model [19]) and natural language. It can often be observed that people resort to more rigorous notations like Z once precision becomes required, as demonstrated in [9]. Modelling constraint problems does require precision to mark differences between models, to define model transformations and to allow unambiguous model specifications. We are currently investigating several alternatives of combining uml with a rigorous notation. Last, another major problem is that uml does not straightforwardly allow to model mathematical relations, i.e. subsets of Cartesian Products. Note that this notion of relations is different from relationships as semantic uml modelling element [19, Sect. 2.5].It is currently not possible to post a constraint on an association between two classes A and B (Fig. 4), denoting that both form a binary relation R(A,B) with certain properties such as being functional and injective. This is due to the fact that associations are not first-class citizens in uml. Several papers have introduced work-arounds for this problem [1, 2, 4, 15], all of which require using a handful of uml elements to describe a simple relation. Considering that a relation is a low-level modelling primitive, its representation should not require such complexities but rather be present as an atomic element. Even more, as relations are elementary building blocks of constraint problems, this marks not only a deficiency of uml but has prompted us to develop an extension. Other future work focuses on formalizing the approach, making it precise enough to specify model translations.
5
Conclusion
We have presented benefits and experiences of using a separate high-level modelling layer for constraint programming, illustrated our approach on a small example and pointed out limitations of the employed uml. Looking at future prospects, we see a potential of the approach to increase modelling support, allow communication about models on a common basis and to open a door for (models of) distributed constraint problems. Namely, model descriptions can be translated into an xml dialect [8], which allows homogeneous constraint representation, processing and constraint fusion across heterogeneous nodes.
A Synergy of Modelling for Constraint Problems
1037
References [1] D. H. Akehurst and B. Bordbar. On Querying UML Data Models with OCL. In M. Gogolla and C. Kobryn, editors, UML 2001 - The Unified Modeling Language. Modeling Languages, Concepts, and Tools. 4th International Conference, Toronto, Canada, October 2001, Proceedings, volume 2185 of LNCS, pages 91– 103. Springer, 2001. 1036 [2] D. H. Akehurst and S. Kent. A Relational Approach to Defining Transformations in a Metamodel. In J.-M. J´ez´equel, H. H. Hussmann, and S. Cook, editors, UML 2002 - The Unified Modeling Language, 5th International Conference, Dresden, Germany, September 30 – October 4, 2002; Proceedings, volume 2460 of LNCS, pages 243–258. Springer, 2002. 1036 [3] K. R. Apt, J. Brunekreef, V. Partington, and A. Schaerf. Alma-0: An Imperative Language that Supports Declarative Programming. ACM Toplas, 20(5):1014– 1066, September 1998. 1031 [4] H. Balsters. Derived classes as a basis for views in UML/OCL data models. Research Report 02A47, University of Groningen, Research Institute SOM (Systems, Organisations and Management), 2000. 1036 [5] J. Bowen. The (minimal) specialization CSP: a basis for generalized interactive constraint processing. In Working Notes of the First International Workshop on User-Interaction in Constraint Satisfaction, held in conjunction with CP 2001, pages 1–14, Paphos, Cyprus, December 2001. 1030 [6] R. Bradwell, J. Ford, P. Wills, E. Tsang, and R. Williams. An Overview of The CACP Project: Modelling And Solving Constraint Satisfaction/Optimisation Problems With Minimal Expert Intervention. In In CP 2000 Workshop on Analysis and Visualization of Constraint Programs and Solvers, Singapore, September 2000. 1031 [7] M. Cross and A. O. Moscardini. Learning the Art of Mathematical Modelling. Ellis Horwood Ltd., 1985. 1030, 1033 [8] B. Demuth, H. Hussmann, and S. Obermaier. Experiments With XMI Based Transformations of Software Models. In Proceedings of WTUML: Workshop on Transformations in UML in Genova, Italy, Saturday April 7th, 2001, 7 April 2001. 1036 [9] A. S. Evans. Reasoning with UML Class Diagrams. In Second IEEE Workshop on Industrial Strength Formal Specification Techniques, October 20 - 23, 1998, Boca Raton, Florida, pages 102–113. IEEE, 1998. 1036 [10] A. Fernandez and P. Hill. A Comparative Study of Eight Constraint Programming Languages over the Boolean and Finite Domains. Constraints (Kluwer), 5(3):275– 301, July 2000. 1032 [11] R. Fourer, D. M. Gay, and B. W. Kernighan. AMPL: A Modeling Language for Mathematical Programming. Duxbury Press, 1993. 1030, 1031 [12] A. M. Frisch, B. Hnich, I. Miguel, B. M. Smith, and T. Walsh. Towards CSP Model Reformulation at Multiple Levels of Abstraction. International Workshop on Reformulating Constraint Satisfaction Problems at CP-02, Ithaca, NY, USA, 2002. 1032, 1033 [13] F. Goualard and F. Benhamou. A Visualization Tool for Constraint Program Debugging. In Proceedings of The 14th IEEE International Conference on Automated Software Engineering (ASE-99), pages 110–118. IEEE Computer Society, October 1999. 1035
1038
Gerrit Renker et al.
[14] P. V. Hentenryck. The OPL Optimization Programming Language. The MIT Press, January 1999. 1031 [15] L. Mandel and M. V. Cengarle. On the Expressive Power of OCL. In Jeannette M. Wing, Jim Woodcock, and Jim Davies, editors, Proceedings of the World Congress on Formal Methods (FM-99), Toulouse, France, volume 1708 of LNCS, pages 854– 874. Springer, 1999. 1036 [16] K. Marriott and P. J. Stuckey. Programming with Constraints: An Introduction. The MIT Press, 1998. 1031, 1033, 1034 [17] L. Michel and P. V. Hentenryck. Localizer. Constraints, 5(1-2):43–84, January 2000. 1031 [18] J. Odell, H. V. D. Parunak, and B. Bauer. Extending UML for Agents. AOIS Workshop at AAAI 2000; Austin, Texas (USA), July 30, 2000. 1035 [19] OMG. Unified Modeling Language Specification, Version 1.4, September 2001. Available at http://www.omg.org/cgi-bin/doc?formal/01-09-67, 2001. 1032, 1035, 1036 [20] B. O’Sullivan, E. C. Freuder, and S. O’Connell. Interactive constraint acquisition. In Working Notes of the First International Workshop on User-Interaction in Constraint Satisfaction, held in conjunction with CP 2001, Paphos, Cyprus, December 2001. 1033 [21] G. Renker, H. Ahriz, and I. Arana. CSP - There is more than one way to model it. In M. Bramer, A. Preece, and F. Coenen, editors, Research and Development in Intelligent Systems XIX: Proceedings ES 2002, The Twenty-second SGAI International Conference on Knowledge Based Systems and Applied Artificial Intelligence, pages 395–408. Springer, 2002. 1031, 1032, 1033 [22] T. L. Saaty and J. M. Alexander. Thinking With Models. Pergamon Press, 1981. 1030, 1032 [23] D. Sabin and E. C. Freuder. Configuration as Composite Constraint Satisfaction. In G. F. Luger, editor, Proceedings of the (1st) Artificial Intelligence and Manufacturing Research Planning Workshop, pages 153–161. AAAI Press, 1996. 1032 [24] E. Tsang, P. Mills, R. Williams, J. Ford, and J. Borrett. A Computer Aided Constraint Programming System. In The First International Conference on The Practical Application of Constraint Technologies and Logic Programming (PACLP 99), pages 81–93, London, April 1999. 1031 [25] J. B. Warmer and A. G. Kleppe. The Object Constraint Language: Precise Modeling with UML. The Addison-Wesley Object Technology Series. Addison Wesley, 1999. 1032
Object-Oriented Design of E-Government System: A Case Study Jiang Tian and Huaglory Tianfield School of Computing and Mathematical Sciences, Glasgow Caledonian University 70 Cowcaddens Road, Glasgow, G4 0BA, UK {j.tian,h.tianfield}@gcal.ac.uk
Abstract. E-government systems have become wide spread in more and more countries. However, e-government development systems are too complex to design. This paper firstly analyses the functions and services of government and the possibility of their digitalization. Secondly this paper analyses the overall requirement of e-government systems. Finally an object-oriented design (OOD) is introduced, which is an effective method for design e-government systems. OOD process, composed of five major steps, is illustrated in the case of passport application system in detail.
1
Introduction
E-government is “utilising the internet and the word-wide-web for delivering government information and services to citizens” [1]. E-government not only can improve the operation efficiency of government, but also meet the increasing requirements of making scientific decision and reforming democracy. It can be envisioned that e-government can satisfy greatly better the requirements of citizens and public. However, e-government development systems are too complex and volatile to design. How to conquer the complexity and evolution to meet the increasing requirements of e-government system? OOD not only can decompose complex systems in various objects and classes which have direct mappings to physical processes [2], but also OOD can meet the demand of extensive reuse and evolution for software systems. So OOD method is more suitable to tackle the complexity in egovernment development systems.
2
Governmental Functions/Services and E-Government System Requirement Analysis
Although there are various forms for governments such as federal government, republic government, military government and constitutional monarchy government, the essence of government is an effective organisational form of regime and results V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1039-1045, 2003. Springer-Verlag Berlin Heidelberg 2003
1040
Jiang Tian and Huaglory Tianfield
from the establishment of countries. The functions/services of government are realised through the horizontal and vertical structure of government [3]. 2.1
Governmental Functions/Services
Governmental functions/services cover a wide range. “Government provides not only the legal, political, and economic infrastructure to support other sectors, but also exerts significant influence on the social factors that contributed to their development” [4] In a general sense, the functions/services of a government includes four aspects, i.e., (1) to establish and consolidate governance, (2) to deal with foreign affairs, (3) to develop economy and promote the progress of the society, and (4) to service citizens and public. Although all functions/services of government could be digitalized ultimately in future, only a limited number of functions/services can be digitalized mainly because of the national security and secret along with some current technology limitations [5]. 2.2
Requirement Analysis of E-Government Systems
In e-government systems, functional requirements include accesses to information, transaction services and participations three aspects [6, 7], while non-functional requirements focus on the efficiency, portability, reuse and democratic reformation, etc. The functional and non-functional requirements in e-government system are depicted in Table 1. Table 1. The functional and non-functional requirements of e-government systems
Functional requirement (1) Accesses to information • Announcement • Policies displays (2) Transaction services • Various licenses online • Permits and patents online • Linking to deposit and other systems (3) Participations • Electronic petitions • Electronic voting
Non-functional requirement (1) Efficiency (2) Portability (3) Standards (4) Reuse (5) Democratic Reformation (6) Interface to external systems, such as e-commerce, bank and digital library
In order to illustrate the process of digitization, a case of the passport application is further analysed by OOD in term of the functional and non-functional requirements. The passport application is a typical service for government.
Object-Oriented Design of E-Government System: A Case Study
3
1041
OOD Process of E-Government Systems
Fundamentally, OOD consists of three things, i.e., notation, strategies and goodness criteria. Notation is a communication notation, strategies are to develop solution “patterns”, and goodness criteria are to evaluate the design [8]. These three things are merged in the OOD process. And OOD process is a complete system including a neat sequence of activities, which typically contains five steps, i.e., (1) to design the context and the model of use for the system, (2) to design the system architecture, (3) to identify the objects in the system, (4) to develop design models, and (5) to specify object interfaces [2]. These steps will individually be illustrated in a passport application case. This case is that the applicants present applications and relevant documents on-line then obtain their issued passports. In this case, the main tasks are (1) to receive application, (2) to judge application, (3) to collect application documents, and (4) to issue passports correctly in term of different application categories. At the same time, this system also provides consultation and help services, and deals with the illegal applications. Startup Shutdown Receipt Collection Judgment Issuing Passport Official
Report Archive
Applicant
Charge Illegality Consultant Help
Fig. 1. Use cases for passport application
3.1
System Context and the Model of Use
The system context is to define the problem which needs to be resolve and to distinguish its environment. Generally, the use model is an effective method to define the system context through modelling the practical the process. A government service relates to many entities. The close external systems for passport application system includes civil service system, security management system, ID database, bank system and post system. For example, when the passport application system executes judgment it will interact with ID database and security management system. The passport application process is carried out by Passport Office and applicant. The whole process is divided into various steps, such as receipt, collection, judgement and issuing, and all steps are associated with the Passport Office while certain steps are associated with the applicant. The use case of passport application is depicted in
1042
Jiang Tian and Huaglory Tianfield
Fig. 1, where the ellipses represent the different steps in the process and “shutdown” is the initial state of this system. 3.2
System Architecture
System architecture represents the fundamental structure to realise the functions. System architecture can decompose a complex system as simple as possible by some necessary subsystems which implement certain functions of the system. The layered architecture is usually to depict the system architecture in term of the processing stages. The data collection, data processing, issuing passport, data report and data archiving subsystems and their functions are depicted in Fig. 2. In each stage the operation relies only on the processing of the previous stage. <subsystem> Data Collection
<subsystem> Data Processing
<subsystem> Issuing Passport
<subsystem> Data Report
<subsystem> Data Archiving
Data collection layer where objects are concerned with acquiring accurate data from applicant
Data processing layer where objects are concerned with checking, integrating and comparing the application data
Issuing passport layer where objects are concerned with issuing different passport in term of different applicant for different application reason Data report layer where objects are concerned with reporting and presenting the data in a human-readable form
Data archiving layer where objects are concerned with storing the data for further processing
Fig. 2. The layered architecture for application for passport system
3.3
Identification of the Principal Object Classes
Object is the abstraction of physical entity of requirement, which has state, behaviour and identity. Object class is a collection of one or more objects with uniform set of attributes and services [2, 8]. So all object classes should cover and meet the all functional requirements of system. Following object classes are designed based on the requirement analysis of the passport application, and each object class can realise certain requirements of passport application. For example, judgement object class is abstracted from the judgmental requirement and executes the judgement function in the application passport process. These object classes are identified as a named rectangle with two sections. The object attributes are listed in the top section, and the operations are set out in the bottom section.
Object-Oriented Design of E-Government System: A Case Study
3.4
1043
Development of Design Models
System model includes static and dynamic models. The typical static model is subsystem models, and the typical dynamic model includes both sequence models and state machine models (or state charts). 3.4.1 Subsystem Models Subsystem models are static models, which reflect the structures and associations of subsystems. Each subsystem is an encapsulation of object classes or data stores. Most subsystems are associated with each other. For example, instrument subsystem is associated with judgment, issuing the passport, receipt and document collection subsystems. 3.4.2 Sequence Models Sequence models are dynamic model, which show the sequence of object interactions. The sequence model is depicted in Fig. 3. In Fig. 3, the objects involved in the interaction are arranged horizontally, the dashed vertical lines represent the time progressing down, the thin rectangle represents the time when the object is the controlling object in the system, small arrows indicate the acknowledgement that the message sender does not expect a reply, the dotted arrow indicate a return of control, and the labelled arrows represent the interactions between objects. Controller
Issuing
Judgment
Document Data
Application Acknowledge ()
Report () Report ()
Summarise ()
Send () Reply
Send ()
Acknowledge ()
Fig. 3. Sequence model of passport application system
3.4.3 State Machine Models (State Charts) State machine models are also dynamic models, which show how individual objects change their state in response to messages or events. In other words, the change of the state is the result that object class reacts when it sends or receives various massages. In passport application system, the typical state charts are issuing, receipt, document collection and archival object classes. For example, the state chart for judgement is depicted in Fig. 4.
1044
Jiang Tian and Huaglory Tianfield
Judgement Shutdown
Startup ()
Compare ()
Waiting
Comparing
Shutdown ()
Comparison Complete Judge ()
Judging Judgment Complete
Transmission Done Transmitting Judge Document Data ()
Document Summary Complete Summarising
Fig. 4. State chart for judgment
In Fig. 4, the rectangle with round angle represents a certain state of object, the arrow represents the change from one state to another state in response to different events, and the arrow with a black blob indicates that the “shutdown” state is the initial state. 3.5
Specification of Object Interfaces
Specification of the interfaces focuses on the accesses to an object or to a group of objects. After specifying the interfaces the objects and components can be designed in parallel so it is an important part of the design process. The main interfaces of the objects in the application passport system are depicted in Fig. 5. In Fig. 5, the arrows with represent of the condition represent the interfaces between various object classes, and the two lines indicate the associations among instrument object, issuing object and collection object. Suspecting of Application Documents Returning Application Controller Returning Application
Receipt
Collection
Material is full
Judgment
Is ok
Issuing
Is illegal Illegality
Archiving
Fig. 5. The interfaces between object classes
Instrument
Object-Oriented Design of E-Government System: A Case Study
4
1045
Conclusion
E-government is the new development trend for government in the information ages. Because of the intricacy and volatility characteristics of e-government systems, development method of e-government systems must meet the especial requirements such as decomposition of a complex system and revolution of an existing egovernment system, etc. OOD method is a suitable selection. OOD has three distinctive advantages, i.e., (1) OOD has easier maintenance of the systems. Changing the internal details of an object does not affect any other system objects, and introducing new objects does not make significant effect on the rest of the system, (2) OOD facilitates reuse of objects and classes in the future and other systems, and (3) Objects are direct mappings of physical entities and thus understanding and analysis of the systems become simplified. These advantages just conquer the difficulties in the e-government development systems.
References [1] [2] [3] [4] [5] [6] [7] [8] [9]
Holliday, I.: Building e-government in East and Southeast Asia: regional rhetoric and national (In) action. Public Administration and Development, Volume 22, Issue 4, 2002, pp. 323-335 Sommerville, I.: Software Engineering (6th Edition). Addison-Wesley, 2000 Layne, K. and Lee, J.: Developing fully functional e-government: A four stage model. Government Information Quarterly, Issue 18, 2001, pp. 122-136 Elmagarmid, A. K. and McIver, W. J., Jr.: The ongoing march toward digital government. Computer, Volume34, Issue 2, February 2001, pp. 32 -38 Strejcek, G. and Theil, M.: Technology push, legislation pull? E-government in the European Union. Decision Support System, Issue 34, 2002, pp. 305-313 Marchionini, G., Samet, H. and Brandt, L.: Digital government: introduction. Communications of the ACM, January 2003, Volume 46, Issue 1, pp. 24-27 Barnum, G.: Availability, access, authenticity, and persistence: creating the environment for permanent public access to electronic government information. Government Information Quarterly, Issue 19, 2002, pp. 37-43 Yourdon, E.: Object-Oriented Systems Design: An Integrated Approach. Prentice-Hall International, Inc, 1994 http://www.passports.gov.uk/index.htm, https://www.passport-application.gov.uk/
A Practical Study of Some Virtual Sensor Methods as Part of Data Fusion Jouni Muranen, Riitta Penttinen, Ari J Joki, and Jouko Saikkonen Finnish Air Force P.O.Box 30, FIN-41161 Tikkakoski, Finland
Abstract. This practical study compares two virtual sensor methods used in data fusion. Both methods have their own advantages, but the differences between them are noticeable. A certain part of a data fusion system had been implemented with Expert System type of Case Based Reasoning (CBR) approach. In this study the CBR part, which has some elements of fuzzy logic, was replaced with pure fuzzy reasoning system. The simulation results are stunning. Keywords: CBR, fuzzy logic, expert systems, data fusion
1
Introduction
Virtual sensor is an algorithm that manages, computes, and calculates values from data of real sensors. The basic idea is to deduce new items of data, virtual measurements, which would be too expensive, difficult, or impossible to measure otherwise. From this point of view, some parts of the data fusion problem can be considered a virtual sensor. This paper discusses two different kinds of approaches of virtual sensors used in a simulation application. The first is Case Based Reasoning (CBR) -type Expert System that includes some Fuzzy Logic parts in it, while the second approach is pure Fuzzy Logic (FL). The paper shows the main characteristics of both methods. CBR-approach is interesting on point of view of combining different methods of reasoning even when the object application is not the best for it. FL-approach with quick development cycle is in its own field; this particular application is more suitable for it. The application area used in this paper is air situation picture generation based on sensors' data in simulation environment [9]. The aim of the virtual sensor is to compare the similarities of different kinds of electronic attributes in sensor detection data. In ASP generation all the data is one way or another fused to form a concrete picture to the human operator of the situation in field. This is done based on data of different kind of sensors. From data, the locations of the aeroplanes, their speed, and their identifications are computed and the result is represented to the operator.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1046-1051, 2003. Springer-Verlag Berlin Heidelberg 2003
A Practical Study of Some Virtual Sensor Methods as Part of Data Fusion
2
1047
Process Simulation
Air situation picture (ASP) generation is based on sensor measurement data. The data from sensors is fused and used in identification and classification of the objects in ASP. This is a typical data fusion application where data from various sources is fused together. [1, 2, 6] In data fusion, ASP generation does not differ a lot from, for example, paper mill automation system where different kinds of measurement are fused to give the operator the best possible view to the process. In ASP generation as well as in the other areas of control engineering the interest of new methods has arisen [3, 4, 8, 10]. In the ASP as well as in other control processes the aim is a fusion of measured data. This data has always some specific attributes depending on the objects it describes. In ASP, these attributes are electronic and physical characteristics measured from the target, such as frequency, modulation, pulse width and polarisation of the measured signal. These attributes can be used in identification of the target or process situation. Data fusion is thus a key to estimate correctly the behaviour of targets or processes. [2, 5, 7, 11] In attribute data fusion of ASP, commonly used methods are based on probability theory, Bayes' networks, or Dempster-Shafer reasoning. FL and combined methods are thus an interesting field. In this simulation, electronic attributes are used to determine if a new signal is from the same situation or object as the earlier ones or other current signals. Thus, the similarity of attributes is 'measured'. This paper focuses on aspects of similarity of these attributes.
3
Target Process
The similarity determines whether two signals are from the same object. In this case the attributes are pulse repetition interval (PRI), frequency (FRQ) and pulse width (PW) found from the signals. The same attributes with the same parameters are used in both approaches, CBR and FL. In reality there is, of course, a larger set of possible attributes as in any other processes. The structure of the system is the same in each case. These three inputs are used with PRI proximity with options weak, medium and strong, FRQ proximity as weak, medium or strong, and PW proximity as weak or strong. There is an output, OVERALL, with values ‘impossible', ‘very little chance', ‘may be', ‘very probably', and ‘strictly'. The rule base with expert rules is always handled as a Mamdani-type system with minimum as AND-method and maximum as Aggregation-method. Defuzzification method is centroid. 3.1
CBR Expert System with Fuzzy Membership Functions
The reason for using CBR Expert System approach is the variability of process situations. The idea is to make sure that each situation is handled correctly. Additionally, the combination of expert systems, CBR, and fuzzy logic is very interesting.
1048
Jouni Muranen et al.
In this simulation CBR cases, which are the possible process situations, are described by expert rule bases. Each case is determined based on for example whether or not a certain data exists: thus, a case can be 'when data of all three attributes is available', or 'when there are only two attributes available'. In the CBR approach 24 cases where determined. Each CBR-case is described with a similar rule base. Of course, the rule bases could differ a lot but in this case, they don't. The rule bases are formed as expert rules, and thus in each of them there is a rule for each possible combination of variations of inputs. This means that there are 18 rules in each rule base. Together 24 cases and 18 rules in each case mean 432 rules, which has a strong effect on modifications and updates of the system. Each change in input determinations means that each rule must be checked. The problem of this solution is the fact that in this application, the cases are not so different and thus the real need for separate handling does not exist. This solution does show, though, that it is also possible to combine the methods and get reasonable results. 3.2
Pure Fuzzy Logic
CBR-solution was replaced with FL, where the rule base included only 18 rules. An important issue that facilitates the replacement is the fact that the differences of the cases of the original solution were possible to describe in the rules themselves, so the case structure was not necessary. The tuning of membership functions and the final formulation of the rule base was done with assistance of experts. These modifications do help the maintenance of the system. It is easier to understand and quicker to handle a small rule base instead of a huge set of rules. Thus, adjustments and modifications are simple.
4
Simulations
Both approaches were tested with similar simulations. Tests were done using ten scenarios with duration of 2 hours of real time each. With long duration, the statistical issues of simulations were covered. Both approaches were used and analysed with respect to certain key figures. True paths of objects (which means the true situation in process control, for example) were compared with results of both approaches to see how well the data fusion systems were able to resolve the situation. Also other analyses were done. There were 1...5 flying targets in scenarios 1...8 and more than ten in scenarios 9 and 10; thus they are shown in separate figures. Scenarios 7 and 8 included different types of sensors.
A Practical Study of Some Virtual Sensor Methods as Part of Data Fusion
95 %
98 % 96 %
CBR
94 %
FL
92 % 90 % 1
2
3
4
5
6
7
Angular Track Coverage
Angular Track Coverage
100 %
1049
90 % 85 % CBR
80 %
FL
75 % 70 % 65 %
8
9
Scenario
10 Scenario
100 %
100 %
99 % 99 %
90 % 80 %
98 % 98 %
CBR
97 %
FL
97 % 96 %
Total Coverage
Total Coverage
Fig. 1. Total coverage of paths. Black is for FL and grey for CBR. The results of FL are at least as good as with CBR. Scenarios 9 and 10 are shown in a graph of their own. [9]
96 % 95 %
70 % 60 % 50 %
CBR FL
40 % 30 % 20 % 10 %
95 %
0% 1
2
3
4
5
6
7
9
8
10 Scenario
Scenario
Fig. 2. Angular track coverage. Results of FL (black) are still at least as good as CBR (grey). In scenarios 9 and 10 the difference is much higher [9]
Figures 1 and 2 illustrate the analysis results of scenarios. In Figure 1 the percentage of proximity between true paths and estimated paths is shown; 100% means that estimated paths are in exactly correct places. Figure 2 shows the proximity between angular tracks and true paths in the same way. Both figures show that FL gives results at least as good as CBR. Different sensors in scenarios 6 and 7 improve the results, as seen in the figures. The results of scenarios 9 and 10 are noticeable: in scenario 9 the fuzzy system gives results that are up to 20% better.
False Tracks
1 0,8 0,6
CBR
0,4
FL
0,2 0 9
10 Scenario
Fig. 3. Difference in the amount of false tracks. The difference is more than 20 %. FL (black) gives more reliable results than CBR (grey) [9]
Jouni Muranen et al.
Angular Track Heading Error
1050
3 2,5 2 CBR
1,5
FL
1 0,5 0 9
10 Scenario
Fig. 4. Heading error of angular tracks of the system with FL (black) and CBR (grey). The error is shown here in degrees of arc [9]
FL has given very interesting results in minimising the amount of rules thus making maintenance issues more bearable while CBR-approach has shown that the combinations of different methods can be used in problems of this kind.
5
Results
Results of these simulations were very interesting. The difference between the two approaches considered was remarkable and promising. Three quite different conclusions can be made based on these simulations: 1) fuzzy logic is a very good option for this kind of task, 2) combination of different methods is a possible solution, and 3) even a small change in a larger system might have a large effect to the result. It is known that FL systems generally are easy to maintain and understand. This was the main reason for choosing FL as the alternative to be tested. The hope was that the results would be at almost the same level as in CBR approach, so the results being as good as, and even better, was a nice surprise. On the other hand, the difference of the results was not sufficiently huge to cause direct rejection of CBR, but enough to show a need for more studies of combinations. The effects of modifications were motivating to continue the study of small parts of the system to improve the functionality of the whole entity.
6
Conclusion
Air situation picture generation needs a plenty of different kinds of algorithms before the final product is reached. In this study the aim was to compare two different kinds of methods of a small part of ASP generation, and find out characteristics of both of them. The object process of the study was the testing of the similarity of electronic attributes of detections. This information is used in deciding whether the detections are from the same target.
A Practical Study of Some Virtual Sensor Methods as Part of Data Fusion
1051
The results of the simulations were interesting and motivating: they showed the goodness of the fuzzy system but did not reject totally the CBR method. They also showed the effects of the small parts to the huge system entity: even a small change in a small part has an effect - it is not a drop in an ocean.
Acknowledgements We would like to thank Professor Hannu Koivisto for his expertise during the study as well as our colleague researcher Tuomas Silvennoinen for his assistance in the area.
References [1]
Bar-Shalom, Y., 1988. Tracking and Data Association, Academic Press, Inc. San Diego, California. 347 p. [2] Blackman, S. S., 1986. Multiple-Target Tracking with radar applications, Artech House Inc, 610 Washington street, Dedham. 441 p. [3] Bonissone, P. P., Chen, Y.-T., Goebel K., Khedkar, P. S. Hybrid Soft Computing Systems: Industrial and Commercial Applications, GE Corporate Research and Development, One Reseach Circle, Niskayuna, NY 2309, USA, 29 p. [4] Buede, D. M., Waltz, E. L. 1989. Benefits of Soft Sensors and Probabilistic Fusion, Signal and data Processing of Small Targets. Proceedings of the SpieThe International Society for Optical Engineering. 27-29 March 1989 Orlando, FL. SPIE [5] Dailey, D. J.,Harn, P., Lin, P.-J. 1996. ITS Data Fusion, Technical Report number WA-RD 410.1, University of Washington, Seattle, Washington, 98 p [6] Hiirsalmi, M., Kotsakis, E., Pesonen, A., Wolski, A., 2000. Discovery of Fuzzy Models from Observation Data. Helsinki, VTT Infrmation Technology, Research report TTE1-2000-43. 36 p. [7] Hovanessian, S. A., 1988. Introduction to sensor systems, Artech House Inc, 685 Cantob Street, Norwood. 299 p. [8] Koutsoukos, X., Zhao, F., Haussecker, H., Reich, J., Cheung, P. Fault Modeling for Monitoring and Diagnosis of Sensor-Rich Hybrid Systems. Xerox Palo Alto Research Center, Palo Alto. 9 p. [9] Muranen J., 2002. Virtuaalianturointi sumean laskennan menetelmin (in Finnish). Fuzzy logic based virtual sensor. Master of Science thesis. Tampere University of Technology, 73 p. [10] Oosterom, M., Babuska, R. Virtual Sensor for Fault Detection and Isolation in Flight Control Systems - Fuzzy Modeling Approach. Delft, Delft University of Technology. 6 p. [11] Syrjärinne, J. 1998. Data fusion in passive multisensor tracking. Thesis for the degree of licenciate of technology. Tampere, Tampereen Teknillinen Korkeakoulu, Sähkötekniikan osasto. 132 p
A Mamdani Model to Predict the Weighted Joint Density Hakan A. Nefeslioglu1, Candan Gokceoglu2, Harun Sonmez2 1
General Directorate of Mineral Research and Exploration 06520, Ankara, Turkey [email protected] 2 Hacettepe University Engineering Faculty Department of Geological Engineering, Applied Geology Division 06532, Beytepe, Ankara, Turkey {cgokce, haruns}@hacettepe.edu.tr
Abstract. Estimating the block size is a major task for the quarry economy. Two approaches such as volumetric joint count and weighted joint density exist in the literature to assess the block size. However, due to the complex nature of discontinuities in the rock masses, this parameter could not be predicted easily everytime. Especially, when working in the rock masses having a wide discontinuity spacing, it is too difficult to perform a scanline survey. In this study, to overcome this difficulty, the photoanalysis method was considered to obtain the data required to construct a predictive model for weighted joint density. Considering the obtained data, a Mamdani fuzzy inference system was construct and its performance was assessed. As a result, a model proposed for predicting the weighted joint density in the present study.
1
Introduction
Depending on rapid urbanization, the need for the natural building stones increases and the existing resources decreases. Morover, the production of the natural resources results in enviromental degradation more or less. For this reason, to maximize the production and minimize the enviromental degradation, the complex engineering projects are required. One of the most important parameters governing the production of the natural building stones is the block size. The sizes of the blocks are controlled by the number of joint sets and spacing of joint sets. Due to the complex nature of joints, some difficulties in direct determinations of these parameters could be encountered. For this reason, two approaches such as volumetric joint count (Jv) and weighted joint density (wJd) [7] exist in the literature for assesment of block size. According to [8], the weighted joint density method (Eq.1) offers better V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1052-1057, 2003. Springer-Verlag Berlin Heidelberg 2003
A Mamdani Model to Predict the Weighted Joint Density
1053
characterization of the degree of jointing and block size than the common methods in use today. wJd =
1 A
1
∑ sin d =
1 A
∑ fi
(1)
where, A is the size of the observed area in m2; d is the intersection angle. As seen in Eq.1, to calculate the wJd the intersection angle is required. However, to obtain this angle from the high exposures with very steep inclinations is too difficult. To overcome this difficulty, fi factor was proposed by the origmator of the approach, [8]. In this study, construction of a Mamdani fuzzy inference system to obtain wJd without using the intersection angle is aimed. For this purpose, the data needed was collected from two marble quarries.
2
Data Collection
To obtain the data needed, an extensive field study was performed. However, the operating benches are too high (~6m) and steep (~90°) to measure. For this reason, the discontinuity data were extracted from the photographs. According to [5], it is desirable to take colour photographs of the rock face and scanline, including a scale and appropriate label, before commencing the sampling process. Besides, it is usually necessary to shoot upwards from the bottom of the face, which may be tilted back from the vertical by as much as 30° [9]. Considering these recommendations, the photographs were taken (Fig.1a) and, then these colour photographs were scanned. All visible weakness planes were drawn using a digitizing program (Fig.1b). Total 1854.72 m2 area was investigated. From the investigated area, total 1750.11 m discontinuity length was measured considering the 84 photographs. The other important discontinuity parameter is the spacing. During the study, this parameter was also determined. The statistical evaluations of the obtained parameters are given Table 1. (a)
(b)
Scale (A4 size paper)
Fig. 1. An example of a scanned colour photograph (a) and the determined discontinuities (b)
1054
Hakan A. Nefeslioglu et al. Table 1. Statistical evaluations of the discontinuity parameters
Parameter
N
Min.
Max.
Avg.
Std. Dev.
Var.
1118
2.88
190.64
20.84
21.37
456.51
L/Area (m-1)
84
0.28
2.05
0.665
0.367
0.135
Mean discontinuity spacing X (m)
84
0.42
2.98
1.243
0.563
0.317
wJd
84
0.65
10.52
3.714
2.106
4.436
Discontinuity length L (m)
3
Fuzzy Inference System
In the last two decades, an increase in the applications of fuzzy sets to solve many rock mechanics and engineering geological problems has been observed e.g. [2], [3], [4], [6], [10] etc. because the fuzzy models can cope with the complexity of complex and ill-defined systems in a flexible and consistent way [1]. In fact the problems related to rock masses are very complex and determination of the discontinuity characteristics of the rock masses involves some uncertainties and difficulties as mentioned in the previous paragraphs. It can be considered that fuzzy set theory introduced by [11] is one of the tool to handle uncertainties. In this study, the Mamdani fuzzy algorithm was taken into consideration to express the wJd without the intersection angle for high and steep rock faces. As mentioned by [1], the Mamdani algorithm is perhaps the most appealing fuzzy method to employ in engineering geological problems. The general “if-then” rule structure of the Mamdani algorithm is given in the following equation (Eq.2): Ri: if xi is Ai1 and ... and xr is Air then y is Bi
for i = 1,2,...,k
(2)
In the present study, the fuzzy inference system includes two inputs (L/A and X) and one output (wJd). To extract the input and output fuzzy sets, the simple regression analyses were performed between both inputs and output (Fig.2a and b). The best relationship between planar intensity (L/A) and wJd is power (Fig.2a) while that of between mean discontinuity spacing (X) and wJd (Fig.2b) is logarithmic. If both relationships could be linear, the multiple regression equation would be considered. For this reason, only fuzzy inference system was employed in the present study. The input and output fuzzy sets are given below: Input (L/A) Extremely Low (EL) = {0/0, 1/0, 1/0.283, 0/0.578} Very Low (VL) = {0/0.283, 1/0.578, 0/0.872} Low (L) = {0/0.578, 1/0.872, 0/1.167}
A Mamdani Model to Predict the Weighted Joint Density
1055
Medium (M) = {0/0.872, 1/1.167, 0/1.461} High (H) = {0/1.167, 1/1.461, 0/1.755} Very High (VH) = {0/1.461, 1/1.755, 0/2.049} Extremely High (EH) = {0/1.755, 1/2.049, 1/2.500, 0/2.500} Input (Mean X) Extremely Low (EL) = {0/0, 1/0, 1/0.171, 0/0.300} Very Low (VL) = {0/0.171, 1/0.300, 0/0.524} Low (L) = {0/0.300, 1/0.524, 0/0.909} Medium (M) = {0/0.524, 1/0.909, 0/1.571} High (H) = {0/0.909, 1/1.571, 0/2.682} Very High (VH) = {0/1.571, 1/2.682, 0/4.522} Extremely High (EH) = {0/2.682, 1/4.522, 1/5.000, 0/5.000} Output (wJd) Extremely Low (EL) = {0/0, 1/0, 1/0.940, 0/0.1.992} Very Low (VL) = {0/0.940, 1/1.992, 0/3.069} Low (L) = {0/1.992, 1/3.069, 0/4.170} Medium (M) = {0/3.069, 1/4.170, 0/5.282} High (H) = {0/4.170, 1/5.282, 0/6.405} Very High (VH) = {0/5.282, 1/6.405, 0/7.538} Extremely High (EH) = {0/6.405, 1/7.538, 1/12.000, 0/12.000} The second stage of the fuzzy inference system is to extract the “if-then” rules. All combinations of the inputs were considered and total 64 weighted (Fig. 3) “if-then” rules were extracted. Six of 64 weighted rules are given in following as examples: R10:
if
L/A is EL
and
X is EH
then
wJd is VL
( 0.44 )
R11:
if
L/A is EL
and
X is EH
then
wJd is L
( 0.56 )
R19:
if
L/A is VL
and
X is VH
then
wJd is VL
( 0.71 )
R20:
if
L/A is VL
and
X is VH
then
wJd is L
( 0.29 )
R47:
if
L/A is H
and
X is M
then
wJd is H
( 0.69 )
R48:
if
L/A is H
and
X is M
then
wJd is VH
( 0.31 )
During the construction of fuzzy inference system, 74 data of 84 data were considered as training data set while 10 of those were considered as checking data set. The root mean square error (RMSE) performance index was calculated for both training and checking data sets to put forward the performance of the constructed fuzzy inference system. The RMSE indices were calculated as 1.457 and 1.470 for the training and checking data sets, respectively. These results revealed that the fuzzy inference system has a high prediction performance.
1056
Hakan A. Nefeslioglu et al. (a ) EL
VL
M
L
H
VH
EH
1.0
0.5
X
L/A
1.0
(b) ELVL L M
EH
0
1.0518
wJd = -2.014Ln(X)+3.9788 r = 0.41
EH
10
8
EH
wJd = 3.5447(L/A)
8
4 2
0
0.5
1.0
1.5
2.0
0
wJd
6
6 4 2 0
1.0
0.5
VH H M L VL EL
VH H M L VL EL
wJd
VH
0.5
0
10 r = 0.73
0
H
0
1
-1
L/A (m )
2
3
4
Mean X (m)
WJd
0
0.5
1.0 WJd
Fig. 2. Inputs and output of the fuzzy inference system: (a) L/A - wJd and (b) Mean X - wJd
4
Results and Conclusions
The following results and conclusions can be drawn from the present study: a)
Considering the difficulties encountered when studying the steep and high rock faces, the photoanalysis method was employed and total 1854.72 m2 area was investigated. This study showed that the photoanalysis method is very effective when studying on the rock masses having the high discontinuity spacing. b) A fuzzy inference system to predict the weighted joint density was constructed and its performance was checked. The inference system exhibited a high performance. This will provide an efficient tool for the mining planners when calculating the block size. L
XL
H
M
X
wJd
XM
SwJd weigth for low class =
1-
weigth for moderate class =
X - XL SwJd 1-
X - XM SwJd
Fig. 3. General principle of the extraction of the weighted rules employed in this study.
A Mamdani Model to Predict the Weighted Joint Density
1057
References [1]
Alvarez Grima, M., 2000. Neuro-fuzzy Modeling in Engineering Geology. A.A. Balkema, Rotterdam, 244 p. [2] Alvarez Grima, M., Babuska, R., 1999. Fuzzy model for the prediction of unconfined compressive strength of rock samples. International Journal of Rock Mechanics and Mining Science 36, 339-349. [3] den Hartog, M.H., Babuska, R., Deketh, H.J.R., Alvarez Grima, M., Verhoef, P.N.W., Verbruggen, H.B., 1996. Knowledge-based fuzzy model for performance prediction of a rock-cutting trencher. International Journal of Approximate Reasoning 16, 43-66. [4] Gokceoglu, C., 2002. A fuzzy triangular chart to predict the uniaxial compressive strength of Ankara agglomerates from their petrographic composition. Engineering Geology 66, 39-51. [5] Hudson, J.A., Priest, S.D., 1979. Discontinuities and rock mass geometry. International Journal of Rock Mechanics, Mining Science and Geomechanics Abstracts, 16, 339-362. [6] Nguyen, V.U., 1985. Some fuzzy set applications in mining geomechanics. International Journal of Rock Mechanics and Mining Science 22, 369-379. [7] Palmström, A., 1995. Rmi - a rock mass characterization system for rock engineering purposes. PhD thesis, Oslo University, Norway, 400 p. [8] Palmström, A., 1996. The Weighted Joint Density Method Leads to Improved Characterization oj Jointing. Conference on Recent Advances in Tunnelling Technology, New Delhi, 1-6. [9] Priest, S.D., 1993. Discontinuity Analysis for Rock Engineering. Chapman & Hall Inc.,London, 473 p. [10] Sonmez, H., Gokceoglu, C., Ulusay, R., 2003. An application of fuzzy sets to the Geological Strength Index (GSI) System used in rock engineering. Engineering Applications of Artificial Intelligence, (in press). [11] Zadeh, L.A., 1965. Fuzzy sets. Information and Control 8, 338-353.
Mining Spatial Rules by Finding Empty Intervals in Data Alexandr Savinov Fraunhofer Institute for Autonomous Intelligent Systems Schloss Birlinghoven, Sankt-Augustin, D-53754 Germany [email protected] http://www.ais.fhg.de/~savinov/
Abstract. Most rule induction algorithms including those for association rule mining use high support as one of the main measures of interestingness. In this paper we follow an opposite approach and describe an algorithm, called Optimist, which finds all largest empty intervals in data and then transforms then into the form of multiple-valued rules. It is demonstrated how this algorithm can be applied to mining spatial rules where data involves both geographic and thematic properties. Data preparation (spatial feature generation), data analysis and knowledge postprocessing stages were implemented in the SPIN! spatial data mining system where this algorithm is one of its components.
1
Introduction
The conventional approach to rule induction consists in finding anomalies by searching for intervals with surprisingly high values of probability distribution representing the data semantics and the larger such an interval the better. For instance, in association rule mining patterns are generated in the form of itemsets and their interestingness is measured by support (the number of objects satisfying both condition and conclusion) and confidence (the number of objects satisfying the rule consequent among those satisfying the antecedent). For example, we might infer a rule where some item, e.g., high long-term illness, under some conditions has 99% confidence. Implicitly it means that other (mutually exclusive) items, e.g., medium and low longterm illness have much less probability very close to 0. Thus the rule semantics can be reformulated as incompatibility of some target values with items in condition. Interesting rules then can be generated by finding item combinations that never occur in data. The goal is still finding some kind of anomalous behaviour but the main distinction from traditional approaches is that we are trying to find empty areas instead of high frequency areas in the data. A related approach to mining association rules based on this principle is described in [4-7] were empty intervals among numeric attributes are called holes in data. The holes are found by using the algorithm [8-9] from computational geometry. In this paper we apply an original rule induction algoV. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1058-1063, 2003. Springer-Verlag Berlin Heidelberg 2003
Mining Spatial Rules by Finding Empty Intervals in Data
1059
rithm, called Optimist [1-3], which works with finite value attributes and generates rules for one pass through the data set by using a method of sectioned vectors. It is estimated that 80% of data are geo-referenced and recently spatial data mining area has been paid significant attention. Particularly, spatial rule induction offers great potential benefits for solving the problem of spatial intelligent data analysis. In this paper we describe how the rule induction algorithm based on finding largest empty intervals in data can be applied to spatial data analysis. Since analysis itself is known to take only a small potion of the whole knowledge discovery process while such tasks as data preparation and postprocessing take most of time we integrated our rule induction algorithm into the SPIN! spatial analysis system. The SPIN! system integrates several data mining methods adapted to the analysis of spatial data, e.g., multi-relational subgroup discovery and spatial cluster analysis, and combines them with thematic mapping functionality for visual data exploration, thus offering an integrated environment for spatial data analysis [10-11].
2
Generating Largest Empty Intervals
Attributes x1 , x2 ,K, xn are assumed to take a finite number of values, ni from their domains All combinations of values Ai = {a i1 , a i 2 ,K, a ini } .
ω = 〈 x1 , x 2 ,K, x n 〉 ∈ Ω = A1 × A2 × K × An form the state space or universe of discourse. Each record from a data set corresponds to one combination of attribute values or a point. If for a combination of values a record in the data set exists then it is said to be possible. Otherwise the point is impossible. To represent the data semantics as Boolean distribution over the universe of discourse we use the method of sectioned vectors and matrices [2,3]. The idea of the method is that one vector can represent a multidimensional interval of possible or impossible points (called also positive and negative internal, respectively). Each vector consists of 0s and 1s, which are grouped into sections separated by dots and corresponding to all attributes. A section consists of ni components corresponding to all attribute values. For example, 01.010.0101 is a sectioned vector for three attributes taking 2, 3 and 4 values. A sectioned vector associates n components with each point from Ω (one from each section). The position of these components in the vector corresponds to the point coordinates. To represent negative intervals we use disjunctive interpretation of sectioned vector. It means that the point is assigned 0 if all its components in the vector are 0s, and it is assigned 1 if at least one the components is 1. For example, the point 〈 a11 , a 21 , a 31 〉 is impossible according to the above vector semantics while the point 〈 a11 , a 22 , a 33 〉 is possible since the component corresponding to a 22 is equal to 1. The idea of the algorithm for finding largest empty intervals consists in representing data semantics by a set of negative sectioned vectors and updating it for each record. Initially the data is represented by the only empty interval consisting of all 0s and making all points impossible. After the first record is added it is split into several smaller negative intervals so that the point corresponding to this record becomes possible. For example, addition of the record 01.001.0001 (where 1s correspond to its
1060
Alexandr Savinov
values) to the interval 00.010.0100 splits it into three new intervals: 01.010.0100, 00.011.0100, and 00.010.0101 (changed components are underlined). During this procedure very small intervals with a lot of 1s are removed since they generate very specific rules and only the top set of the largest intervals is left. Once largest empty intervals have been found they can be easily transformed into rules by negating sections , which should be in antecedent. For example, the vector can be transformed into the implication {0,1} ∨ {0,1,0} ∨ {0,1,0,1} {1,0} ∧ {1,0,1} → {0,1,0,1} interpreted as the rule IF x1 = {a11} AND x 2 = {a 21 , a 23 } THEN x3 = {a32 , a34 } . The rules are filled in by statistical information in the form of the target value frequencies within the rule condition interval (for one additional pass through the data set). In other words, each value in conclusion is assigned its frequency within the condition interval, e.g., IF x1 = {a11} AND x 2 = {a 21 , a 23 } THEN x3 = {a32 : 145, a34 : 178} , which is obviously more expressive. Here 145 means that the value a 32 occurs 145 times within the selected interval.
3
Mining Interesting Spatial Rules
The Optimist algorithm has been implemented as one of SPIN! spatial data mining system components [10-11] (Fig. 1). It is tuned by a set of algorithm parameters such as maximal number of patterns (empty intervals) and execution on the client or on the server. Input data for the algorithm is specified by a standard SPIN! query component, which uses a separate connection component to access a database. The spatial rules generated by the algorithm are stored in rule base component. When appropriately connected this minimal set of components implements the conventional knowledge discovery cycle. The analysis starts from specifying database and query, which can produce data in the necessary format. In our case we need data columns to take only a finite number of values. Since most of source data had continuous attributes we applied SPIN! optimal discretization algorithm [12]. Once columns have been discretized it is necessary to generate spatial attributes. For this purpose we used spatial functionality of Oracle 9i database where objects are represented by means of special built-in geometry type. Using such a representation a query can combine spatial information with thematic data describing objects located in space. It is important that various spatial properties can be automatically generated by the database with the help of spatial predicates and relationships. We used UK 1991 census data for Stockport1, one of the ten districts in Greater Manchester, UK. The analysis was carried out at the level of enumeration districts (the lowest level of aggregation) characterized by such attributes as person per household, cars per household, migration, long-term illness, unemployment and other census statistics. Spatial information was available as coordinates and borders of such objects as enumeration districts, water, roads, streets, railways, bus stops etc. For 1
All data are provided by the Manchester University and Manchester Metropolitan University.
Mining Spatial Rules by Finding Empty Intervals in Data
1061
Fig. 1. Visualization of spatial rules simultaneously and interactively with the map and other views in the SPIN! system. As one rule is selected in the upper right view all objects satisfying its condition are dynamically highlighted on the map in the lower right window
typical analysis we might be interested in finding dependencies among different thematic and spatial attributes, for example, what spatial and non-spatial factors influence long-term illness. As spatial characteristic we define an attribute, which counts the number of water resources belonging to each enumeration district calculated by means of SQL statement with spatial join. The final result set produced by SQL query is a normal table, which can be directly analyzed by the Optimist rule induction algorithm. Here is an examples, which has been generated by such an analysis where MARRIED is the percentage of married people and WATER_NUM_REL is a characteristic of water resources in the enumeration district: IF MARRIED (461) {high (46%) OR medium (53%)} AND WATER_NUM_REL (447) {low (58%) OR medium (41%)} THEN LONG_TERN_ILLNESS (358) {low (68%) OR medium (31%)} The produced rules can be shown in its own window where the rules can be studied in details. However, the SPIN! system provides much more powerful method by using linked displays and interactive visualisation functionality [13]. The idea is that objects described in one view can be simultaneously visualised in other views. In our case the rules describe enumeration districts while these very districts can be simultaneously shown on the map. Moreover, as we select some rule all objects, which satisfy its left hand side are dynamically highlighted on the map so that we can easily
1062
Alexandr Savinov
see how they are spatially distributed (Fig. 1). For example, we might find that enumeration districts satisfying some rule and thus having interesting characteristics in terms of the target attribute form a cluster or have more complex spatial configuration, e.g., with respect to other geographic objects such roads and cities.
4
Conclusion
In the paper we described an approach to mining spatial rules by finding largest empty intervals in multidimensional space. The advantage of the algorithm is that it directly generates highly expressive multiple-valued rules for one pass over the data set (additional pass is needed for generating additional rule statistics). Particularly it does not require the data set to be in memory and hence can be applied to very large tables. Combined with additional data preprocessing and geographic visualisation components within SPIN! spatial data mining system it allows for carrying out complex analysis of real world data involving both spatial and non-spatial attributes.
References [1]
[2]
[3]
[4] [5] [6]
[7]
A.A. Savinov, Mining possibilistic set-valued rules by generating prime disjunctions, Proc. 3rd European Conference on Principles and Practice of Knowledge Discovery in Databases — PKDD'99, Prague, Czech Republic, September 15-18, 1999, 536-541. A.A. Savinov. Application of multi-dimensional fuzzy analysis to decision making. In: Advances in Soft Computing — Engineering Design and Manufacturing, R. Roy, T. Furuhashi and P.K. Chawdhry (eds.), Springer-Verlag, London, 1999. A.A. Savinov, An algorithm for induction of possibilistic set-valued rules by finding prime disjunctions, In: Soft computing in industrial applications, Suzuki, Y., Ovaska, S.J., Furuhashi, T., Roy, R., Dote, Y. (Eds.), SpringerVerlag, London, 2000. Bing Liu, Liang-Ping Ku and Wynne Hsu, "Discovering Interesting Holes in Data," Proceedings of Fifteenth International Joint Conference on Artificial Intelligence (IJCAI-97), pp. 930-935, August 23-29, 1997, Nagoya, Japan. Bing Liu, Ke Wang, Lai-Fun Mun and Xin-Zhi Qi, "Using Decision Tree Induction for Discovering Holes in Data," Pacific Rim International Conference on Artificial Intelligence (PRICAI-98), 182-193, 1998. Liang-Ping Ku, Bing Liu and Wynne Hsu, "Discovering Large Empty Maximal Hyper-rectangles in Multi-dimensional Space," Technical Report, Department of Information Systems and Computer Science (DCOMP), National University of Singapore, 1997. Jeff Edmonds, Jarek Gryz, Dongming Liang, and Renée J. Miller, Mining for Empty Rectangles in Large Data Sets. In Proceedings of the 8th International Conference on Database Theory (ICDT), London, UK, January 2001, pp. 174188.
Mining Spatial Rules by Finding Empty Intervals in Data
[8] [9] [10] [11] [12] [13]
1063
M. Orlowski. A New Algorithm for the Largest Empty Rectangle Problem. Algorithmica, 5(1):65--73, 1990. B. Chazelle, R. L. Drysdale, and D. T. Lee. Computing the largest empty rectangle. SIAM J. Comput., 15:300-315, 1986. M. May, A. Savinov, An Architecture for the SPIN! Spatial Data Mining Platform, Proc. New Techniques and Technologies for Statistics, NTTS 2001, 467472, Eurostat, 2001. M. May, A. Savinov, An integrated platform for spatial data mining and interactive visual analysis, Data Mining 2002, 25-27 September 2002, Bologna, Italy, 51-60. Andrienko, G., Andrienko, N., and Savinov, A., Choropleth Maps: Classification revisited, In Proceedings ICA 2001, Beijing, China, Vol. 2, pp. 1209-1219. Andrienko, N., Andrienko, G., Savinov, A., Voss, H., and Wettschereck, D., Exploratory Analysis of Spatial Data Using Interactive Maps and Data Mining, Cartography and Geographic Information Science 2001, v.28 (3), pp. 151-165.
The Application of Virtual Reality to the Understanding and Treatment of Schizophrenia Jennifer Tichon and Jasmine Banks Center for Online Health, University of Queensland, Australia [email protected] Advanced Computational Modelling Centre, University of Queensland, Australia [email protected]
Abstract. Virtual Reality (VR) techniques are increasingly being used in education about and in the treatment of certain types of mental illness. Research indicates VR is delivering on it's promised potential to provide enhanced training and treatment outcomes through incorporation of this high-end technology. Schizophrenia is a mental disorder affecting 1−2% of the population. A significant research project being undertaken at the University of Queensland has constructed virtual environments that reproduce the phenomena experienced by patients who have psychosis. The VR environment will allow behavioral exposure therapies to be conducted with exactly controlled exposure stimuli and an expected reduction in risk of harm. This paper reports on the work of the project, previous stages of software development and current and future educational and clinical applications of the Virtual Environments.
1
Introduction
Schizophrenia is a debilitating mental illness that often strikes people in their prime. Psychotic symptoms include delusions, hallucinations, and thought disorder [1]. Most people with these symptoms “hear” voices, and a large proportion also “see” visual illusions. At present patients have to describe their hallucinations, and often feel that their therapists cannot really understand them. Therapists themselves have difficulties learning about the exact nature of psychosis. Virtual reality (VR) provides a real option for translating a person's psychological experience into a 'real' experience others can share. VR techniques are increasingly being used in trial clinical programs and in the treatment of certain types of mental illness [4,5,6]. The ability of VR users to become immersed in virtual environments provides a potentially powerful tool for mental health professionals [2,3]. The University of Queensland research team have been collaborating for the past twelve months in the development of VR software to use in education, training and treatment of schizophrenia.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1064-1068, 2003. Springer-Verlag Berlin Heidelberg 2003
The Application of Virtual Reality
2
1065
The VISAC Laboratory and Remote Visualisation
The Visualisation and Advanced Computing Laboratory (VISAC) at the University of Queensland consists of an immersive curved screen environment of 2.5m radius and providing 150 degrees field of view. Three projectors separated by 50 degrees are used to project the images onto the curved screen. The curved screen environment is suitable for having small groups of people, eg patients and caregivers, to share the immersive experience. These facilities are managed by the Advanced Computational Modelling Centre (ACMC). The Centre has around 20 academic staff with expertise in scientific modelling, advanced computing, visualisation and bioinformatics (see http://www.acmc.uq.edu.au). 2.1
Stages of Software Development
Using the VISAC facilities the project has modelled the experience of psychosis in virtual environments. The ultimate goal of the project is to develop VR software that patients can use, in conjunction with their therapist, to re-create their experiences to allow for enhanced monitoring of their illness. To ensure the re-created virtual world accurately models the patient's inner world, actual patients with schizophrenia have been interviewed and asked to describe their symptoms in detail. These descriptions have then been transcribed into models of the hallucinations that are personal to individual patients. Patients are interviewed for both feedback and evaluation on their individualised VR program. The project commenced in October 2001, and since then, has undergone a number of distinct development phases: Phase 1 The initial work involved building a model of an everyday environment, in this case a living room, using a commercial 3D modelling package called Realax. The hallucinations modelled included: a face in a portrait morphing from one person into another and also changing its facial expression; a picture on the wall undergoing distortion; the walls of the room contracting and distorting so that the straight edges of the walls appeared curved; the blades of a ceiling fan dipping down; and the TV switching on and off of its own accord. In addition, a soundtrack of auditory hallucinations, provided by the pharmaceutical company Janssen-Cilag, was played in the background as the visual hallucinations occurred. This was in order to acheive a good approximation to the cacophony of voices that would be present for a patient who is attempting to concentrate on everyday tasks. This initial model was presented to a number of patients with schizophrenia. Feedback from patients was generally positive. However, due to the generic nature of the auditory and visual hallucinations portrayed, it was not possible to confirm how realistic the model was. A number of patients commented that they liked the concept of virtual hallucinations, however, they felt the hallucinations modeled did not actually relate to them. Phase 2 The second phase of the project involved modeling the actual experiences, as described, by one specific patient with schizophrenia. In this way, a model of psychosis would be built from the patient's perspective. The virtual environment was moved from a living room to a psychiatric ward.
1066
Jennifer Tichon and Jasmine Banks
Development of the virtual environment for Phase 2 comprised two main steps. The first step involved creating the model of the psychiatric ward, and models of static elements of the scene (eg. furniture) using a 3D modelling package. In order to build a model of a psychiatric ward and the objects within it, the model was built from photographs of an actual psychiatric unit at the Royal Brisbane Hospital in Brisbane. The static models of the psychiatric ward and objects were saved as VRML files for inclusion into the main program. The second stage of development involved writing the main program which loads, positions and displays the static elements, and also which implements the dynamic parts of the scene, such as sounds and movements of objects. The software was written in C/C++, in conjunction with an open source, cross platform scene graph technology. This method of implementation was chosen as it will allow us to eventually port the software from the current IRIX platform to a PC platform. This will enable the software to be used, for example, in a psychiatrist's office or in a hospital, making it more accessible to patients, caregivers and mental health workers. The software was designed so that the user is able to navigate around the scene using the mouse and keyboard, and various hallucinations are triggered either by proximity to objects, by pressing hot keys, or by clicking an object with the mouse. Hallucinations modelled included an apparition of the Virgin Mary, which would appear and “talk” to the patient; the word “Death” appearing to stand out of newspaper headlines; random flashes of light; a political speech, which changes to refer to the patient; choral music; and auditory hallucinations described by the patient. Feedback from the patient who described her personal hallucinations was positive. She reported the virtual environment was an “extraordinary experience” and “captured the essence of the experience”. The patient also commented the virtual environment was effective in re-creating the same emotions that she experienced on a dayto-day basis during her psychotic episodes.
3
Application of the Software in Education and Training
Virtual Reality (VR) has enormous potential in teaching students about the complex nature of schizophrenia. It has been used successfully in education to provide a more interactive learning experience for students [7]. The use of VR can provide students with first hand knowledge of what hallucinations “feel” like with the outcome that practitioners will empathise more readily with patients with psychosis. Empathy is recognised as an essential component of effective mental health care. It is related to the constructs of rapport and therapeutic alliance and forms part of the core clinical skills repertoire for health professionals. An important focus of training in psychiatry is to facilitate through training the development of empathy or rapport as a result of a practitioner's appropriate attention to the client. The first application of the VR psychosis software has been into the psychiatry classroom. A survey was conducted with students on their experiences with current teaching methods as compared to the use of VR as a teaching tool. 75% of the students indicated that the use of VR in the classroom assisted them to better understand schizophrenia. More importantly, 67% of the students reported the use of VR clarified many of the ambiguities of schizophrenia not resolved for students by current teach-
The Application of Virtual Reality
1067
ing methods and tools. Through open-ended questions students commented that the most useful application of VR in their training was to provide them with “as close a first hand experience as one would like to get and students can also better empathise” with the people suffering from schizophrenia [8]. In essence, the application of VR technology to the education of medical students facilitates an improved understanding of the impact of the psychological experiences of patients with schizophrenia. Future research will apply a specifically developed measurement tool to determine statistically whether practitioners score significantly higher as empathic to patients with psychosis after VR training than before VR training. Level of experience in mental health and exposure to other sources of psychotic experience are also expected to influence scale scores and will be entered as covariates.
4
Application of the Software to Clinical Environments
VR is also proposed as a new medium of cognitive behavior therapy for patients with schizophrenia. The processes of habituation and extinction, in which the feared stimuli cease to elicit behavioral and psychological responses make it's meaning less threatening [9]. The use of VR in therapy is predicted, through activating the responses and modifying them, to improve symptoms of schizophrenia. This further stage of the VR and Schizophrenia Project will investigate the software as a tool with which cognitive-behavior therapy can be effectively delivered to patients with schizophrenia. The focus is on implementation and evaluation as outlined in the following two stages: 4.1
Stage One: Case Reports
There is a growing body of literature suggesting that the use of VR in exposure therapy for specific phobias, including claustrophobia, acrophobia, arachnophobia, is effective [2, 6, 9]. To date no research exploring the clinical use of VR in psychosis can be located. Case studies will be conducted where patients will be exposed to the specific hallucinations they have described during therapy sessions. It is proposed therapy will be conducted twice weekly, for a total of twenty sessions. This is based on the use of non-technology assisted cognitive-behavioral therapy [10]. 4.2
Stage Two: Controlled Outcome Data
A pretreatment assessment will be conducted for a VR augmented treatment group, a standard treatment group and a wait-list group. Immediately following the pretreatment assessment the first treatment session will be conducted in which patients will be familiarised with the VR equipment. Following this session participants in the treatment group will receive a pre-determined number of weekly individual treatment sessions (approximately 20) consisting of exposure to their individualised hallucinations. Full assessments will be conducted at pre and post-treatment for each of the three groups of participants. These will involve measurement of changes using the CPRS, MADRS & SANS measures.
1068
5
Jennifer Tichon and Jasmine Banks
Conclusion
Virtual Reality (VR) is a high-end multimedia tool that has enormous potential in teaching students about the complex nature of schizophrenia. Even during the early stages of this project it has been used successfully in psychiatry lectures to provide a more interactive learning experience for students [7]. Medical students surveyed have reported the virtual schizophrenia environments have provided them with first hand knowledge of what hallucinations “feel” like. To date no research exploring the clinical use of VR in psychosis can be located. It is expected clinical trials will commence within the next twelve months. As outlined, the main aims of the project are the development of the virtual reality software for use in clinical environments and to design it to be deliverable on consulting room PCs. This project has the potential to have a significance impact on the field of psychiatry in both the assessment and ongoing monitoring of patients with schizophrenia.
References [1]
Kaplan, H., Sadock, B.(2000) Comprehensive Textbook of Psychiatry, vols. 1-2. LippincotWilliams Wilkin, Philadelphia [2] Hodges, L., Anderson, P., Burdea, G., Hoffman, H., Rothbaum, B. (2001) Treating Psychological and Physical Disorders with VR, IEEE Computer Graphics and Applications. Nov/Dec, pp. 25−33 [3] Kahan, M.(2000) Integration of Psychodynamic and Cognitive-Behavioral Therapy in a Virtual Environment. Cyberpsychology & Behavior. 3, pp.179183 [4] Riva, G., Gamberini, L.(2000) Virtual Reality in Telemedicine. Cyberpsychology & Behavior. 6, pp. 327-340 [5] Riva G.(2000) From Telehealth to E-Health: Internet and Distributed Virtual Reality in Health Care. Cyberpsychology & Behavior. 3, pp. 989-998 [6] Anderson, P., Rothbaum, B., Hodges, L.(2001) Virtual Reality: Using the Virtual World to Improve Quality of Life in the Real World. Bulletin Menninger Clinic. 65, pp. 78-91 [7] MacPherson C., & Keppell, M. (1998). Virtual Reality: What is the state of play in education? Australian Journal of Educational Technology, 14, pp. 6074 [8] Tichon, J. & Loh, J. (2002) The Use of Virtual Environments in the Education and Training of Psychiatry, Unpublished Manuscript [9] Rothbum, B.O., Hodges, L.F., Ready, D., Graap, K., & Alarcon, R.D. (2001) Virtual reality therapy for Vietnam veterans with posttraumatic stress disorder. Journal of Clinical Psychiatry, 62(8), pp. 617-622 [10] Sensky, T., Turkington, D., Kingdon, D., Scott, J.L., Scott, J., Siddle, R., O'Carroll, M. & Barnes, T.R. (2000) A randomized controlled trial of cognitive-behavioral therapy for persistent symptoms in schizophrenia resistant to medication, Archives of general Psychiatry, 57(2), pp. 165-173
Representation of Algorithmic Knowledge in Medical Information Systems Yuriy Prokopchuk1 and Vladimir Kostra2 1
Chair of Information Technology and Cybernetics Ukrainian State Chemical Technology University 8 Gagarin ave., Dniepropetrovsk, 49005, Ukraine http://cyberlab.iatp.org.ua [email protected] 2 Department of System Analysis and Control Problems ITM National Academy of Sciences of Ukraine 15 Leshko-Popeljy st., Dniepropetrovsk, 49600, Ukraine [email protected]
Abstract. This paper deals with the problems of representation of computing, schematic and production knowledge in medical information systems. It is possible to selected four levels of the problems in system where the algorithmic knowledge can be used: level of a separate field of the medical document; level of all medical document; level of all medical card of the patient; level of all active database. The description of practical implementation of the proposed approach is given.
1
Introduction
The structure of an algorithmic knowledge has in fact three conceptually isolated layers: computing, schematic and production knowledge, above which statement of problems and the so-called knowledge processors are formed [1]. Each layer deals with its set of concepts. The layers hierarchy is arranged so that the concepts of the production layer are unveiled via concepts of the schematic layer which in turn are unveil via concepts of the computing layer. Traditionally, the computing knowledge is represented as subroutine libraries with specifications. The schematic knowledge represents all set of concepts interesting for the developer (researcher) and needed for the description of the structural features and the characteristics of blocks of mathematical models and their research algorithms. Thus, the scheme is free from details concerning the form of the machine representation of the information, features of the functional ratios expressed by the formulas, equations, algorithms and grouped on the layer of the computing knowledge, and contains only references to concepts of the computing layer of knowledge.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1069-1074, 2003. Springer-Verlag Berlin Heidelberg 2003
1070
Yuriy Prokopchuk and Vladimir Kostra
The layer of production knowledge fixes experience of the developer as techniques for analyzing, designing, decision-making, and allows choosing the most acceptable algorithms depending on the model features as well as the numerical values of controlling parameters of these algorithms. It is possible to select four levels of the problems in medical information systems (MIS) where the algorithmic knowledge can be used [1-3]. Level 1. A level of a separate field of the medical document. There is a need, as often happens, to perform the defined calculations, for example, to calculate the geometrical parameters of the given area etc while filling the document field by some data frame Level 2. A level of all medical document. The part of parameters appearing in the document can be introduced manually (or by using a medical apparatus), and other part can be calculated according to the defined algorithms based on data of the first part. At calculation the external (global) variables such as a sex, age, body height, weight etc written in a card of the patient can be also used. Level 3. A level of all medical card of the patient. At this level the computing problems of diagnostics, prediction and choice of tactics of treatment are worked out. To solve these problems the data of all documents of a medical card can be used. Level 4. A level of all active database. At this level the research problems (regularities search, creation of an information image of illness etc.), statistics and precedents search etc are solved. The above classification of the problems on its own is a unit of the production knowledge in the given data domain. The paper represents production, schematic and computing knowledge for the problems of the 1st and 2nd levels.
2
Statement of the Problem
The Problems of the 1st Level. Let the formalized professional language (FPL) representing a collection of lexical trees (LT) [2] is used for filling text boxes of the document (each expert uses its subset of FPL). The FPL fragment intended for input of results of an ultrasonic study of a thyroid gland is given below: Thyroid gland { (Right long __. _ х __._ х __._ Mm. Size ______ cubic sm.), (Isthmus __._ Mm.), (Left long __. _ х __. _ х __._ Mm. Size _______ cubic sm.), Ultrasound-CONTROL. Echo image without essential dynamic. …} The above mentioned fragment of the Volume parameter is calculated manually and substituted in the text. Let the Lexical processor (LP) described in [2] is used for operation with FPL. LP allows navigating according to a lexical tree (LT) and choosing necessary lexemes at creation of the text box of the data frame.
Representation of Algorithmic Knowledge in Medical Information Systems
1071
Thus, it is necessary to upgrade FPL and LP for automatically executing the computing operations with a lexical free (for example, calculation of size of a thyroid gland). Following the principle of construction of open systems, also let require that the computing operations are part of metadata, i.e. are accessible at any instant for editing. The Problems of the 2nd Level. Let the metainformation containing the description of templates of documents (forms) and templates of lexicon represents the quick reference of templates as well as the set of the special database (DB) tables describing each template separately [3]. In such table any recording contains the description of some field of the document: the field type, layout, input mask, input order, norm boundaries, name of the connected file of the quick reference or lexicon. Let the Designer of Data Domain (DDD) from [1] is used to create, upgrade and fill templates. Thus, it is necessary to expand the structure of the metainformation, to upgrade the frame of the document template and DDD for automatically calculating the derivative parameters and linguistic variables in the course of the document filling. Besides, it is necessary that the computing operations should be a part of metadata, i.e. be accessible for editing at any instant.
3
Solution of the Problem
Let's consider the representation of schematic and computing knowledge for the problems of the 1st level (the level of a separate field of the document). FPL modernization. If in operation with LT it is necessary to execute the computing operations with any parameters or risk factors (RF), at the end of a lexeme after the character '#' type (number) RF is put which defines the level of community and character of the system response on given RF. Further the RF title follows. The condition of RF activation limited by the characters "(" and ")" can be next [3]. Examples of lexemes with RF can be given: |stable; paroxysmal | a giddiness In current ___ |years; months | Amount of labors ___, abortions ____
! #3 giddiness !#0 Number #0 Period !#3 Labors (Labors > 3) #3 Abortions
Either a value "constant" or a value "paroxysmal" will be assigned to RF "Giddiness" depending on choice of the doctor. The RF value "Labors" is equal to the number of labors, but RF will be activated only in the event that the number of labors is more than 3. RF "Abortions" will be activated in any case. By using the macro substitutions (&Name) the difficult computing operations can be realized with the help of scripts. RF are arguments scripts RF (generally - arbitrary parameters). The description of scripts is located directly in LT under heading "Scripts". The described way of implementation of the computing operations represents a schematic knowledge.
1072
Yuriy Prokopchuk and Vladimir Kostra Table 1.
Field name Body surface area, m2 Interval R-R, s Cardiac contraction freq., per min. Research
Data type
Size Input field mask
Input method
File name
String
4
9.99 Calculation
String String
3 3
9.99 999 Calculation
String
15
Reference book
p_kr
MITRAL VALVE String Affection String
0 15
Lexicon
s_ehokg5
Norm Identifier S
60-72
RR CCF
Scripts (computing knowledge) may be coded by various ways. The way given below reminds a dialect of the Pascal language. The specialized interpreter carries out script. The example of calculation of the volume of a thyroid gland is given below: The right lobe _. _ x _. _ x _. _ sm. Volume &Volume
! #a #b #c #d #e #f
Let's consider the representation of schematic and computing knowledge for problems of the 2nd level (the document level). The template of the document is modernized as follows: 1) Add a new field "Identifier"; 2) Add "Calculation" to the list of ways of data input (there were only: Edited, Lexicon, Reference book). The example of a part of one of tables is given below. If a way of input is "Calculation", value of a field is defined by the execution of some semantic operation. Generally, as semantic operations may act: sample of the necessary data from DB; entering the data in DB; logic and computing operations with the data. In the present work we only consider the last semantic operations, that is, the following operations: "if A that В". To store semantic operations (computing knowledge) it is proposed to enter a new object into the structure of metainformation "File with scripts" (DB table), which is connected to the template of the document (the name of the file with scripts coincides with the name of the document template file). Fields of the database table for storage of scripts are: 1) the identifier (must coincide with the identifier in the document template); 2) the comment is the string describing an assignment of the identifier; 3) the program script representing the computing knowledge. Set of DB connected tables - templates of documents, files with lexicon and files with scripts - represents the schematic knowledge. Note a relative independence of the schematic knowledge from the computing knowledge. In a general view, the computing knowledge stored in the file with scripts can be to presented in such a way as shown in table 2.
Representation of Algorithmic Knowledge in Medical Information Systems
1073
Table 2.
Identifier Comment string Program script Auto Body surface area if A that В CCF CCF per min if X that Y
The value returned in the template S CCF
Program realization of semantic operations may differ essentially. In the realized version the script represents a set of text strings on the programming language like Pascal. As in Pascal, the language demands an explicit description of all variables used. The specialized interpreter carries out the script. Set of all scripts represents the computing knowledge. Introduction of a new type of metainformation (files with scripts) has demanded the development of the specialized knowledge processor (the semantic processor) [3]. The above-mentioned approach can be used in operation with the specialized processors for inputting / processing the initial data (biosignal processors and images processors). At such an approach processors of the primary data are responsible only for allocation of the significant parameters and visualization of signals and fields. Further parameters are transferred to the document template where the intellectual processing is carried out forming the final conclusion.
4
Conclusions
One of obstacles placed in a wide use of the formalized professional language in automated doctor's workplaces was the impossibility of the realization of the computing operations in the act of filling text sections of medical documents. The proposed technology allows taking off this obstacle. Except for executing only algorithmic operations, any semantic operations of data processing may be similarly programmed. Also, an advantage of the technology is the openness of both lexicon and scripts. In principle, the realization proposed of the metaknowledge and the semantic processor allows realizing any local expert system turned to the concrete medical document. Research of other principles of the realization of the algorithmic knowledge, for example, with the help of the neural networks is of interest. With the use of the approaches proposed a number of laboratory and hospital systems are created which are introduced in the Dniepropetrovsk Region (Ukraine).
References [1]
Prokopchuk, Yury A., Kostra, Vladimir V. "Intelligent Modules of Medical Information Systems". Proc. of the joint meeting of the 5th World Multiconference on Systemics, Cybernetics and Informatics (SCI2001) and the 7th International Conference on Information Systems Analysis and Synthesis (ISAS2001), Orlando, Florida, USA, July 2001., IIIS, Vol. II, 2001, pp. 205208
1074
[2]
[3]
Yuriy Prokopchuk and Vladimir Kostra
Y.Prokopchuk, V Kostra The Architecture of Hospital Information Systems // Proceedings of the 2001 International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences (METMBS'2001); Ed. Valafar, June 25-28.2001 Las Vegas, Nevada, USA (CSREA Press), 2001, pp. 197-200 Yu. A. Prokopchuk, O. A. Kharchenko, V. V. Kostra, S. V. Khoroshilov "Construction of open data processing systems in medicine", in Proc. of the 4th International Workshop on Biosignal Interpretation, Villa Olmo, Como, Italy on June 24-26th 2002, pp.475-476
A Case-Based Reasoning Approach to Business Failure Prediction Angela Y N Yip and Hepu Deng School of Business Information Technology, RMIT Business, GPO Box 2476V, Melbourne, Victoria 3001, Australia [email protected] [email protected]
Abstract. Tremendous efforts are spent and numerous approaches are developed for predicting business failures. However, none of the existing approaches is dominant with respect to the accuracy and reliability of the prediction outcome. Contradictory prediction results are often present when different approaches are used. Also, explanation and justification of a prediction is often neglected. This paper reviews different approaches and presents a framework of a case-based reasoning (CBR) approach to business failure prediction by integrating two techniques, namely nearest neighbor and induction. It is unrealistic to assume that all attributes are equally important in the similarity function of nearest neighbor assessment. To avoid the inconsistency of subjective preferences of human experts, induction is used to find the relevancy of the attributes for nearest neighbor assessment in the case matching process. The approach is expected to provide an accurate prediction with justification, which is useful and beneficial to stakeholders of the companies.
1
Introduction
Accurately identifying potentially failing companies helps in preventing or reducing the number of failures, and is beneficial to stakeholders of the failing companies. Numerous approaches for predicting business failure have been developed and many applications of these approaches have been reported in the literature [1, 2, 7, 12, 15, 16]. In general, these approaches can be classified into statistical and artificial intelligence approaches. All of them claim a certain degree of success in terms of predictive accuracy but none is dominant. Justification of a prediction is often neglected. Case-based reasoning (CBR) solves problems based on past experiences with the expectation that the world as a whole is regular, consistent and predictable [9]. It offers explanation for its results, which is crucial in solving financial problems. For example, a company often asks why its corporate bond is rated as ‘C' by a rating agency. In predicting business failures, CBR provides similar companies that failed in the past as a justification when a company is identified as failing. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1075-1080, 2003. Springer-Verlag Berlin Heidelberg 2003
1076
Angela Y N Yip and Hepu Deng
This paper outlines a CBR approach to business failure prediction by integrating induction and nearest neighbor. Some attributes are more important than the others and a weighting is needed for every attribute. Experts can determine the relevancy of the attributes but their judgments are often subjective. To avoid the inconsistency of subjective preferences of experts, induction is used to find the relevancy of the attributes in nearest neighbor assessment. This paper is organized as follows. Section 2 provides a review of the existing approaches to business failure prediction. Section 3 describes the framework of the CBR approach. Section 4 concludes the paper.
2
A Review of Approaches to Business Failure Prediction
Many approaches have been developed for predicting business failures. Statistical approaches are the techniques used traditionally and artificial intelligence approaches have been widely used in the last decade. Other techniques include recursive partitioning analysis [7] and rough sets [6]. A review of a wide range of techniques can be found in [5]. Statistical techniques such as discriminant analysis (DA), logit and probit analyses and linear probability model (LPM) usually generate a result as a numerical value or a probability of failure based on a set of explanatory variables for each firm under investigation. The result is then compared to a predetermined cutoff score in order to determine whether the firm is going to fail or not [1, 11, 12]. Statistical techniques are good at handling numerical data but are inherently limited in their ability to evaluate qualitative or symbolic data. They cannot explain for their results. Artificial intelligence techniques can learn directly from data, cope with imprecise and incomplete data and handle both symbolic and numerical data. Applications of these techniques in predicting failures include neural networks NNs [15], genetic algorithms (GAs) [16] and CBR [3, 8, 13]. A NN is a dynamic model with a set of interconnected units that perform computing tasks. Associated with each connection is a weight to be adjusted through learning processes until the network performs well on training data. A major drawback of NNs is that they cannot explain the results. GAs apply principles of evolution to the challenges of problem solving by creating an initial set of guesses as potential solutions to a problem and then repeatedly rearrange these guesses in the order of how well they solve the problem through selection, crossover and mutation until a good solution is found. GAs only require one to recognize a good solution and the results produced can be explained. However, there is no guarantee of an optimal solution. CBR uses old experiences to solve current problems [9]. It retrieves similar cases stored in a case base and adapts the solutions of the retrieved cases to solve a new problem. In CBR, most knowledge is represented in cases. A case is comprised of a problem state, its corresponding solution and/or outcome. CBR is particularly useful in domains that are impossible or difficult to understand completely. It can handle various data types and can justify its solution. It is less appropriate to use when cases are difficult or impossible to obtain. Comparisons among different techniques in the context of business failure prediction have been conducted and contradictory comparative results are often present. The work of Collins and Green [4] shows that DA and LPM produce consistently
A Case-Based Reasoning Approach to Business Failure Prediction
1077
better results than logit analysis does. However, the study conducted by Lennox [10] show that well-specified logit and probit models are more accurate than DA. A comparison of NN and DA indicates that the NN technique has a higher predictive accuracy than DA does [19] or a similar degree of accuracy [2]. These examples reinforce the fact that no one technique consistently shows superiority over the others.
3
Framework of the CBR Approach
In the context of business failure prediction, a case represents attributes or factors that can affect the viability of a company. The case outcome emerges as a value of either failed or non-failed. When a new company with certain attributes is being assessed, similar cases that match these attributes are retrieved from the case base with a retrieval algorithm. Prediction of outcome for the new company depends on the outcomes of the retrieved cases. Among the few studies that applied CBR to bankruptcy or business failure prediction, Jo et al [8] and Park and Han [13] use nearest neighbor while Bryant [3] employs induction for case retrieval in bankruptcy prediction. Induction and nearest neighbor are both well known techniques in retrieving similar cases [17]. Induction is a method of discovering structures and regularities of data through analyzing examples. Nearest neighbor assesses similarity between a stored case and a new case based on a numerical function. A pure nearest neighbor calculation assumes that all attributes are equally important, which is unrealistic because some attributes are inherently more relevant than the others. A weight has to be assigned to each attribute for taking its relative importance into consideration. The similarity function typically used for assessment is the inverse of weighted normalized Euclidian distance. A similarity score is calculated by SIM(X, Y) = 1 − DIST(X, Y) = 1 −
n
∑w
i
dist 2(xi, yi).
(1)
i
where X and Y are the new and stored case respectively with n number of attributes while xi and yi are the normalized values for the ith attribute. A normalized weight wi is assigned to each attribute. This calculation is repeated for every stored case in the case base. Cases are then ranked by similarity to the new case and those with higher scores are more similar cases. Finding the set of weights that can improve the overall accuracy is an important as well as a challenging task. Integrating domain knowledge in weight determination is one means of resolving this problem. However, suitable experts are not always available and their subjective preferences are more prone to inconsistency. Induction can help in assigning the importance of the attributes in the similarity measure, as illustrated in Fig. 1. Cases in the case base are indexed and induction generates a set of attribute weights in the nearest neighbor assessment. The most similar cases to the new case are retrieved and their outcomes are reused in predicting the outcome for the new case. Justification and explanation for the prediction can be readily given by providing the list of retrieved cases as precedents.
1078
Angela Y N Yip and Hepu Deng
Case base
New case
Indexing
Nearestneighbor retrieval
Set of weights
Induction
Outcomes reused
Preditcion
Fig. 1. Induction assigns set of weights
The way that induction generates a set of weight is through information gain [14]. Information gain is a heuristic of ID3, an induction algorithm, for comparing potential splits in finding the most discriminating attribute for dividing the cases. It calculates the difference between entropy of a case base and its partitions built from an attribute and can be assigned as an attribute weight [18]. The entropy characterizes the impurity of a set of cases T with respect to the target attribute that has k outcomes. It is defined as k
entropy(T ) = ∑ − pj × log 2 pj.
(2)
j =1
where pj is the proportion of T belonging to outcome j. If T is partitioned on attribute X with n values, the expected value of the entropy is the weighted sum over the subsets given by n
entropy X (T ) = ∑ i =1
| Ti | × entropy(Ti ). |T |
(3)
where Ti is the subset of T for which attribute X has value i. The information gain by branching on X to partition T is measured by gain( X ) = entropy(T ) − entropy X (T ).
(4)
Information gain is calculated for every attribute and is used as the attribute weight in the nearest neighbor assessment. On the whole, this approach takes the relative importance of the attributes into consideration by using induction to automatically generate a set of weights to avoid the unavailability or inconsistency of experts.
4
Conclusion
This paper reviews on different approaches to business failure prediction. It also presents a CBR approach for predicting business failures by integrating nearest
A Case-Based Reasoning Approach to Business Failure Prediction
1079
neighbor and induction. Induction is used to assign the relative importance of the attributes for nearest neighbor assessment in order to avoid the inconsistency in the subjective preferences of human experts. It is believed that this approach can explain and justify its results. It also gives the company under assessment an accurate basis for prediction and an early warning, so that measures can be taken to avoid business failure or at least to minimize its impact.
References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14]
Altman, E. I.: Financial Ratios, Discriminant Analysis and the Prediction of Corporate Bankruptcy. The Journal of Finance 23 (1968) 589-609 Altman, E. I., Marco, G., Varetto, F.: Corporate Distress Diagnosis: Comparisons using Linear Discriminant Analysis and Neural Networks (the Italian Experience). Journal of Banking and Finance 18 (1994) 505-529 Bryant, S. M.: A Case-Based Reasoning Approach to Bankruptcy Prediction Modeling. Intelligent Systems in Accounting, Finance and Management 6 (1997) 195-214 Collins, R., A., Green, R. D.: Statistical Methods for Bankruptcy Forecasting. Journal of Economics and Business 34 (1982) 349-354 Dimitras, A. I., Zanakis, S. H, Zopounidis, C.: A Survey of Business Failures with an Emphasis on Prediction Methods and Industrial Applications. European Journal of Operational Research 90 (1996) 487-513 Dimitras, A. I., Slowinski, R., Susmaga, R., Zopounidis, C.: Business Failure Prediction using Rough Sets. European Journal of Operational Research 114 (1999) 263-280 Frydman, H., Altman, E. I, Kao, D.: Introducing Recursive Partitioning for Financial Classification: the Case of Financial Distress. The Journal of Finance 40 (1985) 269-291 Jo, H., Han, I., Lee, H.: Bankruptcy Prediction using Case-Based Reasoning, Neural Networks, and Discriminant Analysis. Expert Systems with Applications 13 (1997) 97-108 Kolodner, J. L.: Case-Based Reasoning. Morgan Kaufmann, San Mateo, CA (1993) Lennox, C.: Identifying Failing Companies: a Re-evaluation of the Logit, Probit and DA Approaches. Journal of Economics and Business 51 (1999) 347364 Martin, D.: Early Warning of Bank Failure. Journal of Banking and Finance 1 (1977) 249-276 Ohlson, J. A.: Financial Ratios and the Probabilistic Prediction of Bankruptcy. Journal of Accounting Research 18 (1980) 109-131 Park, C., Han, I.: A Case-Based Reasoning with the Feature Weights Derived by Analytic Hierarchy Process for Bankruptcy Prediction. Expert Systems with Applications 23 (2002) 255-264 Quinlan, J. R.: Induction of Decision Trees. Machine Learning 1 (1986) 81-106
1080
Angela Y N Yip and Hepu Deng
[15] Tam, K. Y., Kiang, M. Y.: Managerial Applications of Neural Networks: the Case of Bank Failure Predictions. Management Science 38 (1992) 926-947 [16] Varetto, F.: Genetic Algorithms Applications in the Analysis of Insolvency Risk. Journal of Banking and Finance 22 (1998) 1421-1439 [17] Watson, I.: Applying Case-based Reasoning: Techniques for Enterprise Systems. Morgan Kaufmann, San Francisco, CA (1997) [18] Wettschereck, D., Aha, D. W.: Weighting Features. In: Proceedings of the First International Conference on Case-based Reasoning, Sesimbra (1995) 347-358 [19] Wilson, R. L., Sharda, R.: Bankruptcy Prediction using Neural Networks. Decision Support Systems 11 (1994) 545-557
A Metaheuristic Approach to Fuzzy Project Scheduling Hongqi Pan and Chung-Hsing Yeh School of Business Systems, Monash University Clayton, VIC 3800, Australia [email protected] [email protected]
Abstract. In practice, projects may contain many activities. To schedule such projects, under constraints of limited resource and precedence relations, it becomes an NP hard problem. Any exact algorithms will have difficulty solving such problems. In addition, many activities of a project are quite often imprecise and vague due to lack of sufficient information. Fuzzy set theory is the best way to describe such data. In this study, a fuzzy simulated annealing approach is developed to handle resource-constrained project scheduling with fuzzy data.
1
Introduction
Resource-constrained project scheduling (RCPS) has been the subject of academic research for nearly 40 years since Kelley [1] and Wiest [2] raised this problem in 1960s and has been widely used in many areas. One of the most common goals is to minimize the project completion time under resource constraints and precedence relations while scheduling a project in which activity durations are commonly considered deterministic. However, in practice, activity durations are sometimes impossible to forecast as precise estimates and have to be considered as nondeterministic. Traditionally, probabilistic approaches have been applied by modelling them as random occurrences. Such approaches are theoretically valid if there is sufficient prior information. But there is most frequently insufficient information associated with the determination of activity durations in many projects. Such uncertainty only can be effectively handled by fuzzy set theory proposed by Zedeh [3]. Fuzzy set theory allows project managers to use their knowledge and experience to estimate approximate durations of activities in an intuitive and subjective manner, expressing their optimistic and pessimistic views. The application of fuzziness in project scheduling has become a challenging issue in which only a limited number of papers have been published, and more vigorous research is needed. Prade [4] introduced the concept of fuzzy sets into PERT in 1979. McCahon [5] and Rommelfanger [6] applied fuzzy sets to project planning. Fu and Wang [7] in 1996, proposed a fuzzy resource allocation model that uses linear programming when resource are insufficient. In relation to the consideration of resource constraints, Willis et al [8] applied the fuzzy goal programming approach, V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1081-1087, 2003. Springer-Verlag Berlin Heidelberg 2003
1082
Hongqi Pan and Chung-Hsing Yeh
but such an approach is only suitable for small sized projects. Lorterapong [9] and Pan et al. [10] presented heuristic-based approaches applying fuzzy sets to deal with fuzzy activity duration times of a project. Hapke et al. [11] adopted the simulated annealing technique in their attempt to solve the multiple objective combinatorial optimisation of fuzzy RCPS. In this paper, a fuzzy simulated annealing is developed to handle resourceconstrained project scheduling with the minimisation of project completion time for a project with fuzzy durations of activities.
2
Fuzzy Notation and Its Operations
~ A fuzzy set A in the universe of discourse U can be defined by a membership ~ function µ A~ so that µ A~ ( x ) describes the degree of membership of x in A . Any ~ fuzzy number A can be expressed in the mathematical form as:
~ µ ~ ( x) , x ∈U , µ A~ ( x ) ∈[0, 1] A= A x
(1)
~ ~ To operate on fuzzy numbers, A and B are assumed to be fuzzy numbers, and ∗ to be any basic fuzzy arithmetic operations. A particular operation on two fuzzy ~ ~ numbers A ∗ B can be denoted in the following form:
η A~∗B~ ( z ) = max{min[(µ A~ ( x), µ B~ ( y )]} x∗ y = z
(2)
In project scheduling, trapezoidal and triangular fuzzy numbers are used. A trapezoidal number is a flat fuzzy number that can be represented by 4-tuples (a1, a2, a3, a4). A triangular fuzzy number can be regarded as a special case of the trapezoidal fuzzy number when the lower modal (a2) and upper modal (a3) values are same. The following shown fuzzy operations are based on trapezoidal numbers: ~ ~ A + B = ( a1 + b1 , a 2 + b2 , a3 + b3 , a 4 + b4 )
(3)
~ ~ A − B = ( a1 − b4 , a2 − b3 , a3 − b2 , a4 − b1 )
(4)
~ ~ A × B = ( a1 × b1 , a2 × b2 , a3 × b3 , a 4 × b4 )
(5)
~ ~ max(A, B) = (∨(a1 , b1 ), ∨ (a2 , b2 ), ∨ (a3 , b3 ), ∨(a4 , b4 ))
(6)
~ ~ min(A, B ) = (∧(a1 , b1 ), ∧ (a2 , b2 ), ∧ (a3 , b3 ), ∧(a4 , b4 ))
(7)
A Metaheuristic Approach to Fuzzy Project Scheduling
1083
where +, −, and × represent fuzzy addition, subtraction and multiplication respectively, and ∨ and ∧ symbolise maximum, and minimum operations for fuzzy numbers respectively. In scheduling, fuzzy numbers need to be compared to find out which are bigger or smaller since they are not straightforward in intuition. Here Cheng's approach [12] is used, based on the calculation of the centroid point ( x0 , y 0 ) to obtain the distance index, where x0 and y0 are centroid values both in the horizontal and vertical axes. ~ The centroid point ( x0 , y 0 ) of a fuzzy number A can be calculated using the following formulae: ∫a 2[x µ A~ ( x )] dx + ∫a23 xdx + ∫a34 [x µ A~ ( x )] dx ~ ~ x0 ( A ) = 1 a L R a a ∫a12 µ A~ ( x ) dx + ∫a23 dx + ∫a34 µ A~ ( x ) dx a
~ y0 ( A) =
a
L
1 ∫0
[y g 1
L Aˆ
a
[
R
(8)
( y )] dy + ∫0 yg A~ ( y )] dy 1
L
1
R
R
∫0 g A~ ( y ) dy + ∫0 g A~ ( y ) dy
The ranking index can be expressed as: ~ 2 2 R ( A ) = ( x0 ) + ( y 0 )
(9)
~ ~ Assume that Ai , A j are any fuzzy numbers in set ℜ , the comparison of fuzzy
numbers has the following properties when ranking indices are obtained:
3
(1)
~ ~ ~ ~ if R ( Ai ) > R ( A j ) , then Ai > A j ,
(2)
~ ~ ~ ~ if R( Ai ) = R( Aj ) , then Ai = A j ,
(3)
~ ~ ~ ~ if R ( Ai ) < R ( A j ) , then Ai < A j .
Fuzzy Simulated Annealing
Simulated Annealing (SA) is a random search technique that has been applied in many NP-hard problems in an attempt to find global optimal solutions [13]. To avoid ‘trapping' in local optima, it allows the occasional acceptance of “up climbing” (bad) moves while moving from one solution to another in order to explore additional different search areas. SA has proven to be a powerful stochastic search algorithm for applications where there is little prior knowledge available regarding the problem [14, 14, 16].
1084
Hongqi Pan and Chung-Hsing Yeh
BEGIN An initial solution generated by assigning activity priorities randomly; Initialise: temperature change counter t:= 0; Set an initial temperature T(0) > 0; DO WHILE stopping criterion (FreezeCount) reaches Initialise repetition counter n := 0 Repeat until n = N(t); Generate a neighbour solution from the current solution; Calculate ∆f = f(s') – f(s) If ∆f < 0 or ∆f >exp(-∆f/T(n)) CurSolution = f(s') EndIf If BestSolution EndIf
>f(s')
n= n + 1 EndRepeat until n = N(t) Calculate T(t+1) ; T=t+1; EndDo stopping criterion reaches END Fig. 1
In fuzzy scheduling, the initial solution is generated randomly in the first place, followed by the generation of a neighbourhood solution, its acceptability being based on a certain probability that is called the cooling temperature. In the beginning, the cooling temperature is relatively high so as to accept a large proportion of the generated solutions. However, the cooling temperature is then progressively reduced in order to accept prospective solutions. This prevents the algorithm from getting trapped in local optima at early stages. The framework of fuzzy simulated annealing is presented in Fig 1. In the SA algorithm for RCPS, in the first instance, the priority of each activity is assigned randomly. To generate an initial solution, the fuzzy parallel scheduling method is applied, using fuzzy operations referred to the previous section. When an initial solution is obtained, the neighbourhood can be generated in SA. The following is the fuzzy parallel method used to generate neighbourhood solutions when priorities are determined for activities:
A Metaheuristic Approach to Fuzzy Project Scheduling
1085
INITIALISATION: n := 1, ~ tn := 0, D ( ~ tn ) := {1}, πRr := Rr ∀r ∈ R ~ ~ A( t ) = C ( t ) := φ , R ( ~ t ) := {Z } GOTO Step (2) n
n
n
DO WHILE n <J DO STAGE n BEGIN ~ tn−1 )} (1) t n := min{FTi Z i ∈ A( ~ ~ A( ~ tn ) := A( ~ tn−1 ) \ {Z i | Z i ∈ A( ~ tn−1 ), FTi ≅ ~ tn } ~ ~ ~ ~ C ( t ):= C ( t ) ∪ {Z FT ≅ t } n
n −1
i
i
n
πK r = K r − ∑~ k ir , ∀r ∈ R i∈A ( tn )
D(~ tn ) = {Z i | Z i ∉ {C ( ~ tn ) ∪ A( ~ tn ), Pi ⊆ C ( ~ tn ) R(~ t ) = Z \ {C ( ~ t ) ∪ A( ~ t ) ∪ D(~ t )} n
n
n
n
(2) Ordering the priority list in D ( ~ tn ) DO WHILE kir ≥ πK r ~ STi := ~ tn ~ ~ ~ FTi := STi + d i A( ~ t ) := A( ~ t )∪Z n
n
i
Update πK r ∀r ∈ R, D ( ~ tn ), and R ( ~ t )n n: = n + 1 END Fig.2
The fuzzy SA is now integrated into a system that is written in VB.Net. This research attempts to find an efficient way to obtain an approximate optimal solution when this problem is typical NP-hard.
4
Experimentation
Ten sets of projects were examined in fuzzy SA. The project sizes ranged from 20 to 100 activities. The initial temperature (T0) is 5 times the number of activities in the project and Markov chain length (N(t)) is same as the number of activities in the experiment.
1086
Hongqi Pan and Chung-Hsing Yeh Table 1 Results on Fuzzy SA
Set 1 2 3 4 5 6 s 8 9 10
Number of Activities 50 65 80 95 110 125 140 155 170 185
Best
Worst
Average deviations from best solution
(68,85,96,105) (77,94,107,145) (81,105,120,151) (90,110,130,159) (98,120,135,168) (103,128,141,180) (111,140,159,188) (122,150,180,201) (141,190,220,241) (144,189,217,244)
(79,90,105,112) (85,115,120,161) (80,119,135,170) (98,113,147,171) (109,129,144,170) (109,135,150,198) (120,149,170,201) (117,159,191.211) (154,201,231,255) (140,195,224,256)
5.4 5.8 6.0 6.2 5.9 7.4 7.8 7.5 6.9 7.6
Solutions to the problem
To verify the dispersion from the best solution, each problem set was run 100 times to determine how far the average solution differed from the best one. In order to apply the average deviation from the best solution, all results are converted into ranking indices because of the use of fuzzy numbers in project completion times. Table 1 lists the best and worst solutions average deviation. Through testing, the results achieved by fuzzy SA are encouraging as seen in Table 1. In the 10 sets, the average deviations are from 5.4 to 7.8 and the worst solutions do not greatly deviate from the best ones. This demonstration has shown that the system developed is capable of managing projects scheduling under resource constraints.
5
Conclusion
RCPS problem is often a challenging issue in project management and it easily becomes NP-hard when the number of activities in a project becomes large. In addition, durations of activities often are uncertain because of insufficiently precise information so that stochastic approaches are unable to handle them. Therefore, Metaheuristic approaches seem to be more attractive and powerful in the handling of such problems. This study provides the framework of a fuzzy SA approach for solving RCPS problem where activity durations are imprecise. The approach uses a computer-aided system that is written in VB.Net.
A Metaheuristic Approach to Fuzzy Project Scheduling
1087
References [1] [2] [3] [4] [5] [6] [7] [8]
[9] [10] [11] [12] [13] [14] [15] [16]
Kelly, J.E.: The Critical-path Method: Resources Planning and Scheduling. In: Muth, J.E. et al (eds.): Industrial Scheduling. Prentice Hall, New Jersey (1963) 347-365 Wiest, J.D.: Some Properties of Schedules for Large Projects with Limited Resources. Operations Research. 12 (1964) 395-418 Zedeh, L.A.: Fuzzy Sets. Information and Control. 8 (1965) 338-353 Prade, H.: Using Fuzzy Set Theory in A Scheduling Problem: A Case Study. Fuzzy Sets and Systems. 2 (1979) 153-165 McCahon, C.S, Lee, E.S.: Project Network Analysis with Fuzzy Activity Times. Computers & Mathematics with Applications. 15 (1988) 829-838 Rommelfanger, H.J.: Network Analysis and Information Flow in Fuzzy Environment. Fuzzy Sets and Systems. 67 (1994) 119-128 Fu, C.C., Wang, H.F.: Fuzzy Resource Allocations in Project Management When Insufficient Resources Are Considered. Soft Computing in Intelligent Systems and Information Processing. IEEE, New York (1996) 290-295 Willis, R.J., Pan, H., Yeh, C-H.: Resource-constrained Project Scheduling under Uncertain Activity Duration. In: Mohammadian, M. (ed.): The Proceedings of the 1999 International Conference on Computational Intelligence for Modelling, Control and Automation (1999) 429-434 Lorterapong, P.: A Fuzzy Heuristic Method for Resource-constrained Project Scheduling. Project Management Journal. 25 (1994) 12-18 Pan, H., Willis. R., Yeh. C-H.: Resource-constrained Project Scheduling with Fuzziness. In: Mastorakis (ed.) Advances in Fuzzy Systems and Evolutionary Computation. WSES Press, Danvers. (2001) 173-179 Hapke, M., Jaszkiewicz, A., Slowinski, R.: Interactive Analysis of Multiplecriteria Project Scheduling Problems. European Journal of Operational Research. 107 (1998) 315-324 Cheng, C-H.: A New Approach for Ranking Fuzzy Numbers by Distance Method. Fuzzy Sets and Systems. 95 (1998) 307-317 Eglese, R.W.: Simulated Annealing: A tool for operational research. European Journal of Operational Research. 46 (1990) 271-281 Pannetier, J.: Simulated annealing: an introductory review. International Physics Conference. 107 (1990) 23-39 Yao, X. A new simulated annealing algorithm. International Journal of Computer Mathematics. 56: 3-4 (1995) 161-168 Lee, J.-K., Kim, Y.-D.: Search heuristics for resource constrained project scheduling. Journal of the Operational Research Society. 47: 5 (1996) 678-689
A Note on the Sensitivity to Parameters in the Convergence of Self–Organizing Maps Marcello Cattaneo Adorno1 and Marina Resta2 1
CA Investment Advisers Ltd. 33 St James’s street, SW1A 1HD, London, United Kingdom, [email protected] 2 University of Genova, DIEM sez. di Matematica Finanziaria via Vivaldi 5,16126 Genova, Italy [email protected]
Abstract. This paper is aimed to study what are the parameters having the largest influence on the convergence of Kohonen Self Organizing Maps (SOMs), with particular attention to the occurrence of meta–stable states when systems of maps are employed. The underlying assumption is that, notwithstanding the random initialization of the SOMs and randomization of patterns presentation, trained maps configurations should converge to an ’optimal’ mapping of the original data–set. Therefore, we should look for a set of learning parameters that minimizes the divergence between SOMs that are trained from the same input space. To such purpose we will introduce a Convergence Index, which is able to test the robustness of the fit of trained SOMs to their input space. Such arguments will be tested with a highly non linear financial data–set, and some conclusions will be drawn about the architecture which is best suited to generate robust Kohonen Maps.
1
Introduction
Kohonen maps (or Self Organizing Maps -SOMs- as they are also known), are powerful tools for data analysis, and, specifically, for the analysis of time-series data. Although widely used, their employment has a number of drawbacks, which are linked to: – The intrinsic nature of data under examination. The behaviour of some types of data, in fact, is strictly dependent from a number of determinants which are neither completely known (phase space reconstruction can be a difficult task, and attempts to build attractors often led to contradictory or unsatisfactory results), or from qualitative variables (in such cases the structure of dependence is difficult to render, and it cannot be modelled in an unique way) [1]. – Structural limitations of the algorithm: no exact rules are known to fix the structure of Kohonen maps (namely, its parameters), according to the particular task they are devoted to. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1088–1094, 2003. c Springer-Verlag Berlin Heidelberg 2003
A Note on the Sensitivity to Parameters
1089
This paper starts from such remarks, studying how changes in the parameters of SOMs can affect overall performance, and investigates a tractable way to provide suggestions about the optimal net structure, which is best suited to fit data from the input space. To such aim ,we will develop a Convergence Index that measures the speed at which Kohonen maps converge to the input data distribution. For a review of the existing literature regarding the problem of convergence of Self Organizing Maps and the existence of possible meta-stable states, the reader is referred to [2], [3], [4], [5]. Notwithstanding the theoretical contributions given to such topic, we will explore it from a different point of view, which seems to be newer, since we will focus on the requirements for the existence of meta-stables states in groups of maps, trying to provide an affordable tool to be used in practical applications, in order to test the robustness of the fitting given by the SOMs. The structure of the work is as follows: Section II will discuss the features of the Convergence Index, giving a proper mathematical framework. After a short description of data employed, Section III will present the experimental design and discuss simulation results. Additional remarks will be given about the variants in the basic net structure which have been considered. Finally, Section IV will end the paper, giving some conclusive remarks and outlooks for future works.
2
The Convergence Index
In order to obtain a robust representation of the input data from a group of SOMs, it is necessary for the trained maps to be close to the ideal, optimal mapping of the training data–set. We don’t know such ideal mapping, but we can infer that if several randomly initialized SOMs, trained on the same data–set, are very similar to each other, then they will also be close to the optimal SOM. We assume that the notion of similarity between two maps is strictly related to the notion of distance between themselves. We therefore introduce the following definition of distance between two equally sized maps. Definition 1. The distance between two equally sized maps (DESM ) Mi and Mj is defined as: N 1 Mi DESM (Mi , Mj ) = ||x − yrMj || (1) N k r=1 r where N is the size of the maps, k is the length of the nodes of the maps, M i , yr j are reference vectors from maps Mi and Mj respectively. For one– and xM r dimensional maps, one can compute the DESM for each permutation of the order of the nodes: if the number of nodes is N , there will be N ! of such permutations. The following definition is therefore perfectly straigthforward. Definition 2. The Minimum distance (md) between two equally sized maps Mi and Mj is given by: md(Mi , Mj ) = min[DESM (P (Mi ), Mj )]
(2)
1090
Marcello Cattaneo Adorno and Marina Resta
where P (Mi ) spans the range of all possible permutations of the N nodes in the map Mi . Note that Definition 2 is general, since it can be extended from the case of mono–dimensional maps to that of multi-dimensional maps. In the case of two-dimensional maps, in fact, it is always possible to relax them along one of the two directions of the original maps, hence obtaining mono-dimensional objects. In such case, we will deal with a number of nodes: N = n × m, where n and m are respectively the number of rows and columns in the original maps. The number of distances to evaluate (in the sense stated by Definition 1) will be hence equal to (m × n)!. More generally, given a set of d–dimensional maps, we shall always consider them as mono–dimensional objects, by relaxing them over d − 1 directions. d The resulting mono–dimensional map will have a number of nodes N = i=1 n(dimi ), where n(dimi ) is the number of nodes along the i–th direction (i = 1 to d − 1). Obviously, the number of possible permutations will d increase accordingly, and will be equal to ( i=1 n(dimi ))!. We are now ready to introduce the notion of Convergence Index. Definition 3. Given a set M = {M1 , M2 , ..., Mq } of equally sized maps, each with N nodes defined in k–dimensional space, the Convergence Index (CI) is given by: 1 1 md(M1 , Mi ) = min(DESM (P (M1 ), Mi )) q i=2 q i=2 q
CI =
q
(3)
Here M1 is a reference map, assumed as the closest map to input space. The Convergence Index is then a measure of similarity; we will try to use it, in order to assess whether or not the training of the SOMs has been effective to attain a stable configuration (that is an organization close to the unknown optimum). To such aim, it is important to develop a definition of goodness for the Convergence Index, in order to find a threshold level, separating stable from unstable configurations. As preliminary approximation, in this work we have evaluated normalized square root CI quantities (nCI), assuming the Convergence Index to be good when the value of nCI is smaller than 0.05. In such case, for instance, nCI = 0.05, implies CI = 0.0025. It is important to point out that the Convergence Index measures Euclidean distance between two maps, i.e. the average distance between corresponding elements of two maps trained on the same data-set, after the nodes of one have been transposed to the permutation that yields the smallest distance from the other, and that nCI transforms this measure so that it is in the same scale as the values of the node elements. Considering that in our experiment we will rescale the input data to the range of [0, 1], a value of nCI of 0.05, represents 5% of the input data-range. One can easily see that, at this level of nCI, there is still a very high probability that new patterns to be presented to the SOM will fall in different categories, depending on which trained map is used. To avoid this problem, we would want to have nCI values smaller than 0.01 or 0.005, i.e. less than 1% of the range of the rescaled input data. To achieve these results we will need to optimize the net parameters, as described in the following section.
A Note on the Sensitivity to Parameters
1091
Table 1. Basic parameters configuration XDim YDim NEp AlphaT InAlpha Neighb InRad 7
3
1
5
1
0.15
1
2
Simulation Results
3.1
Experimental Design
The algorithm in use has been developed starting from the work of Kohonen, as presented in [6]. Let us assume u(t) to be a generic output of the map at time t, so that the following equality holds: u(t) = f (Xdim, Y dim, N Ep, AlphaT, InAlpha, N eighb, InRad)
(4)
where: Xdim and Y dim are respectively the number of rows and columns of the map; N Ep is the number of epochs, AlphaT specifies the shape of the learning rate function which drive the learning process. Two possibility have been considered and indicated with proper flag numbers: (1) Linear Option, that sets the function linearly decreasing along time, and (2) Constant Option, which maintains the function α constant throughout the simulation. Additionally, InAlpha is the value assumed by the function α at time t = 0; InRad is the Radius R of the neighbourhood (with the constraint: R ≤ min(Xdim, Y dim), when Y dim is greater than one, and R ≤ Xdim, when Y dim = 1); finally, N eighb defines the neighbourhood structure of the net. Various options have been considered, each of them associated to a code number: 1 indicates Bubble kernel, that is a fixed array of points around the winner (in the case of bi-dimensional maps it reduces to a standard Moore neighbourhood); 3 is for Gaussian or Mexican hat kernel; and finally, 11, 13 are similar to 1 and 3 respectively, but with a ”wraparound” map option 1 . Due to limitations in computing speed we have limited the tests to toy configurations of one-dimensional SOMs. The basic parameters configuration for a set of N maps = 30 is displayed in Table 1: From such configuration we will perform two different tasks. Firstly, we will study the effects from variations on the parameters Xdim, NEp, AlphaT, InAlpha, Neighb, and InRad. In particular, we will initially step through one parameter at a time, disregarding second and third level interactions; we shall therefore test 11 different parameter sets. In the second phase of the study, some of the best performing parameters will be combined together, and effects of such combinations will be then analysed. Table 2 summarises essential features of each run. 1
When this option is in use, we test for closeness to the map boundaries, before running the adaptation routine: if the winner is closer to a boundary than the current Radius, its location is shifted away from the boundary by the necessary number of cells. We also perform a circular shift on the entire array of nodes, in the same amount and direction as we shifted the winner. Such circular shift is permanent, i.e. it is not lost after the adaptation procedure.
1092
Marcello Cattaneo Adorno and Marina Resta
Table 2. Simulation essential features Runs
KC1 KC2 KC3 KC4 KC5 KC6 KC7 KC8 KC9 KC10 KC11
XDim YDim Nmaps NEp AlphaT InAlpha Neigh InRad
7 1 30 5 1 0.15 1 2
4 1 30 5 1 0.15 1 2
10 1 30 5 1 0.15 1 2
7 1 30 2 1 0.15 1 2
7 1 30 5 2 0.15 1 2
7 1 30 5 1 0.05 1 2
7 1 30 5 1 0. 5 1 2
7 1 30 5 1 0.15 3 2
7 1 30 5 1 0.15 11 2
7 1 30 2 1 0.15 13 2
7 1 30 5 2 0.15 1 3
The data–set consist of a series of financial mono–variate data provided in the Santa Fe Competition, directed in 1991 by Neil Gershenfeld and Andreas Weigend2 . These data are the tick–wise bids for the exchange rate from Swiss francs to US dollars; they are recorder from August 7, 1990 to April 18, 1991. Before running the Kohonen procedure we rescale the data to the range [0, 1]. Since the algorithm introduced by Kohonen is essentially a projection method, it is also important to provide a proper evaluation of the projection dimension (i.e. of the dimension of the space where input samples lie): if the estimated value is too small, a considerable amount of information about the observed time-series could be lost. On the other hand, too large estimation might lead to loss of usefulness for the procedure. Since the exact determination of such quantity is often difficult, we will run simulations by assuming 4 different rough approximations of the (real) one that should be used, setting the projection dimension (and hence the number of elements of each node) to 3, 5, 7 and 10. In summary, we will perform 11 × 4 = 44 runs for the given data–set. 3.2
Discussion of Results
Simulation results are summarised in Table 3. It is possible to note that when data are embedded into a 10 dimensional phase state, the sensitivity from parameter changes is much more evident than in the other cases. Additionally, we observe that setting AlphaT to type 2 (fixed to initial value) decreases distance between maps, and changing the kernel of Neighbourhood type (Neighb) from bubble to gaussian improves training. It is also obvious that the nCI values obtained in all simulations are far from the desired values outlined in previous section, as they are in the range of 13% to 65% of the input-data range! 2
Freely downloadable at the URL: ~ http://www-psych.stanford.edu/andreas/ TimeSeries/SantaFe.html.
A Note on the Sensitivity to Parameters
1093
Table 3. Simulation results: the columns show nCI for different values of the projection dimension
3.3
Runs
3
5
7
10
KC1 KC2 KC3 KC4 KC5 KC6 KC7 KC8 KC9 KC10 KC11
0.653 0.351 0.517 0.462 0.634 0.595 0.616 0.549 0.489 0.582 0.653
0.445 0.445 0.445 0.445 0.445 0.445 0.445 0.445 0.445 0.445 0.445
0.448 0.448 0.448 0.448 0.448 0.448 0.448 0.448 0.448 0.448 0.448
0.448 0.257 0.426 0.128 0.395 0.209 0.303 0.454 0.369 0.413 0.448
An Optimal Configuration and the Brute Force Approach
We now do a second set of runs, using the optimal parameters of the previous runs, and concentrating on the effect of the number of Epochs on convergence, varying them from a minimum of 1 to a maximum value of 625. We will use the parameter configurations reported in Table 4. The other learning parameters are also set to the optimal values found in the initial tests. We vary the number of epochs across a wide range, in a brute force attempt to increase convergence dramatically. The result is very encouraging, as for Number of Epochs –NEp– greater than 125, the convergence index falls within the desired range.
4
Conclusions
In order to obtain a robust representation of the input data in SOMs it is necessary for the trained map to be close to the ideal, optimal mapping of the training data–set. We don’t know the ideal mapping, but we can infer that if several SOMs trained on the same data and randomly initialised are very similar
Table 4. Simulation results: nCI values with the brute force approach Number of Epochs 3
5
7
10
1 5 25 125 625
0.072 0.014 0.020 0.019 0.015
0.053 0.019 0.026 0.018 0.017
0.089 0.016 0.013 0.026 0.018
0.056 0.022 0.022 0.020 0.016
1094
Marcello Cattaneo Adorno and Marina Resta
to each other, they will also be close to the optimal SOM. We have developed a measure of similarity, the Convergence Index (CI), and determined what would be a desirable value for such a measure; we have also applied a series of steps to determine suitable training parameters. We take a sample data–set, train one or more maps on the data set and, depending on the distance between categories, determine the value of CI that would provide sufficient resolution so that the risk of a new pattern falling in the ’wrong’ node is minimised. We then train several groups of SOMs , with increasing values of the Number of Epochs parameter; and measure the CI of each group, choosing a value that will satisfy our minimum requirement. In the course of our experiments, we find that the most reliable and powerful way to increase SOM convergence, as measured by the CI, is to increase the Number of Epochs parameter, therefore increasing the number of times that input set is presented to the net. This behavior is consistent with the intuitive observation that the Self Organizing capability of the net should improve when increasing the number of learning iterations. At this time our capability to measure the CI is limited to SOMs having at most 10 nodes. The limitation is due to the fact that in order to find the minimum distance between two maps, we have to loop through all N ! permutations of the order of the nodes. In a future study we must look for a way of computing at least an approximation of the CI for SOMs larger than 10 nodes. We must also develop better criteria to establish acceptance/rejection levels of the CI, given a required probability of avoiding incorrect classification by a trained SOM.
References [1] Casdagli, M.: Nonlinear prediction of chaotic time-series, Physica D, 35, p. 335-356 (1989). 1088 [2] Cottrell, M., Fort, J. C.: Etude d’un processus d’autoorganisation, Annales de l’Institut Henri Poincar´e, 23(1), 1–20 (1987). 1089 [3] Cottrell, M., Fort, J., Pages, G.: Theoretical aspects of the SOM algorithm, Neurocomputing, 21, 119–138 (1998). 1089 [4] Erwin, E., Obermeyer, K., Schulten, K.: Convergence Properties of SelfOrganizing Maps, In Kohonen, T., Makisara, K., Simula, O., Kangas J., (eds): Artificial Neural Networks, Elsevier Science Publishers B. V. North Holland, 409– 414 (1991). 1089 [5] Erwin, E., Obermeyer, K., Schulten, K.: Erwin, E., Obermeyer, K., Schulten, K.: Self-Organizing Maps: stationary states, metastability and convergence rate, Biological Cybernetics, 67, 35–45 (1992). 1089 [6] Kohonen, T.: Self-Organizing Maps, Springer Series in Information Sciences, Vol. 30, Springer, Berlin, Heidelberg, New York, (1997). 1091
Utilization of AI & GAs to Improve the Traditional Technical Analysis in the Financial Markets Norio Baba1, Yaai Wang2, Tomoko Kawachi1, Lina Xu3, and Zhenglong Deng 3 1
Information Science, Osaka Kyoiku University, 582-8582, Japan [email protected] 2 Information Science, Osaka Kyoiku University (On one year and half leave from Control Engineering, Harbin Institute of Technology, 150001, P.R.China) 3 Control Engineering, Harbin Institute of Technology, 150001, P.R.China
Abstract. Traditionally, many stock traders have utilized the measure (LMMA-SMA) {((Long-Medium term moving average (such as 26 weeks moving average)) - (Short term moving average (such as 13 weeks moving average))} in order to detect up and down tendency of the stock market. In this paper, we shall demonstrate by the several computer simulations that the DSS whose dealings are done by the measure (LMMA – SMA) is further improved by the use of the AI techniques. We shall also suggest that GAs would be useful for improving the proposed DSS.
1
Introduction
Recently, a large number of researchers have had growing concerns about applying various approaches to construct intelligent decision support systems for dealing stocks [1-8]. In this paper, we shall suggest that the approach utilizing the traditional technical analysis could be further improved by the sophisticated use of the AI techniques and GAs.
2
Traditional Technical Analysis in the Financial Market
Traditionally, many stock traders have utilized the measure “Long (medium) term moving average versus short term moving average” in order to detect up and down tendency of the stocks. Fig.1 and Fig.2 illustrate golden cross and dead cross of the two moving averages, respectively. As is known, “Golden Cross” is the sign that the price of the stock is getting higher and “Dead Cross” is the sign of downward moving. Therefore, many people have utilized this approach for dealing stocks. The prototype DSS utilizing the traditional technical analysis can be described as follows: V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1095-1099, 2003. Springer-Verlag Berlin Heidelberg 2003
1096
Norio Baba et al.
1) If the sign of 26MA - 13MA changes from positive to negative, then buy stocks. 2) If the sign of 26MA - 13MA changes from negative to positive, then sell stocks. Remark 1. As the measure “Long(medium) term moving average versus short term moving average”, “26 weeks moving average versus 13 weeks moving average” has often been utilized. “13 weeks moving average versus 5 weeks moving average” has also been utilized in order to detect the tendency of the stock market.
Fig. 1. Golden Cross of the moving averages
3
Fig. 2. Dead Cross of the moving averages
A Trial to Improve the Traditional Technical Analysis
The traditional technical analysis has been widely recognized as one of the most reliable techniques for dealing stocks. We have also carried out a large number of computer simulations in order to check the effectiveness of the DSS utilizing this technique [7],[8]. Almost all of the simulations confirm the effectiveness of the DSS. However, we now consider that the DSS could further be improved by incorporating AI techniques and/or GAs. The DSS shown in the previous chapter executes dealings after golden cross or dead cross has occurred. However, such a way of dealings often loses an important chance to adapt to the sudden changes in the stock market. It might be much better if we could carry out dealings before golden cross (or dead cross) will occur. In this section, we shall propose the following DSS to realize this idea: DSS Let z t = (26 weeks moving average(26MA) – 13 weeks moving average(13MA)) on the Friday of the t-th week. a) Assume that z t ≥ 0. Let t1 be the number of the week which is nearest to t among t' satisfying the relations Zt'-1 ≤ 0, Zt' > 0, and t' < t. Let
Mz t = Max(
Choose parameters a, b, c.
z t1
, z t1 +1 , … , z t )
Utilization of AI & GAs to Improve the Traditional Technical Analysis
1097
If both of the following relations (1) and (2) are satisfied, then buy.
Mz t > b × c
(1)
z t < Min( Mz t /a, c)
(2)
b) Assume that Zk <0 Let k1 be the number of the week which is nearest to k among k' satisfying the relations Zk'-1 ≥ 0, Zk' < 0, and k' < k Let wk = - Zk (k = k1, …, k) Further, let
Mwk = Max( w k , wk1 +1 , … , wk ) 1
Choose parameters a , b , c . If both of the following relations (3) and (4) are satisfied, then sell.
4
Mwk > b × c
(3)
wk < Min( Mwk / a , c )
(4)
Computer Simulations
We have carried out computer simulations for dealing various individual stocks. We have also carried out computer simulations concerning the well known indexes such as Nikkei-225 and TOPIX. In order to derive fair evaluation (as possible as we can), we have tried to do simulations for rather long periods. In the followings, just for the information, let us briefly introduce one of the simulation results. Table 1 shows the changes of the initial amount of money (10 billion yen) by the dealings in Kyosera (one of the most famous corporations listed in the Tokyo Stock Market) utilizing the DSS proposed in the previous section.As for the parameter values, we have used: a = a = 5, b = b = 2, c = c = 100 It also shows the changes of the initial money by the traditional method where the dealings are done after the sign of “Long-Medium Term Moving Average - Short Term Moving Average” changes. Further, it also adds the data of the changes of the initial amount of money by the Buy-and-Hold method. Figure 3 illustrates the changes of the stock price (Kyosera) during the simulation periods.
1098
Norio Baba et al.
Table 1. Changes of the initial money (10 billion yen) by the dealings in the individual stock (Kyosera). Simulation Period: A1: 1996.1 - 1998.12 A2: 1997.1 - 1999.12 A3: 1998.1 2000.12
A1
A2
A3
Average
Proposed Method
10.59498
32.57745
27.55398
23.57544
Traditional Approach
8.21466
28.61131
19.54632
18.79076
Buy-and-Hold
7.81114
26.73854
21.01833
18.52267
(billion yen)
30,000 25,000 yen
20,000 15,000 10,000 5,000 0
96/1
97/1
98/1
99/1
00/1
Fig. 3. Changes of the stock price (Kyosera) during the simulation periods
5
Utilization of GAs for Marking the Proposed DSS More Effective
By the large numbers of computer simulations, we have confirmed that the proposed DSS is quite effective in yielding considerable amount of return under the specific combination of parameter values. However, in order to let the proposed DSS more reliable, one has to investigate the effectiveness of the DSS under the various combinations of the parameter values and find the most suitable one for each simulation period. We are now trying to utilize GAs for this objective. Due to limitation of space, we don't go into details. Interested readers are kindly asked to attend our presentation.
Utilization of AI & GAs to Improve the Traditional Technical Analysis
6
1099
Concluding Remarks
A Decision Support System for dealing stocks which improves the approach utilizing traditional technical analysis has been proposed. Several computer simulation results confirm the effectiveness of the proposed DSS.
Acknowledgements The authors would like to express their thanks to QUICK Corporation for their kind support in giving them various financial data.
References [1] [2] [3] [4] [5] [6] [7] [8]
Refenes, A.P.(ed.): Neural Networks in the Capital Markets. Wiley (1995) Weigend, A.S. et al. (eds.): Decision Technologies for Financial Engineering. WorldScientific (1997) Baba, N. and Kozaki, M.: An intelligent forecasting system of stock price using neural network. in Proceedings of IJCNN. 1( 1992) 371-377 Baba, N. et al.: A hybrid algorithm for finding the global minimum of error function of neural networks and its applications. Neural Networks. 7, (1994) 1253-1265 Baba, N. et al.: Utilization of neural networks and GAs for constructing an intelligent decision support system to deal stocks. in Proceedings of the SPIE conference. Orland, U.S.A., 2760( 1996) 164-174 Baba, N. and Suto, H.: Utilization of artificial neural networks and TD-learning method for constructing intelligent decision support systems. European Journal of Operational Research. 122( 2000)501-508 Baba, N. et al.: Utilization of soft computing techniques for constructing reliable decision support systems for dealing stocks. in Proceedings of IJCNN 2002, Hawaii, U.S.A.( 2002) Baba, N.: Utilization of soft computing techniques for dealing in the TOPIX and the Nikkei-225. in Proceedings of KES 2002, Milano, Italy (2002)374-380
On the Predictability of High-Frequency Financial Time Series Mieko Tanaka-Yamawaki Department of Computer Science and Systems Engineering Miyazaki University, Miyazaki, 889-2192 Japan [email protected]
Abstract: We focus on the memory length and its stability of the tickwise price. fluctuations. It is known that the price fluctuations can be approximated by the random walk but has a short memory. However it has not been known so far what exactly the memory length is and how stable it is. We have analyzed tick-wise price fluctuations of U. S. Dollar vs. Japanese Yen exchange rates extending for five and a half years, by automatically generating the output files of conditional probabilities for the memory length up to seven. We have identified the memory length be three ticks almost everywhere throughout the entire period of the data. This result coincides with two other independent analyses; autocorrelation functions and mutual informations. This fact implies a possibility of short time prediction based on Markov type models.
1
Introduction
Price fluctuations can be viewed as the long-term trends and the short-term fluctuations, and the latter can be assumed to be the random walk for the first approximation thus can be approximated by the normal distribution [1]. However, the probability distribution of the real data of short-term price fluctuations has been known to follow distributions that are more weighted toward both tails compared to the normal distribution. Such distributions are called as the ‘fat-tail' distribution. Although this idea itself is generally accepted, the exact shape of the probability distribution of such price fluctuations has not reached any sort of consensus. Especially the question of pinpointing the stable distribution has gone much far from the original proposal of Levy's stable distribution of index 1.4 [2]. A stylized fact currently accepted by the community [3,4] is that the center part and the tail part of the distribution seem to follow two different distributions. Also there are arguments against the assumption of the stable distribution, since we do not really know when and how the stability is maintained in such data. The statistical distribution itself does not give us much insight for the predictability. In order to predict the next move based on the past moves, we need know the length of past moves that govern the next move. This kind of knowledge requires an V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1100-1108, 2003. Springer-Verlag Berlin Heidelberg 2003
On the Predictability of High-Frequency Financial Time Series
1101
extensive study of real tick data. Unfortunately the tick data are generally hard to access due to the huge-size and the high-price. Those data also suffer from various shortcomings [5]. The worst of all is the recording errors. For example, a lone point vastly different in magnitude compared to the neighboring points must be an error without any doubt. On the other hand, there are cases in which the orders made by mistake with outrageously low prices may actually be executed in a real market. We need care various factors in order to extract a rule from the data concerning the human actions that are bound to make a mistake. Next problem is the types of data. Many tick data contain not only actually executed prices but also simple quotations. Often the same prices are bunched at a special time of a day, or quotations done at different places on the Earth are lined on the same time series. Those data do not necessarily reflect the rule behind the price evolution. Taking all those factors into account, we can still obtain some message hidden in the tick data. Analysis of millions or more price motions enable us to extract various features that characterize the high frequency financial data [5-9] We first report in this paper the facts we have observed in the financial tick data by using out program that automatically generates time series of conditional probabilities out of large size financial time series. We focus on the simplified case of binary motion {up, down}= {1,0} by neglecting the magnitude of the move as well as unmoved motions [6]. Namely, we focus on a question on whether the direction of move at a certain time and the next move are correlated, and how long the correlation lasts. A surprising feature has been found [5-8]. While ‘1'(up) and ‘0'(down) appear approximately equal in the data, ‘11'(up after up) appears in 70% and ‘01' (up after down) appears in 30 %, and both ratios are very stable throughout the entire period. This discovery indicates the existence of certain stability for a long period, which is expected to be a key to clarify the nature of price motions. Here in computing a time series of each conditional probability, we split the entire data into pieces of 2000 points to compute one conditional probability. We have compared various sizes and concluded that the minimum length necessary to guarantee the error to be stable is 500-1000. We observe from these results that a move is governed by the previous move. Moreover this effect seems to have regularity. What if we consider the effect of previous two moves. If we continue this line of thought and if the correlation terminates after some point, this time series is Markovian. We draw a picture of Markovian model with 2m states. Note that we do not expect a deterministic prediction but we limit ourselves within the probabilistic predictions. The length of memory can be identified as three ticks by examining the time series of various conditional probabilities. For the memory length up to three, two lines corresponding to different conditional probabilities are distinguishable, while for the memory larger than three those conditional probabilities begin overlap. This fact clearly represents a loss of information for the memory length longer than three ticks. A consistent result is obtained more directly from the shape of the autocorrelation of the price increments. Namely the correlation vanishes after three ticks. Also we have checked the mutual information between the next move and the past moves of
1102
Mieko Tanaka-Yamawaki
various sizes. It has been found that the mutual informations saturate for the memory longer than three ticks.
2
Characteristic Features of High-Frequency Financial Fluctuations
Currency exchange data are the most convenient one among many financial data for the purpose to study conditional probabilities of various memory depths, because such analysis requires a large-sized data in order to compute the probability out of real data. The currency exchange market is busy at any time of the day because some markets are open somewhere at any time of the day except weekends and holidays. In this chapter we focus on the stability found in the high frequency motions of price fluctuation and show that the length of memory stays almost as a constant for a long period of time. In tick data, {time, bid price, ask price} are recorded when the price is required. The price is required when a trader makes a trade, or simply a quotation is made by a bank/market after a certain time period. Therefore the time intervals are not equal. Also for the currency exchange data, no information is recorded on the volume nor information on whether the order is executed, or even the difference between an order and a quote. At every point of time, bid is lower than ask. For bid is the price that the bank wants to buy, and ask is the price that the bank wants to sell. Thus a person to make a deal with a bank must pay high and sell low. It is said that bid is more accurate than ask. Generally speaking ask and bid move parallel and either one is enough for our analysis. We choose ‘ask' prices in our analysis. We simplify the data into binary motions, by neglecting unmoved motions then neglect the magnitude of each move. Thus the new data becomes a time series of ‘1(up)' and ‘0(down)'. The ratio of 1 and 0 are about equal. If the price fluctuations are indeed random and uncorrelated, it is expected that the following equations hold. P(1|1)=P(1|0)=P(1)=0.5
(1)
We compute the conditional probabilities as a time series by using the exchange rates of U. S. Dollar vs. Japanese Yen from January 1995 to April 2001. The ask positions have a little more than ten million data points and we use the first 10,000,000 points for this analysis. It is essential to write an efficient program to deal with such large data. The algorithm that we have used to compute various conditional probabilities for a fixed memory depth (m) is as follows. The time series is split into pieces of a fixed m
size. Then 2 conditional probabilities are computed for each piece. Finally they are rearranged into time series for each m and written into automatically generated files for each conditional probability. First we seek for the best size of splitting the data. Fig. 1 shows the mean squared error for P(1|0) and P(1|1) as a function of the size of each split data sets. We observe that a stability is seen for the size larger than 20,000.
On the Predictability of High-Frequency Financial Time Series
1103
Fig. 1. Mean squared error of the conditional probability of up motion after a down motion P(1|0) as a function the partition size becomes stable for the size larger than 20,000
In Fig.2(a) and Fig.2(b) we can see that the results for the two partition sizes are essentially the same. Fig.2(b), Fig.3, Fig.4, and Fig.5 are the results corresponding to the memory length m=1,2,3,aqnd 4, respectively for the case of partition size 50,000. If the price fluctuations in this exchange rate were utterly random, each step of the price time series would be independent and P(1|0)=P(1|1)=0.5 as in Eq.(1). We observe from Fig.2(a), however, for a long period of time the following equation holds instead of Eq.(1). P(1|0) = 0.71 >> P(1|1) = 0.29
(2)
One motion is strongly correlated to the previous motion and the motion if the opposite side is much larger than the motion to the same direction. This time series has a finite memory!! The case for the memory depth m=2 is shown in Fig.3. Still we recognize a very stable four lines. The line for P(1|0) in Fig.2 are split into two distinct lines of P(1|00)=0.78 ± 0.05 and P(1|10)=0.68 ± 0.03 in Fig.3. If the memory length were only one step, then the following situation would occur. P(1|00)= P(1|10)=P(1|0) =0.71
(3)
Similarly in Fig.4 eight lines are clearly separated and stay stable for a long time, which means that the memory length extends to three steps (m=3), while for the case of m=4 the sixteen lines now overlap and make eight pairs. This fact indicates the following situation. P(1|0000)=P(1|1000)=P(1|000)
(4)
This means that the memory for more than three steps is irrelevant in this time series. This fact is stable throughout the five and a half years of duration.
1104
Mieko Tanaka-Yamawaki 1
'P0' 'P1'
0.8
0.6
0.4
0.2
0 0
20
40
60
80
100
120
140
160
180
200
Fig. 2(a). Conditional Probabilities P(1|0) and P(1|1), represented in the figure as P0 and P1, respectively when the partition size is 50,000. Note that P(1|0) is much larger than P(0|0), indicating a strong correlation between the two consecutive moves. Also those two lines are very stable over the entire period of five and a half year 1
'P0' 'P1'
0.8
0.6
0.4
0.2
0 0
10
20
30
40
50
60
70
80
90
100
Fig. 2(b). Conditional Probabilities P(1|0) and P(1|1) when the partition size is 100,000. Fig.1 and Fig.2 are essentially the same 1
'P00' 'P01' 'P10' 'P11'
0.8
0.6
0.4
0.2
0 0
20
40
60
80
100
120
140
160
180
200
Fig. 3. Conditional Probabilities P(1|00), P(1|10), P(1|01), P(1|11) for m=2 ordered from the top to the bottom in the figure, represented as P00, P10, P01 and P1 for the partition size 50,000. Note that the four lines are separated and are stable throughout the period indicating that the memory depth extends to m=2 and stable
On the Predictability of High-Frequency Financial Time Series 1
1105
'P000' 'P001' 'P010' 'P011' 'P100' 'P101' 'P110' 'P111'
0.8
0.6
0.4
0.2
0 0
20
40
60
80
100
120
140
160
180
200
Fig. 4. Conditional Probabilities P(1|000), P(1|001), P(1|010), P(1|011), P(1|100), P(1|101), P(1|110), P(1|111) for m=3 ordered from the top to the bottom in the figure, represented as P000, P001, etc for the partition size 50,000. Note that the eight lines are separated and are stable throughout the period indicating the memory depth extends to m=3 and stable 1
'P0000' 'P0001' 'P0010' 'P0011' 'P0100' 'P0101' 'P0110' 'P0111' 'P1000' 'P1001' 'P1010' 'P1011' 'P1100' 'P1101' 'P1110' 'P1111'
0.8
0.6
0.4
0.2
0 0
20
40
60
80
100
120
140
160
180
200
Fig. 5. Conditional Probabilities P(1|0000), P(1|0001),…, P(1|1111) for m=4 ordered from the top to the bottom in the figure, represented as P000, P0001, etc for the partition size 50,000. Note that the sixteen lines are overlapped into eight bunches and the memory at m=4 is washed out in the price fluctuations
3
Autocorrelation
Autocorrelation C (T) of price fluctuation ∆x can be calculated by Eq.(3).
C(T) ={< ∆x(T +t)∆x(t) > - < ∆x(T +t) ><∆x(t) >}/σ 2
(5)
Here σ is the standard deviation. C(T) is zero if ∆x(t) and ∆x(t+T) are uncorrelated. Fig.6 shows the result of such calculations. C(T) vanishes after T=3, or 4, indicating that the memory length for the currency exchange is approximately three.
1106
Mieko Tanaka-Yamawaki
Fig. 6. Autocorrelation C(T) of price fluctuations as a function of T(ticks) vanishes after T=3, or 4
Fig. 7. Time Series of Mutual Information between the next move and the past move for m=1,2,3,4
4
Mutual Information
Mutual information is a quantity to measure the degree of correlation between two informations x and y.
M (x, y) = H (x) - H(x| y) = -‡”x P(x) log2 P(x) + ‡”‡” P(x | y) log2 P(x | y) x y
(6)
This is the difference between the uncertainty of x, namely the entropy H(x) and the uncertainly of x after knowing y, namely the conditional entropy H(x|y). The meaning of M(x,y) is therefore the degree of how much certainty is gained, or how much uncertainty for x is removed by knowing y. We compute M(x,y) by computing P(x) and P(x|y) as in Chapter 2, by using the first 52,000 data points out of10,000,000 points. Namely the first point of P(x) or P(x|y) is computed from 1-50,000 th points, and the second point from 2-50,001 th points, etc. This process generates a time series of length 2001 for each memory length m. Fig.7 shows mutual information for m=1,2,3, and 4. We observe that the
On the Predictability of High-Frequency Financial Time Series
1107
two lines of m=1 and m=2 are well separated, which means more information is obtained if we consider the history of two steps compared to the case of only one step. Similarly still more information is gained by increasing the step of memory from m=2 to m=3, judged by the fact that the line for m=3 is much larger than the line for m=2. While the increase from m=3 to m=4 does not do much, because the lines for m=3 and 4 almost overlap. Therefore we conclude that the m=4th memory is not relevant to predict the future in this time series.
5
Conclusion
In order to extract a possible rule for tick-wise price movements, we have analyzed a set of high frequency time series of U. S. Dollar vs. Japanese Yen having ten million data points for five and a half year period from 1995 to 2001, by automatically generating outputs in one program for each memory length m. We have observed from these outputs that the conditional probabilities of {up, down} motions are highly stable throughout the entire period of the data. Moreover the relevant memory length extracted from the conditional probabilities stays as three ticks throughout the entire period. The memory length extracted from the autocorrelation function of the price increments also becomes three ticks, consistent to the above result. Finally we have computed the mutual information between one move and its past history of various sizes. It is shown that the mutual information increases as the depth of memory increases up to three, then saturates after m=3. This also indicates the loss of memory for more than three ticks, which verifies the previous conclusion. This result provides us a keystone knowledge to study the stochastic nature of the price fluctuations.
Acknowledgments This work is supported in part by the Scientific Grant in Aid by the Ministry of Education, Science, Sports, and Culture of Japan (C2:14580385). The author is grateful to Mr. H. Moriya (Oxford Financial Education) for his effort of providing a collection of currency exchange data for 1995-2001.
References [1] [2]
Bachelier L. “Théorie de la speculation”, Doctor Thesis. Annales Scientifiques de l'Ecole Normale Sperieure III-17:21-86; Translation (1964) : P.H. Cootner(Ed.) the Random character of stock market prices, MIT Press (1900) 17-18 R.N. Mantegna and H.E. Stanley, “Scaling Behavior in the Dynamics of an Economic Index” Nature, 376 (1995) 46-49.
1108
[3] [4] [5] [6] [7] [8] [9]
Mieko Tanaka-Yamawaki
Proceedings of International Conference on Econophysics at Bali, August 2002, Physica A (2003) in press. Proceeding of IEEE International Symposium on Computational Intelligence on Financial Engineering (CIFER03) March 20-23 (2003) Hong Kong. Mieko Tanaka-Yamawaki, “Stability of Markovian Structure Observed in High Frequency Foreign Exchange Data”, Ann. Inst. Statist. Math. (AISM), vol. 55 (2003) in press. Toru Ohira, et.al. “Predictability of Currency Exchange Market ”, Physica A 308 (2002) 368-374 Mieko Tanaka-Yamawaki, “ Stability of Markovian Structure Observed in High Frequency Foreign Exchange Data”, New Trends in Optimization and Computer Algorithms (December 9-13,2001, Kyoto); Mieko Tanaka-Yamawaki “A Study on the Predictability of High-frequency Financial Data”, Proceedings of the 7th International Symposium on Artificial Life and Robotics, vol.1, (2002) 74-77 Mieko Tanaka-Yamawaki, Shinya Komaki and Tsuyoshi Itabashi, “Arbitrage Chances and the Non-Gaussian Features of Financial Data”, Proceeding of IEEE International Symposium on Computational Intelligence on Financial Engineering (CIFER03) March 20-23 (2003) Hong Kong
Incremental Learning and Forgetting in RBF Networks and SVMs with Applications to Financial Problems Hirotaka Nakayama and Atsushi Hattori Konan University Dept. of Info. Sci. & Sys. Eng. Kobe 658-8501, JAPAN [email protected] Abstract. Radial Basis Function Networks (RBFNs) have been widely applied to practical classification problems. In recent years, Support Vector Machines (SVMs) are gaining much popularity as promising methods for classification problems. This paper compares those two methods in view of incremental learning and forgetting. The authors have reported that the incremental learning and active forgetting in RBFNs provide a good performance for classification under the changeable environment. First in this paper, a method for incremental learning and forgetting in SVMs is proposed. Next, a comparative simulation for a portfolio problems between RBFNs and SVMs will be made.
1
Introduction
In many practical decision making problems, the environment changes very often over time. In machine learning, therefore, decision rules are needed to adapt for such changeable situations. To this end, incremental learning should be made on the basis of new data. We have reported the effectiveness of incremental learning in several machine learning techniques (Nakayama et al. 1997). On the other hand, since the rule for classification becomes more and more complex with only incremental learning, some appropriate forgetting is also necessary. Although several trials of forgetting in machine learning have been also suggested, they are concerned in such a way that the degree of importance of data decreases over time (Nakayama 1999; Nakayama & Yoshii 2000). We call the way of forgetting based only on the time elapse “passive forgetting”. However, it seems more effective to forget data which give bad influences to the current judgment. We call this way of forgetting “obstacle data” actively “active forgetting”. In the following, we suggest a method for incremental learning and active forgetting in SVMs, and compare the performance with RBF networks.
2
Support Vector Machine
Suppose that there are two sets A and B. Consider a separating hyperplane which classifies those two sets. If the original problem is not linearly separable, the V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1109–1115, 2003. c Springer-Verlag Berlin Heidelberg 2003
1110
Hirotaka Nakayama and Atsushi Hattori
original data space is mapped to a (usually high dimensional) feature space by some nonlinear mapping so that the mapped sets corresponding to A and B may be linearly separable. The minimal distance between the data and the separating hyperplane is called the margin. The optimal separating hyperplane is decided in such a way that the margin is maximized. We set yi = +1 for the data xi in A, and yj = −1 for xj in B. Suppose that the separating hyperplane in the feature space Z is given by wT z + b = 0 where z = φ(x). The maximization of margin is reduced to the following QP problem: 1 T w w 2 Subject to : yi ( wT φ( xi ) + b) ≥ 1 (i = 1, . . . , n)
Minimize :
(1)
The dual problem associated with (1) is as follows: Minimize :
p
αi −
i=1
p 1 αi αj yi yj φ( xi )T φ( xj ) 2 i,j=1
Subject to : αi ≥ 0 , (i = 1, . . . , p) p αi yi = 0
(2)
i=1
Using a kernel function K(x, x ) with the property K( x, x ) = φ( x)T φ( x ), the problem (2) can be reformulated as follows: p
p 1 Minimize : αi − αi αj yi yj K( xi , xj ) 2 i,j=1 i=1
Subject to : αi ≥ 0 , (i = 1, . . . , p) p αi yi = 0
(3)
i=1
In this paper, we use the gaussian kernel K ( x, x ) = exp(−
||x − x ||2 ) r2
Here, it is important to decide the parameter r appropriately. It has been observed that the following simple estimation works well in many problems: dmax r= √ n nm where dmax is the maximal distance among the data; m is the dimension of data; n is the number of data.
Incremental Learning and Forgetting in RBF Networks and SVMs
1111
Separating two sets A and B completely is called the hard margin method, which tends to make overlearning. This implies the hard margin method is easily affected by noise. In order to overcome this difficulty, the soft margin method is introduced. The soft margin method allows some slight error which is represented by a slack variable xii . Dualizing the original problem maximizing the margin in the feature space, we have the following formulation for the soft margin method: Minimize :
p
αi −
i=1
p 1 αi αj yi yj K( xi , xj ) 2 i,j=1
Subject to : C ≥ αi ≥ 0 , (i = 1, . . . , p) p αi yi = 0
(4)
i=1
3
RBF Networks
Radial Basis Function Networks (RBFNs) are artificial neural networks with a three layers structure. RBFNs try to follow the teacher’s signal by a function with the form of m f (x) = wj hj (x). j=1
where hj (j = 1, ..., m) are radial basis function, e.g., hj (x) = e−x−µj Usually, the learning in RBFN is made by solving E=
p
(ˆ yi − f (xi ))2 +
i=1
m
2
/rj
.
λj wj2 → Min
j=1
where the second term is introduced for the purpose of regularization. Letting A = (HpT Hp + Λ), we have as a necessary condition for the above minimization ˆ = HpT yˆ. Aw Here, HpT = [h1 · · · hp ], where hTj = [h1 (xj ), · · · , hm (xj )], and Λ is a diagonal matrix whose diagonal components are λ1 · · · λm . Therefore, our problem is reduced to finding A−1 = (HpT Hp + Λ)−1
4 4.1
Incremental Learning Passive Incremental Learning
The passive incremental learning is the method in which new data are added without considering the importance of data. Namely, a new test data is added to the training dataset regardless whether or not it is correctly classified by the current rule. Finally, all new data are added to the training dataset.
1112
4.2
Hirotaka Nakayama and Atsushi Hattori
Selectively Incremental Learning
The selectively incremental learning adds new data selectively by considering the importance of data. Namely, judge whether the new data is important or not. If the new data is judged to be important, then it is added to the training dataset. Otherwise, the new data is not added to the training dataset. The details are described in the following: In the selectively incremental learning, not only misclassified data but also correctly classified data with small output are added to training dataset. Namely, a new data xt is added to the training dataset, if xt is not classified correctly. In addition, if xt is correctly classified and if the absolute value of output for xt is smaller than a threshold ω, then the data xt is added to the training dataset.
5
Forgetting
If we make only incremental learning, the classification rule becomes more and more complex, which means that we have a poor generalization ability in general. It seems that human beings avoid this by forgetting unnecessary data. There are two ways for forgetting. One is passive forgetting. The other is active forgetting. 5.1
Passive Forgetting
In many situations, it seems natural that the degree of importance of data reduces as the time passes. Suppose that the output at the time t is given by y(t) = e−βt . Here, the parameter β can be decided by the memory period Mp and the threshold φ as follows: θ = e−βMp . Suppose that after the time passes over Mp , the output is set to be zero. 5.2
Active Forgetting
In passive forgetting, old data are regarded less important than new data. However, an important data may be even in the old data. It seems more effective to forget data which give bad influences to the current judgment. We call this way of forgetting “obstacle data” actively “active forgetting”. One way for finding obstacle data is given as follows. Suppose that a test pattern xt is misjudged by the current rule. Let IF denote the set of data in the other category from xt . Removing a data xi ∈ IF , judge the category of test data xt . If the judgment is correct, the data xi is considered an obstacle data. Find such obstacle data by checking all data xi ∈ IF . In many cases, as the distance between a data xi and the test pattern xt becomes smaller, the influence of the data xi becomes larger. Therefore, it seems
Incremental Learning and Forgetting in RBF Networks and SVMs
1113
Fig. 1. Nikkei Index
natural to increase the rate of forgetting as the distance between the obstacle data and the test pattern becomes smaller. In this event, teacher’s signal is controlled directly. One example is given by 2 yf yf = α (1 + e−γs ) − 1 where, s is the distance between obstacle data xf and additional data xt . α takes a value form [0,1], the paramete γ is determined mainly by experience. yf is the teacher’s signal before the renewal. yf is the teacher’s signal after the renewal.
6
Application to Stock Portfolio Problems
Our problems is to judge whether a stock is to be bought or not. Seven economic indices are taken into account. We have the data in the 119 periods in the past for which it is already known to be bought or not. We take the first 50 data as the teacher’s ones, and examine the ability of classification for the rest 69 data. Note that we had a bubble economy in the first half period, which was broken in the last half as can be seen in Figure 1. Figure 2 shows the result of the comparison of incremental learning with/without active forgetting in SVM and RBFN for some stock. Good performance by using incremnetal learning can be seen for both SVM and RBFN. Furthermore, RBFN obtains a better result with actice forgetting than without. However, SVM can not necessarily provide better results by introducing active forgetting. The soft margin method with an appropriate value of the parameter C only with incremental learning can give almost the same performance as RBF with active forgetting. This seems because the soft margin method in SVM imposes an uper bound on the Lagrangean multiplier α, and results in a similar effect of active forgetting.
1114
Hirotaka Nakayama and Atsushi Hattori
Fig. 2. Comparison of SVM and RBFN in terms of incremental leraning with/without forgetting
In Table 1-3, we show simulation results in more detail. As long as with passive forgetting, SVM provides better (at least, almost equal) performance than with selective incremental learning only. With active forgetting, however, SVM can not necessarily yield better result than with selctive incremental learning only.
Table 1. Misclassification in passive incremental learning by SVM with various memory period 110 hard margin 14 soft margin (C = 50.0) 10 soft margin (C = 10.0) 12
100 14 10 12
90 14 10 12
80 14 10 12
Mp 70 60 14 15 10 12 11 12
50 15 12 12
40 15 15 13
30 13 13 13
20 11 11 12
10 14 14 14
Table 2. Misclassification in selectively incremental learning by SVM with various memory period, ω = 0.9) 110 hard margin 13 soft margin (C = 50.0) 10 soft margin (C = 10.0) 8
100 13 10 8
90 13 10 8
80 13 10 8
Mp 70 60 13 12 10 12 9 9
50 12 9 10
40 15 15 13
30 13 13 11
20 11 11 10
10 11 11 12
Incremental Learning and Forgetting in RBF Networks and SVMs
1115
Table 3. Miscalssification in active forgetting with passive/selective incremental learning by SVM with various forgetting ratio
1 2 p3 4 1 softmargin 2 (C = 50.0) p 3 4 1 softmargin 2 (C = 10.0) p 3 4
hardmargin
7
passive incremental selective incremental learning learning (ω=0.9) 18 20 13 17 15 17 14 17 17 18 12 14 10 16 10 17 13 16 13 13 14 13 13 14
Concluding Remarks
In SVM and RBFN, it has been observed that the incremental learning provides a better ability of classification than the initial learning only. Furthermore, Forgetting (in particular, active forgetting) can raise the classification precision in RBFN. On the other hand, the forgetting (in particular, active forgetting) in SVM can not necessarily yield better results. This seems because the active forgetting in SVM is made by removing the unnecessary data suddenly at a time. It has been also observed that the soft margin method in SVM gives a similar effect of forgetting.
References [1] Nello Cristianini, John Shawe-Taylor ”An Introduction to Support Vector Machines and other Kernel-based learning method“, Cambridge, 2000 [2] S. Haykin, Neural Networks: A Comprehensive Foundation, Macmillan College Publishing Company, 1994 [3] H. Nakayama, Growing Learning Machines and their Applications to Portfolio Problems, Proc. of the International ICSC Congress on Computational Intelligence Methods and Applications (CIMA’99), pp.680-683, 1999 [4] H. Nakamaya and M. Yoshida, Additional Learning and Forgetting by Potential Method for Pattern Classification, Proc. ICNN’97, pp.1839-1844, 1997 [5] H. Nakayama and K. Yoshii, Active Forgetting in Machine Learning and its Application to Financial Problems, Proc. International Joint Symposium on Neural Networks, (in CD ROM), 2000.
Expanded Neural Networks in System Identification Shigenobu Yamawaki1 and Lakhmi Jain2 1
Dep. of Electrical Engineering, School of Science & Engineering Kinki University, Osaka, 577-8502, Japan [email protected] 2 Knowledge-Based Intelligent Engineering Systems Centre (KES), University of South Australia, Adelaide, Mawson Lakes, South Australia, 5095 [email protected]
Abstract. The neural networks are recognized to possess the fault tolerance and learning capability. The neural networks are also used in the identification of nonlinear systems. However in the system identification it is important to whiten a color noise using the noise model. In this paper we propose an expanded neural network in which a noise model is incorporated into the output layer of the neural network. We have developed the learning algorithm converged more quickly than a classical back-propagation algorithm. The proposed algorithm estimates the parameter of the expanded neural network using the least-squares method, and estimates threshold by the fundamental error back-propagation method.
1
Introduction
It is known that a neural network[1] can uniformly approximate an arbitrary continuous function. This implies that we are able to assume that a neural network can be used to model complex systems. The identification methods for non-linear systems using a neural network have been investigated by a number of researchers [2-5]. In these reports, however, the identification methods are discussed based on the assumption that noise is white noise. Colored noise should also be considered in order to discuss the validity of the identification methods using a neural network. This paper proposes the Expanded Neural Network (ENN) in which a noise model is included in the output layer[6]. The parameter of the ENN is estimated by the error back-propagation method for using a least-squares method, and the threshold is estimated with the fundamental error back-propagation method. Applying the proposed method to the system identification, we will prove the validity of this method.
2
The Expanded Neural Networks in System Identification
In this paper, we consider the method of the identification for a class of nonlinear systems described as follows: V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1116-1121, 2003. Springer-Verlag Berlin Heidelberg 2003
Expanded Neural Networks in System Identification
y (k ) = F (Y (k − 1), U (k − 1)) , z (k ) = y (k ) + v(k )
1117
(1)
where, Y (k − 1) is the set of Y (k − 1) = { y (k − 1), y (k − 2), L} and U (k − 1) is the set of U (k − 1) = {u (k − 1), u (k − 2), L} . Let u (k ) and y (k ) be q-dimensional input vector and p-dimensional output vector, respectively. The input u (k ) is given by true values; however, the output y (k ) is observed as z (k ) after the observation noise v(k ) has been added. The observation noise v(k ) is statistically independent from the input. F is an unknown non-linear function, and k is the number of steps. The model used for an identification of a system (1) is the Expanded Neural Network (ENN) [6] that the noise model was included in the output layer of an output recurrent Neural Network (NN). Then, we propose a method for obtaining simultaneously the NN (2) and the NM (3): i =1 i =1 o N (k ) = f ( x N (k )), T f ( x N (k )) = [ f1 ( x N 1 (k )) f 2 ( x N 2 (k ))L f n ( x Nn (k ))] 2 f i ( x) = λ − 1 1 − exp(− x q s ) m1
m1
x N (k ) = ∑ ANi y N (k − i ) + ∑ BNi u (k − i ) + θ N ,
y EN (k ) = y N (k ) + vd (k ) = C N o N (k ) + {− D1v(k − 1) − L − Dm 2 v (k − m2 )}
(2.1)
(2.2)
where, x N (k ) , o N (k ) and u (k ) are n-dimensional states, same dimensional output of hidden layer and q-dimensional input of the ENN at the step k. θ N is the threshold value of the ENN at the step k. The weights parameters ANi , B Ni and C N are appropriately sized coefficient matrices for each layer, and m1 is the finite difference order of the ENN. The sigmoid function f i (x) is the amplitude λ and slope q s . The variable y EN (k ) is p-dimensional expanded output as the output y N (k ) of the basically neural network (NN) and the estimated noise vd (k ) . We could easily verify that (2.2) satisfies (3) as follows: z (k ) − y EN (k ) = { y (k ) + v(k )} − { y N (k ) + vd (k )} = { y (k ) − y N (k )} −{v (k ) + D1v(k − 1) + L + Dm2 v(k − m2 )}
(3)
If the output y N (k ) (= C N o N (k )) of the ENN (2) should accurately approximate the output y (k ) of the nonlinear system (1), then the AR type noise model will be obtained as follows:
1118
Shigenobu Yamawaki and Lakhmi Jain
z (k ) − y EN (k ) = v (k ) + D1v(k − 1) + L + Dm2 v(k − m2 ) = e(k ) .
(4)
Consequently, the output (2.2) denotes that he AR type noise model (MN) is incorporated in the output layer of the neural network. Although ENN directly outputs white noise as shown in (4), NN section of ENN could be utilized for the external description model of the nonlinear system (1), and MN section of ENN could be a whitening filter for the noise. The identification method is proposed as the parameter of the ENN is estimated with the error back-propagation algorithm using the least-squares method, and the threshold is corrected by the basic error back-propagation method. The proposed algorithm can be summarized as follows; < Algorithm > 1) The output error is back-propagated to the input layer by just the amount of the correction rate β (0 < β < 1) . 2) Threshold θ N is corrected by the standard error back-propagation method. 3) The least squares method is applied to estimate the output parameter C EX = [C N − D1 L − Dm2 ] of ENN. As the noise v(k ) cannot be observed directly, the estimated value vˆ(k ) found as follow: vˆ(k ) = z (k ) − y N (k )
(5)
4) The least squares method is applied to estimate the input layer and state layer parameter W N = [ AN 1 L Am2 B N 1 L B Nm1 ] . 5) If correction finishes to all data, the calculation blow procedure (2) is repeated l times.
3
Examples
We have used for an identification of the bilinear system described as below, where the output z (k ) was observed the observation noise v(k ) added to the output y (k ) . 0.4 0.0 0.2 x(k ) + u1 (k ) x(k ) + 0.2 0.3 0.0 0.0 0.4 1.0 0.0 u2 ( k ) x k x k ( ) ( ), + 0.0 −0.2 0.2 1.0 1.0 −0.3 y (k ) = x ( k ) 0.4 1.0 z (k ) = y (k ) + v(k )
0.3 x(k + 1) = −0.4
0.2 0.4 − 0.2 v(k ) = v(k − 1) + 0.1 0 . 0 0 . 3
0.0 v(k − 2) + e(k ) 0.5
(6)
(7)
Expanded Neural Networks in System Identification
1119
The e T (k ) = [e1 (k ) e2 (k )] is average zero for each element and is given by normalized random numbers with covariance σ 2 of 0.25. The observed noise v(k ) has an average value v , and the variance matrix Λ ≅ 9.2904 × 10−2 . The observed noise
about 3.7680 × 10 −1 .
Λy
v(k )
Λ is
v = 1.3164 × 10 −2 ,
corresponds to
Λ Λy
is the covariance matrix of the true output.
of
In the
estimation, the number of data was taken to be 500 and the correction rate β = 0.1 . The iterations of each step were taken to be l = 10 . The estimation result of the ENN (2) for m1 = 1 , n = 6 and m2 = 2 is shown in Fig. 1. The estimated result was evaluated by the Akaike information criterion (AIC) [7]. The estimated result of the algorithm applied to the basic neural network (NN) is also shown in Fig. 1 -200 ENN LSBP -1200 ENN LSBP+BP NN BP NN LSBP NN LSBP+BP
AIC
-400 -600 -800 -1000 -1200 0
2
4
6
8
10
l Fig. 1. Akaike information criterion (AIC) of identifying with ENN and NN, LSBP: the error back-propagation method using least-square method, LSBP+BP: the proposed method, BP: fundamental error back-propagation method
The applied algorithms are the classical error back-propagation method (BP), an error back-propagation method using a least-squares method (LSBP), and the proposed method (LSBP+BP). It is clear that the ENN is able to be improving estimated accuracy for each algorithm from Figure 1. Furthermore, the estimated result of the model and algorithm that were used in order to show the validity of this procedure is shown in Fig. 2. It proves that the ENN is able to be estimated by the proposed procedure similarly as the model that used NN and NM together. Next, the respective estimated error is shown in Table 1.
1120
Shigenobu Yamawaki and Lakhmi Jain
-800 BP LSBP LSBP+BP
AIC
-900 -1000 -1100 -1200
NN
NN+NM
ENN
Model
Fig. 2. Akaike information criterion (AIC) of identifying with models, NN: basic neural network, NN+NM: basic neural network and noise model ENN: expanded neural network Table 1. Estimation error for model structure
Model
Algorithm BP LSBP LSBP+BP
NN
NN+NM ENN
eˆ
× 10− 2
× 10− 2
Λˆ
4.6705 1.5496 13.4220
15.0930 12.9970 11.5240
BP LSBP LSBP+BP
6.4257 2.1358 16.9760
11.0410 8.8091 8.2724
LSBP LSBP+BP
3.4909 22.9330
9.0935 8.6250
Table 2. Residual tests for model structure
Model NN
NN+NM ENN
Algorithm BP LSBP LSBP+BP
e(k )
e( k ) − u ( k )
40 % 28 52
28 % 18 34
BP LSBP LSBP+BP
16 8 44
28 22 26
LSBP LSBP+BP
12 96
22 34
Expanded Neural Networks in System Identification
1121
Finally, the result of the residuals test to each model and an estimated algorithm is shown in Table 2. The residuals test was applied using 95 % confidence limit. Table 1 demonstrates that this procedure has obtained the estimate which has bias to each model. Therefore, it is obvious from Table 2 to have not realized winterizations of the estimated error are satisfied with this procedure.
4
Conclusion
In this paper, we have proposed the expanded neural network that the noise model is incorporated into the output layer of the neural network. Furthermore, we have proposed the algorithm applied the least-squares method to the parameter of the ENN, and estimated the threshold by the fundamental error back-propagation method. The validity of proposed ENN and algorithm was clarified by applying to the identification of the nonlinear system. It has been clear to obtain the estimate that had the bias using the proposed algorithm from the simulation. Consequently, the noise model had not been obtained as the complete whiting filter; but the estimated accuracy is improved.
Acknowledgements The author, S. Yamawaki, wishes to thank the Knowledge-Based Intelligent Engineering Systems Centre (KES) of the University of South Australia for their hospitality and research discussion during his stay in KES. (October 2002 ~ October 2003).
References [1] [2] [3] [4] [5] [6] [7]
K. Funahashi: On The Approximate Realization of Continuous Mappings by Neural Networks; Neural Networks, Vol. 2, 183/192, (1989) S. Chen, S. A. Billings and P. M. Grant: Non-linear system identification using neural networks ;INT. J. CONTROL, Vol. 51, No. 6, 1191/1214, (1990) S. Yamawaki, M. Fujino and S.Imao: Modeling of dynamic systems by neural networks and characteristic analysis ; transactions of the institute of systems, control and information engineers, Vol. 6, No. 5, 207/212, (1993) ( Japanese) S. Yamawaki, M. Fujino and S. Imao: An Approximate Maximum Likelihood Estimation of a Class of Nonlinear Systems using Neural Networks and Noise Models; T. ISCIE , Vol. 12, No. 4, pp.203-211, (1999) ( Japanese ) S. Yamawaki, M. Fujino and S. Imao: The Back Propagation Method Using the Least Mean-Square Method for the Output Recurrent Neural Network; T. ISCIE , Vol. 12, No. 4, pp.225-233, (1999) ( Japanese ) S. Yamawaki : A System Identification Method using Expanded Neural Networks, Knowledge-Based Intelligent Information Engineering System & Allied Technologies, KES 2002, IOS Press, pp. 358-363, (2002) L.Ljung: System Identification-Theory for the User. Prentice-Hall, (1987)
A New Learning Algorithm for the Hierarchical Structure Learning Automata Operating in the General Nonstationary Multiteacher Environment Norio Baba1 and Yoshio Mogami2 1
2
Information Science Osaka Kyoiku University 582-8582, Japan Intelligent Information Engineering Systems Tokushima University, 770-8506, Japan
Abstract. Learning behaviors of the hierarchical structure stochastic automata operating in the general nonstationary multiteacher environment are considered. It is shown that convergence with probability 1 to the optimal path is ensured by a new learning algorithm which is an extended form of the relative reward strength algorithm.
1
Introduction
The study of learning automata in an unknown environment was started by Varshavskii and Vorontsova [1] and has since been done quite extensively by many researchers [2, 3, 4, 5, 6]. The learning automata theory has now reached a relatively high level of maturity, and various successful applications utilizing learning automata has so far been reported [7, 8, 9]. Despite the current matured state concerning learning automata theory and applications, there are still several problems to be settled. One of the most important is the insufficient tracking ability to the nonstationary environment. In order to overcome this problem, extensive research effort has so far been done by many researchers. (Due to limitation of space, we don’t go into details. Interested readers are kindly asked to read the references [10, 11, 12, 13, 14, 15].) However, research level concerning this matter is still in its infancy. In this paper, we shall consider the learning behaviors of the hierarchical structure learning automata (HSLA) in the general multiteacher environment. We shall propose a new learning algorithm which exploits the idea of the relative reward strength algorithm [13] and show that it can ensure the convergence with probability 1 to the optimal path under a certain condition.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1122–1128, 2003. c Springer-Verlag Berlin Heidelberg 2003
A New Learning Algorithm
1123
Unknown Random Environment R ( C1 ,....,Cr )
yi
S(t)=(0,1)
Stochastic Automaton A ( w1 ,....,wr )
Fig. 1. Basic model of a learning automaton operating in an unknown random environment
2
Learning Mechanism of the Hierarchical Structure Learning Automata (HSLA) Operating in the Nonstationary Multiteacher Environment
The learning behaviors of stochastic automaton have been extensively studied under the basic model shown in Fig. 1. However, one of the most serious bottlenecks concerning the basic model is that a single automaton can hardly cope with the problems with high dimensionality. To overcome this problem, Thathachar and Ramakrishnan [6] proposed the concept of HSLA. Since then, the learning behaviors of the HSLA have been extensively studied by many active researchers. In the followings, we shall briefly touch upon the learning behaviors of the HSLA operating in the nonstationary multiteacher environment. 2.1
Hierarchical Structure Learning Automata (HSLA) Operating in the Nonstationary Multiteacher Environment
Fig. 2 illustrates the learning mechanism of the HSLA operating in the general nonstationary multiteacher environment. The hierarchy is composed of the single automaton A¯ in the first level, r automata A¯1 , · · · , A¯r in the second level, and rs−1 automata A¯j1 j2 ···js−1 (ji = 1, · · · , r; i = 1, · · · , s − 1) in the sth level (s = 1, · · · , N ). Each automaton in the hierarchy has r actions. The operation of the learning system of the HSLA can be described as follows. Initially, all the action probabilities are set equal. A¯ chooses an action at time t from the action probability distribution (p1 (t), · · · , pr (t)) . Suppose that αj1 (j1 = 1, · · · , r) is the output from A¯ and β1k1 j1 (t) k1 = 1, · · · , r1 , where t denotes time, is the reward strength from the k1 th teacher. (β1k1 j1 (t) can be an arbitrary number in the closed line segment [0, 1].) Depending upon the output αj1 , the responses β1k1 j1 (t), k1 = 1, · · · , r1 , and the responses from the lower levels, the first level automaton A¯ changes its action probability vector P (t) = (p1 (t), · · · , pr (t)) . Corresponding to the output αj1 at the first level, the automaton A¯j1 is actuated
1124
Norio Baba and Yoshio Mogami j
α j 1j 2
j
α1 − A2
− A1
α 11 α 12 α 13
− A 11
α j 1j 2j 3
α 21 α 22 α 23
β 1i 1
α3
α2
r j1 β 11
j j
Level 2
− A3
Environment
Environment
α j1
β 11 1
Level 1
− A
β 12 1 2 j j
β 2i 1 2
α 31 α 32 α 33
− A 21
− A 32
r j j β 22 1 2
Level 3
j j j
β 13 1 2 3 j j j
β 3i 1 2 3 r 3 j 1j 2j 3
β3 α 111
α 132
α 211
α 311
α 332
Fig. 2. Learning mechanism of hierarchical structure learning automata
in the second level. This automaton chooses an action from the current action probability distribution (pj1 1 (t), · · · , pj1 r (t)) . This cycle of operation from the top to the bottom. The sequence of actions {αj1 , αj1 j2 , · · · αj1 j2 ···jN } having been chosen by N automata is called the path. Let φj1 ···jN denote the path. Corresponding to the path, the HSLA receive reward strengths {(β11j1 , · · · , 1j1 ···jN rN j1 ···jN β1r1 j1 ), (β21j1 j2 , · · · , β2r2 j1 j2 ), · · · , (βN , · · · , βN )} from the general multiteacher environment. The HSLA model utilizes these reward strengths in order to update the current recent reward vector. The action probability of each automaton relating to the path is updated by using the information concerning the recent reward vector. After all of the above procedures have been completed, time t is set to be t + 1. Let πj1 j2 ···jN (t) denote the probability that the path φj1 j2 ···jN is chosen at time t. Then, πj1 j2 ···jN (t) = pj1 (t)pj1 j2 (t) · · · pj1 j2 ···jN (t). 2.2
(1)
The Nonstationary Multiteacher Environment
In this paper, we consider the learning behaviors of the hierarchical structure learning automata (HSLA) under the nonstationary multiteacher environment with the following property: ∗ satisfying the following relation (2) exists The optimal path φj1∗ j2∗ ···jN uniquely: inf E t
r j ∗ j ∗ ···j ∗ (t) + · · · + βs s 1 2 s (t) rs 1i1 i2 ···is βs (t) + · · · + βsrs i1 i2 ···is (t) > sup E for all s = 1, · · · , N (2) rs t 1j1∗ j2∗ ···js∗
βs
A New Learning Algorithm
1125
Remark 1. βslk1 k2 ···ks (t) can be arbitrary number in the closed interval [0, 1]. The large value indicates that the high reward is given by the lth teacher in the sth level. Remark 2. E{·} denotes the mathematical expectation of ·
3
A New Learning Algorithm of HSLA
Recently, we [14] have extended the relative reward strength algorithm of Simha and Kurose [13] in order that it can be used in the HSLA model where a single reward strength β(t) is given as an environmental response. In this section, we shall propose another extended algorithm to be utilized in the learning model where HSLA receive reward strengths from each level of the hierarchy of the multiteacher environment. First, let us define ”recent average reward strength to the path” and ”recent average reward strength vector”. Definition 1: Let us assume that the path φi1 i2 ···iN has been chosen at time t 1i1 i2 ···iN rN i1 i2 ···iN and, corresponding to φi1 i2 ···iN , reward strengths βN (t), · · · , βN (t) has been given from the N th level (bottom level) of the multiteacher environment. Then, ”recent average reward strength (at the bottom level) to the path φi1 i2 ···iN ” is defined as follows: 1i1 ···iN rN i1 ···iN βN (t) + · · · + βN (t) . (3) u¯i1 i2 ···iN (t) = rN On the other hand, the other recent average reward strength (at the bottom = jk ) is defined as follows: level) to the path φj1 j2 ···jN (∃ k; ik u¯j1 j2 ···jN (t) =
1j1 ···jN rN j1 ···jN βN (τj1 j2 ···jN ) + · · · + βN (τj1 j2 ···jN ) . rN
(4)
where τj1 j2 ···jN is the most recent time when the path φj1 j2 ···jN has been chosen, rk j1 ···jN (τj1 j2 ···jN ) (k = 1, · · · , N ) is the reward strength from the rk th and βN teacher at the bottom level of the multiteacher environment at τj1 j2 ···jN . ¯ i1 i2 ···is−1 (t) = (¯ Definition 2: Let v vi1 i2 ···is−1 1 (t), · · · , v¯i1 i2 ···is−1 r (t)) be the recent average reward strength vector relating to the sth level LA A¯i1 i2 ···is−1 (s = ¯ i1 i2 ···is−1 (t) is constructed as 1, 2, · · · , N ). Then, each of the components of v follows: At the N th (bottom) level v¯i1 i2 ···iN (t) = u ¯i1 i2 ···iN (t).
(5)
At the sth level (1 ≤ s ≤ N − 1) v¯i1 i2 ,···is (t) = αs Aβ + (1 − αs ) max{¯ vi1 i2 ···is 1 (t), · · · , v¯i1 i2 ···is r (t)}.
(6)
1126
Norio Baba and Yoshio Mogami
Here, Aβ denotes the recent average reward strength obtained from the sth level multiteacher environment and αs is the parameter satisfying the relation 0 < αs < 1. Aβ is defined as follows: 1. If the automaton is on the path φi1 ···iN chosen at time t, then Aβ =
βs1i1 ···is (t) + · · · + βsrs i1 ···is (t) . rs
(7)
2. If the automaton is on the other path φj1 ···jN (∃ k; ik = jk ), then Aβ =
βs1j1 ···js (τj1 ···jN ) + · · · + βsrs j1 ···js (τj1 ···jN ) , rs
(8)
where τj1 ···jN is the most recent time when the path φj1 ···jN has been chosen. As in [13], we assume that the following condition holds for all i1 , i2 , · · · , is (iq = 1, 2, · · · , r; q = 1, 2, · · · , s(s = 1, 2, · · · , N )): qmin ≤ pi1 i2 ···is (t) ≤ qmax ,
(9)
0 < qmin < qmax < 1 and qmax = 1 − (r − 1)qmin .
(10)
where Let us propose a new learning algorithm of HSLA operating in the multiteacher environment. Learning Algorithm In the followings, let us propose a new learning algorithm of the HSLA operating in the multiteacher environment: Assume that the path φ(t) = φj1 j2 ···jN has been chosen at time t and actions αj1 , αj1 j2 , · · · , αj1 j2 ···jN has been actuated to the multiteacher environment (MTEV). Further, assume that (corresponding to the actions by HSLA) environmental responses {(β11j1 , · · · , β1r1 j1 ), (β21j1 j2 , · · · , β2r2 j1 j2 ), · · · , 1j1 ···jN rN j1 ···jN (βN , · · · , βN )} have been given to the HSLA. Then, the action probabilities pj1 j2 ···js−1 is (t) (is = 1, 2, · · · , r) of each automaton A¯j1 j2 ···js−1 (s = 1, · · · , N ) connected to the path being chosen are updated by the following equation: pj1 j2 ···js−1 is (t + 1) = pj1 j2 ···js−1 is (t) + λj1 j2 ···js−1 (t)∆pj1 j2 ···js−1 is (t),
(11)
where ∆pj1 j2 ···js−1 is (t) is calculated by v¯j1 j2 ···js−1 is (t) ∆pj1 j2 ···js−1 is (t) = − |Bs1(t)| ls ∈Bs (t) v¯j1 j2 ···js−1 ls (t) ∀is ∈ Bs (t) 0 is ∈ Bs (t)
(12)
A New Learning Algorithm
1127
Here, the set Bs (t) is constructed as follows: 1. Place v¯j1 j2 ···js−1 js (t) in descending order. vj1 j2 ···js−1 ks (t) = maxis {¯ vj1 j2 ···js−1 is (t)}}. 2. Set Dj1 j2 ···js−1 = {ks |¯ 3. Repeat the following procedure for is (is ∈ Dj1 j2 ···js−1 ) in descending order of v¯j1 j2 ···js−1 is (t): If the inequality pj1 j2 ···js−1 is > qmin can be satisfied as a result of calculation by (11) and (12), then set Dj1 j2 ···js−1 = Dj1 j2 ···js−1 ∪{is }. 4. Set Bs (t) = Dj1 j2 ···js−1 . Remark 3. λj1 j2 ···js−1 (t) denotes the step size parameter at time t. Remark 4. The action probabilities of each automaton which is not on the selected path are not changed.
4
Convergence Theorem
In this section, we shall give a convergence theorem concerning the learning performance of the proposed algorithm. First, we show that the following lemma can be obtained. 1 Lemma 1. Assume that HSLA receives reward strengths {(β11j1 , · · · , 1j1 ···jN rN j1 ···jN β1r1 j1 ), · · · , (βN , · · · , βN )} from the multiteacher environment. Then, ∗ (s = 1, · · · , N ) holds for the following inequality concerning the LA A¯j1∗ j2∗ ···js−1 all t: ∗ E{¯ vj1∗ j2∗ ···js∗ (t)} > E{¯ vj1∗ j2∗ ···js−1 (13) is (t)}, = js∗ . where is = 1, 2, · · · , r, is By taking advantage of the above lemma, we can easily obtain the following theorem. 1 Theorem 1: Assume that the conditions (9) and (10) and the conditions in the lemma hold. Further, let λj1 ,···,js−1 (t) be a sequence of real numbers which ∞ satisfies the following conditions: λ (t) > 0, λj1 ,···,js−1 (t) = ∞, j ,···,j 1 s−1 t=1 ∞ 2 ∗ ∗ t=1 λj1 ,···,js−1 (t) < ∞. Then, the path probability πj1 ···jN (t) that the HSLA N ∗ (t) at time t converges almost surely to (qmax ) . chooses the optimal path φj1∗ ···jN
5
Concluding Remarks
In this paper, we have extended the relative reward strength algorithm of Simha and Kurose [13] in order that it can be used in the HSLA model operating in the multiteacher environment. We have shown that the proposed algorithm ensures convergence to the optimal path with probability 1. Further research effort is needed in order to find the detailed behaviors of the HSLA under various types of the nonstationary environments. 1
Due to space limitation, we don’t go into details concerning the proofs of the lemma and the theorem. They can be proved by using the same procedure as in [13, 14]. Interested readers are kindly asked to read these papers.
1128
Norio Baba and Yoshio Mogami
References [1] Varshavskii V. I., Vorontsova I. P.: On the behavior of stochastic automata with variable structure. Atutom. Remote Control. 22 (1961) 1345–1354 1122 [2] Lakshmivarahan S.: Learning Algorithms Theory and Applications. SpringerVerlag, Berlin Heidelberg New York (1981) 1122 [3] Baba N.: New Topics in Learning Automata Theory and Applications. SpringerVerlag, Berlin Heidelberg New York (1985) 1122 [4] Narendra K. S., Thathachar M. A. L.: Learning Automata: An Introduction. Englewood Cliffs, NJ: Prentice-Hall (1989) 1122 [5] Poznyak A. S., Najim K.: Learning Automata and Stochastic Optimization. Springer-Verlag, Berlin Heidelberg New York (1997) 1122 [6] Thathachar M. A. L., Ramakrishnan K. R.: A hierarchical system of learning automata. IEEE Trans. Syst. Man. Cybern. SMC-11 (1981) 236–241 1122, 1123 [7] Srikantakumar R. P., Narendra K. S.: A learning model for routing in telephone networks. SIAM J. Control Optim. 20 (1982) 4–57 1122 [8] Zeng X., Zhou J., Vasseur C.: A strategy for controlling nonlinear systems using a learning automata. Atutomatica 36 (2000) 1517–1524 1122 [9] Papadimitriou G. I., et al: On the use of learning automata in the control of broadcast networks: A methodology. IEEE Trans. Syst. Man. Cybern. 32 (2002) 781–790 1122 [10] Baba N., Sawaragi Y.: On the learning behavior of stochastic automata under a nonstationary random enrironment. IEEE Trans. Syst. Man. Cybern. SMC-5 (1975) 273-275 1122 [11] Thathachar M. A. L., Sastry P. S.: A class of rapidly converging algorithms for learning automata. in Proc. IEEE Int. Conf. Cybernetics and Society, Bombay, India, (1984) 602–606 1122 [12] Thathachar M. A. L., Sastry P. S., A hierarchical system of learning automata that can learn the globally optimal path. Inf. Sci. 42 (1987) 143–166 1122 [13] Simha R., Kurose J. F.: Relative reward strength algorithms for learning automata. IEEE Trans. Syst. Man. Cybern. 19 (1989) 388–398 1122, 1125, 1126, 1127 [14] Baba N., Mogami Y.: A new learning algorithm for the hierarchical structure learning automata operating in the nonstationary S-Model random environment. IEEE Trans. Syst. Man. Cybern. 32 (2002) 750–758 1122, 1125, 1127 [15] Oommen B. J., Ageche M.: Continuour and discretized pursuit learning scheme: Various algorithms and their comparison. IEEE Trans. Syst. Man. Cybern. B 31 (2001) 277–287 1122
An Age Estimation System Using the Neural Network Kensuke Mitsukura1 , Yasue Mitsukura2 , Minoru Fukumi1 , and Sigeru Omatu3 1
The University of Tokushima 2-1 Minami-Josanjima Tokushima, 770-8506 JAPAN {kulaken,fukumi}@is.tokushima-u.ac.jp 2 Okayama University 3-1-1 Tsushima Okayama 700-8530 JAPAN [email protected] 3 Osaka Prefecture University 1-1 Gakuen-Cyo, Sakai, Osaka 599-8531 JAPAN [email protected]
Abstract. Threshold selection in multi-value images is performed based on their color information. When the threshold in an image is fixed it lacks versatility for the others. Because the color information varies under the influence of light conditions. Furthermore, it is important to decide face or not. The face decision standard is very difficult. In this paper, a Genetic Algorithm (GA) is used to select the most likely values of lips and skin colors in a light condition. It is possible to extract objects from the multi-value image only with the color information. In this paper, the objects of extraction are chosen to be the human lips and skin colors. Furthermore, we propose a face decision standard. That is, the decision method of face or not. Furthermore, it is very important to identify an individual. Therefore in this paper, detected faces are distinguished in individual by using the color maps.
1
Introduction
Much work on face recognition with a digital computer has been done actively [1] − [5], and applied to human tracing and counting. For these demand, there are methods based on the color histogram. However these methods cannot cope with a change in color [6], [7]. It was very difficult, by using nothing but the color information to extract a specific object from an image. People can recognize color information using various color characteristics. Objects are normally not monochrome except those made intentionally. Good results can be obtained by threshold obtained from binarization for a given image. However, the results are not so good when several images are used, because of the variation in lightings, orientation and times etc. Therefore, when the threshold is fixed, generality cannot be pursued. In this paper, the best threshold for every image is found using GA. The lip extracting research uses various techniques in which the postion of a face, the relation of distance and principal component analysis etc are V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1129–1134, 2003. c Springer-Verlag Berlin Heidelberg 2003
1130
Kensuke Mitsukura et al.
utilized. Some fixed conditions are needed for these images. However, even if the object race is changed, there is no great diffenence in lip color and shape, it would be possible to seek an apposite threshold for lip extraction. Therefore this paper proposes the extraction method of lips regions paying attention to the color information of the lips. Although it is difficult to decide the threshold by conventional methods, it can be autmatically decided by using GA. Furthermore, faces are detected by using the best thresholds ( skin color and lips color thresholds ) which is sought by the GA. Where “best” means most likely lips color and skin color sought by GA. By the way, faces are detected by using the proposed method. However, the detected faces must be divide someone. Therefore, the face identification are done in the proposed method. Moreover, in order to show the effectiveness of the proposed method, computer simulations are done by using the real images.
2 2.1
Image Processing Use of Color System
To obtain optical features of color information from color images, each attribute in a color system is used. Examples of color systems are RGB, Y CrCb, Y IQ, etc. In this paper, the Y, Cr , Cb color system is used, which reflects the most color perception property of human beings. In particular, the value Y is used for the purpose of understanding the light concentration. A transform formula with the RGB table color is shown in the following. 0.29900 0.58700 0.11400 R Y Cr = 0.50000 −0.41869 −0.08131 G Cb −0.16874 −0.33126 0.50000 B , where Y is brightness, and Cr , Cb are colors. Note that, only Y is used in this paper. The method using thresholds can recognize objects at high speed. However, how to choose the thresholds is a difficult problems. It is because a threshold changes by the condition of the light. Therefore, a recognition accuracy changes significantly with varying threshold value. In this paper, appropriate thresholds of lips and skin color are found by using GA under the condition of an unknown Y value which is calculated by the above equation. 2.2
Genetic Algorithm
First, the average of the whole brightness Y of the image is calculated. Next, most likely values of lips and skin colors are given as an individual of GA. It is set to finish GA learning if a recognition accuracy is 100 %. The images which made brightness change in the same scenery are used for GA learning. The table of defining the relations between the most likely values of lips and skin colors and brightness is produced as a result of GA learning. Then, the Y value of the unknown image is examined, and its most likely values of lips and skin colors are
An Age Estimation System Using the Neural Network
1131
Fig. 1. Examples of images used here
chosen from the table. The most likely values in this paper mean representative ones of lips and skin. Finally, the search of faces is done. In this paper, the decision of most likely values of lips and skin colors is done by using the real coded GA. These procedures are summarized in the following. Step1: An initial uniform random number n in [0,255] is assigned to all individuals. Step2: Calculation of the fitness function {fI }, where I denotes the individual on the N -th generation. Here, N(max) is the maximum generation number. Step3: Individuals are rearranged in turn on the basis of their recognition rates. Step4: Reproduction of the next generation’s population through selection, crossover and mutation. In Step2, the fitness function is given as the recognition accuracy. That is, the most likely values of lips and skin colors are selected as the individual which has the high value of the fitness function. In Step3, the elite strategy is done untill N(max)/2. After N(max)/2, selection follows the probability distribution. As for the crossover, first, two individuals (I(1), I(2)) are selected for the crossover. Next, the fitness values of these individuals (fI(1) , fI(2) ) are found out. The individual of the next generation is given as the real number which divides the line segment between I(1) and I(2) into fI(2) to fI(1) .
3
Computer Simulations
In order to show the effectiveness of the proposed method, computer simulations are done. First, several sheets of sample images under the various conditions are used for GA training. The samples are shown in Fig.1.
1132
Kensuke Mitsukura et al.
300
R
200 100 0 20
40
60
80
100
120
140
160
40
60
80
100
120
140
160
40
60
80
100
120
140
160
300
G
200 100 0 20 300
B
200 100 0 20
Y
Fig. 2. The estimated threshold lines (lips) by using the RLS algorithm Next, the average brightness Y values of these images are calculated. Moreover, the most likely values of lips and skin colors are found at each brightness Y by using GA. Moreover, the estimated threshold lines by using the RLS algorithm are illustrated in Figs. 2 and 3. RLS means the least squares algorithm. Tests were done by using these results with the unknown images. First, the average of the brightness Y is calculated, and then +10 and -10 for RGB values based on Figs.2
300
R
200 100 0 20
40
60
80
100
120
140
160
40
60
80
100
120
140
160
40
60
80
100
120
140
160
300
G
200 100 0 20 300
B
200 100 0 20
Y
Fig. 3. The estimated threshold lines (skin) by using the RLS algorithm
An Age Estimation System Using the Neural Network
1133
Fig. 4. RGB distributions of the individuals
and 3 are set as the upper bound threshold and the lower bound threshold, respectively. We apply the proposed method to a real images, 98.6 % recognition accuracy can be obtained. Furthermore, the face can be obtained only in 0.08 seconds. In the proposed method, faces are determined by skin color and lips color. Therefore, if skins are detected as miss recognition, the lips color are not detected in this miss recognition results resions. Then, the miss recognition are drastically reduced. Then, we divide the detected faces into the identity. 95.2 % recognition accuracy can be obtained. Fig.4 shows the RGB distributions of the individuals From these results, it is confirmed that the proposed method works well. Furthermore, in order to show the effectiveness of the proposed face identification method, simulation results are shown in Fig. 7. Showing this figure, the face identification method works well by using the proposed method.
4
Conclusions
In this paper, threshold values was found in consideration of light conditions by using GA. Moreover, the face identificatio are done by the color identification. In order to show the effectiveness of the proposed method, computer simulations are performed. From the simulation results, it was confirmed that the proposed scheme works well and very fast. Furthermore, the face identification works well by using the propoed method.
1134
Kensuke Mitsukura et al.
References [1] M.Hasegawa, Y.Nasu, and E.Shimizu : A method of face images by using multiplex resolution image (globalize and localize), IEICE Tech. Report PRU 89-26, pp.5760 (1989) in Japanese [2] S. Akamatsu : The research movement of the face recognition by the computer, IEICE Tech. Report (D-II), Vol. J80, NO.8, pp.2031-2046 (1997)in Japanese [3] Henry A. R., Shumeet B., and Takeo K. : Human Face Detection in Visual Scenes : Proc. NIPS8 (Advances in Neural Information Processing System) pp.875881(1996) [4] H.Yokoo and M.Hagiwara : Human Face Detection Method Using Genetic Algorithm, The Journal of The Institute of Electrical Engineers of Japan (Electronics, Information and Systems Society), Vol.117, No.9, pp.1245-1252(1997)in Japanese [5] H. W., Q. C., M. Y.: Detecting Human Face in color Images, Proc. of IEEE SMC Conf., pp.2232-2237(1996) [6] Murase,et al.: Fast Visual Using Focussed Color Matching-Active Search, IEICE Tech. Report (D-II), Vol.J81-D-II, No.9, pp.2035-2042 (1998) in Japanese [7] H.Ishii, M.Fukumi, and N.Akamatsu : Face Detection based on skin color information in visual scenes by neural networks, Human Interface Society Tech. Report, Vol.1, No.3, pp.49-54 (1999)in Japanese [8] R. Nakagawa and T. Kawaguchi : Detection of Human Faces in a Color Image with Complex Background, IEICE Tech. Report PRMU 99-111, pp.77-82 (1999)
Reconstruction of Facial Skin Color in Color Images Stephen Karungaru, Minoru Fukumi, and Norio Akamatsu University of Tokushima 2-1 Minami Josanjima 770-8506, Tokushima, Japan. {karunga,fukumi,akamatsu}@is.tokushima-u.ac.jp Abstract. In this paper, a method to recover and reconstruct skin color from images over or under-exposed to scene illumination is proposed. Over-exposure is a normal and serious problem in computer vision ,photography etc especially in images taken under direct sunlight or when extra illumination is added to a scene especially when a camera’s flash is used. In addition, in scenes with insufficient illumination, under-exposure is prevalent. The skin color regions most susceptible to over-exposure include areas around the forehead, tip of the nose and the cheeks. This method uses a feed-forward neural network and a distance measure method to learn and construct the skin color face image map. The color distance measure method used is the Cylindrical Metric method (CMM). The relationships between neighboring pixels skin color distance and their relative position in a face are learned using the neural network. This information is then used to reconstruct over or under-exposed skin color regions.
1
Introduction
The color appearance in visual scenes depends on many factors. Of these, one of the most important parameter is the scene illumination used during image capture. Without any scene illumination, it is literally impossible to capture an image using the normal camera. In addition, if the scene illuminant correlated color temperature (CCT) is very high (Of cause, it is assumed that the camera is calibrated using an illuminant having a lower CCT), then the colors in an image will appear distorted and over-exposed. This means that the RGB values of the image pixels are above or closely approach the value 255, making the color of such pixels to appear whitish. This is a common problem with objects whose surface is convex in shape, since they tend to reflect more of the scene illuminant [1]. A lot of research has been done in the area of skin color correction, normalization or compensation under varying illumination [1], [2], [3]. However, the main objective in these papers is to transform the skin color appearance from an unknown scene illumination to its appearance under a given scene illumination. Over-exposed parts of the skin color still appear over exposed even in the transformed images. This paper hopes to fill this gap by going a step further; to recover the real skin color. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1135–1141, 2003. c Springer-Verlag Berlin Heidelberg 2003
1136
Stephen Karungaru et al.
Fig. 1. RGB histogram for the averaged cheeks and forehead region pixels and the difference curves
2
Characteristics of Skin Color
The most natural looking skin color in visual scenes can be achieved if, the conditions at image acquisition, are normalized and fixed. This ideal skin color image can be captured by a camera white balanced and the scene illuminated using the daylight illuminant (CIE D65). To investigate the characteristics of the skin color captured as described, we collected 100 (ideal skin color), 200x200 pixels images of Caucasian subjects from the physics-based face database obtained from the University of Oulu [3]. The database contains frontal face images taken using four illuminants (for camera white balancing and scene illumination) including the CIE D65. A face detector adapted from [5] was then used to detect the position of each face in the images. After the face position was detected, 500 samples, each 40x40pixels in size, were manually extracted from the cheeks and forehead regions of the face. An average RGB histogram was then plotted for all the samples as shown in Fig. 1. In Fig. 1, R, G and B represent the RGB color space color components; Y the brightness and (R-G) represent the difference between R and G color components. Likewise (G-B) represents the difference between the G and B color components. Two major skin color characteristics were then observed. 1. In ideal face images, the general relationship between the three RGB components for the skin color pixels can be expressed using the following equation. R > G > B.
(1)
2. The average difference between the components R, G and B increases with brightness as shown in Fig.1. For the applicable brightness range, the two curves (R-G and G-B) were fitted using the least squares method and the following equations were obtained to express this relationship.
Reconstruction of Facial Skin Color in Color Images
1137
(i)RGrange : (R − G) = 0.4Y + 22.
(2)
(ii)GBrange : (G − B) = 0.28Y.
(3)
Eqs. 2 and 3 are valid when the brightness has been normalized for the range 75 ≤ Y ≤ 105.
(4)
This means that the ranges for R-G and G-B can be calculated from equations (2) and (3) as follows. (i)RGrange : 52 ≤ (R − G) ≤ 64.
(5)
(ii)GBrange : 21 ≤ (G − B) ≤ 29.
(6)
Equations (5) and (6) can be used to evaluate whether a skin color pixel is over/under-exposed or not. First the given pixel’s brightness level is normalized to within the range allowed in equation (4), then R-G and G-B are calculated to see if they satisfy equations (5) and (6). If they do not and the original brightness was above the higher brightness threshold (equation (4)), then the skin color pixel is very likely to be overexposed. Otherwise, the pixel is likely to be under exposed. It should however, be noted that, from this result alone, we cannot conclude that the pixel is over or under exposed without running a test that decide whether the pixel is skin color or not. In addition, from equations (2) and (3) it can be concluded that if any two of the three RGB components are known, we can calculate the other one. This helped reduce the number of nodes required to represent a pixel from three to two in the neural network’s output layer. A smaller sized and hence, much faster neural network was therefore obtained. Several methods exist for human skin segmentation from a visual scene [1], [6]. Of these, one of the fastest to compute is the threshold method using the YIQ color method[1]. Fig.2 shows skin color regions not detected using this method because they are overexposed.
Fig. 2. Example of skin color regions not detected mainly due to overexposure. The undetected skin color areas are the white regions within the black lines surrounding the face regions
1138
3
Stephen Karungaru et al.
Cylindrical Metric
Cylindrical metric is a distance measure method used to calculate the distance between colors. It uses the HSI (Hue, Saturation and Intensity) color space as its bases of calculating the distance. It computes the distance between the projections of the pixel points on a chromatic plane. The cylindrical metric method is defined as follows. First the image’s RGB values are converted into the HSI color space using the equations derived in [4]. The equation used to calculate the distance between two given color is then derived from the HSI values as shown below in equation (7). (7) dcylindrical (s, i) = (dintensity )2 + (dchromaticy )2 . Where dintensity =| Is − Ii | .
(8)
and dchromaticity =
Ss2 + Si2 + 2Ss Si cos θ.
(9)
θ=
(| Hs − Hi |−→ if | Hs − Hi |< 180) . (360− | Hs − Hi |−→ if | Hs − Hi |> 180)
(10)
s and i represent the pixels being compared. The dchromaticity is defined as the distance, in two-dimensional hue and saturation vectors, on the chromaticity plane, of two neighboring pixels. By using the result in equation (7) as the input to a neural network, we can learn the relationship in distance between neighboring pixels where one of the pixels (ideally the center pixel) is used as a reference.
4
Neural Network Structure
The size of the neural network used depended on the neighborhood selected. What is this neighborhood? It is defined as the pixels surrounding the target pixel. Given a pixel, the immediate surrounding contains a minimum of 8 pixels, that is, a 3x3 pixels area with the target pixel at the center. In this work, a 5x5 pixels area was used to offer better generalization of the relationships between the pixel’s colors. Consequently, the neural network had 25 inputs; each representing a color distance calculated using equation (7). The number of nodes in the output layer is 3; one to check if the pixel under consideration is skin color or not(called the skin-color-check node), and the other two
Reconstruction of Facial Skin Color in Color Images
1139
output nodes represents the two R and B color components of the pixel. Ideally three RGB color components should be used for each pixel, but because we can recalculate the final color using equations (2) and (3), only two components are necessary. The skin-color-check output is set to one or zero to indicate whether or not a pixel under consideration is skin or non-skin color respectively. The hidden layer had 9 nodes and the back propagation algorithm [7] was used to train the neural network. The total number of 5x5pixels input samples used to train the neural network was 12950. All the training samples consist of over and under exposed skin color pixels. The teacher signal was extracted from the images used in section 2. Even with this large number of input samples, the neural network trained relatively fast because of its fairly small size. In fact it took less than 20 minutes to achieve an error of 0.00001 using a learning rate of 0.001 and a momentum rate of 0.002. The values for the learning rate and the momentum rate were selected using trail and error method. Computer simulations in this work were carried out using a Dell Dimension 4100 Pentium III 850MHZ personal computer.
5
Simulation Results
In order to show the effectiveness of the proposed method computer simulations were performed using 10 images of various sizes, corrected from a variety of sources including the world-wide-web. In total, the images contain 180 faces, which were the target of this method’s skin color recovery. The total number of pixels in the images are 1,634,150 of which approximately 323,400 are skin color pixels. The simulation method used in this work is performed using the following steps, for a given visual scene. 1. Perform skin color detection using threshold method. Threshold method used was adapted from [1]. 2. Block out the individual skin color regions into squares or rectangles. 3. For all the pixels inside the blocked regions and not detected as skin color run the neural network. 4. If the output of the skin-color-check pixel is above 0.7 (this indicates the pixel is skin color), recover skin color for the pixel. Otherwise move to the next candidate. Table 1 shows the results, in time taken, to recover the skin pixels and also shows the percentage of over/under-exposed pixels recovered. Note that the original total number of skin color pixels per skin color region in the images was manually estimated beforehand. The difference between that value and the value detected using the threshold method is the estimated number of over/underexposed skin color pixels. Hence, the percentage of skin color pixels recovered is the number corrected by our method divided by the estimated number of overexposed pixels. An over/under-exposed skin color pixel is successfully recovered, if after recovery, it can be classified as a skin color pixel using the threshold method used earlier.
1140
Stephen Karungaru et al.
Table 1. Simulation Results Time Taken (Seconds)
Skin Color Pixels Total Per Face 1450 Total
323400
TM miss Recv Acc (%) TM Recv 420
402
95.7
0.001 0.001
66510
62240
93.6
0.512 0.357
In Table 1, TM Miss is the number of pixels not detected using the threshold method (Over/under-exposed pixels), Recv represents the number of pixels recovered, Acc (%) is the percentage accuracy and TM represents the time taken by the threshold method. Fig. 3 shows an example result achieved using this method for the image shown in Fig.2. Note that not all skin color pixels were recovered. The reason was that these pixels are already white and therefore contain very little or no skin color at all.
6
Discussion
The validity of equations (2) and (3) in section 2 depends entirely on brightness levels. Experiment showed that these equations are valid only for the brightness range of 50 to 220. Outside this range, one or more of the RGB components is distorted either from over or under exposure to scene illumination and therefore the least square method can not be used to fit the curve effectively. Although the RGB color space is not very well suited for color image processing, it is used here as the target because it produces the final results at much faster speeds than the other color spaces. This is because the output already contains two of the three components necessary to describe a color, and only one needs to be calculated. The cylindrical metric method values have a high cost of calculation due to the conversion from RGB color space to HSI color space and then to cylindrical
Fig. 3. Results of the recovery process for the image in Fig. 2
Reconstruction of Facial Skin Color in Color Images
1141
metric method values. However, this was necessary because of the difficulty in choosing the teacher signals for the neural network. If the cylindrical metric values are not used, the system will obviously run faster but the skin color recovered will be rather uniform missing the natural skin color look and also the basic relationship between pixels.
7
Conclusion
In this paper, a method using which, over and under exposed skin color pixels due to high or low scene illumination CCT, can be recovered was presented. The method exploited the relationships between the skin color pixels in RGB color space and the distance in color between neighboring pixels using the cylindrical metric method. This method consumes most time only during the initial neural network training but is otherwise very fast on testing. We achieved a recovery average of 93.6%. The pixels that could not be recovered had no color information at all. That is, all the RGB components were at a value of 255 making the pixels white. However this method recovers skin color of pixels only under the scene illuminant used during the neural network training. One assumption is that the camera used during the image capture is white balanced using the CIE D65 (daylight) illuminant. This study is intended to be one of the preprocessing steps in systems like face detection and recognition. One application area of this method, for example, is immediately after a skin color compensation under varying scene illuminations system like in [1].
References [1] 1135, 1137, 1139, 1141 S. Karungaru, M. Fukumi and N. Akamatsu: Skin color compensation under varying scene illumination. Proc. of IASTED/ASC, Banff, Canada, pp. 500-505, 2002. [2] Storing M, Andersen H and Granum E: Estimation of the illuminant colour from human skin colour. Inter. Conf. On Automatic face and gesture recognition, 2000. 1135 [3] Soriano M, Marszalec E, Pietikainen M: Color correction of face images under different illuminants by RGB eigenfaces. Proc. 2nd Audio- and Video-Based Biometric Person Authentication Conference (AVBPA99), pp. 148-153. 1999. 1135, 1136 [4] K. Plataniotis and A. Venetsanopoulos: Color image processing and applications. Springer, Ch. 1 and pp. 268-269, 2000. 1138 [5] S. Karungaru, M. Fukumi and N. Akamatsu: Detection of human face in visual scenes. Proc.of ANZIIS, Perth, Austraria, pp. 165-170. 2001. 1136 [6] S. Karungaru, M. Fukumi and N. Akamatsu: Improved speed for human face detection in visual scenes using neural networks. Proc. of ICONIP Teajon, S. Korea, FBP-30, pp. 1-6. 2000. 1137 [7] Rumelhart D.E, G. E Hinton and R. J Williams: Learning internal representations by error propagation in PDP. The MIT press, pp. 318-362, 1986. 1139
Grain-Shaped Control System by Using the SPCA Seiki Yoshimori1 , Yasue Mitsukura2 , Shigeru Omatsu1 , and Kohji Kita1 1
University of Tokushima, 770-8506 2-1 Minami-Josanjima Tokushima, Japan [email protected] 2 Okayama University, 700-8530 3-1-1 Tsushima Okayama, Japan [email protected]
Abstract. Recently in the world, noise removable method for images are investigated. This is because it increased to change analog data into the digital style and to use it by the digitization in recent years. Furthermore, when it is changed into the digital style from the analog, it is a cause that noise (impulse element) is made. For this problem, the edge of the image is detected by using simple principal component analysis (SPCA) in this paper. Furthermore, the method to get rid of the noise of the part except for the edge is proposed. It is possible that it gets rid of noise without breaking down the edge of the original image by using the proposed method. Finally in this paper, in order to show the effectiveness of the proposed method, simulations are done for the comparing the conventional methods.
1
Introduction
It gets rid of noise in the image which much noise of the impulse exists in, and very difficult to put emphasis on the edge. In many cases, an edge disappears if much noise exists and noise is reduced if emphasis is put on the edge. These are in the relations of the so-called trade-off. As for the method which emphasis or detects the outline of the applicable thing that it is expressed in, generally the way of differentiating method and Sobel filter, a method by laplacian filter, and so on are used widely from the characteristics of the emphasis of the change in the adjoining painting base to practical use [1]. A switching filter is proposed as a noise removal to solve a matter of relations of the trade-off that exists during the noise removal performance of median filter and the original signal preservation recently. The noise of the impulse or an original signal-painting base is distinguished, and a management point is the filter which management is changed to as that result as for this. Therefore, it is necessary to distinguish clearly the impulse noise or not [2]. In this paper, the edge of the chief ingredient image is detected by using SPCA which made the calculation of PCA that the chief ingredient is how to analyze it by using many images which did an image change to one sheet of V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1142–1148, 2003. c Springer-Verlag Berlin Heidelberg 2003
Grain-Shaped Control System by Using the SPCA
1143
original image easy. The method which regions to do a noise removal is specified for is proposed. As for the proposed method, it becomes possible that the edge of each element is detected by looking for an edge to each chief ingredient image. Therefore, it is possible that a vivid edge is detected. Furthermore, detected edges are moved to the original signal. A part except for that revises an image as a deterioration painting base. Furthermore, in order to show the effectiveness of the proposed method, simulations are shown by using the real images.
2
Images in this Paper
The images used in this paper are the images that changed into digital one photoed with the instant camera, and took it in. In this paper, we use 7 images that are changed into gray scale images by using R, G, B value, moving average method (3 × 3) and HSI conversion images.
3
HSI
M1 M2 I1
√ 2/ √6 = R G B −1/√6 −1/ 6
√ 0√ 1/√3 1/ √2 1/√3 − 1 2 1/ 3
H = arctan(M1 /M2 )
(1)
S = (M12 + M22 )1/2
(2)
I=
√ 3I1
(3)
In HSI conversion, “H” expresses hue, “S” expresses the degree of saturation and “I” expresses the brightness.
4
SPCA
The SPCA has been proposed in order for the PCA to chieve high-speed processing. It is confirmed that it is effective in information compression at the handwritten digits and dimensionality reduction at a vector space model[?]. The SPCA produces approximate solutions without calculating a variance-covariance matrix. The SPCA algorithm is given as: 1. Collect of an dimensional data set, V = {v1 , v2 , ..., vm } 2. X = {x1 , x2 , ..., xm }, which are obtained by subtracting which is the average value of form the center of gravity of set of the vectors, is used as input vectors.
1144
Seiki Yoshimori et al.
3. The column vector is defined as connection weights between the inputs and output. The first weight, is used to approximate the first eigenvector. The output function has a value given by: y1 = αT1 xj
(4)
4. Using the following equation (5) and (6), repetitive calculation of arbitrary vector suitably given as an initial value is carried out. As a result, the vector can approach the same direction as. j Φ1 (y1 , xj ) (5) = ak+1 1 j Φ1 (y1 , xj ) y1 = (ak1 )T xj where is a threshold function given as +xj Φ1 (y1 , xj ) = −xj
y1 ≥ 0 otherwise
(6)
(7)
5. Using the following equation (8), we remove the first principal component vector from the dataset in order to find the next principal component. xi = xi − (aT1 xi )a1
(8)
We obtain principal components because we substitute with and with in equation (5), (6) and perform repetitive calculation once again [3]. In this paper, we ask for the the rate of accumulation contribution which becomes until 99.9%.
5
Principal Component Images
In this paper, the eigenvectors of an input image that are asked by SPCA is normalized in the value of 0 - 255, and uses as the chief ingredient image. The sample of principal component image is shown in Fig. 5 This is image that is picture-izing of principal component in the image. It is thought that, detection of this edge bound that is different component is able to detect.
6
Edge Ditection
In this paper, the principal component image is used as an object image that detects edge. In case edge is detected, detection of fine edge and the incorrect detection by noise have the relation of a trade-off [4]. By this method, since it becomes possible to ask for edge from each principal component, fine edge is
Grain-Shaped Control System by Using the SPCA
1145
(a)The first (b)The second (c)The thrad (c) The fourth principal component principal component principal component principal component image image image image
Fig. 1. Principal component image
detectable. Moreover, it is thought that the noise of the images corrected usually does not become extremely large. For this reason, it is thought that it is rare for a noise to come out greatly as the principal component. Therefore, it is thought that a good detection result can be obtained compared with the case where the edge of an original image is detected directly. However, if all the edge that had been able to be found is used, noise may also be detected simultaneously. Then, the quantity that adds edge based on the rate of contribution which shows the reflecting the information on an original image is changed. The Sobel operator that can expect obtaining result easily and comparatively good as the edge detection processing technique is used. This result is shown in Fig. 6.
7
Filter
In this paper, processing by the approximate value of the average value in near domain, conventional Median filter, a moving average method, the Median filter to about 4 domain and processing of only an impulse ingredient are performed as the control technique.
(a) Original image
(a) Sobel operator
Fig. 2. Edge detection
(a) Proposed method
1146
Seiki Yoshimori et al.
The conventional Median filter is that the median of the concentration value in the near domain of an attention pixel makes attention pixel. Moving average method is that the average value of the concentration in the near domain of an attention pixel makes attention pixel value. The Median filter is that the median of concentration in the four directions of an attention pixel makes attention pixel value. In processing by the approximate value of the average value in a domain, a concentration value that is nearest of average value in a domain makes attention pixel value. Processing of only an impulse component detects the large pixel of change by fading from an original image and pulling the image. It is the processing which carries out a moving average method only to the point of having removed the edge portion from this detected pixel [5].
8
System Flow
– Step1 The original image change into the gray scale image by using the R, G and B value. Moreover, HSI conversion is performed to an original image. With the application of the moving average method of 3 × 3 domain, faded image is created. – Step2 The principal components are asked with SPCA by using the images that are created in Step1. At this time, principal components are asked until the rate of accumulation contribution of principal component becomes 99.9%. – Step3 The principal component image is created by using the asked principal component. At this time, the eigen vector value of each pixel that could be found by SPCA is regularly turned into 0-255, and an image is created. – Step4 Edge is detected using a Sobel operator to the images that changed the original image into direct gray scale and each principal component images. Responses to the rate of contribution of each principal component, the edge of each principal component images are added to the edge to an original image. – Step5 Based on the edge image which could be found in Step4, filter processing of 3 × 3 performs to control processing. At this time, all the edge portions that could be found are saved. Moreover, when an attention domain includes two or more domains on both sides of an edge portion, only the pixel of each domain is used for filter processing.
9
Computer Simulations
In order to show the effectiveness of proposed method, we have done the computer simulation by using the original images. The images after processing are shown in a Fig. 3 Moreover, for comparison, the image that made the original image sharp, and the images that made after processing sharp are also shown in Fig. 4
Grain-Shaped Control System by Using the SPCA
(a) Original image
(b) Median filter
(c) Moving average method
(d) Improved median method
(e) Median (2×2)
(f) Improved impulse remove method
Fig. 3. Simulation Result
(a) Original image
(b) Median filter
(c) Moving average method
(d) Improved median filter
(e) Median (2×2)
(f) Improved impulse remove method
Fig. 4. Simulation result (Sharp)
1147
1148
9.1
Seiki Yoshimori et al.
Simulation Results
Compare the case that had carried out control processing whit had not carried out control processing, when the case that had carried out control processing, impulse element had been reduced greatly. This is because the moving average method is given to the attention pixels. However, it is generally known that an image will fade if a moving average method is given to all the pixels in an image. In this paper, the moving average method is given to the other portion, without processing into an edge portion. For this reason, big dotage of an image has not been checked. However, it is necessary to perform this processing by the premise that edge can be taken certainly. Then, the accuracy of edge detection becomes very important. Compare the case that edge is detected using the Sobel operator that is generally known well and the case that extend the proposed method, latter one had been able to detect the rate of detection of edge correctly. From these results, the proposed method works very well. However, in this processing, edge portion is leaving directly now. For this reason, when an image is made clear, the phenomenon arises in which an edge portion looms. For this reason, it is necessary to perform control processing also to the edge portion. However, when apply this proposed method directly, a color is mixed and a very unnatural result is brought in the edge portion that continuation and over a domain that is different. Then, it is thought that, when edge is able to be found, necessary to divide a domain to some extent.
10
Conclusions
In this paper, we propose edges of the image are detected by using simple principal component analysis (SPCA). Furthermore, the method to remove the noise of the part except for the edge is proposed. Finally, in order to show the effectiveness of the proposed method, using the real images does computer simulations.
References [1] H. Ishii, R. Taguchi and M. Sone, “An Edge Detection with Multi-Masks from Images Corrupted by High Probability Impulse Noise,” IEICE Tech. Report(A), vol. J82, no. 1, pp. 131–141, 1999. 1142 [2] Y. Hashimoto, Y. Kajikawa and Y. Nomura, “Directional Difference-Based Switching Median Filters,” IEICE Tech. Report(A), vol. J83, no. 10, pp. 1131– 1140, 1990. 1142 [3] H. Takimoto, Y. Mitsukura, M. Fukumi and N. Akamatsu, “A Design of Face Detection System Using the GA and the Simple PCA,” IProc. of ICONIP’02, Vol.4, pp.2069-2073, 2002. 1144 [4] M. Ohsaki, T. Sugiyama and H. Ohno, “Evaluation of Edge Detection Methods through Psychological Tests - Is the Detected Edge Really Desirable for Humans? -,” IEEE SMC2000. Nashville, USA, pp. 671–677, 2000. 1144 [5] T. Matsumoto and R. Taguchi, “Removal of Impulse Noise from Highly Corrupted Images by Using Noise Position Information and Directional Information of Image,” IEICE Tech. Report(A), vol. 12, pp1382–1392, 2000. 1146
Reinforcement Learning Using RBF Networks with Memory Mechanism Seiichi Ozawa and Naoto Shiraga Graduate School of Science and Technology, Kobe University 1-1 Rokko-dai, Nada, Kobe 657-8501, Japan [email protected] http://frenchblue.scitec.kobe-u.ac.jp/~ozawa/index.html
Abstract. In reinforcement learning, the catastrophic interference could be serious when neural networks are used for approximating action-value functions. To solve this problem, we propose a memory-based model that is composed of Resource Allocating Network and an external memory. A remarkable feature of this model is that it needs only quite small memory capacity to execute incremental learning. To examine this property, the proposed model is applied to a mountain-car task in which the working area is temporally expanded. In the simulations, we verify that the proposed model needs smaller memory to approximate action-value functions properly as compared with some conventional models.
1
Introduction
In Reinforcement Learning (RL) tasks with a finite number of states and actions, it is possible to memorize all action-values into a look-up table. In many cases of practical interest, however, there are far more states that could not memorize in such a table. In this case, it is effective to learn action-value functions using function approximation methods. One of these is a linear method [1],[2], in which action-values are approximated by linear functions of feature vectors that roughly code agent’s states. In general, this method tends to need large memories especially when the state space has large dimensions and/or a wide area. To solve this problem, nonlinear approximation methods like neural networks [1] are often used. However it is well known that the learning of neural networks becomes difficult when the distribution of given training data is temporally varied and training data are incrementally given [3]. In such a situation, input-output relations acquired in the past are easily collapsed by the learning of new training data. This disruption in neural networks is called “catastrophic interference” that is caused by the excessive adaptation of connection weights for new training data. There have been proposed several approaches to suppressing the interference in supervised learning [3],[4]. We have proposed an incremental learning model for supervised learning tasks [5], in which Long-Term Memory (LTM) was introduced into Resource Allocating Network (RAN) [6]. In this model called RANLTM, storage data in LTM (noted as “memory item”) are produced from inputs V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1149–1156, 2003. c Springer-Verlag Berlin Heidelberg 2003
1150
Seiichi Ozawa and Naoto Shiraga zk (t) z1(t) =Qt (x(t),a1) =Qt(x(t),ak) Output
Generation and Storage
Normalization
(s *1 ,Q1* ) (s *2 ,Q2* ) * ,Q * ) (sL(t) L(t)
Hidden
Memory Input
xI (t) x1(t) Retrieval and Learning Resource Allocating Network (RAN)
Fig. 1. Structure of RAN-M and outputs of networks whose relationships are accurately approximated. In supervised learning, the exact errors between network outputs and their targets are given. In RL problems, however, agents are generally given only Temporal Difference (TD) errors calculated from currently estimated action-values and immediate rewards. This suggests that memory items generated in the past do not always hold proper action-values; hence, we should modify RAN-LTM such that memory items could be updated as the learning proceeds. In this paper, a new RAN-LTM model is proposed in which memory items are updated as the agent’s action-values are improved. We shall call this model ‘RAN with Memory (RAN-M)’ in the followings. Furthermore, through some standard RL problems, we verify that RAN-M could learn proper policies even in a difficult incremental learning setting.
2 2.1
RAN with Memory Mechanism Network Architecture
Figure 1 shows the architecture of RAN-M that consists of two modules: Resource Allocating Network (RAN) [6] and an external memory. To approximate action-value functions, RAN with normalized radial basis functions is adopted here, in which the outputs of hidden units are normalized. When the training gets started in RAN, the number of hidden units is set to one; hence, RAN possesses simple approximation ability at first. As the trials proceed, the approximation ability of RAN is developed by allocating additional hidden units. The inputs of RAN at time t, x(t) = {x1 (t), · · · , xI (t)} , are set to the agent’s states s(t) = {s1 (t), · · · , sI (t)} . Here, I is the number of input units of RAN. We associate network outputs z(t) = {z1 (t), · · · , zK (t)} with actionvalues Qt (s(t), a(t)) that are utilized for selecting agent’s action a(t); i.e., zk (t)= Qt (s(t), a(t) = ak ). Hence, the number of output units, K, is equal to the number of agent’s actions. The outputs z(t) are given as follows: yj (t) = exp(−
x(t) − cj 2 ) (j = 1, · · · , J) σj 2
(1)
Reinforcement Learning Using RBF Networks with Memory Mechanism
yj (t) yˆj (t) = l yl (t) zk (t) =
J
(j = 1, · · · , J)
wkj yˆj (t) + θk
(k = 1, · · · , K)
1151
(2) (3)
j=1
where wkj and γk are connection weights and biases. yˆj (t), J and σj2 are respectively a normalized output of the jth hidden unit, the number of hidden units and the variance of the jth radial basis function. The agent’s action is selected based on the following probability calculated from the outputs z(t) of RAN: exp(zk (t)/T (t)) P(a(t) = ak ) = l exp(zl (t)/T (t))
(4)
where T (t) is the temperature at time t. 2.2
Learning Algorithm
In learning RAN-M, either of the following two processes is carried out: 1. allocation of new hidden units, 2. update of connection weights, RBF centers, and biases. The former process is conducted when agents come across unknown states. This can be identified by checking whether all of the activation of hidden units yj (t) (j = 1, · · · , J) is low and the following TD error δk (t) is large enough: r(t + 1) + γ maxa Qt (s(t + 1), a) − Qt (s(t), ak )) (k = k ) δk (t) = (5) 0 (otherwise) where k is the subscript of action a(t) selected by an agent at time t, and r(t+1) is an immediate reward given by the environment. On the other hand, the latter process is carried out in order to reduce the TD error in Eq. (5). However, as stated in Section 1, the adaptation to only a given training sample might cause the catastrophic interference. Therefore, not only the above TD error but also the following error between the outputs z(s∗l ) for a memory item s∗l and the desired outputs Q∗l is simultaneously reduced: E(t) =
K 1 ∗ (Qkl − zk (s∗l ))2 2
(6)
l∈Ω(t) k=1
where Ω(t) is a set of memory items generated at time t. These memory items correspond to the representative input-output pairs that are extracted from the mapping function of RAN. The detailed descriptions of generating and retrieving memory items are discussed in [5]; here let us show the procedures briefly.
1152
Seiichi Ozawa and Naoto Shiraga
[Learning Algorithm] Start an episode. Initialize the agent state as s(0). Set s(t) to RAN’s inputs x(t), and calculate the output z(t) from Eqs. (1)-(3). Select an action ak based on the probability in Eq. (4), and carry out it. After observing the next state s(t + 1) and immediate reward r(t + 1), calculate the TD error in Eq. (5). 5. If δk (t) > η1 and x(t)−c∗ > η2 , add a hidden unit (i.e., J ← J +1) and initialize weight connections, RBF centers, and biases as follows: cJ i = xi (t), wkJ = Ek (t), and σJ = κx(t) − c∗ . Here, c∗ is the nearest RBF center to the input x(t) and κ is a positive constant. Otherwise, the following procedure is conducted. (a) According to the retrieval procedure, recall memory items (s∗l , Q∗l ) (l ∈ L(t)). Here, L(t) is the number of memory items generated at t. (b) For all retrieved memory items, obtain the RAN’s outputs z(s∗l ) and calculate the squared error E(t) based on Eq. (6). δ (t)2 , update the (c) To minimize E(t) as well as the squared TD error 1/2 k k network parameters of RAN as follows:
1. 2. 3. 4.
K
= cOLD + α δk ecji + cNEW ji ji
∆kl wkj
l∈Ω(t) k=1
ecNEW ji
=
ecOLD ji
s∗li − cji yˆj (s∗l )(1 − yˆj (s∗l )) σj2
xi (t) − cji + wkj yˆj (t)(1 − yˆj (t)) σj2
NEW OLD = wkj + α δk ew wkj j +
K
∆kl yˆj (s∗l )
l∈Ω(t) k=1
ewNEW j
=
ewOLD j
+ yˆj (t)
θkNEW = θkOLD + α δk eθk +
K
∆kl
l∈Ω(t) k=1
eθNEW k
=
eθOLD k
+1
Q∗kl
θ where ∆kl = − zk (s∗l ), α is a positive learning ratio, and ecji , ew j , ek are eligibility traces for RBF centers, weight connections, and biases, respectively. 6. Execute the generation procedure of memory items. 7. If the episode is not over, update all memory items by recalculating z(s∗l ). 8. t → t + 1 and go back to Step 2).
[Retrieval Procedure] 1. Obtain all subscripts j of hidden units whose outputs yj are larger than η3 , and define a set of these subscripts as I1 . 2. If I1 = φ, no memory item is retrieved and the procedure is terminated. 3. For all hidden units belonging to I1 , obtain memory item s∗j that has the nearest distance to the center vectors cj . 4. Add all memory items (s∗j , Q∗j ) satisfied with the condition cj − s∗j < η4 to Ω(t).
[Generation Procedure] 1. Obtain all subscripts j of hidden units whose outputs yˆj are larger than η3 , and define a set I2 of these subscripts.
Reinforcement Learning Using RBF Networks with Memory Mechanism
1153
goal car agent
R1
R2
u
Fig. 2. The working area of the one-dimensional mountain-car task 2. If I2 = φ, this procedure is terminated. 3. The value vj is updated for all j ∈ I2 as follows: vj ← vj + 2 exp(−ρ|δk (t)|) − 1 where k is the subscript of the action selected at t and ρ is a positive constant. 4. If vj < η5 is satisfied with all j ∈ I2 , this procedure is terminated. Here, η5 is a positive constant. 5. For all hidden units that satisfy vj ≥ η5 , the following procedures are applied. (a) Initialize vj . (b) If the corresponding memory item has not been generated yet, the number of memory items L(t) is increased by one (i.e., L(t) ← L(t) + 1). Otherwise, go back to Step (a). (c) The center vector cj is given to RAN as its inputs, and the outputs z ∗ (cj ) are calculated. (d) The obtained (cj , z ∗ (cj )) is stored into memory as the L(t)th memory item.
3 3.1
Simulations Problem Statement
To examine the incremental learning ability of the proposed model, let us apply it to a standard RL problem: mountain-car task [1]. The mountain-car task is a problem in which a car driver (agent) learns an efficient policy to reach a goal as fast as possible. Figure 2 shows the working area in the one-dimensional case. In the original problem, only the left region (R1 ) in Fig. 2 is presented for learning. Here, to evaluate the suppression performance for the catastrophic interference, the right region (R2 ) is also trained after the learning of R1 . Furthermore, we extend this problem to a two-dimensional one where the agent moves around a two-dimensional space spanned by u1 and u2 . In the one-dimensional problem, when a car agent arrives at the left most and right most places in Fig. 2, the velocity is reset to zero. The goal of the car agent is to drive up the steep incline successfully and to reach a goal state at the top of the hill as fast as possible. Hence, the reward in this problem is −1 at all time steps until the car agent reaches the goal. There are three actions to be selected; “full throttle to goal” and “zero throttle” and “full throttle to opposite side of goal”. A car agent is initially positioned in either of two regions: R1 and R2 . The position u(t) and
1154
Seiichi Ozawa and Naoto Shiraga
Table 1. Average numbers of steps to reach the goal in R1 and R2 , the maximum shared memory and the convergence time in learning (a) one-dimensional problem Steps in R1 Steps in R2 Memory (KB) Time (sec.) TC 377 379 259 73 LWR 196 184 1538 61 RAN 2259 221 15.4 581 RAN-M 185 195 21.4 1211 (b) two-dimensional problem Steps in R1 Steps in R2 Memory (KB) Time (sec.) TC 517 603 5120 601 LWR 310 281 15396 536 RAN 2011 362 83.7 4121 RAN-M 290 289 118.9 8015
velocity u(t) ˙ are updated based on the following dynamics: u(t + 1) = B[u + u(t)] ˙
(7)
u(t ˙ + 1) = B[u(t) ˙ + 0.001a(t) − 0.0025 cos(3u(t))]
(8)
where B[·] is a function that the agent’s working area is restricted in the following two regions: R1 :{u | − 1.2 ≤ u < 0.5} and R2 :{u | 0.5 < u ≤ 2.2}. The goal is located at u = 0.5. The inputs of RAN correspond to the position u(t), velocity u(t), ˙ and previous action a(t − 1). In the two-dimensional problem, the working area is composed of the following two regions: R1 : {(u1 , u2 ) | − 1.2 ≤ u1 < 0.5, −1.2 ≤ u2 < 0.5} and R2 : {(u1 , u2 ) | 0.5 < u1 ≤ 2.2, −1.2 ≤ u2 < 0.5}. The agent’s location (u1 , u2 ) and velocity (u˙ 1 , u˙ 2 ) are subject to the differential equations that Eqs. (7) and (8) are applied to both u1 and u2 . The agent can select the following five actions: do nothing and step on the accelerator in four directions (right, left, up, down). The other experimental conditions are the same as in the one-dimensional case. 3.2
Results and Discussions
Table 1 shows the experimental results of the one-dimensional and two-dimensional mountain-car tasks. For comparative purposes, the performance is also examined for Tile Coding (TC) [1], Locally Weighted Regression (LWR) [2], and RAN [6]. As seen from Table 1, for both problems, RAN needs a considerable number of steps to reach the goal when initial positions are located in R1 . Obviously, serious forgetting of the acquired action-value function is caused by the additional learning of R2 . On the other hand, for the proposed RAN-M as well as TC and
Reinforcement Learning Using RBF Networks with Memory Mechanism
1155
LWR, we cannot find distinctive increase in the average steps in R1 . This result suggests that these models can suppress the interference effectively. The average steps in both LWR and RAN-M are almost the same, but TC needs more steps. Since action values are trained separately for every tiles in TC, the interference does not occur. However, the continuity of the action values for neighbor tiles is not taken into consideration. This fact might lead to the poor result in TC. In Table 1, we also estimate the maximum size of shared memory and the average convergence time when the learning is conducted. As you can see, although the fast learning is realized in LWR and TC, they need large memory capacity. On the other hand, RAN and RAN-M need quite small memory capacity; however, the learning of RAN-M and RAN is very slow. This is because the learning is conducted based on the gradient descent algorithm. This problem could be solved by applying the linear method to the proposed model (see [7] for details).
4
Conclusions
To learn action-value functions stably in RBF networks, we developed a new version of Resource Allocating Network in which a memory mechanism was introduced. To evaluate the incremental learning ability, the proposed model was applied to an extended mountain-car task in which the working area of agents was expanded as the learning proceeded. That is, the learning was conducted for a region at first, and then the learning for different regions was carried out. From the simulation results, we verified that the proposed model could learn action-values more accurately or equivalently as compared with some conventional models: Tile Coding (TC), Locally Weighted Regression (LWR), and the original RAN. Moreover, we showed that the shared memory capacity in the proposed model is quite smaller than that in LWR and TC.
Acknowledgement The authors would like to thank Prof. Shigeo Abe for his useful discussions and comments. This research was partially supported by the Ministry of Education, Science, Sports and Culture, Grant-in-Aid for Young Scientists (B).
References [1] Sutton, R. S., Barto, A. G.: Reinforcement Learning - An Introduction. The MIT Press (1998) 1149, 1153, 1154 [2] Atkeson, C. G., Moore, A. W., Schaal, S.: Locally Weighted Learning. Artificial Intelligence Review 11 (1997) 11-73 1149, 1154 [3] Schaal, S., Atkeson, C. G.: Constructive Incremental Learning from Only Local Information. Neural Computation 10 (1998) 2047-2084 1149
1156
Seiichi Ozawa and Naoto Shiraga
[4] Yamauchi, K., Yamaguchi, N., Ishii, N.: Incremental Learning Methods with Retrieving of Interfered Patterns. IEEE Trans. on Neural Networks 10 (1999) 13511365 1149 [5] Kobayashi, M., Zamani, A., Ozawa, S., Abe, S.: Reducing Computations in Incremental Learning for Feedforward Neural Network with Long-Term Memory. Proc. Int. Joint Conf. on Neural Networks (2001) 1989-1994 1149, 1151 [6] Platt, J.: A Resource Allocating Network for Function Interpolation. Neural Computation 3 (1991) 213-225 1149, 1150, 1154 [7] Okamoto, K., Ozawa, S., Abe, S.: A Fast Incremental Learning Algorithm of RBF Networks with Long-Term Memory. Proc. Int. Joint Conf. on Neural Networks (in press) 1155
Face Identification Method Using the Face Shape Hironori Takimoto1 , Yasue Mitsukura2 , Norio Akamatsu1 , and Rajiv Khosla3 1
The University of Tokushima 2-1 Minami-Josanjima Tokushima, 770-8506 Japan {taki, akamatsu}@is.tokushima-u.ac.jp 2 Okayama University 3-1, Tsushima Okayama 700-8530 Japan [email protected] 3 The La Trobe University, Victoria 3086 Australia [email protected]
Abstract. Personal identification provides are important means for realizing a man-machine interface and security products. Recently in the world, many researches for individual identification method using the face are widely done. In the personal identification methods using the face, the feature points method and the pattern matching method are general technology. However, condition of lighting affects recognition accuracy in many cases. In this paper, we propose the face identification method which is robust for lighting based on the feature points method. The proposed method extracts the edge of facial features. Then, it determine oval parameters of each feature from the extracted edge by hough transform. Moreover, in order to show the effectiveness of the proposed method, we show computer simulations by using real images. From these results, we show the effectiveness of the proposed method.
1
Introduction
Recently in the world, personal identification has been realized by the advancement of an information technology. In the personal identification method, there are recognition methods by using IC card, password, and biometrics[1, 2]. The biometrics is not in need of memory or carrying of cards, and only registrant is accepted. Especially, the face is always opened to society, it has little psychological burden compared with other physical features. There are some personal individual identification methods using the front face image[3, 4]. In the personal identification methods using the face, the feature point method and the pattern matching method are general technology. A basic concept of the feature point method is to use knowledge about the structure of the face for recognition. This method extracts the facial feature such as an eye and a mouth. Then, it calculates feature vectors for recognition which showed a shape and a position of each feature numerically. In the conventional method, there are many researches of having used the feature point method as the based on. However, it is difficult for them to perform robust recognition for lighting even if the data to be used is limited to the front face image. Furthermore, the pattern matching method is used for extraction of facial feature. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1157–1162, 2003. c Springer-Verlag Berlin Heidelberg 2003
1158
Hironori Takimoto et al.
In this paper, we propose the face identification method which is robust for lighting based on the feature points method. First of all, the proposed method extracts an edge of facial feature. Then, it determines oval parameters of each feature from the extracted edge by the hough transform. It performs the personal identification by using them. Thus, it is a robust for lighting compared with the conventional methods. Moreover, compared with the conventional method, the extraction of the feature point is certainly and easily detected. Because, the hough transform are used for extracting the parameters approximately. In order to show the effectiveness of the proposed method, computer simulations are done by using the real images.
2 2.1
Pre-processing Face Image Data
In this paper, it is necessary to normalize the face image to recognize. The normalization method of face image in the paper is shown. The face image is normalized based on both eyes. The reason for having used the eye for normalization of face image is as follows. The first, eyes are having been easy to perform the normalization about a rotation and a size, compared with a lip, a nose, or an ear. Next, many researches of extracting the region of an eye are proposed[5, 6]. Therefore, to use eye for normalization of face image is efficient. First of all, an original image (420×560 pixels, 24bit color; Fig. 1(a)) is changed into 8bit gray scale image, and median filter is performed in order to remove noise. Next, center positions of both eyes are extracted. Then, the line segment joining both eyes is rotated so that it matches the horizontal line. Furthermore, the distance between both eyes is made 60 pixels by scale change. Moreover, in order to diminish influence of hair and clothes, an image is cut out as shown in Fig. 2. That is, letting the midpoint of the segment joining both eyes be a standard point, the region spreads by 120 pixels (by 60 pixels in right and left direction, respectively) in horizontal direction, and by 80 pixels (by 35 pixels in the upper part and by 105 pixels in the lower part) in the vertical direction. Finally, in order to ease the influence by photometric property, gray scale transformation is performed. The image of the Fig. 1(a) is normalized as the Fig. 1(b). If extraction of the position of an eye is possible, it can perform grayscale transformation of facial region. Because, I think that future processings are not influenced of lighting. 2.2
Edge Detection
In this paper, the edge image of facial feature is extracted form the face image by using the sobel operator. The sobel operator is popular method in extraction of the edge. Then, the edge image is changed binarization image by the threshold processing. In edge detection, to detect the rough position of the facial feature is important preprocessing. The position of eye is detected in the step of normalization
Face Identification Method Using the Face Shape
(a) The original image
1159
(b)The normalized image
Fig. 1. The normalization of a face image
of face image. Furthermore, the position of eyebrows is detected easily using the gray scale image. Because, the difference of gray level of eyebrows compared with the circumference region is large. However, if the mouth region is detected using gray scale image, it is difficult to detect for assimilation to the skin color by effect of lighting. Thus, the L*a*b* color coordinate system is used for detection of mouth region. It is color space reflecting the feeling to color of human[7]. In the L*a*b* conversion, lightness index is shown in equation (1), and perception chromaticity index is shown in equation (2). The parameters X, Y and Z are value of an object in the XYZ color coordinate system. Then, the parameters Xn , Yn and Zn are value of a standard light source. The parameter a∗ is shown redness degree. Therefore, the mouth region is detected using the parameter a∗ . L∗ = 116 a∗ = 500
3 3.1
Y 13
X 1 Y 1 3 3 , − Xn Yn
Yn
− 16
b∗ = 200
Y 1 Z 1 3 3 − Yn Zn
(1) (2)
Parameter Extraction Hough Transform
In this paper, it determines oval parameters of each facial feature from the extracted edge by the hough transform. The hough transform extracts the object from the edge image in the condition is shown quantification as parameter expression. [8, 9] Firstly, it selects some points on the edge for the edge image. Then, the parameter of object which passes along the selected points are plotted on the parameter space. An oval parameters are extracted by detecting the points on the parameter space which the number of times plotted most had. The
1160
Hironori Takimoto et al.
Fig. 2. The outline of the normalization
hough transform is robust noise, and it can extract a subject even if the object has not appeared completely. 3.2
Determination of Oval Parameters
In this paper, the oval parameters of each feature is determined from the extracted information of edge by the hough transform. Because, it is thought that human face is possible to express by the oval. The number of the parameters showing an oval is 5, and it is shown details in Fig. 3. The facial features of extracting the parameter are eyes, eyebrows, and a mouth. The eyes and a mouth are divided with an upper contour and a lower contour, respectively, and they are determined by the parameter with two ovals. The eyebrows are approximated by an oval. Therefore, the number of ovals showing a person is 8, and the number of parameters is 40.
4
Simulations of Personal Identification
For the propose of showing the effectiveness of the proposed method, computer simulations are done by using the real images. In this simulation, we used 72 adult men’s, 28 adult women’s in total 100 people images for registrant, and they are 6-face images par person (total 600 face images). Then, the registrants are only used for simulation. Recognition process is as follows, and we show each details of the processes.
step step step step
1 2 3 4
Normalization of face images Edge extraction Determination of feature parameters by the hough transform Face identifications
In the face identification, the leave-one-out cross-varidation method is used.
Face Identification Method Using the Face Shape
1161
Fig. 3. The expression of oval parameter
5
Conclusions
Finally in this paper, conclusions are shown in this section. In this paper, face identification method which is robust for lighting based on the feature points method are proposed. In this method, parameters of faces are extracted by the edge which is detected by the hough transform. By using these parameters, face identification are done in this paper. It is a robust for lighting compared with the conventional methods. These simulations are shown in that days.
References [1] Y. Yamazaki, N. Komatsu: ”A Feature Extraction Method for Personal Identification System Based on Individual Characteristics”. IEICE Trans. Vol.D-II, Vol.J79. No.5, pp.373-380, (1996) in Japanese 1157 [2] A. K. Jain, L. Hong, and S. Pankanti: ”Biometric identification”. Commun. ACM, vol.43, no.2, pp.91-98, (2000) 1157 [3] R. Chellappa, C. L. Wilson, and S. Sirohey: ”Human and machine recognition of faces: A survey”. Proc.IEEE, vol.83, no.5, pp705-740, (1995) 1157 [4] S. Akamatsu: ”Computer Recognition of Human Face”. IEICE Trans. Vol.D-II, Vol.J80. No.8, pp.2031-2046, (1997) in Japanese 1157 [5] S. Kawato, N. Tetsutani: ”Circle-Frequency Filter and its Application”, Proc. Int. Workshop on Advanced Image Technology, pp.217-222, Feb (2001) 1158 [6] T. Kawaguchi, D. Hikada, and M. Rizon: ”Detection of the eyes from human faces by hough transform and separability filter”, Proc. of ICIP 2000, pp.49-52, (2000) 1158 [7] G. Wyszecki and W. S. Stiles: ”Color Science: Concepts and Methods, Quantitative Data and Formulae”, John Wiley & Sons, New York, (1982) 1159 [8] P. V. C. Hough: ”Machine Analysis of Bubble Chamber Pictures”, International Conference on High Energy Accelerators and Instrumentation, CERN, (1959) 1159
1162
Hironori Takimoto et al.
[9] H. K. Yuen, J. Illingworth, and J. Kittler: ”Detecting Partially OccludedEllipses using the Hough Transform”, Image and Vision Computing, Vol.7, No.1, pp.31-37, Feb (1989) 1159
Tracking of Moving Object Using Deformable Template Takuya Akashi1 , Minoru Fukumi2 , and Norio Akamatsu2 1
Graduate School of Engineering, University of Tokushima 2-1 Minami-josanjima, Tokushima, Japan [email protected] 2 Faculty of Engineering, University of Tokushima 2-1 Minami-josanjima, Tokushima, Japan {fukumi,akamatsu}@is.tokushima-u.ac.jp
Abstract. A method to track a moving object that has a continuous shape deformations is described in this paper. Moreover this method aims to also obtain information about the shape deformations. This method uses template matching, where the template is deformed. The deformation of the template and matching process are performed using a genetic algorithm. Our conventional method does not have a good search efficiency. Therefore, a new method, which distinguishes between an object function and fitness function, is proposed in this paper. Comparisons between our conventional method and the new method are also simulated. As an object for this simulation, a lips region during speech is used. From results of this simulation, it comes to light that the proposed method is more better than the conventional one in tracking accuracy and speed.
1
Introduction
Tracking of a moving object is very important for human interface. For example, a lipreading system is needed for mobile devices such as PDAs and cellular phones. Taking into account a practical use, the tracking method should have invariance to many changes. Such changes are geometric changes by camera shake and/or user’s head moving, and the object’s own deformations. As for the lipreading system for mobile devices, the geometric changes are a parallel translation, scaling and rotation of lips region, and the deformations are an opened/closed mouth and showing/not showing any teeth during speech. One of the earlier work in this area that used lips images is lipreading by using Eigen-template as the feature of the lips shape is reported in [1]. But, this work has a problem to being unadaptable for intensified geometric changes by a shake of a head or a camera or by a non-stable camera. That is the reason why one of their assumption is that the camera and head of the subject are fixed. On the other hand, to take the shape of the object into consideration, a method that use the template matching by a genetic algorithm is proposed [2, 3, 4, 5]. However, as far as we know, there is not a method that has invariance by only one template for shape deformations of lips during speech. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1162–1168, 2003. c Springer-Verlag Berlin Heidelberg 2003
Tracking of Moving Object Using Deformable Template
1163
Our conventional method [6] (NPDM: Normalised Pixel Difference Method) has invariance for an opened/closed mouth and showing/not showing any teeth, and has high speed and high tracking accuracy using only one deformable template by a genetic algorithm. In this paper, the Non-Normalised Pixel Difference Method (Non-NPDM) which improves NPDM is described.
2
System Flow
The lips region tracking system is set out to perform the following operations: Step Step Step Step Step Step
1: 2: 3: 4: 5: 6:
Input a template image and target image. Deform the template shape to a “square annulus”. Generate a population of individuals of the first generation randomly. Measure fitness of individuals. Perform genetic operators. Output image which has extracted lips region.
In Step 1, image data is obtained from input images. A modified x component (redness) of the Yxy colour space [7] is used as this image data. Template shape is a square in Step 1 that is unacceptable for the shape deformations of lips during speech. Therefore, the template shape is deformed to a new shape called “square annulus” (refer to Sec. 3) in Step 2. Steps 3 and 4 are mentioned below and are genetic algorithms processes. Finally, result of tracking is output on the target image in Step 6. The procedure in Step 4 is as follows: Step 4-1: Transform a template by using homogeneous coordinates. Step 4-2: NPDM: Calculate fitness function and measure. Non-NPDM: Calculate object function and fitness function, and measure. In Step 4-1, each individual parameters are obtained from informations based on the chromosome. These parameters are used for transforming the template by using homogeneous coordinates in Step 4-1. During the transformation, the aspect ratio of the template is controlled. Then, in Step 4-2, in case of the NPDM, fitness function is calculated using genetic operators mentioned in Sec. 4. Then, the fitness value is measured. In the other case, the Non-NPDM, at first an object function (refer to Sec. 4.2) is calculated using genetic operators with penalty. Next, fitness function is calculated and measured.
3
Features of Shape Deformations of the Target
In general, the typical template shape is a square. However, considering an application of the template matching to shape deformations of lips, the square shape
1164
Takuya Akashi et al.
Fig. 1. Square annulus of template is unsuitable. This is because, during speech, the lips region has intense variations such as an opened/closed mouth and showing/not showing any teeth. In other words, the lips shape changes to other shapes constantly during speech. This is a serious problem in tracking of lips region by using only one template per user. To solve this problem, we focus our attention on invariance under constantly deforming shapes. Then, we find out that lips shapes (without buccal cavity, teeth, and a tongue and etc.) of opened mouth during speech have the same topological properties. In fact, they are homeomorphic. Thus, we use a new template shape illustrated in Fig. 1 to cope with the ever-changing lips region. This shape is called “square annulus”. In Fig. 1, w and h are the source square template’s height and width. And w and h are the new square annulus template’s inside width and height. In simulations (refer to Sec. 5), w and h are decided experientially. By ignoring the w × h region, tracking of lips region during speech is possible, the amount of calculation is reduced and the lips tracking can be done at high speed.
4 4.1
Genetic Algorithm Structure of Chromosome
A chromosome is an optimised solution. In other words, chromosomes are parameters which represent coordinates, scaling and rotation of an exploration object on the target image. Therefore, the chromosome has 5 parameters: cx and cy are coordinates for parallel translation, mx and my are scaling rates, and angle is rotation angle of lips shape. Each gene lengths is 8 bits, and total chromosome length is 40 bits. The lips region during speech is longer than the width or longer than the height. Thus, we use 2-dimensional scaling by mx and my . 4.2
Fitness Function
In the NPDM, we use only a fitness function which is normalised in range [0, 1], because of reducing a dispersion of pixel differences whose pixel values are real values, x component of Yxy colour space. Moreover, this normalisation makes genetic algorithm process a simple maximisation problem. Against that, the Non-NPDM does not normalise, and has an object function and a fitness function. The object function is a minimisation problem and the fitness function is a maximisation problem.
Tracking of Moving Object Using Deformable Template
1165
NPDM At first, parameters of geometric transformations are obtained from the chromosome. Then, fitness is calculated by using equation (1). h w
fitness = 1.0 −
Dij
j=1 i=1
(w × h) (Pmax ) p∗ − pij p∗ ∈ target image ij ij ∗ Dij = pij ∈ Pmax / target image
(1)
(2)
where the template size is w and h, Pmax is a maximum value of pixel, P is a point on the template image, and P ∗ is the point that corresponds to a transformed point P on the target image, p is a pixel value of a point P on coordinate (i, j) in the template image, p∗ is a pixel value of a point P ∗ on coordinate (i, j) in the target image. Dij is value of the pixel difference between p and p∗ , however, in case of a point P ∗ is out of region template image, Dij is worst Pmax . f itness allows us to do a good exploration that the fitness approaches to 1. Non-NPDM
p∗ − pij ij Oij = Pmax
∗ pij ∈ target image ∗ / target image pij ∈
fitness = (w × h) (Pmax ) −
h w
Oij
(3)
(4)
j=1 i=1
where Oij is value of the object function, and the pixel difference between p and p∗ , however, in case of a point P ∗ is out of region template image, Oij is worst Pmax . Oij allows us to achieve good exploration that the value of the object function approaches 0. In other words, f itness allows us to do a good exploration that the fitness value becomes large.
5 5.1
Results and Discussions Input Images
The template images and target images are illustrated in Fig. 2. Fig. 2 left are template images. Template image size of subject 1 and subject 2 is 18×8 [pixels], and subject 3 is 20 × 9 [pixels]. Fig. 2 right below shows examples (pronouncing the vowel /a/) of target images. The images captured using a video camera include a face and background while each of three objects pronounce the vowels. Target images are then cut
1166
Takuya Akashi et al.
Fig. 2. Template images and target images
from the video streams. In consideration of use by mobile devices, target images have geometric changes based on the template image. The geometric changes are such as parallel translation, scaling, rotation. These value can be regarded as the solutions which are obtained from the chromosome (refer to Sec. 4.1). All target images size are 240 × 180 [pixels]. 5.2
Configurations of System
The parameters of genetic algorithms are: population size is 70, probability of crossover is 0.7, and probability of mutation is 0.05. The shape of template change parameters is w /w = 0.8, h /h = 0.5 (refer to Fig. 1 in Sec. 3). If the same elite fitness value continues until 30 generations, the solution is regarded as having converged and tracking is terminated. The more this value becomes large, the more the termination criterion becomes fair. 5.3
Result of Simulation and Consideration
Fig. 3 shows examples of results from the computer simulation using the NPDM and the Non-NPDM. The rectangle region is the extracted lips region. The shape deformations of lips by speech are extracted exactly as shown in Fig. 3. The effectiveness of our method is demonstrated using 10 times simulations for each person (total is 30 times simulations per one vowel) being tested as shown in Tables 1 and 2. On the other hand, Table 3 shows the result using 20
Fig. 3. Result images
Tracking of Moving Object Using Deformable Template
1167
Table 1. Results of simulation (Non-NPDM, tough criterion) /a/ /i/ /u/ tracking accuracy [%] 96.7 90.0 100.0 processing time [msec] 190.0 181.9 172.7 generation 80.7 76.4 73.0
/e/ 90.0 189.2 78.5
/o/ 90.0 195.9 82.7
Table 2. Results of simulation (NPDM, tough criterion) /a/ /i/ /u/ /e/ /o/ tracking accuracy [%] 23.3 26.7 16.7 20.0 23.3 processing time [msec] (90.0) (86.3) (124.0) (113.3) (102.9) generation (62.3) (61.1) (83.4) (79.5) (76.3)
Table 3. Results of simulation (NPDM, fair criterion) /a/ /i/ /u/ tracking accuracy [%] 98.3 96.7 91.7 processing time [msec] 323.0 318.3 332.3 generation 140.7 139.9 147.1
/e/ 93.3 299.0 130.8
/o/ 91.7 328.7 144.2
times simulations for each person (total is 60 times simulations per one vowel), and this simulations are described in [6]. In Tables 1 and 2, the configuration of the simulations are described in Sec. 5.2. Table 3 is tested with another configuration and shows the NPDM result which is described in [6]. In case of Tables 1 and 2, their termination criterion value is 30 (refer to Sec. 5.2). Against that, in case of Table 3, it is 50. In other words, Table 3 criterion is fairer than that in Tables 1 and 2. Additionally, in Table 2, processing time and generation are not important, because their tracking accuracies are very low. Comparing Table 2 with Table 3, they indicate that the NPDM obtain high tracking accuracies on the fair criterion, however on the tough criterion, it obtains low tracking accuracies. Against that, the Non-NPDM works on the tough criterion in Table 1.
6
Conclusion and Future Work
In this paper, a method for lips region tracking that has invariance for an opened/closed mouth showing/not showing any teeth, and has high speed and high recognition accuracy using only one template, is proposed. In Sec. 5, the effectiveness of the Non-NPDM is demonstrated by comparison with the NPDM. The results of these computer simulations indicate that the Non-NPDM can perform more efficient exploration and has invariance for the varying shape, and
1168
Takuya Akashi et al.
a high speed and high recognition accuracy can be obtained in the tracking processing of all the vowels. However, in the Non-NPDM some parameter has a lot to do with experiences. For example, w and h , which are the new square annulus template’s inside width and height, are experientially decided (refer to Sec. 3). It comes to light that these parameters are very important parameter for fitness value from many simulations. Therefore, our future work needs improvement to decide all parameters automatically. Other future work is to apply this method to realtime processing. For this improvement, we hope to consider use of other feature of lips, such as another shape of the template except for the square annulus, will be effective.
References [1] Nakata, Y., Ando, M.: Detection of Mouth Position Using Color Extraction Method and Eigentemplate Technique for Lipreading. Technical Report of IEICE, Vol. PRMU2001-09. 101(303) (2001) 7–12 1162 [2] Masunaga, S., Nagao, T.: Extraction of human facial regions in still images using a genetic algorithm. Technical Report of IEICE, Vol. PRU95-160. 95(365) (1995) 13–18 1162 [3] Hara, A., Nagao, T.: Extraction of facial region of arbitrary directions from still images with a genetic algorithm. Technical Report of IEICE, Vol. HCS97-12. 97(262) (1997) 37–44 1162 [4] Saitoh, F.; Pose recognition of gray-scaled template image using genetic algorithm. T. IEE Japan, Vol. 121-C. 10 (2001) 1500–1507 1162 [5] Akashi, T., Fukumi, M., Akamatsu, N.: Lips Extraction with template Matching by Genetic Algorithm. Knowledge-Based Intelligent Information Engineering systems and Allied Technologies, Crema, Italy (2002) 343–347 1162 [6] Akashi, T., Mitsukura, Y., Fukumi, M., Akamatsu, N.: Genetic Lips Extraction Method for Varying Shape. Computational Intelligence in Robotics and Automation, Kobe, Japan, 2003, (to appear) 1163, 1167 [7] Minolta Co., Ltd.: The Essential of Imaging. Japan (2002) 28–44 1163
Thai Banknote Recognition Using Neural Network and Continues Learning by DSP Unit Fumiaki Takeda, Lalita Sakoobunthu, and Hironobu Satou Department of Information System Engineering, Kochi University of Technology 185 Miyanokuchi, Tosayamada-cho, Kami-gun, Kochi 782-8502 Japan [email protected]
Abstract. Nowadays, neural networks (NNs) are widely used in many fields of engineering and the most famous application is pattern recognition. In our previous researches, a banknote recognition system using a NN has been developed for various applications in worldwide banking systems such as banknote readers and sorters. In this paper, a new kind of banknotes, Thai banknotes, are being proposed as the objects of recognition. First, the slab values, which are the digitized characteristics of banknote by the mask set, are extracted from each banknote image. These slab values are the summation of non-masked pixel values of each banknote. Second, slab values are inputted to the NN to execute its learning and recognition process. Third, for commercial usability, the NN algorithm is implemented on the DSP unit in order to execute the continuous learning and recognition. We show the recognition ability of the proposed system and its possibility for self-refreshed function on the DSP unit using Thai banknotes.
1
Introduction
Recently, the neural network (NN) is widely used for pattern recognition because of its abilities of self-organization, parallel processing, and generalization (4). With these 3 abilities, the NN can recognize patterns effectively and robustly. Therefore, the NN has been applied in our previous researches (1) - (8) to invent a new recognition system for banknote reader and sorter machines. In this paper, a new kind of banknotes, Thai banknotes (Baht = B), are proposed as the objects of recognition to test the recognition ability of the system. In this recognition system, masking process is defined as characteristics extraction of a banknote image. Especially, an axissymmetry mask set is applied for multiple kinds banknote recognition (1), (7). In addition, for the feasibility and effectiveness in commercial banking machines, the NN learning and recognition algorithm are implemented on the DSP devices (5) - (8) as a neuro-recognition engine. Still more, we propose the continuous learning by the DSP unit which we have developed self-refreshed function for banking machines. We show its possibility with Thai banknotes.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1169-1177, 2003. Springer-Verlag Berlin Heidelberg 2003
1170
2
Fumiaki Takeda et al.
Thai Banknotes
Thai banknotes consist of 5 kinds that are 20B, 50B, 100B, 500B, and 1000B, and there are 4 conveyed directions (head upright, head reversed, tail upright, and tail reversed) for inserting them to the banking machines. Thus, there are totally 20 patterns for inserting all Thai banknotes to the banking machine. However, the axissymmetry mask set is applied for the characteristics extraction. Then, both upright and reverse direction of head (HB) or tail (TB) banknote of the same pattern can be recognized at the same time. Therefore, total number of recognized banknote patterns is 10. D a ta S c a n n in g a n d C o lle c tin g
P rep ro cesses
E d g e s a n d C e n te r D e te c tio n
M a s k s S e le c tio n a n d L o c a tio n
S la b -v a lu e s E x tra c tio n
S la b E x tr a c tio n and M a sk in g P rocess
PC
In itia l L e a rn in g The NN C o n tin u o u s L e a rn in g & N N R e c o g n itio n T h e D S P U n it
Fig. 1. The Banknote Recognition Flowchart
3
System Overview
Fig. 1 shows the banknote recognition flowchart of the proposed system. This system consists of 3 main parts, which are preprocesses, the NN, and the DSP unit. For preprocesses, aim of this part is the slab-values extraction. These slab values are characteristics of banknote image and used as the NN inputs (1)-(8). Therefore, the first step of preprocesses is banknote images collecting by using scanner. Then, edges and center of all banknote images are detected for masks location. After that the mask set are applied to each banknote image for the slab-values extraction. We show the mask set detail in next section. In addition, all preprocesses are performed on PC before transferring all banknote images and the mask set to the DSP unit for learning and recognition by the NN. Next, second part, the NN has 3 layers, which is 50×30×10 as shown in Fig. 2. Number 10 of the output layer is represented a number of banknote patterns which be recognized (5 kinds × 2 conveyed directions). The NN has 2 important functions, which are learning and recognition. For the NN learning, a Backpropagation (BP) method is applied as a learning algorithm. After finish learning, NN weights, which converged to learning condition, are used for banknote recognition. Although, it
Thai Banknote Recognition Using Neural Network and Continues Learning
1171
seems that only preprocesses and the NN can recognize banknotes, it is not sufficient for a real banknote recognition machine. Therefore, the DSP unit is applied. This DSP unit can perform both the continuous learning and the banknote recognition by itself. In this research, we performed 2 experiments that are PC experiment and DSP unit experiment. First, we execute all preprocesses and the NN calculation on PC to confirm recognition ability of the proposed system. Next, for the DSP unit experiment, banknote images, the mask set, and converged NN weights from the PC are transferred to the DSP unit for the continuous learning and recognition.
Fig. 2. The NN configuration with the Slab and Masking Processes
4
Masking Process
Our previous researches (1) - (8) proposed the masking preprocess to extract the slab values of each banknote image for the NN inputs. These slab values, which are extracted by the MAJ (majority) operator (9), are used as the characteristics of the input image. First, pixel values of each image are summed up and used as the input for the NN. These slab values are regarded as the digitized characteristics of banknote. Sometimes, although two images are different, they have the same slab value. Second, the masking preprocess is applied to solve this problem so that the slab values are the summation of non-masked pixel values. Moreover, for the effective NN recognition system, several masks, which have different mask positions, are applied as a mask set to each input image to yield various and different slab values. Therefore, we decided to apply 50 axis-symmetry mask patterns as the mask set of this system because of its advantage in multiple kinds banknote recognition (1), (7). So, from Fig. 2, number of the input node of the NN is equal to the mask pattern number. For using this kind of mask set, masked parts are located against each axes, which is vertical and horizontal axis symmetry. Using this mask set, the slab values of both upright and reversed direction of same banknote pattern are identical as shown in Fig. 3. Therefore, 2 conveyed directions of banknote image, upright and reversed, are recognized simultaneously.
1172
Fumiaki Takeda et al. S y m m e try A x e s Im a g e F ra m e
M a sk
B a n k n o te Im a g e
H e a d U p rig h t
H e a d R ev erse d
S a m e s la b v a lu e
Fig. 3. The Axis-symmetry Mask Set Structure and Slab Values
5
Data Collection
Since, the slab values of banknote are extracted from its image so that banknote image must be firstly collected. First, each banknote image is collected by scanner and then a scanned image is saved as a bit map data. Second, the bit map data of each banknote image is transformed to the NN data form by dividing the bit map data along the vertical and horizontal axis to get 32×216 pixels image frame of the NN data. Each pixel value of the NN data is represented the average value of all pixels of the bit map data, which are located in that pixel of the NN data. Finally, the NN data is saved on PC and applied as input to the masking preprocess and the NN.
Fig. 4. Edges Detection with Threshold Values
Thai Banknote Recognition Using Neural Network and Continues Learning
1173
To locate masks, banknote edges and center have to be detected. For banknote edges detection of all of the NN data, 2 threshold values, threshold “cho” and “tan”, are manually selected. Then, the first and last column (or line), which its gray scale level is more than the threshold values, is decided as the edge of banknote image. Fig. 4 shows the edge detection of the 20B head (20HB). Finally, the banknote center is also detected automatically. As a result of the center detection, the mask set is manually selected and then located on the banknote image by setting the mask center at the same position as the banknote center.
6
Experiments
Table 1 shows the NN structure and learning conditions of the proposed system. The total number of banknote patterns that are recognized is 10, which is the same as node number of the output layer. For the NN learning, 10 pieces of each banknote pattern are applied as the NN learning data. And the BP learning algorithm is shown in equation (1) (8).
∆ W i, j (t )
=
− ε d j • o i + α ∆ W i , j ( t − 1) + β ∆ W i, j (t − 2 )
(1)
where ∆W is weight difference, d is generalized error, o is output response of the previous layer while ε, α, and β are coefficients. For the NN learning, the NN weights are adjusted the learning algorithm until reaches one of learning condition. Table 1. The NN Structure and Learning Conditions
Mask no. No. of banknote types Hidden node no. Learning data / pattern Testing data / pattern Mean square error Maximum iteration no.
7
50 10 30 20 80 0.0001 20000
Recognition Results
After the NN learning process is completed, converged NN weights and the mask set is saved on PC and then 90 pieces of banknote image per pattern are tested to evaluate recognition ability by PC. These testing data consist of usual, worn out, paint and defect banknotes, in 5 different positions and 2 different rotated angles. From the experiment, all banknote patterns have 100% recognition ability excepting the 50B head (50HB), the 50B tail (50TB), and the 100B head (100HB) pattern, which their recognition abilities are 95.00%, 98.89%, and 98.75%, respectively. However, there is
1174
Fumiaki Takeda et al.
reliability problem because output response of some patterns is varied. Next, 2 important factors that affect on the recognition ability and the output variation are discussed.
(a) 8×2 mask area and (-48, -4) offset
97
89
81
73
65
57
49
41
33
25
9
17
97
89
81
73
65
Testing Data Number
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 1
Output Response
50 TB
57
49
41
33
25
9
17
1
Output Response
50 TB 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
Testing Data Number
(b) 8×1 mask area and (-48, -2) offset
Fig. 5. Effect of Mask Area and Position on the Output Response
8
Discussion of Factors Effect on Output Response
8.1
Masks Position and Area
This is the most important factor of the banknote recognition because the slab values are extracted from the mask set. Different mask positions also give different output responses as shown in Fig. 5. Here, the mask offset shows distance between the origin point of mask frame and the banknote center. 8.2
Threshold Values Selection and Center Shifting
Another key factor is the threshold values for the edges and center detection. With the suitable threshold values, the edges and center of banknote image are detected accurately. However, sometimes, there are many pairs of these two threshold values that can detect edges and center. Therefore, the most accurate threshold values should be applied because the center of image affects to the mask positions.
9
Continues Learning and Recognition by the DSP Unit
9.1
Data Transportation
Using the DSP unit, 2 NN calculations and 3 important data are needed. First, the NN calculation for its learning and recognition are implemented on the DSP unit. Other 3 data, which are i) image databases, ii) the NN weights and iii) the mask set, are transported from PC to a flash memory of the DSP unit. Fig. 6 shows the hardware configuration of the DSP unit.
Thai Banknote Recognition Using Neural Network and Continues Learning
1175
Fig. 6. The Hardware Configuration of the DSP Unit
9.2
Continuous Learning of the DSP Unit
Our DSP unit can work in both learning and recognition mode. On the learning mode, banknote images, the mask set and converged NN weights, which are saved on the flash memory, are loaded to SRAM that is main memory of the DSP unit. In addition, the mask set that applied to the DSP unit is the same as one, which used in the PC experiment. Next, the DSP unit starts to execute the NN learning. For the DSP unit experiment, 2 types of the NN learning are performed. First, the initial learning, 20 pieces of each banknote pattern were transferred to the DSP unit. These 20 pieces consist of 10 pieces for learning and only 10 pieces as testing data. Using this learning, the NN weights are adjusted from random initial weights. Second, the continuous learning, new 10 pieces of each banknote pattern were added to the DSP unit as new learning data and the NN weights are adjusted by starting from the converged NN weights of the PC. Thus, there are 20 learning data of each pattern and also 10 pieces of each banknote pattern, which same as testing data of the initial learning, for evaluating recognition ability. Then, final converged NN weights are saved on the flash memory for recognition. 9.3
Recognition Results
For the recognition mode, the mask set and the converged NN weights are loaded from the flash memory to the SRAM and recognition is started. The recognition abilities of both learning cases are shown on Table 2. Case 1 is represented the initial learning and case 2 is the continuous learning. Although testing data amount of the DSP unit is only 10, these testing data of the DSP unit are identical to first 10 testing data of all 90 testing data for the PC experiment in the initial learning case. Therefore, the output responses of 10 testing data for the DSP unit experiment are similar to the output responses of first 10 testing data of the PC experiment.
1176
Fumiaki Takeda et al. Table 2. The Recognition Ability of Thai Banknote Patterns
Banknote Type 20 HB 20 TB 50 HB 50 TB 100 HB 100 TB 500 HB 500 TB 1000 HB 1000 TB Average
Recognition Ability (%) Case 1 Case 2 100.00 100.00 100.00 100.00 95.00 95.56 98.89 100.00 98.75 98.89 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 99.26 99.45
Moreover, by using the continuous learning, the recognition abilities are also improved, especially the recognition ability of the 50TB pattern is increased from 98.89% to 100% as shown in Table 2. So, we showed that the recognition ability of the DSP unit is almost same to the one of the PC and also confirmed possibility of the continuous learning on the DSP unit and self-refreshment.
10
Conclusion
The NN recognition system for Thai banknote has been proposed and its recognition ability was confirmed by evaluating several pieces of banknotes on both PC and the DSP unit. Especially, advanced banking machines are needed for its self-refreshed function because heavy damaged and worn-out banknote in the field are supposed to be increased such as Euro banknotes. We showed the possibility of realization of selfrefreshment for banking machines by the NN continuous learning on the DSP unit.
References [1]
[2]
[3]
Nishikage, T. and Takeda, F., “Axis-Symmetrical Masks Optimized by GA for Neuro-currency Recognition and Their Statistical Analysis”, Proceedings of World Multi-Conference on Systemics, Cybernetics and Informatics, Vol.2, pp.308-314. (1998) Sakoobunthu, L., Takeda, F., and Sato, H., “Thai Banknote Recognition Using Neural Network and Applied to the Neuro-Recognition Unit”, The International Workshop on Signal Processing Application and Technology, pp.107-114. (2002) Sakoobunthu, L., Takeda, F., and Sato, H., “Online Continuous Learning by DSP Unit”, SICE System Integration Conference, Vol.3, pp. 321-322. (2002)
Thai Banknote Recognition Using Neural Network and Continues Learning
[4] [5]
[6] [7]
[8] [9]
1177
Takeda, F. and Omatu, S., “High Speed Paper Currency Recognition by Neural Networks”, IEEE Trans. on Neural Networks, Vol.6, No.1, pp.73-77. (1995) Takeda, F. and Omatu, S., “A Neuro-Paper Currency Recognition Method Using Optimized Masks by Genetic Algorithm”, Proceedings of IEEE International Conference on Systems, Man and Cybernetics, Vol.5, pp. 43674371. (1995) Takeda, F., Omatu, S. and Matsumoto, Y., “Development of High Speed Neuro-Recognition Unit and Application for Paper Curency”, The International Workshop on Signal Processing Application and Technology, pp.49-56. (1998) Takeda, F., Nishikage, T. and Matsumoto, Y., “Characteristic Extraction of Paper Currency using Symmetrical Masks Optimized by GA and NeuroRecognition of Multi-National Paper Currency”, Proceedings of IEEE World Congress on Computation Intelligence, Vol.1, pp.634-639. (1998) Takeda, F., Nishikage, T. and Omatu, S., “Banknote Recognition by Means of Optimized Masks, Neural Network, and Genetic Algorithm”, Engineering Applications of Artificial Intelligence 12, pp.175-184. (1999) Widrow, B., Winter, R.G. and Baxter, R.A., “Layered Neural Nets for Pattern Recognition”, IEEE Transaction Acoustic, Speech & Signal Preprocessing, Vol.36, No.7, pp.1109-1118. (1988)
Color-Identification System Using the Sandglass-Type Neural Networks Shin-ichi Ito1 , Kensuke Yano2 , Yasue Mitsukura3 , Norio Akamatsu1 , and Rajiv Khosla4 1
2
University of Tokushima, 2-1, Minami-Josanjima, Tokushima, Japan {itwo,akamatsu}@is.tokushima-u.ac.jp Japan Gain the Summit Co., Ltd., 2-11-10, Bunkyouku-Suidou, Tokyo, Japan [email protected] 3 Okayama University, 3-1-1, Tsushima-naka, Okayama, Japan [email protected] 4 University of La Trobe, Victoria 3086 Australia [email protected]
Abstract. A problem for the illegal copy of the digital image, which overflows on WWW, and the copyright protection of the image information becomes serious. Electronic watermark technology is being watched as that strong countermeasure. However, there is a problem that the problem that quality of output deteriorates, and the codes put inside are analyzed. In this paper, a neuro-color compression ID system with no quality of output deterioration, which sets up security for the image, is proposed. The position information of pixel chosen at random from the color image and RGB value are compressed with a proposal system by using sandglass type neural network (SNN). The information that it is compressed and the combination weight of SNN are called neuro-color compression ID. Moreover, neuro-color compression ID can be restored by using the position information of the number of pixel of SNN chosen at random. It is possible that the possible system of the prevention of an illegal copy and the copyright protection is built by making either of the position information of neuro-color compression ID, the number of pixel that it is chosen, or pixels base an image license. Furthermore, because it doesn’t need to put a code inside like an electronic watermark, even the field that a demand for the information secret is severe can be used as a new watermark technology with no quality of output deterioration. Furthermore, in order to show the effectiveness of the proposed method, simulations are done. This proposed method was recognized as the patent.
1
Introduction
Recently, handling of the digital image becomes easy due to the performance improvement of the computer and high obi stage-ization of the network. The illegal copy that a copyright was ignored is on the increase along with that, and the copyright protection of the digital image causes a trouble [1],[2]. Electronic V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1178–1184, 2003. c Springer-Verlag Berlin Heidelberg 2003
Color-Identification System Using the Sandglass-Type Neural Networks
1179
watermark technology is being watched as one of the strong countermeasures in these questions [3],[4]¨ uDAn electronic watermark can be classified for the space territory use type and the frequency territory use type by the territory, which watermark information, is set inside. Generally, management is easy from setting watermark information directly inside pixel value as to the space territory use type. On the other hand, there is a problem that the watermark information put inside is often lost by the attack such as the expansion that image compression becomes smooth. The deterioration of the quality of output after the management isn’t avoided in this case though it can think about the method, which strengthens watermark strength to strengthen a tolerance against the attack. On the other hand, watermark information is set inside the frequency area, and the electronic watermark of the frequency territory use type strengthens a tolerance against the attack. However, the technique which a going directly change such as FFT and DCT was used for has the fault that a block, and distortion are easy for an image after the management to cause. Deterioration is sometimes caused in the quality of output after the management by the strength of the watermark though the technique which wavelet translation to solve this fault, and spectrum diffusion were used for is proposed. Therefore, the decline of the quality of output isn’t avoided so that the method of the above which may do the management to put watermark information inside, too. Moreover, much enciphered individual information and so on exists in the watermark information used with an electronic watermark. But, enciphered information can’t be analyzed. A difficult system is required by the analysis, which doesn’t entail quality of output deterioration there. The neuro-color compression ID system that makes an image ID is proposed by using the information which a color image itself has in this paper. In the proposed method, first pixels are chosen at random from the inside of the image. Then, the position information of chosen pixel and a sexual desire report (RGB value) are compressed by using the sandglass type neural network (SNN). Here, we define the compressed information and weights by SNN as neuro-color compression ID. SNN is the technique of the information compression by the neural network proposed by Cottrell’s. Furthermore, SNN can be thought to be the technique, which is suitable for ID-ization of the color image because combination weight never almost corresponds to the information that it was compressed by a middle layer unit even if the image, which is similar to compress it from the viewpoint of learning, is used. Then, neuro-color compression ID can be restored by using the position information of pixel of SNN chosen at random. It is possible that the possible system of the prevention of an illegal copy and the copyright protection is built by making either of the position information of the neuro-color compression ID, pixel an image license. Quality of output deterioration can be prevented by using information in the image like this. Computer simulation was done by using the real data to verify the validity of the proposed system. It was approved as that result patent.
1180
Shin-ichi Ito et al.
Fig. 1. The structure of sandglass type neural networks
2
Neuro-color Compression ID System
In neuro-color compression ID system, the pixel used for SNN is chosen at random out of the original image. The positional information and the color information (RGB) of selected pixel are shrunk by using SNN. Neuro-color compression ID is obtained the information by SNN. Neuro-color compression ID is the data, which is the middle unit output of SNN, and the combination weight of SNN. This system can realize high security accuracy, because the number of the selected pixel and neuro-color compression ID is absolutely imperative to acquire the image. Moreover, it is also possible to embed information using the pixel selected at random. Then this system can apply to any image form by using RGB value. 2.1
Sandglass Type Neural Networks
In this paper, sandglass type neural networks (SNN) are used for the generation method of neuro-color compression ID. SNN is proposed by Cottrell, and SNN is the method of the information compression by the neural networks. SNN is the class pattern neural networks, and SNN is using the same data as the input data and teacher signal data. Then back-propagation (BP) method is used for the way of learning of SNN. SNN is called sandglass type from the form, which made the number of the middle unit smaller than an input-and-output layer. Because only linear dimension compression can do the sandglass type neural network of three classes, it is known that it is necessary to use SNN of five or more layers, in order to perform nonlinear conversion / reverse conversion to an internal expression. Therefore, the information, which the pixel in the image selected at random has, is compressed using SNN of five class patterns shown in Fig.1, and neuro-color compression ID is generated by using the SNN.
Color-Identification System Using the Sandglass-Type Neural Networks
Image1
Image2
1181
Image3
Fig. 2. Original images
RGB value
RGB value
R value G value B value
1 0.8 0.6 0.4 0.2 0
0
0.2 0.4 y axis 0.6
R value G value B value
1 0.8 0.6 0.4 0.2 0
0 0.8
1
0
0.2
0.6 0.4 x axis
0.8
0.2 0.4 y axis 0.6
1
Image1
0.8
1
0.2
0
0.6 0.4 x axis
Image2
RGB value
R value G value B value
1 0.8 0.6 0.4 0.2 0
0
0.2 0.4 y axis 0.6
0.8
1
0
0.2
0.6 0.4 x axis
0.8
1
Image3
Fig. 3. The selected pixel of the original images
0.8
1
1182
Shin-ichi Ito et al.
R value
G value
Image1 Image2 Image3
1 0.8 0.6 0.4 0.2 0
0
0.2 0.4 y axis 0.6
Image1 Image2 Image3
1 0.8 0.6 0.4 0.2 0
0 0.8
1
0.6 0.4 x axis
0.2
0
0.8
0.2 0.4 y axis 0.6
1
0.8
1
R
0
0.2
0.6 0.4 x axis
0.8
1
G B value
Image1 Image2 Image3
1 0.8 0.6 0.4 0.2 0
0
0.2 0.4 y axis 0.6
0.8
1
0
0.2
0.6 0.4 x axis
0.8
1
B
Fig. 4. The RGB value of the selected pixel of original images 10
Image1 Image2 Image3
Combination weight value
8 6 4 2 0 -2 -4 -6 -8 -10 0
5 10 15 20 The number of combination weight of the SNN
Fig. 5. Combination weigth
2.2
The Generation Method of Neuro-color Compression ID
Neuro-color compression ID is generated by the information, which the pixel in the image chosen at random has. In this paper, the color information is us-
Color-Identification System Using the Sandglass-Type Neural Networks
1183
Image1 Image2 Image3 Third unit 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
0.1
0.7 0.6 0.5 0.4 0.3 Second unit 0.2 0.1
0.2
0.3 0.4 0.5 First unit 0.6
0
Fig. 6. The output of middle layer unit
ing RGB value. The positional information and RGB value of selected pixel are compressed by using the SNN. Neuro-color compression ID is obtained the information by using the SNN. The generation method of neuro-color compression ID is as follows: Step 1: The pixel in the image is chosen at random. The number of the chosen pixel is the number of regulations. Step 2: The position information and RGB value of the selected pixel are extracted. This operation is done for all of the selected pixels. Step 3: The extracted position information of the pixel chosen at random from the color image and RGB value are input to the input layer unit of the SNN. The way of learning of the SNN is BP method. Teacher signal data is the same data of the input layer unit. Then the extracted position information of the pixel chosen at random from the color image and RGB value are compressed by using the SNN. Neuro-color compression ID is the data, which is the middle unit output of the SNN and the combination weight of SNN, extracted by using the SNN. In addition, learning of the SNN is performed per 1 pixel.
3
Computer Simulations
In this paper, computer simulations had been done by using the real data to verify the validity of the proposed system. The real data, as the similar image, is human’s face image. The number of the image is three. First of all, each original image is shown in Fig.2. The selected pixel of original images is shown in Fig.3. Then RGB value of the selected pixel of original images is shown in Fig.4. RGB-distribution is not very different by each image. Next, RGB value and the position information of the pixel are used for the SNN. Then neurocolor compression ID is extracted by the using the SNN. Fig.5 shows neuro-color
1184
Shin-ichi Ito et al.
compression ID. Neuro-color compression ID is different by each image. This result suggests that neuro-color compression ID can become new watermark information.
4
Conclusions
In this paper, we proposed the new images protection method by using the SNN. Furthermore, in order to show the effectiveness of the proposed method, computer simulations are done by using the real images.
References [1] Masataka Ejima and Akiko Miyazaki : Digital Watermark Technique Using OneDimensional Image Signal Obtained by Raster Scanning, IEIC (A), Vol.J82-A, No.7, pp.1083–1091, 1999. 1178 [2] Miki Haseyama and Isoa Kondo : Image Authentication Based on Fractal Image Cording without Contamination of Original Image, IEIC (D-II), Vol.J85-D-II, No.10, pp.1513–1521, 2002. 1178 [3] Katsutoshi Ando and Hitoshi Kiya : An Encryption Method for JPEG2000 Images Using Layer Function, IEIC (A), Vol.J85-A, No.10, pp.1091–1099, 2002. 1179 [4] G. C. Langelaar, I. Setyawan, and R. L. Lagendijk : Watermarking digital image and video data, IEEE Signal Processing Magazine, Vol.17, No.5, pp.20–46, 2000. 1179 [5] Katsutoshi Ando, Osamu Watanabe, and Hitoshi Kiya : Partial-Scrambling of Images Encoded by JPEG2000, IEIC (D-II), Vol.J85-D-II, No.2, pp.282–290, 2002. [6] Minoru Kuribayashi and Hatsukazu Tanaka : Watermarking Schemes Using the Addition Property among DCT Cofficients, IEIC (A), Vol.J85-A, No.3, pp.322– 333, 2002. [7] I. Echizen, H. Yoshiura, T. Arai, H. Kimura, and T. Takeuchi : General quality maintenance module for motion picture watermarking, IEEE Trans. Consum. Electron., Vol.45, No.4, pp.1150–1158, 1999. [8] I. J. Cox, J. Kilian, F. T. Leighton, and T. Shamoon : Secure spread spectrum watermarking for multimedia, IEEE Trans. Image Process., Vol.6, No.12, pp.1673– 1687, 1997. [9] M. D. Swanson, M. Kobayashi, and A. H. Tewfik : Multimedia data-embedding and watermarking technologies, Proc. IEEE, Vol.86, No.6, pp.1064–1087, 1998. [10] F.Hartung and M. Kutter : Multimedia watermarking techniques, Proc. IEEE, Vol.87, No.7, pp.1079–1107, 1999. [11] S. Katzenbeisser and F. A. P. Petitcolas : Information hiding techniques for steganography and digital watermarking, Artech House Publishers, MA, 2000.
Secure Matchmaking of Fuzzy Criteria between Agents Javier Carbó 1, Jose M. Molina1, and Jorge Dávila2 1
Artificial Intelligence Group, Computer Science Dept., Univ. Carlos III of Madrid, Av. Universidad 30, Leganés 28911 Madrid, Spain {jcarbo,molina}@ia.uc3m.es http://scalab.uc3m.es/ 2 Information Security Group, Computer Science Faculty, Univ. Politecnica of Madrid, Campus de Montegancedo s/n, Boadilla del Monte 28660 Madrid, Spain [email protected] http://tirnanog.ls.fi.upm.es
Abstract. A subjective evaluation of services and agents play a fundamental role in open systems. The vague nature of the personal criteria used in subjective evaluations justify the use of fuzzy sets. This distributed approach requires cooperation (information sharing) between agents with similar criteria to improve decision making process. So similarity between criteria is used to decide whether cooperate with a given agent or not. But since personal criteria of evaluation reflect users' profile, this information should be protected, and not be disclosed to any agent. This paper presents an analytic study about the viability of applying distributed secure computations over the matching of fuzzy sets representing such personal criteria.
1
Introduction
With the advent of Internet, so many retail services were accessible that choosing the right option became a major problem. The most standard solution to this problem consists of a hierarchy of trusted third parties issuing digital certificates. But this approach assumes the existence of objective evaluation criteria that should be universally accepted. Although customized evaluation of provided services is sometimes desirable. Then, the evaluation depends on personal subjective criteria since every party will own a particular view of the world. This decentralized approach would be much less effective than the scheme of Certification Authorities if cooperation was not implemented, but the results of this type of evaluation is not so easily transferred to other parties, Both characteristics mentioned (personalization and cooperation) are among those that justify the use of agent technology. Agents are autonomous programs who act in behalf of humans as their representatives. They make use of user's profiles to perform tasks as humans would do. Therefore, we can assume that agents use personal criteria
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1185-1191, 2003. Springer-Verlag Berlin Heidelberg 2003
1186
Javier Carbó et al.
to select the best services, and cooperate among them to improve the quality of the decisions. The goal of agent cooperation is to use the evaluations of services performed by other agents in the decision making. But if those other agents have very different criteria, the cooperation is useless. Furthermore, privacy protection of these personal preferences is a major concern of human users, and it is avoiding that benefits of electronic commerce fit expectations. Therefore, cooperation should take place only among agents that share similar evaluation criteria in order to avoid a generalized disclosure of user's profile [1]. So the definition of a fair method for matching personal criteria is required.
2
Previous Works on Private Matchmaking of Personal Criteria
The more direct way to protect the privacy of personal criteria in the matchmaking process is through anonymity. This solution has been largely studied in different ways. The most direct method to achieve anonymity is the use of pseudonyms based on public key cryptography that should be changed after certain time period. An alternative consists of hiding the identity among a crowd or group [2]. The level of anonymity depends on the size of the group. This scheme is also called deniable signatures since it is not possible to conclude who particularly is the responsible among the members of the group [3]. Finally, other possibility is the use of blind signatures proposed for emulating electronic cash [4]. But this last approach have shown several limitations [5]. Even if anonymity would be easy to implement, it has relevant negative consequences derived from the initial reputation assigned to newcomers [6]. A low initial reputation leads to a serious handicap for the agents who joined the system late. On the other hand, a high initial reputation allows the avoidance of punishment caused by malicious behaviour, since re-emerged agents (with new pseudonym) becomes rewarded. Therefore, the problem consists of authenticating the similarity of evaluation criteria with a newcomer, while privacy of the criteria used in the verification of similarity remains protected. The conflict of this problem seems to be similar to any two-party secure distributed computations. They are interested in a setting where both parties behaved honestly computing the output of a prefixed function but they wished to maintain the secrecy of their inputs. The security of such solutions relies in certain algebraic properties that the used encryption schemes maintain. These encrypted outputs are homomorphic with the inputs, so they partially preserve the multiplicative group operation (associativity, complementation, neutral element). This common algebraical structure is used to verify the satisfaction of certain properties over the inputs without revealing them. This is the case of in the millionaires problem [7], in which two players want to know who is richer, but they do not wish to disclose the exact amounts they own. There are also different solutions to this matchmaking problem that involves the participation of a trusted third party, such as [8][9]. An improved solution that does not require of a third party was proposed by [10].
Secure Matchmaking of Fuzzy Criteria between Agents
3
1187
Secure Matchmaking of Fuzzy Criteria
Fuzzy logic aims to give sound mathematical foundations to vague and imprecise reasoning typical of humans. In a broad sense, Fuzzy Logic is almost synonymous with fuzzy set theory. Fuzzy set theory is considered as an extension of the classical set theory. The difference between classic and fuzzy set theory relies on how an element x may be member of a set A. These membership functions mathematically allow us to model vague concepts like much, a few, more or less, etc. as fuzzy sets. Since the criteria to be applied in subjective evaluations of behaviour reflects human vague reasoning, representing these criteria with fuzzy sets seems to be adequate. This approach have been applied by some of the authors of this paper representing the reputation of merchants [11]. Furthermore, fuzzy sets may be defined by four values that represents the squares of a trapezium. We may use then an scheme like millionaires problem to compare each pair of the four squares of both trapezium. Therefore, two agents A and B will know in a secure way if xAi is greater-equal, or less than xBi while the concrete values xAi are secret for B, and viceversa.Then both agents will know that xA1 ≥ xB1, xA2 ≥ xB2, xA3 < xB3 and xA4 < xB4. With the results of these four comparisons, the authors proved in [12] that it is possible to estimate the similarity level between both preferences with a high success rate (very reliable similarity rates). But in that publication it lacks a measure of the quantity of information. His inclusion in this contribution completes such work and shows the effectiveness of the privacy protection proposed.
4
Evaluation
The goal of this section is to analyze how much information is disclosed with the secure matchmaking process proposed for fuzzy criteria compared with the reliability of the similarity rate obtained. Let us call A (a1, a2, a3, a4) to the fuzzy criteria of our agent, and X (x1, x2, x3, x4) the fuzzy set corresponding to the other agent. In order to estimate the reliability of the similarity rate that can be computed with our proposed method, we have considered 100 different fuzzy criteria Xi for each possible combination of four comparison results. Then the average matching of these 100 fuzzy sets Xi, with the known fuzzy set A can be considered an approximate estimation of the similarity between A and any fuzzy set given a combination of four comparison results. In order to compute the reliability of such measure, we have considered the standard deviation of the average matching in these 100 cases. Once we can evaluate the similarity rate, and its reliability, we are interested in analyzing the quantity of disclosed information. It depends on the range of possible values of the four squares of the unknown trapezium representing the fuzzy criteria of the other agent. The left square of the unknown trapezium (x1) will belong to the next range of possible values:
1188
Javier Carbó et al.
IF a1>x1 THEN x1 ∈ [0, a1 ]1
IF a1<x1 AND ¬∃ai , xi where ai > xi THEN x1 ∈ [a1 ,100]
(1)
IF a1<x1 AND ∃ai , xi tal que ai > xi entonces k = min{i} ; x1 ∈ [a1 , a k ] The right square of the unknown trapezium (x4) will belong to this range of values: IF a4<x4 THEN x 4 ∈ [a 4 ,100]1
IF a4>x4 AND ¬∃ai , xi where ai < xi THEN x 4 ∈ [0, a 4 ] IF a4>x4 AND ∃ai , xi where ai < xi THEN k = max{i} ; x 4 ∈ [a k , a 4 ]1
(2)
Finally, the other two squares (xj where j ∈ {2,3} ), will belong to these ranges of values:
[
IF aj>xj AND ¬∃ai , xi where i>j ; ai < xi THEN x j ∈ 0, a j IF aj>xj AND x j ∈ ak , a j
[
]
∃ai , xi
where i<j ;
ai < xi THEN
[
]
k = max{i} ;
IF aj<xj AND ¬∃ai , xi where i<j ; ai > xi THEN x j ∈ a j ,100 IF aj<xj AND ∃ai , xi x j ∈ a j , ak
[
where i>j ; ai > xi
]
]
(3)
THEN k = min{i} ;
Therefore, we can obtain the width of the ranges of possible values for each possible result of the four comparisons given the four squares of our fuzzy criteria (ai). The more narrow is the range, the more information is being disclosed about an square of the unknown fuzzy criteria. So an approximate estimation of the information disclosed for the four squares can be computed as:
∑ width _ of _ range( x ) i
1−
i
4 ⋅ 100
(4)
On the other hand, the probability of a given combination of four comparison results depends also on the width of such ranges. But since this probabilities are not independent, we should apply Bayes theorem: P(x1,x2,x3,x4)=P(x1) P(x2 / x1) P(x3 / x2,x1) P(x4 / x3,x2,x1)
(5)
Then, we can guess that the first factor of the equation is: P(x1)= width_of_range(x1) / 100
(6)
And for the other factors, we have to apply one of the next two equations:
Secure Matchmaking of Fuzzy Criteria between Agents
1189
IF ∃ k where k
(7) Else: P(xi / xi-1 , ..., x1)= width_of_range(xi) / 100 We have applied these equations to two illustrative fuzzy criteria of our agent in order to observe the correlation between the reliability of the similarity rate, the quantity of information transmitted, and the probability that such combination occurs. Table 1. Information transmitted, probability, similarity rate and its reliability given a fuzzy criteria A defined by: a1=20, a2=30, a3=40, a4=50
2nd cmp 1st cmp a1 ≥ x1 a2 ≥ x2 a1 < x1 a2 ≥ x2 a1 ≥ x1 a2 < x2 a1 < x1 a2 < x2 a1 ≥ x1 a2 ≥ x2 a1 < x1 a2 ≥ x2 a1 ≥ x1 a2 < x2 a1 < x1 a2 < x2
3rd cmp
4th cmp
Inform.
Probab.
Simil.
Reliab.
a3 ≥ x3
a4 ≥ x4
0.650
0.0024
0.033
0.933
a3 ≥ x3
a4 ≥ x4
0.825
0.0011
0.086
0.913
a3 ≥ x3 a3 ≥ x3 a3 < x3
a4 ≥ x4 a4 ≥ x4
0.850 0.850
0.0008 0.0010
0.236 0.091
0.045 0.579
a4 ≥ x4
0.850
0.0006
0.602
0.820
a3 < x3 a3 < x3
a4 ≥ x4 a4 ≥ x4
0.900 0.850
0.0002 0.0009
0.337 0.396
0.807 0.872
a3 < x3
0.825
0.0017
0.149
0.201
0.650
0.0024
0.431
0.866
a1 ≥ x1 a2 ≥ x2 a1 < x1 a2 ≥ x2 a1 ≥ x1 a2 < x2
a3 ≥ x3
a4 ≥ x4 a4 < x4
a3 ≥ x3 a3 ≥ x3
a4 < x4 a4 < x4
0.775 0.775
0.0019 0.0020
0.255 0.520
0.873 0.910
a1 < x1 a2 < x2
a3 ≥ x3 a3 < x3 a3 < x3
a4 < x4
0.775
0.0025
0.119
0.841
a4 < x4 a4 < x4
0.600 0.675
0.0300 0.0058
0.859 0.379
0.876 0.779
a3 < x3
a4 < x4
0.525
0.0857
0.512
0.858
a3 < x3
a4 < x4
0.350
0.5000
0.023
0.939
a1 ≥ x1 a2 ≥ x2 a1 < x1 a2 ≥ x2 a1 ≥ x1 a2 < x2 a1 < x1 a2 < x2
In table 1 we observe that the three combinations of comparison results (rows) that show greater probability, present a similarity rate highly reliable (greater than 0.85) and transmit less information than the others. In fact, the row with greatest probability (more than 0.5) is also the similarity rate more reliable and transmit less information than all the other possibilities.
1190
Javier Carbó et al.
Table 2. Information transmitted, probability, similarity rate and its reliability given a fuzzy criteria A defined by: a1=70, a2=75, a3=85, a4=90 1st cmp
2nd cmp
3rd cmp
4th cmp
Infor.
Prob.
Simil.
Reliab.
a1 ≥ x1
a2 ≥ x2
a3 ≥ x3
a4 ≥ x4
0,2000
0,4016
0.012
0.952
a1 < x1
a2 ≥ x2
a3 ≥ x3
a4 ≥ x4
0,8875
0,0027
0.118
0.507
a1 ≥ x1
a2 < x2
a3 ≥ x3
a4 ≥ x4
0,7375
0,0168
0.186
0.300
a1 < x1
a2 < x2
a1 ≥ x1 a1 < x1 a1 ≥ x1 a1 < x1 a1 ≥ x1 a1 < x1 a1 ≥ x1 a1 < x1
a2 ≥ x2 a2 ≥ x2 a2 < x2
a3 ≥ x3 a3 < x3 a3 < x3 a3 < x3
a4 ≥ x4 a4 ≥ x4 a4 ≥ x4 a4 ≥ x4
0,8750 0,6125 0,9500 0,7625
0,0120 0,0087 0,0004 0,0070
0.028 0.612 0.340 0.472
0.956 0.835 0.847 0.869
a2 < x2 a2 ≥ x2 a2 ≥ x2 a2 < x2 a2 < x2
a3 < x3 a3 ≥ x3 a3 ≥ x3 a3 ≥ x3 a3 ≥ x3
a4 ≥ x4 a4 < x4 a4 < x4 a4 < x4 a4 < x4
0,8875 0,4 0,9125 0,75 0,8875
0,0066 0,0446 0,0013 0,0112 0,0080
0.075 0.322 0.239 0.511 0.161
0.615 0.850 0.887 0.878 0.276
a1 ≥ x1 a1 < x1 a1 ≥ x1 a1 < x1
a2 ≥ x2 a2 ≥ x2 a2 < x2 a2 < x2
a3 < x3 a3 < x3 a3 < x3 a3 < x3
a4 < x4 a4 < x4 a4 < x4 a4 < x4
0,575 0,9125 0,7 0,8
0,0525 0,0111 0,0700 0,1
0.921 0.384 0.583 0.017
0.896 0.833 0.809 0.953
From the observation of the data contained in table 2, we can conclude that three out of the five combinations of comparison results (rows) that show greater probability, present a similarity rate highly reliable (greater than 0.85) and transmit less information than the others. In fact, the row with greatest probability (more than 0.5) is also the similarity rate more reliable and transmit less information than all the other possibilities. The observed correlation was also tested with several other different fuzzy criteria A, so we can assert that this method of matching fuzzy criteria, presents a reasonably high probability to obtain a good estimation of similarity, while information transmitted is minimized. As many other solutions of Information Security research area, protection is not perfect, but it may be considered as an adequate approach. In spite of everything, the contribution of this work had a severe handicap when it would be applied in real time scenarios, since secure computations, such as millionaires problem, have strong temporal requirements.
5
Conclusions
In this paper we have shown the justification of fuzzy sets as the right formalism to represent subjective criteria of evaluation. Next, the protection of this criteria is considered when criteria become very different and a trust relationship can not be concluded. Therefore the matching of these fuzzy sets is estimated in an approximate way applying classical two-party distributed secure computations over the four squares that define a piece-wise fuzzy set. Finally a brief analysis on the reliability of similarity, probability and level of disclosure is showed through several illustrative examples.
Secure Matchmaking of Fuzzy Criteria between Agents
1191
References Foner, L.: A multi-agent referral-based match-making system. In Proc. 1st Int. Conf. on Autonomous Agents. (1997) 301-307. [2] Cramer, R., Damgård, I., Schoenmakers, B.: Proofs of Partial Knowledge and Simplified Design of Witness Hiding Protocols. Lecture Notes in Computer Science. 893 (1985) 174-187. [3] Kilian, J., Petrank E.: Identity Escrow. Lecture Notes in Computer Science. 1642 (1998) 169-185. [4] Chaum, D.: Blind signatures for untraceable payments. Advances in Cryptology: Proceedings of CRYPTO'82, Plenum Press, Nueva York, (1983) 199-203. [5] Stadler, M., Piveteau, J.M., Camenish, J.: Fair blind signatures. Lecture Notes in Computer Science 921 (1995) 209-219. [6] Friedman, E., Resnick, P.: The Social Cost of Cheap Pseudonyms. Proc. Telecommunications Policy Research Conference (1998). [7] Yao, A.: Protocols for Secure Computations. Proc. 23rd IEEE Symposium on Foundations of Computer Science (1982) 160-164. [8] Baldwin, R.W., Gramlich, W.C.: Cryptographic protocol for trustable matchmaking. In Proceedings of the 1985 IEEE Symposium on Security and Privacy (1985) 92-100. [9] Huberman, B., Franklin, M., Hogg, T.: Enhancing Privacy and Trust in Electronic Communities. Proc. 10th ACM Conference on Computers, Freedom & Privacy (2000). [10] Meadows, C.: A more efficient cryptographic matchmaking protocol for use in the absence of a continuously available third party. Proceedings of the 1986 IEEE Symposium on Security and Privacy (1986). [11] Carbo, J., Molina, J.M., Davila, J.: Trust Management through Fuzzy Reputation. Int. Journal on Cooperative Information Systems, Vol. 12, Num. 1, (2003) 135-155. [12] Carbo, J., Molina, J.M., Davila, J.: Privacy of Trust in Similarity Estimation through Secure Computations. International Workshop on Trust and Privacy in Digital Business. In Procs. of 2002 Int. Conf. on Database and Expert Systems Application. IEEE Computer Society (Aix-en-Provence, 2002) 498–505. [1]
Finding Efficient Nonlinear Functions by Means of Genetic Programming Julio César Hernández Castro 1, Pedro Isasi Viñuela 2, and Cristóbal Luque del Arco-Calderón 2 Computer Security Group, Computer Science Department, 28911 Leganés, Madrid, Spain [email protected] 2 Artificial Intelligence Group, Computer Science Department, 28911 Leganés, Madrid, Spain {isasi,cluque}@ia.uc3m.es 1
Abstract. The design of highly nonlinear functions is relevant for a number of different applications, ranging from database hashing to message authentication. But, apart from useful, it is quite a challenging task. In this work, we propose the use of genetic programming for finding functions that optimize a particular nonlinear criteria, the avalanche effect, using only very efficient operations, so that the resulting functions are extremely efficient both in hardware and in software.
1
Introduction
The design of highly nonlinear functions is useful for a number of different applications, from the construction of hash functions for large databases to the development of different cryptographic functions (pseudorandom functions, message authentication codes, block ciphers, etc.) For many of these applications, a desirable and conflicting property is efficiency. It is not difficult to design highly nonlinear functions, nor very efficient functions, but finding algorithms with both properties is quite a challenging task. 1.1
Compression Functions m
n
A function F : Z 2 → Z 2 is said to be a compression function if m>n. Different compression functions are extensively used in diverse cryptographic primitives, such as pseudorandom number generator or hash functions (used for digital signatures and message authentication). The main advantage of this approach is to allow the simple processing of variable-length input, following the recurrence relation:
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1192-1198, 2003. Springer-Verlag Berlin Heidelberg 2003
Finding Efficient Nonlinear Functions by Means of Genetic Programming
1193
M = M 0 M 1 .........M n h0 = IV hi +1 = F ( M i , hi ) h( M ) = hn +1 This scheme is very versatile, as it can be found with some minor differences in block cipher chaining and cryptographic hash functions, and it can also be used to produce pseudorandom numbers. We will focus our search to a quite general kind of compression functions, where m=2n.
2
The Avalanche Effect
Nonlinearity can be measured in a number of ways or, what is the same, has not a complete unique and satisfactory definition. In this work we focus our attention in a special property, named avalanche effect because it tries to reflect to some extend the intuitive idea of high-nonlinearity: a very small difference in the input producing a high change in the output, thus an avalanche of changes. Mathematically:
∀x, y | H ( x, y ) = 1 H ( F ( x), F ( y )) ≈
n 2
So if F is to have the avalanche effect [1], the Hamming distance between the outputs of a random input vector and one generated by randomly flipping one of the bits should be, on average, n/2. That is, a minimum input change (one single bit) produces a maximum output change (half of the bits). This definition also tries to abstract the more general concept of independence of the output from the input. Although it is clear that this is impossible to achieve (a given input vector always produces the same output) the ideal F will resemble a perfect random function where inputs and outputs are statistically unrelated. Any such F would have perfect avalanche effect, so it is natural to try to obtain such functions by optimizing the amount of avalanche.
3
Genetic Programming
Genetic Programming [2] is a method for automatically creating working computer programs from a set of high-level statements of a given problem. This is achieved by breeding a population of computer programs using the principles of Darwinian natural selection and other biologically inspired operations that include reproduction, sexual recombination (crossover), mutation, and possibly others. Starting from a initial population of randomly created programs derived from a given set of functions and terminals, populations gradually evolve, giving birth to new, more fitted individuals.
1194
Julio César Hernández Castro et al.
This is performed by repeating the cycle of fitness evaluation, Darwinian selection and genetic operations until a certain ending condition is met. Each individual (or program in the population) is evaluated to determine how fit is at solving a given problem, and then programs are selected probabilistically from the population according to their fitness values for being applied the rest of genetic operators. It is important to note that, while fitter programs have higher probabilities of being selected, all programs have a chance. After some generations, a program may emerge that solves, completely or approximately, the problem at hand. Genetic Programming combines the expressive high-level symbolic representations of computer programs with the learning efficiency of genetic algorithms. Genetic Programming techniques have been successfully applied to a number of different problems: apart from classical problems such as function fitting or pattern recognition, where other evolutionary computation techniques also work fine, they have even produced results that are competitive with humans in some non-trivial tasks as designing electrical circuits[3] (some of which have been patented) or at classifying protein segments[4].
4
Implementation Issues
We have used the lilgp genetic programming system, available at http://garage.cps.msu.edu/software/lil-gp/lilgp-index.html, but a number of modifications were needed for our problem. Firstly, we need to define the set of functions: This is critical for our problem, as they are the building blocks of the algorithms we would obtain. Being efficiency one of the paramount objectives of our approach, it is natural to restrict the set of functions to include only very efficient operations; so the inclusion of the basic binary operations as rotd (right rotation), roti (left rotation), xor (addition mod 2), or (bit wise or), not (bit wise not), and and (bit wise and) are an obvious first step. Other operators as the sum (sum mod 232) are necessary in order to avoid linearity, being itself quite efficient. Another interesting operator introduced was kte, an operation that, whatever its input, returns the 32-bit constant value 0x9e377969, which are the most significant digits of the expression of the golden ratio in hexadecimal notation).The idea behind this operator was to provide a constant value that, independently from the input, could be used by the aforementioned operators to increase nonlinearity , and idea suggested by [5]. The inclusion of the mult (multiplication mod 232) operator was not so easy to decide, because, depending on the particular implementations, the multiplication of two 32 bit values could cost up to fifty times more than an xor or and operation, so it is relatively inefficient, at least when compared with the other operators used. In fact, we did not include it at first, but after extensively experimentation, we conclude that its inclusion was beneficial because, apart from improving non-linearity; it at least doubled and sometimes tripled the amount of avalanche we were trying to maximize, so we finally include it in the function set. The set of terminals in our case is easy to establish; as said above, we are focusing our attention in compression functions where m=2n, so the input will be formed by
Finding Efficient Nonlinear Functions by Means of Genetic Programming
1195
two 32 bits integers a0 and a1, and these will be the branches of the function trees that the genetic programming algorithm will construct, with functions from the function set in the nodes. The fitness of every individual (algorithm or function) was evaluated by generating 1024 64-bit random vectors, then randomly flipping one of the bits and calculating the Hamming distance over their outputs. For each of these 1024 experiments a Hamming distance between 0 and 32 was obtained and the fitness of the given function was the observed average Hamming distance. Although this is a quite natural way of measuring the avalanche effect as defined in the introduction, some additional explanations are needed. Obviously; the ideal observed value should be 32/2=16, so a more natural approach for the fitness function will be not simply this average but is deviation from the optimum value of 16, that is |16-average| Anyway, after some experiments we observed that, at least for the depths studied in this work, the resulting functions fitness were far and below the 16 optimum value, so simply using the average number of changes as the fitness function to maximize worked perfectly well. When using genetic programming approaches, it is necessary to put some limits to the depth and to the number f nodes the resulting trees could have. We preferred to vary only the depth and not to put any limitation (apart from the depth itself) to the number of nodes possibly used. We selected a population size of 100 individuals, a crossover probability of 0.8, which produced better results than the default 0.9 probability proposed, and an ending condition of reaching 500 generations. In many cases, ten different runs were performed, each one seeded with the 6 most significant digits of the expression (314159)i. When only one run was possible, the seed was 314159.
5
Results
Selected functions are shown in the Appendix. Table 1 below summarizes some of the results. Basically, Table 1 shows that, by increasing the allowed depth of the function trees, better avalanche effects could be obtained. This is intuitively clear, but not trivial to get. The idea that simply adding more and more complexity (more branches with functions and terminals) to any tree will increase the avalanche effect of the function can be easily dismounted by a simple inspection to the first two individuals shown at the Appendix. After all the experiments, we can conclude we have found one super-individual, shown in the appendix, that with only a depth of 5 and a little number of nodes is capable of contesting in terms of avalanche with much more complex rivals. Furthermore, this highly fitted function has many points in common with some of the fittest individuals for every depth (as shown at the end of the appendix), so, in a way, this is also revealing us a new promising method of construction or a new kind of function design. There are other very fitted functions which have great similitude with the aforementioned, but we cannot include them in the appendix due to lack of space.
1196
Julio César Hernández Castro et al.
Table 1. Avalanche effect as a function of depth. Averages exclude the maximal and minimal values. For some values there are only one result
\Avalanche Depth\ 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
6
Maximun 9.4424 10.1396 11.2148 11.1479 11.3232
Average 8.8696 9.0725 9.5619 9.9919 10.2296 10.7037 11.5200 10.9482 11.2310 11.3513 11.8284 11.5164 10.9707 12.4451 11.6604
Minimun 8.8174 9.1152 9.6533 9.6621 10.3145
Conclusions
The average avalanche effect on Table 1 steadily increases when the depth is increased; so a natural question is which is the minimum depth at which any algorithm exists that reaches the perfect avalanche of 16. For trying to guess that, we performed a series of additional searches in deeper depths (but only one run, as they are much more time consuming) until reaching depth 16. By the slow increase in the avalanche levels observed, our current best guess is that this could perhaps be achieved at depth 20. This makes that some questions arise naturally as, for example, would it better to dig in some such deep trees, with the associated combinatorial explosion (for example in the number of nodes used) or is it more sensible to search for short, faster functions and apply then twice? What would it better, the best function we can find with this approach at level 12 or the best function found at depth 6, with the same amount of resources, applied twice? Many other questions remain. One of the more interesting is if it would be possible to add to every function in the function set a weight or some other measure of its complexity or efficiency (for example in number of instructions needed at the processor level, microseconds taken to perform, etc) and allow the genetic programming technique itself to decide whether it is interesting to use that particular function or not in terms of avalanche effect introduced for a given cost. This and other questions and extensions to this work are currently under consideration.
Finding Efficient Nonlinear Functions by Means of Genetic Programming
1197
Acknowledgements Supported by the Spanish Ministerio de Ciencia y Tecnologia research project TIC2002-04498-C05-4
References [1] [2] [3] [4] [5]
Feistel, H.: Cryptography and Computer Privacy. Scientific American, 228 (5) 15-23, 1973 Koza, J.: Genetic Programming. In: Encyclopedia of Computer Science and Technology, v.39, 29-43, 1998 Koza, J., et. al. Automated Synthesis of analog electrical circuits by means of genetic programming. In IEEE Transactions on Evolutionary Computation 1(2), 1997 Koza, J., Andre, D.: Automatic discovery of protein motifs using genetic programming. In Evolutionary Computation: Theory and Applications. World Scientific Publications, 1996 Wheeler, D., Needham, R.: TEA, a Tiny Encryption Algorithm. In: Proceedings of the 1994 Fast Software Encryption Workshop, and at http://www.ftp.cl.cam.ac.uk/ftp/papers/djw-rmn/djw-rmn-tea.html
Appendix: Experimental Results Depth 5 Run #4 fitness: 11.2148 TREE: (mult (kte (rotd a0)) (rotd (sum (roti (xor a0 a1)) (xor a0 a1)))) Depth 14: fitness: 10.9707 TREE: (xor (sum (xor (rotd (mult (and a0 (mult (mult (mult a1 (mult a1 a1)) (rotd a0)) (mult a0 (mult a1 a0)))) (rotd (rotd (xor a1 (rotd (xor a1 (rotd (sum a0 a1))))))))) (xor (xor a0 a1) (sum a0 (rotd (sum a0 a1)))))
1198
Julio César Hernández Castro et al.
(sum (xor (rotd (roti a1)) a0) (sum a0 a1))) (xor (sum (xor (sum (mult (mult a1 a1) (sum (sum a0 a0) (xor a1 (xor a0 (sum a0 (roti a1)))))) (xor a0 (sum a0 a1))) (xor a0 a1)) (not a0)) (mult (sum a0 a1) (mult a0 (mult a1 a0))))) =========================================================== Run #4 fitness: 11.0957 TREE: (mult (kte (xor (roti (roti a1)) a1)) (rotd (sum (sum (rotd a0) (rotd a1)) (roti (xor a0 a1))))) DepthMax=7 ========== Run #4 fitness: 11.3232 TREE: (mult (kte (rotd a0)) (rotd (sum (rotd (sum (sum a0 a0) (roti (rotd a1)))) (rotd (sum (sum (rotd a0) (rotd a1)) (rotd a0))))))
Keystream Generator Analysis in Terms of Cellular Automata Amparo F´ uster-Sabater1 and Dolores de la Gu´ıa-Mart´ınez2 1
2
Instituto de F´ısica Aplicada, C.S.I.C. C/ Serrano 144, 28006 Madrid, Spain [email protected] Centro T´ecnico de Inform´ atica, C.S.I.C. C/ Pinar 19, 28006 Madrid, Spain [email protected]
Abstract. The equivalence between the binary sequences obtained from a particular kind of keystream generator, the shrinking generator, and those ones generated from cellular automata has been studied. The class of shrinking generators can be identified with a subset of one-dimensional linear hybrid cellular automata with rules 90 and 150. A cryptanalytic approach based on the phaseshift of cellular automata output sequences is proposed. From the obtained results, we can create linear cellular automata-based models to cryptanalyze the class of keystream generators. Keywords: shrinking generator, cellular automata, cryptography
1
Introduction
Cellular Automata (CA) are discrete dynamic systems characterized by a simple structure but a complex behavior (i.e. [1] and [2]). In this sense, CA are studied in order to obtain a characterization of the rules (mapping to the next state) producing sequences with maximal length, balancedness and good distribution of 1’s and 0’s. These sequences can be obtained from a given subset of specific rules or combination of such rules (for example, combinations of rules 90 and 150), see [3]. From a cryptographic point of view, it is fundamental to analyze some additional characteristics of these generators, such as linear complexity or correlationimmunity. The results of this study point toward the equivalence between the sequences generated by CA and those obtained from Linear Feedback Shift Registers-based models [4]. In this paper, the CA hybrid configurations constructed from combinations of rules 90 and 150 are considered. In fact, a linear model that describes the behavior of a kind of pseudorandom sequence generator, the so-called shrinking generator (SG), has been derived. In this way, the sequences generated by SG can be studied in terms of CA. Thus, all the theoretical background on CA found in the literature can be applied to the analysis and/or cryptanalysis of shrinking generators. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1199–1206, 2003. c Springer-Verlag Berlin Heidelberg 2003
1200
2
Amparo F´ uster-Sabater and Dolores de la Gu´ıa-Mart´ınez
Basic Structures
In the following subsections, we introduce the general characteristics of the two basic structures considered in the paper: Shrinking Generators and Cellular Automata. 2.1
The Shrinking Generator
It is a very simple generator with good cryptographic properties [5]. This generator is composed by two LFSRs: a control register, called R1 , that decimates the sequence produced by the other register, called R2 . The sequence produced by the LFSR R1 , that is {ai }, controls the bits of the sequence produced by R2 , that is {bi }, which are included in the output sequence {cj } (the shrunken sequence), according to the following rule: 1. If ai = 1 =⇒ cj = bi 2. If ai = 0 =⇒ bi is discarded. According to [5], the period of the shrunken sequence is T = (2L2 − 1)2(L1 −1) and its linear complexity, notated LC, satisfies the following inequality L2 2(L1 −2) < LC ≤ L2 2(L1 −1) . A simple calculation, based on the fact that every state of R2 coincides once with every state of R1 , allows one to compute the number of 1’s in the shrunken sequence. Such a number is constant and equal to N o. 1 s = 2(L2 −1) 2(L1 −1) . Thus, the shrunken sequence is a quasi-balanced sequence. Since simplicity is one of its most remarkable characteristics, this scheme is suitable for practical implementation of efficient stream cipher cryptosystems. 2.2
Cellular Automata
One-dimensional cellular automata can be described as n-cell registers, whose cells (binary cells) are updated at the same time depending on a k -variable function [3] also called rule. If k = 2r + 1 input variables are considered, then there is a total of 2k different neighbor configurations. Therefore, for cellular k automata with binary content cells there can be up to 22 different mappings to the next state. Such mappings are the different rules Φ. In fact, the next state xt+1 of the cell xti depends on the current state of k neighbor cells xt+1 = i i t Φ(xi−r , . . . , xti , . . . , xti+r ). If these functions are composed exclusively by XOR and/or XNOR operations, then cellular automata are said to be additive. Thus, t+1 the next CA state (xt+1 1 , . . . , xn ) can be computed from the current state t+1 t t t t (x1 , . . . , xn ) as follows: (x1 , ..., xt+1 n ) = (x1 , ..., xn ).A + C, where A is an n x n matrix with binary coefficients and C is the complementary vector. CA are called uniform (hybrid ) whether all cells evolve under the same rule (different rules). At the ends of the array, two different boundary conditions are possible: null automata when cells with permanent null content are supposed adjacent to the extreme cells or periodic automata when extreme cells are supposed adjacent. In this paper, all automata considered will be null hybrid CA with rules 90 y 150. For k = 3, these rules are described as follows:
Keystream Generator Analysis in Terms of Cellular Automata
1201
rule 90 → xt+1 = xti−1 + xti+1 i 111 110 101 100 011 010 001 000 0 1 0 1 1 0 1 0
rule 150 → xt+1 = xti−1 + xti + xti+1 i 111 110 101 100 011 010 001 000 1 0 0 1 0 1 1 0
01011010 (binary) = 90 (decimal)
10010110 (binary) = 150 (decimal)
The main idea of this work is to write a given SG in terms of hybrid cellular automata where one of their output sequences equals the SG output sequence.
3
A Linear Model for the Shrinking Generator
In this section, an algorithm to determine the one-dimensional linear hybrid CA corresponding to a particular shrinking generator is presented. Such an algorithm is based on the following facts: Fact 1: The characteristic polynomial of the shrunken sequence [5] is of the form P (D)N , where P (D) is a L2 -degree primitive polynomial and N satisfies the inequality 2(L1 −2) < N ≤ 2(L1 −1) . Fact 2: P (D) depends exclusively on the characteristic polynomial of the register R2 and on the length L1 of the register R1 . Moreover, P (D) is the characteristic polynomial of cyclotomic coset 2L1 − 1, see [4]. Fact 3: Rule 90 (150) at the end of the array in null automata is equivalent to two consecutive rules 150 (90) with identical sequences. According to the previous facts, the following algorithm is introduced: Input: Two LFSR’s R1 and R2 with their corresponding lengths, L1 and L2 , and the characteristic polynomials P2 (D) of the register R2 . Step 1: From L1 and P2 (D), compute the polynomial P (D). In fact, P (D) is the characteristic polynomial of the cyclotomic coset E, where E = 20 + 21 + . . . + 2L1 −1 . Thus,
L1 −1
P (D) = (D + αE )(D + α2E ) . . . (D + α2
E
)
α being a primitive root in GF (2L2 ) as well as a root of P2 (D). Step 2: From P (D), apply the Cattell and Muzio synthesis algorithm [6] to determine the two linear hybrid CA whose characteristic polynomial is P (D). Such CA are written as binary strings with the following codification: 0 = rule 90 and 1 = rule 150. Step 3: For each one of the previous binary strings representing the CA, we proceed: 3.1 Complement its least significant bit. The resulting binary string is notated S.
1202
Amparo F´ uster-Sabater and Dolores de la Gu´ıa-Mart´ınez
3.2 Compute the mirror image of S, notated S ∗ , and concatenate both strings Sc = S ∗ S ∗ . 3.3 Apply steps 3.1 and 3.2 to Sc recursively L1 − 1 times. Output: Two binary strings codifying the CA corresponding to the given SG. Remark that the characteristic polynomial of the register R1 is not needed. It can be noticed that the computation of the CA is proportional to L1 instead of 2L1 . Consequently, the algorithm can be applied to SG in a range of cryptographic interest (e.g. L1 , L2 ≈ 64). In order to clarify the previous steps a simple numerical example is presented. Example 1: Consider the following LFSRs: R1 with length L1 = 2 and R2 with length L2 = 5 and characteristic polynomial P2 (D) = 1+D+D3 +D4 +D5 . Step 1: P (D) is the characteristic polynomial of the cyclotomic coset 3. Thus, P (D) = 1 + D2 + D5 . Step 2: From P (D) and applying the Cattell and Muzio synthesis algorithm, two linear hybrid CA whose characteristic polynomials are both P (D) can be determined. Such CA are written as: 01111 11110 Step 3: The two binary strings of length L = 10 representing the required CA are: 0111001110 1111111111 with the corresponding codification above mentioned. The procedure has been carried out once as L1 − 1 = 1. From L = 10 known bits of the shrunken sequence {cj }, the whole sequence, whose period T = 62, can be easily reconstructed. In fact, let {cj } be of the form {cj } = {0 1 0 1 1 0 1 0 0 1 ...} , then the initial state of the cellular automata can be computed from right to left (or viceversa), according to the corresponding rules 90 and 150. Table 1 depicts the computation of the initial state for the first automata when the shrunken sequence is placed at the most right column. Once the corresponding initial state is known, then the cellular automata will produce its corresponding output sequences and the shrunken sequence can be univocally determined.
Keystream Generator Analysis in Terms of Cellular Automata
1203
Table 1. CA: 0111001110, the shrunken sequence is at the most right column 90 0
4
150 150 150 90 0 0
0 1 1
1 0 1 1
1 0 1 1 1
90
150 150 150 90
1 1 0 0 0 0
0 0 1 1 1 1 0
1 0 0 0 0 1 1 1
1 0 1 1 0 1 0 0 1
0 1 0 1 1 0 1 0 0 1
A Cryptanalytic Approach to the Shrinking Generator
Since a linear model describing the behavior of the SG has been derived, a new cryptanalytic attack that exploits the weaknesses inherent to the CA-based linear model has been also developed. The key idea of this attack is based on the study of the repeated sequences in the automata under consideration and the relative shifts among such sequences. The analysis is composed by several steps described as follows: – First, the portion of intercepted shrunken sequence is placed at the most right (left) column of the automata. – Then, the locations of the automata cells that generate the same sequence have to be detected. – Finally, the values of the relative shifts among sequences are computed. Once the previous steps are accomplished, the original shrunken sequence can be reconstructed by concatenating the different shifted subsequences. In order to carry out this reconstruction, the Bardell’s algorithm [7] to phaseshift analysis of CA is applied. Let us see the procedure with a simple example. Example 2: Let us consider a CA with the following characteristics: – Automata length L = 10 – Automata under study: 0011001100 – P (D) = (1 + D + D3 + D4 + D5 )2 . Let S be the shift operator defined on Xi (i = 1, . . . , 10), the state of the i-th cell , such as follows: SXi (t) = Xi (t + 1) . Thus, the corresponding difference equation system for the previous automata can be written as follows:
1204
Amparo F´ uster-Sabater and Dolores de la Gu´ıa-Mart´ınez
SX1 = X2
SX6 = X5 + X7
SX2 = X1 + X3 SX3 = X2 + X3 + X4
SX7 = X6 + X7 + X8 SX8 = X7 + X8 + X9
SX4 = X3 + X4 + X5 SX5 = X4 + X6
SX9 = X8 + X10 SX10 = X9 .
The symmetry of the previous system must be noticed. Next expressing each Xi as a function of X10 , we obtain the following system: X1 = (S 9 + S 4 + S 3 + S 2 + S + 1)X10 8
6
5
4
3
X6 = (S 4 + S)X10
X2 = (S + S + S + S + S + S + 1)X10 X3 = (S 7 + S 6 + S 5 + S 3 + 1)X10
X7 = (S 3 + S 2 + 1)X10 X8 = (S 2 + 1)X10
X4 = (S 6 )X10 X5 = (S 5 + S 3 + 1)X10
X9 = (S)X10 .
Analogous results can be obtained expressing each Xi as a function of X1 . Now taking logarithms in both sides of the equalities, log(X1 ) = log(S 9 + S 4 + S 3 + S 2 + S + 1) + log(X10 ) log(X2 ) = log(S 8 + S 6 + S 5 + S 4 + S 3 + S + 1) + log(X10 ) log(X3 ) = log(S 7 + S 6 + S 5 + S 3 + 1) + log(X10 ) log(X4 ) = log(S 6 ) + log(X10 ) log(X5 ) = log(S 5 + S 3 + 1) + log(X10 ) log(X6 ) = log(S 4 + S) + log(X10 ) log(X7 ) = log(S 3 + S 2 + 1) + log(X10 ) log(X8 ) = log(S 2 + 1) + log(X10 ) log(X9 ) = log(S) + log(X10 ) . The base of the logarithm is P (S) and the values of the logarithms are integers over a finite domain (the cycle length of P (S) see [7]). According to the Bardell’s algorithm, we determine the integers m (if there exist) such that S m mod P (S) equal the different polynomials in S included in the above system. For instance, S 26 mod P (S) = S 2 + 1 . Or simply, S 26 = S 2 + 1 and 26 log(S) = log(S 2 + 1) with log(S) ≡ 1. Now substituting in the previous system, the following equations can be derived:
Keystream Generator Analysis in Terms of Cellular Automata
log(X9 ) − log(X10 ) = 1
log(X2 ) − log(X1 ) = 1
log(X8 ) − log(X10 ) = 26 log(X4 ) − log(X10 ) = 6
log(X3 ) − log(X1 ) = 26 log(X7 ) − log(X1 ) = 6 .
1205
The phaseshifts of the outputs 9, 8 and 4 relative to cell 10 are 1, 26 and 6 respectively. Similar values are obtained in the other group of cells, that is cells 2, 3 and 7 relative to cell 1. The other cells generate different sequences. Note that we have two different CA plus an additional pair of CA corresponding to the reverse version of the shrunken sequence (the pair associated to the reciprocal polynomial of P2 (D)). In each particular case, the most adequate automata or a combination of the successive automata can be used in order to reconstruct the output sequence.
5
Conclusions
In this work, a particular family of LFSR-based keystream generators, the socalled Shrinking Generator, has been analyzed and identified with a subset of linear cellular automata. In fact, a linear model describing the behavior of the SG has been derived. The algorithm to convert the SG into a CA-based linear model is very simple and can be applied to shrinking generators in a range of practical interest. The linearity of this cellular model can be advantageously used in the analysis and/or cryptanalysis of the SG. Besides the traditional cryptanalytic attacks, an outline of a new attack that exploits the weaknesses inherent to these CA has been introduced too.
Acknowledgements Work supported by Ministerio de Ciencia y Tecnolog´ıa (Spain) under grant TIC2001-0586.
References [1] Martin, O., Odlyzko, A. M., Wolfram, S.: Algebraic Properties of Cellular Automata. Commun. Math. Phys. 93 (1984) 219-258 1199 [2] Nandi, S., Kar, B. K., Chaudhuri, P. P.: Theory and Applications of Cellular Automata in Cryptography. IEEE Transactions on Computers. 43 (1994) 1346-1357 1199 [3] Das, A. K., Ganguly, A., Dasgupta, A., Bhawmik, S., Chaudhuri,P. P.: Efficient Characterisation of Cellular Automata. IEE Proc., Part E. 1 (1990) 81-87 1199, 1200 [4] Golomb, S.: Shift-Register Sequences (revised edition). Aegean press (1982) 1199, 1201
1206
Amparo F´ uster-Sabater and Dolores de la Gu´ıa-Mart´ınez
[5] Coppersmith, D., Krawczyk, H., Mansour, Y.:The Shrinking Generator. In: Advances in Cryptology–CRYPTO’93. Lecture Notes in Computer Science, Vol. 773. Springer Verlag, Berlin Heidelberg New York (1994) 22-39 1200, 1201 [6] Cattell K., Muzio, J.: Analysis of One-Dimensional Linear Hybrid Cellular Automata over GF(q). IEEE Transactions on Computers. 45 (1996) 782-792 1201 [7] Bardell, P. H.: Analysis of Cellular Automata Used as Pseudorandom Pattern Generators. Proceedings of the IEEE International Test Conference. Paper 34.1 (1990) 762-768 1203, 1204
Graphic Cryptography with Pseudorandom Bit Generators and Cellular Automata ´ Gonzalo Alvarez Mara˜ no´n1 , Luis Hern´ andez Encinas1 , Ascensi´on Hern´ andez Encinas2 , 3 ´ Angel Mart´ın del Rey , and Gerardo Rodr´ıguez S´anchez2 1
Instituto de F´ısica Aplicada, CSIC C/ Serrano 144, 28006-Madrid, Spain {gonzalo, luis}@iec.csic.es http://www.iec.csic.es/~{gonzalo, luis} 2 ETSII, Universidad de Salamanca Av. Fern´ andez Ballesteros 2, 37700-B´ejar, Salamanca, Spain {ascen, gerardo}@usal.es 3 EPS, Universidad de Salamanca ´ C/ Sto. Tom´ as s/n, 05003-Avila, Spain [email protected]
Abstract. In this paper we propose a new graphic symmetrical cryptosystem in order to encrypt a colored image defined by pixels and by any number of colors. This cryptosystem is based on a reversible bidimensional cellular automaton and uses a pseudorandom bit generator. As the key of the cryptosystem is the seed of the pseudorandom bit generator, the latter has to be cryptographically secure. Moreover, the recovered image from the ciphered image has not loss of resolution and the ratio between the ciphered image and the original one, i.e., the factor expansion of the cryptosystem, is 1.
1
Introduction
As it is known, the goal of cryptography is to assure the secrecy and confidentiality of communications between two or more users, who use an insecure channel ([9], [10]). Opposite, the goal of cryptanalysis is to break the security and privacy of these communications. Here, we are interested in a special class of messages: colored images. We propose to use cellular automata of dimension 2 and pseudorandom bit generators as a graphic symmetrical cryptosystem, that is, as a symmetrical cryptosystem to encrypt colored images. The proposed protocol begins with a message (the plaintext), uses an algorithm (based on a cellular automaton of dimension 2 and a pseudorandom bit generator), and ends with an encrypted message (the ciphertext). We wish that both messages, the plaintext and the ciphertext, belong to the set of colored images of the same size, i.e., the cipherimage will be defined by the same number of pixels than the plainimage. In V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1207–1214, 2003. c Springer-Verlag Berlin Heidelberg 2003
1208
´ Gonzalo Alvarez Mara˜ n´ on et al.
this way, several applications to watermarking, steganography and subliminal cryptography, can be derived. To date, there are several proposals for using images in cryptography. Some of them are based on dynamical systems ([6], [7]), but these proposals are difficult to implement in practice due to the difference between the chaotic arithmetic defined by the dynamical systems and the discrete arithmetic of the computers. Other proposals for encrypting images are based on the use of compression methods in an iterated way ([4], [5]). In these cases, the problem is that the recovered image has less definition than the original one. The foremost use of images in cryptography is done by means of the visual cryptography ([12], [15]). It is based on visual threshold schemes t of n, that is, the original image is divided in n shades. Each of them is photocopied in a transparency and then, the original image is recovered by superimposing any t transparencies, but no less. Moreover, no cryptographic protocol is used to recover it. Nevertheless, the recovered image also has less definition than the original one. There are some applications of the previous protocols. Visual authentication and identification methods are proposed in [11]; the identification of documents and photograph signatures are presented in [1] and [13]; and visual secret sharing schemes for developing a method for intellectual property protection of grey level images in [3]. The rest of this paper is organized as follows: Cellular automata are briefly recalled in §2, and the main properties of pseudorandom bit generators are presented in §3. In §4 the proposal of a new graphic cryptosystem is presented, and finally, the conclusions are summarized in §5.
2
Cellular Automata
A 2-dimensional finite cellular automaton, (CA for short), A= (L, S, V, f ) ([14]), is a 4-uplet, where L is the cellular space formed by a 2-dimensional array of size r × s of identical objects, called cells. Each cell is denoted by i, j, with 0 ≤ i ≤ r − 1, and 0 ≤ j ≤ s − 1. We denote by S the finite set of all possible values of the cells which is called the state set. As the state set is finite, we take |S| = k. Let V ⊂ ZZ 2 be a finite ordered subset, called the set of indices of L, then for every cell i, j ∈ L, its neighborhood, Vi,j is an ordered set of n cells defined as follows: Vi,j = {i + α, j + β : ∀ (α, β) ∈ V } ⊂ L.
(1)
Moreover, the local transition function f : S n → S is a function which determines the evolution of the CA throughout the time, i.e., the changes of the states of every cell taking the states of its neighbors into account. Finally, as the CA considered are finite, we take periodic boundary conditions of the form: (t)
(t)
aij = akl ⇔ i ≡ k
(mod r)
and j ≡ l
(mod s),
(2)
Graphic Cryptography with Pseudorandom Bit Generators
1209
(t)
where aij ∈ S stands for the state of the cell i, j at time t. Hence, the cellular space can be supposed as a 2-dimensional toroidal array. The matrix (t) C (t) = aij , 0 ≤ i ≤ r − 1, 0 ≤ j ≤ s − 1, (3) is called the configuration of A at time t. In particular, C (0) is called the initial configuration of the CA. Hence, the evolution of A is the sequence C (0) , C (1) , C (2) , . . . . A cellular automaton is reversible (RCA for short) if there exists another CA, called its inverse, that computes the inverse evolution of A.
3
Pseudorandom Bit Generators
As is well known, a pseudorandom bit generator (PRBG) is a deterministic algorithm which, given a truly random bit sequence of length l, outputs a binary sequence of length g l, which appears to be random. The input to the PRBG is called the seed, whereas the output is called a pseudorandom bit sequence. They are very used in cryptography, for example, to encrypt a plaintext by using a stream cipher, where the secret key is the seed. Hence, good random properties of the generator are convenient to prevent statistical attacks; but moreover, it is necessary that the generator must be cryptographically secure (CSPRBG). The security, in this sense, means that the probability that an algorithm can produce the next bit of a given sequence in a polynomial time, is negligible. In particular, the BBS generator (see [2]) is defined by iterating the function x2 (mod n) on the set of the quadratic residues of integers modulo n, where n is a Blum integer, i.e., n = p · q, where p and q are very large prime numbers, both congruent to 3 modulo 4. That is, starting from a seed x0 , and iterating xi+1 ≡ x2i (mod n), one obtains a binary sequence bi = parity(xi ). This PRBS is a CSPRBG under the assumption that the integer factorization is intractable. The length of the orbits of the BBS generator were characterized in [8], hence the BBS generator is a good option to be used in the graphic cryptosystem proposed below.
4
The Graphic Cryptosystem
In this section we propose a graphic symmetrical cryptosystem in order to encrypt a colored image. This cryptosystem is based on a reversible 2-dimensional CA and uses a cryptographically secure PRBG. Let I be a colored image defined by r × s pixels, Pij , where the pixels are n denoted by Pij = (p1ij , p2ij , . . . , p24 ij ), with pij ∈ ZZ 2 , 1 ≤ i ≤ r, 1 ≤ j ≤ s, 1 ≤ n ≤ 24, and c colors, with 2 ≤ c ≤ 224 . If Pij is in the m-th position from the left top corner of the image, then m = (i − 1)s + j, with 1 ≤ m ≤ r × s.
1210
4.1
´ Gonzalo Alvarez Mara˜ n´ on et al.
Key Generation
The key of the graphic cryptosystem is the key used for the CSPRBG to generate the pseudorandom bit sequence. Then, the security of the proposed cryptosystem is guaranteed by the security of the PRBG. This is the reason why a CSPRBG is used. Hence, to break this graphic cryptosystem it is necessary to determine the keys used by the pseudorandom bit generator. In the suggested generator (the BBS generator), this means to determine the prime numbers p and q, or the modulus n = p · q, and to factorize it. Moreover, the session key k0 is also necessary. But, to date this problem is infeasible computationally. In our practical implementation we use the BBS pseudorandom generator. Hence, the key is the couple (n, K), where the number n = p · q is the product of two large primes numbers, each of them congruent to 3 modulo 4; and K is the seed used to generate the bit sequence (the session key). The modulo n is a long-term key, whereas the seed, i.e., K is a short-term key, that is, it must be changed for each session. We denote by B = (B1 , B2 , . . . , Br×s ) the bit pseudorandom sequence of length r × s × 24 generated by the CSPRBG from the key K, where Bm = (bm1 , bm2 , . . . , bm24 ), and bmn ∈ ZZ 2 . 4.2
Encryption
To construct the encryption protocol, we consider the RCA A= (L, S, V, f ) defined by: (i) The cellular space, L, is a rectangular array of size r × s, i.e., L is the set of r × s pixels. (ii) The state set is defined by S = ZZ 2 × .(24 . . ×ZZ 2 , with |S| = 224 ; hence, each color of the image I can be identified with an element in S. We denote by xm = (um1 , . . . , um24 ) an element of S, where umn ∈ ZZ 2 , 1 ≤ n ≤ 24. (iii) The set of indices, V , is selected in a public way, such that |V | = 24. For example, if we consider the square of size 5 × 5 around the cell i, j, except the cell itself, the set of indices is V = {(−2, −2), . . . , (−2, 2), . . . , (0, −1), (0, 1), . . . , (2, −2), . . . , (2, 2)}, (4) and the neighborhood, Vi,j , can be viewed as follows: • • • • •
• • • • •
• • ◦ • •
• • • • •
• • • • •
This set can be represented by two variables h and w, Vi,j = {(i + h, j + w)}, where: h = (n − 1)/5 − 2, h = n/5 − 2,
w = (n − 1) (mod 5) − 2, w=n
(mod 5) − 2,
1 ≤ n ≤ 12,
13 ≤ n ≤ 24.
(5) (6)
Graphic Cryptography with Pseudorandom Bit Generators
1211
(iv) To determine the encrypted pixel of Pij , Qij , 1 ≤ i ≤ r, 1 ≤ j ≤ s, the 24 pixels of the neighbourhood of Pij , Vi,j : Pi−2,j−2 , . . . , Pi−2,j+2 , . . . , Pi,j−1 , Pi,j+1 , . . . , Pi+2,j+2 , are taken, and the transition function f : S 24 → S is applied to them as follows: Pij → Qij : f (Pi−2,j−2 , Pi−2,j−1 , . . . , Pi+2,j+2 ) = Bm ⊕ (π1 (Pi−2,j−2 ) , π2 (Pi−2,j−1 ) , . . . , π24 (Pi+2,j+2 )) = (bm1 , bm2 , . . . , bm24 ) ⊕ p1i−2,j−2 , p2i−2,j−1 , . . . , p24 i+2,j+2 = bm1 ⊕ p1i−2,j−2 , bm2 ⊕ p2i−2,j−1 , . . . , bm24 ⊕ p24 i+2,j+2 1 2 24 = Qij , , qij , . . . , qij = qij n = bmn ⊕ pni+h,j+w ; m denotes the position of the pixel Pij , i.e., that is, qij m = (i − 1)s + j, Bm is the m-th component of the bit sequence B, the (n) operation ⊕ is the XOR-operation, πn : S → ZZ 2 is the projection onto the n-th component, and the boundary conditions are periodic conditions.
The cipherimage, C, is obtained by applying only once the above transition function to each pixel of the plainimage I. In this way, the cipherimage is defined by r × s pixels and d colors, with 2 ≤ d ≤ 224 . Moreover the expansion factor for this cryptosystem is 1, i.e., the ratio between the cipherimage and the plainimage is 1. This cryptosystem considers the original image as the initial configuration of the CA, and the ciphered image as the configuration of the CA at time t = 1. It is not necessary to iterate the CA more than once since the security of the cryptosystem is not increased in other case. 4.3
Decryption
For the decryption protocol, the receiver considers the inverse CA of A, that is, A−1 = (L, S, W, g). The cellular space L and the state set S of A−1 are the same that for A. The set of indices of A−1 is W = −V , i.e., W is the same set of indices of A, but they are taken in their inverse order. Then, the neighbourhood of the pixel i, j is: W = {(2, 2), . . . , (2, −2), . . . , (0, 1), (0, −1), . . . , (−2, 2), . . . , (−2, −2)}.
(7)
Moreover, as in the previous case, the neighbourhood of the cell k, l, Wk,l , can be codified by the same two variables h and w: Wk,l = {(k − h, l − w)}. The transition function g : S 24 → S to decrypt a pixel is defined in the following way: Qkl → Rkl : g (Qk+2,l+2 , Qk+2,l+1 , . . . , Qk−2,l−2 ) = Bt ⊕ (π1 (Qk+2,l+2 ) , π2 (Qk+2,l+1 ) , . . . , π24 (Qk−2,l−2 )) 1 2 24 , qk+2,l+1 , . . . , qk−2,l−2 = (bt1 , bt2 , . . . , bt24 ) ⊕ qk+2,l+2 1 2 24 , bt2 ⊕ qk+2,l+1 , . . . , bt24 ⊕ qk−2,l−2 = bt1 ⊕ qk+2,l+2 = bt1 ⊕ bm1 ⊕ p1k,l , bt2 ⊕ bm2 ⊕ p2k,l , . . . , bt24 ⊕ bm24 ⊕ p24 k,l = p1kl , p2kl , . . . , p24 kl = Pkl ,
1212
´ Gonzalo Alvarez Mara˜ n´ on et al.
n n where rkl = btn ⊕ qk−h,l−w = pnkl ; t denotes the position of Qk−h,l−w , i.e., t = (k − 1 − h)s + l − w, and hence bt,n = b(k−1−h)s+l−w,n , πn : S → ZZ 2(n) is the projection onto the n-th component, the symbol ⊕ stands for the XORoperation, and the contour conditions are periodic conditions. Hence, to recover the t-th decrypted pixel, Pkl , one has to apply the transition function g to its corresponding encrypted pixel, Qkl . The plainimage, I, is recovered by applying once the transition function to each pixel of the cipherimage C. The plainimage I is identical to the original one (pixel by pixel), that is, the recovered image has no loss of resolution.
4.4
Example
In this subsection we present an example of a colored plainimage (“Self-Portrait” by Tamara de Lempicka) and its corresponding cipherimage. Both images are defined by 602 × 800 pixels, and they are shown reduced in Fig. 1. The number of colors of the first image is 85803, whereas the second one has 474750 colors. For this example we have used a key with artificially small parameters. The bitlength of the prime numbers are 64, hence the bitlength of the modulus n is 128. The values of the modulus and the key are, respectively: n = p · q = 609490657811550215868356152313472597421,
(8)
K = 430146343670092314107950454676296640957.
(9)
Fig. 1. Example of a plainimage and its cipherimage
Graphic Cryptography with Pseudorandom Bit Generators
5
1213
Conclusions
A new graphic symmetrical cryptosystem is presented in order to encrypt a colored image defined by pixels and by any number of colors. This cryptosystem is based on a reversible bidimensional cellular automaton and uses a cryptographically secure pseudorandom bit generator. The key of the cryptosystem is the same that for the CSPRBG and the session key is the seed used to generate the pseudorandom bit sequence. Moreover, the decrypted image is identical to the original one, i.e., no loss of resolution takes place.
Acknowledgements This work was partially supported by Ministerio de Ciencia y Tecnolog´ıa (Spain) under grant TIC2001–0586, and by Consejer´ıa de Educaci´ on y Cultura del Gobierno de Castilla y Le´ on (Spain) under grant SA052/03.
References [1] Bellamy, B., Mason, J. S., Ellis, M.: Photograph signatures for the protection of identification documents. Proc. of Crypto & Coding’99, LNCS 1746 (1999) 119– 128. 1208 [2] Blum, L., Blum, M., Shub, M.: A simple unpredictable pseudo-random number generator. SIAM J. Comput. 15 (1986) 364–383. 1209 [3] Chang, C. C., Chuang, J. C.: An image intellectual property protection scheme for gray-level images using visual secret sharing strategy. Pattern Recogn. Lett. 23 (2002) 931–941. 1208 [4] Chang, C., Hwang, M., Chen, T.: A new encryption algorithm for images cryptosystems. J. Syst. Software 58 (2001) 83–91. 1208 [5] Chang, C., Liu, J. L.: A linear quadtree compression scheme for image encryption. Signal Process. Image 10 (1997) 279–290. 1208 [6] Fridrich, J.: Image encryption based on chaotic maps. Proc. IEEE Int. Conf. Systems, Man Cybern. Comput. Cybern. Simul. (1997) 1105–1110. 1208 [7] Fridrich, J.: Symmetric ciphers based on two-dimensional chaotic maps. Internat. J. Bifur. Chaos 8, 6 (1998) 1259–1284. 1208 [8] Hern´ andez Encinas, L., Montoya Vitini, F., Mu˜ noz Masqu´e, J., Peinado Dom´ınguez, A.: Maximal periods of orbits of the BBS generator. Proc. 1998 Int. Conf. on Inform. Secur. & Cryptol. (1998) 71–80. 1209 [9] Menezes, A., van Oorschot, P., Vanstone, S.: Handbook of applied cryptography. CRC Press, Boca Raton, FL, 1997. 1207 [10] Mollin, R. A.: An introduction to cryptography. Chapman & Hall/CRC, Boca Raton, FL, 2001. 1207 [11] Naor. M., Pinkas, B.: Visual authentication and identification. Proc. of Crypto’97, LNCS 1294 (1997) 322–336. 1208 [12] Naor, M., Shamir, A.: Visual cryptography. Proc. of Eurocrypt’94, LNCS 950 (1995) 1–12. 1208 [13] O’Gorman. L., Rabinovich, I.: Secure identification documents via pattern recognition and public-key cryptography. IEEE Trans. Pattern Anal. Mach. Intell. 20, 10 (1998) 1097–1102. 1208
1214
´ Gonzalo Alvarez Mara˜ n´ on et al.
[14] Packard, N. H., Wolfram, S.: Two-dimensional cellular automata. J. Statist. Phys. 38 (1985) 901–946. 1208 [15] Stinson, D.: Cryptography. Theory and Practice. 2nd ed. CRC Press, Boca Raton, FL, 2001. 1208
Agent-Oriented Public Key Infrastructure for Multi-agent E-service Yuh-Jong Hu and Chao-Wei Tang Emerging Network Technology (ENT) Lab. Dept. of Computer Science National Chengchi University, Taipei, Taiwan 116 {jong, g9017}@cs.nccu.edu.tw
Abstract. Agent is autonomous software that mediates e-service for human on the Internet. The acceptance of agent-mediated e-service (AMES) is very slow for the lacking of security management infrastructure for multi-agent system. Therefore we proposed an agent-oriented public key infrastructure (APKI) for multi-agent e-service. In this APKI, a taxonomy of digital certificates are generated, stored, verified, and revoked to satisfy different access and delegation control purposes. Agent identity certificate was designed for agent’s authentication whereas attributed and agent authorization certificates were proposed for agent’s authorization and delegation. Using these digital certificates, we establish agent trust relationships on the cyberspace. A trusted agent-mediated eservice scenario will be shown to demonstrate the feasibility of our APKI.
1
Introduction
Web technologies have been developed very fast for the past few years. However human still has to spend a lot of time on searching information that he needs. Therefore agent technology provides the capacity to mitigate this problem. If autonomous agent as human’s software delegate that can do all kinds of e-services then it really saves us tremendous amount of time on mundane jobs. We give an execution order and delegate our authority to agent, then agent finish the services for us autonomously. It is still uncertain whether we can deploy agent technology in large scale on the Web. One of the reasons is that it always takes a considerable of risk for human using agent services on the Web. People are still hesitating at using agents as their legal delegates on the WWW [7]. We are going to address several unsolved issues in this paper before people can really apply agent technology on the Web. First, we must guarantee that each agent was tightly bound with its owner for legal responsibility. In case agent has done something wrong on the cyberspace, we can trace its real owner for liability. Second, even all of transaction service agents were bound to their owners, we still need to establish a trust relationship between these agents so that agent can verify each other’s identity and its owner’s trustworthiness.
This research was supported by Taiwan National Science Council (NSC) under grants No. NSC 90-2213-E-004-008.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1215–1221, 2003. c Springer-Verlag Berlin Heidelberg 2003
1216
Yuh-Jong Hu and Chao-Wei Tang
We proposed an Agent-oriented Public Key Infrastructure (APKI) to resolve above two issues. Similar to Human PKI (HPKI), this APKI also has Agent Certification Authorities (ACAs) to issue, store, and even revoke agent’s certificates [10]. Of course, agent can verify each other’s identity certificate under this APKI. In addition, each ACA in APKI is in charge of legal binding service between agent and its owner so that agent’s owner is responsible for his agent in the real world. If possible, we even allow human delegate the trust validation authority to ACA so that the APKI might subsume the HPKI functions. We also introduce human attribute certificate and agent authorization certificate for authority initialization and delegation [8]. Finally we briefly demonstrate how to build an APKI on the multi-agent system using FIPA-OS [3]. Our APKI complements existing FIPA standards and frameworks that do not have security mechanism for agents to authenticate and authorize with each other [4].
2
Related Work
There were several PKI architectures proposed and implemented recently, such as X.509 PKI, PGP, SPKI/SDSI [1, 10]. Each PKI infrastructure has its own format of certificate, name space, and topology to satisfy the requirements of security criteria. Unfortunately, these PKI infrastructures are only designed for human but not for agent. When the software paradigm is shifting from the monolithic and passive system to the newly cooperative and autonomous agentbased system. We need an APKI that is specifically designed for multi-agent system. There has been some studies on agent security and APKI but they did not provide a binding mechanism between human and agent so that we can find the trust path between two e-service agents for the trustworthiness of an agent [2, 5, 6, 11]. We have a binding mechanism between agent and its owner to guarantee the legal responsibility. We also apply agent core technologies on certificate management operations, such as certificate applying, issuing, storing, and revocation so that we can achieve trusted agent-mediated e-services via digital certificates management under our APKI [9].
3
APKI Architecture
Why we separate APKI and HPKI into two frameworks? Because human lives in the physical world that provides the essential trust foundation on any e-service transaction but agent lives on the cyberspace that relies on human’s trust linkage. Certainly we can not enforce any existing Human CA (HCA) to provide agent certificate management services.
Agent-Oriented Public Key Infrastructure for Multi-agent E-service
3.1
1217
APKI Deign Issues
There are several issues on designing of APKI shown as the followings: 1. The APKI has to provide full capacity to manage all of agent identity certificates on the WWW. 2. The human liability for its agent is based on the certificate binding mechanism between human and agent so that nonrepudiation of agent’s owner is ensured. 3. The feasible APKI topology needs to meet the efficient and robust of agent certificate management for tremendous amount of agents. 4. The essential trust path between service provider agent and service request agent needs to be set up to ensure the trustworthiness of the service. 5. The APKI might subsume the HPKI if human grants the necessary trust validation authority to all ACAs on the essential trust path. 3.2
Agent Certification Authority
Agent Certification Authority (ACA) is defined as two agents: Registration Agent (RA) and Management Agent (MA). – RA: RA registers to adjacent upper layer’s MA to get its identity certificates except RA on the top of APKI who registers to its MA. – MA: MA provides certificate management operations for adjacent lower layer’s agent certificates, including certificate issuing, storing, verification, and revocation. What is the possible APKI topology to serve all of agents on the Web? To resolve above APKI design issues, we propose a strict hierarchy tree APKI topology with three layers of ACA and one layer of application agent (see Fig. 1). – Root ACA (RACA): RACA is on the top layer of APKI so every lower ACA must trust it. The RA of RACA is the root of trust so it initially registers to its peer MA to get agent identity certificate. – Global ACA (GACA): the RA of GACA registers to the MA of the RACA to get its agent identity certificate after the RACA was initialized. – Local ACA (LACA): the RA of LACA registers to the MA of the GACA to get its agent identity certificate after LACA was initialized. – Application Agents (AAs): AA is the real service agent for human in our APKI. Before requesting services, AA registers to one of local ACAs it trusts for applying an identity certificate so that other service provider agents could validate this agent’s identity and trustworthiness through the trust path.
4 4.1
Agent and Human Certificate Agent Identity Certificate
The format of the agent a’s identity certificate is shown as the followings: IDACA →a − Cert = (Ida , P ua , Sig(P ua )h , V, Option, SigACA )
1218
Yuh-Jong Hu and Chao-Wei Tang
where Ida : a’s distinguished identity; P ua : a’s public key; Sig(P ua)h : agent a’s public key signed by human h’s private key, it is human and agent’s binding information; V : validation period; Option: optional information; SigACA : certificate signature signed by ACA’s private key. – Agent-certificate-applying: the information required for applying agent certificate are agent’s public key, its owner’s identity certificate, and agent public key as message digest signed by human’s private key to endorse this particular agent (see Fig. 1). We guarantee that human is fully responsible for his own agent so that non-repudiation criteria are satisfied. We did not sign agent’s process code as message digest because the process code is mutable. – Agent-certificate-issuing: the MA of ACA issues an agent identity certificate after verifying necessary information, such as owner’s authenticity. – Agent-certificate-storing: the MA of ACA stores a copy of agent identity certificate to its repository for other agent’s query or validation. – Agent-certificate-revocation: the MA of ACA revokes the agent identity certificate from the repository when this certificate’s validation period expires. – Agent-certificate-verification: the establishment of trust path is to ensure the trust relationships of all ACAs in the trust path. Thus ACA is not only endorse the binding relationship between agent and its owner but also propagate the trust beliefs within the path. We could establish the trust path between any two AAs and their associated owners if human delegate the (human) trust verification and propagation authority to the corresponding ACAs.
Fig. 1. Agent-Oriented Public Key Infrastructure with three agent digital certificate management layers and one application layer
Agent-Oriented Public Key Infrastructure for Multi-agent E-service
1219
Human attribute certificate and agent authorization certificate are designed for agent authority initialization and delegation shown as the followings: 4.2
Human Attribute Certificate
The attribute certificate is only proposed for human. We did not propose agent’s attribute certificate because agents do not have initiative authority unless human or institution delegates. The format of attribute certificate is shown as the followings: ATT A →h − Cert = (Idh , Arh , V, Option, SigHT A ) where Idh : principal h’s distinguished identity; Arh : principal h’s attribute information; V : validation period; Option: optional information; SigHT A : signature signed by HTA’s (Human aTtribute Authority) private key. 4.3
Human and Agent Authorization Certificate
The SPKI/SDSI-based authorization certificate was initially created for human to request service from service application agent [1]. This authorization certificate will be re-generated and delegate to agent(s) for further services via different delegation mechanisms, such as chain-ruled, threshold, etc [9]. Even the issuer and subject in the authorization certificate are indicated as anonymous public keys, we still have the capacity to trace who is the original authority delegate and who goes wrong on the delegation process. More specifically we could examine authorization certificates on the chain-ruled and find out the owner responsible for the liability of the agents. The format of human or agent authorization certificate is shown as the followings: AUp →q − Cert = (P up , P uq , A, D, V, Sigp ) where P up : a public key for the issuer of principal p to grant authorization; P uq : a public key for the subject of principal q to receive authorization; A: expression for authorization; D: delegation bit with value 0 or 1; V : validation period; Sigp : certificate signature signed by p’s private key.
5
Trusted Agent-Mediated E-services
A simple agent-mediated e-service (AMES) scenario will be proposed to demonstrate the feasibility of our APKI framework with its associated digital certificates management. In this trusted AMES scenario, we will see that all of the digital certificates we proposed are required for the trust verification. There exist an e-service portal on the WWW with multi-agent system built on top of it. Agent b on this portal is customized by user (or institution) u2 as a professional service agent with its specialty, such as ticket reservations. User u1 might launch his personal agent a to request a service s from agent b. Initially, assuming that users u1 , u2 and agents a, b digital identity certificates were issued and stored in the associated HCAs, ACAs in the respective HPKI and the APKI.
1220
Yuh-Jong Hu and Chao-Wei Tang
1. User u1 delegates agent a an authorization certificate AUu1 →a − Cert for requesting service s from agent b. Of course, u1 original authority is based on u1 attribute certificate ATHT A →u1 − Cert. 2. Agent a uses Web Portal’s yellow page to locate agent b using b’s agent identity certificate. Before an authority delegation was initiated, a trust association will be established between agent b and agent a (shown as step 3). Then a new authorization certificate AUa →b − Cert will be generated to ask service. 3. A trust association between agent b and agent a is required to ensure the trust criteria satisfaction, including: – Agent a and agent b are legally bound with their respective owners. This is an embedded function of LACA. – There should exist a trust path between agent b and agent a in the APKI. A trust verification probing message will be sent from the agent b’s (or a’s) LACA to the connected GACA (or the RACA) to ascertain this condition. – Users u1 and u2 are mutually trust each other. This criteria are satisfied if all of ACAs in the previous trust path establishment were granted authority from their respective owners. 4. Agent b might serve the agent a’s service request directly or it might request service from another agent c. In that case, go to step 2, and a new authorization certificate AUb →c − Cert will be generated. 5. The complete trusted AMES scenario is achievable if a trust path can be established and all of digital certificates are validated and satisfied with access control rules.
6
APKI Implementation
There are two service management modules in the existing FIPA multi-agent system framework, i.e., Agent Management System (AMS) and Description Facilitator (DF). The objective of AMS is to accept agent’s name registration and manage agent’s identity whereas the objective of DF is to accept agent’s service registration. Agent discovers a specific service from DF with the function of SearchDF. We built our APKI on the FIPA-OS to evaluate the feasibility of our framework (see Fig. 2). Because there lacks digital certificate management feature in the existing FIPA standards and associated frameworks so our study is very important and significant for the entire agent research community. Adding our APKI module in the FIPA management system certainly enhances agent authentication, authorization, and delegation trust services on the Web.
7
Conclusion and Further Studies
As agent technology is getting popular, we envision the importance of trusted agent e-service using digital certificates. We proposed a tree hierarchy APKI
Agent-Oriented Public Key Infrastructure for Multi-agent E-service
1221
Fig. 2. APKI system implementation framework topology to manage agent identity certificates. This topology is simple and extensible to serve tremendous amount of agents on the WWW, such as AgentCities. The agent’s authentication, authorization, and even delegation are also possible using our taxonomy of digital certificates. We need to establish the trust path between peer to peer agent to secure the e-service trust relation. A trusted agent-mediated e-service scenario was shown and implemented using this APKI to demonstrate the feasibility of our proposed framework. In our further studies, we are using this certificate-based trust establishment theory to build trust ontology and rules on the semantic web.
References [1] Ellison, C. M., SPKI/SDSI Certificates, http://world.std.com/ 1216, 1219 [2] Finin, T. W., Mayfield, J., Thirunavukkarasu, C., Secret Agents - A security Architecture for the KQML Agent Communication Language, CIKM’95 Intelligent Information Agents Workshop, Baltimore (1995). 1216 [3] FIPA-OS, http://fipa-os.sourceforge.net. 1216 [4] FIPA Standards, http://www.fipa.org. 1216 [5] Foner, N. L., A Security Architecture for Multi-Agent Matchmaking, Proceeding of the Second International Conference on Multi-Agent System (1996). 1216 [6] He, Q., Sycara, P. K., and Finin, T. W., Personal Security Agent: KQML-Based PKI, Proceedings of the Second International Conference on Autonomous Agents (1998). 1216 [7] Heckman, C. and Wobbrock, O. J., Liability for Autonomous Agent Design, Autonomous Agents 98, Minneapolis, MN, USA (1998) 392-399. 1215 [8] Hu, Y.-J., Some Thoughts on Agent Trust and Delegation, The Fifth International Conference on Autonomous Agents, Montreal, Canada (2001). 1216 [9] Hu, Y.-J., Trusted Agent-Mediated E-Commerce Transaction Services via Digital Certificate Management, Electronic Commerce Research, Vol. 3, Issues 3-4, (2003). 1216, 1219 [10] Gerck, E., Overview of Certification Systems: X.509, CA, PGP and SKIP, MCGMeta-Certificate Group, http://www.mcg.org.br. 1216 [11] Wong, H. C. and Sycara, K, Adding Security and Trust to Multi-Agent Systems, Proceedings of Autonomous Agents ’99 Workshop on Deception, Fraud and Trust in Agent Societies, Seattle, Washington (1999). 1216
SOID: An Ontology for Agent-Aided Intrusion Detection Francisco J. Martin1 and Enric Plaza2 1
School of Electrical Engineering and Computer Science, Oregon State University Corvallis, 97331 OR, USA [email protected] 2 IIIA - Artificial Intelligence Research Institute CSIC - Spanish Council for Scientific Research Campus UAB, 08193 Bellaterra, Catalonia, Spain [email protected]
Abstract. We introduce SOID, a Simple Ontology for Intrusion Detection, that allows an agent-aided intrusion detection tool called Alba (ALert BArrage) to reason about computer security incidents at a higher level of abstraction than most current intrusion detection systems do.
1
Introduction
Site Security Officers (SSOs) responsibilities include, among other perimeter defense tasks, prevention, detection and response to computer security incidents. Nowadays, Intrusion Detection Systems (IDSes) have become common tools (eTrust, eSafe, IntruShield, RealSecure, etc) deployed by SSOs to combat unauthorized use of computer systems. First research on computer-aided intrusion detection goes back to the 80’s [1]. However, using current generation of IDSes SSOs are continuously overwhelmed with a vast amount of log information and bombarded with countless alerts. The capacity to tolerate false positives of a human SSO and correctly respond to the output of current IDSes is debatable [2]. There are those who even postulate that traditional IDS not only have failed to provide an additional layer of security but have also added complexity to the security management task. Therefore, there is a compelling need for developing a new generation of tools that help to automate security management tasks such as the interpretation and correct diagnosis of IDSes output. The fact of the matter is that as long as the number of networked organizations proliferates and the number of computer security threats increases this need accentuates. To make the aforementioned tasks more manageable we envisage a new generation of intrusion detection tools under the heading of agent-aided intrusion detection. Some recent works can be seen as the prelude of this tendency [3, 4, 5, 6]. Agent-aided systems also known as agent-based systems or multi-agent systems constitute an active area of research within the artificial intelligence community. We are developing a first prototype of an agent-aided intrusion detection tool
On sabbatical leave from iSOCO-Intelligent Software Components S.A.
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1222–1229, 2003. c Springer-Verlag Berlin Heidelberg 2003
SOID: An Ontology for Agent-Aided Intrusion Detection
1223
Fig. 1. Alba Overview called Alba (ALert BArrage) that mediates between the alarm element —such as is considered in a generic architectural model of an IDS— and the corresponding SSO (see Fig. 1). Alba aims to perform the alert management task on behalf of its SSO and to reduce the number of false positives due to innocuous attacks and to increase the predictive power for harmful multi-stage attacks. However, a useful level of description is required to automate the alert management task. The objective of the present work is to provide a mechanism that allows a sequence of alerts (alert stream) provided by a conventional IDS not only to be readable but also understandable by a software agent. Ontologies provide a way of capturing a shared understanding of terms that can be used by humans and programs to aid in information exchange. Unfortunately, currently there is not a common ontology that allows computer security incidents to be conceptualized in a standardized way. Neither does there exists a widely accepted common language for describing computer attacks. Consequently, the first and foremost step in the development of Alba has been to construct an ontology that provides a comprehensive understanding of security incidents concepts. We have called that ontology SOID that stands for Simple Ontology for Intrusion Detection. SOID establishes well-defined semantics that allow Alba to process information consistently and reason about an IDS alert stream at a higher level of abstraction, facilitating in this way the automation of the alert management task. The rest of this article is organized as follows. Section 2 succinctly explains the methodology that we have followed to build SOID. Section 3 summarizes some approaches addressing the number of alerts that a human SSO has to handle. Finally, Sec. 4 presents some concluding remarks.
2
SOID
We understand ontology as a formal specification of a vocabulary of concepts and the relationships among these concepts that provides a machine readable set of definitions that in turn create a machine understandable taxonomy of classes and subclasses and relationships between them. Several methodologies for building ontologies have been proposed. We have followed the V-model methodology inspired by the software engineering V-process model [7].
1224
Francisco J. Martin and Enric Plaza
2.1
Purpose and Conceptualization
SOID aims at providing a domain-specific representation language for alert management in intrusion detection. SOID conceptualization allows Alba to diagnose the sequence of alerts (alert stream) provided by a conventional IDS. That is, Alba identifies whether a given alert really corresponds to intrusive behavior or not, reasoning in terms of the concepts provided by SOID. At a quick glance, in order to automatize the alert management task we have identified four key sources of knowledge to be conceptualized and built a separate ontology for each of them using the knowledge representation language Noos[8]. Finally, we have merged these partial ontologies in a more global ontology that we have called SOID. Networks. A network is the computer system to be protected. We have defined a set of concepts and relationships to model a network based on the Network Entity Relationship Database (NERD) proposed in [9]. Properly modelling the network allows the importance of each alert to be correctly assessed. For instance, determining whether a given alert corresponds to an innocuous attack or not. Network models based on SOID can easily be coded into Noos and automatically updated translating the reports provided by network scanners such as Nessus, Satan or OVAL. Incidents. An incident is a unauthorized use or abuse of the protected system. We have followed CLCSI [10] that defines an incident taxonomy based on three key concepts: events, attacks and incidents. An event is an action directed at a target which is intended to result in a change of state of the target. An attack is defined as a sequence of actions directed at a target taken by an attacker making use of some tool exploiting a computer or network vulnerability. Finally, an incident is defined as a set of attacks carried out by one or more attackers with one or more goals. Vulnerabilities. A vulnerability is a flaw in a target that could allow an unauthorized result. Knowing the vulnerabilities in our network is the main source of knowledge to automatically decide if a given alert corresponds to an innocuous attack or not. We have incorporated common vulnerabilities and exposures (CVE) dictionary provided by the MITRE corporation into our ontology. Alerts. An IDS aims to discover intrusion attempts and whenever intrusive behavior is detected the IDS notifies the proper authority by means of alerts. Alerts take the form of emails, database or log entries, etc and their format (plain text, xml, IDMEF, etc) and content vary according to the particular IDS. In SOID we have conceptualized alerts according to the Snort ruleset. Snort is a network IDS where alerts are triggered by a collection of rules. Each Snort rule is composed of a Snort identification number (SID), a message that is included
SOID: An Ontology for Agent-Aided Intrusion Detection
1225
Fig. 2. Sorts for Snort Alerts
in the alert when the rule is triggered, an attack signature, and references to sources of information about the attack. Snort alerts are classified in twenty three classes. Figure 2 shows the SOID sorts for representing them. Each alert is provided with an identifier, time and date, sensor identifier, triggered signature, IP and TCP headers and payload. In Fig. 3 an alert corresponding to an attempt of propagation of the CodeRed worm is shown. 2.2
Integration
Our ontology is coded using Noos, an object-centered knowledge representation language useful for developing knowledge systems that integrate problem solving and learning[8]. Noos has also been upgraded with agent-programming constructs. The three basic concepts that underpin the Noos knowledge representation language are: sorts, feature terms, and subsumption. A sort is defined as a symbol that denotes a set of the individuals of a domain. Sorts form a collection of partially ordered symbols. Noos is formalized using feature terms. Feature terms are a generalization of first order terms and lambda terms. Feature terms constitute the Noos basic data structure and can be seen as extendable records organized in a subsumption hierarchy [8]. Feature terms are represented graphically by means of labeled directed graphs. In Noos subsumption is defined as
1226
Francisco J. Martin and Enric Plaza
Fig. 3. Alert for CodeRed an informational ordering among feature terms. A feature term Ψ is subsumed by another feature term Ψ when all information provided by Ψ is also provided by Ψ . Subsumption is crucial in our approach since it is at the core of the algorithms that Alba uses to compare sequences of alerts. For instance, Fig. 4 shows an alert-tree to represent a multi-stage attack as a sequence of Snort alert classes (see Fig. 2) and a risk factor that indicates whether the attack is innocuous or not. Alert-trees can be provided by an expert or learnt using frequent episode discovery algorithms[11]. Alba continuously searches subsequences of alerts into the alert stream that are subsumed by any alert trees. Prior to comparison and subsumption alerts are provided in a more abstract representation using the sorts shown in Fig. 2.
3
Related Work
The correct interpretation of an IDS alert stream is an active area of research in the intrusion detection community [9, 12, 13]. As our approach M2D2 [12] reuses models proposed by others and integrates multiple interesting concepts into a unified framework. The most significative difference between both approaches is that M2D2 uses the B formal method to model the different sources of information for the alert management task whereas we are using a description logic like language like [9]. Crosbie and Spafford were the first to propose
SOID: An Ontology for Agent-Aided Intrusion Detection
1227
Fig. 4. Innocuous Alert Tree
autonomous agents in the context of intrusion detection. Their initial proposal evolved to become AAFID [4]. Other works such as Cooperating Security Managers have proposed a multi-agent system to handle intrusions instead of only detecting them [3]. However, in these works agents lack reasoning capabilities and are used for mere monitoring. More sophisticated agents with richer functionality were introduced by [6]. It has been argued in [5] that a collection of heterogenous software agents can reduce risks during the window of vulnerability introduced between when an intrusion is detected and the security manager can take an active role in the defense of the computer system. Some works have also proposed to deal with intrusion detection at higher level of abstraction. The benefit of dealing with intrusions at higher level of abstraction is twofold: it allows irrelevant details to be removed and the differences between heterogenous system to be hidden [14]. Different taxonomies of computer security incidents have been proposed [15]. An ontology centered on computer attacks was introduced in [6]. That ontology provides a hierarchy of notions specifying a set of harmful actions in different levels of granularity —from high level intentions to low level actions. An ontology based on natural language processing notions is proposed in [16] as the theoretical foundation to organize and unify the terminology and nomenclature of information security (in general). A target-centric ontology for intrusion detection based on the features of the target (system component, means of attack, consequences of attack and location of attack) has been proposed in [15]. IDMEF tries to establish a common data model for intrusions mainly based on XML. Thus, has a limited capability to describe the relations among objects and requires each IDS to interpret the data model programmatically [15]. In [15] RDFS was proposed as an alternative to IDMEF and in [17] an ontology for computer attacks was provided on top of DAML+OIL (now renamed OWL). The representation we use in SOID is equivalent to OWL Lite.
4
Conclusions
One of the main obstacles for the rapid development of higher-levels intrusion detection applications stems from the absence of a common ontology that not only impedes to deal with computer security incidents at a higher level of abstraction but also the collaboration among different intrusion detection systems. [13] signaled three key challenges for the alert management task: the absence of widely available domain expertise, the time-consuming and expensive effort due to the
1228
Francisco J. Martin and Enric Plaza
large number of alerts, and the heterogeneity of the information produced by different information security devices. SOID ontology copes with such challenges providing an homogeneous representation of expertise and facilitating the automation by part of Alba that uses case-based plan recognition, sequence analysis, and frequent episode discovery algorithms to perform the alert management task. Our first experiments show promising results. In a networked organization with more than 200 computers and in only 45 days 3 Snort sensors were able to detect 18662 attacks coming from 1367 different IPs. At the end of the first month Alba reduced in a 98% the number of alerts received by our SSO.
Acknowledgments Part of this work has been performed in the context of the MCYT-FEDER project SAMAP (TIC2002-04146-C05-01) and the SWWS project funded by the EC under contract number IST-2001-37134.
References [1] Anderson, J. P.: Computer security threat monitoring and surveillance. Technical report, James P. Anderson Co, Fort Whanington, PA, USA (1980) 1222 [2] Axelsson, S.: The base-rate fallacy and the difficulty of intrusion detection. ACM Transactions on Information and System Security 3 (2000) 186–205 1222 [3] White, M. G. B., Fisch, E. A., Pooch, U. W.: Cooperating security managers: A peer-based intrusion detection system. IEEE Network 10 (1996) 20–23 1222, 1227 [4] Spafford, E. H., Zamboni, D.: Intrusion detection using autonomous agents. Computer Networks 34 (2000) 547–570 1222, 1227 [5] Carver, C. A., Hill, J. M., Surdu, J. R., Pooch, U. W.: A methodology for using intelligent agents to provide automated intrusion response. In: Proc. of the IEEE Workshop on Information Assurance and Security, West Point, NY. (2000) 110– 116 1222, 1227 [6] Gorodetski, V. I., Popyack, L. J., Kotenko, I. V., Skormin, V. A.: Ontology-based multi-agent model of information security system. In: 7th RSFDGrC. Number 1711 in Lecture Notes in Artificial Intelligence. Springer (1999) 528–532 1222, 1227 [7] Stevens, R., Bechhofer, S., Goble, C.: Ontology-based Knowledge Representation for Bioinformatics. Briefings in Bioinformatics 1 (2000) 1223 [8] Arcos, J., Plaza, E.: Inference and reflection in the object-centered representation language Noos. Journal of Future Generation Computer Systems 12 (1996) 173– 188 1224, 1225 [9] Goldman, R. P., Heimerdinger, W., Harp, S. A., Geib, C. W., Thomas, V., Carter, R. L.: Information modeling for intrusion report aggregation. In: DICEX. (2001) 1224, 1226 [10] Howard, J. D., Longstaff, T. A.: A common language for computer security incidents. Technical Report SAND98-8667, Sandia National Laboratories (1998) 1224 [11] Martin, F. J., Plaza, E.: Case-based sequence analysis for intrusion detection. In: Submitted. (2003) 1226
SOID: An Ontology for Agent-Aided Intrusion Detection
1229
[12] Morin, B., M´e, L., Debar, H., Ducass´e, M.: M2d2: A formal data model for ids alert correlation. In: Proc. of the RAID 2002. Number 2516 in Lecture Notes in Computer Science. Springer (2002) 115–137 1226 [13] Porras, P. A., Fong, M. W., Valdes, A.: A mission-impact-based approach to INFOSEC alarm correlation. In: Proc. of the RAID 2002. Number 2516 in Lecture Notes in Computer Science. Springer (2002) 95–114 1226, 1227 [14] Ning, P., Jajodia, S., Wang, X. S.: Abstraction-based intrusion detection in distributed environments. ACM Transactions on Information and System Security 4 (2001) 407–452 1227 [15] Undercoffer, J., Pinkston, J.: Modeling computer attacks: A target-centric ontology for intrusion detection. In: CADIP Research Symposium. (2002) 1227 [16] Raskin, V., Hempelmann, C. F., Triezenberg, K. E., Nirenburg, S.: Ontology in information security: a useful theoretical foundation and methodological tool. In: Proc Workshop on New Security Paradigms, ACM Press (2001) 53–59 1227 [17] Noel, S.: Development of a cyber-defense ontology. Center for Secure Information Systems George Mason University, Fairfax, Virginia (2001) 1227
Pseudorandom Number Generator – The Self Programmable Cellular Automata Sheng-Uei Guan and Syn Kiat Tan Department of Electrical and Computer Engineering, National University of Singapore 10 Kent Ridge Crescent Singapore 119260 {eleguans,engp1723}@nus.edu.sg
Abstract. This paper deals with the problem of generating high quality random number sequences for cryptography. A new class of CA, the Self Programmable Cellular Automata (SPCA) is proposed. Experimental results showed the sequences generated performed well in the ENT and DIEHARD test suites; comparable to the results obtained by other researchers. Furthermore, the SPCA has implementation space savings over other similar variants of CA. The SPCA uses only Boolean operations hence it is ideally suitable for VLSI hardware implementation.
1
Background
Over the past years, researchers have applied cellular automata (CA) in pseudorandom number generation (PRNG) for various purposes such as cryptography [1,7], built-in self-test [2,4] etc. The field of one dimensional (1-D) CA has been studied extensively, leading to several new variants of CA such as the hybrid CA [1,3], Programmable Cellular Automata (PCA) [1,10,11], Controllable Cellular Automata (CCA) [10,11] etc. However, the general randomness quality of number sequences generated by such CA is not satisfactory, and researchers begin to investigate 2-D CA [6]. Although it is reported in the literature 2-D CA generates number sequences that perform well in the stringent ENT [8] and DIEHARD [9] test suites, 2-D CA brings along possibly increased complexity and implementation space. Cellular automata (CA) are dynamic systems discrete in space and time. Minimally, a CA is an array of programmed cells, each having a finite number of states, and interacting with its neighboring cells via a transition rule. These cells can be arranged in a 1-D array, 2-D grid or 3-D mesh topology. In this paper, we will only consider CA with Boolean cells with states, s ∈ {0, 1}. The transition rule is usually specified in the form of a truth table, with entries for every possible configuration of neighborhood cell inputs. Researchers have compiled with a simple way of identifying the 28 (for a size 3 neighborhood) possible transition rules and we will adopt this convention in our paper. Since our CA is implemented in finite configurations, 3 types of boundary conditions are to be considered: V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1230-1235, 2003. Springer-Verlag Berlin Heidelberg 2003
Pseudorandom Number Generator - The Self Programmable Cellular Automata
1231
i)
Cut-off boundaries whereby the cells located at the left (right) end have a permanent input fixed at ‘0' ii) Spatially periodic boundaries where the cells located at the left (right) end have each other as neighbors. iii) Immediate boundaries where the cells located at the left (right) end have the two cells on its right (left) as neighbors [3]. Both uniform and hybrid CA have been used together with Genetic Algorithm (GA) [5,6,7,10,11] for generating random number sequences. Their differences lie in the transition rule for each cell need not be the same: the extreme case would have a different transition rule for each cell. While complexity seems to be increased due to the loss of homogeneity, there are actually little consequences on both computational costs and hardware implementation costs, since essentially similar amount of resources are used. However, randomness of generated sequences may be better due to the increased complexity of the CA. n-bit PCA utilized a ROM which contains a set of 2n pre-programmed maximallength rules for the particular PCA. At each state transition, a counter is used to load different rules from the ROM into the PCA. For more details, refer to [1,10,11]. CCA uses the additional rule control signal and cell control signal to actively change the CA behavior during state transitions. It is a hybrid CA containing basic cells and controllable cells – whose behaviors are changed by the cell control signals. The generation of the additional signals could be performed by 2 CA. More details can be found at [10,11]. The performance of the above mentioned CA based PRNG will be compared in the Experimental Results section. Our CA is structurally simple compared to the above cases.
2
The Self Programmable Cellular Automata
Our research objective is to seek a new 1-D CA that can generate number sequences that passes ENT and DIEHARD, while having less complexity than the PCA and CCA. Our proposal is the Self Programmable Cellular Automata (SPCA); this SPCA is uses the same 2 transition rules for each cell, which is quite similar to the 1-bit PCA. While the 1-bit PCA uses a pre-programmed ROM to load transition rules into the cells, the SPCA derives the rule selection algorithm using a selectionneighborhood within the CA itself, hence its name. This selection neighborhood uses input from cells different from the conventional neighborhood. Fig 1 shows a SPCA cell. The additional complexity lies in the additional logic circuitry and wiring for each cell, which is typically a XOR gate. Each SPCA is then characterized by the selection circuitry and the 2 transition rules. Using a 3 cell neighborhood for transition rules, we have a total of 216 possible combinations of transition rules. Though the search space seems small, the time it takes to minimally evaluate a SPCA (generating the numbers and running the ENT and DIEHARD tests) is at least 2 minutes, so GA was adopted to minimize the search time.
1232
Sheng-Uei Guan and Syn Kiat Tan
Fig. 1. A SPCA Cell
Let s0(t),s1(t),…sn(t) denote the states of the cells at time interval t, and fk(i,t)(.) is the selected transition rule for cell i at time t; k(i,t) ∈ {0,1} is the transition rule index for cell i at time interval t. Then si(t + 1) = fk(i,t)( si-1(t) , si(t), si+1(t) ) and fk(i,t) ∈ {Rule 1, Rule2}, is selected using the following simple selection rule, with its corresponding neighborhood: k(i,t+1) = ( si+1(t) + si+2(t) + si+3(t) ), where periodic conditions apply for the selection neighborhood and calculations are over GF(2). Note that this rule is actually rule-150 shifted to the right by 2 cells. Every cell's transition rule changes during each state transition, due to the rule selection mechanism. Thus, k = {k(1,t), k(2,t),…, k(n,t)} forms the secret state of the SPCA when used as for stream ciphers [1,7]. The initial selection of transition rules for each of the cells also forms the secret key used to initialize the PRNG. Hence the total secret key of the SPCA-PRNG consist of <s, k>, where s = {s(1,0),s(2,0),…,s(n,0)} is the n bit CA state and k is the n bit secret state at t = 0. It is an option to keep the 2 transition rules as part of the secret key.
3
The Genetic Algorithm
We specify a chromosome using the 2 transition rules used. These rules, in their truth table form, are encoded using 8 bits which represents the outputs for all 256 possible neighborhood configurations. They then form the chromosome: a concatenated 16-bit binary string, see Fig 2.
Fig. 2. The GA chromosome
Pseudorandom Number Generator - The Self Programmable Cellular Automata
1233
Our genetic algorithm uses a population of 10 CA chromosomes, each having their 2 transition rules randomly initialized. We use three tests from the ENT suite (entropy, serial correlation coefficient and chi-square) and all nineteen tests from DIEHARD to evaluate the randomness quality of sequences produced by each chromosome. DIEHARD is known as the more stringent test suite and sequences that pass DIEHARD will most often pass ENT too, while the converse is not necessarily true. From experience, we also noticed that sequences that do not have satisfactory results from the three ENT tests will have problems passing the DIEHARD tests, hence we filtered the sequences that are subjected to DIEHARD to avoid wasting computational time. The fitness function used is simply the number of DIEHARD tests (total of 19) the chromosome/sequence can pass. After evaluation, the top 3 chromosomes were selected for 1-point crossover, producing 6 new children. Another 4 children is created via mutation from the 2 best chromosomes (elitism is considered used as a copy of the best chromosome till date is kept aside for this purpose).
4
Experimental Results
Random bit sequences are obtained by sampling the states of the CA cells is at each discrete time interval. Fig 3 shows the 20-cell 1-D SPCA used, with 4 redundant boundary cells. The redundant cells (a form of site spacing) are required to offset the correlation inherent in the CA output. These 4 cells are only used to generate the rule selection values and their state values are not sampled. This SPCA, using the best evolved rule, passed all ENT and 18 DIEHARD tests (results were averaged using 100 random initial seeds). Sequences generated by this 20-cell SPCA have consistently failed the COUNT-THE1s test. This result is in line with other researchers' results [4,10,11], caused by the correlation between neighboring cells. The total count of 20 cells required to have such performances in the test suites is quite comparable to the best results by other researchers. The performance comparison with other researcher's CA results is shown in Table 1. Here, the researchers used site spacing (ss) and time spacing (ts), common techniques used for de-correlating bit sequences [4]. Although this results in the loss of throughput per clock cycle, the sequences thus yielded often performed better in the ENT and DIEHARD suite. The best evolved chromosomes use the rule combinations listed in Table 2. Note that each pair is complementary.
Fig. 3. The 20 Cell SPCA
1234
Sheng-Uei Guan and Syn Kiat Tan Table 1. Comparison with Other CA1
Test name
ENT suite 1. Entropy 2. SCC 3. Chi-square DIEHARD suite 1.Overlapping sum 2. Runs test 3. 3D sphere 4. A parking lot 5.Birthday spacing 6.Count the ones 1 7.Binary rank 6*8 8.Binary rank 31*31 9.Binary rank 32*32 10. Count the ones 2 11. Bitstream test 12. Craps test 13.Minimum distance 14.Overlapping permu 15. Squeeze 16. OPSO test 17. OQSO test 18. DNA test 19. Overall KS test Total DIEHARD tests passed
1 bit SPCA n = 20 ts = 0 ss= 4 unused cells
1 bit PCA n = 50 ts = 0 ss= 1 [10]
2 bit PCA n = 50 ts = 0 ss= 0 [10]
CCA2 n = 50 ts = 0 ss = 0 [11]
CCA2 n = 64 ts = 1 ss = 2 [11]
PCA90165 n = 50 ts = 1 ss = 2 [11]
7.9999818 0.0002385 Passed
Passed
Passed
Passed
Passed
-
Passed
Passed
Passed
Passed
Passed
Passed
Passed Passed Passed Passed
Failed Passed Passed Passed
Passed Passed Passed Passed
Passed Passed Passed Passed
Passed Passed Passed Passed
Passed Passed Passed Passed
Failed
Failed
Failed
Failed
Passed
Passed
Passed Passed
Passed Passed
Passed Passed
Passed Passed
Passed Passed
Passed Passed
Passed
Passed
Passed
Passed
Passed
Passed
Passed
Passed
Passed
Passed
Passed
-
Passed Passed Passed
Failed Passed Passed
Passed Passed Passed
Passed Passed Passed
Passed Passed Passed
Passed Passed
Passed
Failed
Passed
Passed
Passed
Passed
Passed Passed Passed Passed Passed
Passed Failed Failed Failed -
Passed Failed Failed Passed -
Passed Passed Passed Passed -
Passed Passed Passed Passed -
Passed -
18
11
15
17
18
13
Table 2. Evolved Rule Combinations
Chromosome 105-150 154-101 86-179 210-45 1
No. of Diehard tests passed 18 18 18 18
Test result of p-values between 0.0001 and 0.9999 were taken as passed.
Pseudorandom Number Generator - The Self Programmable Cellular Automata
5
1235
Conclusion
SPCA is being proposed as a suitable pseudorandom number generator in this paper. GA was used to search for optimal transition rule combinations and very good chromosomes were found. The found SPCA has a size advantage compared to the various CA-based PRNG, while having comparable results. Ongoing work on the SPCA includes establishing the transitional matrices of good SPCA found and using MultiObjective Genetic Algorithm to further optimize between the important aspects: randomness quality, CA length, length of interconnection etc.
References [1]
S. Nandi, B. K. Kar, and P. Pal Chaudhuri, Theory and applications of cellular automata in cryptography, IEEE Trans. Comput. 43, 1994, pp. 1346-1357 [2] D. R. Chowdhury, I. S. Gupta and P. Pal Chaudhuri, A class of twodimensional cellular automata and applications in random pattern testing, Journal of Electrical Testing: Theory and Applications, Vol. 5, 1994, pp. 65-80 [3] S. Nandi and P. Pal Chaudhuri, Analysis of periodic and intermediate boundary 90/150 cellular automata, IEEE Transactions on Computers, Vol 45, No. 1, Jan 1999 [4] P. D. Hortensius, R. D. Mcleod, Werner Pries, D. Michael Miller and H. C. Card, Cellular automata-based pseudorandom number generators for built-in self-test, IEEE Transactions on Computer-Aided Design, Vol. 8, No. 8, 1989, pp. 842-859 [5] Moshe Sipper and Marco Tomassini, Generating parallel random number generators by cellular programming, International Journal. Modern Physics, Vol. 7, No.2, 1996, pp.181-190 [6] Moshe Sipper and Marco Tomassini, On the generation of high-quality random numbers by two-dimensional cellular automata, International Journal. Modern Physics, Vol. 7, No.2, 1996, pp.181-190 [7] Macro Tomassini and Mathieu Perrenoud, Cryptography with cellular automata, Applied Soft Computing 1 (2001), pp. 151 - 160 [8] ENT test suite, http://www.fourmilab.ch/random [9] Marsaglia, “Diehard”, http://stat.fsu.edu/~geo/diehard.html, 1998 [10] Incremental Evolution of Cellular Automata for Random Number Generation, Sheng-Uei Guan and Shu Zhang, accepted by International Journal of Modern Physics [11] Pseudorandom Number Generators Based on Controllable Cellular Automata, Sheng-Uei Guan and Shu Zhang, submitted to IEEE Transactions
Exclusion/Inclusion Fuzzy Classification Network Andrzej Bargiela 1, Witold Pedrycz 2,3, and Masahiro Tanaka 4 1
The Nottingham Trent University Nottingham NG1 4BU, UK ([email protected]) 2 University of Alberta, Edmonton, Canada 3 Systems Research Institute, Polish Academy of Sciences, Poland ([email protected]) 4 Konan University, 8-9-1 Okamoto, Higashinada-ku, Kobe, Japan (m_tanaka@konan_u.ac.jp)
Abstract. The paper introduces an exclusion/inclusion fuzzy classification neural network. The network is based on our GFMM [3] and it allows for two distinct types of hyperboxes to be created: inclusion hyperboxes that correspond directly to those considered in GFMM, and exclusion hyperboxes that represent contentious areas of the pattern space. The subtraction of the exclusion hyperboxes from the inclusion hyperboxes, implemented by EFC, provides for a more efficient coverage of complex topologies of data clusters.
1
Introduction and Problem Statement
Fuzzy hyperbox classification of data has been shown to be a powerful algorithmic approach to deriving intuitive interpretation of data [5, 6, 3, 2, 4, 7]. However, the interpretability of individual hyperboxes comes with their inherent limitation of incompatibility of their shape with the shape of many real-life data clusters. In order to overcome this incompatibility it is necessary to use sets of appropriately sized hyperboxes to cover the more complex topologies of the actual data. The quality of this coverage, measured as a misclassification rate, depends on the maximum size of hyperboxes. The smaller is the maximum size of hyperboxes the more accurate coverage can be obtained. Unfortunately, a direct consequence of that is that the increase of the number of hyperboxes erodes the interpretability of the results. It is therefore necessary to balance the requirement of interpretability with the classification accuracy. The tradeoff originally proposed by Simpson [5, 6] was the optimization of a parameter defining the maximum hyperbox size as a function of misclassification rate. However, the use of a single maximum hyperbox size proved too restrictive in that for some data clusters there was a need for several hyperboxes while for the other clusters, with a more complex topology, there was still a significant misclassification possible with hyperboxes of such a size. One solution to this problem, proposed in [3], is the adaptation of the size of hyperboxes so that it is possible to generate larger V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1236-1241, 2003. Springer-Verlag Berlin Heidelberg 2003
Exclusion/Inclusion Fuzzy Classification Network
1237
hyperboxes in some areas of the pattern space without sacrificing the recognition rate, while in the other areas the hyperboxes are kept small to afford accurate coverage of the data topology. A similar effect has been also obtained in the context of a partially supervised hyperbox clustering [1]. In this paper we propose a development of the General Fuzzy Min-Max Neural Network (GFMM) [3] architecture that generates two types of hyperboxes. The first type, called inclusion hyperboxes, is the type of hyperboxes that we have considered so far. The second type, called exclusion hyperboxes, represent hyperboxes that contain data belonging to different classes. So, the exclusion hyperboxes represent areas of the pattern space in which the classification is ambiguous. Using these two types of hyperboxes as the basic building blocks it is possible to represent complex topologies of data clusters with a reduced overall number of hyperboxes. Also, the three steps of the GFMM algorithm, namely Expansion, Overlap test and Contraction can be reduced to two i.e. Expansion and Overlap tests. This paper is organized as follows. Section 2 gives an overview of the GFMM. The new Exclusion/Inclusion Fuzzy Classification (HEFC) network is introduced in section 3. Section 4 provides a brief comparison of the HEFC to FSS and GFMM using the IRIS data set.
2
General Fuzzy Min-Max Neural Network
The neural network that implements the GFMM algorithm is shown in Figure 1. It is a three-layer feed forward neural network that grows adaptively to meet the demands of the classification problem. The input layer has 2*n processing elements, two for each of the n dimensions of the input pattern Xh - [ x hl x hu ] . Each second-layer node represents a hyperbox fuzzy set where the connections of the first and second layers are the min-max points and the transfer function is the hyperbox membership function. The connections are adjusted using the algorithm described in [3]. The min points matrix V is applied to the first n input nodes representing the vector of lower l
bounds x h of the input pattern and the max points matrix W is applied to the second n input nodes representing the vector of upper bounds x hu . The connections between the second- and third-layer nodes are binary values. They are stored in the matrix U. The elements of U are defined as follows: 1, u jk =
if b j is a hyperbox for class c k 0,
otherwise
(1)
where bj is the jth second-layer node and ck is the kth third-layer node. Each thirdlayer node represents a class. The output of the third-layer node represents the degree to which the input pattern Xh fits within the class k. The transfer function for each of the third-layer nodes is defined as m
c k = max b j u jk j =1
(2)
1238
Andrzej Bargiela et al.
for each of the p+1 third-layer nodes. Node c0 represents all unlabelled hyperboxes from the second layer. The outputs of the class layer nodes can be fuzzy when calculated using expression (2), or crisp when a value of one is assigned to the node with the largest ck and zero to the other nodes. The topology of the network depicted in Figure 1 is almost identical to the original fuzzy min-max neural network proposed by Simpson [5] except for two changes. First, the number of input nodes has been increased from n to 2*n. This has eliminated the need for double connections from input nodes to second-layer nodes. Second, an additional node c0 representing all the unlabelled hyperboxes from the second layer has been included in the output layer. This allows combined consideration of the classification and clustering of data. xlh
V b
x uh
U
c
W xlh = [xlh1, …, xlhn], xuh = [xuh1, …, xuhn] V=[v1,…,vm], W=[w1,…,wm] b=[b1,…,bm] U=[u10,…,u1p;…; um0,…,ump] c=[ c0, c1,…,cp]
Fig. 1. The three-layer neural network implementation of the GFMM algorithm
The training of the neural network involves adaptive construction of hyperboxes guided by the class labels. The input patterns are presented in a sequential manner and are checked for a possible inclusion in the existing hyperboxes. If the pattern is fully included in one of the hyperboxes no adjustment of the min- and max-point of the hyperbox is necessary, otherwise a hyperbox expansion is initiated. However, before the expansion can be confirmed it is necessary to perform an overlap test since it is possible that the postulated expansion could result in some areas of the pattern space belonging simultaneously to two distinct classes, thus contradicting the classification itself. If the overlap test is negative, the postulated hyperbox expansion is confirmed and the next input pattern is being considered. If, on the other hand, the overlap test is positive the hyperbox contraction procedure is initiated. This involves subdivision of the hyperboxes along the overlapping coordinates and the consequent adjustment of the min- and max-points. However, the contraction procedure has an inherent weakness in that it inadvertently eliminates from the two hyperboxes some part of the pattern space that was unambiguous while retaining some of the contentious part of the pattern space in each of the hyperboxes. This is illustrated in Figure 2.
Exclusion/Inclusion Fuzzy Classification Network Xh
Xh b2
b1
b1
b1
(c)
(b)
(a)
b2
b2
b2
b1
1239
(d)
Fig. 2. GFMM hyperbox expansion/contraction (a)-(c) and the proposed exclusion/inclusion hyperboxes (d)
The attempted expansion of hyperbox b2 (because the pattern Xh belongs to the same class as hyperbox b2) is shown in Figure 2(b). However, the expansion creates an overlap with the hyperbox b1, which is assumed here to belong to a different class than hyperbox b2. This initiates the contraction procedure that is illustrated in Figure 2(c). The result of the contraction is that a proportion of the original hyperboxes b1 and b2 is discarded (areas highlighted with diagonal lines) while part of the new hyperboxes b1 and b2 still remains contentious (areas marked with a solid black fill). Note that the pattern Xh is still included in the new hyperbox b1 (see Figure 2(c)) therefore it is being misclassified. Clearly there is a scope for a fine-tuning of the contraction procedure, however the result is always a compromise between the loss of correctly classified patterns and the inclusion of areas of the pattern space where the misclassification occurs.
V
xlh
b
W
U c
S
x uh
e
R
T xlh
= [xlh1, …, xlhn], xuh = [xuh1, …, xuhn] V=[v1,…,vm], W=[w1,…,wm], S=[s1,…,sm], T=[t1,…,tm] b=[b1,…,bm], e=[e1,…,eq] U=[u10,…, u1p, u1(p+1);…; um0,…,ump, um(p+1)] R=[r10,…, r1p, r1(p+1);…; rq0,…,rqp, rq(p+1)] c=[ c0, c1,…,cp,cp+1] Fig. 3. Exclusion/Inclusion Fuzzy Classification Network
3
Exclusion/Inclusion Fuzzy Classification Network (EFC)
The solution proposed here is the explicit representation of the contentious areas of the pattern space as exclusion hyperboxes. This is illustrated in Figure 2(d). The original hyperbox b1 and the expanded hyperbox b2 do not lose any of the undisputed
1240
Andrzej Bargiela et al.
area of the pattern space but the patterns contained in the exclusion hyperbox are eliminated from the relevant classes in the {c1,…,cp} set and are instead assigned to class cp+1 (contentious area of the pattern space class). This overruling implements in effect the subtraction of hyperbox sets which allows for the representation of nonconvex topologies with a relatively few hyperboxes. The additional second-layer nodes e are formed adaptively in a similar fashion as for nodes b. The min-point and the max-point of the exclusion hyperbox are identified when the overlap test is positive for two hyperboxes representing different classes. These values are stored as new entries in matrix S and matrix T respectively. If the new exclusion hyperbox contains any of the previously identified exclusion hyperboxes, the included hyperboxes are eliminated from the set e. The connections between the nodes e and nodes c are binary values stored in matrix R. The elements of R are defined as follows: 1, if el is a hyperbox overlapping with class c k and 1 < k < p rlk = 0, otherwise
(3)
Note that the third layer has p+2 nodes [ c0, c1,…, cp, cp+1] with the node cp+1 representing the new exclusion hyperbox class. The output of the third-layer is now moderated by the output from the exclusion hyperbox nodes e and the values of matrix R. The transfer function for the third-layer nodes is defined as: m
q
c k = max(0, max b j u jk − max ei rik ) j =1
(4)
i =1
The second component in (4) cancels out the contribution from the overlapping hyperboxes that belonged to different classes. Since we allow both fuzzy and crisp outputs of the class layer we restrict the minimum value of ck to 0.
4
Numerical Example
The EFC was applied to a number of synthetic data sets and demonstrated improvement over the GFMM and the original FMM [5]. As a representative example, we illustrate the performance of the network using the Iris data-set. The network was trained using the first 75 patterns and the EFC performance was checked using the remaining 75 patterns. The results for FMM have been obtained using our implementation of the FMM algorithm which produced results consistent with those reported in [5]. The results are summarized in Table 1. Table 1. Comparison of performance of FMM, GFMM and EFC
Performance criterion Recognition rate (range) Number of hyperboxes (max. size 0.06) Number of hyperboxes (max. size 0.03)
FMM [5] 97.33-92% 32 56
GFMM [3] EFC 100-92% 100-97% 29 18 49 34
Exclusion/Inclusion Fuzzy Classification Network
1241
It is clear that the number of misclassifications has been significantly reduced while the number of hyperboxes has been reduced as well. This is due to the increased expressive power of the combined exclusion and inclusion hyperboxes. Further results will be reported at the conference.
Acknowledgments Support from the Engineering and Physical Sciences Research Council (UK), the Natural Sciences and Engineering Research Council of Canada (NSERC) and the Konan University is gratefully acknowledged.
References [1] [2] [3] [4] [5] [6] [7]
Bargiela A., Pedrycz W., Granular clustering with partial supervision, Proc. European Simulation Multiconference, ESM2001, Prague, June 2001, 113-120. D. Dubois and H. Prade, Fuzzy relation equations and causal reasoning, FSS, 75, 1995, pp. 119-134 Gabrys B., Bargiela A., General fuzzy min-max neural network for clustering and classification, IEEE Trans. Neural Networks, vol.11, 2000, 769-783. W. Pedrycz, F. Gomide, An Introduction to Fuzzy Sets, Cambridge, MIT Press, Cambridge, MA, 1998. Simpson PK., Fuzzy min-max neural networks - Part1: Classification, IEEE Trans. Neural Networks, vol.3, 5, 1992, 776-786. Simpson PK., Fuzzy min-max neural networks – Part 2: Clustering, IEEE Trans. Neural Networks, vol.4, 1, 1993, 32-45. Zadeh LA, Toward a theory of fuzzy information granulation and its centrality in human reasoning and fuzzy logic, Fuzzy Sets and Systems (FSS), 90, 1997, 111-117.
Discovering Prediction Rules by a Neuro-fuzzy Modeling Framework Giovanna Castellano, Ciro Castiello, Anna Maria Fanelli, and Corrado Mencar Computer Science Department, University of Bari Via Orabona, 4 - 70126 Bari, Italy {castellano,castiello,fanelli,mencar}@di.uniba.it
Abstract. In this paper, we propose a neuro-fuzzy modeling framework to discover fuzzy rules and its application to predict chemical properties of ashes produced by thermo-electric generators. The framework is defined by several sequential steps in order to obtain a good predictive accuracy and the readability of the discovered fuzzy rules. First, a feature selection procedure is applied to the available data by discarding the features possessing lowest ranking in terms of their predictive power. Then, a competitive learning scheme is adopted to initialize a fuzzy rule base, which is successively refined by a neuro-fuzzy network trained on the available data. To improve accuracy, we applied the process on each ash property to be predicted, hence obtaining a set of MISO models that are both accurate and transparent, as shown by the reported experimental results.
1
Introduction
In the last decade, neuro-fuzzy systems found a wide success due to their hybrid nature that allows the acquisition of accurate knowledge from numerical data via neural learning, with the ability of representing it in a fuzzy rule-based structure so as to be classified as “human readable”. In this paper we present a neuro-fuzzy modeling framework intended to discover prediction rules for a very complex industrial problem: the derivation of the properties of ashes originated from combustion processes for electric generation. This particular task can be considered of peculiar significance, since the study of discarded substances could lead to relevant results in the fields of environmental impact factor evaluation and material reuse. In our experimentation we propose to analyse the different typologies of fuels, undergoing some specified combustion processes, to derive the final properties of the resulting ashes by means of a fuzzy rule base, generated on purpose. The inherent difficulty of the problem (which concerns the study of a complicated chemical process) results additionally accentuated if we consider the configuration of the dataset we worked with: it is composed by 54 instances only, each of them consisting of 32 input features and 22 output features. To avoid the well-known “curse of dimensionality” problem [1], a feature selection procedure becomes indispensable, being capable to improve the generalization ability of V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1242–1248, 2003. c Springer-Verlag Berlin Heidelberg 2003
Discovering Prediction Rules by a Neuro-fuzzy Modeling Framework
1243
the resulting model and, at the same time, to simplify the structure of the fuzzy rules to be generated, making them more intelligible by human users. The neuro-fuzzy approach is therefore based on a multistep strategy: once the feature selection process has been performed, the fuzzy inference model is automatically produced starting from input/output data, then it is subsequently modified to improve its accuracy. A particular neural network (the neuro-fuzzy network) has been designed that reproduces in its topology the structure of the fuzzy rule base, in order to initialize and refine it employing neural learning. In the initialization stage, in fact, a clustering of input data is performed by means of an unsupervised learning of the neuro-fuzzy network. A supervised learning process is then applied for the consequent enhancement of the fuzzy rule base. The paper is organized as follows: the general scheme of the neuro-fuzzy modeling framework is illustrated in the next section. Then the feature selection process is detailed in Section 3. The forth section describes the learning strategy for the neuro fuzzy network and experimental results are reported in Section 5. Finally, in the concluding section, some conclusions are drawn together with indications for future works.
2
The Neuro-fuzzy Modeling Framework
To approach the ash property prediction problem a MISO strategy has been adopted, consisting in producing a fuzzy rule base for each of the 22 output derive, corresponding to the 22 main features we want to predict at the end of the combustion process. In this way we aim at obtaining 22 fuzzy inference models, each of them constituted by a rule base whose fuzzy rules have the following form: Rk : IF x1 is Ak1 AND . . . AND xn is Akn THEN y is bk
k = 1, . . . , K, (1)
where x = (x1 , . . . xn ) are the input variables and y is the output variable, Aki are fuzzy sets defined on the input variables and bk is a fuzzy singleton defined on the output variable (the example above refers to a generic kth rule). Fuzzy sets Aki 2 ik ) are defined by Gaussian membership functions: µik (xi ) = exp( −(xiσ−c ) , where 2 ik cik , σik are the center and the width of the Gaussian function. When an input vector x is presented to the prediction n model, the activation strength of the k rule is firstly evaluated as µk (x) = i=1 µik (xi ), k = 1, . . . , K, then the output y is obtained by: K
y = k=1 K
µk (x)bk
k=1
µk (x)
(2)
The described inference system is translated into a particular neural network, the neuro-fuzzy network, which reflects in its topology the structure of the fuzzy rule base. In particular, the neuro-fuzzy network consists of four layers performing: (i) the introduction of the input values; (ii) the evaluation of the Gaussian
1244
Giovanna Castellano et al.
Fig. 1. The neuro-fuzzy network
membership functions; (iii) the computation of the fulfillment degree for each rule; (iv) the production of the final output. The described neuro-fuzzy network is depicted in Fig. 1.
3
Feature Selection
In order to perform feature selection, we first apply a feature ranking procedure to sort features by their predictive power. The sorted feature list can be used to discard less predictive features, as well as to eliminate highly correlated features that do not bring any additional information in the prediction process. Moreover, the sorted feature list can be used by domain experts to discard or keep features that are physically involved in the determination of the property to be predicted. Here, we describe a procedure that is similar to the one proposed in [2], [3], because of its adherence with the proposed modeling framework. Such technique assumes that the predictive power of a feature hinges on the model to be used. The procedure requires the generation of a number of neuro-fuzzy models for each property to be predicted. Since such neuro-fuzzy models are only used for feature ranking purposes, the generation speed is much more important than their predictive accuracy. As a consequence, the generation of neuro-fuzzy models must be as fast as possible, but still adequate to the problem in hand. To reduce possible overfitting of the generated models, the available dataset has been split in a training set, used for model generation, and a test set for accuracy evaluation. To generate neuro-fuzzy models in a fast yet accurate way, the following simple clustering procedure is applied to define the initial structure of each model. The algorithm performs only one cycle on the training set, which implies a rapid termination, and provides a set of multidimensional prototypes, defined in the input/output product space.
Discovering Prediction Rules by a Neuro-fuzzy Modeling Framework
Given 1. 2. 3.
1245
a training set T = {z1 , z2 , . . . , zn , zj = (xj , yj )} Set the first prototype w1 := z1 Set nt := 1, NS1 := 1 For each j = 1, 2, . . . , N , (a) Find the nearest prototype wk such that wk − zj = min wi − zj i=1,...,n
(b) If wk − zj ≤ δ, then i. Set NSk := NSk + 1 ii. Set wk := wk + wk − zj /NSk (c) else i. Set nt := nt + 1 ii. Create a new prototype wnt = zj 4. End for
Once the prototypes are available, Gaussian fuzzy sets can be defined by setting the centers to coincide with the prototypes (with exception of the last dimension, which corresponds to the output), while the widths are assumed to be equal for each fuzzy set. The value of the common width is chosen from a range of possible values, by selecting the width that correspond to the most accurate neuro-fuzzy model. The results of the clustering algorithm mainly depends on the parameter δ, whose optimal value is selected by generating a number of neuro-fuzzy models with different values of δ and choosing the network with best accuracy on the test set. In order to give a ranking value to each feature, the following procedure is carried out. In turn, a feature is selected in order to calculate its rank, while the others are “locked” so that, the membership values of the corresponding fuzzy sets are forced to 1.0. As a consequence, the activation strength of each rule coincides with the membership value of only the fuzzy set corresponding to the selected feature. Then, the output of the model is computed as usual. Such inference rule is applied to all examples of the training set, in order to calculate the minimum output value yi and the maximum output value yi obtained from the neuro-fuzzy model. The relevancy index of the i-th feature can be quantified as follows: yi − yi . Ri = max (yi − yi ) i=1,...,n
Once features have been ranked by the corresponding relevancy index, a feature selection procedure can be applied in order to provide a small yet useful subset of features for the design of the final neuro-fuzzy model by more accurate techniques. A first preliminary selection of features can be performed by discarding highly correlated features. When two features have a high correlation, it is possible to discard the feature with the smallest relevancy index or it can be chosen by a domain expert. The ranked feature list, eventually cleaned from highly correlated features, is used to determine the final subset of features as follows. A number of neuro-
1246
Giovanna Castellano et al.
fuzzy models is generated with an increasing number of features from the simplest model with the most relevant feature only, to the model with all features. Eventually, the domain expert can decide to force some feature to be always selected in all models for physical motivations. The parameters of such models can be directly inherited from the neuro-fuzzy model selected for the feature ranking. All models are then evaluated on the test set, but only the one with best accuracy is chosen and the corresponding subset of features is selected. The feature ranking and selection procedures are repeated for each property to be predicted, in order to obtain different selections of features, one for each property to predict, that are useful for the design of more refined neuro-fuzzy models in the subsequent steps of the proposed framework.
4
The Learning Strategy of the Neuro-fuzzy Network
The training process of the neuro-fuzzy network involves two distinct steps which respectively correspond to the identification and the following refinement of the fuzzy rule bases. The available observational data are employed in this phase, adopting the simplification of the input variables achieved at the end of the feature selection process detailed above. A competitive learning scheme is devoted to the identification of the fuzzy models: the number of rules and the initial parameters are derived via an unsupervised learning of the neuro-fuzzy network. This procedure performs a clustering of the input data and presents the peculiarity to automatically obtain the proper number of clusters (corresponding to the number of rules for the fuzzy model) starting from a guessed number. When an input is presented to the network, nodes in the third layer compete in such a way that it is possible to classify as “winner” the one possessing the weight vector closest to the input vector in terms of Euclidean distance and as “rival” the second closest. The weight vector of the winner is then modified to shift it closer to the input vector, while the rival weight vector is moved farther. This rival-penalized mechanism is performed in order to associate only one weight vector with each cluster. The structure and the weights of a trained network supply the base for deriving an initial fuzzy rule base whose performance is enhanced in terms of accuracy by the second step of the network training. A supervised learning process, based on a gradient descent technique, is performed to adjust the network weights and consequently to tune the fuzzy rule parameters, improving the final prediction results. More details regarding the described learning strategy can be found in [4, 5].
5
Experimental Results
The dataset used in the modeling framework is provided by “ENEL Produzione e Ricerca S.p.A” (Italy). Beside the intrinsic complexity of the problem in hand, due to the low cardinality of the dataset in contrast to the high number of attributes, other problems
Discovering Prediction Rules by a Neuro-fuzzy Modeling Framework
1247
Table 1. Comparison between the two groups of neuro-fuzzy models
output variable 1 Al 2 Ca 3 Fe 4 Mg 5 Na 6P 7S 8 Si 9 Ti 10 Ba 11 Co
MISO without feature selection
MISO with feature selection
MISO without feature selection
MISO with feature selection
Final MSE value
Final MSE Value
Final MSE value
Final MSE Value
Training Set
Test Set
Training Set
Test Set
output variable
Training Set
Test Set
Training Set
Test Set
0,21 0,48 0,09 0,21 0,21 0,27 0,33 0,54 0,30 0,44 0,37
0,24 0,28 0,49 0,19 0,10 0,29 0,39 0,23 0,35 0,47 0,13
0,04 0,30 0,05 0,23 0,15 0,21 0,17 0,14 0,05 0,11 0,05
0,05 0,16 0,63 0,26 0,24 0,28 0,43 0,29 0,12 0,16 0,07
12 13 14 15 16 17 18 19 20 21 22
0,46 0,13 0,27 0,56 0,02 0,46 0,19 0,26 0,27 0,56 0,25
0,25 0,93 0,13 0,22 0,38 0,42 0,20 0,28 0,19 0,16 0,53
0,43 0,13 0,04 0,51 0,05 0,26 0,18 0,03 0,09 0,49 0,17
0,28 0,86 0,54 0,22 0,16 0,21 0,19 0,08 0,05 0,17 0,57
Cr Cu K Mn Ni Pb Sr V Zn D50 LOI
have to be dealt due to the presence of missing values and highly different ranges of input variables. To overcome such problems, data are properly pre-processed, by filling missing values with mean values and by normalizing all ranges with a rescaling technique. Particularly, each input and output variable has been transformed according to: z˜i = (zi − mean(zi ))/std(zi ). To derive the predictive models for ash property prediction, the available dataset has been split into a training set of 31 samples and a test set with the remaining 23 samples. The neuro-fuzzy models are built according to the described framework and, to validate the proposed feature selection technique, a second group of neuro-fuzzy models has been defined with all 32 features. In Table 1 the obtained results are listed together with a comparison of the two groups in terms of Mean Squared Error evaluated both on training and test set. The table reports also the name of the 22 output variables (chemical elements) we want to predict. In both experimentations the same parameter configurations have been employed for the structure initialization phase and the supervised learning processes of the neuro-fuzzy networks have been carried out for 1000 epochs. All the fuzzy rule bases obtained adopting the MISO approach without feature selection present 8 rules with 32 input variables. In Table2 are reported the rule and input numbers for each of the fuzzy model generated with the implementation of the feature selection procedure.
6
Conclusions
The proposed modeling framework provides an effective tool for automatically generating predictive neuro-fuzzy models from data. As shown by experimental results, the resulting neuro-fuzzy models have a simple structure and satisfactory prediction accuracy, despite the high complexity of the prediction problem. The
1248
Giovanna Castellano et al.
Table 2. output
Number of selected features and derived rules for each predicted Al Ca Fe Mg Na
P
S
Si
Ti
Ba
Co
# features
4
9
8
6
3
11
4
3
3
10
8
# rules
11
10
15
12
10
11
12 14 11
9
10
Cr Cu
K
Mn
Ni Pb Sr
V
Zn D50 LOI
# features 11
10
2
6
10
9
8
3
10
8
10
# rules
9
10
11
8
11
13
7
10
8
8
12
modeling framework is not tailored to a specific problem, but can be used to as a general-purpose framework where a prediction problem is to be solved by means of fuzzy rule-based systems. Future research is in the direction of defining interval-valued rule output, so as to provide more meaningful results in the prediction process.
References [1] Bellmann: Adaptive Control Process: A Guided Tour (1961) 1242 [2] Linkens, D. A., Chen, M. Y.: Input selection and partition validation for fuzzy modelling using neural network Fuzzy Sets and Systems, 107(3) (1999) 299–308 1244 [3] Linkens, D. A., Chen, M. Y.: A Systematic Neuro-Fuzzy Modeling Framework With Application to Material Property Prediction IEEE Trans. on Systems, Man and Cybernetics, 31(5) (2001) 781–790 1244 [4] Castellano, G., Fanelli, A. M.: A self-organizing neural fuzzy inference network. In Proc. of IEEE Int. Joint Conference on Neural Networks (IJCNN2000), Como, Italy, 5 (2000) 14–19 1246 [5] Castellano G., Fanelli, A. M.: Information granulation via neural network based learning. In Proc. of Joint 9th IFSA World Congress and 20th NAFIPS International Conference (IFSA-NAFIPS 2001), Vancouver, Canada. Invited paper, (2001). 1246
SAMIR: Your 3D Virtual Bookseller F. Zambetta, G. Catucci, F. Abbattista, and G. Semeraro Dipartimento di Informatica, University of Bari, Italy Via E. Orabona, 4 I-70125 – Bari (I) {zambetta,fabio,semeraro}@di.uniba.it [email protected]
Abstract. Intelligent web agents, that exhibit a complex behavior, i.e. an autonomous rather than a merely reactive one, are daily gaining popularity as they allow a simpler and more natural interaction metaphor between the user and the machine, entertaining him/her and giving to some extent the illusion of interacting with a human-like interface. In this paper we describe SAMIR, an intelligent web agent satisfying the objectives listed above. It uses: - a 3D face animated via a morph-target technique to convey expressions to be communicated to the user, - a slightly modified version of the ALICE chatterbot to provide the user with dialoguing capabilities, - an XCS classifier system to manage the consistency between conversation and the face expressions. We also show some experimental results obtained applying SAMIR to a virtual bookselling scenario, involving the wellknown Amazon.com site.
1
Introduction
Intelligent virtual agents are software components designed to act as virtual advisors into applications, especially web ones, where a high level of human computer interaction is required. Indeed, their aim is to substitute the classical WYSIWYG interfaces, which are often difficult to manage by casual users, with reactive and possibly pro-active virtual ciceros able to understand users' wishes and converse with them, find information and execute non-trivial tasks usually activated by pressing buttons and choosing menu items. Frequently these systems are coupled with an animated 2D/3D look-and-feel, embodying their intelligence via a face or an entire body. This way it is possible to enhance users trust into these systems simulating a face-to-face dialogue, as reported in [1]. A very complete agent of this kind, frequently called an ECA (Embodied Conversational Agent), is REA [1], a Real Estate Agent able to converse with the users and sell them a house that complies with their wishes and needs. The interaction occurs in real time via sensors acquiring user facial expressions and hands pointing; moreover speech recognition is performed to bypass users need of writing their requests. REA answers using its body posture, its facial expressions and digitized V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1249-1257, 2003. Springer-Verlag Berlin Heidelberg 2003
1250
F. Zambetta et al.
sounds rendering the salesperson recommendations and utterances. The EMBASSI system [2] was born by a very big consortium occupied in defining the technologies and their ergonomics requirements to implement an intelligent shop assistant to facilitate user purchasing and information retrieval. These objectives are pursued by multi-modal interaction: common text dialogs as well as speech recognition devices are used to sense user requests whilst a 3D face is the front-end of the system. The agent is able to send its response also via classical multimedia content (video-clips, hyperlinks, etc.). In [3] an agent is described which is not just able to be animated but also to answer to users based on emotions modeled on the Five Factor Model (FFM) of personality [4] and implemented using Bayesian Belief Networks. Moreover the Alice chatterbot is used in order to let the web agent process and generate responses into the classical textual form. The MS Office assistants deserve to be mentioned in the agent panorama because of their widespread use, even though they are quite shallow in some respects. Sometimes they are too invasive and do not exhibit a very complex behavior, however, they suffice to help the inexperienced user in many situations. They are based upon Microsoft Agent technology that enables multiple applications, or clients, to use animation services such as loading a character, playing a specified animation and responding to user input. Spoken text appears in a word balloon but several TTS (text-to-speech) engines may be used to play it. Another example of a Microsoft Agent is IMP (Instant Messaging Personalities, see http://www.eclips.com.au), which gives a very good idea of how an agent can help the users in handling all Microsoft Messenger features such as mail checking, instant messaging and so on. The agents have lips-synced faces pronouncing the messages the user receives and they tell the user if any of his/her contacts have just gone online, offline, away etc. A general observation is that the mentioned systems, though interesting, are generally heavy to implement, difficult to port onto different platforms, and usually not embeddable in Web browsers. We pursue a light solution, which should be portable, easy to implement and fast enough in medium-sized computer environments. In this paper we present the SAMIR (Scenographic Agents Mimic Intelligent Reasoning) system, a digital assistant where an artificial intelligence based Web agent is integrated with a purely 3D humanoid, robotic, or cartoon-like layout [5]. The remainder of the paper is organized as follows. Section 2 describes the architecture of SAMIR. In Sections 3, 4 and 5, the three main modules of SAMIR are detailed. Some examples of SAMIR in action are given in Section 6. Finally, conclusions are drawn in Section 7.
2
The Architecture of SAMIR
SAMIR (see Figure 1) is a client-server application, composed of 3 main sub-systems detailed in the next sections: the Data Management System (DMS), the Behavior Manager and the Animation System. The DMS is responsible for directing the flow of information in our system: When the user issues a request from the web site, via a common form, an HTTP request is
SAMIR: Your 3D Virtual Bookseller
1251
directed to the DMS Server to obtain the HTTP response storing the chatterbot answer. At the same time, based on the events raised by the user on the web site and on his/her requests, a communication between the DMS and the Behavior Manager is set up. This results into a string encoding the expression the Animation System should assume. This string specifies coefficients for each of the possible morph targets [6] into our system: We use some high-level morph targets corresponding to the well-known fundamental expressions [7] but even low-level ones are a feasible choice in order to preserve full MPEG-4 compliance. After this interpretation step, a key-frame interpolation is performed to animate the current expression.
Fig. 1. The Architecture of SAMIR
2.1
The Animation System
The FACE (Facial Animation Compact Engine) Animation System is an evolution of the Fanky Animation System [8]. FACE was conceived keeping in mind lightness and performance so that it supports a variable number of morph targets: For example we currently use either 12 high-level ones or the number of the entire “low-level” FAP set, in order to achieve MPEG-4 compliance [9]. Using just a small set of high-level parameters might be extremely useful when debugging the behavior module because it is easier to reason about behavioral patterns in terms of expressions rather than longer sets of facial parameters. Besides it is clear that using a reduced set of parameters might avoid bandwidth limitations and this can be a major advantage in porting this animation module to a small device such as a Pocket PC [10], a process we are beginning to experiment with.
1252
F. Zambetta et al.
An unlimited number of timelines can be used allocating one channel for the stimulus-response expressions, another one for eye-lid non-conscious reflexes, another one for head non-conscious reflexes and so on. We are integrating a TTS engine into our system and, for this reason, another channel for visemes will be used. In Figure 2 some expressions taken from an animation stream are illustrated. We are developing a custom editor able to perform the same tasks performed by FaceGen but giving more control to the user: This way, we believe, each user, both the unexperienced one and the experienced one, might enjoy the process of creating a new face, tailored to his/her wishes, who could use some specific low-level deformation tools, based upon the well-known FFD technique [11].
Fig. 2. Some expressions assumed by a 3D face
2.2
Fig. 3. The DMS Architecture
The Dialogue Management System
The DMS (Dialogue Management System) is responsible for the management of user dialogues and for the extraction of the necessary information for book searching. The DMS can be viewed as a client-server application composed mainly by two software modules, communicating through the HTTP protocol (see Figure 3). The client side application is just a simple Java applet whose main aim is to let user to type requests in a human-like language and to send these ones to the server side application in order to process them. The other important task it is able to perform is retrieving
SAMIR: Your 3D Virtual Bookseller
1253
specific information, based on the responses elaborated by the server-side application, on the World Wide Web through the JavaScript technology. On the server side we have the ALICE Server Engine enclosing all the knowledge and the core system services to process user input. ALICE is an open source chatterbot developed by the ALICE AI Foundation and based on the AIML language (Artificial Intelligence Markup Language), an XML-compliant language that gives us the opportunity to exchange dialogues data through the World Wide Web. ALICE has been fully integrated in SAMIR as a Java Servlet and all the knowledge of the system has been stored in the AIML files, containing all the patterns matching user input. Dialogues data are exchanged through simple built-in classes handling the classical HTTP sockets communication. In order to obtain a system able to let users navigating in a bookshop web site, we wrote some AIML categories finalized to book searching and shopping. An AIML category is a simple unit which contains a “pattern” section for matching user input and a correspondent “template” section containing an adequate response and/or action (i.e. a JavaScript execution) to user requests. Our categories were chosen to cover a very large set of all possible manners for requesting a given book in a real bookstore. We considered a set of seven fields that let a user to specify books he/she is interested in. They include the book title, author, publisher, publication date, subject, ISBN code and a more general field keyword. Successful examples of book requests for the Amazon bookshop web site are the following: I want a book written by Sepulveda, I am looking for a book entitled Journey and whose author is Celine, I am searching for all books written by Henry Miller and published after 1970, I am interested in a book about horror or, in alternative forms, it is possible to send requests like: Could you find some book written by Fernando Pessoa?, Search all books whose author is Charles Bukowski, Give me the book whose ISBN code is 0-13-273350-1, Look for some book whose subject is fantasy Clearly, AIML categories do not fit well for user requests that exhibit a high level of ambiguity due to the peculiar characteristics of human language interaction. 2.3
The Behavior Generator
The Behavior Generator aims at managing the consistency between the facial expression of the character and the conversation tone. The module is mainly based on Learning Classifier Systems (LCS), a machine learning paradigm introduced by Holland in 1976 [12]. The learning module of SAMIR has been implemented through an XCS [13], a new kind of LCS, which differs in many aspects from the traditional Holland's framework. The most appealing characteristic of this system is that it is very related to the Q-learning but it can generate task representations which can be more compact than tabular Q-learning [14]. At discrete time intervals, the agent observes a state of the environment, takes an action, observes a new state and finally receives an immediate reward. The basic components of an XCS are: Performance Component, that, on the ground of the detected state of the environment, selects the better action to be performed; Reinforcement Component, whose aim is to evaluate the reward to be
1254
F. Zambetta et al.
assigned to the system; Discovery Component which, in case of degrading performance, is devoted to the evolution of new, more performing rules. The environment in which SAMIR has to act is represented by the user dialogue (the higher the user satisfaction the higher the reward received from SAMIR). At the very beginning of its life, the behavior of SAMIR is controlled by a set of random generated rules and, consequently, its capability is very low. Behavior rules are expressed in the classical format if then , where (the state of the environment) represents a combination of 4 possible events, sensed by 4 effectors, representing different conversation tones such as: user salutation (user performs/does not perform salutation), user request formulation to the agent (no request, polite, impolite), user compliments/insults to the agent (no compliment, a compliment, an insult, a foul language), user permanence in the Web page (user changes/does not change the page) while represents the expression that the Animation System displays during user interaction. In particular, the expression is built as a linear combination of a set of fundamental expressions that includes the basic emotion set proposed by Paul Ekman, namely anger, fear, disgust, sadness, joy, and surprise [7]. Other emotions and many combinations of emotions have been studied but remain unconfirmed as universally distinguishable. However, the basic set of expressions has been extended in order to include some typical human expression such as bother, disappointment and satisfaction. The Behavior Manager, as explained above, is able to produce synthetic facial expressions, to be shown according to the content of the ongoing conversation. Thus the part provides the Animation System with the percentage of each one of the expressions, to be used to compose the desired expression of our character. For example, an expression composed by 40% of joy and 60% of surprise is coded into the following string: 0100 0000 0110 0000 0000 0000 0000 0000 0000 % Surprise % Sadness % Joy % Fear % Disgust % Anger % Bother % Disappointment% Satisfaction
During its life, SAMIR performs several interactions with different users and, on the ground of the received reward, the XCS evolves better behavior rules in order to achieve ever better performance. In a preliminary phase we defined a set of 30 interaction rules in which different situations in the course of an interaction (actions performed, users requests, etc.) have been considered. This set of pre-defined rules represented the training set that is the minimal know-how that SAMIR should possess to start its work in the Web site. To evaluate the performance of the system we performed some preliminary experiments aiming at verifying the effectiveness of the User Interaction module to learn a set of 30 predefined rules. Due to the inherent features of the XCS, SAMIR has been able to learn quite effectively the pre-defined rules of behavior and to generalize some new behavioral pattern that could update the initial set of rules. In such a way, SAMIR is comparable with a human assistant that, after a preliminary phase of training, will continue to learn new rules of behavior on the basis of personal experiences and interaction with human customers.
SAMIR: Your 3D Virtual Bookseller
3
1255
Experimental Results
In this section we present some experimental results obtained from the interaction between SAMIR and some typical users searching for books about topics like literature, fantasy and horror or for more specific books whose information like title, author and publisher are given. When the user connects to the site, SAMIR presents itself and asks to the user his/her name for user authentication and recognition. Figure 4 shows the results of a user request about a horror book. The result of the query is a set of books in this genre available at the Amazon book site. It can be noticed that the book ranked first is by the author Stephen King. Figure 5 is an example of a more sophisticated query in which there is a request for Henry Miller books published after 1970. In this case the user gives a heavy insult to SAMIR and, consequently, its expression is angry.
Fig. 4. Requesting an horror book
4
Conclusions
In this paper we presented a first prototype of a 3D agent able to support users in searching for books into a Web site. Actually, the prototype has been linked to a specific site but we are currently implementing an improved version that will be able to query several Web bookstores simultaneously and to report, to users, a sort of comparison based on different criteria such as prices, delivery times, etc. Moreover, our work will be aimed to give a more natural behavior to our agent. This can be achieved improving dialogues, and eventually, the text processing
1256
F. Zambetta et al.
capabilities of the ALICE chatterbot, and giving the agent a full proactive behavior: the XCS should be able not only to learn new rules to generate facial expressions but also to modify dialogue rules, to suggest interesting links and to supply an effective help during the site navigation.
Fig. 5. All books by Henry Miller published after 1970
References [1] [2] [3] [4] [5]
[6] [7]
J. Cassell et al. (Eds.), Embodied Conversational Agents. MIT Press, Cambridge, 2000. M. Jalali-Sohi & F. Baskaya, A Multimodal Shopping Assistant for Home ECommerce. In: Proceedings of the 14th Int‘l FLAIRS Conf. (Key West FL, 2001), 2-6. S. Kshirsagar & N. Magnenat-Thalmann, Virtual Humans Personified. In: Proceedings of the Autonomous Agents Conference (AAMAS), (Bologna, 2002). R.R. McCrae & O. P. John, An introduction to the five factor model and its applications, J. of Personality, 60 (1992), 175-215. F. Abbattista, A. Paradiso, G. Semeraro, F. Zambetta: An agent that learns to support users of a web site, in: R. Roy, M. Koeppen, S. Ovaska. T. Furuhashi and F. Hoffmann (Eds.), Soft Computing and Industry: Recent Applications, Springer, 2002, 489-496. B. Fleming & D. Dobbs, Animating Facial Features and Expressions. Charles River Media, Hingham, 1998. P. Ekman, Emotion in the human face. Cambridge University Press, Cambridge, 1982.
SAMIR: Your 3D Virtual Bookseller
[8] [9] [10] [11] [12] [13] [14]
1257
A. Paradiso, F. Zambetta, F. Abbattista., Fanky: a tool for animating 3D intelligent agents. In: A. de Antonio, R. Aylett, D. Ballin, Intelligent Virtual Agents, (Madris, 2001), Springer, Berlin, 242-243. MPEG-4 standard specification. http://mpeg.telecomitalialab.com/standards/mpeg-4/mpeg-4.htm. The Pocket PC website. http://www.microsoft.com/mobile/pocketpc/default.asp. Sederburg, T.W., Free-Form Deformation of Solid Geometric Models, Computer Graphics, 20(4), (1986), 151-160. Holland, J.H.: Adaptation. In R. Rosen and F.M. Snell (Eds.), Progress in Theoretical Biology, New York: Plenum, 1976. Wilson, S.W.: Classifier Fitness based on Accuracy. In: Evolutionary Computation 3/2, (1995), 149-175. Watkins, C.J.C.H., Learning from delayed rewards, PhD thesis, University of Cambridge, Psychology Department, 1989.
Detection and Diagnosis of Oscillations in Process Plants Toru Matsuo1 , Hideki Sasaoka2 , and Yoshiyuki Yamashita3 1
2
Technical Dept., Mitsui Chemicals Inc. 30 Asamuta-machi, Omuta, 836-8610 Japan Research & Development Headquarters, Yamatake Corporation 1-12-2 Kawana, Fujisawa, 251-8522 Japan 3 Department of Chemical Engineering, Tohoku University 07 Aramaki Aoba, Sendai 980-8579, Japan
Abstract. An Wavelet analysis tool is utilized for the detection of oscillations in actual industrial chemical plants. It can detect multiple oscillations in measurement signals and shows valuable information for the diagnosis. For each of the detected oscillatory control loops, the root cause of the oscillation is diagnosed by the analysis of input-output characteristics of a valve. Finally, the combination of the Wavelet tool and the analysis of MV-PV plots is shown to be a powerful practical method for the detection and diagnosis of oscillations in process plants.
1
Introduction
To achieve high quality production with efficient and stable operation, many control loops are embedded in chemical plants. Recently, the requirement for quality and productivity is increasing and superior control performance is highly requested to satisfy the requirement. It is also important to find control loops that may become excessively sluggish at their earliest stage in order to prepare the possible solutions of the problem. Causes of the poor performance of a controller can be classified in software problem and hardware problem. The typical software problem is the insufficient tuning of PID parameters. Inappropriate combination of P, I and D values could make the performance of the controller decreased, and causes sluggish response to set point changes and vulnerable for disturbance. The relation between PID parameters and control behavior has been studied in many years and various PID tuning rules have been proposed. The typical hardware problem is related to the valve actions. In principle, any nonlinear response in control systems can produce sustained oscillations. In practice, most of the causes of the oscillations are the stiction (sticking and friction) or hysteresis of valves. Bad environment and poor maintenance of valves often produce large oscillations in a plant and severely decrease its productivity[1]. Once the oscillatory behavior can be detected, the mechanism to cause the oscillation may often be obvious. The problem is that the oscillations are not detected easily in actual plant. Oscillatory signals are sometimes hidden under V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1258–1264, 2003. c Springer-Verlag Berlin Heidelberg 2003
Detection and Diagnosis of Oscillations in Process Plants
1259
many kinds of other signals and become difficult to be detected. To detect the oscillatory signals, spectral analysis with FFT transform has been widely used. The method was applied to plant wide monitoring of oscillations and the root cause analysis[2]. Recently, the Wavelet analysis become a powerful tool for the detection and diagnosis of oscillatory behavior[3]. It can extract information on time-frequency domain and visualize the characteristics of the signal.
2
Detection and Characterization of Oscillations
Before dealing with actual plants, the effect of nonlinearities in a control valve are analyzed by simulations. The schematic diagram of the simulators, which is c , is shown in Fig. 1. The plant consists of first-order plus made by Simulink time-delay block. The backlash block is used for the valve unit so as to simulate a hysteresis. Three typical output patterns of the simulation are shown in Fig. 1, those are rectangular, triangular and sinusoidal patterns. These signals are also analyzed c . The Wavelet tool is useful by a commercial Wavelet analysis tool Time2Wave to extract the oscillatory components from noisy signal. It is also useful to specify the timing of frequency changes and to investigate the harmonics. The characteristics of the oscillation can be summarized as follows. – Oscillation occurs in control system when hysteresis is involved in valve behavior. – Appearance and the properties of harmonic components in the oscillation depend on time constant and disturbance of the system. – Detection of oscillation is difficult in time domain when the signal includes disturbances.
3 3.1
Case Study 1 Process Description
To detect and diagnose the oscillatory behaviors in the plant, the Wavelet tool was applied to the actual industrial plant. A part of the schematic diagram of the target plant is shown in Fig. 2. In this plant, gaseous reactant B is fed into the reactor and react with A to produce liquid product C, where the reactant A is prepared in the reactor. To maintain the pressure in constant during the course of reaction, it is controlled by the feed rate of the reactant gas B. 3.2
Detection of Oscillations
An example of the operational data is shown in the left side of Fig. 3. The pressure in the reactor seems to be maintained constant after turning on the corresponding controller. Variation from SV can be found by zooming up the pressure signal, but it is still not clear to judge if it is oscillatory or not.
1260
Toru Matsuo et al.
T=10 T=20 T=100
Setpoint
PID
Valve
Controlled Variable Plant
Disturbance
T=10 no dist Rectangular
.
harmonics fundamental
T=20no dist Triangular
.
harmonics fundamental
T= 100䇭no dist Sinuous
.
fundamental
Fig. 1. Schematic diagram and typical outputs of the simulation with valve hysteresis
The Wavelet tool convert the signals into the time frequency view as shown in the right side of Fig. 3. From this figure, it is obvious that three of these signals has the same oscillatory characteristic of 35 minutes period. It may not be noticed by watching only the pressure signal, but it become much easier to detect by watching other signals having the same characteristics. It also indicate that the cause of the oscillation can be found in the variable group having the same characteristics.
Detection and Diagnosis of Oscillations in Process Plants
1261
PID PC.MV
F.PV
PC.PV
Reactor A(L)+B(G) 㸢C(L)
䇭䇭
PCV
䇭
䇭
(B:Gas)
Fig. 2. Process flow of the first case study plant
PC.PV
F.PV
PC.MV
Fig. 3. Detected oscillation of the first case study plant 3.3
Cause Analysis
Time region to be analyzed was selected by the Wavelet tool, and analyzed by plotting the manipulated variable (MV) v.s. the process variable (PV; that is flow rate in this case). The shape of trajectory of these plots is a parallelogram which top and bottom side are horizontal. This type of oscillation is very similar to the result of the simulation as shown in Fig. 1. Therefore the cause of the oscillation is estimated as hysteresis of the control valve.
1262
Toru Matsuo et al.
Case Study plant 1 㽲
㽳
㽵
㽴
㽴 Flow 䋨PV 䋩
MV
㽵 㽳
Flow 䋨 PV 䋩
㽲 MV
Fig. 4. Detailed view of the oscillations in the first case study plant
Controller (P, I, D)
Gas Header (site wide long tube)
F2
MV
PV P2
(CV)
plant1 plant2 plant3
P
䋺20䋬 䌉䋺360䋬 D:0
plant4
Fig. 5. Process flow of the second case study plant
4
Case Study 2
Here is the another example of the analysis on a part of actual plant. This is the feedback pressure control system maintaining the pressure of feeding header of gas resource (Fig. 5). As shown in the time series in Fig. 6, about one percent of variations in pressure is observed. No oscillatory behavior nor close relation between PV and MV values are found from the time-domain data. By applying the conventional FFT method, it is easily found that these two variables have the same oscillatory characteristics of 52 minutes and 103 minutes period. The wavelet transform shows that the oscillation period are changing by time. As shown in Fig. 7, shape of trajectory on MV-PV plots becomes parallelogram but its top and bottom sides are not horizontal in this case. Although the valve action seems to be different from the case study 1, another kind of nonlinearity is shown in this figure. The observation of harmonic band of 52 minutes and 103 minutes in time-frequency domain suggests the existence of valve hysteresis with relatively short time constant.
Detection and Diagnosis of Oscillations in Process Plants Time䇭series䋻䇭PV
1263
Time䇭series䋻䇭MV 3 days
3 days
FFT䋻䇭PV
FFT䋻䇭MV 103 min
103 min
52 min
Wavelet䇭Transform䋻䇭PV
Wavelet䇭Transform䋻䇭MV
52 min
52 min 103 min
3 days
3 days
Fig. 6. Detected oscillation of the second case study plant Case Study plant 2 㽲
㽳
㽴
㽵
㽶 㽶
Flow 䋨PV 䋩
㽲
MV
㽵
Flow 䋨PV 䋩
㽳 㽴 MV
Fig. 7. Detailed view of the oscillations in the second case study plant
5
Conclusion
An Wavelet tool is applied to the detection of oscillations in actual process plants. It detects oscillations with the information about the frequency, amplitude, and time duration of the occurrence. Without this tool, detections of the oscillations would be difficult because of the noise. By plotting the manipulated variable (MV) v.s. the process variable (PV), valve hysteresis is considered the cause of the oscillation. Simulation study having backlash in a valve confirmed the cause. As the result, the combination of Wavelet tool and the analysis of MV-PV plots provide powerful practical method for the detection and diagnosis of oscillation in process plants.
1264
Toru Matsuo et al.
References [1] Horsh,A., Condition Monitoring of Control Loops, PhD Thesis, Royal Institute of Technology, Stockholm, Sweden (2000). 1258 [2] Thornhill,N. F., Shah,S. L. and Huang,B., Detection of Distributed Oscillations and Root-Cause Diagnosis, CHEMFAS4, Korea, June (2001) 167–172. 1259 [3] Matsuo,T. and Sasaoka,H., Application of Wavelet analysis to chemical process diagnosis,” KES’2002, Crema, Italy (2002) 843–847. 1259
A Petri-Net Based Reasoning Procedure for Fault Identification in Sequential Operations Yi-Feng Wang and Chuei-Tin Chang Department of Chemical Engineering, National Cheng Kung University Tainan, Taiwan 70101, Republic of China [email protected]
Abstract. In implementing any hazard analysis method, there is a need to reason deductively for identifying all possible fault origins that could lead to an undesirable consequence. Due to the complex time-variant cause-and-effect relations between events and states in sequential operations, the manual deduction process is always labor intensive and often error-prone. The theme of the present study is thus concerned mainly with the development of Petri-net based reasoning algorithms for automating such cause-finding procedures. The effectiveness and correctness of this approach are demonstrated with a realistic example in this paper.
1
Introduction
In order to ensure operation safety, hazard analysis is one of the basic tasks that must be performed in designing or revamping any chemical process. A variety of techniques have already been proposed in the literature, namely, hazard and operability study (HAZOP), fault tree analysis (FTA) and failure mode and effect analysis (FMEA), etc. In implementing these methods, there is always a need to identify all possible causes of an undesirable consequence with a deductive reasoning approach. Traditionally, the task of characterizing the corresponding fault propagation scenarios is performed manually on an ad hoc basis. For a complex chemical process, the demand for time and effort is often overwhelming. Thus, there are real incentives to automate this cause-finding process. To facilitate development of fault identification algorithms, an accurate system model must first be constructed to describe the fault propagation behaviors. In this work, the Deterministic Timed Transitions Petri net (DTTPN for short) [1] is used as the modelling tool. A systematic procedure has already been developed in previous publications [2, 3] to construct the appropriate PN-based system models. Systematic simulation techniques have also been proposed to identify and enumerate all critical fault propagation scenarios. In implementing this approach, it is necessary to first acquire a comprehensive list of failure modes associated with every component in the system. Next, all failure scenarios must be studied with repeated simulation runs. The possible causes of any given undesirable consequence can then be identified from the V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1265–1272, 2003. c Springer-Verlag Berlin Heidelberg 2003
1266
Yi-Feng Wang and Chuei-Tin Chang
simulation results. As the system complexity increases, the number of cases requiring investigation becomes extremely large. Thus, this fault identification procedure tends to be tedious and ineffective if it is applied to realistic systems. A novel deductive reasoning method has been developed in the present study to relieve the work load. The reasoning steps are executed on the basis of the backward Petri net of a given DTTPN and represented with a fault tree according to a set of conversion rules. The effectiveness of the proposed method has been verified with a large number of examples and one of them is presented in this paper.
2
Backward Petri Nets
Naturally, an appropriate PN model must first be constructed to describe the fault propagation behaviors in a given system. The construction procedure is detailed elsewhere [2, 3]. For the ease of tracking the deductive reasoning process, a systematic procedure should then be followed to transform the system PN into a backward PN. Specifically, the directions of all input and output arcs in the former case should be reversed. Although the inhibitor arcs and test arcs are allowed in a forward DTTPN, all of them should be converted to normal arcs in the corresponding backward Petri net. The weights of these redirected arcs should remain the same except those obtained from the inhibitor arcs. Each of these weights is computed by subtracting one from the corresponding value in the forward PN. Let us now consider the simple DTTPN in Figure 1(a) as an example. The delay times of transitions t1 , t2 , t3 and t4 are 0, 1, 2 and 3 units respectively. By applying the proposed transformation procedure, the corresponding backward PN shown in Figure 1(b) can be generated.
p8
3
t4 0
p5 0
p1
(a)
p6
t1
1
p2
p7
t2
2
p3
t3
p4
(b)
Fig. 1. The Forward and the Corresponding Backward PN
A Petri-Net Based Reasoning Procedure for Fault Identification
(a)
1267
(b)
Fig. 2. (a) The General Fault-Tree Structure; (b) The Time-Stamped Fault Tree generated from Fig. 1(b)
3
Time-Stamped Fault Trees
The fault tree is a graphic system model. The propagation mechanisms of all possible fault origins of a predefined undesirable consequence are embedded in this model. In essence, this graph can be viewed as an accurate representation of the deductive reasoning processes in fault identification. This is due to the fact that all logical interrelationships of the basic and intermediate events leading to the given top event can be clearly depicted in a tree. There are two basic types of logic gates, i.e., the OR gates and the AND gates, are commonly used in a typical fault tree. Any other gate can be replaced with a proper combination of these two types of gates. In this study, a fault tree is developed from level to level on the basis of the backward Petri net. The general fault-tree structure between an input place and its output places is shown in Figure 2(a). The key step in developing a specific structure involves the identification of the actual number of AND gates connected to OR gate and also of the input events connected to each AND gate. In this paper, two simple conversion rules have been adopted for this purpose: OR-Gate Rule Connect an OR gate to the output event associated with the given input place. The number of AND gates connected to OR gate should be the same as the number of output transitions connected to the input place. AND-Gate Rule Establish a one-to-one correspondence between the transitions and the AND gates. The input events of each AND gate should be associated with the output places of the corresponding transition. In the conventional fault tree representation, the time relations between events and states cannot be represented explicitly. In order to mark the occurrence time of each event in a fault tree, a systematic procedure for generating
1268
Yi-Feng Wang and Chuei-Tin Chang
the global time stamps relative to the initiation time of the backward PN has also been applied in this paper. Initially, a definite global time stamp is assigned to the top event. Next, the global time stamp of each event in the next level is calculated by subtracting the time delay of the fired transition from the time stamp of the previous level. This calculation should be repeated until all events in the fault tree are marked. For illustration purpose, let us consider the backward PN in Figures 1(b) as an example. It can be easily found that the proposed techniques can be used to synthesize the corresponding fault trees in Figures 2(b).
4
Stalled Transitions
The occurrence of every event in a fault tree can be represented by placing a token in the corresponding place in the system PN model. If the tree is constructed according to the proposed conversion rules, then the implied assumption is that the abnormal condition represented with the place under consideration can be reached only by firing at least one of its input transitions at the occurrence time. However, it should be noted that another possibility has been neglected in this reasoning process, i.e., the given place acquired a token at a prior time and its normal output transitions have all been stalled since then. Let us consider the Petri net in Figure 1(a) as an example and further assume that it is a partial version of the net given in Figure 3(a). Notice that three discrete places (p9 , p10 and p11 ) and one un-timed transition t5 are added in this Petri net. Notice also that all new arcs are normal except the inhibitor arc connecting t5 and p10 . It is obvious that a token can be introduced in p7 at a given time t∗ if p4 acquired one at time t∗ − 2. On the other hand, it is also possible that p7 has already obtained a token before t∗ − ∆t and ∆t denotes a finite period. The requirement to keep this token in p7 at t∗ is to maintain either an empty p10 or an non-empty p11 throughout the time interval [t∗ − ∆t, t∗ ].
(a)
(b)
Fig. 3. (a) The System Petri Net; (b) The Modified Forward Petri Net
A Petri-Net Based Reasoning Procedure for Fault Identification
1269
Clearly, the latter scenario in the above example cannot be deduced if the proposed fault-tree construction approach is applied to original system PN model in Figure 3(a). To facilitate proper use of the reasoning procedures developed previously, the original Petri net must be first converted to the one shown in Figure 3(b). Notice that an additional discrete place (p12 ) and three new transitions (t6 , t7 and t8 ) are introduced in this modified net. The added transitions t6 and t7 are un-timed, but the delay time of t8 is ∆t. The two output arcs of p10 (and also the two arcs of p11 ) are logically complementary.
5
Application
Figure 4 is the flow diagram of a sequential process for drying air by using two fixed alumina beds [4], i.e., Bed-I and Bed-II. Ambient air which contains water vapor enters the process via stream 9 and passes through one bed to remove moisture. The dried air leaves the process in stream 25. In order to maintain a steady supply of dry air, two beds are employed alternatively. The operation recipe in a complete cycle can be found in Table 1.
Table 1. Operating Recipe for Fixed-Bed Air Drying Process Valve Position Time Period
3W
1
11 → 12
2
11 → 17
3
11 → 12
4
11 → 17
4W-I
4W-II
18 → 19 20 → 21 22 → 23 24 → 25
Bed Status Bed-I
Bed-II
regeneration in service
18 → 19 20 → 21 22 → 23 24 → 25 18 → 23 20 → 25 22 → 19 24 → 21
cooling
in service
in service regeneration
18 → 23 20 → 25 22 → 19 24 → 21
in service
cooling
Due to the limitation of space, the complete system model is difficult to present in graphic form. Basically, this Petri net contains the component models of the timer, 3-way valve, 4-way valves and alumina beds. There are four timer states modelled here, i.e., P (1), P (2), P (3) and P (4), and each is associated with a time period in the operation cycle. The equipment states of every valve, i.e., P V (+) and P V (−), can be considered as the two alternative valve positions. The bed states are characterized with two parameters denoted by T (temperature)
1270
Yi-Feng Wang and Chuei-Tin Chang
Fig. 4. The Process Flow Diagram of a Utility Air Drying Process
and M (water content). In particular, each state can be expressed with a twopart code. The first part, i.e., “I” or “II,” is the bed number and the second represents the parameter type, i.e., “T” or “M,” and its qualitative value, i.e., “0,” “1” or “2.” Notice that the process configuration is governed by the valve positions. In this model, they are described with a set of places labelled also with two-part codes. The first part of each code again denotes the bed number and the second the inlet or outlet connection of the corresponding bed. Specifically, the codes “H,” “C” and “P” are used to describe the air at bed inlet, and “E” and “R” the destinations of air leaving the alumina bed. The first two codes denote that this air is taken from stream 9 and the difference between the two is that the former has been heated. The third code means that the air is taken from the proportionating valve. The fourth one “E” means that the air is delivered to stream 25 and “R” means that it is recycled. Finally, a small number of failure models are also included in this example. The timer failures considered here only result in spurious signals to switch 4W-I or 4W-II. All valve malfunctions can be attributed to one failure mode, i.e., sticking. It is assumed that the beds can always function normally during operation. Let us assume that a reasonable condition for hazard analysis may be “H2 O concentration in stream 25 is too high in time period k,” where k can be 1, 2, 3 or 4. This is due to the fact that, if the outlet air from the air-drying process contains too much water, a large number of valuable instruments downstream may be damaged. Every direct cause of this undesirable consequence can be characterized with a distinct set of four places in the Petri net. Two of them are used for representing a particular process configuration and the rest for the bed states, e.g., {I-H, I-E, I-T(0), I-M(2)}. The proposed deductive reasoning
A Petri-Net Based Reasoning Procedure for Fault Identification
1271
procedures can be applied to each of the four places in all possible sets repeatedly. A small sample of the fault identification results is presented in Table 2. Notice that, in this table, the occurrence times of the events/conditions are specified in the square brackets. It has been found that, in general, all identified failure mechanisms are quite reasonable. Let us consider the fault propagation scenario associated with the event “4W-II sticks in period 4” (row 6) as an example. Obviously, since Bed-I is still in service during the same period, the system should still behave normally right after the valve failure occurs. However, it should also be noted that the outlet air from Bed-I is scheduled to be recycled to cooler in period 1 of the next cycle and this operation step cannot be executed due to the same failure. As a result, the conditions listed in the second row of the first column can indeed be realized in this scenario. Table 2. A Sample of the Fault Origins Identified with the Proposed Deductive Reasoning Approach Direct Causes
Root Causes
{ 4W-I sticking } [1] {I-H, I-E, I-T(0), I-M(0)} [3] { 4W-I sticking } [2] { spurious signal to 4W-I, 4W-I sticking } [4] {I-H, I-E, I-T(0), I-M(2)} [1]
6
{ spurious signal to 4W-II, 4W-II sticking } [2] { 4W-II sticking } [3] { 4W-II sticking } [4]
{I-C, I-E, I-T(1), I-M(0)} [2]
{ spurious signal to 4W-II } [2] { 4W-II sticking } [3] { 4W-II sticking } [4]
{I-P, I-E, I-T(1), I-M(0)} [3]
{ 3W sticking } [1] { 3W sticking } [3]
{I-P, I-E, I-T(0), I-M(2)} [3]
{ 3W sticking } [2] { 3W sticking } [4]
Conclusions
A systematic deductive reasoning procedure is presented in this paper to identify all possible causes of system hazards in sequential operations. This procedure is carried out on the basis of the backward Petri nets transformed from the original system model. The corresponding fault trees can also be constructed accordingly to represent the deduction process. From the results obtained in application example, it can be observed that the proposed approach can indeed be used for the design of computer programs to automate the cause-finding operation in hazard analysis.
1272
Yi-Feng Wang and Chuei-Tin Chang
References [1] Ramchandani, C., Analysis of asynchronous concurrent systems by Petri nets, Project MAC, TR-120, M. I. T., Cambridge, MA, (1974). 1265 [2] Y. F. Wang, J. Y. Wu and C. T. Chang, Automatic Hazard Analysis of Batch Operations with Petri Nets, Reliab. Eng. Syst. Saf., 76:1 (2002) 91-104. 1265, 1266 [3] Y. F. Wang and C. T. Chang, A Hierarchical Approach to Construct Petri Nets for Modeling the Fault Propagation Mechanisms in Sequential Operations, Comput. Chem. Eng., 27:2 (2003) 259-280. 1265, 1266 [4] Shaeiwitz, J. A., Lapp, S. A. and Powers, G. J., Fault Tree Analysis of Sequential Systems, Ind. Eng. Chem. Process Des. Dev., 16:4 (1977) 529-549. 1269
Synthesis of Operating Procedures for Material and Energy Conversions in a Batch Plant Yoichi Kaneko1 , Yoshiyuki Yamashita1 , and Kenji Hoshi2 1
Department of Chemical Engineering, Tohoku University Sendai 980-8579, Japan 2 Tohoku Pharmaceutical University Sendai 981-8558, Japan
Abstract. The problem of operating procedure synthesis for chemical process plants is investigated. The knowledge about plant structure and material-conversion procedures was represented by directed graphs and the subgraph-isomorphism algorithm was utilized to solve the problem. In this study, the concept of heat-resources and multiple outputs for material conversions are proposed. These extensions provide the method to deal with the operation of heat exchange and separation for the synthesis of operating procedures in subgraph-isomorphism framework. This method is successfully demonstrated with a double effect evaporator.
1
Introduction
Without the operating procedures, no chemical plant can be operated to produce target materials. In these days, small quantity and multi-purpose production become popular and the generation or changes of operating procedures become frequent. Operating procedures are usually generated by human experts and it requires considerable amount of time and efforts. The requirement of automatic generation of operating procedures is getting increased to improve the efficiency. It is also required to verify the generated operating procedures for the improvement of the reliability. Several approaches have been investigated for the automatic synthesis of operating procedures of a chemical plant. Viswanathan et al. proposed a synthesis method based on the Grafset, a discrete event model concept [1]. Aylett et al. applied AI planning tool to the synthesis of operating procedures of a chemical plant [2]. We have proposed directed graph representations and recursive algorithm to solve this problem [3, 4]. Recently, we introduced a subgraph isomorphism framework to this problem domain [5]. In this paper, our subgraph isomorphism framework is extended so as to deal with heat exchanges and chemical separations. For these purposes, the concept of heat resources is introduced as the addition of material-conversion graph. Moreover, the output from one node is extended to have multiple materials in the material-conversion graph. Finally, the method is demonstrated on the double-effect evaporator plant to generate operating procedures. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1273–1280, 2003. c Springer-Verlag Berlin Heidelberg 2003
1274
2
Yoichi Kaneko et al.
Method
To generate a sequence of operating procedures for a production of a specific product by a given plant, the knowledge about connectivity of plant equipments and the knowledge about operations for material conversions and energy conversions are required at least. 2.1
Plant Structure
In this study, the connectivity of plant equipments is represented as a digraph named sequential plant-structure graph [3], which can also deal with state transitions of the plant. For example, a simple example plant shown in Fig. 1 is converted into the sequential structure-graph are shown in the right side of the figure. In this representation, each nodes represents process equipment and each directed arcs represents the direct connectivity between equipments. Here, valves and pumps are implicitly included in the arcs. For example, arcs from T1 to T2 include valves V1 and V3, and arcs from T2 to T3 include PU1 and V4. Each lines of the graph is corresponding to a time step, which duration of time is not constant. Nodes in the same column are the same equipments in different time steps. Each arcs represents a material flow between the connected equipments. Arcs connected to the same equipment represent the state of holding the contents in the same equipment. 2.2
Material Conversion Procedures
The knowledge about unit operating procedures and material conversions is represented by digraph named material-conversion graph [1]. For example, the knowledge about unit operating procedures and material conversions listed in the left side of Fig. 2 is converted into the material-conversion graph as shown in
V4 Feed Tank
Reactor
T1
T3
V1 Mixing Tank
.
T1
T2
T3
Step 1 Step 2
V2 V3
Step 3
T2
Step 4 V4 V5
PU1
....
Steam
Fig. 1. The example plant and corresponding sequential plant-structure graph
Synthesis of Operating Procedures for Material and Energy Conversions
name
type
UP1 Reaction UP2 Reaction UP3 Mixing
conversion procedures
equipment constraints
A→B C→D B+D →E
T2 T2 T1,T2,T3
A
UP1
B D
C
1275
UP3
E
UP2
Fig. 2. Unit operations for material conversions and the material-conversions graph
the right side. Here, the start points represent raw materials and the goal point is the target product. Other nodes represent a sequence of unit operations to convert the connected materials. 2.3
Synthesis by Subgraph Isomorphism
The synthesis of operating procedures to produce a target material is the task to find path of material conversions from raw materials to the target materials and to allocate the each conversion procedures on a specific equipment. This problem is to find out if the plant-structure graph contains a subgraph that is isomorphic to the material-conversion graph, or find all such isomorphic subgraphs. Cordella et al. have developped an efficient algorithm for graph-subgraph isomorphism [6]. We implemented the algorithm with special consideration to solve this problem [5]. Applying to the practical problem, many kinds of restrictions must be considered to get the appropriate procedures. Typical restrictions for chemical plants are avoidance of contaminations, possible equipments for a specific operation, and so on. In this study, these restrictions are treated as the constraints during the search of the solutions. Figure 3(a) shows one of the feasible solutions of the operating procedures to produce E. 2.4
Energy Conversion Procedures
To deal with unit operations such as the heat exchange, conversion and transfer of the energy must be considered. Let us consider about the jacket heating in Fig. 1. In this example, the first unit operation is replaced to a reaction with heating by considering about the heating (Table 1). The material conversion is
Table 1. A modified unit operation for material and energy conversions name
type
UP1’ Reaction&Heating
procedures A + (Heat) → B
equipments T2
1276
Yoichi Kaneko et al.
T1
A
T2
T3
T1
C
Step 1
UP2
Step 2
A
T2
T3
C
Step 1
UP2
Step 2
Step 3 Step 4
UP1 UP33
Step 5
E
(a) without heating
Steam Source
Steam
Step 3
Step 4
UP1’
UP33
Step 5
E
(b) with heating
Fig. 3. One of the feasible operating procedures to produce E in the example plant the same but the term (Heat) in the conversion procedures is introduced here. This term means that the operation includes heat exchange, and requires a heat source. The source of utilities are also introduced, because most of the heat exchanges in process plants are using such utility lines as steam and water, By using this definition, the subgraph isomorphism algorithm can be used to generate the operating procedures for a plant including energy conversions. Figure 3(b) shows an example of operating procedures for the example plant. In this figure, a line from Steam to UP1’ provides steam from the heat source to the tank T2, then the UP1’ operation convert the material A to B. 2.5
Separation Procedures
Separation operation did not treated in the previous study [5], because each nodes in the material-conversion graph has only one output arc. In this study, this restriction is removed to have multiple output arcs from one node. Implementation of the graph-subgraph matching was improved to satisfy this functionality. As the result, separation operation can be treated in the same framework.
3
Case Study
To demonstrate the ability of the proposed algorithm, the synthesis of operating procedures for the start-up of a double-effect evaporator [2] is investigated here. Figure 4 shows the schematic flow of the target plant. The purpose of this plant is to remove water from a salt water solution. This plant is called double-effect, because the steam that is evaporated off from the salt-water solution in the first evaporation is used to supply the energy for the second evaporation. In this plant, the salt water SW is fed from the feed tank FT, preheated in Glass Heater GH and heated in Heat Exchanger HE1, and evaporated in the first
Synthesis of Operating Procedures for Material and Energy Conversions
Air
1277
V10
PW V21
FT
Pump3 Steam Drain
EV 1
BC 1
V13
V18
V16 V14
V7 V11
V2
GH
V12
EV 2
V3
Drain
V5 V8
V1
HE 1
HE 2
V9
V15
Drain V4
Steam
V6
V19
V17
Pump2
V20
TK 2 Drain
Pump1
Drain
TK 1
FT : Feed tank GH : Glass Heater HE : Heat Exchanger Ev : Evaporator V : Valve TK : Tank BC : Braometric Condenser PW: Process Water
Fig. 4. The double-effect evaporator (DEE) stage evaporator EV1. The steam ST generated by the first stage evaporator is used as a heat source in the second stage evaporator EV2. The separated liquid LQ1 in EV1 is heated in HE2, and evaporated in EV2. Finally, steam is collected by barometric condenser BC1, then water W T is obtained in tank TK2, and liquid product LQ2 is obtained in tank TK1. This plant includes heat exchangers and evaporators, which requires both the material flow and the energy flow. Material and energy conversion procedures for this plant is described in Table 2. In this table, UP1, UP2 and UP4 are the operation for heating, and UP6 are the operation for cooling. By representing the plant structure as the structure graph (Fig. 5), material and energy conversion procedures as the conversion graph (Fig. 6), the proposed algorithm was applied to generate the operating procedures. Figure 7 shows
Table 2. Material and energy conversion procedures for the case study problem name UP1 UP2 UP3 UP4 UP5 UP6 UP7
type Preheating Heating Evaporation Heating Evaporation Condensation Condensation
conversion procedures SW + (HEAT) −→ SW SW + (HEAT) −→ SW SW −→ ST + LQ1 LQ1 + (HEAT) −→ LQ1 LQ1 −→ ST + LQ2 ST − (Heat) −→ W T ST − (Heat) −→ W T
equipment constraints GH HE1, EV1, HE1, EV1, BC1 BC1
HE2 EV2 HE2 EV2
1278
Yoichi Kaneko et al.
steam
FT
steam
GH
process water
HE1
EV1
HE2
EV2
BC1
TK2
TK1
Fig. 5. Plant structure graph for the DEE plant
ST
ST
PW PW
SW
UP1
UP2
preheating
UP3
heating
UP4
evaporation
heating
UP5 evaporation
UP7
UP6
condensation
LQ2
condensation
WT
Fig. 6. Material and energy conversion graph for the DEE plant
Steam Source
FT
GH
HE1
EV1
HE2
EV2
BC1
TK2
TK1
SW
Step1
UP1
Step2
UP2
Step3
UP3
Step4
UP4
Step5
UP5 UP6
Step6
WT LQ2
UP7
Step7
Step8
WT
Step9
Fig. 7. An example of generated procedures for the DEE plant
Synthesis of Operating Procedures for Material and Energy Conversions
1279
an example of the generated matched graph representing the operating procedures to produce the target product LQ2 in TK1 and W T in TK2. Each arcs in the graph shows material transfer, where the operation of valves and pumps are implicitly included. The matched graph can be easily interpreted to a sequence of operations as shown in Table 3. In this table, operations such as evaporation and heating are corresponding to the nodes having the name of unit operations in the matched graph. Activation of valves and pumps are corresponding to the arcs in the graph. Initially, all the valves are supposed to be closed and no pumps are working. After completed operations in each steps, each valves must be closed to move to the next step, if it is opened again in the step. Final lines in steps 3, 5 and 8 do not require any valve operations, because the equipments are connected directly without any valves. The initiation time and condition of each operation steps are not considered. By watching this operating procedures, the sequence of operations seems very reasonable. The operation of double effect is also realized by the transfer
Table 3. Interpretation of the generated procedures for the DEE plant step
operation
step1 Prepare salt water in FT Transfer the contents of FT to GH Transfer steam to GH step2 Preheat in GH Transfer the contents of GH to HE1 Transfer steam to HE1
activation of vales or pumps by opening V2 by opening V1 by activating Pump1, V4 and V8 by opening V7
step3 Heat in HE1 (Transfer the contents of HE1 to EV1) step4 Evaporation in EV1 Transfer the vapor from EV1 to HE2 Transfer the liquid from EV1 to HE2 step5 Heat in HE2 Transfer the vapor from HE2 to BC1 (Transfer the liquid from HE2 to EV2)
by opening V14 by activating Pump1, Pump2, V5, V20 and V17 by opening V16
step6 Evaporation in EV2 Condensation in BC1 Transfer the vapor from EV2 to BC1, then TK2 by opening V18 Transfer the liquid from EV2 to TK1 by opening V15 step8 Condensation in BC1 (Transfer the contents of BC1 to TK2)
1280
Yoichi Kaneko et al.
in step4. Although the algorithm could generate many possible solutions, the selection of appropriate solution will be the future work.
4
Conclusion
A subgraph-isomorphism framework for the synthesis of operating procedures for chemical plant is extended to deal with the operations of heat exchange and separation. This extension enables the consideration of operations related to the utilities such as steam and process water for various plants. The proposed method was demonstrated on the double-effect evaporator plant and confirmed that it generates very reasonable operating procedures automatically.
References [1] Viswanathan, S., Johnsson, C., Srinivasan, R., Venkatasubramanian, V. and ¨ Arzen, K. E., Compututers and Chemical Engineering, 22 (1998) 1673–1685. 1273, 1274 [2] Aylett, R. S., Soutter, J., Petley, G. J., Chung, P. W. H. and Edwards, D., Engineering Applications of Artificial Intelligence, 14 (2001) 341–356. 1273, 1276 [3] Hoshi, K., Nagasawa, K., Yamashita, Y. and Suzuki, M., KES2001, Osaka, Japan (2001). 1273, 1274 [4] Hoshi, K., Nagasawa, K. Yamashita, Y. and Suzuki, M., Journal of Chemical Engineering of Japan, 35 (2001) 377–383. 1273 [5] Hoshi, K. Yamashita, Y. and Suzuki, M., Kagaku Kogaku Ronbunshu, 29 (2003) 107–111. 1273, 1275, 1276 [6] Cordella, L. P., Foggia, C., Sansone, C. and Vento, M., ICPR’96, Vienna, Austria (1996) 180–184.
IDEF0 Activity Model Based Design Rationale Supporting Environment for Lifecycle Safety Tetsuo Fuchino1 and Yukiyasu Shimada2 1
Department of Chemical Engineering, Tokyo Institute of Technology Tokyo, 152-8552, Japan [email protected] 2 Department of Systems Engineering, Okayama University Okayama, 700-8530, Japan [email protected]
Abstract. To manage safety through the plant lifecycle in the process industry, the system environment to enable recording and accessing design rationale of the current process and/or plant is indispensable. Originally, structure of process design activity can be formulated, if its design process could be disclosed explicitly. In this study, the design activity of batch process design was considered based on ANSI/ISA S88.01 standard [1], and a design rationale supporting environment with a feature-oriented approach is provided. From the hierarchal nature of process design, IDEF0 activity model is adopted for representing design process, and design rationale representation scheme is defined by considering the relation between process actions, recipe procedures and process structure in batch processes. These relations are represented by objects and associations, and bidirectional search between design intention and process structure via design rational is enabled.
1
Introduction
Chemical processes are designed to have integrity among process structure, operation and process actions, to realize safety. Within the plant life cycle, many changes on process structure, operation, or process behavior will be necessary to meet the economical and/or technological environment. To maintain the safety, it is necessary to confirm that the planned modification does not conflict with the intention and/or rationale of current process and/or plant design. Piping and instrument diagrams (P&IDs), process flow diagrams (PFDs), standard operation procedures (SOPs) and so on are outputs of the process design. In general, these outputs are telling only the results of design, and reasoning and logic of the design remain implicitly in the memory of some designer. Therefore, safety management depends greatly upon memory of man and/or expectations, and it causes unsafe modifications leading to accidents and disasters. Development of supporting environment, which enables representing and accessing the design rationale, is necessary for lifecycle safety. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1281-1288, 2003. Springer-Verlag Berlin Heidelberg 2003
1282
Tetsuo Fuchino and Yukiyasu Shimada
There exist many researches on the design rationale system [2], and two approaches have been taken for representing the design rationale according to design object; process-oriented and feature-oriented. The previous studies ([3], [4]) applied IBIS (Issue Based Information System) which is mainly for process-oriented approach. The process design seems a complicated activity requiring a profound knowledge on chemical processes and plants. In process design, operations to discover the preferable process actions are designed simultaneous with designing process structure. It is the feature of process design, but it gives us the impression of being complicated. When the relation between operation, process action and process structure in performing design was able to be clarified, the structure of process design activity would be formulated. In this study, a feature-oriented design rationale system for lifecycle safety is proposed. To consider the operational factor which makes characteristic of process design, we attention to batch processes, and the process design process is analyzed on the basis of hierarchical control model representations in ANSI/ISA S88.01 standard [1]. IDEF0 activity model is adopted to formulate process design activity, and representing the design rationale on the activity model by considering how the process intentions are converted into formula, procedure, equipment requirement and process structure via design rationale. Structure Design
Operation Design Process Action Process is an
ordered set of
Process Stage is an ordered set of Process Operation
Operation (Procedure)l needs to perform
needs to perform
needs to perform
Process Structure require
Procedure(s) is an ordered set of Unit Procedure(s) is an ordered set of
Process Cell(s)
require
is an ordered set of Unit(s)
require
Operation(s)
is an ordered set of Unit(s)
Fig. 1. Relation between Process Action, Procedure and Process Structure in Process Design
2
Activity of Performing Process Design
Chemical processes are designed to satisfy the operational requirements [5], and these requirements are categorized into three operational phases; normal, abnormal, and emergency. To satisfy these requirements, operation, process action and process structure are detailed gradually through design stages. In batch processes, the design stages are related with recipe types; general, site, master, and control. Thus, process design can be defined as several spaces made from the operational requirements and recipe types. In this study, the normal operational requirement is concerned for the first step, and the site recipe design stage is considered. In this design space, the operations, process actions and process structure corresponding to the site recipe are designed for given general recipe and design intentions.
IDEF0 Activity Model Based Design Rationale Supporting Environment
1283
In ANSI/ISA S88.01 standard [1], the relation between process operations (procedure in the standard), process actions, and process physical structure is defined, from the view point of control of existing process and/or plant. This model explains that procedure combined with a process physical structure provides required process functionality. On the basis of this model, the process design activity is considered. The direction of information in process design should be different from that of ANSI/ISA S88.01, i.e. the necessary process actions need process operations, and the operations specify equipment requirements, as shown in Figure 1. The process structure is designed on the basis of the equipment requirements, and these series of actions details process design gradually. Production Plan
Site Conditions
General Recipe
Recipe Information
Process Structure Information Equipment Requirement Recipe
Manage Design Intention
Structure Design Intention
A111
Design Site Recipe
Recipe Design Intention
A112
Design Process Structure
Designed Process Structure
Structure
A113
Fig. 2. Process Design Activity
Recipe for Deaerattion
Recipe for Charging
General Recipe
Recipe Design Intention
Recipe for Charging
Recipe for Deaeration
Intention for Charging
Deaerate
Recipe for Worn-Up
A1121
Recipe Recipe for Worm-Up
Intention for Worm-Up
Requirement for Deaeration
Charge
Intention for Deaeration
Recipe for Reaction
A1122
Next Sub-Task
Next Sub-Task
Intention for Reaction
Recipe for Reaction
Requirement Intention for Unloading for Worm-Up
Recipe for Unloading
A1123
Next Sub-Task
React Requirement for Reaction
A1124
Process Structure for Deaeration
Process Structure Process Structure for Charging for Worm-up
Next Sub-Task
Unload A1125
Designed Process Structure
Requirement for Charging
Warm-Up
Process Structure for Reaction
Recipe Process Structure for Unloading for Unloadong
Equipment Requirement Requirement for Unloading
Fig. 3. Sub-Activities of PVC Polymerization Reaction
It is clear that process design activity is composed of recipe design part and process structure design part, so that the process design activity (process design of site recipe level for normal operation) can be represented by IDEF0 model as shown in Figure 2. In the recipe design part, the inputted design intentions and upper class recipe (general recipe here) are developed into more specific intentions and more detailed subrecipes from the hierarchical nature of process design. Therefore, the activity “Design Site Recipe” in Fig. 2 is decomposed into sub-activities to design sub-recipes as shown in Figure 3, where the sub-recipes are for PVC polymerization reaction applied
1284
Tetsuo Fuchino and Yukiyasu Shimada
in the later section. Each sub-activity to design sub-recipes in Fig. 3 is further decomposed into sub-activities for designing sub-recipes. The boxes in Figs. 2 and 3 describe decision activities in process design, and design rationales must be defined within them. However, in ordinary IDEF0 model, these activities were defined by associating glossary as plane text, with this, it is difficult to systemize the design rationale supporting environment. Thus, the design rationale scheme representing each design activity is considered.
3
Representation of Design Rationale
The IDEF0 activity model for process design shown in Figs. 2 and 3 is provided on the basis of Fig. 1, so that Fig. 1 can be considered as the representation of process designer's viewpoint. In designing recipe and process structure, the process designer must have the same viewpoint. Thus, as same as the activity model, the design rationale scheme representing each design activity is considered on the basis of Fig. 1. In each recipe design activity shown in Figs. 2 and 3, the upper class recipe is broken down into more detailed sub-recipes. For this development, the process required actions (formula) are designed from the developed process intensions, and the procedure (sub-tasks and their sequence) to realize the required process state is designed, and the equipment requirements to enable the designed procedure are generated from Fig. 1. The formula, sub-tasks, sub-task sequence and equipment requirements are results of recipe design, and the design rationales must exist to obtain these results. Moreover, in the process structure design activities, process structure is designed for the equipment requirement, and the design rationale must be adapted for this design. Intentions
Recipe
Equipment Requirements
Recipe Design
Recipe Rationale Intentions
Formula
Rationale Rationale
Formula
Sub-Task
Rationale
Sequence
Rationale
Sub-Task
Rationale
Process Structure
Process Structure Design
Equipment
Rationale
Process Structure
Requirements
Rationale
Intentions
Recipe
Equipment Requirements
Process Structure
Fig. 4. Design Rationale Representation Scheme
Therefore, according to Fig. 1, the logical structure of recipe design and process structure design can be represented by the scheme as shown in Figure 4. This scheme is common to the other activities in Figs. 2 and 3 and further decomposed ones. When each element with rectangle and logical relations in Fig. 4 are defined as objects and associations, the scheme can be coded with the object-oriented manner.
IDEF0 Activity Model Based Design Rationale Supporting Environment
4
1285
Implementation
In this study, the design rationale representing schemes on IDEF0 are described with Prolog facts. As shown in Fig. 4, the scheme is composed of plural set of associated two objects. The objects, described with rectangles in Fig. 4, have seven classes, i.e. intention, rationale, formula, sub-task, task-sequence, equipment requirement and process structure. The associations, which are also objects, have three classes, i.e. horizontal association within activity, sequential association within activity and hierarchical association between upper and lower level activities. To describe the design rationale representation scheme having these properties in Prolog, three predicates (logic, log_unit, relation) are introduced, and the object classes are defined as arguments in following three kinds of Prolog unit clauses. logic(‘Activity','Asscociation','Object_1','Port_1', 'Object_2','Port_2'). log_unit(‘Activity','Object_1','Object_Class_1'). log_unit(‘Activity','Object_2','Object_Class_2'). Relation(‘Activity','Associate','Associate_Class'). The first clause means that ‘Port_1' of ‘Object_1 and Port_2' of ‘Object_2' are logically linked via ‘Associate' in ‘Activity'. The second (the third) clause means that the class of ‘Object_1' (‘Object_2') is ‘Object_Class_1' (‘Object_Class_2') in ‘Activity'. The fourth clause means that the class of ‘Associate' is ‘Associate_Class' in ‘Activity'. Furthermore, the contents of intentions and/or design rationales are usually defined with plan text, and another following clause with the predicate; glossary is introduced. glossary(‘Object_1','Text_1'). glossary(‘Object_2','Text_2'). Although, it is possible to link with external documentation environment, sentences explaining definitions are written directory in ‘Text_1' and/or ‘Text_2' here. The IDEF0 activity modeling and above mentioned formulation are repeated, and the flow sheet is provided in the process structure design activity. When the equipment requirements are associated with the item number used in the flow sheet to identify each item, it is possible to relate the design intentions and rationales with the process structure visually.
5
Application
The developed design rationale supporting environment is applied to process design of PVC polymerization reaction section. The general recipe in ANSI/ISA-S88.01 format is given as shown in Table 1. In general, the process design intensions are categorized into four types; safety, quality, cost and delivery, which are referred to as SQCD for short. In the A111 activity of Fig. 2, the categories of design intentions and general recipe are provided. In the A112 activity, these categories are decomposed into more detailed intentions and recipes. In this example, nine intentions are considered, and a part of them are as shown in Table 2.
1286
Tetsuo Fuchino and Yukiyasu Shimada Table 1. General Recipe
Formula Procedure Equipment Requirement
VCM:100[unit] DMW:100[unit] LOP:0.0088[unit] PVA:0.090[unit] Temp:55[deg C] De-aerate, DMW, VCM, LOP, Warm up, LOP, React Jacketed Reactor with Agitator and Baffle Plate Table 2. A Part of Process Intentions
Intention No. SQCD_001 SQCD_003 SQCD_004
Categoly Safety Quality Quality
Argument Avoid same task simultaneously at multiple reactors. De-aerate sufficiently. Mix sufficiently after PVA is charged.
Fig. 5. Designed Process Structure
According to the scheme shown in Fig. 4, process design is performed, and the objects and associations are defined. In this case, PVC polymerization is divided into five sub-tasks; de-aerate, charge, warm up, react and unload, and these sub-tasks are designed to perform in this order in the recipe design of A112 level. For example, on the basis of the second intention for quality in Table 2 (SQCD_003), the vacuum operation selected for de-aeration, and the required action is specified in formula of deaeration. This design procedure of the rationale representation scheme is described as the following Prolog facts. logic('A112','RE_001','SQCD_003','F1', 'Rationale_001','T1'). logic('A112','RE_006','Rationale-001','F1', 'De-Aerate_Formula','T1'). glossary('Rationale_001', 'De-Aerate_is_done_by_reducing_pressure'). glossary('De-Aerate_Formula','Vacuum_to_0.7_bar').
IDEF0 Activity Model Based Design Rationale Supporting Environment
1287
In the same manner, A112 and A113 level design and their sub-activity level design are carried out according to the design rationale representation scheme, and the applied design intentions, design rationales and results are recorded as Prolog facts. As a result, process flow sheet as shown in Figure 5 can be designed. For example, the above design intention (SQCD_003) is realized by introducing the following valves and lines appeared in Fig. 5;VE-120, VE-121, VE-122,VE-110, VE-111, VE115, MC64, MC62, MC21, MC5, MC10, MC11, MC12. These items can be searched from SQCD_003, and SQCD_003 is reachable from these items by using recursive rule of Prolog.
6
Conclusion
Design rationale supporting environment is proposed. Process design process can be formulated by analyzing the relation between process intention, operation and process structure on the basis of ANSI/ISA-S88.01 standard. The feature-oriented approach is adopted for representation of the design rationale. To consider the operational feature in process design, batch processes is concerned. The IDEF0 activity model based design rationale representation scheme is proposed, and the represented design rationale is converted into Prolog fact data. Bidirectional reference between process structure and design rationale is realized, and it is applied to design of PVC polymerization section.
Acknowledgements The authors are grateful to the following members of safety research group in the Society of Chemical Engineers, Japan (SCEJ) for engineering advises; Mr. Noboru Yamamuro(ZEON), Mr. Yasumasa Kajiwara (Kaneka), Mr. Hiroyoshi Kitagawa (Asahi Glass), Mr. Susumu Ohara, Mr. Iwao Tezuka (Toyo Engineering) and Mr. Masaki Ueyama (Mitsui Takeda Chemicals).
References [1] [2] [3] [4]
ANSI/ISA-S88.01-1995, “Batch Control, Part I: Models and Terminology”; Instrumentation, Systems, and Automation Society, (1995) W. C. Regli, X. Hu, M. Atwood and W. Sun: A Survey of Design Rationale Systems: Approach, Representation, Capture and Retrieval, Eng. with Comput., 16, PP. 209-235, (2000) R. Banares-Alcantara, and J. M. P King: Design Support Systems for Process Engineering iii – Design Rationale as a Requirement for Effective Support, Comput. and Chem. Eng., 21(3), PP 263-276 (1997) P. W. H. Chung and R. Goodwin: An Integrated Approach to Representing and Accessing Design Rationale, Eng. Applic. Artif. Intell., 11(1), PP 149-159 (1999).
1288
[5]
Tetsuo Fuchino and Yukiyasu Shimada
R. Batres, M. L. Lu and Y. Naka: Operations Planning Strategies for Concurrent Process Engineering, Proc. of the AIChE Annual Meeting, Los Angeles, U.S.A (1997)
The Optimal Profit Distribution Problem in A Multi-Echelon Supply Chain Network: A Fuzzy Optimization Approach Cheng-Liang Chen, Bin-Wei Wang, Wen-Cheng Lee Department of Chemical Engineering, National Taiwan University Taipei 10617, Republic of China [email protected] Abstract. A multi-product, multi-stage, and multi-period production and distribution planning model is formulated for a typical multi-echelon supply chain network to achieve multiple objectives such as maximizing profit of each participant enterprise, maximizing customer service level, maximizing the safe inventory levels, and ensuring fair profit distribution for all partners. A two-phase fuzzy decision-making method is proposed to attain compromised solution between all conflict objectives. Therein, each objective function is viewed as a fuzzy goal and a membership function is used to characterize the transition from the objective value to the degree of satisfaction. The final decision is interpreted as the fuzzy aggregation of multiple objectives and is measured by the maximized overall degree of satisfaction. The proposed two-phase fuzzy optimization method combines advantages of two popular t-norms, the minimum and the product, for implementing the fuzzy aggregation. One numerical case study is supplied, demonstrating that the proposed two-phase fuzzy intersection method can provide a better compensatory solution for multi-objective optimization problems in a supply chain network.
1 Introduction In traditional supply chain management, minimizing costs or maximizing profit as a single objective is often the focus when considering the integration of a multi-echelon supply chain network. Recently, Gjerdrum et al. [1] proposed a mixed-integer linear programming model for a production and distribution planning problem and solve the fair profit distribution problem by using the Nash-type model as objective function. However, directly maximizing the Nash-type overall profit objective may cause the unfair distribution of profits between the participants due to different scales of profits. Furthermore, today’s consumers are demanding better customer service, whether it be the manufacturing or service industry. It only benefits a company to constantly improve their customer service. Thus, customer service should also be taken into consideration when formulating a multi-echelon supply chain system. But in the traditional supply chain management of minimizing costs or maximizing profit as a single objective, it is difficult to quantify customer service into a monetary amount as the objective function. To solve this problem, the model presented here uses multi-objective optimization to formulate the production and distribution-planning problem of a multi-echelon supply chain system. Besides maximizing profit for the entire system, we also take distribution problem of profits for its members, the customer service level, and the safe inventory levels into consideration as objectives. The model presented here also considers the problem in regards to economies of scale for manufacturing or shipping faced by most firms today. Binary variables are added in the model to act as policy decisions to use economies of scale for manufacturing or shipping. Non-linear terms will appear due to consideration of the customer service and the safe inventory levels. This model is then formulated to be a multi-objective mix-integer nonlinear programming problem (MOMINLP). Then, we proposed a two-phase fuzzy intersection method [2] to solve the multi-objective programming problem. So that, we can guarantee each member of the multi-echelon supply chain system can go after their own maximal profit on the basis of the least of required profit. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1289-1295, 2003. c Springer-Verlag Berlin Heidelberg 2003
1290
Cheng-Liang Chen et al.
2 Problem Description A general multi-echelon supply chain is considered which consists of three different level enterprises. The first level enterprise is retailer from which the products are sold to customer subject to a given low bound of customer service. The second level enterprise is distribution center (DC) which uses different type of transport capacity to deliver products from plant side to retailer side. The third level enterprise is plant which batch-manufactures one product at one period. The overall problem can be stated as follows: Given: cost parameters, manufacture data, transportation data, inventory data, forecasting customer demand and product sales price. Determine: production plan of each plant and transportation plan of each distribution center, sales quantity of each retailer and inventory level of each enterprise, and each kind of cost. The target is to integrate the multi-echelon decisions simultaneously, which results in a fair profit distribution, and to increase customer service level and safe inventory level as possible.
3 Mathematical Formulation 3.1. Parameters We divide the parameters into the cost parameters and the other parameters, such as shown in Table 1. 3.2. Variables Binary variables, which act as policy decisions to use economies of scale for manufacturing or shipping, and other variables are listed in Table 2. 3.3. Integration of production and distribution models We consider the inventory balance, maximum ad safe inventory quantities, the customer service level and safe inventory levels, and all costs such as manufacturing, transportation, handling, etc. to formulate an integrated planning model for the problem. Detailed formulation for constraints and objective functions for retailer r, distribution center d, and for plant p, respectively, can be found in [3].Therein, the customer service level of retailer r at period t is defined as the average percentage ratio of actual sales quantity of product i from retailer r to customer at period t, Sirt , over the total demand quantity. The total demand quantity is the sum of forecasting customer demand of product i to retailer r at period t, FCDirt , and backlog level of product i of retailer r at the end of period t − 1, Bir,t−1 . and the safe inventory level of retailer r at period t is defined as the average percentage of 1 less the ratio of short safe inventory level of product i of retailer r at period t, Dirt , over the safe inventory quantity of product i of retailer r, SIQir , for all products. The resulting multiple objectives Js , s ∈ S, variable vector, x, and the feasible searching space, Ω, are 0 1 summarized in the following.
BB X Zrt BB t BB 1 X CSL BB T t rt BB 1 X BB T SILrt BB X t max (J1 (x), . . . , JS (x)) = B BB t Zdt x∈Ω BB 1 X SILdt BB X BB T Zt pt BB t BB 1 X SIL B@ T t pt
, ∀r ∈ R , ∀r ∈ R , ∀r ∈ R , ∀d ∈ D , ∀d ∈ D , ∀p ∈ P , ∀p ∈ P
CC CC CC CC CC CC CC CC CC CC CC CC CC CC A
(1)
The Optimal Profit Distribution Problem in a Multi-Echelon Supply Chain Network
( x=
Sirt , Iirt , Birt , Dirt , Sidrt , Qkdrt , Iidt , Didt , Ykdrt , Sipdt , Qkpdt , Iipt , Dipt , Ykpdt , αipt , β ipt , γ ipt , oipt , i ∈ I, r ∈ R, d ∈ D, p ∈ P, t ∈ T , k ∈ K, k ∈ K, n ∈ N
8 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > < Ω= > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > :
x
X
Iirt = Iir,t−1 +
Sidr,t−TLTdr − Sirt
d
Birt
IirT ≥ SIQirT = Bir,t−1 + FCDirt − Sirt BirT = 0 Iirt ≤ MICr
X i
Iidt
SIQirt − Iirt ≤ Dirt ≤ SIQirt i Drt , Iirt , Birt , Sirt , CSLrt ≥ 0 = Iid,t−1 + Sipd,t−TLTpd − Sidrt
X
X
p
r
IdT ≥ SIQdT X i i
i
Idt ≤ MICd
i
SIQidt − Iidt ≤ Didt ≤ SIQidt Didt , Iidt , Sidrt ≥ 0 Qkdrt = Sidrt
X
X
k
i
X
k k k k TCLk−1 dr Ydrt < Qdrt ≤ TCLdr Ydrt Ykdrt ≤ 1
XX r
k
TCLkdr Ykdrt ≤ MOTCd
k
Iipt = Iip,t−1 + FMQip αip,t−1 + OMQip oip,t−1 −
d
IpT ≥ SIQpT X i i
i
Ipt ≤ MICp
i
SIQipt − Iipt ≤ Dipt ≤ SIQipt Dipt , Iipt , Sipdt ≥ 0
X k
Qkpdt =
X
Sipdt
i
TCLkpd−1 Ykpdt < Qkpdt ≤ TCLkpd Ykpdt
XX
X
TCLkpd Ykpdt ≤ MITCd
X
k
p
Ykpdt ≤ 1
k
β ipt = 1
i
γ ipt
αipt ≤ β ipt ≥ β ipt − β ip,t−1 oipt ≤ αipt oipt ≤ MTOp
XX i t XX i
op,t−n−1 ≤ N − 1
i
n
X
Sipdt
9 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > = > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
1291
) (2)
(3)
1292
Cheng-Liang Chen et al. Table 1: Indices, sets, and parameters Index/Set Dimension Physical meaning r∈R d∈D p∈P i∈I t∈T k∈K k ∈ K
[R] = R [D] = D [P] = P [I] = I [T ] = T [K] = K [K ] = K
retailers distribution centers plants products periods transport capacity level, DC to retailer transport capacity level, plant to DC
Parameter
∗∈
USRi∗ UICi∗ UHCi∗ UTCk∗ UTCk∗ FTCk∗ FTCk∗ UMCi∗ OMCi∗ FMCi∗ FICi∗
{pd, dr, r} {p, d, r} {p, d, r} {pd} {dr} {pd} {dr} {p} {p} {p} {p}
Unit Sale Revenue of i, p to d, etc. Unit Inventory Cost of i for p, d, r Unit Handling Cost of i for p, d, r k th level Unit Transport Cost, p to d kth level Unit Transport Cost, d to r k th level Fix Transport Cost, p to d kth level Fix Transport Cost, d to r Unit Manufacture Cost of i Overtime unit Manuf. Cost of i Fix Manuf. Cost changed to make i Fix Idle Cost to keep p idle
{r} {pd, dr} {p, d, r} {p, d, r} {pd} {dr} {d} {d} {p} {p} {p}
Forecast Customer Demand of i Transport Lead Time, p to d (d to r) Safe Inventory Quantity in p, d, r Max inventory capacity of p, d, r k th Transport Capacity Level, p to d kth Transport Capacity Level, d to r Max Input Transport Capacity of d Max Output Transport Capacity of d Fix Manufacture Quantity of i Overall fix Manufacture Quantity Max Total Overtime manuf. period
FCDi∗ TLT∗ SIQi∗ MIC∗ TCLk∗ TCLk∗ MITC∗ MOTC∗ FMQi∗ OMQi∗ MTO∗
Physical meaning
The Optimal Profit Distribution Problem in a Multi-Echelon Supply Chain Network Table 2: Binary variables and other continuous variables for t ∈ T Binary ∗∈ Meaning when having value of 1
Yk∗t Yk∗t αi∗t β i∗t γ i∗t oi∗t
{pd} {dr} {p} {p} {p} {p}
k th transport capacity level, p to d kth transport capacity level, d to r manufacture with regular time workforce setup plant p to manufacture i p changeover to manufacture i manufacture with overtime workforce
Real
∗∈
Physical meaning
Si∗t Qk∗t Qk∗t Q∗t Ii∗t Bi∗t Di∗t TMC∗t TPC∗t TIC∗t THC∗t TTC∗t PSR∗t SIL∗t CSL∗t Z∗t
{pd, dr, r} {pd} {dr} {pd, dr} {p, d, r} {r} {p, d, r} {p} {d, r} {p, d, r} {p, d, r} {d; pd, dr} {p, d, r} {p, d, r} {d} {p, d, r}
Sales quantity of i, p to d etc. k th level transport quantity, p to d kth level transport quantity, d to r total transport quantity, p to d or d to r Inventory level of i in p, d, r Backlog level of i in r at end of t Short safe inventory level in p, d, r Total Manufacture Cost of p Total Purchase Cost of d, r Total Inventory Cost of p, d, r Total Handling Cost of p, d, r Total Transport Cost of d; p to d or d to r Product Sales Revenue of p, d, r Safe Inventory Level of p, d, r Customer Service Level of r Net profit of p, d, r
1293
1294
Cheng-Liang Chen et al.
4 Fuzzy Approach for Multi-objective Optimization By considering the uncertain property of human thinking, it is quite natural to assume that the decision-maker (DM) has a fuzzy goal, Js , to describe the objective Js with an interval [Js∗ , Js− ]. For the sth maximal objective, it is quite satisfied as the objective value Js > Js∗ , and is unacceptable as Js < Js− . The original multi-objective optimization problem is thus equivalent to look for a decision that can provide the maximal overall degree-of-satisfaction for the multiple fuzzy objectives. Under incompatible objective circumstances, a DM must make a compromise decision that provides a maximal degree-of-satisfaction for all of these conflict objectives. The new optimization problem can be interpreted as the synthetic notation of a conjunction statement (maximize jointly all objectives). The result of this aggregation, D, can be viewed as a fuzzy intersection of all fuzzy goals Js , s ∈ S, and is still a fuzzy set. The final degree-of-satisfaction resulting from certain variable set, µD (x) can be determined by aggregating the degree-of-satisfaction for all objectives, µJs (x), s ∈ S, via specific t-norms such as minimum or product operators. The procedure of the fuzzy satisfying approach for the multi-objective optimization problem, Eq.(1), are summarized as follows. 1. Determine the ideal solution and anti-ideal solution by directly maximizing and minimizing each objective function, respectively. max Js min Js
= Js∗ = Js−
(Ideal solution of Js , totally acceptable value) (Anti-ideal solution of Js , unacceptable value)
(4)
Notably, Js∗ and Js− are upper/lower bounds for Js . We use these values to define membership function for the fuzzy objective Js . 2. Define each membership function. Without loss of generality, we will adopt linear function for all fuzzy objectives. 1; Js ≥ Js∗ Js −Js− − ; Js ≤ Js ≤ Js∗ ∀s∈S (5) µJs = Js∗ −Js− 0; J ≤ J− s
s
3. (Phase I) To maximize the degree of satisfaction for the worst objective by selecting minimum operator for fuzzy aggregation. max µD = max min(µJ1 , µJ2 , · · · , µJS ) = µ1 x∈Ω
x∈Ω
(6)
4. (Phase II) Considering satisfaction of all objectives, re-optimize the problem by selecting the product operator with guaranteed minimum degree-of-satisfaction for all objectives. max µD
=
Ω+
=
x∈Ω+
max(µJ1 × µJ2 × · · · × µJS ) Ω ∩ µJs ≥ µ1 , ∀ s ∈ S
x∈Ω+
(7)
5 Numerical Example Considering a multi-echelon supply chain consists of 1 plant, 2 distribution centers, 2 retailers, and 2 products. Numerical values of all parameters can be found in [3]. We solve the multi-objective mixedinteger non-linear programming problem by using the fuzzy approach procedure, and the results are
The Optimal Profit Distribution Problem in a Multi-Echelon Supply Chain Network
1295
summarized in Table 3. Table 3 shows that by selecting minimum as the fuzzy intersection operator, we can get a more balanced satisfaction among all objectives where the degrees of satisfaction are all around 0.66. By using product operator directly to guarantee a unique solution, however, the results are unbalanced with the lower degree of satisfaction for d = 2’s profit and safe inventory level, and p = 1’s profit. On the other hand, the high performance objectives or goals are given a very high emphasis. The proposed two-phase method can combine advantages of these two popular fuzzy intersection operators. The minimum operator is used in phase I to find the least degree of satisfaction, and the product operator is applied in phase II with guaranteed least membership value for all fuzzy objectives as additional constraints. Table 3: Results of using minimum operator, product operator and two-phase method Objectives
minimum operator Obj Value Satisfaction
product operator Obj Value Satisfaction
two-phase method Obj Value Satisfaction
r=1 r=2 Profit d = 1 d=2 p=1
859, 582 1, 066, 607 566, 217 1, 959, 172 4, 507, 340
0.66 0.66 0.66 0.66 0.66
970, 556 1, 208, 310 824, 620 1, 515, 645 4, 231, 931
0.73 0.75 0.89 0.49 0.54
845, 754 1, 053, 162 593, 598 1, 935, 237 4, 486, 048
0.66 0.66 0.68 0.66 0.66
CSL r = 1 r=2 r=1 r=2 SIL d = 1 d=2 p=1
0.92 0.91 0.63 0.63 0.66 0.65 0.65
0.72 0.69 0.67 0.66 0.66 0.67 0.66
0.99 0.99 0.91 0.91 1.00 0.57 0.77
1.00 1.00 0.97 0.95 1.00 0.59 0.79
0.99 0.99 0.88 0.85 0.99 0.64 0.65
1.00 1.00 0.94 0.89 0.99 0.66 0.66
CSL: Customer Service Level, SIL: Safe Inventory Level
6 Conclusion In this paper, we investigate the fair profit distribution problem of a typical multi-echelon supply chain network. To implement this concept, we construct a multi-product, multi-stage, and multi-period production and distribution planning model to achieve objectives such as maximizing profit of each participant enterprise, maximizing customer service level, maximizing safe inventory level, and ensuring fair profit distribution. The fuzzy set theory is used to attain the compromised solutions. We proposed a two-phase fuzzy intersection method by combining the advantages of two popular t-norms to solve the fair profit distribution problem. One case study is supplied, demonstrating that the proposed two-phase method can provide a better compensatory solution for multi-objective problems in a multi-echelon supply chain network.
Reference 1. Gjerdrum, J., N. Shah and L.G. Papageorgiou, Transfer price for multi-enterprise supply chain optimization, Ind. Eng. Chem. Res, 40 (2001) 1650. 2. Li, R.J. and E.S. Lee, Fuzzy multiple objective programming and compromise programming with Pareto optimum, Fuzzy Sets and Systems, 53 (1993) 275. 3. Chen, C.L., B.W. Wang and W.C. Lee, Multi-objective optimization for a multi-echelon supply chain network, Ind. Eng. Chem. Res, (2003) in press.
An Experimental Study of Model Predictive Control Based on Artificial Neural Networks Ji-Zheng Chu1, Po-Feng Tsai2, Wen-Yen Tsai2, Shi-Shang Jang2*, David Shun-Hill Wong2, Shyan-Shu Shieh3, Pin-Ho Lin4 and Shi-Jer Jiang5 1
Department of Automation, Beijing University of Chemical Technology, Beijing 2 Chemical Engineering Department, National Tsing-Hua UniversityHsin-Chu, Taiwan 3 Department of Occupational Safety and Hygiene, Chang Jung University Tainan, Taiwan 4 Department of Chemical Engineering, Nanya Institute of Technology Taoyuan, Taiwan 4 China Petroleum Corporation, Chia-Yi, Taiwan
Abstract Practical implementations of two typical types of artificial neural networks (ANNs), feedforward networks and external recurrent networks, as the model for model predictive control (MPC) were performed on the dual temperature control problem of two distillation columns, a pilot scale i-butane and n-butane distillation column and a bench scale ethanol and water column. The superiority of MPC based on an ANN models over conventional proportional-integral controllers and over dynamic matrix control were testified through experiments.
1
Introduction1
In order to treat high non-linearity and complex dynamics, features of a process, as expressed by some mathematical relation- ships which are called a process model, have to be taken into account in the design and operation of the corresponding control system. Quite a number of model-based control schemes have been proposed to incorporate a process model into a control system, and are categorized under three classes: predictive control, inverse-model based control, and adaptive control by Hussain[1], with the first one being the most commonly found control technique. Dutta and Rhinehart[2] gave a brief comment on various control schemes which use process models. Model predictive control (MPC) has been being a vibrant research topic in the process control area[3]. The basic idea of MPC algorithm is to use a model to predict the future output trajectory of a process and compute a controller action to minimize the difference between the predicted trajectory and a user-specified one, subject to constraints[4]. * Corresponding Author whose electronic mail is: [email protected] V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1296-1302, 2003. Springer-Verlag Berlin Heidelberg 2003
An Experimental Study of Model Predictive Control
1297
Developing a valid model for the dynamics of a process is frequently the major part of the work required to implement most advanced control strategies. Modeling costs normally account for over 75% of the expenditure in the design of an advanced control project. Artificial neural networks (ANNs) as a process model for control purpose conceive some obvious superiority as compared with other more conventional modeling methods.[1, 5] Hussain summarized the active research on the application of artificial neural networks in model-based control design by listing 100 relevant papers, with a conclusion that actual successful online applications are still limited. The main objective of this study is to strengthen the practical experience of applications of artificial neural networks incorporated in model predictive control. The most commonly used two kinds of neural networks, feedforward networks (FFNs) and external recurrent networks (ERNs), as incorporated in a standard model predictive control scheme for the dual temperature problem of distillation control, were tested on a bench-scale distillation column of water and ethanol and on a pilot-scale column of i-butane and n-butane. The major consideration in choosing distillation as the target process for algorithm test is that such a system constitutes a constrained, coupled, nonlinear, non-stationary process and has disparate dynamics[2], and therefore is complex enough to reveal the superiority of a sophisticated scheme such as MPC based upon an artificial neural network model. Another reason for this choice is that there have been several reports on the use of model-based control strategies to distillation columns.[2,5,6,7] The structure of model predictive control (MPC) is shown in Figure 1 and its working principle has been well cultivated[3,4,5,8], where P, M and C are process, model and controller, y and u are controlled variables (CVs) and manipulated variables, d is disturbance, and y and yˆ are setpoint and model prediction. In this study, the Levenberg-Marquardt algorithm[9] as recommended by Ramchandran and Rhinehart[7], was adopted in searching for the increments of MVs.
Fig. 1. General architecture of model predictive control
Artificial neural networks (ANNs) used as a model for MPC were trained and tested as suggested by Psichogios and Ungar[8]. Datasets for training and testing were collected by recording CVs under random changes of MVs.
1298
2
Ji-Zheng Chu et al.
Experiment on an Ethanol and Water Distillation Column
The purpose of this test was to compare the performance of model predictive control (MPC) with an FFN as the model, the FFN-based MPC, and that of the dynamic matrix control (DMC). The test was carried out on a bench-scale distillation column for ethanol and water mixture. The structural and operational parameters of the column were referred elsewhere[10]. It should be noted that the isotropic mixture (78.15 °C, 0.8943 mole fraction ethanol, at 1 atm) is likely to form at the top of the column. For this system, the controlled variables are the top and bottom temperatures ( y1 and y 2 ), and the manipulated variables are the reflux valve position ( u1 ) and the heating steam pressure ( u 2 ). The relative gain array (RGA) measured indicated interactions between the two control loops.
(a)
(b)
Fig. 2. Comparison between the output of the FFN (a) and ERN (b) and the testing data for the ethanol-water column
Neural networks were trained to model the two controlled variables, top and bottom temperatures, and their structural parameters are: There is 1 hidden layer with 3 hidden nodes, and the activation function for hidden nodes is hyperbolic tangent. The transfer function for the only output node is linear. Input entries include u1(t-1), …, u1(t-18), u2(t-1), …, u2(t-18), and y(t), …, y(t-6). These parameters were roughly determined by considering dead time, time constant, and fitting error. Figure 2 depicts the testing results. Obviously, FFNs perform much better than ERN in fitting the training data and in predicting the testing data, and were therefore chosen as models for MPC. The FFN-based MPC was implemented by minimizing the following objective function: 2
11
∑∑{ y (t + p) − [yˆ (t + p) + h (t + p)] }
2
i
i
i
(2)
i =1 p =0
under constraints
35% ≤ u1 ≤ 95% ∆u1 ≤ 6% 55 kPa ≤ u 2 ≤ 145 kPa ∆u 2 ≤ 12 kPa
(3) (4) (5) (6)
An Experimental Study of Model Predictive Control
1299
to determine manipulated variables, where y = setpoint, yˆ = predicted value with the ANN model, and h = difference between the measured and predicted values of a controlled variable. The control horizon C was chosen to be 1 for rapid on-line execution. Our Testing showed that the control performance using C=1 and 2 were very similar.
(a)
(b)
Fig. 3. Response of the ethanol-water distillation column under (a) the FFN-based MPC and (b) the DMC, to a setpoint step change in the top temperature
(a)
(b)
Fig. 4. Response of the ethanol-water distillation column under (a) the FFN-based MPC and (b) the DMC, to a setpoint step change in the bottom temperature
The DMC used in this study had a prediction horizon and a control horizon of 12 and 2 steps ahead of the current instant respectively, and took no penalty on the manipulated variables. Detailed algorithm of DMC is referred to a standard textbook of Marlin [11]. Figure 3 draws the transient curves of the top and bottom temperatures of the test column under the control of the FFN-based MPC and the DMC, respectively, when a setpoint step change in the top temperature was introduced. Figure 4 serves the same purpose as Figures 3, as to the case where there was a setpoint change in the bottom temperature. It is clear from the comparison in Figures 3 and 4 that the FFN-based MPC was much superior to the DMC in tracking setpoint changes for this ethanolwater distillation column where strong non-linearity was present because the operation was in the neighborhood of isotropic mixture at the top of the column.
3
Experiment on an i-Butane and n-Butane Distillation Column
It was our aim in this test to compare model predictive control (MPC) using an ERN model, the ERN-based MPC, with conventional proportional-integral (PI) controllers. The test was performed on a pilot-scale column for i-butane and n-butane. This pilotscale column is packed with wire mesh packing, and its parameters are referred elsewhere[10]. The re-boiler was heated by electricity. Four metering pumps were included to control various flow-rates. J-type thermocouples were adopted in
1300
Ji-Zheng Chu et al.
temperature measurement. The separation capability of the column was about 18 theoretical plates. Temperatures at the third and the twelfth theoretical plates, referred to as top and bottom temperatures respectively, were the controlled variables. For this pilot-scale column, the controlled variables are y1 = temperature at the height of the 3rd theoretical plate (top temperature) and y 2 = temperature at the height of the 12th theoretical plate (bottom temperature), and the manipulated variables are u1 = reflux pump's speed and u 2 = reboiler heating power. Optimum parameters for the two PI controllers were determined by BLT approach[12].
Fig. 5. Comparison between the output of the ERN and its training data for the i-butane-nbutane column
Four neural networks were trained in this case as in the ethanol-water column, to model the two controlled variables, temperatures at 3rd and 12th plates, and their structural parameters are the same as those for the ethanol-water column. Again, FFNs performed much better in data fitting. Figure 5 shows the training results with ERNs. The objective function of minimization and constraints in our ERN-based MPC are: 2
15
∑∑{ y (t + p) − [yˆ (t + p) + h (t + p)] }
(8)
6% ≤ u1 ≤ 46%
(9)
23% ≤ u 2 ≤ 29%
(10)
2
i
i
i
i =1 p =0
The control horizon C = 1. It was a surprise for us to find that MPC with FFN model presented steady state offset, as shown in Figure 6. The reason for this is that FFNs are so-called seriesparallel model[5] and offset is doomed with multistep predictive control using such a model, if model mismatch exists. Such offset is ascertained by our theoretical analysis and simulation study[13]. The performance of the ERN-based MPC was examined in this test. For comparison, PI controllers were also used to control the column in a parallel run. Figure 7 is the result of control with the ERN-based MPC and the PI controllers. The transient process of the controlled variables in response to step changes in the setpoints of the top and bottom temperatures, show the clear superiority of the ERNbased MPC over conventional PI control in this dual temperature control problem where interaction existed, as it is shown by the measured relative gain matrix[10].
An Experimental Study of Model Predictive Control
1301
Fig. 6. Transient process of the top and bottom temperatures of the i-butane and n-butane distillation column under the FFN-based MPC, in response to step changes in the setpoints of these two controlled variables
(a)
(b)
Fig. 7. Response of the i-butane and n-butane column under (a) the ERN-based MPC and (b) the PI control, to setpoint step changes in the bottom temperature
4
Conclusion
The superiority of MPC over PI control and DMC for the dual temperature control problem in distillation systems was experimentally testified. The advantage of MPC comes from its capability of decoupling interactions between different control loops and from the capability of ANNs to capture nonlinear dynamics of a process. While FFN did a better job than ERN in fitting the training data, both types of ANNs performed equally well. We also find in this study that FFNs as a kind of deries-parallel model, are not proper for multistep predictive control, because MPC with such a model produces steady state offset. Whereas ERNs as parallel model, are more suitable for MPC. The success of ERNs in the control of i-butane and n-butane column strongly supports the viewpoint of Rhinehart and his coworkers[2, 7]. They said that “it is gain prediction, more-so than state prediction, that makes model-based control effective.”
Acknowledgement The authors thank the financial support for this work from National Science Council, Republic of China through the grant NSC90-2622-E007-003.
1302
Ji-Zheng Chu et al.
References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13]
Hussain, M.A., 1999. Review of the application of neural networks in chemical process control --- simulation and online implementation. Artificial Intelligence in Engineering, 13: 55-68. Dutta, P. and R.R. Rhinehart, 1999. Application of neural network control to distillation and an experimental comparison with other advanced controllers. ISA Transactions, 38: 251-278. Morari, M. and J.H. Lee, 1999. Model predictive control: past, present and future. Computers and Chemical Engineering, 23: 667-682. Garcia, C.E., D.M. Prett and M. Morari, 1989. Model predictive control: theory and practice --- a survey. Automatica, 25(3): 335-348. MacMurray, J.C. and D.M. Himmelblau, 1995. Modeling and control of a packed distillation column using artificial neural networks. Computers and Chemical Engineering, 19(10): 1077-1088. Shaw, A.M., F.J. Doyle III and J.S. Schwaber, 1997. A dynamic neural network approach to nonlinear process modeling. Computers and Chemical Engineering, 21(4): 371-385. Ramchandran, S. and R.R. Rhinehart, 1995. A very simple structure for neural network control of distillation. J. of Process Control, 5(2): 115-128. Psichogios, D.C. and L.H. Ungar, 1991. Direct and indirect model based control using artificial neural networks. Ind. Engng Chem. Res., 30: 2564-2573. Marquardt, D.W., 1963. An algorithm for least-squares estimation of nonlinear parameters. J. Soc. Indust. Appl. Math., 11(2): 431-441. Tsai, W.-Y., 2001. Artificial Neural Network Model Predictive Control on Packed Distillation Columns. Thesis of Master Degree, National Tsinghua University, Taiwan, 2001. Marlin, T.E., 1995. Process Control: Designing Processes and Control Systems for Dynamic Performance. International Editions, McGraw-Hill, Inc. Luyben,W.L., 1990. Process Modeling, Simulation, And Control for Chemical Engineers 2nd Ed.,. International Editions, McGraw-Hill, Inc. Chu, J.-Z., P.-F. Tsai, W.-Y. Tsai, S.-S. Jang, S.-S. Shieh, P.-H. Lin and S.-J. Jiang, 2002. Multistep Predictive Control Based on Artificial Neural Networks. paper submitted to I&EC.
Fault Recognition System of Electrical Components in Scrubber Using Infrared Images Kuo-Chao Lin1 and Chia-Shun Lai2 1
Department of Power Mechanical Engineering, National Tsing-Hua University, Hsinchu,Taiwan 300, R.O.C. [email protected] 2 Center for Environmental, Safety and Health Technology Development, Industrial Technology Research Institute, Hsinchu, Taiwan 300, R.O.C. [email protected]
Abstract. In this paper, an automatic recognition system is described for diagnosing faulted patterns of the electrical components in the scrubber system. The implementation of this faulted recognition scheme integrates several technological issues. Firstly, Preprocessing techniques are applied for diminishing the environmental effects and background temperature. And then, the shape of the rising temperature area can be obtained clearly for the electric components. Thermal shape and temperature distribution are selected for feature extraction. The thermal shape is chosen to distinguish components under a loading condition and the temperature distribution can be used to evaluate the deterioration severity of a component. Finally, a radial basis function neural network is built to identify various failure modes. The accuracy reach 89.4% under 80 hidden nodes in this designed faulted recognition system. It reveals that the feasibility of this model can be used for diagnosis and classify the failure mode of electric components in a scrubber system.
1
Introduction
A measuring and diagnostic integrated system based on infrared images was developed and offered a lot of advantages over the conventional temperature measurement scheme. There are many successful application examples using portable thermograph instrument to detect abnormal conditions of electrical components [1][2][3]. In 1998, Hou [4] utilized the infrared thermograph to discuss some technical problems of diagnosing the internal faults of electrical equipments and concluded that most of faults can be detected by the infrared thermograph without powering off and dissecting the equipments. Merryman et al [5] developed the diagnostic technique for power supply systems using the infrared images. The concept of using infrared images in a real-time diagnostic and control system has been successfully demonstrated. Moreover, some detecting and monitoring techniques V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1303-1310, 2003. Springer-Verlag Berlin Heidelberg 2003
1304
Kuo-Chao Lin and Chia-Shun Lai
utilizing the infrared images have been developed for the power and control components [6][7]. These research findings revealed that the feasibility of diagnosing systems using infrared images is a powerful tool to prevent the hazard caused by industrial equipments. In this paper, an automatic infrared images recognition system is proposed to diagnose the abnormal behaviors of electrical components in the electrical control system by using a neural network model. This system is capable to detect different components of different scrubbers in the electrical control system continuously and automatically, even if the layout of the electrical components is varied. Furthermore, it can detect the abnormal components and their level of deterioration automatically. This system is expected to be a good artificial intellectual recognition system with fast and accurate. This paper is arranged as following. Section II described the clustering algorithm of failure modes of electrical components. Section III discussed the preprocessing techniques and section IV covered the feature extraction of infrared images. Section V developed the neural network architecture appropriate for this study, and the result of this study was presented in Section VI. The conclusions were given in the last section.
2
Failure Modes
A scrubber system can be divided into a mechanical system and an electrical control system. The mechanical system provides a function of waste gas processing with three sections: oxidation, burning chamber, cooling and scrubbing reaction. The electrical control system supplies the heat needed for processing to the mechanical system. There are different kinds of component needed in the electrical control system, as shown in Fig.1. Fan 1
Fan 2 C a b le C h a s e
Pow er N .F .B .
H e a te r P L C
Pow er N .F .B . C a b le C h a se
Pum p1
Pum p2
C irc u it
C irc u it
B rea k e r
B re a k e r
Pow er
Pum p1
Pum p2
A la rm
R e la y
R e la y
R e la y
R e la y
C a b le C h a se
H e a te r Pow er R e la y
F u se 1 F u se 2
S o lid T ra n sfo rm e r
S ta te R e la y
F u se 3 C a b le C h a se
Fig. 1. Layout of the electrical control system
Fault Recognition System of Electrical Components in Scrubber
1305
For classifying the failure feature of infrared image, a new clustering algorithm is applied. The classification scheme is based on temperature rising pattern caused by heat effect. For classifying the failure modes, heat consumption is caused by current effect, voltage effect and electromagnetism effect. In this study, the failure modes of the electrical control system are assorted as following. The components with rising temperature caused by voltage effect are relay, solid-state relay and fuse. The component with rising temperature caused by electro- magnetism effect is transformer and the components with rising temperature caused by overloading and unbalance of current effect are single cable, double cables and triple cables. Furthermore, two components with rising temperature caused by connector loosening of current effect are the no fuse breaker and the circuit breaker. In conclusion, there are eighteen failure modes as shown in Table 1. For classifying the deterioration severity, two levels of severity are defined as alarm level and dangerous level. When the component temperature is higher than standards but still under the dangerous level, the insulation material will start to deteriorate and the color will start to change. After the component temperature reaches the dangerous level, the insulation material may started to melt and a fire could be occurred in a short period of time. Both of the electrical component and its deterioration severity can be recognized at the same time. Table 1. Failure modes used in the automatic recognition system
Modes 1 2 3 4 5 6 7 8 9
3
Component Type Relay Relay Solid State Relay Solid State Relay Fuse Fuse Transformer Transformer Single Cable
Deterioration Modes Component Type Severity Alarm 10 Single Cable Dangerous 11 Double Cables
Deterioration Severity Dangerous Alarm
Alarm
12
Double Cables
Dangerous
Dangerous
13
Triple Cables
Alarm
Alarm Dangerous Alarm Dangerous Alarm
14 15 16 17 18
Triple Cables No Fuse Breaker No Fuse Breaker Circuit Breaker Circuit Breaker
Dangerous Alarm Dangerous Alarm Dangerous
Preprocessing Technique
The infrared image of electrical components is a two-dimensional arrayed image F ( x, y ) . The preprocessing includes two steps, the first step is to remove the reflection influence, and the second step is to identify the higher temperature area. Sometimes some pixel values are increased because the reflection of surrounding heat source from electrical components. Some temperature measuring error is due to this kind of reflection phenomena. Therefore, a smoothing operation is adopted to
1306
Kuo-Chao Lin and Chia-Shun Lai
diminish the reflection influence of surrounding heat source at the first step. The smoothing operation is shown in equation (1). G ( x, y ) = F ( x, y ) × H ( x , y )
(1)
where H ( x, y ) is a mean filter with predefined neighborhood of F ( x, y ) . Under an ideal condition, the pixel with higher temperature and its background are grouped into two dominant modes. One way to extract the higher temperature shape from the background is to select the background temperature as a threshold. But in reality, the background temperature may be raised by thermal radiation. In that case, the threshold selection will not only depend on the background temperature. A threshold Tc to the infrared image G ( x, y ) is defined as follows Tc = µ + α × Var 2
(2)
α is the adaptive parameter, and where µ is the mean value of matrix G ( x, y ) and Var is the variance of matrix G ( x, y ) . The algorithm of threshold selection is described as equation (2). If a pixel (x, y) is higher than the threshold Tc , the original temperature value of the infrared image is kept. Otherwise, the temperature value of the pixel is considered as zero. The infrared image would then be written as I ( x, y ) after removing the background from G ( x, y ) through threshold Tc selection process.
4
Feature Extraction
For feature extraction algorithm, the thermal shape and the temperature distribution are chosen. The thermal shape is used to distinguish the components under a loading condition and the temperature distribution is used to evaluate the deterioration severity of components. Although the thermal shape will varied under translation, two-dimensional rotation and scaling, the feature vector of thermal shape is varied only within a very small range. The orientation and position of electrical component in an infrared image is varied due to the fabrication direction of the same electrical component on site. For a two-dimensional infrared image I ( x, y ) , the moment of order (p + q) is defined in the relation as below. m pq = ∑ x
∑ x p y q I ( x, y )
(3)
y
The central moment can be expressed as, µ pq = ∑ x
where x =
∑ ( x − x ) p ( y − y ) q f ( x, y ) y
m10 m , y = 01 m 00 m 00
(4)
Fault Recognition System of Electrical Components in Scrubber
1307
The size of electrical components and the distance between the detecting object and infrared detector would also influence the temperature scaling of objects in infrared images. The central moment needs to be normalized to achieve the scale invariance. The normalized central moments are denoted as below.
η pq =
µ pq
(5)
µ 00γ
p+q + 1 for p + q = 2,3,... . 2 In order to apply the invariant moment to a thermal shape, a lower order moment method is applied to verify different thermal shapes. The set of moments was shown to be invariant under the change of translation, rotation, and scale [8].
where γ =
£p1 = η 20 + η 02
(6)
£p2 = (η 20 − η 02 ) 2 + 4η 112
(7)
£p3 = (η 30 − 3η12 ) 2 + (3η 21 − η 03 ) 2
(8)
£p4 = (η 30 − η 12 ) 2 + (η 21 − η 03 ) 2
(9)
£p5 = (η 30 − 3η 12 )(η 30 + η 12 )[(η 30 + η 12 ) 2 − 3(η 21 + η 03 ) 2 ] + (3η 21 − η 03 )(η 21 + η 03 )[3(η 30 + η 12 ) 2 − (η 21 + η 03 ) 2 ]
(10)
£p6 = (η 20 − η 02 )[(η 30 + η12 ) 2 − (η 21 + η 03 ) 2 ] + 4η11 (η 30 + η12 )(η 21 + η 03 ) (11) £p7 = (3η 21 − η 20 )(η 30 + η 12 )[(η 30 + η 12 ) 2 − 3(η 21 + η 03 ) 2 ] + (3η 21 − η 30 )(η 21 + η 03 )[3(η 30 + η 12 ) 2 − (η 21 + η 03 ) 2 ]
(12)
The invariant moments are combined to the thermal shape moment as follows: S = [φ1 Lφ 7 ]
(13)
where φ1 is the first invariant moment and φ 2 is the second invariant, and so on. After an infrared image has processed under the preprocessing step I ( x, y ) , the pixel with higher temperature of electrical components will be extracted. Different electrical components have different level of deterioration severity because of its own operating temperature range. The same degree of temperature rose for different electrical components may be defined into different level of deterioration severity. For feature extraction of temperature distribution, the temperature equalization is used to divide the higher temperature area into L number of temperature levels after the normalization process. The number of contour L is equal to the number of temperature level L. The vector B of temperature distribution feature is composed of the mean value and the highest value. B = [M k
Xk]
k = 1,2 ,......L
(14)
1308
Kuo-Chao Lin and Chia-Shun Lai
where L is the number of temperature level and M k is mean value of each contour, and X k is the highest value of each contour. The vector B is representing the feature vector of temperature distribution for failure modes. The feature vector X is composed by the feature data selecting from thermal shape S in (13) and temperature distribution B in (14).
5
RBF Network
A generalized regression neural (GRN) network is used to recognize the function of electric components. The architecture of GRN network with two layers was modified from RBF network. In the first layer, the net input to the RBF transfer function is the vector distance between its weight vector and the input vector multiplied by the bias. In the second layer, the transfer function is linear operated. The second layer also has as many neurons as input and target vectors. The schematic of the radial basis layer is shown as follows: d (x ) = wο +
h
∑θ
i
p ( x)
(15)
i =1
where p(⋅) is a given function; θ i is the weight and h is the number of hidden nodes. In the RBF network, there are several functions of p(⋅) . The function p(⋅) is chosen for recognition to be Gaussian function as below. || x − µ i || 2 p(x ) = exp − 2 σi
(16)
where µi is the mean value of the Gaussian function and σ i is the variant of the Gaussian function.
6
Experimental Results
The infrared camera which model is INFRAMETRICS SC 1000 of IR Imaging Radiometer with nominal 3-5 µm wavelength is used in this experiment. With this camera, the infrared image can be displayed and recorded for further analysis. The scrubber system was under working condition while capturing infrared images. In this experiment, 20 infrared images from each component type were taken. Ten images of each type were chosen as the training data, and the others were applied as the testing data. There are 18 failure modes for this experiment. Totally, there are 18x20=360 infrared images were taken. The target value of the faulted mode is defined from -1 to +1.
Fault Recognition System of Electrical Components in Scrubber
1309
For a thermal image, the original matrix is F ( x, y ) which is 320×240. In preprocessing technique, a 4×4 neighborhood mean filter H ( x, y ) is used to average the temperature value of each pixel in matrix F ( x, y ) by smoothing operation. When the adaptive parameter α of threshold selection in equation (2) equals to 1.4, an optimal effect of threshold selection can be obtained. After preprocessing, seven feature vectors of thermal shape are obtained by moment invariants from equation (13). There are ten feature vectors of temperature distribution when the temperature level L is defined to be five in function (14). The feature vector X is coupled with the seven vectors of thermal shape and ten vectors of temperature distribution. The result of the performance test is shown in table 2. The accuracy of the RBF model using eighty nodes is 89.4%. The accuracy of RBF model using ninety nodes is 92.8%. And the accuracy of RBF model using one hundred nodes is 96.1%. The RMS error value of the RBF network has reached 0.001 for a satisfy result. Table 2. Compare the accuracy of recognition
Hidden Nodes 60 70 80 90 100 110
7
Epochs 60 60 80 80 100 100
Accuracy (%) 67.2 84.4 89.4 92.8 96.1 96.1
Conclusion
A systematic process is presented in this paper to identify the failure modes of the electrical components in the scrubber system. The preprocessing operation of infrared images could extract the optimal features to be identify, so that the characteristics of training data required for the abnormal component recognition is easier to be affined. After preprocessing, a feature extraction process is applied to recognize the failure mode of the abnormal component. An implemented RBF network scheme is presented to perform the fault diagnosis of electrical components. The accuracy can be over 89.4% for the associated patterns of electrical components.
References [1] [2]
Blazquez, C.H.: Detection of Problems in High Power Voltage Transmission and Distribution Lines with an infrared Scanner/Video System, Proceeding of Thermosense XVI, Vol. 2245, SPIE (1994) 27-32 Newport, R.: Infrared electrical inspection myths, Proceeding of Thermosense XIX, Vol. 3056, SPIE (1997) 124-132
1310
[3] [4] [5] [6] [7] [8]
Kuo-Chao Lin and Chia-Shun Lai
Giesecke, J.L.: Substation component identification for infrared thermographers, Proceeding of Thermosense XIX, Vol. 3056, SPIE (1997) 153163 Hou, N.: The infrared thermography diagnostic technique of high-voltage electrical equipments with internal faults, Power system technology international conference, IEEE, Vol.1 (1998) 110-115 Merryman, S. A. and Nelms, R. M.: Diagnostic technique for power systems utilizing infrared thermal imaging, Industrial Electronics, IEEE (1995) 615-628 Ishino, R.: Detection of a faulty power distribution apparatus by using thermal images, Power Engineering Society Winter Meeting, IEEE (2002) 1332-1337 Chan, W.L., So, A. T. P. and Lai, L. L.: Three-dimensional Thermal Imaging for Power Equipment Monitoring, Generation, Transmission and Distribution, IEE Proceedings, Vol. 1 (2000) 355-360 Hu, M.K.: Visual pattern recognition by moment invariants, IRE Trans. Inf. Theory, vol. IT-8 (1962) 179-187
Nonlinear Process Modeling Based on Just-in-Time Learning and Angle Measure Cheng Cheng, and Min-Sen Chiu* Department of Chemical and Environmental Engineering, National University of Singapore, 10 Kent Ridge Crescent, Singapore 119260 *[email protected]
Abstract: A new just-in-time learning methodology by incorporating both distance measure and angle measure is developed. This result enhances the existing methods, where the available information of angular relationships between two data samples is not exploited. In addition, a new procedure of selecting the relevant data set is proposed. The proposed methodology is illustrated by a case study of modeling a polymerization reactor. The adaptive ability of the just-in-time learning is also evaluated.
1
Introduction
Mathematical models are often required for purposes of process modeling, control, and fault detection and isolation. However, most chemical processes are multivariable and nonlinear in nature, and their dynamics can be time-varying. Thus first- principle models are often unavailable due to the lack of complete physicochemical knowledge of chemical processes. An alternative approach is to develop data-driven methods to build model from process data measured in industrial processes. Traditional treatments of the data-driven modeling methods focus on global approaches, such as neural networks, fuzzy set and other kinds of non-linear parametric models [1]. However, when dealing with large sets of data, this approach becomes less attractive to deal with because of the difficulties in specifying model structure and the complexity of the associated optimization problem, which is usually highly non-convex. Another fundamental limitation of these models is their inability to extrapolate accurately once the information is outside the range of the data used to generate or train the model. On the other hand, the idea of local modeling is to approximate a nonlinear system with a set of relatively simple local models valid in a certain operating regimes. The T-S fuzzy model [2] and neuro-fuzzy network [1], [3] are well-known examples of local modeling approach. Similar to the global modeling approaches, however, the local modeling approaches require parametric regression, and thus they suffer the drawbacks of requiring a priori knowledge to determine the partition of operating space and complicated training strategy to determine the optimal parameters of the models. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1311-1318, 2003. Springer-Verlag Berlin Heidelberg 2003
1312
Cheng Cheng and Min-Sen Chiu
To alleviate the aforementioned difficult problems, Just-in-Time Learning (JITL) [4] was recently developed as an attractive alternative for modeling the nonlinear systems. It is also know as instance-based learning [5], locally weighted model [6], lazy learning [7], or model-on-demand [8] in the literature. However, distance measures are overwhelmingly used in the previous work [5-8]. Complementary information available from angular relationship has not been exploited. In this paper, a new methodology for JITL based on both angle measure and distance measure is proposed. In addition, a new procedure of selecting the relevant data set is proposed. One case study of modeling an isothermal free-radical polymerization reactor is presented to evaluate the efficiency of the proposed method. Online adaptive ability of JITL is also studied.
2
Just-in-Time Learning
Comparing with the traditional modeling method, JITL has no standard learning phase. It merely gathers the data and stores them in the database and the computation is not performed until a query data arrives. In the inquiry stage, there are three main steps: (1) the relevant data samples in the database are searched to match the query data by some nearest neighborhood criterion; (2) a local model is built based on the relevant data; (3) the local model predicts the model output based on the current query data. The local model is then discarded right after the answer is obtained. When the next query data comes, a new local model will be built based on the aforementioned procedure. It should be noted that JITL model is only locally valid for the operating condition characterized by the current query data. In this sense, JITL constructs local approximation of the dynamic systems. Therefore, a simple model structure can be chosen, e.g. an ARX model. This feature allows JITL to be conveniently incorporated into the model based controller designs or process monitoring methods. Another advantage of JITL is its inherently adaptive nature, which is achieved by storing the current measured data into the database [1]. In contrast, neural network and neurofuzzy models require model update from scratch, namely both network structure (e.g. the number of hidden neurons in the former case and the number of the fuzzy rules in the latter) and model parameters may need to be changed simultaneously. Evidently, this procedure is not only too expensive from computational point of view, but also it will interrupt the plant operation, if these models are used for other purposes like model based controller design. To facilitate the ensuing developments, the algorithm of JITL is described next. Suppose that a database consisting of process data {( yi , x i )}iN=1 , yi ∈ R , x i ∈ R n , is collected. It is worthwhile pointing out here that the vector x i is formed by the past values of both process input and process output as typically required in building an ARX model. Given a specific query data q ∈ R n whose elements are identical to those defined for x i , the objective of JITL is to predict the model output yˆ q = f (q ) according to the known database. In the literature, distance measure d (q, x i ) , e.g. Euclidean norm d (q, x i ) =|| q − x i ||2 , is overwhelmingly used to select the relevant
Nonlinear Process Modeling Based on Just-in-Time Learning and Angle Measure
1313
data set from the database by evaluating the relevance (or similarity) between the query data q and x i in the entire database, i.e. smaller value of distance measure indicates greater similarity between q and x i . To do so, a weight wi is assigned to each data x i and it is calculated by the kernel functions, wi = K ( d (q, x i ) / h ) , where h is the bandwidth of the kernel function K, which is usually computed using a 2
Gaussian function, K ( d ) = e − d . If a linear model is employed to predict the model output yˆ q , the query answer is [6]: yˆ q = q T ( Z T Z ) −1 Z T v
(1)
where Z = WΦ , v = Wy , W ∈ R N ×N is a diagonal matrix with diagonal elements wi , Φ ∈ R N ×n
is
the
matrix
with
every
row
corresponding
to
x Ti ,
and
T
y = [ y1 , y 2 ,L, y N ] . In JITL, PRESS statistic [9] is used to perform leave-one-out cross validation to assess the generalization capability of the local linear model [6]. In doing so, the optimal value of h , hopt , and the model output yˆ q can be determined as follows: for
a given h , eq. (1) is used to compute the predicted output and the validation error is calculated by the leave-one-out cross validation test. This procedure repeats for a number of h and hopt is chosen as the one with the smallest validation error. With hopt known, the optimal model prediction is then computed by eq. (1) for the current
query data q .
3
New JITL Methodology
In the preceding section, it is evident that conventional JITL methods only use distance measure to evaluate the similarity between two data samples. However, considering data observations as points in space leads to two types of measures: distance between two points, and angles between two vectors. Therefore, by incorporating angular relationship in the formulation of JITL, it is expected that the resulting method should give more accurate prediction over the existing methods. Therefore, by using both distance measure and angle measure the following similarity criterion is proposed: si = γ ⋅ e − d
2
( q , xi )
+ (1 − γ ) ⋅ cos(θ i ) , if cos(θ i ) ≥ 0
(2)
where γ is a weight parameter and is constrained between 0 and 1, θ i is the angle between q and x i , si is the similarity number bounded between 0 and 1. Another shortcoming of the existing JITL methods is that the weights for all the data in the database are computed in order to calculate the respective contribution of each data in the regression, namely, the data with very small weight (i.e. greater
1314
Cheng Cheng and Min-Sen Chiu
distance) will have little contribution to the regression. This leads to a large sparse regression matrix Z ∈ R N ×n that is prone to numerical problems. To circumvent this problem, we proposed that only a pre-specified number of relevant data with greater resemblance to the query data q , as determined by the proposed similarity criterion eq. (2), are used in the regression. Specifically, two parameters k min and k max are chosen such that only the relevant data sets formed by the k min -th relevant data to the k max -th relevant data are used in the regression. Usually k min and k max are much smaller than the number of data in the entire database, therefore the computational burden is significantly reduced compared to conventional JITL methods. The detailed methodology is described as follows: Given a data set {( yi , x i )}iN=1 , parameters k min , and k max , and a query data q : Step 1: Compute the distance and angle between q and each data { yi , x i } : di = q − xi
2
, cos(θ i ) =
qT x i q 2⋅ x
2
If cos(θ i ) ≥ 0 , compute the similarity number si according to eq. (2); otherwise, the data { yi , x i } is discarded. Step 2: Rearranging si in the descending order and construct k max − k min + 1 relevant data sets, Φ 's, that consist of k data, where k belongs to [k min , k max ] , by selecting those data corresponding to the largest si to the k-th largest si . Applying leave-one-out cross validation to Φ 's and by comparing the respective prediction errors, the smallest validation error is obtained. Step 3: Steps 1 and 2 are repeated for different value of γ and the optimal γ is determined as the one with the smallest validation error. With this optimal γ , eq. (1) is used to compute the optimal model output for query data q .
4
Example
Considering a polymerization reactor, in which an isothermal free-radical polymerization of methyl methacrylate is carried out using azo-bis-isobutyronitrile as initiator and toluene as solvent. The output variable is the number average molecular weight (NAMW), y, and the input variable is the inlet initiator flow rate FI . This reactor can be described by the following equations [10]: F (Cmin − Cm ) dCm = −( k p + k fm )Cm P0 + V dt
(3)
Nonlinear Process Modeling Based on Just-in-Time Learning and Angle Measure
1315
FI CIin − FC I dC I = −k IC I + V dt
(4)
dD0 FD0 = (0.5kTc + kTd ) P02 + k f m Cm P0 − dt V
(5)
dD1 FD1 = M m ( k p + k f m )Cm P0 − dt V
(6)
where y =
2 f * k ICI D1 , P0 = D0 kTd + kTc
0.5
. The model parameters and steady-state operating
condition can be found in [10]. The database is generated by adding a periodic pseudo-random sequence signal with a switching probability of 0.25 to the process input FI . Using a sampling time of 0.03h, 2500 input/output data as shown in Fig. 1 are simulated to build the database. u − u0 The data are scaled by their respective nominal values, i.e. u~ = and 0.01 y − y0 ~ y= . The scaled output range is ~ y ∈ [ −1.26 1.72] . 10000 2
NAMW
1 0 -1 -2
0
500
1000
1500
2000
2500
0
500
1000 1500 Sample time
2000
2500
8 6 Fi
4 2 0 -2
Fig. 1. Input-output data used for database building Table 1. Performance of JITL for various values of γ
Distance measure γ = 0.90 γ = 0.85 γ = 0.80 γ = 0.75 γ = 0.70
MSE -4 5.597×10 -4 1.036×10 -4 1.012×10 -4 9.996×10 -4 9.930×10 -4 1.025×10
Max (|error|) 0.1533 0.0527 0.0517 0.0496 0.0489 0.0548
1316
Cheng Cheng and Min-Sen Chiu
To proceed with the proposed JITL algorithm,
xi
is chosen to be
T
x i = [ y ( k − 1), y ( k − 2), u( k − 1)] and set k min = 8 and k max = 60 . To determine the optimal γ , Table 1 lists the mean-squared-error (MSE) of the validation test for five values of γ . As can be seen, the error decreases initially as γ decreases from 0.9 to 0.75, after which the error starts to increase. Similar trend is also observed when the maximum absolute error is used as a measure. Therefore, the optimal γ is chosen to be 0.75. Based on the identical database, JITL with distance measure alone is also considered for comparison purpose. Table 1 and Fig. 2 compare the predictive performance of these two methods. It is evident that JITL complemented with angle measure outperforms that based on distance measure alone.
NAMW
1
0.16 actual model
0.5
0.14
without angle with angle
0.12 0
50
100
150
200
250
300
350
400
Sample time
1 actual model
NAMW
0.5
Error
0.1 -0.5 0
0.08 0.06 0.04 0.02
0
-0.5 0
0 50
100
150
200
250
300
350
400
-0.02 0
50
100
150
Sample time
(a)
200
250
300
350
400
Sample time
(b)
Fig. 2. Validation result. (a) top: proposed method ( γ = 0.75 ); bottom: JITL without angle measure. (b) comparison of the modeling error between the proposed method ( γ = 0.75 ) and JITL without angle measure
To illustrate that JITL can be made adaptive by simply adding the process data online to the database, step changes in FI are introduced such that the output variable is away from the normal operating space of current database. Two scenarios are studied: non-adaptive and adaptive JITL. In the former, the database is fixed, whereas in the latter the database in constantly updated by adding the new available inputoutput data to the database at each sampling time. Simulation results in Fig. 3 show that significantly smaller modeling error is achieved by adaptive JITL (with or without the angle measure) as compared with non-adaptive JITL. As expected, for both adaptive and fixed databases, JITL complemented with angle measure consistently gives better prediction than that based on distance measure alone.
5
Conclusion
In this paper, a new JITL methodology for nonlinear process modeling is proposed. Simulation study of a polymerization reactor is used to evaluate the efficiency of the
Nonlinear Process Modeling Based on Just-in-Time Learning and Angle Measure
1317
proposed method. It is also demonstrated that JITL can be made adaptive online by simply adding the new data to the database. 2
2
1.5
1.5 1
Actual with angle without angle
0.5
NAMW
NAMW
1
0
0.5
-0.5
-0.5
-1
-1
-1.5 0
10
20
30
40
Sample time
(a)
50
60
70
Actual LL with angle LL without angle
0
-1.5 0
10
20
30
40
50
60
70
Sample time
(b)
Fig. 3. Comparison between non-adaptive and adaptive JITL . (a) Non-adaptive JITL: proposed method (MSEs: top 2.45 × 10 -2 , bottom 7.32 × 10 -5 ); JITL without angle measure (MSEs: top 3.23 × 10 -2 , bottom 2.02 × 10 -4 ). (b) Adaptive JITL: proposed method (MSEs: top 4.90 × 10 -4 , bottom 1.58 × 10 -5 ); JITL without angle measure (MSEs: top 1.60 × 10 -3 , bottom 3.16 × 10 -5 )
References [1] [2] [3] [4] [5] [6] [7] [8]
Nelles, O. Nonlinear system identification. Springer-Verlag (2001) Takagi, T. and Sugeno, M.: Fuzzy identification of systems and its application to Modeling and control. IEEE Trans. on Sys., Man, & Cyber., (1985)15, 116132 Jang, J. S. R. and S, C. T.: Neuro-fuzzy modeling and control. Proceedings of the IEEE, (1995) 83, 378-406 Cybenko, G.: Just-in-time learning and estimation. In S. Bittanti & G. Picci, Identification, adaptation, learning : the science of learning models from data, Springer (1996), 423-434 Aha, D. W., Kibler, D. and Albert, M. K.: Instance-based learning algorithms. Machine Learning (1991) 6, 37-66 Atkeson, C. G. Moore, A. W. and Schaal, S.: Locally weighted learning. Artificial Intelligence Review (1997) 11, 11-73 Bontempi, G. Bersini, H. and Birattari, M.: The local paradigm for modeling and control: from neuro-fuzzy to lazy learning. Fuzzy Sets and Systems (2001) 121, 59-72 Braun, M. W., Rivera, D. E. and Stenman, A.: A model-on-demand identification methodology for nonlinear process systems. Int. Journal of Control (2001) 74, 1708-1717
1318
[9]
Cheng Cheng and Min-Sen Chiu
Myers, R. H.: Classical and modern regression with applications. Mass. PWSKent Pub, Boston (1990) [10] Doyle III, F. J., Ogunaike, B. A. and Pearson, R. K.: Nonlinear model-based control using second-order volterra models. Automaitca (1995) 31, 697-714
Support Vector Machines for Improved Voiceband Classification Stephen R. Alty King’s College London, Centre for Digital Signal Processing Research Strand, London, WC2R 2LS [email protected]
Abstract. A method for detecting and classifying the presence of various voiceband data signals on the General Switched Telephone Network (GSTN) is presented. The classification vectors are extracted from processing of the speech parameters evolved by a standard speech coding algorithm. A multi-class Support Vector Machine (SVM) approach is implemented to optimise the classification parameters improving the ability of the system to operate under poor signal-to-noise ratio (SNR) conditions. It is shown that the newly proposed classifier improves on previous implementations and is capable of detecting various ‘V’ series standards at SNRs well below 12dB.
1
Introduction
Recent advances in analogue to digital modulation and coding techniques have led to the development of many high speed internet access technologies for domestic use (for example ‘ADSL’). Despite this, the take up of broadband access is quite slow with the majority of users still relying on traditional voiceband modem devices to connect to the internet at home. Furthermore, the use of facsimile transceivers is widespread in the workplace. The analogue telephone network, therefore, remains very much a part of life and is used by millions of subscribers everyday. Given the huge quantity of analogue speech signals traversing the GSTN, the desire for network providers to employ efficient speech coding schemes, has never been greater. However, their wholesale application is severely restricted due to the considerable presence of voiceband data modulated signals. This is because speech coders utilise compression techniques specific to the characteristics of speech and therefore, in general, are not transparent to voiceband data [1]. Hence, there is a need for the robust classification of voiceband signals into two categories, either speech or data to enable the safe application of speech coding techniques. The works of [1, 2] and [3] amongst others, started to address this problem. Moreover, the further classification of voiceband data into facsimile, modem and signalling types could enable the implementation of a Digital Signal Interpolation (DSI) system [4]. The objective of a DSI system is to allocate dynamically the correct bandwidth appropriate for the efficient transmission V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1319–1326, 2003. c Springer-Verlag Berlin Heidelberg 2003
1320
Stephen R. Alty
of each signal type. For example, speech could be handled by using a low bit-rate vocoder, voiceband data could be demodulated and transmitted at its baseband rate. This would lead to the low-bit rate communication of all voiceband signals, resulting in potentially large savings in network bandwidths. Such a real-time classifier could also facilitate network administration and planning by providing accurate voiceband traffic analysis. The works of [5] and [6] represent the most recent research on this matter. This aim of this work is to use Support Vector Machine (SVM) [7, 8, 9] learning theory (introduced in Section 2) to optimise the voiceband classification systems presented in [10, 11]. These works rely on a hand tuned classification system which is shown, in Section 5 to be sub-optimal. Here, the application of an optimal SVM classifier brings about improvements in robustness such that the system can operate in signal-to-noise levels below 12dB, resulting in a gain of up to 3dB for each class (when compared to [11]).
2
Support Vector Machines
Support Vector Machines (SVMs) have received a great deal of attention recently [8, 9, 12] proving themselves to be very effective in a variety of pattern classification tasks. They have been applied to a number of problems ranging from hand-written character recognition, bioinformatics to automatic speech recognition (amongst many others) with a great deal of success. A brief summary of the mathematical theory of SVMs follows, for a complete treatment please see [9]. Consider a binary classification task with a set of linearly separable training samples S = {(x1 , y1 ), . . . , (xm , ym )} where x is the input vector such that X ∈ Rd (in d-dimensional input space) and yi is the class label such that yi ∈ {−1, 1}. The label indicates the class to which the data belongs. A suitable discriminating function could then be defined as: f (x) = sgn(x·w + b) .
(1)
Where vector w determines the orientation of a discriminant plane (or hyperplane), x·w is the inner product of the vectors, x and w and b is the bias or offset. Clearly, there are an infinite number of possible planes that could correctly classify the training data. Intuitively one would expect the choice of a line drawn through the “middle”, between the two classes, to be a reasonable choice. This is because small perturbations of each data point would then not affect the resulting classification. This therefore implies that a good separating plane is one that is more general, in that it is also more likely to more accurately classify a new set of, as yet unseen, test data. It is thus the object of an optimal classifier to find the best generalising hyperplane that is equidistant or furthest from each set of points. The set of input vectors is said to be optimally separated by the hyperplane if they are separated without error and the distance between the closest vector and the hyperplane is maximal. This approach leads to the determination of just one hyperplane.
Support Vector Machines for Improved Voiceband Classification
2.1
1321
Soft-Margin Classifier
Typically, real-world data sets are in fact linearly inseparable in input space, this means that the maximum margin classifier approach is no longer valid and a new model must be introduced. This means that the constraints need to be relaxed somewhat to allow for the minimum amount of misclassification. Therefore the points that subsequently fall on the wrong side of the margin are considered to be errors. They are, as such, apportioned a lower influence (according to a preset slack variable) on the location of the hyperplane. In order to optimise the softmargin classifier, we must try to maximise the margin whilst allowing the margin constraints to be violated according to the preset slack variable ξi . This leads m to the minimisation of: 12 w · w + C i=1 ξi subject to yi (w · xi + b) ≥ 1 − ξi and ξi ≥ 0 for i = 1, . . . , m. The minimisation of linear inequalities is typically solved by the application of Lagrangian duality theory [9]. Hence, forming the primal Lagrangian, L(w, b, ξ, α, r) =
m m m 1 w · w+C ξi − αi [yi (w · xi +b)−1+ξi]− ri , (2) 2 i=1 i=1 i=1
where αi and ri are independent Lagrangian multipliers. The dual-form m can be found by setting the derivatives of the primal to zero thus, w = i=1 yi αi xi m and i=1 yi αi = 0, then re-substituting into the primal. Hence, L(w, b, ξ, α, r) =
m i=1
1 yi yj αi αj xi · xj . 2 i=1 m
αi −
(3)
Interestingly, this is the same result as for the maximum margin classifier. The only difference is the constraint α + r = C, where r ≥ 0, hence 0 ≤ ξi ≤ C. This implies that the value C, sets an upper limit on the Lagrangian optimisation variables αi , this is sometimes referred to as the box constraint. The value of C offers a trade-off between accuracy of data fit and regularisation, the optimum choice of C is usually determined by cross-validation of the data. These equations can be solved mathematically using Quadratic Programming (QP) algorithms. There are many online resources of such algorithms available for download, see website http://www.kernel-machines.org/ for an up to date listing. 2.2
Kernel Functions
It is quite often the case with real-world data that not only is it linearly nonseparable but it also exhibits an underlying non-linear characteristic nature. Kernel mappings offer an efficient solution by non-linearly projecting the data into a higher dimensional feature space to allow the successful separation of such cases. The key to the success of Kernel functions is that special types of mapping, that obey Mercer’s Theorem, offer an implicit mapping into feature space. This means that the explicit mapping need not be known or calculated, rather the inner-product itself is sufficient to provide the mapping. This simplifies
1322
Stephen R. Alty
the computational burden dramatically and in combination with SVM’s inherent generality largely mitigates the so-called “curse of dimensionality”. Further, this means that the input feature inner-product can simply be substituted with the appropriate Kernel function to obtain the mapping whilst having no effect on the Lagrangian optimisation theory. Hence, the relevant classifier function then becomes: nSV s yi αi K(xi , xj ) + b . (4) f (x) = sgn i=1
The use of Kernel functions transforms a simple linear classifier into a powerful and general non-linear classifier. There are a number of different Kernel functions available [9], however, one of the most useful is the Gaussian Radial Basis Function (RBF) Kernel, given by K(xi , xj ) = exp[− xi − xj 2 /2σ 2 ].
3
Voiceband Data Signals
The majority of modem and facsimile devices are capable of establishing communication at different data rates, so as to allow for varying line quality and compatibility between old and new equipment. It is therefore necessary to establish mutual parameters and synchronise data transmissions. The ITU-T regularly publish (and update) internationally recognised standard protocols, so-called “Recommendations” for transmission of voiceband data at various rates. This work focusses upon the classification of the most popular standards V.26ter, V.27ter, V.29, V.17 and V.341 . 3.1
Pre-message Procedure
Each modulation standard is different, but, in general, as part of the so-called pre-message procedure (before the onset of the modulated data), produces a unique synchronisation sequence. These sequences result in the brief (c.31ms– 107ms) production of phase alternations which manifest themselves as sinusoids at various frequencies depending on the baud-rate and centre frequency of the specific modulation standard (for details see http://www.itu.int/). It is these sequences that allows each standard and rate to be identified. Table 1 shows which frequencies correspond to which modulation standard and their duration.
4
Classification Technique
Before applying the SVM algorithms to the data, suitable features for the classification process need to be extracted. From prior knowledge of the signal under 1
V.34 is capable of operating at many different baud-rates and centre frequencies depending on the line conditions. The three main modes of operation at the highest data rates have been examined and are labelled with the suffixes, ‘a’, ‘b’ & ‘c’.
Support Vector Machines for Improved Voiceband Classification
1323
Table 1. Observed frequencies during phase reversal sequence of pre-message procedure for each ITU-T ‘V’ series recommendation Series Frequencies (Hz) Duration (ms) V.26ter 1200, 2400 41.67 V.27ter 1000, 2600 31.25 V.29 500, 1700, 2900 53.33 V.17 600, 1800, 3000 106.67 V.34a 229, 1829, 3429 40.00 V.34b 320, 1920, 3520 40.00 V.34c 245, 1959, 3674 37.33
scrutiny [11], it was found that the application of a parametric spectral analysis technique in the form of linear predictive analysis provided the necessary time-frequency resolution required to accurately identify features of durations often shorter than 200–300 samples at 8kHz. Any efficient method could be have been employed, however, the modified covariance method [13] was chosen as it is one of the fastest solutions and is free from the effects of line-splitting. Furthermore, as a linear predictor is the basis for many low bit-rate speech coding schemes the classification scheme could be appended to such a coder with minimal computational impact. 4.1
Feature Extraction
Thus, the modified covariance analysis is applied to the sample frame to determine a set of predictor coefficients, αk . A frame length of N = 160 equivalent to 20ms at fs = 8kHz and a 14th order analysis is used to closely mimic those of a typical speech coder. However, the sample frame is block shifted 80 samples at a time so as to yield new predictor coefficients every 10ms. A pole-solving routine is then employed to determine the frequencies and radii of each pole. Further, the prediction filter gain is determined at each pole frequency providing an extra feature for the classifier. This feature, in particular, was found to improve resistance to the false detection of speech (so-called “talk-off ”). 4.2
Support Vector Classification
Two consecutive stages of support vector classifier are employed in the overall scheme. The first stage uses the radii and prediction gains (in a 2D mapping) to separate the relevant poles from those representing the unwanted speech/noise signal using a binary linear maximal margin classifier [14]. The data in the wanted class is then reformulated into 2-dimensional frequency data. This was then used to train an multi-class Radial Basis Function soft-margin classifier [15]. Finally, we allocate a class for each voiceband data type in Table 1 and one extra for the unwanted speech or noise data led to the resultant eight-class support vector classifier used in the final design. Figure 1 shows the classification boundaries determined by the training algorithm.
1324
Stephen R. Alty
Fig. 1. The 8-class support vector classifier boundaries. (Note: for clarity, the speech/noise class is effectively denoted by the surrounding whitespace which extends from 0 to 4kHz in both dimensions)
5
Results
In order to test the classifier repeating sequences of each signal and periods of silence were generated with measured amounts of additive noise. The duration of silence was calculated to provide a cadence that would be asynchronous to the sample frame so as to present a greater challenge to the classifier. Table 2 presents the performance results of the classifier in terms of detection success rate for all the various standards for both the hand tuned [11] and SVM optimised methods. The results show that the new classifier is capable of detecting all standards at an SNR of 12dB or below. It clearly offers an improvement of a significant margin (up to 3dB) for each class. The noise headroom exhibited suggests that the classifier would function well in the presence of other channel impairments, (such as non-linear distortion) not considered in this work. Resistance to talk-off was high, with no false detections over 600s of mixed male and female speech.
Table 2. Classification success rates at various SNRs for each method Method Hand Tuned SVM Optimised SNR 20dB 15dB 12dB 9dB 20dB 15dB 12dB 9dB V.26ter 100% 100% 100% 97% 100% 100% 100% 100% V.27ter 100% 100% 100% 92% 100% 100% 100% 100% V.29 100% 100% 100% 19% 100% 100% 100% 97% V.17 100% 100% 100% 32% 100% 100% 100% 99% V.34a 100% 100% 89% 0% 100% 100% 100% 54% V.34b 100% 100% 60% 0% 100% 100% 100% 81% V.34c 100% 100% 97% 3% 100% 100% 100% 97%
Support Vector Machines for Improved Voiceband Classification
6
1325
Conclusions
This paper presents a novel method employing Support Vector Machines for augmenting a speech coding system to allow the simultaneous classification of a wide range of voiceband data standards. The new SVM based classifier clearly out-performs the previous hand-tuned method, resulting in a useful increase in robustness to noise. The system is capable of operating in very poor signal-tonoise conditions, down to 12dB and beyond. Comparison with other techniques is favourable. None of the other methods by Benvenuto [4], Sewall [5] and Benetazzo [6] offer the degree of classification range combined with the confidence of this method.
References [1] Benvenuto, N., “A speech/voiceband data discriminator,” IEEE Trans. on Communications, vol. 41, pp. 539–543, (1993) 1319 [2] Yatsuzuka, Y., “High-gain digital speech interpolation with adaptive differential PCM encoding,” IEEE Trans. on Communications, vol. COM-30, pp. 750–761, (1982) 1319 [3] Roberge, C. and Adoul, J.-P., “Fast on-line speech/ voiceband data discriminator for statistical multiplexing of data with telephone conversations,” IEEE Trans. on Communications, vol. COM-34, pp. 744–751, (1986) 1319 [4] Benvenuto, N. and Goeddel, T. W., “Classification of voiceband data signals using the constellation magnitudes,” IEEE Trans. on Communications, vol. 43, pp. 2759–2770, (1995) 1319, 1325 [5] Sewall, J. S., and Cockburn, B. F., “Voiceband signal classification using statistically optimal combinations of low-complexity discriminant variables,” IEEE Trans. on Communications, vol. 47, pp. 1623–1627, (1999) 1320, 1325 [6] Benetazzo, L., Bertocco, M., Paglierani, P. and Rizzi, E., “Speech/voice-band data classification for data traffic measurements in telephone-type systems,” IEEE Trans. on Instrumentation and Measurement, vol. 49, pp. 413–417, (2000) 1320, 1325 [7] Vapnik, V. The Nature of Statistical Learning Theory, Springer-Verlag, New York, (1995) 1320 [8] Burges, C. J. C., “A tutorial on support vector machines for pattern recognition,” Data Mining and Knowledge Discovery, 2, pp. 121–167, (1998) 1320 [9] Christianini, N. and Shawe-Taylor, J., An Introduction to Support Vector Machines, Cambridge University Press, (2000), http://www.support-vector.net/ 1320, 1321, 1322 [10] Alty, S. R. and Greenwood, A. R., “The classification of facsimile signalling tones on the general switched telephone network,” IEEE Nordic Signal Processing Symposium (NORSIG96), pp. 403–406, (1996) 1320 [11] Alty, S. R., The classification of voiceband signals, Ph.D. Thesis, Liverpool John Moores University, Liverpool, U. K., (1998) 1320, 1323, 1324 [12] Bennett, K. P. and Campbell, C., “Support Vector Machines: Hype or Hallelujah?,” SIGKDD Explorations, 2,2, 1-13. (2000), http://www.rpi.edu/ bennek/ 1320 [13] Marple, S. L., Digital spectral analysis with applications. Englewood Cliffs, N. J.: Prentice-Hall, (1987) 1323
1326
Stephen R. Alty
[14] Gunn, S.: Matlab Support Vector Machine Toolbox (ver 2.1), (2001) 1323 [15] Ma, J., Zhao, Y. and Ahalt, S., OSU SVM Classifier Matlab Toolbox (ver 3.00), http://eewww.eng.ohio-state.edu/ maj/osus vm/”, (2002) 1323
Adaptive Prediction of Mobile Radio Channels Utilizing a Filtered Random Walk Model for the Coefficients Torbj¨ orn Ekman UniK, PO Box 70, N-2027 Kjeller, Norway [email protected]
Abstract. The predictor coefficients of a mobile radio channel predictor have to be adapted to the changes in the radio environment. A direct adaptive predictor for the taps of a mobile radio channel is proposed. The coefficients of the predictor are assumed to change according to a filtered random walk model and are tracked using a Kalman filter. The filtered random walk is the simplest linear model that describes smooth changes of the coefficients and that includes integration. The proper choice of tuning parameters is discussed.
1
Introduction
Prediction of rapidly time-varying mobile radio channels [1], [2], [3] is of interest to a large number of applications that increases the spectral efficiency of the radio transmission, e.g. adaptive modulation [4] and power control. Due to fading, the mobile radio channel changes dramatically over short distances. Still the parametrization of the amplitude and phase of the channel is slowly time varying [2]. This enables the use of channel predictors. Here an adaptive predictor for the individual complex valued taps of the impulse response of the observed mobile radio channel is proposed. The predictor has to be adaptive to follow the gradual changes of the radio environment when the transceiver is moved. The paper first presents a state space model for the predictor coefficients. The Kalman estimator for the coefficients and the resulting channel predictor is then treated. The performance of the adaptive linear predictor is examined in an example and the choice of tuning parameters is discussed.
2 2.1
State Space Model Regression Model
The following linear regression model will be used, y(t) = ϕH(t − L)θ(t) + v(t),
(1)
where y(t) is a tap of a mobile radio channel, v(t) is the error of the regression model, θ(t) is a column vector containing M time-varying coefficients and ϕ(t) V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1326–1333, 2003. c Springer-Verlag Berlin Heidelberg 2003
Adaptive Prediction of Mobile Radio Channels
1327
is the regressor vector consisting of terms of the time series y(t) up to time t. The prediction range is L. The M elements of the regressor ϕ(t) consist of delayed samples of the time series y(t) up to time t, either in direct or transformed form. In the following, the M predictor coefficients in θ(t) are assumed to vary without abrupt changes, on a slower time scale than the variations of the tap, y(t). 2.2
AR1I Models
The filtered random walk model (AR1I) offers the simplest linear description of smooth changes of the coefficients and includes integration.1 Let the predictor coefficients have increments ∆θ(t) θ(t + 1) = θ(t) + ∆θ(t).
(2)
Under the assumption that there are no abrupt changes, ∆θ(t) may be modelled as a stochastic process with low-pass properties, e.g. a first order autoregressive (AR1) model 1 − p2 ∆θ(t) = e(t), 0 ≤ p < 1 (3) 1 − pq −1 where the innovation e(t) is a vector of complex-valued white noises with zero mean and constant or slowly time varying covariance matrix Re . The pole of the model filter 0 ≤ p < 1 is a tuning parameter chosen to render the desired tracking performance. Dependencies between the coefficients can be modelled by introducing non-zero off-diagonal elements in the covariance matrix of e(t). One can also use individual p values for each coefficient to better model differences between the coefficients. 2.3
State Space Model
The predictor coefficients and their increments, modelled as an AR1I process can conveniently be expressed in a state space form as 0 1 1 θi (t) θi (t + 1) + ei (t + 1) = 0 p ∆θi (t + 1) ∆θi (t) 1 − p2
A
xi (t)
θi (t) = 1 0 xi (t).
B
(4)
C
The state space models (4) for each component of θ(t) are used as building blocks for the state space model representing the time varying prediction coefficients, 1
Integration in the model is desired because then the corresponding Kalman filter can estimate time-invariant parameters without bias.
1328
Torbj¨ orn Ekman
with the regression model (1) as the measurement equation. The model can thus be expressed as x(t + 1) = Fx(t) + Ge(t + 1) θ(t) = Hx(t) y(t) = ϕH(t − L)θ(t) + v(t),
(5) (6) (7)
where the states (holding all the individual predictor coefficients and their increments) and innovations are T T (8) x(t) = xT1 (t) . . . xTM (t) , e(t) = e1 (t) . . . eM (t) and the matrixes are block-diagonal, built as A 0 B 0 C 0 F = ... , G = ... , H = ... . 0 A 0 B 0 C
(9)
The model error term v(t) represents the L-step prediction error. In the Kalman design, v(t) is assumed to be zero mean white noise with variance σv2 . The time variability of the predictor coefficients is parameterized by F, G and H.
3
The Kalman Filter and Predictor
The best linear state estimator given the state space model (5)-(7) is the Kalman filter [5]. It is a state observer with a gain matrix obtained via the solution to a Riccati difference equation [6]. The Kalman filter for the states and the corresponding prediction of the predictor coefficients is given as ˆ − 1) (t) = y(t) − ϕH(t − L)θ(t|t ˆ (t|t) = Fˆ x x(t − 1|t − 1) + Kf (t)(t) ˆ ˆ (t|t), θ(t + L|t) = HFL x
(10) (11) (12)
where (t) are the scalar output innovations, or the prediction errors, and Kf (t) is the Kalman filter gain. Note that the one step prediction of the predictor coefficients is required in (10). The gain Kf (t) depends on the variance of the innovations, denoted σ 2 (t), and on the state estimation error covariance matrix S(t|t − 1), which are updated by the Riccati difference equation S(t|t − 1) = FS(t − 1|t − 1)FH + GRe GH P(t|t − 1) = HS(t|t − 1)HH σ 2 (t) = σv2 + ϕH(t − L)P(t|t − 1)ϕ(t − L) S(t|t − 1)HH K(t) = σ 2 (t) Kf (t) = K(t)ϕ(t − L) S(t|t) = S(t|t − 1) −
2 Kf (t)KH f (t)σ (t).
(13) (14) (15) (16) (17) (18)
Adaptive Prediction of Mobile Radio Channels
3.1
1329
Prediction of the Channel
The prediction error (t) in (10), based on the regressors ϕ(t − L), is used to modify the state x ˆ(t|t) via (11), and to obtain the extrapolated predictor coeffiˆ + L|t) via (12). These extrapolated predictor coefficients are then used cients θ(t together with ϕ(t), which is available at time t, to predict y(t + L) as ˆ + L|t), yˆ(t + L|t) = ϕH(t)θ(t
(19)
where the unknown model error in (1) simply is set to zero. The corresponding prediction error2 is ˆ + L|t). ˜(t + L) = y(t + L) − yˆ(t + L|t) = ϕH(t)θ(t (20) With the regressor vector H
ϕ(t) = [y(t) y(t − m) . . . y(t − (M − 1)m)] ,
(21)
the predictor (19) is a linear FIR-filter with sub-sampling factor m. The matrix HFL in (12) is built from blocks of CAL which can be expressed as L L−1 n = 1 1−p . (22) CAL = 1 n=0 p 1−p The predicted coefficient θˆi (t + L|t) is thus obtained as the L-step ahead extrapolation of the coefficient at time t with the expected changes added as θˆi (t + L|t) = θˆi (t|t) +
L−1 n=0
pn ∆θˆi (t|t) = θˆi (t|t) +
1 − pL ˆ ∆θi (t|t) 1−p
with the unknown innovations set to zero. When p approaches one (the double integrator) this corresponds to an extrapolation of the present estimate θˆi (t|t) by a linear trend with slope ∆θˆi (t|t). 3.2
Tuning Parameters
To simplify matters we assume that the innovations, ei (t), are uncorrelated and have the same variance σe2 . The covariance matrix is then diagonal, Re = σe2 I. The variance of the innovations σe2 and the variance of the measurement noise σv2 do not affect the Kalman gain independently. Only the variance ratio σv2 /σe2 is of importance. The tuning parameter σv2 /σe2 , determines how much we trust the measurements versus the model. The adaption becomes slower when σv2 /σe2 is increased. The Kalman filter gain Kf (t) in (17) is then reduced, due to an assumed higher noise level in the measurement y(t). The second tuning parameter, the position of the pole p in the filter (3), appearing in the F and G matrixes, determines the assumed smoothness of the predictor coefficients. The changes of the predictor coefficients becomes more low-pass filtered when p approaches one. 2
This prediction error, ˜(t + L), will differ from (t + L) obtained from (10), since the ˆ + L|t + L − 1) is used when computing that error. predictor coefficient estimate θ(t The error (t) is almost white, whereas ¯ (t) often is colored.
1330
Torbj¨ orn Ekman
4
x 10
Strongest complex tap
−3
Amplitude
2 0 −2 Real Imaginary −4
20
40
60
80
100
120
140
Time [ms] 1.5
x 10
Second strongest complex tap
−3
Amplitude
1 0.5 0 −0.5 −1 Real Imaginary
−1.5 −2
20
40
60
80
100
120
140
Time [ms]
Fig. 1. The strongest (upper plot) and second strongest tap (lower plot) of a measured mobile radio channel. Solid line denotes real part and dash-dotted imaginary part. The second tap does not contain the dominating slow oscillating component, corresponding to wavefronts coming from the side, that is apparent in the strongest tap
4
Example
The prediction properties of the adaptive linear filter that is obtained using (19) with the regressor (21) is examined on a measured mobile radio channel. The measurement, performed at 1880 MHz, is described in [7]. The Doppler frequency is roughly 157 Hz and a snapshot of the channel is taken every 0.11 ms. The strongest and second strongest taps of the sample channel are shown in Figure 1. 4.1
The Linear Adaptive Predictor
A linear adaptive predictor with 25 coefficients spread over a memory of 251 samples (a sub-sampling factor of 10), is used for prediction of the taps. The memory is chosen short to reduce the number of adapted coefficients for the predictor. There will always be a trade off between the number of coefficients on one hand and the length on the memory on the other. 4.2
Tuning Parameters
To find the best choice for the tuning parameters, σv2 /σe2 and p, the prediction gain is evaluated for a grid of different tuning parameters. There are two different
Adaptive Prediction of Mobile Radio Channels
1331
Table 1. Table of the results for the different predictors on the two strongest taps Prediction range L Tap Samp. [ms] 10
30
1.1
3.3
λ 0.17
0.51
Tuning parameters Predictor
2 σv 2 σe
NG
PG
p
[dB]
[dB]
1st
agile slow
10−2 1012
0.99 0.0
-3 13
27 24
2nd
agile slow
10−2 1012
0.99 0.0
-3 15
21 18
1st
agile slow
10−2 1012
0.99 0.0
-3 24
8 8
2nd
agile slow
10−2 1012
0.99 0.0
-3 23
2 3
choices of tuning parameters, resulting in either slow or agile change of the prediction coefficients, that produce the best performances. The slow predictor has a pole in zero and a huge σv2 /σe2 , that is the changes of the coefficients are assumed to be small. This result in a Kalman estimator that behaves like an RLS estimator with a forgetting factor close to one. The prediction coefficients hardly change after the initial transient. The agile predictor has the pole close to the unit circle and a small variance ratio. The changes of the predictor coefficients are strongly low-pass filtered but are still allowed to change quite rapidly. 4.3
Results
In Table 1 the prediction gain3 (PG), noise gain4 (NG) and the corresponding tuning parameters are listed for the two strongest taps of the channel. The agile and slow predictors have similar performance, with a 3 dB advantage for the agile predictor for the shorter prediction range. The average frequency response of the agile and slow linear predictors are shown in Figure 2. The slow predictor behaves like a predictor for a timeinvariant band-limited processes [8]. It amplifies high frequencies for which the power in the signal is low, thus the high NG. In the passband the amplification is approximately 0 dB. In contrast, the time varying agile predictor on average suppresses high frequencies, which is reflected in the low NG values in Table 1. 3 4
The prediction gain is defined as 10 log10 (σy2 /σ˜2 ) where σy2 is the variance of the predicted signal and σ˜2 is the variance of the prediction error in (20). The noise gain is the amplification of a white noise by the linear predictor.
1332
Torbj¨ orn Ekman
Average frequency responses for the predictors 30
Response (dB)
20 10 0 −10 −20 Doppler spectrum Agile predictor Slow predictor
−30 −40
−400
−200
0
200
400
Frequency (Hz)
Fig. 2. Average frequency responses for the 10-step ahead predictors on the strongest tap. The slow predictor (dash-dotted) has a higher high-frequency gain than the agile predictor (solid). The gray line is the Doppler spectrum of the strongest tap
5
Discussion
This paper propose the use of a Kalman estimator, based on a filtered random walk model, for tracking the time varying coefficients of a mobile radio channel predictor. The filtered random walk model is quite flexible as seen in the example. Different choices of tuning parameters result in predictors with totaly different tracking properties, which in turn affect the average frequency response of the adaptive linear predictors. As the predictor coefficients are tracked directly, there is also a great freedom in the choice of the elements of the regressor. Thus, noise reduced observations, as in [3], can easily be encompassed.
References [1] T. Ekman, Prediction of Mobile Radio Channels, Modeling and Design, PhD thesis, Uppsala University, Uppsala, Sweden, 2002. 1326 [2] A. Duel-Hallen, S. Hu and H. Hallen, “Long-range prediction of fading signals,” IEEE Signal Processing Magazine, vol. 17, pp 62-75, May 2000. 1326 [3] T. Ekman, M. Sternad and A. Ahl´en, “Unbiased Power Prediction on Broadband Channels,” IEEE Vehicular Technology Conference VTC2002-Fall, Vancouver, Canada, Sept. 2002. 1326, 1332 [4] S. T. Chung and A. Goldsmith “Degrees of Freedom in Adaptive Modulation: A Unified View,” IEEE Transaction on Communications, vol. 49, no. 9, pp. 15611571, Sept. 2001. 1326 [5] B. D. O. Anderson and J. B. Moore, Optimal Filtering, Prentice Hall, Englewod Cliffs, NJ, 1979. 1328
Adaptive Prediction of Mobile Radio Channels
1333
[6] A. H. Jazwinski, Stochastic Processes and Filtering Theory, Academic Press, New York, 1970. 1328 [7] M. Sternad, T. Ekman and A. Ahl´en, “Power prediction on Broadband Channels,” IEEE Vehicular Technology Conference VTC2001-Spring, Rhodes, Greece, May 2001. 1330 [8] A. Papoulis, ”A note on the predictability of band-limited processes,” Preceedings of the IEEE, 73(8):1332–1333, August 1985. 1331
Amplitude Modulated Sinusoidal Models for Audio Modeling and Coding Mads Græsbøll Christensen , Søren Vang Andersen, and Søren Holdt Jensen Department of Communication Technology, Aalborg University, Denmark {mgc,sva,shj}@kom.auc.dk
Abstract. In this paper a new perspective on modeling of transient phenomena in the context of sinusoidal audio modeling and coding is presented. In our approach the task of finding time-varying amplitudes for sinusoidal models is viewed as an AM demodulation problem. A general perfect reconstruction framework for amplitude modulated sinusoids is introduced and model reductions lead to a model for audio compression. Demodulation methods are considered for estimation of the time-varying amplitudes, and inherent constraints and limitations are discussed. Finally, some applications are considered and discussed and the concepts are demonstrated to improve sinusoidal modeling of audio and speech.
1
Introduction
In the last couple of decades sinusoidal modeling and coding of both speech and audio in general has received great attention in research. In its most general form, it models a segment of a signal as a finite sum of sinusoidal components each having a time-varying amplitude and a time-varying instantaneous phase. Perhaps the most commonly used derivative of this model is the constant-frequency constant-amplitude model known as the basic sinusoidal model. This model is based on the assumptions that the amplitudes and frequencies remain constant within the segment. It has been used for many years in speech modeling and transformation [1]. The model, however, has problems in modeling transient phenomena such as onsets, which causes so-called pre-echos to occur. This is basically due to the quasi-stationarity assumptions of the model being violated and the fundamental trade-off between time and frequency resolution. Also, the use of overlap-add or interpolative synthesis inevitably smears the time-resolution. Many different strategies for handling time-varying amplitudes have surfaced in recent years. For example, the use of time-adaptive segmentation [2] improves performance greatly at the cost of increased delay. But even then pre-echos may still occur in overlap regions or if interpolative synthesis [1] is used. Also, the use of exponential dampening of each sinusoid has been extensively studied [3, 4, 5], although issues concerning quantization remain unsolved. Other approaches include the use of one common dampening factor for all sinusoids [6], the use of
This work was conducted within the ARDOR project, EU grant no. IST–2001–34095
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1334–1342, 2003. c Springer-Verlag Berlin Heidelberg 2003
Amplitude Modulated Sinusoidal Models for Audio Modeling and Coding
1335
asymmetric windows [7], the use of an envelope estimated by low-pass filtering of the absolute value of the input [8] and the approaches taken in [9, 10]. In [9] lines are fitted to the instantaneous envelope and then used in sinusoidal modeling, and in [10] transient locations are modified in time to reduce preecho artifacts. The latter requires the use of dynamic time segmentation. Also, tracking of individual speech formants by means of an energy separation into amplitude modulation and frequency modulation (FM) contributions has been studied in [11, 12, 13]. In this paper we propose amplitude modulated sinusoidal models for audio modeling and coding applications. The rest of the paper is organized as follows: In Sect. 2 the mathematical background is presented. A general perfect reconstruction model is derived in Sect. 3, and in Sect. 4 a model which addresses one of the major issues of audio coding regardless of type, namely pre-echo, is presented along with a computationally simple estimation technique. Finally, in Sect. 5 some experimental results are presented and discussed and Sect. 6 concludes on the work.
2
Some Preliminaries
The methods proposed in this paper are all based on the so-called analytic signal, which is derived from the Hilbert transform. First, we introduce the Hilbert transform and define the analytic signal and the instantaneous envelope. Then we briefly state Bedrosian’s theorem, which is essential to this paper. Definition 1 (Discrete Hilbert Transform). Let xr (n) be a continuous real signal. The Discrete Hilbert transform, H{·}, of this, denoted xi (n), is then defined as (see e.g. [14]) xi (n) = H{xr (n)} =
∞
h(n − m)xr (n) .
(1)
m=−∞
where h(n) is the impulse response of the discrete Hilbert transform given by 2 sin2 (πn/2) , n = 0 . πn h(n) = (2) 0, n=0 A useful way of looking at the Hilbert transform, definition, is in the frequency domain: j, Xi (ω) = H(ω)Xr (ω), with H(ω) = 0, −j,
an perhaps a more intuitive for for for
−π < ω < 0 ω = {0, π} . 0<ω<π
(3)
where Xi (ω) and Xr (ω) are the Fourier transforms (denoted F {·}) of xi (n) and xr (n), respectively, and H(ω) is the Fourier transform of h(n). The so-called analytic signal and instantaneous envelope are then defined as (4) xc (n) = xr (n) + jxi (n) and |xc (n)| = x2r (n) + x2i (n) ,
1336
Mads Græsbøll Christensen et al.
respectively. With these definitions in place, we now state Bedrosian’s theorem [15]. Theorem 1 (Bedrosian). Let f (n) and g(n) denote generally complex functions in I 2 (Z) of the real, discrete variable n. If 1. the Fourier transforms F (ω) of f (n) is zero for a < |ω| ≤ π and the Fourier transform G(ω) of g(n) is zero for 0 ≤ |ω| < a, where a is an arbitrary positive constant, or 2. f (n) and g(n) are analytic, then H{f (n)g(n)} = f (n)H{g(n)} .
(5)
For proof of the continuous case see [15]. The theorem holds also for periodic signals in which case the Fourier series should be applied.
3
Sum of Amplitude Modulated Sinusoids
In this section we consider a perfect reconstruction framework based on a model consisting of a sum of amplitude modulated sinusoids: x ˆ(n) =
Q
γq (n)Aq cos(ωq n + φq )
for n = 0, . . . , N − 1 ,
(6)
q=1
where γq (n) is the amplitude modulating signal, Aq the amplitude, ωq the frequency, and φq the phase of the qth sinusoid. We note in the passing that the aforementioned exponential sinusoidal model [3, 4, 5] fall in to this category. Assume that the signal has been split into a set of subbands by a perfect reconstruction nonuniform Q-band filterbank such as [16] having a set of cut-off frequencies Ωq for q = 0, 1, . . . , Q where Ω0 = 0 and ΩQ = π. Then we express the contents of each individual subband xq (n) as an amplitude modulated sinusoid placed in the middle of the band, i.e. xq (n) = γq (n)Aq cos(ωq n + φq ) = γq (n)sq (n) ,
(7)
Ω +Ω
where ωq = q 2 q−1 , γq (n) ∈ C, i.e. the modulation is complex. We start our demodulation by finding the analytic signal representation of both the left and right side of the previous equation: γq (n)sq (n) + jH{γq (n)sq (n)} = xq (n) + jH{xq (n)} ,
(8)
which according to Bedrosian’s theorem is equal to γq (n)sq (n) + jH{γq (n)sq (n)} = γq (n) (sq (n) + jH{sq (n)}) = γq (n)Al exp (j(ωq n + φq )) .
(9) (10)
Amplitude Modulated Sinusoidal Models for Audio Modeling and Coding
1337
This means that we can simply perform complex demodulation in each individual subband using a complex sinusoid, i.e. γq (n) = (xq (n) + jH{xq (n)})
1 exp(−j(ωq n + φq )) . Aq
(11)
In this case we have a modulation with a bandwidth equal to the bandwidth of the subband, ∆q = Ωq − Ωq−1 . It is of interest to relax the constraint on the frequency of the carrier. Here we consider a more general scenario, where the carrier may be placed anywhere in the subband, i.e. Ωq−1 ≤ ωq ≤ Ωq . In this case, the modulation is asymmetrical around the carrier in the spectrum. An alternative interpretation is that the carrier is both amplitude and phase modulated simultaneously. Alternatively, we can split the modulation into an upper (usb) and a lower sideband (lsb). These can be obtained by calculating the analytic signal of γq (n) and γq∗ (n), which is similar to zeroing out the negative frequencies: 1 (γq (n) + jH{γq (n)}) 2 1 ∗ γ (n) + jH{γq∗ (n)} . γq,lsb (n) = 2 q
γq,usb (n) =
(12) (13)
The complex modulating signal can be reconstructed as ∗ γq (n) = γq,usb (n) + γq,lsb (n) .
(14)
The modulating signal can be written as γq (n) = C + b(n), where b(n) is zero mean. For C = 0, this is the case where the sinusoidal carrier is present in the spectrum in the form of a discrete frequency component. For the special case C = 0, we have what is known as suppressed carrier AM, i.e. the carrier will not be present in the spectrum. In the context of speech modeling this representation may be useful in modeling non-tonal parts, e.g. unvoiced speech, whereas the non-suppressed AM (C = 0) case may be well-suited for voiced speech. In the particular case that the modulating signal is both non-negative and real, i.e. γq (n) ∈ R and γq (n) ≥ 0, the demodulation simply reduces to γq (n) =
1 |xq (n) + jH{xq (n)}| , Aq
(15)
as the instantaneous envelope of the carrier is equal to 1. This last estimation is lossy as opposed to the previous demodulations. Notice that in the perfect reconstruction scenario, the filtering of the signal into subbands and subsequent demodulation can be implemented efficiently using an FFT. An alternative to the filterbank-based sum of amplitude modulated sinusoids scheme, which requires that the sinusoidal components are well spaced in frequency is the use of periodic algebraic separation [17, 18]. This allows for demodulation of closely spaced periodic components provided that the periods are known.
1338
4
Mads Græsbøll Christensen et al.
Amplitude Modulated Sum of Sinusoids
In this section a model for audio compression is introduced. This model addresses one of the major problems of audio coding regardless of type, namely pre-echo control. The perfect reconstruction model of the previous section has an amplitude modulating signal of each individual sinusoid. Here, we explore the notion of having more sinusoids in each subband and that modulating signal being identical for all sinusoids in the subband. This is especially useful in the context of modeling onsets and may even be used in the one-band case for low bit-rate or single source applications. The model of the qth subband is: x ˆq (n) =γq (n)
Lq
Aq,l cos(ωq,l n + φq,l ) = γq (n)ˆ sq (n) ,
(16)
l=1
where sˆq (n) is the constant-amplitude part. In the one-band case where xq (n) = x(n) the models in [6, 7, 8, 9] all fall into this category. These, however, do not reflect human sound perception very well as pre-echos may occur in the individual critical bands (see e.g. [19]). Neither do they take the presence of multiple temporally overlapping sources into account. The sum of amplitude modulated sinusoids, however, does take multiple sources into account. The basic principle in the estimation of the modulating signal γq (n) is that it can be separated from the constant-amplitude part of our model x ˆq (n) under certain conditions. First we write the instantaneous envelope of equation 16, i.e. |ˆ xq (n) + jH{ˆ xq (n)}| = |γq (n)ˆ sq (n) + jH{γq (n)ˆ sq (n)}| .
(17)
Since we are concerned here with sinusoidal modeling, we constrain the modulation to the case of non-suppressed carrier and the physically meaningful nonnegative and real modulating signal. Equation (17) can then be rewritten using Bedrosian’s theorem: |γq (n)ˆ sq (n) + jH{γq (n)ˆ sq (n)}| = γq (n)|ˆ sq (n) + jH{ˆ sq (n)}| .
(18)
For this to be true, our constant-amplitude model and the amplitude modulation may not overlap in frequency, i.e. we have that the lowest frequency must be above the bandwidth, BW , of the modulating signal BW < min ωq,l . l
(19)
Using this constraint, we now proceed in the estimation of the amplitude modulating signal γq (n) by finding the analytic signal of the sinusoidal model x ˆq,c (n) =
Lq l=1
Al γq (n) exp(jφq,l ) exp(jωq,l n) ,
(20)
Amplitude Modulated Sinusoidal Models for Audio Modeling and Coding
1339
with subscript c denoting the analytic signal. We then find the squared instantaneous envelope of the model: |ˆ xq,c (n)| = 2
Lq Lq
γq2 (n)Aq,l Aq,k exp(j(φq,k − φq,l )) exp(j(ωq,k − ωq,l )n). (21)
l=1 k=1
The squared instantaneous envelope is thus composed of a set of auto-terms (l = k) which identifies the amplitude modulating signal and a set of interfering cross-terms (l = k). From this it can be seen that the frequencies of these cross-terms in the instantaneous envelope is given by the distances between the sinusoidal components. Thus, the lowest frequency in the squared instantaneous envelope caused by the interaction of the constant-amplitude sinusoids is given by the minimum distance between two adjacent sinusoids. A computationally simple approach is to reduce the cross-terms by constraining the minimum distance between sinusoids and then simply lowpass filter the squared instantaneous envelope of the input signal, i.e. γq2 (n) = αe2q (n) ∗ hLP (n) , e2q (n)
x2q (n)
(22)
where = + H{xq (n)} , α is a positive scaling factor and hLP (n) is the impulse response of an appropriate lowpass filter with a stopband frequency below half the minimum distance between two sinusoids, i.e. 2
2BW < min |ωq,l − ωq,k | . l=k
(23)
This estimate allows us to find a amplitude modulating signal without knowing the parameters of the sinusoidal model a priori. This is especially attractive in the context of matching pursuit [20]. Note that the constraint in equation (23) is more restrictive than those of theorem 1. The design of the lowpass filter is subject to conflicting criteria. On one hand, we want to have sufficient bandwidth for modeling transients well. On the other, we want to attenuate the cross-terms while having arbitrarily small spacing in frequency between adjacent sinusoids. Also these criteria have a time-varying nature. A suitable filter which can easily be altered to fit the requirements is described in [8]. Generally, the consequences of setting the cutoff frequency of the lowpass filter too low are more severe than setting it too high. Setting the cutoff frequency too high causes cross-terms to occur in γq (n), which may result in degradation in some cases, whereas setting the cutoff frequency too low reduces the models ability to handle transients. An alternative approach in finding γq (n) would be to estimate the amplitude modulating signals of the individual sinusoids and then combine these according to frequency bands or sources.
5
Results and Discussion
The framework in Sect. 3 has been verified in simulations to attain perfect reconstruction. The choice of model, whether it is some derivative of the sum of
1340 0.1
0.1
0
0
5
10
15
0 −0.1 0 0.1
5
10
15
20
0 −0.1 0
−0.1 0 0.1
20 Amplitude
−0.1 0 0.1 Amplitude
Mads Græsbøll Christensen et al.
5
10
15
20
5
10
15
20
5
10 Time [ms]
15
20
0 −0.1 0 0.1 0
5
10 Time [ms]
15
20
−0.1 0
Fig. 1. Signal examples: voiced speech. Original (top), modeled without AM (middle) and with AM (bottom)
amplitude modulated sinusoids or the amplitude modulated sum of sinusoids, should reflect signal characteristics. Types of sinusoidal signals that can be efficiently modeled using a one-band amplitude modulated sum of sinusoids are single sources that have a quasi-harmonic structure, i.e. pitched sounds. For example, voiced speech can be modeled well using such a model. In Fig. 1 two examples of onsets of voiced speech are shown (sampled at 8kHz) with the originals at the top, modeled without AM in the middle, and with at the bottom. The fundamental frequency was found using a correlation-based algorithm and the amplitudes and phases were then estimated using weighted least-squares. Segments of size 20ms and overlap-add with 50% overlap was used. It can be seen that the pre-echo artifacts present in the constant-amplitude model are clearly reduced by the use of the AM scheme. The proposed model and estimation technique was found to consistently improve performance of the harmonic sinusoidal model in transient speech segments with pre-echo artifacts clearly being reduced. More complex signals composed of multiple temporally overlapping sources, however, require more sophisticated approaches for handling non-stationarities. The ”glockenspiel” excerpt of SQAM [21] is such a signal. At first glance this signal seems well suited for modeling using a sinusoidal model. The onsets are, however, extremely difficult to model accurately using a sinusoidal model. This is illustrated in Fig. 2, again with the original at the top, modeled using constant amplitude sinusoids in the middle and using AM at the bottom. The signal on the left is the entire signal and the signal on the right is a magnification of a transition region between notes. In this case amplitude modulation is applied per equivalent rectangular bandwidth (ERB) (see [19]) and a simple matching pursuit-like algorithm was used for finding sinusoidal model parameters, i.e. no harmonic constraints on the frequencies. Again overlap-add using segments of 20ms and 50% overlap was employed. In this example the sampling frequency was 44.1kHz.
Amplitude Modulated Sinusoidal Models for Audio Modeling and Coding 0.5
0.5
0
0
2
4
6
8
0 −0.5 0
2
4
6
8
10
0.5
5
10
15
20
25
30
5
10
15
20
25
30
5
10
15 Time [ms]
20
25
30
0.5 0 −0.5 0 0.5
0 −0.5 0
−0.5 0
10
0.5
Amplitude
Amplitude
−0.5 0
1341
0
2
4
6 Time [s]
8
10
−0.5 0
Fig. 2. Signal examples: ”glockenspiel” [21]. Original (top), modeled without AM (middle) and with AM (bottom) It can be seen that the onsets are smeared when employing constant amplitude and that there is a significant improvement when AM is applied, although some smearing of the transition still occurs due to the filtering.
6
Conclusion
In this paper we have explored the notion of amplitude modulated sinusoidal models. First, a general perfect reconstruction framework based on a filterbank was introduced, and different options with respect to modulation and their physical interpretations were presented. Here, one sinusoid per subband is used and everything else in the subband is then modeled as modulation of that sinusoid. This model is generally applicable and can be used for modeling not only tonal signals but also noise-like signals such as unvoiced speech. Then a physically meaningful, compact representation for sinusoidal audio coding and modeling and a demodulation scheme with low computational complexity was presented. In this model, each subband is represented using a sum of sinusoids having one common real, non-negative modulating signal, which is estimated by lowpass filtering the squared instantaneous envelope. The model and the proposed estimation technique was found to be suitable for modeling of onsets of pitched sounds and was verified to generally improve modeling performance of sinusoidal models.
References [1] McAulay, R.J., Quatieri, T.F.: Speech Analysis/Synthesis Based on a Sinusoidal Representation. In: IEEE Trans. Acoust., Speech, Signal Processing. Volume 34(4). (1986) 1334 [2] Prandom, P., Goodwin, M.M., Vetterli, M.: Optimal time segmentation for signal modeling and compression. In: Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing. (1997) 1334
1342
Mads Græsbøll Christensen et al.
[3] Goodwin, M.M.: Matching pursuit with damped sinusoids. In: Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing. (1997) 1334, 1336 [4] Nieuwenhuijse, J., Heusdens, R., Deprettere, E.F.: Robust Exponential Modeling of Audio Signals. In: Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing. (1998) 1334, 1336 [5] Jensen, J., Jensen, S.H., Hansen, E.: Exponential Sinusoidal Modeling of Transitional Speech Segments. In: Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing. (1999) 1334, 1336 [6] Jensen, J., Jensen, S.H., Hansen, E.: Harmonic Exponential Modeling of Transitional Speech Segments. In: Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing. (2000) 1334, 1338 [7] Gribonval, R., Depalle, P., Rodet, X., Bacry, E., Mallat, S.: Sound signal decomposition using a high resolution matching pursuit. In: Proc. Int. Computer Music Conf. (1996) 1335, 1338 [8] George, E.B., Smith, M.J.T.: Analysis-by-synthesis/overlap-add sinusoidal modeling applied to the analysis-synthesis of musical tones. In: J. Audio Eng. Soc. Volume 40(6). (1992) 1335, 1338, 1339 [9] Edler, B., Purnhagen, H., Ferekidis, C.: ASAC – Analysis/Synthesis Audio Codec for Very Low Bit Rates. In: 100th Conv. Aud. Eng. Soc., preprint 4179. (1996) 1335, 1338 [10] Vafin, R., Heusdens, R., Kleijn, W.B.: Modifying transients for efficient coding of audio. In: Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing. (2001) 1335 [11] Maragos, P., Kaiser, J.F., Quatieri, T.F.: Energy Separation in Signal Modulations with Application to Speech Analysis. In: IEEE Trans. Signal Processing. Volume 41(10). (1993) 1335 [12] Bovik, A.C., Havlicek, J.P., Desai, M.D., Harding, D.S.: Limits on Discrete Modulated Signals. In: IEEE Trans. on Signal Processing. Volume 45(4). (1997) 1335 [13] Quatieri, T.F., Hanna, T.E., O’Leary, G.C.: AM-FM Sepration using Audiotorymotivated Filters. In: Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing. (1996) 1335 [14] Oppenheim, A.V., Schafer, R.W.: Discrete-Time Signal Processing. 1st edn. Prentice-Hall (1989) 1335 [15] Bedrosian, E.: A product theorem for Hilbert transforms. In: IEEE Signal Processing Lett. Volume 44(1). (1963) 1336 [16] Goodwin, M.M.: Nonuniform filterbank design for audio signal modeling. In: Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing. (1997) 1336 [17] Zou, M.Y., Zhenming, C., Unbehauen, R.: Separation of periodic signals by using an algebraic method. In: Proc. IEEE Int. Symp. Circuits and Systems. (1991) 1337 [18] Santhanam, B., Maragos, P.: Multicomponent AM-FM demodulation via periodicity-based algebraic separation and energy-based demodulation. In: IEEE Trans. Communcations. Volume 48(3). (2000) 1337 [19] Moore, B.C.J.: An Introduction to the Psychology of Hearing. 4th edn. Academic Press (1997) 1338, 1340 [20] Mallat, S., Zhang, Z.: Matching pursuit with time-frequency dictionaries. In: IEEE Trans. Signal Processing. Volume 40. (1993) 1339 [21] European Broadcasting Union: Sound Quality Assessment Material Recordings for Subjective Tests. EBU (1988) http://www.ebu.ch. 1340, 1341
A Fast Converging Sequential Blind Source Separation Algorithm for Cyclostationary Sources M. G. Jafari1 , D. P. Mandic2 , and J. A. Chambers1 1
2
Centre for Digital Signal Processing Research King’s College London Communications and Signal Processing Research Group Imperial College, London [email protected]
Abstract. A fast converging natural gradient algorithm (NGA) for the sequential blind separation of cyclostationary sources is proposed. The approach employs an adaptive learning rate which changes in response to the changes in the dynamics of the sources. This way the convergence and the robustness to the initial choice of parameters are much improved over the standard algorithm. The additional computational complexity of the proposed algorithm is negligible as compared to the cyclostationary NGA method. Simulations results support the analysis.
1
Introduction
Blind source separation (BSS) has recently received much attention by the signal processing and neural networks research communities, because of the wide range of problems to which it may be applied, spanning disciplines as diverse as wireless communications, geophysical exploration, airport surveillance and medical signal processing [1, 2, 3]. Several approaches have been developed for the solution of the blind source separation problem, which can broadly categorised as block-based and sequential methods, depending on whether the parameters of interest are estimated from a block of available data, or as new measurements become available. When operating in a stationary environment, the performance of sequential algorithms is characterised by the convergence speed, as well as the steady-state misadjustment [4, 5], which in turn are controlled by the step-size parameter, since large step-sizes lead to fast convergence speed but the steadystate misadjustment will be large, while small learning rates result in a better misadjustment, but slower initial convergence speed. Clearly, the selection of an appropriate step-size is crucial to the performance of the algorithm, and generally the use of a fixed step-size parameter leads to slow convergence speed and poor tracking performance. An alternative approach is to use an adaptive stepsize, whose value is adjusted according to some measure of the distance between the estimated filter parameters, and their optimal values [5]. In this paper, the problem of source separation when the original sources are cyclostationary is addressed, and in this context, a fast converging algorithm is proposed, which uses V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1343–1349, 2003. c Springer-Verlag Berlin Heidelberg 2003
1344
M. G. Jafari et al.
an adaptive learning rate that changes in response to the time-varying dynamics of the signals and the separating matrix. The instantaneous BSS problem is described in Section 2. Section 3 introduces the cyclostationary natural gradient algorithm (CSNGA), which performs source separation by exploiting the statistical cyclostationarity of the source signals, while the proposed gradient adaptive step-size method is presented in Section 4. The performance of the algorithm is shown by simulation in Section 5, while conclusions are drawn in Section 6.
2
Problem Statement
The aim of blind source separation is to recover the n unknown source signals s(k) ∈ Rn from a set of observed mixture signals x(k) ∈ Rm , which are obtained when the sources are mixed by an unknown channel. When the mixing operation is linear and instantaneous, and no additive noise is present, the m observed signals are given by [1] x(k) = As(k) (1) where A ∈ Rm×n is an unknown full column rank mixing matrix, and k denotes the discrete time index. The sources are estimated according to the following linear separating system y(k) = W(k)x(k) (2) where W (k) ∈ Rn×m is the unmixing or separating matrix, and y (k) ∈ Rn represent an estimate of s(k). Ideally, the pseudo-inverse of the separating matrix W (k) is an estimate of the mixing matrix A. However, in practice, inherent to the BSS problem are a scaling and a permutation indeterminacy, such that the sources can be recovered only up to a multiplicative constant, and it is not possible to pre-determine their ordering, implying that perfect separation is achieved when the global mixing-separating matrix P(k) = W(k)A tends toward a matrix with only one non-zero term in each row and column [1], given by P(k) = JD where J ∈ Rn×n is a permutation matrix modeling the ordering ambiguity, and D ∈ Rn×n is a diagonal matrix accounting for the scaling indeterminacy.
3
Cyclostationary Natural Gradient Algorithm
The cyclostationary natural gradient algorithm minimises the following cost function [6] KL (W (k)) = − log det (W (k)) −
m
log qi (yi (k))
i=1
1 m 1 ˜α ˜α log det R + Tr R y (k) − y (k) − 2 2 2
(3)
where T r (·) and det (·) represent respectively the trace and determinant operators, and qi (yi (k)) is an appropriately chosen independent pdf. Moreover,
A Fast Converging Sequential Blind Source Separation Algorithm
1345
˜α R y (k) is the output cyclic correlation matrix for the p-th cycle frequency, given by m m αp ˜α R (k) = R (k) = E ejαp k y (k) yT (k) (4) y y p=1
p=1
˜α ˜α In the limit as k → ∞, the matrix R y (k) is required to satisfy limk→∞ Ry (k) = I, such that each of the output cyclic correlation matrices converges to a matrix with only one non-zero entry, situated at the p-th position along the main diagonal. Applying the natural gradient descent method, the update equation for the cyclostationary natural gradient algorithm is given by ˜α W (k + 1) = W (k) + µ I − f (y(k))yT (k) + I − R (5) y (k) W (k) T
where f (y(k)) is an appropriate odd non-linear function of the output y(k), (·) denotes the transpose operator, and µ is a positive, fixed step-size parameter. Practical implementation of (5) requires the estimation of the output cyclic corα relation matrices Ry p (k) at the current iteration which, assuming that the cycle frequencies αp , p ∈ {1, 2, . . . , n} are known, can be done using an exponentially weighted average of the instantaneous statistics T p ˆ αp ˆα (6) R y (k + 1) = (1 − λ)Ry (k) + λ cos (αp k) y (k) y (k) where λ controls the leakiness of the average, and the exponential function in (4) simplifies to a cosine function because the source signals and mixing matrix coefficients are real valued.
4
Adaptive Learning Rate
The basic gradient adaptive step-size algorithm updates the learning rate µ (k) according to [4, 5, 7]
(7) µ (k) = µ (k − 1) − ρ µ J (W (k)) µ=µ(k−1) where ρ is a fixed step-size parameter, and J (k) is the CSNGA cost function [8]. In [4], evaluation of (7) leads to the following updates equation (n − 1) 1 − yT (k − 1) f (y (k − 1)) µ (k) = µ (k − 1) + ρ + (1 + µ (k − 1)) 1 + µ (k) [1 − yT (k − 1) f (y (k − 1))] T T −y (k) f (y (k)) + f (y (k)) f (y(k − 1))yT (k − 1)y (k) (8) where n denotes the number of source signals. To ensure stability of (8), and to prevent the adaptation of the learning rate from terminating entirely during the separation process, upper and lower bounds are imposed on the value of the step-size µ (k) ∈ [δ, µmax ] [4], where δ is a small positive constant, and µmax represents an upper bound on the step-size, controlling the size of µ (k). Apart from the complexity of the learning rate update equation (8), which at
1346
M. G. Jafari et al.
each iteration imposes an increased computational requirement of O (n), the algorithm is sensitive to small variations in the parameters which are used in the update, does not exploit the cyclostationarity of the signals, and its derivation is based on several simplifying assumptions [4]. Here, the evaluation of (7) is performed considering the direct path only. Then, from (5) it can be shown that the adaptive step-size algorithm is given by ˜α µ (k) = µ (k − 1) + γ I − f (y (k)) yT (k) + I − R y (k) W (k) ˜α × WT (k − 1) I − y (k − 1) f T (y (k − 1)) + I − R (k − 1) (9) y T ˜α ˜α where γ is a fixed step-size parameter, and the property R y (k) = Ry (k) has been used [9]. The above algorithm leads to a set of local learning rates, controlling the magnitude of the update of the individual separating matrix coefficients. When a global step-size is preferred, the expression (9) can be modified as follows ˜α µ (k) = µ (k − 1) + γ I − f (y (k)) yT (k) + I − R (k) W (k) y ˜α × WT (k − 1) I − y (k − 1) f T (y (k − 1)) + I − R (10) y (k − 1) where · represents the Euclidean norm of the matrix. Since the proposed adaptive learning rate algorithm (10) uses instantaneous estimates of the gradient, it is anticipated that fluctuations may be observed, particularly when the estimated parameters are close to their optimal values. Thus, an upper bound is imposed on the norm of the update term so that, while aiding algorithm stability, memory of the previous step-size value is retained, which ensures that the adaptive process remains related to the changes in the estimated parameters. Equation (10) can be re-written as ∂J (W (k)) ∂J (W (k − 1)) T (11) µ (k) = µ (k − 1) + γ ∂W (k) ∂W (k − 1) This form highlights the simplicity of the algorithm, since it only requires the current and previous gradient matrices to evaluate the learning rate at any time k, and which are already available from the standard algorithm.
5
Experimental Results
The performance of blind source separation methods is conventionally assessed by plotting the following performance index (PI) n n n n 1 1 |pij |2 |pij |2 PI(k) = (12) + n i=1 j=1 maxq |piq |2 − 1 n j=1 i=1 maxq |pqj |2 − 1
A Fast Converging Sequential Blind Source Separation Algorithm
1347
where P(k) = [pij ] = W(k)A, and n is the number of source signals. The performance index is a measure of the closeness between the magnitude of the dominant entry, and the sum of all the entries in each row and column: the more dominant that entry is, the more unambiguously the estimated separating matrix W (k) separates the sources. Generally, a low PI indicates better performance. Two BPSK signals carrying independent binary data, and using sinusoidal −1 carriers of normalised frequencies 2(5π)−1 and (4π) , are mixed by a real stationary channel A, and zero mean white Gaussian noise is added, so that the SNR is 8 dB. The parameters αp , p = 1, 2, and λ in (6) are chosen respectively as −1 4 (5π) , (2π)−1 , and 0.05, while the fixed step-size parameters ρ and γ are set to, respectively, 10−8 and 10−6 . Separation is performed with the cyclostationary NGA algorithm, when the step-size is fixed to 0.0005, and using the updates (8) and (10), with initial learning rate µ (0) = 0.0005. Also, the upper bound on the gradient norm in (10) is set to 100, and the bounds on the step-size in (8) are selected as δ = 10−4 and µmax = 0.004. The performance index resulting from the application of the three methods, and averaged over 100 independent trials is plotted in Figure 1. The results illustrate that both adaptive learning rate approaches lead to faster speed of convergence than the fixed step-size parameter. Moreover, the cyclostationary NGA algorithm converges to a PI of 0.01 within approximately 600 samples when the proposed method is used, and 2200 samples when the learning rate (8) is employed. Using the fixed step-size parameter CSNGA does not achieve this PI value within 3000 samples.
1
Fixed step−size Douglas and Cichocki’s method Proposed method
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
500
1000
1500
2000
2500
3000
Fig. 1. Average performance indices obtained for CSNGA with fixed step-size (solid line), and adaptive learning rates (8) (dashed line) and (10) (dotted line). The mixing channel is stationary, and additive noise is present, such that the SNR is 8 dB
1348
M. G. Jafari et al.
Fixed step−size Douglas and Cichocki’s method Proposed method
2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0
500
1000
1500
2000
2500
3000
Fig. 2. Average performance indices obtained for CSNGA with fixed step-size (solid line), and adaptive learning rates (8) (dashed line) and (10) (dotted line), when the mixing channel changes abruptly after 500 and 2653 samples
Next, the two source signals described above are mixed by a channel which changes abruptly after 500 and 2653 samples. When the fixed step-size parameter is used, µ = 0.0005, while for the adaptive learning rate (10) and (8), µ (0) = 0.0005, ρ = 10−8 , γ = 10−6 . The values for δ, and µmax , and the upper bound on the gradient norm are selected as above. The performance index resulting from the application of the three methods, and averaged over 100 Monte Carlo trials, is depicted in Figure 2, and shows that the use of the gradient adaptive step-sizes leads to a significant increase in speed of convergence. In particular, the convergence properties of CSNGA improve considerably when the learning rate is updated with the proposed algorithm.
6
Conclusions
A fast converging algorithm for the separation of cyclostationary sources has been proposed, which is of the same order of complexity as the standard cyclostationary NGA algorithm. The method improves the performance of CSNGA by varying the step-size in response to changes in the dynamics of the estimated parameters, and by exploiting the statistical cyclostationarity of the source signals. Computer simulations have shown that when the sources are recovered by the proposed approach, improved rate of convergence is obtained.
A Fast Converging Sequential Blind Source Separation Algorithm
1349
References [1] S. Amari and A. Cichocki, ”Adaptive blind signal processing - neural network approaches,” Proceedings of the IEEE, vol. 86, pp. 2026-2048, 1998. 1343, 1344 [2] T. W. Lee, Independent component analysis. Kluwer academic publishers, 1998. 1343 [3] A. Mansour, A. K. Barros, and N. Ohnishi, ”Blind separation of sources: Methods, assumptions and applications”, IEICE Trans. on Fundamental of Electronics, Communications and Computer Sciences, vol. E83-A, pp. 1498-1512, 2000. 1343 [4] S. C. Douglas and A. Cichocki, ”Adaptive step size techniques for decorrelation and blind source separation”, Proc. of the Asilomar Conf. on Signals, Systems and Computers, vol. 2, pp. 1191-1195, 1998. 1343, 1345, 1346 [5] V. J. Mathews and Z. Xie, ”A stochastic gradient adaptive filter with gradient adaptive step size”, IEEE Trans. on Signal Processing, vol. 41, pp. 2075-2087, 1993. 1343, 1345 [6] M. G. Jafari and J. A. Chambers, ”A new natural gradient algorithm for cyclostationary sources”, submitted to IEE Proceedings Vision, Image and Signal Processing, 2002. 1344 [7] D. P. Mandic and A. Cichocki, ”An online algorithm for blind extraction of sources with different dynamical structures”, to appear in Proc. of the Int. Workshop on Independent Component Analysis and Blind Signal Separation, 2003. 1345 [8] M. G. Jafari, J. A. Chambers, and D. P. Mandic, ”Natural gradient algorithm for cyclostationary sources”, IEE Electronics Letters, vol. 38, pp. 758-759, 2002. 1345 [9] M. G. Jafari, ”Novel sequential algorithms for blind source separation of instantaneous mixtures”, Ph.D. dissertation, King’s College London, 2002. 1346
Computationally Efficient Doubletalk Detection Using Estimated Loudspeaker Impulse Responses Per ˚ Ahgren Department of Systems and Control, Uppsala University Box 337, SE-75105 Sweden phone: +46-18-4713079 [email protected]
Abstract. For an adaptive acoustic echo canceller to perform well a doubletalk detector has to be present in order to determine when filter adaptation is allowed. Since the doubletalk detector is to be implemented in real-time it is important to have a doubletalk detector with a low computational complexity. In this paper we propose a new approach to doubletalk detection that can be used to reduce the computational complexity of existing algorithms. Furthermore, the numerical example show that the new approach yields better doubletalk detection performance when applied to some of the best existing doubletalk detection algorithms.
1
Introduction
The problem of Acoustic Echo Cancellation (AEC) was introduced in [4] and is still an active field of research. Acoustic Echo Cancellers are needed for removing the acoustic echoes resulting from the acoustic coupling between the loudspeaker(s) and the microphone(s) in communication systems. In Fig. ?? a typical setup for AEC is shown. The essential purpose of the setup is that the near-end speech signal v(t) is to be picked up by the microphone M and propagated to the far-end room while far-end speech is to be emitted by the loudspeaker L into the near-end room. During doubletalk, the near-end speech in the microphone signal y(t) is corrupted by the echo of the far-end speech signal x(t) that is propagated in the near-end room from the loudspeaker L to the microphone M . Therefore, during doubletalk, the resulting microphone signal y(t) consists of near-end speech mixed with far-end speech filtered by the near-end room impulse response h from the loudspeaker to the microphone y(t) = hT x(t) + v(t) + w(t).
(1)
In (1) w(t) is noise, and the near-end room impulse response modeled as a Finite Impulse Response (FIR) filter of order n and the input data vector x(t) are defined as T T x(t) = x(t) x(t − 1) · · · x(t − n + 1) . (2) h = h0 h1 · · · hn−1 , V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1350–1355, 2003. c Springer-Verlag Berlin Heidelberg 2003
Computationally Efficient Doubletalk Detection
1351
ˆ of Usually, in order to remove the undesired echo an adaptive filter estimate h(t) h is computed (in this paper we will only consider FIR filters which is the most common filter type for AEC), and used to predict the far-end speech contribution hT x(t) and subtract it from the microphone signal y(t). Thereby we get the error signal ˆ T (t)x(t) = v(t) + hT x(t) − h ˆ T (t)x(t) + w(t) e(t) = y(t) − h
(3)
that ideally should be equal to the near-end speech signal v(t). When no nearend speech is present the error signal e(t) can be used to adapt the adaptive ˆ filter h(t) using some algorithm for filter adaptation. Several different methods for filter adaptation in AEC have been proposed [2]. The most common one is perhaps the Normalized Least Mean Squares (NLMS) algorithm [3] which has been shown to perform well for the AEC problem. When there is doubletalk, however, the near-end speech signal v(t) disturbs the adaptation and can cause the adaptive filter to diverge. Therefore it is important to detect doubletalk in order to stop the filter adaptation when doubletalk is present. Several different algorithms have been proposed for doubletalk detection of which the most interesting probably is the normalized cross-correlation (NCR) method that was presented in [1]. The main drawback of the NCR method is that it has a huge computational complexity. There is, however, an approximate variant of the NCR method [1] that trades some of the good performance of NCR for lower computational complexity. This we will denote Cheap-NCR. The AEC algorithms as well as the DTD algorithms are to be run in real-time on a digital signal processor with limited memory and computational power. As the complexities of these algorithms usually are proportional to a power of n (the length of the impulse response h), and n usually is very large, ranging from several hundred to several thousand, it is essential to minimize the computational complexity. The main purpose of this paper is to show how the knowledge of the loudspeaker impulse response can be used to reduce the computational complexity of existing DTD algorithms while at the same time increasing the DTD performance.
2
DTD Using Estimated Loudspeaker Impulse Responses
In this paper we propose a new approach to DTD based on the knowledge of the impulse response of the loudspeaker L in Fig. ??. This new approach, which we denote the Loudspeaker-IMpulse-rEsponse (LIME) approach, may be used to modify existing DTD algorithms. As we will see the LIME approach use a data model similar to the one in (1). Thus the LIME approach may probably be used for most existing DTD algorithms working with the model in (1). However, in this paper we only apply the approach to the NCR and the Cheap-NCR algorithms. It turns out that the LIME approach can significantly reduce the computational complexity of these DTD algorithms while still obtaining a comparable, or even better, DTD performance.
1352
Per ˚ Ahgren
The LIME approach is based on the fact that the all far-end speech, and no near-end speech, is filtered by the impulse response for the loudspeaker L in Fig. ??. This can be exploited and based on the knowledge of the loudspeaker impulse response we can modify existing DTD methods. These modifications are described in Section 2.1. For the LIME approach to be feasible it is vital that we can somehow obtain the loudspeaker impulse response and this is discussed in Section 2.2. In Section 2.3, we summarize the LIME approach for the case when NCR and Cheap-NCR are used as DTD algorithms. Finally, in Section 2.4 we discuss the numerical complexities of the LIME approach. 2.1
The LIME Approach
The loudspeaker impulse response h in (2) consists both of the time-varying impulse response hE of the echo path in the near-end room and the time-invariant impulse response hL of the loudspeaker L. Assuming these impulse responses can be approximated as linear (which is a basic assumption in AEC), we can write h as (4) h = hL ∗ hE where ∗ denotes convolution, and hL and hE are defined similarly to h. The orders of hL and hE are denoted by p and m, respectively. From (4), we then ˆ L of hL have that n = p + m − 1. Since we assume that we know an estimate h we can rewrite this equation as ¯ (t) + v(t) + w(t), y(t) = hTE x
¯ (t) = hTL x(t). x
(5)
ˆ E of hE from From these equations it is clear that we can compute an estimate h ¯ (t) and y(t). the signals x Most DTD algorithms work with the data model in (1) and rely on the fact that y(t) is a filtered version of x(t) and that v(t) is not. In the LIME approach we modify the data model in (1) and end up with the following model ˜(t) + v(t) + w(t), y(t) = hTL x
ˆ T x(t) ˜ (t) = h x E
(6)
which is similar to the model in (1). The computational complexities of most successful DTD algorithms such as NCR and Cheap-NCR are proportional to the dimension of the filter in the AEC data model. Thus, by applying the DTD algorithms to the model (6) instead of (1), we will lower the numerical complexity of the algorithms significantly since p, the order of the filter hL in (6), is generally much smaller than n, which is the order of the filter h in (1). 2.2
Estimation of the Loudspeaker Impulse Response
The impulse response of a loudspeaker may be obtained in different ways. The best, and perhaps most direct way, is to compute it from measurements taken in an anechoic chamber.
Computationally Efficient Doubletalk Detection
1353
An important property of the loudspeaker impulse responses is their timeinvariance. Indeed, if the loudspeaker impulse responses were time-varying the LIME approach would not be feasible. Fortunately, it seems that the loudspeaker impulse responses are relative time-invariant. This is a property on which music hardware for compensating for the acoustic properties of loudspeakers is based. 2.3
Summary of the LIME Approach
The LIME approach applied to the NCR and the Cheap-NCR detectors is summarized in the steps below. We will assume that we have previously computed ˆ L of the loudspeaker impulse response hL . an estimate h ˆ E of hE from (i) Compute, possibly adaptively and recursively, an estimate h T ˆ ¯ (t) = hL x(t) and y(t). x (ii) Compute ˆ T x(t). x ˜(t) = h (7) E ˜ (t) we (iii) Directly applying the DTD algorithms developed in [1] on y(t) and x get the following decision variables 1 ξN CR (t) = rTx˜ y (t)R−1 (8) ˜ y (t) ˜x ˜ (t)rx x σ ˆy (t) T ˆL r (t)h ˜y x (9) ξCheap−N CR (t) = σ ˆy2 (t) for NCR and Cheap-NCR, respectively. For the two decision variables we have that doubletalk is detected at time sample t if ξ(t) ≥ T , and not present if ξ(t) < T , where T is constant threshold described below. In (8) and (9), Rx˜ x˜ and rx˜ y are defined as Rx˜ x˜ (t) = E{˜ x(t)˜ xT (t)},
rx˜y (t) = E{y(t)˜ x(t)}
(10)
where E denotes the expectation operator. The variance σ ˆy2 (t) is defined as σ ˆy2 (t) = Ey 2 (t). T is a constant threshold that should be chosen to minimize the probability of false alarm (Pf ) as well as the probability of missed detection (Pm ) (defined in Section 3.1). 2.4
Numerical Complexity Comparison
In Table 1 the numerical complexities of NCR are stated for the cases when the LIME approach is used, and when it is not. Note that we have assumed ˆ E in step (i) of the LIME approach, and that that NLMS is used to compute h −1 Rx˜ x˜ (t) and rx˜ y (t) have been computed recursively in time over a sliding window. Also note that we will not consider the numerical complexity for Cheap-NCR since Cheap-NCR in a sliding window or forgetting factor implementation only requires a few multiplications per time sample. It is clear that the computational complexity of the AEC setup when the NCR method is used is much lower with the LIME approach than without the LIME approach.
1354
3 3.1
Per ˚ Ahgren
Numerical Example Definitions
The probability of missed detection (Pm ), and the probability of false alarm (Pf ) are defined as Pm =
NDm , ND
Pf =
NDf NN D
(11)
where NDm is the number of samples where doubletalk was not detected but was present, ND is the total number of samples where doubletalk was present, NDf is the number of samples where doubletalk was detected but where no doubletalk was present, and NN D is the total number of samples where doubletalk was not present. The Near-end to Far-end speech Ratio (NFR), and the Signal to Noise Ratio (SNR) are defined as E[y(t) − v(t)]2 E[y 2 (t) − w2 (t)] NFR = 10 log , SNR = 10 log (12) Ev 2 (t) Ew2 (t) where y(t), w(t) and v(t) are defined in (1). 3.2
DTD Algorithm Evaluation Scheme
(i) Generate two seconds of data according to the model in (1) without any doubletalk present (v(t) ≡ 0). (ii) Apply the detector to the data and choose a threshold T that gives a Pf of 0.1. (iii) Create nine different data sets, each in which one of three different 1/2 second speech samples are added in three different positions into the original data set from step (i). (iv) Apply the detector to all the nine data sets and compute the average probability of missed detection. 3.3
Numerical Simulation
The model in (1) was used to generate the data. The impulse response h in (1) was obtained in an office room using an AEC setup with a loudspeaker with known (computed in an anechoic chamber) impulse response hL . Prerecorded speech data was used for the far-end speech signal as well as doubletalk speech signals. The sampling frequency was set to 8 kHz in order to keep the computational complexity of the simulations for NCR (without the LIME approach) reasonably low. The doubletalk detection performance of the DTD algorithms NCR and Cheap-NCR with, and without, the LIME approach was tested using the evaluation scheme presented in Section 3.2. The total room impulse response (including the loudspeaker impulse response) had a length of 250 filter taps (the
Computationally Efficient Doubletalk Detection
1355
reason for using so short impulse responses was that the NCR method without the LIME approach was too computationally complex to allow much higher filter orders) and the loudspeaker impulse response was truncated to a length of ˆ E ) of the echo paths used in the detectors were 75 filter taps. The estimates (h estimated from 2 seconds of data generated using the model in (1) without any doubletalk. In all the data, the SNR was set to the rather high value of 30 dB to ensure that the noise did not have too much influence on the algorithm performance evaluation. The detectors were evaluated for different NFR and the results are displayed in Fig. ?? where Pm is plotted as a function of the NFR. It is clear from the figures that the NCR and Cheap-NCR algorithms with the LIME approach outperform their counterparts without the LIME approach. Another interesting thing to note in the figures is that the Cheap-NCR algorithm performs better than the NCR algorithm both with and without the LIME approach. This may seem strange at first sight. The comparison between the NCR and Cheap-NCR algorithms is, however, not fair, since the Cheap-NCR uses an impulse response that has been computed from 2 seconds of data, and the NCR method only uses data from a moving time window to compute the decision variable. Thus, by using more data the Cheap-NCR algorithms can perform better than the NCR algorithms.
4
Conclusion
We have proposed a new approach to DTD that can be used for most existing DTD algorithms. The new approach offers a lower computational complexity compared to the original DTD algorithm. The numerical simulations show that when applied to some doubletalk detectors it may also give improved DTD performance.
References [1] J. Benesty, D.R. Morgan, and J.H. Cho. A new Class of Doubletalk Detectors Based on Cross-Correlation. IEEE Transactions on Speech and Audio Processing, 8(2):168–172, Mar 2000. 1351, 1353 [2] C. Breining, P. Dreiseitel, E. H¨ ansler, A. Mader, B. Nitsch, H. Puder, T. Schertler, G. Schmidt, and J. Tilp. Acoustic Echo Control - An Application of Very-HighOrder Adaptive Filters. IEEE Signal Processing Magazine, 16(4):42–69, July 1999. 1351 [3] S. Haykin. Adaptive Filter Theory. Prentice-Hall, Upper Saddle River, New Jersey, 3rd edition, 1996. 1351 [4] M.M. Sondhi. An adaptive echo canceler. Bell System Technical Journal, XLVI(3):497–510, 1967. 1350
Texture Segmentation Using Semi-supervised Support Vector Machine S. Sanei Centre for Digital Signal Processing Research, King's College London, London, WC2R 2LS, UK Phone: +44 (0)207 848 1039 [email protected]
Abstract. Support vector machine (SVM) is used here to detect the texture boundaries. In order to do that, a cost function is initially defined based on the estimation of higher order statistics (HOS) of the intensities within small regions. K-mean algorithm is used to find the centres of the two clusters (boundary or texture) from the values of the cost function over the entire image. Then the target values are assigned to the class members based on their Euclidean distances from the centres. A supervised nonlinear SVM algorithm with RBF kernel is later used to classify the cost function values. The boundary is then identified in places where the cost function has greater values. The overall system will be semi-supervised since the targets are not predetermined; however, the number of classes is considered as two. The results show that the algorithm performance is superior to other conventional classification systems for texture segmentation. The displacement of the edges is negligible.
1
Introduction
Texture segmentation plays an important role in analysis, understanding, recognition, and coding of images. A successful texture segmentation scheme provides homogeneous regions and ensures that adjacent regions possess significantly different properties. Many methods for texture segmentation have been introduced in the literature. Application of Gabor filters using the properties of human visual system (HVS) is well known [1][2]. A similar alternative is clustering the features extracted by filtering the textures using Gaussian filters and their derivatives [3][4]. Similarly, classification of the dominant eigenimages constructed from eigenvectors weighted by their corresponding eigenvalues is an effective method. Other methods based on AR modelling have also been considered for semi-periodic textures [4][5]. In this paper, we present a novel HOS-SVM based texture segmentation technique. HOS are especially appropriate for texture segmentation for at least two reasons. Firstly, HOS are resistant to additive Gaussian noise and are therefore appropriate in high noise situation where even traditional second order statistic methods fail. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1357-1363, 2003. Springer-Verlag Berlin Heidelberg 2003
1358
S. Sanei
Secondly, HOS characterise non-Gaussian process and can therefore be used to discriminate textures that are second-order spectrally equivalent. Our approach is to initially adopt and generalise the method of Tsatsanis and Giannakis for detecting a change in the autocorrelation function [6] and the method of Brian for detecting a change in the second and higher order statistics [7]. We deem edges as those areas where change occurs. Changes often take the form of a step, which might occur in various order features. Between these changes the regions are assumed stationary or semi-stationary. The features then need to be clustered and classified. A k-mean algorithm followed by SVM, as a semi-supervised classifier, are utilised to separate the edges.
2
Texture Segmentation
Our algorithm includes (a) Computation of a proper HOS-based features, (b) Definition of a suitable cost function (c) Clustering the features using k-mean algorithm, (d) Classification of the feature function using SVM, (e) Post-processing of the result in step 3 to achieve one-pixel width smooth linked edge map. Prior to application of the algorithm the image is median filtered using a median filter and spatially lowpass filtered to remove or alleviate the noise. 2.1
Feature Detection
In order to determine the features we perform the calculations over a sliding square window. The window size N w × N w , has to be carefully selected. To enable an accurate estimation of the features in a texture with a distinguishable fundamental frequency f 0 , N w may be considered at least equal to 2 / f 0 . In the case of having different fundamental frequency along a row or column of the image, N w will be equal to 2 / f 0 min , where
f 0 min is the minimum fundamental frequency. The
fundamental frequency can be measured by looking at the spectrum of the region. In this part we extend the one dimensional non-parametric detection methodology proposed in [6] into a two-dimensional space. Then we will introduce the smoothing method for reducing the variance of the change detection statistic proposed in [7]. We wish to detect a change in a discrete two-dimensional space for a random process I ( x, y ) at point s 0 = ( x 0 , y 0 ) . It is assumed that I ( x, y ) is stationary for
s < s 0 and it is also stationary for s > s 0 but with some new set of characteristics. Let η( z , τ ) = max η x ( z , τ x ), η y z , τ y as the corresponding HOS, where τ
(
(
))
[
]
values are space variables, η x ( z; τ ) = E I ( x, y )I ( x + τ1 , y )I ( x + τ 2 , y ) and
η y ( z; τ ) = E [I (x, y )I ( x, y + τ 3 )I (x, y + τ 4 )] denote the space varying third
order feature function along x and y directions respectively. z = (x,y), τ1 , τ 2 , τ 3 , τ 4 > 0 , τ x = max(τ1 , τ 2 ) , τ y = max(τ 3 , τ 4 ) , and
Texture Segmentation Using Semi-supervised Support Vector Machine
1359
τ = arg max(η(z, τ) ) . Then we can state the boundary detection problem as τ x ,τ y
finding the pixels around which a function of η( z , τ ) has significant change. Assuming there is a stepwise jump in the signal, we have s = s 0 + 1 . According to [6] detection of the changes can be performed by minimising,
J N (s ) = where
1 N −1− τ ∑ J (z, s ) N − τ z =0
(1)
{
J ( z , s ) = E [η(z , τ ) − η(τ ) − θ s ( z , τ )]2 η (τ ) =
}
N −1− τ N −1− τ 1 ∑ ∑ η( z , τ ) ( N − τ) 2 x =0 y =0
(2) (3)
θ s ( z , τ ) will be matched to the behaviour of η( z , τ ) . Expression (1) can be simplified by dropping terms that do not depend on s and scaling θ s ( z , τ ) to be zero mean and to have unit energy for all values of s resulting in, J N (s ) =
N −1− τ N −1− τ 1 ∑ ∑ η( x, y, τ )θ s ( x, y, τ ) ( N − τ) 2 y =0 x =0
(4)
where 1 N s − − τ 2 − s θ s (z, τ ) = 1 s 2 N − s − τ
z = 0, L s − 1 (5)
z = s, L , N − τ
It is important to note that in (5) z can be x or y depending on the direction of the maximum change in J N (s ) . In practice N = N w , which is the size of the sliding
window. The use of J N (s ) relies on the estimate of η( z , τ ) . These estimates are of
high variance typically, so it is necessary to smooth them. The smoothing procedure should avoid deterioration of the edges as much as possible. Adaptive smoothing has been used successfully to reduce image noise while preserving edge information. We adopt the smoothing technique proposed in [7]. Under this method, we first estimate the variance in two windows of either side of the point of interest and taking the mean over the window that exhibits low variance.
1360
2.2
S. Sanei
Edge Detection Using SVM
Most of the known classification algorithms such as Fisher discriminator or Bayesian classifiers perform well only for Gaussian data. Application of a multi-class nonlinear SVM overcomes this problem. SVM uses Lagrangian constraint to identify the best discriminator in a multi-class environment [8]. Here, the values of J N (s ) within each two-dimensional window are to be classified. Application of SVM for classification is subject to having a training sequence for various values of J N (s ) . In places where the textures are already known this will be an easy task. However we may consider that there is no a priori information about the textures. In that case prior to the application of SVM a k-mean algorithm [9] is used to find the class centres and allocate a target value based on the J N (s ) values in the entire image. K-mean is an iterative partitioning clustering algorithm. The procedure is repeated for different images. The points far from the class centres by a threshold value, are discarded so they will not be considered for classification. After clustering, we have
1 J N ( s ) > Tu J c ( s) = − 1 J N ( s ) < Tl
(6)
where J c (s ) are the clustered values. Tu and Tl are upper and lower threshold levels respectively, which identify the points to be classified using SVM. SVM is used here to classify those J N (s ) for which J c (s ) has a (target) value. After the support vectors are identified a proper kernel in junction with the classifier is used. Here, the reproducing kernel Hilbert space function used is a radial basis function (RBF) defined as
J −J i j K ( J i , J j ) = exp − 2 2σ
2
(7)
where σ2 is the variance of the cost function values J N (s ) . Then the classifier will be
P f ( J ) = sgn ∑ yiα i K ( J i , J j ) + b i =1
(8)
where p shows the number of support vectors, αi are the Lagrangian multipliers, b is the allowed margin between the support vectors and the corresponding hyperplanes, and y =Jc, which are already identified using k-mean algorithm. The encompassed convex halls are either the texture or the boundary regions. To further restore the edges, the boundary regions are thinned and linked (if necessary) by simple means of thinning (e.g. using morphological operators) and linking criteria.
Texture Segmentation Using Semi-supervised Support Vector Machine
3
1361
Experimental Results
Two combinations of texture from Brodatz collection [10] have been used to test and justify the performance of the algorithm. The images are given in Figure 1. Figure 2 shows the points, JN, have been clustered using k-mean algorithm. In Figure 3 the result of implementation of SVM for classification of the points within one twodimensional window of 32×32 is given. The region of edge points is highlighted in Figure 3 (b). In this experiment the windows do not overlap. Finally in Figure 4 (a) the entire boundaries using SVM are depicted and in Figure 4 (b) the final edge-maps are illustrated.
4
Summary and Conclusions
A semi-supervised method consisting of a k-mean algorithm followed by a nonlinear SVM using RBF kernel is used to classify the values of a cost function. The cost function is a measure of the third order statistics of the intensity values of the images with distinct regions of different textures. The result of implementation of the algorithm shows the method is robust for separation of the uniform patterns such as most of Brodatz textures. The method is potentially useful for application in pattern and object recognition. where there are various objects of uniform patterns. Implementation of the algorithm locates the edges with a negligible displacement error.
Fig. 1. Original Images 256×256
1362
S. Sanei
Fig. 2. Clustering over a frame of 32×32
(a)
Fig. 3. (a) Detection of the boundary region using SVM within a 32×32 region, (b) the boundary region is highlighted
(b)
Fig. 4. (a) The overall boundary region (b) the final edge map
Texture Segmentation Using Semi-supervised Support Vector Machine
1363
References [1]
Turner M. R., “Texture discrimination by Gabor functions,” Biol. Cybernetics, 55, 71-82, 1986. [2] Jain A. K. and Farrokhnia F., “Unsupervised texture segmentation using Gabor filters,” IEEE Transaction On Pattern recognition, 24(12), 1167-1186, 1991. [3] Kasparis T., Charalampidis D., Georgiopoulos M., and Rolland J., “Segmentation of textured images based on fractals and image filtering,” Pattern Recognition, 34, 1963-1973, 2001. [4] Randen T. and Hus∅y J. H., “Filtering for texture classification: A comparative study,” IEEE Transaction on Pattern Analysis and Machine Intelligence, 21(4), 291-310, 1999. [5] Kashyap R. L., “Characterization and Estimation of Two-dimensional ARMA models”. IEEE Transaction On Information Theory, 30, 736-745, 1984. [6] Tsatsanis M. K. and Giannakis G. B., “a non-parametric approach for detecing changes in the autocorrelation structure,” Proc. 6th European Signal Processing Conference (EUSIPCO-92), II, 843-846, Belgium,1992. [7] Sadler, B.M. “texture segmentation by change detection in second and higher order statistics,” Conference Record of The 27th Asilomar Conference on Signals, Systems, and Computers,, 1, 436 –440, 1993. [8] DeCoste D. and Scholkopf B., Training invariant support vector machines, Machine Learning, Kluwer Press, 2001. [9] Friedman J. H., Baskett F., and Shustek L. J., “An algorithm for finding nearest neighbours,” IEEE Transaction on Computers, C-24, 1000-1006, 1975. [10] Brodatz P., Texture: A photographic album for artists and designers, Dover, New York 1966.
A Non-parametric Test for Detecting the Complex-Valued Nature of Time Series Temujin Gautama1 , Danilo P. Mandic2 , and Marc M. Van Hulle1 1
Laboratorium voor Neuro- en Psychofysiologie, K. U. Leuven Campus Gasthuisberg, Herestraat 49, B-3000 Leuven, Belgium {temu,marc}@neuro.kuleuven.ac.be 2 Department of Electrical and Electronic Engineering Imperial College of Science, Technology and Medicine Exhibition Road, SW7 2BT, London, UK [email protected]
Abstract. The emergence of complex-valued signals in natural sciences and engineering has been highlighted in the open literature, and in cases where the signals have complex-valued representations, the complexvalued approach is likely to exhibit advantages over the more convenient real-valued bivariate one. It remains unclear, however, whether and when the complex-valued approach should be preferred over the bivariate one, thus, clearly indicating the need for a criterion that addresses this issue. To this cause, we propose a statistical test, based on the local predictability in the complex-valued phase space, which discriminates between the bivariate or complex-valued nature of time series. This is achieved in the well-established surrogate data framework. Results on bothe the becnhmark and real-work IPIX complex radar data support the approach.
1
Introduction
Recently, the use of complex-valued signals has shown many advantages over real-valued bivariate ones, and are an increasingly popular topic in many branches of physics and DSP. Consequently, considerable research effort has been devoted to the extension of nonlinear modelling and filtering approaches towards
T.G. was supported by a scholarship from the Flemish Regional Ministry of Education (GOA 2000/11) and a research grant from the Fund for Scientific Research (G.0248.03). D.P.M. was supported by QLG3-CT-2000-30161 and a visiting fellowship from the K.U.Leuven (F/01/079) while at the Laboratorium voor Neuro- & Psychofysiologie, K.U.Leuven, Belgium. M.M.V.H. was supported by research grants received from the Fund for Scientific Research (G.0185.96N; G.0248.03), the National Lottery (Belgium) (9.0185.96), the Flemish Regional Ministry of Education (Belgium) (GOA 95/99-06; 2000/11), the Flemish Ministry for Science and Technology (VIS/98/012), and the European Commission, 5th framework programme (QLG3-CT-2000-30161 and IST2001-32114).
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1364–1371, 2003. c Springer-Verlag Berlin Heidelberg 2003
A Non-parametric Test for Detecting the Complex-Valued Nature
1365
complex-valued signals [1, 2, 3], the applicability of which has been demonstrated, among others in radar, sonar and phase-only DSP. The surrogate data method, as originally proposed by Theiler et al. [4], has evolved into a standard technique to test for the presence of nonlinearity in a real-valued time series. In the field of nonlinear signal processing, such tests are indispensable, since, in principle, signal nonlinearity should be assessed prior to the utilisation of nonlinear models, the parameters of which are more mathematically involved to determine than those of linear models. A reliable, statistical test for assessing the complex-valued nature of a signal, however, is still lacking. To that cause, we extend the iterative Amplitude Adjusted Fourier Transform (iAAFT) approach [5] toward complex-valued signals (Section 2). A null hypothesis of a complex-valued linear system underlying the time series under study i utilised. Next, a novel methodology is proposed for characterising (Section 3), and statistically testing (Section 4) the complex-valued nature of a time series. Simulations which support the analysis are performed on both the benchmark and real-world complex valued data.
2
Surrogate Data
The surrogate data method computes test statistics on the original time series and a number of so-called ‘surrogates’, which are realisations of a certain null hypothesis, H0 . These are further used for bootstrapping the distribution of the test statistic under the assumption of H0 . In this section, a surrogate data generation procedure known as the (real-valued) iterative Amplitude Adjusted Fourier Transform (iAAFT) method [5] is shortly introduced, after which an extension of this method towards complex-valued signals is proposed. 2.1
Real-Valued iAAFT Method
The iAAFT method generates a surrogate for a real-valued (univariate) time series under the null hypothesis that the original time series is generated by a Gaussian linear process, followed by a, possibly nonlinear, static (memoryless) observation function, h(·). The surrogates have their signal distributions identical to that of the original signal, and amplitude spectra that are approximately identical, or vice versa. Let {|Sk |} be the Fourier amplitude spectrum of the original time series, s, and {ck } the (signal value) sorted version of the original time series. Note that k denotes the frequency index for the amplitude spectrum, whereas for a time series, it denotes the time index. At every iteration j of the algorithm, two time series are calculated, namely r(j) , which has an signal distribution identical to that of the original, and s(j) , which has an amplitude spectrum identical to the original. The iAAFT iteration starts with r(0) a random permutation of the time samples: Repeat: 1. compute the phase spectrum of r(j−1) → {φk } 2. compute s(j) as the inverse transform of {|Sk | exp(iφk )}
1366
Temujin Gautama et al.
Fig. 1. Realisation of the Ikeda Map (A), iAAFT (B) and CiAAFT (C) surrogates (j)
3. compute r(j) by rank-ordering s(j) to match {ck }, i.e., sort {sk } in ascend(j) ing order and set rk = crank(s(j) ) k Until error convergence The modelling error can be quantified as the mean-square-error (MSE) between {|Sk |} and the amplitude spectrum of r(j) . The algorithm was extended towards the multivariate case in [5], yielding surrogates that retain not only the amplitude spectra of the variates separately, but also the cross-correlation spectrum. This was done by modifying the phase adjustment step (step 1): the cross-correlation between the variates can be retained if the relative phases between the frequency components remains intact. For details we refer to [5]. Figure 1B shows a realvalued bivariate iAAFT realisation of the Ikeda Map (shown in Fig. 1A). 2.2
Complex-Valued iAAFT Method
A straightforward extension of the univariate iAAFT-method towards complexvalued signals would be if the desired amplitude spectrum is replaced by the amplitude spectrum of the original complex-valued signal. In the next step, the desired signal distribution needs to be imposed on the surrogate in the time domain (step 3 in the iAAFT-procedure). This can be achieved by applying the rank-ordering procedure to the real and imaginary parts of the signal separately. However, in practice for complex-valued signals, it is more important to impose equal empirical distributions on the moduli of the complex-valued samples, rather than on the real and imaginary parts separately. Therefore, we subsequently perform a rank-ordering procedure on the moduli, so as to match the moduli of the original time series. The underlying null hypothesis is that the time series is generated by a linear complex-valued process, driven by Gaussian white noise, followed by a (possibly nonlinear) static observation function, h(·), which operates on the moduli of the complex-valued time samples. We propose the following complex-valued iAAFT (CiAAFT) procedure, using the same conventions as in the iAAFT case, namely {|Sk |} is the Fourier amplitude spectrum of the original time series, {ck } is the modulus sorted version of the time series, r(j) and s(j) are time series at iteration j with a modulus distribution, respectively, amplitude spectrum identical to the original time series:
A Non-parametric Test for Detecting the Complex-Valued Nature
1367
Fig. 2. A) Convergence of the CiAAFT algorithm; B) DVV plots for the Ikeda Map (thick solid), for an iAAFT (thin dashed) and a CiAAFT surrogate (thin solid). C) Number of time series that were judged complex-valued by the proposed method, for every rangebin in the IPIX radar data set
Repeat: 1. compute the phase spectrum of r(j−1) → {φk } 2. compute s(j) as the inverse transform of {|Sk | exp(iφk )} 3. rank-order the real and imaginary parts of r(j) to match the real and imaginary parts of {ck } 4. rank-order the moduli of r(j) to match the modulus distribution of {ck } Until error convergence The iteration is started with r(0) a random permutation of the complex-valued time samples. Convergence can be monitored as the MSE computed between {|Sk |} and the amplitude spectrum of r(j) . Simulations suggest that the iteration can be terminated when the MSE decrement is smaller than 10−5 , which typically occurs after fewer than 100 iterations. Figure 1C shows a CiAAFT realisation of the Ikeda Map, for which the error curve is shown in Fig. 2A.
3
Delay Vector Variance Method
We have used a complex-valued variant of the Delay Vector Variance (DVV) method [6] for characterising the time series based on its local predictability in phase space over different scales. For a given embedding dimension m and resulting time delay embedding representation (i.e., a set of delay vectors (DV), x(k) = [xk−m , . . . , xk−1 ]T ), a measure of unpredictability, σ ∗2 (rd ), is computed for a standardised range of degrees of locality, rd : – The mean, µd , and standard deviation, σd , are computed m over all pairwise 2 Euclidean distances between DVs, x(i) − x(j) = n=1 |xi−n − xj−n | (i = j). – The sets Ωk (rd ) are generated such that Ωk (rd ) = {x(i)| x(k)−x(i) ≤ rd }. The range rd is taken from the interval [max{0, µd − nd σd }; µd + nd σd ], e.g., uniformly spaced, where nd is a parameter controlling the span over which to perform the DVV analysis.
1368
Temujin Gautama et al.
– For every set Ωk (rd ), the variance of the corresponding targets, σk2 (rd ), is computed as the sum of the variances of the real and imaginary parts. The average over all sets Ωk (rd ), normalised by the variance of the time series, σx2 , yields the ‘target variance’, σ ∗2 (rd ): N 1 2 k=1 σk (rd ) ∗2 N σ (rd ) = . (1) σx2 Note that the computation of the Euclidean distance between complex-valued DVs is equivalent to considering real and imaginary parts as separate dimensions. Since for bivariate time series, a delay vector is generated by concatenating time delay embedded versions of the two dimensions, the complex-valued and bivariate versions of the DVV method are equivalent, and can be conveniently compared when the variance of a bivariate variable is computed as the sum of the variances of each variate. A DVV plot, D, is obtained by plotting the target variance, σ ∗2 (rd ), as d . The DVV plots for a 1000 sample a function of the standardised distance, rdσ−µ d realisation of the Ikeda Map, D, and the two types of surrogates, Db and Dc , generated using the iAAFT and CiAAFT method, respectively, are shown in Fig. 2B, using m = 3 and nd = 3.
4
Statistical Testing
In the framework of surrogate data testing, as introduced by Theiler et al. [4], a time series is characterised by a certain test statistic, which is compared to an empirical distribution of test statistics, computed for a number of surrogates generated under the assumption of a null hypothesis. A significant difference between the two then indicates that the null hypothesis can be rejected. In the CiAAFT case, a rejection of the null hypothesis that the signal is complexvalued and linear, could be due to a deviation from either of the two properties. Therefore, we propose a different approach: rather than comparing the original time series to the surrogates, we compare surrogates generated under different null hypotheses, namely that of a linear and bivariate time series, H0b , and that of a linear and complex-valued time series, H0c . The respective surrogates are generated using the bivariate iAAFT [5] and the proposed CiAAFT method. All time series are characterised using the DVV method, and a significant difference between the two sets of surrogates is an indication that the original time series is complex-valued. The proposed methodology is the following. 1. 2. 3. 4.
Generate Generate Generate Compare
Ns,ref CiAAFT surrogates and the average DVV plot → D0 ; Ns iAAFT surrogates and corresponding DVV plots → {Db }; Ns CiAAFT surrogates and corresponding DVV plots → {Dc }; (D0 − {Db }) and (D0 − {Dc }).
To perform the final step in a statistical manner, the (cumulative) empirical distributions of root-mean-square distances between {Db } and D0 , and between {Dc } and D0 , are compared using a Kolmogorov-Smirnoff (K-S) test. This way,
A Non-parametric Test for Detecting the Complex-Valued Nature
1369
the different types of linearisations (bivariate, {Db }, and complex-valued, {Dc }) are compared to the ‘reference’ linearisation given a complex-valued nature of the time series, D0 . The distributions of the test statistics (the root-mean-square distances) under the different null hypotheses are, in fact, bootstrapped using the proposed methodology. If the two distributions of test statistics are significantly different at a certain level α, the original time series is complex-valued. Note that assumptions regarding the possible nonlinearity of the signal are avoided.
5 5.1
Simulations Synthetic Time Series
The proposed algorithm was tested on five sets of synthetically generated benchmark signals containing N = 1000 time samples. The two linear sets contained time series 1) consisting of time samples that are drawn from a bivariate Gaussian distribution, N ([0, 0], [1, 2]), rotated over an angle of π3 (linear bivariate, “LB”), and 2) generated by considering a Gaussian ‘amplitude spectrum’, adding random phase and computing the inverse FFT (linear complex, “LC”). The two nonlinear sets were generated by a nonlinear system described in [7]: xk−1 xk−2 (xk−1 + 2.5) + xk , (2) yk = γ 1 + x2k−1 + x2k−2 where γ is a parameter controlling the prevalence of the nonlinear over the linear part of the signal, which was set to γ = 0.6, unless stated otherwise. In the first nonlinear set (nonlinear bivariate, “NLB”), both dimensions of “LB” were separately passed through the nonlinear system, and in the second set (nonlinear complex, “NLC”), x represents the complex-valued time series “LC”. The final set contained realisations of the Ikeda Map (an example is shown in Fig. 1A). For each of the five sets, 100 realisations of time series were generated, to each of which the proposed test was applied. For the bivariate sets (LB and NLB), the number of (erroneous) rejections were of the order expected from the α = 0.05 level (5/100 and 1/100). The proposed test did not perform well on the LC set: only 16/100 the time series were correctly judged to be complex-valued. However, this is not surprising, since any linear complex-valued system has a bivariate equivalent, though not vice versa. Consequently, the iAAFT method can represent these time series equally well as the CiAAFT method. For the NLC set, the proposed test correctly judged the time series to be complex-valued in 62/100 cases (the performance increased to 79/100 with γ = 1.0), and in all of the Ikeda Map realisations. 5.2
Radar Data
We further considered real-world data taken from in-phase and quadrature components from the Dartmouth 1993 IPIX radar data, which is publicly available (http://soma.crl.mcmaster.ca/ipix). We have arbitrarily selected data set #17,
1370
Temujin Gautama et al.
which was recorded during a higher sea state, with the waves moving away from the radar. It consisted of 14 rangebins containing a time series of 131, 072 complex-valued samples. In the ninth rangebin (and, to a lesser degree in rangebins 8, 10 and 11), a target was present. The remaining bins only contained so-called ‘sea clutter ’, i.e., radar backscatter from the ocean surface. From every bin, we considered time segments of N = 1000 samples (one second), and generated 100 non-overlapping time segments, on each of which the proposed test was applied. The number of time series which were judged to be complex-valued, are shown in Fig. 2C for every bin. On average, 51/100 time series in every bin were found to be complex-valued, and there were stronger indications of a complex-valued nature in those bins in which a target was present (bins 8–11, but also in bin 12). The increased complex-valued nature in the presence of a target was consistent over different data sets from the same database (results not shown).
6
Conclusions
We have introduced a novel methodology for statistically testing whether or not the processing of a bivariate time series could benefit from a complex-valued representation. We have proposed a novel procedure, the Complex iterative Amplitude Adjusted Fourier Transform (CiAAFT) method, for generating surrogate time series under the null hypothesis of a linear and complex-valued system underlying the time series. Consequently, surrogates generated using the traditional iAAFT method for bivariate time series can be compared to those generated using the CiAAFT method. Both types of surrogates have been characterised using a complex-valued extension of the Delay Vector Variance (DVV) method, allowing for a statistical comparison between the two types of surrogates. If the difference is significant, the time series is judged complex-valued, and it is judged bivariate otherwise. The methodology was validated on synthetically generated time series, and applied to real-world data obtained from the IPIX radar. The latter data has been frequently addressed in the open literature (for an overview, see [8]), and it has been shown that short time segments can be modelled adequately by a complex-valued autoregressive (AR) model. It was demonstrated that 50 % of the time series from the radar data showed a complex-valued nature, and, furthermore, that this proportion increased in the presence of a target.
References [1] Leung, H., Haykin, S.: The complex backpropagation algorithm. IEEE Trans. Signal Processing 39 (1991) 2101–2104 1365 [2] Nitta, T.: An analysis of the fundemental structure of complex-valued neurons. Neural Processing Letters 12 (2000) 239–246 1365 [3] Kim, T., Adali, T.: Universal approximation of fully complex feed-forward neural networks. In: Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing (ICASSP). (2002) 1365
A Non-parametric Test for Detecting the Complex-Valued Nature
1371
[4] Theiler, J., Eubank, S., Longtin, A., Galdrikian, B., Farmer, J.: Testing for nonlinearity in time series: The method of surrogate data. Physica D 58 (1992) 77–94 1365, 1368 [5] Schreiber, T., Schmitz, A.: Surrogate time series. Physica D 142 (2000) 346–382 1365, 1366, 1368 [6] Gautama, T., Mandic, D., Van Hulle, M.: Indications of nonlinear structures in brain electrical activity. Phys. Rev. E 67 (2003) 046204 1367 [7] Narendra, K., Parthasarathy, K.: Identification and control of dynamical systems using neural networks. IEEE Trans. Neural Networks 1 (1990) 4–27 1369 [8] Haykin, S., Bakker, R., Currie, B.: Uncovering nonlinear dynamics: The case study of sea clutter. In: Proc. of IEEE, Special Issue. (2002) submitted 1370
Domain Ontology Analysis in Agent-Oriented Requirements Engineering Paolo Donzelli1 and Paolo Bresciani2 1
Department of Innovation and Technology – Italian Cabinet Office Via Barberini 38, I-00187 Roma, Italy [email protected] 2 ITC-irst Via Sommarive 18, I-38050 Trento-Povo, Italy [email protected]
Abstract. Goal and agent orientation in Requirements Engineering has been recognized as a very promising approach. In fact, by adopting notions as those of Agent, Goal, Intentional Dependency and so on, it is possible to smoothly refine high-level requirements into detailed descriptions of the system-to-be. This paper introduces a requirements engineering framework based on the notions of Agent, Goal, and Intentional Dependency. The framework represents a powerful tool for describing the organization ontology and capture high-level organizational needs, to transform them into system requirements.
1
Introduction
In Requirements Engineering (RE), Goal and Agent orientation has been recognized as a very promising approach [1, 2, 3, 4]. In fact, by adopting notions as those of Agent, Goal, Intentional Dependency and so on, it is possible to refine high-level requirements, originating from the organizational setting, into detailed descriptions of the system to be implemented, in a smooth and controlled manner, especially, but not only [5], if the target programming paradigm is an agent-oriented one [6, 7, 8]. This paper briefly introduces a requirements engineering framework (called REF ), by means of its sample application to a case study in e-Government. REF is designed to deal with, and reason about, socio-technical systems. It is a powerful tool that allows the analyst to model high-level organizational needs and to transform them into system requirements, while redesigning the organizational structure to better exploit the new system. One of the key-points of REF is focusing on an ontological dimension often neglected in application domain modeling: the intentional dependency between agents. In REF, Agents represent any kind of active entity (e.g., teams, humans, and machines, including the target system [9, 10]) having specific goals. Goals [9, 10, 1] model objectives for the achievement of which agents may depend on each others. Thus, Intentional Dependencies among agents provide the connecting lattice in which the agents V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1372–1379, 2003. c Springer-Verlag Berlin Heidelberg 2003
Domain Ontology Analysis in Agent-Oriented Requirements Engineering
1373
are placed and related each others for the collaborative achievement of delegated goals. Additional notational elements —the Constraints— allow us to further enrich the ontological description, by operationalizing goals at an objective level; thus, goal semantics is made unambiguous, and potential sources of conflicts are highlighted. REF is made up not only by an ontology representation diagrammatic language, but also by a clear methodology for knowledge acquisition and, thus, requirements elicitation. The ontology building and analysis process is supported by a clear cycle; as well, REF provides some “temporary” notational elements (the S-connection and the H-connection), to be used by the analyst during the ontology development process. The paper is organized as follows. Section 2 introduces a case study in eGovernment, used to illustrate REF thorough Section 3. Section 4 gives some conclusions.
2
The Case Study Initial Ontology
The case study reports on the Requirements Engineering phase of an ongoing project aiming at introducing an Electronic Record Management System (ERMS) into the administrative processes of a complex governmental organization, to transform a huge repository of documents into a readily available source of knowledge to be shared among all the agents acting within the organization. ERMS is based on the adoption of complex ICT solutions that allows for efficient storage and retrieval of document-based unstructured information, combining classical filing strategies with information retrieval techniques, and encompassing mechanisms for facilitating documents routing and notification, also supporting interoperability. Accordingly to the REF process, a first organization diagram describing the original organizational setting before the introduction of the ERMS was produced (see Figure 1). In REF diagrams, circles represent agents, and dashed ovals are used to bound the internal structure of complex agents, i.e., agents containing other agents. As in i* [11]—by which REF notation is inspired—and in other derived methodologies, as, e.g.,Tropos [6], a distinction is made in REF between softand hard-goals [12]. Soft-goals qualitatively specify not sharp-cut objectives, the precise definition of which requires further details, while hard-goals clearly define a state an agent desires to reach. Again in Figure 1, rounded boxes represent hard-goals and clouds represents soft-goals. Goals are always connected with arrows to (one or more) agents: an incoming arrow means that that goal is desired, wanted or needed by the connected agent. An out-going arrow means that there is a dependency on the connected agent for the fulfillment of the goal. Thus, in the most general case in which an agent A is connected to a goal G that is connected to another agent B, we have that A (who wants G) depends on B for G to be fulfilled. In a very similar way,
1374
Paolo Donzelli and Paolo Bresciani
legend Agent
soft goal
m ost im portant first resource
hard goal
Head of Unit
Secretary document
task
dependency link
transm it to em ployee
document
safe conservation document reference
Personal Computer
Archivist
provide when available
Em ployee
document
Physical Archive
document
return when done
Organisational Unit
Fig. 1. The organizational context before the ERMS
resources (rectangles), and tasks 1 (hexagons), may be represented. In Figure 1, the complex agent Organizational Unit corresponds to the organizational fragment into which it is planned to introduce the new ERMS, whereas the Head of Unit, the Secretary, the Employee, the Archivist, the Personal Computer and the Physical Archive are simple agents, acting within the Organizational Unit. Thus, for example, the Secretary receives from the enclosing context (not depicted here) the input documents, which then she passes to the Head of Unit. So, the Head of Unit depends on the Secretary for receiving the document and for the qualitative constraint (soft-goal): most important first. A complete description of the diagram can be found in [12].
3
Evolving the Ontology to Provide Requirements
The first step towards the identification of the requirements for the ERMS together with the new organizational setting suitable to exploit ERMS capabilities, is to produce a new organization diagram, capturing the motivating pushes underlying the project. Figure 2 represents again the complex agent Organizational Unit and the agent Head of Unit. As well, new elements are introduced: the softgoals exploit ICT to increase performance while avoiding risks, and cost/effective 1
A task is a well specified prescriptive activity.
Domain Ontology Analysis in Agent-Oriented Requirements Engineering
1375
exploit ICT to increase performance while avoiding risks
cost/effective and quick solution
Head of Unit
Organisational Unit
Fig. 2. Introducing the ERMS: the initial organization model
and quick solution, that represent the required organizational improvement that the Head of Unit is responsible to achieve. REF models are introduced and refined along an incremental and iterative analysis process, including three distinct phases: Organization Modeling, HardGoal Modeling, and Soft-Goal Modeling. This helps to reduce the complexity of the modeling effort. During Organization Modeling, the organizational context is analyzed and the agents and their hard- and soft-goals identified. Hard-Goal Modeling seeks to determine how the agent can achieve a hardgoals placed on it, by decomposing it into more elementary subordinate hardgoals and tasks. Soft-Goal Modeling aims at producing the operational definitions of the softgoals that emerged during the organizational modeling, sufficient to capture and make explicit the semantics that are usually assigned implicitly by the involved agents [13, 14, 3] and to highlight the system quality issues from the start. A soft-goal is refined in terms of subordinate soft-goals, hard-goals, tasks and constraints, where constraints (represented by rounded boxes with a line) specify the quality attributes corresponding to a soft-goal. According to the REF process, the analysts, through continuous interaction with the stakeholders and supported by the different models, will deal first with the high level organizational structure, and then descend into the details of the domain. In particular, after the construction of an initial organization model (as in Figure 2), the REF process evolves cyclicly through the phases of hard-goal, soft-goal and organization modeling. For example, from the organization model in Figure 2 the analysis may proceed by modeling the emerging soft-goals. An example of the result of a soft-goal modeling activity is presented in Figure 3. The figure describes how the soft-goal exploit ICT to increase performance while avoiding risks is iteratively top-down decomposed to finally produce a set of tasks, hard-goals, and constraints that precisely defines the meaning of the soft-goal, i.e., the way to achieve it. For example, the soft-goal provide employee’s performance is better defined in terms
1376
Paolo Donzelli and Paolo Bresciani
exploit ICT to increase perform ance while avoiding risks
Head of Unit
A
A A
increase personal perform ance
increase productivity
A
A
avoid risks due to new technology A
A
A
easy document access
be m ore productive
A
multi−channel access
reduce process constraints
increase process visibility
A
A PDA for reading documents
no filters from secretary A
A internet access
e−document as paper document
A
provide employee’s performance
A
..... A mantain process structure
A
number of documents waiting
A
provide employee’s number of documents
.....
guarantee sm ooth transaction
A
provide process performance
A notify new documents by SMS
guarantee security
A A
A
A
twice a week update
Fig. 3. The exploit ICT to increase performance while avoiding risks Soft-Goal Model
of a task (provide employee’s number of documents) and a quantitative constraint on the frequency (twice a week). Again, the arrowhead lines indicate dependency links. A soft-goal depends on a subordinate soft-goal, hard-goal, task or constraint, when it requires that goal, task or constraint to be achieved, performed, or implemented in order to be achieved itself. These dependency links may be seen as top-down decomposition of the soft-goal, and may be conjunctive (indicated by the label “A” on the dependency link) or disjunctive (indicated by the label “O”). Using the same kind of link to describe both agent dependencies and goal decomposition has been proved to be, for the stakeholders, an acceptable and well understood simplification [12]. Soft-goal modeling is usually a long and fatiguing process, during which stakeholders and analyst interact more times. In order to provide the analyst with a method to annotate promising evolutions of the ontology design —where the
Domain Ontology Analysis in Agent-Oriented Requirements Engineering
Head of Unit
1377
exploit ICT to increase perform ance while avoiding risks
A
A
increase personal perform ance
A
avoid risks due to new technology
increase productivity
A
S
A
be m ore productive
reduce process constraints
Fig. 4. A possible sharing between goals of different agents
potential value of one branch in the process may derive from foreseen commonalities among yet not expanded subtrees— REF provides a specific notation, called S-connection (“S” for sharing), to link two or more goals between which a possible commonality is presumed. A simple example is presented in Figure 4, where the S-connection is used to take note of the fact that probably a sharing exists between the soft-goal increase personal Performance, that the Head of Unit wants to achieve, and the soft-goal be more productive, that the Head of Unit transfers to the Employee (and hence is drawn in thick lines; see also Figure 5). Along a similar rationale, although in an opposite perspective, REF provides also the so-called H-connection link (“H” for “hurting”). This is a powerful tool to detect possible conflicts and try to reconcile different stakeholders points of view, allowing to evolve the analyses only along the most promising alternatives. An example of H-connection is shown in Figure 5, where is has been used to highlight the possibility of a conflict between the Employee’s soft-goal protect my privacy and the task provides employee’s number of document, required by the Head of Unit. Both the H-connection and the S-connection provide the analyst with a marking mechanism very useful to better drive and control the analysis process.
4
Conclusions
The paper introduced the requirements engineering framework REF, explicitly designed to support the analysts in reasoning about socio-technical systems, and transform high-level organizational needs into system requirements. One of the key-points of REF is its capability of representing the domain ontology by adopting concepts as those of Agents, Goals, and Intentional Dependency, and introducing an essential graphical notation [15]. Thus, REF results to be
1378
Paolo Donzelli and Paolo Bresciani
exploit ICT to increase perform ance while avoiding risks
Head of Unit be m ore productive guarantee security
PDA for reading docum ents
.....
provide em ployee’s num ber of docum ents
Archivist
H
cost/effective and quick solution
ERMS
Em ployee protect m y privacy
IT easy to integrate
Organisational Unit apply public adm inistration standard
Fig. 5. The evolving organization model
a very effective and usable technique, able to tackle complex real case situations, while remaining simple enough to allow a concrete and effective stakeholders involvement. In addition, REF supports the analysts in dealing with complex and system/organizational design related issues, such as shared and clashing stakeholders’ needs, by introducing some specific analysis-oriented notations to allow an early marking and detection of such situations.
References [1] Anne Dardenne, Axel van Lamsweerde, and Stephen Fickas. Goal-directed requirements acquisition. Science of Computer Programming, 20(1-2):3–50, 1993. 1372 [2] A. van Lamsweerde. Requirements Engineering in the Year 00: A Research Perspective. In Proceedings of the 22nd International Conference on Software Engineering, June 2000. 1372 [3] L. K. Chung, B. A. Nixon, E. Yu, and J. Mylopoulos. Non-Functional Requirements in Software Engineering. Kluwer Publishing, 2000. 1372, 1375
Domain Ontology Analysis in Agent-Oriented Requirements Engineering
1379
[4] John Mylopoulos, Lawrence Chung, Stephen Liao, Huaiqing Wang, and Eric Yu. Exploring alternatives during requirements analysis. IEEE Software, 18(1):92–96, 2001. 1372 [5] Anna Perini, Paolo Bresciani, Paolo Giorgini, Fausto Giunchiglia, and John Mylopoulos. Towards an Agent Oriented approach to Software Engineering. In Proceedings of WOA 2001 – Dagli oggetti agli agenti: tendenze evolutive dei sistemi software, Modena, September 2001. Pitagora Ed., Bologna. 1372 [6] J. Mylopoulos and J. Castro. Tropos: A framework for requirements-driven software development. In J. Brinkkemper and A. Solvberg, editors, Information System Engineering: State of the Art and Research Themes, Lecture Notes in Computer Science. Springer-Verlag, 2000. 1372, 1373 [7] P. Bresciani, A. Perini, , F. Giunchiglia, P. Giorgini, and J. Mylopoulos. A Knowledge Level Software Engineering Methodology for Agent Oriented Programming. In Proceedings of the Fifth International Conference on Autonomous Agents, Montreal, Canada, May 2001. 1372 [8] J. Castro, M. Kolp, and J. Mylopoulos. Developing agent-oriented information systems for the enterprise. In Proceedings Third International Conference on Enterprise Information Systems, Stafford UK, July 2000. 1372 [9] P. Donzelli and M.R. Moulding. Developments in application domain modelling for the verification and validation of synthetic environments: A formal requirements engineering framework. In Proceedings of the Spring 99 Simulation Interoperability Workshop, LNCS, Orlando, FL, 2000. Springer-Verlag. 1372 [10] Eric Yu. Why agent-oriented requirements engineering. In Proceedings of 3rd Workshop on Requirements Engineering For Software Quality, Barcelona, Catalonia, June 1997. 1372 [11] E. Yu. Modeling Strategic Relationships for Process Reengineering. PhD thesis, University of Toronto, Department of Computer Science, University of Toronto, 1995. 1373 [12] P. Donzelli and R Setola. Handling the knowledge acquired during the requirements engineering process. In Proceedings of the Fourteenth International Conference on Knowledge Engineering and Software Engineering (SEKE), 2002. 1373, 1374, 1376 [13] V. R. Basili, G. Caldiera, and H. D. Rombach. The Goal Question Metric Approach. Wiley&Sons Inc, 1994. 1375 [14] G. Cantone and P. Donzelli. Goal-oriented software measurement models. In Proc. of the European Software Control and Metrics Conference, Herstmonceux Castle, UK, April 1999. 1375 [15] P. Donzelli and P. Bresciani. Goal-oriented requirements engineering: a case study in egovernment. In Proceedings of the 15th Conference On Advanced Information Systems Engineering (CAiSE’03 ), Velden, Austria, June 2003. 1377
A Framework for ACL Message Translation for Information Agents Zhan Cui, Yang Li and John Shepherdson Intelligent Systems Laboratory, BTexact Technologies B62 MLB1/pp12, Adastral Park, Martlesham Heath, Ipswich, Suffolk, IP5 3RE, UK {zhan.cui,yang.li,john.shepherdson}@bt.com
Abstract. Legacy databases are a valuable asset for any large organisation. The rapid spread of the Internet has created opportunities where these databases can be made available on-line (via information agents) to third parties and thus generate further revenue. The key enabler for this scenario is semantic interoperability of heterogeneous databases. In this paper we present an approach to semantic interoperability, which uses shared ontologies to describe the heterogeneity of database schema for information agents, in order to automate message translation at run time. Unlike any other work in this area, we take advantage of ontological knowledge to improve the degree of automation for this mapping process.
1
Introduction
Legacy database systems are valuable assets for large organisations. The rapid spread of the Internet has offered opportunities for e-commerce, where heterogeneous databases are put on-line and information held in the databases is exploited. To facilitate e-commerce, communities of information agents that can maintain and query databases, and communicate with each other to exchange information, are increasingly being deployed [1] and [2]. Databases that were developed independently of one another tend to have different structures and use different sets of vocabulary. In order to achieve effective communication among information agents, each agent must understand precisely what the other is “talking” about. FIPA [3] has made great strides towards heterogeneous agent interoperability, by defining standards for interaction protocols, agent communication language (ACL), content languages and ontologies. Even so, any application that makes use of a community of disparate information agents representing a number of heterogeneous databases, requires some form of “mapping” between the structure and vocabulary of each database and the content and ontology of the ACL messages exchanged by the agents. In an environment that uses interoperability standards such as FIPA's, the agents can be considered as gobetweens for the various databases, and the problem becomes one of matching the structure and vocabulary of one database to another. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1380-1386, 2003. Springer-Verlag Berlin Heidelberg 2003
A Framework for ACL Message Translation for Information Agents
1381
There are two common strategies for creating a set of structure (or vocabulary) mappings. One is to map the structure/vocabulary of every combination of two databases directly; the other is to map all the databases indirectly, i.e., through an integrated schema or a common ontology. Suppose there are n databases that need to be mapped. By adopting the first strategy, n*(n-1) sets of mappings are needed; by adopting the second strategy, only n sets of mappings are needed. The typical features of legacy databases (e.g. original developers have moved on, the schema details are only partially documented) make fully automated mapping difficult; as a result, manual mapping is the dominant technique. Moreover, as manual mapping is a labour-intensive task, the strategy of mapping multiple databases to a shared ontology is favoured, as it reduces the total number of mapping sets, hence reducing cost. Existing work on creating mappings between a database schema and a shared ontology is purely manual. It is our observation that a certain degree of automation can be achieved by exploiting ontological knowledge embedded in both the ontology base and the database. To this end, we have developed an infrastructure that allows each database to have a specialised version of a shared ontology so that the peculiarities of each database can be described at design time. This is very important in e-commerce because each vendor wants to differentiate its products from others. We have also built a mapping editor - a decision support tool that guides the user through the processes of creating a set of mappings, by analysing the design-time mismatch (or contents) descriptions and suggesting appropriate pairings between database and ontological items. Based on these mappings, we have developed a centralised mapping translation service for information agents to use to service the contents of the databases they manage. This removes the burden of developing algorithms to translate messages and to resolve mismatches among incoming messages. This also reduces the impact of ontological changes on information agents because most of the ontological and data mediations are made externally. The remaining part of the paper is organised as follows. In Section 2, we give an overview of the framework of our approach and discuss the ontology server and mapping editor respectively. In Section 3 we discuss related work. Finally we offer some conclusions.
2
Framework of the Proposed Approach
The prevailing use of a shared ontology for mediating interoperability among information agents or databases is to map local models (agents or databases) to a shared ontology [4], [5] and [6]. This is very similar to mapping local models to an integrated schema albeit a shared ontology provides a richer set of constraints than an integrated schema. This use requires concepts from the shared ontology to be tightly associated (i.e. explicitly present in concept descriptions or definitions.) with all attributes concerned. As each database or information agent would like to be differentiable from the others, most of the concepts end up with tens of attributes. This makes it difficult to understand and reason with concepts. Furthermore, many of these attributes are just pragmatic associations to store data. For example, products
1382
Zhan Cui, Yang Li and John Shepherdson
have accessories. The accessory attribute does not normally contribute to a product definition. They are not essential for semantic definitions of concepts [7]. In the proposed framework, an ontology defines general types or concepts without the need to tightly associate with too many attributes. Only distinguishing attributes or contributing attributes should be included in concept descriptions or definitions. Concepts of all optional attributes should be present in a shared ontology, which give each information agent or database the ability to specialise a shared ontology by constraining type definitions or making certain attributes the contributing ones. The rationale is that some attributes may not be contributing attributes in a general ontology, but they may become contributing attributes in a constrained domain. However, the decision of what are contributing attributes is subjective and depends on the use of an ontology. The separation of contributing and optional attributes is very important in database integration and interoperability. For example, two databases may have the same notion of customers, but one database has the customer address attribute, not the telephone number attribute while the other has the telephone number attribute, but not the address attribute. If the concept Customer in a shared ontology dictates both address and telephone number attributes, it would not be possible to map the two databases to the shared ontology. By choosing to make address and telephone numbers optional attributes, the two databases are able to map their customers to the customer concept in the shared ontology. Of course, the concepts Address and Telephone number have to be in the shared ontology. This idea is well supported by most Description Logics (DLs) [8]. Attributes in DLs are defined with domains and ranges. They specify where they could be applied, but do not require any concept to use them. In most DLs, attributes often have an inheritance hierarchy, making fine type distinctions possible. Based on this idea, we have developed a framework for message translation and data integration as shown in Fig. 1. M apping S erver
Agent1 Ont1 Agent3
Ont2
Agent2
Ontology S erver T ranslation s ervice
Ont3 Agent4
Fig. 1. A framework for message translation and data integration
Source ontologies such as Ont1, Ont2 and Ont3 are specialised versions of a shared ontology from the ontology server. Each agent or database can specialise a shared ontology by adding additional contributing attributes such as address and telephone number, by constraining attribute types or adding new axioms sanctioned by the ontology language. Two agents could have the same source ontology such as Agent1 and Agent3. The mapping server, ontology server and translation services are
A Framework for ACL Message Translation for Information Agents
1383
centralised in this framework. They provide services for all agents. This makes it possible to share general mappings and conversion functions. As mentioned above, most of the mappings have to be defined manually, but the use of a mapping server makes it possible for multiple developers to contribute and share. This centralised approach also makes it possible to develop translation services for all information agents. We have developed a general translation service, which receives messages from issuing agents, and translates and routes them to target agents. Based on this framework we have developed the following approach for mapping services. Fig. 2 shows the components and message flows. 1 Ontology Server
User
Ontology Base
2-1 User
Mapping Server
2-2 User
Mapping Base Database1
3 4 Database2
Message1
Message2 Agent1
Agent2 5
Control Flow
Data Flow
Fig. 2. The proposed mapping and translation services
The first step is to construct a shared ontology through the ontology server, using a GUI-based ontology editor. The resulting ontology is stored in the ontology base. Next, each information agent developer either uses an ontology from the ontology server or specialises an existing ontology, using the same ontology editor. The specialised version is also stored in the ontology server. After choosing an ontology to use for its database, the developer creates the mappings between a local database and the ontology through the mapping server using a GUI-based mapping editor, and the set of mappings is stored in the mapping base. At run-time, when information agent Agent1 needs to send Message1 to information agent Agent2, it follows these steps: • It queries the mapping server, where local terms used by Agent1 are translated into a shared ontology, through which they are further translated into the local
1384
Zhan Cui, Yang Li and John Shepherdson
terms used by Agent2. The translation is carried out by applying the appropriate set of mapping rules and a new message Message2 is generated from Message 1 in which the old terms are replaced by the new terms. • It sends Message2 to Agent2. There are two key parts to the system: the ontology server and the mapping server. Their roles and services are described in the following sections. 2.1
The Ontology Server
The functions of the ontology server are (1) to maintain one or more shared ontologies or source ontologies (2) to answer user queries regarding ontological relationships. The shared ontologies currently use the OIL format [9]. The relationships that can be queried include specialisation, generalisation and disjoint. 2.2
The Mapping Server
The function of the mapping server is to establish a set of mappings between items in an ontology and a given database schema. This is done through selectively applying a set of pre-defined mapping rules. Examples include host-type conversions, measurement system conversions, currency-type conversions, message syntax or format conversions, and so on. These are considered low-level mappings. By using a centralised mapping server, these mappings can be jointly developed and shared. The mapping process between an ontology and a given database schema is semiautomatic. The user makes decisions at the concept-level about which concepts should be mapped. Then the mapping server checks the concept definitions and makes suggestions about how to map between attribute and type conversions. The actual mapping process is as follows: • The user makes a link between a concept in an ontology and a table name or schema name in the database. This is done through ‘click and drag', and is called concept-level mapping; • The mapping server then checks for attribute correspondences through contextbased mappings. Here each concept is treated as a context. This is very effective in e.g. e-commerce applications because data has to be normalised and rationalised. For each context, there is a set of pre-defined mapping rules. Some examples of the mapping rules are as follows: Concept A: st Æ street Concept B: st Æ Saint
Concept A: ave Æ avenue Concept C: id Æ code
The mapping of attributes is carried out through the following sequence: • If one of the mapping rules mentioned above is applied, then the two attributes are matched; • If the two attributes are individuals of the same concept, then the two are matched; • If generalisation or specialisation relationship exists between the two attributes, then the two are matched;
A Framework for ACL Message Translation for Information Agents
1385
• If a disjoint relationship exists between the two attributes, then the two are mismatched; • All other situations are considered neutral. If the two attributes are matched, a mapping between them will be established automatically. If a mismatch is found between the two, an alert will be raised to the user, as and when the user tries to establish a mapping between them. The system ignores all neutral situations.
3
Related Work
A shared ontology has been used as an integrated schema for agents and databases in several projects, include Infosleuth [1], Observer [5], etc. In our view the potential advantages of ontologies have not been fully exploited. The uses quoted here rely on there being well-defined semantics for concepts. However, in practical applications, this is not possible because only partial definitions of concepts are available. We have identified the advantages of ontology and mappings and made use of them together to automate some mapping tasks. Work on creating mapping between a database and an ontology base/knowledge base has been underway for quite a long time [10]. Recently, in the J2EE platform for example, a user can map an entity bean [11], which is an element of an ER model [12], to a table of a backend database in order to establish persistency and make transaction management transparent. However, to our knowledge, most of the work was purely manual, and the power of ontological knowledge was not exploited. In addition, there are no shared mapping pools to allow the creation of shared mappings. We believe shared mapping creation is going to play an important role for e.g. ecommerce applications in the foreseeable future. It seems likely that fully automated mapping creation based on ontology won't be available in the next few years. Attempts to utilise ontological knowledge and improve the degree of automation for the mapping creation process started in the field of software maintenance [13], where AI techniques were used to build an intelligent assistant [14] to help recover domain knowledge from legacy software code, including legacy databases. The work described in this paper is an extension of our previous work that allows it to be used in a multi-agent environment.
4
Concluding Remarks
In this paper we have presented our approach to reducing the burden of enabling semantic interoperability between disparate data sources, and given an example of its utility in the case of information agents that manage legacy databases. In particular, we presented a mapping editor that can help a user map a database schema to a shared ontology semi-automatically. The mapping server can give advice on which attributes can and cannot be mapped. This helps to reduce the overhead of the mapping process
1386
Zhan Cui, Yang Li and John Shepherdson
and improves the efficiency of indirectly linking legacy databases in a multi-agent environment. The separation of contributing and optional attributes could be potentially exploited to develop specialised agents to mediate different specialisations. The use of centralised mapping and translation services allows agent designers to consider how to select mappings and leave the execution of those mappings to the specialised mapping services. There is no need for everyone to develop the mapping execution algorithm. It also promotes shared development of common mapping rules. This is very important in an increasingly complex e-commerce world.
References [1] [2] [3] [4] [5]
[6] [7] [8] [9] [10] [11] [12] [13] [14]
Nodine, M., Fowler, J., Ksiezyk, T., Perry, B., Taylor, M., and Unruh, A.: Active Information Gathering in InfoSleuth. In: International Journal of Cooperative Information Systems 9:1/2 (2000) 3-28 Revelli, C.: Agents for E-Commerce. http://www.agentland.com/pages/learn/revelli/ extbook4-1.html (2003) FIPA, the Foundation for Intelligent Physical Agents: www.fipa.org(2003) Roth, M. T. and Schwarz: Don't Scrape It, Wrap It! A Wrapper Architecture for Legacy Data Sources. VLDB (1997) Mena, E., Illarramend, A., Kashyap, V., and Sheth, A.: OBSERVER: An Approach for Query Processing in Global Information Systems based on Interoperation across Pre-existing Ontologies. In: International Journal Distributed and Parallel Databases (DAPD), Vol. 8, No. 2, ISSN 0926-8782 (2000) 223-271 Ambite, J.L., and Knoblock, C.A.: Agents for Information Gathering. In: IEEE Expert: Intelligent Systems and their Applications (1997) Tamma, V.A.M.: An Ontology Model Supporting Multiple Ontology for Knowledge Sharing. PhD Thesis, The University of Liverpool (2002) Baader, F., Calvanese, D., McGuinness, D., Nardi, D., and Patel-Schneider, P. (eds.): The Description Logic Handbook. Cambridge University Press (2003) Ontoknowledge.org: OIL, www.ontoknowledge.org/oil/ (2003) Kashyap, V. and Sheth, A.: “Schematic and Semantic Similarities between Database Objects: A Context-based Approach”, VLDB Journal, 5(4) (1996) Bodoff, S., Green, D., Haase, K., Jendrock, E., Pawlan, M., and Stearns, B.: The J2EE Tutorial. Addison-Wesley (2002) Chen, P.: The Entity-Relationship Model – Toward a Unified View of Data. ACM Transactions on Database System, V.1, N.1 (1976) Li, Y., Yang, H., and Chu, W.: Towards Building A Smarter Domain Knowledge Recovery Assistant. In: Proceedings of the 24th IEEE COMPSAC (2000) Li, Y., Yang, H., and Chu, W.: Information Elicitation from Software Code. Chapter for Handbook of Software Engineering and Knowledge Engineering, V2, World Scientific Publishing Co. (2001)
An Ontology for Modelling Security: The Tropos Approach Haralambos Mouratidis1, Paolo Giorgini2, and Gordon Manson1 1
University of Sheffield, Computer Science Department, UK {haris,g.manson}@dcs.shef.ac.uk 2 University of Trento, Department of Information and Communication Technology, Italy [email protected]
Abstract. It has been argued that security concerns should inform all the stages of the development process of an agent-based system. However, this is not the case since current agent-oriented methodologies do not, usually, consider security concepts within their modelling ontology. In this paper we present extensions to the Tropos ontology to enable it to model security.
1
Introduction
Following the wide recognition of multi-agent systems, agent-oriented software engineering has been introduced as a major field of research. Many agent-oriented software engineering methodologies have been proposed [1,2] each one of those offering different approaches in modelling multi-agent systems. It has been argued [3] that security issues should inform all the stages of the development of agent-based systems. However, usually, this is not the case. One of the reasons is the lack of concepts and notations employed by the current methodologies to help towards the inclusion of security within the development stages. In other words, agent oriented software engineering methodologies do not, usually, integrate security concepts within their ontology. In this paper we describe how the Tropos ontology has been extended to consider security issues. Section 2 of the paper provides an overview of Tropos ontology, and Section 3 identifies the need to extend the methodology to consider security issues. Section 4 describes the newly introduced (security) concepts and Section 5 concludes the paper and presents directions for future work.
2
TROPOS
Tropos [2] is an information system development methodology, tailored to describe both the organisational environment of a system and the system itself, employing the same concepts throughout the development stages. Tropos ontology is described at V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1387-1394, 2003. Springer-Verlag Berlin Heidelberg 2003
1388
Haralambos Mouratidis et al.
three levels of granularity [4] and is inspired by social and organisational structures. At the first level (lowest), Tropos ontology adopts components from the i* modelling framework [5], which is based on the concepts of actors, goals, soft goals, tasks, resources, and social dependencies. Social dependencies represent obligations or agreements, called dependum, between two different actors called depender and dependee. To partially illustrate the modelling of the social dependencies between actors consider the eSAP System [6]. Such a system, involves four actors [6], namely Older Person, R&D Agency, Benefits Agency, Department of Health (DoH), and Professional (Figure 1). The depender is the depending actor and the dependee is the actor who is depended upon. For example in Figure 1, the Older Person depends on the Professional to fulfil the Receive Appropriate Care goal dependency. For this dependency, the Older Person is the depender, the Professional the dependee and the Receive Appropriate Care goal the dependum. Actors have strategic goals and intentions within the system or the organisation and represent (social) agents (organisational, human or software), roles or positions (represent a set of roles). A goal represents the strategic interests of an actor. In Tropos we differentiate between hard (only goals hereafter) and soft goals. The latter having no clear definition or criteria for deciding whether they are satisfied or not. A task represents a way of doing something. Thus, for example a task can be executed in order to satisfy a goal. A resource represents a physical or an informational entity while a dependency between two actors indicates that one actor depends on another to accomplish a goal, execute a task, or deliver a resource. At the second level, Tropos ontology provides a set of organisational styles inspired by organisation theory and strategic alliances [4]. These styles are used to describe the overall architecture of the organisational context of the system or its architecture. The last element of the Tropos ontology consists of social patterns [4]. These patterns, unlike organisational styles, are focused on the social structure necessary to achieve a particular goal instead of the overall goals of the organisation.
Fig. 1. The social Dependencies between the stakeholders of the eSAP system
In addition to the graphical representation, Tropos provides a formal specification language called Formal Tropos [7]. Formal Tropos compliments i* by defining a textual notation for i* models and allow us to describe dynamic constraints among the different elements of the specification in a first order linear-time temporal logic [7].
An Ontology for Modelling Security: The Tropos Approach
3
1389
(Lack of) Security Ontology in TROPOS
As we have been argued in a previous paper [3], the Tropos methodology needs to be extended in order to adequately model security. The current Tropos ontology provides developers the ability to model security requirements as soft goals. The concept of a soft goal is “used to model quality attributes for which there are no a priori, clear criteria for satisfaction, but are judged by actors as being sufficiently met” [5]. However, security requirements may relate to system's quality properties, or alternatively may define constraints on the system [8]. Qualities are properties or characteristics of the system that its stakeholders care about, while constraints are restrictions, rules or conditions imposed to the system and unlike qualities are (theoretically) non negotiable. Thus, although the concept of a soft goal captures qualities, it fails to adequately capture constraints [3]. Security constraints might affect the analysis and design of the system, by restricting some alternative design solutions, conflict with some of the requirements of the system, and also by refining some of the goals of the system or introducing new ones that help the system towards the satisfaction of the constraint. We believe the current Tropos ontology must be extended towards three main directions. Firstly, the concept of a security constraint must be introduced, as a separate concept, next to the existing concepts of Tropos. Secondly, existing concepts such as goals, tasks, resources, must be defined with and without security in mind. For example a goal should be differentiated from a secure goal, the latter representing a goal that affects the security of the system. Thirdly, security-engineering concepts such as security features, protection objectives, security mechanisms and threats, which are widely used in security engineering, must be introduced in the Tropos ontology, in order to make the methodology applicable by software engineers as well as security engineers. In this paper, due to lack of space, we only present the extensions towards the first two directions. Readers interested in how securityengineering concepts are integrated within Tropos methodology should refer to [3].
4
Security Concepts
4.1
Security Constraints
We define security constraint as a constraint that is related to the security of the system. Since constraints can influence the security of the system either positively (e.g., Allow Access Only to Personal Record) or negatively (e.g., Send Record Plain Text, not encrypted), we further define positive and negative security constraints, respectively. In the early requirements analysis security constraints are identified and analysed according to the constraint analysis processes we have proposed in [9]. Security constraints are then imposed to different parts of the system, and possible conflicts between security and other (functional and non functional) requirements of the system are identified and solved. To identify these conflicts we differentiate between security constraints that contribute positively or negatively to the other requirements of the
1390
Haralambos Mouratidis et al.
system. We consider a security constraint contributing to a higher level of abstraction. This means we are not taking into consideration specific security protocols that should be decided during the implementation of the system, and that most of the times restrict the design with the use of a particular implementation language. 4.2
Secure Entities
The term secure entities involves any secure goals, tasks and resources of the system. A secure entity is introduced to the actor (or the system) in order to help in the achievement of a security constraint. For example, if a health professional actor has the security constraint Share Info Only If Consent Obtained, the secure goal Obtain Patient Consent can be introduced to this actor in order to help in the achievement of the constraint. A secure goal does not particularly define how the security constraint can be achieved, since (as in the definition of goal, see [5]) alternatives can be considered. However, this is possible through a secure task, since a task specifies a way of doing something [5]. Thus, a secure task represents a particular way for satisfying a secure goal. For example, for the secure goal Check Authorisation, we might have secure tasks such as Check Password or Check Digital Signatures. A resource that is related to a secure entity or a security constraint is considered a secure resource. For example, an actor depends on another actor to receive some information and this dependency is restricted by a constraint Only Encrypted Info. 4.3
Secure Dependencies
A secure dependency introduces security constraint(s), proposed either by the depender or the dependee in order to successfully satisfy the dependency. For example a Doctor (depender) depends on a Patient (dependee) to obtain Health Information (dependum). However, the Patient imposes a security constraint to the Doctor to share health information only if consent is achieved. Both the depender and the dependee must agree in this constraint for the secure dependency to be valid. That means, in the depender side, the depender expects from the dependee to satisfy the security constraints while in the dependee side, a secure dependency means that the dependee will make an effort to deliver the dependum by satisfying the security constraint(s). There are two degrees of security: Open Secure dependency (normal dependency) and Secure dependency. In an Open Secure Dependency [3] some security conditions might be introduced but if the dependee fail to satisfy them, the consequences will not be serious. This means that the security of the system will not be in danger if some of these conditions are not satisfied. On the other side, there are three different types of a secure dependency [3], Dependee Secure Dependency, Depender Secure Dependency, and Double Secure Dependency. Taking as an example the eSAP system illustrated in Section 2, the social dependencies between the actors of the system can be modelled now taking into account the security constraints between them as shown in figure 3. The Older Person depends on the Benefits Agency to Receive Financial Support. However, the Older Person worries about the privacy of their finances so they impose a constraint to the Benefits Agency actor, to keep their financial information private. The Professional
An Ontology for Modelling Security: The Tropos Approach
1391
depends on the Older Person to Obtain Information, however one of the most important and delicate matters for a patient (in our case the older person) is the privacy of their personal medical information, and the sharing of it. Thus most of the times the Professional is imposed a constraint to share this information if and only if consent is achieved. In addition, one of the main goals of the R&D Agency is to Obtain Clinical Information in order to perform tests and research. To get this information the R&D Agency depends on the Professional. However, the Professional is imposed a constraint (by the Department of Health) to Keep Patient Anonymity.
Fig. 3. Social dependencies between the eSAP stakeholders
The security constraints imposed at each actor can be further analysed by identifying which goals of the actor they restrict [9]. For example, the Professional actor has been imposed two security constraints (Share Info Only If Consent Obtained and Keep Patient Anonymity). During the means-end analysis [1] of the Professional actor we have identified the Share Medical Info goal. However, this goal is restricted by the Share Info Only If Consent Obtained constraint imposed to the Professional by the Older Person. For the Professional to satisfy the constraint, a secure goal can be introduced such as Obtain Older Person Consent. However this goal can be achieved with many different ways, for example a Professional can obtain the consent personally or can ask a nurse to obtain the consent on their behalf. Thus a subconstraint can be introduced, Only Obtain Consent Personally. This sub constraint introduces another secure goal Personally Obtain Consent. This goal can be divided into two sub-tasks Obtain Consent by Mail or Obtain Consent by Phone. 4.4
Formal Tropos
Formal Tropos [7] complements graphical Tropos by extending the Tropos graphical language into a formal specification language [7]. The language offers all the primitive concepts of graphical Tropos, supplemented with a rich temporal specification language, inspired by KAOS [10], that has formal semantics and it is
1392
Haralambos Mouratidis et al.
amenable to formal analysis. In addition, Formal Tropos offers a textual notation for i* models and allows the description of different elements of the specification in a first order linear-time temporal logic. A specification of formal Tropos consists of a sequence of declarations of entities, actors, and dependencies [7]. Formal Tropos can be used to perform a formal analysis of the system and also verify the model of the system by employing formal verification techniques such as model checking to allow for an automatic verification of the system properties [7]. As with the graphical Tropos, Formal Tropos has not been conceived with security on mind. Thus, Formal Tropos fails to adequately model some security aspects (such as secure dependencies and security constraints). Extending Formal Tropos will allow us to perform a formal analysis of our introduced concepts and thus provide formalism to our approach. Towards this direction, we have extended Formal Tropos grammar [9] and below we present an example in which the secure dependency Obtain OP information between the Older Person and the Professional (Figure 3) is specified Entity HealthInformation Attribute constant Record: Record Entity Record Attribute constant content: CarePlan ,accessControl: Boolean patient: Patient, consent: boolean Security Constraint ∃hi:HealthInformation ((hi.record=self) → self.accessControl) Actor Professional Attribute patients: PatientList Goal provideCare Creation condition ∃ p: Patient (In(p,self.patients) ∧ ¬p.helthOK) Actor Older Person Attribute healthOK: boolean Goal MaintainGoodHealth Creation condition ¬self.healthOK Security constraint ∀ rec: Record ((rec.patient=self) → rec.accessControl ) Dependency ObtainOPInformation Type Goal Security Type Dependee Mode Achieve and Maintain Depender Professional Dependee Older Person Attribute constant Creation condition In(self.dependee,self.depender.patients) ∧self.dependee.healthOK Security Constraint for depender ∀ rec: Record ((rec.patient=dependee) ∧ rec.consent)
An Ontology for Modelling Security: The Tropos Approach
5
1393
Conclusions and Future Work
In this paper we have presented extensions to the Tropos ontology to enable it to model security issues. Concepts and notations were introduced to the existing graphical Tropos and also Formal Tropos grammar was extended to provide formalism for our newly introduced concepts. During the process of extending the Tropos ontology we have reach some useful conclusions. By introducing the concepts of security constraints, functional, nonfunctional and security requirements are defined together, however a clear distinction is provided. In addition, by considering the overall software development process it is easy to identify security requirements at the early requirements stage and propagate them until the implementation stage. This introduces a security-oriented paradigm to the software engineering process. Also the iterative nature of the methodology along with the security concepts allows the redefinition of security requirements in different levels therefore providing a better integration of security and system functionality. Our extensions only apply to the first level of the Tropos ontology. Future work involves the expansion of our approach to the other two levels of the Tropos ontology. We aim to provide a set of organisational styles and a pattern language that will help developers to consider security throughout the development of an agent-based system.
References [1] [2] [3] [4]
[5] [6]
[7] [8]
C. Iglesias, M. Garijo, J. Gonzales, “A survey of agent-oriented methodologies”, Intelligent Agents IV, A. S. Rao, J. P. Muller, M. P. Singh (eds), Lecture Notes in Computer Science, Springer-Verlag, 1999 J. Castro, M. Kolp and J. Mylopoulos. “A Requirements-Driven Development Methodology,” In Proc. of the 13th Int. Conf. On Advanced Information Systems Engineering (CAiSE'01), Interlaken, Switzerland, June 2001 H. Mouratidis, P. Giogini, G. Manson, “Modelling Secure Multiagent Systems”, (to appear) in the Proceedings of the 2nd International Joint Conference on Autonomous Agents and Multiagent Systems, July 2003 Fuxman, P. Giorgini, M. Kolp, J. Mylopoulos, “Information Systems as Social Structures”, In Proceedings of the Second International Conference on Formal Ontologies for Information Systems (FOIS-2001), Ogunquit, USA, October 1719, 2001 E. Yu, “Modelling Strategic Relationships for Process Reengineering”, PhD thesis, Department of Computer Science, University of Toronto, Canada 1995 H. Mouratidis, I. Philp, G. Manson, “Analysis and Design of eSAP: An Integrated Health and Social Care Information System”, In the Proceedings of the 7th International Symposium on Health Information Management Research (ISHIMR2002), Sheffield, July 2002 Fuxman, M. Pistore, J. Mylopoulos, P. Traverso, “Model Checking Early Requirements Specification in Tropos”, In the Proceedings of the 5th Int. Symposium on Requirements Engineering, RE' 01, Toronto, Canada, 2001 Sommerville, “Software Engineering”, sixth edition, Addison-Wesley, 2001
1394
[9]
Haralambos Mouratidis et al.
H. Mouratidis, “Extending Tropos Methodology to Accommodate Security”. Progress Report, Computer Science Department, University of Sheffield, October 2002 [10] Dardenne, A. Van Lamsweerde, S. Fickas, “Goal-directed Requirements Acquisition”, Science of Computer Programming, 20, pp 3-50, 1993
Towards a Pragmatic Use of Ontologies in Multi-agent Platforms Philippe Mathieu, Jean-Christophe Routier, and Yann Secq Laboratoire d’Informatique Fondamentale de Lille – CNRS UMR 8022 Universit´e des Sciences et Technologies de Lille 59657 Villeneuve d’Ascq Cedex {mathieu,routier,secq}@lifl.fr
Abstract. The knowledge representation field is gaining momentum with the advent of the Semantic Web activity within the W3C. This working group, thanks to previous researches, has proposed the Ontology Web Language to enhance the expressivity of web pages and to allow semantic inferences. This paper argues that knowledge representation technologies should be core components of multi-agent platforms. In the first part of this paper, we introduce our agent model that relies on the notion of skill. Then, we identify several criteria that let us to believe that the Owl language should be used as a content language within Agent Communication Languages and also in the design of multi-agent platforms. In the last part, we discuss the conceptual and technological challenges that platform designers coming from the multi-agent field have to deal with when trying to integrate knowledge representation technologies.
1
Introduction
The multi-agent systems field has rapidly been growing during the last decade. They are now used in industrial context, and the standardization process initiated by the Fipa organization1 is gaining momentum. This growth has led to several proposals of agent methodologies[12, 18, 6] to ease the analysis and design of such complex distributed systems. These methodologies often make reference to knowledge and linguistic theories: ontologies to add a semantic layer to content languages, Speech Act Theory[1] to handle illocutary acts and knowledge representation to model agent beliefs[15]. Sadly, even if these intentions are faithful, lots of agents frameworks do not use all these features and rely instead on lower level approaches that have the advantage to be practical. Moreover, the multiplicity of agent models and frameworks makes it difficult to capitalize experiences, that are often not easily usable on another platform that the one used to design them. The knowledge representation field is also gaining momentum with the so called Semantic Web 2 initiative supported by the W3C. More precisely, works 1 2
Foundation for Intelligent Physical Agent: http://www.fipa.org Semantic Web @ W3C: http://www.w3c.org/2001/sw/WebOnt/
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1395–1402, 2003. c Springer-Verlag Berlin Heidelberg 2003
1396
Philippe Mathieu et al.
that have been done on languages like Rdf[3](Resource Description Framework), Daml+Oil[5] and recently with the Working Draft of Owl, Ontology Web Language, show a strong trend towards a broadening of knowledge representation use in everyday Internet technologies. Nevertheless, these languages are still in there infancy, and are seldom available in commercial products. This paper is a prospection on the use of knowledge representation languages, and more precisely Owl, as a core component of agents communication and agent platforms. We first introduce our agent model, which tries to define several levels of abstraction in agent design and that relies on generic components called skills. Then, we argue that the Owl language should be used to add a semantic layer at different levels within agents. This layer could have an important impact on the engineering of multi-agent systems and more prospectively on their reliability. Finally, we set out the difficulties and challenges that agents platform designer have to deal with, from a conceptual and technological point of view.
2
A Generic Agent Model Relying on the Notion of Skill
The basis of our model is on the one hand the interactive creation of agent[16], and on the other hand a search on the fundamental functionalities of agenthood. We are not interested in the description of the individual behaviour of agents, but rather in the identification of functions that are sufficient and necessary to an agent. Indeed, the management of interactions, the knowledge management or the management of organizations, are not related to the agent model, but are intrinsic characteristics with the concept of agent. In our model, an agent is a container which can host skills. A skill is a coherent set of functionalities accessible through a neutral interface. This concept of skill is to be brought closer to the concept of software component in object oriented technologies. Thus, an agent consists in a set of skills which carries out various parts of its behaviour. We have identified four layers which are characterized by the level of abstraction of the functionalities that are proposed: 4 Applicative skills
Database access, graphical user interface ...
3 Agent model skills
Inference engine, behavioral engine, ...
2 Agenthood skills
KB, conversation management, organizations management
1 Minimal system skills Communication and skill management
The first level corresponds to system skills, i.e. the minimal functionalities allowing to bootstrap an agent: the communication (emission/reception of messages) and the management of skills (dynamic acquisition/withdrawal of skills). The second level identifies agent skills: the knowledge base, media of interaction between skills and the place of knowledge representation, the management of interaction protocols and the management of organizations). The third level is related to skills that define the agent model skills (reactive, BDI...), while the last level represents purely applicative skills. Rather than skills carrying out these
Towards a Pragmatic Use of Ontologies in Multi-agent Platforms
1397
various levels, it is the functionalities that they represent which are fundamental: the management of communications, just like the knowledge base can be implemented in different ways, but it is necessary to have these functions within the agent. Thus, the first and the second level characterize our generic minimal agent model. This model is generic with respect to the agent models that can be used, and minimal in the sense that it is not possible to remove one of the functionalities without losing a fundamental aspect of agenthood. A skill is made of two parts: its interface and its implementation. The interface specifies the incoming and outgoing messages, while the implementation carries out the processing of these messages. This separation uncouples the specification from its realization, and thus makes it possible to have several implementations for a given interface. The interface of a skill is defined by a set of message patterns which it accepts and produces. These messages must be discriminated, it is thus necessary to type them: interface := ((min )+, (mout )*)* where mx = message pattern The typing of message patterns can take several forms: a strong typing, which has the advantage of totally specifying the interfaces, while a weak typing offers more flexibility with regard to the interface evolution. Thus, if the content of messages are expressed in KIF or Owl, a strong typing will consist of an entire message checking, while a weak typing will only check it partially. An analogy can be done between this approach and the propositions done to type Xml messages: Xml Schema induce a full verification, while Schematron does partial checking3 . From an implementation point of view, our notion of skill is similar to the idea of Web Services: a neutral interface that can be implemented in several languages and component models. The component models that could be used to realize skills are ranging from Enterprise Java Beans from Sun, to the Corba Component Model from the Omg, or even OSGi bundles[10] for constrained environments. Skill interfaces are only concerned with incoming and outgoing messages, but this approach brings the problem of the semantic of messages. The exchange of messages within multi-agent systems is one of the main principle. It was early identified with works from Hewitt[11] on Actor Languages and has been more recently used as a mean to enable agents interoperability. This last approach has been implemented with the Kqml[7] language during the last decade. This language was built using notions from the Speech Act Theory, introduced by Austin[1] and developed by Searle[17], which claims that talking is acting. Thus, Searle identified four categories of speech acts: utterances, propositional utterances, illocutionary utterances and perlocutionary utterances. These notions have been interpreted and reduced to the following key elements that constitutes a Kqml message: the sender agent, the intended agent to whom it was addressed, the reply to the message that the sending agent needs to receive, the performative name (25 are predefined), the language used to specify the 3
For an interesting comparison of XML schema languages see [13].
1398
Philippe Mathieu et al.
content, the ontology that describes the meaning of the message (i.e. what it is trying to achieve) and finally the message content. Nowadays, Kqml is being slowly replaced by the Fipa-Acl which retains the same principles. Although this approach is really appealing, it is not practical: developers always have to agree on content languages and to manually interpret the ontology (ie. the semantic) of messages4. We believe that a leap forward could be achieved if the Fipa foundation proposed general purpose ontologies specified in Owl. It could also remove incoherence that emerges if the all the information on messages were expressed with the same language: the envelope, payload and message[14].
3
Integrating Knowledge Representation Technologies at the Core of Agent Platforms
In our first framework, Magique[16], agents exchange semantically weak messages. These messages can be viewed as a kind of remote method invocation. Skill interfaces are basically Java interfaces, and their implementations are Java objects or components. So, we wanted to add some XML-based language to describe our skill interfaces to get rid off the Java language dependency. We studied existing approaches, and we found that Wsdl5 was the closer technological solution. But this language is finally just an XML-encoding of our previous approach, and we wanted more expressiveness. So, we took a closer look to Rdf, and rapidly to Daml+Oil. The latter unifies works from different communities: it has a formal semantic and efficient reasoning (through Description Logics), it provides rich modelling primitives (frame-like concepts), and a standard syntactical exchange format (Rdf triples). Our first use of Owl is thus to define skill interfaces (we were also attentive with DAML-S[4], but this initiative does not seem to grow). This enable us to add meta-information to ease the management of skill interfaces or implementations: version number, libraries dependencies, deployment information ... Indeed, the use of Daml+Oil or its successor Owl represents a shift from object processing to document processing and inference facilities can be seen as powerful information accessors. The second use that we consider is related to the agent knowledge base. As agents exchange semantically strong messages, being able to use the same tool to represent knowledge could ease agent developer task. Implementing agents requires message matching, knowledge base querying and updating and message creation to reply. If the same language is used through all these stages, some translations can be avoided. Moreover, Owl has all the features needed to describe rich knowledge structures (it was designed for this aim), but can also ease the transition from object technologies thanks to the inclusion of datatypes. The other aspect, we consider is to use the knowledge base as a media for local 4 5
We make reference here to heterogeneous agent platforms. Lots of studies have been done and are working with homogeneous environments. Web Service Description Language: http://www.w3.org/TR/wsdl12/
Towards a Pragmatic Use of Ontologies in Multi-agent Platforms
1399
(intra-agent) skill interactions. For that purpose, we propose to use information in the knowledge base as a kind of semantic linda-space[8]. This approach would induce a real uncoupling were dependencies would only be expressed through the semantic of data, and again inferences would be much more expressive than pattern matching (semantic versus syntax). Going further, having Owl as a core component of agent platforms could yield to more prospective aspects like advanced integrated development environment that could leverage the semantic layer, or even facilitate the use of agents platforms for model-based experimentations. Who has never been wandering through the thousand of classes available in the standard library of Java ? Indeed, the semantic description of skills could be used in development environment as enhanced technical documentation, a kind of semantic “Skilldoc” (in analogy to the Javadoc, automated project documentation framework). Modelbased programming aims at developing sophisticated regulatory and immune systems that accurately and robustly control their internal functions6 . To accomplish this, these systems exploit a vast nervous system of sensors, to model themselves and their environment. It enables them to reconfigure themselves. A tight coupling between the higher level coordination functions provided by symbolic reasoning, and the lower level processes of adaptive estimation and control is thus necessary. Working with agents that are built on semantically strong descriptions, and relying on inference engine, could ease the design of such systems: one of the agent skill could monitor others and react if one fails.
4
Conceptual and Technological Challenges
Nevertheless, even if the integration of knowledge representation would be really useful for multi-agent software engineers, this task is really challenging. The first challenge is the novelty of these technologies and the lack of tools to ease their integration within existent systems. While Rdf is now widely supported, Daml+Oil support is just beginning (and moreover for Owl which is not yet standardized). Several editors are available: a mode for emacs, which do not provide more than syntax highlighting, OILed [2], which is defined by their creators as an ontology “notepad”, Prot´ eg´ e[9], an ontology and knowledge-base editor that should integrate a Daml+Oil plugin soon, and some commercial tools. The main problems are the lack of coherence checking or querying facilities in these editors (even if OILed can rely on the FaCT reasoner), and the lack of embeddable components of such tools. An exception should be noted, the Java Theorem Prover 7 is a nice API that is portable and easily embeddable. It is likely that the situation will be better when Owl will become a W3C Recommendation. Another possibility is the Owl-Lite language, which is a subset of Owl: it will probably be easier to create tools that support it, and this availability of tools could ease the widespread use of Owl. 6 7
Model-based computing at Xerox : http://www2.parc.com/spl/projects/mbc/ JTP website: http://www.ksl.stanford.edu/software/JTP/
1400
Philippe Mathieu et al.
To leverage the use of these technologies, the Fipa organization could provide Owl ontologies within some of its specifications. Technological choices are fundamental for industry adoption. For example, the use of SL as a content language and Iiop as a transport layer have been wrong choices: relying on an XML-based language like Daml+Oil and Http for transport would have ease the development of libraries, tools and applications around Fipa specifications. A nice initiative towards this aim is the Java Agent Services 8 project, held under the Java Community Process, which implements the Fipa Abstract Architecture9 , and provide a nice object-oriented API that could leverage works done on agents infrastructures. Sadly, since this project has gone under Public Review, it seems to be stalled. A last challenge, more cultural, is that knowledge representation technologies and particularly ontology design and development is not an easy task. And because of the novelty of Daml+Oil and Owl, resources like how-to or tutorials are quite scarce. This last point is very important, and the knowledge representation community can play an important pedagogic role to “evangelise” the multi-agent community, and more precisely platform designers.
5
Conclusion
In this paper, we have studied the role that knowledge representation technologies could play in the design of open multi-agent platforms. We insist on the fact that using KR languages within agent-based applications is not new, but their use for open systems has not really worked yet. The advent of the Semantic Web, and more precisely the transition from Daml+Oil to the Working Draft of Owl, could enable a wider use of KR technologies within the WWW, but will also leverage standardization on ontology languages. This fact should deeply impact multi-agent technologies: instead of esoteric languages like KIF and SL0 or dedicated low-level languages (ie. Java objects), Agent Communication Languages could finally enable real interoperability through an agreement on Owl as a content language. An analogy could be drawn with the advent of XML and the impact it has induced in server-side environments. We believe that with Owl as a W3C standard, the Fipa organization should follow and provide basic ontologies for agent and platform services (instead of informal frame-based ones, which are not usable without being first interpreted by developers). We have identified several points, where the use of Owl would be useful in agent-based systems: as a content language within ACL, as a knowledge representation language wihtin agent knowledge base, as semantically stronger description for skill interfaces. These are direct applications, but we also raise more prospective ones like a kind of semantical technical documentation, and the facilities that could be used to add model-based notions within agent-based 8 9
JAS website: http://www.java-agent.org. Fipa Abstract Architecture: http://www.fipa.org/specs/fipa00001/SC00001L.pdf
Towards a Pragmatic Use of Ontologies in Multi-agent Platforms
1401
systems to enhance their reliability. Unfortunately, the use of these technologies is challenging for several reasons: technologies are new so there is not a lot of tools available, and ontology design and management is not an easy task. Nevertheless, we believe that the Owl language will play an important role in the agent software engineering field.
References [1] J. L. Austin. How To Do Things With Words. Harvard University Press, second edition edition, 1975. 1395, 1397 [2] Sean Bechhofer, Ian Horrocks, Carole Goble, and Robert Stevens. OilEd: A reason-able ontology editor for the semantic Web. Lecture Notes in Computer Science, 2174:396–??, 2001. 1399 [3] Stefan Decker, Sergey Melnik, Frank van Harmelen, Dieter Fensel, Michel C. A. Klein, Jeen Broekstra, Michael Erdmann, and Ian Horrocks. The semantic web: The roles of XML and RDF. IEEE Internet Computing, 4(5):63–74, 2000. 1396 [4] A. Ankolekar et al. Daml-s: Web service description for the semantic web, proc. 1st international semantic web conf. (iswc 02), 2002. 1398 [5] D. Fensel, F. van Harmelen, I. Horrocks, D. McGuinness, and P. Patel-Schneider. Oil: An ontology infrastructure for the semantic web, 2001. 1396 [6] J. Ferber and O. Gutknecht. Operational semantics of a role-based agent architecture. In Proceedings of ATAL’99, jan 1999. 1395 [7] T. Finin, R. Fritzson, D. McKay, and R. McEntire. Kqml as an agent communication language. In Proceedings of the 3rd International Conference on Information and Knowledge Management (CIKM’94), pages 456–463, Gaithersburg, Maryland, 1994. ACM Press. 1397 [8] D. Gelernter. Multiple tuple spaces in linda. In E. Odijk, M. Rem, and J.-C. Syre, editors, PARLE ’89: Parallel Architectures and Languages Europe, volume 366 of Lecture Notes in Computer Science, pages 20–27, 1989. 1399 [9] W. Grosso, H. Eriksson, R. Fergerson, J. Gennari, S. Tu, and M. Musen. Knowledge modeling at the millennium – the design and evolution of protege, 2000. 1399 [10] R.S Hall H. Cervantes. Beanome : A component model for the osgi framework. In Workshop on Software Infrastructures for Component-Based Applications on Consumer Devices, September 2002. 1397 [11] C. Hewitt. Viewing control structures as patterns of passing messages. In Artificial Intelligence: An MIT Perspective. MIT Press, Cambridge, Massachusetts, 1979. 1397 [12] E. A. Kendall, M. T. Malkoun, and C. H. Jiang. A methodology for developing agent based systems. In Chengqi Zhang and Dickson Lukose, editors, First Australian Workshop on Distributed Artificial Intelligence, Canberra, Australia, 1995. 1395 [13] Dongwon Lee and Wesley W. Chu. Comparative analysis of six XML schema languages. SIGMOD Record (ACM Special Interest Group on Management of Data), 29(3):76–87, 2000. 1397 [14] Torsten Illmann Michael Schalk, Thorsten Liebig and Frank Kargl. Combining fipa acl with daml+oil - a case study. In Proceedings of the Workshop on Ontologies in Agent Systems,1st International Joint Conference on Autonomous Agents and Multi-Agent Systems, Bologna (Italy), 2002. 1398
1402
Philippe Mathieu et al.
[15] A. S. Rao and M. P. Georgeff. BDI-agents: from theory to practice. In Proceedings of the First Intl. Conference on Multiagent Systems, San Francisco, 1995. 1395 [16] JC. Routier, P. Mathieu, and Y. Secq. Dynamic skill learning: A support to agent evolution. In Proceedings of the AISB’01 Symposium on Adaptive Agents and Multi-Agent Systems, pages 25–32, 2001. 1396, 1398 [17] J Searle. Speech Acts: An Essay in the Philosophy of Language. Harvard University Press, second edition edition, 1969. 1397 [18] M. Wooldridge, NR. Jennings, and D. Kinny. The GAIA methodology for agentoriented analysis and design. Journal of Autonomous Agents and Multi-Agent Systems, 2000.
Ontological Foundations of Natural Language Communication in Multiagent Systems Luc Schneider1 and Jim Cunningham2 1
Institute for Formal Ontology and Medical Information Science University of Leipzig 2 Department of Computing Imperial College London
Abstract. The paper outlines a semantic ontology as a minimal set of top-level conceptual distinctions underlying natural language communication. A semantic ontology can serve as the basis for the specification of the meaning, as the logical form, of agent messages couched in natural language. It represents a general and reusable module in the architecture of multi-agent systems involving human as well as software agents. As a practical example, we will sketch a basic multi-agent system relying on natural language communication.
1
Ontology as a Basis for Multiagent Semantics
Successful communication in a multiagent system requires not only that the communicating agents share a common language, but also that they are committed to the same intended model for the semantics of this language. The semantics of a communication language is the theory that specifies the truth conditions of the messages embedded in the agents’ speech acts. Under the closed world assumption, a shared intended model may be specified as a subset of the Herbrand base, that is, the set of ground goals of the communication language. In this case an ontology can be regarded as the logic program whose declarative meaning (roughly, the set of ground goals deducible from it) is an intended model shared by a community of communicating agents. This is just a paraphrase of the classical definition of an ontology as the formal statement of a model specifying the shared understanding of cooperating agents (Gruber [1991, 1995]). A semantic ontology is a conceptualisation, common to a community of agents that understand natural language, of the categories and relations that pervade the agents’ environment as a whole. It can be used to specify the logical form as the truth-functional meaning of agent messages embedded in natural language. Architecturally a semantic ontology is the most reusable component of multiagent systems involving a human-computer interface. A semantic ontology has to reflect the wired-in conceptual framework human agents are equipped with. In (Schneider [2001]), a minimal semantic ontology V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1403–1410, 2003. c Springer-Verlag Berlin Heidelberg 2003
1404
Luc Schneider and Jim Cunningham
was drafted which drew its inspiration from two sources: the semantical analysis of natural language and philosophical accounts of the commonsense view of reality. Indeed, Parsons’ ([1990]) account of the semantics of verbs in terms of underlying events as well as the parts played in the latter by objects can be ideally complemented by Strawson’s ([1959]) “descriptive metaphysics”, an attempt to specify the basic entity-types of commonsense. The paper is structured as follows. In Section 2, we present the basic conceptual distinctions that are required by a minimal semantic ontology. Events and processes have to be differentiated from the objects that participate in them; the participants of events are either physical objects or persons. In Section 3, we show how to classify different ways of participation according to the kinds of events and participants involved, giving an ontological reading of verb complementation. Section 4 sketches the role of a semantic ontology as a basis of natural language communication between agents by using a simple multi-agent architecture involving human and software agents as an example.
2
Basic Distinctions in Semantic Ontology
Dependent and Independent Entities It seems to be a fundamental feature of the human conceptual scheme that some kinds of entities, like physical objects or persons, are considered as basic, while other types of individuals, like qualities or boundaries, are regarded to be somehow dependent on the former. According to Strawson ([1959]:16-17), this dependence has to be understood in terms of identification in an agent’s environment. A class of particulars A (say, colours or boundaries) is identification-dependent on a class of particulars B (say, physical objects) if and only if, in order to be able to identify an instance of A, an agent has to single out an instance of B first. The commonsense distinction between dependent and independent entities is also acknowledged by recent computational upper-level ontologies, like BFO (Smith [2002]) or DOLCE (Gangemi et al. [2002]). In particular, the dichotomy between objects and the characteristics dependent on them is fundamental for a semantic ontology underlying natural language communication, as it motivates the grammatical difference between nouns and adjectives. The common role of nouns is to refer to objects or kinds of objects, while adjectives usually denote attributes. Of course, there are exceptions to that rule, but nominalisations of adjectives, such as “Green” or “wisdom” seem to be recognised by speakers as exceptions to a more basic semantic rule. Persons and Bodies Another distinction that is crucial for a semantic ontology is that between between mental or private characteristics (e.g. beliefs, intentions, desires) on the one hand, and physical or public characteristics (e.g. weight, colour) on the other hand. According to Strawson, our conceptual equipment is such as to posit the distinction between two types of spatio-temporal objects, namely bodies, to which only physical attributes can be ascribed, and
Ontological Foundations of Natural Language Communication
1405
persons which both mental and physical characteristics can be attributed to (Strawson [1959]:102-103). Many natural languages reflect this distinction explictly, by gender or other systems of noun classifications. We will see that the P erson/Body dichotomy even underlies the semantical subcategorisation or complementation of verbs. Thus cognitively oriented ontologies like DOLCE (Gangemi et al. [2002]) have to include the difference between agentive and non-agentive objects in their taxonomies. Objects and Events Following Davidson ([Davidson 1980]), Parsons defends the view that the semantics of verbs and verb phrases implies the existence of events and processes ([1990]:4, 186-187): verbs may be considered to represent kinds of processes or events. However, the idea that the grammatical distinction between nouns and verbs is grounded on the ontological dichotomy of objects versus events or processes has been always intuited by natural language syntacticians (Tesni`ere [1959]). Objects persist through time in virtue of core characteristics that are fully present throughout their life. Processes exist in time by having different phases at different instants, except events as instantaneous boundaries of processes (Simons [1987]). Strawson argues that events or processes are dependent on objects with regard to their identification ([1959]:39, 45-46). Objects enjoy an ontological priority over events or states. The dependence of an occurrence on an object is called participation in DOLCE (Gangemi et al. [2002]).
3
Object Participation in Language
The different ways objects participate in occurrences (processes or events) have been studied by linguists interested in the phenomenon of verb complementation or thematic roles. These are partly syntactic, partly semantic relations between noun phrases and the main verb of a sentence. Thematic roles correspond to the different parts that referents play in the occurrence expressed by the verb (Parsons [1990]:72-73). Table 1 shows Parsons’ ([1990]:73-78) list of thematic roles together with their definitions and examples. Obviously, Parsons’ empirically assembled list lacks an ontological systematisation. We count three subjectrelated roles: Agent, Experiencer and Performer, where the Person/Thing and Private/Public distinctions are muddled together. In (Schneider [2001]), a coherent ontological account of thematic roles and ways of participation is given according to the following lines. Firstly, we will consider as basic only those thematic roles which express mere specifications of the participation relation that are neutral as to the types of occurrences or objects involved. The result is shown in Table 2. Secondly, Agent, Experiencer and Performer are defined using our four elementary thematic roles and the basic particular-types. We are here in the presence of two orthogonal oppositions:
1406
Luc Schneider and Jim Cunningham
Table 1. Parson’s Classification of Thematic Roles Thematic Roles
Definition
Example sentences
Agent
Person initiating the event
John writes a book. The book is signed by John.
Theme
Entity affected by the event
Mary reads a book. Mary blushed at his sight.
Goal
Addressee
John gives Mary a rose. Anna writes a letter to Mary.
Benefactive
Entity to whose benefit the event occurs
Mary gave Anne a party. John signs a book for Mary.
Experiencer
Person the event is an experience of
Mary sees a rose. John thinks about Mary.
Instrument
Thing the event is accomplished with
John opens the letter with a knife.
Performer
Thing initiating the event
The knife opened the letter.
Table 2. Revised Classification of Basic Thematic Roles Thematic Roles
Definition
Example sentences
Origin
Entity initating the event
John writes a book. A stone hits the window. The book is signed by John. The window was hit by a stone.
Theme
Entity affected by the event
Mary reads a book. Mary blushed at his sight.
Addressee
Entity the event is directed to
John gives a rose to Mary. Mary gives water to her flowers.
Benefactive
Entity to whose benefit the event occurs
Mary gave Anne a party. John signs a book for Mary.
1. Agent or Experiencer vs. Performer : the difference between personal and non-personal origins of occurrences; 2. Agent or Performer vs. Experiencer : the difference between a physical and a mental occurrence of which the object is an origin. To clean this orthogonal classifications up, we first define a new thematic role, namely Initiator, and redefine Performer, as restrictions of the Origin role to
Ontological Foundations of Natural Language Communication
1407
the object-types Person and Body respectively. An object x is an initiator of an occurrence y if and only if x is an origin of y and x is a person. An object x is a performer of an occurrence y if and only if x is an origin of y and x is a body. The thematic roles of Agent and Experiencer are then characterised as specifications of the Initiator -roles. Indeed, if x is an initiator of an occurrence y, then x is an agent of y if and only if y is a public or physical occurrence; x is an experiencer of y if and only if y is a mental or private occurrence of x. Thus by using basic ontological distinctions, we can transform a flat unsystematized list of thematic roles into a reasoned taxonomy.
4
A Proof of Concept
The role of a semantic ontology with respect to natural language understanding in multi-agent systems has been exemplified in (Schneider [2001]) by implementating, as a simple proof of concept, a reasoning agent as a server capable of processing natural language queries from multiple human operated clients. Concretely, this reasoning agent is able to engage in a game of challenges and answers: opponents send natural language assertions to be parsed, proved or disproved, the agent justifying its answers by indicating the respective logical form (meaning) or proof established on the basis of a semantics/ontology shared with the human opponent. A way to realise such a system is to implement it as a client-server architecture, the server being the reasoning agent and the client(s) operated by human users. The server spawns off a new thread for each client, thus allowing peer-topeer communication. Multi-agency is thus not merely implemented by pairing off a single program with a single human, but actually involves multi-threading and inter-thread communication. This architecture has been implemented in Qu-Prolog, a distributed and concurrent version of Prolog (Clark, Robinson and Hagen [1999]; Robinson [2000]). By spawing a new thread or agent at each client’s request, the reasoner server becomes the central node in a multi-agent system of communicating human and non-human peers. The content of the humans’ messages are declarative natural language sentences whose meaning, i.e. logical form, mirrors the everyday conceptual framework of intelligent primates. Shared understanding is made possible by arranging that the software agents have the same ontology, i.e. set of fundamental conceptual distinctions, as their human partners. This ontology is the basis for the semantics of the natural language fragment used by the humans to communicate with their non-human peers. Sharing this ontology as a part of their knowledge base, the software agents have the capability of parsing and proving , i.e. of understanding and reasoning upon the assertions submitted to them. The logical form and proof computed by a thread of the reasoning server reflects the semantic and ontological intuitions of the human operators. The primitive and defined predicates of the semantic ontology, i.e. the basic particular-types P erson and Body as well as the thematic roles discussed in the
1408
Luc Schneider and Jim Cunningham
previous section, are used directly in parsing the natural language sentences submitted by the human users. The parser is basically a logic grammar translating into action the semantical analysis of verbs and sentences in terms of underlying events and thematic roles. The thematic roles occurring in the entries of verb meanings in the parser’s lexicon are declared in the semantic ontology, which constitutes a separate module of the agent’s knowledge base. As an illustration, we describe a sample run of the system presented in Schneider [2001] (Figure 1). After connecting to the server, the user requests the proof of the natural language sentence: “joan liked marcel” by the server thread spawned for that purpose. First, the server thread parses this sentence into its logical form expressing that there exists an occurrence, which is a liking, whose experiencer is Joan and whose theme is Marcel. Second, the server thread proves this first-order logical formula using its knowledge base, displaying the steps of the deduction. Indented steps indicate backtracking triggered successively on the conjuncts in the body of the definitions of Experiencer, P erf ormer and P rivate respectively. These definitions are not part of the particular domain-related knowledge of the agent, but belong to the semantic ontology as a separate module.
| ?- reasoner_prove. > joan liked marcel exists : (_34A , ((liking(_34A) , (exp(_34A, joan) , th(_34A, marcel))))) liking(o(20)) exp(o(20), joan) perf(o(20), joan) or(o(20), joan) person(joan) private(o(20)) liking(o(20)) th(o(20), marcel) yes
Fig. 1. A sample run By constructing the semantic ontology as a separate module in the architecture of a reasoning agent, two essential goals for agent implementations are achieved: scalability, reusability and maintainability. By storing the declarations and definitions of the various ontological types and roles outside of the lexicon, the latter can be scaled down in size, thus enhancing the efficiency of the parser. As a component of its own, a semantic ontology is easier to share between applications and to reuse in various contexts involving different parsing technologies. Finally, a semantic ontology as a distinct module is trivially easier to maintain without the need of modifying other components of the multi-agent architecture.
Ontological Foundations of Natural Language Communication
5
1409
Conclusions
The aim of this paper has been to outline a minimal semantic ontology, a set of high-level concepts that can serve as a basis for specifying the logical form of agent messages using natural language. Its main inspiration comes from the semantical analysis of natural language, as well as philosophical accounts of the commonsense view of the world. A semantic ontology may be put to two uses: to define the fundamental concepts necessary for agents to communcate and to reason, as well as to contribute to the computational analysis of natural language. The two uses can be combined in a multi-agent system involving humans and thus recurring to natural language communication. A simple instance of a multiagent architecture based on natural language communication that illustrates these both uses has been sketched at the end of this paper. We emphasised the advantages in terms of scalability, reusability and maintainability of having a semantic ontology as a separate module in a multi-agent architecture.
Acknowledgements This paper is partly based on work supported by the Alexander von Humboldt Foundation under the auspices of its Wilhelm Paul Programme.
References [1999] Clark, K., Robinson, P. and Hagen, R. 1999. Multi-Threading and Message Communication in Qu-Prolog. Technical Report 99-41, Software Verification Research Center, University of Queensland. 1407 [Davidson 1980] Davidson, D. 1980. “The Logical Form of Action Sentences”. In Davidson, D., Essays on Actions and Events. Oxford: Oxford University Press, 105-148 1405 [2003] Grenon, P. 2003. “The Formal Ontology of Spatio-Temporal Reality and its Formalization”. AAAI Spring Symposium on the Foundations and Applications of Spatio-Temporal Reasoning 2003. [1991] Gruber, T. 1991. “The Role of Common Ontology in Achieving Sharable, Reusable Knowledge Bases”. Principles of Knowledge Representation and Reasoning: Proceedings of the Second International Conference. Morgan Kaufmann, San Mateo/CA. 1403 [1995] Gruber, T. 1995. “Toward Principles for the Design of Ontologies Used for Knowledge Sharing”. International Journal of Human and Coputer Studies. Special Issue: Formal Ontology, Conceptual Analysis and Knwoledge Representation. 1403 [2002] Gangemi A., Guarino N., Masolo C., Oltramari, A., Schneider L., 2002. “Sweetening Ontologies with DOLCE”. Proceedings of EKAW 2002. Siguenza, Spain. 1404, 1405 [1990] Parsons, T. 1990. Events in the Semantics of English: A Study in Subatomic Semantics. Cambridge/MA: MIT Press. 1404, 1405 [2000] Robinson, P. 2000. Qu-Prolog 6.0 User Guide. Technical Report 00-20, Software Verification Research Center, University of Queensland. 1407
1410
Luc Schneider and Jim Cunningham
[2001] Schneider, L. 2001. “Na¨ıve Metaphysics. Merging Strawson’s Theory of Individuals with Parson’s Theory of Thematic Roles as a Basis for Multiagent Semantics”. MSc. Computing Science thesis, available as Departmental Technical Report 2002/5. London: Department of Computing, Imperial College London. 1403, 1405, 1407, 1408 [1987] Simons, P. 1987. Parts. A Study in Ontology.. Oxford: Clarendon. 1405 [2002] Smith, B. 2002. “Basic Formal Ontology”. http://ontology.buffalo.edu/bfo/ 1404 ´ ements de syntaxe structurale. Klincksieck, Paris. 1405 [1959] Tesni`ere, L. 1959. El´ [1959] Strawson, P. 1959. Individuals. An Essay in Descriptive Methaphysics. London: Routledge. 1404, 1405
Ontology Management for Agent Development Kwun-tak Ng, Qin Lu and Yu Le Department of Computing, The Hong Kong Polytechnic University, Hong Kong {csktng,csluqin,c0955928}@comp.polyu.edu.hk
Abstract. The content of agent messages is often defined according to some ontology through which agents can communicate with each other based on a common vocabulary. However, when ontology evolves, the existing agents may not be able to communicate properly with new agents built on the evolved ontology unless existing agents are redesigned or re-implemented. In this paper, we propose an ontology management platform called OntoWrap with three distinct components. Firstly it maintains an ontology repository which keeps track of changes to ontology so that the relationship between different versions of ontology and their extensions are maintained. Secondly, it maintains a service repository which links ontology with agent service specifications so that different implantations of an ontology are also maintained. Thirdly, based on the ontology versioning and service specifications, a wrapping service framework is provided for the automatic generation of adapter agents for existing agents to make use of new services.
1
Introduction
In multi-agent systems, autonomous agents can look for suitable agent services when it navigates through various hosts, which we call service discovery. From the perspective of a role-play, we categorize agents into two classes: client agents and server agents where client agents seek services from server agents and server agents provide services for client agents. An agent can be both a client agent and a server agent depending on its role. That is, if a server agent A makes use of a service provided by agent B, A is considered a client agent of agent B. However, if agent A provides service to another agent C, A is considered a server agent for C. Each server agent must provide a set of services to client agents, which we refer to as the agent services. An agent service directory server is designated to provide directory service to help agents locating their required agent services autonomously. Most directory services currently only provide run-time binding for a client to a service agent of which the client has complete knowledge before hand. There are insufficient mechanisms for the functionalities of agent services to be declared or exported so as to facilitate the development of agent applications. The objectives of this work are (1) to find a way for agent ser-vice providers to declare the types of services it can provide with interfaces and methods to facilitate the development of agent applications on-line and (2) methods for agent service providers to map its agent services to what are required by client agents. V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1411-1418, 2003. Springer-Verlag Berlin Heidelberg 2003
1412
Kwun-tak Ng, Qin Lu and Yu Le
In a multi-agent model, agents accomplish tasks and achieve desired objectives by cooperation and coordination with each other. Cooperation and coordination, in turn, are done through message exchanges. These messages are written in an agent communication language (ACL). The content of the message is written based on an ontology which is understood in a specific domain so that agents can speak in the same language and can reach conclusions through reasoning and inference processes. Ontology, generally speaking, models and conceptualizes the agent world [1]. It specifies, in a specific domain of interest, the concepts, objects and relationships among them [2]. It provides a framework for describing the semantics of terms and data in multi-agent systems. In this paper, we follow the FIPA definition [3] to refer ontology as the specification of concepts, propositions, rules, and actions usually in a certain confined domain, such as music, biology, or E-commerce, etc. The ontology here provides a common language and vocabulary for different agents to talk to each other so that the messages can be interpreted uniformly. In practice, ontology evolves as a specific domain always has new concepts being introduced into it and the semantics of existing terms may also be modified with time. If such changes are not properly traced nor managed, this would hinder the use of ontology. Agents would be speaking different "languages" at different times, even though the point of having an ontology is that every agent speak the same language. For example, if an old client agent uses ontology OA, it cannot use the service provided by a new service agent if the new service is built based on a revised ontology OA' which is in the same domain as OA. We all know that all agents are built based on some domain specific ontology either explicitly or implicitly. In this work, we aim to achieve our objectives through the use of ontology management and the linking of ontology management with agent services. We propose a platform, referred to as OntoWrap, which makes use of ontology and its management for service providers to declare their services. Client developers can also declare required services through OntoWrap. With such declarations, new service providers can provide services to existing clients. Change and enhancement can be kept track of through the management of the ontology repository. More specifically, the ontology platform in our system not only specifies the semantics of agent services, but also provides interface specification for clients to use such systems with semantic information provided. Through ontology management mechanisms, changes and ex-tensions to services can be traced, and new service providers can make sure that their service can be provided for both new client agents as well as clients who are using older versions of the ontology. The rest of this paper is organized as follows. Section 2 introduces related work. Section 3 analyzes the evolution of ontology. Section 4 presents the platform architecture. Section 5 discusses implementation issues. Section 6 gives our conclusion.
2
Related Work
Currently, there are a few supporting tools using ontology to facilitate agent development. ParADE [4] is a tool for providing development support for autonomy and interoperability on FIPA-compliant platforms. It is composed of a set of development
Ontology Management for Agent Development
1413
tools in two levels of abstraction: the agent level and the object level. While the agent level models the architecture of the multi-agent system and the elements that agents used to communicate, such as social organization, goals and beliefs, whereas the object level is used for application-specific behavior implementation or legacy code integration. However, ParADE does not consider the issue of software reuse nor evolution of the ontology. Protégé 2000 is an ontology tool to help developers to construct a domain specific ontology, by defining concepts, predicates, slots and agent actions [5]. With an appropriate plug-in [6], the tool can generate Java classes to support the definition of ontology in JADE [5]. Both of the above mentioned tools are designed to facilitate the development of agent applications by providing an ontology interface which links ontology to classes/objects which are used to implement the services. However, they do not address the ontology reuse issue nor were they concerned about ontology management so as to keep track of ontology changes during the development process.
3
Ontology Evolution Analysis
Ontology evolves either because of domain change, shared conceptualization change, or specification change [8]. Once the ontology is implemented in an application, feedback from end users and the developers may lead to conceptual changes defined in the application model. Ontology changes can be categorized as either integration change [9] or merge change. Integration changes are made from adapting an ontology from a different domain into a new ontology for a specific application, whereas merge changes are made from aligning ontologies in the same domain into a consolidated common ontology. Both integration and merge finally result in some designated ontologies for an application. Variants of these ontologies are developed as changes are made. Consequently, there are two kinds of relationships in ontology changes: the relationship between different or similar subject ontologies and the relationships between ontology of different versions, which we refer to as ontology-variant relationships. In this project, we focus only on the study of ontology-variant relationships and how to transform one version of an ontology to the other. We use the concept of ontology versioning and its management to relate one version of a concept to another version of that concept explicitly so as to keep track of these variants. This is by no means eliminate the capability of the system to keep track of ontology changes due to reference to different/similar subjects. If a developer has choices, such relationships can be captured by the version change. When there are changes made to the existing ontology, a new version of the ontology is created. In order to keep track of ontology-variant relationships, the actual ontology change operations applied to an ontology should be recorded. Basically, we can consider that all changes are made by three basic operations: add, delete and rename. The add operation extends the existing ontology with new ontology elements. It is probably caused by the advancement of technology and additional concepts found in the domain. The delete operation removes elements from the ontology. An obsolete concept in a new application may be deleted in a new ontology version.
1414
Kwun-tak Ng, Qin Lu and Yu Le
Rename operation renames the identifier of an element but it still keeps the same semantic meaning of its original construct. Generally speaking, any change to the ontology can be described by a sequence of the three operations. For example, a replacement opera-tion can be considered as a delete operation followed by an add operation. However, changes to ontology can either be syntactic or semantic, and this must be understood if automatic wrapping service are provided. This issue will be discussed in Section 5. Layer
Application (Runtime)
Construct
Framework Support
Agent Based Application
Agent Platform
Agent Agent Online Wrapper Service Repository Framework
Implementation
Define the ontology and the construct used
Ontology Conceptual
Online Ontology Repository
Execute the agent implementation and adpater to the agent platform
Ontology Editor
Fig. 1. OntoWrap Components
4
System Architecture Overview
The proposed ontology management platform OntoWrap is composed of four components, as shown in Figure 1. Firstly, it maintains an ontology repository which keeps track of changes to ontologies so that the ontology-variant relationships between different versions of ontology and their extensions are maintained. Secondly, the ontology editor provides functions to modify and create ontology. Thirdly, it keeps a service repository which links ontology with agent service specifications so that different implementations of an ontology are also maintained. Fourthly, based on the ontology versioning and service specifications, it provides a wrapping service framework for the automatic generation of adapter agents for existing agents to use new service. Before the development of any multi-agent systems, the problem domain must be analyzed first. Through conceptualization, the problem domain and the planned target objectives are realized in a defined ontology, which enables an agent to understand the message content during agent communication. Thus the produced ontology reflects and binds the capability of an agent. Various domains of ontologies are created and stored in the ontology repository. When there are ontology changes, its versioning function keeps track of these changes and distinguishes ontology variants. The agent service specification describes the syntax and semantics of services provided by an agent such as the agent behavior, pre-requisite state, final state, and input arguments, etc. The semantics are linked to the ontology repository and the syntax are written in an object-oriented language. It should be pointed out that service agents do
Ontology Management for Agent Development
1415
not need to understand the complete ontology nor implement everything in an ontology. The linking of the service repository to the ontology repository provides the platform for service agents to declare its capability with respect to a defined ontology. The service repository also provides client agents the platform to declare the kind of generic service it requires with respect to an ontology. As a result, some later developed service agents have ways to know how to accommodate existing client agents. This also makes automatic wrapping of new services to existing clients possible.
5
System Implementation
The OntoWrap system is being implemented on the JADE platform. To facilitate easy access and maintenance, OntoWrap is accessed through the internet. Various basic functions such as browse, search, and version controls are provided. The browse function allows a developer to look up a specific ontology module by reading through a simple list of ontology modules with appropriate linkage between different versions. Also, a developer can enter keywords to search for those taxonomies defined in the ontology. A web-based directory service can serve for this purpose.
Fig. 2. Ontology editor interface
The ontology editor is the user interface to access the ontology repository. Our ontology editor is a tab widget plug-in based on Protégé 2000[6]. Protégé 2000 is a generic ontology editor that allows a user to construct classes, attributes and relationship between classes. Yet, it lacks customized support for the agent development. Following the FIPA 2000 specification, an ontology includes elements such as concepts, actions, predicates and relationships. With these specifiable elements, ontology experts can create elements and define appropriate attributes. All elements defined in FIPA 2000 are supported and maintained in our system. As an example shown in Figure 2, three classes, namely, Concept (such as AID and CD), AgentAction (such as Sell), and Predicate (such as Own) are created for the MusicShop ontology. All the change related operations such as add, delete and rename must be done explicitly
1416
Kwun-tak Ng, Qin Lu and Yu Le
through the interface as shown in the middle section of the interface. This allows the system to trace every change made to the ontology. Any change to an ontology is recorded using the most portable format written in XML. Similar to ontology repository, service repository has a web-based interface to facilitate browse and search function. A developer can register the agent service specification to the repository. It is connected with the ontology repository so that it shows what parts of the ontology have agent implementation. The details of service management will be discussed in a separate paper. Based on the versioning information recorded in the ontology repository and the declarations maintained by the service repository, adapter agents can be generated to converts agent message content from one ontology to the other one provided that these two ontologies have some commonality or relationship. The adapter agent building process is semi-automatic due to the fact that not all the mapping issues can be resolved automatically. We distinguish two types of mapping issues, namely syntactic and semantic. The syntactic issue is related to the change of ontology definition in representation only. For example, the price slot of a CD concept in Music Shop ontology may change from a string type to floating point data type. Pre-defined syntactic conversion rules can be defined to convert the ontology element from one type to another. The semantic issue is related to the semantics of those basic ontology change operations. For example, if the ontology expert decides to delete a concept CD (Compact Disc) and add a concept MD (Mini-Disc) in the Music Shop ontology, he can conveniently rename concept CD to concept MD instead of using delete and add operations. Although CD and MD concepts have almost the same at-tributes, such as album name, artist, sound track and playing time, they are indeed two different things semantically. If the instance of concept CD is mapped to the instance of concept MD during agent communication, the consistency is questionable. A client agent that plans to buy a CD will not expect getting a MD. Thus, the change operation should be used carefully. Ontology experts need to resolve this kind of mapping problems if the change operation is not applied properly. Another example is that a predicate owns(CD, owner) change to predicate hasStock(CD, owner) by rename operation. Although two predicate has the same arguments, they have different meanings and has different interaction protocol. Although OntoWrap has a basic set of change operations, it may not be sufficient to modify ontology without restriction. For example, we can move the CD concept in an ontology from its parent Item to MusicMedia by delete and then add operation. How-ever, these two generic operations makes it impossible to associate the CD in the new ontology with the CD in the original ontology; CD becomes a new concept after add operation. Additional constraints, such as restricting to add the deleted concept only, should be applied appropriately to these generic operations so as to avoid losing the original meaning and relationship. The use of different operations may result in different modified ontology structure [10]. For example, when a concept is deleted from the middle of the tree, its subconcepts may either be deleted from the tree or be connected to the parent concept of the deleted concept. The final state of ontology tree structures may be totally different.
Ontology Management for Agent Development
6
1417
Conclusions
In this paper, we pointed out the important role of ontology in agent development environment. As ontology evolves over time, so is the agent implementation. In order to reduce the cost of redevelopment and facilitate the implementation, we propose a system that can link the ontology management with the agent development. Ontology versioning can keep track of ontology changes. The ontology editor not only assists the user to construct ontology but it also records the change operations. Those recorded changes are useful for the automatic generation of adapter agents. It aims at reusing the existing multi-agent system implementation.
Acknowledgements The work is partially supported by the PolyU research project (Z044).
References [1] [2] [3] [4] [5] [6]
[7] [8]
Gruber, T. R.: Toward principles for the design of ontologies used for knowledge sharing. Intl. Journal of Human-Computer Studies, Vol. 43. Academic Press (1995) 907-928 Wooldridge, M.: Introduction to Multiagent systems. John Wiley & Sons (2002) FIPA.org. FIPA 2000 Specification, [Internet] Foundation for Intelligent Physical Agent. Available from: [Accessed 22 October, 2002] Bergenti, F., and Poggi, A.: A Development Toolkit to Realize Autonomous and Inter-operable Agents. In: Proceeding of Fifth International Conference of Autonomous Agents. Montreal (2001) 632-639 Bellifemine, F., Poggi, A. and Rimassa, G.: JADE - A FIPA-compliant agent framework. In: Proceedings of the Practical Application of Intelligent Agents and Multi-Agents. London (1999) 97-108 Noy, N. F., Fergerson, R. W., and Musen, M. A.: The knowledge model of Protégé-2000: Combining interoperability and flexibility. In: 2nd International Conference on Knowledge Engineering and Knowledge Management, Juanles-Pins, France (2000) IBROW Project. Ontology Bean Generator for Jade 2.5 [Internet], Universiteit van Amster-dam. Available from: [Accessed 22 October 2002] Klein, M. and Fensel, D.: Ontology versioning for the Semantic Web. In: Proceedings of the International Semantic Web Working Symposium. Stanford University, California, USA (2001) 75-91
1418
[9]
Kwun-tak Ng, Qin Lu and Yu Le
Pinto, H. S. and Martins, J. P.: A methodology for ontology integration. In: Proceedings of the International Conference on Knowledge Capture. Victoria, Canada (2001) 131-138 [10] Stojanovic, L., Maedche, A., Motik, B., and Stojanovic, N.: User-driven Ontology Evolu-tion Management. In: Proceedings of the 13th European Conference on Knowledge Engi-neering and Knowledge Management. Madrid, Spain (2002)
A Multiagent, Distributed Approach to Service Robotics Maurizio Miozzo, Antonio Sgorbissa, and Renato Zaccaria DIST – University of Genoa, Via Opera Pia 13, Tel +39 010 3532801 {maurizio.miozzo,antonio.sgorbissa,[email protected]
Abstract. We propose a multiagent, distributed approach to Service Mobile Robotics which we metaphorically call "Artificial Ecosystem": robots are thought of as mobile units within an intelligent environment where they coexist and co-operate with fixed, intelligent devices that are assigned different roles: helping the robot to localize itself, controlling automated doors and elevators, detecting emergency situations, etc. In particular, intelligent sensors and actuators (i.e. physical agents) are distributed both onboard the robot and throughout the environment, and they are handled by Real-Time software agents which exchange information on a distributed message board. The paper describes the approach and a case study within an hospital, by outlining the benefits in terms of efficiency and Real-Time responsiveness.
1
Introduction
In mobile robotics literature, the discussion has often been focused on the importance to be given to reactive and deliberative activities to govern the robot's actions: see for example the contraposition between behaviour-based architectures (like Subsumption Architecture [1], ALLIANCE [2]) and hybrid approaches (like AuRA [3], ATLANTIS [4]). However, in spite of their differences, almost all the existing systems share a philosophical choice (partially due to efficiency reasons) which unavoidably leads to a centralized design. We call it the autarchic robot design; it can be summarized in the following two points: • •
robots must be fully autonomous: i.e., they are often asked to co-operate with humans or other robots but they mainly relies on their own sensors, actuators, and decision processes in order to carry out their tasks. robots must be able to operate in non structured environments (i.e. environments which have not been purposely modified to help robots to perform their tasks).
Up to now, this approach has not been very successful. No robot (or team of robots) has proven yet to be really “autonomous” in a generic, “non structured” environment, i.e. able to work continuously, for a long period of time, carrying out its tasks with no performance degradation or human intervention. On the opposite, the few examples of robots which are closer to be considered autonomous (in the sense V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1419-1426, 2003. Springer-Verlag Berlin Heidelberg 2003
1420
Maurizio Miozzo et al.
which has been just given) were designed to work in a specific –even if unstructuredenvironment: see for example the museum tour-guide robot Rhino [5], which heavily relied on the building's ceiling lights to periodically correct its position in the environment. Having learnt from this lesson, in the “Artificial Ecosystem” approach we face all the problem related to autonomy in a full multiagent perspective; robots are thought of as mobile physical agents within an intelligent environment where they coexist and cooperate with fixed physical agents, i.e. intelligent sensing/actuating devices that are assigned different roles: devices which provide the robots with clues about their position in the environment, devices that control automated doors and elevators, devices for detecting emergency situations such as fires or liquid leaks, cameras for surveillance, etc. Both robots and intelligent devices are handled by software agents, which can communicate through a distributed message board and are given a high level of autonomy. As a consequence, autonomy (and intelligence) is not just a characteristic attribute to robots, but it is distributed throughout the building (Fig. 1a): we say that robots are autonomous but not autarchic. Next, we extend the concept of intelligent devices to the sensors and actuators on board the robot: analogously to fixed, intelligent devices distributed throughout the building, onboard sensors and actuators are implemented as autonomous devices, i.e. they are handled by software agents which can communicate through the distributed message board and are given a high level of autonomy in taking decisions. Thus, intelligent control on the robot is performed at two different levels (Fig. 1b): on the lower level a higher reactivity is reached through simple reactive agents which control sensors and actuators; on a higher level, sensorial inputs are collected by agents running on the onboard computer and performing more sophisticate computations before issuing control to actuators. On Board PC: On Board PC: On Board PC: Build Map Generate a Plan a path trajectory
Intelligent Device: Beacon for Localization Control of TV camera Intelligent Device: Beacon for Localization
Intelligent Device: Beacon for Localization Control of Elevator
a)
Global Message Board
to/from Intelligent Devices in the building
b)
Intelligent Device: Control of USSensors
Local Message Board
Intelligent Device: Control of Bumpers
Intelligent Device: Control of Motors
Fig. 1. The Artificial Ecosystem – a) agents in the building; b) agents on the robot
This increases the tolerance of the system to failures: the coupling of intelligent sensors and actuators is able to produce an emergency behaviour even in case of a failure in the onboard software. In Section 2 we show the overall architecture of the system; in Section 3 we show a case study at Gaslini Hospital of Genoa and discuss some experimental results. Conclusions follow.
A Multiagent, Distributed Approach to Service Robotics
2
1421
Agents in the Artificial Ecosystem
Three different types of software agents are devised in the AE approach (agents are classified according to Russel and Norvig's definition [6]): 1.
simple reflex agents, i.e. agents with no internal state governed by conditionaction rules. These agents are used for purely reactive behaviours, i.e. stopping the motors to avoid a imminent collision (a task fundamental for mobile robots) or opening an automated door upon request (a task usually assigned to fixed devices). agents that keep track of the world, i.e. agents which maintain an internal state and/or representation in order to choose which action to perform. These agents are used for more complex tasks, such avoiding obstacles on the basis of a continuously updated local map of the environment (a task for mobile robots) or controlling an elevator dealing with multiple requests (a task for fixed devices). goal-based agents, i.e. agents which handle goals and find action sequences to achieve this goals. These agents are used for high-level planning tasks, such as finding an action sequence to reach a target location (a task for mobile robots): i.e. opening an automated door, moving into a room, calling an elevator, and so on.
2.
3.
Notice that the agents in the system are requested to run on different hardware platform and operating systems: agents controlling intelligent devices are scheduled by dedicated low cost microprocessors with small computational power and memory storage (no type 3 - goal-based agents run on fixed devices), while agents for highlevel control of the robot are executed on a standard PC platform with the Linux OS: we will refer to them as ID agents and PC agents. To implement ID agents [7], we refer to fieldbus technology for distributed control (in particular Echelon LonWorks, a standard for building automation); for PC agents, we refer to ETHNOS (Expert Tribe in a Hybrid Network Operating System), an operating system and programming environment for distributed robotics application which has been introduced in [8]. ID agent: Left Motor
ID agent: Right Motor
A5
A4
from PC agents to PC agents
ID agent: Front Motor
Speed, Jog
A6
A16
from ID agents
a) ID agent: Pneumatic Bumper
ID agent: US Sensors
APF
A17
Proximity Data
b) A2
Bitmap
A18
Message Board
Proximity Data
A1
PC agent: Build APF
PC agent: Build Map
Speed, Jog
to ID agents
A3 ID agent: Motion Reflexes
Landmark Position
PC agent: Set Temporary Target to Avoid Obstacles
A14
PC agent: Plan a Path, Set Target Location
A15
PC agent: Generate Trajectory to Landmark Position
Fig. 2. a) ID agents producing a "stop the robot" behaviour; b) PC agents generating a trajectory
1422
Maurizio Miozzo et al.
In spite of differences in their internal model and their implementation, ID and PC agents have some common requirements in terms of Communication and Scheduling: 1. Communication - agents are not omniscient: at any given time, they have only partial knowledge of the state of the world and the system. In order to share information, all agents can communicate on the basis of a publish/subscribe protocol. This allows to dynamically add/remove agents in the system without other agents being aware of that, thus increasing the versatility/reconfigurability of the system. In particular, we implement a two-level message board (Fig. 1), composed of 1) a global message-board, which contains information of general interest and to which all agents in the system can publish/subscribe 2) many local message-boards, which contain information which are relevant only within a single subset of agents, and to which only these agents can publish/subscribe (i.e. only the agents running on a robot need to know the trajectory to be followed to avoid an obstacle). 2. Scheduling - agents have different requirements in terms of computational resource and timing constraints. Since we want the system to operate in the real world and to deal with an uncertain and dynamic environment, the system architecture has soft/hard Real-Time characteristics in order to guarantee a predictable and safe behaviour of the system when computational resources are limited. Notice that ID agents, since the limitations of Echelon Operating System, are implemented as soft Real-Time task; on the opposite, PC agents are handled by ETHNOS, which allows the concurrent scheduling of hard Real-Time processes according to the Rate Monotonic scheduling policy. For implementation details see [7] and [8].
3
A Case Study: An Artificial Ecosystem at Gaslini Hospital
The AE approach is being tested at the Gaslini Hospital of Genova: the experimental set-up is composed of the mobile robot Staffetta and a set of intelligent devices distributed in the building, which are given the primary purpose of controlling active beacons for localization and elevators. The robot simulates transportation tasks, i.e. it is able to execute concurrently the following activities: 1) plan paths to the target 2) localize itself 3) generate smooth trajectories and finally 4) follow the trajectory while avoiding unpredicted obstacles in the environment. To achieve this, the robot is equipped with ultrasonic sensors; a laser rangefinder could be used as well (a comparison between sonars and laser rangefinders in terms of noise, precision, safety, and cost is currently subject to debate in the international research community). All ID agents are type 1 and 2 agents (in Section 2 classification); i.e., simple reflex agents and agents with internal state (no goal-based agents). ID agents on board the robot: A1 controls a pneumatic bumper which is able to detect collisions on different areas of the robot chassis; A2 controls sixteen ultrasonic sensors; A3 generates motion reflexes to avoid detected obstacles; A4, A5, and A6 control the two rear motors and the front steering wheel; A7 controls the DLPS (the onboard rotating laser and the infrared communication device, which are part of the beacon-
A Multiagent, Distributed Approach to Service Robotics
1423
based localization systems); A8 monitors the batteries' state; A9 is directly connected to the motors and works as a watchdog, by periodically communicating with the onboard PC and disabling the motors whenever this communication fails; moreover, A9 controls the joystick interface, which is activated when the onboard software is not running for manually displacing the robot; A10 computes inverse kinematics and odometrical reconstruction; A11 takes control of the motors in case of a system crash, in order to move the robot to a safe location. Finally –as anticipated- one agent is employed to handle the communication between the ID and the PC agents (through the PCLTA board, an off-the-shelf Echelon component which works as a bridge between the Echelon fieldbus and a standard PC). ID agents in the building: a varying number of agents A12i, each controlling a beacon i for localization, two agents A13 which are responsible for controlling the elevator. Notice that ID agents, in spite of their internal simplicity, are able to produce interesting behaviours. For example, they guarantee safe motion even when agents on the onboard PC are no more able to correctly control the robot's motion, either because of some bug in the software or even because of a crash of the operating system. Fig. 2a shows how interacting ID agents can increase the reactivity of the system to obstacles which suddenly appear in the environment (as long as they are detected by the ultrasonic sensors or the pneumatic bumper). The sensorial data collected by A1 (pneumatic bumper) and A2 (US sensors) are posted to the local message board (message type Proximity Data), thus being available to PC agents for map building, trajectory generation, localization, etc. However, Proximity Data are also available to the ID agent A3, which is able to generate simple motion reflexes: when required, A3 produces a message Speed, Jog containing speed and jog values which allow to avoid an imminent collision (for example, speed=jog=0) and posts it to the message board. Finally A4, A5, and A6, which control the two rear motors and the front steering wheel, read speed and jog values on the message board and consequently stop the motors before waiting for a command from PC agents (which require a higher computational time and are therefore more critical). PC agents can be agents of type 1, 2, and 3; i.e., simple reflex agents, agents with internal state, and goal-based agents (see Fig. 2b). Agent A14 is a type 3 goal-based agent responsible of plan selection and adaptation, allowing the robot to execute highlevel navigation tasks such as “go into office A”. These tasks, depending on the current robot context, may be decomposed into many, possibly concurrent, sub-tasks such as “localise, go to the door, open door, etc...” and eventually into primitive actions such as “go to position (x,y,θ)”. In Fig. 2b, A14 plans a path to a target as a sequence of landmarks positions, and posts to the message board the location of the next landmark to be visited (message type Landmark Position). Agent A15 is a type 1 agent which subscribes to Landmark Position and is capable of executing smooth trajectories from the robot current position to the landmark specified, relying on a biologically inspired, non linear law of motion (ξ model [8]). It produces speed and jog values which are made available to ID agents A4, A5, and A6 to control motors: however, A15 cannot deal with obstacles in the robot path. In the figure two shared representations (indicated as grey rectangles) can be observed: the bitmap, an ecocentrical statistical dynamic description of the environment the robot moves in, and the APF - Abstract Potential Field, based on the bitmap and on direct sensorial
1424
Maurizio Miozzo et al.
information. These representations are handled and continuously updated to maintain consistency with the real world based on sensor data by two agents of type 2: A16 (for the bitmap) and A17 (for the APF); the details of these agents is beyond the scope of this paper but can be found in [9]). Agents A16 and A17, together with A18, are responsible of obstacle avoidance. In particular, A18 (another type 2 agent) executes a virtual, mental, navigation in the abstract potential field, thus determining a virtual trajectory and a landmark position that successfully avoids obstacles while driving the robot towards its final goal. The new Landmark Position is posted to the message board and becomes available to A15 (the ξ trajectory generator) for smooth obstacle avoidance. Notice that, should a collision become imminent, ID simple reactive agents would take control by stopping the motors to deal with the emergency (as we already explained). Consider now both ID and PC agents on board the robot: the agents in the architecture communicate using messages conveying the type of information or command to be executed at the appropriate level (i.e. Landmark Position, Speed and Jog, etc). However, all agents operate concurrently and the architecture is only partially hierarchical, thus implying a dynamic competition on the allocation of “resources”. For example, both PC agent A15 (which generates a smooth trajectory) and ID agent A3 (which produces motion reflexes) publish Speed, Jog messages, thus competing with each other; ID agents A4, A5, and A6 need a way to choose which speed and jog commands should be issued to actuators. The conflict is solved using an authority-based protocol. Each agent is assigned a specific authority (possibly varying in time) with which it issues command messages. The receiving agent stores the command associated authority and begins its execution (for example speed = 2 cm/s). If the action, as in the given example, continues indefinitely or requires some time to be executed, the agent continues its execution until a contrasting message with a similar or higher authority is received. On the other hand, messages with lower authorities are not considered. For example, A3 has a higher authority than A15, thus overriding any command from the latter in case of an emergency.
Obstacle
Imposed Trajectory
a)
Real trajectory
b) Fig. 3. a) Correcting the robot's position b) Motion reflex to avoid an obstacle
The whole system has been tested in Gaslini Hospital of Genoa and during many public exhibitions. During the experiments, all the PC and ID agents described in the previous Section are running: thus, the robot performs concurrently goal-oriented navigation, map making, smooth trajectory generation, obstacle avoidance,
A Multiagent, Distributed Approach to Service Robotics
1425
localization, etc. Finally, the environment is intrinsically dynamic because of the presence of people. However, in spite of the complexity of the environment, the system keeps working with no performance degradation: during the Tmed exposition (Magazzini Del Cotone, Porto Antico di Genova, October 2001), the system was running from opening to closing hours, meeting a lot people and interacting with them (i.e., saying sentences in particular situations). We were sometimes forced to stop the robot for battery recharge, but never because the robot got lost in the environment: thanks to the Intelligent Devices (beacons) distributed in the building, computing the robot's position becomes a straightforward problem through the interaction of ID agents A12i and A7 and PC agent A20. Fig. 3a shows the estimated position of the robot (dots) together with the estimated uncertainty (ellipses) as it varies in time. Notice that the initial uncertainty (when the robot is in 2500,6500) drastically decreases as more beacons are detected and the position is corrected by the Extended Kalman Filter. Finally, experiments show that distributing control increases the robustness of the system. For example, we test the reactivity of the system to the presence of an unknown obstacle quickly approaching the robot. When the robot detects a very close obstacle by means of ultrasonic readings (Fig. 3b), the reflex behaviour generated by A3 temporarily takes the control of A4, A5, and A6 and computes speed and jog values in order to safely avoid a collision (thus inhibiting A15's output). Next, it release the control and the robot moves again towards its target.
4
Conclusions
In this paper we describe the “Artificial Ecosystem”, a novel multiagent approach to intelligent robotics: we claim that, given the current technological state in sensors and actuators, mobile robots will have a lot of difficulties in substituting humans in a generic, human inhabited environment, even for the simplest navigation task. Thus, we think that modifying the environment to fit the robot's requirements can be a temporary solution to obtain significant results given the current technological state. To achieve this, we distribute sensors and actuators not only onboard the robot, but also in the building. As a consequence, intelligence is not only a characteristic of each single robot; instead, robots are considered as the mobile units within a “forest” of intelligent fixed devices, which are handled by cooperating software agents with a high level of autonomy in taking decisions. Finally notice that, at the present state, Intelligent Devices distributed throughout the building have the primary purpose of helping robots to carry out their navigation tasks; however, we foresee also the opposite situation, i.e. robots helping fixed, Intelligent Devices to execute their own tasks (e.g. imagine an intelligent camera which detects an anomalous situation and asks for the robot intervention in order to better investigate the area).
1426
Maurizio Miozzo et al.
References [1]
Brooks, R. (1986). A Robust Layered Control System for a mobile robot, IEEE J. of Robotics and Automation, RA-2 l [2] Parker, L. E. (1998) ALLIANCE: An Architecture for Fault Tolerant MultiRobot Cooperation, IEEE Transactions on Robotics and Automation, 14 (2) [3] Arkin, R. C. (1990). Motor Schema-Based Mobile Robot Navigation, International Journal of Robotics Research. [4] Gat, E. (1992) Integrating Planning and Reacting in a Heterogeneous Asynchronous Architecture for Controlling Real-World Mobile Robots, Proceedings of the National Conference on Artificial Intelligence (AAAI). [5] Burgard, W., Cremers, A.B., Fox, D., Hähnel, D., Lakemeyer, G., Schulz, D., Steiner, W., and Thrun., S. (2000). Experiences with an interactive museum tour-guide robot, Artificial Intelligence (AI), 114 (1-2), [6] Russell, S. and Norvig, P. Artificial Intelligence: A Modern Approach. Prentice Hall, Englewood Cliffs, NJ, 1995. [7] Miozzo, M., Scalzo, A., Sgorbissa, A., Zaccaria, R., 2002, Autonomous Robots and Intelligent Devices as an Ecosystem, International Symposium on Robotics and Automation (ISRA 02), September 2002, Toluca, Mexico. [8] Piaggio, M., Sgorbissa, A., Zaccaria, R. (2000). Pre-emptive Versus Non Preemptive Real Time Scheduling in Intelligent Mobile Robotics. Journal of Experimental and Theoretical Artificial Intelligence, (12)2. [9] Sanguineti, V, Morasso, P. (1997), Computational maps and target fields for reaching movements,in: Self-organization, Computational Maps, and Motor Control (P. Morasso and V. Sanguineti Eds.), Elsevier Science Publishers, 547592, Amsterdam, 1997. [10] Piaggio, M., Zaccaria, R. (1997), An Autonomous System for a Vehicle Navigating in a Partially or Totally Unknown Environment, Proc. Int. Workshop on Mechatronical Computer Systems for Perception and Action MCPA, Pisa, Italy.
PCA Based Digital Watermarking Thai D Hien1, Yen-Wei Chen1,2, and Zensho Nakao1 1
Department of EEE, Faculty of Engineering, Univ. of the Ryukyus, Okinawa 903-0213, Japan 2 Institute for Computational Science and Enineering, Ocean Univ. of China, China {tdhien,chen,nakao}@augusta.eee.u-ryukyu.ac.jp
Abstract. This work evaluates a novel watermarking method based on Principle Component Analysis and effectiveness of the method to some watermark attacks. A PCA is used on a block by block basis to decorrelate the image pixel, watermarks are added in the Principle Components of an image. A theoretical description of the method is included together with experimental results in order to validate the methodology presented. Simulation shows the performance of the method to be robust for image cropping and some attacks such as additive noise, low pass filtering, median filtering, and jpeg compression. This research presents a new approach to watermarking fields with good performance in image cropping, and enhancement to this system with respect to robustness against various attacks is under investigation.
1
Introduction
Today, more and more multimedia information is transmitted digitally, there digital multimedia data facilitate efficient and easy distribution, reproduction, and manipulation over network information systems. Watermark of multimedia contents has been introduced for copyright protect. A digital watermarking is now one of the most active research fields in the signal/image processing area. There are many watermark algorithms proposed and watermarking techniques proposed so far can be divided into two main groups: those which embed watermark signals directly in the spatial domain and those perform in a frequency domain [6],[7]. Spatial watermarking technique is known to be not robust against lossy image compression, filtering, and scanning at all. The technique can embed only a small number of bits. The frequency domain techniques embed watermark data by changing a frequency component value by an orthogonal transformation. The technique can embed a larger number of bits with invisibility, and thus it can be employed with common image transforms, such as discrete cosine transform (DCTs) [1][2], Fourier transform [5], and the wavelet transform [3] [4]. An advantage of the frequency domain-based technique over the spatial domain approaches is that watermark casting and detection can be done with compressed images and it is possible to insert watermark into jpeg image as well as mpeg and mpeg4. The frequency-domain V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1427-1434, 2003. Springer-Verlag Berlin Heidelberg 2003
1428
Thai D Hien et al.
watermarking methods are relatively robust to attacks such as noise, common image processing and jpeg compression compared to the spatial domain method. Many watermark schemes are done in the frequency domain. In this paper we evaluate a novel watermarking scheme based on PCA, and show that the method is robust enough for common attacks such as image cropping, image compression (JPEG) image enhancement (low pass filter, median filter, Gaussian noise) and that the embedded marks are invisible as needed in most practical applications.
2
PCA Watermarking Theory
Let a picture I(m,n) with size MxN be denoted by [I] where m and n take integer values from 0 through M-1 and N-1. We write the transform of the image as:
M −1 N−1 [T (u, v)] = ∑ ∑ I (m,n)ϕ(u,v)(m,n) m=0 n=0
(1)
(u,v) ] is a transformation matrix (basic matrix), [T (u, v)] the transform of the where [ϕ image. The inverse transformation can be defined as:
M −1 N −1 ∑ T ( m,n )ϕ−1( u ,v ) ( m,n ) [ I ( u , v )] = ∑ m=0 n=0
(2)
(u,v) −1
] is the inverse transformation matrix. It is noted that the discrete where [ϕ Fourier transform uses sinusoidal functions as basic functions, DCT uses cosine functions, and the wavelet transforms uses a particular wavelet as its basic function. The PCA approach uses the basic matrix obtained by finding the eigenvectors of the image correlation matrix. We extract the principle components of sub pixels of an image by finding the PCA
transformation matrix [ϕ] . Each sub-pixel is transformed by the PCA transformation matrix [ϕ] . Watermarks are embedded into the perceptually significant components of the sub-pixels which are selected. The following step applied to the original image to find transformation matrix [ϕ] and the watermarked image. Step 1. Firstly, partitioned the picture in to a number-of sub-pictures for convenient in numerical implementation. Consider each sub-image like a vector (vector of I = (i , i , i ....., i )T
m where vector i is the i1 2 3 pixels); image data vectors can be express as: i th sub image and T denoted the transpose matrix, each sub-image has n2 pixels. Each vector ii has n2 dimensional.
Step 2. Calculate the covariance matrix Cx of sub-images and the eigenvectors, T eigenvalues of the covariance matrix: Cx = E(I − mi )(I − mi ) where mi = E(I ) are mean
PCA Based Digital Watermarking
1429
vector of each sub-vector ii. Each sub-image may now be decomposed in to uncorrelated coefficients by first finding the eigenvectors (basic function of transformation) and the corresponding eigenvalues of the covariance matrix : CxΦ = λxΦ . The matrix [ϕ] formed by the eigenvectors Φ = (e1, e2 , e3,...., en ) . Eigenvalues (λ ≥ λ ≥ λ ≥ ..... ≥ λn ) and eigenvectors [ϕ] are sorted in descending order. The λ, 1 2 3
matrix [ϕ] is an orthogonal matrix called basic functions of PCA.
Step 3. Transform sub-images to uncorrelated coefficients. The original spectrally correlated image I can be decorrelated by the basic matrix
[ϕ] , and get the eigen image Y, The corresponding values in matrix yi are the Y = ΦT I = ( y , y ,...... y )T
n . 1 2 principle components of each sub pixel coefficients where Once the principle components have calculated by the PCA basic function, we have to decide which one to embed the watermark. Since the decomposition is done bock by block according to the maximum of the variance, the first components contains a lot of information, and the last components correspond mainly to noise. Watermarks are embedded into the perceptually significant components of the subpixels which are selected. Image is watermarked by following conditions: • The first components are usually kept. • Watermark is cast into some principal components of uncorrelated coefficients of sub image.
Step 4.
To retrieve the watermark image can be performed by the following
T −1 equation: I = (Φ ) Y = ΦY .
3
PCA Watermarking System
The watermark system goes through two main steps: encode and decode. 3.1
Encode System
We suppose that an original image I0 (NxN) is subdivided into n by n blocks. We can get a PCA basis function which is denoted by Φ . PCA uncorrelated coefficients for sub blocks are computed on these set of image sub-blocks. It is noted that for DCT approached method Cox et al [1] was added the watermark to the n highest magnitude coefficients (excluding DC component), since this method may degrade image quality. More enhancement this method Barni et al [2] suggested adding the watermark to a larger number of DCT coefficient which need not be significant, The DCT coefficients in a zig-zag scan and the first 16000 coefficients are left out. The watermark added to next 25000 coefficients. For the DFT watermark [5] is embedded in the phase of the DFT which is quite robustness for geometric distortion. In the wavelet domain Dugad [3] et al was added watermark to the high pass sub-band, the low pass sub–band was left out and picked up the entire coefficient above the predefinition threshold T1 in magnitude instead of select a fix number of coefficients.
1430
Thai D Hien et al.
In our method based on the properties of PCA, a set of coefficients in each sub block coefficients is selected to cast the watermark by modifying the coefficients since watermark is placed in perceptually significant components of the signal. In this proposed method the watermark signal consists of pseudo-random number sequence W = (w , w ,......w )
M . The scheme is to 1 2 of length M with values w normally distributed embed the watermark into pre-definition components of each PCA sub-block uncorrelated coefficients. The embedded coefficients were modified by the following equation:
y ' = y +ξ yi wi
(3)
where i = 1, 2…m, ξ is a scaling parameter to control strength of watermark and y ' is the watermarked coefficients. The watermarked image I' is received by applying the inverse PCA process. 3.2
Decode System
The watermark detection is applied to watermarked image I'. Sub-block uncorrelated coefficients of I' are computed by applying the PCA image basic function Φ and the coefficients which were embedded watermarks are selected to generate watermarked Y* = ( y* , y* ,...... y* )
M . 1 2 coefficient vector The correlation value between the code marked W and possibly corrupted coefficient Y* is calculated to detect the mark:
CV =
WY* 1 M = ∑ w y* M M i=1 i i
(4)
Watermark correlations are calculated first for mark W, and then for 1000 different marks. The correlation CV can be used to determine whether a given mark is present or not. For watermark detection, the threshold T is defined by
T=
ξ M
∑y 2M i=1 i
(5)
Like in [2][3] the threshold which is estimated on the watermarked image is applied to evaluate the decode system.
4
Computer Simulation
The watermark scheme must guarantee perceptibility, reliability and the robustness. To evaluate this watermark scheme five standard images selected: Aerial, Baboon, Elaine, Man, and Peppers in our experiments. All images are of 512x512 pixels, and n = 8 for each of sub block; for each of them sixteen different watermark random numbers were inserted to the sixteen latest coefficients. To carry out this process the watermark which total length M = 655361 were randomly generated with standard
PCA Based Digital Watermarking
1431
normal distribution. In PCA based watermark scheme, we investigated that the robustness can be improved by increasing the strength (the scaling parameters) of the embedded watermark data but may affect the perceptible degradation of images. 120
80
Aerial Baboon Elaine Man Peppers
40
60
PSNR
Detector Response
50
Aerial Baboon Elaine Man Peppers
100
40
30
20
20 0
10
0.0
0.2
0.4
0.6
0.8
Scaling Paramaters
(a)
1.0
1.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Scaling Paramaters
(b)
Fig. 1. Fig 1a shows watermark detector responses for different scaling parameters to five pictures. Fig 1b shows the change of PSRN following an increase in watermark strength
In the PCA based watermarking system, the scaling parameter was set ξ = 0.2 to ensure the perceptible degradation and the robustness. Tables 1 show the results of detector response, threshold, mean square error and PSNR for five pictures with our method, the method in [2] and the method in [3]. We can see that our method has stronger detection ability than method in [2], the method in [3] is not good for the image which composed of texture and edges, because the watermarks are added much to non-smooth images and effect of the watermark invisible, Our method is suitable for both smooth and non smooth image. Figure 2 shows that the proposed system can detect watermarks under various attacks. Watermarked image was under cropping attack, if the cropped part is at least 10% of the original image, and the system shows good results against cropping image attacks (fig2a, 2d). Proposed system can detect successfully watermarks when the watermarked images are lossy compression by the jpeg algorithms (fig2b, 2e). We attempted to detect the watermark which was addition of Gaussian noise, In our proposed method the decoder still can detect the mark when the variance even increased to σ = 25000 with high image degradation (fig2d, 2f). A watermarked image was filtered with low pass and median filters. Figure 3a and 3b show the watermarked image “Baboon” under filters with window size 5x5, and Figures 3c, 3d show corresponding encoder system response. 2
1432
Thai D Hien et al.
Table 1. The detector response, Threshold, MSE and PSNR of method in [2] and method in [3] Detector response PCA method/DCT method [2]/Wavelet method [3] 1.33/3.43/0.73 2.80/3.55/1.03 1.11/1.36/0.10 1.09/2.49/0.48 1.14/1.34/0.19
Pictures Aerial Baboon Elaine Man Peppers
Threshold PCA method/DCT method [2]/Wavelet method [3] 0.63/2.48/0.23 1.39/3.12/0.28 0.54/0.83/0.03 0.51/2.10/0.13 0.56/1.23/0.06
(a)Cropped
the PCA based method,
MSE PCA method/DCT method [2]/Wavelet method [3] 0.93/0.81/10.58 2.72/0.98/10.13 0.66/0.07/1.03 0.47/0.42/5.86 0.70/0.16/2.31
(b)JPEG compressed
PSNR PCA method/DCT method [2]/Wavelet method [3] 48.47/49.02/38.00 43.78/48.22/38.07 49.90/59.54/47.01 49.42/51.94/40.47 49.66/56.05/44.49
(c)Added noise
Watermark Detector
Watermark Detector
0.08
2.5
Watermark Detector 0.1
2
0.08
0.06
0.06
1.5
0.02
0
Detector response
Detector response
Detector response
0.04 0.04
0.02
1
0.5
0
0 −0.02
−0.02
−0.5 −0.04
−0.04
0
100
200
300
400
500 600 Watermarks
700
800
900
1000
(d)Cropped PCA method
−0.06
−1 0
100
200
300
400
500 600 Watermarks
700
800
900
1000
(e)JPEG compressed PCA method
0
100
200
300
400
500 600 Watermarks
700
800
900
1000
(f)Added noised PCA method
Fig. 2. Fig 2a, Fig2b, Fig2c show 128x128 cropping of “Baboon” image in which only the central part of watermarked image remains, the watermark image “Baboon” with jpeg (compression quality 18%), and effecting of Gaussian noise addition to the watermarked image “Baboon” with variance σ = 4000 and zero mean; Fig2d, Fig2e, Fig2f corresponding encoder response 2
shows the
PCA Based Digital Watermarking
(a)Low pass filtered
1433
(b)Median filtered
Watermark Detector
Watermark Detector
0.035
0.04
0.03 0.03 0.025
0.02
Detector response
Detector response
0.02 0.015
0.01
0.005
0.01
0 0
−0.005 −0.01 −0.01
−0.015
0
100
200
300
400
500 600 Watermarks
700
800
900
1000
(c)Lowpass filtered PCA method
−0.02
0
100
200
300
400
500 600 Watermarks
700
800
900
1000
(d)Median filtered PCA method
Fig. 3. Fig3a, Fig 3b show watermarked image “Baboon” low pass filtered 5x5 and median filtered 5x5; Fig3c , Fig3d the corresponding encoder responses
5
Conclusion
We evaluated a novel method that embeds and detects watermark using Principal Component Analysis for digital images. By this method, we could successfully detect watermarks against most well known attacks and showed good performance against geometric distortion attack cropped image.
References [1] [2] [3] [4]
Ingemar J. Cox, Joe Kilian, Tom Leighton, and Talal G. Shamoon “Secure spread spectrum watermarking for multimedia,” In Proceedings of the IEEE ICIP '97, vol. 6, pp. 1673 - 1687, Santa Barbara, California, USA, 1997. M. Barni, F. Bartolini, V. Cappellini, A. Piva, "A DCT-domain system for robust image watermarking," Signal Processing, Special Issue in "Copyright Protection and Access Control for Multimedia Services," pp. 357-372, 1998. Rakesh Dugad, Krishna Ratakonda, and Narendra Ahuja ”A new wavelet-based scheme for watermarking images,” Proceedings of the IEEE International Conference on Image Processing, ICIP '98, Chicago, IL, USA, October 1998. Jong Ryul Kim and Young Shik Moon “A robust wavelet-based digital watermark using level-adaptive thresholding,” Proceedings of the 6th IEEE International Conference on Image Processing ICIP '99', pp. 202, Kobe, Japan, October 1999.
1434
[5] [6] [7]
Thai D Hien et al.
V. Licks, R. Jordan”On Digital Image Watermarking Robust to Geometric Transformations," IEEE International Conference on Image Processing, Vancouver, Canada, 2000. Ingemar Cox, Matthew Miller, Jeffrey Bloom “Digital Watermarking” Morgan Kaufman Publisher, October 2001, ISBN 1-55860-714-5. J.J. Eggers and B. Girod, Informed Watermarking, Kluwer Academic Publishers, 2002, ISBN 1-4020-7071-3.
Image Retrieval Based on Independent Components of Color Histograms Xiang-Yan Zeng1 , Yen-Wei Chen1,2 , Zensho Nakao1 , Jian Cheng3 , and Hanqing Lu3 1
2
Department of EEE, Faculty of Engineering, University of the Ryukyus Okinawa 903-0213, Japan Institute of Computational Science and Engineering, Ocean University of China, China 3 National Laboratory of Pattern Recognition, Chinese Academy of Science Beijing 2772, China
Abstract. Color histograms are effective for representing color visual features. However, the high dimensionality of feature vectors results in high computational cost. Several transformations, including principal component analysis (PCA), have been proposed to reduce the dimensionality. PCA reduce the dimensionality by projecting the data to a subspace which contains most of the variance. It is restricted to an orthogonal transformation and may not be the optimal to represent the intrinsic features of data. In this paper, we apply independent component analysis (ICA) to extract the features in color histograms. PCA is applied to reduce the dimensionality and then ICA is performed on the low-dimensional PCA subspace. Furthermore, spatial information is incoporated by performing ICA on a color coherent vector (CCV). The experimental results show that the proposed method outperform other methods based on SVD of quadratic matrix or PCA, in terms of retrieval accuracy.
1
Introduction
Using low-level visual features for content-based image retrieval has drawn much attention of researchers in recent years. Color, texture and shape information are separately used or combined for this task[1],[2]. Among them, color is perhaps the most domaint and distinguishing visual feature. Several color features have been proposed to represent color compositions of images[3],[4]. Color histogram is the most widely used color index in content-based image retrieval. A color histogram describes the global color distribution in an image. It is insensitive to small changes of viewpoints and has rather robust performance. A disadvantage of color histograms is the dimensions of feature vectors are very large. Data structures designed for fast searching in large database are efficient for only small dimensions (of the order of 1-10). As the data dimensionality increases, the query performance of these structures degrades rapidly. Therefore, V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1435–1442, 2003. c Springer-Verlag Berlin Heidelberg 2003
1436
Xiang-Yan Zeng et al.
the high dimensionality of color features is a crucial problem to overcome in content-based image retrieval. Current technique of indexing high dimensional data is to first transform the data to low dimensional features and then indexing the new feature space. Suitable transformations include SVD[5], PCA or Karhunen-Loeve transform(K-L)[6], Discrete Cosine Transform (DCT), and Wavelet Transform et al. These transformations form two large families: (1)data dependent transforms, such as PCA, where the transformation matrix is obtained from statistical analysis of sample data. (2)data independent transforms, like DCT and waverlet transform, where the transformation matrix is dertermined a priori. Since the data dependent transform can be tuned to the specific data set, they can achieve better performance. The drawback is that the transformation matrix need to be recomputed when the database changes significantly. In this paper, we propose a new low-dimensional color index obtained by independent component analysis (ICA). As a decorrelation approach, ICA is different from PCA in the following two aspect: (1) ICA is not necessarily an orthogonal transformation. (2) ICA features (coefficients) are higher-order uncorrelated while PCA’s are second-order uncorrelated. On the whole, PCA is appropriate for capturing the structure of data that are well described by a Gaussian cloud. Such a case is illustrated in Fig.1(a) where the 2-dimensional data have Gaussian distributions. The two PCA basis functions in Fig.1(c) are orthogonal and the first one is along the direction of the maximum variance. However, in practice, data may not be a Gaussian cloud with second-order correlation but have higher-order dependence. An example is given in Fig.1(b) where the data have super-Gaussian distributions. As is shown in Fig.1(d) , the ICA basis functions are not orthogonal and they are in the directitions of maximum independence. The PCA basis functions constitute an orthogonal rotation that obviously doesn’t capture the structure of data. High-dimensional color histograms cannot be appropriately characterized by second-order statistics and higher-order statistical dependence should be considered. Therefore, we utilize ICA to extract the intrinsic features of the color histograms and use the ICA feature as a new
Component 1
Component 1 PCA basis−1
ICA basis−1 PCA basis−2 PCA basis−1
ICA basis−2 PCA basis−2
Component 2
Component 2
(a)
(b)
(c)
(d)
Fig. 1. PCA and ICA transformations. The original data are 2-dimensional and the transofrmations include two basis functions. (a) two components of Gaussian data, (b) two components of super-Gaussian data, (c) two PCA basis functions of (a), (d) two ICA basis functions and two PCA basis functions of (b)
Image Retrieval Based on Independent Components of Color Histograms
1437
color index. Futhermore, to incoporate the spatial characteristics in color histograms, we perform ICA separately on the coherent and incoherent parts of the histogram.
2
Color Histogram
Color histograms are widely used to capture the color information in an image. They are easy to compute and tend to be robust against small changes of camera viewpoints. Given an image I in some color space (e.g., red, green, blue). The color channels are quantized into a coarser space with k bins for red, m bins for green and l bins for blue. Therefore the color histogram is a vector h = T (h1 , h2 , · · · , hn ) , where n = k × m × l, and each element hj represents the number of pixels of the discretized color in the image. We assume that all images have been scaled to the same size. Otherwise, we normalize histogram elements as hj hj = n
j=1
(1)
hj
The noramlized color histogram f = (h1 , h2 , · · · , hn )T is the feature vector to be stored as the index of the image database.
3 3.1
Related Work SVD on Quadratic Form Distance Measure
To reduce the high dimensionality of color histograms, Hafner proposed to perform SVD on a quadratic matrix whose elements denote the similarity between color bins. Given two color histograms x and y with N color bins, the quadratic form distance is defined as d2hist (x, y) = (x − y)T A(x − y)
(2)
where A is a matrix whose element aij indicates the similarity between color bins i and j. Let Ak = VkT Σk Vk be an approximation of A, where Σk is a diagonal matrix with the elements being the first k singular value of A, and the rows of Vk are the corresponding singular vector. To index an image, the k-dimensioanl features are precomputed from its histogram x by xk = Vk x. The quadratic distance in the k-dimensional space, is d2k (xk , yk ) = (xk − yk )T Σk (xk − yk )
(3)
Since matrix A is data independent, it can be applied to different databases without recompute the SVD. At the same time, this data independence makes it hard to achieve optimal performance.
1438
3.2
Xiang-Yan Zeng et al.
Color Indexing by Principal Component Analysis
Principal component analysis is a data analysis technique widely used in image processing and pattern recognition. The main idea of PCA is to obtain a set of uncorrelated features that optimally represent the distribution of the original data. Given a n-dimensional histogram x, PCA is to linearly transform it with a matrix V, so that the feature vector c = Vx
(4)
has uncorrelated components and most of the variance is concentrated in the first several components. The transformation matrix V is an orthogonal matrix whose rows (columns of B) are the eigenvectors of the covariance matrix of x. The eigenvectors are sorted by decreasing order of eigenvalues and the first k eigenvectors form the rows of the transformation matrix V. The PCA feature vector c = (c1 , c2 , · · · , ck )T is used as the database index.
4
Color Indexing by Independent Component Analysis
ICA generalize the technique of PCA and has proven to be a effective tool of feature extraction[7]. The goal is to express a set of random variables as linear combinations of statistically independent component variables. In the simplest model of ICA [8], we observe n scalar random variables x1 , x2 , · · · , xn which are linear combinations of k(k ≤ n) unknown independent sources s1 , s2 , · · · , sk . Let us arrange the random variables into a vector x = (x1 , x2 , · · · , xn ) and the sources into a vector s = (s1 , s2 , · · · , sk ); then the linear relationship is given by x = As
(5)
where A is an unknown mixing matrix. In the application of ICA to feature extraction, the columns of A represent the basis functions and si represent the ith feature in the observed data x. The goal of ICA is to find a matrix W, so that the resulting vector y = Wx (6) recovers the independent sources s, probablity permuted and rescaled. We apply ICA to the color histogram vector x, and use the independent component vector y as a new color index. Unlike in PCA, the basis functions of ICA cannot be calculated analytically. The adopted method is to minimizing or maximizing some relevant criterion functions. Several ICA algorithms have been proposed. We use the fixed-point algorithm proposed by A. Hyvarinen et al [9]. Compared with other adaptive algorithms, it is very fast to converge and not affected by a learning rate. Before performing ICA, PCA is used to prewhite the data, amd the dimensionality is also reduced. The purpose of ICA is not to reduce the dimensionality but to extract the underlying features.
Image Retrieval Based on Independent Components of Color Histograms
5 5.1
1439
Image Retrieval Distance Measure
Given a query image, we want to retrieval all the images whose color features are similar to the color features of the query image. To measure the similarity between two color indices F 1 and F 2 , one can use the L1 -norm [3] or the L2 norm [14]. In our experiment, the best performance is achieved with a distance measure similar to L1 distance d(F 1 , F 2 ) =
n
|Fj1 − Fj2 |/(|Fj1 | + |Fj2 |)
(7)
j=1
5.2
Retrieval Performance
Image retrieval experiments are carried out to compare the proposed method with other low-dimensional color indices. The database images and the query images are 8-bit color images and are scaled to size 256 × 384. Our database consists of 2000 images from Corel and the database used in SIMPLIcity [10], including nature scenes, people and buildings et al. Another 60 images containing a variety of colors and color combinations are chosen as queries. We discretize each RGB color channel to 8 levels. Therefore, the color histogram feature vector has 512 components. Using the SVD and PCA transformations, we obtain low-diemensional features from the 512-dimensional clor histogram feature. SVD is performed on the quadratic matrix as described in Sec.3.1. Although, the SVD approach is based on the quadratic form distance measure, it is observed that applying the weighted L1 distance in Sec.5.1 to the transformed feature vector achieves better performance. PCA is performed on the color histograms of the database images. In both cases, the eigienvectors corresponding to the several largest eigenvalues constitute the transformation matrix. In the proposed method, ICA is performed on the data whose dimensionality has been reduced by PCA. The goal of ICA is not dimensionality reduction but to extract the intrinsic features in the lowdimensional PCA subspace. The retrieval accuracy is measured by the Recall defined as Recall(C) = Rc /M
(8)
where C is the number of retrievals, Rc is the number of relevant matches among the C retrievals, and M is the total number of relevant matches in the database. Before the evaluation, the relevant matches for each query image have been found through the entire database. The average Recall is computed over the 60 query images and used to evaluate the retrieval performance. Fig. 2(a) shows the relation between the number of index components k and the average Recall, where the number of retrievals C is 30. On the whole, SVD is inferior to PCA and ICA. PCA outperforms the others in the small area
1440
Xiang-Yan Zeng et al.
0.8
1
0.7
0.9
0.6
0.8
0.5
0.7
Average Recall
Average Recall
ICA
PCA
SVD
0.4
0.5
0.2
0.4
0.1
0.3
0
2
4
6
8
10
12
14
16
18
20
Number of components
(a)
ICA
PCA
512−d histogram
0.6
0.3
0
SVD
0.2 10
20
30
40
50
60
70
80
90
100
Number of retrievals
(b)
Fig. 2. Comparison of ICA feature and other color features. (a) Average recall versus number of feature components; C = 30. (b) Average recall versus number of retrievals; number of SVD, PCA and ICA components k=10 of k < 7. However, after that area the performance of PCA is not improved with the increase of components. The performance is also evaluated by the Recall with different number of retrievals. The result is shown in Fig. 2(b), where the number of SVD, PCA and ICA features k is 10 and a comparison is also done with the 512-dimensional color histogram. It can be seen from these figures that the low-dimensional ICA feature achieves a performance better than the SVD and PCA features. 5.3
Incoporation of Spatial Information
As a general observation, incoporating of spatial information can improve the retrieval performance. To add spatial constraints, we perform ICA on the split histogram called a color coherent vector(CCV)[11]. In CCV, pixels are classified as coherent and incoherent pixels. A coherent pixel is a part of a large group of pixels of the same color, while an incoherent pixel is not. A color histogram is split into a coherent pair vector h = {h1 , · · · , hn } = {< α1 , β1 >, · · · , < αn , βn >}
(9)
where αi and βi are the number of coherent and incoherent pixels of color i, and hi = αi +βi . We separately perform ICA on the conherent part {α1 , · · · , αn } and the incoherent part {β1 , · · · , βn }. The new color index contains the ICA features of the two vectors. Fig. 3 compares the retrieval performance of ICA and ICA+CCV. It can be seen that the ICA+CCV outperforms ICA when the number of the total components(coherent and incoherent) k > 17.
6
Conclusion
In this paper, we apply ICA to extract a new index from color histogram. ICA generalize the technique of PCA and has proven to be a effective tool for finding
Image Retrieval Based on Independent Components of Color Histograms
1441
Fig. 3. The retrieval performance of ICA + CCV
structure in data. Differing from PCA, ICA is not restricted to an orthogonal transformation and can find the structure in non-Gaussian data. We first use PCA to reduce the dimensionality of the color histogram, and then apply ICA to the low-dimensional subspace. The effectiveness of the proposed color index has been demonstrated by experiments. Comparisons with PCA and SVD show that ICA feature outperforms these low-dimensional color indices in terms of retrieval accuracy. Experiments are also done to incoporate spatial information in the new index. We perform ICA on the the two parts of CCV and obtain a improved accuracy with slightly more components.
Acknowledgements This research was partly supported by the Outstanding Overseas Chinese Scholars Fund of Chinese Academy of Science.
References [1] A.Pentlard, R. W. Picard and S. Sclarroff: Photobook: Cont-based manipulation of image databases. International Journal of Computer Vision, vol. 18, no. 3,(1996) 233-254. [2] Y-W. Chen, X-Y. Zeng, Z. Nakao and H. Q. Lu: An ICA-based illumination-free texture model and its application to image retrieval. Leture notes in computer science, Springer, vol. 2532, (2002) 167-174. [3] M. J. Swain and D. H. Ballard: Color indexing. Int. Journal of Computer Vision, vol. 7, no.1,(1991)11-32. [4] M. A. Stricker and M. Orengo: Similarity of color images. Storage Retrieval Still Image Video Database IV, SPIE. (1996)381-392. [5] J. Hafner, H. S. Sawhney,et al: Efficient color histogram indexing for quadratic form distance functions. IEEE Trans. Pattern Anal. and Machine Intel., vol. 17, no. (1995) 729-736. [6] C. Faloutsos, W. Equitz, et al: Efficient and effective query by image content. Journal of Intelligent Information Systems, vol. 3, no.4.(1996)231-262.
1442
Xiang-Yan Zeng et al.
[7] A. Bell. and T. Sejnowski: The ’independent components’ of natural scenes are edge filters. Visiob research, vol. 37.(1997)3327-3338. [8] P. Common, ”Independent component analysis-a new concept? Signal Processing, vol. 36, pp. 287-314, 1994. [9] A. Hyvarinen and E. Oja, ”A fast fixed-point algorithm for indepepndent component analysis. Neural Computation, vol. 9, pp. 1483-1492, 1997. [10] J. Z. Wang, J. Li, and G. Iederhold, ”SIMPLICITY: Semantic-sensitive integrated matching for picture libraries,” IEEE Trans. Pattern Anal. and Machine Intell., vol. 23, no. 9, pp. 947-963, 2001. [11] G. Pass, and R. Zabin: Histogram refinement for content based image retrieval. IEEE Workshop Applications Computer Vision. (1998) 59-66.
Face Recognition Using Overcomplete Independent Component Analysis Jian Cheng1, Hanqing Lu1, Yen-Wei Chen2, and Xiang-Yan Zeng2 National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, P.O.Box 2728, 100080, P.R.China {jcheng,luhq}@nlpr.ia.ac.cn 2 Department of EEE, Faculty of Engineering, University of the Ryukyus, Japan [email protected], [email protected] 1
Abstract. Most current face recognition algorithms find a set of basis functions in a subspace by training the input data. However, in many applications, the training data is limited or only a few training data are available. In the case, these classic algorithms degrade rapidly. The overcomplete independent component analysis (overcomplete ICA) can separate out more source signals than the input data. In this paper, we use the overcomplete ICA for face recognition with the limited training data. The experimental results show that the overcomplete ICA can improve efficiently the recognition rate.
1
Introduction
The face recognition problem has attracted much research effort in the last 10 years. Among the popular face recognition techniques, the eigenfaces method, proposed by Turk and Pentland [1], is one of the most successful methods. The eigenfaces method is based on Principal Component Analysis (PCA), which decreases the dimensions of the input data by decorrelated the 2nd-order statistical dependencies of the input data. However, PCA cannot represent the high-order statistical dependencies such as relationships among three or more pixels. Independent Component Analysis (ICA) [2,3] is a relatively recent technique that can be considered as a generalization of PCA. The ICA technique can find a linear transform for the input data using a set of basis which be not only decorrelating but also as mutual independent as possible. ICA has been successfully applied to face recognition by Bartlett[4]. The results of Bartlett show that the ICA representations are superior to PCA for face recognition. However, both PCA and ICA techniques for face recognition have a fatal drawback: need large-scale database to learn the basis functions. In general, we cannot obtain the enough large face databases. In other words, the face database is not enough large to obtain the high recognition precision rate. Recently, an extension of ICA model: the overcomplete independent component analysis, has be paid more and more attention. A distinct difference compared to the standard ICA, the overcomplete ICA is assumed that more sources than observations. In this paper, we propose a new V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1443-1448, 2003. Springer-Verlag Berlin Heidelberg 2003
1444
Jian Cheng et al.
face recognition approach using the overcomplete ICA with a small training face database. The paper is organized as follows: in the second section 2, we give a brief introduction to the overcomplete ICA. The experimental results are shown in section 3. The concluding remarks will be given in the final section.
2
Overcomplete Independent Component Analysis
The overcomplete independent component analysis model can be denoted as
x =As + ε
(1)
x = ( x1 , x 2 , L , x m ) T is
Where
the
m-dimensional
observed
variables,
s = ( s1 , s 2 , L s n ) T is the n-dimensional latent variables or source signals, A is a m × n dimensional unknown mixing matrix, ε is assumed to be a white Gaussian noise. In the standard independent component analysis, the mixing matrix A is a square matrix, i.e. the dimension of the observed variables equals to the dimension of the source signals. Whereas, m < n in the overcomplete ICA. In general, it is an illposed problem. Some methods have been proposed for estimating the mixing matrix A . Such as Lewicki and Sejnowski [5,6,7] use a Bayesian method for inferring an optimal basis to find efficient codes. Their method comprises mainly two steps: inferring the sources s and learning the basis matrix A . 2.1
Inferring the Sources
Given the model (1), the noise ance
σ
2
ε
is assumed to be white Gaussian noise with vari-
so that
log P( x | A, s ) ∝ −
1 2σ 2
( x − As )
(2)
Using Bayes' rule, s can be inferred from x
P ( s | x, A) ∝ P( x | A, s ) P( s )
(3)
prior distribution of s is Laplacian distribution P ( s ) ∝ exp(−θ | s |) , which constrain s to be sparse and statistically independent
We
assume
the
components. Maximizing the posterior distribution P ( s | x, A) , the s can be approximated as:
Face Recognition Using Overcomplete Independent Component Analysis
1445
sˆ = max P( s | x, A) S
= max[log P( x | A, s ) + log P( s )] S
(4)
1 = min 2 | x − As | 2 +θ T | s | S 2σ 2.2
Learning the Basis Vectors
The objective for learning the basis vectors is to obtain a good model of the observed data. The expectation of log-probability of the observed data under the model can assess the goodness of fit
L = E{log P( x | A)}
(5)
P ( x | A) = ∫ dsP( x | A, s ) P( s )
(6)
where
An approximation of L can be obtained using the Gaussian approximation to the posterior:
1 1 L ≈ const. − E 2 | x − Asˆ | 2 + log P ( sˆ) − log det H 2 2σ
(7)
Where H is the Hessian matrix of the log posterior at sˆ . Performing gradient ascent T
on L with respect to A and multiplying by AA can derive the iteration equation for learning the basis vectors:
∆A ∝ − A( zs T + A T AH −1 )
(8)
Where z = d log P ( s ) / ds .
3
Experimental Results
Among classic face recognition algorithms, most are subspace analysis methods such as PCA (eigenfaces), LDA, etc. These algorithms represent face images by mapping the input data from high dimensional space to low dimensional subspace. The training of subspace usually needs large-scale face database. However, in many applications, the training data is limited or only a few training images are available. In this instance, in order to improve the recognition rate, we must obtain as much as possible information from the limited input data. As shown in section 2, the overcomplete ICA can obtain more source signals than the observed signals. We performed the experiments on face recognition using the overcomplete ICA to extract the face features.
1446
Jian Cheng et al.
Our experiments were performed on face images in a subset of the FERET database. The training data contained 20 face images selected randomly from the test database. The test database includes 70 individuals; each one has 6 face images with different luminance and expression, in all 420 face images. Each face image has 112 × 92 pixels. We use the model described as Eq.(1). x is a 20 × 10304 matrix. Each row is a training face. A is set to 20 rows and 30 columns. The overcomplete ICA algorithm introduced in section 2 is used to produce the source matrix s with 30 rows and 10304 columns. Each row of s is a source face image. The 30 source faces are shown in Fig.1.
Fig.1. 30 source faces separated from the 20 training faces using the overcomplete ICA
In order to compare the overcomplete ICA algorithm with other algorithms, face recognition experiments are performed on the test face database. We compare the overcomplete ICA with the PCA [1] and the standard ICA [4] based on the same training database and the test database described as above. In [4], there are two ICA models for face recognition. We selected the first model that is superior to the second model. Firstly, the source face images in the overcomplete ICA, the eigenfaces in PCA and the independent components in the standard ICA are inferred by each algorithm as a set of basis functions B . Secondly, each face image a from the test database is projected on the set of basis functions B . The coefficients f are considered as the feature vector of this face for recognition.
Face Recognition Using Overcomplete Independent Component Analysis
f = B*a
1447
(9)
The face recognition performance is evaluated for the feature vector f using cosines as the similarity measure. The nearest one is the most similar.
d ij =
fi ∗ f j
(10)
fi ∗ f j
Fig.2 compares the face recognition performance using the overcomplete ICA, the standard ICA and PCA. The horizontal axis is the most similar n face images to the query face image. The vertical axis is the average precision number in the test database. The Fig.2 shows a trend for the overcomplete ICA and the standard ICA to outperform the PCA algorithm. The overcomplete ICA is a trivial improvement upon the standard ICA.
Overcomplete ICA Standard ICA
PCA
Fig. 2. The horizontal axis is the first n most similar face images to the query face image. The vertical axis is the average precision numbers in all test face image. The dashed line represents the overcomplete ICA. The solid line represents the standard ICA and the dash-dotted line represents the PCA
1448
4
Jian Cheng et al.
Conclusions
In this paper, we applied the overcomplete ICA to learning efficient basis function for face recognition. The three different algorithms for face recognition have been compared. The experimental results demonstrated that the overcomplete ICA can improve the precision rate of face recognition.
Acknowledgement This research was partly supported by the Outstanding Overseas Chinese Scholars Fund of Chinese Academy of Science.
References [1] [2] [3] [4] [5] [6] [7]
M.Turk and A.Pentland: Eigenfaces for Recognition. Journal of Cognitive Neuroscience, 3(1): 77-86, 1991. A.Bell and T.Sejnowski: An Information Maximization Approach to Blind Separation and Blind Deconvolution. Neural Computation, Vol.7, 1129-1159, 1995. A.Hyvärinen: Fast and Robust Fixed-Point Algorithms for Independent Component Analysis. IEEE Tran. on Neural Networks, Vol.10, No.3, 626634,1999. M.S. Bartlett, J.B. Movellan and T.J. Sejnowski: Face Recognition by Independent Comp onent Analysis. IEEE Tran. On Neural Networks, Vol.13 (6) 1450-1464, 2002. M.S. Lewicki and T.J. Sejnowski: Learning Overcomplete Representations. NeuralComp. ,Vol.12, 337-365, 2000. M.S. Lewicki, B.A. Olshausen: A Probabilistic Framework for the Adaptation and Comparison of Image Codes. J. Opt. Soc. Am. A: Optics, Image Science and Vision. Vol.16 (7), 1587-1601, 1999. T.-W. Lee, M.S. Lewicki, M. Girolami and T.J. Sejnowski: Blind Source Separation of More Sources Than Mixtures Using Overcomplete Representations. IEEE Sig. Proc. Lett., Vol.6(4), 87-90, 1999.
An ICA-Based Method for Poisson Noise Reduction Xian-Hua Han, Yen-Wei Chen, and Zensho Nakao Department of EEE, Faculty of Engineering, University of the Ryukyus, Japan [email protected] [email protected]
Abstract. Many image systems rely on photon detection as a basis of image formation. One of the major sources of error in these systems is Poisson noise due to the quantum nature of the photon detection process. Unlike additive Gaussian noise, Poisson noise is signal dependent, and consequently separating signal from noise is a very difficult task. In most current Poisson noise reduction algorithms, noisy signal is firstly pre-processed to approximate Gaussian noise and then denoise by a conventional Gaussian denoising algorithm. In this paper, based on the property that Poisson noise adapts to the intensity of signal, we develop and analyze a new method using an optimal ICAdomain filter for Poisson noise removal. The performance of this algorithm is assessed with simulated data experiments and experimental results demonstrate that this algorithm greatly improves the performance in denoising image.
1
Introduction
In medical and astronomical imaging systems, images obtained are often contaminated by noise, and the noise usually obeys a Poisson law and hence is highly dependent on the underlying intensity pattern being imaged. So the contaminated image can be decomposed as the true mean intensity and Poisson noise, and the noise represents the variability of pixel amplitude about the true mean intensity. It is wellknown that the variance of a Poisson random variable is equal to the mean. Therefore, the variability of the noise is proportional to the intensity of image and hence signaldependent [1]. This signal dependence makes it much difficult to separate signal from noise. Current methods for Poisson noise reduction mainly include two types of strategies. One would be to work with the square root of the noisy image, since the square-root operation is a variance stabilizing transformation [8]. However, after preprocessing, it's impossible that Poisson noise tend to a white noise if there are a few number of photon counts. So it is not completely suitable to adopt Gaussian noise reduction algorithm. The other strategy is the method of wavelet shrinkage. The basic function of wavelet transformation, however, is fixed and cannot adapt to different kinds of data sets. Recently, an ICA based denoising method has been developed by Hyvarinen and his co-workers[2][3][4]. The basic motivation behind this method is that the ICA V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1449-1454, 2003. Springer-Verlag Berlin Heidelberg 2003
1450
Xian-Hua Han et al.
components of many signals are often very sparse so that one can remove noise in the ICA domain. It is proven that the ICA-domain filtering for denoising non-Gaussian signals corrupted by Gaussian noise can perform well by applying a soft-threshold (shrinkage) operator on the components of sparse coding [5]. But for data sets contaminated by Poisson noise, it is necessary to develop new filtering in ICAdomain to adapt to the property of noise that is signal dependent. In this paper, we develop a novel ICA-domain shrinkage procedure for noise removal in Poisson noise image. The shrinkage scheme (filter) adapts to both the signal and the noise, and balances the trade-off between noise removal and excessive smoothing of image details. The filtering procedure has a simple interpretation as a joint edge detection/estimation process. Our method is closely related to the method of wavelet shrinkage, but it has the important benefit over wavelet methods that the representation is determined solely by the statistical properties of the data sets. Therefore, ICA based methods may perform better than wavelet based methods in denoising application. The paper is organized as follows: in the second section, we review ICA based denoising algorithm. Section 3 gives a new ICA-domain shrinkage scheme for Poisson noise. The experimental results are shown in section 4. The concluding remarks will be given in the final section.
2
ICA Based Denoising Algorithm
In using the method of Hyvarinen et al. to denoise Gaussian noisy signal, one needs first to employ the fixed point algorithm on noise-free data to get the transformation matrix (basic functions), and then to use maximum likelihood to estimate parameters for the shrinkage scheme. Assume that we observe an n-dimensional vector contaminated by Gaussian noise. We denote by x the observed noisy vector, by s the original non-Gaussian vector and by v the noise signal. Then we obtain that x=s+u
(1)
In the method of Hyvarinen et al., u is Gaussian white noise. The goal of signal denoising is to find š=g(x) such that š is close to s in some well-defined sense. The ICA based denoising method of Hyvarinen et al. works as follows [5]: Step 1 Estimate an orthogonal ICA transformation matrix W using a set of noise-free representative data z. Step 2 For i =1,…,n, estimate a density which approximates actual distribution of variable s i = w i
T
z ( where w i is the ith column of w). Based on the estimated
model and the variance of u (assumed to be known), determine the nonlinear function g i .
An ICA-Based Method for Poisson Noise Reduction
1451
Step 3 For each observed x, the final denoising procedure is: (1) ICA transform (2) Nonlinear shrinkage (3) Reverse transform
y = wx š i = g i (y i )
(2) (3)
T
(4)
š=w
š
The ICA based method needs additional noise-free data to estimate the transformation matrix w and shrinkage nonlinearities. Although we cannot obtain the exact original of noisy image, we can use this kind of images that have similar structure with noisy image. Example for if noisy image is man-made image, we choose similar noise-free man-made image to get transformation matrix w. In the ICA based method of Hyvarinen et al., the additional noise is Gaussian white noise, However, our goal is to reduce Poisson noise in image. In the next section, based on Poisson noise's special property, we will give an efficient shrinkage scheme which can be obtained directly from the noisy data.
3
New Shrinkage Scheme Adjusts to Poisson Noise
In the case of Poisson noise, the noise power will differ between ICA- domain coefficients depending on the image intensity after the image ICA transformation. This spatial variation of the noise must be accounted for in the ICA-domain shrinkage function design. The shrinkage function of Hyvarinen et al. does not adjust to these differences. Given the signal and noise power, a natural choice for an ICA-domain shrinkage function is
s, if SNR in s is high 0 if SNR in s is low
š =
(5)
According to R. D. Nowak et al. [6], we can use cross-validation algorithm to design an optimal shrinkage function of this form. Since ICA transform matrix (basic function) has similar property with wavelet basic function, we can directly use the nonlinear shrinkage function of waveletdomain denoising. The optimal shrinkage function in [6] takes the form š=s
s2 − δ 2 s2
(6)
where š is the ICA-domain denoised coefficient, s is the ICA-domain noisy 2
2
coefficient; š and δ are, respectively, the power of noise-free signal and Poisson noise. We can just take š² this form š
2
=s
2
-δ
2
(7)
where s can be directly obtained from ICA transformation, the noise power can be gotten from the following formulation (δ i
2
is the ith component's noise power)
1452
Xian-Hua Han et al.
δi
2
= (w i .w i ) x
(8)
So we can obtain noise power in each sample of noise data in ICA-domain and then denoise each data sample according to the SNR. Usually, we can interpret the shrinkage function as the following: Because the ICA transform matrix w can be considered as a local directional filter, after ICA transformation, the ICA-domain coefficient can be thought as the projections of the image onto localized “details”. For the noise power estimate, we project the image onto the square of the corresponding transformation matrix, which effectively computes a weighted average of local intensity in the image. This may be an approximation of noise power according to the property of Poisson noise. It is clear that the estimate of noise power can adapt to local variations in the signal or noise. The above shrinkage function simply weights each noisy ICA coefficient s(i,j) by a factor equal to the estimated signal power divided by estimate signal-plus-noise power. If the estimated signal to signal-plus noise power ratio is negative, the shrinkage function just thresholds the ICA-domain coefficients to zero. Hence, this optimal shrinkage function has a very simple interpretation as a data-adaptive ICA – domain Wiener filter.
4
Experiment Results
In this section, we present a comparison of the performance of the proposed shrinkage scheme in ICA-domain, modified Wiener filter, and R. D. Nowak's wavelet denoising method [1]. In our experiment, we use standard Lena image and simulate Poisson process in Lena image to gain Poisson noise image. The original image and noisy image are respectively showed in (a) and (b) of fig. 1. In our experiment, we used Tony Bell and T.J. Sejnowski's infomax algorithm to learn the ICA transformation matrix w [7][8], and 8*8 sub-windows were randomly sampled from noise-free images. These subwindows were presented as 64dimentional vectors. The DC value was removed from each vector as a processing step. The infomax algorithm was performed on these vectors to obtain the transformation matrix w. For the reason given in [5], we orthogonalize w by w=w (w
T
w)
−1 / 2
(9)
After ICA transformation, the denoising algorithm was applied on each subwindows in image and 64 constructions were obtained for each pixel. The final result was the mean of these reconstructions. Experimental results are showed in figure 1. We denoise Lena image contaminated by Poisson noise using modified Wiener filter (wiener filtering on the square root of noisy image), wavelet-domain optimal filter, and our method. It is evident that our method is capable of producing better noiseremoval results than the two others. Table 1 gives respectively SNR and M.S.E of these three algorithms. In calculation, we normalize intensity of all image to [0,255] to make comparison. From table 1, we see our method can obtain higher SNR and smaller M.S.E.
An ICA-Based Method for Poisson Noise Reduction
(a)
(c)
1453
(b)
(d)
(e)
Fig. 1. (a) The original image, (b) Poisson noisy image, (c) Result by Wiener filter on the square root of noisy image, (d) Result by wavelet-domain filter, (e) Result by our method
Table 1. SNR and M.S.E Comparison
Noisy Image Modified Wiener Filtering Wavelet Transformation Our Denoising algorithm
5
S/N (DB) 10.4249 15.6969 15.7572 19.9366
M.S.E 5.7885 1.7085 1.6830 0.6375
Conclusion
The usual Poisson denosing methods mainly include modified Wiener filter and wavelet shrinkage. However, the preprocessing of modified Wiener filter cannot approximate Poisson noise to Gaussian noise. For wavelet shrinkage, the basic function is fixed. ICA based method adjusts the transform matrix according to the data. It provides a new way to improve denoising performance. However, this method needs additional noise-free data to estimate the transformation matrix w. Future work will focus on how to obtain the ICA-transformation matrix from noisy data.
1454
Xian-Hua Han et al.
References [1] [2] [3] [4] [5] [6] [7] [8]
R.D. Nowak and R. Baraniuk, "Wavelet Domain Filtering for Photon Imaging Systems," IEEE Transactions on Image Processing (May, 1999). Hyvarinen, E. Oja, and P. Hoyer, “Image Denoising by Sparse Code Shrinkage,” S. Haykin and B. Kosko (eds), Intelligent Signal Processing, IEEE Press 2000 P. Hoyer, Independent Component Analysis in Image Denoising Master's Thesis, Helsinki University of Techonology, 1999 R. Oktem et al, Transform Based Denoising Algorithms: Comparative Study, Tampere University of Technology, 1999. Hyvarinen, “Sparse Code Shrinkage: Denoising of Nongaussian Data by Maximum Liklihood Estimation,” Neural Computation, 11(7): 1739-1768, 1999. R.D. Nowak, “Optimal signal estimation using cross-validation,” IEEE Signal Processing Letters, vol. 3, no. (2), pp. 23-25, 1996. T-W.Lee, M.Girolami, A.J.Bell and T.J.Sejnowski. “A Unifying Informationtheoretic Framework for Independent Component Analysis,” Computers & Mathematics with Applications, Vol 31 (11), 1-21, March 2000. T-W. Lee, M. Girolami and T.J. Sejnowski. “Independent Component Analysis using an Extended Infomax Algorithm for Mixed Sub-Gaussian and SuperGaussian Source,” Neural Computation, 1999, Vol.11(2): 417-441,
Recursive Approach for Real-Time Blind Source Separation of Acoustic Signals Shuxue Ding and Jie Huang School of Computer Science and Engineering, The University of Aizu Tsuruga, Ikki-machi, Aizu-Wakamatsu City, Fukushima, 965-8580 Japan [email protected] [email protected]
Abstract. In this paper we propose and investigate a recursive approach for blind source separations (BSS) or/and for independent component analyses (ICA). Based on this approach we present a deterministic (without a stochastic learning) algorithm for real-time blind source separation of convolutive mixing. When employed to acoustic signals, the algorithm shows a superior rate of convergence over its counterpart of gradientbased approach based on our simulations. By applying the algorithm in a real-time BSS system for realistic acoustic signals, we also give experiments to illustrate the effectiveness and validity of the algorithm.
1
Introduction
To separate acoustic sources blindly in a real-world environment has been proven to be a very challenging problem [1]. A usual way to aim this purpose is modeling a real-world superposition of audio sources by a mixture of delayed and filtered versions of sources. To successfully separate sources, one needs to estimate the relative delays between channels and weights of filter taps as accurate as possible. It is a challenging task to accurately estimate the weights since the filters are very long to model a realistic environment. In paper [2], authors studied blind separations of acoustic sources in some real-world environments, especially inside vehicles. In that approach, they first separate sources in the time-frequency domain since the mixture becomes instantaneous, and then reconstruct output-separated signals in the time domain. The similar consideration can also be found in [3], though a different criterion for BSS has been used. Although the approach is quite effective, it can only work in a batch mode or a semi-real-time mode with a large buffer. Therefore one has to use other approaches if he wants to realize a real-time type of blind source separations. A possible consideration is to adapt the filter weights in the time domain by a gradient search of the minimum point of the corresponding cost function. Indeed, there has been a huge number of discussions related to such a method (e.g., [4, 5, 6, 7]).
V. Palade, R.J. Howlett, and L.C. Jain (Eds.): KES 2003, LNAI 2773, pp. 1455–1462, 2003. c Springer-Verlag Berlin Heidelberg 2003
1456
Shuxue Ding and Jie Huang
However, a major problem, i.e. the problem of convergent speed, arises if one intends to use such an approach to acoustic signals. The gradient search of the minimum still cannot converge so fast to satisfy requirements of usual realistic applications, even though so-called nature gradient approach has greatly improved the convergence of a normal gradient approach [4]. Since transmitting channels are usually time variant in real-world environments, we need the learning processing to converge as fast as possible to catch up the time-variations. Since the convergence becomes slower if the eigenvalue spread of the correlation matrix of input signals is larger, the slow convergence in a gradient-based approach seems deeply relate to the instability of acoustic sources. The local minima of cost function are also related to the slow convergent speed. Based on our experiments, a gradient learning can scarcely converge to the true minimum point in most realistic situations, since there are too many local minima. It should be helpful if one is reminded of the situation of adaptive filtering by a supervised learning. Usually the adaptive processing is implemented by the least mean square (LMS) algorithm, which also converges slower for signals with larger eigenvalue spreads. However, what is called recursive least square (RLS) algorithm can remarkably improved the situation [8]. This least-square problem is usually formulated as the normal equation with respect to the up-tonow samples of signals, which corresponds deterministically to the true minimum point of the weighted mean square. By solving the normal equation, one can get to the true minimum point directly avoiding traps of the local minima. The motivation of this paper is to investigate the possibility of developing a recursive type of BSS approach that can also improve the convergence speed of conventional gradient-based approaches. In this approach, we resolve the problem that arises from the non-stationary property of source signals, and simultaneously resolve the local minimum problem of cost function. A deterministic (without a stochastic learning process) algorithm is presented for a real-time blind separation of convolutively mixed sources. When employed to acoustic signals, the algorithm shows the superior rate of convergence and the lower floor of cost function than its counterpart of gradient-based approach based on our simulations. By applying the algorithm in a real-time BSS system for realistic acoustic signals, we also give experiments to illustrate the effectiveness and validity of the algorithm.
2
Problem Formulation
We assume M statistically independent sources s(t) = [s1 (t), ..., sM (t)]T , where t is the number of samples. These sources are convolved and mixed in a linear medium leading to N sensors signals x(t) = [x1 (t), ..., xN (t)]T , x(t) = A ∗ s(t)
(1)
where ∗ denotes the convolution operator and A is a N × M matrix of filters that describing transmitting channels. At present stage of discussions we have ignored the term of sensor noises for simplicity.
Recursive Approach for Real-Time Blind Source Separation
1457
The purpose of BSS is to find an inverse model W of A such that y(t) = W ∗ x(t)
(2)
and the components of y(t) become as independent as possible. We can transform equation (1) and equation (2) into the frequency domain, X(ω, t) = A(ω)S(ω, t)
(3)
Y(ω, t) = W(ω)X(ω, t)
(4)
and where X(ω, t) = DFT([x(t), ..., x(t + L − 1)]), Y(ω, t) = DFT([y(t), ..., y(t + L − 1)]), and W(ω) = DFT(W). Here DFT is discrete Fourier transform and L is its length. In recursive implementations of the approach of BSS, we start the computation with known initial conditions and use the information contained in new samples to update the old estimation of optimal solution. We therefore find that the length of observable samples is variable. Moreover, we expect to separate sources in the frequency domain since this way is more efficient than in the time domain. Accordingly, we express the cost function to be minimized as l(ω, n), where ω is frequency and n is the variable length of the observable sample blocks. Conveniently, if we set n = 1 for the initial sample block, n is equal to the number of current sample block. Similarly, the separation matrix becomes W(ω, n), which is also n dependent. As a first try of recursive approach, in this paper, we use a cost function that is based on second order of moments of signals. There have already been a lot of discussions about convolutive BSS by such a cost function [3, 5, 6, 7, 9]. Since subband signals on frequency-bins are approximately orthogonal each other, we can realize a source separation by separation on each-bin independently. However, a difference to the previous discussions is that we introduce a weighting factor into the cost function as a custom of recursive approach. We thus write the cost function as |(RY (ω, n))ij |2 (5) l(ω, n) = i =j
where (RY (ω, n))ij ≡
n
β(n, k)Yi (ω, kδ)YjH (ω, kδ)/Λi
(6)
k=0
is weighted correlation matrix of the outputs, where β(n, k) is the weighting factor, and δ is the shifting number of sample between neighboring blocks. Here Λi is a factor (variance of i-th output) to normalize the covariance of the i-th output. The use of the weighting factor β(n, k) is intended to ensure that samples in the distant past are ”forgotten” in order to afford the possibility of following the
1458
Shuxue Ding and Jie Huang
statistical variations of the observable samples when the BSS operates in a nonstationary environment. A special form of weighting that is commonly used is the exponential weighting factor of forgetting factor defended by β(n, k) = λn−k , for k = 1, 2, ..., n. Here λ is a positive constant close to, but less than, 1. By equation (4) and equation (6), the weighted correlation matrix of the outputs can be written as RY (ω, n) = W(ω, n)RX (ω, n)W H (ω, n) where (RX (ω, n))ij ≡
n
λn−k Xi (ω, kδ)XH j (ω, kδ)
(7)
(8)
k=0
is weighted correlation matrix of the normalized observations in the frequency domain. Now, the problem of blind source separation can be formulated as to find W(ω, n) such that the cost function l(ω, n) attains its minimum value. In BSS and ICA, the traditional way to find the minimum value of the cost function is to exploit stochastic gradient optimization approaches [4, 1]. Instead of this approach, now we want to give a different approach that is recursive. The idea is that when the outputs of BSS become independent each other, the weighted cross-correlation of the outputs in equation (7) should attain zero approximately. I.e., RY (ω, n) = I or W(ω, n)RX (ω, n)W H (ω, n) = I
(9)
We might call equation (9) as a ”normal equation” of BSS in analogy with the normal equation of RLS algorithm in adaptive filters [8]. The problem of BSS now can be attributed as the problem of finding solutions of equation (9).
3
A Recursive Algorithm for BSS/ICA
We shall give a recursive approach to estimate RX (ω, n) and a deterministic method to solve equation (9), which makes an online processing quite easy. At first, equation (9) can be written as W(ω, n)W H (ω, n) = (RX (ω, n))−1
(10)
It is easy to show that the n-th correlation matrix RX (ω, n) relates the n − 1-th correlation matrix RX (ω, n − 1) by (RX (ω, n))ij = λ(RX (ω, n − 1))ij + Xi (ω, nδ)XH j (ω, nδ)
(11)
where RX (ω, n − 1) is the previous value of the correlation matrix, and the matrix product Xi (ω, nδ)XH j (ω, nδ) plays the role of a ”correction” term in the updating operation.
Recursive Approach for Real-Time Blind Source Separation
1459
According to the matrix inversion lemma [4], we obtain −1 (R−1 (R−1 X (ω, n))ij = λ X (ω, n − 1))ij +
(
−1 H λ−2 R−1 X (ω, n − 1)X(ω, n)X (ω, n)RX (ω, n − 1) )ij −1 1 + λ−1 XH (ω, n)RX (ω, n − 1)X(ω, n)
(12)
Fig. 1 shows a block diagram for our recursive algorithm. In the figure, we have ignored the parts for DFT on the input signals and the part for IDFT on the output signals. Here, we adopt the overlap-and-save method [8] for the real-time DFT-IDFT processing part. This method is needed since that (1) in order to make the separation filters perform linear convolutions instead of cyclic ones, a part of the weights on taps have to be set to zeros [8]; (2) so-called the permutation problem [3, 2] can be solved by constraint on the solutions of W(ω, n) such that it restricts those filters that have no time response beyond a fixed size [6]. The initial condition for the recursive processing is R−1 X (ω, n) = I, for n ≤ 0. In the conventional gradient-type of algorithm usually some parameters related to the output signals, for an example, score function, need to be estimated, and then the results of these estimations are fed back to update the separation matrix of filters. However, from Fig. 1, we can see that in recursive BSS algorithm there is not such kind of feedbacks at all. All of estimations that are exploited to update the separation matrix are on the input signals only.
Fig. 1. Block diagram for recursive BSS
1460
4
Shuxue Ding and Jie Huang
Simulations and Experiments
For the blind separation system, since no reference signal is available and parameters of mixing model are unknown, it is not straightforward to define a performance measure. Since the cost function defined by equation (5) is related to the cross-correlations between the outputs, and we can show that the crosscorrelation between independent sources is very small, it can be taken as a quantitative separation performance measure. In this paper, we only consider the case that M = 2 and N = 2. However, there exist some extra background noises. 4.1
Simulation Results of Separations of Real-Word Benchmarks
In this subsection, a real-world benchmark recording that have been downloaded from the web [10] have been used to evaluate the recursive BSS algorithm. Figure 2 shows the learning curves for the conventional gradient-based BSS and for the recursive BSS that has been proposed in this paper. The results presented in Fig. 2 obviously show the superior rate of convergence of the recursive BSS over its counterpart of gradient-based BSS algorithm. In this figure, the vertical axis shows the values of the cost function, and the horizontal axis shows the iteration numbers. Since we have a single iteration of processing for one signal sample block, the horizontal axis also shows the sample block number n. The wild fluctuations of learning curves in Fig. 2 and the following figures are due to the non-stationarity of the sources.
0 Gradient BSS Recursive BSS -5
-10
Cost function (dB)
-15
-20
-25
-30
-35
-40
-45
-50
50
100
150
200
250
300
Number of iteration
Fig. 2. Learning curves for recursive BSS (λ = 0.95) and gradient BSS algorithms (µ = 0.01, optimized). Signals: rss mA and rss mB (Lee [10]); Size of filter taps=1024; Length of FFT=4096
Recursive Approach for Real-Time Blind Source Separation
1461
0 Gradient BSS Recursive BSS -5
-10
Cost function (dB)
-15
-20
-25
-30
-35
-40
-45
-50
50
100
150
200
250
300
350
Number of iteration
Fig. 3. Learning curves for the recursive BSS (λ = 0.95) and gradient BSS algorithms (µ = 0.01, optimized). Signals: real-world recordings in vehicle environment; Size of filter taps = 2048; Length of FFT=8192
4.2
Experiments Results of Separations of Real-World Recordings
Real-time experiments have been implemented by both a model on Simulink and a model on TMS320C6701 Evaluation Module board from Texas Instruments. Experiments were done with audio recorded in a real acoustic environment (an automobile). The automobile interior that was used for the recordings was 114.0 cm× 136.5 cm× 233.0 cm (height × width × depth). Two persons read their sentences and the resulting sound was recorded by two microphones spaced 10.0 cm apart. The recordings are digitized to 16 bit per sample, with sample rate 44.1kHz. The acoustic environment was corrupted by the noise of the engine and other directionless noises. The learning curve presented in Fig. 3 also obviously shows the superior rate of convergence of the recursive BSS over its counterpart of gradient-based BSS algorithm. The meanings of the vertical and horizontal axes are the same as that in Fig. 2.
5
Conclusions and Discussions
In this paper we have proposed and investigated a recursive approach for realtime implementation of BSS/ICA. Based on this approach we have presented a deterministic algorithm for real-time blind separation of convolutively mixed sources. When employed to acoustic signals, the algorithm has shown the superior rate of convergence over its counterpart of gradient-based approach based on
1462
Shuxue Ding and Jie Huang
our simulations. By applying the algorithm in a real-time BSS system for realistic acoustic signals, we have also given experiments to illustrate the effectiveness and validity of the algorithm. At present stage, we have only realized a recursive BSS algorithm with cost function based on second-order statistics of signals. However, this approach is very general that it can be applied to other cost functions based on higher-order statistics of signals. We would like to present such kind of considerations and investigations in future.
References [1] Torkkola, K.: Blind separation of convolved sources based on information maximization. In S. Usui, Y. Tohkura, S. Katagiri, and E. Wilson, editors, Proc. NNSP96, pp. 423-432, New York, NY, 1996. IEEE press 1455, 1458 [2] Ding, S., Otsuka, M., Ashizawa, M., Niitsuma, T., and Sugai, K.: Blind source separation of real-world acoustic signals based on ICA in time-frequency domain. Technical Report of IEICE. Vol. EA2001-1, pp. 1-8, 2001 1455, 1459 [3] Murata, N., Ikeda, S. and Ziehe, A.: An approach to blind source separation based on temporal structure of speech signals. BSIS Technical Reports. 98-2, 1998 1455, 1457, 1459 [4] Cichocki, A. and Amari, S.: Adaptive blind signal and image processing. John Wiley & Sons, LTD., 2002 1455, 1456, 1458, 1459 [5] Kawamoto, M., Matsuoka, K. and Ohnishi, N.: A method of blind separation for convolved non-stationary signals. Neuralcomputing, Vol. 22, pp. 157-171,1998 1455, 1457 [6] Parra, L. and Spence, C.: Convolutive blind separation of non-stationary sources. IEEE Trans. Speech Audio Proc., Vol. 8, No. 3, pp. 320-327, May 2000 1455, 1457, 1459 [7] van de Laar, J., Habets, E., Peters, J. and Lokkart, P.: Adaptive Blind Audio Signal Separation on a DSP. Proc. ProRISC 2001, pp. 475-479, 2002 1455, 1457 [8] Haykin, S.: Adaptive filter theory. 3rd Edition, Prentice-Hall, Inc., 1996 1456, 1458, 1459 [9] Schobben, D. W. E. and Sommen, P. C. W.: A new blind signal separation algorithm based on second order statistics: Proc. IASTED, pp 564-569, 1998 1457 [10] Lee, T.: http://www.cnl.salk.edu/ tewon/ 1460