Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Moshe Y. Vardi Rice University, Houston, TX, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
4492
Derong Liu Shumin Fei Zengguang Hou Huaguang Zhang Changyin Sun (Eds.)
Advances in Neural Networks – ISNN 2007 4th International Symposium on Neural Networks, ISNN 2007 Nanjing, China, June 3-7, 2007 Proceedings, Part II
13
Volume Editors Derong Liu University of Illinois at Chicago, IL 60607-7053, USA E-mail:
[email protected] Shumin Fei Southeast University, School of Automation, Nanjing 210096, China E-mail:
[email protected] Zengguang Hou The Chinese Academy of Sciences, Institute of Automation, Beijing, 100080, China E-mail:
[email protected] Huaguang Zhang Northeastern University, Shenyang 110004, China E-mail:
[email protected] Changyin Sun Hohai University, School of Electrical Engineering, Nanjing 210098, China E-mail:
[email protected]
Library of Congress Control Number: 2007926816 CR Subject Classification (1998): F.1, F.2, D.1, G.2, I.2, C.2, I.4-5, J.1-4 LNCS Sublibrary: SL 1 – Theoretical Computer Science and General Issues ISSN ISBN-10 ISBN-13
0302-9743 3-540-72392-7 Springer Berlin Heidelberg New York 978-3-540-72392-9 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com © Springer-Verlag Berlin Heidelberg 2007 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12060771 06/3180 543210
Preface
ISNN 2007 – the Fourth International Symposium on Neural Networks—was held in Nanjing, China, as a sequel of ISNN 2004/ISNN 2005/ISNN 2006. ISNN has now become a well-established conference series on neural networks in the region and around the world, with growing popularity and increasing quality. Nanjing is an old capital of China, a modern metropolis with a 2470-year history and rich cultural heritage. All participants of ISNN 2007 had a technically rewarding experience as well as memorable experiences in this great city. A neural network is an information processing structure inspired by biological nervous systems, such as the brain. It consists of a large number of highly interconnected processing elements, called neurons. It has the capability of learning from example. The field of neural networks has evolved rapidly in recent years. It has become a fusion of a number of research areas in engineering, computer science, mathematics, artificial intelligence, operations research, systems theory, biology, and neuroscience. Neural networks have been widely applied for control, optimization, pattern recognition, image processing, signal processing, etc. ISNN 2007 aimed to provide a high-level international forum for scientists, engineers, and educators to present the state of the art of neural network research and applications in diverse fields. The symposium featured plenary lectures given by worldwide renowned scholars, regular sessions with broad coverage, and some special sessions focusing on popular topics. The symposium received a total of 1975 submissions from 55 countries and regions across all six continents. The symposium proceedings consists of 454 papers among which 262 were accepted as long papers and 192 were accepted as short papers. We would like to express our sincere gratitude to all reviewers of ISNN 2007 for the time and effort they generously gave to the symposium. We are very grateful to the National Natural Science Foundation of China, K. C. Wong Education Foundation of Hong Kong, the Southeast University of China, the Chinese University of Hong Kong, and the University of Illinois at Chicago for their financial support. We would also like to thank the publisher, Springer, for cooperation in publishing the proceedings in the prestigious series of Lecture Notes in Computer Science. Derong Liu Shumin Fei Zeng-Guang Hou Huaguang Zhang Changyin Sun
ISNN 2007 Organization
General Chair Derong Liu, University of Illinois at Chicago, USA, and Yanshan University, China
General Co-chair Marios M. Polycarpou, University of Cyprus
Organization Chair Shumin Fei, Southeast University, China
Advisory Committee Chairs Shun-Ichi Amari, RIKEN Brain Science Institute, Japan Chunbo Feng, Southeast University, China Zhenya He, Southeast University, China
Advisory Committee Members Hojjat Adeli, Ohio State University, USA Moonis Ali, Texas State University-San Marcos, USA Zheng Bao, Xidian University, China Tamer Basar, University of Illinois at Urbana-Champaign, USA Tianyou Chai, Northeastern University, China Guoliang Chen, University of Science and Technology of China, China Ruwei Dai, Chinese Academy of Sciences, China Dominique M. Durand, Case Western Reserve University, USA Russ Eberhart, Indiana University Purdue University Indianapolis, USA David Fogel, Natural Selection, Inc., USA Walter J. Freeman, University of California-Berkeley, USA Toshio Fukuda, Nagoya University, Japan Kunihiko Fukushima, Kansai University, Japan Tom Heskes, University of Nijmegen, The Netherlands Okyay Kaynak, Bogazici University, Turkey Frank L. Lewis, University of Texas at Arlington, USA Deyi Li, National Natural Science Foundation of China, China Yanda Li, Tsinghua University, China Ruqian Lu, Chinese Academy of Sciences, China
VIII
Organization
John MacIntyre, University of Sunderland, UK Robert J. Marks II, Baylor University, USA Anthony N. Michel, University of Notre Dame, USA Evangelia Micheli-Tzanakou, Rutgers University, USA Erkki Oja, Helsinki University of Technology, Finland Nikhil R. Pal, Indian Statistical Institute, India Vincenzo Piuri, University of Milan, Italy Jennie Si, Arizona State University, USA Youxian Sun, Zhejiang University, China Yuan Yan Tang, Hong Kong Baptist University, China Tzyh Jong Tarn, Washington University, USA Fei-Yue Wang, Chinese Academy of Sciences, China Lipo Wang, Nanyang Technological University, Singapore Shoujue Wang, Chinese Academy of Sciences Paul J. Werbos, National Science Foundation, USA Bernie Widrow, Stanford University, USA Gregory A. Worrell, Mayo Clinic, USA Hongxin Wu, Chinese Academy of Space Technology, China Youlun Xiong, Huazhong University of Science and Technology, China Lei Xu, Chinese University of Hong Kong, China Shuzi Yang, Huazhong University of Science and Technology, China Xin Yao, University of Birmingham, UK Bo Zhang, Tsinghua University, China Siying Zhang, Qingdao University, China Nanning Zheng, Xi’an Jiaotong University, China Jacek M. Zurada, University of Louisville, USA
Steering Committee Chair Jun Wang, Chinese University of Hong Kong, China
Steering Committee Co-chair Zongben Xu, Xi’an Jiaotong University, China
Steering Committee Members Tianping Chen, Fudan University, China Andrzej Cichocki, Brain Science Institute, Japan Wlodzislaw Duch, Nicholaus Copernicus University, Poland Chengan Guo, Dalian University of Technology, China Anthony Kuh, University of Hawaii, USA Xiaofeng Liao, Chongqing University, China Xiaoxin Liao, Huazhong University of Science and Technology, China Bao-Liang Lu, Shanghai Jiaotong University, China
Organization
Chenghong Wang, National Natural Science Foundation of China, China Leszek Rutkowski, Technical University of Czestochowa, Poland Zengqi Sun, Tsinghua University, China Donald C. Wunsch II, University of Missouri-Rolla, USA Gary G. Yen, Oklahoma State University, Stillwater, USA Zhang Yi, University of Electronic Science and Technology, China Hujun Yin, University of Manchester, UK Liming Zhang, Fudan University, China Chunguang Zhou, Jilin University, China
Program Chairs Zeng-Guang Hou, Chinese Academy of Sciences, China Huaguang Zhang, Northeastern University, China
Special Sessions Chairs Lei Guo, Beihang University, China Wen Yu, CINVESTAV-IPN, Mexico
Finance Chair Xinping Guan, Yanshan University, China
Publicity Chair Changyin Sun, Hohai University, China
Publicity Co-chairs Zongli Lin, University of Virginia, USA Weixing Zheng, University of Western Sydney, Australia
Publications Chair Jinde Cao, Southeast University, China
Registration Chairs Hua Liang, Hohai University, China Bhaskhar DasGupta, University of Illinois at Chicago, USA
IX
X
Organization
Local Arrangements Chairs Enrong Wang, Nanjing Normal University, China Shengyuan Xu, Nanjing University of Science and Technology, China Junyong Zhai, Southeast University, China
Electronic Review Chair Xiaofeng Liao, Chongqing University, China
Symposium Secretariats Ting Huang, University of Illinois at Chicago, USA Jinya Song, Hohai University, China
ISNN 2007 International Program Committee Shigeo Abe, Kobe University, Japan Ajith Abraham, Chung Ang University, Korea Khurshid Ahmad, University of Surrey, UK Angelo Alessandri, University of Genoa, Italy Sabri Arik, Istanbul University, Turkey K. Vijayan Asari, Old Dominion University, USA Amit Bhaya, Federal University of Rio de Janeiro, Brazil Abdesselam Bouzerdoum, University of Wollongong, Australia Martin Brown, University of Manchester, UK Ivo Bukovsky, Czech Technical University, Czech Republic Jinde Cao, Southeast University, China Matthew Casey, Surrey University, UK Luonan Chen, Osaka-Sandai University, Japan Songcan Chen, Nanjing University of Aeronautics and Astronautics, China Xiao-Hu Chen, Nanjing Institute of Technology, China Xinkai Chen, Shibaura Institute of Technology, Japan Yuehui Chen, Jinan University, Shandong, China Xiaochun Cheng, University of Reading, UK Zheru Chi, Hong Kong Polytechnic University, China Sungzoon Cho, Seoul National University, Korea Seungjin Choi, Pohang University of Science and Technology, Korea Tommy W. S. Chow, City University of Hong Kong, China Emilio Corchado, University of Burgos, Spain Jose Alfredo F. Costa, Federal University, UFRN, Brazil Mingcong Deng, Okayama University, Japan Shuxue Ding, University of Aizu, Japan Meng Joo Er, Nanyang Technological University, Singapore Deniz Erdogmus, Oregon Health & Science University, USA
Organization
Gary Feng, City University of Hong Kong, China Jian Feng, Northeastern University, China Mauro Forti, University of Siena, Italy Wai Keung Fung, University of Manitoba, Canada Marcus Gallagher, University of Queensland, Australia John Qiang Gan, University of Essex, UK Xiqi Gao, Southeast University, China Chengan Guo, Dalian University of Technology, China Dalei Guo, Chinese Academy of Sciences, China Ping Guo, Beijing Normal University, China Madan M. Gupta, University of Saskatchewan, Canada Min Han, Dalian University of Technology, China Haibo He, Stevens Institute of Technology, USA Daniel Ho, City University of Hong Kong, China Dewen Hu, National University of Defense Technology, China Jinglu Hu, Waseda University, Japan Sanqing Hu, Mayo Clinic, Rochester, Minnesota, USA Xuelei Hu, Nanjing University of Science and Technology, China Guang-Bin Huang, Nanyang Technological University, Singapore Tingwen Huang, Texas A&M University at Qatar Giacomo Indiveri, ETH Zurich, Switzerland Malik Magdon Ismail, Rensselaer Polytechnic Institute, USA Danchi Jiang, University of Tasmania, Australia Joarder Kamruzzaman, Monash University, Australia Samuel Kaski, Helsinki University of Technology, Finland Hon Keung Kwan, University of Windsor, Canada James Kwok, Hong Kong University of Science and Technology, China James Lam, University of Hong Kong, China Kang Li, Queen’s University, UK Xiaoli Li, University of Birmingham, UK Yangmin Li, University of Macau, China Yongwei Li, Hebei University of Science and Technology, China Yuanqing Li, Institute of Infocomm Research, Singapore Hualou Liang, University of Texas at Houston, USA Jinling Liang, Southeast University, China Yanchun Liang, Jilin University, China Lizhi Liao, Hong Kong Baptist University, China Guoping Liu, University of Glamorgan, UK Ju Liu, Shandong University, China Meiqin Liu, Zhejiang University, China Xiangjie Liu, North China Electric Power University, China Yutian Liu, Shangdong University, China Hongtao Lu, Shanghai Jiaotong University, China Jinhu Lu, Chinese Academy of Sciences and Princeton University, USA Wenlian Lu, Max Planck Institute for Mathematics in Sciences, Germany
XI
XII
Organization
Shuxian Lun, Bohai University, China Fa-Long Luo, Anyka, Inc., USA Jinwen Ma, Peking University, China Xiangping Meng, Changchun Institute of Technology, China Kevin L. Moore, Colorado School of Mines, USA Ikuko Nishikawa, Ritsumeikan University, Japan Stanislaw Osowski, Warsaw University of Technology, Poland Seiichi Ozawa, Kobe University, Japan Hector D. Patino, Universidad Nacional de San Juan, Argentina Yi Shen, Huazhong University of Science and Technology, China Daming Shi, Nanyang Technological University, Singapore Yang Shi, University of Saskatchewan, Canada Michael Small, Hong Kong Polytechnic University Ashu MG Solo, Maverick Technologies America Inc., USA Stefano Squartini, Universita Politecnica delle Marche, Italy Ponnuthurai Nagaratnam Suganthan, Nanyang Technological University, Singapore Fuchun Sun, Tsinghua University, China Johan A. K. Suykens, Katholieke Universiteit Leuven, Belgium Norikazu Takahashi, Kyushu University, Japan Ying Tan, Peking University, China Yonghong Tan, Guilin University of Electronic Technology, China Peter Tino, Birmingham University, UK Christos Tjortjis, University of Manchester, UK Antonios Tsourdos, Cranfield University, UK Marc van Hulle, Katholieke Universiteit Leuven, Belgium Dan Ventura, Brigham Young University, USA Michel Verleysen, Universite Catholique de Louvain, Belgium Bing Wang, University of Hull, UK Dan Wang, Dalian Maritime University, China Pei-Fang Wang, SPAWAR Systems Center-San Diego, USA Zhiliang Wang, Northeastern University, China Si Wu, University of Sussex, UK Wei Wu, Dalian University of Technology, China Shunren Xia, Zhejiang University, China Yousheng Xia, University of Waterloo, Canada Cheng Xiang, National University of Singapore, Singapore Daoyi Xu, Sichuan University, China Xiaosong Yang, Huazhong University of Science and Technology, China Yingjie Yang, De Montfort University, UK Zi-Jiang Yang, Kyushu University, Japan Mao Ye, University of Electronic Science and Technology of China, China Jianqiang Yi, Chinese Academy of Sciences, China Dingli Yu, Liverpool John Moores University, UK Zhigang Zeng, Wuhan University of Technology, China
Organization
XIII
Guisheng Zhai, Osaka Perfecture University, Japan Jie Zhang, University of Newcastle, UK Liming Zhang, Fudan University, China Liqing Zhang, Shanghai Jiaotong University, China Nian Zhang, South Dakota School of Mines & Technology, USA Qingfu Zhang, University of Essex, UK Yanqing Zhang, Georgia State University, USA Yifeng Zhang, Hefei Institute of Electrical Engineering, China Yong Zhang, Jinan University, China Dongbin Zhao, Chinese Academy of Sciences, China Hongyong Zhao, Nanjiang University of Aeronautics and Astronautics, China Haibin Zhu, Nipissing University, Canada
Table of Contents – Part II
Chaos and Synchronization Synchronization of Chaotic Systems Via the Laguerre–Polynomials-Based Neural Network . . . . . . . . . . . . . . . . . . . . . . . . Hongwei Wang and Hong Gu
1
Chaos Synchronization Between Unified Chaotic System and Genesio System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xianyong Wu, Zhi-Hong Guan, and Tao Li
8
Robust Impulsive Synchronization of Coupled Delayed Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lan Xiang, Jin Zhou, and Zengrong Liu
16
Synchronization of Impulsive Fuzzy Cellular Neural Networks with Parameter Mismatches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tingwen Huang and Chuandong Li
24
Global Synchronization in an Array of Delayed Neural Networks with Nonlinear Coupling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jinling Liang, Ping Li, and Yongqing Yang
33
Self-synchronization Blind Audio Watermarking Based on Feature Extraction and Subsampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiaohong Ma, Bo Zhang, and Xiaoyan Ding
40
An Improved Extremum Seeking Algorithm Based on the Chaotic Annealing Recurrent Neural Network and Its Application . . . . . . . . . . . . . Yun-an Hu, Bin Zuo, and Jing Li
47
Solving the Delay Constrained Multicast Routing Problem Using the Transiently Chaotic Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wen Liu and Lipo Wang
57
Solving Prize-Collecting Traveling Salesman Problem with Time Windows by Chaotic Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yanyan Zhang and Lixin Tang
63
A Quickly Searching Algorithm for Optimization Problems Based on Hysteretic Transiently Chaotic Neural Network . . . . . . . . . . . . . . . . . . . . . . Xiuhong Wang and Qingli Qiao
72
Secure Media Distribution Scheme Based on Chaotic Neural Network . . . Shiguo Lian, Zhongxuan Liu, Zhen Ren, and Haila Wang
79
XVI
Table of Contents – Part II
An Adaptive Radar Target Signal Processing Scheme Based on AMTI Filter and Chaotic Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Quansheng Ren, Jianye Zhao, Hongling Meng, and Jianye Zhao
88
Horseshoe Dynamics in a Small Hyperchaotic Neural Network . . . . . . . . . Qingdu Li and Xiao-Song Yang
96
The Chaotic Netlet Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Geehyuk Lee and Gwan-Su Yi
104
A Chaos Based Robust Spatial Domain Watermarking Algorithm . . . . . . Xianyong Wu, Zhi-Hong Guan, and Zhengping Wu
113
Integrating KPCA and LS-SVM for Chaotic Time Series Forecasting Via Similarity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jian Cheng, Jian-sheng Qian, Xiang-ting Wang, and Li-cheng Jiao
120
Prediction of Chaotic Time Series Using LS-SVM with Simulated Annealing Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Meiying Ye
127
Radial Basis Function Neural Network Predictor for Parameter Estimation in Chaotic Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hongmei Xie and Xiaoyi Feng
135
Global Exponential Synchronization of Chaotic Neural Networks with Time Delays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jigui Jian, Baoxian Wang, and Xiaoxin Liao
143
Neural Fuzzy Systems A Fuzzy Neural Network Based on Back-Propagation . . . . . . . . . . . . . . . . . Huang Jin, Gan Quan, and Cai Linhui
151
State Space Partition for Reinforcement Learning Based on Fuzzy Min-Max Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yong Duan, Baoxia Cui, and Xinhe Xu
160
Realization of an Improved Adaptive Neuro-Fuzzy Inference System in DSP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xingxing Wu, Xilin Zhu, Xiaomei Li, and Haocheng Yu
170
Neurofuzzy Power Plant Predictive Control . . . . . . . . . . . . . . . . . . . . . . . . . . Xiang-Jie Liu and Ji-Zhen Liu GA-Driven Fuzzy Set-Based Polynomial Neural Networks with Information Granules for Multi-variable Software Process . . . . . . . . . . . . . Seok-Beom Roh, Sung-Kwun Oh, and Tae-Chon Ahn
179
186
Table of Contents – Part II
The ANN Inverse Control of Induction Motor with Robust Flux Observer Based on ESO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xin Wang and Xianzhong Dai Design of Fuzzy Relation-Based Polynomial Neural Networks Using Information Granulation and Symbolic Gene Type Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SungKwun Oh, InTae Lee, Witold Pedrycz, and HyunKi Kim
XVII
196
206
Fuzzy Neural Network Classification Design Using Support Vector Machine in Welding Defect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiao-guang Zhang, Shi-jin Ren, Xing-gan Zhang, and Fan Zhao
216
Multi-granular Control of Double Inverted Pendulum Based on Universal Logics Fuzzy Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bin Lu and Juan Chen
224
The Research of Decision Information Fusion Algorithm Based on the Fuzzy Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pei-Gang Sun, Hai Zhao, Xiao-Dan Zhang, Jiu-Qiang Xu, Zhen-Yu Yin, Xi-Yuan Zhang, and Si-Yuan Zhu Equalization of Channel Distortion Using Nonlinear Neuro-Fuzzy Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rahib H. Abiyev, Fakhreddin Mamedov, and Tayseer Al-shanableh Comparative Studies of Fuzzy Genetic Algorithms . . . . . . . . . . . . . . . . . . . Qing Li, Yixin Yin, Zhiliang Wang, and Guangjun Liu Fuzzy Random Dependent-Chance Bilevel Programming with Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rui Liang, Jinwu Gao, and Kakuzo Iwamura Fuzzy Optimization Problems with Critical Value-at-Risk Criteria . . . . . . Yan-Kui Liu, Zhi-Qiang Liu, and Ying Liu
234
241
251
257
267
Neural-Network-Driven Fuzzy Optimum Selection for Mechanism Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yingkui Gu and Xuewen He
275
Atrial Arrhythmias Detection Based on Neural Network Combining Fuzzy Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Rongrong Sun and Yuanyuan Wang
284
A Neural-Fuzzy Pattern Recognition Algorithm Based Cutting Tool Condition Monitoring Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pan Fu and A.D. Hope
293
XVIII
Table of Contents – Part II
Research on Customer Classification in E-Supermarket by Using Modified Fuzzy Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yu-An Tan, Zuo Wang, and Qi Luo
301
Recurrent Fuzzy Neural Network Based System for Battery Charging . . . R.A. Aliev, R.R. Aliev, B.G. Guirimov, and K. Uyar
307
Type-2 Fuzzy Neuro System Via Input-to-State-Stability Approach . . . . . Ching-Hung Lee and Yu-Ching Lin
317
Fuzzy Neural Petri Nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hua Xu, Yuan Wang, and Peifa Jia
328
Hardware Design of an Adaptive Neuro-Fuzzy Network with On-Chip Learning Capability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tzu-Ping Kao, Chun-Chang Yu, Ting-Yu Chen, and Jeen-Shing Wang Stock Prediction Using FCMAC-BYY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jiacai Fu, Kok Siong Lum, Minh Nhut Nguyen, and Juan Shi
336
346
A Hybrid Rule Extraction Method Using Rough Sets and Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shufeng Wang, Gengfeng Wu, and Jianguo Pan
352
A Novel Approach for Extraction of Fuzzy Rules Using the Neuro-Fuzzy Network and Its Application in the Blending Process of Raw Slurry . . . . Rui Bai, Tianyou Chai, and Enjie Ma
362
Training and Learning Algorithms for Neural Networks Neural Network Training Using Genetic Algorithm with a Novel Binary Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yong Liang, Kwong-Sak Leung, and Zong-Ben Xu
371
Adaptive Training of a Kernel-Based Representative and Discriminative Nonlinear Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Benyong Liu, Jing Zhang, and Xiaowei Chen
381
Indirect Training of Grey-Box Models: Application to a Bioprocess . . . . . Francisco Cruz, Gonzalo Acu˜ na, Francisco Cubillos, Vicente Moreno, and Danilo Bassi FNN (Feedforward Neural Network) Training Method Based on Robust Recursive Least Square Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . JunSeok Lim and KoengMo Sung
391
398
Table of Contents – Part II
A Margin Maximization Training Algorithm for BP Network . . . . . . . . . . Kai Wang and Qingren Wang
XIX
406
Learning Bayesian Networks Based on a Mutual Information Scoring Function and EMI Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fengzhan Tian, Haisheng Li, Zhihai Wang, and Jian Yu
414
Learning Dynamic Bayesian Networks Structure Based on Bayesian Optimization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Song Gao, Qinkun Xiao, Quan Pan, and Qingguo Li
424
An On-Line Learning Algorithm of Parallel Mode for MLPN Models . . . D.L. Yu, T.K. Chang, and D.W. Yu
432
An Robust RPCL Algorithm and Its Application in Clustering of Visual Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zeng-Shun Zhao, Zeng-Guang Hou, Min Tan, and An-Min Zou
438
An Evolutionary RBFNN Learning Algorithm for Complex Classzification Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jin Tian, Minqiang Li, and Fuzan Chen
448
Stock Index Prediction Based on Adaptive Training and Pruning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jinyuan Shen, Huaiyu Fan, and Shengjiang Chang
457
An Improved Algorithm for Eleman Neural Network by Adding a Modified Error Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhiqiang Zhang, Zheng Tang, GuoFeng Tang, Catherine Vairappan, XuGang Wang, and RunQun Xiong
465
Regularization Versus Dimension Reduction, Which Is Better? . . . . . . . . . Yunfei Jiang and Ping Guo
474
Integrated Analytic Framework for Neural Network Construction . . . . . . Kang Li, Jian-Xun Peng, Minrui Fei, Xiaoou Li, and Wen Yu
483
Neural Networks Structures A Novel Method of Constructing ANN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiangping Meng, Quande Yuan, Yuzhen Pi, and Jianzhong Wang
493
Topographic Infomax in a Neural Multigrid . . . . . . . . . . . . . . . . . . . . . . . . . James Kozloski, Guillermo Cecchi, Charles Peck, and A. Ravishankar Rao
500
Genetic Granular Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yan-Qing Zhang, Bo Jin, and Yuchun Tang
510
XX
Table of Contents – Part II
A Multi-Level Probabilistic Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . Ning Zong and Xia Hong An Artificial Immune Network Model Applied to Data Clustering and Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chenggong Zhang and Zhang Yi Sparse Coding in Sparse Winner Networks . . . . . . . . . . . . . . . . . . . . . . . . . . Janusz A. Starzyk, Yinyin Liu, and David Vogel Multi-valued Cellular Neural Networks and Its Application for Associative Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhong Zhang, Takuma Akiduki, Tetsuo Miyake, and Takashi Imamura Emergence of Topographic Cortical Maps in a Parameterless Local Competition Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A. Ravishankar Rao, Guillermo Cecchi, Charles Peck, and James Kozloski Graph Matching Recombination for Evolving Neural Networks . . . . . . . . . Ashique Mahmood, Sadia Sharmin, Debjanee Barua, and Md. Monirul Islam
516
526 534
542
552
562
Orthogonal Least Squares Based on QR Decomposition for Wavelet Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Min Han and Jia Yin
569
Implementation of Multi-valued Logic Based on Bi-threshold Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qiuxiang Deng and Zhigang Zeng
575
Iteratively Reweighted Fitting for Reduced Multivariate Polynomial Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wangmeng Zuo, Kuanquan Wang, David Zhang, and Feng Yue
583
Decomposition Method for Tree Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peng Huang and Jie Zhu An Intelligent Hybrid Approach for Designing Increasing Translation Invariant Morphological Operators for Time Series Forecasting . . . . . . . . . Ricardo de A. Ara´ ujo, Robson P. de Sousa, and Tiago A.E. Ferreira Ordering Grids to Identify the Clustering Structure . . . . . . . . . . . . . . . . . . Shihong Yue, Miaomiao Wei, Yi Li, and Xiuxiu Wang An Improve to Human Computer Interaction, Recovering Data from Databases Through Spoken Natural Language . . . . . . . . . . . . . . . . . . . . . . . Omar Florez-Choque and Ernesto Cuadros-Vargas
593
602 612
620
Table of Contents – Part II
XXI
3D Reconstruction Approach Based on Neural Network . . . . . . . . . . . . . . . Haifeng Hu and Zhi Yang
630
A New Method of IRFPA Nonuniformity Correction . . . . . . . . . . . . . . . . . . Shaosheng Dai, Tianqi Zhang, and Jian Gao
640
Novel Shape-From-Shading Methodology with Specular Reflectance Using Wavelet Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lei Yang and Jiu-qiang Han Attribute Reduction Based on Bi-directional Distance Correlation and Radial Basis Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Li-Chao Chen, Wei Zhang, Ying-Jun Zhang, Bin Ye, Li-Hu Pan, and Jing Li Unbiased Linear Neural-Based Fusion with Normalized Weighted Average Algorithm for Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yunfeng Wu and S.C. Ng
646
656
664
Discriminant Analysis with Label Constrained Graph Partition . . . . . . . . Peng Guan, Yaoliang Yu, and Liming Zhang
671
The Kernelized Geometrical Bisection Methods . . . . . . . . . . . . . . . . . . . . . . Xiaomao Liu, Shujuan Cao, Junbin Gao, and Jun Zhang
680
Design and Implementation of a General Purpose Neural Network Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yi Qian, Ang Li, and Qin Wang
689
A Forward Constrained Selection Algorithm for Probabilistic Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ning Zong and Xia Hong
699
Probabilistic Motion Switch Tracking Method Based on Mean Shift and Double Model Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Risheng Han, Zhongliang Jing, and Gang Xiao
705
Neural Networks for Pattern Recognition Human Action Recognition Using a Modified Convolutional Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ho-Joon Kim, Joseph S. Lee, and Hyun-Seung Yang
715
Neural Networks Based Image Recognition: A New Approach . . . . . . . . . . Jiyun Yang, Xiaofeng Liao, Shaojiang Deng, Miao Yu, and Hongying Zheng
724
Human Touching Behavior Recognition Based on Neural Networks . . . . . Joung Woo Ryu, Cheonshu Park, and Joo-Chan Sohn
730
XXII
Table of Contents – Part II
Kernel Fisher NPE for Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . Guoqiang Wang, Zongying Ou, Fan Ou, Dianting Liu, and Feng Han
740
A Parallel RBFNN Classifier Based on S-Transform for Recognition of Power Quality Disturbances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Weiming Tong and Xuelei Song
746
Recognition of Car License Plates Using Morphological Features, Color Information and an Enhanced FCM Algorithm . . . . . . . . . . . . . . . . . . . . . . Kwang-Baek Kim, Choong-shik Park, and Young Woon Woo
756
Modified ART2A-DWNN for Automatic Digital Modulation Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xuexia Wang, Zhilu Wu, Yaqin Zhao, and Guanghui Ren
765
Target Recognition of FLIR Images on Radial Basis Function Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jun Liu, Xiyue Huang, Yong Chen, and Naishuai He
772
Two-Dimensional Bayesian Subspace Analysis for Face Recognition . . . . Daoqiang Zhang
778
A Wavelet-Based Neural Network Applied to Surface Defect Detection of LED Chips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hong-Dar Lin and Chung-Yu Chung
785
Graphic Symbol Recognition of Engineering Drawings Based on Multi-Scale Autoconvolution Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chuan-Min Zhai and Ji-Xiang Du
793
Driver Fatigue Detection by Fusing Multiple Cues . . . . . . . . . . . . . . . . . . . . Rajinda Senaratne, David Hardy, Bill Vanderaa, and Saman Halgamuge
801
Palmprint Recognition Using a Novel Sparse Coding Technique . . . . . . . . Li Shang, Fenwen Cao, Zhiqiang Zhao, Jie Chen, and Yu Zhang
810
Radial Basis Probabilistic Neural Networks Committee for Palmprint Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jixiang Du, Chuanmin Zhai, and Yuanyuan Wan
819
A Connectionist Thematic Grid Predictor for Pre-parsed Natural Language Sentences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jo˜ ao Lu´ıs Garcia Rosa
825
Perfect Recall on the Lernmatrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Israel Rom´ an-God´ınez, Itzam´ a L´ opez-Y´ an ˜ez, and Cornelio Y´ an ˜ez-M´ arquez
835
Table of Contents – Part II
A New Text Detection Approach Based on BP Neural Network for Vehicle License Plate Detection in Complex Background . . . . . . . . . . . . . . Yanwen Li, Meng Li, Yinghua Lu, Ming Yang, and Chunguang Zhou Searching Eye Centers Using a Context-Based Neural Network . . . . . . . . . Jun Miao, Laiyun Qing, Lijuan Duan, and Wen Gao
XXIII
842
851
A Fast New Small Target Detection Algorithm Based on Regularizing Partial Differential Equation in IR Clutter . . . . . . . . . . . . . . . . . . . . . . . . . . Biyin Zhang, Tianxu Zhang, and Kun Zhang
861
The Evaluation Measure of Text Clustering for the Variable Number of Clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Taeho Jo and Malrey Lee
871
Clustering-Based Reference Set Reduction for k-Nearest Neighbor . . . . . . Seongseob Hwang and Sungzoon Cho
880
A Contourlet-Based Method for Wavelet Neural Network Automatic Target Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xue Mei, Liangzheng Xia, and Jiuxian Li
889
Facial Expression Analysis on Semantic Neighborhood Preserving Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shuang Xu, Yunde Jia, and Youdong Zhao
896
Face Recognition from a Single Image per Person Using Common Subfaces Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jun-Bao Li, Jeng-Shyang Pan, and Shu-Chuan Chu
905
SOMs, ICA/PCA A Structural Adapting Self-organizing Maps Neural Network . . . . . . . . . . Xinzheng Xu, Wenhua Zeng, and Zuopeng Zhao How Good Is the Backpropogation Neural Network Using a Self-Organised Network Inspired by Immune Algorithm (SONIA) When Used for Multi-step Financial Time Series Prediction? . . . . . . . . . . . . . . . . Abir Jaafar Hussain and Dhiya Al-Jumeily Edge Detection Combined Entropy Threshold and Self-Organizing Map (SOM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kun Wang, Liqun Gao, Zhaoyu Pian, Li Guo, and Jianhua Wu Hierarchical SOMs: Segmentation of Cell-Migration Images . . . . . . . . . . . . Chaoxin Zheng, Khurshid Ahmad, Aideen Long, Yuri Volkov, Anthony Davies, and Dermot Kelleher
913
921
931
938
XXIV
Table of Contents – Part II
Network Anomaly Detection Based on DSOM and ACO Clustering . . . . Yong Feng, Jiang Zhong, Zhong-yang Xiong, Chun-xiao Ye, and Kai-gui Wu
947
Hybrid Pipeline Structure for Self-Organizing Learning Array . . . . . . . . . . Janusz A. Starzyk, Mingwei Ding, and Yinyin Liu
956
CSOM for Mixed Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fedja Hadzic and Tharam S. Dillon
965
The Application of ICA to the X-Ray Digital Subtraction Angiography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Songyuan Tang, Yongtian Wang, and Yen-wei Chen
979
Relative Principle Component and Relative Principle Component Analysis Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cheng-Lin Wen, Jing Hu, and Tian-Zhen Wang
985
The Hybrid Principal Component Analysis Based on Wavelets and Moving Median Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Cheng-lin Wen, Shao-hui Fan, and Zhi-guo Chen
994
Recursive Bayesian Linear Discriminant for Classification . . . . . . . . . . . . . 1002 D. Huang and C. Xiang Histogram PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1012 P. Nagabhushan and R. Pradeep Kumar Simultaneously Prediction of Network Traffic Flow Based on PCA-SVR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1022 Xuexiang Jin, Yi Zhang, and Danya Yao An Efficient K-Hyperplane Clustering Algorithm and Its Application to Sparse Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1032 Zhaoshui He and Andrzej Cichocki A PCA-Combined Neural Network Software Sensor for SBR Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1042 Liping Fan and Yang Xu Symmetry Based Two-Dimensional Principal Component Analysis for Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1048 Mingyong Ding, Congde Lu, Yunsong Lin, and Ling Tong A Method Based on ICA and SVM/GMM for Mixed Acoustic Objects Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1056 Yaobo Li, Zhiliang Ren, Gong Chen, and Changcun Sun ICA Based Super-Resolution Face Hallucination and Recognition . . . . . . 1065 Hua Yan, Ju Liu, Jiande Sun, and Xinghua Sun
Table of Contents – Part II
XXV
Principal Component Analysis Based Probability Neural Network Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1072 Jie Xing, Deyun Xiao, and Jiaxiang Yu A Multi-scale Dynamically Growing Hierarchical Self-organizing Map for Brain MRI Image Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1081 Jingdan Zhang and Dao-Qing Dai
Biomedical Applications A Study on How to Classify the Security Rating of Medical Information Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1090 Jaegu Song and Seoksoo Kim Detecting Biomarkers for Major Adverse Cardiac Events Using SVM with PLS Feature Selection and Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 1097 Zheng Yin, Xiaobo Zhou, Honghui Wang, Youxian Sun, and Stephen T.C. Wong Hybrid Systems and Artificial Immune Systems: Performances and Applications to Biomedical Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1107 Vitoantonio Bevilacqua, Cosimo G. de Musso, Filippo Menolascina, Giuseppe Mastronardi, and Antonio Pedone NeuroOracle: Integration of Neural Networks into an Object-Relational Database System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1115 Erich Schikuta and Paul Glantschnig Discrimination of Coronary Microcirculatory Dysfunction Based on Generalized Relevance LVQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1125 Qi Zhang, Yuanyuan Wang, Weiqi Wang, Jianying Ma, Juying Qian, and Junbo Ge Multiple Signal Classification Based on Genetic Algorithm for MEG Sources Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1133 Chenwei Jiang, Jieming Ma, Bin Wang, and Liming Zhang Registration of 3D FMT and CT Images of Mouse Via Affine Transformation with Bayesian Iterative Closest Points . . . . . . . . . . . . . . . . 1140 Xia Zheng, Xiaobo Zhou, Youxian Sun, and Stephen T.C. Wong Automatic Diagnosis of Foot Plant Pathologies: A Neural Networks Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1150 Marco Mora, Mary Carmen Jarur, Daniel Sbarbaro, and Leopoldo Pavesi Phase Transitions Caused by Threshold in Random Neural Network and Its Medical Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1159 Guangcheng Xi and Jianxin Chen
XXVI
Table of Contents – Part II
Multiresolution of Clinical EEG Recordings Based on Wavelet Packet Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1168 Lisha Sun, Guoliang Chang, and Patch J. Beadle Comparing Analytical Decision Support Models Through Boolean Rule Extraction: A Case Study of Ovarian Tumour Malignancy . . . . . . . . . . . . . 1177 M.S.H. Aung, P.J.G Lisboa, T.A. Etchells, A.C. Testa, B. Van Calster, S. Van Huffel, L. Valentin, and D. Timmerman Human Sensibility Evaluation Using Neural Network and Multiple-Template Method on Electroencephalogram (EEG) . . . . . . . . . . . 1187 Dongjun Kim, Seungjin Woo, Jeongwhan Lee, and Kyeongseop Kim A Decision Method for Air-Pressure Limit Value Based on the Respiratory Model with RBF Expression of Elastance . . . . . . . . . . . . . . . . 1194 Shunshoku Kanae, Zi-Jiang Yang, and Kiyoshi Wada Hand Tremor Classification Using Bispectrum Analysis of Acceleration Signals and Back-Propagation Neural Network . . . . . . . . . . . . . . . . . . . . . . . 1202 Lingmei Ai, Jue Wang, Liyu Huang, and Xuelian Wang A Novel Ensemble Approach for Cancer Data Classification . . . . . . . . . . . 1211 Yaou Zhao, Yuehui Chen, and Xueqin Zhang Biological Sequence Data Preprocessing for Classification: A Case Study in Splice Site Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1221 A.K.M.A. Baten, Saman K. Halgamuge, Bill Chang, and Nalin Wickramarachchi A Method of X-Ray Image Recognition Based on Fuzzy Rule and Parallel Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1231 Dongmei Liu and Zhaoxia Wang Detection of Basal Cell Carcinoma Based on Gaussian Prototype Fitting of Confocal Raman Spectra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1240 Seong-Joon Baek, Aaron Park, Sangki Kang, Yonggwan Won, Jin Young Kim, and Seung You Na Prediction of Helix, Strand Segments from Primary Protein Sequences by a Set of Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1248 Zhuo Song, Ning Zhang, Zhuo Yang, and Tao Zhang A Novel EPA-KNN Gene Classification Algorithm . . . . . . . . . . . . . . . . . . . . 1254 Haijun Wang, Yaping Lin, Xinguo Lu, and Yalin Nie A Novel Method for Prediction of Protein Domain Using Distance-Based Maximal Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1264 Shuxue Zou, Yanxin Huang, Yan Wang, Chengquan Hu, Yanchun Liang, and Chunguang Zhou
Table of Contents – Part II XXVII
The Effect of Recording Reference on EEG: Phase Synchrony and Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1273 Sanqing Hu, Matt Stead, Andrew B. Gardner, and Gregory A. Worrell Biological Inspired Global Descriptor for Shape Matching . . . . . . . . . . . . . 1281 Yan Li, Siwei Luo, and Qi Zou Fuzzy Support Vector Machine for EMG Pattern Recognition and Myoelectrical Prosthesis Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1291 Lingling Chen, Peng Yang, Xiaoyun Xu, Xin Guo, and Xueping Zhang Classification of Obstructive Sleep Apnea by Neural Networks . . . . . . . . . 1299 Zhongyu Pang, Derong Liu, and Stephen R. Lloyd Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1309
Synchronization of Chaotic Systems Via the Laguerre–Polynomials-Based Neural Network Hongwei Wang and Hong Gu Department of Automation, Dalian University of Technology
[email protected]
Abstract. In recent years, chaos synchronization has attracted many researchers’ interests. For a class of chaotic synchronization systems with unknown uncertainties caused by both model variations and external disturbances, an orthogonal function neural network is utilized to realize the synchronization of chaotic systems. The basis functions of orthogonal function neural network are Laguerre polynomials. First of all, the orthogonal function neural network is trained to learn the uncertain information. Then, the parameters of Laguerre orthogonal neural network are adjusted to accomplish the synchronization of two chaotic systems with the perturbation by Lyapunov steady theorem. At last, the result of numerical example is shown to illustrate the validity of the proposed method.
1 Introduction In recent years, chaos synchronization has attracted many researchers’ interests. Different methods were used in the synchronization of chaotic systems such as radical basis function neural network, recurrent neural network and wavelet neural network in the literatures [1-6], which have possessed the abilities to approximate nonlinear systems. Chaos synchronization can be viewed from a state-observer perspective, in the sense that the response system can be regarded as the state-observer of the drive system [7-9]. In the state-observer based approach, the output can be chosen to be a linear or nonlinear combination or function of the system state variables. However, it has been shown that the state-observer-based scheme has a coherent disadvantage that the transmission noise affects the performance of synchronization and communication [7]. On the other hand, control methods that are applicable to general nonlinear systems have been extensively developed since the early 1980’s, for example based on the differential geometry theory [10]. Recently, the passivity approach has generated some increasing interest for synchronization control laws for general nonlinear systems [11]. An important problem in this approach is how to achieve robust nonlinear control in the presence of unmodelled dynamics and external disturbances. Along this line there is the so-called H ∞ nonlinear control approach [12-13]. One major difficulty with this approach, alongside its possible system structural instability, seems to be the requirement of solving the associated partial differential equations. In addition, for dynamic systems with complex, ill-conditioned, or nonlinear characteristics, the fuzzy D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1–7, 2007. © Springer-Verlag Berlin Heidelberg 2007
2
H. Wang and H. Gu
modeling method is very effective to describe the properties of the systems. Also, in this researching area, there are many attempts to create the synchronization of chaotic systems by using fuzzy methods [14-15]. In the paper, the orthogonal function neural network is utilized to realize the synchronization of chaotic system. The orthogonal function neural network is Laguerre orthogonal neural network. The orthogonal function neural network is trained to learn the uncertain information of the system. At last, with the demonstration of numerical example, the simulation results can show the validity of proposed method.
2 The Description of the Problem The chaotic system is shown as the following form.
x = F ( x) + u
(1)
x ∈ R n ; F (x) is a nonlinear function, satisfying F(x) = ( f1 (x), f 2 ( x),....,f n (x))T , f i ( x)(i = 1,2,..., n) ; u is the input vector,
where the input vector satisfies
u ∈ Rn . In the actual system, the chaotic system is perturbed by the outside perturbation, and then Equation (1) can be written as the following equation.
x = F ( x) + ΔF ( x) + u
(2)
where ΔF (x ) is the outside perturbation. Definition: the reference chaotic system is as Equation (3).
x r = g ( x r )
(3)
x r is a state vector, satisfying x r ∈ R n , g (⋅) is the vector with many slippery nonlinear functions, having the same structure as F (⋅) or the different with F (⋅) . Let e = x − x r , we have where
lim e(t ) = 0 t →∞
(4)
and the system (1) keeps the synchronization with the system (2). Matrix A is a matrix with negative real part of the eigenvalues. The error dynamical trajectory equation is shown as Equation (5).
e = x − x r = A( x − x r ) + Ax r − Ax + F ( x) + ΔF ( x) + u − g ( x r )
(5)
Equation (5) is defined as the following form
e = Ae + Ax r − Ax + F ( x) + ΔF ( x) + u − g ( x r )
(6)
Synchronization of Chaotic Systems
3
Because of the eigenvalues of the matrix A having negative real part, there is a positive definite and symmetric matrix P to satisfy the following Lyapunov equation.
AT P + PA = −Q
(7)
where matrix Q is a positive definite and symmetric matrix. Let ing equation.
G (x) as the follow-
G ( x) = Ax r − Ax + F ( x) + ΔF ( x) − g ( x r )
(8)
Equation (5) represents as Equation (9)
e = Ae + G ( x) + u The estimated value
(9)
Gˆ ( x) substitutes the function G (x) when the function
G (x) is unknown. The controller is defined as the following equation (10). u = −Gˆ ( x) − σ where
σ
(10)
is a vector with the smaller norm.
3 The Controller of Orthogonal Function Neural Network The orthogonal function neural network has simple structure, fast convergence by the comparison of the common BP neural network. The orthogonal function neural network can approach to any nonlinear function on the tight set. In the paper, the orthogonal function of neural network is Laguerre orthogonal polynomial. The Laguerre polynomial is defined as the following form.
⎧ P1 ( x) = 1 ⎪ P ( x) = 1 − x i = 1,2,....N ⎨ 2 ⎪⎩ Pi = {[P2 + 2(i − 2)]Pi −1 − (i − 2) Pi − 2 }/(i − 1)
(11)
The global output of the orthogonal function neural network is defined as Equation (12). N
y = ∑ Φ iWi
(12)
i =1
n
where
Φi = P1i (x1)× P2i (x2 ) ×⋅⋅⋅×Pni (xn ) = ∏Pji (xj ) , Pji ( x j ) is Laguerre polynoj=1
1 2 i = 1,2,..., N , j = 1,2,...,n .
{[
]
}
mial, Pj1(xj ) =1, Pj2(xj ) = (1−xj ), Pji = Pj 2 + 2(i − 2) Pj (i−1) − (i − 2)Pj (i −2) /(i −1) ,
4
H. Wang and H. Gu
Lemma 1. For defining the any function g ( X ) on the section [ a, b] and any small positive number ε , there is an orthogonal function sequence
{Φ1 ( X ), Φ 2 ( X ),..., Φ N ( X )} and the real number sequence Wi (i = 1,2,..., N )
to satisfy the following equation N
g ( X ) − ∑ Wi Φ i ( X ) ≤ ε
(13)
i =1
On the basis of the lemma 1, the properties are acquired as the following form
ε0
Property 1. Give a positive constant
and a continuous function G (x ) : x ∈ R , n
exist an optimization weighted matrix W = W , W = [W1 ,W2 ,...,WN ] to satisfy *
*
*
*
*
the following equation.
G ( x) − W * Φ ( x) ≤ ε 0 T
(14)
Φ satisfies Φ( x) = [Φ 1 ( x), Φ 2 ( x),..., Φ N ( x)] . Based on Equation (14), G (x ) is shown as Equation (15). T
where
G ( x ) = W * Φ ( x) + η T
(15)
where η is a vector with the smaller norm. Equation (15) is substituted into Equation (9).
e = Ae + W * Φ ( x) + η + u (t ) T
(16)
P , satisfying T T W = Φ ( x)e P , A P + PA = −Q , where Q is a positive definite and symmetric
Theorem 1. Exist a positive definite and symmetric matrix matrix, the controller is designed as the following form.
u = u1 + u 2 where
(17)
u1 is u1 = −W T Φ( x) and u 2 is u 2 = −η 0 sgn( Pe)
The state of Equation (9) approaches zero, namely chronization with the system (2).
x → x r , the system (1) is syn-
~ W = W * − W , which W * an optimization weighted matrix, W is a ~ weighted matrix, W is an estimated error matrix. Define the norm of matrix R: 2 R = tr ( RR T ) = tr ( R T R ) . The lyapunov function is shown as the following form. Proof. Let
V =
1 T 1 ~ e Pe + W 2 2
2
(18)
Synchronization of Chaotic Systems
5
Differentiating Equation (18) with respect to the time is shown the following steps.
1 ~ ~ V = (e T Pe + e T Pe) + tr (W T W ) = 2 1 T T ~ ~ ~ e A P + PA e + Φ T ( x)WPe + e T Pη − e T Pη 0 sgn(e T P) + tr (W T W ) = 2 1 ~ ~ ~ − e T Qe + Φ T ( x)WPe + e T Pη − e T Pη 0 sgn(e T P) + tr (W T W ) 2 ~ ~ T T T T T Because of Φ ( x )WPe = tr ( PeΦ ( x )W ) , e Pη − e Pη0 sgn(e P) ≤ 0
(
)
1 1 ~ ~ ~ V ≤ − e T Qe + tr (W T W + PeΦ T ( x)W ) = − e T Qe ≤ 0 2 2 ~ On the basis of the lyapunov theorem, e , W have the limit, then lim e = 0 . t →∞
4 Simulation The Lorenz chaotic system is shown as Equation (19).
⎧ v1 = a(v 2 − v1 ) ⎪ ⎨v2 = (b − v3 )v1 − v 2 ⎪ v = −cv + v v 3 1 2 ⎩ 3
(19)
In the actual system, the Lorenz chaotic system is perturbed by the environment, and then the Lorenz chaotic system is represented as the following equation.
⎧ v1 = (a + δa )(v 2 − v1 ) + d1 + u1 ⎪ ⎨v2 = (b + δb − v3 )v1 − v 2 + d 2 + u 2 ⎪ v = −(c + δc)v + v v + d + u 3 1 2 3 3 ⎩ 3
(20)
The system parameters are a = 10 , b = 30 , c = 8 / 3 ; The perturbation items of the parameters are δa = 0.1 , δb = 0.2 , δc = 0.2 ; The perturbation items of the system states are
d1 = 0.03 sin t , d 2 = 0.01cos t , d 3 = 0.02 sin(3t ) .
The parameters of Laguerre orthogonal neural network are adjusted to accomplish the synchronization. The other parameters are
A = diag (−1,−1,−1) , Q = diag(2,2,2) ,η0 = 0.02, P = diag (1,1,1) .
6
H. Wang and H. Gu
Fig. 1. The responding diagram of
e1
Fig. 2. The responding diagram of
Fig. 3. The responding diagram of
e2
e3
The error responding diagrams are shown as the figure 1 to figure 3. The Based on these responding diagrams, the synchronization of the two chaotic systems is carried out by the proposed method. The validity of the proposed method is demonstrated with the result of numerical example simulation.
5 Conclusion In the paper, the orthogonal function neural network based on Laguerre orthogonal polynomial is utilized to realize the synchronization of chaotic systems. The parameters of orthogonal neural network are adjusted to accomplish the synchronization of two chaotic systems with the perturbation of parameters by Lyapunov steady theorem. The proposed method can guarantee the synchronization of two chaotic systems with the perturbation of parameters.
Acknowledgement This work is supported by National Natural Science Foundation of China (60674061).
Synchronization of Chaotic Systems
7
References 1. Liu, F., Ren, Y., Shan, X.M., Qiu, Z.L.: A Linear Feedback Synchronization Theorem for a Class of Chaotic Systems. Chaos, Solution and Fractals 13 (2002) 723-730 2. Sarasola, C., Torrealdea, F.J.: Cost of Synchronizing Different Chaos Systems. Mathematics and Computers in Simulation 58 (2002) 309-327 3. Shahverdiev, E.M., Sivaprakasam, S., Shore, K.A.: Lag Synchronization in Time-Delayed Systems. Physics Letter A 292 (2002) 320-324 4. Tsui, A., Jones, A.: Periodic Response to External Stimulation of a Chaotic Neural Network with Delayed Feedback. International Journal of Bifurcation and Chaos 9 (1999) 713-722 5. Tan, W., Wang, Y.N., Liu, Z.R., Zhou, S.W.: Neural Network Control for Nonlinear Chaotic Motion. Acta Physica Sinica 51 (2002) 2463-2466 6. Li, Z., Han, C.S.: Adaptive Control for a Class of Chaotic Systems with Uncertain Parameters. Acta Physica Sinica 50 (2002) 847-850 7. Alvarez-Ramirez, J., Cervantes, I.: Stability of Observer-Based Chaotic Communications for a Class of Lure Systems. Bifurcation and Chaos 12 (2002) 1605-1618 8. Grassi, G., Masolo, S.: Nonlinear Observer Design to Synchronize Hyperchaotic Systems via a Scalar Signal. IEEE Transaction of Circuits Systems 44 (1997) 1011-1014 9. Jiang, G.P., Zheng, W.X.: An LMI Criterion for Chaos Synchronization via the LinearState-Feedback Approach. IEEE International Symposium Computer Aided Control System Design (2004) 368-371 10. Isidori, A.: Nonlinear Control Systems. 3rd Ed, Spring Verlag, New York, USA, 1995. 11. Hill, D. J., P. Moylan.: The Stability of Nonlinear Dissipative Systems. IEEE Transaction on Automatic Control 21 (3) (1996) 708-711 12. Knobloch, H.W., Isidori, A., Flockerzi, D.: Topics in Control Theory. Birkhauser, Boston, USA, 1993 13. Yu, G.R.: Fuzzy Synchronization of Chaos Using Gray Prediction for Secure Communication. IEEE International Conference on Systems, Man, Cybernetics 4 (2004) 3104-31099 14. Hyun, C.H., Kim, J.J., Kim, E.: Adaptive Fuzzy Observer Based on Synchronization Design and Secure Communications of Chaotic Systems. Chaos, Soliton and Fractals 27 (4) (2006) 930-940 15. Vasegh, N., Majd, V.J.: Adaptive Fuzzy Synchronization of Discrete-Time Chaotic Systems. Chaos, Soliton and Fractals 27 (4) (2006) 1029-1036
Chaos Synchronization Between Unified Chaotic System and Genesio System Xianyong Wu1,2, Zhi-Hong Guan1, and Tao Li1 1
Department of Control Science and Engineering, Huazhong University of Science and Technology, Wuhan, 430074, China 2 School of Electronics and Information, Yangtze University, Jingzhou, Hubei, 434023, China
[email protected]
Abstract. This work presents chaos synchronization between two different chaotic systems via active control and adaptive control. Synchronization between unified chaotic system and Genesio system are investigated, different controllers are designed to synchronize the drive and response systems, Numerical simulations show the effectiveness of the proposed schemes.
1 Introduction Since Pecora and Carroll introduced a method [1] to synchronize two identical chaotic systems with different initial conditions, chaos synchronization, an important topic in nonlinear science, has been investigated and studied extensively in the last few years. A variety of approaches have been proposed for the synchronization of chaotic systems such as drive-response synchronization [2], linear and nonlinear feedback synchronization [3], adaptive synchronization [4-6], coupled synchronization [7,8], active control method [9,10], impulsive synchronization [11,12], etc., most of the methods mentioned above synchronize two identical chaotic systems with known or unknown parameters. However, the method of synchronization of two different chaotic systems is far from being straightforward because of their different structures and parameters mismatch. In practice, it is hardly the case that every component can be assumed to be identical, especially when chaos synchronization is applied to secure communication, in which the structures of drive and response systems are different. Therefore, synchronization of two different chaotic systems in the presence of known or unknown parameters is more essential and useful in real-life applications. Recently, Bai and Lonngren studied synchronization of unified chaotic systems via active control [10], Ref. [13] used backstepping approach to synchronize two Genesio systems. However, the approach of synchronization between unified chaotic system and Genesio system is seldom reported. In this paper, we propose a scheme to synchronize unified chaotic system and Genesio system with different structures by two different methods, active control is applied when system parameters are known; adaptive synchronization is employed when system parameters are unknown or uncertain, the controllers and adaptive laws of parameters are designed based on Lyapunov stability theory. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 8–15, 2007. © Springer-Verlag Berlin Heidelberg 2007
Chaos Synchronization Between Unified Chaotic System and Genesio System
9
2 Systems Description and Mathematical Models Consider nonlinear chaotic system as follows.
{
x = f (t , x) y = g (t , y ) + u (t , x, y )
(1)
where x, y ∈ R n , f , g ∈ R × R n are differentiable functions, the first equation in (1) is the drive system, and the second one is the response system, u (t , x, y ) is the control input. Let e = y − x be the synchronization error, our goal is to design a controller u such that the trajectory of response system with initial conditions y0 asymptotically approaches the drive system with initial conditions x0 and finally implements synchronization, in the sense that
lim e = lim y (t , y0 ) − x(t , x0 ) = 0 t →∞
t →∞
where || ⋅ || is the Euclidean norm. The Genesio system, proposed by Genesio and Tesi [14], is one of paradigms of chaos since it captures many features of chaotic systems. It includes a simple square part and three simple ordinary differential equations that depend on three negative real parameters. The dynamic equation of the system is as follows ⎧ x = y ⎪ ⎨ y = z ⎪⎩ z = ax + by + cz + x 2
(2)
where x, y, z are state variables, when a = −6, b = −2.92, c = −1.2 , the system (2) is chaotic. Lü et. al. proposed a unified chaotic system [15], which is described by ⎧ x1 = (25α + 10)( y1 − x1 ) ⎪ ⎨ y1 = (28 − 35α ) x1 + (29α − 1) y1 − x1 z1 ⎪⎩ z1 = x1 y1 − 8 +3α z1
(3)
where α ∈ [0,1] . Obviously, system (3) becomes the original Lorenz system for α = 0 while system (3) becomes the original Chen system for α = 1. When α = 0.8, system (3) becomes the critical system. In particular, system (3) bridges the gap between Lorenz system and Chen system. Moreover, system (3) is always chaotic in the whole interval α ∈ [0,1] . In the next sections, we will study chaos synchronization between unified chaotic system and Genesio system by two different methods.
10
X. Wu , Z.-H. Guan, and T. Li
3 Synchronization Between Unified Chaotic System and Genesio System Via Active Control In order to observe the synchronization behavior between unified chaotic system and Genesio system via active control, we assume that Genesio system (2) is the drive system and the controlled unified chaotic system (4) is the response system. ⎧ x1 = (25α + 10)( y1 − x1 ) + u1 ⎪ ⎨ y1 = (28 − 35α ) x1 + (29α − 1) y1 − x1 z1 + u2 ⎪⎩ z1 = x1 y1 − 8 +3α z1 + u3
(4)
Three control functions u1 , u2 , u3 are introduced in system (4), in order to determine the control functions to realize synchronization between systems (2) and (4), we subtract (2) from (4) and get ⎧ e1 = (25 + α )( y1 − x1 ) − y + u1 ⎪ ⎨ e2 = (28 − 35α ) x1 + (29α − 1) y1 − x1 z1 − z + u2 ⎪⎩ e3 = x1 y1 − 8 +α z1 − ax − by − cz − x 2 + u3 3
(5)
where e1 = x1 − x, e2 = y1 − y, e3 = z1 − z , we define active control functions u1 , u2 and u3 as follows
⎧ u1 = −(25α + 10)( y1 − x) + y + V1 ⎪ ⎨ u2 = −(28 − 35α ) x1 − (29α − 1) y + x1 z1 + z + V2 ⎪⎩ u3 = − x1 y1 + 8 +α z + ax + by + cz + x 2 + V3 3
(6)
Hence the error system (5) becomes ⎧ e1 = −(25α + 10)e1 + V1 ⎪ ⎨ e2 = (29α − 1)e2 + V2 ⎪⎩ e3 = − 8 +3α e3 + V3
(7)
The error system (7) to be controlled is a linear system with control inputs V1 ,V2 and V3 as functions of the error states e1 , e2 and e3 . As long as these feedbacks stabilize the system, e1 , e2 and e3 converge to zero as time t tends to infinity. This implies that unified chaotic system and Genesio system are synchronized with feedback control. There are many possible choices for the control V1 ,V2 and V3 . We choose
⎡V1 ⎤ ⎡ e1 ⎤ ⎢V2 ⎥ = A ⎢e2 ⎥ , ⎢ ⎥ ⎢ ⎥ ⎣V3 ⎦ ⎣ e3 ⎦ where A is a 3×3 constant matrix. In order to make the closed loop system stable, the proper choice of the elements of matrix A is that the feedback system must have all eigenvalues with negative real parts. Let matrix A be chosen in the following form 0 ⎡ 25α + 9 A=⎢ 0 −29α ⎢ 0 ⎣ 0
0 ⎤ 0 ⎥ 5 +α ⎥ 3 ⎦
Chaos Synchronization Between Unified Chaotic System and Genesio System
11
In this particular choice, the closed loop system (7) has the eigenvalues -1,-1 and -1. This choice will lead to the error states e1 , e2 and e3 converge to zero as time t tends to infinity and hence the synchronization between unified chaotic system and Genesio system can be achieved. In simulation, fourth order Runge-Kutta integration method is used to solve two systems of diferential equations (2) and (4) with time step 0.001. We select the parameters of unified chaotic system as α = 0.2, and the parameters of Genesio system as a = −6, b = −2.92, c = −1.2 , the initial values of drive and response systems are ( x(0), y (0), z (0)) = (1, 2,3) and ( x1 (0), y1 (0), z1 (0)) = (−1, − 2,5) , respectively, while the initial errors of system (5) are (e1 (0), e2 (0), e3 (0)) = (−2, − 4, 2) . Fig.1 shows the synchronization errors between unified chaotic system and Genesio system, one can see that response system can trace drive system rapidly and become the same finally.
Fig. 1. Synchronization errors e1 , e2 , e3 between unified chaotic system and Genesio system via active control
4 Adaptive Synchronization Between Unified Chaotic System and Genesio System with Unknown Parameters In order to compare with active control method, we still assume that Genesio system (2) is the drive system, and the controlled unified chaotic system (8) is the response system.
⎧ x1 = (25α + 10)( y1 − x1 ) + u1 ⎪ ⎨ y1 = (28 − 35α ) x1 + (29α − 1) y1 − x1 z1 + u2 ⎪⎩ z1 = x1 y1 − 8 +3α z1 + u3 We subtract (2) from equation (8) and yield
(8)
12
X. Wu , Z.-H. Guan, and T. Li
⎧ e1 = (25α + 10)( y1 − x1 ) − y + u1 ⎪ ⎨ e2 = (28 − 35α ) x1 + (29α − 1) y1 − x1 z1 − z + u2 ⎪⎩ e3 = x1 y1 − 8 +α z1 − ax − by − cz − x 2 + u3 3
(9)
where e1 = x1 − x, e2 = y1 − y, e3 = z1 − z Our goal is to find proper controllers ui (i = 1, 2,3) and parameter adaptive laws, such that system (8) globally synchronizes system (2) asymptotically. i.e. lim e(t ) = 0 t →∞
where e = [e1 , e2 , e3 ]T Theorem: If the controllers are chosen as
⎧ u1 = −(25αˆ + 10)( y1 − x1 ) + y − k1e1 ⎪ ⎨ u2 = −(28 − 35αˆ ) x1 − (29αˆ − 1) y1 + x1 z1 + z − k2 e2 ⎪ u = − x y + 8 +αˆ z + ax ˆ + cz ˆ + by ˆ + x 2 − k3 e3 1 1 1 ⎩ 3 3
(10)
And adaptive laws of parameters are chosen as ⎧ aˆ = − xe3 ⎪ ⎪bˆ = − ye3 ⎨ ⎪cˆ = − ze3 ⎪αˆ = 25( y − x )e − (35 x − 29 y )e − 1 z e 1 1 1 1 1 2 ⎩ 3 1 3
(11)
Then system (8) globally synchronizes system (2) asymptotically. Where ki (i = 1, 2,3) are positive constants, aˆ , bˆ, cˆ, αˆ are estimates of a, b, c, α , respectively. Proof: Applying control laws (10) to (9) yields the error dynamics as follows
⎧ e1 = −25α ( y1 − x1 ) − k1e1 ⎪ ⎨ e2 = 35α x1 − 29α y1 − k2 e2 ⎪ e = ax + 1 α z1 ⎩ 3 + by + cz 3
(12)
where a = aˆ − a, b = bˆ − b, c = cˆ − c, α = αˆ − α
Consider the following Lyapunov function V=
1 T (e e + a 2 + b 2 + c 2 + α 2 ) 2
The time derivative of V along the solution of error dynamical system (12) gives that
Chaos Synchronization Between Unified Chaotic System and Genesio System
13
dV + cc + bb + αα = eT e + aa dt = e1[−25α ( y1 − x1 ) − k1e1 ] + e2 [35α x1 − 29α y1 − k2 e2 ] + cz + by + 13 α z1 ] + e3 [ax + a (− xe3 ) + b (− ye3 ) + c (− ze3 ) + α [25( y1 − x1 )e1 − (35 x1 − 29 y1 )e2 − 13 z1e3 ] = −k1e12 − k2 e2 2 − k3 e32 = −eT Pe ≤ 0
where P = diag{k1 , k2 , k3 } Since V is positive definite and dV is negative semi-definite in the neighborhood of dt zero solution of system (12). It follows that e, a , b, c, α ∈ L∞ , from the fact that t
∫0 λ min ( P )
e
2
dt ≤
t
∫0 e
T
Pedt =
= V (0) − V (t ) ≤ V (0) ∫0 −Vdt t
where λmin ( P) is the minimal eigenvalue of the positive definite P . Thus, e ∈ L2 , from Eq. (12) we have e ∈ L∞ , by Barbalat’s lemma, we have lim e = 0, Thus response t →∞
system (8) can globally synchronize drive system (2) asymptotically. This completes the proof. In simulation, fourth order Runge-Kutta integration method is used to solve two systems of differential equations (2) and (8). We select the parameters of unified chaotic system as α = 0.95, and the parameters of Genesio system as a = −6, b = −2.92,
Fig. 2. Synchronization errors e1 , e2 , e3 between unified chaotic system and Genesio system via adaptive control
14
X. Wu , Z.-H. Guan, and T. Li
Fig. 3. Adaptive parameters aˆ , bˆ, cˆ of Genesio system
Fig. 4. Adaptive parameters αˆ of unified chaotic system
c = −1.2, ki (i = 1, 2,3) = 2, the initial values of drive and response systems are
( x(0), y (0), z (0)) = (1, 2, 3) and ( x1 (0), y1 (0), z1 (0)) = (−1, − 2, 5), respectively, while the initial errors of system (9) are (e1 (0), e2 (0), e3 (0)) = (−2, − 4, 2) , the initial values of the estimate parameters are aˆ (0) = bˆ(0) = cˆ(0) = 1 , α (0) = 2 .The synchronization errors between unified chaotic system and Genesio system are shown in Fig.2. The estimate parameters of a, b, c and αˆ are shown in Figs.3 and 4, respectively. Obviously, the synchronization errors converge asymptotically to zero and two different systems are indeed achieved chaos synchronization. Furthermore, the estimates of parameters converge to their real values.
5 Conclusions This paper presents two chaos synchronization schemes between unified chaotic system and Genesio system with different structures and parameters. Active control is used when system parameters are known and adaptive control is used when system parameters are unknown. Computer simulations show the effectiveness of the proposed schemes.
Chaos Synchronization Between Unified Chaotic System and Genesio System
15
Acknowledgement This work was supported by the National Natural Science Foundation of China under Grants 60573005 and 60603006.
References 1. Pecora, L.M., Carroll, T.L.: Synchronization in chaotic systems. Phys Rev Lett. 64 (1990) 821-824 2. Yang, X.S., Duan, C.K., Liao, X.X.: A note on mathematical aspects of drive-response type synchronization. Chaos, Solitons & Fractals 10 (1999) 1457-1462 3. Lu, J., Wu, X., Han, X., Lü, J.H.: Adaptive feedback synchronization of a unified chaotic system. Phys Lett A. 329 (2004) 327-333 4. Femat, R., et al.: Adaptive synchronization of high-order chaotic systems: a feedback with low-order parameterization. Physica D. 139 (2000) 231-246 5. Yassen, M.T.: Adaptive synchronization of two different uncertain chaotic systems. Phys Lett A. 337 (2005) 335-341 6. Feki, M.: An adaptive chaos synchronization scheme applied to secure communication. Chaos,Solitons & Fractals. 18 (2003) 141-148 7. Lü, J.H., Zhou, T.S., Zhang, S.C.: Chaos synchronization between linearly coupled chaotic systems. Chaos, Solitons & Fractals. 14 (2002) 529-541 8. Alexeyev, A.A., Shalfeev, V.D.: Chaotic synchronization of mutually coupled generators with frequency-controlled feedback loop. Int J Bifurcat Chaos. 5 (1995) 551-557 9. Ho, M.C., Hung, Y.C.: Synchronization of two different systems by using generalized active control. Phys Lett A. 301 (2002) 424-8 10. Ucar, A., Lonngren, K.E., Bai, E.W.: Synchronization of the unified chaotic systems via active control. Chaos, Solitons & Fractals. 27 (2006) 1292-1297 11. Chen, S., Yang, Q., Wang, C.: Impulsive control and synchronization of unified chaotic system. Chaos, Solitons & Fractals. 20 (2004) 751-758 12. Yang, T., Chua, L.O.: Impulsive control and synchronization of nonlinear dynamical systems and application to secure communication. Int J Bifurcat Chaos. 7 (1997) 645-664 13. Park, J.H.: Synchronization of Genesio chaotic system via backstepping approach. Chaos, Solitons & Fractals. 27 (2006) 1369-1375 14. Genesio, R., Tesi, A.: A harmonic balance methods for the analysis of chaotic dynamics in nonlinear systems. Automatica. 28 (1992) 531-548 15. Lü, J.H., Chen, G., Cheng, D.Z., Celikovsky, S.: Bridge the gap between the Lorenz system and the Chen system. Int J Bifur Chaos. 12 (2002) 2917-2926
Robust Impulsive Synchronization of Coupled Delayed Neural Networks Lan Xiang1 , Jin Zhou2, , and Zengrong Liu3 1
2
Department of Physics, School of Science, Shanghai University, Shanghai, 200444, P.R. China
[email protected] Shanghai Institute of Applied Mathematics and Mechanics, Shanghai University, Shanghai, 200072, P.R. China
[email protected] 3 Institute of System Biology, Shanghai University, Shanghai, 200444, P.R. China
[email protected]
Abstract. The present paper studies robust impulsive synchronization of coupled delayed neural networks. Based on impulsive control theory on dynamical systems, a simple yet less conservative criteria ensuring robust impulsive synchronization of coupled delayed neural networks is established. Furthermore, the theoretical result is applied to a typical scale-free (SF) network composing of the representative chaotic delayed Hopfield neural network nodes, and numerical results are presented to demonstrate the effectiveness of the proposed control techniques.
1
Introduction
Over the last decade, control and synchronization of coupled chaotic dynamical systems has attracted a great deal of attention due to its potential applications in many fields including secure communications, chemical reactions, biological systems and information science, etc [1], [2], [3], [4]. As a typical example, synchronization of coupled neural networks has currently been an active area of research, and a wide variety of strategies have been developed for chaos synchronization, see ([3], [5], [6], [7], [8], [9], [10]) and relevant references therein. In the past several years, impulsive control has been widely used to stabilize and synchronize chaotic dynamical systems due to its potential advantages over general continuous control schemes [12], [13]. There are many important results focusing mainly on some well-known chaotic dynamical systems such as the Lorenz system, R¨ossler system, Chua system, Duffing oscillator, Brusselator oscillator and so on [1], [13]. It has been proved, in the study of chaos synchronization, that impulsive synchronization approach is effective and robust in synchronization of chaotic dynamical systems. Moreover, the controllers used usually have a relatively simple structure. In an impulsive synchronization scheme, only
Corresponding author.
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 16–23, 2007. c Springer-Verlag Berlin Heidelberg 2007
Robust Impulsive Synchronization of Coupled Delayed Neural Networks
17
the synchronization impulses are sent to the receiving systems at the impulsive instances, which can decrease the information redundancy in the transmitted signal and increase robustness against the disturbances. In this sense, impulsive synchronization schemes are very useful in practical application, such as in digital secure communication systems [4]. Therefore, the investigation of impulsive synchronization for coupled delayed neural networks is an important step for practical design and application of delayed neural networks. This paper is mainly concerned with the issues of robust impulsive synchronization of coupled delayed neural networks. Based on impulsive control theory on delayed dynamical systems, a simple yet less conservative criteria is derived for robust impulsive synchronization of coupled delayed neural networks. It is shown that the approaches developed here further extend the ideas and techniques presented in recent literature, and they are also simple to implement in practice. Finally, a typical scale-free (SF) network composing of the representative chaotic delayed Hopfield neural network nodes is used as an example to illustrate this impulsive control scheme, and the numerical simulations also demonstrate the effectiveness and feasibility of the proposed control techniques.
2
Problem Formulations and Preliminaries
First, we consider a dynamical system consisting of N linearly coupled identical delayed neural networks, which is described by the following set of differential equations with time delays [3]: x˙ i (t) = −Cxi (t) + Af (xi (t)) + Aτ g(xi (t − τ )) + I(t) +
N
bij Γ xj (t),
i = 1, 2 · · · , N.
(1)
j=1
where xi (t) = (xi1 (t), · · · , xin (t)) are the state variables of the ith delayed neural network, C = diag(c1 , . . . , cn ) is a diagonal matrix with positive diagonal entries cr > 0 (i = 1, · · · , n), A = (a0rs )n×n and Aτ = (aτrs )n×n denote the connection weight matrix and the delayed connection weight matrix respectively, I(t) = (I1 (t), · · · , In (t)) is an external input vector, τ is the time delay, and the activation function vectors f (xi (t)) = [f1 (xi1 (t)), · · · , fn (xin (t))] and g(xi (t)) = [g1 (xi1 (t), · · · , gn (xin (t))] , here we assume that the activation functions fr (x) and gr (x) are globally Lipschitz continuous, i. e., (A1 ) There exist constants kr > 0, lr > 0, r = 1, 2, · · · , n, for any two different x1 , x2 ∈ R, such that 0≤
fr (x1 ) − fr (x2 ) ≤ kr , x1 − x2
|gr (x1 ) − gr (x2 )| ≤ lr |x1 − x2 |, r = 1, 2, · · · , n.
For simplicity, we further assume that the inner connecting matrix Γ = diag(γ1 , · · · , γn ), and the coupling matrix B = (bij )N ×N is the Laplacian matrix, i.e., a symmetric irreducible matrix with zero-sum and real spectrum. This
18
L. Xiang, J. Zhou, and Z. Liu
implies that zero is an eigenvalue of B with multiplicity 1 and all the other eigenvalues of B are strictly negative [3]. Next, we consider the issues of impulsive control for robust synchronization of the coupled delayed neural network (1). By adding an impulsive controller {tk , Iik (t, xi (t))} to the ith-dynamical node in the coupled delayed neural network (1), we have the following impulsively controlled coupled delayed neural network: ⎧ x˙ i (t) = −Cxi (t) + Af (xi (t)) + Aτ g(xi (t − τ )) + I(t) ⎪ ⎪ ⎪ N ⎨ + bij Γ xj (t), t = tk , t ≥ t0 , (2) ⎪ ⎪ j=1 ⎪ ⎩ xi = Iik (t, xi (t)), t = tk , k = 1, 2, · · · , where i = 1, 2 · · · , N , the time sequence {tk }+∞ k=1 satisfy tk−1 < tk and limk→∞ tk = − +∞, xi = xi (t+ ) − x (t ) is the control law in which xi (t+ i k k k ) = limt→t+ xi (t) and k
xi (t− xi (t). Without loss of generality, we assume that limt→t+ xi (t) = k ) = limt→t− k k xi (tk ), which means the solution x(t) is continuous from the right. The initial conditions of Eq. (2) are given by xi (t) = φi (t) ∈ P C([t0 − τ, t0 ], Rn ), where P C([t0 − τ, t0 ], Rn ) denotes the set of all functions of bounded variation and right-continuous on any compact subinterval of [t0 − τ, t0 ]. We always assume that Eq. (2) has a unique solution with respect to initial conditions. Clearly, if Iik (t, xi (t)) = 0, then the controlled model (2) becomes the well-known continuous coupled delayed neural network (1) [3]. The main objective of this paper is to design and implement an appropriate impulsive controller {tk , Iik (t, xi (t))} such that the states of the controlled coupled delayed neural network (2) will achieve synchronization, i. e., lim xi (t) − s(t) = 0,
t→+∞
i = 1, 2 · · · , N,
(3)
where s(t) is called as the synchronization state of the controlled coupled delayed neural network (2). It may be an equilibrium point, a periodic orbit, or a chaotic attractor. Throughout this paper, we define the synchronization state of the N 1 controlled coupled delayed neural network (2) as s(t) = xi (t), where N i=1 xi (t) (i = 1, 2 · · · , N ) are the solutions of the continuous coupled delayed neural network (1) [11]. For the later use, the definition with respect to robust impulsive synchronization of the controlled coupled delayed neural network (2) and the famous Halanay differential inequality on impulsive delay differential inequality are introduced as follows: Definition 1. The controlled coupled delayed neural network (2) is robustly exponentially synchronized, if there exist constants ε > 0 and M > 0, for all φi (t) ∈ P C([t0 − τ, t0 ], Rn ), such that xi (t) − s(t) ≤ M e−ε(t−t0 ) ,
t ≥ t0 ,
i = 1, 2 · · · , N.
(4)
Robust Impulsive Synchronization of Coupled Delayed Neural Networks
19
Lemma 1. [12] Suppose p > q ≥ 0 and u(t) satisfies the scalar impulsive differential inequality + D u(t) ≤ −pu(t) + q( sup u(s)), t = tk , t ≥ t 0 , t−τ ≤s≤t (5) − u(tk ) ≤ αk u(tk ), u(t) = φ(t), t ∈ [t0 − τ, t0 ]. where u(t) is continuous at t = tk , t ≥ t0 , u(tk ) = u(t+ k ) = lims→0+ u(tk + s) − and u(tk ) = lims→0− u(tk + s) exists, φ ∈ P C([t0 − τ, t0 ], R). Then u(t) ≤ ( θk )e−μ(t−t0 ) ( sup φ(s)), (6) t0 −τ ≤s≤t0
t0
where θk = max{1, |αk |} and μ > 0 is a solution of the inequality μ−p+qeμτ ≤ 0.
3
Robust Impulsive Synchronization
Base on impulsive control theory on delayed dynamical systems, the following sufficient condition for robust impulsive synchronization of the controlled coupled delayed neural network (2) is established. Theorem 1. Consider the controlled coupled delayed neural network (2). Let the impulsive controller as ui (t, xi ) =
+∞
Iik (t, xi (t))δ(t − tk ) =
k=1
+∞
dk xi (t− k ) − s(t) δ(t − tk ),
(7)
k=1
where dk is a constant called as the control gain, δ(t) is the Dirac function, and the eigenvalues of its coupling matrix B be ordered as 0 = λ1 > λ2 ≥ λ3 ≥ · · · , λN .
(8)
Assume that, in addition to (A1 ), the following conditions are satisfied for all i = 1, 2, · · · , n and k ∈ Z + = {1, 2, · · · , ∞} (A2 ) There exist n positive numbers δ1 , · · · , δn , and two numbers pi = δi + ci −
(a0ii )+ ki
n n 1 1 τ 0 0 − |aij |kj + |aji |ki − |a |lj , 2 j=1 2 j=1 ij j=i
qi =
n 1
2
|aτji |li ,
j=1
such that p = min1≤i≤n {2pi } > q = ⎧max1≤i≤n {2qi } and γi λ(γi ) + δi ≤ 0, where ⎨ λ2 , if γi > 0, (a0ii )+ = max{a0ii , 0} with λ(γi ) = 0, if γi = 0, ⎩ λN , if γi < 0.
20
L. Xiang, J. Zhou, and Z. Liu
(A3 ) Let μ > 0 satisfy μ − p + qeμτ ≤ 0, and
θk = max 1, (1 + dk )2 , θ = sup k∈Z +
ln θk tk − tk−1
(9)
such that θ < μ. Then the controlled coupled delayed neural network (2) is robustly exponentially synchronized. Brief Proof. Let vi (t) = xi (t) − s(t) (i = 1, 2, · · · , N ), then the error dynamical system can be rewritten as ⎧ τ ˜ ˜(vi (t − τ )) ⎪ i (t) + Af (vi (t)) + A g ⎪ v˙ i (t) = −Cv ⎪ N ⎨ + bij Γ vj (t) + J, t = tk , t ≥ t0 , (10) ⎪ ⎪ j=1 ⎪ ⎩ vi (tk ) = (1 + dk )vi (t− t = tk , k = 1, 2, · · · , k ), ˜ where f (vi (t)) = f (vi (t) + s(t)) − f (s(t)), g˜(vi (t − τ )) = g(vi (t − τ ) + s(t − τ )) − N 1 g(s(t−τ )) and J = Af (s(t))+Aτ g(s(t−τ ))+ [Af (xk (t))+Aτ g(xk (t−τ ))]. N k=1 Let us construct a Lyapunov function 1 v (t)vi (t). 2 i=1 i N
V (t) =
(11)
Calculating the upper Dini derivative of V (t) with respect to time along the solution N of Eq. (10), from Condition (A1 ), and note that vi (t) = 0, we can get for t = tk , i=1
D+ V (t) ≤
N n
1 0 (|a |ks + |a0sr |kr ) 2 s=1 rs n
− δr − cr + (a0rr )+ kr +
i=1 r=1
s=r
n N 1 1 τ 2 2 + |aτrs |ls vir (t) + |asr |lr vir (t − τ ) + vi (t) 2 s=1 2 s=1 i=1 n
×
N
bij Γ vj (t) + diag(δ1 , . . . , δn )vi (t)
j=1
≤ −pV (t) + qV (t − τ ) +
n
v¯j (t)(γj B + δj IN )¯ vj (t),
(12)
j=1
N where v¯j (t) = (¯ v1j (t), · · · , v¯N j (t)) ∈ L def = z = (z1 , · · · , zN ) ∈ RN | i=1 zi = n 0 , from which it can be concluded that if γj λ(γj )+δj ≤ 0, then v¯j (t)(γj B+ j=1
δj IN )¯ vj (t) ≤ 0. This leads to D+ V (t) ≤ −pV (t) + q( sup
t−τ ≤s≤t
V (s)).
(13)
Robust Impulsive Synchronization of Coupled Delayed Neural Networks
21
On the other hand, from the construction of V (t), we have V (tk ) = (1 + dk )2
N
− − 2 vj (t− k )vj (tk ) ≤ (1 + dk ) V (tk ).
(14)
j=1
It follows from Lemma 1 that if θ < μ for all t > t0 , V (t) ≤ e−(μ−θ)(t−t0 ) (
sup
t0 −τ ≤s≤t0
V (s)).
(15)
This completes the proof of Theorem 1. Remark 1. From the proof of Theorem 1, it should be noted that, different from previous investigations in [4], [5], [6], here our main strategy is to control all the states of dynamical networks to its synchronization state s(t), but where s(t) may be not a solution of an isolated dynamical node. Moreover, it can be seen from (A2 ) and (A3 ) that robust impulsive synchronization of the controlled coupled delayed neural network (2) not only depends on the coupling matrix B, the inner connecting matrix Γ, and the time delay τ, but also is heavily determined by the impulsive control gain dk and the impulsive control interval tk − tk−1 . Therefore, the approaches developed here further extend the ideas and techniques presented in recent literature, and they are also simple to implement in practice. Example 1. Consider a model of the controlled coupled delayed neural network: ⎧ x˙ i (t) = −Cxi (t) + Af (xi (t)) + Aτ g(xi (t − τ )) + I(t) ⎪ ⎪ ⎪ N ⎨ + bij Γ xj (t), t = tk , t ≥ t0 , ⎪ ⎪ j=1 ⎪
⎩ xi (t) = (1 + dk ) (xi (t− t = tk , k = 1, 2, · · · , k ) − s(t) , in which xi (t) = (xi1 (t), xi2 (t)) , f (xi (t)) = tanh(xi2 (t))) (i = 1, · · · , 100), I(t) = (0, 0) and C=
10 , 01
A=
2.0 −0.1 −5.0 3.0
g(xi (t))
with
Aτ =
=
(16)
(tanh(xi1 (t)),
−1.5 −0.1 . −0.2 −2.5
where the synchronization state of the coupled delayed neural network (16) is 100 1 defined as s(t) = xk (t). 100 k=1 It should be noted that the isolate neural network x(t) ˙ = −Cx(t) + Af (x(t)) + Aτ g(x(t − 1)),
(17)
is actually a chaotic delayed Hopfield neural network [8], [9] (see Fig. 1 (a)).
22
L. Xiang, J. Zhou, and Z. Liu
Now we consider an scare-free network with 100 dynamical nodes. We here take the parameters N = 100, m = m0 = 5 and κ = 3, then the coupling matrix B = Bsf of the SF network can be randomly generated by the B-A scale-free model [11]. In this simulation, the second-largest eigenvalue and the smallest eigenvalue of the coupling matrix Bsf are λ2 = −1.2412 and λ100 = −34.1491 respectively. For simplicity, we consider the equidistant impulsive interval τk − τk−1 = 0.1 and dk = −0.5000 (k ∈ Z + ). By taking kr = lr = 1 and δr = 12 (r = 1, 2), it is easy to verify that if γ1 = γ2 = 6, then all the conditions of Theorem 1 are satisfied. Hence, the the controlled coupled delayed neural network (16) will achieve robust impulsive synchronization. The simulation results corresponding to this situation are shown in Fig. 1 (b). 2
3
1.5
2
1
x (t), x (t) (i=1,2,...,100)
4
0
xi2(t)
0.5 0
i2
y
1
−0.5
i1
−1 −2 −3 −4 −1
x (t)
−1
i1
−1.5
−0.5
0
(a) x
0.5
1
−2 0
0.2
0.4
0.6
0.8
1
(b) t
Fig. 1. (a) A fully developed double-scroll-like chaotic attractors of the isolate delayed Hopfield neural network (17). (b) Impulsive synchronization process of the state variables in the controlled coupled delayed neural network (16).
4
Conclusions
In this paper, we have investigated the issues of robust impulsive synchronization of coupled delayed neural networks. A simple criterion for robust impulsive synchronization of such dynamical networks has been derived analytically. It is shown that the theoretical results can be applied to some typical chaotic neural networks such as delayed Hopfield neural networks and delayed cellular neural networks (CNN). The numerical results are given to verify and also visualize the theoretical results. Acknowledgments. This work was supported by the National Science Foundation of China (Grant nos. 60474071 and 10672094), the Science Foundation of Shanghai Education Commission (Grant no. 06AZ101), the Shanghai Leading Academic Discipline Project (Project nos. Y0103 and T0103) and the Shanghai Key Laboratory of Power Station Automation Technology.
Robust Impulsive Synchronization of Coupled Delayed Neural Networks
23
References 1. Chen, G., Dong, X.: From Chaos to Order: Methodologies, Perspectives, and Applications, World Scientific Pub. Co, Singapore (1998) 2. Wu, C. W., Chua, L. O.: Synchronization in an Array Linearly Coupled Dynamical System. IEEE Trans. CAS-I 42 (1995) 430-447 3. Chen, G., Zhou, J., Liu, Z.: Global Synchronization of Coupled Delayed Neural Networks and Applications to Chaotic CNN Model. Int. J. Bifur. Chaos 14 (2004) 2229-2240 4. Liu, B., Liu, X., Chen, G.: Robust Impulsive Synchronization of Uncertain Dynamical Networks, IEEE Trans. CAS-I. IEEE Trans 52 (2005) 1431-1441 5. Wang, W., Cao, J.: Synchronization in an Array of Linearly Coupled Networks with Time-varying Delay. Physica A 366 (2006) 197-211 6. Li, P., Cao, J., Wang, Z.: Robust Impulsive Synchronization of Coupled Delayed Neural Networks with Uncertainties. Physica A 373 (2006) 261-272 7. Zhou, J., Chen, T., Xiang, L.: Adaptive Synchronization of Coupled Chaotic Systems Based on Parameters Identification and Its Applications. Int. J. Bifur. Chaos 16 (2004) 2923-2933 8. Zhou, J., Chen, T., Xiang, L.: Robust Synchronization of Delayed Neural Networks Based on Adaptive Control and Parameters Identification. Chaos, Solitons, Fractals 27 (2006) 905-913 9. Zhou, J., Chen, T., Xiang, L.: Chaotic Lag Synchronization of Coupled Delayed Neural Networks and Its Applications in Secure Communication. Circuits, Systems and Signal Processing 24 (2005) 599-613 10. Zhou, J., Chen, T., Xiang, L.: Global Synchronization of Impulsive Coupled Delayed Neural Networks. Wang, J and Yi, Z. (eds.): Advances in Neural Networks ISNN 2006. Lecture Notes in Computer Science, Vol. 3971. Springer-Verlag, Berlin Heidelberg New York (2006) 303-308 11. Zhou, J., Chen, T.: Synchronization in General Complex Delayed Dynamical Networks. IEEE Trans. CAS-I 53 (2006) 733-744 12. Yang, Z., Xu, D.: Stability Analysis of Delay Neural Networks with Impulsive Effects. IEEE Trans. CAS-II 52 (2005) 517-521. 13. Yang, T.: Impulsive Control Theory. Springer-Verlag, Berlin Heidelberg New York (2001)
Synchronization of Impulsive Fuzzy Cellular Neural Networks with Parameter Mismatches Tingwen Huang1 and Chuandong Li2 1
2
Texas A&M University at Qatar, Doha, P.O. Box 5825, Qatar
[email protected] College of Computer Science, Chongqing University, Chongqing, 400030, China
[email protected]
Abstract. In this paper, we study the effect of parameter mismatches on the fuzzy neural networks with impulses. Since it is impossible to make two non-identical neural networks complete synchronized, we study the synchronization of two neural networks in terms of quasi-synchronization. Using Lyapunov method and linear matrix inequality method, we obtain a sufficient condition for a global synchronization error bound of the two neural networks.
1
Introduction
Since L. Pecora and T. Carroll [16] published their pioneering work on synchronization of chaos, synchronization of chaotic systems has been investigated intensively by many researchers [2,3,4,6,7,9-12,14-19,24,25] in various fields such as applied mathematics, physics and engineering due to its practical applications such as security communication. The most common regime of synchronization been investigated is complete synchronization, which implies the coincidence of states of interacting (master and response) systems. However, due to the parameter mismatch [2,6,7,9,18] which is unavoidable in real implementation, the master system and response system are not identical and the resulting synchronization is not exact. It is impossible to achieve complete synchronization. However, it is possible to make the synchronization error bounded by a small positive constant ε which is depended on the differences between parameters of two fuzzy neural networks. To the best of our knowledge, no report has been reported on quasisynchronization of two non-identical fuzzy neural networks. In this paper, we will investigate the effect of parameter mismatches on chaos synchronization of fuzzy neural networks by impulsive controls. It is known that the main obstacle for the impulsive synchronization in the presence of parameter mismatches is to get a good estimate of the synchronization error bound. To overcome this problem, we will obtain a numerically tractable, though suboptimal, sufficient condition using linear decomposition and comparison-system method. This paper is organized as follows. In Section 2, the problem is formulated and some preliminaries are given. In Section 3, the main results are presented. In Section 4, conclusions are drawn. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 24–32, 2007. c Springer-Verlag Berlin Heidelberg 2007
Synchronization of Impulsive Fuzzy Cellular Neural Networks
2
25
Problem Formulation and Preliminaries
Consider the driving system described by the following fuzzy neural networks: n n dxi = −di xi (t) + αij fj (xj (t)) + βij fj (xj (t)) dt j=1 j=1
+
n
Hij μj +
j=1
n
Tij μj + Ii ,
(1)
j=1
where i = 1, · · · , n; αij , βij , Tij and Hij are elements of fuzzy feedback MIN template, fuzzy feedback MAX template, fuzzy feed forward MIN template and fuzzy feed forward MAX template respectively; and denote the fuzzy AND and fuzzy OR operation respectively; xi , μi and Ii denote state, input and bias of the ith neurons respectively; fi is the activation function. At discrete instants, τk , k = 1, 2, 3, · · · , state variables of the driving system are transmitted to the driven system and then the state variables y = (y1 , · · · , yn ) of the driven system are subjected to sudden changes at these instants. In this sense, the driven system is described by an impulsive fuzzy neural networks as n n dyi = −di yi (t) + αij fj (yj (t)) + β ij fj (yj (t)) dt j=1 j=1
+
n
Hij μj +
j=1
n
Tij μj + Ii ,
t = τk ,
j=1
Δy|t=τk = y(τk+ ) − y(τk− ) = −Ce,
t = τk , (2)
where i = 1, · · · , n, k = 1, 2, 3, · · · , C ∈ Rn×n is the control gain and e = x − y is the synchronization error, In general di , αij , β ij are different from di , αij , βij , in other words, there exist parameter mismatches between the driving and driven systems. From systems (1) and (2), the error system of the impulsive synchronization is given by n n dei (t) = −di ei (t) + (di − di )yi (t) + αij fj (xj (t)) − αij fj (yj (t)) dt j=1 j=1
+
n j=1
βij fj (xj (t)) −
n
β ij fj (yj (t)),
t = τk ,
j=1
Δy|t=τk = y(τk+ ) − y(τk− ) = −Be,
t = τk , (3)
Remark 1: It is clear that the origin e = 0 is not equilibrium point of equation (3) when parameter mismatches exist, so complete synchronization between systems (1) and (2) is impossible.
26
T. Huang and C. Li
In this paper, we assume that H: fi is a bounded function defined on R and satisfies |fi (x) − fi (y)| ≤ li |x − y|,
i = 1, · · · , n.
(4)
for any x, y ∈ R. In the following, we cite several concepts on quasi-synchronization and impulsive differential equation. Definition 1. ([9]). Let χ denote a region of interest in the phase space that contains the chaotic attractor of system (1). The synchronization schemes (1) and (2) are said to be uniformly quasi-synchronized with error bound ε > 0 if there exists a T ≥ t0 such that ||x(t) − y(t)|| ≤ ε for all t ≥ T starting from any initial values x(t0 ) ∈ χ and y(t0 ) ∈ χ. Definition 2. ([23]) A function V : R+ × Rn → R+ is said to belong to class Σ if 1) V is continuous in (τk−1 , τk ) × Rn and, for each x ∈ Rn , k = 1, 2, · · · , lim(t,y)→(τ +,x) V (t, y) = V (τk+ , x) exists; k 2) V is locally Lipschitzian in x For the following general impulsive differential equation x˙ = g(t, x),
t = τk ,
x(τk+ ) = ψk (x(τk )), t = τk , x(t0 ) = x0 , t0 ≥ 0,
(5)
the right-upper Dini’s derivative of V ∈ Σ is defined as the following: Definition 3. ([23]) For (t, x) ∈ (τk−1 , τk ] × Rn , the right-upper Dini’s derivative of V ∈ Σ with respect to time is defined as D+ V (t, x) = lim {V [t + h, x + hg(t, x)] − V (t, x)} h→0+
(6)
Definition 4. ([23]) For impulsive system (5), let V ∈ Σ and assume that D+ V (t, x) ≤ g[t, V (t, x)], t = τk , − V [t, y(τk ) − Be] ≤ ψk [V (t, x)], t = τk , (7) where g : R+ × R+ → R is continuous and g(t, 0) = 0, ψk : R+ → R+ is non-decreasing. Then, the system ω˙ = g(t, ω), ω(τk+ )
t = τk ,
= ψk (ω(τk )),
ω(t0 ) = ω0 ,
t0 ≥ 0.
t = τk , (8)
is called the comparison system for (5). For the convenience, we give the matrix notations here. For A, B ∈ Rn×n , A ≤ B(A > B) means that each pair of the corresponding elements of A and B satisfy the inequality ≤ ( > ). Also, if A = (aij ), then |A| = (|aij )|.
Synchronization of Impulsive Fuzzy Cellular Neural Networks
3
27
Main Results
In this Section, we will obtain a sufficient condition for quasi-synchronization and estimate the synchronization bound at the same time using Lyapunov-like function. Before we state the main results, we state the following theorem first. Lemma 1. ([22]). For any aij ∈ R, xj , yj ∈ R, i, j = 1, · · · , n, we have the following estimations, |
n
aij xj −
j=1
and |
n j=1
n
aij yj | ≤
j=1
aij xj −
n
(|aij | · |xj − yj |)
(9)
(|aij | · |xj − yj |)
(10)
1≤j≤n
aij yj | ≤
j=1
1≤j≤n
Let D = diag(d1 , · · · , dn ), L = diag(l1 , · · · , ln ), D = diag(d1 , · · · , dn ), A = (αij )n×n , A = (αij )n×n , B = (βij )n×n , B = (β ij )n×n , ΔD = D − D, ΔA = A − A, ΔB = B − B. Now, we are ready to state and prove the main result on the synchronization of driving system (1) and driven system (2). Theorem 1. Let χ = {x ∈ Rn |||x|| ≤ δ1 }, and parameter mismatches satisfy ΔDT ΔD + ΔAT ΔA + ΔB T ΔB ≤ δ2 , δ = δ12 δ22 . and let the sequence of impulses be equidistant and separated by an interval τ . If there exists a symmetric and positive definite matrix P > 0 such that the following conditions hold: (i)−2P D + P 2 + 2L|A| + P 2 L2 + 2L|B| + P 2 L2 − λP ≤ 0, (ii) (I + C)T P (I + C) − ρP ≤ 0, (iii) lnρ + λτ < 0. then the synchronization error system (3) converges exponentially to a small region containing the origin which is τδ n {e ∈ R |||e|| ≤ } (11) λm (P )ρ(lnρ + τ λ) Thus, the quasi-synchronization with error bound ε =
τδ λm (P )ρ(lnρ+τ λ)
between
the systems (1) and (2) achieved. Proof. Consider the following Lyapunov-like function: V (e(t)) = e(t)T P e(t) From (3) and Lemma 1, for t ∈ (τk+ , τk+1 ], we have n n dei (t) = −di ei (t) + (di − di )yi (t) + αij fj (xj (t)) − αij fj (yj (t)) dt j=1 j=1
(12)
28
T. Huang and C. Li
+
+
n
αij fj (yj (t))−
j=1 n
n
αij fj (yj (t))+
j=1 n
βij fj (yj (t)) −
j=1
n
βij fj (xj (t)) −
j=1
n
βij fj (yj (t))
j=1
β ij fj (yj (t)),
j=1
≤ −di ei (t) + (di − di )yi (t) +
n
|αij fj (xj (t)) − αij fj (yj (t))|
j=1
+
+
n j=1 n
|αij fj (yj (t)) − αij fj (yj (t))| +
n
|βij fj (xj (t)) − βij fj (yj (t))|
j=1
|βij fj (yj (t)) − β ij fj (yj (t))|,
j=1
≤ −di ei (t) + (di − di )yi (t) + +
+
n j=1 n
lj |αij − αij ||yj (t)| +
n j=1 n
lj |αij ||xj (t) − yj (t)| lj |βij ||(xj (t) − (yj (t)|
j=1
lj |βij − β ij ||yj (t)|,
j=1
≤ −di ei (t) + (di − di )yi (t) + +
+
n j=1 n
lj |αij − αij ||yj (t)| +
n j=1 n
lj |αij |e(t) lj |βij |e(t)
j=1
lj |βij − β ij ||yj (t)|,
(13)
j=1
We write the above inequality as matrix form: de(t) ≤ −De(t) + (D − D)y(t) + L|A|e(t) dt +L|A − A||y(t)| + L|B|e(t) + L|B − B||y(t)|
(14)
Now calculate the derivative of V with respect to time t ∈ (τk+ , τk+1 ] along the solution to (3). V + (e(t)) = 2e(t)T P e(t) ≤ 2e(t)T P {−De(t) + (D − D)y(t) + L|A|e(t) +L|A − A||y(t)| + L|B|e(t) + L|B − B||y(t)|} ≤ −2e(t)T P De(t) + e(t)T P 2 e(t) + y(t)T (D − D)T (D − D)y(t) +2e(t)T L|A|e(t) + e(t)T P 2 L2 e(t) + y(t)T (A − A)T (A − A)y(t)
Synchronization of Impulsive Fuzzy Cellular Neural Networks
29
+2e(t)T L|B|e(t) + e(t)T P 2 L2 e(t) + y(t)(B − B)T (B − B)y(t) = e(t)T (−2P D + P 2 + 2L|A| + P 2 L2 + 2L|B| + P 2 L2 )e(t) +y(t)T (ΔDT ΔD + ΔAT ΔA + ΔB T ΔB)y(t) = e(t)T (−2P D + P 2 + 2L|A| + P 2 L2 + 2L|B| + P 2 L2 − λP )e(t) +e(t)T λP e(t) + y(t)T (ΔDT ΔD + ΔAT ΔA + ΔB T ΔB)y(t) ≤ λe(t)T P e(t) + y(t)T (ΔDT ΔD + ΔAT ΔA + ΔB T ΔB)y(t) ≤ λV (e(t)) + δ
(15)
At the impulsive points, we get V ((I + C)e(τk+ )) = e(τk− )T (I + C)T P (I + C)e(τk− ) = e(τk− )T [(I + C)T P (I + C) − ρP ]e(τk− ) + ρe(τk− )T P e(τk− ) ≤ ρe(τk− )T P e(τk− ) = ρV (τk− )
(16) Thus, the error system has the following comparison system: t = τk , t = τk ,
z(t) ˙ = λz(t) + δ, z(τk+ ) = ρz(τk− ),
z(t0 ) = z0 = V (e(t+ 0 )).
(17)
To obtain the solution to (16) explicitly, consider the linear reference system for (16): z(t) ˙ = λz(t), t = τk , + − z(τk ) = ρz(τk ), t = τk , z(t0 ) = z0 = V (e(t+ 0 )).
(18)
The unique solution to the above equation is z(t, t0 , z0 ) = ρn(t,t0 ) eλ(t−t0 ) z0 ,
t > t0 ,
(19)
τ1 (t
where n(t, t0 ) = − t0 ). Here . is the floor operation. Since ρ < 1 and 1 1 n(t,s) (t − s) − 1 < n(t, s) ≤ < ρ−1 ρ(t−s)/τ , thus, τ τ (t − s) for t > s, we have ρ z(t, s, z(s)) ≤ ρ−1 (ρ τ eλ )t−s) z(s), 1
t > s ≥ t0 ,
(20)
The solution of Equation (16) with initial value z0 is t z(t, t0 , z0 ) = ρn(t,t0 ) eλ(t−t0 ) z0 + ρn(t,s) eλ(t−s) δds t0 −1
≤ρ
1 τ
λ (t−t0 )
(ρ e )
z0 + δρ
−1
t
1
(ρ τ eλ )(t−s) ds t0
τδ 1 = ρ−1 (ρ eλ )(t−t0 ) z0 − [1 − (ρ τ eλ )(t−t0 ) ] ρ(lnρ + τ λ) 1 τ
(21)
30
T. Huang and C. Li
By the Theorem 3.1.1 in [23], we have V (e(t)) = e(t)T P e(t) ≤ z(t, t0 , z0 ),
t > t0 ,
(22)
where z0 = V (e(t0 )). Let λm (P ) is the minimal eigenvalue of the square matrix P . From equations (21) and (22), we have λm (P )||e(t)||2 ≤ e(t)T P e(t) = V (e(t)) ≤ z(t, t0 , z0 ) 1 τδ 1 ≤ ρ−1 (ρ τ eλ )(t−t0 ) z0 − [1 − (ρ τ eλ )(t−t0 ) ] ρ(lnρ + τ λ) (23) so, ||e(t)||2 ≤
1 1 τδ 1 {ρ−1 (ρ τ eλ )(t−t0 ) z0 − [1 − (ρ τ eλ )(t−t0 ) ]} λm (P ) ρ(lnρ + τ λ) (24)
1
Since ρ τ eλ < 1, the first and third term on the right side of equation (23) will go to 0 exponentially as t approaches to ∞. Thus, there exists a large T > 0 such that ||e(t)||2 ≤
τδ λm (P )ρ(lnρ + τ λ) (25)
namely, ||e(t)|| ≤
τδ λm (P )ρ(lnρ + τ λ) (26)
The proof of the theorem is completed.
4
Conclusion
Since parameter mismatches are inevitable, and have detrimental effect on the synchronization quality between driving system and driven system, it is important to find out what the effect of parameter mismatch on the synchronization. In this paper, we have investigated synchronization of two systems with parameter mismatches using Lyapunov method and comparison theorem. We obtained a sufficient condition for quasi-synchronization with error bound ε of two nonidentical fuzzy neural networks by impulsive control.
Synchronization of Impulsive Fuzzy Cellular Neural Networks
31
Acknowledgments The first author is grateful for the support of Texas A&M University at Qatar. Also, this work was partially supported by the National Science Foundation of China (Grant No. 60574024).
References 1. Arik S.: Global Robust Stability of Delayed Neural Networks. IEEE Trans. Circ. Syst. I, 50(2003)156-160 2. Astakhov,V., Hasler,M., Kapitaniak,T., Shabunin,A., Anishchenko,V.: Effect of Parameter Mismatch on The Mechanism of Chaos Synchronization Loss in Coupled Systems. Physical Review E 58 (1998) 5620-5628 3. Cao J., Li P., Wang W.: Global Synchronization in Arrays of Delayed Neural Networks with Constant and Delayed Coupling. PHysics Letters A, 353(2006) 318-325 4. Cao J., Li H., Ho D.: Synchronization Criteria of Lur’e Systems with Time-delay Feedback Control. Chaos Solitons & Fractals 23 (2005) 1285-1298 5. Huang T.: Exponential Stability of Fuzzy Cellular Neural Networks with Distributed Delay. Physics Letters A 351 (2006)48-52 6. Jalnine,A., Kim,S.-Y.: Characterization of The Parameter-mismatching Effect on The Loss of Chaos Synchronization. Physical Review E 65 (2002), 026210 7. Leung,H., Zhu,Z.: Time-varying Synchronization of Chaotic Systems in The Presence of System Mismatch. Physical Review E 69 (2004) 026201 8. Liu,Y., Tang,W.: Exponential Stability of Fuzzy Cellular Neural Networks with Constant and Time-varying Delays. Physics Letters A, 323 (2004) 224-233 9. Li C., Chen G., Liao X., Fan Z.: Chaos Quasisynchronization Induced by Impulses with Parameter Mismatches. Chaos 16 (2006), No.02102 10. Li C., Chen G., Liao X., Zhang X.: Impulsive Synchronization of Chaotic Systems. Chaos 15 (2005), No. 023104 11. Li C., Liao X., Yang X. and Huang T.: Impulsive Stabilization and Synchronization of A Class of Chaotic Delay Systems. Chaos 15 (2005), 043103 12. Li C., Liao X.: Wong KW Chaotic Lag Synchronization of Coupled Time-delayed Systems and Its Applications in Secure Communication. Physica D 194 (2004) 187-202 13. Liao X., Wu Z., Yu J.: Stability Analyses for Cellular Neural Networks with Continuous Delay. Journal of Computational and Applied Mathematics, 143(2002)29-47 14. Lu J., Cao J.: Adaptive Complete Synchronization of Two Identical or Different Chaotic (Hyperchaotic) Systems with Fully Unknown Parameters. Chaos 15 (2005), No. 043901. 15. Lu W., Chen T.: New Approach to Synchronization Analysis of Linearly Coupled Ordinary Differential Systems. Physica D 213 (2006) 214-230 16. Pecora L., Carroll, T.: Synchronization in Chaotic systems. Physical Review Letters 64 (1990) 821-824. 17. Wang W., Cao J.: Synchronization in An Array of Linearly Coupled Networks with Time-varying Delay. Physica A, 366(2006) 197-211 18. Wu,C. W., Chua,L.O.: A Unified Framework for Synchronization and Control of Dynamical Systems. Int. J. Bifurcation Chaos, 4 (1994) 979-989
32
T. Huang and C. Li
19. Xiong W., Xie W., Cao J.: Adaptive Exponential Synchronization of Delayed Chaotic Networks. Physica A 370 (2006) 832-842 20. Yang T., Yang L.B., Wu C.W., Chua L.O.: Fuzzy Cellular Neural Networks: Theory. In Proc. of IEEE International Workshop on Cellular Neural Networks and Applications, (1996)181-186 21. Yang T. , Yang L.B., Wu C.W. and Chua L.O.: Fuzzy Cellular Neural Networks: Applications. In Proc. of IEEE International Workshop on Cellular Neural Networks and Applications, (1996)225-230. 22. Yang T., Yang L.B.: The Global Stability of Fuzzy Cellular Neural Network. Circuits and Systems I: Fundamental Theory and Applications, 43(1996)880-883 23. Yang Y.: Impulsive Control Theory. Springer, Berlin, 2001. 24. Zhang X., Liao X., Li C.: Impulsive Control, Complete and Lag Synchronization of Unified Chaotic System with Continuous Periodic Switch. Chaos Solitons & Fractals 26 (2005) 845-854 25. Zhou J., Chen T., Xiang L.: Robust Synchronization of Delayed Neural Networks Based on Adaptive Control and Parameters Identification. Chaos Solitons & Fractals 27 (2006) 905-913
Global Synchronization in an Array of Delayed Neural Networks with Nonlinear Coupling Jinling Liang1 , Ping Li1 , and Yongqing Yang2 1
Department of Mathematics, Southeast University, Nanjing, 210096, China 2 School of Science, Southern Yangtze University, Wuxi, 214122, China
[email protected]
Abstract. In this paper, synchronization is investigated for an array of nonlinearly coupled identical connected neural networks with delay. By employing the Lyapunov functional method and the Kronecker product technique, several sufficient conditions are derived. It is shown that global exponential synchronization of the coupled neural networks is guaranteed by a suitable design of the coupling matrix, the inner linking matrix and some free matrices representing the relationships between the system matrices. The conditions obtained in this paper are in the form of linear matrix inequalities, which can be easily computed and checked in practice. A typical example with chaotic nodes is finally given to illustrate the effectiveness of the proposed synchronization scheme.
1
Introduction
Dynamical behaviors of recurrent neural networks have been deeply investigated in the past decades due to their successful application in optimization, signal processing, pattern recognition and associative memories, especially in processing static images [1]. Most of the previous studies predominantly concentrated on the stability analysis, periodic oscillations and dissipativity of such kind of neural networks [2]. However, complex dynamics such as bifurcation and chaotic phenomena have also been shown to exist in these networks [3]. On the other hand, both theoretical studies and practical experiments have been reported that synchronization phenomena occur generically in many cases, such as in a mammalian brain, in language emergence and in an array of coupled identical neural networks. Arrays of coupled systems have received much attention recently for they can exhibit many interesting phenomena such as spatio-temporal chaos, autowaves and they can be utilized in engineering fields such as secure communication, chaos generators design and harmonic oscillation generation [4,5]. Synchronization of coupled chaotic systems has been extensively investigated, for more information one may refer to [6-10, 12-13] and the references cited therein. However, in these papers, the coupling terms of the models been studied are always linear, to the best of our knowledge, up till now, there are very few results on an array of nonlinearly coupled neural networks. Based on the D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 33–39, 2007. c Springer-Verlag Berlin Heidelberg 2007
34
J. Liang, P. Li, and Y. Yang
above discussions, in this paper, the following nonlinearly coupled neural network model will be studied: dxi (t) dt
= −Cxi (t) + Af (xi (t)) + Bf (xi (t − τ )) + I(t) +
N
Gij Df (xj (t))
(1)
j=1
where i = 1, 2, . . . , N and xi (t) = (xi1 (t), . . . , xin (t))T is the state vector of the ith network at time t; C = diag(c1 , . . . , cn ) > 0 denotes the rate with which the cell i resets its potential to the resting state when isolated from other cells and inputs; A and B are the weight matrix and the delayed weight matrix, respectively; activation function f (xi (t)) = (f1 (xi1 (t)), . . . , fn (xin (t)))T ; I(t) = (I1 (t), . . . , In (t))T is the external input and τ > 0 represents the transmission delay; D is an n×n matrix and G = (Gij )N ×N denotes the coupling configuration of the array and satisfying the diffusive coupling connections (i = j),
Gij = Gji
N
Gii = −
Gij
for
i, j = 1, 2, . . . , N.
(2)
j=1,j =i
For simplicity, let x(t) = (xT1 (t), xT2 (t), . . . , xTN (t))T , F (x(t)) = (f T (x1 (t)), f T (x2 (t)), . . . , f T (xN (t)))T , I(t) = (I T (t), . . . , I T (t))T , combining with the signal ⊗ of Kronecker product, model (1) can be rewritten as dx(t) dt
= −(IN ⊗ C)x(t) + (IN ⊗ A)F (x(t)) +(IN ⊗ B)F (x(t − τ )) + I(t) + (G ⊗ D)F (x(t))
(3)
The initial conditions with (3) are given by xi (s) = φi (s) ∈ C([−τ, 0], Rn ),
i = 1, 2, . . . , N.
(4)
Throughout of this paper, the following assumptions are made: (H) There exist constants lr > 0, r = 1, 2, . . . , n, such that 0≤
fr (x1 ) − fr (x2 ) ≤ lr x1 − x2
for any different x1 , x2 ∈ R. Definition 1. Model (3) is said to be globally exponentially synchronized, if there exist two constants > 0 and M > 0, such that for all φi (s) (i = 1, 2, . . . , N ) and for sufficiently large T > 0, xi (t) − xj (t) ≤ M e−t for all t > T , i, j = 1, 2, . . . , N . Lemma 1 [11]. Let ⊗ denotes the notation of Kronecker product, α ∈ R, A, B, C and D are matrices with appropriate dimensions, then (1) (2) (3)
(αA) ⊗ B = A ⊗ (αB); (A + B) ⊗ C = A ⊗ C + B ⊗ C; (A ⊗ B)(C ⊗ D) = (AC) ⊗ (BD).
Global Synchronization in an Array of Delayed Neural Networks
2
35
Main Results
In this section, Lyapunov functional method will be employed to investigate the global exponential synchronization of system (3). Theorem 1. Under the assumption (H), system (3) with initial condition (4) is globally exponentially synchronized, if there exist three positive definite matrices Pi > 0 (i = 1, 2, 3) and two positive diagonal matrices S, W such that the following LMIs are satisfied for all 1 ≤ i < j ≤ N : ⎡
−P1 C − CP1 + P3 LS + P1 A − N Gij P1 D ⎢ SL + AT P1 − N Gij DT P1 P2 − 2S Ωij =⎢ ⎣ 0 0 B T P1 0
⎤ 0 P1 B ⎥ 0 0 ⎥ < 0, ⎦ −P3 LW W L −P2 − 2W (5)
where L = diag(l1 , l2 , . . . , ln ). Proof. Condition (5) ensures that there exists a scaler > 0 such that ⎡
⎤ 0 P1 B ⎥ 0 0 ⎥ < 0. ⎦ −P3 LW W L −P2 − 2W (6) Let e = (1, 1, . . . , 1)T , EN = eeT be the N × N matrix of all 1’s, and U = N IN −EN , in which IN denotes the N ×N unitary matrix. Consider the following Lyapunov functional candidate for system (3): P1 − P1 C − CP1 + eτ P3 LS + P1 A − N Gij P1 D ⎢ SL + AT P1 − N Gij DT P1 eτ P2 − 2S ij =⎢ Ω ⎣ 0 0 B T P1 0
V (t, xt ) = V1 (t, xt ) + V2 (t, xt ) + V3 (t, xt ),
(7)
where V1 (t, xt ) = et xT (t)(U ⊗ P1 )x(t),
t V2 (t, xt ) = t−τ e(s+τ ) F T (x(s))(U ⊗ P2 )F (x(s))ds,
t V3 (t, xt ) = t−τ e(s+τ ) xT (s)(U ⊗ P3 )x(s)ds. Calculating the derivative of V (t) along the solutions of (3), and notifying that (U ⊗ P1 )I(t) ≡ 0, U G = N G; by Lemma 1, we have dV (t,xt ) dt t T
= e x (t)(U ⊗ P1 )x(t) + 2et xT (t)(U ⊗ P1 )[−(IN ⊗ C)x(t) +(IN ⊗ A)F (x(t)) + (IN ⊗ B)F (x(t − τ )) + I(t) + (G ⊗ D)F (x(t))] +e(t+τ )F T (x(t))(U ⊗ P2 )F (x(t)) − et F T (x(t − τ ))(U ⊗ P2 )F (x(t − τ )) +e(t+τ )xT (t)(U ⊗ P3 )x(t) − et xT (t − τ )(U ⊗ P3 )x(t − τ ) = et {xT (t)[(U ⊗ P1 ) − 2U ⊗ (P1 C)]x(t) + 2xT (t)(U ⊗ (P1 A) +(N G) ⊗ (P1 D))F (x(t)) + 2xT (t)(U ⊗ (P1 B))F (x(t − τ )) +eτ F T (x(t))(U ⊗ P2 )F (x(t)) − F T (x(t − τ ))(U ⊗ P2 )F (x(t − τ )) +eτ xT (t)(U ⊗ P3 )x(t) − xT (t − τ )(U ⊗ P3 )x(t − τ )}
36
J. Liang, P. Li, and Y. Yang
= et
N −1
N
{(xi (t) − xj (t))T [(P1 − 2P1 C)(xi (t) − xj (t))
i=1 j=i+1
+2(P1 A − N Gij P1 D)(f (xi (t)) − f (xj (t))) +2P1 B(f (xi (t − τ )) − f (xj (t − τ )))] +eτ (f (xi (t)) − f (xj (t)))T P2 (f (xi (t)) − f (xj (t))) −(f (xi (t − τ )) − f (xj (t − τ )))T P2 (f (xi (t − τ )) − f (xj (t − τ ))) +eτ (xi (t) − xj (t))T P3 (xi (t) − xj (t)) −(xi (t − τ ) − xj (t − τ ))T P3 (xi (t − τ ) − xj (t − τ ))};
(8)
Under the assumption (H), one can easily get the following inequalities (f (xi (t)) − f (xj (t)))T S(f (xi (t)) − f (xj (t))) ≤ (xi (t) − xj (t))T LS(f (xi (t)) − f (xj (t))), (f (xi (t − τ )) − f (xj (t − τ )))T W (f (xi (t − τ )) − f (xj (t − τ ))) ≤ (xi (t − τ ) − xj (t − τ ))T LW (f (xi (t − τ )) − f (xj (t − τ ))),
(9) (10)
where 1 ≤ i < j ≤ N . Substituting (9) and (10) into (8), we obtain dV (t, xt ) dt N −1 N ≤ et {(xi (t) − xj (t))T [P1 − 2P1 C + eτ P3 ](xi (t) − xj (t)) i=1 j=i+1
+(f (xi (t)) − f (xj (t)))T [eτ P2 − 2S](f (xi (t)) − f (xj (t))) −(xi (t − τ ) − xj (t − τ ))T P3 (xi (t − τ ) − xj (t − τ )) −(f (xi (t − τ )) − f (xj (t − τ )))T (P2 + 2W )(f (xi (t − τ )) − f (xj (t − τ ))) +2(xi (t) − xj (t))T [LS + P1 A − N Gij P1 D](f (xi (t)) − f (xj (t))) +2(xi (t) − xj (t))T P1 B(f (xi (t − τ )) − f (xj (t − τ ))) +2(xi (t − τ ) − xj (t − τ ))T LW (f (xi (t − τ )) − f (xj (t − τ )))} = et
N −1
N
T ξij Ωij ξij ,
(11)
i=1 j=i+1
in which ξ = [(xi (t) − xj (t))T , (f (xi (t)) − f (xj (t)))T , (xi (t − τ ) − xj (t − τ ))T , (f (xi (t−τ ))−f (xj (t−τ )))T ]T . From condition (6), the above inequality (11) implies that V (t) ≤ V (0), hence et xT (t)(U ⊗P1 )x(t) is bounded and this yields that λmin (P1 )xi (t) − xj (t)2 ≤
N −1
N
(xi (t) − xj (t))T P1 (xi (t) − xj (t)) = O(e−t ),
i=1 j=i+1
∀1 ≤ i < j ≤ N . According to Definition 1, we can conclude that the dynamical system (3) is globally exponentially synchronized. Based on Theorem 1, one can easily get the following corollary: Corollary 1. Under the assumption (H), system (3) with initial condition (4) is globally exponentially synchronized, if there exist three positive definite matrices
Global Synchronization in an Array of Delayed Neural Networks
37
Pi > 0 (i = 1, 2, 3) and one positive diagonal matrix S such that the following LMIs are satisfied for all 1 ≤ i < j ≤ N : ⎡ ⎤ −P1 C − CP1 + P3 LS + P1 A − N Gij P1 D 0 P1 B ⎢ SL + AT P1 − N Gij DT P1 P2 − 2S 0 0 ⎥ ⎢ ⎥ < 0. (12) ⎣ 0 0 −P3 0 ⎦ T B P1 0 0 −P2
3
Numerical Example
Consider a 2-dimensional neural network with delay presented in [3]: dy(t) dt
= −Cy(t) + Af (y(t)) + Bf (y(t − 0.93)) + I(t),
(13)
where y(t) = (y1 (t), y2 (t))T ∈ R2 is the state vector of the network, the activation function f (y(t)) = (f1 (y1 (t)), f2 (y2 (t)))T with fi (yi ) = 0.5(|yi +1|−|yi −1|) (i = 1, 2), obviously, assumption (H) is satisfied with L = diag(1, 1); the external input vector I(t) = (0, 0)T ; and the other matrices are as follows: √
10 1 + π4 20 − 1.3π4 2 0.1√ C= , A= , B= 01 0.1 1 + π4 0.1 − 1.3π4 2 The dynamical chaotic behavior with initial conditions y1 (s) = 0.2,
∀s ∈ [−0.93, 0]
y2 (s) = 0.3,
(14)
is shown in Fig.1. 0.8
1
0.6
0.8
0.4 0.6
0.2
e(t)
0.4
0
0.2
−0.2 0
−0.4 −0.2
−0.6 −0.4
−0.8 −15
−10
−5
0
5
10
0
5
Fig. 1. Chaotic trajectory of (13)
10
15
time t
15
Fig. 2. Synchronization error e(t)
Now consider a complex system consisting of three nonlinearly coupled identical models (13). The state equations of the entire array are dxi (t) dt
= −Cxi (t) + Af (xi (t)) + Bf (xi (t − 0.93)) + I(t) +
3
Gij Df (xj (t)),
j=1
(15)
38
J. Liang, P. Li, and Y. Yang
where xi (t) = (xi1 (t), xi2 (t))T (i = 1, 2, 3) is the state vector of the ith neural network. Choose the coupling matrix G and the linking matrix D as ⎡ ⎤
−3 1 2 40 ⎣ ⎦ G = 1 −2 1 , D= . 04 2 1 −3 By applying the MATLAB LMI Control Toolbox, (12) can be solved to yield the following feasible solutions:
0.0632 0.0467 0.4114 −0.5336 0.0084 −0.0119 P1 = , P2 = , P3 = , 0.0467 2.2843 −0.5336 14.6908 −0.0119 0.2428 and S = diag(1.0214, 36.3493). According to Corollary 1, network (15) can achieve global exponential synchronization, and the synchronization performance 3 is illustrated in Fig.2, where e(t) = (e1 (t), e2 (t))T and ej (t) = (xij (t)−x1j (t))2 i=2
and the initial stats for (15) are taken randomly constants in [0, 1] × [0, 1]. Fig.2 confirm that the dynamical system (15) is globally exponentially synchronized.
References 1. Hopfield, J.J.: Neurons with Graded Response Have Collective Computational Properties Like Those of Two-Stage Neurons. Proc. Natl. Acad. Sci. USA 81 (1984) 3088-3092 2. Zhang, J., Suda, Y. and Iwasa, T.: Absolutely Exponential Stability of a Class of Neural Networks with Unbounded Delay. Neural Networks 17 (2004) 391-397 3. Gilli, M.: Strange Attractors in Delayed Cellular Neural Networks. IEEE Trans. Circuits Syst.-I 40(11) (1993) 849-853 4. Hoppensteadt, F.C. and Izhikevich, E.M.: Pattern Recognition Via Synchronization in Phase Locked Loop Neural Networks. IEEE Trans. Neural Networks 11(3) (2000) 734-738 5. Zheleznyak, A. and Chua, L.O.: Coexistence of Low- and High-Dimensional SpatioTemporal Chaos in a Chain of Dissipatively Coupled Chua’s Circuits. Int. J. Bifur. Chaos 4(3) (1994) 639-674 6. Wu, C.W. and Chua, L.O.: Synchronization in an Array of Linearly Coupled Dynamical Systems. IEEE Trans. Circuits Syst.-I 42(8) (1995) 430-447 7. Chen, G.R., Zhou, J. and Liu, Z.R.: Global Synchronization of Coupled Delayed Neural Networks and Applications to Chaotic Models. Int. J. Bifur. Chaos 14(7) (2004) 2229-2240 8. Lu, W.L. and Chen, T.P.: Synchronization of Coupled Connected Neural Networks with Delays. IEEE Trans. Circuits Syst.-I 51(12) (2004) 2491-2503 9. Cao, J., Li, P. and Wang, W.W.: Global Synchronization in Arrays of Delayed Neural Networks with Constant and Delayed Coupling. Phys. Lett. A 353 (2006) 318-325 10. Li, Z. and Chen, G.R.: Global Synchronization and Asymptotic Stability of Complex Dynamical Networks. IEEE Trans. Circuits Syst.-II 53(1) (2006) 28-33
Global Synchronization in an Array of Delayed Neural Networks
39
11. Chen, J.L. and Chen, X.H.: Special Matrices, Tsinghua University press, China, 2001 12. Cao, J. and Lu, J.: Adaptive Synchronization of Neural Networks with or without Time-Varying Delays. Chaos 16 (2006) art. no. 013133 13. Huang, X. and Cao, J.: Generalized Synchronization for Delayed Chaotic Neural Networks: a Novel Coupling Scheme. Nonlinearity 19(12) (2006) 2797-2811
Self-synchronization Blind Audio Watermarking Based on Feature Extraction and Subsampling Xiaohong Ma, Bo Zhang, and Xiaoyan Ding School of Electronic and Information Engineering, Dalian University of Technology, Dalian 116023, China
[email protected]
Abstract. A novel embedding watermark signal generation scheme based on feature extraction is proposed in this paper. The original binary watermark image is divided into two blocks with the same size and each block is changed into one dimension sequences. After that, Independent Component Analysis (ICA) is used to extract the independent features of them, which are regarded as two embedding watermark signals. In the embedding procedure, the embedding watermark signals are embedded in some selected wavelet coefficients of the subaudios obtained by subsampling. And Self-synchronization is implemented by applying special peak point extraction scheme. The blind extraction procedure is basically the converse procedure of the embedding one. And the original watermark image can be recovered with the help of the mixing matrix of the ICA. Experimental results show the validity of this scheme.
1
Introduction
Recent growth in the distribution of digital multimedia data over networks and internet has caused authentication and copyright problems. Digital watermarking is proposed as an effective solution to these problems. The most important properties of digital watermarking are robustness and imperceptibility [1]. To achieve them, the watermark is usually embedded in the transformed domain. As Discrete Wavelet Transform (DWT) can reflect both time and frequency properties, lots of watermarking algorithms are based on DWT [2], [3]. Synchronization attack is a serious problem to any audio watermarking scheme. Audio processing such as random cropping causes displacement between embedding and detected signals in the time domain, and therefore it is difficult for the watermark to survive [4]. In [5], the authors proposed a synchronization scheme based on peak point extraction. The scheme proposed in this paper has made some improvements on it. It can make the search of synchronization points more accurate without adding extra information to the original audio signal. As a kind of blind source separation (BSS) algorithm, ICA has received much attention because of its potential applications in signal processing. In many audio watermark embedding schemes, it is used to separate watermark and audio signals [1], [6], [7]. In digital image watermark schemes, the usage of ICA can obtain independent feature components of an image for watermark embedding to improve robustness [8]. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 40–46, 2007. c Springer-Verlag Berlin Heidelberg 2007
Self-synchronization Blind Audio Watermarking
41
A novel embedding watermark generation method based on ICA is proposed in this paper. It is employed to extract the independent feature components corresponding to the watermark image to generate the embedding watermark signals. The mixing matrix obtained during the ICA is kept as secret key. In the embedding procedure, a new method called subsampling [9] is utilized and a synchronization scheme called peak point extraction is utilized to resist cropping attack. The original audio signal is not required during watermark extraction.
2
Watermark Embedding
The block diagram of watermark embedding is shown in Fig. 1. There are three main steps, the embedding watermark signals generation, which is enclosed with dashed line, the synchronization point extraction and the watermark embedding.
Audio for embedding
Original audio
Special shaping
Subaudio
A1
Subaudio
A2
Subaudio
A3
Subaudio
A4
Subsampling
Synchronization points
DWT
Coefficients selection
key3
V1i V2i V3i V4i
Watermark image
Subblocks
Embedding watermark signals generation
ICA features extraction
w1
w2
Embedding
IDWT
Watermarked audio
key1 key2
Fig. 1. The block diagram of watermark embedding
2.1
Embedding Watermark Signal Generation
In this paper, to ensure the security of the scheme, FastICA method [10] is applied to extract the two independent features corresponding to the original watermark image as two embedding watermark signals. An image can be considered as a mixture of several independent features. In this paper, the watermark image is taken as a two-feature-image combination, and FastICA is employed to extract these two independent feature components to generate the embedding watermark signals. The watermark image is divided into two subblocks and each subblock is changed to a vector of one dimension as an observation signal of FastICA. After this process, two feature components and two matrices can be obtained. The generation process can make the watermark scheme be much securer. The original watermark W is a binary image with the size of 32 × 32. It is divided into 2 subblocks of 16 × 32 and resized into two vectors d1 and d2 . And then, FastICA method is applied to them to obtain two feature components v1 and v2 , v1 = {v1 (i), i = 1, 2, · · · , 512}, v2 = {v2 (i), i = 1, 2, · · · , 512}. The
42
X. Ma, B. Zhang, and X. Ding
mixing matrix is kept as secret key key1 which can recover the watermark image through multiplying it by extracted feature signals in watermark extraction scheme. There are altogether two possible element values in v1 and four possible element values in v2 . The elements in v2 are selected to form two groups t1 and t2 , denoted as t1 = {t1 (i), i = 1, 2, · · · , S}, t2 = {t2 (i), i = 1, 2, · · · , 512 − S}. The elements of t1 have the same absolute value, so do the elements of t2 . The positions of t1 in v2 , and the absolute values of v1 , t1 and t2 are kept as secret key key2 for the extraction procedure. v1 , t1 and t2 can be quantized as follows: 1, if v1 (i) > 0 w1 (i) = (1) −1, if v1 (i) < 0 1, if tk (i) > 0 tk (i) = k = 1, 2 (2) −1, if tk (i) < 0 The combination of t1 and t2 , which can be described as w2 = [t1 , t2 ] , and w1 denoted as w1 = {w1 (i), i = 1, 2, · · · , 512} are two embedding watermark signals. 2.2
Synchronization Point Extraction
Synchronization is a significant scheme in digital audio watermarking because the attack such as cropping is very destructive. Therefore, lots of synchronization schemes have been proposed to resist various attacks of time axis. In [11], bark code is embedded into the original audio signal to indicate the location of watermark. But extra information embedded in the audio signal may distort the original audio signal and draw the attention of attackers. What’s more, the search of the synchronic code is always a time consuming work. Another solution for synchronization is called self-synchronization. In this kind of scheme, the feature points or areas are fully employed. In [6], the feature of the audio signal is utilized to implement synchronization. But the synchronization points are not outstanding and difficult to search. In this scheme, the power of the original signal is specially shaped by raising the sample value to a high power as exemplified by the following equation: x (n) = x4 (n)
(3)
where x(n) is the original audio signal, and x (n) is the signal after the special shaping. Power of 4 is chosen for the convenience of identifying the outstanding peaks. This process could amplify the energy differences between the peak regions and low-energy regions. The special regions are identified by comparing with a threshold th. th is set to be 20% of the sample value of the highest peak after special shaping. Samples which have values higher than the threshold are extracted as the peak points. The peak points usually appear in group consisting of many samples. If the number of consecutive peak points in a group is equal to or greater than N , this group is chosen for embedding. In [6], the last point of the group is taken
Self-synchronization Blind Audio Watermarking
43
as a synchronization point. In this scheme, the largest point in this group is taken as the synchronization point because it is more outstanding in a group. The selection of N is according to practical experiments and varies among different audio signals. In this scheme, to improve the security and robustness, two synchronization points are selected and the watermark signals are embedded twice. 2.3
Embedding
L points of original audio signal after the synchronization point are selected as the watermark embedding segment Audio, L = 4k, k = 1, · · · , M . It can be subsampled as follows: Ai (k) = Audio(4k − 4 + i) , k = 1, 2, · · · L/4 , i = 1, 2, 3, 4
(4)
where A1 , A2 , A3 , A4 are four similar subaudios. To ensure the robustness, 3-level DWT is implemented on these signals. The approximate components of them are rearranged according to descending sequence and then checked according to Eq.(5) and Eq.(6) to see if they can satisfy the embedding condition. V1j + V2j 2 V1j − V2j < 2a Vj Vj =
(5) (6)
where V1j and V2j are the rearranged approximate components of A1 and A2 , and a is a positive constant. 512 coefficients which can satisfy Eq.(6) are picked out for embedding and denoted as V1i and V2i . At the same time, the positions for embedding are kept as secret key key3 . The watermark signal w1 is embedded according to Eq.(7): V1i = Vi (1 + aw1 (i)) , V2i = Vi (1 − aw1 (i)) , i = 1, 2, · · · , 512
(7)
The selection of a is a tradeoff between audio distortion and detection accuracy. As the similarity of four subaudios, the approximate components of them are similar too. So the embedding positions of w2 are the same as those of w1 . The embedding procedure of w2 is totally the same as that of w1 . To resist cropping attack, w1 and w2 are twice embedded each. At last, IDWT is implemented on the modified coefficients together with the other ones to get the watermarked audio signal.
3
Watermark Extraction
The block diagram of watermark extraction is shown in Fig. 2. Just like the watermark embedding procedure, synchronization points are searched and the following L points are selected and subsampled to get four
44
X. Ma, B. Zhang, and X. Ding
Audio for extraction
Watermarked audio
Special shaping
Subaudio
A1c
Subaudio
A2c
Subaudio
A3c
Subaudio
A4c
Subsampling
Synchronization points
DWT
w1c Coefficients w 2c selection
key3
Postprocessing
Extracted watermark
key2 key1
Fig. 2. The block diagram of watermark extraction
subaudios as described in Fig. 2. 3-level DWT is applied to each subaudio. According to the secret key key3 , the embedding positions in approximate components are obtained. The extraction of w1 is according to Eq.(8). Considering the watermarked audio signal may have undergone some attacks or processing, the selected pairs of approximate components are denoted as U1i and U2i , the watermark signal w 1 , w 1 = {w1 (i), i = 1, 2, · · · , 512} can be recovered as follows: w1 (i) =
1 U1i − U2i · a U1i + U2i
(8)
The extraction of w 2 is absolutely the same as the process of w 1 . As discussed in watermark signals generation, a reverted process is necessary for w 1 and w 2 . The positive elements in w 1 are replaced by the absolute value of v1 , which is kept in key2 ; the rest ones are replaced by negative values of them. The elements in the kept positions of w 2 are replaced by the positive or negative absolute value of t1 depending on the signs of themselves. The rest ones in w 2 are replaced according to the same rule. After that, the watermark can be recovered as follows: w1 ww = A · (9) w 2 where ww is a 2 × 512 matrix. Taken 0 as the threshold, the elements in ww are mapped into {0, 255}. Each vector is changed to a 16 × 32 matrix and then combined to an integrated watermark image.
4
Experiment Results
The parameters in our experiment are given as follows: N = 10; a = 0.1; L = 70856. The sampling rate of original audio signal is 44.1 KHz and length is 112080. The original audio signal and the watermarked audio signal are shown in Fig. 3(a) and Fig. 3(b) respectively. There is no visible distortion between them, and it is also true in listening test.
Self-synchronization Blind Audio Watermarking 0.6
0.6
0.4
0.4
0.2
0.2
0.0
0.0
-0.2
-0.2
-0.4
-0.4
-0.6
45
-0.6 0
20000
40000
60000
(a)
80000
100000
120000
0
20000
40000
60000
80000
100000
120000
(b)
Fig. 3. Original audio signal and watermarked audio signal. (a) original audio signal. (b) watermarked audio signal.
Fig. 4. The original watermark image and extracted watermark images under various conditions. (a) original watermark image. (b) extracted watermark image without any attack. (c) mp3 compression. (d) cropping. (e) adding white Gaussian noise(SNR is 25dB). (f) requalifing. (g) resampling (from 44.1KHz to 88.2 KHz, then to 44.1 KHz). (h) lowpass filtering.
The original watermark image is shown in Fig. 4(a), and the extracted watermark without any attack is shown in Fig. 4(b). Fig. 4(c)-Fig. 4(h) show the extracted watermarks under various attacks. All the extracted watermarks except Fig. 4(g) and Fig. 4(h) are all very clear. Though under the attack of resampling and filtering, the embedding watermark signal has been degraded, the extracted watermark can still be recognized clearly.
5
Conclusion
A novel watermark signal generation scheme based on feature extraction is proposed in this paper. It makes use of ICA for feature extraction to generate the embedding watermark signals which makes the audio watermark scheme much securer. Watermark signals are embedded in the DWT domain of four subaudios obtained by subsampling. The synchronization scheme can improve the robustness
46
X. Ma, B. Zhang, and X. Ding
against cropping attack without introducing additional information and the extraction procedure is completely blind. Experimental results show the excellent imperceptibility and good robustness against various attacks. Acknowledgments. This work was supported by the National Natural Science Foundation of China under Grant No. 60575011 and the Liaoning Province Natural Science Foundation of China under Grant No. 20052181.
References 1. Liu, J., Zhang, X. G., Najar, M., Lagunas, M. A.: A Robust Digital Watermarking Scheme Based on ICA. International Conference on Neural Networks and Signal, Oregon, USA 2 (2003) 1481-1484 2. Vieru, R., Tahboub, R., Constantinescu, C., Lazarescu, V.: New Results Using the Audio Watermarking Based on Wavelet Transform. International Symposium on Signals, Circuits, and Systems, Kobe, Japan 2 (2005) 441-444 3. Cvejic, N., Seppanen, T.: Robust Audio Watermarking in Wavelet Domain Using Frequency Hopping and Patchwork Method. The 3rd International Symposium on Image and Signal Processing and Analysis, Rome, Italy 1 (2003) 251-255 4. Wei Li, Xiangyang Xue, Peizhong Lu.: Localized Audio Watermarking Technique Robust Against Time-Scale Modification. IEEE Transactions on Multimedia 8 (2006) 60-69 5. Foo Say Wei, Xue Feng, Li Mengyuan.: A Blind Audio Watermarking Scheme Using Peak Point Extraction. IEEE International Symposium on Circuits and Systems, Kobe, Japan 5 (2005) 4409-4412 6. Toch, B., Lowe, D., Saad, D.: Watermarking of Audio Signals Using Independent Component Analysis. The Third International Conference WEB Delivering of Music, Leeds, United Kingdom (2003) 71-74 7. Sener, S., Gunsel, B.: Blind Audio Watermark Decoding Using Independent Component Analysis. The 17th International Conference on Pattern Recognition, Cambridge, United Kingdom 2 (2004) 875-878 8. Sun, J., Liu, J.: A Novel Digital Watermark Scheme Based on Image Independent Feature. The 2003 IEEE International Conference on Robotics, Intelligent Systems and Signal Processing, Changsha, China 2 (2003) 1333-1338 9. Chu., Wai C.: DCT-Based Image Watermarking Using Subsampling. IEEE Transactions on Multimedia 5 (1) (2003) 34-38 10. Hyvarinen, A., Oja, E.: A Fast Fixed-point Algorithm for Independent Component Analysis. Neural Computation 9 (7) (1997) 1483-1492 11. Huang, J., Wang, Y., Shi, Y.: A Blind Audio Watermarking Algorithm with Selfsynchronization. IEEE International Symposium on Circuits and Systems, Arizona, USA 3 (2002) 627-630
An Improved Extremum Seeking Algorithm Based on the Chaotic Annealing Recurrent Neural Network and Its Application* Yun-an Hu, Bin Zuo, and Jing Li Department of Control Engineering, Naval Aeronautical Engineering Academy Yantai 264001, China
[email protected],
[email protected]
Abstract. The application of sinusoidal periodic search signals into the general extremum seeking algorithm(ESA) results in the “chatter” problem of the output and the switching of the control law and incapability of escaping from the local minima. An improved chaotic annealing recurrent neural network (CARNN) is proposed for ESA to solve those problems in the general ESA and improve the global searching capability. The paper converts ESA into seeking the global extreme point where the slope of Cost Function is zero, and applies a CARNN to finding the global point and stabilizing the plant at that point. ESA combined with CARNN doesn’t make use of search signals such as sinusoidal periodic signals, which solves those problems in previous ESA and improves the dynamic performance of the controlled system greatly. During the process of optimization, chaotic annealing is realized by decaying the amplitude of the chaos noise and the probability of accepting continuously. The process of optimization was divided into two phases: the coarse search based on chaos and the elaborate search based on ARNN. At last, CARNN will stabilize the system to the global extreme point. At the same time, it can be simplified by the proposed method to analyze the stability of ESA. The simulation results of a simplified UAV tight formation flight model and a typical Schaffer function validate the advantages mentioned above.
1 Introduction Extremum seeking problem deals with the problem of minimizing or maximizing a plant over a set of decision variables[1]. Extremum seeking problems represent a class of widespread optimization problems arising in diverse design and planning contexts. Many large-scale and real-time applications, such as traffic routing and bioreactor systems, require solving large-scale extremum seeking problem in real time. In order to solve this class of extremum seeking problems, a novel extremum seeking algorithm was proposed in the 1950’s. Early work on performance improvement by extremum seeking can be found in Tsien. In the 1950s and 1960s, Extremum seeking algorithm was considered as an adaptive control method[2]. Until 1990s sliding mode control for extremum seeking has not been utilized successfully[3]. Subsequently, a method of adding compensator dynamics in ESA was proposed by Krstic, which *
This research was supported by the Natural Science Foundation of P.R.China (No. 60674090).
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 47–56, 2007. © Springer-Verlag Berlin Heidelberg 2007
48
Y.-a. Hu, B. Zuo, and J. Li
improved the stability of the system[4]. Although those methods improved tremendously the performance of ESA, the “chatter” problem of the output and the switching of the control law and incapability of escaping from the local minima limit the application of ESA. The method of introducing a chaotic annealing recurrent neural network into ESA is proposed in the paper. First, an extremum seeking problem is converted into the process of seeking the global extreme point of the plant where the slope of cost function is zero. Second, an improved CARNN is constructed; then, we can apply the CARNN to finding the global extreme point and stabilizing the plant at that point. The CARNN proposed in the paper doesn’t make use of search signals such as sinusoidal periodic signals, so the method can solve the “chatter” problem of the output and the switching of the control law in the general ESA and improve the dynamic performance of the ESA system. At the same time, CARNN utilizes the randomicity and the property of global searching of chaos system to improve the capability of global searching of the system[5-6], During the process of optimization, chaotic annealing is realized by decaying the amplitude of the chaos noise and the accepting probability continuously. Adjusting the probability of acceptance could influence the rate of convergence. The process of optimization was divided into two phases: the coarse search based on chaos and the elaborate search based on RNN. At last, CARNN will stabilize the system to the global extreme point, which is validated by simulating a simplified UAV tight formation flight model and a typical Schaffer Function. At the same time, it can be simplified by the proposed method to analyze the stability of ESA.
2 Annealing Recurrent Neural Network Descriptions 2.1 Problem Formulation Consider a general nonlinear system:
x = f ( x ( t ) ,u ( t ) )
(1)
y = F ( x (t ))
Where
x ∈ R n ,u ∈ R m and y ∈ R are the states, the system inputs and the system
F ( x ) is also defined as the cost function of the
output, respectively. system.
f ( x,u ) and F ( x ) are smooth functions.
If the nonlinear system(1) is an extremum seeking system, then it must satisfy three assumptions described in [7]. We know that there must be a smooth control law
u ( t ) = α ( x ( t ) ,θ )
to
stabilize
the
nonlinear
system(1),
θ = ⎡⎣θ1,θ2 ,",θi ,",θp ⎤⎦ ( i ∈[12 , ,",p]) is a parameter vector of p
where
T
dimension which
determines a unique equilibrium vector. Then there must also be a smooth function
An Improved Extremum Seeking Algorithm
49
xe : R p → R n such that: f ( x,α ( x,θ ) ) = 0 ↔ x = xe (θ ) . Therefore, the static performance map at the equilibrium point
xe (θ ) from θ to y represented by:
y = F ( xe (θ ) ) = F (θ ) .
(2)
Differentiating (2) with respect to time yields the relation between
∂ (θ ( t ) ) θ ( t ) = y ( t )
θ
and y ( t ) .
,
(3)
⎡ ∂F (θ ) ∂F (θ ) ∂F (θ ) ⎤ T where ∂ (θ ( t ) ) = ⎢ , ," , ⎥ and θ ( t ) = ⎡⎣θ1 ,θ 2 ," ,θ p ⎤⎦ . ∂θ2 ∂θ p ⎦⎥ ⎣⎢ ∂θ1 T
Once the seeking vector
θ
of the extremum seeking system (1) converges to the T
⎡ ∂F (θ ) ∂F (θ ) ∂F (θ ) ⎤ global extreme vector θ , then ∂ (θ ) = ⎢ , ," , ⎥ must also ∂θ2 ∂θ p ⎦⎥ ⎣⎢ ∂θ1 ∗
converge to zero. A CARNN is introduced into ESA in order to minimize ∂ (θ ) in finite time. Certainly the system (1) is subjected to (3). Then, the extremum seeking problem can be written as follows Minimize: Subject to: where
f1 (υ ) = cTυ
p1 (υ ) = Aυ − b = 0 .
(4)
∂T (θ ) denotes the transpose of ∂ (θ ) . υ = ⎡⎣∂ (θ )
⎡ 11× p −sign ( ∂T (θ ) ) 01× p ⎤ ⎢ ⎥ A = ⎢θ T ( t ) 01× p 01× p ⎥ ⎢0 01× p ∂T (θ ) ⎥⎥ ⎢⎣ 1× p ⎦
b = ⎡⎣0 y ( t )
,
∂ (θ )
c = ⎡⎣01× p 11× p
T θ ( t ) ⎤⎦
01× p ⎤⎦
,
T
,
⎧1 x > 0 y ( t ) ⎤⎦ , and sign ( x ) = ⎪⎨0 x = 0 . ⎪ −1 x < 0 ⎩ T
By the dual theory, the dual program corresponding to the program (4) is Maximize: Subject to: where,
ω
f 2 ( ω ) = bT ω
p2 (ω ) = AT ω − c = 0 .
denotes the dual vector of υ ,
(5)
ω T = [ω1 ω2 ω3 ]1×3 .
Therefore, an extremum seeking problem is converted into the programs defined in (4) and (5).
50
Y.-a. Hu, B. Zuo, and J. Li
2.2 Annealing Recurrent Neural Network(ARNN) Design In view of the primal and dual programs (4) and (5), define the following energy function:
E (υ , ω ) = T ( t ) ( f1 (υ ) − f 2 (ω ) ) 2 + p1 (υ ) 2
2 + p2 (ω )
2
2
2.
(6)
Clearly, the energy function (6) is convex and continuously differentiable. The first term in (6) is the squared difference between the objective functions of the programs (4) and (5), respectively. The second and the third terms are for the equality constraints of (4) and (5).
T ( t ) denotes a time-varying annealing parameter.
With the energy function defined in (6), the dynamics for ARNN solving (4) and (5) can be defined by the negative gradient of the energy function as follows:
dσ dt = −μ∇E (σ ) . where, σ
(7)
= (υ ,ω ) , ∇E (σ ) is the gradient of the energy function E (σ ) defined T
in (6), and μ is a positive scalar constant, which is used to scale the convergence rate of annealing recurrent neural network. The dynamical equation (7) of annealing recurrent neural network can be expressed as:
du1 dt = − μ ∂E (υ , ω ) ∂υ = −μ ⎡⎣T ( t ) c ( cTυ − bT ω ) + AT ( Aυ − b ) ⎤⎦ .
(8)
du2 dt = −μ ∂E (υ , ω ) ∂ω = −μ ⎡⎣ −T ( t ) b ( cTυ − bT ω ) + A ( AT ω − c ) ⎤⎦ .
(9)
υ = q ( u1 ) .
(10)
ω = q ( u2 ) .
(11)
where,
q(
) is a sigmoid activation function, υ = q ( u ) = ( b − a ) (1+ e
−u1 ε1
ω = q ( u2 ) = ( b2 − a2 ) (1 + e−u
2
below bound of
υ . a2
ε1 > 0 and ε 2 > 0 .
and
ε2
)+a . 2
1
1
1
)+a
1
and
a1 and b1 denote the upper bound and the
b2 denote the upper bound and the below bound of ω .
~
The annealing recurrent neural network is described as the equations (8) (11), which are determined by the number of decision variables such as (υ ,ω ) , ( u1 ,u2 ) is the column vector of instantaneous net inputs to neurons, (υ ,ω ) is the column output vector of neurons.
An Improved Extremum Seeking Algorithm
51
3 Convergence Analysis In this section, analytical results on the stability of the proposed annealing recurrent neural network and feasibility and optimality of the steady-state solutions to the programs described in (4) and (5) are presented. Theorem 1. Assume that the Jacobian matrices J ⎡⎣ q ( u1 ) ⎤⎦ and J ⎡⎣ q ( u2 ) ⎤⎦ exist and are positive semidefinite. If the temperature parameter T ( t ) is nonnegative, strictly monotone decreasing for t ≥ 0 , and approaches zero as time approaches infinity, then the annealing recurrent neural network (8) (11) is asymptotically stable.
~
Proof: Consider the following Lyapunov function:
L = E (υ , ω ) = T ( t ) ( f1 (υ ) − f 2 (ω ) ) 2 + p1 (υ ) 2
Apparently,
2
2 + p2 (ω )
2
2.
(12)
L ( t ) > 0 . The differentiation of L along time trajectory of (12) is
as follows:
∂f (υ ) ∂p (υ ) ⎤ dυ . dL ⎡ = ⎢T ( t ) ⋅ 1 ⋅ ( f1 (υ ) − f 2 (ω ) ) + 1 ⋅ p1 (υ ) ⎥ ⋅ dt ⎣ ∂υ ∂υ ⎦ dt
⎡ ∂f (ω) ∂p (ω) ⎤ dω 1 dT ( t) 2 +⎢−T ( t ) ⋅ 2 ⋅( f1 (υ) − f2 (ω) ) + 2 ⋅ p2 (ω) ⎥⋅ + f1 (υ) − f2 (ω) ) . ( ∂ω ∂ω ⎣ ⎦ dt 2 dt
(13)
According to the equations (8) and (9), and the following equations dυ dt = J ⎡⎣ q ( u1 ) ⎤⎦ ⋅ du1 dt and d ω dt = J ⎡⎣ q ( u2 ) ⎤⎦ ⋅ du2 dt . We can have: 2 dL 1 du du 1 du du 1 dT ( t ) = − ⋅ 1 ⋅ J ⎡⎣q ( u1 )⎤⎦ ⋅ 1 − ⋅ 2 ⋅ J ⎡⎣q ( u2 )⎤⎦ ⋅ 2 + f1 (υ) − f2 (ω) ) ( dt μ dt dt μ dt dt 2 dt
(14)
We know that the Jacobian matrices of J ⎡⎣ q ( u1 ) ⎦⎤ and J ⎣⎡ q ( u2 ) ⎦⎤ both exist and are positive semidefinite and μ is a positive scalar constant. If the time-varying annealing parameter T ( t ) is nonnegative, strictly monotone decreasing for t ≥ 0 , and
approaches zero as time approaches infinity, then dL dt is negative definite. Because
T ( t ) represents the annealing effect, the simple examples of T ( t ) can described by −η T ( t ) = βα −η t or T ( t ) = β (1 + t ) , where α > 1 , β > 0 and η > 0 are constant
parameters. Parameters
β
and η can be used to scale the annealing parameter.
Because L ( t ) is positive definite and radially unbounded, and dL dt is negative definite. According to the Lyapunov’s theorem, the designed annealing recurrent neural network is asymptotically stable.
52
Y.-a. Hu, B. Zuo, and J. Li
Theorem 2. Assume that the Jacobian matrices J ⎡⎣ q ( u1 ) ⎤⎦ and J ⎡⎣ q ( u2 ) ⎤⎦ exist and are positive semidefinite. If T ( t ) ≥ 0 , dT ( t ) dt < 0 and lim T ( t ) = 0 , then the t →∞
steady state of the annealing neural network represents a feasible solution to the programs described in equations (4) and (5). Proof: The proof of Theorem 1 shows that the energy function E (υ , ω ) is positive definite and strictly monotone decreasing with respect to time lim E (υ , ω , T ( t ) ) = 0 . Because lim T ( t ) = 0 , then we have t →∞
t →∞
(
lim E (υ , ω , T ( t ) ) = lim p1 (υ ( t ) ) t →∞
t →∞
p1 (υ ( t ) )
Because
(
lim p1 (υ ( t ) ) t →∞
= p1 (υ )
υ
t , which implies
and
ω
2
2
2
2
(
) (
p2 ( ω ( t ) )
2 = p1 limυ ( t ) t →∞
)
2
)
2 =0 are
2
(15) continuous,
(
2 + p2 lim ω ( t ) t →∞
)
2
2
2 = 0 , so we have p1 (υ ) = 0 and p2 (ω ) = 0 , where
are the stable solutions of
Now, Let F1(υ) =⎡ f1(υ) ⎣
2 + p2 (ω ( t ) )
and
2 + p2 (ω ( t ) )
2 + p2 (ω )
2
υ
and
ω.
) ( f (υ)) ( f (υ))⎤⎦
T
1
1
(
and F2 (ω) =⎡ f2 (ω) ⎣
) ( f (ω)) ( f (ω))⎤⎦
T
2
2
be the augmented vector. Theorem 3. Assume that the Jacobian matrices J ⎡⎣ q ( u1 ) ⎤⎦ ≠ 0 and J ⎡⎣ q ( u2 ) ⎤⎦ ≠ 0 and are positive semidefinite, ∀t ≥ 0 , and ∇ ( f1 (υ ) ) ≠ 0 and ∇ f 2 (ω ) ≠ 0 . If
dT ( t ) dt < 0 , lim T ( t ) = 0 and
(
)
t →∞
⎧ ⎛ ∂p1(υ) ∂p (υ) ⎞ T T p1(υ) −∇F1 ⎡⎣υ( t)⎤⎦ J⎡⎣q( u1)⎤⎦ 1 p1(υ) ⎟ ⎪ ⎜∇p1 ⎡⎣υ( t)⎤⎦ J⎡⎣q( u1)⎤⎦ ∂υ ∂υ ⎪ ⎝ ⎠ T( t) ≥max⎨0, , ⎛ ∂ f υ ∂ f υ T T ⎪ ∇F⎡υ( t)⎤ J⎡q( u )⎤ 1( ) ( f (υ) − f (ω) ) −∇p ⎡υ( t)⎤ J⎡q( u )⎤ 1( ) ( f (υ) − f (ω) ) ⎞ ⎟ 2 1⎣ 2 ⎦ ⎣ 1 ⎦ ∂υ 1 ⎪ ⎜⎝ 1 ⎣ ⎦ ⎣ 1 ⎦ ∂υ 1 ⎠ ⎩ ⎫ ∂p2 (ω) ∂p (ω) ⎛ ⎞ T T p2 (ω) −∇F2 ⎡⎣ω( t) ⎤⎦ J ⎡⎣q( u2) ⎤⎦ 2 p2 (ω) ⎟ ⎪ ⎜∇p2 ⎡⎣ω( t)⎤⎦ J ⎡⎣q( u2 ) ⎤⎦ ∂ω ∂ω ⎪ (16) ⎝ ⎠ ⎬ ⎛ ∂f2 (ω) ∂f2 (ω) ⎞⎪ T T ∇ F ⎡ ω t ⎤ J ⎡ q u ⎤ f υ − f ω −∇ p ⎡ ω t ⎤ J ⎡ q u ⎤ f υ − f ω ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( 1 2 ) 2 ⎣ ⎦ ⎣ 2 ⎦ ∂ω ( 1 2 ) ⎟⎪ ⎜ 2⎣ ⎦ ⎣ 2 ⎦ ∂ω ⎝ ⎠⎭
An Improved Extremum Seeking Algorithm
then the steady states
υ
and
ω
53
of the annealing neural network represents the
optimal solutions υ and ω to the programs described in equations (4) and (5). Because of the length restriction, we omit the proof of Theorem 3. ∗
∗
4 A Chaotic Annealing Recurrent Neural Network Descriptions In order to improve the global searching performance of the designed annealing recurrent neural network, we introduce chaotic factors into the designed neural network. Therefore, the structure of a chaotic annealing recurrent neural network is described as follows. du1 dt
du2 dt
T t
f1
f1 f1
T t
Tt
f2
f1 Tt
p1
f2 f1
f2
f1
f2
1
b1 a1
1
p2
2
p2
b2 a2
2
ω = q ( u2 ) = ( b2 − a2 ) (1 + e −u
ε1
)+a .
random P2 t
(19)
1
2
ε2
)+a .
(20)
2
ηi ( t + 1) = (1 − κ )ηi ( t ) i = 1, 2 . Pi t
a2
(18)
p2 1
Pi t 0
random P1 t
p1
υ = q ( u1 ) = ( b1 − a1 ) (1 + e −u
Pi t 1
a1
(17)
p1
f2
p2
f2
p1
(21)
0 .
(22) .
χ i ( t + 1) = γχi ( t ) (1 − χi ( t ) ) .
(23)
where γ = 4 , Pi ( 0 ) > 0 , 0 < κ < 1 , 0 < δ < 1 , η i ( 0 ) > 0 , ε1 > 0 and ε 2 > 0 . We know that equation (23) is a Logistic map, when γ = 4 , the chaos phenomenon will happen in the system. As time approaches infinity, the chaotic annealing recurrent neural network will evolve into the annealing recurrent neural network (8) (11). Therefore, we must not repeatedly analyze the stability and solution feasibility and solution optimality of the chaotic annealing recurrent neural network (17) (23).
~
~
54
Y.-a. Hu, B. Zuo, and J. Li
5 Simulation Analysis
ⅰ
( ) A Simplified Tight Formation Flight Model Simulation Consider a simplified tight formation flight model consisting of two Unmanned Aerial Vehicles tested in reference [8]. The cost function of the tight formation flight model is given by
y ( t ) = −10 ( x1 ( t ) + 0) − 5( x3 ( t ) + 9) + 590 . 2
2
(24)
Clearly, if the states of the model are x1∗ = 0 and x3∗ = −9 , then the cost function
y ( t ) will reach its maximum y ∗ = 590 .
The initial conditions of the model are given as x1 ( 0 ) = − 2 , x 2 ( 0 ) = 0 ,
x 3 ( 0 ) = − 4 , x 4 ( 0 ) = 0 , θ 1 ( 0 ) = − 2 , θ 2 ( 0 ) = − 4 . Choose T ( t ) = β α −η t , where β = 0 .01 , α = e , η = 5 . Applying CARNN to the model described in reference [8], the parameters are given as: μ = 23.5 , γ = 4 , P1 ( 0 ) = P2 ( 0 ) = 1 , κ = 0.01 , δ = 0.01 , ε 1 = 10 , ε 2 = 10 , χ1 ( 0 ) = 0.912 ,
χ2 ( 0) = 0.551 , η1 ( 0) = [ −10 −1 5]T , η2 ( 0) = [3 10 5]T , b1 = b2 = 0.5 ,
a1 = a2 = −0.5 .
The simulation results are shown from figure 1 to figure 3. In those simulation results, solid lines are the results applying CARNN to ESA; dash lines are the results applying ESA with sliding mode[9]. Comparing those simulation results, we know the dynamic performance of the method proposed in the paper is superior to that of ESA with sliding mode. By figure 1 and figure 2, the “chatter” phenomenon disappears in the CARNN’s output, which is very harmful in practice. Moreover the convergence rate of ESA with CARNN can be scaled by adjusting the annealing parameter T ( t ) .
x1
x3
Learning iterative times
Learning iterative times
n
Fig. 1. The simulation result of the state
x1
n
Fig. 2. The simulation result of the state
x3
An Improved Extremum Seeking Algorithm
55
ⅱ
( ) Schaffer Function Simulation In order to exhibit the capability of global searching of the proposed CARNN, the typical Schaffer function (25) is defined as the testing function[10].
f ( x1 , x2 ) =
sin 2
x12 + x22 − 0.5
(1 + 0.001( x
2 1
+x
2 2
))
2
− 0.5, xi ≤ 10, i = 1, 2 .
(25)
When x1 = x2 = 0 , the schaffer function f ( x1 , x2 ) will obtain the global
minimum f ( 0, 0 ) = − 1 . However, there are numerous local minimums and
maximums among the range of 3.14 away from the global minimum. Now, we define θ1 = x1 and θ 2 = x 2 . The values of CARNN’s parameters are same with those in subsection 5.1 except for μ = 35 , η1 ( 0) = [ −200 −20 50] and T
η2 ( 0) = [100 300 50] . The simulation condition T
Ⅰ:
the initial conditions of the
function (25) are given as x1 ( 0 ) = − 2 and x2 ( 0 ) = 3.5 ; the simulation condition
Ⅱ: the initial conditions are given as x ( 0 ) = − 1 and x 1
2
( 0 ) = 9.5 . The simulation
y
Learning iterative times
n
Fig. 3. The simulation result of the output
Fig. 5. The simulation result of
x1
y
Fig. 4. The simulation result of f ( x1 , x2 )
Fig. 6. The simulation result of
x2
56
Y.-a. Hu, B. Zuo, and J. Li
results are shown as from figure 4 to figure 6, where the dash-dot lines are the results of the simulation condition , and the solid lines are the results of the simulation condition . We have accomplished a great deal of simulations in different initial conditions. The ESA based on the chaotic annealing recurrent neural network can find the global minimum of Schaffer function in every different simulation.
Ⅱ
Ⅰ
6 Conclusions The method of introducing CARNN into ESA greatly improves the dynamic performance and the global searching capability of the system. Two phases of the coarse search based on chaos and the elaborate search based on ARNN ensure that the system could fully carry out the chaos searching and find the global extremum point and accordingly converge to that point. At the same time, the disappearance of the “chatter” of the system output and the switching of the control law are beneficial to engineering applications.
References 1. Natalia I. M.: Applications of the Adaptive Extremum Seeking Control Techniques to Bioreactor Systems. A dissertation for the degree of Master of Science. Ontario: Queen’s University, (2003). 2. Blackman, B.F.: Extremum-seeking Regulators. An Exposition of Adaptive Control, New York: Macmillan (1962) 36-50 3. Drakunov, S., Ozguner, U., Dix, P., Ashrafi, B.: ABS Control Using Optimum Search via Sliding Mode., IEEE Transactions on Control Systems Technology 3 (1995) 79-85 4. Krstic, M.: Toward Faster Adaptation in Extremum Seeking Control. Proc. of the 1999 IEEE Conference on Decision and Control, Phoenix. AZ (1999) 4766-4771 5. Tan, Y., Wang, B.Y., He, Z.Y.: Neural Networks with Transient Chaos and Time-variant gain and Its Application to Optimization Computations. ACTA ELECTRONICA SINICA. 26 (1998) 123-127 6. Wang, L., Zheng, D.Z.: A Kind of Chaotic Neural Network Optimization Algorithm Based on Annealing Strategy. Control Theory and Applications 17 (2000) 139-142 7. Hu, Y.A., Zuo, B.: An Annealing Recurrent Neural Network for Extremum Seeking Control. International Journal of Information Technology 11 (2005) 45-52 8. Zuo, B., Hu, Y.A.: Optimizing UAV Close Formation Flight via Extremum Seeking. WCICA2004 4 (2004) 3302-3305 9. Pan, Y., Ozguner, U., Acarman, T.: Stability and Performance Improvement of Extremum Seeking Control with Sliding Mode. Control. Vol. 76 (2003) 968-985. 10. Wang, L..: Intelligent Optimization Algorithms with Application. Beijing: Tsinghua University Press (2004)
Solving the Delay Constrained Multicast Routing Problem Using the Transiently Chaotic Neural Network Wen Liu and Lipo Wang College of Information Engineering, Xiangtan University, Xiangtan, Hunan, China School of Electrical and Electronic Engineering, Nanyang Technology University, Block S1, 50 Nanyang Avenue, Singapore 639798 {liuw0004,elpwang}@ntu.edu.sg Abstract. Delay constrained multicast routing (DCMR) aims to construct a minimum-cost tree with end-to-end delay constraints. This routing problem is becoming more and more important to multimedia applications which are delay-sensitive and require real time communications. We solve the DCMR problem by the transiently chaotic neural network (TCNN) of Chen and Aihara. Simulation results show that the TCNN is more capable of reaching global optima compared with the Hopfield neural network (HNN).
1
Introduction
There are two types of multimedia delivery: real-time file streaming and nonreal-time downloads. As for the real-time communication, its applications usually have various quality of service (QoS) requirements, such as bandwidth limit, cost minimization, and delay constraint. The QoS constrained routing problem covers a wide area, e.g., point-to-point and group-to-group routing, with different endto-end QoS requirements [1, 2]. In this paper we focus on delay constrained multicast routing (DCMR) problem, which is also called the constrained Steiner tree (CST) problem. Multicast routing [3, 4] covers the delivery service that can not be accomplished by broadcast and point-to-point delivery. The multicast routing functionality including three parts: the management of group membership, the construction of data delivery route, and the information replication at the interior node. Our work is on the second part: construct a delay constrained minimal cost tree with the transiently chaotic neural network (TCNN) [5]. The neural network is applied to the routing problem for the powerful parallel computational ability of the neural network [6]. Rauch and Winarske use neural networks for the shortest path problem [7]. A modified version of the Hopfield neural network for the delay constrained multicast routing was proposed in [8]. The model is capable to find the solution for an 8-node network, but for large scale communication networks, this HNN model may be easily trapped at local minima. To overcome this limitation of HNNs, Nozawa [9] proposed a chaotic D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 57–62, 2007. c Springer-Verlag Berlin Heidelberg 2007
58
W. Liu and L.P. Wang
neural network (CNN) by adding negative self-feedbacks into HNNs. Chen and Aihara [5] further developed the CNN and presented a neural network with transient chaos, namely, the transiently chaotic neural network (TCNN). Since the chaos is able to improve the ability of the neural network model to reach global optima, this transiently chaotic neurodynamics makes the TCNN a promising tool for the combinatorial optimization problem. Hence we use TCNNs here to solve the DCMR problem for the powerful searching ability of this transiently chaotic model. This paper is organized as follows. We introduce the delay constrained multicast routing problem in Section 2. The transiently chaotic neural network is reviewed in Section 3. Simulation results are presented and discussed in Section 4. Finally, we conclude this paper in Section 5.
2 2.1
The Delay Constrained Multicast Routing Problem Problem Formulation
Based on the formulation presented in [10], an n-node D destinations communication network is formulated on D-n × n matrices, where matrix m is used to compute the constrained unicast route to destination dm , (m = 1, · · · , D). Each element in one matrix is treated as a neuron and neuron mxi describes the link from node x to node i for destination dm in the communication network. Pxi characterizes the connection status of the communication network: Pxi = 1 (m) if the link from node x to node i does not exist; Otherwise Pxi = 0. Vxi is the (m) output of the neuron at location (x, i) in matrix m, Vxi = 1 implies the link from node x to node i is on the final optimal tree for destination m; Otherwise (m) Vxi = 0. Cxi and Lxi denote the cost and delay of a link from node x to node i, respectively, which are assumed to be real non-negative numbers [8]. For nonexisting arcs, Cxi = Lxi = 0. Costs and delays are assumed to be independent. E.g., costs could be a measure of channel utilization, and the delay could be a combination of propagation, transmission, and queuing delay. 2.2
Problem Definition
The delay constrained multicast routing problem is defined to construct a tree rooted at the source s and spanning to all the destination members of D = {d1 , d2 , . . . , dm } such that not only the total cost of the tree is minimum but also the delay from the source to each destination is not greater than the ren n (m) quired delay constraint, i.e., x=1 i=1,i=x Lxi Vxi ≤ Δ, where Δ is the delay (m)
bound. Vxi ∈ {0, 1} denotes the neuron output of constrained unicast route for destination dm . 2.3
The Energy Function
Pornavalai et at [8] proposed the energy function for the delay constraint multicast routing problem. We change the neuron update rule by using the mean value of
Solving the DCMR Problem Using the TCNN
59
neuron outputs as the threshold to fire the neuron. In the original energy function, n n (m) (m) the outputs are forced to be 0 or 1 by an energy term x=1 i=1 Vxi (1 − Vxi ). The total energy function E for the delay constrained multicast routing is the sum of energyfunctions of delay constrained unicast routing to every desN tination [8]: E = m=1,m∈D E (m) . Where E (m) is used to find the constrained unicast route from source node s to destination dm : n n (m) (m) (m) E (m) = μ1 [ Cxi fxi (V )Vxi ] + μ2 (1 − Vms ) x=1 i=1,i =x
+ μ3
⎧ n ⎨ n
x=1
+ μ5
⎩
(m)
Vxi
−
i=1,i =x
n
(m)
Vix
i=1,i =x
⎫2 ⎬ + μ4
⎭
n n
(m)
Pxi Vxi
x=1 i=1,i =x
h(z)dz
(1)
where, m fxi (V ) =
1+ 0, h(z) = z,
n
1
(2)
(j)
j=1,j =m
Vxi
if z ≤ 0; otherwise.
(3)
μ1 term is the total cost of the unicast route for destination dm . The function (m) fxi (V ) reduces the cost when unicast routes for different destinations choose the same link. μ2 term creates a virtual link from destination dm to source s, which is used to satisfy the constraint state in μ3 term. μ3 term ensures that for every node, the number of incoming links is equal to the number of outgoing links. μ4 term penalizes neurons that represent non-existing links. μ5 term is used (m) to satisfy the delay constraint, with z = nx=1 ni=1,i=x LxiVxi − Δ. Thus the μ5 term contributes positively only when the delay constraint is violated [10].
3
Transiently Chaotic Neural Networks
Chen and Aihara proposed a transiently chaotic neural network (TCNN) [5] as follows: (m)
(m)
Uxi (t + 1) = kUxi (t) +
N N
(m)
wyj,xi Vyj (t)
y=1 j=1,j =y (m)
(m)
Vxi
+Ixi − zxi (t)[Vxi (t)) − I0 ] 1 (m) (m) = fxi (Uxi ) = (m) (m) −U 1 + e xi /xi
where, −
∂E (m) ∂Vxi
=
N N y=1 j=1,j =y
(m)
wyj,xi Vyj (t) + Ixi
(4) (5)
60
W. Liu and L.P. Wang
zxi (t + 1) = (1 − β)zxi (t) zxi (t) = self-feedback neuronal connection weight (zxi (t) ≥ 0). (m)
(m)
Uxi and Vxi are internal state and output of neuron (x, i) in matrix m, respectively. k is the damping factor of the nerve membrane (0 ≤ k ≤ 1), Ixi is the input bias of neuron (x, i), I0 is the positive bias, β is the damping factor for (m) the time-dependent neuronal self coupling (0 ≤ β ≤ 1)., and xi is the steepness parameter of the neuron activity function ( ≥ 0).
4
Simulation Results
We implement the algorithm in VC++. The end of an iteration is determined by the change in the energy function between two steps: ΔE = E(t) − E(t − 1). The iterations stop when ΔE is smaller than a threshold (0.002) in three continuous steps. The communication network used in the simulation are generated by a graph generator [11]. A network with n nodes is randomly placed on a Cartesian coordinate. Fig.1 is an example of randomly generated 80-node communication network. Table 1 shows the specifications of communication networks we generated. Values for the weighting coefficients are chosen as follows based on [8]: μ1 = 200 μ2 = 5000 μ3 = 1500 μ4 = 5000 μ5 = 250 Corresponding to the parameter setting principle described in [12], we let = = 0.004, I0 = 0.65, β = 0.001, and z(0) = 0.1. Initial inputs of neural networks Uxi (0) are randomly generated between [−1, 1]. At the end of each iteration, we set each neuron on or off according to the average value (VT ) of (m) xi
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0
0.2
0.4
0.6
0.8
1
Fig. 1. A 80-node network used in our simulations, with an average degree 4 (the number of links for a node)
Solving the DCMR Problem Using the TCNN
61
Table 1. Specifications of the randomly generated geometric instances Instance
Nodes
Destinations Edges Delay bound
Number N
Number
E
Δ
Case #1
8
5
11
20
Case #2
16
5
22
20
Case #3
30
5
47
20
Case #4
80
5
154
25
Table 2. Results for the HNN, TCNN for instances #1 to #4. “sd” stands for “standard deviation”. Cost mean±sd
Time mean±sd (s)
Instance No. HNN
TCNN
HNN
TCNN
1
12.19±0.53 8.72±0.31 0.18±0.22 0.68±0.29
2
32.06±3.23 23.20±0.28 2.17±0.64 6.49±0.70
3
24.35±1.31 22.67±1.95 23.32±3.71 46.86±4.18
4
29.07±1.02 25.13±0.98 537.3±79.2 1150±127
the output matrix. If Vxi ≥ VT , then Vxi = 1, the link from x to i is on the final optimal tree, and vice versa. The algorithm is run 1000 times with randomly generated initial neuron states, and compared with conventional Hopfield networks used in [8]. The result is listed in Table 2. The TCNN is capable to jump out of local minima and achieves the global optimal due to its complex dynamics. As a trade off, the execution time increases. In applications, we can balance the route optimality ratio and the execution time through the parameter β, which determines the decaying of chaotic dynamics. Larger β will make the TCNN converge faster, while smaller one will make the TCNN more probable to reach the global optimal.
5
Conclusion
We studied the delay constrained multicast routing problem which is motivated by fast development of delay-sensitive communication applications. We showed that the transiently chaotic neural network is more capable to reach the global optimal solutions compared with the HNN.
62
W. Liu and L.P. Wang
Individual QoS parameters may be conflicting and interdependent, thus making the problem even more challenging [13]. Computing multicast routes that satisfy different QoS parameters simultaneously is an NP-hard problem. It is even harder to solve when each destination has different QoS requirements. Furthermore, the multicast group may be dynamic, i.e., the node may join or leave the communication network at any instance of time. We will keep exploring this area in future.
References 1. Reeves, D.S., Salama, H.F.: A Distributed Algorithm for Delay-constrained Unicast Routing. IEEE/ACM Transactions on Networking 8(2) (2000) 239-250 2. Chen, J., Chan, S.H.G., Li, V.O.K.: Multipath Routing for Video Delivery over Bandwidth-limited Networks. IEEE Transactions on Selected Areas in Communications 22(10) (2004) 1920-1932 3. Chakraborty, D., Chakraborty, G., Shiratori, N.: A Dynamic Multicast Routing Satisfying Multiple QoS Constraints. Int. Journal of Network Management 13(5) (2003) 321-335 4. Ganjam, A., Zhang, H.: Internet Multicast Video Delivery. Proceedings of the IEEE 93(1) (2005) 159-170 5. Chen, L.N., Aihara, K.: Chaotic Simulated Annealing by a Neural Network Model with Transient Chaos. Neural Networks 8(6) (1995) 915-930 6. Venkataram, P., Ghosal, S., Kumar, B.P.V.: Neural Network Based Optimal Routing Algorithm for Communication Networks. Neural Networks 15(10) (2002) 1289-1298 7. Rauch, H.E., Winarske, T.: Neural Networks for Routing Communication Traffic. IEEE Cont. Syst. Mag. 8(2) (1988) 26-31 8. Pornavalai, C., Chakraborty, G., Shiratori, N.: A Neural Network Approach to Multicast Routing in Real-time Communication Networks. In: International Conference on Network Protocols (ICNP-95) (1995) 332-339 9. Nozawa, H.: A Neural-network Model as a Globally Coupled Map and Applications Based on Chaos. Chaos 2(3) (1992) 377-386 10. Ali, M.K.M., Kamoun, F.: Neural Networks for Shortest Path Computation and Routing in Computer Networks. IEEE Transactions on Neural Networks 4(6) (1993) 941-954 11. Waxman, B.: Routing of Multipoint Connections. IEEE J. select. Areas Communication 6(9) (1988) 1617-1622 12. Wang, L.P., Li, S., Tian, F.Y., Fu, X.J.: A Noisy Chaotic Neural Network for Solving Combinatorial Optimization Problems: Stochastic Chaotic Simulated Annealing. IEEE Transactions on System, Man, and Cybernetics-Part B: Cybernetics 34(5) (2004) 2119-2125 13. Roy, A., Banerjee, N., Das, S.K.: An Efficient Multi-objective Qos Routing Algorithm for Real-time Wireless Multicasting. In: Proceedings of IEEE Vehicular Technology Conference (2002) 1160-1164
Solving Prize-Collecting Traveling Salesman Problem with Time Windows by Chaotic Neural Network Yanyan Zhang and Lixin Tang
,
The Logistics Institute Northeastern University, Shenyang, China
[email protected]
Abstract. This paper presents an artificial neural network algorithm for prize-collecting traveling salesman problem with time windows, which is often encountered when scheduling color-coating coils in cold rolling production or slabs in hot rolling mill. The objective is to find a subset sequence from all cities such that the sum of traveling cost and penalty cost of city unvisited is minimized. To deal with this problem, we construct mathematical model and the corresponding network formulation. Chaotic neurodynamic is introduced and designed to obtain the solution of the problem, and the workload reduction strategy is proposed to speed up the solving procedure. To verify the efficiency of the proposed method, we compare it with ordinary Hopfield neural network by performing experiment on the problem instances randomly generated. The results clearly indicate that the proposed method is effective and efficient for given size of problems with respect to solution quality and computation time.
1 Introduction A great deal of problems in theory and practice are related to combinatorial optimization problems, most of which are hard to solve and belong to NP-hard problems. Therefore researches in this field usually aim at developing efficient and effective techniques to find better solutions instead of exact ones. And from the practical viewpoint, rather fast approximate algorithms are useful and have achieved considerable success when applied to practical case. A typical such kind of combinatorial optimization problem can be found in color coating coils scheduling in cold rolling mill and slabs scheduling in hot rolling mill. In the production of color coating coils, after the surface treatment, the cold rolled coils and galvanized coils are dressed with all kinds of paints to the surface in roller applying method. In the course of operation, considering productivity and cost, many requirements between adjacent coils must be taken into account. Most of these requirements can be transformed into a parameter [1] (similar to the sense of “distance” in TSP) between adjacent coils (cities). The situation of slabs scheduling in hot rolling mill is much similar. Based on such transformation, these production scheduling problems can be formulated as the framework of a well-studied Prize-Collecting Traveling Salesman Problem with Time Windows D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 63–71, 2007. © Springer-Verlag Berlin Heidelberg 2007
64
Y. Zhang and L. Tang
(PCTSPTW), which can be characterized by prize-collecting mechanism and time windows requirements. With respect to the prize-collecting mechanism, that is, each city is assigned a prize value and a penalty, the goal is to construct a tour that maximizes total prize collected and (or) minimizes the penalties occurred while minimizing the distance traveled; this allows a salesman to skip certain unprofitable sites, similar researches have been done in[2][3]. As for the time windows demand, it can be expressed as that, each city can only be visited within a given time interval. Because of the time windows, only feasible paths need to be considered. In practice, such windows are needed because there exists a holding time constraint for each coil or slab, if a coil (slab) has been kept longer than the allowable time limitation before processing, unnecessary pretreatment cost will occur. In this research the holding time requirement, the arrival and visiting time of each city is defined as the time windows, a real time index of the city. Therefore, the best sequence in the above context is the sequence with minimum cost, with respect to both the visiting costs and penalties. Unlike most of the researches that consider the time windows requirements as soft constraints (violation of time windows constraints leads to penalties), this paper formulate them as hard constraints, only feasible solution with respect to time windows can be accepted, which increases the difficulties of solving. It has been proved that elementary shortest path with time windows is a strongly NP-hard problem [4], and relaxed versions of this problem have been reported [5]. Therefore the task of PCTSPTW with such complex constraints that we address is intractable, even for a feasible solution. As for the solving approach, artificial neural networks have been applied to many combinatorial optimization problems [6][7] such as TSP and production scheduling problems. But in these problems, no time windows are taken into account. Since the performance and structure of artificial neural networks have been optimized, and among which the transient chaotic neural network [8][9] is one of the most successful applications. In this paper, we propose a novel neural network algorithm in which chaotic mechanism is introduced to escape from local minima of traditional neural network. It is, with the authors’sights, the first such algorithm in the literature to solve the PCTSPTW. The contribution of this research involves the construction of a network formulation, the derivation of running neurodynamic, the innovations of reducing computation cost and the design of the experiments.
2 Problem Description and Formulation We define the Prize-Collecting Traveling Salesman Problem with time-windows (PCTSPTW) as follows. 2.1 Notations n i, j
— the number of all available cities to be processed. — city identifier, i, j=1, 2, … , n
Solving Prize-Collecting Traveling Salesman Problem
65
C pi ri ei ci Bi k,l
— the capacity demand (the upper bound of prize) of the sequence. — the penalty occurred when nod i s not selected in current sequence. — the arrival time of city i. — the ending time of city i. — the visiting prize of city i. — the start time of city i. — the processing position, k, l=1, 2, … , n
vik
— the output of a neuron.
uik
— the state of a neuron.
Iik
— the threshold value of a neuron.
Wik,jl
— the connection weights between two neurons.
ISk
— the immediate succeeding) visiting position of k in the route.
IPk dij
— the immediate preceding visiting position of k in the route. — the distance between city i and j. — damping factor of nerve membrane ( 0 ≤α ≤1). — positive scaling parameter for inputs.
α
γ zik β
— self-feedback connection weight or refractory strength (zik ≥ 0 ). — damping factor of the time-dependant zi ( 0 ≤ β ≤1)
I0 ΩS
— positive parameter. —set of selected cities in current sequence.
Ωu
—set of unvisited cities.
Decision variables: ⎧1, xij = ⎨ ⎩0, ⎧1 yi = ⎨ ⎩0
if city i is visited immediately before city j otherwise
if city i is selected in the current sequence otherwise
2.2 Mathematical Model
The objective function n
n
n
min ∑∑ x ij d ij + ∑ p i (1 − y i ) i =1 j =1
(1)
i =1
Subject to n
∑x i =1
ij
≤ 1,
j = 1,2,..., n
(2)
66
Y. Zhang and L. Tang n
∑x j =1
∑x
i , j∈S
≤ 1,
ij
ij
≤ S − 1,
ri ≤ Bi ≤ ei n
∑c y i =1
∑x
i
i = 1,2,..., n
(3)
∀S ⊆ Ω s
(4)
∀i ∈ Ω s ∪ Ω u
(5)
≤C
i
(6)
= yi
∀i ∈ Ω s ∪ Ω u
(7)
xij ∈ {0,1},
i, j = 1,2,..., n
(8)
i = 1,2,..., n
(9)
j∈Ω \ i
ij
yi ∈ {0,1},
The first item in the objective function is the sum of distances between all pairs of adjacent cities, the second item is the total penalties of unscheduled cities. Constraints (2) ensure that for each city there are at most one city is arranged before it. Constraints (3) guarantee that for each city there are at most one city is arranged after it. Constraints (4) ensure the feasibility of the obtained route that no cycle is allowed to exist, where S is the generated city sequence. Constraints (5) present the time windows of each city, the hard real-time constraints, only within this time the city can be processed. Constraints (6) mean that total prizes in the sequence should not exceed upper bound the capacity demand. Equations (7), (8) and (9) are the variable values constraints. 2.3 Networks Formulation
Objective function n
n
min ∑∑
n
n
n
i =1
k =1
∑ vik (v j ,ISk + v j ,IPk )dij + ∑ pi (1 − ∑ vik )
i =1 k =1 j =1, j ≠i
(10)
Subject to n
n
n
∑∑ ∑ v i =1 k =1 j =1, j ≠ i
ik
v jk = 0
(11)
Solving Prize-Collecting Traveling Salesman Problem n
n
n
∑∑ ∑ v i =1 k =1 l =1,l ≠ k
n
v =0
ik il
67
(12)
n
(∑∑ vik − num) 2 = 0
(13)
i =1 k =1
n
n
(∑∑ ci vik − C ) 2 = 0
(14)
i =1 k =1 n
min ∑ vik ( Bi − ri ) ≥ 0 1≤ i ≤ n
(15)
k =1
n
min ∑ vik (ei − Bi ) ≥ 0 1≤ i ≤ n
(16)
k =1
Where Bi is the start time of city i, Bi = max{BIPi + d IPi , i , ri } . In the objective function, the first item is the sum of distances between all pairs of adjacent cities, the second item is the penalty of all cities for the tardiness of due date, the third item is the penalty for unscheduled cities. Constraints (11) require that on one position, only one city can be arranged. Constraints (12) mean that each city can only be arranged to one processing position. Constraints (13) claim that the approximate number of scheduled cities is num which corresponds to capacity (prize) limitation and is expressed as num =[C/(
∑
n i =1
c i /n)]. Constraints (14) express the demand for the sum of
prizes in the sequence. Constraints (15) and (16) are the time window requirements, giving that once a city is selected, its start time must be after its earliest possible start time and before its latest allowable start time. Then we get the following energy function.
E=
A1 n n n (∑∑ ∑vik (v j ,IPk + v j ,ISk )dij 2 i=1 k =1 j=1, j≠i
n
n
+ ∑ pi (1 − ∑vik )2 ) + i=1 n
k =1
n
+ ∑∑
n
∑vikvil ) +
i=1 k =1 l =1,l ≠k
+
A2 n n n (∑∑ ∑vik v jk 2 i=1 k=1 j=1, j≠i
n n A3 n n A (∑∑vik − num)2 + 4 F (C − ∑∑ci vik ) 2 i=1 k=1 2 i=1 k =1
n n A5 (G min∑vik (Bi − ri ) + G min∑vik (ei − Bi )) 1≤i≤n 1≤i≤n 2 k =1 k =1
(17)
68
Y. Zhang and L. Tang
The connection weights and threshold values are as follows:
(
)
wik , jl = − A1 (1 − δ ij )(δ l , IPk +δ l , IS k )dij + δ ij pi − A2 ((1 − δ ij )δ kl n
n
(18)
+ δ ij (1 − δ kl )) − A3 − A4ci c j g (C − ∑∑ c p v pq ) p =1 q =1
n
n
I ik = − A1λ2 ci − A3 num − A4 Cci g (C − ∑∑ c p v pq ) p =1 q =1
n
n
− A5 ( Bi − ri ) g (min ∑ v jl ( B j − r j )) + (ei − Bi ) g (min ∑ v jl (e j − B j )) 1≤ j ≤ n
1≤ j ≤ n
l =1
l =1
(19) Substitute the above formula for the Wik,jl and Iik in the following equation. n
n
u ik (t ) = ∑
∑w
j =1 l =1, jl ≠ ik
ik , jl
v jl (t ) − I ik
(20)
Then we get the running dynamics of our networks as follows. n
uik (t ) = ∑
∑ (− A ((1 − δ n
1
j =1 l =1, jl ≠ik
ij
)(δ l ,IPk+δ l ,ISk )d ij + δ ij pi ) − A2 ((1 − δ ij )δ kl (21)
⎞ + δ ij (1 − δ kl )) − A3 − A4 ci c j g (C − ∑∑ ci vik ) ⎟v jl (t ) − I ik i =1 k =1 ⎠ n
u ik (t ) = − A1 ( − A2 (
n
∑
j =1, j ≠ i
n
n
∑ (v j ,IPk (t ) + v j , ISk (t ))d ij + pi (∑ vil − 1))
j =1, j ≠ i
v jk (t ) + n
n
l =1
n
∑v
l =1, l ≠ k
n
n
il
n
(t )) − A3 (∑∑ v jl (t ) − num) j =1 l =1
n
n
− A4 g (C − ∑ ∑ c p v pq )c i (∑ ∑ v jl (t )c j − C ) p =1 q =1
j =1 l =1
n
n
− A5 (( Bi − ri ) g (min ∑ v jl ( B j − r j )) + (ei − Bi ) g (min ∑ v jl (e j − B j ))) 1≤ j ≤ n
l =1
1≤ j ≤ n
l =1
(22)
Solving Prize-Collecting Traveling Salesman Problem
69
When chaos is introduced, n
n
n
uik (t + 1) = αuik (t ) + β (∑∑∑
n
∑w
i =1 k =1 j =1 l =1, jl ≠ik
ik jl
v jl (t ) − I ik ) + z (t )(vik (t ) − I 0 )
n n ⎛ = αuik (t ) + γ ⎜⎜ − A1 ( ∑ (v j ,IPk (t ) + v j ,ISk (t ))d ij + pi (∑ vil − 1)) j =1, j ≠i l =1 ⎝
− A2 (
n
∑v
j =1, j ≠i
jk
(t ) +
n
n
∑v
l =1,l ≠ k
il
n
n
(t )) − A3 (∑∑ v jl (t ) − num) j =1 l =1
n
n
n
− A4 g (C − ∑∑ c p v pq )ci (∑∑ v jl (t )c j − C ) + zik (t )(vik (t ) − I 0 ) p =1 q =1
j =1 l =1
n ⎞ − A5 (( Bi − ri ) g (min ∑ v jl ( B j − rj )) + (ei − Bi ) g (min ∑ v jl (e j − B j ))) ⎟ 1≤ j ≤ n 1≤ j ≤ n l =1 l =1 ⎠ n
(23) Where
z ik (t + 1) = z ik (t ) / ln(e + β (1 − z ik (t )))
Where the output
⎧0 F ( x) = ⎨ 2 ⎩x
vik = 1 /(1 + e
x≥0
⎧0 G ( x) = ⎨ x<0 ⎩x
−
(24)
u ik
μ
) and
x≥0 ⎧0 g ( x) = ⎨ x<0 ⎩1
⎧1 i = j x≥0 δ ij = ⎨ x<0 ⎩0 i ≠ j
3 Computational Experiments In this section, we compare the performance of the proposed chaotic searching method with conventional Hopfield network for the PCTSPTW. The experiments are implemented using C++ on a Pentium-IV 3.0-GHz PC with 512 MB RAM. The data set used in the experiments is as follows: the capacity, the distance matrix [0, 1], the visiting prize [0, 1], the unvisited penalty [0, 1] and the time windows are all generated at random. In the time windows, the start time ri ∈ [0, the distance sum of all cities], the end time of a city i is set as such : ei=ri+[0, 1]. Other parameters are set as follows:
α = 0 .88 ; γ = 0 .01 ; β = 0 .015 ; μ = 0 .004 ; I( 0 ) = 0 .65 ; z ik ( 0 ) = 0 .08 And A1 ∈ [4, 5] , A2 ∈ [4, 5] , A3 ∈ [4, 5] , A4 ∈ [1.5, 2.5] , A5 ∈ [5.5, 7] , depending on different testing examples. The number of cities in our research is 10, 5 groups of instances are generated. For each group of instances, we present the results of 100
70
Y. Zhang and L. Tang
individual iterations. The iteration stop when the energy reduction has been continuouly less than 0.01 for 20 times. In addition, in the case of applying the proposed methods to PCTSPTW, it will be much beneficial if we can reduce the computational complexity of the methods so as to make it easier to apply them to large problems. The basic idea is that, if we can reduce the computational cost for unimportant parts, we can spend more cost on important parts or finding better solutions. Here we discuss a simple method for reducing the computational workload of the proposed method. Because of the hard time windows requirements, it is very rare to visit cities with relative later time windows at the beginning of the route. Otherwise, the generated solution will in large degree be undesirable. Similarly, cities with relative earlier time windows will not be visited at the end part of the route. Since it is rare for good solutions to include such routes, we can omit the calculations for such paths. That is to say, we can set the corresponding values of such neurons to be zero before computation. During the course of iteration, the output of such neurons keep fixed to reduce the number of neurons to be calculated. From the testing results we find that when not all cities are fully taken into account (‘unprofitable’ cities are omitted), the computational complexity is reduced, CPU time is much decreased and the possibility of being trapped in infeasible region is reduced. The results of comparison are summarized in Table 1 (CN: the method in this paper, HN: Hopfield network). For such complex problem the proposed approach exhibits much better performance that the high rate of feasibility can be guaranteed and in most cases satisfying solutions can be obtained. Although some local minima occurred, the deviations of such solutions are all less than 10%. Table 1. Results comparison on 5 groups of problem instances
Problem
CN
HN
structure
1
2
3
4
5
1
2
3
4
5
Rate of feasible solution(%)
88
96
100
90
57
12
42
34
50
10
Rate of satisfactory solution (%)
60
84
56
66
43
12
30
16
32
2
Rate of Local minima (%)
28
14
44
24
14
0
12
18
18
8
4 Conclusions This paper presents an artificial neural network algorithm, the first such algorithm in the literature to solve the Prize-Collecting Traveling Salesman Problem with Time Windows (PCTSPTW). For solving such kind of NP-hard problems, the practical way
Solving Prize-Collecting Traveling Salesman Problem
71
is to design efficient approximate algorithm to obtain satisfactory solutions within reasonable time. Chaotic neural network approach is adopted to find near-optimal solution in this paper. We formulate the problem as a PCTSPTW, considering the requirements of prized-collecting and time windows. The objective tries to find compromised solutions considering all constraints involved. Chaotic mechanism is introduced to escape from local minima of traditional neural network. Computational results prove the efficiency of the proposed approach and the promising prospect of applying neural network to such complex problems.
Acknowledgement This research is partly supported by National Natural Science Foundation for Distinguished Young Scholars of China (Grant No. 70425003), National Natural Science Foundation of China (Grant No. 60274049) and (Grant No. 60674084), the Excellent Young Faculty Program and the Ministry of Education, China.
References [1] Okano H., Davenport A.J., Trumbo, M., Reddy C., Yoda K., Amano M.: Finishing Line Scheduling in the Steel Industry. Journal of Research & Development 48 (5/6) (2004) 811-830 [2] Laporte, G., Martello, S.: The Selective Traveling Salesman Problem. Discrete Applied Mathematics 26 (1990) 193-207 [3] Balas, E.: The Prize Collecting Traveling Salesman Problem. Networks 19 (1989) 621-636 [4] Dror, M.: Note on the Complexity of the Shortest Path Models for Column Generation in VRPTW. Operations Research 42 (1994) 977–978 [5] Desrochers, M., Soumis, F.: A Reoptimization Algorithm for the Shortest Path Problem with Time Windows. European Journal of Operational Research 35 (1988) 242–254 [6] Smith, K.A.: Hopfield Neural Networks of Timetabling: Formulations, Methods and Comparative Results. Computers & Industrial Engineering 44 (2003) 283-284 [7] Hopfield, J.J., Tank, D.W.: Neural Computation of Decisions in Optimization Problems. Biological Cybernetics 52 (1985) 141-152 [8] Hasegawa, M., Ikeguchi, T., Aihara, K.: Solving Large Scale Traveling Salesman Problems by Chaotic Neurodynamics. Neural Networks 15 (2002) 271-283 [9] Chen, L.N., Aihara, K.: Chaotic Simulated Annealing by a Neural Network Model with Transient Chaos. Neural Networks 8 (6) (1995) 915-930
A Quickly Searching Algorithm for Optimization Problems Based on Hysteretic Transiently Chaotic Neural Network Xiuhong Wang1 and Qingli Qiao2 1
School of Management, Tianjin University, Tianjin 300072, China
[email protected] 2 Department of Biomedical Engineering, Tianjin Medical University, Tianjin 300070, China Qlqiao@ gmail.com
Abstract. This paper presents a fast algorithm based on the hysteretic transiently chaotic neural network (HTCNN) model for solving optimization problems. By using hysteretic activation function, HTCNN has higher ability of overcoming drawbacks that suffer from the local minimum. Meanwhile, in order to avoid oscillation and offer a considerable acceleration of converging to the optimal solution, a fast speed strategy is involved in HTCNN. Numerical simulation of a combinatorial optimization problem-assignment problem shows that HTCNN with fast speed strategy (FHTCNN) can overcome drawbacks that suffer from the local minimum and find the global optimal solutions quickly.
1 Introduction In order to overcome the shortcoming of Hopfield neural networks (HNN) [1], which often suffer from the local minima, many researchers have investigated the characteristics of the chaotic neural networks (CNN) and attempted to apply them to solve optimization problems [2]. In fact, most of these chaotic neural networks have been proposed with monotonous activation function such as sigmoid function. Recently, the neuro-dynamics with a non-monotonous have been reported to posse an advantage of the memory capacity superior to the neural network models with a monotonous mapping [3]. Hysteretic neuron models have been proposed [4] for association memory, and have been demonstrated performing better than nonhysteretic neuron models, in terms of capacity, signal-to-noise ratio, recall ability, etc. We have proposed a transiently chaotic neural network model with a non-monotonous mapping-hysteresis (HTCNN) to solve the assignment problem, because of using hysteretic function that is multi-valued, adaptive, and has memory, the HTCNN has higher ability of overcoming drawbacks that suffered from the local minimum [5]. In order to simulate continuous dynamics and accomplish fast calculation with parallel digital computers, the HTCNN algorithm is operated in a synchronous and discrete way. When synchronous and discrete dynamics are applied to the HTCNN for the optimization problems, much iteration is required before converging to optimal solution with small time difference. When time difference is large, however, the D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 72–78, 2007. © Springer-Verlag Berlin Heidelberg 2007
A Quickly Searching Algorithm for Optimization Problems Based on HTCNN
73
system becomes oscillatory and the search fails completely. In this paper, we analyze the eigenspace of the weight matrix that is designed to solve the optimization problems and propose a new algorithm, which offers a considerable acceleration of converging to the optimal solution when the HTCNN is updated in a synchronous discrete manner.
2 Hysteretic Neuron and Hysteretic Chaotic Neural Network Our hysteretic neuron activation function is depicted in Fig.1.and is described as:
y ( x / x ) = φ ( x − λ ( x )) = tanh(γ ( x )( x − λ ( x ))) , where
⎧γ α , x ≥ 0 ⎧− α , x ≥ 0 γ ( x ) = ⎨ , λ ( x ) = ⎨ ⎩ β , x < 0 ⎩γ β , x < 0
and (γ α , γ β )
(1)
> 0 , β > −α ,
Fig. 1. Hysteretic activation function
Observed that, this neuron’s output
y not only depends on its input x , but also on
derivative information, namely x . It is the latter information that provides the neuron with memory and distinguishes it from other neurons. The hysteretic neuron’s activation function has four parameters, namely, α , β , γ ε , γ β , and we can think about tuning all of its parameters in order to maximize its performance. So it seems that the hysteretic neuron has much more flexibility than the usual neuron. By combining N neurons given in (1), we can form the hysteretic Hopfield neural network (HHNN). The dynamics of HHNN is defined by:
dxi x = − i + ∑ wij y j + I i , dt τ j
(2)
where yi, xi, and Ii are output value, internal input value, and bias of a neuron i, respectively, and wij is a synaptic weight from neuron j to i. τis the time constant.
74
X. Wang and Q. Qiao
HHNN includes memory because of using hysteretic function as neuron’s activation function. And due to a change in the direction of the input, a system can pull itself out of a saturated region by jumping from one segment of the hysteretic activation function to the other segment. This make the HHNN has a tendency to overcome local minima. We have demonstrated stability of the HHNN by Lyapunov direct method [5]. In order to make use of the advantages of both the chaotic neurodynamics and the hysteretic neural networks, we create hysteretic transiently chaotic neural network (HTCNN) and the continuous dynamics of the HTCNN is [5].
dxi x = − i + ∑ wij y j + I i − z ( yi − I 0 ) , dt τ j dz
= −β 0 z ,
dt
(3) (4)
where z (t) is the self-feedback connection weight, β0 (0<β0<1) is damping factor, and I0 is a positive parameter. Parameter z decay exponentially as eqn. (6) and make the neural network actually has transiently chaotic dynamics. Damping factor β0 control the chaotic dynamics and convergence speed of the network. Using the Eular discretization, the difference equation is written in the form:
xi (t + 1) = kxi (t ) + α 0 (∑ wij y j (t ) + I i ) − z(t )( yi − I 0 ) ,
(5)
z (t + 1) = z (t )(1 − β 0 ) ,
(6)
j
where k=(1-Δt/г) and α0=Δt.
3 Fast Hysteretic Transiently Chaotic Neural Network (FHTCNN) for the Optimization Problems Kindo and Kakeya [6] have proposed a geometrical method for analyzing the properties of associative memory model. Based on it, we give a short review of the geometrical explanation on neural dynamics. For simplicity, assume v=f(u)=sgn(u), I=0 and z=0 (∀i), Then v = N holds, for the state vector v has +1 or –1 as its components. Therefore v is always on the surface of the hypersphere SN-1 with radius N . The neural dynamics are divided into two phases. In the first phase, the state vector v(t) is transferred to the vector u(t+1)=(1- t)u(t)+ twv(t) linearly (τ=1) with the weight matrix w. In the second phase, the vector is quantified to the nearest state vector that requires the least angle rotation. Therefore, from the hyperspherical viewpoint, linear transformation gives the major driving force of dynamics, while nonlinear transformation generates the terminal points of dynamics. That is to say linear transformation is more important than the nonlinear transformation when we discuss non-equilibrium dynamical properties of neural network. This suggests that the eigenspace analysis of the weight matrix gives major information to explain the
△
△
A Quickly Searching Algorithm for Optimization Problems Based on HTCNN
75
global feature of the dynamics. Now we apply this approach and analyze the weight matrix of the neural network, which is designed to solve the optimal problem. As stated above, the good solutions of optimization problems are located in the low energy area of the state space, and the low energy state of the network corresponds to the state that is composed mainly of the eigenvectors with large eigenvalues. Therefore good solutions have large components of eigenvectors with large eigenvalues and almost no components of eigenvector with negative eigenvalues. While the synchronous discrete dynamics with large t do not always realize state transition toward the low energy. Fig.2 is used to illustrate the simple mechanism. Here the nonlinear transformation is neglected for simplicity, and the dynamics given by u(t+1)=(1- t)u(t)+ twv(t) are illustrated. When t is small, the state vector converges to the eigenvector of w with the largest positive eigenvalue, which spans the low energy states. When t is large, however, the state vector is attracted to the eigenvector of w whose eigenvalue has larger absolute value. This means that the state vector stays in the high-energy states when a negative eigenvalue has larger absolute value than the maximun positive eigenvalue. In this case, t has to be kept small to ensure convergence to a low energy state though larger t leads to faster convergence when positive eigenvalues are dominant. From this discussion, it is expected that synchronous and discrete state transition with large t can proceed toward the low energy states if the effect of the minimum eigenvalue is canceled.
△
△
△
△
△
△
t=1 t=3
Eigenvector (Eigenvalue=1) t=2
t=1
t=1
t=3
t=4
Eigenvector (Eigenvalue=-2)
t=0.5
△
△
Eigenvector (Eigenvalue=1) t=4 t=2
Eigenvector (Eigenvalue=-2)
Fig. 2. Convergence of dynamics given by difference equation with small and large time differences t
△
The component of the eigenvector with the minimum eigenvalue is eliminated from the weight matrix w of neural network for optimization problem by calculating
Ψij = Wij − ρλ min ei(min) e (min) , j
(7)
where λmin is the minimum eigenvalue and ei(min) is its normalized eigenvector, ρ is a positive constant, when ρ=1, the minimum eigenvalue component is eliminated from w completely. However, simple adoption of this weight matrix Ψ often fails, because reduction of small eigenvalues increases the firing rate of the network, as a result, the network converge to a solution which does not satisfy the constraints. To adjust the firing rate, the threshold should be raised in accordance with the increase of the average weight. Since the threshold is always active while the firing rate of neurons in
76
X. Wang and Q. Qiao
the feasible is 1/N, the effect of the threshold is N times larger than that of the neurons. Therefore threshold is
φi = I i +
1 ρλ min ∑ ei(min) e (min) , j N j
(8)
4 Fast Hysteretic Transiently Chaotic Neural Network (FHTCNN) for the Assignment Problems The simplest case of Assignment problems (AP) can be illustrated as: Let a number n of jobs be given that have to be performed by n machines, where the cost depend on the specific assignments. Each job has to be assigned to one and only one machine, and each machine has to perform one and only job. The problem is to find such an assignment that the total cost of the assignments became minimum. Give two lists of elements and a cost value for the paring of any two elements from these lists; the problem is to find the particular one-to-one assignment or match between the elements of the two lists that results in an overall minimum cost. We use capital letters to describe the elements of one list (i.e. X=A, B, C, etc.) and enumerate the elements of the other list (i.e. i=1, 2, 3, etc.). A one-to-one assignment means that each element of X has to be assigned to exactly one element of i. The cost PXi for every possible assignment between X and i is given for each. →i ⎛ 68 ⎜ ⎜6 ⎜ 68 ↓⎜ 42 X ⎜⎜ 33 ⎜ ⎜ 72 ⎜ 44 ⎝
68 93 38 52 83 4 ⎞ ⎛ 0 ⎟ ⎜ 53 67 1 38 7 42 ⎟ ⎜ 1 59 93 84 53 10 65 ⎟ ⎜ 0 ⎟ ⎜ 70 91 76 26 5 73 ⎟ → ⎜ 0 65 75 99 37 25 98 ⎟ ⎜ 0 ⎟ ⎜ 75 65 8 63 88 27 ⎟ ⎜ 0 76 48 24 28 36 17 ⎟⎠ ⎜⎝ 0
0 0 0 0 0 1⎞ ⎟ 0 0 0 0 0 0⎟ 0 0 0 0 1 0⎟ ⎟ 0 0 0 1 0 0⎟ 1 0 0 0 0 0⎟ ⎟ 0 0 1 0 0 0⎟ 0 1 0 0 0 0 ⎟⎠
Fig. 3. Cost matrix and output matrix of the neural network of a 7×7 AP
The assignment problem is represented by a two dimensional quadratic matrix of units, whose outputs are denoted by vXi which is a “decision” variable, with vXi=1 meaning that the element X should be assigned to the element i, and vXi=0 meaning that the pairing between X and i should not be made. This way, a solution to the AP can be uniquely encoded by the two dimensional matrix of the outputs vXi after all units converge to 0 or 1. The constraints of the one-to-one assignment require that the outputs of the network after convergence should produce a permutation matrix with exactly one unit “on” in each row and column, Such as Fig.3. In this example, the output-matrix determines the assignment of elements A to 7, B to 1, C to 6 etc. The solution encoded by the output-matrix is optimal with an overall cost of 165.
A Quickly Searching Algorithm for Optimization Problems Based on HTCNN
77
The energy function is:
E AP =
A (∑ (∑ v Xi − 1) 2 + ∑ (∑ v Xi − 1) 2 ) + D ∑ ∑ p Xi v Xi , 2 X i i X X i
(9)
A and D are constant parameters, the weight values wXi,Yj and external bias IXi of the neural network neurons are w
Xi , Yj
= − A (δ
XY
(1 − δ
ij
)+δ
ij
(1 − δ
XY
(10)
)) ,
I Xi = 2A − Dp Xi ,
,otherwise δ
where δ ij = 1 , if i=j
ij
(11)
= 0 . Introduce (10), (11) into (5) and then
derive the HTCNN for solving the assignment problem, where, the neuron activation function is given by the hysteretic function as follows:
⎧0.5 tanh(γ αXi (u Xi + α Xi )) + 0.5, u Xi ≥ 0, vXi = ⎨ β ⎩0.5 tanh(γ Xi (u Xi − β Xi )) + 0.5, u Xi < 0, where,
v Xi and u Xi are output value and internal input value of neuron
(12)
Xi .
We use FHTCNN to solve the above specified 7×7 assignment problem. The parameter are chosen as A=1, D=1, I0=0.4, k=0.975,α0=0.015, z(0)=0.45,β0=0.02,
γ αXi = γ Xiβ = 50 , α Xi = β Xi = 0.02 .
The eigenvalue distribution of the weight
matrix w is shown in Fig.4. It has an extremely small eigenvalue –85.
Fig. 4. Eigenvalue distribution of weight matrix
Chose ρ=0.50 for calculating weight matrix Ψ and threshold Ф, The result with 100 different initial conditions in FHTCNN,HTCNN, TCNN and HNN under sets of time difference t=0.001, are summarized in Table.1. It is shown that FHTCNN has higher ability to search for globally optimal solution and converge to the optimal solution quickly.
△
78
X. Wang and Q. Qiao Table 1. Result of FHTCNN, HTCNN, TCNN, HHNN, and HNN for AP
Neural network Average of valid solutions Average iterations for convergence
FHTCNN 169.58 278
HTCNN 171.33 302
TCNN 182.3 435
HNN 402.3 324
5 Conclusions In order to accelerate the HTCNN searching for optimal solution of optimization problems under the synchronous discrete computation, we have analyzed the eigenspace of the weight matrices in geometrical approach, presented a fast algorithm by eliminating the components of the eigenvectors with eminent negative eigenvalues of the weight matrix. The simulation of an assignment problem shows that FHTCNN with modified weight matrix requires much less iteration than HTCNN with standard weight matrix before reaching optimal solution.
Acknowledgement This work is supported by the National Basic Research Program (also called 973 Program) of China under Grant 2005CB724302.
References 1. Hopfield, J.J., Tank, D.W.: “Neural” Computation of Decisions in Optimization Problems. Biological Cybernetics 52 (1985) 141-152 2. Wang, X.H., Qiao, Q.L.: Solving Assignment Problems with Chaotic Neural Network. Journal of Systems Engineering 16 (2001) 46-49 3. Nakagawa, M.: An Artificial Neuron Model with a Periodic Activation Function. Journal of the Physical Society of Japan 64 (1995) 1023-1031 4. Yanai, H., Sawada, Y.: Associative Memory Network Composed of Neurons with Hysteretic Property. Neural Networks 3 (1990) 223-228 5. Wang, X.H., Qiao, Q.L.: Solving Optimization Problems Based on Chaotic Neural Network with Hysteretic activation function. Advances in Neural Networks-ISNN 2005, PT 1, Proceedings Lecture Notes in Computer Science 3496 (2005) 754-749 6. Kindo, T., Kakeya, H.: A geometrical analysis of associative memory. Neural Networks 11 (1998) 39-51
Secure Media Distribution Scheme Based on Chaotic Neural Network Shiguo Lian, Zhongxuan Liu, Zhen Ren, and Haila Wang France Telecom R&D Beijing, Beijing 100080, China
[email protected]
Abstract. A secure media distribution scheme is proposed in this paper, which distributes different copy of media content to different customer in a secure manner. At the sender side, media content is encrypted with a chaotic neural network based cipher under the control of a secret key. At the receiver side, the encrypted media content is decrypted with the same cipher under the control of both a secret key and the customer information. Thus, the decrypted media copy containing customer information is slightly different from the original one. The difference can be detected and used to trace media content’s illegal distribution. The scheme’s performances, including security, imperceptibility and robustness, are analyzed and tested. It is shown that the scheme is suitable for secure media distribution.
1 Introduction Neural networks are used to design data protection schemes because of its complicated and time-varying structures [1]. For example, the cipher [2] is constructed based on the random sequence generated from the neural network. For the property of initial-value sensitivity, ergodicity or random similarity, chaos is also introduced to data protection. For example, the block cryptosystem is designed via iterating a chaotic map [3], the one-way hash is constructed based on the chaotic map with changeable parameter [4,5]. As a combination of neural networks and chaos, chaotic neural networks are expected to be more suitable for data encryption. For example, it is reported [6] that faster synchronization can be obtained by jointing neural network’s synchronization and chaos’ synchronization. Due to both of the properties, chaotic neural networks are regarded more suitable for data encryption. Till now, various chaotic neural network based encryption algorithms have been reported, which can be classified into two classes: stream cipher and block cipher. The first one uses neural networks to construct stream ciphers, in which, neural network is used to produce pseudorandom sequences [7,8,9,10]. The second one uses neural network to construct block ciphers, which makes use of chaotic neural networks’ properties to encrypt plaintext block by block [11,12]. Chaotic neural networks have also been used in multimedia content protection. For example, Yue et al. [13] constructed the image encryption algorithm based on chaotic neural networks, which encrypts several gray images into some binary images. Lian et al. [14,15] proposed a cipher based on chaotic neural network, which is used to D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 79–87, 2007. © Springer-Verlag Berlin Heidelberg 2007
80
S. Lian et al.
encrypt compressed images or videos partially. In practical applications, due to the rapid development of Internet technology and multimedia technology, multimedia data, such as images, videos or audios, may be redistributed after data decryption. For example, in video-on-demand system, the authorized customer can record the media content and redistribute it to other unauthorized customers. In this case, only encrypting the media data is not secure enough. To conquer the illegal distributors, the media content should be marked in order to show the ownership. In this paper, a secure distribution scheme based on chaotic neural network is proposed. Firstly, a parameter-adjustable sequence generator based on chaotic neural networks is proposed. At the server side, the media data are encrypted with the pseudorandom sequence (encryption sequence) generated from the sequence generator. At the customer side, the media data are recovered by the pseudorandom sequence generated from the combination of the encryption sequence and a mark sequence. The mark sequence is produced by the sequence generator under the control of the unique customer code. Thus, the recovered media content containing customer information is slightly different from the original one. The customer information can be used to tell the illegal distributors. The rest of the paper is arranged as follows. In Section 2, the proposed secure distribution scheme is presented in detail. And the performances including the security, imperceptibility, robustness and computational efficiency are tested and analyzed in Section 3. Finally, some conclusions are drawn, and the future work is presented.
2 The Proposed Secure Distribution Scheme 2.1 Architecture of the Proposed Distribution Scheme The scheme is composed of two parts: the sender side and the receiver side. At the sender side, the server encrypts the media data with a chaotic neural network based cipher. That is, the media data P is modulated by the pseudorandom sequence S0, and transformed into the cipher media C. The sequence S0 is produced by a chaotic neural network based pseudorandom number generator (CNNG) under the control of a secret seed and a quantization factor. The encrypted media C is then transmitted to customers. At the receiver side, the media C is demodulated by the pseudorandom sequence S and transformed into the cipher media P’. S is the combination of two sequences, S0 and S1. Among them, S0 is produced by the CNNG under the control of the secret seed and the quantization factor that are same to the ones used at the sender side, while S1 is produced by the CNNG under the control of a unique customer code and another quantization factor. The customer code is securely stored at the receiver side, and the quantization factor is often far smaller than the one used at the sender side. In the following contents, they will be presented in detail. 2.2 Chaotic Neural Network Based Pseudorandom Number Generator (CNNG) Chaotic neural network has been used in pseudorandom number generator [8,9,10] because of the properties, such as initial-value sensitivity, parameter sensitivity and random similarity. Here, a parameter-adjustable pseudorandom sequence generator is proposed, as shown in Fig. 2. It is composed of two steps: chaotic neural network based sequence generation and sequence quantization.
Secure Media Distribution Scheme Based on Chaotic Neural Network
C
Sender
81
Receiver
P'
P
S S0 Q0
CNNG
S0 Q0
CNNG
S1 CNNG
Q1
Customer Information Secret key
Secret key
Fig. 1. Architecture of the proposed secure media distribution scheme Q
K
Chaotic Neural Network
X=x0,x1,...,xn-1 Quantization
S=s 0,s1,...,s n-1
Fig. 2. The pseudorandom number generator based on chaotic neural network
Step 1. The chaotic neural network produces the chaotic sequence X=x0, x1, …, xn-1 under the control of K. Taking the one-way coupled map lattice (OCOML) [16,17] model for example, K is partitioned into two parts: one for the initial value, and another for the control parameter. Thus, the produced sequence satisfies 0 xi 1 (i=0,1,…,n-1). Step 2. Set x0, x1, …, xn-1 be the chaotic sequence generated from a chaotic neural network, A the maximal amplitude of the chaotic sequence, and Q the quantization factor. Then the pseudorandom sequence S=s0, s1, …, sn-1 is generated according to the following method.
≤≤
si =
xi Q (i = 0,1," , n − 1) . A
(1)
Generally, the bigger Q is, the bigger the amplitude of S is. 2.3 The Encryption Process The plaintext P=p0, p1, …, pn-1 is modulated by the pseudorandom sequence S0=s0,0, s0,1, …, s0,n-1 that is generated under the control of the secret seed and the quantization factor Q0 (0
≤
82
S. Lian et al.
Here, the pseudorandom sequence X0=x0,0, x0,1, …, x0,n-1 is produced by the chaotic neural network under the control of the secret seed, and L is the gray level of the plaintext’s pixel. For example, for a binary image, L=2, while for a 256-color image, L=256. Generally, the bigger Q0 is, the higher the security is. Thus, Q0=L is preferred. 2.4 The Decryption Process The decryption process is composed of three steps: the generation of two pseudorandom sequences, the combination of the sequences, and the demodulation operation. Step 1. The pseudorandom sequence S0=s0,0, s0,1, …, s0,n-1 is generated with the same method adopted in encryption. That is
s0,i =
x0,i Q0 (i = 0,1," , n − 1) . A
The other sequence S1= s1,0, s1,1, …, s1,n-1 is generated with the same method while different parameter. That is
s1,i =
x1,i − 1/ 2 A
Q1 (i = 0,1,", n − 1) .
Here, the pseudorandom sequence X1=x1,0, x1,1, …, x1,n-1 is produced by the chaotic neural network under the control of the unique customer ID, and Q1 is another quantization factor. Step 2. The two pseudorandom sequences are combined together and produce the new sequence S= s0, s1, …, sn-1 with the following method.
si = ( s0,i − s1,i ) mod L = (
x0,i A
Q0 −
x1,i − 1/ 2 A
Q1 ) mod L (i = 0,1," , n − 1) .
(3)
Step 3. The cipher text C= c0, c1, …, cn-1 is demodulated by the new sequence S, and thus the cipher text P’= p’0, p’1, …, p’n-1 is produced according to the following operation.
pi′ = ( ci − si ) mod L x x x − 1/ 2 . (4) ⎡ ⎤ = ⎢( pi + 0,i Q0 ) mod L − ( 0,i Q0 − 1,i Q1 ) mod L ⎥ mod L A A A ⎣ ⎦ x1,i − 1/ 2 =( pi + Q1 ) mod L (i = 0,1,", n − 1) A Seen from Eq. (4), if Q1=0, the decrypted media copy is same to the original copy, P’=P. Otherwise, the smaller Q1 is, the fewer differences between the original copy and the decrypted copy. Generally, the selection of Q1 depends on both the imperceptibility of the decrypted media data and the robustness of the embedded customer code. 2.5 Customer Tracing To detect which copy has been illegally distributed, the correlation-based detection operation is applied to the media copy. Set P’k (k=0,1,…,m-1) be the k-th customer’s
Secure Media Distribution Scheme Based on Chaotic Neural Network
83
media copy, Sj1 (j=0,1,…,m-1) the j-th customer’s sequence. Thus, the correlationbased detection operation is n −1
< Pk′ − P, S1j >=
∑ ( p′ i =0
− pi ) s1,j i
k ,i
.
n −1
∑s i =0
(5)
j j 1,i 1,i
s
For different customer owns different customer ID, the produced pseudorandom sequences are often independent from each other. Thus, set the threshold be T, then the customer can be detected by the following method.
⎧⎪k = j, < P ′ − P, S j > ≥ T k 1 . ⎨ ⎪⎩k ≠ j, < Pk ′ − P, S1j > < T
(6)
3 Performance Analysis 3.1 Security 3.1.1 Security of the Secret Seed The system’s security depends on the secret seed that is stored or transmitted in a secure way. The large size of the secret seed keeps it secure against brute-force attacks. In the proposed scheme, the chaotic neural networks with cipher-suitable properties are preferred, which produce the sequences with high sensitivity to changes [16,17,18]. This property keeps the system secure against the attacks based on key statistics. 3.1.2 Perceptual Security The quantization factor Q0 (0
≤
3.1.3 Collusion Attacks Assume N customers (k,k+1,…,k+N-1) attend the collusion attack. In the collusion attack, the N copies Pk′, Pk′+1 ," , Pk′+ N −1 are averaged, which produces the colluded copy P’’. In this case, the colluders can still be detected with a smaller threshold T/N. Since S1j ( j = 0,1,..., m − 1) is independent from each other, for each detected colluder, i.e., j=k+1, the correlation result is
< P′′ − P, S1j >=
1 1 < S1k +1 , S1j >= < S1j , S1j > . N N
(7)
84
S. Lian et al.
(a) Q0=8
(b) Q0=90
(c) Q0=180
(d) Q0=256
Fig. 3. Encryption results corresponding to different Q0
For the correlation result becomes smaller, the smaller threshold T/N can be used to detect the colluders. Generally, if N is no bigger than 20, the correlation value is no smaller than 0.05, that can still be distinguished from the correlation value between two independent sequences. 3.2 Imperceptibility and Robustness The quantization factor Q1 is in relation with the imperceptibility and robustness of the decrypted media data. Generally, the bigger Q1 is, the more greatly the decrypted media data are degraded, and the lower the imperceptibility is. Fig. 4(a) gives the relation between Q1 and the image’s quality (PSNR). Here, the images, Lena and Airplane, are tested, with Q1 ranging from 2 to 16. As can be seen, to keep the decrypted image be of high quality, Q1 should be kept small. Generally, the decrypted image’s PSNR is no smaller than 32 if Q1 is no bigger than 8. The robustness refers to the ability for the customer information to survive such operation as adding noise or recompression. Taking various images (Lena, Airplane, Couple, Boats, Village, Bridge, Baboon, Cameraman, Crowd, and Barbara) for example, the relation between the average correlation value, noise strength, compression quality and Q1 is tested and shown in Fig. 4(b) and Fig. 4(c). As can be seen, when the compression quality or noise strength is certain, the bigger Q1 is, the higher the correct detection rate is. Generally, if Q1 is no smaller than 8, the correlation value
Secure Media Distribution Scheme Based on Chaotic Neural Network 50 Lena Airplane
PSNR (dB)
45 40 35 30 25
5
10 15 Q1 (a) Imperceptibility test (Q1 - Decrypted image’s quality) 1.2 1
Cor
0.8 0.6
Q1=2 Q1=4 Q1=8 Q1=16
0.4 0.2 0 -0.2 -10
-8 -6 -4 Variance of Gaussian Noise (log10)
-2
(b) Robustness against adding noise 0.5
0.4
Cor
0.3 Q1=2 Q1=4 Q1=8 Q1=16
0.2 0.1
0 30
40
50 60 70 80 90 JPEG Compression Quality
(c) Robustness against recompression
Fig. 4. Imperceptibility and robustness test
100
85
86
S. Lian et al.
keeps no smaller than 0.4 when the noise strength (Gaussian noise’s variance) is no bigger than 0.01 and compression quality (JPEG compression quality) is no smaller than 70.
4 Conclusions and Future Work In this paper, a secure media distribution scheme based on chaotic neural network is proposed. It encrypts the media data at the server side and decrypts the media data into different copies at the customer side. The decrypted copy contains the customer information that can be used to trace illegal distributors. Such performances of the proposed scheme as security, imperceptibility and robustness are analyzed and tested through experiments. The colluders can still be detected, the customer information in the decrypted media data keeps imperceptible, the customer information can survive such acceptable operations as adding noise or recompression, and the encryption or decryption operation can be implemented efficiently. In future work, some means will be taken to support more colluders, and the robustness against some other operations (filtering, resample, rotation, etc) will be evaluated.
Acknowledgement This work was supported by the France Telecom project “Tatouage” through the grant number PEK06-ILAB-008.
References 1. Chan, C., Cheng, L.: Pseudorandom Generator Based on Clipped Hopfield Neural Network. Proceedings of IEEE International Symposium on Circuits and Systems 3 (1998) 183–186 2. Guo, D., Cheng, L., Cheng, L.: A New Symmetric Probabilistic Encryption Scheme Based on Chaotic Attractors of Neural networks. Applied Intelligence 10(1) (1999) 71-84 3. Xiang, T., Liao, X., Tang, G., Chen, Y., Wong, K.: A Novel Block Cryptosystem Based on Iterating a Chaotic Map. Physics Letters A 349(1-4) (2006) 109-115 4. Lian, S., Sun, J., Wang, Z.: Secure Hash Function Based on Neural Networks. Neurocomputing 69(16-18) (2006) 2346-2350 5. Xiao, D., Liao, X., Deng, S.: One-way Hash Function Construction Based on the Chaotic Map with Changeable-parameter. Chaos, Solitons & Fractals 24(1) (2005) 65-71 6. Rachel, M., Einat, K., Wolfgang, K.: Public Channel Cryptography by Synchronization of Neural Networks and Chaotic Maps. Physical Review Letters 91(11) (2003) 1-4 7. Cauwenberghs, G.: Delta-sigma Cellular Automata for Analog VLSI Random Vector Generation. IEEE Trans. Circuits and Systems II 46(3) (1999) 240–250 8. Chan, C., Cheng, L.: The Convergence Properties of a Clipped Hopfield Network and its Application in the Design of Keystream Generator. IEEE Trans. Neural Networks 12(2) (2001) 340-348 9. Karras, D., Zorkadis, V.: On Neural Network Techniques in the Secure Management of Communication Systems through Improving and Quality Assessing Pseudorandom Stream Generators. Neural Networks 16(5-6) (2003) 899-905
Secure Media Distribution Scheme Based on Chaotic Neural Network
87
10. Caponetto, R., Lavorgna, M., Occhipinti, L.: Cellular Neural Networks in Secure Transmission Applications. Proceedings of the IEEE International Workshop on Cellular Neural Networks and their Applications (1996) 411-416 11. Yen, J., Guo, J.: A Chaotic Neural Network for Signal Encryption/Decryption and Its VLSI Architecture. Proc. 10th (Taiwan) VLSI Design/CAD Symposium (1999) 319-322 12. Yee, L., Silva, D.: Application of Multilayer Perception Networks in Symmetric Block Ciphers. Proceedings of International Joint Conference on Neural Networks 2 (2002) 1455– 1458 13. Yue, T., Chiang, S.: A Neural Network Approach for Visual Cryptography. Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks 5 (2000) 494–499 14. Lian, S., Chen, G., Cheung, A., Wang, Z.: A Chaotic-Neural-Network-Based Encryption Algorithm for JPEG2000 Encoded Images. 2004 IEEE Symposium on Neural Networks (ISNN2004), Dalian, China, Springer LNCS, 3174 (2004) 627-632 15. Lian, S., Sun, J., Li, Z., Wang, Z.: A Fast MPEG4 Video Encryption Scheme Based on Chaotic Neural Network. 2004 International Conference on Neural Information Processing (ICONIP2004), India, Springer LNCS, (2004) 720-725 16. Wang, S., Kuang, J., Li, J.: Chaos-based Communication in A Large Community. Phys Rev 66 (6) (2002) 1-4 17. Kuang, J., Deng, K., Huang, R.: An Encryption Approach to Digital Communication by Using Spatiotemporal Chaos Synchronizaiton. Acta Physica 50(10) (2001) 1806-1856 18. Lian, S., Sun, J., Wang, J., Wang, Z.: A Chaotic Stream Cipher and the Usage in Video Protection. Chaos, Solitons & Fractals. In Press, Corrected Proof, Available online 14 June (2006)
An Adaptive Radar Target Signal Processing Scheme Based on AMTI Filter and Chaotic Neural Networks Quansheng Ren, Jian Wang, Hongling Meng, and Jianye Zhao* Department of Electronics, Peking University, 100871 Beijing, China
[email protected]
Abstract. In the proposed new scheme of adaptive radar target signal processing, the chaotic neural network not only detects the target signal by reconstructing the chaotic clutter, but also repairs the frequency spectrum according to its associative memory characteristic. The clutter is filtered by the Burg algorithm based on the adaptive MTI filter. The information of distance and velocity is also obtained by Burg spectral estimation. The validity of the scheme is analyzed theoretically, and the simulation results show that it has good performance in clutter and noise background. The adaptive method adopted in this paper facilitates the radar design in complex environment.
1 Introduction The aim of radar target signal processing is to get adequately valid information of distance, velocity etc. Signal processing module should detect the target signal correctly from the background, filter the cluster and noise, get valid information, and improve the result quality such as the signal to noise ratio. However, the noise and cluster vary with environmental characteristic, such as terrain, weather, and flyers. This variation brings tremendous difficulty to target signal processing. In this context, the adaptive radar target signal processing scheme is required. Recent studies have introduced nonlinear model (chaos or fractal) to analysis the cluster [1-5]. In [2], H. Leung proposed a kind of signal detection method based on nonlinear prediction using RBF neural network. The shortage of this method is that it ignores the disturbance of noise. Z. L. Xiong etc. In [6], the authors applied Single Value Decomposition (SVD) and classical matched filter, and used Chaotic Neural Network (CNN) instead of RBF neural network. Some literatures [7-8] utilize AR model to approach the frequency response of the clutter spectrum and design the Adaptive Moving Target Indication (AMTI) filter. Modified Lower and Upper triangular matrix Decomposition (MLUD) algorithm has also introduced in [7]. In this paper, a new scheme of adaptive radar target signal processing is proposed. Unlike schemes in [6] and [7], we combine CNN with Burg algorithm and AMTI filter. We utilize Burg algorithm to design AMTI filter and get the target information of distance and velocity. Because of its complex dynamics, chaotic neural network *
Corresponding author.
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 88–95, 2007. © Springer-Verlag Berlin Heidelberg 2007
An Adaptive Radar Target Signal Processing Scheme
89
has more memory capacity and error tolerance than other neural networks. In the proposed scheme, the CNN module not only detects the target signal, but also repairs the frequency spectrum corrupted by the noise. The results of simulation show that the proposed scheme can improve the performances of detection, filtering and frequency spectrum. This paper is organized as follows: Section 2 describes the system model. In Section 3, the detail design of CNN and Burg algorithm based adaptive MIT filter is theoretically analyzed. Results of simulation are given in Section 4. Section 5 shows some concluding remarks.
2 System Model The new scheme of adaptive radar target signal processing is illustrated in Figure 1. Its signal processing flow is as follows:
Fig. 1. The block diagram of adaptive radar target signal processing
When the radar is far from the target, there is only clutter in the echo. The signal processing scheme is working in the “learning period” first. During learning period, the clutter time series pass through delay line. The delayed clutter sequence is input to the chaotic neural network. The chaotic neural network is trained to approach the inherent function hid in the clutter and to reconstruct the phase space of the clutter dynamics. The clutter time series also pass through the Burg module (dashed lines). The parameters of adaptive MTI filter can be obtained through the AR spectral estimation using Burg algorithm (dashed lines). After the system has been trained successfully, the system begins to work in the “detecting period”. The output of the chaotic neural network is compared with the real echo. If there is no target signal in the echo, the average square value of predictive error is less than the threshold. On the other hand, if the average square value of predictive error exceeds the threshold, it indicates that the radar echo contains the target signal. The result of the signal detection (CFAR module) controls the start-up of the adaptive MTI filter. If the target signal is detected successfully, the system comes into the “processing period”. In this period, the radar echo passes through the adaptive MTI filter, and the clutter is filtered. The output of the filter is used to get the target information of distance and velocity by the spectral estimator, i.e. Burg algorithm. Besides clutter, there may be noise in the radar echo, such as thermal noise. The effect of this
90
Q. Ren et al.
fact is that the frequency spectrum peak of the target signal would be expanded by the noise. The associative memory characteristic of chaotic neural network is utilized to repair the spectrum.
3 Theory Analysis 3.1 Chaotic Neural Network as Detecting Module The chaotic neuron model is expressed as follows [9]: t ⎡ ⎤ y (t + 1) = f ⎢ A(t ) − α k d g {y (t − d )} − Θ⎥ , d =0 ⎣⎢ ⎦⎥
∑
(1)
where y(t) is the output of the chaotic neuron, t is the discrete time steps, t=0,1,2,…, f is the output function, A(t) is the external stimulation at the time t, g is the refractory function, α , k and Θ are the refractory scaling parameter, the refractory decay parameter and the threshold, respectively. In the proposed radar target signal processing scheme, g(y)=y, θ (t ) = a , f adopts
the function ϕ j ( x j (n)) = a tanh(bx j (n)), (a, b) > 0 . The Aihara chaotic neuron is
used as the constituent element of the chaotic neural network module. The dynamics of the ith chaotic neuron is simplified as follows: t N t t ⎡M ⎤ yi (t + 1) = f ⎢∑ ε ij ∑ k e A j (t ) + ∑ wij ∑ k f y j (t ) − α ∑ k r g {yi (t )} − Θ i ⎥ , j =1 d =0 d =0 ⎣ j =1 d =0 ⎦
where
ε ij
(2)
and wij are synaptic weights to the ith neuron from the jth external input
and from the jth neuron respectively, and k e , k f and
k r are the parameters for the
external inputs, the feedback inputs, and the refractoriness, respectively. According to the Takens embedding theorem, the geometrical structure of ddimension dynamics system could be observed from the D-dimension vector
YR (n) = [ y (n), y (n − τ ),", y (n − ( D − 1)τ )]T ,
(3)
where τ is an integer denoting the embedded delay. For a given y(n), it has relation with an observable variable of the unknown dynamics. If D ≥ 2d + 1 , reconstruction could be achieved from YR (n) .
Fig. 2. Chaotic neural network block diagram
An Adaptive Radar Target Signal Processing Scheme
91
The chaotic neural network is composed of three layers: multi-input layer, hidden layer and output layer. According to LMS algorithm, the value of weighting is learning as follows,
Δw ji (n) = −η
∂Ε(n) = ηe j (n) ⋅ ϕ 'j ( x j (n)) ⋅ y j ( n) = ηδ j (n) y j (n) , ∂w ji ( n)
(4)
∂Ε( n) = ηe j ( n) ⋅ ϕ 'j ( x j ( n)) ⋅ yi (n) = ηδ j (n) yi ( n) , ∂ε ji ( n)
(5)
Δε ji (n) = −η where δ j (n) = −
∂Ε(n) , ∂x j (n)
y j (n) is the output of the jth neuron in the hidden layer,
x j (n) is the inner state variable of the jth neuron in the hidden layer. For output layer,
b a
δ j (n) = e j (n) ⋅ ϕ 'j ( x j (n)) = [d (n) − o(n)][a − o(n)][a + o(n)] , where
(6)
o(n) is the predictive value, d (n) is the real clutter value, a and b are the
parameters of the output function ϕ j ( x j (n)) = a tanh(bx j (n)), (a, b) > 0 . For hidden layer,
δ j (n) = ϕ 'j ( x j (n)) ⋅
∑δ
k (n) ⋅ wkj (n) =
k
∑δ
b [a − y j (n)][a + y j (n)] a
k (n) wkj (n) .
(7 )
k
3.2 Adaptive MTI Filter Based on Burg Algorithm
Hawkes and Haykin pointed out that most of clutters could be fitted by the low-rank auto regressive (AR) sequences. The coefficients of the AR model are determined by the kind of clutter and environment. The Maximum Entropy Method (MEM) of spectral estimation has the following power spectral expression: 2
P(ω ) = σ 2 A(e jω ) , A( z ) =
p
∑ a(i) z i =0
Fig. 3. FIR filter scheme
−i
.
(8)
92
Q. Ren et al.
Burg is one of the MEM algorithm, and it is equivalent to AR model when a(0)=1. To filter the clutter, an FIR filter is designed. Its coefficients are just coefficients (a 0 , a1 , " , a N ) gained by Burg algorithm. N
The output of the filter is y (n) =
∑a
k x (n
− k ) The system equation is
k =0
N
H ( z) =
∑a
kz
−k
. The frequency response is
k =0
H ( e jω ) =
N
∑a
kz
− jωk
.
(9)
k =0
Comparing equation (8) with (9), one could see that the zero coefficients of the FIR filter frequency response are just the pole coefficients of the Burg spectral expression. Therefore the filter has the ideal frequency response, which is just the “inverted” clutter spectrum. Since the central frequency and bandwidth of the clutter is estimated by Burg algorithm during “learning period” in practical working, the filter can be adaptively adjusted according to the characteristic of the certain clutter spectrum. 3.3 Chaotic Neural Network as Spectrum Repair Module
The effect of noise, such as thermal noise, is that it expands the frequency spectrum peak of the target signal. This expansion would bring difficulties and errors to the estimation of the distance and velocity information. One of the most useful functions of chaotic neural network is its associative memory characteristic [10]. Because of its complex dynamics, chaotic neural network has more memory capacity and error tolerance compared with Hopfield neural network. In this paper, the chaotic neural network memorizes the ideal spectrum peaks, and associates the expanded spectrum peak to the most likely memorized ideal peak. The expression of the chaotic neural network is just like the network (2) introduced in section 3.1. However, the output function adopts the Sigmoid function
f ( y ) = (1 − exp( −λy )) /(1 + exp( −λy )) , where
λ
is the steepness parameter.
(10)
wij are synaptic weights to the ith neuron from
the jth neuron respectively. The chaotic neural network memorize T ideal spectrum peaks, and its learning rule adopts Hebb rule ωij =
T
∑ (x
p i
− xip )( x jp − x jp ) ,
p =1
where xip is the ith element of the pth memorized peak.
(11)
An Adaptive Radar Target Signal Processing Scheme
93
4 Simulations 4.1 Simulations of the CNN Detecting Module
The main interference comes from the main spectral peak of the clutter. The central frequency spectral peaks of clutter and target are simulated. The simulation result is illustrated in Fig. 4. The real line corresponds to the real signal, and the dashed corresponds to the predictive signal. For (a) and (b), there is only clutter central frequency in the radar echo. The predictive error in this case is very low. This implies that the CNN learn and reconstruct the clutter successfully. For (c) and (d), there are both clutter and target central frequencies in the radar echo. The predictive error is much higher than above case. This implies that the CNN learn the clutter and detect the target signal successfully.
Fig. 4. The detection of the central frequency spectral peaks
4.2 Simulations of the Adaptive MTI Filter
Clutter could be filtered by the designed adaptive MTI filter. Fig. 5 illustrates the spectral estimation results of the radar echo and the filtered signal. The powers of the clutter and the target signal are equal, i.e. the Signal-to-Clutter Ratio (SCR) is 0dB. For case (a), there is only clutter (central frequency is 6kHz), and no target signal in the radar echo. After filtering, the spectral peak of the clutter is filtered. For case (b), there are both clutter (central frequency is 6kHz) and target signal (central frequency is 12kHz) in the radar echo. After filtering, the spectral peak of the clutter is
Fig. 5. The spectrum of the radar echo and filtered signal
94
Q. Ren et al.
filtered, and the spectral peak of the target signal is reserved. For case (c), the central frequency of the target signal is 48kHz, and similar result is gotten. 4.3 Simulations of the CNN Spectrum Repair Module
The associative memory of CNN is utilized to repair the expanded spectrum.
Fig. 6. The spectral repair effect of the CNN
Fig. 6 illustrates this effect. For case (a), the chaotic neural network memorizes one ideal peak (26.6kHz) as the sample, and repairs the expanded spectrum by associative memory. For case (b), the chaotic neural network memorizes six ideal peaks (8.9kHz, 17.8kHz, 26.6kHz, 35.5kHz, 44.4kHz, 53.3kHz) as the samples, and repairs the expanded spectrum by associative memory successfully.
5 Conclusion In this paper we proposed a new scheme of adaptive radar target signal processing. The chaotic neural network is designed to reconstruct the chaotic clutter and to detect the target signal utilizing the Takens embedding theorem. After detection, the clutter is filtered by the Burg algorithm based adaptive MTI filter. The information of distance and velocity is obtained by Burg spectral estimation. The noise would expand the spectral peak of the target signal. Because of its complex dynamics, chaotic neural network has more memory capacity and error tolerance than other neural networks. In the proposed scheme, the CNN module not only detects the target signal, but also repairs the frequency spectrum according to its associative memory characteristic. The validity of the scheme is analyzed theoretically, and the simulation results show that it has good performance in clutter and noise background. The adaptive method adopted in this paper facilitates the radar design in complex environment.
References 1. Haykin, S., Puthusserypady, S.: Chaotic dynamics of sea clutter. Chaos 7 (1997) 777–802 2. Leung, H., Dubash, N., Xie, N.: Detection of small objects in clutter using a GA-RBF neural network. IEEE Trans. Aerosp. Electron. Syst. 38 (2002) 98–118
An Adaptive Radar Target Signal Processing Scheme
95
3. Haykin, S., Bakker, R., Currie, B.W.: Uncovering nonlinear dynamics-the case study of sea clutter. Proc. IEEE. 90 (2002) 860–881 4. Morrison, A.I., Srokosz, M.A.: Estimating the fractal dimension of the sea-surface—A 1st attempt. Annales Geophysicae-Atmospheres Hydrospheres and Space Sci. 11 (1993) 648–658 5. Hu, J., Tung, W.W., Gao, J.B.: Detection of low observable targets within sea clutter by structure function based multifractal analysis. IEEE Transactions on antennas and propagation 54 (2006) 136-143 6. Xiong, Z.L., Shi, X.Q.: A novel signal detection subsystem of radar based on HA-CNN. Lecture Notes in Computer Science 3174 (2004) 344-349 7. Huang, Y., Peng, Y.N.: Design of airborne adaptive recursive MTI filter for detecting targets of slow speed. IEEE National Radar Conference – Proceedings (2000) 215-218 8. Xiang, Y., Ma, X.Y.: AR model approaching-based method for AMTI filter design. Systems Engineering and Electronics 27 (2005) 1826-1830 9. Aihara, K., Takabe, T., Toyoda, M.: Chaotic neural networks. Physics Letters A 144 (1990) 333-340 10. Adachi, M., Aihara, K.: Associative dynamics in a chaotic neural network. Neural Networks 10 (1997) 83-98
Horseshoe Dynamics in a Small Hyperchaotic Neural Network Qingdu Li1 and Xiao-Song Yang2
2
1 Department of Electronics and Information Engineering, Huazhong University of Science and Technology, Wuhan, 430074, China
[email protected] Department of Mathematics, Huazhong University of Science and Technology, Wuhan, 430074, China
[email protected]
Abstract. This paper studies the hyperchaotic dynamics in a four dimensional Hopfield neural network. A topological horseshoe on a three dimensional block is found in a carefully chosen Poincar´e section hyperplane of the ordinary differential equations. Numerical studies show that there exist two-directional expansions in this horseshoe map. In this way, a computer-assisted verification of hyperchaoticity of this neural network is presented by virtue of topological horseshoe theory.
1
Introduction
Among the various neural dynamics, deterministic chaos is of much interest and has been regarded as a powerful mechanism for the storage, retrieval and creation of information in neural networks and received a considerable attention in recent years [1, 2, 3, 4]. The research in anatomy and physiology suggests the attempt to understand the emergent dynamical properties of a large network in terms of interacting smaller subnetworks [5, 6, 7, 8]. Therefor, a thorough investigation on chaotic dynamics of small neural networks is significant to study brain functions and artificial neural networks [9, 10, 11, 4, 12, 13, 14, 15]. The existence of a horseshoe embedded in a dynamical system should be the most compelling signature of chaos since it can be used to prove the existence of chaos, show the structure of chaotic attractors and reveal the mechanism inside of chaotic phenomena. Now it is well recognized that horseshoe theory with symbolic dynamics provides a powerful tool in rigorous studies of chaos [16, 17, 18, 19]. This tool has been successfully applied in studies of common chaos with one positive Lyapunov exponent in neural network [10, 20, 21, 22]. In this paper, we try using this tool to carry out a rigorous study on the hyperchaotic dynamics of a small neural network proposed in [15] by showing a topological horseshoe with two-directional expansion and presenting a computerassisted verification of hyperchaoticity. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 96–103, 2007. c Springer-Verlag Berlin Heidelberg 2007
Horseshoe Dynamics in a Small Hyperchaotic Neural Network
2
97
Horseshoe Dynamics in the Hyperchaotic Neural Network
In this section, we first recall a result on horseshoes theory, and then propose our main result. 2.1
A Result of Topological Horseshoe
Let X be a metric space, D is a compact subset of X, and f : D → X is a map satisfying the assumption that there exist m mutually disjoint compact subsets D1 , D2 , . . . , Dm of D, the restriction of f to each Di , i.e., f |Di is continuous. Definition 1. Let γ be a compact subset of D, such that for each 1 ≤ i ≤ m, γi = γ ∩ Di is nonempty and compact, then γ is called a connection with respect to D1 , D2 , . . . , Dm . Let F be a family of connections γ with respect to D1 , D2 , . . . , Dm satisfying property: γ ∈ F ⇒ f (γi ) ∈ F . Then F is said to be an f -connected family with respect to D1 , D2 , . . . , Dm . Theorem 1. Suppose that there exists an f -connected family F with respect to D1 , D2 , . . . , Dm . Then there exists a compact invariant set K ⊂ D, such that f |K is semiconjugate to m−shift dynamics. Here, the semiconjugacy is conventionally defined as follows. Definition 2. Let X and Σm be topological spaces, and let f : X → X and σ : Σm → Σm be continuous functions. We say that f is topologically semicongugate to σ, if there exists a continuous surjection h : Σm → X such that f ◦ h = h ◦ σ. Proposition 1. Let X be a compact metric space, and f : X → X is a continuous map. If there exists an invariant set Λ ⊂ X such that f |Λ is semi-conjugate to m-shift dynamics σ|Σm , then ent(f ) ≥ ent(σ) = log m,
(1)
where ent(f ) denotes the entropy of the map f . In addition, for every positive integer k, ent(f k ) = k · ent(f ). (2) For details about the proof of Theorem 1, see [19], and for details of symbolic dynamics and horseshoe theory, see [16]. 2.2
Poincar´ e Map and Horseshoe
The dynamics of the 4D hyperchaotic Hopfield neural network can be described by the following ordinary differential equations: x˙i = −ci xi +
4
wij tanh(xi )
j=0
where W = (wij ) is the connection matrix. When the parameters take
(3)
98
Q. Li and X.-S. Yang
⎛
⎞ 1 0.5 −3 −1 ⎜ 0 2.3 3 0 ⎟ ⎟ c1 = c2 = c3 = 1, c4 = 100 and W = ⎜ ⎝ 3 −3 1 0 ⎠, 100 0 0 170 computer simulations show that (3) has an attractor, as illustrated in Fig. 1 [15]. Its Lyapunov exponents are 0.237, 0.024, -0.000 and -74.08 which suggests that the attractor seems hyperchaotic. In what follows, we will give detailed discussions of a horseshoe imbedded in this attractor.
2 1
x4
A1
−0.24
P
Block a
0
A5 A6
−0.26
A4
−1
A8
−0.28
A3
x2
−2 1
0
−0.34
−0.5
x2
B2
−0.32
0.5
−1
−0.5
x1
0
0.5
−0.36
A7
Block b
−0.3
B1 B4
B6 B7 B5 B8
−0.304−0.302 −0.3 −0.298−0.296−0.294
x1
Fig. 1. The phase plot of (3) and the position of block a and block b
As shown in Fig. 1, we choose a 3D section P = {x1 ∈ (−3.1, 2.8), x2 ∈ (0, 0.8), x3 ∈ (−0.19, 0.35)} in the hyperplane Q : x4 = 1.1. The Poincar´e map π : P → Q is defined as follows: For each x ∈ P , π(x) is taken to be the second return point in Q under the flow with the initial condition x. Let κ be a subset of P , its image under the map π is denoted by κ = π(κ) in the following discussion. The following statement can be obtained by numerical computations on P . Theorem 2. For the Poincar´e map π corresponding to the cross section P , there exists a closed invariant set Λ ⊂ P for which π 2 |Λ is semiconjugate to the 2-shift dynamics, and ent(π) ≥ 12 log 2. Proof. In view of Theorem 1, we only need to show that there exists an π 2 connected family F with respect to two subsets of P . After a number of attempts, we find two subsets a and b of P with their eight vertices in terms of (x1 , x2 , x3 ) to be
Horseshoe Dynamics in a Small Hyperchaotic Neural Network
A1 A2 A3 A4 A5 A6 A7 A8 B1 B2 B3 B4 B5 B6 B7 B8
99
= (−0.29581779549343, −0.24019597042840, 0.09142880863296), = (−0.29542801291366, −0.25654351847307, 0.08383869638895), = (−0.29664321742707, −0.28604018124933, 0.10066504874467), = (−0.29726992667299, −0.27182492208005, 0.11184324741833), = (−0.29382176324355, −0.24020663129214, 0.09155427417961), = (−0.29343198066377, −0.25655417933681, 0.08396416193560), = (−0.29464718517719, −0.28605084211307, 0.10079051429132), = (−0.29527389442310, −0.27183558294379, 0.11196871296498), = (−0.30406472867537, −0.34778553401970, 0.21313753034174), = (−0.30304332426000, −0.32692785416949, 0.19847573234127), = (−0.30297948648404, −0.33166823595363, 0.19709871410172), = (−0.30423141620149, −0.35505411942204, 0.21523569550565), = (−0.30206868992023, −0.34779509551942, 0.21326298099230), = (−0.30104728550486, −0.32693741566922, 0.19860118299183), = (−0.30098344772890, −0.33167779745335, 0.19722416475228), = (−0.30223537744636, −0.35506368092176, 0.21536114615621),
on which π|a and π|b are both diffeomorphisms, as shown in Fig. 1. For block a, the top surface at = |A5 A6 A7 A8 | parallels the bottom surface ab = |A1 A2 A3 A4 |, they are both quadrangular, and the other four surfaces of a called the side of a in the following discussions(indicated with as ) are all parallelograms. For block b, it has the same situation with a. By means of interval analysis, their images under π are computed as what did in [23, 24] and shown in Figs 2, 3. From Fig. 2, it is easy to see that the Poincar´e map π sends block a to its image a as follows: The top quadrangular at and the bottom quadrangular ab of a are both expanded in two directions and transversely intersect block a between at and ab and intersect block b between bt and bb ; the side of a, i.e. as , is mapped outside of as and bs , as shown in Fig. 2(b). In this case, for each subset of a if it can transversely intersect a between at and ab its image must transversely intersect block a and block b between their top and bottom surfaces, we say that the image a = π(a) wholly across a and b, Similarly, it is easy to see from Fig. 3 that the Poincar´e map π sends block b to its image b as follows: The top quadrangular bt and the bottom quadrangular bb of b are both expanded in two directions and transversely intersect block a between at and ab ; the side of b, i.e. bs , is mapped outside of as , as shown in Fig. 3(b). In this case, we say that the image b = π(b) wholly across a. Since π|a and π|b are both diffeomorphisms, it is easy to find a sub-block a ˜ of a and a sub-block ˜b of b such that: a ˜ and ˜b both wholly across a ˜ and ˜b under π 2 , e.g., a ˜ = π −1 (π(a) ∩ a) and ˜b = π −1 (π(b) ∩ a). Note that the subsets a and b are mutually disjoint, a ˜ and ˜b must be also mutually disjoint. It is not hard to find a π 2 -connected family F with respect to a ˜ and ˜b. In view of Theorem 1, the Poincar´e map π 2 is semi-conjugate to a 2-shift map, and ent(π) ≥ 12 log 2. The global picture of the images π(a) and π(b) suggests that π|a and π|b both expand in two directions (corresponding to the two positive Lyapunov exponents).
100
Q. Li and X.-S. Yang
(a) The 3D view
0.26
0.26
as = π(as )
0.24
0.24 0.22
b
0.2
0.2
0.18
0.18
0.16
x3
x3
0.22
a = π(a)
0.14
bb
bt
0.16 0.14
0.12
a
0.1
0.12 0.1
0.08
ab
at
0.08
0.06 −0.305 −0.3 −0.295
x1
0.06 −0.45 −0.4 −0.35 −0.3 −0.25 −0.2
x2
(b) The top view
−0.2
−0.4 −0.305 x2 −0.3
x1 −0.295
−0.3
(c) The side view
Fig. 2. a = π(a) wholly across a and b
The local expansions of π on a and b can be partially confirmed by numerically studying partial derivatives ∂π of π at randomly chosen points in the intersection set of blocks a and b and their images. We numerically find that the matrix ∂π has one eigenvalue lying in the interior of the unit circle and two eigenvalues that are located outside of the unit circle. Thereby it justifiably indicates a strong evidence that the attractor illustrated in Fig. 1 is hyperchaotic.
Horseshoe Dynamics in a Small Hyperchaotic Neural Network
101
(a) The 3D view
x3 b
0.2 0.18
0.18
0.16
0.16
0.14
bs = π(bs )
0.14
0.12
x3
x3
0.2
b
bs = π(bs )
0.12
a
0.1
at
0.1
0.08 0.08
ab
0.06 0.06 0.04 0.04 −0.3 −0.29
x1
−0.5
−0.4
x2
−0.3
(b) The top view
−0.2
−0.2 −0.4 −0.305
x2
−0.3 x −0.295 −0.29
1
(c) The side view
Fig. 3. b = π(b) wholly across a
3
Conclusions
We have presented a 3D topological horseshoe in the small hyperchaotic neural network proposed in [15]. Numerical studies suggest that there exist twodirectional expansions in this horseshoe map. In this way, a computer-assisted verification of hyperchaos has been provided by virtue of topological horseshoe theory, which is much more intuitionistic and convincible than the usual method by calculating Lyapunov exponents.
102
Q. Li and X.-S. Yang
Acknowledgements This work is supported in part by Program for New Century Excellent Talents in University (NCET-04-0713), National Natural Science Foundation of China (10672062) and Doctorial Thesis Fund of Huazhong University of Science and Technology(D0640).
References 1. Elbert, T., Ray, W.J., Kowalik, Z.J., Skinner, J.E., Graf, K.E., Birbaumer, N.: Chaos and Physiology: Deterministic Chaos in Excitable Cell Assemblies. Physiological Reviews 74 (1994) 1–47 2. Freeman, W.J., Yao, Y.: Model of Biological Pattern Recognition with Spatially Chaotic Dynamics. Neural Networks 3 (1990) 153–170 3. Babloyantz, A., Lourenco, C.: Brain Chaos and Computation. Int. J. Neural Syst. 7 (1996) 461–471 4. Lewis, J.E., Glass, L.: Nonlinear Dynamics and Symbolic Dynamics of Neural Networks. Neural Computation 4 (1992) 621–642 5. Abeles, M.: Corticonics. Cambridge University Press, Cambridge (1991) ´ 6. Arbib, M.A., Erdi, P., Szent´ agothai, J.: Neural Organization - Structure, Function, and Dynamics. MIT Press, Massachusetts (1998) 7. Shepherd, G.M., ed.: The Synaptic Organization of the Brain Cortex. Ocford Univ. Press, New York (1990) 8. White, E.L.: Cortical Circuits: Synaptic Organization of the Cerebral Cortex Structure, Function and Theory. Birkh¨ auser, Boston (1989) 9. Pasemann, F.: Complex Dynamics and the Structure of Small Neural Networks. Network: Comput. Neural Syst. 13 (2002) 195–216 10. Guckenheimer, J., Oliva, R.A.: Chaos in the Hodgkin-Huxley Model. Siam J. Applied Dynamical Systems 1 (2002) 105–114 11. Das, A., Das, P., Roy, A.B.: Chaos in a Three-Dimensional General Model of Neural Network. I. J. Bifurcation and Chaos 12 (2002) 2271–2281 12. Bersini, H.: The Frustrated and Compositional Nature of Chaos in Small Hopfield Networks. Neural Networks 11 (1998) 1017–1025 13. Bersini, H., Sener, P.: The Connections between the Frustrated Chaos and the Intermittency Chaos in Small Hopfield Networks. Neural Networks 15 (2002) 1197–1204 14. Li, Q., Yang, X.S.: Complex Dynamics in a Simple Hopfield-Type Neural Network. In Wang, J., Liao, X., Yi, Z., eds.: Advances in neural networks - ISNN 2005. Volume 3496., New York, Springer-Verlag (2005) 357–360 15. Li, Q., Yang, X.S., Yang, F.: Hyperchaos in Hopfield-Type Neural Networks. Neurocomputing 67 (2005) 275–280 16. Wiggins, S.: Introduction to Applied Nonlinear Dynamical Systems and Chaos. Springer-Verlag, New York (1990) 17. Szymczak, A.: The Conley Index and Symbolic Dynamics. Topology 35 (1996) 287–299 18. Kennedy, J., Yorke, J.A.: Topological horseshoes. Transactions of The American Mathematical Society 353 (2001) 2513–2530 19. Yang, X.S., Tang, Y.: Horseshoes in Piecewise Continuous Maps. Chaos, Solitons and Fractals 19 (2004) 841–845
Horseshoe Dynamics in a Small Hyperchaotic Neural Network
103
20. Li, Q., Yang, X.S.: Chaotic Dynamics in a Class of Three Dimensional Glass Networks. Chaos 16 (2006) 033101 21. Yang, X.S., Yang, F.: A Rigorous Verification of Chaos in an Inertial Two-Neuron System. Chaos, Solitons and Fractals 20 (2004) 587–591 22. Yang, X.S., Li, Q.: Horseshoe Chaos in Cellular Neural Networks. Int. J. Bifurcation and Chaos 16 (2006) 131–140 23. Zgliczy´ nski, P.: Computer Assisted Proof of Chaos in the R¨ ossler Equations and in the H´enon map. Nonlinearity 10 (1997) 243–252 24. Li, Q., Yang, X.S.: A Computer-Assisted Verification of Hyperchaos in the Saito Hysteresis Chaos Generator. J. Phys. A: Math. Gen. 39 (2006) 9139–9150
The Chaotic Netlet Map Geehyuk Lee and Gwan-Su Yi Information and Communications University Daejeon 305-732, South Korea
[email protected],
[email protected]
Abstract. The parametrically coupled map lattice (PCML) exhibits many interesting dynamical behaviors that are reminiscent of the adaptation and the learning of the neural network. In order for the PCML to be a model of the neural network, however, it is necessary to identify the biological counterpart of one-dimensional maps that constitute the PCML. One of the possible candidates is a netlet, a small population of randomly interconnected neurons, that was suggested to be a functional unit constituting the neural network. We studied the possibility of representing a netlet by a chaotic one-dimensional map and the result is the chaotic netlet map that we introduce in this paper.
1
Introduction
The coupled map lattice (CML) [1] is a mathematical model of a spatially extended dynamical system having discrete-time, discrete-space, and continuous states. In spite of its simplicity, the CML has been a successful tool for studying spatiotemporal chaos which arises in many physical systems. More recently, the parameterically coupled map lattice (PCML) [2,3] was proposed as a model neural network capable of automatic adaptation and learning. While the PCML displays many unique dynamical behaviors that are reminiscent of the adaptation and learning of the neural network, it is not easy to relate the PCML with the neural network. Before we can relate them with each other, we need to identify the biological counterpart of the chaotic map that constitute the PCML. One of the possible candidates is the chaotic neurons as suggested by many researchers including Aihara [4] and Farhat [5]. Another candidate is the netlet [6,7], a small population of randomly interconnected neurons that was suggested to be a functional unit constituting the neural network. Farhat [2] considered the netlet to be the biological counterpart of the chaotic map constituting the PCML. However, there has not been yet a satisfactory explanation for the linkage between the netlet and a chaotic map. This paper is about our effort to find an explanation for the linkage between the netlet and a chaotic map. We started with a review of Harth’s onedimensional map model of a netlet [6], and observed that, although the map model was successful in explaining certain collective behaviors of a netlet, it is inherently unable to model the chaotic aspect of a netlet. This led us to reconsider the dynamics of a netlet in a different time scale. The result was a D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 104–112, 2007. c Springer-Verlag Berlin Heidelberg 2007
The Chaotic Netlet Map
105
chaotic one-dimensional map model of a netlet, that we call the chaotic netlet map. A brief review of Harth’s map model will be given in Sect. 2, and the possibility of chaos in Harth’s map model will be discussed in Sect. 3. The derivation of the chaotic netlet map model will be given in Sect. 4, followed by a reminder of the assumptions made in the derivation of the new model.
2
The Netlet
Harth and others [6] suggested that the structure of the neural network may be approximated by a set of discrete populations of randomly interconnected neurons, which they named netlet. The concept netlet was an answer to the question of redundancy in the neural network. A netlet is a reliable functional unit made of many less reliable units, i.e., neurons. Due to the redundancy, a netlet does not require precise wiring between the constituent neurons. The connections between the neurons are determined by only a few probability parameters. Nevertheless, due to cooperative action among neurons, the netlet is much more reliable than the independent duplications of the equivalent number of the identical neurons. A detailed description of the mathematical model of a netlet is given by Anninos et al. [7]. In this section, the derivation of their map model of a netlet in the special case of no inhibitory neurons will be given. Consider a netlet consisting of N neurons. Each of them has μ afferent connections on average. The synaptic delay of a connection is identical throughout the netlet and is taken as the time unit. Assume that the absolute refractory period is longer than the synaptic delay, but shorter than twice the synaptic delay. Thus a neuron that fires at time n will be insensitive at n + 1, and fully recover at t = n + 2. Next, define the activity α(n) of a netlet as the fractional number of neurons firing at time n. If we assume that the integration time of the postsynaptic potentials is less than the synaptic delay, we can see that α(n + 1) depends only on α(n). The expectation value of α(n + 1) depends on α(n) as follows. Since each of the excitatory neurons has μ efferent connections, there are α(n)N μ excitatory postsynaptic potentials (EPSPs) at n + 1. Since the connections are assumed to be distributed uniformly, the expected number of EPSPs per neuron is α(n)μ. The probability that a neuron receive l EPSPs is given by the Poisson distribution in the limit of a large total number of EPSPs: pl =
(α(n)μ)l −α(n)μ e . l!
(1)
The probability P (α(n)) that a neuron receive EPSPs exceeding its threshold at t = n + 1 is then given by
α(n)N μ
P (α(n)) =
l=η
pl
∞ l=η
pl = 1 − e−α(n)μ
η−1 l=0
(α(n)μ)l , l!
(2)
where η is the minimum number of EPSPs necessary to trigger a neuron. The approximation here is possible because pl is already negligibly small when l =
106
G. Lee and G.-S. Yi
α(n)N μ. Finally, because (1 − α(n)) is the fraction of neurons that are not in the refractory period at t = n + 1, the expectation value of α(n + 1) is given by < α(n + 1) >= (1 − α(n))P (α(n)),
(3)
If we approximate α(n + 1) by its expectation value < α(n + 1) > and then use (2), we arrive at the following one-dimensional map. η−1 (α(n)μ)l −α(n)μ α(n + 1) = (1 − α(n)) 1 − e . (4) l! 0 Figure 1 shows the return-maps of (4), the one-dimensional map model of a netlet, for several different values of μ and η. These return-maps explain that a netlet can have three different dynamical modes. When η = 1, there are two fixed points: one at the origin is a repeller and the other off the origin is an attractor. Any sequence {α(n)} will eventually settle down to the attractor. When η is greater than a certain threshold, the return-map will be contained below the diagonal line α(n + 1) = α(n). In this case, there is only one fixed point at the origin that is an attractor. Any sequence {α(n)} will eventually converge to 0. When η is between 1 and the threshold, there can be three fixed points: two attractors and one repeller between them. A sequence {α(n)} will converge to either attractor. Considering the dynamical characteristics of the one-dimensional map given by (4) we will call it the stable netlet map in the following.
3
The Netlet and Chaos
Harth provided some evidence of the chaotic behavior of a netlet from the computer simulation of a netlet. On the other hand, the stable netlet map given by (4) cannot exhibit chaos for the following reason. From the derivation of (4), we know that the second factor on the right side of (4) is a probability which was denoted by P (α(n)) in (2), i.e., α(n + 1) = (1 − α(n)) P (α(n)) .
(5)
If we take the derivative of the map function, dα(n + 1) = −P (α(n)) + (1 − α(n)) P (α(n)). dα(n)
(6)
Regardless of the statistical reasoning to evaluate P (·), P (·) is a probability, and therefore cannot exceed 1. Since the effect of refractoriness is already taken into account by the first factor (1 − α(n)), the probability function P (·) must be an increasing function of α(n). Therefore, dα(n + 1) ≥ −1 + (1 − α(n)) P (α(n)) ≥ −1. dα(n)
(7)
The Chaotic Netlet Map μ=10
1
1
0.8
0.8
0.6
0.6
α(n+1)
α(n+1)
μ=5
0.4 0.2 0
0.4 0.2
0
0.5 α(n)
0
1
0
0.5 α(n)
1
μ=20 1
0.8
0.8
0.6
0.6
α(n+1)
α(n+1)
μ=15 1
0.4 0.2 0
107
0.4 0.2
0
0.5 α(n)
1
0
0
0.5 α(n)
1
Fig. 1. The return-maps of the stable netlet map: five curves for 5 different values of η (η = 1, 2, 3, 4 and 5, from the top), in each plot
The second inequality follows (1 − α(n)) ≥ 0. The fact that the map function cannot have a derivative smaller than −1 places a strict limitation on the possibility of chaos in the stable netlet map. The map function may cross the line α(n + 1) = α(n), and have one or more fixed points, but none of them can have the chance of a flip bifurcation and develop into a chaotic attractor. (See [8] for a detailed description of bifurcation mechanisms in unimodal one-dimensional maps.) The conclusion is that we cannot revise the stable netlet map and derive a map that can exhibit chaos. We also studied other map models of the netlet, for instance the one by Usher [9], but arrived at the basically same conclusion. This is the consequence of using the absolute refractory period as the time unit of the map models. In this regard, we say that their map models are basically models of absolute refractoriness. We had to back off a step and look at the netlet again at a coarser time scale in order to free ourselves from the restriction of absolute refractoriness.
4
Chaotic Netlet Map
Consider a netlet of N neurons which are fully connected to one another. Full connection is not necessary in our derivation but will help keep our argument simple without affecting the validity of the final result. Let us begin by defining two basic time constants in the dynamics of a netlet.
108
G. Lee and G.-S. Yi
– Let τr be the absolute refractory period of a neuron, which is assumed to be identical for all neurons in a netlet. – Let τd be the pulse integration time of a neuron, which is also assumed to be identical for all neurons in a netlet. In the derivation of the stable netlet map, τd was assumed to be of the same order as τr , and therefore τr played the role of the time unit. As we pointed out in the previous section, any effort to derive a map model of a netlet with τr being the time unit will lead to the basically same result as the stable netlet map. Therefore, we decided to examine the dynamics of a netlet in a different time scale. Our choice was to use τd as the time unit as before, but assumed that τd is several times larger than τr , which is in fact no less acceptable than that of the stable netlet map. In this case the dynamics of a netlet can be better described in terms of the number of pulses that a neuron generates than in terms of the number of active neurons in a netlet. We chose yi (n), the average number of pulses generated by neuron i at time step n, to be the state variable of the target one-dimensional map. To avoid unnecessary complication by integer arithmetic, we assumed that the variable, yi (n), takes on a real value. Now the time unit and the state variable for a one-dimensional map are determined. The next step is to design a first order map function for a netlet. At this point, it may be worth reviewing the history of the logistic map [10] since it is a model of a dynamical system that is also a kind of population as a netlet is. The logistic map is a model of a population with the following two conditions: 1. There is a multiplying factor in the system. In the population dynamics of insects, a couple of insects gives a birth to tens of offsprings. 2. There is a resource constraint. In the population dynamics of insects, it is limited supply of food. We may be guided by these two conditions in our reasoning toward the development of a first order map function for a netlet. We consider first a multiplying factor in a netlet and then two types of resource constraints in a netlet. Multiplying factor: A neuron can deliver an output pulse to multiple number of postsynaptic neurons. If the postsynaptic neurons fire in response to the pulses at some probability, the net result is multiple pulses out of a single pulse. Suppose, for example, every neuron fires once at a certain time step in a netlet of 100 neurons (yi (n) = 1). Since the netlet is fully connected, each neuron will receive 100 EPSPs on average. If a neuron fires at probability 0.1 for an incoming pulse, every neuron in the netlet will fire 10 times on average in the next time step (yi (n+1) = 10), meaning multiplication of a firing frequency by 10. In more general terms, this multiplying dynamics can be stated as follows. yi (n + 1) = p
N j=1
yj (n),
(8)
The Chaotic Netlet Map
109
where p is the probability for a neuron to fire in response to an incoming pulse. Using a mean field argument, we replace the summation by N yi (n). Then, y(n + 1) = N py(n).
(9)
The subscript to the state variable is dropped since individual neurons are no more distinguished after the mean field approximation. Constraint by absolute refractoriness: The absolute refractory period limits the maximum number of pulses a neuron can generate in a unit time interval. In the current framework, τd is the time unit, and therefore the upper bound yˆ to the number of pulses that a neuron generate in a unit time is given by τd /τr . With this hard bound to the state variable, (9) becomes y(n + 1) = min (ˆ y , N py(n)) .
(10)
Metabolic constraint: In addition to the hard constraint by the absolute refractory period, there are many environmental factors that can affect the efficiency of a neuron. For instance, ions or neurotransmitters are essential in the relay of signals between neurons, and therefore their variant availability and activity in a netlet can be one of the factors that can control the efficiency of a neuron. A cellular energy source such as adenosine triphosphate (ATP) can be a main environmental factor of neuronal activity. ATP is necessary for the most of cellular signal transduction and active transport of ions which are involved in the process of neuronal pulse generation. Especially, active transport is needed for the polarization and the repolarization of a neuron that can affect the efficiency of pulse generation directly and possibly set the upper bound of the number of pulses too. Another issue to bring here is that various external and internal factors of a neuron can invoke temporal and localized changes of ATP level which can be a source of inconsistent activity of neuron. (See [11] for detailed description of the role of ATP in neuronal signal transduction.) At present, however, we would not be able to describe the exact mechanism of the environmental factors on this process without further experimental evidence. It is inevitable to leave it as an assumption which needs to be justified in the future. The efficiency of a neuron is represented by the parameter p in (8). The parameter p is proportional to the environmental condition, i.e., p = po z, where z ∈ [0, 1] represent the environmental condition, and po is the value of p when the environment is in the best condition (z = 1). Since every firing of a neuron consumes some amount of resource in the environment, we may write p in the following form: p = po (1 − by(n)), where b is a small positive constant. A flaw with this form of p is that p can become 0, which never occurs in an open system like a netlet. A remedy to this flaw is to replace (1 − by(n)) by e−by(n) , which approximates (1 − by(n)) well when by(n) is small, and approaches 0 as by(n) becomes larger, but never becomes 0. With this exponential factor incorporated, (9) now becomes y(n + 1) = min yˆ, ay(n)e−by(n) ,
(11)
110
G. Lee and G.-S. Yi
“Multiplication”
“Absolute refractoriness”
yˆ = τ d / τ r
N
y ( n)
Ny ( n )
pNy ( n )
p = po e
− by ( n )
LIM
(
y ( n + 1) =
− by n min yˆ , ay ( n ) e ( )
)
“Metabolic constraint”
Fig. 2. The chaotic netlet map: the firing rate y(n) is multiplied by the population size N , and is attenuated by the metabolic constraint, and is finally hard-limited by absolute refractoriness before it becomes y(n + 1) β = 6.000000 1
0.8
0.8
0.6
0.6
x(n+1)
x(n+1)
β = 5.000000 1
0.4 0.2 0
0.4 0.2
0
0.5 x(n)
0
1
0
1
0.8
0.8
0.6
0.6
0.4 0.2 0
1
β = 8.000000
1
x(n+1)
x(n+1)
β = 7.000000
0.5 x(n)
0.4 0.2
0
0.5 x(n)
1
0
0
0.5 x(n)
1
Fig. 3. The return-maps of the chaotic netlet map: four plots for β = 5, 6, 7 and 8, and four curves in each plot for α = 5, 10, 15 and 20 (from the lowest curve)
where a ≡ N po is called the multiplying factor of a netlet and b is the resource factor of a neuron. Figure 2 is a graphical representation of (11). y(n) pulses generated by a neuron result in N y(n) EPSPs. A receiving neuron cannot fire in response to every incoming pulse, but its responsiveness depends on the available metabolic resource left unused in the previous time step, which is modeled by
The Chaotic Netlet Map
111
Fig. 4. The bifurcation diagrams of the chaotic netlet map with the parameter α being the bifurcation parameter: four diagrams for β = 5, 6, 7 and 8
the factor e−byi (n) . Finally, the number of pulses are hard-limited by yˆ due to the absolute refractoriness of a neuron. We introduce a normalized state variable x(n) = y(n) yˆ to convert the map into a form better suited for comparison with other well-known maps defined on the unit interval. In terms of the normalized variable, (11) becomes x(n + 1) = min 1, αx(n)e−βx(n) , (12) where α ≡ ayˆ and β ≡ bˆ x. The meaning of β can be understood when we note that −β e is the minimum responsiveness of a neuron after it experiences the maximum activity, which is allowed by the absolute refractory period of a neuron, in the previous time step. Since the resulting one-dimensional map given by (12) is a model of the chaotic aspect of a netlet, we named it the chaotic netlet map. Figure 3 shows the return-maps of the chaotic netlet map: four plots for β = 5, 6, 7 and 8, and four curves for α = 5, 10, 15 and 20 in each plot. From these return-maps, we can expect the bifurcation pattern of the map will be similar to that of the logistic map since they have an unstable fixed point at the origin and another fixed point with a negative slope. The multiplying factor α can be used to change the slope like the μ-parameter of the logistic map. Actually, the bifurcation diagrams of the chaotic netlet map shown in Fig. 4 are similar to that
112
G. Lee and G.-S. Yi
of the logistic map except for the disappearance of the chaotic orbits for large values of α, due to the clipping of the return-map by the absolute refractoriness.
5
Conclusions
We showed that, when the integration time of a neuron is several times larger than the absolute refractory period of a neuron and when we look at the dynamics of a netlet in such a coarser time scale, a netlet can be represented by a chaotic one-dimensional map, that is similar in form and behavior to the logistic map. It seems that our initial goal of deriving a chaotic one-dimensional map model of a netlet is achieved, but it should be remembered that we left many assumptions unverified. Among others, we still need to come up with evidence of a biological mechanism that can explain the resource constraint in a netlet. Also, it should be noted that the new map model exhibits chaos only if the parameters α and β are chosen properly. We need yet to check the validity of the ranges of the parameters of the chaotic netlet map from the biological point of view. The values of the parameter α seem to be acceptable since the total number of neurons N is usually much larger than the number of pulses required for a neuron to fire. On the other hand, the validity of the values of β used in Fig. 4 needs further examination.
References 1. Kaneko, K.: Theory and Applications of Coupled Map Lattices, Chichester : New York : John Wiley & Sons (1993) 2. Farhat, N.H., Hernandez, E.D.M., Lee, G.: Strategies for autonomous adaptation and learning in dynamical networks, In: IWANN ‘97. (1997) 3. Lee, G., Farhat, N.H.: Parametrically coupled sine map networks, International Journal of Bifurcation and Chaos 11 (2001) 1815–1834 4. Aihara, K., Takabe, T., Toyoda, M.: Chaotic neural networks, Physics Letters A 144 (1990) 333–340 5. Farhat, N.H., Eldefrawy, M.: The bifurcating neuron, In: Digest Annual OSA Meeting, San Jose, CA. (1991) 10–10 6. Harth, E.M., Csermely, T.J., Beek, B., Lindsay, R.D.: Brain functions and neural dynamics, Journal of Theoretical Biology 26 (1970) 93–120 7. Anninos, P.A., Beek, B., Csermely, T.J., Harth, E.M., Pertile, G.: Dynamics of neural structures, Journal of Theoretical Biology 26 (1970) 121–148 8. Hilborn, R.C.: Chaos and Nonlinear Dynamics, Oxford University Press, New York (1994) 9. Usher, M., Schuster, H.G., Niebur, E.: Dynamics of populations of integrated-andfire neurons, partial synchronization and memory, Neural Computation 5 (1993) 570–586 10. May, R.M.: Simple mathematical models with very complicated dynamics, Nature 26 (1976) 459–467 11. Nicholls, J.G., Martin, A.R., Wallace, B.G.: From Neuron to Brain. 3rd edn, Sinauer, Sunderland, MA (1992)
A Chaos Based Robust Spatial Domain Watermarking Algorithm Xianyong Wu1,2, Zhi-Hong Guan1, and Zhengping Wu1 1
Department of Control Science and Engineering, Huazhong University of Science and Technology, Wuhan, 430074, China 2 School of Electronics and Information, Yangtze University, Jingzhou, Hubei, 434023, China
[email protected]
Abstract. This paper presents a novel spatial domain watermarking scheme based on chaotic maps. Two chaotic maps are employed in our scheme, which is different from most of the existing chaotic watermarking methods, 1-D Logistic map is used to encrypt the watermark signal, and generalized 2-D Arnold cat map is used to encrypt the embedding position of the host image. Simulation results show that the proposed digital watermarking scheme is effective and robust to commonly used image processing operations.
1 Introduction Recently, a new information security technology, information hiding has become a major concern, which includes digital watermarking and stenography. Many different watermarking schemes are proposed in recent years [1-4], which can be classified into two categories: the spatial domain [5] and the frequency domain [6-9] watermarking. Spatial domain watermarking show that a large number of bits can be embedded without incurring noticeable visual artifacts; whereas, frequency domain watermarking has been shown to be quite robust against JPEG compression, filtering, noise pollution and so on. In most spatial domain schemes, watermark signal is embedded in the LSB (least significant bit) of the pixels in host image, but the robustness against attacks is weak, watermark can be detected easily. Therefore, many new schemes for LSB algorithm are proposed to improve the robustness, but they are not secure enough. In [10], for example, Hash function is employed to improve the security of watermarking algorithm. In [11], a digital signature approach that will not degrade the quality of the host image is proposed, but a mapping table is needed to record the embedding position, which increased the complexity of the algorithm. In this paper, 1-D Logistic map is used to encrypt watermark signal; To spread the watermark signal in all the regions of host image chaotically, 2-D Arnold cat map is employed to shuffle the embedding positions of pixels in host image, which ensures the security of our scheme; another chaotic sequence is generated to locate the bit of pixel for host image, watermark bits are used to modify the 3th, 4th,5th or 6th bit of the D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 113–119, 2007. © Springer-Verlag Berlin Heidelberg 2007
114
X. Wu, Z.-H. Guan, and Z. Wu
corresponding shuffled pixels in host image randomly, which further enhances the robustness and security of the proposed scheme.
2 Chaos and Its Application in Watermarking 2.1 Encryption to Watermark Signal
Due to the characteristics of extreme sensitivity to initial conditions and the outspreading of orbits over the entire space, chaos maps are widely used for watermarking and encryption. To ensure the security of the watermarking scheme, watermark is encrypted before embedding. First, watermark signal is encoded into binary bit streams by exploiting ASCII codes; then, random-like, uncorrelated and deterministic chaotic sequence is created by 1-D logistic map, the initial condition and parameters of chaotic map are kept as the secret key; next, the encoded watermark bits are encrypted by chaotic sequence. Therefore, a number of uncorrelated, randomlike and reproducible encrypted watermarking signals are generated. A commonly used chaotic map is the Logistic map, which is described by:
zn+1 = μ z n (1 − zn ) ,
(1)
where zn ∈ (0,1) , μ ∈ (0, 4] . When μ > 3.5699456, the sequence iterated with an initial value is chaotic, different sequences will be generated with different initial values. The encryption formula is as follows. wen = w ⊕ cn ,
(2)
where wen is the n-th encrypted watermark signal, w is the original watermark signal, and cn is the chaotic sequence. 2.2 Encryption to Embedding Position of Watermark
In order to shuffle the embedding position of the host image, 2-D Arnold cat map [12] is adopted in our scheme, which is described by: xn+1 = ( xn + yn ) mod 1, yn+1 = ( xn + 2 yn )
(3)
where notation “ x mod 1” denotes the fractional parts of a real number x by adding or subtracting an appropriate integer. Therefore, ( xn , yn ) is confined in a unit square of [0,1] × [0,1], we write formula (3) in matrix form and obtain: ⎡ xn+1 ⎤ = ⎡1 1 ⎤ ⎡ xn ⎤ = A ⎡ xn ⎤ mod1. ⎢⎣ yn+1 ⎥⎦ ⎢⎣1 2 ⎥⎦ ⎢⎣ yn ⎥⎦ ⎢⎣ yn ⎥⎦
(4)
A unit square is first stretched by linear transformation and then folded by modulo operation, so the cat map is area preserving, the determinant of its linear
A Chaos Based Robust Spatial Domain Watermarking Algorithm
115
transformation matrix | A | is equal to 1. The map is known to be chaotic. In addition, it is one to one map, each point of the unit square is uniquely mapped onto another point in the unit square. Hence, the watermark pixel of different positions will get a different embedding position. The cat map above can be extended as follows: firstly, the phase space is generalized to [0,1, 2" N − 1] × [0,1, 2" N − 1] , i.e., only the positive integers from 0 to N − 1 are taken; then equation (4) is generalized to 2-D invertible chaotic map. ⎡ xn+1 ⎤ = ⎡ a b ⎤ ⎡ xn ⎤ = A ⎡ xn ⎤ mod N , ⎢⎣ yn+1 ⎥⎦ ⎢⎣ c d ⎥⎦ ⎢⎣ yn ⎥⎦ ⎢⎣ yn ⎥⎦
(5)
where a,b, c and d are positive integers, and | A |= ad − bc = 1 , therefore, only three among the four parameters of a, b, c and d are independent under this condition. The generalized cat map formula (5) is also of chaotic characteristics. By using the generalized cat map (5), we can obtain the embedding position of watermark pixels, i.e., the coordinate (i, j ) of watermark pixel is served as the initial value, three independent parameters and iteration times n are served as the secret key, after n rounds of iterations, the iterating result ( xn , yn ) will be served as the embedding position of watermark pixel of (i, j ) . When the iteration times n is big enough, two arbitrary adjacent watermark pixels will separate apart largely in host image, different watermark pixels will get different embedding positions. To locate the pixel bits to be embedded in host image, 1-D Logistic map is used once more in our approach. Because the chaotic sequence is normally distributed in the interval of (0,1) and is non periodic, the interval of (0,1) can be divided into several subintervals which correspond to different pixel bits of the host image.
3 Watermark Embedding and Extraction 3.1 Watermark Embedding
Let the binary watermark of size M 1 × M 2 be denoted as W = {w(i, j ), 1 ≤ i ≤ M 1 , 1 ≤ j ≤ M 2 } , and the original host image of size N1 × N 2 be denoted as F = { f ( x, y ), 1≤ x ≤ N1 ,1≤ y ≤ N 2 }, where (i, j ) and ( x, y ) represent pixel coordinates of binary watermark image and original host image, respectively, w(i, j ) = {0,1} and
f ( x, y ) = {0,1," , 2L − 1} represent the pixel values of the watermark and the host image, respectively, L denotes binary bit of gray image pixel. For simplicity, let M 1 = M 2 = M , N1 = N 2 = N . Watermark bits (1 bit per pixel) are embedded randomly to pixel bits in host image for security (avoid unauthorized extraction), the embedding position ( x, y ) is calculated by formula (5). Therefore, different watermark positions (i, j ) will be mapped onto different embedding positions
116
X. Wu, Z.-H. Guan, and Z. Wu
( x, y ) . For Arnold cat map is one to one map, the record table is not needed to record the colliding positions in our algorithm. The watermark bit of position (i, j ) is embedded to the k-th bit of position ( x, y ) in host image, where k =3,4,5,6, which is determined by subintervals of z n generated by Logistic map (1), so the bit position to be embedded is located by coordinate ( x, y ) and k . Let the pixel of watermarked image be denoted as f ′( x, y ) , if w(i, j ) is the same as the k-th bit of f ( x, y ) , then f ′( x, y ) = f ( x, y ) , i.e., the pixel value is kept unchanged; otherwise, the k-th bit of f ( x, y ) is substituted by w(i, j ) . Watermark embedding algorithm can be described as follows. Step1: Encrypt watermark signal with Logistic chaotic sequence to obtain an encrypted watermark signal. Step2: Designate three independent parameters of Arnold cat map, initial value (i, j ) , and iteration times n as well as initial value z0 of Logistic map. Step3: For watermark pixel w(i, j ) , let x0 = i, y0 = j , iterate n times by (5) to obtain ( x, y ) . Then let i = i + 1, j = j + 1 . Step4: Perform Logistic iteration to obtain a real sequence z n , where zn ∈ (0,1) , then determine k : zn ∈ (0, 0.25] , k = 3 ; zn ∈ (0.25, 0.5] , k = 4 ; zn ∈ (0.5, 0.75] ,
k = 5 ; zn ∈ (0.75,1) , k = 6 , Then find out the k-th bit bk in f ( x, y ) . Step5: If w(i, j ) = bk , then f ′( x, y ) = f ( x, y ) ; Otherwise f ′( x, y ) = w(i, j ) . Step6: Take the next watermark pixel; repeat step2 through step 4 until all watermark pixels are embedded. 3.2 Watermark Extraction
Watermark extraction is just the inverse process of the above embedding algorithm. Key parameters and initial value z0 as well as the watermark length are needed in watermark extraction, let w′(i, j ) denote the watermark pixel to be extracted, watermark extraction algorithm can be described as follow. Step1: Designate three independent parameters of Arnold cat map, initial value (i, j ) , iteration times n as well as initial value z0 of Logistic map. Step2: For watermark pixel to be extracted w′(i, j ) , let x0 = i, y0 = j , iterate n times by formula (5) to obtain ( x, y ) . Then let i = i + 1, j = j + 1 . Step3: Perform the same Logistic iteration as the embedding process to obtain z n , zn ∈ (0, 0.25] , k = 3 ; zn ∈ (0.25, 0.5] , k = 4 ; zn ∈ (0.5, 0.75] , k = 5 ; z n ∈ (0.75,1) ,
k = 6. Step4: Calculate the k-th bit pk of f ′( x, y ) to obtain encrypted watermark ′ w ( i , j ) = pk .
A Chaos Based Robust Spatial Domain Watermarking Algorithm
117
Step5: Repeat step2 through step 4 until all the watermark pixels w′(i, j ) (i, j = 1, 2," , M ) are extracted. Step6: Decrypt the encrypted watermark to obtain original watermark.
4 Simulation Results To demonstrate the effectiveness of the proposed algorithm, MATLAB simulations are performed by using 256 × 256 pixel gray level“peppers”image and 64 × 64 pixel binary watermark logo“HUST”. The three independent parameters and initial value of Arnold cat map are chosen as a = 1, b = 2, c = 3 and ( x0 , y0 ) = (2,3), respectively, the iteration number n = 20; the parameter and the initial value of Logistic map is chosen as μ = 4 and z0 = 0.5, respectively. The watermark bits are embedded to 3th, 4th,5th or 6th bits of the pixel position ( x, y ) in host image randomly.
(a)
(b)
(c)
Fig. 1. Demonstration of invisibility: (a) Original “peppers” image; “HUST”; (c) Watermarked image; (d) Extracted watermark logo
(d) (b) Watermark logo
Fig.1. demonstrates the invisibility of watermark. 1(a) and 1(b) show the original host image and binary watermark logo, respectively, 1(c) and 1(d) show the watermarked image (PSNR=47.25dB) and the extracted watermark logo “HUST” , respectively. One can see that the watermark is perceptually invisible. Fig.2. demonstrates the robustness of our algorithm. 2(a) 2(b) 2(c) 2(d) 2(e) show the JPEG compressed watermarked image with quality=10, watermarked image with 5 5 median filtering, watermarked image polluted by additive Gaussian noise (0,0.01), watermarked image with a quarter being cropped at the upper left corner, watermarked image with 2° rotation, respectively; 2(f) 2(j) are the corresponding extracted watermark logos, respectively. Results show that the recovered watermark logos are obvious even watermarked image is survived severe attacks.
×
、 、 、 、 ~
118
X. Wu, Z.-H. Guan, and Z. Wu
(a)
(f)
(c)
(h)
(e)
(j)
(b)
(d)
(g)
(i)
Fig. 2. Demonstration of robustness: (a) JPEG compressed (quality=10) watermarked image; (b) Watermarked image by 5×5 median filtering; (c) Noisy watermarked image (0, 0.01); (d) One quarter being Cropped watermarked image; (e) Rotated watermarked image (2°); (f) (j) Show the corresponding extracted watermark logos
~
5 Conclusions In this paper, a novel spatial domain watermarking algorithm based on Logistic map and Arnold cat map is proposed. The embedding positions of watermark signal are encrypted by 2-D Arnold cat map, and the pixel bit to be embedded in host image is
A Chaos Based Robust Spatial Domain Watermarking Algorithm
119
determined by 1-D Logistic chaotic map. Computer simulations show that the scheme is secure and robust to commonly used image processing operations.
Acknowledgment This work is supported by the National Natural Science Foundation of China under Grants 60573005 and 60603006.
References 1. Hsuct, W.: Hidden Digital Watermarks in Images. IEEE Trans. Image Processing 8 (1999) 58-68 2. Cox, I. J., Miller, M.L., Bloom, J.A.: Digital Watermarking. Academic Press, New York (2002) 3. Lee, C.H., Lee, Y.K.: An Adaptive Digital Image Watermarking Technique for Copyright Protection. IEEE Trans. Consumer Electronics 45 (1999) 1005-1015 4. Zhang, J.S., Tian, L.H., Tai, M.: A New Watermarking Method Based on Chaotic Maps. IEEE international Conference on Multimedia and Expo (2004) 939-942 5. Bender, W.R., Gruhl, D., Morimoto, N.: Techniques for Data Hiding. In Proc. SPIE: Storage and Retrieva1 of Image and Video Databases 2420 (1995) 164-173 6. Barni, M., Bartolini, F., Cappellini, V., Piva, A.: A DCT-domain System for Robust Image Watermarking. Signal Processing 66 (1998) 357-372 7. Lin, S.D., Chen, C.F.: A DCT Based Image Watermarking with Threshold Embedding. Int. J. of Comp. and Applications 25 (2003) 130-l35 8. Zhao, D.W., Chen, G.R., Liu, W.B.: A Chaos-Based Robust Wavelet Domain Watermarking Algorithm. Chaos Solitons & Fractals 22 (2004) 47 9. Lu, W., Lu, H.T., Chung, F.L.: Chaos-Based Spread Spectrum Robust Watermarking in DWT Domain. Proceedings of the 4th International Conference on Machine Learning and Cybernetics (2005) 5308-5313 10. Hwang, M.S., Chang, C.C., Hwang, K.F.: A Watermarking Technique Based on One-way Hash Function. IEEE Trans. Consumer Electronics 45 (1999) 286-294 11. Chang, C., Hsiao, J., Chiang, C.: An Image Copyright Protections Scheme Based on Torus Automorphism. Proc. of the IEEE (2002) 217-224 12. Kohda, T., Aihara, K.: Chaos in Discrete Systems and Diagnosis of Experimental Chaos. Transactions of IEICE E 73 (1990) 772-783
Integrating KPCA and LS-SVM for Chaotic Time Series Forecasting Via Similarity Analysis Jian Cheng1, Jian-sheng Qian1, Xiang-ting Wang1, and Li-cheng Jiao2 1 School of Information and Electrical Engineering, China University of Mining and Technology, 221116, Xu Zhou, China 2 Institute of Intelligent Information Processing, Xidian University, 710071, Xi an, China
[email protected]
Abstract. A novel approach is presented to reconstruct the phase space using kernel principal component analysis (KPCA) with similarity analysis and forecast chaotic time series based on a least squares support vector machines (LSSVM) in the phase space. A three-stage architecture is proposed to improve its prediction accuracy and generalization performance for chaotic time series forecasting. In the first stage, KPCA is adopted to extract features and obtain kernel principal components. Then, in the second stage, the similarity is analyzed between every principal components and output variable, and some principal components are chosen to construct the phase space of chaotic time series according to their similarity degree to the model output. LS-SVM is employed in the third stage for forecasting the chaotic time series. The method was evaluated by coal mine gas concentration in experiment. The simulation shows that LS-SVM by phase space reconstruction using KPCA with similarity analysis performs much better than that without similarity analysis.
1 Introduction With the interests in chaotic time series forecasting have been increased, however, most practical time series are of nonlinear and chaotic nature that makes conventional, linear prediction methods inapplicable. Although the neural networks is developed in chaotic time series prediction, some inherent drawbacks, e.g., the multiple local minima problem, the choice of the number of hidden units and the danger of over fitting, etc., would make it difficult to put the neural networks into some practice. Support vector machine (SVM), established on the unique theory of the structural risk minimization principle [1], usually achieves higher generalization performance than traditional neural networks that implement the empirical risk minimization principle in solving many machine learning problems. Another key characteristic of SVM is that training SVM is equivalent to solving a linearly constrained quadratic programming problem so that the solution of SVM is always unique and globally optimal. Least squares support vector machine (LS-SVM) [2], as a new kind of SVM, is easier to use than usual SVM, so LS-SVM is employed to forecast chaotic time series. In developing a LS-SVM model for chaotic time series, the first important step is to reconstruct embedding phase space. The traditional time series phase space D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 120–126, 2007. © Springer-Verlag Berlin Heidelberg 2007
Integrating KPCA and LS-SVM
121
reconstruction usually adopts coordinate delay method whose key is to ascertain embedding dimension and time delay [3]. G-P algorithm [4], FNN (false nearest neighbors) method [5] can all ascertain embedding dimension. Besides timeconsuming, their most serious problem is that there may be correlation between different features in reconstructed phase space, which will influence the quality of phase space and modeling effect. Principal component analysis (PCA) is a well-known method for feature extraction, which acquires the embedding dimension from time series directly, but PCA is a linear method in nature [6]. Kernel principal component analysis (KPCA) is one type of nonlinear PCA developed by generalizing the kernel method into PCA [7], which first maps the original input space into a high dimensional feature space using the kernel method and then calculates PCA in the high dimensional feature space. The linear PCA in the high dimensional feature space corresponds to a nonlinear PCA in the original input space. The paper proposes a phase space reconstruction method based on KPCA with similarity analysis in order to improve quality of phase space and accuracy of chaotic time series modeling. On the basis of KPCA, some kernel principal components are chosen according to their similarity degree to the model output, and utilized to reconstruct final phase space of chaotic time series. The restructured phase space is then used as the input space of LS-SVM to realize chaotic time series forecasting. By examining the performance in forecasting coal mine gas concentration, the simulation shows that LS-SVM with phase space reconstruction combining KPCA with similarity analysis performs much better than that without similarity analysis. The rest of this paper is organized as follows. Section 2 presents the phase space reconstruction of chaotic time series based on KPCA. In Section 3, reducing the dimensions of phase space is presented. The architecture and algorithm are given in Section 4. Section 5 presents the results and discussions on the experimental validation. Finally, some concluding remarks are drawn in Section 6.
2 Phase Space Reconstruction Based on KPCA Given a set of centered chaotic time series x k ( k = 1,2, " , l , and
∑k =1 xk l
= 0 ). The
basic idea of KPCA is to map the original input vectors x k into a high dimensional feature space Φ ( x k ) and then to calculate the linear PCA in Φ ( x k ) . By mapping x k
into Φ ( x k ) , KPCA solves the eigenvalue equation (1).
~
λi u i = Cu i , i = 1,2, " , l ,
(1)
~ 1 l where C = ∑ Φ ( x k )Φ ( x k ) T is the sample covariance matrix of Φ ( x k ) . λi is one of l k =1 ~ the non-zero eigenvalues of C . u i is the corresponding eigenvector. Equation (1) can be transformed to the eigenvalue equation (2). ~ (2) λ α = Kα , i = 1,2, " , l , i
i
i
122
J. Cheng et al.
where K is the l × l kernel matrix. The value of each element of K is equal to the inner product of two high dimensional feature vector Φ ( xi ) and Φ ( x j ) . That is, K ( xi , x j ) = Φ ( xi ) ⋅ Φ ( x j ) . The advantage of using K is that one can deal with Φ ( x k ) of arbitrary dimensionality without having to compute Φ ( x k ) explicitly, as all the calculations of the dot product (Φ ( xi ) ⋅ Φ ( x j )) are replaced with the kernel function ~ K ( xi , x j ) . This means that the mapping of Φ ( x k ) from x k is implicit. λi is one of ~ the eigenvalues of K , satisfying λi = lλi . α i is the corresponding eigenvector of K , satisfying u i = ∑ j =1 α i ( j )Φ ( x j ) ( α i ( j ) , j = 1,2, " , l , are the components of α i ). l
For assuring u i is of unit length, each α i must be normalized using the corresponding eigenvalue by ~
α~i = α i
λi , i = 1,2, " , l .
(3)
Based on the estimated α~i , the principal components for x k is calculated by l
s k (i ) = u iT Φ ( x k ) = ∑ α~i ( j ) K ( x j , x k ) , i = 1,2, " , l . j =1
In addition, for making
(4)
∑k =1 Φ ( x k ) = 0 , in equation (4) the kernel matrix on the l
training set K and on the testing set K t are respectively modified by
~ 1 1 K = ( I − 1l1Tl ) K ( I − 1l1Tl ) , l l
(5)
~ 1 1 K t = ( K t − 1lt 1Tl K )( I − 1l1Tl ) , l l
(6)
where I is l dimensional identity matrix. l t is the number of testing data points. 1l and 1lt represent the vectors whose elements are all ones, with length l and l t respectively. K t represents the l t × l kernel matrix for the testing data points. From above equations, it can be found that the maximal number of principal components extracted by KPCA is l . If only the first several eigenvectors sorted in descending order of the eigenvalues are considered, the number of principal components in s k can be reduced. The popular kernel functions includes Gaussian kernel function, sigmoid kernel, polynomial kernel, etc. Gaussian kernel function is employed in this paper. K ( x, x k ) = exp(− x − x k σ 2 ) .
(7)
Integrating KPCA and LS-SVM
123
3 Reducing the Dimension of the Embedding Phase Space Via Similarity Analysis The kernel principal components s k in feature space can be computed as Section 2, which are denoted by H 1 , H 2 , " , H l ( H i = ( H 1i , H 2i , " , H li ) T for i = 1,2, " , l ) in this section for convenience. Where H ij is the ith principal component of jth sample. The first q principal components are chosen such that their accumulative contribution ratio is big enough, which form reconstructed the phase space. As formula (8), training sample pairs for chaotic time series modeling can be formed as below, ⎡ H 11 ⎢ ~ ⎢ H 21 X = ⎢ # ⎢ 1 ⎣⎢ H l
⎡ y1 ⎤ H 12 " H 1q ⎤ ⎢y ⎥ 2 q⎥ H2 " H2 ⎥ , Y = ⎢ 2⎥ . ⎥ ⎢#⎥ # % # ⎥ ⎢ ⎥ 2 q H l " H l ⎦⎥ ⎣ yl ⎦
(8)
Modeling for time series, which is based on KPCA phase space reconstruction, is ~ ~ to find the hidden function f between input X and output Y such that y i = f ( xi ) . The above KPCA-based phase space reconstruction choose the first q principal components successively only according to their accumulative contribution ratio (their accumulative contribution ratio must be big enough so that they can stand for most information of original variables), not considering the similarity between every principal component H i (1 ≤ i ≤ q ) chosen and output variable Y . The paper analyses the similarity between principal components and output variables on the basis of KPCA. Set a threshold θ , compute the similarity coefficient between principal component H i (1 ≤ i ≤ q ) and output Y ,
ρi =
Cov ( H i , Y ) Cov ( H i , H i ) ⋅ Cov (Y , Y )
,
(9)
where Cov( H i , Y ) is covariance of vector H i and Y . Choose principal components H i (1 ≤ i ≤ q ) such that the similarity coefficient ρ i ≥ θ to form the reconstructed phase space H .
4 The Proposed Architecture and Algorithm The Basic idea is to use KPCA to reconstruct the phase space with similarity analysis (SA) and apply LS-SVM for forecasting chaotic time series. Fig. 1 shows how the model is built.
124
J. Cheng et al.
Fig. 1. The architecture of model for chaotic time series forecasting
Up to here the process of predicting chaotic time series is completed. The detailed step of algorithm is illustrated as the following: Step 1. For a chaotic time series x k , KPCA is applied to assign the embedding
dimension with the accumulative contribution ratio. The principal components s k , whose number of dimension is less than x k , is obtained . Step 2. s k is used for the input vector of SA, and select appropriate threshold θ according to the satisfied result. The dimension of the final embedding phase space is assigned. Step 3. In the reconstructed phase space, the structure of LS-SVM model is built, trained and validated by the partitioned data set respectively to determine the kernel parameters σ 2 and γ of LS-SVM with Gaussian kernel function. Choose the most adequate LS-SVM that produces the smallest error on the validating data set for chaotic time series forecasting.
5 Simulation Results The gas concentration, which is a chaotic time series in essence, is one of most key factors that endanger the produce in coal mine. It has very highly social and economic benefits to strengthen the forecast and control over the coal mine gas concentration. From the coal mine, 2010 samples are collected from online sensor underground after eliminating abnormal data in this study. The goal of the task is to use known values of the time series up to the point x = t to predict the value at some point in the future x = t + τ . To make it simple, the method of forecasting is to create a mapping from d points of the time series spaced τ apart, that is, ( x(t − (d − 1)τ ), " , x(t − τ ), x(t )) , to a forecasting future value x(t + τ ) . 1200 samples are considered to reconstruct the phase space using KPCA. Through several trials, σ 2 = 75 and number of principal components is 28 where the accumulative contribution ratio is 0.95. Then the embedding dimension of phase space is 15 through similarity analysis with θ = 0.90 . So the embedding phase space is reconstructed with the values of the parameters d = 45 and τ = 4 in the experiment. From the gas concentration time series x(t ) , we extracted 1200 input-output data pairs. The first 500 pairs is used as the training data set, the second 200 pairs is used as validating data set for finding the optimal parameters of LS-SVM, while the remaining 500 pairs are used as testing data set for testing the
Integrating KPCA and LS-SVM
125
predictive power of the model. The prediction performance is evaluated using by the root mean squared error (RMSE) and the normalized mean square error (NMSE) as follows:
RMSE =
NMSE =
1
δ
2
1 n 2 ∑ ( yi − yˆ i ) , n i =1
∑ ( yi − yˆ i ) n n
i =1
2
, δ2 =
(10)
1 n 2 ∑ ( yi − y ) , n − 1 i =1
(11)
where n represents the total number of data points in the test set, y i , yˆ i , y are the actual value, prediction value and the mean of the actual values respectively. When applying LS-SVM to modeling, the first thing that needs to be considered is what kernel function is to be used. As the dynamics of chaotic time series are strongly nonlinear, it is intuitively believed that using nonlinear kernel functions could achieve better performance than the linear kernel. In this investigation, the Gaussian kernel functions trend to give good performance under general smoothness assumptions. The second thing that needs to be considered is what values of the kernel parameters ( γ and σ 2 ) are to be used. As there is no structured way to choose the optimal parameters of LS-SVM, the values of the parameters that produce the best result in the validation set are used for LS-SVM. Through several trials, it can be get that σ 2 and
γ play an important role on the generalization performance of LS-SVM, so σ 2 and γ are, respectively, fixed at 0.15 and 25 for following experiments. The results of simulation are shown in Table 1, where SA represents the similarity analysis. Table 1. The converged RMSE and NMSE and the number of principal components in gas concentration chaotic time series
Model #Principal Component Training RMSE Testing Training NMSE Testing
KPCA+LS-SVM 28 0.0141 0.0150 0.0379 0.0725
KPCA(SA)+LS-SVM 15 0.0101 0.0107 0.0291 0.0348
Fig. 2. The forecasting errors in the KPCA+LS-SVM model (the dotted line) and the KPCA(SA)+LS-SVM model (the solid line)
126
J. Cheng et al.
From Table 1, it can be observed that the KPCA(SA)+LS-SVM forecast more closely to actual values than KPCA+LS-SVM. So there are correspondingly smaller forecasting errors in the KPCA(SA)+LS-SVM (the solid line) than the KPCA+LSSVM (the dotted line), as illustrated in Fig. 2.
6 Conclusions This paper describes a novel methodology, a LS-SVM based on combining KPCA with similarity analysis, to model and forecast chaotic time series. Firstly, KPCA is a nonlinear PCA by generalizing the kernel method into linear PCA, which is adopted to extract features of chaotic time series, reflecting its nonlinear characteristic fully. Secondly, on the basis of KPCA, the embedding dimension of the phase space of chaotic time series is reduced according to the similarity degree of the principal components to the model output, so the model precision is improved greatly. The proposed model has been evaluated by coal mine gas concentration. Its superiority is demonstrated by comparing it with the model without similarity analysis. The simulation results show that the proposed model in the paper can achieve a higher prediction accuracy and better generalization performance than that with out similarity analysis. On the other hand, there are some issues that should be investigated in future work, such as how to ascertain the accumulative contribution ratio of KPCA and confidence threshold of similarity analysis which affect deeply the performance of the whole model, how to construct the kernel function and determine the optimal kernel parameters, etc.
Acknowledgements This research is supported by National Natural Science Foundation of China under grant 70533050 and Young Science Foundation of CUMT under grant 2006A010.
References 1. Vapnik,V.N.: An Overview of Statistical Learning Theory. IEEE Transactions Neural Networks 10 (5) (1999) 988-999 2. Suykens, J.A.K., Vanderwalle, Moor,J., B. D.: Optimal Control by Least Squares Support Vector Machines. Neural Network 14 (2001) 23-35 3. Wei, X.K., Li, Y.H., et al.: Analysis and Applications of Time Series Forecasting Model via Support Vector Machines. System Engineering and Electronics 27 (3) (2005) 529-532 4. Chen, K., Han, B.T.: A Survey of State Space Reconstruction of Chaotic Time Series Analysis. Computer Science 32 (4) (2005) 67-70 5. Kennel, B. Mathew, Brown, R., et al.: Determining Embedding Dimension for Phase-space Reconstruction Using a Geometrical Construction. Phy Rev A 45 (1992) 3403-3411 6. Palus, M., Dovrak, I.: Singular-value Decomposition in Attractor Reconstruction: Pitfalls and Precautions. Physica D 55 (1992) 221-234 7. Scholkopf, B., Smola, A.J., Muller, K.R.: Nonlinear Component Analysis as a Kernel Eigenvalue Problem. Neural Computation 10 (1998) 1299-1319
Prediction of Chaotic Time Series Using LS-SVM with Simulated Annealing Algorithms Meiying Ye Department of Physics, Zhejiang Normal University, Jinhua 321004, China
[email protected]
Abstract. Least squares support vector machine (LS-SVM) is a popular tool for the analysis of time series data sets. Choosing optimal hyperparameter values for LS-SVM is an important step in time series analysis. In this paper, we combine LS-SVM with simulated annealing (SA) algorithms for nonlinear time series analysis. The LS-SVM is used to predict chaotic time series, and its parameters are automatically tuned using the SA and generalization performance is estimated by minimizing the k-fold cross-validation error. A benchmark problem, Mackey-Glass time series, has been used as example for demonstration. It is showed this approach can escape from the blindness of man-made choice of the LS-SVM parameters. It enhances the prediction capability of chaotic time series.
1 Introduction Time series prediction is a very important practical problem with a diverse range of applications from economic and business planning, inventory and production control, weather forecasting, signal processing and control. However, time series analysis is a complex problem. Most time series of practical relevance are of nonlinear and chaotic nature which makes conventional, linear prediction methods inapplicable. Hence, a number of nonlinear prediction methods have been developed including neural networks (NN), though, not initially proposed for time series prediction, exceed conventional methods by orders of magnitude in accuracy. One of the most common NN in the area of chaotic time series prediction is the multilayer NN with error backpropagation learning algorithm. The NN has been successfully utilized to predict chaotic dynamical systems. The NN employs gradient descent method to provide a suitable solution for network weights by minimizing the sum of squared errors. Training is usually done by iterative updating of the weights according to the error signal. Although the NN is developed in chaotic time series prediction, some inherent drawbacks, e.g., the multiple local minima problem, the choice of the number of hidden units and the danger of over fitting, etc., would make it difficult to put the NN into some practical application. The present study focuses on the problem of chaotic time series prediction using least squares support vector machine (LS-SVM) regression [1] [2], whose parameters are automatically tuned using the SA [3] and generalization performance is estimated by minimizing the k -fold cross-validation error [4]. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 127–134, 2007. © Springer-Verlag Berlin Heidelberg 2007
128
M. Ye
2 Problem Description A chaotic time series is an array of values belonging to subsequent samples usually coming from a nonlinear dynamic system's output. It is assumed that neither of state of the nonlinear dynamic system is measurable nor the equation describing its state is known. If the nonlinear dynamic system is deterministic, we can try to predict the chaotic time series by reconstructing the state space. The object of chaotic time series forecasting is building an estimate function for the system's transfer function only using its output. Many conventional regression techniques can be used to solve problems of estimating function. In this investigation, we concentrate on the LS-SVM. Let's assume that the chaotic time series is sampled every T . The chaotic time series can be express as x(T ), x(2T ), ", x(NT ) . The chaotic time series prediction
can be stated as a numerical problem: Split the time series x(T ), x(2T ), " , x(NT ) into windows x((i − D + 1)T )," , x(iT ) of size D . Then find a good estimate for the function F : R D → R such that
x((i + 1)T ) = F ( x((i − D + 1)T )," , x(iT )) ,
(1)
for every i ∈ {D, N } . The F ( ⋅ ) is an unknown function, and D is a positive integer, the so-called embedding dimension. In many time series applications, one-step prediction schemes are used to predict the next sample of data, x((i + 1)T ) , based on previous samples. However, one-step prediction may not provide enough information, especially in situations where a broader knowledge of the time series behavior can be very useful or in situations where it is desirable to anticipate the behavior of the time series process. The present study deals with chaotic time series prediction, i.e. to obtain predictions several steps ahead into the future x((i + 1)T ), x((i + 2)T )," , x((i + P )T ) starting from information at instant i + 1 . Hence, the goal is to approximate the function F ( ⋅ ) such that the model given by equation (1) can be used as a chaotic time series prediction scheme. In this work, we try applying LS-SVM and SA to estimate the unknown function F ( ⋅ ) .
3 SVM and Its Parameter Selection by SA The present study focuses on the problem of chaotic time series prediction using LSSVM and SA. In the following, we briefly introduce LS-SVM regression and SA. For further details on LS-SVM and SA we refer to Refs. [1] [2] and [3]. 3.1 LS-SVM Model for Chaotic Time Series Prediction
Consider a given training set of N data points {x k , y k }k =1 with input data x k ∈ R D and N
output y k ∈ R . In feature space LS-SVM models take the form: y ( x) = wTϕ ( x) + b ,
(2)
Prediction of Chaotic Time Series Using LS-SVM with SA Algorithms
129
where the nonlinear mapping ϕ (⋅) maps the input data into a higher dimensional feature space. Note that the dimensional of w is not specified (it can be infinite dimensional). In LS-SVM for function estimation the following optimization problem is formulated: min J ( w,e) =
1 T 1 N w w + γ ∑ ek2 , 2 2 k =1
(3)
subject to the equality constrains: y ( x ) = w T ϕ ( x k ) + b + e k , k = 1, " , N .
(4)
Important differences with standard SVM [5] are the equality constrains and the squared error term, which greatly simplifies the problem. The solution is obtained after constructing the Lagrangian: N
L ( w , b , e, α ) = J ( w, e ) − ∑ α k { w T ϕ ( x k ) + b + e k − y k }
(5)
k =1
with Lagrange multipliers α k . After optimizing equation (5) and eliminating ek , w , the solution is given by the following set of linear equations: G ⎡0 ⎤ ⎡b ⎤ ⎡ 0 ⎤ 1T ⎢G ⎥ ⎢ ⎥=⎢ (6) ⎥, T -1 ⎣⎢1 ϕ ( xk ) ϕ ( xl ) + γ I ⎦⎥ ⎣⎢ α ⎦⎥ ⎣⎢ y ⎦⎥ G where y = [ y1 ;"; y N ] , 1 = [1;";1] , α = [α 1 ;"; α N ] and the Mercer’s condition:
K ( xk , xl ) = ϕ ( xk ) T ϕ ( xl ) , k , l = 1," , N
(7)
has been applied. This finally results into the following LS-SVM model for function estimation: L
y ( x) = ∑ α k K ( xk , xl ) + b ,
(8)
k =1
where α k , b are the solution to the linear system, K (⋅,⋅) represents the high dimensional feature spaces that is nonlinearly mapped from the input space x , and L is the number of support vectors. The LS-SVM approximates the function using the equation (8). Any function that satisfies Mercer’s condition can be used as the kernel function K (⋅,⋅) . Te choice of the kernel function has several possibilities. Popular kernel functions are Gaussian kernel: K ( xk , xl ) = exp(−
xk − xl 2σ 2
2
),
Polynomial kernel: K ( xk , xl ) = (1 + xk ⋅ xl ) β . where σ and β are positive real constant.
(9) (10)
130
M. Ye
In this work, the Gaussian kernel function is used as the kernel function of the LSSVM because Gaussian kernels tend to give good performance under general smoothness assumptions. Consequently, they are especially useful if no additional knowledge of the data is available. Note that in the case of Gaussian kernels, one has only two additional tuning parameters, viz. kernel width parameter σ in equation (9) and regularization parameter γ in equation (3), which is less than for standard SVM. 3.2 SA for Parameter Tuning of LS-SVM
To obtain a good prediction performance, some parameters in LS-SVM have to be chosen carefully. These parameters include: • •
the regularization parameter γ , which determines the tradeoff between minimizing the training error and minimizing model complexity; and parameter ( σ or β ) of the kernel function that implicitly defines the nonlinear mapping from input space to some high-dimensional feature space. (In this paper we entirely focus on the Gaussian kernel).
These “higher level” parameters are usually referred as hyperparameters. In this paper, these parameters are automatically tuned using the SA and generalization performance of LS-SVM is estimated by minimizing the k -fold cross-validation error in the training phase. The SA is an optimization technique, analogous to the annealing process of material physics. Boltzmann [6] pointed out that if the system is in thermal equilibrium at a temperature T , then the probability PT (s ) of the system being in a given state s is given by the Boltzmann distribution: PT ( s ) =
exp(− E ( s ) / KT ) , ∑ exp(− E (w) / KT )
(11)
w∈S
where E (s ) denotes the energy of state s ; K represents the Boltzmann constant and S is the set of all possible states. However, equation (11) does not contain information on how a fluid reaches thermal equilibrium at a given temperature. Metropolis et al. [3] developed an algorithm that simulates the process of Boltzmann. The Metropolis algorithm is summarized as follows. When the system is in the original state sold with energy E ( s old ) , a randomly selected atom is perturbed, resulting in a state snew with energy E ( s new ) . This new state is either accepted or rejected depending on the Metropolis criterion: if E ( s new ) ≤ E ( s old ) then the new sate is automatically accepted. In contrast, if E ( s new ) > E ( s old ) , then the probability of accepting the new state is given by the following probability function: ⎛ E ( sold ) − E ( s new ) ⎞ Pt (accept s new ) = exp⎜ ⎟. KT ⎝ ⎠
(12)
Prediction of Chaotic Time Series Using LS-SVM with SA Algorithms
131
Based on the study of Boltzmann and Metropolis, Kirkpatrick et al. [7] claimed that the Metropolis approach is conducted for each temperature on the annealing schedule until thermal equilibrium is reached. Additionally, a prerequisite for applying the SA algorithm is that a given set of the multiple variables defines a unique system state for which the objective function can be calculated. The SA algorithm in our investigation is described as follows: Step 1 (Initialization). Set upper bounds of the two LS-SVM positive parameters,
σ and γ . Then, generate and feed the initial values of the two parameters into the
LS-SVM model. The forecasting error is defined as the system state ( E ). Here, the initial state ( E0 ) is obtained. Step 2 (Provisional state). Make a random move to change the existing system state to a provisional state. Another set of the two positive parameters is generated in this stage. Step 3 (Acceptance tests). The following equation is employed to determine the acceptance or rejection of the provisional state: ⎧Accept the provisional state if E ( snew ) > E ( sold ), and p < Pt (Accept snew ),0 ≤ p < 1 ⎪ ⎨Accept the provisional state if E ( snew ) ≤ E ( sold ) ⎪Reject the provisional otherwise ⎩
(13)
In equation (13), p is a random number to determine the acceptance of the provisional state. If the provisional state is accepted, then set the provisional state as the current state. Step 4 (Incumbent solutions). If the provisional state is not accepted, then return to Step 2. Furthermore, if the current state is not superior to the system state, then repeat Steps 2 and 3 until the current state is superior to the system state and, finally, set the current state as the new system state. Previous studies [8,9] indicated that the maximum number of loops ( N sa ) is 100 D to avoid infinitely repeated loops, where D denotes the problem dimension. In this investigation, the two parameters ( σ and γ ) are used to determine the system states, hence, N sa is set to 200. Step 5 (Temperature reduction). After the new system state is obtained, reduce the temperature. The new temperature reduction is obtained by equation: New temperature = (Current temperature) × ρ , where 0 < ρ < 1 .
(14)
The ρ is set at 0.9 in this study. If the pre-determined temperature is reached, then stop the algorithm, and the latest state is an approximate optimal solution. Otherwise, go to Step 2. Cross-validation is a popular technique for estimating generalization performance and there are several versions. The k -folds cross validation is computed as follows: The training set is randomly divided into k mutually exclusive subsets (folds). The LS-SVM is trained with k − 1 subsets and then tested with the remaining subset to obtain the regression error. This procedure is repeated k times and in this fashion each subset is used for testing once. Averaging the test error over the k trials gives an estimate of the generalization performance. The k -fold cross-validation can be
132
M. Ye
applicable to arbitrary learning algorithms. In order to evaluate the performance of the proposed methods, we use k = 5 for the number of folds.
4 Benchmark Problem and Experimental Results In this section, we present an example showing the effectiveness of using LS-SVM with SA for chaotic time series prediction. We use data sets generated by the MackeyGlass differential-delay equation [10]. We generate a time series by numerically integrating the Mackey-Glass time-delay differential equation: dx(t ) gx(t − τ ) = −hx(t ) + dt 1 + x10 (t − τ )
(15)
with parameter g = 0.2 , h = 0.1 , τ = 17 and initial condition x(0) = 1.2 and x(t ) = 0 for t < 0 . Equation (15) was originally introduced as a model of blood cell regulation. 1.5 1.2 0.9 0.6 0.3
50
100
150
200
time
Fig. 1. Predicted and desired values of Mackey-Glass series, the parameter P is set to 36 0.05
RMSE
0.04 0.03 0.02 0.01 0.00
10
20
30
40
P
Fig. 2. Root-mean-square errors (RMSE) as a function of P. The solid line indicates the prediction errors with SA and the dashed line indicates that using 5-folds cross validation.
Prediction of Chaotic Time Series Using LS-SVM with SA Algorithms
133
The time series data was obtained by applying the conventional fourth-order RungeKutta algorithm to determine the numerical solution to equation (15). This nonlinear time series is chaotic, and so there is no clearly defined period. The series will not converge or diverge, and the trajectory is highly sensitive to initial conditions. The prediction of future values of this series is a benchmark problem. In time series prediction, we want to use known values of the time series up to the point in time, say, i , to predict the value at some point in the future, say, i + P . The standard method for this type of prediction is to create a mapping from D sample data points, sampled every T units in time, ( x(i − ( D − 1)T ),", x(i − T ), x(i )) , to a predicted future value x(i + P) . Following the conventional settings for predicting the Mackey-Glass time series, we set D = 4 . For each i , the input training data for LSSVM is a four-dimensional vector. We extracted input/output data pairs of the following format:
[ x(i − 18), x(i − 12), x(i − 6), x (i ); x(i + 6)]
(16)
In Figure 1, multi-step predictions have been considered; the parameter P is set to 36. Figure 2 shows the dependence of prediction errors with prediction step when LSSVM parameters are automatically tuned using the SA. We can see that the prediction is quite accurate. With the increasing prediction parameter P , the prediction errors increase. These indicate that the lower prediction errors were obtained. The results may be attributable to the fact that it is more likely to converge to the global.
5 Conclusions In this paper, LS-SVM with SA is used for chaotic time series prediction. The Mackey-Glass time series has been used as examples for demonstration. The results demonstrate that the prediction method using the LS-SVM with SA is suitable for the multi-step prediction. It is showed this approach can escape from the blindness of man-made choice of the LS-SVM parameters. Although the processes are focused on the Mackey-Glass differential-delay equation, we believe that the proposed method can be used to other complex chaotic time series.
Acknowledgements. The Project Supported by Zhejiang Provincial Natural Science Foundation of China (Y105281, Y106786).
References 1. Suykens, J.A.K., Vandewalle, J.: Least Squares Support Vector Machine Classifiers. Neural Processing Letters 9 (1999) 293-300 2. Suykens, J.A.K., Brabanter, J.D., Gestel, T.V., Vandewalle, J.: Least Squares Support Vector Machines. World Scientific, Singapore (2002) 3. Metropolis, N., Rosenbluth, A.W., Rosenbluth, M.N., Teller, A.H., Teller, E.: Equations of State Calculations by Fast Computing Machines. J. Chem. Phys. 21 (1953) 1087-1091 4. Duan, K., Keerthi, S.S., Poo, A.N.: Evaluation of Simple Performance Measures for Tuning SVM Hyperparameters. Neurocomputing 51 (2003) 41-59
134
M. Ye
5. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer-Verlag, New York (1999) 6. Cercignani, C.: The Boltzmann Equation and Its Applications. Springer-Verlag, Berlin (1988) 7. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by Simulated Annealing. Science 220 (1983) 671-680 8. Van Laarhoven, P.J.M., Aarts, E.H.L.: Simulated Annealing: Theory and Applications. Kluwer Academic Publishers, Dordrecht (1987) 9. Dekkers, A., Aarts, E.: Global Optimization and Simulated Annealing. Math. Programm. 50 (1991) 367-393 10. Mackey, M., Glass, L.: Oscillations and Chaos in Physiological Control Systems. Science 197 (1977) 287-289
Radial Basis Function Neural Network Predictor for Parameter Estimation in Chaotic Noise Hongmei Xie1,* and Xiaoyi Feng2 1
Department of Electronics and Information Engineering, School of Electronics and Information, Northwestern Polytechnical University, Xi’an 710072, P.R. China
[email protected] 2 Department of Electronics Science and Technology, School of Electronics and Information, Northwestern Polytechnical University, Xi’an 710072, P.R. China
[email protected]
Abstract. Chaotic noise cancellation has potential application in both secret communication and radar target identification. To solve the problem of parameter estimation in chaotic noise, a novel radial basis function neural network (RBF-NN) -based chaotic time series data modeling method is presented in this paper. Together with the spectral analysis technique, the algorithm combines neural network’s ability to approximate any nonlinear function. Based on the flexibility of RBF-NN predictor and classical amplitude spectral analysis technique, this paper proposes a new algorithm for parameter estimation in chaotic noise. Analysis of the proposed algorithm’s principle and simulation experiments results are given out, which show the effective of the proposed method. We conclude that the study has potential application in various fields as in secret communication for narrow band interference rejection or attenuation and in radar signal processing for weak target detection and identification in sea clutter.
1 Introduction Nonlinear dynamic is very important in describing many physical phenomena in practice [1]. In the filed of radar signal processing, sea clutter can be modeled as chaotic noise. In communication system, speech and indoor multi-path have been demonstrated to be chaotic rather than purely randomness. In such application as radar surveillance, secure communication and narrowband interference cancellation, the chaotic signal is one kind of noise. Therefore, there exists an enormous need to detect and extract useful signal parameter in chaotic noise. For example, chaotic modulation is used in secret communication where impulse interference cancellation will rely on the performance of frequency estimation in chaotic noise [2]. Modeling sea clutter using nonlinear dynamics chaos made target velocity estimation become frequency estimation in chaotic noise. *
Corresponding author.
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 135–142, 2007. © Springer-Verlag Berlin Heidelberg 2007
136
H. Xie and X. Feng
To solve the problem of parameter estimation in chaotic noise, minimum phase space volume (MPSV)-based algorithm and its improved version like generic algorithm minimum phase space volume (GA-MPSV) [2][3][6], least square autoregressive (LS-AR)[4] have been proposed. However, their performance is not satisfying. On the one hand, the former MPSV-based algorithm is very complex because it will involve in inverse filter design and then a global searching and optimization procedure, although it has considered the nature of chaotic noise and can achieve correct results. On the other hand, the latter algorithm doesn’t take into consideration of the nature of chaotic noise although its computation burden is small and can work in relatively high signal-to-noise ratio (SNR). In this paper, we proposed a neural network and power spectrum analysis based algorithm, which take account into both the computation burden and numerical precise at the same time. Our motivation is that neural network can fit our chaotic nonlinear dynamic function since neural network (NN) has the ability to model nonlinear time series [5][6] globally. Analysis of the proposed algorithm’s principle and simulation experiments results are given out, the results show the effective of the proposed method. The time delay chaotic reconstruction and power spectral density analysis are used to estimate the useful parameter. Systematic compare and contrast of all the three kinds of parameter estimation algorithms are also give out in this paper. This paper is organized as follows: Section 2 describes the mathematical formulation and physical description of the problem to be solved. Section 3 gives out the block diagram of the novel parameter estimation algorithm and some consideration concerning the selection of some key factors. In section 4, simulation experiments are designed and results are given out and analyzed. In the last section, discussion and conclusion are described.
2 Problem Formulation and Description Generally speaking, the problem can be expressed as: k
xt = st (θ0 ) + nt = ∑ α i sin(2π f i t ) + nt ,
t = 1, 2,
,N ,
(1)
i =1
θ 0 = ⎡⎣θ1 , ,θ p ⎤⎦ is the parameter vector to be estimated in useful signal st (θ 0 ) . The additive noise nt is chaotic noise. Here p is the dimension of the where
vector θ 0 . In other words,
p is the number of unknown parameters.
In radar or sonar system, the parameters like DOA, moving velocity and RCS are needed to describe the target exactly and to track it. By using the Doppler theory, the moving velocity can be transformed into frequency. Thus, the formula for real system is to estimate some frequency in the signal. This can be written as Eq.(1).
RBF-NN Predictor for Parameter Estimation in Chaotic Noise
137
To solve the problem of parameter estimation in chaotic noise, what we need to do first is to model and predict the chaotic component correctly. Actually, from the point of signal processing, the modeling of chaotic signal can be described as the obtaining of proper state space from clear or additive noisy received signals. When one considers a discrete dynamical system whose state can be described by a set of physical variables, to simply the problem, one can assume that the observation data is acquired at discrete time, i.e. t = 1,2, , . Then the dynamic rule can be converted into mapping expressed as:
(
)
Y ( t + 1) = ψ Y ( t ) , And each element in
Y ( t + 1) can be expressed as
y ( t + 1) = ψ { y ( t ) , y ( t − 1) , In other words, each element’s value previous system value
Y ( t ) , Y ( t − 1) ,
(2)
, y ( t − 2 D )}.
(3)
y ( t + 1) for t + 1 can be obtained by the
y (t ), y (t − 1),
. i.e.
Y ( t + 1) can be get by
. Therefore, the state of a dynamic system at time t + 1 can be
formed by a its former states. Basically, chaos is one kind of deterministic nonlinear system and chaotic signal can be predicted in short period. Moreover, the local predictability signal is based on knowing the deterministic functionψ . Therefore, the aim is to construct one model that can reconstruct the mapping from the observations Y (t ) . According to Taken’s delay embedding theorem, a compact manifold with dimension D can be reconstructed by a delay map of at least dimension m = 2d + 1 . This gives out the considerations that need to be considered when designing a delay embedding reconstruct. To solve the problem of chaotic time series modeling, local method using different
m -order AR system and global method have been proposed. The main disadvantage of the former is that one needs to choose the size of the region because this method will fail with proper region size selection. The global method that consists of polynomial modeling, radial base function and forward feed neural network can overcome the disadvantage of the local method [5].
138
H. Xie and X. Feng
3 Depiction of the Proposed Scheme The basic idea of the new scheme is based on the local predictability of chaotic noise. The brief description and implementation of our scheme is shown in Fig.1. First, we use neural network as a tool to reconstruct the nonlinear dynamic system for chaotic component in the received signal. Then subtract the reconstructed chaotic noise from the received signal to obtain the weak but useful remain (error) signal. The remaining signal mainly contains of information in which we are interested. After that we perform power spectra density (PSD) analysis to the error signal and derive the parameter using the PSD results by traditional method.
r (t )
s (t ) n (t )
5(&216758&7 121/,1($5 )81&7,21)25 &+$26&20321(17
rˆ ( t )
r (t )
2%7$,1 (5525
36' $1$/<6,6
Fig. 1. Block diagram of the proposed algorithm
Since constructing the nonlinear dynamic function is significant for our scheme, we use neural network as a tool to implement it. Neural network has its special advantage in mapping and approximation nonlinear function. Among many kinds of the available neural networks, we choose RBF-based neural network as the nonlinear modeling tool since it has the following merits: 1). It has the capability of learning on-lien, which is very important for real time signal processing application; 2). It has strong robustness for modeling and curve fitting because its ability of self-adjustment. 3). The initial structure and size of RBF-NN can be easily constructed. A typical construction of a RBF-NN is shown in Fig.2. Typically it comprises input layer, hidden layer and output layer and the connection from input to hidden layer is nonlinear while the connection from hidden layer to output layer is linear. When using as function fitting, the overall response function is h h ⎡ ( x − c) 2 ⎤ f RBF (x) = ∑ wi R(x) = ∑ wi exp ⎢ − ⎥, σ k2 ⎦ k =1 k =1 ⎣
where input vector
(4)
x(t) = ( x1 (t ), x2 (t),… ,x m (t)) is one vector from input space,
which is derived from the received data.
RBF-NN Predictor for Parameter Estimation in Chaotic Noise
139
Fig. 2. Construction of RBF-NN
In the phase of design and training for NN, there are 5 groups of RBF-NN parameters to be determined. These parameters include embedding dimension m , connection weights w , center c , width σ and the number of hidden layer cells. Embedding dimension m equals to the number of input cells and must satisfy the Taken’s theorem. Other parameters for NN can be obtained by training the NN and minimize some cost function. The main work is that we need divide our received data into training set and test set.
4 Simulations and Analysis In this part we perform computer simulation for both radar target identification in chaotic noise and interference cancellation for chaos modulation secret communication system. The former one can be expressed using eq.(1) while the latter can be formulated as x (t ) =
P
∑
i =1
a i x (t − i) + n (t ),
(5)
where p is the order of autoregressive model describing the interference[2], a i is the parameter. For both cases the noise component is chaotic. Here we use logistic chaotic mapping to simulate the chaotic noise. The results are shown in Fig. 3. Fig.3 (a) (The left figure of Fig.3 ) gives out the PSD of singular sinusoid frequency estimation results For the simulation for problem described in Eq.(5), we use a 2 order AR model to evaluation the performance of our scheme. Let xt = 0.195 xt −1 − 0.95 xt −2 + nt is the received data and the initial condition for
140
H. Xie and X. Feng
(a) Predicting ability of RBF neural network.
(b) Results of parameter estimation.
Fig. 3. Simulation results for RBF-NN based estimation scheme
chaotic logistic mapping is x (0) = 0.68587
, y(0) = 0.65876 . We get the
correct estimated values −0.20 and 0.95 for the 2-order AR system. To illustrate Fig.3 more clearly, we make such restatement. In Fig. 3(a), the solid line represents the curve to be fitted and the dash and plus line represents the fitted resultant curve while in Fig.3 (b), the solid line represents the PSD to be fitted and the dash line represents the fitted result It can be seen that the proposed scheme is effective for the real problem since the maximum frequency of the estimated signal and the real theoretical one corresponds each other very well. To further evaluate and compare the performance of the proposed RBF-NN based scheme to other algorithm, we perform Monte Carlo simulation for all the three algorithms (LS-AR, GA-MPSV and our NN-based schemes) and the results and
Table 1. Comparisons of different methods
LS-AR Basic analysis
Simple& general But without the consideration of chaotic property
GA-MPSV Effective but complex for parameter estimation in chaos, need inverse filter and global searching resulting its high computation complexity
RBF-NN based scheme (The one proposed in this paper) Effective and less complex, by proper choosing NN type and dividing the data into training and test set
Mean Square Error (MSE)
0.4002
0.1025
2.5 × 10 −6
Time for once computation
5 minutes
4~6 hours
5~10 minutes
RBF-NN Predictor for Parameter Estimation in Chaotic Noise
141
analysis are given out in Table 1, which shows the new scheme has the best tradeoff between computational complexity and estimation performance. Moreover, the new scheme can work well for low SNR. It should be noted that the time complexity is obtained using our desktop computer.
5 Summary and Discussion In order to solve the problem of parameter estimation in chaotic noise for radar and communication system, we proposed a new scheme that combines the NN and PSD analysis together. Theory analysis and simulation results are given out to show the effectiveness of the proposed method. Moreover, specific technique such as the design, initialize and training of NN are also discussed in this paper. Finally, we compare the performances of several methods systematically in this paper. We end this paper in such conclude that the research of this study has potential application in various fields as in secret communication for narrow band interference rejection or attenuation and in radar signal processing for weak target detection and identification in sea clutter [7][8][9][10][11][12][13][14][15][16]. Further work may be focus on the study of the influence of SNR to the algorithm and apply our algorithm to some real systems.
Acknowledgement The authors gratefully acknowledge the support of “National Science Fund of Shaanxi Province” and “NPU Yingcai Training Task” funds. Hongmei Xie would also like to give her sincere thanks to Dr. Mingyu Lu who is now an assistant professor in Wave Scattering Research Center (WSRC) of Electrical Engineering department in the University of Texas at Arlington (UTA) because she really learned a lot and improved her work and English expression after discussing with him.
References [1] Li, X.: Detection of signals in chaos. Proceedings of the IEEE 83 (1) 95-122 [2] Fu, Y., Leung, H.: Narrow-Band Interference Cancellation in Spread Spectrum Communication Systems Using Chaos. IEEE Trans. Circuits and Systems-I: Fundamental theory and application 48 (7) (2001) 847-858 [3] Leung, H., Huang, X.: Parameter Estimation in Chaotic Noise. IEEE Trans. Signal Processing 41 (10) (1996) 2456-2463 [4] Xie, H., Zhao, J., Yu, B.:The Parameter Estimation in Chaotic Noise and its Performance Simulation. NanJing, P.R.China, Proceedings IEEE ICNNSP2003 12 (2003) 779-783 [5] Holger, K.: Nonlinear Time Series Analysis. London :Cambridge university Press (1997) [6] Djonin, D., et.al.: On the Application of Minimum Phase Space Volume Parameter Estimation. European Signal Processing Conference, (EUPSICO), Finlande (2000) [7] Li, Z., Dong, H., Quan, T.:Apply Genetic Algorithm to Parameter Estimation in Chaotic Noise. ICSP’02 Proceedings (2002) 1399-1402
142
H. Xie and X. Feng
[8] Lo, T., et.al.:Fractal Characterization of Sea-Scattered Signals and Detection of Sea-Surface Targets. IEE Proc F 140 (1990) 243-250 [9] Andreyev, Y., et.al.: Information Processing Using Dynamical Chaos: Neural Networks Implementation. IEEE Trans on Neural Network 7 (2) (1996) 290-298 [10] Wako, Y., Shin, I., Sato, M.: Reconstruction of Chaotic Dynamic Using a Noise-Robust Embedding Method. IEEE ICASS (2000) 181-184 [11] Leung, H., Neville, D.: Detection of Small Objects in Clutter Using a GA-RBF Neural Network. IEEE Trans. Aerospace Electronic Systems 38 (1) (2002) 98-116 [12] Stright, J., et.al.: An Application of Embedology to Spatio-Temporal Pattern Recognition. IEEE Trans on Aerospace Electron. Systems 32 (2) (1996) 768-773 [13] Leung, H.: Applying Chaos to Radar Detection in an Ocean Environment: An Experimental Study. IEEE Journal of Ocean Engineering 20 (1) (1995) 56-64 [14] Leung, H.: Chaotic Radar Signal Processing over the Sea. IEEE Journal of Ocean Engineering 18 (3) (1993) 287-295 [15] Haykin, S., Li, X.: Detection of Signals in Chaos. Proceedings of the IEEE 83 (1) (1995) 95-122
Global Exponential Synchronization of Chaotic Neural Networks with Time Delays Jigui Jian1 , Baoxian Wang1 , and Xiaoxin Liao2 1
2
Institute of Nonlinear Complex Systems, Three Gorges University, Yichang, Hubei, 443000, China
[email protected] Department of Control Science and Engineering, Huazhong University of Science and Technology, Wuhan, Hubei, 430074, China xiaoxin
[email protected]
Abstract. This Letter deals with the global exponential synchronization of a class of chaotic neural networks with time delays. Based on the the Halanay inequality technique and the Lyapunov stability theory, a delay-independent and decentralized control law is derived to ensure the exponential synchronization of the model and the simpler, less conservative and more efficient results are easy to be verified in engineering applications. Finally, an illustrative example is given to demonstrate the effectiveness of the presented synchronization scheme.
1
Introduction
It has been well known that a chaotic system is a nonlinear deterministic system with complex and unpredictable behavior. Since Pecora et al. [1,2] proposed drive-response concept for constructing synchronization of coupled chaotic systems, the control and synchronization problems of the chaotic systems such Lorenz system and Chen’s circuits have been proposed and thoroughly studied over the past two decades due to its potential applications in creating secure communication systems [3-9 and the references therein]. Meanwhile, both Hopfield neural networks and cellular neural networks have been applied to describe complex nonlinear dynamical systems, and have become a field of active research over the past two decades. Most of researches devoted to the stability analysis of this kind of delayed neural networks, and several stability criteria have been developed [10-12]. The chaotic phenomena in Hopfield neural networks and cellular neural networks with two or more neurons and different delays have also been found and investigated [3-7,9]. To the best of our knowledge, neural networks are all nonlinear and highdimensional systems consisting of many neurons. For such systems, centralized control method is hard to implement. This Letter considers a decentralized feedback control method for the synchronization problem of a class of chaotic systems such as Hopfield neural networks and cellular neural networks with time delays. A decentralized feedback control technique is adopted here. Based on the the Halanay inequality technique and the Lyapunov stability theory, a decentralized D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 143–150, 2007. c Springer-Verlag Berlin Heidelberg 2007
144
J. Jian, B. Wang, and X. Liao
feedback control input is derived to achieve synchronization of the drive-response chaotic neural networks with time delays. The rest of the Letter is organized as follows. Section 2 defines the exponential synchronization problem of the drive-response chaotic neural networks and gives some assumptions. In Section 3, a decentralized feedback control input is derived to achieve exponential synchronization. In Section 4, we show an illustrative example. Finally, conclusions are presented in Section 5.
2
Neural Network Model and Preliminaries
Consider the following chaotic neural network x˙ i (t) = −gi (xi (t))+
n
aij fj (xj (t))+
j=1
n
bij fj (xj (t−τ ))+Ii , i = 1, 2, . . . , n, (1)
j=1
for t ≥ 0, where n denotes the number of neurons in the network, xi (t)(i = 1, 2, . . . , n) denotes the state variable of the chaotic system. Functions gi (·) and fi (·) : R −→ R are continuous, and gi (0) = fi (0) = 0. A = (aij )n×n and B = (bij )n×n are real matrixes, which denote the strength of neuron interconnections, Ii is external inputs. The initial values with (1) are xi (t) = ϕi (t) ∈ C([−τ, 0], R) for i = 1, 2, . . . , n. Note x(t) = (x1 (t), x2 (t), . . . , xn (t))T . In order to observe the synchronization behavior of system (1), we introduce another chaotic network which is the response system of the drive system (1). The behavior of the response system depends on the behavior of the drive system, but the drive system is not influenced by the response system. Moreover, considering a chaotic system depends extremely on initial values, the initial condition of the response system is defined to be different from that of the drive system. Therefore, the response system of network (1) can be written as y˙ i (t) = −gi (yi (t)) +
n j=1
aij fj (yj (t)) +
n
bij fj (yj (t − τ ) + Ii + ui , i = 1, 2, . . . , n,
j=1
(2) where yi (t) (i = 1, 2, . . . , n) denotes the state variable of the response system, y(t) = (y1 (t), y2 (t), . . . , yn (t))T , ui = ui (t) is the appropriate control input that will be designed in sequel to obtain a certain control objective. The initial values with (2) are yi (t) = φi (t) ∈ C([−τ, 0], R) for i = 1, 2, . . . , n. Here the drive system with state variable xi (t) drives the response system having identical dynamical equations with state variable yi (t). Although the system’s parameters are same, the initial condition on the drive system is different from that of the response system. In fact, even infinitesimal difference in the initial condition in (1) and (2) will lead to different chaotic phenomenon in those systems. Let us define the synchronization error vector e(t) = (e1 (t), e2 (t), . . . , en (t))T with ei (t) = xi (t) − yi (t) for i = 1, 2, . . . , n. Moreover, e(t) −→ 0 as t −→ +∞ means the models given in (1) and (2) are asymptotically synchronized. Before proceeding, we make the following assumptions for the functions gi (·) and the activation functions fi (·):
Global Exponential Synchronization of Chaotic Neural Networks
145
Assumption H1. Functions gi (xi (t)) are globally Lipschitz continuous. Morei (x) over γi = inf { dgdx } > 0 for i = 1, · · · , n. x∈R
Assumption H2. Each activation functions fi (·) is bound, and satisfies the Lipschitz condition with a Lipschitz constant Li > 0, i.e., |fi (x) − fi (y)| Li |x − y| for all x, y ∈ R. The error dynamics between (1) and (2) can be expressed by e˙ i (t) = −Gi (ei (t)) +
n
aij Fj (ej (t)) +
j=1
n
bij Fj (ej (t − τ )) − ui , i = 1, 2, . . . , n,
j=1
(3) where Gi (ei (t)) = gi (xi (t)) − gi (yi (t)), Fi (ei (t)) = fi (xi (t)) − fi (yi (t)). Model (3) can be rewritten as the following matrix form e(t) ˙ = −G(e(t)) + AF (e(t)) + BF (e(t − τ )) − u(t),
(4)
where G(e(t)) = (G1 (e1 (t)), G2 (e2 (t)), . . . , Gn (en (t)))T , F (e(t)) = (F1 (e1 (t)), F2 (e2 (t)), . . . , Fn (en (t)))T , u(t) = (u1 (t), u2 (t), . . . , un (t))T . Definition 1. The systems (1) and the uncontrolled system (2) (i.e. u ≡ 0 in (3)) are said to be exponentially synchronized if there exist constants M 1 and λ > 0 such that x(t) − y(t) M
sup
t0 −τ st0
ϕ(s) − φ(s)e−λ(t−t0 ) , t t0 ,
where ϕ(s) = (ϕ1 (s), ϕ2 (s), . . . , ϕn (s))T , φ(s) = (φ1 (s), φ2 (s), . . . , φn (s))T . Moreover, the constant λ is defined as the exponential synchronization rate. Lemma 1. (Halanay inequality Lemma)[11] Let τ > 0, x(t) is nonnegative continuous scalar function defined for [t0 − τ, t0 ] which satisfies D+ x(t) −r1 x(t) + r2 x ˜(t) for t t0 , where x ˜(t) = sup {x(s)}, r1 and r2 are constants. If t−τ st
r1 > r2 > 0, then
x(t) x˜(t0 )e−λ(t−t0 ) , t t0 ,
where λ is a unique positive root of the equation λ = r1 − r2 eλτ . The Letter aims to determine the decentralized control input ui (t) associated with the state-feedback for the purpose of exponentially synchronizing the unidirectional coupled identical chaotic neural networks with the same system’s parameters but the differences in initial conditions.
3
Main Results
Theorem 1. For drive-response structure of chaotic neural networks (1) and (2) which satisfy assumptions (H1) and (H2). If the control input ui (t) in (3) is suitably designed as ui (t) = ηi ei (t), i = 1, 2, . . . , n,
(5)
146
J. Jian, B. Wang, and X. Liao
where ηi chosen are constants such that the matrix A˜ = (˜ aij )n×n is negative ˜ ˜ definite and dm λM (A) > dM λM (B) for a existed positive diagonal matrix D = diag(d1 , d2 , . . . , dn ) > 0, dm = min {di }, dM = max {di }, and 1in 1in −di (γi + ηi ) + di Li |aii |, i = j = 1, 2, . . . , n, 0 D|B|L ˜ a ˜ij = and B = di Lj |aij |+dj Li |aji | L|B|T D 0 , i = j; i, j = 1, 2, . . . , n. 2 ˜ is 2n × 2n matrix with |B| = (|bij |)n×n and L = diag(L1 , L2 , . . . , Ln ), −λM (A) ˜ are the maximum eigenvalues of A˜ and B, ˜ respectively. Then system and λM (B) (3) is globally exponentially stable, i.e., the global exponential synchronization of systems (1) and (2) is obtained with a synchronization rate 12 λ, where λ is the unique positive root of the equation λ=
˜ ˜ ˜ 2λM (A) λM (B) λM (B) − − eλτ . dM dm dm
Proof. To confirm that the origin of (3) or (4) is globally exponentially stable, consider the following continuous Lyapunov function V (t) defined as 1 2 di ei (t), 2 i=1 n
V (t) =
(6)
Then for ∀e ∈ Rn , the inequality holds 1 1 dm e(t)22 ≤ V (t) ≤ dM e(t)22 . 2 2
(7)
Subsequently, with condition (5) and assumptions (H1) and (H2), evaluating the time derivative of V (t) along the trajectory of (3) gives: n n n V˙ (t) = − i=1 di ei (t)Gi (ei (t)) + i=1 j=1 di aij ei (t)Fj (ej (t)) n n + i=1 j=1 di bij ei (t)Fj (ej (t − τ )) − ni=1 di ηi e2i (t) ≤ nj=1 [(−dj (γj + ηj ) + dj Lj |ajj |)e2j (t) + ni=1 di Lj |aij ||ei (t)ej (t)|] i =j n n + j=1 i=1 di Lj |bij ||ei (t)ej (t − τ )| T |e(t)| ˜ ˜ |e(t)| ≤ |e(t)|T A|e(t)| + 12 B |˜ e(t)| |˜ e(t)| ˜ + 1 λM (B))e ˜ T (t)e(t) + 1 λM (B))˜ ˜ eT (t)˜ ≤ (−λM (A) e(t) 2 2 ˜ ≤ −r1 V (t) + r2 V (t), where r1 =
V˜ (t) =
˜ ˜ 2λM (A) λM (B) − , dM dm sup
t−τ ≤s≤t
V (s) =
r2 =
˜ λM (B) , dm
n 1 sup di e2i (s). 2 t−τ ≤s≤t i=1
Global Exponential Synchronization of Chaotic Neural Networks
147
In terms of Lemma 1, we obtain V˙ (t) ≤ V˜ (t0 ) exp(−λ(t − t0 )), t ≥ t0 ,
(8)
combining (7) and (8), we have dM λ e(t) ≤ sup ϕ(s) − φ(s) exp(− (t − t0 )), t ≥ t0 , dm t0 −τ ≤s≤t0 2 Therefore, system (3) is globally exponentially stable, i.e., under the control input vector (5), every trajectory yi (t) of system (2) synchronize exponentially the corresponding variable xi (t) of neural network (1). The proof is completed. Corollary 1. If drive-response structure of chaotic neural networks (1) and (2) which satisfy assumptions (H1) and (H2), and the control input ui (t) in (3) is given by (5) such that r1 + r2 < 0, then the exponential synchronization of systems (1) and (2) is obtained with a synchronization rate 12 λ. Where λ is the unique positive root of the equation λ = r1 − r2 exp(λτ ) with n n r1 = max {−2(γj + ηj )+ [Lj |aij |+ Li (|aji |+ |bji |)]}, r2 = max {Lj |bij |}. 1≤j≤n
i=1
1≤j≤n
i=1
Corollary 2. If drive-response structure of chaotic neural networks (1) and (2) which satisfy assumptions (H1) and (H2), and the control input ui (t) in (3) is given by (5) such that r1 + r2 + r3 < 0, then the exponential synchronization of systems (1) and (2) is obtained with a synchronization rate 12 λ. Where λ is the unique positive root nof the equation λ = r1 + r2 − r3 exp(λτ )nwith r1 = max1≤j≤n {−2(γj +ηj )+ i=1 (Lj |aij |+Li (|aji |)}, r2 = max1≤j≤n { i=1 Li |bji |} and r3 = max1≤j≤n {Lj ni=1 |bij |}.
4
An Illustrative Example
It has been demonstrated that if the system’s matrices A and B, as well as the delay parameter τ are suitably specified, the system (1) may display a chaotic behavior [3,6,7]. Regarding the exponential synchronization condition of the system (1) with delays is demonstrated by the following example. Example. Consider a delayed Hopfield neural network (HNN) with two neurons as below [6]: x˙ 1 −x1 (t) 2 −0.1 f1 (x1 (t)) = + x˙ 2 −x −5 f2 (x2 (t)) 2 (t) 3 (9) −1.5 −0.1 f1 (x1 (t − τ )) + , −0.2 −2.5 f2 (x2 (t − τ ))
148
J. Jian, B. Wang, and X. Liao 5
4
3
2
x2
1
0
−1
−2
−3
−4
−5 −1
−0.8
−0.6
−0.4
−0.2
0 x1
0.2
0.4
0.6
0.8
1
Fig. 1. The chaotic behavior of system (9) with the initial condition x(s) = [0.4, 0.6]T , −1 ≤ s ≤ 0, in which x label denotes the state x1 (t) and y label denotes the state x2 (t) 1
0.8
0.6
0.4
0.2
0
−0.2
−0.4
−0.6
−0.8
−1
0
1
2
3
4
5 t
6
7
8
9
10
Fig. 2. The synchronization error e1 (t), e2 (t) with the initial condition e(s) = [−1, 1]T , −1 ≤ s ≤ 0 between system (9) and system (10), in which the dashed line depicts the trajectory of error state e1 (t) and the solid line depicts the trajectory of error state e2 (t)
where gi (xi ) = xi , τ = 1 and fi (xi ) = tanh(xi ) for i = 1, 2. The system satisfies assumptions (H1) and (H2) with L1 = L2 = 1 and γ1 = γ2 = 1. It should be noted that the system (9) is actually a chaotic delayed Hopfied neural networks
Global Exponential Synchronization of Chaotic Neural Networks
149
with the initial condition (x1 (s), x2 (s))T = (0.4, 0.6)T for −1 ≤ s ≤ 0 (See [3,6]). The response chaotic Hopfield neural network with delays is designed by y˙ 1 −y1 (t) 2 −0.1 f1 (y1 (t)) = + y˙ 2 −y −5 f2 (y 2 (t)) 2 (t) 3 (10) −1.5 −0.1 f1 (y1 (t − τ )) u1 (t) + + , −0.2 −2.5 f2 (y2 (t − τ )) u2 (t) If the control input vectors are designed as u1 (t) = η1 e1 (t), u2 (t) = η2 e2 (t). ⎞ 0 0 1.5 0.1 ⎜ ⎟ 1 − η1 2.55 ˜ = ⎜ 0 0 0.2 2.5 ⎟ Let d1 = d2 = 1, then A˜ = and B ⎝ 2.55 1 − η2 1.5 0.2 0 0 ⎠ 0.1 2.5 0 0 ˜ = 2.52, where η1 and η1 can be chosen to ensure that A˜ is negwith λM (B) ˜ > λM (B). ˜ If let η1 = η ≥ 7 and η2 = η + 1, ative definite and −λM (A) ˜ = η − 3.55 ≥ 3.45 > λM (B) ˜ = 2.52. From Theorem 1, the then λM (A) exponential synchronization of systems (9) and (10) can be obtained with a synchronization rate 12 λ, where λ is the unique positive root of the equation λ = 2(η − 3.55) − 2.55 − 2.55 exp(λτ ). For instance, for η = 7 and η = 10, the exponential synchronization rates of (9) and (10) are at least 12 λ = 0.225 and 1 2 λ = 0.64, respectively. Figure 1 depicts that the chaotic behavior of system (9) with the initial condition x(s) = [0.4, 0.6]T , −1 ≤ s ≤ 0. Figure 2 depicts the synchronization error e1 (t), e2 (t) between the derive system (9) and the response system (10).
5
⎛
Conclusion
This Letter has proposed a decentralized control scheme to guarantee the globally exponential synchronization for a class of neural networks including Hopfield neural networks and cellular neural networks with time delays. By constructing some suited controllers and using the Halanay inequality lemma, a delay-independent criteria have been derived to ensure the global exponential synchronization of delayed chaotic neural networks. Furthermore, the synchronization degree can be easily estimated. Finally, a numerical example has been given to verify the correctness of our results.
Acknowledgments This work was partially supported by National Natural Science Foundation of China (60474011, 60574025), and the Scientific Research Projects of Hubei Provincial Department of Education (D200613002) and the Doctoral PreResearch Foundation of China Three Gorges University.
150
J. Jian, B. Wang, and X. Liao
References 1. Pecora, L.M., Carroll, T.L.: Synchronization in Chaotic Systems. Phys Rev Lett 64 (8) (1990) 821-824 2. Carroll, T.L., Pecora, L.M.: Synchronization Chaotic Circuits. IEEE Trans Circ Syst 38 (4) (1991) 453-456 3. Cheng, C.J., Liao, T.L., Yan, J.J., Hwang, C.C.: Synchronization of Neural Networks by Decentralized Feedback Control. Physics Letters A 338 (2005) 28-35 4. Wang, Z.S., Zhang, H.G, Wang, Z.L.: Global Asymptotic Synchronization of a Class of Delayed Chaotic Neural Networks. Journal of Northeastern University (Natural Science) 27 (6) (2006) 598-601 5. Wang, Z.S., Zhang, H.G, Wang, Z.L.: Global Synchronization of a Class of Chaotic Neural Networks. Acta Physica Sinica 55 (6) (2006) 2687-2693 6. Cheng, C.J., Liao, T.L., Hwang, C.C.: Exponential Synchronization of a Class of Chaotic Neural Networks. Chaos, Solitons & Fractals 24 (2005) 197-206 7. Li, C., Chen, G.: Synchronization in General Complex Dynamical Networks with Coupling Delays. Physica A 343 (2004) 263-278 8. Liao, X.X., Chen, G.R., Wang, H.O.: On Global Synchronization of Chaotic Systems. Dynamics of Continuous, Discrete Impulsive Syst 10 (2003) 865-872 9. Cao, J., Li, P., Wang, W.: Global Synchronization in Arrays of Delayed Neural Networks with Constant and Delayed Coupling. Physics Letters A 353 (2006) 318-325 10. Jian, J.G., Kong, D.M., Luo, H.G., Liao, X.X.: Exponential Stability of Differential Systems with Separated Variables and Time Delays. J. Center South University (Science and Technology) 36 (2) (2005) 282-287 11. Liao, X.X., Xiao, D.M.: Globally Exponential Stability of Hopfield Neural Networks with Time-Varying Delays. Acta Electronica Sinica 28 (4) (2000) 87-90 12. Zhang, J.Y.: Globally Exponential Stability of Neural Networks with Variable Delays. IEEE Transactions on Circuits and Systems -I: Fundamental Theory and Applications 50 (2) (2003) 288-291
A Fuzzy Neural Network Based on Back-Propagation Huang Jin1,2, Gan Quan1, and Cai Linhui1 1
2
NanJing Artillery Academy, NanJing 211132 NanJing University of Science and Technology, NanJing 210094
[email protected]
Abstract. Some arguments on fuzzy neural network algorithm have been put forward, whose weights were considered as special fuzzy numbers. This paper proposes a conception of strong L-R type fuzzy number and derives a learning algorithm based on BP algorithm via level sets of strong L-R type fuzzy numbers. The special fuzzy number has been weakened to the common case. Then the range of application has been enlarged.
1 Introduction Some fuzzy neural networks models have been put forward in recent years [1], [2]. One approach for the direct fuzzification is transforming real inputs and real targets to fuzzy numbers. Ishibuchi et al, proposed a neural network for fuzzy input vectors. Connection weights of neural network were fuzzified. Hayashi also fuzzified the delta rule while Ishibuchi et al derived a crisp learning algorithm for triangular fuzzy weights. But all of arguments have the same fault which is the fuzzy weights should be the symmetrical triangular numbers. The application will be restricted within narrow limits. In this paper, firstly, we bring forward a fuzzy neural network, whose input-output relation of proposed fuzzy neural network is defined by the extension principle of Zadeh [3]. The input-output relation is numerically calculated by interval arithmetic via level sets (i.e., α -cuts) of fuzzy weights and fuzzy inputs. Next we define the strong L-R type fuzzy number, and show its good properties in interval arithmetic. While defining a cost function for level sets of fuzzy outputs and fuzzy targets, we propose a learning algorithm from the cost function for adjusting three parameters of each strong L-R type fuzzy weight. Lastly, we examine the ability of proposed fuzzy neural network implementing on fuzzy if-then rules.
2 Fuzzy Neural Network Algorithms In the type of fuzzy neural networks based on BP, neurons are organized into a number different of layers and signals flow in one direction. There are no interactions and feedback loops among the neurons in the same layer, Fig. 1 shows this model fuzzy neural network. According to the type of inputs and weights, we define three different kinds of fuzzy neural networks as follows: (I) crisp weight and fuzzy inputs; (II) fuzzy weight D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 151–159, 2007. © Springer-Verlag Berlin Heidelberg 2007
152
H. Jin, G. Quan, and C. Linhui
and crisp inputs; (III) fuzzy weight and fuzzy inputs. This paper will deal with the type (III) of fuzzy feed forward neural networks. In this model, the connections between the layers will be illustrated as a matrix of fuzzy weights w ji , which provides a fuzzy weight of a connection between ith neuron of the input layer, and jth neuron of the hidden layer. The total fuzzy input of jth neuron in the second layer is defined as:
Net pj =
Nj
∑W
ji
i =1
. O pj + Θ j ,
(1)
Net pj is the total fuzzy input of the jth neuron of hidden layer, OPJ = X PJ is the i ith fuzzy input of that neuron, and is fuzzy bias of the jth neuron. The fuzzy output of the jth neuron is defined with the transfer function Where,
f(Net)=1/{1+exp(-Net)}:
O pj = f ( Net pj ),
j=1,2,…,NH .
(2)
Furthermore, the fuzzy output of the kth neuron of output layer is defined as follows: NH
Net pj = ∑ Wkl ⋅ O pj + Θ k ,
(3)
j =1
O pk = f ( Net pk ).
(4)
The input-output relation in (1)-(4) can defined by the extension principle [3]. The fuzzy output is numerically calculated for level sets (i.e. a-cut) of fuzzy inputs, fuzzy weights and fuzzy biases. Next, we need to find out a type of fuzzy number to denote the fuzzy inputs, fuzzy weights and fuzzy biases; this type fuzzy number has good property so that it can be easily adapted to the interval arithmetic. This type fuzzy number has good property. Furthermore, let (Xp, Tp) is a fuzzy input-output pairs, and Tp =(Tp1 ,Tp2,…,Tpn) is a NO-dimensional fuzzy target vector corresponding to the fuzzy input vector Xp. The cost function for the input-output pair (Xp, Tp) is obtained as:
e p = ∑ e ph .
(5)
h
The cost function for the h-level sets of the fuzzy output vector Op and the fuzzy target vector are defined as: No
e ph = ∑ e pkh , k =1
(6)
A Fuzzy Neural Network Based on Back-Propagation
153
where
e pkh = e Lpkh + eUpkh , e
L pkh
U phk
e
= h⋅
= h⋅
([Tpk ]hL − [O pk ]hL )2 2 ([Tpk ]Uh − [O pk ]Uh ) 2 2
(7)
,
(8)
.
(9)
Next section we introduce the strong L-R type fuzzy number, and put forward a FNN algorithm based BP.
3 Strong L-R Representation of Fuzzy Numbers Definition 1. A function, usually denoted L or R, is a reference function of fuzzy numbers if 1. S (0)=1; 2. S (x)=S (-x); 3. S is no increasing on [0
,+ ∞ ]
Definition 2. A fuzzy number M is said to be an L-R type number if
⎧ ⎛β −x⎞ ⎪ L⎜ a ⎟, x ≤ β , a > 0 ⎪ ⎝ ⎠ μ M ( x) = ⎨ ⎪ R⎛⎜ x − β ⎞⎟, x ≥ β , b > 0 ⎪⎩ ⎝ b ⎠
(10)
L is for left and R for right reference. m is the mean value of M. called left and right spreads, Symbolically, we write M=(m
a and β are
αβ )L R *
Definition 3. A fuzzy number M is said to be a strong L-R type fuzzy number if
L(1) = R(1) = 0
This kind of fuzzy number has properties as follows: 1. The a -cuts of every fuzzy number are closed intervals of real numbers; 2. Fuzzy numbers are convex fuzzy sets;
β −x
= 1, x = β − a ≡ a , such that, L( β − a ) = L(a ) = 0 , same a as R ( β | +b) = R (ν ) = 0 , such that the support of every fuzzy number is the interval (a |,ν ) of real numbers. 3. Let
154
H. Jin, G. Quan, and C. Linhui
Those properties are essential for defining meaningful arithmetic operations on fuzzy numbers. Since each fuzzy set is uniquely represented by its -cut. These are closed intervals of real numbers, arithmetic operations on fuzzy numbers can be defined in terms of arithmetic operation on closed intervals of real numbers. These operations are the corners one of interval analysis, which is a well-established area of classical mathematics. We will apply them to next section to define arithmetic operations on fuzzy numbers. The strong L-R type is an important kind of fuzzy numbers. The triangular fuzzy number (T.F.N.) is a special class of the strong L-R type fuzzy number. We can write any strong L-R type fuzzy number symbolically as M
= ( α , β , γ ) LR
*
, in other words, the strong L-R type fuzzy number
can be uniquely represented by three parameters. Accordingly, we can adjust three parameters of each strong L-R type fuzzy weight and fuzzy biases.
Wkj = ( wkjα , wkjβ , wkjγ ) LR*
W ji = ( wαji , w βji , wγji ) LR*
Θ k = (θ kα , θ kβ , θ kγ ) LR*
Θ j = (θ αj , θ jβ , θ γj ) LR*
What’s more let
ckj =
wkjγ − wkjβ
c ji =
wkjβ − wkjα
wγji − w βji w βji − wαji
ck =
θ kγ − θ kβ θ kβ − θ kα
θ γj − θ jβ cj = β θ j − θ αj then β
wkj =
wkjγ + ckj ⋅ wkjα 1 + ckj
,
wijβ , θ kβ , θ jβ have some from as wijβ .
We discuss how to learn the strong L-R type fuzzy weight between the
Wkj = wkjα , wkjβ , wkjγ
jth hidden unit and the kth output unit. Similar to Rumelhart, we can
count the quantity of adjustment for each parameter by the cost function
ΔwkjL (t ) = −η ΔwkjU (t ) = −η
∂e ph ∂w
L kj
∂e ph ∂wUkj
+ ξ ⋅ ΔwkjL (t − 1),
(11)
+ ξ ⋅ ΔwUkj (t − 1).
(12)
A Fuzzy Neural Network Based on Back-Propagation
155
The derivatives above can be written as follows:
∂e ph ∂wkjα ∂e ph ∂wkjγ
= =
∂e ph ∂[ wkj ]αh ∂e ph ∂[ wkj ]αh
⋅ ⋅
∂[ wkj ]αh ∂wkjα ∂[ wkj ]αh ∂wkjγ
+ +
∂e ph ∂[ wkj ]γh ∂e ph ∂[ wkj ]γh
⋅ ⋅
∂[ wkj ]γh ∂wkjα ∂[ wkj ]γh ∂wkjγ
,
(13)
.
(14)
Since Wkj is a strong L-R type fuzzy number, it's h-level and 0-level have relations as follows:
[ wkj ]αh = γ
[ wkj ]h =
wkjγ + ckj ⋅ wkjα 1 + ckj
wkjγ + ckj ⋅ wkjα 1 + ckj
+
−
wkjγ − wkjα 1 + ckj
⋅ L−1 (h),
ckj ( wkjγ − wkjα ) 1 + ckj
(15)
⋅ R −1 (h).
(16)
Therefore,
∂e ph α
∂wkj
=
⎡ ckj ∂e ph L−1 (h) ⎤ ⋅ + ⎥+ α ⎢ ∂[ wkj ]h ⎢⎣1 + ckj 1 + ckj ⎥⎦ ∂[ wkj ]γh ∂e ph
⎡ ckj ⎤ ckj ⋅⎢ − R −1 ( h ) ⎥ , ⎢⎣1 + ckj 1 + ckj ⎥⎦ (17)
∂e ph γ
∂wkj
=
⎡ 1 ∂e ph L ( h) ⎤ ⋅⎢ − ⎥+ ∂[ wkj ]h ⎢⎣1 + ckj 1 + ckj ⎥⎦ ∂[ wkj ]γh ∂e ph
−1
α
⎡ 1 ⎤ ckj ⋅⎢ + R −1 (h) ⎥ . ⎢⎣1 + ckj 1 + ckj ⎥⎦ (18)
These relations explain how the error signals
∂e ph α
∂[ wkj ]h
and
∂e ph ∂[ wkj ]γh
for the
h-level set propagate to the 0-level of the strong L-R type fuzzy weight Wkj , and then, the fuzzy weight is updated by the following rules:
wkjα (t + 1) = wkjα (t ) + Δwkjα (t ),
(19)
wkjγ (t + 1) = wkjγ (t ) + Δwkjγ (t ).
(20)
We assume that n values of h (i.e., h1, h2,…, hn) are used for the learning of the fuzzy neural network. In this way, the learning algorithm of the fuzzy neural network can be defined as follows: 1: Initialize the fuzzy weights and the fuzzy biases.
156
H. Jin, G. Quan, and C. Linhui
2: Repeat 3 for h=h1, h2,…,hn. 3: Repeat the following procedures for p=1,2,…,m. (m input-output pairs (Xp,Tp)): Forward calculation: Calculate the h-level set of the fuzzy output vector Op corresponding to the fuzzy input vector Xp Back-propagation: Adjust the fuzzy weights and the fuzzy biases using the cost function eph4: if a pre-specified stopping condition (etc, the total number of iterations) is not satisfied, go to 2. Let (XP,TP) is the fuzzy input-output pairs, and Tp =(Tp1,Tp2,…,Tpn ) is NO-dimensional fuzzy target vector corresponding to the fuzzy input vector Xp.
4 Simulation We consider an n-dimension fuzzy classification problem. It can be described by If-Then rules as follows: If x p1 is Ap1 and,…, x pn is Apn ,Then
x p = ( x p1, Λ, x pn ) belong to G p where p=1,2,...,k, Api is linguistic term, for example: "large", "small" etc. For the convenience of computing, we assume that Api is a symmetrical strong L-R type fuzzy number, that is to say, L=R=max(0,1-|x|2) . We can solve the above problem b y using the fuzzy neural network we proposed. So we note the fuzzy input as Ap=(Ap1, Ap2,...,Apn), and the target output Tp can be defined as follows:
⎧⎪1, Ap ∈ Class1; Tp = ⎨ ⎪⎩0, Ap ∈ Class 2;
(21)
According to the target output T and the real output O, we define the error function: 2 ⎪⎧ (t p − o p ) ⎪⎫ e ph = max ⎨ o p ∈ [Yp ]h ⎬ . 2 ⎪⎩ ⎪⎭
(22)
We should train this network in order to make the eph be minimum. It is easy to know that the error function become the classical error function in BP algorithm when input vector
k
(t p − o p ) 2
p =1
2
e=∑
Ap and Yp are real number. We train the fuzzy
neural network with h-level sets (h=0.2, 0.4, 0.6, 0.8), the error function of the pair is: 2 ⎪⎧ (t p − o p ) ⎪⎫ e p = ∑ h ⋅ max ⎨ o p ∈ [Yp ]h ⎬. 2 h ⎪⎭ ⎩⎪
(23)
A Fuzzy Neural Network Based on Back-Propagation
157
In this way, we can deal with the fuzzy classification problem by using the model of section 3, where the input vector and the weight are symmetrical strong L-R type fuzzy number.
5 Example We set it as an example to measure the height of the city wall. We set a certain wall as an example. And there are forty-five feature points about the wall. Now twenty-three feature points are taken as studying samples, which are chosen regularly, other twenty-two points are taken as testing samples. The BP algorithm applies the classical error iterative method, Table. 1 shows the practical parameter. Table 1. The parameter for neural network system Sample
23
Work
45
nInput
3
nHidden
15
nOutput
1
Eita
1.2
Alfa
0.5
Error
0 .3
StepE
6
Trans Min-Max = 0.2 - 0.8
5.1 The Result of the BP Network Calculating It adopts six-grade iteration. As the error is smaller than the givenε=0.003, the circulation ends. Table 2 shows the known height y0, the imitated height y, and the differential between the two height of each observing point. In comparison the actual output with the measured value, we can find the maximum error is 0.99m, and the minimum error is 0.01m. The result of fitting is not well-pleasing, because: (1) the city wall feature points is linear distribution so this method is greatly limited to describe the space information. (2) The extent of the city wall chosen is broad, the height is constantly changing, and the changing rule is hard to describe.
158
H. Jin, G. Quan, and C. Linhui Table 2. The result of simulating the height of wall about each feature points
the simulated result of studying sample y0
y
the simulated result of testing sample y0
y
5.2 To Use BP Algorithm to Interpolate in Paragraphs In this example, we can decollate the broad extent and constant changing city wall to relative smaller segment, and then we can find the changing rule of each segment through the same neural network model .Now we set the NO.0~9 point as example, and set five points as studying samples and other five points as testing samples. Table 3 shows the result of simulation. According to the result above, we can find the maximum error of this simulate interpolation is 9cm and the minimum error is 0.8cm. So the measuring accuracy is satisfied with this project.
A Fuzzy Neural Network Based on Back-Propagation
159
Table 3. The result of simulation in paragraphs ----------------------------------------------------------------------point
y0
y
dy
----------------------------------------------------------------------0
15.32755
15.21728
.060271
2
13.85258
13.96085
-.088274
4
13.09941
13.06753
.031876
6
13.12414
13.02487
.089266
8
13.33703
13.24328
.093745
1
14.93377
14.91906
3
13.29222
13.34892
5
13.0759
12.94139
.094508
7
13.26844
13.19278
.075663
9
13.21131
13.09183
.08948
.00807 -.0567
-----------------------------------------------------------------------
6 Conclusion In this paper, we proposed the fuzzy neural network architecture with strong L-R type fuzzy numbers, and defined the corresponding learning algorithm. Since the strong L-R type fuzzy number is (the) more familiar than the triangular fuzzy number, the proposed fuzzy network can be considered as an extension of the former work.
References [1] L, M., Quan, T.F., Luan, S.H.: An Attribute Recognition System Based on Rough Set Theory-Fuzzy Neural Network and Fuzzy Expert System. Fifth World Congress on Intelligent Control and Automation (WCICA) (2004) 2355-2359 [2] W, S.Q., L, Z.H., X, Z.H., Zhang, Z.P.: Application of GA-FNN Hybrid Control System for Hydroelectric Generating Units. Proc. Int. Conf. Machine Learning and Cybernetics 2 (2005) 840-845 [3] Dubois, D., Prade, H.: Fuzzy Sets and Systems-Theory and Applications. New York: Academic Press (1982) [4] Feng, L., Liu, Z.Y.: Genetic Algorithms and Rough Fuzzy Neural Network-based Hybrid Approach for Short-term Load Forecasting. IEEE: Power Engineering Society General Meeting (2006) 1-6
State Space Partition for Reinforcement Learning Based on Fuzzy Min-Max Neural Network Yong Duan1, Baoxia Cui1, and Xinhe Xu2 1
School of Information Science & Engineering, Shenyang University of Technology, Shenyang, 110023, China 2 Institute of AI and Robotics, Northeastern University, Shenyang, 110004, China
[email protected]
Abstract. In this paper, a tabular reinforcement learning (RL) method is proposed based on improved fuzzy min-max (FMM) neural network. The method is named FMM-RL. The FMM neural network is used to segment the state space of the RL problem. The aim is to solve the “curse of dimensionality” problem of RL. Furthermore, the speed of convergence is improved evidently. Regions of state space serve as the hyperboxes of FMM. The minimal and maximal points of the hyperbox are used to define the state space partition boundaries. During the training of FMM neural network, the state space is partitioned via operations on hyperbox. Therefore, a favorable generalization performance of state space can be obtained. Finally, the method of this paper is applied to learn behaviors for the reactive robot. The experiment shows that the algorithm can effectively solve the problem of navigation in a complicated unknown environment.
1 Introduction Reinforcement learning (RL) requires the agent to obtain the mapping from state to action. The aim is to maximize the future accumulated reinforcement signals (rewards) received from the environment. In the case of application, the state and action space of RL are often large, which brings on the search space of training is overly large. Therefore, the agent is difficult to visit each state-action pair. To cope with the problem, some generalization approaches are used to approximate or quantize the state space, which aims to reduce the complexity of the search space. Recently, some methods of quantization are proposed by researchers, such as BOX [1, 2]. Its basic idea is to quantify the state space of the RL problem into nonoverlaping regions. Each of the regions is called a BOX. Moore [3] proposed the Parti-game algorithm using k-d tree which partitions the state space. Henceforth, some improved algorithms of Parti-game have been researched [4,5]. Murao and Kitamura put forward an approach noted QLASS [6], in which the state spaces space of an RL problem is constructed as Voronoi diagram. Furthermore, Ivan S.K. Lee and Henry Y.K.Lau also presented an on-line state space partition algorithm [7]. Therefore, how to make the agent adaptively partition the state space according to the characteristics of the environment and the learning tasks become the key of RL. In D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 160–169, 2007. © Springer-Verlag Berlin Heidelberg 2007
State Space Partition for RL Based on FMM Neural Network
161
order to effectively solve the problem of state space, the improved fuzzy min-max (FMM) neural network is applied to quantify the state space of RL in this paper. The hyperboxes of FMM serve as the partition regions of the state space. By tuning the min and max points and related parameters, the hyperboxes can reflect adaptively the distribution characteristics. So the quantization distortion can be decreased effectively. The hyperboxes of FMM can constitute the tabular RL, which is instrumental to implement the exploration scheme and increases the learning speed. The FMM [8-10] neural network can be viewed as the online classifier based on hyperbox fuzzy sets. Each hyperbox represents one cluster. The min-max points are utilized to define the boundaries of cluster. This clustering approach is based on soft verification, that is, the input training data does not definitely belong to some hyperbox (cluster). Instead, the fuzzy membership function is used to denote the degree of membership of being in the hyperbox. So the vector data set can be classified accurately. According to the above merits, FMM neural network is used to partition the state space of RL problems. The basic FMM algorithm is improved and integrated with Q(λ ) -learning, which is noted as FMM-RL. In the learning process, the state space is partitioned online through the operations of hyperbox expansion, contraction, merging and deletion. Synchronously, Q(λ ) -learning proceeds. Therefore, the method in this paper can construct the state space and solve the RL problem simultaneously. In a way, RL suits well to the robot control domain. RL has the merits of independent to environment models and self-adaptability, consequently, it pioneers the new research field of robotics. In the application of autonomous mobile robots, RL not only is able to implement the lower elementary control of robot behavior; but also can be used for learning the high-level behavior and complicated strategy of the robot [11]. Therefore, the above RL algorithm based on FMM is utilized to control the behaviors of a reactive robot. The robot is able to learn diversified behaviors and accomplish appointed tasks through interacting with the environment, under an unsupervising situation. In unknown and unstructured environments, the robot only can sense the environmental information by its own sensors. Thereby, the perceptible sensor data are viewed as the state vector of the RL problem. The state vector is quantized by a FMM neural network. The goal is to discretize the continuous state space and decrease the distortion of generalization.
2 Q(λ)-Learning In Markov decision process (MDP), the agent can perceive the state set S = {s i | s i ∈ S} from the environment. And the action set of the agent is A = {a i | a i ∈ A} . At the time step t , the agent senses the current state st and chooses the action a t . Through implementing this action, the agent can receive the reward rt from the environment and transform into the new state st +1 . The aim of the RL method is to achieve an optimum control scheme π : S → A
162
Y. Duan, B. Cui, and X. Xu
Q-learning [12] is an important algorithm of RL [1,11]. In Q -learning, the idea is to directly optimize Q -function. The function Q( s t , a t ) represents evaluation value of the state-action pair < s t , a t > . Q -learning is given by [12]:
Qˆ ( st , at ) = Q( st , at ) + η ⋅ [rt + γ ⋅ max Q( st +1 , at +1 ) − Q ( st , at )] , a
(1)
where the value of η is a learning rate. γ denotes the discount factor. The equation (1) updates the current Q -function based on evaluation value of the next state, which is called one-step Q -learning [11, 13]. When the convergence of Q -function is achieved, the optimal policy can be confirmed. The TD (λ ) method is introduced into Q -learning, which becomes incremental multi-step Q -learning. This method is noted Q(λ ) -learning [13]. Firstly, Q -function is updated according to the normal one-step Q -learning. Then, the temporal difference of greedy policy is used to update Q -function again. Therefore, Q(λ ) learning is an on-line algorithm. Q(λ ) -learning has faster convergence speed than one-step Q -learning. Moreover, it is more effective than one-step Q -learning [11,13]. Now let
ς t = rt + γ max Qt +1 − max Qt .
(2)
ζ t = rt + γ max Qt +1 − Qt .
(3)
and
Then, updating algorithm of Q value is calculated as follows: If s = st , a = at , then Qˆ ( s t , a t ) = Q ( s t , a t ) + η t ⋅ [ζ t + ς t et ( s t , a t )] ,
(4)
otherwise, Qˆ ( st , at ) = Q( st , at ) + η t ς t et ( st , at ) ,
(5)
where et ( s t , a t ) is the eligibility trace of the state-action pair < s t , a t > .
3 RL Based on FMM Neural Network 3.1 FMM Clustering Neural Network
FMM neural network is one of the online learning classifier. Each hyperbox with ndimension is composed of the n-dimensional min-max points. The hyperbox can be regard as the fuzzy set. The membership function of the hyperbox describes the degree of an input state vector S ∈ R n pertaining to the hyperbox. The hyperbox B j is defined as follows [8, 9]: B j = {Si ,V j ,W j , μ j ( Si ,V j ,W j )} . Where, Si denotes the ith
State Space Partition for RL Based on FMM Neural Network
163
input vector, Si = ( Si1, Si 2 ,", SiL ) n . V j = (v j1, v j1,", v jL ) and W j = ( w j1, w j1,", w jL ) , respectively, express the minimum and maximum points of the hyperbox B j . μ j ( Si ,V j ,W j ) is the membership function of B j , which can be calculated as follows:
μ j (S ,V j ,W j ) =
1 L ∑ [1 − f ( si − w ji , γ ) − f (v ji − si , γ )] , L i =1
(6)
where L is the state vector dimension. γ represents the sensitivity parameter that regulates the gradient of the membership function. f (⋅) denotes the two-parameter ramp threshold function, which is given by: ⎧ 1, ⎪ f ( x, γ ) = ⎨ xγ , ⎪ 0, ⎩
xγ > 1, 0 ≤ xγ ≤ 1, xγ < 0.
(7)
Fig. 1. FMM neural network element
As described in Fig. 1, FMM can be viewed as a two-layer neural network. The input vectors and the hyperboxes serve as the input and output nodes of the neural network. The connected weights of the nodes are the min and max points of the hyperbox. The weights connecting the ith input node and the jth output node are depicted as v ji and w ji . The corresponding transforming function is the membership function of the hyperbox. The output y j = μ j of jth node can be calculated according to the membership function. Each output node denotes the only cluster. By competing, the victorious hyperbox with the maximal degree of the membership is the output of the neural network. During the training process, the weights (min-max points) of the victorious neural node (hyperbox) are updated continuously. During the learning process, the original FMM neural network approximates gradually the state space of RL through appending the new hyperboxes. The hyperbox mergence and the hyperbox deletion are appended to the basic operations of the FMM neural network. The hyperbox mergence operation can merge the similar hyperboxes into an exclusive hyperbox. The condition of hyperboxe mergence is that the two hyperboxes are near enough. Moreover, the evaluation values Q are also similar enough. Whether deleting the hyperbox that is determined by the visited frequency
164
Y. Duan, B. Cui, and X. Xu
and the cumulated reward of the hyperbox. The above improved method can remove effectively the redundant and nonsignificant hyperboxes. Accordingly, the state space can be approximated availably by the finite hyperboxes. 3.2 Q(λ)-Learning Based on FMM
The FMM-RL system is composed of the FMM neural network and Q(λ ) -learning. Firstly, the state vectors of RL are viewed as the training data of the FMM neural network. The hyperboxes represent the segmentation regions of the state space of RL. The learning of the FMM neural network can be regarded as the process of the dynamic partition for the state space. In the learning process, the partition region boundaries are changed by tuning the min and max points of the hyperbox. The hyperboxes can express self-adaptively the distribution characteristics of the state space through operations on hyperbox expansion, contraction, append, mergence and deletion. Consequently, the tradeoff of coarse and fine partition of the state space can be solved effectively. Then, the hyperboxes of FMM serves as the discrete state vectors of the tabular RL. After training with RL, the action with the maximum Qvalue of each state vector (hyperbox) is selected as the optimal policy. Consequently, the state vector and its corresponding optimum action constitute the Look-up table. Synthesizing improved the FMM neural network and Q(λ ) -learning, the FMM-RL algorithm can be described as follows: (1) Parameter Initialization. Define the maximum hyperbox size ϑ and the sensitivity gain γ . Initialize the visited frequency threshold κ and the hyperbox comparability threshold δ . Initialize the mean square difference threshold ε of the Q-values and the accumulated reward threshold χ .Initialize the initial hyperbox B0 , which min point V0 = 1 and max point W0 = 0 . Define the initial evaluation Q ( B0 , ak ) = 0 , k = 1,", N , where, N denotes the number of the selected actions of RL. Initialize the accumulated reward AR j = 0 and the visited frequency HF j = 0 of the hyperbox B j .Initialize the eligibility trace e( B0 , ak ) = 0 . Furthermore, archive the current state S0 . (2) The action at is selected. By executing the action at , the agent obtains the immediate reward rt and the next state St +1 . (3) Find the most adjacent hyperbox with the current state. The degree of membership μ j , j = 1,", M t that the current state belongs to each hyperbox is calculated by equation 6. The hyperbox Bt +1 = B j that has the highest degree of ∗
membership is selected as the victorious hyperbox. The visitied frequency HF j of hyperbox B j is updated. (4) Verdict the condition of hyperbox expansion. If the min and max points of hyperbox B j meet the equation 8, goto (5). Otherwise, the expansion condition of the hyperbox except for B j is judged until all the hyperboxes are exhausted. If all the hyperboxes do not satisfy the condition, the new hyperbox is appended, goto (6).
State Space Partition for RL Based on FMM Neural Network
L
∑ [max( w ji , si ) − min(v ji , si )] ≤ L ⋅ ϑ .
i =1
(5)
165
(8)
Hyperbox expansion. The min and max points of the hyperbox are adjusted based on equation 9 and 10. old v new ji = min(v ji , si ) , ∀i = 1, 2," , L , old wnew ji = max( w ji , si ) , ∀i = 1, 2," , L .
(9) (10)
(6) Hyperbox append. The min and max points of appended hyperbox is the current state vector, that is, Vnew = Wnew = St +1 . Furthermore, the corresponding evaluation value Q ( Bnew , ak ) and the eligibility trace e( Bnew , ak ) are appended and initialized. (7) Q(λ ) -learning. The Q -values, eligibility trace e( B j , ak ) and the accumulated reward AR j are updated according to the Q(λ ) -learning algorithm. (8) Hyperbox overlapping test. If the expanded or appended hyperbox is satisfied for each of the following case, goto step(9); otherwise, goto (10). Case 1: v pi < v ji < w pi < w ji ; Case 2: v ji < v pi < w ji < w pi ; Case 3: v pi < v ji ≤ w ji < w pi ; Case 4: v ji < v pi ≤ w pi < w ji . (9) Hyperbox Contraction. If the hyperbox overlapping conditions are met, the overlap is eliminated. According to the four cases previously demonstrated, the overlapping hyperboxes are contracted as follows: Case 1: v pi < v ji < w pi < w ji , new old old v new ji = w pi = (v ji + w pi ) / 2 .
Case 2: v ji < v pi < w ji < w pi , new old old v new pi = w ji = (v pi + w ji ) / 2 .
Case 3: v pi < v ji ≤ w ji < w pi , If w ji − v pi < w pi − v ji , old v new pi = w ji . old Otherwise, wnew pi = v ji .
Case 4: v ji < v pi ≤ w pi < wki , The same assignments under the same conditions in case 3. (10) Hyperbox mergence. The conditions of the hyperbox mergence are two L
hyperboxes are sufficient to approach, that is, ∑ (v pi − vki ) 2 + ( w pi − wki ) 2 ≤ δ i =1
166
Y. Duan, B. Cui, and X. Xu
and Qth ≤ ε . Where, Qth =
N
∑ [Q ( B p , a k ) − Q ( B j , a k )] 2 . The merged
k =1
hyperbox can be denoted to: v new = (v p + v j ) / 2 , w new = ( w p + w j ) / 2 . The corresponding Q -value is Q( B new , a k ) = [Q ( B p , a k ) + Q( B j , a k )] / 2 . (11) Hyperbox deletion. If the visited frequency of the hyperbox is less than the threshold κ ( HF j < κ ) and the accumulated reward of the hyperbox is less than the threshold χ ( AR j < χ ), then the hyperbox is deleted. (12) State transformation. S t ← S t +1 , return step (2). (13) Iterate from step (2) to (12), until the min and max points of hyperbox don’t change.
4 The Robot Navigation Based on FMM-RL The robot adopts a two wheel differential drive at its geometric center (see Figure 2).The drive motors of both wheels are independent. Where, vl and v r are the velocities of the left and right wheel. The sensors of the robot are divided into three groups according to their overlay areas. Respectively, the distances of the obstacles to the right, at the front, and to the left of the robot are sensed. In each group, the distance between the robot and obstacles is the minimum vale of the sensed data, i.e., D min = min( d i ) . θ is the angle between the moving direction of the robot and the line connecting the robot center with the target .
Fig. 2. The perceptive model of robot
Firstly, the robot senses the state information of the environment. According to the above section, the operations on hyperbox expansion, append, contraction, mergence and deletion are implemented. Thereby, the robot can online partition the state space and implement Q(λ ) -learning during the learning process. Then, the action corresponding with the above state vector searched from Look-up table is regarded as the control variables of the robot. The robot control variables are the left/right wheel velocities vl and v r , which are represented respectively by five discrete values. They constitute 25 different
State Space Partition for RL Based on FMM Neural Network
167
combinations, which are used as the action variables of RL. The corresponding Q-value of each action is updated by RL. After training, the action with the maximal Q-value is selected as the optimal action of the hyperbox.
5 Experimental Results In order to demonstrate the effectiveness of the proposed FMM-RL, the experiments are performed with the simulation and the real mobile robot Pioneer II. According to the previous section, the ultrasonic sensors of Pioneer II are divided into three groups. Each group sensors can measure the distance between the robot and the obstacles in the different directions. In order to increase the learning speed of the FMM-RL method and decrease the exhaustion of the real robot, we apply the proposed method to the robot that tries to learn the behaviors in the simulation environment. Then the learned results are tested with the real robot Pioneer II. In this section, we study the learned wandering behaviors. The wandering behavior of the robot is that the robot can explore stochastically the unknown and changed environment without collision, which aim is to obtain the environment information or search the targets. For wandering behavior, the sensor measure values Dl , Dc and Dr of the triple orientations (left, front and right) are viewed as the state variables of RL. By implementing the FMM-RL method, the state vectors are partition online. To avoidance obstacles, it is natural to wish the robot is far away from obstacles. If the robot is close to obstacle, it will receive the punishment (negative reinforcement signal); On the contrary, the robot receives the bonus (positive reinforcement signal).Thereby, the reinforcement signal function is defined as follows:
−1, dt < DS , ⎧ ⎪ rt = ⎨−τ ( DA − dt ), DS < dt ≤ DA , ⎪ 0, otherwise. ⎩
(11)
Where rt is the immediate reinforcement signal at the time step t . d t denotes the minimum obstacle distance of triple directions around the robot, i.e., dt = min{Dl , Dc , Dr } . The parameter τ is proportional gain. DS represents the threshold of the safe distance. If the distance between the robot and the obstacles is less than D S , the robot is considered have been collided. D A is the distance threshold of avoiding obstacles. Within the range from D S to D A , the robot is able to avoid obstacles effectively. The simulated robot is located in the complicated unknown environment to train the obstacle-avoidance behavior. According to the equation 11, the robot receives the reinforcement signal. If the robot collides with obstacles, reaches the target or completes the trials, it will return the start state and perform a new learning stage. Figure 3 denotes the wandering trajectories of the robot in the unknown simulation
168
Y. Duan, B. Cui, and X. Xu
Fig. 3. Wandering trajectories in simulation
Fig. 4. Pioneer II robot wandering behavior
environment. Figure 4 shows that the robot Pioneer II performs the wandering behavior in the real environment. The effectiveness of the proposed method is demonstrated through simulator and the real robot experiments. The robot with controller designed by FMM-RL can explore the environment without collision.
6 Conclusions In this paper, the improved FMM neural network and RL are integrated, which constitute the FMM-RL algorithm. Firstly, the FMM neural network is used to quantify the continuous state space of RL. So the continuous state space can be approximated by the finite hyperboxes of FMM. The proposed algorithm not only partitions self-adaptively the state space, but also can effectively delete and merge the insignificance state partition regions. Consequently, the tabular RL method can be implemented. We study the behavior learning of the mobile robot based on the FMMRL method. The experimental results indicate the FMM-RL method with the reasonable reinforcement signal function can complete effectively the learned tasks.
References 1. Sutton, R. S., Barto, A. G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge, Massachusetts (1998) 2. Michie, D., Chambers, R. A.: Box: An experiment in adaptive control. Machine Intelligent 2 (1968) 137-152 3. Moore, A.W., Atkeson, C.G.: The Parti-game Algorithm for Variable Resolution Reinforcement Learning in Multidimensional State-spaces. Machine Learning 21 (1995) 199-233 4. Munos, R., Moore, A. W.: Variable Resolution Discretization for High-accuracy Solutions of Optimal Control Problems. Proc. 16th International Joint Conf. on Artificial Intelligence (1999) 1348--1355 5. Reynolds, S. I.: Adaptive Resolution Model-free Reinforcement Learning: Decision Boundary Partitioning. Proc. 17th International Conf. on Maching Learning (2000) 783-790
State Space Partition for RL Based on FMM Neural Network
169
6. Murao, H., Kitamura, S.: Q-Learning with Adaptive State Segmentation (QLASS). Proc. IEEE International Symposium on Computational Intelligence in Robotics and Automation (1997) 179-184 7. Ivan, S.K. Lee, Henry, Y.K.Lau.: Adaptive State Space Partitioning for Reinforcement Learning. Engineering Applications of Artificial Intelligence 17 (2004) 577-588 8. Simpson, P. Fuzzy Min-max Neural Network-Part I: Classification. IEEE Trans. on Neural Networks 3 (5) (1992) 776-786 9. Simpson, P. K.: Fuzzy Min-max Neural Network-Part II: Clustering. IEEE Trans. on Fuzzy Systems 1 (1) (1993) 32-45 10. Gabrys, B., Bargiela.: General Fuzzy Min-max Neural Network for Clustering and Classification. IEEE Trans. on Neural Networks 11 (3) (1999) 769-783 11. Zhang, R.B.: Reinforcement Learning Theory and Applications. Harbin Engineering University Press (2000) 12. Watkins, C. J., Dayan P.: Q-learning. Machine Learning 8 (3) (1992) 279-292 13. Peng, J., Williams, R.J.: Incremental Multi-step Q-learning. Machine Learning: Proceedings of the Eleventh International Conference(ML94), Morgan Kaufmann, New Brunswick, NJ, USA (1994) 226-232
Realization of an Improved Adaptive Neuro-Fuzzy Inference System in DSP Xingxing Wu, Xilin Zhu, Xiaomei Li, and Haocheng Yu College of Mechanical science and Engineering, Jilin University, Changchun 130025, China
[email protected]
Abstract. Scaled conjugate gradient (SCG) algorithm was used to improve adaptive neuro-fuzzy inference system (ANFIS). It’s proved by applications in chaotic time-series prediction that the improved ANFIS converges with less time and fewer iterations than standard ANFIS or ANFIS improved with the Fletcher-Reeves update method. The way in which ANFIS could be improved on the basis of standard algorithm using fuzzy logic toolbox of MATLAB is dwelled on. A convenient method to realize ANFIS in TI ’s digital signal processor (DSP) TMS320C5509 is presented. Results of experiments indicate that output of ANFIS realized in DSP coincides with that in MATLAB and validate this method.
1
Introduction
Artificial neural network and fuzzy inference system have been applied in more and more engineering fields for their abilities to simulate human’s learning and inference ability. Adaptive neuro-fuzzy inference system (ANFIS) utilizes learning principle and adaptive ability of neural network to model fuzzy inference system. In this way membership function parameters and fuzzy inference rules can be obtained by learning from quantities of input and output data. So ANFIS has special superiority for complex system in which qualitative knowledge and experiences are deficient or hard to obtain.With the self-learning ability ANFIS can expand its fuzzy inference library according to change of application circumstances. Therefore the system’s flexibility and adaptive ability have been improved.At present ANFIS has been sucessfully applied in many fields such as modeling and forecast of nonlinear system,fingerprint matching,etc [1,2,3]. Programmable digital signal processor has been developing at a high speed in recent twenty years. With increasingly high cost performance digital signal processor(DSP) has been the core of many electronic devices and widely used in communication, automatic control, spaceflight and other fields [4]. Stardard ANFIS algorithm is rather slow for complex computation of large amounts of data should be carried out in trainning. In this paper Scaled conjugate gradient(SCG)algorithm [7] was used to improve ANFIS algorithm to accelerate the training process and reduce training times. In addition how to realize ANFIS in DSP conveniently on the basis of MATLAB’s fuzzy toolbox was presented. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 170–178, 2007. c Springer-Verlag Berlin Heidelberg 2007
Realization of an Improved ANFIS in DSP
2
171
Improvement of ANFIS
Fuzzy inference systems can be classfied into Mamdani-type, Sugeno-type ,pure fuzzy inference system,etc. Mamdani-type fuzzy inference system can express knowledge conveniently because the form of fuzzy inference rules coincides with human thought and language expression customs. But its computation is rather complicated and difficult for analysis in math. Sugeno-type fuzzy inference system is simple in computation and easy to be combined with optimizing and self-adapting methods [9]. ANFIS based on Sugeno-type fuzzy inference system was put forward by Jang [5]. In ANFIS parameters which determine member function shapes of each input are called premise parameters. Output of each rule is linear combination of inputs and constant. The linear combination coefficients and constant are called consequent parameters. All these parameters are adjusted by back propagation(BP) algorithm or a combination of least squares estimation and back propagation algorithm in the way similar to neural network. In the pure BP method both premise and consequent parameters are adjusted by BP algorithm. In the hybrid method premise parameters are adjusted by BP algorithm while consequent parameters are adjusted by least squares estimation. But standard back propagation algorithm is often too slow for application and may get stuck in a shallow local minimum. So many faster algorithms have been presented such as variable learning rate BP,resilient BP,conjugate gradient,Levenberg-Marquardt(LM),and so on. As far as conjugate gradient algorithm is concerned, it also can be divided into several kinds such as scaled conjugate gradient(SCG) , Fletcher-Reeves Update(FRU), Powell-Beale Restarts, etc. Following conclusion was drawn according to experiments on different algorithms described above with different structures and precision in solving six different kinds of practical problems [6]. Generally the LM algorithm will have the fastest convergence for networks that contain up to a few hundred weights on function approximation problems. But its performance is relatively poor on pattern recognition problems. Resilient BP algorithm is the fastest algorithm on pattern recognition problems while it does not perform well on function approximation problems. The conjugate gradient algorithms, in particular scaled conjugate gradient (SCG) algorithm, seem to perform well over a wide variety of problems, particularly for networks with a large number of weights. The SCG algorithm is almost as fast as the LM algorithm on function approximation problems (faster for large networks) and is almost as fast as resilient back propagation algorithm on pattern recognition problems. So in this study ANFIS was improved with the SCG algorithm to quicken its training speed. Standard back propagation algorithm adjusts the weights in the steepest descent direction.As formula (1) and (2) show. Δf (Wn ) = −αn ∇f (Wn ),
(1)
Wn+1 = Wn + Δf (Wn ),
(2)
where Wn is weight vector at iteration n, αn is current step size and ∇f (Wn ) is current gradient vector. It turns out that it doesn’t necessarily produce the
172
X. Wu et al.
fastest convergence along the negative of the gradient. In conjugate gradient algorithms the search direction is conjugate to previous search direction except that the first search direction is along the negative of the gradient. Generally conjugate gradient algorithms converge faster than standard BP algorithm. Line searches are performed to determine the optimal distance to move along search directions in conjugate algorithms such as FRU, Powell-Beale Restarts, etc. SCG algorithm put forward by Moller combines the model-trust region approach and the conjugate gradient approach to avoid the time-consuming line search and improve the convergence speed [7,8]. It’s shown as below. (1) At n=0,choose an initial weight vector W0 , and scalars0 < σ < 10−4 ,0 < ρ0 < 10−6 ,ρ0 = 0,set the Boolean success=true. Set the initial direction vector D0 = G0 = −∇f (W0 ).
(3)
(2)If success=true then calculate second order information: σ , |Dn |
(4)
∇f (Wn + σn Dn ) − ∇f (Wn ) , σn
(5)
σn = Sn =
θn = DnT Sn ,
(6)
(3)scale θn : 2
θn = θn + (ρn − ρn ) |Dn | , (4)if θn ≤ 0 then make the Hessian positive definite: θn ρn = 2 ρ n − , 2 |Dn | 2
(7)
(8)
θn = −θn + ρn |Dn | ,
(9)
ρn = ρn ,
(10)
ξn = DnT Gn ,
(11)
(5)calculate the step size: ξn αn = , θn (6)calculate the comparison parameter Cn : Cn = 2θn
f (Wn ) − f (Wn + αn Dn ) . ξn2
(12)
(13)
(7)Weight and direction update: If Cn > 0 then a successful update can be made Wn+1 = Wn + αn Dn ,
(14)
Gn+1 = −∇(Wn+1 ),
(15)
Realization of an Improved ANFIS in DSP
173
ρn = 0,Success=true. If n mod N=0 then restart the algorithm with Dn+1 = Gn+1 , else 2 βn = (|Gn+1 | − GT n+1 Gn )/ξn ,
(16)
Dn+1 = Gn+1 + βn Dn .
(17)
ρn = ρn /4,
(18)
If Cn ≥ 0.75 then else ρn = ρn ,success=false. (8)If Cn < 0.25 then 2
ρn = ρn + θn (1 − Cn )/ |Dn | .
(19)
(9)If the steepest descent direction:Gn = 0,set n=n+1 and go back to (2) else terminate and return Wn+1 as the desired minimum. In MATLAB standard ANFIS funciton anfis() is provided,with which pure BP method or hybrid method can be chosen to train the system. It was found out from analysis of MALTAB language source file anfis.m that in anfis() anfismex.dll was called to realize the kernel training algorithm. C language source codes of anfismex.dll can be found in the directory toolbox\fuzzy\fuzzy\src of MATLAB. Through analyzing these source codes it can be concluded that variable learning rate BP algorithm has been used to improve the convergence speed. The basic idea of variable learning rate BP to increase or decrease the learning rate(or called step size) by judging if current training error is smaller than last training error. If the training errors decrease in succession for several times the learning rate will be increased. If the training errors vibrate the learning rate will be decreased and otherwise keep constant. As above algorithm flow shows,in SCG algorithm the learning rate is computed from second order information of performance function. Compared to variable learning rate BP algorithm,it can avoid vibrations may caused by inappropriate initial learning rate or increasing/decreasing rate. To construct a new ANFIS function anfisscg() based on SCG algorithm, Firstly a new kernel training algorithm library anfisscgmex.dll should be made. It was realized by modifying the source codes of anfismex.dll according to SCG algorithm described above. Main modification was made to the function anfislearning() in learning.c as most learning procedures were completed in it. In order to complete computation of formula (5),(13) and (16) conveniently,new members were added to the construct type FIS and NODE in anfis.h.Assigning and freeing memory as well as initiating codes for new members were added in datstruc.c. After modification the file that contained function mexfunction() was renamed to anfisscgmex.c. Then anfisscgmex.dll was generated by command ”mex anfisscgmex.c -output anfisscgmex.dll” in MATLAB. At last the improved ANFIS function anfisscg() was got by substituting function anfismex with function anfisscgmex in anfis.m and renaming the file anfisscg.m.
174
3
X. Wu et al.
Test of the Improved ANFIS Algorithm
In order to test the improved ANFIS algorithm,the standard ANFIS ,ANFIS improved with the Fletcher-Reeves update method and ANFIS improved with SCG algorithm were respectively applied to forecasting chaotic time series. A chaotic time series is generated by following Mackey-Glass (MG) time-delay differential equation: •
x(t) =
0.2x(t − τ ) − 0.1x(t). 1 + x10 (t − τ )
(20)
This time series is chaotic, and so there is no clearly defined period. The series will not converge or diverge, and the trajectory is highly sensitive to initial conditions. This is a benchmark problem in the neural network and fuzzy modeling research communities [9]. In MATLAB file mgdata.dat has been provided in the directory toolbox\fuzzy\fuzdemos which contains time series data calculated by the fourth-order Runge-Kutta method.Half of the data were used to train while the other half were used for checking to assure the modeling was successful. In all algorithms system was initialized by grid partition method and Gauss type of membership function was selected as membership function of inputs. Other parameters used the default values. Training results of different algorithms are as Table 1 shows. Table 1. Training results of different algorithms Algorithm
Target Error Et Iteration N Time T /s
Standard ANFIS BP Method Improved by FRU BP Method Improved by SCG BP Method Standard ANFIS Hybrid Method Improved by FRU Hybrid Method Improved by SCG Hybrid Method
0.02 0.02 0.02 0.0017 0.0017 0.0017
5919 220 40 155 72 63
338.3970 28.4810 6.8000 51.8140 35.8820 30.3430
Training error curves of Standard ANFIS, ANFIS improved by FRU and ANFIS improved by SCG using BP method are as Fig. 1 shows. In ANFIS improved by FRU the parameter βn which determined how much last search direction influence current search direction was computed according to formula (21): 2
2
βn = |Gn+1 | / |Gn | .
(21)
Step size αn was adjusted in the same way as variable learning rate BP algorithm. In ANFIS improved by SCG using Hybrid Method consequent parameters were adjusted least squares estimation and premise parameters were adjusted in a way similar to FRU except that βn was computed according to formula (16) instead of formula (21).
Realization of an Improved ANFIS in DSP
175
1 Standard ANFIS ANFIS improved with FRU ANFIS improved with SCG
0.9 0.8
Training Error E
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0
50
100 Training Iterations N/times
150
200
Fig. 1. Comparison of training error curves
It can be concluded from Table 1. and Fig. 1 that ANFIS improved by SCG converges much faster than standard ANFIS or ANFIS improved by FRU. It takes less time and fewer iterations for ANFIS improved by SCG to reach the same target error.
4
Realization of ANFIS in DSP
High speed Real-time signal process can be achieved in DSP for advanced technologies used in it such as Harvard architecture,super scale pipeline, special MAC units and instruction,etc. The improved ANFIS algorithm realized in MATLAB running on PC is fitful for analysis and simulation. But it can’t satisfy the field signal process demands such as low power,real-time,small size,etc. It will greatly promote ANFIS’s applications in more fields if it can be realized in DSP conveniently. DSP used in this study is TMS320VC5509, which is based on the latest TMS320C55x DSP processor core.The C55x DSP architecture achieves high performance and low power through increased parallelism and total focus on reduction in power dissipation [10]. The devices used include emulator,code composer studio (CCS)5000 and a target board which has been extended with 1M*16 bit SDRAM and 512K*16bit flash. CCS 5000 supports development and debugging of TMS320C55x c or assembly language program. According to this study there are two methods to realize ANFIS in DSP. In the first method c language source codes of anfisscgmex.dll was modified according to TMS320C55x c language and hardware attributes of TMS320C5509 to realize whole training and inference processes in DSP. In the second method training process was completed in MATLAB. After training, the system was saved to .fis format file with the use of function writefis(). Structure and parameters of the inference system can be extracted from the .fis format file saved for system initialization. The stand-alone c code fuzzy inference engine contained in fis.c in
176
X. Wu et al.
the directory toolbox\fuzzy\fuzzy of MATLAB was modified in CCS to complete the inference process in DSP. As there is no file system in DSP, extraction of parameters was completed by calling the function returnFismatrix() in fis.c. In practical applications it often takes long time to train the system while the inference speed should be as quick as possible. So off-line training and online inference is a wise choice. As a result the second method is better as it also demands smaller memory in DSP. For example, The improved ANFIS to forecast chaotic time series was realized in DSP using the second method. Forecast results of time series number 124 to 223 in DSP is as Fig. 2 shows.
Fig. 2. Output results in DSP
Here dual time time/frequency graph was used to show computation results in DSP. The upper curve represents forecast values obtained from fuzzy inference computation. The lower curve represents errors between forecast values and real values in mgdata.dat. Start addresses of the upper curve and lower curve were respectively set to be names of arrays in which system outputs and errors were saved. Both display buffer size and display data size were set to be 100. 32-bit IEEE float point was used as DSP data type. Output of the system was shown in the stdout window. In MATLAB outputs of the system were computed by calling function readfis ( ) and evalfis( ) [9]. As Fig. 3 shows,the upper graph is the comparison of real values (represented by circles) to outputs of the system (represented by line) in MATLAB. In the graph trend of the line coincides with that of circles which indicates the forecast is successful. The lower graph shows the forecast error curve. Parts of forecast results in DSP and MATLAB are as Table 2 shows. It can be seen from Fig. 2,Fig. 3 and Table 2 that chaotic time series forecast has been successfully achieved both in MATLAB and in DSP. System output
system output and real value
Realization of an Improved ANFIS in DSP
177
1.5
1 real value system output
0.5
0 120
140
160
180 200 Time series number
220
240
140
160
180 200 Time series number
220
240
−3
Forecast error
4
x 10
2 0 −2 −4 120
Fig. 3. Output results in MATLAB Table 2. Forecast results in MATLAB and DSP Time series Forecast value Forecast error Real value number MATLAB DSP MATLAB DSP 123 1.0510 1.0516 1.051554 0.0006 0.000554 125 0.9564 0.9530 0.952994 -0.0034 -0.003406 136 0.6526 0.6541 0.654167 0.0014 0.001567 145 0.8663 0.8659 0.865982 -0.0004 -0.000318 159 1.2022 1.2021 1.202076 -0.0000 -0.000124 167 1.1540 1.1541 1.154046 0.0001 0.000046 186 0.5053 0.5040 0.504101 -0.0013 -0.001199 222 1.0022 1.0026 1.002582 0.0004 0.000382
in DSP coincides with system output in MATLAB,which verify the method to realize ANFIS in DSP.
5
Conclusions
This paper presents an improved ANFIS algorithm and a convenient method to realize it in DSP. The ANFIS improved by SCG algorithm converges faster than standard ANFIS or ANFIS improved by FRU conjugate gradient algorithm. In this way training iterations and time of ANFIS can be reduced. ANFIS can be conveniently realized in DSP by the way of off-line training and online inference. Tests on chaotic time series forecast verify the improved ANFIS algorithm and the method to realize ANFIS in DSP. With faster training speed and a convenient method to be realized in DSP ANFIS will be applied in more and more practical fields.
178
X. Wu et al.
References 1. Lee, K.C., Gardner, P.: Adaptive Neuro-Fuzzy Inference System (ANFIS) Digital Predistorter for RF Power Amplifier Linearization. IEEE Transactions on Vehicular Technology 55 (1) (2006) 43-51 2. Hui, H., Song, F.J., Widjaja, J.: ANFIS-Based Fingerprint-Matching Algorithm. Optical Engineering 43 (3) (2004) 415-438 3. Jwo, D.J., Chen, Z.M.: ANFIS Based Dynamic Model Compensator for Tracking and GPS Navigation Applications. Lecture Notes in Computer Science 3611. Springer-Verlag, Berlin Heidelberg (2005) 425-431 4. Wang, C.M., Sun, H.B., Ren, Z.G.: Design and Development Examples of TMS320C5000 Series DSP System.Publish house of electronics industry. Beijing (2004) 5. Jang, J.R.:ANFIS: Adaptive-Network-Based Fuzzy Inference System.IEEE Transactions on Systems,Man and Cybernetics 23 (3) (1993) 665-685 6. Demuth, H., Beale, M., Hagan, M.: Neural Network Toolbox for Use with MATLAB User’s Guide.4th edn. The MathWorks IncMA(2005) 7. Moller, M.F.: A Scaled Conjugate Gradient Algorithm for Fast Supervised Learning. Neural Networks 6 (4) (1993) 525-533 8. Falas, T., Stafylopatis, A.: Implementing Temporal-Difference Learning with the Scaled Conjugate Gradient Algorithm. Neural Processing Letters 22 (3) (2005) 361-375 9. Fuzzy Logic Toolbox for Use with MATLAB User’s Guide.2nd edn. The MathWorks IncMA(2005) 10. TMS320VC5509 Fixed-Point Digital Signal Processor Data Manual. The Texas Instruments Inc, Dallas(2001)
Neurofuzzy Power Plant Predictive Control Xiang-Jie Liu and Ji-Zhen Liu Department of Automation, North China Electric Power University Beijing 102206, China
[email protected] Abstract. In unit steam-boiler generation, a coordinated control strategy is required to ensure a higher rate of load change without violating thermal constraints. The process is characterized by nonlinearity and uncertainty. Using of neuro-fuzzy networks (NFNs) to represent a nonlinear dynamical process is one choice. Two alternative methods of exploiting the NFNs within a generalised predictive control (GPC) framework are described. Coordinated control of steam-boiler generation using the two nonlinear GPC methods show excellent tracking and disturbance rejection results.
1 Introduction In modern power plant, the coordinated control scheme constitutes the uppermost layer of the control system, which is responsible for driving the boiler-turbinegenerator set as a single entity, harmonising the slow response of the boiler with the faster response of the turbine-generator, to achieve fast and stable unit response during load tracking manoeuvres and load disturbances. In existing method, the PID controller is still the most widespread, being developed in power plant control loops. However, steam-boiler turbine system is the complex industrial process with highly nonlinear, non-minimum, uncertainty and load disturbance [1]. Load-cycling operation between full load and low load is a common feature in modern power plant. This leads to the change of operating point right across the whole operating range. Variations in plant variables become quite nonlinear. This has presented a great challenge to power plant control system. Model predictive control (MPC) has emerged to be an effective way of power plant control. The application of a decentralized predictive control scheme was proposed in [2] based on a state space implementation of GPC for a combined-cycle power plant, in which a two-level decentralized Kalman filter was used to locally estimate the states of each of the subprocess. A nonlinear long-range predictive controller based on neural networks is developed in [3] to control the power plant process. In the presence of constraints, the optimum predicted control trajectory is defined through the on-line solution of a quadratic programming problem. For nonlinear system, since the on-line optimization problem is generally nonconvex, the on-line computation demand is high for any reasonably nontrivial systems. Using a neurofuzzy networks (NFNs) [4] to learn the plant model from operational process data for nonlinear GPC is one solution. In the NFNs, expert knowledge in linguistic form can be incorporated into the network through the fuzzy rules. This article describes how this nonlinear neurofuzzy modelling technique can be integrated within an MPC framework. It also discusses how constraint handling can be D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 179–185, 2007. © Springer-Verlag Berlin Heidelberg 2007
180
X.-J. Liu and J.-Z. Liu
incorporated in the nonlinear control scheme while ensuring the highest possible rate of load change. Comparative control studies produce good results for both nonlinear coordinated control schemes.
2 Neuro-Fuzzy Network Modelling Consider the following general single-input single-output nonlinear dynamic system:
y (t ) = f [ y (t − 1),", y (t − n′y ), u (t − d )," , u (t − d − nu′ + 1), e(t − 1)," , e(t − ne′ )] + e(t ) Δ
(1)
where f[.] is a smooth nonlinear function such that a Taylor series expansion exists, e(t) is a zero mean white noise and Δ is the differencing operator, n ′y , nu′ , ne′ and d are respectively the known orders and time delay of the system. Let the local linear model of the nonlinear system (1) at the operating point O(t) be given by:
A ( z −1 ) y (t ) = z − d B ( z −1 )Δu (t ) + C ( z −1 )e(t )
(2)
where A ( z −1 ) = ΔA( z −1 ) , B( z −1 ) and C ( z −1 ) are polynomials in z-1, the backward shift operator. The nonlinear system (1) is partitioned into several operating regions, such that each region can be approximated by a local linear model. Since NFNs is a class of associative memory networks with knowledge stored locally [4], they can be applied to model this class of nonlinear systems. A schematic diagram of the NFN is shown in Fig. 1. The input of the network is the antecedent variable [ x1 , x2 " xn ] , and the output, yˆ (t ) , is a weighted sum of the output of the local linear models yˆ i (t ) . Bspline functions are used as the membership functions in the NFNs. The membership functions of the fuzzy variables can be obtained by, n
a i = ∏ μ Ai ( x k ) ; for i = 1,2, " , p
(3)
k
k =1
where n is the dimension of the input vector x, and p, the total number of weights: n
p = ∏ ( Ri + k i ) i =1
Fig. 1. Neuro-fuzzy network
(4)
Neurofuzzy Power Plant Predictive Control
181
where k i and Ri are the order of the basis function and the number of inner knots respectively. The output of the NFN is, p
yˆ =
∑ yˆ a i =1 p
i
∑a i =1
i
p
= ∑ yˆ iα i
(5)
i =1
i
3 Neuro-Fuzzy Network Predictive Control 3.1 Local Model-Based Generalized Predictive Control (LMB-GPC)
The neurofuzzy network provides a global nonlinear plant representation from a set of locally valid CARIMA models together with a weight function, producing a value close to one in parts of the operating space where the local model is a good approximation and a value approaching zero elsewhere. Notice that this is the main property of the B-spline neuro-fuzzy networks. An alternative way of developing nonlinear controller is to use the same operating regime based model directly with a model based control framework. In this way, global modeling information may be used to determine the control input at each sample time. The closed-loop performance, stability and robustness are then all directly related to both the quality of the identified model and the general properties of GPC. It is assumed to constitute a linear representation of the process at any time instant and may then be used by a GPC controller to represent the process dynamics locally. The resultant LMB-GPC is shown in Fig.2.
Fig. 2. Local model-based generalized predictive control
3.2 Composed Controller Generalized Predictive Control (CC-GPC)
The control structure here consists of the family of controllers and the scheduler. At each sample instant the latter decides which controller, or combination of controllers, to apply to the process. Generally, the controllers are tuned about a model obtained from experiments at a particular equilibrium point. The interpolated outputs are then
182
X.-J. Liu and J.-Z. Liu
summed and used to supply the control commands to the process. The resultant CCGPC structure is shown in Fig.3. The interpolation function effectively smoothes the transition between each of the local controllers. In addition, the transparency of the nonlinear control algorithm is improved as the operating space is covered using controllers rather than models.
Fig. 3. Composed controller generalized predictive control
3.3 Constraint Handling
One of the main application benefits of using a linear predictive controller is its ability to handle process constraints directly within the control law. The inclusion of constraints in LMB-GPC is straightforward, since the least squares solution to the chosen cost function may be replaced by a constrained optimization technique such as quadratic programming. The drawback is the increasing computation required to solve for the control sequence at each sample instant. While the same approach is applied to the CC-GPC, a problem arises, as there is no way of knowing that the summation of all of the controller outputs will not in fact violate a process constraint. Notice that we are using a B-spline neuro-fuzzy network, i.e., ∑ μ kj ( x) ≡ 1, x ∈ [ xmin , xmax ] , signifyj
ing that the basis functions form a partition of unity. In such a way, the summation of all of the controller outputs will not in fact violate a process constraint, since they are weighted sum by the normalized B-spline neuro-fuzzy network.
4 Coordinated Control in Steam-Boiler Generation A valid neurofuzzy model of the plant, which is an essential tool for the improvement of the control system, has been established in [1].The proposed two kinds of neurofuzzy predictive controllers are now incorporated in the system. In the control system shown in Fig.4, W Nμ (s ) is the transfer function relating the steam valve setting to the load power, and W NM (s ) is the transfer function between the fuel consumption and the load power, i.e., W Nμ ( s ) = K μ WT ( s )
W NM ( s ) = WPM ( s ) K PWT ( s )
(6)
In the CC-GPC, the nonlinear controller consists of five local controllers, each one of which is designed about one of the local models, and thus each with a set of tuning parameters. At each sample instant the load signal was fed to the interpolation membership
Neurofuzzy Power Plant Predictive Control
183
Fig. 4. Load control system in boiler-following mode
function of the B-spline NFNs, which in turn generates the activation weights for each of the local controllers. Each local controller was assumed to be linear and hence the control sequence for each could be solved analytically. However, the summation of the interpolated outputs is nonlinear. Notice that, since the B-spline membership function was chosen to be second order, there are two controllers working at any time instant. In the LMB-GPC, the NFNs model for the process was used with a GPC algorithm for control purposes. At each sample instant the load signal was fed to the interpolation membership function of the NFNs. Each of the five sets of local model parameters was then passed through this B-spline interpolation function to form a local model, which accurately represents the process around that particular operating point. This local model may be assumed linear and is used by the GPC controller. Also notice that, since the B-spline membership function was chosen to be second order, there are two local models working at any instant time. The LMB-GPC strategy requires only one set of tuning parameters. The internal model of a single GPC controller is updated at each sample instant. The linear GPC is obtained by minimizing the following cost function, N
M
j =1
j =1
J = E{∑ q j [ yˆ (t + j ) − y r (t + j )] 2 } + ∑ λ j [Δu (t + j − 1)] 2
(7)
subject to u min < u (t + i − 1) < u max Δu min < u (t + i − 1) < Δu max ,for i = 1,2, " , m The controller parameters are chosen as Q = I , and λ = 0.1× I . The sampling in-
,
terval is chosen to be 30s. N=10 M = 6. In the sliding pressure mode, the steam pressure setpoint was incremented every 10 minutes from 11Mpa to 19Mpa, leading to a load increase from 140MW to 300MW. This was done in order to move the process across a wide operating range.The “tuning knobs” of the neuro-fuzzy GPC are chosen as discussed above. Simulations were first taken under unconstraint condition. The sliding pressure responses are shown in Fig.5 by the dotted lines. It is readily apparent that the linear GPC controller could not offer satisfactory results in most of the cases. This is because its internal model was generated at a load “Medium” where the plant gain is moderate. The nonlinear GPC controllers show good sliding pressure response. Overall there seems to be very little difference between the two nonlinear controllers during this test. Simulations were then made under constraint condition: −0.005 ≤ u1 ≤ 0.005
−1.0 ≤ u 2 ≤ 0.02
(8)
184
X.-J. Liu and J.-Z. Liu
(a)
(b)
(c) Fig. 5. Sliding pressure response and control efforts under (a) linear GPC, (b) local modelbased GPC and (c) composed controller GPC
The sliding pressure responses and control efforts are shown in Fig.5 by the dotted lines. Similar comparing results were obtained except that in every scheme, control change effort was limited, leading to a slower response. Boiler following or “constant pressure” mode is the most commonly used mode in power plant coordinated control. Fig.6-a shows steam pressure transient process while load increases from 260MW to 290MW. The opening of the steam valve leads to a quick increase in the load, as energy stored in the boiler are being released. The steam pressure is restored to its original level by increasing the fuel delivery, after being decreased. All the three controllers give a similar performance, since the plant dynamic is within one operating region and the tuning parameter of the linear controller are valid within this region. Fig.6-b shows steam pressure response while load increases from 240MW to 300MW. The nonlinear controllers exhibit superior action,
Neurofuzzy Power Plant Predictive Control
(a)
185
(b)
Fig. 6. Steam pressure transient process under boiler following mode
since the tuning parameters of the linear controller were specified at one region and the plant dynamic changes across two regions.
5 Conclusion GPC can produce excellent results compared to conventional methods. One limitation of GPC is that it is mostly based on a linear model. It would lead to large differences between the actual and predicted output values, especially when the current output is relatively far away from the operating point at which the linear control model was generated. Introducing NFNs could help to solve this problem. The proposed nonlinear GPC controllers were applied in the simulation of the power plant coordinated control, which is kernel system of unit steam-boiler. Better results are obtained when compared with the linear GPC. Also it is shown how constraints handling can be incorporated into the GPC system by using the B-spline NFNs. The advantage of the method is that it is suitable for improving many industrial plants already controlled by linear controllers.
Acknowledgment This work is supported by National Natural Science Foundation of China under grant 50576022 and 69804003, Natural Science Foundation of Beijing under grant 4062030.
References 1. Liu, X.J., Lara-Rosano, F., Chan, C.W.: Neurofuzzy Network Modelling and Control of Steam Pressure in 300MW Steam-boiler System. Engineering Applications of Artificial Intelligence 16(5) (2003) 431-440 2. Katebi, M.R., Johnson, M.A.: Predictive Control Design for Large-scale Systems. Automatica 33(3) (1997) 421-425 3. Prasad, G., Swidenbank, E., Hogg, B.W.: A Neural Net Model-based Multivariable Longrange Predictive Control Strategy Applied Thermal Power Plant Control. IEEE Trans. Energy Conversion 13(2) (1998) 176-182 4. Brown, M., Harris, C.J.: Neurofuzzy Adaptive Modelling and Control. Englewood Cliffs, Prentice-Hall, NJ (1994)
GA-Driven Fuzzy Set-Based Polynomial Neural Networks with Information Granules for Multi-variable Software Process Seok-Beom Roh1, Sung-Kwun Oh2, and Tae-Chon Ahn1 1
Department of Electrical Electronic and Information Engineering, Wonkwang University, 344-2, Shinyong-Dong, Iksan, Chon-Buk, 570-749, South Korea {nado,tcahn}@wonkwang.ac.kr 2 Department of Electrical Engineering, The University of Suwon, San 2-2 Wau-ri, Bongdam-eup, Hwaseong-si, Gyeonggi-do, 445-743, South Korea
[email protected]
Abstract. In this paper, we investigate a GA-driven fuzzy-neural networks– Fuzzy Set–based Polynomial Neural Networks (FSPNN) with information granules for the software engineering field where the dimension of dataset is high. Fuzzy Set–based Polynomial Neural Networks (FSPNN) are based on a fuzzy set-based polynomial neuron (FSPN) whose fuzzy rules include the information granules obtained through Information Granulation. The information Granules are capable of representing the specific characteristic of the system. We have developed a design methodology (genetic optimization using real number type gene Genetic Algorithms) to find the optimal structures for fuzzy-neural networks which are the number of input variables, the order of the polynomial, the number of membership functions, and a collection of the specific subset of input variables. The augmented and genetically developed FSPNN (gFSPNN) with aids of information granules results in being structurally optimized and information granules obtained by information granulation are able to help a GA-driven FSPNN showing good approximation on the field of software engineering. The GA-based design procedure being applied at each layer of FSPNN leads to the selection of the most suitable nodes (or FSPNs) available within the FSPNN. Real number genetic algorithms are capable of reducing the solution space more than conventional genetic algorithms with binary genetype chromosomes. The performance of GA-driven FSPNN (gFSPNN) with aid of real number genetic algorithms is quantified through experimentation where we use a Boston housing data.
1 Introduction In recent, a great deal of attention has been directed towards usage of Computational Intelligence such as fuzzy sets, neural networks, and evolutionary optimization towards system modeling on the high-dimensional input-output space. A lot of researchers on system modeling have been interested in the multitude of challenging and conflicting objectives such as compactness, approximation ability, generalization D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 186–195, 2007. © Springer-Verlag Berlin Heidelberg 2007
GA-Driven FSPNN with Information Granules for Multi-variable Software Process
187
capability and so on which they wish to satisfy. Fuzzy sets emphasize the aspect of linguistic transparency of models and a role of a model designer whose prior knowledge about the system may be very helpful in facilitating all identification pursuits. It is difficult to build the fuzzy model which has good approximation ability and superior generalization capability on the multi-dimensional field. In addition, to build models with substantial approximation capabilities on the multi-dimensional field, there should be a need for advanced tools. As one of the representative and sophisticated design approaches comes a family of fuzzy polynomial neuron (FPN)-based self organizing neural networks (abbreviated as FPNN or SOPNN and treated as a new category of neural-fuzzy networks) [1], [2], [3], [4]. The design procedure of the FPNNs exhibits some tendency to produce overly complex networks as well as comes with a repetitive computation load caused by the trial and error method being a part of the development process. The latter is in essence inherited from the original GMDH algorithm that requires some repetitive parameter adjustment to be completed by the designer. In this study, in addressing the above problems coming with the conventional SOPNN (especially, FPN-based SOPNN called “FPNN”) [1], [2], [3], [4] as well as the GMDH algorithm, we introduce a new genetic design approach as well as a new FSPN structure treated as a FPN within the FPNN. Bearing this new design in mind, we will be referring to such networks as GA-driven FPNN with fuzzy set-based PNs (“gFPNN” for brief). In other hand, we introduce a new structure of fuzzy rules as well as a new genetic design approach. The new structure of fuzzy rules based on the fuzzy set-based approach changes the viewpoint of input space division. From a point of view of a new understanding of fuzzy rules, information granules seem to melt into the fuzzy rules respectively. The determination of the optimal values of the parameters available within an individual FSPN (viz. the number of input variables, the order of the polynomial corresponding to the type of fuzzy inference method, the number of membership functions(MFs) and a collection of the specific subset of input variables) leads to a structurally and parametrically optimized network. The network is directly contrasted with several existing neural-fuzzy models reported in the literature.
2 The Architecture and Development of Fuzzy Set-Based Polynomial Neural Networks (FSPNN) The FSPN encapsulates a family of nonlinear “if-then” rules. When put together, FSPNs results in a self-organizing Fuzzy Set-based Polynomial Neural Networks (FSPNN). Each rule reads in the form. if xp is Ak then z is Ppk(xi, xj, apk), if xq is Bk then z is Pqk(xi, xj, aqk),
(1)
where aqk is a vector of the parameters of the conclusion part of the rule while P(xi, xj, a) denoted the regression polynomial forming the consequence part of the fuzzy rule. The activation levels of the rules contribute to the output of the FSPN being computed
188
S.-B. Roh, S.-K. Oh, and T.-C. Ahn
1st layer
2nd layer or higher FSPN
FSPN FSPN
x1 FSPN
FSPN FSPN
x2
FSPN
yˆ
FSPN FSPN
x3
FSPN
FSPN
x4
FSPN FSPN FSPN
FSPN xi , x j F
x3
P
μ31
μˆ 31
P31
μ32
μˆ32
P32
μ41
μˆ 41
P41
μ42
μˆ 42
P42
∑
{ A3 }
∑
z
xi , x j
x4
∑
{B4 }
Fuzzy set-based processing(F) part
Membership function
Triangular
Gaussian
No. of MFs per each input
Polynomial form of mapping(P) prat
2≤M ≤5
Fuzzy inference method
Simplified fuzzy inference Regression polynomial fuzzy inference
The structure of consequent part of fuzzy rules
Selected input variables
PD : C0+C1x3+C2+x4
Entire system input variables
PD : C0+C1x1+C2+x2+C3x3+C4x4
Fig. 1. A general topology of the FSPN based FPNN along with the structure of the generic FSPN module (F: fuzzy set-based processing part, P: the polynomial form of mapping) Table 1. Different forms of the regression polynomials forming the consequence part of the fuzzy rules No. of inputs Order of the polynomial 0 (Type 1) 1 (Type 2)
1
2
3
Constant
Constant
Constant
Linear
Bilinear
Trilinear
Biquadratic-1
Triquadratic-1
Biquadratic-2
Triquadratic-2
2 (Type 3) Quadratic 2 (Type 4)
1: Basic type, 2: Modified type
as a weighted average of the individual condition parts (functional transformations) PK. (note that the index of the rule, namely “k” is a shorthand notation for the two indices of fuzzy sets used in the rule (1), that is K=(l,k)). total inputs
z=
∑ l =1
total inputs
=
∑ l =1
(
(
total_rules related to input l
∑
total_rules related to input l
μ( l , k ) P( l , k ) ( xi , x j , a ( l , k ) )
k =1
rules related to input l
∑ k =1
μ ( l , k ) P( l , k ) ( xi , x j , a ( l , k ) )
).
∑ k =1
μ ( l ,k )
)
(2)
GA-Driven FSPNN with Information Granules for Multi-variable Software Process
189
In the above expression, we use an abbreviated notation to describe an activation level of the “k”th rule to be in the form
μ(l , k ) μ~( l , k ) = total rule related . to input l ∑ μ(l , k )
(3)
k =1
3 Information Granulation Through Hard C-Means Clustering Algorithm Information granules are defined informally as linked collections of objects (data points, in particular) drawn together by the criteria of indistinguishability, similarity or functionality [12]. - Definition of the premise and consequent part of fuzzy rules using Information Granulation The fuzzy rules of Information Granulation-based FSPN are as followings. if xp is A*k then z-mpk = Ppk((xi-vipk),(xj- vjpk),apk), if xq is B*k then z-mqk = Pqk((xi-viqk),(xj- vjqk),aqk),
(4)
where, A*k and B*k mean the fuzzy set, the apex of which is defined as the center point of information granule (cluster) and mpk is the center point related to the output variable on clusterpk, vipk is the center point related to the i-th input variable on clusterpk and aqk is a vector of the parameters of the conclusion part of the rule while P((xivi),(xj- vj),a) denoted the regression polynomial forming the consequence part of the fuzzy rule. The given inputs are X=[x1 x2 … xm] related to a certain application and the output is Y=[y1 y2 … yn]T. Step 1) build the universe set. Step 2) build m reference data pairs composed of [x1;Y], [x2;Y], and [xm;Y]. Step 3) classify the universe set U into l clusters such as ci1, ci2, …, cil (subsets) by using HCM according to the reference data pair [xi;Y]. Step 4) construct the premise part of the fuzzy rules related to the i-th input variable (xi) using the directly obtained center points from HCM. Step 5) construct the consequent part of the fuzzy rules related to the i-th input variable (xi). Sub-step1) make a matrix as (5) according to the clustered subsets
⎡ x21 ⎢x 51 i Aj = ⎢ ⎢ xk 1 ⎢ ⎣#
y2 ⎤
x22
"
x2 m
x52
"
x5 m
y5 ⎥
xk 2
"
xkm
yk ⎥
#
"
#
# ⎦
⎥, ⎥
(5)
190
S.-B. Roh, S.-K. Oh, and T.-C. Ahn
where, {xk1, xk2, …, xkm, yk}∈cij and Aij means the membership matrix of j-th subset related to the i-th input variable. Sub-step2) take an arithmetic mean of each column on Aij. The mean of each column is the additional center point of subset cij. The arithmetic means of column is (6)
center points = ⎣⎡ vij 1
2
vij
"
m
vij
mij ⎦⎤ .
(6)
Step 6) if i is m then terminate, otherwise, set i = i +1 and return step 3.
4 Genetic Optimization of FSPNN with Aid of Real Number Gene-Type Genetic Algorithms Let us briefly recall that GAs is a stochastic search technique based on the principles of evolution, natural selection, and genetic recombination by simulating a process of “survival of the fittest” in a population of potential solutions (individuals) to the given problem. GAs are aimed at the global exploration of a solution space. They help pursue potentially fruitful search paths while examining randomly selected points in order to reduce the likelihood of being trapped in possible local minima. The main features of genetic algorithms concern individuals viewed as strings, population-based optimization (where the search is realized through the genotype space), and stochastic search mechanisms (selection and crossover). The conventional genetic algorithms use several binary gene type chromosomes. However, real number gene type genetic algorithms use real number gene type chromosomes not binary gene type chromosomes. We are able to reduce the solution space with aid of real number gene type genetic algorithms. That is the important advantage of real number gene type genetic algorithms. In order to enhance the learning of the FPNN, we use GAs to complete the structural optimization of the network by optimally selecting such parameters as the number of input variables (nodes), the order of polynomial, and input variables within a FSPN. In this study, GA uses the serial method of binary type, roulette-wheel used in the selection process, one-point crossover in the crossover operation, and a binary inversion (complementation) operation in the mutation operator. To retain the best individual
Fig. 2. Overall genetically-driven structural optimization process of FSPNN
GA-Driven FSPNN with Information Granules for Multi-variable Software Process
191
and carry it over to the next generation, we use elitist strategy [3], [8]. The overall genetically-driven structural optimization process of FPNN is visualized in Fig. 2.
5 Design Procedure of GA-Driven FPNN (gFPNN) The framework of the design procedure of the GA-driven Fuzzy Polynomial Neural Networks (FPNN) with fuzzy set-based PNs (FSPN) comprises the following steps [Step 1] Determine system’s input variables [Step 2] Form training and testing data [Step 3] specify initial design parameters - Fuzzy inference method - Type of membership function : Triangular or Gaussian-like MFs - Number of MFs allocated to each input of a node - Structure of the consequence part of the fuzzy rules
Fig. 3. The FSPN design–structural considerations and mapping the structure on a chromosome
192
S.-B. Roh, S.-K. Oh, and T.-C. Ahn
[Step 4] Decide upon the FSPN structure through the use of the genetic design [Step 5] Carry out fuzzy-set based fuzzy inference and coefficient parameters estimation for fuzzy identification in the selected node (FSPN) [Step 6] Select nodes (FSPNs) with the best predictive capability and construct their corresponding layer [Step 7] Check the termination criterion [Step 8] Determine new input variables for the next layer Finally, an overall design flowchart of the genetic optimization of FSPNN is shown Fig. 4. STA RT
D e c is io n o f e n t ir e s y s t e m 's in p u t v a r ia b le s
I n it ia l in f o r m a tio n f o r c o n s tr u c t in g F P N N a r c h it e c tu r e D e c is o n o f in it ia l in f o r m a t io n f o r f u z z y in f e r e n c e m e t h o d & f u z z y id e n t if ic a t io n ¾ ¾ ¾ ¾ ¾ ¾
D e c is io n o f f u z z y in f e r e n e m e th o d D e c is io n o f M F ty p e D e c is io n o f th e s tr u c tu r e o f c o n s e q u e n t p a r t o f fu z z y r u le
S e le c tio n o f th e te r m in a tio n c r ite r io n D e c is io n o f th e m a x im u m n u m b e r o f in p u t v a r ia b le s o f n o d e s in e a c h la y e r D e c is io n o f n o . o f n o d e s in e a c h la y e r
G e n e r a t io n o f F P N N a r c h ite c t u r e in t h e c o r r e s p o n d in g la y e r b y G A s
¾ ¾ ¾ ¾ ¾
¾
R ep ro d u ctio n R ep r od u ctio n
¾ ¾ ¾ ¾ ¾
N o . o f g e n e r a tio n s . o nf greantee r a tio n s M uNtaotio C r oMs suotavtio e r nr a rtea te s o vne sr izr aete P oC p ur olastio P o p s iz C h r muola s otio m ne le n ge th C h r o m o s o m e le n g th
G e n e r a tio n o f a F S P N b y c h r o m o so m e G e n er a tio n o f a F S P N b y c h r o m o so m e
1 st su b -c h ro m o so m e : 1 sSt es le u bc-tio c hnr oomf onsoo. mo fe in : p u t v a r ia b le s 2 n d s u bS-ecle h rcotio m no soofmneo :. o f in p u t v a r ia b le s 2 nSdesleucbtio - c hnr oo m om : ia l o r d e r f poos ly n oe m S e le c tio n o f p o ly ¾ 3 r d s u b - c h r o m o s m e : n o m ia l o r d e r ¾ 3 r dSseuleb c- ctio h rno m o fons oo m . oef :M F s ¾ 4 th s u b -Sc he le r ocmtio o sno omf en:o . o f M F s ¾ 4 thS seulebc-tio c h rno mo fo in s o pmuet v: a r ia b le s S e le c tio n o f in p u t v a r ia b le s ¾
¾ S e le c tio n : R o u le tte - w h e e l ¾ ¾ C r oS sesleo cvtio e r n: :ORnoeu- pleotte in t- wc rhoeseslo v e r o s ns o:v Ienr v:eO in nt c r o s s o v e r ¾ ¾ M uCartio r t nme -upaotio ¾ M u a tio n : I n v e r t m u a tio n
D e c isio n o f g e n e tic in itia l in fo r m a to in D e c isio n o f g e n e tic in itia l in f o r m a to in
In fo tm a tio n G ra n u latio n In fo tm a tio n G ra n u la tio n
¾
E x tr a c t I n fo r m a tio n G r a n u le s f orr m tiotsn oGf reaancuhle s s u cEhx tr a sa cCt eInn te P oa in s u cChluassteCr eunste o in inr gP H C ts M o f each C lu s te r u s in g H C M
¾
E v a lu a tio n o f F S P N s b y fitn ess v a lu e E v a lu a tio n o f F S P N s b y fitn ess v a lu e
I n f o r m a t io n G r a n u le s
E litis t st r a te g y & S e le c t io n o f F S P N s( W ) E litis t s tr a te g y & S e le c tio n o f F S P N s( W )
¾ A r r a y o f n o d e s o n th e b a s is o f fitn e s s v a lu e s r r ath y eo fd un po lic d e sa te o dn th e ebsass is ¾ ¾ U nAify fitn v aoluf efsitn e s s v a lu e s U cn tio ifyntho ef ndoudpelic a te d hfitn a lu ¾ ¾ S e le sw h ic h aevses hvig h eers fitn e s s v a lu e s ( W ) h hoaf vin e itia h iglh pe or pf itn s s nv anluu ems b( W ¾ ¾ A r Sr aeyleocftio n on doefs noond eths ewbhaic s is u laetio e r) ¾ A r r a y o f n o d e s o n th e b a s is o f in itia l p o p u la tio n n u m b e r
NO
T e r m in a t io n c r it e r io n ? YES
G e n e r a tio n o f F P N N a r c h ite c u r e in th e c o r r e s p o n d in g la y e r G e n e r a tio n o f F P N N a r c h ite c u r e in th e c o r r e s p o n d in g la y e r
NO
T e r m in a t io n c r ite r io n ?
D e t e r m in e n e w in p u t v a r ia b le s f o r t h e n e x t la y e r xj = zi
YES F in a l p r e d ic t iv e m o d e l
fˆ
END
Fig. 4. An overall design flowchart for the genetic optimization of the FPNN architecture
GA-Driven FSPNN with Information Granules for Multi-variable Software Process
193
Table 2. System’s variables description System variables CRIM ZN INDUS NOX CHAS RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
description Per capita crime rate by town Proportion of residential land zones for lots over 25,00 sq. ft. Proportion of non-retail business acres per town Nitric oxides concentration(parts per 10 million) Chareles River dummy variables (1-track bounds river, 0-otherwise) Average number of rooms per dwelling Proportion of owner-occupied units built prior to 1940 Weighted distance to five Boston employment centers Index of accessibility to radial highways Full-value property-tax rate per $ 10,000 Pupil-teacher ratio by town 1000 ⋅ (Bk-0.63) 2 , Bk is the proportion of blacks by town % lower status of the population Media value of owner-occupied homes in $1000s
Maximal number of inputs to be selected(Max) 2(A) ; , 3(B) ; , 4(C) ; , 5 (D) ;
Maximal number of inputs to be selected(Max) 2(A) ; , 3(B) ; , 4(C) ; , 5 (D) ;
20
34
A : (5 13;4 4) B : (2 5 13;4 4) C : (6 8 11 13;2 4) D : (6 8 11 13 0;2 4)
Training error
16 14
32 30
A : (4 25;4 5) B : (3 4 13;2 2) C : (2 6 11 16;4 2) D : (2 10 13 0 0;2 4)
12 10
Testing error
18
A : (3 26;4 2) B : (3 4 13;2 2) C : (1 10 25 0;2 2) D : (11 13 21 0 0;2 3)
28 26 24
8
22
6
20
4 2
18 1
2
16
3
Layer
1
2
3
Layer
(a-1) Training error
(a-2) Testing error (a) Triangular M F Maximal number of inputs to be selected(Max) 2(A) ; , 3(B) ; , 4(C) ; , 5 (D) ;
Maximal number of inputs to be selected(Max) 2(A) ; , 3(B) ; , 4(C) ; , 5 (D) ; 32
20 18
30
A : (7 13;4 5) B : (8 11 13;2 4) C : (6 8 11 13;2 4) D : (6 8 11 13 0;2 4)
14
28
A : (1 9;4 2) B : (3 20 29;2 3) C : (10 26 0 0;4 3) D : (2 5 0 0;4 2)
12
Training error
Training error
16
A : (9 19;3 4) B : (10 13 17;2 4) C : (6 8 11 13;2 4) D : (12 14 19 21 0 0;2 4)
10 8
26
24
22
6 20
4 2
1
1.2
1.4
1.6
1.8
2 2.2 Layer
2.4
2.6
2.8
3
18
1
1.2
(b-1) Training error
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
Layer
(b-2) Testing error (b) Gaussian-like M F
Fig. 5. Performance index of IG-gFSPNN (with Type T) with respect to the increase of number of layers
194
S.-B. Roh, S.-K. Oh, and T.-C. Ahn
6 Experimental Studies In the experiment of this study, we investigate a Boston housing data set on the software engineering [6]. It concerns a description of real estate in the Boston area where housing is characterized by a number of features including crime rate, size of lots, number of rooms, age of houses, etc. and median price of houses. The Boston dataset consists of 504 14-dimensional points, each representing a single attribute, Table 2. The construction of the fuzzy model is completed for 336 data points treated as a training set. The rest of the data set (i.e. 168 data points) is retained for testing purposes. Fig. 5 depicts the performance index of each layer of Information Granules based gFSPNN with Type T according to the increase of maximal number of inputs to be selected. Fig. 6 illustrates the different optimization process between gFSPNN and the proposed IG-gFSPNN by visualizing the values of the performance index obtained in successive generations of GA when using Type T*. 22
40
.. : IG-gFSPNN - : gFSPNN
.. : IG-gFSPNN - : gFSPNN
20 18
35
Testing Error
Training Error
16 14 12
1st layer
2nd layer
3rd layer
10
30
1st layer
2nd layer
3rd layer
25
8 6
20
4 2
0
50
100
150
200
Generation
(a) Training error
250
300
15
0
50
100
150
200
250
300
Generation
(b) Testing error
Fig. 6. The different optimization process between gFSPNN and IG-gFSPNN quantified by the values of the performance index (in case of using Gaussian MF with Max=4 and Type T)
In case when using triangular MF and Max=4 in the IG-gFSPNN, the minimal value of the performance index, that is PI=3.5071, EPI=16.9334 are obtained. In case when using Gaussian-like MF and Max=5, the best results are reported in the form of the performance index such as PI=2.5726, EPI=18.0604.
7 Concluding Remarks In this study, we have investigated the real number gene type GA-based design procedure of Fuzzy Set-based Polynomial Neural Networks (IG-FSPNN) with information granules along with its architectural considerations. The design methodology emerges as a hybrid structural optimization framework (based on the GMDH method and genetic optimization) and parametric learning being regarded as a two-phase design procedure. The GMDH method is comprised of both a structural phase such as
GA-Driven FSPNN with Information Granules for Multi-variable Software Process
195
a self-organizing and an evolutionary algorithm (rooted in natural law of survival of the fittest), and the ensuing parametric phase of the Least Square Estimation (LSE)based learning. The comprehensive experimental studies involving well-known datasets quantify a superb performance of the network when compared with the existing fuzzy and neuro-fuzzy models. Most importantly, the proposed framework of genetic optimization supports an efficient structural search resulting in the structurally and parametrically optimal architectures of the networks. Acknowledgements. This work was supported by the Korea Research Foundation Grant funded by the Korean Government (MOEHRD)(KRF-2006-311-D00194).
References 1. Oh, S.-K., Pedrycz, W.: Self-organizing Polynomial Neural Networks Based on PNs or FPNs : Analysis and Design. Fuzzy Sets and Systems 142 (2) (2004) 163-198 2. Michalewicz, Z.: Genetic Algorithms + Data Structures = Evolution Programs. SpringerVerlag, Berlin Heidelberg New York (1996) 3. Jong, D.K.A.: Are Genetic Algorithms Function Optimizers?. Parallel Problem Solving from Nature 2, Manner, R. and Manderick, B. eds., North-Holland, Amsterdam 4. Oh, S.K., Pedrycz, W.: Fuzzy Polynomial Neuron-Based Self-Organizing Neural Networks. Int. J. of General Systems 32 (2003) 237-250 5. Jang, J. S. R.: ANFIS: Adaptive-Network-Based Fuzzy Inference System. IEEE Trans. on Systems, Man and Cybernetics 23 (3) (1993) 665-685 6. Pedrycz, W., Reformat, M.: Evolutionary Fuzzy Modeling. IEEE Trans. On Fuzzy Systems 11 (5) (2003) 652-665 7. Oh, S.K., Pedrycz, W.: The design of Self-organizing Polynomial Neural Networks. Information Science 141 (2002) 237-258 8. Sugeno, M., Yasukawa, T.: A Fuzzy-Logic-Based Approach to Qualitative Modeling. IEEE Trans. Fuzzy Systems 1 (1) (1993) 7-31 9. Park, B.-J., Pedrycz, W., Oh, S.-K.: Fuzzy Polynomial Neural Networks : Hybrid Architectures of Fuzzy Modeling. IEEE Transaction on Fuzzy Systems 10 (5) (2002) 607-621 10. Lapedes, A. S., Farber, R.: Non-linear Signal Processing Using Neural Networks: Prediction and System Modeling. Technical Report LA-UR-87-2662. Los Alamos National Laboratory, Los Alamos, New Mexico 87545 (1987) 11. Zadeh, L. A.: Toward a Theory of Fuzzy Information Granulation and its Centrality in Human Reasoning and Fuzzy Logic. Fuzzy Sets and Systems 90 (1997) 111-117 12. Park, B.J., Lee, D.Y., Oh, S.K.: Rule-based Fuzzy Polynomial Neural Networks in Modeling Software Process Data. Int. J. of Control Automation and Systems 1 (3) (2003) 321-331
The ANN Inverse Control of Induction Motor with Robust Flux Observer Based on ESO Xin Wang and Xianzhong Dai School of Automation, Southeast University Nanjing, 210096, P.R. China
[email protected],
[email protected]
Abstract. When flux and speed are measurable, the artificial neural network inverse system (ANNIS) can almost linearize and decouple (L&D) induction motor despite variation of parameters. In practice, the rotor flux cannot be measured and is difficult to estimate accurately due to parameters varying. The inaccurate flux can affect the ANNIS, coordinate transformation and outer rotor flux loop, and make the performance degrade further. Based on this, an artificial neural network inverse control (ANNIC) method of induction motor with robust flux observer based on extended state observer (ESO) is proposed. The observer can estimate the rotor flux accurately when uncertainty exists. The proposed control method is expected to enhance the robustness and improve the performance of whole control system. At last, the feasibility of proposed control method is confirmed by simulation. Keywords: neural network inverse, extended state observer, linearize and decouple, induction motor, robust, simulation.
1 Introduction In the last decades, lots of high performance control methods of induction motor were proposed, these methods include (field oriented control) FOC, (direct torque control) DTC and other nonlinear control methods [1], [2]. In original version of these methods, the variation of machine parameters are not taken into account, that is, they depend on the exactly known model of induction motor, when electrical parameters of AC drive varying, the performance will deteriorate. To overcome it, improved version of them [3], [4] were proposed to achieve robust control system, which are expected to obtain high performance when the variation of parameters happens under various operating conditions. The ANNIC of induction motor is one of them [5], compared with others, it has advantage as following: 1) it is more robust than analytic inverse system control. When parameters of controlled plant varying, the ANNIS still can almost L&D the induction motor system. 2) It extends the asymptotically decoupling and linearization of FOC to global one. 3) It is simpler than other nonlinear adaptive controller and robust to the variation of all parameters, unmodelled dynamics etc. in practical applications. In the other hand, like other high performance control method, the ANNIC also needs accurate estimated flux, the inaccurate one can influence the ANNIS ,the D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 196–205, 2007. © Springer-Verlag Berlin Heidelberg 2007
The ANN Inverse Control of Induction Motor with Robust Flux Observer
197
coordinate transformation and the outer flux loop, consequently make the performance of whole system degrade. So it is necessary to design the ANNIC with robust flux observer to improve the performance of whole system.
2 Rotor Flux Observer Based on ESO In past, various flux observers are proposed to estimate flux [6], [7]. In ideal condition, a good rotor flux observer owns the characters that: use little input as possible; robust to variation of parameters and lightest computing burden. Because of the value of rotor resistance ranging from zero to 100% due to the variation of temperature, so a rotor flux based on ESO will be designed firstly in this paper, it can estimate rotor flux accurately despite uncertainty in rotor flux dynamics. The ESO is the core of auto-disturbance rejection controller (ADRC) and used as an essential part of ADRC in high performance control of induction motor drives [9], [10]. The ESO is based on the concept of generalized derivatives and generalized functions. ESO is a nonlinear configuration for observing the states and disturbances of the system under control without the knowledge of the exact system parameters. In this section, the theory of ESO is used to handle the uncertainty in the rotor flux dynamic system of induction motor. The representation of induction motor based on α , β two phases fixed coordinate system is given by
⎧ di s dt = −γ i s + ( βτ − Jβ n pωm )ψ r + η u s ⎪ ⎨ dψ r dt = τ Lm i s − (τ − Jn pωm )ψ r ⎪ ⎩ d ωm dt = μ (ψ r ⊗ i s ) − Tl J
(1)
where
β=
n p Lm Lm L2 R R R ⎡0 −1⎤ 1 ,μ = , γ = m r2 + s ,η = ,τ = r , J = ⎢ ⎥ σ Ls Lr JLr σ Ls Lr σ Ls Lr σ Ls ⎣1 0 ⎦
Rs , Rr are stator and rotor resistance, Ls , Lr , Lm are stator, rotor and mutual inductance,
ωm
is rotor mechanical angular velocity, n p is the number of pole
pairs, Tl is load torque, J is motor-load inertia, σ = 1 − L2m ( Ls Lr ) is the leakage coefficient, us = ⎡⎣ usα
T
T
usβ ⎤⎦ , ψr = ⎡⎣ψ rα ψ rβ ⎤⎦ , i s = ⎡⎣isα
T
is β ⎤⎦ . Please note that the
α , β appeared in subscript are different with the α , β in text. In the equation (1), the parameters, especially rotor resistance are varying when the motor in operation, the first row of (1),that is the current dynamic equation is rewritten as following, which combine the term including Rr di s dt = − Rsη i s − β ⎡⎣τ Lm i s − (τ − Jn pωm )ψ r ⎤⎦ + η us
(2)
198
X. Wang and X. Dai
Let a (t) = τ L m i s − (τ − J n p ω m ) ψ r
(3)
which includes all the uncertainty aroused from related machine parameters, it takes into the variation of Rr into account. Let da(t) dt = b(t) , (2) can be extended to ⎧ d i s dt = − R sη i s − β a(t) + η u s ⎨ ⎩ d a(t) dt = b(t)
(4)
⎧⎪ d ˆi s dt = − R sη ˆi s − β a(t) ˆ + η u s + g 1 ( ˆi s − i s ) ⎨ ˆ dt = g 2 ( ˆi s − i s ) ⎪⎩ d a(t)
(5)
The ESO of (4) is:
where in general gi ( xˆ − x) = βi fal ( xˆ − x, α , δ ) , i = 1, 2 ⎧ ε α sgn(ε ), ε > δ ⎪ fal (ε , α , δ ) = ⎨ x ,ε ≤δ ⎪ ⎩ δ 1−α
(6)
where ε = xˆ − x , sgn(ε ) is the signum function. The exponential α , α ∈ ( 0,1) and the
scaling factor βi determine the convergence speed of ESO, the parameter δ determines the nonlinear region of the ESO. Generally, δ is set to be approximately 10% of the variation range of its input signal. For a(t) ,that is, the derivative of rotor flux, is varying in given region. Choosing α , β1 , β 2 , δ carefully, we can make (5) approximate the state i s and uncertain a(t) of practical system, and get ˆ r by integrating aˆ (t) . the estimated rotor flux ψ ˆ r = ∫0t a(t) ˆ dt ψ
(7)
One can see from (5) that the ESO doesn’t include rotor resistance Rr , so it is robust to Rr .The observed modulus and position of rotor flux are
ψˆ r = ψˆ r2α + ψˆ r2β
(8.1)
θˆs = ∫ ωˆ s dt = arctg (ψˆ r β ψˆ rα )
(8.2)
3 The ANNIC of Induction Motor The detailed introduction to design the ANNIC of a plant was given in [8]. The steps of designing an ANNIC of induction motor, assuming flux is measurable, are given below.
The ANN Inverse Control of Induction Motor with Robust Flux Observer
199
3.1 Analytic Inverse Expression of Induction Motor
The model of induction motor in M-T coordinated can be represented as (9) ⎞ ⎛ dism dt ⎞ ⎛ −γ ism + ωs ist + τβψ r + η usm ⎜ ⎟ ⎜ −ω i − γ i − n ω βψ + η u ⎟ st p m r st ⎟ ⎜ dist dt ⎟ = ⎜ s sm ⎟ ⎜ dψ r dt ⎟ ⎜ τ L i − τψ r ⎟ ⎜⎜ ⎟⎟ ⎜⎜ m sm ⎟ ⎝ d ωm dt ⎠ ⎝ μ istψ r − Tl J ⎠
(9)
where
ωs = n pωm + Lm Rr ist ( Lrψ r ) ωs is the rotor flux rotating velocity or synchronous rotating velocity, which is calculated in M-T axis in this section, it is equal to one calculated in section 2, ism , ist are the M-axis and T-axis components of stator currents, ψ r is rotor flux projected on M-axis or modulus of rotor flux , usm , ust are M-axis and T-axis components of stator voltages. In equation (9), let state vector X = [ x1
x2
x3
x4 ] = [ism , ist ,ψ r , ωm ]T ,the input T
vector us = [u1 , u2 ]T = [usm , ust ]T and the output vector y = [ y1 , y2 ]T = [ψ r , ωm ]T .Then the state equation (9) can be compacted as dX dt = f ( X, u s , θ)
(10)
y = [ y1 , y2 ]T = [ψ r , ωm ]T
(11)
The output equation is
where θ is vector of motor parameters, for convenience, it is assumed that Tl = 0 in this paper. The input-output type static analytic inverse expressions are expressed as (12) AB C Rr ( β Lm + 1) y1 Lr v1 ⎧ + ] ⎪u1 = σ Ls [ L − (n p y2 + Lm C ) y − Lm Lr Rr Lm ⎪ r 1 ⎨ ⎪u = σ L [ AC + n y ( B + n β y ) + v2 ] s p 2 p 1 ⎪⎩ 2 Lr y1 μ Jy1
(12)
where A = γ Lr + Rr , B =
T + Jy 2 Rr y1 + Lr y1 ,C = l Lm R r μJ
The compact form of the inverse system expression can be written as us = G ( y , y , v, θ) , v = y
(13)
200
X. Wang and X. Dai
3.2 The Design of ANNIC
According to (13) and the relative degree of the system are 4; so the inputs and outputs of the static neural network are 6 and 2 respectively. Choosing the three layers feedforword BP (backpropagation) neural network, the active functions of hide and output layer are tansig()and purelin() respectively, then the static neural network used to approximate the analytic inverse control is expressed as usNN = NN ( y , y , v ) = purelin( W2T (tan sig ( W1T Y + B1 )) + B 2 )
where Y= [ y
(14)
,
y v ] ,B1,B2 are bias vector W=[W1T,W2T ]T, thus the problem is T
ˆ in space Ω make equation (15) transformed to search the optimum weight matrix W satisfied ˆ = arg min ⎛⎜ sup purelin( W T (tan sig ( W T Y + B )) + B ) − u ⎞⎟ W 2 1 1 2 s W∈Ω ⎝ Y∈D ⎠
(15)
where D is the space spanned by the data of input used to train neural network, the appropriate algorithm is chosen to train neural network, thus the analytic inverse system controller is replaced by static neural network, then the number of the additional integrators is ascertained according to the relative degree. The designed input-output integrated types ANNIS can almost L&D the induction motor into flux subsystem and speed subsystem when flux and speed are measurable. Thus we can design the flux and speed regulators with linear system theory separately. Compare (14) with (13) we can conclude that: through introducing ANN, the effect of θ is eliminated and replaced by the weight matrices of the ANN, because of the neural network is essential adaptive system, and has advantages like self-learning, tolerance and robustness. So the replacement of analytic inverse system with ANNIS can improve the capability of rejecting variation of parameters.
4 The ANNIC of Induction Motor with Robust Flux Observer According to the method described in section 3, in this section, we propose an ANNIC of induction motor with robust rotor flux observer designed in section 2. We expected to make the control system more robust and the performance better. The design of ANNIC with robust flux observer based on ESO is as follows. 4.1 Design of Exciting Signals
The flux and speed exciting signals are designed according to the operational region of the motor and identification theory. The close loop identification is chosen to avoid that motor runs out of its region. Firstly, inputting step reference signal of flux and speed to the analytic inverse control of induction motor with flux observer based on ESO. According to obtained response curves to ascertain steady and dynamic parameters, the uniformed random signals are chosen by simulation inspection, the amplitude of the excited signals are flux: 0.1-1(Wb) and speed: 0-150(rad/s), the period of variation are 1s and 0.9s respectively. The exciting signals added to
The ANN Inverse Control of Induction Motor with Robust Flux Observer 180
24
200
100 0
0
120 0.6
60 0.3
0.0 0
10
20
30
Time (s) (a)
0
10
20
Rotor speed (rad/s)
12
Rotor flux modulus (Wb)
0.9
300
T-axis voltage component (V)
M-axis voltage component (V)
201
0 30
Time (s) (b)
Fig. 1. The input and output of excited induction motor, (a) The stator voltages added to induction motor (b) The flux and speed response of induction motor
induction motor are Fig.1 (a), the solid line is M-axis stator voltage component, the dashed line is T-axis stator voltage component, the output signal of motor is Fig.1 (b), the solid line is flux response curve, and the dashed line is rotor speed response curve. 4.2 Data Sampling and Handling
To sample input, output signals of induction motor with a sampling rate much higher than that in the subsequent control. The derivatives of output variables {ψˆ r ωm } were calculated from the first to second order. Since the derivatives are calculated offline, one can choose a good numerical differentiating algorithm to ensure the derivation accuracy to some extent. Reassemble all the sampled and calculated data to form the training data set {ψˆ r ψˆ r ψˆ r ωm ω m ωm }, { usm ust }. The former and the latter are, respectively, the input data and desired outputs of the static ANN, for easily convergence, the data sets are normalized in -4-+4. 4.3 Training and Testing of ANN
According to section 3.2, the ANN’s structure will be ascertained. The number’s selection of hide layer is a compromise between output precision, training time, and ANN generalization capabilities. With a small number of neurons, the network will take a long time to converge or will not converge to a satisfactory error. On the other hand, if the number of neurons is large or if the training error is very small, the ANN will memorize the training vectors and give a large error for generalization vectors. The number of neurons of hide layer is ascertained by trial and error. Finally the structure of ANN is 6-13-2.The sampled and handled data was divided into two groups, one is used to train, and the other is used to test. The LM (LevenbergMarquardt) algorithm is chosen to train the static ANN offline, the number of training step is 2000. At last, the ANN which training error is 5.23832e-5 is obtained.
202
X. Wang and X. Dai
4.4 Design of Flux and Speed Regulators
The designed ANNIS using estimated flux can L&D adaptively the motor into flux subsystem Gψ ( s ) ≈ 1 s 2 and speed subsystem Gω ( s ) ≈ 1 s 2 .The proportional and derivative (PD) is chosen to adjust the two subsystems, the parameters of regulators are: K pω = K pψ = 1200 , K d ω = K dψ = 50 ,the subscripts ω ,ψ represent the parameter of regulators of speed and flux subsystem respectively. The ANNIC of induction motor with observer based on ESO is described by Fig.2.
Fig. 2. The input-output type ANNIC of induction motor with robust flux observer based on ESO Table 1. The Parameters of Induction Motor Rated power Rated speed Pairs of pole Motor-load Inertia Rate load torque
1.1kw 146.6rad/s 2 0.0021kg·m2 7.5N·m
Stator resistance Rotor resistance Stator inductance Rotor resistance Mutual inductance
,
5.9Ω 5.6Ω 0.574H 0.580H 0.55H
5 Simulation The proposed control system is studied by simulation, the simulation algorithm is ode45, and the sampling period is 1e-4s, the chosen motor’s parameters are listed in Table 1, the parameters of ESO are α = 0.5, δ = 0.1, β1 = 75, β 2 = 375 . Comparison ANNIC with flux observer to the one with real flux under two conditions will be done in this section. 5.1 Comparison Under Rr Constant
The simulation results of ANNI control of the induction motor with observer and the one using real flux value when rotor resistance is constant are shown in Fig.3, Fig.3 (a) is the flux response of ANNI control system with real flux value (dashed line) and the one with observer (dotted line), the solid line represents reference flux, Fig.3 (b) is the speed response of ANNI control system with real flux value (dashed line) and the
The ANN Inverse Control of Induction Motor with Robust Flux Observer
203
180 1.0
160 140
0.8
Speed (rad/s)
Flux (Wb)
120 0.6
The reference flux signal The response of ANNI control system with real flux value The response of ANNI control system with flux observer
0.4
100 80 60 40
0.2
The reference speed signal The response of ANNI control system with real flux value The response of ANNI control system with flux observer
20 0
0.0 0
1
2
3
4
0
1
Time (s) (a)
2
3
4
Time (s)
(b)
Fig. 3. The comparison of ANNI control system with real flux value and the one with flux observer when Rr is constant, (a) The flux response curves of two control system, (b) The speed response curves of two control system
one with observer (dotted line), the solid line represents reference speed. From the results, we can conclude that when the motor model is exactly known, the approximate L&D is obtained both by ANNI control of induction motor with flux observer and the one with real flux value. The performance of system with flux observer based on ESO is comparative to the one with the real flux value, some coupling that can be omitted appear. 5.2 Comparison Under Rr Varying
The simulation results of ANNI control of the induction motor with robust observer and the one using real flux value when rotor resistance is varying as (16) are shown in Fig.4, Fig.4 (a) is the flux response of ANNI control system with real flux value (dashed line) and the one with observer (dotted line), the solid line represents reference flux, Fig.4 (b) is the speed response of ANNI control system with real flux value (dashed line) and the one with observer (dotted line), the solid line represents reference speed. 180 1.0
Rotor speed (rad/s)
Rotor flux modulus (Wb)
150 0.8
0.6
0.4 The reference speed signal The response of ANNIC system with real flux value The response of ANNIC system with flux observer
0.2
120
90
60
30 The reference speed signal The response of ANNIC system with real flux value The response of ANNIC system with flux observer
0
0.0 0
1
2
Time (s) (a)
3
4
0
1
2
3
4
Time (s) (b)
Fig. 4. The comparison of ANNIC system with real flux value and the one with flux observer when Rr is varying, (a) The flux response curves of two control system (b) The speed response curves of two control system
204
X. Wang and X. Dai
From above results, we conclude that when rotor resistance varying, the control performance of system with observer is little affected by rotor resistance, and degrades little larger than the one with real flux, the couplings that appear in both two systems are acceptable. The inclusion of observer cannot cause the instability of the system. These make the ANNIC more near to implement in practice. ⎧5.6 + 4t , t ≤ 1.5 Rr = ⎨ ⎩11.6, t > 1.5
(16)
6 Conclusions In this paper, to solve the problem that rotor flux of motor cannot be measured and machine parameters, especially the rotor resistance will be increased because of temperature when the motor is in operation. An ANNIC of induction motor with robust flux observer based on ESO, which doesn’t depend on the model of induction motor strongly, was proposed. The comparative study between the ANNIC of induction motor with flux observer and the one using real flux value was done with Matlab/Simulink. The simulation results show that the ANNIC method with robust observer can almost implement L&D control of the motor and can get good tracking performance despite the rotor resistance varying. The ANNI control with flux observer based on ESO owns strong robustness. So the proposed control system is more close to real case when implementation. Acknowledgments. This work is supported by the National Natural Science Foundation of China (No. 60574097), the Specialized Research Fund for the Doctoral Program of Higher Education (No.20050286029) and in part by National Basic Research Program of China under Grant (No. 2002CB312204).
References 1. Bodson, M., Chiasson, J.,Novotnak, T..: High-performance Induction Motor Control via Input–output Linearization. IEEE Contr. Syst. Mag 14(4) (1994) 25–33 2. Taylor, D.:Nonlinear Control of Electric Machines: An Overview. IEEE Contr. Syst. Mag 14(6) (1994) 41–51 3. Marino, R.,Peresada, S., Tomei, P.: Global Adaptive Output Feedback Control of Induction Motors with Uncertain Rotor Resistance. IEEE Trans. Automatic Control 44(5) (1999) 967–983 4. Kwan, C., Lewis, F. L.: Robust Backstepping Control of Nonlinear Systems Using Neural Networks. IEEE Trans. Systems, Man and Cybernetics, Part A 30(6) (2000) 753–766 5. Dai, X., Zhang, X., Liu, G., Zhang, L.: Decouping Control of Induction Motor Based on Neural Networks Inverse. (in Chinese), Proceedings of the CSEE 24(1) (2004) 114–117 6. Du,T., Vas,P., Stronach, F.: Design and Application of Extended Observers for Joint State and Parameter Estimation in High-performance AC Drives. Proc. IEE—Elect. Power Applicat 142(2) (1995) 71–78
The ANN Inverse Control of Induction Motor with Robust Flux Observer
205
7. Soto, G. G., Mendes, E., Razek, A.: Reduced-Order Observers for Flux, Rotor Resistance and Speed Estimation for Vector Controlled Induction Motor Drives Using the Extended Kalman Filter Technique.: Proc. IEE—Elect. Power Applicat 146(3) (1999) 282–288 8. Dai,X., He,D., Zhang, X., Zhang, T.: MIMO System Invertibility and Decoupling Control Strategies Based on a ANNá-th Order Inversion. IEE Proceedings Control Theory and Application, 148(2) (2001) 125–136 9. Feng, G., Liu, Y. F., Huang, L. P.: A New Robust Algorithm to Improve the Dynamic Performance on the Speed Control of Induction Motor Ddrive. IEEE Trans. Power Electronics 19(6) (2004) 1614–1627 10. Fei, L., Zhang, C. P., Song, W.C., Chen, S.S.: A Robust Rotor Flux Observer of Induction Motor with Unknown Rotor and Stator Resistance. Industrial Electronics Society,IECON '03 (2003) 738–741
Design of Fuzzy Relation-Based Polynomial Neural Networks Using Information Granulation and Symbolic Gene Type Genetic Algorithms SungKwun Oh1, InTae Lee1, Witold Pedrycz2, and HyunKi Kim1 1
Department of Electrical Engineering, The University of Suwon, San 2-2 Wau-ri, Bongdam-eup, Hwaseong-si, Gyeonggi-do, 445-743, South Korea
[email protected] 2 Department of Electrical and Computer Engineering, University of Alberta, Edmonton, AB T6G 2G6, Canada and Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland
Abstract. In this study, we introduce and investigate a genetically optimized fuzzy relation-based polynomial neural networks with the aid of information granulation (IG_gFRPNN), develop a comprehensive design methodology involving mechanisms of genetic optimization with symbolic gene type. With the aid of the information granules based on C-Means clustering, we can determine the initial location (apexes) of membership functions and initial values of polynomial function being used in the premised and consequence part of the fuzzy rules respectively. The GA-based design procedure being applied at each layer of IG_gFRPNN leads to the selection of preferred nodes with specific local characteristics (such as the number of input variables, the order of the polynomial, a collection of the specific subset of input variables, and the number of membership function) available within the network. The proposed model is contrasted with the performance of the conventional intelligent models shown in the literatures.
1 Introduction While the theory of traditional equation-based approaches is well developed and successful in practice (particularly in linear cases) there has been a great deal of interest in applying model-free methods such as neural and fuzzy techniques for nonlinear function approximation [1]. GMDH was introduced by Ivakhnenko in the early 1970's [2]. GMDH-type algorithms have been extensively used since the mid-1970’s for prediction and modeling complex nonlinear processes. While providing with a systematic design procedure, GMDH comes with some drawbacks. To alleviate the problems associated with the GMDH, Self-Organizing Neural Networks (SONN, called “FRPNN”) were introduced by Oh and Pedrycz [3], [4], [5] as a new category of neural networks or neuro-fuzzy networks. Although the FRPNN has a flexible architecture whose potential can be fully utilized through a systematic design, it is difficult to obtain the structurally and parametrically optimized network because of the limited design of the nodes located in each layer of the FRPNN. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 206–215, 2007. © Springer-Verlag Berlin Heidelberg 2007
Design of Fuzzy Relation-Based Polynomial Neural Networks
207
In this study, in considering the above problems coming with the conventional FRPNN, we introduce a new structure and organization of fuzzy rules as well as a new genetic design approach. The new meaning of fuzzy rules, information granules melt into the fuzzy rules. In a nutshell, each fuzzy rule describes the related information granule. The determination of the optimal values of the parameters available within an individual FRPN (viz. the number of input variables, the order of the polynomial, a collection of preferred nodes, and the number of MFs) leads to a structurally and parametrically optimized network through the genetic approach.
2 FRPNN with Fuzzy Relation-Based Polynomial Neuron (FRPN) The FRPN consists of two basic functional modules. The first one, labeled by F, is a collection of fuzzy sets that form an interface between the input numeric variables and the processing part realized by the neuron. The second module (denoted here by P) is about the function – based nonlinear (polynomial) processing. The detailed FRPN involving a certain regression polynomial is shown in Table 1. The choice of the number of input variables, the polynomial order, input variables, and the number of MF available within each node itself helps select the best model with respect to the characteristics of the data, model design strategy, nonlinearity and predictive capabilities. Table 1. Different forms of regression polynomial building a FRPN No. of inputs Order of the polynomial Order FRPN 0 Type 1 1 Type 2 Type 3 2 Type 4 1: Basic type, 2: Modified type
1
2
3
Constant Linear
Constant Bilinear Biquadratic-1 Biquadratic-2
Constant Trilinear Triquadratic-1 Triquadratic-2
Quadratic
Proceeding with the FRPNN architecture essential design decisions have to be made with regard to the number of input variables and the order of the polynomial forming the conclusion part of the rules as well as a collection of the specific subset of input variables. Table 2. Polynomial type according to the number of input variables in the conclusion part of fuzzy rules Input vector Type of the consequence polynomial Type T Type T*
Selected input variables in the premise part A A
Selected input variables in the consequence part A B
Where notation A: Vector of the selected input variables (x1, x2,…, xi), B: Vector of the entire system input variables(x1, x2, …xi, xj …), Type T: f(A)=f(x1, x2,…, xi) - type
208
S. Oh et al.
of a polynomial function standing in the consequence part of the fuzzy rules, Type T*: f(B)=f(x1, x2, …xi, xj …) - type of a polynomial function occurring in the consequence part of the fuzzy rules.
3 The Structural Optimization of IG_gFRPNN 3.1 Information Granulation by Means of C-Means Clustering Method Information granulation is defined informally as linked collections of objects (data points, in particular) drawn together by the criteria of indistinguishability, similarity or functionality [6]. Granulation of information is a procedure to extract meaningful concepts from insignificant numerical data and an inherent activity of human being carried out with intend of better understanding of the problem. We extract information for the real system with the aid of Hard C-means clustering method [7], which deals with the conventional crisp sets. Through HCM, we determine the initial location (apexes) of membership functions and initial values of polynomial function being used in the premise and consequence part of the fuzzy rules respectively. The fuzzy rules of IG_gFRPNN is given as follows: R j : If x1 is A ji and
x k is A jk then y j − M j = f j {( x1 − v j1 ), ( x 2 − v j 2 ),
, ( xk − v jk )} ,
where Ajk mean the fuzzy set, the apex of which is defined as the center point of information granule (cluster). Mj and vjk are the center points of new created inputoutput variables by information graunle. 3.2 Genetic Optimization of IG_gFRPNN Let us briefly recall that GAs is a stochastic search technique based on the principles of evolution, natural selection, and genetic recombination by simulating a process of “survival of the fittest” in a population of potential solutions to the given problem. The main features of genetic algorithms concern individuals viewed as strings, population-based optimization and stochastic search mechanism (selection and crossover). In order to enhance the learning of the IG_gFRPNN and augment its performance, we use genetic algorithms to obtain the structural optimization of the network by optimally selecting such parameters as the number of input variables (nodes), the order of polynomial, input variables, and the number of MF within a IG_gFRPNN. Here, GAs use serial method of symbolic type, roulette-wheel as the selection operator, one-point crossover, and an uniform operation in the mutation operator [8].
4 The Algorithm and Design Procedure of IG_gFRPNN The IG_gFRPNN comes with a highly versatile architecture both in the flexibility of the individual nodes as well as the interconnectivity between the nodes and organization of the layers. Evidently, these features contribute to the significant flexibility of the networks yet require a prudent design methodology and a well-thought
Design of Fuzzy Relation-Based Polynomial Neural Networks
209
learning mechanisms. The framework of the design procedure of the genetically optimized Fuzzy Relation-based Polynomial Neural Networks (gFRPNN) based on information granulation comprises the following steps. [Step 1] Determine system’s input variables [Step 2] Form training and testing data The input-output data set (xi, yi)=(x1i, x2i, …, xni, yi), i=1, 2, …, N (with N being the total number of data points) is divided into two parts, that is, a training and testing dataset. [Step 3] Decision of axis of MFs by Information granulation(HCM) As mentioned in ‘3.2 Definition of the premise part of fuzzy rules using IG’, we obtained the new axis of MFs by information granulation as shown in Fig. 4. [Step 4] Decide initial information for constructing the FRPNN structure Here we decide upon the essential design parameters of the FRPNN structure. Those include (a) Initial specification of the fuzzy inference method and the fuzzy identification (b) Initial specification for decision of FRPNN structure [Step 5] Decide upon the FRPNN structure with the use of genetic design We divide the chromosome to be used for genetic optimization into four subchromosomes as shown in Fig. 1. The 1st sub-chromosome contains the number of input variables, the 2nd sub-chromosome includes the input variables coming to the corresponding node (FRPN), the 3rd sub-chromosome contains the number of membership functions (MFs), and the last sub-chromosome (remaining bits) involves the order of the polynomial of the consequence part of fuzzy rules. All these elements are optimized by running the GA. S e le c tio n o f n o d e(F R P N ) str u c tr u e b y c h r o m o so m e
R e la te d b it ite m s
B it str u ctu re o f su b c h r o m os o m e d iv id e d f or e a ch ite m
S y m b o lic G e n e T yp e G e n etic D e sig n
i) B its fo r th e se lec tion of th e n o . of in p u t v ar iab le s
3
1
3
F u z z y in fe r e n c e & f u z z y id e n tific a tio n
S e le c te d F R P N s
3 1
S e lec tion of n o. o f in p u t v ar iab les(r)
iii) B its fo r th e se lec tion o f in p u t va ria b le s
4 3
2 4
S ele c tio n o f in p u t va r iab les
iii) B its fo r th e se le ctio n th e n o . o f M F s e ac h in p u t v a ria b le
2
3
3
2
3
3
ii) B its for th e se le ctio n o f th e p o ly n o m ia l or d e r
2
3 3
S electio n o f no . o f M F s b it fo r ea ch in p u t va r ia b le
S e le ctio n o f th e o r d e r of p oly n om ia l (T y p e 1 ~ T y p e 4 )
F u zzy in fere ne m eth o d
M F Type
N o . o f M F s p er e a c h in pu t
T h e str u ctur e o f co n seq ue nt p a r t o f fu zzy r u les
S im p lifie d o r re g res sio n p o ly n o m ia l fu z zy in feren ce
T r ia n g u la r o r G a u ss ia n
N o . o f M F s p er ea c h in p u t va r ia b le 2 ~ 5
S elec ted in p u t v a ria b les o r en tir e sy s tem in pu t v a ria b les
FR PN
Fig. 1. The FRPN design used in the FRPNN architecture – structural considerations and a map ping the structure on a chromosome
210
S. Oh et al.
[Step 6] Carry out fuzzy inference and coefficient parameters estimation for fuzzy identification in the selected node i) Simplified inference The consequence part of the simplified inference mechanism is a constant. Using information granulation, the new rules read in the form
and xk is Ank then y n − M n = an 0 ,
R n : If x1 is An1 and
(1)
where Rn is the n-th fuzzy rule, xl (l=1, 2, …, k) is an input variable, Ajl (j=1, …, n; l=1, …, k) is a membership function of fuzzy sets, Mj (j=1, …, n) is the center point related to the new created output variable, n denotes the number of the rules. n
∑ yˆ =
n
∑μ
μ ji yi
j =1 n
∑μ
=
+M)
ji ( a j 0
j =1
n
∑μ
ji
i =1
n
=
∑μ
ji ( a j 0
+ M j),
(2)
j =1
ji
i =1
μ ji = A j1 ( x1i ) ∧
∧ A jk ( xki ) ,
(3)
where μˆ ij is normalized value of μij, and Eq. (3) is inferred value yˆ i from Eq. (1). The consequence parameters (aj0) are produced by the standard least squares method that is
a = ( X T X ) −1 X T Y , X = [ x1 , x2 , ⎡ ⎛ Y = ⎢ y1 − ⎜ ⎜ ⎢ ⎝ ⎣
ˆ 1i , μˆ 2i , , xm ]T , xi = [ μ T
⎞ M j μˆ j1 ⎟ ⎟ j =1 ⎠ n
∑
⎛ y2 − ⎜ ⎜ ⎝
(4)
, μˆ ni ] , a = [a10 , , an0 ] ,
⎞ M j μˆ j 2 ⎟ ⎟ j =1 ⎠ n
∑
⎛ ym − ⎜ ⎜ ⎝
T
⎞⎤ M j μˆ jm ⎟ ⎥ . ⎟⎥ j =1 ⎠⎦ n
∑
ii) Regression polynomial inference The regression fuzzy inference (reasoning scheme) is envisioned: The consequence part can be expressed by linear, quadratic, or modified quadratic polynomial equation as shown in Table 1. The use of the regression polynomial inference method gives rise to the expression.
R n : If x1 is An1 and
and xk is Ank then
yn − M n = f n {(x1 − vn1 ), (x2 − vn 2 ),
,(xk − vnk )} ,
(5)
where Rn is the n-th fuzzy rule, xl (l=1, 2, …, k) is an input variable, Ajl (j=1, …, n; l=1, …, k) is a membership function of fuzzy sets, vjl (j=1, …, n; l=1, …, k) is the center point related to the new created input variable, Mj (j=1, …, n) is the center point related to the new created output variable, n denotes the number of the rules, fi(⋅) is a regression polynomial function of the input variables as shown in Table 1.
Design of Fuzzy Relation-Based Polynomial Neural Networks
211
The calculation of the numeric output of the model are carried out in the wellknown form n
yˆ i =
∑μ j =1 n
n
ji
yi
∑ μ ji
=
∑μ j =1
ji
{ a j 0 + a j1 (x1i − v j1 ) + n
∑μ
j =1
n
i =1
= ∑ μˆ ji { a j 0 + a j1 (x1i − v j1 ) + j =1
+ a jk (xki − v jk ) + M j } ji
(6)
+ a jk (xki − v jk ) + M j },
where I (i=1,…, m) is i-th data, ajl (j=1, …, n ; l=0, …, k) is coefficient of conclusion part of the fuzzy rule, and μji is same as shown in Eq. (3). The coefficients of consequence part of fuzzy rules obtained by least square method (LSM) as follows. a = ( X T X ) −1 X T Y . (7) [Step 7] Select nodes (FRPN) with the highest predictive capability and construct their corresponding layer To evaluate the performance of FRPNs (nodes) constructed using the training dataset, the testing dataset is used. Based on this performance index, we calculate the fitness function. The fitness function reads as
1 , (8) 1 + EPI where EPI denotes the performance index for the testing data (or validation data). In this case, the model is obtained by the training data and EPI is obtained from the testing data (or validation data) of the IG_gFRPNN model constructed by the training data. [Step 8] Check the termination criterion The termination condition that controls the growth of the model consists of two components, that is the performance index and a size of the network (expressed in terms of the maximal number of the layers). As far as the performance index is concerned (that reflects a numeric accuracy of the layers), a termination is straightforward and comes in the form, F1 ≤ F* , (9) F ( fitness Function) =
where F1 denotes a maximal fitness value occurring at the current layer whereas F* stands for a maximal fitness value that occurred at the previous layer. As far as the depth of the network is concerned, the generation process is stopped at a depth of less than three layers. This size of the network has been experimentally found to achieve a sound compromise between the high accuracy of the resulting model and its complexity as well as generalization abilities. In this study, we use a measure (performance index) of Root Mean Squared Error (RMSE)
212
S. Oh et al.
E ( PI or EPI ) =
1 N
N
∑(y p =1
p
(10)
− yˆ p ) 2 ,
where yp is the p-th target output data and yˆ p stands for the p-th actual output of the model for this specific data point. N is training (PI) or testing (EPI) input-output data pairs and E is an overall (global) performance index defined as a sum of the errors for the N. [Step 9] Determine new input variables for the next layer If has not been met, the model is expanded. The outputs of the preserved nodes (zli, z2i, …, zWi) serves as new inputs to the next layer (x1j, x2j, …, xWj) (j=i+1). This is captured by the expression x1 j = z1i , x2 j = z 2i ,… , xwj = z wi .
(11)
The IG_gFRPNN algorithm is carried out by repeating steps 4-9.
5 Experimental Studies We demonstrate how the IG-based gFRPNN can be utilized to predict future values of a chaotic Mackey-Glass time series. This time series is used as a benchmark in fuzzy and neurofuzzy modeling. The time series is generated by the chaotic Mackey-Glass differential delay equation. To come up with a quantitative evaluation of the network, we use the standard RMSE performance index. Table 3. Computational aspects of the genetic optimization of IG_gFRPNN
GAs
Parameters Maximum generation Total population size Selected population size (W) Crossover rate Mutation rate String length Maximal no.(Max) of inputs to be selected
IG_ gFRPNN
Polynomial type (Type T) of the consequent part of fuzzy rules(#) Consequent input type to be used for Type T (##)
Membership Function (MF) type No. of MFs per input l, T, Max: integers, # and ## : refer to Tables 1-2 respectively.
1st layer 150 300 30 0.65 0.1 Max*2+1 1˺ l˺ Max(4~5)
2nd to 3rd layer 150 300 30 0.65 0.1 Max*2+1 1˺ l˺ Max(4~5)
1˺ T˺ 4
1˺ T˺ 4
Type T* Triangular Gaussian 2 or 3
Type T Triangular Gaussian 2 or 3
Fig. 2 depicts the performance index of each layer of IG_gFRPNN according to the increase of maximal number of inputs to be selected. In Fig. 3, the left, middle, and right part within A:(•;•;•)- B:(•;•;•) denote the optimal node numbers at each layer of the network, the polynomial order, and the number of MFs respectively. Fig. 3 illustrates the detailed optimal topologies of IG_gFRPNN with Gaussianlike MFs for 3rd layers when using Max=5. As shown in Fig. 3, the proposed network enables the architecture to be a structurally more optimized and simplified network
Design of Fuzzy Relation-Based Polynomial Neural Networks Maximal number of inputs to be selected x 10 1.5
(Max )
Maximal number of inputs to be selected
5 (B)
4 (A)
-4
x 10 6.5
A : (1, 3, 4, 6 ; 3; 2, 2, 3, 2) B : (1, 3, 4, 6, 0 ; 3; 2, 2, 3, 2)
1.4
6
5.5
A : (1, 2, 3, 29 ; 3; 2, 2, 2, 2) B : (1, 2, 4, 9, 13 ; 2; 2, 2, 2, 2, 2)
Testing Error
Training Error
1.3 1.2
(Max )
5 (B)
4 (A)
-4
213
1.1 1
5
4.5
4
0.9 3.5
0.8 0.7
A : (1, 2, 8, 9 ; 4; 2, 2, 2, 2) B : (5, 12, 17, 19, 29 ; 2; 2, 2, 3, 2, 2) 1
2
3
3
1
2
Layer
3
Layer
(a-1) PI (a-2) EPI (a) Performance Index in case of using Triangular membership function Maximal number of inputs to be selected x 10 9
Maximal number of inputs to be selected x 10 10
A : (1, 3, 4, 6 ; 3; 2, 2, 2, 2) B : (1, 2, 3, 4, 6 ; 2; 3, 2, 2, 2, 2)
8
7
A : (4, 6, 20, 24 ; 2; 3, 2, 2, 2) B : (1, 9, 22, 23, 0 ; 2; 3, 3, 2, 2)
6
5
A : (4, 5, 14, 28 ; 3; 2, 2, 2, 2) B : (1, 2, 15, 0, 0 ; 4; 3, 2, 3)
4
3
2
(Max )
5 (B)
4 (A)
-5
9
Testing Error
Training Error
(Max )
5 (B)
4 (A)
-5
8
7
6
5
4
1
2
3
3
1
2
Layer
3
Layer
(b-1) PI (b-2) EPI (b) Performance Index in case of using Gaussian-like membership function Fig. 2. Performance index of IG-gFRPNN with respect to the increase of number of layers 3 2 2 2 2
x(t-30) x(t-24) x(t-18) x(t-12) x(t-6) x(t)
2 3 2 2 2 2 2 2 3 2 2 2 2 2 3 2 2 3 2
FPN
5
FPN
5
FPN
5
1
2 3
2 9
2
FPN 22
5
2
3 3 2 2
FPN
4
2 3 2 2
4
3 2 2
3
FPN
1
2 2
2
3 2 3
FPN
3
1
4
ˆ y
FPN 15
4
FPN 23
4
4
Fig. 3. Optimal networks structure of GAs-based FRPNN ( for 3 layers )
214
S. Oh et al.
Table 4. Comparative analysis of the performance of the network; considered are models reported in the literature Model Wang’s model[10]
PI 0.044 0.013 0.010
PIs
EPIs
NDEI*
Cascaded-correlation NN[14] Backpropagation MLP[14] 6th-order polynomial[14] ANFIS[11] FNN model[12]
0.06 0.02 0.04 0.0016 0.0016 0.0015 0.007 0.014 0.009
Recurrent neural network[15]
0.0138
SuPFuNIS[16] NFI[17] Basic Case 1 (5th Case 2 layer) Type I Modified Case 1 SONN** (5th [13] Case 2 layer) Basic Case 1 Type (5th II Case 2 layer) Max= Triangular MFs IG_gFRPNN 4 Gaussian-like MFs (Our Model) Max= Triangular MFs 5 Gaussian-like MFs
0.016 0.014 0.004 0.0011 0.0011
0.005
0.0027 0.0028
0.011
0.0012 0.0011
0.005
0.0038 0.0038 0.016 0.0003 0.0005 0.0016 0.0002 0.0004 0.0011 8.09e-5 7.46e-5 2.40e-5 2.27e-5
3.77e-4 3.68e-4 6.28e-5 3.69e-5
than the conventional FRPNN. In nodes (FRPNs) of Fig. 3, ‘FRPNn’ denotes the nth FRPN (node) of the corresponding layer, numeric values with rectangle form before a node(neuron) mean number of membership functions per each input variable, the number of the left side denotes the number of nodes(inputs or FRPNs)coming to the corresponding node, and the number of the right side denotes the polynomial order of conclusion part of fuzzy rules used in the corresponding node.
6 Concluding Remarks In this study, we introduced and investigated a new architecture and comprehensive design methodology of IG_gFRPNNs and discussed their topologies. The proposed IG_gFRPNN is constructed with the aid of the algorithmic framework of information granulation based on C-Means clustering and symbolic gene type. In the design of IG_gFRPNN, the characteristics inherent to entire experimental data being used in the construction of the gFRPNN architecture is reflected to fuzzy rules available within a FRPN. The comprehensive experimental studies involving well-known dataset quantify a superb performance of the network in comparison to the existing fuzzy and neuro-fuzzy models.
Design of Fuzzy Relation-Based Polynomial Neural Networks
215
Acknowledgements This work was supported by the Korea Research Foundation Grant funded by the Korean Government (MOEHRD) (KRF-2006-311-D00194).
References 1. Nie, J.H., Lee, T.H.: Rule-based Modeling: Fast Construction and Optimal Manipulation. IEEE Trans. Syst., Man, Cybern. 26 (1996) 728-738 2. Ivakhnenko, A.G..: Polynomial Theory of Complex Systems. IEEE Trans. on Systems, Man and Cybernetics. SMC-1 (1971) 364-378 3. Oh, S.K., Pedrycz, W..: The Design of Self-organizing Polynomial Neural Networks. Information Science 141 (2002) 237-258 4. Oh, S.K., Pedrycz, W., Park, B.J.: Polynomial Neural Networks Architecture: Analysis and Design. Computers and Electrical Engineering 29 (2003) 703-725 5. Oh, S.K., Pedrycz, W.: Fuzzy Polynomial Neuron-Based Self-Organizing Neural Networks. Int. J. of General Systems 32 (2003) 237-250 6. Zadeh, L.A.: Toward A Theory of Fuzzy Information Granulation and Its Centrality in Human Reasoning and Fuzzy Logic. Fuzzy sets and Systems 90 (1997) 111-117 7. Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. New York. Plenum (1981) 8. Jong, D.K.A.: Are Genetic Algorithms Function Optimizers?. Parallel Problem Solving from Nature 2, Manner, R. and Manderick, B. eds., North-Holland, Amsterdam (1992) 9. Vachtsevanos, G., Ramani, V., Hwang, T.W.: Prediction of Gas Turbine NOx Emissions Using Polynomial Neural Network. Technical Report, Georgia Institute of Technology. Atlanta. (1995) 10. Wang, L.X., Mendel, J.M.: Generating Fuzzy Rules from Numerical Data with Applications, IEEE Trans. Systems, Man, Cybern. 22 (6) (1992) 1414-1427 11. Jang, J.S.R.: ANFIS: Adaptive-Network-Based Fuzzy Inference System. IEEE Trans. System, Man, and Cybern. 23 (3) (1993) 665-685 12. Maguire, L.P., Roche, B., McGinnity, T.M., McDaid, L.J.: Predicting A Chaotic Time Series Using A Fuzzy Neural Network. Information Sciences 112 (1998) 125-136 13. Oh, S.K., Pedrycz, W., Ahn, T.C.: Self-organizing Neural Networks with Fuzzy Polynomial Neurons. Applied Soft Computing 2 (2002) 1-10 14. Crowder III, R. S.: Predicting The Mackey-Glass Time Series with Cascade-correlation Learning. In D. Touretzky, G. Hinton, and T. Sejnowski, editors. Proceedings of the 1990 Connectionist Models Summer School. (1990) 117-123 15. Li, C. James, Huang, T.Y.: Automatic Structure and Parameter Training Methods for Modeling of Mechanical Systems by Recurrent Neural Networks. Applied Mathematical Modeling 23 (1999) 933-944 16. Paul, S., Kumar S.: Subsethood-Product Fuzzy Neural Inference System (SuPFuNIS). IEEE Trans. Nerual Networks 13 (3) 2(2002) 578-599 17. Song, Qun, Kasabov, N. K.: NFI : Neuro-Fuzzy Inference Method for Transductive Reasoning. IEEE Trans. Fuzzy Systems 13 (6) (2005) 799-808 18. Park, B.J., Lee, D.Y., Oh, S.K.: Rule-based Fuzzy Polynomial Neural Networks in Modeling Software Process Data. Int. J. of Control, Automations, and Systems. 1 (3) (2003) 321-331
Fuzzy Neural Network Classification Design Using Support Vector Machine in Welding Defect Xiao-guang Zhang1,2, Shi-jin Ren 3, Xing-gan Zhang2, and Fan Zhao1 1
College of Mechanical and Electrical Engineering, China University of Mining and Technology, Xuzhou, 221008 2 Department of Electronic Science & Engineering, Nanjing university, Nanjing, 210093, China 3 Computer Science & Technology college, Xuzhou normal university, Xuzhou, 221116
[email protected]
Abstract. To cope up with the variability of defect shadows and the complexity between defect characters and classes in welding image and poor generalization of fuzzy neural network (FNN), a support vector machine (SVM)-based FNN classification algorithm for welding defect is presented. The algorithm firstly adopts supervisory fuzzy cluster to get the rules of input and output space and similarity probability is applied to calculate the importance of rules. Then the parameters and structure of FNN are determined through SVM. Finally, the FNN is trained to classify the welding defects. Simulation for recognizing defects in welding images shows the efficiency of the presented.
1 Introduction FNN inherits advantages of neural network and fuzzy logic, thus it can make use of expert language and have self-learning ability, and is applied widely in machine learning. Most learning algorithms of FNN adopt BP and FCM cluster to obtain fuzzy rules and membership parameters from training data, but these algorithms can not minimize both experience error and expected error simultaneously. Besides, the training time is sensitive to the number of input dimension, and when there are some redundant and conflict rules, the precision of FNN is unsatisfied. SVM can effectively deal with small samples and obtain the global optimum through quadratic optimization [1]. Now more and more researchers pay attention to SVM. Therefore, we propose a new FNN algorithm that SVM is used to determine the initial parameters and structure of FNN. X-ray-based non-destructive inspection is an important method of controlling and inspecting welding quality. However the recognition and evaluation of X-ray inspection welding image mainly depend on person presently, this method often results in uncertain results. In the past 30 years, classifiers are used to recognize defects in the research on defect recognition [2]-[3]. Although it can obtain certain results, the correct recognition ratio is very low. Nowadays, neural network improves the correct recognition ratio of all shapes of welding defects [4]. The main problem is that defect in the welding image varies very much and the relations between defect D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 216–223, 2007. © Springer-Verlag Berlin Heidelberg 2007
FNN Classification Design Using Support Vector Machine in Welding Defect
217
features and classes are complex. Based on the characteristics of defects in welding images, a SVM-based FNN classification algorithm is proposed. Firstly, a supervisory fuzzy cluster is adopted to extract fuzzy rules and weighing algorithm determines the importance of rules making models be able to learn the rules selectively. Then, SVM is applied to determine the structure and initial parameters of FNN. Finally FNN is trained according to weighting cost function. In this paper, chapter 2 introduces the basic structure and realization method of FNN. Chapter 3 introduces the fuzzy cluster algorithm based on supervisory GK and an algorithm is put forward to denote the weight of rule importance. Chapter 4 introduces multi-class classification method of SVM and the training algorithm of FNN based on SVM. The simulated results of recognition of weld defects are discussed in chapter 5. The conclusion is reached in chapter 6.
2 FNN The FNN adopted in this paper consists of three basic modules (fuzziness, fuzzy consequence and fuzzy judgment). Every feature extracted from welding defects is modeled and every fuzzy reference model is the qualitative description of feature and classification of welding defects. Suppose there are n rules and m input variables. The rule j: if x1 is A j1 and …and xm is A jm then y is d j , j = 1,2,", n ,
A ji is the fuzzy set of input variables of xi and d j is the result parameter of y. In order to analyze, the fuzzy rule 0 is added. The rule 0:
x1 is A01 and …and xm is A0 m then y is d 0 . To the m-dimension input x = [ x1 ,", x m ] , the condition of fuzzy model is if
defined as and operation using product operator. m
Aj ( x) = ∏ μ ji ( x j ) ,
(1)
i =1
A j is multi-variable fuzzy set of the jth rule and μ ji is the membership function of single variable. The model output is n
y = ∑ μ j y j + d0 , j =1
n
m
n
j =1
i ' =1
m
n
j =1
i ' =1
μi = Ai ( x) / ∑ Ai ( x) = [∏ μij ( x)] / ∑ Ai ( x) = ∏ {μij ( x) / ∑ Ai ( x)}. '
i ' =1
'
'
(2)
218
X.-g. Zhang et al.
In this way, the equation can be written as follows: n
m
j =1
i =1
y = ∑ d j ∏ μ ij ( xi ) + d 0 .
(3)
Using FNN to train the samples will strengthen the mapping ability of the network, improve its expression ability and possess the simple and practical characteristics, such as self-learning, redundancy, strong classification ability and parallel processing. The neural network introduced fuzzy theory can improve correct recognition ratio under the condition of not adding new information.
3 Fuzzy Cluster with Supervision Now the fuzzy subset and membership function of FNN depend on the manual experience, which is difficult for high input dimension and large samples. How to extract fuzzy rules from sample data automatically is still an open problem. In this paper, we apply supervisory fuzzy cluster algorithm [5] to extract fuzzy rules. The practice proves that the method can make full use of class label information of samples and can cover the input and output space of samples enough. And also it can find important clusters and determine rational cluster number. GK fuzzy cluster algorithm is proved as effective cluster method used to recognize TS fuzzy model. It uses adaptive distance norm to test the clusters with different geometrical shapes [6]. Every cluster presents one rule in the rule database and the cluster is based on minimal object function. M
N
J = ∑∑ ( μ kj ) m d ki2 ,
(4)
i =1 k =1
satisfying the following condition: c
∑μ j =1
kj
= 1,
μ kj ≥ 0,
1≤ k ≤ n,
1 ≤ k ≤ n, I ≤ j ≤ c ,
where, m denotes the weight index of fuzzy cluster, and m > 1 , c is the number of clusters, n is the sample number of cluster space, μkj is the membership of the j sample
xk belonging to the cluster k , dki2 = Aj (x) = (xk − v j )T (Fj )−1(xk − vj ) is inner
xk is from the cluster center v j , F j is the diagonal matrix containing variance, x k ∈ R s , v j ∈ R s and s is the dimension number of input vectors. U = {μ kj } denotes n × c partition matrix and V = {v1 , v 2 , ", vc } denotes s × c matrix of cluster center. For {Z i = ( x i , y i )}i =1, 2 ,", N , the steps of the supervisory fuzzy cluster algorithm norm which denotes the Euclid distance that the sample
are as follows [5]:
FNN Classification Design Using Support Vector Machine in Welding Defect
219
Here repeated times l = 1 , the number of clusters is
M , the contribution ratio (0) threshold of rules, terminated error ε > 0 , random initial fuzzy division matrix U (1) Calculate cluster model N
N
k =1
k =1
v i(l ) = [∑ ( μ ki(l −1) ) m z k ] / ∑ ( μ ki(l −1) ) m .
(5)
(2) Calculate covariance matrix N
N
k =1
k =1
Fi = [∑(μki(l −1) ) m ( zk − vi(l ) )(zk − vi(l ) )T ] /∑(μki(l −1) ) m .
(6)
(3) Calculate the distance to the cluster
dki2 = (zk − vi(l ) )T Di (zk − vi(l ) ) , 1 ≤ i ≤ M , 1 ≤ k ≤ N ,
(7)
where Di = [det(Fi 1 /( n+1) Fi −1 )] . (4) Update division matrix To 1 ≤ i ≤ M , 1 ≤ k ≤ N , if
d ki = 0 ,
M
μ ki(l ) = 1 / ∑ ( d ki / d kj ) 2 /( m −1) ,
(8)
j =1
or else, if
d ki = 0 , μ ki(l ) = 1 .
(5) Run cluster reduction algorithm of orthogonal least square (OLS) [5] and find and save M s important clusters according to principle of maximal error change ratio.
M := M s , U ( l ) = [u i ] , i = 1, " , M s , and regularizing U (l ) . (6) If || U ( l ) − U ( l −1) ||< ε , l = l + 1 , return 1 and go on running. Since sample data contain noises and even isolated points, it is possible that several isolated points will form another cluster and the tight degrees of every cluster are different. So the importance degrees of fuzzy rules extracted from clusters are different. This factor should be considered in modeling, otherwise it will affect the accuracy of the last model. An algorithm denoting the weight importance of rules is put forward in this paper. The weight wi of rule i is: N
nx
k =1
j =1
wi = {(∑ ( μi ,k ) m ) / N }∏ (1/ 2π Fij ) .
(9)
Its meaning is that when the membership function is Gauss function and the rule i exists, the former part is transcendental probability of rule i and the latter part is the
220
X.-g. Zhang et al.
reciprocal of condition membership of rule i . In this way, it is easy to extract rules from clusters and calculate the importance weight of corresponding rules.
4 FNN Training Based on SVM For SVM, the most important problem is to choose appropriate kernel function according to real world. The kernel function is defined as follows: Theorem 1 [7]: To the sample x and μ ( x) : R → [0,1] is a norm function, the function
⎧n ⎪∏μ (x )μ (z ), K(z, x) = ⎨ i =1 j i j i ⎪0 , ⎩
z , if the membership function
x, z are in j − th cluster
(10)
other
is also a norm function and a Mercer kernel function. Suppose the dimension of samples is nx and the number of samples is
n . Use the
method of supervision fuzzy cluster mentioned in chapter 3 to establish initial fuzzy rules. Suppose the samples are divided into m cluster and the sample number of cluster i is ki . The following m clusters can be obtained.
cluster _ 1 = {( x11 , y11 ),", ( x1k1 , y 1k1 )} ,
"
cluster _ m = {( x1m , y1m ),", ( x km1 , y km1 )} , m
∑k i =1
i
=n. ⎛ K1
The corresponding kernel matrix is K = ⎜ ⎜
⎜0 ⎝
⎞ ⎟, % ⎟ K m ⎟⎠
0
K i is the ki × ki kernel matrix. The parameter of kernel function, corresponding to the samples of cluster i , is equivalent to cluster variance. In this way, SVM can be used to learn the parameters of FNN. In the case of 2 clusters, nonlinear classification hyperplane can be obtained through solve the following optimization problem. n
min L(α ) =|| w || 2 +C ∑ ei ξ i ,
(11)
i =1
satisfying the restrict condition: k
yi ( w ⋅ ϕ ( xi ) + d 0 ) ≥ 1 − ξ i , i = 1, " , ∑ k i , i =1
FNN Classification Design Using Support Vector Machine in Welding Defect
C is a constant,
221
ei is the importance weight of sample i . The method is mentioned
in chapter 3 and the weight of the samples belonging to the same cluster is consistent. Its dual problem is n
n
i =1
i , j =1
max L(α ) = ∑ α i − [ ∑ yi y j K ( xi , x j )]/ 2 ,
(12)
satisfying the restrict condition: n
α i ∈ [0, ei C ] , ∑ yiα i = 0 , i =1
K ( xi , x j ) is the fuzzy kernel function. Suppose there are n support vectors. Then n sv
d 0 = [ ∑ a i y i x i' x * (1) + i =1
nsv
∑
i =1
a i y i x i' x * ( − 1) ] / 2 ,
(13)
x * (1) and x * ( −1) belong to class 1 and class 2 respectively. nsv
Since the last equation is y = ∑ ai yi K ( x, xi ) + d 0 , the parameters of FNN can be i =1
regarded as d i = ai y i from (10). The center is the corresponding support vector and the parameter of membership function is the standard difference. The case of one class including many clusters is also done according to the method above. Since SVM can be used to realize 2-class classification, SVM should be rebuilt to adapt the multi-class classification aiming at the multi-class problems. Many scholars researched this problem, this paper adopts “one-against-all” multi-class classifier because it is easy to realize [8]. In this case, classification functions (total N) can be constructed between every class and the others. For example, the jth SVM can classify the jth class samples from the others. In this way, sample labels in the training set should be remodified. If the sample label of the jth class is 1, the others should be -1. In the classification of the training samples, the comparison method is adopted. Enter testing sample x into N two-class classifiers respectively and calculate the discriminated function values of every sub-classifiers. Choose the class corresponding to the maximal discriminated function value as the class of the testing data. Every classifiers of SVM regard the cluster belonging to one class as class 1 and the other as class 2. And the form used to choose parameters and the learning manner are the same with the one mentioned above. In this way, N SVM can be determined. To a new mode x , it should be trained by the N SVM and the decision function is
f ( x) = c, f c ( x) = max f i ( x) . i =1,", N
(14)
5 Simulation Experiment According to [9], weld defects can be generally classified as crack, lack of penetration, lack of fusion, strip-shaped slag inclusion, spherical slag inclusion and
222
X.-g. Zhang et al.
pore 6 classes. Different features, such as defect shape, location, boundary flatness and tip sharpness, are used to recognize and classify defects. In reference [2] and [9], 6 shape feature parameters, such as the ratio of long diameter and short diameter, tip sharpness, boundary flatness, the obliquity with welding direction, centroid coordinate relative to the weld center and symmetry, are chosen as feature parameters. X-ray inspection welding image is processed in the following way, such as preprocessing, segmentation and contour trace (track). Extract defect location parameter, defect perimeter, defect area, long diameter of the defect, short diameter of the defect and the ratio of them described in reference [10]. According to the definition of 6 feature parameters, feature parameters can be obtained using the defect parameters. In this paper, using standard image database of weld defects as experimental samples, choose 184 defect feature of weld image as the total samples (total 184×6 parameters). The feature parameters of practical input samples should be rearranged and regarded as input vectors xi = ( xi1 , xi 2 , " xi 8 )(i = 1, 2, " , 184) , which is composed of 44 pore samples, 40 spherical slag inclusion samples, 25 strip-shaped slag inclusion samples, 25 lack of penetration samples, 25 lack of fusion samples, 25 crack samples. Adopt 6 classes of defect samples (124) as training samples and 60 samples as testing samples. The thresholds of orthogonal least square (OLS) algorithm are ρ = 7% and ε = 0.001 . Through experiments, initial number of initial clusters is 12-6. Although the class number and class label are known, the appropriate number can not be known. The appropriate cluster number can be found after several experiments. Choose the cluster result whose fit errors of training and testing samples are minimal as the last result. The number of the last clusters is 7 and results are shown as follows: Table 1. The Cluster Results and Classification Composition
Defects
Pore
Cluster number 2 Class importance 0.823 Training samples 30 Testing samples 14 Classification 100 precision
Spherical slag inclusion 1 0.965 30 10 100
Strip-shaped slag inclusion
Lack of penetration
Lack of fusion
Crack
1 0.892 16 9
1 0.868 16 9
1 0.926 16 9
1 0.955 16 9
78
78
89
100
Use the “one-against-one” classification methods mentioned above and choose radius basis function as the kernel function. The LOO method is used to determine the punish coefficients of SVM and obtain wonderful classification effects. The detailed numbers of training and testing samples are shown in Table 1 and the recognition classification ratio of defects is 90% in average. The FNN trained by SVM possesses higher learning and testing accuracy. The simulation above has been done in Matlab and SVM is trained by the optimization package in the Matlab toolbox.
FNN Classification Design Using Support Vector Machine in Welding Defect
223
6 Conclusion A SVM-based FNN classification algorithm for welding defects is proposed to overcome the shortcomings of the existing FNN learning algorithms. Through weighing error terms to emphasize particularly on different rules with different importance, the precision and anti-interference can be improved. Simulation results show that the proposed algorithm can effectively model complex relations between defect features and classification has better classification performance for small sample sets.
Acknowledgments The authors would express their appreciation for the financial support of the China Planned Projects for Postdoctoral Research Funds, grant NO.20060390277. The authors also would express their thanks for the finical support of the Jiangsu Planned Projects for Postdoctoral Research Funds, grant NO.0502010B. The paper also is supported by six calling person with ability pinnacle, grant NO.06-E-052, is supported by China university of Mining and Technology Science Research Funds, grant NO.2005B005
References 1. Vapnik, V.: The Nature of Statistical Learning Theory. Springer Press, New York (1995) 2. Zhou, W., Wang, C.: Researth and Application of Automatic Recognition System to Weld Defects. Transactions of the China Welding Institution 13 (1) (1992) 45–50 3. Silva, R.R.da, Siqueira, M.H.S., Caloba, L.P., etc.: Radiographics Pattern Recognition of Welding Defects Using Linear Classifiers. Insight 43 (10) (2001) 669–674 4. Ren, D., You, Z., Sun, C.: Automatic Analysis System of X-ray Weld Real-time Imaging. Transactions of the China Welding Institution 21 (1) (2000) 61–63 5. Magne Setnes.: Supervised Fuzzy Clustering for Rule Extraction. IEEE Transactions on Fuzzy Systems 8 (4) (2000) 416–424 6. Nauck, D., Kruse, R.: Obtaining Interpretable Fuzzy Classification Rules from Medical Data. Artificial Intelligence 16 (2) (1999) 149–169 7. Lin, C.T., Yeh, Chang-Moun, Hsu, Chun-Fei.: Fuzzy Neural Network Classification Design Using Support Vector Machine. IEEE International Symposium on Circuits and Systems 5 (2004) 724-727 8. Sun, Z.H.: Study on Support Vector Machine and Its Application in Control, Dissertion, Zhejiang University (2003) 9. The National Standard of PRC: The Ray- cameras and Quantity Class of Fusion Welding Joint. GB3323-87 (1987) 10. Zhang, X.G.: The Extraction and Automatic Identification of Weld Defects with X-ray Inspection. National Defence Industry Press. Beijing (2004)
Multi-granular Control of Double Inverted Pendulum Based on Universal Logics Fuzzy Neural Networks* Bin Lu1 and Juan Chen2 1
Department of Computer Science & Technology, North China Electric Power University, 071003 Baoding, China
[email protected] 2 Department of Economic Management, North China Electric Power University, 071003 Baoding, China
[email protected]
Abstract. The control of double-inverted pendulum is one of the most difficult control problems, especially for the control of parallel-type one, because of the high complexity of control systems. To attain the prescribed accuracy in reducing control complexity, a multi-granular controller for stabilizing a double inverted pendulum system is presented based on universal logics fuzzy neural networks. It is a universal multi-granular fuzzy controller which represents the process of reaching goal at different spaces of the information granularity. When the prescribed accuracy is low, a coarse fuzzy controller can be used. As the process moves from high level to low level, the prescribed accuracy becomes higher and the information granularity to fuzzy controller becomes finer. In this controller, a rough plan is generated to reach the final goal firstly. Then, the plan is decomposed to many sub-goals which are submitted to the next lower level of hierarchy. And the more refined plans to reach these sub-goals are determined. If needed, this process of successive refinement continues until the final prescribed accuracy is obtained. In the assistance of universal logics fuzzy neural networks, more flexible structures suitable for any controlled objects can be easy obtained, which improve the performance of controllers greatly. Finally, simulation results indicate the effectiveness of the proposed controller.
1 Introduction The double inverted pendulum is a classical and complex nonlinear system, which is often used as a benchmark for verifying the effectiveness of a new control method because of the simplicity of the structure. In general, to control a double inverted pendulum system stably, there need 6 input items to cover all of the angular controls of the two pendulums and the position control of the cart. The conventional fuzzy inference model which puts all of the input items into the antecedent part of each fuzzy rule has difficulty to settle fuzzy rules of 6 input items. Even if the fuzzy rule base is built, it will increase the complexity of control system extremely because of its *
The research work is supported by the Ph. D Science Foundation (20041211) and the Postdoctoral Science Foundation (20041101) of North China Electric Power University.
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 224–233, 2007. © Springer-Verlag Berlin Heidelberg 2007
Multi-granular Control of Double Inverted Pendulum
225
huge size. How to reduce the size of fuzzy rule base, then degrade the control complexity, it has become one of the main concerns among system designers. Yi J. Q.[1] constructed a controller based on the single input rule modules (SIRMs) dynamically connected fuzzy inference model. Each input item is assigned with a SIRM and a dynamic importance degree. Tal C. W.[2] proposed a fuzzy adaptive approach to fuzzy controllers designed with the spatial model to reduce the complexity. Sun Q.[3] presented a design method for the stabilization of multivariable complex nonlinear systems which can be represented by fuzzy dynamic model in decentralized control of large-scale systems, and an optimal fuzzy controller is designed with genetic algorithms. Jinwoo K.[4] employed a multi-resolutional search paradigm to design optimal fuzzy logic controllers in a variable structure simulation environment, and the search paradigm was implemented using hierarchical distributed genetic algorithms-search agents solving different degrees of abstracted problems. Except for all of above, there are still many other studies of fuzzy controllers for reducing the computational complexity. Although these achievements can improve the performance of controllers more or less, the limitations still exist unavoidably. In this paper, a multi-granular controller for stabilizing a double inverted pendulum system is presented based on universal logics fuzzy neural networks (ULFNN), which has excellent effect on the reduction of complexity of controllers and can guarantee the prescribed control accuracy in the case of certain class of uncertain systems. The fuzzy controller uses the different levels of information granularity to attain the prescribed accuracy. When the prescribed accuracy is low, a fuzzy controller based on coarser granular information can be used. As the process moves from high level to low level, the prescribed accuracy becomes higher and the information granularity to fuzzy controller becomes finer. If needed, this process of successive refinement continues until the final prescribed accuracy is obtained. At the same time, by combining ULFNN, this controller uses flexible, opened and adaptive family of operators, which can contain all logical forms and the inferring patterns, parameterizes the basic fuzzy inferring operators, and realizes the flexibilities of integration of the rule premises, the rule activations as well as the rule outputs. Therefore, the performance of controller is improved greatly. In the following sections, analysis and design of the fuzzy controller will be discussed. Although a parallel-type inverted pendulum system is taken as the demonstration, the fuzzy controller can be also applied for the series-type double inverted pendulum systems and the other control systems.
2 Parallel-Type Double Inverted Pendulum System As one of the family of inverted pendulum systems, stabilization control of a paralleltype double inverted pendulum system is more difficult than single inverted pendulum systems, series-type double inverted pendulum systems, and so on. To stabilize a parallel-type double-inverted pendulum is not only a challenging problem but also a useful way to show the power of the control method. As shown in fig. 1, the double inverted pendulum system considered here consists of a straight line rail, a cart moving on the rail, a longer pendulum 1, a shorter pendulum 2, and a driving unit.
226
B. Lu and J. Chen
Fig. 1. Double inverted pendulum
Here, the parameters M = 1.0 kg, m1 = 0.3 kg, m2 = 0.1 kg are the masses of the cart, the pendulum 1 and the pendulum 2, respectively. The parameter g = 9.8m/s2 is the gravity acceleration. Suppose the mass of each pendulum is distributed uniformly. Half the length of the longer pendulum 1 is given as l1 = 0.6 m, and half the length of the pendulum 2 is given as l2 = 0.2 m. The position of the cart from the rail origin is denoted as x, and is positive when the cart locates on the right side of the rail origin. The angles of the pendulum 1 and pendulum 2 from their upright positions are denoted separately as α and β, and clockwise direction is positive. The driving force applied horizontally to the cart is denoted as F (N), and right direction is positive. Also, suppose no friction exists in the pendulum system. Then the dynamic equation of such a double inverted pendulum system can be obtained by Lagrange’s equation of motion as
⎡ a11 ⎢a ⎢ 21 ⎢⎣ a31
a12 a22 a32
a13 ⎤ ⎡ x⎤ ⎡ b1 ⎤ a23 ⎥⎥ ⎢⎢α ⎥⎥ = ⎢⎢b2 ⎥⎥ . ⎢⎣b3 ⎥⎦ a33 ⎥⎦ ⎢⎣ β⎥⎦
(1)
Where the coefficients are given by
⎡ a11 ⎢a ⎢ 21 ⎣⎢ a31
a12 a22 a32
a13 ⎤ a23 ⎥⎥ = a33 ⎦⎥
⎡ M + m1 + m2 ⎢ m l cos α ⎢ 11 ⎣⎢ m2 l2 cos β
m1l1 cos α 4m1l12 / 3 0
m2 l2 cos β ⎤ ⎥, 0 ⎥ 4m2 l2 2 / 3 ⎦⎥
(2)
and 2 2 ⎡ b1 ⎤ ⎡ F + m1l1α sin α + m2 l2 β sin β ⎤ ⎢ ⎥ ⎢b ⎥ = m1l1 g sin α ⎥. ⎢ 2⎥ ⎢ ⎢ ⎥ m2 l2 g sin β ⎣⎢b3 ⎦⎥ ⎣ ⎦
(3)
3 Multi-granular Fuzzy Control In the controller, starting from the initial state of the overall system, a rough plan is generated to reach the final goal firstly. Then, the plan is decomposed to many
Multi-granular Control of Double Inverted Pendulum
227
sub-goals which are submitted to the next lower level of hierarchy. And the more refined plans to reach these sub-goals are determined. This process of successive refinement continues until the final prescribed accuracy is obtained. The structure of controller is showed in fig. 2. In the figure, the letter r denotes the desired output trajectory, e Error, y actual output and u control action.
Fig. 2. Block diagram of controller
Since the number of rules increase exponentially as the number of system variables increase, one of the most important aims of the controller is to reduce the size of rule base. The idea of the controller is based on the human operator's behavior or problem solving methods. The operator would try to bring the controlled process variable ‘roughly’ to a desirable situation and then to a precisely desirable one. Thus the controlled variable in the case of regulation problem will be brought within a small deviation band around the set-point by using ‘coarse’ resolution and then finer information resolution is used.
Fig. 3. Switch of granularities
In fig. 3, at each level of information granularity, the goal is to reduce the error to zero which is defined on a universe of discourse [-εi, εi]. As a result, the error reduces into the threshold [-εi, εi]. When the zero at i-th level is reached, the granulation of information becomes finer. The interval on which the membership functions are defined, become smaller. The membership functions are now described using the smaller universe of discourse [-εi+1, εi+1]. This process continues until the prescribed accuracy is reached. Thus, the task decomposition is achieved by defining the membership functions on ever decreasing universe of discourse.
228
B. Lu and J. Chen
4 Analysis of ULFNN The ULFNN is a six-layer feed-forward net where the AND operation, OR operation and implication operation are all realized with the Universal Logics operators[5], which are the parameterized families of operators including of zero-level universal AND operators (ZUAND), zero-level universal OR operators (ZUOR) and zero-level universal implication operators (ZUIMP). 4.1 Structure of ULFNN
Consider a multiple input-single output ULFNN. The knowledge base of the system is defined by the set of linguistic rules of the type: IF x1 = Ai1 AND … AND xn = Ain THEN y = Bi , i = 1, 2, … , M.
(4)
In the above, Aij are reference antecedent fuzzy sets of the n input variables x1 , x2 , … , xn and Bi are reference consequent fuzzy sets of the output variable y. The xi is defined on the universes of discourse Xi, i=1,…,n and is y defined on the universe of discourse Y. The M denotes the number of rules.
Fig. 4. Structure of ULFNN
An ULFNN computationally identical to this type of reasoning is shown in fig. 4, which is a six-layer feed-forward net in which each node performs a particular function on incoming signals as well as a set of parameters pertaining to this node. Note that the links in an adaptive network only indicate the flow direction of signals between nodes; no weights are associated with the links. 1)Layer 1 (the input layer) There are the r crisp input variables x1 , x2 , …, xr , which are defined on the universes
of discourse Xi respectively, i = 1, … , r. 2)Layer 2 (the fuzzification layer) Compare the input variables with the membership functions on the premise part to obtain the membership values of each linguistic label. The output of the node is the degree to which the given input satisfies the linguistic label associated to this node. Usually, we choose Gauss-shaped membership functions
Multi-granular Control of Double Inverted Pendulum
Ai j ( x) = exp[−
( x − cij ) 2
σ ij2
] ∈ F (Xi), i = 1, … , r.
229
(5)
to represent the linguistic terms, where{ ci j , σ i j }is the parameter set. As the values of these parameters change, the Gauss-shaped functions vary accordingly, thus exhibiting various forms of membership functions on linguistic labels Ai j . In fact, any continuous, such as trapezoidal and triangular-shaped membership functions are also quantified candidates for node functions in this layer. 3)Layer 3 (the firing strength layer) Usually combine the membership values on the premise part to get firing strength of each rule through a specific t-norm operator, such as Min or Probabilistic. However the firing strength of the associated rule will be computed through the parameterized ZUAND operators here. The firing strength of the i-th rule is
τ i = T ( x1 , x2 , … , xn , hT ) 1
= (max( 0
mT1
mT1
, x1
+ x2
mT1
+ …+ xn
mT1
– (n – 1)) )
1 mT1
(6) .
In the above, real number m has relation with generalized correlation coefficient h as m = (3 – 4h)/ 4h (1 – h), h [0, 1], m R. Basic operators, such as Min, Probabilistic, etc, can be derived from the ZUAND operators by specifying its parameter. If the premise part of rule is connected with logic connective OR, it can be replaced by the parameterized ZUOR operators. So the firing strength of the i-th rule is
∈
∈
τ i = S ( x1 , x2 , … , xn , hS ) =1– (max( 0mS , (1 – x1 mS
mS
) mS (7)
1 mS
+ (1 – x2 ) + …+ (1 – xn ) – (n – 1)) ) . Basic operators, such as Max, Strong, etc, can be derived from the ZUOR operators by specifying its parameter. 4)Layer 4 (the implication layer) Generate the qualified consequent of each rule depending on the firing strength. Each node generates the consequence of each rule through the parameterized ZUIMP operators here. The consequence of the i-th rule is 1
Fi ( y ) = I ( τ i , Bi ( y ) , hI ) = (min(1+ 0 mI ,1 – τ i mI + Bi mI ( y ) ) ) mI .
(8)
In the above, τ i is the firing strength of the i-th rule, and Fi ( y ) is a fuzzy set of the output of the i -th rule. The most often used fuzzy implication operators, such as Goguen, Lukasiewicz and so on, can be derived from the ZUIMP operators by specifying its parameter. 5)Layer 5 (the aggregation layer) Aggregate the qualified consequences to produce a fuzzy output. Replacing the logical connective ALSO with ZUAND operators because Bi ⊂ Fi according to the property of ZUIMP operators, the overall fuzzy output of the output variable y is
230
B. Lu and J. Chen
F ( y ) = T ( F1 ( y ) , F2 ( y ) , … , FM ( y ) , hT2 )= (max ( 0
mT
2
mT
, F1
2
( y ) + F2
mT
2
( y ) +…+ FM
mT
2
( y ) – (M – 1)) )
1 mT
2
(9) .
6)Layer 6 (the defuzzification layer) A crisp output can be obtained with the defuzzification method, and usually we use the COA method to do it. That is
y*=
∫ yF ( y )dy . ∫ F ( y )dy Y
(10)
Y
4.2 Learning Algorithms
After discussing the structure of ULFNN, we will research how to determine a concrete controller. It is well known that the neural network has the strong learning function, which can be introduced to fuzzy system to determine the parameters of the universal logics fuzzy neural network controller through training so as to meet the needs of different controlled objects. In order to give dual attention to the astringency and rapidity of learning process, the BP algorithm with adaptive learning rate is given. Several important formulas are proved firstly. Formula 1:
In the above
1 −1 ∂T ( x1 , x2 ,..., xn , h) = AT m x j m −1 . ∂x j
,A =x T
m 1
+ x2 m +…+ xn m – (n – 1).
Proof: Let AT = x1m + x2 m +…+ xn m – (n – 1), from ref. [5], we have 1
∂T ( x1 , x2 ,..., xn , h) ∂ ( x1m + x2 m + ... + xn m − (n − 1)) m = ∂x j ∂x j =
1 1 −1 −1 1 ( x1m + x2 m +…+ xn m – (n – 1)) ) m m x j m −1 = AT m x j m −1 . m
Formula 2:
∂T ( x1 , x2 ,..., xn , h) = ∂h
⎧ m1 −1 ⎪ AT ( AT ln AT − mBT )C ⎪ ⎨0 ⎪ Not exist ⎪ ⎩
, h ∈ (0.75, 1) or h ∈ (0, 0.75) and A > 0 , h ∈ (0, 0.75) and A ≤ 0 , h = 0, 0.75, 1 T
T
In the above, BT = x1m ln x1 + x2 m ln x2 +…+ xn m ln xn , C=1+
3 . (4h − 3)2
Multi-granular Control of Double Inverted Pendulum
,
∈
231
Proof: From ref. [5], when h (0, 0.75) and AT ≤ 0 or h = 0, 0.75, 1, the formula establishes obviously. Now it’s only needed to prove when h (0.75, 1) or h (0, 0.75) and AT > 0. Let BT = x1m ln x1 + x2 m ln x2 +…+ xn m ln xn , we have
∈
∈
1
∂T ( x1 , x2 ,..., xn , h) ∂ ( x1m + x2 m + ... + xn m − (n − 1)) m = ∂h ∂h 1
∂ ( x1m + x2 m + ... + xn m − (n − 1)) m = ∂m 1
1)) ) m
−1
∂m = ∂h
(n
–
(m( x1m ln x1 + x2 m ln x2 +…+ xn m ln xn )– ( x1m + x2 m +…+ xn m – (n –
1)) ln ( x1m + x2 m +…+ xn m – (n – 1))) 1
( x1m + x2 m +…+ xn m –
1 −1 1 ∂m m = A (m BT – AT ln AT )(–C) T 2 m ∂h
−1
= AT m ( AT ln AT – m BT )C . 1 −1 ∂I ( x1 , x2 , h) = (−1) j AI m x j m −1 , where AI =1– x1m + x2 m , j =1, 2. ∂x j
Formula 3:
Proof: Let AI =1– x1m + x2 m , from ref. [5], we have 1
1 −1 ∂I ( x1 , x2 , h) ∂ (1 − x1m + x2 m ) m 1 = = (1– x1m + x2 m ) m (−1) j m x j m −1 ∂x j ∂x j m 1
= (−1) j AI m
−1
x j m −1 .
Formula 4:
, h ∈ (0.75, 1) or h ∈ (0, 0.75) and A < 1 , h ∈ (0, 0.75) and A ≥ 1 , h = 0, 0.75, 1
⎧ m1 −1 ⎪ AI ( AI ln AI − mBI )C ∂I ( x1 , x2 , h) ⎪ = ⎨0 ∂h ⎪ Not exist ⎪ ⎩ In the above, BI = – x1m ln x1 + x2 m ln x2 .
I
I
∈
Proof: From ref. [5], when h (0, 0.75) and AI ≥ 1, or h = 0, 0.75, 1, the formula establishes obviously. It’s only needed to prove when h (0.75, 1) or h (0, 0.75) and AI < 1. Let BI = x1m ln x1 + x2 m ln x2 , we have
∈
1
1
∈
1 −1 ∂ (1 − x1m + x2 m ) m ∂m ∂I ( x1 , x2 , h) ∂ (1 − x1m + x2 m ) m = = = (1– x1m + x2 m ) m (– (1– ∂h ∂h ∂m ∂h 1 ∂m x1m + x2 m ) ln (1– x1m + x2 m ) + m( x1m ln x1 + x2 m ln x2 )) 2 m ∂h 1
−1
1
−1
= AI m (– AI ln AI + m BI )(–C) = AI m ( AI ln AI – m BI )C .
232
B. Lu and J. Chen
The formulas of ZUOR operators are omitted for the similarity with the ZUAND operators. Let η is the learning rate of adjustable parameters. The following two strategies should be adopted to adjust the learning rate in training:
-If the error measure undergoes 4 consecutive reductions, increase η. -If the error measure undergoes 2 consecutive combinations of 1 increase and 1 reduction, decrease η.
In order to increase the convergence speed of BP algorithm, the initial values of the adjustable parameters should be set to about 0.5.
5 Control Simulations The fuzzy controller takes the angle and angular velocity of the pendulum 1 and pendulum 2, the position and velocity of the cart as the input items, and takes the driving force F as the output item. Without losing generality, the rail origin is selected as the desired position of the cart. Then, the stabilization control of the double inverted pendulum system is to balance the two pendulums upright and move the cart to the rail origin in short time. If the six input items all converge to zero, then the stabilization control is apparently achieved. The membership functions of each variable are defined as Gauss-shaped. In each level of information granularity, the total number of fuzzy rules is reduced significantly in the controller. To verify the effectiveness of the proposed controller, many different simulations are done.
(a)
(b) ◦
◦
Fig. 5. Simulation results for initial angles 5 and 0 in (a), and both 5◦ in (b) separately
Fig. 5 shows the control results when the initial angles of the two pendulums are separately set up to 5◦ and 0◦ in (a), 5◦ and 5◦ in (b), while the initial values of the other state variables are all set to zeros. And the sampling period is 0.01s. In these figures, the line denotes the position of cart, the angle of pendulum 1 and the angle of pendulum 2. As the control results, the stabilization time of both simulations is about 6 s. More experiments and results of simulations are not listed here because of limitations of paper length. However, all the simulation results show that the fuzzy controller can stabilize the parallel-type double inverted pendulum system for a wide range of the initial angles of the two pendulums in relatively short time. Since the conventional
①
②
③
Multi-granular Control of Double Inverted Pendulum
233
fuzzy inference model has difficulty to set up all fuzzy rules of 6 input items and to change the control structure, the method proposed in this paper represents more advantages in the stabilization control of the double inverted pendulum system.
6 Conclusions and Future Work In a word, the proposed controller is a universal fuzzy neural network controller for control problems solving, not only for the parallel-type double inverted pendulum system. It has a simple and intuitively understandable structure, and can attain the prescribed accuracy in the certain class of uncertain systems while reducing control complexity. And in the assistance of ULFNN, more flexible structures suitable for any controlled objects can be easy obtained, which improve the performance of controllers greatly. In future, our emphases of work are mainly in further improving the efficiency of fuzzy controllers.
References 1. Yi, J.Q., Naoyoshi, Y., Kaoru, H.: A New Fuzzy Controller for Stabilization of Parallel-type Double Inverted Pendulum System. Fuzzy Sets and Systems 126 (2002) 105-119 2. Tal, C.W., Taur, J.S.: Fuzzy Adaptive Approach to Fuzzy Controllers with Spacial Model. Fuzzy Sets and Systems 125 (2002) 61-77 3. Sun, Q., Li, R.H., Zhang, P.A.: Stable and Optimal Adaptive Fuzzy Control of Complex Systems Using Fuzzy Dynamic Model. Fuzzy Sets and Systems 133 (2003) 1-17 4. Jinwoo, K., Zeigler, B.P.: Designing Fuzzy Logic Controllers Using A Multiresolutional Search Paradigm. IEEE Trans. Fuzzy Systems 4(3) (1996) 213-226 5. He, H.C.: Principle of Universal Logics. Science Press. Beijing (2006)
The Research of Decision Information Fusion Algorithm Based on the Fuzzy Neural Networks Pei-Gang Sun1,2 , Hai Zhao1 , Xiao-Dan Zhang3 , Jiu-Qiang Xu1 , Zhen-Yu Yin1 , Xi-Yuan Zhang1 , and Si-Yuan Zhu1 1
3
School of Information Science & Engineering, Northeastern University, Shenyang 110004, P.R. China 2 Shenyang Artillery Academy, Shenyang 110162, P.R. China Shenyang Institute of Aeronautical Engineering, Shenyang 110034, P.R. China {sunpg,zhhai,xujq,cmy,zhangxy,zhusy}@neuera.com,
[email protected] http://www.netology.cn
Abstract. A new decision information fusion algorithm based on the fuzzy neural networks, which introduces fuzzy comprehensive assessment into traditional decision information fusion technology under the “sof t” decision architecture, is proposed. The process of fusion is composed of the comprehensive operation and the global decision through fusing the local decision of multiple sensors for obtaining the global decision of the concerned object at the fusion center. In the practical application, the algorithm has been successfully applied in the temperature fault detection and diagnosis system of hydroelectric simulation system of Jilin Fengman. In the analysis of factual data, the performance of the algorithm precedes that of the traditional diagnosis method.
1
Introduction
Information fusion is a new information process technology for the alliance of data obtained from multiple sources, such as sensors, database, knowledge base and so on. It aims at obtaining coherent explanation and description of the concerned object and environment, through making the most of multi-sensor resource, combining the redundant, complement information that each sensor has obtained, by rationally employing each sensor and its sensor data. Information fusion is a kind of comprehensive, multiple angles, and multiple layers analysis process to the concerned object[1], [2]. Information fusion could be classified into three levels according to the abstract level of data, which are pixel level fusion, characteristic level fusion and decision level fusion[3]. Decision fusion is a kind of high-level fusion process, and its result is often utilized as the basis for the system decision. Because the decision level fusion often concerns all kinds of factors, besides the data that obtained by sensors, further more the evidence of decision fusion process is often uncertain; it is very difficult to construct the accurate model that has high reliability for a certain problem. But in practical application, the decision level D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 234–240, 2007. c Springer-Verlag Berlin Heidelberg 2007
The Research of Decision Information Fusion Algorithm
235
fusion can bring some especial benefit, such as high robustness, processing different class information, and so on, so it has been paid attention to by scientists and engineers, and become an important subject in the study of information fusion theory and application[4], [5]. In this paper, a new decision level fusion algorithm, which considers the fuzzy property of the decision level fusion and adopts the “sof t” decision architecture of information fusion, is researched. The algorithm introduces fuzzy comprehensive assessment into decision assessment at the process of the fusion. In the practical application, the algorithm has been successfully applied in the temperature fault detection and diagnosis system of hydroelectric simulation system of Jilin Fengman[6], [7], [8]. In the analysis of factual data, the performance of the algorithm precedes that of the traditional diagnosis method.
2
Model of Fuzzy Comprehensive Assessment
Comprehensive assessment method is one of important methods and tools in the decision and analysis. Fuzzy comprehensive assessment is comprehensive assessment method to the object and phenomena that is influenced by multiple factors using fuzzy set theory[9]. The method has been successfully applied into the industry process, evaluating of product, supervise of quality and so on[10]. In the process of fuzzy comprehensive assessment, it is denoted that (U, V, R) is assessment model of fuzzy comprehensive assessment, and the Factor Set U consists of all elements, which relate to the assessment, it can be represented as U = (u1 , u2 , · · · , um ), In general, every factor has its different weight ai . The weight set A is a fuzzy set, which is represented by a fuzzy vector, A = (a1 , a2 , · · · , am ), where ai is the value of the membership function of the factor ui relating A. That is, it represents the degree of every factor in the comprehensive m assessment. In general, it satisfies ai = 1, ai > 0, (i = 1, 2, 3, · · · , m). i=1
The set V is the assessment set, which is the set that consists of the assessment degree of the object. It can be represented as V = (v1 , v2 , · · · , vn ), where vi is the assessment degree for this assessment. The matrix R = (rij )m×n is a fuzzy mapping from U to V , where rij express the possibility degree of j th assessment when considering the ith factor, which is the membership degree of from Ui to Vi . In the process of fuzzy comprehensive assessment, let A = (a1 , a2 , · · · , am ) be the fuzzy set on the factor set U , in which ai is the weight of ui , B = (b1 , b2 , · · · , bn ) is the fuzzy set on the assessment set V , the comprehensive assessment can be represent as following: B = A ◦ R = (b1 , b2 , · · · , bn )
(1)
in formula (1) the operator ◦ is often defined as the assessment arithmetic operator (∧∗, ∨∗), so formula (1) can be written as: ∀ bi ∈ B, bi = (a1 ∧ ∗ r1i ) ∨ ∗ (a2 ∧ ∗ r2i ) ∨ ∗ · · · ∨ ∗ (am ∧ ∗ rmi )
(2)
236
P.-G. Sun et al.
In general, the assessment arithmetic operator can be defined as common matrix operation (“multiplication” and “addition”) or Zadeh fuzzy operation (“and” and “or”) and so on, according to the practical applications. Following the comprehensive process, the synthetic evaluation of (b1 , b2 , · · · , bn ) is a defuzzifier process of making a fuzzy quantity to a precise quantity, the method, such as max membership principle[11], centroid method[12], weighted average method etc, can be adopted. In general, max-membership principle is also known as the height method, which is limited to peaked output. The centroid method is also called the center area or center of gravity; it is the most prevalent and physically appealing of all the defuzzifier methods. Weighted average method is only valid for symmetrical output membership functions, but is simple and convenient[13]. In practical application the exact method of synthetic evaluation usually depends upon the application.
3 3.1
The Decision Information Fusion Algorithm Based on the Fuzzy Neural Networks The Architecture of the “Sof t” Decision Information Fusion
The objects of decision information fusion is usually the local decisions of the sensors, that is, the process of decision information fusion is that of global decision under the basis of local decisions of the multiple sensors. The method or architecture of the decision information fusion is usually classified into either the “hard” decision or the “sof t” decision according to the results of local decision of the sensor. In the “hard” decision, the local decision of the sensor is usually the problem of binary hypothesis test, the result of hypothesis test is either zero or one, according to the threshold level. So the local decision of the sensor that is directly sent to the fusion center is either zero or one. In the “sof t” decision, the whole region of sensor decision is usually divided into multiple regions, and the result of the sensor includes not only the region of decision but also the possibility value belonging to the region, so the information that is sent to the fusion center in “sof t” decision is the possibility of each hypothesis. In the process of “hard” decision, the sensor couldn’t provide any information that is lower or higher than the threshold level, so the information that is lower or higher than the threshold level is lost in the process of fusion at the fusion center. Compared with the process of “hard” decision, the process of the “sof t” decision provides not only the region of decision, but also the possibility of the region. In the fusion center, the object including the region and the possibility of the region can be utilized for the process of the fusion. The architecture of the process of the “sof t” decision under the fuzzy comprehensive assessment is shown in Fig.1. 3.2
The Description of the Algorithm
From Fig.1, the algorithm of decision level information fusion based on the fuzzy neural networks adopted the architecture of the “sof t” decision. In the algorithm,
The Research of Decision Information Fusion Algorithm
237
Fig. 1. The architecture of the “sof t” decision fusion under the fuzzy comprehensive assessment
we consider an information fusion system consisted of m sensors that observe the same phenomenon. Each sensor makes its local decision based on its observation, the local decision that include the decision region and its possibility value is sent to the fusion center, the global decision based on the local decisions of m sensors is obtained at the fusion center. It is denoted that the set S is the sensor set, that is S = (s1 , s2 , · · · , sm ), the result of the fusion center is classified into n regions, is called as the assessment set Y , that is Y = (y1 , y2 , · · · , yn ). In the process of the “sof t” decision of each sensor, the result of each sensor is the value of possibility on the assessment Y , for the ith sensor, the result of local decision can be described as the vector ri = (ri1 , ri2 , · · · , rin ), through the process of normalization, the input of the fusion center for the ith sensor is the vector ri = (r i1 , ri2 , · · · , rin ). For the ∀ si ∈ S, the vector ri consists of m × n the matrix R, which is called as the fusion matrix of the fusion center, can be described as following. ⎛
r 11 r 12 ⎜ r 21 r 22 ⎜ R=⎜ . .. ⎝ .. . rm1 r m2
⎞ · · · r1n · · · r2n ⎟ ⎟ .. .. ⎟ . . ⎠ · · · rmn
(3)
For each sensor in the fusion system, the effect of each sensor is always different, it is denoted that A is the sensors’ vector weight power, it is a fuzzy set on the sensor set S, and described as the normalized fuzzy vector A = (a1 , a2 , · · · , am ) and ai = μ(si ), i = 1, 2, · · · , m. In the comprehensive operation of the algorithm, the comprehensive result of the sensor weigh vector and the fusion matrix is the fuzzy set of the assessment set. The result can be described as following, B = A ◦ R = (b1 , b2 , · · · , bn )
(4)
238
P.-G. Sun et al.
for the comprehensive operator, the algorithm adopted the comprehensive operator (∧∗, ∨∗) in the fuzzy comprehensive assessment. In the process of the global decision at the fusion center, the input is the vector(b1 , b2 , · · · , bn ) result from the comprehensive operation, in this research, the max membership principle is adopted, that is if ∃ i ∈ (1, 2, · · · , m), satisfy bi = max(b1 , b2 , · · · , bm ), so the result of global decision of the fusion center is bi .
4
Experiment Analysis
In the Hydroelectric Simulation System of Jilin Fengman, the generator system is the important component of the system; its working condition has great influence to the stabilization of the whole system, so fault detection and diagnosis is necessary to the generator. As far as it goes in the detection system in the Hydroelectric system, the method of fault detection and diagnosis usually adopts the senor-threshold level method, that is, in the system of fault detection and diagnosis, primary parameter of the equipment is supervised by a sensor, the data is send to the detection center. In the detection center, threshold level of the parameter is set in advance when the data that is gathered exceed the threshold level, touch off the corresponding fault alarm. So the sensitivity of the whole detection system is dependent upon the threshold level. But in the practical application, the threshold level is set artificially. If the value of the threshold level is too high, it is possible to fail to report the alarm, otherwise if the value is too low, it is possible to cause the system alarm when the equipment is in order. Aimed to the disadvantage of the traditional detection and diagnosis system, the information fusion technology can be applied into fault detection and diagnosis system. In the practical diagnosis system, multiple sensors have been embedded into the equipment, and gathered the current data of circumstance. At the fusion center, redundant and complemented data have been made full use of, so precise estimation about equipment status can be achieved, belief quantity of diagnosis system can be enhanced, and fuzzy level of status is decreased. So the application of information fusion improves detection performance by making full use of resource of multiple sensors[14][15]. In the simulation system, we have applied the new decision information fusion algorithm into the temperature fault detection and diagnosis of the generator. In this diagnostic system, three embedded temperature sensors have been embedded into the generator, and the temperature of equipment has been periodically gathered[16][17]. The sensor set can be defined as S = (s1 , s2 , s3 ). It has been found in the practical application of the system that the reason of temperature alarm of the generator can be classified into the fault of cycle water equipment, cooling water equipment and misplay of operator, etc. So the assessment set can be defined as Y = (y1 , y2 , y3 , y4 , y5 ) ={circulation water valve shutdown by error, low pressure of circulation, cooling water valve shutdown by error, cooling water pump lose pressure and backup pump not switched, other undefined reason} in the temperature fault diagnosis system.
The Research of Decision Information Fusion Algorithm
239
The effect of the three sensors is different in the diagnosis system because of its position, precision and so on, so in the practical application, the weigh power vector has been allocated according to the experience. That is A = (a1 , a2 , a3 ) = (0.4400, 0.2300, 0.3300)
(5)
The three embedded sensors gather the data and make its local decision, the local decision that is the value of the possibility of the fault has been send to the fusion center, the process of the diagnosis in the fusion center as following, in the fusion center, the local decision of the sensor has been normalized firstly, the results of normalization of each sensor constitute the fusion matrix. Secondly, comprehensive operation is made between the sensor weigh power vector and the matrix of decision. At last, global decision about the fault is made according to the result of comprehensive operation under the max membership principle. For example, in the process of diagnosis the local decision of the sensor that has been normalized is described as Tab.1. Table 1. Experiment data of the diagnosis system
S1 S2 S3
O1 0.3750 0.2712 0.1450
O2 0.2200 0.4386 0.3338
O3 0.0000 0.0000 0.0000
O4 0.4050 0.2902 0.5212
O5 0.0000 0.0000 0.0000
In this research, the comprehensive operation of the fusion center adopts the fuzzy set conjunction and disjunction operation, that is max-min operator, so the result of the comprehensive operation is, B = (0.3750, 0.3300, 0.0000, 0.4050, 0.0000) After the normalization the B can be obtained. B = (0.3378, 0.2973, 0.0000, 0.3649, 0.0000) In the global decision, according to the max membership principle the decision about the fault is made as the cooling water pump lose pressure and backup pump not switched.
5
Conclusion
In this paper, a new decision information fusion algorithm based on fuzzy neural networks is proposed. The process of fusion is composed of the comprehensive operation and the global decision through fusing the local decision of multiple sensors for obtaining the global decision of the concerned object at the fusion center. In the practical application, the algorithm has been successfully applied in the temperature fault detection and diagnosis system of Hydroelectric Simulation System, and the performance of the algorithm precedes that of the traditional diagnosis method.
240
P.-G. Sun et al.
Acknowledgments. The authors acknowledge the support of Natural Science Foundation of P. R. China (NSF No. 69873007) and National High-Tech Research and Development Plan of P.R.China(NHRD No. 2001AA415320) about this project, and the cooperation of FengMan hydropower plant of Jinlin province of China for developing and running this system.
References 1. Liu, T.M., Xia, Z.X., Xie, H.C.: Data Fusion Techniques and its Applications. National Defense Industry Press, 1999. 2. Hopfield, J.J., Tank D.W.: Neural Computation of Decisions in Optimization Problems. Biological Cybernetics 52 (1985) 141-152. 3. Carvalho, H.S., Heinzelman, W.B., Murphy, A.L., Coelho, C.J.N.: A General Data Fusion Architecture. Proceedings of Information Fusion 2003 2 (2003) 1465-1472. 4. Yu, N.H., Yin Y.: Multiple Level Parallel Decision Fusion Model with Distributed Sensors Based on Dempster-Shafer Evidence Theory. Proceedings of 2003 International Conference on Machine Learning and Cybernetics 5 (2003) 3104-3108. 5. Wang, X., Foliente, G., Su, Z., Ye, L.: Multilevel Decision Fusion in a Distributed Active Sensor Network for Structural Damage Detection. Structural Health Monitoring, 5(1) (2006) 45-58. 6. Zhang X.D., Zhao H., Wang G., Wei S.Z.: Fusion Algorithm for Uncertain Information by Fuzzy Decision Tree. Journal of Northeastern University(Natural Science) 25(7) (2004) 657-660. 7. Wang G., Zhang D.G., Zhao H.: Speed Governor Model Based on Fuzzy Information Fusion. Journal of Northeast University (Natural Science) 23(6) (2002) 519-522. 8. Zhang D.G., Zhao H.: General Hydropower Simulation System Based on Information Fusion. Journal of System Simulation 14(10) (2002) 1344-1347. 9. Hall D.: Mathematical Techniques in Multisensor Data Fusion. Artech House Press, London (1992) 235-238. 10. Waltz E.L.: Multisensor Data Fusion. Artech House Press, Norwood (1991) 101-105. 11. Wei S.Z., Zhao H., Wang G., Liu H.: Distributed Fusion Algorithms in Embedded Network On-line Fusion System. Proceedings of Information Fusion’2004, Stockholm, Sweden (2004) 622-628. 12. Hou Z.Q., Han C.Z., Zheng L.: A Fast Visual Tracking Algorithm Based on Circle Pixels Matching. Proceedings of Information Fusion’2003, 1 (2003) 291-295. 13. Yager, R.R.: The Ordered Weighted Averaging Operators: Theory and Applications. Kluwer Academic Publishers, (1997) 10-100. 14. Jlinals J.: Assessing the Performance of Multisensor Fusion System. Proceedings of the International Society for Optical Engineering 1661 (1992) 2-27. 15. Kai F.G.: Conflict Resolution using Strengthening and Weakening Operations in Decision Fusion. Proceedings of The 4th International Conference on Information Fusion 1 (2001) 19-25. 16. Satoshi M.: Theoretical Limitations of a Hopfield Network for Crossbar Switching. IEEE Transactions on Neural Networks 12(3) (2001) 456-462. 17. Wang G., Zhang D.G., Zhao H.: Speed Governor Model Based on Fuzzy Information Fusion. Journal of Northeastern University(Natural Science) 23(6) (2002) 519-522.
Equalization of Channel Distortion Using Nonlinear Neuro-Fuzzy Network Rahib H. Abiyev1, Fakhreddin Mamedov2, and Tayseer Al-shanableh2 1
Near East University, Department of Computer Engineering, Lefkosa, North Cyprus
[email protected] 2 Near East University, Department of Electrical and Electronic Engineering, Lefkosa, North Cyprus
Abstract. This paper presents the equalization of channel distortion by using a Nonlinear Neuro-Fuzzy Network (NNFN). The NFNN is constructed on the basis of fuzzy rules that incorporate nonlinear functions. The learning algorithm of NNFN is presented. The NFNN is applied for equalization of channel distortion of time-invariant and time-varying channels. The developed equalizer recovers the transmitted signal efficiently. The performance of NNFN based equalizer is compared with the performance of other nonlinear equalizers. The effectiveness of the proposed system is evaluated using simulation results of NNFN based equalization system.
1 Introduction In digital communications, channels are affected by both linear and nonlinear distortion, such as intersymbol interference and channel noise. Various equalizers have been applied to equalize these distortions and recover the original transmitted signal [1,2]. Linear equalizers could not reconstruct the transmitted signal when channels have significant non-linear distortion [3]. Since non-linear distortion is often encountered on time-variant channels, linear equalizers do not perform well in such kind of channels. When a channel has time-varying characteristics and the channel model is not precisely known, adaptive equalization is applied [4]. Nowadays neural networks are widely used for the equalization of nonlinear channel distortion [5-12]. One class of adaptive equalizers is based on multilayer perceptron (MLP) and radial basis functions (RBF) [5-10]. The MLP equalizers require long time for training and are sensitive to the initial choice of network parameters [5,8,9]. The RBF equalizers are simple and require less time for training, but usually require a large number of centers, which increase the complexity of computation [6,7,10]. An application of neural networks for adaptive equalization of nonlinear channel is given in [11]. Using the 16 QAM (quadrature amplitude modulation) scheme, the simulation of equalization of communication systems is carried out. In [12] neural decisionfeedback equalizer is developed by using adaptive filter algorithm and it is applied for equalization of nonlinear communication channels. One of the effective ways for development of adaptive equalizers for nonlinear channels is the use of fuzzy technology. This type of adaptive equalizer can process D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 241–250, 2007. © Springer-Verlag Berlin Heidelberg 2007
242
R.H. Abiyev, F. Mamedov, and T. Al-shanableh
numerical data and linguistic information in natural form [13,14,15]. Human experts determine fuzzy IF-THEN rules using input-output data pairs of the channel. These rules are used to construct the filter for the nonlinear channel. In these systems the incorporation of linguistic and numerical information improves the adaptation speed and the bit error rate (BER) [13]. The fuzzy logic is used for implementation of a Bayesian equalizer to eliminate co-channel interference [16,17]. TSK-based decision feedback fuzzy equalizer is developed by using an evolutionary algorithm and is applied to a QAM communication system [18]. Sometimes the construction of proper fuzzy rules for equalizers is difficult. One of the effective technologies for construction of equalizer’s knowledge base is the use of neural networks. Much effort has been devoted to the development and improvement of fuzzy neural network models. The structures of most of neuro-fuzzy systems mainly implement the TSK-type or Mamdani-type fuzzy reasoning mechanisms. Adaptive neuro-fuzzy inference system (ANFIS) implements TSK-type fuzzy system [19]. The consequent parts of the TSK-type fuzzy system include linear functions. This fuzzy system can describe the considered problem by means of combination of linear functions. Sometimes these fuzzy systems need more rules, during modeling complex nonlinear processes in order to obtain the desired accuracy. Increasing the number of the rules leads to the increasing the number of neurons in the hidden layer of the network. To improve the computational power of neuro-fuzzy system, we use nonlinear functions in the consequent part of each rule. Based on these rules, the structure of the nonlinear neuro-fuzzy network (NNFN) has been proposed. Because of these nonlinear functions, NNFN network has more computational power, and, it can describe nonlinear processes with desired accuracy. In this paper, the NNFN is used for equalization of nonlinear channel distortion. The NNFN network allows in short time train equalizer and gives better results in bit error rate, at the cost of computational strength. This paper is organized as follows. In section 2 the architecture and learning algorithm of NNFN are presented. In section 3 the simulation of NNFN based channel equalization system is presented. Section 4 includes the conclusion of the paper.
2 Nonlinear Neuro-Fuzzy Network The kernel of a fuzzy inference system is the fuzzy knowledge base. In a fuzzy knowledge base, the information that consists of input-output data points of the system is interpreted into linguistic interpretable fuzzy rules. In this paper, the fuzzy rules that have IF-THEN form and constructed by using nonlinear quadratic functions are used. The use nonlinear function allows to increase the computational power of neuro-fuzzy system. They have the following form. If x1 is Aj1 and x2 is Aj2 and…and xm is Ajm Then m
y j = ∑ ( w1ij xi2 + w2ij xi ) + b j ,
(1)
i =1
Here x1, x2, …,xm are input variables, yj (j=1,..,n) are output variables which are nonlinear quadratic functions, Aji is a membership function for i-th rule of the j-th
Equalization of Channel Distortion Using Nonlinear Neuro-Fuzzy Network
243
input defined as a Gaussian membership function. w1ij , w2ij and bj (i=1,..m, j=1,…,n) are parameters of the network. The fuzzy model that is described by IF-THEN rules can be obtained by modifying parameters of the conclusion and premise parts of the rules. In this paper, a gradient method is used to train the parameters of rules in the neuro-fuzzy network structure. Using fuzzy rules in equation (1), the architecture of the NNFN is proposed (Fig. 1). The NNFN includes seven layers. In the first layer the number of nodes is equal to the number of input signals. These nodes are used for distributing input signals. In the second layer each node corresponds to one linguistic term. For each input signal entering to the system the membership degree to which input value belongs to a fuzzy set is calculated. To describe linguistic terms the Gaussian membership function is used.
μ1 j ( xi ) = e
−
( x i − cij ) 2
σ ij2
P1(x)
R1 x1
, i=1..m,
(2)
j=1..J
NF1
y1
6
NF2
:
y2
R2
u
P2(x) x2
: ‘
:
: ‘ 6
: xm
yn
NFn
:
Rn
Pn(x)
layer 1 layer 2
layer 3
layer 4
layer 5
layer 6
layer 7
Fig. 1. The NNFN architecture
Here m is number of input signals, J is number of linguistic terms assigned for external input signals xi. cij and σij are centre and width of the Gaussian membership functions of the j-th term of i-th input variable, respectively. μ1j(xi) is the membership function of i-th input variable for j-th term. m is number of external input signals. In the third layer, the number of nodes corresponds to the number of the rules (R1, R2,…,Rn). Each node represents one fuzzy rule. To calculate the values of output signals, the AND (min) operation is used. In formula (3), Π is the min operation
μ l ( x) = ∏ μ1 j ( xi ) , l=1,..,n, j=1,..,J j
(3)
244
R.H. Abiyev, F. Mamedov, and T. Al-shanableh
The fourth layer is the consequent layer. It includes n Nonlinear Functions (NF) that are denoted by NF1, NF2,…,NFn, The outputs of each nonlinear function in Fig.1 are calculated by using the following equation (1-3). m
y j = ∑ ( w1ij xi2 + w2ij xi ) + b j , j = 1,..., n
(4)
i =1
In the fifth layer, the output signals of third layer μl(x) are multiplied with the output signals of nonlinear functions. In the sixth and seventh layers, defuzzification is made to calculate the output of the entire network. n
u = ∑ μl ( x ) yl l =1
n
∑ μ ( x). l =1
(5)
l
Here yl is the outputs of fourth layer that are nonlinear quadratic functions, u is the output of whole network. After calculating the output signal of the NNFN, the training of the network starts. Training includes the adjustment of the parameter values of membership functions cij and σij (i=1,..,m, j=1,..,n) in the second layer (premise part) and parameter values of nonlinear quadratic functions w1ij, w2ij, bj (i=1,..,m, j=1,..,n) in the fourth layer (consequent part). At first step, on the output of network the value of error is calculated.
E=
1 O d (ui − ui ) 2 . ∑ 2 i =1
(6)
Here O is number of output signals of network (in given case O=1), u id and u i are the desired and current output values of the network, respectively. The parameters w1ij, w2ij, bj (i=1,..,m, j=1,..,n) and cij and σij (i=1,..,m, j=1,..,n) are adjusted using the following formulas.
w1ij (t + 1) = w1ij (t ) + γ w 2 ij (t + 1) = w 2 ij (t ) + γ
∂E + λ ( w1ij (t ) − w1ij (t − 1)); ∂w1 j
∂E + λ ( w 2 ij (t ) − w 2 ij (t − 1)); ∂w 2 ij
bj (t +1) = bj (t ) + γ
cij (t + 1) = cij (t ) + γ
∂E ; ∂cij
∂E + λ(bj (t ) − bj (t −1)); ∂bj
σ ij (t + 1) = σ ij (t ) + γ
(7)
(8) (9)
∂E . ∂σ ij
(10)
Here γ is the learning rate, λ is the momentum, m is number of input signals of the network (input neurons) and n is the number of rules (hidden neurons), i=1,..,m, j=1,..,n.
Equalization of Channel Distortion Using Nonlinear Neuro-Fuzzy Network
245
The values of derivatives in (7-8) are determined by the following formulas.
μ ∂E = (u (t ) − u d (t )) ⋅ n l ⋅ xi2 ; ∂w1ij
∑μ
μ ∂E = u (t ) − u d (t )) ⋅ n l . ∂b j
∑μ l =1
l =1
μ ∂E = (u (t ) − u d (t )) ⋅ n l ⋅ xi ; ∂w2ij
∑μ
l
l =1
(11)
l
l
The derivatives in (10) are determined by the following formulas.
∂E ∂E ∂u ∂μl =∑ . ∂σ ij j ∂u ∂μl ∂σ ij
(12)
∂u y l − u = L , i = 1,..,m,j = 1,..,n,l = 1,..,n ∂μ l ∑ μl
(13)
∂E ∂E ∂u ∂μl =∑ ; ∂cij j ∂u ∂μl ∂cij
Here
∂E = u(t) − u d (t), ∂u
l =1
2( x j − c ji ) ⎧ , if j node ⎪μl ( x j ) σ 2ji ∂μl ( x j ) ⎪⎪ =⎨ is connected to rule node l ∂c ji ⎪0, otherwise ⎪ ⎪⎩
⎧ 2( x j − c ji ) 2 , if j node ⎪ μl ( x j ) σ 3ji ⎪ ∂μl ( x j ) ⎪ = ⎨ is connected to rule node l ∂σ ji ⎪0, otherwise ⎪ ⎪ ⎩
(14)
Taking into account the formulas (11) and (14) in (7)-(10) the learning of the parameters of the NNFN is carried out.
3
Simulation
The architecture of the NNFN based equalization system is shown in Fig. 2. The random binary input signals s(k) are transmitted through the communication channel. Channel medium includes the effects of the transmitter filter, transmission medium, receiver filter and other components. Input signals can be distorted by noise and intersymbol interference. Intersymbol interference is mainly responsible for linear distortion. Nonlinear distortions are introduced through converters, propagation environment. Channel output signals are filtrated and entered to the equalizer for equalizing the distortion. During simulation the transmitted signals s(k) are known input samples with an equal probability of being –1 and 1. These signals are corrupted by additive noise n(k). These corrupted signals are inputs for the equalizer. In channel equalization, the problem is the classification of incoming input signal of equalizer onto feature space which is divided into two decision regions. A correct decision of equalizer occurs if s (k ) = s (k ) . Here s(k) is transmitted signal, i.e. channel input, s (k ) is the output of equalizer. Based on the values of the transmitted signal s(k) (i.e. ±1) the channel
246
R.H. Abiyev, F. Mamedov, and T. Al-shanableh
n(k)
Channel medium s(k)
Channel
x (k ) 6 x(k)
z-1
z-2
x(k-1)
x(k-2)
...
z-m x(k-m)
NNFN Equalizer e(k) delay
u (k ) 6
s (k )
Decision
Fig. 2. The architecture of the NNFN based equalization system
state can be partitioned into two classes R+ and R-.Here R+={x(k)⏐s(k)=1} and R={x(k)⏐s(k)=-1}. x(k) is the channel output signal. In this paper the NNFN structure and its training algorithm are used to design equalizer. Simulations have been carried out for the equalization of linear and nonlinear channels. In the first simulation, we use the following nonminimum-phase channel model.
x(k) = a1 (k)s(k) + a2 (k)s(k - 1 ) + a3 (k)s(k - 2 ) + n(k),
(15)
where a1(k) = 0.3482, a 2 (k) = 0.8704 and a 3(k) = 0.3482 . n(k) is additive noise. This type of channel is encountered in real communication systems. During equalizer design, the sequence of transmitted signals is given to the channel input. 200 symbols are used for training and 103 signals for testing. They are assumed to be an independent sequence taking values from {-1,1} with equal probability. The additive Gaussian noise n(k) is added to the transmitted signal. In the output of the equalization system, the deviation of original transmitted signal from the current equalizer output is determined. This error e(k) is used to adjust network parameters. Training is continued until the value of the error for all training sequence of signals is acceptably low. During simulation, the input signals for the equalizer are outputs of channel x(k), x(k-1), x(k-2), x(k-3). Using NNFN, ANFIS [19], and feedforward neural networks the computer simulation of equalization system has been performed. During simulation, we used 27 rules (hidden neurons) in the NNFN, 27 hidden neurons in the feedforward neural network and 36 rules (hidden neurons) in the ANFIS based equalizer. The learning of equalizers has been carried out for 3000 samples. After simulation the performance characteristics (bit error rate versus signal-noise ratio) for all equalizers have been determined. Bit Error Rate (BER) versus Signal-Noise Ratio (SNR) characteristics have been obtained for different noise levels. Fig. 3 show the performance of equalizers based on NNFN, ANFIS and feedforward neural networks. In Fig. 3 solid line is the performance of the NNFN based equalizer, dashed line is the performance of the equalizer based on ANFIS and dash-dotted line is the performance of feedforward neural network based equalizer. As shown in figure, at the area of low
Equalization of Channel Distortion Using Nonlinear Neuro-Fuzzy Network
247
Fig. 3. Performance of the NNFN (solid line with ‘+’), ANFIS (dashed line with ‘o’) and feedforward neural network (dash-doted line with ‘*’) based equalizers
SNR (high level of noises) the performance of NNFN based equalizer is better than other ones. In the second simulation, the following nonlinear channel model was used
x(k) = a1(k)s(k) + a2 (k)s(k - 1 ) - 0.9 ⋅ (a1(k)s(k) + a2 (k)s(k - 1 ))3 + n(k),
(16)
where a1(k) = 1 and a 2 (k) = 0.5 . We consider the case when the channel is time varying, that is a1(k) and a 2 (k) coefficients are time-varying coefficients. These are generated by using second-order Markov model in which white Gaussian noise source drives a second-order Butterworth low-pass filter [4,22]. In simulation a second order Butterworth filter with cutoff frequency 0.1 is used. The colored Gaussian sequences which were used as time varying coefficients ai are generated with a standard deviation of 0.1. The curves representing the time variation of the channel coefficients are
Fig. 4. Time-varying coefficients of channel
248
R.H. Abiyev, F. Mamedov, and T. Al-shanableh
Fig. 5. Performance of the NNFN (solid line with ‘+’), ANFIS (dashed line with ‘o’) and feedforward neural network (dash-doted line with ‘*’) based equalizers
Fig. 6. Error plot
depicted in Fig. 4. The first 200 symbols are used for training. 103 signals are used for testing. The simulations are performed using NNFN, ANFIS and feedforward neural networks. 36 neurons are used in the hidden layer of each network. Fig. 5 illustrates the BER performance of equalizers for channel (16), averaged over 10 independent trials. As shown in figure performance of NNFN based equalizer is better than other ones. In Fig. 6, error plot of learning result of NNFN equalizer is given. The channel states are plotted in Fig. 7. Here Fig.7(a) demonstrates noise free channel states, 7(b) is channel states with additive noise, and Fig. 7(c) is channel states after equalization of distortion. Here 7(c) describes the simulation result after 3000 learning iterations. The obtained result satisfies the efficiency of application of NNFN technology in channel equalization.
Equalization of Channel Distortion Using Nonlinear Neuro-Fuzzy Network
a)
249
b)
c) Fig. 7. Channel states: a) noise free, b) with noise, c) after equalization
4 Conclusion The development of NNFN based equalizer has been carried out. The learning algorithm is applied for finding the parameters of NNFN based equalizer. Using developed equalizer the equalization of linear and nonlinear time-varying channels in presence of additive distortion has been performed. Simulation result of NNFN based equalizer is compared with the simulation results of equalizer based on feedforward neural network. It was found that NNFN based equalizer has better BER performance than other equalizer in the noise channels. Comparative simulation results satisfy the efficiency of application of the NNFN in adaptive channel equalization.
References [1] Proakis, J.: Digital Comunications. New York, McGraw-Hill (1995) [2] Qureshi, S.U.H.: Adaptive Equalization. Proc.IEEE, 73 (9) (1985) 1349-1387 [3] Falconer, D.D.: Adaptive Equalization of Channel Nonlinearites in QAM Data Transmission Systems. Bell System Technical Journal 27 (7) (1978)
250
R.H. Abiyev, F. Mamedov, and T. Al-shanableh
[4] Cowan, C.F.N., Semnani, S.: Time-Variant Equalization Using Novel Nonlinear Adaptive Structure. Int.J.Adaptive Contr. Signal Processing 12 (2) (1998) 195-206. [5] Chen, S., Gibson, G.J., Cowan, C.F.N., Grant, P.M.: Adaptive Equalization of Finite Non-Linear Channels Using Multiplayer Perceptrons. Signal Process 20 (2) (1990) 107-119 [6] Chen, S., Gibson, G.J., Cowan, C.F.N., Grant, P.M.: Reconstruction of Binary Signals Using an Adaptive Radial-Basis Function Equalizer. Signal Processing 22 (1) (1991) 77-93 [7] Chen, S., Mclaughlin, S., Mulgrew, B.: Complex Valued Radial Based Function Network, Part II:Application to Digital Communications Channel Equalization. Signal Processing 36 (1994) 175-188 [8] Peng, M., Nikias, C.L., Proakis, J.: Adaptive Equalization for PAM and QAM Signals with Neural Networks. in Proc. Of 25th Asilomar Conf. On Signals, Systems & Computers 1 (1991) 496-500 [9] Peng, M., Nikias, C.L., Proakis, J.: Adaptive Equalization with Neural Networks: New Multiplayer Perceptron Structure and Their Evaluation. Proc.IEEE Int. Conf.Acoust., Speech, Signal Proc., vol.II (San Francisco,CA) (1992) 301-304 [10] Lee, J.S., Beach, C.D., Tepedelenlioglu, N.: Channel Equalization Using Radial Basis Function Neural Network. Proc.IEEE Int. Conf.Acoust., Speech, Signal Proc., 1996, vol.III (Atlanta, GA) (1996) 1719-1722 [11] Erdogmus, D., Rende, D., Principe, J., Wong, T.F.: Nonlinear Channel Equalization Using Multiplayer Perceptrons with Information-Theoretic Criterion. In Proc. of 2001 IEEE Signal Processing Society Workshop (2001) 443-451 [12] Chen, Z., Antonio, C.: A New Neural Equalizer for Decision-Feedback Equalization. IEEE Workshop on Machine Learning for Signal Processing ( 2004) [13] Wang, L.X., Mendel, J.M.: Fuzzy Adaptive Filters, with Application to Nonlinear Channel Equalization. IEEE Transaction on Fuzzy Systems 1 (3) (1993) [14] Sarwal, P., Srinath, M.D.: A Fuzzy Logic System for Channel Equalization. IEEE Trans. Fuzzy System 3 (1995) 246-249 [15] Lee, K.Y.: Complex Fuzzy Adaptive Filters with LMS Algorithm. IEEE Transaction on Signal Processing 44 (1996) 424-429 [16] Patra, S.K., Mulgrew, B.: Efficient Architecture for Bayesian Equalization Using Fuzzy Filters. IEEE Transaction on Circuit and Systems II 45 (1998) 812-820 [17] Patra, S.K., Mulgrew, B.: Fuzzy Implementation of Bayesian Equalizer in the Presence of Intersymbol and Co-Channel Interference. Proc. Inst. Elect. Eng. Comm.145 (1998) 323-330 [18] Siu, S., Lu, C., Lee, C.M.: TSK-Based Decision Feedback Equalization Using an Evolutionary Algorithm Applied to QAM Communication Systems. IEEE Transactions on Circuits and Systems 52 (9) 2005 [19] Jang, J., Sun, C., Mizutani, E.: Neuro-fuzzy and Soft Computing: a Computational Approach to Learning and Machine Intelligence. Prentice-Hall, NJ (1997) [20] Choi, J., Antonio, C., Haykin, S.: Kalman Filter-Trained Recurrent Neural Equalizers for Time-Varying Channels. IEEE Transactions on Communications 53 (3) (2005) [21] Abiyev, R., Mamedov, F., Al-shanableh, T.: Neuro-Fuzzy System for Channel Noise Equalization. International Conference on Artificial Intelligence.IC-AI’04, Las Vegas, Nevada, USA, June 21-24 (2004)
Comparative Studies of Fuzzy Genetic Algorithms Qing Li1, , Yixin Yin1 , Zhiliang Wang1 , and Guangjun Liu2 1
School of Information Engineering, University of Science and Technology Beijing 100083 Beijing, China {liqing,yyx}@ies.ustb.edu.cn, zhiliang
[email protected] 2 Department of Aerospace Engineering, Ryerson University M5B 2K3 Toronto, Canada
[email protected]
Abstract. Many adaptive schemes for controlling the probabilities of crossover and mutation in genetic algorithms with fuzzy logic have been reported in recent years. However, there has not been known work on comparative studies of these algorithms. In this paper, several fuzzy genetic algorithms are briefly summarized first, and they are studied in comparison with each other under the same simulation conditions. The simulation results are analyzed in terms of search speed and search quality. Keywords: genetic algorithm, crossover probability, mutation probability, fuzzy logic.
1
Introduction
It is well known that the probabilities of crossover and mutation in a genetic algorithm (GA) have great influence on its performance (e.g., search speed and search quality), and the correct setting of these parameters is not an easy task. In the last decade, numerous fuzzy logic based approaches for adjustment of crossover and mutation probabilities have been reported, such as [1] to [7]. Song et al. [1] propose a fuzzy logic controlled genetic algorithm (FCGA) for the regulation of crossover probability and mutation probability, where the changes of average fitness value between two consecutive generations are selected as the input variables. Yun and Gen [2] improve the works of Song et al., in which some fuzzy inference rules are modified and a scaling factor for normalizing the input variables is introduced. Li et al. [3] investigate another fuzzy genetic algorithm (FGA), where the information of both the whole generation and particular individuals is used for controlling crossover and mutation probabilities. Subbu, et al. [4] suggest a fuzzy logic controlled genetic algorithm (FLC-GA), and the FLCGA uses two kinds of diversity (genotypic diversity and phenotypic diversity) information as the input. A new fuzzy genetic algorithm using PD (Population Diversity) measurements is designed by Wang in [5], and experiments have
Corresponding author, currently a visiting scholar with the Department of Aerospace Engineering, Ryerson University.
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 251–256, 2007. c Springer-Verlag Berlin Heidelberg 2007
252
Q. Li et al.
demonstrated that premature convergence can be avoided by this method. Liu et al. [6] develop a hybrid fuzzy genetic algorithm (HFGA), in which the average fitness value and the best fitness value of each generation are adopted for dynamical tuning the crossover and mutation probabilities. Recently, an improved fuzzy genetic algorithm (IFGA) is proposed by Li et al. in [7]. The differences in the average fitness value and standard deviation between two consecutive generations are selected as the input variables, and two adaptive scaling factors are introduced for normalizing the input variables. Moreover, new domain heuristic knowledge based rules are introduced for fuzzy inference. Although most of fuzzy genetic algorithms have demonstrated their effectiveness in each work, comparative studies and performance analysis have not been reported in previous works. The aim of this paper is to compare the performance of the above-mentioned algorithms under the same conditions. Three fuzzy genetic algorithms are selected for comparative studies and the simulation results are analyzed. The comparison results illustrate that IFGA has led to improved performance in terms of search speed and search quality compared with other two genetic algorithms under the same test functions. The numerical simulation studies of the three selected fuzzy genetic algorithms using the same test functions are presented in Section 2, followed by the conclusions and future work in Section 3.
2
Comparative Studies and Performance Analysis
In this section, three fuzzy genetic algorithms (FCGA in [2], FGA in [3] and IFGA in [7]) are selected for comparative studies. The detailed procedures of each algorithm are not introduced in this paper because of page limitation. Three test functions are applied for numerical simulation studies similarly as in [2]. Test function 1 (T1) is called “Binary f6” and it has a global maximum 1.0 at the point of x1 = x2 = 0 in its search range [-100, 100]. The expression is as follows: (sin x21 + x22 )2 − 0.5 f (x1 , x2 ) = 0.5 − . (1) 1.0 + 0.001(x21 + x22 )2 Test function 2 (T2) is called “Rosenbrock function” and it has a global minimum 0 at x1 = x2 = 0 within the range from -2.048 to 2.048. Its expression is as follows: f (x1 , x2 ) = 100(x21 − x2 )2 + (1 − x1 )2 . (2) Test function 3 (T3) is called “Rastrigin function” and it has a global minimum 0 at the point of x1 = x2 = x3 = x4 = x5 = 0 within the range [-5.12, 5.12]. It can be expressed as follows: f (x1 , x2 , x3 , x4 , x5 ) = 15 +
5
(x2i − 3cos(2πxi )) .
(3)
i=1
The parameters of each algorithm are set as follows: population size 20, maximum generation 2000, initial crossover probability 0.5 and mutation probability
Comparative Studies of Fuzzy Genetic Algorithms
253
0.05. The roulette wheel selection operator, the uniform arithmetic crossover operator and the uniform mutation operator in [8] are adopted as the genetic operators in recombination process. 20 iterations were executed to eliminate the randomness of the searches and an elitism strategy is also used to preserve the best individual of each generation. If a pre-defined maximum generation is reached or an optimal solution is located, the evolution process will be stopped. Two indices are used to compare the performances of the three algorithms. One is “average number of generations” which is defined as the average number of generations that reaches to the given stop conditions. The other is ”number of obtaining the optimal solution” which represents the total number that locates the optimal solution during 20 iterations. We can see that the former index indicates the search speed and the latter one stands for the search quality. All the simulation programs are executed on an Acer notebook (AMD Turion 64, 512MB DDR) and programmed in MATLAB. The simulation results are listed in Table 1. Table 1. Simulation results of three test functions Algorithms
FCGA
T1 T2 T3 T1 Number of obtaining the optimal solution T2 T3 Average number of generations
FGA
1783.6 2137.3 1537.5 1407.7 1534.9 1205.2 1121.3 2285.6 976.8 16 12 18 15 14 18 18 10 19
450 400 IFGA FCGA FGA
Average fitness value
350 300 250 200 150 100 50 0
5
10
15
20
25 30 Generation
35
IFGA
40
45
Fig. 1. Behaviors of average fitness value in T2
50
254
Q. Li et al. 200 IFGA FCGA FGA
180 160
Standard deviation
140 120 100 80 60 40 20 0
5
10
15
20
25 30 Generation
35
40
45
50
Fig. 2. Behaviors of standard deviation in T2 1.1
1
Crossover probability
0.9
0.8
0.7
0.6
IFGA FCGA FGA
0.5
0.4
5
10
15
20
25 30 Generation
35
40
45
50
Fig. 3. Behaviors of crossover probability in T2
In terms of “average number of generations” in Table 1, the IFGA outperforms FCGA and FGA because it requires less generations for getting the optimal solution. In the “number of obtaining the optimal solution”, the IFGA also outweighs FCGA and FGA for it can locate more optimal solutions than the others. From the simulation results, we see that the IFGA shows better performances in term of “average number of generations” and “number of obtaining the optimal solution” compared with FCGA and FGA.
Comparative Studies of Fuzzy Genetic Algorithms
255
0.11
0.1
Mutation probability
0.09
0.08
0.07
0.06
IFGA FCGA FGA
0.05
0.04
5
10
15
20
25 30 Generation
35
40
45
50
Fig. 4. Behaviors of mutation probability in T2
For a more detailed comparison between the adaptive schemes, the average fitness value, standard deviation, crossover probability and mutation probability of testing function 2 (T2) are demonstrated in Figures 1, 2, 3 and 4 when the generation number of 50 is reached. From Figs. 1 and 2, we see lower average fitness value and higher standard deviation in the IFGA than those of FCGA and FGA, implying that IFGA is more efficient in search quality and exploration ability. By analysis using Figs. 3 and 4, we can see that the probabilities of crossover and mutation (especially the mutation probability) of IFGA have more fluctuations than those of FCGA and FGA during the searching process, which shows that the IFGA has enhanced self-adaptive adjusting ability compared with FCGA and FGA.
3
Conclusions and Future Work
Three fuzzy genetic algorithms are compared and analyzed under the same simulation conditions in this paper. The numerical simulation results show that the IFGA has provided faster search speed, better search quality and self-adaptability compared with FCGA and FGA. There are at least two tasks to be performed in the near future: (1) Higher dimension and higher order functions are to be applied to test the generality of the conclusion; and (2) Other fuzzy genetic algorithms are to be taken into consideration for further comparison studies.
256
Q. Li et al.
Acknowledgments. This work is supported by NSFC (Natural Science Foundation of China, Grant#60374032) and CSC (China Scholarship Council).
References 1. Song Y., Wang G., Wang P., Johns A.: Environmental/Economic Dispatch Using Fuzzy Logic Controlled Genetic Algorithm. In: IEE Proceedings on Generation, Transmission and Distribution, Vol.144. The Institution of Engineering and Technology, London (1997) 377-382 2. Yun Y., Gen M.: Performance Analysis of Adaptive Genetic Algorithm with Fuzzy Logic and Heuristics. Fuzzy Optimization and Decision Making, 2 (2003) 161-175 3. Li Q., Zheng D., Tang Y., Chen Z.: A New Kind of Fuzzy Genetic Algorithm. Journal of University of Science and Technology Beijing, 1 (2001) 85-89 4. Subbu R., Sanderson A.C., Bonissone P.P.: Fuzzy Logic Controlled Genetic Algorithms Versus Tuned Genetic Algorithms: An Agile Manufacturing Application. In: Proceedings of the 1998 IEEE ISIC/CIRA/ISAS Joint Conference, New Jersey: (1998) 434-440 5. Wang K.: A New Fuzzy Genetic Algorithm Based on Population Diversity. In: Proceedings of the 2001 International Symposium on Computational Intelligence in Robotics and Automation, New Jersey: (2001) 108-112 6. Liu H., Xu Z., Abraham A.: Hybrid Fuzzy-Genetic Algorithm Approach for Crew Grouping. In: Nedjah N., Mourelle L.M., Vellasco M.M.B.R., Abraham A., Koppen M. (eds.): Proceedings of the 2005 5th International Conference on Intelligence Systems Design and Applications. IEEE Computer Society, Washington, DC: (2005) 332-337 7. Li Q., Tong X., Xie S., Liu G.: An Improved Adaptive Algorithm for Controlling the Probabilities of Crossover and Mutation Based on a Fuzzy Control Strategy. In: L. O’Conner (eds.): Proceedings of the 6th International Conference on Hybrid Intelligent Systems and 4th Conference on Neuro-Computing and Evolving Intelligence. IEEE Computer Society, Washington, DC: (2006) 50-50 8. Michalewicz, Z.: Genetic Algorithms + Data Structures = Evolution Programs. 3rd edn. Springer-Verlag, Berlin Heidelberg New York (1996)
Fuzzy Random Dependent-Chance Bilevel Programming with Applications Rui Liang1 , Jinwu Gao2 , and Kakuzo Iwamura3 1
3
Economy,Industry and Business Management College, Chongqing University, Chongqing 400044, China 2 School of Information, Renmin University of China, Beijing 100872, China Department of Mathematics, Josai University, Sakado, Saitama 350-0248, Japan
Abstract. In this paper, a two-level decentralized decision-making problem is formulated as fuzzy random dependent-chance bilevel programming. We define the fuzzy random Nash equilibrium in the lower level problem and the fuzzy random Stackelberg-Nash equilibrium of the overall problem. In order to find the equilibria, we propose a hybrid intelligent algorithm, in which neural network, as uncertain function approximator, plays a crucial role in saving computing time, and genetic algorithm is used for optimization. Finally, we apply the fuzzy random dependent-chance bilevel programming to hierarchical resource allocation problem for illustrating the modelling idea and the effectiveness of the hybrid intelligent algorithm.
1
Introduction
Decentralized decision-making becomes more and more important for contemporary decentralized organizations in which each department seeks its own interest, while the organization seeks the overall interest. In order to dealing with such problems, multilevel programming (MLP) was proposed by Bracken and McGill [4][5] in early 1970s. Thereafter, despite of its inherent NP-hardness [3], MLP has been applied to a wide variety of areas including economics [2][6], transportation [33][36], engineering [7][31], and so on. For detailed expositions, the reader may consult the review papers [34][35] and the books [9][19]. When multilevel programming is applied to real world problems, some system parameters are often subject to fluctuations and difficult to measure. By assuming them to be random variables, Patriksson and Wynter [30] and Gao, et al. [11] discussed stochastic multilevel programming with the numerical solution methods. Meanwhile, Gao and Liu [12]-[14] discussed fuzzy multilevel programming models with hybrid intelligent algorithms under the assumption of fuzzy parameters. However, in many situations, the system parameters are with both randomness and fuzziness. For instance, in a economic system, the demand consists of multiple demand sources, amongst which some are characterized by random variables, others (e.g., new demand sources or demand sources in some
This work was supported by National Natural Science Foundation of China (No.70601034) and Research Foundation of Renmin University of China.
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 257–266, 2007. c Springer-Verlag Berlin Heidelberg 2007
258
R. Liang, J. Gao, and K. Iwamura
unsteady states) are characterized by fuzzy variables. Then the total demand is the sum of some random and fuzzy variables, and characterized by a fuzzy random variable. Kwakernaak [17][18] first introduced the notion of fuzzy random variable. The concept of chance measure of fuzzy random event was first given by [22], and fuzzy random dependent-chance programming model was initialized by Liu [23]. The underlying philosophy is to select the decision with maximal chance to meet the fuzzy random event. In this paper, we formulate a two-level decentralized decision-making problem as a fuzzy random dependentchance bilevel programming (FRDBP) model, and present a numerical solution method by integrating neural network and genetic algorithm. For that purpose, the paper is organized as follows. Firstly, we give some basic results of fuzzy random theory in Section 2. Then we formulate a twolevel decentralized decision-making problem in fuzzy random environments as an FRDBP model in Section 3. Thirdly, in Section 4, we propose a hybrid intelligent algorithm by integrating fuzzy random simulation, neural network and genetic algorithm. In Section 5, as an application, a hierarchical resource allocation problem with fuzzy random parameters is formulated by FRDBP, and the computational results further illustrate the idea of the FRDBP, and the effectiveness of the hybrid intelligent algorithm. Lastly, we give a concluding remark.
2
Preliminaries
Let Θ be a nonempty set, (Θ) the power set of Θ, and ξ be a fuzzy variable with membership function μ. Then the credibility measure Cr of a fuzzy event A ∈ (Θ) was defined by Liu and Liu [24] as: 1 Cr(A) = sup μ(x) + 1 − sup μ(x) . 2 x∈B x∈B c Definition 1. (Liu and Liu [24]) A fuzzy random variable is function ξ defined on a probability space (Ω, , Pr) to a set of fuzzy variables such that Cr {ξ(ω) ∈ B} is a measurable function of ω for any Borel set B of . Definition 2. (Gao and Liu [10]) Let ξ be a fuzzy random variable, and B a Borel set of . Then the chance of fuzzy random event B is a function from (0, 1] to [0, 1], and defined as Ch {ξ ∈ B} (α) =
sup
inf Cr {ξ(ω) ∈ B} .
Pr{A}≥α ω∈A
Example 1. A fuzzy random variable ξ is said to be triangular, if for each ω, ξ(ω) is a triangular fuzzy variable.
3
Fuzzy Random Dependent-Chance Bilevel Programming
Consider a decentralized decision system with two-level structure. The lower level consists of m decision makers called followers. Symmetrically, the decision maker
Fuzzy Random Dependent-Chance Bilevel Programming with Applications
259
at the upper level is called leader. Each decision maker has his own decision variables and objective. The leader can only influence the reactions of followers through his own decision variables, while the followers have full authority to decide how to optimize their own objective functions in view of the decisions of the leader and other followers. In order to model the fuzzy random decentralized decision-making problem, we give the following notations: – – – – – – – –
i = 1, 2, · · · , m: index of followers; x: control vector of the leader; y i : control vector of the ith follower; ξ = (ξ1 , ξ2 , · · · , ξn ): n-array fuzzy random vector into which problem parameters are arranged; f0 (x, y 1 , · · · , y m , ξ): objective function of the leader; fi (x, y 1 , · · · , y m , ξ): objective function of the ith follower; g0 (x, ξ): constraint function of the leader; gi (x, y 1 , · · · , y m , ξ): constraint function of the ith follower.
Following the philosophy of fuzzy random dependent-chance programming [23], we formulate this problem as an FRDBP model in the following. Firstly, we assume that the leader’s decision x and the other followers’ decisions y 1 , · · · , y i−1 , y i+1 , · · · , ym are given, and the ith follower concerns with the event of his objective function’s achieving a prospective value f¯i . Then the rational reactions of the ith follower is the set of optimal solutions to the dependentchance programming model ⎧ ⎪ max Ch fi (x, y 1 , y2 , · · · , y m , ξ) ≥ f¯i (αi ) ⎪ ⎨ yi subject to: (1) ⎪ ⎪ ⎩ gi (x, y 1 , y 2 , · · · , y m , ξ) ≤ 0, where αi is a predetermined confidence level. It is obvious that each follower’s rational reaction depends on not only the leader’s decision x but also the other followers’ decisions y 1 , · · · , y i−1 , y i+1 , · · · , y m . Definition 3. An array (y ∗1 , y ∗2 , · · · , y ∗m ) is called a Nash equilibrium with respect to a given decision x of the leader, if Ch fi (x, y ∗1 , · · · , y ∗i−1 , y i , · · · , y ∗m , ξ) ≥ f¯i (αi ) (2) ≤ Ch fi (x, y ∗1 , y ∗2 , · · · , y ∗m , ξ) ≥ f¯i (αi ) subject to the uncertain environment gi (x, y 1 , y 2 , · · · , y m , ξ) ≤ 0, i = 1, 2, · · · , m for any feasible (y ∗1 , y ∗2 , · · · , y ∗i−1 , y i , y ∗i+1 , · · · , y ∗m ) and i = 1, 2, · · · , m. Secondly, if the leader has given a confidence level α0 , and wants to maximize the chance of his objective function’s achieving a prospective value f¯0 , then the leader’s problem is formulated as the following dependent-chance programming model
260
R. Liang, J. Gao, and K. Iwamura
⎧ max Ch f0 (x, y ∗1 , y∗2 , · · · , y ∗m , ξ) ≥ f¯i (α0 ) ⎪ ⎨ x subject to: ⎪ ⎩ g0 (x, ξ) ≤ 0.
(3)
where (y ∗1 , y ∗2 , · · · , y ∗m ) is the Nash equilibrium respect to x. Now, we present the concept of Stackelberg-Nash equilibrium defined as follows, Definition 4. An array (x∗ , y ∗1 , y ∗2 , · · · , y ∗m ) is called a Stackelberg-Nash equilibrium, if Ch f0 (x, y 1 , y 2 , · · · , y m , ξ) ≥ f¯0 (α0 ) (4) ≤ Ch f0 (x∗ , y ∗1 , y ∗2 , · · · , y ∗m , ξ) ≥ f¯0 (α0 ) subject to the uncertain environment g0 (x, ξ) ≤ 0 for any x and the Nash equilibrium (y 1 , y 2 , · · · , y m ) with respect to x. Finally, we assume that the leader first chooses his control vector x, and the followers’ rational reactions always form an Nash equilibrium. In order to maximize the chance functions of the leader and followers, we have the following dependent-chance bilevel programming model, ⎧ max Ch f0 (x, y ∗1 , y∗2 , · · · , y ∗m , ξ) ≥ f¯0 (α0 ) ⎪ ⎪ x ⎪ ⎪ ⎪ subject to: ⎪ ⎪ ⎪ ⎪ ⎪ g0 (x, ξ) ≤ 0 ⎪ ⎪ ⎨ where (y ∗1 , y ∗2 , · · · , y ∗m ) solves problems (5) ⎧ ⎪ ⎪ ¯ ⎪ ⎪ max Ch f (x, y , y , · · · , y , ξ) ≥ f (α ) ⎪ i i i 1 2 m ⎪ ⎪ ⎨ yi ⎪ ⎪ ⎪ ⎪ subject to: ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ ⎩ gi (x, y 1 , y 2 , · · · , y m , ξ) ≤ 0.
4
Hybrid Intelligent Algorithm
Since bilevel programming problem is NP-hard [3], its successful implementations of multilevel programming rely largely on efficient numerical algorithms. As an extension of bilevel programming, FRDBP further enhances this difficulty. In this section, we integrate fuzzy random simulation, neural network and genetic algorithm to produce a hybrid intelligent algorithm for solving the FRDBP model. 4.1
Fuzzy Simulation
By uncertain functions we mean the functions with fuzzy random parameters like U : (x, y 1 , y 2 , · · · , y m ) → Ch f (x, y 1 , y 2 , · · · , y m , ξ) ≥ f¯ (α). (6) Due to the complexity, we resort to the fuzzy random simulation technique for computing the uncertain functions. Here we shall not go into details, and the interested reader may consult the book [26] by Liu.
Fuzzy Random Dependent-Chance Bilevel Programming with Applications
4.2
261
Uncertain Function Approximation
A neural network is essentially a nonlinear mapping from the input space to the output space. It is known that a neural network with an arbitrary number of hidden neurons is a universal approximator for continuous functions [8][16]. Moreover, it has high speed of operation after it is well-trained on a set of inputoutput data. In order to speed up the solution process, we train neural networks to approximate uncertain functions, and then use the trained neural networks to evaluate the uncertain functions in the solution process. For training a neural network to approximate an uncertain function, we must first generate a set of input-output data (x(k) , y (k) , z (k) ) k = 1, 2, · · · , M , where x and y are control vectors of the leader and followers, respectively, z (k) are the corresponding function values that are calculated by fuzzy simulations, k = 1, 2, · · · , M . Then, we train a neural network on the set of input-output data by using the popular backpropagation algorithm. Finally, the trained network characterized by U (x, y, w), where w denotes the network weights that was produced via the training process, can be used to evaluate the uncertain function. Thus, much computing time is saved. For detailed discussion on uncertain function approximation, the reader may consult the book [26] by Liu. 4.3
Computing Nash Equilibrium
Define symbols y −i = (y 1 , y 2 , · · · , y i−1 , y i+1 , · · · , y m ), i = 1, 2, · · · , m. For any decision x revealed by the leader, if the ith follower knows the strategies y −i of other followers, then the optimal reaction of the ith follower is represented by a mapping y i = ri (y −i ) that solves the subproblem defined in equation (1). It is clear that the Nash equilibrium of the m followers will be the solution of the system of equations y i = ri (y −i ),
i = 1, 2, · · · , m.
(7)
In other words, we should find a fixed point of the vector-valued function (r1 , r2 , · · · , rm ). This task may be achieved by solving the following dependent-chance programming model, ⎧ m ⎪ ⎪ y i − ri (y −i ) ⎨ min R(y 1 , y 2 , · · · , y m ) = i=1 (8) subject to: ⎪ ⎪ ⎩ gi (x, y 1 , y 2 , · · · , y m , ξ) ≤ 0, i = 1, 2, · · · , m. If the optimal solution (y ∗1 , y ∗2 , · · · , y ∗m ) satisfies that R(y ∗1 , y ∗2 , · · · , y ∗m ) = 0,
(9)
then y ∗i = ri (y ∗−i ) for i = 1, 2, · · · , m. That is, (y ∗1 , y ∗2 , · · · , y ∗m ) must be a Nash equilibrium for the given x.
262
R. Liang, J. Gao, and K. Iwamura
In a numerical solution process, if a solution (y ∗1 , y ∗2 , · · · , y ∗m ) satisfies that R(y ∗1 , y ∗2 , · · · , y ∗m ) ≤ ε,
(10)
where ε is a small positive number, then it can be regarded as a Nash equilibrium for the given x. Otherwise, we should continue the computing procedure. Since the objective function involves m mappings ri (y −i ), the optimization problem (8) may be very complex. So we employ genetic algorithm to search for the Nash equilibrium. Genetic Algorithm for Nash Equilibrium: Step 1. Input a feasible control vector x. Step 2. Generate a population of chromosomes y (j) , j = 1, 2, · · · , pop size at random from the feasible set. Step 3. Calculate the the objective values of chromosomes. Step 4. Compute the fitness of each chromosome according to the objective values. Step 5. Select the chromosomes by spinning the roulette wheel. Step 6. Update the chromosomes by crossover and mutation operations. Step 7. Repeat Steps 3–6 until the best chromosome satisfies inequality (10). Step 8. Return the Nash equilibrium y ∗ = (y ∗1 , y ∗2 , · · · , y ∗m ). 4.4
Hybrid Intelligent Algorithm
For any feasible control vector x revealed by the leader, denote the Nash equilibrium with respect to x by (y ∗1 , y ∗2 , · · · , y ∗m ). Then, the Stackelberg-Nash equilibrium can be get by solving the leader’s problem defined in (3). Since its objective function involves not only uncertain parameters ξ, but also a complex mapping x → (y ∗1 , y ∗2 , · · · , y ∗m ), the optimization problem may be very difficult to solve. Genetic algorithm is a good candidate, although it is a relatively slow way. Now we integrate fuzzy simulation, neural network, and genetic algorithm to produce a hybrid intelligent algorithm for solving general FRDBP models. Hybrid Intelligent Algorithm for Stackelberg-Nash Equilibrium: Step 1. Generate input-output data of uncertain functions like (6). Step 2. Train neural networks by the backpropagation algorithm. Step 3. Initialize a population of chromosomes x(i) , i = 1, 2, · · · , pop size randomly. Step 4. Compute the Nash equilibrium for each chromosome. Step 5. Compute the fitness of each chromosome according to the objective values. Step 6. Select the chromosomes by spinning the roulette wheel. Step 7. Update the chromosomes by crossover and mutation operations. Step 8. Repeat Steps 4–7 for a given number of cycles. Step 9. Return the best chromosome as the Stackelberg-Nash equilibrium.
Fuzzy Random Dependent-Chance Bilevel Programming with Applications
5
263
Hierarchical Resource Allocation Problem
Consider an enterprize composed of a center that markets products and supplies resources and two factories as subsystems each of which produces two kinds of products by consuming allocated resources. The center makes a decision on the amounts of the resources so as to maximize its total profit in marketing the products, while each factory desires to attain its production activity goal based on efficiency, quality, and performance. Some notations are given as follows: – xmj : an amount of resource j allocated to the factory m; – ymj : an amount of the product j produced by the factory m; – Yj : a total amount of marketing product j, where Yj = y1j + y2j ; – f0 (Y ): a profit function of marketing Y = (Y1 , Y2 )T ; – fm (ym ): an objective function expressing the goal of the factory m, where ym = (ym1 , ym2 )T ; ∗ – ymj (x): parametric optimal values of ymj and Yj , respectively, with respect to resource allocation x = (x11 , x12 , x21 , x22 )T . The objective functions of the two factories are f1 (y1 (x)) = (y11 − 4.0)2 + (y12 − 13.0)2 , and f2 (y2 (x)) = (y21 − 35.0)2 + (y22 − 2.0)2 , and the profit function of the center is f0 (Y ) = (ξ1 − Y1 (x))Y1 (x) + (ξ2 − Y2 (x))Y2 (x), where ξ1 is a triangular fuzzy random variable with normal distribution denoted by (N (200, 42) − 10, N (200, 42), N (200, 42 ) + 10), and ξ2 is a triangular fuzzy random variable with normal distribution denoted by (N (160, 32) − 10, N (160, 32), N (200, 32 ) + 10). We note that the prototype of the above example comes from [1]. Here we fuzzy randomize only two system parameters for the convenience of comparison. When ξ1 and ξ2 are substituted by their mean values 200 and 160, respectively, we get the problem in Ref. [1], whose optimal solution is known as x∗ = (x∗11 , x∗12 , x∗21 , x∗22 ) = (7.00, 3.00, 12.00, 18.00), and the optimal reaction of the two factories are ∗ ∗ (y11 , y12 ) = (0.00, 10.00)
264
R. Liang, J. Gao, and K. Iwamura
and
∗ ∗ (y21 , y22 ) = (30.00, 0.00).
The optimal objective of the center is f0 (Y (x∗ )) = 6600. That is, the center can achieve a profit level 6600 from the point of mean value. Due to the fuzzy randomness of the system parameters ξ1 and ξ2 , the objective/profit function of the center is fuzzy random too. Suppose that the center has set a profit level 6200 and a probability level 0.9, and wants to maximize its profit function’s achieving 6400, we have the following FRDBP model ⎧
(ξ1 − Y1∗ (x))Y1∗ (x) ⎪ ⎪ ⎪ max Ch ≥ 6400 (α) ⎪ +(ξ2 − Y2∗ (x))Y2∗ (x) ⎪ x ⎪ ⎪ ⎪ s.t. ⎪ ⎪ ⎪ ⎪ ⎪ x11 + x12 + x21 + x22 ≤ 40 ⎪ ⎪ ⎪ ⎪ ⎪ 0 ≤ x11 ≤ 10, 0 ≤ x12 ≤ 5 ⎪ ⎪ ⎪ ⎪ 0 ≤ x21 ≤ 15, 0 ≤ x22 ≤ 20 ⎪ ⎪ ⎪ ⎪ ⎪ where y ∗1 , y ∗2 solve problems ⎪ ⎪ ⎪ ⎪ ⎧ ⎪ ⎪ min (y11 − 4.0)2 + (y12 − 13.0)2 ) ⎪ ⎪ ⎪ ⎪ y11 ,y12 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ s.t. ⎨ (11) 4y11 + 7y12 ≤ 10x11 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ 6y11 + 3y12 ≤ 10x12 ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ ⎪ ⎪ 0 ≤ y11 , y12 ≤ 20 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎧ ⎪ ⎪ ⎪ (y21 − 35.0)2 + (y22 − 2.0)2 ⎪ ymin ⎪ ⎪ ⎪ 21 ,y22 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ s.t. ⎪ ⎪ ⎪ ⎪ 4y21 + 7y22 ≤ 10x21 ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ 6y21 + 3y22 ≤ 10x22 ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ ⎩ 0 ≤ y21 , y22 ≤ 40. A run of the hybrid intelligent algorithm for 200 generations, we get the best solution x∗ = (6.20, 3.58, 12.18, 18.03), and the corresponding chance is 0.71. For the allocation x∗ , the optimal solution and objective of factory 1 are (1.30, 8.12) and 31.09, respectively; the optimal solution and objective of factory 2 are (30.05, 0.00) and 28.48, respectively. That is, the center can achieve a profit level 6200 with a higher credibility 0.71 given probability level 0.90. However, it is at the expense of the objective value of factory 1.
6
Conclusions
In this paper, we proposed FRDBP as well as a hybrid intelligent algorithm. As shown in their application to hierarchical resource allocation problem, they
Fuzzy Random Dependent-Chance Bilevel Programming with Applications
265
could be used to solve two-level decentralized decision-making problem in fuzzy random environments such as government policy making and engineering.
References 1. Aiyoshi E., and Shimizu K.: Hierarchical decentralized system and its new sollution by a barrier method. IEEE Transactions on System, Man, and Cybernetics SMC 11 (1981) 444–449 2. Bard J.F., Plummer J., and Sourie J.C.: A bilevel programming approach to determining tax credits for biofuel production. European Journal of Operational Research 120 (2000) 30–46 3. Ben-Ayed O., Blair C.E.: Computational difficulties of bilevel linear programming. Operations Research 38 (1990) 556–560 4. Bracken J., McGill J.M.: Mathematical programs with optimization problems in the constraints. Operations Research 21 (1973) 37–44 5. Bracken J., McGill J.M.: A method for solving Mathematical programs with nonlinear problems in the constraints. Operations Research 22 (1974) 1097–1101 6. Candler W., Fortuny-Amat W. and McCarl B.: The potential role of multi-level programming in agricultural economics. American Journal of Agricultural Economics 63 (1981) 521–531 7. Clark P.A., Westerberg A.: Bilevel programming for chemical process design— I. Fundamentals and algorithms. Computer and Chemical Engineering 14 (1990) 87–97 8. Cybenko G.: Approximations by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2 (1989), 183–192 9. Dempe S.: Foundations of bilevel programming Kluwer Academic Publishers, Dordrecht, 2002 10. Gao J., Liu B.: New primitive chance measures of fuzzy random event. International Journal of Fuzzy Systems 3 (2001) 527–531 11. Gao J., Liu B. and Gen M.: A hybrid intelligent algorithm for stochastic multilevel programming. IEEJ Transactions on Electronics, Information and Systems 124-C (2004) 1991-1998 12. Gao J., Liu B.: On crisp equivalents of fuzzy chance-constrained multilevel programming. Proceedings of the 2004 IEEE International Conference on Fuzzy Systems Budapest, Hungary, July 26-29, 2004, pp.757-760 13. Gao J., Liu B.: Fuzzy multilevel programming with a hybrid intelligent algorithm. Computer & Mathmatics with applications 49 (2005) 1539-1548 14. Gao J., Liu B.: Fuzzy dependent-chance multilevel programming with application to resource allocation problem. Proceedings of the 2005 IEEE International Conference on Fuzzy Systems Reno, Nevada, May 22-25, 2005, pp.541-545 15. Gao J., Liu Y.: Stochastic Nash equilibrium with a numerical solution method. In: Wang J. et al, (eds.): Advances in Neural Networks-ISNN2005. Lecture Notes in Computer Science, Vol. 3496. Springer-Verlag, Berlin Heidelberg New York (2005) 811–816 16. Hornik K., Stinchcombe M. and White H.: Multilayer feedforward networks are universal approximators. Neural Networks, 2 (1989), 359–366 17. Kwakernaak H.: Fuzzy random variables–I: Defnitions and theorems. Information Sciences 15 (1978) 1–29
266
R. Liang, J. Gao, and K. Iwamura
18. Kwakernaak H.: Fuzzy random variables–II: Algorithms and examples for the discrete case. Information Sciences 17 (1979) 253–278 19. Lee E.S., Shih H.S.: Fuzzy and Multi-level Decision Making Springer-Verlag, London, 2001 20. Liu B.: Stackelberg-Nash equilibrium for multi-level programming with multiple followers using genetic algorithm. Comput. Math. Appl. 36 (1998) 79–89 21. Liu B.: Dependent-chance programming in fuzzy environments. Fuzzy Sets and Systems 109 (2000) 95–104 22. Liu B.: Fuzzy random chance-constrained programming. IEEE Transactions on Fuzzy Systems 9 (2001) 713–720 23. Liu B.: Fuzzy random dependent-chance programming. IEEE Transactions on Fuzzy Systems 9 (2001) 721–726 24. Liu B., Liu Y.: Expected value of fuzzy variable and fuzzy expected value models, IEEE Transactions on Fuzzy Systems 10 (2002) 445–450 25. Liu B.: A Survey of Entropy of Fuzzy Variables. Journal of Uncertain Systems, 1 (2007) 1–10 26. Liu B.: Uncertainty Theory, 2nd ed., Springer-Verlag, Berlin, 2007. 27. Liu Y., Gao J.: Convergence criteria and convergence relations for sequences of fuzzy random variables. Lecture Notes in Artificial Intelligence 3613 (2005) 321–331 28. Liu Y.: Convergent results about the use of fuzzy simulation in fuzzy optimization problems. IEEE Transactions on Fuzzy Systems 14/2 (2006) 295–304 29. Liu Y., Gao J.: The dependence of fuzzy variables with applications to fuzzy random optimization. International Journal of Uncertainty, Fuzziness & KnowledgeBased Systems to be published 30. Patriksson M., Wynter L.: Stochastic mathematicl programs with equilibrium constraints. Operations research letters 25 (1999) 159–167 31. Sahin K.H., and Ciric A.R.: A dual temperature simulated annealing approach for solving bilevel programming problems. Computers and Chemical Engineering 23 (1998) 11–25 32. Shimizu K., Aiyoshi E.: A new computational method for Stackelberg and minmax problems by use a penalty method. IEEE Transactions on Automatic Control AC-26 (1981) 460–466 33. Suh S., Kim T.: Solving nonlinear bilevel programming models of the equilibrium network desing problem: A comparative review. Annals Operations Research 34 (1992) 203–218 34. Vicente L., Calamai P.H.: Bilevel programming and multi-level programming: A bibliography review. Journal of Global Optimization 5 (1994) 35. Wen U.P.: Linear bilevel programming problems—A review. Journal of the Operational Research Society 42 (1991) 125–133 36. Yang H., Bell M.G.H.: Transport bilevel programming problems: recent methodological advances. Transportation Research: Part B: 35 (2001) 1–4 37. Zhao R., Liu B.: Renewal Process with Fuzzy Interarrival Times and Rewards. International Journal of Uncertainty, Fuzziness & Knowledge-Based Systems 11 (2003) 573–586 38. Zhao R., Tang W.: Some Properties of Fuzzy Random Processes. IEEE Transactions on Fuzzy Systems 14/2 (2006) 173–179
Fuzzy Optimization Problems with Critical Value-at-Risk Criteria Yan-Kui Liu1,2 , Zhi-Qiang Liu2 , and Ying Liu1 1
2
College of Mathematics & Computer Science, Hebei University Baoding 071002, Hebei, China
[email protected],
[email protected] School of Creative Media, City University of Hong Kong, Hong Kong, China
Abstract. Based on value-at-risk (VaR) criteria, this paper presents a new class of two-stage fuzzy programming models. Because the fuzzy optimization problems often include fuzzy variables defined through continuous possibility distribution functions, they are inherently infinitedimensional optimization problems that can rarely be solved directly. Thus, algorithms to solve such optimization problems must rely on intelligent computing as well as approximating schemes, which result in approximating finite-dimensional optimization problems. Motivated by this fact, we suggest an approximation method to evaluate critical VaR objective functions, and discuss the convergence of the approximation approach. Furthermore, we design a hybrid algorithm (HA) based on the approximation method, neural network (NN) and genetic algorithm (GA) to solve the proposed optimization problem, and provide a numerical example to test the effectiveness of the HA.
1
Introduction
It is known that production games [16] feature transferable utility and strong cooperative incentives, they are appealing in several aspects such as the characteristic function can be explicitly defined and easy to compute. In stochastic decision systems, production games were extended to accommodate uncertainty about events not known ex ante, and planning then took the form of two-stage stochastic programming [17]. In fuzzy decision systems, based on possibility theory [2,14,18], fuzzy linear production programming games were presented by Nishizaki and Sakawa [15], but they belong to static production games. Fuzzy two-stage production games rely on the optimization model developed in [13] as well as the work in this paper. In literature, two-stage and multistage stochastic programming problems have been studied extensively [5], and applied to many real world decision problems, especially decision problems involving risk [4]. Our objective in this paper is to take credibility theory [7,8,9,10,11,12] as the theoretical foundation of fuzzy optimization [1,3,6,13], and present a new class of two-stage fuzzy optimization problem with critical VaR criteria in the objective. In the proposed fuzzy optimization problem, infeasibility of first-stage decisions is accepted, but has D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 267–274, 2007. c Springer-Verlag Berlin Heidelberg 2007
268
Y.-K. Liu, Z.-Q. Liu, and Y. Liu
to be compensated for afterward, hence second-stage or recourse actions are required. Because two-stage fuzzy optimization problems are inherently infinitedimensional optimization problems that can rarely be solved directly, algorithms to solve such optimization problems must rely on intelligent computing and approximation scheme, which results in approximating finite-dimensional optimization problems. This fact motivates us to present an approximation approach to critical VaR objective and combine it with GA and NN to solve the proposed optimization problem. In the following section we formulate a new class of two-stage fuzzy optimization problem with critical VaR criteria in the objective. Section 3 discusses the issue of approximating critical VaR function and deals with the convergence of the approximation method. In Section 4, we design an HA based on the approximation scheme to solve the proposed fuzzy optimization problems, and provide a numerical example to show the effectiveness of the HA. Finally, we draw conclusions in Section 5.
2
Problem Formulation
Consider the following fuzzy linear programming min cT x + q T (γ)y subject to: T (γ)x + W (γ)y = h(γ) x ∈ X, y ∈ n+2 .
⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭
(1)
We assume that all ingredients above have conformal dimensions, that X ⊂ n1 is a nonempty closed polyhedron, and that some components of q(γ), h(γ), T (γ) and W (γ) are fuzzy variables defined on a credibility space (Γ, P(Γ ), Cr), where Γ is the universe, P(Γ ) the power set of Γ , and Cr the credibility measure defined in [9]. Decision variables are divided into two groups: first-stage variable x to be fixed before observation of γ, and second-stage variables y to be fixed after observation of γ. Given x ∈ X and γ ∈ Γ , denote Q(x, γ) = min{q T (γ)y | W (γ)y = h(γ) − T (γ)x, y ∈ n+2 }.
(2)
According to linear programming theory, the function Q(x, γ) is real-valued on m2 almost sure with respect to γ provided that W (γ)(n+2 ) = m2 and {u ∈ m2 | W (γ)T u ≤ q(γ)} = ∅ almost sure with respect to γ, which will be assumed throughout the paper. With a preselected threshold φ0 ∈ , the excess credibility functional QC (x) = Cr γ ∈ Γ | cT x + Q(x, γ) > φ0 measures the credibility of facing total fuzzy objective values exceeding φ0 . For instance, if φ0 is a critical cost level, then the excess credibility is understood as the ruin credibility.
Fuzzy Optimization Problems with Critical Value-at-Risk Criteria
269
However, excess credibility does not quantify the extend to which objective value exceeds the threshold. The latter can be achieved by another risk measure, the critical VaR. Denote by Φ(x, φ) = Cr γ ∈ Γ | cT x + Q(x, γ) ≤ φ the credibility distribution of the fuzzy variable cT x+Q(x, γ). With a preselected credibility 0 < α < 1, the critical VaR at α is defined by QαVaR (x) = inf {φ | Φ(x, φ) ≥ α} . As a consequence, a two-stage fuzzy programming with VaR objective reads min {QαVaR (x) : x ∈ X} .
(3)
Since we will discuss the issue of approximation of the problem (3) when the distribution of γ is continuous and approximated by a discrete one, we are interested in the properties of the α-VaR QαVaR as a function of x as well as the distribution of γ. Toward that end, it will be convenient to introduce the induced credibility measure Cˆr = Cr ◦ ξ −1 on N , and reformulate the optimization problem (3) as follows min {QαVaR (x, Cˆr) : x ∈ X} where
(4)
ˆ ≤φ ≥α , QαVaR (x, Cˆr) = inf φ | Cˆr ξˆ ∈ Ξ cT x + Q(x, ξ)
ˆ is defined as the second-stage value function Q(x, ξ)
ˆ = min q T (ξ)y ˆ W (ξ)y ˆ = h(ξ) ˆ − T (ξ)x, ˆ y ∈ n2 , Q(x, ξ) +
(5)
ˆ hT (ξ), ˆ W1· (ξ), ˆ . . . , Wm · (ξ), ˆ T1· (ξ), ˆ . . . , Tm · (ξ)) ˆ T is the realizaand ξˆ = (q T (ξ), 2 2 tion value of fuzzy vector ξ such that Wi· is the ith row of the matrix W, and Ti· is the ith row of the matrix T.
3
Approximation Approach to VaR
To solve the proposed fuzzy optimization problem (4), it is required to calculate the following VaR at α,
ˆ ≤φ ≥α U : x → QαVaR (x, Cˆr) = inf φ | Cˆr ξˆ ∈ Ξ cT x + Q(x, ξ) (6) repeatedly, where Ξ is the support of ξ described in Section 2. For simplicity, ˆ ≡ W. we assume in this section the matrix W is fixed, i.e., W (ξ) mthat 2 +n2 +m2 n1 Suppose that Ξ = i=1 [ai , bi ] with [ai , bi ] the supports of ξi , i = 1, 2, · · · , m2 + n2 + m2 n1 , respectively. In the following, we adopt the approximation method proposed in [13] to approximate the possibility distribution of ξ by a sequence of possibility distributions of primitive fuzzy vectors ζn , n = 1, 2, · · ·. The method can be described as follows.
270
Y.-K. Liu, Z.-Q. Liu, and Y. Liu
For each integer n, define ζn = (ζn,1 , ζn,2 , · · · , ζn,m2 +n2 +m2 n1 )T as follows ζn = hn (ξ) = (hn,1 (ξ1 ), hn,2 (ξ2 ), · · · , hn,m2 +n2 +m2 n1 (ξm2 +n2 +m2 n1 ))T where the fuzzy variables ζn,i = hn,i (ξi ), i = 1, 2, · · · , m2 + n2 + m2 n1 ,
ki ki hn,i (ui ) = max ki ∈ Z, ≤ ui , ui ∈ [ai , bi ] n n and Z is the set of all integers. As a consequence, the possibility of ζn,i , denoted by νn,i , is as follows
ki ki ki ki + 1 νn,i = Pos ζn,i = = Pos ≤ ξi < n n n n for ki = [nai ], [nai ] + 1, · · · , Ki . By the definition of ξi , one has ξi (γ) − 1/n < ζn,i (γ) ≤ ξi (γ) for all γ ∈ Γ, and i = 1, 2, · · · , m2 + n2 + m2 n1 , which implies the sequence {ζn } of discrete fuzzy vectors converges uniformly to the fuzzy vector ξ on Γ . In what follows, the sequence {ζn } of primitive fuzzy vectors is referred to as the discretization of the fuzzy vector ξ. For each fixed n, the fuzzy vector ζn takes K = K1 K2 · · · Km2 +n2 +m2 n1 values, and denote them as k k ζˆnk = (ζˆn,1 , · · · , ζˆn,m ), k = 1, · · · , K. 2 +n2 +m2 n1
We now replace the possibility distribution of ξ by that of ζn , and approximate the QαVaR (x, Cˆr) by QαVaR (x, Cˆrn ) with Cˆrn = Cr ◦ ζn provided n is k k sufficiently large. Toward that end, denote νk = νn,1 (ζˆn,1 ) ∧ νn,2 (ζˆn,2 ) ∧ ··· ∧ k ˆ νn,m2 +n2 +m2 n1 (ξn,m2 +n2 +m2 n1 ) for k = 1, 2, · · · , K, where νn,i are the possibility distributions of ζn,i , i = 1, 2, · · · , m2 + n2 + m2 n1 , respectively. For each integer k, we solve the second-stage linear programming problem (5) via simplex method, and denote the optimal value as Q(x, ζˆnk ). Letting φk = cT x + Q(x, ζˆnk ), then the α-VaR QαVaR (x, Cˆrn ) can be computed by U(x) = min{φk | ck ≥ α}
(7)
where
1 (1 + max{νj | φj ≤ φk } − max{νj | φj > φk }). 2 The process to compute the α-VaR QαVaR (x, Cˆr) is summarized as ck =
(8)
Algorithm 1 (Approximation Algorithm) k k Step 1. Generate K points ζˆnk = (ξˆn,1 , · · · , ξˆn,m ) uniformly from the 2 +n2 +m2 n1 support Ξ of ξ for k = 1, 2, · · · , K. Step 2. Solve the second-stage linear programming problem (5) and denote the optimal value as Q(x, ζˆnk ), and φk = cT x + Q(x, ζˆnk ) for k = 1, 2, · · · , K.
Fuzzy Optimization Problems with Critical Value-at-Risk Criteria
271
k k k Step 3. Set νk = μn,1 (ζˆn,1 ) ∧ μn,2 (ζˆn,2 ) ∧ · · · ∧ μn,m2 +n2 +m2 n1 (ξˆn,m ) 2 +n2 +m2 n1 for k = 1, 2, · · · , K. Step 4. Compute ck = Cˆrn {cT x + Q(x, ζˆn ) ≤ φk } for k = 1, 2, · · · , K according to formula (8). Step 5. Return U(x) via the estimation formula (7).
The convergence of Algorithm 1 is ensured by the following theorem. As a consequence, the α-VaR QαVaR (x, Cˆr) can be estimated by the formula (7) provided that n is sufficiently large. Theorem 1. Consider the two-stage fuzzy programming problem (4). Suppose W is fixed, ξ = q or (h, T ) is a continuous fuzzy vector, and β ∈ (0, 1) a prescribed confidence level. If ξ is a bounded fuzzy vector, and the sequence {ζn } of primitive fuzzy vectors is the discretization of ξ, then for any given x ∈ X, we have lim QβVaR (x, Cˆrn ) = QβVaR (x, Cˆr) n→∞
provided that β is a continuity point of the function QαVaR (x, Cˆr) at α = β. Proof. By the suppositions of Theorem 1, and the properties of Q(x, ξ), the proof of the theorem is similar to that of [10, Theorem 2].
4
HAs and Numerical Example
In the following, we will incorporate the approximation method, NN, and GA to produce an HA for solving the proposed fuzzy optimization problem. First, we generate a set of training data for QαVaR (x, Cˆr) by the approximation method. Then, using the generated input-output data, we train an NN by fast BP algorithm to approximate QαVaR (x, Cˆr). We repeat this BP algorithm until the error for all vectors in the training set is reduced to an acceptable value or perform the specified number of epochs of training. After that, we use new data (which are not learned by the NN) to test the trained NN. If the test results are satisfactory, then we stop the training process; otherwise, we continue to train the NN. After the NN is well-trained, it is embedded into a GA to produce an HA. During the solution process, the output values of the trained NN are used to represent the approximate values of QαVaR (x, Cˆr). Therefore, it is not necessary to compute QαVaR (x, Cˆr) by approximation method during the solution process so that much time can be saved. This process of the HA for solving the proposed fuzzy optimization problem is summarized as Algorithm 2 (Hybrid Algorithm) Step 1. Generate a set of input-output data for the critical VaR function U : x → QαVaR (x, Cˆr) by the proposed approximation method;
272
Y.-K. Liu, Z.-Q. Liu, and Y. Liu
Step 2. Train an NN to approximate the critical VaR function U(x) by the generated input-output data; Step 3. Initialize pop size chromosomes at random; Step 4. Update the chromosomes by crossover and mutation operations; Step 5. Calculate the objective values for all chromosomes by the trained NN; Step 6. Compute the fitness of each chromosome according to the objective values; Step 7. Select the chromosomes by spinning the roulette wheel; Step 8. Repeat Step 4 to Step 7 for a given number of cycles; Step 9. Report the best chromosome as the optimal solution. We now give a numerical example to show the effectiveness of the designed HA. Example 1. Consider the following two-stage fuzzy programming problem with q and h containing fuzzy variables ⎫ min Q0.9VaR (x) ⎪ ⎪ x ⎪ ⎪ ⎪ s.t. ⎪ ⎬ x1 + x2 + 2x3 ≤ 15 2x1 − x2 + x3 ≤ 6 ⎪ ⎪ ⎪ ⎪ ⎪ −2x1 + 2x2 ≤ 8 ⎪ ⎭ x1 , x2 , x3 ≥ 0 where cT x + Q(x, γ) = 3x1 + 2x2 − 4x3 + Q(x, γ), ⎫ Q(x, γ) = min q1 (γ)y1 + q2 (γ)y2 + y3 + q4 (γ)y4 + y5 ⎪ ⎪ ⎪ ⎪ s.t. ⎪ ⎪ ⎬ y1 + y2 − 3y4 − 2y5 = h1 (γ) + x1 − x3 18y1 − 8y2 + 6y3 = h2 (γ) − x1 + 2x2 − x3 ⎪ ⎪ ⎪ −y1 − 9y2 + 14y3 + 8y5 = h3 (γ) + x1 − x2 ⎪ ⎪ ⎪ ⎭ yk ≥ 0, k = 1, 2, · · · , 5, and q1 , q2 , q4 , h1 , h2 , and h3 are mutually independent triangular fuzzy variables (7, 8, 9), (5, 6, 7), (9, 10, 11), (23, 24, 25), (16, 17, 18), and (20, 21, 22), respectively. For any given feasible solution x, we use 10000 samples in approximation method to estimate the 0.9-VaR Q0.9VaR (x). Using this method, we first produce 3000 input-output data xj → Q0.9VaR (xj ), j = 1, · · · , 3000; then we use the data to train an NN to approximate the VaR function Q0.9VaR (x) (3 input neurons representing the value of decision x, 10 hidden neurons, and 1 output neuron representing the value of Q0.9VaR (x)). After the NN is well-trained, it is embedded into a GA to produce an HA to search for the optimal solutions. In view of identification of parameters’ influence on solution quality, we compare solutions by careful variation of parameters of GA. The computational results are reported in Table 1, where the parameters of GA include the population size pop size, the probability of crossover Pc , and the probability of mutation
Fuzzy Optimization Problems with Critical Value-at-Risk Criteria
273
Table 1. Comparison Solutions of Example 1 pop size pc pm Optimal solution Optimal value 30 0.3 0.2 (0.0000, 1.0000, 7.0000) 112.232580 30 0.3 0.1 (0.0000, 1.0000, 7.0000) 112.232517 30 0.2 0.2 (0.0000, 1.0000, 7.0000) 112.234515 30 0.1 0.3 (0.0000, 1.0000, 7.0000) 112.234459 20 0.1 0.3 (0.0000, 1.0000, 7.0000) 112.234479 20 0.3 0.2 (0.0000, 1.0000, 7.0000) 112.234522 20 0.3 0.1 (0.0000, 1.0000, 7.0000) 112.234615 20 0.2 0.2 (0.0000, 1.0000, 7.0000) 112.234518
Pm . From Table 1, we can see that the optimal solutions and the optimal objective values change little when various parameters of GA are selected, which imply that the HA is robust to the parameters setting and effective to solve this fuzzy two-stage programming problem.
5
Conclusions
In this paper, we have formulated a novel class of two-stage fuzzy programming with recourse problem based on VaR criteria. In order to compute the critical VaR objective, we presented an approximation approach to fuzzy variables with infinite supports, and discussed the convergence of the approximation scheme. Furthermore, we designed an HA, which combines the approximation approach, GA and NN, to solve the proposed fuzzy optimization problem, and provided a numerical example to show the effectiveness of the HA. Acknowledgements. This work was partially supported by the National Natural Science Foundation of China under Grant No.70571021, the Natural Science Foundation of Hebei Province under Grant No.A2005000087, and the CityUHK SRG 7001794 & 7001679.
References 1. Chen, Y., Liu, Y.K., Chen, J.: Fuzzy Portfolio Selection Problems Based on Credibility Theory. In: Yeung, D.S., Liu, Z.Q., et al. (eds.): Advances in Machine Learning and Cybernetics. Lecture Notes in Artificial Intelligence, Vol.3930, SpringerVerlag, Berlin Heidelberg (2006) 377-386 2. Dubois, D., Prade, H.: Possibility Theory. Plenum Press, New York (1988) 3. Gao, J., Liu, B.: Fuzzy Multilevel Programming with a Hybrid Intelligent Algorithm. Computer & Mathematics with Applications 49 (2005) 1539-1548 4. Hogan, A.J., Morris, J.G., Thompson, H.E.: Decision Problems under Risk and Chance Constrained Programming: Dilemmas in the Transition. Management Science 27 (1981) 698-716 5. Kibzun, A.I., Kan, Y.S.: Stochastic Programming Problems with Probability and Quantile Functions. Wiley, Chichester (1996)
274
Y.-K. Liu, Z.-Q. Liu, and Y. Liu
6. Liu, B.: Theory and Practice of Uncertain Programming. Physica-Verlag, Heidelberg (2002) 7. Liu, B.: Uncertainty Theory: An Introduction to Its Axiomatic Foundations. Springer-Verlag, Berlin Heidelberg New York (2004) 8. Liu, B: A Survey of Entropy of Fuzzy Variables. Journal of Uncertain Systems 1 (2007) 1-11 9. Liu, B., Liu, Y.K.: Expected Value of Fuzzy Variable and Fuzzy Expected Value Models. IEEE Trans. Fuzzy Syst. 10 (2002) 445-450 10. Liu, Y.K.: Convergent Results About the Use of Fuzzy Simulation in Fuzzy Optimization Problems. IEEE Trans. Fuzzy Syst 14 (2006) 295-304 11. Liu, Y.K., Liu, B., Chen, Y.: The Infinite Dimensional Product Possibility Space and Its Applications. In: Huang, D.-S. Li, K., Irwin, G.W. (eds.): Computational Intelligence. Lecture Notes in Artificial Intelligence, Vol.4114, Springer-Verlag, Berlin Heidelberg (2006) 984-989 12. Liu, Y.K. Wang, S.: Theory of Fuzzy Random Optimization. China Agricultural University Press, Beiing (2006) 13. Liu, Y.K.: Fuzzy Programming with Recourse. Int. J. Uncertainty Fuzziness Knowl.-Based Syst. 13 (2005) 381-413 14. Nahmias, S.: Fuzzy Variables. Fuzzy Sets Syst. 1 (1978) 97-101 15. Nishizaki, I., Sakawa, M.: On Computional Methods for Solutions of Multiobjective Linear Production Programming Games. European Journal of Operational Research 129 (2001) 386-413 16. Owen, G.: On the Core of Linear Production Games. Math. Programming 9 (1975) 358-370 17. Sandsmark, M.: Production Games under Uncertainty. Comput. Economics 14 (1999) 237-253 18. Zadeh, L.A.: Fuzzy Sets as a Basis for a Theory of Possibility. Fuzzy Sets Syst. 1 (1978) 3-28
Neural-Network-Driven Fuzzy Optimum Selection for Mechanism Schemes Yingkui Gu and Xuewen He School of Mechanical & Electronical Engineering Jiangxi University of Science and Technology Ganzhou, Jiangxi 341000, China
[email protected]
Abstract. Product conceptual design is an innovative activity that is to form and optimize the projects of products. Identification of the best conceptual design candidate is a crucial step as design information is not complete and design knowledge is minimal at conceptual design stage. It is necessary to select the best scheme from feasible alternatives through comparison and filter. In this paper, the evaluation system of mechanism scheme is established firstly based on the performance analysis of the mechanism system and the opinions of experts. Then, the fuzzy optimum selection model of mechanism scheme evaluation is provided. Combined with the fuzzy optimum selection model with the neural network theory, a rational pattern of determining the topologic structure of network is provided. It also provides a weight-adjusted BP model of the neural network with the fuzzy optimum selection model for mechanism scheme. Finally, an example is given to verify the effective feasibility of the proposed method.
1 Introduction Mechanism scheme design is the core of mechanical product concept design. Concept design is a process to develop design candidate based on design requirements. At conceptual design stage, a number of design candidates are usually generated, which satisfy all design requirements. Therefore, identification of the best conceptual design candidate is a crucial step as design information is not complete and design knowledge is minimal at conceptual design stage. The evaluation and selection of schemes are the important tasks for mechanism conceptual design. How to establish the reasonable evaluation system and how to establish the effective selection model are the key problems for the designers to study. In recent years, many methods have been presented to evaluate mechanism schemes. Especially the recent advances in soft computing techniques, including fuzzy set [1-8], neural network [9-11] and genetic algorithm, provide new tools for developing intelligent systems with the capabilities of modeling uncertainty and learning under the fuzzy and uncertain development environment. Applications of soft computing in mechanism scheme optimum selection resulted in computerized systems. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 275–283, 2007. © Springer-Verlag Berlin Heidelberg 2007
276
Y. Gu and X. He
Chen, Cai and Song [12] introduced a case-based reasoning product conceptual design system. In this system, product similar case can be evaluated based on the knowledge of design and manufacturing, so the optimum solution of product conceptual design can be acquired. Jiang and Hsu [13] presented a manufacturability scheme evaluation decision model based on fuzzy logic and multiple attribute decisionmaking under the concurrent engineering environment. Huang, Li and Xue [14] used fuzzy synthetical evaluation to evaluate and select the optimal grinding machining scheme. Huang, Tian and Zuo [15] introduced an intelligent interactive multiobjective optimization method to evaluation reliability design scheme based on physical programming theory proposed by Messac [16]. Sun, Kalenchuk, Xue and Gu [17] presented a method for design candidate evaluation and identification using neural network-based fuzzy reasoning. Xue and Dong [18] developed a fuzzy-based design function coding system to identify design candidates from design functions. Bahrami, Lynch and Dagli [19] used fuzzy associative memory, a two-layer feedforward neural network, to describe the relationships between customer needs and design candidates. Sun, Xie and Xue [20] presented a drive type decision system based on one-againstone mode of support vector machine through identification of the characteristics and the type decisions. Huang, Bo and Chen [21] presented an integrated computational intelligence approach to generate and evaluate the concept design schemes, where neural network, fuzzy set and genetic algorithm are used to evaluate and select the optimal design scheme. Although the methods proposed above are effective and feasible in evaluating the mechanism scheme, there still exist some disadvantages, such as calculation difficulty, stronger subjectivity and lower evaluation efficiency, etc. In this paper, a neural network-driven fuzzy optimum selection method was introduced based on the fuzzy optimum selection theory proposed by Chen [22-24] for solving the problems of modeling uncertainty and improving computational efficiency in the process of identifying mechanism scheme. The evaluation system of mechanism scheme is established firstly based on the performance analysis of the mechanism system and the opinions of exporters. Then, the fuzzy optimum selection model of mechanism scheme evaluation is provided. Combined with the fuzzy optimum selection model with the neural network theory, a rational pattern of determining the topologic structure of network is provided. It also provides a weight-adjusted BP model of the neural network with the fuzzy optimum selection model for mechanism scheme. Results show that the proposed method offers a new way to evaluate and select the optimum mechanism scheme from scheme set.
2 The Fuzzy Optimum Selection of Mechanism Schemes 2.1 Establishment of the Evaluation Index System Mechanism scheme is usually composed of several sub-systems. In the conceptual design stage, it is necessary to select the best scheme form feasible alternatives through comparison and filter. Therefore, a reasonable and effective evaluation index system should be established to evaluate and optimize the mechanism scheme set. Based on the performance analysis of the mechanism system and the opinions of experts, the evaluation index system of mechanism scheme is established as shown in Figure 1.
Neural-Network-Driven Fuzzy Optimum Selection for Mechanism Schemes
277
U
R1
R2
R11 R12
R21 R22 R23
R4
R3
R5
R31 R32 R33 R34 R41 R42 R43 R44 R45 R51 R52 R53
Fig. 1. The evaluation system of mechanism scheme
In Figure 1, U is the satisfaction degree. R1 is the basic function. R11 is the kinematic precision. R12 is the transmission precision. R2 is the working function. R21 is the operation speed. R22 is the adjustment. R23 is the loading capacity. R3 is the dynamical function. R31 is the maximal acceleration. R32 is the noise. R33 is the reliability. R34 is the anti-abrasion. R4 is the economical performance. R41 is the design cost. R42 is the manufacturing cost. R43 is the sensitivity of manufacturing errors. R44 is the convenience of adjustment. R45 is the energy consuming. R5 is the structure performance. R51 is the dimension. R52 is the weight. R53 is the complexity of structure. The evaluation index system is an objective set that the mechanism scheme should arrive at. Therefore, the system should have the characteristics of integrality, independency and quantity. 2.2 The Fuzzy Optimum Selection Model of Mechanism Schemes It is assumed that there are n mechanism schemes satisfying the constraint conditions. Each scheme is evaluated according to m evaluation objectives. Let x ij be the eigenvalue of the i th objective of the j th scheme, and rij be the relative membership degree of objective eigenvalue x ij . The objective eigenvalue can be categorized into the following two different categories: (1) The larger, the better. Let x i max = x i1 > x i 2 > " > x in , then rij =
xij x i max
.
(2) The smaller, the better. Let x i min = x i1 < xi 2 < " < x in , then rij =
x i min . x ij
The relative membership degree matrix of n mechanism schemes can be expressed as follows. ⎡ r11 ⎢ r21 R=⎢ ⎢ # ⎢ ⎣⎢rm1
r12 r22
# rm 2
" r1n ⎤ ⎥ " r2 n ⎥ = rij , i = 1,2, " , m, " "⎥ ⎥ " rmn ⎦⎥
( )
j = 1,2, " , n .
(1)
278
Y. Gu and X. He
We know that the relative membership degree vector of the j th scheme is
(
r j = r1 j , r2 j , " , rmj
)
T
. We can define that the relative membership degree vector of
the optimum scheme and the bad scheme is (1,1,",1) and (0,0, " ,0 ) respectively. The Haming distance between the j th scheme and the optimum scheme is T
m
(
)
T
m
d jg = ∑ wij 1 − rij = 1 − ∑ wij rij . i =1
(2)
i =1
where wij is the weight of i th objective of j th scheme. To each scheme j , it should satisfy the following constraint m
∑w i =1
ij
=1.
(3)
The Haming distance between the j th scheme and the bad scheme is
d jb = ∑ wij (rij − 0 ) = ∑ wij rij . m
m
i =1
i =1
(4)
Let the relative membership degree to optimum scheme of the j th scheme be u j , and the relative membership degree to bad scheme be u cj , then u cj = 1 − u j .
(5)
The weight distance between the j th scheme and the optimum scheme is D jg = u j d jg .
(6)
The weight distance between the j th scheme and the bad scheme is
(
)
D jb = u cj d jb = 1 − u j d jb .
(7)
In order to obtain the optimum value of the relative membership degree of the j th scheme, the optimization criterion is established as follows [22]
{
min F ( u j ) = D 2jg + D 2jb
} 2
2
m m 2⎛ ⎛ ⎞ ⎞ =u ⎜1 − ∑ wij rij ⎟ + (1 − u j ) ⎜ ∑ wij rij ⎟ . ⎝ i =1 ⎠ ⎝ i =1 ⎠ 2 j
Let
( )
dF u j du j
=0,
(8)
Neural-Network-Driven Fuzzy Optimum Selection for Mechanism Schemes
279
The optimization model that is expressed by Haming distance can be given as follows [24]. 1 uj = . m ⎡ ⎤ 1 − w r ⎢ ∑ ij ij ⎥ (9) ⎥ 1 + ⎢ m i =1 ⎢ ⎥ ⎢ ∑ wij rij ⎥ ⎣ i =1 ⎦
3 BP-Neural-Network-Driven Fuzzy Optimum Selection Model The evaluation method based on neural network is an evaluation method based on examples. It only needs the user to offer enough samples for the training of the network. The evaluation results can be obtained according to the trained network. Because back-propagation neural network has the ability to learn by examples, it has been used in pattern matching, pattern classification and pattern recognition. Therefore, it can be used to establish the neural-network-driven fuzzy optimum selection model for mechanism scheme. A back-propagation (BP) neural network is a multi-layer network with an input layer, an output layer, and some hidden layers between the input and output layers. Each layer has a number of proceeding unit, called neurons. A neuron simply computes the sum of their weighted inputs, subtracts its threshold from the sum, and passes the results through its transfer function. One of the most important characteristics of BP neural networks is their ability to learn by examples. With proper training, the network can memorize the knowledge in the problem solving of a particular domain [25]. The back-propagation neural networks refer to their training algorithm, known as error back-propagation or generalized delta rule. The training of such a network starts with assigning random values to all the weights. An input is then presented to the network and the output from each neuron in each layer is propagated forward through the entire network to reach an actual output. The error for each neuron in the output layer is computed as the difference between an actual output and its corresponding target output. This error is then propagated backwards through the entire network and the weights are updated. The weights for a particular neuron are adjusted in direct proportion to the error in the units to which it is connected. In this way the error is reduced and the network learns. As shown in Figure 2, a three-layer BP neural network was selected to reflect the established fuzzy optimum selection model. The network has m input nodes, l hidden nodes and one output node. In the network, the number of input layer nodes is the number of the evaluation objectives of fuzzy optimum selection, and the input of neural network is the relative membership degree of each objective. The output of neural network is the relative membership degree of the evaluated scheme. In input layer, the input and output of the i th node is rij and u ij respectively, where i = 1,2, " , m , j = 1,2, " , n . In hidden layer, the input and output of the k th node is I kj and u kj respectively. wik is the joint weight between the i th node and the
280
Y. Gu and X. He
u pj p
Output layer
wkp
…
Hidden layer
k
wik
… …
Input layer
r1 j
r2 j
r3 j
i
rij
rmj
Fig. 2. BP-neural-network-driven fuzzy optimum selection model
k th node. There is only one node p in the output layer, and the input and out is I pj
and u pj respectively. wkp is the joint weight between hidden layer and output layer. The input and output of the network are listed in Table 1. Table 1. The input and output of fuzzy optimum selection BP neural network
Nodes The i th node of input layer The k th node of hidden layer The node p of output layer
Input
Output
rij
u ij = rij
ukj =
m
I kj = ∑ wik rij i =1
u pj =
l
I pj = ∑ wkpukj k =1
Joint weight
1 −1 ⎡⎛ m ⎤ ⎞ 1 + ⎢⎜⎜ ∑ wik rij ⎟⎟ − 1⎥ ⎢⎣⎝ i =1 ⎥⎦ ⎠ 1
m
2
−1 ⎡⎛ l ⎤ ⎞ 1 + ⎢⎜⎜ ∑ wkp ukj ⎟⎟ − 1⎥ ⎢⎣⎝ k =1 ⎥⎦ ⎠
∑w
ik
i =1
=1,
wik ≥ 0 l
2
∑w k =1
kp
=1,
wkp ≥ 0
The actual output u pj is the response of fuzzy optimum selection BP neural net-
work to the input rij . Let the expectation output of the j th scheme be M (u pj ) , its square error is Ej =
[
( )]
1 u pj − M u pj 2
2
.
(10)
4 Case Study To investigate the model developed above, here is an example of the optimum design scheme selection of the cutting paper machine. The design requirements are as
Neural-Network-Driven Fuzzy Optimum Selection for Mechanism Schemes
a) Scheme 1
b) Scheme 2
281
c) Scheme 3
Fig. 3. The scheme set of the cutting paper machine
follows. (1) The speed of cutting paper is constant. (2) The reliability is high. (3) The structure of the machine is simple and easy to design and manufacture. Through detail analysis, three schemes are presented as shown in Figure 3. Applying the evaluation system proposed in section 2 and the neural-networkdriven fuzzy optimum model presented in section 3, a three-layer BP neural network is established as shown in Figure 4. The network has 17 input nodes, 5 hidden nodes and one output node. The input of neural network is the relative membership degree of each objective. The output of neural network is the relative membership degree of the evaluated scheme. The input value of each scheme is listed in Table 2. The output value of the relative membership degree of each scheme is listed in Table 3. By comparison we can see, the first scheme has the higher relative membership degree than the other two schemes and is adopted as the optimum scheme of the cutting machine. u pj Output layer
Hidden layer
…
Input layer
r11 j
r12 j R1
r21 j
r22 j R2
r23 j
…
r51 j
r52 j
r53 j
R5
Fig. 4. A three-layer BP neural network model for the fuzzy optimum selection of the cutting paper machine schemes
282
Y. Gu and X. He Table 2. The input value of each schem
Criterion r r12 r21 r22 r23 r31 r32 r33 r34 r41 r42 r43 r44 r45 r51 r52 r53 11 Scheme 1.0 0.75 0.75 0.75 0.75 0.75 1.0 0.5 0.5 1.0 0.5 0.75 0.75 0.75 0.75 0.75 0.75 j=1
j=2 j=3
0.75 0.75 0.75 0.75 0.75 0.75 0.5 0.5 0.75 1.0 0.75 0.75 0.75 0.5 0.75 0.75 0.5 0.75 0.75 0.75 0.75 0.75 0.75 0.5 0.75 0.75 1.0 0.75 0.75 0.75 0.75 0.5 0.75 0.75
Table 3. The output value of the relative membership degree of each scheme Scheme
j=1 j=2 j=3
Relative Membership Degree 0.8654
Order 1
0.7548
3
0.8012
2
5 Conclusions The problem of mechanism scheme evaluation is a kind of expert decision problem that needs to evaluate repeatedly, and the essential characteristics of this problem are fuzziness and uncertainty. The experience of experts has very important influence on the evaluation result. The evaluation method proposed in this paper can describe the property value of evaluation and the non-linear relationship among evaluation results well. It decreases the complexity and subjectivity of scheme evaluation, and improve the rationality of evaluation results through applying the proposed method. Neuralnetwork-driven fuzzy optimum selection offers a new way for the evaluation of mechanism schemes.
Acknowledgment This research was partially supported by China Postdoctoral Science Foundation under Grant 20060391029.
References 1. Huang, H.Z., Zuo, M.J., Sun, Z.Q.: Bayesian Reliability Analysis for Fuzzy Lifetime Data. Fuzzy Sets and Systems 157 (2006) 1674-1686 2. Huang, H.Z., Wang, P., Zuo, M.J., Wu W.D., Liu, C.S.: A Fuzzy Set Based Solution Method for Multiobjective Optimal Design Problem of Mechanical and Structural Systems Using Functional-Link Net. Neural Computing & Applications 15 (2006) 239-244 3. Huang, H.Z., Wu, W.D., Liu, C.S.: A Coordination Method for Fuzzy Multi-Objective Optimization of System Reliability. Journal of Intelligent and Fuzzy Systems 16 (2005) 213-220 4. Huang, H.Z., Li, H.B.: Perturbation Fuzzy Finite Element Method of Structural Analysis Based on Variational Principle. Engineering Applications of Artificial Intelligence 18 (2005) 83-91
Neural-Network-Driven Fuzzy Optimum Selection for Mechanism Schemes
283
5. Huang, H.Z., Tong, X., Zuo, M.J.: Posbist Fault Tree Analysis of Coherent Systems. Reliability Engineering and System Safety 84 (2004) 141-148 6. Huang, H.Z.: Fuzzy Multi-Objective Optimization Decision-Making of Reliability of Series System. Microelectronics and Reliability 37 (1997) 447-449 7. Huang, H.Z.: Reliability Analysis Method in the Presence of Fuzziness Attached to Operating Time. Microelectronics and Reliability 35 (1995) 1483-1487 8. Zhang, Z., Huang, H.Z., Yu, L.F.: Fuzzy Preference Based Interactive Fuzzy Physical Programming and Its Application in Multi-objective Optimization. Journal of Mechanical Science and Technology 20 (2006) 731-737 9. Xue, L.H., Huang, H.Z., Hu, J., Miao, Q., Ling, D.: RAOGA-based Fuzzy Neural Network Model of Design Evaluation. Lecture Notes in Artificial Intelligence 4114 (2006) 206-211 10. Huang, H.Z., Tian, Z.G.: Application of Neural Network to Interactive Physical Programming. Lecture Notes in Computer Science 3496 (2005) 725-730 11. Li, H.B., Huang, H.Z., Zhao, M.Y.: Finite Element Analysis of Structures Based on Linear Saturated System Model. Lecture Notes in Computer Science 3174 (2004) 820-825 12. Song, Y.Y., Cai, F.Z., Zhang, B.P.: One of Case-Based Reasoning Product Conceptual Design Systems. Journal of Tsinghua University 38 (1998) 5-8 13. Jiang, B., Hsu, C.H.: Development of a Fuzzy Decision Model for Manufacturability Evaluation. Journal of Intelligent Manufacturing 14 (2003) 169-181 14. Huang, H.Z., Li, Y.H., Xue, L.H.: A Comprehensive Evaluation Model for Assessments of Grinding Machining Quality. Key Engineering Materials 291-292 (2005) 157-162 15. Huang, H.Z., Tian, Z.G., Zuo, M.J.: Intelligent Interactive Multiobjective Optimization Method and Its Application to Reliability Optimization. IIE Transactions on Quality and Reliability 37 (2005) 983-993 16. Messac, A., Sukam, C.P., Melachrinoudis, E.: Mathematical and Pragmatic Perspectives of Physical Programming. AIAA Journal 39 (2001) 885-893 17. Sun, J., Kalenchuk, D.K., Xue, D., Gu, P.: Design Candidate Identification Using Neural Network-Based Fuzzy Reasoning. Robotics and Computer Integrated Manufacturing 16 (2000) 383-396 18. Xue, D., Dong, Z.: Coding and Clustering of Design and Manufacturing Features for Concurrent Design. Computers in Industry 34 (1997) 139-53 19. Bahrami, A., Lynch, M., Dagli, C.H.: Intelligent Design Retrieval and Packing System: Application of Neural Networks in Design and Manufacturing. International Journal of Production Research 33 (1995) 405-426 20. Sun, H.L., Xie, J.Y., Xue, Y.F.: Mechanical Drive Type Decision Model Based on Support Vector Machine. Journal of Shanghai Jiao Tong University 39 (2005) 975-978 21. Huang, H.Z., Bo, R.F., Chen, W.: An Integrated Computational Intelligence Approach to Product Concept Generation and Evaluation. Mechanism and Machine Theory 41 (2006) 567-583 22. Chen, S.Y.: Engineering Fuzzy Set Theory and Application. National Defence Industry Press, Beijing (1998) 23. Chen, S.Y., Nie, X.T., Zhu, W.B., Wang, G.L.: A Model of Fuzzy Optimization Neural Networks and Its Application. Advances in Water Science 10 (1999) 69-74 24. Chen, S.Y.: Multi-Objective Decision-Making Theory and Application of Neural Network with Fuzzy Optimum Selection. Journal of Dalian University of Technology 37 (1997) 693-698 25. Zhang, Y.F., Fuh, J.Y.H.: A Neural Network Approach for Early Cost Estimation of Packaging Products. Computers and Industrial Engineering 34 (1998) 433-450
Atrial Arrhythmias Detection Based on Neural Network Combining Fuzzy Classifiers Rongrong Sun and Yuanyuan Wang Department of Electronic Engineering, Fudan University, Postfach 20 04 33, Shanghai, China {041021082,yywang}@fudan.edu.cn
Abstract. Accurate detection of atrial arrhythmias is important for implantable devices to treat them. A novel method is proposed to identify sinus rhythm, atrial flutter and atrial fibrillation. Here three different feature sets are firstly extracted based on the frequency-domain, the time-frequency domain and the symbolic dynamics. Then a classifier with two sub-layers is proposed. Three fuzzy classifiers are used as the first layer to perform pre-classification task corresponding to different feature sets respectively. A multilayer perceptron neural network is used as the final classifier. The performance of this algorithm is evaluated with two databases. One is the MIT-BIH arrhythmia database and the other is the endocardial electrogram database. A comparative assessment of the performance of the proposed classifier with individual fuzzy classifier shows that the algorithm can improve the overall accuracy for atrial arrhythmias classification. The implementation of this algorithm in implantable devices may provide accurate detection of atrial arrhythmias.
1 Introduction Cardiac arrhythmias are alterations of cardiac rhythm that disrupt the normal synchronized contraction sequence of the heart and reduce pumping efficiency. Among them, atrial fibrillation (AF) is the most common arrhythmia associated with a considerable risk of morbidity and mortality [1]. Recently, automatic external defibrillator introduced for home use as well as automatic implantable device therapies for atrial arrhythmias become more sophisticated in their ability to deliver several modes of therapy, such as antitachycardiac pacing and defibrillation, depending on the specific rhythm. If a false positive (FP) occurs, for example, when a normal sinus rhythm is misinterpreted as AF, an unnecessary shock will be given, which can damage the heart and cause inconvenience to the patient. So it is critical to detect accurately tachycardias that can be potentially terminated by pacing [2]. Several research groups have been working on the detection problem and a number of detection and analysis techniques have been evolved in the time-domain [3-5], the frequency-domain [6, 7], the time-frequency domain [8], and the nonlinear dynamics and chaos theory [9]. However, most of these methods are based on a single feature, D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 284–292, 2007. © Springer-Verlag Berlin Heidelberg 2007
Atrial Arrhythmias Detection Based on Neural Network Combining Fuzzy Classifiers
285
in which only one parameter is extracted to depict the signal. Then the feature is compared straightforwardly with a certain threshold chosen to discriminate different arrhythmias. This may lead to a higher error rate. Other multi-features based algorithms only improve the classification accuracy limitedly [10] since these features are usually extracted just from one aspect of the signal. In order to overcome the aforementioned problems, data fusion models are introduced because it can exploit information from different sources [11]. In this study, a novel method which fuses different features sets is proposed for atrial arrhythmias detection. Here, three features sets are firstly extracted based on the frequency-domain, the time–frequency domain and symbolic dynamics of the signals respectively. Then three parallel fuzzy clustering classifiers are used to perform the pre-classification task using the three features sets as input respectively. Finally, a multilayer perceptron (MLP) neural network is used to combine these three parallel fuzzy classifiers to make a final decision.
2 Data Acquisition Two databases of electrogram signals are studied in this paper. One is the MIT-BIH arrhythmia database and the other is the canine endocardial database. In the MIT-BIH arrhythmia database, sinus rhythm (SR), atrial flutter (AFL), and atrial fibrillation (AF) recordings are selected and digitized at sample frequency of 360 Hz. The canine endocardial electrograms are obtained by an 8×8 electrode array (with a 2 mm interelectrode distance) sewed on the atrium surface of six dogs. During SR, AFL, and AF, 20 seconds simultaneous recordings from each dog are digitized at sample frequency of 2000 Hz, with 16-bit resolution.
All data are split into 2-second segments for the analysis. For an example, a segment of SR, AFL, and AF signal in the MIT-BIH database are shown in Figure 1. The MIT-BIH database includes 150 segments of SR, AFL, and AF respectively and the canine database includes 300 segments of SR, AFL, and AF respectively.
Fig. 1. A segment of AFL, SR, and AF signals in the MIT-BIH database
286
R. Sun and Y. Wang
3 Features Extraction Most of previous methods focus on a single feature of electrogram signals, resulting in a low accuracy. In this study, three sets of features are extracted in terms of the frequency-domain, the time-frequency domain and symbolic dynamics respectively. Three sets of features are the input vectors for the three parallel fuzzy clustering classifiers respectively. 3.1 Frequency-Domain Features The first set of features is coefficients of 5-order auto-regression model of the signal which reflect the information of the signal in the frequency-domain. 3.2 Time-Frequency Domain Features The second set of features is obtained from the time-frequency domain of the signal after the wavelet transformation. Firstly, signals are transformed into the time-frequency domain using the wavelet decomposition on the scale a=1 5 with Daubechies4 as the basic wavelet function. The wavelet coefficients matrix of the signal is obtained in the time-frequency domain. Since singular values are the inherent property of a matrix, the singular values of the wavelet coefficient matrix are taken as features of signals.
~
3.3 Symbolic Dynamics Features The traditional techniques of data analysis in the time and frequency domains are often not sufficient to characterize the complex dynamics of Electrocardiograph (ECG). In this study, symbolic dynamics are used to analyze the nonlinear dynamics of the ECG. The concept of symbolic dynamics is based on the elimination of detailed information in order to keep the robust properties of the dynamics by a coarsegraining of the measurements [12]. In this way, the time series is transformed into a symbol sequence Sn with Equation 1. Here, the symbolic sequence Ω = {0,1,2} is used. Figure 2 presents examples of the transformation. These transformations are based on the mean value μ of each analyzed time series and also based on a nondimensional parameter α that characterizes the ranges where the symbols are defined. α ⎧ bn > (1 + ) μ , ⎪0 if 2 ⎪ . (1) α α ⎪ Sn = ⎨1 if (1 − ) μ < bn ≤ (1 + ) μ , 2 2 ⎪ α ⎪ bn ≤ (1 − ) μ . ⎪ 2 if 2 ⎩ Here n=1,2,…,N, where N is the sample numbers of the signal and bn are the values of the time series respectively.
Atrial Arrhythmias Detection Based on Neural Network Combining Fuzzy Classifiers
287
In order to characterize the symbol strings obtained by transforming the time series to Sn, the probability distribution of words with a length l=3 is analyzed. The words consist of three symbols obtaining a total of 3l different possible word types, the number of overlapped symbols in consecutive words is one. The probability occurrence of each word type are obtained as the third features set.
2 2 2 1 1 1 1 1 0 0
222 211
111 110
Fig. 2. Description of the basic principle of symbolic dynamics, the symbol extraction from a time series and the construction of words
4 Multi-parallel Fuzzy Clustering Classifiers After features extraction, three sets of features based on the frequency-domain, the time-frequency domain and symbolic dynamics are obtained respectively. They differ significantly in what they represent and this makes it difficult to accommodate them in a single classifier. Further more, three sets of features can result in a feature vector with high dimensionality for a single classifier and this may increase the computational complexity and cause accuracy problems. Additionally, the task of appropriately scaling the three sets of features could be a difficult task in itself. In order to overcome the aforementioned problems, multi-classifiers based on different feature sets are used, and their outputs have similar properties (e.g., confidence values) which can be combined with relative eases. Here, three parallel fuzzy clustering classifiers are used corresponding to the three sets of features respectively and they output membership values with each class. Suppose there are N classifier C1,…,CN, M class S1,…,SM.. For each set of features, the mean feature vectors ci=[ci1, ci2, …, cin] of each class are taken as the center vector estimated by the training data, 1 ≤ i ≤ M , n is the dimensionality of the feature vector. xj =[ xj1, xj2, …, xjn] represent the feature vector of testing data set, 1 ≤ j ≤ p , p is the number of testing data. U k ∈ R M × p denotes the membership matrix of the kth fuzzy
clustering classifier, the element μ ij in the matrix represents the membership value of
288
R. Sun and Y. Wang
the feature vector xj to the ith class decided by the kth classifer. μ ij is calculated as follows:
(1 / x j − ci )1 /(b −1) 2
μ ij =
M
∑ (1 / x k =1
, i=1,2,…,M, j=1,2,…,P.
(2)
2 1 /( b −1)
j
−c k )
x j − ci is the distance between the feature vector xj and the center vector ci of class i, and b is a parameter that controls the degree of fuzziness. Here b=2. Fuzzy clustering classifiers preclassify the input vector xj to all classes with different membership values, and μ ij provide the degree of the confidence that a
fuzzy classifier associates with the proposition xj ∈ Si. So for each input feature vector, the output of each of the N classifier can be completely represented by a Mdimensional vector Vi = (vi1, vi2, …, viM), 1 ≤ i ≤ N , where each component vij in the vector is a label associated with class Sj give by classifier Ci. After fuzzy clustering, the location of the input vector xj in the feature vector space is more precise which is easy for human beings to interpret.
5 Classifiers Combination Using MLP The outputs of the individual fuzzy classifier are not redundant, they can be combined to form a multi-classifiers decision that takes advantage of strengths of the individual classifier and diminishes their weaknesses to solve the same problem.
Fig. 3. The structure of the classifier using MLP neural network to combine classifiers
Atrial Arrhythmias Detection Based on Neural Network Combining Fuzzy Classifiers
289
Here, a classifiers combination method based on the MLP neural network is proposed. The MLP does not have hidden layer and the membership value V1,…,VN from the three parallel fuzzy clustering classifiers form the input vector to it. Here N=3. Z = (z1, z 2,,…, zM) is its output which is responsible for the final classification of atrial arrhythmias. Here zi is a confidence value positively associated with the decision on the class Ci, the higher the value of zi, the higher the associated degree of confidence. The whole structure of the classifier is show in Figure 3. This network is trained by back-propagation minimizing the mean square error (MSE). The transformation function is sigmoid. The advantage of such network is that each weight has an apparent meaning in the role that each classifier plays in the classifier combination. The weight ω ijk is the contribution to class Sk when the classifier Ci assigns membership vij to the class Sj. After finishing training procedure of the whole network using the training data, the cluster centers of the three parallel fuzzy clustering layer as well as weights of neural network are frozen and ready for the use in the retrieval mode. For each input signal, three sets of features are firstly extracted as the input of three parallel fuzzy clustering classifiers respectively and generate the membership values. The membership value vector activates the MLP network and the output of network indicates the final membership of the input signal to the appropriate class of atrial arrhythmias. The signal is decided belonging to the class from which the largest membership value comes from.
In this paper, all analysis is performed in a PC with P-IV 2.80 GHz CPU and 504 M RAM using Matlab 7.1 program.
6 Experimental Results 100 episodes from each class rhythm in the MIT-BIH database are randomly selected as the initial training data of the algorithm and the others as testing data. As for the canine database, the number of training data and testing data is 200 and 100 respectively. Evaluation of the sensitivity (SE), specificity (SP), and accuracy (AC) of the method for arrhythmias classification is carried out with two databases. Individual fuzzy clustering classifier is also used to classify signals, and the results are compared with those obtained by the MLP which combines classifiers. Tables 1-8 show the experimental results. Table 1. Performance of fuzzy clustering classifier based on frequency-domain features with MIT-BIH database Actual type SR AF AFL
Experiment Result SR AF AFL Total 30 1 19 50 28 19 3 50 38 0 12 50
SE
SP
AC
60.0 38.0 24.0
34.0 99.0 78.0
42.7 78.7 60.0
290
R. Sun and Y. Wang
Table 2. Performance of fuzzy clustering classifier based on time-frequency domain features with MIT-BIH database Actual type SR AF AFL
Experiment Result SR AF AFL Total 35 0 15 50 1 9 40 50 7 4 39 50
SE
SP
AC
70.0 18.0 78.0
92.0 96.0 45.0
84.7 70.0 56.0
Table 3. Performance of fuzzy clustering classifier based on symbolic dynamics features with MIT-BIH database Actual type SR AF AFL
Experiment Result SR AF AFL Total 38 1 11 50 7 41 2 50 4 6 40 50
SE
SP
AC
76.0 82.0 80.0
89.0 93.0 87.0
84.7 89.3 84.7
Table 4. Performance of MLP combining classifiers with MIT-BIH database Actual type SR AF AFL
Experiment Result SR AF AFL Total 48 1 1 50 0 50 0 50 1 2 47 50
SE
SP
AC
96.0 100.0 94.0
99.0 97.0 99.0
98.0 98.0 97.3
Table 5. Performance of fuzzy clustering classifier based on frequency-domain features with canine database Actual type SR AF AFL
Experiment Result SR AF AFL Total 83 6 11 100 34 14 52 100 31 12 57 100
SE
SP
AC
83.0 14.0 57.0
67.5 91.0 68.5
72.7 65.3 64.7
Table 6. Performance of fuzzy clustering classifier based on time-frequency domain features with canine database Actual type SR AF AFL
Experiment Result SR AF AFL Total 66 4 30 100 42 39 19 100 3 2 95 100
SE
SP
AC
66.0 39.0 95.0
77.5 97.0 75.5
73.7 77.7 82.0
Table 7. Performance of fuzzy clustering classifier based on symbolic dynamics features with canine database Actual type SR AF AFL
Experiment Result SR AF AFL Total 78 6 16 100 63 22 15 100 6 16 78 100
SE
SP
78.0 22.0 78.0
65.5 89.0 84.5
AC 69.7 66.7 82.3
Atrial Arrhythmias Detection Based on Neural Network Combining Fuzzy Classifiers
291
Table 8. Performance of MLP combining classifiers with canine database Actual type SR AF AFL
Experiment Result SR AF AFL Total 96 3 1 100 2 97 1 100 0 0 100 100
SE
SP
AC
96.0 97.0 100.0
99.0 98.5 99.0
98.0 98.0 99.3
As shown in Tables 1-8, the performance of each individual fuzzy clustering classifier demonstrates that the three sets of features are complementary. A comparative assessment of the performance of the proposed method with individual fuzzy clustering classifier show that more reliable results are obtained with the MLP neural network which combines classifiers for the classification of atrial arrhythmias.
7 Conclusion The new algorithm for atrial arrhythmias classification applies three sets of features in terms of the frequency-domain, the time-frequency domain, and the symbolic dynamics respectively to characterize the signals. This paper, therefore, focuses on ways by which the information from different features can be combined in order to improve the classification accuracy. Here, a MLP neural network is used to combine classifiers to improve the classification accuracy. The algorithm is composed of two layers connected in cascade. The three parallel fuzzy clustering classifiers form the first layer, it uses the three sets of features respectively and performs the pre-classification task. A MLP neural network which combines the former classifiers forms the second layer, and it makes a final decision on the ECG signals. The fuzzy clustering layer can firstly analyse the distribution of the data and group them into class with different membership values. The neural network takes these membership values as input vector and classifies the atrial arrhythmias to the appropriate class. This technique incorporates fuzzy clustering method with back propagation learning and combines their advantage. The two experiment databases used for evaluation of the method includes not only the ECG signals obtained by standard 12-lead on the human body surface but also endocardial electrograms obtained from the canine atrial surface. They will prove the generalizability of this method to distinguish among various atrial arrhythmias of different type of databases. So the algorithm can provide accurate detection of atrial arrhythmias and be easily implemented not only in automatic external defibrillator but also in the automatic implantable devices.
Acknowledgement This work wa supported by the National Basic Research Program of China under Grant 2005CB724303, Natural Science Foundation of China under Grant 30570488 and Shanghai Science and Technology Plan, China under Grant 054119612.
292
R. Sun and Y. Wang
References 1. Chugh, S.S., Blackshear, J.L., Shen, W.K., Stephen, C.H., Bernard, J.G.: Epidemiology and Natural History of Atrial Fibrillation: Clinical Implications. Journal of the American College of Cardiology 37 (2) (2001) 371-377 2. Wellens, H.J., Lau, C.P., Luderitz, B., Akhtar, M., Waldo, A.L., Camm, A.J., Timmermans, C., Tse, H.F., Jung, W., Jordaens, L., Ayers, G.: Atrioverter: An Implantable Device for the Treatment of Atrial Fibrillation. Circulation 98 (16) (1998) 1651-1656 3. Sih, H.J., Zipes, D.P., Berbari, E.J., Olgin, J.E.: A High-temporal Resolution Algorithm for Quantifying Organization During Atrial Fibrillation. IEEE Transactions on Biomedical Engineering 46 (4) (1999) 440-450 4. Narayan, S.M., Valmik, B.: Temporal and Apatial Phase Analyses of the Electrocardiogram Stratify Intra- Atrial and Intra-ventricular Organization. IEEE Transactions on Biomedical Engineering 51 (10) (2004) 1749-1764 5. Faes, L., Nollo, G., Antolini, R.: A Method for Quantifying Atrial Fibrillation Organization Based on Wave-morphology Similarity. IEEE Transactions on Biomedical Engineering 49 (12) (2002) 1504-1513 6. Khadra, L., Al-Fahoum, A.S., Binajjaj, S.: A Quantitative Analysis Approach for Cardiac Arrhythmia Classification Using Higher Order Spectral Techniques. IEEE Transactions on Biomedical Engineering 52 (11) (2005) 1840-1845 7. Everett, T.H., KoK, L.C., Vaughn, R.H., Moorman, J.R., Haines, D.E.: Frequency Domain Algorithm for Quantifying Atrial Fibrillation Organization to Increase Defibrillation Efficacy. IEEE Transactions on Biomedical Engineering 48 (9) (2001) 969 -978 8. Stridth, M., Sornmo, L., Meurling, C.J., Olsson, S.B.: Characterization of Atrial Fibrillation Using the Surface ECG: Time-dependent Spectral Properties. IEEE Transactions on Biomedical Engineering 48 (1) (2001) 19-27 9. Zhang, X.S., Zhu, Y.S., Thakor, N.V.: Detecting Ventricular Tachycardia and Fibrillation by Complex Measure. IEEE Transactions on Biomedical Engineering 46 (5) (1999) 548-555 10. Xu, W.C., Tse, H.F., Chan, F.H.Y., Fung, P.C. W., Lee, K. L. F., Lau, C. P.: New Bayesian Discriminator for Detection of Atrial Tachyarrhythmias. Circulation 105 (12) (2002) 1472-1479 11. Gupta, L., Chung, B., Srinath, M.D., Molfese, D.L., Kook, H.: Multichannel Fusion Models for the Parametric Classification of Differential Brain Activity. IEEE Transactions on Biomedical Engineering 52 (11) (2005) 1869-1881 12. Baumert, M., Walther, T., Hopfe, J., Stepan, H., Faber, R., Voss, A.: Joint Symbolic Dynamic Analysis of Beat-to-beat Interactions of Heart Rate and Systolic Blood Pressure in Normal Pregnancy. Medical & Biological Engineering and Computing 40 (2002) 241-245
A Neural-Fuzzy Pattern Recognition Algorithm Based Cutting Tool Condition Monitoring Procedure Pan Fu1 and A.D. Hope2 1
Mechanical Engineering Faculty, Southwest JiaoTong University Chengdu 610031, China
[email protected] 2 Systems Engineering Faculty, Southampton Institute Southampton SO14 OYN, U.K.
[email protected]
Abstract. An intelligent tool wear monitoring system for metal cutting process will be introduced in this paper. The system is equipped with four kinds of sensors, signal transforming and collecting apparatus and a micro computer. A knowledge based intelligent pattern recognition algorithm has been developed. The fuzzy driven neural network can carry out the integration and fusion of multi-sensor information. The weighted approaching degree can measure the difference of signal features accurately and ANNs successfully recognize the tool wear states. The algorithm has strong learning and noise suppression ability. This leads to successful tool wear classification under a range of machining conditions.
1 Introduction Modern advanced machining systems in the “unmanned” factory must possess the ability to automatically change tools that have been subjected to wear or damage. This can ensure machining accuracy and reduce the production costs. Coupling various transducers with intelligent data processing techniques to deliver improved information relating to tool condition makes optimization and control of the machining process possible. Many tool wear sensing methods have been suggested, but only some of these are suitable for industrial application. The research work of Lin S C and Yang R J [1] showed that both the normal cutting force coefficient and the friction coefficient could be represented as functions of tool wear. An approach was developed for inprocess monitoring tool wear in milling using frequency signatures of the cutting force [2]. An analytical method was developed for the use of three mutually perpendicular components of the cutting forces and vibration signature measurements [3]. The ensuing analyses in time and frequency domains showed some components of the measured signals to correlate well to the accrued tool wear. A tool condition monitoring system was then established for cutting tool-state classification [4]. The investigation concentrated on tool-state classification using a single wear indicator and progressing to two wear indicators. In another study, the input features were D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 293–300, 2007. © Springer-Verlag Berlin Heidelberg 2007
294
P. Fu and A.D. Hope
derived from measurements of acoustic emission during machining and topography of the machined surfaces [5]. Li, X etc. showed that the frequency distribution of vibration changes as the tool wears, so the r.m.s. of the different frequency bands measured indicates the tool wear condition [6]. Tool breakage and wear conditions were monitored in real time according to the measured spindle and feed motor currents, respectively [7]. The models of the relationships between the current signals and the cutting parameters were established under different tool wear states. Many kind of advanced sensor fusion and intelligent data processing techniques have been used to monitor tool condition. A new on-line fuzzy neural network (FNN) model with four parts was developed [8]. They have the functions of classifying tool wear by using fuzzy logic; normalizing the inputs; using modified least-square backpropagation neural network to estimate flank and crater wear. Parameters including forces, AE-rms, skew and kurtosis of force bands, as well as the total energy of forces were employed as inputs. A new approach for online and indirect tool wear estimation in turning using neural networks was developed [9]. This technique uses a physical process model describing the influence of cutting conditions (such as the feed rate) on measured process parameters (here: cutting force signals) in order to separate signal changes caused by variable cutting conditions from signal changes caused by tool wear. Two methods using Hidden Markov models, as well as several other methods that directly use force and power data were used to establish the health of a drilling tool [10]. In order to increase the reliability of these methods, a decision fusion center algorithm (DFCA) was proposed which combines the outputs of the individual methods to make a global decision about the wear status of the drill. Experimental results demonstrated the high effectiveness of the proposed monitoring methods and the DFCA. In this study, a unique neural-fuzzy pattern recognition algorithm was developed to accomplish multi-sensor information integration and tool wear state classification. It combines the strong interpretation power of fuzzy systems and the adaptation and structuring abilities of neural networks. The monitoring system that has been developed provided accurate and reliable tool wear classification results over a range of cutting conditions.
2 The Tool Condition Monitoring System As shown in Fig.1, the tool wear monitoring system is composed of four kinds of sensors, signal amplifying and collecting devices and the microcomputer. Part of the condition monitoring experiments were carried out at the Advanced Manufacturing Lab. of Southampton Institute, U.K.. The experiments were carried out on a Cincinnati Milacron Sabre 500 (ERT) Vertical Machining Centre with computer numerical control. Sensors were installed around the workpiece being machined and four kinds of signal were collected to reflect the tool wear state comprehensively. Tool condition monitoring is a pattern recognition process in which the characteristics of the tool to be monitored are compared with those of the standard models. The process is composed of the following parts: determination of the membership functions of signal features, calculation of fuzzy distances, learning and tool wear classification.
A Neural-Fuzzy Pattern Recognition Algorithm
ADC 200 Digital Oscilloscope
2102 Analogue Module
AE Sensor
2101PA Preamplifier
KISTLER 9257B Dynamometer
KISTLER 5807A Charge Amplifier
EX205 Extension Board
Accelerometer
Charge Amplifier
PC 226 A/D Board
Current Sensor
Low-pass Filter
295
Main Computer
Fig. 1. The tool condition monitoring system
3 Feature Extraction Features are extracted from the time domain and frequency domain. Only those features that are really relevant to tool wear state are eventually selected for the further kw 1.5 1.45 1.4 1.35 1.3 1.25 1.2 1.15 1.1 1.05
1.5 1.45 1.4 1.35 1.3 1.25 1.2 1.15 1.1 1.05
VB=0.5(mm) VB=0.4(mm) VB=0.3(mm)
g
VB=0.5(mm) VB=0.4(mm) VB=0.3(mm) VB=0.2(mm)
Condition 1
VB=0.2(mm)
Condition 1
13 12 11 10 9 8 7 6 5
13 12 11 10 9 8 7 6 5
VB=0.1(mm)
VB=0.1(mm)
VB=0(mm)
VB=0(mm)
(a) Mean value of the power consumption signal 450 400 350 300 250 200 150 100 50 0 VB=0.5(mm) VB=0.4(mm) VB=0.3(mm) VB=0.2(mm) VB=0.1(mm) VB=0
μbar
450 400 350 300 250 200 150 100 50 0 40(kHz 80(kHz) 120(kHz) 160(kHz) 200(kHz) 240(kHz) 280(kHz) 320(kHz) 360(kHz) 400(kHz)
(c) Spectra of the AE signal
(b) Standard deviation of the vibration signal
N 60
60
50
50
40
40
30
30
20
20
10
10
0
0 VB=0.5(mm) VB=0.4(mm) VB=0.3(mm) VB=0.2(mm) VB=0.1(mm) VB=0
200(Hz) 400(Hz) 600(Hz) 800(Hz) 1000(Hz) 1200(Hz) 1400(Hz) 1600(Hz) 1800(Hz) 2000(Hz)
(d) Spectra of cutting force ( Fx ) signal
Fig. 2. Some sensor signal features
296
P. Fu and A.D. Hope
pattern recognition as follows: for power consumption signal: mean value; for AERMS signal: mean value, skew and kutorsis; for cutting force, AE and vibration: mean value, standard deviation and the mean power in 10 frequency ranges. As an example, figure.2 shows several features in time and frequency domain (under cutting condition 1*). It can be seen that both the amplitude and the distribution pattern of those features change in certain pattern along with the development of tool flank wear (VB).
4 The Similarity of Fuzzy Sets Fuzzy approaching degree and fuzzy distance can be used as the quantitative indexes to represent the similarity of two fuzzy sets (A and B). The features of sensor signals of the tool condition monitoring system can reflect the tool wear states. For the standard models (cutting tool with standard flank wear values), the j-th feature of the i-th model can be considered as a fuzzy set Aij . Theoretical analysis and experimental results show that these features can be regarded as modified normal distribution fuzzy sets. 4.1 Approaching Degree Assume that F (X) is the fuzzy power set of a universal set X and the map, N:F (X ) × F (X ) → [0,1] satisfies: (a) ∀A ∈ F ( X ) ,
N ( A, A) = 1 ; (b) ∀A, B ∈F ( X ) ,
N ( A , B ) = N ( B , A ) ;(c) if A, B, C ∈ F ( X ) satisfies:
A( x) − C ( x) ≥ A( x) − B( x) (∀x ∈ X ) then N ( A, C ) ≤ N ( A, B ) . So the map N is the approaching degree in F ( X ) and N ( A, B ) is called the approaching degree of A and B. It can be calculated by using different methods. Here the inner and outer products are used. Assume that A, B ∈F ( X ) , so :
A • B = ∨{A( x) ∧ B( x) : x ∈ X } is defined as the inner product of A and B and A ⊕ B = ∧{A( x) ∨ B( x) : x ∈ X } is defined as the outer product of A and B. Finally, in the map :
N : F (X ) × F (X ) → [ 0 , 1 ] , N ( A, B ) is the approaching degree of A and B . N
(A , B ) = (A
• B
) ∧ (A
⊕ B
)c
(1)
4.2 Fuzzy Distance
X = {x1 , x 2 ,..., x n } , the membership value of A ( ( A( x1 ), A( x2 ),..., A( xn )) can be explained as the points in the n-dimensional
If A ∈F (X ) , when
Euclidean space. So the distance between two fuzzy sets can be defined like how the
A Neural-Fuzzy Pattern Recognition Algorithm
distance in Euclidean spaced is defined. When tion on
X = [a, b] , A(x) is a limited func-
[a, b] , the distance between two fuzzy sets can be defined as the followings.
Suppose M p : F (X ) × F (X ) → [0 , + ∞) ( p is a positive real number) F (X ) × F (X ) , when
X = {x1 , x2 ,..., xn } ,
⎤ ⎥ ⎦
1/ p
dx ⎤ ⎦⎥
1/ p
⎡ n M p ( A, B ) = ⎢∑ A( xi ) − B ( xi ) ⎣ i =1
When
297
p
∀( A, B ) ∈
(2)
X = [ a , b] , b M p ( A, B ) = ⎡∫ A( x) − B ( x) ⎣⎢ a
p
(3)
M p is fuzzy distance on F (X ) , M p ( A, B ) is the fuzzy distance between fuzzy set A and B . In general situation, p can take the value of 1. 4.3 Two Dimensional Weighted Approaching Degree In the conventional fuzzy pattern recognition process, the approaching degree or fuzzy distance between corresponding features of the object to be recognized and different models are first calculated. Combining these results can determine the fuzzy similarity between the object and different models. The object should be classified to one of the models that have the highest approaching degree or shortest fuzzy distance with it. This process can be further improved by developing a method that can assign suitable weights to different features to reflect their specific influences in the pattern recognition process. The two fuzzy similarity measures can also be combined to describe the closeness of fuzzy sets more comprehensively. Approaching degree and fuzzy distance reflects the closeness of two fuzzy sets from different angles. For two intersecting membership function, approaching degree reflects the geometric position of the intersecting point and the fuzzy distance shows the area of the intersecting space. Approaching degree and fuzzy distance between different sensor signal features also have changing importance in the practical pattern recognition process. In this study, artificial neural networks (ANNs) are employed to integrate approaching degree and fuzzy distance and assign them with suitable weights to provide a two dimensional weighted approaching degree. This makes more accurate and reliable tool wear classification possible.
5 Fuzzy Driven Neural Network ANNs has the ability to classify inputs. The weights between neurons are adjusted automatically in the learning process to minimize the difference between the desired and actual outputs. ANNs can continuously classify and also update classifications. In this study, ANNs is connected with the fuzzy logic system to establish a fuzzy driven neural network pattern recognition algorithm. Its principle is shown in the following
298
P. Fu and A.D. Hope
figure. Here a back propagation ANN is used to carry out multi-sensor information integration and tool wear classification. The approaching degree and fuzzy distance calculation results are the input of the ANNs. The associated weights can be updated as: w i ( new ) = w i ( old ) + αδ x i . Here α , δ , x i are learning constant, associated error measure and input to the i-th neuron. In this updating process, the ANN recognizes the patterns of the features corresponding to certain tool wear state. So in practical machining process, the feature pattern can be accurately classified to that of one of the models. In fact ANNs combine approaching degree and fuzzy distance and assign each feature a proper synthesized weight and the output of the ANNs is two dimensional weighted approaching degrees. This enables the classification process be more reliable. Forc Load Time and Frequency AE Dormain Feature Vib. Extraction
Fuzzy Membership Function Calculation
Fuzzy Distance and Fuzzy Approaching Degree Calculation
Training input
Encoder
Training target
Data to be Encoded
Decoder Data to be Decoded
Encoded Data
Decoded Data
Test input Test target Inquiry input
ANN
Error Graph
Inquiry Output New Normal Worn
Fig. 3. The fuzzy driven neural networks
6 Tool Wear State Classification In the practical tool condition monitoring process, the tool with unknown wear value is the object and it will be recognized as “new tool”, “normal tool” or “worn tool”. The membership functions of all the features of the object can be determined first. The approaching degree and fuzzy distance of the corresponding features of the standard model and the object to be recognized can then be calculated and become the inquiry input of the ANNs. One of a pre-trained ANNs is then chosen to calculate the two dimensional weighted approaching degree. Finally the tool wear state should be classified to the model that has the highest weighted approaching degree with the tool being monitored. In a verifying experiment, fifteen tools with unknown flank wear value were used in milling operations. Figure.4 shows the classification results under cutting condition
A Neural-Fuzzy Pattern Recognition Algorithm
299
1*. It can be seen that all the tools were classified correctly with the confidence of higher than 80%. Experiments under other cutting conditions showed the similar results.
Tool wear classification results New tool Norm al tool Worn tool
Classification Confidence (%) 100 90 80 70 60 50 40 30 20 10 0 0.05
0.05
0.07
0.07
0.08
0.25
0.25
0.27
0.27
0.28
0.55
0.55
0.56
0.56
0.57
Tool wear value Fig. 4. Tool wear states classification results
7 Conclusions An intelligent tool condition monitoring system has been established. Tool wear classification is realized by applying fuzzy driven neural network based pattern recognition algorithm. On the basis of this investigation, the following conclusions can be made. (1) Power consumption, vibration, AE and cutting force sensors can provide healthy signals to describe tool condition comprehensively. (2) Many features extracted from time and frequency domains were found to be relevant to the changes of tool wear state. This makes accurate and reliable pattern recognition possible. (3) The combination of ANN and fuzzy logic system integrates the strong learning and classification ability of the former and the superb flexibility of the latter to express the distribution characteristics of signal features with vague boundaries. This methodology indirectly solves the weight assignment problem of the conventional fuzzy pattern recognition system and let it have greater representative power, higher training speed and be more robust. (4) The introduction of the two dimensional weighted approaching degree can make the pattern recognition process more reliable. The fuzzy driven neural network effectively fuses multi-sensor information and successfully recognizes the tool wear states. (5) Armed with the advanced pattern recognition methodology, the established intelligent tool condition monitoring system has the advantages of being suitable for different machining conditions, robust to noise and tolerant to faults.
300
P. Fu and A.D. Hope
(6) Future work should be focused on developing data processing methods that produce feature vectors which describe tool condition more accurately, improving the fuzzy distances calculation methods and optimizing the ANNs structure. * Cutting condition 1( for milling operation): cutting speed - 600 rev/min, feed rate 1 mm/rev, cutting depth - 0.6 mm, workpiece material - EN1A, cutting inserts Stellram SDHT1204 AE TN-42.
References 1. Lin, S.C., and Yang, R.J.: Force-based Model for Tool Wear Monitoring in Face Milling, Int. J. Machine Tools and Manufacturing 9 (1995) 1201-1211 2. Elbestawi, M.A., Papazafiriou, T.A., and Du, R.X.: In-process Monitoring of Tool Wear in Milling Using Cutting Force Signature, Int. J. Machine Tools Manufacturing 1 (1991) 55-73 3. Dimla, D.E., Lister, P.M.: On-line metal cutting tool condition monitoring. I: force and vibration analyses, Int. J. of Machine Tools and Manufacturing 5 (2000) 739-768. 4. Dimla, D.E., Lister, P.M.: On-line metal cutting tool condition monitoring. II: tool-state classification using multi-layer perceptron neural networks, Int. J. of Machine Tools and Manufacturing 5 (2000) 769-781 5. Wilkinson, P., Reuben, R.L., Jones, J.D.C.: Tool wear prediction from acoustic emission and surface characteristics via an artificial neural network, Mechanical Systems and Signal Processing 6 (1999) 955-966 6. Li, X., Dong, S., Venuvinod, P.K.: Hybrid learning for tool wear monitoring, Int. J. of Advanced Manufacturing Technology 5 (2000) 303-307 7. Li, X.L., Tso, S.K., Wang, J: Real-time tool condition monitoring using wavelet transforms and fuzzy techniques, IEEE Transactions on Systems, Man and Cybernetics Part C: Applications and Reviews 3 (2000) 352-357 8. Chungchoo, C., Saini, D.: On-line tool wear estimation in CNC turning operations using fuzzy neural network model, Int. J. of Machine Tools and Manufacture 1 (2002) 29-40 9. Sick, B.: Tool wear monitoring in turning: A neural network application, Measurement and Control 7 (2001) 207-211+222 10. Ertunc, H.M., Loparo, K.A.: A decision fusion algorithm for tool wear condition monitoring in drilling, Int. J. of Machine Tools and Manufacture 9 (2001) 1347-1362
Research on Customer Classification in E-Supermarket by Using Modified Fuzzy Neural Networks Yu-An Tan1, Zuo Wang1, and Qi Luo2 1 Department of Computer Science and Engineering, Beijing Institute of Technology, 100081 Beijing, China
[email protected],
[email protected] 2 Department of Information & Technology, Central China Normal University, 430079, Wuhan, China
[email protected]
Abstract. With the development of network technology and E-commerce, more and more enterprises have accepted the management pattern of E-commerce. In order to meet the personalized needs of customers in E-supermarket, customer classification based on their interests is a key technology for developing personalized E-commerce. Therefore, it is highly needed to have a personalized system for extracting customer features effectively, and analyzing customer interests. In this paper, we proposed a new method based on the modified fuzzy neural network to group the customers dynamically according to their Web access patterns. The results suggest that this clustering algorithm is effective and efficacious. Taking one with another, this new proposed approach is a practical solution to make more visitors become to customers, improve the loyalty degree of customer, and strengthen cross sale ability of websites in E-commerce. Keywords: customer classification, E-supermarket, modified fuzzy neural networks, personalized needs, Web access.
1 Introduction With the development of network technology and E-commerce, more and more enterprises have transferred to the management pattern of E-commerce [1]. The management pattern of E-commerce may greatly save the cost in the physical environment and bring conveniences to customers. People pay more and more attention to E-commerce day by day. Therefore, more and more enterprises have set up their own E-supermarket websites to sell commodities or issue information service. But the application of these websites is difficult to attract customer’ initiative participation. Only 2%-4% visitors purchase the commodities on E-supermarket websites [2]. The investigation indicates that personalized recommendation system that selecting and purchasing commodities is imperfect. The validity and accuracy of providing commodities are low. If E-supermarket websites want to attract more visitors to customers, improve the loyalty degree of customers and strengthen the cross sale ability of D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 301–306, 2007. © Springer-Verlag Berlin Heidelberg 2007
302
Y.-A. Tan, Z. Wang, and Q. Luo
websites, the idea of personalized design should be needed. It means that commodities and information service should be provided according to customers’ needs. The key of personalized design is how to classify customers based on their interests. In the paper, we presents a system model that dynamically groups customers according to their Web access and transactional data, which consist of the customers’ behavior on web site, for instance, the purchase records, the purchase date, amount paid, etc. The proposed system model is developed on the base of a modified fuzzy ART neural network, and involves two sequential modules including: (1) trace customers’ behavior on web site and generate customer profiles, (2) classify customers according customer profile using neural network.
2 System Model The system model is characterized as Fig. 1, which is applied to E-supermarket in our experiment. The idea is that customer interests could be extracted by observing customer behavior, including the transaction records, the transaction time and the products pages customer browsed. Then the results of first module are organized in a hierarchical structure and utilized to generate customer profile respectively. Finally customer profile could be grouped into different teams using modified fuzzy ART neural network. The system model includes three modules: customer behavior recording, customer profile generating and customer grouping. A hierarchical structure Customer behavior recording Transaction records Transaction time
Customer profile generating
Customer grouping
Products pages
Modified fuzzy ART neural network
Fig. 1. System model
(1) Customer behavior recording. The customer behavior is divided two types: transaction record and customer operation. Customer operation is composed of browsing time, frequency and so on. According to our early research, visiting duration of a product page is a good way to measure the customer interests. Hence, in our paper, each product page, whose visiting time is longer than a threshold, is analyzed. (2) Customer profile generating. A tree-structured is represented for customer profile. We could organize customer preference in a hierarchical structure according to customer interests. The structure is shown as follows Fig. 2. (3) Customer grouping. Customer could be grouped to different teams according their profiles by using adaptive neural network.
Research on Customer Classification in E-Supermarket
303
Level 1: User preference Tree
Level 2: Class
Level 3: Subclass 1
Level n: Subclass N
Fig. 2. Structure of customer profile
3 Modified Fuzzy ART Network The Fuzzy ART network is an unsupervised neural network with ART architecture for performing both continuous-valued vectors and binary-valued vectors [3]. It is a pure winner-takes-all architecture able to instance output nodes whenever necessary and to handle both binary and analog patterns. Using a vigilance parameter as a threshold of similarity, Fuzzy ART can determine when to form a new cluster. This algorithm uses an unsupervised learning and feedback network. It accepts input vector and classifies it into one of a number of clusters depending upon which it best resembles. The single recognition layer that fires indicates its classification decision. If the input vector does not match any stored pattern; it creates a pattern that is like the input vector as a new category. Once a stored pattern is found that matches the input vector within a specified threshold (the vigilance ρ ∈ [0,1] ), that pattern is adjusted to make it accommodate the new input vector. The adjective fuzzy derives from the functions it uses, although it is not actually fuzzy. To perform data clustering, fuzzy ART instances the first cluster coinciding with the first input and allocating new groups when necessary (in particular, each output node represents a cluster from a group). In the paper, we employ a modified Fuzzy ART proposed by Cinque al. to solve some problems of traditional Fuzzy ART. Function choice used in the algorithm is characterized as follows. 2
choice ( C j , V j ) =
(C
j
∧ Vi
C j Vi
)
2
⎛ n ⎞ ⎜ ∑ Zr ⎟ = ⎝n r =1 n ⎠ . ∑ Cr ∑Vr r =1
(1)
r =1
It computes the compatibility between a cluster and an input to find a cluster with greatest compatibility. The input pattern V j is an n-elements vector transposed, C j is
304
Y.-A. Tan, Z. Wang, and Q. Luo
the weight vector of cluster J (both are n-dimensional vectors). “ ∧ ” is fuzzy set intersection operator, which is defined by: x ∧ y = min{x, y} (2) X ∧ Y = ( x1 ∧ y1 ," , xn ∧ yn ) = ( z1 , z2 , " zn ) Function match is the following: n
match ( C * ∧ Vi ) =
C ∧ Vi *
C*
=
∑Z
r
∑C
* r
r =1 n
r =1
.
(3)
This computes the similarity between the input and the selected cluster. The match process is passed if this value is greater than, or equal to, the parameter ρ ∈ [0,1] . Intuitively, ρ indicates how similar the input has to be to the selected cluster to allow it to be associated with the customer group the cluster represents. As a consequence, a greater value of ρ implies smaller clusters, a lower value means wider clusters. Function adaptation is the selected cluster adjusting function, which algorithm is shown as following: * adaptation ( C* ,Vi ) = Cnew ( Cold* ∧ Vi ) + (1 − β ) Cold* ,
(4)
where the learning parameter ρ ∈ [0,1] , weights the new and old knowledge respec* * tively. It is worth observing that this function is not increasing, that is Cnew < Cold . The energy values of all leaf nodes in a customer profile consist an n-elements vector representing a customer pattern. Each element of the vector represents a product category. If a certain product category doesn’t include in customer profile, the corresponding element in the vector is assigned to 0. Pre-processing is required to ensure the pattern values in the space [0, 1], as expected by the fuzzy ART.
4 Experiment On the foundation of research, we combine with the cooperation item of personalized service system in community. The author constructs E-supermarket website to provide personalized recommendation. The experiment simulated 15 customers behavior on E-supermarket over a 20-day period, and they were grouped to 5 teams. The experimental web site is organized in a 4-level hierarchy that consists of 5 classes and 40 subclasses, including 4878 commodities. As performance measures, we employed evaluation metrics as follows [4]. , , .
(5) (6) (7)
Research on Customer Classification in E-Supermarket
305
The experiment results were compared with SOM, k-means and traditional fuzzy ART [5]. It was used in the fast learning asset (with β =1) with α set to zero. Values for the vigilance parameter β were found by trials. In the simulation of k-means, parameter K representing the number of clusters is assigned to 7. In particular, we used a rectangular map with two training stages: the first was made in 750 steps, with 0.91 as a learning parameter and a half map as a neighborhood, and the second in 400 steps, with 0.016 as a learning parameter and three units as a neighborhood. Map size was chosen by experiments. In the proposed system, decaying factor λ is assigned to 0.93, aging factor ψ is set to 0.02, β is set to 1, and vigilance parameter ρ is assigned to 0.89. With the growth of vigilance parameter, the amount of clusters is increased too. Fig. 3 shows the increase in the number of clusters with increased vigilance parameter values ranging from 0.85 to 0.95.
Fig. 3. The vigilance parameter increase with the clusters increasing
Fig. 4 illustrates the comparisons of three algorithms mentioned before, including precision, recall and F1 . The average for precision, recall and F1 measures using the SOM classifier are 80.7%, 75.3%, 78.9%, respectively. The average for precision, recall and F1 measures using the traditional fuzzy ART classifier are 88.3%, 85.8%, 87%, respectively. And the average for precision, recall and F1 measures using the
Fig. 4. The comparison of SOM, Traditional ART, K-means and modified fuzzy ART algorithm
306
Y.-A. Tan, Z. Wang, and Q. Luo
5 Conclusions In summary, a new approach that uses a modified fuzzy neural network based on adaptive resonance theory to group customers dynamically based on their Web access patterns is proposed in the paper. A new method is applied to E-supermarket to provide personalized service. The results manifest that this clustering algorithm is effective. Thus, we thought it might be a practical solution to make more visitors become to customers, improve the loyalty degree of customer, and strengthen cross sale ability of websites in E-commerce. I wish that this article’s work could give references to certain people.
References 1. Wu, Z.H.: Commercial Flexibility Service of Community Based on SOA. In the proceedings of the fourth Wuhan International Conference on E-Business, Wuhan: IEEE Press (2005) 467-471 2. Li, Y., Liu, L.: Comparison and Analysis on E-Commence Recommendation Method in China. System Engineering Theory and Application 24 (8) (2004) 96-98 3. Haykin, S.: Neural Networks: A Comprehensive Foundation. Canada: Prentice-Hall (2001) 4. Li, D., Cao, Y.D.: A New Weighted Text Filtering Method. In the proceedings of the International Conference on Natural Language Processing and Knowledge Engineering, Wuhan: IEEE Press (2005) 695-698 5. Yang, Z.: Net Flow Clustering Analysis Based on SOM Artificial Neural Network. Computer Engineering 32 (16) (2006) 103-105
Recurrent Fuzzy Neural Network Based System for Battery Charging R.A. Aliev1, R.R. Aliev2, B.G. Guirimov1, and K. Uyar3 1
Azerbaijan State Oil Academy, 20 Azadlig avenue, Baku, Azerbaijan
[email protected] 2 Eastern Mediterranean University
[email protected] 3 Near East University
[email protected]
Abstract. Consumer demand for intelligent battery charges is increasing as portable electronic applications continue to grow. Fast charging of battery packs is a problem which is difficult, and often expensive, to solve using conventional techniques. Conventional techniques only perform a linear approximation of a nonlinear behavior of a battery packs. The battery charging is a nonlinear electrochemical dynamic process and there is no exact mathematical model of battery. Better techniques are needed when a higher degree of accuracy and minimum charging time are desired. In this paper we propose soft computing approach based on fuzzy recurrent neural networks (RFNN) training by genetic algorithms to control batteries charging process. This technique does not require mathematical model of battery packs, which are often difficult, if not impossible, to obtain. Nonlinear and uncertain dynamics of the battery pack is modeled by recurrent fuzzy neural network. On base of this FRNN model, the fuzzy control rules of the control system for battery charging is generated. Computational experiments show that the suggested approach gives least charging time and least Tend-Tstart results according to the other intelligent battery charger works.
1 Introduction There are several research works on application of new technologies, namely fuzzy, neural, genetic, neuro-fuzzy approaches for battery charging. Unlike conventional schemes using constant current or a few trip points, the intelligent charger monitors battery parameters continuously, and alters charge current as frequently as required to prevent overcharging, to prevent exceeding temperature limits, and to prevent exceeding safe charge current limits. This allows a high charge to be applied during the initial stages of charging. The charge current is appropriately reduced during the later stages of charging based on the battery parameters. Authors in [1] implement three different approaches for controlling a complex electrochemical process using MATLAB. They compared the results of fuzzy, neurofuzzy systems with conventional PID control by simulating the formation (loading) of a battery. These systems designed using absolute temperature (T) and temperature D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 307–316, 2007. © Springer-Verlag Berlin Heidelberg 2007
308
R.A. Aliev et al.
gradient (dT/dt) as inputs and current (I) as output. Although [1] explains the duration of charging unfortunately neither gives information about the type of battery nor any information about temperature increase level during charging. Paper [2] focuses on the design of a super fast battery charger based on NeuFuz technology. In this application they have used a NiCd battery pack as the test vehicle and measured values were T, Voltage (U) and I. The results are 5 degrees Celsius difference between ending temperature and starting temperature (Tend - Tstart) and charging time is 20 to 30 minutes. The charging time is too long compared to other researches. In [3] the authors gives Tend - Tstart result as 35 up to 60o C and the method increases the life time to 3000 charging time. The paper does not explain how long it takes to charge the battery. The work does not give from how many cycle this method increased the life time to 3000 cycle. The Tend - Tstart result is too high according to other research papers also. Paper [4] considers on a fuzzy controller for rapid NiCd batteries charger using adaptive Neuro-Fuzzy inference system model. The NiCd batteries were charged at different rates between 8 and 0.05 charging rate (C) and for different durations. The two input variables identified to control the C are T, dT/dt. The equivalent ANFIS architecture for the system under consideration is created in MATLAB. Although this work gives the best result with less charging time, the ANFIS gives a high level 50 degrees Celsius Tend - Tstart result. Paper [9] presents a genetic algorithm approach to optimize a fuzzy rule-based system for charging high power NiCd batteries. Unfortunately, as it is mentioned in [5], little progress has been made in creating of intelligent control system of batteries charging those provides optimal trade-off between charging time and battery overheating and there is a big potential to increase efficiency of battery charging system by using more effective technologies. In this paper we propose soft computing approach based on fuzzy recurrent neural networks trained by genetic algorithms to control batteries charging process. This approach does not require mathematical model of battery packs, which are often difficult, if not impossible, to obtain. The work distinguishes as containing fuzzy recurrent neural network modeling non linearity and high degree of uncertainty of battery packs. This FRNN model allows generation of fuzzy rule-base for intelligent battery charging control system. The main advantage of the proposed intelligent control system is that it provides more advantages (minimum charging time and minimum overheating) as compared to existing methods. The rest of this paper is organized as follows. Section 2 describes the fuzzy RNN for battery modeling and battery charging control system. Section 3 describes the Soft Computing based battery charging control system. In section 4 simulations and experimental results are discussed. Section 5 gives conclusion of this paper.
2 Fuzzy Recurrent Neural Network and Its Learning The structure of the proposed fuzzy recurrent neural network is presented in Fig. 1. The box elements represent memory cells that store values of activation of neurons at previous time step, which is fed back to the input at the next time step.
Recurrent Fuzzy Neural Network Based System for Battery Charging
Layer 0 (input)
Layer 1 (hidden)
309
Layer L (output)
y11 (t ) y11 (t − 1)
x1l (t ) 0 1
x (t )
y1L (t − 1)
y1L (t )
y 1i (t − 1) y i1 (t )
xil (t )
x 0j (t ) y NL L (t ) y 1N1 (t )
Fig. 1. The structure of FRNN
where
x lj (t ) is j-th fuzzy input to the neuron i at layer l at the time step t, y il (t ) is
the computed output signal of the neuron at the time step t,
wij is the fuzzy weight of
the connection to neuron i from neuron j located at the previous layer, bias of neuron i, and
θi is the fuzzy
y (t − 1) is the activation of neuron j at the time step (t-1), vij l j
is the recurrent connection weight to neuron i from neuron j at the same layer. The activation F for a total input to the neuron s (Fig. 2) is calculated as:
F (s) =
s . 1+ | s |
(1)
F(s) 0.8 0.6 0.4 0.2
s
0 -8
-6
-4
-2
0
2
4
6
-0.2 -0.4 -0.6 -0.8 -1
Fig. 2. The activation function F(s)
8
310
R.A. Aliev et al.
So, the output of neuron i at layer l is calculated as follows:
θ il +
∑ x (t )w + ∑ y (t − 1)v l j
l ij
l j
j
yil (t ) = 1+
θ il
+
l ij
j
∑
x lj (t ) wijl
j
+
.
∑
y lj (t
(2)
− 1)vijl
j
All fuzzy signals and connection weights and biases are general fuzzy numbers that with any required precision can be represented as T(L0,L1,...,Ln-1,Rn-1,Rn-2,...R0). Fig. 3 shows an example of fuzzy number when n=4 (if n=2, we get traditional trapezoid ( L1 < R1 ) numbers and triangle numbers ( L1 = R1 )).In case the original learning patterns are crisp, we need to sample data into fuzzy terms, i.e. to fuzzify the learning patterns. The fuzzifiers can be created independently for specific problems. For learning of FRNN we use GA [7,8]. To apply genetic algorithm based approach for FRNN training, all adjustable parameters i.e. connection weights and biases are coded as bitstrings. A combination of all weight and bias bitsrings compose a genome (sometimes called a chromosome) representing a potential solution to the problem. During the genetic evolution a population consisting of a set of individuals or genomes (usually 50-100 genomes) undergo a group of operations with selected genetic operators. crossover and mutation are the most often used operators. Applying genetic operators results in generating many offsprings (new individuals or genomes). When bitstrings are decoded back to weights and biases, presenting different FRNN solutions, some may present good network solutions and some bad. Good genomes (i.e. those corresponding to good solutions) have more chances to stay within the populations for upcoming generations while bad genomes have more chances to be discarded during the future selection processes. Whether a genome is good or bad is evaluated by a fitness function. The fitness function is an evaluator function (can also be fuzzy) numerically evaluating the quality of the genome and the representing a solution. In case of a neural network learning, the purpose is to minimize the network error performance index. Thus, the selection of best genomes from the population is done on the basis of the genome fitness value, which is calculated from the FRNN error performance index. The calculation of the fitness value of a particular genome require restoration of the coded genome bits back to fuzzy weight coefficients and biases of FRNN, in other words, we need to get a phenotype from the genotype. The FRNN error performance index can be calculated as follows:
Etot = ∑∑ D ( y pi , y des pi ) , p
(3)
i
where Etot is the total error performance index for all output neurons i and all learning data entries p. We shall assume Y is a finite universe Y={y1,y2,...,yn}; D is an error function such des
as the distance measure between two fuzzy sets, the desired y pi and the computed
y pi outputs. The efficient strategy is to consider the difference of all the points of the
Recurrent Fuzzy Neural Network Based System for Battery Charging
311
used general fuzzy number (Fig. 3). The considered distance measure is based on Hamming distance Δ j = y pij − y pij , des
D (T1 , T2 ) =
n
Δ j ∈ [0,1] : D = ∑ Δ j , j =1
i = n −1
i = n −1
i =0
i =0
∑
k i | LT 1i − LT 2i | +
∑k
i
| RT 1i − RT 2i | ,
where D (T1 , T2 ) is the distance measure between two fuzzy numbers
(4)
T1 ( y des pi ) and
T2 ( y pi ), 0≤k0≤k1... ≤kn-2≤kn-1 are some scaling coefficients. Once the total error performance index for a combination of weights has been calculated the fitness f of the corresponding genome is set as:
f =
1 . 1 + Etot
(5)
1
0.75
0.5
0.25
0 -5
0
5
10
15
Fig. 3. An example of n-point fuzzy number
As can be seen, the fitness function value for a genome (coding a network solution) is based on a distance measure comparing two sets of fuzzy values. Scaling coefficients are included to add sensitivity to high membership areas of a fuzzy number. The GA-based training process is schematically shown in Fig. 4 The GA used here can be described as follows: 1. Prepare the genome structure according to the structure of FRNN; 2. Incase of existence of a good genome (an existing network solution), put it into population; else generate a random network solution and put it into population; 3. Generate at random new PopSize-1 genomes and put them into population; 4. Apply genetic crossover operation to PopSize genomes in the population; 5. Apply mutation operation to the generated offsprings; 6. Get phenotype and rank (i.e. evaluate and assign fitness values to) all the offsprings;
312
R.A. Aliev et al.
Fig. 4. GA based training of a FRNN network
7. Create new population with Nbest best parent genomes and (PopSize- Nbest) best offsprings; 8. Display fitness value of the best genome; If termination condition is met go to Step 9; Else go to step 4; 9. Get phenotype of the best genome in the population. Store network weights file; 10. Stop. In the above algorithm PopSize is minimum population size and Nbest is the number of best parent genomes always kept in the newly generated population. The learning may be stopped once we see the process does not show any significant change in fitness value during many succeeding regenerations. In this case we can specify new mutation (and maybe crossover) probability and continue the process. If the obtained total error performance index or the behavior of the obtained network is not desired, we can restructure the network by adding new hidden neurons, or do better sampling (fuzzification) of the learning patterns.
3 Description of Soft Computing Based Battery Charging Control System The purpose battery control system is to charge the whole battery pack, consisting of 6 batteries, to hold 9.6V. The initial charge level is 1.37V and temperature is 21.6°C.
Recurrent Fuzzy Neural Network Based System for Battery Charging
313
Just after the battery reaches 1.6V (the target for one battery: 1.6V×6 batteries=9.6V), it becomes overheated and we can observe loss of charge due to some chemical processes inside the battery. The purpose of control is to charge the battery to hold 1.6V in a possibly shorter time while preventing the battery from overheating. We can apply different charging currents as control input with values ranging from 0A to 6A. The input signals of suggested control system for batteries charging T and U are measured by temperature and voltage sensors . Output of the sensors are crisp current values of temperature and voltage. Other input signals of battery charging controller are first derivatives of U (dU/dt) and T (dT/dt). All these input signals U, dU/dt, T and dT/dt are fuzzified by fuzzifiers. Generated in advance fuzzy knowledge base of controller is implemented by RFNN approximately. Receiving current fuzzy values of U, dU/dt, T and dT/dt controller performs fuzzy inference and determines fuzzy values of control signal. As only crisp control signals are applied to battery, the fuzzy control signal from RFNN must be defuzzified. This signal is then applied to battery [6]. As it is mentioned above, there is no exact mathematical model. Because of this, instead of design of a nonlinear differential equation model we prefer to use neurofuzzy genetic method to define a model. To design the charger a FRNN is used to learn the behavior of the battery charging and to generate a set of fuzzy rules and membership functions and then acquired knowledge into new fuzzy logic system. The system creates the voltage and temperature models. The FRRN designed for battery charging controller had 4 inputs, 20 hidden neurons, and 1 output. The three used inputs represented temperature (T), change of temperature (dT), voltage (V) and change of voltage (dV). The output of the controller is the current (I) applied for charging the battery. All weights and biases of the FRNN are coded as 64 bits long genes. The control rules used for learning of battery control system are listed in table 1. Table 1. The control rules T LOW LOW LOW ... ... ... LOW LOW MED
dT LOW MED HIGH ... ... ... MED HIGH HIGH
V LOW LOW LOW ... ... ... MED MED MED
dV LOW LOW LOW ... ... ... HIGH HIGH HIGH
I HIGH HIGH HIGH ... ... ... HIGH HIGH LOW
4 Simulation Results The network was trained by the above fuzzy rules. Fig. 5 shows the graph of the charging process under the control of the learned FRNN. The GA learning was done with population size 100, probability of multi-point crossover 0.25, and probability of
R.A. Aliev et al. Table 2. Comparison of different charging controllers Charging controller FRNN based (our approach) FL [3] FG [9] ANFIS [4] NeuFuz [2]
Time (sec) 860 959 900 1200-1800
Tend-Tstart 2,85 35-60 9 50 5
V o lta g e v s T im e 2 1 .9 U (V)
1 .8 1 .7 1 .6 1 .5 1 .4 1 .3 0
500
1000
1500
2000
1500
2000
1500
2000
t (s )
T (degree C)
T e m p e r a tu r e v s T im e 40 38 36 34 32 30 28 26 24 22 20 0
500
1000 t (s )
C u r r e n t v s T im e 8 7 6 I (A)
314
5 4 3 2 1 0 0
500
1000 t (s )
Fig. 5. Battery charging control process
Recurrent Fuzzy Neural Network Based System for Battery Charging
315
mutation 0.05. After the crossover and mutation operations [7], every 90 best offspring genomes plus 10 best parent genomes make a new population of 100 genomes. The selection of 100 best genomes is done on the basis of the genome fitness value [7]. The FRNN based control system allows very quick and effective charge of the battery: the charging time is reduced from more than 2000 seconds (with applied constant charge current 2A) to 860 seconds (or even less, if the temperature limit is set higher than 25ºC) with dynamically changed (under the control of the RFNN) input current (Fig. 5). Also the battery is protected from overheating and a long utilization time of the battery can be provided by adequately adjusting the fuzzy rules describing the desired charging process. The results of FRNN based charging controller compared with other battery chargers are given in Table 2. The NiCd charger with GA based training of FRNN gives less charging time and less Tend-Tstart result than other controllers.
5 Conclusions This work proposes Soft Computing approach based on recurrent fuzzy neural network to control batteries charging process. Dynamics of the battery pack is described by recurrent fuzzy neural networks on base of which the fuzzy control rules are generated. Genetic algorithm is used for tuning fuzzy neural network. Computational experiments show that the suggested approach gives least charging time and least Tend – T start results according to the other intelligent battery charger works. This approach is general and can be extended to design controllers for quickly charging different battery types.
References 1. Castillo, O., Melin, P.: Soft Computing for Control of Non-Linear Dynamical Systems. Springer, Germany (2001) 2. Ullah, M. Z., Dilip, S.: Method and Apparatus for Fast Battery Charging using Neural Network Fuzzy Logic Based Control. IEEE Aerospace and Electronic Systems Magazine 11 (6) (1996) 26-34 3. Ionescu, P.D., Moscalu, M., Mosclu, A.: Intelligent Charger with Fuzzy Logic. Int. Symp. on Signals, Circuits and Systems (2003) 4. Khosla, A., Kumar, S., Aggarwal, K. K.: Fuzzy Controller for Rapid Nickel-cadmium Batteries Charger through Adaptive Neuro-fuzzy Inference System (ANFIS) Architecture. 22nd International Conference of the North American Fuzzy Information Processing Society, NAFIPS. (2003) 540 – 544 5. Diaz, J., Martin-Ramos, J.A., Pernia, A.M., Nuno, F., Linera. F.F.: Intelligent and Universal Fast Charger for Ni-Cd and Ni-MH Batteries in Portable Applications IEEE Trans. On Industrial Electronics 51 (4) (2004) 857-863 6. Jamshidi, M.: Large-Scale systems: Modeling, Control and Fuzzy Logic. Englewood Cliffs, NJ: Prentice Hall (1996)
316
R.A. Aliev et al.
7. Aliev, R.A., Aliev, R.R.: Soft Computing and Its Applications. World Scientific, New Jersey (2001) 8. Jamshidi, M., Krohling, R. A., Coelho, DOSS., Fleming, P.: Robust Control Design Using Genetic Algorithms. CRC Publishers, Boca Raton, FL (2003) 9. Surmann, H.: Genetic Optimization of a Fuzzy System for Charging Batteries. IEEE Trans. on Industrial Electronics 43 (5) (1996) 541-548
Type-2 Fuzzy Neuro System Via Input-to-State-Stability Approach* Ching-Hung Lee and Yu-Ching Lin Department of Electrical Engineering, Yuan Ze University Chung-li, Taoyuan 320, Taiwan
[email protected]
Abstract. This paper proposes the type-2 fuzzy neural network system (type-2 FNN) which combines the advantages of type-2 fuzzy logic systems (FLSs) and neural networks (NNs). For considering the system uncertainties, we use the type2 FLSs to develop a type-2 FNN system. The previous results of type-1 FNN systems can be extended to a type-2 one. Furthermore, the corresponding learning algorithm is derived by input-to-state-stability (ISS) approach. Nonlinear system identification is presented to illustrate the effectiveness of our approach.
1 Introduction In recent years, intelligent systems including fuzzy control, neural networks, and genetic algorithm, etc, has been developed and applied widely, especially in the field of fuzzy neural network (FNN) [1-5]. In literature [1-3], the FNN system has the properties of parallel computation scheme, easy to implement, fuzzy logic inference system, and parameters convergence. The fuzzy rules and the membership functions (MFs) can be designed and trained from linguistic information and numeric data. Thus, it is then easy to design an FNN system to achieve a satisfactory level of accuracy by manipulating the network structure and learning algorithm of the FNN. The concept of type-2 fuzzy sets was initially proposed by Zadeh as an extension of ordinary fuzzy sets (called type-1) [6]. Subsequently, Mendel and Karnik developed a complete theory of type-2 fuzzy logic systems (FLSs) [7-11]. These systems are characterized by IF-THEN rules and type-2 fuzzy rules are more complex than the type-1 fuzzy rules because some differences are their antecedents and their consequent sets are type-2 fuzzy sets [8-10]. In this paper, the so-called type-2 FNN is proposed, which is an extension of the FNN. By the concept of literature [7-10], the type-2 FNN system is used to handle uncertainty. The proposed type-2 FNN is a multilayered connectionist network for realizing type-2 fuzzy inference system, and it can be constructed from a set of type-2 fuzzy rules. The type-2 FNN consists of type2 fuzzy linguistic process as the antecedent and consequent part. The consequent part of type-2 fuzzy rules means the output through type-reduction and defuzzification. Based on input-to-state-stability (ISS) approach, rigorous proofs are presented to guarantee the convergence of the type-2 FNN. *
This work was support partially by National Science Council, Taiwan, R.O.C. under NSC-942213-E-155- 039.
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 317–327, 2007. © Springer-Verlag Berlin Heidelberg 2007
318
C.-H. Lee and Y.-C. Lin
This paper is organized as follows. In Section 2, we briefly introduce the type-2 FNN system and the corresponding learning algorithm by input-to-state-stability approach. Section 3 presents the application results of nonlinear system identification to illustrate the effectiveness of our approach. Finally, conclusion is summarized.
2 Type-2 Fuzzy Neural Network Systems 2.1 The Systems Structure The FNN system is a type of fuzzy inference system in neural network structure [1-5]. The construction of the type-2 FNN system is shown in Fig. 1. Obviously, the type-2 FNN is constructed by IF-THEN rule [1, 5]. The main difference is to replace the type-1 fuzzy sets to type-2 fizzy ones. Herein, we firstly introduce the basic function of every node in each layer. In the following symbols, the subscript ij indicates the jth term of the ith input Oij(k ) , where j = 1,…, l , and the super-script (k) means the k-th layer.
x1
~ G ~ G
x2
~ G ~ G ~ G
xn
~ G ~ G ~ G
[f 1
[w 1
1
[w
[f
Layer
2
∏ j
Layer
]
w1
∑
∏
~ G
Layer
]
f1
∏
f
3
j
j
w
j
yˆ
]
] Layer
4
Fig. 1. The construction of MISO type-2 FNN system
(a)
(b)
Fig. 2. Type-2 fuzzy MFs, (a) uncertain mean; (b) uncertain variance
Layer 1: Input Layer For the ith node of layer 1, the net input and the net output are represented as Oi(1) = wi(1) xi(1) where weights w , i = 1, , n are set to be unity. (1) i
(1)
Type-2 Fuzzy Neuro System Via Input-to-State-Stability Approach
319
Layer 2: Membership Layer Each node performs a type-2 membership function (MF), i.e., interval fuzzy sets, as shown in Fig. 2. Then, we described two kinds of the output of layer 2 respectively [8, 11]. For Case 1- Gaussian MFs with uncertain mean, as shown in Fig. 2(a) ⎡ 1 (Oi(1) − mij )2 ⎤ ⎧Oij( 2 ) as mij = mij Oij( 2 ) = exp⎢ − = (σ ij )2 ⎥⎥⎦ ⎨⎩O ij( 2) as mij = mij ⎢⎣ 2
(2)
Case 2- Gaussian MFs with uncertain variance, as shown in Fig. 2(b), we have ⎡ 1 (Oi(1) − mij )2 ⎤ ⎧Oij( 2 ) as σ ij = σ ij Oij( 2 ) = exp ⎢ − = (σ ij )2 ⎥⎦⎥ ⎨⎩O ij( 2) as σ ij = σ ij ⎣⎢ 2
(3)
where mij and σ represent the center (or mean) and the width (or standard-deviation) respectively. The type-2 MFs can be represented as interval bound by upper MF and lower MF, denoted by μ F~ and μ F~ , as shown Fig. 2. Thus, the output Oij( 2 ) is ij
[
i
]
(2)
i
represented as an interval O ij , Oij( 2 ) .
Layer 3: Rule Layer This layer are used to implement the antecedent matching. Here, the operation is chosen as a simple PRODUCT operation. Therefore, for the jth input rule node n ⎧ ( 3) O j = ∏ (wij( 3)Oij( 2 ) ) ⎪ ⎪ i =1 = ∏ (wij( 3) Oij( 2 ) ) = ⎨ n i =1 ⎪ O (j3 ) = ∏ wij( 3 ) O ij( 2 ) ⎪⎩ i =1 n
O (j 3)
(
(4)
)
where the weights wij( 3) =1. Thus, the output Oij( 3) is represented as [O (j3) , O j( 3) ] . Layer 4: Output Layer This links in this layer are used to implement the consequence matching and typereduction and the linear combination [7, 9, 10]. Thus, O ( 4 ) + OL( 4 ) yˆ = O ( 4 ) = R (5) 2 where
(
) ∑ (O
O R( 4 ) = ∑ ( f jR w j( 4 ) ) = ∑ O j w j( 4 ) + l
R
j =1
and l
j =1
(
OL( 4 ) = ∑ f jL w j j =1
(4)
( 3)
) = ∑ (O L
l
( 3) j
k = R +1
) + ∑ (O
wk( 4 ) )
(6)
)
(7)
l
( 3) j
(4)
wj
j =1
k = L +1
( 3) j
(4)
wk
In order to obtain the values O L( 4 ) and O R( 4 ) , find coefficients R and L, firstly. Assume ( 4)
that the pre-computed w j( 4 ) and w j (4) 1
w
≤w
(4) 2
≤
≤w
(4) l
R1: Compute O R( 4 )
let y r ≡ OR( 4 ) .
and w1 ≤ w 2 ≤ (4)
( 4)
are arranged in ascending order, i.e.,
≤ wl
(4)
[7, 10]. Then, 1 ( 3) in (6) by initial setting f jR = O j( 3) + O j for i = 1, 2
(
)
, l , and
320
C.-H. Lee and Y.-C. Lin
R2: Find R (1 ≤ R ≤ l − 1) such that wR( 4 ) ≤ y r ≤ wR( 4+1) . R3: Compute OR( 4) in (6) with f jR = O (j3) for j ≤ R and f jR = O j( 3) for j > R , and
let y r′ = OR( 4 ) . R4: If y ′r ≠ y r , then go to step R5. If y ′r = y r , then stop and set OR( 4 ) = y r′ . R5: Set y′r equal to yr, and return to step R2.
Subsequently, the computation of O L( 4 ) is similar to the above procedure [10]. Thus, the input/output representation of type-2 FNN system with uncertain mean is
(
)
(
)
(
)
l L l 1 R ( 3) (4) ( 3) (4) yˆ (mij , m ijσ ij , w j , w j ) = [ ∑ O j w j( 4 ) + ∑ (Ok( 3) wk( 4 ) ) + ∑ O j( 3) w j + ∑ O k w k ]. (8) 2 j =1 k = R +1 j =1 k = L +1
The type-2 FNN with uncertain variance, as Fig. 2(b), can be simplified as 1 l ( 3) yˆ (mij , σ ij , σ ij , w j ) = ∑ O j + O j( 3) w (j 4 ) . 2 j =1
[(
) ]
(9)
2.2 The Input-to-State-Stability Learning Algorithm
Input-to-state stability (ISS) is one elegant approach to analyze stability besides Lyapunov method [12]. For case of Gaussian MFs with uncertain variance, the qth output of type-2 FNN can be expressed as yˆ q =
⎡ n ⎛ ( xi − mij ) 2 ⎞ n ⎛ ( xi − mij ) 2 ⎞⎤ . 1 l ⎟+ ⎟⎥ wqj ⎢∏ exp⎜ − exp⎜ − ∑ ∏ 2 2 ⎜ ⎟⎥ ⎜ ⎟ i =1 2 j =1 2σ ij 2σ ij ⎢⎣ i =1 ⎝ ⎠ ⎝ ⎠⎦
The object of the type-2 FNN modeling is to find the center values of B~1 j ,
(10) ~ Bmj , as
well as the MFs A~1 j , A~nj , such that the output Yˆ (k ) of the type-2 FNN (10). Let us define identification error as e(k ) = Yˆ (k ) − Y ( k ) .
(11)
We will use the modeling error e(k) through algorithm to train the type-2 FNN on-line such that Yˆ (k ) can approximate Y (k ), ∀k . According to function approximation theories of fuzzy neural networks, the identification can be represented as yˆ q =
⎛ ( x − m* ) 2 ⎞⎤ ⎛ ( xi − mij* ) 2 ⎞ n 1 l *⎡ n ⎟ + ∏ exp⎜ − i * ij ⎟⎥ − Δ q wqj ⎢∏ exp⎜ − ∑ * 2 ⎜ ⎟ ⎜ 2 j =1 2(σ ij ) ⎠ i=1 2(σ ij ) 2 ⎟⎠⎥⎦ ⎢⎣ i =1 ⎝ ⎝
(12)
where wqj* , mij* , σ ij* , and σ *ij are unknown parameters which may minimize the unmodeled dynamic Δ q . In the case of four independent variables, a smooth function f has Taylor formula as ⎡ ∂ ∂ ∂ ∂ ⎤ f ( x1 , x2 , x3 , x 4 ) = ⎢( x − x10 ) + ( x − x20 ) + ( x − x30 ) + ( x − x40 ) ⎥ f + Rl ∂ x ∂ x ∂ x ∂ x4 ⎦ 1 2 3 ⎣
(13)
where Rl is the remainder of Taylor formula. If we let [x1 x2 x3 x4 ] = [wqj mij σ ij σ ij ] and [x10 x20 x30 x40 ] = [wqj∗ mij∗ σ ij∗ σ ∗ij ], we have
Type-2 Fuzzy Neuro System Via Input-to-State-Stability Approach l ⎛ O ( 3) + O (j3) ⎞ l n ∂ ⎟ + ∑∑ yq + Δ q = yˆ q + ∑ wqj* − wqj ⎜ j yˆ m* − mij ⎜ ⎟ j =1 i =1 ∂mij q ij 2 j =1 ⎝ ⎠ l n l n ∂ ∂ * * + ∑∑ yˆ q σ ij − σ ij + ∑ ∑ yˆ q σ ij − σ ij + R1 j =1 i =1 ∂σ ij j =1 i =1 ∂σ ij
(
)
(
(
(
)
321
)
(14)
)
where R1 is the approximation error. Using chain rule, we obtain ∂yˆ q ∂mij
=
∂yˆ q ∂O j( 3) ∂O
( 3) j
∂mij
+
xi − mij ⎞ 1 ⎛ ( 3) xi − mij ⎞ 1 ⎛ ⎟ + wqj ⎜ O j ⎟ = wqj ⎜ O j( 3) ⎜ ∂mij 2 ⎝ σ ij 2 ⎟⎠ 2 ⎜⎝ σ ij 2 ⎟⎠
∂yˆ q ∂O (j3) ∂O
( 3) j
=
wqj ⎛ ( 3) xi − mij ( 3 ) xi − mij ⎜Oj +Oj 2 ⎜⎝ σ ij 2 σ ij 2
⎞ ⎟ ⎟ ⎠
(15)
∂yˆ q ∂yˆ q ∂O j( 3) ∂yˆ q ∂O (j3) wqj ( 3) (xi − mij ) = + = Oj ( 3) ∂σ ij ∂O j( 3) ∂σ ij 2 ∂O j ∂σ ij σ ij 3
2
(16)
∂yˆ q ∂yˆ q ∂O j( 3) ∂yˆ q ∂O (j3) wqj ( 3) (xi − mij ) . = + = Oj ∂σ ij ∂O j( 3) ∂σ ij ∂O (j3) ∂σ ij 2 σ ij 3 2
(17)
Thus, (14) can be re-written as l ⎛ O j( 3) + O (j3) ⎞ l n wqj ⎛ ( 3) xi − mij ⎞ * ( 3) x i − mij ⎟ + ∑∑ ⎜O ⎟(mij − mij ) y q + Δ q = yˆ q + ∑ (wqj* − wqj )⎜ +Oj 2 2 ⎜ ⎟ j =1 i =1 2 ⎜ j ⎟ 2 σ σ j =1 ij ij ⎝ ⎠ ⎝ ⎠ 2 2 l n ⎡w l n ⎡w ⎤ ⎤ ( x − m ) ( x − m ) ( 3) * qj i ij qj i ij * + ∑∑ ⎢ O j( 3) Oj ⎥ (σ ij − σ ij ) + ∑∑ ⎢ ⎥ σ ij − σ ij + R1 σ ij 3 ⎦⎥ σ ij 3 ⎦⎥ j =1 i =1 ⎣ j =1 i =1 ⎣ ⎢ 2 ⎢ 2
(
)
Rewrite it, we obtain .
[
T
]
( 3) ⎡ O (3) + O1( 3) Ol (3) + O l ⎤ where , l ×1 Wq = wq1 wql ∈ ℜ1×l Z( k ) = ⎢ 1 ⎥ ∈ℜ 2 2 ⎣ ⎦ ( 3) (3) ( 3) ~ ⎡ wq1 O1( 3) ⎡ wq1O1 wql Ol ⎤ wql O l ⎤ 1×l , 1×l , Wq = Wq − Wq* , D D ZLq = ⎢ ⎥ ∈ℜ ⎥ ∈ℜ ZRq = ⎢ 2 ⎥⎦ 2 ⎥⎦ ⎢⎣ 2 ⎢⎣ 2 x n − mn1 ⎡ x1 − m11 * (mn1 − mn*1 )⎤⎥ 2 ⎢ (σ )2 (m11 − m11 ) , ( ) σ 11 n 1 ⎢ ⎥ l×n C R (k ) = ⎢ ∈ ℜ ⎥ xn − mnl ⎢ x1 − m1l * * ⎥ ( m − m ) ( m − m ) 1l 1l ⎢ (σ )2 (σ nl )2 nl nl ⎥⎦ ⎣ 1l
⎡ x1 − m11 ⎢ (σ )2 ⎢ 11 C L (k ) = ⎢ ⎢ x1 − m1l ⎢ (σ )2 1l ⎣ ⎡ ( x1 − m11 )2 ⎢ 3 ⎢ (σ 11 ) B R (k ) = ⎢ ⎢ (x − m )2 1l ⎢ 1 3 ⎣⎢ (σ 1l )
(m
* − m11
(m
1l
− m1*l
(σ
11
− σ 11*
)
(σ
1l
− σ 1*l
)
11
)
)
xn − mn1
(m
)
⎤ − mn*1 ⎥ , ⎥ l×n ∈ ℜ ⎥ xn − mnl ⎥ mnl − mnl* ⎥ 2 (σ nl ) ⎦ (xn − mn1 )2 σ − σ * ⎤ n1 n1 ⎥ , (σ n1 )3 ⎥ l×n ⎥ ∈ℜ (xn − mnl )2 σ − σ * ⎥⎥ nl nl (σ nl )3 ⎦⎥
(σ n1 )2
n1
(
)
(
)
(
)
(18) ,
322
C.-H. Lee and Y.-C. Lin ⎡ ( x1 − m11 )2 * σ 11 − σ 11 ⎢ 3 ( σ ) 11 ⎢ B L (k ) = ⎢ ⎢ (x − m )2 * 1l σ 1l − σ 1l ⎢ 1 4 ( ) σ 1l ⎣⎢
(
(
(xn − mn1 )2 (σ − σ * )⎤ n1 n1 ⎥ (σ n1 )3 ⎥
) )
xn − mnl
(σ nl )3
(σ
− σ nl *
nl
)
.
l×n ⎥ ∈ℜ ⎥ ⎥ ⎦⎥
The identification error is defined as eq = yˆ q − yq , using (18) we have ~ eq = Z(k ) Wq + D ZRq C Rk E + D ZLq C Lk E + D ZRq B Rk E + D ZLq B Lk E + Δ q − R1
(19)
~ e( k ) = Wk Z( k ) + D ZR (k )C Rk E + D ZL ( k )C Lk E + D ZR ( k )B Rk E + D ZL ( k )B Lk E + ς (k )
(20)
and
where
e(k ) = [e1
⎡ w11O1( 3) ⎢ 2 D ZR (k ) = ⎢⎢ ( 3) ⎢ wm1O1 ⎢⎣ 2 Δ = [Δ 1
em ] ∈ ℜ m×1
* ⎡ w11 − w11 ~ ⎢ Wk = ⎢ ⎢ wm1 − wm* 1 ⎣
,
T
⎡ w11 O1(3) w1l Ol ( 3) ⎤ ⎢ ⎥ , 2 ⎥ ⎢ 2 m×l ∈ ℜ D = ZL ⎢ ⎥ ( 3) wml Ol ( 3) ⎥ ⎢ wm1 O1 ⎢ ⎥ 2 ⎦ ⎣ 2
Δ m ] ∈ ℜ m×1 , R 1 = [R11
R1m ] ∈ ℜ m×1 .
T
w1l − w1*l ⎤ ⎥ m×l ⎥ ∈ℜ * ⎥ wml − wml ⎦
(3) w1l O l ⎤ ⎥ , 2 ⎥ m×l ∈ ℜ ⎥ ( 3) wml O l ⎥ 2 ⎥⎦
,
ς (k ) = Δ − R 1 ,
T
By the bound of the Gaussian function and the plant are BIBO stable, Δ and R1 in (19) are bounded. Therefore, ς (k ) in (20) is bounded. The following theorem gives a stable algorithm for discrete-time type-2 FNN. Theorem 1. If we use the type-2 FNN to identify nonlinear plant, the following backpropagation algorithm makes identification error e(k) bounded ⎧ Wk +1 = Wk − η ISS e(k )[Z ( k )]T ⎪ ⎪ m (k + 1) = m (k ) − η wqj ⎛⎜ O (3) xi − mij + O ( 3) xi − mij j ij ISS j ⎪ ij 2 ⎜⎝ σ ij 2 σ ij 2 ⎪⎪ 2 wqj ( 3) (xi − mij ) ⎨ (yˆ q − yq ) 3 ⎪σ ij (k + 1) = σ ij ( k ) − η ISS 2 O j σ ij ⎪ 2 ⎪ wqj ( 3) (xi − mij ) Oj (yˆ q − yq ) ⎪σ ij (k + 1) = σ ij ( k ) − η ISS 3 2 σ ij ⎪⎩
where η ISS =
and
ηk 2
2
1 + Z + 2 D ZR
2
+ 2 D ZL
⎞ ⎟(yˆ − y ) . q ⎟ q ⎠
0 < ηk ≤ 1
(21)
(22)
⋅ denotes 1-norm. In addition, the average of identification error satisfies J = lim sup T →∞
where π = η k
(1 + λ )
2
(
, Z > 0 λ = max k
2
1 T
T
∑e
2
(k ) ≤
k =1
+ 2 DZR
2
ηk ⋅ς π
+ 2 DZL
2
(23)
), ς = max[ς (k )]. 2
k
■
Type-2 Fuzzy Neuro System Via Input-to-State-Stability Approach
323
3 Applications in Nonlinear System Identification Example 1 Consider the BIBO nonlinear plant [3]
y(k ) . (24) 1 + y 2 (k ) In training the type-2 FNN, we use 8 rules to construct the FNN on-line. The initial values of parameters are chosen as y ( k + 1) = u 3 ( k ) +
5 ⎡ mi = ⎢− 1 − 7 ⎣
−
3 7
−1 7
1 7
3 7
5 ⎤, 1 7 ⎥⎦
σ ij =
4 6 2 , σ ij = , σ ij = , w j = 0 . 7 7 7
In addition, the testing input signal u(k) as following equation is used to determine the identification results
mod(k,50) ⎧ ⎪-0.7 + 40 ⎪ rands(1,1) ⎪ u (k ) = ⎨ 0.7 - mod (k,180 ) ⎪ 180 ⎪ ⎛ πk ⎞ ⎪ 0.6 ⋅ cos⎜ ⎟ ⎝ 50 ⎠ ⎩
k ≤ 80 80 < k ≤ 130
(25)
130 < k ≤ 250 k > 250.
Note that, the optimal leaning rate will be invalid when the initial weight is wj=0. According to research [5], we have the optimal leaning rate ⎡
−2 ⎛ ∂yˆ ⎞ ⎤ . ⎟ ⎥ ⎝ ∂W ⎠ ⎦⎥
ηW∗ = min ⎢1, ⎜ ⎣⎢
(26)
Herein, we have different ηISS when ηk is different. We give ten different values, i.e., η k = 0.1, 0.2, ,1 . The simulation results are described in Table 1. We can easily find the best performance when ηk=1. Then, we fixed ηk=1 and have a comparison with ηW∗ . The simulation is shown in Fig. 3(a). The dotted line is plant actual output, the dash-dotted line is the testing output using type-2 FNN with learning rate ηISS ( RMSE = 3.7403 × 10 −3 ), and the real line is the testing output using type-2 FNN with optimal learning rate η ∗ ( RMSE = 3.1305 ×10 −3 ). Figure 3(b) shows the on-line identification performance using type-1 FNN and type-2 FNN. The both FNN systems are with optimal learning rate. The dotted line is plant actual output, the dashdotted line is the testing output using type-1 FNN ( RMSE = 6.9241× 10 −3 ), and the real line is the testing output using type-2 FNN ( RMSE = 3.2047 ×10 −3 ). Example 2 Consider the Duffing forced oscillator system [4, 5] ⎡ x1 ⎤ ⎡0 1⎤ ⎡ x1 ⎤ ⎡0⎤ ⎡ x1 ⎤ , (27) ⎢ x ⎥ = ⎢0 0⎥ ⎢ x ⎥ + ⎢1⎥( f + u + d ) y = [1 0]⎢ x ⎥ ⎦⎣ 2 ⎦ ⎣ ⎦ ⎣ 2⎦ ⎣ ⎣ 2⎦ where f = −c1 x1 − c2 x2 − (x1 )3 + c3cos(c4 t ) and d denotes the external disturbance and is
assumed to be a square-wave with amplitude ± 0.5 and period 2π . Here, we give the
324
C.-H. Lee and Y.-C. Lin Table 1. Comparison of RMSE
ηk 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
RMSE as training 9.3919×10-3 9.8335×10-3 1.0339×10-2 1.0289×10-2 1.0506×10-2 1.1036×10-2 1.1539×10-2 1.1988×10-2 1.2495×10-2 1.5122×10-2
RMSE as testing 3.7403×10-3 4.7220×10-3 7.1634×10-3 6.5308×10-3 6.1930×10-3 6.9446×10-3 8.2179×10-3 9.9165×10-2 1.1588×10-2 1.2019×10-2
(a)
(b) Fig. 3. Identification results of example 1, (a) type-2 FNN with η ISS and η ∗ ; (b) type-1 FNN with η ∗ and type-2 FNN with η ∗
coefficients are C = [c1 c2 c3 c4 ] = [1 0 12 1] . In training the type-2 FNN, we use 8 rules to construct the FNN on-line. The initial values of parameters are chosen as 5 ⎡ ⎢−1 − 7 ⎢ 20 mij = ⎢− 4 − 7 ⎢ ⎢ − 8 − 40 ⎢⎣ 7 ⎡ 4 16 σi = ⎢ ⎣7 7
3 7 12 − 7 24 − 7 −
T
1 7 4 − 7 8 − 7 −
32 ⎤ ⎡6 , σi = ⎢ 7 ⎥⎦ ⎣7
1 7 4 7 8 7 24 7
3 7 12 7 24 7
5 7 20 7 40 7 T
⎤ 1⎥ , ⎥ 4⎥ ⎥ 8⎥ ⎥⎦
48 ⎤ ⎡2 , σi = ⎢ 7 ⎥⎦ ⎣7
w j = rands(1,1) ,
8 7
16 ⎤ 7 ⎥⎦
T
Type-2 Fuzzy Neuro System Via Input-to-State-Stability Approach
325
Besides, the testing input signal u(k) as following equation is used to determine the identification results ⎧ ⎛ πt ⎞ ⎪cos⎜ 10 ⎟ ⎪ ⎝ ⎠ u (k ) = ⎨ 1 ⎪ −1 ⎪ ⎩
t < 10 sec
(28)
10 sec . ≤ t < 20 sec t ≥ 20 sec .
By the same way, we also give ten different values, i.e., η k = 0.1, 0.2, ,1 , to simulate and pick one value for best choice. The simulation results are described in Table 2. We can easily find the best performance when η k = 0.1 . Then, we fixed η k = 0.1 and have a comparison with ηW∗ by the result of [5]. The simulation is shown in Fig. 4(a). Table 2. Comparison of RMSE
ηk 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1
RMSE as training 8.3333×10-3 8.8069×10-3 9.1803×10-3 1.0604×10-2 1.2220×10-2 1.3241×10-2 1.4374×10-2 1.6746×10-2 2.3208×10-2 2.3521×10-2
RMSE as testing 9.2382×10-3 1.1091×10-2 1.1408×10-2 1.1792×10-2 1.2577×10-2 1.2943×10-2 1.4100×10-2 1.5480×10-2 1.7419×10-2 1.7862×10-2
(a)
(b) Fig. 4. Identification results, (a) type-2 FNN with η ISS (η k = 1) and η ∗ ; (b) type-1 FNN with η ∗ and type-2 FNN with η ∗
326
C.-H. Lee and Y.-C. Lin
The dotted line is plant’ actual output, the dash-dotted line is the testing output using type-2 FNN with learning rateη ISS ( RMSE = 9.954 ×10 −3 ), and the real line is the testing output using type-2 FNN with optimal learning rate η ∗ ( RMSE = 8.175 × 10−3 ). Figure 4(b) shows the on-line identification performance using type-1 FNN and type-2 FNN. The both FNN systems are with optimal learning rate, as [5]. The dotted line is plant’ actual output, the dash-dotted line is the testing output using type-1 FNN ( RMSE = 1.550 × 10 −2 ), and the real line is the testing output using type-2 FNN ( RMSE = 8.175 × 10−3 ). Herein, we choose the number of type-2 fuzzy rule is half of type-1 one. Hence, we can get the simulation results, as type-1 FNN ( RMSE = 1.550 × 10 −2 ), and type-2 FNN ( RMSE = 1.367 × 10 −2 ). We can find that even we reduce the number of parameters in type-2 FNN, we also can get better performance on identification.
4 Conclusions This paper has presented a type-2 FNN system and the corresponding adaptive learning algorithm by ISS approach. In ISS approach, we derived only one learning rate for all parameters of type-2 FNN system. Compare with the results of literature [5], we have to calculate the optimal learning rate for each parameter of type-2 FNN system. The simulations show the ability of type-2 FNN system for nonlinear system identification with different approaches. Even we reduce the number of parameters in type-2 FNN, we also can get the better performance when total parameters in type-2 FNN less than type-1 one. Several simulation results of nonlinear system identification were proposed to verify the ability of function mapping ability of the type-2 FNN system.
Reference 1. Chen, Y.C., Teng, C.C.: A Model Reference Control Structure Using A Fuzzy Neural Network. Fuzzy Sets and Systems 73 (1995) 291–312 2. Jang, J.S.R., Sun, C.T., Mizutani, E.: Neuro-fuzzy and Soft-computing. Prentice-Hall, Upper Saddle River, NJ (1997) 3. Lin, C.T., Lee, C.S.G.: Neural Fuzzy Systems. Prentice Hall: Englewood Cliff (1996) 4. Lee, C.H., Lin, Y.C.: System Identification Using Type-2 Fuzzy Neural Network (Type-2 FNN) Systems. IEEE Conf. Computer, Intelligent, Robotics, and Automation,CIRA03, Japan (2003) 1264 -1269 5. Lee, C.H.: Stabilization of Nonlinear Nonminimum Phase Systems: An Adaptive Parallel Approach Using Recurrent Fuzzy Neural Network. IEEE Trans. Systems, Man, and Cybernetics Part B 34(2) (2004) 1075-1088 6. Zadeh, L.A.: The Concept of a Linguistic Variable and Its Application to Approximate Reasoning. Information Sciences 8 (1975) 199-249 7. Liang, Q., Mendel, J.: Interval Type-2 Fuzzy Logic Systems: Theory and Design. IEEE Trans. Fuzzy Systems 8(5) (2000) 535-550
Type-2 Fuzzy Neuro System Via Input-to-State-Stability Approach
327
8. Karnik, N., Mendel, J., Liang, Q.: Type-2 Fuzzy Logic Systems. IEEE Trans. Fuzzy Systems 7(6) (1999) 643-658 9. Mendel, J., John, R.: Type-2 Fuzzy Sets Made Simple. IEEE Trans. Fuzzy Systems 10(2) (2002) 117-127 10. Mendel, J.M.: Uncertain Rule-Based Fuzzy Logic Systems: Introduction and New Directions. Prentice-Hall: NJ (2001) 11. Wang, C.H., Cheng, C.S., Lee, T.T., Dynamical Optimal Training for Interval Type-2 Fuzzy Neural Network (T2FNN). IEEE Trans. Systems, Man, and Cybernetics Part B 34(3) (2004) 1462-1477 12. Grune, L.: Input-to-state Stability and Its Lyapunov Function Characterization. IEEE Trans. Automatic Control 47(9) (2002) 1499-1504
Fuzzy Neural Petri Nets* Hua Xu1, Yuan Wang1,2, and Peifa Jia1,2 1
State Key Laboratory of Intelligent Technology and Systems, Tsinghua University, Beijing, 100084, P.R. China 2 Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, P.R. China {xuhua,yuanwang05,dcsjpf}@mail.tsinghua.edu.cn
Abstract. Fuzzy Petri net (FPN) is a powerful modeling tool for fuzzy production rules based knowledge systems. But it is lack of learning mechanism, which is the main weakness while modeling uncertain knowledge systems. Fuzzy neural Petri net (FNPN) is proposed in this paper, in which fuzzy neuron components are introduced into FPN as a sub-net model of FNPN. For neuron components in FNPN, back propagation (BP) learning algorithm of neural network is introduced. And the parameters of fuzzy production rules in FNPN neurons can be learnt and trained by this means. At the same time, different neurons on different layers can be learnt and trained independently. The FNPN proposed in this paper is meaningful for Petri net models and fuzzy systems.
1 Introduction Characterized as concurrent, asynchronous, distributed, parallel, nondeterministic, and/or stochastic [1, 2], Petri nets (PN) have gained more and more applications these years. In order to model and analyze uncertain and fuzzy knowledge processing in intelligent systems or discrete event systems, fuzzy Petri nets [3,4] are proposed and have been an area of vigorous theoretical and experimental studies that results in a number of formal models and practical findings, cf. Fuzzy Petri nets (FPN) [3,5]. These models attempted to address an issue of partial firing of transitions, continuous marking of input and output places and relate such models to the reality of environments being inherently associated with factors of uncertainty. However, when FPN is used to model intelligent systems, cf. expert systems, autonomous systems, they are lack of powerful self-learning capability. Based on the powerful self-adaptability and self-learning of Neural Networks (NN), the self-study ability is extended into PN. The study ability extension of PN most frequently found in the literature associate fuzzy firing to transitions. Generally speaking, two classes can be identified. The first one, usually called generalized fuzzy Petri nets, originated from the proposal found in [3]. The places and transitions of the net are represented as OR and AND type and DOMINANCE neurons, respectively. *
This work is jointly supported by the National Nature Science Foundation (Grant No: 60405011, 60575057) and China Postdoctoral Science Fund (Grant No: 20040350078).
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 328–335, 2007. © Springer-Verlag Berlin Heidelberg 2007
Fuzzy Neural Petri Nets
329
The second approach, usually called BP based fuzzy Petri nets, originated from [6] and is based on the association of the fuzzy description to each transition firing. The fuzzy knowledge is refreshed on the base of BP learning offline. The object of this work is to extend the generalized fuzzy Petri nets by adding fuzzy neurons into FPN. The fuzzy neurons not only own self-adapting and selflearning ability but also can be regarded as independent fuzzy neuron components in PN. The application areas of Petri nets being vigorously investigated involve knowledge representation and discovery [6, 7], robotics [8], process control [9], diagnostics [10], grid computation [11], traffic control [12], to name a few high representative domains. The research undertaken in this study directly extends independent neuron components into PN models developed and used therein by defining independent neuron components with self-learning ability. The paper is organized as the following. In Section 2, we review the existing models of Petri nets that are developed in the framework of fuzzy sets and neural network concepts. Section 3 concentrates on the underlying formalism of the detailed model where independent neuron components defined in Petri nets. Section 4 concentrates on the issue of learning in independent neuron components. A robot track decision application example is discussed in Section 5. Finally, conclusions are covered in Section 6.
2 Fuzzy Petri Nets (FPN) 2.1 Petri Nets Petri net is a particular kind of directed graph, together with an initial state called the initial marking, M0. The underlying graph N of a Petri net is a directed, weighted, bipartite graph consisting of two kinds of nodes, called places and transitions, where arcs are either from a place to a transition or from a transition to a place. In graphical representation, places are drawn as circles, transitions as bars or boxes. Arcs are labeled with their weights (positive integers), where a k-weighted arc can be interpreted as the set of k parallel arcs. A common Petri net can be defined as the following. Definition 1: A Petri net is a 5-tuple, PN=(P, T, A, W, M0) where z z z z z z
P={p1,p2,…,pn} is a finite set of places, T={t1,t2,…,tn} is a finite set of transitions, A={P╳T} {T╳P} is a set of arcs, W: A {1,2,3,…} is a weight function, M0: p {0,1,2,3,…} is the initial marking, P T= and P T
→ → ∩ Φ
∪
∪ ≠Φ.
■
A transition t is said to be enabled if each input palace p of t is marked with at least w(p,t) tokens, where w(p,t) is the weight of the arc from p to t. An enabled transition may or may not be fired depending on whether or not the event actually takes place.
330
H. Xu, Y. Wang, and P. Jia
A firing of an enabled transition t removes w(p,t) tokens from each input place p to t, where w(p,t) is the weight of the arc from p to t, and adds w(t,p) tokens to each output place p of t, where w(t,p) is the weight of the arc from t to p. 2.2 Fuzzy Petri Nets On the base of common PN, fuzzy Petri net can be defined as the following. Definition 2: A fuzzy Petri net is a 9-tuple, FPN={P,T,D,A,M0,th,f,W,β}, where z z z z z
z z
P, T, A and M0 are similar to those in PN. D={d1,d2,…,dn} is the finite set of propositions and di [0,1]. th: T [0,1] is the mapping from transitions to thresholds. f: T [0,1] is the mapping from transitions to confidence level value. W: P [0,1] is the mapping from propositions represented by places to truth values. It represents the supporting level of every place representing the corresponding proposition condition for transition firing. The corresponding truth value set is {ω1, ω2,…, ωn}. β: P D is the mapping from places to propositions. P T D= and |P|=|D|.
∈
→ → →
→ ∩∩ Φ
■
∈
In FNPN, the truth value of the place pi, pi P is denoted by the weight W(pi), W(pi)= ωi and ωi [0,1]. If W(pi)= ωi and β(pi)=di. This configuration states that the confidence level of the proposition di is ωi. A transition ti with only one input place is enabled, if for the input place pj I(ti), ωi th(ti)= λi where λi is the threshold value. If the transition ti fires, the truth value of its output place is ωi μi,where μi is the confidence level value of ti. For instance, the following fuzzy reduction rule can be modeled and subsequently fired as shown in Fig. 1. For more details of FPN, reference [3] has given a comprehensive discussion.
∈
∈
≥
·
IF dj THEN dk (CF=μj), λj, ωi.
(1)
The truth value ωj of the place Pj and the confidence level value μj of the transition ti are aggregated through the algebraic product yk=ωi μj, where yk is the truth value of its output place.
·
Fig. 1. A Simple Typical FPN Model
3 Fuzzy Neural Petri Nets (FNPN) Besides the basic parts, the building blocks of a FNPN also include neurons. A neuron is a coarse grained subnet of a FNPN, which can also be regarded as basic
Fuzzy Neural Petri Nets
331
components similar to those in CTPN [13]. The neuron models describe its fuzzy information processing procedure in form of FNPN sub-nets or components. A simple typical FNPN based neuron is illustrated in Fig. 2. In Fig. 2, the input signal of a neuron and threshold function are realized by the input places pi (i=1,…,n) and the transitions tj (j=1,…,n). The integration place Pj calculates the transferring function output according to the neuron transferring function. If the result is not less than the threshold, the threshold transition Tk will be fired.
y1
ω1
ym
Fig. 2. A Typical FNPN based Neuron Model
Fuzzy neural Petri net can be defined as the following. Definition 3: A fuzzy neural Petri net is an 11-tuple FNPN=(P,T,D,A,M0,Kp,Kt, th,f,W, β): z z z
P,T,D,A,M0,th,f,W and β are similar to those in definition 2, Kp is the state set of hidden layer and output layer, Kt is the mapping from T to rule sets. P1
Pn
Input Layer t1
… ωn
… tn
Hidden Layer
Output Layer
λ1,μ1
ω11 ωn1 ω1j ωnj
y11
P1
…
T1
λj,μj
Pj
y1m yj1 yjm
…
Top Hierarchy
ym
Tj Mapping and Unfolding
Bottom Hierarchy
Fig. 3. Abstracting Neurons in FNPN
■
332
H. Xu, Y. Wang, and P. Jia
FNPN can abstract the realization details of neurons illustrated in Fig. 3. According to the modeling and analyzing requirements, the neurons in different layers can be abstracted as FNPN based neuron component. At the same time, the FNPN based model can simplify the modeling and analyzing procedure of neuron hierarchical model by abstracting neurons or sub neural networks with independent self-learning ability. On the other hand, the abstract neurons with self learning ability can also be unfolded in the whole model directly without changing the connection relation. The model with unfolded neuron FNPN subnets is a complex one, however the abstract model is a simple hierarchical one.
4 Learning in FNPN Suppose the FNPN model to be studied is n-layered with b ending places pj, where j=1,…,b. r learning samples are used to train the FNPN model. The performance evaluation function is defined as the following: r
E =
b
∑∑ i=1
j =1
′
(M i( p j) − M
i
( p j ))
2
,
(2)
2
′
where Mi(pj) and Mi (pj) represent the actual marking value and the expected one of the ending place pj respectively. Suppose ti(n) is one transition on the nth layer ti(n) Tn. The weights of the (n) (n) (n) corresponding input arcs are ωi1 , ωi2 ,…,ωim . Its threshold is λi(n) and its truth (n) (n) value is μi . If the place pj is one of the output places of the transition ti(n), obviously it is also the ending place. The BP based learning algorithm [14] is used in FNPN. dE dω
(n) ix
∈
δ
=
(n)
dE dμ
(n) i
dE d λ (i n )
δ
d (M
×
(n)
dω
( p j ))
=
δ
×
d (M
=
δ (n) ×
d (M
(n)
=
(n)
(n)
dμ
dλ
(n)
( p j )) ,
(n) i
(n)
dE d (M
x=1,2,…,m-1,
(n) ix
( p j ))
(n) i
. ( p j ))
,
(3)
(4)
(5)
(6)
According to the BP learning algorithm [14], the parameters of the (n-1)th,…, 1st layer can be calculated. δ(q), dE/dωix(q), dE/dμi(q) and dE/dλi(q), where x=1,2,…,m-1, q=n2,…,1. The adjusting algorithm of the parameters of the transition of ti(q) can be got as the following: ωix(q)(k+1)= ωix(q)(k) - ηdE/dωix(q) ,
(7)
Fuzzy Neural Petri Nets
where x=1,…,m-1, q=n,…,1 and
∑ω
333
(q) ix =1.
μi(q)(k+1)= μi(q)(k) - ηdE/dμi(q) ,
(8)
λi(q) (k+1)= λi(q) (k) - ηdE/λi(q) .
(9)
In the above equations, η is the learning rate.
5 A Simple Example 5.1 Rim Judgment Example in Arc Weld Robots In ship building manufacture, arc weld robots are always used to process the steel plate according to the required form. In the processing procedure, the control system of arc weld robot needs to judge the rim of processed steel plate so as to make decisions of processing track plan. However, the steel plate is always irregular in space. Only are the coordinates of the points to be processed used to judge the rim. It requires self-adaptive algorithm to complete the uncertain reasoning. So FNPN is used to model the rim judgment procedure. Suppose the coordinate of the measured point on the steel plate is (x, y, z). The steel plates are irregular, so only the difference between neighbor point coordinates can be referenced to judge whether it is a point on the steel plate or not. The judgment decision model is constructed on the base of FNPN in Fig.4. As the model input, the coordinates are used to conduct fuzzy reasoning in two neurons (PN1 and PN2) with self-learning ability. Then the decision results can be got. If the output of p4 equals to 1, it represents the point is on the steel plate. Otherwise, if that of p5 is 1, it represents the point is outside. The corresponding FNPN model is illustrated in Fig. 4.
ω 12
y 12
Fig. 4. The FNPN Model for Steel Plate Rim Judgment
For initializing model, the initial parameters are set as the following according to the processing experiments: ω11=0.3, ω21=0.2, ω31=0.5; ω12=0.3, ω22=0.2, ω32=0.5; λ1=0.7, μ1=0.6; λ2=0.7, μ2=0.6; y11=0.5, y21=0.5; y12=0.5, y22=0.5. The FNPN model is trained on the base of 100 groups of testing data, where b=1000, η=0.03. After the training, the fuzzy reasoning is conducted on the base of the trained FNPN. The reasoning results are listed in Table 1.
334
H. Xu, Y. Wang, and P. Jia Table 1. The Actual Output and the Expected Output No . 1 2 3 4 5 6 7 8 9 10
P4 Actual Output 0.9898 0.9898 0.9896 0.9896 0.9896 0.9896 0.9896 0.9898 0.9896 -0.0130
P5 Expected Output 1 1 1 1 1 1 1 1 1 0
Actual Output -0.0085 -0.0085 -0.0087 -0.0087 -0.0087 -0.0087 -0.0087 -0.0085 -0.0086 0.9774
Expected Output 0 0 0 0 0 0 0 0 0 1
According to the reasoning results of FNPN in Table.1, the FNPN model has been demonstrated to be effective to model the fuzzy knowledge based systems for actual applications. 5.2 Application Analysis From the view of the former FNPN based example, compared with FPN and NN, FNPN manifests the following advantages: z
z
As an independent self-learning component, neuron is introduced into the FPN. It can model the complex systems, which include several steps with independent self-learning NNs or neurons. In FNPN, complex neurons or NNs can also be abstracted as FNPN components, when the system model needs to be analyzed in the system level. The abstraction and hierarchy is another outstanding feature for FNPN.
6 Conclusions Fuzzy is a usual phenomenon in knowledge-based expert systems especially in the system with fuzzy production rules. FPN is a powerful modeling tool to model fuzzy systems or uncertain discrete event systems. In order to extend FPN with self-learning capability, this paper proposes the fuzzy neural Petri net on the base of FPN and neural networks. As a kind of FNPN component, neurons are introduced into FNPN. Neurons in FNPN are FNPN sub-nets with BP based self-learning ability. The parameters of fuzzy production rules in every neuron component can be trained in its own layer. Neurons are depicted in different layers, which is meaningful for representing FNPN models with multi-rank NNs. State analysis needs to be studied in the future. Xu [15] has proposed an extended State Graph to analyze the state change of objects/components based models. With the temporal fuzzy sets introduced into PN, the confidence level about transition firing (state changing) needs to be considered in the state analysis.
Fuzzy Neural Petri Nets
335
References [1] Murata, T.: Petri Nets: Properties, Analysis and Applications. Proceedings of IEEE 77 (1989) 541-580 [2] Peterson, J.L.: Petri Net Theory and the Modeling of Systems. Prentice-Hall, New York (1991) [3] Pedrycz, W., Gomide, F.: A Generalized Fuzzy Petri Net Model. IEEE Trans. Fuzzy Ststems 2 (1994) 295-301 [4] Pedrycz, W., Camargo, H.: Fuzzy Timed Petri Nets. Fuzzy Sets and Systems 140 (2003) 301-330 [5] Scarpelli, H., Gomide, F., Yager, R.: A Reasoning Algorithm for High Level Fuzzy Petri Nets. IEEE Trans. Fuzzy Systems 4 (1996) 282-293 [6] Manoj, T.V., Leena, J., Soney, R.B.: Knowledge Representation Using Fuzzy Petri Netsrevisited. Knowledge and Data Engineering, IEEE Transactions on 10 (4) (1998) 666- 667 [7] Jong, W., Shiau, Y., Horng, Y., Chen, H., Chen, S.: Temporal Knowledge Representation and ReasoningTechniques Using Ttime Petri Nets. Systems, Man and Cybernetics, Part B, IEEE Transactions on 29 (4) (1999) 541 - 545 [8] Zhao, G., Zheng, H., Wang, J., Li, T.: Petri-net-based Coordination Motion Control for Legged Robot. Systems, Man and Cybernetics, 2003. IEEE International Conference on 1 (2003) 581 - 586 [9] Tang, R., Pang, G.K.H., Woo, S.S.: A Continuous Fuzzy Petri Net Tool for Intelligent Process Monitoring and Control. Control Systems Technology, IEEE Transactions on 3 (3) (1995) 318- 329 [10] Szücs, A., Gerzson, M., Hangos, K. M.: An Intelligent Diagnostic System Based on Petri Nets. Computers & Chemical Engineering 22 (9) (1998) 1335-1344 [11] Han, Y., Jiang, C., Luo, X.: Resource Scheduling Model for Grid Computing Based on Sharing Synthesis of Petri Net. Computer Supported Cooperative Work in Design, 2005. Proceedings of the Ninth International Conference on 1 (2005) 367 - 372 [12] Wang, J., Jin, C., Deng, Y.: Performance Analysis of Traffic Networks Based on Stochastic Timed Petri Net Models. Engineering of Complex Computer Systems, 1999. ICECCS '99. Fifth IEEE International Conference on (1999) 77 - 85 [13] Wang, J., Deng, Y., Zhou, M.: Compositional Time Petri Nets and Reduction Rules. Systems, Man and Cybernetics, Part B, IEEE Transactions on 30 (4) (2000) 562-572 [14] Gallant, S.: Neural Network Learning and ExpertSystems. Cambridge, Mass. : MIT Press, c1993 [15] Xu, H.; Jia. P.: Timed Hierarchical Object-Oriented Petri Net-Part I: Basic Concepts and Reachability Analysis. Lecture Notes In Artificial Intelligence (Proceedings of RSKT2006) 4062 (2006) 727-734
Hardware Design of an Adaptive Neuro-fuzzy Network with On-Chip Learning Capability Tzu-Ping Kao, Chun-Chang Yu, Ting-Yu Chen, and Jeen-Shing Wang Department of Electrical Engineering, National Cheng Kung University, Tainan 701, Taiwan, R.O.C
[email protected]
Abstract. This paper aims for the development of the digital circuit of an adaptive neuro-fuzzy network with on-chip learning capability. The on-chip learning capability was realized by a backpropagation learning circuit for optimizing the network parameters. To maximize the throughput of the circuit and minimize its required resources, we proposed to reuse the computational results in both feedforward and backpropagation circuits. This leads to a simpler data flow and the reduction of resource consumption. To verify the effectiveness of the circuit, we implemented the circuit in an FPGA development board and compared the performance with the neuro-fuzzy system written in a MATLAB® code. The experimental results show that the throughput of our neuro-fuzzy circuit significantly outperforms the NF network written in a MATLAB® code with a satisfactory learning performance.
1 Introduction A neural-fuzzy (NF) system is well-known for its capability to solve complex applications. However, its highly computational demand hinders itself from many real-time applications. Realization of NF systems into hardware circuits is a good solution to remove the hindrance. How to design the circuits that can efficiently process the network computation and economically allocates hardware resources becomes an important research topic. When designing a digital NF network circuit, there are several issues to consider. These include: 1) how to implement nonlinear functions as linguistic term sets, 2) how to realize a complex parameter learning algorithm, and 3) how to design a highly efficient circuit. Several researchers have proposed different approaches to implement membership functions in digital circuits. For example, Ref. [1] has developed a VLSI fuzzy logic processor with isosceles triangular functions in the digital circuit for controlling the idle speed of engines. A look-up table was proposed to substitute the direct implementation of nonlinear functions in [2]. The look-up table is easy to realize into a hardware device; however, the precision of computational results using a look-up table may not be satisfactory if memory resource is limited. In the second issue, learning capability is a desirable property in NF networks but a learning algorithm is usually complicated for hardware realization. Three methods, offline learning, chip-in-the-loop learning, and on-chip learning have been proposed D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 336–345, 2007. © Springer-Verlag Berlin Heidelberg 2007
Hardware Design of an Adaptive Neuro-fuzzy Network
337
for the hardware realization of parameter learning algorithm [3]. On-chip learning is more attractive for real-time applications because the parameter training can be performed in an online mode. However, the complexity of such hardware design is higher than those of the other two methods. In [4], a real-time adaptive neural network has developed to provide the instant update for the parameters of the network under continuous operations. Ref. [5] has proposed a logic-oriented neural network with a backpropagation algorithm. The drawback of this approach is that the network training is not stable due to the quantization of the weights and neuron outputs into integer values. In order to improve the execution performance, Ref. [6] has proposed an online backpropagation algorithm with a pipelined adaptation structure that separate the parameter learning into several stages and the computation in each stage can be performed simultaneously. However, the efficiency of pipeline operations is lower than that of pipeline scheduling using a dataflow graph. To develop a highly efficient neuro-fuzzy circuit, this study aimed at the integration of the following three concepts into the hardware design: data sharing, optimal scheduling, and pipeline architecture. Each of these possesses its significance in increasing the computational speed as well as the efficiency of NF circuits. For the data sharing, the computational results of each layer in the feedforward circuit were stored in buffers to establish a database that can be used for the calculation of error gradients in the backpropagation algorithm. This idea not only simplifies the data flow of the whole algorithm but reduces the network resource requirement. We used an integer linear programming [7] approach to obtain an optimal scheduling. Moreover, we investigated the resource consumption and performance of different pipeline architectures to increase the throughput and efficiency of the circuit. The rest of this paper is organized as follows. In Section 2, we introduce our adaptive neuro-fuzzy network with its functionalities of each layer. The hardware implementation of the NF network is presented in Section 3. Section 4 provides the hardware verification of the network as well as simulations using the NF circuit as a controller for a path following problem. Finally, conclusions are summarized in the last section.
2 Adaptive Neuro-fuzzy Network We realize a four-layer NF system shown in Fig. 1 into a circuit due to its structural
simplicity. The computation of the network includes two procedures: feedforward and backpropagation. 2.1 Feedforward Computation The detailed function explanation of the nodes in each layer is as follows. z z
Layer 1: The nodes in this layer only transmit input values to the second layer. Layer 2: Each node in this layer represents a membership function which is an isosceles triangle-shaped function.
338
T.-P. Kao et al.
μij(2) ( xi ) = 1 − 2
| xi − aij | bij
(1)
. th
where xi, aij and bij are the input data of the i input node, centers and widths of an
z
isosceles triangle membership function. The index j indicates the labels of membership nodes. Layer 3: The nodes in this layer represent fuzzy logic rules and the function of each node is as follows.
μ (3)l = ∏μij(2) , j ∈{μij(2) with connection to lth node} . n
1≤l ≤m
z
(2)
i =1
Layer 4: The inference results in the previous layer are multiplied respectively by specific weights and divided by the sum of the outputs in layer 3 as a defuzzification process.
∑ ∑
m
y = μ (4) =
l =1 m
μl(3) wl
l =1
μl(3)
(3)
.
2.1 Backpropagation (BP) Learning Algorithm Based on the architecture in Fig. 1, a BP algorithm is utilized to update the centers and widths of the membership function, and the weights of the output layers. First, we define the error function which we want to minimize. 1 ( y − y d )2 . (4) 2 where y is the actual output and yd is the desired output. We express the error function by substituting (1), (2) and (3) into (4) and the function can be expressed as: E=
1 ∑ μ (x) wl 1 E = ( l =m1 − y d )2 = ( (3) 2 ∑ μl ( x ) 2 m
(3) l
l =1
n
∑ (∏ μ m
l =1
(2) ij
) wl
i =1 n
∑ j =1 (∏ μ ) M
i =1
− y d )2 .
(5)
(2) ij
The corresponding error signals for the adjustable parameters are derived as follows: (2) ∂E ∂E ∂μl(3) ∂μij ⎪⎧⎡ 1 ⎤ ⎡ ⎤ 1 ⎪⎫ 2 = (3) (2) =⎨⎢( y − yd ) × ⎢∑wl μl(3) − y∑μl(3) ⎥ × ⎬× (2) × sign(xi − aij ). (6) ⎥ ∂aij ∂μl ∂μij ∂aij ⎪⎩⎣ ACC ⎦ ⎣ l l ⎦ bij ⎪⎭ μij (2) ∂E ∂E ∂μl(3) ∂μij ⎪⎧⎡ 1 ⎤ ⎡ ⎤ 1 ⎪⎫ 1 = (3) (2) = ⎨⎢( y − yd ) × ⎢∑wl μl(3) − y∑μl(3) ⎥ × ⎬×( (2) −1). (7) ⎥ ∂bij ∂μl ∂μij ∂bij ⎪⎩⎣ ACC ⎦ ⎣ l l ⎦ bij ⎪⎭ μij
∂E ∂E ∂y ⎡ 1 ⎤ = = ( y − yd ) × μk(3) . ∂wk ∂y ∂wk ⎢⎣ ACC ⎥⎦ The parameter update rules are described as follows. ∂E aij (t + 1) = aij (t ) − η . ∂aij
(8)
(9)
Hardware Design of an Adaptive Neuro-fuzzy Network
339
bij (t + 1) = bij (t ) − η
∂E . ∂bij
(10)
wij (t + 1) = wij (t ) − η
∂E . ∂wij
(11)
η is the leaning rate. yp
y1 Layer 4
w11
wPM
R1 Layer 3
Layer 2
R2
Π
/\
RM
Π
/\
Π
x − aij
i /\ 1 − 2 b j i
/\
Layer 1
x1
xn
Fig. 1. Structure of neuro-fuzzy network
3 Hardware Implementation of Adaptive Neuro-Fuzzy Networks 3.1 Circuit Architecture and Computational Results Sharing The hardware design of the NF network is divided into a datapath design phase and a control-path design phase. The datapath design includes a feedforward circuit design and a backpropagation circuit design. In order to reduce computational complexity in the datapath and to simplify control signals in the control path, we adopted two approaches in our design. First, we analyzed the calculation regularity of the NF network to decompose the circuit into several modules and to avoid redundant operations. Second, we accelerated the computational process by sharing the computational results that are required in both circuits. To realize which computational result can be shared, we analyze the equations of the feedforward and backpropagation procedures to extract the mathematical terms of the equations that appear in both procedures. That is, we store the computational results that are obtained in the feedforward circuit and will be used in the backpropagation circuit in specific memory locations. Such storages can avoid a great amount of redundant computations. We partitioned the feedforward computation into three primary modules: membership function module, fuzzy inference engine, and defuzzifier. Based on this partition, datapaths are scheduled to achieve as more concurrent executions as possible without
340
T.-P. Kao et al.
any violation on the restriction of data dependency and resource sharing. We employed an integer linear programming approach to achieve optimal scheduling. After the datapath analysis, we divided the feedforward computation into three asynchronous parts. Each part is realized by a synchronous fine-grain pipeline architecture to accelerate the computational speed of the circuit. The backpropagation algorithm, however, is realized by a synchronous pipeline circuit because of continuous update process and resource limitation. During the data transformation, each module should process/transfer the information containing data values, data indexes, and calculation flags from/to its previous/following module. In addition, the data communication follows handshaking logic to ensure the logical order of the circuit events and to avoid race conditions. Based on these issues and considerations on optimal scheduling and allocation analysis, we proposed a new control approach that integrates asynchronous and synchronous design methodologies. That is, we construct synchronous circuits for functional modules and design an asynchronous circuit for the communication between three modules. The islands of synchronous units are connected by an asynchronous communication network as illustrated in Fig. 2. We named this architecture a globally-asynchronous locally-synchronous circuit.
Req
Req
Ack
HS Start
Done
Req Ack
Req
Ack
HS Start
Ack Done
HS Start
Done
Input
Output R1
Module Register
F1
R2
F2
R3
F3
Function Units
Fig. 2. Asynchronous communication approach with islands of synchronous units
3.2 Dataflow of Backpropagation Algorithm The backpropagation learning algorithm can be expressed as the form in Fig. 3. In the equations, we use several labels to represent the buffers that store the computational results obtained in the feedforward circuit. These data storages enable the backpropagation circuit to calculate the error signals for adjustable parameters efficiently due to the data sharing and the omission of the same data computations. In addition, the data sharing leads to a simpler data flow and a reduction of resource consumption without any increasing cost. Here, we provide an example to illustrate our idea of the data sharing. From a circuit point of view, the term, ⎛⎜ ∑ wl μl(3) − y ∑ μl(3) ⎞⎟ × 1 × 1(2) , in the update rule of aij is ⎝ l l ⎠ bij μij complicated to implement. In order to efficiently utilize the resources, some buffers are designed for the feedforward circuit to store the computational results that can be
Hardware Design of an Adaptive Neuro-fuzzy Network
341
shared in the backpropagation algorithm. According to the idea of data sharing, the terms, 1 , ∑ wl μl(3) , and ∑ μl(3) calculated in the feedforward computation can be bij × μij(2)
l
l
stored in the buffers temporarily and be retrieved to reduce the design complexity of the learning rules for tuning aij in the backpropagation circuit. Similarly, the other two update formulas in Fig. 3 can be simplified by the same idea of the first update formula. Note that the data dependency between these formulas is changed because of the data sharing. This constraint should be considered in the scheduling optimization. E aij
(3) l
y
=
y
(2) ij
(3) l (2) ij
E
aij
1 AC C
d
E bij
E (3) l
y
=
E wk
y
d
E y = y wk
(3) l
1 bij
(3) l
y
l
err AC C (3) l (2) ij
rule _ buf
wl
l
ReuseRul
2 (2) ij
sign( xi
aij ).
Inv _ b
(2) ij
bij 1 ACC
wl
( 3) l
( 3) l
y
l
l
1 bij
1 (2) ij
(1
( 2) ij
).
Reuse _ tune _ a
y
yd
1 ACC
(3) k
err AC C
Fig. 3. Learning rules for sharing computational results
3.3 Pipeline Architecture of Backpropagation Circuit In the backpropagation learning circuit, the datapath of updating wl is designed as two pipeline stages with just one clock latency. The throughput of the path is equal to the clock rate. Two choices can be selected for the update path of aij and bij: 1) a nonpipeline circuit with two multipliers, and 2) a structure with one pipeline latency. The first choice takes 70 control steps while the second choice takes only 18 control steps but needs three additional multipliers. In the second choice, the computation of the circuit is the fastest. However, the update of weights wl takes 50 control steps. The cost of control steps are determined by the update procedure of weights. That means the update procedure has to wait 32 control steps until the update of weights finishes. Hence, this result is not desirable. Fig. 4 shows the final pipeline scheduling. There are two pipeline latencies in the update datapaths of aij and bij. The execution process takes 32 control steps. Table 1. Performance Analysis based to Different Pipeline Latency Parameters Pipeline Latency Multiplier Execution Step
Backpropagation Learning Circuit aij and bij Yes No Yes 1 0 1 1 1 4 49+1= 50 Steps 5×14= 70 Steps 1×14+4=18 Steps wl
Yes 2 2 2×14+4= 32 Steps
342
T.-P. Kao et al. err rule _ buf y ReuseRul e ACC wl ReuseWst ReuseWs Wt
b
err ACC A CC
(2) ij
one
>>
aiji
<< >>
wl
aij
bij >>
bij
Fig. 4. The DFG of pipeline scheduling (latency = 1 for wl, latency = 2 for aij and bij)
NF top-level
FSM controller
Forward path register file
Backward path register file
Sharing memory of Computation result
ALU
Main FSM
ALU FSM
Handshaking FSM
Multiplication
Division
Fig. 5. Block diagram of the modular NF structure
Although the cost of the execution step is larger than 18, it is still smaller than the update procedure of the parameters wl. Furthermore, this structure only requires additional one multiplier to obtain a better circuit performance. Fig. 5 shows the modular block diagram of the NF circuit. The results of performance analysis are provided in Table 1.
4 Hardware Simulations and Verification The proposed architecture has been coded in Verilog by using register transfer level (RTL) model. Before the RTL code of the NF network circuit is synthesized, we used MATLAB® to establish a software simulation platform for function verification. This simulation platform is used to simulate the NF circuit as a controller to learn how to drive a vehicle to follow a planned trajectory. Fig. 6 illustrates the car-driving system where the NF circuit was implemented in an FGPA device and served as a forward controller, and the rest blocks are simulated in a PC. In order to online
Hardware Design of an Adaptive Neuro-fuzzy Network
FPGA
ud(t)
NeuralFuzzy Controller
343
(t)
Communication Interface
Trajectory Planner
ARM
+ e(t)
+
P Controller
-
PC
+
Car System
PC u(t)
Fig. 6. Simulation platform of the car-driving system Table 2. Comparison of Hardware Execution and Software Simulation
Register
Decimal
System
0.123194 -37.9291 28.8083 27.9886 40 10.3094 6.4178 -8.8147 9.7273 36.1794 11.28102
output
Average
Hardware
Error
2018 -621430 471995 458565 655360 168909 105149 -144420 159372 592763
2024 -621395 472039 458620 655355 168931 105201 144384 159255 592723
-6 -35 -44 -55 5 -22 -52 -36 117 40
Error in Decimal -0.00037 -0.002136 -0.00269 -0.00336 0.000305 -0.00134 -0.00317 -0.002197 0.007141 0.002441
184828.1
213713.7
-8.8
-0.00054
Shifted Binary
train the NF parameters, we used a proportional controller as an auxiliary controller not only to compensate the insufficiency of the forward controller for achieving a satisfactory trajectory-following accuracy but to provide an error signal for tuning the parameters of the NF circuit. The planned trajectory was generated by a path planning algorithm that is able to find a shortest path from a given initial location to a final destination and to avoid obstacles in a globally optimal manner. The training patterns of the NF circuit were obtained by discretizing the trajectory with a fixed sampling time. The learning objective of the NF circuit was to follow the planned trajectory with a minimal error. The learning process was performed iteratively to tune the NF parameters and was stopped once the total mean square error achieves a pre-specified criterion. This platform is used to verify the effectiveness of the NF network circuit and to compare the efficiency of the NF network implemented in a hardware device and
344
T.-P. Kao et al.
a software system. Because we only used 14 bits to represent the decimal values, sometimes the output values were not as accurate as those of the software simulation platform. Some examples of numerical errors are illustrated in Table 2 to show the difference between the software simulation and hardware execution. The same generated trajectory inputs were used in both RTL and MATLAB® simulations. In our empirical experience, this small error did not cause any instability during parameter learning or degradation in learning performance. Fig. 7 illustrates the learning result of the NF controller implemented in an FPGA device to drive the car to follow the path generated by the path planning algorithm. From the figure, we can see that the actual path is very close to the optimal path after several iterations of parameter learning. In Table 3, the throughput rate of the NF network implemented in an FPGA is much higher than the software simulation. Especially, the performance of the backpropagation learning is excellent because of the effectiveness of our pipeline architecture and data sharing.
Fig. 7. Comparison of the learning results obtained by using the NF circuit in an FPGA
device and the software written in a MATLAB® code for the trajectory following application Table 3. Throughput of Execution on MATLAB® and FPGA
Feedforward Circuit Throughput Rate Backpropagation Circuit Throughput Rate
MATLAB® 0.438 KHz (period 2.28 ms) 0.1 KHz (period 9.18 ms)
FPGA 308.64 KHz (period 3.24 μs) 510.21 KHz (period 1.96 μs )
Hardware Design of an Adaptive Neuro-fuzzy Network
345
5 Conclusion This paper presents a digital hardware implementation of an adaptive neuro-fuzzy network with on-chip learning capability. We proposed an idea of data sharing to reduce the complexity of hardware implementation for a backpropagation learning algorithm. Without the repetition of performing the same computation, we believe that the consumption of hardware resource is greatly reduced while the throughput rate can be increased significantly. Finally, we implemented the circuit in an FPGA device to serve as a controller for driving a car to follow a desired trajectory. The simulation results show that the throughput of our NF circuit significantly outperforms the NF network written in a MATLAB® code with satisfaction in learning performance.
References [1] Jin, W.W., Jin, D.M., Zhang, X.: VLSI Design and Implementation of a Fuzzy Logic Controller for Engine Idle Speed. Proc. of 7th IEEE Int’l. Conf. on Solid-State and Integrated Circuits Technology 3 (2004) 2067-2070 [2] Marchesi, M., Orlandi, G., Piazza, F., Pollonara, L., Uncini, A.: Multi-layer Perceptrons with Discrete Weights. Int’l. Joint Conf. on Neural Networks 2 (1990) 623-630 [3] Reyneri, L.M.: Implementation Issues of Neuro-Fuzzy Hardware: Going Toward HW/SW Codesign. IEEE Trans. Neural Networks 14 (2003) 176-179 [4] Yi, Y., Vilathgamuwa, D.W., Rahman, M.A.: Implementation of an Artificial-NeuralNetwork-Based Real-Time Adaptive Controller for an Interior Permanent-Magnet Motor Drive. IEEE Trans. Industry Applications 39 (2003) 96-104 [5] Kamio, T., Tanaka, S., Morisue, M.: Backpropagation Algorithm for Logic Oriented Neural Networks. Proc. of the IEEE Int’l. Joint Conf. on Neural Networks 2 (2000) 123-128 [6] Girones, R.G., Salcedo, A.M.: Systolic Implementation of a Pipelined On-Line Backpropagation. Proc. of the 7th Int. Conf. on Microelectronics for Neural, Fuzzy and BioInspired Systems (1999) 387-394 [7] Hwang, C.T., Lee, J.H., Hsu, Y.C.: A Formal Approach to the Scheduling Problem in High Level Synthesis. IEEE Trans. Computer-Aided Design 10 (1991) 464-475
Stock Prediction Using FCMAC-BYY Jiacai Fu1, Kok Siong Lum2, Minh Nhut Nguyen2, and Juan Shi1 1
Research Centre of Automation, Heilongjiang Institute of Science and Technology, Harbin, China 2 School of Computer Engineering, Nanyang Technological University Singapore 639798
Abstract. The increasing reliance on Computational Intelligence applications to predict stock market positions have resulted in numerous researches in financial forecasting and trading trend identifications. Stock market price prediction applications are required to be adaptive to new incoming data as well as have fast learning capabilities due to the volatility nature of market movements. This paper analyses stock market price prediction based on a Fuzzy Cerebellar Model Articulation Controller – Bayesian Ying Yang (FCMAC-BYY) neural network. The model is motivated from the Chinese ancient Ying-Yang philosophy which states that everything in the universe can be viewed as a product of a constant conflict between opposites, Ying and Yang. A perfect status is reached if Ying and Yang achieves harmony. The analyzed experiment on a set of real stock market data (Singapore Airlines Ltd – SIA) in the Singapore Stock Exchange (SGX) and Ibex35 stock index shows the effectiveness of the FCMAC-BYY in the universal approximation and prediction.
1 Introduction Charting has been the main analysis approach for stock market prediction for a long time and many mathematical methods have been used to forecast the stock market movements. Sornette and Zhou [1] have proposed a mathematical method based on a theory of imitation between the stock market investors and their herding behavior. However, due to the volatile nature of stock market, its movement would generally not follow any mathematic formula. This has hence limited the accuracy of the prediction for mathematical models. Neural network has long been utilized for this purpose due to its excellent ability to depict and replicate complicated stock market patterns. Cerebellar Model Articulation Controller (CMAC) is a type of associative memory neural network that was first proposed by Albus in 1975 [2]. CMAC imitates human’s cerebellum, which allows it to learn fast and carry out local generalization efficiently. However, the associative memory nature of CMAC does not provide the differential information between input and output and requires excessive memory requirement [9]. Chiang and Lin [3] proposed a fuzzy CMAC (FCMAC) to introduce fuzzy sets as the input clusters into neural network. The differentiable property between the input D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 346–351, 2007. © Springer-Verlag Berlin Heidelberg 2007
Stock Prediction Using FCMAC-BYY
347
state and the actual output value can be obtained by computing the differentiation of the output with respect to its input. In addition, Bayesian Ying-Yang (BYY) [4] is applied in the fuzzification layer to improve the approximation of the input clusters and consequently, the FCMAC-BYY model was proposed in our previous work [5]. The motivation of this paper comes from the popular status of various tools to predict stock market data. The remaining of this paper is structured as follows. The next section would describe the FCMAC-BYY structure while Section 3 would illustrate experiment results with the benchmark dataset (Ibex35) as well as the real life data (SIA). Last but not the least, Section 4 would consist of the conclusion for this paper.
2 FCMAC –BYY Model In Figure 1, the FCMAC-BYY neural network is observed as a five-layer hierarchical structure namely Input layer, Fuzzification layer, Association Layer, Post association Layer and Output Layer. Post Association Layer Association Layer Output Response Unit Fuzzy Layer
q(y)
Weights Wj
p(y|x) p(x)
y0
q(x|y)
x0
xj
Input Layer
Fig. 1. The structure of FCMAC-BYY
2.1 FCMAC-BYY Structure Input Layer is the layer where the input data is obtained from the retrieved information. Bayesian Ying-Yang fuzzification is performed on the input training dataset in Fuzzification Layer in order to obtain fuzzy clusters. Association Layer is the rule layer where each association cell represents a fuzzy rule. A cell is activated only when all the inputs to it are fired. The Association Layer is then mapped to a Post Association Layer. A cell in this layer will be fired if any of its connected inputs is activated. Adapting Credit Assigned-FCMAC [11] methodology, a variable, named f_freq, was added to each cell to count the number of times in which the cell was fired. Using this approach, the cells which were fired more frequently would be learning at a reduced rate. Prior to this switch, the weights were updated according to the total
348
J. Fu et al.
number of fired cells instead. The following formula is applied to the updating of the weights in CA-FCMAC:
α⎛⎜∑ f (l) f (l)⎞⎟ m
ω(ji) = ω(ji−1) + ⎝l=1
ϕ
⎠ (y − y ) d j
(1)
m ⎧m ⎫ where ϕ = ∑ ⎨∑ f (l ) f (l )⎬ , ω (ij ) is the weight of the jth cell after i iterations. α is l =1 ⎩ l =1 ⎭ the learning rate while yd and yj are the desired and calculated output respectively, and f(l) returns the variable f_freq. Using the derived ϕ , Eq(1) reduces proportionally the learning rate of a cell as its fired frequency increases. Finally, the defuzzification center of area (COA) method [6] is used to compute the output in the Output Layer.
2.2 A Ying-Yang Approach to Fuzzification In this research, the BYY fuzzification is performed on the input data patterns to obtain input clusters. Treating both x and y as random processes, the joint distribution can be calculated by either of these two formulas: p( x, y) = p( y | x) p( x) (2)
q( x, y) = q( x | y)q( y)
(3)
The breakdown of Eq(2) follows the Yang concept with the visible domain by p(x) regarded as a Yang space and the forward pathway by p(y | x) as a Yang pathway. Similarly, Eq(3) is regarded as a Ying space and the backward pathway q(x | y) by as a Ying pathway. Both equations should return the same result for the joint distribution, however, this is the case only when the solution is the optimal. The forward/training model or Yang model and the backward/running model or Ying model can be computed using the Eq(4) and (5) respectively. q ( x ) = ∫ q ( x | y ) q ( y ) dy
(4)
p ( y ) = ∫ p ( y | x) p ( x)dx
(5)
Eq(4) focuses on the mapping function of the input data x into a cluster representation y via a forward propagation distribution p(y | x) while the Ying model focuses on the generation function of the input data x from a cluster representation y via a backward propagation distribution q(x | y). Under the Ying-Yang harmony principle, the difference between the two Bayesian representations in Eq(2) and (3) would be minimized. Thus, the trade-off between the forward/training model and the backward/running model is optimized. It means that the input data are well mapped into the clusters and at the same time the clusters also well cover the input data. Eventually, the Eq(2) and (3) will produce the same results when Ying and Yang achieves harmony and FCMAC-BYY will then have the highest generalization ability. For further details, reader may refer to [5].
Stock Prediction Using FCMAC-BYY
349
3 Experimental Results Two experiments were conducted using data obtained from price value of Ibex35 index as well as SIA stock. The results of the tests are as follows. 3.1 Ibex35 Index Data
The Ibex35 is a capitalization-weighted stock market index, comprised of the 35 most liquid Spanish stocks traded in the continuous market, and is the benchmark index for the Bolsa de Madrid. The extensive Spanish Ibex35 daily stock price data [7] was chosen because it is a popular index which is also used as a benchmark test for other prediction tools [10]. 1000 samples were chosen for training and 500 for testing. The dataset was put through the various neural networks (Multi-layer perceptron-MLP, conventional CMAC, FCMAC, CA-FCMAC and FCMAC-BYY) for comparison. The results are as shown in Figure 2 and their performance in Table 1.
Comparison Chart Actual Output
Ibex35 Index
9800
CMAC Output
9600
FCMAC Output
9400 9200
CA-FCMAC Output
9000
MLP
8800 8600 Time
FCMACBYY
Fig. 2. Comparison Chart on Ibex35 index using various neural networks
The three-layer MLP used four neurons in a hidden layer in a 4-4-1 layout. Gaussian function was used for the clustering of data in FCMAC. The MLP produced highly accurate results and has low memory requirement. However, it is hard to determine the optimal number of hidden layer neurons and MLP operates like a black box with its computation of data hidden from the users. From Figure 2, CMAC prediction is not as accurate when compared to FCMAC. Furthermore, FCMAC was able to produce the prediction using less memory. CA-FCMAC capitalized on the FCMAC structure to build the results in less computation cycles. Last but not the least; FCMAC-BYY was able to produce similar results with even less memory requirement.
350
J. Fu et al. Table 1. Comparison Table
Model
MSE
Iterations
Memory used
MLP CMAC FCMAC CA-FCMAC FCMAC_BYY
0.00037 0.00225 0.00119 0.00092 0.00090
93 29 48 43 38
8 130321 6336 6336 4096
3.2 SIA Stock Data
The second test was conducted on SIA stock which is listed on the Mainboard in Singapore Exchange (SGX). Stock prices were collected through data collected from the SGX website [8] at 5 minutes interval from 15 September 2006 till 15 October 2006. The information was then parsed and analyzed using the FCMAC-BYY system. In total, 350 data samples were used as the training dataset and 150 data samples were used as the testing dataset.
SIA Prediction Chart
Actual Price
16.0 15.0
FCMACBYY
Price
14.0 13.0
CMAC
12.0 11.0 10.0
FCMAC Time
Fig. 3. FCMAC-BYY prediction of SIA stock
From Figure 3, it can be observed that FCMAC-BYY was able to closely follow the movement of the stock prices. Mean Square error of 0.00104 was achieved using 625 cells. Using BYY to cluster the input dataset, less cells were needed and thus the memory requirement reduced. Furthermore, the computation cycles required improved while the accuracy of the prediction was not compromised. In all, FCMAC-BYY was able to improve the overall efficiency of the prediction through the proficient BYY clustering methodology.
Stock Prediction Using FCMAC-BYY
351
4 Conclusion Stock prediction application had come a long way and at the same time shown great progress. This paper proposes an Associative Memory structure, which contains two modules: a BYY input space clustering module and a FCMAC neural network approximation system. BYY is based on the ancient Ying-Yang philosophy and aims to find the clusters to represent the input data. On the other hand, the proposed FCMAC system includes a non-constant differentiable Gaussian basis function to preserve the derivative information so that a gradient descent method capable of serving as learning rules. Together, FCMAC-BYY had been used here successfully to model the stock market movement and project accurate predictions. Its great ability to adequately cluster the input patterns has allowed accurate prediction to be carried out and less memory usage. Experimental results indicate that FCMAC-BYY has a high learning speed while maintaining low memory requirement. FCMAC-BYY was able to perform non-linear approximations and on-the-fly updates while keeping memory requirement lower than conventional CMAC structures as well as the original FCMAC.
References 1. Sornette, D., Zhou, W.X.: The US 2000-2002 Market Descent: How Much Longer and Deeper? Taylor and Francis Journals Quantitative Finance, 2 (2002) 468-481 2. Albus, J. S.,: Data Storage in the Cerebellar Model Articulation Controller (CMAC). Transaction of the ASME, Dynamic Systems Measurement and Control, 97 (1975) 228-233 3. Lin, C.S., Chiang, C.-T.: Learning Convergence of CMAC Technique. IEEE Trans. Neural Networks, 8 (1997) 1281–1292 4. Xu, L.: Advances on BYY Harmony Learning: Information Theoretic Perspective, Generalized Projection Geometry, and Independent Factor auto Determination. IEEE Trans. Neural Networks 15 (2004) 885-902 5. M.N Nguyen, D.Shi, C.Quek, FCMAC-BYY: Fuzzy CMAC Using Bayesian Ying-Yang Learning. IEEE Trans. Syst. Man Cybern B: Cybernetic, 36 (2006) 1180-1190 6. Lee, E. S., Zhu, Q.: Fuzzy and Evidence Reasoning: Physica-Verlag (1995) 7. Spain Ibex35 historical daily closing stock price (Online). Available: Yahoo! Finance wensite. URL-http://finance.yahoo.com/q?s=%5EIBEX&d=t. 8. SIA stock price value. Available: Singapore Exchange website. URLttp://www.ses.com.sg/. 9. Hu, J., Pratt, F.: Self-orgarnizing CMAC Neural Networks and Adaptive Dynamic Control, IEEE International Conference on Intelligent Control, Cambridge, MA, (1999) 15-17 10. Górriz, Juan M., Puntonet, Carlos G., Salmerón, Moisés, Lang, E.W.: Time Series Prediction using ICA Algorithms. IEEE International Workshop on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications 8-10 September (2003), Lviv, Ukraine 11. Su, S-F., Tao, T., Hung, T-H.: Credit Assigned CMAC and Its Application to Online Learning Robust Controllers. IEEE Trans. Syst. Man Cybern. B: Cybernetics. 33 (2003)
A Hybrid Rule Extraction Method Using Rough Sets and Neural Networks Shufeng Wang, Gengfeng Wu, and Jianguo Pan Department of Computer Science, Shanghai University, 149 Yanchang road Shanghai, P.R. China 200072
[email protected],
[email protected],
[email protected]
Abstract. Rough sets and neural networks are two common techniques applied to rule extraction from data table. Integrating the advantages of two approaches, this paper presents a Hybrid Rule Extraction Method (HREM) using rough sets and neural networks. In the HREM, the rule extraction is mainly done based on rough sets, while neural networks are only served as a tool to reduce the decision table and filter its noises when the final knowledge (rule sets) is generated from the reduced decision table by rough sets. Therefore, the HREM avoids the difficult of extracting rules from a trained neural network and possesses the robustness which the rough sets based approaches are lacking. The effectiveness of HREM is verified by comparing the experiment results with the approaches of traditional rough sets and neural networks.
1 Introduction One important issue of data mining is classification which has attracted great attentions of researchers [11]. Rough sets and neural networks are two technologies frequently applied to data mining tasks [12, 13]. The common advantage of the two approaches is that they do not need any additional information about data like probability in statistics or grade of membership in fuzzy set theory. Rough sets theory introduced by Pawlak in 1982 is a mathematical tool to deal with vagueness and uncertainty of information. It has been proved to be very effective in many practical applications. However, in rough sets theory, the deterministic mechanism for the description of error is very simple. Therefore, the rules generated by rough sets are often unstable and have low classification accuracy. Neural networks are considered as the most powerful classifier for their low classification error rates and robustness to noise. But neural networks have two obvious shortcomings when applied to data mining problems. The first is that neural networks require long time to train the huge amount of data of large databases. Secondly, neural networks lack explanation facilities for their knowledge. The knowledge of neural networks is buried in their structures and weights. It is often difficult to extract rules from a trained neural network. The combination of rough sets and neural networks is very natural for their complementary features. One typical approach is to use rough set approach as a pre-processing tool for the neural networks [1, 2]. By eliminating the redundant data D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 352–361, 2007. © Springer-Verlag Berlin Heidelberg 2007
A Hybrid Rule Extraction Method Using Rough Sets and Neural Networks
353
from database, rough sets methods can greatly accelerate the network training time and improve its prediction accuracy. In [4], Rough sets method was also applied to generate rules from trained neural networks. In these hybrid systems, neural networks are the main knowledge bases and rough sets are used only as a tool to speedup or simplify the process of using neural networks for mining knowledge from the databases. In [3], a rule set, a part of knowledge, is first generated from a database by rough sets. Then a neural network is trained by the data from the same database. In the prediction phase, a new object is first predicted by the rule set. If it does not match any of the rules, then is fed into the neural networks to get its result. Although the hybrid model can get high classification accuracy, part of prediction knowledge is still hidden among the neural networks and is not comprehensible for user. In this paper, from a new perspective we develop a Hybrid Rule Extraction Method (HREM) using rough sets and neural networks to mine classification rules from large databases. Compared with previous research works, our study has the following contributions. (1) We reduce attributes of decision table by three steps. In the first step, irrelevant and redundant attributes are removed from the table by rough sets approach without loss of any classification information. In the second step, a neural network is used to eliminate noisy attributes in the table while the desirable classification accuracy is maintained. In the third step, the final knowledge which is mainly represented as classification rules are generated from the reduced decision table by rough sets. (2) In our HREM, neural networks are used only as a tool to reduce the decision table and filter its noises. The final classification rules are generated from the reduced decision table by rough sets, not from the trained neural networks.
2 Preliminaries 2.1 Binary Discernibility Matrix
T = U , C ∪ D, V , f be a decision table. In general, D can be transformed into a set that has only one element without changing the classification for U that is, D = {d } . Every value of d corresponds to one equivalence class of U / ind ( D) , Let
which is also called the class label of object. A binary discernibility matrix represents the discernibility between pairs of objects in a decision table. Let M be the binary discernibility matrix of S , it element
M (( s, t ), i ) indicates the discernibility between two objects x s and xt with different class labels by a single condition attribute ci , which is defined as follows:
⎧1.......ci ( x s ) ≠ ci ( xt ), M (( s, t ), i ) = ⎨ ⎩0.......otherwise, s < t ≤ m and d ( x s ) ≠ d ( xt ) i ∈ {1,2,..., n} . It can be seen that M has n columns and its maximal number of rows is m(m − 1) / 2 . Each column of M represents a single condition attribute and each row of M represent an object pair having different d values. Where 1 ≤
354
S. Wang, G. Wu, and J. Pan
2.2 Attribute Reduction by Rough Sets and Neural Networks Attribute reduction is a process of finding an optimal subset of all attributes according to some criterion so that the attribute subset are good enough to represent the classification relation of data. Attributes deleted in attribute reduction can be classified into two categories. One category contains irrelevant and redundant attributes that have no any classification ability. An irrelevant attribute does not affect classification in any way and a redundant feature does not add anything new to classification. These attributes represent some classification ability, but this ability will disturb the mining of true classification relation due to the effect of noise. In general, rough sets theory provides useful techniques to reduce irrelevant and redundant attributes from a large database with a lot of attributes. However, it is not so satisfactory for the reduction of noisy attributes because the classification region defined by rough sets theory is relatively simple and rough sets based attribute reduction criteria lack effective validation method. For example, the dependency γ and information entropy H are two most common attribute reduction measures in rough sets theory. When using them to measure attribute subsets, an attribute subset with a high γ may contain some noisy attribute and degrade the generalization of classification, and H may make the noise overestimated and delete useful attributes. Neural networks have the ability to approach any complex function and possess good robustness to noise. Therefore we think that the nonlinear mapping ability for classification relations and cross-valid mechanism provided by neural networks can give us more chance to eliminate noisy attributes and reserve useful attributes. However, the neural networks will take long training time for attribute reduction when treating a large amount of attributes. 2.3 Rule Extraction by Rough Sets and Neural Networks To extract rules using neural networks is usually difficult because of the nonlinear and complicated nature of data transformation conducted in the multiple hidden layers. Although neural networks researchers have proposed many methods to discover symbolic rules from a trained neural network, these methods are still very complicated when the network is large. The algorithms to extract rules from trained neural networks were summarized in [10]. Compared to the neural network approaches, rule extraction by rough sets is relatively simples and straightforward and without extra computational procedures before rules being extracted.
3 Development of HREM 3.1 The Procedures of HREM The HREM consists of three major phases: 1. attributes reduction done by rough sets. Using rough sets approach, a reduct of condition attributes of decision table is obtained. Then a reduct table is derived from the decision table by removing those attributes that are not in the reduct.
A Hybrid Rule Extraction Method Using Rough Sets and Neural Networks
355
2. further reduction of decision table done by neural networks. Through a neural network approach, noisy attributes are eliminated from the reduct. Thus the reduct table is further reduced by removing noisy attributes and by removing those objects that can not be classified accurately by the network. 3. all rules in decision table extracted by rough sets. Applying rough sets method, the final knowledge-a rule set is generated from the reduced decision table. Fig 1 shows the procedures of HREM.
Original DT (Decision Table)
Reduct DT
Reduced DT
Attribute Reductio n by RS
Rule generatio n by RS Data
C,F
Rules set
Fig. 1. The procedures of HERM
3.2 Algorithms in HREM We first develop an algorithm of Rough Set Attribute Reduction (RSAR) based on a binary discernibility matrix, which replaces complex set operations by simple bit-wise operations in the process of finding reduct and provides a more simple and intelligible measure for the importance of attributes. Even if the initial number of attributes is very large, using the measure can effectively delete irrelevant and redundant attributes in a relatively short time. Secondly, we employ the neural network feature selection (NNFS) algorithm in [5] to further reduce attributes in the reduct. In this approach, the noisy input nodes (attributes) along with their connections are removed iteratively from the network without decreasing obviously the network’s classification ability. The approach is very effective for a wide variety of classification problems including both artificial and real-world datasets, which was verified by a lot of experiments. Making use of the robustness to noise and generalization ability of the neural network method, these attributes and objects polluted by noisy can be reduced from decision table. Thirdly, we present an Extraction Algorithm of Approximate Sequence Decision Rules (EAASDR) which extracted concise rule from reduced table, remove values of those attributes, then extracted rule from border region, until all rules are extracted from reduced table.
356
S. Wang, G. Wu, and J. Pan
3.2.1 RSAR Algorithm We assume that the context of decision table is the only information source our objective is to find a reduct with minimal number of attributes. Based on the definition of the binary discernibility matrix, we propose our rough set attribute reduction (RSAR) algorithm to find a reduct of a decision table. RSAR is outlined as below. RSAR algorithm Input: a decision table T = (U , C ∪ D )
, U = {u , u ,..., u } , C = {c , c ,..., c } ; Output: a reduct of T , denoted as Re d ; 1. Construct the binary discernibility matrix M of T ; 2. Delete the rows in the M which are all 0’s, Re d = φ ; /*delete pairs of 1
2
n
1
2
m
inconsistent objects*/ 3. while( M ≠ φ ) {
(1)select an attribute c in the M with the highest discernibility degree (if there are i
several ci with the same highest discernibility degree, choose randomly an attribute from them); 2 Re d
( ) ← Re d ∪ {c } ; (3)remove the rows which have “1” in the c column from M ; (4)remove the c column from M ; i
i
i
} endwhile /*the following steps remove redundant attribute from Re d */
= {r1 , r2 ,..., rk } contains k attributes which are sorted by the order of entering Re d , rk is the first attributes chosen into Re d , r1 is the last one chosen into Re d . 5. Get the binary discernibility matrix MR of decision table TR = (U , Re d ∪ D ) ; 6. Delete the rows in the MR which are all 0’s; 4. Suppose that Re d
7. For I =2 to k { remove the ri column from MR ;
(no row in the MR is all 0’s)
if
Re d ← Re d − {ri } ; else put the ri column back to MR ; then
Endif } Endfor. 3.2.2 EAASDR Algorithm A reduced table can be seen as a rule set where each rule corresponds to one object of the table. The rule set can be generalized further by applying rough set value reduction method. Unlike most value reduction methods, which neglect the border region rules among the classification capabilities of condition attributes, we first extracted concise
A Hybrid Rule Extraction Method Using Rough Sets and Neural Networks
357
rules from reduced table, remove values of those attributes, then extracted rules from border region, until all rules were extracted from reduced table. The steps of our rough set rule generation algorithm called EAASDR (an Extraction Algorithm of Approximate Sequence Decision Rules) are presented below. Input: decision table S = (U , C ∪ D ) ; Output: rule sets
CORE D (C ) in S = (U , C ∪ D) was calculated, /*the
1. The relatively core
relatively core was obtained by calculating the significance σ CD (C
'
) of each condition
attribute to decision attribute*/
,
≠ φ Then P1 = CORE D (C ) and E = P1 Else P1 = {c1 } /*for ∀c ∈ C calculate dependency degree γ [ c ] ( D ) between
2. If CORE D (C )
E = P1
and
;
,
c and D , γ [ c1 ] = max{r[ c ] ( D), c ∈ C} was selected as initially attribute sets.*/ 3. The decision classification U
;
;
/ D = {Y1 , Y2 ,..., Yd } was calculated;
;
P = {P1 } i = 1 U = U B = φ * 5. U / IND ( Pi ) = { X i1 , X i 2 ,..., X ik } ; 4.
6. B
*
= { X k ∈ U * / IND( Pi ) | X k ⊆ Y j , whereY j ∈ U / D, j ∈ {1,2,..., d }}
'
Rule ' = φ ∀X k ∈ B
'
,
;
Rule ' = {des Pi ( X k ) → des D (Y j )} ,where Y j ∈ U / D and
; Rule = Rule ∪ Rule , B = B ∪ B ; = ∪ x ; If B = U Then goto Step 8 Else
Yj ⊇ X k 7.
; Rule = φ ;
B*
'
'
*
{
U * = U * − B*
X ∈B
;
i = i + 1 ;for ∀c ∈ C − E ,the significance σ ({c}∪ E ) D ({c}) was calculated, if
σ ({c }∪ E ) D ({c 2 }) = max{σ ({c}∪ E ) D ({c}), c ∈ C − E} then Pi = Pi −1 ∪ {c 2 } ,and Pi is a equivalence class in P ,goto Step 5;} 2
8.
B is the result of dynamic classification, Rule is the decision rules set.
3.3 General Algorithm In summary, we conclude the general algorithm to generate rules from a decision table as below. The general algorithm Input: a decision table T = (U , C ∪ D ) Output: a rule set RULE ;
, U = {u , u ,..., u } , C = {c , c ,..., c } ; 1
2
n
1
2
m
358
S. Wang, G. Wu, and J. Pan
1. Apply the RSAR algorithm, get a reduct of T , denoted as Re d ; 2. Remove those attributes that are not in Re d from T ; 3. Apply the NNFS algorithm, obtain an important attributes subset IMP of Re d .Suppose OBJ is the set of objects that were classified wrongly by the network; 4. Remove those attributes that are not in IMP and remove those objects in OBJ from T , and merge the identical objects into one object; 5. Apply EAASDR the algorithm, extract all rule set RULE from the reduced decision table. It should be noted that the identical objects are not merged after step 2 as the probability distribution of all objects is needed in the next neural network training phase. While the identical objects correspond to same rule in rule generation phase, so they are merged in the step 4.
4 Experiments and Results We did a series of experiments to test our method. First, for comparing with traditional methods, we applied our approach to eight data mining problems [9] and six standard datasets from the UCI repository that were used in [7], [8] respectively. Secondly, to test our HREM under noisy conditions, we made the relevant experiments on MONK3 dataset by randomly adding different level noise in the data. In this paper, the rules set accuracy and the rules set comprehensibility were used as the criteria of evaluation of the rule extraction approaches. The accuracy of rules set was indicated by the accuracy of rules generated on the testing set, and the comprehensibility of rules set includes two measures, the number of rules and the average number of conditions of each rule. 4.1 Comparison Between HREM with NNFS Ten classification problems were defined on datasets having nine attributes in [9]. We selected eight problems (except function 8) with different complexities in our experiments. Like in [9], the values of the attributes of each object were generated randomly and a perturbation factor of 5% was added. The class labels were determined according to the rules that defined the function. For every experiment, 3000 objects were generated among which 2000 objects were used as the training set ant the other 1000 ones were the testing set. The attribute values were initially discretized and coded by the methods proposed in [7]. Then the nine attributes were transformed into 37 binary attributes. We tested 30 times for each problem. Table 1 reports the results of eight classification problems using our hybrid approach based on rough sets and neural networks (HREM). Experimental results in [7], obtained by an approach based on neural networks (NNFS), are also compared for the same problems. The results obtained show that the HREM is comparable in both accuracy and comprehensibility with NNFS. Moreover, the rule extracting time of HREM is greatly shorter than that of NNFS for the reason as mentioned previously.
A Hybrid Rule Extraction Method Using Rough Sets and Neural Networks
359
Table 1. Comparison of performance of the rules generated by HREM and NNFS F
Average accuracy (%) HREM
1 2 3 4 5 6 7 9
Average no. of rules
NNFS
99.92(0.48) 99.01(0.87) 98.81(1.37) 93.09(1.92) 98.16(0.78) 90.89(0.23) 91.82(0.57) 91.58(0.98)
HREM
99.91(0.36) 98.13(0.78) 98.18(1.56) 95.45(0.94) 97.16(0.86) 90.78(0.43) 90.50(0.92) 90.86(0.60)
2.14(0.58) 6.78(1.56) 7.60(0.81) 10.90(1.83) 22.89(9.78) 12.45(3.56) 5.13(2.34) 10.21(1.96)
NNFS 2.03(0.18) 7.13(1.22) 6.70(1.15) 13.37(2.39) 24.40(10.1) 13.13(3.72) 7.43(1.76) 9.03(1.65)
Averageno. of condition HREM
NNFS
2.01(0.58) 4.20(0.78) 2.25(0.04) 2.79(0.27) 4.95(1.20) 4.20(0.98) 1.71(0.47) 3.21(0.40)
2.23(0.50) 4.37(0.66) 3.18(0.28) 4.17(0.88) 4.68(0.87) 4.61(1.02) 2.94(0.32) 3.46(0.36)
4.2 Comparison Between HREM with RS To compare with rough set based approaches (RS), we applied our approach to six UCI datasets that were used by [8]. Similar to [8], we randomly separate each dataset into two parts: two thirds as training set and the rest as testing set. The continuous attributes were also initially discretized using the equal width binding method. We also tested 20 times for each case and present the averages of the results in Table2. The data of RS columns were given in [8] and they have no standard deviations. We can see that HREM outperforms RS in accuracy in all six datasets and the rule set of HREM is more concise than that of RS. It is because HREM can filter effectively the noise in the data, which make generated rules more accurate and simper. Table 2. Comparison of performance of the rules generated by HREM and RS Data sets
Average accuracy (%) HREM
Australian breast diabetes German glass iris
85.70(0.39) 94.82(0.77) 73.92(2.07) 72.41(2.95) 63.89(1.24) 95.78(0.79)
RS 85.54 92.38 73.32 70.48 60.42 95.10
Average no. of rules HREM
RS
3.00(2.15) 5.30(0.18) 6.50(2.46) 4.35(3.56) 22.5(2.58) 3.15(1.78)
6.7 7.8 6 4.7 24.5 3.55
Averageno. of condition HREM 1.34(0.62) 1.91(0.15) 4.35(3.56) 2.16(1.02) 1.82(0.58) 1.52(0.57)
RS 2.5 1.6 1.5 1.4 2.2 1.29
4.3 Experiments Under Noisy Conditions In order to demonstrate the robustness of our approach, MONK3 dataset was selected in our experiments. The dataset contains 432 objects, and each is described by 6 attributes. All objects are classified into two classes by the following rules: Class 1: (Jacket_color = green and holding = sword) or (jacket color! = blue and body_shape! = octagon). Class 0: otherwise.
360
S. Wang, G. Wu, and J. Pan
We constructed three classification problems on the dataset by randomly adding 6, 12, 18 noises to the training objects respectively. We set an object as a noise by changing its class label. That is, if an object originally label as “1” was relabeled as “0”. In every experiment, dataset was divided randomly into two equal sets: one set was used as training set and the other set as testing set. Table 3 shows the result of three problems. Each problem was done 30 times. It can be seen that under the different noisy level conditions, the rule set generated remained relatively stable, and HREM can effectively filter the noises in the data by deleting relatively less objects (the number of objects deleted were not more than twice as the number of true noises). It guarantees that concise and accurate rules are generated. Table 3. Result of robustness experiments on MONK3 dataset with HREM Data set 6 noises 12 noises 18 noises
Average accuracy (%) 97.84(1.19) 97.78(1.13) 95.56(6.09)
Average no. of rules 3.57(0.97) 4.30(2.45) 3.97(3.01)
Average no. of condition 1.34(0.14) 1.44(0.34) 1.40(0.42)
5 Conclusions In this paper, we present a hybrid approach integrating rough sets and neural networks to mine classification rules from large datasets. Through rough sets approach a decision table is first reduced by removing redundant attributes without losing any classification information then a neural network is trained to delete noisy attributes in the table. Those objects that cannot be classified accurately by the network are also removed from the table. Finally, all classification rules are generated from the reduced decision table by rough sets. In addition, based on a binary discrenibility matrix, a new algorithm RSAR of finding a reduct and a new algorithm EAASDR of all rules generation from a decision table were also proposed. The HREM was applied to a series of classification problems that include artificial problems and real world problems. The results of comparison experiments show that our approach can generate more concise and more accurate rules than traditional neural network based approach and rough set based approach. The results of robustness experiment indicate that HREM can work very well under the different noisy level condition.
References 1. Jelonek, J., Krawiec, K., Stowinski, R.: Rough Set Reduction of Attributes and Their Domains for Neural Networks. Computational Intelligence 11 (1995) 339-347 2. Swiniarski, R., Hargis, L.: Rough Set as a Front End of Neural-Networks Texture Classifiers. Neurocomputing 36 (2001) 85-102 3. Ahn, B., Cho, S., Kim, C.: The Integrated Methodology of Rough Set Theory and Artificial Neural Network for Business Failure Predication. Expert Systems with Application 18 (2000) 65-74
A Hybrid Rule Extraction Method Using Rough Sets and Neural Networks
361
4. Yasdi, R.: Combining Rough Sets Learning and Neural Network Learning Method to Deal with Uncertain and Imprecise Information. Neurocomputing 7 (1995) 61-84 5. Setiono, R., Liu, H.: Neural-Network Feature Selector. IEEE Trans. Neural Networks 8 (1997) 554-662 6. Towell, G., Shavlik, J.W.: Interpretation of Artificial Neural Networks: Mapping Knowledge-based Neural Networks into Rules. In Advances in Neural Information Processing Systems 4, Moody, J.E., Hanson, S.J., Lippmann, R.P. eds., San Mateo, CA: Morgan Kaufmann (1992) 977-984 7. Lu, H., Setiono, R., Liu, H.: Effective Data Mining using Neural Networks. IEEE Trans. Knowledge and Data Engineering 8 (1996) 957-961 8. Chen, X., Zhu, S., Ji, Y.: Entropy based Uncertainty Measures for Classification Rules with Inconsistency Tolerance. In: Proc. IEEE Int. Conf. Systems, Man and Cybernetics (2000) 2816-2821 9. Agrawal, R., Imielinski, T., Swami, A.: Database Mining: A Performance Perspective. IEEE Trans. Knowledge and Data Engineering 5 (1993) 914-925 10. Andrews, R., Diederich, J.,Tickle, A.B.: Survey and Critique of Techniques for Extracting Rules from Trained Artificial Neural Networks. Knowledge Based System 8 (1995) 373-389 11. Chen, M., Han, J., Yu, P.: Data Mining: An Overview from a Database Perspective. IEEE Trans. Knowledge and Date Engineering 8 (1996) 866-883.
A Novel Approach for Extraction of Fuzzy Rules Using the Neuro-fuzzy Network and Its Application in the Blending Process of Raw Slurry* Rui Bai1, Tianyou Chai1,2, and Enjie Ma1 1
Key Laboratory of Integrated Automation of Process Industry, Ministry of Education, Northeastern University, Shenyang 110004 2 Research Center of Automation, Northeastern University, Shenyang 110004, China
Abstract. A novel approach is proposed to extract fuzzy rules from the inputoutput data using the neuro-fuzzy network combined the improved c-means clustering algorithm. Interpretability, which is one of the most important features of fuzzy system, is obtained using this approach. The fuzzy sets number of variables can also be determined appropriately using this approach. Finally, the proposed approach is applied to the blending process of raw slurry in the alumina sintering production process. The fuzzy system, which is used to determine the set values of the flow rate of materials, is extracted from the error of production index –adjustment of the flow rate. Application results show that the fuzzy system not only improved the quality of raw slurry but also have good interpretability.
1 Introduction From 1965 when fuzzy set was proposed by L.A. Zadeh , fuzzy systems have been applied widely in many fields including modeling, control, pattern recognition, fault diagnosis, and so on. One of the important design issues of fuzzy systems is how to construct a set of appropriate fuzzy rules. There are two major approaches: manual rule generation and automatic rules generation. Most of the reported fuzzy systems have resorted to a trial-and-error method for constructing fuzzy rules. This not only limits applications of fuzzy systems, but also forces system designers to spend hard time on constructing and tuning fuzzy rules. Moreover, the manual approach becomes even more difficult if the required number of rules increases or domain knowledge is not easily available. To resolve these difficulties, recently, several automatic extraction approaches for fuzzy rules from the input-output data have been proposed, including look-up table approach[1], data mining approach[2,3], GA approach[4], clustering approach[5], *
This project is supported by the National Foundamental Research Program of China(Grant No. 2002CB312201), and the State Key Program of National Natural Science of China(Grant No.60534010), and the Funds for Creative Research Groups of China (Grant No. 60521003), and the Program for Changjiang Scholars and Innovative Research Team in University (Grant No. IRT0421).
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 362–370, 2007. © Springer-Verlag Berlin Heidelberg 2007
A Novel Approach for Extraction of Fuzzy Rules
363
neural network approach[6,10]. However, all these approaches only focus on fitting data with the highest possible accuracy, neglecting the interpretability of the obtained fuzzy systems, which is a primary advantage of fuzzy systems and the most prominent feature that distinguishes fuzzy set theory from many other theory used in modeling and control. Another disadvantage of these approaches is that the number of fuzzy sets and fuzzy rules must be determined manually and beforehand. In order to improve the interpretability of fuzzy systems, similar fuzzy sets are merged based on the similarity anlyasis [7, 8, 9, 13]. However, similar or incompatible fuzzy rules are not considered, which also make the interpretability of fuzzy systems decrease. It should be noted that most of the reported approaches can only realize the T-S fuzzy rules whose consequence is a constant or a linear combination of inputs, which is more difficult to interpret linguistically than the normal fuzzy rules whose consequence is fuzzy sets. To resolve this problem, normal fuzzy rules are extracted using the neuro-fuzzy networks [9, 10]. However, every weight of the output layer of neurofuzzy network represents a fuzzy set of output variable in [9, 10], which makes fuzzy sets of output variables excessive and interpretability of fuzzy rules loss. In order to determine the number of fuzzy rules and fuzzy sets appropriately, Kwang Bo Cho [10] used hierarchically self-organizing learning (HSOL) algorithm to automatically determine the number of fuzzy rules. However, the number and initial parameters of fuzzy sets are determined subjectively and randomly. Rui Pedro Paiva, etc, determined the number of fuzzy rules by means of implementing clustering analyse to the input-output data [9, 11]. However, in these approaches, the number of clusters is equal to the number of fuzzy rules, and also is equal to the number of fuzzy sets of input or output variables, so, the fuzzy rules base is not complete. To resolve the above problems, this paper improves the c-means clustering algorithm, and the novel extraction approach for fuzzy rules is also proposed using the neuro-fuzzy network combined the improved c-means clustering algorithm. Interpretability of fuzzy rules is increased and the number of fuzzy sets of variables can be determined appropriately. The proposed approach is applied to the raw slurry blending process in alumina production. Fuzzy control rules are extracted from the error of production index – adjustment of the flow rate. Application results show that the fuzzy system not only improves the quality of raw slurry, but also has good interpretability.
2 The Novel Approach for Extraction Fuzzy Rules from Input-Output Data In this paper, the number and the initial parameters of fuzzy sets are determined using the improved c-means clustering, and the neuro-fuzzy network is used to train the parameters. After training, the weights of output layer are clustered to determine the fuzzy sets of output variables. At last, fuzzy rules are extracted from the neuro-fuzzy
364
R. Bai, T. Chai, and E. Ma
networks and regulations are implemented. Its main steps of the proposed approach in this paper are as follows: Step1. Differing with [9.11], clustering anlysis are implement to every variable instead of the whole input and output data, and the improved c-means algorithm is proposed in this paper. Fuzzy sets of every input variable are determined based on the clustering results. Moreover, the initial weights and the number of nodes of neurofuzzy network are determined appropriately. Step2. Back propagation algorithm with variable step size is adopted to train the neuro-fuzzy network. Step3. Improved c-means Clustering analysis is implemented to the output-layer’s weight. The membership functions of output variable’s fuzzy set are determined based on the clustering results, and the fuzzy rules are extracted from the trained neuro-fuzzy networks. Differing with [9.10], the numbers of the fuzzy sets of output variables are decreased. Step4. Regulations for fuzzy rules are implemented, including merging similar fuzzy set and fuzzy rules, and deleting the similar or incompatible fuzzy rules. 2.1 Determine the Initial Fuzzy Sets of Input Variables Let us assume that the given input-output data set is as follows:
(x
1, p
,
, xm , p;y1, p ,
, yn, p )
p = 1,
,P
(1)
Where m is the number of the input data, n is the number of the output data, and there P input-output data. Traditional c-means algorithm has a disadvantage, i.e., the number of clusters and the initial cluster centers are determined subjectively beforehand. To overcome this disadvantage, the improved c-means algorithm is proposed in this paper. The initial fuzzy sets of input variables are determined appropriately using the improved c-means algorithm. All data of the ith input variable are clustered, and the main steps of the improved c-means are as follows: Step1. Definition of the distance between xi and clusters is:
[
d ip , j = (xi. p − ci , j )
]
2 1/ 2
i = 1,
,m
Where ri is the number of existing clusters,
, p = 1,
, P j = 1,
, ri
(2)
ci , j is the center of the jth cluster.
Step2. Let k=0, and xi,1 is selected as the first cluster which is noted as Wi,1k. Let xi,1, and ri=1.
ci,1k=
Step 3. Computing the distance between xi,2 and Wi,1k, if
d 2i ,1 > T , new cluster
Wi,2k is obtained, and let ci,2k= xi,2, ri=2.Otherwise, xi,2 is assigned into Wi,1k. Step4. Let us assume that there are ri cluster centers, and we compute the distances between
xi , p and the existing clusters. If the minimum of d ipj is greater than T,let k
ri=ri+1, and xi,p is selected as a new cluster which is noted as Wi , ri , and let
cik, ri = xi , p .
A Novel Approach for Extraction of Fuzzy Rules
365
xi are assigned into some clusters, the procedure is over, otherwise, turn to step (4). Finally, xi is divided into ri clusters which are noted as Step5. If all data of
Wi ,kj , j = 1,
, ri . We redefine the center of every cluster is the average of all data in
the cluster. Step1 to 5 is the first phase of the improved c-means algorithm. The number and the initial centers of the clusters are determined. Based on these results, xi is clustered using the traditional c-means algorithm again in the step 6 to 7, which are the second phase of the improved c-means algorithm. Step6. Computing the distances between xi , p and the existing clusters in turn, if
xi , p is closest to the lth cluster, xi , p is assigned into the lth cluster, and the new lth cluster which is noted as the new
Wi ,kj+1 comes into being. Consequently, xi is divided into
ri clusters which are noted as Wi ,kj+1 , and the center of the clusters is noted
cik, +j 1 . k +1
Step7. If ci , j
= cik, j ( j = 1,
turn to (6). After step 1 to 7,
, ri ), procedure is over. Otherwise, let k=k+1, and
xi is divided into ri clusters, and the centers of cluster, i.e., ci , j ,
are obtained. Based on these results,
ri fuzzy sets of input variable xi are determined
appropriately, and the membership functions of fuzzy sets are as follows:
⎡ ( xi − ci , j ) 2 ⎤ μ Ai , j ( xi ) = exp⎢− ⎥ σ i2, j ⎦⎥ ⎣⎢
σ i, j =
ci , j +1 − ci , j
j = 1,
j = 1,
2.5
, ri
, ri
(3)
(4)
2.2 Design and Train the Neuro-fuzzy Network 2.2.1 Determine the Structure and the Initial Weights of Neuro-fuzzy Network Fig.1 shows the schematic diagram of the neuro-fuzzy network. The input layer has m nodes which represent m input variables in equation (1). There Q nodes in the second layer, i.e., the fuzzification layer. m
Q = ∑ ri i =1
(5)
366
R. Bai, T. Chai, and E. Ma μ1,1
R1
x1
R1
ω1,1 ω1, 2
μ1,r
y1
ω1, N
1
μ m ,1 xm
yn
μ m,r
RN
m
ωn, N
RN
Fig. 1. Neuro-fuzzy network
The activation function of every neuron is the corresponding membership function, i.e. equations (3~4). So, c j and σ j are selected as the initial parameters of the fuzzification layer appropriately. There are N nodes in the third layer, i.e., the inference layer. Every node in this layer represents a fuzzy rule, and the output of node is the production of all input data. m
N = ∏ ri
(6)
i =1
Normalization layer has also N nodes. Output layer has n nodes which represent n output variables in equation (1), and the output of every node is computed as follows: N
yi = ∑ ωij Ri
i = 1,2,
,n
(7)
j =1
Where ω ij is the weight between the ith node of output layer and the jth node of normalization layer. The least neighbor principle is adopted to determine the initial ω ij . For example,
ω1,1
is corresponding to c1,1 , c2,1 , ...,
cm ,1 . Assuming the lth input and output data in
data set (1) is the closest to the vector { c1,1 , c2,1 , ...,
cm ,1 }, the first output data of the
lth input-output data pair is selected as the initial ω1,1 . 2.2.2 Training the Neuro-fuzzy Networks Back propagation algorithm with variable step size is adopted to train the neuro-fuzzy network. The error function E is defined as follows:
E=
(
1 P n yi − yid ∑∑ 2 p=1 i=1
)
2
(8)
Where yi is the actual output of the neuro-fuzzy network, and yid is the desired output of the neuro-fuzzy network.
A Novel Approach for Extraction of Fuzzy Rules
367
The learning algorithm for updates cij is as follows:
cij (k + 1) = cij (k ) − α (k )
∂E ∂cij (k )
(9)
α (k ) = 2λ α (k − 1)
(10)
⎡ ∂E ∂E ⎤ × ⎥ ⎣⎢ ∂cij (k ) ∂cij (k − 1) ⎦⎥
λ = sgn ⎢ Where
α (k )
(11)
is learning rate, λ is step coefficient.
Using the same algorithm, we can also update for detail process of computing
σ ij
and ω ij . [12] is the reference
∂E . ∂cij (k )
2.3 Extraction for the Fuzzy Rules
After the neuro-fuzzy network is trained, we can extract fuzzy rules from it. The main steps are as follows: Step1: we select the trained c ij , σ ij as the parameters of the membership functions of input variables. Step2: Let us assume that the trained
ω ij
can be divided into
improved c means algorithm, whose centers are able y i is divided into
qi clusters using the
d i , j . Therefore, the output vari-
qi fuzzy sets, Bi , j , whose center of membership function is
d i, j . Step3: Every node in the inference layer represents a fuzzy rule. For example, the first node of the inference layer corresponds to the fuzzy sets A1,1 , , Am ,1 , and the
{
weight
ωi ,1
}
are corresponding to Bi , j , so, the fuzzy rule that the first node represents
is : Rule 1: If x1 is
A1,1 and x2 is A2,1 and … xm is Am,1 , then y1 is B1, j1 , y 2 is
B2, j 2 , … , yn is Bn , j n . Step4: All fuzzy rules represented by nodes are extracted. 2.4 Regulation for the Fuzzy Rules 2.4.1 Merge the Similar Fuzzy Sets The fuzzy rules obtained above may contain redundant information in terms of similarity between fuzzy sets, and it is difficult to assign qualitatively linguistic term to
368
R. Bai, T. Chai, and E. Ma
similar fuzzy sets. In order to increase the interpretability, the similarity measure is defined as follows: m
Ss ( A, B ) =
∑
min( μ A ( x i ), μ B ( x i ))
∑
max( μ A ( x i ), μ B ( x i ))
i =1 m
(12)
i =1
If
S s > ξ s , i.e., the fuzzy sets are very similar, two fuzzy sets A and B should be
merged to create a new fuzzy set C. where
ξs
is a predefined threshold. The parame-
ters of newly merged fuzzy set C from A and B are defined as: cA + cB ⎧ ⎪cC = 2 ⎪ ⎨ 2 σ A + σ ⎪σ = C 2 .5 ⎩⎪
(13)
2 B
2.4.2 Delete the Similar and Incompatible Fuzzy Rules Considering two fuzzy rules:
Ri : If x1 is A1, i and … xm is Am,i , then y1 is B1,i , and … , yn is Bn, i . R j : If x1 is A1, j and … xm is Am, j , then y1 is B1, j , and … , yn is Bn, j . The similarity measure of the antecedent part and consequent part of fuzzy rules are determined as follows: S r _ if ( R i , R j ) =
S r _ then ( R i , R j ) =
S s ( A1 , i , A1 j ) + m
S s ( B 1 , i , A1 j ) +
If S r _ if > ξ r _ if and S r _ then > ξ r _ then , and S r _ then
< ξ r _ then
,
+ S s ( A m ,i , A m , j )
+ S s ( B m ,i , Am , j ) n
Ri and R j are similar fuzzy rules. If
(14)
(15) S r _ if > ξ r _ if
Ri and R j are incompatible fuzzy rules. If two fuzzy rules are
similar or incompatible, we should delete one of them.
3 Application in Blending Process of Raw Slurry In alumina sintering production process, lime, ore, red slurry and alkali are blended to form the raw slurry. In this blending process, three most importance quality index of raw slurry is determined by the flow rate of four raw materials. Traditional manual operation manner that operators adjust the flow rate based on the errors of quality index can not produce high-quality raw slurry. So, fuzzy system is proposed to replace manual operation, and fuzzy rules are constructed using the approach proposed
A Novel Approach for Extraction of Fuzzy Rules
369
in this paper. In this fuzzy system, the input variables are e1, e2 and e3, which represent the error of quality index, respectively, and the output variables are x1, x2, x3 and x4, which represent adjustment of lime, ore, red slurry and alkali, respectively. The input and output data set can be obtained by means of history data and experience data:
△
△ △
△
(e
1, p
, e2, p , e3, p ; Δx1, p , Δx2, p , Δx3, p , Δx4, p ) i = 1,
,200
(16)
At first, using the improved c-means algorithm, we can obtain the initial fuzzy sets of e1, e2 and e3. The numbers of nodes of the neuro-fuzzy network is 3, 11, 45, 45, and 4, respectively. Using the data set (16) to train the neuro-fuzzy network, we can obtain the trained fuzzy sets. The initial and trained fuzzy sets of input variables are shown in Table1. Using the similarity measure, we find that the two fuzzy sets PM and PB of e1 are similar. We use fuzzy set P of e1 to replace them. After training and regulation, the final fuzzy sets of input and output variables are shown in Table2. Table 1. The initial and final fuzzy sets of e1, e2 and e3
Variables
e1 e2
e3
Initial fuzzy sets (c i , σ i )
Trained fuzzy sets (c i , σ i )
ZE(0.02,0.22),PM(0.56,0.16),P B(0.97,0.2) NB(-0.28,0.052) ,NS(-0.15, 0.048),ZE(-0.03, 0.06), PS(0.12, 0.066), PB(0.285, 0.06) N(-0.12,0.06), ZE(0.03,0.042), P(0.135,0.06)
ZE(0.01,0.17),PM(0.75,0.31), PB(0.82,0.22) NB(-0.29,0.07),NS(-0.12, 0.053),ZE(-0.01, 0.045), PS(0.092, 0.071), PB(0.28, 0.056) N(-0.14,0.05), ZE(0.01,0.05), P(0.139,0.04)
Table 2. The final fuzzy sets of variables
Variables e1 e2 e3
△x △x △x △x
1
2
3
4
final fuzzy sets (c i , σ i ) ZE(0.01,0.17),P(0.785,0.152) NB(-0.29,0.07),NS(-0.12, 0.053),ZE(-0.01, 0.045), PS(0.092, 0.071), PB(0.28, 0.056) N(-0.14,0.05), ZE(0.01,0.05), P(0.139,0.04) NB(-9.21,156), NS(-5.3,2.13), ZE(0.02,2.1) NB(-4.42,0.49),NM(-3.2,0.76),NS(-1.3, 0.51),ZE(0.03,0.58),PS(1.42,0.65),PM(3.051, 0.54), PB(4.41, 0.5) NB(-4.3,0.51) , NM(-3.02,0.7) ,NS(-1.27, 0.51),ZE(-0.01, 0.604), PS(1.5, 0.56), PM(2.98, 0.608), PB(4.5, 0.6) NB(-5.92,1.17) , NM(-3,0.588) ,NS(-1.53, 0.616),ZE(0.01, 0.59), PS(1.49, 0.612), PM(3.02, 0.045), PB(5.84, 1.13)
370
R. Bai, T. Chai, and E. Ma
At last, we can obtain thirty fuzzy rules: Rule1: If is PS,
e1 is ZE and e2 is NB and e3 is NB, then Δx1 is ZE, Δx 2 is NS, Δx3
Δx 4 is PM.
Rule30: If
e1 is P and e2 is PB and e3 is P, then Δx1 is NB, Δx 2 is PB, Δx3 is
NB, Δx 4 is NM. The quality of slurry is improved greatly when fuzzy system replaced the operator.
4 Conclusions This paper improves the c-means algorithm, and uses the neuro-fuzzy network combined the improved c-means algorithm to extract fuzzy rules from the input and output data. The initial parameters and structure of the neuro-fuzzy networks can be determined appropriately, and the regulations of fuzzy rules are implemented to increase the interpretability of fuzzy rules. The proposed approach is applied to construct a set of fuzzy rules in the raw-slurry blending process, and the results show its validity.
References 1. Wang, L.X., Mendel, J.M.: Generating fuzzy rules by learning from examples. IEEE Transactions on Fuzzy Systems 9 (2001) 426-442 2. Wang, Y.F., Chai, T.Y.: Mining fuzzy rules from data and its system implementation. Journal of System Engineering 20 497-503 3. Hu, Y.C., Chen, R.S.: Finding fuzzy classification rules using data mining techniques. Pattern Recognition Letters 24 (2003) 509-519 4. Wong, C.C., Lin, N.S.: Rule extraction for fuzzy modeling. Fuzzy sets and systems (1997) 23-30 5. Gomez-skarmeta, A.F., Delgado, M., Vila, M. A.: About the use of fuzzy clustering techniques for fuzzy model identification. Fuzzy sets and systems (1999) 179-188 6. Xiong, X., Wang, D.X.: Effective data mining based fuzzy neural networks. Journal of Systems Engineering 15 32-37 7. Xing, Z.Y., Jia, L.M. etc: A Case Study of Data-driven Interpretable Fuzzy Modeling. ACTA AUTOMATICA SINICA 31 (2005) 815-824 8. Jin, Y.C., Sendhoff, B.: Extracting Interpretable Fuzzy Rules from RBF Networks. Neural Process Letters 17 (2003) 149-164 9. Paiva, R.P.: Interpretability and learning in neuro-fuzzy systems. Fuzzy sets and systems (2004) 17-38 10. Cho, K.B., Wang, B.H.: Radial basis function based adaptive fuzzy systems and their applications to system identification and prediction. Fuzzy sets and systems (1996) 325-339 11. Oh, S.K., Pedrycz, W., Park, H.S.: Hybrid identification in fuzzy-neural networks . Fuzzy Sets and Systems (2003) 399-426 12. Sun, Z.Q.: Intelligent control theory and technology. Tsinghua Universtiy Press 1997 13. Setnes, M., Babuska, R.: Similarity Measures in Fuzzy Rule Base Simplification. IEEE Transactions on system, man, and cybernetics-Part B 28 376-386
Neural Network Training Using Genetic Algorithm with a Novel Binary Encoding Yong Liang1 , Kwong-Sak Leung2 , and Zong-Ben Xu3 1
Department of Computer Science and Ministry of Education National Key Laboratory on Embedded Systems, College of Engineering, Shantou University, Shantou, Guangdong, China
[email protected] 2 Department of Computer Science and Engineering, The Chinese University of Hong Kong, HK
[email protected] 3 School of Science, Xi’an Jiaotong University Xi’an, Shaanxi, China
[email protected]
Abstract. Genetic algorithms (GAs) are widely used in the parameter training of Neural Network (NN). In this paper, we investigate GAs based on our proposed novel genetic representation to train the parameters of NN. A splicing/decomposable (S/D) binary encoding is designed based on some theoretical guidance and existing recommendations. Our theoretical and empirical investigations reveal that the S/D binary representation is more proper than other existing binary encodings for GAs’ searching. Moreover, a new genotypic distance on the S/D binary space is equivalent to the Euclidean distance on the real-valued space during GAs convergence. Therefore, GAs can reliably and predictably solve problems of bounded complexity and the methods depended on the Euclidean distance for solving different kinds of optimization problems can be directly used on the S/D binary space. This investigation demonstrates that GAs based our proposed binary representation can efficiently and effectively train the parameters of NN.
1
Introduction
Most of the real-world problems could be encoded by different representations, but genetic algorithms (GAs) may not be able to successfully solve the problems based on their phenotypic representations, unless we use some problem-specific genetic operators. Especially, GAs are widely used in the parameter training of Neural Network (NN). They need to transform NN’s parameters from the real encoding into the binary strings. Therefore, a proper genetic representation is necessary when using GAs on the real-world problems [1], [8], [12]. A large number of theoretical and empirical investigations on genetic representations were made over the last decades, and have shown that the behavior and performance of GAs is strongly influenced by the representation used. Originally, the schema theorem and the building block hypothesis proposed by [1] D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 371–380, 2007. c Springer-Verlag Berlin Heidelberg 2007
372
Y. Liang, K.-S. Leung, and Z.-B. Xu
and [3] to model the performance of GAs to process similarities between binary bitstrings. The most common binary representations are the binary, gray and unary encodings. According to three aspects of representation theory (redundancy, scaled building block and distance distortion), Rothlauf [9] studied the performance differences of GAs by different binary representations for real encoding. Analysis on the unary encoding by the representation theory reveals that encoding is redundant, and does not represent phenotypes uniformly. Therefore, the performance of GAs with the unary encoding depends on the structure of the optimal solution. Unary GAs fail to solve integer one-max, deceptive trap and BinInt problems [4], unless larger population sizes are used, because the optimal solutions are strongly underrepresented for these three types of problems. Thus, the unary GAs perform much worse than GAs using the non-redundant binary or gray encoding [9]. The binary encoding uses exponentially scaled bits to represent phenotypes. Its genotype-phenotype mapping is a one-to-one mapping and encodes phenotypes without redundancy. However, for non-uniformly binary strings and competing Building Blocks (BBs) for high dimensional phenotype space, there are a lot of noise from the competing BBs lead to a reduction on the performance of GAs. In addition, the binary encoding has the effect that genotypes of some phenotypical neighbors are completely different. As a result, the locality of the binary representation is partially low, i.e. Hamming cliff [10]. In the distance distortion theory, an encoding preserves the difficulty of a problem if it has perfect locality and if it does not modify the distance between individuals. The analysis reveals that the binary encoding changes the distance between the individuals and therefore changes the complexity of the optimization problem. Thus, easy problems can become difficult, and vice versa. The binary GAs are not able to reliably solve problems when mapping the phenotypes to the genotypes. The non-redundant gray encoding [10] was designed to overcome the problems with the Hamming cliff of the binary encoding. In the gray encoding, every neighbor of a phenotype is also a neighbor of the corresponding genotype. Therefore, the difficulty of a problem remains unchanged when using mutation-based search operators that only perform small step in the search space. As a result, easy problems and problems of bounded difficulty are easier to solve when using the mutation-based search with the gray coding than that with the binary encoding. Although the gray encoding has high locality, it still changes the distance correspondence between the individuals with bit difference of more than one. When focused on crossover-based search methods, the analysis of the average fitness of the schemata reveals that the gray encoding preserves building block complexity less than the binary encoding. Thus, a decrease in performance of gray-encoded GAs is unavoidable for some kind of problems [2], [12]. Up to now, there is no well set-up theory regarding the influence of representations on the performance of GAs. To help users with different tasks to search good representations, over the last few years, some researchers have made recommendations based on the existing theories. For example, Goldberg [1] has
NN Training Using Genetic Algorithm with a Novel Binary Encoding
373
proposed two basic design principles for encodings: (i), Principle of minimal alphabets — the alphabet of the encoding should be as small as possible while still allowing a natural representation of solutions; and (ii), Principle of meaningful building blocks — the schemata should be short, of low order, and relatively unrelated to schemata over other fixed positions. The principle of minimal alphabets advises us to use bit string representation. Combining with the principle of meaningful building blocks (BBs), we construct uniform salient BBs, which include equal scaled and splicing/decomposable alleles. This paper is organized as follows. Section 2 introduces a novel splicing/ decomposable (S/D) binary representation and its genotypic distance. Section 3 describes a new genetic algorithm based on the S/D binary representation, the splicing/ decompocable genetic algorithm (SDGA). Section 4 provides the simulation results of SDGA for NN training and comparisons with other binary GAs. The paper conclusion are summarized in Section 5.
2
A Novel Splicing/Decomposable Binary Genetic Representation
Based on above investigation results and recommendations, Leung et al. have proposed a new genetic representation, which is proper for GAs searching [5] [13]. In this section, first we introduce a novel splicing/decomposable (S/D) binary encoding, then we define the new genotypic distance for the S/D encoding, finally we give the theoretical analysis for the S/D encoding based on the three elements of genetic representation theory (redundancy, scaled BBs and distance distortion). 2.1
A Splicing/Decomposable Binary Encoding
In [5], Leung et al. have proposed a novel S/D binary encoding for real-value encoding. Assuming the phenotypic domain Φp of the n dimensional problem can be specified by Φp = [α1 , β1 ] × [α2 , β2 ] × · · · × [αn , βn ].
(1) (βi −αi ) , 2(/n)
Given a length of a binary string l, the genotypic precision is hi (l) = i= 1, 2, · · · , n. Any real-value variable x = (x1 , x2 , ..., xn ) ∈ Φp can be represented by a splicing/decomposable (S/D) binary string b = (b1 , b2 , .., bl ), the genotypephenotype mapping fg is defined as l/n
x = (x1 , x2 , · · · , xn ) = fg (b) = (
2(l/n−j) × bj×n+1 ,
(2)
2(l/n−j) × bj×(n+1) ),
(3)
j=0 l/n
l/n
2(l/n−j) × bj×n+2 , · · · , j=0
j=0
where l/n
2(l/n−j) × bj×n+i ≤ j=0
xi − αi < hi (l)
l/n
2(l/n−j) × bj×n+i + 1. j=0
(4)
374
Y. Liang, K.-S. Leung, and Z.-B. Xu
Fig. 1. A graphical illustration of the splicing/decomposable representation scheme, where (b) is the refined bisection of the gray cell (10) in (a) (with mesh size O(1/2) ), (c) is the refined bisection of the dark cell (1001) in (b) (with mesh size O(1/22 )), and so forth
That is, the significance of each bit of the encoding can be clearly and uniquely interpreted (hence, each BB of the encoded S/D binary string has a specific meaning). As shown in Figure 1, take Φp = [0, 1] × [0, 1] and the S/D binary string b = 100101 as an example (in this case, l = 6, n = 2, and the genotypic precisions h1 (l) = h2 (l) = 18 ). Let us look how to identify the S/D binary string b and see what each bit value of b means. In Figure 1-(a), the phenotypic 1
domain Φp is bisected into four Φp2 (i.e., the subregions with uniform size 12 ). According to the left-0 and right-1 correspondence rule in each coordinate di1 rection, these four Φp2 then can be identified with (00), (01), (10) and (11). As the phenotype x lies in the subregion (10) (the gray square), its first building block (BB) should be BB1 = 10. This leads to the first two bits of the S/D 1
binary string b. Likewise, in Figure 1-(b), Φp is partitioned into 22×2 Φp4 , which 1
are obtained through further bisecting each Φp2 along each direction. Particu1
1
larly this further divides Φp2 = (BB1 ) into four Φp4 that can be respectively labelled by (BB1 , 00), (BB1 , 01), (BB1 , 10) and (BB1 , 11). The phenotype x is in (BB1 , 01)-subregion (the dark square), so its second BB should be BB2 = 01 and the first four positions of its corresponding S/D binary string b is 1001. 1
In the same way, Φp is partitioned into 22×3 Φp8 as shown in Figure 11
1
(c), with Φp4 = (BB1 , BB2 ) particularly partitioned into four Φp8 labelled by (BB1 , BB2 , 00), (BB1 , BB2 , 01), (BB1 , BB2 , 10) and (BB1 , BB2 , 11). The phenotype x is found to be (BB1 , BB2 , 01), that is, identical with S/D binary string b. This shows that for any three region partitions, b = (b1 , b2 , b3 , b4 , b5 , b6 ), each bit value bi can be interpreted geometrically as follows: b1 = 0 (b2 = 0) means the
NN Training Using Genetic Algorithm with a Novel Binary Encoding
375
phenotype x is in the left half along the x-coordinate direction (the y-coordinate direction) in Φp partition with 12 -precision, and b1 = 1 (b2 = 1) means x is in the right half. Therefore, the first BB1 = (b1 , b2 ) determine the 12 -precision location 1
1
of x. If b3 = 0 (b4 = 0), it then further indicates that when Φp2 is refined into Φp4 , 1
the x lies in the left half of Φp2 in the x-direction (y-direction), and it lies in the right half if b3 = 1 (b4 = 1). Thus a more accurate geometric location (i.e., the 1 4 -precision location) and a more refined BB2 of x is obtained. Similarly we can explain b5 and b6 and identify BB3 , which determine the 18 -precision location of x. This interpretation holds for any high-resolution l bits S/D binary encoding. 2.2
A New Genotypic Distance on the Splicing/Decomposable Binary Representation
For measuring the similarity of the binary strings, the Hamming distance is widely used on the binary space. Hamming distance describes how many bits are different in two binary strings, but cannot consider the scaled property in non-uniformly binary representations. Thus, the distance distortion between the genotypic and the phenotypic spaces makes phenotypically easy problem more difficult. Therefore, to make sure that GAs are able to reliably solve easy problems and problems of bounded complexity, the use of equivalent distances is recommended. For this purpose, we define a new genotypic distance on the S/D binary space to measure the similarity of the S/D binary strings. Definition 1. Suppose any binary strings a and b belong to the S/D binary space Φg , the genotypic distance a − bg is defined as l/n−1
n
a − bg =
| i=1
j=0
aj×n+i − bj×n+i |, 2j+1
where l and n denote the length of the S/D binary strings and the dimensions of the real-encoding phenotypic space Φp respectively. For any two S/D binary strings a, b ∈ Φg , we can define the Euclidean distance of their correspond phenotypes: a − bp =
l/n−1
n
( i=1
j=0
aj×n+i − 2j+1
l/n−1
j=0
bj×n+i 2 ) , 2j+1
as the phenotypic distance between the S/D binary strings a and b. Theorem 1. The phenotypic distance · p and the genotypic distance · g are equivalents in the S/D binary space Φg because the inequation: · p ≤ · g ≤
√
n × · p
is satisfied in the S/D binary space Φg , where n is the dimensions of the realencoding phenotypic space Φp .
376
Y. Liang, K.-S. Leung, and Z.-B. Xu
According to the distance distortion of the genetic representation, using the new genotypic distance · g can guarantee GA to reliably and predictably solve problems of bounded complexity. 2.3
Theoretical Analysis of the Splicing/Decomposable Binary Encoding
In our previous work [6], [7], we introduce the delicate feature of the S/D representation — a Building Block-significance-variable property. Actually, it is seen from the above interpretation that the first n bits of an encoding are responsible for the location of the n dimensional phenotype x in a global way (particularly, with O( 12 )-precision); the next group of n bits is responsible for the location of phenotype x in a less global (might be called ‘local’) way, with O( 14 )-precision, and so forth; the last group of n-bits then locates phenotype x in an extremely 1 local (might be called ‘microcosmic’) way (particularly, with O( 2/n )-precision). Thus, we have seen that as the encoding length l increases, the representation (b1 , b2 , · · · , bn , bn+1 , bn+2 , · · · , b2n , · · · ,
(5)
b(−n) , b(−n+1) , · · · , bl )
(6)
= (BB1 , BB2 , · · · , BBl/n )
(7)
can provide a successive refinement (from global, to local, and to microcosmic), and more and more accurate representation of the problem variables. In each BBi of the S/D binary string, which consists of the bits (bi×n+1 , bi×n+2 , · · · , b(i+1)×n ), i = 0, · · · , l/n−1, these bits are uniformly scaled. We refer such delicate feature of BBi to as the uniform-salient BB (USBB). Furthermore, the splicing different number of USBBs can describe the rough approximations of the problem solutions with different precisions. So, the intra-BB difficulty (within building block) and inter-BB difficulty (between building blocks) [1] of USBB are low. The theoretical analysis reveals that GAs searching on USBB can explore the high-quality bits faster than GAs on non-uniformly scaled BB. The S/D binary encoding is redundancy-free representation because using the S/D binary strings to represent the real values is one-to-one genotype-phenotype mapping. The whole S/D binary string is constructed by a non-uniformly scaled sequence of USBBs. The domino convergence of GAs occurs and USBBs are solved sequentially from high to low scaled. The BB-significance-variable and uniform-salient BB properties of the S/D binary representation embody many important information useful to the GAs searching. We will explore this information to design new GA based on the S/D binary representation in the subsequent sections.
3
A New S/D Binary Genetic Algorithm (SDGA)
The above interpretation reveals that for non-uniformly binary strings and competing Building Blocks (BBs) in binary and grid encodings, there are a lot of noise from the competing BBs lead to a reduction on the performance of GAs.
NN Training Using Genetic Algorithm with a Novel Binary Encoding
377
Input: N —population size, m—number of USBBs, g—number of generations to run; Termination condition: Population fully converged; begin g ←− 0; m ←− 1; Initialize Pg ; Evaluate Pg ; while (not termination condition) do for t ←− 1 to N/2; randomly select two individuals x1t and x2t from Pg ; crossover and selection x1t , x2t into Pg+1 ; end for mutation operation Pg+1 ; Evaluate Pg+1 ; if (USBBm fully converged) m ←− m + 1; end while end Fig. 2. Pseudocode for SDGA algorithm
To avoid this problem, we propose a new splicing/decomposable GA (SDGA) based on the delicate properties of the S/D binary representation in our previous work [6], [7]. In the SDGA, genetic operators apply from the high scaled to the low scaled USBBs sequentially. For two individuals x1 and x2 randomly selected from current population, The crossover point is randomly set in the convergence window USBB and the crossover operator generates two children c1 , c2 . The parents x1 , x2 and their children c1 , c2 can be divided into two pairs {x1 , c1 } and {x2 , c2 }. In each pair {xi , ci }(i = 1, 2), the parent and child have the same low scaled USBBs. The selection operator will conserve the better one of each pair into next generation according to the fitness calculated by the whole S/D binary string for high accuracy. Thus, the bits contributed to high fitness in the convergence window USBB will be preserved, and the diversity at the low scaled USBBs’ side will be maintain. The mutation will operate on the convergence window and not yet converged USBBs according to the mutation probability to increase the diversity in the population. These low salient USBBs will converge due to GAs searching to avoid genetic drift. The implementation outline of the SDGA is shown in Figure 2. Since identifying high-quality bits in the convergence window USBB of GAs is faster than that GAs on the non-uniform BB, while no genetic drift occurs. Thus, population can efficiently converge to the high-quality BB in the position of the convergence window USBB, which are a component of overrepresented optimum of the problem. According to theoretical results of Thierens [11], the overall convergence time complexity of the new GA with the S/D binary representation
378
Y. Liang, K.-S. Leung, and Z.-B. Xu
√ is approximately of order O(l/ n), where l is the length of the S/D binary string and n is the dimensions of the problem. This is much faster than working on the binary strings as a whole where GAs have a approximate convergence time of order O(l). The gain is especially significant for high dimension problems.
4
Simulations and Comparisons
GAs in NN area can be used for searching weight values, topology design, NN parameter settings and for selection and ordering of input and output vectors for training and testing set. We focused only on the weight searching by GAs. The structure of NN is fixed and it is not changed throw all experiments. The feedforward NN is used with one hidden layer, 20 hidden neurons, the sigmoidal transfer function of the hidden neuron tansig(x) and the sigmoidal transfer function of the output neuron purelin(x). We used NN to approach the nonlinear functions f1 − f3 respectively. f1 (x) = (1 − x2 )e−x
2
/2
x ∈ [−2, 2];
(8)
x ∈ [−10, 10];
(9)
x1 , x2 ∈ [−2, 2];
(10)
,
5
f2 (x) =
j cos{(j + 1)x + j}, j=1
f3 (x) =
1 , 1 + |(x1 + ix2 )6 − 1|
The standard GA (SGA) using binary, gray, unary, S/D encodings and SDGA are used on the training of NN to compare their performance. We performed 50 runs and each run was stopped after the 1000 generations. For fairness of comparison, we implemented SGA with different binary encodings and SDGA with the same parameter setting and the same initial population with 500 individuals, in which each variable is represented by 20 bits binary string. For SGA, we used onepoint crossover operator (crossover probability=0.8), one-point mutation operator (mutation probability=0.05) and tournament selection operator without replacement of size two. All algorithms were implemented in MATLAB environment. Figure 3 presents the results for the problems f1 − f3 respectively. The plots show for SGA with different representations and SDGA the best fitness with respect to the generations. Table 1 summarizes the experimental results for all f
f
1
f
2
fitness
4
2
3
10
1
8
0.8 fitness
6
6 4 2
0
200
400 600 800 generations
1000
0
0.6 0.4 0.2
200
400 600 800 generations
1000
0
200
400 600 800 generations
1
Fig. 3. The comparison results for the problems f1 − f3 .(◦:SDGA; :unary SGA; ×:binary SGA, +: gray SGA; •:S/D encoding SGA)
NN Training Using Genetic Algorithm with a Novel Binary Encoding
379
the test problems f1 − f3 . The best fitness of each problem is calculated as the average of the fitness when GAs fully converged with different runs. As in Figure 3 and Table 1 described, SGA with different scaled binary representations including binary, gray and S/D encodings complies domino convergence, genetic drift and noise from BBs. Due to the problems of the unary encoding with redundancy, which result in an underrepresentation of the optimal solution, the performance of SGA using unary encoding performance is significantly worse than when using binary, gray and S/D encodings. SGA with gray encoding performs worse than the binary encoding for f1 . As expected, SGA using S/D encoding performs better than that using binary and gray encodings for the all test problems. Because in S/D encoding, more salient bits are continuous to construct short and high fit BBs, which are easily identified by SGA. This reveals that the S/D encoding is proper for GAs searching. However, lower salient bits in S/D binary string are randomly fixed by genetic drift and noise from BBs, the performance of SGA with S/D encoding cannot significantly better than those with binary and gray encodings. As shown in Figure 3, the convergence of SDGA is much faster than that of other SGA. This reveals the performance of SDGA is significantly better than SGA with different encodings, because there are no premature and drift occurred. On the other hand, GAs search on USBBs of S/D binary encoding faster than the non-uniformly scaled BBs and domino convergence, which occurs only on the non-uniformly sequence of USBBs, is too weak. Table 1. Comparison of results of SGA with different binary representations and SDGA for the problems f1 − f3 . (Numbers in parentheses are the standard deviations.) Best fitness Unary SGA Binary SGA Gray SGA f1 0.51 (0.17) 0.25 (0.13) 0.33 (0.12) f2 4.3 (1.6) 3.2(1.6) 2.9 (1.8) f3 0.30 (0.19) 0.21 (0.11) 0.18 (0.086)
5
S/D SGA SDGA 0.14 (0.083) 0.057 (0.029) 2.4 (0.95) 0.14 (0.052) 0.17 (0.093) 0.042 (0.034)
Conclusions
Genetic algorithms (GAs) are widely used in the parameter training of Neural Network (NN). In this paper, we investigate GAs based on our proposed novel genetic representation to train the parameters of NN. A splicing/decomposable (S/D) binary encoding is designed based on some theoretical guidance and existing recommendations. Our theoretical and empirical investigations reveal that the S/D binary representation is more proper than other existing binary encodings for GAs’ searching. Moreover, a new genotypic distance on the S/D binary space is equivalent to the Euclidean distance on the real-valued space during GAs convergence. Therefore, GAs can reliably and predictably solve problems of bounded complexity and the methods depended on the Euclidean distance for solving different kinds of optimization problems can be directly used on the
380
Y. Liang, K.-S. Leung, and Z.-B. Xu
S/D binary space. This investigation demonstrates that GAs based our proposed binary representation can efficiently and effectively train the parameters of NN.
References 1. Goldberg, D.E.: Genetic Algorithms in Search, Optimization, and Machine Learning. Reading, MA: Addison-Wesley (1989) 2. Han, K.H., Kim, J.H.: Genetic Quantum Algorithm and Its Application to Combinatorial Optimization Problem. Proceeding of Congress on Evolutionary Computation 1 (2000) 1354-1360 3. Holland, J.H.: Adaptation in Natural and Artificial systems. Ann Arbor, MI: University of Michigan Press (1975) 4. Julstrom, B.A.: Redundant Genetic Encodings May Not Be Harmful. Proceedings of the Genetic and Evolutionary Computation Conference, San Francisco, CA: Morgan Kaufmann Publishers 1 (1999) 791 5. Liang, Y., Leung, K.S.: Evolution Strategies with Exclusion-based Selection Operators and a Fourier Series Auxiliary Function. Applied Mathematics and Computation 174 (2006) 1080-1109 2006 6. Liang, Y., Leung, K.S., Lee, K.H.: A Splicing/Decomposable Encoding and Its Novel Operators for Genetic Algorithms. Proceeding of the ACM Genetic and Evolutionary Computation Conference (2006) 1225-1232 7. Liang, Y., Leung, K.S., Lee, K.H.: A Novel Binary Variable Representation for Genetic and Evolutionary Algorithms. Proceeding of the 2006 IEEE World Congress on Computational Intelligence (2006) 2551-2558 8. Liepins, G.E., Vose, M.D.: Representational Issues in Genetic Optimization. Journal of Experimental and Theoretical Artificial Intelligence 2 (1990) 101-115 9. Rothlauf, F.: Representations for Genetic and Evolutionary Algorithms. Heidelberg; New York: Physica-Verl. (2002) 10. Schaffer, J.D., Caruana, R.A., Eshelman, L.J., Das, R.: A Study of Control Parameters Affecting Online Performance of Genetic Algorithms for Function Optimization. Proceedings of the Third International Conference on Genetic Algorithms, San Mateo, CA: Morgan Kaufmann (1989) 11. Thierens, D.: Analysis and Design of Genetic Algorithms. Leuven, Belgium: Katholieke Universiteit Leuven (1990) 12. Whitley, D.: Local Search and High Precision Gray Codes: Convergence Results and Neighborhoods. In Martin, W., & Spears, W. (Eds.), Foundations of Genetic Algorithms 6, San Francisco, California: Morgan Kaufmann Publishers, Inc. (2000) 13. Xu, Z.B., Leung, K.S., Liang, Y., Leung, Y.: Efficiency Speed-up Strategies for Evolutionary Computation: Fundamentals and Fast-GAs. Applied Mathematics and Computation 142 (2003) 341-388 2003
Adaptive Training of a Kernel-Based Representative and Discriminative Nonlinear Classifier Benyong Liu, Jing Zhang, and Xiaowei Chen College of Computer Science and Technology, Guizhou University, Huaxi 550025, Guiyang, China
[email protected],
[email protected],
[email protected]
Abstract. Adaptive training of a classifier is necessary when feature selection and sparse representation are considered. Previously, we proposed a kernel-based nonlinear classifier for simultaneous representation and discrimination of pattern features. Its batch training has a closedform solution. In this paper we implement an adaptive training algorithm using an incremental learning procedure that exactly retains the generalization ability of batch training. It naturally yields a sparse representation. The feasibility of the presented methods is illustrated by experimental results on handwritten digit classification.
1
Introduction
Adaptive training of a classifier is necessary when feature selection and sparse representation are considered. Generally it is realized by incremental learning, a procedure adaptively updating the parameters when a new datum arrives, without reexamining the old ones. Many incremental learning methods have been devised so far [1], [2]. Some of them improve computational efficiency at the cost of decreasing the generalization capability of batch learning. In this paper, we design an incremental learning procedure to adaptively train a previously proposed classifier named Kernel-based Representative and Discriminative Nonlinear Classifier (KNRD) [3]. In our discussion, it is required that the incremental learning result equals exactly to that of batch learning including the new datum, so that the same generalization ability is maintained [4]. Based on the procedure, a technique for reducing the training set to obtain a sparse KNRD is derived. Validity of the presented adaptive training procedure and set-reducing technique is demonstrated by experimental results on handwritten digit recognition. The rest of this paper is organized as follows. Section 2 briefly reviews the previously proposed classifiers, a Kernel-based Nonlinear Representor (KNR) and a Kernel-based Nonlinear Discriminator (KND), and combines them into a
The related work is supported by the Key Project of Chinese Ministry of Education (No.105150) and the Foundation of ATR Key Lab (51483010305DZ0207). Thanks to Prof. H. Ogawa of Tokyo Institute of Technology for helpful discussions.
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 381–390, 2007. c Springer-Verlag Berlin Heidelberg 2007
382
B. Liu, J. Zhang, and X. Chen
KNRD. Section 3 presents an incremental learning procedure that implements adaptive training on a KNRD classifier, and addresses a set-reduction technique to obtain a sparse KNRD. Experimental results on handwritten digit recognition are given in Section 4. Conclusions are drawn in Section 5 and the related proofs are put into three appendices.
2
KNRD: A Kernel-Based Representative and Discriminative Nonlinear Classifier
Our discussion is limited to finding an optimal approximation to a desirable decision function, f0 (x). We assume that f0 is defined on C N , a complex N dimensional vector space, and it is an element of a reproducing kernel Hilbert space which has a kernel function k. Generally only M sampled values of f0 are known beforehand and they constitute a teacher vector y, where y = Af0 ,
(1)
and A is the sampler. We also assume that y is an element of the M -dimensional space C M . The study goal is to find a certain inverse operator X of A, so that f = Xy
(2)
becomes an optimal approximation to f0 [4]. When a classifier is designed for optimal representation of a target class c, we can minimize the distance between f and f0 by deriving X from XR = argminX (c) {tr[(I − X (c) A(c) )(I − X (c) A(c) )∗ ]}, (c)
(3)
where R denotes representation and (c) means that the operators correspond to Class c, while tr and ∗ denote the trace and the adjoint operation of an operator, respectively. The solution is named a KNR [5]. On the other hand, if a classifier is designed to optimally discriminate Class c from other classes, it is required that the inverse operator X satisfies ∗
(c)
XD = argminX (c) {tr(X (c) Q(X (c) ) },
(4)
where D denotes discrimination and Q is given by 1 Q= C −1
C
Q(i) ,
(5)
i=1, i =c
wherein C is the total number of classes and Q(i) = y (i) ⊗ y (i) ,
(6)
with y (i) the teacher vector of class i, ⊗ the Neuman-Schatten product [6], and y (i) the complex conjugate of y (i) . A solution to this criterion results in a KND [7].
Adaptive Training of a KNRD
383
Using a parameter λ to control the balance between representation and discrimination, the above two criteria were combined to the so-called R-D criterion, to simultaneously represent and discriminate pattern features, as follows [3]: ∗
(c)
∗
XR = argminX (c) {tr[(I − X (c) A(c) )(I − X (c) A(c) ) + λX (c) Q(X (c) ) ]}.
(7)
A solution to the R-D criterion results in the following kernel-based representative and discriminative nonlinear classifier (KNRD) (Ref. [3] may be consulted for more information): M f (x) = aj k(x, xj ), (8) j=1
{xj }M j=1
where is the set of the training feature vectors, and k is an associated kernel function. The coefficient set in the above representation has the following closed-form solution [3]: T + a = [a1 , a2 , . . . , aM ] = (U (c) ) y, (9) where T denotes the transpose of a vector or a matrix, (U (c) )+ is the MoorePenrose pseudoinverse of U (c) , with U (c) = K (c) + λQ, and K (c) is the the kernel matrix vectors of Class c, as follows: ⎡ k(x1 , x1 ) ⎢ k(x1 , x2 ) K (c) = ⎢ ⎣ ··· k(x1 , xM )
3
(10)
determined by k and the M training feature k(x2 , x1 ) k(x2 , x2 ) ··· k(x2 , xM )
··· ··· ··· ···
⎤ k(xM , x1 ) k(xM , x2 ) ⎥ ⎥. ⎦ ··· k(xM , xM )
(11)
Adaptive Training of a KNRD
Adaptive training of a classifier is necessary when feature selection and sparse representation are considered. This kind of work has been done for KNR and KND classifiers [8], [9]. In this section, we design a similar algorithm for adaptive training of KNRD. 3.1
Adaptive Training with Incremental Learning
In neural network training, the process to adaptively adjust the trained result by a new sample is called incremental learning [1]. Many incremental learning methods have been devised so far [1], [2]. Some of them improve memory and computation efficiency at the cost of decreasing the generalization ability of batch learning. In our discussion, it is required that the incremental learning
384
B. Liu, J. Zhang, and X. Chen
result equals exactly to that of batch learning including the novel sample, so that the same generalization ability is retained [4]. Although several new instances may become available in a later stage, in this paper we consider a relearning procedure that processes only one instance, i.e., one training feature vector per class. Now we turn our focus to variable number of training data. For clearness, sub(c) scripts are used to denote variation. For example, y m denotes the actual output (c) vector of Class c after m instances are trained, am+1 denotes the coefficient vector obtained after the (m + 1)-th instance becomes available, etc.. (c) (c) (c) For the KNRD of Class c, the objective is to express am+1 by am and ym+1 , the desirable output of Class c at Stage (m + 1), for c = 1, 2, . . . , C. We use the following m-dimensional vector and scalar to describe the traits of the desirable outputs of all classes other than Class c: 1 C −1
q m+1 =
σm+1 =
1 C −1
C
(i)
y (i) m ym+1 ,
(12)
i=1, i =c C
(i)
(i)
ym+1 ym+1 .
(13)
i=1, i =c
Furthermore, we define the following m-dimensional vectors
T (c) (c) (c) (c) (c) sm+1 = k(x1 , xm+1 ), k(x2 , xm+1 ), . . . , k(x(c) , x ) , m m+1
(14)
tm+1 = sm+1 + λq m+1 ,
(15)
+ τ m+1 = (U (c) m ) tm+1 ,
(16)
and and scalars (c)
and
(c)
αm+1 = k(xm+1 , xm+1 ) + λσm+1 − < τ m+1 , tm+1 >,
(17)
βm+1 = 1 + <τ m+1 , τ m+1 >,
(18)
(c)
γm+1 = (ym+1 − < y (c) m , τ m+1 >)/βm+1 ,
(19)
where < ·, · > denotes the inner product of C m . Then we have the following lemmas, proofs of which are put into Appendices A and B, respectively. Lemma 1. As for tm+1 in Eq.(15) and αm+1 in Eq.(17), we have t ∈ (U (c) m ),
(20)
αm+1 ≥ 0.
(21)
and
Adaptive Training of a KNRD
385
(c)
+ Lemma 2. The operators (U m+1 )+ and (U (c) m ) have the following relation:
(i)when αm+1 = 0 ⎡
(c) (U m+1 )+
+ Tm+1 (U (c) m ) Tm+1 ⎣ = (c) + T Tm+1 (U m ) τ m+1 βm+1
where Tm+1 = Im+1 −
(c)
Tm+1 (U m )+ τ m+1 βm+1 (c)
<(U m )+ τ m+1 ,τ m+1 > 2 βm+1
⎤ ⎦,
τ m+1 ⊗ τ m+1 . βm+1
(22)
(23)
(ii)when αm+1 > 0 ⎡
τ m+1 ⊗τ m+1 + (U (c) m ) + (c) αm+1 (U m+1 )+ = ⎣ τ Tm+1 − αm+1
−τ m+1 αm+1 1 αm+1
⎤ ⎦.
(24)
Lemma 2 shows that we can avoid direct calculating the Moore-Penrose pseudoinverse of U (c) . In addition, Lemma 2 and Eq.(9) naturally lead us to the following theorem, proof of which is put into Appendix C. Theorem 1. The coefficients of the KNRD of Class c (c = 1, 2, . . . , C) can be adaptively trained as follows: (i)when αm+1 = 0
(c) αm − ητ m+1 + γm+1 (U (c) )+ τ m+1 m = , ηm+1
(c) αm+1
where
(25)
(c)
ηm+1 =
+ γm+1 < (U (c) m ) τ m+1 , τ m+1 > + < τ m+1 , am > . βm+1
(26)
(ii)when αm+1 > 0
(c) αm+1
3.2
=
(c)
αm −
γm+1 βm+1 τ m+1 αm+1 γm+1 βm+1 αm+1
.
(27)
Sparse Representation of KNRD
Theorem 1 shows that the effect of every training sample on a KNRD can be evaluated during adaptive training, and hence samples of little importance can be discarded one by one if necessary. Henceforth, we can obtain a technique for sparse representation of a KNRD and it is briefly discussed as follows. For Class c, we adopt the following distance to evaluate the importance of the (c) novel training feature vector, xm+1 , at Stage (m + 1): (c)
(c)
(c)
(c)
(c) δm+1 = |fm+1 (xm+1 ) − fm (xm+1 )|,
(28)
386
B. Liu, J. Zhang, and X. Chen (c)
where | · | denotes the absolute value of a number. If δm+1 is less than a predetermined threshold , a positive number trading generalization ability for sparse(c) ness, then xm+1 will be discarded. Theorem 1 and Eqs.(8) and (28) yield: m |ηm+1 | |km+1,m+1 + j=1 νm+1 (j)kj,m+1 | if αm+1 = 0, (c) m δm+1 = (29) βm+1 | γm+1 | |k + τ (j)k | if αm+1 > 0, m+1,m+1 m+1 j,m+1 j=1 αm+1 where
(c)
(c)
km+1,m+1 = k(xm+1 , xm+1 ), (c)
(c)
kj,m+1 = k(xm+1 , xj ),
(30) (31)
τm+1 (j) is the j-th element of the vector τ m+1 , and νm+1 (j) is that of the following vector:
γm+1 (c) + ν m+1 = (U m ) τ m+1 − τ m+1 . (32) ηm+1 For the KNRD of Class c, the above adaptive training procedure and the training set reduction technique are summarized into an algorithm as follows. Algorithm 1. Begin 2. Decide on the reproducing kernel k(x, x ) and the threshold for data reduction. (c) (c) 3. Initialize to zero: m = 0, a0 = 0, and {U0 }+ = 0. (i) 4. For the new training feature set {xm+1 }C i=1 , decide on the corresponding (c) (i) desirable output values of the classifier, say ym+1 = 1 and ym+1 = 0 for i = c, and
T (i) (i) (i) (a)Calculate the actual output vectors {ym+1 = f c (x1 ), · · · , f c (xm ) }C i=1 (c)
using am and Eq.(8), where M is substituted by m. (b) Calculate the vector q m+1 using Eq.(12) and the scalar σm+1 using Eq.(13). (c) Calculate the vectors sm+1 and tm+1 using Eqs.(14) and (15), repectively. (d) Calculate the vector τ m+1 using Eq.(16). (e) Calculate the scalars αm+1 , βm+1 , and γm+1 using Eqs.(17), (18), and (19), respectively. (c) (f ) Calculate the weight vector am+1 using Eqs.(25) and (26), or Eq.(27). (c) (c) (g) Calculate δm+1 by Eq.(29). If δm+1 is less than , then discard the new data set. (c) (h) If there is still new training feature set, calculate the operator (Um+1 )+ using Eqs.(22) and (23), or Eq.(24), substitute m by (m + 1) and return to Step 4. Otherwise, let M = m + 1 and go to Step 5. (c) 5. Output aM . 6. End. In the sequel, the feasibility of the above algorithm is demonstrated by experimental results on handwritten digit classification.
Adaptive Training of a KNRD
4
387
Experiments on Handwritten Digit Recognition
For comparison convenience, we take experiments with the dataset used by Jain et al.[10] and our previous works [7] [8] [9]. It provides features of handwritten digits (”0”-”9”) extracted from a collection of Dutch utility maps. For each digit class, there are two hundred patterns. The dataset contains six feature sets respectively consisted of • • • • • •
76 Fourier coefficients of the character shape, 216 profile correlations, 64 Karhunen-Love coefficients, 240 pix averages in 2 × 3 windows, 47 Zernike moments, and 6 morphological features.
We adopt the Gaussian kernel with kernel width, together with the value of λ, crudely estimated by experience [7]. The test set consists of the last one hundred feature vectors of each class and is fixed in the following experiments. In comparison with the results of other classifiers in Ref.[10], for each feature set, we first consider training size 10 × 50, ten classes and fifty patterns per class, which are randomly selected so that there is no intersection between the training set and the test set. Ten different runs are conducted and the averaged classification error rates, over classes and runs, are listed in the second row of Tab.1, in which the values printed in bold denote the best ones among the results of our method and of the methods conducted in Ref.[10], wherein twelve methods were applied to this dataset, and for the six feature sets, the best results (error rates in percent) are given by the Parzen classifier (17.1), the linear Bayes normal classifier (3.4), the Parzen classifier (3.7), the 1-NN rule and the k-NN rule and the Parzen classifier (3.7), the linear Bayes normal classifier (18.0), and the linear Fisher discriminator (28.2), respectively. Our KNRD classifier performs the best for feature Set 3 and Set 4, and nearly the best for feature Set 1 and Set 5. That is, in comparison with the twelve classifiers conducted in [10] on the six feature sets, an adaptively trained KNRD performs almost the best. The third row of Tab.1 lists the error rates of experiment on the efficiency of the proposed technique in sparse representation. In this experiment, the predetermined positive thresholds for the six feature sets are respectively estimated Table 1. Classification error rates of the adaptively trained KNRD (CER1), and the sparse KNRD (CER2), on training size 10 × 50, i.e., 500 feature vectors, where λ = 0.2 Feature set CER1 CER2 Remained feature vectors
Set 1 17.3 19.6 0.3 434
Set 2 7.8 21.6 0.1 398
Set 3 3.6 7.8 0.2 401
Set 4 3.7 7.6 0.4 375
Set 5 18.6 31.6 0.1 406
Set6 69.1 69.0 0.1 431
388
B. Liu, J. Zhang, and X. Chen
in a similar manner of estimating the kernel widths. The listed number of the remained feature vectors is an averaged value over the ten runs and the ten classes. The results in Tab.1 show that training set reduction is obtained at the cost of increasing classification error rate. For example, an increase of more than 4.0 points in error rate is paid for around 20.0 % reduction of the training set (Set 3). Notice that the results in Tab.1 are favorably comparable to those of a quadratic support vector classifier (SVC), which results in error rates of 21.2, 5.1, 4.0, 6.0, 19.3, and 81.1, respectively, for the six feature sets, and these results are better than those of the linear SVC [10].
5
Conclusions
We designed an incremental learning procedure for adaptive training of KNRD, a previously proposed kernel-based nonlinear classifier for simultaneous representation and discrimination of pattern features. The procedure reduces the computational load of batch training because it avoids direct calculating the MoorePenrose general inverse of a matrix, and results in a sparse representation of KNRD. Validity of the presented methods was demonstrated by experimental results on handwritten digit classification.
References 1. Fu, L.M., Hsu, H.H., Principe, J.C.: Incremental Backpropagation Learning Network. IEEE Trans. Neural Networks 7 (1996) 757-761 2. Park, D.C., Sharkawi, A.E., Marks II, R.J.: An Adaptively Trained Neural Network. IEEE Trans. Neural Networks 2 (1991) 334-345 3. Liu, B.Y., Zhang, J.: Face Recognition Applying a Kernel-Based Representative and Discrimnative Nonlinear Classifier to Eigenspectra. Proc. IEEE Int. Conf. Communications, Circuits and Systems, HongKong 2 (2005) 964-968 4. Vijayakumar, S., Ogawa, H.: A Functional Analytic Approach to Incremental Learning in Optimally Generalizing Neural Networks. Proc. IEEE Int. Conf. Neural Networks, Perth, Western Australia 2 (1995) 777-782 5. Liu, B.Y., Zhang, J.: Eigenspectra Versus Eigenfaces: Classification with a KernelBased Nonlinear Representor. LNCS 3610 (2005) 660-663 6. Schatten, R.: Norm Ideals of Completely Continuous Operators. Springer, Berlin (1970) 7. Liu, B.Y.: A Kernel-based Nonlinear Discriminator with Closed-form Solution Discriminator. Proc. IEEE Int. Conf. Neural Network and Signal Processing, Nanjing, China 1 (2003) 41-44 8. Liu, B.Y., Zhang, J.: An Adaptively Trained Kernel-Based Nonlinear Representor for Handwritten Digit Classification. J. Electronics (China) 23 (2006) 379-383 9. Liu, B.Y.: Adaptive Training of a Kernel-Based Nonlinear Discriminator. Pattern Recognition 38 (2005) 2419-2425 10. Jain, A.K., Duin, R.P.W., Mao, J.C.: Statistical Pattern recognition: A Review. IEEE Trans. Pattern Analysis and Machine Intelligence 22 (2000) 4-37 11. Albert, A.: Conditions for Positive and Nonnegative Definiteness in Term of Pseudoinverses. SIAM J. Appl. Math. 17 (1969) 434-440
Adaptive Training of a KNRD
389
Appendix A: Proof of Lemma 1 (c)
From Eqs.(5), (10), and (11) we know that U m+1 is positively semi-definite, that is,
(c) U m+1
≥ 0 . Further more, Eqs.(5) and (10)-(15) yield
(c) U m+1
where
U (c) = Tm tm+1 (c)
tm+1 , μm+1
(c)
μm+1 = k(xm+1 , xm+1 ) + λσm+1 .
(33)
(34)
From Theorems 1 and 2 of Ref.[11] we know that (c) + U (c) m (U m ) tm+1 = tm+1 ,
(35)
+ μm+1 ≥< (U (c) m ) tm+1 , tm+1 > .
(36)
(c) + Eq.(35) is equivalent to Eq.(20) because U (c) m (U m ) = P(U (c) ) , the m projection operator onto the range of U (c) m , and Eq.(36) to Eq.(21)
orthogonal because of
Eqs.(34), (16), and (17).
Appendix B: Proof of Lemma 2 In order to prove Lemma 2, Theorem 3 of Ref.[11] is restated in the following proposition: Proposition A. Suppose U (c) m ≥ 0 and (c) U (c) U m+1 = T m tm+1
tm+1 , μm+1
(37)
(c) + T + Let τ m+1 = (U (c) m ) tm+1 , αm+1 = μm+1 − tm+1 (U m ) tm+1 , βm+1 = 1 + 2 T τ m+1 , and Tm+1 = Im − τ m+1 τ m+1 /βm+1 , then (c) (i) U m+1 ≥ 0 if only if U (c) m τ m+1 = tm+1 and αm+1 ≥ 0. (ii) In this case ⎤ ⎧⎡ τ m+1 τ Tm+1 τ m+1 (c) + ⎪ (U ) + − ⎪ αm+1 αm+1 ⎦ ⎪ ⎣ mT ⎪ if αm+1 > 0, ⎪ τ m+1 ⎪ 1 ⎪ − ⎪ αm+1 αm+1 ⎨ (c) (U m+1 )+ = ⎡ ⎤ (c) ⎪ ⎪ Tm+1 (U m )+ τ m+1 (c) + ⎪ T (U ) T ⎪ m+1 m+1 m ⎪ βm+1 ⎪ ⎣ ⎦ if αm+1 = 0. (c) + ⎪ (c) T ⎪ ⎩ ( Tm+1 (U m )+ τ m+1 )T τ m+1 (U m ) τ m+1 2 βm+1 β m+1
(38)
390
B. Liu, J. Zhang, and X. Chen
T + Proof of Lemma 2. Since tTm+1 (U (c) m ) tm+1 = tm+1 τ m+1 =< tm+1 , τ m+1 >, + τ m+1 τ Tm+1 = τ m+1 ⊗τ m+1 , τ m+1 2 =< τ m+1 , τ m+1 >, and τ Tm+1 (U (c) m ) τ m+1 (c) + =< (U m ) τ m+1 , τ m+1 >, from Eqs.(5), (10), and (11) we know that U (c) m ≥ (c) 0. Furthermore, Eqs. (16) and (35) yield U m τ m+1 = tm+1 . Finally, Lemma 1 shows that αm+1 ≥ 0. These conditions, Eq.(33), and Proposition A lead us to Lemma 2.
Appendix C: Proof of Theorem 1
T (c) Notice that y Tm+1 = y Tm (ym+1 )T , hence Eq.(9) and Lemma 2 directly yield Theorem 1.
Indirect Training of Grey-Box Models: Application to a Bioprocess Francisco Cruz, Gonzalo Acuña, Francisco Cubillos, Vicente Moreno, and Danilo Bassi Facultad de Ingeniería, Universidad de Santiago de Chile, USACH Av. Libertador Bernandor O’Higgins 3363, Santiago, Chile
[email protected],
[email protected]
Abstract. Grey-box neural models mix differential equations, which act as white boxes, and neural networks, used as black boxes. The purpose of the present work is to show the training of a grey-box model by means of indirect backpropagation and Levenberg-Marquardt in Matlab®, extending the black box neural model in order to fit the discretized equations of the phenomenological model. The obtained grey-box model is tested as an estimator of a state variable of a biotechnological batch fermentation process on solid substrate, with good results.
1 Introduction The determination of relevant variables or parameters to improve a complex process is a demanding and difficult task. This gives rise to the need to estimate the variables that cannot be measured directly, and this in turn requires a software sensor to estimate those variables that cannot be measured on line [1]. An additional problem is the one consisting of a model that has parameters that vary in time, because a strategy must be applied to identify such parameters on line and in real time [2]. A methodology that is used in these cases, especially in the field of chemical and biotechnological processes, is that of the so-called grey-box models [3]. These are models that include a limited phenomenological model which is complemented with parameters obtained by means of neural networks. The learning or training strategies used so far for grey-box neural models assume the existence of data for the parameters obtained by the neural model [4], but most of the time this is not possible. This paper proposes a training process that does not use learning data for the neural network part, instead backpropagating through the phenomenological model the error at its output, as will be detailed below. The creation of the proposed model, the training and the simulations were all carried out using the Matlab development tool.
2 Grey-Box Models Grey-box neural models are used for systems in which there is some a priori knowledge, i.e., some physical laws are known, but some parameters must be determined from the observed data. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 391–397, 2007. © Springer-Verlag Berlin Heidelberg 2007
392
F. Cruz et al.
Acuña et al. [5] distinguish between two methods of training. The first one corresponds to direct training (Fig. 1(a)), which uses the error originated at the output of the neural network for the correct determination of their weights. The second method is indirect training (Fig. 1(b)), which uses the error originated at the model's output for the purpose of learning by the neural network. Indirect training can be carried out in two ways, one by minimizing an objective function by means of a nonlinear optimization technique, and the other is by backpropagating the output error over the weights of the neural network taking into account the discretized equations of the phenomenological model.
(a)
(b)
Fig. 1. (a) Grey-box model with direct training. (b) Grey-box model with indirect training.
In this paper the second indirect training method is used, calculating the error at the output of the phenomenological model, and backpropagating it from there to the model's neural part or black-box part. The backpropagation process considers a network with m inputs and p outputs, n neurons in an intermediate layer, and d data for training. The computed gradients, depending on the the activation and the transfer function used, are shown in Table 1 Table 1. Gradients depending on the activation functions used in the neurons
f
Sum
Product
∂Ack +1 ∂Z kj
wkjc+1
(∏ wkjc+1 ) ⋅ (∏ Z qk )
∂Akj
Z kj
(∏ wqjk ) ⋅ (∏ Z qk −1 )
q =1
q =1 q≠ j
∂wijk
q =1 q≠ j
q =1
Table 2. Gradients depending on the transfer functions used in the neurons
g
sigmoid
tanh
∂Z kj
Z kj ⋅ (1 − Z kj )
1 + ( Z kj )
∂Akj
2
inverse
identity
− Z kj ⋅ Z kj
1
Indirect Training of Grey-Box Models: Application to a Bioprocess
393
and Table 2 respectively, where wkij is the weight of the connection from neuron i to neuron j in layer k, Aki is the activation value of neuron i of layer k, and Zki is the transfer value of neuron i from layer k (output of neuron i).
3 Biotechnological Process In this paper a grey-box neural model is proposed for the simulation of a batch fermentation bioprocess on a solid substrate, corresponding to the production of gibberellic acid from the philamentous fungus Gibberella fujikuroi. A simplified model describes the evolution of the main variables [6]. This phenomenological model based on mass conservation laws considers 8 state variables: active biomass (X), measured biomass (Xmeasu), urea (U), intermediate nitrogen (NI), starch (S), gibberellic acid (GA3), carbon dioxide (CO2) and oxygen (O2). Only the last two variables can be measured directly on line. The model's equations discretized by Euler's method and considering discrete time t and t+1 are the following:
(
)
X measu ( t +1) = X measu (t ) + μ ⋅ X ( t ) ⋅ Δt
(
,
(1)
)
X (t +1) = X ( t ) + μ ⋅ X (t ) − k d ⋅ X ( t ) ⋅ Δt ,
( )
U (t +1) = U (t ) + − k ⋅ Δt
N I (t +1)
(2)
,
(3)
⎧ ⎛ ⎛ X ⎞⎞ ⎪ N + ⎜ 0, 47 ⋅ k − μ ⋅ ⎜ (t ) ⎟ ⎟ ⋅ Δt , si U ≥ 0, ⎜ YX / N ⎟ ⎟ ⎪ I (t ) ⎜ ⎪ ⎝ I ⎠⎠ ⎝ =⎨ ⎛ ⎛ X ⎞⎞ ⎪ (t ) ⎟ ⎟ ⋅ Δt , U (t ) = 0, si U < 0, ⎪ N I (t ) + ⎜ − μ ⋅ ⎜ ⎜ ⎟⎟ ⎜ Y ⎪⎩ ⎝ ⎝ X / NI ⎠ ⎠
⎛
μ ⋅ X (t )
⎝
YX / S
S ( t +1) = S (t ) + ⎜ −
⎞
− ms ⋅ X ( t ) ⎟ ⋅ Δt
⎠
(
)
,
GA3(t +1) = GA3(t ) + β ⋅ X ( t ) − k p ⋅ GA3(t ) ⋅ Δt
⎛ ⎛
CO2( t +1) = CO2( t ) + ⎜ μ ⋅ ⎜
X (t )
⎜ ⎜ YX / CO 2 ⎝ ⎝
⎛ ⎛
O2( t +1) = O2(t ) + ⎜ μ ⋅ ⎜
X (t )
⎜ ⎜ YX / O 2 ⎝ ⎝
(4)
(5)
,
⎞ ⎞ ⎟ + mCO ⋅ X (t ) ⎟ ⋅ Δt , 2 ⎟ ⎟ ⎠ ⎠
⎞ ⎞ ⎟ + mO ⋅ X (t ) ⎟ ⋅ Δt . 2 ⎟ ⎟ ⎠ ⎠
(6)
(7)
(8)
394
F. Cruz et al.
The measured outputs are the following: y1 = CO2( t +1) ,
y 2 = O2(t +1) .
(9) (10)
On the other hand, the parameters that are difficult to obtain and that will be estimated by the model's neural part are μ and β , corresponding to the specific growth rate and specific production rate of gibberellic acid, respectively. The remaining parameters were identified on the basis of specific practices and experimental conditions. Their values under controlled water temperature and activity conditions (T=25 ºC, Aw=0.992) can be found in [6].
4 Proposed Solution The proposed solution is a grey-box neural model whose phenomenological part can be described jointly with its black-box part, by means of an extended neural network containing both, the discretized equations of the phenomenological model and the time-varying parameters modeled by the black-box part (Fig. 2). This hybrid neural network has the capacity to fix weights in the training phase, so that it can act as a grey-box model. The weights in Fig. 2 that have a fixed value correspond to the model's phenomenological part. The weights for which no value is given correspond to the model's neural part. These weights were initially assigned pseudo-random values obtained by the initialization method of Nguyen & Widrow [7]. In Fig. 2 it is seen that one of the weights corresponding to the white-box or phenomenological part is graphed as a dotted line. This line represents the switching phenomena that is seen in the fourth state variable (NI) in the mathematical model, i.e., if the urea (U) is greater than or equal to zero, this weight has the indicated value, otherwise, if urea (U) is less than zero, this weight has a value of zero. Therefore, the multilayer perceptron, inserted in the model, estimates the values of the two parameters that are difficult to obtain, and in turn they are mixed with the phenomenological part of the model, in that way obtaining its output. For the black-box neural part the hyperbolic tangent was used as transfer functions in the intermediate layer and the identity function in the output layer, while for the phenomenological part the identity function was used as transfer function. The activation function most currently used was the sum of the inputs, except for the two neurons immediatelly after the output of the black-box neural part, for which a product was used as activation function in order to follow the discretized phenomenological equations. The training algorithm used corresponds to backpropagation with a LevenbergMarquardt optimization method. As it was already stated, the algorithm has the capacity to modify only the weights that are indicated, therefore leaving a group of fixed weights which, represent the model's phenomenological part in the training phase. For the validation of the proposed grey-box neural model, quality indexs such as IA (Index of Agreement), RMS (Root Mean Square) and RSD (Relative Standard
Indirect Training of Grey-Box Models: Application to a Bioprocess
395
Fig. 2. Grey-box model for the solid substrate fermentation process. Fixed weights represent the discretized phenomenological model. The black-box part that models the unknown timevarying parameters µ and β has variable weights. The dotted line represents a switch on the model of the state variable (NI).
396
F. Cruz et al.
Deviation) are calculated, and the values considered acceptable for these indexs are IA>0.9, RMS<0.1 and RSD<0.1. The quality indexs equations are the following: n
IA = 1 −
(οi − pi ) ∑ i =1
n
2
( οi ' + pi ' ) ∑ i =1 n
2
,
RMS =
(οi − pi ) ∑ i =1
n
(οi − pi ) ∑ i =1
2
n
οi 2 ∑ i =1
RSD =
,
2
N
,
(11)
where oi and pi are the observed and predicted values, respectively, at time i, and N is the total number of data. Then, pi’=pi-om and oi’=oi-om, where om is the mean value of the observations.
5 Simulation and Results For the simulation, tests were carried out for data with 5% error and an erroneous initial biomass value. The simulation was made using one thousand examples. The initial conditions under normal operation were the following: X (0) =
⎡⎣0,
0.0040, 0.0040, 0.5 * 10
−4
, 0.0040, 0, 0, 0
⎤⎦ .
Case 1: Simulation with 5% noise The first case evaluated corresponds to the simulation with 5% noise in the input data. Fig. 3(a) shows the real and estimated biomass. In this case the quality indexs obtained were the following: IA=0.99, RMS=0.5E-1 and RSD=0.8E-3. Case 2: Simulation with noise in the data and erroneous initial condition in the biomass The second case evaluated corresponds to the simulation with 5% noise in the data in addition to an erroneous initial biomass condition with an error of 250%, to verify the convergence of the model acting as a software sensor of biomass. Fig. 3(b) shows the model's response, where the real and the estimated biomass are seen. The quality indexs were IA=0.93, RMS=0.24 and RSD=0.41E-2. In this case the RMS index does Real and Estimated Biomass
Real and Estimated Biomass
0.03
0.035 Estimated Output Real Output
Estimated Output Real Output
0.03
0.025
0.025
Biomass
Biomass
0.02
0.015
0.02
0.015
0.01 0.01 0.005
0
0.005
0
100
200
300
400
500 600 Time (min)
700
800
900
0
1000
(a)
0
100
200
300
400
500 600 Time (min)
700
800
900
1000
Fig. 3. (a) Real and estimated biomass for case 1 (b) Real and estimated biomass for case 2
(b)
Indirect Training of Grey-Box Models: Application to a Bioprocess
397
not fulfill the acceptable condition (RMS<0.1), but this is due to the nature of the introduced error, because the model takes about 300 iterations to fit the estimated curve to the real value. This simulation, however, shows the convergence of the software sensor based on the grey-box model, so it is considered acceptable.
6 Conclusions Grey-box neural models are a real alternative for modeling real world processes. They have advantages over black-box models because they are supported by the a priori knowledge available on the process. The model proposed in this paper combines the phenomenological equations of the process with a multilayer perceptron neural network that estimates the unknown timevarying parameters within a extended neural network for carrying out the backpropagation process. The use of fixed and variable weights and new activation functions were needed in order to fit the discretized phenomenological model thus slightly changing the standard backpropagation method usually found in available neural network softwares. Good results when using the trained model as a software sensor for estimating biomass concentration in a biotechnological process were shown.
Acknowledgement The authors gratefully acknowledge the partial financial support of this work by Fondecyt under Project 1040208.
References 1. James, S., Legge, R., Budman, H.: Comparative Study of Black-box and Hybrid Estimation Methods in Fed-batch Fermentation. Journal of Process Control 12 (2000) 113-121 2. Hensen, R., Angelis, G., van de Molengraft, M., de Jager, A., Kok, J.: Grey-box Modeling of Friction: An Experimental Case-study. European Journal of Control 6 (2000) 258-267 3. Chen, L., Bernard, O., Bastin, G., Angelov, P.: Hybrid Modelling of Biotechnological Processes Using Neural Networks. Control Engineering Practice 8 (2000) 821-827 4. Aguiar, H., Filho, R.: Neural Network and Hybrid Model: A Discussion about Different Modeling Techniques to Predict Pulping Degree with Industrial Data. Chemical Engineering Science 56 (2001) 565-570 5. Acuña, G., Cubillos, F., Thibault, J., Latrille, E.: Comparison of Methods for Training Grey-box Neural Network Models. Computers and Chemicals Engineering Supplement (1999) 561-564 6. Gelmi, C., Pérez-Correa, R., Agosin, E.: Modelling Gibberella fujikuroi growth and GA3 production in solid-state fermentation. Process Biochemistry 37 (2002) 1033-1040 7. Nguyen, D., Widrow, B.: Improving the Learning Speed of 2-Layer Neural Networks by Choosing Initial Values of the Adaptive Weights. Proceedings IEEE IJCNN, San Diego (1990) 21-26
FNN (Feedforward Neural Network) Training Method Based on Robust Recursive Least Square Method JunSeok Lim1 and KoengMo Sung2 1
2
Department of Electronics Engineering, Sejong University, 98 Kwangjin Kunja, Seoul Korea, 143-747
[email protected] School of Electrical Engineering and Computer Science, Seoul National University
Abstract. We present a robust recursive least squares algorithm for multilayer feed-forward neural network training. So far, recursive least squares (RLS) has been successfully applied to training multilayer feedforward neural networks. However, RLS method has a tendency to become diverse due to the instability in the recursive inversion procedure. In this paper, we propose a numerically robust recursive least square type algorithm using prewhitening. The proposed algorithm improves the performance of RLS in infinite numerical precision as well as in finite numerical precision. The computer simulation results in the various precision cases show that the proposed algorithm improves the numerical robustness of RLS training.
1
Introduction
In an artificial neural network, the most fundamental property is the ability of the network to learn from its environment and to improve its performance through learning. The multilayer feedforward neural networks (FNNs) have attracted a great deal of interest due to its rapid training, generality and simplicity. In the past decade, the use of the recursive least squares (RLS) algorithm for training FNNs has been investigated extensively [1],[2]. In the RLS problem, the underlying assumption is that the input vector is exactly known and all the errors are confined to the observation vector. However, due to the lack of numerical robustness, RLS makes trouble in real implementations in finite precision. The source of these numerical instabilities in RLS schemes is the update of the inverse of the exponentially windowed input signal autocorrelation matrix, Rxx (n) = λn δI +
n
λn−k x(k)x(k),
(1)
k=1
where x(k) = [ xl ( k ) x2 ( k ) . . . xN ( k )]T , k≥1 is the input signal vector sequence, N is the number of system parameters, δ is a small positive constant, and λ, 0 < λ < 1 is the forgetting factor. Several authors have shown that D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 398–405, 2007. c Springer-Verlag Berlin Heidelberg 2007
FNN Training Method Based on Robust RLS Method
399
the primary source of numerical instability in is the loss of symmetry in the numerical representation of Rxx (n)[3]. This fact has spurred the development of alternative QR-decomposition-based least-squares (QRD-LS) methods [3] that employ the QR decomposition of the weighted data matrix XT (n) = (λn δ)1/2 I λ(n−1)/2 x(1) λ(n−2)/2 x(2) · · · x(n) (2) to calculate an upper-triangular square root factor or Cholesky factor R(n) of Rxx (n) = RT (n)R(n) = XT (n)X(n).QRD-LS methods possess good numerical properties [3]. Even so, QRD-LS methods are less popular because they require N divide and N square-root operations at each time step. These calculations are difficult to perform on typical DSP hardware. In [4], a new RLS algorithm has been proposed. It is similar to a recently developed natural gradient prewhitening procedure [5], [6]. In this paper, we propose a prewhitening based robust recursive least squares (RRTLS) algorithm for training multilayer feed-forward neural networks and show that it improves the estimation performance in both infinite precision and finite precision.
2
Training of Multilayer Feedforward Neural Networks
Fig. 1 shows the layout of a multiplayer feedforward neural network for a single hidden layer. 1
x1
2
1
r ( n) y1
x2
3
2
y (n)
Unknown System
x (n)
+
y2 x3
y (n)
4
x( n) Input layer of source nodes
Layer of hidden neurons
Layer of output neurons
+
Neural Network
-
+
e( n )
(a) FNN with one hidden layer
(b) Equivalent model with input and output noises
Fig. 1. Training of Neural Networks
An FNN is trained by adjusting its weights according to an ongoing stream of input-output observations {xk (n), yk (n) : t = 1, . . . , N ; k = 1, . . . , p} where p denotes the number of training samples sequences. The objective is to obtain a set of weights such that the neural networks will predict future outputs accurately. This training process is the same as parameter estimation process. Thus, training FNN can be considered as a nonlinear identification problem where the weight values are unknown and need to be identified for the given set of input-output vectors.
400
J. Lim and K. Sung
Considering the training of neural networks in Fig. 1(a), the training problem can be regarded as a parameter identification and the dynamic equation for the neural network can be expressed as follows: y(n) = f [w(n), x(n)] + ε(n) ,
(3)
where w(n) is the parameter of the FNN at time step n, y(n) is the observed output vector of the networks, and (n) is white Gaussian noises with zero mean. In many cases, it is assumed that input is clear and output has noise. However, the measured input, in many cases, also has noise as well. This noise originates from quantizer, etc. Therefore, Fig. 1(b) model, which contains input and output measurement noise, is more practical. Therefore, robust training algorithm is necessary for noisy input of hidden layer and noisy output of neural network in Fig. 1(a).
3
Robust Recursive Least-Squares with Prewhitening
In this section, we apply the prewhitened recursive least-squares procedure to the FNN training given by J(w(t)) =
t
2 λt−i d(i) − hH (t)w(i) ,
(4)
i=1
where hH (k) = ϕ(w1 z1 + b1 ) · · · ϕ(wN˜ zN + bN˜ ) . This becomes the exponentially weighted least squares cost function, which is well studied in adaptive filtering in [1]. J(β(t)) is minimized if w(t) = C−1 hh (t)chd (t), where Chh (t) =
t
λt−i h(i)hH (i), chd (t) =
i=1
t
(5)
λt−i d(i)h(i).
i=1
e(n) = d(n) − hH (n − 1)w(n).
(6)
The conventional RLS algorithm in [1] for adjusting β(n) is expressed as w(n) = w(n − 1) + e(n)k(n),
k(n) =
C−1 yy (n − 1)h(n) λ + hH (n)C−1 yy (n − 1)h(n)
(7)
.
(8)
T We can replace C−1 yy (n − 1) in (8) with P (n − 1)P(n − 1) to obtain
k(n) =
PT (n − 1)P(n − 1)h(n) . λ + hT (n)PT (n − 1)P(n − 1)h(n)
(9)
FNN Training Method Based on Robust RLS Method
401
Applying the least-squares prewhitening algorithm in [5] yields PT (n)P(n) = C−1 hh (n).
(10)
Consider the update for C−1 hh (n) in the conventional RLS algorithm, given by C−1 hh (n) =
−1 T C−1 1 hh (n − 1)h(n)h (n)Chh (n − 1) C−1 (n − 1) − . hh λ λ + hT (n)C−1 hh (n − 1)h(n)
Substituting PT (n)P(n) for C−1 hh (n) in (11) yields T 1 v(n)v (n) 1 PT (n)P(n) = √ PT (n − 1) I − P(n − 1) √ , 2 λ λ λ + v
(11)
(12)
where the prewhitened signal vector v(n) is v(n) = P(n − 1)h(n).
(13)
The vector v(n) enjoys the following property: if h(n) is a wide-sense stationary sequence, then
lim E v(n)vT (n) ∼ (14) = (1 − λ)I. n→∞
That is, as P(n) converges, the elements of v(n) are approximately uncorrelated with variance (1 - λ). We decompose the matrix in large brackets on the RHS of (12) as (15). BT (n)B(n) = I −
v(n)vT (n) λ + v
2
.
(15)
Then P(n-1) can be updated using 1 P(n) = √ B(n)P(n − 1). λ
(16)
The matrix on the RHS of (15) has one eigenvalue equal to λ/(λ + v2 ) in the direction of v(n)and N - 1 unity eigenvalues in the null space of v(n), or v(n)vT (n) λ v(n)vT (n) T B (n)B(n) = I − + (17) 2 2 2 . v(n) λ + v(n) v(n) Due to the symmetry and orthogonality of the two terms in (17), a symmetric square-root factor of BT (n)B(n) is v(n)vT (n) λ v(n)vT (n) B(n) = I − + (18) 2 2 2 . v(n) λ + v(n) v(n)
402
J. Lim and K. Sung
Substituting (18) into (16) yields, 1 P(n) = √ P(n − 1) − ς(n)v(n)vT (n) . β
(19)
u(n) = PT (n − 1)v(n).
(20)
ς(n) =
1 v(n)
2
1−
λ 2
λ + v(n)
.
(21)
Equations (8), (13) and (20) represent the new adaptive square-root factorization algorithm. u(n) k(n) = (22) 2. λ + v(n) Therefore, combining (22) with (6) and (7), we obtain a new numerically robust RLS training algorithm. The algorithm’s complexity is similar to that of the
(a) FNN model 0
mean square error [dB]
−10
−20
−30
−40
−50
−60
50
100
150 time step
200
250
300
(b) Learning curves with RLS and Proposed method Fig. 2. Learning curves with RLS and Proposed Algorithm (RLS:dot and Proposed Algorithm:solid)
FNN Training Method Based on Robust RLS Method
403
conventional RLS training algorithm; however, the former algorithm’s numerical properties are much improved.
4
Simulation
Figure 2(a) presents the FNN model for identifying the unknown system. This is connected to one hidden layer and one output layer, and has 1 source node, 25 hidden neurons, and 1 output neuron. The bias of hidden layer ranges from -6.0 −5
0
−10 −10 mean square error [dB]
mean square error [dB]
−15 −20 −25 −30 −35
−20
−30
−40
−40 −50 −45 −50 0
50
100
150 time step
200
250
−60 0
300
(a) 32 bit precision
150 time step
200
250
300
10
−15
0
−20
mean square error [dB]
mean square error [dB]
100
(b) 30 bit precision
−10
−25 −30 −35 −40
−10
−20
−30
−40
−45 −50 0
50
50
100
150 time step
200
250
−50 0
300
(c) 28 bit precision
50
100
150 time step
200
250
300
(d) 26 bit precision
10
40 30
0 mean square error [dB]
mean square error [dB]
20 −10 −20 −30 −40
10 0 −10 −20 −30
−50 −60 0
−40 50
100
150 time step
200
250
(e) 25 bit precision
300
−50 0
50
100
150 time step
200
250
300
(f) 24 bit precision
Fig. 3. Comparison of learning curves in 32bit, 30bit, 28bit, 26bit, 25bit and 24bit precision cases (RLS: dot, and Proposed Algorithm: solid)
404
J. Lim and K. Sung
to 6.0 which is the dynamic range of input data x. In this simulation, we assume that the relationship between input x and output y of the unknown system is y = sin(3x),
, −6 ≤ x ≤ 6 .
(23)
In this FNN model, the output of the first hidden layer is ϕ = [ϕ1 (x − 6.0) ϕ2 (x − 5.5) · · · ϕ25 (x + 6.0)]T , (24) where ϕi = 1/ 1 + e−(bi +x) and bi is the i-th bias in Fig. 2(a). The output of total FNN model is yˆ = wH ϕ where
w = [w1
w2
···
w25 ]T .
(25)
We can apply the proposed algorithm to (25). In the first simulation, we assume that the numerical resolution is infinite and that SNR is as high as 20dB. Fig. 2 shows the proposed algorithm outperforms the conventional RLS based training algorithm. In the second simulation, we assume that noise exists in input and output as 10−4 . Fig. 3 shows the results in three finite precision cases from 24bits to 32bits. This is a comparison between normal RLS and the proposed algorithm. As observed from Fig. 3, the performance of the proposed algorithm remains almost the same with much less dependency on finite precision while the performance of RLS based algorithm degrades severely in shorter precision. In the RLS based training, training performance begins to degrade from 26 bit precision. On the contrary, the proposed algorithm continues to show good performance even in the shorter numerical precision. From the above results, it is clear that the proposed algorithm is robust enough to prevent the instability effect in the conventional RLS based training algorithm.
5
Conclusions
In this paper, a robust recursive least squares algorithm with prewhitening was presented to be applied in neural network training. This algorithm was found to outperform the RLS algorithm. The proposed algorithm showed much less instability effect. In order to validate its performance, we applied this algorithm to training multilayer feedforward neural network under the various numerical precisions and confirmed the robustness.
References 1. Wong, K.W., Leung, C.S.: On-line Successive Synthesis of Wavelet Networks. Neural Processing Letter 7 (1998) 91-100 2. Leung, C.S., Tsoi, A.C.: Two-regularizers for Recursive Least Squared Algorithms in Feedforward Multilayered Neural Networks. IEEE Trans. Neural Networks 12 (2001) 1314-1332
FNN Training Method Based on Robust RLS Method
405
3. Haykin, S.: Adaptive Filter Theory, 3rd edn. Prentice-Hall, Upper Saddle River, NJ, (1996) 4. Douglas, S.C.: Numerically - Robust. O(N2 ) Recursive Least-Squares Estimation Using Least Squares Prewhitening. Proceeding of International Conference of Acoustics, Speech, and Signal Processing (ICASSP00), 1 (2000) 412-415 5. Dasilva, F.M., Almeida, L.B.: A Distributed Decorrelation Algorithm. In: Gelenbe, E.(eds.): Neural Networks: Advances and Applications, Elsevier Science, Amsterdam (1991) 145-163 6. Douglas, S.C., Cichocki, A.: Neural Networks for Blind Decorrelation of Signals. IEEE Trans. Signal Processing 45 (1997) 2829-2842
A Margin Maximization Training Algorithm for BP Network Kai Wang and Qingren Wang Institute of Machine Intelligence, Nankai University, 300071 Tianjin, China wangkai
[email protected]
Abstract. Generalization problem is a key problem in NN society, which can be grouped into two classes: the generalization problem with unlimited size of training sample and that with limited size of training sample. The generalization problem with limited size of training sample is considered in this paper. Similar to margin maximization criterion in SVM, we propose a margin maximization training algorithm for BP network to further improve the generalization ability of BP network. Experimental results show that the margin maximization training algorithm proposed in this paper does improve the performance of BP network, and shows a comparable performance with SVM.
1
Introduction
Back-Propagation (BP) algorithm is easy to implement and shows good performance in neural network (NN) training, so to be studied a lot and have been widely used in practice. As is well known, the object of the classifier design is “building the model with a given training sample, which can reflect the essential nature of sample and make correct decision for any independent (testing) sample”, instead of “learning training sample as precisely as possible for the sake of the same learning set”. Therefore, even if the trained classifier can classify all patterns in the training set correctly, it cannot be ensured to classify an independent testing sample at the optimal accuracy. The expected performance of the classifier on testing sample is called generalization ability, which is measured by the average accuracy on testing sample. The higher the accuracy is, the stronger the generalization ability of the classifier is. A classifier with weak generalization ability is much less useful than that with strong generalization ability. The problem of “how to estimate and improve generalization ability during the classifier design”is called generalization problem, which is a key problem in NN society. Previous research works on the generalization problem of NN can be grouped into the following two classes: 1) The generalization problem with unlimited size of training sample In early researches in pattern recognition, Bayesian classifier was found to be theoretically optimal. However, it is never easy to realize in practical recognition D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 406–413, 2007. c Springer-Verlag Berlin Heidelberg 2007
A Margin Maximization Training Algorithm for BP Network
407
problems. In most cases, the optimal result can only be achieved “asymptotically in probability”. 2) The generalization problem with limited size of training sample During the classifier design, the number of training patterns is limited. How to design the classifier to achieve strong generalization ability is the major problem met in applications. A large number of research works have been conducted in an attempt to resolve the generalization problem with limited size of training samples. Some methods were proposed to improve the generalization ability, including construction, pruning, regularization, early stopping, noise injection, randomly expanded training set, and so on. In this paper we only consider the generalization problem with limited size of training sample. An improved NN training algorithm based on margin maximization, called M2 algorithm for short, is proposed, which can be combined with previous works to further improve the performance of NN. Without loss of university we only consider two classes problem (Any pattern recognition problem can be decomposed into one or more two classes’ problems). The rest of the paper is organized as follows. Section 2 gives the basic idea of M2 algorithm. Section 3 gives a practical implementation of M2 algorithm. Section 4 reports our simulation, which provides an experimental support for M2 algorithm. Summary and conclusion are given in section 5.
2 2.1
The Basic Idea of M2 Algorithm Margin Maximization Criterion of Support Vector Machine (SVM)
The basic idea of SVM can be clearly described with the simplest case: linear classifiers trained on linear separable data, as is shown in Fig.1 [1] . Solid points and hollow points stand for samples of two classes, respectively. Assumed H is the decision hyper-plane. Those patterns, whose removal will change the solution, are called support vectors, as is shown in the figure by extra circles. Support vectors form two hyper-planes H1 and H2 , which are parallel with H. The margin of a decision hyper-plane is defined as the distance from H1 to H2 . For the linear separable case, the training algorithm of SVM simply looks for the optimal decision hyper-plane that maximizes the minimum distance from H1 to H and from H2 to H. In another word, SVM maximizes the minimum distance from the decision hyper-plane to support vectors. 2.2
Margin Maximization Criterion of M2 Algorithm
For the sake of discussion, we first give the following definition. Definition 1. The distance from a point in the feature space to a class, calculated by some criterions, is called class-distance.
408
K. Wang and Q. Wang
Fig. 1. The decision hyper-plane of SVM
The value of class-distance depends on concrete calculation criterions, e.g. class-distance of SVM is the minimum distance from the decision hyper-plane to support vectors in each class. To realize the margin maximization criterion, the decision hyper-plane of SVM should have an equal class-distance from two classes. SVM is designed based on support vectors, whose optimization function contains the distance from the decision hyper-plane to support vectors, thus the margin maximization criterion is easy to implement for SVM. However, it is never easy to implement the margin maximization for BP network because it is difficult to achieve the distance from the decision boundary to training patterns, so that adjusting points set is introduced, all points in which have an equal class-distance from two classes. During the training of BP network, the decision boundary should always move toward adjusting points to realize the margin maximization criterion.
3
A Practical Implementation of M2 Algorithm
One detailed implementation based on the ideas introduced above is presented in this section, which consists of the following four steps: 1) Generating boundary patterns Similar to [2], for each pattern in training set T, we define a Voronoi polyhedron that consists of all those points to which this pattern is the nearest one in T. All Voronoi polyhedrons constitute Voronoi graph (Fig.2). Obviously, the decision boundary of nearest neighbor classifier (thick line in Fig.2) consists of the boundary between all the neighboring Voronoi polyhedron pairs in which the two patterns belong to two different classes. We call every such Voronoi polyhedron a boundary polyhedron, all of which constitute the decision boundary of nearest neighbor classifier. Correspondingly, each pattern in a boundary polyhedron is called a boundary pattern that spread the decision boundary, which is
A Margin Maximization Training Algorithm for BP Network
409
Fig. 2. Voronoi graph and the decision boundary of nearest neighbor classifier (thick line)
similar with the support vector in SVM. Here we apply BDPATCH method [3] to generate boundary patterns. 2) Generating adjusting points candidate set S We generate some points in the boundary region, whose nearest neighbor patterns should also be a boundary pattern, as adjusting points candidate set S. 3) Generating adjusting points set V For each pattern C in S, we calculate the distance Di between C and its nearest neighbor pattern in class ωi (i = 1, 2), as the class-distance. Then we calculate the class interval DC of pattern C according to formula (1), and constitute adjusting points set V with the first M patterns in S that have maximum class interval: 1 if D1 − D2 = 0, DC = |D1 −D2 | (1) ∞ otherwise. The intuitive meaning of formula (1) is that, the smaller the class interval of a pattern is, the larger the minimum class-distance of the pattern is. 4) Training the neural networks We take formula (2) as optimization target of ANN design. E=
c∈T
|tc − y c | +
tvi =
1 v |t − y v |, M
(2)
v∈V
0 if yiv > yjv for ∀j = i, 1 otherwise,
(3)
where M is the size of adjusting points set, tk is the expected output of the k th pattern, y k is the network output of the k th pattern, tki is the ith component of tk , and yik is the ith component of y k . We set the value of tvi with formula (3) to ensure the decision boundary of ANN always moving toward the adjusting points.
410
4 4.1
K. Wang and Q. Wang
Experimental Studies Data Preparation
Artificial Data. Without loss of generality, we only consider two classes problem with equal prior probability in two and three dimensional feature spaces. 1) Two Dimensional Gaussian Distribution In the two dimensional simulation, the distribution density functions of the two classes problem are formula (4) and formula (5) respectively. 1 1 (x − 1)2 (y − 1)2 f1 (x, y) = · exp − · + 2π · 0.4 · 0.8 2 0.42 0.82 1 1 (x − 3)2 (y − 1)2 + · exp − · + , (4) 2π · 0.4 · 0.8 2 0.42 0.82 1 1 (x − 2)2 (y + 1)2 f2 (x, y) = · exp − · + 2π · 0.4 · 0.8 2 0.42 0.82 2 1 1 (x − 4) (y + 1)2 + · exp − · + . 2π · 0.4 · 0.8 2 0.42 0.82
(5)
It can be easily calculated that the Bayesian accuracy R∗ = 94.575%, thus Bayesian error rate e∗ = 5.425%. Both training and testing samples were generated by a program based on formula (4), formula (5) and Monte-Carlo method. We generated 80 training samples in various experiments with the size N = 96, 192, 384 and 768 respectively. The testing sample was the same for all the experiments, the size of which being M = 8000. 2) Three Dimensional Gaussian Distribution In the three dimensional experiments, the distribution density functions of the two classes problem are formula (6) and formula (7) respectively. 1 (x − 2)2 + y 2 + (z − 1)2 √ f1 (x, y, z) = · exp − 2 · 1.152 2 · (1.15 · 2π)3 1 (x + 2)2 + y 2 + (z − 1)2 √ + · exp − , (6) 2 · 1.052 2 · (1.05 · 2π)3 2 1 x + (y − 2)2 + (z + 1)2 √ f2 (x, y, z) = · exp − 2 · 0.752 2 · (0.75 · 2π)3 2 1 x + (y + 2)2 + (z + 1)2 √ + · exp − . 2 · 0.652 2 · (0.65 · 2π)3
(7)
It can also be easily calculated that Bayesian accuracy R∗ = 95.295%, thus Bayesian error rate e∗ = 4.705%. Both training and testing samples were generated by a program based on formula (6), formula (7) and Monte-Carlo method. We generated 80 training samples in various experiments with the size N = 96, 192, 384 and 768. The testing sample was the same for all the experiments, the size of which being M = 8000.
A Margin Maximization Training Algorithm for BP Network
411
Real-world Data. Two classification problems are used to evaluate the performance of M2 algorithm. They are the breast cancer problem and the heart disease problem that consist of 100 training sets and corresponding 100 testing sets, which can be obtained from [4]. For the breast cancer dataset, there are 200 patterns in each training set and 77 patterns in each testing set. Each pattern can be classified to one of two classes according to nine features. For the heart disease dataset, there are 170 patterns in each training set and 100 patterns in each testing set. Each pattern can be classified to one of two classes according to thirteen features. 4.2
Experimental System
To avoid the affection of initial weight and minimize training error rate, we adopt the weight initialization method based on random optimization, whose basic idea is to search global optimum by genetic algorithm and search local optimum by gradient algorithm. On the other hand, a few researches [5]-[7] have shown that NN designed with pre-edited training sample perform better than classical NN. Therefore, we conduct our simulation with pre-edited training set, for which holdout method [8] is adopted. We implemented the following two networks in the simulation: GA-BP network — Classical BP network with the weight initialized by genetic algorithm. GA-M2 -BP network — GA-BP network trained by M2 algorithm proposed in this paper. BP network has been introduced in many textbooks, which won’t be discussed in this paper. Here we only set detailed parameter settings for each experiment. (a) Single hidden layer with 50 nodes (large enough to learn training sample sufficiently). (b) Initialize the weight with the uniform distribution in the interval of [-0.5, 0.5]. (c) Improved BP algorithm with momentum item and varied learning rate. (d) The condition for stopping training: Training error is lower than a predefined threshold or has been stable for a number of iterations. 4.3
Experimental Results
Experimental results on artificial data are shown in Table 1, it can be seen that, GA-M2 -BP network performs better than GA-BP network. That is to say, M2 algorithm does improve the performance of BP network. Note that, with increasing size of training sample, the difference between the performance of GA-M2 -BP network and that of GA-BP network becomes small. Especially for two dimensional data, the difference vanishes when the size of training sample is 768. This phenomenon is consistent with our intuitive knowledge: The larger the size of training sample is, the smaller the distance between two neighbored patterns is, thus the decision boundary can be constrained to a smaller range.
412
K. Wang and Q. Wang Table 1. Experimental results on artificial data Testing Error on Two The Size of Dimensional Data (%) Training Sample GA-BP GA-M2 -BP Network Network 96 8.55 7.75 192 6.99 6.70 384 6.19 6.16 768 5.87 5.87
Testing Error on Three Dimensional Data (%) GA-BP GA-M2 -BP Network Network 8.22 7.04 7.54 6.43 6.41 5.80 5.51 5.25
Experimental results on real-world data are shown in Table 2. GA-M2 -BP network and GA-BP network are two classifiers implemented by us, whose results are compared with the benchmark results of other classifiers in [4]. It can be seen that, GA-M2 -BP network performs better than most of other classifiers, and shows comparable performance with SVM. Table 2. Experimental results on real-world data Testing Error (%) Breast Cancer Heart Disease GA-M2 -BP Network 25.82 16.11 GA-BP Network 26.77 16.54 RBF Network 27.64 17.55 AdaBoost with RBF-Network 30.36 20.29 LP Reg-AdaBoost 26.79 17.49 QP Reg-AdaBoost 25.91 17.17 AdaBoost Reg 26.51 16.47 SVM with RBF-Kernel 26.04 15.95 Kernel Fisher Discriminant 24.77 16.14
5
Conclusion
Generalization problem is a key problem in NN society, which can be grouped into two classes: the generalization problem with unlimited size of training sample and that with limited size of training sample. The generalization problem with limited size of training sample is considered in this paper. Similar to margin maximization criterion in SVM, we propose a margin maximization training algorithm for BP network to further improve the generalization ability of BP network. Experimental results show that the margin maximization training algorithm proposed in this paper does improve the performance of BP network, and shows a comparable performance with SVM. It can be proved that the GA-M2 -BP network implemented in this paper is asymptotically optimal in probability, which will be discussed in another paper. On the other hand, the method we adopt in practical implementation is rather simple, which can be further improved by adding some human experiences. In the
A Margin Maximization Training Algorithm for BP Network
413
future, we will apply the algorithm proposed in this paper to more applications to further verify its validity.
References 1. Zhang, X. G.: Introduction to Statistical Learning Theory and Support Vector Machines. Acta Automatic Sinica 26 (2000) 32-42 2. Shamos, M. I.: Geometric Complexity. In: Proceedings of the 7th ACM Symposium on the Theory of Computing. Albuquerque, New Mexico (1975) 224-233 3. Wang, Q. R.: A CNN Classification Design with Boundary Patching. Acta Automatic Sinica 12 (1988) 415-438 4. http://ida.first.fhg.de/projects/bench/benchmarks.htm 5. Rosin, P. L., Fierens, F.: Improving Neural Network Generalisation. In: Proceedings of International Geoscience and Remote Sensing Symposium, 2, Firenze, Italy (1995) 1255-1257 6. Choi, S. H., Rockett, P.: The Training of Neural Classifiers with Condensed Datasets. IEEE Transactions on Systems, Man, and Cybernetics, Part B 32 (2002) 202-206 7. Hara, K., Nakayama, K., Kharaf, A. A. M.: A Training Data Selection in On-Line Training for Multilayer Neural Networks. In: Proceedings of IEEE International Joint Conference on Neural Networks. Anchorage, USA (1998) 2247-2252 8. Ferri, F. J., Albert, J. V., Vidal, E.: Considerations about Sample-size Sensitivity of a Family of Edited Nearest-neighbor Rules. IEEE Transactions on System, Man, Cybernetics, Part B: Cybernetics 29 (1999) 667–672
Learning Bayesian Networks Based on a Mutual Information Scoring Function and EMI Method Fengzhan Tian1 , Haisheng Li2 , Zhihai Wang1 , and Jian Yu1 1
School of Computer & Information Technology, Beijing Jiaotong University, Beijing 100044, P.R. China {fzhtian,Zhhwang,jianyu}@bjtu.edu.cn 2 Baoding Training Center, Hebei Electric Power Corporation, Baoding 071000, P.R. China
[email protected]
Abstract. At present, most of the algorithms for learning Bayesian Networks (BNs) use EM algorithm to deal with incomplete data. They are of low efficiency because EM algorithm has to perform iterative process of probability reasoning to complete the incomplete data. In this paper we present an efficient BN learning algorithm, which use the combination of EMI method and a scoring function based on mutual information theory. The algorithm first uses EMI method to estimate, from incomplete data, probability distributions over local structures of BNs, then evaluates BN structures with the scoring function and searches for the best one. The detailed procedure of the algorithm is depicted in the paper. The experimental results on Asia and Alarm networks show that when achieving high accuracy, the algorithm is much more efficient than two EM based algorithms, SEM and EM-EA algorithms.
1 Introduction In recent years, learning Bayesian Networks (BNs) from data has become an attractive and active research issue in artificial intelligence and data mining [6]. The BNs learning algorithms fall into two categories. The first is called search & scoring based algorithms, on which most of the research on BNs learning has been focused. This kind of algorithms views the learning problem as searching a structure in the space of graphical models, which can fit the dataset best. Search & scoring based algorithms now include Bayesian method[6], entropy based method, Minimal Description Length (MDL) based method [12], Maximal Mutual Information based method [15] etc. The second is called dependency analysis based algorithms. They use dependencies between variables to infer the structure, which are usually determined through Conditional Independency (CI) tests [4]. SGS [11], PC[11], TPDA [1], SLA[1] etc. all belong to this kind of algorithms. Some research has attempted to learn BNs by the combination of the two kinds of algorithms [7]. The algorithms mentioned above concentrate on learning BNs from complete data. While learning BNs from incomplete data is much more difficult than from complete data. Now there has been some advances in learning BNs from incomplete data. One earlier research on BNs learning from incomplete data was made by Chickering [2], D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 414–423, 2007. c Springer-Verlag Berlin Heidelberg 2007
Learning Bayesian Networks
415
which deals with missing data by Gibbs sampling method. The most significant advance was the SEM algorithm [5]. SEM algorithm turns the learning problem with incomplete data into the easy solving learning problem with complete data using EM algorithm. Another research was made by W. Myers et al. They complete the incomplete data using generic operations, and evolve the network structures and missing data at the same time [9]. Using the idea of Friedman and Myers for reference, we put forward EM-EA algorithm [13], which completes data using EM algorithm and evolves BNs using an evolutionary algorithm (EA). The above advances are all on search & scoring based algorithms. As for dependency analysis based algorithms, we presented a method, named EMI, for estimating (conditional) Mutual Information from incomplete data and extended J. Cheng’s work to the conditions in present of incomplete data [14]. EM based algorithms are of low efficiency due to the iterative process of EM algorithm for probabilities reasoning when completing the incomplete data especially as well as many calls of EM algorithm. Another disadvantage of EM algorithm is that it usually converges to a local maximum. EMI is a deterministic method that can be used for estimating probability distributions from incomplete data. In this paper, we use EMI method instead of EM algorithm to deal with incomplete data so as to avoid the inefficiency caused by iterative reasoning. For this purpose, we adopt a new scoring function based on mutual information theory to evaluate BN structures, and then use EMI method to estimate probability distributions over local structures used to compute mutual information and learn BNs based on search & scoring strategy. In the algorithm, local probability distributions are stored in main memory to take advantage of the previous computational results and avoid repetitive computation. Finally, we conduct experimental analysis on Asia and Alarm networks with respect to SEM, EM-EA and our algorithm. The rest of this paper is organized as follows. In Section 2, we will introduce a new scoring function for evaluating BNs. In Section 3, we briefly review EMI method. In Section 4, we will describe our algorithm in detail and give its storage and time complexity analysis. The experimental analysis and conclusions will be given in Section 5 and Section 6 respectively.
2 A Scoring Function Base on Mutual Information Suppose that R is the underlying true distribution over variables X = {X1 , . . . , Xn } and Q is a distribution over these variables defined by some Bayesian network, then the K-L cross entropy between R and Q, CKL (R, Q), is a distance measure of how close Q is to R and is defined by the following equation: CKL (R, Q) =
x
R(x)log2
R(x) . Q(x)
(1)
The goal of BNs learning is to search for the BN model with minimal CKL (R, Q) [8]. And also in [8], Lam and Bacchus proved the following theorem: Theorem 1. The K-L cross entropy CKL (R, Q) is a monotonically decreasing function n of i=1 I(Xi , Πi ). Hence, it will be minimized if and only if the sum is maximized.
416
F. Tian et al.
In the above theorem, Πi denotes the set of parents of Xi in a BN. Each element of the sum, I(Xi , Πi ), measures the mutual information of a local structure, Xi and its parents Πi , and is defined as follows: I(Xi , Πi ) =
P (xi , πi )log
xi ,πi
P (xi , πi ) . P (xi )P (πi )
Where xi and πi respectively represent the value of Xi and the instantiation of Πi . According to the above theorem, given a dataset D, we can use the following equation as a scoring function of a BN structure S and search for the BN structure with maximal value of the following equation: Score(S : D) =
n
I(Xi , Πi ) =
n i=1 xi ,πi
i=1
P (xi , πi )log
P (xi , πi ) . P (xi )P (πi )
(2)
The computation of Eq. (2) only depends on the multivariate probability distribution P (xi , πi ). Please note P (xi ) = πi P (xi , πi ) and P (πi ) = xi P (xi , πi ). But when the dataset is incomplete, P (xi , πi ) can not be accurately computed, so we have to estimate it from the dataset.
3 Probability Distribution Estimation by EMI Method EMI method starts to compute the interval estimates of a joint probability of a variable set, then to compute a point estimate (i.e. a probability distribution estimate) via a convex combination of the extreme points. Here we just give a brief review, the details can be found in literature [14]. Let X be a variable set, having r possible states x1 , x2 , . . . , xr , and the parameter vector θ = θ1 , θ2 , . . . , θr be associated to the probability P (X = xi ), i = 1, 2, . . . , r, which has a prior Dirichlet distribution, so that i θi = 1. Let N (xi ) be the frequency of complete instances with X = xi in the dataset and N ∗ (xi ) be the artificial frequency of incomplete instances that can be completed as xi . Assume that we have completed the dataset by filling all possible incomplete cases of X with value xi , and i i denote the completed dataset as DX . Then given DX , the prior distribution of θ is updated into the posterior distribution using Bayes’ theorem:
i Pi (θ|DX ) = Dir(θ|a1 , a2 , . . . , ar ), i = 1, 2, . . . , r,
(3)
where ai = ai + N (xi ) + N ∗ (xi ), ak = ak + N (xk ) for all k = i, ak (k = 1, 2, . . . , r) are the super parameters of the prior Dirichlet distribution. From Eq. (3), we obtain the maximal Bayes estimate of P (X = xi ): P ∗ (X = xi ) = E(θi ) =
ai + N (xi ) + N ∗ (xi ) r , a + N ∗ (xi ) + k=1 N (xk )
(4)
and a unique probability for other states of X : Pi∗ (X = xl ) = E(θl ) =
al + N (xl ) r . a + N ∗ (xi ) + k=1 N (xk )
(5)
Learning Bayesian Networks
Where l = 1, 2, . . . , r and l = i, a = estimate of P (X = xi ) as follows:
r k=1
417
ak . Now we can define the minimal Bayes
P∗ (X = xi ) = M ink (Pk∗ (X = xi )) ai + N (xi ) r = , i = 1, 2, . . . , r, a + M axl=i N ∗ (xl ) + k=1 N (xk )
(6)
Note that when the dataset is complete, we will have P ∗ (X = xi ) = P∗ (X = xi ) =
ai + N (xi ) r , i = 1, 2, . . . , r, a + k=1 N (xk )
the Bayes estimate of P (X = xi ) that can be obtained from the complete dataset. The minimal and maximal Bayes estimates of P (X = xi ) comprise the lower and upper bounds of its estimate interval. Next we use these intervals to compute point estimates using a convex combination of their lower and upper bounds as follows: P (X = xi ) = λi P ∗ (X = xi ) + λk Pk∗ (X = xi ), i = 1, 2, . . . , r. (7) k =i
This point estimate is the expected Bayes estimate that would be obtained from the complete dataset when the mechanism generating the missing data can be described by λi , ri=1 λi = 1. The estimates obtained according to Eq. (7) define a probability r distribution since i=1 P (xi ) = 1. The Proof can be found in literature [14]. When no information on the mechanism generating missing data is available, any pattern of missing data is equally likely, i.e. λi =
1 , i = 1, 2, . . . , r r
(8)
This case corresponds to the assumption that data is Missing Completely at Random (MCAR) [10]. When data is Missing at Random (MAR) [10], we will have: λi =
ai + N (xi ) r , i = 1, 2, . . . , r. a + k=1 N (xk )
(9)
We can estimate all P (xi , πi ) using the above method. On this basis, we can evaluate network structures by Eq. (2).
4 Learning BNs by Probability Distribution Estimates In this section, we will first describe our learning algorithm in detail, and then analyze its storage and time complexity. The learning algorithm using probability distribution estimates is as follows. Input: Incomplete dataset Dinc ; Output: A Bayesian network (S, θ);
418
F. Tian et al.
Initialize a BN Sc with a tree structure using Chow-Liu’s algorithm [3]; REPEAT
Ë
Determine the search frontier F of Sc with some search strategy; Let P = S ∈F P rob(S ), where P rob(S ) = {P (xi , πi ) : 1 ≤ i ≤ n} represent the set of local probability distribution needed to evaluate S ; Estimate each P (xi , πi ) in P with the method in Section 3; Let N ets(P ) = {S : P rob(S) ⊆ P } represent the set of networks that can be evaluated with the local probability distributions in P ; Let S = argmaxS ∈Nets(P ) Score(S : P ); If Score(S : P ) > Score(Sc : P ), then let Sc = S; UNTIL The terminate conditions Compute the parameter vector θ of Sc ; Output (Sc , θ).
When performed, the above procedure maintains a current best network structure (that currently has highest score). With some search strategy, the algorithm creates a set of candidate structures called a search frontier of the current structure. In the procedure, Score(S : P ) is our scoring function, which can be calculated by Eq. (2); The terminate conditions can be set by users. The general rule is terminating the REPEAT circulation when the frontier does not change any more. Theoretically, any search strategy could be used in the above procedure. No matter which strategy is used, the search frontier can be regarded as the ”neighbors” of the current structure, which are different from the current structure only in certain local structure. In other words, almost all the local structures are the same compared the current structure with its neighbors and it is needed to estimate newly only a few different local structures. Suppose we use greedy strategy in the learning procedure. The greedy search strategy starts from an initial structure and changes one arc (adds, deletes, or reverses an arc) at each move without creating a directed cycle. The first operation is equivalent to add a parent Xj to some variable Xi , then the local probability distribution on Xi and its parents is changed to P (xi , πi ∪ xj ), πi is the configuration of the old parent set of Xi . Under this situation, it is needed to reestimate P (xi , πi ∪ xj ). The second operation is equivalent to delete a parent Xj of some variable Xi , then the local probability distribution on Xi and its parents is changed to P (xi , πi − xj ) = xj P (xi , πi ). It is not needed to reestimate the local probability distribution on Xi and its parents. The third operation can be achieved by performing both the first and the second operation. Suppose we reverse an arc from Xi to Xj , then it is needed to reestimate the local probability distribution on Xi and its parents and marginalize the local probability distribution on Xj and its parents. In the above procedure, the time consumption is mainly on estimating P (xi , πi ) from incomplete dataset Dinc . Suppose there are n nodes and m arcs in the underlying correct network. Because the initial network in the procedure is a tree that has n − 1 arc, the number of arcs of each current network in the learning process is usually less than m. So in the worst case, we have to reestimate n(n − 1) + m local probability distributions for evaluating the candidates in the frontier. Considering the time consumption of estimating one probability distribution is roughly (N ), proportional to the size of the
Learning Bayesian Networks
419
dataset[14], so in the worst case, the time complexity of executing REPEAT circulation is (N (n2 + m)). If each REPEAT circulation makes one right change to the current network that means to add, delete or reverse correctly an arc in once circulation, then at most m REPEAT circulations have to be performed and the time complexity of the procedure is (N (n2 m + m2 )) in the worst case. The storage assumption of our algorithm is dependent of the number of local probability distributions needed to evaluate the candidates in the frontier and will reach a maximum and not increase any more after evaluating a certain number of networks.
5 Experimental Analysis To validate our algorithm, we compare it with SEM and EM-EA algorithms in terms of storage assumption, accuracy and efficiency. 5.1 Experimental Method In the experiments, we consider two well-known networks: Asia, which has 8 nodes and 8 arcs and Alarm, which has 37 nodes and 46 arcs. We first generate complete random samples with different size, and then throw away respectively 10%, 20%, 30% values from each complete sample at random. The incomplete datasets obtained in this way follow the assumption that data is Missing Completely at Random. Therefore we compute the weights λi by Eq. (8). In the experiments, Both SEM algorithm and our algorithm adopt greedy search strategy, MDL metric as well as non-information prior Dirichlet distribution that means ai = 1(i = 1, 2, . . . , r). The other experimental parameters are set as follows. The mutation and crossover probabilities in the EM-EA algorithm are set as 0.1 and 0.5 respectively; the group size during the evolution is set as 20; the stopping criterion for EM-EA algorithm is set at the time when the group does not change any more. We implement SEM, EM-EA and our algorithms in a Java based system and run them given the node ordering. All the experiments were conducted on a Pentium 2.4 GHz PC with 512MB of RAM running under Windows 2003. The datasets are stored in an SQL Server 2000 database. All the experimental outcome is the average of the results run 10 times for each experimental condition. We evaluate the performance of the three algorithms by measuring the average log-loss of each learned network BS on N another test sets, that is N1 i=1 logBS (xi ). 5.2 Experimental Results As for storage consumption, we count only the space storing the probability distribution records in our algorithm and sufficient statistic records in SEM and EM-EA algorithms. Because the storage has nothing to do with the level of missing data, we do the experiments only on datasets with 20% missing values. The experimental result is shown in Fig. 1. From Fig. 1, the three algorithms allocate almost constant memories despite the increase of the size of data samples and our algorithm need no more storage than
F. Tian et al.
Storage(KB)
420
Data Samples (with 20% missing values)
Fig. 1. Storage consumption of SEM, EM-EA and our algorithm on Asia and Alarm networks
SEM and EM-EA algorithms with respect to both Asia and Alarm network. This is because the storages depend only on the number of the probability distribution records or the sufficient statistic records. Even with respect to Alarm network, our algorithm consumes only about 21 KB of memory, which is too small to be worth worrying about. Fig. 2 and Fig. 3 show the log loss of SEM, EM-EA and our algorithm for each level of missing data with respect to Asia and Alarm networks respectively. The lower the log loss score, the better. As can be seen from the figures, the predictive accuracy of the three algorithms degrades as the missing data increase. As for small datasets (with 500 instances), the predictive accuracy of our algorithm is worse than that of SEM and EM-EA algorithms. While the predictive accuracy of our algorithm is significantly better than that of SEM and EM-EA algorithms for both middle (with 1000 instances) and large datasets (with 2000 instances) at all levels of missing data, especially at the 30% missing data. This verifies that the nature of EM algorithm tending to converge to a parametric local maximum decreases the accuracy of EM based algorithm to some extent. Table 1 and Table 2 show the running time (Seconds) of SEM, EM-EA and our algorithms respectively on Asia and Alarm networks with different levels of missing data and sample size. From the two tables, we can see that the larger is the datasets, the more running time is needed. Given the level of missing data, the running time of our algorithm is roughly linear to the size of datasets, while that of SEM and EM-EA algorithms increase much more quickly. Given the size of datasets, the running time of SEM and EM-EA algorithms increases steadily with the increase of the missing data, while that of our algorithm has no obvious change with different levels of missing data. When given large datasets (such as 2000 instances) with high level of missing data (such as 30% missing values), the running time of SEM and EM-EA algorithms reach dozens of times of that of our algorithm. This proves that the nature of EM based algorithms having to complete the incomplete datasets using Bayesian inference technique also causes very high computational cost.
Learning Bayesian Networks
Fig. 2. Learning accuracy of SEM, EM-EA and our algorithm on Asia network
Fig. 3. Learning accuracy of SEM, EM-EA and our algorithm on Alarm network
421
422
F. Tian et al. Table 1. Running time (seconds) of SEM, EM-EA and our algorithm on Asia network Sample Algorithms Size
Complete Data
10 Missing
20 Missing
30 Missing
500
SEM EM-EA Our algorithm
0.72 2.12 0.72
7.60 8.20 0.72
11.45 9.86 0.73
12.41 10.62 0.73
1000
SEM EM-EA Our algorithm
0.93 2.29 0.93
10.81 12.22 0.92
15.70 13.51 0.94
16.40 14.81 0.95
2000
SEM EM-EA Our algorithm
1.31 2.64 1.31
32.12 36.49 1.32
39.62 41.32 1.31
42.02 42.42 1.32
Table 2. Running time (seconds) of SEM, EM-EA and our algorithm on Alarm network Sample Size
Algorithms
Complete Data
10 Missing
20 Missing
30 Missing
500
SEM EM-EA Our algorithm
82 212 82
1094 1205 82
1645 1558 83
1854 1676 84
1000
SEM EM-EA Our algorithm
103 229 103
1698 1803 104
2449 2419 105
2789 2629 104
2000
SEM EM-EA Our algorithm
151 264 151
4625 4651 150
6249 6098 150
6974 6721 151
6 Conclusions EM algorithm is a widely used method for estimating parameters and calculating the expectation of sufficient statistics. But it has to perform iterative optimization process and usually converges to a local maximum. This depresses the efficiency and accuracy of BN learning algorithms based on EM algorithm to some extent. This paper presents an efficient algorithm, which adopts a new scoring function based on mutual information theory to evaluate BN structures, and uses EMI method to estimate probability distributions over local structures and search & scoring strategy to learn BNs. The experimental results on Asia and Alarm networks show that our algorithm is much more efficient than SEM and EM-EA algorithms and has higher predictive accuracy than SEM and EM-EA algorithm for middle (with 1000 instances) and large datasets (with 2000 instances). Next we will evaluate our algorithm on other BN models further. Acknowledgments. This work is supported by NSF of China under Grant No. 60503017 and No. 60673089 and Science Foundation of Beijing Jiaotong University under Grant No. 2005SM012. Thank anonymous readers for helpful comments.
Learning Bayesian Networks
423
References 1. Cheng, J., Greiner, R., Kelly, J., Bell, D.A., Liu, W.: Learning Bayesian Networks from Data: An Information-Theory based Approach. The Artificial Intelligence Journal 137 (2002) 43-90 2. Chickering, D.M., Heckerman, D.: Efficient Approximations for the Marginal Likelihood of Bayesian Networks with Hidden Variables. Machine Learning 29 (1997)181-212 3. Chow, C.K., Liu, C.N.: Approximating Discrete Probability Distributions with Dependence Trees. IEEE Trans. Information Theory 14 (1968) 462-467 4. Dash, D., Druzdzel, M.J.: Robust Independence Testing for Constraint-based Learning of Causal Structure. UAI (2003) 167-174 5. Friedman, N.: The Bayesian Structural EM Algorithm. Fourteenth Conf. on Uncertainty in Artificial Intelligence (1998) 6. Heckerman, D.: Bayesian Networks for Data Mining. Data Mining and Knowledge Discovery 1 (1997) 79-119 7. Ioannis, T., Laura, E., Constantin, F.A.: The Max-Min Hill-Cimbing Bayesian Networks from Data. to be appeared in machine Learning (2006) 8. Lam, W., Bacchus, F.: Learning Bayesian Belief Networks: An Approach based on the MDL Principle. Computational Intelligence 10 (1994) 269-293 9. Myers, J., Laskey, K., Levitt, T.: Learning Bayesian Networks from Incomplete Data with Stochastic Search Algorithms. Fifth Conf. on Uncertainty in Artificial Intelligence (UAI) (1999) 10. Singh, M.: Learning Bayesian Networks from Incomplete Data. The 14th National Conf. on Artificial Intelligence (1997) 11. Spirtes, P., Glymour, C., Scheines, R.: Causation, Prediction, and Search. MIT Press, CAMBRIDGE, MA, USA, second edition (2001) 12. Suzuki, J.: Learning Bayesian Belief Networks based on the MDL Principle: An Efficient Algorithm using Branch and Bound Techniques. In: Lorenza ed. Proceedings of the 13th International Conference on machine Learning. Bari: Morgan Kaufmann (1996) 462-470 13. Tian, F.,Lu, Y., Shi, C.: Learning Bayesian Networks with Hidden Variables using the Combination of EM and Evolutionary Algorithm. Proceedings of the 5th Pacific-Asia Conference on Knowledge Discovery and Data Mining,Heidelberg: Springer-Verlag (2001) 568-574 14. Tian, F., Zhang, H., Lu, Y.: Learning Bayesian Networks from Incomplete Data based on EMI Method. Proceedings of the 3rd IEEE International Conference on Data Mining (ICDM 2003), Melbourne, Florida, USA (2003) 323-330 15. Zhang, S., Wang, X.: Algorithm for Bayesian Networks Structure Learning based on Information Entropy. Mini-Micro Computer Systems 26 (2005) 983-986
Learning Dynamic Bayesian Networks Structure Based on Bayesian Optimization Algorithm* Song Gao1, Qinkun Xiao2, Quan Pan1, and Qingguo Li3 1
School of Automatization, Northwestern Polytechnical University, 710072 Xi’an, China 2 School of Electronic Information Engineering, Xi’an Technological University, 710032 Xi’an, China 3 School of Engineering Science, Simon Fraser University, Burnaby, BC V5A 1S6, Canada
[email protected]
Abstract. An optimization algorithm for dynamic Bayesian networks (DBN) based on Bayesian optimization algorithm (BOA) is developed for learning and constructing the DBN structure. In this paper, we first introduce some basic theories and concepts of probability model evolutionary algorithm. Then we describe, the basic mode for constructing DBN diagram and the mechanism of DBN structure learning based on BOA. The DBN structure learning based on BOA consists of two parts. The first part is to obtain the structure and parameters of DBN in terms of a good solution, and the second part is to produce new groups according to the obtained DBN structure. In this paper, the DBN learning is achieved by genetics algorithm based on a greedy mechanism. The DBN inference is performed by a forward-simulation algorithm. Simulation results are provided to demonstrate the effectiveness of the proposed algorithm.
1 Introduction Bayesian networks structure learning is a combinatorial optimization problem. By introducing the time dependence, Bayesian networks is changed into dynamic Bayesian networks (DBN). Because of dynamism and timely effectiveness for DBN, DBN structure learning becomes more complicated [1, 2], and the better and faster search algorithm must be found out. Recently, probability model evolutionary algorithm (PMEA) based on Genetic Algorithm (GA) has attracted many scholars all over the world [3-7]. Bayesian optimization algorithm (BOA) [8, 9] based on Bayesian network learning incorporates the probability model evolutionary algorithm into a graphics model. When BOA runs on a computer, less memory in holding and premature is overcome effectively. Owing to these advantages, the DBN structure learning optimization algorithm based on BOA is developed in this paper. Firstly, total thoughts of the learning structure based on BOA *
Supported by NSFC (Grant No. 60404011 and No.60634030).
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 424–431, 2007. © Springer-Verlag Berlin Heidelberg 2007
Learning Dynamic Bayesian Networks Structure Based on BOA
425
is discussed; Secondly, the main point is located in DBN learning and inference. That include two main algorithm: (1) GA based on greed search algorithm (GA-GS); (2) forward-simulation networks inference model.
2 Overview Of DBN Structure Optimization Algorithm DBN structure optimization algorithm based on BOA is used to reflect evolutionary algorithm by four steps: (1) Establishing a fine solution group S(t): using all kinds of choice mechanisms to select S(t) from current population. (2) Finding a DBN to matching S(t): using BD metric to find the better network structure according as S(t). (3) Producing a new choice solution: using joint probability distributing of network diagram to produce a new individual group. (4) Creating next population: using new individual group to replace elderly group some chromosome and update population as newer population. The four steps are carried out repetitively to meet the termination rule. The termination rule of the algorithm is that either a good solution has been found in the population such that the evolutionary generations are enough or the rate of evolutionary is slow enough. DBN structure learning optimization algorithm process is described as follows: (1) (2) (3) (4) (5)
Random giving birth to a new initial population P(0). Selecting higher average adapt value individual groups from P(t) as S(t). Constructing DBN matching S(t) according to the BD metric. Producing O(t) according to joint probability distributing of DBN coding. Replacing some solutions of P(t) as O(t) to produce the generation population P(t+1). (6) Repeating steps (1) to (5) until the termination rule is met. The flowchart of the algorithm is illustrated in Fig.1. Two key features of BOA are: (1) learning DBN’s structure and parameters according as S(t); (2) producing new individual groups by the learned DBN’s structure and parameters. We will discuss the two key features in the following sections.
Initialization population
Fine solution group
10010011001 10010011111 10010111001 10011011001 10000011001 10010000001 10010011001 10010011001
10010011001 10010011111 10010111001 10011011001
DBN
New population
A0
A1
A2
B0
B1
B2
C0
C1
C2
B0: P(C0|A0),P(C0|B0),P(B0|A0) P(A2|A1),P(B2|A1),P(B2|A2)
B→: P(B2|B1),P(C2|B2),P(C2|C1)
10000011001 10010000001 10010011001 10010011001 10000011001 10010000001 10010011001 10010011001
Condition probability
Fig. 1. Diagram of DBN structure learning algorithm based on BOA
426
S. Gao et al.
3 Learning DBN Using GA-GS 3.1 Encoding A DBN includes two network frames, namely, earlier experience network and transfer network: B (B0 B→). We can express B0 and B→ in chromosomes C0 and C→., and the encoding combination of chromosomes C0 and C→ expresses a DBN chromosome.
= ,
Fig. 2. DBN and its encoding
As an example, a network structure graph in time fragments 0, t, t+1 is shown in Fig.2. In this transfer network, all nodes in time t do not have father nodes, thus we will only encode those nodes in time t+1. Here we label every first bit in the associated group as X, and the following bits are the father nodes of X, labeled as ∏(x). (C0, C→) is a DBN chromosome. If we use 0 or 1 to represent a gene bit, (C0, C→) can be denoted binary system encoding gene. The encoding process is described in Table 1. Table 1. Table of DBN encoding Network B0 B→
Example of given code 101111100 100100110001101100
After simplification 011100 001001000101100
3.2 Fitness Function For a DBN, there exist two structure measure scales, namely, Bayesian Dirichlet metric (BD) and Bayesian information metric (BIC). In this paper, we use BD to measure network and BD is described as follows[6]: n −1
BD ( B | D ) = p ( B | D)∏∏ i = 0 π Xi
Γ (m ' (π Xi )) Γ (m ' ( x i , π Xi ) + m( x i , π Xi )) . ∏ Γ (m ' (π Xi ) + m(π Xi )) xi Γ(m ' ( xi , π Xi ))
(1)
Where D is the data volume correspond to S(t) , and B is the network graph matching S(t) . The i’th node is represented as Xi , πXi gives the value of Xi’s father nodes. The
Learning Dynamic Bayesian Networks Structure Based on BOA
427
∈
value of Xi is xi. For the binary system encoding, xi {0,1}. m(πXi)=∑xim(xi,πXi) is corresponding to m’(xi, πXi) and m’(πXi) to express the prior experience information of the network graph. The BD is higher, DBN structure is better. 3.3 Select Initial Population In general, initial population can be generated randomly, or given by experts. Another way to generate the initial population is using variant technique, which was developed by Madigan [8]. The technique combines network structure hypothesis with prior experience probability distribution. In this technique, a computer program will help users to produce a complete database of hypotheses. Based on image data given by the expert and the complete database of hypotheses, some “good” structures are selected to generate a initial population. 3.4 Design of Genetic Arithmetic Operators Select scheme After arranging the selected scheme, we decide the probability of each individual being a father node by its fitness function. It’s evident that the selection probability is not
Fig. 3. DBN crossover genes operation diagram
428
S. Gao et al.
directly related to the absolute value of the fitness function. Therefore, this scheme can avoid premature caused by the existence of super individuals. Crossover scheme Crossover operation will increase population’s average quality, and the improvement comes from using selection arithmetic operators from stochastic gene pairings. In the process of filial generation producing, two DBN structures of father generation can cross according to crossover probability pc through a uniform parameterization system. The process of crossover operation is illustrated in Fig.3. Similar to the gene blocks, local structures with higher fitness function values will be crossed in father generation, therefore, network structures with higher fitness function values are gained. Mutation scheme Some new gene states will be generated by the mutation operation, and it provides a mechanism to avoid premature. In this paper, two kinds of mutation operations are developed: add “1” or reduce “1” in a chromosome, and these two operations corresponds to adding or reducing a arc in DBN, as shown in Fig.4.
Fig. 4. DBN mutation operation process diagram
After crossover and mutation operations, a evolutionary period is finished. Before starting the next evolutionary period, the following scenarios need to be considered: (1) the generated structure in the evolutionary period may be unrealistic. One way to deal with this problem is to examine whether the new network is logical by is-cyclic-not algorithm [12], and assign the unreasonable network with a lower fitness value. Alternatively, we could restrict the number of father nodes. For every node, we randomly select only k(0
Learning Dynamic Bayesian Networks Structure Based on BOA
429
3.5 Setting Genetics Parameters The range of population size n is set as 30~160, and the crossover probability is set as 0.8~0.9. Mutation probability is lower and set as 0.01~0.2. The selection of the number of generation depends on the network structure complexity and evolutionary precision. In general, the number of generation is bigger, the structure is better. 3.6 Constructing Algorithm The GA-GS can be described as follow: (1) Obtain Initial population C1 C2,…,Cj(j denoted as size of the population), if every chromosome is made of n genes bit, encoding can be expressed as: Cj=Xi|∏(Xi) (1
4 DBN Inference After structure and parameters of DBN is found, new units will be obtained according to the joint distribution probability of the associated network encoding, and we can finish this inference by forward-simulation algorithm [11]. The algorithm consists of two steps: Compute genetics order of nodes The basic idea is computing genetics order of variables after producing variables according to the given order Produce a new unit’s all variables value in term of calculated order. Given the value of a father node variable, the value distribution of a variable is determined by the associated conditional probability. After performing the second step, a new chromosome unit is generated.
5 Simulation To verify the proposed BOA based DBN structure optimization algorithm, some experiments are performed in MATLAB. In experiments, we select DBN that have 8 variables to match database as an optimization object. We gain database from BNT Structure Learning Package [12]. We compare evolutionary curves of three algorithms
430
S. Gao et al.
which are shown in Fig.5. In the experiments, crossover probability is 0.8 and mutation probability is 0.02, the maximum evolutionary generation is 1000. The simulation results shows that the population fitness tends to be stable in shorter time in BOA based DBN optimization algorithm, and the optimum fitness value becomes stable in 300 generations. As a comparison, PMEA requires 400 generations and GA needs 900 generations. The optimum fitness values touch average fitness values closely in the algorithm based on BOA. The experiment results indicate that presented algorithm based on BOA has a better convergence property.
Fig. 5. Compare diagram of structure optimization algorithm. (a) found DBN structure using GA; (b)found DBN structure using PMEA; (c) found DBN structure using BOA
6 Conclusion The major contributions of this paper can be summarized as: (1) developed an algorithm to construct DBN structure match database, and DBN structure learning and constructing mechanism based on BOA. (2) presented a DBN learning model based on greed mechanism GA, and DBN structure encoding, fitness function design, genetic arithmetic operators design are done. (3) verified the proposed algorithm by simulation experiments.
References 1. Sanghai, S., Domingos, P., Weld, D.: Relational Dynamic Bayesian Networks. Journal of Artificial Intelligence Research 2 (2005) 106-112 2. Kok, S., Domingos, P.: Learning the Structure of Markov Logic Networks. In ICML’05,Bonn, Germany, ACM Press (2005) 3. Bernard, A., Hartemink, A.J.: Informative Structure Priors: Joint Learning of Dynamic Regulatory Networks from Multiple Types of Data. In PSB (2005) 459–470
Learning Dynamic Bayesian Networks Structure Based on BOA
431
4. Harik, G.R., Lobo, F.G., Goldberg, D.E.: The Compact Genetic Algorithm. IEEE Trans. Evolutionary Computation 3 (1999) 287-297 5. Pelikan, M., Golaberg, D.E., Cantu-Paz, E.: The Bayesian Optimization Algorithm. Illinois, University of Illinois at Urbana-Champaign, IlliGAL Report 98013 (1998) 6. Pelikan, M., Goldberg, D.E., Cantu-Paz, E.: The Bayesian Optimization Algorithm, Population Sizing, and Time to Convergence. Illinois, University of Illinois at Urbana-Champaign, IlliGAL Report 2000003 (2000) 7. Husmeier, D.: Sensitivity and Specificity of Inferring Genetic Regulatory Interactions from Microarray Experiments with Dynamic Bayesian Networks. Bioinformatics 19 (2003) 2271–2282 8. Lauger, F.G.: Artificial Intelligence Structures and Strategies for Complex Problem Solving. Fourth Edition. Addison Wesley (2002) 9. Oèenášek, J.: Parallel Estimation of Distribution Algorithms. Doctoral Thesis. Brno University of technology (2001) 10. Pelikan, M.: Bayesian Optimization Algorithm: from Single Level to Hierarchy. Doctoral dissertation, IlliGAL Report 2002023 (2002) 11. Pelikan, M., David, E.G., Sastry, K.: Bayesian Optimization Algorithm, Decision Graphs and Occam’s Razor. IlliGAL Report 2000020 (2000) 12. Philippe, L. et al: BNT Structure Learning Package [2005.5.23]. http: // banquiseasi . insa-rouen.fr/projects/bnt-slp/ (2005)
An On-Line Learning Algorithm of Parallel Mode for MLPN Models D.L. Yu1, T.K. Chang1, and D.W. Yu2 1
Control Systems Research Group, Liverpool John Moores University, UK
[email protected] 2 Department of Automation, Northeast University at Qinhuangdao, China
Abstract. An on-line learning algorithm in parallel mode for multi-layer perceptron network (MLPN) model is proposed. The MLPN is on-line trained directly in a parallel mode. The on-line learning algorithm is based on the Extended Kalman Filter (EKF) algorithm. This network is able to learn the nonlinear dynamic behaviour of unknown time-varying systems and perform multistep-ahead prediction for control purpose. The performance of this model is evaluated in modelling a multi-variable non-linear continuous stirred tank reactor (CSTR).
1 Introduction The application of NN in process modelling [1] and control [2] has been studied intensively in recent years. It has been proved that certain NN architectures, such as the MLPN and Radial Basis Function network (RBFN), can represent any non-linear mapping between input and output with arbitrary accuracy, given suitable weighting factors and architecture [3]. For time-delay presentation and model predictive control, multi-step-ahead prediction is necessary and important [4]. An accurate parallel model can be obtained widen the potential use of NNs in system identification and improving the performance of control strategies that incorporate a process model, such as non-linear predictive control. Therefore, a realtime parallel model for time-varying non-linear systems is demanding. A feed-forward network with external feedback from the network output has been used for dynamic modelling [3]. This kind of network is called a parallel network or external recurrent network. In most publications, e.g. [1], the NN-based parallel model is trained in series-parallel form. After sufficient training has reduced the modelling error to a desired level, the network is implemented in parallel form, where the past model output is incorporated. However, this training approach is unsuitable for time-varying non-linear systems. Hence, in this paper, a new parallel MLPN learning approach is proposed. The proposed direct parallel MLPN on-line learning algorithm is based on EKF algorithm. The proposed learning approach is applied to an unknown non-linear CSTR process. The simulated results have shown the effectiveness of the parallel training algorithm. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 432–437, 2007. © Springer-Verlag Berlin Heidelberg 2007
An On-Line Learning Algorithm of Parallel Mode for MLPN Models
433
2 Parallel MLPN Model A general model structure suitable for representing the dynamics of a wide range of non-linear systems is the NARX model defined by
y (t ) = f [u (t − k − 1), " , u (t − k − nu ), y (t − 1), " , y (t − n y )] + e(t ) ,
(1)
where u(t)∈Rm and y(t)∈Rp are the sampled process input and output at time instant t, e(t)∈Rp is the equation error, nu and ny are the input order and output order respectively, k represents the process dead time, f(.) is an unknown non-linear function to be identified. A on-line trained parallel MLPN model is configured to represent the non-linear function f(.). Thus, the parallel MLPN has the following structure: y NN (t ) = NN [u (t − k − 1), " , u (t − k − n u ), y NN (t − 1), " , y NN (t − n y )] ,
(2)
where yNN(t)∈Rp is the NN output. nu and ny are determined by simulation experiment. In this paper, a standard three layers (one hidden layer) MLPN with q hidden layer nodes is used. The MLPN can be described as
y NN (t ) = W y h(t ) * g (W h x (t )) ,
(3)
where g(.) is a sigmoid function and x is the input vector of the network given in (2)
3 EKF-Based On-Line Learning Algorithm In order to on-line train the parallel MLPN, consider all the elements of an parameter vector, w including all the elements of both Wh(t) and Wy(t), where w(t)∈R(q.n+p.q)×1. The relationship of w vector with Wh and Wy matrix is formulated as follows: ⎡wh ⎤ w=⎢ y⎥ ⎣⎢ w ⎦⎥
(4)
with
( )
⎡ W1h ⎢ wh = ⎢ # ⎢ Wh ⎣ q
( )
( )
⎡ W1y ⎤ ⎢ ⎥ y ⎥, w = ⎢ # T⎥ ⎢ Wy ⎦ ⎣ p T
( )
⎤ ⎥ ⎥, T⎥ ⎦ T
where Wih is the ith row in Wh, and Wiy is the ith row in Wy. Consider the weight matrices in MLPN are constant, and the parallel MLPN model with the unknown parameter vector w can be describe as follow: w(t + 1) = w(t ) , y (t ) = y NN (t ) + e m (t ) ,
(5) (6)
434
D.L. Yu, T.K. Chang, and D.W. Yu
where em(t)∈Rp is the modelling error. The EKF techniques offer the non-linear parameter estimation, and is used here to estimate the unknown parameter w(t). The EKF-based parallel MLPN on-line learning algorithm is derived as w(t ) = w(t − 1) + K (t )[ y (t ) − y NN
],
(7)
E (t ) = [ I − K (t )C (t )]E (t − 1) ,
(8)
t |t −1
K (t ) = E (t − 1)C (t ) T [ R(t ) + C (t ) E (t − 1)C (t ) T ] −1 ,
(9)
where y NN
t |t −1
= NN [w(t − 1), x(t )] , ∂ y NN
C (t ) =
t |t −1
∂w
.
R(t) is an unknown priori covariance matrix and can be estimated on-line by using the following recursion [5]:
1 R(t ) = R (t − 1) + {[ y (t ) − y NN t
t |t −1
] [ y (t ) − y NN
t |t −1
]T − R (t − 1)} .
(10)
It is expressed of C(t) as follows: ∂ y NN
t |t −1
∂w
Then, the term ∂ y NN ∂w
∂ y NN
t |t −1 h
t |t −1
∂wh
⎡ ∂ y NN t |t −1 =⎢ ⎢ ∂w h ⎣
∂ y NN
t |t −1
∂w y
⎤ ⎥. ⎥ ⎦
(11)
can be expressed as
= W y (t − 1)
∂ h t |t −1 ∂w h
⎧⎪ ∂ h t |t −1 ⎫⎪ ∂W h (t − 1) x(t ) = W y (t − 1)diag ⎨ , ⎬ ∂w h ⎪⎩ ∂ z t |t −1 ⎪⎭
(12)
where h t |t −1 = g[ z t |t −1 ] , ⎡ x(t ) T ∂W h (t − 1) x(t ) ⎢ =⎢ # ∂w h ⎢ 0 ⎣
z t |t −1 = W h (t − 1) x(t ) ,
⎡ 0 ( m.nu )×( q.n ) ⎤ ⎢ ∂y (t − 1) ⎥ ⎤ NN " 0 ⎢ ⎥ ⎥ h ⎥ . (13) ∂w h % # ⎥ + W (t − 1) ⎢ ⎢ ⎥ # T⎥ " x(t ) ⎢ ∂y NN (t − n y ) ⎥ ⎦ q × ( q .n ) ⎢ ⎥ ∂w h ⎣ ⎦
An On-Line Learning Algorithm of Parallel Mode for MLPN Models
The term
∂yNN (t ) ∂wh
435
is needed in the calculation of eq. (13) at next time and can be
determined by replacing Wh(t-1) and Wy(t-1) in
∂ y NN
t |t −1
∂wh
with Wh(t) and Wy(t)
respectively. The term
∂ y NN
∂ y NN ∂w
t |t −1
∂w y t |t −1 y
can be expressed as
⎡( h )T ⎢ t |t −1 =⎢ # ⎢ 0 ⎢⎣
⎤ " 0 ⎥ % # ⎥ " (h t |t −1 ) T ⎥⎥ ⎦ p×( p.q )
⎡ 0 ( m.nu )×( p.q ) ⎤ ⎢ ∂y (t − 1) ⎥ NN ⎢ ⎥ ⎧⎪ ∂ h t |t −1 ⎫⎪ y h ⎢ ⎥. ∂w y + W (t − 1)diag ⎨ ⎬ W (t − 1) ⎢ ⎥ # ∂ z ⎪⎩ t |t −1 ⎪⎭ ⎢ ∂y NN (t − n y ) ⎥ ⎢ ⎥ ∂w y ⎣ ⎦
(14)
∂yNN (t ) is needed in the calculation of eq. (14) at next time and can be ∂w y ∂ y NN t|t −1 determined by replacing Wh(t-1) and Wy(t-1) in with Wh(t) and Wy(t) ∂w y respectively.
The term
4 Parallel MLPN Applications on CSTR The CSTR process is a typical dynamical process used in chemical and biochemical industry. A second order endothermic chemical reaction 2A → B takes place in the tank. The process works in the following way. Solution of substance A flows into the tank. The chemical reaction takes place in the tank based on the temperature and the reaction absorbs heat energy. In the result, the reaction will influence the temperature, concentration of outflow solution and the liquid level. If the liquid level is kept constant and the two input and two outputs are chosen as follows: ⎡q ⎤ u = ⎢ i⎥, ⎣Ti ⎦
⎡c⎤ y=⎢ ⎥. ⎣Tr ⎦
The following equations can be derived to describe the process dynamics. EA dc(t ) 1 h = {c i (t )q i (t ) − c(t ) − 2 Ahk o c 2 (t ) exp[− ]} , dt Ah Rv RTr (t )
(15)
436
D.L. Yu, T.K. Chang, and D.W. Yu
dTr (t ) 1 h = {ρ σ [q (t )Ti (t ) − Tr (t )] dt ρ r σ r Ah r r i Rv − ΔHAhk o c 2 (t ) exp[−
EA ] +U 1 [Th − Tr (t )] − U 2 [Tr (t ) − T x ]} . RTr (t )
(16)
The ranges of the input variables are q i (t ) ∈ [2, 5] (1/sec) and Ti (t ) ∈ [273, 480] (Kelvin) . In the simulation investigation, nu=1, ny=3 and k=0 in eq. (2) is determined and 10 hidden layer nodes is found to give the best parallel MLPN modelling structure to represent the CSTR. y NN (t ) = NN (8:10:2) [u (t − 1), y NN (t − 1), " , y NN (t − 3)] .
(17)
c(t)
sample Tr(t)
MLPN output
---- desired output
sample
Fig. 1. Parallel MLPN prediction
Two MLPNs with the same structure in (21) are trained, one is trained in seriesparallel mode, and the other is trained in direct parallel mode. Both MLPNs on-line learn from the first 1000 samples of data and then are used to predict in parallel mode (2) for the rest of 1000 samples. The prediction results of the parallel model are shown in Fig. 1. The series-parallel model is also used to predict the data. The sum squared error (SSE) of the predictions for both models are calculated as, for c(t) SSEparallel = 0.02986, SSEseries = 0.19794; for Tr(t), SSEparallel = 0.13151, SSEseries = 0.22305. The parallel MLPN has the much smaller prediction error than seriesparallel MLPN.
An On-Line Learning Algorithm of Parallel Mode for MLPN Models
437
5 Conclusions In this paper, a direct on-line learning algorithm of MLPN in parallel mode is proposed. The new on-line learning algorithm is based on the EKF parameter estimation algorithm and some partial derivatives are used recursively. The multistep-ahead prediction performance is tested for a non-linear dynamic CSTR process. The result shows that the parallel MLPN trained in the direct parallel mode is much more precise than that trained in the series-parallel mode. The proposed parallel MLPN on-line learning algorithm is very useful for multi-step-ahead prediction control of time-varying non-linear system that involve long time delay.
References 1. Tan, Y.H., Cauwenberghe, A.V.: Non-linear One-step-ahead Control using Neural Networks: Control Strategy and Stability Design. Automatica 32 (1996) 1701-1706 2. Hunt, K.J., Sbarbaro, D.: Neural Networks for Non-linear Internal Model Control. IEE Proc. Part D: Control Theory and Applications 138 (1991) 431-438 3. Narendra, K.S., Parthasarthy, K.: Identification and Control of Dynamical Systems using Neural Networks. IEEE Transactions on Neural networks 1 (1990) 4-27 4. Yu, D.L., Gomm, J.B.: Implementation of Neural Network Predictive Control to a Multivariable Chemical Reactor. Control Engineering Practice 11 (2003) 1315-1323 5. Ljung, L., Soderstrom, T.: Theory and Practice of Recursive Identification. Cambridge, MA: M.I.T. (1983)
An Robust RPCL Algorithm and Its Application in Clustering of Visual Features Zeng-Shun Zhao, Zeng-Guang Hou, Min Tan, and An-Min Zou Key Laboratory of Complex Systems and Intelligence Science, Institute of Automation, The Chinese Academy of Sciences, P.O. Box 2728, Beijing 100080, China
[email protected], {zengguang.hou,min.tan}@ia.ac.cn
Abstract. Clustering in the neural-network literature is generally based on the competitive learning paradigm[4]. This paper presents a new clustering algorithm which is against initialization while meantime can find the natural prototypes in the input data, especially it could partly handle problems that Rival Penalized Competitive Learning (RPCL) algorithm have. Simulation results on synthesized data sets show that proposed method is effective and robust. Application of the proposed robust RPCL algorithm in indexing of visual features is discussed.
1
Introduction
Clustering in the neural-network literature is generally based on the competitive learning paradigm[4, 5, 7], where the prototypes correspond to the weights of neurons. The purpose of competition is to find one prototype (winner) that is the closest to the sample vector, and the winner is shifted toward the sample vector according to the weight update rule. The learning scheme, according to which the winner is found by the minimumdistance rule and is shifted toward the sample vector, while other prototypes remain unchanged, is referred to as the winner-take-all (WTA) learning scheme. The classic competitive learning can be characterized as the WTA learning scheme. Grosberg’s adaptive resonance theory (ART) [3] serials and Kohonen’s Self Organizing feature Mapping (SOFM) [1, 2] are two main developments of the classical competitive learning. However, Conventional competitive learning algorithm is associated with two major issues, namely, strongly dependent on initialization and there is no guarantee that the resulting clusters are natural prototypes. WTA could possibly lead to the problem of “dead nodes”, in which cases, if initial prototypes are located close to each other and far away from the sample data, it may result in missing some important clusters [4, 8]. Actually, the problem of “dead nodes” is one kind of problem of sensitiveness to initialization. In general, selecting the appropriate number of prototypes is a difficult task, as we do not really know anything about the density or the compactness of our clusters in many case. The traditional approach to determine the natural number of clusters is to pick the value that optimizes a global validity measure. Such D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 438–447, 2007. c Springer-Verlag Berlin Heidelberg 2007
An Robust RPCL Algorithm and Its Application
439
validity measures can be difficult to devise since one needs to take into accounts such unknown things as the variability in cluster shape, density, and size. Although Fuzzy Competitive Learning (FCL) could handle the problem of noise and outliers and the case of overlapping clusters. But it is still sensitive to initialization of prototype locations and other parameters used by the algorithms. Another solution to tackle the problem of dependence on initialization is to introduced Frequency Sensitive Competitive Learning (FSCL) [10]. FSCL is to make the less frequent winners can increase their chance to win the next time. So it could solve the problem of dead units well. However, when an inappropriate prototype is taken as the winner, one may encounter the problem of “shared clusters” meaning that a number of prototypes is attracted by the same cluster. The rival penalized competitive learning (RPCL) algorithm was designed to solve this problem of “shared clusters” by L. Xu [8]. The basic idea of RPCL is to shift the second place winner, the rival, away from the sample vector preventing its interference in the competition. So Under the circumstance of initial units’ number larger than actual number, RPCL can converge those units to actual centers of input pattern. However, RPCL has some new problems itself [4]. The rest of the paper is organized as follows. The problems need to be tackled and strategy proposed by us is presented in the following section. Details of proposed algorithm which is more robust against initialization and penalizing rate and a summary are described in Section 3. Section 4 gives details simulation about synthesized data set and discusses its application in clustering of visual features in the following part. At last, conclusions are drawn in Section 6.
2
Problem Presentation and Strategy of the Proposed Method
In many cased, we do not know the knowledge a prior about distribution of data points. And in cases of classification of clusters with largely different sizes, a well partitioned large cluster may still have the most distortion, which may trick CL to generate or reserve redundant prototype to share this clusters. A well-known critical problem with competitive learning is the difficulty in determining the optimum number of clusters. Some researchers focus on how to avoid this problem by employing certain validity measure which involve variations from splitting strategy, agglomeration strategy or combinations of these two strategies. Such algorithms include self-splitting competitive learning (SSCL) [5], Robust competitive learning (RCA) [11], etc.. However, how to choose this validity measure is another largely unsolved problem because of lack of prior knowledge in the input pattern. Moreover, even if the ideal number of clusters is given, there is no guarantee that the clusters found correspond to natural clusters in the input data. Many algorithm itself imposes an artificial structure on the data set, namely, such algorithm dependent on the formulation of objective function. Maybe a natural cluster might be erroneously divided into several parts, while meantime several natural classes are wrongly merged into one cluster. So when the optimum
440
Z.-S. Zhao et al.
number of classes are known beforehand, it is not necessarily sure to obtain natural clusters. Although RPCL shows many advantages in competitive learning literature, it has own inherent problems: the penalizing rate should be selected appropriately, otherwise the resulting prototypes would not converge to natural classes; When different groups in the input data has large different volume, heavily overlapped or in the cases where the input vectors contain dependent components, the performance of RPCL degrades rapidly. RPCL are also affected to some extent by the initial number of cluster centers. We give our improvement of RPCL in following consideration: (1) To re-scale the original data: each variable of feature vectors should be normalized. (2) To eliminate the correlation between variables, Principal component analysis (PCA) is employed to the data set. (3) The whole data set is divided by the number of clusters predefined. In each subset, a representative prototype was selected randomly. (4) Learning rate and penalizing rate varies (change) gradually within a reasonable range with the increase of learning cycles. (5) After learning is finished, we attempt to merge similar clusters according to variance of each clusters, until finally all remaining clusters are distinct from each other.
3
Details of Proposed Algorithm
The RPCL is modified to overcome its shortage of sensitiveness to initialization and dependence on appropriate penalizing rate. It is also desirable to obtain better performance in cases where different clusters overlap heavily, or in cases of clusters with various shape, density, etc. Actually it is hard to select a correct number in complex high-dimensional feature space. 1. Preprocessing: Normalization and Transformation by PCA : After each variable of feature vectors is normalized, PCA is employed. The goal of implementing PCA transformation to the sample set is to eliminate the correlation between variables, which partly alleviate the problems of the original RPCL algorithm. At the same time, perform dimensionality reduction with minimum loss of discriminate information could accelerate the learning process. 2. Predefine a reasonable lager value N to the number of prototypes. 3. The whole data set is divided by N. In each subset, a representative prototype was selected randomly. We refer to this strategy as “limited randomness”, which can decrease the chance of impropriate initialization.cluster. 4. Randomly select a sample x from a dataset D and for i ∈ [1, k], determine the winner and rival corresponding to its distance to those centers of clusters. c, if γc x − ωc 2 = minj γj x − ωj 2 , i= (1) r, if γr x − ωr 2 = minj=r γj x − ωj 2 ,
An Robust RPCL Algorithm and Its Application
441
k where γc and γr are winning frequencies of units μc and μr with γj = nj /(Σi=1 ni ) and ni is the cumulative number of the occurrences of μi = 1. αc and αr are learning rate for winner and rival respectively, usually αc αr . 5. For this sample, make the identification of the winner and rival. Let ⎧ if i = c , ⎨ 1, μi = −1, if i = r , (2) ⎩ 0, others.
6. Update the weight vector ωi : ⎧ ⎨ αc (x − ωi ), Δωi = −αr (x − ωi ), ⎩ 0,
if μi = 1, if μi = −1, others.
(3)
7. Merge two two clusters if these two clusters are visually similar. First it is necessary to decide which clusters are close enough to be reasonably agglomerated. A criterion is proposed to determine whether this step is necessary to perform. From above, ωi is defined as the weight vector of cluster i. Meanwhile, actually it is the location of resulting prototype in this case from the previous step. let σi be its standard deviation. The merging condition is the following criterion: ωi − ωj ≤ 1/2(σi + σj ).
4
(4)
Simulations on Synthesized Data Set
The examples used in this section consists of four groups of synthesized two dimensional data set. The initial value of learning rate γc was set to 0.03. Initially 6 seeds was selected according to the “limited randomness” strategy. In all examples, we tried many different locations of 6 initial prototypes to demonstrate the robust effect. Fig. 1 shows a synthetic data set consisting 4 Gaussian clusters of various sizes with variance 0.1 with One group of 20 sample points and other three clusters of 50 samples. In Fig. 1, the “◦” signs indicate the initial locations of prototypes, the “∗” signs indicate the locations of each prototypes after learning, pentacle means the final centers clusters after merging step. The 6 curves painted in various color represent the evolving trajectories of initial 6 seeds. These curves are all start at each respective initial locations which “◦” signs indicate and end where “∗” signs locate. We test proposed algorithm on this data set in two conditions. In one case, Learning rate γc change gradually from 0.05 to 0.005, penalizing rate γr from 0.005 to 0.001. Fig.1 shows the initial 6 prototypes superimposed on the data set. The 6 evolving trajectories give the results when the balance of composite effect of competition and penalization on all 6 initial prototypes is achieved. Whereas competitive learning can be considered as a positive reinforcement that encourages the winner to represent the sample vector, penalizing mechanism on the
442
Z.-S. Zhao et al. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0
0.2
0.4
0.6
0.8
1
Fig. 1. Simulations result when merging condition is met with
1
0.5
0
−0.5
−1
−1.5
−2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Fig. 2. Simulations result when 2 seeds are repelled
contrary can be treated as a negative reinforcement that would detract a seed from the sample vector. But in this condition, there are no prototypes drifted away as the result of under-penalization. It can be seen that penalizing impact on the redundant prototypes(seeds) is insufficient. Then with proposed preprocessing steps plus original RPCL algorithm, the resulting clusters can not correspond to distribution of natural data samples. But according to modified algorithm, the top-right two prototypes merges into one cluster, and the bottom-right two ones merges into one. Thus we got the right number of classes and nearly exactly right locations of four prototypes. It means that under the circumstance of impropriate parameters for RPCL, the proposed algorithm can still work.
An Robust RPCL Algorithm and Its Application
443
1
0.8
0.6
0.4
0.2
0
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Fig. 3. The case that repelling and merging conditions appear at the same time 1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Fig. 4. Repelling and merging conditions appear at the same time when 12 prototypes are given as the initial seeds
After 15 iterations, the resulting prototypes have no significant change with the increase of iterations. It means that this method is not as sensitive to the choice of learning cycles (iterations) as that in the original RPCL. In Fig. 2, shows the learning result on the same data set and same initial 6 prototypes as that in Fig. 1, however, using different learning parameters. In this case, fixed learning rate and penalizing rate are adopted instead of parameters changed with learning cycles. Learning rate γc is 0.05, penalizing rate γr is tried many different values ranging from 0.0001 to 0.02. With the increase of γr , the
444
Z.-S. Zhao et al. 124 46 14 45
4
23
44 8826 65 87 86 43 120 3 7 83 42 30 19 110 18 109 108 107 77
22 13 6
116 56 5 76 73
33
64 122121 123 85 21 32 6331 12 62 119 61 20 11 6084 118 82 59 58 81 80 10117 115 5557 79 9 127114 41 40 39 78 38 29 25 113 112 111 5354 75 8 106 104
17 16 74 105 1
37
24 15
72
134
28 71
70 69 133 102 131 130 129
126
3627
132 9795 96 94
2 103 101 10098 99 52 6849 51 50 48 3534 93 67
125
47 66 128 92 91 90
89
Fig. 5. Examples of SIFT features in a reference image in the database
124
4514 46 4 23 44 88 33 22 26 65 87 86 43 64 13 120 123 122 121 85 12 21 32 119 2011 3 6331 60 62118 61 83 84 82 7 59 42 80 117 57 58 10 81 116 30 56 115 79 55 127 41 40 39 19 9 114 78 6 29 38 5425 113 126 112 111 110 53 18 109 77 108 107 5 76 106 75 105 17 16 74 72 15134 24 73 8 104 1 37 28 71 70 69 125 102
36 27 94
97 96 95 89
2 103 101 99 98 100 52 68 51 50 49 48 35 34 93 67 47 66128 92 91 90
131 130 129 33 1321
Fig. 6. Examples of SIFT features in adjacent view to Fig. 5
penalizing effect becomes more significant. It can be seen in Fig. 2 that two seeds was repelled out when γr = 0.013 and learning cycles reaches 18. While other four seeds work reliably, which makes those prototypes could present the distribution of the data samples. There are no prototypes met with merging criterion. Fig. 3 gives the case where drifting and merging conditions come forth at the same time. The top-right two prototypes merges into one cluster, and the blue
An Robust RPCL Algorithm and Its Application
445
Fig. 7. Examples of SIFT features clustered by proposed RPCL algorithm
learning trajectory shows how this redundant prototypes is pushed out of range of sample vectors. Penalizing rate γr is 0.013 and learning cycles reaches 18. Fig. 4 shows evolving trajectories of initial 12 prototypes superimposed on the same data set when the learning cycles reach 30. In this case, learning rate γc change gradually from 0.01 to 0.005, penalizing rate γr from 0.005 to 0.001. Although push two seeds are pushed far away from the natural distribution as the effect of penalizing, however, in this condition there are still 6 redundant seeds oscillate among the samples. It means that penalizing impact on the redundant seeds is insufficient. According to modified algorithm, the top-right four prototypes, the bottom-right two and top-left two meet with the merging criterion. Thus the natural structure is revealed, and at the same time we got the nearly exactly right centers of 4 class of different sizes. It also shows that the proposed algorithm can overcome the shortage of conventional RPCL in the case of impropriate learning parameters.
5
Discussion on Robust RPCL Based Clustering Similar Features
In the literature of content-based information retrieval in image database, a good indexing structure of the extracted visual features is very important for performance of computationally efficient matching. Without a properly designed indexing structure, the retrieval of information would be reduced to a linear exhaustive search. The situation gets worse when dealing with large database. It is often argued that local feature-based representations are more robust to scene dynamics and partial occlusion than globally-derived representations [12]. A common approach to obtain such features that possess above attributes is known as “key-point” or “interest point” extraction. With detected key-points across images, correspondences/matches between test image and models can be
446
Z.-S. Zhao et al.
found by comparing their descriptors. In this paper, the proposed robust RPCL algorithm is explored the feasibility of application in place recognition. SIFT (Scale Invariant Feature Transform) [13, 14] detector and descriptor are adopted as local visual features for its effective invariant attribute. The modified RPCL algorithm is used to build the indexing structure of local feature database with respect to individual places. We may view the extracted feature vectors from model images in image database as points in a high-dimensional vector space, and consider each querying local feature as random variable. Then the task of image retrieval is divided into sub-tasks that retrieval each each local feature vectors extracted from the query image. For nearest neighbor retrieval, it is natural to group similar feature vectors together. This leads to clustering which is based on feature similarity. According to the indexing structure, a nearest neighbor query x is compared to all the cluster centers. All the vectors belonged to the cluster whose center is closed to x will be retrieved. Then the result is obtained via voting scheme. As analyzed above, when all the features associated with the same meaning group together and are stuffed into the same data set, those very similar features should be clustered followed by discarding redundant features. But the unknown number of clusters can not be determined manually. So Robust RPCL algorithm could be adopted to build indexing structure of local features. Fig. 5 and Fig. 6 give some examples of SIFT features in the two different view of the same place. It also demonstrate the viewpoint insensitivity of SIFT features. All the reference image regarding the same location unsurprisingly posses repetitive local visual feature. From Fig. 7, we can see examples of SIFT features clustered by proposed RPCL algorithm. By using Robust RPCL algorithm to cluster similar features, not only most unnecessary matchings can be removed,but also we gain the advantage of reduction in the overall volume of the model database.
6
Conclusions
In this paper, we have described a new clustering algorithm that can meet the requirements that insensitive to initialization and cluster sizes and learning parameters. Simulation results on synthesized data sets show that proposed method is effective and robust. Application of the proposed robust RPCL algorithm in indexing of visual features is discussed.
Acknowledgements This research has been supported in part by the National Natural Science Foundation of China (Grant Nos. 60205004, 60635010, 50475179 and 60334020), National Basic Research Program (973) of China (Grant No. 2002CB312200), Hi-Tech R&D Program (863) of China (Grant Nos. 2006AA04Z258 and 2005AA420040).
An Robust RPCL Algorithm and Its Application
447
References 1. Kohonen, T.: Self-Organizing Maps. 2nd ed. Berlin, Germany: Springer-Verlag (1997) 2. Pal, N. R., Bezdek, J. C., Tsao, E. C.: Generalized Clustering Networks and Kohonens Self-organizing Scheme. Proceedings of IEEE Transactions on Neural Networks 4 (1993) 549-557 3. Carpenter, G. A., Grossberg, S., Reynolds, J. H.: ARTMAP: Supervised Real-time Learning and Classification of Nonstationary Data by a Self-organizing Neural Network. Neural Networks 3 (1990) 129-152 4. Liu, Z. Q., Glickman, M., Zhang, Y. J.: Soft-competitive Learning Paradigms, in Soft Computing and Human-Centered Machines. New York: Springer-Verlag (2000) 131-161 5. Zhang, Y. J., Liu, Z. Q.: Self-splitting Competitive Learning: A New On-line Clusteringparadigm. Proceedings of IEEE Transactions on Neural Networks 13 (1993) 369-380 6. Diamantras K. I., Kung S. Y.: Principal Component Neural Networks: Theory and Applications. John Wiley and Sons, New York (1996) 7. Haykin, S.: Neural Networks: A Comprehensive Foundation. 2nd Edn. PrenticeHall, New Jersey (1999) 8. Xu, L., Adam, K., Oja, E.: Rival Penalized Competitive Learning for Clustering AnalysisRBF Netand Curve Detection. IEEE transaction on Neural Networks 4 (1993) 636-648 9. Chung, F. L., Lee, T.: Fuzzy Competitive Learning. Neural Networks 7 (3) (1994) 539-551 10. Ahalt, S. C., Krishnamurty, A. K., Chen, P., Melton, D. E.: Competitive Learning Algorithms for Vector Quantization. Neural Networks 3 (3) (1990) 277-291 11. Frigui, H., Krishnapuram, R.: A Robust Competitive Clustering Algorithm with Applications in Computer Vision. IEEE transaction on Pattern Analysis and Machine Intelligence 21 (1999) 450-465 12. Mikolajczyk, K., Schmid, C.: A Performance Evaluation of Local Descriptors. Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition 2 (2003) 257-263 13. Lowe, D.: Object Recognition from Local Scale Invariant Features. International Conference on Computer Vision 2 (1999) 1150-1157 14. Lowe, D.: Distinctive Image Features from Scale-invariant Keypoints. International Journal of Computer Vision 60 (5) (2004) 91-110
An Evolutionary RBFNN Learning Algorithm for Complex Classzification Problems Jin Tian, Minqiang Li, and Fuzan Chen School of Management, Tianjin University, Tianjin 300072, P.R. China
[email protected]
Abstract. A self-optimizing approach for complex classifications is proposed in this paper to construct dynamical radial basis function neural network (RBFNN) models based on a specially designed genetic algorithm (GA). The algorithm adopts a matrix-form mixed encoding and specifically designed genetic operators to optimize the decayed-radius selected clustering (DRSC) process by co-evolving all of the parameters of the network's layout. The individual fitness is evaluated as a multi-objective optimization task and the weights between the hidden layer and the output layer are calculated by the pseudo-inverse algorithm. Experimental results on eight UCI datasets show that the GA-RBFNN can produce a higher accuracy of classification with a much simpler network structure and outperform those models of neural network based on other training methods.
1 Introduction The Radial Basis Function Neural Network (RBFNN) can be viewed as a three-layer feed-forward neural network with linear output mapping. Many different variants of RBFNNs are used for classification tasks due to a number of advantages compared with other types of artificial neural networks (ANN), such as better classification capabilities, simpler network structures, and faster learning algorithms. The RBFNN in most classification applications should be pre-specified and the main difficulty in the configuration lies in the determination of the hidden layer structure. A variety of approaches have been developed [1,2]. However, most of them suffer from the fact that the formulation of the hidden layer is based on local search methods and cannot guarantee the selection of the optimal number of hidden nodes. Compared with the conventional algorithms, the evolutionary algorithms (EAs) are usually considered more efficient in the sense that EAs can provide higher opportunity for obtaining the global optimal solution because of their robust global search capabilities. The most popular EA is Genetic Algorithm (GA). Recently, some literatures focus on training ANN with the GA method and the optimal solutions are obtained through the co-evolving process [3,4]. The proposed algorithm, named GA-RBFNN, is based on a specially designed GA with a matrix-form mixed encoding which includes the network centers, the radius widths and a control vector. The initial network is defined by a decay-radius selected clustering (DRSC) method, and the optimal scheme is iteratively obtained by D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 448–456, 2007. © Springer-Verlag Berlin Heidelberg 2007
An Evolutionary RBFNN Learning Algorithm for Complex Classification Problems
449
co-evolving the optimum structure of hidden layer and the radius widths. The individual fitness is evaluated as a multi-objective optimization task and the weights between the hidden layer and the output layer are calculated by the pseudo-inverse method. The GA-RBFNN is then tested on eight datasets from the UCI repository. The rest of the paper is structured as follows: Section 2 presents briefly the RBFNN architecture. In section 3, the proposed algorithm is described in details. Section 4 illustrates empirically the new algorithm’s efficiency on eight UCI datasets in comparison with other training techniques. Finally, Section 5 summarizes the key points of the paper and presents the concluding remarks.
2 An Overview of RBFNN A standard RBFNN forms a special forward neural network architecture which consists of three layers, namely the input, hidden and output layers. The RBFNN topological structure defines the relations between the m-dimension input vector x ∈ R m and the n-dimension output vector y ∈ R n : f : x → y . The response output by a hidden node is produced by the node activity (the Euclidean distance between the input vector and the center) through a radial basis function. The mostly used radial basis function is the symmetrical Gaussian function:
⎧ x−μ j ⎪ ϕ j ( x ) = exp ⎨− 2 σ j ⎪⎩
2
⎫ ⎪ ⎬ ⎭⎪
(1)
where x = ( x1 , x2 ,… xm ) is the input vector, μ j is the center, and σ j is the radius width of the jth hidden node. Finally, the output layer is linear and the ith output may T
k
be expressed as a linear combination of the k radial basis functions: yi = ∑ w jiϕ j ( x ) . j =1
3 Configuration of RBFNN Using GA The GA is an iterative stochastic methodology for complicated optimization problems based on mechanics of natural selection and natural genetics. As the algorithm proceeds, the solutions represented by individuals in the population is gradually improved [5]. In this paper, a special matrix-form mixed encoding method is proposed, and the traditional genetic operators have to be modified to maintain the special structure of the chromosome. 3.1 Matrix-Form Mixed Encoding
According to the characteristics of RBFNN, we design a matrix-form mixed encoding genotype representation, which assigns the hidden centers and the radius widths as real-valued encoding matrices and a control vector as a binary string as well.
450
J. Tian, M. Li, and F. Chen
The whole population contains L individuals, C1 , C 2 ,…, C L , and each individual Cl represents a matrix of size D × ( m + 2) , an integrated RBFNN, as below: C l = [c l
σl
⎡ c11l c12l ⎢ l l c c22 b l ] = ⎢ 21 ⎢ ⎢ l l ⎣⎢ cD1 cD 2
c1lm c2l m
σ 1l σ 2l
l cDm σ Dl
b1l ⎤ ⎥ b2l ⎥ ⎥ ⎥ bDl ⎦⎥
(2)
where L is the population size, D is the maximum number of hidden nodes, and m is the dimension of the input vectors. c l = [cijl ]D× m is the lth RBFNN hidden layer centers, and σ l = [σ il ]D×1 is the corresponding radius width. bl = [b1l , b2l ,… , bDl ]T is the control vector of D ×1 , where bil = 0 means the ith hidden node of the lth RBFNN is invalid and not considered in the design of the network structure, otherwise bil = 1 denotes that it is valid and stays in the network structure. 3.2 Initialization
The initial positions of the hidden layer centers, [cij* ]d * × m , are generated by the decayradius selection clustering (DRSC) method [6], and d * is the initial number of the hidden node. The minimum and maximum values of a dimension are represented by x j ,min , x j ,max ( j = 1, 2,… , m) , and the initial population is created as cijl = cij* + rijl ⋅ δ x j .
δ x j = ( x j ,max − x j ,min ) / mn , mn is a constant integer which isn’t smaller than 2, and rijl is
a randomly generated real number and rijl ∈ [ −1,1] . The training algorithm begins with the initial population and the corresponding bits of their control vectors are assigned as 1. The initial values of σ are calculated based on the K-means method [7], and the output weights of RBFNN are calculated directly by the pseudo-inverse method [8]. 3.3 Evaluation and Selection of the Chromosomes
The individual fitness is evaluated as a multi-objective optimization task in this algorithm because the solutions based on Pareto optimality can guarantee the diversity [9] of an explorative population. In addition, it offers a framework for adding other task-oriented objectives in our model without modifying the general structure. Two objectives are selected here for the fitness evaluation and described as follows: 1) Classification accuracy: The first objective is commonly used in conventional supervised learning algorithms, and it is measured as the ratio of samples of the evaluating set classified correctly by the network: Ar (Cl ) = N s (Cl ) N e . N s (Cl ) is the correct classification number of the evaluating set for Cl , and N e is the size of the evaluating set. However, similar accurate rates give rise to less selection pressure in the population. For overcoming this, a slightly modified objective is employed: f1 (Cl ) = α (1 − α ) I (Cl ) . (3) where I (Cl ) is the sorting order of Cl ’s classification accuracy Ar (Cl ) after all individuals are ranked from big to small in terms of classification accuracy, and α is a pre-designed real number between 0 and 1( and α = 0.4 in this paper).
An Evolutionary RBFNN Learning Algorithm for Complex Classification Problems
451
2) Shared performance: This objective enforces the networks to classify different patterns [10]. In this way, the individuals that contribute to the network by accurately classifying samples that are incorrectly classified by many others are rewarded. Each sample in the evaluating set receives a weight, namely wi = Ri L . Ri is the number of individuals that classify the ith sample correctly and L is the size of population. The value assigned to an individual for this objective is given by Ne
f 2 (Cl ) = ∑ eli × (1 − wi ) .
(4)
i =1
where eli is 1 if the ith sample is correctly classified by the network constructed by Cl , and 0, otherwise. This idea is similar to the boosting scheme by encouraging the learning of difficult patterns through raising their probability of being sampled. And the weights remain unchanged during one run. Totally, the fitness of an individual Cl is calculated by aggregating formula (3) and (4): fitl = a × f1 (Cl ) + b × f 2 (Cl ) . (5) where a and b are the pre-defined coefficients of the objective importance, a, b ∈ [0,1] , and a + b = 1 . The roulette wheel selection is utilized and the selection probability is proportional to each individual’s fitness, which makes it possible that the worst individual may be selected with a small probability and that the individuals with a higher probability of selection will get more copies in reproduction. Hence, the algorithm is able to keep diversity and avoid premature convergence efficiently. In addition, elitist selection is adopted so that the best solution survives definitely from one generation to the next. 3.4 Crossover
The uniform crossover is used in this paper. The individuals that undergo the crossover operation are grouped into pairs, and for every pair a mask binary string with the same length as the individual is generated randomly. Scanning the mask string from left to right, if the current bit is 1, then the genes at the corresponding position in first parent are selected; otherwise, the genes at the corresponding position in second one are selected. Thus one offspring is produced. The second offspring is produced similarly by repeating the process again but with the positions of 0 and 1 being reversed in the mask string. 3.5 Mutation
The mutation is an auxiliary but also a significant operation. The standard mutation operators have to be modified to accommodate the special matrix-form mixed encoding structure. Here two different types of mutation are introduced. 3.5.1 Non-structure Mutation Operator For every gene of a chromosome Cl , a random number r ∈ [0,1] is selected. If pm > r , the gene value is replaced by a new value. If the gene is binary, the operation inverts
452
J. Tian, M. Li, and F. Chen
the bit (if the original bit is 0, it is replaced by 1, and vice versa). If the gene is realvalued, it is replaced by a new value: cijl ′ = cijl + rijl′ ⋅ cijl − cijl , σ ijl ′ = σ ijl + rijl ′ ⋅ σ ijl − σ ijl .
cijl , cijl are the hidden center values in the previous two iterations, and σ ijl , σ ijl are the corresponding radius widths, rijl′ is a random number uniformly distributed on [–1, 1]. In order to ensure the validity of mutation, a dynamic mutation rate is used, i.e., the individual whose fitness value is above the average level is treated with a lower probability, while the one below the average level is treated with a higher probability. 3.5.2 Structure Mutation Operator Since the probability of crossover and non-structure mutation is usually low, we try to introduce some additional flexibility by using the so-called structure mutation operator, which can add or prune some hidden node centers to get a different network. A binary value rb and a real number r ∈ (0,1) are generated randomly for each
chromosome Cl . If rb = 0 , and pad > r , all the genes below a randomly selected position are deleted, and the corresponding bits of its control vector are assigned as 0. If rb = 1 and pad > r , a random number of non-zero vectors, cijl = x j ,min + tij ⋅ δ x j , are used to replace an equal number of rows whose control vector bits are formerly 0 by taking into account that the total number of rows should not exceed D, and then the relevant control bits are modified to 1. tij is a constant integer between 1 and mn , and the meanings of other parameters are the same as that described in Section 3.2.
4 Experimental Study In order to evaluate the performance of GA-RBFNN, we have applied the proposed methods and conventional methods to eight UCI datasets. They are real-world problems with different number of available patterns from 178 to 990, different number of classes from 2 to 11, and different kind of inputs. Each dataset was divided into three subsets: 50% of the patterns were used for learning, 25% of them for validation, and the remaining 25% for testing. There are two exceptions, Sonar and Vowel problems, as the patterns of these two problems are prearranged in two subsets due to their specific features. 4.1 Experiment 1
The experiments were carried out with the aim of testing our GA-RBFNN model against some traditional training algorithms, such as the DRSC and the K-means, the probabilistic neural network (PNN) and the K-nearest neighbor algorithm (KNN). These methods generate RBFNN without a validation set by joining the validation set to the training set together. For each dataset, 30 runs of the algorithm were performed. The GA parameters were set as follows. The population size L was 50, and the number of generations G was 200. The probability of crossover pc was 0.5. The higher non-structure mutation rate pm1 was 0.4, the lower one pm2 was 0.2, and the structure mutation rate pad was 0.2. The average accuracies of classification and the average numbers of the hidden nodes over the 30 runs are shown in Table 1.
An Evolutionary RBFNN Learning Algorithm for Complex Classification Problems
453
Table 1. Comparison with other algorithms on eight UCI datasets. The accuracy values of evaluating the GA-RBFNN are omitted for the simplicity in comparison since the data were only divided into two subsets in the compared algorithms. The t-test compares the average testing accuracy of the GA-RBFNN with that of each uesd algorithm. Methods Cancer Glass Heart Iono Pima Sonar Vowel Wines Ave Nc GATrain 0.9629 0.7318 0.8475 0.9368 0.7747 0.9369 0.8636 0.9790 0.8791 25.25 RBFNN Test 0.9688 0.6913 0.8172 0.9326 0.7625 0.7785 0.7371 0.9680 0.8320
Train 0.9673 0.9671 t-test 0.3882 Train 0.9641 KTest 0.9634 means t -test 0.9477 Train 1.000 PNN Test 0.9494 t -test 4.581 Train 0.9736 KNN Test 0.9686 t -test 0.0357 DRSC Test
0.7988 0.6246 6.043 0.7362 0.6610 3.219 1.000 0.6686 4.497 0.7863 0.6566 5.024
0.8862 0.8015 1.276 0.8424 0.8054 0.8420 1.000 0.7309 7.145 0.8653 0.8000 3.488
0.9191 0.9189 4.109 0.8958 0.8970 4.904 1.000 0.9443 -0.6540 0.8203 0.7871 19.04
0.7897 0.7372 3.821 0.7609 0.7415 3.012 1.000 0.6950 10.96 0.8073 0.7181 7.010
0.8147 0.7144 5.641 0.7981 0.7295 4.443 1.000 0.5349 27.81 0.7654 0.7192 5.901
0.7720 0.6700 14.01 0.5285 0.4650 39.89 1.000 0.9515 -30.40 0.8924 0.7799 -1.238
0.9753 0.9515 2.843 0.9744 0.9667 0.8365 1.000 0.9447 4.952 0.9756 0.9697 -0.2480
0.8654 0.7982 34.86 0.8125 0.7787 30 1.000 0.8024 311.6 0.8608 0.7999 311.6 -
Table 2. Results of previous works using the same datasets. The results of the best method is recorded among the algorithms tested in each paper. GA[11]1 RBFNN Cancer 0.9688 0.9580 Glass 0.6913 0.7050 Heart 0.8172 0.8370 Ionosphere 0.9326 0.8970 Pima 0.7625 0.7720 Sonar 0.7785 0.7850 Vowel 0.7371 0.8170 Wines 0.9680 1 k-fold cross-validation 2 Hold out Datasets
[12]1
[13]1
[14]1
[15]2
[16]1
[17]1
[18]1
0.9470 0.6710 0.7400 0.7920
0.7095 0.8296 0.7660 0.9657
0.9620 0.7620 0.9370 0.4830 -
0.6837 0.8817 0.6872 0.9444
0.9650 0.7510 0.8030 0.7560 -
0.9780 0.7050 0.8370 0.9310 0.7430 0.8300 0.6520 0.9290
0.9490 0.7000 0.7890 0.9060 0.7400 0.7650 0.7810 -
The comparison presented in Table 1 shows that GA-RBFNN yielded accuracies close to the best accuracy on most datasets with the hidden nodes being adjusted dynamically. The t-test results indicate that there are significant differences between the GA-RBFNN and the conventional algorithms in comparison in most cases with a confidence level of 95%. The GA-RBFNN has improved 4.23% in the average testing accuracy and dropped 27.57% in the number of the hidden nodes compared with the DRSC algorithm, which is used to determine the initial network of GA-RBFNN. Furthermore, K-means need many trials to obtain a certain suitable number of hidden nodes, whereas the GA-RBFNN can design the network structure dynamically and need only one run to obtain the optimal solution. PNN and KNN need a significantly big number of the hidden nodes although they outperform GA-RBFNN on Sonar and Vowel datasets. These results show that the GA-RBFNN algorithm is able to obtain a significantly higher accuracy and produce a smaller network structure than those compared methods.
454
J. Tian, M. Li, and F. Chen
Moreover, the proposed method is competitive when compared with other works on these datasets. Table 2 shows a summary of the results reported in papers devoted to other classification methods. Comparisons must be made cautiously, as the experimental setup is different in many papers. Some of the papers use tenfold crossvalidation on some of the datasets and obtain a more optimistic estimation. However, we didn’t utilize tenfold cross-validation because it does not fit to the triplet samples partition. Table 2 shows that on Cancer, Ionosphere, Pima and Wines datasets our algorithm achieves a performance that is better or at least identical to all the results reported in the cited papers. 4.2 Experiment 2
In order to test the impacts of the parameters upon the performance of the proposed method, we carried out another experiment by assigning different values to the genetic parameters. Firstly, the effect of the probability of crossover, pc, was considered. It varied from 0.1 to 0.9. And the other parameters were assigned as follows: G=200, L=50, pm1=0.4, pm2=0.2 and pad=0.2. We performed ten runs of the algorithm for each pc and the results in Table 3 are the average accuracies of classification over the ten runs. Table 3. Average testing accuracies for various probability of crossover pc pc 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Cancer 0.9691 0.9692 0.9683 0.9592 0.9703 0.9663 0.9654 0.968 0.9703
Glass 0.6973 0.6828 0.6893 0.6872 0.6908 0.7002 0.6879 0.6973 0.6748
Heart 0.8065 0.8123 0.8079 0.8168 0.8197 0.8050 0.8079 0.8094 0.8182
Datasets Iono Pima 0.9014 0.7632 0.8946 0.7613 0.9332 0.7548 0.9196 0.7587 0.9332 0.7626 0.9264 0.7574 0.9264 0.7515 0.9286 0.7632 0.9263 0.7450
Sonar 0.7769 0.7413 0.7567 0.7634 0.7769 0.7798 0.7807 0.7673 0.7750
Vowel 0.7029 0.7017 0.7094 0.7181 0.7311 0.7218 0.7203 0.7246 0.7181
Wines 0.9712 0.9826 0.9416 0.9485 0.9689 0.9507 0.9462 0.9348 0.9689
Ave
S*
Ratio
0.8236 0.8182 0.8202 0.8214 0.8317 0.8260 0.8233 0.8242 0.8246
0.0021 0.0043 0.0031 0.0021 0.0003 0.0014 0.0019 0.0026 0.0014
17.82 12.52 14.83 17.77 48.77 21.94 18.70 16.02 21.92
The column of Ave in Table 3 is the average classification accuracy of the eight datasets for each pc. Due to the numerous experimental results, we introduce an additional test variable, S*, which donates the difference between the testing accuracy for every pc and the maximum accuracy of each dataset in Table 3. The last two columns in Table 3, S*i and Ratioi, can be calculated as follows:
S *i =
Numset
∑ ( Accu j =1
ij
− Max j )2 , Ratioi =
Avei S *i
(6)
where i = 1, 2,… , Num p , j = 1, 2,… , Numset , Accuij is the testing accuracy of the jth dataset for the ith value of pc, Maxj is the maximum accuracy of the jth dataset, Nump is the number of the different pc values and Numset is the number of the datasets. Note that it is more suitable and convictive when considering both Ave and Ratio than only
An Evolutionary RBFNN Learning Algorithm for Complex Classification Problems
455
considering the former. As Table 3 shows, both the testing accuracy and the ratio reach the maximum when pc=0.5. We carried out experiment to test the influence of the population size with 10, 40, 70, 100, and 130. We have pc=0.5. The other parameters are assigned as the same as above. The average accuracies of classification over the ten runs are shown in Table 4. The meanings of the last three columns are also defined as in Table 3. Table 4. Average testing accuracies for various population size L L 10 40 70 100 130
Cancer 0.9726 0.9685 0.9645 0.9697 0.9691
Glass 0.6790 0.6941 0.6621 0.6809 0.6734
Heart 0.8012 0.8100 0.8070 0.8026 0.8012
Datasets Iono Pima 0.9305 0.7536 0.9418 0.7573 0.9316 0.7646 0.9271 0.7620 0.9316 0.7630
Sonar 0.7619 0.7792 0.7590 0.7590 0.7677
Vowel 0.6561 0.7341 0.7276 0.7168 0.7049
Wines 0.9495 0.9677 0.9405 0.9404 0.9586
Ave
S*
Ratio
0.8131 0.8316 0.8196 0.8198 0.8212
0.0073 0.0001 0.0024 0.0019 0.0017
9.537 99.32 16.76 18.74 19.96
Note that in Table 4, not only Ave, but also Ratio reaches its peak when L=40. In some of the problems, namely, Cancer and Pima, the enlarging of population size produces an improvement in the performance of the model which is not significant in the view of the increased complexity of the model. A t-test has been conducted and with a confidence level of 95% there are no significant differences in these cases. We can assure that the GA-RBFNN does not perform better with bigger population sizes, which may lead to the inbreeding without making any improvement for the network performance.
5 Conclusions A GA-RBFNN method for complex classification tasks has been presented, which adopts a matrix-form mixed encoding and specifically designed genetic operators to optimize the RBFNN parameters. The individual fitness is evaluated as a multiobjective optimization task and the weights between the hidden layer and the output layer are computed by the pseudo-inverse algorithm. Experiment results over eight UCI datasets show that the GA-RBFNN can output a much simpler network structure with a better generalization and prediction capability. To sum up, the GA-RBFNN is a quite competitive and powerful algorithm for complicated classification problems. Two directions are to be investigated in our further work. One is to combine some feature-selecting algorithms with the proposed methods to improve the performance further, and the other is to introduce the fuzzy technique to increase the selfadaptation ability of the relative parameters. Acknowledgments. The work was supported by the National Science Foundation of China (Grant No.70171002, No. 70571057) and the Program for New Century Excellent Talents in Universities of China (NCET).
456
J. Tian, M. Li, and F. Chen
References 1. Zhu, Q., Cai, Y., Liu, L.: A Global Learning Algorithm for a RBF Network. Neural Networks 12 (1999) 527-540 2. Leonardis, A., Bischof, H.: An Efficient MDL Based Construction of RBF Networks. Neural Networks 11 (1998) 963-973 3. Arifovic, J., Gencay, R.: Using Genetic Algorithms to Select Architecture of a Feedforward Artificial Neural Network. Physica A: Statistical Mechanics and its Applications 289 (2001) 574-594 4. Sarimveis, H., Alexandridis, A., et al: A New Algorithm for Developing Dynamic Radial Basis Function Neural Network Models Based on Genetic Algorithms. Computers and Chemical Engineering 28 (2004) 209-217 5. Li, M.Q., Kou, J.S., et al: The Basic Theories and Applications in GA. Science Press, Beijing (2002) 6. Berthold, M.R., Diamond, J.: Boosting the Performance of RBF Networks with Dynamic Decay Adjustment. In: Tesauro, G., Touretzky, D.S., Leen, T.K. (eds.): Advances in Neural Information Processing Systems 7 MIT Press, Denver Colorado (1995) 512-528 7. Zhao, W.X., Wu, L.D.: RBFN Structure Determination Strategy Based on PLS and Gas. Journal of Software 13 (2002) 1450-1455 8. Burdsall, B., Christophe, G.-C.: GA-RBF: A Self-Optimizing RBF Network. In: G.D. Smith et al (eds.): Proceedings of the Third International Conference on Artificial Neural Networks and Genetic Algorithms (ICANNGA'97). Springer-Verlag, Norwich (1997) 348-351 9. Bosman, P.A.N., Thierens, D.: The Balance between Proximity and Diversity in Multiobjective Evolutionary Algorithms. IEEE Transactions on Evolutionary Computation 7 (2003) 174-188 10. Liu, Y., Yao, X., Higuchi, T.: Evolutionary Ensembles with Negative Correlation Learning. IEEE Transactions on Evolutionary Computation 4 (2000) 380–387 11. Frank, E., Wang, Y., Inglis, S., et al: Using Model Trees for Classification. Machine Learning 32 (1998) 63-76 12. Erick, C.-P., Chandrika, K.: Inducing Oblique Decision Trees with Evolutionary Algorithms. IEEE Trans Evolution Computation 7 (2003) 54-68 13. Guo, G.D., Wang, H., Bell, D., et al: KNN Model-Based Approach in Classification. In: Meersman, R., Tari, Z., Schmidt, D.C. (eds.): CoopIS, DOA, and ODBASE - OTM Confederated International Conferences. Lecture Notes in Computer Science 2888 Springer-Verlag, Berlin Heidelberg New York (2003) 986-996 14. Friedman, J., Trevor, H., Tibshirani, R.: Additive Logistic Regression: a Statistical View of Boosting. The Annals of Statistics 28 (2000) 337-407 15. Draghici S.: The Constraint Based Decomposition (CBD) Training Architecture. Neural Networks 14 (2001) 527-550 16. Webb, G.I.: Multiboosting: A Technique for Combining Boosting and Wagging. Machine Learning 40 (2000) 159-196 17. Yang, J., Parekh, R., Honavar, V.: DistAI: An Inter-pattern Distance-based Constructive Learning Algorithm. Intelligent Data Analysis 3 (1999) 55-73 18. Frank, E., Witten, I.H.: Generating Accurate Rule Sets Without Global Optimization. In: Shavlik, J.W. (eds.): Proceedings of the Fifteenth International Conference on Machine Learning. Morgan Kaufmann, San Francisco CA (1998) 144-151
Stock Index Prediction Based on Adaptive Training and Pruning Algorithm Jinyuan Shen1 , Huaiyu Fan1 , and Shengjiang Chang2 1
School of Information Engineering, Zhengzhou University, Zhengzhou, China 2 Institute of Modern Optics, Nankai University, Tianjin, China
[email protected]
Abstract. A tapped delay neural network (TDNN) with an adaptive learning and pruning algorithm is proposed to predict the nonlinear time serial stock indexes. The TDNN is trained by the recursive least square (RLS) in which the learning-rate parameter can be chosen automatically. This results in the network converging fast. Subsequently the architecture of the trained neural network is optimized by utilizing pruning algorithm to reduce the computational complexity and enhance the network’s generalization. And then the optimized network is retrained so that it has optimum parameters. At last the test samples are predicted by the ultimate network. The simulation and comparison show that this optimized neuron network model can not only reduce the calculating complexity greatly, but also improve the prediction precision. In our simulation, the computational complexity is reduced to 0.0556 and mean square error of test samples reaches 8.7961 × 10−5 .
1
Introduction
The stock indexes are influenced by many factors so that it is very difficulty to forecast the change of the stock indexes by using an accurate mathematical expression. This means that it is a typical nonlinear dynamic system to forecast the stock indexes accurately. Due to its powerful nonlinear processing ability, the neural network has been applied widely in forecasting stock indexes to improve prediction ability recently, and some network models have made greater progress than traditionally statistical methods [1]. But the most research concentrate in the Back Propagation (BP) algorithm [2], Radial-Basis Function (RBF) method [3], Genetic algorithm [4], Support Vector Machine (SVM) [5] and their improvement algorithms. However these models have some shortcoming: it is well known that it is impossible to avoid local minima for BP algorithm and the number of neurons in the hidden layer is determined blindly. It is difficulty for RBF network to choose the best central vectors as well as the number of the central vectors. SVM is suitable to the small sample situation, but until now nobody study how to choose the best kernel function. The architecture and the learning algorithm of the neural network are crucial for predicting the stock indexes accurately. Therefore, it is very important how to choose an optimal topological architecture. A tapped delay neural network (TDNN) with an adaptive learning and D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 457–464, 2007. c Springer-Verlag Berlin Heidelberg 2007
458
J. Shen, H. Fan, and S. Chang
pruning algorithm is adopted to predict the nonlinear time serial stock indexes in his paper. Firstly an adaptive algorithm based on the recursive least square is employed to train the network. Secondly the architecture of the primary neural network is optimized by utilizing pruning algorithm which prunes the redundant neurons of the hidden layers and input layer. And then the optimized network is retrained to obtain optimum parameters. At last the test samples are predicted by the ultimate network. In our model, the learning stepsize can be determined automatically, so the network converges fast. The pruning algorithm is used widely to prune the redundant neurons in the hidden layer. However, how many delays are best for the time series prediction? How many input neurons are optimal for TDNN to forecast the stock indexes? Nobody ever studied until now. We adopt the pruning algorithm not only in the hidden layer but also in the input layer by defining a new energy function in this paper. The computational complexity, therefore, is reduced and the generalization is improved greatly. The computational complexity is reduced to 0.0556 and mean square error of test samples reaches 8.7961 × 10−5 .
2
The Architecture of the Neural Network
A three-layer feed-forward neural network can approximate the non-linear continuous function arbitrarily. The sketch of a three-layer TDNN is shown in Fig.1. The input of TDNN is a delaying time series and one output neuron adopts w1
x(n-1)
z
X(n-1) z
f()
1
w2 x( n -2)
f()
x(n-p)
f()
1
1
1
Fig. 1. The model of tapped delay neural network
linear activation function. The functional relation between input and output is described by: T ˆ = x(n) (ω2,i f(ω1,i X(n − 1))) − θ. (1) i
Where θ is the bias of the output neuron, X(n − 1) is the N o.n − 1 time input vector. X(n − 1) = [x(n − 1), x(n − 2), . . . , x(n − p), 1]T . (2)
Stock Index Prediction Based on Adaptive Training and Pruning Algorithm
459
Nn which includes the threshold cell is the total number of the neuron in the n − th layer. ωn,ij (t) is the weight between the neuron (i, j) (the j − th neuron in the n − thlayer) and the neuron (n+1,i): ωn = [(ωn,1 )T , (ωn,2 )T , . . . , (ωn,Nn+1 −1 )T ], ωn,j = [ωn,1i , ωn,2i , . . . , ωn,N i ],
(3)
and ω = [(ω1 )T , (ω2 )T ].
3
(4)
Adaptive Training
The weight ω is regarded as a stable non-linear dynamics system in the TDNN model. Supposes the n − th training pattern is input the TDNN model, the system should satisfy the following static equation: ω(n) = ω(n + 1) = ω0 , d(n) = h(ω(n)) + e(n),
(5)
where d(n) is the desired output, h(ω(n))) is the output of the network and e(n) is the modeling error. The error function is: ξ(n) =
n
|d(j) − h(ω(n))|2 λn-j ,
(6)
j=1
where λ is a forget factor that satisfies 0 < λ < 1 and approximates 1. Expanding the h(ω(n)) by the Taylor expanding at the point of ω ˆ (n − 1), we can obtain: h(ω(n)) = h(ˆ ω (n − 1)) + H(n)(ω(n) − ω ˆ (n − 1)) + . . .
(7)
where H(n) = ∂h(ω) ω (n−1) . With these state equations, according to the ∂ω |ω=ˆ identification theory [6], the recursion equation of the estimate ω ˆ (n) can be obtained as follows: ω ˆ (n) = ω ˆ (n − 1) + K(n)(d(n) − h(ˆ ω (n − 1))), −1 −1 T K(n) = λ P(n − 1)H(n)[I + λ H (n)P(n − 1)H(n)]−1 , P(n) = λ−1 P(n − 1) − λ−1 K(n)HT (n)P(n − 1),
(8) (9) (10)
The estimate weight ω ˆ (n) should make the error function ξ(n) to be smallest. K(n) is gain matrix. P(n) is error covariance matrix of the recursive least square algorithm. ω ˆ (0) and P(0) are determined according to prior knowledge, or else: ω ˆ (0) = [0, 0, . . . , 0]T P(0) = δ −1 I, (11) where δ is small amounts that is larger than 0.
460
4
J. Shen, H. Fan, and S. Chang
Pruning Algorithm
4.1
Pruning the Neuron of Hidden Layers
One of the important key problems is how to choose a suitable scale. If the scale of the network is oversized, the generalization ability of the network may be very bad; otherwise, the network may converge too slow even not converge if the scale excessively is small. An effective method solving this question is to prune weights adaptively. After input the n − th training sample, the energy function of the network is defined as: E=
n 1 [ (d(j) − h(ω, x(j)))T (d(j) − h(ω, x(j))) + ω T P(0)−1 ω]. 2 j=1
(12)
According to the pruning algorithm [7], suppose the initial values of the covariance matrix is a diagonal matrix P(0) = δ −1 I, where I is the identity matrix and δ > 0, so the energy change of the network brought by the change of weights (Δˆ ω ) is: 1 ΔE = Δω T P(∞)−1 Δω. (13) 2 The importance of ωj calculates from the equation: ΔEj =
1 [P(∞)−1 ]jj ω ˆ (∞)2j , 2
(14)
where ω ˆ (∞) and P(∞) are weights and covariance matrix of the convergent network. [P(∞)−1 ]jj means the j-th diagonal element. According to equations above, the process of pruning weights is shown as follows: (a) After training network by RLS algorithm, the importance of all weights is estimated according the equation (14). The weights queue up form small to large according to ΔEj . Supposes the queue number is [πi ], then ΔEπm ≤ ΔEk , (m < k); (b) Let [Δω]πk = [ω]πk (1 ≤ k ≤ k ), or else [Δω]πk = 0(k > k ), ΔE caused by pruning the weight from ωπ1 to ωπk can be estimated according to the equation (13) (c)IfΔE ≤ αE (0 < α < 1), let k = k + 1 and return the step(b). Otherwise prune weights from ωπ1 to ωπk −1 . It is worth pointing out that for a three feed-forward neural network with one output, if one weight between the hidden layer and the output layer is pruned, the neuron joined with the weight above will be pruned. Then many weights in the input layer joined with the neuron that has been pruned will be pruned too. After pruning the neuron in the hidden layer, the computational complexity is reduced. According to the literature [8], suppose the number of the neuron in the hidden layer that is un-pruned is equal to H0 and the number of the left neuron in the hidden layer that is H1 , the ratio of the computational complexity 1 2 is ( H H0 ) .
Stock Index Prediction Based on Adaptive Training and Pruning Algorithm
4.2
461
Pruning the Input Layer
No one ever studied how many delay is best for the time series prediction. In other words, how many input neurons are optimal for TDNN predicting the stock indexes? A new energy is defined and the pruning algorithm is also employed to pruning the redundant neurons in the input layer. Suppose the weight between the i − th neuron in input layer and the j − th neuron in hidden layer isωi,ij . We defined the energy function: E1,i =
n
(ω1,ij )2 ,
(15)
j=0
where n is the number of neurons in hidden layers. Then we obtain an array of vectors E = [E1,1 , E1,2 , . . . , E1,m ]T ,i = 1, 2, . . . , m. m is the number of neurons in input layer. E1,i queue up form low to high according to the equation (15). ΔE1 Suppose ΔE1 = kk=1 E1,k (1 ≤ k < m) and E1 = m i=1 E1,i .If E1 = β, then the front k neurons in the input layer will been pruned.
5 5.1
Computer Simulation Training
We use 650 daily data of Shanghai Composite Indexes from Mar. 23.2001 to Dec.17.2003 as samples, and extract the first 300 data as training sample, and use the 301 − 500th as retraining samples. That which is left is taken as the test samples. At first we initialize the sample data: Xˆi =
Xi − min(Xi ) . max(Xi ) − min(Xi )
(16)
The primary architecture of the time delay neural network is 12 − 15 − 1. Then the TDNN is trained by RLS algorithm. The primary values of some parameters are: ω ˆ (0) = [0, 0, . . . , 0]T ,P(0) = 60 × I; λ = 0.999; The times of iteration is only 36 i.e. the network converges very fast. The prediction error results of TDNN with different architectures are shown in the table.1. The mean square error is 2 equal to 1.4142 × 10−4 for the training samples.WhereMSE = n e N(n) . 5.2
Pruning the Hidden Layer
According to the equation(14), all of the weights are arranged by their importance, and then some unimportant weights in the front of the queue will be pruned. Fig. 2 shows the relational curve between energy and weights. Obviously, the frontal 130 weights is unimportant corresponding with E = 1268. 10 of these 130 weights are the weights between the hidden layer and the output layer. Therefore, the neuron number of the hidden layer becomes 5after these 130 weights are pruned, i.e. the network architecture turns into 12 − 5 − 1. The
462
J. Shen, H. Fan, and S. Chang Table 1. The Comparison of the MSE of TDNN Models training samples 12 − 15 − 1 retraining samples un-pruning 12 − 15 − 1 retraining samples pruning 12 − 9 − 1 12 − 6 − 1 12 − 5 − 1 12 − 4 − 1 test samples un-pruning 12 − 15 − 1 test samples pruning 12 − 9 − 1 12 − 6 − 1 12 − 5 − 1 12 − 4 − 1
MSE 1.4142 × 10−4 1.3898 × 10−4 1.2996 × 10−4 1.2307 × 10−4 1.2003 × 10−4 1.3689 × 10−4 1.3307 × 10−4 1.1582 × 10−4 1.0986 × 10−4 9.7160 × 10−5 1.0211 × 10−4
5
6
x 10
5
4
3
2
1
0
0
50
100
150
200
250
Fig. 2. The relational curve between energy and weights
computational complexity is reduced to (5/15)2 = 0.1111. This indicated that the topology architecture of network can be optimized effectively by the pruning algorithm. The prediction errors are compared with the different value of ΔE in the Tab.1. The value of MSE is the smallest when 10 neurons in the hidden layer are pruned. This means the neuron number of the hidden layer is 5. 5.3
Pruning the Input Layer
1 Shown in the Table 2, according to the equation (15) and the ratio of the ΔE E1 , when β = 0.2, we obtain the least mean square error is equal to 8.7961 × 10−5
Table 2. The comparison of the different network the architecture of the network the mean square error 12 − 15 − 1 1.3307 × 10−4 6−5−1 9.5924 × 10−5 6 − 5 − 1(obtained by pruning12 − 5 − 1) 8.7961 × 10−5
Stock Index Prediction Based on Adaptive Training and Pruning Algorithm
463
for the test samples. The network architecture turns into 6 − 5 − 1 and the computational complexity is reduced to 0.1111 = 0.0556. 2 5.4
Retraining and Predicting
The 301 − 500th data samples are learned to retrain the final TDNN to get the optimum weights. The test samples (501 − 650th samples) are forecasted. In order to examine the architecture of network whether obtain the optimization, different architectures (i.e. different neuron number of the hidden layer) are simulated except the optimum architecture 6 − 5 − 1. The results are shown in Tab.1 and Tab.2. The prediction curve of the test samples with 6 − 5 − 1 TDNN is shown in the Fig. 3 1600 actual datas forecasting datas 1550
1500
1450
1400
1350
1300
0
50
100
150
Fig. 3. The prediction curve of the 6-5-1
6
Conclusions
From the simulation results, we can know that the convergence rate is fast. Therefore the TDNN with RLS learning algorithm basically may satisfy the request of the on-line forecast. The adaptive training and pruning not only make the computational complexity reduced but also the VC dimension reduced. And this improves the network generalization ability. Hence the network can predict the test samples more accurately. In addition, we prune not only the redundant neurons not only in the hidden layers but also in the input layer by present a new energy function in this paper. This results in choosing the useful input factors self-adaptively. It means that we can not only reduce the computational complexity of the network but also exact the useful features from inputs with nose. Our expanding pruning method, therefore, can be an effective method used to preprocessing the input data.
Acknowledgment This work is supported by Outstanding Youth Fund of Henan Province (grant No.512000400), and Hean Province Cultivation Project for University Innovation Talents and The Project-sponsored by SRF for ROCS, SEM.
464
J. Shen, H. Fan, and S. Chang
References 1. Refenes, A.N., Zapranis, A. Francies, G.: Stock Performance Modeling using Neural Networks: A Comparative Study with Regression Models. Neural Network 5 (1994) 961-970 2. Chang, B.R., Tsai, S.F.: A Grey-Cumulative LMS Hybrid Predictor with Neural Network based Weighting for Forecasting Non-Periodic Short-Term Time Series. IEEE International Conference on Systems, Man and Cybernetics 6 (2002) 5 3. Lee, R.S., Jade,T.: Stock Advisor: An Intelligent Agent based Stock Prediction System using Hybrid RBF Recurrent Betwork. IEEE Trans. Systems, Man and Cybernetics-A 34 (2004) 421-428 4. Grosan, C., Abraham, A.:Stock Market Modeling using Genetic Programming Ensembles. Studies in Computational Intelligence 13 (2006) 131-146 5. Ince, H.,Trafal, I.: Kernel Principal Component Analysis and Support Vector Machines for Stock Price Prediction. IEEE International Joint Conference on Neural Networks Proceedings 3 (2004) 2053-2058 6. Shah, S., Palmieri, F., Datum, M.: Optimal Filtering Algorithm for Fast Learning in Feed-Forward Neural Network. Neural network 5 (1992) 779-787 7. Lecun, Y., Denker, J.S., Solla, S.A.:Optimal Brain Damage. Advances in Neural Information Processing 2 (1989) 598-605 8. Chen, S., Chang, S.J., Yuan, J.H.: Adaptive Training and Pruning for Neural Networks Algorithms and Application. Acta Physica Sinica 50 (2001) 674-681
An Improved Algorithm for Eleman Neural Network by Adding a Modified Error Function Zhang Zhiqiang1 , Tang Zheng1 , Tang GuoFeng1 , Catherine Vairappan1 , Wang XuGang2 , and Xiong RunQun3 1
2
Faculty of Engineering, Toyama University, Gofuku 3190, Toyama shi, 930-8555 Japan
[email protected] Institute of Software, Chinese Academy of Sciences, BeiJing 100080, China 3 Key Lab of Computer Network and Information Integration, Southeast University, Nanjing 210096, China
Abstract. The Eleman Neural Network has been widely used in various fields ranging from temporal version of the Exclusive-OR function to the discovery of syntactic categories in natural language date. However, one of the problems often associated with this type of network is the local minima problem which usually occurs in the process of the learning. To solve this problem, we have proposed an error function which can harmonize the update weights connected to the hidden layer and those connected to the output layer by adding one term to the conventional error function. It can avoid the local minima problem caused by this disharmony. We applied this method to the Boolean Series Prediction Questions problems to demonstrate its validity. The result shows that the proposed method can avoid the local minima problem and largely accelerate the speed of the convergence and get good results for the prediction tasks.
1
Introduction
Eleman Neural Network (ENN) is one type of the partial recurrent neural network which more includes Jordan networks [1], [2]. ENN consists of two-layer back propagation networks with an additional feedback connection from the output of the hidden layer to its input. The advantage of this feedback path is that it allows ENN to recognize and generate temporal patterns and spatial patterns. This means that after training, interrelations between the current input and internal states are processed to produce the output and to represent the relevant past information in the internal states [3], [4]. The ENN is the local recurrent network, so when at learning a problem it needs more hidden neurons in its hidden layer than actually are required for a solution by others methods. Since ENN uses back propagation (BP) to deal with the various signals, it has been approved that it suffers from a sub-optimal problem [5], [6], [7]. In order to resolve this question, many improved ENN algorithms have been suggested in the literature to increase the performance of the ENN with simple D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 465–473, 2007. c Springer-Verlag Berlin Heidelberg 2007
466
Z. Zhang et al.
modifications [8], [9], [10]. One of the typical modified methods is proposed by Pham and Liu on the idea which adding a self-connection weights (fixed between 0.0 and 1.0 before the training process) for the context layer. The suggested modifications on the ENN in the literature mostly have been able to improve certain kinds of problems, but it is not clear yet which network architecture is best suited to dynamic system identification or prediction [11]. And at the same time, these methods are apt to change or add some other elements or connections in the network so as to enhance the complexity of the computation. In this paper, we explain the neuron saturation in the hidden layer as the update disharmony between weights connected to the hidden layer and output layer. Then we proposed a modified error function for the ENN algorithm in order to avoid the local minima problem with less neuron units and less training time than the conventional ENN. Finally, simulation results are presented to substantiate the validity of the modified error function. Since a three-layered network is capable of forming arbitrarily close approximation to any continuous nonlinear mapping [12], we use three layers for all training networks.
2
ENN’s Structure
Fig. 1 show the structure of a conventional ENN. In the Fig. 1, after the hidden units are calculated, their values are used to compute the output of the network and are also all stored as ”extra inputs” (called context unit) to be used when the next time the network is operated. Thus, the recurrent contexts provide a weighted sum of the previous values of the hidden units as input to the hidden units. As the definition of the original ENN, the activations are copied form hidden layer to context layer on a one for one basis, with fixed weight of 1.0 (w=1.0). The forward connection weight is trainable between hidden units and context units.
Output w Hidden
w=1.0 w Context
w Input
Fig. 1. Structure of ENN
Training such a network is not straightforward since the output of the network depends on the inputs and also all previous inputs to the network. One approach used in the machine learning is to operate the process by time shown as Fig. 2.
An Improved Algorithm for ENN by Adding a Modified Error Function
467
Fig. 2 represents a long feed forward network where by back propagation is able to calculate the derivatives of the error (at each output unit) by unrolling the network to the beginning. At the next time step t + 1 input is represented, this time the context units contain values which are exactly the hidden unit values at time t, these context units thus provide the network with memory [6]. In our paper, the time element is updated by the iterative of changing connection weights, so we did not much consider the time gene in the computing formulas.
Out T
Context
Context
Out T-1
Hidden unit,time T-1
Inputs,time T
Inputs,time T-1
Context
Hidden unit,time T-2
...
Hidden unit,time T
Out T-2
Inputs,time T-2
Fig. 2. Unroll the ENN Through Time
3
Motivation
In the ENN, usually the sigmoid function is used to process the network. Our proposed method is based on the same understanding of the current sigmoid function shown as Eq.(1). f (x) =
1 . 1 + e−x
(1)
The shape of the sigmoid function is shown as Fig.3. Since we use the Sigmoid function, saturation problem is inevitable. Such a phenomenon is caused by the activation function [13]. The derivative of the sigmoid function is shown as Eq.(2). f (x) = g(1 − f (x)) ∗ f (x).
(2)
In the Fig. 3, we can see there are two extreme area A and area B. Once the activity level of all hidden layer approaches the two extreme areas (the outputs f (x) of all neurons are in the extreme value close to 1 or 0), f (x) will almost be 0. For the ENN, the change in weights is determined by the sigmoid derivative which can even be as small as 0. So for some training patterns, weights connected to the hidden layer and the output layer are modified inharmoniously, that is all the hidden neuron’s output are rapidly driven to the extreme areas before the output start to approximate to the desired value. Thus the hidden layer will lose their sense to the error. The local minimum problem may occur. To overcome such a problem, the neuron output in the output layer and those in the hidden layer should be considered together during the iterative update
468
Z. Zhang et al.
f(x) 1 A B 0
x
Fig. 3. Sigmoid function
procedure. Motivated by this, we add one term concerning the outputs in the hidden layer to the conventional error function [14]. In such way, weights connected to the hidden layer and the output layer could be modified harmoniously.
4
Proposed Algorithm
For the conventional ENN algorithm, the error function is given by 1 (tpj − opj )2 , 2 p=1 j=1 P
EA =
J
(3)
where P is number of training patterns, J is the number of neurons in the output layer. tpj is the target value (desired output) of the j − th component of the outputs for the pattern p, opj is the output of the j − th neuron of the actual output layer. To minimize the error function EA , the ENN algorithm uses the following delta rules as back propagation algorithm: Δwji = −ηA
∂EA , ∂wji
(4)
where wji is the weight connected between neurons i and j and ηA is the learning rate. For the improved ENN algorithm, the modified error function is given by:
Enew = EA +EB =
P J P J H 1 1 (tpj −opj )2 + ( (tpj −opj )2 )×( (ypj −0.5)2 ). 2 p=1 j=1 2 p=1 j=1 j=1
(5) We can see that the new error function consists of two terms, EA is the conventional error function, and EB is the added term. Where ypj is the output of the j − th neuron in the hidden layer and H is the number of neurons in the hidden layer. H j=1
(ypj − 0.5)2 .
(6)
An Improved Algorithm for ENN by Adding a Modified Error Function
469
Eq.6 can be defined as the degree of saturation in the hidden layer for pattern p. This added term is used to keep the degree of saturation of the hidden layer small while EA is large (the output layer have not approximate the desired signals).While the output layer approximates to the desired signals, the affect of term EB will be diminished and becomes zero eventually. Using the above error function as the objective function, we can rewrite the update rule of weight wji as: Δwji = −ηA
∂EA ∂EB − ηB . ∂wji ∂wji
(7)
For pattern p, the derivative ∂EA /∂wji can be computed as the same as the conventional error function does. Thus we can easily get ∂EB /∂wji as following: For weights connected to the output layer: H p p ∂EB ∂EA = (ypj − 0.5)2 . ∂wji ∂wji j=1
(8)
For weights connected to the hidden layer: H J p p ∂EB ∂EA ∂ypj = (ypj − 0.5)2 + (tpj − opj )2 (ypj − 0.5) . ∂wji ∂wji j=1 ∂wji j=1 Because ypj = f (netpj ) and netpj = (wij opj )
∂ypj ∂ ypj ∂netpj = = f (net)opi , ∂wji ∂netpj ∂ wji
(9)
(10)
where opi is the i − th input for pattern p, and netpj is the net input to neuron j produced by the presentation of pattern p. In order to verify the effectiveness of the modified error function, we applied the algorithm to the BSPQ problems.
5
Simulations
Boolean Series Prediction Questions is one of the problems about time sequence prediction. First let me see the definition of the BSPQ problems [15]. Now suppose that we want to train a network with an input P and target T as defined below. P =1 0 1 1 1 0 1 1 And T =0 0 0 1 1 0 0 1 Where T is defined to be 0, except when two 1’s occur in P in which case T is 1 and we called this problem as ”11” problem (one kind of the BSPQ). Also when ”00”or ”111” (two 0’s or three 1’s) occurs, it is named as the ”00” or ”111” problem. In this paper we define the prediction set P1 randomly in the following stochastic 20 unit figures. P1 =1 1 1 0 1 0 0 0 1 0 1 1 0 1 1 1 0 0 1 1
470
Z. Zhang et al.
Table 1. Experiment results for the ”11” question with 7 neurons in the hidden layer Methods Success rate(100 trials) (1-7-1 network) E=0.1 E=0.01 Conventional ENN 45% 35% Improved ENN 100% 100%
Iterative(average) Average CPU time (second) E=0.1 E=0.01 E=0.1 E=0.01 15365 33274 9.5 20.5 322 636 0.18 0.29
In order to test the effectiveness of the proposed method, we compare its performance with those of the conventional ENN algorithm on a series of BSPQ problems including ”11”, ”111” and ”00” problem. In our simulation, we use the modified back propagation algorithm with momentum 0.9. In order to maintain the similarity for both algorithm, the learning rate of ηA = ηB = 0.9 are used in all experiments where as the weights and thresholds are initialized randomly from (0.0, 1.0). Three aspects of training algorithm performance—”success rate” ,”iterative” and ”training time” are assessed for each algorithm. Simulations were implemented in Visual C++ 6.0 on a Pentium4 2.8GHz (1GB)). A training run was deemed to have been successful if the network’s E was smaller than the lower levels E (E=0.1 or E=0.01), where E was the sum of squares error function for the full training set. If the network reach to this error precision point (E=0.1 or E=0.01), all patterns in the training set can get a tolerance of 0.05 for each target element. And we used the well trained network to do the final prediction about sequence P1 to test the prediction capacity of it. For all the trials, 150 patterns were provided to satisfy the equilibrium of the training set and at the same time to ensure that there was enough and reasonable running time for all algorithms. The upper limit iterative were set to 5 0000 for two algorithms. 5.1
”11” Question
Firstly, we deal with the ”11” question and analyze the effect of the memory from the context layer for the network. From the Table 1’s compared results of the two algorithms we can see the improved method not only 100% success but quickly get the convergence point. Although the conventional ENN is able to predict the requested input test sequence P1 , but the training success rate is slow, only 45% when E is set to 0.1. Fig. 4 and Fig. 5 are the training error curves for two algorithms with the same initialization weight for the 1-7-1 network, when E is set to 0.1. Comparison of the Fig. 4 and Fig. 5 shows that the improved ENN only needs 49 iterations to be successful but the conventional ENN needs 1957 iterations to reach the desired goal. For the prediction set P1 , we can get its corresponding expected results with below T1 . T1 =0 1 1 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 1 Fig.6 is the prediction results for P1 of the ”11” question with improved ENN, and the two lines represent the expected T1 line and the actual prediction result
An Improved Algorithm for ENN by Adding a Modified Error Function
471
30
E 25
20
15 10
5
0
200
400
600
800 1000 1200 1400 1600 1800 2000
iterative
Fig. 4. Training error curve of the conventional ENN algorithm 30
E 25
20 15 10 5
0 0
5
10
15
20
25
30
35
40
45
50
iterative
Fig. 5. Training error curve of the improved ENN algorithm Value 1 0.8
prediction line T1 line
0.6
0.4
0.2
0
0
5
10
15
20 T 1
Fig. 6. Comparison of the expected output and prediction result with improved ENN
472
Z. Zhang et al.
Table 2. Experiment results for the ”111” question with different neuron units in the hidden layer Structure Network (1-10-1) Network (1-12-1)
Items Success rate(100 trials) Error precision E=0.1 E=0.01 Conventional ENN 72% 70% Improved ENN 99% 99% Conventional ENN 80% 75% Improved ENN 100% 97%
Iterative(average) Time (second) E=0.1 E=0.01 E=0.1 E=0.01 13545 24555 11.3 17.9 2279 3698 1.6 3.1 9823 17090 7.9 14.1 2001 3710 1.9 2.5
line. Apparently the tolerance of every pattern in the P1 is very small. Thus, based on the findings, it can be concluded that the improved ENN has sufficient capability to do the prediction of the given task. 5.2
”111” and ”00” Questions
As we change the type of the BSPQ, we can continue to testify the validity of the improved ENN algorithm. Table 2 is the comparison of the specific results from the ”111” problem between the conventional ENN and the improved ENN algorithm. From the Table 2 we can see that the improved ENN can avoid the local minima problem with almost 100% success rate and less iterative than conventional ENN algorithm. Table 3 is the specific comparison results for the ”00” BSPQ problem between the conventional ENN and the improved ENN algorithm. Table 3. Experiment results for the ”00” BSPQ problem with different neuron units in the hidden layer Structure Network (1-7-1) Network (1-10-1)
6
Items Success rate(100 trials) Error precision E=0.1 E=0.01 Conventional ENN 92% 81% Improved ENN 99% 97% Conventional ENN 95% 92% Improved ENN 100% 100%
Iterative(average) Time (second) E=0.1 E=0.01 E=0.1 E=0.01 4177 9852 4.0 9.6 1661 3589 1.7 3.8 13074 16333 11.3 16.5 5205 6944 4.4 5.8
Conclusion
In this paper, we proposed a modified error function with two terms for ENN algorithm. This modified error function was used to harmonize the update of weights connected to the hidden layer and those connected to the output layer in order to avoid the local minima problem in the training learning process. Moreover, the modified error function did not require any additional computation and did not change the network topology either. Finally, the algorithm has been applied to the BSPQ problems including ”11”, ”111” and ”00” problems. Through the analysis of the result from various BSPQ problems we can see that the proposed algorithm is effective at getting rid of local minima problem
An Improved Algorithm for ENN by Adding a Modified Error Function
473
with less time and getting good prediction results. But more analysis on the other types of problems and more detailed discussions on the parameters setting are still required. Therefore, we will continue in studying the improvement of the ENN.
References 1. L.Eleman, J.: Finding Structure in Time. Cognitive Science 14 (1990) 179-211 2. Jordan, M.I.: Attractor Dynamics and Parallelism in a Connectionsist Sequential Machine. Proceedings of the 8th Conference on Cognitive Science (1986) 531-546 3. Omlin, C.W., Giles, C.L.: Extraction of Rules from Dicrete-Time Recurrent Neural Networks. Neural Networks 9 (1) (1996) 41-52 4. Stagge, P., Sendhoff, B.: Organisation of Past States in Recurrent Neural Networks: Implicit Embedding. Mohammadian, M. (Ed.), Computational Intelligence for Modelling, Control & Automation, IOS Press, Amsterda (1999) 21-27 5. Pham, D.T., Liu, X.: Identification of Linear and Nonlinear Dynamic Systems Using Recurrent Neural Networks. Artificial Intelligence in Engineering 8 (1993) 90-97 6. Smith, A.: Branch Prediction with Neural Networks: Hidden Layers and Recurrent Connections. Department of Computer Science University of California, San Diego La Jolla, CA 92307 (2004) 7. Cybenko, G.: Approximation by Superposition of a Sigmoid Function. Mathematics of Control, Signals, and Systems 2 (1989) 303-314 8. Kwok, D.P., Wang, P., Zhou, K.: Process Identification Using a Modified Eleman Neural Network. International Symposium on Speech, Image Processing and Neural Networks (1994) 499-502 9. Gao, X.Z., Gao, X.M., Ovaska, S.J.: A Modified Eleman Neural Network Model with Application to Dynamical Systems Identification. Proceedings of the IEEE International Conference on System, Man and Cybernetics 2 (1996) 1376-1381 10. Chagra, W., Abdennour, R.B., Bouani, F., Ksouri, M., Favier, G.: A Comparative Study on the Channel Modeling Using Feedforward and Recurrent Neural Network Structures. Proceedings of the IEEE International Conference on System, Man and Cybernetics 4 (1998) 3759-3763 11. Kalinli, A., Sagiroglu, S.: Eleman Network with Embedded Memory for System Identification. Journal of Informaiton Science and Engineering 22 (2006) 15551668 12. Servan-Schreiber, C., Printz, H., Cohen, J.: A Network Model of Neuromodulatory Effects: Gain, Signal- to-Noise Ratio and Behavior. Science 249 (1990) 892-895 13. Cybenko, G.: Approximation by Superposition of a Sigmoid Function. Mathematics of Control, Signals, and System 2 (1989) 303-314 14. Wang, X., Tang, Z.: An Improved Backpropagation Algorithm to Avoid the Local Minima Problem. Neurocomputing 56 (2004) 455-450 15. http://www.mathworks.com/access/helpdesk/help/helpdesk.shtml
Regularization Versus Dimension Reduction, Which Is Better? Yunfei Jiang1 and Ping Guo1,2, 1
Laboratory of Image Processing and Pattern Recognition Beijing Normal University, Beijing, 100875, China 2 School of Computer Science and Technology Beijing Institute of Technology, Beijing, 100081, China yunfeifei
[email protected],
[email protected]
Abstract. There exist two main solutions for the classification of highdimensional data with small number settings. One is to classify them directly in high-dimensional space with regularization methods, and the other is to reduce data dimension first, then classify them in feature space. However, which is better on earth? In this paper, the comparative studies for regularization and dimension reduction approaches are given with two typical sets of high-dimensional data from real world: Raman spectroscopy signals and stellar spectra data. Experimental results show that in most cases, the dimension reduction methods can obtain acceptable classification results, and cost less computation time. When the training sample number is insufficient and distribution is unbalance seriously, performance of some regularization approaches is better than those dimension reduction ones, but regularization methods cost more computation time.
1
Introduction
In real world, there are some data, such as Raman spectroscopy and stellar spectra data, that the number of variables (wavelengths) is much higher than the number of samples. When classification (recognition) tasks are applied, the ill-posed problems arise. For such ill-posed problems, there mainly have two solutions. One is to classify them directly in high-dimensional space with regularization methods [1], the other is to classify them in feature space after dimension reduction. Many approaches are proposed to solve the ill-posed problem [1,2,3,4,5,6,7,8]. Among these methods, Regularized Discriminant Analysis (RDA), Leave-OneOut Covariance matrix estimate (LOOC) and Kullback-Leibler Information Measure based classifier (KILM) are regularization methods. RDA [2] is a method based on Linear Discriminant Analysis (LDA) which adds the identity matrix as a regularization term to solve the problem in matrix estimation, and LOOC [3] brings the diagonal matrix in solving singular problem. The KLIM estimator
Corresponding author.
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 474–482, 2007. c Springer-Verlag Berlin Heidelberg 2007
Regularization Versus Dimension Reduction, Which Is Better?
475
is derived by Guo and Lyu [4] based on Kullback-Leibler information measure. Regularized Linear Discriminant Analysis (R-LDA), Kernel Direct Discriminant Analysis (KDDA) and Principal Component Analysis (PCA) are dimension reduction methods. R-LDA was proposed by Lu et.al [6], which introduces a regularized Fisher’s discriminant criterion, and via optimizing the criterion, it addresses the small sample size problem. KDDA [7] can be seen as an enhanced kernel Direct Linear Discriminant Analysis (kernel D-LDA) method. PCA [8] is a linear transformation that transforms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. In this paper, comparative studies on regularization and dimension reduction approaches are given with two typical sets of high-dimensional data from real world, Raman spectroscopy signals and stellar spectra data. Correct classification rate (CCR) and time cost are used to evaluate the performance for each method. The rest of this paper is organized as follows. Section 2 gives a review on discrimination analysis. Section 3 introduces the regularization approaches. Section 4 discusses the dimension reduction approaches. Experiments are described in Section 5. Then we make some discussions in Section 6. Finally, conclusions are given in last section.
2
Discriminant Analysis
Discriminant analysis is to assign an observation x ∈ RN with unknown class membership to one of k classes C1 , ..., Ck known a priori. There is a learning data set A = {(x1 , c1 ), ..., (xn , cn )|xj ∈ RN and cj ∈ {1, ..., k}}, where the vector xj contains N explanatory variables and cj indicates the index of the class of xi . The data set allows to construct a decision rule which associates a new vector x ∈ RN to one of the k classes. Bayes decision rule assigns the observation x to the class Cj∗ which has the maximum a posteriori probability. Which is equivalent, in view of the Bayes rule, to minimize a cost function dj (x), j ∗ = arg min dj (x), j
j = 1, 2, · · · , k,
dj (x) = −2 log(πj fj (x)).
(1)
(2)
Where πj is the prior probability of class Cj and fj (x) denotes the class conditional density of x, ∀j = 1, ..., k. Some classical discriminant analysis methods can be obtained by combining additional assumptions with the Bayes decision rule. For instance, Quadratic Discriminant Analysis (QDA) [1,5] assumes that j ) which j, Σ the class conditional density fi for the class Cj is Gaussian N (m leads to the discriminant function −1 (x − m j | − 2 ln α j )T Σ j ) + ln |Σ dj (x) = (x − m j . j
(3)
476
Y. Jiang and P. Guo
j is the covariance j is the mean vector, and Σ Where α j is the prior probability, m matrix of the j-th class. If the prior probability α j is the same for all classes, the term 2 ln α j can be omitted and the discriminant function reduces to a more simple form. The parameters in above equations can be estimated with traditional maximum likelihood estimator. nj α j = , (4) N 1 nj j = m xi , (5) i=1 nj nj j = 1 j )(xi − m j )T . Σ (xi − m (6) i=1 nj In practice, this method is penalized in high-dimensional spaces since it requires estimating many parameters. For small sample number case, it will lead to the ill-posed problem. In that case the parameter estimates can be highly unstable, giving rise to high variance in classification accuracy. By employing a method of regularization, one attempts to improve the estimates by biasing them away from their sample based values towards values that are deemed to be more “physically plausible”. For this reason, particular rules of QDA exist in order to regularize the estimation of xi . We can assume that covariance matrices are equal, i.e. j = Σ, which yields the framework of LDA [9]. This method makes linear Σ separations between the classes.
3
Regularization Approaches
Regularization techniques have been highly successful in the solution of ill-posed and poorly-posed inverse problems. Such as RDA, LOOC and KLIM are proposed, the crucial difference of these methods is the diversity of the covariance matrix estimation formula. We will give brief review of these methods. 3.1
RDA
RDA is a regularization method which was proposed by Friedman [2]. RDA is designed for small number samples case, where the covariance matrix in Eq.(3) takes the following form: Σj (λ, γ) = (1 − γ)Σj (λ) + γ with Σj (λ) =
Trace[Σj (λ)] Id , d
j + λN Σ (1 − λ)nj Σ . (1 − λ)nj + λN
(7)
(8)
The two parameters λ and γ, which are restricted to the range 0 to 1, are regularization parameters to be selected according to maximizing the leave-one j that are out correct classification rate (CCR). λ controls the amount of the Σ shrunk towards Σ, while γ controls the shrinkage of the eigenvalues towards equality as Trace[Σj (λ)]/d is equal to the average of the eigenvalues of Σj (λ).
Regularization Versus Dimension Reduction, Which Is Better?
3.2
477
LOOC
There exists another covariance matrix estimation formula which was proposed by Hoffbeck and Landgrebe [3]. They examine the diagonal sample covariance matrix, the diagonal common covariance matrix, and some pair-wise mixtures of those matrices. The proposed estimator has the following form: j ) + ξj2 Σ j + ξj3 Σ+ξ j4 diag(Σ). Σj (ξj ) = ξj1 diag(Σ
(9)
The elements of the mixing parameter ξj = [ξj1 , ξj2 , ξj3 , ξj4 ]T are required 4 to sum up to unity: Σl=1 ξjl = 1. In order to reduce the computation cost, they only considered three cases: (ξj3 , ξj4 ) = 0, (ξj1 , ξj4 ) = 0, and (ξj1 , ξj2 ) = 0. They called the covariance matrix estimator as LOOC because the mixture parameter ξ was optimized by Leave-One-Out Cross validation method. 3.3
KLIM
The matrix estimation formula of KLIM is shown in the following: (1) j, Σj (h) = hId + Σ
(10)
where h is a regularization parameter, Id is a d × d dimensional identity matrix. This class of formula can solve matrix singular problem in high-dimension setting. In fact, as long as h is not too small, Σ−1 j (h) exists with a finite value and the estimated classification rate will be stable.
4
Dimension Reduction Approaches
Dimension reduction is another solution to solve the ill-posed problem arising in the case of high dimension with small sample number setting. R-LDA, KDDA and PCA are three common dimension reduction methods. R-LDA and KDDA are considered to be variations of D-LDA. R-LDA introduces a regularized Fisher’s discriminant criterion. The introduction of the regularization helps to decrease the importance of those highly unstable eigenvectors, thereby reducing the overall variance. KDDA introduces a nonlinear mapping from the input space to an implicit high-dimensional feature space, where the nonlinear and complex distribution of patterns in the input space is “linearized” and “simplified” so that conventional LDA can be applied. PCA tends to find a p-dimensional subspace whose basis vectors correspond to the maximum variance direction in the original data space. We will give brief review of R-LDA and KDDA. Since PCA is a well-known method, we will not verbosely introduce it in this paper. 4.1
R-LDA
The purpose of R-LDA [6] is to reduce the high variance related to the eigenvalue estimates of the within-class scatter matrix at the expense of potentially increased bias. The regularized Fisher criterion can be expressed as follows: Ψ = arg max Ψ
|ΨT SB Ψ| , |η(ΨT SB Ψ) + (ΨT SW Ψ)|
(11)
478
Y. Jiang and P. Guo
where SB is the between-class scatter matrix, SW is the within-class scatter matrix, 0 ≤ η ≤ 1 is a regularization parameter. Determine the set Um = [u1 , · · · , um ] of eigenvectors of SB associated with the m ≤ c − 1 non-zero −1/2 eigenvalues ΛB . Define H = Um ΛB , then compute the M (≤ m) eigenvecT tors PM = [p1 , · · · , pM ] of H SW H with the smallest eigenvalues ΛW . In the end, we can obtain Ψ = HPM (ηI + ΛW )−1/2 by combining the results of above, which is considered a set of optimal discriminant feature basis vectors. 4.2
KDDA
The KDDA method [7] implements an improved D-LDA in a high-dimensional feature space using a kernel approach. Define RN as the input space, assuming that A and B represent the null spaces of the between-class scatter matrix SB and the within-class scatter matrix SW respectively, the complement spaces of A and B can be written as A = RN − A and B = RN − B. Then the optimal discriminant subspace sought by the KDDA algorithm is the intersection space (A B). A is found by diagonalizing the matrix SB . The feature space F becomes implicit by using kernel methods, where dot products in F are replaced with a kernel function in RN so that the nonlinear mapping is performed implicitly in it.
5
Experiments
Two typical sets of real world data, namely Raman spectroscopy and stellar spectra data, are used in our study. The Raman spectroscopy data set used in this work is the same data set with in reference [10]. This data set consists of three classes of substance, they are acetic acid, ethanol and ethyl acetate. After data preprocess, all the data have been cut into 134 dimension. There are 50 samples in acetic acid, 30 samples in ethanol and 290 samples in ethyl acetate, therefore there are 370 samples in total. The stellar spectrum data used in the experiments are from Astronomical Data Center (ADC) [11]. They are drawn from standard stellar library for evolutionary synthesis. The data set consists of 430 samples and could be divided 1 (a) 0.9
0.8
0.7
0.6
0.5
I
0.4
0.3
0.2
0.1
0
0
200
400
600
800 1000 Wavelength(nm)
1200
1400
1600
Fig. 1. The typical three type stellar spectra lines
Regularization Versus Dimension Reduction, Which Is Better?
479
into 3 classes. The number of samples in each class are 88, 131 and 211, respectively. The spectrum is of 1221 wavelength points covering the range from 9.1 to 160000 nm. The typical distribution of these spectrum lines in a range from 100 nm to 1600 nm is shown in Fig. 1. In experiments, the data set is randomly partitioned into a training set and a testing set with no overlap between them. In Raman data experiment, 15 samples are chosen randomly from each class. They are used as training samples to estimate the mean vector and covariance matrix. The remained 310 samples are the test samples to verify the classification accuracy. While in stellar data experiment, 40 samples are randomly chosen from each class for training, and the remains for testing. In this study, we investigate regularization methods firstly, that is to classify the data directly in high-dimensional space with regularization methods. The another aspect of experiments is to apply R-LDA, KDDA and PCA methods for dimension reduction, respectively. With the reduced dimension data set, we choose QDA as a classifier to get the correct classification rate (CCR) in feature space. The results with PCA method are gotten under the condition of reduced 10-dimensional data set. All the experiments are repeated 20 runs with random different partitioned data sub-set, and all results reported in tables of this paper are the average values over the twenty runs. In experiments, we noted the CCR and time cost for each method. Table 1 shows the classification results with different approaches. It is needed to point out that the dimension of the raw stellar data is too high compared with its sample number, it is unstable to compute CCR directly in such a high dimensional space. Thus we reduce the dimension of stellar data to 100 with PCA method first, and consider it still being a sufficient high-dimensional space for the problem to investigate. In the tables presented in this paper, the CCR is reported in decimal fraction. Furthermore, the notation N/A represents that the covariance matrix is singular, in which case reliable results can not be obtained. Table 1. The Classification Results with Different Approaches Data Evaluation RDA LOOC Raman CCR N/A N/A Time 99.399 39.782 Stellar CCR 0.9490 0.7786 Time 150.1 40.058
KLIM 0.8448 178.57 0.9653 194.3
R-LDA 0.6536 0.2423 0.9677 0.1672
KDDA 0.7374 2.6132 0.9591 2.6678
PCA 0.7625 0.4166 0.9531 4.157
For further comparison, we perform PCA methods to reduce the data to different dimension before classification, and still use the same classifier QDA to compute CCR. The dimension of two data sets is reduced into four different levels, which is 40, 20, 10 and 2 dimension respectively. For the purpose of comparative convenience, in table 2 we shows the classification results for different dimension of Raman spectroscopy and stellar spectra data together.
480
6
Y. Jiang and P. Guo
Discussion
In table 1, it depicts a quantitative comparison of the mean CCR and time cost obtained by direct classification with regularization approaches in high dimensional space and classification after dimension reduction with a non-regularization classifier (QDA). As it can be seen from the table, time cost by classifying the highdimensional data directly with regularization approaches is usually from 20 to 1000 times higher than that classification after dimension reduction. And in most cases, the CCR obtained by classification with dimension reduction approaches is more acceptable compared to directly classify with regularization approaches. From table 1, we can find that when the training samples are insufficient and distribution unbalance seriously, if the regularization parameters are too small, even with regularized classifiers, the ill-posed problem can not be fully solved. This phenomenon is very obvious for Raman data, and by applying RDA and LOOC classifiers still encounter the covariance matrix singular problem. Meanwhile we also find that KLIM is a very effective regularization approach. For Raman data, it can get the best results among three regularization approaches and three dimension reduction approaches. If data sets are insufficient and distribution unbalance seriously such as Raman, KLIM always gives us better CCR results compared to those dimension reduction approaches, but it costs more computation time than other classifiers. Table 2. The Classification Results for Different Dimensionality Data Evaluation d=40 Raman CCR N/A Time 6.358 Stellar CCR N/A Time 39.2
d=20 N/A 4.2011 0.9574 13.913
d=10 0.7625 0.4166 0.9531 4.157
d=2 0.6446 0.3014 0.8963 2.2212
Wether it can give us more acceptable results when classification with dimension reduction compared to classification directly with regularization classifiers? The answer is not always positive. As illustrated in table 2, the lower of the dimension reduced, the less of the computation time cost. We see that results of Raman data are worse than those of Stellar data, due to training samples insufficient and distribution unbalance seriously. For Raman data, even though we reduce data to the 20 dimension, the ill-posed problem sill exists. And the classified results of Raman data are much worse than those of stellar data. From experiments we also find that mean classification accuracy with principal components(PCs) is still acceptable even with only 2 PCs, but the classification accuracy has an obvious degradation. When we reduce the data dimension to 2 PCs, for Stellar data, the CCR obtained with QDA classifier is lower than the CCR obtained with RDA. We consider that is because a reduction in the number of features will lead to a loss of the discriminant ability for some data set. In order to cut down the computational time and get a satisfactory classification
Regularization Versus Dimension Reduction, Which Is Better?
481
accuracy at the same time, it need a careful choice of the dimension level of the data to reduce. However, it still is an open problem how to select a suitable dimension level.
7
Conclusions
In this paper, we presented comparative studies of regularization and dimension reduction with real world data sets in same working conditions. From the results, we can draw some conclusions: (1) Dimension reduction approaches often gives us acceptable CCR results. Meanwhile they can reduce the computational time cost and use less memory compared to classification directly in high dimension with regularization methods. (2) The choice of what dimension level should be reduced to is a very important thing. There exists an appropriate dimension level, at this level we can get satisfied results, and computational time cost as well as memory required as less as possible. However, it is very difficult to choose such a proper dimension level. (3) If the dimension we chosen is not sufficient low such that still can not avoid the ill-posed problem. And if the dimension reduced is too low, that will lead to a loss in the discriminant ability and consequently degrade the classification accuracy. (4) If data sample number is insufficient and sample distribution unbalance seriously like Raman spectroscopy, some regularization approaches like KLIM may be more effective than those dimension reduction approaches.
Acknowledgments The research work described in this paper was fully supported by a grant from the National Natural Science Foundation of China (Project No. 60675011). The author would like to thank Fei Xing and Ling Bai for their help in part of experiment work.
References 1. Aeberhard, D., Coomans, D., De Vel, O.: Comparative Analysis of Sstatistical Pattern Recognition Methods in High Dimensional Settings. Pattern Recognition 27 (1994) 1065-1077 2. Friedman, J.H.: Regularized Discriminant Analysis. J. Amer. Statist. Assoc. 84 (1989) 165-175 3. Hoffbeck, J.P., Landgrebe, D.A.: Covariance Matrix Estimation and Classification with Limited Training Data. IEEE Trans. Pattern Analysis and Machine Intelligence 18 (1996) 763-767 4. Guo, P., Lyu, M.R.: Classification for High-Dimension Small-Sample Data Sets based on Kullback-Leibler Information Measure. In: Proceedings of The 2000 International Conference on Artificial Intelligence, H. R. Arabnia (2000) 1187-1193 5. Webb, A.R.: Statistical Pattern Recognition. In: Oxford University Press, London (1994)
482
Y. Jiang and P. Guo
6. Lu, J., Plataniotis, K.N., Venetsanopoulos, A.N.: Regularization Studies of Linear Discriminant Analysis in Small Sample Size Scenarios with Application to Face Recognition. Pattern Recognition Letter 26 (2005) 181-191 7. Lu, J., Plataniotis, K.N., Venetsanopoulos, A.N.: Face Recognition using Kernel Direct Discriminant Analysis Algorithms. IEEE Trans. Neural Networks 14 (2003) 117-126 8. Jolliffe, I.T.: Principal Cmponent Analysis. Springer-Verlag (1996) 9. Fisher, R.A.: The Use of Multiple Measurements in Taxonomic Problems. Annals of Eugenics 7 (1936) 179-188 10. Guo, P., Lu, H.Q., Du, W.M.: Pattern Recognition for the Classification of Raman Spectroscopy Signals. Journal of Electronics and Information Technology 26 (2002) 789-793 (in Chinese) 11. Stellar Data: ADC website: (http://adc.gsfc.nasa.gov/adc/sciencedata.html).
Integrated Analytic Framework for Neural Network Construction Kang Li1, Jian-Xun Peng1, Minrui Fei2, Xiaoou Li3, and Wen Yu4 1
School of Electronics, Electrical Engineering & Computer Science Queen’s University Belfast, Belfast BT9 5AH, UK {K.Li,J.Peng}@qub.ac.uk 2 Shanghai Key Laboratory of Power Station Automation Technology, School of Mechatronics and Automation, Shanghai University, Shanghai 200072, China 3 Departamento de Computación, CINVESTAV-IPN A.P. 14-740, Av.IPN 2508, México D.F., 07360, México 4 Departamento de Control Automático, CINVESTAV-IPN A.P. 14-740, Av.IPN 2508, México D.F., 07360, México
Abstract. This paper investigates the construction of a wide class of singlehidden layer neural networks (SLNNs) with or without tunable parameters in the hidden nodes. It is a challenging problem if both the parameter training and determination of network size are considered simultaneously. Two alternative network construction methods are considered in this paper. Firstly, the discrete construction of SLNNs is introduced. The main objective is to select a subset of hidden nodes from a pool of candidates with parameters fixed ‘a priori’. This is called discrete construction since there are no parameters in the hidden nodes that need to be trained. The second approach is called continuous construction as all the adjustable network parameters are trained on the whole parameter space along the network construction process. In the second approach, there is no need to generate a pool of candidates, and the network grows one by one with the adjustable parameters optimized. The main contribution of this paper is to show that the network construction can be done using the above two alternative approaches, and these two approaches can be integrated within a unified analytic framework, leading to potentially significantly improved model performance and/or computational efficiency.
1 Introduction The single-hidden layer neural network (SLNN) represents a large class of flexible and efficient structures due to their excellent approximating capabilities [1][2][12]. A SLNN is a linear combination of some basis functions that are arbitrary (usually nonlinear) functions of the neural inputs. Depending on the type of basis functions being used, various SLNNs have been proposed. Two general categories of SLNNs exist in the literature: 1) the first category includes SLNNs with no tunable parameters in the activation functions, such as the Volterra, polynomial neural nets, etc. [13][14]; 2) The second category includes SLNNs with tunable parameters [3]-[7][11], this include Radial basis function neural nets (RBFNNs), probabilistic RBFNNs, MLPs D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 483–492, 2007. © Springer-Verlag Berlin Heidelberg 2007
484
K. Li et al.
with single hidden layer, or heterogeneous neural networks whose activation functions are heterogeneous [15]. In some extreme cases, the tunable parameters in the activation functions are discontinuous [16]. There are two important issues in the applications of SLNNs. One issue is the network learning, i.e. to optimize the parameters; the other is to determine the network structure or the number of hidden nodes, based on the parsimonious principle. These two issues are closely coupled, and it is a mixed integer hard problem if the two issues are considered simultaneously for many SLNNs. Given the two general categories of the SLNNs, their constructions can be quite different. The construction of the first type of SLNNs is to select a small subset of neural nodes from a candidate pool and it is naturally a discrete SLNN construction procedure, since there is no adjustable parameter in the hidden nodes. The main difficulty is however that the candidate pool can be extremely large, the computational complexity to search for the best subset is extremely high [7][9][12]. To alleviate the computational burden, forward subset selection methods have claimed to be one of a few efficient approaches. Forward subset selection algorithms, such as orthogonal least squares (OLS) method [4][5][6] or the fast recursive algorithm (FRA) [7], select one basis function each time from the candidates which maximizes the reduction of the cost function, e.g. the sum of squared errors (SSE). This procedure is repeated until the desired number of, say n, basis functions have been selected. If n is unknown a 'prior', some selection criteria could be applied to stop the network construction, such as the Akaike’s information criterion (AIC) [8]. For the second category of SLNNs, the network construction is a very complex problem and can be quite time-consuming due to the fact that these adjustable parameters in the activation functions have to be optimized. One solution is to convert it to a discrete neural net construction problem, i.e. generate a pool of candidate hidden neurons with the nonlinear parameters taking various discrete values, then use subset selection methods to select the best hidden neurons. One of the last development in this area is the Extreme Learning Machine concept (ELM) proposed by Huang [10]. In ELM, the nonlinear parameters are assigned with some random values ‘a priori’ and the only set of parameters to be solved is the linear output weights. This concept has been successfully applied to a wide range of problems, and it is quite effective for less complicated problems, and particularly useful for some neural nets that the activation functions are discontinuous [16]. However, ELM has two potential disadvantages. Firstly, since the parameters in the activation functions are determined ‘a priori’, the candidate neuron pool is therefore discrete in nature and may not necessarily contain the best neurons with the optimal parameters in the parameter space. Secondly, since a small network is usually desirable, the network construction is then to select a subset of neurons from a large pool of candidates, which has been described above. This again can be computationally very expensive. Continuous construction of the second category of SLNNs is to optimize the tunable parameters on the whole parameter space along the network construction procedures. This is a very complicated process if both the determination of the network size
Integrated Analytic Framework for Neural Network Construction
485
and parameter learning are considered simultaneously, since it is a mixed integer hard problem. Despite that few analytic methods are available to efficiently and effectively address this problem, the two separate issues of parameter training and determination of the number of hidden nodes have been studied extensively in the literature. For example, for the training of different SLNNs like RBF or MLP, various supervised, unsupervised and hybrid methods have been extensively studied in the last decades [2] [11]. Guidelines and criterions have also been proposed for network growing and pruning [3]. This paper introduces the two alternative approaches for the construction of SLNNs, namely discrete and continuous approaches. Each method has their advantages and disadvantages. It will then be shown that these two approaches can be integrated within one analytic framework, leading to potentially significantly reduced computational complexity and/or improved model performance. This paper is organized as follows. Section 2 briefly introduce the discrete construction of SLNNs. Section 3 will show that after appropriate modification of the discrete construction method, a continuous construction method can be derived. Section 4 presents a simulation example to illustrate these two methods, and Section 5 is the conclusion.
2 Discrete Construction of SLNNs The main objective of discrete SLNN construction is to select a subset of hidden nodes from a pool of candidates using subset selection method. This is called discrete construction since the parameters need not to be trained. The corresponding SLNNs either have no adjustable parameters in the activation function of hidden nodes, or the parameters are assigned values ‘a priori’. This process can be described as follows. Suppose a set of M candidate basis functions {φ i (t ), i = 1,2," , M } , and a set of N samples are used for the construction and training of SLNNs, leading to the following full regression matrices
Φ = [φ1 , φ2 ," , φM ],
φi = [φi (1), φi (2)," , φi ( N )]T , i = 1, 2," , M .
(1)
Now, consider the discrete construction of SLNNs, i.e. {φ i (t ), i = 1,2," , M } have no tunable parameters, such as the Volterra series, or the tunable parameters in these basis functions are assigned with some values ‘a priori’. The main objective is then to select n significant basis functions, denoted as p1 , p 2 , " , p n , which form a selected regression matrix P = [ p1, p2 ,", pn ]
(2)
producing the network output of y = Pθ + e
(3)
486
K. Li et al.
best fitting the data samples in the sense of least-squares, i.e. the sum squared-errors (SSE) is minimized J (P) =
min
Φ n∈Φ ,θ ∈ℜn
{e T e} =
min
Φ n∈Φ ,θ ∈ℜn
{( y − Φ nθ ) T ( y − Φ nθ )} ,
(4)
where Φ n is an N × n matrix composing of n columns from Φ , θ denotes the output weights, and the selected regression matrix P = [ p1 , p 2 , " , p n ] .
(5)
If the selected regression matrix P is of full column-rank, the least-squares estimation of the output weights in (4) is given by θ = ( P T P ) −1 P T y .
(6)
Theoretically, each subset of n terms out of the M candidates forms a candidate neural net, and there are M ! / (n! /( M − n)!) possible combinations. Obviously, to obtain the optimal subset is computationally very expensive or impossible if M is a very large number, and part of this is also referred to as the curse of dimensionality. To overcome the difficulty, the forward stepwise model selection methods select basis functions one by one with the cost function being maximally reduced each time. Obviously a series of intermediate models are generated during the forward stepwise selection process. To formulate this forward selection process, the regression matrix of the kth intermediate network (with k basis functions having been selected) is denoted as Pk = [ p1 , p 2 , " , p k ], k = 1,2," , n .
(7)
Obviously, the cost function (4) becomes
J ( Pk ) = y T y − y T Pk ( PkT Pk ) −1 PkT y = y T ( I − Pk ( PkT Pk ) −1 PkT ) y .
(8)
Now suppose one more basis function pk +1 is selected, the net decrease in the cost function is given by
ΔJ k +1 ( p k +1 ) = J ( Pk ) − J ([ Pk , p k +1 ]) ,
(9)
where Pk +1 = [ Pk , p k +1 ] is the regression matrix of the (k+1)th intermediate model. In model selection, each selected term achieves the maximum contribution among all remaining candidates, i.e.
ΔJ k +1 (p k +1 ) = max{ΔJ k +1 (ϕ ), ϕ ∈ Φ, ϕ ≠ p j , j = 1, ", k} .
(10)
Since the number of candidate basis functions can be very large, an efficient algorithm is required to perform the optimization problem (10). To introduce the forward selection algorithm, a matrix series is defined
⎧I − Pk (PkT Pk )−1 PkT , 0 < k ≤ n, Rk Δ ⎨ k = 0. ⎩ I,
(11)
Integrated Analytic Framework for Neural Network Construction
487
The matrix R k is a residue matrix directly coming from (8), which projects the output into the residue space. This residue matrix series has several interesting properties [7][9][11]. In particular, the following properties hold for R k : Rk +1 = Rk −
Rk p k +1 p kT+1 RkT p kT+1 Rk p k +1
, k = 0,1, " , n − 1 ,
R kT = Rk ; R k R k = R k , k = 0,1, " , n ,
Ri R j = R j Ri = Ri , i ≥ j; i, j = 0,1, " , n ,
if rank ([p1 ," , p k , ϕ ]) = k , ⎧0, R kϕ = ⎨ ( k ) ⎩ϕ ≠ 0, if rank ([p1 ," , p k , ϕ ]) = k + 1,
(12)
(13) (14)
(15)
where ϕ ( k ) = R k ϕ . Now the net contribution of p k +1 to the cost function can is given by ΔJ k +1 ( p k +1 ) = y T ( Rk − Rk +1 ) y .
(16)
To further simplify (16), define
ϕ i( k ) Δ Rk ϕ i , i = 1, " , M , y ( k ) Δ Rk y , (k )
p j ΔRk p j , j = 1, " , n .
(17) (18)
Then it holds that
ϕ
( k +1)
( p k +1 ) T ϕ ( k ) (k )
= Rk +1ϕ = ϕ
(k )
−
(k )
(k )
( p k +1 ) T p k +1
p k( k+)1 ,
ΔJ k +1 ( pk +1 ) = [( y ( k ) ) T p (k k+1) ] 2 /[( p (k k+1) ) T p (k k+1) ] .
(19)
(20)
Equation (20) expresses the net contribution of a selected basis function to the cost function, based on which the discrete construction of SLNNs is given as follows. Algorithm 1 (A1): Discrete construction of SLNNs Step 1: Initialisation phase. Select N training samples, and generate a candidate pool of hidden nodes, denoted as Tpool., and the corresponding full regression matrix is Φ . For the first category of SLNNs, the candidates are all possible basis functions without tunable parameters. For the second category of SLNNs, the tunable parameters are
488
K. Li et al.
assigned with random values like ELM method [10][16] or as the values of the parameters are assigned according to ‘a priori’ information. Define the stop criterion of network construction, this is usually a given level of the minimum desired contribution of the basis functions, δ E or since criterion like AIC [8]. Set the count of selected basis functions k=0. Step 2: Selection phase. The candidate basis function ϕ i , i = 1,..., M in Tpool that satisfies the following criteria will be selected: p k +1 : max ΔJ k +1 (ϕ i ) = [( y ( k ) )T ϕ i( k ) ] 2 /[(ϕ i( k ) )T ϕ i( k ) ] ϕ i ∈ T pool . ϕi
(21)
Step 3: Check phase. Check if the network construction criterion is satisfied, e.g. ΔJ k +1 ( pk +1 ) ≤ δ E . If (22) is true then stop, otherwise, continue.
(22)
Step 4: Update phase. Add pk +1 into the network, and remove it from T pool . Update intermediate variables according to (19), and let k = k + 1 . Step 5: Go to step 2.
3 Continuous Construction of SLNNs Continuous construction of the second category of SLNNs is to optimize the tunable parameters on the whole parameter space along the network construction procedures. In the following, it will be shown that after appropriate modification on the above discrete SLNN construction method, a continuous method can be derived. For continuous construction of SLNNs with tunable parameters, the basis functions {φ i (t ), i = 1,2," , M } in (1) can be redefined as φ i ( x (t ), ω) , where x (t ) is the network input vector, ω is the tunable parameter vector defined in the continuous parameter space, t is the time instant. The notation for φ i ( x (t ), ω) can be further simplified as φ i (ω) . Obviously this representation covers a wider class of neural networks. The continuous construction of SLNNs optimizes the parameters ω in the continuous parameter space. In comparison with the discrete construction, it starts the construction with none candidate basis function, therefore the computational complexity and memory storage can be significantly reduced. For simplicity, it is supposed that there is one type of basis functions, such as guassian function, tangent sigmoid, etc. The network construction procedure is to grow the network by adding basis functions one by one, each time maximizing the reduction of the cost function defined in (9). Based on (17)-(20), the contribution of adding one more basis function, the net change of the cost function defined in (9) is a function of ω : ΔJ k +1 (ω) = C 2 (ω) / D (ω) ,
(23)
Integrated Analytic Framework for Neural Network Construction
489
where C (ω) = y T Rk φ(ω) = ∑tN=1 [ y ( k ) (t )φ ( k ) ( x (t ), ω)]⎫⎪ ⎬ D (ω) = [φ(ω)] T Rk φ(ω) = ∑tN=1 [φ ( k ) ( x (t ), ω)] 2 ⎪⎭
(24)
The maximum contribution by adding one more basis function φ(ω) can be identified as ΔJ k +1 (ω k +1 ) = max{ΔJ k +1 (ω), ω ∈ R n +1 .
(25)
Now (25) is an unconstrained continuous optimization problem and a number of first order and second order searching algorithms be applied, such the Newton’s method, conjugate gradient method, etc. Algorithm 2 (A2): Continuous construction of SLNNs Step 1: Initialization. Let k = 0 and the cost function J = y T y , define the stop criterion of network construction, this is usually a given level of the minimum desired contribution of the basis functions, δ E or since criterion like AIC. Step 2: Search for the optimum parameter ω k +1 for the (k+1)’th hidden node using a conventional first order or second order search algorithm with the first and second order derivative information. Step 3: Check phase. Check if the network construction criterion is satisfied, e.g. ΔJ k +1 (ω k +1 ) ≤ δ E . (26) If (26) is true then stop, otherwise, let k=k+1 and continue.
4 Simulation Consider the following nonlinear function that is to be approximated using RBF neural network, sin( x) f ( x) = , −10 ≤ x ≤ 10 , (27) x 400 data points were generated using y = f (x ) + ξ , where x was uniformly distributed within the range [-10, 10] and the noise ξ ~ N (0,0.2) . The first 200 points were then used for network construction and training and the rest for validation. Table 1. Test performance
Network size (m) 1 2 3 4
SSE Training data A1 A2 20.39 8.78 18.41 8.01 14.29 7.31 11.24 6.71
NPE (%) Validation A1 A2 71.00 47.60 69.35 47.18 60.69 44.97 53.99 43.71
Running-time (s) A1 0.344 0.359 0.391 0.390
A2 0.093 0.094 0.235 0.312
490
K. Li et al.
Fig. 1. Top: Equi-height contour of cost function with respect to centre and width for the 1st neuron; Bottom: input and output signals to be modelled using the first neuron
Fig. 2. Top: Equi-height contour of cost function with respect to centre and width for the 4th neuron; Bottom: input and output signals to be modelled using the 4th neuron
Both the discrete (A1) and continuous (A2) construction methods were used to produce the RBF networks. For discrete construction of the RBFNN, all 200 training data samples are used as the candidate centres, and the width are predetermined as σ 2 = 200 by a series of tests. Networks of sizes from 1 to 6 were produced. The final
Integrated Analytic Framework for Neural Network Construction
491
cost function (sum squared error) over the training data sets and the running time of both the algorithms are listed in Table 1 for comparison. The produced networks obtained by the two algorithms were then tested on the validation data set. The normalized prediction errors (NPE) of networks over the validation data set are also listed in Table 1. NPE in table 1 is defined as
NPE = [∑ tN=1 ( yˆ (t ) − y (t )) 2 / ∑ tN=1 y 2 (t )]1 / 2 × 100% ,
(28)
where yˆ (t ) denotes the network output. Fig. 1 and 2 illustrate the equi-height contours of the cost function SSE with respect to centre and width of the first and the fourth hidden nodes in the RBF network. The y- and x- signals to be modelled by the two hidden neurons are also illustrated in the diagrams. It is obvious that the search space is quite complex and pre-determined widths and centres for hidden nodes may not produce a good neural model. Fig. 2 and table 1 also reveal that further increase in the number of RBF hidden nodes from 4 has little impact on the network performance as the signals to be modelled tends to be simply noise.
5 Conclusion An integrated framework has been proposed for the construction of a wide range of single-hidden layer neural networks (SLNNs) with or without tunable parameters. Firstly, a discrete SLNN construction method has been introduced. After a proper modification, a continuous construction has also been introduced. Each of the two alternative methods has their advantages and disadvantages. It is shown that these two methods can be performed within one analytic framework.
References 1. Igelnik, B., Pao, Y. H.: Additional Perspectives of Feedforward Neural-nets and the Functional-link. IJCNN '93, Nagoya, Japan (1993) 2. Adeney, K. M., Korenberg, M. J.: Iterative Fast Orthogonal Search Algorithm for MDLbased Training of Generalized Single-layer Network. Neural Networks 13 (2000) 787-799 3. Huang, G.-B., Saratchandran, P., Sundararajan, N.: A Generalized Growing and Pruning RBF (GGAP-RBF) Neural Network for Function Approximation. IEEE Trans. Neural Networks 16 (2005) 57-67 4. Chen, S., Billings, S. A.: Neural Network for Nonlinear Dynamic System Modelling and Identification. International Journal of Control 56 (1992) 319-346 5. Zhu, Q. M., Billings, S.A.: Fast Orthogonal Identification of Nonlinear Stochastic Models and Radial Basis Function Neural Networks. Int. J. Control 64 (5) (1996) 871-886 6. Chen, S., Cowan, C. F. N., Grant, P. M.: Orthogonal Least Squares Learning Algorithm for Radial Basis Functions. IEEE Trans. Neural Networks 2 (1991) 302-309 7. Li, K., Peng, J., Irwin, G. W.: A Fast Nonlinear Model Identification Method. IEEE Trans. Automatic Control 50 (8) (2005) 1211-1216 8. Akaike, H.: A New Look at the Statistical Model Identification. J. R. Statist. Soc. Ser. B. 36 (1974) 117-147
492
K. Li et al.
9. Li, K., Peng, J., Bai, E-W: A Two-stage Algorithm for Identification of Nonlinear Dynamic Systems. Automatica 42 (7) (2006) 1189-1197 10. Huang, G.B., Chen, L., Siew, C.K.: Universal Approximation Using Incremental Constructive Feedforward Networks with Random Hidden Nodes. IEEE Trans. Neural Networks 17 (4) (2006) 79-892 11. Peng, J., Li, K., Huang, D.S.: A Hybrid forward Algorithm for RBF Neural Network Construction. IEEE Trans. Neural Networks 17 (6) (2006) 1439-1451 12. Li, K., Peng, J., Fei, M.: Real-time Construction of Neural Networks. Artificial Neural Networks – ICANN 2006. Lecture Notes in Computer Science, Springer-Verlag GmbH, LNCS 4131 (2006) 140-149 13. Adeney, K.M., Korenberg, M.J.: On the Use of Separable Volterra Networks to Model Discrete-time Volterra Systems. IEEE Trans. Neural Networks 12 (1) (2001) 174 - 175 14. Nikolaev, N., Iba, H.: Learning Polynomial Feedforward Networks by Genetic Programming and Backpropagation. IEEE Trans. Neural Networks 14 (2) (2003) 337-350 15. Weingaertner, D., Tatai, V. K., Gudwin, R. R., Von Zuben, F. J.: Hierarchical Evolution of Heterogeneous Neural Networks. Proceedings of the 2002 Congress on Evolutionary Computation (CEC2002) 2 (2002) 1775-1780 16. Huang, G.B., Zhu, Q.Y., Mao, K. Z., Siew, C. K., Saratchandran, P., Sundararajan, N.: Can Threshold Networks be Trained Directly. IEEE Trans. Circuits and Systems-II: Express Briefs 53 (3) 187-191
A Novel Method of Constructing ANN Xiangping Meng1 , Quande Yuan2 , Yuzhen Pi2 , and Jianzhong Wang2 1
School of Electrical Engineering & Information Technology, Changchun Institute of Technology, 130012, Changchun, China xp
[email protected] http://www.ccit.edu.cn 2 School of Information Engineering Northeast Dianli University, 132012, Jinlin, China {yuanquande,piyuzhen}@gmail.com
[email protected]
Abstract. Artificial Neural Networks (ANNs) are powerful computational and modeling tools, however there are still some limitations in ANNs. In this paper, we give a new method to construct artificial neural network, which based on multi-agent theory and Reinforcement learning algorithm. All nodes in this new neural networks are presented as agents, and these agents have learning ability via implementing reinforcement learning algorithm. The experiment results show this method is effective.
1
Introduction
Artificial Neural Networks (ANNs) are powerful computational tools that have been found extensive acceptance in many disciplines for solving complex realworld problems. ANN may be defined as structures comprised of densely interconnected adaptive simple processing elements (called artificial neurons or nodes) that are capable of performing massively computations for data processing and knowledge representation[1][2]. Although ANNs are drastic abstractions of the biological counterparts, the idea of ANNs is not to replicate the operation of the biological systems but to make use of what is known about the functionality of the biological networks for solving complex problems. the ANNs have gained great success, however there are still some limitations in ANNs.1 such as: 1. Most of the ANNs are not really distribute, so its nodes or neurons can not parallel-work. 2. Training time is long. 3. The nodes number is limited by the capability of computer. To solve these problems we try to reconstruct the NN using Multi-agent System theory. 1
Supported by Key project of the ministry of education of China for Science and Technology Researchment(ID:206035).
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 493–499, 2007. c Springer-Verlag Berlin Heidelberg 2007
494
X. Meng et al.
Multi-agent technology is a hotspot in the recent study on artificial intelligence. The concept of Agent is a natural analogy of real world. In this new network, the units presented as agents can run on different computers. We use reinforcement learning algorithm as this new neural networks learning rules.
2 2.1
Multi-Agent System and Reinfocement Learning Multi-Agent System
Autonomous agents and multi-agent systems (MASs) are rapidly emerging as a powerful paradigm for designing and developing complex software systems. In fact, the architecture of a multi-agent system can be viewed as a computational organization. Surprisingly there is no agreement on what an agent is: there is no universally accepted definition of the term agent. These problem had been discussed in [3] and other papers in detail. We are agree on this concept that an agent should have characters as follow: 1. Autonomy The agent is capable of acting independently, and exhibiting control over their internal state. 2. Reactivity Maintains an ongoing interaction with its environment, and responds to changes that occur in it (in time for the response to be useful). 3. Pro-activity That means the agent is generating and attempting to achieve goals, not driven solely by events. 4. Social ability The ability to interact with other agents (and possibly humans) via some kinds of agent-communication language, and perhaps cooperate with others. As a intelligent agent, an agent should have another important ability, that’s learning, which can help agent adapt the dynamic environment. A multi-agent system contains a number of agents which interact through communication, able to act in an environment and will be linked by other (organizational) relationships. A MAS can do more things than a single agent. From this perspective, every neuron in a artificial neural networks can be viewed as an agent, who takes input and decide what to do next according its own state and policy. Through the interactive of the agents’, the MAS will output a result. Then supervisor gives feedback to output agent, and output gives feedback to others agent. 2.2
Reinforcement Learning
Learning behaviors in a Multi-Agent environment is crucial for developing and adapting Multi-Agent systems. Reinforcement learning (RL) has been successful in finding optimal control policies for a single agent operating in a stationary environment, specifically a Markov decision process (MDP). RL only
A Novel Method of Constructing ANN
495
through trial and error to find optimal strategies, also can be applied offline, as a pre-processing step during the development of the game, and then be continuously improved online after its release. Stochastic games extend the single agent Markov decision process to include multiple agents whose actions all impact the resulting rewards and next state. Stochastic games are a generalization of MDPs to multiple agents, and can be used as a framework for investigating Multi-Agent learning. Reinforcement learning has opened the way for designing autonomous agents capable of acting in unknown environments by exploring different possible actions and their consequences. Q-learning is a standard reinforcement learning technique. In single-agent systems, Q-learning possesses a firm foundation in the theory of Markov decision processes. The basic idea behind Q-learning is to try to determine which actions, taken from which states, lead to rewards for the agent(however these are defined), and which actions, from which states, lead to the states from which said rewards are available, and so on. The value of each action which could be taken in each state, i.e., its Q-value is a time-discounted measure of the maximum reward available to the agent by following a path through state space of which the action in question is a part. A typical Q-learning model is shown in Fig. 1.
Fig. 1. A typical reinforcement learning model
Q-learning consists on iteratively computing the values for the action-value function, using the following update rule: Q (s, a) ← (1 − α) Q (s, a) + α [r + βV (s )] ,
(1)
496
X. Meng et al.
where β is a discount factor, with 0 ≤ β < 1, V (s ) = maxa (s , a ), which represents the relative importance of future against immediate rewards. Where is a positive step-size parameter. Q-Learning will converge to a best-response independently of the agents behavior as long as the conditions for convergence are satisfied. If decreases appropriately with time and each state-action pair would be visited infinitely often in the limit, then this algorithm will converge to a best-response for all s ∈ S and a ∈ A (s) with probability one. While in some multi-agent environment basic Q-learning is not enough for an agent, multi-agent environments are inherently non-stationary since the other agents are free to change their behavior as they also learn and adapt. In researching multi-agent Q-learning, most researches adopt the framework of general-sum stochastic games. In Multi-Agent Q-learning, the Q-function of agent is defined over states and joint action vectors a a1 , a2 , . . . an , rather than state-action pairs. The agents start with arbitrary Q-values, and the updating of Q value proceed as following: Qit+1 (s, a) = (1 − α) Qit (s, a) + α rti + β · V i (st+1 ) , (2) where V i (st+1 ) is state value functions, and V i (st+1 ) = max f i Qit (st+1 , a) . ai ∈A
(3)
In this generic formulation, the keys elements are the learning policy, i.e. the selection method of the action a, and the determining of the value function V i (st+1 ) , 0 ≤ α < 1. However, The number of agents in a MAS usually is very big, it’s difficult to get all of the agents’ information and maintain them, because the resource of computer is limited. In fact, agents needn’t know all of the other’s state and action, they just interacting with their neighbor is enough.
3
Artificial Neural Network Based On MAS
As discussed above, sometimes we need a neural network with huge number neurons while using traditional methods is difficult to do this. This new methods we proposed is based on MAS theory: the neural network can be viewed as a multi-agent system, which maps input to output through agents interact. There are four types of agents: input nodes agent, output nodes agents, hidden layer agents and a manager agent. These agents can run on not only same computer but also different ones, so the number of agents is not limited. Every agent has learning ability via implementing Reinforcement Learning. Now we can construct the ANN using MAS theory: each unit or node is an agent. We constructed a three layers network. Firstly, we create a manager agent, which manage the other agents, including agents type, the number of each type agents, agents location and id. The other agent can find the node agents that need to link. A simple new three layer BP NNs topological diagram is shown in Fig. 2. Every node agent has its own internal state, mapping the input to output.
A Novel Method of Constructing ANN
497
Fig. 2. Topological diagram of a new ANN
3.1
Manager Agent
The manager agent(MA) is a platform or a container, and all other agents run on it. A MAS can have many manager agents, but just can have only one main manager agent. these manager agents maintains NNs global information, including number of each type of agents, their location and unique ID. It provides service to other agents. For example, when an input agent needs its neighbor output nodes agents, it will send a message to the manager agent, and the MA will return a list of agents ID to it. It receives other agents register. When a new agent is created, we must send a message to the manager agent, telling it that a node agent is created, and the node agents ID, location and other information. Before an agent die, it sends manager agent a message too; the manager agent will remove the corresponding record. 3.2
Unit Agents
The unit agents are classified into three types: the input agents, the hidden agents, and the output agents. the most important part is the second one. Sup(l) pose there are m layers in the ANN, nl presents the number of layer l, yk is the output of agent k in layer l.
nn−1 (l)
(l)
yk = wk · y (l−1) =
(l) (l−1)
Wkj yj
,
(4)
k = 1, 2, · · · , nl ,
(5)
j=1
(l) (l) yk = f yk (l)
where wk is the weight vector between layer l − 1 and l, Y (0) = X.
498
X. Meng et al.
Given the supervised information, the weight between agents will be modified to minimize E (w): m 2 1 1 Y − Y = Yk − Y k , 2 2
n
E (w) =
(6)
k=1
unit agents learning algorithm is given in Table 1. Table 1. The Process of Reinforcement Learning of Input Agent 1. Initialize: (a) Select initial learning rate α and discount factor β and let t=0; (b) Initialize the state S and action A respectively; (1) (2) (n) (c) For all states s and actions a, let Qi0 (s(0) , as , as , · · · , as ), (i) (i) i (0) i (0) 1 1 π0 (s , as ) = n , π0 (s , as ) = n ; 2. Repeat the following process (for each episode) (a) get action a from current state s using policy π derived from Q; (b) Execute action a, observe reward r and new states s ; (c) Update Qit using formula(1); Until for all states Δ (s) < .
4
Experiment and Results
We construct a simple three layer BP networks to training it to learn XOR. The experiment results are shown in Fig. 3. From this figure, we can find the network can be quickly learn the correct classification. 1
0.9
0.8
0.7
error rate
0.6
0.5
0.4
0.3
0.2
0.1
0
0
20
40
60 episodes
80
100
Fig. 3. The error rates vs training episodes
120
A Novel Method of Constructing ANN
5
499
Conclusion
In this paper, we discussed MAS, reinforcement learning, and proposed a new method to construct artificial neural networks, which is inspired by the theory of agent. The results show this method is effective. However, there are still more work to do, such as: 1. The communication between nodes agents should be improved; 2. How much neighbor agents a unit agent should know; We will continue attending this aspect, and more research will be done in the future work.
References 1. Hecht-Nielsen: Neurocomputing. Addison-Wesley, Reading, MA (1990) 2. Schalkoff, R.J.: Artificial Neural Networks. McGraw-Hill, New York (1997) 3. Jennings, N.R., Sycara,K.P., Wooldridge, M.: A Roadmap of Agent Research and Development. In Journal of Autonomous Agents and Multi-Agent Systems 1 (1) (1998) 7-36 4. Kaelbling, L.P., Littman, M.L., Moore, A.W.: Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research 4 (2) (1996) 237-285 5. Littman, M.L.: Friend-or-foe: Q-learning in General-sum Games. Proceedings of the Eighteenth International Conference on Machine Learning (2001) 322-328 6. Bowling, M., Veloso, M.: Multiagent Learning using a Variable Learning Rate. Artificial Intelligence 136 (2002) 215-250 7. Maarten, P.: A Study of Reinforcement Learning Techniques for Cooperative MultiAgent Systems. in Vrije Universiteit Brussel Computational Modeling Lab Faculty of - Department of Computer Science Academic (2002-2003) 8. Watkins, C.J.C.H., Dayan: Q-learning. Machine Learning 8 (3/4) (1992) 279-292 9. Littman, M.L.: Markov Games as a Framework for Multiagent Reinforcement Learning. in Proceedings of the 11th International Conference on Machine Learning, New Brunswick, NJ (1994) 157-163
Topographic Infomax in a Neural Multigrid James Kozloski, Guillermo Cecchi, Charles Peck, and A. Ravishankar Rao IBM T.J. Watson Research Center, Yorktown Heights,NY 10598 {kozloski,gcecchi,cpeck,ravirao}@us.ibm.com
Abstract. We introduce an information maximizing neural network that employs only local learning rules, simple activation functions, and feedback in its functioning. The network consists of an input layer, an output layer that can be overcomplete, and a set of auxiliary layers comprising feed-forward, lateral, and feedback connecwtions. The auxiliary layers implement a novel ”neural multigrid,” and each computes a Fourier mode of a key infomax learning vector. Initially, a partial multigrid computes only low frequency modes of this learning vector, resulting in a spatially correlated topographic map. As higher frequency modes of the learning vector are gradually added, an infomax solution emerges, maximizing the entropy of the output without disrupting the map’s topographic order. When feed-forward and feedback connections to the neural multigrid are passed through a nonlinear activation function, infomax emerges in a phase-independent topographic map. Information rates estimated by Principal Components Analysis (PCA) are comparable to those of standard infomax, indicating the neural multigrid successfully imposes a topographic order on the optimal infomax-derived bases.
1
Introduction
Topographic map formation requires an order-embedding, by which a set of vectors X in some input space is mapped onto a set of vectors Y in some output space such that the ordering of vectors in Y , when embedded within some alternate lower-dimensional coordinate system (usually 2D) preserves as much as possible the partial ordering of vectors in X in the input space. An important additional objective of topographic map formation is that the volume defined by Y be maximized so as to avoid trivial mappings. This second objective ignores the ordering of inputs and outputs in their respective spaces and instead attempts to maximize the mutual information between X and Y , I(X; Y ). We observe that order embedding need not necessarily impact information, as the ordering of outputs imposes no constraint on the volume that they define. One of the most influential approaches to topographic map formation, Kohonen’s self-organizing map (SOM) [1], has its origins in another of Kohonen’s algorithms, Learning Vector Quantization (LVQ) [2]. LVQ has as its stated goal density estimation, or that the number of input vectors assigned to each output vector be equalized. It aims to accomplish this as an approximation of k-means clustering, which minimizes mean squared error between input vectors and output prototypes. As such, neither algorithm guarantees the desired equalization D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 500–509, 2007. c Springer-Verlag Berlin Heidelberg 2007
Topographic Infomax in a Neural Multigrid
501
between inputs and output vectors for nonuniform input spaces. Modifications to the traditional LVQ algorithm improve density estimation, but require ongoing adjustments of still more parameters that govern the rates at which inputs are assigned to each output vector [3]. For SOM and its many variations, these modifications are complicated by a neighborhood function which constrains the assignment of inputs to output vectors and is shrunk during learning to create a smooth mapping (for example, see [4]). While LVQ and SOM support overcomplete bases (wherein the number of outputs exceeds the number of inputs), the problem of density estimation is not addressed by adding more output vectors. Information maximization is a well-characterized optimization that results in density estimation and equalization of output probabilities. Originally expressed for the case of multi-variate Gaussian inputs in the presence of noise [5], a significant extension of this solution accommodates input spaces of arbitrary shape for the noiseless condition [6]. The original derivations of infomax used critically sampled bases (wherein the number of outputs equals the number of inputs), either for notational simplicity [5] or by necessity [6]. While subsequent derivations of infomax [7] and related sparse-coding and probabilistic modeling strategies [8,9] incorporate overcomplete bases, none do so in the context of topographic mapping. Algorithms that maximize Shannon information rates subject to a topographic constraint [10] rely on non-locally computed learning rules and do not apply to arbitrary input spaces and the noiseless condition. Finally, because one of the main operational principles of infomax is to make outputs independent by eliminating redundancy, standard topographic mapping algorithms, which necessarily create dependencies between map neighbors, should seem incompatible with infomax. Here we present a network capable of performing the infomax optimization over an arbitrary input space (in our case, natural images) for the noiseless condition, using either critically sampled or overcomplete bases, while simultaneously creating a topographic map with either a phase-dependent or a phaseindependent order embedding. Changes to learning rates or neighborhood sizes are not required for convergence. These capabilities derive from a novel neural multigrid, configured to estimate Fourier modes of the infomax antiredundancy learning vector using feed-forward, lateral, and feedback connections. Section 2 introduces the infomax network from which ours derives [11], which we have termed a Linsker network. In subsequent sections we present, step by step, each of the modifications to the Linsker network necessary to achieve these capabilities. Section 3 shows how to implement a multilayer Linsker network with feedback, which generates topography, but fails to achieve infomax. Section 4 refines the feedback network and introduces the neural multigrid to address low frequency redundancy reduction. Section 5 shows how an overcomplete basis finally achieves the infomax optimum in a neural multigrid. Section 6 demonstrates that a modified multigrid can achieve the infomax optimum for a critically sampled basis, provided a phase-independent order embedding is specified. Finally section 7 discusses the experimental results in the context of a new principle of information maximization which we term topographic infomax.
502
2
J. Kozloski et al.
A Three-Stage Infomax Network
A Linsker network comprises three stages. Stage one selects a vector x from the input ensemble, x ∈ X, and computes the input vector x = q −1/2 ( x − x0 ), where x0 is an input bias vector that continually adapts with each input according to Δx0 = βx0 [ x − x0 ],1 , and q −1/2 is the pre-computed whitening matrix, where q = ( x − x0 )( x − x0 ) . Pre-whitening of inputs is not required for infomax, but speeds convergence. For results shown here, X was a set of image segments drawn at random from natural images [12]. Stage two learns the input weight matrix C and computes u ≡ Cx, where each element ui is the linear output of a stage two unit i to a corresponding stage three unit. In addition, each stage two unit computes an element of the output vector y, such that yi = σ(ui ), where σ(·) denotes a nonlinear squashing function. We used the logistic transfer function for σ(·) as in [6]: y = 1/1 + e−(u+w0 ) , where w0 is an output bias vector that continually adapts with each input according to Δw0 = βw0 [1 − 2y]. The ensemble of all network output vectors is then y ∈ Y , and the objective to maximize I(X; Y ). Because the network is deterministic, maximizing I(X; Y ) is equivalent to maximizing H(Y ), since I(X; Y ) = H(X)+ H(Y ) − H(Y |X), H(X) is fixed, and, for the noiseless condition, H(Y |X) = 0. Let us now consider stage three, whose outputs comprise a learning vector with the same dimension as u. When applied in stage two to learning C, this learning vector yields the anti-redundancy term (C )−1 of Bell and Sejnowski (ie., the inverse of the transpose of theinput weight matrix) [6,11]. Consider the chain rule H(Y ) = ni=1 H({yi )})− ni=1 I({yi }; {yi−1 }, . . . , {y1 }). Then, maximizing H(Y ) is achieved by constraining the entropy of the elements yi (as expressed by the first sum) while minimizing their redundancy (as expressed by the second). In a Linsker network, the constraint imposed by σ(·) and the learning vector produced by stage three perform these functions, and are sufficient to guide C learning to maximize H(Y ). Thus, the learning vector produced by stage three is responsible for redundancy minimization. We refer to it (and its subsequent derivatives) as the anti-redundancy learning vector, hereafter denoted ψ. In a Linsker network, local learning rules and a linear activation function iterating over a single set of lateral connections compute ψi for each unit i of a single stage three layer. Units in this layer are connected by the weight matrix Q, → Q ≡ uu . For a given whose elements undergo Hebbian learning such that Q input presentation, feed-forward and lateral connections modify elements of an t−1 . To learn auxiliary vector v at each iteration t according to vt = vt−1 +u−αQv Regardless Q locally, we set v0 = u and use the learning rule ΔQ = βQ [v0 v0 − Q]. of initial v, and assuming Q = Q and the scalar α is chosen so that v converges, by Jacobi αv∞ = Q−1 u. The constraint 0 < α < 2/Q+ must be satisfied for v to converge, where Q+ is the largest eigenvalue of Q [11]. In Linsker’s network, α is computed by a heuristic [5]. We devised a dynamic, local computation of 1
Note that all learning rates in the network are constant, and denoted by β with subscript. For this and subsequent learning rates, we used βx0 = 0.0001, βw0 = 0.0021, βQ = 0.0007, and βC = 0.0021.
Topographic Infomax in a Neural Multigrid
503
Q+ based on power iteration, from which α can be computed precisely. Let in the e represent an activity vector propagated through the lateral network Q t−1 . absence of the normal stage three forcing term, vt−1 +u, such that et = −αQe Precalculating α = 1/et for each t ensures e → Q+ and α → 1/Q+ , thus satisfying the convergence criterion of v. In practice, a finite number of iterations are sufficient to approximate αv∞ ,2 and therefore anti-redundancy learning for a given input weight Cij can depend on the locally computed element of the learning vector, ψi = αvi , and the local input to stage two, xj . The final infomax learning rule for the network is then ΔC = βC [(ψ + 1 − 2y)x ] [11]. Use of a standard Linsker network produces the expected infomax result reliably with a fixed number of Jacobi iterations and learning rates (Fig. 1A). A
C
B 10
0
a b
−10
c d
10
e
−20
10
0
40
80
120
Fig. 1. A: Learned bases of Linsker network (standard infomax). This and all subsequent figures derive from training over a set of publicly available natural images (http://www.cis.hut.fi/projects/ica/data/); B: PCA of outputs u (eigenvalues of Q shown on a continuous function). Curves labeled a and b result from A and C, and show maximization of the output volume (successful infomax). Curves labeled c, d, and e, result from multigrid infomax over low to high frequency modes, infomax from feedback, and multigrid infomax over low frequency modes only, and show failed infomax. C: Learned bases of overcomplete, topographic infomax.
3
Infomax from Feedback
Jacobi iteration in stage three of a Linsker network aims to solve the equation αQv = u. We observe that certain linear transformations of this equation likewise yield the infomax anti-redundancy term. For example, stage three might include a second layer, or grid (denoted h1 ), in which a fixed, symmetric, full rank weight 2
For most problems, we found 4 Jacobi iterations to be sufficient.
504
J. Kozloski et al.
matrix S linearly transforms u, yielding a new auxiliary vector uh1 = Su = SCx. As in a Linsker network, units in h1 are connected by a lateral weight matrix, h1 , that undergoes Hebbian learning such that Q h1 → Qh1 ≡ uh1 uh1 = Q h1 SQS . Jacobi iteration in h1 proceeds similarly as above, vth1 = vt−1 + u h1 − h 1 h1 h1 −1 α Q vt−1 . To recover the infomax anti-redundancy term (C ) , a derivation −1 −1 similar to that of Linsker yields: I = Qh1 SQS , (S )−1 = Qh1 SCxx C , −1 −1 (S )−1 (C )−1 = Qh1 SCxx , and (C )−1 = S Qh1 SCxx . Hence, (C )−1 = h1 h1 Sα v∞ x , and anti-redundancy learning for a given input weight now depends on an element of the feedback learning vector ψ = S αh1 v h1 computed in the stage three’s input layer and the corresponding local input to stage two. Theoretically, the proposed feedback network is equivalent to a Linsker network, and should therefore not change the infomax optimization it performs. In practice, however, the choice of S can influence the optimization dramatically, h1 since Jacobi iteration is used to estimate αh1 v∞ . If, for example, S represents a low-pass convolution filter and thus low frequency modes are emphasized in both uh1 and Qh1 ,3 Jacobi iteration in h1 can fail to provide a solution that is accurate enough for infomax to succeed, given some fixed number of Jacobi iterations, since low frequency modes of the solution are notoriously slow to converge [13]. In fact, we observed a failure of this network to achieve infomax (Fig. 1B[d]). However, we consistently observed a stable topographic ordering of the outputs in the 2D coordinate system of the network after each failed infomax optimization, suggesting that errors derived from incomplete convergence of low frequency Fourier modes in ψ are sufficient to generate a topographic map. Next we set out to eliminate these errors.
4
A Neural Multigrid
The second layer of the feedback infomax network described in the preceding section aims to solve the equation αh1 Qh1 v h1 = uh1 by means of Jacobi iteration. It fails to do so accurately because of the dominant low frequency modes of the problem, rendering ψ unsuitable for anti-redundancy learning that depends on accurate solutions in these modes. Hence infomax fails. We explored using multigrid methods to better estimate this solution, since multigrid methods in general speed convergence and accuracy of Jacobi iteration by decomposing it into a series of iterative computations performed sequentially over a set of grids, each solving different Fourier modes of the original problem [13]. Multigrid casts the problem into a series of smaller and smaller grids, such that low frequency modes in the original problem can converge quickly and accurately in the form of high frequency modes in a restricted problem. The multigrid method we implement here in a neural network is nested iteration, though the network design can easily accommodate other multigrid methods such as ”V-cycle” and ”Full Multigrid” [13]. Similar to the feedback network described in the previous section, Jacobi iteration in h1 aims to solve αh1 Qh1 v h1 = uh1 , but now only after 3
We used a low-pass, 2D gaussian kernel with SD=0.85.
Topographic Infomax in a Neural Multigrid
505
v h1 is initialized with the result of a preceding series of nested iterative computations over a set of smaller grids wherein lower frequency modes of the solution have already been computed. The set of grids, hk , is enumerated by the set of wavelengths of the Fourier modes of the problem that each solves, for example k ∈ {1, 2, 4, . . .}. The iterative computation performed in each grid is similar to that in a Linsker network, hn hn v hn , n ∈ k. As in traditional multigrid now denoted vthn = vt−1 + uhn − αhn Q t methods, we chose powers of two for the neural multigrid wavelengths, such that if the Linsker network and grid h1 are 11 × 11 layers, grid h2 is a 5 × 5 layer, and h4 a 2 × 2 layer. Feed-forward connections initially propagate and restrict each uhn to each lower dimensional grid h2n , such that uh2n = S hn uhn ∀n ∈ k, where S hn denotes the restriction operator (in our neural multigrid, a rectangular feed-forward weight matrix) from grid hn to h2n . As in traditional multigrid methods, restriction here applies a stencil (in our neural multigrid, a neighbor1 hn hood function), such that the restriction from grid hn to h2n is uhx2n ,y = 16 [4ux,y + n n n n n n n n uhx+1,y+1 + uhx+1,y−1 + uhx−1,y+1 + uhx−1,y−1 + 2(uhx,y+1 + uhx,y−1 + uhx+1,y + uhx−1,y )], where (x, y) are coordinates in hn , and (x , y ) are the corresponding transformed coordinates in h2n . Jacobi iteration proceeds within coarse grids first, followed by finer grids. Feedback propagates and smoothly interpolates the result of a coarse grid iteration, αh2n v h2n , to the next finer grid, where it replaces v hn prior to Jacobi iteration within the finer grid: v hn ← S hn αh2n v h2n . In this way, higher frequency mode iteration refines the solution provided by lower frequency mode iteration. The process continues until αh1 v h1 is computed by iteration at the second layer of stage three, and finally ψ is derived from feedback to stage three’s input layer as described in the previous section. While restriction and interpolation of activity vectors in the neural multigrid are easily accomplished through feed-forward and feedback connections described by S hn and S hn , how is Qhn computed for each grid using only local learning rules? Consider the problem of restricting to a coarser grid the matrix Qhn , defined for all multigrid methods as Qh2n = S hn Qhn S hn [13]. By substitution, Qh2n = S hn uhn uhn S hn = S hn uhn uhn S hn = uh2n uh2n . Hence, any restricted matrix Qh2n can be computed by Hebbian learning over a lateral weight h2n such that Q h2n → Qh2n ≡ uh2n uh2n .4 matrix Q In the preceding experiments with feedback infomax networks, the failure to compute low frequency modes of the solution to αh1 Qh1 v h1 = uh1 was responsible for a topographic ordering (since infomax learning was unable to eliminate low frequency spatial redundancy in the map). Next, we hypothesized that limiting our solution to these same low frequency modes could have a comparable topographic influence while providing a means to complete the infomax optimization. Minimizing redundancy in low frequency Fourier modes is equivalent to minimizing redundancy between large spatial regions of the map. In the absence of competing anti-redundancy effects from other Fourier modes, redundancy within these large regions should therefore increase, as units learn 4
Any optimization over a matrix A requiring a solution to Ax = b can be implemented using a neural multigrid if, and only if, A is strictly a function of b.
506
J. Kozloski et al.
based on a smooth, interpolated anti-redundancy learning vector derived from coarse grids only. In fact, the topographic effect was pronounced. Again, however, the network failed to achieve infomax (Fig. 1B[e]), now because high frequency modes of the solution were neglected by our partial multigrid. We therefore devised a network that gradually incorporates the iterative computations of each grid of the neural multigrid, from coarse grids to fine, into the computation of ψ. Initially, iteration proceeded only at the two coarsest grids, with m inputs presented to the network in this configuration.5 Infomax learning proceeded with feedback vectors computed by interpolating the solutions from the coarsest grid through each multigrid layer, then feeding the result back to stage three’s input layer, where finally ψ was computed. Subsequent layers in the neural multigrid were activated one at a time at intervals of m input presentations. The number of input presentations for which a grid hn had been active was phn , and only when phn ≥ m∀n ∈ k were all Fourier modes of the solution to αh1 Qh1 v h1 = uh1 present in ψ, and the multigrid complete. Prior to this, ψ was computed in the partial multigrid as a linear combination of the feedback vectors from the twofinest active grids, ha and hb , fed back through all intervening layers: a ψ = S n=1 S hn [β ha αha v ha +(1−β ha )S hb αhb v hb ], where β ha = pha /m ∈ [0, 1]. Gradual incorporation of each grid of the neural multigrid resulted in a topographic map, but failed to achieve infomax (Fig. 1B[c]), suggesting a more radical approach was required to recover an infomax solution.
5
Overcomplete Topographic Infomax
To recover an infomax solution within a topographic map, we first devised a network that uses a neural multigrid to compute an anti-redundancy learning vector based on pooled outputs from nine separate critically sampled bases (each initially embedded in an 11 × 11 grid as above). Each basis therefore represented a separate infomax problem, and each was then re-embedded within a single ”overcomplete” 2D grid (now 33 × 33, denoted hoc ) as follows: x ← r + 3x, y ← s + 3y, where (r, s) represents a unique pair of offsets applied to each 11 × 11 grid’s corresponding coordinates in order to embed it within the 33 × 33 grid, r ∈ [0, 2], s ∈ [0, 2]. The pooling of elements from each basis was achieved by restricting the output of hoc to the neural multigrid’s first layer, such that uh1 = Soc uhoc , where h1 is an 11 × 11 grid, and Soc is a rectangular feed-forward weight matrix.6 The computation of ψ proceeded as above, with grids incorporated into the multigrid gradually, from coarse to fine. After the multigrid was completely active, a set of nine, independent lateral networks within the overcomplete grid became active. Each lateral network included only connections between those 5 6
We used m = 2, 000, 000. We scaled the previous low-pass, 2D gaussian kernel proportionally to SD=2.55 in order to create the new restriction matrix, Soc .
Topographic Infomax in a Neural Multigrid
507
units comprising a single critically sampled basis, and the overcomplete grid’s lateral network thus comprised overlapping, periodic, lateral connections. The results of iteration over each of these independent Linsker networks represented nine separate solutions to nine fully determined problems Qv = u. These solutions were gradually combined with the feedback learning vector from the multigrid as described in the previous section, and yielded an infomax solution wherein each critically sampled basis was co-embedded in a single topographic map (Fig. 1B[b],C).
6
Absolute Redundancy Reduction
Next, we aimed to generate a phase-independent order embedding in our topographic map using the neural multigrid. Drawing upon the work of Hyv¨ arinen et al., [14] we reasoned that certain nonlinear transformations of the inputs to the neural multigrid might produce a topographic influence independent of phase selectivity in output units, as has been observed in primary visual cortex. Unlike Hyv¨ arinen et al., we maintained the goal of maximizing mutual information between the input layer and the first layer of our multilayer network. The nonlinear transformation applied was the absolute value, such that uh1 = S|u|. Given this transformation, feedback from the multigrid was no longer consistent with anti-redundancy learning at the input layer of stage three. We reasoned that the input weights to unit i, Cij ∀j, should be adjusted to eliminate redundancy in multigrid units, which derives from pooled absolute activation levels over i. Cij must then be modified in a manner dependent on each unit i’s contribution to this redundancy, i.e., its absolute activation level. Hence, at each unit i, the multigrid feedback vector was multiplied by ωi prior to its incorporation into ψ, where ωi takes the value 1 if ui ≥ 0 and −1 otherwise. The computation of ψ was modified during learning as above, with grids incorporated into the multigrid gradually, from coarse to fine. In the end, the network employed a linear combination of the multigrid feedback vector and the infomax anti-redundancy vector αv, such that computed elements of ψ locally were determined by ψi = βαvi + (1 − β)ωi k Sik αh1 vkh1 .7 The results show that a phase-independent topographic map does result from nonlinearly mapping the problem αQv = u onto a gradually expanded neural multigrid (Fig. 2A), and that infomax was readily achieved in these experiments (Fig. 2B[b]). Interestingly, when multigrid inputs were transformed in this way, the constraints imposed by initial learning, biased topographically by the partial multigrid, were not severe enough to prevent infomax from emerging within the map, even for a critically sampled basis. Finally, we applied the nonlinear mapping to an overcomplete basis as above and found that phase-independent, overcomplete, topographic infomax was readily achieved (Fig. 2B[b],C). 7
We limited β to 0.25 in order to maintain a smooth mapping, given that infomax emerged readily at this weighting, and did not require β → 1.
508
J. Kozloski et al.
A
B
C
10
0
a b
−1
10 0
40
80
120
Fig. 2. A: Learned bases of phase-independent, topographic infomax; B: PCA of outputs u (eigenvalues of Q shown on a continuous function). Curves labeled a and b result from Linsker network (standard infomax) and from phase-independent topographic infomax (A and C overlapping), and show maximization of the output volume (successful infomax). C: Learned bases of phase-independent, overcomplete, topographic infomax.
7
Discussion
The infomax algorithm implemented here in a modified multi-layered Linsker network maximizes I(X; Y ), the mutual information between network inputs and outputs. Rather than doing so directly, however, the novel network configuration uses first a partial neural multigrid to induce spatial correlations in its output, then a complete multigrid to perform the infomax optimization. The manner in which the multigrid guides infomax to a specific topographically organized optimum constitutes a new principle of information maximization, which we refer to as topographic infomax. The problem solved by the partial neural multigrid first is that of eliminating redundancy in low frequency Fourier modes of the output. While solving this problem for these modes and no others, the network operates as if these modes contain all output redundancy, which they clearly do not. Redundancy between large regions of the output map is minimized, even though doing so increases redundancy in higher frequency modes (ie., between individual units). In the completed multigrid, iteration in coarse grids precedes iteration in finer grids for any input, and thus higher frequency redundancy reduction is heavily constrained by any previous minimization of redundancy in lower frequency modes. For standard infomax, redundancy minimization can be achieved by redundancy reduction in any or all modes simultaneously. Topographic infomax instead aims to first eliminate low frequency redundancy and thus imposes a topographic order on the output map.
Topographic Infomax in a Neural Multigrid
509
The constraints imposed by low frequency redundancy reduction can prevent infomax from emerging in the completed multigrid. Two approaches to relaxing these constraints and recovering the infomax solution have yielded topographic infomax: first, the use of an overcomplete basis, and second, the use of a phaseindependent order-embedding. We anticipate that many parallels between the network configuration employed here and those observed in biological structures such as primate primary visual cortex remain to be drawn, and that topographic infomax represents a mechanism by which these structures, constrained developmentally by local network connection topologies, can achieve quantities of mutual information between inputs and outputs comparable to what is achieved in more theoretical, fully-connected networks of equal dimension.
Acknowledgments We thank Ralph Linsker and John Wagner for many helpful discussions.
References 1. Kohonen, T.: Self-Organizing Maps. Berlin: Springer-Verlag (1997) 2. Kohonen, T.: Learning Vector Quantization. Neural Networks 1, Supplement 1 (1988) 303 3. Desieno, D.: Adding a Conscience to Competitive Learning. Proc. Int. Conf. on Neural Networks I (1988) 117-124 4. Bednar, J. A., Kelkar, A., Miikkulainen, R.: Scaling Self-Organizing Maps to Model Large Cortical Networks. Neuroinformatics 2 (2004) 275-302 5. Linsker, R.: Local Synaptic Learning Rules Suffice to Maximise Mutual Information in a Linear Nnetwork. Neural Computation 4 (1992) 691-702 6. Bell, A. J., Sejnowski, T. J.: An Information-Maximisation Approach to Blind Separation and Blind Deconvolution. Neural Computation 7 (1995) 1129-1159 7. Shriki, O., Sompolinsky, H., Lee, D.D.: An Information Maximization Approach to Overcomplete and Recurrent Representations. 12th Conference on Neural Information Processing Systems (2000) 87-93 8. Olshausen, B. A., Field, D. J.: Sparse Coding with an Overcomplete Basis Set: A Strategy Employed by V1. Vision Research 37 (1996) 3311-3325 9. Lewicki, M. S., Sejnowski, T. J.: Learning Overcomplete Representations. Neural Computation 12 (2000) 337-365 10. Linsker R.: How to Generate Ordered Maps by Maximizing the Mutual Information between Input and Output Signals. Neural Computation 1 (1989) 402-411 11. Linsker, R.: A Local Learning Rule that Enables Information Maximization for Arbitrary Input Distributions. Neural Computation 9 (1997) 1661-1665 12. Hyv¨ arinen, A., Hoyer, P. O.: A Two-Layer Sparse Coding Model Learns Simple and Comlex Cell Receptive Fields and Topogrpahy From Natural Images. Vision Research 41 (2001) 2413-2423 13. Briggs, W.L., Henson, V.E., McCormick, S.F.: A Multigrid Tutorial, Phildelphia, PA: Society for Industrial and Applied Mathematics 14. Hyv¨ arinen, A., Hoyer, P.O., Inki, M.: Topographic Independent Component Analysis. Neural Computation 13 (2001) 1527-1528
Genetic Granular Neural Networks Yan-Qing Zhang1 , Bo Jin1 , and Yuchun Tang2 1
Department of Computer Science Georgia State University Atlanta, GA 30302-3994 USA
[email protected],
[email protected] 2 Secure Computing Corporation Alpharetta, GA 30022 USA
[email protected]
Abstract. To make interval-valued granular reasoning efficiently and optimize interval membership functions based on training data effectively, a new Genetic Granular Neural Network (GGNN) is desinged. Simulation results have shown that the GGNN is able to extract useful fuzzy knowledge effectively and efficiently from training data to have high training accuracy.
1
Introduction
Recently, granular computing techniques based on computational intelligence techniques and statistical methods have various applications [2-9][13-15]. Type2 fuzzy systems and interval-valued fuzzy systems are investigated by extending type-1 fuzzy systems [1][10-12][16]. It is hard to define and optimize type-2 or interval-valued fuzzy membership functions subjectively and objectively. In other words, the first challenging problem is how to design an effective learning algorithm that can optimize type-2 or interval-valued fuzzy membership functions based on training data. Usually, type-2 fuzzy systems and interval-valued fuzzy systems can handle fuzziness better than type-1 fuzzy systems in terms of reliability and robustness. But type-2 fuzzy reasoning and interval-valued fuzzy reasoning take much longer time than type-1 fuzzy reasoning. So the second challenging problem is how to speed up type-2 fuzzy reasoning and interval-valued fuzzy reasoning. In summary, the two long-term challenging problems are related to effectiveness and efficiency of granular fuzzy systems, respectively. To solve the first effectiveness problem, learning methods are used to optimize type-2 or interval-valued fuzzy membership functions based on given training data. Liang and Mendel present a method to compute the input and antecedent operations for interval type-2 FLSs, introduce the upper and lower membership functions, and transfer an interval type-2 fuzzy logic system into two type-1 fuzzy logic systems for membership function parameter adjustments [10]. To handle different opinions from different people, Qiu, Zhang and Zhao use a statistical linear regression method to construct low and high fuzzy membership functions for an interval fuzzy system [16]. Here, a new interval reasoning method using the granular sets is designed to make fast granular reasoning. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 510–515, 2007. c Springer-Verlag Berlin Heidelberg 2007
Genetic Granular Neural Networks
2
511
Granular Neural Networks
The n-input-1-output GNN with m granular IF-THEN rules uses the granular sets such that IF x1 is Ak1 and ... and xn is Akn T HEN y is B k ,
(1)
where xi and y are input and output granular linguistic variables respectively, granular linguistic values Aki and B k are defined as follows, Aki = [μAk (xi ), μ ¯Aki (xi )]/xi , (2) R
i
μAk (xi ) = exp[−( i
xi − aki 2 ) ], σ ki
(3)
xi − aki 2 μ ¯Aki (xi ) = exp[−( k ) ], σ i + ξik B k = [μB k (y), μ ¯B k (y)]/y,
(4) (5)
R
μB k (y) = exp[−(
y − bk 2 ) ], ηk
(6)
μ ¯B k (y) = exp[−(
y − bk 2 ) ], ηk + ν k
(7)
where aki and bk are centers of membership functions of xi and y respectively, and σ ki and (σ ki + ξik ) for ξik > 0 are widths of lower-bound and upper-bound membership functions of xi , respectively, and η k and (η k + ν k ) for ν k > 0 are widths of membership functions of y for i = 1, 2, ..., n and k = 1, 2, ..., m. The functions of granular neurons in different layers are described layer by layer as follows: Layer 1: Input Layer Input neurons Ii on layer 1 have simple mapping functions Oi = xi
(8)
where i = 1, 2, ..., n. Layer 2: Compensation and Linear Combination Layer In this layer, there are two types of granular neurons which are (1) lower-bound and upper-bound compensatory neurons denoted by C k and C¯k , respectively, and (2) lower-bound and upper-bound linear combination neurons denoted by ¯ k , respectively, for k = 1, 2, ..., m. Lk and L Compensatory neurons on layer 2 have compensatory mapping functions OC k = [
n
i=1 n
OC¯k = [
i=1
μAk (xki )]1−γk +
γk n
,
(9)
,
(10)
i
μ ¯Aki (xki )]1−γk +
γk n
512
Y.-Q. Zhang, B. Jin and Y. Tang
where γk stand for compensatory degrees. Linear combination neurons on layer 2 have linear mapping functions OLk = bk +
n ψ ki (xi − aki ) ηk , n i=1 σ ki
(11)
OL¯ k = bk +
n ¯k ηk + ν k ψi (xi − aki ) . n σ ki + ξik i=1
(12)
Layer 3: Normal granular Reasoning Layer Lower-bound and upper-bound granular reasoning neurons denoted by Rk and ¯ k , respectively, on layer 3 have product mapping functions R ORk = OC k OLk , OR¯ k = OC¯k OL¯ k ,
(13) (14)
Layer 4: Interval Summation Layer The lower-bound and upper-bound compensatory summation neurons denoted ¯ respectively, have mapping functions by CS and CS, OCS = OCS ¯ =
m k=1 m
OC k ,
(15)
OC¯k .
(16)
k=1
The lower-bound and upper-bound granular reasoning summation neurons ¯ respectively, have mapping functions denoted by F RS and F RS, OF RS = OF RS ¯ =
m k=1 m
ORk ,
(17)
OR¯ k .
(18)
k=1
Layer 5: Hybrid Output Layer Finally, an output neuron OU T has an average mapping function OOUT = [
OF RS O ¯ + F RS ]/2. OCS OCS ¯
(19)
For clarity, the output f (x1 , ..., xn ) of the hybrid output layer of the GNN is given below, m f (x1 , ..., xn ) =
ψk (xi −ak η k n i) i + n )g i=1 σk i , m n γk 1−γ + k k n k=1 [ i=1 μAk (xi )] k=1 (b
k
i
(20)
Genetic Granular Neural Networks
513
where g=[
n
μAk (xki )]1−γk +
,
(21)
i
i=1
m f¯(x1 , ..., xn ) =
γk n
¯k (xi −ak ) η k +ν k n ψ i i + n )¯ g k i=1 σk i +ξi , m n γk ¯Aki (xki )]1−γk + n k=1 [ i=1 μ
k=1 (b
k
(22)
where g¯ = [
n
μ ¯Aki (xki )]1−γk +
γk n
,
(23)
i=1
where heuristic parameters ψ ki are defined below, ψ ki
=
ψ¯ik =
υ ki for xi ≤ aki ωki for xi > aki .
(24)
υ¯ik for xi ≤ aki ω ¯ ik for xi > aki .
(25)
Finally, the output function f (x1 , ..., xn ) of the GNN is f (x1 , ..., xn ) =
f (x1 , ..., xn ) + f¯(x1 , ..., xn ) . 2
(26)
Interestingly, the output function f (x1 , ..., xn ) of the GNN also contains a linear combination of xi for i = 1, 2, ..., n since both input and output membership functions are all the same Gaussian functions. Especially, if input and output membership functions are different kinds of functions such as triangular and Gaussian functions, the output of the GNN f (x1 , ..., xn ) may have a nonlinear combination of xi for i = 1, 2, ..., n.
3
Genetic Granular Learning
Suppose: Given n-dimensional input data vectors xp (i.e., xp = (xp1 , xp2 , ..., xpn )) and 1-dimensional output data vector y p for p = 1, 2, ..., N . The energy function is defined by Ep =
1 [f (xp1 , ..., xpn ) − y p ]2 . 2
For simplicity, let E and f p denote E p and f (xp1 , ..., xpn ), respectively.
(27)
514
Y.-Q. Zhang, B. Jin and Y. Tang
A 3-phase evolutionary interval learning algorithm with constant compensatory rate γk = a (a ∈ [0, 1] for k = 1, 2, ..., m) is described below: Step 1: Using the Type-1 Learning Method to Optimize Initial Expected Pointvalued Parameters of the GNN. Step 2: Using Genetic Algorithms to Optimize Initial Interval-valued Parameters. Step 3: Using the Compensatory Interval Learning Algorithm to Optimize Interval-valued Parameters. Step 4: Discovering Granular Knowledge. Once the learning procedure has been completed, all parameters for the GNN have been adjusted and optimized. As a result, all m granular rules have been discovered from training data. Finally, the trained GNN can generate new values for new given input data.
4
Conclusions
To make interval-valued granular reasoning efficiently and optimize interval membership functions based on training data effectively, a GGNN is desinged. In the future, more effective and more efficient hybrid granular reasoning methods and learning algorithms will be investigated for complex applications such as bioinformatics, health, Web intelligence, security, etc.
References 1. Karnik, N., Mendel, J.M.: Operations on Type-2 Fuzzy Sets. Fuzzy Sets and Systems 122 (2001) 327-348 2. Fang, P.P., Zhang, Y.-Q.: Car Auxiliary Control System Using Type-2 Fuzzy Logic and Neural Networks. Proc. of WSC9, Sept. 20 -Oct. 8 (2004) 3. Jiang, F.H., Li, Z., Zhang, Y.-Q.: Hybrid Type-1-2 Fuzzy Systems for Surface Roughness Control. Proc. of WSC9, Sept. 20 -Oct. 8 (2004) 4. Lin, T.Y.: Granular Computing: Fuzzy Logic and Rough Sets. Computing with Words in Information/Intelligent Systems. Zadeh, L., Kacprzyk, J. (eds.) (1999) 184-200 5. Karnik, N.N., Mendel, J.M., Liang, Q.: Type-2 Fuzzy Logic Systems. IEEE Trans. Fuzzy Systems 7 (1999) 643-658 6. Pedrycz, W.: Granular Computing: an Emerging Paradigm. Physica-Verlag, Heidelberg (2001) 7. Karnik, N.N., Mendel, J.M.: Applications of Type-2 Fuzzy Logic Systems to Forecasting of Time-series. Inf. Sci. 120 (1999) 89-111 8. Tang, M.L., Zhang, Y.-Q., Zhang, G.: Type-2 Fuzzy Web Shopping Agents. Proc. of IEEE/WIC/ACM-WI2004 (2004) 499-503 9. Tang, Y.C., Zhang, Y.-Q.: Intelligent Type-2 Fuzzy Inference for Web Information Search Task. Computing for Information Processing and Analysis Series in Fuzziness and SoftComputing 164. Nikravesh, M., Zadeh, L.A., Kacprzyk, J. (eds.), Physica-Verlag, Springer (2005) 10. Liang, Q., Mendel, J.M.: Interval Type-2 Fuzzy Logic Systems: Theory and Design. IEEE Trans. Fuzzy Systems 8 (2000) 535-550
Genetic Granular Neural Networks
515
11. Mendel, J.M.: Computing Derivatives in Interval Type-2 Fuzzy Logic Systems IEEE Trans. Fuzzy Systems 12 (2004) 84-98 12. Wu, H., Mendel, J.M.: Uncertainty Bounds and Their Use in the Design of Interval Type-2 Fuzzy Logic Systems. IEEE Trans. Fuzzy Systems 10 (2002) 622-639 13. Zadeh, L.A.: Fuzzy Sets and Information Granulation. Advances in Fuzzy Set Theory and Applications. Gupta, N., Ragade, R., Yager, R. (eds.), North-Holland (1979) 3-18 14. Zhang, Y.-Q., Fraser, M.D., Gagliano, R.A., Kandel, A.: Granular Neural Networks for Numerical-linguistic Data Fusion and Knowledge Discovery. Special Issue on Neural Networks for Data Mining and Knowledge Discovery, IEEE Trans. Neural Networks 11(3) (2000) 658-667 15. Zhang, Y.-Q.: Constructive Granular Systems with Universal Approximation and Fast Knowledge Discovery. IEEE Trans. Fuzzy Systems 13(1) (2005) 16. Qiu, Y., Zhang, Y.-Q., Zhao, Y.: Statistical Interval-Valued Fuzzy Systems via Linear Regression. Proc. of IEEE-GrC 2005, Beijing, July 25-27 (2005) 229-232
A Multi-Level Probabilistic Neural Network Ning Zong and Xia Hong School of Systems Engineering, University of Reading, RG6 6AY, UK
[email protected]
Abstract. Based on the idea of an important cluster, a new multi-level probabilistic neural network (MLPNN) is introduced. The MLPNN uses an incremental constructive approach, i.e. it grows level by level. The construction algorithm of the MLPNN is proposed such that the classification accuracy monotonically increases to ensure that the classification accuracy of the MLPNN is higher than or equal to that of the traditional PNN. Numerical examples are included to demonstrate the effectiveness of proposed new approach.
1
Introduction
A popular neural network for classification is probabilistic neural network (PNN) [1]. The PNN classifies a sample by comparing a set of probability density functions (pdf) of the sample conditioned on different classes, where the probability density functions (pdf) are constructed using a Parzen window [2]. Research on PNN has been concentrated on the model reduction using various approaches, e.g. forward selection [3] and clustering algorithms [4,5]. The motivation of this paper is to investigate the possibility of further improvement on the classification accuracy of the PNN. We attempt to identify some input regions with poor classification accuracy from a PNN and emphasize the region as important cluster. A new multi-level probabilistic neural network (MLPNN) and the associated model construction algorithm have been introduced based on the important cluster. The MLPNN uses an incremental constructive approach, i.e. it grows level by level. The classification accuracy over the training data set monotonically increases to ensure that the classification accuracy of the MLPNN is higher than or equal to that of the traditional PNN. Two numerical examples are included to demonstrate the effectiveness of proposed new approach. It is shown that the classification accuracy of the resultant MLPNN over the test data set also monotonically increases as the model level grows for a finite number of levels.
2
Probabilistic Neural Network and Important Cluster
The structure of the probabilistic neural network (PNN) is shown in Figure 1. The input layer receives a sample x composed of d features x1 , · · · , xd . In the hidden layer, there is one hidden unit per training sample. The hidden unit D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 516–525, 2007. c Springer-Verlag Berlin Heidelberg 2007
A Multi-Level Probabilistic Neural Network
517
k argmax
^ y1 (x )
^ yM(x )
Σ x 11
Σ
x 21
xN 1 1
x 1M
output layer
x 2M
xN M M
hidden layer
x1
x2
input layer
xd
Fig. 1. The structure of a PNN
xij corresponds to the ith, i = 1, · · · , Nj training sample in the jth class, j = 1, · · · , M . The output of the hidden unit xij with respect to x is expressed as aij (x) =
1 (x − xij )T (x − xij ) exp{− } d/2 d 2σ 2 (2π) σ
(1)
where σ denotes the smoothing parameter. In the output layer, there are M output units, one for each class Cj , j = 1, · · · , M . The jth output is formed as yˆj (x) =
Nj 1 aij (x), Nj i=1
j = 1, · · · , M.
(2)
The output layer classifies the sample x to class Ck which satisfies k = argmaxj {ˆ yj (x)|j = 1, · · · , M }.
(3)
The important cluster can be formed as some cluster, or “sub-region” containing some or all of the misclassified training samples of the conventional PNN constructed based on the “whole region”. In order to improve the classification accuracy of the conventional PNN, it is crucial that the classification accuracy over the important clusters is improved. Hence, in this study, we attempt to correct the misclassified training sample x through computing the discriminant functions based on only a small number of neurons around it, i.e. emphasizing the contributions of the neurons closer to it. An important cluster contains a smaller number of training samples. The classification accuracy over the important cluster may be improved by (i) initially constructing a new PNN using only the training samples in the important cluster as its neurons and then (ii) classifying the training samples in the important cluster using the new PNN.
518
3
N. Zong and X. Hong
The Structure of MLPNN
A MLPNN with L levels consists of K PNNs denoted by P N Nk , k = 1, · · · , K, (L ) each of which is constructed by using the training samples in cluster Gk k ⊆ Gtr as its neurons, Lk ∈ {1, · · · , L}. The superscript Lk denotes the index of the level and Gtr is some region that contains all the training samples. (L ) (L ) (L ) {G1 1 , G2 2 , · · · , GK K } satisfies the following conditions. 1. L1 = 1, 2 = L2 ≤ L3 · · · ≤ LK−1 ≤ LK = L. (L ) (L ) 2. Gi i ∩ Gj j = ∅ for any i = j, if Li = Lj . ∅ denotes the empty set. By defining G(l) =
K
(Lk )
χ(Lk = l)Gk
l = 1, · · · , L
,
(4)
k=1
where χ(•) denotes an indication function whose value is 1 when • is true or 0 otherwise, the lth level of the MLPNN is referred to as the collection of the PNNs in P N Nk , k ∈ {1, · · · , K} which correspond to G(l) . The model structure of a MLPNN is depicted in Figure 2. The “SWITCH”
x (L 1)
G1
(L 2)
G2 . . . . . .
(L K)
GK
PNN 1
S W I T C H
PNN 2 . . . . . .
. . . . . .
PNN K
k1 k2 . . . . . .
k
MLPNN
kK
Fig. 2. The structure of a MLPNN
decides which PNN in P N Nk , k = 1, · · · , K is used to classify a sample x by calculating (L ) I = argmaxk {Lk |x ∈ Gk k }. (5) The class label output of the MLPNN for the input x is thus kMLP N N = kI
(6)
A Multi-Level Probabilistic Neural Network
519
where kI is the class label output of P N NI given the sample x. Therefore, it can be concluded that the MLPNN classifies a sample using the PNN corresponding to the cluster with the maximum level of all the clusters capturing this sample. In other words, if x ∈ G(L) , it is classified by one of the PNNs in the Lth, i.e. top level of the MLPNN. If x ∈ G(l−1) \ G(l) where \ denotes the set minus operator, it is classified by one of the PNNs in the (l − 1)th level of the MLPNN, l = 2, · · · , L. Figure 3 illustrate the clusters of a MLPNN with 3 levels.
(1)
G1
(2)
G2 (3)
G4
(2)
G3
(3)
G5
Fig. 3. The clusters of MLPNN with 3 levels
4 4.1
The Learning Algorithm of MLPNN The Construction Procedure of the MLPNN
The MLPNN is constructed by using an incremental learning approach, i.e. new level of PNNs is constructed aiming at improving the classification accuracy of the top level of the MLPNN and added to the MLPNN to form a new top level. The construction procedure of the MLPNN is as follows. 1. Construct the first level (or first top level) of the MLPNN by constructing a traditional PNN based on the training samples in Gtr . Set P N N1 as the (1) traditional PNN and G1 as Gtr . (1) (1) 2. Apply P N N1 over G1 for classification. Form s important clusters Gk ⊆ (1) G1 , k = 1, · · · , s by clustering all the misclassified training samples using a clustering algorithm. Test P N N1 by counting the number of the misclassified (1) (k) training samples in Gk as netr , k = 1, · · · , s. 3. Construct P N Nk , k = 1, · · · , s whose neurons are the training samples in (1) (1) Gk . Apply P N Nk over Gk for classification. Test P N Nk by counting the (1)
(k)
number of the misclassified training samples in Gk as netr , k = 1, · · · , s. (k) (k) (k) (k) (1) 4. Compare netr and netr , k = 1, · · · , s, if netr < netr , mark Gk as “pass”; (1)
otherwise, delete Gk
and P N Nk . Count the number of “pass” as np . If
520
N. Zong and X. Hong
np > 0, set s as np , construct the second level of the MLPNN by adding s new (2) (1) (2) (1) PNNs, i.e. G2 = G1 , · · · , G1+s = Gs , P N N2 = P N N1 , · · · , P N N1+s = P N Ns to the MLPNN to form a new top level. (Note that for notational sim(1) (k) (k) plicity, the passed Gk with netr < netr and their corresponding P N Nk are (1) still denoted as Gk and P N Nk , k = 1, · · · , s, respectively). Set l as 2 and K as 1 + s, continue. If np = 0, return with the derived MLPNN with 1 level. (l) (l) 5. For each Gk , k = K − s + 1, · · · , K: (1) Apply P N Nk over Gk for clas(l) (l) sification and form an important cluster Gk ⊆ Gk by clustering all the misclassified training samples. Test P N Nk by counting the number of the (l) (k) misclassified training samples in Gk as netr . (2) Construct P N Nk whose (l) (l) neurons are the training samples in Gk and apply P N Nk over Gk for classification. Test P N Nk by counting the number of the misclassified training (l) (k) samples in Gk as netr . (k) (k) (k) (k) (l) 6. Compare netr and netr , k = K − s + 1, · · · , K, if netr < netr , mark Gk (l) (l) as “pass”; otherwise, Gk = Gk and P N Nk = P N Nk . Count the number of “pass” as np . If np > 0, construct the (l + 1)th level of the MLPNN by (l+1) (l) (l+1) (l) adding s new PNNs, i.e. GK+1 = GK−s+1 , · · · , GK+s = GK , P N NK+1 = P N NK−s+1 , · · · , P N NK+s = P N NK to the MLPNN to form a new top level, l = l + 1, K = K + s, go to step 5. If np = 0, return with the derived MLPNN with L = l level. The following theorem shows that the classification accuracy of the MLPNN monotonically increases over the training data set with the number of levels. Theorem 1: Denote the MLE of the misclassification error rate of the MLPNN (l) (l) (l−1) with l levels as Pˆe , Pˆe < Pˆe . Proof : ( see [6]) For a MLPNN with 1 level, it is equivalent to the traditional PNN. It is shown [6] that the classification performance of the MLPNN is higher than or equal to that of the traditional PNN. 4.2
Comparison with Other Approaches
The MLPNN shares some common characteristics with some other approaches. For example, the boosting [7] and the piecewise linear modelling (PLM) [8,9] also consist of a set of models. The main differences between the MLPNN and other approaches including the boosting [7] and the PLM [8,9] are as follows. 1. Models in the PLM are usually defined on a set of disjoint subsets of the training set. Models in the boosting are all defined on the whole training set. In the MLPNN, PNNs are defined on the important clusters which are disjoint when they are in the same level or overlapped when they are in the different levels. 2. Various approaches have been developed to construct the models in the PLM, such as building hyperplane using linear discriminant function [10], building
A Multi-Level Probabilistic Neural Network
521
subtree using tree growing and pruning algorithm [8] and building linear model using linear system identification algorithm [9]. In the boosting, new model is trained based on the whole training set which is reweighted to deemphasize the training samples correctly classified by the existing models. In the MLPNN, new PNNs are constructed based on the important clusters which are formed by clustering the misclassified training samples of the top level of the MLPNN. 3. For a sample, the boosting combines the outputs of all the models to a final output using the weighted majority vote while the PLM and the MLPNN classify the sample according to the location of this sample, i.e. find a subset or important cluster which captures the sample and apply the corresponding local model to produce an output. 4. There are also some connections between the MLPNN and the improved stochastic discrimination (SD) [11,12]. For example, the improved SD also forms an important cluster by clustering the misclassified training samples of the existing models. However, the improved SD trains a set of new models based on the important cluster using the random sampling while in the MLPNN, new PNNs are constructed based on the important clusters. Moreover, to determine whether a new model is kept or not, the improved SD applies discernibility and uniformity test [13,14,15] while the MLPNN checks the classification accuracy.
5
Numerical Examples
In order to demonstrate the effectiveness of the proposed MLPNN, two examples were presented in this subsection. Numbers of misclassified training and testing samples of the traditional PNN and those of the proposed MLPNN were compared to demonstrate the advantages of the latter. Example 1: In this example, samples composed by 2 features x1 and x2 are uniformly distributed in some circular areas in a 2-dimensional space. A training set with 500 training samples and a test set with 500 test samples were generated. The training samples were plotted in the left subplot in Figure 4 and test samples were plotted in the right subplot in Figure 4. Samples of class 1 were represented by “+” and those of class 2 were represented by “·”. A MLPNN was constructed by using the proposed algorithm. The number of important clusters per level s was chosen as 6. The value of the smoothing (L) parameter σ was set as 1. The numbers of misclassified training samples netr (L) and test samples nete of the constructed MLPNN, which are the functions of the number of the levels L were plotted using the solid line in the left subplot and right subplot in Figure 5, respectively. It can be observed from Figure 5 that the classification accuracy of the constructed MLPNN monotonically increases as L grows. The constructed MLPNN is terminated as 3 levels because when L > 3, no newly constructed PNNs and the corresponding important clusters are kept and the learning procedure was automatically stopped.
522
N. Zong and X. Hong 5
5
4
4
3
3
2
2
1
1
x2
x2 0
0
−1
−1
−2
−2
−3
−3
−4 −4
−2
0
2
−4 −4
4
−2
0
2
4
x1
x1
Fig. 4. Training samples and test samples in Example 1 80
75
70
70
65 60 60 50 (L)
(L) netr
nete 55 40 50 30 45
20
10
40
1
2
L
3
35
1
2
3
L
Fig. 5. Number of misclassified training samples (left) and test samples (right) of MLPNN in Example 1. s = 6.
To investigate the effect of s on the classification accuracy of the MLPNN, we increased s to 15 and plotted the corresponding performance curves in Figure 6.It can be observed from Figure 6 that only on the training set, the classification accuracy of the constructed MLPNN monotonically increases as L grows while on the test set, the classification accuracy of the constructed MLPNN
A Multi-Level Probabilistic Neural Network 80
523
75
70 70 60
50
65
netr 40 (L)
(L)
nete 60
30
20 55 10
0
1
2
3
4
50
1
2
L
3
4
L
Fig. 6. Number of misclassified training samples (left) and test samples (right) of MLPNN in Example 1. s = 15. 100
70
90
65
80 60 70 (L) netr
(L) 55 nete
60 50 50
45
40
30
1
2
3
L
4
5
40
1
2
3
4
5
L
Fig. 7. Number of misclassified training samples (left) and test samples (right) of MLPNN in Example 2
fails to increase after L reaches some point. One feasible explanation is that too big a s means too many small important clusters in the MLPNN. Hence information in the training set is overemphasized and the MLPNN may fit into the noise of the training set. Fitting into the noise of the training set usually impairs the model’s generalization capability. However the number L and s can be determined empirically through the general approach of cross validation. Because the traditional PNN is the first level of the MLPNN, it can be observed from Figure 5 and Figure 6 that the constructed MLPNNs have higher classification accuracy than that of the traditional PNNs.
524
N. Zong and X. Hong
Example 2: In this example, The BUPA liver disorders data set obtained from the repository at the University of California at Irvine [16] was used in this example. The data set contains 345 samples of 2 classes with each sample having 6 features and 1 class label. The first 200 samples were selected as training samples and the remaining 145 samples were used as test samples. With a predetermined value σ = 50, a set of MLPNN was trained where the number of important clusters per level s was determined through cross validation as 4. The simulation results were shown in Figure 7. It is seen that the classification accuracy of the MLPNN can improve the classification accuracy until L = 3.
6
Conclusions
A new MLPNN has been introduced to improve the classification accuracy of the traditional PNN, based on the concept of an important cluster. The construction algorithm of MLPNN has been introduced. Numerical examples have shown that the proposed MLPNN offers improvement on the classification accuracy over the conventional PNNs.
References 1. Specht, D. F.: Probilistic Neural Networks. Neural Networks 3 (1990) 109–118 2. Duda, R. O. and Hart, P. E.: Pattern Classification and Scene Analysis. Wiley, New York. (1973) 3. Mao, K. Z., Tan, K. C. and Ser, W.: Probabilistic Neural-network Structure Determination for Pattern Classification. IEEE Transactions on Neural Networks 3 (2000) 1009–1016 4. Specht, D. F.: Enhancements to The Probabilistic Neural Networks. in Proc IEEE Int. Conf. Neural Networks, Baltimore, MD. (1992) 761-768 5. Zaknich, A: A Vector Quantization Reduction Method for The Probabilistic Neural Networks. in Proc IEEE Int. Conf. Neural Networks, Piscataway, NJ. (1997) 6. Zong, N: Data-based Models Design and Learning Algorithms for Pattern Recognition. PhD thesis, School of Systems Engineering, University of Reading, UK. (2006) 7. Breiman, L.: Arcing Classifiers. Annals of Statistics 26 (1998) 801-849 8. Gelfand, S. B. Ravishankar, C. S. and Delp, E. J.: Tree-structured Piecewise Linear Adaptive Equalization. IEEE Trans. on Communications 41 (1993) 70-82 9. Billings, S. A. and Voon, W. S. F.: Piecewise Linear Identificaiton of Nonlinear Systems. 46 (1987) 215-235 10. Sklansky, J. and Michelotti, L.: Locally Trained Piecewise Linear Classifiers. IEEE Trans. on Pattern Analysis and Machine Intelligence PAMI-2 (1980) 101-111 11. Zong, N. and Hong,X.: On Improvement of Classification Accuracy for Stochastic Discrimination- multi-class Classification. In Proc of Int. Conf. on Computing, Communications and Control Technologies, CCCT’04 3 (2004) 109-114 12. Zong, N. and Hong, X.: On Improvement of Classification Accuracy for Stochastic Discrimination. IEEE Trans. on Systems, Man and Cybernetics, Part B: Cybernetics 35 (2005) 142-149
A Multi-Level Probabilistic Neural Network
525
13. Kleinberg, E. M.: Stochastic Discrimination. Annals of Mathematics and Artificial Intelligence 1 (1990) 207-239 14. Kleinberg, E. M.: An Overtraining-resistant Stochastic Modeling Method for Pattern Recognition. Annals of Statistics 24(1996) 2319-2349 15. Kleinberg, E. M.: On The Algorithmic Implementation of Stochastic Discrimination. IEEE Trans. on Pattern Analysis and Machine Intelligence 22(2000) 473-490 16. ftp://ftp.ics.uci.edu/pub/machine-learning-databases/liver-disorders
An Artificial Immune Network Model Applied to Data Clustering and Classification Chenggong Zhang and Zhang Yi Computational Intelligence Laboratory School of Computer Science and Engineering University of Electronic Science and Technology of China Chengdu 610054, P.R. China {zcg,zhangyi}@uestc.edu.cn
Abstract. A novel tree structured artificial immune network is proposed. The trunk nodes and leaf nodes represent memory antibodies and non-memory antibodies, respectively. A link is setup between two antibodies immediately after one has reproduced by another. By introducing well designed immune operators such as clonal selection, cooperation, suppression and topology updating, the network evolves from a single antibody to clusters that are well consistent with the local distribution and local density of original antigens. The framework of learning algorithm and several key steps are described. Experiments are carried out to demonstrate the learning process and classification accuracy of the proposed model.
1
Introduction
Over the past few years, Artificial Immune Network (AIN) has emerged as a novel bio-inspired computational model that provides favorable characteristics for a variety of application areas. The AIN is inspired from immune network theory that proposed by Jerne in 1974 [1], which states that the immune system is composed of B cells and the interactions between them; the B cells receive antigenic stimulus and maintain interactions through mutual stimulation or suppression; thus the immune system acts as a self-regulatory mechanism that can recognize the antigens and memorize the characters of such antigens even in the absence of their stimulations. Several artificial immune network models have been proposed based on Jerne’s theory and applied to a variety of application areas [2,3,4,5,6]. In this paper we propose a novel AIN model - Tree Structured Artificial Immune Network (TSAIN). By implementing novel immune operators on antibody population, such as the clonal selection, cooperation and suppression, the network evolves as clusters with controlled size that are well consistent with the local distribution and local density of original antigens. Comparing with former models [2,3], the network topology plays a more important role in our method. Actually the topology is grows along with the evolution of antibody population. A topological link is setup between two antibodies immediately after one has reproduced by D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 526–533, 2007. c Springer-Verlag Berlin Heidelberg 2007
An AIN Model Applied to Data Clustering and Classification
527
another. Hence there is no need to define a threshold like N AT in [2] to judge whether two antibodies should be connected. Another advantage of adapting tree structure The mutual cooperation between antibodies provides the network with self-organizing capacity. The mutual suppression and topology updating make the topological structure consistent with clusters in shape space. The parameters of the learning algorithm are time-varying, which makes the stop criterion simplified; and the final convergence of network is also ensured.
2
Tree Structured Artificial Immune Network
We first give an overview for the learning algorithm of our proposed model: Algorithm 1: The learning algorithm of TSAIN 1: Initialize antibody population by a single non-memory antibody; 2: gen = 0; 3: while gen + + < maximum generation do 4: Randomly choose an antigen ag; 1 5: Calculate aff r,ag for each antibody r, where aff r,ag = ; 1 + r − ag 6: best = arg max(affr ,ag ); r
7: if best is non-memory antibody then 8: best .stimulation++; 9: if best .stimulation == clonal selction threshold then 10: best goes through cloning, produces children antibodies OS ; 11: mutate each ab ∈ OS ; 12: setup topological links between best and each ab ∈ OS ; 13: convert best to memory antibody; 14: end if 15: end if 16: Antibody cooperation; 17: Antibody suppression; 18: Topology updating; 19: end while 20: Delete all non-memory antibodies;
2.1
Clonal Selection
The antibody population is divided into non-memory antibodies which represented by leaf nodes, and memory antibodies which represented by trunk nodes. The non-memory antibodies serve as the candidates for memory antibodies and as the medium for relying cooperation signals; the memory antibodies stand for the formed immune memory to antigens which have already presented. Once an antigen arrives, the antibody with highest affinity against that antigen will be selected as the best(see Algorithm 1) and increases its stimulation.
528
C. Zhang and Z. Yi
Further, if best is non-memory and its stimulation attains a certain threshold called sti, then it will go through the clonal selection process: 1. Generate children antibodies OS with the size calculated by: nt(1 − aff )(1 − mc) |OS| = max 1, + mc , aff (1 − nt)
(1)
where aff is the affinity of best against current antigen. mc ≥ 1 is the predefined maximum size of OS. nt ∈ [0, 1) is the predefined affinity threshold. If aff ≤ nt, the size of OS will be 1. Each newabi ∈ OS is an identical copy of best. 2. Each newabi ∈ OS goes through the mutation process: newabi = newabi + var · N (0, 1),
(2)
where N (0, 1) is the standard normal distribution. var 1 controls the intensity of mutation. 3. Convert the best to memory antibody. Then it enters dormant phase in which the stimulation level will not be increased any more. And it will not reproduce children antibodies in future evolution. In other words, if we regard the chance of reproduction as a kind of resource, then when the best has finished the reproduction, the resource it holds will be bereaved and passed to its children. By using the clonal selection process, the antibodies with higher affinity gradually increase their proportion in the whole population. 2.2
Antibody Cooperation
When the clonal selection has finished, the algorithm enters the cooperation phase in which each antibody abi moves according to four factors: the position of current antigen, its topological distance with best , current learning rate and current neighborhood width, That is: d2i abi = abi + λgen · e 2 · δgen · (ag − abi ), −
(3)
where gen is current generation number, λgen ≤ 1 is the current learning rate. di is the topological distance between abi and best. δgen > 0 is the current neighborhood width that controls the influence zone of best. In each generation the λgen and δgen is determined by: gen k λgen = (λ1 − λ0 ) · + λ0 , (4) G δgen = (δ1 − δ0 ) ·
gen k
+ δ0 , (5) G where 0 < λ1 < λ0 ≤ 1, δ0 > δ1 > 0. G is the maximum generation number. k > 0 is used to control the convergence rate of λgen and δgen .
An AIN Model Applied to Data Clustering and Classification
529
From Eq. 3 we can see that all antibodies seek to approach the current antigen, namely moving forward the same directions with best . And the intensity of such movement is decreasing with their topological distance with best. In fact, the moving of antibodies can be regarded as a form of reaction; hence we can say that the antibodies are cooperatively react to current antigen. This is the reason of calling this mechanism as “cooperation”. Notice that we adopt tree structure as the network topology. Thus the topological distance with any two antibodies is definite since there is exactly one path between any two antibodies. Consequently the cooperation intensity between antibodies is also definite. 2.3
Antibody Suppression
We implement population controlling mechanism by using mutual suppression based on topological links. For any two antibodies abi and abj , if the suppression condition is satisfied, i.e. they do not have lineal relationship and their affinity is larger than suppression threshold st, then the one with larger offspring size will be the winner. Let abi be the winner, then it will impose one of the following suppressing operators on abj : 1. Delete abj and all of its offspring with probability 1 − p. 2. Remove the link between abj and its father and then create a link between abi and abj with probability p. gen We define p = G . It means that in the initial phase, the suppression inclines to shape the network; In ending phase, the suppression inclines to adjust the network topology in a non-reducing manner. Each antibody goes through the suppression until there is no pair of antibodies satisfies the suppression condition. Notice that after the second type suppression, the network structure is still a tree. By using the suppression, the size of sub-population in each cluster is under control. In each iteration the st is updated by: stgen = (st1 − st0 ) · 2.4
gen k G
+ st0 , 0 < st1 < st0 < 1.
(6)
Topology Updating
When the suppression has finished, the topology updating is preformed. In this phase, the links between antibodies with affinity smaller than ct are removed. By using this mechanism, there will be more independent branches in the tree structured network which represent different clusters. In each iteration, the ct is updated by: ctgen = (ct1 − ct0 ) ·
gen k G
+ ct0 , 0 < ct0 < ct1 < 1.
(7)
530
C. Zhang and Z. Yi Table 1. Parameter settings for the experiments
G I k sti mc nt λ0 λ1 δ0 δ1 st0 st1 ct0 ct1 Artificial data 24000 1 0.2 20 3 0.935 0.5 0.01 30 0.5 0.995 0.985 0.806 0.935 Real problem 4000 1 0.2 10 3 0.91 0.5 0.01 30 0.5 0.91 0.74 0.54 0.83
3
Simulations
3.1
Artificial Dataset
We first use a 2-dimensional artificial data set (Fig. 1(a)) to show the learning process of TSAIN. The original data set involves 3 clusters with different shape in unit square. Each cluster has 640 samples which produced by adding noises to standard curves. There are 80 additional noise samples independently distribute in the unit square. The parameter settings are listed in Table.1. 1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0
0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
(a) The artificial data set.
0.9
1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(b) The final network.
Fig. 1. The artificial data set and the network obtained through learning process
Fig. 1(b) shows the final resultant network in which there are totally 501 memory antibodies distribute in 10 clusters. Two out of the 10 clusters are observably local in noise areas, contain 10 memory antibodies. These 501 memory antibodies are used to represent the original antigen population. Fig. 2 visually demonstrates the network in different generations. From it we can find that the network is evolved from a single antibody to a tree structured network which contains a number of antibodies of which the positions are consistent with local distribution and local density of original antigen population. 3.2
Real Problem
The second experiment is based on Wisconsin Breast Cancer Database [7]. The original database contains 699 instances, each instance has 9 numeric-valued attributes. Since there are 16 instances that contain missing attribute values, we
An AIN Model Applied to Data Clustering and Classification Iteration number: 3000
Iteration number: 9000
Iteration number: 6000
1
1
1
0.5
0.5
0.5
0
0
0.5 1 Iteration number: 12000
0
0
0.5 1 Iteration number: 15000
0
1
1
1
0.5
0.5
0.5
0
0
0.5 1 Iteration number: 21000
0
1
1
0.5
0.5
0
0
0.5
1
0
0
0
0.5 1 Iteration number: 24000
0.5
531
0
0
0.5 1 Iteration number: 18000
0
0.5
1
1
Fig. 2. The evolution process of antibody population and network topology Table 2. Comparative classification accuracy (a) Our result
(b) Historical results
Time Average accuracy Train Validate 1 97.3 96.8 2 97.2 96.6 3 97.2 95.9 4 97.1 96.2 5 97.2 96.4 6 97.2 96.4
Method Reported accuracy(%) C4.5 [8] 94.74 RIAC [9] 94.99 LDA [10] 96.80 NEFCLASS [11] 95.06 Optimized-LVQ [12] 96.70 Supervised fuzzy clustering [13] 95.57
only use the rest 683 instances for our experiment. The instances are divided into 2 classes: class 0(tested benign) contains 444(65.0%) instances; class 1(tested malignant) contains 239(35.0%) instances. We apply 10-fold cross-validation for 6 times. The attributes are normalized before the experiment. Table 1 lists the parameter settings used in the experiment. We use two separate antibody population, each represents a cancer class(0 or 1). In each training process, both populations are evolved independently using their corresponding antigen populations. When both populations are obtained, the final resultant network is the intersection of them. When an unseen antigen is presented, a best antibody is selected(see definition of best in algorithm 1),
532
C. Zhang and Z. Yi
and the antigen is classified as the class the best belongs to. Table 2(a) listed the final classification accuracy in each of the 6 times. The overall average accuracy on training set is 97.2%; and the overall average accuracy on validating set is 96.4%. Table 2(b) listed reported results on the same data set using 10-fold cross-validation in former research. We can find that our model outperforms some former methods in term of validation accuracy.
4
Conclusions
In this paper, we proposed a new artificial immune network model. The basic components of our model are antibodies and the topological links between them. With the help of clonal selection and cooperation, the network exhibits self-organizing property. By using the suppression, the antibodies compete for occupancy of the cluster areas. The introducing of topology updating ensures the consistency of network topology with distribution of clusters. Experimental results shows that the learning algorithm exhibits well learning capacity. In future work, some more experiments on complicated data sets should be implemented.
References 1. Jerne, N.: Towards a Network Theory of the Immune System. Ann. Immunol. 125 (1974) 373–389 2. Timmis, J., Neal, M.: A Resource Limited Artificial Immune System for Data Analysis. Konwledge-Based Systems 14 (2001) 121–130 3. Castro, L.N.D., Zuben, F.J.V.aiNet: An Artificial Immune Network for Data Analysis. Int. J. Computation Intelligence and Applications 1 (3) (2001) 4. Knight, T., Timmis, J.: A Multi-layered Immune Inspired Machine Learning Algorithm. In: Lotfi, A., Garibaldi, M. (eds.): Applications and Science in Soft Computing. Springer (2003) 195–202 5. Nasaroui, O., Gonzalez, F., Cardona, C., Rojas, C., Dasgupta, D.: A Scalable Artificial Immune System Model for Dynamic Unsupervised Learning. In: Cant´ uPaz, E. et al. (eds.): Proceedings of GECCO 2003. Lecture Notes in Computer Science, Springer-Verlag Berlin Heidelberg 2723 (2003) 219–230 6. Neal, M.: Meta-Stable Memory in an Artificial Immune Network. In: Timmis J., Bentley, P., Hart E. (eds.): Proceeding of ICARIS 2003. Lecture Notes in Computer Science, Springer-Verlag Berlin Heidelberg 2787 (2003) 168–180 7. Newman, D.J., Hettich, S., Blake, C.L., Merz, C.J.: UCI Repository of Machine Learning Databases. University of California, Irvine, Dept. of Information and Computer Sciences (1998) 8. Quinlan, J.R.: Improved Use of Continous Attributes in C4.5. J. Artif. Intell. Res. 4 (1996) 77–90 9. Hamilton, H.J., Shan, N., Cercone, N.: RIAC: a Rule Induction Algorithm Based on Approximate Classification. Technical Report CS 96-06. University of Regina. 10. Ster, B., Dobnikar, A.: Neural Networks in Medical Diagnosis: Comparison with Other Methods. In: Proceedings of the International Conference on Engineering Applications of Neural Networks (1996) 427-430
An AIN Model Applied to Data Clustering and Classification
533
11. Nuack, D., Kruse, R.: Obtaining Interpretable Fuzzy Classification Rules from Medical Data. Artif. Intell. Med. 16 (1999) 149–169 12. Goodman, D.E., Boggess, L., Watkins, A.: Artificial Immune System Classification of Multiple-class Problems. In: Proceedings of the Artificial Neural Networks in Engineering (2002) 179-183 13. Abonyi, J., Szeifert, F.: Supervised Fuzzy Clustering for the Identification of Fuzzy Classifiers. Pattern Recognition Lett. 24 (2003) 2195-2207
Sparse Coding in Sparse Winner Networks Janusz A. Starzyk1, Yinyin Liu1, and David Vogel2 1
School of Electrical Engineering & Computer Science Ohio University, Athens, OH 45701 {starzyk,yliu}@bobcat.ent.ohiou.edu 2 Ross University School of Medicine Commonwealth of Dominica
[email protected]
Abstract. This paper investigates a mechanism for reliable generation of sparse code in a sparsely connected, hierarchical, learning memory. Activity reduction is accomplished with local competitions that suppress activities of unselected neurons so that costly global competition is avoided. The learning ability and the memory characteristics of the proposed winner-take-all network and an oligarchy-take-all network are demonstrated using experimental results. The proposed models have the features of a learning memory essential to the development of machine intelligence.
1 Introduction In this paper we describe a learning memory built as a hierarchical, self-organizing network in which many neurons activated at lower levels represent detailed features, while very few neurons activated at higher levels represent objects and concepts in the sensory pathway [1]. By recognizing the distinctive features of patterns in a sensory pathway, such a memory may be made to be efficient, fault-tolerant, and to a useful degree, invariant. Lower level features may be related to multiple objects represented at higher levels. Accordingly, the number of neurons increases up the hierarchy with the neurons at lower levels making divergent connections with those on higher levels [2]. This calls to mind the expansion in number of neurons along the human visual pathway (e.g., a million geniculate body neurons drive 200 million V1 neurons [3]). Self-organization is a critical aspect of the human brain in which learning occurs in an unsupervised way. Presentation of a pattern activates specific neurons in the sensory pathway. Gradually, neuronal activities are reduced at higher levels of the hierarchy, and sparse data representations, usually referred to as “sparse codes”, are built. The idea of “sparse coding” emerged in several earlier works [4][5]. In recent years, various experimental and theoretical studies have supported the assumption that information in real brains is represented by a relatively small number of active neurons out of a large neuronal population [6][[7][3]. In this paper, we implement the novel idea of performing pathway selections in sparse network structures. Self-organization and sparse coding are obtained by means D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 534–541, 2007. © Springer-Verlag Berlin Heidelberg 2007
Sparse Coding in Sparse Winner Networks
535
of localized, winner-take-all (WTA) competitions and Hebbian learning. In addition, an oligarchy-take-all (OTA) concept and its mechanism is proposed that produces redundant, fault tolerant, information coding. This paper is organized as follows. In section 2, a winner network is described that produces sparse coding and activity reduction in the learning memory. In section 3, an OTA network is described that produces unsupervised, self-organizing, learning with distributed information representations. Section 4 demonstrates the learning capabilities of the winner and the OTA networks using experimental results. Finally, our method of sparse coding in sparse structures is summarized in section 5.
2 The Winner Network In the process of extracting information from data, we expect to predictably reduce neuronal activities at each level of a sensory pathway. Accordingly, a competition is required at each level. In unsupervised learning, we need to find the neuron in the network that has the best match to the input data. In neural networks, such a neuron, is usually determined using a WTA network [8][9]. A WTA network is usually implemented based on competitive neural network in which inhibitory lateral links and recurrent links are utilized, as shown in Fig. 1. The outputs iteratively suppress the signal strength among each other and the neuron with maximum signal strength will stay as the only active neuron when the competition is done. For a large memory, with many neurons on the top level, a global WTA operation is complex, inaccurate and costly. Moreover, average competition time increases as the likelihood of similar signal strengths increases in large WTA networks.
…
α
α
… 1
r1
dm
dj
d1
1
1
α rj
rm
Fig. 1. WTA network as competitive neural network
The use of sparse connections between neurons can, at the same time, improve efficiency and reduce energy consumption. However, sparse connections between neurons on different hierarchical levels may fail to transmit enough information along the hierarchy for reliable feature extraction and pattern recognition. In a local network model for cognition, called an “R-net” [10][11], secondary neurons, with random connections to a fraction of primary neurons in other layers, effectively provide almost complete connectivity between primary neurons pairs. While R-nets provide large
536
J.A. Starzyk, Y. Liu, and D. Vogel
capacity, associative memories, they were not used for feature extraction and sparse coding in the original work. The R-net concept is expanded in this work by using secondary neurons to fully connect primary neurons on lower levels to primary neurons on higher levels through the secondary neurons of a sparsely connected network. The network has an increasing number of neurons on the higher levels, and all neurons on the same level have an equal number of input links from neurons on the lower level. The number of secondary levels between primary levels affects the overall network sparsity. More secondary levels can be used to increase the network sparsity. Such sparsely connected network with secondary levels is defined as winner network and illustrated in Fig. 2.
…
Secondary level s
Primary level h
…
…
…
…
… …
…
Increasing number of Overall neurons
… Primary level h+1
Fig. 2. Primary level and secondary level in winner network
The initial random input weights to each neuron are scaled to have a sum of squared weights equal to 1, which places them on the unit multidimensional sphere. Because a neuron becomes active when its input weight vector is similar to its input pattern, spreading the input weights uniformly on the unit-sphere increases the memory capacity of the winner network. Furthermore, the normalization of the weights maintains the overall input signal level so that the output signal strength of neurons, and accordingly the output of the network, will not be greatly affected by the number of input connections. In a feed-forward computation, each neuron combines its weighted inputs using a thresholded activation function. Only when the signal strength is higher than the activation threshold can the neuron send a signal to its post-synaptic neurons. Eventually, the neurons on the highest level will have different levels of activation, and the most strongly activated neuron (the global winner) is used to represent the input pattern. In this work, the competition to find the global winner is replaced by small-scale WTA circuits in local regions in the winner network as described next. In a sparsely connected network, each neuron on the lower level connects to a group of neurons on the next higher level. The winning neuron at this level is found by comparing neuronal activities. In Hebbian learning, weight adjustments reduce the plasticity of the winning neuron’s connections. Therefore, a local winner should not
Sparse Coding in Sparse Winner Networks
537
only have the maximum response to the input, but also its connections should be flexible enough to be adjusted towards the input pattern so that the local winner is,
⎧ ⎫ ⎪ ⎪ level +1 level level s winner = max w s ⋅ ρ ), ⎨ i jk k ji ⎬ (i = 1,2,..N j∈N ⎪⎩k∈N ⎪⎭ level +1 i
∑
(1)
level j
where Nilevel+1is a set of post-synaptic neurons on level (level+1) driven by a neuron i, Nilevel is a set of pre-synaptic neurons that project onto neuron j on level (level), and ρji denotes the plasticity of the link between pre-synaptic neuron i and post-synaptic neuron j, as shown in Fig. 3(a).
N ilevel +1
j 1
2
3 4
1
2
5
3 4
6
7
5 6 7
S winner 8 9 level+1
level
S winner S winnerSwinner … …
Loser neurons in local competition
S winner
i
N
Winner network
level j
…
Winner neurons in local competition
Fig. 3. (a) Interconnection structure to determine a local winner, (b) The winner network
Such local competition can be easily implemented using a current-mode WTA circuit [12]. A local winner neuron, for example N4level+1 in Fig. 3(a), will pass its signal strength to its pre-synaptic neuron N4level, and all other post-synaptic branches connecting neuron N4level with the losing nodes will be logically cut off. Such local competition is done first on the highest level. The signal strengths of neurons which win in their corresponding local competitions propagate down to the lower levels and the same procedure continues until the first input layer is reached. The global winning neuron on the top level depends on the results of all local competitions. Subsequently, the signal strength of the global winner is propagated down to all lower-level neurons which connect to the global winner. Most of the branches not connected to the global winner are logically cut off, while the branches of the global winner are kept active. All the branches that propagate the local winner signal down the hierarchy form the winner network, as shown in Fig. 3(b). Depending on the connectivity structure, one or more winner networks can be found. By properly choosing the connectivity structure, we may guarantee that all of the input neurons are in a single winner network so that the output level contains a single winner. Let us use a 3-layer winner network (1 input level, 2 secondary levels and 1 output level) as an example. The network has 64 primary input neurons and 4096 output neurons with 256 and 1024 secondary neurons, respectively. The number of active neurons in the top level decreases with increasing numbers of input connections. As shown in Fig.4, when the number of input links to each neuron is more than 8, a single winner neuron in the top level is achieved.
538
J.A. Starzyk, Y. Liu, and D. Vogel
Since the branches logically cut off during local competition will not contribute to post-synaptic neuronal activities, the synaptic strengths are recalculated only for branches in the winner network. As all the branches of the winner network are used, the signal strength of pathways to the global winner are not reduced. However, due to the logically disconnected branches, the signal strength of pathways to other output neurons are suppressed. As a result, an input pattern activates only some of the neurons in the winner networks. The weights are only adjusted using Hebbian learning for links in winner networks to reinforce the activation level of the global winner. After updating, weights are scaled so that they are still spread on the unit-sphere. In general, the winner network with secondary neurons and sparse connections, builds sparse representations in three steps: sending data up through the hierarchy, finding the winner network and global winner by using local competitions, and training. The winner network finds the global winner efficiently without iterations usually adopted in MAXNET [8][9]. It provides an effective and efficient solution to the problem of finding global winners in large networks. The advantages of sparse winner networks are significant for large size memories. Number of active neurons on top level vs. Number of input links to each neuron number of active neurons on top level
12 10 8 6 4 2 0
2
3
4
5
6
7
8
9
10
number of input links
Fig. 4. Effect of number of input connections to neurons
3 Winner Network with Oligarchy-Takes-All The recognition using a single-neuron representation scheme in the winner network can easily fail because of noise, fault, variant views of the same object, or learning of other input patterns due to an overlap between activation pathways. In order to have distributed, redundant data representations, an OTA network is proposed in this work to use a small group of neurons as input representations. In an OTA network, the winning neurons in the oligarchy are found directly in a feed-forward process instead of the 3-step procedure used in the winner network as described in section 2. Neurons in the 2nd layer combine weighted inputs and use a threshold activation function as in the winner network. Each neuron in the 2nd layer competes in a local competition. The projections onto losing nodes are logically cut off. The same Hebbian learning as is used in the winner network is carried out on the
Sparse Coding in Sparse Winner Networks
539
logically connected links. Afterwards, the signal strengths of the 2nd level are recalculated considering only effects of the active links. The procedure is continued until the top level of hierarchy is reached. Only active neurons on each level are able to send the information up the hierarchy. The group of active neurons on the top level provides redundant distributed coding of the input pattern. When similar patterns are presented, it is expected that similar groups of neurons will be activated. Similar input patterns can be recognized from the similarities of their highest level representations.
4 Experimental Results The learning abilities of the proposed models were tested on the 3-layer network described in section 2. The weights of connections were randomly initialized within the range [-1, 1]. A set of handwritten digits from the benchmark database [13] containing data in the range [-1, 1] was used to train the winner network or OTA networks. All patterns have 8 by 8 grey pixel inputs, as shown in Fig. 5. Each input pattern activates between 26 and 34 out of 4096 neurons on the top level. The groups of active neurons in the OTA network for each digit are shown in Table 1. On average, each pattern activates 28.3 out of 4096 neurons on the top level with the minimum number of 26 neurons and the maximum number of 34 neurons.
Fig. 5. Ten typical patterns for each digit Table 1. Active neuron index in the OTA network for handwritten digit patterns
digit 0 1 2 3 4 5 6 7 8 9
Active Neuron index in OTA network 72 237 294 109 188 103 68 237 35 184
91 291 329 122 199 175 282 761 71 235
365 377 339 237 219 390 350 784 695 237
371 730 771 350 276 450 369 1060 801 271
1103 887 845 353 307 535 423 1193 876 277
1198 1085 1163 564 535 602 523 1218 1028 329
1432 1193 1325 690 800 695 538 1402 1198 759
1639 1218 1382 758 1068 1008 798 1479 1206 812
… … … … … … … … … …
The ability of the network to classify was tested by changing 5 randomly selected bits of each training pattern. Comparing the OTA neurons obtained during training with those activated by the variant patterns, we find that the OTA network successfully recognizes 100% of the variant patterns. It is expected that changing more bits of the original patterns will degrade recognition performance. However, the tolerance of the OTA network for such change is expected to be better than that of the winner network.
540
J.A. Starzyk, Y. Liu, and D. Vogel
Fig. 6 compares the performances of the winner network and the OTA network for different numbers of changed bits in the training patterns based on 10 Monte-Carlo trials. We note that increasing the number of changed bits in the patterns quickly degrades the winner network’s performance on this recognition task. When the number of bits changed is larger than 20, the recognition correctness stays around 10%. However,10% is the accuracy level for random recognition for 10 digit patterns recognition. It means that when the number of changed bits is over 20, the winner network is not able to make useful recognition. As anticipated, the OTA network has much better fault tolerance and it is resistant to this degradation of recognition correctness.
percentage of correct recognition
Percentage of correct recognition 1
performance of OTA performance of winner network
0.8 0.6 0.4 0.2 0 0
Accuracy level of random recognition
10 20 30 40 number of bits changed in the pattern
50
Fig. 6. Recognition performance of the OTA network and the winner network
5 Conclusions This paper investigates a mechanism for reliably producing sparse coding in sparsely connected networks and building high capacity memory with redundant coding into sensory pathways. Activity reduction is accomplished with local rather than global competition, which reduces hardware requirements and computational cost of self-organizing learning. High memory capacity is obtained by means of layers of secondary neurons with optimized numbers of interconnections. In the winner network, each pattern activates a dominant neuron as its representation. In the OTA network, a pattern triggers a distributed group of neurons. With OTA, information is redundantly coded so that recognition is more reliable and robust. The learning ability of the winner network is demonstrated using experimental results. The proposed models produce features of a learning memory that may prove essential for developing machine intelligence.
Sparse Coding in Sparse Winner Networks
541
References 1. Starzyk, J.A., Liu, Y., He, H.: Challenges of Embodied Intelligence. Proc. Int. Conf. on Signals and Electronic Systems, ICSES'06, Lodz, Poland, Sep. 17-20 (2006) 2. Kandel, E.R., Schwartz, J.H., Jessell, T.M.: Principles of Neural Science. McGraw-Hill Medical 4th edition (2000) 3. Anderson, J.: Learning in Sparsely Connected and Sparsely Coded System. Ersatz Brain Project working note (2005) 4. Barlow, H.B.: Single Units and Sensation: A Neuron Doctrine for Perceptual Psychology? Perception 1 (1972) 371-394 5. Amari, S.: Neural Representation of Information by Sparse Encoding, Brain Mechanisms of Perception and Memory from Neuron to Behavior. Oxford University Press (1993) 630-637 6. Földiak, P., Young, M.P.: Sparse Coding in the Primate Cortex, The Handbook of Brain Theory and Neural Networks, The MIT Press (1995) 895-898 7. Olshausen, B.A., Field, D.J.: Sparse coding of sensor inputs, Current Opinions in Neurobiology 14 (2004) 481-487 8. Haykin, S.: Neural Networks: A Comprehensive Foundation. Prentice Hall (1999) 9. Zurada, J.M.: Introduction to Artificial Neural Systems. West Publishing Company (1992) 10. Vogel, D.D., Boos, W.: Sparsely Connected, Hebbian Networks with Strikingly Large Storage Capacities. Neural Networks 10(4) (1997) 671-682 11. Vogel, D.D.: A Neural Network Model of Memory and Higher Cognitive Functions. Int J Psychophysiol 55 (1) (2005) 3-21 12. Starzyk, J.A., Fang, X.: A CMOS Current Mode Winner-Take-All Circuit with both Excitatory and Inhibitory Feedback. Electronics Letters 29 (10) (1993) 908-910 13. LeCun, Y., Cortes, C.: The MNIST Database of Handwritten Digits. http://yann.lecun.com/exdb/mnist/
Multi-Valued Cellular Neural Networks and Its Application for Associative Memory Zhong Zhang, Takuma Akiduki, Tetsuo Miyake, and Takashi Imamura Toyohashi University of Technology 1-1 Hibarigaoka, Tempaku-cho, Toyohashi 441-8580, Japan
[email protected] Abstract. This paper discusses the design of multi-valued output functions of Cellular Neural Networks (CNNs) implementing associative memories. The output function of the CNNs is a piecewise linear function which consists of a saturation and non-saturation range. The new structure of the output function is defined, and is called the “basic waveform”. The saturation ranges with n levels are generated by adding n − 1 basic waveforms. Consequently, creating an associative memory of multivalued patterns has been successful, and computer experiment results show the validity of the proposed method. The results of this research can expand the range of applications of CNNs as associative memories.
1
Introduction
Cellular Neural Networks (CNNs), proposed by Chua and Yang in 1988 [1,2], are one type of interconnected neural networks. CNNs consist of nonlinear elements that are called cells and each cell is connected to its neighborhood cells. The state of each cell changes in parallel based on a differential equation, and converges to an equilibrium state. Thus, CNNs can be designed to be associative memories by the dynamics of the cells [3] and have been applied in various fields, such as character recognition, medical diagnosis and machine failure detection system [4,5,6]. The purpose of our study is to create an abnormal diagnosis system which detects anomalous behavior in man-machine systems by pattern classification using the the CNN as an associative memory. To realize this system, it is important to have a wide variety of diagnosis patterns. In order to acquire accuracy of processing results, two methods are considered: the first which increases the number of cells, and the second which adds more output levels to each cell. However, the first method would decrease the computational efficiency due to the expansion of the CNN’s scale. On the other hand, the second method has the advantage that there is no need to expand the scale. The output levels of conventional CNNs is two-levels or three-levels. When the CNNs conduct abnormality diagnosis, Kanagawa et al. classified the results of blood tests into three states, and made them the patterns for diagnosis, which were either ”NORMAL”, ”LIGHT EXCESS” or ”HEAVY EXCESS”. Other CNNs which have multi-valued output functions for image processing have also been proposed in the past [7], but the evaluation of them as an associative storage medium having arbitrary output levels has not been conducted yet. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 542–551, 2007. c Springer-Verlag Berlin Heidelberg 2007
Multi-Valued CNNs and Its Application for Associative Memory
543
In this paper, we discuss the design of multi-valued output functions of CNNs for associative memory. The output function of a CNN is a piecewise linear function which consists of a saturation and non-saturation range. We define a new structure of the output function, and compose a basic waveform using the piecewise linear function. The basic waveform creates the multiple ranges by adding itself together. Hence we can compose the multi-valued output function by adding the basic waveform. Our method’s effectiveness is evaluated by computer experiment using random patterns.
r=1
Memory patterns
Cell : C(i,j)
Memory Table Neighborhood cell
Fig. 1. A Cellular Neural Network and a corresponding example associative memory. Each cell is connected to its r-neighborhood cells.
2
Cellular Neural Networks
Figure 1 shows a Cellular Neural Network with its memory patterns. In this section, the definition of “cells” in the Cellular Neural Network and its dynamics are described first, and second, the design method for CNNs as an associative memory is described. 2.1
Dynamics of Cells
We first consider the following CNN, which is composed of an M × N array of nonlinear elements that are called cells. The dynamics of cell C(i, j) in the ith row and jth column is expressed as follows: x˙ ij = −xij + Tij ∗ yij + Iij , (1) yij = sat(xij ) (1 ≤ i ≤ M, 1 ≤ j ≤ N ) where xij and yij represent the state variable and the output variable, respectively, while Tij represents the matrix of coupling coefficients, Iij represents the threshold, and ∗ is the composition operator. When the (i, j) cell is influenced
544
Z. Zhang et al.
yij=sat(xij) 1 -1 0
1
xij
-1
Fig. 2. Piecewise linear output characteristic
from neighborhood cells r-units away (shown in Figure 1), Tij ∗ yij is expressed as in the following equation: Tij ∗ yij =
r r
tij(k,l) yi+k,j+l ,
(2)
k=−r l=−r
The output of each cell yij is given by the piecewise linear function of the state xij , and when output level is binary, the function is expressed as in the following equation: 1 yij = (|xij + 1| − |xij − 1|). (3) 2 This output function has two saturated levels (as depicted in Figure 2). 2.2
Design of the CNN for Associative Memory
When we express the differential equation of each cell given in Eq.(1) in vector notation, two-dimensional CNNs having M rows and N columns are represented by the following: x˙ = −x + T y + I , (4) y = sat(x) where m = M N , x = (x11 , x12 , · · · , x1N , x21 , x22 , · · · , xMN )T = (x1 , x2 , · · · , xk , · · · , xm )T , y = (y11 , y12 , · · · , y1N , y21 , y22 , · · · yMN )T = (y1 , y2 , · · · , yk , · · · , ym )T , I = (I11 , I12 , · · · , I1N , I21 , I22 , · · · IMN )T = (I1 , I2 , · · · , Ik , · · · , Im )T .
Multi-Valued CNNs and Its Application for Associative Memory
545
The matrix T = [Tij ] ∈ m×m is the template matrix composed of row vectors whose elements are zero when the corresponding cells have no connections. The state vector to be memorized by the CNN corresponds to the stable equilibrium point of the system of differential equations in Eq.(4). Here, Eq.(4), which is a system of interval linear equations, has a number of asymptotically stable equilibrium points. We can make the network memorize patterns by making the patterns correspond to the asymptotically stable equilibrium points. Following Liu and Michel [3], we are given q vectors α1 , α2 , · · · , αq ≡ {x ∈ m : xi = 1 or − 1, i = 1, · · · , m} which are to be stored as reachable memory vectors for CNNs, and then assume vectors β 1 , β 2 , · · · , β q such that: β i = kαi ,
(5)
where the real number k is an equilibrium point arrangement coefficient and βi (i = 1, · · · , q) are asymptotic stable equilibrium points in each cell. It is evident that the output vectors are αi . Therefore, the CNN designed to have α1 , α2 , · · · , αq as memory vectors has templates T and a threshold vector I, which satisfies the following Eq. (8) simultaneously: ⎧ 1 1 ⎪ ⎪−β + T α + I = 0 ⎪ ⎪ 2 ⎨−β + T α2 + I = 0 , (6) .. ⎪ ⎪ . ⎪ ⎪ ⎩ q −β + T αq + I = 0 Here we set the following matrices: Y = (α1 − αq , α2 − αq , · · · , αq−1 − αq ), Z = (β 1 − β q , β 2 − β q , · · · , β q−1 − βq ), We have Z = T Y,
(7)
I = β − Tα . q
q
(8)
Under Eq.(5), in order for the CNNs to have alpha as memory vectors, it is necessary and sufficient to have template matrix T and threshold vector I, which satisfy Eqs. (7) and (8).
3
Multivalued Function for the CNN
In this section, we propose a design method of the multi-valued output function for associative memory CNNs. We first introduce some notation which shows how to relate Eq.(3) to the multi-valued output function. The output function of Eq.(3) consists of a saturation and non-saturation range. We define the structure of the output function such that the length of the non-saturation range is L, the
546
Z. Zhang et al.
length of the saturation range is cL, and the saturated level is |y| = H which is a positive integer (refer to Figure 3). Moreover, we assume equilibrium points which are |xe | = kH. Here, the Eq.(3) can be rewritten as follows: y=
H L L (|x + | − |x − |). L 2 2
(9)
Then, the equilibrium point arrangement coefficient is expressed as k = ( L2 + cL)/H by the above-mentioned definition. When H = 1, L = 2, c > 0, Eq.(9) is equal to Eq.(3). We will call the waveform of Figure 3 (a) a “basic waveform”. Next we give the theorem for designing the output function. Theorem 1. Both L > 0 and c > 0 are necessary conditions for convergence to an equilibrium point. Proof. We consider the cell model Eq. (1) where r = 0, I = 0. The cell behaves according to the following differential equation: x˙ = −x + ky.
(10)
y=sat(x) H
cL
-kH
L
x 0
kH
cL
-H
(a)
(b) Fig. 3. Design procedure of the multivalued output function. (a) shows a basic waveform, (b) shows the multivalued output function which is formed from (a).
Multi-Valued CNNs and Its Application for Associative Memory
In the range of |x| < L2 , the output value of a cell is y = (a)). Eq. (10) is expressed by the following: x˙ = −x + k
2H , L
2H L x
547
(refer to Figure 3
(11)
The solution of the equation is: x(t) = x0 e(
2kH L
−1)t
,
(12)
where x0 is an initial value at t = 0. The exponent in Eq. (12) must be 2kH L −1 > 0 for transiting from one state in the non-saturation range to a state in the saturation range. Here, by the above-mentioned definition, the equilibrium point arrangement coefficient is expressed as: 1 L k = (c + ) , (13) 2 H Therefore, parameter conditions c > 0 can be obtained from the Eqs. (12) and (13). In the range of L ≤ |x| ≤ kH, the output value of a cell is y = ±H. Then Eq. (10) is expressed by the following: x˙ = −x ± kH,
(14)
x(t) = ±kH + (x0 ∓ kH)e−t .
(15)
The solution of the equation is: When t → ∞, Eq. (15) proves to be xe = ±kH which is not L = 0 in Eq. (13). The following expression is derived from the above: L > 0 ∧ c > 0.
(16)
Secondly, we give the method of constructing the multi-value output function based on the basic waveform. The saturation ranges with n levels are generated by adding n−1 basic waveforms. Therefore, the n-valued output function satn (·) is expressed as follows: H satn (x) = (−1)i (|x + Ai | − |x − Ai |), (17) (n − 1)L i where,
Ai =
Ai−1 + 2cL (i : odd) , Ai−1 + L (i : even)
However, i and k are defined as follows: n : odd i = 0, 1, . . . , n − 2, A0 = L2 , L k = (n − 1)(c + 1/2) H , n : even i = 1, 2, . . . , n − 1, A1 = cL, L k = (n − 1)(2c + 1) 2H . Figure 4 shows the output waveforms which are the result of Eq. (17). The results demonstrate the validity of the proposed method, because the saturation ranges of the n levels have been made in the n value output function: satn (·).
548
Z. Zhang et al.
sat2
sat3
(a)
(b) sat4
sat5
(c)
(d)
Fig. 4. The output waveforms of the saturation function. (a), (b), (c), and (d) represent, respectively sat2 , sat3 , sat4 and sat5 . Here, the parameters of the multivalued function are set to L = 0.5, c = 1.0.
4
Computer Experimentation
In this section, a computer experiment is conducted using numerical software in order to show the effectiveness of the proposed method, P1
P2
P3
P4
2 1 P5
P6
P7
P8
0 -1 -2
Fig. 5. The memory patterns for the computer experiment. These random patterns of 5 rows and 5 columns have elements of {−2, −1, 0, 1, 2}, and are used for creation of the associative memory.
Multi-Valued CNNs and Its Application for Associative Memory
4.1
549
Experimental Procedure
For this memory recall experiment, the desired patterns to be memorized are fed into the CNN, which are then associated by the CNN. In this experiment, we use random patterns which have five values for generalizing the result as memory patterns. To test recall, noise is added to the patterns shown in Figure 5 and the 400 Mean recall rate Mean recall time 350
Mean recall rate (%)
300 250 50
200 150 100
Mean recall time (step)
100
50 0 0
2
4 6 Parameter c.
8
0 10
(a) 400 Mean recall rate Mean recall time 350
Mean recall rate (%)
300 250 50
200 150 100
Mean recall time (step)
100
50 0 0
2
4 6 Parameter c.
8
0 10
(b) 400 Mean recall rate Mean recall time 350
Mean recall rate (%)
300 250 50
200 150 100
Mean recall time (step)
100
50 0 0
2
4 6 Parameter c.
8
0 10
(c) Fig. 6. Results of the computer experiments. Each figure shows (a) the result when L = 0.1, (b) the result when L = 0.5, and (c) the result when L = 1.0.
550
Z. Zhang et al.
resulting patterns are used as initial patterns. The initial patterns are represented as follows: x0 = kαi + ε. (18) where αi ≡ {x ∈ m : xi = −H or − H/2 or 0 or H/2 or H, i = 1, · · · , m}, and ε ∈ m is a noise vector which corresponds to the normal distribution N (0, σ 2 ). These initial patterns are presented to the CNN and the output evaluated to see whether the memorized patterns can be remembered correctly. Then, the number of correct recalls are converted into a recall probability which is used as the CNN’s performance measure.. The parameter L of the output function is in turn set to L = 0.1, L = 0.5, L = 1.0, and parameter c is changed by 0.5 step sizes in the range of 0 to 10. Moreover, the noise level is a constant σ = 1.0, and the experiments are repeated for 100 trials at each parameter combination (L, c). 4.2
Experimental Results
Figure 6 shows the results of the experiments. Each figure shows the relationship between the parameter c and both time and the recall probability. The horizontal axis is parameter c and the vertical axes are the mean recall rate (the mean recall probability;%) and mean recall time (measured in time steps). As can be seen in the experiment results, the recall rate increases as parameter c increases. The reason is that c is the parameter which determines the size of a convergence range. Therefore, the mean recall rate improves by increasing c. On the other hand, if the length L of the non-saturated range is short, convergence to the right equilibrium point becomes difficult because the distance between equilibrium points is small. Additionally, as shown in Figure 6 (a), the mean recall rate is lower than in Figures 6 (b),(c). Therefore, the length of the saturation range and the non-saturation range needs to be set at a suitable ratio. Moreover, in order for each cell to converge to the equilibrium points, both c > 0 and L > 0 must hold.
5
Conclusions
In this paper, we proposed a novel design method of the multi-valued output function for CNNs as an associative memory, and conducted computer experiment with five-valued random patterns. Consequently, memorization of the multi-valued patterns has been successful, and the results showed the validity of our method. The method requires only two parameters, L, and c. These two parameters must be L > 0, c > 0, because the length of the saturation and non-saturation range is required for allocating equilibrium points. When noise is added to the initial pattern, the parameters affect the recall probability and recall time. Therefore, the optimal value of the parameters changes according to the noise level. Future research will focus on creating a multi-valued output function of more than five values, and on evaluating its performance with the CNN. Moreover, we will apply the CNNs in an abnormality detection system.
Multi-Valued CNNs and Its Application for Associative Memory
551
References 1. Chua, L.O., Yang, L.: Cellular Neural Networks: Theory. IEEE Trans. Circuits Syst. 35(10) (1988) 1257-1272 2. Chua, L.O., Yang L.: Cellular Neural Networks: Application. IEEE Trans. Circuits Syst. 35(10) (1988) 1273-1290 3. Liu, D., Michel, A.N.: Cellular Neural Networks for Associative Memories. EEE Trans. Circuits Syst. 40(2) (1993) 119-121 4. Zhang, Z., Namba, M., Kawabata, H.: Cellular Neural Networks and Its Application for Abnormal Detection. T.SICE 39(3) (2003) 209-217 5. Tetzlaff, R.(Eds.): Celular Neural Networks and Their Applications. World Scientific (2002) 6. Kanagawa, A., Kawabata, H., Takahashi, H.: Cellular Neural Networks with Multiple-Valued Output and Its Application. IEICE Trans. E79-A(10) (1996) 16581663 7. Yokosawa, K., Nakaguchi, T., Tanji, Y., Tanaka, M.: Cellular Neural Networks With Output Function Having Multiple Constant Regions. IEEE Trans. Circuits Syst. 50(7) (2003) 847-857
Emergence of Topographic Cortical Maps in a Parameterless Local Competition Network A. Ravishankar Rao, Guillermo Cecchi, Charles Peck, and James Kozloski IBM T.J. Watson Research Center Yorktown Heights, NY 10598, USA
[email protected],
[email protected]
Abstract. A major research problem in the area of unsupervised learning is the understanding of neuronal selectivity, and its role in the formation of cortical maps. Kohonen devised a self-organizing map algorithm to investigate this problem, which achieved partial success in replicating biological observations. However, a problem in using Kohonen’s approach is that it does not address the stability-plasticity dilemma, as the learning rate decreases monotonically. In this paper, we propose a solution to cortical map formation which tackles the stability-plasticity problem, where the map maintains stability while enabling plasticity in the presence of changing input statistics. We adapt the parameterless SOM (Berglund and Sitte 2006) and also modify Kohonen’s original approach to allow local competition in a larger cortex, where multiple winners can exist. The learning rate and neighborhood size of the modified Kohonen’s method are set automatically based on the error between the local winner’s weight vector and its input. We used input images consisting of lines of random orientation to train the system in an unsupervised manner. Our model shows large scale topographic organization of orientation across the cortex, which compares favorably with cortical maps measured in visual area V1 in primates. Furthermore, we demonstrate the plasticity of this map by showing that the map reorganizes when the input statistics are chanaged.
1
Introduction
A major research problem in the area of unsupervised learning in neural networks is the understanding of neuronal selectivity and the formation of cortical maps [2][pg. 293]. In the vertebrate brain, in areas such as the visual cortex, individual neurons have been found to be selective for different visual cues such as ocular dominance and orientation [10]. Furthermore, these selective neurons are arranged in an orderly 2D fashion known as a cortical map [2][pg. 293], and such maps have been observed and extensively studied in the primate cortex [4]. A natural question is to ask how such maps are formed, and what are the underlying computational processes at work. Understanding cortical map formation is a central problem in computational neuroscience, and impacts our ability to D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 552–561, 2007. c Springer-Verlag Berlin Heidelberg 2007
Emergence of Topographic Cortical Maps
553
understand processes operating across the entire brain. Since the visual cortex is the best studied, we will restrict our attention to visual phenomena, and the formation of orientation maps in particular [4]. We pose the following requirements that a computational model of cortical map formation should satisfy. – The model should use biologically realistic visual inputs as stimuli rather than abstract variables representing features of interest such as orientation or ocular dominance. The cortical units in the model should learn their synaptic weights. – The model should exhibit the formation of stable maps that resemble experimental measurements such as those made in [4]. Some features of observed cortical orientation maps include pinwheels, fractures and linear zones. – The model should be able to address the stability-plasticity dilemma. Though the cortical maps are stable, they retain plasticity when the statistics of the input space are changed, such as through a change in cortical connectivity or input statistics [1,11]. This allows the cortical map to faithfully represent the external world. – The model should involve as little parameterization as possible. This requirement allows the model to be widely applicable under different conditions, such as different input spaces and different sizes of the cortical maps. Many computational theories have been developed to explain the formation of such cortical maps [9], especially the formation of orientation columns. However, no single model appears to satisfactorily meet the requirements described above. For instance, Carreira-Perpinan et al [5] use abstract variables for input, such as orientation and frequency, which are not derived from real images. Miikkulainen et al [3] describe a method for self-organization to obtain orientation columns in a simulated patch cortex. However, their method requires significant parameterization and the use of a carefully applied schedule [7]. The main contribution of this paper is to demonstrate how Kohonen’s selforganizing map (SOM) algorithm can be modified to employ only local competition, and then combined with a recently published technique to eliminate the traditional parameterization required [8]. This combination is novel, and achieves the formation of realistic cortical orientation maps with inputs consisting of visual images of randomly oriented lines. Furthermore, the cortical map is plastic, as we demonstrate by changing the statistics of the input space multiple times, by varying the statistical distribution of orientations. If the input statistics are constant, the map converges to a stable representation, as defined by an error measure. This effectively addresses the stability-plasticity problem. Our model is computationally simple, and its behavior is intuitive and easy to understand and verify. Due to these reasons, it meets all the imposed requirements, and hence should prove to be a useful technique to practitioners in computational neuroscience.
554
2
A.R. Rao et al.
Background
Kohonen’s self-organizing map (SOM) has been widely used in a number of domains [6]. An area where it has had considerable impact is that in computational neuroscience, in the modeling of the formation of cortical maps [9,7]. The traditional Kohonen SOM requires the use of a schedule to gradually reduce the neighborhood size over which weight updates are applied, and to reduce the learning rate. This requires careful modification of these key parameters over the course of operation of the algorithm. For instance, Bednar has shown the formation of cortical orientation maps through the use of a rigid schedule [7]. Recently, Berglund and Sitte [8] presented a technique for automatically selecting the neighborhood size and learning rate based on a measure of the error of fit. Though they did not state it, it appears quite plausible that such a computation can be carried out by the cortex, as it is a local computation. All that is required is that the error between the weight vector (synaptic weights) and the input vector be computed. This allows the neuron to adjust its learning rate over time. The role of inputs is critical in the process of self-organization. Hubel et al [10] showed that rather than being genetically predetermined, the structure of cortical visual area V1 undergoes changes depending on the animal’s visual experience, especially during the critical period of development. Sharma et al [13] showed that rewiring the retinal output to the auditory cortex instead of the visual cortex resulted in the formation of orientation-selective columns in the auditory cortex. It is thus likely that the same self-organization process is taking place in different areas of the cortex. The nature of the cortical maps then becomes a function of the inputs received. In order to demonstrate this cortical plasticity, we have created a computational model that responds to changing input space statistics. Certain classes of inputs are sufficient to model V1. For instance, Bednar [7] used input stimuli consisting of elongated Gaussian blobs. Other researchers have used natural images [12] as inputs to self-organizing algorithms. In this paper, we use sine-wave gratings of random orientation for the sake of simplicity, and to demonstrate the essential characteristics of our solution.
3
Experimental Methods
We model the visual pathway from the retina to the cortex as shown in Figure 1. The retina projects to the lateral geniculate nucleus (LGN), which in turn projects to the cortex. There are two channels in the LGN, which perform on-center and off-center processing of the visual input. The cortical units are interconnected through a lateral network which is responsible for spreading the weights of the winner. 3.1
Algorithm for Weight Updates
A significant contribution of this paper is to provide a natural extension of Kohonen’s algorithm to allow local competition in a larger cortex, such that multiple
Emergence of Topographic Cortical Maps
555
Cortex T opographic mapping from input layer to cortex
Lateral neighborhood used for computation of local winners
(LGN) “On” center
“Off” center channel
channel
Input Layer (Retina)
(A)
(B)
Fig. 1. Illustrating the network connectivity. (A) The input units are arranged in a twodimensional grid, and can be thought of as image intensity values. The cortical units also form a 2D grid. Each input unit projects via the LGN in a feedforward topographic manner to the cortical grid. (B) shows the lateral connectivity in the cortex.
winners are possible. In the traditional Kohonen algorithm, the output layer is fully connected, and all the output units receive the same input. There is only one global winner in this case. We have modified the algorithm, such that there is limited connectivity between output units, and each output unit receives input from a restricted area of the retinal input. This allows the possibility of multiple winners in the output layer. Learning is driven by winners in local neighborhoods, determined by the extent of lateral connectivity. A simple Hebbian rule is used to update synaptic weights. The basic operation of the network is as follows. Let X1 denote the input vector from the on-center LGN channel and X2 the input vector from the off-center LGN channel to a cortical unit. Each cortical unit receives projections from only a restricted portion of the LGN. Let w1ij denote a synaptic weight, which represents the strength of the connection between the ith on-center LGN unit and the j th unit in the cortex. Similarly w2ij represents weights between the off-center LGN and cortical units. The output yj of the j th cortical unit is given by yj = w1ij X1i + w2ij X2i (1) i∈Lj
i∈Lj
Here the cortical unit combines the responses from the two LGN channels, and Lj is the neighborhood of LGN units that project to this j th cortical unit. The next step is for each cortical unit to determine whether it is a winner within its local neighborhood. Let Nj denote the local cortical neighborhood of the j th cortical unit (which excludes the j th unit). Let m index the cortical units within Nj . Thus, unit j is a local winner if ∀m ∈ Nj , yj > ym
(2)
556
A.R. Rao et al.
This is a local computation for a given cortical unit. Once the local winners are determined, their weights are updated to move them closer to the input vector. If cortical unit j is the winner, the update rule is w1ij ← w1ij + μ(X1i − w1ij )
(3)
where i indexes those input units that are connected to the cortical unit j, and μ is the learning rate. μ is typically set to a small value, so that the weights are incrementally updated over a large set of input presentations. A similar rule is used to learn w2ij . In addition, the weights of the cortical units within the neighborhood Nj , denoted by the index m, are also updated to move closer to their inputs, but with a weighting function f (d(j, m)), where d(j, m) is the distance from the unit m to the local winner j. This is given by w1im ← w1im + f (d(j, m))μ [X1i − w1im ]
(4)
Finally, the incident weights at each cortical unit are normalized. The cortical dynamics and learning are thus based on Kohonen’s algorithm. Typically, the size of the neighborhood Nj and the learning rate μ are gradually decreased according to a schedule such that the resulting solution is stable. However, this poses a problem in that the cortex cannot remain plastic as the learning rate and neighborhood size for the weight updates may become very small over time. One of the novel contributions of this paper is to solve this stability-plasticity dilemma in cortical maps through an adaptation of the parameterless SOM technique of Berglund and Sitte [8]. Their formulation called for the adjustment of parameters based on a normalized error measure between the winner’s weight vector and the input vector. We modify their formulation as follows. First, since there can be multiple local winners in the cortex, we compute an average error measure. Second, we use temporal smoothing based on a trace learning mechanism. This ensures that the learning rate is varied smoothly. There appears to be biological support for trace learning, as pointed out by Wallis [14]. Let n (i) denote the error measure at the ith cortical winner out of a total of Mn winners at the nth iteration. n (i) = W1i − X1i + W2i − X2i where · denotes the L2 norm. The average error measure is defined as follows n (i) (n) = Mn
(5)
(6)
Let r(n) = max((n), r(n − 1))
(7)
Emergence of Topographic Cortical Maps
557
where r(0) = (0)
(8)
The normalized average error measure is then defined to be = (n) (n) r(n)
(9)
The time averaged error measure, η(n) is defined by the following trace equation + (1 − κ)η(n − 1) η(n) = κ(n)
(10)
We used κ = 0.05 in our simulation. The learning rate μ and neighborhood size N were varied as follows μ = μ0 η(n) ;
N = N0 η(n)
(11)
where μ0 = 0.05 and N0 = 15. The rationale behind Equation 11 is that the learning rate and neighborhood size decrease as the error between the winners’ weight vectors and input vectors decrease, which happens while a stable representation of an input space is being learnt. If the input space statistics change, the error increases, causing the learning rate and neighborhood size to increase. This allows learning of the new input space. This effectively solves the stabilityplasticity dilemma. 3.2
Network Configuration
We used an input layer consisting of 30x30 retinal units. The images incident on this simulated retina consisted of sinusoidal gratings of random orientation and phase. The LGN was the same size as the retina. A radius of r = 9 was used to generate a topographic mapping from the LGN into the cortex. We modeled the cortex with an array consisting of 30x30 units. The intra-cortical connectivity was intitialized with a radius of rCC = 15. For the weight updates, the function f was chosen to be a Gaussian that tapers to approximately zero at the boundary of the local neighborhood, ie at rCC . The learning rules in section 3.1 were applied to learn the afferent weights. The learning rate μ was set as in equation 11. The entire simulation consisted of 100,000 iterations. In order to test cortical plasticity we varied the statistics of the input space as follows. For the first 33,000 iterations, we used sinusoidal gratings of random orientation. From iteration 33,000 to 66,000 we changed the inputs to be purely horizontal. Then from iteration 66,000 to 100,000 we changed the inputs back to gratings of random orientation. In order to contrast the behavior of the parameterless SOM, we also show the results of running the same learning algorithm with a modifed Kohonen algorithm that allows local winners, and follows a fixed schedule which uses exponentially decaying learning rates and neighborhood sizes.
558
4
A.R. Rao et al.
Experimental Results
We present the results in the form of a map of the receptive fields of cortical units. The receptive field is shown as a grayscale image of the weight matrix incident on each cortical unit. In order to save space, we show the weight matrices connecting only the on-center LGN channels to the cortex. (The weight matrices of the offcenter LGN channels appear as inverse images of the on-center channels). Figures 2 - 4 show demonstrate that the modified parameterless SOM exhibits plasticity in accommodating changing input statistics, whereas a scheduled SOM is non-plastic.
(A)
(B)
Fig. 2. Map of receptive fields for each cortical unit. Only the on-center LGN channel is shown. This is the result after 33,000 presentations of sinuoisoidal gratings of random orientation. Note that the receptive fields show typical organization that is seen in biologically measured cortical orientation maps [4]. Features that are present in this map are pinwheels, fractures and linear zones. (A) Shows the modified parameterless SOM. (B) Shows the map with a traditional schedule.
Figure 5 shows how the error measure generally decreases as the iteration number increases. As can be seen, suddenly increases when the input statistics are changed. This causes an increase in the learning rate and the size of the neighborhood.1 The map eventually settles to a stable configuration at 100,000 iterations (when the simulation was terminated) as becomes small. Thus we have demonstrated stability of the cortical map through an error measure which decreases. 1
We note that the input disturbances are introduced before the maps have converged, as these two factors are independent of each other. In other words, the input disturbance does not have any knowledge of the configuration or stability of the map.
Emergence of Topographic Cortical Maps
(A)
559
(B)
Fig. 3. Map of receptive fields for each cortical unit at 66,000 iterations. This shows the cortical map after the input statistics were changed at iteration number 33000, such that only horizontal lines were presented. (A) With the modified parameterless SOM, the receptive fields of the cortical units are purely horizontal now, reflecting an adaptation to the input space. (B) However, the traditional SOM with a schedule fails to adapt to the new input space. Very few receptive fields have changed to represent horizontal lines.
(A)
(B)
Fig. 4. Map of receptive fields for each cortical unit at 100,000 iterations. The input statistics were changed again at 66000 iterations to create lines of random orientation. (A) With the modified parameterless SOM, the receptive fields of the cortex now contain lines of all orientations in a characteristic pattern as observed in Figure 2. (B) The traditional SOM following a schedule continues to retain its original properties, and does not exhibit plasticity.
560
A.R. Rao et al. 0
Input statistics changed
Input statistics changed
−0.5 ln(ε) −1
−1.5
−2 0
2
4 6 iteration number
8
10 4 x 10
Fig. 5. A plot of the logarithm of the error measure ln() as a function of the number of iterations. The input statistics are changed twice, indicated by the arrow marks.
5
Conclusions
In this paper, we developed a systematic approach to modeling cortical map formation in the visual cortex. We presented a solution that satisfies the following key requirements: self-organization is driven by visual image input; the cortical map converges to a stable representation, and yet exhibits plasticity to accommodate changes in input statistics. Furthermore, our computational approach is simple and involves minimal parameterization, which lends itself to easy experimentation. Our solution is based on modifying the traditional Kohonen SOM to use localized lateral connectivity that results in local winners, and to use the parameterless SOM [8] to solve the stability-plasticity problem. This combination of techniques is novel in the literature. We demonstrated the power of our solution by varying the input statistics multiple times. Each time, the cortical map exhibited the desired plasticity, and converged to a stable representation of the input space. The significance of this result is that it shows how a modified Kohonen SOM can be used to explain the dual phenomena of cortical map formation and cortical plasticity. By bringing together these two capabilities in a simple model, we pave the way for more complex models of cortical function involving multiple maps.
References 1. Carpenter, G.A., Grossberg, S.: The ART of Adaptive Pattern Recognition by a Self-organizing Neural Network. Computer 21(3) (1988) 77-88 2. Dayan, P., Abbott, L.F.: Theoretical Neuroscience: Computational and Mathematical Modeling of Neural Systems. MIT Press, Cambridge, MA (2001)
Emergence of Topographic Cortical Maps
561
3. Miikkulainen, R., Bednar, J.A., Choe, Y., Sirosh, J.: Computational Maps in the Visual Cortex. Springer, Berlin (2005) 4. Obermayer, K., Blasdel, G.: Geometry of Orientation and Ocular Dominance Columns in Monkey Striate Cortex. J. Neuroscience 13 (1993) 4114-4129 5. Carreira-Perpinan, M.A., Lister, R.J., Goodhill, G.J., A Computational Model for the Development of Multiple Maps in Primary Visual Cortex Cerebral Cortex 15 (2005) 1222-1233 6. Kohonen, T.: The Self-organizing Map. Proceedings of the IEEE 78(9) (1990) 1464-1480 7. Bednar, J.A.: Learning to See: Genetic and Environmental Influences on Visual Development. PhD thesis, Department of Computer Sciences, The University of Texas at Austin (2002) Technical Report AI-TR-02-294 8. Berglund, E., Sitte, J.: The Parameterless Self-organizing Map Algorithm. IEEE Trans. Neural Networks 17(3) (2006) 305-316 9. Erwin, E., Obermayer, K., Schulten, K.: Models of Orientation and Ocular Dominance Columns in the Visual Cortex: A Critical Comparison. Neural Computation 7(3) (1995) 425-468 10. Hubel, D.H., Wiesel, T.N., Levay, S.: Plasticity of Ocular Dominance Columns in Monkey Striate Cortex. Phil. Trans. R. Soc. Lond. B 278 (1977) 377-409 11. Buonomano, D.V., Merzenich, M.M.: Cortical Plasticity: From Synapses to Maps. Annual Review of Neuroscience 21 (1998) 149-186 12. Hyv¨ arinen, A., Hoyer, P.O., Hurri, J.: Extensions of ICA as Models of Natural Images and Visual Processing. Nara, Japan (2003) 963–974 13. Sharma, J., Angelucci, A., Sur, M.: Induction of Visual Orientation Modules in Auditory Cortex. Nature 404 (2000) 841-847 14. Wallis, G.: Using Spatio-temporal Correlations to Learn Invariant Object Recognition. Neural Networks (1996) 1513-1519
Graph Matching Recombination for Evolving Neural Networks Ashique Mahmood, Sadia Sharmin, Debjanee Barua, and Md. Monirul Islam Bangladesh University of Engineering and Technology, Department of Computer Science and Engineering, Dhaka, Bangladesh
[email protected], {aumi buet,rakhee buet}@yahoo.com,
[email protected] http://www.buet.ac.bd/cse
Abstract. This paper presents a new evolutionary system using genetic algorithm for evolving artificial neural networks (ANNs). Existing genetic algorithms (GAs) for evolving ANNs suffer from the permutation problem. Frequent and abrupt recombination in GAs also have very detrimental effect on the quality of offspring. On the other hand, Evolutionary Programming (EP) does not use recombination operator entirely. Proposed algorithm introduces a recombination operator using graph matching technique to adapt structure of ANNs dynamically and to avoid permutation problem. The complete algorithm is designed to avoid frequent recombination and reduce behavioral disruption between parents and offspring. The evolutionary system is implemented and applied to three medical diagnosis problems - breast cancer, diabetes and thyroid. The experimental results show that the system can dynamically evolve compact structures of ANNs, showing competitiveness in performance.
1
Introduction
Stand-alone weight learning of artificial neural networks (ANNs) with fixed structures is not adequate for many real-world problems. The success of solving problems by ANNs largely depends on their structures. A fixed structure may not contain the optimal solution in its search space. Therefore, learning only the weights may result into a solution that is convergent to a local optimum. On the other hand, devising an algorithm that searches an optimal structure for a given problem is a very challenging task. Many consequent problems have arisen [1],[2] in constructing ANNs, many of which are yet unresolved. Thus designing an optimal structure remains a challenge to the ANN researchers for decades. Genetic algorithms (GAs) [3] and evolutionary programming (EPs) [4] both have been applied for evolving ANNs. GA-based approaches rely on dual representation [5], one (phenotypic) for applying training algorithm like Back Propagation (BP) for weight adaptation, another (genotypic) for structural evolution. This duality introduces a deceptive mapping problem between genotype and D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 562–568, 2007. c Springer-Verlag Berlin Heidelberg 2007
Graph Matching Recombination for Evolving Neural Networks
563
phenotype, namely - permutation problem or many-to-one mapping problem [2]. This problem yet remains unresolved. Moreover, frequent and abrupt recombination between parents in GA processes have very detrimental effect on the quality of offspring. Frequent recombination breaks ANN structures before training up to maturity, which is essential before evaluating them. Abrupt recombination between ANN structures drastically effects the already built up performance gain and makes it difficult to rebuild. Many efforts made recently are through EP approaches [6]. EP-based approaches do not use dual representation, thus avoids the permutation problem and relies only on statistical perturbation as its sole reproductive operator. As, this statistical perturbation is mutation in nature, direction of evolution extracted from recombination operation in GA-based approaches is absent here. This paper presents a GA-based approach with a permutation problem free recombination operator for evolving ANNs. Here, a genotypic representation and its incorporating recombination operator are introduced. To make the operator permutation problem free, it uses graph matching technique between two parent ANNs. Moreover, the GA process is so designed to avoid the detrimental effect of frequent and abrupt recombination in traditional GA-based approaches.
2
Evolution of ANNs Using Graph Matching Recombination
Proposed algorithm is a GA-based approach for evolution of ANNs. Encoding scheme and recombination operator is so designed that it can avoid permutation problem and the problems led from frequent and abrupt recombination. The complete algorithm, though uses GA, do not recombine parents rigorously. ANNs, that are learning with a good rate, are not subject to recombination. The idea is taken from an EP approach, namely EPNet [6], to reduce the detrimental effect of restructuring immature ANNs through early recombination. Recombination operator itself does not also produce offspring that differ from their parents greatly. In the following, the algorithm is elaborated: Step 1. Generate an initial population of networks at random. Nodes and connections are uniformly generated at random within a certain range. Weights are also generated uniformly at random within a certain range. Step 2. Train each network partially for certain number of epochs using backpropagation algorithm. After training, if the training error has not been significantly reduced, the network is marked with ’failure’, otherwise the network is marked with ’success’. Step 3. If stopping criteria are met, stop the process and identify the best network. Otherwise, go to step 4. Step 4. Choose a network marked with ’failure’. If no network with ’failure’ mark is left to choose, go to step 2. Step 5. Choose a network among the networks marked with ’success’ uniformly at random.
564
A. Mahmood et al.
Step 6. Set the first network (marked with ’failure’) as the weak parent (i.e. which is to be updated) and the second network (marked with ’success’) as the strong parent (which gives the update suggestions and towards which the first network is to be changed). Apply the encoding scheme i.e., form AGs (defined later) for the ANNs, find M G (defined later) from the AGs, find a nice maximal clique to identify their common subgraph. Make use of the clique in the recombination operator to generate an updated ANN. To perform the update, delete some unique nodes of the weak parent with a low probability, add some unique connections of the strong parent to the weak parent with a low probability, and delete a connection from their common part (mutation) with a very low probability. Step 7. Partially train the resulting network for certain number of epochs. If the resulting network performs better, in terms of validation error, than its parent then replace its parent with it, otherwise, discard the resulting network and keep the parent. Go to step 4. In this approach, an ANN (together with its topology and weights) is converted to a comparable graph structure namely attributed graph (AG) [7]. Using AG, two different networks can be compared and their unique graph portions can be found. To compare two AG graphs, a match graph (M G) is formed [7]. An M G contains all the information of similarity between the AGs. Each clique of the M G is a common subgraph of the AGs and hence a common subnet of the corresponding ANNs. Offspring is generated based on this subnet. Stopping criteria is met when two different generations having a fixed number of generations between them overfits. 2.1
ANN to AG Encoding
AG is a graph in which vertices (and, possibly edges also) have attributes and each vertex is associated with some values of those attributes. The encoding is defined in Definition 1. Definition 1. Convert an ANN to an AG < V, E, P > in such a way that, – (V ) Each connection of ANN becomes a vertex of AG. – (E) If two connections of ANN are incident on the same node, then there is an edge between the corresponding vertices of AG. The edge should be labeled with the layer number of the incident node. – (P ) Each vertex of AG has two properties: • Weight, w associated with the corresponding connection of AG. • A set S of ordered pairs < n, w >, each of which corresponds to the incidence of another connection with this connection; n is the layer number of the incident neuron and w is the weight of the adjacent connection. An application of such encoding on an ANN is shown in Figure 1. Only the two hidden layers are shown for the ANN.
Graph Matching Recombination for Evolving Neural Networks
565
Fig. 1. An ANN (left) and its encoded AG (right) according to the scheme
2.2
AG to M G
Match graph (M G) is a graph, formed on the basis of matches found in two different AGs [7]. Rules of matching depend on the particular definition used. One is as in Definition 2. Definition 2. An M G is formed from two AGs (say, AG1 and AG2 ) with following characteristics: – Vertices of M G are assignments from AG1 and AG2 , – An edge in M G exists between two of its vertices if corresponding assignments are compatible. Now, assignment and compatibility of assignments are defined in Definition 3 and 4. Definition 3. v1 , a vertex from AG1 and v2 , a vertex from AG2 form an assignment if all of its attributes are similar. Definition 4. One assignment (and thus, one vertex of an M G, in effect) a1 between v1 from AG1 and v1 from AG2 is compatible with another assignment a2 between v2 from AG1 and v2 from AG2 if, all the relationships between v1 and v2 from AG1 is compatible with the relationships between v1 and v2 from AG2 . For the instance of problem, similarity and compatibility of relationships can be defined as in Definition 5. Definition 5. Similarities and compatability can be defined as: Similarity of weight. Two weights of two vertices are similar if there absolute difference is below some threshold value. Similarity of Set S. Two S sets from two vertices are similar if any two ordered pairs from each set are similar.
566
A. Mahmood et al.
Similarity of ordered pair. Two order pairs of two S sets are similar if their first values (layer number) are exactly same and the second values - weights are similar (Similarity of weight). Compatibility of edge. Two edges are compatible if they are labeled with same (layer) number. 2.3
Clique
Finding the largest clique of a graph is N P-complete. To find the largest clique in a graph, an exhaustive search via backtracking provides the only real solution. Here, a maximal clique is searched using one simpler version of Qualex-MS [8] - namely New-Best-In Weighted, a maximal clique finding approximation algorithm, which finds a solution in polynomial time. 2.4
Recombination Step
Now, as similarity between these two ANNs is found, the next step is to recombine them. To perform recombination, connections unique to AG1 (which is named - weak parent ) are deleted from AG1 and connections unique to AG2 (strong parent ) are added to AG1 . Resulting offspring is basically the ANN of AG1 having modifications directed towards AG2 . Such modification can result offspring hung between the structural loci from AG1 to AG2 . There will be limited classes of structures allowed in the process if all offspring lie only in the loci, reducing the region of exploration. To overcome this tendency of overcrowding, a connection from the similar portion is deleted from AG1 with a very low probability. It is a mutation step, which retains diversity of the population here.
3
Experimental Studies
Performance is evaluated on well-known benchmark problems - breast cancer, diabetes and thyroid. The datasets representing these problems were obtained from the UCI machine learning benchmark repository. The detailed descriptions of datasets are available at ics.uci.edu (128.195.11) in directory /pub/machinelearning-databases. 3.1
Experimental Setup
Experiment is standardized on dataset, partitioning and benchmark rules according to Proben1 [9]. A population of 20 individuals has been used. Each individual is a neural network of two hidden layers. Number of connections between the hidden layers is chosen and set uniformly from 70% to 100% that of the fully connected network. Initial weight is between -0.5 to 0.5. Number of epochs for training is 100 for each run on each set of problems. For each set of problems, a number of 10 runs are used to accumulate the results.
Graph Matching Recombination for Evolving Neural Networks
3.2
567
Results
Table 1 shows the accuracies for the three problems over training set, validation set and test set. Within each set, first column is the error which is minimized by BP and the second column is the classification error. Table 1. Mean, standard deviation, minimum and maximum value of training, validation and testing errors for different problems
Mean SD Min Max Mean Diabetes SD Min Max Mean Heart SD Disease Min Max Mean Thyroid SD Min Max Breast Cancer
Training Set error error rate 2.7667 0.0286 0.0662 0.0003 2.0460 0.0200 2.8960 0.0314 13.9678 0.2063 0.7363 0.0182 10.23 0.1484 21.92 0.3307 7.4503 0.0899 1.7097 0.0229 2.6750 0.0283 12.31 0.1783 1.0977 0.0049 0.6414 0.0021 0.3412 0.0014 7.7310 0.0186
Validation Set error error rate 2.3837 0.0391 0.2342 0.0030 1.8460 0.0229 3.4340 0.0457 16.1721 0.2170 0.3677 0.0101 14.58 0.1927 23.22 0.2552 15.6445 0.1937 1.1991 0.0164 12.49 0.1391 20.99 0.2739 1.8006 0.0079 0.5793 0.0019 0.9904 0.0033 7.8040 0.0189
Test Set error error rate 1.6512 0.0230 0.193 0.0019 1.14 0.0115 3.144 0.0345 16.9480 0.2458 0.6330 0.0169 15.5900 0.1875 23.0500 0.3646 17.0669 0.2076 1.1399 0.0143 14.06 0.1609 22.61 0.2652 1.8111 0.0087 0.6361 0.0023 0.9513 0.0039 7.9370 0.0244
Evolution of the structure can be observed by connection Vs generation curves. Figure 2(a) shows the curve of the mean of average number of connections Vs generations for breast cancer problem, Figure 2(b) shows the same for diabetes problem.
(a)
(b)
Fig. 2. Evolution of structure of ANNs for problems (a) breast cancer and (b) diabetes
568
A. Mahmood et al.
The experimental results show that a comparatively good performance for diabetes and thyroid problem while results for breast cancer problem are also competitive. It also evolves structurally compact ANNs which explains its well generalization capability.
4
Conclusions
Here, one particular evolutionary system is used to describe the potential of the devised recombination operator. It was carefully designed to avoid permutation problem and line up matching blocks. Result shows that this effort can dynamically adapt structures, which also validates its suitability as an operator. This recombination operator can also be experimented by incorporating it with other evolutionary flows having different choices of ’success’ and ’failure’ marks, strong and weak parents and other parameters. Its applicability to other processes makes it a better choice of operator for GA-based evolutionary approaches.
References 1. Storn, R., Price, K.: Differential Evolution -a Simple and Efficient Adaptive Scheme for Global Optimization over Continuous Spaces. Technical Report TR-95-012, ICSI, March (1995) ftp.icsi.berkeley.edu 2. Hancock, P.J.B.: Genetic Algorithms and Permutation Problems: A Comparison of Recombination Operators for Neural Net Structure Specification. Proc. COGANN Workshop, Baltimore, MD (1992) 108-122 3. Holland, J.H.: Adaptation in Natural and Artificial Systems. Ann Arbor, MI: Univ. Michigan Press (1975) 4. Fogel, L., Owens, A., Walsh, M., Eds.: Artificial Intelligence Through Simulated Evolution. New York: Wiley (1966) 5. Fogel, D.B.: Phenotypes, Genotypes, and Operators in Evolutionary Computation. Proc. 1995 IEEE Int. Conf. Evolutionary Computation, Piscataway, NJ (1995) 193-198 6. Yao, X., Liu, Y.: A New Evolutionary System for Evolving Artificial Neural Networks. IEEE Trans. Neural Networks 8(3) (1997) 694-713 7. Schalkoff, R.J.: Pattern Recognition: Statistical, Structural and Neural Approaches. John Wiley & Sons, New York (1992) 8. Busygin, S.: A New Trust Region Technique for the Maximum Weight Clique Problem. Discrete Applied Mathematics 154(15) (International Symposium on Combinatorial Optimization CO’02) (2006) 2080-2096 9. Prechelt, L.: Proben1-A Set of Neural Network Benchmark Problems and Benchmarking Rules. Fakultat fur Informatik, Univ. Karlsruhe. Karlsruhe, Germany, Tech. Rep. 21/94, Sept. (1994)
Orthogonal Least Squares Based on QR Decomposition for Wavelet Networks Min Han and Jia Yin School of Electronic and Information Engineering, Dalian University of Technology, Dalian, 116023, China
[email protected]
Abstract. This paper proposes an orthogonal least square algorithm based on QR decomposition (QR-OLS) for the neurons selection of the hidden layer of wavelet networks. This new algorithm divides the original neurons matrix into several parts to avoid comparing among the poor ones and uses QR decomposition to select the significant ones. It can avoid lots of meaningless calculation. This algorithm is applied to the wavelet network with the analysis of variance (ANOVA) expansion and one-step-ahead predictions, respectively, for the Mackey-Glass delay-differential equation and the annual sunspot data set. The results show that the QR-OLS algorithm can relieve the load of the heave calculation and has a good performance.
1 Introduction The idea of combining wavelets with neural networks has led to the development of wavelet networks (WNs), where wavelets were introduced as activation functions. The wavelet analysis procedure is implemented with dilated and translated versions of a mother wavelet, which contains much redundant information. So the calculation of the WNs is heavy and complicated in some cases especially for high-dimensional models. Therefore it is necessary to use an efficient method to select the hidden neurons for relieving the load of the calculation. Several methods have been developed for selecting items. Battiti et al. [1] used the mutual information to select the hidden neurons, Gomm et al. [2] proposed the piecewise linearization based on Taylor decomposition, and F. Alonge et al. [3] applied genetic algorithm for selecting the wavelet functions. The orthogonal least squares (OLS) algorithm was developed by Bellings et al. [4]. However, these methods are time-consuming and therefore, some efficient approaches have been investigated. In the OLS algorithm, to select a correct hidden neuron, the vectors formed by the candidate neurons must be processed by using orthogonal methods, which became a heave burden for the WNs. In the present paper, an orthogonal least squares algorithm based on QR decomposition (QR-OLS) which divides the candidate neurons into some sub-blocks to avoid comparing among the poor neurons and uses the forward orthogonal least squares algorithm based on the QR decomposition approach to select the hidden neurons in the WNs. The paper is organized as follows. Section 2 briefly reviews some primarily acknowledge on WNs. The QR-OLS algorithm is described in section 3. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 569–574, 2007. © Springer-Verlag Berlin Heidelberg 2007
570
M. Han and J. Yin
In section 4, two examples are simulated to illustrate the performance of the new algorithm. Finally, the conclusions of this paper are given in section 5.
2 Wavelet Networks The most popular wavelet decomposition restricts the dilation and translation parameters into dyadic lattices [4, 5]. And the output of the WNs can be expressed as J0
y=∑
J0
∑c
ψ j , k ( x) = ∑
j,k
j0 k ∈K j
∑c
j0 k ∈K j
j ,k
⋅ 2− j / 2ψ (2 j ⋅ x − k )
(1)
where x is the one-dimension input of the network, y is the one-dimension output of the network, cj,k is the coefficient of the wavelet decomposition or the weight of the WN, j can be regarded as the dilation parameter, k can be regarded as the translation parameter, ψj,k(x)=2-j/2ψ(2j⋅x−k) and ψ(⋅) is the wavelet function. In Eq. (1), j0 is the coarsest resolution, J0 is the finest resolution and k Kj that depends on the dilation parameter is the subset of integers. According to Eq. (1), the structure of WNs used in this paper is similar with the Radial Basis Function networks, while the activation functions are wavelet function not radial basis functions. The WNs can be trained by using least-squares methods. The result for the one-dimensional case described previously can be extended to high dimensions [4,6]. Firstly, n-dimensional wavelet function can be expressed as
∈
n
ψ [jn, k] (x) = ψ [jn, k] ( x1 , x2 ," xn ) = ∏ψ j , k ( xi ), x = [ x1 , x2 ," , xn ], i = 1, 2," , n
(2)
i =1
where the superscript [n] stands for the dimension of the wavelet and x is the multidimension input of the WNs. Then using the analysis of variance decomposition (ANOVA) [4] simplifies the n-dimensional wavelet function. The main idea of ANOVA in Eq. (3) is to decompose the high dimensional function into lower ones.
ψ [jn, k] (x) =
∑ψ
1≤ l1 ≤ n
[1] j,k
( xl1 ) +
∑
ψ [2] j , k ( xl , xl ) + 1
1≤ l1 ≤ l2 ≤ n
2
∑
ψ [3] j , k ( xl , xl , xl ) + " + e
1≤ l1 ≤ l2 ≤ l3 ≤ n
1
2
3
(3)
where e is the error of the ANOVA decomposition and p, q, r=1, 2, ⋅ ⋅ ⋅ , n. Then Eq. (1) can be extended as: y = ∑ fl1[1] ( xl1 ) + ∑ fl1[2] ∑ fl1[li2]"li ( xl1 , xl2 ,", xli ) + e l2 ( xl1 , xl2 ) + " + 1≤ l1 ≤ n
1≤ l1 ≤ l2 ≤ n Ji
f l1[li2]"li ( xl1 , xl2 ," , xli ) = ∑
1≤ l1 ≤ l2 ≤"≤ li ≤ n
∑ ψ [ji,]k ( xl , xl ,", xl ) i =1,2,",n; 1 ≤ l1 ≤ " ≤ li ≤ n
ji k ∈K j
1
2
(4)
i
where j1, j2, j3 are the coarsest resolutions, and J1, J2, J3 are the finest resolutions.
3 OLS Algorithm Based on QR Decomposition The Pall consists of the Mall vectors formed by the hidden neurons whose activation functions are wavelet functions in Eq. (4). Several OLS algorithms have been
Orthogonal Least Squares Based on QR Decomposition for Wavelet Networks
571
developed for selecting the vectors or the wavelet function with parameters j and k, such as Classical Gram-Schmidt (CGS) algorithm, Modified Gram-Schmidt (MGS) algorithm [8, 9] and Householder algorithm. The core of these algorithms is the method to deal with Pall. In the CGS algorithm, Pall is decomposed as: Pall=W⋅A
(5)
where A is an unit upper triangular matrix and W is a matrix with M orthogonal columns w1, w2,…, wM. MGS algorithm performs basically the same operations as CGS, only in a different sequence. However, both methods are sensitive to the roundoff error. As for the Householder algorithm, Householder translation is applied for the orthogonal procedure. The three methods have the same drawback that to select a hidden neuron, the vectors in Pall must be processed by using orthogonal methods. Therefore, it is timeconsuming and complicated. To avoid repeating decompositions of Pall, the new algorithm divides Pall into several sub-blocks. And then QR decomposition is applied to every sub-block, which can avoid ill conditioning. The algorithm is showed as follows. Firstly, a sub-block P with M (M<Mall) columns derived from Pall is selected. Assuming that P is full rank in columns, then P can be decomposed as P=Q⋅R
(6)
T
where Q=[Q1 Q2], R=[R1 0] . Q1 is a N by M matrix; Q2 is a N by (N-M) matrix; R1 is a square matrix with M columns. Therefore, Eq. (6) can be simplified. P=Q1⋅R1
(7)
where Q1 = [q1 , q 2 ," , q M ] . Then the model of the WN can be expressed as Y=P⋅W+Ξ
(8)
where Y is the output matrix of the WN; W is the weight matrix to be solved; Ξ is the sum of the model error and noise. According to Eq. (7), Eq. (8) becomes Y = P ⋅ R1−1 ⋅ R1 ⋅ W + Ξ .
(9)
Define G = R1 ⋅ W = [ g1 , g 2 ," , g M ]T . Then Eq. (9) can be rewritten as Y = Q1 ⋅ G + Ξ .
(10)
If Ξ is ignored, G can be solved by least square algorithm as G = [Q1T ⋅ Q1 ]−1 ⋅ Q1T ⋅ Y , gi = qTi ⋅ Y qTi q i = qTi Y .
(11)
Assume that Ξ is uncorrelated with the past output of the system, Eq. (10) can be expressed as 1 T 1 Y Y= N N
M
∑g i =1
2 i
+
1 T ΞΞ. N
(12)
572
M. Han and J. Yin
The parameter ERRi is introduced, which is defined as ERRi = gi2 YT Y = YT qi YT Y .
(13)
Several vectors qi (i=1, 2, ⋅ ⋅ ⋅ , M) with larger ERRi are selected as hypo-optimized neurons. The same procedure is applied to other sub-blocks, and then the optimized neurons are selected from the hypo-optimized ones with the same algorithm.
4 Simulations Two examples are provided to verify the performances of QR-OLS algorithm. Both data sets are normalized based on wavelet operating domain rather than physical insight. In addition, the OLS algorithm based on CGS (CGS-OLS) is also used in the simulations to compare with the proposed algorithm. 4.1 Mackey-Glass Delay-Differential Equation
This data set is generated by the Mackey-Glass delay-differential equation dx(t ) 0.2 x(t − τ ) = −0.1x(t ) + dt 1 + x10 (t − τ )
(14)
where the time delay τ is chosen to be 30 in this example. Setting the initial condition x(t)=0.9 for 0 ≤ t ≤ τ , a Rung-Kutta integral algorithm is applied to solve Eq. (14) with an integral step Δt = 0.01 and 1000 equi-spaced samples, x(t), (t=1,2,…,1000) is extracted with a sampling interval of T=0.06 time unit. The data set is divided into two parts: 500 data points are used to train the WNs and others are prepared for testing the network. The model can be expressed as y(m) =
∑
1≤ l1 ≤ 6
f l1[1] ( xl1 ) +
∑
1≤ l1 ≤ l2 ≤ 6
fl1[2] l2 ( xl1 , xl2 ) +
∑
1≤ l1 ≤ l2 ≤ l3 ≤ 6
fl1[3] l2 l3 ( xl1 , xl2 , xl3 )
(15)
where xli=x(t−li) and i=1, 2, 3, 1 ≤ l1 ≤ l2 ≤ l3 ≤ 6 , the 1-D, 2-D and 3-D compactly supported Mexican hat wavelets are used in this example to approximate the uni[3] variate functions f l1[1] , the bi-variate functions f l1[2] l2 , and the tri-variate function f l1l2 l3 , respectively, with the coarsest resolutions j1 = j2 = j3 = 0 and the finest resolutions J1 = 3, J2 = 1 and J3 = 0. The value of M is 285, so Pall is divided into some sub-blocks with 285 columns. To avoid the ill conditioning, the candidate neuron will be eliminated if ||qi|| in Eq. (11) is less than a predetermined threshold ε . ε is chosen to 6 to guarantee well conditioning, which based on the data set and the normalization used in the WNs. And 14 hypo-optimized neurons are selected from every sub-block. Finally, 14 most significant neurons are selected from hypo-optimized ones. Table 1 gives the comparison of CPU time required to select the hidden neurons and the simulation is realized in matlab. It can be seen that the algorithm is more efficient and accurate than CGS-OLS algorithm.
Orthogonal Least Squares Based on QR Decomposition for Wavelet Networks
1.4
Actual Value
573
Table 1. Comparison of CGS-OLS and QROLS algorithm
QR-OLS
y
1.0
Method CGS-OLS
RMSE 3.5×10-3
TIME(s) 419.1400
QR-OLS
2.8×10-3
23.7500
0.6
0.2 500
600
700 800 Sampling Index
900
1000
Fig. 1. Prediction with QR-OLS algorithm for Mackey-Glass series
4.2 The Sunspot Time Series
This example uses the Wolf sunspot data series recording the annual sunspot indices form 1700 to 1999. The data set is separated into two parts: the training set consists of 270 data points and the test set consisted of 30 data points. According to [6], y(t-1), y(t-2), and y(t-9) are selected as the most significant variables, and the model order is chosen to be 9. Then the model can be expressed as y (t ) =
∑
1≤ l1 ≤ 9
f l1[1] ( xl1 ) +
∑
1≤ l1 ≤ l2 ≤ 9
[3] f l1[2] l2 ( zl1 , zl2 ) + f129 ( x1 , x2 , x9 )
(16)
where xli=y(t − li) for li=1, 2 , ⋅ ⋅ ⋅, 9, zli=y(t − li) for li=1, 2 and z3=y(t − 9). The 1-D, 2-D and 3-D compactly supported Gaussian wavelets are used in this example to approximate the uni-variate functions f l1[1] , the bi-variate functions f l1[2] l2 , and the tri[3] variate functions f129 , respectively, with the coarsest resolutions j1 = j2 = j3 = 0 and finest resolutions J1= 2, J2 = J3 = 0. Pall is divided into some sub-blocks with 160 columns and ε in this example is chosen to 0.085. The predicted result of the WNs compared with CGS-OLS is showed in Figure 2 and Table 2. The new algorithm can
Table 2. Comparison of CGS-OLS and QR-OLS algorithm
Fig. 2. Predictions with CGS-OLS and QR-OLS algorithm for annual sunspot time series
Method CGS-OLS
RMSE 16.4899
QR-OLS
13.3800
574
M. Han and J. Yin
catch the characteristics of the real-world chaotic system and has a better performance. For this less size of the problem, the improvement of time consuming is a little.
5 Conclusions In this paper, a new OLS algorithm based on QR decomposition for WNs is proposed. The QR-OLS algorithm avoids a great deal of calculation by dividing the original neuron group into several parts and using QR decomposition to select the significant ones. The more the candidate neurons are, the better the QR-OLS algorithm performs than the other OLS algorithms. The results obtained from the examples which include the Mackey-Glass delay-differential equation and the sunspot time series demonstrate its effectiveness and accuracy.
Acknowledgements This research is supported by the National Natural Science Foundation of China under Project (60674073) and (60374064). All of these supports are appreciated.
References [1] Robert, B.: Using Mutual Information for Selecting Features in Supervised Neural Net Learning. IEEE Transactions on Neural Networks 5 (4) (1994) 537-550 [2] Gomm, J.B., Yu, D.L.: Order and Delay Selection for Neural Network Modelling by Identification of Linearized Models. International Journal of Systems Science 31 (10) (2000) 1273-1283 [3] Alonge, F., D'lppolito, F., Raimond,i F.M.: System identification via optimized waveletbased neural networks. IEEE Proc.-Control Theory Appl 150 (2) (2003) 147-154 [4] Billings, S.A., Wei, H.L.: A New Class of Wavelet Networks for Nonlinear System Identification. IEEE Transactions on Neural Networks 16 (4) (2005) 862-874 [5] Cao, L.Y., Hong, Y.G., Fang, H.P.,, Hai G.W.: Predicting Chaotic Time Series with Wavelet Networks. Physica D 85 (1995) 225-238 [6] Wei, H.L., Billings, S.A., Liu, J.: Term and Variable Selection for Nonlinear System Identification. Int. J. Control 77 (2004) 86-110 [7] Paulito, P.P., Taichi, H., Shiro, U.: Mutation-Based Genetic Neural Network. IEEE Transactions on Neural Networks 16 (3) (2005) 587-600 [8] Achiya, D.: A modified Gram-Schmidt Algorithm with Iterrabive Orthogonalization and Column Pivoting. Linear Algebra and its Applications 310 (2005) 25-42 [9] Ling, F.Y., Dimitrls, M., John, G.P.: A Recursive Modified Gram-Schmidt Algorithm for Least-Squares Estimation. IEEE Transactions on Acoustics, Speech, and Signal Processing ASSP-34 (4) (1986) 829-836
Implementation of Multi-valued Logic Based on Bi-threshold Neural Networks Qiuxiang Deng and Zhigang Zeng School of Automation, Wuhan University of Technology, Wuhan, Hubei, 430070, China qx
[email protected],
[email protected]
Abstract. The implementation of multi-valued logic with a three layers forward neural network is proposed. The hidden layer is constituted by bi-threshold neurons compared with traditional simple threshold neurons. According to the obtained results in this paper, if a perception with a simple threshold neuron is used in the output layer, then the logical map of {0, 1, · · · , n} to {0, 1} can be gained. In addition, if a linear neuron is used in the output layer, then the logical map of {0, 1, · · · , n} to {0, 1, · · · , m} can be obtained. The arithmetic, which is used to design the three layers forward neural network, improves on the traditional digital logic of two values. An example shows that the designable procedure of the network is simple and effective.
1
Introduction
Recently, a great deal of attention has been paid on the implementation of digital logic based on neural networks. Numerous research methods and results have been proposed [1-15]. That’s partly because neural networks possess of many merits such as being able to learn, strong currency, lower complexity, simple configuration and easy integration. In order to reduce the complexity and limit of neural networks, the ideas of Karnaugh map and minterm inhibition are introduced by using a three layers neural network in [1]. According to the obtained results in [1], a network is presented to lower complexity and limit and to counteract light disturbance of digital voltages in [2]. However, all these results focus on the logical map of {0, 1} to {0, 1} or the map of {−1, 1} to {−1, 1}. In these three layers forward neural networks, there are 2 values in implementation of the logical map as all neurons have a simple threshold and offer two states. Some studies suggest that the neural networks can achieve not only typical logic of two values, but also multi-valued logic and not exact logic such as fuzzy logic. Ojha presents a method for enumerating linear threshold functions of n-dimensional inputs in [3]. The linear threshold functions are the transfer function of McCulloch-Pitts model neurons. The problem is to enumerate distinct LTF’s which can be computed by varying the weights and the bias [4-6]. A multi-threshold neuron to implement multi-valued logic is suggested based on the parallel hyperplanes which can be divided the Euclidian space into some regions in [7]. However, along with increasing number of variables, the number D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 575–582, 2007. c Springer-Verlag Berlin Heidelberg 2007
576
Q. Deng and Z. Zeng
of threshold of a multi-threshold neuron is increased. The contribution of this paper is to put bi-threshold neurons into a three layers forward neural network. The bi-threshold neurons are used in the hidden layer. The neural network with bi-threshold neurons and other neurons can reduce the total number of neurons and can implement arbitrary logical map. The rest of this paper is organized as follows. In Section 2, the neurons including a simple threshold neuron and a bi-threshold neuron and their transfer functions in the network are introduced. In Section 3, a construction of the three layers network with bi-threshold neurons is proposed to implement arbitrary logic. Bi-threshold neurons are used as the neurons of the hidden layer. According to the obtained results, if the perception with a simple threshold neuron is used in the output layer, then the logical map of {0, 1, · · · , n} to {0, 1} can be derived. In addition, if a linear neuron is used in the output layer, then the logical map of {0, 1, · · · , n} to {0, 1, · · · , m} can be obtained. In Section 4, we give an example to show that the structure of this kind of neural network is simple, reliable and easy to implement the arbitrary logic. Finally, concluding remarks are included in Section 5.
2 2.1
Threshold Neuron Simple Threshold Neuron
A simple threshold neuron, also called as a perception, is the simplest neuron in the neural networks. The model of this neuron is given in Fig. 1, where xi is the
Fig. 1. The model of a simple threshold neuron
Fig. 2. The transfer function of a simple threshold neuron
Implementation of Multi-valued Logic
577
i-th input, wi is the weight connecting outer inputs and this neuron and θ is the only threshold of this neuron. Its transfer function is shown in Fig. 2. The output function of this neuron can be express as follows: n y = f( wi xi − θ),
f (u) =
i=1
2.2
1 u > 0; 0 otherwise
(1)
Bi-threshold Neuron
A bi-threshold neuron is a linear combination of two simple threshold neurons. The problem of “exclusive-OR”, which can’t implement by only one simple threshold neuron, can be solved with one bi-threshold neuron. The model of a bi-threshold neuron is given in Fig. 3, where xi is the i-th input, wi is the weight connecting outer inputs and this neuron, θ1 is the upper limit threshold of this neuron and θ2 is the lower limit threshold of this neuron. Its transfer function is shown in Fig. 4. If the value of signal’s sum is between threshold θ1 and θ2 , then the activation signal 1 is produced. Otherwise, the restrained signal 0 is produced. The output function of this neuron can be expressed as follows: n y = f( wi xi ), i=1
f (u) =
1 θ1 ≤ u ≤ θ2 ; 0 otherwise
Fig. 3. The model of a bi-threshold neuron
Fig. 4. The transfer function of a bi-threshold neuron
(2)
578
3 3.1
Q. Deng and Z. Zeng
Implementation of Multi-valued Logic Implementation of the Logical Map of {0, 1, · · · , n} to {0, 1}
The construction of a neural network to realize the logical map of {0, 1, · · · , n} to {0, 1} can be shown in Fig. 5, where x = (x1 , x2 , · · · , xn )T is the input vector, w1ij is the weight connecting the input neurons and the hidden neurons, hj is the output of hidden neurons, w2j is the weight connecting the hidden neurons and the output neuron. Obviously, it has three layers, the neurons in the hidden layer are bi-threshold neurons with different upper and lower thresholds. There is only one neuron in the output layer. In this section, in order to realize the logical map of {0, 1, · · · , n} to {0, 1}, a simple threshold neuron is used in the output layer. The output functions of this neural network satisfies the following equations ⎧ n 1, θj1 ≤ sj ≤ θj2 ; ⎪ ⎪ s = w x , h = f (s ) = j 1ij i j j ⎨ 0, otherwise, i=1 (3) L 1 o ≥ 0; ⎪ ⎪ o = w h , Y = ⎩ 2j j 0 o < 0, j where sj and o are the input of a hidden neuron and the output neuron, respectively. θj1 and θj2 are the upper and lower threshold of the j-th neuron in the hidden layer, respectively. The number of the neurons in the hidden layer is similar with the number of output value which is 1. The arithmetic of this three layers network to realize the logical map of {0, 1, · · · , n} to {0, 1} can be concluded as follows: – Initialization: Let the number of input neurons be the number of digital logical variables, the number of neurons in the hidden layer be the number of output value which is 1 and let the output layer have one simple threshold neuron whose threshold is zero.
Fig. 5. The structure of a neural network to implement the logical map of {0,1,. . . ,n} to {0,1}
Implementation of Multi-valued Logic
579
– Swatch learning: Learning only takes place in the swatch variables whose output value is 1. The rules of learning are concluded as follows: Assume xr is the r-th input swatch vector whose function value is 1. Let the swatch variables whose function value is 1 compose a set U. If xr ∈ U, then xr = [xr1 , xr2 , · · · , xrm ], where xr ∈ {0, 1, · · · , n}. The learning arithmetic of the weights and thresholds in the neural networks is shown as follows: ⎧ m ⎪ ⎪ θj1 = (n + 1)i xi − 0.5, ⎨ w1ij = (n + 1)i , i=1 (4) m ⎪ ⎪ (n + 1)i xi + 0.5, w2j = 1, T = 0. ⎩ θj2 = i=1
– Repeat the swatch learning until the whole swatches, which function values are 1, are trained. Then the learning arithmetic is end. Lemma. In hidden layer, only one neuron is on the exciting state, and the others are on the restrained state in any time. The output of the network is similar with the function value of corresponding swatch. Proof. Let x = xr ∈ U , [x1 , · · · , xr , · · · , xL ] ⊂ U, where xr = (xr1 , xr2 , · · · , xrj , · · · , xrm )T and xrj ∈ {0, 1, · · · , n}. As the statement before, the number of input variables is m, and the value of each variable is between 1 and n. Hence, there are m neurons in the input layer, L neurons in the hidden layer. According to (3) and (4), if j = r, then n m sj = sr = f ( w1ij xi ) = f ( ((n + 1)i )xri ), i=1
hj = 1.
i
If x is other swatches, sj is not belong to [θj1 , θj2 ], then hj = 0. Hence, the output of whole network is 1 according to the expression (3)and (4). Hence, if the bi-threshold neuron is put into the hidden layer, then the input value can extent to m. That’s to say, the neural networks overcomes the disadvantage of the input value fixing to two values such as {0, 1} and {−1, 1}. 3.2
Implementation of the Logical Map of {0, 1, · · · , n} to {0, 1, · · · , m}
The construction of the neural networks to realize the logical map of {0, 1, · · · , n} to {0, 1, · · · , m} is also shown in Fig. 5. In the output layer is no more a simple threshold, but a linear neuron. The output function can be expressed as follows. Y = f (u) = u. In the swatch learning, the training of weights and thresholds learning in the hidden layer is the same as above detailed. So in the hidden layer, only one neuron is on the exciting state, and the others are on the restrained states in any time. The training of weights connecting the hidden layer and the output
580
Q. Deng and Z. Zeng
layer should be trained in other way since the output value is no more two values but multiple values. The rules of weights training between the hidden layer and the output layer follow the expression w2j = tj , where tj is the function value of the j-th input swatch. Hence, the output of entire network can be obtained. Y = w2j hj = 1 ∗ tj ,
tj ∈ {0, 1, · · · , m}.
As tj is not belongs to {0, 1} but to {0, 1, . . . , m}, the output value can vary 0 to m. Hence, this three layer network can realize the logical map of {0, 1, . . . , n} to {0, 1, . . . , m}.
4
Application in a Three Values Logic with a Three Layers Neural Network
In the three values logical operation, there are 3 input variables and 3 output values. Take the AND operation for an example. Using this three layers network, the operation can be described perfectly. Firstly, true value of this operation can be shown in table 1. Table 1. True value table of the three values with three variables for AND operation
x1 0 0 0 0 0 0 0 0 0
x2 0 0 0 1 1 1 2 2 2
x3 output 0 0 1 0 2 0 0 0 1 0 2 0 0 0 1 0 2 0
x1 1 1 1 1 1 1 1 1 1
x2 0 0 0 1 1 1 2 2 2
x3 0 1 2 0 1 2 0 1 2
output 0 0 0 0 1 1 0 1 1
x1 2 2 2 2 2 2 2 2 2
x2 0 0 0 1 1 1 2 2 2
x3 output 0 0 1 0 2 0 0 0 1 1 2 1 0 0 1 1 2 2
From the table, one can see that there are 27 swatches and 8 items which the output value is not zero in this logical operation. As statement before, let the input layer has 3 neurons, the hidden layer has 8 neurons and the output layer has one linear neuron. The neural network’s construction of this operation can be shown in Fig. 6 based on the above arithmetic. According to above arithmetic, w1ij = 30 , w12j = 31 , w13j = 32 , where j = 1, 2, · · · , 8. The threshold of the neurons in the hidden layer can be obtained in table 2. The weights connecting the hidden layer and the output layer are trained by using this way. That is h1 = h2 = h3 = h4 = h5 = h6 = h7 = 1, h8 = 2. Obviously, one can get the three values with three variables for AND operation through this network. In addition, one can see this construction of the network to implement
Implementation of Multi-valued Logic
581
Fig. 6. The structure of the neural networks to implement the three value logic Table 2. The threshold of the neurons in the hidden layer
the ith hidden the neuron 1 2 3 4 5 6 7 8
the upper threshold 12.5 13.5 15.5 16.5 21.5 22.5 24.5 25.5
the lower threshold 13.5 14.5 16.5 17.5 22.5 23.5 25.5 26.5
x1 1 1 1 1 2 2 2 2
swatch x2 1 1 2 2 1 1 2 2
x2 1 2 1 2 1 2 1 2
expected output 1 1 1 1 1 1 1 2
logic is simple and efficient. The arithmetic to construct the three layers network can extent to solve any multi-valued logic with multi-variable problems.
5
Concluding Remarks
After discussing the connection between digital logic and a bi-threshold neuron, the arithmetic of a three layers feed forward neural network to resolve the multivalued logic is presented in this paper. The arithmetic overcomes the pitfall in researching logic that only limited in two values. The network becomes simple and efficient and the obtained results in this paper can be widely used in the study of digital circuits, code and pattern recognition.
582
Q. Deng and Z. Zeng
Acknowledgement This work was supported by the Natural Science Foundations of China under Grant 60405002.
References 1. Donald, L.G., Anthony, N.M.: A Training Algorithm for Binary Feedforward Neural Networks. IEEE Trans. Neural Networks 2 (1992) 176-194 2. Ma, X.M., Hu, Z.p.: Design of Digital Logic Using Neural Network. Journal of Circuits and System 3 (1998) 51-58 3. Piyush, C.O.: Enumeration of Linear Threshold Functions from the Lattice of Hyperplane Intersections. IEEE Trans. Neural Networks 4 (2000) 839-850 4. Winder, R.O.: Enumeration of Seven-argument Threshold Functions. IEEE Trans. Electron. Comput. 14 (1965) 315-325 5. Zuev, Y.A.: Asymptotics of the Logarithm of the Number of Threshold Functions of the Algebra of Logic. Sov. Math. Dok. 39 (1989) 512-513 6. Siu, K.Y., Roychowdhury, V., Kailath, T.: Discrete Neural Computation: A Theoretical Foundation. Englewood Cliffs, NJ: Prentice-Hall (1995) 7. Wang, S.J., Zhao, G.L., Liu, Y.Y.: Research on Logic Operation by Multi-threshold Neural Network. BIC-CA2006 (2006) 335-342 8. Sun, C., Feng, C.: Global Robust Exponential Stability of Interval Neural Networks with Delays. Neural Processing Letters 17 (2003) 107-115 9. Sun, C., Feng, C.: On Robust Exponential Periodicity of Interval Neural Networks with Delays. Neural Processing Letters 20 (2004) 53-61 10. Huang, D.S., Ip, H.H.S., Law, K.C.K., Chi, Z.R.: Zeroing Polynomials Using Modified Constrained Neural Network Approach. IEEE Trans. Neural Networks 3 (2005) 721-732 11. Huang, D.S., Ip, H.H.S., Chi, Z.R.: A Neural Root Finder of Polynomials Based on Root Moments. Neural Computation 8 (2004) 1721-1762 12. Huang, D.S.: A Constructive Approach for Finding Arbitrary Roots of Polynomials by Neural Networks. IEEE Trans. Neural Networks 2 (2004) 477-491 13. Xu, B.J., Liu, X.Z., Liao, X.X.: Global Exponential Stability of High Order Hopfield Type Neural Networks. Applied Mathematics and Computation 1 (2006) 98-116 14. Liu, M.Q.: Global Exponential Stability Analysis for Neutral Delay- differential Systems: An LMI Approach. International Journal of Systems Science 11 (2006) 777-783 15. Liu, M.Q.: Dynamic Output Feedback Stabilization for Nonlinear Systems Based on Standard Neural Network Models. International Journal of Neural Systems 4 (2006) 305-317
Iteratively Reweighted Fitting for Reduced Multivariate Polynomial Model Wangmeng Zuo1, Kuanquan Wang1, David Zhang2, and Feng Yue1 1
School of Computer Science and Technology, Harbin Institute of Technology, 150001 Harbin, China
[email protected],
[email protected] 2 Department of Computing, Hong Kong Polytechnic University, Kowloon, Hong Kong
Abstract. Recently a class of reduced multivariate polynomial models (RM) has been proposed that performs well in classification tasks involving few features and many training data. The RM method, however, adopts a ridge leastsquare estimator, overlooking the fact that least square error usually does not correspond to minimum classification error. In this paper, we propose an iteratively reweighted regression method and two novel weight functions for fitting the RM model (IRF-RM). The IRF-RM method iteratively increases the weights of samples prone to misclassification and decreases the weights of samples far from the decision boundary, making the IRF-RM model more suitable for efficient pattern classification. A number of benchmark data sets are used to evaluate the IRF-RM method. Experimental results indicate that IRFRM achieves a higher or comparable classification accuracy compared with RM and several state-of-the-art classification approaches.
1 Introduction Pattern classification, which assigns a class label to an unseen instance from a set of attributes describing that instance, plays a key role in many applications such as image retrieval, medical diagnosis, and bioinformatics. Over the years, various classification methods have been proposed. These approaches could be grouped into two major categories, generative learning approaches and discriminative learning approaches. In a generative learning approach, a generative model is learned from the training data and is then used to predict the class label of an unknown instance using the Bayes rule. In a discriminative learning approach, a decision function or decision boundary is learned by optimizing some performance criterion, such as the classification accuracy or generalization. Unlike generative learning, discriminative learning makes no assumptions to the distributions of samples, but instead attempts to directly compute the sample-class mapping. These approaches have been successful in many application tasks. For example, the performance of support vector machines (SVMs) in handwritten character recognition [11] is state-of-the-art. To ensure that it is able to represent decision boundaries of any shape, the decision function in a discriminative learning framework should be nonlinear, yet the demand D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 583–592, 2007. © Springer-Verlag Berlin Heidelberg 2007
584
W. Zuo et al.
for good generalization implies that the decision function should be restricted in its complexity. One approach to treat this dilemma is to map the original data to a high dimensional feature space and then use a simple decision function (e.g., linear) on the feature space. This class of algorithms includes SVM, Φ-Machine, etc. A natural choice for use with discriminative learning models is the multivariate polynomial model (MPM). MPM first transforms the data into a high dimensional polynomial feature space. A linear regression method is then applied in the transformed feature domain to fit an optimal polynomial model [5]. Although a MPM model can represent the nonlinear decision function, the number of feature dimensions would increase exponentially with the growth of rank r, requiring a great number of data to ensure the polynomial model is not over-fitted. Recently a class of reduced multivariate polynomial models (RM) has been proposed for pattern classification [10]. Unlike MPM, RM transforms the data into a reduced polynomial feature space where its dimensionality is significantly less than that of the multivariate polynomial model. RM has performed well in classification tasks that involve few features and many training data. In neural network community, polynomial feedforward neural networks (PFNN) have also been used to generate high-order multivariate polynomial mappings. Neural learning or evolutionary computation approaches have been proposed to learn the neural network architecture, including the sampled polynomial terms and the corresponding coefficients [13]. One disadvantage of RM is that it adopts a least-square estimator and neglects the fact that the outputs intrinsically are discrete response variable. Least-square regression assumes that the predicted output has a Gaussian distribution with the mean of theoretical output. However, this assumption is generally not tenable. The samples close to the boundary play a more important role in estimating the decision function. In this paper we propose to use an iteratively reweighted fitting method for the RM model (IRF-RM) which is more suitable for classification tasks. The iteratively reweighted fitting algorithm is an efficient method in solving nonlinear optimal problem, and has been widely applied to train SVM [7] and neural network. We use 42 data sets to evaluate the classification performance of IRF-RM. To provide rational comparisons of classifiers over multiple data sets, we adopt the Wilcoxon signed ranks test for comparison of two classifiers, and use the Friedman test for comparison of multiple classifiers [3]. IRF-RM can achieve a higher or comparable classification accuracy compared with several state-of-the-art classifiers.
2 The Reduced Multivariate Polynomial Model 2.1 The Multivariate Polynomial Model The multivariate polynomial model provides an efficient way to describe complex nonlinear relationships between instances and class labels. Assume that we are given a set of training data, X = [x1, ", x m ]T , x i ∈ R N with corresponding labels y = [ y1 ,", ym ]T
, yi ∈{0, 1} . For this two-class problem, the goal of a MPM model is to fit an r-rank polynomial model
Iteratively Reweighted Fitting for Reduced Multivariate Polynomial Model
585
K −1
g ( x ) = b + ∑ wi x1r1 x2r2 " xNrN , 0
(1)
i =1
Let β=[b wT]T, z i = [1,", xir,1 xir,2 " xir, N ,"]T , and Z = [ z1 , z 2 ,", z m ]T . We then compute 1
2
N
the optimal estimation of βˆ by minimizing the least-square error m
βˆ = arg min ∑ ( yi − βT z i ) 2 .
(2)
i =1
Assuming ZTZ is nonsingular, we will obtain the unique solution, βˆ = (ZT Z) −1 ZT y .
(3)
Generally we cannot guarantee that the matrix ZTZ is nonsingular. In such cases, ridge regression can be used to address the singularity of the matrix ZTZ. Ridge regression is a regularization technique for least squares regression. One usual way of regularization is to define a penalized least-square error m
ΔE (β) = ∑ ( yi − βT z i ) 2 + μ 2βT β ,
(4)
i =1
where μ is the regularization parameter, and I is an identity matrix. The solution of the optimal vector βˆ can be calculated by βˆ = ( ZT Z + μ 2 I) −1 ZT y .
(5)
2.2 The Reduced Multivariate Polynomial Model
A reduced multivariate polynomial model (RM) has recently been proposed to relieve the dimension explosion of the MPM model. In the MPM model, the number of feature dimensions would increase exponentially with the growth of rank r. This increase in dimensions would increase computational complexity and even decrease classification performance. To alleviate this problem, Toh et al. proposed a RM model by efficiently sampling in polynomial feature space [10], defined as r N r r ⎛ N ⎞ ⎛ N ⎞⎛ N ⎞ g RM ( x ) = a0 + ∑∑ aki xik + ∑ arN + k ⎜ ∑ xi ⎟ + ∑ ⎜ ∑ ar ( N +1) + ( k − 2) N +i xi ⎟ ⎜ ∑ x j ⎟ k =1 i =1 k =1 ⎝ i =1 ⎠ k = 2 ⎝ i =1 ⎠ ⎝ j =1 ⎠ k
k −1
(6)
The feature dimension of RM, 1+r+(2r-1)N, is usually much less than that of MPM, CNr + r . For example, if the dimension of a sample x is 9, the feature dimension of 6rank MPM is 5005, but that of the RM is 106. Using the one-versus-all strategy [9], the RM method can be generalized to solve multi-class problems. Let C denote the number of classes. We form an m×C indicator response matrix Y = [ y1 , y 2 ,", y C ] , where yi,j is 1 if the class label of the corresponding sample xj is i, else yi,j = 0. For each yi, we can train a RM model βi
586
W. Zuo et al. Table 1. Procedure of the iteratively reweighted fitting method The Iteratively Reweighted Fitting Method Step 1. Let t=0, initialize β( t ) = ( ZT Z ) −1 ZT y or β( t ) = ZT y ; Step 2. Let t=t+1, update W using Wii ( t +1) =
∂ϕ {( yi − β( t )T z i )2 } ; ∂ ( yi − β ( t ) T z i ) 2
Step 3. Update β using β( t +1) = ( ZT W ( t +1) Z) −1 ZT W ( t +1) y ; Step 4. Repeat Steps 2 and 3 until convergence.
β i = ( ZT Z + σ 2 A −1 ) −1 ZT y i
(7)
In the classification stage, we compute the estimated output of an sample x by [gRM(x, β1), …, gRM(x, βC)], and then classify x according to the maximum value of gRM(x, βi).
3 Iteratively Reweighted Fitting for the RM Method 3.1 Iterativly Reweighted Fitting
The iteratively reweighted fitting method is used to computed the optimal βˆ which minimizes the criterion m
J (β) = ∑ ϕ{( yi − βT z i ) 2 }
(8)
i =1
where the function ϕ (⋅) should be defined according to the application tasks. The optimal vector β can be obtained by solving the equation m
m
i =1
i =1
(
)
∂ ∑ϕ {( yi − βT z i ) 2 }/ ∂β = ∑ 2ϕ i′ βT z i zTi − zTi yi = 0
ϕi′ =
∂ϕ {( yi − βT z i ) 2 } ∂[( yi − βT zi ) 2 ]
(9)
(10)
To the best of our knowledge, the optimal vector β cannot be directly solved in one step. Fortunately, if the values of ϕi′ have been determined, the optimal vector βˆ can be obtained using a weighted least squares estimator βˆ = ( ZT WZ) −1 ZT Wy
(11)
where the diagonal element Wii = ϕi′ is defined as the weight function. We thus use an iteratively reweighted fitting method to compute W and βˆ , as shown in Table 1.
Iteratively Reweighted Fitting for Reduced Multivariate Polynomial Model
587
3.2 Weight Functions
Weight functions have a deterministic effect on the classification accuracy of IRF-RM. If some training instances may contain noise or be mislabeled, known as outlier instances, they would seriously influence the performance of the model to be fitted. Some weight functions can be used to reduce the influence of outlier instances, such as the LAR and the bisquare functions [4]. These weight functions usually are even- or odd-symmetric. Asymmetric weight function, however, would be more efficient for solving a classification problem. We propose two asymmetric weight functions, Asymm-I and Asymm-II. The definition of the Asymm-I weight function is as follows: 1/2 2 ⎧⎛ ⎞ ⎪⎪⎜ 1+ ⎛ |t| ⎞ ⎟ , if (0
(12)
Unlike Asymm-I, the Asymm-II weight function simplifies the computation of weight for t ≤ 0 of class 0 and t ≥ 1 of class 1. In Section 4, we show that this simplification would not cause the performance degradation of the Asymm-I weight function. The definition of the Asymm-II weight function is as follows: 1/ 2 2 ⎧⎛ ⎞ ⎪⎪⎜ 1+ ⎛ | t | ⎞ ⎟ , if (0 < t ≤ 1& y = 0) or (-1 ≤ t < 0 & y = 1) Asymm-II : W (t ) = ⎨⎜ ⎜⎝ η ⎟⎠ ⎟ ⎠ ⎪⎝ else ⎩⎪ γ ,
(13)
One significant difference between asymmetric and symmetric weight functions is that symmetric weight functions usually are independent of class labels but asymmetric weight functions are not. From the definition of Asymm-I and Asymm-II, we can observe that, given the same t, the weight W(t) for class 0 and class 1 might be different. The Asymm-I and -II weight functions also emphasize the influence of training samples close to the decision boundary.
4 Experimental Results and Discussions 4.1 The Experimental Setup
The IRF-RM method is tested on 42 benchmark data sets from the UCI Machine Learning Repository [1], and StatLog data sets [2]. These data sets include 16 2-class problems, 12 3-class problems and 14 multi-class problems, and cover a wide range of applications such as medical diagnosis, image analysis, and character recognition. The choice of these data sets is mainly according to [6], [8], and [10], where experimental results of some state-of-the-art classifiers are reported on these data sets using a similar experimental setup.
588
W. Zuo et al.
For each of the 42 data sets, we describe the experimental setup to test the accuracy of IRF-RM as follows: (1) Following [10], there are six data sets: attitude towards smoking restrictions (Smoking), waveform, thyroid disease (Thyroid), StatLog DNA, StatLog satellite image (StatLog SatImage), and LED, from which we calculate the classification rates using the training set and test set described in [10]. Each of the other data sets we randomly split into 10 folds, and use a 10-fold cross validation method to determine the classifier parameters and classification rate. To reduce bias in evaluating the performance, we calculate the average and standard deviation of the classification rate of the 10 runs of 10-fold cross-validation. (2) Normalization. For all the data sets, each of all the input characters are normalized to values within [0, 1] except the first two characters of the Monk1 data set which are normalized to values within [0, 10]. (3) Performance evaluation. To compare the performance differences of multiple classifiers, it is usual to select a number of data sets to test the individual performance scores. It is then possible, based on these individual performance scores, to use various approaches to evaluate the overall performance of a classifier. One straightforward way is to calculate the average over all data sets [10]. This approach, however, does not provide statistical evidence and is seldom adopted. In our experiments, we use the Wilcoxon signed ranks test to compare IRF-RM and RM, and use the Friedman test with the corresponding post-hoc tests to compare IRF-RM and other classifiers. 4.2 Comparisons with the RM Model
When using IRF-RM, we should determine four hyper parameters: model order r, regularization parameter σ, and two weight function parameters, γ and η. In our experiments, we adopt a stepwise strategy to determine the optimal values of these hyper parameters using 10-fold cross-validation (cv). To reduce bias in the performance evaluation, we further compute the average and standard deviation (std.) of the classification rate of the 10 runs of 10-fold cv. These experimental results indicate that the Asymm-II weight function was able to achieve competitive or better recognition rate than the Asymm-I function. Thus in the subsequent experiments, the classification rate of IRF-RM are denoted by that obtained using the Asymm-II function. Using the average classification rates (ARR) on the test data sets, we carry out a Wilcoxon signed-ranks test to compare the performance of IRF-RM and RM [3]. Let Δi be the difference between the classification rates of the two classifiers on the ith data sets. We rank the difference Δi according to their absolute values. If several difference values are equal, we assign the average rank to all these differences. Then we define R+ as the sum of ranks for the data sets on which IRF-RM outperforms RM, and R − as the sum of ranks for the data sets on which RM outperforms IRF-RM. If R − < R + , we further calculate the statistics on Nd data sets z=
R − − N d ( N d + 1) / 4 N d ( N d + 1)(2 N d + 1) / 24
(14)
Iteratively Reweighted Fitting for Reduced Multivariate Polynomial Model
589
Table 2. Comparison of ARRs between IRF-RM and RM using the Wilcoxon signed-ranks test Data Set Shuttle BUPA Monk-1 Monk-2 Monk-3 Pima Tic-tac-toe BC-Wisconsin Statlog Heart Credit Vote Mushroom Wdbc Wpbc Inonsphere Sonar Iris Balance TA Thyroid-new Abalone Contraceptive Boston housing Wine Smoking Waveform Thyroid Statlog DNA Car Statlog Vehicle Soybean small Nursery Statlog Satimage Glass Zoo Statlog Segment Ecoli LED Yeast Pendigit Optdigit Letter Average
RM [10] (%) 95.78 72.74 98.67 76.69 91.50 77.55 98.35 97.00 84.41 86.42 95.43 100.0 96.09 81.89 88.82 74.75 97.60 92.90 58.07 93.67 66.60 54.81 78.35 98.75 69.50 83.30 93.99 94.52 87.17 82.29 95.00 90.33 88.15 64.98 96.91 94.11 87.43 72.75 61.01 95.73 95.32 74.14 85.32
IRF-RM (%) 98.47 73.65 99.02 76.05 92.73 77.47 98.33 96.91 84.13 86.64 96.11 100.0 97.21 81.80 91.50 75.72 96.83 99.42 57.46 93.63 66.55 54.81 78.85 98.88 70.10 85.33 96.97 95.78 96.08 85.65 95.74 93.52 88.55 65.33 96.91 96.07 87.89 73.75 61.97 98.70 96.91 86.33 86.76
Difference (%) +2.69 +0.91 +0.35 -0.64 +1.23 -0.08 -0.02 -0.09 -0.28 +0.22 +0.68 0.00 +1.12 -0.09 +2.68 +0.97 -0.77 +6.52 -0.61 -0.04 -0.05 0.00 +0.50 +0.13 +0.60 +2.03 +2.98 +1.26 +8.91 +3.36 +0.74 +3.19 +0.40 +0.35 0.00 +1.96 +0.46 +1.00 +0.96 +2.97 +1.59 +12.19 +1.44
Rank 35 24 13.5 20 29 7 4 8.5 12 11 21 2 28 8.5 34 26 23 40 19 5 6 2 17 10 18 33 37 30 41 39 22 38 15 13.5 2 32 16 27 25 36 21 42
If there are many data sets, the distribution of z can be approximated by a standard normal distribution. Given a pre-determined null-hypothesis α and threshold z0, if z < z0, we could conclude that one method significantly outperform the other in terms of classification rate at α.
590
W. Zuo et al.
Using the Wilcoxon signed-ranks test, we compare the performance of RM and IRF-RM. Table 2 lists the ARRs of IRF-RM and RM, the value of their difference, and the rank of the absolute value of the difference on each data set. The overall ARR of IRF-RM over all data sets is 86.76%, which is higher than that of RM, at 85.32%. The comparison of classifiers based on the overall ARRs, however, is unreliable and does not provide statistical evidence. Thus we calculate the sum of ranks where IRFRM outperforms RM R+ = 787, and the sum of ranks where RM outperforms IRF-RM R- = 116, and then the z statistics of the Wilcoxon signed-ranks test z = -4.19. According to [12], if z < -3.21, IRF-RM would be significantly superior to RM in terms of its classification rate at α =0.01. Table 3. Comparisons of six classifiers using the ARRs and the average ranks Classifier ARR Average Rank
ICPL 87.7% 4
RT3 87.4% 4.03
KNN 88.4% 3.42
SVM 88.7% 3.20
C4.5 83.9% 3.90
IRF-RM 90.2% 2.45
Fig. 1. Comparisons of six classifiers using the Bonferroni-Dunn test
4.3 Comparisons with Existing Approaches
We compare the performance of IRF-RM with those of several state-of-the-art classifiers, such as k-nearest neighbor (KNN), decision tree, Φ-machine, and SVM. We take the experimental results from [6, 8] because they adopt similar experimental setups. In [6], Lam proposed four variants of integrated concept prototype learner (ICPL), and compared the best ICPL with four classifiers: an instance pruning technique RT3, KNN, C4.5, and SVM. We compare the performance of IRF-RM with these methods using 30 of the 35 data sets. In [8], Precup proposed an Φ-machine-based method, CLEF, and compared its performance with five classifiers: C4.5, Φ-regression tree (Φ-RT), dynamic node creation (DNC), Φ-DNC, and SVM. We compare the performance of IRF-RM with these classifiers using 10 of the 20 data sets. Would IRF-RM achieve comparable or better performance than these approaches? We use the Friedman test to test this null-hypothesis. The Friedman test first ranks the classifiers for each data set. Let ri j be the rank of the jth of k algorithms on the ith of Nd data sets r j . Then we calculate the average rank of each classifier over all the data
sets, and calculate the Friedman statistic χ F2 . Further, Iman and Davenport proposed a superior F statistic [12]. If the Friedman test rejects the null-hypothesis, we further carry out a Bonferroni-Dunn test to reveal the performance differences of several
Iteratively Reweighted Fitting for Reduced Multivariate Polynomial Model
591
Table 4. Comparisons of seven classifiers using the ARRs and the average ranks Classifier ARR Average Rank
C4.5 81.8% 4.20
CLEF 87.3% 3.10
Φ-RT 77.0% 5.45
Φ-DNC 76.8% 5.90
DNC 78.6% 5.25
SVM 86.1% 2.65
IRF-RM 91.7% 1.45
Fig. 2. Comparisons of seven classifiers using the Bonferroni-Dunn test
classifiers. The Bonferroni-Dunn test is used to compare a number of classifiers with a control classifier, by comparing whether the difference of average ranks is higher than the threshold CDB. Using the Friedman test, we evaluate the performance of IRF-RM by comparing it with the classifiers used in [6]. Table 3 shows the overall ARRs and average ranks of IRF-RM and the other five classifiers. The overall ARR of IRF-RM is 90.2%, higher than the overall ARRs of other approaches. The average rank of IRF-RM is also lower than the ranks of other approaches. The Friedman test is then used to evaluate the performance differences of these six classifiers. The Friedman statistic of average ranks is 16.20, and the F statistic of average ranks, 3.51, which is higher than 2.66 (the critical value of F(5, 145) for α =0.05). Thus we can reject the null-hypothesis, that there is no significant difference between these six classifiers at α =0.05. We further use the Bonferroni-Dunn test to reveal the difference between classifiers and the difference between IRF-RM and other classifiers. Fig. 1 shows the results of the test with α=0.10. At α=0.10, the performance of IRF-RM is statistically better than those of KNN, RT3, ICPL, and C4.5. From the results of the Bonferroni-Dunn test, we cannot state there is a statistical performance difference between IRF-RM and SVM using the Bonferroni-Dunn test. Table 4 shows the overall ARRs and average ranks of IRF-RM and the other six classifiers in [8]. The overall ARR of IRF-RM is 91.7%, and the average rank of IRFRM is 1.45, which both are superior to those of other approaches. We then calculate the Friedman statistic (35.25) and the F statistic (12.82) of average ranks. From the Friedman and the F statistics, we reject the null-hypothesis that there is no significant difference between these classifiers at α =0.05. We further use the Bonferroni-Dunn test to check the difference between IRF-RM and other classifiers. Figure 2 shows the results of the test of these seven classifiers with α=0.05. At α=0.05, the performance of IRF-RM is significantly better than those of Φ-DNC, Φ-RT, DNC, and C4.5. From the Bonferroni-Dunn test, we cannot state there is a statistical performance difference between IRF-RM, SVM and CLEF.
592
W. Zuo et al.
5 Conclusion In this paper, we propose an iteratively reweighted least squares method for fitting the RM model (IRF-RM). IRF-RM, which iteratively increases the weights of samples close to the decision boundary and decreases the weights of samples far away from the decision boundary, is more suitable for efficient pattern classification. We use 42 data sets from the UCI repository and other two benchmark repositories to evaluate the IRF-RM method. Experimental results indicate that IRF-RM is statistically superior to the RM model in terms of classification rate, and achieve a higher or comparable classification rate than several state-of-the-art approaches.
Acknowledgement The work is supported in part by the National Natural Science Foundation of China (NSFC) under the contracts No. 60332010 and No. 90209020.
References 1. Blake, C.L., Merz, C.J.: UCI Repository of Machine Learning Databases. Dept. of Information and Computer Sciences, Univ. of Calif., Irvine, http://www.ics.uci.edu/mlearn/ MLRepository.html, 1998. 2. Brazdil, P.: STATLOG Datasets. Inst. For Social Research at York Univ., http://www.niaad.liacc.up.pt/old/statlog/datasets.html, 1999. 3. Demšar, J.: Statistical Comparisons of Classifiers Over Multiple Data sets. Journal of Machine Learning and Research 7 (2006) 1-30 4. Huber, P.J.: Robust Statistics. John Wiley & Sons, New York, 1981 5. Kuntner, M.K. Nachtsheim, C.J., Neter, J.: Applied Linear Regression Model. 4th Edn. McGraw-Hill /Irwin, New York, 2004. 6. Lam, W. Keung, C.-K., Liu, D.: Discovering Useful Concept Prototypes for Classification Based on Filtering and Abstraction. IEEE Trans. PAMI 24 (2002) 1075-1090. 7. Pérez-Cruz, F. Bousoño-Calzón, C., Artés-Rodríguez, A.: Convergence of the IRWLS Procedure to the Support Vector Machine Solution. Neural Computation 17 (2005) 7-18 8. Precup, D., Utgoff, P.E.: Classification Using Φ-machines and Constructive Function Approximation. Machine Learning 55 (2004) 31-52. 9. Rifkin, R., Klautau, A.: In Defense of One-vs-all Classification. Journal of Machine Learning and Research 5 (2004) 101-141 10. Toh, K.-A. Tran, Q.-L., Srinivasan, D.: Benchmarking a Reduced Multivariate Polynormial Pattern Classifier. IEEE Trans. PAMI 26 (2004) 740-755 11. Vapnik, V.N.: Statistical Learning Theory. Wiley, New York, 1998. 12. Zar, J.H.: Biostatistical Analysis. 4th edn. Prentice Hall, Upper Saddle River, NJ, 1999. 13. Nikolaev, N.Y., Iba, H.: Learning Polynomial Feedforward Neural Networks by Genetic Programming and Backpropagation. IEEE Trans. Neural Networks 14 (2003) 337-350
Decomposition Method for Tree Kernels Peng Huang and Jie Zhu Electronic Engineering Department, Shanghai Jiao Tong University, 800, Dongchuan Rd., Shanghai, China
[email protected],
[email protected]
Abstract. We often meet the tree decomposition task in the tree kernel computing. And tree decomposition tends to vary under different tree mapping constraint. In this paper, we first introduce the general tree decomposition function, and compare the three variants of the function corresponding to different tree mapping. Then we will give a framework to generalize the kernels based on tree-to-tree decomposition with the decomposition function.
1
Introduction
The kernel method, which is extensively studied in the recent years, is a very powerful technique for many problems in the machine learning, pattern recognition, computational biology and many others [3]. And the tree kernel is obviously a necessary and important kernel in the need of dealing with structured or semistructured data. This paper focuses on the decomposition method in the tree kernel computation. In the tree kernel computation, we often need to decompose the original problem to some sub-problems, that is, decompose the trees or the forests to the sub-trees and sub-forests. So the decomposition methods determine the complexity of the algorithm of calculating the tree kernels. In the prior work, Collins and Duffy [2] introduced the parse tree kernel, and the framework of computing the kind of tree kernel, which is based on the decomposition of the trees. Kashima and Koyanagi [4] extended the parse tree kernel to the elastic tree kernel, but still with the same rigid decomposition method. Recently, Kuboyama and Shin [1] represented the recursive expressions of the flexible tree kernels under the constraint used in the edit tree distance problem [5], but they still did not give the clear elucidation of the decomposition in the computation. And Dulucq and Touzet [11] analyzed the decomposition and cover strategies but their decomposition strategy was limited to one-node deleting per time. In this paper, we give the clear definition of the decomposition function, and represent three concrete functions under the Tai tree mapping [6], constrained tree mapping [7] and less-constrained tree mapping [8]. Then we will give a framework to generalize the kernels based on tree-to-tree decomposition with the decomposition function. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 593–601, 2007. c Springer-Verlag Berlin Heidelberg 2007
594
2 2.1
P. Huang and J. Zhu
Background Terminology
In the following explanation, a tree is a rooted labeled ordered tree in which each node is labeled from a finite alphabet and the left-to-right order is given. An ancestor of a node is the node itself or the ancestor of the parent of the node. A forest means a list of trees in which the left-to-right order between the trees is also given. Following [1] [10], the notation x ≤ y denotes x is an ancestor of y and x < y if x = y. We denote lca(x, y) as the least common ancestor of x and y. x ≺ y denotes x is to the left of in the tree if x, y < lca(x, y) and the child of lca(x, y) on the path to x is to the left of child of lca(x, y) on the path to y. The tree edit distance problem is defined as the minimum cost of a sequence of the operations which transform a tree to another. Three edit operations are always considered: (1) changing, (2) deleting, (3) inserting. We denote the edit operation as a → b, where a and b can be any node from the alphabet Σ or the null symbol λ . The edit operation is a changing if a = λ and b = λ; a deleting if b = λ; an inserting if a = λ. 2.2
The Kernel Methods
The kernel method is a recently popular mathematic tool, with the support vector machine [12] [13], it is widely used in the machine learning, data mining and pattern recognition communities. A group of algorithms based on the combination of the kernel and some common method, such as principal component analysis and Gaussian process, appeared as promising method to deal with problems in the traditional or the new domain. We usually refer to them as kernel methods [14]. With the kernel methods, we can enjoy the convenience of handling the data in the high-dimension space and at the same time avoid the explicitly mapping the data to the space, which is often a costly and though procedure. Many kernels are validly used in the real-life domain [14], and they are powerful to deal with the huge and complex data set. But the kernels are traditionally used for data in the form of ordinary vector or called attribute-value data, which can not always apply to some more complex real-life use. So the kernels [15] for many kinds of structured data are introduced. While the inductive logic programming community [16] has traditionally use logic programs to represent structured data, recently some direct computation method [1] of the kernels is proposed. 2.3
Tree Mappings
The notion of tree mapping is first introduced by Tai [6], which means a kind of correspondence between some patterns of two trees. The tree mapping is widely used for the tree edit distance problem with some constraints, and we often refer to the fundamental mapping defined by Tai as Tai mapping. There are also other
Decomposition Method for Tree Kernels
595
tree mappings based on Tai mapping but with different constraint conditions. The constrained [7] tree mapping and the less-constrained [8] tree mapping are the two major ones among them. Now we represent the definition of the three tree mappings as follows. Naturally, the notion of tree mapping can be extended to the forest. Definition 1 (Tai Mapping). A mapping M = {(a, b)|a ∈ V (T1 , b ∈ V (T2 )} is a Tai Mapping if and only if for all the (1) one-to-one relation: a1 = a2 ⇔ b1 = b2 (2) left-to-right relation: a1 < a2 ⇔ b1 < b2 (3) ancestor-to-child relation: a1 ≺ a2 ⇔ b1 ≺ b2 are satisfied. Definition 2 (Constrained Mapping). A Tai mapping is said to be a constrained mapping if and only if lca(a1 , a2 ) = lca(a1 , a3 ) ⇔ lca(b1 ,b 2) = lca(b1 , b3 ) holds, for any (a1 , b1 ), (a2 , b2 ), (a3 , b3 ) ∈ M . Definition 3 (Less-Constrained Mapping). A Tai mapping is said to be a constrained mapping if and only if lca(a1 , a2 ) ≤ lca(a1 , a3 ) ⇔ lca(b1 ,b 2) ≤ lca(b1 , b3 )holds, for any (a1 , b1 ), (a2 , b2 ), (a3 , b3 ) ∈ M
3 3.1
Decomposition Functions General Decomposition Function
Haussler represented general decomposition method for the convolution kernel [9] by defining a relation between the structure and its parts, i.e. the sub-structures. Suppose x ∈ X is a decomposable structure and x1 , x2 , . . . , xN are its parts, where xi is in the set Xi for each i, 1 ≤ i ≤ N . We assume {x1 , x2 , . . . , xN } is nonempty, and the relation is a Boolean function defined on the set X1 × X2 × . . . × XN [9]: R(x1 , x2 , . . . , xN , x) = 1 iff x1 , x2 , . . . , xN are parts of x, else 0
(1)
Then we have the reverse function which can decompose some structure into its sub-structures: R−1 = {x1 , x2 , . . . , xN | R(x1 , x2 , . . . , xN , x) = 1}
(2)
The equation (2) can be viewed as the general expression of the decomposition function with two conditions: 1. Xi = ∅, 1 ≤ i ≤ N 2. X1 ∪ X2 ∪ . . . ∪ XN = X 3.2
Decomposition Functions for Tree and Forest
In this subsection, first, we reduce the notion of the general decomposition functions to the tree and forest decomposition functions. Next, we give the functions under the Tai tree mapping, constrained tree mapping and less-constrained tree mapping.
596
P. Huang and J. Zhu
Definition 4 (Basic Decomposition Function for Tree and Forest). Let XN denote the set of all the trees or forests with N nodes, and we define D : XN → Xi × XN − 1, 1 ≤ i ≤ N as the basic decomposition function for tree and forest if for F ∈ XN , F1 ∈ X1 andF2 ∈ XN − 1(X1 = Σ), F − F1 = F2 . Here F − F1 means deleting F1 from F according to some ordering, and F2 must be an embedding [10] in F . In the rest of this paper, for convenience, we will represent the trees and the forests with the set of their nodes at the top, i.e. the nodes with no ancestor. For example, we will represent the tree in Fig. 1 as {v}, and if we decompose the tree to and F1 ∈ X1 , F2 ∈ XN − 1 i.e. delete v from the tree, then we can see F1 is the root of the tree and it is a single node v ∈ Σ. So we have F1 = v, andF2 = {v1 , v2 , . . . , vN }. Here we use different expressions v and {v} to distinguish a single node and the tree rooted at, and if has no child, then v and {v} are equivalent. V
T˙v˄F˅ Tree decomposition V1
T1
V2
T2
V1
Vn
...
T3
T1
V2
T2
Vn
...
Tn
Fig. 1. Decomposition for tree which is always deleting the root under all the Tai Mapping
And then if we continue to decompose, we can have the {v1 } and {v2 , v3 , . . . , vN } as the result of the decomposition (the Fig. 2). However, we can not always execute the decomposition like this for two reasons: this kind of decomposition is not unique and may omit some possible sub-problems in solve the tree kernel or the edit distance problem. The intuitive idea that a fast and good decomposition tend to delete more nodes once and meanwhile can meet all the sub-problems. Therefore, we should give some constraint to improve the decomposition. Now we will derive the expressions of the decomposition functions under the Tai Mapping, Constrained Mapping and Less-Constrained Mapping. Tree Decomposition Function for Tai Mapping. Because tree only has one top node, and in the Tai Mapping, the ancestor-children relationship is respected, so we always have the expression as follows. Suppose {v} has n children, denoted as ch(v)1 , ch(v)2 , . . . , ch(v)n . D({v}) = [v, {ch(v)1 , ch(v)2 , . . . , ch(v)n }] The correctness of the equation (3) is obvious.
(3)
Decomposition Method for Tree Kernels
F2˙{V1,V2,Ă,Vn} V1
T1
V2
T2
Vn
...
Tn
597
F3˙{V2,Ă,Vn} Forest decomposition
Vn
V2
T2
...
Tn
Fig. 2. Decomposition for tree which is always deleting the root under all the Tai Mapping
Forest Decomposition Function for Tai Mapping. For Tai Mapping, the decomposition of the forest is still deleting one node once. So it is very similar to the tree decomposition of the tree. Suppose v1 has n1 children, M is the list of the nodes evolved in the tree mapping. [{ch(v)1 , ch(v)2 , . . . , ch(v)n1 }, {v2 , . . . , vn }],v1 ∈ /M D({v1 , v2 , . . . , vn }) = [ D({v1 }), {v2 , . . . , vn }] ,v1 ∈ M (4) Proof. We suppose tree kernel problem as K({v1 , v2 , . . . , vn }, {u1 , u2 , . . . , um }), and for Tai Mapping, if (a1 , b1 ) ∈ M ,anther different (a2 , b2 ∈ M ) and a2 ∈ {a1 },then b2 ∈ {b1 } must hold to satisfy the ancestor-child relationship, and vice versa. The (v1 , u1 ) ∈ M is fixed, for some function g(x, y) which defines the relationship between the sub-problems, then we have: K({v1 , v2 , . . . , vn }, {u1 , u2 , . . . , um }) = g(K(D({v1 }), D({u1 })), K({v2 , . . . , vn }, {u2 , . . . , um })). So we have D({v1 , v2 , . . . , vn }) = [D({v1 }), {v2 , . . . , vn }]. Tree Decomposition Function for Constrained and Less-Constrained Mapping. Compared to Tai Mapping, Constrained and Less-Constrained Mapping still hold the same top-down relationship, so we still have: (Suppose node v has n children) D({v}) = [v, {ch(v)1 , ch(v)2 , . . . , ch(v)n }]
(5)
Forest Decomposition Function for Constrained Mapping Di ({v1 , v2 , . . . , vn }) = [{v1 , . . . , vi−1 , vi+1 , . . . , vn }, {vi }]
(6)
Proof. For the kernel problem K({v1 , v2 , . . . , vn }, {u1, u2 , . . . , um }) and the Constrained Mapping, if there are (a1 , b1 ) ∈ M and a1 ∈ {vi }, b1 ∈ {uj }, then for any (a2 , b2 ) ∈ M we have a2 ∈ {vi } ⇔ b2 ∈ {uj }.The meaning of this restriction [7] is the tree-to-tree mapping relation is fixed in the Constrained Mapping. So we have K({v1 , v2 , . . . , vn }, {u1 , u2 , . . . , um }) = g(K({vi }, {uj })), i.e. kernel problem is decomposed into the subproblems only related to only one tree of the forest. So the decomposition function for Constrained Mapping is as equation (6).
598
P. Huang and J. Zhu
Forest Decomposition Function for Less-Constrained Mapping Di,j ({v1 , v2 , . . . , vn }) = [v1 , . . . , vi−1 , vj+1 , . . . , vn , {vi , . . . , vj }], 1 ≤ i ≤ j ≤ n (7) Proof. For the Less-Constrained Mapping, the restriction of the tree-to-tree mapping is relaxed to allow one-to-many tree mapping. With the lemma10 and lemma11 from [8], K({v1 , v2 , . . . , vn }, {u1 , u2 , . . . , um }) = g(K({vi1 , . . . , vi2 }, {uj1 , . . . , uj2 })). So the equation (7) is proven.
4
Framework for Tree Kernels Based on Decomposition
With the decomposition functions above, we can give a generalized framework for the tree-to-tree problem, just with different decomposition function. f (F1 , F2 ) = g(f ([D(F1 )]i , [D(F2 )]j )), i, j ∈ {1, 2}
(8)
Collins and Duffy [2] introduced the famous parse tree kernel, which is based on the decomposition of the trees. With the decomposition function, we can re-express the convolution kernel [9] K({v1 }, {v2 }) = δ([D{v1 }]1 , [D{v2 }]1 ) · K([D{v1 }]2 , [D{v2 }]2 ) K({v1 , v2 , . . . , vn }, {u1 , u2 , . . . , um }) (1 + K({v1 }, {v2 })) × K({v2 , . . . , vn }, {u2 , . . . , um }),m = n = 0 ,m =n
(9)
(10)
Then we can have the uniform expression for (9) and (10). K(F1 , F2 ) (1 + K([D({v1 })]1 , [D({v2 })]1 )) × K([D({v1 })]2 , [D({v2 })]2 ),|top(F1 )| = |top(F2 )| = 0 , otherwise
(11) Here we use the Decomposition function for Tai Mapping, and for single nodes K(v1 , v2 ) = δ(v1 , v2 ).And for the general expression (8), we have a general and. Recently, Kuboyama and Shin [1] introduced the flexible kernel base on counting functions, which is the based on the sparse kernel [2] and tree mapping [7] [8]. It is also in the framework of the equation (8) as we show in the next. Here we follow [1], and use σ : X1 × X1 → + to denote the similarity of two nodes. (Here we can view the single node as the ordinary vector) From [1], we have the recursive tree kernel expression under the Tai Mapping. K T AI (F, φ) = K T AI (φ, F ) = 0
(12)
Decomposition Method for Tree Kernels
599
K T AI (v1 (F1 ) · F1 , v2 (F2 ) · F2 )
= σ(v1 , v2 ) × (1 + K T AI (F1 , F2 )) × (1 + K T AI (F1 , F2 ))
+ K T AI (v1 (F1 ) · F1 , F2 · F2 ) + K T AI (F1 · F1 , v2 (F2 ) · F2 )
(13)
− K T AI (F1 · F1 , F2 · F2 ) With the decomposition function, we re-represent the equation (12) (13) in the following equation (14) and (15). K T AI ({v1 , v2 , . . . , vn }, φ) = K T AI (φ, {v1 , v2 , . . . , vn }) = 0
(14)
K T AI ({v1 , v2 , . . . , vn }, {u1 , u2 , . . . , um }) = (1 + K T AI ({ch(v1 )1 , ch(v1 )2 , . . . , ch(v1 )n1 }), {ch(u1 )1 , ch(u1 )2 , . . . , ch(u1 )m1 }) × (1 + K T AI({v2 , . . . , vn }, {u2 , . . . , um })) × σ(v1 , u1 ) + K T AI ({ch(v1 )1 , ch(v1 )2 , . . . , ch(v1 )n1 , v2 , . . . , vn }, {u1 , u2 , . . . , um }) + K T AI ({v1 , v2 , . . . , vn }, {ch(u1 )1 , ch(u1 )2 , . . . , ch(u1 )m1 , u2 , . . . , um }) − K T AI ({ch(v1 )1 , . . . , ch(v1 )n1 , v2 , . . . , vn }, {ch(u1 )1 , . . . , ch(u1 )m1 , u2 , . . . , um }})
(15) K
T AI
(F1 , F2 )
= K T AI ([[D(F1 )]1 ]1 , [[D(F2 )]1 ]1 ) × K T AI ([D(F1 )]2 , [D(F2 )]2 ) × (1 + K T AI ([[D(F1 )]1 ]2 , [[D(F2 )]1 ]]2 )) + K T AI (F1 , [D(F2 )]2 ) + K T AI ([D(F1 )]2 , F2 ) − K T AI ([D(F1 )]2 , [D(F2 )]2 ) (16) And from equation (16) we see that the Tai Mapping tree kernel can also be solved on the frame of the equation (8). And we continue to derivate the tree kernels under the Constrained Mapping expression adopted also from [1]. We can trivially obtain the equations (17) (18) (19), which depict 3 different situations in the process of tree kernel computation. K({v1 }, {u1 }) = σ(v1 , u1 ) × (1 + K({ch(v1 )1 , ch(v1 )2 . . . , ch(v1 )n }, {ch(u1 )1 , ch(u1 )2 . . . , ch(u1 )m })) + K({v1 }, {ch(u1 )1 , ch(u1 )2 . . . , ch(u1 )m }) + K({ch(v1 )1 , ch(v1 )2 . . . , ch(v1 )n }, {u1 }) + K({ch(v1 )1 , ch(v1 )2 . . . , ch(v1 )n }, {ch(u1 )1 , ch(u1 )2 . . . , ch(u1 )m }) − 2K({ch(v1 )1 , ch(v1 )2 . . . , ch(v1 )n }, {ch(u1 )1 , ch(u1 )2 . . . , ch(u1 )m })
(17) K({v1 }, {u1, u2 , . . . , um }) = K({v1 }, {u1 }) − K({ch(v1 )1 , ch(v1 )2 . . . , ch(v1 )n1 }, {u1 }) + K({v1 }, {u2 , . . . , um }) − K({ch(v1 )1 , ch(v1 )2 . . . , ch(v1 )n1 }, {u2 , . . . , um }) + K({h(v1 )1 , ch(v1 )2 . . . , ch(v1 )n1 }, {u2, . . . , um }) (18)
600
P. Huang and J. Zhu
K({v1 , v2 , . . . , vn }{u1 , u2 , . . . , um }) = K({v1 }, {u1}) × (K({v2 , . . . , vn }, {u2, . . . , um }) + 1) + K({v1 }, {u1 , u2 , . . . , um }) + K({v1 , v2 , . . . , vn }, {u1 }) + K({v2 , . . . , vn }, {u1 , u2 , . . . , um }) + K({v1 , v2 , . . . , vn }, {u2 , . . . , um }) + K({v2 , . . . , vn }, {u2 , . . . , um })
(19)
− overlaps From the equation (17) and (19), we can conclude that they are both in the form of the kernels of the intersection of the their sub-parts, which are the result of the decomposition function. Then we can derivate the general expression of (17) and (19) in equation (20), which also obey the framework in equation (8). K(F1 , F2 ) = K([D(F1 )]1 , [D(F2 )]1 ) × (1 + K([D(F1 )]2 , [D(F2 )]2 )) + K(F1 , [D(F2 )]1 ) + K(D(F1 )]1 , F2 ) + K(F1 , [D(F2 )]2 ) + K(D(F1 )]2 , F2 ) + K([D(F1 )]2 , D(F1 )]2 ) − overlaps (20) And when we give a careful look at the equation (18), the difference between (18) and (20) is the item K([D(F1 )]1 , [D(F2 )]1 ) × (1 + K([D(F1 )]2 , [D(F2 )]2 )). But this situation is forbidden under the Constrained Mapping, so this sub-problem is simply set to zero and the equation (18) also conforms to the equation (20). With the same scenario above, we can convert the tree kernel for Constrained Mapping (also called semi-accordant mapping [1]) and Less-Constrained (also called align-able mapping [1]) Mapping to the decomposition function form. Due to the space limitation, they are not stated in this paper.
5
Conclusion and Further Work
A generalization of the decomposition method used in the tree-to-tree comparison, especially for the tree kernel based on decomposition, is given in this paper. We give a clear definition of the decomposition function. And with the notion of tree mapping, we give the special form of the decomposition function for Tai, Constrained and Less-Constrained tree mapping, which are widely used in the recent work of construct promising tree kernel [1]. Then we design a framework with decomposition function to deal with the sparse tree kernel and related problems. Obviously, the complexity of the tree computation depends on the decomposition function. So the selection of the decomposition function is the key step in the tree kernel computation. For future work, we are going to test the efficiency and effectiveness of different decomposition function in solving tree kernel problem. And combined with different tree mapping constraint for the real-world problem, the useful and fast decomposition functions are also needed.
Decomposition Method for Tree Kernels
601
References 1. Kuboyama, T., Shin, Kashima, H.: Flexible Tree Kernels based on Counting the Number of Tree Mapping. Proceedings of the International Workshop on Mining and Learning with Graphs (2006) 61–72 2. Collins, M., Duffy, N.: Convolution Kernels for Natural Language. Advances in Neural Information Processing Systems 14 ( 2001) 625–632 3. Shawe-Taylor, J., Cristinini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press (2004) 4. Kashima, H., Koyanagi, T.: Kernels for Semi-Structured Data. Proceeding of the 9th International Conference on Machine Learning (2002) 291–298 5. Zhang, K., Shasha, D.: Simple Fast Algorithm for the Editing Distance between Trees and Related Problems. SIAM Journal of Computing 18 (1989) 1245–1262 6. Tai, K.C.: The Tree-to-Tree Correction Problem. JACM 26 (1979) 422–433 7. Zhang, K.: Algorithms for the Constrained Editing Distance between Ordered Labeled Tree and Related Problems. Pattern Recognition 28 (1995) 463–474 8. Lu, C.L., Su, Z.Y., Tang, G.Y.: A New Measure of Edit Distance between Labeled Trees. Lecture Notes in Computer Science 2108 (2001) 338–348 9. Haussler, D.: Convolution kernels on Discrete Structures. USSC-CRL 99–10, Dept. of Computer Science, Univ. of California at Santa Cruz (1999) 10. Kuboyama, T., Shin, Miyahara, T.: A Theoretical Analysis of Edit Distance Measures. Lecture Notes in Computer Science 3701 (2005) 323–337 11. Dulucq, S., Touze, H.: Analysis of Tree Edit Distance Algorithms. Lecture Notes in Computer Science 2676 (2003) 83–95 12. Boser, B.E., Guyon, I.M., Vapnik, V.N.: A Training Algorithm for Optimal Margin Classifiers. In D. Haussler (Ed.), Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory (1992) 144C-152 13. Vapnik, V.: The Nature of Statistical Learning Theory. Springer-Verlag. (1995) 14. Sch¨ olkopf, B., Smola, A.J.: Learning with Kernels. MIT Press. (2002) 15. G¨ artner, T., Lloyd, J., Flach, P.A.: Kernels and Distances for Structured Data. Machine Learning. 57 (2004) 205–232 16. Dzeroski, S., Lavrac, N.: (Eds.) Relational Data Mining. Springer-Verlag. (2001)
An Intelligent Hybrid Approach for Designing Increasing Translation Invariant Morphological Operators for Time Series Forecasting Ricardo de A. Ara´ ujo1 , Robson P. de Sousa2 , and Tiago A.E. Ferreira2 1
2
Center for Informatics, Federal University of Pernambuco, Brazil Statistics and Informatics Department, Catholic University of Pernambuco, Brazil
[email protected],
[email protected],
[email protected]
Abstract. In this paper, an intelligent hybrid approach is presented for designing increasing translation invariant morphological operators for time series forecasting. It consists of an intelligent hybrid model composed of a Modular Morphological Neural Network (MMNN) and an improved Genetic Algorithm (GA) with optimal genetic operators to accelerate its search convergence. The improved GA searches for the minimum number of time lags for a correct time series representation, as well as by the initial weights, architecture and number of modules of the MMNN; then each element of the improved GA population is trained via Back Propagation (BP) algorithm to further improve the parameters supplied by the improved GA. An experimental analysis is conducted with the proposed method using two real world time series and five wellknown performance measurements, demonstrating good performance of this kind of morphological system for time series forecasting.
1
Introduction
Many efforts have been made to the development of linear and nonlinear statistical models for time series forecasting [1,2,3,4,5]. However, these statistical models, in the most of time, has the need of an interaction of the system with a specialist to validate the predictions generated by the model, limiting the development of an automatic model of forecast. In order to overcome the limitation of statistical models, approaches based on artificial Neural Networks (NNs) have been proposed for nonlinear modelling of time series [6]. An important class of NNs are the Morphological Neural Networks (MNNs). Sousa [7] presented a particular MNN, referred to as Modular Morphological Neural Network (MMNN), based on the Matheron Decomposition Theorem [8], which states that every increasing translation invariant morphological operator can be decomposed by a union of erosions or an intersection of dilations. In the morphological systems context, an interesting work was presented by Ara´ ujo et al. [9], which it consists of an evolutionary morphological approach definition for financial time series forecasting. Based on Ara´ ujo et al. [9], an intelligent hybrid approach for designing increasing translation invariant morphological operators, via the Matheron Decomposition [8], is presented for time series forecasting. It consists of an intelligent D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 602–611, 2007. c Springer-Verlag Berlin Heidelberg 2007
An Intelligent Hybrid Approach
603
hybrid model composed of an MMNN [7] and an improved Genetic Algorithm (GA) [10], which searches for: (1) the minimum number of time lags (and their corresponding specific positions) necessary to represent the time series, based on the Takens Theorem [11]; and (2) the initial weights, architecture and number of modules of the MMNN to solve the time series forecasting problem; then each element of the improved GA population is trained via Back Propagation (BP) algorithm to further improve the parameters supplied by the improved GA. Furthermore, experimental results are presented for two real world time series: Dow Jones Industrial Average (DJIA) Index and Sunspot. The results are discussed according to five well-known performance measurements: Mean Square Error (MSE), Mean Absolute Percentage Error (MAPE), Normalized Mean Square Error (NMSE), Prediction Of Change In Direction (POCID) and Average Relative Variance (ARV).
2
Time Series Forecasting Problem
A time series is a set of points, generally time equidistant, defined by, Xt = {xt ∈ R | t = 1, 2, 3 . . . N },
(1)
where t is the temporal index and N is the number of observations. Therefore Xt is a sequence of temporal observations orderly sequenced and equally spaced. The main objective when applying forecasting techniques to a given time series is to identify certain regular patterns present in the data set in order to create a model capable of generating the next temporal patterns. In this context, a crucial factor for a good forecasting performance is the correct choice of the time lags considered for the representation of the given time series. Such relationship structures among historical data constitute a d-dimensional phase space, where d is the dimension capable of representing such relationship. Takens [11] proved that if d is sufficiently large, such built phase space is homeomorphic to the phase space which generated the time series. A crucial problem in reconstructing the original state space is the correct choice of the variable d, or more specifically, the correct choice of the time lags.
3 3.1
Background Mathematical Morphology
The following equations are used in MMNN for designing increasing translation invariant morphological operators [7]: Dilation: δk = max(x + ak );
(2)
Erosion: k = min(x − ak );
(3)
where x is the input signal and ak represents the structuring elements. In the particular case of time series forecasting, input signal x represents the time lags of the time series.
604
3.2
R. de A. Ara´ ujo, R.P. de Sousa, and T.A.E. Ferreira
Improved Genetic Algorithm
The improved Genetic Algorithm (GA) used here is based on the work of Leung et al. [10], where specials crossover and mutation operators are applied to accelerate the search convergence of the GA. The crossover operator is used for exchanging information from two parents, chromosomes p1 and p2 , obtained in the selection process via a roulette wheel method [10]. The parents generate four sons: C1 ,C2 , C3 and C4 . The sons of the crossover operator are defined by the following equations [10]: C1 =
p1 + p2 , 2
(4)
C2 = pmax (1 − w) + max(p1 , p2 )w,
(5)
C3 = pmin (1 − w) + min(p1 , p2 )w,
(6)
(pmin + pmax ) (1 − w) + C1 w, (7) 2 where w ∈ [0, 1] denotes the crossover weight (the closer w is to 1, the greater are the direct contribution from its parents), min(p1 , p2 ) (max(p1 , p2 )) denotes the vector with each element obtained by taking the minimum (maximum) among the corresponding elements of p1 and p2 ; pmin (pmax ) denotes the minimum (maximum) gene values. Among the four generated sons, the one having the largest fitness value will be chosen as the son generated by the crossover process and denoted C best . After the crossover operator, C best is selected to have a mutation process, where three new sons are generated and defined by the following equation [10]: C4 =
Mj = Ckbest + γk ΔMk , k = 1, 2, . . . , n,
(8)
where γk can only take the value 0 or 1 and ΔMk are randomly generated numbers such that pmin ≤ Ckbest + ΔMk ≤ pmax and n denotes the number of genes in the chromosome. The first mutation son (M1 ) is obtained according to (8) using only one term γk set to 1 (k is randomly selected within the range [1, n]); the remaining terms γk are set to 0. The second mutation son (M2 ) is obtained according to (8) using some γk randomly chosen and set to 1; the remaining terms γk are set to 0. The third mutation son (M3 ) is obtained according to (8) using all γk set to 1.
4 4.1
MMNN Fundamentals MMNN Architecture for the Matheron Decompositon
The following equations define the MMNN architecture via Matheron Decomposition [8] by dilations according to [7]. vk = δk = max(x + ak ),
(9)
An Intelligent Hybrid Approach
605
where x is the input signal of the MMNN. Network output: Y = min(v),
(10)
v = (v1 , v2 , . . . , vk ).
(11)
where The weight matrix, A, of the MMNN is defined by A = (a1 ; a2 ; . . . ; ak ),
(12)
where ak ∈ Rk , k = 1, 2, . . . , N , represent the MMNN weights (i.e. rows composed by structuring elements ak ). In a dual manner, the MMNN architecture via Matheron Decomposition [8] by erosions is defined by substituting dilations by erosions and the MMNN output Y = min(v) by Y = max(v). As an example, Figure 1 presents the MMNN architecture for the Matheron decomposition [8] by dilations.
Fig. 1. MMNN architecture used for the Matheron Decomposition via dilations
4.2
MMNN Training for the Matheron Decomposition
Based on the BP algorithm and the MMNN architecture defined in Section 4.1, Sousa [7] presented the MMNN training for the Matheron Decomposition [8] as follows: A(n + 1) = A(n) − μ∇A J(A), n = 0, 1, . . . (13) where μ is the learning rate and ∇A J(A) is the gradient matrix for some objective function J(A) (to be minimized with respect to the weight matrix A). For a given training set {(xm , dm ), m = 1, 2, . . . , M } , (14) where dm is the desired output for a given input xm , J(A) is defined by J(A) =
1 2 e , 2 m
(15)
where em = dm − ym is the difference between the desired output and the actual output for the input xm , m = 1, 2, . . . , M . The gradient presented in equation (13) is given by ∂J ∂y ∂vk = −e , k = 1, 2, . . . , N. (16) ∂ak ∂vk ∂ak
606
R. de A. Ara´ ujo, R.P. de Sousa, and T.A.E. Ferreira
According to Sousa [7], the partial derivatives in equation (16) are estimated by the methodology of Pessoa and Maragos [12] via rank indication vectors c and smooth impulse functions Qσ , and are given by ∇A J(A) = −e.diag(c).C,
(17)
where c=
5
Qσ (y.1−v) Qσ (y.1−v).1T
and C = (c1 ; c2 ; . . . ; cN ), with ck =
Qσ (vk .1−x+ak ) Qσ (vk .1−x+ak ).1T
.
The Proposed Approach
The approach proposed in this paper is based on an intelligent hybrid training mechanism for time series forecasting using increasing translation invariant morphological operators via Matheron Decomposition [8]. The proposed method is based on the definition of the two main elements necessary for building an accurate forecasting system according to Ferreira [13]: (a) minimum number of time lags adequate for representing the time series, and (b) structure of the model capable of representing such underlying information. It is important to consider the minimum possible number of time lags in the correct representation of the series because the model must to be as parsimonious as possible. Following this principle, the proposed method consists of an intelligent hybrid model composed of an MMNN [7] and an improved GA [10], which searches for: (1) the minimum number of time lags (and their corresponding specific positions) to represent the time series (initially, a maximum number of time lags (M axLags) is defined and then the improved GA will search for the number of time lags in the range [1, M axLags] for each individual of the population); and (2) the initial weights, architecture and number of modules of the MMNN (initially, a maximum number of MMNN modules (M axM od) is defined and then the improved GA chooses, for each candidate individual, the most adequate MMNN architecture (Section 4.1) and the number of MMNN modules in the range [1, M axM od]); then each element of the improved GA population is trained via the Back Propagation (BP) algorithm (Section 4.2) to further improve the parameters supplied by the improved GA. Each element of the improved GA population represents an MMNN. As an example, Figure 2 represents an element of the improved GA population, where se(i) , i = 1, 2, . . . , N represent the structuring elements (MMNN weights). Terms arch, mod and lags represent the MMNN architecture, the number of MMNN modules and the number of time lags (and their corresponding specific positions) for the time series representation, respectively.
Fig. 2. Coding of the improved GA chromosome
An Intelligent Hybrid Approach
6
607
Performance Evaluation
Most of the works found in the literature of time series prediction frequently employ only one performance criterion for model evaluation. The measure used is usually the MSE (Mean Squared Error), M SE =
N 1 (targetj − outputj )2 , N j=1
(18)
where N is the number of patterns, targetj is the desired output for pattern j and outputj is the predicted value for pattern j. Although the MSE measure may be used to drive the prediction model in the training process, it cannot be considered alone as a conclusive measure for comparison of different prediction models [14]. For this reason, other performance criteria should be considered for allowing a more robust performance assertiveness. A second relevant measure is the MAPE (Mean Absolute Percentage Error), given by N 100 targetj − outputj M AP E = (19) , N j=1 Xj where Xj is the time series at point j. A third performance measure is the NMSE (Normalized Mean Square Error), which is given by N
N M SE =
(targetj − outputj )2
j=1 N
,
(outputj − outputj+1
(20)
)2
j=1
which associates the model performance with a random walk model [15]. Another relevant evaluation measure considers the calculation of the correctness of Prediction Of Change In Direction, or POCID for short, which is given by N Dj P OCID = 100 where Dj
j=1
N
,
(21)
1 if (targetj − targetj−1)(outputj − outputj−1 ) > 0, 0 otherwise.
(22)
There is also another relevant evaluation measure, the ARV (Average Relative Variance), which is given by N
ARV =
(outputj − targetj )2
j=1 N
, (outputj − target)2
j=1
(23)
608
R. de A. Ara´ ujo, R.P. de Sousa, and T.A.E. Ferreira
which associates the model performance with the mean of the time series. Term target is the mean of the time series.
7
Simulations and Experimental Results
The fitness function of the improved GA is defined by f itness f unction =
P OCID . 1 + M SE + M AP E + N M SE + ARV
(24)
The improved GA parameters used here are an initial population of 10 elements, maximum number of generations corresponding to 1000, with crossover weight w = 0.9, mutation probability equals to 0.1, maximum number of lags (M axLags = 10) and maximum number of MMNN modules (M axM od = 100). The improved GA stopping criteria were the number of GA generations, the increase in the validation error (≥ 5%) and the decrease in the training error (≤ 10−6 ). The structuring elements (MMNN weights) were normalized in the range [−1, 1]. Each element of the improved GA population is then trained via the BP algorithm for 100 epochs, using the smooth rank function (Qσ = exp − 12 (x/σ)2 ) with a smoothing parameter σ = 0.05 and a convergence factor μ = 0.01. A set of two time series was used as a test bed for evaluation of the proposed method: Dow Jones Industrial Average (DJIA) Index and Sunspot. All series investigated were normalized to lie within the range [0, 1] and divided in three sets according to Prechelt [16]: training set (50% of the points), validation set (25% of the points) and test set (25% of the points). The improved GA parameters were the same for all the time series. To build a reference performance level, also will be presented results by random walk (RW) model [15]. 7.1
Dow Jones Industrial Average (DJIA) Index Series
The Dow Jones Industrial Average (DJIA) Index series corresponds to daily observations from January 1st 1998 to August 26th 2003, constituting a database of 1,420 points. For the prediction of the DJIA series (with one step ahead of prediction horizon), the proposed model automatically chose the time lags 1, 2, 4, 7, 9 and 10 to represent the time series, selected the MMNN architecture via Matheron Table 1. Experimental results for the DJIA series RW Model Proposed Model MSE 8.3911 · 10−4 8.2990 · 10−4 MAPE 9.68% 9.72% NMSE 1.0000 0.9890 POCID 46.30% 53.01% ARV 3.5051 · 10−2 3.4667 · 10−2 fitness function 3.9509 4.5136
An Intelligent Hybrid Approach
609
Fig. 3. Prediction results for the DJIA Index series (test set): actual values (solid line) and predicted values (dashed line)
Decomposition by dilations (arch = 0) with 8 modules (mod = 8). Table 1 shows the results for all performance measurements. Figure 3 shows the actual DJIA values (solid line) and the predicted values generated by the proposed model (dashed line) for the last 100 points of the test set. 7.2
Sunspot Series
The Sunspot series consisted of the total annual measures of the sun spots from the years of 1700 to 1988, constituting a database of 289 points. For the prediction of the Sunspot series (with one step ahead of prediction horizon), the proposed model automatically chose the time lags 1, 2, 4 and 10 to represent the time series, selected the MMNN architecture via Matheron Decomposition by dilations (arch = 0) with 10 modules (mod = 10). Table 2 shows the results for all performance measurements. Figure 4 shows the actual Sunspot values (solid line) and the predicted values generated by the proposed model (dashed line) for the 70 points of the test set. Table 2. Experimental results for the Sunspot series RW Model Proposed Model MSE 2.7003 · 10−2 1.3443 · 10−2 MAPE 55.27% 30.48% NMSE 1.0000 0.4681 POCID 76.81% 84.05% ARV 0.4050 0.2016 fitness function 1.3325 2.6132
610
R. de A. Ara´ ujo, R.P. de Sousa, and T.A.E. Ferreira
Fig. 4. Prediction results for the Sunspot series (test set): actual values (solid line) and predicted values (dashed line)
8
Conclusion
An intelligent hybrid approach was presented for designing increasing translation invariant morphological operators, via Matheron Decomposition, for time series forecasting. The experimental results used five different metrics, Mean Square Error (MSE), Mean Absolute Percentage Error (MAPE), Normalized Mean Square Error (NMSE), Prediction Of Change In Direction (POCID) and Average Relative Variance (ARV). The proposed method was applied to one real world time series from the financial market with all their dependence on exogenous and uncontrollable variables (Dow Jones Industrial Average (DJIA) Index) and a natural phenomena time series (Sunspot). It was observed that the proposed model obtained a behavior better (in terms of fitness function) than random walk model [15] for the analyzed natural phenomena time series. However, for the analyzed financial time series, the proposed model obtained behavior slightly better (in terms of fitness function) to a random walk model. Prediction for the analyzed financial time series is dislocated one step ahead the original values. This observation is also in consonance with the work of Sitte and Sitte [17] and Ara´ ujo et al. [9], which have shown that the predictions of financial time series exhibit a characteristic of one step shift with respect to the original data. They argued that the financial time series were a random walk model. However, Ferreira [13] showed that this behavior, which is like a random walk model for financial time series, can be corrected by a phase prediction adjustment. Future works will consider the development of a parallel implementation of the proposed model training procedure for financial time series forecasting, where the random walk dilemma is expected to be overcome without the need for the phase adjustment procedure proposed by Ferreira [13].
An Intelligent Hybrid Approach
611
References 1. G. E. P. Box, G. M. Jenkins, and G. C. Reinsel. Time Series Analysis: Forecasting and Control. Prentice Hall, New Jersey, third edition, 1994. 2. T. Ozaki. Nonlinear Time Series Models and Dynamical Systems, volume 5 of Hand Book of Statistics. Noth-Holland, Amsterdam, 1985. 3. M. B. Priestley. Non-Linear and Non-Stacionary Time Series Analysis. Academic Press, 1988. 4. T. Subba Rao and M. M. Gabr. Introduction to Bispectral Analysis and Bilinear Time Series Models, volume 24 of Lecture Notes in Statistics. Springer, Berlin, 1984. 5. D. E. Rumelhart and J. L. McCleland. Parallel Distributed Processing, Explorations in the Microstructure of Cognition, volume 1 & 2. MIT Press, 1987. 6. G. Zhang, B. E. Patuwo, and M. Y. Hu. Forecasting with artificial neural networks: The state of the art. International Journal of Forecasting, 14:35–62, 1998. 7. R. P. Sousa. Design of translation invariant operators via neural network training. PhD thesis, UFPB, Campina Grande, Brazil, 2000. 8. G. Matheron. Random Sets and Integral Geometry. Wiley, New York, 1975. 9. R. A. Ara´ ujo, F. Madeiro, R. P. Sousa, L. F. C. Pessoa, and T. A. E. Ferreira. An evolutionary morphological approach for financial time series forecasting. In Proceedings of the IEEE Congress on Evolutionary Computation, Vancouver, Canada, 2006. 10. F. H. F. Leung, H. K. Lam, S. H. Ling, and P. K. S. Tam. Tuning of the structure and parameters of the neural network using an improved genetic algorithm. IEEE Transactions on Neural Networks, 14(1):79–88, January 2003. 11. F. Takens. Detecting strange attractor in turbulence. In A. Dold and B. Eckmann, editors, Dynamical Systems and Turbulence, volume 898 of Lecture Notes in Mathematics, pages 366–381, New York, 1980. Springer-Verlag. 12. L. F. C. Pessoa and P. Maragos. MRL-filters: A general class of nonlinear systems and their optimal design for image processing. IEEE Transactions on Image Processing, 7:966–978, 1998. 13. T. A. E. Ferreira. A New Hybrid Intelligent Methodology for Time Series Forecasting. PhD thesis, Federal University of Pernambuco, 2006. 14. M. P. Clements and D. F. Hendry. On the limitations of comparing mean square forecast errors. Journal of Forecasting, 12(8):617–637, Dec. 1993. 15. T. C. Mills. The Econometric Modelling of Financial Time Series. Cambridge University Press, Cambridge, 2003. 16. Lutz Prechelt. Proben1: A set of neural network benchmark problems and benchmarking rules. Technical Report 21/94, 1994. 17. Renate Sitte and Joaquin Sitte. Neural networks approach to the random walk dilemma of financial time series. Applied Intelligence, 16(3):163–171, May 2002.
Ordering Grids to Identify the Clustering Structure Shihong Yue, Miaomiao Wei, Yi Li, and Xiuxiu Wang School of Electric Engineering and Automation, Tianjin University, Tianjin300072, China {shyue1999,liyips,wmm1976}@tju.edu.cn http://www.tju.edu.cn/colleges/automate/php
Abstract. Almost all of the well-known clustering algorithms require input parameters while these parameters may be difficult to be determined. OPTICS (Ordering Points To Identify the Clustering Structure Cluster Structure) is a primary semi-clustering method to visualize the data structure and to determine the input parameters of a given clustering algorithm. However, OPTICS has too high complexity O(n2logn) to be applied to any large dataset of n data. In this paper, we present a new semi-clustering method by partitioning data space into a number of grids and Ordering all Grids To Identify the Clustering Structure (OGTICS). Accordingly, the new method is called OGTICS. The OGTICS has only linear complexity O(n) and thus is much faster than OPTICS. Consequently, the OGTICS can be applied to very large dataset.
1 Introduction Larger and larger amounts of data are collected and stored in databases increasing the need for efficient and effective analysis methods to make use of the information contained implicitly in the data. Clustering analysis is a fundamental tool to help a user to understand the natural grouping or structure in a data set. Therefore, the development of improved clustering algorithms has received a lot of attention in the two recent decades 1, 2. A clustering task aims to group the objects of a database into a set of meaningful subclasses such that the similar data are assigned to the same group and dissimilar data are separated as possible. However, almost all clustering algorithms require values for input parameters that are hard to determine, especially for realworld data sets containing high dimensional objects. For example, Hard c-Means (HCM) algorithm 3 needs to input a prior number of clusters and good initialization. On the other hand, the HCM are very sensible to these parameter values, often producing very different partitioning of the data set even for slightly different parameter settings. In addition, high-dimensional real-data sets often have a very skewed distribution that cannot be revealed by a clustering algorithm using only one global parameter setting. To overcome this problem, an effective method hierarchically outputs different clustering results at different parameter settings to ensure no useful clustering results being missed. OPTICS (Ordering Points To Identify the Clustering Structure Cluster analysis) is a primary semi-clustering method to visualize the data structure and determine the input parameters of a given clustering algorithm, which does not produce a clustering of a data set explicitly; but instead creates an augmented ordering of the database D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 612–619, 2007. © Springer-Verlag Berlin Heidelberg 2007
Ordering Grids to Identify the Clustering Structure
613
representing its density-based clustering structure. In principle, this cluster-ordering contains information which is equivalent to the density-based clusters corresponding to a broad range of parameter settings. As a primary semi-clustering algorithm, the OPTICS must have fast and effective performances. It is a versatile basis for both automatic and interactive cluster analysis. For medium sized data sets, the feasibility of OPTICS has been demonstrated and can be represented graphically 4, 5. Since ever-increasing amount of data in dataset, most of the recent research related to the clustering task has been directed towards efficiency. The OPTICS is too slow to be applied to very large dataset. As compared, Grid-clustering approaches are without predetermining the number of clusters, choosing the appropriate cluster centers in the initial step, and choosing a suitable similarity measure according to the shapes of the data, thus have achieved great success and thereby are main concerns 6. A grid-clustering approach produces a hierarchical structure on the underlying dataset, partitions data space into a great number of grids, assigns each data point into unique grid, and finally merges each group of similar grids to a cluster instead of a direct operation to those original data points as HCM etc. Consequently, their runtimes are only involved with the number of the grids rather than the number of data points and thus gains linearly computational complexity. This is the further reason why the grid-based clustering approach is very attractive in many applications 7~11. In this paper, we introduce a new grid-based semi-clustering algorithm to replace the OPTICS. The new algorithm has only linear computational complexity and thus can be applied to very large dataset. The rest of the paper is organized as follows. Related work on the OPTICS is briefly discussed in section 2. In section 3, the basic notions of our new algorithm OPTICS is created and its performance are presented. The application of this new algorithm for the purpose of cluster analysis is demonstrated by the experiments in section 4. Section 5 is conclusion.
2 Proposed Method: OGTICS In the section we firstly recall the fundamental OPTICS method, including its basic performances and disadvantages, and then conduct the OGTICS method and stress on its advantages over the OPTICS, finally analyze the complexity of the OGTICS. 2.1 OPTICS Method OPTICS is a semi-clustering method and aims to identify the data structure under different parameter settings. OPTICS defines three basic notations: core object (data), density-reachable (directly density-reachable) and reachability distance: 1) core object: An object p is called core object (data) if the object satisfies the condition Card(Nε(p)) ≥ MinPts , where Card(N) denotes the cardinality of the set N, and MinPts in the set of objects D in a user-specified threshold of minimum neighborhood radius(ε)of p; Nε(q) is the subset of D contained in the εneighborhood of q 2) density-reachable: Object p is directly density-reachable from core object q with respect to ε and MinPts in a set of objects D if A) p ∈ Nε(q). B) Card(Nε(q))
614
3)
S. Yue et al.
≥ MinPts. An object p is density-reachable from an object q with respect to ε and if there is a chain of objects p1, ..., pn, p1 = q, pn = p such that pi ∈D and pi+1 is directly density-reachable from pi with respect to ε and MinPts; reachability-distance: the reachability-distance of an object p with respect to another object o is the smallest distance such that p is directly density-reachable from o if o is a core object.
The OPTICS creates an ordering of a dataset, additionally storing the core-distance and a suitable reachability distance for each object. A two-dimensional coordinate system acts on the visible entry of data structures, where all data points are projected to a queue in x-axis and are performed the following steps5: 1) any current object is inserted into a queue in x-axis if it is core object; 2) objects which are connected with a current core object are inserted into a queue of existing objects in x-axis for further expansion. The objects contained in the queue in x-axis are sorted by their reachability-distance to the closest core object from which they have been directly density reachable; 3) If current object is a core object, further candidates for the expansion may be inserted into the queue in x-axis; 4) Objects, which are already in the queue, are moved further to the top of the queue if their new reachability-distance is smaller than their previous reachabilitydistance. After realizing the above steps a statistical histogram is created consisting of the queue of x-axis and the corresponding density-reachable distances of y-axis. The curse of OPTICS is the computation of spherical neighborhood of any object. n objects needs to scan the entire dataset O(n2logn) times for the construction of their neighborhoods. Hence, we apply the grid-shaped neighborhood to replace the spherical one of an object. To search the grid-shaped neighborhood the entire dataset only needs to be scanned one time. The computation cost will be reduced greatly. 2.2
Main Performances of OGPICS
In the paper we explore a new semi-clustering approach to analyze the data structure of a dataset. The dataset is given by the matrix M=[x1, x2,…, xn] ∈ R d ×n , where each column of M, x i = ( x i1 , x i 2 ,..., x id ) T ∈ R d , is a single data-point. The proposed OGTICS method in this paper consists of the following steps: 1) Divide the dataset into a number of grids and assign all data into these grids; 2) Partition all grids to a few of groups; 3) Compute the center of each grid; 4) Order all grids to a queue in x -axis; 5) Generate a statistical histogram with the number of data across all grids; 6) Determine the optimal number of clusters and partitions. The six steps are illustrated further as follows: 1) Compute the minimal regular polyhedron that encloses all data, GRID=[l, r]d. After dividing the interval [l, r] in each dimension (M-1) times Md original grids are obtained. Each data is assigned into unique grid, and all empty grids are deleted.
Ordering Grids to Identify the Clustering Structure
615
Any two overlapped grids may have 20, 21, 22,…, 2d-1 common vertexes in ddimensional data space, where 20 refers to only a common vertex of two grids. Fig.1 shows all overlapped cases of any two grids in 3-dimensional data space, where there exist two groups of data point respectively located in two grids. Usually, the center distance of two grids can characterize these different overlapped degrees. In fact, the maximal center distance of any two overlapped grids must be less than n a , where a is edge length of grid. So the
n a can play a margin role on judging whether any two
grids overlap. As the center distance of two grids reaches n − 1a, n − 2 ,... 2 , the two grids must have closer and stronger similarity.
Fig. 1. All overlapped cases of any two grids in 3-dimensinal data space
2) For any two grids gradi {[ li1 , ri1 ];…; [ lid , rid ]} and gradi {[ li1 , ri1 ];…; [ lid , rid ]}, they have a common vertex at least only if the following condition is satisfied
lik = r jk or lik = r jk , lik = lik , k = 1,2,..., d
(1)
In light of the condition, we use the following procedures in Table 1 to partition all grids in Ω into different groups. The procedure is similar to the step that groups all data by directly density-reachable distance in the OPTICS. Table 1. The procedure of partitioning all grids into different groups
For ( p= 1; p++, , s++// the ε and Minpt are specified // For p++, p<=d G1 = Ω
Count the left neighbor of gradi{([ l1; h1];… ; [ls; hs])} Let G2={grid of the same left neighbor} Count the right neighbor of gradi{([ l1; h1];… ; [ls; hs]) in G2 Let G3={grid of the same left neighbor} Query all groups of connected original grids by (1) to create a chain of grids
616
S. Yue et al.
3) The center of each grid is taken as the arithmetic average of all vertexes of the grid, and all centers of grid is denoted as G = {g1 , g 2 ,..., g n } . 4) In light of the performances of OPTICS we order all grids which have been grouped in the step 2) to a queue with a sequence number Sequence () in x-axis. The ordering has no natural difference from that of OPTICS except that the all grids are randomly inserted instead of the closest search in OPTICS, and all grids are ordered in each group rather than in the entire set of data (see Table 2). 5) Generate a statistical histogram with the number of data across all grids, where the grids in x-axis are arranged by the value of Sequence( grid i ) from small to large and y-axis is the number of data points in i-th grid. 6) Determine the optimal number of clusters and corresponding partitions. Table 2. Ordering all grids to the quesue in x-axis
Compute the margin distance of any two grids: σ ( σ = n d ) Create a function of the position of each object in x-axis: Sequence( grid h ) Set grid-(i+1) be an unlabeled grid with center gi+1 in Ω 3 ; i=1, i++ Compute the distance from gi+1 to any grid in cluster 1, cluster 2, …, cluster i If d is1 , d is2 ,..., d ism ≤ σ Merge all clusters associated with s1, s2, …. sm to the cluster of s1 Insert Gi +1 to the queue of grids in x -axis; Sequence( grid i ) = Sequence( grid i ) +1;
Else assign a new cluster label cluster (i+1) by Gi +1 ; Decrease σ to σ = n − 1d , n − 2 d ,..., d ; repeat.
2.3 Time Cost of OGTICS
The time cost of OGTICS consists of two parts: partitioning the data space into the grids and assign all data points to the grids, and ordering all grids to generate the statistical histogram under different parameter settings. The former is equivalent to the time cost of the class of grid-clustering methods, that is O(n); the latter is less than former since the number of groups often are few compared with the number of data; in sum, the two parts have linear runtimes. Furthermore, if the ordering is repeated a number of times for different parameters, says m, the total runtimes is almost O(mn).
3 Experiments All experiments in this paper were performed in such situations: Win2000, Matlab6.5, 3.2HzCPU, 512 M memory. To what follows, we call the new algorithm OGTICS. In this experiment, we apply HCM to two artificial datasets and the LETTER dataset from UCI machine learning repository 12. All clusters in the artificial datasets are generated by randn() and functions in Matlab toolbox. The first of the two artificial datasets contains 584 data distributed 6 clusters and is used to simulate such dataset of
Ordering Grids to Identify the Clustering Structure
617
even-density and different-size clusters. The second contains 2036 data of 3 clusters and is used to simulate such dataset of arbitrary-shaped (spiral) clusters. The LETTER is a well-known dataset with 20000 records distributed in 26 clusters in 16-dimensianl data space. The LETTER is used to test the efficiency of OGTICS. However, the original LETTER cannot be used to test a clustering algorithm since there are not clear boundaries among different clusters. This point can be verified when user projects the records of LETTER to 2-dimensional space, as shown in Fig.2. One can observe that the two clusters (indicated by different colors) have no clear boundaries. Hence we modify the LETTER dataset by the rule that the two records will be deleted if the two records are enough close but have different cluster labels. Consequent the modified LETTER leaves only 15426 records distributed in 14 clusters that have clear
Fig. 2. Data structure and (data number)-Sequence statistical histogram of LETTER dataset 8
4
0
Fig. 3. Structures and (data number)-Sequence statistical histogram of the six-cluster dataset
618
S. Yue et al.
8
4
0
Fig. 4. Structures of (data number)-Sequence statistical histogram of the three-cluster dataset Table 3. Performances of OPTIGS and OGTICS (*/*/* corresponds to the three datasets)
Suggested number of clusters σ=
OPTICS OGTICS
2a
n / 2a
Real number of clusters
Runtimes(s) (average values)
(6/3/14) (6/3/14)
(0.34/18.26/67.22) (0.29/1.86/3.44)
n
(9/8/10) (6/4/14) (2/3/10) (7/5/10) (5/4/14) (2/3/14)
boundaries to each other (some small clusters are deleted). We have compared the performances of OGTICS and OPTICS in the three datasets, and are described as follows. Fig.3 shows that the six-cluster dataset have clear six aggregating parts in the created statistical histogram based on the OGTICS. Similarly, the three-cluster dataset have clear three aggregating parts in its statistical histogram (see Fig.4), and the modified LETTER dataset have clear fourteen aggregating parts in its statistical histogram based on the OGTICS. Furthermore, as the number of data in the three datasets increases, the advantages of OGTICS over the OPTICS becomes clear and clear (see Table 3). Table 3 also shows that the OGTICS can give a structural output of the data structure when the partitioned size of grids is chosen well. However, the OGTICS has the same shortage as the class of grid-clustering algorithms has, that is, the partitioning size of data space must be chosen in advance.
4 Conclusion To identify the data structure of very large dataset, some semi-clustering algorithms have proposed. So far, none of them are succeeded since there are too high computational complexity rather than linear computational complexity. In contrast, the new proposed OGTICS has linear complexity and is succeeded in very large dataset (LETTER) to great extent. The author of OPTICS had pointed that an interesting and
Ordering Grids to Identify the Clustering Structure
619
important direction is to modify OPTICS for a large gain in efficiency at the expense of a limited amount of accuracy. Consequently, the OGTICS provides an instructive solution to the category. However, it is infeasible to apply OGTICS to a database containing several million high-dimensional objects. This is a common challenge to current clustering and semi-clustering algorithms.
Acknowledgment This work was supported by the National Science Foundation of China under Grant N0. 60572065, 60532020.
References 1. Huang, Z., Ng, M.K., Rong, H.: Automated Variable Weighting in k-Means Type Clustering. IEEE Trans. Patt. Anal. Mach. Intell. 27 (2005) 657-668 2. Xu, R., Wunsch, D.: Survey of Clustering Algorithm. IEEE Trans. Neural Network 16 (3) (2005) 645 -661 3. Thbshirani, R., Walther, G., Hastie, T.: Estimation the Number of Clusters in a Dataset via the Gap Statistic. J. Royal Society, B 355 1393 (2000) 135-146 4. Ankerst, M, Breunig, M., Kriegel, H.P.: Ordering Points to Identify the Clustering Structure. Proc. ACM SIGMOD’99 Int. Conf. on Management of Data, PA, 1999 5. Kim, Y.I., Kim, D.W., Lee, D.: A Cluster Validation Index for GK Cluster Analysis Based on Relative Degree of Sharing. Information Sci 168 (1-4) (2004) 225-242. 6. Lange, T.: Stability-Based Validation of Clustering Solutions. Neu. Comp. 16 (2004) 1299- 1323. 7. Yue, S., Li, P.: In the Validity of Clustering Algorithm. J. Chinese Electronic 13 (1) (2005) 841-847. 8. Eden, W., Ma, M., Tommy, W.S.: A New Shifting Grid Clustering Algorithm. Patt. Recog., 37 (2004) 503–514. 9. Bouguessa, M., Wang, S., Hao, S.: An Object Solution to Clustering Index. Pattern Recognition Letter 27 (2006) 1419-1430 10. Yu J.: A general c-means clustering approach. IEEE Trans. Patt. Anal. Mach. Intell. 25 (8) (2003) 944-952. 11. Yu, J. Chen, Q.: The Search Range of the Optimal Number of Clusters. Science in ChinaE 32 (2002) 275-280. 12. UCI Machine Learning Repository : ftp://ftp.cs.cornell.edu/pub/smart/
An Improve to Human Computer Interaction, Recovering Data from Databases Through Spoken Natural Language Omar Florez-Choque and Ernesto Cuadros-Vargas National University of San Agustin, Arequipa-Peru San Pablo Catholic University, Arequipa-Peru
[email protected],
[email protected] http://socios.spc.org.pe/omarflorez/
Abstract. The fastest and most straightforward way of communication for mankind is the voice. Therefore, the best way to interact with computers should be the voice too. That is why at the moment men are searching new ways to interact with computers. This interaction is improved if the words spoken by the speaker are organized in Natural Language. In this article, it is proposed a model to recover information from databases through queries in Spanish Natural Language using the voice as the way of communication. This model incorporates a Hybrid Intelligent System based on Genetic Algorithms and a Kohonen Self-Organizing Map (SOM) to recognize the present phonemes in a word through time. This approach allows us to remake up a word with speaker independence. Furthermore, it is proposed the use of a compiler with type 2 grammar according to the Chomsky Hierarchy to support the syntactic and semantic structure in Spanish language. Our experiments suggest that the Spoken Natural Language improves notably the Human-Computer interaction when compared with traditional input methods such as: mouse or keybord.
1
Introduction
According to Wickens [1] the information in the Human-Computer interaction is carried out by means of two modalities, sounds and images. The user performs several tasks at the same time. If these tasks require the similar resources from the user, it will take place effects of interference which increase the user mental load. An interface based on natural language to recover information from a database allows us to reduce the user mental load, since it requires the input of data by means of voice and it presents the processed information in a visual interface. This model allows a high interaction degree between the user and the computer. Laurel [2] defines this interaction degree in proportion to the assignment of anthropomorphic skills to the computer from user during the interaction. In that sense, the Isolated Word Recognition systems were some of the first applications in the voice treatment, due to the easiness of recognizing words directly of the voice spectrum. However these systems depended on the speaker D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 620–629, 2007. c Springer-Verlag Berlin Heidelberg 2007
An Improve to Human Computer Interaction, Recovering Data
621
and involved the storing and training of the whole word, which affected the escalability of the system in very extensive domains as the Natural Language. For that reason, now a days words are recognized through the identification of phonemes, with which it is possible to remake up a word. This task implies several phases such as: identifying the limits where words begin and end inside the voice spectrum, normalizing the data, applying slicing windows to separate in segments the signal, each segment can store a phoneme, and finally extracting the main features of each segment. Atal demonstrated that short-time power spectrum of speech in different frequencies, presented a high correlation degree and thus they may be very useful for speech recognition with speaker independence [3]. One of the contributions of our work lies in proposing the use of a sequence of 12 Mel-Frecuency Cepstrum Coefficients (MFCC) with a short period of time (5 miliseconds) as training vectors of a Hybrid Intelligent System based on neural networks and genetic algorithms for the words recognition in real time with speaker independence. This new approach arises from the observation that according to the selected segments, some coefficients of MFCC showed a high correlation in different words, this shows the presence of similar phonemes in different words. The neural network SOM is very useful to reduce the dimensions of each feature vector and visualize in a map of 2 dimensions the clusters of phonemes, just as it is shown in the Figure 1. Voice2SQL is particularly useful to attend users with difficulties in the use of conventional input devices as the keyboards or pointer devices. This kind of users, generally with dissabilities, will be able to carry out queries on a database
Fig. 1. A illustration of the trajectory of word kohonen on the SOM. The arrowheads represent the location of the neuron with smaller distance function regarding shorttime feature vectors every 5 miliseconds. Notice the presence of clusters of phonemes organized according to the color of the neuron on the map. Similar colors represent similar phonemes.
622
O. Florez-Choque and E. Cuadros-Vargas
through simple spoken user interface. Across the following pages, we propose a system for the generation of SQL sentences based on Spoken Natural Language. The rest of this paper is organized as follows. Section 3 describes the steps performed to obtain a vector from the voice signal. Section 4 shows one of our contribution, that is linguistic criteria which is used in the convertion of Natural Language to SQL sentences. In the Section 5 we discuss the obtained results. And lastly, section 6 provides our conclusions.
2
Previous Works
Previous works in the voice recognition implied the detection of features from a signal in the time-domain (waveform) [4], [5], [6], [7], these features included measures in the time-domain as the energy of the signal, the average rate of zero-crossing, and the correlation of features using a short-time power spectrum of speech in different frequencies. Two of the main problems in the voice recognition are the variability of the spectrum energy and the search of patterns to form possible words. The problem of variability can be partially solved by the coefficients of Cepstral. These coefficients attenuate the components of the spectrum. Some of the methods proposed in the past that fulfilles this task were the average subtraction of Cepstrals [8] and the RASTA filtering [9]. The use of a group of phonemes represented by Hidden Markov models, is also broadly used [10], [11], [12]. Also, for features extraction, the creators of Sphinx II [10] has shown that Cepstral coefficients provide a better recognition performance because these coefficients are evaluated in a logarithmic and equispaced scale. On the other hand, the professor Teuvo Kohonen in 1988, introduced a new architecture of neural networs called Self-Organizing Map (SOM), Kohonen used SOM for the recognition of phonetic units as a speaker adaptative system with an unlimited vocabulary. Finally, along these years a kind of Intelligent Hybrid System denominated evolutionary neural networks have been used in the design of the architecture of a neural network to solve the classic XOR problem ([13], [14], [15]) with interesting theoretical results. Another application is the use of genetic algorithms to select the training data and to interpret the behavior of the outputs of the neural network. Schaffer, Whitley and Eshelman (1992) have extensively studied these areas.
3
Feature Extraction from the Speech Signal
It is possible to obtain information from the signal in the short-time power spectrum, so the number of extracted features will be smaller than the total number of points obtained based on the signal without processing. We propose
An Improve to Human Computer Interaction, Recovering Data
623
the use of MFCC to extract the signal features and a later data normalization to obtain meaningful distances among vectors. 3.1
Mel-Frecuency Cepstrum Coefficients
The Mel-Frequency Cepstrum Coefficients (MFCC) characterizes the voice spectrum according to the number of chosen coefficients 12 or 13 are usually good approaches for speech recognition [12]. The MFCC are defined by the real part of the Cepstrum Coefficients of a short-time signal after a segmentation process, through slicing windows, as Hammin Windows, and Fourier and Cosine transforms on each segment of the signal. The MFCC are an improvement to the conventional Cepstral coefficients. They take into account a logarithmic scale of frequency calles Mel-Frecuency scale. Mel-Frecuency scale was proposed by Stevens, Volkman and Newmann in 1937, and it is based on a musical perceptual scale, keeping in mind equispaced observers. Above 500 Hz, the equispaced exponentially intervals of frequency are perceived as if they were spaced lineally. This scale may be calculated by means of equation 1. fs M el(fs ) = 2595 log10 (1 + ) (1) 700 Then, it is possible to define a bank of M filters, where each filter is a triangular filter with the following transfer function: Hm [k] = 0, k < f [m − 1] [m−1]) Hm [k] = (f [m+1]−f2(k−f [m−1])(f [m]−f [m−1]) , f [m − 1] ≤ k ≤ f [m] [m+1]−k) Hm [k] = (f [m+1]−f2(f [m−1])(f [m+1]−f [m]) , f [m] ≤ k ≤ f [m + 1]
Hm [k] = 0, k > f [m + 1] As depicted in Figure 2, each filter calculates the average sprectrum around each center of frequency with growing bandwidth. The area inside each filter is always the same. Let f1 and fh be the lowest and higher frequencies inside the bank of filters measured in Hertz. Let fs be the sampling frequency measured in Hertz and be M the number of filters inside the bank of filters. The boundary points of the triangular areas of the Figure 2 are evenly spaced in the Mel-scale by means of f [m] = (N/fs )M el−1 (M el(f1 ) + m
M el(fh) − M el(f1 ) ) M +1
(2)
where the Mel scale is defined by M el(f ) = f˜ = 1125 ln(1 +
f ) 700
(3)
624
O. Florez-Choque and E. Cuadros-Vargas
Fig. 2. Bank of triangular filters used to calculate the Cepstrum coefficients in the Mel scale. Notice that the area is the same for each filter.
and its inverse by f M el−1(f˜) = 700(e 1125 − 1)
(4)
then, it is possible to calculate the energy based on a logarithmic scale at the output of each filter by means of E[m] = ln
N −1
2
|F [k]| Hm [k]
(5)
k=0
Finally, the Cepstrum coefficients in Mel frecuency scale are the Discrete Cosine Transform (DCT) of the output M filters . M F CC[n] =
M−1
E[m]cos(πn(
m=0
m + 12 )), 0 ≤ n ≤ M M
(6)
The work of Huang and Hong [16] demonstrates that the representation based on MFCC are specially useful for voice recognition. 3.2
SOMs as Phoneme Recognizers
Genetic Algorithms (GAs) and Neural Networks (NNs) are two biologically motivated computational models. For that reason, it is not surprising that several researchers have exploded the idea of Evolutionary Neural Networks. This approach, based on GAs, involves the code of possible solutions as strings with binary values of chromosomes, the initial population setting of chromosomes and the use of genetic operators such as selection, crossover and mutation allows us to derive in a SOM with better fitness. This fitness represents the quality in the topology of the Map, keeping in mind that each SOM finds different similarities among the same training vectors. Therefore, the GA finds the better topology, that is, the topology with smaller distance regarding the phonemes characterized as floating point vectors. Let N be the size of the Kohonen Map and let a be the activity of each neuron, that is, the distance between the value of this neuron and the recent input vector. Therefore we have used the following equation to compute the fitness of each map: 1 f itness = (7) N 1 + n=1 a
An Improve to Human Computer Interaction, Recovering Data
625
Fig. 3. Ilustration of the SOM with different iterations. a. 0 b. 100 c. 500 d. 1000 e. 3000 f. 5000. Notice that while the number of iterations increases, changes on the topology decrease, because learning rate and neighbourhood radius also decrease in function to the number of iterations.
Fig. 4. Ilustration of the SOM with different learned phonemes. Note that some neurons have specialized in certain phonemes and special phonemes as ’’, ’y’ and ’z’ are located in isolated clusters.
According to the Figure 3 the evolution of the SOM is shown in different iterations. It is possible to observe how the SOM iterates to form regions of similar color based on the presence of common features, correlations, similarities and differences among the input patterns of the neural network. These patterns are formed by the extracted MFCC of the sentences spoken by the user. Therefore, regions with similar colors indicate the presence of similar phonemes. According
626
O. Florez-Choque and E. Cuadros-Vargas
to the Figure 4 we can see that some neurons have specialized to learn differents phonemes. Then, it is possible to generalize on the Kohonen Map to form words based on phonemes.
4
Natural Language Processing
Another of our contributions is the use of a linguistic criteria to identify the main Entities that participate in a sentence of Natural language, such as actions, tables, attributes, values and conditions. First, words related with queries actions are recognized choose, calculate, enumerate, select, and so on) and their combinations in first, second, third person and passive voice. Then, indirect objects are recognized through the presence of prepositions (in, of, with, from, to, and so on) so the noun in the nominal sentence will identify the possible presence of an Entity, for instance in the following example students is recognized as an Entity: – Enumerate the name of those students... Then, it is detected the presence of preposition and comparators (greater than, lower than, equal to, different from) between a name (attribute name) and a identifier (attribute value) – Enumerate the name of the students of the course of AI – to Select those marks grater than B therefore is possible to associate two syntactic elements in the following way – course = AI – mark >B Then cardinal numbers are recognized (first, second, third, so on) in the following way – Enumerate the first place... to recognize structures of the type – place = 1 Finally, the presence of adjectives is detected. Adjetives represent the features and conditions of recognized Entities – Enumerate the best students... – Calculate the most studious students... to recognize structures of this type – best(students) – studious(students)
An Improve to Human Computer Interaction, Recovering Data
627
In view of these adjectives can be subjectives to each person should be represented according to the domain of the problem. The Fuzzy Logic can help to model the grade of subjectivity of each adjective. Finally some of the outputs of the system are shown based on queries in natural language: – Voice entry: ”Calculate the average mark in the students of the course of AI ” – Output: Select AVG(Mark) From Students Where Course = ’AI’ – Voice entry: ”Select the students of fifth cycle that are not registered in the course of AI ” – Output: Select * From Students Where Course <> ’AI’ and Cycle = 5
5
Experimental Results
The speech data was uttered for two males and two females peruvians speakers. Each one read the Spanish alphabet three times in different order to train the SOM. Each dataset contained 63 phonemes and was divided in 5 codebooks (abcde, fgijk, lmno, pqrst and uvxyz ) each one with 256 feature vectors, each feature vector contained 12 MFCCs. Therefore each user training dataset contained 1280 training patterns. The sound was digitized by 16 bits with segments of 5 miliseconds and sampled to 12.8 kHz. Our experiments showed that vowels and plosives are easily clustered, but it is more difficult with fricatives. Table 1. The accuracy rate in the phonemes recognition using SOM, LVQ and MLP as classifiers characteristic MFCC MFCC MFCC
classifier SOM LVQ MLP
accuracy 78.7 % 92.3 % 68.1 %
Table 2. The error rate in the phonemes recognition using SOM, LVQ and MLP as classifiers characteristic MFCC MFCC MFCC
classifier SOM LVQ MLP
error 26.3 % 6.0 % 28.2 %
628
5.1
O. Florez-Choque and E. Cuadros-Vargas
Results
The obtained results using three different classifiers are listed below. The classifiers were Self-Organizing Map (SOM), Learning Vector Quantization (LVQ) and Multi Layer Perceptron (MLP).
6
Conclusions
With this paper we have concluded that: A foreseen advantage is related with spoken natural language. The use of voice and natural language allow a better experience with the computer. Furthermore, it allows to include users with disabilities such as: blind people and people with Parkinson. An approach based on natural language is closely related with the chosen language. Others languages than Spanish with strongly different grammar (Cantonesse, German, English, and son on) will requiere a different approach to try verbs, prepositions, adjetives, etc. SOM allows to represent in a straightforward way a complex and stochastic phenomenon as the voice. The approach based on phonemes does possible to scale the system to recognize any present word in the Spanish language. Therefore it is possible to work with unlimited dictionaries. Future works will imply to use Semantic and Fuzzy Logic techniques to reduce the ambiguity and represent the subjectivity of adjectives. Furthemore, the use of Hidden Markov Models (HMM) will improve the accuracy to recognize words. In view of the fact that HMM takes into account the probabilities of transition among phonemes.
Acknowledgments The author wants to thank very especially the teachings and advices from teachers of the National University of San Agust´ın at Arequipa, specially Dr. Ernesto Cuadros-Vargas and Luis Alfaro Casas. Besides, I would like to express my deep gratitude to Dr. Rosa Alarc´ on Choque, University of Chile at Santiago, and Marco A. Alvarez, State University of Utah, for their invaluable encouragement.
References 1. C. D. Wickens and J. G. Hollands, Engineering Psychology and Human Performance (3rd Edition). Prentice Hall, September 1999. [Online]. Available: http://www.amazon.de/ exec/obidos/ASIN/0321047117 2. B. Laurel, “Interface agents: Metaphors with character,” 1999, pp. 355–365. 3. B. Atal, “Automatic recognition of speakers from their voices,” in Proceedings of the IEEE, vol. 64, April 1976, pp. 460–475. 4. M. Hunt, “Spectral signal processing for asr,” 1999.
An Improve to Human Computer Interaction, Recovering Data
629
5. L. Gu and K. Rose, “Perceptual harmonic cepstral coefficients for speech recognition in noisy environment,” 2001. 6. M. Gales, “Model-based techniques for noise robust speech recognition,” 1996. [Online]. Available: citeseer.ist.psu.edu/gales95modelbased.html 7. R. Schlter and H. Ney, “Using phase spectrum information for improved speech recognition performance,” 1998. 8. S. Johnson, P. Jourlin, G. Moore, K. S. Jones, and P. Woodland, “The cambridge university spoken document retrieval system,” in Proc ICASSP ’99, vol. 1, Phoenix, AZ, 1999, pp. 49–52. [Online]. Available: citeseer.ifi.unizh.ch/ johnson99cambridge.html 9. H. Hermansky and N. Morgan, “RASTA processing of speech,” in IEEE Transactions on Speech and Acoustics, vol. 2, October 1994, pp. 587–589. 10. X. Huang, F. Alleva, H.-W. Hon, M.-Y. Hwang, and R. Rosenfeld, “The SPHINX-II speech recognition system: an overview,” Computer Speech and Language, vol. 7, no. 2, pp. 137–148, 1993. [Online]. Available: citeseer.ifi.unizh.ch/ huang92sphinxii.html 11. D. J. Kershaw, “Phonetic context-dependency in a hybrid ann/hmm speech recognition system,” 1996. [Online]. Available: citeseer.ifi.unizh.ch/175909.html 12. J. Neto, L. Almeida, M. Hochberg, C. Martins, L. Nunes, S. Renals, and A. Robinson, “Speakeradaptation for hybrid hmm-ann continuous speech recognition system,” 1995. [Online]. Available: citeseer.ifi.unizh.ch/neto95speakeradaptation.html 13. L. D. Whitley, S. Dominic, and R. Das, “Genetic reinforcement learning with multilayer neural networks.” in ICGA, 1991, pp. 562–569. 14. G. F. Miller, P. M. Todd, and S. U. Hegde, “Designing neural networks using genetic algorithms.” in ICGA, 1989, pp. 379–384. 15. S. G. Romaniuk, “Evolutionary growth perceptrons.” in ICGA, 1993, pp. 334–341. 16. H. Huang, A. Acero, and H. Hon, Spoken Language Processing - A Guide to Theory, Algorithms and Systems Development. Prentice - Hall, 2001.
3D Reconstruction Approach Based on Neural Network Haifeng Hu and Zhi Yang Department of Electronics and Communication Engineering Sun Yat-sen University, Guangzhou, P.R. China, 510275 {huhaif,issyz}@mail.sysu.edu.cn
Abstract. In this paper, a new 3D reconstruction approach in neuro-vision system is presented. Firstly, RBF network (RBFN) is used to provide effective methodologies for solving camera calibration and stereo rectification problems. RBFN works mainly in two aspects: (1) a RBFN is adopted to learn and memorize the nonlinear relationship in stereovision system; (2) another RBFN is trained to search the correspondent lines in two images such that stereo matching could be performed in one dimension. Secondly, a new matching method based on Hopfield neural network (HNN) is presented. The energy function is built on the basis of uniqueness, compatibility and similarity constraints. It is then mapped onto a 2-D neural network for minimization, whose final stable state indicates the possible correspondence of the matching units. The depth map can be acquired through performing the above operation on the all epipolar lines. Experiments have been performed on common stereo pairs and the results are accurate and convincing.
1 Introduction Stereo vision is a passive technique for building a 3D description of a scene observed from several viewpoints. Usually two cameras are used to observe the same scene with two slightly different viewpoints and the stereo image pairs are obtained. After the stereo pair is matched, 3-D description of the scene can be built. Generally, the reconstruction process involves two tasks: camera calibration and stereo matching. Camera calibration is a pre-requirement for most application in computer vision. In many cases, the overall performance of the vision system strongly depends on the accuracy of the camera calibration. The classic calibration approach that originates from the field of photogrammetry solves the problem through minimizing a nonlinear error function [1]. However, due to slowness and computational burden of this technique, closed-form solutions have been also suggested [2]. In these methods, parameter values are computed directly through a non-iterative algorithm based on a closed-form solution. Since no iteration is required, the algorithms are fast. But there are two disadvantages. First, camera distortion cannot be incorporated, and therefore, distortion effects cannot be corrected. Second, the accuracy of the final solution is relatively poor. There are also calibration procedures where both nonlinear minimization and a closed form solution are used [3, 4]. Tsai method [3] is the representative of this type, which uses a series of equations to determine camera parameters in two stages with a simplified model of radial lens distortion. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 630–639, 2007. © Springer-Verlag Berlin Heidelberg 2007
3D Reconstruction Approach Based on Neural Network
631
Stereo matching is the most difficult problem in stereo vision, which can be roughly divided into two categories according to the disparity selection method: local methods and global methods. Local methods [5, 6] select the disparities of image pixels locally using the winner-takes-all (WTA) method and, therefore, are faster than global methods in general. However, local methods have difficulty in dealing with image ambiguity owing to insufficient or repetitive texture. On the other hand, global methods seek a disparity surface that minimizes a global cost function defined by making explicit smoothness assumptions. Various optimization techniques such as dynamic programming [7], graph cut [8], and belief propagation (BP) [9] are used to get minimum cost solutions. Therefore, global methods can deal with inherent image ambiguity more effectively than local methods. In this paper, we adopt two different neural networks for solving the camera calibration and stereo matching problems in respective. First, RBF network is introduced to calibrate a stereovision system. A four-input-two-output-architecture network is adopted to directly map the stereo image points to their correspondent spatial points. So the nonlinear relationship of the stereovision system and physical parameters of the camera can be fully learned based on a training data set. In stereovision system, traditional rectification is realized through using some known physical camera parameters obtained from calibration. In our method, we prove that some relationship between pairs of lines exists if some condition is satisfied. The new method is presented for searching correspondent lines utilizing a two-input-twooutput RBF network. Second, a new matching method is presented based on Hopfield network. We construct a new energy function on the basis of uniqueness, compatibility and similarity constraints, which reflects the constraint relations of all matching units of the same lines. It is then mapped onto a 2-D Hopfield neural network for minimization, whose final stable state indicates the possible correspondence of the matching units. The proposed matching algorithm has two traits: (1) individual pixel point but not scene point or edge line is adopted as matching unit and dense depth map could be obtained directly; (2) the external input of the nodes is not constant again, which reflects gray similarity of correspondent points. The paper is organized as follows. Section 2 introduces how to use RBFN to calibrate a stereoscopic system. Section 3 proves that a mapping between pairs of lines exists if some condition is satisfied, and a RBFN is adopted to learn such relationship. In section 4, the theory behind using a Hopfield network to solve the correspondence problem is developed. The last two sections present the experimental results to corroborate the proposed theory and give some brief conclusion in turn.
2 Camera Calibration Using RBF Network Fig. 1 illustrates the basic geometry of stereovision system. p1 , p2 are the projections of spatial point P onto the retinal plane π 1 and π 2 in respective. From literature [1], we have the following linear transformation equations in homogeneous coordinates: 1 1 1 1 1 1 1 1 (u1m31 − m11 ) X + (u1m32 − m12 )Y + (u1m33 − m13 ) Z = m14 − u1m34 , 1 1 1 1 1 1 1 1 (v1m31 − m21 ) X + (v1m32 − m22 )Y + (v1m33 − m23 ) Z = m24 − v1m34 ,
(1)
632
H. Hu and Z. Yang 2 2 2 2 (u2 m31 − m112 ) X + (u2 m32 − m122 )Y + (u2m33 − m132 ) Z = m142 − u2 m34 , 2 2 2 2 2 2 2 2 (v2 m31 − m21 ) X + (v2 m32 − m22 )Y + (v2m33 − m23 ) Z = m24 − v2 m34 ,
(2)
where (u1 , v1 ,1) and (u2 , v2 ,1) are the homogeneous coordinate of p1 and p2 in the π 1 , π 2 respectively. ( X , Y , Z ,1) is the homogeneous coordinate of P in the world coordinate, mijk (k = 1, 2; i = 1...,3; j = 1..., 4) are the camera calibration parameters. P( X , Y , Z )
p1 (u1 , v1 )
p 2 ( u 2 , v2 )
S1
S2 O2
O1
Fig. 1. Stereo imaging geometry
From the view of analytic geometry, we can determine the position of arbitrary spatial P with a least-square solution as P is the intersection of O1 p1 and O2 p2 . In actual condition, camera distortion cannot be ignored and P should be determined through a very complex nonlinear mapping [1-4]. It has been shown that the RBF network is a X Y Z universal approximation for continuous functions, provided that the number of the hidden nodes is sufficiently large. This property makes the network a ¦ ¦ ¦ powerful tool for dealing with many real world problems [11]. In our calibration method, the learning and memory functions of the RBF neural network are fully used. The training samples are composed of the given spatial points and their images. Nonlinear relation of the stereovision system and the parameters of the camera are learnt by the neural network and they are stored in the weights for future needs. ul ur vl vr The structure of the adopted RBF network (denoted as network A) is illustrated in figure 2. It is Fig. 2. RBFN for calibration 3-layer RBF neural network. The first layer is the input-layer with 4 neurons that correspond to the left image coordinate (ul , vl ) and the right image coordinate (ur , vr ) . The last layer is the output layer with three neurons that correspond to the spatial point coordinate ( X , Y , Z ) . The hidden layer is used to store the nonlinear relationship between the 2-D image and the 3-D object space and camera parameters. RBF
RBF
RBF
3D Reconstruction Approach Based on Neural Network
633
3 Rectification of Stereo Pairs The stereo matching can be solved much more efficiently if images are rectified, i.e. transforming the images so that the epipolar lines are aligned horizontally. In this case, stereo matching algorithm can easily take advantage of the epipolar constraint and reduce the search space to one dimension. In the following section, a new approach is proposed which can find conjugate epipolar lines through using RBFN. 3.1 The Relationship of the Correspondent Lines Suppose there exists a line l1 : v1 = k1u1 + t1 in π 1 (see fig.1), we modify the equation (1), and get the following expression: u1 =
a11 X + b11Y + c11Z + d11 , a31 X + b31Y + c31 Z + d 31
v1 =
a21 X + b21Y + c12 Z + d 21 . a31 X + b31Y + c31 Z + d 31
(3)
Then the line l1 : v1 = k1u1 + t1 could be modified as: AX + BY + CZ + D = 0 ,
(4)
where A, B, C , D are the linear combination of k1 , t1 and a , b , c , d (i = 1, 2,3) . According to the basic knowledge of stereovision, a point (u1 , v1 ) on l1 must have a correspondent point (u2 , v2 ) in π 2 which satisfies the following equation: 1 i
u2 =
a12 X + b12Y + c12 Z + d12 , a32 X + b32Y + c32 Z + d 32
v2 =
1 i
1 i
1 i
a22 X + b22Y + c22 Z + d 22 . a32 X + b32Y + c32 Z + d 32
(5)
Rearranging equation (5) and we can get u2 =
PY + Q1Z + R1 1 PY + Q3 Z + R3 3
,
v2 =
PY + Q2 Z + R2 2 PY + Q3 Z + R3 3
,
(6)
where Pi , Qi , Ri are the linear combination of ai2 , bi2 , ci2 , d i2 (i = 1, 2,3) and A, B, C , D . From Equation (4) and (6), we can eliminate Y and get v2 =
T1 W − W2T1 T2 T W − W1T1 T2 u2 + 1 u2 + ( S1 − 1 S 2 ) + 1 S2 , T2 T2 Z + W2 T2 T2 Z + W2
(7)
where S1 = P2 P3 ,
S 2 = P1 P3 ,
T2 = Q1 − Q3 P1 P3 , W1 = R2 − R3 P2 P3 ,
T1 = Q2 − Q3 P2 P3 , W2 = R1 − R3 P1 P3 ,
(8)
If the object depth varies in a small range, i.e. | Z − Cm |≤ ε ( ε is small value, Cm is a constant) comes into existence under some condition, then equation (7) could be modified as the linear form, which means that l2 in π 2 could be reduced as an epipolar line. A network can be used to learn the relationship between two l1 and l2 .
634
H. Hu and Z. Yang
3.2 Searching the Correspondent Line Using a RBF Network We use a two-input-two-output RBF network to learn the relationship between the epipolar lines. Input vector is the slope and intercept of l1 , output vector is the slope and intercept of the correspondent line l2 . The new RBFN is denoted as network B.
4 Stereo Matching Based on HNN Stereo matching is one of the classic difficult problems in the computer vision, and its complexity and precision hedge the capability of vision system to reconstruct the 3-D scene. Here, we present a novel stereo matching method based on HNN. Our approach gives new descriptions of the similar relation of matching pairs and smooth property of the neighboring matching point. 4.1 Designing the Energy Function
col-j
col-i
ŭ
row-x nxi We construct a two-dimensional Hopfield network which is used to find the Wxiyj correspondence on the conjugate epipolar lines I xi ů (shown in Fig. 3). el ( er ) is denoted as left n yj (right) epipolar line. The network can be row-y regarded as an N x N array of neurons, where N I yj is the total number of image points on el or er . A neuron n xi explores the hypotheses that the xth Fig. 3. The plot of 2-D HNN point of el corresponds to ith point of er . Similar meaning is for n yj . The connection between two neurons Wxiyj represents how
compatible the two correspondences are. This connection may be excitatory if compatible or inhibitory if incompatible. I xi (or I yj ) is the external input to the neuron n xi (or n yj ) . We construct the energy function as follows by using the constraints that must be satisfied in the stereo matching process, i.e., uniqueness, epipolar line, continuity and geometric constraints. E = −( P / 2)∑∑∑∑ C xiyjVxiV yj + (Q / 2)∑∑∑ VxiVxj + ( R / 2)∑∑∑VxiVyi − ∑∑ S xiVxi . (9) x
i
y
j
x
i
j
i
x
y
x
i
where P, Q and R are positive constants, Vxi is the output of the neuron n xi . C xiyj is called the compatibility measure and is used to enforce the smoothness constraint. S xi is called similar measure of correspond-dent points. The first term in (9) represents the degree of compatibility of two matched points, while the second and third terms tend to enforce the uniqueness constraint. The last term is added which discourages a neuron to be at a zero state as it causes the energy to increase, with decrease in output. So the above energy function meets the required constraints in matching process. The key point is how to define C xiyj and S xi .
3D Reconstruction Approach Based on Neural Network
635
4.2 Definition of the Compatibility Measure C xiyj The compatibility measure C xiyj basically computes how compatible, the two correspondences that the neurons n xi and n yj represent, are. This is computed as follows: Let p, p ' (or q, q ' ) be the coordinate vector of image points x, y (or i, j ) on the line el (or er ). Gd is gradient of disparity, which is defined by ( p ' − p ) − ( q ' − q)
Gd =
( p ' − p) + (q' − q)
.
(10)
where i is Euclidean norm. The above equation shows Gd reflects the dense difference of spatial neighboring points because disparity contains the dense information of spatial point. If the points (such as p , p ' etc.) belong to the same object, then Gd should be very small, owing to the object rigidity and surface smoothness. Actually, Gd is always smaller than or equal to 1, according to the psychology experiments [10]. Then C xiyj is defined by C xiyj = 2exp(−
(Gd − G0 )2
λ2
) −1 ,
(11)
where λ , G0 are controlled parameters. It could be shown that the nonlinear function used in (11) scales the compatibility measure smoothly between -1 and 1, which means the value C xiyj is bounded by -1 and 1. The more compatible the two correspondences are, the closer C xiyj is to 1, the lower is the energy of the system, and the opposite also holds true. 4.3 Definition of the Similar Measure Let z x (or zi ) be the gray value of image point x (or i ) and X be a point in the scene. Then the similar measure is given by: S xi = P (( x, i) | X ) = P ( x | X ) P (i | X ) ,
(12)
where P( x | X ) is probability density distribution that represents the likelihood of x assuming it originated from X , which is defined by ' ⎧ 1 ⎫ (13) × exp ⎨− ( z − z x ) Cx−1 ( z − z x ) ⎬ , ⎩ 2 ⎭ where z is unknown true value and Cx is the covariance matrix associated with the
P( x | X ) = (2π )Cx
−
1 2
error z − z x . Since the true value, z, is unknown we approximate it by maximum likelihood estimate zˆ obtained from the measurement pair ( x, i ) and is given by: z ≈ zˆ =
Cx−1 z x + Ci−1zi , (Cx−1 + Ci−1 )
where Cx (or Ci ) is the covariance associated with measurement z x (or zi )
(14)
636
H. Hu and Z. Yang
5 Experimental Results and Discussion Fig. 4 shows the schematic diagram of experimental setup for calibration. Two Panasonic TV CCD cameras with normal lenses were mounted on an ordinary zraiser. The base line and vergence angle of the two cameras are about 150 mm and 20 , respectively. The focal length and the diagonal visual angle of the two lenses are 18 mm and about 30 , respectively. A calibration pattern produced by a laser printer is a normal chessboard map which contains many grid corners. A subtle method was used to get the exact position of each corner to serve as the calibration samples ([4]).
COMPUTER
IMAGE
RIGHT CAMERA
ACQUISITION GRID CORNER MAP DEVICE
LEFT CAMERA
a d
Fig. 4. Schematic diagram of experimental setup for calibration
5.1 Calibrating the Stereovision System Using Network A As shown in Fig. 4, the distance between two adjacent corners is 10mm, i.e. a=10mm. Depth between two adjacent levels is 20mm, i.e. d=20mm. In our test, we get two different kinds of points. One is used for system calibration. The other is for testing the generalization ability of the network. Table 1 illustrates the average 3D position measurement errors using different data set. Sc is learning parameters (defined in [11]). The point number used for computing measurement errors is 500. The result indicates that the best generalization ability of the network is acquired when the training number is 500 and Sc is 1.2. And the mean error is below the resolution of 1 mm. Table 1. Average 3D position measurement errors by on different training data set (mm)
Training number 500
150
50
Training error 0.05 0.02 0.016 0.032 0.016 0.012 0.008 0.005 0.002 0.02 0.01 0.005 0.002 0.001
Sc=0.5 1.4051 0.9552 0.9705 1.4933 1.4554 1.4422 1.1329 0.9984 1.4661 10.7781 8.3398 6.2914 6.7811 6.6809
Sc=0.8 1.4086 0.9512 0.9678 1.5495 1.4467 1.3345 1.0784 1.0109 1.9983 6.7019 6.4572 4.5785 4.5082 3.0409
Sc=1.0 1.3970 0.9538 0.9570 1.5231 1.4373 1.3126 0.9829 0.9842 2.3583 2.2350 2.4528 2.0436 2.7297 3.7685
Sc=1.2 1.3924 0.9543 0.9543 1.4415 1.4087 1.3534 1.0309 0.9946 1.2058 1.4637 1.5932 1.6810 3.7880 3.0149
3D Reconstruction Approach Based on Neural Network
637
5.2 Stereo Image Rectification Using Network B To learn the relationship between two correspondent lines in the stereo image planes, 180 lines on three different z planes are selected to train the network. To test the performance of the trained network, we select some correspondent image points from original stereo pairs (marked by crosses in fig. 5) and compute their difference in longitudinal direction after rectification. Table 2 shows the results of test errors, which indicate that the mean rectification error using our method is below 0.65 pixels. Table 2. Average measurement errors and max error after rectification
Image pair Phone Grid
Data number 38 96
(a) Stereo pair of phone
Average error (pix) 0.3585 0.6278
Max error (pix) 1.294 1.452
(b) Stereo pair of Grid map
Fig. 5. The stereo images used for testing rectification
5.3 Stereo Matching Based on HNN We have tested the HNN matching approach with two image pairs captured by our lab’s vision system. All images shown here are composed of about 512× 512 pixels. During the matching processes, the image pairs are rectified and pruned such that the correspondent points on the same horizontal line. The parameter value is set as P = 10 , Q = 5 , R = 20 , λ = 0.3 and G0 = 0.05 . The first example is a model of a pair of pliers shown in figure 6(a). This example is designed to test the capability of constructing curved surfaces from stereogram. Fig. 6 (c) shows the rectified stereo pair. After stereo matching using the proposed HNN method, the correspondence points or disparity map can be obtained as shown in 6(e). The result illustrates the sharp depth discontinuities that can be obtained with the algorithm. However, the disparity map exhibits a depth discontinuity that is aligned with some neighboring scanlines. This is not the case in practice because Hopfield network is guaranteed to find global minima but not necessarily the same one for each scanline, which cause the misalignment of the vertical depth discontinuities. The threedimensional plot of disparity map for the pliers is shown in figure 6(g), which indicates the curved surface constructed with this approach is satisfying.
638
H. Hu and Z. Yang
(e)
(a)
(b)
(c)
(d)
(f)
(g)
(h)
Fig. 6. Experiment results where (a)-(b) are the original stereo pair of pliers, cup lid respectively which are captured by our stereovision system. (c)-(d) are the rectified and pruned images. (e)-(f) are the depth maps based on our HNN method. (g)-(h) are the reconstruction configuration of two different stereo pairs.
The second example is a cup lid stereogram shown in fig. 6(b). In the same way, the left and right images in the stereo image pair are taken at parallel positions (fig. 6(d)). This example is designed to test the approach in a situation where there is only slight depth difference in the scene. Fig.6 (f) shows the result of applying the proposed algorithm to the cup lid pair. Significant detail is obtained and fine depth difference of the lid brim is well determined. However, the disparity map is poor with many artifacts present. Investigation of this problem revealed that the intensity values at corresponding points revealed significant non-zero biases. Figure 6 (h) shows the 3D figure of the lid cup using the trained network A in the calibration stage. It can be seen that the shapes and surfaces are well recovered, which demonstrates the effectiveness of our method.
6 Conclusions In this paper, two different neural networks are adopted in neuro-vision system for 3D reconstruction. One is RBF neural network, which works in three stages: firstly, it is adopted to learn the nonlinear relationship of vision system in the calibration stage; secondly, it is used to search the correspondent lines in rectification stage; finally, it is
3D Reconstruction Approach Based on Neural Network
639
employed to reconstruct the 3D object figure in reconstruction stage. Hopfield neural network is established for solving stereo correspondence problem which is formulated as a minimization of an energy function. Based on the gray similar property of corresponding points and the continuity property of neighboring points, the energy function is constructed with satisfying some necessary constraints. By minimizing the energy function, the corresponding points between stereo pair can be obtained. Through the practical experiments with common image pairs, the 3-D configuration and shape of the objects can be well reconstructed, which indicates our neural network method is efficient, accurate and straightforward to implement in real stereovision system.
References 1. Isaguirre, A., Pu, P., Summers J.: A New Development in Camera Calibration Calibrating a Pair of Mobile Cameras. In: Proc. IEEE Int. Conf. RA (1985) 74–79 2. Ganapathy S.: Decomposition of Transformation Matrices for Robot Vision. In: Proc. IEEE Int. Conf. RA (1984) 130–139 3. Tsai, R.Y.: A Versatile Camera Calibration Technique for High-Accuracy 3D Machine Vision Metrology Using Off-the-Shelf TV Cameras and Lenses. IEEE Trans. on RA. 3 (4) (1987) 323–344 4. Weng, J., Cohen, P., Herniou, M.: Camera Calibration with Distortion Models and Accuracy Evaluation. IEEE Trans. on PAMI 14 (10) (1992) 965–980 5. Veksler, O.: Fast Variable Window for Stereo Correspondence Using Integral Images. CVPR 1 (2003) 556–561 6. Yoon, K. J., Kweon, I. S.: Adaptive Support-weight Approach for Correspondence Search. IEEE Trans on PAMI 28 (4) (2006) 650–656 7. Bobick, A.F., Intille, S.S.: Large Occlusion Stereo. Int. J. of Computer Vision 33 (3) (1999) 181–200 8. Kolmogorov, V., Zabih, R.: Computing Visual Correspondence with Occlusions via Graph Cuts. ICCV 2 (2001) 508–515 9. Sun, J., Li, Y., Kang, S.B., Shum, H.Y.: Symmetric stereo matching for occlusion handling. CVPR 2 (2005) 399–406 10. Pollard, S.B., Mayhew, J.E.W., Frisby, J.P.: PMF: A Stereo Correspondence Algorithm Using A Disparity Gradient Constraint. Perception 14 (1985) 449-470 11. Chen, S., Cowan, C.F.N., Grant, P.M.: Orthogonal Least Squares Learning Algorithm for Radial Basis Function Network. IEEE Trans. on Neural Network 2 (2) (1991) 302–3
A New Method of IRFPA Nonuniformity Correction Shaosheng Dai, Tianqi Zhang, and Jian Gao School of Communication and Information Engineering, Chongqing University of Posts and Telecommunications (CQUPT), Chongqing 400065, China
[email protected]
Abstract. In order to meet the demand of real time nonuniformity correction of Infrared Focal Plane Array (IRFPA), a new algorithm based on neural network (NN) is proposed. Comparing to the traditional NN algorithm, the new algorithm uses weighted average of four adjacent pixels value to calculate the expected output value, and uses weighted average of four adjacent corrected output value to replace the finally corrected value. With the help of high performance image-processing system based on TMS320C6201 DSP, the nonuniformity of IRFPA is real time corrected. In this paper the mathematical model of the new NN algorithm is established, and the specific hardware correction process is expatiated. At last experimental results are given out. The results show that new NN algorithm is more satisfying to increase the quality of the corrected image.
1 Introduction Nowadays infrared (IR) imaging technology has been making rapid progress with microelectronics development. IR sensor has also developed from single detector to large-area IRFPA. The IRFPA imaging system has a series of advantages including simple configuration, high sensitivity and high frame frequency response and so on, so it will gradually become the leading direction of IR imaging technology [1]. Affected by manufacturing technique and environment, detectors of IRFPA exist response nonuniformity [2] [3], having different response levels under the same blackbody radiation. The nonuniformity of IRFPA has caused the presence of a fixed-pattern noise over the resulting images and greatly limited the widespread application of IR imaging system (IRIS). So nonuniformity correction of IRFPA is a critical task for achieving higher performance in IRIS. Nowadays, some nonuniformity correction methods have been proposed to reduce the fixed-pattern noise persistent in IRFPA. For example, the most used two-point calibration method is founded on linear model of detector response, which employs two blackbody sources at different temperatures to calculate the gain and offset parameters of each detector. To perform the two-point calibration the imaging process of IRIS must be halted. So it is difficult to realize continuous correction of IRFPA; however the scene-based nonuniformity correction method does not require calibration, but with less precision. During operation the same scene information captured by IRIS is used to estimate the parameters of all detectors of IRFPA. In order to obtain the confident estimation parameters, a larger number of image frames are computed, but the D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 640–645, 2007. © Springer-Verlag Berlin Heidelberg 2007
A New Method of IRFPA Nonuniformity Correction
641
computing process is too time-consuming. In order to meet the need of real time nonuniformity correction of IRFPA, a new NN algorithm is proposed which can self-adaptively correct IRFPA during real time imaging. In order to increase real time correction speed, we utilize powerfully image-processing system based on TMS320C6201 DSP to run the algorithm of the new NN algorithm. The DSP processor has eight highly independent functional units, hardware multiplier and adder, which are very suitable for processing a large amount of image data. In this paper, the formula of new NN algorithm is created, and the specific correction steps using DSP hardware are listed.
2 Conventional NN Algorithm According to human vision system theory, the light received cells of human retina always exist some differences, but human eye horizontal cells are able to self-adjust the differences according to the original image. Every horizontal cell is connected to several adjacent light received cells. By this principle, some foreign scholars use periodically NNs structure to self-adaptively correct nonuniformity. The specific methods are as follows: firstly linking every nerve cell to an array unit, then designing a hidden layer, in which every nerve cell is connected to some adjacent array units like human eye horizontal cell, and calculating average output value from several adjacent array units, and feeding back to upper layer to correct nonuniformity, then using the steepest descent learning rule to get correction gain and offset, at last calculating correction value by NN correction formula. Suppose x is an original nonuniformity image, y is a corrected image, then the relationship between y and x is described as: y = Gx + O ,
(1)
where G is correction gain factor, O is correction offset factor. The process of self-adaptive correction based on NN is as followed: G and O are continuously changed according to real time sampling original image till corrected image quality achieves optimization. If W is weight vector, equation is as follows:
y ( n ) = W T ( n) X ( n ) ,
(2)
where n denotes the n -th frame image, W (n) = (G (n), O(n))T , X (n) = ( x(n),1)T . Suppose expected output value subtracts real output value to get error value, and error value is identified as e(n) , whose expression is as follows: e ( n ) = f ( n ) − y ( n ) = f ( n) − W T ( n ) X ( n ) .
(3)
Therefore the self-adaptive correction based on NN is the process of continuously adjusting weight coefficient, the weight coefficient adjustment will not halted until the mean square error of e(n) reaches minimum.
642
S. Dai, T. Zhang, and J. Gao
Assume f ij is expected correction output, and take the average of four adjacent pixels’ value as example, the equation would be: f ij = ( xi , j +1 + xi −1, j + xi , j −1 + xi +1, j ) / 4 .
(4)
Then, error function equation is: F (G , O ) = (Gx + O − f ) 2 . According to the steepest descend method, G and O ’s iterative formula can be given: Gn +1 = Gn − 2ax ( y − f ) ,
(5)
On +1 = On − 2a( y − f ) ,
(6)
where n is frame, a is descending pace. The iterative correction equation can be obtained: yn +1 = Gn +1 xn +1 + On +1 .
(7)
3 A New NN Algorithm The conventional NN algorithm utilizes four adjacent pixels’ average to replace expected output value, which excludes the influence of each adjacent pixel’s gray. It is fit for correcting weak noise image, but for strong noise images it hardly improves image quality. The new NN algorithm has fully considered the influence factors including the gray of adjacent pixels and noise of original images. During calculating expected output value, the influence of adjacent pixel’s gray is quantitatively expressed by weight coefficient. The weight coefficient is decided by the reciprocal of difference value between xi , j pixel gray and its adjacent pixel gray. When the difference value is less, the weight coefficient is more. During processing strong noise images, firstly acquiring the corrected output value yi , j of xi , j , and then using four adjacent corrected pixels’ weighted average to replace the final correction output value. On the basis of expressions of conventional NN algorithm, the mathematical models of new NN algorithm are described as follows: 1) Using the four adjacent pixels’ weighted average to express the expected output value f i , j :
ω1 = 1/ ( fabs ( xi , j +1 − xi , j ) ∗ 255 + 1) ,
ω2 = 1/ ( fabs ( xi −1, j − xi , j ) ∗ 255 + 1) , ω3 = 1/ ( fabs ( xi , j −1 − xi , j ) ∗ 255 + 1) , ω4 = 1/ ( fabs ( xi +1, j − xi , j ) ∗ 255 + 1) ,
ω5 = ω1 + ω2 + ω3 + ω4 , f ij = (ω1 ∗ xi , j +1 + ω2 ∗ xi −1, j + ω3 ∗ xi , j −1 + ω4 ∗ xi +1, j ) / ω5 .
(8)
A New Method of IRFPA Nonuniformity Correction
643
According to iterative equations (5) (6), the iterative equations of the new NN algorithm are given out: Gi , j (n + 1) = Gi , j (n) − 2axi , j ( yi , j − fi , j ) ,
(9)
Oi , j (n + 1) = Oi , j (n) − 2a( yi , j − f i , j ) .
(10)
According to the corrected equation (7), the correction output expression yi , j is: yi , j (n + 1) = Gi , j (n + 1) xi , j (n + 1) + Oi , j (n + 1) .
(11)
2) The final correction output equation is obtained based on expression (11): ω '1 = 1/ ( fabs ( yi , j +1 − yi , j ) ∗ 255 + 1) ,
ω '2 = 1/ ( fabs ( yi −1, j − yi , j ) ∗ 255 + 1) , ω '3 = 1/ ( fabs ( yi , j −1 − yi , j ) ∗ 255 + 1) ,
ω '4 = 1/ ( fabs ( yi +1, j − yi , j ) ∗ 255 + 1) , y 'ij = (ω '1 ∗ yi , j +1 + ω '2 ∗ yi −1, j + ω '3 ∗ yi , j −1 + ω '4 ∗ yi +1, j ) / ω '5 .
(12)
The y 'ij is the weighted average of yi , j ’s four adjacent pixels. Using y 'ij to replace yi , j can reduce the remnant noise of the corrected image. The equation (12) is the final correction output expression of the new NN algorithm.
4 Hardware Implementing Nonuniformity Correction 4.1 Hardware System Constitution
The nonuniformity correction hardware system is composed of IRFPA, Analog to Digital Convertor (A/D), Field-programmable gate array (FPGA), external data memory (ERAM), TMS320C6201 DSP processor and PC. Fig. 1 is the block diagram of the correction system constitution.
Fig. 1. Block diagram of the correction system constitution
644
S. Dai, T. Zhang, and J. Gao
According to characteristics of image processing, TMS320C6201 DSP chip [4] [5] from TI company is selected, which is based on very long instruction word (VLIW) architecture. The TMS320C6201 DSP is a fixed-point microprocessor with 8 parallel processing units running up to 200 MHZ (1600 MIPS), which is competent for IRFPA self-adaptive nonuniformity correction algorithm based on NN. 4.2 Hardware Correction Realization
The IRFPA nonuniformity correction process based on TMS320C6201 DSP hardware is described as follows: Firstly obtaining the original correction gains and offsets of all array units in IRFPA. Before using the new NN algorithm to correct IRFPA nonuniformity, the original correction gains and offsets of all pixels must be obtained for iterative calculation. We use the two-point correction [6] method to compute the original correction gains and offsets. The document [6] has presented the equations used to compute gains and offsets. The next step is to get all pixels’ response output value of the current frame and compute correction gains and offsets of the subsequent frame. In order to get subsequent frame correction gains and offsets, all pixels’ response output value of the current frame must be recorded. Using expected correction expression to figure out the expected correction value of current frame, then using correction expression to figure out correction output value of the current frame. At last we can get the correction gains and offsets of the subsequent frame according to iterative equations. In the end, continuous correction for subsequent frames can be real time gone on according to self-adaptive nonuniformity correction formula based on NNs. The specific implementation steps are: 1) In PC, making a simple program to sample and record every pixel response output value under different temperature blackbody radiation, and by two-point correction equation the correction gain Gi , j (n) and offset Oi , j (n) are figured out. 2) After system power-on, the TMS320C6201 DSP starts A/D to continuously sample, and save sampling data as the current frame into ERAM. Then DSP follows nonuniformity correction formula (11) based on NN to compute the current frame correction value and stores them into ERAM. By equation (8) the expected correction value of current frame is calculated, and then according to iterative equations (9) and (10) the subsequent frame correction gains and offsets are also figured out and saved. 3) When A/D continues to sample the subsequent frame image, DSP continuously corrects data by equation (11). In order to reduce image noise, the final correction data of the subsequent frame are produced by equation (12) and sent into PC to display. Thus IRFPA real time nonuniformity correction is realized. 4.3 Experimental Results
A nonuniformity correction experiment based on a new NN algorithm is done using 128×128 pixels IRFPA. The experimental results are very ideal. Fig 2, Fig 3 and Fig 4 are the same people hand infrared image saved during real time nonuniformity
A New Method of IRFPA Nonuniformity Correction
645
Fig. 2. Image with noise Fig. 3. Corrected by conventional NN Fig. 4. Corrected by new NN
correction, Fig 2 is an original image with noise; Fig 3 is corrected image using the conventional NN algorithm which is faint; Fig 4 is corrected image using the new NN algorithm which is clear.
5 Conclusions During the nonuniformity correction of IRFPA, the powerful image processing system based on DSP is used and the algorithm is optimized by linear assembler language, which greatly improve the imaging system’s real time performance. The new NN algorithm is more fit for improving image quality than the conventional NN algorithm. The experiments show that the new NN correction algorithm is easy realization in High Performance DSP System, and above all the more clear IR image is obtained. Acknowledgements. This work is supported by the National Natural Science Foundation of China (No.60602057), the Natural Science Foundation of Chongqing University of Posts and Telecommunications (CQUPT) (No.A2006-04, No.A2006-86), the Natural Science Foundation of Chongqing Municipal Education Commission (No.KJ060509), and the Natural Science Foundation of Chongqing Science and Technology Commission (No.CSTC2006BB2373).
References 1. Hui, X., Weiping, Y., Zhenkang, S.: Study on Infrared FPA Nonuniformity Correction. Systems Engineering and Electronic Technology 20 (12) (2001) 40-43 2. Xiaomei, H.: Study on IRFPA Nonuniformity and Its Correction. Infrared and Laser Engineering 28 (3) (1999) 9-12 3. Baoguo, C., Zhiwei, Z., Shike, H.: IRFPA Nonuniformity Correction Using the FPGA Technology. Infrared and Laser Engineering 29 (4) (2000) 55-57 4. Lixiang, R.: DSPS Principle and Application of Tms320c6201. Electronic Industry Publishing Company, Beijing (2000) 5. Tms320c62xx CPU and Instruction Set Reference Guide. Texas Instruments Incorporated (1999) 6. Shaosheng, D., Xianghui, Y.: The research on infrared image nonuniformity real time correction new technology. Optics and Precision Engineering, 12(2)(2004) 201-204
Novel Shape-From-Shading Methodology with Specular Reflectance Using Wavelet Networks Lei Yang and Jiu-qiang Han School of Electronics and Information Engineering, Xi’an Jiaotong University, Xi’an 710049, China
[email protected]
Abstract. This paper proposes a novel direct 3-D reconstruction methodology namely Shape-From-Shading with specular reflectance using wavelet networks. The thought of this approach is to optimize a proper reflectance model by learning the parameters of wavelet networks. Hybrid reflection models which contain diffuse reflectance and specular reflectance are used to formulate reflectance map equation because they are prone to reality. The approach uses wavelet networks as a parametric representation of the unknown surface to be reconstructed. After the orientation expressed by the parametric form of the surface is substituted into hybrid reflection model, the shape from shading problem is formulated as minimization of the total intensity error function over the network weights. Gradient-descent method is used to update the parameters of wavelet networks. The heights of the surface can then be obtained from the wavelet networks after supervised learning. Experiments on both synthetic and real images demonstrate the performance of the proposed SFS method.
1 Introduction Shape reconstruction is a classical problem in computer vision. The techniques are called shape-from-X, where X denotes the specific information, such as shading, stereo, motion, and texture. Shape-From-Shading (SFS), one of major reconstruction approaches, reconstructs 3-D shape from one 2-D image [1]. The brightness of image depends on several respects: the orientation of light source, the location of camera, the orientation of object and the reflectance property of the surface [2]. SFS reconstructs the 3-D shape of an object from the shading variation in 2-D image, which is based on satisfying the reflectance map equation at each imaged point [3]. The variation of brightness, namely shading, is used to estimate the orientation of surface and then calculate the height of objects. The development of SFS mainly depends on two aspects, namely research of suitable reflectance models and the investigation of effective SFS algorithms [4]. The importance of suitable reflectance models lies in the fact that whether or not the models describe the surface shape accurately associated with the image intensity. Original SFS algorithm was based on the principle of variations. The first SFS was proposed by Horn in the early 1970s and was enhanced by himself and others researchers [1, 2]. Their approaches are based the Lambertian reflectance model and minimize the total error function consisting of one or several terms of the brightness constraint, the smoothness constraint, the integrability constraint, the D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 646–655, 2007. © Springer-Verlag Berlin Heidelberg 2007
Novel Shape-From-Shading Methodology with Specular Reflectance
647
intensity gradient constraint, and so on [1, 2, 5]. Zhang [3] categorized traditional SFS approaches into four categories: minimization, propagation, local, and linear approaches. At early time, SFS reconstruction was viewed as an ill-posed problem [6]. Minimizing the total error function takes a long computational time and does not easily converge to the globe optimum solution. So regularization methods are used [7]. In late1980s, direct 3-D SFS approaches have been proposed [8]. Recently, a statistical approach of 3-D SFS reconstruction has been proposed [9]. Neural networks have also been recently employed to tackle SFS problem [10, 11]. New further advance of 3-D SFS reconstruction is going both on methodology and its applications. 3-D SFS reconstruction based on more general, varying reflectance model was proposed in 2005 [12]. More recently Perspective 3-D SFS reconstruction method was proposed [13]. Color 3-D SFS reconstruction method was also brought forward in [4]. And 3-D shape reconstruction technique using fusing strategies which compose SFS method and other approach as Photometric Stereo, series images or others was proposed more recently [14, 15]. SFS is widely applied in automatic inspection, terrain analysis such as moon and ocean, and so on [16-18]. Neural networks (NNs) were used to deal with SFS by Q.W. Guo [10]. Another approach using the feed-forward neural networks was proposed to model the reflectance model by Cho & Chow [4]. Neural-based hybrid reflectance model was also developed by them. Recently a new neural-network-based adaptive hybrid-reflectance 3-D surface reconstruction model was proposed in 2005 [12]. The model considers the characteristics of each point and the variant albedo. The operation of the neural network is regarded as a non-linear regression operation. Knowledge on the illumination is not required. But more than three images are needed in the method. Wavelet neural networks (WNs) were proposed by Q. Zhang [19]. The structure of this network is similar to that of RBF network, except that the radial basis functions are replaced by orthonormal scaling functions. The properties of WNs in functionlearning and estimation were discussed in [20]. The results showed that WNs are favorable to the MLP and RBF networks. Main disadvantages of the method of Cho are as follow. Firstly, the method is a two-step method. Shape is calculated by the gradients determined by NNs. So the reconstruction error of shape from gradients is inevitable. Secondly, the method is based on Horn’s iterative calculation. But the performance of reconstruction of later is not satisfactory. Lastly, parametric renewing algorithms in NNs are likely to make solution to be local minimization. On the other hand, Wavelet networks perform well local approximation ability, which may be very suitable to describe the local variation of a shape to be reconstructed. So we use 2-D wavelet networks to express shape to be reconstructed to avoid above questions. When the shape is expressed parametrically by wavelet networks, its direction vectors of the shape derived from wavelet networks are substituted into reflectance maps. We also use minimization method to renew the parameters of wavelet networks. The objective of learning is to make the mean square error of reflectance map and image to be minimized. After the supervised learning, the surface shape described by wavelet works is gained. From above analysis, we see that proposed method may be super to the methods of Cho’s and Guo’s as following. Because our method belongs to direct methods, the step of shape integration from surface normal is omitted. The proposed method is not based on Horn’s iterative principle. And advantage of wavelet networks over neural networks may be more suitable
648
L. Yang and J.-q. Han
to deal with SFS. Besides of its local expression ability, the learning rate of wavelet networks is much quicker than neural networks because the weights between inputs layer and hidden layer are fixed, the only parameters to be tuned are weights between hidden layer and outputs layer. Regularization technique is not necessary in proposed method. Experiments also show the wavelet networks-based SFS is super to former networks-based Cho’s and Guo’s SFS. Remainder of the paper is organized as following. Hybrid reflectance model about surface is discussed in the second section. WNs and their expression ability are reviewed in the third section. The proposed SFS method is described in detail in the forth section. Section five gives two experiments on synthesized and real images to illustrate the performance of the proposed method. In last section, conclusion is drawn.
2 Reflectance Map Equation The goal of SFS is to recovery 3-D shape information from its 2-D image, which is implemented by considering image brightness as clue of surface geometry. Suitable reflectance model and an effective SFS algorithm influence the success of SFS. A right reflectance model could accurately describe the mage intensity variation corresponding to the reflection characteristic of the surface. Reflection mapping equation (1) associates the image brightness with the reconstructed surface.
I ( x, y ) = R ( p, q ) ,
(1)
where I ( x, y ) is the mage brightness, and R ( p, q ) is the reflectance map function based on reflection model, and p, q denote x-and y-partial derivatives of reconstructed surface
z ( x, y ) with respect to the image coordinates x and y respectively.
( p, q ) = (
∂z ∂z , ). ∂x ∂y
(2)
Since the reflectance model is a nonlinear function, the reflectance map equation is first-order partial differential equations (PDEs). The reflection map depends on the radiance of light source and the bidirectional reflectance function (BRDF) of the surface. Usually, a point light source is assumed to be located at infinite distance. Following,
n T = ( p, q, −1) denotes surface normal, niT = ( pi , qi , −1) is light source
vector, and
noT = ( po , qo , −1) is observing camera vector.
Lots of experiential reflection models and mathematical reflectance models are formulated in optical radiation. The reflection characteristic is described by BRDF. Two kinds of extreme reflection models namely diffuse reflection and specular reflection are usually considered in computer vision. For most traditional SFS algorithms, the reflectance model is assumed to be a Lambertian (diffuse) reflectance model [2-3].
Novel Shape-From-Shading Methodology with Specular Reflectance
649
The reflectance map function of the Lambertian surface illuminated by single point light source (light strength of which is E) is given by
Rd (n , ni , no ) = E * n T * ni .
(3)
On the other hand, Torrance–Sparrow model [4] uses a Gaussian distribution (4) to deal with specular reflectance phenomena.
(tan arccos(n T * ns )) 2 E Rs1 (n, ni , no ) = T exp(− ), (n * no ) 2σ 2 where the vector
(4)
ns = (ni + no ) / ni + no is called the halfway-specular reflectance
direction, and represents the normalized vector sum between the light source direction and the observing camera direction. The factor σ is the standard deviation, which can be viewed as measurement of the roughness of the surface. Another specular model is Phong’s model [12] which is represented as
Rs (n , ni , no ) = E * (n T * ns ) K ,
(5)
where K is a constant. Different values of K denote different kinds of surfaces which are more or less mirror-like. We use (5) to describe the specular reflectance characteristic rather than (4) due to its simple form. But real surfaces are neither pure Lambertian reflectance, nor pure specular components. Instead, they are a combination of diffuse and specular. A hybrid model that consists of three components: a diffuse lobe, a specular lobe, and a specular spike was proposed by Tagare [21]. A linear combination model of diffuse and specular is shown as
Rh = (1 − w) Rd + wRs ,
(6)
Rh is total intensity of the surface, Rd and Rs are diffuse intensity and the specular intensity respectively, and w is the weight of the specular component dewhere
termined empirically. We use hybrid reflectance (6) to formulate proposed SFS. Normalized intensities of image are as
I n ( x, y ) = where
I ( x, y ) − I min , I max − I min
(7)
I max and I min are the maximum and minimum values of image I ( x, y ) .
When the gray level and reflectance function are both normalized, we get the normalized reflectance map equation associated the image brightness shown as
I n ( x, y ) = Rh ( x, y ) = (1 − w) Rd + wRs = (1 − w)
pp0 + qq0 + 1 p 2 + q 2 + 1 p02 + q02 + 1
+ w(
pph + qqh + 1 p 2 + q 2 + 1 ph2 + qh2 + 1
)K
.(8)
650
L. Yang and J.-q. Han
3 Surface Parametric Expression of Wavelet Networks The parameterized surface z ( x, y ) is expressed by wavelet networks in this section. A function
φ ( x) ∈ L2 ( R)
is scaling function. Defining the following functions
φ j ,k ( x) = 2 j / 2 φ (2 j x − k ), j , k ∈ Z . Let
(9)
{φ j , k ( x)}k ∈Z formulate the standard orthonormal basis of V j , where j and
k are called dilation and translation of the scaling function, respectively. V j is called
L2 ( R ) . And ∀j ∈Z , V j nest chain of closed sub-
spaces
such that
–1
0
1
U
the scaling subspace of
∩ V j = {0} and ∪ V j = L2 ( R) . The
j ∈Z
j ∈Z
orthonormal supplementary space of V j about V j +1 is W j , namely j = +∞
V j +1 = V j ⊕ W j ; V j ⊥ W j ; ∀j ∈ Z , L2 ( R) = ⊕ W j . j = −∞
For a given
J ∈ Z , multi-resolution analysis may be as
L2 ( R) = VJ ⊕ WJ ⊕ WJ +1 ⊕ where
,
(11)
W j = space{ψ j ,k ( x) = 2 j / 2ψ (2 j x − k ), k ∈ Z } , ψ (x) is called mother
wavelet function, and Using (11),
where
(10)
{ψ 0, k ( x), k ∈ Z } form the standard orthonormal basis of W0 .
∀f ( x) ∈ L2 ( R) can be decomposed as f ( x) = fVJ + fWJ + fWJ +1 +
,
fV J ∈ VJ , fW J ∈ WJ . If J is sufficiently large, then f ( x) ≈ fVJ .
(12)
For finite support and finite precision, we have l
J
f ( x) = ∑ 2 2 aJ ,iφ (2 J x − k ) .
(13)
k =1
By the way, d dimension scaling functions can be written as the product of onedimensional wavelet functions d
φ( X ) = ∏ φ(x j ) , j =1
(14)
Novel Shape-From-Shading Methodology with Specular Reflectance
where X = ( x , x be expressed as 1
2
651
,...x d )T . So the shape function z ( x, y ) with two variables can K1
z ( x, y ) = ∑ i =1
K2
∑2
J1 + J 2 2
j =1
ai , j [φ (2 J1 x − i )) φ (2 J 2 y − j ))] .
Parameters of the surface z ( x, y ) are can also be parameterized by a i , j .
(15)
a i , j . Using (2), the orientations of the surface
4 Training Algorithm of Wavelet Networks and Proposed SFS A framework diagram of wavelet-network-based SFS is shown as Figure 1. Normalized image brightness I ( x, y ) is reference signal to train the WNs.
R( p(x, y), q(x, y)) is reflectance model (RM) expressed by neural network. The surface z ( x, y ) is expressed by WNs. So the reflectance model is also parameterized by the wavelet networks. E (W ) is total error function of R ( p ( x, y ), q ( x, y )) and I ( x, y ) , where W is the parameters of the network namely {ai , j } . When solving SFS problem by WNs, the parameters are updated by supervised training to minimize the error function using back-propagation algorithm. The error function E (W ) usually defined as 2
E (W)= ∫∫ [ I ( x, y ) − Rh ( p, q, W )] dxdy ,
(16)
D
where D is the image domain. After training, the surface can be obtained from the neural parameterized expression. Following, detail of proposed SFS is discussed.
I ( x, y ) ( x, y )
e( x, y ) z ( x, y )
WN
RM
R ( x, y )
− +
Fig. 1. A learning framework diagram of wavelet-neural-based SFS
We assume the resolution of image is M * N . The support domain of image is normalized as [20]. Back-propagation algorithm is used for supervised training of the WNs to minimize the discrete form of error function (16), which is defined as M
E (W ) = ∑ i =1
N
∑ (I j =1
i, j
− Ri , j (W )) 2 ,
(17)
652
L. Yang and J.-q. Han
where I i , j denotes the Ri , j (W ) = (1 − w)
i, j th intensity value of the original 2-D image. Ri , j is as
p i , j p 0 + qi , j q 0 + 1 p
2 i, j
+q
2 i, j
+1 p + q +1 2 0
2 0
pi , j p h + qi , j q h + 1
+ w( p
2 i, j
+q
2 i, j
+1 p + q +1 2 h
2 h
)K
,(18)
where pi , j , qi , j denoted as (19) are calculated using (15).
p i , j (W ) =
∂z ( x, y ) ∂x
( x , y )=(i , j )
, qi , j (W ) =
∂z ( x, y ) ∂y
( x , y ) = (i , j )
.
(19)
For a given image, forward processes calculate the output function values of all nodes in the network to obtain the synthesized reflectance map using initializing the connection weights’ values. Then, backward processes update the weights’ values. Following is the next iteration to calculate the reflectance map and update the weights until the stop criteria are satisfied. Gradient optimization algorithm containing momentum term [12] given by (20) is used to train the neural network.
Δai , j (k + 1) = α (−
∂E ) + β Δai , j (k ) , ∂ai , j
where ai , j denotes the adjustable parameters in the network, rate,
(20)
α denotes the learning
β is a momentum term to damp possibility of oscillations. This paper uses val-
ues for α and β as 0.4 and 0.6. Additionally, during the iteration, the learning rate α should be decreased whenever the total error function shows any oscillations. We reduce it by 10 percent of its current value. Initialization : Input the image
Ii, j
,and randomize the parameters {a
i, j
}
Iterations Forward calculation : Calculate the orientation of surface using (19), then calculate hybrid reflectance map (18) and error function E (W ) ; Back-propagation : Update the connection weights value {ai , j } using (20); Stop criteria : Continue Iterations until the criteria are reach. Output the 3-D shape.
Fig. 2. Framework of the proposed 3-D SFS method
After supervised learning, the reconstructed surface is expressed by the weights of WNs using (15). Based on above analysis, framework of the proposed neuralnetwork-based SFS method with hybrid reflection model is illustrated as figure 2.
5 Experiment and Discussion Two series of experiments were performed to evaluate the performance of the proposed SFS. Wavelet scale function
φ(x) = (1− x2)exp(−x2 /2) is used. The number of
Novel Shape-From-Shading Methodology with Specular Reflectance
653
nodes of hidden layer is 121. The reconstructed height was compared with the original height to examine the performance. Criteria such as mean error (ME) and root squares mean error (RS) were calculated to evaluate the performance of SFS. Experiments using previous methods were done and their performances are used as reference. The first experiment shows numeric comparing result between proposed SFS method and classical ones using synthesized image produced by proposed hybrid reflectance model. The second is about real image containing specular reflectance. All the algorithms are realized under following conditions: hardware CPU-AMD 1.7 GHZ, RAM-256MB; software Windows 2000 and Matlab7.0. 5.1 Experiment One: Synthesized Image Containing Specular Reflectance Numeric comparing results were obtained by reconstructing synthetic hemisphere. The image of hemisphere was generated mathematically by (8). The light was located in the direction (0, 0,1) , and K and w in (8) were 10, 0.2 respectively. The number of pixels of images is 100*100 . The reconstructed surface and errors with the original hemisphere using proposed SFS are shown as figure 3(a). Figure 3(b), 3(c) and 3(d) are three reconstruction results of three methods: Cho’s [11], Wei’s [10], and
(a)
(b)
(c)
(d)
Fig. 3. Reconstructed shape and error of hemisphere: (a) by proposed method; (b) by Cho’s method; (c) by Wei’s method; (d) by Horn’s method Table 1. Comparing result of reconstructed shape of synthesized hemisphere
Index of comp ME of height RS of height CPU time(sec)
Proposed -1.065 1.2451 149.62
Cho’s -1.5216 2.6569 180.93
Wei’s -2.1367 3.4528 482.26
Horn’s -4.8800 6.1146 364.83
654
L. Yang and J.-q. Han
Horn’s [2] separately as comparing. Table one is the numeric comparing result illuminated in figure 3(a-d). Iterative times of proposed SFS are 40. 5.2 Experiment Two: Real Image with Specular Reflectance Figure 4(a) shows an open image of a real surface. Figure 4(b, c, d) are the reconstructed results using proposed method with hybrid reflectance model, Cho’s and Wei’s method under same iterative condition. The number of pixels of image is 128*128 . K and w in (8) were selected as10, 0.3 respectively.
(a)
(b)
(c)
(d)
Fig. 4. Reconstructed shape of real image: (a) image of real surface; (b) by proposed method; (c) by Cho’s method; (d) by Wei’s method
Numeric comparing results of experiment one show that the proposed method is more accurate and fast than classical ones. Horn’s method in experiment one is large. Error of Cho’s is in that the method is a two-step method based on Horn iterative calculation. Well local approximation ability and fast convergence of WNs make the proposed method is fast than existing neural-networks-based SFS such as Wei’s and Cho’s. Experiment two shows that error of reconstructed shape is mainly located in specular reflectance because acquired image should describe highlight correctly. But in real situation, ideal condition of capturing image is difficult to satisfy. Two possible reasons are due to not ideal point light source and not accurate reflectance model.
6 Conclusions A novel wavelet-network-based SFS with hybrid reflection model is proposed. Hybrid reflection models are used to formulate reflectance map equation. Wavelets networks are used as parametric representation form of the unknown surface to be reconstructed. When the orientation expressed by the parametric form of the surface is substituted into reflection model, the shape from shading problem is formulated as the minimization of an intensity error function over the network weights. The height of the surface can be obtained from the WNs after supervised learning. Experiments on both synthetic and real images show the proposed SFS algorithm is fast and accurate. Further research direction of SFS is such as color SFS, information fusing of SFS with other shape reconstruction methods and so on.
Novel Shape-From-Shading Methodology with Specular Reflectance
655
References 1. Horn, B. K. P., Brooks, M. J.: The Variational Approach to Shape from Shading. Computer Vision Graphics Image Process 33 (1986) 174-208 2. Horn, B. K. P.: Height and Gradient from Shading. Int. J. Computer Vision 5 (1) (1990) 37-75 3. Zhang, R., Tsai, P. S., Cryer, J. E., et al.: Shape from Shading: A Survey. IEEE Trans. PAMI 21 (8) (1999) 690-706 4. Cho, S.Y., Chow, T. W. S.: A New Color 3D SFS Methodology Using Neural-based Color Reflectance Models and Iterative Recursive Method. Neur. Comput. 14 (2002) 2751-2789 5. Ikeuchi, K., Horn, B. K. P.: Numerical Shape from Shading and Occluding Boundaries. Artificial Intelligence 17 (1–3) (1981) 141-184 6. Oliensis, J.: Uniqueness in Shape from Shading. International Journal of Computer Vision 6 (2) (1991) 75-104 7. Zhang, R., Tsai, P. S., J. Cryer, J. E., et al.: Analysis of Shape from Shading Techniques. Proceedings of Computer Vision Pattern Recognition (1994) 377–384 8. Leclerc, Y.G., Bobick, A.F.: The Direct Computation of Height from Shading. IEEE Proc. Computer Vision and Pattern Recognition (1991) 552-558 9. Adrian, G. B., Edwin, R. H., Richard, C. W.: Terrain Analysis Using Radar Shape-fromShading. IEEE Trans. PAMI 25 (8) (2003) 974-992 10. Wei, G. Q., Hirzinger, G.: Learning Shape from Shading by a Multilayer Network. IEEE Trans. Neural Netw. 17 (1996) 985-995 11. Cho, S. Y., Chow, T. W. S.: Neural Computation Approach for Developing a 3-D Shape Reconstruction Model. IEEE Trans. Neural Netw. 12 (2001) 1204-1214 12. Lin, C. T., Cheng, W. C., Liang, S. F.: Neural-network-based Adaptive Hybrid-reflectance Model for 3-D Surface Reconstruction. IEEE Trans. Neural Netw. 16 (6) (2005) 1601-1615 13. Tankus, A., Sochen, N., Yeshurun, Y.: Shape-from-Shading under Perspective Projection. International Journal of Computer Vision 63 (1) (2005) 21–43 14. Antonio, R. K., Hancock, E. R.: A Graph-spectral Approach to Shape-from-Shading. IEEE Trans. Image Processing 13 (7) (2004) 912-926 15. Cryer, J., Tsai, P., Shah, M.: Integration of Shape from X Modules: Combining Stereo and Shading. IEEE Proc. Computer Vision and Pattern Recognition (1993) 720-721 16. Du, Q. Y., Chen, S. B., Lin, T.: An Application of Shape from Shading. 8th international conference on control, automation, robotics and vision (2004) 184-189 17. Song, L.M., Qu, X.H., Xu, K.X., et al.: Novel SFS-NDT in the Field of Defect Detection. International NDT&E 38 (2005) 381–386 18. Mohammad, A., Rajabi, J. A. R.: Optimization of DTM Interpolation Using SFS with Single Satellite Imagery. The Journal of Supercomputing 28 (2004) 193-213 19. Zhang, Q.: Wavelet Networks. IEEE Trans. Neural Netw. 3 (6) (1992) 889-898 20. Zhang, J., Walter, G. G., Miao, Y., et al.: Wavelet Neural Networks for Function Learning. IEEE Trans. Signal Processing 43 (1995) 1485-1497 21. Nayar, S. K., Ikeuchi, K., Kanade, T.: Surface Reflection: Physical and Geometrical Perspectives. IEEE Trans. PAMI 13 (7) (1991) 611-634
Attribute Reduction Based on Bi-directional Distance Correlation and Radial Basis Network* Li-Chao Chen, Wei Zhang, Ying-Jun Zhang, Bin Ye, Li-Hu Pan, and Jing Li Institute of Computer Science and Technology, Taiyuan University of Science and Technology, No.66, Waliu road, Wanbolin District, Taiyuan Shanxi Prov., 030024, China
[email protected],
[email protected],
[email protected],
[email protected],
[email protected],
[email protected]
Abstract. Attribute reduction is one of the important means to improve the efficiency and the quality of data mining, especially for high dimension data. From the view of distance and correction , the bi-directional distance and correction method was presented. This method can be used to measure the importance of dada attributes. Moreover, the revised decrease-increase combination strategy was used to reduce dimensionality and the radial basis neural network was used to validate the sub-set. This method adopts appropriate correlation function according to sample characteristic, which can avoid the limitation of IOC method. Since the longitudinal input-output connection and the horizontal difference between attribute and target was taken into account, the measure of the attribute importance will be more rational. So, quality data will be supply for the process of data mining subsequently
.
1 Introduction Data Mining is the process of discovering interesting knowledge from large amount of data stored either in databases, data warehouses, or other information repositories[1]. Usually data preprocessing is a necessary step in the knowledge discovery process, since quality decision must based on quality data. Removing noisy and redundant data, detecting data anomalies, rectifying them early, and reducing the data to be analyzed can improve the efficiency and quality of data mining. Attribute reduction(also called attribute choice) as an important data reduction method, is to select the most important and most representative attribute subset which have the most influence on the mining goal and retain the majority data message expressed by the original attribute set. The selection criteria may vary as it relies on importance measurement. Information Gain, Information Gain Ratio, Mutual Information[2] , Sensitivity[3] , Distance, Relevance(or Correction)are commonly used measurements. Among them, the former three are only feasible for discrete data analysis; while Sensitivity only for consistent numeric data; G statistics, Distance and Relevance are two most commonly used straightforward measurements, especially for classification cases. Literature [4] proposed the input *
:20051044).
Supported by Natural Science Foundation of Shanxi Province China (No
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 656–663, 2007. © Springer-Verlag Berlin Heidelberg 2007
Attribute Reduction Based on BDDC and Radial Basis Network
657
output correlation method(IOC) based on Distance and Relevance. However, only samples with obvious discrete classification target attribute can be analyzed by it, which limits its application. Quality decision depends on not only quality data, but good mining technique. The neural network is suitable for nonlinear and noisy data processing. But present network attribute selection methods must carry on the training on all the possible attribute set, and select subset by the training precision comparison each time, which will directly results in bad efficiency[5]. Data cleaning, normalization and target attribute conformation are firstly carried out according to domain knowledge and mining purpose in this paper. Then data attribute importance ranking is done using bi-directional distance correlation(BDDC) method proposed in this paper. Subsequently, put forward and apply improved decrease-increase combination strategy (IDICS) and appropriate radial basis neural network(RBNN) to select the most important attribute subset based on the ranking results The above process substantially provide sample data of high quality for later mining
. .
2 RBNN Attribute Reduction Based on BDDC and IDICS Neural network based data mining method also including data preprocessing(including data reduction), data mining, result analyzes and evaluation, meanwhile the three stages are alternately overlapped rather than isolated especially the former two. Before attribute selection, some transformation should be done. Data cleaning is firstly carried out according to domain knowledge ,which includes filling in missing values ,smoothing out noise, identifying and correcting outliners. Then normalizes the data according to mining way. Third, conforms the target attribute value by certain network in term of mining purpose. 2.1 Attribute Importance Measurement 2.1.1 Input Output Correction Importance Measurement The input output correction method (IOC) takes the accumulation sum of output change caused by the input change-called IOC value-as the attribute importance measurement. Regarding certain attribute, the bigger the output change caused by its input change is, the greater importance the attribute means to the output; otherwise the smaller the less significance. Its principle can be denoted by formula (1): C(k)=|x(i,k)-x(j,k)|×sign|y(i)-y(j)|,
(1)
where C(k) is the IOC value of attribute k , x(i,k), x(j,k) are respectively the value of attribute k of sample i and j. y(i), y(j) output value. sign(x) is signal function which returns to 1 when x is bigger than 0 and 0 when equal. Speaking from algebra viewpoint, function sign(|y(i)-y(j)|) only folks eyes on whether |y(i)-y(j)| equal to zero or not; while the absolute value is always bigger or equal to zero, so the size and the sign of y(i)-y(j) means nothing. When applied in
658
L.-C. Chen et al.
attribute measure, function sign|y(i)-y(j)| only acknowledges the output difference caused by input difference but neglects the difference size and direction of output change. As to classification data ,such as the Iris ,Breast, German-Org and Australian databases used in document[4], the IOC works well as the output value is parallel, difference between class 1and class 2 is equal to that between class1 and class5. But when introduced to another kind, such as city competition power databases of which the output are different not only on quality but quantity, its performance is not very excellent. Take the 1st attribute x(i,1) and output y i of certain cities in Table 1 for instance: Set (i,j) equal to (1,2),(6,5),(2,3),(3,5),(1,2)and(4,3),the limitations of expression |x(i,k)-x(j,k)|×sign|y(i)-y(j)| can be obviously discovered. Therefore the IOC can only be applied in simple parallel classification condition with counter classification output. In other words, formula (1) only considers vertical quality difference and neglects the quantity difference. Besides, formula (1) only paies attention to the relation of vertical input-output variety but neglects the direct influence of the input value to the output value. As to the Chinese city competition power sorting data(refer to Table 2.), the special significance of certain indexes will be disregarded.
()
Table 1. Samples from the Chinese city competition power databases of 2006
i/sample 1/Hongkong 2/Hangzhou
y(i)/target attribute 1
x(i,1)/attribute 1 1
11
14
3/Dongyuan
23
50
4/Chongqing 5/Changzhou
40 35
37 28
6/Nanjing
25
23
2.1.2 Bi-directional Distance and Correction Importance Measurement Based on distance and correction thought, this article makes overall evaluation on vertical and horizontal difference range and proposes the bi-directional distance and correction(BDDC) as the importance measurement. The BDDC value of attribute k is calculated by the following formulas: R(k)=αH(k)+(1-α)V(k),
(2)
H(k)=1-||x(i)-y||=1- ∑(x(i,k)-y(i))2 ,
(3)
V(k)= ∑radbas[(x(i,k)-x(j,k))-(y(i)-y(j)),
(4)
V(k)= ∑|x(i,k)-x(j,k)|×sign|y(i)-y(j)|,
(5)
Attribute Reduction Based on BDDC and Radial Basis Network
659
∈
where R(k) is the importance value of attribute k, α is a constant (α [0,1]). H(k),V(k) are respectively the horizontal and vertical BDDC value of attribute k, while x(i) is the attribute vector i, and radbas(x)=exp(-x2) is the radial basis function. 2.2 Radial Basis Neural Network Attribute Reduction The topology of function based radial basis neural network(RBNN) constitutes three
layers: input layer, hidden layer and output layer. The later hidden uses distance function as basis function, and radial basis function(like Gaussian function) as activation function, while the output function depends on application. For continuous prediction sample, this paper introduces linear output RBNN to select the attribute subset and takes formula(6) as error performance function. As to classification sample, just uses the special probability neural network(PNN)and formula (7): Refer to reference[5,6] for detailed design and training. E1= ∑(t(i)-y(i))2 /N, E2=Ne/(Ngu).
(6) (7)
Among the above formulas, t is actual test output, y is target output, N is the input sample number, u is the number of the output nerve cell, Ne is the number of output that disagree with the target output. Selects the best attribute subset using improved decrease-increase combination strategy(IDICS) and RBNN based on the descending order of BDDC values. The procedure starts the RBF learning with subset of certain most important attributes(generally the former half, N/2) and examines its forecast precision. At each step, it removes the last selected attribute or adds its neighbor in the left subset to the subset, and compares the performance. Selects the subset with the best performance as the best subset. then thinks other attributes weak-relevant, irrelevant or redundant. For further neural network classification or rule extraction, just learns on the selected subset rather than the whole attribute set, which will subsequently improve the mining efficiency and quality, meanwhile it reduces the number of attributes appearing in the discovered patterns, helping to make the patterns easier to understand.
3 Attribute Reduction Algorithm Based on BDDC and RBNN The following pseudo codes express the procedure of the supervised algorithms. Attribute reduction() {getinput
( P) ;
datacleaning(P); normalization(P);
660
L.-C. Chen et al.
if(classificaton) {if(target attribute unavailable) clustering(P,Y)
//Y is the classification output
α=0; bddc-calculation(R); // formula (2), (3), (5) net=PNN; } else {α=1; bddc-calculation(R); //formula (2), (3), (4) net=RBNN; } ranking(R); do{ set[i]=sub_sellect(); //select subset according IDICS and R; precision[i]=net(P,Y,set[i]); compare(precision); }While(new_min(precision)); Save subset[j] corresponding to min(precision); }
4 Experiment and Conformation
《
This paper makes simulation on the Chinese city competitiveness data of 2006 from The city competitive power blue skin book: The Chinese city competitive power report No: to verify the feasibility and efficiency of the reduction algorithm based on BDDC and RBNN. The data provides sorting list of fifteen sub-items and one overall index(refer to the first two columns in Table 2). Since the whole sixteen attributes are of the same
Ⅳ》
Attribute Reduction Based on BDDC and Radial Basis Network
661
Table 2. The BDDC value(R(k)) and IOC value(C(k)) of the fifteen input attributes
No
V(k)
H(k)
R(k)/α=0.5
C(k)
1
attribute name Talented person
0.886
0.8863
0.8862
0.7943
2
Business enterprise
0.7145
0.67
0.6922
0.9
3
Industry
0.6146
0.5739
0.5942
0.8374
4
Public department
0.7667
0.7277
0.7472
0.8572
5
Living environment
0.8221
0.8066
0.8143
0.7452
6
Business environment
0.762
0.7357
0.7489
0.763
7
Innovation environment
0.9
0.8922
0.8961
0.8048
8
Social milieu
0.4953
0.446
0.4707
0.8329
9
Growth
0.1
0.1
0.1
0.7994
10
Scale
0.8857
0.9
0.8929
0.6314
11
Efficiency
0.8493
0.6152
0.7323
0.1
12
Benefit
0.6149
0.5662
0.5906
0.5812
13 14 15
Structure
0.8339 0.8454 0.5501
0.8033 0.7379 0.4525
0.8186 0.7916 0.5013
0.5276 0.318 0.2805
Quality Employment
,
quantity measure without parallelism between each output, so the function approaching thought are applied here. Table 2 Table 3 and Fig 1 just describe the simulation results of the two method in contrast. Table 2 lists the BDDC value calculated by formula (2).(3),(4) and the IOC value by formula (1); Table 3 represents the selection conformation results of the two methods based on the same RBNN learning parameters and same error performance function. Fig 1 illustrates the performance difference between the BDDC and IOC in plot. It is obviously that that error curve of BDDC method is smoother and flatter than that of IOC method, which means the former performance changes more steadily than the later while the input subset changes. The error of BDDC method reaches its minimum when the input subset size is 10,and steadily increases when the subset size is bigger and smaller the 10. As there are many peaks and valleys on the IOC performance curve which means unsteady variety trend with multifarious rise and fall, it’s hardly to make sure the best performance point. Besides, the error value of the BDDC is smaller than that of IOC almost on all the points. The comparison identifies that the BDDC method is more suitable for learning of the city competition data than the IOC method. The simulation confirms that combination of the important attributes 7,10,1,13,5,14,6,4,11,2 is the best input attribute set. Advanced learning can be carried on this subset not the whole set of fifteen attributes.
662
L.-C. Chen et al.
Table 3. Ranking and confirmation results of BDDC and IOC
rank
No
1 2 3 4 5 6 7
7 10 1 13 5 14 6
8
4
9
11
10
2
11 12 13 14 15
3 12 15 8 9
BDDC subset 7 7,10 7,10,1 7,10,1,13 7,10,1,13,5 7,10,1,13,5,14 7,10,1,13,5,14,6 7,10,1,13,5,14, 6,4 7,10,1,13,5, 14,6,4,11 7,10,1,13,5, 14,6,4,11,2 … … … … total set
error
IOC subset
No
7.77352 0.031008 0.009202 0.006156 0.016189 0.007324 0.005157
2 4 3 8 7 9 1
0.002503
6
0.00168
5
0.001419
10
0.002284 0.010607 0.010708 0.011607 0.01189
12 13 14 15 11
2 2,4 2,4,3 2,4,3,8 2,4,3,8,7 2,4,3,8,7,9 2,4,3,8,7,9,1 2,4,3,8,7,9, 1,6 2,4,3,8,7,9, 1,6,5 2,4,3,8,7,9, 1,6,5,10 … … … … total set
error 11.2403 0.707995 0.02134 0.007067 0.003624 0.006563 0.019981 0.035907 0.0051 0.033536 0.012049 0.012319 0.014836 0.014832 0.014863
Performance curve of BDDC and IOC 0.05
0.04
←IOC Error
0.03
0.02
←BDDC
0.01
0
2
4
←0.001419 6 8 10 12 Accumulate subset size
14
Fig. 1. Performance plot of BDDC and IOC
5 Conclusion From the view of distance and correction as well as overall consideration on vertical and horizontal difference, this article proposes the bi-directional distance correction method to evaluate the attribute importance, then performs attribute reduction with radial basis network and improved decrease-increase combination strategy based on importance ranking, which will benefit the subsequent classification, prediction and
Attribute Reduction Based on BDDC and Radial Basis Network
663
other rule extraction. This method not only overcomes the shortcomings and limitations of IOC measurement, but avoids the disadvantage of existing network attribution selection having to train on the neural total attribute. In addition, the algorithm is feasible to both discrete value and continuous value. The experiment proves the good feasibility and performance of the supervised attribute reduction.
References 1. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. US: Morgan Kaufmann Publisher (2000) 2. Kwak, N. Choi C.-H.: Input Feature Selection for Classification Problem IEEE Trans on Neural Networks 13 (1) (2002) 143-159 3. Engelbrecht, A.P.: A New Pruning Heuristic Based on Variance Analysis of Sensitivity Information IEEE Trans on Neural Networks 12 (6) (2001) 1386-1399 4. Wen, Z. Wang, Z.: An Efficient RBF Neural Network Attribute Selection Method Based on Data Attribute Important Ranking. Computer Applications 23 (8) (2003) 34-36 5. Wang, X., Ye, B., Liu, Y.: Prediction of Technical and Economic Norm of the Fully Mechanized Face Based on Neural Network. Systems Engineering–Theory & Practice 21 (7) (2001) 129-133 6. Xiu, J., Wang, L.: Rule Extraction Based on Data Dimensionality Reduction Using RBF Neural Networks ICONIP 2001 Proceedings 8th International Conference on Neural Information Processing, Shanghai, China 1 (2001) 149-153
,
,
.
.
.
.
.
,
Appendix Author introduction: Chen Li-Chao (1961- ), male, born in Wanrong Shanxi China, professor, doctor, mainly studies directions: artificial intelligence, pattern recognition, data mining; Zhang Wei (1983- ), female, born in Tai'an Shandong China, master postgraduate student, mainly studies in: data mining; Zhang Ying-Jun(1969- ), male, born in Shanxi CHINA, Senior Engineer, master, mainly studies in artificial intelligence, Artificial life; Ye Bin(1971- ),male, born in Pingding Shanxi China, Associate professor, doctor, mainly studies in artificial intelligence and its application , pattern recognition and intelligent system; Pan Li-Hu(1974-),male, Lecturer, master, mainly studies in internet management and internet security, data mining. Li Jing(1974-),female, born in Zouping Shandong China, master postgraduate student, mainly studies in data mining.
Unbiased Linear Neural-Based Fusion with Normalized Weighted Average Algorithm for Regression Yunfeng Wu1 and S.C. Ng2 1
School of Information Engineering, Beijing University of Posts and Telecommunications Xi Tu Cheng Road 10, Haidian District, 100876 Beijing, China
[email protected] 2 School of Science and Technology, The Open University of Hong Kong 30 Good Shepherd Street, Homantin, Kowloon, Hong Kong
[email protected]
Abstract. Regression is a very important data mining problem. In this paper, we present a new unbiased linear fusion method that combines component predictors so as to solve regression problems. The fusion weighted coefficients assigned are normalized, and updated by estimating the prediction errors between the component predictors and the desired regression values. The empirical results of our regression experiments on five synthetic and four benchmark data sets show that the proposed fusion method improves prediction accuracy in terms of mean-squared error, and also provides the regression curves with better fidelity with respect to normalized correlation coefficients, compared with the popular simple average and weighted average fusion rules.
1 Introduction Data regression is an important task of data analysis which develops a number of mathematical models so as to describe the functional relationships between dependent and independent variables. The practical applications are abundant in a variety of disciplines such as economics, time-series prediction, and biological sciences [1]. Although the conventional statistical techniques, like multiple linear regression models, can make a good error-bar prediction in low dimensions, most approaches suffer from drawbacks of unreliability in dealing with high-dimensional data points contaminated by ambiguous noise. One possible solution is to use a multiple predictor system that can rectify local warps of its component regression models. Fusion or ensemble methods can be utilized to construct such a multiple predictor system in which a group of component predictors are combined to form the overall prediction [2]. In other words, the ensemble methods can fuse the knowledge generated by component predictors in order to make a consensus decision which is supposed to be superior to the one attained by an individual predictor working solely. The merits of the ensemble methods for design of multiple learner systems have been widely accepted by the professional community [3], and this type of machine learning approach is also considered to be promising for solving regression problems [4]. The pioneering ensemble algorithms in the literature are Boosting [5] and Bagging [6]. Boosting works by repeatedly implementing a given weak-learning machine on D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 664–670, 2007. © Springer-Verlag Berlin Heidelberg 2007
Unbiased Linear Neural-Based Fusion with Normalized Weighted Average Algorithm
665
different distributed training data sets, and then combining their outputs. The distribution of the training data in the current iteration depends on the performance of prior predictors. The first version of the Boosting algorithm developed by Schapire is the Boosting-by-filtering [7], which involves a data filtering procedure with a weaklearning algorithm. Unfortunately, it requires a large size of training data, which is not possible in many practical applications. In order to overcome such a drawback, Freund and Schapire proposed the AdaBoost [5] to find a typical mapping function or hypothesis with a low error rate relative to a given probability distribution of the training data. For regression, Freund and Schapire developed the AdaBoost.R [5]. In spite of the effectiveness, the Boosting algorithms still cause some pitfalls. First, they have to project each regression data set into many classification sets, and the number of projected classification instances grows intensively larger after just a few boosted iterations. Second, the loss function changes in different iterations, and even differs between instances in the same iteration. In addition, the Boosting algorithms are very sensitive to outliers and sometimes results in overfitting [8]. On the other hand, the Bagging algorithm introduces the bootstrap sampling procedure [9] into the construction of component predictors, so as to generate enough independent variance among them [6]. The bias of the Bagging ensemble would converge by averaging while the variance gets much smaller than that of each component predictor. At present, linear combination models [10], [11], functioned with the fusion rules [12], are frequently used in the Bagging, AdaBoost, and other popular ensemble learning algorithms. However, the Simple Average (SA) fusion rule treats all component predictors equally and does not efficiently utilize the information regenerated by them. The Weighted Average (WA) method suffers from estimating weights according to the possibility density function of the added errors from noisy data, so its superiority cannot always be guaranteed in practical applications [13]. In this paper, we propose a new training rule that updates the normalized weighted coefficients assigned to component predictors in an unbiased linear fusion model. The rest of this paper is organized as follows. Section 2 provides a description of the proposed normalized weighted average algorithm. Section 3 presents the regression experiments and empirical results on different synthetic and benchmark data sets. Section 4 concludes the paper with an emphasis of the merits of the proposed fusion method, and also comments on future work.
2 Normalized Weighted Average (Normwave) Algorithm The goal of regression is to create a statistical model that characterizes the functional relationship between one dependent variable and multivariate inputs. For analytical convenience, let S = {(xi , yi ), i = 1, , N , xi ∈ ℜd } be the set of regression data, and the standard regression equation can be expressed as a function plus a residual, i.e. y = f (x) + ε ,
(1)
where y is a scalar dependent variable and ε is the residual (a random error term with zero mean and finite variance). For standard statistical inference, the input attributes are
666
Y. Wu and S.C. Ng
independently and identically distributed under certain stationary distributions, and the theory of error analysis commonly assumes that the residual is a Gaussian noise. Now, let us divert our attention to the linear fusion. Suppose we have a total of K component predictors for the unbiased linear fusion, each having a normalized weighted coefficient wk assigned, i.e. K F (x) = ∑ k =1 wk fˆ k ( x),
∑
K k =1
wk = 1,
(2)
where F (x) and fˆ k (x) denote the final fusion prediction and output of the k-th component predictor, respectively. The appropriate fusion coefficients should minimize the mean-squared error (MSE) between the final fusion prediction and desired values, specifically, E fusion =
(
)
2 1 N ⎡1 K ⎤ yi − ∑ k =1 wk fˆ k ( xi ) ⎥. ∑ ⎢ N i =1 ⎣ 2 ⎦
(3)
Thus, we can compute the gradient of fusion MSE with respect to the weighted coefficients as follows, ∇ wk E fusion =
(
)
2 ∂ 1 N ⎡1 K ⎤ yi − ∑ k =1 wk fˆ k (xi ) ⎥ ∑ k ⎢ ∂w N i =1 ⎣ 2 ⎦
(
)
2 1 N ∂ ⎡ K K ⎤ ( yi ) 2 − 2 yi ∑ k =1 wk fˆ k (xi ) + ∑ k =1 wk fˆ k (xi ) ⎥ ∑ k ⎢ 2 N i =1 ∂w ⎣ ⎦ 1 N ⎡ K = ∑ − yi fˆ k (xi ) + fˆ k (x i )∑ l =1 wl fˆ l ( xi ) ⎤ ⎦ N i =1 ⎣
=
=−
(4)
1 N ˆk K ∑ f (xi ) ⎡⎣ yi − ∑ l =1 wl fˆ l (xi )⎤⎦ . N i =1
By substituting (2) and (3) into the steepest-descent adaptation rule [14] for updating weighted coefficients, we have wk = wk + μ ⎡⎣ −∇ wk E fusion ⎤⎦ N K = wk + Nμ ∑ i =1 fˆ k (x i ) ⎡ yi − ∑ l =1 wl fˆ l (x i ) ⎤ ⎣ ⎦ N K μ k k l l ˆ ˆ ⎡ = w + N ∑ i =1 f (x i ) ∑ l =1 w yi − f (xi ) ⎤ ⎣ ⎦ N K μ k k l l ˆ =w + f (x ) w e (x ), *
(
N
∑
i =1
i
∑
l =1
)
(5)
i
where μ is the learning rate parameter (typically 0 < μ < 1 ), and el (xi ) represents the difference between the target value and prediction of the l-th component predictor for the i-th instance, i.e. el (xi ) = yi − fˆ l (x i ).
(6)
Unbiased Linear Neural-Based Fusion with Normalized Weighted Average Algorithm
667
We may rewrite (4) in the matrix form, and then obtain the normalized weighted average (Normwave) algorithm below (7)
wk = wk + μ w T e fˆ k , *
where w ∈ ℜ K ×1 , e ∈ ℜ K × N , and fˆ k ∈ ℜ N ×1 .
3 Experiments and Results 3.1 Regression Data Sets We applied the proposed Normwave algorithm to the unbiased linear fusion for a multiple learner system in dealing with regression problems. In the experiments, a total of night regression sets (five synthetic and four benchmark data sets) were used. The details of function expressions and variable descriptions concerning the synthetic data sets are tabulated in Table 1. The Zigzag and Rhythm data was used for testing regression performance of the Bagging-based least-mean-square fusion algorithm which was present in [15]. The 2-D Mexican Hat, 3-D Mexican Hat, and Gabor data was used by Zhou et al. [16] for investigating in which situation the ensemble Table 1. Descriptions of the synthetic regression data sets Data Set
Expression
Variables
Zigzag
y = sin x 2 cos x 2 − 0.25 x + ε
Rhythm
⎡ ( mod( x,11) − 5 ) ⎤ y=⎢ ⎥ +ε 8 ⎣ ⎦
x ∈ U [ 0, 3]
ε ∼ N (0, 0.2) x ∈ U [ 0, 20]
3
y=
2-D Mexican Hat
y=
3-D Mexican Hat Gabor
y=
π 2
ε ∼ N (0, 0.1) x ∈ U [ −2π , 2π ]
sin | x | +ε | x|
sin x12 + x22 x +x 2 1
2 2
ε ∼ N (0, 0.1) xi ∈ U [ −4π , 4π ]
+ε
ε ∼ N (0, 0.1)
exp ⎡⎣ −2 ( x12 + x22 ) ⎤⎦ cos ⎣⎡ 2π ( x1 + x2 ) ⎦⎤ + ε
xi ∈ U [ 0, 1]
ε ∼ N (0, 0.3)
Number of Instances 2000 2000 2000 2000 2000
Table 2. Benchmark regression data sets Data Set Number of Instances Abalone 4177 Boston Housing 506 Auto mpg+ 392 CPS 85 Wages 534 + Six instances with missing values were removed.
Number of Attributes 7 13 7 10
Source UCI UCI UCI StatLib
668
Y. Wu and S.C. Ng
methods is not able to work effectively. The benchmark data (see Table 2) were obtained from the UCI Machine Learning Repository1 and the StatLib2, respectively. 3.2 Experiments In our experiment, each dataset was randomly partitioned into two disjointed subsets: one of 75% data size for training and the other of 25% size for testing. The training subset was sampled by Efron’s bootstrap approach [9] in order to generate a number of replication subsets which were utilized for modeling the component predictors. The testing subset was used to update the unbiased fusion weighted coefficients, and then to measure the prediction quality of component predictors, along with the fusion system. The linear fusion system consists of five component predictors, including two feedforward neural networks and three robust linear regression models. The neural networks are with the same architecture (10 neurons in only one hidden layer), whereas two networks are activated by the radial symmetric Gaussian function and the thin plate spline function, respectively. We didn’t implement the early stopping during the training, because the bootstrap sampling helps the fusion system overcome overfitting. Three robust linear regression models were set up by different versions of the iteratively least-squares algorithm [17], in which the coefficients are estimated by applying the logistic, cauchy, and welsch functions, respectively, to the residual derived from the previous iteration. Each experiment was repeated 10 times for statistical analysis. For comparison purpose, we also implemented the SA and WA fusion rules in all experiments. The results were evaluated by the measures of MSE and normalized correlation coefficient (NCC). The latter is defined by
∑ y Fˆ (x ) . ∑ y ∑ Fˆ (x ) N
NCC =
i =1
N
i =1
2 i
i
i
N
i =1
(8)
2
i
3.3 Empirical Results The regression results performed by the three fusion rules studied are summarized in Table 3 and Table 4. It is worth noting that both the WA and Normwave methods are able to provide lower regression errors in relation to the SA in all 9 experiments. If we take into account of the relative accuracy improvement ratio (RAIR) defined by RAIR a,b =
MSE b − MSE a × 100%. MSEb
(9)
The Normwave remarkably improves the prediction accuracy versus the SA in terms of RAIR, in particular, 85.4454%, 54.3651%, and 61.1326%, for the 2-D Mexican Hat, 3-D Mexican Hat, and Gabor regression sets, respectively. In the meanwhile, the corresponding RAIR values of the WA versus the SA on these data sets are 73.7767%, 13.6905%, and 32.8614%, respectively. The WA only outperforms the Normwave for the Boston Housing data. 1 2
Online available: http://www.ics.uci.edu/~mlearn/MLRepository.html Online available: http://lib.stat.cmu.edu/datasets/
Unbiased Linear Neural-Based Fusion with Normalized Weighted Average Algorithm
669
Consider now evaluation with respect to the NCC, which is the most frequently used measure of association in time-series prediction [18]. It can be observed that the WA and Normwave both characterize the nature of regression curves consistently better than the SA. In addition, the Normwave is slightly superior to the WA, except for the Abalone and Boston Housing sets. Table 3. Mean-squared errors of regression Simple Average Weighted Average Normwave Mean Std.++ Mean Std. + Mean Std. ++ Zigzag 0.0967 0.0061 0.0861 0.0070 0.0784 0.0072 Rhythm 0.0172 0.0012 0.0169 0.0012 0.0159 0.0012 2-D Mexican Hat 0.0797 0.0052 0.0209 0.0013 0.0116 0.0013 3-D Mexican Hat 0.0504 0.0055 0.0435 0.0070 0.0230 0.0105 Gabor 0.2331 0.0076 0.1565 0.0061 0.0906 0.0061 Abalone 4.9559 0.2511 4.9368 0.2560 4.9167 0.2580 Boston Housing 27.9955 9.7945 26.2835 9.7877 26.3956 9.9128 Auto mpg 10.8716 2.5544 10.7072 2.5452 10.5197 2.5583 CPS 85 Wages 20.1284 6.4115 20.0768 6.3869 20.0284 6.3955 ++ Std. is the abbreviation of standard deviation, unless otherwise stated in the rest of this paper. Data Sets
Table 4. Normalized correlation coefficient results Data Sets Zigzag Rhythm 2-D Mexican Hat 3-D Mexican Hat Gabor Abalone Boston Housing Auto mpg CPS 85 Wages
Simple Average 0.8473 0.5407 0.9170 0.8156 0.8970 0.9772 0.9768 0.9915 0.9046
Weighted Average 0.8664 0.5494 0.9759 0.8375 0.9003 0.9773 0.9783 0.9916 0.9049
Normwave 0.8795 0.5681 0.9799 0.8667 0.9013 0.9773 0.9782 0.9918 0.9054
4 Conclusion The proposed Normwave algorithm works effectively by taking the prediction errors of component predictors to update the normalized weighted coefficients in the unbiased linear fusion. The experimental results of the synthetic and benchmark regression data demonstrate that the proposed Normwave method tangibly achieves more improvement in prediction accuracy, compared with the prevailing SA and WA fusion rules. Furthermore, the Normwave method provides most regression curves with the highest fidelity, and also overcomes some local warps of the component predictors. The future work would be directed toward a study of the linear fusion model based on the Normwave algorithm for multiple classifier systems.
670
Y. Wu and S.C. Ng
Acknowledgment This work was supported in part by the National Science Foundation of China under the Grant No. 60575034, the Doctoral Program Foundation of Ministry of Education of China under the Grant No. 20060013007, and the 2005 Innovation Research Funds from the Graduate School, Beijing University of Posts and Telecommunications.
References 1. von Eye, A.: Regression Analysis for Social Sciences. Academic Press, San Diego, CA, USA (1998) 2. Kittler, J., Hatef, M., Duin, R.P.W., Matas, J.: On Combining Classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (1998) 993-1001 3. Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms. Wiley, New York, NY, USA (2004) 4. Optiz, D., Maclin, R.: Popular Ensemble Methods: An Empirical Study. Journal of Artificial Intelligence Research 11 (1999) 169-198 5. Freund, Y., Schapire, R.E.: A Decision-Theoretic Generalization of On-line Learning and an Application to Boosting. Journal of Computer and System Sciences 55 (1997) 119-139 6. Breiman, L.: Bagging Predictors. Machine Learning 24 (1996) 123-140 7. Schapire, R.E.: The Strength of Weak Learnability. Machine Learning 5 (1990) 197-227 8. Ridgeway, G.: The State of Boosting. Computing Science and Statistics 31 (1999) 172-181 9. Efron, B. Tibshirani, R.: An Introduction to the Bootstrap. Chapman and Hall, New York, NY, USA (1993) 10. Fumera, G., Roli, F.: A Theoretical and Experimental Analysis of Linear Combiners for Multiple Classifier Systems. IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (2005) 942-956 11. Tumer, K., Ghosh, J.: Analysis of Decision Boundaries in Linearly Combined Neural Classifiers. Pattern Recognition 29 (1996) 341-348 12. Kuncheva, L.I.: A Theoretical Study on Six Classifier Fusion Strategies. IEEE Transactions on Pattern Analysis and Machine Learning 24 (2002) 281-286 13. Wu, Y.F., Wang, C., Ng, S.C., Madabhushi, A., Zhong, Y.X.: Breast Cancer Diagnosis Using Neural-Based Linear Fusion Strategies. Proc. 13th Int’l Conf. Neural Information Processing, Lecture Notes in Computer Science 4234 (2006) 165-175 14. Haykin, S.: Neural Networks: A Comprehensive Foundation, 2nd ed. Prentice Hall PTR, Englewood Cliffs, NJ, USA (1998) 15. Wu, Y.F., Wang, C., Ng, S.C.: Bagging.LMS: A Baggging-Based Linear Fusion with Least-Mean-Square Error Update for Regression. Proc. 2006 IEEE Region 10 Conf. (2006) 393-396 16. Zhou, Z.H., Wu, J. Tang, W.: Ensembling Neural Networks: Many Could Be Better than All. Artificial Intelligence 137 (2002) 239-263 17. Street, J.O., Carroll, R.J., Ruppert, D.: A Note on Computing Robust Regression Estimates via Iteratively Reweighted Least Squares. The American Statistician 42 (1988) 152-154 18. Marques de Sa, J.P.: Applied Statistics using SPSS, STATISTICA, and MATLAB. Springer-Verlag, Berlin, Germany (2003)
Discriminant Analysis with Label Constrained Graph Partition Peng Guan, Yaoliang Yu, and Liming Zhang Department of Electronic Engineering, Fudan University, Shanghai 200433, China
[email protected],
[email protected],
[email protected]
Abstract. In this paper, a space partition method called “Label Constrained Graph Partition” (LCGP) is presented to solve the Sample-InterweavingPhenomenon in the original space. We first divide the entire training set into subclasses by means of LCGP, so that the scopes of subclasses will not overlap in the original space. Then “Most Discriminant Subclass Distribution” (MDSD) criterion is proposed to decide the best partition result. At last, typical LDA algorithm is applied to obtain the feature space and the RBF neural network classifier is utilized to make the final decision. The computer simulations and comparisons are given to demonstrate the performance of our method.
1 Introduction Feature extraction and pattern classification are two important stages in pattern recognition. The neural network classifiers, such as RBF classifier [5], are generally considered as powerful tools at decision level. Unfortunately, the existing methods of projection may not separate the samples well in the feature space. One reason is that the samples in the original space are already interwoven. Although the well-known LDA algorithm can obtain a good classification feature space, it assumes the samples of each class are generated from underlying multivariate Normal distribution of common covariance matrix but different means [1]. In fact, this assumption can not be satisfied in most cases. It is possible that one class is formed by several clusters (several Gaussians) or different classes are severely interwoven, shown in Fig.1 (a). Under these circumstances, the performance of LDA deteriorates quickly. In order to solve this problem, many methods have been proposed such as Nonparametric Discriminant Analysis (NDA) [7], approximate Pairwise Accuracy Criterion (aPAC) [8], and Penalized DA (PDA) [9], in which the assumption of LDA can be relaxed. But these methods are aimed to solve some particular issues and can not generalize. Recently, Manli Zhu et al. introduced the idea of Subclass Discriminant Analysis [2] (PAMI of 2006) in which each class is divided into k subclasses and then LDA algorithm is implemented for these subclasses. Regarding the choice of number k, they proposed both SDA-loot and SDA-stability methods. The problem of SDA is that every class is divided into exactly the same number of subclasses which does not accord with the true distributions of classes. Furthermore, the subclasses obtained by SDA may still be interwoven from the D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 671–679, 2007. © Springer-Verlag Berlin Heidelberg 2007
672
P. Guan, Y. Yu, and L. Zhang
respect of the entire training set because SDA considers the distributions of each class individually. In order to surmount the limitations of LDA and avoid the drawbacks of SDA, we propose a space partition method called LCGP. In this scheme, we first obtain all the possible subclass partitions iteratively according to the topological structure of the entire training set. Then MDSD criterion is proposed in this paper to select the best partition result. Since the MDSD criterion considers both the compactness of each subclass and distances between every two subclasses, it is a reliable indication of the goodness of subclass partition. At last, typical LDA algorithm and RBF neural network classifier is applied to classify the samples. This paper is organized as follows: In section 2, SDA is reviewed simply. The LCGP is introduced in section 3. Experimental results are shown in section 4 and section 5 is the conclusions.
(a)
(b)
Fig. 1. In (a), class2 is formed by two separate Gaussians and the scopes of class1 and class2 are overlapped. In (b), every class is partitioned into 2 subclasses.
2 LDA and SDA Given a training set X = {x1 , x2 ,..., xN } including N samples that belong to C classes,
where xi ∈ R p , i = 1, 2,..N . The ith class has ni samples, and ∑ i =1 ni = N . μi and μ are C
the means of class i and the entire training set respectively. xij is the jth sample in class i. LDA aims to find the vector bases V so that the following function is maximized.
J ( v) =
v T Sb v vT Sw v
(1)
Discriminant Analysis with Label Constrained Graph Partition
673
1 C ni ( xij − μi )( xij − μi )T .If the ∑∑ n i =1 j =1 distribution of dataset contradicts the assumption of LDA, the result will be very disappointing such as in the case of Fig.1(a). In SDA [2], the biggest possible number of subclasses, hmax , that each class can be divided into should first be defined. In the SDA-loot, the authors use all but one sample as training set ( N − 1 training samples). For every h ∈ [1, hmax ] , they first adopt Nearest Neighbor based-Clustering[2] to evenly divide each class into h subclasses, then solve (1) using the extended classes to get the vector bases and finally testing whether the sample left out is correctly classified. The procedure is repeated for N times. The optimal h corresponds to the highest correct rate. SDA-stability rules out the leave-one-out-test procedure, instead it introduces the notion of Stability Criterion [3]. For every h ∈ [1, hmax ] , they compute Stability Criterion index and the optimal h corresponds to the minimal Stability Criterion Index. The complexity of SDA-stability is the order of N 2 p + hmax p 3 , while that of SDA-loot exceeds N 3 p + Nhmax p 3 . Note that the computation of the distance of each sample pair in NN-Clustering requires N 2 p and eigen-value decomposition requires p3 .
where
Sb = ∑ i =1 ( μi − μ )( μi − μ )T C
and
Sw =
3 Discriminant Analysis with Label Constrained Graph Partition We first create a 1-NN Label Constrained Graph (LCG) in the original space and divide it into several connected sub-graphs, each of which represents a subclass. Then an iterative algorithm is proposed to obtain the partition result of k-NN LCG. Furthermore, we put forward MDSD Criterion to choose the best partition result. Finally, the feature space can be obtained by Eq.(1). 3.1 Constructing 1-NN LCG
The entire training set can be viewed as an undirected graph G = (V , E ) , where V = {x1 ,..., xN } represents node set in which each node corresponds to a serial number in (1∼N) and E is edge set. Remarkably, each node (or training sample) has a label that decides which class the node belongs to. For example, Label ( xi ) = 3 means xi belongs to the third class. The labels of each node are available at the very beginning since we have got the detailed information of training set. The edge ( xi , x j ) in E denotes that xi and x j are connected. Initially, E ∈ φ . First, the Distance Matrix D ∈ R N × N is constructed with elements dij = xi − x j
2
. The Nearest
Relationship Matrix Q ∈ R N ×( N −1) is built by sorting elements in every row of D , where Qij represents the serial number of the jth nearest neighbor node to the ith node. Let G1 be 1-NN LCG. For every node xi in V , we find its nearest neighbor xQi1 .If xi and xQi1 have the same label, we add edge ( xi , xQi1 ) into E and
674
P. Guan, Y. Yu, and L. Zhang
call xi and xQi1 are connected. As is described above, we get 1-NN LCG ( G1 ). We can generalize to k-NN LCG ( Gk ), in which every node is connected to its first k nearest neighbors under the restriction of “Same Label”. It is important that the connectivity of LCG is related to both nearest relationship and label information, which ensures that the scopes of different subclasses will not overlap. We can easily get the subclass partition result G1 by means of “Graph connectivity estimation algorithm [4]”. 3.2 Iterative Scheme for k-NN LCG
Given G1 , an iterative method is proposed to compute G1 , G2 ,..., Gk …easily. Gk +1 is obtained simply by merging the subclasses in Gk . The algorithm is as follows: Algorithm: Updating Gk Suppose the training set has N samples that belong to C classes and Gk be divided into m subclasses, Gk = {sub1 , sub2 ,..., subm } . Label ( xi ) denotes which class xi belongs to while SubClass ( xi ) denotes which subclass xi is now assigned to. Begin 1 Gk +1 ←⎯ ⎯ Gk 2 for each i ∈ [1, N ] 3 do if Label ( xQi ,k +1 ) = Label ( xi ) && SubClass ( xQi ,k +1 ) ≠ SubClass ( xi )
,
4
then merge SubClass( xQi ,k +1 ) and SubClass( xi ) in Gk +1
End Compared with Gk , every node in Gk +1 connects one more neighbor xQi ,k +1 under the restriction of “Same Label”. For every node xi , its (k + 1)th nearest neighbor is defined as xQi ,k +1 . If xQi ,k +1 and xi are in the same class, xQi ,k +1 is considered as a candidate node, else we leave it out of account. For every candidate node we check whether xQi ,k +1 and xi are in the same subclass. If not, we merge together the two subclasses which respectively contain xQi ,k +1 and xi . One example from G1 to G2 is shown in Fig 2. Fig2(a) demonstrates the partition result of G1 and Fig2(b) shows the situation of G2 . In this example, nodes a, b, and c are in class1 and the other nodes are in class2. G1 is partitioned into five subclasses according to the 1-NN LCG. Although g is the nearest node to c, g and c can not be connected because they have different labels. During the updating process, sub1 and sub2 are merged and sub3 and sub4 are also merged. Thus, G2 has only three subclasses. This procedure is repeated until LCG has exactly C subclasses, which means samples in each class are connected. When G1 , G2 ,..., Gk are available, MDSD criterion is applied to select the best partition result.
Discriminant Analysis with Label Constrained Graph Partition
(a)
675
(b)
Fig. 2. (a)The partition result of G1 with five subclasses. (b) The partition result of G2 with three subclasses.
3.3 Most Discriminant Subclass Distribution (MDSD) Criterion
When we get all the possible partitions of subclasses through 3.2, we should define a criterion to choose the best partition result. We propose MDSD Index to judge the goodness of different partition results. Preliminary:
,
1) V = {x1 , x2 ,..., xN }, xi ∈ R p i = 1, 2,..., N 2) class ( xi ) denotes the class that xi belongs to. Two subsets corresponding to xi are defined as: Vc = {xi ∈ V class ( xi ) = c} and Vc = {xi ∈ V class ( xi ) ≠ c} , where c is the class label. 3) nc = Vc represents the number of samples in Vc 4) The distance between sample xi and x j is d ( xi , x j ) = xi − x j
2
5) The maximal distance between xi and other samples that are in the same class as xi is defined as xi⊂ = arg max j d ( xi , x j ) , x j ∈ Vclass ( xi ) 6) The minimal distance between xi and other samples that are not in the same class as xi is defined as xi⊄ = arg min j d ( xi , x j ) , x j ∈ V class ( xi ) The MDSD Index is defined as: N
MDSD = ∑ i =1
xi⊄ nclass ( xi ) m ×( ) N xi⊂
(2)
where m is the control factor. The partition result that maximizes (2) is taken as the optimal subclass partition. Bigger xi⊄ means class i is far from other classes, which is desired. Smaller xi⊂ means class i is fairly compact, which is also desired. It should be pointed out that when there are few samples in class i , xi⊂ is prone to be small. So (nclass ( xi ) / N ) m is introduced in MDSD Index, which means we do not encourage too
676
P. Guan, Y. Yu, and L. Zhang
many small subclasses. Thus, xi⊂ and (nclass ( xi ) / N ) m are found to balance each other critically. Importantly, since we have computed D in section 3.1, there is no need for us to recalculate the distance of each sample pair. Due to the reason above, MDSD Index can be obtained very quickly. The computational cost of LCGP is around N 2 p + N lg N + p 3 , which is an order smaller than SDA-loot’s
N 3 p + Nhmax p 3 and comparable to SDA-stability’s N 2 p + hmax p 3 .
4 Experimental Results We use two experiments to testify the validity of our method: (1) Artificial Samples of different distributions. (2) 12 datasets in UCI database [6]. 4.1 Artificial Samples
There are three artificial data sets: Multi-Gaussians (210samples per class), Doughnut (125 samples per class), and Two spirals (300 samples per class) considered here shown in Fig.3 (a)∼(c) . We respectively choose 10-80 and 10-55 samples randomly from each class of Multi-Gaussians and Doughnut to form training set, and reserve the rest samples for 25
6
20 4 15 10 2 5 0
0
-5 -2 -10 -15 -4 -20 -25 -15
-10
-5
0
5
10
15
20
-6 -6
25
-4
-2
(a)
0
(b) 15
10
5
0
-5
-10
-15 -15
-10
-5
0
5
10
15
(c) Fig. 3. (a) Multi-Gaussians (b) Doughnut (c)Two spirals
2
4
6
Discriminant Analysis with Label Constrained Graph Partition
677
testing. In the two spirals test, the training set is the whole data set adding a random noise in range of 0.1, while the testing set is formed by adding a random noise with larger range of α ∈ [0.2, 4] . The proposed LCGP is adopted to obtain feature space and a RBF neural network is utilized to classify the testing samples. The results are shown in Fig.4, in which the y-axis represents the incorrect classification rate (also mentioned as Error Rate) and x-axis is the number of training samples per class for Fig.4 (a) and (b), and the noise intensity for Fig.4 (c). In each of the three datasets, the performance of LDA is poor because all of the datasets contradict the assumption of LDA. In the Multi-Gaussians and Doughnut tests, SDA-stability also fails because it wrongly estimates the true distribution of dataset. Our method is the best among all since we can successfully recover the underlying distribution of all the datasets. Although SDA-loot achieves the second best result, its computational cost is much heavier than our method.
(a)
(b)
(c) Fig. 4. Average incorrect classification rate of 10 times testing
4.2 Results on UCI Database [6]
We use 12 datasets for experiment, the descriptions of which can be obtained in ref [6]. The ratio of training and testing samples is about 1:1 without overlapping for every dataset. For the sake of comparisons, we use both nearest neighbor classifier and RBF neural network classifier. The results are shown in Table 1.Note that our methods (LCGPDA+NN&LCGPDA+RBF) achieve the best correct classification rates. The
678
P. Guan, Y. Yu, and L. Zhang
reason is: We use LCGPDA to divide the original space into several subspaces without overlapping, which ensures that different subspaces have separate scopes. The feature space is derived by the extended classes where the feature representations rarely interweave. In WDBC, MDD, LSD, and ORHD our methods are 1-3 percent better than SDA. In PRHD, SD, MC1, and MC2 our methods are 3-10 percent better than SDA. The reason is simple: in the first four datasets, the samples do not interweave frequently, while in the rest datasets, the distributions of datasets are more complicated. From the results of Table 1, we also find that although the subclass partition result is obtained through training set, it generalizes well to the testing set. Table 1. Correct classification Rate (%) on UCI database, where NN represents nearest neighbor classifier and RBF represents RBF neural network classifier. Every result is the average of 10 times testing. The results with * are from ref [2], which use NN as classifier. UCI database WDBC
LDA+NN 94*
SDAloot +NN 94*
SDAstabilit y+NN 94*
LDA+RBF 93
LCGPDA +NN 95
LCGPDA +RBF 96
MDD-pix
93*
96*
96*
94
96
MDD-fou
81*
81*
83*
82
85
97 88
MDD-fac
97*
96*
96*
96
98
98
MDD-kar
96*
97*
97*
92
98
98
MDD-zer
79*
81*
79*
80
83
85
LSD
84
87
88
83
89
91
ORHD
95
96
95
95
97
97
PRHD
95
94
95
94
98
98
SD
86
87
84
84
90
89
MC1
65
63
60
68
71
73
MC2
91
93
93
91
97
98
5 Conclusions A novel space partition method called LCGP is proposed to solve the problem of Sample-Interweaving-Phenomenon. The contribution of this paper lies in three aspects: 1) We propose Label-Constrained Graph Partition strategy to obtain a more reasonable partition result of the original space. 2) An iterative scheme is put forward to speed up the algorithm. 3) For estimating the partition results, we propose MDSD criterion to choose the best division. Compared with SDA, the computational cost of our method is fairly small.
Acknowledgment This work is supported by a grant from the National Science Foundation (NSF 60571052),China.
Discriminant Analysis with Label Constrained Graph Partition
679
References 1. Fisher, R.A.: The Statistical Utilization of Multiple Measurements. Annals of Eugenics 8 (1938) 376-386 2. Zhu, M. L., Martinez, A. M.: Subclass Discriminant Analysis. IEEE Trans Pattern Analysis and Machine Intelligence 28(8) August (2006) 3. Martı´nez, A.M., Zhu, M.: Where Are Linear Feature Extraction Methods Applicable? IEEE Trans. Pattern Analysis and Machine Intelligence 27(12) Dec. (2005) 1934-1944 4. Lu, K. C., Lu, H. M.: Graph theory and its Applications(Second edition) Beijing: tsinghua publishing company 1995 5. Er, M. J., Wu, S. Q., Lu, J. W., Toh, H. L.:Face Recognition With Radial Basis Function (RBF) Neural Networks. IEEE Trans NEURAL NETWORKS 13(3) MAY (2002) 6. Blake, C.L., Merz, C. J.: UCI Repository of Machine Learning Databases. Dept. of InformationandComputerSciences,UniversityofCalifornia, Irvine,http://www.ics.uci.edu/mlearn/MLRepository.html, 1998. 7. Fukunaga, K.: Introduction to Statistical Pattern Recognition, seconded. Academic Press, 1990 8. Loog, M., Duin, R. P.W., Haeb-Umbach, T.: Multiclass Linear Dimension Reduction by Weighted Pairwise Fisher Criteria. IEEE Trans. Pattern Analysis and Machine Intelligence 23(7) July (2001) 762-766 9. Hastie, T., Buja, A., Tibshirani, R.: Penalized Discriminant Analysis. Annals of Statistics 23 (1995) 73-102
The Kernelized Geometrical Bisection Methods Xiaomao Liu1 , Shujuan Cao1 , Junbin Gao2 , and Jun Zhang3 1
Department of Mathematics Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
[email protected] 2 School of Information Technology Charles Sturt University, Bathurst, NSW 2795, Australia 3 State Key Laboratory for Multi-Spectral Information Processing Technologies Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
Abstract. In this paper, we developed two new classifiers: the kernelized geometrical bisection method and its extended version. The derivation of our methods is based on the so-called “kernel trick” in which samples in the input space are mapped onto almost linearly separable data in a high-dimensional feature space associated with a kernel function. A linear hyperplane can be constructed through bisecting the line connecting the nearest points between two convex hulls created by mapped samples in the feature space. Computational experiments show that the proposed algorithms are more competitive and effective than the well-known conventional methods.
1
Introduction
The support vector machines (SVMs) developed by Vapnik [1,2] based on the structural risk minimization principles are widely used in pattern recognition. SVMs have good generalization abilities with which an overfitting problem in learning can be effectively avoided, and also offer better classification precision than the conventional methods. In the recent years SVMs have attracted many attentions [3] and have been used as powerful tools in solving practical problems of classification. Let us consider typical binary classification problems. The essence of SVMs is to learn a hyperplane to classify patterns. The hyperplane is constructed in such a way that it separates patterns belonging to different classes. Bennett et al [4] proved that the linear separable SVM (LSVM, maximum margin method) maximizing the margin and correctly classifying two sets exists only if the two sets are either linearly separable, or equivalently, their convex hulls do not overlap. It also argued geometrically that the perceptron can be constructed by finding the two closest points in the convex hulls of the two sets and choosing the hyperplane bisecting the line between nearest points. The method is usually called the geometrical bisection method (GBM). The geometrical explanation has appeared in other various forms [4,5,6]. Generally speaking, LSVM is mathematically equivalent to GBM for the separable case. Bennet et al. [6] extended this argument D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 680–688, 2007. c Springer-Verlag Berlin Heidelberg 2007
The Kernelized Geometrical Bisection Methods
681
to the approximately linearly separable case by using the concept of the reduced convex hulls and proposed the extended geometrical bisection method (EGBM). The kernel theory plays an important role in SVM. SVM is the first kernelbased method proposed by [7]. This approach is based on the main result of the statistical learning theory, i.e., the generalization ability of a machine depends on both the empirical risk on a training set and the complexity of this machine. Any kernel method comprises two parts: a module that performs the mapping into the feature space and a learning algorithm designed to discover linear patterns in that space [8]. In this paper we aim to extend the GBM and EGBM to the general cases including totally linearly inseparable datasets. The main idea is to map two totally linearly inseparable sets into two linearly separable or approximately linearly separable sets in feature spaces by using an appropriate kernel function and then apply GBM or EGBM in the feature space to obtain our classifiers. The empirical test results show that the new algorithms are very effective and competitive compared with the primal GBM or EGBM.
2
Fundamental Theory
The basic problem to be addressed in this paper is a general binary classification problem. Given a set of labeled training data T = {(x1 , y1 ), (x2 , y2 ), . . . , (xl , yl )},
(1)
with input patterns xi ∈ X = Rn , and output yi ∈ Y = {1, −1} indicating the class for the pattern xi ( i = 1, . . . , l). Let A = {xi |yi = 1}, B = {xi |yi = −1}, l+ = #(A) and l− = #(B) We assume l+ > 0, l− > 0. So m = max(l+ , l− ) > 0. There are three cases for the binary classification problem with the dataset T . Linearly Separable case. There exists a hyperplane that separates the two given sample sets A and B in the X space. Approximate Linearly Separable case. There exists some separating hyperplane in X and the percentage of misclassification is relatively low for the separating hyperplane. Nonlinearly Separable case. The percentage of misclassification is always relatively high for any separating hyperplane. 2.1
The Geometrical Bisection Method (GBM)
For a given set of samples X = {x1 , . . . , xk } ⊆ Rn , its convex hull S is defined as k k S = {x = λj xj : λj = 1, λj ≥ 0, j = 1, 2, . . . , k}. j=1
j=1
The GBM [4] is one approach for solving classification problems with linearly separable training dataset in which case there are two closest points from two convex hulls determined by two class samples, respectively. By connecting the two closest points with a line, a hyperplane, which is orthogonal to the line
682
X. Liu et al.
segment and bisects the line segment, can be determined, as shown in Fig 1(a). Intuitively this hyperplane is the “best” for the purpose of classifying samples in the sense that the two sets are as far away from the separating plane as possible. The algorithm is as follows. Algorithm 1 (GBM) (i) Suppose the given training set T is linearly separable. (ii) Solve the optimal problem 2 1 min αi xi − αi xi α 2 yi =1 yi =−1 s. t. αi = 1, αi = 1, 0 ≤ αi ≤ 1, i = 1, 2, . . . , l yi =1
yi =−1
with the optimal solution α = ( α1 , . . . , αl )T . (iii) Construct the two closest points c = αi xi and d = αi xi . yi =1
yi =−1
b = 0 where the weight w (iv) Define the optimal hyperplane as (w·x)+ = c−d 1 is the normal of the hyperplane, and the threshold b = − 2 ((c − d) · (c + d)) is the distance from the origin to the point halfway between the two closest along the normal w.
Class A
c
w
Class B
c
d
w
d
Class B
Class A w.x+b=0
(a)
w.x+b=0
(b)
Fig. 1. The basic GBMS. (a) The two closest points of the convex hulls determine the separating plane; (b)The inseparable convex hulls and the reduced convex hulls with D=1/2.
For the linear separable case, the above method is equivalent to the primal standard LSVM (maximum margin method). The GBM is geometrically more intuitive. The vital weakness of the GBM is that it fails for other inseparable cases. 2.2
The Extended Geometrical Bisection Method (EGBM)
The EGBM [6] is an alternative approach to address the problem caused by approximately linearly separable datasets. If T is approximately linearly separable,
The Kernelized Geometrical Bisection Methods
683
the two convex hulls of the two sets intersect slightly, see Fig. 1(b), we may reduce the convex hulls properly by putting a chosen upper bound D on the multiplier in the convex combination for each point, making sure that the two reduced hulls no longer intersect. Then we can proceed as in the separable case. The reduced convex hulls of two sets are: S± = {x : x = αi xi , αi = 1}, yi =±1
yi =±1
1 where 0 ≤ αi ≤ D, 0 < D ≤ 1, i = 1, 2, . . . , l, D : Dmin ≤ D ≤ 1, Dmin = m > 0. Note that both hulls are reduced if D ≤ 1 and nonempty if Dmin ≤ D. The smaller the parameter D, the smaller the reduced convex hulls. Intuitively the reduced convex hulls are obtained by removing the outliers in the dataset, see Fig. 1(b).
Algorithm 2 (EGBM) (i) Suppose that T is approximately linearly separable. (ii) Choose an appropriate parameter Dmin ≤ D ≤ 1 and solve the optimal problem in Algorithm 1 by replacing 0 ≤ αi ≤ 1 with 0 ≤ αi ≤ D (iii) Construct the two closest points c = αi xi and d = αi xi . yi =1
yi =−1
b = 0, where the weight w (iv) Define the optimal hyperplane as (w·x)+ = c−d 1 and the threshold b = − 2 ((c − d) · (c + d)). Obviously algorithm 2 is the extension of Algorithm 1. However it is hard to judge whether the training set is approximately linear separable or not. And also, there has been no efficient way to choose the parameter D. A feasible parameter D should satisfy two conditions: the reduced convex hulls are both nonempty and shouldn’t intersect each other, though it is difficult to verify whether two hulls intersect. If D is too large, the convex hull is certainly going to be overlapped, resulting in a meaningless solution w = 0; If D is too small, the problem will be infeasible and have no solution at all. There must be an appropriate range (Dmin , Dmax ) for D. Even if we know the Dmax , it is still hard to find a suitable D in the feasible interval to ensure low percentage misclassification. The reason is that Algorithm 2 is over sensitive to the parameter D, since a tiny variation to D may cause a great variation of the shapes of the reduced convex hulls. Consequently, the normal w and the threshold b may change greatly, since they depend on the shapes of the reduced hulls greatly. Along with the decreasing of the parameter D, the outcome sometimes becomes better and sometimes becomes worse. The only way to get a better parameter value for D is to search in the feasible interval. Randomly searching means the increase of overhead on computation. 2.3
Kernel Function and Feature Space
One of typical way to handling a linearly inseparable training dataset T is to map the input x in the original X space into the so-called feature space, denoted by H [2]. It ensures us search for linear classifiers in the feature space H. Kernel
684
X. Liu et al.
works well in this case and the mapping is implicitly determined by the kernel function K(x, x ) = (ϕ(x) · ϕ(x )), where “ · ” denotes inner products of the mapped data in the H space. A lot of popular kernel functions can be found in [8], for example RBF kernel K(xi , xj ) = exp{−xi − xj 2 /σ 2 } (σ > 0).
3
The Kernelized GBM and EGBM
Though the GBM and EGBM are valid for linearly separable problems, and approximately linearly separable problems, respectively, both methods can not solve any totally linearly inseparable problems. Usually the linear separability is unknown and it is difficult to verify it in advance. In practice the majority of classification problems in practice are linearly inseparable. To address this problem, we intend to use the so-called kernel trick to solve the problem in the feature space instead of in the data space. 3.1
The Kernelized GBM (KGBM)
Applying the kernel trick to the GBM gives a useful algorithm which can be used for any T whose linearly separability in the original space is unclear but the image of T is linearly separable in the feature space by an appropriate kernel function. Algorithm 3 (KGBM) (i) Let T be a given dataset and choose a proper kernel function K(x, x ) such that the image of T is linearly separable in feature space H. (ii) Solve the optimal problem 2 l l 1 1 min αi ϕ(xi ) − αi ϕ(xi ) = αi αj yi yj K(xi , xj ) α 2 2 i=1 j=1 yi =1 yi =−1 s. t. αi = 1, αi = 1, 0 ≤ αi ≤ 1, i = 1, 2, . . . , l yi =1
yi =−1
where ϕ(·) is the mapping related to K(x, x ). Denote the optimal solution by α = ( α1 , . . . , αl )T ; (iii) Construct the two closest points in H : c = αi xi and d = αi xi . yi =1
(iv) Define the optimal hyperplane by (w · x) + b = 0 with w =c−d=
yi =1
αi ϕ(xi ) −
αi ϕ(xi ) =
yi =−1
l
yi =−1
yi αi ϕ(xi ),
b = − 1 (c − d) · (c + d) 2⎛ ⎞ 1 ⎝ =− α i α j K(xi , xj ) − α i α j K(xi , xj )⎠ . 2 y =1 y =1 y =−1 y =−1 i
j
(2)
i=1
i
(3)
j
b). With w and b a decision function in the original space is f (x) = sgn((w·ϕ(x))+
The Kernelized Geometrical Bisection Methods
685
Algorithm 3 is the kernelized version of Algorithm 1 and it is equivalent to Algorithm 1 with choice of linear kernel function. Choosing a proper kernel function is much easier, because for most of the known kernel functions (excluding the Linear kernel and the Polynomial kernel) the related feature space is infinite-dimensional and thus the VC dimension of the set of linear functions in the feature space is infinite. Hence the linear separability of any training set with finite samples must be great in the feature space [2]. 3.2
The Kernelized EGBM (KEGBM)
Algorithm 4 (KEGBM) (i) Let T be the given dataset and choose a proper kernel function K(x, x ) and a parameter D satifying Dmin ≤ D ≤ Dmax ≤ 1 such that T is approximately linearly separable in feature space H; (ii) Solve the optimal problem in Algorithm 3 with replaced constraints 0 ≤ αi ≤ D; (iii) Construct the two closest points c = αi ϕ(xi ) and d = αi ϕ(xi ). yi =1
yi =−1
(iv) Define the optimal hyperplane by (w · x) + b = 0 with w = c − d given by 1 (2) and b = − 2 (c − d) · (c + d) given by (3) with new α found in step (ii). Similar to Algorithm 3, Algorithm 4 is also easy to implement with an appropriate kernel. It can be considered as a good supplement of Algorithm 3 in case the training set is approximately linearly separable in the feature space.
4
Computational Experiments
Here we empirically evaluate and compare the performance of Algorithms 1-4. We chose two group benchmark data from UCI Databases [9]: Iris data and Glass data. Since we concentrated on two category classification problems, we only used part of the data for experiments.
The RBF kernel function K(xi , xj ) = exp −xi − xj 2 /σ 2 was used in our experiments. The feasible set of D, the σ values employed, l+ and l− (the size of class A and B of the training set), n+ and n− (the size of class A− and B − of the training set), the classification precision of training set p1 , and the classification precision of testing samples p2 are given in tables. As for Algorithm 2 and 4, the feasible set of D is an interval and the upper bound is got by searching. Iris12 Experiments: This experiment aimed at testing Algorithm 3 against Algorithm 1 for linearly separable problems. We chose part samples (11-90) of the first and the second class of iris-data as training samples, and used the remaining (1-10, 91-100) of the two class data as testing samples. The training set is linearly separable, as shown by the result of Algorithm 1 in Table 1. The results of applying Algorithms 1 and 3 are given in Table 1. Iris23 Experiments: This experiment was desired to compare the performance of Algorithm 2, 3, and 4 for approximately linearly separable problem. We chose part samples (61-140) of the second and the third class of Iris-data as
686
X. Liu et al.
Table 1. Performance of Algorithm 1 and Algorithm 3 on linearly separable Iris12 data Algorithm Algorithm 1 Algorithm 3
l+ /l− 40/40 40/40
n+ /n− 10/10 10/10
Parameter σ = 2, 3
p1 100% 100%
p2 100% 100%
training samples, and use the remaining (51-60, 141-150) of the two class data as testing samples. The training set is approximate linearly separable, because Algorithm 1 gives a degenerate solution w = 0 and is invalid in this case, and the result of Algorithm 2 shown in Table 2 also verifies this: The feasible value of D is relatively small and both the classification precision p1 of training set and the classification precision p2 of testing samples are relatively low. The results of applying Algorithms 2, 3 and 4 are given in Table 2. Glass 12 Experiments: This experiment tried to compare the performance of Algorithm 2, 3, and 4 for linearly inseparable problem. We chose part samples (31-110) of and the first and the second class of Glass-data as training samples, and use the remaining (1-30, 111-146) of the latter two class data as testing samples. The training set is linearly inseparable, because Algorithm 1 Table 2. Performance of Algorithm 2, 3 and 4 on linearly approximate separable Iris23 data (The feasible set of D for Algorithm 2 is [0.025, 0.749] and the feasible set of D for Algorithm 4 is [0.025, 1]) Algorithm
l+ /l−
n+ /n−
Algorithm 2
40/40
10/10
Algorithm 3
40/40
10/10
Algorithm 4
40/40
10/10
σ σ σ σ σ σ σ σ σ σ
Parameter D = 0.749 D = 0.747 D = 0.745 D = 0.74 D = 0.7 D = 0.6 D = 0.5 D = 0.4 D = 0.3 D = 0.2 D = 0.1 = 2, 3, 4 = 2 D = 1, 0.8, 0.6, 0.45 = 2 D = 0.4 = 2 D = 0.2 = 3 D = 1, 0.8, 0.6, 0.45 = 3 D = 0.4 = 3 D = 0.2 = 4 D = 1, 0.8, 0.6, 0.45 = 4 D = 0.4 = 4 D = 0.2
p1 22.5% 95% 93.75% 17.75% 88.75% 20% 90% 57.5% 96.25% 96.25% 97.5% 100% 100% 97.5% 96.25% 100% 97.5% 96.25% 100% 97.5% 96.25%
p2 40% 95% 100% 35% 80% 10% 95% 65% 100% 100% 100% 85% 85% 90% 100% 85% 90% 100% 85% 90% 100%
The Kernelized Geometrical Bisection Methods
687
Table 3. Performance of Algorithm 2, 3 and 4 on nonlinearly separable Glass 12 data (The feasible set of D for Algorithm 2 is [0.025, 0.506] and the feasible set of D for Algorithm 4 is [0.025, 1]) Algorithm
l+ /l−
p1
Algorithm 2
40/40
30/36
Algorithm 3
40/40
30/36
Algorithm 4
40/40
30/36
σ σ σ σ σ σ σ
Parameter D = 0.506 D = 0.505 D = 0.5 D = 0.49 D = 0.41 D = 0.4 D = 0.3 D = 0.2 D = 0.1 = 2, 3, 4 =2 D = 1, 0.8, 0.6, 0.4 =2 D = 0.2 =3 D = 1, 0.8, 0.6, 0.4 =3 D = 0.2 =4 D = 1, 0.8, 0.6, 0.4 =4 D = 0.2
n+ /n− 65.0% 57.5% 57.5% 42.5% 66.25% 23.75% 76.25% 40% 56.25% 100% 100% 98.75% 100% 97.5% 100% 97.5%
p2 34.85% 39.39% 34.85% 48.48% 66.67% 39.39% 56.06% 54.55% 59.09% 62.12% 62.12% 62.12% 62.12% 62.12% 62.12% 62.12%
gives a degenerate solution w = 0 and thus invalid in this case, and the result of Algorithm 2 shown in Table 3 also verifies this: A small feasible set and poor classification precision p1 and p2 for training and testing data respectively. The results of applying Algorithms 2, 3 and 4 are given in Table 3. Table 2 shows that Algorithm 2 gets high classification precision only if an appropriate D is chosen for approximately linearly separable problem, but it is too difficult to get it because it doesn’t behave orderly with the variation of D. This indicates that Algorithm 2 is valid but not so effective for approximately linearly separable problem due to difficulties of appropriate choice of D. Table 3 implies the classification precision of Algorithm 2 is very low no matter what value of D is chosen, which shows that Algorithm 2 is especially unsuitable for linearly inseparable problem, and the essential reason is that Algorithm 2 seeks for linear separating plane for a totally linearly inseparable problem. As for Algorithm 3 and Algorithm 4, the empirical results in Table 2 and Table 3 also verify our previous analyses: Algorithm 3 and Algorithm 4 behave very well in training both linearly separable problem (Iris23 Experiment) and linearly inseparable problem (Glass12 Experiment) for RBF kernel, because the linear separable extent in the related feature space is improved greatly as long as we choose a kernel appropriately. Even if the reduced parameter of Algorithm 4 are is chosen to be the maximum 1 in our experiments, the training samples are still linearly separable in the high-dimensional feature space related to the RBF kernel. Furthermore, both the variations of both the kernel parameter σ and the reduced parameter D doesn’t influence the classification precision too much. The good stabilization of these two algorithms makes it easy to ensure a high classification precision without considering the choice of the parameters too much.
688
X. Liu et al.
The above discussion demonstrates that our algorithms generalize and improve the conventional algorithms, with higher training accuracy and more stable computation.
5
Conclusions
In this paper, we combine the idea of GBM with kernel trick effectively to classify two category training samples. First we transform linearly inseparable training samples in the original space into (approximate) linearly separable ones in a feature space related to an appropriate chosen kernel, and then solve the (approximate) linearly separable problem and search for a linear classifier in the feature space using the known GBM or EGBM. The algorithms of this paper break through the limitation of the known GBM or EGBM. The modified algorithms are also more superior even for (approximate) linearly separable problems compared with the known GBM or EGBM, with higher training accuracy and more stable computation.
Acknowledgments This work was supported by the National Natural Science Foundation of China via the grants NSFC 60373090.
References 1. Vapnik, V.N.: The Nature of Statistical Learning Theory. New York, Springer Verlag (1998) 2. Vapnik, V.N.: Statistical Learning Theory. John Wiley and Sons, New York, NY (1998) 3. Burges, C.J.C: A Tutorial on Support Vector Machines for Pattern Recognition. Data mining and Knowledge Discovery 2 (2)(1998) 144-152 4. Bennett, K.P., Bredensteiner, E.: Geometry in Learning. Web Manuscript, http://www.rpi.edu/˜bennek (1996) 5. Keerthi, S.S., Shevade, S.K., Bhattacharyya, C., et al.: A Fast Iterative Nearestpoint Algorithm for Support Vector Machine Chassifier Design. IEEE Transactions on Neural Networks 11 (1) (2000) 124-136 6. Bennett, K., Bredensteiner, E.: Duality and Geometry in SVM classifiers. Proc. of Seventeenth Intl. Conference on Machine Learning, Morgan Kaufmann, San Francisco (2000) 57-64 7. Boser, B.E., Guyon, I.M., Vapnik, V.: A Training Algorithm for Optimal Margin Classifiers. Proc. 5th Annu. ACMWorkshop on Computational Learning Theory, Pittsburgh, USA (1992) 144-152 8. John, S.T., Nello, C.: Kernel Methods for Pattern Analysis. Beijing, China Machine Press (2005) 9. Blake, C.L., Merz, C.J.: UCI Repository of Machine Learning Databases. http://www.ics.uci.edu/˜mlearn/databases/ (2005)
Design and Implementation of a General Purpose Neural Network Processor Yi Qian, Ang Li, and Qin Wang school of Information and Engineering of University of Science and Technology Beijing Beijing China
[email protected]
Abstract. The general-purpose neural network processor is designed for the most neural networks algorithm and is required for variable bit length data processing ability. This paper proposes a processor that is based on SIMD (Single Instruction Multiple Data) architecture with three data bit mode: 8-bit, 16-bit and 32-bit. It can use the memory and ALU sufficiently when the bit mode changes. The processor is designed basing on 0.25–micron process technology and it can be synthesized at 50MHz with PKS of Cadence Inc. The experiment result shows that the processor can implement the neural network in highly parallel.
1 Introduction ANN (Artificial Neural Network) has many kinds of algorithm and is widely used in diverse application fields. The general purpose neural network processor is designed for the most neural networks and embodies their common characters. Especially in some real-time fields, the processor can calculate in highly parallel as well as can be programmed arbitrarily, so it can provide an ideal emulation way for ANN. The data length required by kinds of ANN algorithm is various. When a shorter length data runs in a processor with fixed bit length, the memory units and ALU will waste precious silicon area as some prefixed ‘0’ are in them. So the processor is required for variable bit length data processing ability. Some general purpose neural network processor have the character of variable bit length which can calculate over 1-bit, 4-bit, 8-bit and 16-bit data only[1], or in the range from 1 to 16-bit [2] or 1-bit to 64-bit[3] and so on. As the data bit length of the most processors specially designed for one kind ANN are 8-bit[1], 16-bit[5,7] or 32-bit[4,8] now, the ability for shorter or longer bit length of the general purpose neural network processor as mentioned above is not suitable for the current ANN algorithm. Furthermore, the processor’s performance also reduces when the ANN is implemented. In this paper we design a processor based on the SIMD architecture, with three bit mode: 8-bit, 16-bit and 32-bit. It can use the memory and ALU sufficiently when the data runs in it with bit length as mentioned above. Every multiplier can realize one 32-bit multiply, two 16-bit multiply or four 8-bit multiply per cycle and the processor can run in highly parallel. The remainder of this paper is organized as follows: describe ANN algorithm and analyze it in the view of hardware implementation on the section 2, introduce the D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 689–698, 2007. © Springer-Verlag Berlin Heidelberg 2007
690
Y. Qian, A. Li, and Q. Wang
processor architecture and instruction set putting emphasis upon the module design that can work with three bit mode on the section 3, give an example on the section 4, show the experiment result on section 5 and conclusion on section 6.
2 The Hardware Implementation Analyze of ANN Algorithm There are hundreds of ANN algorithms which are conventional iterative algorithms or some novel non-iterative [12,13,14]. As the conventional iterative algorithms are suited for parallel implementation and are widely used, the CPU in our paper is designed based on them. The architecture of ANN is parallel and distributed, and is interconnected with artificial neurons and synapse. These artificial neurons have local memories and simulate the character of biologic neurons----adding the weight-sum of the input signal. Usually the mathematic expression of artificial neuron is o = f ( ∑ xi wij ) , and the activation function---- f is linear, saturated linear, step or sigmoid
function.
The
training
of
ANN
follows
Hebb
study
rule:
wij (t + 1) = wij (t ) + αo i (t )o j (t ) or Delta study rule: wij (t + 1) = wij (t ) + α ( y j − o j (t ))oi (t ) .
To adapt for the character of ANN algorithm mentioned, usually the architecture of the neural network processor is SIMD and can do multiply, add and activation function, and settle the massive data exchange aroused by the neurons’ interconnection. We will give a detailed analyze on PU (Processing Unit) and data communication at the view of hardware implementation in the remainder of the section 2. 2.1 The Hardware Implementation Analyze of Processing Unit Because the neurons have local memory, the PU mapping to the neurons also should have local memory to store weight, the neuron state, the middle-result, the final-result and so on. In the digital IC, all the arithmetical modules have larger silicon area. If the 8-bit mode is designed as a basic mode in our processor, we should reuse the arithmetical modules possibly to make fully use of the silicon resource at 16-bit and 32-bit mode. The main operations of the neuron are multiply-accumulate and multiply-add. Because there is no resource-correlation among the two operations, multiplier can be reused for them. If the multiply-accumulation instruction is based on register operation as the general purpose CPU design dose, the data correlation will occur between the twice multiply-accumulations that will lead to the time spending increase and the efficiency of pipeline decrease. But this correlation can be avoided as the operand address and the times of the multiply-accumulation can be determined when programming. We may set up a special road for multiply-accumulation and replace a program segment by one instruction which can start the SM (status machine) to control the multiply-accumulation. The hardware implementation of linear, saturated linear and step function are simpler and their modules can be put into each ALU or can be realized by instructions. The silicon area of sigmoid function module is larger, so we design one F module to realize it only. All modules as mentioned above should work at the 8-bit, 16-bit and 32-bit mode.
Design and Implementation of a General Purpose Neural Network Processor
691
2.2 The Hardware Implementation Analyze of Data Communication In most cases, the LM (local memory) can provide the data needed by its PU for calculation in distributed shared memory SIMD architecture. All the PUs run synchronously and ANN algorithm may achieve higher processing efficiency. But there are two kinds of data communication that influence the parallel work of all PUs during the process; internet ought to be designed in processor’s architecture to satisfy the data communication. 1) The data communication among the PUs. For example, in training phase, all PUj need the PUi state data in the same cycle to update the wij; at the end of each training phase, ALU often need to calculate local error in order to accumulate for total error. That is, the calculated result of some PUs will be sent to the input port of other specified PUs. 2) The data communication among the processors Multi-chip expansion is needed to set up a processing system when single chip is not sufficient for ANN algorithm. In some special neural network processors, usually the internet is static. The net needs no program to be controlled and is not flexible, connection among the PUs is fixed and the logic implementation is simple. In the general purpose neural network processor, usually the internet is dynamic and the joint of the switch is controlled by program. The hardware implementation is complex, but the connection is more flexible. The bus, cross switch and multi-level internet are the commonly used dynamic nets. The internet of our processor should do the needed data communication at 8-bit, 16-bit and 32-bit mode, and should do its possible to decrease the communication spending in order to increase the processor’s parallelism. The peripheral circuit should be simple for system expansion.
3 The Processor’s Architecture The processor is based on the distributed shared memory SIMD architecture and works at 8-bit, 16-bit and 32-bit mode respectively, as shown in Fig 1. There are 24 PUs in the processor, the 8-bit LM of each PU stores the original data, weight, the final result and other data that is not suitable to be stored in the register file. CU (Control Unit) fetches instruction from Program_cache, decodes the instruction, generates the control signal and part of global data such as immediate data and address of memory unit, transfers them to the PUs parallel. With the control signal, 24 PUs can work parallel or serially. Program_cache is 23-bit ROM for program and has 4K memory units to store instruction codes. F is the circuit module of the sigmoid function, based on LUT, the length of output data is 32 bit, and PU truncates the proper bit length according to the bit mode. FIFO is the buffer of the F mode. The calculated result of F mode is buffered into FIFO (First In First Out) and then is broadcasted into the LMs of the specified PUs. Commonly, Original data in the Extern_data_ram is the sample set and the initial value of the neural network offered by the user. Some or all PUs will use the data to complete the training and other phase of the neural network. In the initialization phase of the program, the data of the Exteral_data_ram is allocated into the specified LM
692
Y. Qian, A. Li, and Q. Wang
chip i -1
chip i +1
Fig. 1. The processor’s architecture
with the control signal. According to mask-bit signal provided with the MRC (MaskRegister in Control unit), PUs decide whether to receive the data or not. The final result also is stored into the memory to be accessed by peripheral circuit. Following are the design of the PU, internet and instruction set of the architecture. 3.1 The Design of the PU There are a register file and an ALU in each PU. 32 8-bit registers in register file can store the middle-result temporarily; the fixed-point arithmetic modules and logic modules are 8-bit in ALU except multiplier. Fixed-point arithmetic modules are used in neural computing. The logic modules are to complete the operation that can not be realized by arithmetic such as calculating the statistic of the recognition. Because the logic modules in ALU do bit-operation, they are not influenced when the bit mode changes. The carries from low 8-bit to high 8-bit of the add module and the subtract module must be taken into account; we can control it with the specified state-machine module in CU. So the two modules can be connected at 16-bit and 32bit mode. The multiplier is based on booth algorithm, the neighboring PUs use a multiplier together, as shown in Fig 2. At 32-bit mode each multiplier can be a 32×32bit multiplier per cycle, at 16-bit mode it can be two 16×16bit multiplier and at 8-bit mode it can be four 8×8bit multiplier. The data access rule for the data in the register file and LM at the three different bit modes is shown as follows: 8-bit mode: 8-bit data is stored into LM or register with the address or register’s number specified by the program. Each PU represents one neuron, and the processor represents 24 neurons then. The ALU can calculate with the local data of the neuron. 16-bit mode: Every 16-bit data is divided into two groups and is stored into the two neighboring PUs. The two neighboring PUs represent one neuron (for example, the memory unit of PU0 store the low 8-bit of the data, and the memory unit with the same address of PU1 store the high 8-bit of the data). The processor is equal to 12 neurons.
Design and Implementation of a General Purpose Neural Network Processor
A
A3 ( PU 3 )
A2 ( PU 2 )
A1 ( PU 1 )
C3
A0 ( PU 0 )
693
PU 3
A C2 C B
B3 ( PU 3 )
B2 ( PU 2 )
B1 ( PU 1 )
B0 ( PU 0 ) B
C1
C0
A3 × B3 = C3
A3 A2 × B3 B2 = C3C2
A2 × B2 = C 2
A1 A0 × B1B0 = C1C0
PU 2
PU 1 PU 0
A1 × B1 = C1 A0 × B0 = C0
A3 A2 A1 A0 × B3 B2 B1 B0 = C3 C2 C1C0
Fig. 2. The data and memory map of the multiplier at different bit modes
32-bit mode: The four neighboring PUs represent one neuron, the 32-bit data is divided into four groups and is stored in the memory units with the same address of the PUs which the data maps to. The processor is equal to 6 neurons. A special data road for multiply-accumulation proposed in the section 2.1 is shown as Fig 3. The road settles the data correlate occurred during multiply-accumulation operation. Register sends the data about the initial address of the LM and the times of the multiply-accumulation to SM, SM generates the address of the LM, read and write signal, multiply accumulation enable signal per cycle. Thus multiply-accumulation can operate continuously.
Fig. 3. The connection among the multiply-accumulation module and the modules that associate with it
3.2 The Design of the Internet 1) The internet among the PUs As the bus may broadcast the data, only one PU is allowed to write the data onto the bus and all PUs may read the data from the bus according to the mask single in the MRC. So the bus may accomplish the task of the data communication among the PUs at the 8-bit mode. At other bit modes, the data length is longer than the bandwidth of the bus. If we use the bus to transmit 16-bit or 32-bit data, each PU will wait for 2 or 4 cycles. The time spending increased is equal to the time spending on the process that one PU calculates the neuron’s state, and it will become a communication bottleneck. In this paper, the across switch grouping internet shown as Fig 4 is designed for the data communication among the PUs inner processor at 16-bit or 32-bit modes. The PUs can access the data of PU in one cycle with the internet.
694
Y. Qian, A. Li, and Q. Wang (PU)0,4,8,12,16,20 ……
1
out0
,5,9,13,17,21
2
……
,6,10,14,18,22 ……
out1
6×1
3
,7,11,15,19,23 ……
out2
6×1
out3
6×1
6×1
4×4 0 4×4
(PU)
0
4×4
4
4×4
4×4
8
12
4×4
16
4×4
20
(a)
2 1 3 1 3 0 2 (b)
Fig. 4. Across switch grouping internet
At 16-bit mode, the internet selects the data from PU2i and PU2i+1 (for example, PU0 and PU1, PU2 and PU3……) and assigns them to the all corresponding output ports of the internet (for example, the PU0 data to out0 and out2, the PU1 data to out1 and out3). At 32-bit mode, the internet selects the data from PU4i, PU4i+1, PU4i+2 and PU4i+3, (for example, PU0, PU1, PU2, PU3……)and assigns them to the all corresponding output port of the internet (for example, the PU0 data to out0, the PU1 data to out1, the PU2 data to out2 , the PU3 data to out3). The mask signal in the MRC of the CU decides which PUs will receive the data. Several PUs can access a same memory unit of a specified PU at the same time with the internet at 16-bit or 32-bit mode. PU calculates in parallel at this time and the efficiency of the processor is the highest. The local error calculated by every PU may be sequentially sent to a specified PU to accumulate for total error. At this time one PU read the data from other one PU, the remainder PUs are idle and the efficiency of the processor is the lowest. But the case only happens on the neurons of the output layer, the number of its cycle is few, so it has little influence on the total running time. Furthermore, we can flexibly select the bit mode control signal of the internet; fully use its function that it can transmit 4 8-bit data of different PUs in one cycle to decrease the time spending of the data communication. A detailed instance will be given in section 4. 2) The multi-chip expansion The buffer register reg_expand on the Data_bus inner the processor is I/O port to the next processor as shown in Fig 1. The processor is serially connected into a ring multi-chip system. The ring connection is simple and easy to program. 3.3 The Design of Instruction Set Adapting for the architecture the instruction set is RISC (Reduced Instruction Set Computer) system with I, R, and J three kinds of instructions. The length of the instruction is 23-bit. Each instruction includes operands, register number, immediate operand and branch address etc. The execution process is divided into 5 stages pipeline as fetch instruction, decode, read register, ALU operation, memory access, write back. The instruction set includes data transfer instruction, arithmetic operation
Design and Implementation of a General Purpose Neural Network Processor
695
Table 1. Parts of the instruction set
instruction class data transfer branch
arithmetic
control
instruction
description
#immediate)
MOV Rt,@(Rs+#immediate)
Rt ← @(Rs+
MOV Rt, #immediate MOV @(Rs+#immediate), Rt
Rt ← immediate @(Rs+ immediate) ← Rt If Rs=#immediate then jump to A
JNE #immediate ,Rs, A ADD Rt, Rs SUB Rt, Rs MAC n,Ra,Rb,Rc
MOV #immediate,MRC Bitmode #immediate
# #
+
Rt ←Rs Rt Rt ← Rt - Rs n times, Ra/Rb/Rc are the address of A/B/C the mask of PU is stored in MRC control the bit mode signal of the internet
instruction, logic operation instruction, branch instruction, and control instruction. Some instructions are shown as Table 1. The multiply-accumulation instruction MAC frequently appears in the program; usually it is a core operation of a circulation. During the processing of MAC, PU is waiting for the result all the while. If the other arithmetic modules are idle in the processing, the efficiency of ALU will decrease. But there is no data correlation between the MAC and the instruction behind it, so we can access the posterior instruction and run it. Thus the efficiency of ALU will increase and the time spending of the program will decrease. By analyzing the time sequence of a commonly used program segment which contains MAC as shown Fig 5, we can see the executing time of MAC which is the total executing time of the program. Because the jump instruction can finish before the MA operation finish, we do not optimize for its empty operation.
Fig. 5. Time sequence chart
4 The Design for an Example In this instance, a 25×8×8 BP network is implemented. The neurons of the middle layer are mapped to PU0 PU7. The neurons of the output layer are mapped to PU8 PU15. The data is 8-bit and is stored in LM shown as Fig 6.
~
~
696
Y. Qian, A. Li, and Q. Wang
w0, 0
w0,1
w0 ,7
v0,0
v0 ,1
v0 , 7
w1, 0
w1,1
w1,7
v1,0
v1,1
v1,7
v2 ,0
v 2,1
v 2, 7
w24 , 0
w24 ,1
w24 , 7
a0
a1
a7
v7, 0
v7 ,1
v7 , 7
y0
y1
y7
b0
b1
b7
e0
e1
e7
d0
d1
d7
Fig. 6. The data of 25×8×8 BP neural network in LM
The original data such as ai and yt is read from Extern_data_ram to every LM at the initialization phase of the neural network. b j and e j are the results in the study phase of the network. The process of the network study is as follows: 1) PU0~PU7 respectively calculate their s j = ∑ wij ai in parallel, s j are stored in the registers of PUs. Then s j ( j = 0,1," ,7 ) is serially sent into F module to calculate the neuron’s output b j = f ( s j ) of the middle layer, and b j is stored in LM of PUj. 2) PU8~PU15 calculate lt = ∑ v jt b j simultaneously. Firstly b j is read from PUj by bus and are sent to register of PU8~PU15 and dose multiply-accumulate with v jt in each PU. lt (t = 0,1,",7 ) is in the register of PUt and are sent to F module serially. After the state of neuron in the output layer is calculated as ct = f (lt ) , it is allotted to the register of PUs. 3) The emendation error of the neuron in the output layer is calculated as dt = ( yt − ct )ct (1 − ct ) in PUt. dt is stored in the register of PUt. Then the weight of the neuron between middle layer and the output layer are updated as v jt = v jt + dt b j . The access step of b j is similar as 2) that is 8 PUs calculate at the same .the total error is done by which the local error of other PUs is sent to PU0 across bus to accumulate the sum and the processor is not in parallel. 4) The emendation error of the neuron in the middle layer is calculated as e j = (∑ d t v jt )b j (1 − b j ) in PUj. PU8 PU15 calculate d t v jt simultaneously. Then the
~
‘Bitmode #32’ instruction is executed and the internet is at 32-bit mode. The data of PU12 PU15 are sent into PU8 PU11 for sum respectively. Then the ‘Bitmode #16’ instruction is executed and the internet is at 16-bit mode. The calculate result of PU10 and PU11 are sent into PU8 and PU9, for sum respectively. Finally the e j added which
~
~
is from PU9 and PU8 is stored in PU0 by bus and the like. After e j , we can update the
Design and Implementation of a General Purpose Neural Network Processor
697
weight of the neurons between the input layer and the middle layer as wij = wij + e j ai . The access process is similar with b j by which 8 PUs multiply-accumulate parallel by bus. 5) Continuously the next mode pair input into the neural network to train it. The association process is similar with 1) and 2) of the train process, we do not illustrate any more.
5 Experiment Result The processor is designed basing on 0.25–micron process technology and it can be synthesized at 50MHz with PKS of Cadence Inc. The processor is divided into 5 stages pipeline with RISC instruction system. Instruction cycle and multiply-add cycle is 20ns. The processor may execute 24 instructions in parallel, so the processing power is 1200MIPS. Compared with the neural chip NM6403 and DSP TMS320C62 which are widely used now, the result is shown as Table 2: Table 2. The comparison among the three processors of the CPS
processor
frequency (MHz)
NM6403
50
TMS320C62
200
Ours
50
bit mode 8-bit 1200M
-
1200M
16-bit 200M 400M 600M
32-bit 50M
-
300M
Notes: CPS means connection per second defined as the rate of MAC operations per second.
Although the rate of TMS320C62 is faster, it only has two multipliers and has no special MAC instruction, so its performance is decreased when implementing ANN algorithm. Otherwise, our processor has a special sigmoid function module and other chip must use program to implement it such as LUT. The LUT may need 20 instructions to accomplish a 16-bit sigmoid function. Obviously our processor shortens the time spending. If the processor is optimized on pipeline and instruction schedule, its run time for ANN algorithm will shorten even more.
6 Conclusions In conclusion, the processor is designed for the common character of ANN algorithm based on SIMD architecture and has three bit modes. It can calculate 8-bit, 16-bit, 32bit data in high parallel and can make the most of the memory and ALU when the data running in it at variable bit length. With the high calculation performance and the convenient interface, it can be used in neural network accelerators for PC and can be applied as the base element for large parallel neural-computer system. Because the design of the processor is in original step, some optimization still can be done on hardware and software and its performance maybe enhanced moreover.
698
Y. Qian, A. Li, and Q. Wang
References 1. Dias, F. M., Antunes, A., Mota, A. M.: Artificial Neural Networks: A Review of Commercial Hardware. Engineering Applications of Artificial Intelligence. 17(8) December (2004) 945-952 2. Duranton, M.: Image Processing by Neural Networks. IEEE Micro 16(5) Oct. (1996) 12-19 3. Borisov, Y., Tchernikov, V., Fomine, D., Vixne, P.: VLIW/SIMD Neuro-Matrix Core. WSEAS Transactions on Systems 2(3) July( 2003) 572-581 4. Vitabile, S., Gentile, A., Sorbello, F.: A Neural Network Based Automatic Road Signs Recognizer; Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN'02 3 (2002) pt. 3 2315-2320 5. Rybarczyk, A., Szulc, M.: The Concept of a Microcontroller with Neural-Matrix Coprocessor for Control Systems that Exploits Reconfigurable FPGAs. Proceedings of the Third International Workshop on Robot Motion and Control. RoMoCo'02 (2002) 123-32 6. Kim, D, Kim, H., Kim, H, Han, G, Chung, D.: A SIMD Neural Network Processor for Image Processing. Advances in Neural Networks - ISNN 2005. Second International Symposium on Neural Networks. Proceedings, Part II 2 (2005) pt. 2 665-672 7. de la Roca, B. Mila;,Randon, E.: Design of a Parallel Neural Processor. Proceedings of the IEEE International Caracas Conference on Devices, Circuits and Systems, ICCDCS (1998) 109-112 8. McBader, S., Lee, P., Sartori, A.: The Impact of Modern FPGA Architectures on Neural Hardware: a Case Study of the TOTEM Neural Processor. 2004 IEEE International Joint Conference on Neural Networks, 4 (2004) pt. 4 3149-3154 9. Chen, K.H., Chiueh, T.D, Chang, S.C., Luh, P.B.: A 1600-MIPS Parallel Processor IC for Job-Shop Scheduling; IEEE Trans. Industrial Electronics 52(1) Feb(2005) 291-299 10. Chen, G.L.: The Architecture of Parallel Computer. higher education publication.Sep(2002) 11. Jiang, Z.L.: Artificial Neural Network Introduction; higher education publication. Aug (2001) 12. Huang, G.B., et al: Extreme Learning Machine: Theory and Applications. Neurocomputing, 70 (2006) 489-501 13. Huang,, G.B., et al: Can Threshold Networks Be Trained Directly?. IEEE Trans. Circuits and Systems-II 53(3) (2006) 187-191 14. Huang,, G.B., et al: Universal Approximation Using Incremental Constructive Feedforward Networks with Random Hidden Nodes. IEEE Trans. Neural Networks 17(4) (2006) 879-892
A Forward Constrained Selection Algorithm for Probabilistic Neural Network Ning Zong and Xia Hong School of Systems Engineering, University of Reading, RG6 6AY, UK
[email protected] Abstract. A new probabilistic neural network (PNN) learning algorithm based on forward constrained selection (PNN-FCS) is proposed. An incremental learning scheme is adopted such that at each step, new neurons, one for each class, are selected from the training samples and the weights of the neurons are estimated so as to minimize the overall misclassification error rate. In this manner, only the most significant training samples are used as the neurons. It is shown by simulation that the resultant networks of PNN-FCS have good classification performance compared to other types of classifiers, but much smaller model sizes than conventional PNN.
1
Introduction
The probabilistic neural network (PNN) is a popular neural network for classification [1,2]. Traditional PNN learning approach simply assigns each training sample to a new neuron. This may lead to the conventional PNNs with large model sizes when large numbers of training samples are provided. Associated with over large model size is the disadvantage of high computational cost, when applied to classify the test samples. There are researches on PNN model reduction, e.g. the learning vector quantization [3], and forward orthogonal algorithm [4]. Alternatively for a mixture of experts model construction, a forward constrained selection (FCS) algorithm was introduced [5]. In this study a new PNN-FCS method is introduced by selecting the most important neurons, thus deriving a small model size for the PNN, based on a similar idea in FCS. Given a full training set as the candidate neurons, at each step of the PNN-FCS, one of the most significant training samples for each class is included into the PNN model as new neurons, whilst their weights are determined so as to minimize the overall misclassification error rate. The process continues until the PNN with appropriate model size and classification performance is obtained. Simulation results illustrate that compared to the conventional PNNs, the PNNs constructed by the new PNN-FCS algorithm have smaller model sizes while keeping good classification performance.
2
A New Probabilistic Neural Network
The structure of the conventional probabilistic neural network (PNN) is shown in Figure 1. The input layer receives a sample x composed of d features x1 , · · · , xd . D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 699–704, 2007. c Springer-Verlag Berlin Heidelberg 2007
700
N. Zong and X. Hong k argmax
^ y1 (x )
^ yM(x )
Σ x 11
Σ
x 21
xN 1 1
x 1M
output layer
x 2M
xN M M
hidden layer
x1
x2
input layer
xd
Fig. 1. The structure of a conventional PNN
In the hidden layer, there is one hidden unit per training sample. The hidden unit xij corresponds to the ith, i = 1, · · · , Nj training sample in the jth class, j = 1, · · · , M . The output of the hidden unit xij with respect to x is expressed as 1 (x − xij )T (x − xij ) aij (x) = exp{− } (1) 2σ 2 (2π)d/2 σ d where σ denotes the smoothing parameter. In the output layer, there are M output units, one for each class Cj , j = 1, · · · , M . The jth output is formed as yˆj (x) =
Nj 1 aij (x), Nj i=1
j = 1, · · · , M.
(2)
The output layer classifies the sample x to class Ck which satisfies k = argmaxj {ˆ yj (x)|j = 1, · · · , M }.
(3)
In order to classify a sample x, one needs to compute Nj neurons outputs aij (x), i = 1, · · · , Nj for each class Cj , j = 1, · · · , M . Hence when a large number of training samples are provided as the neurons, applying the conventional PNN to classify new data samples will be computationally expensive, this is a disadvantage of the conventional PNN. It was argued that many of the training samples may have little contributions to the overall classification accuracy and their corresponding neurons can be removed from the traditional PNN model [4]. Thus an important problem involved in training PNN is the model reduction, i.e. selecting only the most significant training samples as the neurons. Some approaches for PNN model reduction have been developed to select a small number of representative or most significant neurons [3,4].
A Forward Constrained Selection Algorithm for PNN
701
In this paper we propose a PNN-FCS approach which constructs the new PNN model with discriminant functions as (K)
yˆj
(x) =
K
(K)
γij aij (x),
j = 1, · · · , M
(4)
i=1 (K)
where γij are K weights associated with K selected neurons xij satisfying the following convex constraints (for notational simplicity, the selected neurons of class Cj are still denoted with the subscripts 1j, 2j, · · ·.) (K) γij ≥ 0, (5) K (K) = 1. i=1 γij The PNN-FCS approach uses a subset of K, K < Nj neurons selected from Nj training samples for each class Cj , j = 1, · · · , M to save the computational cost for classifying new data sets. In order to keep the good classification performance of the conventional PNN, the selected neurons are selected according to their significance in contributing to the overall classification accuracy. The details of the PNN-FCS approach are introduced as follows.
3
New PNN-FCS Algorithm
The FCS method has been developed for regression problems to construct the regression models with small model sizes and good approximation performance [5]. In FCS for regression (FCR), the most significant experts are selected one by one from a set of candidate experts and their weights (subject to convex constraints) are estimated so as to minimize the overall mean square errors (MSE). In this section, the basic idea in FCS is applied to the PNN learning to form a new PNN-FCS algorithm. The PNN-FCS algorithm adopts an incremental learning scheme such that at each step, the new neurons, one for each class, are selected from the training samples and their weights (subject to convex constraints) are estimated so as to minimize the overall misclassification error rate. The procedure of PNN-FCS terminates as soon as a PNN model with satisfactory classification performance is obtained. Hence, usually, only the most significant training samples are included into the PNN model as the neurons by using the PNN-FCS method. As a result, compared to the conventional PNNs, the PNNs constructed by the PNN-FCS approach may have smaller model sizes while keeping the good classification performance. Based on a training set DN = {xh , yh }N h=1 , a maximum likelihood estimation (MLE) of the misclassification error rate of a classifier can be expressed by [6] Pˆe = netr /N
(6)
where netr denotes the number of the misclassified training samples. For a PNN, netr can be calculated by netr =
N h=1
χ(argmaxj {ˆ yj (xh )|j = 1, · · · , M } = yh )
(7)
702
N. Zong and X. Hong
where χ(•) denotes an indication function whose value is 1 if • is true, and 0 otherwise. Therefore, suppose that a PNN defined by Eq.(4) and Eq.(5) has been constructed, the MLE of the misclassification error rate takes the form as N 1 (K) Pˆe(K) = χ(argmaxj {ˆ yj (xh )|j = 1, · · · , M } = yh ) N
(8)
h=1
(K)
where yˆj (xh ) is given by Eq.(4) and Eq.(5). The new PNN-FCS algorithm can be described as follows. 1. At the first step, the PNN model contains only M neurons, one for each class. Each of Nj training samples, xij , i = 1, · · · , Nj in each class Cj , j = 1, · · · , M is used in turn to form a candidate PNN model with discriminant functions as 1 (x − xij )T (x − xij ) (1) yˆj (x) = exp{− }. (9) 2σ 2 (2π)d/2 σ d M Note that there are j=1 Nj such candidate PNN models. For each of these PNN models, calculate the resultant MLE of the misclassification error rate as N 1 (1) (1) ˆ Pe = χ(argmaxj {ˆ yj (xh )|j = 1, · · · , M } = yh ). (10) N h=1
(1) The “best” training sample in each class Cj , j = 1, · · · , M such that Pˆe reaches its minimum value is then selected as the neurons. 2. At the kth step, k ≥ 2, the PNN model contains k neurons in each class. Each of the remaining (Nj −k +1) training samples (i.e. excluding the (k −1) training samples that have already been selected as the neurons in previous steps) in each class Cj , j = 1, · · · , M is added in turn and combined with (k−1) the existing PNN model yˆj (x) to form a candidate PNN model with discriminant functions as (k)
yˆj (x) = μj
1 (x − xij )T (x − xij ) (k−1) exp{− } + (1 − μj )ˆ yj (x) (11) 2 d/2 d 2σ (2π) σ
where the values of μj , j = 1, · · · , M vary in the range [0, 1]. Note that there M are j=1 (Nj − k + 1) such candidate PNN models. For each of these PNN models, calculate the resultant MLE of the misclassification error rate as N 1 (k) Pˆe(k) = χ(argmaxj {ˆ yj (xh )|j = 1, · · · , M } = yh ). N
(12)
h=1
The “best” training sample in each class Cj , and the associated “best” values (k−1) (k) of μj - denoted by μj , j = 1, ..., M such that Pˆe reaches its minimum value are then selected. These selected training samples are included into the
A Forward Constrained Selection Algorithm for PNN
703
(k)
existing PNN model as the new neurons. The weights γij associated with the selected neurons xij , i = 1, · · · , k, j = 1, · · · , M are updated by (k) (k−1) γkj = μj , (13) (k) (k−1) (k−1) γlj = (1 − μj )γlj , l = 1, · · · , k − 1. (k)
It has been proven (see Lemma 1 [5]) that the weights γij calculated by Eq.(13) satisfy the convex constraints given by Eq.(5). 3. Set k = k + 1 and repeat the above procedure until a PNN model with (k) appropriate model size and small value of Pˆe is obtained. The new PNN-FCS method shares some common characteristics with some other approaches such as the FCR approach [5] and the forward orthogonal least squares (FOLS) algorithm [7]. Some similarities and differences among the PNN-FCS, FCR and FOLS are briefly discussed as follows. 1. All the PNN-FCS, FCR and FOLS can be applied to construct the models in a manner of forward selection, i.e. the most significant neurons (or experts) are selected one by one from a set of candidate neurons (or experts) according to some performance criterion. Hence, all these three methods are usually capable of deriving the models with appropriate model sizes and good modelling performance. 2. The PNN-FCS and FCR methods are used to construct the models whose neurons (or experts) are subject to the constraint of convex combination. The original FOLS algorithm is usually employed to construct general linearin-the-parameters models without convex constraints in the combination parameters. 3. The PNN-FCS and FCR methods determine the weights of the selected neurons (or experts) using the scheme of direct search, while the FOLS algorithm estimates the model parameters based on the least squares (LS) criterion. 4. The PNN-FCS approach aims to minimize the misclassification error rate of the PNN model, while the FCR method minimizes the MSE of the regression model.
4
An Illustrative Example
A real-world data set, the Titanic data set, was obtained from the benchmark repository at the Intelligent Data Analysis (IDA) Group of FIRST [8]. The Titanic data set includes 150 training samples and 2051 test samples of 2 classes. Each sample in Titanic data set has 3 features and 1 class label. The PNN-FCS approach was employed and the classification performance of the constructed PNN-FCS and other commonly adopted classification algorithms are compared in Table 1. The classification performance of other approaches was quoted from the benchmark repository at the Intelligent Data Analysis (IDA) Group of
704
N. Zong and X. Hong Table 1. A comparison of different classification methods Algorithm
Classifier size
Classification error rate over test set PNN-FCS 2(neurons) 22.57% PNN 150(neurons) 22.9% RBF network 4(centres) 26.67% AdaBoost with RBF 200(RBFs)×4(centres) 22.9% SVM not reported 22.9%
FIRST [8]. It is observed that the PNN-FCS algorithm is capable of constructing much smaller model sized model with good classification performance. More experiments on both simulated and realistic data sets were conducted [9], in which similar conclusions are drawn.
5
Conclusions
A new PNN-FCS approach has been introduced to construct PNNs with small model sizes. An incremental learning scheme has been proposed to select the most significant neurons, one for each class, from the training samples. The weights of the neurons are simultaneously estimated in the proposed algorithm so as to minimize the overall misclassification error rate. Finally an illustrative example is utilized to demonstrate the efficacy of the proposed approach.
References 1. Specht, D. F.: Probilistic Neural Networks. Neural Networks 3 (1990) 109–118 2. Specht, D. F.: Enhancements to The Probabilistic Neural Networks. in Proc IEEE Int. Conf. Neural Networks, Baltimore, MD. (1992) 761-768 3. Burrascano, P.: Learning Vector Quantization for The Probabilistic Neural Network. IEEE Transactions on Neural Networks 2 (1991) 458-461 4. Mao, K. Z., Tan, K. C. and Ser, W.: Probabilistic Neural-network Structure Determination for Pattern Classification. IEEE Transactions on Neural Networks 3 (2000) 1009–1016 5. Hong, X. and Harris, C. J.: A Mixture of Experts Network Structure Construction Algorithm for Modelling and Control. Applied Intelligence 16 (2002) 59-69 6. Duda, R. O. and Hart, P. E.: Pattern Classification and Scene analysis. Wiley, New York. (1973) 7. Chen,S., Billings, S. A. and Luo, W.: Orthogonal Least Squares Methods and Their Applications to Non-linear System Identification. Int. J. Control 50 (1989) 18731896 8. http://ida.first.fhg.de/projects/bench/titanic/titanic.results 9. Zong, N.: Data-based Models Design and Learning Algorithms for Pattern Recognition. PhD thesis, School of Systems Engineering, University of Reading, UK. (2006)
Probabilistic Motion Switch Tracking Method Based on Mean Shift and Double Model Filters Risheng Han, Zhongliang Jing, and Gang Xiao Institute of Aerospace Science & Technology, Shanghai Jiao Tong University, Shanghai 200030, P.R. China {hanrs,zljing,xiaogang}@sjtu.edu.cn
Abstract. Mean shift tracking fails when the velocity of target is so large that the target’s window kernel in the previous frame can not cover the target in the current frame. Combination of mean shift and single Kalman filter also fails when the target’s velocity changed suddenly. To deal with the problem of tracking image target that has large and changing velocity, an efficient image tracking method integrated mean shift and double model filters is proposed. Two motion models can switch each other by using a probabilistic likelihood. Experiment results show the method integrated mean shift and double model filters can successfully keep tracking target, no matter the target's velocity is large or small, changing or constant, with modest requirement of computation resource.
1 Introduction The whole process of image tracking includes two stages of operation. The first stage is object detection. There are many detection methods have been proposed. Such as adaptive background model [1, 2, 3], SVM [4, 5], Adaboost, and so on. The second stage is tracking. When object persists in scene, tracking algorithm should allow object to be tracked, whether it is stationary or moves around. Although there are many methods have been proposed for image tracking, it is still a challenging problem in computer vision. In our study, we focus on the second stage. A robust image tracking method based on double model filters and mean shift is proposed. In general, tracking methods can be divided into three classes specified as feature based approach, model-based approach and combination approach. The feature based tracking relies on persistence of certain image features such as image intensity, curves or histogram [7, 8, 9]. The model based approach generates object hypotheses and tries to verify them using the image. In another words, the image content is only evaluated at the hypothetical positions when using the model based approach [6]. For example, template matching and mean shift tracking belong to feature based approach. Condensation and other methods using all kinds of filters belong to model based approach. Combination approach uses both model based approach and feature based approach together to get better tracking performance [13]. Our method belongs to the combination approach. The paper is organized as follows. In section two, we D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 705–714, 2007. © Springer-Verlag Berlin Heidelberg 2007
706
R. Han, Z. Jing, and G. Xiao
briefly review the mean shift tracking algorithm. In section three, state filters for image tracking is discussed. In section four, the mean shift integrated double model filters is described in detail. Experiment results of the proposed method are shown in section five.
2 Brief Review of Mean Shift Tracking Mean shift tracking algorithm needs two models: reference target model and target candidate model. The reference target model is represented by its histogram in the feature space. The most popular feature is color histogram of the target. Without loss of generality the target model can be considered as centered at the spatial location Y0 . In the subsequent frame a target candidate is defined at location Y1 , and is characterized by the color histogram. What the most special is that mean shift mask the target model and candidate model with an isotropic kernel. Detailed knowledge about kernel density estimation can be found in [10]. Let us give target model and target candidate model: Target model:
Q(Y0 ) = {q u (Y0 )}u =1,..,m and
Target candidate model:
m
∑q u =1
u
(Y0 ) = 1 .
P(Y1 ) = {pu (Y1 )}u =1,..,m and
m
∑p u =1
u
(Y1 ) = 1 .
(1)
(2)
Mean shift realizes target model tracking through maximizing Bhattacharyya coefficient. The Bhattacharyya coefficient is a popular likelihood between two histograms for it's simplicity and effectiveness. Mean shift tracking is also derived from second order Taylor expansion of the Bhattacharyya coefficient. The Bhattacharyya coefficient is defined as: m
B(Y1 , Y0 ) = B(P(Y1 ), Q(Y0 )) = ∑ pu (Y1 ) ⋅ qu (Y0 ) .
(3)
u =1
In the mean shift tracking algorithm, the object localization procedure starts from the position Y0 of the object in the previous frame and searches in the neighborhood. Mean shift is a vector that point out the position of target in the current frame. The mean shift method has successfully coped with task of tracking non rigid object [11, 12]. However, if in the current frame, the center of the target Y1 does not remains in the image area covered by the target model in the previous frame, the local maximum of the Bhattacharyya coefficient is not a reliable indicator for the new target location. So when the velocity of target so large that the target model in the previous frame can not cover the target in the current frame, the mean shift tracker fails to track the target. To overcome this problem, a dynamic motion model and state filter is needed.
Probabilistic Motion Switch Tracking Method
707
3 State Filters for Image Tracking The core idea of filters theory is to eliminate uncertainty that is caused by some complicated stochastic factors, and such uncertainty can be neither controlled nor modeled deterministically. According the filters’ different abilities to eliminate the system’s uncertainty, the estimation result is optimal or suboptimal. We may design a proper filter to get a better tracking result. Let us give two necessary models for filters: dynamic process model and measurement model. Dynamic process model: X k = f ( X k −1 , v k −1 ) (4) Where,
X k is system state, f is process model which is linear or non linear. v k is the
process noise. Measurement model:
Yk = h( X k , nk )
(5)
X k is system state, h is measurement model which is linear or non linear. nk is the measure noise. And the Yk is noisy measurement. Where,
In the field of target tracking, Kalman filter might be the most popular tools. Kalman filter assumes that the noise sequences v k and n k are independent Gaussian distributions and functions (4) and (5) are linear. In this situation, the system’s dynamic process model and measurement model can be written as follows: X k = AX k −1 + v k −1 , (6)
Yk = HX k + nk .
(7)
And Q and R are the process noise covariance and measurement noise covariance respectively. The Kalman filter is essentially a set of mathematical equations that implement a predictor and corrector type estimator that is optimal in the sense that it minimizes the estimated error covariance when system is linear and distribution of noise is independent Gaussian. For the convenience of discussing multiple filters later, let us give these Kalman filter equations [14]: −
Prediction of state estimation Xˆ k :
X k− = AXˆ k −1 .
(8)
Prediction of error covariance:
Pk− = APk −1 AT + Q , − k
− k
(9) −1
K k = P H ( HP H + R) , Xˆ k = X k− + K k ( Z k − HXˆ k− ) , T
T
− k
Pk = ( I − K k H ) P . Figure 1 shows the general process of state filter.
(10) (11) (12)
708
R. Han, Z. Jing, and G. Xiao
Yk−1
X − k+1
X −k
X − k −1
Yk
Zk−1
Zk
Yk+1
Xˆ k
Xˆ k −1
Zk+1
Xˆ k +1
Fig. 1. General process of filter. The horizontal sequence is iterative estimation in sequence of every time step k ; the vertical sequence is filter algorithm using measurement Zk and the prediction of dynamic process model in the every time step k .
4 Integrate Mean Shift and Double Model Filters Multiple model filters can adaptively estimate the target’s state by integrating several state evolution models. The approach is proposed by [15, 16] in radar community and received little attention by visual tracking community. The basic idea of multiple model filters is that it does state estimation in parallel by integrating several different models. Observation data is extracted to update each model’s state estimate. Final estimation is the probabilistic fusion of all models' state estimate. 4.1 Motion Models for Image Tracking In our study, the role of filter’s prediction is to push the kernel window towards the correct location, where the center of the target does remains in the image area covered by the kernel window of previous frame. However, the real movement of target may change suddenly, so a single motion model is not adequate for keep tracking. And the mean shift method also looses the target when we use the single filter’s prediction as the input of mean shift algorithm. There is another choice that is second order motion model. But the second order model even gets worse result than the first order model when the target’s velocity changed suddenly. For overcoming the difficulty, we design two motion models in the framework of multiple model filters. Let us give the two models: Dynamic process of model one: ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝
x k ⎞ ⎡1, 0 , dt , 0 ⎤ ⎛ x k − 1 ⎞ ⎟ ⎜ ⎟ y k ⎟ ⎢ 0 ,1, 0 , dt ⎥ ⎜ y k − 1 ⎟ + v, ⎢ ⎥ = x k ⎟ ⎢ 0 , 0 ,1, 0 ⎥ ⎜ x k − 1 ⎟ ⎟ ⎢ ⎟ ⎥⎜ y k ⎟⎠ ⎣ 0 , 0 , 0 ,1 ⎦ ⎜⎝ y k − 1 ⎟⎠
where, dt is the measurement interval. v is the process noise.
(13)
Probabilistic Motion Switch Tracking Method
709
Dynamic process of model two: ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝
x k ⎞ ⎡1, 0 , dt , 0 ⎤ ⎛ ⎟ ⎜ y k ⎟ ⎢ 0 ,1, 0 , dt ⎥ ⎜ ⎢ ⎥ = x k ⎟ ⎢ 0 , 0 , − 1, 0 ⎥ ⎜ ⎟ ⎢ ⎥⎜ y k ⎟⎠ ⎣ 0 , 0 , 0 , − 1 ⎦ ⎜⎝
x k −1 ⎞ ⎟ y k −1 ⎟ + v . x k − 1 ⎟ ⎟ y k − 1 ⎟⎠
(14)
Both models have the same measurement matrix which is defined as:
⎛ xk ⎞ ⎡1,0,0,0⎤⎛ xk −1 ⎞ ⎜⎜ ⎟⎟ = ⎢ ⎟⎟ + n . ⎥⎜⎜ ⎝ y k ⎠ ⎣0,1,0,0⎦⎝ y k −1 ⎠
(15)
Where, n is the measurement noise, and the covariance of measurement noise is defined as R. Two Kalman filters are employed to give the fusion estimation of the two models respectively. We give the same initial setting of the two Kalman filters because of the similarity of both process models. The initial state can be set manually or be set using a detect algorithm such as background subtracting or Adaboost, Support vector machine and so on. The initial value of state error covariance P in each Kalman filter’s formulas can also be set according to experience. 4.2 Probabilistic Likelihood of Motion Model In the process of tracking, we use the selected prediction of the double model filters as the input of mean shift algorithm. And the result of mean shift algorithm is looked on as the measurement of the double model filters. The following conditional probability density functions are used as an indicator of the likelihood of a particular model: All models are in M set. M = {μ 1 ,..., μ r } In our study, r=2 and μ1 is corresponding to model 1; μ 2 is corresponding to model 2;
f (Z k | μ i ) =
(2π
1 C
)e
−
(
1 Zk −H 2
where C = H μ i Pk−, μ i H μ i T + R , i=1,2
μi
X
)
T − k ,μ i
C
−1
(Z
k
−H
μi
X
− k ,μ i
),
(16)
(17)
The Z k is result of mean shift tracking.
⎡1,0 ,0,0 ⎤ . H μi = ⎢ ⎥ 0 , 1 , 0 , 0 ⎣ ⎦ i =1, 2 The
Pk−, μ i
(18)
comes from corresponding Kalman filter.
The conditional probability of likelihood is generated using every model's residual. Right model has small residuals which make likelihood large, while others are wrong and will suffer from large residuals which make likelihood small. Figure 2 shows the framework of double model filters in our study.
710
R. Han, Z. Jing, and G. Xiao
Filter of Model 1
p(Zk | μ1)
Likelihood of model 1 X k−, μ1
Measure
Switch
Zk
Xk
X k−, μ 2
Filter of Model 2
Likelihood of model 2
p(Z k | μ2 )
Fig. 2. Framework of Double Model Filters
4.3 Integrated Algorithm of Mean Shift and Double Model Filters The measurement is provided by mean shift, so there are two measurements. We should select one as the final measurement. Algorithm 1: Selection of Measurements Denote Z k (1) and Z k ( 2) as the two results of mean shift based on two models’ preditions. If B( P(Z k (1)),Q(Y0 )) ≤ B( P(Z k (2)),Q(Y0 )) Z k = Zk (2); Else Zk = Zk (1) ; Output is Z
k
.
Where B(⋅) is Bhattacharyya coefficient defined in (3). Given a new measurement at time step k, we could compute each model’s probabilistic likelihood simply using following formula: p j (k ) =
f (Zk | μ j ) r
∑
h =1
f (Z k | μh )
, j = 1, 2.
(19)
The final result of every time step is probabilistic selection of all models’ state estimation and error covariance.In the process of tracking, it’s critical to select a proper model’s estimation from the two models. So, a selection method base on the likelihood is proposed. Algorithm 2: Probabilistic Switch based on the selected measurement. Input: p1 (k ) and p 2 ( k ) If p1 (k ) ≤ p 2 (k ) , , X = Xˆ k
k ,μ 2
Probabilistic Motion Switch Tracking Method r
[
Pk , μ1 = ∑ p j (k ) Pk ,μ j + ε μ j ε μ j j =1
T
711
];
Else X k = Xˆ k , μ1 , r
[
Pk , μ2 = ∑ p j (k ) Pk , μ j + ε μ j ε μ j j =1
T
];
Output is the selected X k . The algorithm2 means that the selected model keep it’s own error covariance at current time step. And only when a model’s estimation is not selected, it’s error covariance is updated using the following equation. r
[
Pk , μi = ∑ p j (k ) Pk , μ j + ε μ j ε μ j j =1
T
],
(20)
where ε μ = X k − Xˆ k , μ j j Based on above equations and algorithm, the integrated process of double model filters and mean shift tracking is given in algorithm3: Algorithm 3: Mean shift tracking using double model filters
Input:
the target model qu (u=1,…,m) and its location y 0 in the previous frame.
Step1: Compute the predicting locations of the two models using (13) and (14); Step2: Get two measurements using two mean shift trackers defined in [9]; Step3: Select the correct measurement using Algorithm 1; Step4: Compute the two likelihoods of conditional probability using (19); Step5: Select the correct model’s estimation as the final estimation using Algorithm 2; Output: new location Y1 of target in current frame, where Y1 = [ X k (1), X k (2) ] and X k = ( x k , y k , x k , y k ) T .
5 Experiment Study The sequence of falling and bouncing ball is used as test sequence. The ball’s velocity is not only changing but also is large. Mean shift can not track it successfully and a single Kalman filter’s estimation is also incorrect because of the velocity’s change. The combination of mean shift and double model filters that we proposed is employed to deal with the difficulty. Figure 3 shows the result of tracking and comparison in falling and bouncing ball sequence. Because two Kalman filters are employed, the computational complexity is higher than mean shift or mean shift integrated with single Kalman filter. However, the
712
R. Han, Z. Jing, and G. Xiao
Fig. 3. Tracking Process of mean shift, mean shift integrated single Kalman filter and mean shift integrated double model filters. The first row is tracking process of mean shift, it fails because of large velocity; The second row is tracing process of mean shift integrated single Kalman filter, it get better tracking result in the first few frames, but when the ball’s movement changes suddenly, it also fails; The third row is the tracking process of mean shift integrated double model filters, the proposed method successfully keep tracking when the ball’s velocity is large and changing.Figure 4 shows each model’s probabilistic likelihood respectively in tracking process.
Fig. 4. Probabilistic likelihood of each motion model. The likelihood determines which model should be selected in tracking process. The probabilistic likelihood reflects the time that correct model is selected to adapt to ball’s movement. Figure 5 shows the each model’s error covariance respectively in tracking process.
double model filters is worth it’s salt because of the better tracking performance. Considering the speed of current personal computer (average speed is higher than 1 GHz), the proposed method is enough to be used as a real time tracking method.
Probabilistic Motion Switch Tracking Method
713
Fig. 5. Each Model’s Error Covariance in Tracking. The first two figs are the error covariance of model 1 both in x direction and y direction. The last two figs are the error covariance of model 2 both in x direction and y direction. According to Kalman filter theory, we can see both models work well in the whole tracking process.
6 Conclusion An efficient image tracking method is implemented based on double model filters and mean shift. It has been found that mean shift tracking fails when the velocity of target is so large that the target’s kernel window in the previous frame can not cover the target in the current frame. Using Kalman filter's prediction as the input of mean shift, the problem can be overcome. However, when the Kalman filter's prediction is not correct, the tracking performance becomes even worse than mean shift itself. A second order model can change it's velocity in the tracking process but in the prediction stage, it still uses the previous velocity even when the real velocity of target has changed in current frame, so it also fails to solve the problem. In our study, the two models can switch each other by using a probabilistic likelihood based on selected measurement. Experiment shows that the proposed method can successfully keep tracking target no matter the target's velocity is large or small, changing or constant. In addition, our method can work well with modest requirement of computation resource.
References 1. Stauffer, C., Grimson, W.E.L.: Adaptive Background Mixture Models for Real-Time Tracking. In: Proc. IEEE Computer Society Conf. on Computer Vision and Pattern Recognition 2 (1999) 246-252 2. Elgammal, A., Harwood, D., Davis, L.: Non-Parametric Model for Background Subtraction. In Proc. European Conf. on Computer Vision, Dublin, Ireland II (2000) 751–767 3. Magee, D.: Tracking Multiple Vehicle using Foreground, Background and Motion Models. Image and Vision Computing 22 (2004), 143-155 4. Papageorgiou, C., Oren, M., Poggio, T.: General Framework for Object Detection. Journal of Engineering and Applied Science (1998) 555-562 5. Oren, M., Papageorgiou, C., Sinha, P., Osuna, E., Poggio, T.: Pedestrian Detection using Wavelet Templates. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (1997) 193-199
714
R. Han, Z. Jing, and G. Xiao
6. Isard, M., Blake, A.: Condensation–Conditional Density Propagation for Visual Tracking. Int. J. Computer Vision 29 (1998) 5-28 7. Yilmaz, A., Shafique, K., Shah, M.: Target Tracking in Airborne Forward Looking Infrared Imagery. Image and Vision Computing 21 (2003) 623-635 8. Nguyen, H.T., Worring, M., Van den Boomagaard, R.: Occlusion Robust Adaptive Template Tracking. IEEE Int. Conf. on Computer Vision 1 (2001) 678-683 9. Comaniciu, D., Ramesh, V., Meer, P.: Real-Time Tracking of Non-Rigid Objects using Mean Shift. In IEEE Proceedings of Computer Vision and Pattern Recognition, Hilton Head Island, South Carolina 2 (2000) 142-149 10. Comaniciu, D., Ramesh, V.: Mean Shift and Optimal Prediction for Efficient Object Tracking. IEEE International Conference on Image Processing 3 (2000) 70-73 11. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-Based Object Tracking. IEEE Trans. Pattern Analysis and Machine Intelligence 25 (2003) 564–577 12. Comaniciu, D., Meer, P.: Mean Shift: A Robust Approach Toward Feature Space Analysis. IEEE Trans. Pattern Analysis and Machine Intelligence 24 (2002) 603-619 13. Shan, C.F., Wei, Y.C., Tan, Tieniu; Ojardias, F.: Real Time Hand Tracking by Combining Particle Filtering and Mean Shift. Sixth IEEE International Conference on Automatic Face and Gesture Recognition (2004) 669-674 14. Mohinder S.Grewal, Angus P.Andrews. Kalman Filtering Theory and Practice Using MATLAB (Second edition), John Wiley & Sons, Inc (2001) 163-165 15. Bar-Shalom, Y., Chang, K.C., Blom, H.A.P.: Tracking a Maneuvering Target using Input Estimation Versus the Interacting Multiple Model Algorithm. IEEE Trans. Aerosp. Electron. Syst. 2 (1989) 296-300 16. Mazor, E., Averbuch, A., Bar-Shalom, Y., Dayan, J.: Interacting Multiple Model Methods in Target Tracking: A Survey. IEEE Trans. Aerosp. Electron. Syst. 34 (1998) 103–123
Human Action Recognition Using a Modified Convolutional Neural Network* Ho-Joon Kim1, Joseph S. Lee1, and Hyun-Seung Yang2 1
School of Computer Science and Electronic Engineering Handong University, Pohang, 791-708, Korea
[email protected] 2 Department of Computer Science, KAIST Daejeon, 305-701, Korea
[email protected]
Abstract. In this paper, a human action recognition method using a hybrid neural network is presented. The method consists of three stages: preprocessing, feature extraction, and pattern classification. For feature extraction, we propose a modified convolutional neural network (CNN) which has a three-dimensional receptive field. The CNN generates a set of feature maps from the action descriptors which are derived from a spatiotemporal volume. A weighted fuzzy min-max (WFMM) neural network is used for the pattern classification stage. We introduce a feature selection technique using the WFMM model to reduce the dimensionality of the feature space. Two kinds of relevance factors between features and pattern classes are defined to analyze the salient features.
1 Introduction Recognition of human actions is very significant for various practical applications such as intelligent autonomous systems, human-computer interaction, and visual surveillance. However, one of the difficulties in developing an action recognition system is to solve the translations and distortions of features in different patterns which belong to the same action class. Previous works on action representation and recognition have suggested several different approaches to overcome this constraint. In [2], Davis and Bobick have developed a new view-based approach for representation and recognition of action by constructing motion-history images where the intensity of the pixel represents the recency of motion. Yamato et al. [3] used a Hidden Markov Model, which can be applied to analyze time-series with spatiotemporal variability, by transforming one set of time-sequential images into a symbol sequence by vector quantization. Recently, Yilmaz and Shah proposed a novel action representation method named action sketch generated from a view-invariant action volume by stacking only the object regions from the consecutive input frames [4]. Our work is motivated by the technique that Yilmaz and Shah proposed for the representation of temporal templates. *
This research is supported by the ubiquitous computing and network project, the Ministry of Information and Communication 21st century frontier R&D program in Korea.
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 715–723, 2007. © Springer-Verlag Berlin Heidelberg 2007
716
H.-J. Kim, J.S. Lee, and H.-S. Yang
In this paper, we propose a modified convolutional neural network (CNN) model that has a three-dimensional receptive field to extract translation invariant features from a three-dimensional action volume. CNNs are bio-inspired hierarchical multilayered neural networks that achieve some degree of shift and deformation invariance using three ideas: local receptive fields, shared weights, and spatial subsampling [5, 6]. In our earlier work [10], we proposed a weighted fuzzy min-max (WFMM) neural network for pattern classification based on Simpson’s model [7]. The WFMM model is a hyperbox-based pattern classifier and provides a simple and powerful learning algorithm. The model has an incremental learning capability and can be utilized for a feature selection technique and reduce the dimensionality of the feature space. Two kinds of relevance factors between features and pattern classes are defined to analyze the feature saliency. The remainder of this paper is structured as follows. Section 2 provides an overview of the proposed action recognition system. In section 3, the feature extraction method using action volumes and the modified CNN is presented. Section 4 describes an action pattern classification technique using the WFMM model. Experimental results including the feature analysis is demonstrated in Section 5. Section 6 concludes our work.
2 Multi-stage Action Recognition As shown in Fig. 1, the underlying action recognition system consists of three stages: preprocessing, feature extraction, and pattern classification. In the preprocessing stage, an adaptive background segmentation technique proposed by Stauffer and Grimson [1] has been used. The threshold values and the reference
Image Sequence
Background Segmentation
Preprocessing
Action Volume Generator Action Descriptors
Feature Extraction
Feature Map Extractor (CNN) Test
Train
Feature Analysis
Action Classifier (WFMM)
Pattern Classification
Recognized Action
Fig. 1. Overview of the proposed action recognition system
Human Action Recognition Using a Modified Convolutional Neural Network
717
background image are adaptively updated by the training process. From the segmented images, an action volume is generated and a set of action descriptors are extracted from it. The feature map extractor is based on a CNN model which has two types of sublayers called convolution layer and subsampling layer. Each layer of the network extracts successively larger features in a hierarchical set of layers. The pattern classifier is implemented using a WFMM neural network. More details of the feature extractor and pattern classifier are described in the following two sections.
3 Feature Extraction For the feature extraction stage, we use a spatiotemporal volume (STV) called action volume [4]. When an object performs an action in 3D space, the outer boundary of the object is projected as a 2D contour in the image plane. A sequence of such 2D contour with respect to time generates the STV. This STV becomes the input of the convolutional neural network. From this action volume template, a set of action descriptors is extracted.
(a)
(b)
(c)
Fig. 2. An example of three different actions and their action descriptors, the first row shows the action volumes, the second row shows a temporal view of the action volume by x fixed at some constant position, and the third row shows action descriptors where temporal domain orientation was set to 3π/4. (a) sit down, (b) kicking, (c) surrender
We have used a three-dimensional Gabor filter to extract action descriptors which reflect the changes in direction, speed, and shape of parts of the image volume [8]. Action descriptors are extracted using a bank of 3D Gabor filters with different orientation in spatial and temporal domains. Fig. 2 shows examples of action volumes and
718
H.-J. Kim, J.S. Lee, and H.-S. Yang
their corresponding action descriptors. The STV provides view invariant features for action recognition. However, there still may be existing variance in feature locations in three-dimensional space after normalization. Fig. 3 is an illustration of the variance of feature locations for the same action.
Fig. 3. Same actions showing variance of feature locations of a 2D spatiotemporal response profile at a time fixed at t=20
In order to solve this problem we propose a modified convolutional neural network which has a three-dimensional receptive field. As shown in Fig. 4, each layer of the network includes two types of sub-layers, the convolution layer and the subsampling layer. The network extracts successively larger features in a hierarchical set of layers. For the feature extractor, a set of action descriptors on a three-dimensional structure is generated. Convolution Layer
Subsampling Layer
20 x 20 x 20
Convolution Layer
10 x 10 x 10
Subsampling Layer
5x5x5
Fig. 4. Structure of the convolutional neural network with 3D receptive fields
ⅹ ⅹ
The size of an initial feature map is (20 20 20). Each unit in the initial feature maps is connected to a (3 3 3) neighborhood into the input pattern. In the subsampling layer, the feature map has half the number of rows and columns of the input
ⅹⅹ
Human Action Recognition Using a Modified Convolutional Neural Network
719
ⅹ ⅹ
data. Therefore the second layer has 4C2 = 6 feature maps of size (10 10 10). The second subsampling layer generates 6C2 = 15 feature maps. The final feature maps are three-dimensional volumes of size (5 5 5), and each unit of the maps becomes the input data of the action classifier. The number of input features can be reduced through the feature analysis technique described in the next section.
ⅹⅹ
4 Action Pattern Classification and Feature Analysis As shown in Fig. 1 we have employed a WFMM neural network [10] for the action pattern classification. Fig. 5 shows the structure of the WFMM model.
c1
c2
c3
class nodes
Ck
Bj j K
b1
b2
b3
bm
hyperbox nodes
Bj
X ,U j ,V j , C j , Fj , f X ,U j ,V j , C j , F j X
a1
a2
an
In
input nodes
Fig. 5. The structure of the WFMM model
The intermediate layer of the network consists of hyperbox nodes in which the membership function is defined by:
B j = { X , U j ,V j , C j , F j , f ( X , U j ,V j , C j , F j )}
∀X ∈ I n ,
where Uj and Vj mean the vectors of the minimum and maximum values of hyperbox j, respectively. Cj is a set of the mean points for the feature values and Fj means the set of frequency of feature occurrences within a hyperbox. We compute the membership of each hyperbox node by: b j ( Ah ) =
n
1
⋅ ∑ w ji [max(0,1 − max(0, γ jiv min(1, ahi − v ji )))
n
∑w i =1
i =1
ji
+ max(0,1 − max(0, γ jiu min(1, u ji − ahi ))) − 1.0]
γ ⎧ γ jiu = RU ⎪⎪ ⎨ γ ⎪ γ jiv = ⎪⎩ RV
old RU = max( s, u new ji − u ji ) new RV = max( s, v old ji − v ji )
.
720
H.-J. Kim, J.S. Lee, and H.-S. Yang
In the above equation, wji is the connection weight which means the relevance factor between the ith feature and jth hyperbox. s is a positive constant to prevent the weight from having too high value when the feature range is too small. The learning process of the model consists of three sub-processes: hyperbox creation, expansion, and contraction. If the expansion criterion shown in equation n
nθ ≥ ∑ (max(v ji , xhi ) − min(u ji , xhi )) i =1
has been met for hyperbox Bj, then fji, uji, vji, are adjusted using the equation
⎧ f jinew = f jiold + 1 ⎪ new old ⎨u ji = min(u ji , xhi ) ⎪ v new = max(v old , x ) ji hi ⎩ ji
∀i = 1,2,…, n ,
and the mean points are adjusted by
c new = (c ji f jiold + x hi ) / f jinew . ji During the learning process the weight values are determined by w ji =
αf ji R
R = max( s, v ji − u ji ) .
As shown in the equations, the weight value is increased in proportion to the frequency of the feature. The constant s prevents the weight from being a too high value when the feature range is too small. The value of fji is adjusted through the learning process. For feature analysis, we define two kinds of relevance factors using the WFMM model as follows: RF1(xi, Ck) is the relevance factor between a feature value xi and class Ck and RF2(Xi, Ck) is the relevance factor between a feature type Xi and class Ck. The first measure RF1 is defined as RF1( xi , Ck ) = (
1 Nk
−
∑ S ( x , (u
B j ∈C k
i
ji
, v ji )) ⋅ w ji
1 w ji . ∑ S ( xi , (u ji , v ji )) ⋅ w ji ) / B∑ ( N B − N k ) B j ∉C k j ∈C k
(1)
In the equation, constant NB and Nk are the total number of hyperboxes and the number of hyperboxes that belong to class k, respectively. Therefore if the RF1(xi, Ck) has a positive value, it indicates the excitatory relationship between the feature xi and the class k. But a negative value of RF1(xi, Ck) means an inhibitory relationship between them. A list of interesting features for a given class can be extracted using the RF1 for each feature. In equation (1), the feature value xi can be defined as a fuzzy interval which consists of min and max values on the ith dimension out of the n-dimension feature space. For an arbitrary feature xi, xiL and xiU are the min and max value, respectively. The function S is a similarity measure between two fuzzy intervals.
Human Action Recognition Using a Modified Convolutional Neural Network
721
The second measure RF2 can be defined in terms of RF1 as follows RF 2( X i , Ck ) =
1 Li
∑ RF 2( x , C ) .
xl ∈X i
l
k
(2)
Li is the number of feature values which belong to ith feature. The RF2 shown in equation (2) represents the degree of importance of a feature in classifying a given class. Therefore it can be utilized to select a more relevance feature set for the action pattern recognition.
5 Experimental Results We have conducted three types of experiments using a set of video sequences of six different actions: walking, sit down, falling, stand up, surrender, and kicking. The first experiment was done to test the feature map generation process using the proposed CNN model. Fig. 6 shows intermediate 10x10 feature maps which are derived from two different patterns of the same action shown in Fig. 3. As shown in Fig. 6, the feature extractor is capable of generating more invariant feature sets for translation or deformation from the base features. The second experiment is the feature analysis using the WFMM model. Table 1 shows an illustration of the feature analysis results. In the table, the relevance factor
Fig. 6. Feature extraction results: two similar feature maps generated from the two different input data shown in Fig. 3 Table 1. Feature analysis results
Sit down Location in feature map t x y 4 4 4 3 3 2 2 3 2 1 3 3 5 2 1
RF2 0.47 0.28 0.27 0.20 0.00
Surrender Location in feature map t x y 5 2 1 3 3 2 2 3 2 4 4 4 1 3 3
RF2 0.51 0.35 0.29 0.22 0.20
722
H.-J. Kim, J.S. Lee, and H.-S. Yang Table 2. Action recognition results for 50 test patterns of 5 different subjects
Actions Walking Falling Sit down Stand up Surrender Kicking
Walking 7 0 0 0 1 1
Falling 0 6 1 0 0 0
Sit down 0 3 7 0 0 0
Stand up 0 0 0 8 0 0
Surrender 0 0 0 1 7 0
Kicking 1 0 0 0 0 7
values and feature locations are listed. As shown in the table, most relevant features can be adaptively selected for a given action pattern. These data can be utilized to reduce the number of nodes of the pattern classifier. Finally, we tested the action recognition performance. Five data sets for each action type have been used for the learning process. 50 arbitrary action patterns of 5 different subjects have been used for the test stage. Table 2 shows the recognition results. As shown in the table, overall the system proves to be effective in classifying actions with total recognition rate 84% showing that there exists some confusion. ‘Sit down’ actions were relatively less distinguishable from ‘falling’ action due to the similar feature distribution. This can be improved by fine tuning of the parameters in the classification model.
6 Conclusion The three-dimensional receptive field structure of the CNN model provides translation invariant feature extraction capability, and the use of shared weight also reduces the number of parameters in the action recognition system. The action volumes and action descriptors are invariant to the viewing angle of the camera. Therefore, the system can perform view-independent action recognition. The WFMM neural network is capable of utilizing the feature distribution and frequency in the learning process as well as in the classification process. Since the weight factor effectively reflects the relationship between feature range and its distribution, the system can prevent undesirable performance degradation which may be caused by noisy patterns. The feature relevance measures computed through the feature analysis technique can be utilized to design an optimal structure of the action classifier.
References 1. Stauffer, C., Grimson, W.E.L.: Adaptive Background Mixture Models for Real-Time Tracking. Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (1999) 246-252 2. Davis, J.W., Bobick, A.F.: The Representation and Recognition of Action Using Temporal Templates. Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (1997) 928-934
Human Action Recognition Using a Modified Convolutional Neural Network
723
3. Yamato, J., Ohya, J., Ishii, K.: Recognizing Human Action in Time-Sequential Images Using Hidden Markov Model. Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (1992) 379-385 4. Yilmaz, A., Shah, M.: Actions Sketch: A Novel Action Representation. IEEE Computer Society Conference on Computer Vision and Pattern Recognition 1 (2005) 984-989 5. Garcia, C., Delakis, M.: Convolutional Face Finder: A Neural Architecture for Fast and Robust Face Detection. IEEE Trans. Pattern Analysis and Machine Intelligence 26(11) (2004) 1408-1423 6. Lawrence, S., Giles, C.L., Tsoi, A.C., Back, A.D.: Face Recognition: A Convolutional Neural Network Approach. IEEE Trans. Neural Networks 8(1) (1997) 98-113 7. Simpson, P.K.: Fuzzy Min-Max Neural Networks Part 1: Classification. IEEE Trans. Neural Networks 3(5) (1991) 776-786 8. MacLennan, B.: Gabor Representations of Spatiotemporal Visual Images. Technical Report CS-91-144, Computer Science Department, University of Tennessee September (1991) 9. Kim, H.J., Cho, II-G., Yang, H.S.: Face Detection and Tracking using a Modified Convolutional Neural Network. The 2005 International Conference on Artificial Intelligence (2005) 10. Kim, H.J., Lee, J.H., Yang, H.S.: A Weighted FMM Network and Its Application to Face Detection. Lecture Notes in Computer Science 4233 (2006) 177-186
Neural Networks Based Image Recognition: A New Approach Jiyun Yang1, Xiaofeng Liao1, 3, Shaojiang Deng1, Miao Yu2, and Hongying Zheng1 1
College of Computer Science and Engineering, Chongqing University Chongqing 400044, China
[email protected] 2 Dept. of Optoelectronic Engineering, Chongqing University Chongqing 400044, China 3 The Key Laboratory of Optoelectric Technology & Systems Ministry of Education, China
Abstract. In this paper, a new application algorithm for image recognition based on neural network has been pro-posed. The present algorithm including recognition algorithm and algorithm for training BP neural network can recognize continually changing large gray image. This algorithm has been applied to deflection measurement of bridge health monitoring, and achieved a great success.
1 Introduction In recent years, technology of bridge health monitoring [1], [2] has been gradually developed and become mature, an important index of which is deformation measurement [3], [4]. There have been quite a few deflection measurement methods so far, among which image deflection measurement [5] is comparatively representative. Fig. 1 describes the principle of image deflection measurement. In Fig. 1, a measurement target is mounted on the bridge. There will be an image from the measurement target forming on the receiving plane of CCD(Charged-Coupled Device) through optics system. When bridge vibrates because of load, the image, as well as the measurement target, will move. So we can obtain the displacement of monitoring point from the displacement of image by computing Eq. (1). X=YL’/LX
(1)
Here, X is the displacement of monitoring point, Y is the displacement of measurement target on the CCD plane, L is the distance between the monitoring point and the lens, L’ is the distance between receiving plane and lens. Fig.2 shows the image of measurement target on the CCD plane. The light is the measurement target as mentioned above. Y in Eq.(1) is the distance between the center of light and top of image. In Eq.(1), L’ and L are constant, if Y is known, we can obtain X . So our task is to figure out Y. That means we must precisely recognize the light. There have been quite a few image recognition methods so far. Many methods apply neural network to image recognition, such as BP neural network [6] [7], Hopfield D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 724–729, 2007. © Springer-Verlag Berlin Heidelberg 2007
Neural Networks Based Image Recognition: A New Approach
725
measurement target
CCD plane lens
X
Y bridge L
L’
Fig. 1. Principle of image deflection measurement
S
light Y
L
H W Fig. 2. Image of measurement target
neural network [8] [9] [10] and so on. When we use those methods to solve image recognition, usually we will let neural network memorize few fixed target patterns by learning, then the neural network can recognize input image with noise. The application of neural network is only suitable for few fixed target patterns, and the recognized image cannot be too large. In Fig. 2, the light moves continuously in receiving area of CCD along the vertical direction, and the image’s width is 768 pixels, the height 576 pixels. If we use neural network to find the position of light as above, every position of light will be a target pattern. So there are a lot of target patterns that need to be memorized. We can’t expect neural network to memorize all target patterns. In addition, because the image is too large, the neural network would have a lot of neurons. So we should propose new application method.
2 Main Result The new application of neural network includes training period and recognizing period. For the sake of convenience, we define:
726
J. Yang et al.
R = [b1 b2 ... Row gray value matrix i of pixel in one row, W is the width of whole image; L = [R1 R2 Light area gray value matrix gray value matrix, L is the height of Light area;
bW ] ...
, where bi is the gray value R L ]T
, where Ri is a Row
Training area gray value matrix C = [R B R B +1 ... R E ]T ,E − B = L , where B is beginning row index of training area, E is end row index of training area;
Trying area gray value matrix T = [R B R B +1 ... R E ]T ,E − B = L , where B is beginning row index of trying area, E is end row index of trying area. 2.1 Training Neural Network
In training period, we compose a series of training matrix C as defined above. Every training matrix C is composed of pixel gray value of part image, the size of which is equal to that of the light area. And all pixels in training area are continuous. One of training matrix is composed of pixel gray value of light area. That means light area gray value matrix also belongs to training matrixes. And every training matrix has a different matching degree to light area. Now that the size of training matrix is bigger than 768*20, if we let BP neural network memorize all data of training matrix, the BP neural network will be comparatively complex. We use PCA neural network to gain feature matrix of training matrix. And then we use feature matrixes to train BP neural network. wi
x1
y1
x2
y2 ...
...
yM xN Fig. 3. PCA neural networks
We compose PCA neural networks to gain feature matrix, as shown in Fig.3. Input data of this neural network is all element of training matrix or trying matrix. Output data is feature matrix of training matrix or trying matrix, where M is the size of feature matrix, which poses important effect on recognition accuracy, N=W*L. This neural network adopts Eq (2 ) to adjust weight matrix. j ⎡ ⎤ Δw ji (n) = η ⎢ y j (n) xi (n) − y j (n) wki (n) y k (n)⎥ ⎢⎣ ⎥⎦ k =1
∑
(2)
Neural Networks Based Image Recognition: A New Approach
727
When process of adjusting weight matrix has been completed, the neural network will output feature matrix: m
y j ( n) =
∑w
ji x i (n)
(3)
i =1
Then we compose the following multilayer neural networks to memorize matching degree, as shown in Fig.4. j i
w11
w11
x1 x2 x3
y ...
...
x M*L wMN Fig. 4. BP neural networks
Input data of this neural network is feature matrix of training matrix. Output data is matching degree of feature matrix. And following algorithm is adopted to train BP neural network. i) ii)
Set control parameter n=1; Gain training matrix C, which of RE-n to RE is equal to RL-n to RL of L, and other L-n rows of T are corresponding with L-n rows above light area; iii) Use PCA neural network to gain feature matrix F of C; iv) Set target output t=n/L; v) Use F and t to train BP neural network; vi) Set n=n+1; vii) If n
After BP neural network has been trained, it can be used to recognize light according to the following algorithm: i) Set migration M=0, moving length S=Is; ii) Gain trying matrix T, which of first row index B=M; iii) Use PCA neural network gain feature F of T;
728
J. Yang et al.
bB 2 ... b BM ⎡b B1 ⎤ ⎢ ⎥ b B +12 ... b B +1M ⎥ ⎢b B +12 ⎥ F =⎢ ,0 < M < W E−B= L ⎢ ⎥ ⎢ ⎥ ⎢b ⎥ b ... b E2 EM ⎣ E1 ⎦ iv) Use BP neural network to gain matching degree αaccording to F; v) If α<1,modify S as Eq.(4),modify M as Eq.(5), then go to vi);
⎧ (γ − α ) I S − α ⎪ S =⎨ γ ⎪1 ⎩
α <= γ
(4)
α >γ
Where γis a threshold, which has important effect on recognition accuracy. M =M +S vi)
(5)
End.
3 Application Examples The light recognition method mentioned above has been applied to Masangxi Changjiang River Bridge (Chongqing, China) health monitoring project. With the help of
0:00
18:00
12:00
6:00
0:00
18:00
12:00
6:00
0:00
18:00
12:00
470
6:00
500 485 0:00
delection
8/10/2005-11/10/2005 515
time sequnce Fig. 5. Defection curve
this method, defection measurement of this project turned out to be a great success. Fig.5 shows the defection curve of one point.
4 Conclusions In this paper, we proposed a new method for applying neural network to image recognition. This method can recognize a continually changing image, and also has good adaptation ability to natural environment in actual application.
Neural Networks Based Image Recognition: A New Approach
729
Acknowledgement The work presented in this paper was supported by grants from the National Natural Science Foundation of China (No. 60573047), and the Natural Science Foundation of Chongqing (No. CSTC. 2006B2230).
References 1. Imran Rafiq, M., Marios, K., Chryssanthopoulos, T.O.: Performance Updating of Concrete Bridges Using Proactive Health Monitoring Methods. Reliability Engineering and System Safety 86 (2004) 247-256 2. Ko, J. M., Ni, Y.Q.: Technology Developments in Structural Health Monitoring of Largescale Bridges. Engineering Structures 27 (2005) 1715-1725 3. Xie, Y., Yan, P.Q., Yang, Q.: Measurement of Bridge Dynamic Deflection with Seismic Low Frequency Vibration Transducer. Chinese Science Abstracts Series A 14 (1995) 74 4. Nassif, H., Gindy, M., Davis, J.: Comparison of Laser Doppler Vibrometer with Contact Sensors for Monitoring Bridge Deflection and Vibration. NDT and E International 38 (2005) 213-218 5. Dong, H., Chen, W.M.: Method of Laser& Imaging Deflection Measurement. Journal of Transducer Technology 23 (2004) 25-26 6. Qiao, Y.J., Wu, G., Wang, X.: Quality Assessment of Venenum Bufonis by Neural Networks Pattern Recognition. Chinese Science Abstracts Series B 14 (1995) 14 7. Roukhe, A., Nachit, A.: Adaptive Algorithm of Image Restoration by Connectionnist Method. Comptes Rendus de lAcademie des Sciences Series IIB Mechanics Physics Chemistry Astronomy 326 (1998) 263-271 8. Lee, D.L.: Pattern Sequence Recognition Using a Time-varying Hopfield Network. IEEE Transaction on Neural Networks 13 (2002) 15-19 9. Huang, J.S., Liu, H.C.: Object Recognition Using Genetic Algorithms with Hopfield’s Neural Model. Experts Systems with Applications 13 (1997) 191-199 10. Li, W.J., Lee, T.: Object Recognition and Articulated Object Learning by Accumulative Hopfield Matching. Pattern Recognition 35 (2002) 1933-1948
Human Touching Behavior Recognition Based on Neural Networks Joung Woo Ryu, Cheonshu Park, and Joo-Chan Sohn Electronics and Telecommunication Research Institute, 161 Gajeong-dong, Yuseong-gu, Daejeon, 305-700, Korea {ryu0914,bettle,jcshon}@etri.re.kr
Abstract. Of the possible interactions between human and robot, touch is an important means of providing human beings with emotional relief. However, most previous studies have focused on interactions based on voice and images. In this paper, a method of recognizing human touching behaviors is proposed for developing a robot that can naturally interact with humans through touch. In this method, the recognition process is divided into pre-process phase and recognition phase. In the pre-process phase, recognizable characteristics are calculated from the data generated by the touch detector which was fabricated using force sensors. The force sensor used an FSR (force sensing register). The recognition phase classifies human touching behaviors using a multi-layer perceptron which is a neural network model. We measured three different human touching behaviors for six men. The human touching behaviors are ‘hitting,’ ’stroking,’ and ‘tickling’. In the test conducted with recognizers generated for each user, the average recognition rate was 93.8%, while the test conducted with a single recognizer showed a 79.8% average recognition rate. These results show the feasibility of the proposed human touching behavior recognition method.
1 Introduction Reviewing the development of robotics, the trend is moving towards robots which provide humans with emotional relief, such as AIBO and PARO [1, 2], from robots which provide humans with information or carry out tasks for them. The interactive technologies between humans and robots have also been extended from sound and image recognition to natural interaction. In particular, interaction by touching behaviors is attracting interest as a result of potential means to provide emotional relief to human beings. In the field of robotics, touch sensors are mostly used as robots' finger tips so that they can handle objects or attached to the soles of their feet so that they can maintain balance in movement. The touch sensors being currently developed for these purposes are the one-axis and three-axes touch sensors which are in film shape [3], the threeaxes touch sensors made with silicon[4], and the sensors made with synthetic rubber (PDMS) by KAIST. Recently, touch sensors have been developed to enable interaction between humans and robots through touching behaviors. As a representative study, the MIT multimedia laboratory is developing a sensitive skin which will enable robots to ‘feel’ touch all over their body through Huggable Project [5]. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 730–739, 2007. © Springer-Verlag Berlin Heidelberg 2007
Human Touching Behavior Recognition Based on Neural Networks
731
The Huggable Project is focused on a method of recognizing human touching behaviors with touch sensors, as well as on developing the sensitive skin which is a touch sensor. Human touching behaviors were recognized using the MLP (multi-layer perceptron). They were defined into nine behaviors: tickling, poking, scratching, slapping, petting, patting, rubbing, squeezing, and contact. These behaviors were recognized via the six characteristics including the strength of touching, the change of the strength of touching, the direction of movement, and the change of direction of the movement at the contact surface. The result showed an average recognition rate of 53.37% for the eight behaviors excepting slapping behavior due to the difficulties in distinguishing it from other behaviors [6]. Such a low recognition rate was obtained due to the small amount of learning data. The human-robot PIFACT (physical interference and intended contact) definition system, as another study, was developed for WENDY which is a robot co-existing with human beings at Waseda University [7]. PIFACT has defined 31 touching behaviors which may occur between human and robots but it considered only ten behaviors in order to evaluate the performance of the PIFACT definition system. Behaviors such as beating, hitting, pushing, grasping, poking, pinching, tapping, stroking, scraping, and scratching are chosen as the ten behaviors. In the PIFACT definition system, sensor data to recognize the touch behaviors are generated by a touch detector using FSR sensors and MCP (modified counterpropagation) is used as the recognizer. MCP is a neural network which is a supervised learning model extended from the SOM (self-organizing map). As for the characteristics by which the neural network model can classify touching behaviors, nine variables - maximum force, time elapsed to reach the maximum force, contact time, contact frequency, average distribution of force, average force, maximum contact area, maximum distance between two points in the contact area, and the direction of the contact area movement - were calculated. In experiments that were conducted ten times for each combination, the average recognition rate with regard to the said nine characteristics of three combinations of characteristics was the highest at 81%. This paper suggests a human touching behavior recognition method that will enable robots to form 'emotions' in accordance with the touching behaviors of humans. In the method, the recognizer using a MLP is used and a pre-process phase is designed in order to calculate recognizable characteristics from the continuous sensor data generated by a touch detector. The characteristics are calculated by the strength and distribution of the force contacting a touch detector. The touch detector was designed and fabricated using the FSR sensor that is a force sensor. This paper comprises the following chapters; Chapter 2 is a brief description on the touch detector used in this study; Chapters 3 and 4 respectively concern the proposed method and the results of the experiments conducted to validate the proposed method. Finally, chapter 5 provides a conclusions and further works.
2 Touch Detector In this study, touching behaviors were detected with the FSR sensors arranged as shown in Fig. 1. The contact area of the touch detector is a thin plastic plate with dimensions of 13.5cm × 8.5cm × 1mm. Underneath the contact area are five FSR
732
J.W. Ryu, C. Park, and J.-C. Sohn
sensors, on which 4mm-high pillars support the plastic plate. Using the pillars and the thin plastic plate enables the detection of the contact force exerted on the sensor. If the force is exerted outside the pillar, the plastic plate transfers the force to the pillar and the sensor. Therefore, contacts on a wider area can be detected with fewer sensors. The FSR sensor is a polymer film whose electric resistance decreases according to the increase in force exerted on its surface. The FSR sensor used in this study was Part No.402 [8], with a diameter of 12mm.
Fig. 1. Touch detector using FSR sensor elements
The signal processor and the pillars on the sensor elements of the touch detector were fabricated by TechStorm Inc. The 12-bit resolution A/D converter was incorporated in the signal processor unit. The range of the detectable force was 0gf to 500gf. Forces below or exceeding the limits will generate zero or maximum signals. The scan frequency in the signal process was 47Hz, generating the outputs of each sensor element 47 times a second. That is, an output of the sensor element is generated once per 0.02 seconds. The outputs were defined as the sampling data.
3 Method of Recognizing Touching Behaviors In the suggested method, the touching behavior recognition process can be divided into a pre-process phase and a recognition phase, as illustrated in Fig. 2. The preprocess phase converts the sampling data generated in the touch detector into the input data of the recognizer. The recognition phase classifies the input data into touching behaviors using the recognizer.
Fig. 2. Touch behavior recognition process
Human Touching Behavior Recognition Based on Neural Networks
733
3.1 Preprocess Phase The pre-process phase is divided into normalization, pattern extracting, and characteristics generation steps, each of which are applied as necessary. The sampling data entered into the pre-process phase is expressed with the vector S = ( s1 , s 2 ,…, sn ) , where the dimension number n of the vector space represents the number of sensors at the touch detector. Normalization is the process that converts the raw data into a defined range. In this paper, the range of the sampling data value is defined by [0, 1] using the Min-Max method to generate normalized data, S ' = ( s1' , s 2' ,…, s n' ) by mapping the values. By summing up the elements of normalized data vector and expressing it in a histogram as shown in Fig. 3, the activation of the sensor element can be known. If the sum has a value larger than zero, the sensor element has been stimulated. However, noise may generate a value larger than zero without stimulation. Therefore, noise, if any, has to be checked prior to recognition and filtered in order to obtain the value of the sensor activated by the stimulation.
Fig 3. Histogram before and after filtering
Next, as the filtering method adapted in this study against noise, when the activated value of a sensor is smaller than the preset value, it is considered to be generated by noise. In this filtering method, the pre-set boundary value is the threshold. The result of the filtering depends on the method of determining the threshold. The threshold is set up each sensor element like (1) with the average output value and standard deviation calculated for a certain period of time when there is no stimulation. θ i = si' + ασ i , α ≥ 3 , (1) where si' and σ i are the average output value and standard deviation of the i-th sensor element, respectively, and α is the threshold control constant of which the value is larger than three. When α is large, the value of the sensor activated by stimulation is filtered and eliminated. In this paper, α is defined as a real number larger than three to reduce the influence of noise to less than 0.5%. Assuming that the sensor output values generated by noise without stimulation represent normal distribution, where α is three, as shown in Fig. 4, the confidence interval is approximately 99% and the significance level is 1%. Therefore, the possibility of an output value caused by noise is 0.5%.
734
J.W. Ryu, C. Park, and J.-C. Sohn
Fig. 4. Threshold of i-th sensor element
Patterns are extracted from the normalized data which has reduced the influence of noise by filtering, as shown in Fig 3 (b), in order to recognize the touching behaviors. In this paper, the pattern is defined as the set of continuous normalized data in which touch stimulation exists and it is represented with P = {S t' , S t' +1 , …, S t' + m } . t is the time at which stimulation began and m is the number of continuous normalized data included in a pattern. m also means the duration of the stimulation and the value is different whenever a touching behavior is generated. After extracting the patterns, an input data for the recognizer is generated in one pattern. The input data is expressed with the vector D = (d1 , d 2 ,…, d k ) whose components are characteristics for classifying a touching behavior. The characteristics are defined as the average stimulation magnitude of each sensor element at the touch detector in a pattern, like (2). Therefore, k becomes n that represents the number of sensor elements at the touch detector in this paper and the input data vector D includes the information about the distribution of the stimulation magnitude on the touch detector. 1 m d i = ∑ sij' , (2) m j =1 where sij' is the i-th element of j-th normalized data included to the pattern, P. 3.2 Recognition Phase
In the recognition phase, touching behaviors are classified by the recognizer with the input data vector D which is generated in the pre-process phase. In this study, a MLP model is used to classify touching behaviors according to the magnitude and distribution of stimulation, and the EBP (error back-propagation) method is used as the learning method.
Fig. 5 MLP structure
Human Touching Behavior Recognition Based on Neural Networks
735
As shown in Fig. 5, the MLP, in general, has three layers: input, hidden, and output. It is a feedforward neural network model in which each layer is fully connected to the next layer. The principle of node operation in the neural network is to calculate the output value of a node through an activation function, with the weighted value sum calculated using its own input values. The neural network provides a non-linear mapping function, as expressed in (3), between input (D) and output (O). O = F ( D : W , A) ,
(3)
where W is the weighted value of the neural network, and A represents the neural network structure. When an input is presented, the weighted value of the neural network is adequately adjusted so that the target output value, which is the required value for the input, can be generated. This is the learning process of a neural network. The EBP method, which is one of the representative learning methods of MLP, determines the weighted value using the gradient descent technique to minimize the error E, as shown in (4), between the target output value and the calculated value for the present input value. Tk and Ok are the target output value and the real output value calculated with the model, at k-th output node, respectively. E=
1 2 ∑ (Ok − Tk ) . 2k
(4)
During the learning process, the variance of the weighted value at a certain time point t can be calculated as follows:
Δw ji ∞ −
∂E . ∂W
(5)
Therefore, assuming that the learning rate is η , the general equation for the change in the weighted values in each layer can be expressed as follows; - weighted value change between the output and hidden layers:
wkj (t + 1) = wkj (t ) + ηδ k O j ,
δ k = (Ok − Tk )Ok (1 − Ok ). -
(6)
weighted value change between the hidden and input layers:
w ji (t + 1) = w ji (t ) + ηδ j Oi ,
δ j = ∑ δ k wkj (t )O j (1 − O j ).
(7)
k
In the MLP model used in the experiments for confirming the feasibility of the proposed method, the number of nodes in the input layer was set at five due to the touch detector with five FSR sensor elements and that of the output layer was set at three due to classifying only three kinds of behavior: “stroking,” “tickling,” and “hitting.” That is, the value of n is five and the value of k is three. Therefore, a target output T is given by a triplet. For example, “stroking” is defined as (1, 0, 0). The number of
J.W. Ryu, C. Park, and J.-C. Sohn
736
nodes in the hidden layer was set at six by experiment. As for the learning of the recognizer, the learning rate was set at 0.3 and the number for the learning cycle was set at 50,000. The simulation for the experiments performed with Weka 3.4.8 (www.cs.waikato.ac.nz/ml/weka/).
4 Experiments In the experiments, sampling data was generated with the touch detector described in Chapter 2, in order to verify the feasibility of the proposed touching behavior recognition method. The number of generated sampling data is 23,713. It was generated by six men to three types of touching behaviors like Table 1. Table 1. Number of sampling data stroking 1,314 1,411 1,065 1,155 1,108 1,168 7,221
User-1 User-2 User-3 User-4 User-5 User-6 Total
Tickling 1,309 1,410 1,410 1,288 1,133 1,313 7,863
Hitting 1,410 1,922 1,410 1,410 1,260 1,217 8,629
As an initial study of an emotional robot which can interact with human beings through touch, the proposed method considered the three basic touching behaviors of 'stroking', 'tickling', and 'hitting'. As shown in Table 2, 1,096 patterns were generated from the sampling data. The table shows that 'stroking' and 'tickling' have cases exhibiting less patterns but large standard deviations. Since the number of the sampling data per pattern represents the time of touch contact, this means that, for the said two touching behaviors, there were cases when the touching time was long. When the touching time is long, the proposed method does not recognize touching behavior until there is no stimulation at the touch detector. Table 2. Patterns extracted from the sampling data Stroking No. of patterns User-1 User-2 User-3 User-4 User-5 User-6 Total
56 35 2 83 10 62 248
Ticking
No. of sampling data per pattern Ave. 9.9 14.5 497.0 5.6 92.5 9.5 -
SD 3.4 8.3 488.0 2.2 222.6 3.3 -
No. of patterns 138 81 78 147 6 89 539
No. of sampling data per pattern Ave. 4.8 6.8 12.3 5.8 179.3 8.2 -
SD 5.3 7.1 15.5 5.5 296.6 3.3 -
Hitting No. of samNo. of pling data per pattern patterns Ave. SD 76 1.6 0.7 64 2.0 1.1 54 1.7 0.8 26 1.2 0.4 45 3.9 2.0 44 10.3 8.4 309 -
Human Touching Behavior Recognition Based on Neural Networks
737
In the next step, recognition was performed by generating input data with the average output values of each sensor element. Therefore, the number of input data is the same as that of the pattern. We have two experiments taking same input data. Both of them were conducted by arbitrarily changing the initial weighted value of a recognizer five times in the learning course. The learning data for learning used all the input data due to the difference of the pattern count included in three classes - stroking, tickling, and hitting. In the first experiment, recognizers were prepared for each user and evaluated. In the second experiment, only one recognizer was prepared for all the users and evaluated. In this paper, the former is defined as 'toucher dependent', the latter as 'toucher independent'. Table 3. Results of the toucher dependent Recognition Rate (%) User-1 0.229 91.2 User-2 0.122 94.0 User-3 0.187 94.3 User-4 0.177 94.4 User-5 0.004 100.0 User-6 0.244 88.7 Total 93.8 * RMSE: root mean square error RMSE
Stroking (%) 97.5 100.0 100.0 89.6 100.0 96.8 95.7
Tickling (%) 97.8 94.6 94.6 97.6 100.0 92.6 96.2
Hitting (%) 81.8 100.0 92.2 92.3 100.0 70.9 89.5
Fig. 6. Touching behavior histogram of user-6 Table 4. Results of the toucher independent RS
RMSE
0 0.323 1 0.326 2 0.321 3 0.321 4 0.332 Total * RS: random seed
Recognition Rate (%) 79.6 79.5 80.1 81.4 78.2 79.8
Stroking (%)
Tickling (%)
Hitting (%)
71.8 64.5 71.0 71.8 61.7 68.2
86.3 90.2 86.5 86.5 83.5 86.6
74.1 72.8 76.4 80.3 82.2 77.2
Table 3 shows the test results of toucher dependent. In the table, each row shows the average results of the tests conducted by changing the initial values five times by user. For the six users, the average recognition rate was 93.8%. Regarding user-6,
738
J.W. Ryu, C. Park, and J.-C. Sohn
who produced the lowest rate, the 'hitting' pattern was generated in a similar pattern and showed similar power as 'stroking' and 'tickling', as shown in the dotted circle in Fig. 6. This figure is a histogram drawn with the sum of the components of the sampling data selected from the data generated in the touch detector for user-6. In the toucher independent test, the average recognition rate was 79.8%, as shown in Table 4. The rate is lower than that of the toucher dependent because the strength of the touching behavior varies according to the user. In addition, the five FSR sensor elements in the touch detector were not sufficient to classify behaviors by the distribution of force.
5 Conclusion According to the touching behavior of human, this paper presents a touching behavior recognition method so that emotions of a robot can be generated. The method comprises pre-process and recognition phases. In the pre-process phase, the data generated in the touch detector is converted into data which can be processed by the recognizer. In the recognition phase, the input data converted during the pre-process phase is classified into human touching behaviors by MLP recognizer. For the characteristics of classifying touching behavior, the magnitude and distribution of force at the touch detector were defined In this study, the three touching behaviors of 'stroking', 'tickling', and 'hitting' were tested for an initial study of touching behavior recognition. The test results showed that an average recognition rate of 93.8% could be obtained in the toucher dependent method, which was conducted with a different recognizer generated for each user, while an average recognition rate of 79.8% could be obtained in the toucher independent method, which was conducted with the same recognizer generated for all users. The difference of 14% occurred as a result of the differences in the magnitude of the touching force of each user. On the basis of the test results, further studies on emotional robots should consider three factors. Firstly, if the pattern is defined from the sampling data by the existence of stimulation, as in the suggested method, and if the contact time is long, then the emotion of the robot cannot be generated since it cannot recognize the touching even though the stimulation exists. Therefore, the suggested touching behavior recognition method should be extended to solve this problem. Secondly, since the magnitude of touching force differs by user, the robust characteristics of the toucher independent case should be investigated further. Thirdly, the feasibility of the suggested method should be further validated with additional touching behaviors such as 'touching' and 'grasping'.
Acknowledgement This work was supported in part by MIC & IITA through IT Leading R&D Support Project.
Human Touching Behavior Recognition Based on Neural Networks
739
References 1. Kazuyoshi, W., Takanori, S., Tomoko, S., Kazuo, T.: Effects of Three Months of RobotAssisted Activity on Depression among Elderly People Residing at a Health Service Facility for the Aged. SICE Annual Conf (2004) 2709-2714 2. Patrizia, M., Leonardo, G., Alessandro, P., Alessia, R.: Experiencing the Flow: Design Issues in Human-Robot Interaction. Joint SOC-EUSAI Conf (2005) 69-74 3. Kim J.H., Lee J.I., Lee H.J., Park Y.K, Kim M.S., Gang D.I.: Development and Application of Touch Sensors. Journal of Korea Precision Engineering Society 21 (00) (2004) 4. Kim, Y.K., Kim, K.N., Lee, K.Y., Lee, D.S., Kim, W. H., Cho, N.K., Park, K.B., Park, H.D., Ju, B.K., Kim, S.W.: CMOS Based Silicon Tactile Sensor Array for Finger-Mounted Applications. International Conference on Ubiquitous Robots and Ambient Intelligence (2006) 389-392 5. Walter, D.S., Jeff, L., Cynthia, B., Louis, B., Levi, L., Micheal W.: The Design of the Huggable: A Therapeutic Robotic Companion for Relational, Affective Touch. AAAI Fall Symposium on Caring Machines: AI in Eldercare (2006) 6. Walter, D.S., Cynthia, B.: Affective Touch for Robotic Companions. ACII 2005, LNCS 3748 (2005) 747-754 7. Hiroyasu, I., Shigeki, S.: Human-Robot-Contact-State Identification Based on Tactile Recognition. IEEE Transactions on Industrial Electronics 52 (6) (2005) 1468-1477 8. Interlink Electronics: FSR Force Sensing Resister Integration Guide and Evaluation Parts Catalog. http://www.interlinkelectronics.com
Kernel Fisher NPE for Face Recognition Guoqiang Wang1, Zongying Ou1, Fan Ou1, Dianting Liu1, and Feng Han2 1
Key Laboratory for Precision and Non-traditional Machining Technology of Ministry of Education, Dalian University of Technology, Dalian 116024, P. R. China
[email protected],
[email protected], {york_ou,diantingliu}@yahoo.com.cn 2 Changchun Railway Vehicles Limited Company, Changchun 130062, China
[email protected]
Abstract. Neighborhood Preserving Embedding (NPE) is a subspace learning algorithm. Since NPE is a linear approximation to Locally Linear Embedding (LLE) algorithm, it has good neighborhood-preserving properties. Although NPE has been applied in many fields, it has limitations to solve recognition task. In this paper, a novel subspace method, named Kernel Fisher Neighborhood Preserving Embedding (KFNPE), is proposed. In this method, discriminant information as well as the intrinsic geometry relations of the local neighborhoods are preserved according to prior class-label information. Moreover, complex nonlinear variations of real face images are represented by nonlinear kernel mapping. Experimental results on ORL face database demonstrate the effectiveness of the proposed method.
1 Introduction Face recognition has been studied extensively over the past decade due to the recent emergence of applications such as security access control, visual surveillance, public security, and advanced human-to-computer interaction, et. [1]. And much progress has been made in the past few years [2]. Appearance-based method, as one of the most successful techniques for face recognition, is still developing at now. When using appearance-based methods, a two-dimensional image of size w × h pixels is represented by a vector in a w × h -dimensional space. This space is called the sample space or the image space. However, its dimension typically is too high to allow robust and fast face recognition. A nature attempt to resolve this problem is to use dimensionality reduction techniques. Two of the most important techniques for dimensionality reduction are Principal Component Analysis (PCA) [3], [4] and Linear Discriminant Analysis (LDA) [5]. Recently, many research efforts have shown that the face images possibly reside on a nonlinear submanifold. However, both PCA and LDA consider only the Euclidean structure. They fail to discover the underlying structure, if the face images may lie on a nonlinear submanifold hidden in the image space. Several nonlinear techniques were proposed to discover the nonlinear structure of the manifold such as LLE [6], Isometric feature mapping [7] and Laplacian Eigenmaps [8]. They all utilized local D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 740–745, 2007. © Springer-Verlag Berlin Heidelberg 2007
Kernel Fisher NPE for Face Recognition
741
neighborhood relation to learn the global structure of nonlinear manifolds. However, they can not deal with the out-of-sample problem directly. They yield maps that are defined only on the training data points and how to naturally evaluate the maps on novel test data points remains unclear. Therefore, these nonlinear manifold learning techniques might not be suitable for face recognition task. In order to overcome this drawback, He et al. proposed a linear subspace method, named Neighborhood Preserving Embedding (NPE) [9] which aims at preserving the local manifold structure. Although NPE is successful in many domains, it deemphasizes discriminant information for pattern recognition task. Moreover, NPE often fails to deliver good performance when face images are subject to complex nonlinear variations such as expression, lighting and pose, for it is a linear algorithm in nature. LDA is a well-known supervised learning algorithm. LDA encodes discriminating information by minimizing the within-class scatter matrix, maximizing the betweenclass scatter matrix in the projective subspace. Kernel based techniques can generate nonlinear maps which are successfully used in support vector machine. In this paper, based on the idea of LDA and kernel trick [10], we propose a novel subspace analysis approach named Kernel Fisher Neighborhood Preserving Embedding (KFNPE) to further improve the face recognition performance of NPE.
2 Outline of NPE NPE is a linear approximation to LLE [6]. Given a set of points { x1 , x 2 , " , x N } in
R D . The linear transformation A can be obtained by minimizing an objective function [9] as follows: N
K
i =1
j =1
2
min ∑ y i − ∑ Wij y j A
(1)
where y i = A x i . Since LLE seeks to preserve the intrinsic geometric properties of the local neighborhoods, it assumes that the same weights which reconstruct the point xi by its neighbors in the high dimensional space, will also reconstruct its image y i , T
by its corresponding neighbors, in the low dimensional space. Then, the weights can be computed by minimizing the following objective function, N
K
min ∑ xi − ∑ Wij x j i =1
2
K
with constraint
j =1
∑W
ij
=1
(2)
j =1
For more details of NPE and the weight matrix, please see [9]. The minimization problem can be converted to solving a generalized eigenvalue problem as follows:
XMX T A = λXX T A
(3)
742
G. Wang et al.
where M = ( I − W ) ( I − W ) , I = diag (1, " ,1) . It is easy to check that symmetric and semi-positive definite. T
M is
3 Kernel Fisher NPE Algorithm NPE is a linear method in nature, and it is inadequate to represent the nonlinear face space. Moreover, NPE deemphasizes discriminant information, while discriminant information is important for face recognition task. In this paper, a novel subspace method named Kernel Fisher NPE (KFNPE) is proposed. Firstly, a nonlinear function Φ is used to map the data into a high-dimensional feature space F : Φ ( X ) = [Φ ( x1 ), " , Φ ( x N )] . Then in feature space F , a projecting transformation AΦ is sought that can preserve the discriminant information with local geometry structure by minimizing intrapersonal variance and maximizing interpersonal variance. The objective function of KFNPE is defined as follows:
⎡ Nc c c ⎤ c ⎢∑ Wij ( y i − y j ) ⎥ ∑ ∑ c =1 i =1 ⎣ j =1 ⎦ min Nc
C
⎡C Φ Φ ⎤ ⎢ ∑ Bij ( m i − m j ) ⎥ ∑ i =1 ⎣ j =1 ⎦
AΦ
where
C
2
(4)
2
C is the number of face classes, N c is the number of samples in the c th class,
y = AΦT Φ ( xic ) is the projection of Φ( xic ) onto AΦ , and Φ( xic ) is the nonlinear c i
mapping of the i th sample in the
c th class, miΦ is the mean vector of the mapped
training samples in i th class. Both
Wijc and Bij are weight matrix.
By simple algebra formulation, the numerator of the objective function can be reduced to C
⎡ Nc ⎤ ⎢ Wijc ( yic − y cj )⎥ ⎥ i =1 ⎢ ⎣ j =1 ⎦ Nc
2
∑∑ ∑ c=1
Nc ⎡ T ⎤ c c T c = ∑∑ ⎢ AΦ Φ ( x i ) − ∑ Wij AΦ Φ ( x j ) ⎥ c =1 i =1 ⎣ j =1 ⎦ Nc
C
C
=
∑ AΦT Φ( X c )( I − Wc )T ( I − Wc )Φ( X c )T AΦ c =1 C
=
2
∑A c =1
T Φ
Φ ( X c ) M c Φ( X c ) T AΦ
= AΦ Φ ( X ) MΦ ( X ) T
T
AΦ
(5)
Kernel Fisher NPE for Face Recognition
743
where Φ( X ) = [Φ( X1 ), Φ( X 2 ),", Φ( X C )], Φ( X c ) = [Φ( x1 ), Φ( x2 ),", Φ( xNc )], c
c
c
M c = ( I − Wc ) T ( I − Wc ) , M = diag ( M 1 , M 2 , ", M C ) . The weight matrix Wc can be computed by minimizing the following objective function, Nc
2
Nc
min ∑ Φ ( x ) − ∑ W Φ ( x ) with constraint i =1
c i
c ij
j =1
c j
Nc
∑W
c ij
= 1.
j =1
The denominator of the objective function can be simplified as
⎡C ⎤ Φ Φ B ( m − m ) ⎢ ∑ ∑ ij i j ⎥ i =1 ⎣ j =1 ⎦ C
2
2 ⎡C ⎡ C ⎤ ⎤ = A ⎢∑ ⎢Φ ( mi ) − ∑ Bij Φ ( m j ) ⎥ ⎥ AΦ ⎢⎣ i =1 ⎣ j =1 ⎦ ⎥⎦ T T T = AΦ Γ ( I − B) ( I − B)ΓAΦ T Φ
= AΦ Γ
(6)
GΓAΦ T where Γ = [Φ ( m1 ), Φ ( m 2 ), " , Φ ( mc )] , Φ(mi ) is the mean of the i th class in T
feature space F , i.e.
T
Ni
1 Ni
Φ ( mi ) =
∑ Φ( x
i k
) , G = ( I − B) T ( I − B) . The
k =1
weight matrix B can be computed by minimizing the following objective function, 2
C
C
i =1
j =1
min ∑ Φ (mi ) − ∑ Bij Φ(m j ) with constraint
C
∑B
ij
= 1.
j =1
AΦ should lie in the span of Φ( x1 ),", Φ( x N ) , there exists a coefficient vector α = [α 1 ," , α N ]T such that Because
the
linear
transformation
N
AΦ = ∑ α i Φ ( xi ) = Φ ( X )α
(7)
i =1
Then, substitute Eq.(5), Eq.(6) and Eq.(7) into the objective function, and KFNPE subspace is spanned by a set of vectors α satisfying:
,
a = arg min AΦ
α KMKα T α K XM GK XM α T
T
(8)
where K is the kernel matrix with elements K ij = Φ ( xi ) ⋅ Φ ( x j ) i, j = 1, " , N ,
K XM is the Gram matrix formed by training samples X and classes’mean. The transformation space can be achieved similar to LDA algorithm. In our approach, a two-phase algorithm is implemented. In this algorithm, KPCA is employed firstly to
744
G. Wang et al.
remove most noise. Next, NPE algorithm based on Fisher criterion is implemented on KPCA transformed space. Then, M and G can be computed on this space without explicit nonlinear function Φ .
4 Experimental Results An experiment was carried out on the ORL face database [11] to verify the effectiveness of the proposed method. The database was built at Olivetti Research Laboratory in Cambridge, UK. The database consists of a total of 400 face images, 10 for each 40 distinct subjects. For some subjects, the images were captured at different times. The facial expressions (open or closed eyes, smiling or nonsmiling) and facial details (glasses or no glasses) were also varied. The images were taken with a tolerance for some tilting and rotation of the face of up to 20 degrees. Moreover, there was also some variation in the scale of up to about 10%. All images were grayscale and normalized such that the two eyes were aligned at the same position. For the purpose of computation efficiency, all images were resized to 32 × 32 pixels. 10 sample images of one individual are displayed in Fig.1.
Fig. 1. Sample face images from the ORL database
In the experiment, for each subject, t ( = 3,4,5) images were randomly selected for training and the rest were used for testing. This procedure was repeated 10 times by randomly choosing different training and testing sets, and for each given t , we averaged the results over 10 random splits. In this paper, polynomial kernel
k P (x, y) = Φ(x) ⋅ Φ( y) = (a(x ⋅ y))d was adopted for simplicity and consistency. For simplicity, we applied the nearest-neighbor classifier for classification. Table 1 contains a comparative analysis of the mean and standard deviation for the obtained recognition rates. From Table 1, it can be seen that the performance of the KFNPE algorithm outperforms other methods such as PCA, LDA, NPE, KPCA, KLDA. It demonstrates Table 1. Mean and standard deviation on the ORL database (recognition rate (%))
Algorithm KFNPE NPE KLDA LDA KPCA PCA
t =3
91.54 ± 1.89 88.26 ± 2.17 88.67 ± 1.94 85.99 ± 2.56 80.11 ± 2.23 79.47 ± 1.58
t=4
94.27 ± 1.44 91.90 ± 1.42 93.14 ± 1.53 91.63 ± 1.51 84.71 ± 2.59 84.23 ± 2.43
t=5
96.41 ± 1.62 94.29 ± 1.33 95.5 ± 1.47 93.9 ± 1.56 87.35 ± 2.31 86.7 ± 2.0
Kernel Fisher NPE for Face Recognition
745
that the performance is improved because KFNPE preserves the discriminant local structure in subspace and takes into account nonlinear information by kernel trick.
5 Conclusions In this paper, we propose a novel subspace analysis method, named Kernel Fisher Neighborhood Preserving Embedding (KFNPE). KFNPE attempts to preserve discriminant information with intrinsic geometric relations in term of Fisher criterion. Moreover, nonlinear variations of real face images are represented by kernel trick. KFNPE can not only gain a perfect approximation of face manifold, but also enhance local class relations. Experimental results on the ORL face database show that the proposed approach is robust and effective.
Acknowledgement The authors would like to thank the anonymous reviews for their helpful comments and suggestions. This work is supported by research funds of Dalian University of Technology and Shenyang Institute of Automation Chinese Academy of Sciences joint research program.
References 1. Zhao, W., Chellappa, R., Rosenfeld, A., Phillips, P. J.: Face Recognition: a Literature Survey. Technical Report CAR-TR-948, University of Maryland, College Park, 2000. 2. Chellapa, R., Wilson, C. L., Sirohey, S.: Human and Machine Recognition of Faces: a Survey. Proc. IEEE 83 (1995) 705-740. 3. Turk, M., Pentland, A. P.: Face Recognition using eigenfaces. IEEE Conf. Computer Vision and Pattern Recognition, 1991. 4. Turk, M., Pentland, A. P.: Eigenfaces for Recognition. J. Cognitive Neuroscience 3 (1991) 71-86. 5. Belhumeur, P. N., Hespanha, J. P., Kriengman, D. J.: Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection. IEEE Trans. Pattern Analysis and Machine Intelligence 19(7) (1997) 711-720. 6. Roweis, S. T., Saul, L. K.: Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science 290 (2000) 2323-2326. 7. Tenenbaum, J., Silva, V. de, Langford J. C.: A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science 290 (2000) 2319-2323. 8. Belkin, M., Niyogi, P.: Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering. Processings of Adavances in Neural Information Processing System 14, Vancouver, Canada, December 2001. 9. He, X. F., Cai, D., Yan, S. C., Zhang, H. J.: Neighborhood Preserving Embedding. Proceedings of Tenth International Conference on Computer Vision, 2005, 2, 1208-1213. 10. Vapnik, V.: The Nature of Statistical Learning Theory. New York: Springer, 1995. 11. The ORL Database of Faces. http://www.uk.research.att.com:pub/data/.
A Parallel RBFNN Classifier Based on S-Transform for Recognition of Power Quality Disturbances Weiming Tong and Xuelei Song Department of Electrical Engineering, Harbin Institute of Technology, Harbin 150001, China
[email protected]
Abstract. This paper proposes a novel parallel RBFNN (Radial Basis Function Neural Network) classifier based on S-transform for recognition and classification of PQ (Power Quality) disturbances. S-transform is used to extract feature vectors, while the constructed parallel RBFNN classifier is used to recognize and classify PQ disturbances according to the extracted feature vectors. The parallel RBFNN classifier consists of eight sub-networks, each of which is only able to recognize one type of disturbance. In order to improve the convergence performance of RBFNN and optimize the number of hidden layer nodes, a dynamic clustering algorithm which clusters all training samples to determine the number of hidden layer nodes is proposed. Simulation and test results demonstrate that the method proposed to recognize and classify PQ disturbances is correct and feasible, and that the RBFNN classifier based on the dynamic clustering algorithm has a faster convergence speed and a higher correct identification rate.
1 Introduction PQ (Power Quality) has recently become a major concern to both electric suppliers and electric customers. One reason is that PQ has been being disturbed heavily with the increasing number of polluting loads (such as non-linear loads, time-variant loads, fluctuating loads, unbalanced loads, et al) [1][2][3]; the other is that intelligent electrical devices have put forward more rigorous requirements for PQ. Therefore, PQ urgently needs to be monitored and improved. However, it is the key problem that how to extract feature vectors automatically and classify PQ disturbances accurately from massive PQ disturbance data. Generally, there are two classes of PQ disturbances: stationary disturbances and non-stationary disturbances. Recognition and classification of PQ disturbances is a complex problem. Carried out manually, the traditional method is a costly and inefficient task, and can not implement automatic recognition and classification. Hence, recently some methods based on artificial intelligence for automatic recognition and classification, such as ANN (Artificial Neural Network)[4], ES (Expert System)[5], FC (Fuzzy Classification)[6], HMM (Hidden Markov Model)[7], et al, have been widely studied. The disadvantages of the method based on ES are mainly that the knowledge of PQ is very difficult to extract and the problem of combinative explosion D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 746–755, 2007. © Springer-Verlag Berlin Heidelberg 2007
A Parallel RBFNN Classifier Based on S-Transform
747
possibly occurs with the increasing of PQ disturbance types. Fuzzy technology, which implements recognition through simple “IF-THEN” statements, has a high correct identification rate, but it is difficult for many PQ disturbances such as harmonic distortions and voltage transients to establish the knowledge rules like “IF-THEN” statements. ANN, with a simple structure and the ability of solving problems, is an excellent tool for pattern recognition and has been successfully applied to the recognition and classification of PQ disturbances. However, selecting samples directly as the input of ANN will result in high input dimension, huge network structure, heavy training burden, and difficult convergence. Therefore, it is very necessary to preprocess the samples and extract feature vectors which are then input into ANN. The universal methods for extracting feature vectors include Fourier transform, wavelet transform, dq transform, et al. Fourier transform is a kind of traditional signal analysis tool. But it has several problems such as spectrum aliasing, spectrum leakage and grille effect, and it is not suitable for analyzing non-stationary signals because it is a global transform achieving transform either entirely in the time domain or entirely in the frequency domain. Wavelet transform is a kind of popular time-frequency analysis tool very suitable for analyzing non-stationary signals because of its ability of local time-frequency analysis, which means higher frequency resolution in low frequency bands and lower time resolution in high frequency bands. The method based on wavelet and ANN has been applied to recognition and classification of PQ disturbances[8][9][10], and has also achieved good results. However for several PQ disturbances like harmonic distortion, voltage sages and voltage swells, the detection results of wavelet transform is worse. In addition, wavelet transform also has several disadvantages including huger calculation quantity and slower calculation speed. S-transform, as the generation of CWT (Continue Wavelet Transform) and STFT (Short Time Fourier Transform) [11][12], is another kind of time-frequency analysis tool. It introduces Gauss window function as the kernel function whose frequency varies opposite to time. It has excellent time-frequency analysis performance and is able to utilize FFT (Fast Fourier Transform) algorithm whose calculation quantity is less than Mallat wavelet algorithm. Hence S-transform is very suitable for detecting PQ disturbances and extracting feature vectors. In addition, RBFNN (Radial Basis Function Neural Network) has been widely applied in many fields because of its excellent performance [13]. Therefore, this paper proposes a novel parallel RBFNN classifier based on S-transform for recognition and classification of PQ disturbances. S-transform is mainly used to extract feature vectors, and the parallel RBFNN classifier is mainly used to recognize and classify PQ disturbances. The feature vector is composed of the modulus extrema of S-transform coefficients.
2 Topology Structure of the Proposed Method The topology structure of the proposed method based on S-transform and the parallel RBFNN classifier for recognition and classification of PQ disturbances is shown in Fig.1.
748
W.M. Tong and X.L. Song
Testing Samples Resulting inWrong Classification
Training Samples of PQ
Preprocessing
Extracting Feature Vectors Through S-transform
Preprocessing
Extracting Feature Vectors Through S-transform
Recognizing PQ Disturbances
Training the Parallel RBFNN Classifier
Output
Off-line
On-line
Relearning
Testing Samples of PQ
Wrong Classification
Fig. 1. Topology structure of the proposed method
It can be seen from Fig.1 that the topology structure is composed of two submodules: the off-line module and the on-line module. The off-line is mainly used to train the parallel RBFNN classifier, while the on-line module is mainly used to test the parallel RBFNN classifier. The principle of the proposed method is as follows: (1) generate the training samples of PQ and the test of PQ through EMTP (Electromagnetic Transient Program) software; (2) denoise and decouple the training samples of PQ through the preprocessing unit after inputting them into the off-line module; (3) extract the feature vectors of the preprocessed training samples through S-transform; (4) train the constructed parallel RBFNN classifier by the extracted feature vectors of the training samples; (5) denoise and decouple the test samples of PQ through the preprocessing unit after inputting them into the on-line module; (6) extract the feature vectors of the preprocessed test samples through S-transform; (7) according to the feature vectors of the test samples, recognize and classify the type of the PQ disturbance through the trained parallel RBFNN classifier; (8) output the result of recognition and classification if it is correct, and else return to step (2) and train the parallel RBFNN classifier again by the test samples resulting the wrong result of recognition and classification to improve the correct identification rate and the anti-interference performance of the constructed parallel RBFNN classifier.
3 S-Transform S-transform, put forward by Stockwell in 1996 [11], is another kind of time-frequency tool which is the generation of the combination of CWT and STFT. The S-transform S (τ , f ) of the signal x(t ) is defined as follows.
S (τ , f ) =
∫
∞
−∞
x(t ) g f (τ − t ) exp(− j 2πft )dt
(1)
A Parallel RBFNN Classifier Based on S-Transform
⎡ − f 2 (τ − t ) 2 ⎤ exp ⎢ ⎥ 2 2π ⎣⎢ ⎦⎥ f
g f (τ − t ) =
749
(2)
where, g f (τ − t ) is the Gaussian window function. τ is the shift parameter which
can adjust the position of the Gaussian window in the time axis. f is the scale parameter. S-transform is an invertible transform and its inverse transform is as follows. x(t ) =
1 2π
∞
⎡
⎤ S (τ , f )dτ ⎥ exp( j 2πft )df −∞ ⎦
∫ ⎢⎣∫ −∞
∞
(3)
Define X ( f ) as the Fourier transform result of x(t ) . Then the relationship between S-transform and Fourier transform can be described as follows. X(f ) =
∫
∞
−∞
S (τ , f )dτ
(4)
And S-transform can be written as the function of X ( f ) as follows. ⎛ 2π 2 v 2 ⎞ ⎟ exp( j 2πvτ )dv X (v + f ) exp⎜⎜ − (5) −∞ f 2 ⎟⎠ ⎝ where, f ≠ 0 . Thus, S-transform can be implemented through FFT (Fast Fourier Transform) algorithm. Convert equation (5) to the following discrete equation. N −1 ⎧ j 2πkm 2π 2 k 2 X [n + k ] exp(− ) exp( ) n≠0 ⎪S[m, n] = 2 N n ⎪ k =0 (6) ⎨ N −1 1 ⎪ X [k ] n=0 ⎪S[m, n] = N k =0 ⎩ S (τ , f ) =
∫
∞
∑
∑
X [k ] =
2 N
N −1
∑ x[ p] exp(− p =0
j 2πkp ) N
(7)
For the sample sequence x[ p ] ( p = 0, 1, 2, ..., N − 1 ), its S-transform can be implemented by equation (6) and equation (7). The S-transform result is a dualdimension matrix in which the elements are corresponding with the amplitudes, the line is corresponding with the frequency point, and the column is corresponding with the time point. Compared with STFT, the advantage of S-transform is that the width and the height of the time-frequency window vary with the frequency. Like wavelet transform, S-transform is suitable for analyzing non-stationary signals because of its excellent performance of local time-frequency analysis, which means that it has higher frequency resolution in low frequency bands and lower time resolution in high frequency bands. Usually, there are eight kinds of mostly concerned disturbances, namely harmonic distortions, flickers, voltage sags, voltage swells, voltage interruptions, voltage notches, voltage impulses and voltage transients. For all kinds of PQ disturbances, their S-transform coefficients are different from each other. Therefore,
750
W.M. Tong and X.L. Song
according to S-transform coefficients, each kind of PQ disturbance is able to be discriminated. In the proposed method, the modulus extrema of S-transform coefficients are taken as the feature vectors according to which PQ disturbances can be recognized and classified by the parallel RBFNN classifier.
4 Construction of the Parallel RBFNN Classifier 4.1 The Structure of the Parallel RBFNN Classifier
There are eight kinds of common PQ disturbances, separately harmonic distortions, flickers, voltage sags, voltage swells, voltage interruptions, voltage notches, voltage impulses and voltage transients. In order to identify various PQ disturbances, the RBFNN classifier adopts parallel network structure which is shown in Fig.2. As shown in Fig.2, the input of the RBFNN classifier is the feature vector X of PQ disturbances, which is extracted through S transform, and the output of the RBFNN classifier is the classification result vector Y. Consisting of eight sub-networks, the parallel RBFNN classifier can identify eight different common PQ disturbances. Adopting RBFNN, every sub-network is only able to identify a single PQ disturbance. All the items of the classification result vector Y in the parallel RBFNN classifier are binary numbers: 0 and 1. The relationship between the classification result vector Y and the standard PQ disturbance types is shown in Table 1. What is necessary to mention here is that the training of the sub-networks do not affect each other; furthermore, if there is another new disturbance type needing identification, what should be done is only to correspondingly add a new sub-network, which will not affect the performance of all the other sub-networks. PQ Disturbance 1 .. .
xn
y1 .. .
Feature Vector X of PQ Disturbances
RBFNN
x1 x2
Sub-network 1
PQ Disturbance 2 .. .
xn
y2 .. .
RBFNN
Sub-network 2 .. .
x1 x2 xn RBFNN
PQ Disturbance 8 .. .
y8 .. .
Sub-network 8
Fig. 2. The structure of the parallel RBFNN classifier
Classification Result Vector Y
x1 x2
A Parallel RBFNN Classifier Based on S-Transform
751
Table 1. Relationship between the classification relult vector Y and PQ disturbance types
Type of PQ disturbances Harmonic flicker Sag Swell Interruption Notch Impulse Transient
y1 1 0 0 0 0 0 0 0
Classification results vector Y y3 y4 y5 y6 y7 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0
y2 0 1 0 0 0 0 0 0
y8 0 0 0 0 0 0 0 1
4.2 The Mathematic Model of RBF Neuron
Like BP network, RBFNN is also a forward network which is composed of three layers: input layer, hidden layer and output layer. Each layer has many neurons. It is known that RBFNN is better than BP network. RBFNN can be trained efficiently, and will not arrive at the local optimum state. Also the training time of RBFNN is fairly short. By contrast, BP network with Sigmoid activation function requires long training time, and its convergence to the minimum error solution can not be guaranteed. Gaussian function is chosen as the activation function of RBFNN:
⎡ X − ci f ( x) = exp⎢− 2σ i2 ⎢⎣
2
⎤ ⎥, (i = 1,2,...m) ⎥⎦
(8)
where X = [ x1 , x 2 ,...xn ] denotes the input vector, ci denotes the center of the ith T
radial basis function, σ i denotes the width of the ith radial basis function, and notes the number of neurons. The output of RBFNN can be written as:
m de-
M
y (n) = ∑ pi (n)θ i + e(n)
(9)
i =1
where y (n) denotes the expected output,
θ i denote
parameters,
pi (n) denote re-
gressors, and e(n) denotes the residual. 4.3 Dynamic Clustering Algorithm
It is very important for the performance of RBFNN to select efficient centers of radial basis functions. There are several methods which are usually used to determine the centers of radial basis functions such as random selection method, Kohonen selforganization mapping algorithm, HCM algorithm, K-NN algorithm, fuzzy C-Means
752
W.M. Tong and X.L. Song
algorithm, nearest neighbor-clustering algorithm, et al. On the basis of these methods this paper proposes a dynamic clustering algorithm to determine the radial basis functions of RBFNN which accelerates the learning speed. The learning procedure of the dynamic clustering algorithm is as follows: (1)Initialization: set a proper r and an initial clustering number k = 1 , select the first input sample as the initial value of the clustering barycenter z1 .
r21 between the second input sample and the first clustering center. If r21 > r , then the clustering number k = 2 , and the second input sam(2)Calculate the distance
ple becomes the second clustering barycenter; else the second input sample belongs to the first class, and its clustering barycenter becomes z1 =
1 2 ∑ xi , where xi denotes 2 i =1
the ith input sample. (3)Calculate the distance rij between the ith ( i = 3,4,..., N ) input sample and the kth clustering barycenter. If rij > r , then the clustering number k = k + 1 , and the ith input sample becomes the (k+1)th clustering barycenter; else the ith input sample belongs to the jth class, and its clustering barycenter becomes z j =
1 m ∑ xi , where m i =1
xi denotes the ith input sample, m denotes the number of input samples of the jth class. (4)Calculate DT (S ) . If DT′ ( S ) − DT ( S ) < ε , then go to step 5; else go to back to step 1. (5)Stop calculation. The final k is just the corresponding clustering number, and the final z j is just the corresponding clustering barycenter. Through the above dynamic clustering algorithm, the number of hidden layer nodes of RBFNN can be determined: it is just the final clustering number of the input samples. According to the input samples of each class the output of the hidden layer nodes can be obtained through Gaussian function:
⎡ X −z j u j = exp ⎢− 2 ⎢ 2σ ⎣
2
⎤ ⎥ ⎥ ⎦
(10)
where σ is the maximal Eulerian distance among all the clustering barycenters. Obviously, the algorithm is a dynamic clustering learning algorithm. Since it is not necessary to determine the clustering number previously, the algorithm decreases a lot of iteration calculations and accelerates the learning procedure.
A Parallel RBFNN Classifier Based on S-Transform
753
5 Simulation, Test and Analysis In order to examine and certify the correctness and the feasibility of the method proposed in this paper, EMTP (Electromagnetic Transient Program) software is used to create samples for the seven kinds of PQ disturbances, namely harmonic distortions, flickers, voltage sags, voltage swells, voltage interruptions, voltage notches, voltage impulses and voltage transients; 500 samples for every kind and totally 4000 samples for all of them. Thereinto, for each kind of PQ disturbance, 350 samples are used to train the parallel RBFNN classifier, while the rest samples are used to test the parallel RBFNN classifier. In order to examine and certify the robustness of the proposed method, the white noise with the signal-to-noise ratio 30-50dB, is superimposed in each sample. The training error curve of ANN is shown in Fig.3. And the test results of the parallel RBFNN classifier is shown in Table 2. It can be seen from Fig.3 that RBFNN only needs iterating about 80 times before finishing the training procedure. Therefore, the parallel RBFNN classifier is qualified to provide faster convergence speed and shorter training time. 2
Mean-squared error
10
0
10
-2
10
-4
10
0
10
20
30
40 50 Training times
60
70
80
Fig. 3. Training error curve of RBFNN Table 2. Test results of the parallel RBFNN classifier
Type of PQ disturbances Harmonic Flicker Sag Swell Interruption Notch Impulse Transient Total
Number of test samples 150 150 150 150 150 150 150 150 1200
Number of samples correctly identified 143 142 146 146 144 139 138 140 1138
Correct identification rate (%) 95.3 94.7 97.3 97.3 96.0 92.7 92.0 93.3 94.83
754
W.M. Tong and X.L. Song
It can be seen from the test results shown as Table.2, the proposed classification method of PQ disturbances based on S transform and the parallel RBFNN classifier has a higher correct identification rate; thereinto, the correct identification rate of voltage sags and voltage swells are higher than 97% and that of the other PQ disturbances are all higher than 92%, too. Because several PQ disturbances have similar features which results in that they are difficult to distinguish, wrong classification may occurs sometimes; however, even this error is taken into consideration, the average correct identification rate can reach 94.83%.
6 Conclusion In this paper, a novel method based on S-transform and the parallel RBFNN classifier is proposed for recognition and classification of PQ disturbances. S-transform is mainly used to extract feature vectors of PQ disturbances; and the constructed parallel RBFNN classifier is mainly used to classify PQ disturbances according to the extracted feature vectors. The following can be indicated by results of simulation and test: (1) the proposed method is correct and feasible, and is able to recognize and classify PQ disturbances effectively, accurately and reliably; (2) S-transform can effectively and rapidly extract feature vectors of PQ disturbances; (3) the constructed parallel RBFNN classifier has a higher correct identification rate and a better convergence performance; (4) the dynamic clustering algorithm has the characteristic of speedy learning and can optimize the number of hidden layer nodes, which makes the network efficiency improved.(5) the input vector of the parallel RBFNN classifier has a simpler style and a lower dimension, which results in that the sub-network has simpler structure and fast implementation algorithm. The further research is about to focus on application of this method to algorithm design of PQ monitoring devices and performance evaluation of this method in the field environment.
References 1. Loredana, C., Alessandro, F., Simona, S.: A Distributed System for Electric Power Quality Measurement. IEEE Trans. on Instrumentation and Measurement 51(4) (2002) 776-781 2. Kezunovic, M., Liao, Y.: A Novel Software Implementation Concept for Power Quality Study. IEEE Trans. on Power Delivery 17(2) (2002) 998-1001 3. Youssef, A.M., Abde-Gali, T.K., El-Saadany, E.F., et al: Disturbances Classification Utilizing Dynamic Time Warping Classifier. IEEE Trans. on Power Delivery 19(1) (2004) 272-278 4. Ghosh, A.K., Lubkeman, D.L.: The Classification of Power System Disturbance Waveforms Using A Neural Network Approach. IEEE Trans. on Power Delivery 10(1) (1995) 109-115 5. Dash, P.K., Mishra, S., Salama, M., et al: Classification of Power System Disturbances Using a Fuzzy Expert System and a Fourier Linear Combiner. IEEE Trans. on Power Delivery 15(2) (2000) 472-477. 6. Chilukuri, M.V., Dash, P.K.: Multiresolution S-transform-based Fuzzy Recognition System for Power Quality Events. IEEE Trans. on Power Delivery 19(1) (2004) 323-330
A Parallel RBFNN Classifier Based on S-Transform
755
7. Chung, J., Powers, E.J., Grady, W.M. et al: Power Disturbance Classifier Utilizing a Ruled-based Method and Wavelet Packet-based Hidden Markov Model. IEEE Trans. on Power Delivery 17(1) (2002) 738-743 8. Perunicic, B., Malini, M., Wang, Z., Liu, Y.: Power Quality Disturbance Detection and Classification Using Wavelets and Artificial Neural Networks. In: Proceedings of the 8th ICHQP, Vol. 24. (1998) 77-82 9. Santoso, S., Grady, M.W., Powers, J.E.: Power Quality Disturbance Waveform Recognition Using Wavelet-based Neural Classifier-Part 2: Application. IEEE Trans. on Power Delivery 15(1) (2000) 229-235 10. Borras, D., Castilla, M., Moreno, N.: Wavelet and Neural Structure: A New Tool for Diagnostic of Power System Disturbances. IEEE Trans. on Industry Application 37(1) (2001) 184-190 11. Stockwell, R.G., Mansinha, L., Lowe, R.P.: Localization of the Complex Spectrum: The S Transform. IEEE Trans. on Signal Process 44 (1996) 998-1001 12. Lee, I.W.C., Dash, P.K.: S-transform-based Intelligent System for Classification of Power Quality Disturbance Signals. IEEE Trans. on Power Delivery 18(3) (2003) 800-805 13. Abe, Y., Figuni, Y.: Fast Computation of RBF Coefficients for Regularly Sampled Inputs. IEE Electronics Letters 39(6) (2003) 543-544
Recognition of Car License Plates Using Morphological Features, Color Information and an Enhanced FCM Algorithm Kwang-Baek Kim1 , Choong-shik Park2, and Young Woon Woo3 1
Dept. of Computer Engineering, Silla University, Busan, Korea
[email protected] 2 Dept. of Computer Engineering, Youngdong University, Chungcheongbuk-Do, Korea
[email protected] 3 Dept. of Multimedia Engineering, Dong-Eui University, Busan, Korea
[email protected]
Abstract. In modern days, it is very hard to regulate cars of traffic lights violation and speed violation as well as parking violation and management of cars in parking places because of rapid increase of cars. In this paper, we proposed an intelligent recognition system of car license plates to mitigate these problems. The processing sequence of the proposed algorithm is as follows. At first, a license plate segment is extracted from an acquired car image using morphological features and color information, and noises are eliminated from the extracted license plate segment using a line scan algorithm and a grass fire algorithm, and then individual codes are extracted from the license plate segment using 4-directional edge tracking algorithm. Finally the extracted individual codes are recognized by an enhanced FCM algorithm. The enhanced FCM algorithm is a clustering algorithm improved from conventional clustering algorithms having problems that undesirable clustering results to be acquired because of distribution of patterns in cluster spaces. In order to evaluate performance of segment extraction and code recognition of the proposed method, we used 150 car images for experiment. In the results, we could verify the proposed method is more efficient and recognition performance is improved in comparison with conventional car license plate recognition methods.
1
Introduction
There are many troubles in traffic environment recently. Speed of cars is decreasing and safety of cars is also seriously threatened because of heavy traffic. Inefficient movement of cars causes energy waste and increase in amount of car fumes. In order to resolve these matters efficiently and quickly, many countries trying hard to develop ITS(Intelligent Transport System). In one of the ITS research fields, car license plate recognition systems are developed as core technique[1]. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 756–764, 2007. c Springer-Verlag Berlin Heidelberg 2007
Recognition of Car License Plates
757
There are many researches under developing car plate recognition systems in the inside and outside of the country until now. Methods using contrast transformation characteristics and methods using RGB and HSI color space are related with researches for car license plate segment extraction[2][3][4][5]. In conventional researches, it used to fail to extract a car license plate segment in case of an image of low contrast because structural features are not used but only contrast and color information are used. In this paper, we proposed a method to extract a car license plate segment using morphological features of car license plate in order to improve the problem of conventional researches. Noises by fixing pins of car license plate and degradation of an acquired image must be removed in order to find out codes from an extracted car license plate segment. So we used a line scan algorithm and a grass fire algorithm to remove the noises and enhanced FCM algorithm to recognize the codes extracted from a car license plate segment. Conventional FCM algorithms have problems that undesirable clustering results are acquired because of distribution of patterns in cluster spaces. In the enhanced FCM algorithm, variation by cluster intervals and centers of clusters by cluster locations utilizing symmetric characteristics and fuzzy theory are applied to resolve the problems of conventional FCM algorithms.
2
The Proposed Method to Extract a Car License Plate Segment
In this paper, we proposed a method to extract a car license plate segment using morphological features in order to improve the performance of extraction in low contrast car images. The process flow to extract a car license plate segment using vertical edge information is as shown in Fig. 1. Noise features in an extracted object by 4-directional edge tracking algorithm are as follows. – Vertical or horizontal length of each vertical edge is longer than a third of length of entire image. – Vertical or horizontal length of each vertical edge is minute. – Horizontal length of each vertical edge is longer than vertical length. – Vertical length of each vertical edge is longer than 1.5 times of horizontal length. Morphological features for extracting candidate segments of car license plate are as follows. – Variation of vertical distance in two objects must be within 20 percent. – Centers of candidate objects must be found in area of 1.8 to 2.25 times from center coordinates of length of a standard object. – Top and bottom coordinates of candidate objects must be found in area of about 25 percent of top and bottom coordinates of a standard object. The final candidate segment is extracted from candidate segments of a car image using color information of license plate. Green component in license plate
758
K.-B. Kim, C.-s. Park, and Y.W. Woo
Fig. 1. Flow of extracting a car license plate segment
Fig. 2. Process of extracting a license plate segment
is much more than red or blue component because the background color of Korean car license plate is green. So we could extract the final license plate segment from a car image using color features. The proposed process to extract a license plate segment from a car image is shown in Fig. 2.
3
Extraction of Individual Codes by Edge Tracking
It is difficult to extract individual codes from a license plate segment because noises by fixing pins of car license plate and degradation of an acquired image. So in this paper, we used a line scan algorithm and a grass fire algorithm
Recognition of Car License Plates
759
after binarizing an extracted license plate segment to remove noises in the extracted segment. The extracted license plate segment is binarized using an interval threshold value computed by average contrast value of the extracted segment. Some horizontal parts longer than certain length detected by the line scan algorithm are considered as noises and removed. The process to remove noises using a line scan algorithm is shown Fig. 3.
Fig. 3. Noise removal using a line scan algorithm
Objects are extracted from the noise removed segment by horizontal line scan using a grass fire algorithm, and then other objects except code objects are eliminated using structural features of individual codes. Individual codes are extracted from the noise removed segment using 4-directional edge tracking algorithm. A consonant part and a vowel part in a Korean character are combined into a single code using morphological features of license plates, and then individual codes are extracted finally. The process to extract individual codes is shown in Fig. 4. The extracted codes are normalized to be applied to an enhanced FCM algorithm as input patterns. The extracted individual codes are shown in Fig. 5.
Fig. 4. Process to extract individual codes
4
Recognition of a Car License Plate Using an Enhanced FCM Algorithm
An enhanced FCM algorithm is applied to measure similarity between individual code patterns and clusters in the proposed recognition method. Conventional FCM algorithms utilize optimization method of object function based on measured similarity by distances between input vectors and each cluster center[7].
Fig. 5. Extracted individual codes
760
K.-B. Kim, C.-s. Park, and Y.W. Woo
So conventional FCM algorithms have problems that undesirable clustering results can be acquired by distribution of patterns in cluster spaces, because only distances between measured patterns and cluster centers are used[8][9]. For example, there is a trouble if patterns form an ellipse shape or exist on the edge of a cluster and basic shapes intersect each other. In this case, Euclidean distances between centers of basic shapes and patterns can cause incorrect clustering result. In this paper, individual codes are recognized using an enhanced FCM algorithm utilizing variation by cluster intervals and centers of clusters by cluster locations utilizing symmetric characteristics and fuzzy theory. Symmetry measure used in the enhanced FCM algorithm is as follows. deg(xi ,xj ,c) (1 − α)(1 − 180 Symmetric(xi , c) = max (1) j is∀pattern,i =j −(α • ratiod (xi , xj , c)) In equation (1), deg(xi , xj , c) means angle between xi xj based on c and ratiod (x) is shown in equation (2), and α means a weight in fuzzy theory. d(x ,c) j d(xi ,c) ifdi > dj ratiod (x) = d(xi ,c) (2) d(xj ,c) ifdi < dj where d(x, c) means Euclidean distance. d(ci , cj ) α= √ Dm
(3)
Centers of clusters are computed after defining a constant value by equation (1) as μ(x) . Calculation of centers of clusters is shown in equation (4). v (p) = μ(x)k /μ(x)
(4)
Similarity U between centers of clusters and current patterns in the enhanced FCM algorithm is calculated by equation (5). k−1 U = (xi − ci )2 (5) i=0
Learning process of the enhanced FCM algorithm is shown in Fig. 6.
5
Experiments and Analyses
We used an IBM compatible personal computer of Intel Pentium-IV 2GHz CPU and 256MB of main memory, and 150 front side images(resolution : 640 x 480) of cars for experiments. A sample image used in the experiments is shown in Fig. 7. In conventional methods using contrast transformation characteristics or HSI color information, failure in extracting a license plate segment can be occurred in images having complex decorations or characters around the license plate, or images having a car of green-colored body or green-colored background, or images
Recognition of Car License Plates
761
Fig. 6. Learning process of the enhanced FCM algorithm
having low contrast. But in the proposed method, we could verify extraction rate was improved in comparison with conventional methods by experiments. Table 1 shows Number of extracted license plates and number of extracted codes in the proposed method and conventional methods.
Fig. 7. A sample car image
762
K.-B. Kim, C.-s. Park, and Y.W. Woo Table 1. Comparison of extraction results by three methods
Extraction of plate Number Character
HSI method 132 / 150 786 / 792 122 / 132
Contrast 130 / 778 / 126 /
method 150 780 130
Proposed method 147 / 150 882 / 882 147 / 147
The proposed method was better than a conventional method even in low contrast images because candidate segments are selected by morphological features of a license plate and then color information of a license plate was used in candidate segments. If there are similar areas like license plate shape in front of a car, the areas are removed as noise by color information of a license plate. But there were failures in extracting a license plate segment because of area having lots of vertical edges like a license plate. A sample image failed to extract a license plate segment is shown in Fig. 8.
Fig. 8. A sample image of extraction failure
In order to evaluate learning and recognition performance of the enhanced FCM algorithm in the proposed method, we used 100 numbers and 87 characters in 882 numbers and 147 characters extracted from 150 car images as learning patterns. Parameters in the enhanced FCM algorithm for learning are shown in table 2. In table 2, m is a weight value of exponent and ε is a parameter for terminating learning process in the enhanced FCM algorithm. Table 2. Parameters in the enhanced FCM algorithm FCM Character Number m=30 m=30 = 0.01 = 0.01
Enhanced FCM Character Number m=1000 m=3 = 0.01 = 0.01
Table 3 shows results of learning and recognition in the enhanced FCM algorithm, and a conventional FCM algorithm.
Recognition of Car License Plates
763
Table 3. Comparison of individual code recognition results Enhanced FCM FCM Number Character Total Number Character Total # of clusters 10 140 150 10 27 37 # of recognized codes 874/876 147/147 1021/1023 864/864 137/144 1001/1008 Recognition rate 99.7% 100% 99.8% 100% 95% 99.3%
Fig. 9. Sample images of recognition failure
We could see the enhanced FCM algorithm was more efficient than a conventional FCM algorithm in recognizing individual codes as shown in table 3. There were some failures in recognizing individual codes in the enhanced FCM algorithm because codes were deformed in the process of binarization by damage in license plates. Sample images of individual code recognition failure are shown in Fig. 9.
6
Conclusions
We proposed a method for recognition of a car license plate considered as key technique in intelligent transport systems. In order to extract a car license plate segment, we processed the following steps. At first an input car image was converted to a gray image and vertical edges were detected by Prewitt mask. The detected edge image was binarized by a threshold value and objects were extracted by 4-directional edge tracking algorithm. Objects with noise in extracted objects were removed and candidate license plate segments were extracted using morphological features by shape of a car license plate. The final license plate segment was extracted using color information. In order to recognize individual codes in a license plate, we processed the following steps. At first a line scan algorithm and a grass fire algorithm were used to remove noises in an extracted license plate segment and individual codes were extracted using 4-directional edge tracking algorithm. Finally an enhanced FCM algorithm was used to recognize the extracted individual codes from license plate segment. In experiments by the proposed method using 150 car images, license plate segments of about 99% of car images were extracted correctly and we acquired about 99% recognition rate of individual codes extracted from license plate segments using the enhanced FCM algorithm. But there are some failures in extracting license plate segment because of confusion a license plate with other areas having vertical edges. A future work is improvement of license plate extraction rate through further researches on utilizing inherent horizontal and vertical edges in car license plate.
764
K.-B. Kim, C.-s. Park, and Y.W. Woo
References 1. Hwang, Y.H., Park, J.W., Choi, H.S.: A Study on Recognition of Car License Plate. Proceedings of Korea Signal Processing Society 7(1) (1994) 433-437. 2. Heo, N.S., Cho, H.J., Kim, K.B.: A Study on Car License Plate Extraction Using Variation of Contrast in Gray Images. Proceedings of Korea Multimedia Society (1998) 1353-1356. 3. Kim, K.B., Youn, H.W., Noh, Y.W.: Parking Management System Using Color Information and Fuzzy C-Means Algorithm. Journal of Korea Intelligent Information System Society 8(1) (2002) 87-102. 4. Nam, M.Y., Lee, J.H., Kim, K.B.: Extraction of Car License plate Using Enhanced HSI Color Information. Proceedings of Korea Multimedia Society (1999) 345-349. 5. Lim, E.K., Kim, K.B.: A Study on Recognition of Car License Plate Using Improved Fuzzy ART Algorithm. Journal of Korea Multimedia Society 3(5) (2000) 433-444. 6. Kim, K.B., Kim, C.G., Kim, J.W.: A Study on Recognition of English Name Card Using Edge Tracking Algorithm and Improved ART1. Journal of Korea Intelligent Information System Society 8(2) (2002) 105-116. 7. Arun D.K.: Computer Vision and Fuzzy-Neural Systems. Prentice Hall, 2001. 8. Bezdek, J.: A Convergence Theorem for the Fuzzy ISODATA Clustering Algorithm. IEEE Trans. PAMI, 1980. 9. Kim, K.B., Lee, D.U., Sim, K.B.: Performance Improvement of Fuzzy RBF Networks. LNCS 3610 (2005) 237-244.
Modified ART2A-DWNN for Automatic Digital Modulation Recognition Xuexia Wang, Zhilu Wu, Yaqin Zhao, and Guanghui Ren School of Electronics and Information Technology, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China {wangxuexia,wuzhilu,yaqinzhao,rgh}@hit.edu.cn
Abstract. A modified ART2A-DWNN for automatic digital modulation recognition is proposed in this paper. Daubechies wavelet “db9” is chosen instead of “morlet” wavelet as the mother wavelet in ART2A-DWNN because of its compactness and orthonormality. Simulations have been carried out with the modulated signals corrupted by Gaussian noise to evaluate the performance of the proposed method. Recognition capability, noise immunity and convenience of accommodating new patterns of the modified ART2A-DWNN are simulated and analyzed. The experimental results have indicated the advantages of the modified method. Comparing the performance of the two ART2A-DWNNs, the modified ART2A-DWNN has higher recognition capability than the one with “morlet” wavelet.
1 Introduction With the number of modulation schemes increasing, automatic digital modulation recognition has become more important in the research of software-defined radio (SDR). Furthermore, it can also be used in civil applications as well as in military applications, such as interference identification, spectrum management, and electronic warfare [1]. Artificial neural networks (ANNs) is a widely used method for nonlinear pattern recognition. In discrete wavelet neural network (DWNN), the nonlinearity is approximated by superposition of a series of discrete wavelet functions. Structure of DWNN is similar to that of BP network, but the activation functions of the units in hidden layer and the output layer are replaced by wavelet functions and linear summing functions respectively [2]. Thus, DWNN can benefit from fast training without local minima and have high probability of pattern recognition. However, with the increasing of modulation types, the convergence and classification capability of DWNN becomes worse and it has to be retrained. Adaptive resonance theory 2A (ART2A) is a modified network of ART2, which is a category learning system that self-organizes a sequence of either binary or analog inputs into various recognition classes. References [3], [4] have given the mechanism of ART2A in detail. A modified ART2A network adopts a Euclidean measurement of similarity and skips the length normalization of inputs in the preprocessing and D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 765–771, 2007. © Springer-Verlag Berlin Heidelberg 2007
766
X. Wang et al.
adaptation stage. When a new pattern is come to ART2A network, it will be categorized into a new class without any influence to the old ones. However, the classification capability of ART2A will be weakened when there are many patterns. To solve these problems, a new type of neural network, ART2A-DWNN, is proposed in reference [5], which employs an improved unsupervised ART2A network to sort a large number of input patterns into several classes, and then use a three-layer supervised DWNN after each class nodes in the output layer of ART2A for further classification. It is feasible that ART2A is used for coarse classification, and DWNNs are employed for further recognition. Yet the performance of ART2A-DWNN is still needed to be improved. Thus, this paper puts forward a modified ART2A-DWNN with different wavelet functions for automatic modulation recognition.
2 Modified ART2A-DWNN 2.1 Principles and Arithmetic of ART2A-DWNN ART2A-DWNN is composed of ART2A layer and DWNN layer, as shown in Fig.1. Input patterns are clustered into classes by ART2A layer. At this stage, coarse classification is carried out so that patterns with similar features are clustered together. Patterns in each class are then put forward as the inputs of the corresponding DWNN for further classification. DWNN(1)
DWNN(2) Class 1
…
DWNN(3)
Class 2
Class 3
… DWNN Layer ART2A Layer
Winner Takes All Clustering STM(F1) Coding Subsystem
LTM(Wtij)
Orienting Subsystem(ȡ)
STM(F0)
Input Patterns
Fig. 1. The structure of ART2A-DWNN
ART2A architecture which is depicted below the broken line in Fig.1 consists of an input layer F0, a recognition layer F1, a coding subsystem and an orienting subsystem that controls the stability-plasticity trade-off. The interconnection weights between F0 and F1 are defined as long-term memory (LTM). ART2A network follows a winner-take-all competitive learning rule. The modified self-organizing processing by an ART2A network consists of preprocessing-stage, searching-stage and adaptation-stage. The searching-stage will be a procedure of choice, match and reset. This modified ART2A network adopts a Euclidean measurement of similarity and skips the length normalization of inputs in
Modified ART2A-DWNN for Automatic Digital Modulation Recognition
767
the preprocessing and adaptation stage. All of the input vectors X should be fitted to the internal [0, 1]. For each node j in F1 layer, the choice function Tj is defined by Tj=1-||X-Wj||/ N , where Wj is the only feed-forward connection weight vector of node j, j=1, 2, 3, …, N. The choice of a winner node is indexed at J, where TJ=max {Tj: for node j in F1}. Mismatch reset happens when the network fails to locate a winner category after the first input is presented, or when the choice score TJ doesn’t reach the vigilance value TJ <ρ (0<ρ<1). At the same time, a new category K is created by copying X as its weight vector WK=X. Otherwise, the network is ready to reach resonance while the coding subsystem updates the weight vector WJ according to WJ(t+1)=WJ(t)+η[X(t)-WJ(t)] (0<η<1). Apparently, the computational complexity and the dynamics of ART2A are determined by the vigilance parameter ρ and the learning rate η. It is feasible that the small vigilance value ρ is beneficial to the stability of ART2A for coarse classification. Hence we choose ART2A as the first layer of ART2A-DWNN. In DWNN, the nonlinearity is approximated by superposition of a series of discrete wavelet functions. Structure of DWNN is similar to that of BP network, but the activation functions of the units in hidden layer and the output layer are replaced by wavelet functions Ψjk(x) and linear summing functions respectively [6]. The output of the nth unit in the output layer is given as: yn = ∑ tM=1 wnt 2 j / 2 Ψ (2 j / 2 ∑ sI =1 uts X ( s) − k ), j , k ∈ Z
(1)
where wnt indicates the weight between the nth node of output layer and the tth node of hidden layer, uts is the weight between the tth node of hidden layer and the sth node of input layer, j is the dilation parameter and k is the translation parameter. The number of nodes of hidden layer and input layer are M and I respectively. An error vector E can be defined as follows: E = 0.5∑ nN=1 ( d n − yn ) 2 ,
(2)
where d is the desired output vector. Generally, the training is based on the minimization of E, performed by iterative conjugated gradient-based method. For each iteration, the parameters wnt and uts are modified using the conjugated gradient with momentum according to: Δθ t +1 = −η∂E / ∂θ t + α Δθ t , θ t +1 = θ t + Δθ ,
(3)
where η is the learning rate, α is the momentum term, θ denotes the parameters wnt and uts which should be trained. Because of the summing activation function of output layer, time-frequency analysis feature of wavelet and error back propagation learning with momentum, DWNN benefits from fast training without local minima and has high probability of recognizing intertwined signals. However, the simulation results show that the more the modulation types, the worse the convergence and classification capability of DWNN. In ART2A-DWNN classifier, as normalized input vectors are mapped from an Rn space to an R2 space, so ART2A-DWNN network can classify two patterns intertwined, while neither the ART2A network nor the DWNN can[7].
768
X. Wang et al.
When a new example pattern is added, it will be either categorized into one of the existing classes or a new class by ART2A. Then only the class which this new pattern belongs to needs to be retrained. This character is very important from the extendibility viewpoint. On the contrary, if only the DWNN is applied, it has to be retrained using the whole (old and new) patterns. After clustered by ART2A, fewer patterns will be located in the same class. Thus, when applying DWNN to each class, training time is dramatically reduced. And just as analysis on DWNN above, problems of converging to local minima will also be diminished. Therefore, the DWNN layer can produce satisfactory results. 2.2 Modified ART2A-DWNN In reference [5], the “morlet” function is employed in the DWNN layer, which is defined as: −
t2 2
φ ( x) = e e jω t .
(4)
0
Function (4) with a high resolution in both frequency domain and time domain is selected as the mother wavelet. But the “morlet” wavelet does not satisfy exactly the admissibility condition. ART2A-DWNN with the “morlet” function in the automatic digital modulation recognition system can classify the five common similar modulation schemes (2FSK, 4FSK, BPSK, QPSK and GMSK) with average accuracy rate greater than 94% successfully. However, the accuracy rate is needed to be improved yet. Hence, difference mother wavelet functions are tried to get a better ART2A-DWNN. Ingrid Daubechies, one of the brightest stars in the world of wavelet research, invented what are called compactly supported orthonormal wavelets, thus making discrete wavelet analysis practicable. The names of the Daubechies family wavelets are written dbN, where N is the order, and db the “surname” of the wavelet. In this paper Daubechies wavelet “db9” is chosen as the mother wavelet to form the DWNN layer. Here are the figures of wavelet functions of “morlet” and “db9”. Analyses and simulations show that Daubechies wavelet is more suitable than “morlet” wavelet . “molet” wavelet
“db9” wavelet
Fig. 2. Figures of wavelet functions of “morlet” and “db9”
Modified ART2A-DWNN for Automatic Digital Modulation Recognition
769
3 Experimental Results In an automatic digital modulation recognition system, the wireless communication signals are received by RF front-end including amplification, filtering and IF downconversion or base-band down-conversion, then followed by analogue digital converter and digital signal processing. The next processing module is the automatic digital modulation recognition module composed of feature extraction and ART2ADWNN classifier proposed above. Here the features chosen to characterize the digitally modulated signals are γmax, σap, σdp, σaa and σaf which are set up by Azzouz and Nandi [8]. At the end of its training, the network performs a binary classification on each given input pattern. The value of each node in the output layer is designed as ‘1’ or ‘0’, which forms the output vector to express different signals. In this paper, we measure the performance of the ART2A-DWNN classifier by using band-limited digitally modulated signals 2FSK, 4FSK, BPSK, QPSK and GMSK corrupted with white Gaussian noise. According to the sampling theory of the band-limited signals, the carrier frequency fc, the sampling rate fs, and the symbol rate rs, are assigned the values 525 kHz, 300 kHz, and 12.5 kHz, respectively. The vigilance parameter is 0.78 for ART2A layer. After the ART2A layer for coarse classification, 2FSK, 4FSK and QPSK are located in one class while BPSK and GMSK are located in another two classes, respectively. Then apply DWNN to classify different signals in the same class. Then we utilize “morlet” function and “db9” function as the mother wavelet respectively in ART2A-DWNN classifier to recognize the constant envelope modulation signals mentioned above. The results of the performances evaluation of the classifiers based on 10000 realizations for each modulation type at SNR=8dB are displayed in table 1. The results show that the accuracy rates of the modified ART2ADWNN classifier is improved. Table 1. The accuracy rates for classifiers with different mother wavelets at 8dB SNR
signal function “morlet” wavelet “db9” wavelet
2FSK
4FSK
4PSK
2PSK
GMSK
97.47% 98.94%
97.26% 98.65%
96.86% 98.08%
94.78% 97.22%
96.52% 98.44%
Table 2. The accuracy rates for classifiers of modified ART2A-DWNN at different SNRs
signal SNR 15dB 8dB 5dB
2FSK
4FSK
4PSK
2PSK
GMSK
99.02% 98.94% 98.07%
98.98% 98.65% 98.63%
98.25% 98.08% 98.10%
97.53% 97.22% 97.17%
98.67% 98.44% 98.23%
770
X. Wang et al.
Then simulations at 5dB and 15dB SNR have also been carried out to measure the noise immunity of the modified method. While the random noise level decreases from 15dB to 5dB, the ART2A-DWNN still can give high accurate results and there is no obvious change in the accuracy rates of different SNRs, as shown in table 2. Furthermore, in another simulation, new data of 16QAM signals are added to the recognition system to evaluate its capability of accommodation. The simulation results prove that the new data of 16QAM can be self-organized into a new class and only the new DWNN requires to be trained, that is, the ART2A-DWNN is extended without forgetting old ones by only adding new class and the corresponding DWNN which new patterns belong to.
4 Conclusions In this paper, we have presented a modified ART2A-DWNN for automatic digital modulation recognition. Daubechies wavelet “db9” is chosen instead of “morlet” wavelet as the mother wavelet in ART2A-DWNN because Daubechies wavelets are compactly supported orthonormal wavelets. Simulations have been conducted, using signals as realistic as possible. The experimental results have indicated the advantages of ART2A-DWNN, such as the higher recognition capability, noise immunity and convenience of accommodating new patterns. We may claim that the modified ART2A-DWNN is a better alternative for automatic digital modulation recognition. Since the features of input signals are very important for the recognition performance, our future work will continue in this direction to achieve a better classifier.
Acknowledgement This work was supported by Development Program for Outstanding Young Teachers in Harbin Institute of Technology.
References 1. Louis, C., Sehier, P.: Automatic Modulation Recognition with a Hierarchical Neural Network. IEEE Military Communications Conference 3 (1994) 713-717 2. Carpenter, G.A., Grossberg, S., Rosen, D.: ART 2-A: An Adaptive Resonance Algorithm for Rapid Category Learning and Recognition. IEEE IJCNN-91-Seattle International Joint Conference on Neural Networks 2 (1991) 151-156 3. Thomas, F., Karl-Friedrich, K., Torsten, K.: Comparative Analysis of Fuzzy ART and ART-2A Network Clustering Performance. IEEE Transactions on Neural networks 9 (1998) 544-559 4. Huang, Y.C., Huang, C.M.: Evolving Wavelet Networks for Power Transformer Condition Monitoring. IEEE Transactions on Power Delivery 17 (2002) 412-416 5. Wu, Z.L., Wang, X.X., Liu, C.Y., Ren, G.H.: Automatic Digital Modulation Recognition Based on ART2A-DWNN. Advances in Neural Networks – ISNN 2005 LNCS 3497 (2005) 381-386
Modified ART2A-DWNN for Automatic Digital Modulation Recognition
771
6. Li, H.L., Xiao, D.M., Chen, Y.Z.: Wavelet ANN Based Transformer Fault Diagnosis Using Gas-in-oil Analysis. IEEE Proceedings of the 6th International Conference on Properties and Applications of Dielectric Materials 1 (2000) 147-150 7. Wu, Y.T., Tai, H.M., Reynolds, A.C.: An ART2-BP Neural Net and Its Application to Reservoir Engineering. IEEE International Conference on Neural Networks 5 (1994) 3289-3294 8. Azzouz, E.E., Nandy, A.K.: Procedure for Automatic Modulation Recognition of Analog and Digital Modulations. IEEE Pro-Commun 143 (1996)259-266
Target Recognition of FLIR Images on Radial Basis Function Neural Network Jun Liu, Xiyue Huang, Yong Chen, and Naishuai He Automation College, Chongqing University, Chongqing 400030, China
[email protected]
Abstract. The study of small target recognition in low SNR (Signal Noise Ratio) is the key problem about processing of forward-looking infrared (FLIR) images information. Eight features of objects based on IR radiation characteristics and wavelet-based are presented. These features are used to a radial basis function (RBF) network as input for learning and classification. The propose recognition algorithm is invariant to the translation, rotation, and scale channel of a shape. Experiments by real infrared images and noisy images are performed, and recognition results show that the method is very effective.
1
Introduction
Today infrared images are applied widely to many domains. The FLIR image recognition is the key technology in precision guidance field. The main difficulties for detection are: 1) the objects exhibit low thermal contrast compared to its surroundings; 2) low signal to noise ration (SNR), infrared image quality inevitably declines during the processes, such as imaging, imaging transmission, image conversion and so on, so infrared images usually have many noise; 3) for small targets, there is no geometric, spatial distribution and statistical information to make use of. Artificial neural networks (ANNs) have been proved as a good tool to extract useful information and reveals inherent relationship from mass, complicated data or vague, incomplete information. The advantage of ANNs is in their inherent ability to in corporate nonlinear and cross-product terms into the model. Besides, they do not require knowledge of the mathematical function to be known in advance. As a result of their adaptability, artificial neural networks (ANNs) present good solutions for an ever-increasing range of problems [1]. In this paper, we present eight features of objects based on IR radiation characteristics and extract the features from the objects of IR image, and then the features are used to a RBF network as input for learning and classification. We found out that applying the advantages of feature extraction and RBF neural network can make the target recognition more reliable. And the propose recognition algorithm is invariant to the translation, rotation, and scale channel of a shape. Experiments by real infrared images and noisy images are performed, and recognition results show that the method is very effective and reliable. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 772–777, 2007. c Springer-Verlag Berlin Heidelberg 2007
Target Recognition of FLIR Images on RBF Neural Network
2
773
Imaging Model of FLIR Image
The general IR targets acquisition strategy is depicted in Fig. 1. [2]. The small targets are always merged in the noisy and cluttered environments, which make it difficult to detect these targets. In FLIR image sequences, the imaging model can be described as follows f (x, y, t) = fT (x, y, t) + fB (x, y, t) + n(x, y, t) .
(1)
where f (x, y, t) denotes the intensity of location (x, y), t stands for the time, fT (x, y, t) and fB (x, y, t) denotes the target intensity and background respectively, n(x, y, t) denotes the intensity of the measurement noise.
Fig. 1. Acquisition algorithm structure
3
The FLIR Image Feature Extraction
Selection of good features is the first important task, which provides the essential for the RBF classifier. 3.1
Size Features [3]
The following size features are selected. 1). Area. Since the FLIR image is a binary image, it is very convenient and efficient to use the pixel count area measurement. We simply use the total amount of pixels inside the target to stand for the area. 2). Perimeter. The perimeter of an object is particularly useful for discriminating between objects with simple and complex shapes. We calculate the perimeter by counting the boundary pixels of a target based on 4-connected neighborhood. 3). Length and width. It is easy to compute the horizontal and vertical extent of target. We use the minimum enclosing rectangle (MER) around the targets. 3.2
Shape Features [4]
Frequently, the objects of one class can be distinguished from other objects by their shapes. In the local image, shape features are used in combination with size measurements. 1). Rectangularity. A measurement that reflects the rectangularity of an object is the rectangle fit factor: R = A0 /AR . where A0 is the object’s area and AR is the area of the object’s MER. R represents how well an object fills its MER.
774
J. Liu et al.
It takes on the maximum value of 1.0 for rectangular objects, assumes the value π/4 for objects, and becomes small for slender, curved objects. The rectangle fit factor bounded between 0 and 1. 2). Aspect ratio. This is the feature related to rectangularity: a = W/L . The aspect ratio A is the ratio of width to length of the MER. This feature can distinguish slender objects from roughly square of circular objects. 3). Circularity. Circularity is tends to reflect the complexity of the boundary being measured. We use the most commonly used circularity measure: C = P 2 /A . where P is the perimeter and A is the area of object. The feature takes on the minimum value of 4π for a circular shape. More complex shapes yield higher values. The circularity measure C is roughly correlated with the subjective concept of complexity of the boundary. 3.3
Wavelet-Based Feature Extraction
We propose a more reasonable and efficient method to extract the invariant feature of the infrared image, which is based on the wavelet transform. In the method, we combine the energy feature in the spatial frequency domain with invariant moments that represent the statistical geometrical characteristic of the image. It is suitable for the infrared image recognition in the situation that the target image is only different in shape or in texture. The method will be described as follows: 1). Normalize the target image, this would perform the energy feature vector independent on scaling. 2). Decompose the target image in multi-scaling to obtain the sub-image in each low frequency band. 3). Calculate the energy of the low frequency image in each scaling according to the method proposed by Chang et.al.[5]. Suppose that a sub-image function is: Si (x, y), (x = 0, 1, ..., M − 1; y = 0, 1, ..., N − 1). Then its energy is: ei =
M−1 N −1 1 [Si (x, y)]2 M N x=0 y=0
(1 i k) .
(2)
where N ×M is a dimension of the discrete image function. k is the total number of low frequency sub-image obtained after decomposed. 4). Calculate the invariant moments (φ1 and φ2 ) of the low frequency subimage. In fact, the moment represents the statistical distribution of gray of the image relative to the center of its mass. Literature shows that only the invariant moment based on second order moment is independent of translation, rotation and scaling when describing two dimensions object. High order moments are very sensitive to phase error and minuscule distortion in the image processing. In invariant moments, there are only two invariant moments that are based on second order moment; other five invariant moments are based on third order moment. φ1 and φ2 shown as follows:
Target Recognition of FLIR Images on RBF Neural Network
775
φ1 = η20 + η02 .
(3)
2 φ2 = (η20 − η02 )2 + 4η11 .
(4)
5). The feature vector includes energy vector and moment vector. The feature vector does not depend on the translation, rotation and scaling factor of the image, and it could describe the target more properly and efficiently.
4
Design of the RBF Recognition [6]
4.1
RBF Neural Network
The basic structure of RBF neural networks is shown in Fig. 2.
Fig. 2. RBF neural networks
The output of the ith RBF unit is X − Ci ) i = 1, 2, ..., u . (5) σi where X is an r-dimensional input vector, Ci is a vector with the same dimension as X, u is the number of hidden units, and Ri (·) is the ith RBF unit response with a single maximum at the origin. Typically, Ri (·) is chosen as a Gaussian function ORi = Ri (X) = Ri (
X − Ci ). σ2 The jth output fj (X) of an RBF neural network is Ri (X) = exp(−
fj (X) = b(j) +
u i=1
ORi × ω2 (j, i) = b(j) +
u i=1
(6)
Ri (X) × ω2 (j, i) .
(7)
776
J. Liu et al.
where ω2 (j, i) is the weight or strength of the ith receptive field to the jth output and b(j) is the bias of the jth output. In the following analysis, the bias is not considered in order to reduce the network complexity. So, the jth output fj (X) of an RBF neural network is fj (X) =
u
Ri (X) × ω2 (j, i) .
(8)
i=1
4.2
Structure Determination and Initialization
The number of inputs is equal to eight selected features (area, perimeter, length and width, rectangularity, aspect ratio, circularity, invariant moments φ1 and φ2 ), while the number of outputs is set to be the number of classes (see Fig. 2). The selection of RBF nodes is equal to twelve.
5
Experiment Results and Analysis
The propose recognition algorithm is used for targets detection and recognition in clutter backgrounds. The aim of experiments is the target ships recognition on river. The main feature for recognition is that the background is very complex. On river the background is consist of not only sky and the surface of river, but also mountain, bridge, and so on. This is the best difference between river
Fig. 3. Tested original images for the experiment
Fig. 4. Recognition results for the original images
Target Recognition of FLIR Images on RBF Neural Network
777
and sea. So, The target ships are always merged in the noisy and cluttered environments, which make it difficult to detect and recognize these targets. In the experiment, the processed object is the FLIR image (320×240) of target ships shown as Fig.3. Picture (a), (b) and (c) are the original images. In picture (a), one group target ships, which can’t be detected by common way because these ships have be merged in the mountain beside the bank of the river, are on the bank and another are in the distant. There are two ships in picture (b), but the ship near pier can’t be detected by the common way because the boat and the pier have superposition. And there are superposition in the picture (c) too. It is very difficult to detection and recognition the target ships on river using normal method. However, the propose recognition algorithm based on IR radiation characteristics and wavelet-based extract eight features. These features are used to a RBF network as input for learning and classification. Set r=8, u=12, s=1, and for formula (5),(6),(7),(8), Experiments by real infrared images and noisy images are performed, and recognition results for the original images (See Fig.4.) show that the method is very effective.
6
Conclusion
In this paper, we proposed a new method of infrared feature extraction based on the invariant moments, size and shape. The infrared images were identified with the target feature and RBF network. The experiments results show that the method is reliable and efficient; the extracted features can represent the target classes. This method has strong anti-noise ability. It can be applied to recognize the infrared target efficiently and reliably, thus it has valuable application in target recognition of the precision guidance. Acknowledgments. This work is partially supported by CSTC Grant CSTC, 2006BA6016.
References 1. Reyneri, L.M.: Weighted Radial Basis Functions for Improved Pattern Recognition and Signal Processing. Neural Processing Letters 2 (1995) 2-6 2. Huang, D., Cho, W., Tommy, W.S.: A People-Counting System Using a Hybrid RBF Neural Network. Neural Processing Letters 18 (2003) 97-113 3. Jain, A.K.: Fundamentals of Digital Image Processing. New Jersey: Prentice-Hall (1989) 4. Hilditch, J., Rutovitz, D.: Chromosome Recognition. Ann. New York Acad. Sci. 157 (1969) 339-364 5. Chang, T., Kuo, C.J.: Texture Analysis and Classification with Tree-structured Wavelet Transform. IEEE Transactions on Image Processing. Digital Object Identifier (1993) 429-441 6. Yan, P.F., Zhang, C.S.: Artificial Neural Networks and Evolutionary Computering. 2th. TUP Beijing (2005)
Two-Dimensional Bayesian Subspace Analysis for Face Recognition Daoqiang Zhang Department of Computer Science and Engineering Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China
[email protected]
Abstract. Bayesian subspace analysis (BSA) has been successfully applied in data mining and pattern recognition. However, due to the use of probabilistic measure of similarity, it often needs much more projective vectors for better performance, which makes the compression ratio very low. In this paper, we propose a novel 2D Bayesian subspace analysis (2D-BSA) method for face recognition at high compression ratios. The main difference between the proposed 2D-BSA and BSA is that the former adopts a new Image-as-Matrix representation for face images, opposed to the Image-as-Vector representation in original BSA. Based on the new representation, 2D-BSA seeks two coupled set of projective vectors corresponding to the rows and columns of the difference face images, and then use them for dimensionality reduction. Experimental results on ORL and Yale face databases show that 2D-BSA is much more appropriate than BSA in recognizing faces at high compression ratios.
1 Introduction Subspace analysis has attracted much attention in machine learning, data mining and pattern recognition over the last decade. The essence of subspace analysis is to find a set of projective vectors, through which to represent original high-dimensional faces in a low-dimensional space. Principal component analysis (PCA, also known as Eigenface) [5], linear discriminant analysis (LDA) [1][2] and the Bayesian subspace analysis (BSA) [3] [4] are the three mainstreams of subspace analysis methods in the field. Among them, PCA does not consider the class label information and hence is unsupervised. LDA uses the class label information but its decision boundaries are crisp and simple (linear) in nature. The BSA method also uses supervised information, but in a way different from LDA, that is, it tries to construct the similarity model (i.e., intrapersonal space) of the same individual in a soft (probabilistic) way. This makes it easier to adapt to unknown samples. And it has been shown that BSA outperforms both PCA and LDA [4]. However, one of the limitations of BSA is that it needs relatively more projective vectors to compute the probabilistic measure of similarity for better performance, which makes the compression ratio very low. Here the compression ratio is defined as the division between the total numbers of training face image pixels and the projected face components plus the sizes of projective vectors. More specifically, suppose there are M training face images, each of size m by n, and then the total numbers of training D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 778–784, 2007. © Springer-Verlag Berlin Heidelberg 2007
Two-Dimensional Bayesian Subspace Analysis for Face Recognition
779
face image pixels is Mmn. If d projective vectors are used in BSA, the size of projective vectors is dmn, the numbers of projected face components is Md, and then the compression ration is computed as (Mmn)/(Md+dmn). Clearly, as the number of projective vectors increases, the compression ratio reduces. On the other hand, as the number of projective vectors increases, the size of the projected face vector also adds, which means the processing time for recognizing faces also increases. That limitation is especially severe on large face databases. In this paper, we propose a novel 2D Bayesian subspace analysis (2D-BSA) method for face recognition at high compression ratios. Opposed to the classical Image-asVector representation in original BSA, we adopt a new Image-as-Matrix representation [6] for face images in 2D-BSA. Based on the new representation, 2D-BSA seeks two set of projective vectors corresponding to the rows and columns of the difference face images, and then use them for dimensionality reduction. According to the type of projective vectors used, 2D-BSA divides into three concrete forms, i.e. unilateral 2D-BSA using only the projective vectors corresponding to the rows or columns of the difference face images (denoted as U2D-BSA-row and U2D-BSA-col respectively) and bilateral 2D-BSA using both set of projective vectors (denoted as B2D-BSA). Experimental results on ORL and Yale face databases show that the proposed 2D-BSAs (especially B2D-BSA) outperform original BSA in recognizing faces at high compression ratios. The rest of this paper is organized as follows: In Section 2, we briefly introduce the original BSA method. And we propose the 2D-BSA methods in Section 3. The experimental results on ORL and Yale face databases are given in Section 4. Finally, we conclude in Section 5.
2 Bayesian Subspace Analysis The main idea of Bayesian subspace analysis lies in developing a probabilistic measure of similarity based on a Bayesian (MAP) analysis of face image differences. Consider a feature space of the differences vectors Δ = I1 − I 2 between two images. Define two classes of facial image variations: intrapersonal variations Ω I (corresponding to different facial expressions of the same individual) and extra-personal variations Ω E (corresponding to variations between different individuals). The similarity measure S ( Δ ) can then be expressed in terms of the intrapersonal a posteriori
probability of Δ belong to Ω I given by the Bayesian rule [4]:
S ( Δ) = P (ΩI | Δ ) =
P ( Δ | ΩI ) P (ΩI ) P ( Δ | ΩI ) P (ΩI ) + P ( Δ | ΩE ) P (ΩE )
(1)
The densities of both classes are modeled as high-dimensional Gaussians [4]:
P (Δ | ΩE ) =
e
1 − ΔT Σ −E1Δ 2
(2π ) D / 2 Σ E
1/ 2
,
P ( Δ | ΩI ) =
e
1 − ΔT Σ −I 1Δ 2
(2π ) D / 2 Σ I
1/ 2
where Σ E and Σ I are the covariance matrices of Ω E and Ω I respectively.
(2)
780
D. Zhang
To compute the likelihoods P ( Δ | Ω E ) and P ( Δ | Ω I ) , the database images I j are preprocessed with the whitening transformation. Each image is converted and
yφjI for intrapersonal space
stored as a set of two whitened subspace coefficients, i.e. and
yφjE for extrapersonal space [4]: −
1
−
yφjI = Λ I 2VI I j ,
1
yφjE = Λ E2VE I j
(3)
where Λ I , VI and Λ E , VE are matrices of the largest eigenvalues and corresponding eigenvectors of the covariance matrices of Σ I and Σ E respectively. From Eq. (3), Eq. (2) can be rewritten as [4]:
P ( Δ | ΩE ) =
e
− yφE − yφj
E
(2π ) kE / 2 Σ E
1/ 2
e
, P (Δ | ΩI ) =
− yφI − yφj
I
(2π ) kI / 2 Σ I
1/ 2
where k E and k I are the reduced dimensions of Ω E and Ω I respectively, and and
yφ
I
(4) yφ
E
are the whitened coefficient vectors for the test image I.
From Eq. (4), the maximum a posteriori (MAP) similarity defined in Eq. (1) can be easily computed. However, in practice, the MAP similarity is often replaced with the following maximum likelihood (ML) similarity [4]:
S ′ ( Δ ) = P ( Δ | ΩI ) =
e
− yφI − yφj
I
(2π )kI / 2 Σ I
1/ 2
(5)
In Eq. (5), only the intrapersonal class is evaluated, and it has been shown that the ML similarity measure has a similar performance to that of the MAP similarity measure. For that reason and for simplicity, we only consider the ML similarity measure defined in Eq. (5) throughout the paper.
3 2D Bayesian Subspace Analysis Suppose that there are M training face images, denoted by m by n matrices
Ak (k = 1, 2,...M ) . Let Ω I = {Bi }i =1 denote the set of difference images from the N
same individual. Concatenating N matrices
Bi into an m by nN matrix:
M L = [ B1 , B2 ,..., BN ] = {b j } j =1 nN
where b j s is the m by 1 column vectors of
Bi s.
(6)
Two-Dimensional Bayesian Subspace Analysis for Face Recognition
Let Λ Ld (d by d diagonal matrix) and
781
L = [l1 , l2 ,..., ld ] (n by d matrix) be the d
largest eigenvalues and corresponding eigenvectors of Eq. (6). For each training image Ak and any test image A , the column whitening transformation is as follows: −
1
−
1
yφL = Λ Ld2 LT A
yφkL = Λ Ld2 LT Ak ,
(7)
From Eq. (7), the ML similarity becomes:
S L ( Δ = A − Ak ) =
e
− yφL − yφkL
(2π )
d /2
ΣL
(8) 1/ 2
In this paper, we call the 2D-BSA method based on Eq. (8) as unilateral 2D-BSA with column whitening, denoted as U2D-BSA-col. Similarly, Concatenating N matrices Bi into an mN by n matrix:
M R = ⎡⎣ B1T , B2T ,..., BNT ⎤⎦ = {c j } j =1 T
where c j s is the 1 by n row vectors of
mN
(9)
Bi s.
Let Λ Rd (d by d diagonal matrix) and
R = [ r1 , r2 ,..., rd ] (m by d matrix) be the d
largest eigenvalues and corresponding eigenvectors of Eq. (9). For each training image Ak and any test image A , the row whitening transformation is as follows: −
1
yφkR = Ak R Λ Rd2 ,
−
1
yφ R = AR Λ Rd2
(10)
And the ML similarity becomes:
S R ( Δ = A − Ak ) =
e
− yφ R − yφk R
(2π ) d / 2 Σ R
(11)
1/ 2
We call the 2D-BSA method based on Eq. (11) as unilateral 2D-BSA with row whitening, denoted as U2D-BSA-row. Finally, if we have obtained the aforementioned Λ Ld ,
L = [l1 , l2 ,..., ld ] , and Λ Rd ,
R = [ r1 , r2 ,..., rd ] . For each training image Ak and any test image A , we can define the following bilateral whitening transformation: −
1
−
1
yφkB = Λ Ld2 LT Ak RΛ Rd2 ,
−
1
−
1
yφB = Λ Ld2 LT ARΛ Rd2
(12)
782
D. Zhang
And the ML similarity measure is computed as:
S B ( Δ = A − Ak ) =
e
− yφ B − yφkB
(2π ) d / 2 Σ
(13)
1/ 2
And we call the 2D-BSA method based on Eq. (13) as bilateral 2D-BSA, denoted as B2D-BSA.
4 Experimental Result In this section, we test the proposed U2D-BSA-col, U2D-BSA-row and B2D-BSA methods, compared with original BSA method, on two commonly used face databases, ORL and Yale face database. The ORL database contains images from 40 individuals, each providing 10 different images. The Yale database contains images from 15 individuals, each providing 11 different images. For both database, the first 5 images per person are used for training, and the rest for testing. Here the well-know nearest neighbor (1-NN) classifier is used for classification, after extracting the features using the above mentioned methods. Figure 1 and Fig. 2 show the comparisons of the recognition accuracy under different compression ratios of the four methods on ORL and Yale face databases respectively. Here the compression ratio is defined as the division between the total numbers of training face image pixels and the projected face components plus the sizes of projective vectors. More specifically, the compression ratios of BSA, U2D-BSA-col, 0.9 0.8 0.7
accuracy
0.6 0.5 0.4 0.3 0.2 0.1 0 0 10
B2D-BSA U2D-BSA-col U2D-BSA-row BSA 1
10
2
10 compression ratio
3
10
4
10
Fig. 1. Comparisons of the recognition accuracy under different compression ratios of the four methods on ORL face database
Two-Dimensional Bayesian Subspace Analysis for Face Recognition
783
0.9 0.8 0.7
accuracy
0.6 0.5 0.4 0.3 0.2 0.1 0 -1 10
B2D-BSA U2D-BSA-col U2D-BSA-row BSA 0
10
1
2
10 10 compression ratio
3
10
4
10
Fig. 2. Comparisons of the recognition accuracy under different compression ratios of the four methods on Yale face database
U2D-BSA-row and B2D-BSA are (Mmn)/(Md+dmn), (Mmn)/(Mdn+dm), (Mmn)/(Mdm+dn) and (Mmn)/(Md2+dm+dn) respectively, where M is the number of training images, m and n are the size of the face image, and d is the number of projective vectors. Note that M, m and n are fixed for certain face database, and the compression ratio is directly related with only the reduced dimensions d. From Fig. 1 and Fig. 2, it is impressive to see that B2D-BSA has the best accuracy no matter what compression ratio is chosen. Both U2D-BSA-col and U2D-BSA-row outperforms BSA, but they are inferior to B2D-BSA. As the compression ratio increases, the accuracy of BSA decreases rapidly. On the other hand, B2D-BSA can still retain a relatively high accuracy even when the compression ratio is larger than 100.
5 Conclusion In this paper, we propose a novel 2D Bayesian subspace analysis (2D-BSA) method for face recognition at high compression ratios. The main difference between the proposed 2D-BSA and BSA is that the former adopts a new Image-as-Matrix representation for face images, opposed to the Image-as-Vector representation in original BSA. Experimental results show the effects of the proposed method. Both BSA and the proposed 2D-BSA methods assume the Gaussian likelihood function which is not usual in real data. So an interesting issue is to extend the proposed methods to non-Gaussian case though some kernel transformations [7]. Also, in the future works, we will compare our methods with existing face recognition methods such as Eigenface [5] and Fisherface [1] etc.
784
D. Zhang
Acknowledgements The authors would like to thank the anonymous referees for their helpful comments and suggestions. This work was supported by the National Science Foundation of China (No. 60505004) and the Jiangsu Science Foundation (BK2006521).
References 1. Belhumeur, P., Hespanha, J., Kriegman, D.: Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection. IEEE Trans. Pattern Analysis and Machine Intelligence 19(7) (1997) 711-720 2. Kong H., Wang, L., Teoh, E.K., Wang, J.G., Venkateswarlu, R.: A Framework of 2D Fisher Discriminant Analysis: Applications to Face Recognition with Small Number of Training Samples. In: IEEE Conf. CVPR, 2005 3. Moghaddam, B., Pentland, A.: Probabilistic Visual Learning for Object Representation. IEEE Trans. Pattern Analysis and Machine Intelligence 19(7) (1997) 696-710 4. Moghaddam, B., Jebara, T. and Pentland, A.: Bayesian Face Recognition. Pattern Recognition 33(11) (2000) 1771-1782 5. Turk, M. A. and Pentland, A. P.: Face Recognition Using Eigenfaces. In: Proc. of IEEE Conf. on Computer Vision and Pattern Recognition (1991) 586-591 6. Zhang, D., Chen, S., Liu, J.: Representing image matrices: Eigenimages vs. Eigenvectors. Proceedings of the 2nd International Symposium on Neural Networks (ISNN'05), Chongqing, China, LNCS 3497: (2005) 659-664 7. Zhang, D., Chen, S., Liu, Zhou, Z.-H.: Recognizing Face or Object from a Single Image: Linear vs. Kernel Methods on 2D Patterns. Proceedings of the Joint IAPR International Workshops on Structural and Syntactic Pattern Recognition and Statistical Techniques in Pattern Recognition (S+SSPR'06), Hong Kong, China, LNCS 4109: (2006) 889-897
A Wavelet-Based Neural Network Applied to Surface Defect Detection of LED Chips Hong-Dar Lin and Chung-Yu Chung Department of Industrial Engineering and Management, Chaoyang University of Technology, 168 Jifong E. Rd., Wufong Township, Taichung County 41349, Taiwan (R.O.C.)
[email protected]
Abstract. This research explores the automated detection of surface defects that fall across two different background textures in a light-emitting diode (LED) chip. Water-drop defects, commonly found on chip surface, impair the appearance of LEDs as well as their functionality and security. Automated inspection of a water-drop defect is difficult because the defect has a semiopaque appearance and a low intensity contrast with the rough exterior of the LED chip. Moreover, the blemish may fall across two different background textures, which further increases the difficulties of defect detection. We first use the one-level Haar wavelet transform to decompose a chip image and extract four wavelet characteristics. Then, the Multi-Layer Perceptron (MLP) neural network with back-propagation (BPN) algorithm is applied to integrate the multiple wavelet characteristics. Finally, the wavelet-based neural network approach judges the existence of water-drop defects. Experimental results show that the proposed method achieves an above 96.8% detection rate and a below 4.8% false alarm rate. Keywords: Computer vision system, LED chip, surface defect detection, Wavelet decomposition, Multi-layer perceptron neural network with backpropagation algorithm.
1 Introduction A light-emitting diode (LED) is a semiconductor device that emits visible light when an electric current passes through the semiconductor chip. Compared with incandescent and fluorescent illuminating devices, LEDs have lower power requirement, higher efficiency, and longer lifetime. Typical applications of LED components include indicator lights, LCD panel backlighting, fiber optic data transmission, etc. To meet consumer and industry needs, LED products are being made in smaller sizes, which increase difficulties of product inspection. Surface defects impair the appearance of LEDs as well as their functionality and security. As inspecting surface defects by human eyes is ineffective and inefficient, this research aims to develop an automated vision system for detecting one common type of LED surface defects, water-drop blemishes formed by the steam generated during the production process. Automated inspection of a water-drop blemish is difficult because the blemish has a semi-opaque appearance and a low intensity D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 785–792, 2007. © Springer-Verlag Berlin Heidelberg 2007
786
H.-D. Lin and C.-Y. Chung
contrast with the rough exterior of the LED chip. With a width of 12.6 μm, an LED chip comprises an aluminum-pad (bonding pad) in the central area and a metal oxide semiconductor (emitting area) in the outer area, as shown in Fig. 1(a). Texture of the central area has a random pattern while that of the outer area has a uniform appearance. A water-drop defect may fall across the two areas of significantly different textures, which further increases the difficulties of defect detection. Figure 1(b) displays the LED chip images with water-drop blemishes.
(a) LED chip without defect
(b) LED chip with water-drop defect
Fig. 1. LED chip images
Defect detection techniques, generally classified into the spatial domain and the frequency domain, compute a set of textural features in a sliding window and search for significant local deviations among the feature values. Siew et al. [1] apply the cooccurrence matrix method, a traditional spatial domain technique, to assess carpet wear by using two-order gray level statistics to build up probability density functions of intensity changes. For another spatial domain example, Latif-Amet et al. [2] present wavelet theory and co-occurrence matrices for detection of defects encountered in textile images and classify each sub-window as defective or nondefective with a Mahalanobis distance. As to techniques in the frequency domain, Tsai and Hsiao [3] propose a multiresolution approach for inspecting local defects embedded in homogeneous textured surfaces. By properly selecting the smooth subimage or the combination of detail subimages in different decomposition levels for backward wavelet transform, regular, repetitive texture patterns can be removed and only local anomalies are enhanced in the reconstructed image. Tsai and Wu [4] adopt Gabor transform to determine three texture features (scale, frequency and orientation) and use Gabor energy differences to discriminate defect locations. Also, Lin and Ho [5] develop a novel approach that applies discrete cosine transform based enhancement for the detection of pinhole defects on passive component chips. Regarding defect detection applications in the electronic industry, Lin and Chiu [6] use multivariate Hotelling T2 statistic to integrate different coordinates of color models for MURA-type defect detection on Liquid Crystal Displays (LCD), and apply ant colony algorithm and back-propagation neural network techniques to develop an automatic inspection procedure. Lu and Tsai [7] propose a global approach for automatic visual inspection of micro defects such as pinholes, scratches, particles and fingerprints. The Singular Value Decomposition (SVD) adopted by Lu and Tsai suits the need for detecting defects on the TFT-LCD images of highly periodical textural structures. Furthermore, in the recent decade, many vision systems have been
A Wavelet-Based Neural Network Applied to Surface Defect Detection
787
developed for the inspection of surface defects on semiconductor wafers [8-9]. For instance, Fadzil and Weng [10] implement a vision inspection system that achieves a 90% probability of accurately classifying defects, scratches, contamination, blemishes, off center defects, etc. in the encapsulations of diffused LED products. The aforementioned techniques perform well in anomaly detection, but most of them do not detect defects with the properties of water-drop blemishes [2-10]. This research has been motivated by the need for an efficient and effective technique that detects semi-opaque and low-contrast water-drop blemishes falling across two different background textures.
2 Research Method To detect water-drop blemishes of LED chips, this research adopts the one-level Haar wavelet transform to conduct image pre-processing and extract wavelet characteristics. Then, we apply the Multi-Layer Perceptron (MLP) neural network with back-propagation (BPN) algorithm to judge the existence of water-drop blemishes in an image. 2.1 Wavelet Decomposition and Characteristics Wavelet transform provides a convenient way to obtain a multiresolution representation, from which texture features can be easily extracted. The merits of using wavelet transform include local image processing, simple calculations, high speed processing and multiple image information [11-13]. The Haar wavelet transform is one of the simplest and basic wavelet transformations [13]. A standard decomposition of a two-dimensional image can be done by first applying the 1-D Haar wavelet transform to each row of pixel values, treating these transformed rows as if they were themselves an image, and then performing another 1-D wavelet transform to each column. The Haar transform can be computed stepwise by the mean value and half of the differences of the tristimulus values of two contiguous pixels. Based on the transfer concept of the 1-D space, the Haar wavelet transform can process a 2-D image of (M x N) pixels in the following way: Row transfer :
Column transfer :
⎧ ⎡ f ( i, 2 j ) + f (i, 2 j + 1) ⎤ ⎪ f R (i , j ) = ⎢ ⎥, 2 ⎪ ⎣⎢ ⎦⎥ ⎪ ⎡ f i , 2 j − f i , 2 j + 1) ⎤ ( ) ( ⎪ ⎡N ⎤ ⎥, ⎨ f R (i , j + ⎢ ⎥ ) = ⎢ 2 2 ⎣ ⎦ ⎪ ⎣⎢ ⎦⎥ ⎪ ⎪ where 0 ≤ i ≤ ( M − 1), 0 ≤ j ≤ ⎡ N ⎤ − 1, [ ] is Gauss symbol. ⎢2⎥ ⎪ ⎣ ⎦ ⎩
⎧ ⎡ f R ( 2i, j ) + f R (2i + 1, j ) ⎤ ⎪ f C (i, j ) = ⎢ ⎥, 2 ⎪ (1) ⎣⎢ ⎦⎥ ⎪ ⎡ f R ( 2i, j ) − f R ( 2i + 1, j ) ⎤ ⎪ ⎡M ⎤ ⎥, ⎨ f C (i + ⎢ ⎥ , j ) = ⎢ 2 ⎣2 ⎦ ⎢⎣ ⎥⎦ ⎪ ⎪ ⎪ where 0 ≤ i ≤ ⎡ M ⎤ − 1, 0 ≤ j ≤ ( N − 1). ⎢ 2⎥ ⎪ ⎣ ⎦ ⎩
In the above expressions (Eq. (1)), f (i, j ) represents an original image, f R (i, j ) the row transfer function of f (i, j ) , and f C (i, j ) the column transfer function of f R (i, j ) . As f C (i, j ) is also the outcome of the wavelet decomposition of f (i, j ) , the outcomes of a wavelet transform can be defined as:
788
H.-D. Lin and C.-Y. Chung
⎧ ⎡N ⎤ A ( i, j ) = f C (i, j ); D1 ( i, j ) = f C (i, j + ⎢ ⎥ ); ⎪ ⎣2⎦ ⎪ ⎪⎪ ⎡M ⎤ ⎡M ⎤ ⎡N⎤ ⎨ D2 ( i, j ) = fC (i + ⎢ ⎥ , j ); D3 ( i, j ) = fC (i + ⎢ ⎥ , j + ⎢ ⎥ ); 2 2 ⎣ ⎦ ⎣ ⎦ ⎣2⎦ ⎪ ⎪ ⎡M ⎤ ⎡N ⎤ where 0 ≤ i ≤ ⎢ ⎥ − 1, 0 ≤ j ≤ ⎢ ⎥ − 1. ⎪ 2 ⎪⎩ ⎣ ⎦ ⎣2⎦
(2)
One level of wavelet decomposition generates one smooth subimage and three detail subimages that contain fine structures with horizontal, vertical, and diagonal orientations. An image is decomposed by wavelet transform into one approximation subimage (A) and three detail subimages (D1, D2 and D3). These four subimages, M N each of which has a size of ( 2 x 2 ) pixels, form the wavelet characteristics. 2.2 Wavelet-Based Neural Network Approach
The wavelet-based neural network approach decomposes an image of (M x N) pixels into a set of subimages, each of which has a size of (m x n) pixels and is a wavelet processing unit. The original image has g x h (i.e. Mm x Nn ) wavelet processing units. For each wavelet processing unit, the wavelet transform can be applied to the region of (m x n) pixels to obtain four wavelet characteristics A, D1, D2 and D3 through calculations. This research uses the MLP neural network with back-propagation (BPN) algorithm [14-15] to detect defective regions containing water-drop blemishes. We use the four wavelet characteristics as the inputs of the neural network model to identify the regions with water-drop defects. The four wavelet characteristics describe the surface variations of gray level uniformity. The proposed method uses four wavelet characteristics of a wavelet processing unit as the input values of the MLP neural network model. If the size of a wavelet processing unit is 2x2 pixels, an image of 256 x 256 pixels will have 16,384 sets of wavelet characteristics. Each set of the wavelet characteristics can be judged as incontrol or out-of-control. The output layer of the network uses 0 and 1 to represent the in-control and out-of-control decisions. The data of input patterns must be scaled first. The linear transformation is used to set the range of the input values between [0, 1] to avoid extreme values from affecting the network training results. Some parameters of the network model, such as learning rate ( ), training number, errors, and number of hidden layer nodes, need to be carefully set to achieve good model performance. Uniformly distributed random numbers which range between [-1, 1] are used to set interconnected weights and biased weight vectors (θ) for the training patterns of the model. Sigmoid function is used in the model and the numerical range is between [0, 1].
η
f ( x) =
1 1+ e
− net j
(3)
The standard energy function below is used to calculate the variation between expected output and network output.
A Wavelet-Based Neural Network Applied to Surface Defect Detection
E=
1 ∑ (Tj − Yj )2 2 j
789
(4)
The stop criterion of the proposed model is based on the proposition of Hush and Horne [16], who use methods of Root Mean Square Error and fixed learning cycles to set the parameters. Figure 2 shows the network structures of the proposed model.
Fig. 2. Network structure of the proposed wavelet-based neural network approach
3 Experiments and Analyses Experiments are conducted on real LED chips to evaluate the performance of the proposed approach. We test 85 LED images, of which 20 have no defects and 65 have various water-drop defects. For precisely presenting the locations of water-drop defects, we found the most appropriate size of a wavelet processing unit is 2 x 2 pixels. At this size, the proposed approach achieves the best performance considering the sample training time, the recognition time of the testing period, and the size of the defect area. In the middle area of an LED chip is a bounding pad which contains random particles like pepper noises. The more similar the gray levels of the particles on bounding pad and the water-drop defect, the more difficult it is to distinguish the defect and the random particles. The median filter [17] is used to smooth the particles on the random texture. The mask of size 11 x 11 pixels is capable of smoothing all the random particles in the testing samples. Then, the filtered images are conducted the wavelet transform for extracting wavelet characteristics. After testing images have been extracted the wavelet characteristics from many wavelet processing units, the BPN model is implemented to detect the water-drop defects on two different background textures in LED chip images. We combine wavelet characteristics and the BPN model to establish a water-drop defect detection system for detecting the surface variations on two different textures. The input patterns of the BPN model, each including four wavelet characteristics, are obtained from the 16,384 sets of a testing image. The training patterns and testing patterns equal to two-third and one-third of the total images, respectively. The testing results of the BPN models can be affected by many factors, such as parameter settings, number of training samples, input patterns of network, and so on. After conducting
790
H.-D. Lin and C.-Y. Chung
various experiments, we find that the best parameter settings of the BPN model for water-drop defect detection of bounding area are: 1) number of hidden layers = 1; 2) number of hidden layer nodes = 6; 3) learning rate = 0.5; 4) momentum = 0.5; and 5) iteration cycles = 40; and those of the BPN model for emitting area are: 1) number of hidden layers = 1; 2) number of hidden layer nodes = 3; 3) learning rate = 1; 4) momentum = 0.5; and 5) iteration cycles = 10. The index RMSE (Root Mean Square Error) is used to evaluate the performance of the network models. The RMSE indices of the two BPN models with the given parameter settings are 0.063 and 0.074 for the bounding area and emitting area, respectively. Figure 3 shows partial results of detecting water-drop defects by the Otsu method [18], the proposed wavelet-based neural network approach, and the professional inspector, individually. The wavelet-based BPN method detects most of the waterdrop blemishes while the Otsu method misses some defect regions. The performance evaluation indices, (1-α) and (1-β), are used to represent correct detection judgments; the higher the two indices, the more accurate the detection results. The type I errorα is the probability of incorrectly judging the normal regions as defects. The type II errorβ represents the probability of failing to alarm real defects.
Fig. 3. Partial detection results of Otsu method, BPN method, and professional inspector
The average detection rates of all testing samples by the two methods are, respectively, 96.8% (the wavelet-based BPN method), 87.9% (the Otsu method). The proposed wavelet-based neural network approach has higher detection rates than does the traditional method applied to LED chip images. The wavelet-based BPN method excels in its ability of correctly discriminating water-drop blemishes from normal regions.
A Wavelet-Based Neural Network Applied to Surface Defect Detection
791
Detection rate (1-β)
1.0
Otsu
0.9
BPN
0.8 0.7 0.6 0.0
0.1
0.2
False alarm rate α
0.3
0.4
Fig. 4. A ROC plot of the Otsu method and the proposed wavelet-based neural network model
When different detection methods are compared, their pairs of false alarm rates and detection rates are plotted as points on a Receiver Operating Characteristic (ROC) plot. The ROC plot of the wavelet-based BPN and the Otsu methods are presented in Fig. 4, whose upper-left corner indicates a 100% detection rate and a 0% false alarm rate. The more the ROC plot approaches the upper-left corner, the better the test performs. In industrial practices, a more than 90% detection rate and a less than 10% false alarm rate are a good rule of thumb for performance evaluation of a vision system. Accordingly, the wavelet-based BPN method, with its ROC plot closest to the upper-left corner, outperforms the traditional method.
4 Concluding Remarks This research applies the wavelet transform and neural network techniques to detect water-drop defects that fall across two different background textures of LED chips. The proposed approach uses the multi-layer perceptron neural network with backpropagation algorithm to judge the existence of water-drop blemishes through multivariate processes of combining image characteristics from wavelet decomposition of local image blocks. Experimental results show that the waveletbased BPN approach achieves an above 96.8% detection rate and a below 4.8% false alarm rate in detecting water-drop defects across two different background textures. As indicated in the ROC plot analysis, the wavelet-based BPN method has lower false alarm rates and better detection rates than does the Otsu method. Regarding the directions for future research opportunities, the proposed approach can be extended to detection of semi-opaque and low-intensity-contrast image defects falling across two different background textures.
Acknowledgments This study was partially supported by the National Science Council of Taiwan (R.O.C.), Project No. NSC 95-2221-E-324-034-MY2.
792
H.-D. Lin and C.-Y. Chung
References 1. Siew, L.H., Hodgson, R.M., Wee, L.K.: Texture Measures for Carpet Wear Assessment. IEEE Transactions on Pattern Analysis and Machine Intelligence 10 (1988) 92-150 2. Latif-Amet, A., Ertüzün, A., Ercil, A.: An Efficient Method for Texture Defect Detection: Sub-band Domain Co-occurrence Matrices. Image and Vision Computing 18 (2000) 543-553 3. Tsai, D.M., Hsiao, B.: Automatic Surface Inspection Using Wavelet Reconstruction. Pattern Recognition 34 (2001) 1285-1305 4. Tsai, D.M., Wu, S.K.: Automated Surface Inspection Using Gabor Filters. International Journal of Advanced Manufacturing Technology 16 (2000) 474-482 5. Lin, H.D., Ho, D.C.: Detection of Tiny Surface Defects Using DCT Based Enhancement Approach In Computer Vision Systems, IEEE/ASME International Conference on Advanced Intelligent Mechatronics (IEEE/ASME AIM-2005) 373-378 6. Lin, H.D., Chiu, S.W.: Computer-Aided Vision System for MURA-Type Defect Inspection in Liquid Crystal Displays, Lecture Notes in Computer Science 4319 (2006) 442-452 7. Lu, Chi-Jie, Tsai, Du-Ming.: Defect Inspection of Patterned TFT-LCD Panels Using a Fast Sub-image Based SVD. International Journal of Production Research 42 (2004) 4331-4351 8. Shankar, N.G., Zhong, Z.W.: A Rule-based Computing Approach for the Segmentation of Semiconductor Defects. Microelectronics Journal 37 (2006) 500-509 9. Shankar, N.G., Zhong, Z.W.: Defect Detection on Semiconductor Wafer Surfaces. Microelectronic Engineering 77 (2005) 337-346 10. Fadzil, M.H., Ahmed, Weng, C.J.: LED Cosmetic Flaw Vision Inspection System. Pattern Analysis & Application 1 (1998) 62-70 11. Arivazhagan, S., Ganesan, L.: Texture Segmentation Using Wavelet Transform. Pattern Recognition Letters 24 (2003) 3197-3203 12. Bashar, M.K., Matsumoto, T., Ohnishi, N.: Wavelet Transform-based Locally Orderless Images for Texture Segmentation. Pattern Recognition Letters 24 (2003) 2633-2650 13. Gonzalez, Rafael C., Woods, Richard E.: Digital Image Processing. 2nd edn. PrenticeHall, Upper Saddle River, NJ (2002) 349-403 14. Kang, B.S., Park, S.C.: Integrated Machine Learning Approaches for Complementing Statistical Process Control Procedures. Decision Support Systems 29 (2000) 59-72 15. Smith, A.E.: X-bar and R Control Chart Interpretation Using Neural Computing. International Journal of Production Research 32 (1994) 309-320 16. Hush, D.R., Horne, B.G.: Progress in Supervised Neural Networks. IEEE Signal Processing Magazine, January (1993) 8-39 17. Jain, R., Kasturi R., Schunck, B.G.: Machine Vision. International edn. McGRAW-Hill, New York (1995) 80-83 18. Otsu, N.: A threshold selection method from gray-level histogram. IEEE Transactions on Systems, Man, Cybernetics 9 (1979) 62-66
Graphic Symbol Recognition of Engineering Drawings Based on Multi-Scale Autoconvolution Transform* Chuan-Min Zhai1 and Ji-Xiang Du1,2 1
Department of Computer Science and Technology, Huaqiao University, China Department of Automation, University of Science and Technology of China
[email protected],
[email protected]
2
Abstract. In this paper, a novel graphic symbol recognition of scanned engineering drawing method based on multi-scale autoconvolution transform and radial basis probabilistic neural network (RBPNN) is proposed. Firstly, the recently proposed affine invariant image transform called Multi-Scale Autoconvolution (MSA) is adopted to extract invariant features. Then, the orthogonal least square algorithm (OLSA) is used to train the RBPNN and the recursive OLSA is adopted to optimize the structure of the RBPNN. The experimental result shows that, compared with another affine invariant technique, this new method provides a good basis for the scanned engineering drawing recognition task where the disturbances of graphic symbol can be approximated with spatial affine transformation.
1 Introduction Now, automatic scanning, recognition, reconstruction and interpretation of engineer drawings are widely used in industry for automatic input into CAD or GIS systems. It is very fast and easy to perform a engineering drawing conversion into a digital raster form. However, it is very hard to convert raster images into high-level models and usually requires a highly developed software. The aim of engineering drawing scanning and recognition is to obtain representation of engineering drawing image in terms of universal engineering (CAD/CAM) entities: graphic symbol, contour lines, symmetry axes, hatched areas, dimensions, blocks and others. There are already many papers and systems devoted to automatic input and interpretation of engineering drawings [1, 2, 3], most of which are devoted to image vectorization and recognition of simple primitives. In this paper, we focus on the graphic symbol recognition of scanned engineering drawing under different geometric transformations, such as rotation and scale transformation (for example, a arrow symbol at different orientations and sizes), which is a key problem in engineering drawing recognition. It is well known that descriptors like signatures, Fourier descriptors [4], affine moment invariants [5] or Global Affine *
This work was supported by the Scientific Research Foundation of Huaqiao University (NO.06BS217), the Postdoctoral Science Foundation of China (NO.20060390180), and the Youth Technological Talent Innovative Project of Fujian Province (NO.2006F3086).
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 793–800, 2007. © Springer-Verlag Berlin Heidelberg 2007
794
C.-M. Zhai and J.-X. Du
Transformation Correlation [6] can often be made invariant under various geometric distortions in the spatial domain. However, in most cases either the computational complexity or the recognition accuracy limits their applicability. The Multi-Scale Autoconvolution (MSA) method recently introduced in [7, 8] offers a novel way of approaching the problem just described. The method provides affine invariant features with only moderate computational complexity and without the somewhat difficult object segmentation. Thus, in this paper, the new affine invariant feature selection of MSA is applied. Then, to implement the classification task, a Radial Basis Probabilistic Neural Network (RBPNN) classifier is used [9, 10]. The paper is organized so that in Section 2 we present the Multi-Scale Autoconvolution (MSA) transform method. In Section 3, the Radial Basis Probabilistic Neural Network (RBPNN) classifier is described, and then Section 4 and 5 describes the experiments performed and explains the results.
2 Multi-Scale Autoconvolution (MSA) Transform The MSA transform derived in this section relies on the coordinate mapping defined by any three image points called a basis-triplet. This MSA image transform presented here does not require any segmentation of interest points, but it treats the image as a probability density function of the interest points denoted by p (x, y ) . The simplest way of forming p (x, y ) is to normalize the image intensity function ing manner
p( x, y ) =
f ( x, y )
∫
D2
f ( x, y )dxdy
f ( x, y ) in the follow-
(1)
D 2 is the domain of f . Clearly, p(x, y ) can also be defined based on some filtered version of f ( x, y ) where certain properties of the image have been enhanced Where
or suppressed.
f be a function in L1 (R 2 ) ∩ L2 (R 2 )with f ≥ 0 , and let X 0 , X 1 and X 2 be independent random variables with values in R 2 , so that Let
P( X j = x j ) = For α , β
1 f (x j ) f L1
(2)
∈ R , define a random variable U α ,β = X 0 + α ( X 1 − X 0 ) + β ( X 2 − X 0 )
(3)
Then one has U α , β = αX 1 + β X 2 + γX 0 , where γ = 1 − α − β . Now it can be easily shown that U α ,β has a probability density function
Graphic Symbol Recognition of Engineering Drawings
P (U α , β = u ) =
(f
1 f
where f (x ) = a − 2 f ⎛⎜ x ⎞⎟ ,for a ≠ 0 ,and α ⎝a⎠
3
α
∗ f β ∗ f γ )(u )
795
(4)
1
L
fa = f
L1
δ 0 for a = 0 .
For α , β ∈ R , de-
fine the MSA transform of f by
F (α , β ) = E {f (U α , β )}
(5)
Writing this out in terms of the probability density function gives F (α , β ) = ∫ f (u )P (U α , β = u )du = =
1 3
f
∫ f (u )( fα ∗ f β ∗ f γ )du
L1
1 f
⎛u − x− y⎞ ⎛ x ⎞ ⎛ y ⎞ ⎟⎟ f ⎜ ⎟ f ⎜⎜ ⎟⎟dxdydu γ ⎠ ⎝α ⎠ ⎝ β ⎠
f (u ) f ⎜⎜ (αβγ ) ∫∫∫ ⎝ 1
3 L1
(6)
2
If α , β , γ ≠ 0 , and straightforward modifications if one of these numbers is zero. Taking the Fourier transform and using the convolution and correlation theorems, one has
F (α , β ) =
1
1
(2π )
2
f (0 )
3
∫ f (− ξ ) f (αξ ) f (βξ ) f (γξ )dξ
(7)
Which holds for all α , β . As we can see from (5) and (6), the transform is based on multi-scale convolution kernels, and thus it is called Multi-Scale Autoconvolution (MSA). From the above theory description, some important properties of Multi-Scale Autoconvolution can be derived: Property 1: The transform coefficients F (α , β ) are invariant against affine transformations of the image coordinates. Property 2: Given 2D Fourier transforms Ρˆα (w), Ρˆ β (w), Ρˆ γ (w) and Fˆ (w) of pα (u ), p β (u ), pγ (u ), and f (u ) , respectively, the transform coefficients F (α , β ) are given by
F (α , β ) =
1
(αβγ ) ∫ ∫ 2
∞
−∞
Ρˆ α (w)Ρˆ β (w)Ρˆ γ (w)Fˆ * (w)dw1 dw2
(8)
Property 3: Multi-Scale Autoconvolution can be generalized to cover a group of transforms that are invariant against other linear transformations of the image coordinates.
796
C.-M. Zhai and J.-X. Du
A discrete version of MSA is needed with digital images. By discretizing (5) and (6) and using the convolution property of the discrete Fourier transform (DFT) it follows that
F (α , β ) =
ˆ and where Ρ Notice that Ρˆ
1 N2
N −1 N −1
∑∑ Ρˆα (w )Ρˆ β (w )Ρˆ γ (w )Fˆ (w ) *
i
i
i
i
(9)
i =0 i =0
Fˆ are the discrete Fourier transform pairs of p and f , respectively.
−α
(wi ) = Ρˆα* (wi ) , and Ρˆ 0 (wi ) = 1 .
Assuming that the image
f ( x ) is an M × M matrix, the wrap-around error can be
avoided by selecting the transform length N ≥ (α + β + γ )M − 2 . In order to avoid
large FFT lengths, α and β should be reasonably small numbers, but not necessarily integers. For example, the range α , β ∈ (− 1.5, ,2 ) is often enough for object recognition purposes. Futhermore, it turns out that classification can be performed based on even a smaller set of coefficients that are chosen carefully.
3 Radial Basis Probabilistic Neural Network (RBPNN) Model The RBPNN model [9-12] as shown in Fig.1 was derived from the radial basis function neural network (RBFNN) and the probabilistic neural network (PNN). Hence it possesses the advantages of the above two networks while lowers their demerits.
Fig. 1. The topology scheme of radial basis probabilistic neural network
In mathematics, for input vector x , the actual output value of the ith output neuron of RBPNN, y iα , can be expressed as the following equation: y iα =
M
∑ wik hk (x )
(10)
k =1
hk ( x ) = ∑φi ( x − cki nk
i =1
2
)
k = 1,2,
M
(11)
Graphic Symbol Recognition of Engineering Drawings
797
Here hk (x ) is the kth output value of the second hidden layer of RBPNN; φ i (•) is the kernel function, which is generally Gaussian kernel function and can be written as
⎛ x−c 2 ⎞ ki 2 ⎟ φi x − c ki 2 = exp⎜⎜ − (12) ⎟⎟ 2 ⎜ σi ⎝ ⎠ The purpose of training the network is to assure the synaptic weight to change along with the direction of minimizing the squared error between the teacher signal and the actual output. Generally, the training algorithms for the RBPNN include orthogonal least square algorithms (OLSA), recursive least square algorithms (RLSA), etc. These two methods have the common advantages of fast convergence and good convergent accuracy. The RLSA, which requires good initial conditions, however, is to fit for those problems with the large training samples set. As the OLSA make full use of matrix computation, such as orthogonal decomposition algorithm of matrices, its training speed and convergent accuracy is faster and higher than the ones of the RLSA. So the OLSA is preferred in this paper and the details of it can be found in literature [10]. Two methods of the structure optimization of RBPNN can be found in literature [11] [12]. An optimization method based on genetic algorithms (GA) was proposed on recursive orthogonal least square algorithm (ROLS) was proposed in literature [11]. Compared with ROLS, GA is a global search method, and it can usually obtain a more computations and takes a longer training time. On the other hand, the ROLS is a backward selection algorithm. The philosophy of this method is to sequentially remove from the network, one hidden node at a time, i.e., the center that causes the smallest increase in training error. The details of ROLS method have been described in literature [12]. In this paper the ROLS is preferred. The key point of optimizing the RBPNN is to select the hidden center of the first hidden layer of the RBPNN is not only involved in how many the number of the hidden centers being selected, but also in what space locations the hidden centers being located at. Usually, we wish the number of selected centers to be as small as possible for the fewer hidden center will not only simplify the training and the testing of the network but also improve the generalization capability of the network. On the other hand, the locations of the hidden centers in space are of utmost importance to the performance of the network, In the case of the number of the hidden centers being fixed, different locations for the hidden centers can lead to different network performance. In this paper, the ROLS is used to select the hidden centers of the first hidden layer of the RBPNN and to optimize the structure of the RBPNN.
(
)
4 Experimental Results In order to verify the performance of the proposed method based on the MSA transform, we performed the experiment compared with the well-known affine invariant method, the four polynomials of the 2nd and 3rd order affine moment invariants [5]. The graphic symbol image database used in the following experiment has been segmented from the scanned engineering drawings and preprocessed, which includes 30 kinds of different usual-used graphic symbol, totally 976 images. The size of the image is scaled to 16 × 16 . Each kind includes about 30 images, 10 of which are randomly
798
C.-M. Zhai and J.-X. Du
selected as the training samples and the rest as the test samples. A subset of the training samples is shown in Fig.2. All the algorithms were programmed with Matlab 6.5, and were run on Pentium 4 with the clock of 3.0GHz and the RAM of 512M under Microsoft Windows XP environment. For each of the following experiment, it is repeated 20 times, and then the average result is shown.
Fig. 2. A subset of the training samples
Firstly, we only adopted MSA transform to extract affine invariant image feature, whose dimension is 20. 300 training samples (20 dimensional vector) were randomly selected as the hidden centers of the first hidden layer. The number of the second hidden neurons is set as 30. The number of output layer neurons is also set as 30. In order to prune the RBPNN, the recursive orthogonal least squares algorithm was used to optimize the structure of RBPNN. As a result, the selected hidden centers number of the first hidden layer is reduced from 300 to 96. Repeated this experiment 20 times, the average optimized selected hidden centers number of the first hidden layer is 92. And the average correct recognition rate of testing samples is 90.71%. Secondly, compared with the RBPNN classifier, with the same training and testing samples, by selecting all the 300 training samples as the initial hidden centers and optimized by the same recursive orthogonal least squares algorithm as the RBPNN used, the average correct recognition rate for RBFNN of 20 times is 88.76%, and the average optimized selected hidden centers number of the first hidden layer is 107. The average recognition rate of BP neural network (BPNN) to this data is 85.83%, in which the number of the hidden neurons is set as 50. These results show that the recognition rates of these classifiers are lower than that of RBPNN. Finally, to demonstrate the superiority of the MSA transform method, the experiment was performed to compare with the well-known affine invariant moment method. In this experiment, for both two methods, the same 300 training samples were Table 1. Classification performance comparison between RBPNN and other classifiers for MSA method Classifiers
Average Recognition Rate (%)
RBPNN(20-92-30-30) RBFNN(20-107-30) BPNN(20-50-30)
90.71 ± 0.06 88.76 ± 0.07 85.83 ± 0.05
Graphic Symbol Recognition of Engineering Drawings
799
randomly selected, and the rest as the original test sample. Different from the above experiments, the new test samples sets were formed by the original test samples mixed with 5 groups of Gaussian zero mean white noise with the variances in the interval [0,0.1) . The RBPNN was also adopted for both the MSA features and the affine invariant moment feature method. At each tested noise level, repeat the experiment 10 times. The average results were shown in Table. 2. As we see from the results, the overall performance of the MSA transform approach is superior to the affine invariant moment method at all tested noise levels. Table 2. Recognition performance comparison between MSA and affine invariant moment method Noise Variances
MSA transform (%)
Affine invariant moment (%)
0.01 0.02 0.04 0.06 0.08
89.74 ± 0.04 87.20 ± 0.07 84.88 ± 0.02 78.88 ± 0.06 70.01 ± 0.08
86.14 ± 0.01 84.61 ± 0.04 78.11 ± 0.05 70.32 ± 0.01 61.01 ± 0.06
From the above experimental results, it can be observed that, for the graphic symbol recognition of scanned engineering drawings, the MSA transform method combined with the RBPNN classifier can achieved higher statistical recognition rate, even at different noise levels.
5 Conclusions This paper proposed a novel graphic symbol recognition of scanned engineering drawing method based on multi-scale autoconvolution transform and the radial basis probabilistic neural network (RBPNN). The orthogonal least square algorithm (OLSA) is used to train the RBPNN and the recursive OLSA is adopted to optimize the structure of RBPNN. The Multi-Scale Autoconvolution (MSA) transform is used to extract the affine invariant image feature. The experimental results obtained show that this approach is effective, efficient and feasible, and also demonstrate that the MSA transform is promising for image process, the RBPNN is also a very promising neural network model in practical applications.
References 1. Joseph, S.H., Pridmore, T.P.: Knowledge-directed Interpretation of Mechanical Engineering Drawings. IEEE Trans. on Pattern Analysis and Machine Intelligence 14 (1992) 928-940 2. Ablameik, S.V., Bereishik, V.I., Frantskevich, O.V. , Melnik, E.L., Khomenko, ML , Paramonova, N.L: Interpretation of Engineering Drawings: Techniques and Experimental Results. Pattern Recognition and Image Analysis 5 (1995) 380-401
800
C.-M. Zhai and J.-X. Du
3. Dong, Y., Zhao, H., Wang P.: Analysis of Research Status of Engineering Drawings Recognition and Interpretation. Journal of Hefei University of Technology, 28 (2005) 29-33 4. Gonzales, R.C., Woods, R.E.: Digital Image Processing. Addison-Wesley, Readings, MA, 1993. 5. Flusser, J., Suk. T.: Pattern Recognition by Affine Moment Invariants. Pattern Recognition 26 (1993)167–174 6. Ben-Arie., J., Wang, Z.: Pictorial Recognition of Objects Employing Affine Invariance in the Frequency Domain. IEEE Trans. Pattern Analysis and Machine Intelligence 20 (1998) 604–618 7. Heikkilä J. A.: Multi-scale Autoconvolution for Affine Invariant Pattern Recognition. 16th Int’l Conf. on Pattern Recognition 1 (2002)119 –122 8. Rahtu, E., Heikkilä, J.: Object Classification With Multi-scale Autoconvolution. 17th International Conference on Pattern Recognition (ICPR 2004), Cambridge, UK, 3 (2004) 37-40 9. Huang, D.S.: Radial Basis Probabilistic Neural Networks: Model and Application. International Journal of Pattern Recognition and Artificial Intelligence 13 (1999) 1083-1101 10. Huang, D.S.: Systematic Theory of Neural Networks for Pattern Recognition. Publishing House of Electronic Industry of China, Beijing (1996) 11. Zhao, W.B., Huang, D.S.: The Structure Optimization of Radial Basis Probabilistic Networks Based on Genetic Algorithm. In: IJCNN2002, Hilton Hawaiian Village Hotel, Honolulu, Hawaii, (2002) 1086-1091 12. Zhao, W.B., Huang, D.S.: Application of Recursive Orthogonal Least Squares Algorithm to the Structure Optimization of Radial Basis Probabilistic Neural Networks. In: ICSP 2002, Beijing, China, (2002) 1211-1214
Driver Fatigue Detection by Fusing Multiple Cues Rajinda Senaratne, David Hardy, Bill Vanderaa, and Saman Halgamuge Dynamic Systems and Control Research Group, Department of Mechanical and Manufacturing Engineering, The University of Melbourne, Australia
[email protected],
[email protected],
[email protected],
[email protected]
Abstract. A video-based driver fatigue detection system is presented. The system automatically locates the face in the first frame, and then tracks the eyes in subsequent frames. Four cues which characterises fatigue are used to determine the fatigue level. We used Support Vector Machines to estimate the percentage eye closure, which is the strongest cue. Improved results were achieved by using Support Vector Machines in comparison to Naive Bayes classifier. The performance was further improved by fusing all four cues using fuzzy rules.
1
Introduction
Driver fatigue is one of the major causes for the increasing number of road accidents. It is found that the driver fatigue accounts for 35-45% of all vehicle accidents [1]. Therefore, developing systems to automatically detect driver fatigue, and thereby prevent accidents by warning the driver in advance, has received increased interest among research community. Computer vision is considered as a more suitable and user-friendly approach for driver fatigue detection in comparison to other approaches such as electroencephalograph (EEG) methods, which are considered as intrusive [2]. Several video-based techniques have been proposed to detect driver fatigue [3], [4], [2]. In [3], two cues were used separately: percentage eye closure (PERCLOS) and gaze. However, they were not fused to achieve an improved performance. In [4] and [2], Infra-red light was projected onto the face to produce the glow effect on eyes, and thereby the eyes were tracked. However, these methods work only under low light conditions [3], [5], and lose tracking when the eyes are closed [4]. In this work, we propose a driver fatigue detection system, which fuses four cues to achieve improved performance: PERCLOS, head nodding frequency, and two new cues, slouching frequency and postural adjustment (PA) frequency. Encouraging results were achieved by fusing all four cues using fuzzy rules. The main steps of the system are: - face localization (locating the face automatically in the first frame), - locating and tracking the eyes in subsequent frames, D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 801–809, 2007. c Springer-Verlag Berlin Heidelberg 2007
802
R. Senaratne et al.
- estimating the PERCLOS, and slouching, PA, and nodding frequencies, - and fusing them to determine the fatigue level of the driver. To determine PERCLOS, we experimented with two classifiers: Naive Bayes (NB) and Support Vector Machines (SVM). SVM gave higher accuracies in classifying the open/closed state of the eyes. The organization of the paper is as follows. Section 2 describes the face localization and the eye tracking steps. Section 3 details the estimation of the four cues and how they are fused. Section 4 presents the results.
2
Face Localization and Eye Tracking
As the size and the location of the face can vary for different drivers, a reliable method is required to determine the size and the location of the face automatically. Therefore, to find the size and the location of the face, we used the face localization phase of the fully automatic face recognition method known as Landmark Model Matching (LMM) [6], [7]. It uses a search mechanism known as Particle Swarm Optimization (PSO) [8] to efficiently find the size and the location of the face. PSO [8], [9], [10] is an evolutionary computation technique, which can be used to find optima in complex functions. It relies on the exchange of information between random solutions, called particles, of the population, called swarm. A particle may consist of several dimensions. Particles fly through the search space with velocities, which are dynamically adjusted according to their historical behavior. Each particle adjusts its trajectory towards its own previous best position, called the personal best, and towards the best previous position attained by any particle of the swarm, called the global best. In LMM [7], a face is represented by a Landmark Model (LM) consisting of N nodes corresponding to N facial landmarks. A Landmark Distribution Model (LDM) is created from a few training face images. In our experiments, 3 images from 3 different training video sequences recorded from 3 different subjects were used to create the LDM. The size and the location of a new face in the first frame of a new sequence can be automatically found by obtaining the optimal LM that fits this new face. To obtain the optimal LM, a new LM is first computed for this new face, and it is then fitted to the new face by an iterative matching procedure where its geometrical structure is deformed (based on the Principal Components of the LDM node locations) until the model similarity between the LM and the LDM is maximised using a PSO algorithm. In the PSO algorithm, a particle corresponds to an LM. For example, the nodes corresponding to a particle during the iteration process are shown in Fig. 1. Once the face had been located, it is required to accurately find the eyes, as finding eyes accurately is vital to estimate the PERCLOS. Since more emphasis is given for finding the head boundary in the face localization step, for certain faces, occasionally the eyes may not be accurately located by LMM. Even though LMM can be extended to accurately find the eyes, it increases the computational cost considerably. Therefore, a more computationally efficient method is preferred to accurately locate the eyes in the first frame, and thereafter to
Driver Fatigue Detection by Fusing Multiple Cues
803
Fig. 1. Nodes of an LM that corresponds to a particle during initialization and after iteration 1, 2, 3, 4, 6, 8, and 30
track them in subsequent frames. Thus, we used an edge map to find the approximate y-coordinate of the eyes in a similar way to that in [11], and then an adaptive thresholding mechanism for segmentation and a Connected-Component Algorithm (CCA) to accurately find the iris of the eyes (using the fact that the darker regions are candidates for the iris), as illustrated in Fig. 2. This method worked well for both open and closed eyes, since a closed eye also forms a dark region by its eye lashes. To track the eyes in subsequent frames, each eye was searched within a rectangular window around its location in the previous frame.
Fig. 2. Eye localization process: (a) result from LMM (b) masked edge map (c) edge projection (d) segmented regions within rectangular windows around the approximately located eyes (e) accurately located eyes after using the CCA
3
Fatigue Detection
We analysed the ocular and the head movement behavior to detect the fatigue level. The ocular measure we computed was the PERCLOS, and the head movement measures we considered were the slouching, PA, and nodding frequencies. 3.1
Estimation of PERCLOS
PERCLOS is the percentage of the time duration that the eyes being closed 80% or more (of the area of the iris and white sclera), within a predefined time
804
R. Senaratne et al.
period. PERCLOS is considered as one of the strongest cues in detecting driver fatigue [4]. As the fatigue level of a driver increases, his eye closure increases, resulting in a high PERCLOS value. To calculate PERCLOS, each eye in every frame has to be classified into one of two classes: open or closed. If the eye is closed 80% or more, it is considered as a closed eye. In order to classify the eyes, we used the image intensity values of a rectangular image portion (RIP) of 15×10 pixels surrounding each eye. Examples of RIPs of open and closed eyes are shown in Fig. 3 (a) and (c). RIPs were histogram equalized to minimize the effects of the lighting variations, and then quantized to 8 intensity levels, as shown in Fig. 3 (b) and (d). Then, to classify them, we tested with 2 classifiers: NB and SVM. No other work has attempted to use either NB or SVM classifier to determine PERCLOS. Our NB classifier used an approach similar to that in [12].
Fig. 3. Examples of RIPs: (a) RIP of an open left eye (b) its equalized and quantized result (c) RIP of a closed left eye (d) its equalised and quantised result
Given a set of training pairs (xi , yi ), i = 1, .., l, where xi ∈ Rn and yi ∈ {1, −1}, SVM [13] finds a hyperplane which separates the training data by a maximal margin in a higher dimensional feature space by minimizing 12 w T w + l C i=1 ξi , subject to constraints yi (w T φ(xi ) + b) ≥ 1 − ξi , and ξi ≥ 0, where w is a vector perpendicular to the hyperplane, C is the penalty of the error term, b is the bias, and ξi is the slack variable. The kernel K can be expressed as K(xi , xj ) = φ(xi ), φ(xj ), where the function φ maps the training vectors to the feature space. The SVM performance varies with different kernels. Three popular kernels are the linear kernel: K = xTi xj , the polynomial kernel: K = (γxTi xj + r)d , where d is the degree of the polynomial and γ > 0, and the radial 2 basis function (rbf): K = exp(−σ xi − xj ), where σ > 0. We tested with these three kernels, using the Matlab SVM library LIBSVM [14]. PERCLOS was calculated as the ratio of the number of frames where both eyes were closed within a predefined time window (600 frames or 20 seconds in our experiments), to the total number of frames within that window. If the PERCLOS is higher than a predefined threshold, the driver is warned. 3.2
Analysis of Head Movements, and Fusion of the Cues
None of the previous head movement analysis [3], [4], [2] has considered slouching or postural adjustments (PAs). We propose to use these two cues based on some of the psychophysiological studies in [15], [16]. These studies have shown that when a driver becomes fatigued, he tends to shrivel or slouch in his seat. During a slouch, the mean head position lowers slowly and continues to stay at a lower position for a while. PAs are the reactions made by the driver to arouse himself
Driver Fatigue Detection by Fusing Multiple Cues
805
when he realizes that he is becoming fatigued. Most common of them is the lifting of the upper body to adjust his sitting position. In such a behavior, the vertical position of the head increases quickly, and comes back approximately to the normal position. Head nodding can be detected by quick decrease in the vertical position of the head and quickly coming back to the normal position. The slouching, PA, nodding frequencies within a predefined time period (3000 frames in our experiments) can be used as cues to detect fatigue. We used a fuzzy system to fuse the four cues. We chose Fuzzy method due to its well-known linguistic concept modelling ability. A Fatigue Level Index (FLI) was calculated for each frame as the indicator of the fatigue level of the driver. For each of the four inputs and the output (FLI), 3 membership classes were chosen: low, moderate, and high. Fuzzy sets of triangular and trapezoidal shapes were used. Fuzzy rules were manually selected, and 20 rules were used. For example, one of them was: if PERCLOS is high AND slouching frequency is low AND PA frequency is low AND nodding frequency is not high, then FLI is moderate. To alert the driver, we experimentally found that a value of 0.2 is suitable for the FLI threshold.
4
Experimental Results
In order to assess the robustness of the tracking algorithm under real-world conditions, we tested it on a real-world driving video sequence of 11,700 frames (6.5 minutes). If the located eye position was within the iris (when the eye is open) or on the center part of the line formed by the eye lashes (when the eye is closed), it was considered as accurate, and otherwise as an error. Out of the 11,700 frames, the number of frames with errors in tracking the left and the right eyes were 267 (2.28%) and 68 (0.58%). Few tracking results of this sequence are shown in Fig. 4.
Fig. 4. Tracking results of a few frames from the real-world driving video sequence
Indoor video recordings of 5 different subjects were used in our fatigue analysis experiments. In these simulated recordings, the subjects mimic fatigue symptoms, similar to that in [3], [4], and [2]. Therefore, fatigue does not truly exist, however, can be inferred from the visual behavior. To train the SVM/NB classifier, 2000 frames from 2 training sequences of 2 different subjects (each sequence consisting of 1000 frames) were used. For the testing in detecting fatigue, 5 sequences of 5 different subjects (2 of them were the same subjects in the training set) were used, each consisting of 14,400 frames (8 minutes, at 30 frames/s). Each of the 7 sequences had been recorded on a different day.
806
R. Senaratne et al.
To compare the classification results, the RIP of each eye in every frame was manually classified either open or closed. The classification accuracy of the eye state for a sequence was calculated as, accuracy=100×(sensitivity + specificity)/2, where sensitivity = no. of true positives/(no. of true positives + no. of false negatives), and specificity = no. of true negatives/(no. of true negatives + no. of false positives). The classification accuracies for the left and the right eyes were calculated separately and averaged. The results are given in Table 1. Given Table 1. Classification Accuracies Sequence Using NB Using SVM classifier (%) No. classifier (%) linear kernel polynomial kernel rbf kernel 1 95.7 95.9 96.3 95.6 2 73.2 83.6 85.2 83.0 3 88.0 88.7 96.8 91.3 4 74.4 75.4 80.4 83.5 5 77.8 88.0 89.1 84.4 average 81.8 86.3 89.5 87.5
SVM accuracies are for C=1, since each SVM kernel produced its highest mean accuracy for C=1. Given accuracies for the polynomial kernel is for d=1 and γ=0.008, since these parameter values produced the highest mean accuracy for the polynomial kernel. SVM polynomial kernel gave the highest mean accuracy compared to NB classifier and other SVM kernels. The errors may be caused by the uncertainties in manually classifying the training RIPs, as well as the testing RIPs, since the 80% threshold may not be clearly identified. The PERCLOS estimated using SVM polynomial kernel is compared with the true PERCLOS for one subject in Fig. 5. True PERCLOS is calculated based on the manual classification of open/closed state of the eye. We experimentally found that a PERCLOS threshold value of 0.06 was more suitable to alert the driver.
Fig. 5. The true PERCLOS and the system-estimated PERCLOS for one subject
Table 2 shows the performance of estimating the four cues. PERCLOS was estimated by the polynomial kernel. PERCLOS error is the error in the systemestimated PERCLOS, compared to the true PERCLOS. To measure the fatigue detection performance, we manually identified the periods where the subject shows fatigue symptoms, similar to that in [4]. Manually
Driver Fatigue Detection by Fusing Multiple Cues
807
Table 2. Performance of Estimating the Cues Sequence Slouches No. (a) (b) (c) 1 8 8 0 2 7 7 0 3 6 6 0 4 6 6 0 5 6 6 0 average
PAs (a) (b) (c) 9 9 0 2 2 0 7 7 0 3 3 0 7 7 0
Nods PERCLOS error (%) (a) (b) (c) 3 3 0 06.0 7 7 0 15.1 1 1 1 10.1 3 3 1 24.8 1 1 1 06.0 12.4
columns: (a) number of true events (slouches/PAs/nods) occurring in the sequence, (b) number of events accurately detected by the system, (c) number of false positives
identifying the periods where the driver is either fatigued or non-fatigued, based on the physical signs, is considered reliable [1]. The variation of FLI was quite agreeable with the variation of the manually identified periods of fatigue in the sequences. Fatigue detection results are given in Table 3. Fatigue detection error is the ratio of the number of frames where the system-estimated fatigue state (either fatigued or non-fatigued) is different from the manually identified state, to the total number of frames in that sequence. By fusing all 4 cues, the fatigue detection error was reduced in comparison to that achieved by using only PERCLOS. Table 3. Fatigue Detection Results Sequence Fatigue detection error (%) No. using only PERCLOS using all cues 1 10.4 09.1 2 15.0 14.9 3 18.3 14.8 4 23.7 16.3 5 08.6 08.3 average 15.2 12.7
When deciding the PERCLOS threshold and FLI threshold, all 5 subjects were taken into account rather than considering just one or two subjects. This needs a compromise, as the PERCLOS may vary for different people, either fatigued or non-fatigued. Instead of manually selecting the fuzzy rules, they can be automatically generated by using tools such as FuNe [17]. Our future work will also investigate the possibility of extending the fuzzy system based on the work in [18], [19], and [20]. For the system to work during night, the same approach may be used with night vision cameras. The system is under continuous improvement in Maltab environment, and the programming codes
808
R. Senaratne et al.
are not efficiently written. It can run at 6 frames/s in a 3.2GHz Pentium 4 PC. We believe that implementing it in C and efficient programming would easily enable it to run much faster. Acknowledgments. Authors wish to thank Mr. Peter Sparkes and Mr. Cameron Joss for their support given throughout the project. This project is partially funded by the Australian Research Council.
References 1. Lal, S.K.L., Craig, A.: Driver Fatigue: Electroencephalography and Psychological Assessment. Psychophysiology 39(3) (2002) 313-321 2. Ji, Q., Zhu, Z.W., Lan, P.L.: Real-time Nonintrusive Monitoring and Prediction of Driver Fatigue. IEEE Trans. on Vehicular Technology 53(4) (2004) 1052-1068 3. Smith, P., Shah, M., Lobo, N.D.: Determining Driver Visual Attention with One Camera. IEEE Trans. on Intelligent Transportation Systems 4(4) (2003) 205-218 4. Bergasa, L.M., Nuevo, J., Sotelo, M.A., Barea, R., Lopez, M.E.: Real-time System for Monitoring Driver Vigilance. IEEE Trans. on Intelligent Transportation Systems 7(1) (2006) 63-77 5. Hartley, L., Horberry, T., Mabbott, N., Krueger, G. P.: Review of Fatigue Detection and Prediction Technologies. National Road Transport Commision, Melbourne (2000) 6. Senaratne, R., Halgamuge, S.: Optimised Landmark Model Matching for Face Recognition. In 7th International Conference on Automatic Face and Gesture Recognition (2006) 120-125 7. Senaratne, R., Halgamuge, S.: Optimal Weighting of Landmarks for Face Recognition. Journal of Multimedia 1(3) (2006) 31-41 8. Kennedy, J., Eberhart, R.C.: Particle Swarm Optimization. IEEE International Conference on Neural Networks 4 (1995) 1942-1948 9. Shi, Y., Eberhart, R.C.: Empirical Study of Particle Swarm Optimization. In Proc. of the IEEE Congress on Evolutionary Computation (1999) 1945-1950 10. Ratnaweera, A., Halgamuge, S.K., Watson, H.C.: Self-organizing Hierarchical Particle Swarm Optimizer with Time-varying Acceleration Coefficients. IEEE Trans. on Evolutionary Computation 8(3) (2004) 240-255 11. Stringa, L.: Eyes Detection for Face Recognition. Applied Artificial Intelligence. 7(4) (1993) 365-382 12. Baluja, S.: Using Labeled and Unlabeled Data for Probabilistic Modeling of Face Orientation. International Journal of Pattern Recognition and Artificial Intelligence 14(8) (2000) 1097-1107 13. Cortes, C., Vapnik, V.: Support-vector Networks. Machine Learning 20(3) (1995) 273-297 14. Chang, C.C., Lin, C.J.: LIBSVM: A Library for Support Vector Machines. Software available at http://www.csie.ntu.edu.tw/$\sim$cjlin/libsvm (2001) 15. Popieul, J.C., Simon, P., Loslever, P.: Using Driver’s Head Movements Evolution as A Drowsiness Indicator. In IEEE Intelligent Vehicles Symposium (2003) 616-621 16. Roge, J., Pebayle, T., Muzet, A.: Variations of the Level of Vigilance and of Behavioural Activities during Simulated Automobile Driving. Accident Analysis and Prevention 33(2) (2001) 181-186
Driver Fatigue Detection by Fusing Multiple Cues
809
17. Halgamuge, S.K.: Self-evolving Neural Networks for Rule-based Data Processing. IEEE Trans. on Signal Processing 45(11) (1997) 2766-2773 18. Halgamuge, S.K., Poechmueller, W., Glesner, M.: An Alternative Approach for Generation of Membership Functions and Fuzzy Rules based on Radial and Cubic Basis Function Networks. International Journal of Approximate Reasoning 12(3-4) (1995) 279-298 19. Halgamuge, S.K., Glesner, M.: Fuzzy Neural Networks - Between Functional Equivalence and Applicability. International Journal of Neural Systems 6(2) (1995) 185-196 20. Halgamuge, S.K.: A Trainable Transparent Universal Approximator for Defuzzification in Mamdani-type Neuro-fuzzy Controllers. IEEE Trans. on Fuzzy Systems 6(2) (1998) 304-314
Palmprint Recognition Using a Novel Sparse Coding Technique Li Shang, Fenwen Cao, Zhiqiang Zhao, Jie Chen, and Yu Zhang Dept. of Electronic Information Engineering, Suzhou Vocational University, Jiangsu 215104, China
[email protected], {cfw,zzq,cj,zhyu}@jssvc.edu.cn
Abstract. This paper proposes a novel recognition method for palmprints using a new sparse coding (SC) algorithm proposed by us. This algorithm exploited the maximum Kurtosis as the sparse measure criterion, at one time, a fixed variance term of sparse coefficients is used to yield a fixed information capacity. Experimental results show that the feature basis vectors of palmprint images can be successfully extracted by using our SC algorithm. Using the radial basis probabilistic neural network (RBPNN), the classification task can be implemented easily. Finally, compared with methods of principal component analysis (PCA) and the classical SC, simulation results show that our algorithm is indeed efficient and effective in performing palmprint recognition task.
1 Introduction More and more new approaches for recognizing palmprint have been explored, such as eigenpalm [1], Fourier transform [2], wavelets transform [3], principal component analysis (PCA) and independent component analysis (ICA) [4] etc. The significant advantage of PCA and ICA is that they rely only on the statistic property of input data. However, the PCA can only separate pairwise linear dependencies between pixels, while high-order dependencies will be still shown in the joint distribution of PCA coefficients. In contrary, ICA is very sensitive to these high-order statistics. Particularly, when ICA is applied to natural images, it is just a particular SC [4]. However, ICA emphasizes independence over sparsity in the output coefficients, while SC requires that the output coefficients must be sparse and as independent as possible. In fact, because of the sparse structures of natural images, SC is more suitable to process natural images than ICA. Hence, SC method has been widely used in natural image processing [5-7]. The contribution of this paper is that a novel sparse coding (SC) algorithm, based on the maximum Kurtosis sparse measure criterion and the determinative initialization basis function, is proposed. This algorithm can fix the signal-to-noise ratio in the representation and yield a fixed information capacity. Otherwise, the modified Amari natural gradient descent algorithm with amnesic factor is exploited to update coefficients [8]. Compared with the classical SC algorithm, our SC is can well balances the redundant reduction and redundant representation. Finally, the experimental results D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 810–818, 2007. © Springer-Verlag Berlin Heidelberg 2007
Palmprint Recognition Using a Novel Sparse Coding Technique
811
show that the palmprint features are successfully extracted by using our method, and using the radial basis probabilistic neural network (RBPNN) model, the task of palmprint recognition is easily implemented.
2 The Extended Sparse Coding Algorithm 2.1 Modeling the Objective Function Referring to the classical SC algorithm [5], and combining the minimum image reconstruction error with Kurtosis and fixed variance, we construct the following cost function of the minimization problem: 2
J (A,S) =
2 ⎡ si ⎤ 1 ⎡ ∑ ⎢ X ( x, y ) − ∑ a i ( x, y ) s i ⎤⎥ − λ 1∑ kurt ( s i ) + λ 2 ∑ ⎢ ⎥ . i i i ⎦ 2 x, y ⎣ ⎣σt ⎦
(1)
where the symbol ⋅ denotes the mean, X = ( x1, x 2,…, x n )T denotes the n-dimensional input data; A = (a1, a 2,…, a m) denotes the feature basis vectors; S = (s1, s2,…, sm )T denotes the m-dimensional sparse coefficients (commonly m ≤ n ). Note that only the case of m = n is considered, i.e., X is a square matrix.). Parameters λ 1 and λ 2 are positive constant, σ 2t is the scale of coefficient variance. Generally, σ 2t is set to be the variance of an image. In Eqn. (1), the first term is the image reconstruction error and it ensures a good representation for a given image; The second term is the sparseness measure based on the absolute value of Kurtosis ( kurt ( s i ) ). The last term, a fixed
variance term, can penalize the case in which the coefficient variance of the ith vector s i2 deviates from its target value σ 2t . Without this term, the variance becomes so small that the sparseness constraint can only be satisfied, and the image reconstruction error would become large, which is not desirable either. 2.2 Learning Rules
To ensure the convergence and speed up the search for optimal weights, here, we use the modified Amari natural gradient descent algorithm with amnesic factor [9] to update the weight matrix W ( S = WX , W = [ w 1, w 2, , w n ] ), and this updating
formula is defined as follows: ⎧⎪ ∂J ( A, W ) dW T ⎪⎫ = −μ 1 ( t ) ⎨ W ( t ) W ( t ) + βγ ( t ) W ( t ) ⎬ . dt ⎩⎪ ∂W ⎭⎪
(2)
subject to the constraints: μ1 ( t ) > 0 , β > 0 , γ ( t ) > 0 . where μ1 is the learning rate; β
is the scale selected; t denotes the sampling time; J is the cost function in Eqn. (1)
812
L. Shang et al.
and ∂J ( A, W ) ∂W is the gradient with respect to W ; and γ ( t ) is the forgetting factor, which is written as follows: ⎛
γ ( t ) = − tr ⎜ ( W ( t ) ) ⎝
T
∂ J ( A, W ) ∂W
( W (t ))
T
⎞ W (t ) ⎟ . ⎠
(3)
In practice, the well-known real time algorithm of the discrete time of Eqn. (2) is given as follows: W ( k + 1) = W ( k ) + η k ⎡ W ( k ) − F ( S ( k ) ) ( S ( k ) ) ⎢⎣
W ( k ) − β r ( k ) W ( k )⎤ ⎥⎦
T
(
(4)
)
where F ( S) = − ⎡⎣∂J ( A, W ) ∂W ⎤⎦ W T , and γ ( k ) = tr W ( k ) Γ ( k ) . Here, Γ ( k ) is defined as:
Γ ( k ) = W ( k ) − F ( S ( k ) ) (S ( k ) )
T
T
W (k )
(5)
And the gradient with respect to W is written as: ∂J ( A,W) ∂W
T λ ∂kurt ( W) + 4λ2 S 2 W X XT = −( A) ( I − AW) XXT − 1 α ( ) 4 ∂ ( W) σ t2
4λ 2
(6)
= −( A) ( I − AW) XX − λ1α ⎡⎣ S X − 3 S SX ⎤⎦ + 2 S W X X σt T
T
3
2
2
T
where α = sign ( kurt ( s i ) ) , and for super-Gaussian signals, α = 1 , and for subGaussian signals, α = −1 . Thus, for natural image data belonging to super-Gaussian, α is equal to 1. Otherwise, the feature basis function A is updated using the normal gradient descent algorithm, and thus the updating rule can be written as: A ( k + 1) = A ( k ) + ⎡⎣ I − A ( k ) W ( k ) ⎤⎦ XXT WT
(7)
In performing loop, we update W and A in turn. First, holding A fixed, we update W , which is an inner loop. Then, holding W fixed, we update A , which is an outer loop.
3 Radial Basis Probabilistic Neural Network (RBPNN) The RBPNN model [11] is shown in Fig 1. The first hidden layer is a nonlinear processing layer, generally consisting of the selected centers from training samples. The second hidden layer selectively sums the outputs of the first hidden layer according to the categories, where the hidden centers belong to. Namely, the connection weights between the first hidden layer and the second hidden layer are 1’s or 0’s. For pattern recognition problems, the outputs in the second hidden layer need to be normalized. The last layer for the RBPNN is just the output layer.
Palmprint Recognition Using a Novel Sparse Coding Technique
813
Σ
X1
y1
X2 yk
Σ
Xm yM Σ
XN
Fig. 1. The structure of radial basis probabilistic neural network
Mathematically, for an input vector x , the actual output value of the ith output a
neuron of the RBPNN, y i , is expressed as: y i = ∑ wik h k ( x ) . M
a
(8)
k =1
(
nk h k ( x ) = ∑ φ i x − c ki i =1
2
) , k = 1, 2,3,
,M .
(9)
where hk ( x ) is the kth output value of the second hidden layer of the RBPNN;
wik is the synaptic weight between the kth neuron of the second hidden layer and the ith neuron of the output layer of the RBPNN; c ki is the ith hidden center vector for the kth pattern class of the first hidden layer; n k represents the number of hidden center vectors for the kth pattern class of the first hidden layer; ⋅ 2 is Euclidean norm; and M denotes the number of the neurons of the output layer and the second hidden layer, or the pattern class number for the training samples set; φ i (⋅) is the
(
kernel function, which is generally the Gaussian kernel function. φ i x − c ki
2
) is
written as:
φ i ( x − c ki
2
)
⎡ x −c ki = exp ⎢ − 2 ⎢ σi ⎣
2 2
⎤ ⎥ . ⎥ ⎦
(10)
Generally, the training algorithms for the RBPNN include orthogonal least square algorithm (OLSA) and recursive least square algorithms (RLSA) [11], etc. These two methods have the common advantages of fast convergence and good convergent accuracy. The RLSA, which requires good initial conditions, however, is to fit for those problems with the large training samples set. As the OLSA make full use of matrix
814
L. Shang et al.
computation, such as orthogonal decomposition algorithm of matrices, its training speed and convergent accuracy is faster and higher than the ones of the RLSA. Therefore, the OLSA is preferred to train the RBPNN in this paper. For N training samples corresponding to M pattern class, considering the form of matrix, Eqn. (8) can be written as [11]: a Y = HW .
(11)
where Y a and H are both an N × M matrix, W is a square matrix of M × M , matrix W can be solved as follows: W = R −1Yˆ .
(12)
where R is an M × M upper triangle matrix with the same rank as H , and Yˆ is an M × M matrix. Both of them can be respectively obtained as follows: ⎡R ⎤ ⎡Yˆ ⎤ T H = Q × ⎢⎢ ⎥⎥ , Q × Y = ⎢ ⎥ ⎢⎣Y ⎦⎥ ⎢⎣0 ⎥⎦
.
(13)
where Q is an N × N orthogonal matrix with orthogonal columns satisfying
~
Q QT = QT Q = I , and Y is an (N − M ) × M matrix.
4 Experiments 4.1 Data Preprocessing
In this part, we make use of the Hong Kong Polytechnic University (PolyU) palmprint database, widely used in palmprint processing, to perform palmprint recognition. This database contains 600 palm images with the size of 128×128 pixels from 100 users, with 6 images from each individual. For each person, the first three images were used as training data while the remaining three were treated as test data. For the convenience for calculating, before performing our sparse coding algorithm, the PCA is used to make the training data whitened and a dimension reduced from 128 2 to an appropriate dimension, denoted by k . In experiments, the appropriate dimensional number of principal components is selected as k = 85 . It was found that the first 85 principal components account for over 92% of the variances in the images. Let P k denote the matrix containing the first k principal component axes in its columns and let X denote the data set of zero-mean images (each column is an image). Then, the principal component coefficient matrix R k is represented by the formula of R k = X T P k . When setting k to be 16, the first 16 principal component axes of the image set (columns of P k ) are shown in Fig. 2. Coefficients R Tk comprised the columns of the input data matrix, where each coefficient had zero mean. The representation for the training images was therefore contained in the columns of the coefficients U : U = W ∗ R Tk . Here the weight matrix W was k × k , resulting in k
Palmprint Recognition Using a Novel Sparse Coding Technique
815
coefficients in U for each palmprint image, consisting of the outputs of each of the weight filter (An image filter f ( x ) is defined as f ( x ) = wx ). The representation for test images was obtained in the columns of U test as follows: T T U test = W ∗ R test = W ∗ ( X test ∗P k ) . T
(14)
and the basis vectors were obtained from the columns of P k ∗ W −1 . A sample of the basis images was shown in Fig. 3. In this approach, each column of the weight matrix W −1 found by the SC algorithm attempts to get close to a cluster of images that look similar across pixels. Thus, this approach tends to generate basis images that look more palmprint-like than the basis images generated by PCA in that thebases found by the SC algorithm will average only images that look alike. Otherwise, this approach is very similar to the architecture of ICA [4], which is to find statistically independent coefficients for the input images.
Ⅱ
Fig. 2. First 16 principal component axes of the palmprint image set, ordered left to right, top to bottom, by the magnitude of the corresponding eigenvalues
Fig. 3. First 16 basis vectors generated by our sparse coding algorithm, ordered left to right, top to bottom, by the magnitude of the corresponding eigenvalues
4.2 Palmprint Recognition Performance
In performing the recognition task, the features of palmprints were extracted well by using our SC algorithm. Three classifiers were tested, i.e., Euclidean distance, RBPNN, and PNN. Euclidean distance is the simplest distance-matching algorithm among all. The RBPNN classifier is proposed by us, which possesses the advantages of the RBFNN and the PNN, and is very suitable for classification problems [11]. First, to determine the appropriate feature length, we used the three types of classifiers to perform the recognition task of PCA with different k principal components. The recognition results were shown in Figure 4. Here, there is a point to be noted that,
816
L. Shang et al.
when using the RBPNN classifier, we selected 300 training samples as the hidden centers of the first hidden layer. The number of the second hidden neurons is set as 100, thus, the number of output layer neurons is also set as 100. According to literature [11], the shape parameter is set to 650. The OLSA is used to train the RBPNN model. Likewise, by using the parameter similar to the one mentioned above, we use the ROLSA to optimize and prune the structure of the RBPNN. As a result, the number of the selected hidden centers of the first hidden layer is greatly reduced from 300 to 80. The recognition rates of PCA with different principal components are still invariant. This shows that the RBPNN model has better performance in classification.
Fig. 4. The recognition rates of PCA with different principal components
From Fig. 4, clearly, the fact that PCA with 85 principal components yields the best performance, and the recognition rate drops after this point, and about after 90 principal components, the recognition rate was almost invariant with the increasing of principal components. Therefore, this feature length of 85 is then used as the input to our SC algorithm calculation. The recognition rates obtained by using our SC algorithm were shown in Table 1. Otherwise, we compared our SC methods with the classical SC algorithm [5] and the PCA method with 85 principal components, and the comparison results were also shown in Table 1. It is clearly seen that the recognition rate of using our SC algorithm is better than those of the methods of PCA and Table 1. Recognition rate of our SC algorithm using three types of different classifiers with different principal components
Recognition Methods (k=85) RBPNN (%)
PNN (%)
Euclidean distance (%)
PCA
94.97
93.50
91.33
Standard sparse coding (ICA)
96.34
94.97
92.82
Our sparse coding
97.78
96.65
93.67
Palmprint Recognition Using a Novel Sparse Coding Technique
817
Table 2. Verification rate of different algorithms considered here
Algorithms
FAR (%)
FRR (%)
TSR (%)
PCA
3.0707
3.000
96.93
Standard sparse coding (ICA)
2.8890
2. 7081
97.67
Our sparse coding
2.4283
1.9802
98.16
standard SC (or ICA). At the same time, it also can be observed that the Euclidean distance is the worst among the three classifiers, and that the recognition performance of RBPNN is higher than those of PNN and Euclidean distance. Further analyses of the experimental results above can be performed by calculating the standard error rates such as false acceptance rate (FAR) and false rejection rate (FRR). Both FAR and FRR are general principles of biometric recognition system. Table 2 shows the verification rates of recognition methods of PCA, standard SC and our SC using the RBPNN corresponding to 85 principal components. Experimental results shows again that our SC algorithm indeed outperforms other two algorithms considered here.
5 Conclusions In this paper, a new palmprint recognition method based on the sparse coding algorithm proposed by us was proposed. This sparse coding algorithm can ensure that the natural image structure captured by the Kurtosis not only is surely sparse, but also is surely independent. At the same time, a fixed variance term of coefficients is used to yield a fixed information capacity. On the other hand, in order to improve the convergence speed, we used a determinative basis function, which was obtained by a fast fixed-point independent component analysis (FastICA) algorithm, as the initialization feature basis function of our sparse coding algorithm instead of using a random initialization matrix. In addition, the learning rule of coefficient weights can exploit the modified Amari natural descent learning algorithm with forgetting factor. Then, utilizing the cluster of sparse coding, we also developed a novel palmprint recognition method by using sparse coding algorithm and radial basis probabilistic neural network (RBPNN) classifier. The RBPNN was trained by the orthogonal least square algorithm (OLSA) and optimized by the recursive OLSA. Comparing our sparse coding algorithm with PCA and the classical sparse coding algorithms with 85 principal components, it can be concluded that our sparse coding algorithm outperforms other two algorithms in performing the palmprint recognition task. At the same time, it can be found that the RBPNN model is very suitable to perform classification, and it has higher recognition rate than the PNN and Euclidean distance classifiers.
818
L. Shang et al.
References 1. Lu, G., David, Z., Wang, K.: Palmprint Recognition Using Eigenpalm Ffeatures. Pattern Recognition Letters 24 (2003) 1473-1477 2. Li, W., David, Z., Xu, Z.: Palmprint Iidentification by Fourier Transform. Int. J. Pattern Recognition. Artificial Intelligence 16 (2002) 417-432 3. Kumar, Shen, H. C.: Recognition of Palmprints Using Wavelet-based Features. Proc. Intl. Conf. Sys., Cybern., SCI-2002, Orlando, Florida (2002) 4. Tee Connie, Andrew Teoh, Michael Goh, David Ngo: Palmprint Recognition with PCA and ICA. Image and Vision Computing New Zealand 2003, Palmerston North, New Zealand, 3 (2003) 232-227 5. Olshausen, B. A., Field, D. J.: Emergence of Simple-cell Receptive Field Properties by Llearning A Sparse Ccode for Natural Iimages. Nature 381 (1996) 607-609 6. Hyvärinen, A., Oja, E., Hoyer, P., Horri, J.: Image Feature Extraction by Sparse Coding and Independent Component Analysis. In Proc. Int. Conf. on Pattern Recognition (ICPR' 98), Brisbane, Australia, 2 (1998) 1268-1273 7. Olshausen, B. A., Field, D. J.: Natural Images Statistics and Efficient Coding. Network: Computation in Neural Systems, The UK. 7 (1996) 333-339 8. Atick, J. J., Redlich, A. N.: Convergent Algorithm for Sensory Receptive Field Development. Neural Computation 5 (1993) 45-60 9. Georgiev, P., Cichocki , A., Amari, S.: Nonlinear Dynamical System Generalizing The Natural Gradient Algorithm. In proceedings of the NOLTA 2001, Japan (2001) 391-394 10. Hyvärinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. Wiley Interscience Publication, New York (2001) 11. Huang, D. S.: Radial Basis Probabilistic Neural Networks: Model and Application. International Journal of Pattern Recognition and Artificial Intelligence 13 (1999) 1083-1101
Radial Basis Probabilistic Neural Networks Committee for Palmprint Recognition* Jixiang Du1,2,3, Chuanmin Zhai1, and Yuanyuan Wan2,3 1
Department of Computer Science and Technology, Huaqiao University, China Department of Automation, University of Science and Technology of China 3 Intelligent Computing Lab, Hefei Institute of Intelligent Machines, Chinese Academy of Sciences, P.O.Box 1130, Hefei, Anhui 230031, China
[email protected]
2
Abstract. In this paper, a novel and efficient method for recognizing palmprint based on radial basis probabilistic neural networks committee (RBPNNC) was proposed. The RBPNNC consists of several different independent neural networks trained by different feature domains of the original images. The final classification results represent a combined response of the individual networks. The Hong Kong Polytechnic University (PolyU) palmprint database is exploited to test our approach. The experimental results show that the RBPNNC achieves higher recognition accuracy and better classification efficiency than single feature domain.
1 Introduction At present, a lot of researchers have paid much more attentions to biometric personal identification, which is emerging as a powerful means for automatically recognizing a person’s identity with a higher confidence. Palmprint verification is such a technology, which recognizes a person based on unique features in his palm, such as the principal lines, wrinkles, ridges, minutiae points, singular points and texture, etc. Many recognition methods, such as the principal component analysis (PCA) and independent component analysis (ICA) [1], the wavelet transform [2], the Fourier transform [3], the Fisher classifier [4], and neural networks (NN) method [5,6], etc., have been proposed. For a typical palmprint recognition system based on a single neural network, firstly, some significant features are extracted in order to reduce data dimension and computational burden. Then, the recognition system is performed by the single neural network (NN). While Kittler pointed out that, according to the capability of the neural networks committee, the combination of an ensemble of classifiers is able to achieve higher performance in comparison with the best performance achievable by employing a single classifier [7]. So, if the classification results in different feature domains are *
This work was supported by the Postdoctoral Science Foundation of China (NO.20060390180), Scientific Research Foundation of Huaqiao University (NO.06BS217), and the Youth Technological Talent Innovative Project of Fujian Province (NO.2006F3086).
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 819–824, 2007. © Springer-Verlag Berlin Heidelberg 2007
820
J. Du, C. Zhai, and Y. Wan
combined to achieve the final classification result, the final recognition accuracy must be higher than that of the results in the best single feature domain. To demonstrate the effectiveness for palmprint recognition based feature ensemble, a novel method, multi-feature ensemble based on radial basis probabilistic neural network committee (RBPNNC) model is proposed to perform the palmprint recognition task. This paper has adopted five different feature domains, which are commonly used for extracting features from input images, including the principal component analysis (PCA), the Kernel principal component analysis (KPCA), the independent component analysis (ICA), the Fisher’s linear discriminant (FLD) and the Kernel Fisher linear discriminant (KFLD). This paper is organized as follows. Section 2 simply presents five palmprint image feature domains; Section 3 introduces radial basis probabilistic neural networks committee machines (RBPNNC); Section 4 presents several experimental results on the Hong Kong Polytechnic University (PolyU) palmprint database; finally, several conclusions are given in Section 5.
2 Feature Domains At present, Fourier transform [3], wavelet transform [2], principal component analysis (PCA) and independent component analysis (ICA) [1] are commonly used to extract features of palmprint images. Fourier and wavelet transforms have strong mathematical foundations and fast implementations, but they are not of adaptive ability to particular data. While the significant advantage of the PCA and ICA is that they only depend on the statistic properties of image data. The PCA technique is usually suitable for the second order accumulation variant, and the ICA method can be used for multi-dimensional data. The Fisher’s linear discriminant (FLD) is also a usual used method for image recognition, such as face recognition, so it is adopted in our approach. Especially, recently, the Kernel-based method is very popular, and has shown powerful predominance for pattern classification. Thus, the Kernel PCA and Kernel FLD are also used in our method. Here, we use a fast fixed-point algorithm for independent component analysis (FastICA) to extract sfeatures of palmprint images since it is a neural algorithm particularly efficient and light from the point of view of computational effort. According to literature [6], there are two types of implementation architectures for ICA in the image recognition task. The first architecture treats images as random variables and pixels as observations, i. e., each row of the input data matrix denotes an image, and its goal is to find a set of statistically independent basis images. While the other architecture utilizes pixels as random variables and images as observations, i. e., each column of the input data denotes an image, and its goal is to find a representation in which all coefficients are statistically independent. Literature [6] demonstrated that the second architecture outperforms the firstly architecture in classification with single RBPNN classifier. Thus, we also adopt the second architecture. And the detail of this algorithm can refer to Literature [6].
Radial Basis Probabilistic Neural Networks Committee for Palmprint Recognition
821
3 Radial Basis Probabilistic Neural Networks Committee (RBPNNC) 3.1 Radial Basis Probabilistic Neural Networks (RBPNN) The RBPNN model [8, 9] as shown in Fig.1 was derived from the radial basis function neural network (RBFNN) and the probabilistic neural network (PNN). Hence it possesses the advantages of the above two networks while lowers their demerits.
Fig. 1. The topology scheme of radial basis probabilistic neural network
In mathematics, for input vector x , the actual output value of the ith output neuron of RBPNN, y iα , can be expressed as the following equation: M
y iα = ∑ wik hk (x )
(1)
k =1
hk ( x ) = ∑φi ( x − cki nk
i =1
2
)
k = 1,2," M
(2)
Here hk (x ) is the kth output value of the second hidden layer of RBPNN; φ i (•) is the kernel function, which is generally Gaussian kernel function and can be written as
(
φi x − c ki
⎛
x − c ki
⎝
σ i2
) = exp⎜⎜⎜ − 2
2 2
⎞ ⎟ ⎟⎟ ⎠
(3)
3.2 Neural Networks Committee Machines The idea of committee machines is based on a simple engineering principle, namely, the idea referred to as ‘‘divide and conquer’’ [10,11]. So called ‘‘divide and conquer’’ is meant that a complex computational task is divided into a set of less complex tasks so that the divided subtasks can be readily solved. The solutions corresponding to these subtasks are then combined at a later stage to produce the final results for the original complex problem. Recently, many combination strategies of committee machines have been developed. In our experiments, a plurality voting strategy [11] is adopted for combining the committee members. In this combination strategy, the final decision is
822
J. Du, C. Zhai, and Y. Wan
the classification result reached by more classifier members than any other, and the class label can be got by the following formula:
i = arg max(K j ) c
j =1
(4)
where K j denotes the number of the classifiers which support class j . When performing classification using a neural networks committee adopting voting integrating strategy, it is inefficient if the committee contains too few members. So, in our experiments, the original images are firstly divided into 30 sub-images by Gabor filter at 5 different scales and 6 different orientations to increase the number of the committee members [12]. Each sub-image can produce five committee members’ samples, respectively. The system architecture of the proposed committee machine is a feedforward struture, as shown in Fig. 2.
Fig. 2. System architecture of the radial basis probabilistic neural networks committee (RBPNNC)
4 Experimental Results and Discussions We used the Hong Kong Polytechnic University (PolyU) palmprint database, available from http://www.comp.polyu.edu.hk/~biometrics, to verify our RBPNN algorithm. This database includes 600 palmprint images with the size of 128×128 from 100 individuals, with 6 images from each. In all cases, three training images per person (thus 300 total training images) were randomly taken for training, and the remaining three images (300 total images) are taken for testing. In our experiment, the PCA method turns each image into a 85 dimensional vector, and the KPCA method turns each original image into a 98 dimensional vector. The Fisher’s linear discriminant discriminant (FLD) turns each original image into a 80 dimensional vector, and the KFLD method turns each original image into a 90 dimensional vector. The ICA method turns each image into a 85 dimensional vector. Firstly, the Gabor filter didn’t applied; the original single image was used. The experimental results show that the correct classification accuracy can reach the average value of 96.87% and 96.54%, respectively, when we implemented the classification task using a single RBPNN classifier in the ICA or KFLD feature domain. The PCA method got the lowest classification accuracy. When using the RBPNNC to implement
Radial Basis Probabilistic Neural Networks Committee for Palmprint Recognition
823
the classification, the highest classification accuracy was achieved, rising 0.44 percent than the best single feature method. Then, the Gabor filter was applied to increase the number of the committee members. From the experiment results, it can be seen that all the correct classification accuracy have different degree rise when using the single RBPNN classifier in each feature domain. However, the best classification accuracy only rise 0.33 percent than using the single image. But for the RBPNNC, 0.87 percent rise was achieved. So it demonstrates that it is inefficient if the RBPNN committee contains too few members, and the Gabor filter, aimed to produce more committee members’ samples, is necessary and efficient. Note that there are other methods to increase the number of the committee members, such as divide the origin image to several blocks, or Wavelet transformation. With the same training and testing data, compared with the single RBFNN and the RBPNN, the results were also shown in table 1. It can be clearly seen that the recognition rate of the RBPNNC is the highest. Even for the single RBPNN and RBFNN classifier, the experiment results also demonstrate that the RBPNN is a better choice for Palmprint recognition than the RBFNN. Table 1. Recognition rate of different classifiers Classifiers Single Image Gabor Image Group
RBPNN RBFNN RBPNNC RBPNN RBFNN RBPNNC
PCA 94.98 93.77
ICA 96.87 94.61
95.41 94.17
97.20 95.33
Feature FLD 95.11 93.31 97.31 96.26 95.70 98.18
KPCA 95.32 93.65
KFLD 96.54 95.11
96.37 95.20
97.11 96.74
5 Conclusions In this paper, a novel palmprint recognition method was developed by using multiple feature domains classified with the radial basis probabilistic neural networks committee (RBPNNC). The experiments about the multiple feature domains method were compared with the single feature domain methods. From the experimental results, it can be concluded that our palmprint recognition method based on multi-feature domains can achieves higher statistical recognition rate than that of any one single feature domain. Obviously, our proposed method indeed improves the classification accuracy, and it is indeed effective and efficient.
References 1. Connie, T., Teoh, A., Goh, M. etc: Palmprint Recognition with PCA and ICA, Image and Vision Computing New Zealand 2003, Palmerston North, New Zealand, (2003) 232–227 2. Kumar, A., Shen, H.C.: Recognition of Palmprints using Wavelet-based Features, Proceedings of the International Conference on Systems and Cybernetics, SCI-2002, Orlando, Florida, July 2002, 371–376
824
J. Du, C. Zhai, and Y. Wan
3. Li, W., David, Z., Xu, Z.: Palmprint Identification by Fourier transform, Int. J. Pattern Recognition Art. Intell. 16 (2002) 417–432 4. Wu, X., Zhang, D., Wang, K.: Fisherpalms based on Palmprint Recognition, Pattern Recognition Letter 24 (2003) 2829–2938 5. Shang, L., Huang, D.S., Du, J.X., Huang, Z.K.: Palmprint Recognition Using ICA Based on Winner-Take-All Network and Radial Basis Probabilistic Neural Network, ISNN 2006, LNCS 3972, (2006) 216 – 221 6. Shang, L., Huang, D.S., Du, J.X., Zheng, C.H.: Palmprint Recognition using FastICA Algorithm and Radial Basis Probabilistic Neural Network, Neurocomputing 69 (2006) 1782–1786 7. Kittler, J., Hatef, M., Duin, R.P.W., Matas, J.: On Combining classifier. IEEE Transactions on Pattern Analysis and Machine Intelligence 20 (1998) 226–239 8. Huang, D.S.: Radial Basis Probabilistic Neural Networks: Model and Application. International Journal of Pattern Recognition and Artificial Intelligence 13 (1999) 1083-1101 9. Huang, D.S.: Systematic Theory of Neural Networks for Pattern Recognition. Publishing House of Electronic Industry of China, Beijing (1996) 10. Su, M., Basu, M.: Gating Improves Neural Network Performance, Proceedings IJCNN’01, 3 (2001) 2159 -2164 11. Hanaen, L.K., Salamon, P.: Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence 12 (1990) 993–1001 12. Manjunath, B.S., Ma, W.Y.: Texture Features for Browsing and Retrieval of Large Image Data, IEEE Transactions on Pattern Analysis and Machine Intelligence 18 (1996) 837-842
A Connectionist Thematic Grid Predictor for Pre-parsed Natural Language Sentences Jo˜ ao Lu´ıs Garcia Rosa Computer Engineering Faculty - Ceatec Pontifical Catholic University of Campinas - PUC-Campinas Campinas, S˜ ao Paulo, Brazil
[email protected] Abstract. Inspired on psycholinguistics and neuroscience, a symbolicconnectionist hybrid system called θ-Pred (Thematic Predictor for natural language) is proposed, designed to reveal the thematic grid assigned to a sentence. Through a symbolic module, which includes anaphor resolution and relative clause processing, a parsing of the input sentence is performed, generating logical formulae based on events and thematic roles for Portuguese language sentences. Previously, a morphological analysis is carried out. The parsing displays, for grammatical sentences, the existing readings and their thematic grids. In order to disambiguate among possible interpretations, there is a connectionist module, comprising, as input, a featural representation of the words (based on verb/noun WordNet classification and on classical semantic microfeature representation), and, as output, the thematic grid assigned to the sentence. θPred employs biologically inspired training algorithm and architecture, adopting a psycholinguistic view of thematic theory.
1
Introduction
The system θ-Pred (Thematic Predictor for natural language) combines a symbolic approach, through a logical parser based on a Portuguese language grammar fragment, with a connectionist module, which accepts sentences coded in a semantic representation based on WordNet classification for verbs and nouns [1, 2] (see figure 1). The sentences are parsed in the first module, which generates a logical representation based on events and thematic roles, disambiguating meanings through the production of as many formulae as possible readings. The second module is responsible for the prediction of non-presented sentences in the first module, provided that the connectionist architecture is trained with representative patterns, allowing this way the generalization over the input sentence. The output of a succeeded propagation should be the correct thematic grid assigned to that sentence.
2
Thematic Roles
Thematic roles are the semantic relations between a predicate and its arguments [3, 4]. A predicate (usually the verb) assigns a thematic grid to a sentence, the D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 825–834, 2007. c Springer-Verlag Berlin Heidelberg 2007
826
J.L.G. Rosa
Fig. 1. The two modules of θ-Pred system. The words are entered into the symbolic module for parsing. Ungrammatical sentences are discarded (left). Grammatical sentences have their logical forms generated (right). In addition, a semantic microfeature representation of the grammatical sentence is presented to the connectionist module. The system provides the thematic grids for recognized sentences.
structure containing every single thematic role of that sentence. For instance, the verb judge, in the sense evaluate, would assign an experiencer (i) and a theme (j ), no matter in which sentence it occurs, like in [I ]i cannot judge [some works of modern art ]j . There are verbs, however, which assign different thematic grids to different sentences, for instance the verb hit in sentence (1), in the sense cause to move by striking and in sentence (2), in the sense come into sudden contact with. So, based on an episodic logic [5], a parser based on events (e) and thematic roles, can reveal the possible readings for sentences (1) and (2), displaying two different logical forms, one for each sentence. T he man hit the ball.
(1)
Logical form: ∃(x): man(x) ∧ ∃(y) : ball(y) ∧ ∃(e, simple past): hit(e) ∧ agent(e,x) ∧ patient(e,y) T he car hit a tree.
(2)
Logical form: ∃(x): car(x) ∧ ∃(y) : tree(y) ∧ ∃(e, simple past): hit(e) ∧ cause(e,x) ∧ patient(e,y) To the sentences (1) and (2), although the same verb is employed, are assigned different thematic grids. In one possible reading of sentence (1), the thematic grid
A Connectionist Thematic Grid Predictor
827
assigned is [agent, patient] and in sentence (2), [cause, patient]. The reason is that the man, in the intended reading of sentence (1), is supposed to have the control of action, that is, the intention of hitting. The same does not occur in sentence (2). The car is not willing to hit anything. Verbs that assign different thematic grids to different sentences are called here thematically ambiguous. The thematic role notion employed here is what some researchers call abstract thematic roles [6].
3
The Symbolic Module
Departing from an episodic logic, based on events, it is proposed here a Portuguese language Montagovian grammar fragment [7] considering different classes of adverbs. According to Ilari et al. [8], Portuguese adverbs for the spoken language may be classified into many types, including predicative and non-predicative. Predicative adverbs modify the meaning of the verb or adjective, implying a higher order predication, because the adverb predicates a property of the quality or action attributed to the subject. When the adverb does not alter the meaning of the verb or adjective, it is called non-predicative. 3.1
The Grammar
θ-Pred’s lexicon contains several adverbs, according to Ilari et al. [8], including qualitative and intensifier predicative adverbs, sentential (modal and aspectual), and non-predicative (negation). Since the analysis presents logical forms based on events, the adverb that comes with the verb, the noun, or the adjective is called adjunct. Predicative adverbs correspond to second order predication, and the parser is implemented in a first order predicate logic, based on events and thematic roles, that do not support higher order predication. Only non-predicative adverbs should be treated as first order arguments or logical operators. The sentences are formed according to a phrasal grammar, considering adverbs as adjuncts, prepositional phrases, adjectives, relative clauses, anaphora resolution, and phrases connected by the conjunction and. The grammar includes sentence conjunction, allowing anaphora employment (personal pronouns) in the second sentence of the conjunction. It includes also prepositional phrases, through the so-called with-NPs, that is, a noun phrase beginning with the word with. This allows the analyzer process the ambiguous sentence (3). In this case, two logical forms are obtained: the first, where binoculars are the instrument of the verb see and the second, where the girl owns them. T he man saw the girl with the binoculars.
(3)
Logical form 1: ∃(x): (man(x) ∧ ∃(z): binoculars(z) ∧ ∃(y): girl(y) ∧ ∃(e, simple past): see(e) ∧ experiencer(e,x) ∧ theme(e,y) ∧ instrument(e,z) Logical form 2: ∃(x): (man(x) ∧ ∃(z): binoculars(z) ∧ ∃(y): girl(y) ∧ ∃(e, simple past): see(e) ∧ experiencer(e,x) ∧ theme(e,y) ∧ own(y,z)
828
J.L.G. Rosa
If different sentences contain the same thematically ambiguous verb, like hit in sentences (1) and (2), they can be assigned different thematic grids. But, for one ambiguous sentence, like sentence (3), different thematic grids are assigned also, one for each possible interpretation. Besides ordinary verbs and thematically ambiguous verbs, in θ-Pred lexicon there are two-sense verbs with only one thematic grid (for instance, love: according to WordNet, there are four senses for verb love (here two of them are employed: enjoy (sentence 4), and be in love with (sentence 5); for both the thematic grid is [experiencer, theme])).
3.2
I love western movies.
(4)
M ary loves her husband.
(5)
Computational Implementation of the Symbolic Parser
The computational implementation of a context free grammar fragment with adverbs, based on events and thematic roles, is performed through the logical programming language Prolog, where language statements are transposition of first order predicate logical formulae. A semantic analyzer supplies all possible logical forms of Portuguese declarative sentences, analyzing the determiner employed and giving the adequate quantifier. The first version of the parser includes also a morphological analysis, which classifies each regular verb, in tense, number, and person, and each noun, adjective, etc., in gender and number1 . Some irregular verbs are included, like ser/estar (to be). A small lexicon is implemented, where only singular forms of nouns and infinitive forms of verbs are considered (the morphological analysis would discover the number, in case of nouns, and the tense, number, and person, in case of verbs). This analysis is based on a phrasal grammar [9]. If the sentence is ungrammatical, the parser rejects it.
4
The Biologically Plausible Connectionist Module
In this section, it is presented the second module of θ-Pred system: the way the words are represented, the connectionist architecture of the system, and the employment of a biologically plausible supervised learning algorithm with simulation experiments. 4.1
Word Representation
In order to classify verbs and nouns, θ-Pred employs a representation based on classical semantic microfeature distributed representation [10] and on WordNet2 . 1
2
In Portuguese, verbs have, besides tense and number, person too, that is, there are different forms for verbs with different persons, no matter which is the tense. Portuguese adjectives agree with the noun they describe, so they feature gender and number. The morphological analysis gives the correct form of the word. WordNet version 2.1: http://wordnet.princeton.edu/obtain.
A Connectionist Thematic Grid Predictor
829
WordNet is a lexical data base (an ontology based on semantics [11]) of the English language [1, 2] which contains around 120,000 synonym sets (synsets) of nouns, verbs, adjectives, and adverbs, each one representing a lexicalized concept. The verbs chosen from WordNet represent all kinds of semantic relationships the system intends to treat. Twenty five dimensions with two binary units each account for each verb (see table 1) and thirty dimensions for each noun (see table 2). Table 1. The semantic microfeature dimensions for verbs according to WordNet and to a thematic frame [10] body competition emotion social process triggering psychological state interest on process
change consumption motion stative direction of action objective action
cognition contact perception weather impacting process effective action
communication creation possession control of action change of state intensity of action
Table 2. The semantic microfeature dimensions for nouns, based mainly on WordNet action social form body consumption perception
life nature fragility change contact possession
element miscellaneous instrument cognition creation social
property size adulthood communication emotion stative
corporeal consistency gender competition motion weather
Since the aim of the presented system is to deal with thematic relationships between words in a sentence, the microfeatures chosen for verbs attempt to contemplate the semantic issues considered relevant in a thematic frame. The microfeatures outside this context are meaningless [12]. 4.2
The Connectionist Architecture
θ-Pred employs a bi-directional three-layer connectionist architecture with a hundred input units, fourteen hidden units, and seven output units, one for each of the thematic roles: agent, patient, experiencer, theme, location, cause, and value (see figure 2). In this case, according to Sun [13], the architecture can be classified as single-module employing distributed representation. For each sentence, the words are presented sequentially to their specific slot (verb or noun) in input layer. The data used in experiments are realistic in the way they reflect situations found “in the wild.” The method used for generating sentences for training and test (i.e. by filling out the slots of sentence frames) creates a compelling set of
830
J.L.G. Rosa
Fig. 2. The connectionist architecture of θ-Pred. The sentence is presented to the input layer A and its thematic grid is revealed at output layer C. Notice that there are different slots for verbs and nouns. In the hidden layer B there are the conjunction of verb inputs in HV and the conjunction of noun inputs in HN. These two units are connected to one unit, regarding a specific thematic role, in the output layer C. Notice the bi-directional links between hidden (B) and output (C) layers, while there are unidirectional links from input (A) to hidden (B) layer. Legend for the output layer C (thematic roles): A = agent, P = patient, E = experiencer, T = theme, L = location, C = cause, and V = value.
training or test instances, because the chosen frames are representative for the kinds of sentences θ-Pred intends to deal with. 4.3
Biologically Plausible Supervised Learning
In each sentence presentation an output is computed, based on an input pattern and on current values of net weights. The actual output can be quite different from the “expected” output, i.e. the values that it should have in the correct reading of the sentence, that is, the correct thematic grid assigned to the input sentence. During training, each output is compared to the correct reading, supplied as a “master input.” This master input should represent what a real language learner would construct from the context in which the sentence occurs. Learning may be described as the process of changing the connection weights to make the system output correspond, as close as possible, to the master input. The learning algorithm used in θ-Pred is inspired by Recirculation [14] and GeneRec algorithms [15] . This algorithm is considered biologically more plausible since it supports bidirectional propagation, among other items [16]. The algorithm consists of two phases: minus and plus (figure 3). In the minus phase, the semantic microfeature representation of the first word of a sentence is presented to the input layer A. Then, there is a propagation of these stimuli to the output through the hidden layer B (bottom-up propagation). There is also a propagation of the previous actual output, which is initially empty, from output layer C back to the hidden layer B (top-down propagation). Then, a hidden minus activation is generated (sum of the bottom-up and top-down propagations),
A Connectionist Thematic Grid Predictor
831
Fig. 3. The two phases of GeneRec algorithm. In the minus phase, when input x is presented to input layer A, there is propagation of these stimuli to the hidden layer B (1). Then, a hidden minus signal is generated based on input and previous output stimuli o(t − 1) (2 and 3). Then, these hidden signals propagate to the output layer C (4), and an actual output o(t) is obtained (5). In the plus phase, input x is presented to layer A again; there is propagation to hidden layer (1). After this, expected output y (2) is presented to the output layer and propagated back to the hidden layer B (3), and a hidden plus signal is generated, based on input and on expected output. Recall that the architecture is bi-directional, so it is possible for the stimuli to propagate either forwardly or backwardly.
through the sigmoid logistic activation function σ (equation 6). Finally, the current actual output is generated through the propagation of the hidden minus activation to the output layer (equation 7) [17]. A C h− j = σ(Σi=0 wij .xi + Σk=1 wjk .ok (t − 1)),
(6)
B ok (t) = σ(Σj=1 wjk .h− j ).
(7)
In the plus phase, there is a propagation from input layer A to the hidden layer B (bottom-up). After this, there is the propagation of the expected output to the hidden layer (top-down). Then a hidden plus activation is generated, summing these two propagations (equation 8). For the other words, presented one at a time, the same procedure (minus phase first, then plus phase) is repeated. Recall that since the architecture is bi-directional, it is possible for the stimuli to propagate either forwardly or backwardly [17]. A C h+ j = σ(Σi=0 wij .xi + Σk=1 wjk .yk ).
(8)
In order to make learning possible the synaptic weights are updated (equations 9 and 10), considering only the local information made available by the
832
J.L.G. Rosa
synapse. The learning rate η used in the algorithm is considered an important variable during the experiments [18]. Δwjk = η.(yk − ok (t)).h− j , − Δwij = η.(h+ j − hj ).xi .
4.4
(9) (10)
Simulation Experiments
The sentences presented to the net are generated by filling each category slot of sentence frames. Each frame specifies a verb, a noun set and a list of possible fillers of each noun. So, the sentence frame the human buys the thing is a generator for sentences in which the subject human is replaced by one of the words in the human list, like man, and thing is replaced by one of the words in the list of things, like car, since buy assigns the following thematic roles: an agent (the one who buys) and a theme (the thing that is bought). Then the sentence the man bought the car could be generated. And the output for this sentence would be the assigned thematic grid [agent, theme]. If all possible inputs and outputs are shown to a connectionist network employing a supervised training procedure, the net will find a weight set that approximately maps the inputs to the outputs. For many artificial intelligence problems, however, it is impossible to provide all possible inputs. To solve this problem, the training algorithm uses the generalization mechanism, i.e. the network will interpolate when inputs, which have never been received before, are supplied. In the case of this system, since words are described by microfeatures arrays, there are words with related meanings (like, for instance, man and boy). These words are expected to contain many microfeatures in common, so the distance between their microfeatures arrays is small, favoring generalization. The system is trained to learn the correct thematic grids assigned to input sentences. The training set was chosen in order to contain representative verbs and nouns of each thematic category present in θ-Pred. For the system evaluation, test sentences are generated automatically. These sentences are different from the sentences generated by the training sentence generator, although their thematic frames are basically the same (the difference relies on the choice of the words involved). In this case, only the default readings for thematically ambiguous verbs are generated, simulating a user entering sentences to be analyzed. The user does not need to know which thematic reading is expected for the verb; θ-Pred will decide, based on sentence context, which will be the correct reading and, consequently, arrive at the expected thematic grid for that sentence. In relation to accuracy, the connectionist module of the system presents recall and precision rates of 94%3 , since only seven words revealed inadequate thematic roles in 120 words belonging to a limited, but sufficient, set of test sentences. 3
According to Jurafsky and Martin [19], recall is defined by the number of correct answers given by the system divided by the total number of possible correct answers in the text, while precision is the number of correct answers given by system divided by the number of answers given by the system. Since θ-Pred is fed only by correct sound sentences, in this case recall and precision coincide.
A Connectionist Thematic Grid Predictor
5
833
Concluding Remarks
The purpose of this paper is to present a symbolic-connectionist hybrid system consisting of two modules: a symbolic parser based on events, employing a grammar which takes into consideration classes of adverbs, according to Ilari et al. [8], in addition to transitive and intransitive verbs, and a biologically plausible connectionist thematic grid predictor. Since most of adverbs modify the meaning of a verb or an adjective, they experiment a kind of second order predication. For this reason, a parser based on events is chosen. In connectionist Natural Language Processing (NLP) systems, the words belonging to a sentence must be represented in such a way as to keep the meaning of the words and, at the same time, to be useful for the network to develop significant internal representations. The representation of semantic features adopted in this system would also easily allow for new words to be entered in order to increase its lexicon, provided that their semantic microfeature arrays are supplied. θ-Pred presents as a novelty a more biologically plausible architecture and training procedure based on neuroscience [15], which comprises a bi-directional connectionist architecture, to account for chemical and electrical synapses that occur in the cerebral cortex, and a training procedure that makes use of this architecture.
References 1. Fellbaum, C.: English Verbs as a Semantic Net. Intl. J. of Lexicography 3 (1990) 278-301 2. Miller, G.A.: Nouns in Wordnet. In Fellbaum, C., ed.: WordNet: An Electronic Lexical Database. MIT Press, Cambridge, Massachusetts (1998) 3. Chomsky, N.: Lectures on Government and Binding: the Pisa Lectures. Holland: Foris Pub. (1981) 4. Chomsky, N.: Knowledge of Language: its Nature, Origin, and Use. New York: Praeger Pub. (1986) 5. Schubert, L.K., Hwang, C.H.: Episodic Logic Meets Little Red Riding Hood - a Comprehensive Natural Representation for Language Understanding. In Iwanska, L.M., Shapiro, S.C., eds.: Natural Language Processing and Knowledge Representation - Language for Knowledge and Knowledge for Language. AAAI Press / The MIT Press (2000) 111-174 6. Gildea, D., Jurafsky, D.: Automatic Labeling of Semantic Roles. Computational Linguistics 28 (2002) 245-288 7. Dowty, D.R., Wall, R.E., Peters, S.: Introduction to Montague Semantics. Reidel Pub. Co. (1981) 8. Ilari, R., de Castilho, A.T., de Castilho, C.M., Franchi, C., de Oliveira, M.A., Elias, M.S., de Moura Neves, M.H., Possenti, S.: Considera¸co ˜es sobre a posi¸ca ˜o dos adv´erbios. In: Gram´ atica do Portuguˆes Falado - Volume I: A Ordem. Editora da Unicamp/Fapesp, Campinas, SP, Brazil (1990) 63-141 9. Pereira, F.C.N., Warren, D.H.D.: Definite Clause Grammars for Language Analysis - a Survey of the Formalism and a Comparison with Augmented Transition Networks. Artificial Intelligence 13 (1980) 231-278
834
J.L.G. Rosa
10. McClelland, J.L., Kawamoto, A.H.: Mechanisms of Sentence Processing: Assigning Roles to Constituents of Sentences. In McClelland, J.L., Rumelhart, D.E., eds.: Parallel Distributed Processing, Volume 2 - Psychological and Biological Models. A Bradford Book, MIT Press (1986) 11. O’Hara, T.P.: Empirical Acquisition of Conceptual Distinctions via Dictionary Definitions. PhD thesis, NMSU CS (2004) 12. Rosa, J.L.G., da Silva, A.B.: Thematic Role Assignment through a Biologically Plausible Symbolic-connectionist Hybrid System. In: Proceedings of the Intl. Joint Conf. on Neural Networks - IJCNN 2004, Budapest, Hungary (2004) 1457-1462 13. Sun, R.: Hybrid Connectionist/Symbolic Systems. In Arbib, M.A., ed.: The Handbook of Brain Theory and Neural Networks. 2 edn. A Bradford Book, MIT Press (2003) 543-547 14. Hinton, G.E., McClelland, J.L.: Learning Representations by Recirculation. In Anderson, D.Z., ed.: Neural Information Processing Systems. American Institute of Physics, New York (1988) 358-366 15. O’Reilly, R.C.: Biologically Plausible Error-driven Learning Using Local Activation Differences: the Generalized Recirculation Algorithm. Neural Computation 8 (1996) 895-938 16. O’Reilly, R.C.: Six Principles for Biologically-based Computational Models of Cortical Cognition. Trends in Cognitive Science 2 (1998) 455-462 17. Rosa, J.L.G.: A Biologically Inspired Connectionist System for Natural Language Processing. In: Proceedings of the 2002 VII Brazilian Symposium on Neural Networks - SBRN 2002, Recife, Brazil, IEEE Computer Society Press (2002) 243-248 18. Haykin, S.: Neural Networks - a Comprehensive Foundation. 2 edn. Prentice Hall (1999) 19. Jurafsky, D., Martin, J.H.: Speech and Language Processing - an Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall (2000)
Perfect Recall on the Lernmatrix Israel Rom´an-God´ınez, Itzam´a L´opez-Y´an ˜ ez, and Cornelio Y´ an ˜ ez-M´arquez Centro de Investigaci´ on en Computaci´ on Juan de Dios B´ aiz s/n esq. Miguel Oth´ on de Mendiz´ aal Unidad Profesional Adolfo L´ opez Mateos Del. Gustavo A. Madero, M´exico, D. F. M´exico
[email protected],
[email protected],
[email protected]
Abstract. The Lernmatrix, which is the first known model of associative memory, is a hetereoassociative memory that presents the problem of incorrect pattern recall, even in the fundamental set, depending on the associations. In this work we propose a new algorithm and the corresponding theoretical support to improve the recalling capacity of the original model.
1
Introduction
The Lernmatrix is a relevant model of associative memory [1], [2]. The trascendence of the Lernmatrix [3] is evidenced by an affirmation by Kohonen [4] where he points out that correlation matrices substitute Steinbuch’s Lernmatrix. The Lernmatrix suffers of one problem: the phenomenon of saturation. In this work, a modification to the original Lernmatrix´s recall phase is presented in order to avoid this problem, when working with fundamental patterns. The rest of the paper is organized as follows: section 2 is devoted to some background on the original Lernmatrix, while in section 3 the proposed modification and its theoretical support is presented. Section 4 contains some experimental results and section 5 presents some conclusions and future work.
2
The Steinbuch’s Lernmatrix
Here we use basic concepts about associative memories presented in [5]. An associative memory M is a system that relates input patterns, and output patterns, as follows: x −→ M −→ y. Each input vector x forms an association with k ak corresponding output vector y. The k-th association will be denoted as x , y . Associative memory M is represented by a matrix whose ij-th component is mij , and is generated from an a priori finite set of known associations, called the fundamental set of associations. If μ is an index, the fundamental set is represented as: {(xμ , yμ ) | μ = 1, 2, ..., p} with p being the cardinality D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 835–841, 2007. c Springer-Verlag Berlin Heidelberg 2007
836
I. Rom´ an-God´ınez, I. L´ opez-Y´ an ˜ez, and C. Y´ an ˜ez-M´ arquez
of the set. The patterns that form the fundamental set are called fundamental patterns. If it holds that xμ = yμ ∀μ ∈ {1, 2, ..., p}, then M is autoassociative, otherwise it is heteroassociative. In this latter case it is possible to establish that ∃μ ∈ {1, 2, ..., p} for which xμ = yμ . If when presenting a unknown fundamental pattern xω with ω ∈ {1, 2, ..., p} to associative memory M, it happens that the output corresponds exactly to the associated pattern yω , we say that recall is correct. The Lernmatrix is an heteroassociative memory, but it can act as a binary pattern classifier depending on the choice of the output patterns [6]; it is an input-output system that gets a binary input pattern xμ ∈ An , where A = {0, 1}, n ∈ Z + , and produces the class (from p different classes) codified with the simple method of one-hot [7]: assigning for the output binary pattern yμ the following values: ykμ = 1, and yjμ = 0 for j = 1, 2, . . . , k − 1, k + 1, . . . , p where k ∈ {1, 2, . . . , p}. In the learning phase, each component mij of Lernmatrix M is initialized to zero, and is updated according to the following rule: mij = mij + Δmij , where: ⎧ μ μ ⎨ + if yi = 1 = xj , μ Δmij = −ε if yi = 1 and xμj = 0, ⎩ 0 otherwise. and is any positive constant, chosen previously. The recalling phase consists of finding the class to which an input pattern xω ∈ An belongs. This means constructing vector yω ∈ Am which corresponds to xω , according to the building method of all yμ . The class should be obtained without ambiguity. The i-th component yiω of the class vector is obtained as follows, with ∨ being the maximum operator:
n n 1 if mij xω = m mhj xω , ω j j j=1 h=1 j=1 yi = 0 otherwise. However, there is a problem called saturation, which the Lernmatrix can present during the recalling phase. This occurs when yω has a 1 in two or more components, which is not a correct one-hot vector, thus it cannot be a correct association. Saturation could be caused due to two differents reasons: when there is more than one input pattern associated to one output pattern, and when for μ = ξ, xμi ≤ xξi , ∀i and ∃j such that xμj < xξj . In either of these cases, correct recall is not guaranteed.
3
Our Proposal and Its Theoretical Support
Our proposed algorithm is an addendum to the Lernmatrix´s recalling phase, since the new algorithm is applied only if the orignial Lernmatrix recalling phase is not capable of delivering a valid output pattern without ambiguity. The only condition is that the associations of the fundamental set must be one input-toone output pairs without repeating any pattern.
Perfect Recall on the Lernmatrix
837
Let M be a Lernmatrix and {(xμ , yμ ) | μ = 1, 2, ...p} be its fundamental set, where xμ ∈ An and yμ ∈ Ap , with A = {0, 1}, and n ∈ Z + . Once we have finished with the Lernmatrix´s recalling process, we obtain a y representing the class associated with a specific fundamental pattern x . If the class vector does not present the saturation problem, the correct class has been found. Otherwise, we need to create an additional column vector s ∈ Ap which will be useful in the proposed recalling phase. This vector will contain in its i-th component, the sum of the positive values on the i-th row of the M matrix: si =
n
mij such that mij > 0.
j=1
Once the s vector has been built, the next step is to take the output class vector from the Lernmatrix´s recalling phase and create a new one based on the algorithm presented here: Let z ∈ Ap be the class vector resulting from the Lernmatrix´s recalling phase, and y ∈ Ap be the class vector given by our proposal. Each component in the new class column vector y is given as:
1 si = ∧pk=1 sk such that zk = 1, yi = 0 otherwise. After this process, the new class vector y represents the correct association from the fundamental set. Below are presented the definitions, lemmas, and a theorem that support the proposed algorithm. Definition 1. Let A = {0, 1} and xh ∈ An be and input pattern. We denote n the sum of values of the positive components of xh as: Uh = xhi such that i=1
xhi > 0. Definition 2. Let M be a Learnmatrix, and {(xμ , yμ ) | μ = 1, 2, ...p} be its fundamental set, where xμ ∈ An and yμ ∈ Ap are two patterns, with A = {0, 1}, and n ∈ Z + . The i-th component of the cardinality vector s of M will be given by: si =
n
mij such that mij > 0.
j=1
Definition 3. Let A = {0, 1} and xα , xβ ∈ An be two vectors, with n ∈ Z + ; β β α then xα < xβ ←→ ∀ixα i ≤ xi and ∃j such that xj < xj . Definition 4. Let A = {0, 1} and xα , xβ ∈ An be two vectors, with n ∈ Z + ; β then xα ≤ xβ ←→ xα i ≤ xi ∀i ∈ {1, 2, . . . , n} as presented in [5]. In other β words,xα ≤ xβ if and only if ∀i ∈ {1, 2...n} it holds that xα i = 1 −→ xi = 1.
838
I. Rom´ an-God´ınez, I. L´ opez-Y´ an ˜ez, and C. Y´ an ˜ez-M´ arquez
Lemma 1. Let xi be a pattern, randomly taken from the fundamental set. During the Lernmatrix’s learning phase, xi only contributes at the i-th row of M and contributes with Ui 1 s and n − Ui 0 s. Proof. Let xk ∈ An and yk ∈ Ap be two fundamental patterns forming the k-th association xk , yk of Lernmatix M, with A ∈ {0, 1}. According to the manner in which the yk vector has been built and the way the Lernmatrix’s learns, it is clear that the M Lernmatrix is only affected by the xk input vector in its k-th row. Furthermore, since the fundamental set is built in such a way that for each different input vector correspond one different output vector, the xk vector will contribute at one and only one row of M with as many + as ones it has, and as many − as zeroes it has. That is, the k-th row of M will have Ui times the value + and n − Ui times teh value −. Lemma 2. Let s be the cardinality vector of the M Lernmatrix, then for each component of s, si = Ui . Proof. If s is the cardinality vector of Lernmatrix M, then si =
n
mij such
j=1
that mij > 0. As we know by lemma 1, each xi vector with i ∈ {1, 2, ...p} only contributes at the i-th row of M and by the form that the fundamental set has been built ∀j xij = 1 → mij = . Therefore at the i-th row of the Lernmatrix M will be an equal number of as 1 s in the xi vector but as we know by definition n 1, we can say that M will have equal number of as Ui .Therefore si = mij such that mij > 0 −→ si = Ui .
j=1
Lemma 3. Let M be a Lernmatrix, {(xμ , yμ ) | μ = 1, 2, ...p} be its fundamental set, and x ∈ An be a pattern from the fundamental set, which is being presented as input to M, with A ∈ {0, 1}. After the recalling phase, this Lermatrix will give as output a class vector z ∈ Ap that will present 1’s in every component whose index i is the index of the row in M which correspondes to fundamental patterns greater or equal to x : ∀i zi = 1 → xi ≥ xω , xi ∈ {(xμ , yμ ) | μ = 1, 2, ...p}. Proof. Due to the Lernmatrix’s original recalling phase we know that zi = n p n ω ω ω 1 −→ j=1 mij xj = h=1 j=1 mhj xj . Given that the operation mij xj discards the values of the components mij for which xω j has a value of 0, the n maximum result of j=1 mij xω will happen for the patterns which have the j most 1’s in the same positions as xω , thus recalling the patterns which have 1’s in the same components as xω j , regardless of what they have in their other components. Then, the recalled patterns will be either the correct pattern (ie. xk = xω ) or some pattern which has 1’s where xω has 0’s. It is clear that those spurious patterns which are recalled, will be greater than the correct one: ∀i zi = 1, xi = xω → xi > xω . Put diferently:
∀j(x j = 1 −→ mij = ) ∧ (mij = − −→ xj = 0), ∀i zi = 1 −→ ∀j(xj = 1 −→ mij = ) ∧ (∃k xk = 0 ∧ mij = ).
Perfect Recall on the Lernmatrix
839
Theorem 1. Let M be a Lernmatrix, {(xμ , yμ ) | μ = 1, 2, ...p} be its fundamental set, built with associations where no input nor output pattern is repeated. Let x ∈ An be a pattern from the fundamental set, which is being presented as input to M, and z ∈ Ap be the class vector resulting from the Lernmatrix’s recalling phase of M, with A ∈ {0, 1}. Then the proposed modification to the recalling phase of the Lernmatrix will always show correct recall; that is, with the proposed algorithm we will always obtain the corresponding y without ambiguity for any xμ in the fundamental set. Proof. The proposed algorithm will put all components of y ω to 0 except where the cardinality vector for those rows recalled by the original recalling algorithm is minimum:
1 si = ∧pk=1 sk such that zk = 1, yi = 0 otherwise. Then, for this algorithm to fail, it would be necesary that one of the spurious recalled patterns had a corresponding si less than or equal to that of the correct pattern. This shall be proved to be false by contradiction. First, we assume that xα ∈ An is the correct fundamental pattern and xβ ∈ An is an arbitrary spurious recalled pattern, with corresponding si values sα and sβ , respectively. Now, we assume the negated of what we want to prove: sβ ≤ sα .
(1)
Now, by lemma 2, we know that si = Ui , which means that sα = Uα and sβ = Uβ , thus Uβ ≤ Uα . Now, by dividing by , we have Uβ ≤ Uα .
(2)
On the other side, lemma 3 shows that for every spurious recalled patter xi , where xω is the corret recalled patter, ∀i zi = 1, xi = xω → xi > xω . Particularly speaking, xβ > xα . (3) β By definition 3 we know that xα < xβ ←→ ∀ixα i ≤ xi and ∃j such that β β < xj . This means that x will have at least one component greater than x . Since xα , xβ ∈ An , all their components are binary, and can take only two possible values: 0 and 1. Then, for the latter to happen, is is necessary that β β α xα j = 0 ∧ xj = 1. Thus, x will have at least one more 1 than x . by definition 1, this would mean that Uβ > Uα . (4)
xα j α
However, according to equation 2, Uβ ≤ Uα , which is a contradiction. Then, Uβ ≤ Uα cannot be true, therefore Uβ > Uα for every spurious recalld pattern, since xβ was chosen arbitrarily.
840
4
I. Rom´ an-God´ınez, I. L´ opez-Y´ an ˜ez, and C. Y´ an ˜ez-M´ arquez
Experimental Results
A series of experiments were done to illustrate the efficiency of the algorithm, as demonstrated by the theorem presented above and its proof. A software has been made in order to randomly create a finite number (p) of binary input patterns (of dimension n) that will be used to build the fundamental set. With these patterns and their associated classes, the Lernmatrix M was created and assessed by presenting each element from the fundamental set as an input pattern. The software shows the cases in which the proposed algorithm delivered a correct recall in the first step —that is the Lernmatrix’s original recalling phase—, and in which cases it was necessary the use of our proposed modified algorithm. In table 1 we can see the error percentage from three experiments made on the software. It is clear that the original algorithm presents more errors than the one proposed in this paper. Table 1. Experimental results Experiment Number 1 2 3
5
n 11 15 6
p 35 30 25
Original Algorithm Modified Algorithm Error (%) Error (%) 10.65 0.0 7.0 0.0 20.2 0.0
Conclusions and Future Work
In the current paper, a modification to the original Lernmatrix recalling phase algorithm has been presented, along with its theoretical foundation. By means of the presented theorem and some illustrative experimets, it is shown that the proposed algorithm will yield correct recall for every fundamental pattern. The direct consequence this result will have in the practical use of the Lernmatrix is that, through the use of the proposed algorithm, it can be guaranteed that every pattern that has been learned by the Lermatrix will be correctly recalled, regardless of any condition. Also, some patterns that are not fundamental can be correctly recalled. However, the conditions for this correct recall on non-fundamental patterns has not yet been characterized. As future work, we will investigate which are the conditions that allow our algorithm to show correct recall for non-fundamental patterns.
Acknowledgments The authors would like to thank the Instituto Polit´ecnico Nacional (Secretar´ıa Acad´emica, COFAA, SIP, and CIC), the CONACyT, and SNI for their economical support to develop this work.
Perfect Recall on the Lernmatrix
841
References 1. Y´ an ˜ez-M´ arquez, C., D´ıaz-de-Le´ on Santiago, J.L.: Lernmatrix de Steinbuch, IT-48, Serie Verde, CIC-IPN, M´exico (2001) 2. Steinbuch, K.: Die Lernmatrix, Kybernetik. 1 (1) (1961) 36-45 3. Steinbuch, K., Frank, H.: Nichtdigitale Lernmatrizen als Perzeptoren, Kybernetik. 1 3 (1961) 117-124 4. Kohonen, T.: Correlation Matrix Memories. IEEE Transactions on Computers. 21 (4) (1972) 353-359 5. Y´ an ˜ez-M´ arquez, C.: Associative Memories Based on Order Relations and Binary Operators (In Spanish). PhD Thesis. Center for Computing Research, M´exico (2002) 6. Rom´ an-God´ınez, I., L´ opez-Y´ an ˜ez, I., Y´ an ˜ez-M´ arquez, C.: A New Classifier Based on Associative Memories. IEEE Computer Society 10662, Proc. 15th International Conference on Computing, CIC 2006. ISBN: 0-7695-2708-6 (2006) 55-59 7. Chren, W.A.: One-hot Residue Coding for High-speed Non-uniform Pseudo-random Test Pattern Generation. Circuits and Systems (1995)
A New Text Detection Approach Based on BP Neural Network for Vehicle License Plate Detection in Complex Background Yanwen Li1, 2, Meng Li1,3, Yinghua Lu1, Ming Yang1,3, and Chunguang Zhou2 1
Computer School, Northeast Normal University, Changchun, Jilin Province, China 2 College of Computer Science and Technology, Jilin University, Changchun, Jilin Province, China 3 Key Laboratory for Applied Statistics of MOE, China {liyw085,lim923,luyh}@nenu.edu.cn
Abstract. With the development of Intelligent Transport Systems (ITS), automatic license plate recognition (LPR) plays an important role in numerous applications in reality. In this paper, a coarse to fine algorithm to detect license plates in images and video frames with complex background is proposed. First, the method based on Component Connect (CC) is used to detect the possible license plate regions in the coarse detection. Second, the method based on texture analysis is applied in the fine detection. Finally, a BP Neural Network is adopted as classifier, parts of the features is selected based on statistic diagram to make the network efficient. The average accuracy of detection is 95.3% from the images with different angles and different lighting conditions.
1 Introduction With the development of Intelligent Transport Systems (ITS), automatic license plate recognition (LPR) plays an important role in numerous applications in reality [1-3]. The license plate detection method might be applied for electronic tolls to help identify violating vehicles. And how to find the license plate region from complex scenes is the key component of LPR, for which directly affects the system’s overall performance. A large number of scholars have carried on the research and development of this technology recently, and a number of techniques have been proposed for locating the desired plate through visual image processing, such as the methods base on edge extraction [4], Hough transform [5], color feature [6], and histogram analysis [7]. But most previous works have in some way restricted their working conditions, such as limiting them to indoor scenes, stationary backgrounds, fixed illumination, prescribed driveways or limited vehicle speeds. In this paper, as few constraints as possible on the working environment are considered. A LPR system is mainly composed of three processing modules, that is, license plate detection, character segmentation, and character recognition. Among them, license plate detection is considered the most crucial stage in the whole LPR system. License plate detection is the first step of the automatically license plate identifying, and also the key step, its result will directly influence the final identifying effect. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 842–850, 2007. © Springer-Verlag Berlin Heidelberg 2007
A New Text Detection Approach Based on BP Neural Network
843
A color-based approach is normally useful and fast. Because the color of the character on the license plate area is distinct from the color of background, the physics characteristics such as the texture, color, geometric and shape information of the license plate area become the main basis of license plate locating method. Having referenced the methods mentioned above, and fully considered the wealthy texture variety and color information on license plate area, this paper presents a coarse to fine algorithm to locate license plates in images and video frames with complex background is proposed. First, the method based on Component Connect (CC) is used to locate the possible license plate regions in the coarse detection. Second, the method based on texture analysis is applied in the fine detection. Finally, a BP Neural Network is adopted as classifier, parts of the features is selected based on statistic diagram to make the network efficient. The average accuracy of detection is 95.3%. The rest of the paper is organized as follows. The coarse detection method based on the connect component is presented in Section 2 and the fine detection process based on the texture analysis is described in Section 3. Experimental results are presented in Section 4. The paper is concluded with a discussion of future work in Section 5.
2 Coarse Detection Based on Component Connect Coarse detection is to find all the possible license plate regions in an image. In this procedure, the edge detection is used to locate the strong edge pixels and the densitybased region growing is applied to connect the strong edge pixels into regions. 2.1 Edge Detection Most license plates are designed to be easily read, and the color of the character on the license plate area is distinct from the color of background, thereby the boundaries of characters must be strong edges. Sobel edge detector is used to detect the characters in the license plate at first. The edges of the license plate can be detected if the lighting condition is comfortable. Because characters usually have their own characteristic, four directions (0o, 45o, 90o, 135o) are preferred to detect their edges respectively. At the same time, a threshold Ts is given, the edge pixels that are greater than Ts are defined to be ‘strong edge pixels’. Then the four detecting results by the ‘or’ operation are merged to perform the image with strong edge pixels. 2.2 Density-Based Region Growing When the strong edge pixels are dense, merging them to perform a connected region, which is called candidate region, is considered. A pixel P will be a seed pixel if the percentage of candidate pixels in its neighborhood is larger than the threshold Tp. A pixel P’ is considered to be connected with pixel P if P’ is within the neighborhood of P and P is a seed pixel [8]. In this paper, the size of the neighborhood is 3 × 7 pixels and Tp is set to 0.45. In this section, almost every license plate-like region can be reserved, because the edge detector is sensitive, all the candidate regions must be processed further.
844
Y. Li et al.
3 Fine Detection Based on Texture Analysis The candidate regions with abrupt variation may be falsely detected as license plate, for instance, the textures like leaves. The texture features are extracted to identify true license plate from the candidate regions in this section. 3.1 Region Filtering Depend on Heuristic Information Before extracting the texture features, some heuristic information is used to filter out some regions, which cannot be the texts. Because the characters on the license plate belong to texts, the criteria are chosen according to text detection method, which are used in our other works. The criteria are as follows: 1. If the height of the candidate region is greater than 2/3 of the original image, the candidate region is discarded. 2. If the ratio of the width and the height of the candidate region are lower than a threshold Tr (Tr=1.2), the candidate region is discarded. 3. If the size of the candidate region is smaller than 20 × 10 pixels, the candidate is discarded. The first criterion says that if the region is too big in the original image, it is considered as the background, like forests and so on. The ratio in the second criterion is the heuristic information. For the texts aligned in horizontal, the ratio must be greater than 1. Because the texts which the method detected are not one character, but words at least. In the license plate detection, according to the standard ratio of the license plate, Tr is set to 1.2 considering the abnormal license plate. The threshold in the third criterion is according to human vision. If the region is smaller than this threshold, human cannot recognize the texts in it. So the region which is greater than the threshold is discarded. After filtering out some false license plate regions by the criteria above, some texture features should be extracted in the next section. All the thresholds are according to the experimental results from 3777 images which contain texts. 3.2 Feature Extraction Generally, most of the texture features are affected by the size of candidate regions, so before extracting features, the candidate regions must be normalized. In this paper, normal size of the candidate region is set to 64 × 128 pixels. 3.2.1 Gabor Filter (Four Different Directions) In general, the directional characteristic is one of the important elements for texture discrimination. Under the premise, four different directions are chosen, which are 0o, 45o, 90o, and 135o to cover the general case. Then four images are obtained after convolving candidate region with Gabor filter. 6 texture features in each of these 4 images will be computed and 24 features are obtained in all. The 6 features are mean, standard deviation, energy, entropy, inertia and local homogeneity, which are defined as follows:
A New Text Detection Approach Based on BP Neural Network
μ=
σ=
845
1 m n ∑∑ G(i, j ), m × n i =1 j =1
(1)
1 m n ∑∑ (G (i, j ) − μ ), m × n i =1 j =1
(2)
Eg = ∑ G 2 (i, j ),
(3)
Et = −∑ G (i, j ) ⋅ log G (i, j ),
(4)
i, j
i, j
I = ∑ (i − j ) 2 G (i, j ),
(5)
i, j
H =∑ i, j
(6)
1 G (i, j ). 1 + (i − j )2
3.2.2 Multi-resolution Gabor Filter There are two acknowledgements should be noted. First, the texture image is supposed to preserve its main information by the down-sampling process. This means the major information will still be involved under the multi-resolution process. Second, we assume that there exists a set of parameters that contains the most dominant features in the general Gabor filter method [9].
Candidate region Level 0
0o Gabor filter
Extract 6 features
45oGabor filter
Extract 6 features
90oGabor filter
Extract 6 features
135oGabor filter
Extract 6 features
Fig. 1. The detailed method of Gabor filters under a certain resolution
Feature set Level 0
846
Y. Li et al.
The Gabor filters are generally used by several sets of parameters. They are chosen by the octave concept (Jain and Farrokhnia, 1991). These filters are distinguished from low-to high-pass filters. We assume that the lowest-pass filter dominates other higher filters. By this, the multi-resolution concept can be used. The low-pass filter at the finest resolution (level 0) is used and the next higher-pass filter at level 1, which is the down-sampled image of the source image, is applied, and so on [9]. The candidate region is down-sampled twice, and then 72 features are obtained. A simple sketch of our process routine is given in Fig. 1 and Fig. 2.
Candidate region Level 0
Feature set Level 0
Down-sampled Level 1
Feature set Level 1
Down-sampled Level 2
Feature set Level 2
Feature selection
Fig. 2. Procedure of the multi-resolution Gabor filter
3.3 Feature Selection 72 features were extracted from the previous section. Although all of these features can be used to distinguish true license plates with false license plates, some features may contain more information than others. Using only a small set of the most powerful
Fig. 3. Parts of statistical diagrams which depend on a single feature on 3777 samples. (a)The statistical diagram draw by the fourth feature (entropy) of the image after Gabor filtering in 135o direction, (b) the statistical diagram draw by the second feature (deviate) of the image after Gabor filtering in 45o direction.
A New Text Detection Approach Based on BP Neural Network
847
features will reduce the time for feature extraction and classification. So some more powerful features have to be selected. A statistical diagram for every feature is draw; in which 3777 samples is used. Fig. 3 shows two of them. In Fig. 3, the horizontal axis is numerical value of a certain feature. And the blue points are draw by the positive samples (true license plate); the red ones are draw by the negative samples (false license plate). In Fig. 3(b), the common field (I2) is larger than I1 in Fig.3 (a). So the feature Fig.3 (a) denotes has more useful information than that Fig.3 (b) denotes. So R is defined as follows:
R(α , i ) =
Sc , St
i = 1, 2, ... , 6.
(7)
Table 1. Shows the R of every feature in level 0
Direction
Average
Deviate
Energy
Entropy
Contract
Homogeneity
0
0.2667
0.75
0.4
0.5357
0.46
0.6667
45
0.1754
0.2632
0.1857
0.8333
0.35
0.4667
90
0.3333
0.35
0.3597
0.1125
0.1750
0.412
135
0.04
0.2931
0.125
0.8
0.375
0.5
Table 2. Features used for text/non-text classification
Resolution
Direction
Level 0 Level 1 Level 2
0o, 45o, 90o, 135o 0o, 45o, 90o, 135o 0o, 45o, 90o, 135o
(a)
Numbers of the features 24 24 24
Numbers of the features selected 6 6 6
(b)
Fig. 4. The error of the BP neural network. (a) The features training BP are received only by the four directions Gabor filters, (b) by multi-resolution Gabor and features selection.
848
Y. Li et al. Table 3. The compare of expeimental results
The accuracy of detection
The method in [10] 92.4%
Our method 95.3%
(a)
(b)
(c)
(d)
(e)
(g)
(f)
(h)
(i)
(k)
Fig. 5. The results of the license plate detection. (a) The result of detection the distortion license plate, (b) the result of detection the incline license plate, (c)-(d) the extraction results of (a) and (b) respectively, (e)-(f) the results of detection multiple license plates, (g)-(k) the extraction results of (e) and (f) respectively.
Sc denotes the size of the common region of the true license plate and the false license plate, St denotes the size of the region of the text, α denotes the direction of the Gabor filter, i denotes the number of the formula.
A New Text Detection Approach Based on BP Neural Network
849
The smaller the R is, the more probability to distinguish the true license plate and the false license plate samples by this feature. According to the table, the features whose R is smaller than a threshold T are selected. Finally, 18 features are selected in three levels to feed the BP Neural Network, which are shown in Table. 2. 3.4 Training BP Neural Network The BP Neural Network is trained on a dataset consisting of 1262 positive samples (true license plate) and 2515 negative samples (false license plate). The error of the BP Neural Network that is trained by the selected features is much smaller than the non-selected one (Fig. 4). The error of the BP trained only by the 72 features extracted from four directions in level 0 (Fig. 6) is grater than 10-4, but the one by the selected features from the multi-resolution Gabor can reach 10-15 which is further better than the former one.
4 Experimental Results The proposed method was implemented on a personal computer with an Intel Pentium 4-1.6GHz CPU/256M RAM. The images in our database are contains various types of license plate which are collected from several outdoor parking places with different angles and different lighting conditions. As shown in Fig.4, the vehicle license plates are located in every condition. The rate of success is 95.3%. Nevertheless, the algorithm gives good results on our database, and it is relatively robust to variations of the lighting conditions and different orientations. The results of detection license plates are shown in Fig. 5.
5 Conclusions and Future Work In this paper, a coarse to fine algorithm to locate license plates in images and video frames with complex background is proposed. First, the method based on Component Connect (CC) is used to locate the possible license plate regions in the coarse detection. Second, the method based on texture analysis is applied in the fine detection. Finally, a BP Neural Network is adopted as classifier, parts of the features is selected based on statistic diagram to make the network efficient. The average accuracy of detection is 95.3%. The feature selection procedure finds effective texture features to represent the license plate pattern. Although the algorithm is designed mainly for locating single license plate in one color image, it can work well in locating multiple license plates in one color image and in one video frame. Generally, we only provide a license plate detection method. However, license plate should be recognition for the characters. Special technique should be investigated to segment the characters from the background before putting them into the OCR software in the future work.
850
Y. Li et al.
Acknowledgement This paper is supported by the National Nature Science Foundation of China under Grants 60433020 and 60673099, the Key Laboratory for Symbol Computation and Knowledge Engineering of the National Education Ministry of China, and the Project ‘985’: Science Technique Innovation Platform of Computation and Software Science. This work is also supported by science foundation for young teachers of Northeast Normal University, China under Grant 20061003.
References 1. Naito, T.: Robust Recognition Methods for Inclined License Plates Under Various Ilumination Conditions Outdoors. Proc. IEEE/IEEJ/JSAI Int. Conf. on Intelligent Transportation Systems (1999) 697-702 2. Busch, C., Domer, R., Freytag, C., Ziegler, H.: Feature Based Recognition of Traffic Video Streams for Online Toute Tracing. Proc. IEEE Conf. on Vehicular Technology 3 (1998) 1790-1794 3. Zunino, R., Rovetta, S.: Vector Quantization for License-plate Location and Image Coding. IEEE Trans. Ind. Electron. 47 (2000) 159-167 4. Gonzalez, R.C., Woods, R.E.: Digital Image Processing (Second Edition). Prentice-Hall (2002) 5. Kim, K. M., Lee, B. J., Lyou, K.: The Automatic Coefficient and Hough Transform. Journal of Control, Automatic and System Engineering 3 (5) (1997) 511-519 6. Zhu W.G., Hou G.J., Jia X.: A Study of Locating Vehicle License Plate Based on Color Feature and Mathematical Morphology. Signal Processing 1 (2002) 748-751 7. Cho, D.U., Cho, Y.H.: Implementation of Preprocessing Independent of Environment and Recognition of Car Number Plate Using Histogram and Template Matching. The journal of the Korean Institute of Communication Sciences 23 (1998) 94-100 8. Ye, Q.X., Huang, Q.M., Gao, W., Zhao, D.B.: Fast and Robust Text Detection in Images and Video Frames. Image and Vision Computing 23 (2005) 565-576 9. Chen, C.C., Chen, D.C.: Muti-resolution Gabor in Texture Analysis. Pattern Recognition Letters 17 (1996) 1069-1076 10. Hsieh, C.T., Juan, Y.S., Hung, K.M.: Multiple License Plate Detection for Complex Background. IEEE Proceedings of the 19th International Conference on Advanced Information Networking and Applications (2005)
Searching Eye Centers Using a Context-Based Neural Network Jun Miao1, Laiyun Qing2, Lijuan Duan3, and Wen Gao1 1
Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100080, China {jmiao,wgao}@ict.ac.cn 2 School of Information Science and Engineering, Graduate University of the Chinese Academy of Sciences, Beijing 100049, China
[email protected] 3 College of Computer Science and Technology, Beijing University of Technology, Beijing 100022, China
[email protected]
Abstract. Location of human features, such as human eye centers, is much important for face image analysis and understanding. This paper proposes a context-based method for human eye centers search. A neural network learns the contexts between human eye centers and their environment in images. For some initial positions, the distances between them and the labeled eye centers in horizontal and vertical directions are learned and remembered respectively. Given a new initial position, the system will predict the eye centers’ positions according to the contexts that the neural network learned. Two experiments on human eye centers search showed promising results.
1 Introduction Human facial features search and location is much important for human face image analysis, description, model coding, understanding and recognition. Many methods have been published for facial feature detection. Among facial features, eye features are most concerned for their important positions in human faces. In general, the methods on eye features search can be divided into three categories. The first one is knowledge-based method, such as [1-2], which locate eyes’ positions through analysis of grey (or its variance) projection with relative knowledge. The performance of this method is generally not robust, easily affected by facial expression variances. The second category is template matching method, such as 3-D eye template [3], deformable template [4-8], ASM [9] and AAM [10], most of which often involve an iterative search procedure according to objective function computation, and the performance is usually time-consuming. The third category is discrimination method, such as [1112], which views eye feature detection as a binary classification problem, and the performance is dependent to the preceding stage of face detection. Most of the above methods, when used for visual object search, seldom concern contexts between visual features, which seem quite important for human being’s visual function. Some research work [13-18] have utilized context for object search and D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 851–860, 2007. © Springer-Verlag Berlin Heidelberg 2007
852
J. Miao et al.
recognition, for example, Torralba et.al.[16-18] introduced probability and statistical framework to detect object with context. Here we propose a neural network method for context based eye centers search. The neural network consists of two parts: local visual pattern recognition and object position prediction. In the following paragraphs, the entire neural network structure is given in section 2. Then the two parts of the system are introduced respectively in section 2.1 and 2.2. Section 3 gives the detailed context based searching and learning mechanism. In section 4, experiments on eye centers searching are discussed. Conclusion and future directions are given in last section.
2 A Context-Based Neural Network Fig. 1 and 2 illustrate a neural network system for object position prediction or eye centers search according to contexts.
Fig. 1. A context-based neural network for object position (x, y) prediction
The neural network consists of two parts. One is a local image recognition structure, which input local image from a group of visual fields with corresponding resolutions and then recognize the current local image according to features such as gray and edges. The second part is an object position prediction structure, which predicts the position of the object in terms of horizontal and vertical shift distances (x, y) from the center position (0, 0) of the current local image. The two structures naturally incorporate into an entire one and cooperate to recognize and predict in a repeated mode from a global low resolution to a local high resolution until the system’s position prediction is not changed ( shift distances x and y are all zeroes).
Searching Eye Centers Using a Context-Based Neural Network
image
Local image centerd with position (x, y)
Simple fearues responsing and competing
Local image pattern recognition
Move center to position(x, y) and change visual field scale to a small one
Predicted position=(0,0)?
Clear x ,y to 0 and predict eye center position (x, y)
No
853
Yes End
Fig. 2. System framework for eye centers search and location
2.1 Local Image Recognition Part With reference to Fig.1, this part consists of three layers of neurons, the first layer: input neurons, the second layer: feature neurons, and the third layer: recognition neurons. With reference to Fig. 2, the first layer inputs local images centered with a position (0, 0) from visual fields with corresponding resolutions. The second layer extracts features such as gray and edges. These features are involved in competition and only winners contribute to the responses of neurons in next layer. The third layer is composed of neurons which recognize different local image patterns. Fig. 3 illustrates features that neurons in the second layer extract, in which two types of features are given: brightness and contrast. There is one feature pattern for G G G brightness ( f 0 ) and there are 12 patterns for contrast ( f1 ~ f12 ) respectively. Among them, the 12 contrast features actually represent 3 kinds of geometrical features, which are points, line segments and arcs with different positions or orientations. A gray small box in a feature pattern in Fig. 3 represents an excitable input with a positive weight and a black box represents inhibitive input with a negative weight from input neurons (corresponding to an image window of 2x2 pixels) into a feature neuron. Thus a feature pattern could be represented by a vector with a group of weights (here are 4 weights). Generally, all weights in each feature vector are normalized to length 1 for unified feature response/similarity computation and comparison (with reference to Fig. 4).
Fig. 3. Feature patterns (2x2 pixels) that feature neurons extract
854
J. Miao et al.
Let vector x i =(xi1, xi2, xi3, xi4) represent the i-th image window and vector G f ij =(aj1, aj2, aj3, aj4) represent the j-th feature extracting pattern for the i-th image window x i , then the feature response rij=fij( x i ) can be obtained by orthogonal projection or inner product computation(with reference to Fig. 4): G rij= fij( x i )=< f ij , x i >=aj1xi1+aj2xi2+aj3xi3+aj4xi4 Generally, a neuron is firing only if its response is larger than a threshold, for example, threshold=0. Thus the real response of a neuron is: G G G G ⎧< f ij , x i > if < f ij , x i > > 0 G rij = f ij (x i ) = ⎨ 0 otherwise ⎩
Fig. 4. Feature neurons’ responses rj= fj( x )
Mathematically these features constitute a set of non-orthogonal bases in local feature vector space for describing image window pattern. For example, with reference G G G G to Fig.3, f i0 =(1,1,1,1)/ 4 , f i1 =(-3,1,1,1)/ 12 , f i5 =(-1, -1,1,1) / 4 , f i9 =(3,-1,-1,-1) G / 12 , in which the brightness feature vector f i0 is orthogonal to any one of the conG G G trast feature vectors f i1 ~ f i12 . Generally, the brightness feature f i0 has the largest response to any image window input x i except that in a few of cases the contrast feature of “point” or “arc” has the largest responses. If we select the first two largest G G responding features f i0 and f ik (k=1~12), then the image window pattern x i can be G G approximately reconstructed by a sum of weighted f i0 and weighted f ik (k=1~12), i.e.: G G x i bi0 f i0 + bik f ik , (k=1~12)
≈
where bi0=ri0=fi0( x i ) and bik=rik=fik( x i )(k=1~12). In other words, image window pattern x i can be represented by two reconstructed coefficients bi0 and bik or two feature neurons’ responses fi0( x i ) and fik( x i )(k=1~12). From the point of view of feature reduction in pattern recognition, the first m features ( f 'i1 , f 'i2 , •••, f 'im ) that have the largest responses (r’i1=f’i1( x i ), r’i2=f’i2( x i ), •••,
Searching Eye Centers Using a Context-Based Neural Network
855
r’im=f’im( x i )) to the image window pattern x i could approximately describe or represent the x i at the cost of minimum reconstruction error. Generally, m is less than the pixel number or dimension of the image window input x i . In our system, as illustrated in Fig. 3, the size of the image window input or receptive field of feature neurons is 2x2=4 pixels. Thus the dimension of image window pattern x i is 4. For the purpose of reducing features, m is set as 2, which is less than the number of pixels of the image window input x i , i.e.: m
xi ≈
∑ b'
ij
f 'ij
j =0
Fig. 5. Local image recognition structure
Fig. 5 shows the local image recognition structure in which the k-th recognition neuron receives inputs (weighted with strengths wk,ij) from the ij-th feature neuron (with response r’ij) that responses to brightness and geometrical contrast features for their i-th image window input x i . So a recognition neuron’s response Rk=F( x ), for the local image x =( x 1 , x 2
, •••,
x N ) which is composed of the image window input
x i , is: Rk=F( x )=F(( x 1 , x 2 N
m
, •••, N
x N )) m
= ∑∑ w k,ij f'ij ( x i ) = ∑ ∑ w k,ij r' ij i =1 j =1
i =1 j =1
where weights wk,ij is acquired in learning stage according to Hebbian rule wk,ij=αRkr’ij, in which Rk is set to 1 in order to represent the response of the k-th recognition neuron when it is generated for a new local image pattern, and α is also set as 1 for simplification. All the weights will be normalized to length 1 for unified similarity computation and comparison (with reference to Fig. 4).
856
J. Miao et al.
2.2 Position Prediction Part
The Position prediction structure consists of two layers of neurons: recognition neurons and position neurons (Fig.6). The recognition neurons, as discussed in last section, recognize different local image patterns. The position neurons, divided into Xposition and Y-position neurons, which represent the object’s position (x, y) from origin which is also the center position (0,0) of the current local image that the system input from a visual field with a corresponding resolution.
Fig. 6. Position prediction structure
For the current local image input, there will be a recognition neuron with the maximum response to the input, which will win out and represent the current local image pattern through competition with other recognition neurons. In learning stage, if the kth recognition neuron has the maximum response and the object’s position is (x, y) from the center of the current local image, two connections will be generated between the k-th recognition neuron and two position neurons: x-position and y-position neurons (Fig. 6). The weights wk, x and wk, y on the two connections are learned with Hebbian rule: wk, x=αRkRx, wk, y =αRkRy where responses Rk, Rx, and Ry are all set to 1 in order to represent the responses of the k-th recognition, x-position and y-position neuron respectively when they are newly generated, and α is also set as 1 for simplification.
3 Context Based Searching and Learning Mechanism The system’s object searching and locating include a series of local image pattern recognition and object position prediction procedures according to learned contexts, which begin with an initial center position(0,0) and end with a final prediction position (0,0). The procedure for local image pattern recognition is achieved by a recognition neuron responding largest among all the recognition neurons and becoming a winner through competitive interaction. The procedure for object position prediction is achieved by a winner recognition neuron activates x- and y-position neurons
Searching Eye Centers Using a Context-Based Neural Network
857
according to the learned contexts. The two procedures cooperate to recognize and predict in a repeated mode from a global low resolution to a local high resolution until the system’s position prediction is not changed (predicted position x and y are all zeroes). The system’s learned contexts are preserved in the neural networks’ weights. Hebbian rule is the fundamental learning rule, i.e., wij=αRiRj, where wij is connecting weights; α is learning rate; Ri and Rj are responses of two neurons that are connected mutually. Learning mechanism is as followed: 1. Input a local image from a visual field with a corresponding resolution, take current center position of the input as origin, predict object’s position (x, y); 2. If prediction result is not correct, generate a new recognition neuron (let response R=1); else go to 4; 3. Compute connecting weights between the new recognition neuron and feature neurons and that between the new recognition neuron and two position neurons (let response Rx=Ry=1) using Hebbian rule wij=αRiRj; 4. Move current center position to the position of the object at the current visual field and change the visual field and its resolution to a smaller and a higher one respectively; 5. Go to 1, until all scales of visual fields and all given initial center positions are learned.
4 Experiments The system is applied to human eye centers searching and locating in still images (320x214 pixels) which come from the face database of the University of Bern. 4.1 System Structure
A group of visual fields in 5 different scales (16x16, 32x32, 64x64, 128x128 and 256x256 pixels) is used to input local image from the training and test images (320x214 pixels). For each scale or resolution, with reference to Fig.1 and 5, there is a corresponding 16x16 input neuron array with different intervals. So there are totally 5×16×16=1280 neurons in the first layer. With reference to Fig.2 and 5, the size of image window or receptive field of feature neurons is 2x2 pixels, and there are 13 types of such features. Thus there are totally 5×13×[((16/2)×2-1)]2=14625 feature neurons, in which only 14625×(2/13)=2250 neurons(the first m largest responding neurons, m=2, see section 2.1) win through competition and contribute to activate the recognition neurons in the third layer. The number of recognition neurons in the third layer is dependent on natural categories of local image patterns that system learned. The number of position neurons in the fourth layer is 2×16=32, which represents 16 positions in x and y directions respectively, and corresponds to 16x16 input neuron array in the first layer.
858
J. Miao et al.
4.2 Experiments
Two experiments are done on the face images(320x214 pixels) from the face database of the University of Bern, which has total 300 with 30 people(ten images each person) at different poses.
Fig. 7. Context learning for sequentially searching two eye centers from a group of initial positions in even distribution
Fig. 8. Testing for sequentially searching two eye centers from a group of initialpositions in random distribution
As illustrated in Fig.7 and 8, context learning is with a group of initial positions in even distribution while testing is with a group of initial positions in random distribution. Given a initial position, the system is trained or tested to search and locate the left eye center first and then the right eye center on the basis of left eye center’s searching results. In the first experiment, 30 images of 30 people (one frontal image each person) are learned with 368 initial positions on each image, and the rest of 270 images are tested at 48 random initial positions on each image. The average searching error is 5.74 pixels for left eye centers and 8.43 pixels for right eye centers. In the second experiment, 90 images of 9 people (10 images each one) are learned with 1944 initial positions on each image, and the rest of 210 images are tested at 48 random initial positions on each image. The searching error is 8.43 pixels for left eye centers and 9.86 pixels for right eye centers. 1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.5
0.5
0.4
0.4
left eye centers
0.3 0.2
0.3
0.1
left eye centers
0.2
right eye centers
right eye centers
0.1
0
0 1
10
19
28
37
46
55
64
73
82
91
100
Fig. 9. Test results in the 1st experiment
1
10
19
28
37
46
55
64
73
82
91
100
Fig. 10. Test results in the 2nd experiment
Searching Eye Centers Using a Context-Based Neural Network
859
Fig. 9 and 10 show the statistical results for two eye centers searching, in which the horizontal axis represents the percentage of the distance between searching results and ground truth over the distance between two real eye centers. The vertical axis represents the accumulative correct searching/location rate. In our experiments, right eye center searching is designed to follow the searching of left eye centers. As a result of such dependence, the performance for right eye centers is lower than that for left eye centers. The experiments also show that the generalizing ability in experiment 1, in which training and testing faces are in different poses from same persons, are a bit better than that in experiment 2, in which training and testing faces are from different persons.
5 Conclusion This paper proposed a context based neural network for automatic object searching. The system is applied to human eye centers searching and experiments show promising results. The system’s generalizing ability could be enhanced by adding more feature-invariant representations or middle layer clustering. In the future, the performance of the system would be improved with more robust feature representations and more compact neural network structure.
Acknowledgement This research is partially sponsored by Natural Science Foundation of China under contract No.60673091, No.60332010, and No.60473043, Hi-Tech Research and Development Program of China (No.2006AA01Z122), Natural Science Foundation of Beijing(No.4072023), “100 Talents Program” of CAS, Program for New Century Excellent Talents in University (NCET-04-0320), and ISVISION Technologies Co., Ltd.
References 1. Feng, G., Yuen, P.: Variance Projection Function and Its Application to Eye Detection for Human Face Recognition, Pattern Recognition Letters 19 (1998) 899~906. 2. Zhou, Z., Geng, X.: Projection Functions for Eye Detection, Pattern Recognition (2004). 3. W. Huang, B.Yin, C. Jiang, and J. Miao, A New Approach for Eye Feature Extraction Using 3D Eye Template, Proc. International Symposium on intelligent Multimedia, Video and Speech Processing, (2001) 4. Yuille, A.: Deformable Templates for Face Detection. Journal of Cognitive Neuroscience. 3 (1991) 59-70 5. Huang, C.L., Chen, C.W.: Human Face Feature Extraction for Face Interpretation and Recognition. Pattern recognition 25 (1996) 1435-1444 6. Lam K., Yan, H.: Locating and Extracting the Eye in Human Face Images. Pattern Recognition 29 (1996) 771-779 7. Deng, J.Y., Lai, F.: Region-based Template Deformation and Masking for Eye-Feature Extraction and Description. Pattern recognition 30 (1997) 403-419
860
J. Miao et al.
8. Jeng, S.H., Liao, H.: Facial Feature Detection Using Geometrical Face Model: An Efficient Approach. Pattern Recognition 31 (1998) 273-282 9. Cootes, T.F., Taylor, C.J., Cooper, J.: Active Shape Models - Their Training and Application. Computer Vision and Image Understanding 61 (1995) 38-59. 10. Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active Appearance Models. Proc. The 5th European Conference on Computer Vision (1998) 484-498. 11. Jesorsky, O., Kirchberg, K., Frischholz, R.: Robust Face Detection Using The Hausdorff Distance, Proc. The 3rd International Conference on Audio and Video-Based Biometric Person Authentication (2001) 91-95. 12. Niu, Z., Shan, S., Yan, S., Chen, X., Gao, W.: 2D Cascaded AdaBoost for Eye Localization. Proc. International Conference on Pattern Recognition 2 (2006) 1216-1219. 13. Kruppa, H., Santana, M., Schiele, B.: Fast and Robust Face Finding via Local Context, Proc. Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (VS-PETS'03), Nice, France (2003) 14. Paletta, L., Greindl, C.: Context Based Object Detection From Video. LNCS 2626-Proc. International Conference on Computer Vision Systems (2003) 502--512 15. Strat T.M., Fischler, M.A.: Context-based Vision: Recognizing Objects Using Information From Both 2D and 3D Imagery. IEEE-PAMI 13 (1991) 1050-1065 16. Torralba and P. Sinha. Statistical Context Priming for Object Detection. Proc. IEEE International Confernce on Computer Vision ( 2001) 17. Torralba, A.: Modeling Global Scene Factors in Attention. Journal of Optical Society of America A. Special Issue on Bayesian and Statistical Approaches to Vision. 20 (2003) 1407-1418 18. Torralba, A., Murphy, K. P., Freeman, W. T.: Contextual Models for Object Detection using Boosted Random Fields. Advance in Neural Information Processing Systems (2004)
A Fast New Small Target Detection Algorithm Based on Regularizing Partial Differential Equation in IR Clutter* Biyin Zhang, Tianxu Zhang, and Kun Zhang Institute for Pattern Recognition and Artificial Intelligence, State Key Laboratory for Multispectral Information Processing technology, Huazhong University of Science and Technology, Wuhan, 430074, China
[email protected]
Abstract. To detect and track moving dim targets against the complex cluttered background in infrared (IR) image sequences is still a difficult issue because the nonstationary structured background clutter usually results in low target detectability and high probability of false alarm. A brand-new adaptive Regularizing Anisotropic Filter based on Partial Differential Equation (RAFPDE) is proposed to detect and track a small target in such strong cluttered background. A regularization operator is employed to adaptively eliminate structured background and simultaneously enhance target signal. The proposed algorithm’s performance is illustrated and compared with a two-dimensional least mean square adaptive filter algorithm and a BP neural network prediction algorithm on real IR image data. Experimental results demonstrate that the proposed novel method is fast and effective.
1 Introduction A crucial problem in Infrared Search and Track (IRST) surveillance systems today is the detection and recognition of weak moving targets embedded in nonstationary cluttered background. The problem of low-observable small target detection and tracking arise in remote surveillance application where the target signal amplitude is weak relative to the background clutter and noise. The effect of atmosphere radiation, sunlight bright cloud or earth’s surface background usually comes into being non-stationary and texture clutter. So the targets are typically buried in highly structured background clutter and have a very low signal-to-clutter ratio (SCR) [1]. Traditionally, the detection and tracking of small targets in image sequences have been treated separately using the following processing steps: 1) image preprocessing, 2) target detection, and 3) multi-target tracking. Among them, high performance clutter suppression and target-enhancement is critical to detecting weak targets. Temporal filters [2], spatial filters [3], frequency domain filters [4], and three dimensional [5], matched filters [6] and nonlinear neural networks [7] have been proposed. It is assumed that the appropriate applicationdependent image preprocessing has already been performed. However, because of *
The work was supported by National Natural Science Foundation of China (No. 60135020).
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 861–870, 2007. © Springer-Verlag Berlin Heidelberg 2007
862
B.Y. Zhang, T.X. Zhang, and K. Zhang
highly structured background clutter, the preprocessing technologies are not completely satisfied in smoothing edges caused by the image texture which leads to degraded detectability and high false alarms [1, 8]. To solve this problem, the challenge is to design methods which can adaptively reduce the clutter/noise without losing significant features. Considering the IR imaging model of dim small targets [1], then the Partial Differential Equation (PDE) [9] is the best choice. Since it was introduced by Perona and Malik [9], a great research has been devoted to its theoretical and practical understanding [10] in image restoration, edge detection, denoising and image enhance, etc. The primary motivation was to propose a novel algorithm to further smooth the edge texture and promote the ability of detecting weaker targets. In Sec.2 we introduce the Perona and Malik PDE model. In Sec.3 we develop a novel nonlinear adaptive anisotropic filter (RAFPDE) for background clutter suppression and target enhancement. In Sec.4 we present the application of new algorithm and analysis the comparison of new algorithm and the two-dimensional least mean square adaptive filter (TDLMS) [4] and BP neural network prediction algorithm (BPNF) [7] which have a relatively good performance of small target detection against complex clutter. The paper is concluded In Sec.5.
2 The Perona and Malik Anisotropic Diffusion Equation Perona and Malik proposed a nonlinear diffusion method for avoiding the blurring and localization problems of linear diffusion filtering. They applied an inhomogeneous process that reduces the diffusivity at those locations, which have a larger likelihood to be edges. It is measured by anisotropic diffusion equation:
∂u ( x, y, t ) / ∂t = div(c( ∇u ( x, y, t ) )∇u ( x, y, t ))
(1)
u ( x, y, t ) is the diffused image, t determines the diffusion time. ∇u denotes local gradient and c(t ) = ϕ ′(t ) / 2t is called weighting function, ϕ (t ) is a proper Where
potential function. Extensive work on anisotropic diffusion PDEs proposed several conditions on c(t ) for the edge-preserving regularization [10]: i) isotropic smoothing in homogeneous areas, ii) preservation of edges, iii) strictly decreasing to avoid instabilities. The following weighting function was recommended by Perona and Malik: 2
c(t ) = exp{− t 2 k } or c(t ) = 1/(1 + t 2 k 2 )
k >0
(2)
Eq. (1) can be discretized as follows, using a 4-nearest neighbor discretization of the weighted Laplacian operator.
usl +1 = usl + λ ∑ g (∇us , p )∇us , p p∈η s
(3)
A Fast New Small Target Detection Algorithm
Where
863
usl is the discretely sampled image, s denotes the pixel position, and l is
discrete time steps iterations. The constant diffusion rate,
ηs
λ ∈ R+
is a scalar that determines the
represents the spatial neighborhood of pixel s . Perona and Malik
linearly approximated the gradient using nearest-neighbor differences in a particular direction as
∇u s , p = u p − u s
(4)
3 RAFPDE –Based Small Target Detection Algorithm Target detection algorithms have been steadily improving, whereas many of them failed to work robustly during applications involving changing backgrounds that are frequently encountered. In general, a small target embedded in cloudy background presents as a gray spot in image, which also contains bright illuminated terrain or sunlit clouds. In this case, clutter is often much more intensive than both sensor noise and the target signal, and therefore the adaptive filters (such as local mean removal or differencing operation, etc) as isotropic filters are insufficient to discriminate the target from these bright clutter. Here we presented the novel algorithm based on regularizing PDEs (RAFPDE) to overcome those shortcomings. Three significant improvements are made: 1) new principle of the conditions on weighting function c(t ) is presented for the clutter-removing and target-preserving regularization. 2) the expensive computational complexity in Eq.3 is reduced to obtain much faster performance. 3) The conventional two sequential processing steps (background estimation and target-enhancement) of clutter removal procedure are merges into only one step in which the tasks of the clutter-removing and target-preserving are achieved simultaneously. 3.1 The Conditions of Clutter-Removing and Target-Preserving Regularization Since the clutter removal procedure is used to reduce the effects of non-stationary background on detection performance, it must satisfy two basic requirements: 1) it must remove the background structures in the image in order to reduce the number of false alarms in the detection step and, 2) it must maintain high SCR to avoid detection probability reduction. According to the small target imaging model [1, 11, 12], therefore, in order to encourage smoothing within a region and across boundaries and to discourage smoothing in signal of interest, we proposed the following modified conditions on the weighting function c(t ) = ϕ ′(t ) / 2t :
① ϕ ′(t ) / 2t continuous and strictly monotonous increasing on [0, +∞) to avoid instabilities, ② lim ϕ ′(t ) / 2t = M , M ∈ [0, +∞) : Using anisotropic diffusion of edges t →+∞
to reduce structured background clutter,
864
B.Y. Zhang, T.X. Zhang, and K. Zhang
③ lim ϕ ′(t ) / 2t = 0 : isotropic smoothing in homogeneous areas to remove t →+0
background.
(5) ④ ϕ (t ) ≥ 0, ∀t and ϕ (0) = 0 , ⑤ ϕ (t ) = ϕ (−t ) , ⑥ ϕ (t ) continuously differentiable, and ⑦ ϕ ′(t ) ≥ 0, ∀t ≥ 0 . The conditions ① to ③ are the three conditions for clutter-removal. The basic assumptions ④ to ⑦ define the basic limits. The characteristic of the new principle is investigated in the following subsection.
3.2 The RAFPDE Filter’s Formulation According to the above new principle, we firstly start with the Geman-Meclure regularization [13] to analyze the smoothing effect of the regularization functional J on the pixel u (i, j ) . The regularization term J is formulated as:
J (u ) = ∫ ϕ (u x , u y )dxdy
(6)
It is discretized in 4-nearest neighbors Du as
J (u ) =
∑ [ϕ ( D
i , j∈Du
Where
x i, j
(u )) + ϕ ( Diy, j (u ))]
(7)
Dix, j (u ) = (ui , j +1 − ui , j ), Diy, j (u ) = (ui +1, j − ui , j ) , then the effect of the
change of pixel
∂J ∂ui , j =
u (i, j ) on ∂J ∂ui , j is formulated as
∂ [ϕ (ui , j +1 − ui , j ) + ϕ (ui , j − ui , j −1 ) + ϕ (ui +1, j − ui , j ) + ϕ (ui , j − ui −1, j )] ∂ui , j
(8)
= −ϕ ′(ui , j +1 − ui , j ) + ϕ ′(ui , j − ui , j −1 ) − ϕ ′(ui +1, j − ui , j ) + ϕ ′(ui , j − ui −1, j ) Using ϕ ′(t ) = 2t[ϕ ′(t ) / 2t ] = 2t × c (t ) , Eq.8 is written as
∂J ∂ui , j = −2{λE ui , j +1 + λW ui , j −1 + λN ui +1, j + λS ui −1, j − λ∑ui , j }
(9)
And
λE = c(ui , j +1 − ui , j )
,
λW = c(ui , j −1 − ui , j )
λS = c(ui −1, j − ui , j ) , λ∑ = λE + λW + λN + λS
,
λN = c(ui +1, j − ui , j )
(10)
A Fast New Small Target Detection Algorithm
865
J at the pixel (i, j ) is obtained by convoluting the original image u with the kernel Cw′ . That is In other words, the derivative of
⎡0 uˆ = Cw′ ∗ u , and Cw′ = ⎢⎢λW ⎢⎣ 0
λN 0⎤ −λ∑ λE ⎥⎥ λS 0 ⎥⎦
(11)
uˆ is the filtered image. Therefore, Cw′ is a local adaptive weighted Laplacian filter whose weights are given by the weighting function c(t ) . According to the conditions - , the following weighing functions φ (t ) in Eq.12 are feasible to small Where
①⑦
target detection using clutter-removal and edge-preserving regularization in IR clutter.
φ (t ) = 1 − 1 [1 + (t k ) 2 ] or 1 − exp[−(t k ) 2 ] or 1 − 1 1 + (t / k ) 2
(12)
3.3 The Characteristics of RAFPDE Filter Responding to Different Signal 1) The first is the case of a homogeneous area of the image: All gradient around the pixel (i,j) is close to zero. Because φ (t ) meets the condition- , all weights around
③
the pixel where
ui , j are approximately zero. The operator Cw′ (ε ) is shown in Fig.1 (a),
lim Cw′ (ε ) = 0 . Thus, ui , j is completely smoothed (removed) as stationary ε →0
clutter background. 2) Another case is where the similar large gradients around ui , j : All gradients are equal to t = ∇ 0 . According to condition-
②, there is lim φ (t ) = m , and the kernel t0
Cw′ (m) is given in Fig.1 (b). Then Eq.9 is expressed as: ∂J ∂ui , j = 4φ (t ) × t ∝ t
(13)
①
It is seen that Eq.(13) increases with the increase of t because of the condition- ; That is, filtered signal with great gradient will be enhanced. It is known that small target signal can be modeled as a two-dimensional (2-D) additive pulse of small spatial extent, which is well approximated by the sensor point spread function (PSF) and occupies only a few adjacent columns in the IR image [1, 11, 12]. Given such condition, ui , j with similar high variations (assuming they are generated by small targets) will be enhanced prominently by Cw′ ( m) . 3) The last case is an edge or boundary. For example, there is a line-edge passing through ui , j . The corresponding Cw′ (e) is shown in Fig.1 (c). Therefore, ui , j with variations (assuming they are due to edges) will not be enhanced as much as in the case of Fig.1 (b); in other words, the structured clutter is smoothed more.
866
B.Y. Zhang, T.X. Zhang, and K. Zhang
⎡0 ε Cw′ (ε ) = ⎢⎢ε −4ε ⎢⎣ 0 ε (a)
0⎤ m 0⎤ ⎡0 ⎡0 e 0⎤ ⎥ ⎢ ⎥ ε ⎥ Cw′ (m) = ⎢ m −4m m ⎥ Cw′ (e) = ⎢⎢0 −2e 0⎥⎥ ⎢⎣ 0 ⎢⎣0 e 0⎦⎥ 0 ⎥⎦ m 0 ⎦⎥ (b)
(c)
Fig. 1. The filtering kernel Cw′ around pixel (i, j) in different conditions: (a) in a homogeneous area, (b) in a similar large gradient (target-related) area, (c) in an edge or boundary structured area
From the analyses, it is shown that Cw′ is a local adaptive anisotropic filter. In contrast with the conventional filters (LMR, TDLMS, etc.), RAFPDE filter’s role is twofold: In the stationary area with small gradient, Cw′ is an isotropic diffusion filter to eliminate stationary clutter background; in the nonstationary area with great gradient, Cw′ becomes an anisotropic diffusion filter to more smooth the background structure, and at the same time the signal of interest remains sharp and stable. In short, its role is that signal of interest is enhanced while complicated texture clutter is reduced locally adaptively. Such a priori constraint is called “target-preserving” and “clutter-removing” regularization. 3.4 RAFPDE -Based Small Target Detection Algorithms The proposed algorithm consists of two basic steps: (1) Filter the original image u by convolving with Cw′ using Eq. (10, 11) (2) Thresholding: Using threshold to separate target signal in filtered image uˆ . The processing is to find out candidates for target in every frame of images. Then we could use further multi-frame accumulation method or autocorrelation or time predicting algorithm or velocity filtering theory, et al. to suppress the random noise. After the cluttered background and random noise suppression, then we can use dynamic programming or pipeline filtering or Hough transform or trajectory matching algorithm etc. to estimate the target trajectories.
4 Experimental Results 4.1 Performance Criterions Two criterions are defined to measure different filters’ the capability of preserving the target signal and removing the background structures, respectively; besides, the computational complexity (elapsed time, EST) is also used. The capability in target preservation is measured using Improvement of SCR (ISCR) by comparing the SCR obtained (SCRout) after clutter removal with the original SCR
A Fast New Small Target Detection Algorithm
867
(SCRin) in the original image[5].The ability of removing the background structures is measured by background suppression factor (BSF). They are defined as: (14) SCR = μ − μ / σ bt
b
c
ISCR = SCRout / SCRin
(15)
BSF = σ out / σ in
(16)
Where μbt is the intensity peak value of the target, μb is the average intensity value of the pixels in the neighbor area around the target, and σ c is the background plus noise standard variance. σ in and σ out are background standard variance of original image and filtered image, respectively. 4.2 Experiments Using Real IR Images We estimate the performance of the RAFPDE method by comparisons with TDLMS. For these experiments, we select typical IR images from different image sequences.
A1
A2
A3
A4
B1
B2
B3
B4
C1
C2
C3
C4
Fig. 2. Different filters’ performance analyses. A1, B1, C1: original IR images in different conditions; A2, B2, C2: the TDLMS’s filtered results of A1, B1, C1; A3, B3, C3: the BPNF’s filtered results of A1, B1, C1; A4, B4, C4: the RAFPDE’s filtered results of A1, B1, C1, respectively;
868
B.Y. Zhang, T.X. Zhang, and K. Zhang Table 1. Comparison of performance of different methods in different clutter
Images
Filters
TDLMS
BPNF
RAFPDE
No. Tsize
SCRin
ISCR BSF EST(s) ISCR BSF EST(s)
ISCR
A1
5х5
1.769
5.101 2.393
0.040
4.718 2.122
0.030
BSF
EST(s)
11.289
2.925
0.010
B1
3х3
0.985
1.414 3.084
0.045
1.596 2.686
0.032
2.190
4.074
0.010
C1
1х1
1.560
6.900 2.402
0.040
7.138 2.170
0.030
17.189
3.768
0.010
(a)
(b1)
(b2)
(b3)
(c1)
(c2)
(c3)
Fig. 3. Performance comparison of TDLMS and the proposed RAFPDE. (a) the first frame of the image sequence. (b1) the first frame of the TDLMS’s filtered results. (b2) the results of TDLMS’s filtered results projecting on t-coordinate. (b3) segmented result of (b2). (c1) the first frame of RAFPDE’s filtered results. (c2) the results of RAFPDE’s filtered results projecting on t-coordinate. (c3) segmented result of (c2).
A Fast New Small Target Detection Algorithm
869
(a) ISCR with respect to frame No. of image (b) values of BSF with respect to frame No. of sequence image sequence Fig. 4. Experimental comparative analysis on image sequence
Fig.2 shows the filtered effect of several real IR images with different clutter backgrounds. The experimental results are listed in Table 1. It is obvious that RAFPDE maintains better performance for small target detection under different backgrounds and diverse target size (Tsize=1х1~5х5), especially in seriously cluttered background, such as Fig.2 (B1, C1), RAFPDE suppresses the background structures much better than TDLMS or BPNF. Table 1 also shows RAFPDE takes much less computational time. 4.3 Experiments Using Image Sequences To further evaluate the performance, RAFPDE is tested in IR image sequences. The sequence contains 28 frames and the target’s size is about 3х3 pixels and the shifting magnitude is 2~4 pixels per frame. Fig.3 depicts the performance comparison of TDLMS and RAFPDE. Fig.4 gives the value of ISCR and BSF of every frame image of the sequence filtered by the two algorithms, respectively. It’s shown that RAFPDE preserves the target signal and at the same time removes the background structures much better than TDLMS.
5 Conclusion In this paper, we propose a novel adaptive anisotropic diffusion filter for small target detection in IR cluttered images. It is based on modified partial differential equation combined with clutter-removing and target-preserving regularization. We also give a heuristical study of new conditions on the weighting function for such regularization operator. Therefore, such a filter’s role is twofold: signal of interest is enhanced while complicated structured clutter is adaptive removed locally. We illustrate the performance comparisons of the proposed method and existing method applied to IR images under real world conditions. The experimental results demonstrate that our method can improve efficiently dim small target detectability in strong structured clutter background and provide a robust and real-time performance.
870
B.Y. Zhang, T.X. Zhang, and K. Zhang
References 1. Steven D.B., Haydn S.R..: A Sequential Detection Approach to Target Tracking. IEEE Trans. Aerosp. Electron. Syst. 30 (1994) 197-212 2. Femandez M., Randolph, A., et.al. Optimal Subpixel-level Frame-to-frame Registration. Singal and Data Processing of Small Targets 1991 proceedings of SPIE 1481 (1991) 172-179 3. Chan, D.S.K., Langan, D.A.., Stayer, D.A.: Spatial Processing Techniques for the Detection of Small Targets in IR Clutter. Proc. SPIE 1305 (1990) 53-62 4. Lin, J.N., Nie, X. and Unbehauen, R.: Two-dimensional LMS Adaptive Filter Incorporating A Local-mean Estimator for Image Processing. IEEE Trans. Circuits and Systems- 40 (1993) 417-428. 5. Li, M., Zhang, T., et al: Moving Weak Point Target Detection and Estimation With Three-dimensional Double Directional Filter in IR Cluttered Background. Opt. Eng. 44 (2005) 6. Reed, I., Gagliardi R.,and Stootts L.: Optical Moving Target Detection Based on Adaptive Predictions of IR Background Clutter. Laser & Infrared 34 (2004) 478-480 7. Silva, D.M., et al..: Optimal Detection of Small Targets in A Cluttered Background Opt. Eng. 37 (1998) 83–92 8. Perona, P., and Malik, J.: Scale-space and Edge Detection Using Anisotropic Diffusion. IEEE Trans. Pattern Anal. Machine Intell. 12 (1990) 629-639 9. Sylvie T., Laure B.C., et al: Variational Approach for Edge-preserving Regularization Using Coupled PDEs Regularization Using Coupled PDEs', IEEE Trans. Image Processing 7 (1998) 387-397. 10. Xue, D.H.: An Extended Track-before Detect Algorithm for Infrared Target Detection, IEEE Trans. Aerosp. Electron. Syst. 33 (1997) 1087–1092 11. Chan, D.S.K., Langan, D.A., Staver, D.A.: Spatial processing techniques for the detection of small targets in IR clutter, Proc. SPIE–Int. Soc. Opt. Eng. 1305 (1990) 53–62 12. Geman, S. and McClure, D.E: Bayesian Image Analysis: An Application to Single Photon Emission Tomography, In Proc. Statistical Computation Section, Amer. Statistical Assoc., Washington, DC (1985) 12-18.
Ⅱ
The Evaluation Measure of Text Clustering for the Variable Number of Clusters Taeho Jo1 and Malrey Lee2,* 1
Advanced Graduate Education Center of Jeonbuk for Electronics and Information Technology-BK21
[email protected] 2 The Research Center of Industrial Technology, School of Electronics & Information Engineering, ChonBuk National University, 664-14, 1Ga, DeokJin-Dong, JeonJu, ChonBuk, 561-756, South Korea
[email protected] Fax: 82-63-270-2394
Abstract. This study proposes an innovative measure for evaluating the performance of text clustering. In using K-means algorithm and Kohonen Networks for text clustering, the number clusters is fixed initially by configuring it as their parameter, while in using single pass algorithm for text clustering, the number of clusters is not predictable. Using labeled documents, the result of text clustering using K-means algorithm or Kohonen Network is able to be evaluated by setting the number of clusters as the number of the given target categories, mapping each cluster to a target category, and using the evaluation measures of text. But in using single pass algorithm, if the number of clusters is different from the number of target categories, such measures are useless for evaluating the result of text clustering. This study proposes an evaluation measure of text clustering based on intra-cluster similarity and intercluster similarity, what is called CI (Clustering Index) in this article.
1 Introduction Text clustering refers to the process of partitioning a collection of documents into several sub-collections of documents based on their similarity in their contents. In the result of text clustering, each sub-collection is called a cluster and includes similar documents in their contents. The desirable principle of text clustering is that documents should be similar as ones within their same cluster and different from ones in their different clusters, in their contents. Text clustering is important tool for organizing documents automatically based on their contents. The organization of documents is necessary to manage documents efficiently for any textual information system. For example, web documents, such as HTML, XML, and SGML, need to be organized for the better web service and emails should be organized based on their contents for the easy access to them. Unsupervised learning algorithms, such as k *
Corresponding author.
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 871–879, 2007. © Springer-Verlag Berlin Heidelberg 2007
872
T. Jo and M. Lee
means algorithm, single pass algorithm, Kohonen Networks, and NTSO, were applied to text clustering as its approaches [1][2][3][4][5]. The evaluation of their performance of text clustering should be performed based on its principle. Evaluation measures of text categorization, such as accuracy, recall, precision, and F1 measure, were used to evaluate the performance of text clustering in the previous research on text clustering [1][2]. The accuracy is the rate of correctly classified documents to all of documents in the given test bed. This measure is the simplest evaluation measure in classification problems including text categorization, and applicable directly to multi-classification problems. But note that recall, precision, and F1 measure are applicable directly only to binary classification problems. To evaluate the performance of classification using them, the given problem should be decomposed into binary classification problems. In the multi-classification problem, each class corresponds to a binary classification problem, where the positive class indicates “belonging to the class” and the negative class indicates “not belonging to the class”. These evaluation measures focus on only positive class in each binary classification. In text categorization, recall refers to the rate of correctly classified positive documents to all of the true positive documents, precision refers to the rate of correctly classified positive documents to all of the classified positive examples, and F1 measure is the combined value of recall and precision using the equation (1), as follows.
F1 − measure =
2 × recall × precision recall + precision
(1)
The previous research on text clustering proposed and evaluated state of art approaches to text clustering using evaluation measures of text categorization. In 1998, O. Zamir and O. Etzioni proposed suffix tree algorithm as an approach to text clustering and evaluated it using precision. They showed that suffix tree algorithm has higher precision than single pass algorithm and k means algorithm in text clustering [6]. In 1998, S. Kaski and his colleagues proposed a text clustering system, called WEBSOM, where Kohonen Networks are applied as the approach to text clustering [3]. Without evaluating their approach with its comparison with other approaches, they demonstrated the visual result of the system, WEBSOM. In 2000, T. Kohonen and his colleagues revised the system, WEBSOM, to improve its speed to the massive collection of documents by modifying data structures of documents [4]. Although the revised version of WEBSOM is improved even ten times in its speed, both its previous version and its revised version are evaluated using accuracy. In 2000, V. Hatzivassiloglou and his colleagues applied several clustering algorithms, such as single link algorithm, complete link algorithm, group-wise average, and single pass algorithm, to text clustering with and without linguistic features [2]. They evaluated these approaches in these two cases, using linguistic features and not using them, based cost of detection which combines miss and false alarm. If text categorization based evaluation measures, such as accuracy, F1 measure, and cost of detection are used to evaluate approaches to text clustering in their performance, two conditions are required. For first, all documents in the given test bed should be labeled; they should have their own target categories. In the real world, it is more difficult to obtain labeled document than unlabeled document, and the
The Evaluation Measure of Text Clustering for the Variable Number of Clusters
873
process of labeling documents follows that of clustering documents in the practical view. The process of preparing labeled documents for the evaluation of approaches to text clustering is time consuming. For second, the number of clusters should be consistent with the number of their target categories. For example, if a series of documents with their same target category is segmented into more than two clusters, text categorization based evaluation measures are useless in that situation. In 2001, T. Jo proposed an innovative measure of evaluating the result of text clustering [7]. Its advantage over text categorization based evaluation measures is that above two conditions are not required. It does not require the test bed consisting of labeled documents nor the consistency between the number of clusters and the number of their target categories. But it may evaluate the result of text clustering inaccurately, if labeled documents are used as the test bed, because his evaluation measure is computed by analyzing unlabeled documents only lexically. In other words, the similarity between two documents in their same target category may be estimated into its small value. In this case, his proposed evaluation measure is not reliable for evaluating the result of clustering labeled documents. This study proposes another innovative evaluation measure of text clustering, which is applicable to both labeled and unlabeled documents. In using this evaluation measure of text clustering to labeled documents, the similarity between two documents is given as a binary value, one or zero. If both of them belong to their same target category, their similarity is estimated as one. Otherwise, it is estimated as zero. In using it in unlabeled documents, the similarity between two documents is estimated as a continuous real value between zero and one, using the equations described in the next section by encoding them into one of structured data. Therefore, the proposed evaluation measure solves the problems not only from text categorization based ones but also from the evaluation method proposed in [7]. In the structure of this article, the second section describes the process of evaluating the result of text clustering using the proposed evaluation measure. The third section presents several results of text clustering and their value of their evaluation using the proposed measure in the collection of labeled documents.
2 Proposed Evaluation Measure This section describes the evaluation measure of text clustering using labeled documents. The policy of this evaluation is that the better clustering, the higher intracluster similarity and the lower inter-cluster similarity. Within the cluster, documents should be as similar as possible, while between clusters, document should be as different possible as possible. This section proposes the evaluation measure reflecting such policy, what is called clustering index, which indicates the rate of intra-cluster similarity to both intra-cluster similarity and inter-cluster similarity. Clustering index is given as a normalized value between zero and one. Its value, 1.0, indicates the completely desirable clustering, where intra-cluster similarity is 1.0 and inter-cluster similarity is 0.0. Its value, 0.0, indicates the completely poor clustering where the average intra-cluster similarity is 0.0, whether the average inter-cluster similarity is any value.
874
T. Jo and M. Lee
Using a corpus of labeled documents for the evaluation of text clustering, the similarity between two documents is binary value, zero or one. If two documents belong to their same target category, ct , the similarity between them is 1.0. Otherwise, the similarity is 0.0. The process of computing the similarity between two labeled documents, d i and d j is expressed with the equation (2).
⎧1 if d i , d j ∈ ct sim(d i , d j ) = ⎨ ⎩0 otherwise
(2)
ck includes a series of documents and is denoted as a set of documents by ck = {d k 1 , d k 2 ,..., d k c } . The intra-cluster similarity of the cluster, ck , σ k is k A cluster
computed using the equation (3) and indicates the average similarity of all pairs of different documents included in the cluster, ck .
σk =
2 ∑ sim(d ki , d kj ) ck ( ck − 1) i > j
(3)
If a series of clusters as the result of text clustering is denoted by C = {c1 , c2 ,..., c C } , the average intra-cluster similarity, σ is computed using the equation (4), by averaging the intra-cluster similarities of the given clusters.
σ =
1 C
C
∑σ k =1
(4)
k
The inter-cluster similarity between two clusters,
ck and cl , δ kl , is computed
using the equation (5) and indicates the average similarity of all possible pairs of two documents belonging to their different clusters.
δ kl =
1 ck cl
ck
cl
∑∑ sim(d i =1 j =1
ki
, dlj )
(5)
The average inter-cluster similarity δ is computed using the equation (6), by averaging all possible pairs of different clusters.
δ =
2 ∑ δ kl C ( C − 1) k > l
From the equation (3) to the equation (7), the average intra-cluster similarity,
(6)
σ
and
the average inter-cluster similarity, δ , over the given clusters are obtained. Therefore, the clustering index, CI is computed using the equation (7).
CI =
σ2 σ +δ
(7)
The equation (7) shows that a normalized value between zero and one is given in the clustering index. If CI is 1.0, indicates that the average intra-cluster similarity is 1.0 and the overage inter-cluster similarity is 0.0. If the average intra-cluster similarity is
The Evaluation Measure of Text Clustering for the Variable Number of Clusters
875
0.0, CI is absolutely 0.0. The equation (7) implies that both intra-cluster similarity and inter-cluster similarity should be considered for evaluating the result of text clustering.
3 Results of Evaluating Text Clustering There are two experiments using the collection of labeled documents in this section: the consistency and the inconsistency between clusters and their target categories in their number. In the first experiment, the proposed measure is compared with text categorization based evaluation measures: accuracy, recall, precision, and F1 measure. These evaluation measures are compared each other in two cases: the desired clustering where documents are arranged according their target categories and several cases of random clustering where documents are arranged at random with the regardless of their target categories, but the number of their clusters is same to that of their categories. The collection of labeled documents, which is used in this experiment, includes four hundreds news articles labeled with one of four categories in ASCII text files. The predefined categories in such collection are, “corporate news”, “criminal law enforcement”, “economical index”, and “Internet”. This collection was obtained by copying news articles from the web site, www.newspage.com, and pasting them as ASCII text files, individually. Each category includes one hundred news articles, equally. In this experiment, the number of clusters is set as that of their target categories; four clusters are given. In the desired clustering, each cluster is corresponds to one of their target categories and each document is arranged to its corresponding cluster. In a random clustering, each document is arranged to one of these four clusters at random. By doing this, four sets of random clustering are built. The evaluation measure to each set of text clustering is computed, using the equation (7). In the desired clustering, the value of the proposed evaluation measure expressed with the equation (7) is 1.0, since the average intra-cluster similarity is 1.0 and the average inter-cluster similarity is 0.0 based on the equation (2). If it is evaluated using text categorization based evaluation measures, accuracy, precision, recall, and F1 measures have 1.0 as their values. Therefore, both the proposed evaluation measure and the text categorization based ones evaluate the result of the desired clustering, identically. A result of text clustering is presented in the table 1. The number of clusters is identical to that of the target categories of documents, and each cluster is identical to each target category in their number of documents. To apply text categorization based method, each cluster must correspond to one of target categories exclusively. According the majority of each cluster and one to one correspondence, cluster 1, cluster 2, cluster 3, and cluster 4 correspond to corporate news, criminal law enforcement, Internet, and economic index, respectively. Cluster 1, cluster 2, and cluster3 were matched with the target categories according their majority, but cluster 4 was matched with economic index, exceptionally, since each cluster was not allowed to correspond to a redundant category in one to one correspondence. In this condition, all of text categorization based evaluation measures, such as accuracy, recall, precision, and F1 measure, resulted in 0.475 uniformly. In the proposed
876
T. Jo and M. Lee
evaluation method, average intra-cluster similarity is 0.38, using equation (2), (12), and (13) and inter-cluster similarity is 0.1808, using equation (2), (14), and (15). Therefore, the clustering index is 0.2574, using the equation (7). Table 1. A Result of Clustering News Articles
corporate news criminal law enforcement economic index
cluster 1
cluster 2
cluster 3
cluster 4
Total
70
10
10
10
100
15
50
5
30
100
5
30
40
25
100
Internet
10
10
45
35
100
Total
100
100
100
100
400
The table 2 presents another result of clustering news articles. In this result, cluster 1 and cluster 4 have 150 documents and 50 documents differently from target categories. This leads to difference between recall and precision. According the majority of each cluster and one to one correspondence, cluster 1, cluster 2, cluster 3, and cluster 4 correspond to corporate news, economic index, Internet, and criminal law enforcement, in order to use text categorization based evaluation measures. Accuracy and recall of this result are 0.45 identically. Precision and F1 measure are 0.3665 and 0.4039, respectively. In the proposed evaluation measure, the average intra-cluster similarity is 0.4153 and the average inter-cluster similarity is 0.2054. The clustering index is estimated as 0.2776 indicating that these news articles are clustered better than random clustering, at least. Note that the proposed evaluation measure does not require such correspondence between each cluster and each target category. Table 2. A Result of Clustering News Articles
corporate news criminal law enforcement economic index Internet Total
cluster 1
cluster 2
cluster 3
cluster 4
Total
70
5
15
10
100
60
40
0
0
100
15
50
25
10
100
5
5
60
30
100
150
100
100
50
400
If the number of clusters is not same to that of target categories, text categorization based evaluation measure becomes useless, since the correspondence between clusters and target categories can not be one to one. If the collection of news articles is partitioned into five clusters, where three clusters are exactly same to three of target categories and a particular target category are partitioned into two clusters, the average intra-cluster similarity 1.0, but the average inter-cluster similarity is 0.1.
The Evaluation Measure of Text Clustering for the Variable Number of Clusters
877
There are ten pairs of clusters among five clusters and one of ten pairs is 1.0; the average inter-cluster similarity is 0.1. Therefore, the clustering index is computed as 0.9090 using the equation (7). On contrary, two target categories may be merged into a cluster. For example, two target categories are same to two clusters in their distribution of documents, but the rest categories are merged into a cluster in this collection of news articles. The average intra-cluster similarity is 0.8324 and the average inter-cluster similarity is 0.0 in this case. Therefore, the clustering index is computed as 0.8324, using the equation (7). The table 3 presents one of realistic results of text clustering, where the number of clusters is different from that of the target categories of documents, in the second experiment. As mentioned above, text categorization based evaluation measures are not applicable, since these clusters are not able to correspond to these target categories one to one. In this result illustrated in the table 3, the average intra-cluster similarity is 0.3203 and the average inter-cluster similarity is 0.2170. Using the equation (7), the clustering index is 0.1909. Table 3. A Result of Clustering News Articles
corporate news criminal law enforcement economic index Internet Total
cluster 1
cluster 2
cluster 3
Total
70
20
10
100
30
30
40
100
40
50
10
100
10
70
20
100
150
170
80
400
These two experiments in using labeled documents as test bed for text clustering show that the proposed evaluation method is more suitable for text clustering than the text categorization based evaluation methods with two points. The first point is that text categorization based evaluation methods require the one to one correspondence between clusters and target categories, but the proposed method does not require it. When the number of clusters is same to that of the target categories, each cluster should be matched with a category exclusively. When the number of clusters is different from that of target categories, text categorization based evaluation method is useless. The second point is that text categorization based evaluation measures do not consider the similarities between clusters. This ignores the second principle of text clustering, “documents in different clusters should be different as much as possible”. The proposed evaluation measure considers the similarities of documents not only within a particular cluster but also between two different clusters.
4 Conclusion This paper proposed an innovative evaluation measure of text clustering. This measure underlies the principle, as follows.
878
T. Jo and M. Lee
The documents within a particular cluster should be as similar as possible, and those between two documents should be as different as possible. Based on this principle, this study proposed the process of computing the intra-cluster similarity, using the equation (2), (12), and (13) and the inter-cluster similarity, using the equation (2), (14), and (15). The final evaluation measure of text clustering is computed using these two measures with the equation (7). When the number of clusters is same to that of target categories, the proposed measure was compared with text categorization based evaluation measures in the previous section. The experiment in that section showed two advantages of the proposed method over text categorization based ones. The first advantage is that each category does not need to be matched with a cluster, in using the proposed evaluation measure of text clustering. Its advantage leads to that it is applicable although the number of clusters is different from that of target categories. The second advantage over text categorization based method is that the proposed evaluation measure considers both intra-cluster similarity and inter-cluster similarity. Text categorization based measures, such as accuracy, recall, precision, and F1 measure, evaluate the result of text clustering based only on intra-cluster similarity. There is one more advantage of the proposed evaluation measure over text categorization based ones. The advantage is that the proposed measure is applicable even to unlabeled documents, if the process of computing a semantic similarity between two documents is defined. In the real world, it is far easier to obtain unlabeled documents than labeled documents. The assumption underlying in text clustering is that every document is not labeled initially. Therefore, the effort to obtain labeled documents for the evaluation of text clustering is not necessary, in using the proposed evaluation measure. In the real world, almost every document is labeled with more than one category. In the collection of news articles called Reuter 21578, which is used as a standard test bed for the evaluation of text categorization, each news articles has more than one category. Although overlapping clustering, where a document is allowed to be arranged into more than two clusters, is more practical than exclusive clustering in the real world, the previous research on text clustering focused on exclusive clustering for their easy evaluation. The proposed evaluation measure may be applicable depending on how to define the similarity between documents as expressed in the equation (2). In the further research, the proposed evaluation method of text clustering is modified to be applicable to the collection of unlabeled documents, the collection documents labeled with more than one category like Reuter 21578, and the hybrid collection of labeled and unlabeled documents. There are several strategies of using the proposed evaluation method to hybrid collections. In a strategy, unlabeled documents are classified with the reference to labeled documents; all of documents are labeled documents. In another strategy, the similarity is computed using the equation (2) if two documents are labeled, and the similarity is computed differently, otherwise. By modifying the proposed evaluation measure to be applicable to the collection of various documents, the flexibility of the proposed evaluation measure is expected to be improved.
The Evaluation Measure of Text Clustering for the Variable Number of Clusters
879
Acknowledgment This research was supported by the MIC (Ministry of Information and Communication), Korea under the ITRC (Information Technology Research Center) support program supervised by the IITA (Institute of Information Technology Assessment) IITA-2006-C1090-0603-0024.
References 1. Steinbach, M., Karypis, G., Kumar, V.: A Comparison of Document Clustering Techniques, in the Workshop on Text Mining in SIGKDD, 2000 2. Hatzivassiloglou, V., Gravano, L., Maganti, A.: An Investigation of Linguistic Features and Clustering Algorithms for Topical Document Clustering. The Proceedings of 23rd SIGIR (2000), 224-231 3. Kaski, S., Honkela, Krista, Lagus, K., Kohonen, T.: WEBSOM-Self Organizing Maps of Document Collections. Neurocomputing 21 (1998) 101-117 4. Kohonen, T., Kaski, S., Lagus, K., Salojarvi, J., Paatero, V., Saarela, A.: Self Organization of a Massive Document Collection. IEEE Transaction on Neural Networks 11 (2000) 574-585 5. Jo, T., Japkowicz, N.: Text Clustering using NTSO. The Proceedings of IEEE IJCNN (2005) 558-563 6. Zamir, O., Etzioni, O.: Web Document Clustering: A Feasibility Demonstration. The Proceedings of SIGIR 98 (1998) 46-54 7. Jo, T.: Evlauation Function of Document Clustering based on Term Entropy. The Proceedings of 2nd International Symposium on Advanced Intelligent System (2001) 302-306 8. Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Survey 34 (2002) 1-47 9. Jo, T., NeuroTextCategorizer: A New Model of Neural Network for Text Categorization, The Proceedings of International Conference of Neural Information Processing 2000 (2000), 280-285. 10. Jo, T.: Machine Learning based Approach to Text Categorization with Resampling Methods. The Proceedings of the 8th World Multi-Conference on Systemics, Cybernetics and Informatics (2004) 93-98
Clustering-Based Reference Set Reduction for k-Nearest Neighbor Seongseob Hwang and Sungzoon Cho Seoul National University, San 56-1, Shillim-dong, Kwanak-gu, 151-744, Seoul, Korea {hss9414,zoon}@snu.ac.kr
Abstract. Response Modeling is concerned with computing the likelihood of a customer to respond to a marketing campaign. A major problem encountered in response modeling is huge volume of data or patterns. The k-NN has been used in various classification problems for its simplicity and ease of implementation. However, it has not been applied to problems for which fast classification is needed since the classification time rapidly increases as the size of reference set increases. In this paper, we propose a clustering-based preprocessing step in order to reduce the size of reference set. The experimental results showed an 85% decrease in classification time without a loss of accuracy.
1
Introduction
Direct marketing is concerned with identifying likely buyers of certain products or services and promoting them to the potential buyers through various channels [1]. In order to decide which people will receive the promotion, the potential customers are divided into two groups or classes: buyers and non-buyers. Response Modeling is concerned with computing the likelihood of a customer to respond to a marketing campaign. A major problem encountered in response modeling is a gigantic volume of data or patterns. Generally speaking, retailers keep a huge amount of customer data. Moreover, new data keep arriving. Even though data mining algorithms are designed to deal with the problem, it is always desirable to sample the data and work on a subset of the huge data set. The nearest neighbor classifier is a simple, yet powerful supervised concept learning scheme. An unseen (i.e., unclassified) instance is classified by finding the closest previously observed instance, taking note of its class, and predicting this class for the unseen instance [2,3]. Learners that engage this scheme are also termed Case-Based Learners, Instance-Based Learners, Lazy Learners and Memory-Based Learners. They suffer harmful and superfluous instances are stored indiscriminately, thus can become neighbors. They could lead to a wrong classification results. Numerous research has been reported to improve accuracy and speed. First, the general method to overcome this weakness is the k-NN editing(See Fig. 1) [4]. Replacing a dataset with a usually smaller dataset in order to improve the
Corresponding author.
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 880–888, 2007. c Springer-Verlag Berlin Heidelberg 2007
Clustering-Based Reference Set Reduction for k-Nearest Neighbor
881
accuracy of a NN-classifier belongs to a set of techniques called dataset editing. The most popular technique in this category is Wilson editing [4,5]. Wilson [4] proposed an edited k-NN rule to improve performance of the 1-NN rule. In his rule, each pattern in the reference set is classified using the k-NN rule and all the misclassified patterns are deleted from the reference set. A test pattern is then classified using the 1-NN rule based on the edited reference set. Wilson’s edited k-NN rule has yielded good results in many finite-sample-size problems, although its asymptotic optimality has been disproved [6]. Hattori [6] proposed a new edited k-nearest neighbor. For every pattern x in the edited reference set, all the k nearest neighbors of x must be in the class to which x belongs. In his method, high classification accuracy is preferred to the small number of patterns in the reference set. On average, classification accuracy improved, but the improvement was unstable over parameter k . Also, in theory, every instance in a database has to be evaluated to find k nearest neighbors, thus runtime complexity is very high. Dasarathy [7] developed a condensing method to edit the reference set. His rule provides the minimal consistent subset which is used as the edited reference set. All the patterns in the minimal consistent subset can be correctly classified using the 1-NN rule with the initial reference set. His method reduced the classification time by condensing the reference set, but it also reduced the classification accuracy.
(a) Classicial k-NN
(b) k-NN Editing
Fig. 1. Reference Set of Classicial k-NN and k-NN editing
We propose to condense the reference set by clustering. All the patterns in the reference set are divided into clusters. A test pattern is then classified based on a few clusters close to it. The purpose of our work is to reduce the classification time without a loss of accuracy. The organization of this paper is as follows. The following section introduces the proposed method. Section 3 describes DMEF4 dataset and experimental settings. In Section 4, we provide experimental results. Various k-NN based classifiers are compared in terms of accuracy and run time. Finally, conclusions and future work are discussed in Section 5.
882
2
S. Hwang and S. Cho
Proposed Method
The k-NN classifier takes most of its run time in computing distances from a new pattern to all the patterns in the reference set. If the reference set is clustered beforehand, it might suffice to consider only those patterns in the cluster that is closest to the new pattern. If that is the core, the run time can be reduced K times where K is the number of clusters, assuming that each cluster is of a similar size. However, there is a danger of ignoring those patterns in the nearby clusters. A safer approach would be to consider not only the nearest cluster but also a few more near clusters. The trick here is not to consider all the patterns from these additional clusters, but to consider only those patterns located in peripheries of them. [ Preprocessing Step ] A. clustering begin initialize N, μ1 , μ2 , ..., μk /* Put n patterns into K clusters */ where μi is the mean vector of ith cluster do assign patterns xc to Ci∗ where i∗ = argmin|xc − μi | and Ci is ith cluster set recompute μi for i = 1 to K until no change in μi return μ1 , μ2 , ..., μk end B. “core set” and “peripheral set” begin for (c=1 to N ) do if xc is distxc ∈Ci (xc , μi ) ≤ 2 × mean dist∀j,xj ∈Cj (xj , μj ) then assign xc to core set Cic else assign xc to peripheral set Cip where Cic is ith core set, Cip is ith peripheral set, Ci = Cic Cip end [ Classification Step ] A. reduced reference set − identify L nearest cluster from a new example xt , μt(1) , μt(2) , ..., μt(L) where μt(1) is the nearest, μt(2) is the second nearest, μt(3) is the third nearest, and so on. If xc is dist(xc , μt(1) ) ≤ 2 × mean dist∀j,xj ∈Ct(1) (xj , μt(1) ) , then define reference set R as Ct(1) else R as Ct(1) Ct(2)p Ct(3)p · · · Ct(L)p B. classification classify a new example xt using k-NN classifier using reference set R
Ë
Ë
Ë
Ë Ë
Fig. 2. Proposed Method
The proposed method consists of two steps: preprocessing and classification. The algorithm is depicted in Fig. 2. In preprocessing step, the reference set is partitioned into clusters. K-means clustering is used because it is relatively quick. The patterns assigned to each cluster are then split into “core set” and
Clustering-Based Reference Set Reduction for k-Nearest Neighbor
(a) Original Reference Set
883
(b) Clustering Results
(c) Case 1: The test pattern located near (d) Case 2: The test pattern located near the core area of the nearest cluster the peripheral area of the nearest cluster Fig. 3. Reducing reference set
“peripheral set.” The patterns located within a certain distance from the cluster center are put into “core set” while the rest are put into “peripheral set.” In classification step, we first calculate the distance from a new pattern xt to the cluster centers. The patterns from the closest cluster and from the peripheral sets of adjacent clusters are put into the reference set. Finally, k-NN is performed with the reference set just obtained. Figure 3 shows the preprocessing using proposed method. Fig. 3(a) shows the original reference set. The set is partitioned into nine clusters (See Fig. 3(b)). We calculate the distance from a new test pattern to the cluster centers. The patterns from the closest cluster and from the peripheral sets of adjacent clusters are put into the reference set (See Fig. 3(c)-(d)). In classification step, k-NN is performed with the obtained reference set. Table 1 depicts the size of reference set in terms of the location of test pattern. Table 1. Size of Reference Set in terms of the Location of Test Pattern Nearest Cluster 1 2 3 4 5 6 7 8 9 At the Core Set 4/60 7/60 6/60 5/60 6/60 10/60 7/60 8/60 7/60 At the Peripheral Set 5/60 8/60 7/60 6/60 8/60 11/60 10/60 9/60 10/60
884
3 3.1
S. Hwang and S. Cho
Dataset and Experimental Settings DMEF4 Dataset
A catalogue mailing task involving DMEF4 dataset [8] was analyzed. It is concerned with an up-scale gift business that mails general and specialized catalogs to its customers several times each year. The original problem is to estimate how much each customer will spend during the test period, from September 1992 to December 1992, based on the base time period, from December 1971 to June 1992. From the original problem, a classification problem is formulated where the target class labels are +1 for respondents who spent a non-zero amount and −1 for non-respondents who did not spend at all. The dataset contains 101,532 customers each of whom is described by 91 input variables. The response rate is 9.4% with 9,571 respondents and 91,961 non-respondents. While selecting or extracting relevant variables is very important, it is not our main concern. Malthouse [9] extracted 17 out of the 91 input variables for this dataset, and Ha et al. [10] used 15 among them, removing two variables whose variations are negligible. In this paper, these 15 variables were used as input variables that are listed in Table 2. The stratified sampling assigned 60% of the data to the reference set and the rest to the test set. 3.2
Response Models
We compared the proposed method with classical k-NN [2] and Wilson’s k-NN editing [4]. Classical k-NN and k-NN editing have no parameter to specify except the number of neighbors. We used k=1, 3, 5, 7 as the number of nearest neighbors for every method. In order to implement the proposed model, a particular set of parameters should be selected in advance. For the proposed method, one should predetermine the number of clusters, K and the number of reference clusters, L. In our experiment, we set K to 10 and L to 3(the rounded integer of the square root of 10). Note that it is beyond the scope of this paper to find the optimal K and L. Given an response model and instances, there are two types of errors, i.e. false positive (FP) and false negative (FN) [11] as presented in Table 3. In order to depict the tradeoff between false positive and false negative of classifiers, a receiver operating characteristics (ROC) graph have long been used in signal detection theory [12,13,14]. It plots (FP,TP) pairs. We employed the “ROC distance” in Eq. (1) as the criterion similar to [15] and performance measure [16]. An ROC distance indicates how far the result of a model is from the perfect classification in the ROC chart. To achieve a small ROC distance, both FP and FN should have low values. The more correct a model is, the smaller the ROC distance becomes. 2 2 FN FP ROC distance = + , (1) NR NN R
Clustering-Based Reference Set Reduction for k-Nearest Neighbor
885
Table 2. 15 input variables, some original and some derived, were used in response modeling for the DMEF4 dataset ORIGINAL VARIABLES Name Description Purseas Number of seasons with a purchase Falord LTD fall orders Ordtyr Number of orders this year Puryear Number of years with a purchase Sprord LTD spring orders Derived Variables DERIVED VARIABLES Name Description Recency Order days since 10/1992 Tran53 Tran54 Tran55 Tran38 Number of product groups purchased from Comb2 this year Tran46 Tran42 Interaction between the number of orders Interaction between LTD orders and LTD Tran44 spring orders Tran25 Inverse of latest-season items
Formulation I(180 ≤ recency ≤ 270) I(270 ≤ recency ≤ 366) I(366 ≤ recency ≤ 730) 1/recency
È √
14 m=1
ProdGrpm
comb2 log(1 + ordtyr × falord) √ ordhist × sprord 1/(1+lorditm)
Table 3. Confusion matrix: According to actual and predicted responses, true positive (TP), false positive (FP), false negative (FN), and true negative (TN) are computed. Note that N =TP+FP+FN+TN, NR =TP+FN, and NNR =FP+TN.
Predicted
Actual Respondent Non-respondent Respondent TP FP Non-respondent FN TN
A measure that gives a balanced assessment on the two classes has to be adopted such as balanced classification rate (BCR) [16,1] which incorporates TP and TN in the following way: TP TN BCR = · . (2) NR NN R
4 4.1
Experimental Results Classification Accuracy
Table 4 shows the ROC distance of each method. We found that there is little or no difference between the methods in terms of the ROC distance. Regarding
886
S. Hwang and S. Cho Table 4. ROC Distance Method Classical k-NN Wilson’s k-NN Editing Proposed Method
k=1 0.1133 0.1133 0.1133
k=3 0.0846 0.0874 0.0846
k=5 0.0656 0.0672 0.0656
k=7 0.0680 0.0716 0.0680
Table 5. Balanced Classification Rate (BCR) Method Classical k-NN Wilson’s k-NN Editing Proposed Method
k=1 0.9352 0.9352 0.9352
k=3 0.9498 0.9483 0.9498
k=5 0.9597 0.9586 0.9597
k=7 0.9583 0.9562 0.9583
(a) k=1
(b) k=3
(c) k=5
(d) k=7 Fig. 4. Run Time
Clustering-Based Reference Set Reduction for k-Nearest Neighbor
887
parameter k, 5 worked much better than other values. Also, Table 5 shows the BCR of each method. The results are analogous to those from ROC distance. The results of proposed method is equivalent to the others. Note that the results by the classical k-NN and the proposed method are exactly identical. In this application, the proposed method actually contains all the 1, 3, 5, 7 neighbors of a pattern in the reduced reference set. Thus, the identical neighbors were picked. That resulted from choice of a relatively small number of clusters. Of course, we have to investigate its effect in the future. 4.2
Run Time
We measured preprocessing time as well as classification time. The classical kNN does not have any preprocessing, while k-NN editing does. In the proposed method, the sum of the clustering time and the time to divide each cluster into “core set” and “peripheral set” was treated as preprocessing time. The experiments were conducted by Matlab 7.0 in the environment of Intel pentium4 3.0Ghz, RAM 2G. Figure 4 represents the run time (unit: sec) of the experiments. The overall run time pattern is surprisingly similar for different k values. For preprocessing, k-NN editing took about 4,600 secs while the proposed method took only 10 secs. For classification, the classical k-NN took about 2100 while k-NN editing took 2000. The proposed method took about 300, or 85% smaller. It should be noted that preprocessing takes place only once while classification could take place many time more.
5
Conclusions and Discussion
Response Modeling is concerned with computing the likelihood of a customer to respond to a marketing campaign. A major problem encountered in response modeling is huge volume of data or patterns. The classical k-NN is effective when the probability distributions of the feature variables are not known. However, it has not been applied to problems that need fast classification since classification time rapidly increases as the size of reference set increases. We proposed a clustering-based preprocessing step in order to reduce the size of the reference set. All the patterns in the reference set are divided into clusters. A test pattern is then classified based on a few clusters close to it. The experimental results showed that classification time was about 85% reduced without a loss of accuracy. The future works are as follows. First, the dataset used in our experiments is imbalanced in class labels. Experiments using various datasets are needed. Secondly, the effect of proposed method may vary as the number of clusters changes. Various studies on the effects of parameters have to be executed. Finally, an incremental approach to clustering that dynamically adds one cluster center at a time needs to be investigated [17]. Such an approach can help our response model to be more accurate.
888
S. Hwang and S. Cho
Acknowledgement This work was supported by grant No. R01-2005-000-103900-0 from Basic Research Program of the Korea Science and Engineering Foundation, the Brain Korea 21 program in 2006 and partially supported by Engineering Research Institute of SNU.
References 1. Shin, H.J., Cho, S.: Response modeling with support vector machines. Expert Systems with Applications 30 (2006) 746-760 2. Cover, T.M., Hart, P.E.: Nearest neighbor pattern classification. IEEE. Transactions on Information Theory IT 13 (1967) 21-27 3. Brighton, H.: Advances in Instance Selection for Instance-Based Learning Algorithms. Data Mining and Knowledge Discovery 6 (2002) 153-172 4. Wilson, D.L.: Asymptotic Properties of Nearest Neighbor Rules Using Edited Data. IEEE Transactions on Systems, Man, and Cybernetics 2 (1972) 408-420 5. Eick, C.F., Zeidat, N., Vilalta, R.: Using Representative-Based Clustering for Nearest Neighbor Dataset Editing. Fourth IEEE International Conference on Data Mining (2004) 375-378 6. Hattori, K., Takahashi, M.: A new edited k-nearest neighbor rule in the pattern classification problem. Pattern Recognition 33 (2000) 521-528 7. Dasarathy, B.V.: Minimal Consistent Set (MCS) Identification for Optimal Nearest Neighbor Decision Systems Design. IEEE Transaction on System Man and Cybernetics 24 (1994) 511-517 8. The Direct Marketing Association. Available at http://www.the-dma.org/dmef/ dmefdset.shtml 9. Malthouse, E.C.: Assessing the performance of direct marketing scoring models. Journal of Interactive Marketing 15 (2001) 49-62. 10. Ha, K., Cho, S., MacLachlan, D.: Response models based on bagging neural networks. Journal of Interactive Marketing 19 (2005) 17-30. 11. Golfarelli, M., Maio, D., Maltoni, D.: On the Error-Reject Trade-off in Biometric Verification Systems. IEEE Transactions on Pattern Analysis and Machine Intelligence 19 (1997) 786-796 12. Egan, J.P.: Signal detection theory and ROC analysis. Series in Cognition and Perception. Academic Press. New York (1975) 13. Swets, J.A., Dawes, R.M., Monahan, J.: Better decisions through science. Scientific American 283 (2000) 82-87 14. Fawcett, T.: An introduction to ROC analysis. Pattern Recognition Letters 27 (2006) 861-874 15. He, C., Girolami, M., Ross, G.: Employing optimized combinations of one-class classifiers for automated currency validation. Pattern Recognition 37 (2004) 1085-1096 16. Yu, E., Cho, S.: Constructing response model using ensemble based on feature subset selection. Expert Systems with Applications 30 (2006) 352-360 17. Likas, A., Vlassis, N., Verbeek, J.: The global K-means clustering algorithm. Pattern Recognition 36 (2003) 451-461
A Contourlet-Based Method for Wavelet Neural Network Automatic Target Recognition* Xue Mei1,2, Liangzheng Xia1, and Jiuxian Li1 1
2
School of Automation Control, Southeast University, Nanjing, China, 210096 School of Automation Control, Nanjing University of Technology, Nanjing, China, 210096
Abstract. An object recognition algorithm is put forward based on statistical character of contourlet transform and multi-object wavelet neural network (MWNN). A contourlet-based feature extraction method is proposed, which forms the feature vector taking advantage of the statistical attribution in each sub-band of contourlet transform. And then the extracted features are weighted according to their dispersion degree of data. WNN is used as classifier, which combines the extraction local singularity of wavelet transform and adaptive of artificial neural network. With the application in an aircraft recognition system, the experimental data showed the efficiency of this algorithm for automation target recognition. Keywords: Automatic target recpgnition, Wavelet neural network, Contourlet transform, Feature extraction.
1 Introduction For object recognition, it’s greatly important to find out the structures with singularities and irregularities to be features. Wavelet is the optimal basis with object functions of point singularity, which can effectively reflect the positions and characteristics of singular points. Wavelet is widely and successively used in image feature extraction and pattern recognition. The major drawback for wavelets in two dimensions (2-D) is their limited in capturing directional information. Therefore 2-D wavelet is difficult to express geometric features with high dimension inhomogeneities like line singularity and curve singularity. Researchers have recently considered multi-scale and directional representations that can capture the intrinsic geometrical structures such as smooth contours in natural images. Candes, Donoho and Do proposed some Multi-scale Geometric Analysis MGA [1] method like ridgelet, curvelet, contourlet and so on, which have the characters of multi-directional selectivity and anisotropy to effectively catch and show the geometric features of images. Wavelet neural network WNN [8] is recently proposed based on wavelet analysis research, which uses non-linear wavelet basis instead of non-linear sigmoid function and takes full advantage of the well localization features of wavelet transform, and combine the self-learning ability of neural network. In this paper, an object recognition algorithm was put forward based on contourlet transform and
(
)
(
*
)
This work was supported by United Project of Yang-Zi Delta Integration under grant number 2005E60007.
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 889–895, 2007. © Springer-Verlag Berlin Heidelberg 2007
890
X. Mei, L. Xia, and J. Li
multi-object wavelet neural network (MWNN). Features are extracted with the multiscale and multi sub-bands of contourlet transform and the extracted features are weighted according to their degree of dispersion, and then multi-wavelet neural network is used to recognize the objects. After reviewing the contourlet transform and demonstrating a feature extraction method in section 2, we describe the MWNN in section 3. In section 4, we conduct some experiments for recognition aircrafts, and get high recognition rate. In section 5, we conclude our paper.
2 Feature Extraction Base on Contourlet Transform 2.1 Contourlet Transform Inspired by curvelets, Do and Vetterli developed the contourlet transform [2] [3] based on an efficient two-dimensional multiscale and directional filter bank that can deal effectively with images having smooth contours. Contourlet transform is the real 2-D transform of images, which not only possess the main features of wavelets, but also offer a high degree of directionality and anisotropy. The contourlet transform is implemented via a 2-D filter bank that decomposes an image into several directional sub-bands at multiple scales. This is accomplished by combining the Laplacian pyramid LP with a directional filter banks DFB at each scale. As the fig.1 shows, LP [5] decomposes the original image to low frequency sub-bands and high frequency sub-bands. The former ones are sampled from the rows and columns of matrixs 2-D Lowpass filtered from original images, which can form the low frequency components the same size as the original images. The number i is different according to different multi-resolution, and then the number of directional sub-bands with different scales will be doubled with the scales increasing. Repeating these processes to low frequency sub-bands can realize multi-resolution and multidirection reduction of images. A statistic feature extraction method based on contourlet transform is proposed in this section which is formed by feature extraction and feature weighted.
( )
( )
,
Fig. 1. Contourlet transform schematic diagram
2.2 Feature Extraction Statistic models of coefficients by contourlet transform satisfy generalized Gaussian distribution GGD [3] and have peak and heavy tailed. The model of generalized Gaussian distribution is as following.
(
)
f ( x; α , β ) =
β
2αΓ(1 / β )
e −(| x|/ α )
β
(1)
A Contourlet-Based Method for WNN Automatic Target Recognition
891
()
α in Equation 1 is the scale parameter of GGD, and it depends on variance of random variable to control the width of f ( x; α , β ) . β is the shape parameter of GGD, and it controls the shape of
f ( x; α , β ) . Γ(x) is a gamma function. The
features of images can be described by statistic parameters—— α and β , which can be estimated by moment method or maximum likelihood method [6]. 2.3 Feature Weighted The features extracted are the local features of images, so the classification abilities for different sub-bands of contourlet transform of images are different. Features will be weighted based on the importance of different sub-band information, and if the dispersed degree of gray level is larger, the actions of corresponding features will be enhanced, by contraries, lesser ones make the actions of corresponding features weaken. If there are L kinds of object images, J scales can be obtained by contourlet transform of each image. j = {1,2, " , J } , where j is the scale number. Each scale has
K j sub-bands. The feature vectors of sub-band can be denoted as Wtm , where
t ∈ {1,2, " , L} is the sample number. We can weight the features in difference scale and direction obtained from contourlet transform, in which the standard variance of the sub-bands can be used to denote the dispersed degree. This standard variance doesn’t have to be calculated any more, just use α instead, which is the scale parameter of GGD and shows the variance of random variable to control the width of f ( x; α , β ) . In this way, repeated calculation can be avoided and the efficiency will be enhanced. Supposing
R tjk is the weight of image t in sub-band k of the scale j . The weights
of sub-band features in different scale and direction can be calculated as following.
R tjk = K ⋅
α jk ∑ ∑ α jk
(2)
u
R tjk , as weight of sub-band, embodies the feature classification abilities of contourlet in different scale and direction. The bigger ones indicate that the data dispersed degree of sub-band is bigger, that is, the feature classification ability is better. Otherwise, the classification ability is worse. So when the features of every channel t
are weighted by R jk , and the data dispersed degree of feature values will be bigger, which will improve the classification ability. Otherwise, the effect of feature values with smaller dispersed degree in classification will be weakened.
892
X. Mei, L. Xia, and J. Li
3 MWNN WNN can make the best use of the well localization of wavelet transform and the selflearning ability of neural network. The certainty of wavelet neurons and the whole net structure has reliable theory, and the function learning and extending ability are strong. Moreover, the linear distributing of net weight coefficients and the learning convexity of object function can make the net training radically avoid the non-linear optimality problems like location optimality. 3.1 WNN Model In WNN, the wavelet neurons in first floor net extract and choose the time-frequency characteristics of input signals, while the sigmoid neurons in second floor net complete the pattern classification. The adjustable net parameters include the scale factors of wavelet neurons, the join weights between shift factors and neurons. 3.2 Multi-object Wavelet Neural Network In most neural network pattern recognition, the number of inputs is consistent with dimension of object features, while the number of outputs is ensured by the kinds of object. That is to say, the number of training set is huge for all sorts of objects have to be recognized by the net. So, we apply the multi wavelet neural network MWNN) to classify the multi-object. MWNN is composed with a few of sub-nets, each of which recognizes one kind of object and the number of which is equal to the sort number of object. The output is the matching degree of unseen sample and according pattern. All the sub-net outputs are compared and the object sort is the corresponding sort of maximum output. The whole net has only one output that is the object sort. Fig. 2 shows the structure of MWNN. When all the outputs of sub-nets are smaller than the threshold prior set, the result will be reject. Moreover, according to the sort number of objects, the number of neural sub-nets can dynamically increase and decrease. The neural sub-nets can be trained dynamically, and the net structure can be adjusted for demand.
(
Input feature
WNN Kind 1
ff
WNN2 Kind 2
ff
WNN Kind k
MAX Output
Fig. 2. The structure multi-object wavelet neural network
A Contourlet-Based Method for WNN Automatic Target Recognition
893
4 Experiments In our experiment, six kinds of shape closely aircrafts are used, and they are already segmented. Fig. 3 shows an example, which is the edge of the segmented object, and the edge is thinned with one pixel. 4.1 Object Standardization
,
For getting features which is invariant to translation, scaling and rotation object standardization has to be done. Firstly, translation does not change the GGD parameters of sub-bands, and therefore our major concern is performance of the system upon rotation and scaling. Methods in [8] and [9] will be used to get rotation and scaling invariant.
Fig. 3. Result of edge detection of object1
Fig. 4. Contourlet transform of object 1
4.2 Feature Extraction In contourlet transform, the image decomposing chooses 9-7 filter, and the decomposing scale is 4. The number of sub-bands of each scale separately is 4, 8, 8 and 16. Fig. 4 is the sub-bands images of one kind object after transform. The feature vectors are equal to the values that are from calculating and weighting α and β of GGD model from 8 sub-bands in later two scales. 4.3 Neural Network Design and Parameter Chosen Each sub-net of MWNN can be separately designed and the wavelet basis of sub-net can be the same or difference. The net parameters are only correlated with the object pattern features. The values of sequence wavelet transform of signals reflect the relationship of signal and wavelet function. That is, the bigger the wavelet transform value is, the smaller the difference between the signal and wavelet function is. After comparing, choose the 3 rank Daubechies wavelet to be the net wavelet function. The input of sub-net is the feature vector of according pattern sort. When the output is bigger than the threshold 0.5, the objects are belong to this kind, and when all the outputs of sub-nets are smaller than the threshold, the result is rejected.
894
X. Mei, L. Xia, and J. Li
4.4 Experimental Results In the experiment, each kind of object has 100 images, including the ones of different rotation angle and different zooming rate. Fig. 5 shows six plane objects and several samples in simulation and Table 1 presents the recognition result. The rejected recognition objects are the ones with smaller scale and similar in appearance, such as object 3 and 4, 2and 6.
aobject 1 bobject 2 cobject 3dobject 4 eobject 5 fobject 6
g
Fig. 5. (a) ~ (f) Six plane objects. (g) Several samples in the experiments. Table 1. Experimental results
Pattern Sort Recognition Rate %
( )
Obj. 1 96
Obj. 2 9 4
Obj. 3
Obj. 4
Obj. 5
Obj. 6
93
92
96
92
5 Conclusions We proposed a multi-object wavelet neural network automatic target recognition method based on contourlet transform. An improved feature extraction algorithm was put forward. In this algorithm, features were extracted with multi-scale and different sub-bands of statistic parameters of contourlet transform and the extracted features were then weighted according to their degree of dispersion. The multi-direction of contourlet transform is fit for the human vision characteristic, and the statistic parameters of transform coefficients to be features make the classification robust. The weights of features in each sub-band are to improve the classification effect of the components with bigger disperse degree. In MWNN system, each net takes charge for one kind object recognition. The network connections weights and all the parameters of wavelet neurons can be adjusted to optimizing by training. This recognition algorithm not only had good classification ability for similar targets, but also was invariant to the translation, scaling and rotation of the objects.
A Contourlet-Based Method for WNN Automatic Target Recognition
895
Acknowledgement. This work was supported by United Project of Yang-Zi Delta Integration under grant number 2005E60007.
References [1] Li-Cheng Jiao, Shan Tan. Development and prospect of image multi-scale geometric analysis. Acta Electronica Sinica 31 (2003) 1975-1981 in Chinese [2] M.N.Do, M. Vetterli. The contourlet transform: An efficient directional multiresolution image representation. IEEE Transaction on Image Processing 14 (2005) 2091-2106 [3] D.D.-Y.Po, M.N.Do, Directional multiscale modeling of images using the contourlet transform, IEEE Transactions on Image Processing 15 (2006) 1610-1620 [4] Field D J. Relations between the statistics of natural images and the response properties of cortical cells. Journal of Optical Society Am, Series A, 4 (1987) 2379-2394 [5] M. N. Do, M. Vetterli. Framing pyramids. IEEE Transaction on Signal Processing 51 (2003) 2329-2342 [6] M. N. Do, M. Vetterli, Wavelet-based texture retrieval using generalized gaussian density and Kullback-Leibler distance, IEEE Transactions on Image Processing 11 (2002) 146-158 [7] B.S.Manjunath, W.y.Ma. Texture feature for browsing and retrieval of Image Data. IEEE Trans. Pattern Analysis and Machine Intelligence 18 (1996) 837-842 [8] Hong Pan ATR Based on Wavelet Moment and Wavelet Neural Network. [Ph. D. dissertation]. Southeast University, Nanjing, 2004(in Chinese) [9] Xue Mei, Jiu-xian Li, Image recognition based on moment and multiresolution analysis, Journal of Nanjing University of Technology 25 (2003) 50-53 (in Chinese)
(
)
Facial Expression Analysis on Semantic Neighborhood Preserving Embedding Shuang Xu, Yunde Jia, and Youdong Zhao School of Computer Science and Technology Beijing Institute of Technology, Beijing 100081, P.R. China {xushuang,jiayunde,zyd458}@bit.edu.cn
Abstract. In this study, an expression manifold is constructed by Neighborhood Preserving Embedding (NPE) based on the expression semantic metric for a global representation of all possible facial expression images. On this learned manifold, images with semantic ‘similar’ expression are mapped onto nearby points whatever their lighting, pose and individual appearance are quite different. The proposed manifold extracts the universal expression feature and reveals the intrinsic semantic global structure and the essential relations of the expression data. Experimental results demonstrate the effectiveness of our approach.
1 Introduction Facial expression provides meaningful information about internal emotional states, psychology activities, or intentions cues of people, and plays a major role in social communication and interaction of our daily life. Within the past decade or two, significant effort has been done in developing methods to automatically perform facial expression recognition. Facial expression becomes an active and challenging research topic in computer vision (see surveys [1, 2]). People can recognize facial expression easily, though expression varies a lot across human population for different individuals have different facial appearances, or different expression displaying manners due to different cultures or different individual personalities. Ekman [3] gave evidences to support this universality of facial expression which usually can be categorized as happiness, sadness, anger, fear, surprise, and disgust. However, it is a challenge for machine to automatically recognize expression across different individuals; And even for the same individual, under varied pose and lighting, etc, or under the context-dependent variation, the expression varies greatly and it is a difficult task for machine to recognize. Furthermore, facial expression is a distinctive visual pattern for it may be blended and could be classified quantitatively into multi-categories, and it is temporary and dynamic that it varies gradually from one expression state to another. Therefore, to successfully do automatic facial expression recognition, it is significant to extract universal expression feature and reveal the blending and evolving of the expression. In recent years, psychology study and computer vision research have witnessed a growing interest in discovering the underlying low dimensional manifold of perceptual observations from high dimensional data [4]. Based on the observation that the D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 896–904, 2007. © Springer-Verlag Berlin Heidelberg 2007
Facial Expression Analysis on Semantic Neighborhood Preserving Embedding
897
images of an individual under all possible facial deformations make a smooth manifold embedded in a high dimensional image space, Chang et al [5,6] initially proposed the manifold concept of facial expression, and did successfully analysis on single subjected facial expression manifolds, on which similar expressions are points in the local neighborhood and sequences of basic emotional expressions evolve as paths. The blending expression points lie between those basic paths. Though on one subjected expression manifold, the blending and dynamic evolution of expression can be successfully revealed, there are still challenges to extract the universal expression feature. Using the Euclidean metric, expression images from different individuals always reside on different geometry motivated manifolds due to the variation of facial appearance across different individuals. Each of these subjective expression manifolds has its own distinctive complicated structure, such as different center region of neutral face, and different stretching and twisted expression evolving directions. Additionally, the images from one individual with the same expression under different pose and lighting may also scatter far away on one manifold to be difficult for semantic analysis. In order to extract universal expression feature from the data set, it is necessary to generalize the above different individual expression manifolds to obtain a global representation for all possible facial expressions for semantic analysis. In this study, a uniform expression manifold is learned by constructing the neighborhood relationship of the expression data based on the expression semantic similarity instead of the Euclidean distance. On our learned semantic expression manifold, images with ‘similar’ expression are mapped onto nearby points despite that their lighting, pose and subject individuals and contexts are quite different. This requires quantitatively weighing and comparing the expression similarity of different data points. An expression point in the gallery set must be identified with particular expression information including its classification confidence to each category and its expression intensity. Unfortunately, at a lot of cases, for video based facial expression recognition, the expression definition of the gallery set is just simply labeling the basic expression sequences, without the intensity determination. And it will be an exhaustive work to manually define the intensity of each facial expression image; Furthermore, if there are any blending expressions collected in the gallery set, it will be a formidable task even for psychologists to quantitatively assign the blending expression points to multi-categorizes and as well as to determine their intensities. To compare the expression semantic similarity of the data points, it is necessary to determine the particular expression information of each data point in the gallery set. In this study, based on the Euclidean distance in the original high dimensional image space, the expression manifold of each individual is exploited by geometry motivated unsupervised manifold learning such as the LLE[7], Isomap[8], Laplacian Eigenmap[9] etc; Consequently, on each one subjected manifold, for the data points in the local neighborhood are usually with similar expressions, and sequences of similar expressions become topologically and homeomorphic similar paths on the manifold, it is easy to find the evolution gradient of the typical expression sequences, which belong to basic emotional categories with evolving trace extending from the neutral to the apex; Along these dynamic evolution gradients, the expression intensities of the points in the typical expression sequences are determined and the corresponding expression determination are obtained; By the fuzzy linear neighborhood propagation,
898
S. Xu, Y. Jia, and Y. Zhao
the expression determination of the basic expression points can be propagated to the blending expression points which locate between basic expression path, or random expression sequences which may not start from neutral and extent to the apex but with varied duration and intensities. Upon the above learning, the expression information in the gallery set is exploited and the expression of each point is presented and aligned for semantic analysis. Finally, the neighborhood relationship of the whole gallery set is constructed based on the learned expression semantic similarity instead of the Euclidean distance, and the Neighborhood Preserving Embedding (NPE) [10] method is used to learn a uniform general expression manifold which extracts universal expression feature from the data set. This semantic mapping reveals the intrinsic structure of the expression data that not only the images with similar expression are located at neighborhood on the manifold, but the stretching curves of the data on the manifold represent the semantic expression evolutions. The mapping is invariant to complicated non-linear transformations of the inputs such as individual appearance difference, pose and lighting, and models a semantic expression space, in which the expression semantic, intensity and dynamic evolving of new input data can be analyzed and revealed successfully.
2 Manifold Gradient Learning and Fuzzy Linear Neighborhood Propagation In this study, the data points are quantitatively classified into six emotional categories, represented as ξ ∈ [ E , I ] , the expression membership vector is denoted as
Ε = (e1, e2 ,..,ei ,....e6 )T , 1 ≤ i ≤ 6 , each component ei indicates the confidence of assigning x to the i-th class ; I determines the facial expression intensity. 2.1 Manifold Gradient Learning for Basic Expression Determination The expression manifold of each individual in the gallery set can be exploited and revealed in the Euclidean space by geometrically motivated unsupervised nonlinear manifold learning methods, such as Locally Linear Embedding (LLE) [7], Isomap [8], and Laplacian Eigenmap [9]. In the gallery set, there are some typical expression video sequences to the basic emotional categories, evolving from start, apex, to relax, with the neutral expression points and each maximal basic expression points clustered respectively. The intensities of all the neutral expression points are set to be zero and all the maximal basic expression points are set to be one respectively. For each basic emotional category, set the neutral expression and the maximal expression points as a binary classify learning samples, a classifier could be found that the learning of the binary classification is to search for the gradient direction of the expression evolving on the manifold. For on the one subjected expression manifold, the data points are linearly distributed, fisherface method [11] is suitable to obtain a direction which minimizes the ratio of within –class and between-class Matrix by linear translation, i.e. the discriminate direction which could be regarded as an approximation of the expression evolving gradient direction as shown in Fig 1.
Facial Expression Analysis on Semantic Neighborhood Preserving Embedding
899
0
sequence 1 sequence 2 sequence 3 neutral maximal happy
-100
-200
-300
-400
-500
-600
-700
-800
-900
-1000 800
600
400
200
0
-200
-400
-600
-800
-1000
-1500 -1200
-1000
0
-500
1000
500
1500
(A) 3000 neu t ral hap py m ax neg t i ve
2000
1000
0
- 1000
- 2000
- 3000 - 40 0 0
- 3000
- 2000
- 1 0 00
0
1000
2000
3 0 00
(B) Fig. 1. One subjected Expression Manifold gradient learning Examples. A) Three happy sequence on one subjected expression manifold. B) Neutral and maximal happy face are clustered respectively on the manifold learned by conducting the Isopmap on the Frey dada-base[8].
Along these basic expressions evolving gradients, the intensity of each basic expression point is determined. The expression membership vector
Ε = (e1 , e2 ,......e6 )T of a basic expression point x its j-th component
to the j-th class is represented as
e j set to 1 and zero to others; its intensity Ι
( 0 ≤ Ι ≤ 1 ) is deter-
mined along the gradient. 2.2 Linear Neighborhood Propagation for Membership Function Transformation At some cases, there are some random expression videos sequences collected in the gallery set including the expression sequences which may not begin from neutral or
900
S. Xu, Y. Jia, and Y. Zhao
evolve to apex, but with varied duration and intensities; or with blended and transfer expressions. These random expressions are difficult to determine the particular expression information manually. Our approach is to propagate the expression information from the basic expression points to the blending ones; Furthermore, it is reasonable to adopt the member-ship function of the fuzzy theory [12] to describe the fact that the blending expression points should be quantitatively assigned to multicategories. Therefore,to predict the expression information of the blended and random expression points, the fuzzy membership function is imported into the Linear Neighborhood Propagation (LNP)[13 ].
,
X ∈ {x1 , x2 ,
, xl , xl +1 ,… , xn } be a set of n data points in ℜ D considering the membership of one class to the data points, the first l points xi ( i ≤ l ) are determined the membership as ci , 0 ≤ ci ≤ 1 , and the class membership of the remaining points xu ( l + 1 ≤ u ≤ n ) are undetermined; The objective is to predict the Let
membership of the remaining undetermined data .The recovered linear reconstruction weights Wij of LLE [7] are used to predict the membership functions of the undetermined points. This prediction is based on the fact that the weights reflect the likelihood that the data points are of the similar membership function. Let F denote the set of classifying functions defined on X, f F can assign a real value f i to every
∀ ∈
xi for the membership to a category, and the undetermined data point xu can be predicted by the sign f u = f ( xu ) . point
Supposing the membership of each data object can be optimally reconstructed by its linear neighborhood structure through minimizing the total reconstruction error of all the data points’ membership value constructed from their neighbors’ membership as 2
min f η = ∑ i =1 f i − ∑ x ∈N ( x ) Wij f j , n
j
i
(1) s.t.
f i = ci , (1 ≤ i ≤ l ).
3 Semantic Neighborhood Preserving Embedding to Generate a Uniform Expression Manifold Neighborhood Preserving Embedding (NPE)[10] , is a linear approximation to the LLE, but it is defined everywhere, rather than the gallery set. NPE is of particular applicability in the special case where x1 ,… , xn ∈ M M is a nonlinear manifold
,
ℜ . The generic problem of linear dimensionality reduction is thatD given a set of points x1 ,… , xm in ℜ , find a transformation matrix A that maps embedded in
D
Facial Expression Analysis on Semantic Neighborhood Preserving Embedding
901
y1 ,… , ym in ℜd , yi = AT xi . A reasonable crite-
these m points to a set of points
rion for choosing a “good” map is to find the following minimization:
arg min a
T
XX
T
T
a
T
XMX
a
(4)
a =1
Where M = ( I − W ) ( I − W ) and I = diag (1,… ,1) . The transformation vector a is given by the minimum eigen-values solution to the following generalized eigenvector problem: T
T XMX T a = λXX a
(5)
Facial Expression Recognition on the manifold: Upon the learning of section 2, the expression information in the gallery set is exploited and the particular expression information of each point is presented and aligned into an expression vector as ξ ∈ [ E , I ] , which can be further mapped into a spheric coordinate expression space V as shown in Fig 2 .Setting the direction of the basic expressions as the axes, when the angle of an expression vector to a basic expression axe is determined by its member-ship vector E, and the radius of the point is determined by its intensity, an expression point can be represented into a vector in space V as
ψ = (r ,θ1 ,θ 2 ,......θ 6 ) , where
r = Ι , θ i = sin
−1
6
(ei /
∑
j =1
(6)
e j ) , 1 ≤ i ≤ 6 .To weigh and determine the
similarity between two data points in the gallery set, the distance of their expression vector in the spheric space can be adopted. Obviously, the neutral face is the origin of the expression space V. Constructed the embedding graph of NPE based on the similarity metric in the expression space V instead of the Euclidean metric of the original high dimensional space, images with ‘similar’ expression are connected whatever their lighting, pose and subject individuals and contexts are quite different. For the semantic similar x 10
8
3
Basic expression 2
1
2
2
individual 1 happy
1
x1
individual 2 happy individual 3 happy individual 4 happy individual 5 happy individual 6 happy individual 1 surprise individual 2 surprise individual 3 surprise individual 4 surprise individual 5 surprise individual 6 surprise neutral faces
0
x2 1
-1
2 Basic expression 1
-2
-3 12 11
-6.9
10
-6.95 -7
9 8
x 10
-7.05 -7.1
8
-7.15 -7.2
7
-7.25 6
-7.3 5
Fig. 2. The expression presentation space V, x1 , x2 are the expression vectors of two data points, the distance of d measure the expression similarity between x1 , x2
9
x 10
-7.35 -7.4
Fig. 3. A uniform expression manifold learned from six subjected happy and surprise expression data in CK data base[15]
902
S. Xu, Y. Jia, and Y. Zhao
0.4 0
0.6 0.8 1
-0.5
1.2 1.4 -1
1.6 1.8 2
-1.5 9
x 10
-2
-2.5
-3 -5
9
x 10
08 x 10
surprise points in gallery set happy points in gallery set neutral points in gallery set surprise points in probe set happy points in probe set .
5
Fig. 4. The semantic NPE for facial expression recognition (6 subjected gallery set and 1 subjected probe set of CK data base, the probe subject not appear in gallery set)
expression points in the original high dimensional space ℜ are connected, a uniform semantic expression manifold is learned by solving Eq (5). Fig 3 shows a uniform semantic expression manifold learned from six subjected happy and surprise expression data in the CK database [14].To recognize the expression in the probe set, expression sequences in the probe set are mapped onto the semantic expression manifold learned from the gallery set based on the above method and are analyzed by the nearest neighbor rule as the example shown in Fig 4. D
4 Experiment Results The face regions in the images are all segmented from the background and normalized to 60×50-pixels patches; one subject cross all and ten folds validation testing paradigm is implemented. Through expression manifold gradient learning and fuzzy linear neighborhood propagation, the particular expression of each data point in the data set is determined and presented into an aligned expression vector ξ in the expression Table 1. T he recognition confusion matrix on CK-data base Category Happy Surprise Sadness Anger Disgust Fear
Happy 89.20% 8.70% 0 7.50% 12.50% 7.50%
Surprise 0 91.30% 0 0 0 0
Sadness 0 0 51.80% 9.70% 0 7.50%
Anger 10.80% 0 12.50% 46.50% 14.30% 0
Disgust 0 0 27.30% 26.30% 73.20% 7.50%
Fear 0 0 8.40% 0 0 77.50%
Table 2. The recognition correct Correct rate Semantic NPE PCA
happy 80.5% 65.3%
surprise 87.20% 71.90%
Happy+surprise 76.30% 52.30%
Facial Expression Analysis on Semantic Neighborhood Preserving Embedding
903
neutral happy surprise blended
Fig. 5. Examples in our own data base
representation space V as Fig 2. The recognition evaluations in Table 1 and 2 are done by selecting some typical expression frames with intensity I ≥ 2 / 3 . Experiment results on the Cohn-Kanade Facial Expression Data Sets: For the expression sequences in CK dataset [14] are basic emotional categories, the recognition p
is determined by arg max1≤l ≤6 el with confusion matrix in Table 1. Experiment Results on Blended Expression: We build our own data set containing basic and blended expression from our colleagues as shown in Fig 5. Expression sequences from seven subjects including happy and surprise two basic expression as well as some random data points blended with happy and surprise expression, are collected to conduct the experiment. Mapped a probe point with expression vector ξ onto the learned semantic expression manifold, a recognition vector Let
ξ ′ is
obtained;
d = ξ ′ − ξ , if d ≤ ε , the recognition result is taken to be correct, where ε is
a positive control factor. The recognition result of our approach (Semantic NPE) and PCA are compared in Table 2.
5 Conclusion and Future Work In this study, we propose a learning approach to learn a uniform expression manifold by semantic neighborhood preserving embedding, which not only extracts the essential universal expression feature from the data set for discriminant analysis, but also reveals the intrinsic structure and the essential relation of the expression data including the expression blending and the expression dynamic evolving. It shows promise as a unified framework for facial expression analysis. Yet building a more diversified and spontaneous expression database to improve our study, and tracking the expression on the manifold to integrate the expression dynamic into the expression visual cues recognition, etc, are open topics to put more effort in the future work.
References 1. Pantic, M., Rothkrantz, L.: Automatic Analysis of Facial Expressions: The State of the Art. IEEE Trans. Pattern Analysis and Machine Intelligence 22 (12) (2000) 2. Fasel, B., Luettin, J.: Automatic Facial Expression Analysis: A Survey. Pattern Recognition, 36 (2003) 259-275 3. Ekman, P.: Emotion in the Human Face. Cambridge University Press, New York (1982)
904
S. Xu, Y. Jia, and Y. Zhao
4. Seung, H.S., Lee, D.D.: The Manifold Ways of Perception. Science 290 (2000) 2268-2269 5. Chang, Y., Hu, C., Turk, M.: Manifold of Facial Expression. Proceedings of IEEE International Workshop on Analysis and Modeling of Faces and Gestures, Nice, France (2003) 6. Elgammal, A., Lee, C.: Separating Style and Content on a Nonlinear Manifold. In Proc. Computer Vision and Pattern Recognition Conf, Washington (2004) 7. Roweis, S., Saul, L.K.: Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science 290 (2000) 8. Tenenbaum, J.B., Silva, V., Langford, J.C.: A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science 290 (2000) 9. Belkin, M., Niyogi, P.: Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Computation 15 (6) (2003) 1373-1396 10. He, X.F., Cai, D., Yan, S.C., Zhang, H.J.: Neighborhood Preserving Embedding. IEEE Conf. on ICCV’05 2 (2005) 1208-1213 11. Belhumeur, P.N., Hespanda, J., Kiregeman, D.: Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection. IEEE Trans. on PAMI 19 (7) (1997) 711-720 12. Kandel, A.: Fuzzy Techniques in Pattern Recognition. Wiley, New York (1982) 13. Wang, F., Wang, J., Zhang, C., Shen, H.C.: Semi-Supervised Classification Using Linear Neighborhood Propagation. Proceedings of Int. Conf. on Computer Vision and Pattern Recognition (2006) 14. Kanade, T., Cohn, J., Tian, Y.: Comprehensive Database for Facial Expression Analysis. In Proc. IEEE Inter. Conf. on Face and Gesture Recognition (2000) 46–53
Face Recognition from a Single Image per Person Using Common Subfaces Method Jun-Bao Li1, Jeng-Shyang Pan2, and Shu-Chuan Chu3 1
Department of Automatic Test and Control, Harbin Institute of Technology, Harbin, China
[email protected] 2 Department of Electronic Engineering, National Kaohsiung University of Applied Sciences, Kaohsiung, Taiwan 3 Department of Information Management, Cheng Shiu University, Kaohsiung, Taiwan
Abstract. In this paper, we propose a face recognition method from a single image per person, called the common subfaces, to solve the “one sample per person” problem. Firstly the single image per person is divided into multiple sub-images, which are regarded as the training samples for feature extraction. Then we propose a novel formulation of common vector analysis from the space isomorphic mapping view for feature extraction. In the procedure of recognition, the common vector of the subfaces from the test face image is derived with the similar procedure to the common vector, which is then compared with the common vector of each class to predict the class label of query face. The experimental results suggest that the proposed common subfaces approach provides a better representation of individual common feature and achieves a higher recognition rate in the face recognition from a single image per person compared with the traditional methods.
1 Introduction Face recognition is one of the most active research areas in computer vision and pattern recognition with practical applications that include forensic identification, access control and human computer interface. Many face recognition algorithms have been developed, such as eigenface and linear discriminant analysis (LDA) [1] [2]. Recently, a pattern recognition method called common vector approach was applied to isolated word recognition [3]. The environmental effects and personal differences are removed by deriving a common vector from a spoken word which represents common properties of the spoken word. Recently it is applied to face recognition [4][5], but most of them suffer serious performance drop or even fail to work if only one training sample per person is available to the systems, which is called “one sample per person” problem. In this paper, we propose a novel face recognition method, common subfaces, to solve the “one sample per person” problem. Firstly one image per person is divided into many sub-images by sampling the original face image, and secondly we extend the linear discriminant analysis (LDA) to the common vector discriminant analysis from a new view for feature extraction. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 905–912, 2007. © Springer-Verlag Berlin Heidelberg 2007
906
J.-B. Li, J.-S. Pan, and S.-C. Chu
The rest of paper is organized as follows. In section 2, we introduce the proposed algorithm is presented. Experimental results and conclusion are given in Section 3 and Section 4 respectively.
2 Common Subfaces Method Given a stored database of faces, the goal is to identify a person from the database later in time in any different and unpredictable poses, lighting, etc from just one image, which is called “one sample per person” problem. In this section, we describe the common subfaces method in detail. The method can be described as follows. Firstly one image per person is divided into many sub-images for the training set. In the training set consisted of sub-images, we implement our common vector discriminant analysis to extract the feature of face image from per person. 2.1 Subfaces Given C face images from C persons, that is, one sample per person for training, firstly the image is divided into multiple subfaces by sampling the single face image. The detailed algorithm is described briefly as follows. Supposed that the size of the image is M × N , the image is divided into p × q smaller blocks, whose size is m × n ( m, n M , N ). Then we construct the new sub-images by randomly sampling the blocks. Each pixel of the new sub-image is randomly sampling each block, one pixel per block, that is, p × q pixels for the new sub-images. So the original image can be divided into m × n sub-images. One example is shown in Fig.1. The size of the cropped face image is 100 × 100 ( M = 100 , N = 100 ), and it is divided into 50 × 50 ( p = 50 , q = 50 ) blocks whose size is 2 × 2 ( m = 2 , n = 2 ). Thus one image is divided in to 4 sub-images, and the size of the sub-image is 50 × 50 . The algorithm is (e, f )
defined as follows. Let I block ( g , h) ( e = 1, 2, ..., p , f = 1, 2,..., q , g = 1, 2, ..., m ,
Fig. 1. Example of procedure of creating the sub-face images. (The original face image with the size of 100 × 100 is divided into 50 × 50 blocks with size of 2 × 2 ).
Face Recognition from a Single Image per Person Using Common Subfaces Method
907
h = 1, 2, ..., n ) denote the pixel of the block of the original face image
I (i, j )
( g ,h ) sub
( i = 1, 2, ..., M , j = 1, 2, ..., N ). Then I (k , l ) ( k = 1, 2, ..., p , l = 1, 2,..., q ) , which denotes the sub-images, can be obtained as follows. ( a ,b ) ( e, f ) I sub (e, f ) = I block (a, b),
(1)
where a is a random number between 1 and P , and b is a random number between 1 and q . Using the randomly sampling method, a new training set containing multiple training samples in each class can be obtained. 2.2 Common Subfaces
In this section, we describe the common subfaces method in detail. We extend the linear discriminant analysis (LDA) to the common vector discriminant analysis from a new view. Different from the traditional common vector analysis [4],[5], we propose the novel formulation of common vector analysis from space isomorphic mapping view. Given C face images from C persons, a single image per person, each image is divided into N sub-images. Based on LDA, the within class scatter matrix and the between class scatter matrix are defined respectively as follows. C
N
(
SW = ∑∑ xi − mi i =1
j =1
j
)( x
j
i
− mi
), T
(2)
C
S B = ∑ N ( mi − m )( mi − m ) . T
(3)
i =1
We know that SW and S B are both positive operators on Hilbert space H. In Hilbert space H, the Fisher criterion function can be defined by
ϕ S Bϕ T
J (ϕ ) =
ϕ SW ϕ T
.
(4)
In the special case where ϕ SW ϕ = 0 , the Fisher criterion degenerates into the folT
lowing between-class scatter criterion: J b (ϕ ) = ϕ S Bϕ , ( ϕ = 1). T
(5)
The between-class scatter makes the data become well separable when the withinclass scatter is zero. The criterion given in (5) is very intuitive since it is reasonable to use the betweenclass scatter to measure the discriminatory ability of a projection axis when the within-class scatter is zero. Now let us study the criterion given in (5) in the Hilbert space as follows. Let be a compact and self-adjoint operator on Hilbert space H, then its eigenvector system
908
J.-B. Li, J.-S. Pan, and S.-C. Chu ⊥
forms an orthonormal basis for H, ( H = Ψ ⊕ Ψ )[7]. That is, for an arbitrary vector ϕ ∈ Η , ϕ can be uniquely represented in the form ϕ = φ + ζ with φ ∈ Ψ and ζ ∈Ψ
⊥
, so let us define the mapping P : H → Ψ by ϕ = φ + ζ → φ , where φ is called the orthogonal projection of ϕ onto H. It is easy to verify that P is a linear operator from H onto its subspace Ψ . Under the mapping P : H → Ψ determined by ϕ = φ + ζ → φ , the Fisher criterion satisfies the following properties: J b (ϕ ) = J b (φ ) .
Let ϕ = Pθ be an isomorphic mapping from \
onto Ψ . Then ϕ = Pθ is the *
m
*
stationary point of J b (ϕ ) if and only if θ is the stationary point of J b (θ ) . So if θ1 θ 2 " θ d is a set of stationary points of the function J b (θ ) , then *
ϕ1 = Pθ1 , ϕ2 = Pθ 2 ,…, ϕd = Pθ d , is a set of optimal discriminant vectors with respect to the Fisher criterion J b (ϕ ) . Now we split the space \
m
into two subspaces:
the null space and the range space of SW . We then use the Fisher criterion to derive the regular discriminant vectors from the range space and use the between-class scatter criterion to derive the irregular discriminant vectors from the null space. Given the orthonormal eigenvector of SW , α1 , α 2 , ..., α m , Ω w = span{α1 , α 2 , ..., α q } is the range space \
m
and
Ω w = span{α q +1 , α q + 2 , ..., α q + m }
is
the
null
space
of
SW
and
= Ω w ⊕ Ω w , where q = rank ( S w ) . Since Ω w and Ω w are isomorphic to Euclidean
space \ q
and
Euclidean
space
\ p ( p = m − q ) respectively,
and
let
P1 = (α1 , α 2 , ..., α q ) and P2 = (α q +1 , α q + 2 , ..., α m ) , we can define the corresponding
isomorphic mapping by:
ϕ = P2θ .
(6)
Under the mapping denoted by (6) and (7), J b (ϕ ) are converted into the following equations respectively: T J b (θ ) = θ Sobθ , ( θ = 1),
(7)
where Sob = P2 S b P2 . The stationary points μ1 ,..., μd (d ≤ c − 1) of J b (θ ) are the T
orthonormal eigenvectors of Sob corresponding to the d largest eigenvalues. According to equation (5) and (6), the optimal irregular discriminant vectors with respect to J b (ϕ ) , ϕ1 , ϕ 2 , ..., ϕ d can be acquired by ϕ i = P2 μ i ( i = 1, ..., d ) . For the sample x , the irregular discriminant feature vector can be obtained as follows.
y = ( ϕ1 , ϕ 2 , ..., ϕ d ) x. T
(8)
Face Recognition from a Single Image per Person Using Common Subfaces Method
909
The stationary points μ1 ,..., μd (d ≤ c − 1) of J b (θ ) are the orthonormal eigenvectors of Sob corresponding to the d largest eigenvalues. So
S b μ i = λμ i
(9)
i = 1, 2,..., d .
Immediately leads to
P2 Sob μ i = λ P2 μ i , i = 1, 2,..., d .
(10)
Since P2 = (α q +1 , α q + 2 ,..., α m ) , so we can know, P2 P2 = c c is a constant T
value. So
(
)
(
)
P2 Sob P2T P2 μi = λ P2T P2 P2 μi , i = 1, 2,..., d .
(11)
( P So P ) P μ = λ ( P P ) P μ ,
(12)
That is T
Let w = P2 μi and
2 i
λw = cλ
2
2
2
b 2
i
w
i
w is a eigenvector of
S b = P2 Sb P2
T
corresponding to d largest
∑ N ( P P T m − P P T m )( P P T m − P P T m ) . C
=
Sb
(13)
i = 1, 2,..., d .
T
We can see that eigenvalue.
i
we can obtain
( P So P ) w = λ w , 2
i = 1, 2,..., d .
T
b 2
2
(14)
T
2
2
2
i
2
2
2
i
2
2
i =1
T
T
In the projection feature space, yi = P2 P2 xi , So it is easy to obtain ui = P2 P2 mi and j
j
T
u = P2 P2 m . Since Ω w = span{α q +1 , α q + 2 , ..., α q + m } is the null space of SW , and P2 = (α q +1 , α q + 2 , ..., α m ) , so P2 S w P2 = 0 , so it is easy to obtain the following equaT
tion. C
N
(
P2 P2 S w P2 P2 = ∑∑ yi − ui T
T
i =1
Let YC = [ y1 − u1 1
j
y1 − u1 2
"
j =1
j
)( y
j i
− ui
)
T
(15)
= 0.
yC − uC ] , we can obtain YC YC = 0 . Say that for N
T
any sample yi in ith class, we can obtain the same unique vector of the same class. The equation (14) can be rewritten as follows.
ui for all samples
910
J.-B. Li, J.-S. Pan, and S.-C. Chu C
=
Sb
∑ N (u
− u )( ui − u ) .
(16)
T
i
i =1
Let xcom = P2 P2 xi , So we rewrite the above equation as follows. T
i
j
∑ N (x C
Scom
=
i com
i =1
1
− ucom
)( x
i com
− ucom
(17)
), T
C
∑ xcom . From the equation (8), for a test sample C i =1 feature vector can be obtained as follows. where ucom =
i
x , the discriminant
y = ( w1 , w2 , ..., wd ) x ,
(18)
T
where w1 , w2 , ..., wd , d ≤ C − 1 , are the orthonormal eigenvectors of Scom . The procedure of the common subfaces algorithm is described as follows. Step 1. Given C face images from C persons for training, a single image per person,
each image is divided into N sub-images xi , i = 1, 2, ..., C , j = 1, 2,..., N . j
Step 2. Compute the within class scatter matrix S using equation (2). Then compute W
the orthonormal eigenvector of SW , α1 , α 2 , ..., α m , and let P2 = (α q +1 , α q + 2 , ..., α m ) , where q = rank ( S w ) . Step 3. Create the common between scatter matrix Scom with the common subface i xcom = P2 P2T xij , i = 1, 2, ..., C , j = 1, 2, ..., N .
Step 4. Compute the orthonormal eigenvector of Scom , w1 , w2 , ..., wd , ( d ≤ C − 1 ). Step 5. For a test sample
x , divide x into
pute the common subfaces
N sub-images
j ycom , ycom = W T xsub ,
j xsub , j = 1, 2,..., N , com-
j = 1, 2,..., N , where
W = [ w1 , w2 , ..., wd ] .
3 Experimental Results In this section, we evaluate the performance of the proposed common subfaces method on two face databases, ORL and Yale face databases. ORL face database [6] and Yale face database[2]. In our experiments, to reduce computation complexity, we resize the original ORL face images sized 112 × 92 pixels with a 256 gray scale to 48 × 48 pixels. We randomly select one image from each subject, 40 images in total for training, and the rest 360 images are used to test the performance. Similarly, the images from Yale databases are cropped to the size of 100 × 100 pixels. Randomly
Face Recognition from a Single Image per Person Using Common Subfaces Method
911
selected one image per person is regarded as the training samples, while the rest 10 images per person are used to test the performance of the algorithms. We also implement the popular Eigenfaces and Fishfaces methods in the experiments. In the procedure of creating the subfaces, for Yale face database, the size of the cropped face image is 100 × 100 ( M = 100 , N = 100 ), and it is divided into 50 × 50 ( p = 50 , q = 50 ) blocks whose size is 2 × 2 ( m = 2 , n = 2 ). Thus one image is divided in to 4 sub-images, and the size of the sub-image is 50 × 50 . For ORL face database, the size of the cropped face image is 48 × 48 ( M = 48 , N = 48 ), and it is divided into 24 × 24 ( p = 24 , q = 24 ) blocks whose size is 2 × 2 ( m = 2 , n = 2 ). Thus one image is divided in to 4 sub-images, and the size of the sub-image is 24 × 24 . As shown in Table 1, our common subfaces method gives a higher recognition accuracy compared with Sub-Eigenfaces and Sub-Fisherfaces methods for subfaces. Additionally our method can give a higher recognition rate than Eigenface method show in Table 2. Table 1. Performance of common sub-faces method Methods ORL face database Yale face database
Sub-Eigenfaces 0.5528 0.4200
Sub-Fisherfaces 0.5556 0.4667
Common Sub-faces 0.5778 0.5800
Table 2. Sub-Commonfaces vs. Eigenface Methods
Eigenfaces
Common Sub-faces
ORL face database
0.5583
0.5778
Yale face database
0.5533
0.5800
4 Conclusion A novel face recognition common subfaces method is proposed to solve the “one sample per person” problem. The main contributions are summarized as follows. 1) Proposing a novel method to create the subfaces from a single training image per person to solve the “one sample per person” problem. 2) Proposing a novel formulation of common vector analysis from space isomorphic mapping view for feature extraction. A common vector, which is called by common subfaces, for each single face image per person is derived from the subface images from the single face image person, which aims to solve the “one sample per person” problem.
References 1. Martinez, A.M., Kak, A.C.: PCA Versus LDA. IEEE Trans. Pattern Analysis and Machine Intelligence 23 (2001) 228-233 2. Belhumeur, P.N, Hespanha, J.P., Kriengman, D.J.: Eigenfaces vs Fisherfaces: Recognition Using Class Specific Linear Projection. IEEE Trans. Pattern Analysis and Machine Intelligence 19 (1997) 711–720
912
J.-B. Li, J.-S. Pan, and S.-C. Chu
3. Gülmezoglu, M.B, Dzhafarov, V., Keskin, M., Barkana, A.: A Novel Approach to Isolated Word Recognition. IEEE Trans. Speech Audio Process 7 (1999) 620–628 4. He, Y.H., Zhao, L., Zou, C.R., Face Recognition Using Common Faces Method. Pattern Recognition 39 (2006) 2218 – 2222 5. Cevikalp, H., Neamtu, M., Wilkes, M., Barkana, A.: Discriminative Common Vectors for Face Recognition. IEEE Trans. Pattern Analysis and Machine Intelligence 27 (2005) 4-13 6. Samaria, F., Harter, A.,: Parameterisation of a Stochastic Model for Human Face Identification. Proceedings of 2nd IEEE Workshop on Applications of Computer Vision, Sarasota FL, December 1994 7. Yang, J., Frangi, A.F., Yang, J.Y., Zhang, D., Jin, Z.: KPCA Plus LDA: A Complete Kernel Fisher Discriminant Framework for Feature Extraction and Recognition. IEEE Trans. Pattern Analysis and Machine Intelligence 27 (2005) 230-244
A Structural Adapting Self-organizing Maps Neural Network Xinzheng Xu1 , Wenhua Zeng2 , and Zuopeng Zhao1 1
School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, 221008, China {xxzheng,zzpeng}@cumt.edun.cn 2 School of software, Xiamen University, Xiamen, 361005, China
[email protected]
Abstract. Genetic algorithm is introduced to network optimization to overcome the limitation of conventional SOM network. Based on this idea, a new model of structural adapting self-organizing neural network is proposed. In this model, each neuron is regarded as individual of evolutionary population and three operators are constructed as follows:growing operator, pruning operator and stochastic creating operator. In the algorithm, the accumulative error of neuron is selected as fitness function each iteration, and the neurons on compete layer are generated or deleted adaptively according to the values of fitness function until there is not any change of neuron on compete layer. Simulation experiments indicate that this structural adaptive network has better performance than conventional SOM network.
1
Introduction
Self-organizing maps neural network [1] has been used as a tool for mapping highdimensional data into a low dimensional feature map. The main advantage of such a mapping is that it would be possible to gain some idea of the structure of the data by observing the map, due to the topology preserving nature of the SOM [2]. It has been theoretically proved that the SOM in its original form does not provide complete topology preservation, and several researchers in the past attempted to overcome this limitation [3]. Several dynamic neural network models had been developed in the past, which attempted to overcome the limitation of fixed structural network. Some of these models are considered as follows. 1) Growing Cell Structures (GCS) [4] was presented by B. Fritzke, whose main advantage was that it can find appropriate structure automatically. The GCS used a drawing method, which worked well with relatively low-dimensional data, but the mapping could not be guaranteed to be planar for high-dimensional data. This caused problems in visualizing high-dimensional data. 2) Z. Wu et. al presented a structural adapting self-organizing neural network (SASONN) [5]. SASONN overcomes incorrect mapping, neuron underuse and boundary effect, D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 913–920, 2007. c Springer-Verlag Berlin Heidelberg 2007
914
X. Xu, W. Zeng, and Z. Zhao
but the rule of growing and pruning of the neuron still need a further identification. 3) Growing SOM [6] was presented by D. Alahakon, which carried out node generation interleaved with the self-organization using spread factor to control the growing of network. But the shortage of Growing SOM was that it had not instructed a rule of pruning neurons, which could cause the network’s growth terminate probably when the scale of network was limited. To overcome the shortage of SOM network, the essence of genetic algorithm is introduced to network optimization for instructing growth and pruning of neurons. Based on this idea, the model of Structural Adaptive Self-organizing Maps Neural Network(SASOMNN) is proposed. The paper is organized as follows. Section II will discuss the new model of SASOMNN in detail. The algorithm of SASOMNN is described in Section III. In Section IV, computer simulations are given. Finally the conclusions of this study are presented in Section V.
2
Principle of SASOMNN
In the model of SASOMNN, accumulative error is used as fitness function during the evolution, and growing neurons are selected by roulette wheel strategy [7]. After new nodes are grown, the weight values of these nodes are evaluated according to certain rule. The process of training is divided into three phases as follows:initialization phase, growing phase, and smoothing phase. 2.1
Initialization Phase
The network is initialized with four nodes (Fig. 1) because: 1) it is a good starting position to implement a two-dimensional lattice structure; 2) all starting nodes are boundary nodes, thus each node has the freedom to grow in its own direction at the beginning. The starting four nodes are initialized with random values from the input vector space, or the input vector value range. Since the input vector attributes are normalized to the range 0 − 1, the initial weight vector attributes can take random values in this range [6].
Fig. 1. Initial SOM
2.2
Growing Phase
Firstly, all samples are trained by initial SOM using conventional algorithm, and then winning number and accumulative error of each neuron are gained. The error value of each neuron can be calculated D Ei (t + 1) = Ei (t) + (Xpj − Wij )2 , (1) j=1
A Structural Adapting Self-organizing Maps Neural Network
915
for neuron i at time(iteration) t and where D is the dimensions(attributes) in the input data, M is the number of the compete layer, Xpj and Wij are input and weight vectors, respectively. Thus to each winner node, the difference between weight vector and input vector is calculated as an error value. Growing Operator. The neuron grown in next iteration is selected by roulette wheel strategy. In this process, the selection probability of each neuron is proportionate to the value of fitness. The sum of all neurons on compete layer Etotal is calculated M Etotal = Ei . (2) i=1
Thus the proportion of individual(i) fitness is gained by Ei /Et atol, which is regarded as selection probability. Then neuron growing in next generate will be gained according to selection probability. In every generation, the growth of each neuron occurs only once while new neuron will be created around this neuron. – New node’s generation. New nodes will always be grown from a boundary node (Fig. 2). A boundary node is one that has at least one of its immediate neighboring positions free of a node. In our model, each node will have four immediate neighbors. Thus a boundary node can have from one to three neighboring positions free. If a node is selected for growth, on all its free neighboring positions new nodes will grow. New nodes are generated on all free neighboring positions, as this is computationally easier to implement than calculating the exact position of the new node. This will create some redundant nodes, but these nodes can be easily identified and removed by pruning operator as follows. New node Growing node
(a)
(b)
New node
(c)
Fig. 2. New node’s generation around boundary node
– Weight initialization of new nodes. The initial weight values of newly grown nodes will be assigned. If new nodes are initialized with random values at the range of 0-1, these values will not match their neighborhoods probably. Therefore, the strategy of crossover and mutation is used considering the smoothness properties of the existing map and thus initializing new weights to match their neighborhoods. In this case, there are two situations to consider. One is that there is one node or two or three nodes (Fig. 3), and the
916
X. Xu, W. Zeng, and Z. Zhao
W2
W1 (a)
W new
W1
W new
(b)
Fig. 3. Wight initialization of new node
other is that there is no node around the growing node. In Fig. 3(a), there is one node around the growing node, W1 and W2 are weights of the growing node and one of its neighbors, respectively. Wnew which represents the weight of new node is calculated by following strategy. Firstly, select a crossing randomly in weight vector W1 and W2 respectively. Then, through two point crossover, two new weight vectors are gained, one of which will be evaluated to Wnew . If there are two or three nodes around growing node, select one of the nodes around it as the node which will be crossed with it. Then Wnew is attained as the same as above. In another case, there is no node around the grown node showing in Fig. 3(b), select one or several components of the weight vector of the growing node, then use the random values at the range of 0 − 1 to replace this or these selected components. In this way, the weight value of new node will be gained, and other nodes’ weight values can be calculated using similar method. In addition, mutation will also occur by small probability, and the mutation number of weight vector component can be selected to 1 or 2. Pruning Operator. In each iteration cycle, the winning number of each neuron is not equal. When the number of the neuron on compete layer is always equal to zero, this neuron can be removed using pruning operator while the weight vectors connecting with this neuron are also removed. Stochastic Creating Operator. In the process of genetic algorithm, stochastic creating operator can generate new isolated neuron to increase the class shown by the neuron on compete layer. The weight values between the new neuron and the neuron on input layer are evaluated with the values between 0 and 1. 2.3
Smoothing Phase
The smoothing phase occurs after the new node growing phase. The growing phase stops when new node growth saturates, which can be identified by the low frequency of new node growth. No new nodes are added during this phase. The purpose is to smooth out any existing quantization error, especially in the nodes grown at the latter stages of the growing phase. The starting learning rate in this phase is less than in the growing phase, since the weight values should not fluctuate too much without converging. The smoothing phase is stopped when the error values of the nodes in the map become very small.
A Structural Adapting Self-organizing Maps Neural Network
3
917
The Algorithm of SASOMNN
The algorithm of SASOMNN is described as follows. Step 1. Initialization Phase. - Initialize pruning probability Pd and stochastic creating probability Pr and iteration degree. - The network is initialized with four nodes whose weight values are normalized to the range 0-1. Step 2. Growing Phase. - Train all samples through network, calculate winning number and accumulative error E(t). - Regard each neuron on compete layer as individual while population just is the neurons on compete layer. - Calculate Etotal and Ei /Etotal which is regarded as the selection probability of individual. - Select new neurons by roulette wheel strategy. Wheel runs M circuits while M neurons are selected. The structure of network is adjusted by growing operator, pruning operator and stochastic creating operator until iteration number is reached. - Start next iteration, repeat above four steps until new nodes do not generate. Step 3. Smoothing Phase. Train the network by conventional algorithm until the error values of the nodes become very small.
4
Simulation Results
The animal data set was originally introduced by Ritter and Kohonen to illustrate the SOM for high-dimensional data set. It consists of the description of 16 animals by binary property lists tabulated in Table 1 [8]. From Table 1 we know these animals can be partitioned into three classes: birds containing seven animals, carnivores containing six animals, and herbivores containing three animals. There are a total of 13 attributes available. If an attribute applies for an animal the corresponding table entry is one, otherwise zero. To demonstrate the performance of our method we use the conventional algorithm of SOM to project the animal data set firstly. The 13 properties of the animal data set consist of the input vector to the network of 1111 neurons. The network is trained by fast self-organizing map algorithm [9]. After 100 iterations, the neurons on compete layer are shown in Fig. 4, which are divided into three segments, but each segment includes too many neurons shown in the Table 2, such as birds containing 47 neurons, carnivores containing 52 neurons, and herbivores containing 22 neurons. In addition, the boundary between any two segments is not clear. In conclusion, the disadvantage of conventional network concludes neuron underuse, dispersion of cluster center and boundary effect. Then, SASOMNN is used to train the samples shown in Table 1. Parameters are initialized as follows, Pd equal 0.4, Pr equal 0.1, iteration degree equal
918
X. Xu, W. Zeng, and Z. Zhao Table 1. Animal Names and Binary Attributes
Dove Hen Duck Goose Owe Hawk Eagle Fox Dog Wolf Cat Tiger Lion Horse Zebra Cow
SmallMediumBig2Legs4LegsHairHoovesManeFeathersHuntRunFlySwim 1 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 1 0 0 1 1 1 0 0 1 0 0 0 0 1 1 0 1 0 1 0 0 1 0 0 0 0 1 1 0 1 0 0 1 0 1 0 0 0 0 1 1 0 1 0 0 1 0 0 1 1 0 0 0 1 0 0 0 0 1 0 0 1 1 0 0 0 0 1 0 0 0 1 0 0 1 1 0 1 0 1 1 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 0 1 0 1 1 0 0 0 1 1 0 0 0 0 1 0 1 1 0 1 0 1 1 0 0 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 1 0 1 1 1 0 0 0 0 0 0
birds 1
herbivores
carnivore
3
2
Fig. 4. The zoo data mapped to SOM
birds
1
carnivore
3 2 herbivores
Fig. 5. The zoo data mapped to SAOMNN
3000. After the end of iteration, the structure is shown in Fig. 5. From Fig. 5we know the neurons on compete layer are divided into three segments, but each segment includes less neurons than the same section in Fig. 4, and cluster centers become more concentrative. Table 2 introduces the neuron numbers of each
A Structural Adapting Self-organizing Maps Neural Network
919
Table 2. Comparison of Neuron Numbers between SOM and SASOM Neuron Number on Compete layer Birds Carnivores Herbivores SOM 47 52 22 SASOM 24 16 10
segment. From Table 2, we know birds contain 24 neurons, carnivores 17 neurons, and herbivores 10 neurons. The neurons on compete layer of latter structure have not achieve half of the front, therefore the utilization rate of neurons become higher. So SASOMNN has better performance than conventional SOM. In addition, the neurons generating in growing phase dynamically keep regular two-dimensional lattice structure as the same as initial structure. Therefore the visualization effect of SASOMNN is good.
5
Conclusion
The model of SASOMNN is presented to overcome the shortage of fixed structure of conventional SOM. In the algorithm of SASOMNN, genetic algorithm is introduced to network optimization. Growing operator, pruning operator and stochastic creating operator instruct the generation and pruning of neuron on compete layer. Simulation experiments indicate that SASOMNN network has better performance than conventional SOM network. But, when genetic algorithm is used to utilize the structure of SOM, iteration degree must be increased to ensure sufficient evolution of individual. So SASOMNN algorithm will take more time, especially when evolution operators are not suitable. Therefore, the work in future is that how to reduce the complexity of algorithm using more efficient operator, and that how to increase the efficiency of executing algorithm. In addition, the feasibility of algorithm should be researched and discussed deeply through using in more domains to popularize and apply it widely.
References 1. Kohonen, T.: Self-Organization and Associative Memory, 3rd ed. Berlin: SpringerVerlag (1989) 2. Kohonen, T.: Self-Organizing Maps. Berlin, Germany: Springer-Verlag (1995) 3. Villmann, T., Der, R., Hermann, M., Martinetz, M.: Topology Preservation in SelfOrganizing Feature Maps: Exact Definition and Measurement. IEEE Trans. Neural Networks 8 (1997) 256-266 4. Fritzke, B.: Growing Cell Structures-a Self-Organizing Network for Unsupervised and Supervised Learning. Neural Network 7 (1994)1441-1460 5. Wu, Z., Yan, P.F.: A Study on Structural Adapting Self-Organizing Neural Network. Acta Electronic Sinica 27 (1999) 55-58 6. Alahakoon, D., Halgamuge, S.K.: Dynamic Self-Organizing Maps with Controlled Growth for Knowledge Discovery. IEEE Trans. Neural Networks 12 (2000) 153-158
920
X. Xu, W. Zeng, and Z. Zhao
7. Goldberg, D.E.: Genetic Algorithm in Search Optimization and Machine Learning. Reading, MA, Addison-Wesley Publishing (1989) 8. Ritter, H.J., Kohonen, T.: Self-Organizing Semantic Maps. Biol. Cybern. 61 (1989) 241-254 9. Chun, M., Chang, H.T.: Fast Self-Organizing Feature Map Algorithm. IEEE Trans. Neural Networks 11 (2000) 721-733
How Good Is the Backpropogation Neural Network Using a Self-Organised Network Inspired by Immune Algorithm (SONIA) When Used for Multi-step Financial Time Series Prediction? Abir Jaafar Hussain and Dhiya Al-Jumeily Liverpool John Moores University, Byrom Street, Liverpool, L3 3AF, UK {a.hussain,d.aljumeily}@ljmu.ac.uk
Abstract. In this paper, a novel application of the backpropagation network using a self-organised layer inspired by immune algorithm is used for the prediction of financial time series. The simulations assess the data from two time series: Firstly the daily exchange rate between the US dollar and the Euro for the period from the 3rd January 2000 until the 4th November 2005, giving approximately 1525 data points. Secondly the IBM common stock closing price for the period from the 17th May 1961 until the 2nd November 1962, establishing 360 trading days as data points. The backpropagation network with the self-organising immune system algorithm produced an increase in profits of approximately 2% against the standard back propagation network, in the simulation, for the prediction of the IBM common stock price. However there was a slightly lower profit for the US dollar/Euro exchange rate prediction.
1 Introduction The efficient market hypothesis states that a stock price, at any given time, reflects the state of the environment for that stock at that time. That is the stock price is dependent on many variables, such as: news events, other stock prices, exchange rates, etc. The hypothesis suggests that future trends are completely unpredictable and subject to random occurrences. Thus making it infeasible, to use historical data or financial information, to produce above average returns [9]. However, in reality, market responses are not always instantaneous. Markets may be slow to react due to poor human reaction time or other psychological factors associated with the human actors in the system. Therefore, in these circumstances, it is possible to predict financial data, based on previous results [12]. There is a considerable body of evidence showing that markets do not work in a totally efficient manner. Much of the research shows that stock market returns are predictable by various methods such as; time series data analysis on financial and economic variables [11]. Up until now stochastic methods based on the statistical analysis of the signals, within the market system, were used for the prediction of financial time series [1-4]. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 921–930, 2007. © Springer-Verlag Berlin Heidelberg 2007
922
A.J. Hussain and D. Al-Jumeily
The nonlinear nature of financial data has inspired many researchers to use neural networks as a modelling approach [5] by replacing explicit linearity-in-the parameters dependencies with implicit semi-parametric models [6]. When the networks are trained on financial data, with multivariate function, they become minimum average function approximators [1]. Whilst ease of use and capability to model dynamical data are appealing general features, of typical neural networks, there are concerns about generalisation ability and parsimony. Cao and Francis [7] showed that when using multilayer perceptrons (MLP), trained using the back propagation learning algorithm, the normalised mean square of the error (NMSE) will decrease on the validation data for the first few epochs and will increase for the remaining epochs. This indicates that the MLP networks trained using the backpropagation algorithm suffers from overfitting. Hence, the use of neural networks for financial time series prediction encounters many problems, these include [8]: 1. Different neural network models can perform significantly differently when trained and tested on the same data sets. This is because there are artefacts that influence the predictive ability of the models. Yet, it would be reasonable to suppose that well-founded models would produce similar inferences regardless of the detailed architecture of the particular neural network used. 2. For any given type of neural network, the network is sensitive to the topological choice and the size of the data set. Neural networks suffers from overfitting problems and as a result, researchers need to take extra care when selecting the network architecture, the learning parameters and training data in order to achieve good generalisation since this is critical when using neural network for financial time series. 3. The nonstationary nature and the changing trending behaviour between oscillatory and monotonic trends of financial time series can prevent a single neural network from being able to accurately forecast an extended trading period even if it can well forecast changes in the testing data. To improve the recognition and generalisation capability of the backpropagation neural networks, Widyanto et al [11] used a hidden layer inspired by immune algorithm for the prediction of sinusoidal signal and time temperature based quality food data. Their simulations indicated that the prediction of sinusoidal signal showed an improvement of 1/17 in the approximation error in comparison to the backpropagation and 18% improvement in the recognition capability for the prediction of time temperature based quality food data. In this paper, we propose the use of the backpropagation neural networks with hidden layer inspired by immune algorithm for financial time series prediction. Two financial time series are used to test the performance of the network, which are the exchange rates between the US dollar and the Euro and the IBM common stock closing price.
How Good Is the Backpropogation Neural Network
923
The remainder of the paper is organised as follows: section 2 presents the backpropagation neural network with the hidden layer inspired by immune algorithm. Section 3 outlines the financial time series used for the simulations together with the pre-processing steps and the metrics used for benchmarking the performance of the neural networks. Section 4 is dedicated for the simulation results and discussion, while section 5 is used for conclusion.
2 The Self-Organised Network Inspired by the Immune Algorithm (SONIA) The immune algorithm which was first introduced by Timmis [12] has made a lot of attractions. Widyanto et al [19] introduced a method to improve recognition as well as generalization capability of the backpropagation by suggesting a self-organization hidden layer inspired by immune algorithm which is called SONIA network. The input vector and hidden layer of SONIA network are considered as antigen and recognition ball, respectively. The recognition ball which is the generation of the immune system is used for hidden unit creation. In time series prediction, the recognition balls are used to solve overfitting problem. In the immune system, the recognition ball has a single epitope and many paratopes. In which, the epitope is attached to B cell and paratopes are attached to antigen, where there is a single B cell that represents several antigens. For SONIA network, each hidden unit has a centre that represents the number of connections of the input vectors that are attached to it. To avoid the overfitting problem, each centre has a value which represents the strength of the connections between input units and their corresponding hidden units. The SONIA network consists of three layers which are input, self-organized and output layers as shown in the structure of the SONIA network [11]. In what follows the dynamic equations of SONIA network are considered. The ith input unit receives normalized external input Si where i = 1….NI and NI represents the number of inputs. The output of the hidden units is determined by the Euclidean distance between the outputs of input units and the connection strength of input units and the jth hidden unit. The use of the Euclidian distance enables the SONIA network to exploit locality information of input data. This can lead to improve the recognition capability [11]. The output of the jth hidden unit is determined as follows:
⎛ NI ⎜ X Hj = f ⎜ w Hij − x Ii ⎜ i =1 ⎝ j = 1......., N H
∑(
⎞
)2 ⎟⎟ ⎟ ⎠
where WHij represents the strength of the connection from the ith input unit to the jth hidden unit, and f is a nonlinear transfer function.
924
A.J. Hussain and D. Al-Jumeily
The outputs of the hidden units represent the inputs to the output layer. The network output can be determined as follows: ⎛ NH y k = g⎜ wojk + bok ⎜ j = 1 ⎝ k = 1,...., N o
∑
⎞ ⎟ ⎟ ⎠
where wojk represents the strength of the connection from the jth hidden unit to the kth output unit and bok is the bias associated with the kth output unit, while g is the nonlinear transfer function. 2.1 Training the SONIA Network
In this subsection, the training algorithm of the SONIA network will be shown. Furthermore, a B cell construction based hidden unit creation will be described. For the immune algorithm, inside the recognition ball there is a single B cell which represents several antigens. In this case the hidden unit is considered as the recognition ball of immune algorithm. Let d(t+1) represent the desired response of the network at time t+1. The error of the network at time t+1 is defined as:
e(t + 1) = d (t + 1) − y (t + 1)
(1)
The cost function of the network is the squared error between the original and the predicted value, that is:
J (t + 1) =
1 [e(t + 1)]2 2
(2)
The aim of the learning algorithm is to minimise the squared error by a gradient descent procedure. Therefore, the change for any specified element woij of the weights matrix is determined according to the following equation:
Δwoij (t + 1) = −η
∂J (t + 1) ∂wij
(3)
where (i = 1…., NH, j = 1…,No) and η is a positive real number representing the learning rate. The change for any specified element bok of the bias matrix can is determined as follows: Δboj (t + 1) = −η
∂J (t + 1) ∂wij
(4)
where ( j = 1…,No). The initial values of woij are set to zero and the initial values of boj are given randomly. 2.2 B Cell Construction Based Hidden Unit Creation
The purpose of hidden unit creation is to form clusters from input data and to determine the centroid of each cluster formed. These centroids are used to extract local
How Good Is the Backpropogation Neural Network
925
characteristic of the training data and enable the SONIA network to memories the characteristics of training data only and not the testing data. The overfitting problem could be prevented using this approach. Furthermore, the use of Euclidean distance to measure the distance of input data and these centroids, enables the network to exploit local information of the input data. This may lead to improve recognition capability for pattern recognition problem. For each hidden unit, two values are recorded which are the number of input vectors associated with the jth hidden unit, and the cluster centroid of the input vectors that represents the strength of the connection between the input units and the jth hidden unit. Let (dm, ym) represents a given set of pairs of input and output to be learned. In the initialisation process, the first hidden unit (t1, wH1) is created with t1 = 0, and wH1 is taken arbitrarily from the input vector. The following procedure is used for the hidden layer creation which was derived from the immune algorithm [12]. This procedure will be repeated until all inputs have found their corresponding hidden unit [11]: 1. For (j = 1 to NH) determines the distance between the mth input and centroid of the jth hidden unit as follows: dist mj =
NI
∑ (y
mi
− w Hij
)2
i =1
2. Select the shortest distance c=arg minj (distmj) 3. If the shortest distance distmc is below a stimulation level, sl (where s1 is selected between 0 and 1), in this case the input has found its corresponding hidden unit and tc = tc +1, w Hj = w Hj + hd mc , where h is a learning rate. Otherwise a new hidden unit is added with tNH= 0. The value of tk for k = 1 to NH are set to 0. Then go to step 1.
3 Financial Time Series Prediction Using SONIA Neural Network SONIA neural network was used to predict two financial time series. The daily exchange rates between the US Dollar and the Euro (US/EU) in the period between 3 January 2000 to 4 November 2005, which contain approximately 1525 data points and the IBM common stock closing price dated from 17th May 1961 to 2nd November 1962, giving a total of 360 trading days obtained from a historical database provided by DataStream [15]. These time series were fed to the neural networks to capture the underlying rules of the movement in the financial markets. Since financial time series are highly nonlinear and nonstationary signals, they need adequate pre-processing before presenting them to neural network. To smooth out the noise and to reduce the trend, the nonstationary raw data is usually transformed into stationary series.
926
A.J. Hussain and D. Al-Jumeily Table 1. Calculations for Input and Output Variables Indicator Input variables
Output variables
EMA15 RDP-5 RDP-10 RDP-15 RDP-20 RDP+5
Calculations
P ( i ) − EMA
15
(i)
( p ( i ) − p ( i − 5 )) / p ( i − 5 ) * 100 ( p ( i ) − p ( i − 10 )) / p ( i − 10 ) * 100 ( p ( i ) − p ( i − 15 )) / p ( i − 15 ) * 100 ( p ( i ) − p ( i − 20 )) / p ( i − 20 ) * 100 ( p ( i + 5 ) − p ( i ) ) / p ( i ) * 100 p ( i ) = EMA ( i ) 3
EMAn(i) is the n-day exponential moving average of the i-th day. p(i) is the closing price of the i-th day.
The original closing prices were transformed into five-day relative different in percentage of price (RDP) [13]. The advantage of this transformation is that the distribution of the transformed data will follow more closely to normal distribution. The input variables were determined from four lagged RDP values based on fiveday periods (RDP-5, RDP-10, RDP-15, and RDP-20) and one transformed closing price (EMA15) which is obtained by subtracting a 15-day exponential moving average from the closing price. The optimal length of the moving day is not critical, but it should be longer than the forecasting horizon of five days [13]. Since the use of RDP to transform the original series may remove some useful information embedded in the data, EMA15 was used to retain the information contained in the original data. It hasbeen argued in [14] that smoothing both input and output data by using either simple or exponential moving average is a good approach and can generally enhance the prediction performance. The horizon forecast is 5 days and therefore the output variable represents a price of 5 days ahead prediction. The output variable RDP+5 was obtained by first smoothing the closing price with a 3-day exponential moving average and is presented as a relative difference in percent for five days ahead. Because statistical information of the previous 20 trading days was used for the definition of the input vector, the original series has been transformed and is reduced by 20. The calculations for the transformation of input and output variables are presented in Table 1. The RDP series were scaled using standard minimum and maximum normalization method which then produces a new bounded dataset. One of the reasons for using data scaling is to process outliers, which consist of sample values that occur outside a normal range [14]. In financial forecasting parlance, accuracy is related to profitability. Therefore, it is important to consider the out-of-sample profitability, as well as its forecasting accuracy. The prediction performance of our networks was evaluated using various financial and statistical matrices as shown in Table 2.
How Good Is the Backpropogation Neural Network
927
Table 2. Performance Metrics and Their Calculations Metrics
Calculations
Normalised Mean Square Error
σ
Signal to Noise Ratio
2 1 n ∑ y i − yˆ i 2 σ n i =1
(
NMSE =
(NMSE) 2
)
n 2 ∑ (y i − y) n − 1 i =1 n y = ∑ yi i =1 =
1
(
SNR = 10 * log 10 sigma
(SNR) sigma =
m
2
)
∗n
SSE
n SSE = ∑ (y i − yˆ i ) i =1 m = max(y)
Directional Symmetry (DS)
DS =
di
1 n ∑ di n i =1
⎧1 ⎪ =⎨ ⎪0 ⎩
(y i − y i −1 )(yˆ i − yˆ i −1 ) ≥ 0 otherwise
Annualised Return (AR)
AR = 252 ∗
1 n ∑ Ri n i =1
⎧ ⎪⎪ y i Ri = ⎨ ⎪− y i ⎩⎪
(y i )(yˆ i ) ≥ 0 otherwise
n is the total number of data patterns. y and yˆ represent the actual and predicted output value.
4 Simulation Results This work is concerned with financial time series prediction. So, throught the extensive experiments conducted, the primary goal was not to assess the predictive ability of the SONIA neural networks against the backpropagation models, but rather to determine the profitable value contained in the network. As a result the focus was on how the network generates the profits: The the neural network structure which
928
A.J. Hussain and D. Al-Jumeily
produces the highest percentage of annualized return, on out of sample data, is considered the best model. Table 3 displays the average results of 20 simulations obtained on unseen data from the neural networks, while Figure 2 shows part of the prediction of the IBM common stock closing price and the US/EU exchange rate time series on out of sample data. As can be seen in Table 3, the average performance of the SONIA network, using the annualized return, demonstrated that using the network to predict the IBM common stock closing price resulted in better performance profit in comparison to the MLP network, with an average increased of 1.72% using 11 hidden units. In the MLP network, the objective of the backpropagation is to minimize the error over all the dataset, while for SONIA network, the learning concentrated on the local properties of the signal and the aim of the network is to adapt to the local properties of the observed signal using the self-organised hidden layer inspired by the immune algorithm. Thus the SONIA networks have a more detailed mapping of the underlying structure within the data and are able to respond more readily to any greater changes or regime shifts which are common in non-stationary. This accounts for the observed better performance of the SONIA networks, in comparison to the MLP networks, when used to predict the IBM common stock closing price. Table 3. The Average Results Over 20 Simulations for the MLP and the SONIA Neural Networks US/EU Exchange Rate AR (%) DS (%) NMSE SNR IBM Common Stock Closing Price AR (%) DS (%) NMSE SNR
MLP 87.88 65.69 0.2375 23.81
SONIA Hidden 20 87.24 64.20 0.2628 23.37
MLP
SONIA NN Hidden 11
88.54 63.53 0.352200 21.45
90.26 64.70 0.384 21.05
For the prediction of the US Dollar/Euro exchange rate, the simulation showed the MLP network fared slightly better than the SONIA network with a 0.64 % increase in the annualised return. In the MLP network extensive tests were carried out, beforehand, to determine the number of hidden units (between 3 and 10) that delivered the best network performance. However, for the SONIA network the optimum number of hidden units was decided by the system itself. In attempting to understand why the SONIA network failed to generate better profit than the MLP network, the properties of the US dollar and the Euro exchange rate time series were studied. The dataset the was used has 59.51% small changes containing 43.66% of the potential profit and 40.49% of higher value
How Good Is the Backpropogation Neural Network
929
changes containing 56.34% of the profit. This means that there is a large percentage of potential return within the small changes. As the purpose of the MLP network is to minimize the error over all the dataset and as it can work better when the data contains more potential return within small changes, then the MLP networks can perform better than the SONIA network on the annualized return when used to predict the dynamic US/EU exchange rate. SONIA
RDP+5
TESTING OUT-OF-SAMPLE OF
Days
(a)
(b)
Fig. 1. (a) Part of the predication of the IBM common stock closing price using the SONIA in the period between 17th May 1961 to 2nd November 1962. (b) Part of the predication of the daily exchange rate using the SONIA in the period between 3rd January 2000 to 7th November 2005 between the US dollar and the Euro.
5 Conclusion In this paper, a novel application of the SONIA neural network for financial time series prediction is presented. Two time series are used in these simulations which are the daily exchange rates between the US Dollar and the Euro in the period between 3rd January 2000 to 4th November 2005, which contain approximately 1525 data points, and the IBM common stock closing price dated from 17th May 1961 to 2nd November 1962, giving a total of 360 trading days. The simulation results showed that the SONIA network produced profit from the predictions based on the two time series.
References 1. Sitte, R., Sitte, J.: Analysis of the Prediction of Time Delay Neural Networks Applied to the S&P 500 Time Series”, IEEE Transactions on Systems, Man and Cybernetics 30 (2000) 568-57 2. Lindemann, A.., Dunis, C.L., Lisboa, P.: Level Estimation, Classification and Probability Distribution Architectures for Trading the EUR/USD Exchange Rate, Neural Computing & Applications, forthcoming, 2005
930
A.J. Hussain and D. Al-Jumeily
3. Lindemann, A.., Dunis, C.L., Lisboa, P.: Probability Distributions, Trading Strategies and Leverage: An Application of Gaussian Mixture Models, Journal of Forecasting, 23 ( 8) (2004) 559-585 4. Dunis, C., Williams, M.: Applications of Advanced Regression Analysis for Trading and Investment”, in C. Dunis, J. Laws and P. Naïm [eds.], Applied Quantitative Methods for Trading and Investment, John Wiley, Chichester, 2003 5. Zhang, G.Q., Michael, Y.H.: Neural network forecasting of the British Pound/U.S. Dollar Exchange Rate, Omega 26 ( 4) (1998) 495–506. 6. Haykin, S.: Neural Networks: a Comprehensive Foundation. Englewood Cliffs, NJ: Prentice-Hall, 1999. 7. Cao, L.J., Francis, E.H.T.: Support Vector Machine With Adaptive Parameters in Financial Time Series Forecasting, IEEE Transactions on Neural Network 14 (6) (2003) 15061518 8. Versace, M., Bhatt, R., Hinds, O., Shiffer, M.: Predicting the Exchange Traded Fund DIA with a Combination of Genetic Algorithms and Neural Networks”, Expert Systems with Applications 27 (2004) 417-425 9. Knowles, A.: Higher Order and pipelined Networks for Time Series Prediction of Currency Exchange Rates, MPhil, Liverpool John Moores University, 2006 10. Fama, E.F., French, K.R.: Business Conditions and Expected Returns on Stocks and Bonds, J. Financial Econ 25 (1989) 23–49 11. Widyanto, M.R., Nobuhara, H., Kawamoto, K., Hirota, K., Kusumoputro, B.: Improving Recognition and Generalization Capability of Back-Propagation NN using Self-Organized Network Inspired by Immune Algorithm (SONIA), Applied Soft computing 6 (2005) 72-84 12. Timmis, J.I.: Artificial Immune Systems: a Novel Data Analysis Technique Inspire by the Immune Network Theory, Ph.D. Dissertation, University of Wales, Aberystwyth, 2001 13. THOMASON, M.: The practitioner method and tools, Journal of Computational Intelligence in Finance 7 ( 3) (1999) 36-45 14. Kaastra, I., Boyd, M.: Designing a neural network for forecasting financial and economic time series, Neurocomputing 10 (1996) 215-236 15. R.J. Hyndman (n.d.), “Time Series Data Library,” downloaded from: http://www-personal.buseco.monash.edu.au/~hyndman/TSDL/. Original source from: McCleary & Hay, Applied Time Series Analysis for the Social Sciences, Sage Publications, 1980.
Edge Detection Combined Entropy Threshold and Self-Organizing Map (SOM) Kun Wang, Liqun Gao, Zhaoyu Pian, Li Guo, and Jianhua Wu College of Information Science & Engineering, Northeastern University, P.O. Box 135, 110004, Shenyang, China {Kun.Wang,yogo w}@163.com http://www.springer.com/lncs
Abstract. An edge detection method by combining image entropy and Self -Organizing Map (SOM) is proposed in this paper. First, according to information theory image entropy is used to curve up the smooth region and the region of gray level abruptly changed. Then we transform the gray level image to ideal binary pattern of pixels. We define six classes’ edge and six edge prototype vectors. These edge prototype vectors are fed into input layer of the Self-Organizing Map (SOM). Classifying the type of edge through this network, the edge image is obtained. At last, the speckle edges are discarded from the edge image. Experimental results show that it gained better edge image compared with Canny edge detection method.
1
Introduction
Edges are one of the most important visual clues for interpreting images[1]. Most previous edge detection techniques used first-order derivative operators[2] such as Sobel edge operator, Prewitt edge operator and Robert edge operator. The Laplacian operator is a second-order derivative operator for functions of twodimension operators and is used to detect edges at the locations of the zero crossing. However, these points of zero crossing aren’t certainly the edge points and can only be determined to be edge points by further detection. Another gradient operator is the Canny operator that is used to determine a class of optimal filters for different types of edges[3]. In this paper We firstly compute the entropy of each 3×3 neighbor. Then we select a threshold in these entropies, the part which is larger than the threshold has smartly variation of gray-levels, and the part less than the threshold has gentle change. Secondly we use a 3×3 ideal binary pattern of pixels to determine the edge magnitude and direction and find the 3×3 ideal binary pattern for a pixel to classify the edge. Lastly we use self-organization map(SOM) to gain edge. When we construct network, we consider the difference between the edge point and noise point. We use neural network to detect the edge that have better ability of noise rejection. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 931–937, 2007. c Springer-Verlag Berlin Heidelberg 2007
932
2
K. Wang et al.
Image Entropy
According to Shannon information theory, the larger the entropy is, the more information[4]. The clearest image has the largest entropy to two-dimension image. We suppose that the gray-level range of image fk (x, y) is [0, G], and then protract the gray histogram of the image. The gray-level entropy of image is defined as follows: Ek = −
G g=0
Pk (g) log Pk (g), k = (1, 2, · · · , M)
(1)
Where Pk (g) is the probability that has gray level g of the kth image in some windows ω. The image entropy denotes the degree of dispersion of gray-scale distribution of pixels in image.In this paper, we utilize the character of image entropy to gain the region which includes edge. This decreases the calculation amount.
3
Edge Classification
We use a 3×3 ideal binary pattern of pixels to determine the edge classification. We firstly determine the mean intensity in the 3×3 pixel block as a threshold and then using binary values, 0 and 1, place the pixels of the 3×3 pixel block into two pixel groups[5]. One is used for the pixels that belong to the same group as the center pixel and 0 for all the others. This method can reduce smoothing across the edges and computation complexity. Fig.1 shows an example of this process. 7
15
18
0
1
1
5
17
20
0
1
1
9
8
3
0
0
0
Fig. 1. Example of 3 × 3 ideal binary pattern of pixels
We gained 3×3 ideal binary pattern of pixels as above, next step is to confirm the direction[6,7]. Fig.2 shows the 3×3 neighborhood of pixels about the center pixel p5 as well as the four edge directions which may appear. The directional binary summed magnitude of differences between p5 and its neighbors are designated respectively by d1 , d2 , d3 and d4 for directions 1, 2, 3 and 4, are shown in Fig.2 and are calculated by d1 = |p4 − p5 | + |p6 − p5 |, d2 = |p3 − p5 | + |p7 − p5 | d3 = |p2 − p5 | + |p8 − p5 |, d4 = |p1 − p5 | + |p9 − p5 |
(2)
Edge Detection Combined Entropy Threshold and SOM
933
For each pixel in an input image that is not on the outer boundary of the image, we define its four-dimensional feature vector in four directions on its neighborhood as x = (d1 , d2 , d3 , d4 ). According to the feature vector of pixel, we classify pixels into 6 classes. Four typical neighborhood situations are shown in Fig.3. Among them the summed magnitudes of differences of class 1 are "0" in direction 1 and "1" in directions 2, 3, 4. The summed magnitudes of differences of class 2 are "0" in direction 2 and "1" in directions 1, 3, 4. Likewise we can 3
p1 p2
p3
p4
p5
p6
p7
p8
p9
4
2 1
Fig. 2. Pixels and directions in 3 × 3 neighborhood
obtain the summed magnitudes of differences of class 3 and class 4. The background class is for the pixel whose neighborhood has "0" in the four directions; the speckle edge class is used for pixels on whose neighborhood the summed magnitude of differences in all directions are "1". Given a pixel any neighborhood has a situation determines a feature vector in each of the four directions shown in Fig.2. We construct 6 prototype vectors C0 . . . C5 , to be the respective centers of the 6 classes. The construction of these class centers are listed in Table 1. Class 1 (0 1 1 1)
Class 2 (1 0 1 1)
Class 3 (1 1 0 1)
Class 4 (1 1 1 0)
Fig. 3. Situation for typical edge classes Table 1. The edge classification and their prototype vector Edge classification Ci (i = 0, 1, · · · , 5) Class 0(background) C0 = (1111) Class 1(edge) C1 = (0111) Class 2(edge) C2 = (1011) Class 3(edge) C3 = (1101) Class 4(edge) C4 = (1110) Class 5(speckle edge) C5 = (× × ××)
934
4
K. Wang et al.
Self-Organizing Map (SOM)
Kohonen’s Self-Organizing Map (SOM) provides us with classification rules. SOM combines competitive learning with dimensionality reduction by smoothing clusters with respect to an a priori grid. With SOM, clustering is generated by having several units compete for (training) data; the unit whose weight vector is closest to the data becomes the winner so as to move even closer to the input data, the weights of the winner are adjusted as well as those of the nearest neighbors. This is called Winner Takes All (WTA) approach. The organization is said to form a SOM map because similar inputs are expected to put closer position with each other[8]. The search and organization of the representation vectors on the map can be described with the following regressive equation, where t = 1, 2, · · · is the step index,x is a observation,mi (t) is the vector representation for node i at step t, c is the winner index, and hc(x),j is the neighborhood updating function[9]. When the SOM gets an observation vector input x,the input will be compared with all representation vectors on every map node. The node whose vector has the smallest distance from the input vector will be chosen as the winner (equation(3). The training process tries to attract all representation vectors toward the input vector, but the attraction efforts are different for different neurons. The winner is attracted most strongly toward the input vector. Remote nodes are affected less in proportion to their distance to the winner (equation (4)[10]. The attraction strength is controlled by the neighborhood function. Following many training iterations, the map will become ordered. An ordered SOM can be used as a classifier based on equation(3). → →c (t) ≤ − → →i (t) ∀i, − x −− m x −− m
(3)
− →i (t + 1) = − →i (t) + h → − − → m m c(x),i • ( x − mi (t))
(4)
The SOM simplified model which is used in this paper is shown in Fig.4. There are nine neurons(p1 , p2 , · · · , p9 )in input layerindicating respectively the 8-neighborhood of each pixel. There six neurons in the competitive layer which correspond respectively to the six edge classes. (C0 indicates background,C1 class 1 edge, C2 class 2 edge, C3 class 3 edge, C4 class 4 edge, C5 speckle edge.) 4.1
The Study Process of SOM
After we confirm the input and output of network, we can classify via SOM. The study process is shown as follow: Step 1: Initialization. Given less connective weights between the input and the output neuron. The set Sj is ’neighboring neuron’ of the output neuron. Where, Sj (0) denote ’neighboring neuron’ of jth neuron when t = 0, Sj (t) denote ’neighboring neuron’ in time t. Sj (t) decreases along with the time increasing. Step 2: Input new vectorX = xi (t), i = 1, 2, · · · , 9.
Edge Detection Combined Entropy Threshold and SOM C0
C1
C2
C3
C4
935
C5 Output layer
WINNER-TAKES-ALL
p1
p2 p3
p4 p5
p6 p7
p8
p9 Input layer
Fig. 4. Simplified model of SOM
Step 3: The learning algorithm of the SOM computes the Euclidean distances dj between the input vector and each output neuron j. N dj = X − Wj = [xi (t) − ωij (t)]2 (5) i=1
Where ωij is connective weight of SOM. And calculates the neuron j ∗ which has the minimum distance, namely confirmed a neuron k, for free j always: dk = minj (dj ) Step 4: Given a neighborhood Sk (t). Step 5: The weights of the winner neuron j ∗ and the nodes in the neighborhood are updated using a weight adaptation function based on the following Kohonen rule: ωij (t + 1) = ωij (t) + η(t)[xi (t) − ωij (t)] (6) Where, η is a gain which decreases to zero with the time-varying: η(t) =
1 t orη(t) = 0.2(1 − ) t 10000
(7)
Step 6: According to ’winner-takes-all’ compute the output ok : ok = f (minj X − Wj )
(8)
Where, f(.) is 0-1 function. Step 7: Repeat the above study process according to the new input vector. 4.2
Despeckling
For each pixel in the image, If it is isolated single or double speckle, then change the pixel as black.
936
5
K. Wang et al.
Image Experiments
We experimented the proposed method on the cameraman image. For comparison of the results, we used the canny edge detector. We manually tune the parameters of the canny edge detector [ls, hs], where ls is low threshold and hs is high threshold. We can see from fig.5(b),(c) and(d), when the [ls, hs] is selected [0.01,0.5] lacks lots of details and its effect is the worst. When we tune the threshold is [0.01, 0.05], we found that the effect of Canny edge detector is best. Our proposed method compared with the Canny fig.5(d) shows that our method detects the meaningful edges(e.g. tower ,man’s face), in complex scenes without generating too many redundant edges(e.g. in the ground area). Fig.6(a) shows the resource image with salt & pepper noise. The detect effects of different Canny parameters are respectively shown fig.6(b), (c) and (d) shows the detect result using our method. We can see from these figures, under noisy condition, our method call still get the better edge image, although existed the noise in the edge image. It indicates that the noise has little effect on recognizing the image contour and many details. There are many false edges in the edge image obtained with Canny method which are brought by the noise, making some details can not be recognized and the contour is difficult to be discerned especially the beyond scenes.
(a)
(b)
(c)
(d)
Fig. 5. Without noise image detects results: (a) Resource image (b) Canny([0.01,0.5]) (c) Canny([0.01,0.05]) (d) Our results
(a)
(b)
(c)
(d)
Fig. 6. With noise image detects result: (a) Resource image (b) Canny ([0.01, 0.02]) (c) Canny ([0.01, 0.05]) (d) Our results
6
Conclusions
In this paper, we firstly found the region of gray level abruptly changed according to the entropy in the information theory. This process can reduce the
Edge Detection Combined Entropy Threshold and SOM
937
latter computation. Then we transform the gray level image to binary pattern of pixels in 33 neighborhood. We define six classes’ edge and six edge prototype vectors. Classifying the type of edge through the Self-Organizing Map (SOM), the edge image is obtained. At last we discarded the speckle in the edge image. Experimental results show that the edge detection method proposed in this paper is superior to canny method that has different parameters. Under the noisy condition this method has better effect than canny method.
References 1. Haddon, J.F., Boyce, J.F.: Image Segmentation by Unifying Region and Boundary Information. IEEE Trans. on Pattern Analysis arid Machine Intelligence 12 (1990) 2. Fan, J.P., Yau, D.K.Y., Elmagarmid, A.K., Aref, W.G.: Automatic Linage Segmentation by Integrating Color-Edge Extraction and Seeded Region Growing. IEEE Truirs. on huge Processing 10(10) (2001) 3. Canny, J.: A Computational Approach to Edge Detection. IEEE Trans, Pattern Anal. Mach. Intell. 8(6) (1986) 679-687 4. Xu, X.H., Zhang, A.: Entropic Thresholding Method Based on Particle Swarm Optimization for Image Segmentation. Computer Engineering and Applications 10 (2006) 8-11 5. Kim, D.S., Lee, W.H., Kweon, I.S.: Automatic Edge Detection Using 3×3 Ideal Binary Pixel Patterns and Fuzzy-Based Edge Thresholding. Pattern Recognition Letters 25 (2004) 101-106 6. Liang, L.R., Looney, C.G.: Competitive Fuzzy Edge Detection. Applied Soft Computing 3 (2003) 123-137 7. Wang, R., Gao, L.Q., Yang, S., liu, Y.C.: An Edge Detection Method by Combining Fuzzy Logic and Neural Network. Proceedings of the Fourth International Conference on Machine Learning and Cybernetics, Guangzhou (2005) 4539-4543 8. Dokur, Z.: A Unified Framework for Image Compression and Segmentation by Using an Incremental Neural Network. Expert Systems with Applications 5 (2006) 1-9 9. Kohonen, T.: Self Organizing Maps, Springer-Verlag. 1995 10. Haykin, S.: Neural Networks, A Comprehensive Foundation, 2d Edition, Prentice Hall (1999)
Hierarchical SOMs: Segmentation of Cell-Migration Images Chaoxin Zheng1, Khurshid Ahmad1, Aideen Long2, Yuri Volkov2, Anthony Davies2, and Dermot Kelleher2 1
Department of Computer Science, O’Reilly Institute, Trinity College Dublin, Dublin 2, Ireland {chaoxin.zheng,khurshid.ahmad}@cs.tcd.ie 2 Department of Clinical Medicine, Trinity College & Dublin Molecular Medicine Centre, St. James’s Street, Dublin 8, Ireland {along,yvolkov,amitche,dermot.kelleher}@tcd.ie
Abstract. The application of hierarchical self organizing maps (HSOM) to the segmentation of cell migration images, obtained during high-content screening in molecular medicine, is described. The segmentation is critical to our larger project for developing methods for the automatic annotation of cell migration images. The HSOM appears to perform better than the conventional computervision methods of histogram thresholding, edge detection, and the newer techniques involving single-layer SOMs. However, the HSOM techniques have to be complemented by region-based techniques to improve the quality of the segmented images.
1 Introduction Self-organizing maps (SOM) have been used in creating maps for organizing large collections of documents [6] and in organizing image collections, especially biomedical images. There are claims that the SOM is good for (bio-medical) image segmentation as it does not depend on the ‘presumptive heuristic rules derived anatomical meta-knowledge of how a classification decision should be made’ [20]. Furthermore, the use of SOM allows the preservation of knowledge obtained during the segmentation of ‘prototypical reference database’ for subsequently segmenting hitherto unseen images. HSOM’s are preferred to single-layer SOM’s because the single-layer configuration requires the pre-knowledge of the number of desired segments for determining the number of competitive units in the output layer [2]. In the context of image segmentation a hierarchical SOM facilitates ‘data abstraction’ and the segmented image is the final domain abstraction of the input image [5]. More complex images, for instance bacteria in a culture, have been segmented using Tree-SOM [8]. Automatic segmentation of fluorescent micrograph, using both supervised and unsupervised networks has also been demonstrated [11].In this paper we describe the initial results of a collaborative project between molecular medicine experts and computer scientists for automatically annotating images of migrating cells. In cell migration studies one is talking about hundreds of images produced for one experimental condition. These are highly specialized images that D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 938–946, 2007. © Springer-Verlag Berlin Heidelberg 2007
Hierarchical SOMs: Segmentation of Cell-Migration Images
939
record the rearrangement and turnover of the constituents of a specialized cell under different experimental conditions. We report on a study carried out to define the scope of the collaborative project. The cell migration images comprise regions of specific interest to molecular medicine experts. Various techniques are currently being used to segment these regions with various degrees of success; we present a comparison of established segmentation techniques with SOM based segmentation. Once the candidate objects have been identified, we will use a SOM-based multinet system to automatically annotate the image, and possibly its constituent objects, using the descriptions that are collateral to an image [1]. The terminology and ontology for an annotation thesaurus has been derived from a random collection of papers in cell biology. However, the focus of this paper is on image segmentation based on SOM.
2 The Problem Domain: Cell Migration and High Content Analysis One key emerging challenge in molecular medicine is to accurately express subtle differences in the images of cells, obtained from fluorescent and/or electron microscopes. The important class of images that concerns here is that of cells responding and behaving under different experimental conditions, for example drug perturbations. The record of response and behavior may run into thousands of images, frequently complicating their coherent description and subsequent interpretation by scientists working in different specialized areas. These images are to be categorized, or screened, and the categories help the molecular and systems biologists to make a range of decisions covering diagnosis, therapy and after care (also novel deductions as to the molecular mechanisms of migration in basic research). Of critical interest to us is cell migration, especially T-cell migration that is initiated by our immune systems to ‘patrol’ the body for antigens and enter sites within the body that have been inflamed [10, 19]. Dissection of the molecular mechanisms of T cell migration will facilitate the discovery and design of novel antiinflammatory therapies. Cell migration represents one of the most complex types of biological responses observed in the living cells, being accompanied by multiple repeated changes in overall cell morphology and underlying re-arrangements in internal migration-supporting systems. Techniques of high content analysis are being developed to facilitate a manual and expensive process of studying the multiple interactions. The cell migration images are captured in real time and stills (more usually images of fixed cells) made for analysis and we are interested in the degree and type of the active movement of cells – the so called cell motility. The cells comprise three polymer systems (the actin filaments, intermediate filaments and microtubules) that make up the cytoskeleton. The cytoskeleton undergoes turnovers and re-arrangements in response to probes and signals from other -these rearrangements/molecular turnover of cytoskeletal elements is the driving force in the process of cell motility- under varied conditions (with different fluorescent material and antibodies for instance) the different polymer systems can be viewed and descriptive morphological inferences can be made. The key here is that there are significant changes in shape and adhesion features of the three polymer systems captured using high-magnification microscopes. There is a requirement for a
940
C. Zheng et al.
system for facilitating the recognition of shape changes and for facilitating the segmentation of cells that appear in clusters literally glued together [10]. Many commercial high content screening systems for capturing, analyzing and storing microscopy images are available [7] and offer visual feature analysis and focus on segmentation based on shape and texture. However, ‘robust and automated software solution for screening a great variety of cell-based assays is still lacking’ [23]. Initial analysis of the visual information is also frequently hindered by the complexity of the recorded patterns. Indeed, it appears that a variety of domain specific knowledge is required in order to interpret the cell migration images: Zhou et al have used heuristics that combine image features and their own in-depth knowledge of screening. Table 1. Heursitics used in cell segmentation of images [23] Visual Feature Intensity Shape
Feature Value Low invisible Round
Inferred cell type
Shape
Bar shaped
Shape
Oval or bar shaped
to
Interphase cells Interphase cells with monopolar spindles Metaphase or anaphase cells with bi-polar spindle Anaphase cells
Wavelength analysis Red (Actin) (Microtubule) Blue (DNA)
used or
in Green
Blue (DNA) Red (Actin)
From the very inception of HCS, the stress has been on quality of image segmentation results [3]: all subsequent steps rely on the quality. The use of traditional techniques for image segmentation including, histogram-based, regionbased, edge-based, and graph-based [12] and hybrids of the techniques, involve domain specific heuristics and the systems using these techniques have no capability of learning and generalization. Hence our emphasis is on SOM’s and other neural computing techniques.
3 Neural Network for Image Analysis and Annotation The staining of cellular images requires the segmentation of colour images: It has been argued that ‘In the segmentation of colour images, unsupervised learning is preferred to supervised learning because the latter requires a set of training samples, which may not always be available’ [21]. Hierachical SOM’s have been recommended for segmenting colour images for two reasons: First, HSOM’s are faster than single-layer SOM’s, and second, large dimension SOM will produce many classes, causing the image to be over-segmented [5]. However, there is an emphasis on pre-processing or post-processing the input image or the segmented image using conventional hybrid techniques for improving the results. A brief note on hybridsegmentation follows.
Hierarchical SOMs: Segmentation of Cell-Migration Images
941
3.1 Hybrid-Base Image Segmentation The most common hybrid technique is region-based and edge-based segmentation techniques [16]: the technique uses splitting-and-merging technique to generate regions and move the region boundaries towards the more accurate ones by examining the edge information. A more sophisticated technique was developed in [12], where a watershed was used to pre-segment the images with over-segmented results, and a graphic-based technique was then applied to correct for the over segmentation by merging similar regions together. A hybrid segmentation technique is proposed that combines the hierarchical SOM, histogram-based, and region-based techniques; this technique has been used to segment cell-migration images. The HSOM learns the feature vectors of pixels and group similar pixels into preliminary regions. These regions are further processed by the region-based and histogram-based techniques to form the actually cell region. 3.2 Image Segmentation Using HSOM Pixel Feature Selection. The selection of pixel features typically involves deciding between grey-scale images or colour images. For the cell migration images, the greyscale pixel features, comprising intensity and its median for a 3×3 kernel, produce a11
a12
a13
a1n
a21
a22
a23
a2n
a31
a32
a33
a3n
am1
am2
am3
amm
a’11
a’’11
a’12
a’’12
a’13
a’’13
a’1n
a’’1n
a’21
a’’21
a’22
a’’22
a’23
a’’23
a’2n
a’’2n
a’31
a’’31
a’32
a’’32
a’33
a’’33
a’3n
a’’3n
a’m1 a’’m1 a’m2 a’’m2 a’m3 a’’m3
a’11
a’11
a’12
a’’12
a’13
a’’13
a’14
a’’14
a’mn a’’mn
a’mn a’’mn
Fig. 1. Illustration of constructing the feature vectors, where a stands for a pixel; a’ indicates the intensity of the pixel; a’’ is the median intensity; and m, n are the size of the image
942
C. Zheng et al.
better segmentation and more quickly than is the case for colour pixel features that are based on the RGB space. We have, nevertheless, included a comparison between grey-scale pixel features and colour feature for segmenting an image. The median value was selected on the basis of earlier experimentation on macroscopic objects where the choice of the 3×3 kernel appeared to have reduced errors produced by noise [22]. Figure 1 shows the procedure of constructing the feature vectors from the greyscale images for the hierarchical SOM. Architecture of Hierarchical SOM. A hierarchical SOM is a layered arrangement of nodes where an antecedent layer provides input to the next layer. The size of the layers is reduced, in powers of 2: so if the input vector is mapped onto a 2n×2n output plane, then the winners in this layer act as input to the next layer comprising 2n-1×2n-1 neurons and so on ultimately to a 2×2 layer. The weight change algorithm is the same for all the layers: wi (t ) = α t hi (t )d (i ) x(t ) − wi (t ) (1) where w is the weight; i indicates the i-th node; α stands for the learning rate; h denotes the neighbourhood function; x is the input vector; t means the t-th iteration of the training; and d(i) stands the distance function, which is 1 for the best matching unit (BMU) or winner node, 0.5 for the nodes located within the distance of the neighbourhood function, and 0 for the other nodes. The neighbourhood function, hi(t) is initially half the number of rows or columns (8, 4, 2, and 1 respectively) in the map, decreasing exponentially to 1 at the end of training.
[
]
Training and Testing of Hierarchical SOM. During training an input vector containing the features of a pixel are served to the network. The best matching unit, which has the smallest distance to the current input, is located. The weights of the nodes in the current hierarchical SOM are updated according to Equation 1. .The training starts from the pixel at the top-left and ends at the bottom-right, and the procedure was repeated 10 times. After training of the hierarchical SOM, the same images are again used for the testing. The BMU for every pixel is found. The image segmentation is accomplished by merging the neighbouring pixels that belong to the same BMU together.
4 Experimental Results and Discussion We have used a four layer HSOM for segmenting cell-migration images (see Fig 2): we present segmented grey-scale (middle column, Fig 2) and color images (right column) through the use of a hierarchy of SOM’s varying from 16X16 nodes upto 2x2 nodes (left column, Fig 2). Note that at each level the segmentation from color image feature produce more regions. One explanation for this over-segmentation is that the hierarchical SOM learns and discriminates the color feature vectors into more resultant classes. The segmentation results for the grey-scale feature vectors are poor because some of the cell regions are segmented into two different regions and some cell regions are mistaken as the background. Techniques like region merging can avoid segmentation of an object into different segments, but there appears to be no suitable technique for
Hierarchical SOMs: Segmentation of Cell-Migration Images
943
Fig. 2. The hierarchical organization of our HSOM (left) and the segmentation results from each layer by using grey-scale feature vectors (middle) and color feature vectors (right)
resolving the mistaken background. Even though much more regions are produced than expected (over-segmented) from the third layer, it should be noted that one still distinguishes the cell regions from the background. We have tested the hierarchical SOM on twelve cell migration images and calculated the number of regions produced from each layer (see Figure 3). 9 16x16 nodes
8
8x8 nodes 4x4 nodes
No. Regions Produced (log)
7
2x2 nodes 6 5 4 3 2 1 0 1
2
3
4
5
6
7
8
9
10
11
12
13
Analyzed Im age
Fig. 3. Number of regions produced by different layers of the hierarchical SOM
944
C. Zheng et al.
It can be seen that the number of regions is gradually reduced from the lower layers to the upper layers and that the number of regions is reduced substantially from the 8×8 node layer to the 4×4 node layer (Note that the number of regions are shown in log space). We have compared the performance of our HSOM with other image segmentation techniques including one-dimensional histogram thresholding [14], two-dimensional histogram thresholding [22], edge detection [4], and a single layer SOM on the cell migration images. Results are shown in Table 2. Boundaries of regions are plotted in white lines. Although the 2DHT is better than the 1DHT, there are still some large parts of the cell regions that are mistaken as the background. Canny’s edge detection technique provided a better outline of the cell regions. However, it also produced a large number of inconsistent lines inside the cell regions. Post-processing of these undesired lines is still an unsolved challenge for edge detection techniques. Segmentation results of the single layer self-organizing map are not satisfactory either as the image is over-segmented. The post-processing of this over-segmented image would involve significant computation time. We will choose the over-segmented image from the 4×4 layer in the hierarchical SOM for our future work on the hybridbased image segmentation technique, which will combine the hierarchical SOM with region-based techniques. Table 2. Segmentation results using different techniques
Method
Images Segmented
Method
1D Histogram Thresholding [14]
2D Histogram Thresholding [22]
Edge Detection [4]
Single Layer Self-Organizing Map
Images Segmented
This result also suggests that pure histogram thresholding or classification-based techniques are not sufficient for image segmentation as these techniques operate at the pixel level and each pixel is treated separately. Pixels may be linked to each other purely on the basis of contiguity or there may be a semantic link between the pixels. For instance, pixels in the microtubules of the cells have a relationship with the cell boundary – though the two are separated but microtubules are invariably enclosed in a cell. Conventional segmentation analysis cannot cross this semantic gap. A further process of region merging might narrow this semantic gap by merging neighboring semantically or visually similar regions.
Hierarchical SOMs: Segmentation of Cell-Migration Images
945
5 Conclusion and Future Work A hierarchical self-organizing map (SOM) was proposed for the segmentation of cellmigration images. Segmentation results demonstrated that the grey-scale feature vectors produce better results than colour ones. We also compare the hierarchical SOM with other segmentation methods including one-dimensional and twodimensional histogram thresholding, edge detection, and single layer SOM, and hierarchical SOM seemed to be the one that gave the most promising results. However, the post-processing of the images appears to lead to better quality of segmentation and this we are pursuing currently.
References 1. Ahmad, K. , Vrusias, B. and Zhu, M.: Visualising an Image Collection? In (Eds.) Ebad Banisi et al. Proc. of the 9th Int. Conf. Information Visualisation (London 6-8 July 2005). Los Alamitos: IEEE Computer Society Press. (2005) 268-274. 2. Bhandarkar, S.M., Koh, J., Suk, M.: Multiscale Image Segmentation Using a Hierarchical Self-Organizing Map. Neurocomputing 14 (1997) 241–272 3. Bhanu, B., Lee, S., Ming, J.: Adaptive Image Segmentation Using a Genetic Algorithm. IEEE Transactions on Systems, Man, Cybernetics 25 (1995) 1543–1567 4. Canny, J.: A Computational Approach to Edge Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 8 (1986) 679–698 5. Endo, M., Ueno, M., Tanabe, T.: A Clustering Method Using Hierarchical Self-Organizing Maps. Journal of VLSI Signal Processing 32 (2002) 105–118 6. Kohonen, T.: Self-Organization and Associative Memory: 3rd Edition. New York: Springer-Verlag, Inc. (1989) 7. Koop, R.: Combinatorial biomakers: From Early Toxicology Assays to Patient Population Profiling. Durg Discovery Today 10 (2005) 781–788 8. Kyan, M., Guan, L., Liss, S.: Refining Competition in the Self-organising Tree Map for Unsupervised Biofilm Image Segmentation. Neural Networks 18 (2005) 850–860 9. Lau, K.T., McAlernon, P., Slater, M.: Discrimination of Chemically Similar Organic Vapours Mixtures Using the Kohonen Network. The Analyst 125 (2000) 65–70 10. Long, A., Mitchell, S., Kashanin, D., Williams, V., Mello, A.P., Shvets, I., Kelleher, D, Volkov, Y.: A Multidisciplinary Approach to the Study of T Cell Migration. Ann. N.Y. Acad. Sci. 1028 (2004) 313–319. 11. Nattkempera, T.W., Wersinga, H., Schubert, W., Rittera, H.: A Neural Network Architecture for Automatic Segmentation of Fluorescence Micrographs. Neurocomputing 48 (2002) 357–367 12. Navon, E., Miller, O., Averbuch, A.: Color Image Segmentation Based on Adaptive Local Thresholds. Image and Vision Computing 23 (2005) 69–85 13. Ong, S.H., Yeo, N.C., Lee, K.H., Venkatesh, Y.V., Cao, D.M.: Segmentation of Color Images Using a Two-stage Self-organizing Network. Image and Vision Computing 20 (2002) 279–289 14. Otsu, N.: A Threshold Selection Method From Gray-level Histogram. IEEE Transactions on System, Man, and Cybernetics 9 (1979) 62–66 15. Pal, N.R., Pal, S.K.: A Review on Image Segmentation Techniques. Pattern Recognition 26 (1993) 1277–1294
946
C. Zheng et al.
16. Pavlidis, T., Liow, Y.: Integrating Region Growing and Edge Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 12 (1990) 225–233 17. Smet, P.D., De Vleeschauwer, D.: Performance and Scalability of a Highly Optimized Rainfalling Watershed Algorithm. Proceeding of International Conference on Imaging Science, Systems and Technology, CISST 98, Las Vegas, NV, USA, July (1998) 266–273 18. Vincent, L., Soille, P.: Watershed in Digital Space: An Efficient Algorithm Based on Immersion Simulation. IEEE Transactions on Pattern Analysis and Machine Intelligence 13 (1991) 583–598 19. Volkov, Y., Long, A., McGrath, S., Ni Eidhin, D., Kelleher, D.: Crucial Importance of PKC-β(I) in LFA-1–mediated Locomotion of Activated T cells. Nature Immunology 2 (2001) 508–514 20. Wismuller, A., Vietzea, F., Behrendsa, J., Meyer-Baeseb, A., Reisera, M., Ritter, H.: Fully automated biomedical image segmentation by self-organized model adaptation. Neural Networks.17 (2004) 1327–1344 21. Yeo, N.C., Lee, K.H., Venkatesh, Y.V., Ong, S.H.: Colour Image Segmentation Using the Self-organizing Map and Adaptive Resonance Theory. Image and Vision Computing 23 (2005) 1060–1079 22. Zheng, C., Sun, D.W., Zheng, L.: Segmentation of Beef Joint Images Using Histogram Thresholding. Journal of Food Process Engineering 29 (2006) 574–591 23. Zhou, X., Cao, X., Perlman, Z., Wong, S.T.C.: A Computerized Cellular Imaging System for High Content Analysis in Monastrol Suppressor Screens. Journal of Biomedical Informatics 39 (2006) 115–125
Network Anomaly Detection Based on DSOM and ACO Clustering Yong Feng, Jiang Zhong, Zhong-yang Xiong, Chun-xiao Ye, and Kai-gui Wu College of Computer Science, Chongqing University, Chongqing, 400044, China
[email protected] http://www.cqu.edu.cn/
Abstract. An approach to network anomaly detection is investigated, based on dynamic self-organizing maps (DSOM) and ant colony optimization (ACO) clustering. The basic idea of the method is to produce the cluster by DSOM and ACO. With the classified data instances, anomaly data clusters can be easily identified by normal cluster ratio. And then the identified cluster can be used in real data detection. In the traditional clustering-based intrusion detection algorithms, clustering using a simple distance-based metric and detection based on the centers of clusters, which generally degrade detection accuracy and efficiency. Our approach based on DSOM and ACO clustering can settle these problems effectively. The experiment results show that our approach can detect unknown intrusions efficiently in the real network connections.
1
Introduction
Anomaly detection problem can be considered as a two-class classification problem (normal versus abnormal) where samples of only one class (normal class) are used for training. Basically there are three different approaches for anomaly detection: negative characterization [1], [2], positive characterization [3], [4], [5], and artificial anomaly generation [6], [7]. Clustering techniques have been applied successfully to the anomaly detection problem. However, in the traditional clustering-based intrusion detection algorithms, clustering using a simple distance-based metric and detection based on the centers of clusters, which generally degrade detection accuracy and efficiency. In this paper, we present a new type of anomaly detection algorithm to address these problems. The algorithm has four main stages: DSOM network growth, ACO clustering, labelling clusters, and detection. A spread factor is used in the stage of DSOM network growth, which can control the accuracy of clustering. The clusters are produced by ACO clustering. In the stage of detection, the usage of posteriori probabilities makes the detection independent of the centers of the clusters, so that the detection accuracy and efficiency can be improved. The experiment result shows that our approach can detect unknown intrusions efficiently in the real network connections. The remainder of the paper is organized as follows. Section 2 presents the detailed algorithms of our approach. Experiment results are reported in section 3. Finally, concluding remarks are made in sections 4. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 947–955, 2007. c Springer-Verlag Berlin Heidelberg 2007
948
2
Y. Feng et al.
Anomaly Detection Based on DSOM and ACO Clustering
Unsupervised anomaly detection algorithms make two assumptions about the data which motivate the general approach [8]. The first assumption is that the number of normal instances vastly outnumbers the number of intrusions. The second assumption is that the intrusions themselves are qualitatively different from the normal instances. Our approach based on the two assumptions has four main steps: Step1: Step2: Step3: Step4:
DSOM network growth ACO clustering Labeling clusters Detection
Fig. 1 shows the principle of DSOM and ACO clustering.
Fig. 1. Principle of DSOM and ACO clustering
2.1
DSOM Network Growth Algorithm
The DSOM determines the shapes as well as the size of network during the training of the network, which is a good solution to problems of the SOM [9]. The DSOM is an unsupervised neural network, which is initialized with four nodes and grows nodes to represent the input data [10]. During the node growth, the weight values of the nodes are self-organized according to a similar method as the SOM. Def.1. The winner neural b is defined as: v − wb ≤ v − wq , ∀q ∈ N.
(1)
Where v, w are the input and weight vectors, q is the position vector for nodes, N is the set of natural numbers, and || · || is Euclidean distance. Def.2. E is the error distancebetween band v, Eis defined as: E=
d j=1
Where dis dimension of the vector v.
(vj − wb,j )2 .
(2)
Network Anomaly Detection Based on DSOM and ACO Clustering
949
Def.3. GT is the growth threshold of DSOM. For the node ito grow a new node, it is required that E ≥ GT . It can be deduced from (2), 0 ≤ vj , wb,j ≤ 1and 0 ≤ E ≤ d that 0 ≤ GT < d. Therefore, it becomes necessary to identify a different GT value for data sets with different dimensionality. This becomes a difficult task. The spread factor SF can be used to control and calculate the GT for DSOM. The GT can be defined as: GT = d × f (SF ).
(3)
Where SF ∈ R, 0 ≤ SF ≤ 1, and f (SF ) is a function of SF. Def.4. f (SF ) is defined as: f (SF ) = Sigmoidn(t) (1 − SF ) =
1 − e−n(t)∗(1−SF ) . 1 + e−n(t)∗(1−SF )
(4)
Where n(t) is the total number of nodes at tth iteration. f (SF ) gradually saturated with the increase of network training that GT is stable, and DSOM algorithm is reaching convergence. The DSOM algorithm is described as follows: 1) Initialization phase. a) V = {v1 ,v2 ,. . . ,vn } is the input vector sets, and vi = {xi1 ,xi2 , . . . , xid }, where 1 ≤ i ≤ n and dis dimension of the vector vi . Standardizing V to V , if vi is a continuous variable, the method can be described by [11] : xij =
xij − min(xij ) , 1 ≤ i ≤ n, 1 ≤ j ≤ d. max(xij ) − min(xij )
(5)
Or we may code vi according to binary code, and then processing like (5). b) Initialize the weight vectors of the starting nodes with random numbers between 0 and 1. c) Calculate GT for the given data set. 2) Growing phase. a) Present input to the network. b) Find the winner neural b such that (1). c) Calculate the error distance Ei between band vi according to (2). If Ei ≥ GT , then turn d) to grow nodes, or turn e) to adjust the weight value of neighborhood. d) Grow the new node m, and set wm = vi . e) The weight adjustment of the neighborhood can be described by: wj (k), j ∈ / Nk+1 wj (k + 1) = . (6) wj (k) + LR(k) × (xk − wj (k)), j ∈ Nk+1 When k → ∞ (k ∈N ), the learning rate LR(k) →0. wj (k), wj (k+1) are the weight vectors of the node jbefore and after the adjustment, and Nk+1 is the neighborhood of the wining neuron at (k+1)th iteration.
950
Y. Feng et al.
f) Adjustment the learning rate according to the following formula: LR(t + 1) = LR(t) × β.
(7)
Where β is the learning rate reduction and is implemented as a constant value 0< β <1. g) Repeat steps b)f) until all inputs have been presented and node growth is reduced to a minimum level. 2.2
ACO Clustering
ACO clustering has two main phases. First, each ant chooses the object at random, and picks up or moves or drops down the object according to picking-up or dropping probability in the output layer of DSOM. Second, clusters are collected from the output layer of DSOM. Def.5. Swarm similarity is the integrated similarity of a data object with other data objects within its neighborhood. A basic formula of measuring the swarm similarity is showed as formula (8). f (oi ) =
oj ∈N iegh(r)
[1 −
d(oi , oj ) ]. α
(8)
Where Neigh(r) denotes the local region, it is usually a rounded area with a radius r, d(oi , oj ) denotes the distance of data object oi with oj in the space of attributes. The αis defined as swarm similarity coefficient. If α is too large, the algorithm converges quickly, whereas if α is too small, the algorithm converges slowly. Def.6. Probability conversion function is a function of f (oi ) that converts the swarm similarity of a data object into picking-up or dropping probability for a ant. The general idea of configuring probability conversion function is that the more big swarm similarity is, the more small picking-up probability is and the more big dropping probability is, and vice versa. According to the idea, the picking-up probability and the dropping probability are given by: 1 1 f (oi ) Pp = − arctan . (9) 2 π k 1 1 f (oi ) Pd = + arctan . (10) 2 π k Where k is a plus constant and can speed up the algorithm convergence if it is decreased. Instead of using the linear segmentation function of CSI model [12] and LF model [13], here we propose to use a nonlinear function and can help to solve linearly inseparable problems [14]. Fig. 2 shows the changing trend of Pp and Pd .
Network Anomaly Detection Based on DSOM and ACO Clustering
951
Fig. 2. The changing trend of Pp (left) and Pd (right)
The ACO clustering algorithm is described as follows: 1) Initialize α, ant-number, maximum iterative times n, slope k, and other parameters. 2) Give each ant initial objects, initial state of each ant is unloaded. 3) Ant clustering phase. a) Compute f (oi ) within a local region with radius r by formula (8). b) If the ant is unloaded, compute Pp by formula (9). Compare Pp with a random probability Pr , if Pp < Pr , the ant does not pick up this object, another data object is randomly given the ant, else the ant pick up this object, the state of the ant is changed to loaded. c) If the ant is loaded, compute Pd by formula (10). Compare Pd with Pr , if Pd > Pr , the ant drops the object, the pair of coordinate of the ant is given to the object, the state of the ant is changed to unloaded, another data object is randomly given the ant, else the ant continue moving loaded with the object. 4) Clusters production phase. a) If this pattern is an outlier, label it as an outlier. b) Else label this pattern a cluster serial number; recursively label the same serial number to those patterns whose distance to this pattern is smaller than a short distance dist. i.e. collect the patterns belong to a same cluster on the output layer of DSOM. 2.3
Labelling Clusters Algorithm
Under the two assumptions of our approach it is highly probable that clusters containing normal data will have a much larger number of instances than would clusters containing anomalies. Therefore, the maximum quantitative difference Di and the labelling clusters threshold N are defined to label the clusters. The Di and N can be defined as [15]: Di = (Qi − Qmin )2 /(Qmax − Qmin )2 , N = SF/(1 + 1/S).
(11)
Where 0 ≤ Di ≤ 1, 0< N <1. Qi is the number of instances in Ci , 1 ≤ i ≤ S. Qmax is the maximum of {Qi }. Qmin is the minimum of {Qi }. S is the number of the clusters.
952
Y. Feng et al.
The labelling clusters algorithm is described as follows: 1) Calculate the number of instances Qi , Di and N for {C1 ,C2 ,. . . ,CS }. 2) Set i= 1, repeat the follows. 3) If Di > N , then Ci is labelled as the ‘normal’ cluster, or Ci is labelled as the ‘anomalous’ cluster. 4) Set i = i +1. 5) Until i > S. The labelling clusters algorithm has a good accuracy because of Di and N are decided by DSOM and ACO clustering rather than the parameters input by the users. 2.4
Detection Algorithm
Given a network data package X which would be detected, standardizing X to X. The Xis assigned to the class label Ci such that P(Ci |X) is maximal, P(Ci |X) is a-posteriori probabilities. The method is depends on Bayes theorem. Bayes theorem is defined as follows: P (Ci |X) = P (X|Ci ) · P (Ci )/P (X).
(12)
Where P(X) is constant for all classes, P(Ci ) = relative freq of class Ci , Ci such that P(X|Ci ) is maximum = Ci such that P(X|Ci )·P(Ci ) is maximum. The detection algorithm is described as follows: 1) Standardize X to X. 2) Set i= 1, repeat the follows. 3) Calculate P(X|Ci )·P(Ci ) for {C1 ,C2 , . . . , CS }. 4) Let pi = P(X|Ci )·P(Ci ), where pi is array variable. 5) Set i = i +1. 6) Until i > S. 7) Let pj = the maximum of {pi }, where j ∈[1,S]. 8) If Cj is labeled as the ‘normal’ cluster, then X is normal data package, or X is intrusional data package. Bayesian classification is the least fault rate rather than other classification algorithms. Detection algorithm of our approach depends on Bayes theorem, which makes the detection independent of the centers of the clusters, so that the detection accuracy and efficiency can be improved.
3
Experiment
The experimental data we used is the KDD Cup 1999 Data [16]. It consists of two datasets: training dataset (KDD-TND) and test dataset (KDD-TTD). According to the first assumption of the unsupervised anomaly detection algorithms (UADA), we need to generate the training dataset D from KDD-TND
Network Anomaly Detection Based on DSOM and ACO Clustering
953
Fig. 3. Attacks number in the test sets
by filtering it for attacks. D consisted of 1% to 1.5% intrusion instances and 98.5% to 99% normal instances. To evaluate the algorithm we are interested in three major indicators of performance: DR (Detection Rate), FPR (False Positive Rate) and FNR (False Negative Rate). In the test experiment, we adopt 5 test sets from KDD-TTD: DS1, DS2, DS3, DS4 and DS5. Each dataset contain 1,000 instances. Fig. 3 shows attacks number in the test sets.
Fig. 4. Performance of our approach, K-NN and SVM
Fig. 5. Average performance comparing
954
Y. Feng et al.
In the training experiment, we adjust SF from 0.3 to 0.8 (interval is 0.1) and α from 0.85 to 0.2 (interval is 0.05), we got the best performance when SF = 0.7 and α = 0.25. Therefore, we adopt the same SF and α in the test experiment. The test experiment results are reported in Fig. 4 (a). In reference [17], the experimental data is also the KDD Cup 1999 Data. Fig. 4 (b) shows the performance of K-NN in [17] over the test sets. Fig. 4 (c) shows the performance of SVM in [17] over the test sets. Fig. 5 shows the average performance comparing between our approach and the existed UADA over the test sets.
4
Conclusions
This paper proposes an anomaly detection model based on DSOM network and ACO clustering. Experimental results show that the average DR, FPR and FNR of our approach maintained a higher performance than SVM and K-NN. Acknowledgments. This work is supported by the Graduate Student Innovation Foundation of Chongqing University of China (Grant No. 200506Y1A0230130), the Research Fund for the Doctoral Program of Higher Education of China (Grant No. 20040611002) and National Natural Science Foundation of China (Grant No. 30400446).
References 1. Forrest, S., Perelson, A., Allen, L., Cherukury, R.: Self-Nonself Discrimination in a Computer. In Proc. IEEE Symp. on research in security and privacy (1994) 2. Singh, S.: Anomaly Detection using Negative Selection based on the Rcontiguous Matching Rule. In 1st International Conference on Artificial Immune Systems (ICARIS) (2002) 99-106 3. Lane, T., Brodley, C.E.: An Application of Machine Learning to Anomaly Detection. In Proc. 20th NIST-NCSC National Information Systems Security Conference (1997) 4. Lane, T., Brodley, C.E.: Sequence Matching and Learning in Anomaly Detection for Computer Security. In AI Approaches to Fraud Detection and Risk Management (Fawcett, Haimowitz, Provost, Stolfo, eds.), AAAI Press (1997) 43-49 5. Mahoney, M., Chan, P.: Learning Nonstationary Models of Normal Network Traffic for Detecting Novel Attacks. In Proc. 8th ACM SIGKDD international conference on Knowledge discovery and data mining (2002) 23-26 6. Fan, W., Lee, W., Miller, M., Stolfo, S., Chan, P.: Using Artificial Anomalies to Detect Unknown and Known Network Intrusions. In Proc. 1st IEEE International conference on Data Mining (2001) 7. Gonzalez, F., Dasgupta, D.: Neuro-Immune and Self-Organizing Map Approaches to Anomaly Detection: A Comparison. In 1st International Conference on Artificial Immune Systems (2002) 8. Portnoy, L., Eskin, E., Stolfo, S.J.: Intrusion Detection with Unlabeled Data using Clustering. In Proc. ACM CSS Workshop on Data Mining Applied to Security (DMSA2001), Philadelphia, PA (2001)
Network Anomaly Detection Based on DSOM and ACO Clustering
955
9. Kohonen, T.: Self-Organizing Maps. Springer-Verlag, Berlin, Germany (1995) 10. Alahakoon, L.D., Halgamuge, S.K., Srinivasan, B.: A Structure Adapting Feature Map for Optimal Cluster Representation. In Proc. Int. Conf. Neural Information Processing (1998) 809-812 11. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. High Education Press, Morgan Kaufman Publishers (2001) 12. Wu, B., Shi, Z.: A Clustering Algorithm Based on Swarm Intelligence. IEEE International Conferences on Info-tech & Info-net Proceeding, Beijing (2001) 58-66 13. Lumer, E., Faieta, B.: Diversity and Adaptation in Populations of Clustering Ants. In Proc. 3ird International Conference on Simulation of Adaptive Behavior: From Animals to Animats, Cambridge (1994) 499-508 14. Feng, Y., Wu, Z.F., Wu, K.G.: An Unsupervised Anomaly Intrusion Detection Algorithm based on Swarm Intelligence. 2005 International Conference on Machine Learning and Cybernetics, ICMLC 2005, Guangzhou (2005) 15. Feng, Y., Wu, K.G., Wu, Z.F.: Intrusion Detection based on Dynamic SelfOrganizing Map Neural Network Clustering. Lecture Notes in Computer Science (2005) 16. KDD99: KDD99 cup dataset. http://kdd.ics.uci.edu/databases/kddcup99 (1999) 17. Eskin, E., Arnold, A., Prerau, M.: A Geometric Framework for Unsupervised Anomaly Detection: Detecting Intrusions in Unlabeled Data. Published in Data Mining for Security Applications, Kluwer (2002)
Hybrid Pipeline Structure for Self-Organizing Learning Array Janusz A. Starzyk, Mingwei Ding, and Yinyin Liu School of Electrical Engineering & Computer Science Ohio University, Athens, OH 45701
[email protected]
Abstract. In recent years, many efforts have been put in applying the concept of reconfigurable computing to neural networks. In our previous pursuits, an innovative self-organizing learning array (SOLAR) was developed. However, traditional multiplexer method to achieve reconfigurable connection has its limit for larger networks. In this paper, we propose a novel pipeline structure, which offers flexible, possibly large number of dynamically configurable connections and which utilizes each node’s computing ability. The hardware resources demand of the proposed structure is a linear function of the network size, which is especially useful for building a large network that can handle complicated real-world applications.
1 Introduction Reconfigurable computing has become an attractive research topic during past decade due to its good tradeoff between performance and flexibility [1][2][3]. At the same time, a significant effort has been made to introduce reconfigurable computing in hardware implementation of neural networks [4][5][6]. A novel data-driven selforganizing learning array (SOLAR) was proposed in [7] with the aim to develop hardware structures for machine intelligence. Our ultimate goal is to build a modular 3D SOLAR system consisting of hundreds or thousands of neurons. In our previous work [8][9], a dynamic reconfigurable hardware implementation of the SOLAR algorithm was constructed based on Xilinx picoBlaze core [10]. This requires significant amount of silicon dedicated to wiring. Since the number of possible connections among n nodes grows at the level of O(n2) and the average wire length increases at a level of O(n0.5) [11], the total design area occupied by wires grows at a level of O(n2.5). In addition, the growing network size requires increasing number of wires to configure. To solve this problem, a new wiring structure is needed to achieve the same reconfigurability with less global wiring. In this paper, we propose a novel hybrid pipeline structure focused on dynamically changing connectivity. The basic idea here is to utilize the computing ability of each node in the network to perform “soft” connections. The advantage of this structure is that the connections between nodes are fully configurable inside the corresponding nodes, thus saving global wiring, and avoiding complicated routing algorithm. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 956–964, 2007. © Springer-Verlag Berlin Heidelberg 2007
Hybrid Pipeline Structure for Self-Organizing Learning Array
957
The rest of this paper is organized as follows: section II gives detailed description of proposed structure and pipeline dataflow, including the node structure that implements the “soft” connections; section III presents the simulation results; section IV concludes the paper with a summary of the proposed structure and future works.
2 Hybrid Pipeline with Sequential Channel 2.1 SOLAR Overview SOLAR is a regular, 2D/3D array of identical processing neurons, connected to programmable routing channels. Each neuron chooses its input signals from the adjacent routing channels and chooses its arithmetic function to perform and sends its output signals to the routing channels. The optional arithmetic functions that neurons perform include half, identity, logarithm, exponent, sigmoid, addition and subtraction in current design. The functions and connections are reconfigurable as a result of learning. The SOLAR implementation presented in this paper employs a feed-forward structure with all the neurons arranged in a 2D array as depicted in Fig.1.
Fig. 1. SOLAR Structure Overview
A SOLAR structure in many ways resembles the organization of a CNN [12]. Like CNN, its architecture is defined by an array of identical cells, which adapt their behavior to the input data. Its neurons are cellular automata, which can be programmed to perform different computational tasks based on data received from its neighbors. Neurons can be either static or dynamic, as defined in [13], depending on the type of equation solved. However, unlike in a CNN, its connectivity structure is not fixed. In a CNN, the interconnect structure is defined by templates which limits its learning ability, while in a SOLAR the interconnect structure is an element of learning and can by dynamically changed even during the network’s operation. Thus a CNN can be considered as a special case of SOLAR structure. An efficient and flexible routing scheme plays a key role in implementing SOLAR. The following sections detail a new hybrid pipeline structure to address this issue. 2.2 Hybrid Pipeline Structure Overview The whole network is constructed in a rectangular 2D array. Each column of the array consists of a long shift register and several processing nodes (neurons) attached to the
958
J.A. Starzyk, M. Ding, and Y. Liu
shift register to perform read/write operations. Each processing node has four different working modes: idle mode, reading mode, processing mode and writing mode. The long shift register implements routing channel through which all neurons are connected to primary input nodes or other neurons. Each input data item is first repeated several times, and then fed to the network sequentially. After the first column finishes data processing, the results are written back to the shift registers and then the data in shift register are shifted to the next column through routing channel. The next column will work on the data from the routing channel while the previous column starts processing new input data. Therefore, we call this organization a hybrid pipeline structure illustrated in Fig. 2.
Fig. 2. Pipeline structure overview
Before we continue the detailed description of the structure and its data operations, let’s first introduce some terminology to describe the network parameters: c – input data copy ratio, defines how many times a data item is repeated before the next data is fed. The copy ratio only applies to the 1st column of the network. k – the number of processing nodes in each column, N – the number of input data items, L – the total length of one column of shift register, L = c × N , R – the input range for a node (designed to be the same for all the nodes). It specifies the maximum range of nodes from the previous layer a node can reach. th {P1 , P2 ,…, Pk } – A vector of length k, the i element of this vector specifies the read/write position for ith node in the current column.
In the existing design of SOLAR, each neuron can possibly read the data from any other neuron in the previous layers of the network through the routing channel and write the processed results back to the same slots of the shift register. Repeating each data item c times in the 1st column statistically provides neurons the opportunity to read the data across several layers. On the other hand, if the SOLAR structure is
Hybrid Pipeline Structure for Self-Organizing Learning Array
959
strictly hierarchical, there is no need to connect across several layers and then c is set to 1. The copy ratio c should increase with the average number of neuron inputs. Also, it affects the effective size of the routing channel. k is usually related to N in 2D design, and R determines the neighborhood size of each neuron. For locally connected neurons, R is small. The optimum choices of design parameters c, k, R, etc. are application specific and will not be discussed in this paper. In the next section, a more detailed description of the dataflow in the implemented interconnect scheme is given. 2.3 Data Flow Description
First of all, we base the timing circuitry on the clock that drives the shift registers. All the operations are synchronized to the rising edge of this clock. The operation of this structure is pipelined from column to column. All the nodes are in idle mode before Pk cycle, and then the nodes begin reading data from certain slots. All the input data will be shifted to the 1st column after L cycles. At cycle L, the switch at the top of the column switches to feedback position and the data begin to circulate in the 1st column. All the nodes enter the processing mode after reading the input data, and should finish their computing tasks no later than at L + Pk cycle (since if the longest combined reading time and computing time of all the nodes exceed L cycles, additional L cycles need to be added to complete these two operations), when all the nodes begin to write their processed result back to the specified slots, as shown in Fig. 3. By the end of cycle 2 L + Pk , the nodes should finish writing and enter the idle mode, which is illustrated in Fig. 4. At the arrival of 3L cycle, the switch switches back to data and the next N input data begin feeding into the 1st column, and the content of the 1st column is copied to the 2nd column, as shown in Fig. 5. From above description, we can conclude that the pipeline delay between two columns is 3L cycles. This computing scheme in which pipeline data is transported sequentially will yield performance slightly lower than full parallel hardware implementation, but significantly higher than sequential operation on a single processor. In general, if a node requires p cycles to process the input data, then fully parallel implementation requires p+r cycles, where r is the average number of neuron’s inputs. The proposed pipeline scheme requires 2L+[P/L]L cycles, while sequential implementation requires Np+r cycles to complete. For large p, the performance of the proposed pipeline structure is similar to the performance of fully parallel implementation, while for small p, its performance is a function of the channel size. Therefore, it is a good compromise between the hardware cost and performance. 2.4 Node Operations
In our design, each node is implemented with Xilinx picoBlaze core processor. To correctly operate, the node must be running at a higher speed than the shift register, because during one shift register clock period, the node needs to read timing information and to read data at that time if necessary. For a clear description of the operation, the period of that higher speed clock is denoted as node-cycle.
960
J.A. Starzyk, M. Ding, and Y. Liu
Fig. 3. Shift Register Data Flow (1st column)
Fig. 4. Shift Register Data Flow (2nd column)
After a node finishes reading all the slots of required data, it begins working in the processing mode. Based on Figs. 3-5 and previous description, the node has L − c ⋅ R cycles to perform its computing task. Fig. 6 illustrates how each processing node is attached to the shift registers to perform read/write operations. Register 1 and register 2 are parts of the routing channel of the hybrid pipeline structure. The sel signal is always switched to let data flow from register 1 to register2 except when the node decides to output its processed result.
Hybrid Pipeline Structure for Self-Organizing Learning Array
961
Fig. 5. Shift Register Data Flow (1st and 2nd columns)
Register 1
timing Information node
sel Register 2 Fig. 6. Single Node Read/Write Structure
As stated above, the picoBlaze core is running at a frequency higher than the shift register clock and the ratio between the two clocks can be denoted as m. Based on previous analysis, node’s computation time must be a multiple of m in order for the whole pipeline to work correctly. Thus, optimization of the computing time is necessary, especially for complicated operations, like exponent, logarithm, sigmoid, etc.
3 Simulation Results In this section, we will first give the simulation results of a single node performing read/write and an add operations. Then a 4-row array is built to read and process Iris database. At last, four different sizes arrays are constructed and their design areas are compared. The respective read/write waveform for a single node is illustrated in Fig. 7. The node is configured to read data from slot 4 and 5, and perform a modified add operation,
962
J.A. Starzyk, M. Ding, and Y. Liu
Node Reading
Node Writing
Fig. 7. Single Node Read/Write Waveform
and then write the results back to slots 4 and 5. As we can see, the node read the data value 47 and 57 at slot 4 and 5. According to the modified add function designed for our network, the processed result should be (47/2+57/2), which equals to 51. Then the processed result 51 is output by the node when it “sees” slot 4 and 5 again. As an example, a 4x3 SOLAR array with pseudo-random connections was built to process data from Iris database [14]. The training set consists of 75 samples with 4 features each. The connections and arithmetic functions chosen by neurons after learning are shown in Fig. 8. 1
din 4
half
4
log din 3
exp 7
2
3
7
11
log
2
ident add
din 1
12
10,17
exp sub ident 6,16 din 2
8
1, 7 6
log
ident
12,16
ident
sub
6
1
1
5
ident
ident
10
sub ident 9
Fig. 8. 4x3 Array processing 4-feature data
In Fig. 8, the numbers on the top of individual nodes are the neurons’ processed results. The names on the arcs specify which arithmetic functions this node performs. The final output is read from the last column and is verified with Matlab simulation results. It is follows that using the proposed hybrid pipeline structure, the network implemented in FPGA can correctly perform all calculations. To demonstrate the proposed pipeline approach advantages for larger-sized network, additional arrays of 4x6, 4x12 and 4x24 were constructed and synthesized
Hybrid Pipeline Structure for Self-Organizing Learning Array
963
Fig. 9. Design area vs. network size
targeting Xilinx Virtex II chip. The design area (number of slices) is observed to be a linear function of the network size (number of nodes), as shown in Fig. 9, while the maximum system clock is consistently kept at 81.1 MHz.
4 Conclusion and Future Works In this paper, a novel hybrid pipeline scheme is presented and its flexibility in dynamic connectivity configurations is demonstrated by checking the hardware response with the simulation results. This structure is characterized by a linear increase of the hardware resources with respect to the network size. In addition, the proposed structure reduces global wiring to a minimum and shows good modularity, for easy placement and routing task in larger networks. The shift register cost can be mitigated by using Xilinx FPGA technology which possesses abundant resources of registers that can be efficiently implemented by a chain of look-up tables. The proposed structure has been successfully applied to implement the SOLAR algorithm and has been tested in hardware for Iris database. With its excellent scaling property, it shows potential for implementing larger SOLAR networks in hardware targeting applications like data clustering, image processing, pattern recognition, etc. At last we would like to point out that some variants of the proposed structure can achieve higher pipeline speed at the cost of limiting the range of input signals that each node may receive.
References 1. Tessier, R. Burleson, W.: Reconfigurable Computing and Digital Signal Processing: A Survey, Journal of VLSI Signal Processing. 28 (2001) 7-27 2. Hartenstein, R.: A Decade of Reconfigurable Computing: A Visionary Retrospective, Proceedings of Int’l Conf. on Design, Automation and Test in Europe (DATE’01), Munich, Germany (2001) 642-649
964
J.A. Starzyk, M. Ding, and Y. Liu
3. Singh, H., Lee, M., Lu, G.: Kurdahi, F.J., Bagherzadeh, N., MorphoSys: An Integrated Reconfigurable System for Data-Parallel and Computation–Intensive Applications, IEEE Trans. Computers. 49(5) (2000) 465-481 4. Mishra, J., Mitra, S.: Neural Networks in Hardware: a Survey, IEEE Trans. Syst. Man Cybern:C. (2003) 5. Tempesti, G., Mange, D., et al: The BioWall: an Electronic Tissue for Prototyping Bioinspired Systems, Proceedings of the NASA/DoD Conference on Evolvable Hardware, Los Alamitos, CA (EH'2002) 221-230 6. Spaanenburg, L., Alberts, R., et al: Natural Learning of Neural Networks by Reconfiguration, Proceedings of SPIE. 5119 Spain (2003) 273-284 7. Starzyk, J.A., Zhu, Z., Liu, T., Self-Organizing Learning Array, IEEE Trans. Neural Networks. 16(2) (2005) 355-363 8. Starzyk, J.A., Guo, Y.: Dynamically Self-Reconfigurable Machine Learning Structure for FPGA Implementation, Proc. Int. Conf. on Engineering of Reconfigurable Systems and Algorithms, Las Vegas, Nevada (2003) 296-299 9. Starzyk, J.A., Guo, Y., Zhu, Z.: SOLAR and Its Hardware Development, Proc. Computational Intelligence and Natural Computing (CINC’03), (2003) 10. Xilinx PicoBlaze soft processor, http://www.xilinx.com. 11. Donath, W.E.: Wire length distribution for placement of computer logic, IBM Journal of Research and Development (1981) 152-155 12. Chua, L.O., Yang, L.: Cellular Neural Networks: Theory, IEEE Trans. Circuits and Systems. 35(10) (1988) 1257-1272 13. Gupta, M.M., Jin, L., Homma, N.: Static and Dynamic Neural Networks: From Fundamentals to Advanced Theory, John Wiley & Sons (2003) 14. Iris Database, http://www.ics.uci.edu/~mlearn/MLRepository.html
CSOM for Mixed Data Types Fedja Hadzic and Tharam S. Dillon Faculty of Information Technology, University of Technology Sydney, Australia {fhadzic,tharam}@it.uts.edu.au
Abstract. In our previous work we presented a variation of Self-Organizing Map (SOM), CSOM that applies a different learning mechanism useful for situations where the aim is to extract rules from a data set characterized by continuous input features. The main change is that the weights on the network links are replaced by ranges which allows for a direct extraction of the underlying rule. In this paper we extend our work by allowing the CSOM to handle mixed data types and continuous class attributes. These extensions called for an appropriate adjustment in the network pruning method that uses the Symmetrical Tau (τ) criterion for measuring the predictive capability of cluster attributes. Publicly available real world data sets were used for evaluating the proposed method and the results demonstrate the effectiveness of the method as a whole for extracting optimal rules from a trained SOM.
1 Introduction In the past, Neural Networks (NN) were rarely used for data mining tasks because the acquired knowledge was not represented symbolically, but in the form of weights on links between processing units of a network When used for decision support it was impossible for user to verify the suggested decision with a knowledge model, as the knowledge is hidden in the network itself. This problem with neural networks is known as the ‘black box’ critique. However, since the development of symbolic rule extraction methods from NNs, the confidence in using NNs for data mining purposes has risen as the acquired knowledge can now be explained and verified. Most of the developed methods for rule extraction from NN analyze the weights on the links between network units in order to determine the relationships between the input features and the concept being learned [1, 2, 3, 4, 5, 6, 7]. From all the different types of NNs, in this paper we narrow our focus to the Self-Organizing Map (SOM). SOM [8] is an unsupervised neural network that effectively creates spatially organized “internal representations” of the features and abstractions detected in the input space. It is one of the most popular clustering techniques based on the competition among the cells in the map for the best match against a presented input pattern. The goal is to represent the points from a high dimensional input space onto a lower-dimensional output space through a topology preserving mapping. When used for data mining problems, the SOM is commonly integrated with a type of supervised learning in order to assign appropriate class labels to the clusters. To determine which data objects are covered by a cluster, appropriate rule discovering technique has to be applied to associate each cluster with a rule or a pattern. A common approach has D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 965–978, 2007. © Springer-Verlag Berlin Heidelberg 2007
966
F. Hadzic and T.S. Dillon
been to associate some measure of significance to the cluster variables to determine the necessary and sufficient set of cluster constraints [9, 10, 11, 12]. For example in [9], the measure used is the correlation between the cluster structure revealed by all the variables together and by each variable alone. Rauber and Merkl [10] use the quantization errors for all input variables as a guide for their relevance, and a threshold is used to disregard irrelevant features. Our approach to the problem of rule extraction from SOM is based on the BRAINNE method described in [12]. The ‘Unsupervised BRAINNE’ [2, 12] method extracts a set of symbolic knowledge structures in form of concepts and concept hierarchies from a trained SOM. The Hybrid BRAINNE [1] method combines unsupervised and supervised neural networks in order to directly extract disjunctive rules. After the supervised learning is complete each cluster will be labeled by its predicting class, and will have a rule or pattern associated with it that determines the data objects covered by that cluster. In the above mentioned methods rules are extracted using the threshold or the breakpoint technique. In basic terms, the threshold technique selects one or more of the largest components of a weight vector as contributory to a particular output, whereas the break point technique picks an obvious point where the weights from the weight vector substantially differ and considers all inputs above that point as contributory. For continuous data the method produces conjunctive rules where the attributes usually have an upper and a lower bound. To optimize the set bounds a post-processing step takes place that progressively increases/decreases the bounds in order to achieve better coverage rate (CR) and reduce the misclassification rate (MR). To avoid this post-processing step for range optimization we have developed the Continuous Self-Organizing Map (CSOM) [14], which rather than using a singular weight associates a range on the links between the input and output layer. This modification cased an adjustment to the traditional SOM learning mechanism. By making this change one is able to automatically obtain the appropriate ranges of cluster attributes, simply by querying the network links. The Symmetrical Tau (τ) [15] feature selection criterion was used for the purpose of network pruning and rule simplification.
In [14] some preliminary results using simple datasets were provided which demonstrated the potential of the learning mechanism change for efficient rule extraction in continuous domains. In the current work we aim to apply the CSOM to a more complex and wider range of data. For this purpose the CSOM implementation was extended to handle both continuous and categorical features. Another required extension was for the cases where the class variable to be predicted is of a continuous type. Rather than discretizing the class variable we adopt a similar approach with ranged links during supervised training and thereby try to obtain a more realistic range. Furthermore, to appropriately use the τ feature selection criterion when there are categorical inputs and/or continuous class attributes, a different approach was taken and some of the encountered problems and implications are discussed. The rest of the paper is organized as follows. In Section 2 an overview of traditional SOM is given together with some of the common challenges for its effective use for data mining. Our technique is described in detail in Section 3. Section 4 provides some experimental evaluation of the proposed technique, and the paper is concluded in Section 5.
CSOM for Mixed Data Types
967
2 General SOM Properties and Common Challenges This section gives a brief overview of the SOM and then proceeds with a discussion of some commonly faced challenges when SOM is to be effectively applied for data mining purposes. SOM [8] is an unsupervised neural network that effectively creates spatially organized “internal representations” of the features and abstractions detected in the input space. It consists of an input layer and an output layer in form of a map (see Fig. 1). It is based on the competition among the cells in the map for the best match against a presented input pattern. Each node in the map has a weight vector associated with it, which are the weights on the links emanating from the input layer to that particular node.
Fig. 1. SOM consisting of two input nodes and 3 * 3 map
When an input pattern is imposed on the network, a node is selected from among all the output nodes as having the best response according to some criterion. This output node is declared the ‘winner’ and is usually the cell having the smallest Euclidean distance between its weight vector and the presented input vector. The ‘winner’ and its neighboring cells are then updated to match the presented input pattern more closely. Neighborhood size and the magnitude of update shrink as the training proceeds. After the learning phase, cells that respond in similar manner to the presented input patterns are located close to each other, and so clusters can be formed in the map. Existing similarities in the input space are revealed through the ordered or topology preserving mapping of high dimensional input patterns into a lowerdimensional set of output clusters. Rule extraction and optimization The importance of human comprehension has been emphasized before and hence the rules to be extracted must be represented in a symbolic and meaningful way. However, due to inherent complexity the number and the length of the rules can become quite large. Simpler rules are preferred because they are easier understood and perform better on unseen data. A rule set can be evaluated based upon its predictive accuracy on an unseen data set and this enables further rule optimization. In this process, it is common that a trade-off needs to be made between a decrease in misclassification rate (MR), an increase in recognition rate (RR) and an improved generalization power [16]. The trade-off occurs especially when the data set is characterized by continuous features where a valid constraint on the attribute range needs to be determined for a particular rule. The process of rule refinement is a type of uncertain reasoning technique and some different approaches have been developed in the literature [17, 18, 19].
968
F. Hadzic and T.S. Dillon
Network pruning Simplified network is much easier for human analysis and is also expected to have better generalization power. Network pruning can occur prior to learning where the inputs suspected of being irrelevant to the concept being learned are removed from the learning process in order to avoid any interference with the learning mechanism. A statistical measure is commonly used to indicate the predictive capability of an attribute. The attributes that have low predictive capability are then removed from the learning process. Another approach is to train the network to completion and to then inspect the links between particular network units in order to determine the relevance between the two [20]. This approach is useful for rule simplification and for the removal of attributes whose usefulness has been lost through the learning. Generally speaking when a network is simplified it becomes more efficient, easier to analyze and its generalization power is increased by not having irrelevant attributes interfere with the learning and concept forming mechanism. Most of the methods for symbolic rule extraction referenced in the previous section use some kind of pruning technique to increase the performance and produce simpler rules. The contribution of each unit in the network is determined and a unit is removed if the performance of network does not decrease after the removal. This is often referred to as sensitivity analysis in NN and is one of the common techniques for network pruning [20, 21, 22]. Handling of different data types The values of an attribute in the data set can be binary, discrete or continuous. Binary attributes have two possible values whereas discrete attributes can have a larger number of fixed values. Continuous attributes cover a continuous range of values and introduce more difficulty during the process of rule extraction. It is difficult to determine an appropriate constraint on the attribute range for a particular rule as this kind of information is not contained in the network. Most of current approaches involve progressive adjustment of the ranges until some pre-defined optimum is reached [1, 13]. Trade-off occurs since increasing an attribute range usually leads to an increase in RR but at the cost of an increase in MR, and the other way around. Regularization approaches have been applied to balance out these conflicting criteria [16].
3 Continuous Self-Organizing Map for Mixed Data This section describes the proposed approach to the problem of rule extraction and feature selection in SOM. The knowledge acquisition process as a whole is similar to that of Hybrid BRAINNE [1] as disjunctive rules are extracted through combining supervised with unsupervised learning. The main differences lie in the adjustment to SOMs learning algorithm, the rule extraction method, simultaneous rule optimization during supervised learning and the integration of Symmetrical Tau criterion for network pruning. Here we provide a brief overview of the method and then proceed onto explaining each aspect in more detail. The main difference between the traditional SOM learning algorithm and the Continuous SOM learning algorithm is that the weights on the links between the input layer and the Kohonen layer are in CSOM replaced by ranges. This difference has of course caused a change in the
CSOM for Mixed Data Types
969
Table 1. Overview of the proposed method
update function and the way neurons compete amongst each other. The brief overview given in Table 1 assumes continuous attributes since for categorical the only difference would be in step (2) where the approach taken is the same as in traditional SOM. Data preprocessing. The data first needs to be transformed into an appropriate format, which is suitable for the input layer. All continuous attributes are normalized and categorical attributes are split into binary attributes corresponding to possible attribute values. Value ‘1’ is used to indicate the occurrence of a particular value and ‘0’ its absence. Any missing values encountered were manually replaced by a mean value for that particular attribute (continuous) or the most probable value with respect to other attribute values from the instance (categorical). Learning Mechanism. Here we discuss the update functions used in the CSOM for continuous and categorical attributes. Let ‘a(t)’ denote the adaptation gain between 0 and 1, ‘x(t) the input value at time t, ‘Nc(t)’ the neighborhood set at time t, ‘m’ the
970
F. Hadzic and T.S. Dillon
node and ‘i’ the link being updated. The weight update function used for categorical attributes is the same as in the traditional SOM [8], and is given by: mi(t+1) = { mi(t) + a{t}[x(t) – mi(t) ] }, if i є Nc(t), { mi(t) }, if i ¢ Nc(t).
(1)
The update function for continuous attributes is more complex because of the three different possibilities that need to be accounted for. At time t let: ‘m’ denote the node and ‘i’ the link being updated, ‘x(t)’ be the input value at time t, ‘a(t)’ be the adaptation gain between 0 and 1, ‘Nc(t)’ be the neighborhood set, ‘INc(t)’ inhibiting neighborhood set, ‘u(t)’ be the update factor (the amount that the winner had to change by), ‘Umi(t)’ be the upper range limit for link i, Lmi(t) be the lower range for link i, then the update function can be identified as [14]: If x(t) > Umi(t) For winner u(t) = a(t)(x(t) – Umi(t)) Umi(t+1) = { Umi(t) + a(t) [ u(t)] } Contract(Lmi(t)) If I є Nc(t), {Umi(t) – a(t) [ u(t)]} If I є INc(t), { Umi(t) } If I ¢ Nc(t) && I ¢ INc(t)
(2)
If x(t) < Lmi(t) For winner u(t) = a(t)(Lmi(t) – x(t)) Lmi(t+1) = { Lmi(t) - a(t) [ u(t) ] } Contract(Umi(t)) If I є Nc(t), { Lmi(t) + a(t) [ u(t) ]} If I є INc(t), { Lmi(t) } If I ¢ Nc(t) && I ¢ INc(t) If Lmi(t) < x(t) < Umi(t) Update occurs only for I є INc(t) If ((x(t) – Lmi(t)) > (Umi(t) – x(t))) Umi(t) = { Umi(t) – a(t) [ u(t) ] } Else Lmi(t) = { Lmi(t) + a(t)[u(t)] }. Note that the method Contract(range) is used to contract the range in one direction when it is expanded in the other direction. This is necessary for good convergence. The part of the update function where inhibiting neighbors are updated is not always required but in our experiments we found that the performance is increased when nodes far away from the winner are inhibited further away from the input. Contraction Method Each node in the map keeps record of sorted values that have occurred when that particular node was activated. Each value has a weight associated with it indicating the confidence of occurrence. With all this information the following approach was adopted to determine the value that the range should contract to: - Initially we contract to the point where the first/last value occurred; - At later stages a recursive approach is adopted where we contract past the last value if the weight is below a pre-specified threshold and the difference between the last and the next occurring value is above a certain threshold.
CSOM for Mixed Data Types
971
Best match calculation The best matching unit (winner) is determined by the smallest Euclidean distance to the input vector. Euclidean distance (ED) corresponds to the squared sum of differences between the input vector and weight vector of a particular node. After the ranges on links coming from continuous attributes have been initialized the usual way of calculating ED needs to be changed. The differences now correspond to the difference from the range limit that the input value is closest to, and if the input value falls within the range the difference is zero. Network pruning. Symmetrical Tau (τ) [15] criterion has been used for the purpose of removing the links emanating from nodes that are irrelevant for a particular cluster. These links correspond to the attributes whose absence has no effect in predicting the output defined by the cluster. The τ calculation occurs after supervised training during which occurring input values and target values are stored for attributes that define the constraints for a particular cluster. For the categorical attributes the occurring values are either ‘1’ or ‘0’. For continuous attributes the values can come from a large range and hence where the occurring input values are close to each other they are merged together so that the value object represents a range of values instead. This same approach of merging close values is adopted for the cases when the target attribute is continuous and hence different ranges correspond to different classes of the target attribute. Each value object has a weight vector (WV) associated with it which stores the weights to the occurring target values (see Fig. 2). The information collected corresponds to the information contained in a contingency table between an input attribute and the target attribute. Let there be I rows and J columns in the table, and let the probability that an individual belongs to row category ‘i’ and column category ‘j’ be represented as P(ij), and let P(i+) and P(+j) be the marginal probabilities in row category i and column category j respectively. The Symmetrical Tau measure for the capability of attribute A in predicting the class of attribute B is then defined as [15]: τ
J I P(ij)² I J P(ij)² I J = ∑ ∑ P(+j) + ∑ ∑ P(i+) - ∑ P(i+)² - ∑ P(+j)² j=1 i=1 i=1 j=1 i=1 j=1 I J 2 - ∑ P(i+)² - ∑ P(+j)² i=1 j=1
(3)
For the purpose of this work A could be viewed as a cluster constraint attribute and B the target class. The cluster attributes are then ranked according to the decreasing τ criterion and a cut-off point is determined below which all the attributes are considered as irrelevant for that particular cluster. Suitable criteria appeared to be that the cut-off point occurs at an attribute if its τ value is less than half of the previous attribute’s τ value. CSOM is then retrained with all the irrelevant links removed and the result is that the newly formed clusters are simpler in terms of attribute constraints. This improves the performance as simpler rules are most likely to have good generalization power. This process is illustrated in Fig. 2 where we show an example cluster with its corresponding attribute constraints (Lr < A < Ur and Lr < B < Ur) together with its weighted links to the class attribute (T with values tv1 and tv2).
972
F. Hadzic and T.S. Dillon
Issues of Value Merging As mentioned above when the occurring values stored in the value list of an attribute object are close together they are merged and the new value object represents a range of values now. A threshold has to be chosen which will determine when the difference among the value objects is sufficiently small for merging to occur. This is important for appropriate τ calculation as well as for good automatic discretizing of the continuous class attribute. Ideally a good merge value threshold will be picked with respect to the value distribution of that particular attribute. However this is not always known before hand and hence in our approach we pick a general merge threshold of 0.02 that is used for all attributes including the class. This has some implications for the calculated τ value since when the categories of an attribute A are increased more is known about attribute A and the error in predicting another attribute B may decrease. However A becomes more complex and more difficult to predict. This was the main reason for developing the symmetrical τ measure for feature selection as opposed to just using the Goodman and Kruskal’s asymmetrical measure [15]. In the case of continuous class attributes the class objects stored in a cluster’s target vector usually differ in quantity and range to the class objects stored in the target vector of value objects of cluster attributes. Therefore when obtaining the information needed for τ calculation extra care had to be taken in making sure that the weight retrieved from a target vector of a value objects is the sum of all the weights of class object which fall within the range of the class object from a cluster target vector. Issues for Categorical Attribute Since the categorical attributes have been transformed into binary subsets to suit the input layer there is some extra processing needed for obtaining the correct contingency table information. When calculating the τ measure for categorical attributes they are merged into one cluster attributes that has the previous binary subsets as value objects. The weights and target vectors of the new value objects are set based upon the weight and the target vector of the binary subset value object representing the value ‘1’ (i.e 1 indicates the presence of an attribute value).
Fig. 2. Feature selection process in CSOM after supervised training
CSOM for Mixed Data Types
973
Rule Extraction. Once the training is completed clusters can be formed from nodes that respond to input space in similar manner. The weight vector of a particular node represents the attribute constraints for that node. Nodes allocated to a cluster are those nodes whose weight vector is a small ED away from the other nodes belonging to the same cluster. For continuous attributes the rule assigned to a cluster is always the highest upper and lower range attribute constraints that occurred amongst its nodes. Hence for continuous attributes the rule extraction process is very simple since the attribute range constraints are already present on the network links. For categorical attributes we have used the threshold technique adopted from [13] where an input (i) is considered contributory (1) to an output node (o) if the difference between the weight on the link from i to o and the maximum weight component in the weight vector of o is below a pre-specified threshold. In our case the maximum weight component will only be calculated among the categorical attributes. Note also that a threshold is chosen, so that if all the weight components in the weight vector of an output node are below this threshold, than none of the inputs are considered contributory. An input is considered inhibitory (0) if the weight is below a prespecified threshold (Tinh), which is commonly chosen to be close to zero. The inputs that do not satisfy any of the above conditions are considered as a “don’t care” (-1) in the related rule. Whenever a (-1) occurs for a cluster attribute the cluster is split into two clusters one which has a value of ‘0’ and one a value of ‘1’ for the corresponding attribute. Rule Optimization. Once the initial rules have been assigned to each cluster the supervised learning starts by feeding the input data on top of the cluster set activating those clusters with smallest ED to the input instance. When a cluster is activated a link is formed between the cluster and the particular target value. After sufficient training we could determine which particular target value each cluster is predicting by inspecting the weights on the links between the clusters and target values. If a cluster is mainly activated for one particular target value than the cluster rule implies that particular target value. When a cluster has weighted links to multiple target values a rule validating approach is adopted in order to split up the rule further until each subrule (sub-cluster) predicts only one target value. The supervised learning is continued for a few iterations until each cluster points to only one target value. This method was motivated by psychological studies of concept formation [23] and the need for a system capable of validating its knowledge and adapting it to the changes in the domain. During the validating approach when the winning cluster captured an instance it should not have (misclassification occurs), a child cluster is created which deviates from the original cluster in those attributes values that occurred in the misclassified instance. The attribute constraints in the child cluster will be mutually exclusive from the attribute constraints of the parent so that an instance is either captured by the child or parent cluster, not both. After iteration there could be many children clusters created from a parent cluster. If the children clusters point to other target values with high confidence they become a new cluster (rule), otherwise they are merged back into the parent cluster. During the process the clusters that are not activated any more are deleted and similar clusters are merged based upon ED. Once a new cluster set is obtained it is retrained and the same process repeats until all the clusters point to only
974
F. Hadzic and T.S. Dillon
one target value with high confidence or the total number of optimizing iterations was reached. An example of the structure represented graphically is shown in Fig. 3. The reasoning mechanism described would merge DC to C1, DC1 to C2, DC2 to C3 and DC3 to C3 because they are still frequently triggered for the same target value as their parents predicting class, and the links to other target values have only a small weight. The DC2 from C2 and DC1 from C3 become new clusters since they point to different target values than their parents with high weight.
Fig. 3. Example structure after supervised training (notation OR – original rule, DC – deviate child, TV – target value)
4 Testing This section provides an experimental evaluation of the proposed method and discusses some of the major issues encountered. The tests were performed using the publicly available data sets from the ‘uci’ machine learning repository [24]. The chosen datasets contain a variation of continuous, categorical and binary attributes which makes them suitable for testing the current extension of the CSOM method. Where necessary, preprocessing took place as described at the start of previous section. Different sets of learning parameters were chosen for each data set. The gain term (α) is the term used to control the amount of update in cells. The default update factor (du) is used when the winner attribute constraint exactly matched the input so that the neighbors and inhibiting neighbors are updated by this default factor. Initial neighborhood (Ni) corresponds to the size of the area around the winning cell that will be updated to match closer to the input vector. The contract factor (cf) is used for contracting the range in the other direction than it was updated until the contraction method described in the previous section is adopted. All of the above mentioned parameters decrease during the training. There are a few other parameters which do not decrease during the training and these are the thresholds for assigning nodes to a cluster and for merging of clusters and value objects. A common threshold chosen for these parameters is 0.02.
CSOM for Mixed Data Types
975
Auto-mpg This dataset consists of 3 multi valued discrete attributes, 4 continuous attributes and the class itself is continuous. The data concerns city-cycle fuel consumption in miles per gallon, to be predicted in terms of cylinders, displacement, horsepower, weight, acceleration, model year and origin. The data set consists of 398 instances out of which 265 have been used for training and 133 for testing. The following learning parameters were chosen: α = 0.3; du = 0.02; Ni = 6; cf = 0.14. The map size was set to ‘7 * 7’ and it was trained for 24500 iterations. The rule optimization stage using the supervised training set was performed for 100 iterations. The obtained rule set is displayed in Table 2, and for the training set the cluster 4 misclassified one instance which had a class value of ‘0.95’. When tested on the unseen test set cluster 2 misclassified one instance whereas cluster 4 misclassified three instances. The misclassified instances in all cases had a target value higher than ‘0.84’ and the fact that there was only one such instance in the training set, explains why the characteristics for these higher values were not projected on the map. On the other hand, it is also possible that the characteristics are overlapping with some of the detected clusters and even if more instances were available they would still be captured by the same cluster. Table 2. Extracted cluster rules for Auto-mpg data C0 rule: Class Label: (0.16 – 0.3) C1 rule: Class Label: (0.18 – 0.36) C2 rule: Class Label: (0.26 – 0.77) C3 rule: Class Label: (0.39 – 0.66) C4 rule: Class Label: (0.29 – 0.75) C5 rule: Class Label: (0.59 – 0.81)
The class ranges have been automatically determined by merging of values that are close to one another (as described in Section 3). In this process the range of the predicting class of a cluster may grow to become quite large which highly depends on the threshold chosen for merging to occur. Hence, it is quite hard to measure the exact number of misclassifications that occurred. We detected 4 misclassified instances because they were more than merge threshold away from the range of the predicting class for that particular cluster. If a value falls within the merge threshold then it is not treated as a misclassification since in reality we would like to adjust our knowledge model to the newly arriving data. We have performed a few other test cases where the parameters were set in such a way so that the target values and clusters are less often merged. However, this affected the rule optimizing process and we ended up with a much larger number of rules which covered smaller class ranges. For simplicity
976
F. Hadzic and T.S. Dillon
purposes we have chosen the present test case where we detected the most common characteristics that occur frequently together in the input space. It is hard to know whether enough characteristics were isolated and whether some of the grouped characteristics should be split further. For this advice a domain expert would be useful and the learning parameters could be adjusted accordingly. Credit-Screening This data set is concerned with credit card applications. It consists of 6 continuous and 9 categorical attributes and the task is to predict whether the credit application was positive or negative. Some of the categorical attributes have many possible categories and this resulted in a total of 43 units in the CSOM input layer. It was noted in [25] that this dataset is both scanty and noisy causing complex knowledge models that are not very accurate on unseen test cases. The data set consists of 690 instances out of which 460 have been used for training and 230 for testing. The following learning parameters were chosen: α = 0.9; du = 0.02; Ni = 8; cf = 0.14. The map size was set to ‘9 * 9’ and it was trained for 3000 iterations. The rule optimization stage using the supervised training set was performed for 100 iterations. The obtained rule set is displayed in Table 3. We can see that many attributes were detected as irrelevant in this case. Many clusters only have a subset of the total of 15 attributes as their rule constraints. Cluster 5 has all the attributes in its rule but when the constraints and its rule coverage are checked we can see that it is too specific and its existence is probably caused by some noise present in the dataset. When testing its predictive accuracy there were 14% of incorrectly classified instances from the whole unseen test set. These results are comparable to the results obtained by other inductive learners. Table 3. Extracted cluster rules for Credit-Screening data C0 rule: Class Label: ( + ) C1 rule: Class Label: ( - ) C2 rule: Class Label: ( - ) C3 rule: Class Label: ( + ) C4 rule: Class Label: ( - ) C5 rule:
-5
-5
-9
Class Label: ( - )
5 Concluding Remarks Overall, the results show that the proposed approach is successful in extracting symbolic rules from domains containing mixed data types. For continuous class
CSOM for Mixed Data Types
977
attributes the approach converges to a range that is captured by similar instances rather than setting a range manually. The use of τ criterion for network pruning was useful in simplifying some of the extracted rules. There were many learning parameters whose change could affect the results greatly, and there is usually some trade-off when a parameter is changed. For example, if the threshold used for assigning map cells to clusters is too small there will be too many clusters formed, and if it is too large there will be too few formed with high misclassification rate. This caused us to adopt an approach where the threshold is initially set to a small value and once clusters are formed, merging between similar clusters and deletion of unpredicting clusters takes place. This was a more favorable approach as opposed to searching for an optimal threshold. Overall, the idea of replacing the weights by ranges on links emanating from continuous attributes has proven useful since the symbolic information is contained on network links themselves. It would be interesting to see whether a similar idea can be applied to other neural network types.
References 1. Bloomer, W.F., Dillon, T.S., Witten, M.: Hybrid Brainne: A Method for Developing Symbolic Rules from a Hybrid Neural Network. IEEE International Conference on Systems Man and Cybernetics, Beijing China (1996) 14-17 2. Dillon, T.S., Sestito, S., Witten, M., Suing, M.: Automated Knowledge Acquisition Using Unsupervised Learning. Proceedings of the Second IEEE Workshop on Engineering Technology and Factory Automation (EFTA ’93), Cairns (1993) 119-128 3. Gupta, A., Park, S., S. Lam, M.: Generalized Analytic Rule Extraction for Feedforward Neural Networks. IEEE Transactions on Knowledge and Data Engineering 11 (1998) 985–991 4. Hammer, B., Rechtien, A., Strickert, M., Villmann, T.: Rule Extraction from Selforganizing Networks. In: J. R. Dorronsoro (ed.): ICANN’02 (2002) 5. McGarry, K.J., Wermter, S., MacIntyre, J.: Knowledge Extraction from Radial Basis Function Networks and Multi-layer Perceptrons. International Joint Conference on Neural Networks, Washington D.C. (1999) 6. Setiono, R., Leow W.K., Zurada, J.M.: Extraction of Rules from Artificial Neural Networks for Linear Regression. IEEE Transactions on Neural Networks 13 ( 3) (2002) 564 – 577 7. Taha, I.A., Ghosh, J.: Symbolic Interpretation of Artificial Neural Networks. IEEE Transactions on Knowledge and Data Engineering 11 (1998) 448–462 8. Kohonen, T.: The Self-Organizing Map. Proceedings of the IEEE 78 ( 9) (1990) 1464-1480 9. Kaski, S., Nikkilä, J., Kohonen, T.: Methods for Interpreting a Self-Organized Map in Data Analysis. In Michel Verleysen, (ed.): Proceedings of ESANN'98, 6th European Symposium on Artificial Neural Networks, Bruges, D-Facto, Brussels, Belgium (1998) 185-190 10. Rauber, A., Merkl, D.: Automatic Labeling of Self-Organizing Maps: Making a TreasureMap Reveal its Secrets. In Proceedings of the 3rd Pasific-Area Conference on Knowledge Discovery and Data Mining (1999) 11. Siponen, M., Vesanto, J., Simula, O., Vasara, P.: An Approach to Automated Interpretation of SOM. In: Advances in Self-Organizing Maps, Springer (2001) 89-94
978
F. Hadzic and T.S. Dillon
12. Ultsch, A.: Knowledge Extraction from Self-Organizing Neural Networks. In Opitz, O., Lausen, B., and Klar, R., (eds.): Information and Classification, Berlin, Germany: Springer-Verlag (1993) 301--306 13. Sestito, S., Dillon, S.T.: Automated Knowledge Acquisition. Prentice Hall of Australia Pty Ltd, Sydney (1994) 14. Hadzic, F., Dillon, T.S.: CSOM: Self Organizing Map for Continuous Data. 3rd International IEEE Conference on Industrial Informatics, Perth (2005) 15. Zhou, X., Dillon, T.S.: A Statistical-Heuristic Feature Selection Criterion for Decision Tree Induction. IEEE Transactions on Pattern Analysis and Machine Intelligence 13 (8) (1991) 834-841 16. Wang, D.H., Dillon, T.S., Chang, E.: Trading Off between Misclassification, Recognition and Generalization in Data Mining with Continuous Features. In Hendtlass, T. Ali, M. (eds): Developments in Applied Artificial Intelligence (Proceedings of the Fifteenth International Conference on Industrial & Engineering Application of Artificial Intelligence & Expert Systems, Lecture Notes in Artificial Intelligence, LNAI 2358, Springer, Cairns (2002) 303-313 17. Abe, S., Sakaguchi, K.: Generalization Improvement of a Fuzzy Classifier with Ellipsoidal Regions. In Proc .of t he 10th IEEE International Conference on Fuzzy Systems (FUZZIEEE 2001), Melbourne (2001) 207-210 18. Chen, Z.: Data Mining and Uncertain Reasoning: An Integrated Approach. John Wiley & Sons, Inc., New York (2001) 19. Engelbrecht, A.P.: Computational Intelligence:An Introduction. J. Wiley & Sons, Hoboken, New Jersey (2002) 20. Lecun, Y., Denker, J., Solla, S.: Optimal Brain Damage. In Touretzky, D.S., (ed.): Advances in Neural Information Processing Systems, San Mateo, CA, Morgan Kauffman 2 (1990) 598-605 21. Goh, T.H.: Semantic Extraction Using Neural Network Modeling and Sensitivity Analysis. Proceedings of the 1993 International Joint Conference on Neural Networks (1993) 1031–1034 22. Tsaih, R.: Sensitivity Analysis, Neural Networks, and the Finance. IEEE International Joint Conference on Neural Networks 6 (1999) 3830-3835 23. Bruner, S., Goodnow, J.J., Austin, G.A.: A Study of Thinking. John Wiley & Sons, Inc., New York (1956) 24. Blake, C., Keogh, E., Merz, C.J.: UCI Repository of Machine Learning Databases. Irvine, CA: University of California, Department of Information and Computer Science (1998) [http://www.ics.uci.edu/~mlearn/MLRepository.html] 25. Quinlan, J. R.: Simplifying Decision Trees. International Journal of Man-Machine Studies 27 (1987) 221--234
The Application of ICA to the X-Ray Digital Subtraction Angiography Songyuan Tang1,2, Yongtian Wang1, and Yen-wei Chen2 1
Department of Opto-electronic Engineering, Beijing Institute of Technology 100081, Beijing, P.R. China {sytang,wyt}@bit.edu.cn 2 Colleges of Information Science and Engineering, Ritsumeikan University Nojihigashi, Kusatsu, Shiga, Japan
[email protected]
Abstract. The traditional enhancement of X-ray digital subtraction angiography (DSA) is to subtract the mask image and living image so as to remove the background such as ribs, spine, cathers, organs and etc, and obtain the enhanced vessel trees. However, the DSA have serious motion artifacts, poor local contrast and noises, when subtraction technique is used, some tiny vessels are broken, and even disappeared when visualized. To attack the problem, we use independent component analysis instead of subtraction technique. This technique is proved to be very efficient to enhance vessels. Experimental results of simulated data and several clinical data show that the proposed method is robust and can obtain good vessel trees.
1 Introduction X-ray Digital Subtraction Angiography (DSA) is a widely used technique to visualize and examine blood vessels in the human body [1]. Especially for the assessment of coronary artery disease and the reference of the operation, it still remains the "gold standard" today [2]. In this technique, a sequence of two-dimensional (2D) digital Xray projection images are obtained when the contrast medium is injected into the interest vessels. The beginnings of the sequence don’t include any vessels since the contrast medium is not injected into the vessels, and are called mask images. Then, the vessels appear gradually in the rest of the sequence because the contrast medium flows through the vessels and they are called living images. In these X-ray projection images, blood vessels are hardly visible due to the mixture of background, such as ribs, spine, cathers, organs and etc. Usually, subtraction of mask image and living image can remove the background and obtain visible vessels if the mask image and living image are registered and have equal gray-level distributions. However, it is not true because the human body motion and the fluctuations of the X-ray power and noises in the images. In current commercial DSA devices, only manual pixel shifting is provided for correction of global translational motion. It is not suitable for the coronary DSA. Therefore many methods have been developed to attack the problem [3]. Templatematching based method [4][5] is one of them and has been proven robust. After the D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 979–984, 2007. © Springer-Verlag Berlin Heidelberg 2007
980
S. Tang, Y. Wang, and Y.-w. Chen
mask image is registered to the living image, the subtraction technique is the sole technique used to enhance vessels now. In this paper, we have proposed a novel method to enhance the vessel instead of the subtraction technique. The proposed method mainly includes two steps. Firstly, template-matching method is used to register the mask image and the living image. Actually, the registered mask image is the background image. Then, independent component analysis (ICA) is adopted to separate the vessel and the backgrounds from the living image and the background image. We assume the vessel image and the background image are independent signals, and living image is mixed by the vessel signal and background signal. Now, the problem is become how to decompose the vessel signal and background signal from living image signal. It is a classical blind source separation problem [6]. Independent component analysis (ICA) is an effective method to solve the blind source separation problem and has largely application. Therefore, we use ICA to separate the background and vessels. To evaluate the performance of the proposed method, we have compared the result of the proposed method with those acquired by markov random field segmentation [9] and multiscale filter enhancement [10]. Experts have been asked to give visual inspections since the X-ray digital subtraction angiography is mainly used by the surgeons to observe the vessels of the human body. The visual inspections show that the result acquired from the proposed method can remove background greatly, obtain the good vessel trees and keep vessel continuums, while the markov random field segmentation and multiscale filter enhancement still keep many backgrounds such as cathers, bones etc. The proposed method can obtain a good vessel trees.
2 Method 2.1 Template-Matching Method The template-matching method is based on the assumption that a pixel’s displacement in mask image can be estimated by defining a certain window containing the pixel, and by finding the corresponding window in the living image as figure 1 shows. Usually the pixel computed is at the center of the window in the mask image, as arrow A points in figure 1 (a), when the corresponding window in the living image is found, the computed pixel moves to the new position as arrow B points in figure 2 (a), and its displacement is determined by the position A and B. The intensity-based image registration method [7] is used to find the matching window in the living image. When the DSA is acquired from the rigid part of the human body such as brains, arms, legs and etc, the rigid transformation is used. If the DSA is taken from the heart, the similarity or affine transformation is used. The energy of the histogram of differences (EHD) measure is selected since it has been shown to be most adequate for registration in X-ray angiography [3]. The EHD is defined as followed.
M EHD =
1 N
∑ (M ( X ) − L ( X ) )
2
,
(1)
where M (X) and L(X) are the corresponding windows in mask image and living image, and the intensities in the windows are normalized. N is the total pixels in the window. The less the EHD is, the better the images are registered.
The Application of ICA to the X-Ray Digital Subtraction Angiography
981
If the displacement of each pixel in the image is calculated, the computation cost is very expensive. To reduce the cost, only a limited number of windows are selected, and only the limited numbers of the pixels’ displacements are computed. These pixels are called control points. The displacements of the rest pixels in the mask image are interpolation by the displacement of these control points. The uniform control point grids and thin plate spline interpolation are used in the proposed method.
(a)
(b)
Fig. 1. (a) a window in mask image (b) the corresponding window in living image
2.2 Vessel Separation by Independent Component Analysis ICA is a statistic method developed recent year. The basic ICA problem assumes the linear relation between the observation X and the source S, and can be expressed as: X = AS,
X∈Rn,
S∈ Rm,
A ∈ R n× m .
(2)
Each component of S is assumed to be mean 0, mutually independent and drawn from different probability distribution which is not Gaussian expect for at most one. The ICA technique is to find a transformation W given by: Y = WX,
Y∈Rm,
X∈Rn,
W ∈ R m× n .
(3)
The component of Y should be statistically independent and approach the source. There are many methods to solve the problem [3]. Fixed-point algorithm is one of them [8] and computationally very efficient. In the algorithm, the non-gaussianity is measured using approximations to negentropy, which are robust and fast to compute. Therefore we used it to separate the vessels and background. In our application, the two-dimensional image signals are turned into a vector of pixel values row by row scanning. The living image and background image are observed signals, which consist observation X. Here X is 2 × N , and N is the total number pixels of an image. When the transformation W is obtained by the fixed-point algorithm, the statistically independent Y can be computed, which is also 2 × N , and approach the vessel signal and background signal.
982
S. Tang, Y. Wang, and Y.-w. Chen
3 Experiment Result Simulated Data. The simulated data are shown in figure 2. Fig. 2(a) is a simulated vessel, (b) is a simulated background and (c) is the mixture of (a) and (c). Here, the fig. 2 (a) and (c) is similar to registered mask image and living image. Fig. 2 (d) is the result from subtraction technique. It is easily found that the simulated vessel is broken at some position. Fig. 2 (e) shows the vessel separation by ICA. The result is very well and the simulated vessel keeps continuum.
(a)
(c)
(b)
(d)
(e)
Fig. 2. Simulated images. (a) simulated vessel (b) simulated background, which is simulated mask image (c) mixture of vessel and background, which is simulated living image (d) result obtained from subtraction technique (e) result acquired by ICA.
Fig. 3. Vessel is enhanced by (a) subtraction technique, (b) ICA method
Clinic Data. The clinic data were acquired with Philips Medical System. The image sizes are 864 × 864 . Thirty couples of images from three sequences are used to test the proposed method. Fig. 1 is an example. Fig. 1 (a) and (b) show the mask image and living image respectively. Fig. 3 (a) and (b) are the results acquired from the subtraction technique and independent component analysis method after the image registration. When evaluated by visualization, it is easily found that there are broken at many places in fig. 3 (a). The arrow points one of them. To show it clearly, we used global threshold to segment the vessel. The results are shown in fig. 4. Fig. 4(a) shows
The Application of ICA to the X-Ray Digital Subtraction Angiography
(a)
983
(b)
Fig. 4. Vessel extracted by global gray threshold, (a) from subtraction technique, (b) from ICA method
(a)
(b)
Fig. 5. Vessel is extracted (a) by Markov random field segmentation, (b) by multiscale enhancement filter
the result from the subtraction technique while fig. 4 (b) shows that from ICA. As arrows A, B and C point, the vessel is obviously broken in fig. 4(a) while it keeps continuum in fig. 4 (b). We have also compared those acquired from Markov random field segmentation and multiscale enhancement filter as Fig. 5(a) and (b) show. It is easily found that the cathers can not be removed in these methods.
4 Conclusion The proposed method can enhance blood vessel from X-ray digital subtraction angiography well. We have demonstrated that the ICA can separate the vessel and background effectively, keep vessel continuum and get good vessel trees.
984
S. Tang, Y. Wang, and Y.-w. Chen
Acknowledgment This work was partly supported by the National Key Basic Research and Development Program (973) Grant No. 2003CB716105.
References 1. Katzen, B.T.: Current Status of Digital Angiography in Vascular Imaging. Radiologic clinics of North America 33 (11) (1995) 1-14 2. Cavaye, D. M., White, R. A.: Imaging Technologies in Cardiovascular Interventions. J. Cardiovasc. Surg. 34 (1) (1993) 13–22 3. Meijering, E.H.W., Niessen, W.J., Viergever, M.A.: Retrospective Motion Correction in Digital Subtraction Angiography: A Review. IEEE Transactions on Medical Imaging 18 (1) (1999) 2-21 4. Meijering, E.H.W., Zuiderveld, K.J., Viergever, M.A.: Image Registration for Digital Subtraction Angiography. International Journal of Computer Vision 31 (2/3) (1999) 227-246 5. Taleb, N., Bentoutou, Y., Deforges, O., Taleb, M.: A 3D Space-time Motion Evaluation for Image Registration in Digital 78subtraction Angiography. Computerized Medical Imaging and Graphics 25 (2001) 223—233 6. Hyvarinen, A., Karhunen, J., Oja, E.: Idenpendent Component Analysis. A. WileyInterscience Publication, JOHN WILEY & SONS, INC (2001) 7. Zitova, B., Flusser, J.: Image Registration Methods: A Survey. Image and Vision Computing 21 (2003) 977—1000 8. Hyvarinen, A.: Fast and Robust Fixed-point Algorithm for Component Analysis. IEEE Trans. Neural Networks 10 (3) (1999) 626—634 9. Berthod, M., Kato Z., Yu, S., Zerubia, J.: Bayesian Image Classification Using Markov Random Fields. Image and Vision Computing 14 (1996) 285 295 10. Frangi, A. F., Niessen, W. J., Vincken, K. L., Viergever, M. A.: Multiscale Vessel Enhancement Filtering. In Medical Image Computing and Computer-Assisted Intervention, Lecture Notes in Computer Science 1496 (1998) 130-137
-
Relative Principle Component and Relative Principle Component Analysis Algorithm* Cheng-Lin Wen1, Jing Hu2, and Tian-Zhen Wang3 1
Institute of Information and Control, Hangzhou Dianzi University, 310018 Hangzhou, China
[email protected] 2 Department of Computer and Information Engineering, Henan University, 475001 Kaifeng, China
[email protected] 3 Department of Electrical Automation, Shanghai Maritime University, 200135 Shanghai, China
Abstract. Aiming at the problems happened in the practical application of traditional Principle Component Analysis (PCA), the concept of Relative Principle Component (RPC) and method of Relative Principle Component Analysis (RPCA) are put forward. Meanwhile, some concepts such as Relative Transform (RT), “Rotundity” Scatter and so on are introduced. The new algorithm can overcome some disadvantages of traditional PCA for compressing data when data is “Rotundity” Scatter. A simulation has been used to demonstrate the effectiveness and practicability of the algorithm proposed. The RPCs selected by RPCA are more representative, and the way to choose RPCs is more flexible, so that the application of the new algorithm will be very extensive.
1 Introduction The classical Principle Component Analysis (PCA), implemented by a random matrix with finite time sequences of process multivariate, is one of the most important methods for statistical control of multivariate process, the central idea of which is to set up a few derived variables called Principal Components (PCs), while retaining as much as possible of the original variables [1]. We can not only compress data and analyze data by use of PCA, but also apply it to fault diagnosis, signal processing, pattern recognizing and so on [2]. However, there are some problems listed in the following in the classical PCA. (1) These PCs are obtained based on the eigenvalues and eigenvectors of covariance matrix of finite sequences with process multivariate, and ordered by the magnitude of these eigenvalues. Whereas the magnitude of each eigenvalue tightly correlate with the magnitude of covariance between corresponding variables which is as well as relative to corresponding units (for example one *
Supported by the National Nature Science Foundation of China (No.60434020, No.60374020), International Cooperation Item of Henan Province(No.0446650006), and Henan Province Outstanding Youth Science Fund (No.0312001900) .
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 985–993, 2007. © Springer-Verlag Berlin Heidelberg 2007
986
C.-L. Wen, J. Hu, and T.-Z. Wang
meter equals 100centimeter). It is different units that make those variables whose covariance is largest will not tend to be the most important. (2) The number of PCs and their capability or power to take or possess information contained in original random matrix or its covariance matrix lie on the difference degree between bigger eigenvalues and smaller eigenvalues. Because random matrix obtained in real system can often fall into an approximate hyperball geometrically, it is difficult to select significance PCs by classical PCA. In this paper, the method of Relative Principle Component Analysis based on Relative Transform is discussed, which contains two steps as follows. First of all, standardizing the system random matrix is applied to remove the problem of scale dependence from PCA. And then to introduce weights on any variable or all, which are chosen to reflect some a priori idea of the relative importance of the variables.
2 Principal Component Analysis (PCA) Usually, the essential characteristic and most variability of a complex dynamic system which is perfectly described by a lot of process variables can be acquired by a few PCs [3, 4]. Considering the n-dimensional random variables of a dynamic system
, x n (k )]T ∈ R n×1 ,
x(k ) ≡ [x1 (k ), x 2 (k ),
(1)
a random matrix composed by finite time sequences with process variables is
:
X ≡ X(k , k + N − 1) = [x(k ), x(k + 1),
, x(k + N − 1)] .
(2)
Supposing x i is defined as the i th variable time sequence
x i = [ x i ( k ) x i ( k + 1),
, x i ( k + N − 1)] ,
(3)
then X(k , k + N − 1) can still be written as X(k , k + N − 1) ≡ [x1 , x 2 ,
, x n ]T .
(4)
Let random matrix X have the covariance matrix Σ X ⎧ ⎫ Σ X = E ⎨[X ( k , k + N − 1) − E{X ( k , k + N − 1)}] [X (k , k + N − 1) − E{X ( k , k + N − 1)}] T ⎬ . ⎩ ⎭
Then the eigenvalue λi for i = 1,2, eigenvector p i = [ pi (1), pi (2),
(5)
, n and corresponding identity orthogonal T
, pi (n)] will be obtained respectively by
λI − Σ X = 0 ,
(6)
and
[λi I − Σ X ] p i = 0
.
(7)
Relative Principle Component and Relative Principle Component Analysis Algorithm
It is convenient to assume, as we may, that λ1 ≥ λ 2 ≥
987
≥ λn ≥ 0 [5].
With p i from equation (7), we have
V = PT X ,
(8)
where P = [ p1 , p 2 , , p n ] . Since then, analyzing the statistical characteristic of X is equal to analyzing the one of V . To select the first few m(m < n) vectors from V as Principal Components (PCs), by which we can analyze the system. Var( v i ) = (p i ) T Σ X p i = λi , i = 1,2,
Property 1.
,n ,
Cov(vi , v j ) = (pi ) T ΣXp j = 0, i ≠ j .
Definition 1. The energy of a dynamic system based on random matrix X is defined as the norm of X , which is X
2 2
=
∑ ∑ x i (k ) 2 , where xi (k ) 2 = E{xi (k )}. n
N
2
2
2
i =1 k =1
Property 2. PCA keeps the energy of system random matrix be of conservation, i.e. X
2 2
2
= V 2.
3 Standardization Analysis Aiming at the problems of different units presented in introduction, many scholars have been doing a great research as described in reference [6]. The approach centers on standardization processing to a random matrix with limited time sequences, x *i (k ) =
x i (k ) − E[x i (k )]
[Var(x i (k ))]1 2
, i = 1,2,
, n ; k = 1,2,
,N .
(9)
The random matrix X* made up of x*i (k ) is called standardized random matrix for X . Property 3. Standardization process equals the standardized variable x *i , which means the energy of each variable sequence x *i is exactly equal, marked 2
as x *i
2
2
= x *j , i ≠ j . 2
Remark 1. The system energy after transformation X *
2 2
does not always equal
2
to X 2 . Definition 2. “Rotundity” Scatter (RS) A random matrix X ∈ M n, N is said to be of “Rotundity” Scatter if only if the eigenvalues λ1 , λ 2 ,
, λn from equation (6) are approximately equal.
988
C.-L. Wen, J. Hu, and T.-Z. Wang
Remark 2. The “Rotundity” Scatter above is another different concept comparing with the uniform distributing [4]. A random matrix with “Rotundity” Scatter has the following property: Property 4. A vector set {E[x( k )], E[x(k + 1)], , E[x( k + N − 1)]} being subject to “Rotundity” Scatter forms a hyperball in R n space. In a word, the standardization process does absolutely eliminate the influence of different units, However, which brings us other new challenge problems: − Conversation of energy could not be ensured, − The standardizing X* is almost of “Rotundity” Scatter since standardization.
4 Relative Principle Component Analysis (RPCA) Armed with our analysis above, a new concept relative principle component (RPC) will be presented as well as a method of relative principle component analysis. 4.1 Relative Transform (RT) Consider the following matrix
⎡ x1 (1) x1 (2) ⎢x (1) x (2) 2 X(1, N ) = ⎢ 2 ⎢ ⎢ ⎣x n (1) x n (2)
x1 ( N ) ⎤ x 2 ( N )⎥⎥ , ⎥ ⎥ x n ( N )⎦
and let E{X} = 0 without loss of general. Definition 3. Relative Transform Denote XR = W ⋅ X
⎡ w1 ⎢0 =⎢ ⎢ ⎢ ⎣0
0
w2 0
0 ⎤ ⎡ x1 (1) x1 (2) 0 ⎥⎥ ⎢⎢x 2 (1) x 2 (2) ⋅ ⎥ ⎢ ⎥ ⎢ wn ⎦ ⎣x n (1) x n (2)
⎡ ⎢ =⎢ ⎢ ⎢ R R ⎣⎢x n ( N ) x n (2) x1R (1) x R2 (1)
x1R (2) x R2 (2)
x1 ( N ) ⎤ x 2 ( N ) ⎥⎥ , ⎥ ⎥ x n ( N )⎦
(10)
x1R ( N )⎤ ⎥ x R2 ( N )⎥
⎥ ⎥
x Rn ( N )⎦⎥
x iR = wi x i , i = 1,2,
,n ,
(11)
Relative Principle Component and Relative Principle Component Analysis Algorithm
989
where wi = μ i mi .
(12)
We refer to the Eq. (10) as relative transform of X , where W and X R are the corresponding RT operator and random matrix, respectively. μ i is chosen as the proportion coefficient which reflects the importance degree of the variable xi (k) . And m i is the standardization factor with regard to each original variable xi (k) or vector x i . There are several ways to ensure m i , of which the Eq. (9) is mostly in common use. The process of RT is shown in Fig. 1. XR
X
proportion coefficient
standardizing factor mi
μi W
Fig. 1. The relative transform model
Define the correlation coefficient as
ρ {x i (k ), x j (m)} =
{
}
Cov x i ( k ), x j ( m)
{
}.
Var{x i ( k )}Var x j ( m)
(13)
Property 5. RT does not change the correlation between variables, i.e.
{
}
ρ x iR (k ), x Rj (m) = ρ {x i (k ), x j (m)}
Condition 1. According to the conversation of energy theorem for signals, any transformation from A to B keeps the energy unchanged, namely X
2 2
= XR
2 2
.
In despite of the unknown RT operator, the relative proportion k1 : k 2 : : k n between each coefficient is derived as a known prior value or definite information in terms of training by neural networks. Let
μ1 : μ 2 :
: μ n = α k1 : α k 2 :
: α kn .
(14)
Here, the α is used to ensure the conversation of energy, X
2 2
= XR
2 2
=
∑∑ [x i (k )]2 = ∑∑ [x iR (k )] n
N
i =1 k =1
n
N
i =1 k =1
2
990
C.-L. Wen, J. Hu, and T.-Z. Wang N ⎛ n ⎞ ki 2 = α 2⎜ {xi (k ) − E[x i (k )]}2 ⎟⎟ . ⎜ Var [x i (k )] k =1 ⎝ i =1 ⎠
∑
∑
(15)
Furthermore, ⎛ α =⎜ ⎜ ⎝
1
n
N
n
∑∑ [x (k )] ∑ 2
i
i =1 k =1
i =1
N ⎞2 ki 2 {x i (k ) − E[x i (k )]}2 ⎟⎟ . Var [x i (k )] k =1 ⎠
∑
(16)
Some properties, which are concerned with how to choose μi , are listed in the following. (1) The “Rotundity” Scatter of a multivariate sequence matrix X can be successfully adjusted by use of an appropriate relative transform. (2) The RPCs from the relative matrix X R have better performance and stranger ability to represent actual system than PCs from the matrix X . 2
(3) The conversation of energy has to be assured, i.e. X 22 = X R . 2
4.2 Computing RPCs These RPCs v1R , v R2 ,
, v Rn can be gained by the following steps.
(1) Computing the covariance matrix Σ X R of X R ⎧ ⎫ Σ XR = E ⎨[ XR − E{XR }] [XR − E{XR }]T ⎬ ⎩ ⎭
.
(17)
(2) Calculating relative eigenvalue λiR and corresponding eigenvector p iR by λR I − Σ XR = 0 ,
[λ I − Σ ] p R i
XR
where
R i
(18)
= 0, i = 1,2,
[
p iR = p iR (1), p iR ( 2),
,n ,
, p iR ( n)
]
T
suppose λ1R ≥ λ2R ≥ ≥ λnR > 0 . (3) Obtaining the RPCs. Given the following transformation
⎧v1R ⎫ ⎡ p1R (1) ⎪ R⎪ ⎢ R ⎪v 2 ⎪ ⎢ p 2 (1) ⎨ ⎬= ⎢ ⎪ ⎪ ⎢ ⎪v R ⎪ ⎢ p R (1) ⎩ n⎭ ⎣ n
p1R (2) … p1R (n)⎤ ⎧x1R ⎫ ⎥⎪ ⎪ p 2R (2) … p 2R (n)⎥ ⎪x R2 ⎪ , ⎥⎨ ⋅ ⎬ ⎪ ⎪ ⎥ p nR (2) … p nR (n)⎦⎥ ⎪⎩x Rn ⎪⎭
or
( )
T
V R = P R XR ,
then select m (m < n) vectors v1R , v R2 ,
, v Rm as RPCs.
(19)
Relative Principle Component and Relative Principle Component Analysis Algorithm
991
Similar to PCA, the Cumulative Percent Variance (CPV) of RPC v iR is Pi R % =
λiR n
∑
× 100% .
(20)
λiR
i =1
5 RPCA with Application to Data Compression RPCA could be used to reduce dimension, select assistant variables, compress data and extract characteristic for n-variable time sequences. In this section, an example with respect to data compression and character extraction will be given with objective is to show the influence on system made by μi . Parameter setting and simulation result are illustrated in Table.1 and Table.2, respectively. Fig.2 gives a plot of 40 observations on two variables x1 x2 . If transforming to PCs, we obtain the plot given in Figure3, where ellipse approximates to be a circle in
,
Table 1. Parameter setting
System matrix
Observations (N)
“rotundity”
Variables (n)
40
2
Standardizing
c
(9)
2
Eq.
Table 2. Simulation result
μ1
μ2
λ1
λ2
λ1R
λR2
P1 %
P2 %
P1R %
P2R %
1
5
1.2131
0.7869
25.0472
0.9528
60.6573
39.3427
96.3355
3.6645
1
5
1.2131
0.7869
25.0472
0.9528
60.6573
39.3427
96.3355
3.6645
Fig. 2. The distribution of X
Fig. 3. λ1 ≅ λ2 for PCs
Fig. 4. The distribution of X R
992
C.-L. Wen, J. Hu, and T.-Z. Wang
Fig. 5. μ 1 : μ 2 = 1 : 5 for RPCs
Fig. 6. μ 1 : μ 2 = 5 : 1 for RPCs
terms of λ1 ≅ λ 2 . When the proportion coefficients are chosen as μ1 : μ 2 = 1 : 5 , we can work out the relative eigenvalues λ1R > λR2 and the first RPC accounts for 96.3355%of the total variation after RPCA. Plots of data with respect to relative matrix X R and RPCs are given in Fig.4 and Fig.5, respectively. It is clear that there is an obvious greater variation in the direction of RPC1 than in either of the original variables, and PCs. Therefore, the first RPC can totally be used to interpret the most variation in X . Similarly, Fig.6 indicates a plot of X R for RPCs when μ 1 : μ 2 = 5 : 1 holds. Throughout this simulation there are two suggestions listed in the following. 1. It is different or even unable to pick out PCs when those data are “Rotundity” Scatter. By a contrast, the RPCA based on relative transform can change the distributing from “Rotundity” into “Prominent” by means of changing the characteristic structure. As shown above, Figure2 is transformed to Figure4 geometrically, which produces more respective elements. 2. Apply RPCA on two-variable random matrix satisfying μ1 : μ 2 = α1 : α 2 , α1 ,α 2 ∈ R or μ1 : μ 2 = α 2 : α1 holds. In both cases, the CPVs are same, as illustrated in Table 2.
6 Conclusion In this letter, the concept of RPC has been successfully introduced, and the method of RPCA has been effectively implemented. These achievements are owing to these problems happened in classical PCA, such as RS about matrix X . The approach of RPC resolves successfully the above problems and possesses the following advantages. (1) RPCA can avoid the shortcoming that, the bigger the variance of a system variable is, the more it influences the selection of system PCs, and therefore these RPCs may have stronger ability than PCs, while keeping the energy of system be of conservation. (2) RPCA can still obtain RPCs when the multivariate sequence matrix X of a system is RS.
Relative Principle Component and Relative Principle Component Analysis Algorithm
993
However, there are some problems in this new method. For example how to choose adaptively the proportion coefficient μi according to the importance degree of different variables, and the standardization factor according to different actual dynamic systems. To resolve effectively these key problems will be significance and impulse RPCA development with theory and application.
References 1. Choi, W., Kurfess, T. R.: Dimensional Measurement Data Analysis, Part 1: A zone fitting algorithm. Journal of Manufacturing Science and Engineering 121 (1999) 238-256 2. Welsch, R.E., Is Cross-validation the Best Approach for Principal Component and Ridge Regression?. Proceedings of the 32nd Symposium on the Interface: Computing Science and Statistics, New Orleans (Louisiana) 5-8 April (2000) 3. Liu, Y.: Statistical Control of Multivariate Processes with Applications to Automobile Body Assembly. A dissertation submitted in partial fulfillment of the requirements for the Ph.D. degree in the University of Michigan (2002) 35-39 4. Johnson, R.A., Wichern, D.W.: Applied Multivariate Statistical Analysis, Fourth Edition, Prentice-Hall (1998) 347-387 5. Ding, S., Shi, Z., Liang, Y.: Information Feature Analysis and Improved Algorithm of PCA. Proceeding of the Fourth International Conference on Machine Learning and Cybernetics, Guangzhou, 18-21 August (2005) 6. Jolliffe, I.T.: Principal Component Analysis, Second Edition, Springer (2002) 10-77
The Hybrid Principal Component Analysis Based on Wavelets and Moving Median Filter* Cheng-lin Wen1, Shao-hui Fan2, and Zhi-guo Chen2 1
School of Automatic, Hangzhou Dianzi University, Hangzhou 310018, China
[email protected] 2 School of Computer and Information Engineering, Henan University, Kaifeng 475001, China
[email protected]
Abstract. The data obtained from any process may be corrupted with noise and outliers which may lead to false-alarm when applying conventional PCA to process monitoring. To overcome the above mentioned limitations of conventional PCA, an approach is developed by combining the ability of wavelets and moving median filter with PCA. This method utilizes the quality of wavelets and moving median filter to preprocess the data to eliminate noise and outliers. At last, this method is applied to fault detection and has a good effect which proves the method is effective and feasible.
1 Introduction When using PCA to real industrial process monitoring, the data used to build PCA model is usually collected in normal process operation. But the data obtained from any process may contain some random errors or outliers. Consider the following measurement model:
z (k ) = H (k ) x(k ) + v(k )
k = 1,2,
,
(1)
where k is sampling point, z (k ) ∈ R n×1 is the signal contaminated with noise,
x(k ) ∈ R n×1 is the signal of interest, H (k ) ∈ R n×n is measurement matrix, and v(k ) ∈ R n×1 is measurement noise. If we use the data with noise to PCA modeling, the PCA model parameters may be disturbed by noise and outliers. This can lead to falsealarm and significantly compromise the reliability of the monitoring system [3]. In order to resolve the limitations using conventional PCA to process the data with noise and outliers, a method is proposed by preprocessing the process data before using PCA which combining the ability of wavelets with moving median filter to * Supported by the National Nature Science Foundation of China (No.60434020, No.60374020), International Cooperation Item of Henan Province (No.0446650006), and Henan Province Outstanding Youth Science Fund (No.0312001900). D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 994–1001, 2007. © Springer-Verlag Berlin Heidelberg 2007
The Hybrid PCA Based on Wavelets and Moving Median Filter
995
eliminate the influence such as time-varying, uncertainty and unsteady behaviors and at last this can ensure the validity and precision of the result.
2 Principal Component Analysis (PCA) PCA is an approach that is the projection of the original high dimension information into the low dimension subspace and can preserve main information. Although n components can represent the variability of whole system, most of the variability often can be used by m(m ≤ n) principal components to represent. Then the data set with columns representing the data from N different samples, and rows representing the n different variables can be compressed as that with the data of m principal components from N different samples [5]. 2.1 PCA Fundamentals
Consider a training data set [7]: X = [x(1)
x( N )] , x(t ) = [x1 (k )
x n (k )]T (k = 1, … , N ) ,
(2)
where n represents the number of variables, and N represents the number of samples for each variable. First, the data matrix X ∈ R n× N should be scaled to zero mean and unit variance. Then X ∈ R n× N can be normalized to X * . The matrix X * can be decomposed as the sum of the out product of n vectors, namely X * = [q1
where ri ∈ R1× N (i = 1,2,
⎡r1 ⎢ r qn ] ⎢ 2 ⎢ ⎢ ⎣⎢rn
q2
⎤ ⎥ ⎥= ⎥ ⎥ ⎦⎥
n
∑ qi ri
,
(3)
i =1
, n) is defined as score vector (principal component) and
n×1
q i ∈ R is defined as loading vector. The kernel of the PCA technique is SVD (Singular Value Decomposition). Perform the Singular Value Decomposition on matrix X * : X* =
n
∑ σ i q i γ iT
,
(4)
i =1
where q i and γ i are the eigenvectors of X * ( X * ) T and ( X * ) T X * , respectively, and σ i are the singular values (the non-negative square-roots of the eigenvalues of X * ( X * ) T ).
996
C.-l. Wen, S.-h. Fan, and Z.-g. Chen
Let σ i γ iT = ri , then X* =
n
n
m
i =1
i =1
i =1
∑ σ i qi γ iT = ∑ qi ri =∑ qi ri +
n
∑
q i ri =
i = m +1
m
∑ qi ri + E
,
(5)
i =1
where E is error matrix which can be ignored and can’t lead to loss of the useful information in evidence. 2.2 Determination of the Number of PCs and Control Limit for T2 Statistic
A key issue in PCA modeling is to select a proper number of PCs. If the number of PCs is fewer, a poor model will be obtained which has an incomplete representation of the process. On the contrary, if the number of PCs is more than necessary, the model will be over-parameterized and will include a significant amount of noise. This paper will choose CPV (Cumulative Percent Variance) method to select the number of PCs. CPV method is a measure of the percent variance captured by the first m PCs: ⎛ ⎜ ⎜ CPV (m) = 100⎜ ⎜ ⎜ ⎝
m
⎞
∑ λi ⎟⎟ i =1 n
⎟% , λi ⎟⎟ i =1 ⎠
∑
(6)
where λi is the i th eigenvalue of the covariance matrix. Hotelling T 2 is usually used for process monitoring after building the PCA model. If the real-time data and the modeling data are both obtained under normal operation, the T 2 statistic will be lower than the control limit for the T 2 statistic of the PCA model. If the process is under normal operation, then the T 2 should satisfy
T2 =
m
r2
∑ Si 2 < UCL i =1
,
(7)
ri
where ri is the i th component of the score vector, S r2i is the estimate variance of ri and UCL =
m ( n 2 − m) n ( n − m)
Fα (m, n − m) .
(8)
UCL is the upper control limit for the T 2 statistic, m is the number of PCs, n is the number of the variables, Fα (m, n − m) is the value with α is the significance level and degrees of freedom are m and n − m , respectively. If the process is under the unnormal operation, then the statistic T 2 > UCL .
The Hybrid PCA Based on Wavelets and Moving Median Filter
997
3 De-noising Approach Based on Wavelets Analysis Wavelets analysis is a kind of local transform in both time-domain and frequencydomain, it is provided with nicer time-frequency localization, and thus it can get information from signals effectively[4]. Based on multi-resolution frame, Mallat presented multi-resolution decomposition and reconstruction algorithms. Transacted by MRA, process signals are decomposed into wavelets coefficients D j (1 ≤ j ≤ L) and the coarsest scale coefficient C L . Use D j and C L to get scaling signal C j (1 ≤ j ≤ L) by reconstruction algorithm. Any signal can be decomposed into: f (t ) =
∑
C L ,k φ L ,k (t ) +
k∈Z
L
∑ ∑ D j ,kψ j,k (t )
,
(9)
j =1 k∈Z
where φ (t ) is the scaling function, ψ (t ) is the mother wavelet. The coefficients are gained by the following formulas: C j = HC j −1 D j = GC j −1
( j = 1,2,
, L) ,
(10)
where H and G are called the low-pass and high-pass filters, respectively. Reconstruction formula is C j −1 = HC j + GD j
( j = 1,2,
L) .
(11)
At present, there are many de-noising methods for one-dimension wavelets. This paper introduces threshold method to denoise [6]. We can process with the following three steps: 1. Decomposition: Choose wavelets and the level of wavelets decomposition, calculate the wavelets decomposition of signal to the L level and get the wavelet coefficients. 2. Process the signal: Choose a threshold one by one from first level to level L . Shrink the wavelet coefficients above the threshold. 3. Reconstruction: Compute the reconstructed signal based on modified wavelets coefficients and the coarsest scale coefficient. It’s necessary to choose proper wavelet, determine the best decomposition level and select proper threshold. It is the most important to select the threshold. There are two methods to deal with the wavelets coefficients: hard thresholding and soft thresholding. The key issue of these methods is to search a proper value as threshold τ . Retain or shrink wavelet coefficients that violate threshold τ and assign those within threshold τ to zero.
998
C.-l. Wen, S.-h. Fan, and Z.-g. Chen
4 Moving Median Filter The moving median (MM) filter is used to deal with signals that contaminated with outliers. Observations that exceed five standard deviations are considered to be outliers. In this nonlinear signal processing technique, the median of a window containing an odd number of observations is found by sliding it over the entire onedimensional signal [1, 2]. This filter method is described as follows: Let the window size of the filter is w = 2l + 1 , the number of the observations is N , namely, the observe values are x(1), x(2), , x( N ) and N >> w . When the window is sliding over the observation sequence, the output of the MM filter med ( x(k )) is
:
med ( x(k )) = x (l + 1),
w = 2l + 1 ,
(12)
where x(l ) represents the l th bigger value in the 2l + 1 observe values, The MM filter is actually to compose the order again in term of the size of the 2l + 1 observe values in the window, and then get the middle value of ranked data as the output. Based on above definition, the relation between the input x(k ) and output y (k ) in the window of the MM filter is as follows:
y (k ) = med ( x(k − l ),
, x (k ),
, x(k + l )),
k ∈Z .
(13)
In order to avoid to dealing with the boundary, one can extend both sides of the input signal. Suppose the length of the signal is N , then the signal after extending is:
1− l ≤ k ≤ 0 1≤ k ≤ N l +1 ≤ k ≤ N + l .
⎧ x(1) ⎪ x(k )' = ⎨ x(k ) ⎪ x( N ) ⎩
(14)
Applying MM filter to the signal after extending, one can find the output: y (k ) = med ( x(k − l )' ,
, x (k )' ,
, x(k + l )' ),
1≤ k ≤ N .
(15)
5 The Hybrid Principal Component Analysis Based on Wavelets and Moving Median Filter The method of this paper is: 1. Apply wavelets to de-noise the measured process signals contaminated with noise. Get the data when a process is under normal condition. For each variable in the data matrix, compute the wavelet decomposition and get the wavelet coefficients. This paper adopts Daubechies wavelets and uses non-linear soft thresholding approach to de-noise for each scale. Reconstruct the signal from the selected and thresholded coefficients for each variable. Put the reconstructed signal of all variables together.
The Hybrid PCA Based on Wavelets and Moving Median Filter
999
2. Use the MM filtering for rejection of outliers for the data of each variable from step 1. 3. If the process data have been preprocessed as above, put the data to normalize. Then the initial data matrix can be normalized to the data matrix which is scaled to zero mean and unit variance. So the PCA model can be built from the normalized data matrix. Apply CPV method to determine the number of PCs.
6 Simulation Study In order to prove the method proposed in this paper is effective and feasible, we can apply the method to a system: the annular soldering between the electron tube and the electron tube yoke in assembling of the rotation axletree. The input of the autoassembling machine must be controlled in given operation so as to bring finer soldering quality. For control this process, engineer must measure four pivotal variables: x1 :voltage(volts), x 2 :electricity(amps), x3 :feed flow rate (in/min), x 4 :airflow (cfm). In the simulation of the model, process data over a certain period are collected, with a sampling interval of 5 second. The first 64 samples collected under normal operating condition are used to build the PCA model. After that, collect some samples as the test data. Preprocess the test data, and PCA is applied to the test data matrix. The variable x3 has sensor fault after the 180th sample. We can use the mean square error (MSE) to evaluate the performance of the algorithms. MSE is defined as follows: MSE =
1 N
N
∑ ( x(k ) − x (k )) 2
,
(16)
k =1
where x(k ) represents the original signal, x (k ) represents the denoised signal and N is the number of the samples. The smaller the MSE value is, the better performance of the algorithm is. Here we take the test for x 4 to compute the MSE values. The window size of the MM filter is five and decomposition level of the waveletbased filter is three. Table 1 shows the MSE of estimation of original signal for two algorithms. Table 1. The MSE values of two algorithms comparison Corresponding algorithm Wavelets method Wavelets +MM filter method
MSE 0.0023 0.0009
It is evident that the method proposed in this paper has the lower MSE than wavelet-based method. It’s clear that using both wavelet-based filter and MM filter improves the performance.
1000
C.-l. Wen, S.-h. Fan, and Z.-g. Chen
We can get three T 2 charts monitored by conventional PCA, PCA based on wavelets, and PCA based on wavelets and moving median filter, respectively. The T 2 scores are shown as solid lines and the 95% control limits are given as dashed lines.
Fig. 1. T
Fig. 2. T
Fig. 3. T
2
2
2
statistics of conventional PCA
statistics of PCA based on wavelets
statistics of PCA based on wavelets and moving median filter
The following is the table for number of false-alarm points comparison from the above three plots.
The Hybrid PCA Based on Wavelets and Moving Median Filter
1001
Table 2. Number of false-alarm points comparison
Conventional PCA Wavelets +PCA Wavelets +MM filter +PCA
Number of false-alarm points 4 1 0
From the comparison, it is easy to see that conventional PCA fires many falsealarm points which exceed the T 2 control limit when the process is under normal condition. It is overly sensitive to the normal process variation. The number of falsealarm points is reduced by using PCA based on wavelets, and the number of falsealarm points is reduced to zero by using PCA based on wavelets and MM filter. So the performance of last method using the moving median and wavelets is the best. This method can reduce or remove the false-alarm points and detect fault effectively.
7 Conclusion In order to resolve the limitations using conventional PCA to process the data with noise and outliers, a method is proposed by preprocessing the real process data before using PCA which combining the ability of wavelets with moving median filter to eliminate the influence such as time-varying, uncertainty and unsteady behaviors and at last this can ensure the validity and precision of the result. The data obtained from industrial process inevitably contains time-varying, uncertainty, unsteady behaviors and so on. These uncertainty reasons may lead to incorrect conclusions if the data are analyzed without accounting for their effects. Hence, it is important to preprocess the process data. This paper shows that how to do fault detection using PCA based on wavelets and MM filter. From the simulation results, we observe that, conventional PCA fires too many false-alarms and the method proposed in this paper can well overcome this disadvantage. This method is effective and feasible for fault detection.
References 1. Lago, C.L., Juliano, V.F., Kascheres, C.: Applying Moving Median Digital Filter to Mass Spectrometry and Potentiometric Titration. Analytica Chimica Acta, 310 (1995) 281-288 2. Doymaz, F., Bakhtazad, A., Romagnoli, J.A., Palazoglu, A.: Wavelet-Based Robust Filtering of Process Data. Computers and Chemical Engineering 25 (2001) 1549-1559 3. Yang, Q.: Model-Based and Data Driven Fault Diagnosis Methods with Applications to Process Monitoring. Case Western Reserve University (2004) 4. Li, J., Tang, Y.: The Application of Wavelets Analysis Method. Chongqing: Chongqing University Press (1999) 5. Zhang, J., Yang, X.: Multivariate Statistical Process Control. Beijing: Chemical Industry Press (2000) 6. Donoho, D.L.: De-Noising by Soft-Thresholding. IEEE Trans. on Information Theory 41 (3) (1995) 613-627 7. Cao, J.: Principal Component Analysis Based Fault Detection and Isolation. George Mason University (2004)
Recursive Bayesian Linear Discriminant for Classification D. Huang and C. Xiang Department of Electrical and Computer Engineering, National University of Singapore, Singapore
[email protected]
Abstract. Extracting proper features is crucial to the performance of a pattern recognition system. Since the goal of a pattern recognition system is to recognize a pattern correctly, a natural measure of “goodness” of extracted features is the probability of classification error. However, popular feature extraction techniques like principal component analysis (PCA), Fisher linear discriminant analysis (FLD), and independent component analysis (ICA) extract features that are not directly related to the classification accuracy. In this paper, we present two linear discriminant analysis algorithms (LDA) whose criterion functions are directly based on minimum probability of classification error, or the Bayes error. We term these two linear discriminants as recursive Bayesian linear discriminant I (RBLD-I) and recursive Bayesian linear discriminant II (RBLD-II). Experiments on databases from UCI Machine Learning Repository show that the two novel linear discriminants achieve superior classification performance over recursive FLD (RFLD).
1
Introduction
Feature extraction is a crucial step in the designing of a pattern recognition system. Since the goal of a pattern recognition system is to recognize a pattern correctly, a measure of “goodness” of the extracted features is the probability of classification error, i.e. the extracted set of features should be the one with which the classification result is as close to the minimum probability of classification error, or the Bayes error, as possible. Various feature extraction algorithms have been proposed in the past. Among them, linear subspace analysis techniques, such as principal component analysis (PCA) [1], Fisher linear discriminant analysis (FLD) [2,3], and independent component analysis (ICA) [4,5], have become popular due to their simplicity. PCA extracts features that minimize the reconstruction error. ICA extracts features that are statistically independent, or as independent as possible. While PCA and ICA are unsupervised techniques, FLD uses the class information to extract a subspace that maximizes the ratio of between-class scatter to the within-class scatter. In spite of the popularity of these linear subspace analysis techniques, none of them is based on a criterion that is directly related to the probability of classification error. As the optimal subspace should be selected such that the resulting D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1002–1011, 2007. c Springer-Verlag Berlin Heidelberg 2007
Recursive Bayesian Linear Discriminant for Classification
1003
probability of classification error is minimal, in this paper, we derive two linear subspace analysis algorithms whose criterion is base on the Bayes error. In this paper, we first present the derivation of the two novel LDA’s in Section 2 and 3. Some discussion of the two LDA’s is then given in Section 4. The superiority of the two novel LDA’s is experimentally compared in Section 5. The final section offers concluding remarks.
2
The Criterion Based on the Bayes Error
To derive a criterion function which is directly related to the Bayes error, we need to derive the mathematical expression for the probability of classification error. We first consider the simplest case of two normally distributed classes with equal covariance matrices. The probability of classification error in the direction of feature vector w can be expressed as follows: ∞ F (w) = P (C1 ) x0 −μ 1 σ
1 x2 √ exp(− )dx + P (C2 ) 2 2π
∞ μ −x0 2 σ
1 x2 √ exp(− )dx, 2 2π
(1)
where P (Ci ) is the a priori probability of class Ci , μi and σ 2 is the mean and variance after projection to the feature vector w: μi = wT μi ,
(2)
σ 2 = wT Σw,
(3)
where μi and Σ are the mean and covariance matrix of the class Ci . Without loss of generality, we assumed μ1 ≤ μ2 in the above (1). From Bayesian decision theory [3], x0 in the above (1) is determined by: 1 1 (x0 − μ1 )2 1 1 (x0 − μ2 )2 √ P (C1 ) √ exp(− ) = P (C ) exp(− ), 2 2 σ2 2 σ2 2πσ 2 2πσ 2
(4)
which can be simplified to x0 =
(C2 ) σ 2 log P μ1 + μ2 P (C1 ) − . 2 (μ2 − μ1 )
(5)
Introducing (5) for x0 into (1), (1) then can be written in the following form: F (w) =
μ2 − μ1 σ log (P (C2 )/P (C1 )) √ √ P (C1 )erf − 8σ 2(μ2 − μ1 ) μ2 − μ1 σ log (P (C2 )/P (C1 )) √ √ +P (C2 )erf + , 8σ 2(μ2 − μ1 ) 1 1 − 2 2
(6)
1004
D. Huang and C. Xiang
where erf is the error function of the normal distribution. Minimizing the above criterion function (6) is equivalent to maximizing the following criterion function: μ2 − μ1 σ log (P (C2 )/P (C1 )) √ √ J(w) = P (C1 )erf − 8σ 2(μ2 − μ1 ) μ2 − μ1 σ log (P (C2 )/P (C1 )) √ √ +P (C2 )erf + , (7) 8σ 2(μ2 − μ1 ) where J(w) + 1 is actually two times of the probability of correct classification. It is usually the case that the term log(P (C2 )/P (C1 )) is small as the a priori probabilities of different classes are not very much different. Hence, the above criterion function (7) can be approximated as J(w) = erf
μ2 − μ1 √ 8σ
.
(8)
The criterion function in (8) is only for the case when w is a single feature vector. To generalize it for a subspace with dimension greater than one, note that the term (μ2 − μ1 )/σ in (8) is the Mahalanobis distance between the 2 class means after projection onto w. For problems where the number of classes is more than 2, the criterion function (8) can be generalized as: J(w) ≈
i<j
erf
hij √ , 8
(9)
where hij is the Mahalanobis distance between class means in the projection subspace:
hij = (μi − μj )T Σ −1 (μi − μj ). (10) J(W ) in (9) represents a measure of the probability of correct classification. We can see from (9) that the probability of correct classification increases as the Mahalanobis distances between class means in the extracted subspace increase, which makes sense intuitively. Unfortunately, there’s no closed form solution to the maximization of (9).
3
Recursive Bayesian Linear Discriminant (RBLD)
Although there’s no closed form solution for the maximization of criterion function (9), we can, however, obtain some approximate solution by approximating (9) by some simpler function [6]. And the new function would offer a closed form solution. In this section, we describe two approximate criterion functions of (9) and their closed form solutions.
Recursive Bayesian Linear Discriminant for Classification
3.1
1005
RBLD-I
If we assume hij doesn’t change much before and after projection, one very simple solution is to approximate the erf function in (9) by the following: h erf h√ij erf √ij8 wT S w hij 8 ij 2 erf √ ≈ hij = , (11) h2ij h2ij w T SW w 8 where Sij and SW is given by Sij = (μi − μj )(μi − μj )T , SW = Si =
(12)
1 (x − μi )(x − μi )T . Ni
(13)
x∈Ci
With this approximation, (9) can then be transformed into a form as follows:
1
h wT SBij w wT erf √ij8 h−2 ij Sij w J(w) =
i<j
w T SW w
=
i<j
.
w T SW w
(14)
Note that the assumption of similarity of Mahalanobis distances before and after projection into the extracted subspace may not always be valid. This problem can be solved by adopting a recursive algorithm: at each iteration, we extract a set of features, and calculate the Mahalanobis distances between class means before and after the projection into the extracted subspace. If the Mahalanobis distance between any 2 class means changes, we use the newly extracted subspace as the starting point and go through another iteration; if the Mahalanobis distances between class means do not change, we halt the iteration. 3.2
RBLD-II
The derivation of the criterion function for RBLD-II is as follows: hij
∂J(w) ∂erf ( √8 ) 2 ∂ √ = = ∂w ∂w π ∂w i<j i<j 1 = √ e 2π i<j
h2 − 8ij
hij √ 8
e−x dx 2
0
∂h2 ij 1 − h2 ij −1/2 = √ e 8 (h2 ) . ij ∂w ∂w 2π i<j
∂hij
∂J(w) = 0, ∂w h2 ∂h2 ij ij −1/2 ∴ e− 8 (h2 = 0. ij ) ∂w i<j
(15)
∵
(16)
1006
D. Huang and C. Xiang
Since it is very difficult to derive a closed form solution to (16), the following approximation is used:
e−
h2 ij 8
−1/2 (h2 ij )
i<j
h2ij ∂h2 ∂h2 ij ij ≈ e− 8 (h2ij )−1/2 = 0. ∂w ∂w i<j
Then the solution can be written as: T T w Sij w w SBB w ∂ w T SW ∂ h2ij T w w SW w e− 8 (h2ij )−1/2 = = 0, ∂w ∂w i<j where SBB =
2 SBij ,
(17)
(18)
(19)
i<j
and 2 SBij = e−
3.3
h2 ij 8
(h2ij )−1/2 Sij .
(20)
Incorporation of a Recursive Strategy
To conquer the feature number limitation inherent in FLD, we adopted a recursive feature extraction strategy termed RFLD proposed by Xiang etc [7,8], where a set of C − 1 features can be extracted at each iteration.
4 4.1
Discussion Comparison of the Two RBLD’s and Their Relation to FLD
The criterion function of FLD is J(w) =
w T SB w , w T SW w
(21)
where the between-class scatter matrix SB is defined as SB = (μi − μj )(μi − μj )T = Sij , i<j
(22)
i<j
if the a priori probabilities of different classes are assumed to be equal. Comparing the formulation of SB of the two RBLD’s to that of FLD, we can observe that the two approximate discriminants put different weighting factors for Sij . The two weighting factors have the same property that they suppress the influence of far distant classes, or in other words, they put an emphasis on close classes. This makes sense intuitively as close classes are more likely to generate classification errors and therefore require more emphasis than distant classes. In contrast, FLD seeks projection directions that maximize the sum of squared Mahalanobis distances between class means. Hence, the projection directions
Recursive Bayesian Linear Discriminant for Classification
1007
extracted by FLD are over-influenced by far apart classes. This is illustrated by a simple example in Fig.1(a). However, note that when there are only 2 classes, or the Mahalanobis distances between classes are roughly equal, the weighting factors do not play a role and the two RBLD’s become equivalent to FLD. To see the difference of these two weighting factors, we plot out their magnitudes with respect to the Mahalanobis distance, as shown in Fig.1(b). We can see from Fig.1(b). that RBLD-II puts a larger weight on close classes compared to RBLD-I.
Fig. 1. (a) Left: A simple 2-D illustration to show the superiority of the novel linear discriminants over FLD; (b) Right: The two weighting factor as a function of Mahalanobis distance
4.2
Limitation of the Two RBLD’s
In our derivation for the criterion function, we have made the assumption that all the classes have a Gaussian distribution with equal variances. This assumption is made to facilitate the analysis of pattern classification problems. But unfortunately, problems that we encounter in our real life often don’t satisfy this assumption. For example, for face recognition problems, while it is reasonable to believe that different classes have similar variances, the density distribution of the classes is not Gaussian. So it is important to bear in mind that, while features extracted by the two RBLD’s are nearly optimal for classification when all the classes have a Gaussian distribution with equal variances, the two RBLD’s may not be able to extract optimal features if the assumption is violated.
5 5.1
Experiments An Implementation Problem Due to Numerical Error
−hij /8 Note that the term h−2 , (h2ij )−1/2 in (20) approaches 0 when ij in (14) and e hij becomes large. In our experiments, we found that all hij are 0 and as a result SB is 0. The solution to this problem is to scale the weighting factors by multiplying them with a common constant. This is because only the relative magnitude of the weighting factors is significant to the solution of the approximate Bayesian recursive linear discriminants and multiplying them by a common constant doesn’t influence the solution. An appropriate choice for the 2 constants 2 in the 1st and 2nd approximation could be E{h2ij } and E{hij }eE{hij /8} . 2
1008
D. Huang and C. Xiang
5.2
Experimental Results
The performance of the two RBLD’s as well as RFLD are tested over databases from UCI Machine Learning Repository [9]. To demonstrate the effectiveness of the two RBLD’s, we have chosen 7 databases, with various sizes ranging from small to large. The 7 databases chosen are wine, vehicle, glass, optdigits, segmentation, zoo and iris. The “leave-one-out” strategy [1,3] is employed for wine, zoo and iris databases; “5-fold cross validation” for glass, and “9-fold cross validation” for vehicle. For segmentation and optdigits, as there are two separated training and test set, we just used the training set for training and test set for performance evaluation. Because the purpose of this paper is to show the effectiveness of the Bayesian approach, we just used the simple nearest neighbor classifier. The two RBLD’s can be readily combined with other more advanced classifiers such as neural networks or SVM to improve the classification performance. Classification accuracy is usually used as the performance index for evaluating different feature extraction algorithms. Classification accuracy for different number of extracted features from 1 to N (N is the maximum number of features obtainable.) are obtained from the experiments. The maximum classification accuracies are tabulated in Table. 1. Table 1. Maximum Classification Accuracy (%) of Different Linear Discriminant Analysis Algorithms. The last three columns are some characteristics of the databases: NC is the number of Classes, NF the number of features, and N the number of samples. Databases wine zoo iris vehicle glass optdigits segmentation
RFLD 98.88 97.03 96.67 78.13 60.28 97.89 93.67
RBLD-I 98.88 98.02 98.67 78.49 66.36 98.33 94.24
RBLD-II 98.88 99.01 96.67 78.13 67.76 98.28 94.33
NC 3 7 3 4 6 10 7
NF 13 16 4 8 9 64 19
N 178 101 150 946 214 5620 2310
From Table. 1, we can see that classification accuracies for wine database are the same for the three algorithms, while for the other 6 databases, the two RBLD’s give rise to a better performance. Thus, we can infer that the classes from the wine databases are probably equally separated, while not so for the other databases. To verify this, we select to visualize the sample distributions for wine, glass, and segmentation in Fig. 2, 3, and 4. To see the sample distributions in original space, we used PCA to extract the the first two principal components and project the samples into this 2D subspace. To visualize the sample distributions in the transformed space extracted by the three linear discriminants, we use the first two features extracted by the discriminants. The sample distributions in original space, in the subspaces extracted by RFLD, RBLD-I, and RBLD-II, are plotted respectively in the left upper, the right upper, the left lower, and
Recursive Bayesian Linear Discriminant for Classification
Original space
1009
Space extracted by RFLD
−50
−2.5 Class 1 Class 2 Class 3
−3
Feature 2
Feature 2
−3.5
−100
−4 −4.5 −5
Class 1 Class 2 Class 3 −150 −2000
−1500
−5.5 −1000 Feature 1
−500
−6 −6
0
−5
−4
Space extracted by RBLD−I
0
−5
−4
2.5 Class 1 Class 2 Class 3
−2
Class 1 Class 2 Class 3
2 1.5 Feature 2
−2.5 Feature 2
−1
Space extracted by RBLD−II
−1.5
−3 −3.5
1 0.5
−4
0
−4.5
−0.5
−5 −8
−3 −2 Feature 1
−7
−6
−5 −4 Feature 1
−3
−1 −10
−2
−9
−8
−7 −6 Feature 1
Fig. 2. The sample distribution from wine database
Space extracted by RFLD
Original space
Feature 2
16
14
1.855
Class 1 Class 2 Class 3 Class 4 Class 5 Class 6
1.85 1.845 Feature 2
18
12
1.84 Class 1 Class 2 Class 3 Class 4 Class 5 Class 6
1.835 1.83
10 1.825 8 −15
−10
−5
1.82 2.18
0
2.19
2.2
Feature 1
Space extracted by RBLD−I
2.21 2.22 Feature 1
2.23
2.24
Space extracted by RBLD−II
2.95
−1.945
2.945 −1.95 Feature 2
Feature 2
2.94 2.935 2.93 2.925 2.92 −1.94
Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 −1.93
−1.92
−1.955
−1.91 −1.9 Feature 1
−1.89
−1.88
−1.96 −3.65
Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 −3.6
−3.55
−3.5
Feature 1
Fig. 3. The sample distribution from glass database
the right lower part of each figure. We can see from these figures that the three classes from wine databases are indeed roughly equally separated, while for the
1010
D. Huang and C. Xiang −4
Original space 300
0 Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 Class 7
Feature 2
100 0
Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 Class 7
−0.2
Feature 2
200
Space extracted by RFLD
x 10
−0.4
−0.6
−100 −0.8
−200 −300 −500
0
−4
1
x 10
500 Feature 1
1000
1500
−1 −1
−5
Space extracted by RBLD−I 6
0.5
Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 Class 7
−1.5 −2 −2.5
0
2
4 Feature 1
6
8 −5
x 10
Feature 2
Feature 2
4
−0.5
3
1 Feature 1
2
3 −4
x 10
Space extracted by RBLD−II
x 10
5
0
−1
0
Class 1 Class 2 Class 3 Class 4 Class 5 Class 6 Class 7
2 1 0 −10
−8
−6
−4 Feature 1
−2
0 −5
x 10
Fig. 4. The sample distribution from segmentation database
other databases, classes are not equally separated and as a result, RBLD-I and RBLD-II can improve the performance of RFLD. Comparing the performance of RBLD-I with RBLD-II for the other databases, we can see that RBLD-I is better than RBLD-II for databases iris, vehicle, and optdigits, and RBLD-II is superior for databases zoo, glass, and segmentation. Overall, we can say that RBLD-I and RBLD-II achieve comparable performance.
6
Conclusion
This paper deals with the important problem of extracting discriminant features for pattern classification. We first derived a criterion function which is directly based on minimum probability of classification error, or the Bayes error. Two recursive Bayesian linear discriminants are then derived by making some approximations to the original criterion function. Comparative analysis of the two RBLD’s with relation to FLD is presented, which shows that the Bayesian approach is more favorable than FLD if some classes are much further apart than others. The analysis also points out the domain of optimality of the two RBLD’s: all classes can be assumed to have a Gaussian distribution with similar variances. Our future work is: (1) to combine this algorithm with the clusterbased approach [8] to relax the assumption of Gaussian distribution made in this paper; (2) to use more advanced classifiers, like neural networks and SVM etc, to further improve the classification performance.
Recursive Bayesian Linear Discriminant for Classification
1011
References 1. Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection. IEEE Transactions on Pattern Analysis and Machine Intelligence 19 (1997) 711–720 2. Fisher, R.A.: The Use of Multiple Measures in Taxonomic Problems. Ann. Eugenics 7 (1936) 179–188 3. Duda, R., Hart, P.E., Stork, D.: Pattern Classification. 2nd edn. Wiley, New York (2001) 4. Bell, A.J., Sejnowski, T.J.: An Information-Maximization Approach to Blind Separation and Blind Deconvolution. Neural Computation 7 (1995) 1129–1159 5. Hyv¨ arinen, A.: The Fixed-Point Algorithm and Maximum Likelihood Estimation for Independent Component Analysis. Neural Processing Letters 10 (1999) 1–5 6. Huang, D., Xiang, C.: A Novel LDA Algorithm Based on Approximate Error Probability with Application to Face Recognition. Atlanta, U.S.A., 2006 IEEE International Conference on Image Processing (2006) 653–656 7. Xiang, C., Fan, X.A., Lee, T.H.: Face Recognition Using Recursive Fisher Linear Discriminant. IEEE Transactions on Image Processing 15 (2006) 2097–2105 8. Xiang, C., Huang, D.: Feature Extraction Using Recursive Cluster-Based Linear Discriminant with Application to Face Recognition. IEEE Transactions on Image Processing 15 (2006) 3824–3832 9. Newman, D., Hettich, S., Blake, C., Merz, C.: UCI Repository of Machine Learning Databases. (1998)
Histogram PCA P. Nagabhushan and R. Pradeep Kumar Department of Studies in Computer Science, University of Mysore, India
[email protected],
[email protected]
Abstract. Histograms are data objects that are commonly used to characterize media objects like image, video, audio etc. Symbolic Data Analysis (SDA) is a field which deals with extracting knowledge and relationship from such complex data objects. The current research scenario of SDA has contributions related to dimensionality reduction of interval kind data. This paper makes an important attempt to analyze a symbolic data set for dimensionality reduction when the features are of histogram type. The result of an in-depth analysis of such a histogram data set has lead to proposing basic arithmetic and definitions related to histogram data. The basic arithmetic has been used for dimensionality reduction modeling of histogram data set through Histogram PCA. The modeling procedure is demonstrated by experiments with 700x3 data, iris data and 80X data. The utility/applicability of Histogram PCA is validated by clustering the above data.
1 Introduction In conventional data analysis, the objects are numerical vectors. The length of such vectors depends upon the number of features recorded, leading to the creation of multidimensional feature space. Symbolic objects are extensions of classical data type. In conventional data sets, the objects are ‘individualized’, whereas in symbolic objects they are ‘unified’ by means of relationship. Various definitions and descriptions of symbolic object are given by Diday and Gowda [3,4]. Features characterizing a symbolic object may take more than one value or interval of values or may be qualitative. In real life, quite often we come across features of interval / duration / spread / span / distribution. Some such case studies can be found in [6]. Symbolic data happen from many sources, for instance in order to summarize huge sets of data. As input, when large data sets are aggregated into smaller more manageable data sizes we need more complex data tables called ”symbolic data tables” because a cell of such data table does not necessarily contain as usual, a single quantitative or categorical values[1]. In a symbolic data table, a cell can contain a distribution or intervals, or several values linked by a taxonomy and logical rules, etc. The need to extend standard data analysis methods to symbolic data table is increasing in order to get more accurate information and summarize extensive data sets contained in Data Bases [2]. The input to symbolic data analysis is a "symbolic data table". "Symbolic data tables" constitute the main input of a Symbolic Data Analysis. Their columns are “variables “which are D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1012–1021, 2007. © Springer-Verlag Berlin Heidelberg 2007
Histogram PCA
1013
used in order to describe a set of units called "individuals". Rows are called “symbolic descriptions” of these individuals because they are not as usual, only vectors of single quantitative or categorical values [1]. Each cell of this “symbolic data table” contains data of different types such as: (a) Single quantitative value (b) Single categorical value (c) Multivalued. (a) and (b) are special cases of (c). (d) Interval (e) Multivalued with weights: for instance a histogram or a membership function (notice that (a), (b),(c) and (c) are special cases of (e) where the weights are equal to 1). Histogram is a distribution type feature used for characterizing certain symbolic objects. The special attention given to histograms is because of its generic ability to characterize most of the other symbolic data types as expressed earlier. Recent research on symbolic data analysis has looked upon histogram as a generic representative of most (a, b, c, d, e) symbolic type objects [1,6]. In most cases the symbolic data types depicts a summary of a collection of data. In this way symbolic data by itself achieves a reduction in data. But in recent times the collection of data has been so huge that even after creating symbolic data types which could summarize the information the amount of symbolic data collection itself has grown drastically. Thus it demands for data compression of symbolic data as well as dimensionality reduction of symbolic data types for low complexity classification of symbolic objects. This paper focuses on the latter. The curse of dimensionality refers to the exponential growth of hyper volumes as a function of dimensionality. All problems become harder as the dimensionality increases [6]. Larger the dimensionality severe is the problems of storage and analysis. Hence a lot of importance has been attributed to the process of dimensionality reduction. Principal components analysis (PCA) is a quantitatively rigorous method for achieving this reduction. The method generates a new set of features, called principal components. Each principal component is a linear combination of the original features. All the principal components are orthogonal to each other so there is no redundant information. The principal components as a whole form an orthogonal basis for the space of the data. There are an infinite number of ways to construct an orthogonal basis for several columns of data. But still there is something special about the principal component basis [10]. The first principal component is a single axis in space. When each observation is projected on that axis, the resulting values form a new feature. And the variance of this feature is the maximum among all possible choices of the first axis. The second principal component is another axis in space, perpendicular to the first. Projecting the observations on this axis generates another new feature. The variance of this feature is the maximum among all possible choices of this second axis. The full set of principal components is as large as the original set of features. But it is commonplace for the sum of the variances of the first few principal components to cover most of the variances of original data [9, 10, 7]. By visualizing plots of these few new features, researchers develop a deeper understanding of the driving forces that generate the original data and characterize the system. In this paper we propose a method for computing principal components when the feature is of the generic symbolic data type like a Histogram. Section 2 introduces the histogram arithmetic followed by the computational aspects of histogram PCA in
1014
P. Nagabhushan and R.P. Kumar
section 3. We also present the illustrative experimental results pertaining to histogram PCA in section 4. Section 5 lists out the future scope and summarizes the contribution. The performance of these symbolic PCA’s was validated through complete linkage clustering algorithm.
2 Histogram: Definition and Arithmetic In this section we introduce basic definitions and arithmetic with reference to histograms. 2.1 Definitions and Arithmetic Histogram The histogram is a count of number of elements within a range or a rectangular bin. The height of the bins represents the number of values that fall within each range. Graphical Representation of a Histogram:
Fig. 1. A sample histogram
Sequence Representation of a Histogram H = { 10, 20, 40, 50, 50, 30, 20} The individual elements of a histogram is accessed/denoted as H(i) where i is the location or index H(1) = 10, H(2) = 20, H(3) = 40, H(4) = 50, H(5) = 50, H(6) = 30, H(7) = 20 Unit Histo A unit histogram UH is a Histogram with all its elements as one. Null Histo A null histogram NH is a Histogram with all its elements as zero. 2.2 Histogram Arithmetic Given two histograms H1 and H2 defined over same variable ‘i’ with N number of bins the basic arithmetic operations are defined as:
Histogram PCA
1015
Histogram addition N H1 + H2 = ∑ H1(i) + H2(i) i=1 Histogram subtraction N H1- H2 = ∑ | H1(i) - H2(i) | i=1 Histogram Multiplication Scalar: N k * H1 = ∑
k * H1(i)
i=1 Vector: N H1* H2 = ∑ H1(i) * H2(i) i=1 Histogram Division Scalar: N H1 / k = ∑ H1(i) /k i=1 Vector: N H1 / H2 = ∑ H1(i) / H2(i) i=1 2.3 Histo Matrix A histo matrix is a matrix where all its values are histograms. For instance in a 2 x 2 histo matrix H11
H12
H21
H22
H11, H12, H21 and H22 are histograms Identity Histo Matrix An identity histo matrix is a matrix where the diagonal elements are unit histos and the remaining elements are null histos. Basic Histo-Matrix Operations: Given two histo matrices HM1 and HM2 HM1 =
H111
H112
H121
H122
H M2 =
H211
H212
H221
H222
1016
P. Nagabhushan and R.P. Kumar
Histo Matrix Addition HM1 + HM2 =
H111+H211
H112+H212
H121+H221
H122+H222
Histo Matrix Subtraction HM1 - HM2 =
Histo Matrix Multiplication Scalar k * HM1 =
|H111-H211|
|H112-H212|
|H121-H221|
|H122-H222|
k *H111
k * H112
k * H121
k * H122
Vector HM1 * HM2 =
H111*H211 + H112*H221
H111*H212 + H112*H222
H121*H211 + H122*H221
H121*H212 + H122*H222
Histo Matrix Division Scalar HM1 / k
=
H11/k
H12/k
H21/k
H22/k
Vector HM1 / HM2
=
HM1 * HM2 -1
3 Histogram PCA Problem Statement: There are m samples in n-dimensional space. Each feature fi of sample j is of symbolic distribution type (histogram), ie fij = H where 1<=j<=m and 1<=i<=n. It is required to transform the given n-d histogram features fij to n-d histogram features Fij where F = T(f). Here T represents a feature transformation function which generates the principal components. Each histogram feature is constituted with B number of bins. Computational Aspect For describing the computational details of principal component method on histogram data set, let us consider an original space of 2-d data set D.
Histogram PCA
Let D
=
S1 S2
f1 H11 H21
1017
f2 H12 H22
where S1 and S2 are the samples and f1 and f2 are the histogram features. And let the resultant principal component data set scores be given as PCA Scores =
S1 S2
F1 H*11 H*21
F2 H*12 H*22
where S1 and S2 are the samples and F1 and F2 are the principal component histogram features. The variance and covariance of the 2-d data set are computed. Let matrix A, be the variance – covariance matrix. The covariance function is defined as A = VarCov(D) =
E[(H11 - μ1) (H21 - μ1)]
E[(H21 - μ1) (H12-μ2)]
E[(H21 - μ1) (H12 - μ2)]
E[(H12 - μ2) (H22-μ2)]
where E is the expectation, μ1 and μ2 are mean histograms of f1 and f2 respectively. μ1 = (1/2) *[H11 + H21] and μ2 = (1/2) * [H12 + H22] Let X be a column histo matrix of eigenvectors and λ be its corresponding histo vector of eigenvalues. [A] [X] = λ [X]
(1)
Equation (2.1) can be rewritten as [A - λ I] [X] = 0
(2)
where I is an identity histo matrix of the size that of A. The solution of equation (2.2) can be obtained as follows: DET(A - λI) = 0
(3)
Since we are considering a 2-d data set, i.e. A is 2 x 2 matrix, equation 3 gives B (number of histogram bins) number of quadratic equations in λ. Let the roots of this quadratic equations be λ1i and λ2i corresponding to ith bin of histogram. Thus we obtain B sets of (λ1,λ2). Now corresponding to ith bin substituting λ1i in equation 2 and solving it gives eigenvectors of the form V1 = ai11 x1(i) + ai12 x2(i) where ai11 , ai12 are coefficients of the eigenvector of dimension 1 (F1) corresponding to ith bin. Similarly corresponding to ith bin substituting λ2i in equation 2 and solving it gives eigenvectors of the form V2 = ai21 x1(i) + ai22 x2(i) where ai21 , ai22 are coefficients of the eigenvector of dimension 2 (F2) corresponding to ith bin. Now corresponding to ith bin, if the first coefficient of the eigenvector V1 is multiplied with the feature values of dimension 1, second coefficient with the feature values of dimension 2 and combination of the two gives the feature values in dimension 1 of the rotated co-ordinate system.
1018
P. Nagabhushan and R.P. Kumar
H*11 = ai11 H11(i) + ai12 H12(i) H*12 = ai21 H11(i) + ai22 H12(i) Similarly corresponding to ith bin, if the first coefficient of the eigenvector V2 is multiplied with the feature values of dimension 1, second coefficient with the feature values of dimension 2 and combination of the two gives the feature values in dimension 2 of the rotated co-ordinate system. H*21 = ai11 H21(i) + ai12 H22(i) H*22 = ai21 H21(i) + ai22 H22(i) For n dimensional data the first principal component histograms represent the large percentage of the total scene variance, succeeding components (pc-2, pc-3,…pc-n) contains a decreasing percentage of the scene variance. Furthermore, because successive components are chosen to be orthogonal to all previous ones, the data are uncorrelated. Most times, the first few principal components are good enough for classification, thus resulting in dimensionality reduction. As can be understood the computational procedure, histogram PCA demands correspondence between the bins i.e. the bins should be centered at the same position for all histograms. It can be achieved by first normalizing the values along each feature and then synthesizing the histogram symbolic data set. This helps us in maintaining the histogram spread between 0 and 1 for all histograms. Another constraint due the requirement of bin correspondence is that all histograms have to span equal spread or equal number of bins. In our experimentation set up we have chosen histograms with 10 bins.
4 Experimentation Experimentation results: 700x3 Data In order to demonstrate the working of the above formulations we have considered a few symbolic dataset synthesized from conventional data sets (a) 700x3 data (Generated by expanding the data table introduced in [5] and subsequently in [6]) (b) iris data and (c) from a multi-channel 80X data set. The 700x3 is a data set with 700 samples and three features. This data is a conventional data with 5 classes where there is a clear overlap among the feature values requiring all three features for data modeling. We could achieve dimensionality reduction of this conventional data through PCA and could visualize the five classes with the first two principal components as against no two original features could classify the data into 5 classes. This dataset consists of 700 samples with first 140 belonging to class I, 141-280 to class II, 281-420 to class III, 421 to 560 belonging to class IV and 561-700 to class V. Now in each class, the sample set is divided into seven sample packets with 20 samples in each packet. Histograms are generated for each feature in the sample packet. Thus for each class we obtain seven symbolic objects with histogram features. This results in 35 symbolic objects with three histogram features. Figure2 portrays the 700x3 Histogram dataset with 35 samples and 3 histogram features. Figure 4 conveys the variance of the histograms along each feature. As can be understood from Histogram arithmetic the variance histograms can be generated by
Histogram PCA
1019
computing the variance along the individual bins. Figure 3 depicts the principal component histograms. It can be observed from the results that the principal component histograms would take negative values as well. The main point to be noted is there is always no harm in considering the principal component histograms as it is (i.e. with negative values) for further classification provided we opt for conventional distance measures like euclidean or citiblock distances between the histograms. But there may be distance measures which would have been defined considering that Histograms bins would always have positive values because it represents the frequency of items. For dealing with such kind of situations the principal component histograms can be subjected to global normalization by considering the maximum and
Fig. 2. 700x3 Histogram Data
Fig. 3. 700x3 Principal Component Histograms
Fig. 4. 700x3 Var Hist
Fig. 5. Var of PC Hist Fig. 6. Clustering with PCH1 and PCH2
1020
P. Nagabhushan and R.P. Kumar
minimum bin values of the entire set of principal histograms. Having obtained the principal component histograms we also compute the variance histograms of the principal component histograms and is shown in figure 5. It can be observed the variance histogram of the first principal component histogram and that of second are high. The third principal component histogram has very low variance values along its bins. This indicates that the first two principal component histograms are good enough to classify the symbolic sample set. It results in 5 classes with seven samples in each as shown in figure 6. The dendrogram depicted in figure 6 is obtained through complete linkage clustering. Experimentation results: IRIS Data The iris dataset consists of 150 samples with 4 features. This data is again a conventional data with 3 classes namely setosa, versicolour and verginica. The first 50 samples belong to class I, second 50 to class II and the third 50 to class III. Now from each class 30 samples were drawn randomly each time and histograms were generated for four features of the drawn samples. This random sampling was done 5 times with respect to each class and thus we obtain 15 symbolic objects with 4 histogram feature type. The sampling was tuned to cover all 50 samples belonging to each class. Even in this case the values were normalized along each feature to generate uniform histogram spreads between 0 and 1.
Fig. 7. IRIS Histogram Data
Fig. 9. IRIS Var Hist
Fig. 8. IRIS Principal Component Histograms
Fig. 10. Var of PC Hist
Fig. 11. IRIS Clustering with PCH15
Histogram PCA
1021
Figure 7 depicts the 15 histogram symbolic data set and Figure 9 shows the corresponding variance histogram. The resultant histograms of the principal component analysis is portrayed in figure 8 and the corresponding variance histogram in figure 10 shows high variance values in the first variance histogram and meager values in the remaining three histograms. This indicates that the first principal component histograms are good enough to achieve classification among these 15 samples. The complete linkage clustering results are shown in figure 11 with the first five samples belonging to setosa, second five to versicolor and last five to verginica. The overlap between versicolor and verginica seen conventionally, is not observable here.
5 Conclusion and Discussion Recent developments in database technology have seen wide variety of data being stored in huge collections. Histogram is one such data type which is commonly used to characterize media contents as features. In many cases, such histogram feature set is very huge in size which drastically increases the computational complexity. Not much thought have been extended to reduce the dimension of histogram feature set where, each histogram itself is a data object defined over a hyperspace. This paper made a foundational contribution to reduce the dimensions of such complex data set through simple means. Vast amount of experimentation have been conducted to validate the proposed model.
References 1. Diday, E.: An Introduction to Symbolic Data Analysis and the Sodas Software. Journal of symbolic Data Analysis 1 July 2002 2. Diday, E.: The Symbolic Approach in Clustering, Classification and Related Methods of Data Analysis, Ed. H H Bock, E S Publishers, North Holland., 1988 3. Gowda, K.C., Diday, E.: Symbolic Clustering Using a New Similarity Measure, IEEE Transaction on Systems, Machines and Cybernetics 22 (2) 1992 4. Gowda, K.C., Diday, E.: Symbolic Clustering Using a New Dissimilarity Measure, Journal of Pattern Recognition 24 (6) 1991 5. Nagabhushan, P.: An Efficient Method for Classifying Remotely Sensed Data, Incorporating Dimensionality Reduction, Ph.D. Thesis, University of Mysore, India, 1998 6. Nagabhushan, P.,.Gowda, K.C, Diday, E., Dimensionality Reduction of Symbolic Data, Pattern Recognition Letters 16 (1995) 219-223 7. Bock, H.H., Diday, E.: Analysis of Symbolic Data- Exploratory Methods for Extracting Statistical Information from Complex Data, Springer Verlag Publishers, 1998 8. Kumar, R.P., Nagabhushan, P.: Multiresolution Knowledge Mining using Continuous Wavelet Transform, Proc. of International Conference on Cognition and Recognition, 2005 9. Jolliffe, I.T: Principal component analysis (second edition). Springer-Verlag. New York, (2002) 516 pp 10. 10.Manley, B.F.J.: Multivariate Statistical Methods: a Primer. Chapman & Hall, Bury, St. Edmonds, Suffolk, (1994) 215 pp
Simultaneously Prediction of Network Traffic Flow Based on PCA-SVR Xuexiang Jin, Yi Zhang, and Danya Yao Department of Automation, Tsinghua University, Beijing, 100084, China
[email protected],
[email protected],
[email protected]
Abstract. The ability to predict traffic variables such as speed, travel time and flow, based on real time and historic data, collected by various systems in transportation networks, is vital to the intelligent transportation systems (ITS). The present paper proposes a method based on Principal Component Analysis and Support Vector Regression (PCA-SVR) for a short-term simultaneously prediction of network traffic flow which is multidimensional compared with traditional single point. Data from a typical traffic network of Beijing City, China are used for the analysis. Other models such as ANN and ARIMA are also developed as a comparison of the performance of both these techniques is carried out to show the effectiveness of the novel method.
1 Introduction Intelligent Transportation Systems (ITS) is gaining more importance in the recent years as its level of demand increases quickly. There are three kinds of data in transportation systems, which are historical data, real-time data and short-term forecasting data. The ability to predict traffic variables such as speed, travel time or flow, based on real time data and historic data, collected by various systems in transportation networks, is vital to the intelligent transportation systems (ITS) components such as in-vehicle route guidance systems (RGS), advanced traveler information systems (ATIS), and advanced traffic management systems (ATMS). So short-term traffic flow forecasting has attracted much interest in current literature because of its importance in both theoretical and empirical aspects of ITS deployment. Extensive varieties of methods have been developed for dynamic traffic prediction. These methods mainly included nonparametric statistical methods [1], time series model [2], neural-network models [3] and others. At present, the research on dynamic traffic prediction is still in development, not constructing a mature theories system. Short-term traffic forecasting, especially forecasting for several minutes in future, has not obtained a perfect result. The accuracy of Models based on classic mathematic methodologies can hardly satisfy the real-time traffic control and guidance, as the traffic system is highly complicated and nonlinear. Neural-network models have got extensive applications because of their advantages in self-learning and complicated D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1022–1031, 2007. © Springer-Verlag Berlin Heidelberg 2007
Simultaneously Prediction of Network Traffic Flow Based on PCA-SVR
1023
nonlinear mapping. However, the neural-network is liable to over-fitting in the case of small samples, and the complexity of network structure and samples have much influence on the complexity of the algorithm. Support Vector Machine (SVM) based on statistical learning theory [4], is a learning machine adheres to the principle of SRM (structural risk minimization) seeking to minimize an upper bound of the generalization error rather than minimize the training error (the principle followed by neural networks).With the introduction of Vapnik’s ε-insensitive loss function, SVM has been extended to solve nonlinear regression estimation problems, such a new techniques known as support vector machines for regression(SVR) [5], which have been shown excellent performance and broadly applied in financial time sequence forecasting [6], tour market forecasting [7]and many other actual problems. Based on the fact that the traffic information forecasting is equivalent to function regression and approximation [8], SVR can be employed to treat the problem of traffic information forecasting. Until now, most research focus on the problems of traffic flow prediction in single link or intersection separately without considering the structure of the flows in a network. However, simultaneously prediction of network traffic flow involves high dimensional data structures. The most commonly used technique to analyze high dimensional structures is the method of Principal Component Analysis [9]. Given a high dimensional object and its associated coordinate space, PCA finds a new coordinated space which is the “best” one to use for dimension reduction of the given object. Once the object is placed into this new coordinate space, projection the object onto a subset of the axes can be done in a way that minimizes error. In this paper, we propose a short-term traffic prediction method based on PCASVR in which PCA is used to explore the intrinsic dimensionality and structure of network traffic flow. The implementation has showed that the proposed method is efficient applied to an actual short-term forecasting. The remainder of this paper is organized as follows: The methodologies of PCA and SVR are introduced in section 2. In section 3, a short-term simultaneously prediction of network traffic flow prediction based on PCA-SVR is proposed. And then, the application on the traffic flow forecasting to a real traffic network in Beijing City, China, is performed in section 4, along with a comparison to mostly used ANN and ARIMA. Section 5 is the conclusions.
2 PCA-SVR Method 2.1 Dimension Reduction and Structure Extraction Based on PCA PCA has been widely used in system identification and dimensionality reduction in dynamic systems. It is an efficient approach to extract features from sensory signals acquired form multiple sensors [9]. In general, multiple sensory signals can be viewed as a high-dimensional multivariate random matrix composed several vectors formed by different sensory signals. It is not feasible to input above matrix to the prediction model without any feature extraction or dimensionality reduction procedure because of the curse of dimensionality and the high correlation between vectors. By implementation of PCA, the complexity of modeling processes can be reduced and new feature vectors can be reconstructed.
1024
X. Jin, Y. Zhang, and D. Yao
In general, the multiple sensory signals
X P× N may be represented as a matrix with
N samples acquired form P sensors. In order to remove irrelevant or redundant information in acquired signals, an orthogonal linear transform is introduced to convert
X , where v = (v1 , v2 , v3 ,", v p )t . To preserve the most of variation in data set X , the orthogonal vector v is chosen to t maximize the variance of the projections v X , estimated by original matrix X into new space denoted by v
t
Var (v t X ) = vt Σv
(1)
Σ is the covariance matrix of S . Hence, above optimization problem is t t defined as maximizing v Σv with respect to a constraint that v v = 1 . The where
optimization problem can be solved by using Lagrange multipliers and corresponding Lagrange function is constructed as
L(v, λ ) = vt Σv − λ (v t v − 1)
(2)
where λ is the Lagrange multiplier, the solution of Eq. (2) can be obtained by partially differentiating with respect to v and λ , respectively ∂L = 2Σv − 2λ v = 0 ∂v ∂L = vtv − 1 = 0 ∂λ
(3)
It can be seen from Eq. (3) that it is a typical eigenvector problem where v is the eigenvector of the matrix Σ with corresponding eigenvalue λ . These eigenvectors are referred to the principal components, and represent the directions of greatest variance in the multiple sensory signals. The complete PCA normally includes two procedures, i.e. estimation of principal components conducted by eigenvalue analysis of covariance matrix and projecting the original matrix into the lower-dimensional space constructed by dominant principal components. 2.2 SVM for Regression Approximation The formulation of SVM embodies the structural risk minimization principle, which has been shown to be superior to traditional empirical risk minimization principle employed by conventional neural networks [4]. By involving both empirical and anticipant risk in the training cost function, SVM could successfully avoid to stepping into local optimal points and the structure and the parameters of SVM are relatively easy to determine. In the algorithm of SVM, the input nonlinear space is mapped into a high dimensional feature space via a nonlinear mapping, and linear classification and regression are performed in that space. The SVM can be regarded as a two-layer network with the inputs transformed by the kernels corresponding to a subset of the input data. The output of the SVM is a
Simultaneously Prediction of Network Traffic Flow Based on PCA-SVR
1025
linear function of the weights and the kernels. The weights and the structure of the SVM are obtained simultaneously by constrained minimization for a given precision level of the modeling error. In the constrained minimization, kernels corresponding to data points that are within the error bounds are removed. The SVR is formed by the retained kernels [10], and the data points associated with the retained kernels are referred to as the support vectors (SV) [9]. In most cases, using SVM for classification and function regression outperforms traditional used neural network models. N Consider a given set of N data points { xn , yn }n =1 with input data x n ∈ R p and output yn ∈ R . In feature space
F ,the SVR models take the form:
f ( x) = w′ϕ ( x) + b
(4)
where the nonlinear mapping ϕ ( x) maps the input data into a higher dimensional feature space. Note that the dimension of w is not specified. In least squares support vector regression the following optimization problem is formulated [11]: 1 1 N w′w + c ∑ en2 w, e 2 2 n =1 st. yn = w′ϕ ( xn ) + b + en , n = 1, ", N
min J ( w, e) =
(5)
and c is a regularization parameter. By employing Lagrange multipliers α and exploiting Karush-Kuhn-Tucker condition, the solution is given b and y by the following linear equations: K ⎛0 1′ ⎞ ⎛ b ⎞ ⎛ 0 ⎞ (6) ⎜K ⎟=⎜ ⎟ −1 ⎟ ⎜ ⎝ 1 K + c I ⎠ ⎝α ⎠ ⎝ y ⎠ K where y = [ y1 ," , y N ]t , 1 = [1," ,1]′ , α = [α1 ," , α N ]′ and the Mercer’s condition [9]: K nm = ϕ ( xn )′ϕ ( xm ) = K ( xn , xm ) , n, m = 1,", N , has been applied. This finally results into the following least square SVR model for function estimation: N
f ( x ) = ∑ α n K ( x , xn ) + b
(7)
n =1
For the choice of the kernel function K (⋅, ⋅) , one has several possibilities: Polynomial kernel: K ( x, xn ) = ( xn′ x + 1) d
(8)
K ( x, xn ) = tanh(β0 ( xn′ x) + β1 )
(9)
Sigmoid kernel:
1026
X. Jin, Y. Zhang, and D. Yao
Radial basis function (RBF) kernel: K ( x, xn ) = exp( − x − xn
2 2
2σ 2 )
(10)
where d , β0 , β1 and c are specified a priori. The polynomial kernel and RBF kernel always satisfy Mercer’s theorem while the sigmoid kernel satisfies it only for some values of β 0 and β1 [13]. The specific choice of a kernel function implicitly determines the mapping ϕ and the feature space F . In this paper, the RBF kernel was adopted since it was found to be appropriate to capture the nonlinearity of the considered system by testing the modeling performance of the kernel functions mentioned above.
3 Prediction of Network Traffic Flow Based on PCA-SVR Simultaneously PCA can explicitly recognize temporal and spatial correlations among time series of traffic flows observed at different locations of the network, without relying on any a priori assumption concerning the distribution of traffic flows. Let m be the number of traffic detectors located on a subset of the total set of arterial links of an urban network, t be the number of successive days (i.e. the respective periods) in which the detector data are collected and τ be the number of time intervals (e.g. of 15 min length) wherein each day is partitioned. Then, a matrix X can be defined, referred to here as ‘measurement matrix’, with p rows and m columns, where p = t × τ . Thus, each column i of matrix X denotes the i -th traffic flow time series, and each row j denotes the traffic flow readings (measurement) at that particular point in time (
j ). The estimation of the principal component vi is carried out through the spectral
decomposition of the matrix
X t X , as follows:
X t X vi = λi vi , i = 1, 2, " , m where
λi
(11)
is the non-negative real scalar, known as the eigenvalue, corresponding to
vi . By convention, the eigenvalues are arranged in order of magnitude, from large to small, so that λ1 ≥ λ2 ≥ λ3 ≥ " ≥ λm . Since the principal principal component
axes are arranged in order of contribution to the overall variation, the time-varying trend common to all flows along principal axis i can be represented through a column vector
ui with size p , referred to as the eigenflow of the i -th principal axis,
as follows: ui =
X vi
σi
, i = 1, 2, " , m
(12)
Simultaneously Prediction of Network Traffic Flow Based on PCA-SVR
where
σ i = λi
1027
is the singular value corresponding to the i -th principal axis. The
magnitude of singular values demonstrates the overall variation attributable to each particular principal component. After PCA, we can find only r singular values are non-negligible, implies that
X can effectively resides on an r -dimensional subspace of R p .. Then we can approximate the original X as: r
Xˆ = ∑ σ i ui viT
(13)
i =1
where r < p is the effective intrinsic dimension of X . Now SVR is used to make prediction on the dataset of effective r eigen-flows. After SVR , the predicted eigen-flows can be used to reconstruct the network traffic by using Eq.(13) as a prediction of the network traffic flow. A flow chart of the method is illustrated in Fig.1.
Fig. 1. Flow chart of proposed method
Fig. 2. Road network in Beijing. Position of focused region is denoted by an arrow.
1028
X. Jin, Y. Zhang, and D. Yao
4 Case Study We applied the proposed to field case and several modeling approaches such as ANN and ARIMA are used for comparison. 4.1 Data and Parameters Selection
A region in Beijing City is selected for a case study. The region lies to the west of Beijing’s third ring road and belongs to the west zone of the city’s UTC_SCOOT system as denoted by the arrow in Fig. 1. The region is chosen because it contains a typical networked road system and represents various features of Beijing's urban traffic. The network consists of 18 intersections and 58 links. Each link is numbered so that the four links to the same intersection are labeled with consecutive numbers, such as Link No. 34, 35, 36, and 37. The road network sketch map is shown in Fig. 2. Data used in this research come from the flow volumes of this network from June 1st to 30th, 2004, with the sample interval of 15 minutes. After passing through a preprocessing filter to remove invalid data points, our experimental dataset consists of volumes in 2793 consecutive time intervals of 58 links, determining 2793 points in 58-dimensional space. 29
Z ha nla n gu a n R o a d
S ho u tina n R o ad 31
58
C h e go n gz hu a n g Ave n ue
30
33
34 54
Z e ng g u an g R o ad
B aiw a nz h ua n g Ave nu e 37
35
55
57 38
32
50 36
56
20
39
51
42
24
53
1 52
40 43
Yu etan N o rth Av en ue 45 46
25
Sanlihe
26
Sanlihe Street
21
Lishi Street
41
23
East Street
F uc h en g m e n w ai Ave n ue
22
3
8
4 2
44
27
47
Yue ta n S o u th Aven ue 49
6
7
28 9
13 10
L eg en d 12
L ink N o .
48 14
F u w ai Av e n ue
12
5 17
19
B a iy u n R o a d 11
15
18
Fig. 3. Road network sketch map of our focused region
In the case, as for PCA, we choose
m = 58, p = 96 × 30 in case of real network
structure and r = 3 after a structure analysis for the network flow historical data[14]. For SVR, we select radial basis function (RBF) kernel because it has few kernel parameters and high capability in nonlinear regression[5]. The hyperparameters of the kernel are determined on a validation set in training SVR automatically. The number
Simultaneously Prediction of Network Traffic Flow Based on PCA-SVR
1029
of lags can be determined by Automatic Relevance Determination, although this technique is known to work suboptimal in the context of recurrent models[15]. ΔMSE and ΔRMSE are chosen to evaluate and compare the performance of our method and other methods in this case.
Δ M SE =
∑
Δ RM SE =
(
n =1
1 N
x n − xˆ n ) × 100% xn 2
N
1 N
∑ (x n =1
(14)
2
N
n
− xˆ n ) × 100%
xˆn is the predicted value for time n , xn is the actual value at time n and N
where
is the amount of data used for prediction testing. 0.16
U n ifie de ig e n flo w fo rth efirs tp rin c ip a l
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0
10
20
30
40 50 60 Time interval (15min)
70
80
90
100
Fig. 4. Unified eigenflow for the first principal in a day 0.25
U n ifie d e ig e n flo w fo rth e s e c o n d p r in c ip a l
0.2
0.15
0.1
0.05
0
-0.05
-0.1
0
10
20
30
40 50 60 Time interval (15min)
70
80
90
100
90
100
Fig. 5. Unified eigenflow for the second principal 0.3
0.25
U n ifie de ig e n flo w fo rth eth irdp rin c ip a l
0.2
0.15
0.1
0.05
0
-0.05
-0.1
-0.15
-0.2
0
10
20
30
40 50 60 Time interval (15min)
70
Fig. 6. Unified eigenflow for the third principal
80
1030
X. Jin, Y. Zhang, and D. Yao 0.16 Oringianl eigenflow estimated
U n ifie de ig e n flo w fo rth efirs tp rin c ip a l
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
0
10
20
30
40 50 60 Time int erval (15min)
70
80
90
100
Fig. 7. Comparison of the original and estimated eigenflow Table 1. Prediction results comparasion in averaged single point (link)
Method PCA-SVR ANN ARIMA
ΔMSE
ΔRMSE
2.721 2.887 3.112
3.112 3.445 3.552
Calculating Time(s) 8.1 12.1 110.5
4.2 Results
The results are summarized in a series of Figures. Fig.4, Fig.5 and Fig.6 show the unified eigenflow for the corresponding principal. Fig.7 gives a comparison of the original and estimated eigenflow. Table.1 compares the finally result of prediction of network traffic flow in averaged single link by ΔMSE and ΔRMSE . As you see, our method has advantages both in computing time and accuracy.
5 Conclusion The present paper proposes a method based on Principal Component Analysis and Support Vector Regression (PCA-SVR) for a short-term simultaneously prediction of network traffic flow which is multidimensional compared with traditional single point. Data from a typical traffic network of Beijing City, China are used for the analysis. The use of this method effectively improves the precision of the prediction and reduces the processing time compared with other models such as ANN and ARIMA. Results of our research indicate that PCA-SVR is a good choice for shortterm traffic flow prediction especially for the need of multi-dimensional network traffic flow simultaneously.
Acknowledgement This work is partially supported by National Natural Science Foundation of China (NSFC) under the grant 60374059 and National Basic Research Program of China (973 Program) under the grant 2006CB705500.
Simultaneously Prediction of Network Traffic Flow Based on PCA-SVR
1031
References 1. Stephen Clark: Traffic Prediction Using Multivariate Nonparametric Regression. Journal of Transportation Engineering. (2003) 161-168 2. Williams, M., Billy: Multivariate Vehicular Traffic Flow Prediction: Evaluation of ARIMAX Modeling. Transportation Research Board. (2001) 194-200 3. Zhang, H., Ritchie, S.G., Lo, Z.P.: Macroscopic Modeling of Freeway Traffic Using an Artificial Neural Network. Transportation Research Record (2000) 110-119 4. Vapnik, V. N.: The Nature of Statistical Learning Theory, Springer Verlag, New York (1995) 5. Vapnik,V.N., Golowich, S., Smola, A..: Support Vector Method for Function Approximation, Regression Estimation and Signal Processing., Advance in neural information processing system, 9 MIT Press, Cambridge, MA (1997) 281-287 6. Francis, EH., TAY, L., Cao, J.: Modified Support Vector Machines in Financial Time Series Forecasting. Neurocomputing, 48 (2002) 847-861 7. Rob Law, Norman Au: A Neural Network Model to Forecast Japanese Demand for Travel to Hong Kong. Tourism Management, 20 (1999) 87 -97 8. Yang, Z., Jiang G.: The Theoretical Models of Urban Traffic Flow Guidance Based on Higher-order Generalized Neural Network. Journal of Highway and Transportation Research and Development. 115 (1998) 16-19 (in Chinese) 9. Meyer, C.D.: Matrix Analysis and Applied Linear Algebra. SIAM (2000) 10. Scholkopf, B., Burges, C.J.C., Smola, A.J.: Advances in Kernel Methods-Support Vector Learning. MIT Press, Cambridge, MA (1997) 11. Suykens, J.A.K., Vandewalle, J., De Moor, B.: Optimal control by least squares support vector machines. Neural Networks, 14 (2001) 23-35 12. Haykin, S.: Neural Networks, Prentice-Harll, NJ (1999) 13. Kwon, Y.D., Evans, L.B.: A Coordinate Transformation Method for The Numerical Solution of Nonlinear Minimum-Time Control Problems. AIChE J., 21 (1975) 1158-1164 14. Jin, X., Zhang Y.: Structural Analysis of Network Traffic Flow, in press 15. Van Gestel, T., Suykens, J.A..K., De Moor, B., Vandewalle, J.: Automatic Relevance Determination for Least Squares Support Vector Machine Classifiers,. Proc. of the European Symposium on Artificial Neural Networks (ESANN 2001), Bruges, Belgium (2001) 13–18
An Efficient K-Hyperplane Clustering Algorithm and Its Application to Sparse Component Analysis Zhaoshui He1,2 and Andrzej Cichocki1 1
Laboratory for Advanced Brain Signal Processing, RIKEN Brain Science Institue, Wako-shi, Saitama 351-0198, Japan {he_shui,cia}@brain.riken.jp 2 School of Electronic and Information Engineering, South China University of Technology, Guangzhou, 510641, China
Abstract. Based on eigenvalue decomposition, a novel efficient K-HPC algorithm is developed in this paper, which is easy to implement. And it enables us to detect the number of hyperplanes and helps to avoid local minima by overestimating the number of hyperplanes. A confidence index is proposed to evaluate which estimated hyperplanes are most significant and which are spurious. So we can choose those significant hyperplanes with high rank priority and remove the spurious hyperplanes according to their corresponding confidence indices. Furthermore, a multilayer clustering framework called “multilayer K-HPC” is proposed to further improve the clustering results. The K-HPC approach can be directly applied to sparse component analysis (SCA) to develop efficient SCA algorithm. Two examples including a sparse component analysis example demonstrate the proposed algorithm.
1 Introduction Recently, K-hyperplane clustering (K-HPC) has found many applications in sparse component analysis (SCA), machine learning, data mining, knowledge acquisition, image processing, etc [1]-[4][6]. Also, K-HPC is a promising method for blind separation of dependent sources. Consider the following K-HPC problem: given a set of m-dimensional points {xt} Rm, t=1,…,T (with some hidden structures), how to group them into K different hyperplanes (see Fig.1). To achieve this goal, we have three sub-tasks:
∈
(1) Detect the number K of hyperplanes; (2) Determine K hyperplanes Pk , k = 1, , K , where Pk := { x | x ∈ R m , x T v k = 0} ; (3) Group T samples into K different hyperplanes. K-HPC is a special case of generalized principle component analysis (GPCA) problem [1]. In fact, above problem is “zero-means” K-HPC. Note that Bradley and Mangasarian [4] discussed the “nonzero-means” K-HPC problem before this, which D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1032–1041, 2007. © Springer-Verlag Berlin Heidelberg 2007
An Efficient K-Hyperplane Clustering Algorithm and Its Application to SCA
1033
was called as “K-plane clustering” in [4][10][20]. For the “nonzero-means” K-HPC problem, the K hyperplanes can be mathematically expressed as: Pk := { x | x ∈ R m , x T v k = γ k } , where γ k is the intercept of the kth hyperplane and usually γ k ≠ 0 . It is should be noted that our formulated zero-means K-HPC is much easier to solve than nonzero-means K-HPC. Moreover, m dimensional nonzero-means K-HPC can be converted to (m+1) dimensional zero-means K-HPC. From x T v k = γ k , we have [ x T ,1]T ⋅ [ vkT , − γ k ] = 0 . Let
x = [ x T ,1]T
and v k = [v kT , − γ k ]T , then
Pk := { x | x ∈ R m , x T v k = γ k } is equivalent to Pk := { x | x ∈ R m +1 , x T v k = 0} . So we consider only zero-means K-HPC in this paper. And “K-HPC” just indicates zeromeans K-HPC in the remaining of this paper. Similar to the general clustering problem, K-HPC is very challenging. Firstly, it is very difficult to accurately determine the number K of hyperplanes. Furthermore, usually many clustering methods are not robust because of serious local minima (maxima) problem. The clustering results are very sensitive to the initialization. P2
.. . .. . . . . .. .. . .. . . . .. . . . . . . . . .. . . . . . . P . .. ... . . ... .. . . . . . . . . .. . . .. ... . . .. . ... .. . .. . . . . .. .. 1
Fig. 1. Two hyperplanes
In this paper we would like to emphasize the following features of our K-HPC method: (1) It is not only the special case of GPCA problem, but also related to the general GPCA problem (including K-subspace clustering) which can theoretically be decomposed into several K-HPC problems. In other words, GPCA can be solved by being converted to K-HPC. (2) It is possible to design efficient algorithms for K-HPC, which are not very computationally expensive. K-HPC is not NP hard, in contrasting to the general GPCA problem which is NP hard. As mentioned above, Bradley and Mangasarian first discussed K-HPC problem [4]. They developed a new algorithm which was applied to analyze Wisconsin Breast
1034
Z. He and A. Cichocki
Prognosis Cancer database. Better result than that of K-means clustering was reported in [4]. However, they only discussed nonzero-means K-HPC. So their approach is relatively complicated. Moreover, they did not discuss two key issues about clustering: how to determine the number K of hyperplanes, and how to avoid the local minima. In this paper, we present new aspects of K-HPC. Especially, we present an efficient way to detect the number K of hyperplanes and also show how we can considerably avoid local minima. Furthermore, a more efficient K-HPC algorithm is developed.
2 K-Hyperplane Clustering Denote K normals of K hyperplanes as v k ∈ R m , k = 1,
, K , respectively. The aim
of K-HPC is to determine the normal vectors of the hyperplanes. The algorithm is similar to K-means clustering in that it alternates between assigning points to a nearest cluster hyperplane Pk (Cluster assignment step) and, for a given cluster k, computing a cluster hyperplane Pk that minimizes the sum of the absolute values of normalized inner products d ( x (t ), v k ) (see below K-HPC algorithm) of all points in the cluster (Cluster update step). Considering that the number of hyperplanes is unknown and the general local minima problem, we overestimate the number of hyperplanes. Besides, we introduce a confidence index f ( Pk ), k = 1, , K to evaluate how reliable each cluster is. Then we detect the number of hyperplanes and remove the spurious hyperplanes from the overestimated results. 2.1 New K-HPC Algorithm The K-HPC algorithm can be outlined as follows: (1) Initialization.
V = [v1 , that x (t )
, v K ] . Normalize all observed samples x (t ), t = 1,
, T such
, T if necessary.
(2) Cluster assignment step. Group T samples x (t ), t = 1, θ {v k }, k = 1, , K according to the following rule:
x (t ) ∈ θ {vk } ⇔ k = arg min{d ( x (t ), vi ), i = 1, i
t = 1,
, T ; k = 1,
Assume that
x (t1 ),
,K
Denote
= 1, t = 1,
initialize
vk , k = 1,
.
2
Randomly
, T into K clusters
, K},
, K , where d ( x (t ), vi ) = x (t ), vi
(1)
x (t ) .
Tk entries are grouped into set θ {vk } , denote them as
, x (tTk ) . So we have matrix X k = [ x (t1 ),
, x (tTk )] .
An Efficient K-Hyperplane Clustering Algorithm and Its Application to SCA
(3) Cluster update step. Update the normals v k , k = 1, their corresponding confidence index f ( Pk ) as follows: For k=1,…,K Apply eigenvalue decomposition (EVD) T T [( X k ) X k ] T = Wk ⋅ Dk ⋅ (Wk ) , where Wk = [ wk 1 ,
1035
, K and compute
for
matrix
, w km ] .
End Assume Dk = diag (d k1 ,
, d km ) and eigenvalues d k1 ≤ ≤ d km , which are corresponding to eigenvectors wk 1 , , wkm , respectively. Then the eigenvector wk 1 corresponding to the smallest eigenvalue d k 1 is used to update the normal vk of hyperplane k, i.e., vk and confidence index f ( Pk ) are updated as: ⎧⎪v k ← wk 1 , ⎨ m ⎪⎩ f ( Pk ) = ∏ i = 2 d ki , k = 1,
(2)
, K.
(4) If it is convergent1, go to (5); or else go to (2); (5) Output K normal vectors vk and the confidence indices f ( Pk ), k = 1, , K of K hyperplanes. Then we have normal matrix V = [v1 , , v K ] and confidence vector f = [ f ( P1 ), , f ( PK )]T . Remark. We can considerably reduce the risk of getting stuck in local minima by heavily overestimating the number K of hyperplanes because of the simple fact that generally the more the initial hyperplanes are, the larger the probability that all true hyperplanes are involved is. In addition, similar to the likelihood ratio statistics in [8], here the confidence index plays an important role in detecting the significant clusters. Also we can choose those hyperplanes corresponding to larger confidence indices with high rank priority and remove those hyperplanes corresponding to small confidence indices. 2.2 Multilayer K-HPC Clustering To obtain more accurate clustering results, similar to multilayer NMF [5] in some sense, here we also can consider multilayer K-HPC (see Fig.2): give an initialization V (0) , K-HPC outputs a set of clusters Pk , k = 1, , K and their corresponding confidence indices vector f ; remove the spurious clusters according to their confidence indices, and input the remaining clusters as the new initialization, K-HPC will produce another set of clusters and corresponding confidence indices. In the similar way to the methods in [7] and [9] in some degree, multilayer K-HPC scheme can confine the clustering results again and again until the desirable result is derived. 1
When the iterative procedure satisfies that V (iter ) − V (iter-1) ≤ ε (typically, stop the algoirithm. In many situations, we even can take
ε =0
for K-HPC.
ε = 10 −8 ), we
1036
Z. He and A. Cichocki
X V (0)
K-HPC
K-HPC
∼
∼
∼ (1)
Remove the spurious columns from V∼ (2)
Remove the spurious columns from V∼ (1)
^ V (1)
(1)
∼
∼ (2)
V (2) and f
V (1) and f
f
K-HPC
f
The first layer
(2)
∼ (i)
V (i) and f
…
Remove t he spurious columns from V∼ (i)
^ V (2)
f
The second layer
(i)
…
^ V (i)
The ith layer
Fig. 2. Conceptual model of multilayer K-HPC
Especially for the noisy hyperplanes clustering problem, multilayer K-HPC usually can significantly produce more robust identifications.
3 Fast Multilayer Initialization (FMI) The initialization plays a very important role in K-HPC. So we discuss how to produce a good initialization for K-HPC. Generally speaking, hyperplane v k can be
vk falls into the ε -neighborhood of some initial hyperplane v , i = 1, , K , i.e., v k ∈ O (vi(0) , ε ) . The number K0 of unknown true normal
identified by K-HPC if (0) i
vectors v k , k = 1,
, K 0 is finite and fixed (K0). Theoretically, we are always able to
make all v k , k = 1,
, K 0 to be covered by ε -network
∪
K k =1
O( v k(0) , ε ) by setting a
very large K (e.g., K=1000000). Without the loss of generality, we assume that the entries v1k,…,vmk of normal vector v k to be in the interval [-1,1], i.e., v k ∈ [ −1,1]m , because we can always do this by normalization. So we can sufficiently segment mdimensional geometrical body [-1,1]m to obtain a very dense ε -grid and each node of the ε -grid corresponds to an initial vector v k(0) . In this way, K0 true normal vectors
vk , k = 1,
, K 0 theretically will be covered by ε -network
∪
K k =1
O( v k(0) , ε ) if K is
sufficiently large so that K0 hyperlines v k , k = 1, , K 0 could be identified by KHPC theoretically. In practice, we can produce such an ε -grid by generating uniformly random numbers in the interval [-1,1]. To considerably increase the probability that all true normals v k , k = 1, , K 0 are covered by the
ε
-network
∪
K k =1
O( vk(0) , ε ) , it is better to set a very large K, for
example, K=1000. However, the computation would be too heavy. To overcome this
An Efficient K-Hyperplane Clustering Algorithm and Its Application to SCA
1037
problem, we employ the idea of multilayer or hierarchical system [5][7] and extend this idea to “fast multilayer initialization (FMI)” method to produce a more efficient (0)
initialization V for K-HPC. Suppose that we have a high dimensional pre-initial normal matrix V (0) = [v1(0) , , v N(0) ] , whose size is m×N (usually N>>n and N>>K, typically, N=100000, K=100 and n=6). In many situations, we can uniformly randomly generate
V (0) in the interval [-1, 1]. For simplicity, we normalize vk(0) , k = 1, that vk(0)
2
= 1, k = 1,
, K such
, N . The FMI algorithm is as follows (see Fig.3): (0)
(1) Segmentation. The original m×N normal matrix V is segmented into M lower dimensional matrices V1(0) , ,VM(0) , whose sizes are m×N1,…,m×NM, respectively. Obviously, V (0) = [V1(0) , ,VM(0) ] . Here N1,…,NM are much smaller than N. For simplicity, we usually can take N1=…=NM (e.g., N1=…=NM=400). (2) Normalization. Normalize the observed matrix X to be
x (t ) 2 = 1, t = 1,
X such that
, T if x (t ) ≠ 0 .
V (0) . In this step, we attempt to select K (0) (0) possibly optimal columns from V to construct the normal matrix V .
(3) Searching normal matrix (0) Set V(0) =null;
For i=1,…,M (0) Construct matrix Vi (0) = [V((0) ] , and obviously the size of Vi (0) is i −1) , Vi m×(K+Ni); Extract K possibly optimal columns from
Vi (0) to construct V((0) i ) (the
detailed operations are as follows); End To select K possibly optimal columns from Vi (0) = [vi(0) (1), , vi(0) ( K + N i )] , we perform the following operations: Compute all correlation coefficients c(vi(0) (k ), x (t )) = [vi(0) (k )]T [ x(t )] , t = 1, , T , k = 1, , K + N i . Assign all
samples
x (t ), t = 1,
Ω(v ( k )), k = 1, (0) i
satisfies k = arg
,T
to
the
K + Ni
different
sets
, K + N i ; x (t ) ∈ Ω (v ( k )) if and only if x (t ) (0) i
min { c( x (t ), vi(0) ( j )) } . Account the number of entries
j =1, , K + N i
in each set Ω (vi(0) ( k )), k = 1, construct m×K matrix
(0) , K + N i . Extract K columns from Vi to
V((0) i ) , where these K columns correspond to the K sets
that have most entries. (4) Output the m×K initial normal matrix
V (0) for K-HPC.
1038
Z. He and A. Cichocki
The FMI algorithm is not only very fast, but also it can take advantage of high (0)
dimensional matrix V to produce the relatively large scale K-HPC.
V (0) . It can produce efficient initializations for
∼
X
Select K possibly optimal (0 ) columns from V t o (0) 1 produce V .
Select K possibly opt imal (0) columns from V2 to (0) produce V . (2)
(1)
(0)
V(1)
V1
(0)
Select K possibly opt imal ( 0) columns from V t o M (0) produce V .
…
(0) V2 = [V(1) ,V2 ] V(2) (0)
(0)
(0)
…
(0)
VM
(0) = [V((0) M −1) ,VM ]
V2( 0)
The first layer
The second layer
VM( 0)
V
(0)
The Mth layer
Fig. 3. Fast multilayer initialization (FMI) scheme
4 SCA Based on K-HPC Consider the following sparse component analysis (SCA) model [2][10]-[19]:
X = AS , where X = [ x(1),
A = [a1 ,
(3)
, x(T )] ∈ R m×T (T >> m) is the given data (observation) matrix,
, an ] ∈ R m×n is unknown basis matrix and S ∈ R n×T is also unknown
matrix representing sparse sources or hidden components, T is the number of available sample, m the number of observations and n the number of sources. The objective of SCA is to find a reasonable basis matrix A such that the coefficients in matrix S are as sparse as possible. Here we consider the undetermined case m≤n and suppose that any square m×m submatrix of A is nonsingular. From [2], we know that the basis matrix A is identifiable uniquely up permutation and scaling of the columns if each column of the sparse source matrix S has at most m-1 nozero entries in (3). Moreover, T samples of X are respetively hidden in
Cnm −1
hyperplanes generated by the columns of A, respetively. We use the two-stage methods to solve above SCA problem: in the first stage estimate the basis matrix A and in the next stage to estimate the coefficient matrix S using the method in [2]. In order to estimate the basis matrix A, we first identify the
Cnm −1 hyperplanes
using K-HPC. Without the loss of generality, we denote the normal vectors of these
Cnm −1 hyperplanes as vk , k = 1, , Cnm−1 , respectively. Then each column of basis matrix A is respectively the one-dimensional (1-D) complementary space of
An Efficient K-Hyperplane Clustering Algorithm and Its Application to SCA
span{
1,
vk1 , v k 2 ,
1039
, v kC m−2 }, where the indices k1, k 2, , kCnm−−12 are taken from n −1
, Cnm −1 , respectively [2]. For this purpose, we can estimate the columns of A by
eigenvalue decomposition (EVD). So we can identify basis matrix A by calculating these non-null 1-D complementary spaces of all span{ v k1 , v k 2 , , v kC m−2 }. Some n −1
examples are given in next section.
5 Numerical Experiment and Result Analysis In order to demonstrate the performance of the K-HPC algorithm, here we give a sparse component analysis experiment [2]. To check how well the mixing matrix is estimated, we introduce the following Biased Angles Sum (BAS), defined as the sum of angles between the column vectors (of basis matrix) and their corresponding estimations: 180 n BAS ( A, Aˆ ) = ⋅ ∑ i =1 acos ai , aˆi (degree),
(4) π where acos (i) denotes the inverse cosine function, i,i denotes the inner product and A = [a1 ,
, an ] , Aˆ = [aˆ1 ,
, aˆ n ] .
Example 1: Consider the (hyper)planes clustering in 3-dimensional space in noise free case. 21 (hyper)planes with 20000 samples were randomly generated. “Four layer KHPC” was used, where K=200, K=100, K=40, and K=30 in each layer, respectively. We tried 20 Monte Carlo tests and uniformly random initializations were used in each test. The proposed multilayer K-HPC algorithm successfully identified all (hyper)planes without any errors in all tests. We also tried the “K-plane Clustering algorithm” in [4], which got stuck in local minima and failed in all tests. In addition, we also tried the initializations produced by FMI algorithm in this example. Combining with FMI algorithm, K-HPC could faster identified 21 planes. Example 2: Sparse component analysis example: the source matrix S ∈ R 4×30000 was generated artificially; there are uniformly two nonzero entries in each column of S ; and the basis matrix was randomly generated as follows: ⎛ -0.2507 0.1714 0.9637 0.2270 ⎞ ⎜ ⎟. A = ⎜ -0.9653 -0.6833 -0.0305 -0.2426 ⎟ (5) ⎜ 0.0726 0.7098 0.2652 0.9432 ⎟ ⎝ ⎠ So three mixtures X ∈ R 3×30000 were obtained from X = AS . Some white Gaussian noise was added to X . All SNRs are 30dB. To estimate the basis matrix A , we first cluster six (hyper)planes. K-HPC successfully achieved this goal and “three layer KHPC” was used. In the first layer, K was set as K=50 and the 3×50 normal matrix V was randomly initialized. The 3×50 Vˆ (1) was obtained in this layer. Then we removed the spurious columns columns from V (1) . In the second layer, K was set as K=40. From Fig.4, we can see that six hyperplanes are significant. So in third layer, K=6.
1040
Z. He and A. Cichocki
[
Fig. 4. The sorted confidence indices of the estimated hyperplanes
After convergence, the normal matrix Vˆ of six hyperplanes was estimated. From Vˆ , we further estimated the basis matrix A as ⎛ -0.2501 0.1713 0.9638 0.2270 ⎞ ⎜ ⎟. ˆ A = ⎜ -0.9654 -0.6832 -0.0306 -0.2437 ⎟ ⎜ 0.0736 0.7098 0.2648 0.9429 ⎟ ⎝ ⎠
(6)
By calculation, we found BAS ( A, Aˆ ) = 1.4609 degree. So the basis matrix was accurately estimated. After this, the algorithm in [2] for recovering the sources were employed to estimate the sources. The SIRs of four estimated sources are 18.6058dB, 16.3066dB, 28.1552dB and 16.5722dB, respectively. For this example, the algorithm in [2] does not work because there are 30000 samples. According to our experience, the algorithm in [2] works well when the samples are less than 500.
6 Conclusions K-hyperplane clustering was discussed in this paper. An efficient K-HPC algorithm was developed. It is easy to implement, and it has advantages in detecting the number of hyperplanes and avoiding local minima by overestimating the number of hyperplanes. We can choose those significant hyperplanes with high rank priority and remove the spurious hyperplanes according to their corresponding confidence indices. Furthermore, the proposed “multilayer K-HPC” is helpful to further confine the clustering results. The FMI scheme can be used to produce more effective initializations for K-HPC.
An Efficient K-Hyperplane Clustering Algorithm and Its Application to SCA
1041
References [1] Vidal, R., Ma, Y., Sastry, S.: Generalized Principal Component Analysis (GPCA). IEEE Trans. Pattern Analysis and Machine Learning 27 (2005) 1945-1959 [2] Georgiev, P., Theis, F., Cichocki, A.: Sparse Component Analysis and Blind Source Separation of Underdetermined Mixtures. IEEE Trans. on Neural Networks 16 (2005) 992-996. [3] He, Z.S. and Cichocki, A.: K-EVD Clustering and its Applications to Sparse Component Analysis. Lecture Notes in Computer Science 3889 (2006) 90-97 [4] Bradley, P.S., Mangasarian, O.L.: k-plane Clustering. Journal of Global Optimization 16 2000 23–32 [5] Cichocki, A., Zdunek, R.: Multilayer Nonnegative Matrix Factorization. Electronics Letters 42 (2006) 947-948 [6] Li, Y.Q., Cichocki, A., Amari, S.: Analysis of Sparse Representation and Blind Source Separation. Neural Computation 16 (2004) 1193-1234 [7] Matsuda Y., Yamaguchi, K.: Linear Multilayer Independent Component Analysis for Large Natural Scenes. Advances in Neural Information Processing Systems 17 (2005) 897-904 [8] Neill, D., Moore, A.: Detecting Significant Multidimensional Spatial Clusters. Advances in Neural Information Processing Systems 17 (2005) 969-976 [9] Goldberger, J., Roweis, S.: Hierarchical Clustering of a Mixture Model. Advances in Neural Information Processing Systems 17 (2005) 505-512 [10] Georgiev P. Ralescu, A. Clustering on Subspaces and Sparse Representation of Signals. In Proc. 2005 IEEE International 48th Midwest Symposium on Circuits and Systems (2005) 1843-1846 [11] Zibulevsky, V, Pearlmutter, B.A.: Blind Source Separation by Sparse Decomposition in a Signal Dictionary. Neural computation 13 (2001) 863-882 [12] Bofill, P., Zibulevsky M.: Underdetermined Blind Source Separation Using Sparse Representations. Signal processing 81 (2001) 2353-2362 [13] Li, Y.Q., Cichocki, A., Amari, S.: Analysis of Sparse Representation and Blind Source Separation. Neural computation 16 (2004) 1193-1234 [14] Lewicki, M.S., Sejnowski, T.J.: Learning Overcomplete Representations. Neural Computation 12 (2000) 337–365 [15] Girolami, M.: A Variational Method for Learning Sparse and Overcomplete Representations. Neural Comput 13 (2001) 2517–1532 [16] Cichocki, A., Amari, S.: Adaptive Blind Signal and Image Processing: Learning Algorithms and Applications. New York: John Wiley & Sons. (2003) [17] Aharon, M., Elad, M., Bruckstein, A.M.: The K-SVD: An Algorithm for Designing of Overcomplete Dictionaries for Sparse Representation. IEEE Trans. On Signal Processing 54 (2006) 4311-4322 [18] Tropp, J.A., Greed is good: Algorithmic Results for Sparse Approximation. IEEE Trans. Information Theory 50 (2004) 2231-2242 [19] Gribonval, R., Nielsen, M.; Sparse Decompositions in Unions of Bases. IEEE Trans. Inform. Theory 49 (2003) 3320–3325 [20] Washizawa, Y., Cichocki, A., On Line K-Plane Clustering Learning Algorithm for Sparse Component Analysis. 2006 IEEE International Conference on Acoustics, Speech, and Signal Processing (2006)
( )
,
,
A PCA-Combined Neural Network Software Sensor for SBR Processes Liping Fan and Yang Xu Automation Dept., Shenyang Institute of Chemical Technology, Shenyang, 110142, China
[email protected]
Abstract. The high non-linearity, serious time-variability and uncertainty result in a number of very challenging problems in working on the monitoring and control of biological processes. Many important variables are difficult to measure during monitoring and control. Software sensors can give estimation to unmeasured state variables according to the measured information provided by online measuring instruments available in the system. This offers an alternative feasible program for online measurement. A hybrid soft measurement model that combines principal component analysis with artificial neural networks is applied to monitor the sequencing batch reactor (SBR) process. Simulation results show that the most unmeasured variables can be predicted and the method can capture the main trend of the data.
1 Introduction Biological wastewater treatment processes are very complicated control systems, which are strong coupling, highly non-linear and time varying. The control parameters in such systems influence each other and have great lagging, and at the same time, each parameter changes with the changing of production. It is difficult to gain all state information by using on-line sensors, due to biological characteristics of the activated sludge process. A large quantity of state information required for control cannot be measured [1]. Most monitoring methods based on the precise mechanism model of the biological wastewater treatment processes are usually difficult to carry out, because it is very difficult to have the precise mechanism model of such a complicated system so far. Artificial neural networks (ANN) can model complex nonlinear systems easily with mechanism model free, so, ANN has received special attention from many researchers. Since an ANN model uses only input and output data observed from the target system, its prediction ability depends upon the status of training data extremely. If including noises or other unstable factors in the training data, the overflow phenomenon can be caused, and the monitoring strategy will fall into abeyance. It is extremely essential to extract required information from noisy data through data preprocess. This paper focuses on the development of an ANN as a software sensor for biological wastewater treatment processes, in which combining a conventional statistical method of principal component analysis (PCA) for data preprocesses. This D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1042–1047, 2007. © Springer-Verlag Berlin Heidelberg 2007
A PCA-Combined Neural Network Software Sensor for SBR Processes
1043
method simplifies the soft measurement structure while guaranteeing to improve monitoring performance. The sequencing batch reactor (SBR) system for simultaneous nitrogen and phosphorus removal is the object system.
2 Hybrid Software Measurement Model (PCA with ANN) 2.1
Principal Component Analysis (PCA)
PCA is a projection-based statistical tool traditionally used for reduction of dimensionality [2]. The purpose of PCA is to identify linear correlations between random variables aiming at dimensionality reduction. A few linear features called principal components, which describe major trends in the data, explain the distribution of the variables. Consider a p-dimensional data set X that consists of zero-mean elements, X1, X2,…, Xp. The maximum number of PCs is p and the jth principal component is
Y j = h1 j X 1 + h2 j X 2 + ... + h pj X p ,
(j=1,2,…,p)
(1)
where hij is the weight value that reflects the contribution of Xi to a PC Yj. Then X decomposes the data into a series of principal components consisting of score vectors (tr) and loadings (pr), plus residual (E), that is R
X = ∑ t r pr + E .
(2)
r =1
The loading vectors (pr) define the reduced dimension space and are the directions of maximum variability. The score vectors (tr) depict the principal components space. The residual matrix (E) is as small as possible in a least squares sense. Usually, a few principal components can express most of the variability in the data when there is a high degree of correlation among the data. The application field of PCA is so wide. As tried in this study, PCs that are extracted from data reduction through PCA can be used as inputs to the next step analysis. Recently, many studies have concerned PCA. This paper describes nonlinear processes by integrating PCA with nonlinear modeling approach, which is an artificial neural network. 2.2
Artificial Neural Networks
Interest in neural networks has increased in recent days due partly to some significant breakthroughs in research in architecture, learning algorithms, and operational characteristics [3]. An artificial neural network is an abstract simulation of real nervous system that contains a collection of processing units or processing elements communicating with each other via axon connections. Such a model resembles the axons dendrites of the nervous systems. A fundamental feature of artificial neural networks is their adaptive nature, where learning by examples replaces programming in solving problems. This important feature makes such computational paradigm very
1044
L. Fan and Y. Xu
appealing in the application domain where one has little or incomplete understanding of the problem to be solved. It is a powerful tool in dealing with nonlinear systems. The neural network used in this study is multi layer perception (MLP), and LevenMarquardt method that is an improved BP (back propagation) algorithm is adopted for training of neural network. Signal transport in MLP that contains one hidden layer can be described as
y = ∑i =1 wi f i ( x) , H
(3)
1 − e − α vi x T
f i ( x) = tanh(v x) = T i
1 + e −αvi x T
,
(4)
where x is the input vector, y is the output value, H is the number of hidden layers, wi is the weight value, vi is the weight vector that is multiplied with input, f(x) is the activation function, and α is the constant. ANN requires data of good quality that reflect the dynamics of target system accurately, which is hard to get from real wastewater treatment processes. In wastewater treatment processes, noise and measurement error are the main obstacles in setting up an ANN model based on measured data. To facilitate the use of ANN, this paper introduces a data pretreatment technique that reduces the level of noise in data [4].
3
Sequencing Batch Reactor (SBR)
The sequencing batch reactor (SBR) process is characterized by a series of process phases, such as FILL, MIX, AERATE, SETTLE, DISCHARGE and IDLE, each lasting for defined period, as shown in Fig.1. Only one reactor exits in a SBR process. The regulation of water quality and quantity, the microbiological degradation of organic substance, and the separation of the mixed solution are all performed in the same reactor. Microorganisms in activated sludge use various organic materials as nutrient. Various environmental conditions, such as anaerobic, aerobic and anoxic can occur in the SBR. In the period of anaerobic mix, the phosphate-accumulating organisms (PAO) decompose poly-phosphate stored in vivo and store poly-hydroxyalkanoates (PHA). The inorganic phosphorus generated by decomposing polyphosphate release out of the PAO, which is called anaerobic releasing of phosphorus. In the aerobic period, the organic carbon in the wastewater is oxidized to carbon dioxide and water (that is carbon removal), and ammonia is oxidized to nitrate (that is nitrification). At the same time, PAO decompose PHA stored in it, and much energy is released to help PAO absorb phosphate and store poly-phosphate in it (that is aerobic absorbing of phosphate). The anoxic period, nitrate is reduced to gaseous nitrogen, which is called denitrification. By the alternating operation of anaerobic, aerobic and anoxic, the biological removal of carbon, nitrogen and phosphorus is fulfilled.
A PCA-Combined Neural Network Software Sensor for SBR Processes
FILL
MIX
AERATE
1045
SETTLE DISCHARGE IDLE
Fig. 1. Operation phases of generic SBR process
The SBR process has complex biochemistry reaction mechanism, its inherent serious non-linearity, the limited duration, the unstable operation and so on bring special difficulties for its process monitoring. The data set used in monitoring often has strong coupling, and are adulterated with the massive noises. If the appropriate method can not be find to carry on processing to such data, only limited information can be captured from the data set, and this will affect the accurate understanding to the operating condition of the SBR processes. In this paper, PCA is used for data preprocess, and ANN is used to construct a soft measurement method to monitor the SBR processes. A model is needed to evaluate the results of the monitoring strategies. In the models of the activated sludge process issued by the International Association of Water (IWA) [5], only the one of ASM2d contains the phosphorus removal process, so the ASM2d is selected to describe the SBR system with simultaneous removal of nitrogen and phosphorus. There contains 19 state variables and 21 reaction subprocesses in ASM2d. Based on ASM2d, a MATLAB simulation model of the SBR system has been developed [6]. The soft measurement method is studied on this simulation model.
4 Simulation and Results The data used in this research are collected from ASM2d, A fill and draw SBR is operated in a 6h cycle mode and each cycle consists of anaerobic (1.5h), aerobic (2.5h) and anoxic (1.5h). The training data and testing data are selected from normal dry weather data in which including 750 samples in all, two-thirds are training data SO
SSNH NH 4
S NO
S ALK
PC 1 PR 1 PC PR22
ANN (3-8-1)
PC 3 PR 3
Fig. 2. Hybrid neural network combined with principal component analysis model structure
1046
L. Fan and Y. Xu
− real × measured
− real × m easu red
samples
samples − real × measured
samples
− real × measured
samples − real × measured
samples
− real × measured
samples
Fig. 3. Software measurement simulation results
and the leavings are testing data. At present chemical components that can be measured directly online are only SO, SNH, SNO and SALK. So the four variables are the number of the network input layer, it is the number of the auxiliary variables too. Because the other components can't be measured directly, the neurons in output layer are the 19 variables of the model. But restricted by systematic degree of freedom, the number of the auxiliary variables can't be smaller than the number of the estimation variables. So the 19 variables can't be regarded as the output of neuron the network at the same time. This paper adopts the method that the variables are measured
A PCA-Combined Neural Network Software Sensor for SBR Processes
1047
independently, which means only one variable is selected as the output of the network each time. All variables can be measured completely through running the procedure 19 times. In addition, S-type active function is chosen as hidden layer and linear active function is chosen as output layer. ANN of one hidden layer with 8 neurons is employed to estimate the target value. In order to mitigate the sensitivity of ANN to noise and enhance the prediction capability, 4 wastewater quality parameters are reduced to 3 PCs, which become inputs of ANN, as shown in Fig.2. After training the neural network and fixing the network structure, soft measurement can carry through utilizing final weight value, threshold value and transmit function etc. attribute. Some main variables are measured by the hybrid soft measurement method, as shown in Fig.3. Simulation results show that the most unmeasured variables can be predicted and the main trend of the data can be captured.
5 Conclusion By the software measurement approach of hybrid artificial neural network, the primary variables can reflect the systematic variation tendency basically, and can reach higher precision. As an algorithm of soft measurement, this method has great potentiality in the monitoring of wastewater treatment processes.
Acknowledgements This paper is supported by the Science and Technology Fund provided by the Education Department of Liaoning Province (05L334).
References 1. Fan, L.P., Yu, H.B., Yuan, D.C.: State of the Art for Instruments of Wastewater Treatment Processes, Process Automation Instrumentation 26 (2005) 1-5 2. Shao, T., Jia, F., Martin, E.B., Morris, A.J.: Wavelets and Non-Linear Principal Components Analysis for Process Monitoring, Control Eng. Practice 7 (1999) 865-879 3. Kurtanjek, Ž.: Principal Component ANN for Modeling and Control of Baker’s Yeast Productioin. J.Biotechnol. 5 (1998) 23-35 4. Chéruy, A.: Software Sensors in Bioprocess Engineering. Journal of Biotechnology 52 (1997) 193-199 5. Henze, M., Gujer, W.: Activated Sludge Models ASM1, ASM2, ASM2d, ASM3
(2000) London, IWA 6. Fan, L.P., Yu, H.B., Yuan, D.C.: Improved Optimal Control of SBR Biological Wastewater Treatment Systems, Control and Decision, 17 (2005) 237-240
Symmetry Based Two-Dimensional Principal Component Analysis for Face Recognition Mingyong Ding1 , Congde Lu2 , Yunsong Lin2 , and Ling Tong2 1
2
School of Computer Science and Information Engineering, Chongqing Technology and Business University, Chongqing 400067, China
[email protected] School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu 610054, China
[email protected]
Abstract. Two-dimensional principal component analysis (2DPCA) proposed recently overcome a limitation of principal component analysis (PCA) which is expensive computational cost. Symmetrical principal component analysis (SPCA) is also a better feature extraction technique because it utilizes effectively the symmetrical property of human face. This paper presents a symmetry based two-dimensional principal component analysis (S2DPCA), which combines the advantages of 2DPCA and of the SPCA. The experimental results show that S2DPCA is competitive with or superior to 2DPCA and SPCA.
1
Introduction
In 1991, Turk and Pentland presented the well-konwn Eigenfaces methods for face recognition [1]. Since then, principal component analysis (PCA) has been widely investigated and has become one of the most successful approaches in face recognition. However, some drawbacks exist also in PCA. The first one is that the sensitiveness of the traditional PCA to outliers seriously affects its precision. Some researchers had analyzed the robustness of PCA and proposed corresponding improved algorithms. For example, Xu et al [2] applied the statistical physics approach to tackle the robust problem of PCA, and gave several commonly used PCA self-organizing rules into robust versions. The second one is that traditional PCA is based exclusively on the second-order statistics with smooth Gaussian distribution. To overcome this shortcoming, kernel principal component analysis (KPCA) was proposed by Sch¨ olkopf et al. [3][4] and is applied to face recognition [5]. KPCA algorithm uses kernel function to obtain the arbitrary high-order correlation between input variants, and find the principal components needed through the inner production between input data. In 2002, Yang et al [6] presented a Symmetrical Principal Component Analysis (SPCA) algorithm according to the symmetry of human face. Recently, two-dimensional principal component analysis (2DPCA) presented by Yang et al. [7] alleviates expensive computational cost of PCA because it don’t need to transform from image matrices into vectors, which are usually of very high dimensionality. This D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1048–1055, 2007. c Springer-Verlag Berlin Heidelberg 2007
Symmetry Based Two-Dimensional Principal Component Analysis
1049
paper presents a symmetry based 2DPCA (S2DPCA) algorithm, which combines the advantages of 2DPCA and of SPCA. The reminder of this paper is organized as follows. The section 2 describes the principle of SPCA in brief. In section 3, we give a detailed description of S2DPCA. The experiments on CBCL and ORL database are shown in section 4. Finally, Conclusions are presented in section 5.
2
Principle of SPCA
The SPCA algorithm [6] is derived from a simple idea of the even-odd decomposition: any function f (t) can be decomposed into an even function fe (t) and an m (t) odd function fo (t), i.e. f (t) = fe (t) + fo (t), where fe (t) = f (t)+f is an even 2 f (t)−fm (t) function, fo (t) = is an odd function, and fm (t) = f (−t) is the mirror 2 function. Through further decomposition, fe (t) and fo (t) can be presented by the linear combination of a set of odd/even symmetrical base functions. Then, any function can be composed of a set of even symmetrical base functions and a set of odd symmetrical base functions. We apply this principle to face images and define symmetry to be the horizontal mirror symmetry with the vertical midline of the image as its axis. Let {xk , k = 1, 2, · · · , M, xk ∈ RN } denote a set of face images , where M is the number of training samples, N is the input dimensionality. According to the odd-even decomposition theory, xk can be decomposed as xk = xek + xok , with x +x x −x xek = k 2 mk denoting the even symmetrical image, and xok = k 2 mk denoting the odd symmetrical image, where xmk is the mirror image of xk . Let C, Ce , Co denote the covariance matrix of the set of face images xk , xek ,xok 1 M 1 M 1 M T T T respectively, i.e. C = M k=1 xk xk , Ce = M k=1 xek xek , Ce = M k=1 xok xok . M M 1 1 T T Because xk = xek +xok , C = M k=1 xk xk = M k=1 (xek +xok )(xek +xok ) = M 1 T T T T k=1 (xek xek + xek xok + xok xek + xok xok ), and since < xek , xok >= 0 [12], M M 1 then we get C = M k=1 (xek xTek + xok xTok ) = Ce + Co . It is evident that the eigenvalue decomposition on C is equivalent to the eigenvalue decomposition on Ce and Co . As a result, xk can be reconstructed linearly from the eigenvectors of Ce and Co .
3
Symmetry Based 2DPCA Algorithm
In 2DPCA, we hope to find a projection vector V , where V denote an ndimensional unitary column vector. Then an image X is projected onto V by the following linear transformation: Y = XV.
(1)
Thus, an m−dimensional projected vector Y , which is called the projected feature vector of image X, can be obtained. How to determine a good projection
1050
M. Ding et al.
vector V ? According to the proposition of Yang [7], the total scatter of the projected samples can be introduced to measure the discriminatory power of the projection vector V and the total scatter of the projected samples can be characterized by the trace of the covariance matrix of the projected feature vectors. Thus, the following criterion can be adopted: J(V ) = tr(Sv ),
(2)
where Sv denotes the covariance matrix of the projected feature vectors of the training samples and tr(Sv ) denotes the trace of Sv , which tr(Sv ) = V T {E[(X − EX)T (X − EX)]}V . Let us define the following matrix C = E[(X − EX)T (X − EX)],
(3)
which C is called the image covariance matrix and C can be computed directly using the training image samples. Now, we give a description of S2DPCA algorithm as follows. Suppose that there are N training image samples in total, the jth training image is denoted by an m × n matrix Xi (i = 1, 2, · · · , N ), and the average image of all training ¯ Then, C can be evaluated by samples is denoted by X. C=
N 1 ¯ T (Xi − X). ¯ (Xi − X) N i=1
(4)
According to the odd-even decomposition theory, Xi can be decomposed as X +X Xi = Xei + Xoi , with Xei = i 2 mi denoting the even symmetrical image, X −X and Xoi = i 2 mi denoting the odd symmetrical image, where Xmi is the ¯ e denoting the average image of all even mirror image of Xi . Here, let us define X ¯ e denoting the average image of all odd symmetrical symmetrical samples and X samples. Then, C can also be evaluated by C = Ce + Co , which
N 1 ¯ e )T (Xei − X ¯ e ), (Xei − X N i=1
(6)
N 1 ¯ o )T (Xoi − X ¯ o ). Ce = (Xoi − X N i=1
(7)
Ce = and
(5)
So the eigenvalue decomposition on C is equal to the eigenvalue decomposition on the right side of equation (5), i.e., eigenvalue decomposition on Ce , Co . Thus, the criterion in (2) can be expressed by J(V ) = V T CV.
(8)
Symmetry Based Two-Dimensional Principal Component Analysis
1051
The optimal projection axis Vopt is the unitary vector that maximizes J(V ), i.e., the eigenvector of C corresponding to the largest eigenvalue. From literature [6], the odd and even symmetrical principal components hold different energy in facial images. As to the face recognition problem under restricted environments (for example, angles of view and the illumination variations are not vary significantly), the symmetry of faces overwhelms their asymmetry though the asymmetry is quite valuable in some other fields such as the thermal imaging of faces. Thus, even symmetrical components will take up larger energy than odd symmetrical components. This means even symmetrical components are more important than the odd symmetrical components. Of course, it does not mean that we should completely discard the odd symmetrical components, because some of them can also contain important information for face recognition. In conclusion, both the odd symmetrical components and the even symmetrical components should all be utilized in recognition; at the same time, the even symmetrical components should be reinforced while the odd symmetrical components should be suppressed to some extend. In fact, if select the components with more energy or greater variance, the even symmetrical components will be selected naturally because they hold more energy. For feature selection in S2DPCA, we adopt the strategy similar to SPCA, i.e. order eigenvectors according to eigenvalue, and then select eigenvectors corresponding to lager eigenvalue. Since the variance (corresponding to eigenvalue) of the even symmetrical components Ve is bigger than the variance of the correlative components Vo . So it is natural to consider the even symmetrical components first, and then the odd symmetrical components if necessary. Based on the theoretical description of the S2DPCA, we can conclude the S2DPCA algorithm as follows. (1) Generate the mirror images Xmk from the training images Xk . Then, decompose Xk into the even symmetrical images Xek and the odd symmetrical images X +X X −X Xok by Xek = k 2 mk , Xok = k 2 mk . (2) Firstly, Ce ,Co are evaluated by equations (6) and (7), respectively. Then eigenvector Ve and Vo are computed by eigenvalue decomposition on Ce and Co , respectively. (3) Order the eigenvectors Ve , Vo according to their eigenvalue, then select the eigenvectors Ve , Vo with greater eigenvalue as the feature transformation matrix V = (Ve , Vo ). (4) Extract the principal components of the test sample X using the equation Y = XV .
4
The Experiments and Analysis
In order to compare the performance of S2DPCA with one of SPCA, 2DKPCA, we perform experiments on two popular face databases: CBCL database [9] for binary classification and ORL database [10] for multi-category classification. The first experiment is designed to test their performance and execution time. The second experiment is designed to compare the top recognition rate of these
1052
M. Ding et al.
algorithms described above and the selection of even symmetrical features and of odd symmetrical features. In all the experiments, firstly, the image data are preprocessed, i.e. histogram equalization, normalized to zero mean, and linearly clipped to [-1,1]. Then, S2DPCA, 2DPCA and SPCA are used to extract features from images respectively. Finally, linear support vector machine [8] is used for classification. 4.1
Experiments on CBCL Database for Binary Classification
To demonstrate the performances of S2DPCA, we conduct the experiments of face and nonface classification on CBCL database. The image database has been normalized to standard images of 19 × 19 with 256-level gray scale. To demonstrate the effectiveness of S2DPCA, we compare it with SPCA and 2DPCA respectively. The experiments were performed with 400 training images (200 face images and 200 non-face images) and 400 test images (200 face images and 200 non-face images). There is no overlapping between the training samples and the test samples. Table 1. Comparison of the recognition rate (%) of S2DPCA, 2DPCA and SPCA on CBCL database Training samples/class 1 2 3 4 5 SPCA 79.5(40) 85.8(47) 91.4(64) 93.5(57) 94.1(48) 2DPCA 75.6(19 × 3) 84.5(19 × 4) 89.8(19 × 5) 93.2(19 × 5) 95.0(19 × 4) S2DPCA 79.8(19 × 3) 88.7(19 × 4) 92.9(19 × 5) 94.4(19 × 5) 96.5(19 × 4)
Table 2. Comparison of average time to feature extracttion using SPCA, 2DPCA and S2DPCA, respectively Algorithms SPCA 2DPCA S2DPCA Time of feature extraction(second) 10.45 2.19 5.23
Recognition rates using SPCA, 2DPCA and S2DPCA are shown in Table 1. At the same time, a comparison of average time to feature extract using SPCA, 2DPCA and S2DPCA is presented in Table 2. From the experiment on CBCL database for binary classification, it’s evident that the classification performance of S2DPCA is higher than that of SPCA or 2DPCA. In Table 1, the lowest recognition error rate is reduced by 40.68%, from 5.9% to 3.5%, when SPCA is replaced by S2DPCA. Moreover, introducing symmetry in S2DPCA can also largely raise the recognition accuracy, from 95.0% to 96.5%, reducing the error rate by 30%, in comparison with 2DPCA. According to Table 2, it should be pointed out that it is an increases of the computational burden, which needs to compute the covariance matrix of the even symmetrical images and of the odd symmetrical images. So it is a major disadvantage of the proposed method in contrast with 2DPCA.
Symmetry Based Two-Dimensional Principal Component Analysis
4.2
1053
Experiment on ORL Face Database for Multi-category Classification
ORL database (Olivetti Research Laboratory Database) consists of 400 face images from 40 different peoples with 10 images per person. The images for one person differ in the shoot time, or expression (open eyes, closed eyes, smiling, etc.) or details (with glasses, without glasses etc.). The images for a person in ORL database are showed in Fig. 1. The characteristics of ORL database are: each person has equivalent number of face images; Rich variance in expressions, angle of view and facial details; But there is little variance in illumination.
Fig. 1. Samples from ORL face database Table 3. Comparison of the recognition rate (%) of S2DPCA, 2DPCA and SPCA on CBCL database Training samples/class 1 2 3 4 5 SPCA 78.5(39) 86.8(45) 89.4(66) 90.3(50) 92.1(46) 2DPCA 75.6(112 × 4) 85.2(112 × 4) 89.2(112 × 5) 92.3(112 × 4) 94.5(112 × 3) S2DPCA 79.1(112 × 4) 89.6(112 × 4) 91.5(112 × 4) 94.6(112 × 4) 96.4(112 × 3)
Table 4. Comparison of Number of Even Symmetrical Features (NESFs) , Number of Odd Symmetrical Features (NOSFs) extracted by using S2DPCA versus SPCA on ORL face database, when Sum of the Number of Features (Sum of NFs) extracted is 40. Algorithms NESFs NOSFs Sum of NFs SPCA 32 8 40 S2DPCA 34 6 40
The original face images on ORL database were all sized 92 × 112 with a 256-level gray scale. The gray scale was linearly normalized to lie within [-1,1]. We select samples randomly from ORL database. For each person, the number of training samples is equal to the number of test samples, and there is no overlapping between the training samples and the test samples. Properly speaking: for each person, 5 images are selected randomly from the total 10 images. Thus, totally 200 images are selected as the training samples; the rest 200 images are taken as the test samples. The top recognition rate using SPCA, 2DPCA and S2DPCA to extract features respectively are presented in Table 3. From Table 3, it is also obvious that the classification performance of S2DPCA is superior to that of SPCA or 2DPCA. In addition, From table 4, each eigenface is both symmetrical and asymmetrical to
1054
M. Ding et al.
Table 5. Comparison of time (s) for feature extraction with 1 training sample per class Algorithms SPCA 2DPCA S2DPCA time (s) 14.42 10.76 12.36
some extent. But most eigenfaces are more like symmetrical images and few have strong asymmetrical parts. Table 5 gives a comparison of time for feature extraction with 1 training sample per class. From table 5, we find that S2DPCA needs less time than SPCA. This reason is that S2DPCA uses directly image matrices to extract feature rather than vectors which are usually of very high dimensionality. At the same time, S2DPCA obtains higher recognition rate in comparison with 2DPCA, because S2DPCA utilizes the symmetry of facial image. So, S2DPCA combines the advantages of SPCA and of 2DPCA.
5
Conclusions
S2DPCA algorithm presented in this paper combines the advantages of 2DPCA and of SPCA, i.e., it not only considers the symmetry property of human face, but also alleviates computational cost of PCA which needs to transform from image matrices into vectors. At the same time, from an image, S2DPCA can obtain two images, i.e., an even symmetrical image and an odd symmetrical image by even-odd decomposition. So number of samples can obtain an increase of 2 times. It is very valuable for a few samples. Thus, S2DPCA is competitive with or superior to 2DPCA and SPCA, especially with a few samples. However, there also exists a weakness for the S2DPCA, i.e., it needs to compute the covariance matrices of the even symmetrical images and of the odd symmetrical images. As a result, it will lead to an increase of the computational cost. How to reduce the computational cost needs further analysis as future work.
Acknowledgement We would like to thank all the anonymous reviewers for their valuable comments. This work was fully supported by Chongqing Educational committee Foundation under Grant No. KJ060711.
References 1. Turk, M., Pentland, A.: Eigenfaces for Recognition. J. Cogn. Neurosci., 3(1991) 71-86 2. Xu, L., Yuille, A.L.: Robust Principal Component Analysis by Self-Organizing Rules Based on Statistical Physics Approach. IEEE Trans. Neural Networks, 6(1995) 131-143
Symmetry Based Two-Dimensional Principal Component Analysis
1055
3. Sch¨ olkopf, B., Smola,A.J., M¨ uller, K.R.: Kernel Principal Component Analysis. 7th International Conference on Artificial Neural Networks, ICANN 97, Lausanne, Switzerland 1327, 583-588 (Eds.) W. Gerstner, A. Germond, M. Hasler and J.-D. Nicoud, Lecture Notes in Computer Science, Berlin: Springer, 1997 4. Sch¨ olkopf, B., Smola, A., M¨ uller, K.R.: Nonlinear Component Analysis as A Kernel Eigenvalue Problem. Neural Computation, 10(1998) 1299-1310 5. Kim, K.I., Jung, K., Kim, H.J.: Face Recognition using Kernel Principal Component Analysis. IEEE Signal Processing Letters, 9(2002) 40-42 6. Yang, Q., Ding, X.: Symmertrical PCA in Face Recognition. IEEE ICIP 2002 proceedings, II(2002) 97-100 7. Yang, J., Zhang, D., Frangi, A.F., et al.: Two-dimensional PCA: A New Approach to Appearance-based Face Representation and Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(2004) 131-137 8. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer-Verlag, 1995 9. CBCL database, available from: http://www.ai.mit.edu /projects/cbcl/ software-datasets 10. ORL database, available from: http://www.uk.research.att.com:pub/data/ att faces.tar.Z
A Method Based on ICA and SVM/GMM for Mixed Acoustic Objects Recognition Yaobo Li1, Zhiliang Ren1, Gong Chen2, and Changcun Sun1 1
Dept. of Weaponry Eng., Naval Univ. of Engineering, Wuhan 430033, China {Lyb358,rzl503}@163.com 2 Dept. of Electronic Information Eng. Institute Communications Engineering of PLA University, Nanjing 210007, China
[email protected]
Abstract. With independent component analysis (ICA) to realize the blind separation from mixed acoustic objects, a recognition method based on support vector machine/Gaussian mixture models (SVM/GMM) is proposed through extracting linear prediction coefficient (LPC) feature. It is revealed that LPC is consistently better than wavelet energy feature, ICA is efficient algorithm to estimate the unknown signal level. This method uses the output of GMM to adjust the probabilistic output of SVM. The validity of the ICA and SVM/GMM model is verified via examples in mixed acoustic objects recognition system.
1 Introduction The correct recognition rate of field acoustic recognition system can reach satisfactory result, however, the performance will rapidly degrade when the inputs of system are combined with noise. In the presence of battlefields, the inputs of system mix with kinds of acoustic objects (like tank helicopter), if the mixed objects can be separated from recognition system before the inputs, we can enhance performance greatly. Besides, separating of mixed acoustic objects may realize the decision of the number and location of objects, so as to estimate and analysis new acoustic objects. People have paid more and more attention to the counter-tank and counter-helicopter war in the future. In the past, much effort has been made to develop methods of eliminating background noise. Tradition filter and wavelet transfer methods are most common techniques for the elimination of stationary noise from degraded acoustic objects, in fact, the spectrum of noise overlaps with the spectrum of acoustic objects. It is very limited to raise the signal-to-noise ratio and still can distort acoustic objects. In this work, we separate acoustic objects using independent component analysis (ICA). The Independent Component Analysis provides a linear representation that minimizes the statistical dependencies among its components, which is based on higher order statistics of the data. These dependencies among higher order features could be removed by isolating independent components. The ability of the ICA to handle higher-order statistics in addition to the second order statistics is useful in achieving an effective separation of feature space for given data[1][2]. This paper presents a new method to combine both discriminative model and the generative model to make use of the two kinds of models. Support vector machine is
、
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1056–1064, 2007. © Springer-Verlag Berlin Heidelberg 2007
A Method Based on ICA and SVM/GMM for Mixed Acoustic Objects Recognition
1057
recently proved to be a good discriminative classifier for many kinds of pattern recognition applications, which also shows advantages in training and performance compare to artificial neural networks. Gaussian mixture models have become the dominant approach for modeling in objects recognition applications because GMM present objects statistical characters properly. In this paper, we address the method of acoustic targets recognition based on ICA and SVM/GMM. With ICA to realize the blind separation from mixed acoustic targets, a recognition method based on SVM/GMM is proposed through extracting LPC feature. This paper combines the SVM and GMM to utilize the abilities of both models. This method uses GMM's output to adjust the probabilistic output of SVM. In training phase, both SVM and GMM are trained independently. While testing phase, the probability outputs of the GMM are used to adjust the posterior probability output of SVM. Likelihood scoring is performed using the new hybrid model.
2 LPC-Based Feature Extraction We choose tanks as acoustical objects here. In observed time tank objects can be modeled by the auto regressive (AR) model[4][5]. p
x(n) = ξ (n) − ∑ ai x(n − i ) ,
(1)
i =1
where ai is linear prediction coefficient (LPC), p is the order of the AR model,
ξ (n) is input excitation. The LPC analysis essentially attempts to find an optimal fit to the envelope of the acoustical spectrum from a given sequence of objects signal samples. The estimation of AR acoustic is expressed by: p
R(t ) = σ ξ2 Δt 1 + ∑ ai exp(− j 2π fiΔt ) ,
(2)
i =1
where Δt is sampling interval, σ ξ2 is variance of excitation. AR spectrum is high-resolution method for power spectrum estimation. Power spectrum of tank signals reflects signal energy along with the frequency distribution.
Fig. 1. Spectrum response of tank objects
1058
Y. Li et al.
From (2), AR spectrum is tightly relations to LPC, thus it is reasonable to extract feature from tank objects. The frequency response using 20th order LPC analysis is shown in Fig. 1.
3 ICA Algorithm In independent component analysis (ICA), the measured samples (here multidimension tank objects) are thought to be linear mixtures of some underlying sources. The goal of ICA is to try to find how the measured signals are formed from the underlying signals, assuming that the signals are statistically as independent as possible. ICA is a linear model that describes the mixed of source components, S = {s1 (t ), s2 (t ), ⋅⋅⋅, sm (t )}T by means of a mixed matrix A , to produce mixtures
X = {x1 (t ), x2 (t ), ⋅⋅⋅, xn (t )}T . In matrix form, defining S and X to be the column vectors associated with the source and mixed signals, we have[6]: X = AS , (3) where A is unknown mixed matrix, n ≥ m . The purpose of ICA is to separate source signals from X . S = A−1 X = W T X , (4) T where S is an estimate of the source and W is a regularized estimate of the inverse of A . 3.1 Fast ICA
We apply Fast ICA algorithm in feature extraction. ICA algorithm controls convergence with generating function, so as to estimate non-gaussian independent component. The kurtosis is defined as the 4th coefficient of the cumulant generating function. k (v) = E{v 4 } − 3( E{v 2 }) 2 . (5) The kurtosis of W T X is expressed by: k (W T X ) E{(W T X ) 4 } 3[ E{W T X 2 }]2
4
E{(W T X ) 4 } 3 W .
(6)
Generating function:
J (W ) = E{(W T X )4 } − 3 W
4
+ F ( W )2 ,
(7)
where F is cost function. Considering the convergence rate , we change it to (8). W = E{ X (W T X )3 } − 3W . (8) 3.2 Simulation
Experiment is performed with tanks s1 (t ), s2 (t ), s3 (t ), s4 (t ), s5 (t ), s6 (t ) . The sampling frequency of signals is 8000Hz, the sample length is 3000. Then hybrid matrix is carried on to obtain five observations signals. The ICA result shows that four approximate source signals are separated. Fig. 2 (a) is normalized observation signals x1 (t ), x2 (t ), x3 (t ), x4 (t ), x5 (t ), x6 (t ), x7 (t ) before
A Method Based on ICA and SVM/GMM for Mixed Acoustic Objects Recognition
1059
、source signals (b) and separated signals (c)
Fig. 2. Normalized observation signals (a)
mixed. Fig. 2 (b) is normalized source signals s1 (t ), s2 (t ), s3 (t ), s4 (t ), s5 (t ), s6 (t ) , which are A, B, C, D tanks respectively. Fig. 2 (c) is normalized signals y1 (t ), y2 (t ), y3 (t ), y4 (t ), y5 (t ), y6 (t ) with ICA. Being statistical method, this method can separate approximate signal similar to source one. Thus in Fig. 2 peak-to-peak of separated signal value as well as the order and the source signal is different, however, the waveforms are consistent.
4 SVM/GMM Based Recognition There are two major models for acoustic objects recognition: discriminative model like SVM and generative model like GMM. Each one of them can construct models for object recognition tasks. Discriminative models have good property of making full use of discriminative information of different classes, while generative models use statistical information. In a word, discriminative models use intra-class information and generative models use extra-class information. Since discriminative model and generative model have both advantages of themselves, they also have disadvantages of lack using the other kind of information. 4.1 GMM Based Modeling
GMM can be regarded as HMM involving one state. The Gaussian mixture density is defined by a weighted sum of M component densities as: M
P( xt ) = ∑ Pb i i ( xt ) ,
(9)
i =1
where
(
= 1 (2π ) d / 2 Σi
bi ( xt ) = N ( xt , μ i , Σ i )
1/ 2
) exp ( −1 2( x − μ ) Σ T
t
i
−1 i
( xt − μi ) ) xt , t = 1, 2, ⋅⋅⋅, T ,
(10)
Pi , μi and Σi , are the weight, mean and variance, respectively. The mixture weights M
satisfy the constraint that
∑ P = 1 . The GMM reflects the intra-class information. i =1
i
1060
Y. Li et al.
4.2 SVM Based Modeling
Let x be a set of data points and y be corresponding target classes. The output of the standard SVM is y = sign( f ( x)) , (11) where f ( x) = ω T x + b . One method of producing probabilistic outputs was proposed by Wahba, which used a logistic link function. For SVM there are two outputs corresponding to two classes. So the posterior probability outputs of the SVM are P(C+1 x) = 1 1 + e− f ( x ) , P(C−1 x) = 1 1 + e f ( x ) ,
(12)
here f ( x) can be viewed as the distance from x to the support vectors. That is to say that the posterior probability outputs of SVMs are based on the distance of testing vectors and support vectors. So the outputs reflect the inter-class information. This methods combines the SVM and GMM to utilize the abilities of both models. This method uses GMM's output to adjust the probabilistic output of SVM. The GMM can be embedded into SVM to adjust the acoustic objects probabilistic output as follows: K
P(Ci x) = ∏(PSVM (Ci xk )PGMM (xk Ci ) k=1
N
∑P j=1
SVM
(Cj xk )PGMM (xk Cj )) ,
(13)
M
where probabilistic output of i-th GMM PGMM ( x Ci ) = ∑ cim N ( x, μim , Σim ) , m =1
N ( x, μ , Σ) = ((2π ) − d / 2 Σ
− 12
1 ) exp[− ( x − μ )T Σ −1 ( x − μ )] , 2
(14)
cim , μim and Σim are m-th of weight, mean and variance. PSVM (Ci x) is probabilistic output of i-th SVM. The complete GMM for object model is parameterised by the mean vectors, covariance matrices and mixture weights from all component densities. The notation collectively represents these parameters λ = {Pi , μi , Σi / i = 1, 2, ⋅⋅⋅, M } .
5 Acoustic Objects Recognition In this paper, with ICA to realize the blind separation from mixed acoustic objects, the features of clean tanks are derived using 10th order LPC analysis on a 20 Separating Targets Based on ICA
LPC-based Feature Extraction of Test Acoustic Signals
k-means
SVM/GMM Recognition
LPC-based Feature Extraction of Trained Acoustic Signals
Fig. 3. The acoustic recognition system
A Method Based on ICA and SVM/GMM for Mixed Acoustic Objects Recognition
1061
milliseconds frame every 5 milliseconds. In order to construct a small data set for training, the training data for each object was converge to one centroid using k-means clustering algorithm. The acoustic recognition system based on ICA and GCA Table 1. (a) PERFORMANCE COMP ARISON OF ACOUSTIC SOURCE AND SEP ARATED O BJECTS Gun
Machine gun
Artillery
Vehicle
Tank
Heli
Performance (Separating targets)
100%
100%
98%
87%
100%
100%
Performance (Source targets)
91%
94%
91%
81%
100%
98%
(b) P ERFORMANCE COMP ARISON OF T ANK SOURCE AND SEP ARATED OBJECTS Tank A
Tank B
Tank C
Tank D
Tank E
Tank F
Performance (Separating targets)
99%
100%
93%
88%
95%
100%
Performance (Source targets)
93%
93%
90%
88%
90%
96%
Table 2. (a) PERFORMANCE
ACOUSTIC VARIOUS O RDER OF LP C(NUMBER OF FRAME L = 30 )
OF
Number of Frame 5
10
20
(b) P ERFORMANCE
10
20
Machine gun
Artillery
Vehicle
Tank
Heli 98%
GMM
95%
92%
20%
62%
69%
SVM
100%
100%
26%
80%
74%
61%
SVM-GMM
99%
93%
27%
78%
100%
99% 97%
GMM
88%
100%
38%
87%
55%
SVM
100%
100%
95%
87%
71%
77%
SVM-GMM
100%
100%
93%
88%
100%
100% 96%
GMM
84%
100%
79%
84%
61%
SVM
100%
100%
90%
87%
84%
91%
SVM-GMM
100%
100%
98%
88%
100%
100%
OF
T ANK VARIOUS ORDER OF LP C(N UMBER OF FRAME L = 30 )
Order of LPC
5
Gun
Tank A
Tank B
Tank C
Tank D
Tank E
Tank F 98%
GMM
95%
92%
20%
62%
69%
SVM
100%
100%
26%
80%
74%
61%
SVM-GMM
99%
93%
27%
78%
100%
99% 97%
GMM
88%
100%
38%
87%
55%
SVM
100%
100%
95%
87%
71%
77%
SVM-GMM
100%
100%
93%
88%
100%
100% 96%
GMM
84%
100%
79%
84%
61%
SVM
100%
100%
90%
87%
84%
91%
SVM-GMM
100%
100%
98%
88%
100%
100%
1062
Y. Li et al. Table 3. (a) PERFORMANCE
A COUSTIC VARIOUS NUMBER OF F RAME (ORDER OF LP C P = 10 )
OF
Number of Frame
Gun
Machine gun
Artillery
Vehicle
Tank
Heli 96%
GMM
83%
91%
58%
79%
54%
SVM
99%
91%
90%
77%
71%
71%
SVM-GMM
94%
95%
96%
83%
98%
96% 97%
10
GMM
88%
100%
38%
87%
55%
SVM
100%
100%
95%
87%
71%
77%
SVM-GMM
100%
100%
93%
88%
100%
100% 100%
20
GMM
85%
100%
75%
88%
75%
SVM
100%
100%
93%
86%
100%
99%
SVM-GMM
100%
100%
98%
87%
100%
100%
30
(b) P ERFORMANCE
OF
T ANK VARIOUS NUMBER OF F RAME (ORDER OF LP C P = 10 )
Number of Frame
Tank A
Tank B
Tank C
Tank D
Tank E
Tank F
GMM
43%
76%
41%
41%
94%
91%
SVM
86%
90%
78%
18%
76%
71%
SVM-GMM
67%
96%
86%
22%
98%
97%
GMM
33%
100%
38%
100%
100%
99%
SVM
100%
100%
82%
50%
100%
94%
SVM-GMM
98%
100%
85%
81%
100%
99%
GMM
85%
100%
73%
100%
100%
100%
SVM
100%
100%
84%
33%
100%
92%
SVM-GMM
99%
100%
93%
88%
100%
100%
10
20
30
Table 4. (a) PERFORMANCE OF ACOUSTIC D IFFERENT FEATURE (ORDER OF LP C P = 10 , N UMBER OF FRAME L = 30 ) Feature
Gun
Machine Artillery gun
Vehicle
Tank
Heli
Wavelet GMM packet SVM energy SVM-GMM
98%
100%
25%
50%
99%
87%
72%
79%
14%
16%
86%
58%
72%
87%
20%
14%
85%
65%
GMM
85%
100%
75%
88%
75%
100%
SVM
100%
100%
93%
86%
100%
99%
SVM-GMM
100%
100%
98%
87%
100%
100%
LPC
(b) P ERFORMANCE OF T ANK DIFFERENT FEATURE (ORDER OF LP C P = 10 , N UMBER OF FRAME L = 30 ) Feature
Tank A
Tank B
Tank C
Wavelet GMM packet SVM energy SVM-GMM
100%
100%
100%
100%
LPC
Tank D
Tank E
Tank F
37%
100%
37%
89%
33%
100%
11%
90%
100%
100%
58%
99%
53%
99%
GMM
85%
100%
73%
100%
100%
100%
SVM
100%
100%
64%
33%
100%
92%
SVM-GMM
99%
100%
93%
88%
100%
100%
A Method Based on ICA and SVM/GMM for Mixed Acoustic Objects Recognition
1063
Table 5. P ERFORMANCE OF ACOUSTIC DIFFERENT SNR (ORDER OF LP C P = 10 , N UMBER OF FRAME L = 30 ) SNR
40dB
20dB
PERFORMANCE
OF
Gun
Machine Artillery gun
Vehicle
Tank
Heli
GMM
85%
100%
64%
87%
76%
100%
SVM
100%
100%
93%
86%
96%
99%
SVM-GMM 100%
100%
91%
87%
100% 100%
GMM
87%
100%
48%
0%
70%
93%
SVM
100%
100%
75%
0%
96%
90%
SVM-GMM 100%
100%
80%
0%
100%
97%
T ANK DIFFERENT SNR (ORDER OF LP C P = 10 , NUMBER OF FRAME L = 30 ) SNR
40dB
20dB
Tank A
Tank B
Tank C
Tank D
Tank E
Tank F
GMM
86%
100%
73%
100%
100%
100%
SVM
99%
100%
64%
33%
100%
92%
SVM-GMM
99%
100%
93%
88%
100%
100%
GMM
80%
100%
64%
100%
100%
99%
SVM
95%
100%
52%
52%
100%
90%
SVM-GMM
99%
100%
93%
88%
100%
99%
are shown in Fig. 3. The number of probabilistic density function is 8. The kernel function of SVM is K(xi , x) = exp(− x − xi
2
σ2) .
Table 1 shows the performance of different conditions. From Table 1, it can be seen that by incorporating ICA before SVM/GMM input, the performance of recognition and verification are slightly degrade. Table 2-4 show performances based on SVM/GMM for clean tank signals. Table 5 show the accuracy averaged through 100 test experiments, related to the field wind. From above tables, it can be concluded: 1) Performance varies with the increasing order of LPC when the number of frame is constant; on the contrary, it varies with frame while constant feature. 2) In table 2, the average performances based on GMM are 87.17%, 93% and 96.33%, 84.83% based on SVM, but the average performances reach 97.5%, 96.67% based on SVM/GMM. Because GMM reflects the similarity among same data and SVM finds difference among different data, the average performances with GMM is lower than it with SVM 9.16% for different acoustic objects, on the contrary, results with SVM show better performance than SVM 8.17%. SVM/GMM combines both advantage and its performances is 10.33%, 3.67% higher than SVM and 10.33%, 11.84% higher than GMM. 3) On the condition of noise, SVM/GMM shows robust performance.
1064
Y. Li et al.
6 Conclusion A novel recognition system is proposed. The advantages of ICA and SVM/GMM are integrated into the proposed method. Linear prediction coefficient is used to characterize tank objects. The recognition system achieves encouraging results for different training and testing configurations. Conclusion can be drawn as following: 1) The performance of the method based on LPC is consistently better than wavelet packet energy feature (Table 4). 2) An efficient algorithm using ICA technique is employed to estimate the unknown signal level. Separated objects based on ICA are similar to source objects (Table 1). 3) The proposed SVM/GMM approach gives more accurate solution to identify field tanks than SVM and GMM. Further investigations are being carried out to compare the performance of other algorithms for the recognition of field acoustic objects.
References 1. Amari, S., Cardoso, J.: Blind Source Separation - Semiparametric Statistical Approach [J]. IEEE Trans. on Signal Processing 45 (11) (1997) 2692-2700 2. Amari, S., et al.: A New Learning Algorithm for Blind Separation of Sources [A]. Advances in Neural Information Processong, 8[C]. MIT Press, Cambrige (1996) 757-763 3. Wang, B., Qu, D., Peng, X.: Practical Speech Recognition Foundation[M]. Beijing, Defense industry publishing press (2001) 4. Marple, SL.: A New Autoregressive Spectrum Analysis Algorithm. IEEE Transactions on ASSP 2 (4) (1980) 441-453 5. Ji, X., Shi, W., Zhang, G.: An Effective Algorithm of Feature Extraction and Its Application[J]. Journal of shanghai jiaotong university 37 (11) (2003) 6. Bell, A., Sejnowski, T.: An Information-maximization Approach to Blind Separation and Blind Deconvolution. Neural Comput 7 (1995) 1129–1159 7. Reynolds, D.A.: Speaker Identification and Verification Using Gaussian Mixture Speaker Models[J]. Speech Communication 17 (1995) 91-108 8. Comes, C., Vapnik, V.: Support Vector Networks for Pattern Recognition. Data Mining and Machine Learning 20 (1995) 272-297 9. Rabiner, L., Wilpon, J. G., Juang, B. H.: A Segmental K-Means Training Procedure for Connect Word Recognition. AT&T Tech. J 65 (3) (1986) 21-31
ICA Based Super-Resolution Face Hallucination and Recognition Hua Yan, Ju Liu, Jiande Sun, and Xinghua Sun School of Information Science and Engineering, Shandong University Jinan, 250100, Shandong, China {yhzhjg,juliu,jd_sun}sdu.edu.cn,
[email protected]
Abstract. In this paper, we propose a new super-resolution face hallucination and recognition method based on Independent Component Analysis (ICA). Firstly, ICA is used to build a linear mixing relationship between highresolution (HR) face image and independent HR source faces images. The linear mixing coefficients are retained, thus the corresponding low-resolution (LR) face image is represented by linear mixture of down-sampled source faces images. So, when the source faces images are obtained by training a set of HR face images, unconstrained least square is utilized to obtain mixing coefficients to a LR image for hallucination and recognition. Experiments show that the accuracy of face recognition is insensitive to image size and the number of HR source faces images when image size is larger than 8×8, and the resolution and quality of the hallucinated face image are greatly enhanced over the LR ones, which is very helpful for human recognition.
1 Introduction The faces of interest often appear very small in surveillance imagery because of the relatively large distance between the cameras and the scene. Face resolution becomes an important factor for recognition performance, so resolution enhancement techniques are generally needed. Super-resolution (SR) techniques in computer vision are focused on to enhance resolution, that is, to infer the lost high-resolution (HR) image from the lowresolution (LR) ones. In general, there are two classes of SR techniques: reconstruction-based (from inputs images only) and learning-based (from other training images). Of particular interest is face hallucination, or learning high-resolution face image from low-resolution one. Face hallucination is a term presented by baker and Kanade [1], which implies that the high-frequency part of face image must be purely fabricated from a parent structure by recognizing the local features from the training set. Liu etc. [2] developed a two-step statistical modeling algorithm combining global and local parameter models, based on PCA and nonparametric Markov Network for hallucinating faces. The twostep algorithm was improved by Li. etc. [3]. In [4], PCA also was used to fit the input LR face image as a linear combination of the LR face images in the training set. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1065–1071, 2007. © Springer-Verlag Berlin Heidelberg 2007
1066
H. Yan et al.
Replacing the LR training images with HR ones, while retaining the same combination coefficients, rendered the HR image. In this paper, we propose a new SR face hallucination and recognition method based on Independent Component Analysis (ICA). Firstly, ICA model is used to represent HR face image by linearly mixing some independent HR source faces images. After the HR source faces images are down-sampled to get the corresponding downsampled source faces images and the same mixing coefficients are retained, a linear relation between LR face images and down-sampled source face images is built. Thus, when the source faces images are trained from a set of HR face images by ICA, unconstrained least square method is utilized to obtain mixing coefficients to a LR face image. Finally face hallucination and recognition can be achieved by retaining the mixing coefficient and combining the trained source face images. Experiments show that when image size is larger than 8×8, the accuracy of face recognition is insensitive to image size and the number of source faces images, and the hallucinated face images are greatly approximated to the original HR face images and very helpful for recognition by a human being.
2 Image-Domain Independent Component Analysis (ICA) Independent component analysis (ICA) can derive features which best represent the image via a set of source signals which are as statistically independent as possible. The main assumption behind ICA is that some observation signals X = [x1,x 2 ,",x m ]Τ may be modeled as the linear mixture of the statistically independent source signals
Τ S = [s1,s 2 ,",s n ]Τ by an unknown m × n dimensional mixing matrix A = [a1, a2 ,", an ] , where x i and si denote column vectors which observation signal and source signal
are ordered into in the lexicographical notation respectively. a i denotes mixing column vector which includes the coordinates of observation signal x i with respect to the source signals S . X = AS .
(1)
Usually, ICA can be performed on images under two different kinds of architectures [5]. In Architecture I, images are treated as random variables and pixels as outcomes, whereas in Architecture II, pixels are treated as random variables and images as outcomes. In this paper, Architecture I is chosen, and the image synthesis model based on ICA is shown in Fig. 1. In Architecture I, the data matrix is organized so that the images are in rows and the pixels are in columns, and each image has zero mean. In this approach, ICA finds a matrix W such that the rows of S = WX are as statistically independent as possible. The rows of S are then used as source face images to represent faces. Face image representations consist of the coordinates a of these images with respect to the source face images defined by the rows of S , as shown in Fig. 2. These coordinates are contained in the mixing matrix.
ICA Based Super-Resolution Face Hallucination and Recognition
1067
Fig. 1. Image synthesis model for Architecture I
Fig. 2. Independent source image representation consists of the coefficients a = [a1 , a 2 ,", a n ]Τ for the linear combination of independent source images s i
3 ICA-Based Super-Resolution Face Hallucination 3.1 Mathematic Model from HR to LR Image
Let I H and I L denote the HR and LR face image respectively. If I L is d times smaller than I H in both the horizontal and vertical directions, I L is computed by I L (m, n) =
d −1 d −1
∑∑ I H (dm + i, dn + j ) ,. d2 1
(2)
i =0 j =0
where d , which is always an integer, denotes down-sampling factor in the horizontal and vertical directions. Equation (2) combines a smoothing step and a down-sampling step, more consistent with image formation as integration over the pixel [1]. To simplify the notation, let I H and I L be lexicographically ordered into the row vectors respectively, (2) can be rewritten as matrix-vector style. I L = I H D ,.
(3)
where I L denotes a N -dimension row vector of LR image; I H denotes a d 2 N dimension row vector of HR image; D denotes a d 2 N × N down-sampling matrix.
3.2 ICA Model of HR and LR Image We apply ICA model under Architecture I to fit the HR face image I H as a linear combination of n source face images. (4) I = a s1 + a s 2 + " + a s n + M = a Τ S + M ,. H
1 H
2 H
n H
H
H
H
H
1068
H. Yan et al.
where SH = [s1H , s2H ,", snH ]Τ is the statistically independent source face image set in the HR space. M H denotes mean face image to HR face image. a 1 ," , a n are the mixing coefficients, which construct a mixing vector a H = [a1 , a 2 ," , a n ]Τ . Thus, Equation (4) can be rewritten as
I L = (a 1s1H + a 2 s 2H + " + a n s nH )D + M H D .
(5)
Since a 1 ," , a n are the coefficients that denote the coordinates of HR face image with respect to the source faces images, then
I L = a 1s1H D + a 2 s 2H D + " + a n s nH D + M H D .
(6)
We can set up the corresponding relation between HR source image and downsampled one, as well as HR and LR mean face image.
s iL = s iH D M L = M H D ,.
(7)
where s iL denotes down-sampled source face image, and M L denotes LR mean face image. Then
I L = a1s1L + a 2s 2L + " + a ns nL + M L = a ΤH S L + M L .
(8)
Equation (8) describes the linear relation between the LR face image and the downsampled source faces images.
3.3 Face Hallucination and Recognition In the proposed method, we first remove mean face image from the HR face image set. Then the zero-mean training HR faces images are used to obtain a set of independent source faces images s1H , s 2H ,", s nH and the mixing matrix A in HR space by FASTICA [6]. And then the source faces images are down-sampled in term of Equation (7). Finally, s1H , s 2H ,", s nH , s1L , s 2L ,", s nL and A are stored for face hallucination and recognition. For hallucinating HR face image from a LR one, we hope to obtain a mixing vector and make the following cost function minimized. J = I L − M L − a ΤH S L .
(9)
A kind of unconstrained least square method can be used to solve the question.
(
a ΤH = I L S L Τ S LS L Τ
)
−1
.
(10)
Finally, we retain a H to linearly combining the corresponding HR source images and the mean face image to LR input is rendered for face hallucination.
ICA Based Super-Resolution Face Hallucination and Recognition
Iˆ H = a ΤH S H + M L D Τ .
1069
(11)
And we also can carry out face recognition by the following criterion. c=
a ΤH a i ,. a H ⋅ ai
(12)
where a i denotes mixing vector which includes the coordinates of training face image
I iH with respect to the source faces images S H . The framework of face hallucination and recognition is shown in Fig.3.
Fig. 3. Face hallucination and recognition
4 Experiments and Results Our experiments are conducted with 288 HR faces images from NUST 603 face database. The HR faces images, which are aligned to be 32 × 32 , include 96 individuals, with three faces images in different sessions for each individual. Faces images are blurred by averaging neighbor pixels and down-sampled to low-resolution images. Down-sampling factor d is set to be 2 or 4 for obtaining the corresponding LR images with size 16 × 16 or 8 × 8 . All HR and LR faces images are ordered into row vectors lexicographically. First, we study the recognition performance using original HR face images and hallucinated ones. 192 HR faces images from 96 individuals with two different sessions are selected for training, and other 96 HR ones and the corresponding LR ones for test. Equation (11) is used for similarity measurement. Compared with hallucinated HR face image and original HR one, the recognition accuracies over different downsampling factors and the number of source images are shown in Table.1. It can be seen that when the size of LR face image is small to 8 × 8 , the recognition accuracy for hallucinating HR face image drops 10% compared with the original HR image, whereas in other case the recognition accuracy only change slightly, that is, when image size is larger than 8×8, the accuracy of face recognition is insensitive to image size and the number of source faces images. Next, the hallucination experiment is conducted on a data set containing 192 individuals with two face images for each individual. Using the “leave-one-out” methodology at each time, one individual, who can be recognized in recognition procedure, is
1070
H. Yan et al.
selected for testing, and the remaining are used for training. In terms of the recognition result, we select 100 source faces images for down-sampling factor 2, and 30 for 4, to hallucinate HR faces images. Some hallucination results are shown in Fig.4 and Fig.5. Compared with the input LR faces images and the Cubic B-Spline interpolation results, the hallucinated face images have much clearer detail features. They are greatly approximated to the original high-resolution images, and very useful for human recognition. Table 1. The result of face recognition Size of LR face image
16 × 16
8×8
HR face image
Original
hallucinated
Recognition rate
80/96 79/96 77/96 71/96
80/96 81/96 78/96 72/96
Original --79/96 77/96 71/96
hallucinated --23/96 68/96 61/96
number of source images 100 60 30 15
Fig. 4. Face hallucination of three groups for down-sampling factor of 2. From left to right for each group: original HR image, hallucinated HR image, Cubic B-spline interpolation, LR image.
Fig. 5. Face hallucination of three groups for down-sampling factor of 4. From left to right for each group: original HR image, hallucinated HR image, Cubic B-spline interpolation, LR image.
5 Conclusion In this paper, we propose a new SR face hallucination and recognition method based on ICA. ICA model first is used to build a linear relationship between HR face image and independent HR source faces images. Then the HR source faces images are down-sampled to get the corresponding down-sampled source faces images and the linear mixture coefficients are retained, so that the corresponding LR image is represented as a linear mixture of the down-sampled source faces images. Thus, when a LR face image is known and the source faces images are trained, the unconstrained least
ICA Based Super-Resolution Face Hallucination and Recognition
1071
square method is utilized to obtain the mixing coefficients. Finally the mixing coefficients are retained and face hallucination and recognition are carried out. Experiments show that the ICA-based face hallucinated face images are greatly approximated to the original HR faces images, and very helpful for human recognition, and when image size is larger than 8×8, the accuracy of face recognition is insensitive to image size and the number of source faces images.
Acknowledgement The work is supported by Program for New Century Excellent Talents in University (NCET-05-0582), Specialized Research Fund for the Doctoral Program of Higher Education (Grant No. 20050422017) and the Project sponsored by SRF for ROCS, SEM ([2005]55). The corresponding author is Ju Liu (
[email protected]).
References 1. Baker, S., Kanade, T.: Hallucinating Faces. In Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition, (2000) 83-88. 2. Liu, C., Shum H.Y., Zhang, C.S.: A Two-Step Approach to Hallucinating Faces: Global Parametric Model and Local Nonparametric Model. In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition 1 (2001) 192-198. 3. Li, Y., Lin, X.Y.: An Improved Two-Step Approach to Hallucinating Faces, In Proceedings Third International Conference on Image and Graphics (2004) 298-301. 4. Wang, X.G., Tang, X.O.: Hallucinating Face by Eigentransformation. IEEE Transactions on Systems, Man and Cybernetics, Part C 35 (3) (2005) 425-434. 5. Bartlett, M.S., Movellan, J.R., Sejnowski, T.J.: Face Recognition by Independent Component Analysis. IEEE Transactions on Neural Networks 13 (6) (2002) 1450-1464. 6. http://www.cis.hut.fi/projects/ica/fastica/
Principal Component Analysis Based Probability Neural Network Optimization Jie Xing1, Deyun Xiao1, and Jiaxiang Yu1,2 1
Department of Automation, Tsinghua University, Beijing, 100084, China
[email protected] 2 Department of Shipboard Weaponry, Dalian Naval Academy, Dalian, 116018, China
[email protected]
Abstract. Topological structure of Probability Neural Network (PNN) is usually complex when it is trained with large-scale and high-redundant training samples. Aiming this problem, PNN is analyzed and simplified by using probability calculation and multiplication formula. At first, input data of training samples was statistical analyzed by using Principal Component Analysis (PCA). PNN topological structure was optimized based on the statistical results. Subsequently, a complete learning algorithm was provided to avoid the artificial set of smoothing parameters. And Simulated Annealing (SA) coefficient was introduced to increase learning speed and stability. Eventually, the optimized PNN was applied to real problem. The test result validated that the optimized PNN had simpler structure and higher efficiency than typical PNN in the application with large-scale and high-redundant training samples.
1 Introduction Probability Neural Network (PNN) is a four layer feed-forward neural network based on radial basis function neural network. Specht developed it from Bayesian decision theory [1]. PNN constructs an estimate of the probability density functions, according to the Parzen-windows method, by summing the outputs of radial basis function neurons [2]. PNN has compact mathematic theory and clear structure. Being supplied with enough classific training samples, PNN can be directly used without training course, and get ideal results in general pattern classification applications. But PNN is usually complex, when it is supplied with large-scale and highredundant training samples. As a solution, Principal Component Analysis (PCA) is introduced to PNN training. PCA is an important multi-variable statistical analysis technique, which uses statistical analysis to extract the principle components and indicate the linear correlation between process variables. In recently years, there are mainly two research directions about the combination of PCA and neural network. On the one hand, PCA is used as input pretreatment of neural network, reduces the input dimension. Because of the less input variables, PNN makes neural networks have simpler structure, less computational complexity in some extent, and higher efficiency pattern classification [3,4]. On the other hand, neural network is used to perform nonlinear PCA. It simplifies the statistical computation, and improves the robustness of D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1072–1080, 2007. © Springer-Verlag Berlin Heidelberg 2007
Principal Component Analysis Based Probability Neural Network Optimization
1073
PCA [5,6]. In this paper, PCA is considered not only the principle components extracting, but also the elimination of correlation between process variables. Statistical results of PCA are used as instruction of PNN construction. Pattern neurons of PNN are optimized combination based on the eigenvalues obtained in PCA, which can decrease the structure complexity of PNN in the application with large-scale and high-redundant training samples. This paper is organized as follows: In section 2, PNN are analyzed on probability computation, and a multiplication formula based optimizing strategy of PNN structure is proposed; Subsequently, a PCA based structure optimization strategy of PNN and input data pretreatment are described in section 3; Computation courses and learning algorithm with Simulated Annealing (SA) coefficient of the optimized PNN are expatiated in section 4; In section 5, the optimized PNN is applied to prediction of Anode Effect (AE) in aluminium reduction cell, and the results are compared with those of typical PNN; At last in section 6, conclusions of this paper are pointed out.
2 PNN Structure Simplification Based on Multiplication Formula Motivation of PNN is the probability density function estimators developed by Parzen. It asymptotically approach the underlying parent density provided by the known random samples. The particular estimator is
fA ( x) =
1
1 p/2 p (2π ) σ S A
SA
∑ exp[− i =1
( x − xai )T ( x − xai ) ], 2σ 2
(1)
where, xai is the i-th sample of pattern A; SA is the number of samples of pattern A; σ is variance, it is called “smooth parameters” here, which means the width of the Gaussian function with mean set by the corresponding training sample. With enough classic samples, this posteriori probability density function can approach the priori one smoothly and continuously [1]. Eq. (1) can be expressed by network computation, which is PNN as shown in Fig.1. This PNN has M–dimension input, and N–dimension output, which means N patterns need to be recognized. S1, S2,…,SN are the numbers of training samples, which belong to the N patterns respectively. PNN has four feed-forward layers, which are input layer, pattern layer, summation layer and output layer. The connection weights from the summation layer to the output layer are proportions of corresponding pattern samples number to all patterns samples number. The other connection weights between other layers are unit 1. In PNN, each pattern neuron represents a training sample. They compute the similarity between inputs and the corresponding samples, which means the probability of the inputs being the corresponding samples. Summation neurons sum outputs of all pattern neurons belong to same pattern. Probability of the input belongs to a certain pattern is product of the corresponding connection weight and the summation output. Generally, being applied to the problem with not large-number training samples, PNN can get ideal classification results.
1074
J. Xing, D. Xiao, and J. Yu y1 Output Layer
w1 = Summation Layer
Σ
y2
S1
yN
w2 =
∑S
Σ
S2
∑S
wN =
Σ
SN
∑S
Pattern Layer
Input Layer
x1
x2
x3
xM
Fig. 1. Topological structure of PNN. Each pattern neuron represents a training sample.
But a shortcoming of PNN cannot be ignored. The number of pattern neurons is equal to the number of all training samples. When PNN is applied to a problem with large-scale and high-redundant training samples, the number of pattern neurons is larger-scale, the structure is complex, computation speed is slow, and classification efficiency is decreased. In real application, some characters are extracted in some extent, to decrease number and redundancy of training samples, and to simplify structure of PNN [7,8]. But, what characters should be extracted, and how extract them. These problems restrict PNN user with much experience on the special object, which results in a limit application field of PNN. In this paper, a general PNN structure optimization strategy is proposed. The motivation is multiplication formula in probability computation
P(ABC) = P (C|AB) P (B|A) P (A),
(2)
when event A and B are independent each other, Eq. (2) can be simplified as P(ABC) = P (C|AB) P (B) P(A).
(3)
Assuming the input variables are independent each other, structure of PNN can be simplified with the instruction of Eq. (3), as shown in Fig. 2. Each pattern neuron of typical PNN represents a training sample, while each pattern neuron of the simplified PNN represents a character state of the corresponding input variable. Multiplication neurons compute product of corresponding pattern neurons outputs. Probability of event represented by multiplication neuron is the product of multiplication output and the output weight as well as conditional probability. For example, connection weight w11 = P(Z1|X11X21…XM1), then P(Z1) = w11P(X11)P(X21)…P(XM1). Output neurons sum the product of outputs of multiplication neurons and theirs corresponding weights. So probability of the input belonging the corresponding pattern is gotten. In simplified PNN, connection weights from multiplication layer to summation layer are the conditional probability. The other connection weights between other layers are 1.
Principal Component Analysis Based Probability Neural Network Optimization
Summation Layer
w11
y1
y2
yN
Σ
Σ
Σ
1075
wKN
=P(Z1|X11X21...XM1) Z1
Multiplication Layer
Pattern Layer
X11
Π
Z2
Π
X12
ΠZ
K
X21
XM1
Input Layer
x1
x2
xM
Fig. 2. Topological structure of simplified PNN by using probability multiplication formula. Each pattern neuron represents a character state of the corresponding input variable.
3 PCA Based PNN Structure Optimization Only events are independent, Eq. (2) can be simplified as Eq. (3). So the keyprecondition of PNN structure simplification by using multiplication formula is that input variables are independent. But it is impossible in real problem. Correlation between input variables needs to be eliminated by using some pretreatment. PCA is the most simple and efficient. Furthermore, number of pattern neuron corresponding different inputs can calculated with instruction of the eigenvalues gotten in PCA. Structure design of PNN has computation rules, which decreases structure uncertainty by avoiding artificial setting. There are S training samples, with M-dimension input and N-dimension output. The original training samples set is ( x1 , x2 ,..., xS ∈ R M , y1 , y2 ,..., yS ∈ R N ). PCA is used to input data of training samples. Eigenvalues are λ1 , λ2 ,..., λM (λ1 ≥ λ2 ≥ ... ≥ λM ≥ 0) , and corresponding eigenvectors are l1, l2…lM. First P eigenvalues are selected for P
η P = ∑ λk k =1
M
∑λ k =1
k
≥ 80%.
And the corresponding eigenvectors l1, l2…lP of the first P eigenvalues are combined into transform matrix LP = [l1′; l 2′ ;...; l P′ ] ∈ R P× M .
The new P-dimension vectors pi=LPxi (i=1,2,…,S) contain the first P principal components. By using the new samples set ( p1 , p2 ,..., pS ∈ R P , y1 , y2 ,..., yS ∈ R N ) as training samples set for neural network, the correlation between input variables is eliminated, so PNN can be simplified based on multiplication formula. Meanwhile,
1076
J. Xing, D. Xiao, and J. Yu
P<M, which means PCA reduces PNN input dimension. It decreases the computation and structure complexity with the less input dimension. Pattern neurons of the simplified PNN are grouped according to the first P eigenvalues obtained in PCA. In PCA ideas, the larger eigenvalue means the lager variance of the corresponding component variable, means more information contained in. Therefore, the input variable with the larger eigenvalue should be distributed more pattern neurons, to increase classification accuracy of PNN. In the simplified PNN, assuming sum number of pattern neurons is Q, the number of pattern neurons corresponding to the input variables p1,p2,…,pP are
qi = round(Q
λi P
∑ λk
), i = 1, 2,..., P,
k =1
where round( ) is a function to round the independent variable to the nearest integer. The role of PCA in PNN structure optimization is shown in Fig. 3. On the one hand, original input to PNN is pretreated using transform matrix obtained in PCA. The correlation between input variables is eliminated for PNN can be simplified based on multiplication formula. Meanwhile, input dimension is reduced without losing much information. On the other hand, number of pattern neurons corresponding to different input variable is set according to numerical value of the eigenvalues. Structure design of PNN is optimized with some certain computation rules, and no more depends on artificial experience or large-scale tests.
4 Computation and Learning Algorithm of the Optimized PNN PCA based optimized PNN has two computation courses: pretreatment and network computation, as shown in Fig. 3. Pretreatment is input dimension reduction and correlation elimination by using transform matrix LP; Network computation is the basic feed-forward neural network computation. Pretreatment computation is
p = LP x where x is input vector, LP is transform matrix, and p is result of PCA. Layer 1 transfers input value to pattern neuron directly. Function of pattern neuron in layer 2 is Gaussian function z (2) = j
1 2πσ 2j
exp(−
2 (u (2) j − μj)
σ 2j
),
where u (2) and z (2) are input and output of the j-th neuron in layer 2. μ j and σ j are j j mean and variance of Gaussian function. Layer 3 is multiplication computation Lk
zk(3) = ∏ ukl(3) , l =1
Principal Component Analysis Based Probability Neural Network Optimization y1
y2
yN
Σ
Σ
Σ
Π
1077
Π
Π
Network Computation
p1
p2
P C A
pP
LP Pretreatment
x1
x2
x3
xM
Fig. 3. Input pretreatment and structure optimization of PNN by using PCA
where ukl(3) and zk(3) are input and output of the k-th neuron in layer 3. Lk is the number of inputs to the k-th neuron in layer 3. Layer 4 is summation computation K
yn = zn(4) = ∑ wnk zk(3) , k =1
where yn = z is output of the n-th neuron in layer 4, which is the n-th output of PNN. K is the number of neurons in layer 3. wnk is connection weight from k-th neuron in layer 3 to n-th neuron in layer 4. Connections from layer 3 to layer 4 are full connections with corresponding weights. Typical PNN has no learning course. Parameters are set based on artificial experience, which brings PNN uncertainty in some extent. Aiming this problem, a learning algorithm is introduced, whose object is to minimize (4) n
E=
1 S N ∑∑ ( yn (s) − yˆ n (s))2 , 2 s =1 n =1
(4)
where yˆ n ( s ) is the expected output. In the optimized PNN, the parameters need to be adjusted are: wnk, the connection weights from layer 3 to layer 4; μ j and σ j , term value and smooth parameter of pattern neurons in layer 2. For simplification and without losing generality, gradient-descent learning algorithm is employed here. Update rules of wnk are S N ∂E = ∑∑ ( yn ( s ) − yˆ n ( s ))z k(3) ( s ), ∂wnk s =1 n =1
E ∂E wnk = wnk − exp( )η , t ∂wnk
1078
J. Xing, D. Xiao, and J. Yu
where η is learning efficiency coefficient; exp(E/t) is SA coefficient, E is the mean square error in Eq. (4), and t is learning step. Obviously, if E is larger, exp(E/t) is larger, the update extent of wnk increases; and if t is larger, exp(E/t) is smaller, the update extent of wnk decreases. exp(E/t) makes learning course more efficient in early period and more stable in later period. It is called SA coefficient because of the origin from and similarity with the position modification probability in SA optimization algorithm. Update rules of μ j and σ j are N ∂E = ∑ ( yn ( s ) − yˆ n ( s )) wnk , (3) ∂zk ( s ) n =1
∂zk(3) ( s ) 2 = δ kj zk(3) ( s ) 2 (u (2) j − μ j ), ∂μ j σj ∂zk(3) ( s ) 2 2 2 = δ kj zk(3) ( s ) 3 ((u (2) j − μ j ) − σ j ), ∂σ j σj
μ j = μ j −η
∂E E S K ∂E ∂zk(3) ( s ) = μ j − exp( )η ∑∑ (3) , ∂μ j t s =1 k =1 ∂z k ( s ) ∂μ j
σ j = σ j −η
∂E E S K ∂E ∂zk(3) ( s ) = σ j − exp( )η ∑∑ (3) , ∂σ j t s =1 k =1 ∂zk ( s ) ∂σ j
where δ kj =1 if the j-th neuron in layer 2 is connected to the k-th layer in layer 3; otherwise, δ kj is zero.
5 Prediction of AE Based on the Optimized PNN AE is the most important operation fault of aluminium reduction cell, the key equipment in aluminium making industry. Due to the complex working states and field conditions, aluminium reduction cell is a multi variable coupled, time varying, and long time delay large nonlinear system. It makes neural network is feasible to be applied to prediction of AE. In aluminium making, minute is basic period of online data record, such as cell voltage; but several days or longer time is basic period of AE happening. So, samples for prediction of AE have high redundancy. If typical PNN is used, number of pattern neurons will be vary large, structure of PNN will be complex, computation speed and classification efficiency will be low. The PCA based optimized PNN is used to predict AE in aluminium reduction cell. Training sample set is the data from a cell in an 8-hour period when a normal AE happens. Input is some online and offline data from aluminium reduction cell, such as inter-electrode gap, series current, cell voltage, cell temperature and so on. Expect output is probability of AE decided by artificial experience of filed worker. After PCA based structure designed, the optimized PNN has an only 2-dimension input, which is much less than the original 8-dimension input. Size of the optimized PNN is
Principal Component Analysis Based Probability Neural Network Optimization
1079
Table 1. Number of neurons and connections of the typical PNN and the optimized PNN
Samples Neurons Connections
Traditional PNN 958 968 8713
Optimized PNN 958 28 58
compared with those of the typical PNN, as shown in Table 1. It shows that PCA can optimize structure of PNN in much extent in the application with large-scale and high-redundant training samples. The optimized PNN is tested by using data from the same aluminium reduction cell in an 8-hour period when an unwanted AE happens, whose results are shown in Fig. 4. Fig. 4a shows that the cell operates normally at first, undergoes the sudden fault AE at about 303rd minute, the cell voltage increases rapidly from normal 4.2V to a value about 40V, decreased rapidly to normal range again after AE extinguished about 4 to 5 minutes later. Fig. 4b shows that the PNN predicted probabilities of AE increase rapidly about 30 minutes before real AE happens, and get to the alert range about 10 minutes before. These prediction results show that PNN can predict AE timely, which is helpful to economical and safe aluminium production. Furthermore, the PCA based optimized PNN can get similar prediction results with the typical one, by using a much simpler network structure.
Fig. 4. Anode effect prediction based on the optimized PNN
6 Conclusion In this paper, PNN was analyzed on probability computation and a structure optimization strategy based on multiplication formula was proposed. At first, input training samples are analyzed by using PCA. Correlation between input variables is eliminate, for PNN can be optimized by using multiplication formula. Input dimension is reduced, which decreases PNN computation. Subsequently, according to the eigenvalues obtained in PCA, pattern neurons are distribute to different input variable with
1080
J. Xing, D. Xiao, and J. Yu
different number. It optimizes PNN structure with the pretreatment together. Then, computation courses of the optimized PNN are described. The learning algorithm decreases the parameters uncertainty by avoiding artificial set. SA coefficient is introduced to increase the efficiency and stability of the learning course. Eventually, the optimized PNN was used to predict AE in aluminium reduction cell. It validated the simpleness, efficiency and reliability of the optimized PNN in the application with large-scale high-redundant training samples.
Acknowledgment This paper is sponsored by the National High Technology Research and Development Program of China (863 Program) No. 2002AA412510 and No. 2002AA412420. A gratefully acknowledgement is presented here for the financial aid to this research.
References 1. Specht, D.F.: Probabilistic Neural Networks for Classification, Mapping, or Associative Memory. IEEE International Conference on Neural Networks 1 (1988) 525-532 2. Labonte, G.: On the Efficiency of OLS Reduced Probabilistic Neural Networks for AircraftFlare Discrimination. Proceedings of the International Joint Conference on Neural Network 3 (2003) 2306-2311 3. Zhang, Y.C., Peng, L.H., Yao, D.Y., et al.: Principal Component Analysis Method for TwoPhase Flow Concentration Measurements. Journal of Tsinghua University (Science and Technology) 43 (2003) 400-401, 405 4. Oh, B.J.: Face Recognition by Using Neural Network Classifiers based on PCA and LDA. 2005 IEEE International Conference on Systems, Man and Cybernetics 2 (2005) 1699-1703 5. Wang, S., Xia, S.W.: Self-Organizing Algorithm of Robust PCA based on Single-Layer NN. Journal of Tsinghua University (Science and Technology) 37 (1997) 121-124 6. Kong, W., Yang, J.: Applications of Nonlinear PCA based on Neural Network in Prediction of Melt Index. Computer Simulation 20 (2003) 65-67 7. Albano, M., Caldon, R., Turri, R.: Voltage Sag Analysis on Three Phase System Using Wavelet Transform and Probabilistic Neural Network. Universities Power Engineering Conference 3 (2004) 948-952 8. Chen, C.H., Chu, C.T.: Low Complexity Iris Recognition based on Wavelet Probabilistic Neural Networks. Proceedings of. 2005 IEEE International Joint Conference on Neural Networks (IJCNN '05) 3 (2005) 1930-1935
A Multi-scale Dynamically Growing Hierarchical Self-organizing Map for Brain MRI Image Segmentation Jingdan Zhang and Dao-Qing Dai Center for Computer Vision and Department of Mathematics, Sun Yat-Sen (Zhongshan) University, Guangzhou, 510275 China
[email protected],
[email protected]
Abstract. With Kohonen’s self-organizing map based brain MRI image segmentation, there are still some regions which are not partitioned accurately, particularly in the transitional regions of gray matter and white matter, or cerebrospinal fluid and gray matter. In this paper, we propose a dynamically growing hierarchical self-organizing map integrated with a multi-scale feature vector to overcome the problem mentioned above, which uses the spatial relationships between image pixels and using multi-scale processing method to reduce the noise effect and the classification ambiguity. The efficacy of our approach is validated by extensive experiments using both simulated and real MRI images.
1
Introduction
In recent years, various imaging modalities are available for acquiring complementary information for different aspects of anatomy. Because of the advantages of MRI over other diagnostic imaging [1], the majority of researches in medical image segmentation pertains to its use for MRI images. Automatic segmentation of brain MRI images to the three main tissue types: white matter (WM), gray matter (GM), and cerebrospinal fluid (CSF), is a topic of great importance and much research. It is known that volumetric analysis of different parts of the brain is useful in assessing the progress or remission of various diseases, such as Alzheimer’s disease, epilepsy, sclerosis, and schizophrenia [2]. Now, there are a lot of methods available for MRI image segmentation [2]. Clustering methods would naturally be applied in image segmentation [2], [3]. However, the uncertainty of MRI image is widely presented in data because of the noise and blur in acquisition and the partial volume effects originating from the low sensor resolution. Therefore, neural-network-based segmentation could be used to overcome these adversities [4], [5]. In these neural network techniques, Kohonen’s self-organizing map (SOM) is used most in MRI segmentation [6]. But SOM has certain fundamental limitations in the context of image segmentation, especially in the brain MRI images. Because most brain MRI images always
Corresponding author.
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1081–1089, 2007. c Springer-Verlag Berlin Heidelberg 2007
1082
J. Zhang and D.-Q. Dai
present overlapping grey-scale intensities for different tissues, particularly in the transitional regions of GM and WM, or CSF and GM. Several improved SOM algorithms and SOM-related algorithms have been proposed in recent years to overcome its drawbacks. Hierarchical SOM is a variation of SOM [7], [8]. In this paper we address the segmentation problem in the context of isolating the brain tissues in MRI images. Kohonen’s self-organizing feature map is exploited as a competitive learning clustering algorithm in our work. However, there are still some regions which are not partitioned accurately, particularly in the transitional regions between GM and WM, CSF and GM. Therefore, a dynamically growing hierarchical SOM (DGHSOM) is proposed in our work to overcome this problem. Moreover, for image data, there is strong spatial correlation between neighboring pixels. To produce meaningful segmentation, we integrate a multi-scale feature vector with DGHSOM, called MDGHSOM, in which we consider the spatial relationships between image pixels and multi-scale processing method to reduce the noise effect and the classification ambiguity. The efficacy of our approach is validated by extensive experiments using both simulated and real MRI images. The rest of this paper is organized as follows. MDGHSOM for MRI image segmentation is proposed in Section 2. Experimental results are presented in Section 3 and we conclude this paper in Section 4.
2 2.1
SOM and the Proposed MDGHSOM Kohonen’s Self-Organizing Map (SOM)
SOM consists of an input layer and a single output layer of M neurons which usually form a two-dimensional array. The training of SOM is usually performed using Kohonen algorithm [9]. Each neuron i has a d-dimensional feature vector wi = [wi1 , ..., wid ]. At each training step t, a sample data vector x(t) is randomly chosen from a training set. Distances between x(t) and all feature vectors are computed. The winning neuron, denoted by c, is the neuron with the feature vector closest to x(t) c = arg min xt − wi , i ∈ {1, ..., M}. i
(1)
A set of neighboring nodes of the winning node is denoted as Nc , which decreases its neighboring radius of the winning neuron with time. We define Nt (c, i) as the neighborhood function around the winning neuron c at time t. The neighborhood function is a non-increasing function of time t and of the distance of neuron i from the winning neuron c in the output layer. The function can be taken as Nt (c, i) = exp(−ri − rc 2 /2Nc2 (t)) where ri is the coordinate of neuron i on the output layer and Nc (t) is a width parameter. The weight-updating rule in the sequential SOM algorithm can be written as wi (t + 1) = wi (t) + α(t)Nt (c, i)[xi (t) − wi (t)], ∀ i ∈ Nc .
(2)
A Multi-scale Dynamically Growing Hierarchical Self-organizing Map
1083
Both the learning rate α(t) and neighborhood width Nc (t) decrease monotonically with time. 2.2
A Multi-scale Feature Vector
In this section, we first propose a multi-scale and adaptive spatial feature vector as input vector, considering the spatial relationships between image pixels and multi-scale processing method to reduce the noise effect and the classification ambiguity. Moreover, we also take the combination of local information and global information into account. The intensity of input pixel (Intensity) is very important for clustering. But only considering the intensity information, some local details and information would be neglected particularly in the transitional regions. Thus, the gradient of the input pixel in its neighborhood of size 3 × 3 is computed as one element of the input vector Gradient. If Gradient is small enough, the input pixel is homogenous with its neighbors and it may be an interior pixel of some tissue. Otherwise, if Gradient exceeds a given threshold, the input pixel could be in the transitional region. For obtaining the precise clustering result, multiscale information is considered and the mean value (mean3, mean5) and variance (variance3, variance5) of input pixel in neighborhoods of size 3 × 3 and 5 × 5 are computed respectively. Comparing mean3 and mean5 with Intensity, the variety tendency of local region can be obtained. Moreover, variance3 and variance5 would give more local information about the attributes of the input pixel. Based on the aforementioned reason, we constructed the input vector as (Intensity, Gradient, mean3, variance3, mean5, variance5). Different element in the input vector has different weightiness in each layer, and we assign different weights for them. 2.3
MDGHSOM
SOM has certain fundamental limitations in the context of image segmentation, especially in the brain MRI images. Because most brain MRI images always present overlapping grey-scale intensities for different tissues, particularly in the transitional regions. A dynamically growing SOM with double hierarchies is integrated with the multi-scale feature vector to solve the problem aforementioned, called MDGHSOM. A neuron at high level can generate its child SOM at lower level dynamically according to the higher level neuron’s weights. This is like GHSOM where the network grows hierarchically under some conditions. But MDGHSOM does not grow neurons horizontally because we want to simplify the growing process and train the network faster. Note that the number of neurons at the second level of MDGHSOM is adaptively determined which that of HSOM is predefined. At each layer, each neuron i has a six-dimensional weight vector wi =[wi1 , ..., wi6 ] n n n n corresponding with the input vector xp = [xp1 , ..., xp6 ]. And the wi1 , wi2 , wi3 , wi4 , n n wi5 , wi6 denote the intensity centroid, gradient centroid, 3 × 3 neighborhood mean centroid, 3 × 3 neighborhood variance centroid, 5 × 5 neighborhood mean centroid
1084
J. Zhang and D.-Q. Dai
and 5 × 5 neighborhood variance centroid of the pixels clustered in ith neuron in nth layer SOM respectively. The MDGHSOM algorithm is summarized as follows: 1. Initialization: Set level n = 1 and the weights at the first level wi1 = 1 1 1 1 1 1 [wi1 , wi2 , wi3 , wi4 , wi5 , wi6 ]. The SOM in the first layer is trained with all data by invoking the function TrainSOM(xp , n, W ). 2. Recursive Loop: GenerateSOM(xp , n, W ). 1 If wi2 is larger than g0 (a constant threshold taken according to the experimental result, and it is set to 64 in our experiment), it means that the neighborhood gradient of the pixel classified into neuron i is too large, and this pixel may be in the transitional region of different issues. Of course, the large value of gradient could be generated because of noise pixels. To overcome this problem, the 1 1 1 1 1 1 values of |wi4 − wi1 | and |wi6 − wi1 | are also considered. If |wi4 − wi1 | < m0 1 1 and |wi6 − wi1 | < m1 (m0 , m1 are constant thresholds and set to 6 and 8 respectively in our experiment), we can conclude that the intensity of the pixel classified into neuron i is similar with the mean values of its 3 × 3 and 5 × 5 neighborhood, and the large value of gradient is influenced by noise pixels. Otherwise, the pixel clustered into this neuron is different with its neighbors, and it might be the transitional pixels between different tissues. To obtain the accurate segmentation result, a child SOM with two neurons (Because there are only two class pixels in the transitional regions) is generated in the second layer for partitioning this transitional pixel again. Based on the above analysis, we give the function of generating SOM:
Function GenerateSOM(xp , n, W ) 1 1 1 1 1 If wi2 > g0 , |wi4 − wi1 | > m0 , and |wi6 − wi1 | > m1 , neuron i of the first layer spawns two children neurons representing a child SOM in the second layer. Then train the child SOM by TrainSOM(xp , n, W ). Function TrainSOM(xp , n, W ) Train SOM at the nth level with the input vector xp , and update the weights of the neurons. When training each SOM, we implement the original SOM algorithm. The child SOMs are trained with data associated with their mother neurons. MDGHSOM completes the training of SOMs at the first level, and then proceeds to train SOMs at the second level. Fig. 1 shows the architecture of MDGHSOM.
3
Experimental Results
The number of tissue classes in the segmentation is set to three, which corresponds to CSF, GM and WM. Background pixels are ignored in the computation. Extra-cranial tissues are removed from all images prior to segmentation. For all segmentation experiments, the number of training steps T is set to 25 in the
A Multi-scale Dynamically Growing Hierarchical Self-organizing Map
1085
Fig. 1. The architecture of MDGHSOM that grows neurons hierarchically when needed
first layer, and it is set to 50 in the second layer. The proposed algorithm was implemented in C and tested on both simulated MRI images obtained from the BrainWeb Simulated Brain Database1 , and on real MRI data obtained from the Internet Brain Segmentation Repository (IBSR)2 . 3.1
Results Analysis and Comparison
To illustrate this MDGHSOM approach, we compare the segmentation results of FCM segmentation, SOM clustering and our method with the ground truth in Fig. 2c, d, e and f. The original image simulated from normal brain phantoms with 5% noise level is shown in Fig. 2a with its processed result after wavelet-based de-noising [10] in Fig. 2b. All the segmentation results are obtained after the original image de-noised. The FCM segmentation result (Fig. 2c) and SOM clustering result (Fig. 2d) are partitioned inaccurately, particularly in the transitional regions of GM and WM, or CSF and GM. The result image of our proposed MDGHSOM approach (Fig. 2e), taking both local information and global information into account, clearly outperforms the results mentioned above (Fig. 2c and d). Fig. 2f is the ground truth of this slice. Fig. 2g, h, i, j, k and l are parts of images after Fig. 2a, b, c, d, e and f zoomed in respectively, and the efficiency of our method can be obviously observed in the circle and square regions. In our experiments, three different indices (false positive ratio γf p , false negative ratio γf n , and similarity index ρ [11]) are exploited for each of three brain tissues as quantitative measures to compare FCM segmentation, SOM clustering and our method with the ground truth. For a given brain tissue i, i = 1, 2, 3 for CSF, GM and WM respectively, suppose that Ai and Bi represent the sets of pixels labeled into i by the ground truth and by our method respectively. |Ai | denotes the number of pixels in Ai . The false positive ratio γf p is defined as γf p = (|Bi | − |Ai ∩ Bi |)/|Ai |. Likewise, the false negative ratio γf n is defined as γf n = (|Ai | − |Ai ∩ Bi |)/|Ai |. The 1 2
http://www.bic.mni.mcgill.ca/brainweb http://www.cma.mgh.harvard.edu/ibsr
1086
J. Zhang and D.-Q. Dai
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
Fig. 2. (a) Original image simulated from MRI brain phantom with 5% noise level, and its processed version with (b) wavelet-based de-noising. (c) FCM segmentation. (d) SOM clustering result. (e) MDGHSOM clustering result. (f) The ground truth of (a). (g), (h), (i), (j), (k) and (l) are parts of images after (a), (b), (c), (d), (e) and (f) zoomed in respectively. Table 1. Comparing FCM clustering, SOM clustering, and our method with the ground truth FCM γf p
γf n
SOM ρ
WM 19.3 4.13 93.92 GM 9.66 9.29 90.08 CSF 5.46 17.3 88.28
γf p
γf n
MDGHSOM ρ
4.11 8.24 94.47 15.7 2.06 91.19 6.65 15.2 89.30
γf p
γf n
ρ
2.61 6.42 96.30 9.77 5.29 93.09 2.96 8.22 94.26
similarity index ρ is an intuitive and plain index to consider the matching area between Ai and Bi , defined as ρ = 2|Ai ∩ Bi |/(|Ai | + |Bi |). The comparing results are shown in Table 1 using these indices. Our scheme produces a robust and precise segmentation. 3.2
Quantitative Validation
To quantitatively validate our method, test images with known ground truth are required. For this purpose, we use the simulated MRI images from the BrainWeb Simulated Brain Database with T1-weighted sequences, slice thickness of 1 mm, and noise levels of 3%, 5%, 7% and 9% respectively. The skull, fat, and unnecessary background are first removed with the guidance of ground truth. Fig. 3a, d, g and j are simulated MRI images with noise levels of 3%, 5%, 7% and 9% respectively. The corresponding segmentation results processed using
A Multi-scale Dynamically Growing Hierarchical Self-organizing Map
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
1087
Fig. 3. Segmentation of simulated image from MRI brain phantom. (a), (d), (g) and (j) are images with noise levels of 3%, 5%, 7% and 9% respectively, with (b), (e), (h) and (k) their corre-sponding segmentation results using our proposed approach. (c), (f), (i) and (l) are the ground truth of (a), (d), (g) and (j) respectively. 1 wm gm csf 0.98
similarity index
0.96
0.94
0.92
0.9
0.88 0.02
0.03
0.04
0.05
0.06 noise level
0.07
0.08
0.09
0.1
Fig. 4. Validation results for different noise levels
our approach on the original images de-noised [10] are shown in Fig. 3b, e, h and k with their ground truth of Fig. 3c, f, i and l respectively. The similarity index ρ [11] is exploited for each of three brain tissues as quantitative measure to validate the accuracy of our method. The validation results are shown in Fig. 4. The similarity index ρ > 70% indicates an excellent similarity [11]. In our experiments, the similarity indices ρ of all the tissues are
1088
J. Zhang and D.-Q. Dai
larger than 90% even for a bad condition with 9% noise level, which indicates an excellent agreement between our segmentation results and the ground truth. 3.3
Performance on Real MRI Data
To validate the efficiency of our proposed approach, we also test it on real MRI data obtained from the Internet Brain Segmentation Repository (IBSR). Extracranial tissues are removed from all images prior to segmentation, using the method proposed in [12]. Fig. 5a shows one slice of real T1-weighted MRI images. Fig. 5b is FCM segmentation result. Fig. 5c shows the clustering result using SOM. Using our proposed method, the result image is shown in Fig. 5d. Visual inspection shows that our approach produces better segmentation than others, especially in the transitional regions of gray matter and white matter, or cerebrospinal fluid and gray matter, such as in the circle regions.
(a)
(b)
(c)
(d)
Fig. 5. Segmentation of real MRI image. (a) Original image. (b) FCM segmentation. (c) SOM clustering result. (d) MDGHSOM clustering result.
4
Conclusion
In this paper we address the segmentation problem in the context of isolating the brain tissues in MRI images. SOM is exploited as a competitive learning clustering algorithm in our work. However, the transitional regions between tissues in MRI images are not clearly defined and their membership is intrinsically vague. Therefore, a dynamically growing hierarchical SOM is proposed in this paper to overcome the above problem. Moreover, for image data, there is strong spatial correlation between neighboring pixels. So a multi-scale feature vector is integrated with DGHSOM, called MDGHSOM, in which we consider the spatial relation-ships between image pixels and multi-scale processing method to reduce the noise effect and the classification ambiguity. The efficacy of our approach is validated by extensive experiments using both simulated and real MRI images.
Acknowledgments This project is supported in part by NSF of China(60575004, 10231040), NSF of GuangDong(05101817) and Ministry of Education of China(NCET-04-0791).
A Multi-scale Dynamically Growing Hierarchical Self-organizing Map
1089
References 1. Wells, W.M., Grimson, W.E.L., Kikinis R., Arrdrige S.R.: Adaptive Segmentation of MRI Data. IEEE Trans. Med. Imaging. 15 (1996) 429-442 2. Pham, D.L., Xu, C.Y., Prince, J.L.: Current Methods in Medical Image Segmentation. Ann. Rev. Biomed. Eng. 2 (2000) 315-337 3. Liew, A.W.C., Yan, H.: An Adaptive Spatial Fuzzy Clustering Algorithm for 3-D MR Image Segmentation. IEEE Trans. Med. Imaging. 22 (2003) 1063-1074 4. Reddick, W.E., Glass, J.O., Cook, E.N., Elkin, T.D., Deaton, R.: Automated Segmentation and Classification of Multispectral Magnetic Resonance Images of Brain using Artificial Neural Networks. IEEE Trans. Med. Imaging. 16 (1997) 911-918 5. Ozkan, M., Dawant, B.M., Maciunas, R.J.: Neural-Network-based Segmentation of Multi-Modal Medical Images: A Comparative and Prospective Study. IEEE Trans. Med. Imaging. 12 (1993) 534-544 6. Chuang, K.H., Chiu, M.J., Lin, C.C., Chen, J.H.: Model-Free Functional MRI Analysis using Kohonen Clustering Neural Network and Fuzzy C-Means. IEEE Trans. Med. Imaging. 18 (1999) 1117-1128 7. Marsland, S., Shapiro, J., Nehmzow, U.: A Self-Organizing Network that Grows when Required. Neural Networks 15 (2002) 1041-1058 8. Si, J., Lin, S., Vuong, M.A.: Dynamic Topology Representing Networks. Neural Networks 13 (2000) 617-627 9. Kohonen, T.: Self-Organizing Maps. New York: Springer-Verlag (1995) 10. Pizurica, A., Philips, W., Lemahieu, I., Acheroy, M.: A Versatile Wavelet Domain Noise Filtration Technique for Medical Imaging. IEEE Trans. Med. Imaging. 22 (2003) 323-331 11. Zijdenbos, A., Dawant, B.: Brain Segmentation and White Matter Lesion Detection in MR Images. Crit. Rev. Biomed. Eng. 22 (1994) 401-465 12. Kong, J., Zhang, J.D., Lu, Y.H., Wang, J.Z., Zhou, Y.J.: A Novel Approach for Adaptive Unsupervised Segmentation of MRI Brain Images. MICAI 2005, LNCS 3789 (2005) 918-927
A Study on How to Classify the Security Rating of Medical Information Neural Network* Jaegu Song and Seoksoo Kim Hannam University, Department of Multimedia Engineering, Postfach 306 791 133 Ojeong-Dong, Daedeok-Gu, Daejeon, Korea {Song}
[email protected], {Kim}
[email protected]
Abstract. Provide these intelligent medical services, it is necessary to understand the situation information generated in a hospital. There should be infra technologies that can classify and control the information for processing situation data, not mere collection of conceptual information, with clear standards. This paper, as a study to seize the information generated from medical situation more clearly, understood the property of data using neural network and applied the security ratings of information so that the system to provide the user appropriate to designated rating with analyzed medical information is established. It will be an effective measure to enhance the effectiveness of medical devices and backup data already introduced and understand the various medical data that will be generated from medical devices to be introduced.
1 Introduction The development of mobile communication and medical technologies is providing many technologies to address the lack of clinic facilities to the aging society. Combined with ubiquitous system, intelligent sensor or remote clinic, and similar technologies, are evolving into more developed system. The aim of current studies is to make it possible to check the conditions that need medical treatment in complicated situations. Such concept is defined as pervasive computing, disappearing computing and invisible computing. Pervasive computing technologies are being applied more and more in electronic products, bottles, chamber pot, mirror and medicine container, etc. with micro processor through telecommunication and cooperation [1]. However, to provide these intelligent medical services, it is necessary to understand the situation information generated in a hospital. In addition, there should be infra technologies that can classify and control the information for processing situation data, not mere collection of conceptual information, with clear standards. It is also necessary for the preparation of medical information’s growing needs and their application to seize the types of sensor and medical situation and to embody the classification ratings. It will be an effective measure to enhance the effectiveness of medical devices and backup data already introduced and understand the various medical data that will be generated from medical devices to be introduced. *
This work was supported by a grant from Security Engineering Research Center of Ministry of Commerce, Industry and Energy.
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1090–1096, 2007. © Springer-Verlag Berlin Heidelberg 2007
A Study on How to Classify the Security Rating
1091
This paper, as a study to seize the information generated from medical situation more clearly, under-stood the property of data using neural network and applied the security ratings of information so that the system to provide the user appropriate to designated rating with analyzed medical information is established. Chapter 2, related research, explains the situation information and how to classify using the neural net-work. Chapter 3 proposes the method to classify the medical information security ratings using the neural network. Chapter 4 carries out the simple test using the classification system and analyzes the effectiveness. Final chapter 5 suggests the conclusion and the future study subject.
2 Related Researches 2.1 Situation Information Situation information is defined in three types roughly. 1. The information indicating the individual situation of user, place and object. It is required in the interaction between user and applied service [2]. 2. The data needed for location information such as the location where access request occurred and accessing object exists, for time information such as the time when access request occurred and intervals, as well as for certain actions [3]. 3. The location information such as site and domain for proximity control [4]. Similar to above, situation information is defined in different respects. This paper adapted the first definition for study. There are mainly two ways to classify the situation information. One is RBAC (Role-Based Access Control), which groups according to names, and give role to user according to the individual’s responsibility and authority so that controls the use of resources. It develops security policy according to organizations and it is effective for security control [5]. Another is proximity control using the situation information. Its representative method are GRBAC (Generalized Role-Based Access Control) which enables time-based proximity control that could not provided in role based proximity control, and xoRBAC, which proposed the restricted matters of role based proximity control that checks the previously defined conditions [6]. This paper classifies the situation information using role based proximity control that can easily apply proximity control to the medical information from hospitals. 2.2 How to Apply Neural Network The most widely used method in neural network is multilayer perceptron (MLP), which, as a typical static neural network, is used for the recognition by supervised learning, classification, function approximation, etc. Multilayer perceptron has a layer-structured neural network that has more than one middle layer (hidden layer) between input and out layer. As shown in figure 1, neural network generally consists of input layer, hidden layer, and output layer. When factor x, input data, is input, it will be treated by the defined process in hidden layer then output Y will be produced. At this, each layer learns the process so that output layer develops clearly into the desirable result value.
1092
J. Song and S. Kim
Fig. 1. Neural network generally consists
However, this neural network method only outputs and learns the result corresponding the input/output pattern, so it has the problem to create similar output for unknown input patterns. As most of information generated from real life tend to have relation with time, it necessary to renew past information into up-to-date information through circular linkage in dynamic expression. Considering such characteristic, circular neural network should be applied in autonomous dynamic expression. Figure 2 is a Jordan’s SRN model and indicates circular neural network that can show temporary transit of input pattern and contextual dependent relationship. For real-time renew of information on medical treatment, this paper applied the circular neural network method [7].
Fig. 2. Jordan’s SRN
3 Medical Information Security Ratings Classification Method Using Neural Network 3.1 Assumptions For system design and composition, this paper assumed that various medical information is simple information as below: - Sensor information (location information, body temperature, heart beat) - Medical information (medical record, meditation details, written opinion, prescription details, individual information, medical charge)
A Study on How to Classify the Security Rating
1093
- Situation information user (doctor, nurse, assistant nurse, patient, guardian) - Security ratings (it is divided into 3 stages to restrict access to information needed for medical acts and information for patient and guardian: 1: basic ratings, 2: medical common ratings, 3: medical security ratings). 3.2 Designing of Medical Information Security Ratings Classification Method Using Neural Network In this paper, medical situation information is provided to final information identifier through neural network structure as like figure 3. In hidden layer, each information is classified into sensor and common medical information then; security rating is applied to the classified information and stored to be identified by final information identifier. At this time, to process the information changing to time dynamically, the information of hidden layer and output layer are turned to input layer so to make a circular linkage.
Fig. 3. Neural network structural for medical situation information classification
The neural network through circular linkage like above figure 3 reflects the state of past neural network to determine the state of current neural network, so dynamic information provision is available according to the change of medical information. In other words, as repeated information acquired from progressing medical acts are input in real time, the states of medical acts can be identified in real time and overlapping treatments can be prevented. In neural network structure for medical situation information classification, specific applying measure for hidden layer part is like figure 4. The process is as follows: Medical data value input Extracting the characteristics of information. The information from hospitals is classified into sensor data and medical data according to characteristics and generating nature of information. At this time, sensor data and medical data provide the basic category value for the generation places of each information in the form of data and extract the patterns by the characteristics of information through the learning whether 1
2
1094
J. Song and S. Kim
it can be distinguished persistently with classified information and existing data classification methods. The extracted pattern value is provided as additional standards so that it contributes to more exact data analysis in later. 3 Considering the information extracted from 2 , each information will be applied to security ratings. Security ratings are applied by the information guideline adjusted to hospital’s medical acts, learns the information classified along with security ratings, stores security applying patterns and gives the learned value to security guideline so that contributes more exact analysis in later. 4 Stores medical information’s characteristics and data having security ratings to finish the preparation for information provision 5 If the latest data of information characteristics extraction, security ratings application and medical situation information database are changed to consider the change of time, it should be returned to initial information in the manner of circulation method to renew past information.
Fig. 4. Medical information security ratings classification processing using neural network
To learn neural network, medical information is defined as representative data property or classification standard and learns the output result repeatedly. At this time, medical data is used as the input data of neural network after the extraction of many input data. All data is renewed to each neuron through feed-back linkage, so it is designed so as to enable dynamic situation information analysis and classification.
4 Test and Analysis Medical situation information classification method using the neural network proposed in this paper carried out simple test using the sensor information of location information, body temperature and heart beat, which are input through sensor, and the information input as diagnosis. All information applied static pattern with text basis and tried to examine whether the result provides proper situation information to final medical agent who receives data and analysis information needed for practical
A Study on How to Classify the Security Rating
1095
medical procedure. Figure 5 indicates the result off test in which each medical data was in-put in 1 second interval for 100 seconds and it was repeated for 4 times. As a test result, both learning and information applied with security ratings showed higher error rate in the extracting process of medical information characteristics, because the process to classify 400 situations information into 8 is relatively more difficult than security classification that classifies into 3. The test showed that the value appropriate to learning result property and the error rate of medical information applied with security ratings decreases gradually. However, the rate of identification on medical data such as complex medical information or unclassified terminology was very low. Additional setting on the data of classification standard can solve such problem to the roots.
Fig. 5. Neural network studying error rate
5 Conclusion In order to specify the standards for the classification of medical situation information, this paper applied the rating standards by medical information’s property and security using neural network. As a method to distinguish a special situation, medical information, Role-Based Access Control was used in situation information classification. To divide and provide real time medical information, circular neural network was used in information classification. This can provide the dynamic demand for the latest medi-cal acts considering time information when practical information is applied. As a result of simple test with limited information, this study developed applicable system by reducing the error rate through automatic classification of medical acts and the repeated learning of process. This study, as a method to dynamically classify situation information that is generated from future ubiquitous environment, would be used for the integration of similar situation information and the improvement of system performance. In the future, there should be a study on the system that performs autonomous learning considering the frequency of persistently occurring
1096
J. Song and S. Kim
information, as a measure for the data not existing in learning data value, with recognizing the situation information of hospital as each object. Such study may address the situation information more flexibly.
References 1. Weiser, M.: The Computer for the Twenty-First Centrury. Scientific American (1991) 94-101 2. Dey, A.K., Abowd, G.D.: Understanding and Using Context. Personal and Ubiquitous Computing Journal 5(1) (2001) 4-7 3. Georgiadis, C.K., Mavridis, I., Pangalos, G., Thomas, R.K.: Flexible Team-based Access Control Using Contexts. In: ACM Symposium on Access Control Models and Technologies (SACMAT2001) (2001) 21-30 4. Wilikens, M., Feriti, S., Sanna, A., Masera, M.: A Context-related Authorization and Access Control Method Based on RBAC : A Case Study from the Health Care Domain. In: 7th ACM Symposium on Access Control Models and Technologies (SACMAT2002) (2002) 117-124 5. Sandhu, R.S., Coyne, E.J., Feinstein, H.L., Youman, C.E.: Role Based Access Control Model. IEEE Computer 20(2) (1996) 38-47 6. Neumann, G., Strembeck, M.: An Approach to Engineer and Enforce Context Constraints in an RBAC Environment. In: 8th ACM Symposium on Access Control Models and Technologies (SACMAT2003), Como, Italy (2003) 65-79 7. Jordan, M.: Serial Order: A parallel Distributed Processing Approach (ICS Tech. Rep. No. 8604). La Jolla, CA: University of California, San Diego, Department of Cognitive Science, 1986
Detecting Biomarkers for Major Adverse Cardiac Events Using SVM with PLS Feature Selection and Extraction Zheng Yin1, Xiaobo Zhou2,3, Honghui Wang4, Youxian Sun1, and Stephen T. C. Wong2,3 1
National Laboratory of Industrial Control Technology, Institute of Industrial Process Control, Zhejiang University Hangzhou, 310027 P.R. China
[email protected],
[email protected] 2 Functional and Molecular Imaging Center, Brigham and Women’s Hospital Boston, MA 02121 USA 3 HCNR Center for Bioinformatics, Harvard Medical School Boston, MA 02115, USA {xiaobo_zhou,stephen_wong}@hms.harvard.edu 4 Radiology and Imaging Sciences, Clinical Center, National Institutes of Health Bethesda, MD 20892 USA
Abstract. Detection of biomarkers capable of predicting a patient’s risk of major adverse cardiac events (MACE) is of clinical significance. Due to the high dynamic range of the protein concentration in human blood, applying proteomics techniques for protein profiling can generate large arrays of data for development of optimized clinical biomarker panels. The objective of this study is to discover an optimized subset of biomarkers for predicting risk of MACE containing less than ten biomarkers. In this paper, we connect linear SVM with PLS feature selection and extraction. A simplified PLS algorithm selects a subset of biomarkers and extracts latent variables and prediction performance of linear SVM is dramatically improved. The proposed method is compared with a widely used PLS-Logistic Discriminant solution and several other reported methods based on the MACE prediction experiments.
1 Introduction The search for biomarker panels in MACE prediction has been motivated by the original work in [1]. They reported that assays on MPO (myeloperoxidase) levels in blood samples from 604 patients supply accuracy over 60% in predicting the risk of MACE in the ensuing 30- and 180-day period after present in emergency room with chest pain [1]. MPO has been accepted as a biomarker to MACE with measurement kit approved by FDA (http://www.fda.gov/cdrh/reviews/K050029.pdf), meanwhile, the detection of better biomarker groups as assistants to MPO is inspired with the help of Mass Spectrometry (MS). In our study, the same plasma samples as in [1] are adopted in search for a biomarker set containing less than 6 proteins supplying better prediction accuracy. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1097–1106, 2007. © Springer-Verlag Berlin Heidelberg 2007
1098
Z. Yin et al.
The samples belong to 3 categories, and we just focus on classifying MACE group from Control group. MACE group contains samples from patients with chest pain and consistently negative Troponin T, but had MACE in next 30- or 180-day, while Control Group contains patients with chest pain and consistently negative Troponin T and lived in the next 5 years without any major cardiac events or death. As described in [2], the plasma samples were fractionated into 6 fractions. SELDI MS profiles from fraction 1,3,4,5 and 6 were acquired with Ciphergen’s PBSIIc using IMAC and CM10 ProteinChips. Pre-processing and peak detection was done following [3] with S/N ratio thresholds set at 1.5. The feature selection and classification experiment is done on the peak map of the protein profile obtained from fraction 4 using IMAC protein chips (denoted as IMAC F4). The number of peaks (features) in the peak map is 200. The algorithms developed will be applied to other peak maps (made using different S/N ratio thresholds) and profiles (obtained from different fraction or using different protein chips). The biomarker selection is intrinsically linked to feature selection techniques of machine learning. Various feature selection and classification methods are applied in this project. The dataset involved is of great overlap, i.e. samples from different categories have resembling feature values and are hard to be classified. What’s more, sample number is lower than feature number in this dataset. According to the highly overlapped nature of dataset, kernel methods can be involved to detect the intrinsic nature between features [4, 5], a kind of improved genetic algorithms is also proposed [2] and filter approaches can also supply a guideline of dimension reduction with relatively low computational cost [6]. However, linear SVM classifier just performs poor when connected to traditional feature selection methods on this dataset. A baseline solution featuring PLS-Logistic Discriminant Analysis [7] outperforms linear SVM greatly both on accuracy and computation cost. According to the realization of SVM, it must be connected to a method which can determine a feature rank and form a less overlapped feature space in relatively short time to form an effective biomarker detection method. Our solution is obtained by connecting a simplified Partial Least-Squares (PLS) method [8] designed for binary classification to linear SVM. PLS has proven to be useful in the situations where feature number is much greater than sample number [5], the reduced feature space obtained from PLS can be connected to classifiers like LDA [9], logistic analysis, QDA [7] and SVM [4,5]. What’s more, the component axis supplied by PLS is proven to be a validated feature ranking criterion [9], which means PLS intrinsically combines feature selection and feature extraction. This method selects most informative features and forms a less overlapped feature space through forming latent variables according to the correlation between feature values and class labels. The classification performance of linear SVM and the baseline method are dramatically improved when applied in the feature space selected by PLS, and linear SVM outperforms baseline method on both sensitivity and specificity with less computational cost, moreover, it shows the potential of help determine the component number used in PLS realization.
Detecting Biomarkers for MACE Using SVM
1099
This paper is organized like this: methods involved in feature selection, extraction and classification is described in Section 2, while Section 3 is devoted to the experiment results followed by a conclusion made at last.
2 Methods Let
Y = [ y1 ," , yn ]
T
denote
the
class
labels
of
n
training
samples,
where yi ∈ {−1,1} , i = 1,", n . -1 denotes control group and +1 denotes MACE group. The protein peak values are summarized in an n × L matrix X, where each row of X represents one of the n samples, for the ith row: xi = [ xi1 , xi 2 ,..., xiL ] , i = 1," , n , x ij is the value of the jth peak for the ith sample. Our method consists of the steps of feature selection, PLS feature extraction and SVM classification. Through cross validation on randomly partitioned training sets, a feature rank is obtained using the coefficients of first PLS axis, the simplified PLS utilizes the top features of this rank to form a latent feature space and linear SVM is applied on this space to supply prognostic values. 2.1 Feature Selection Using First PLS Axis
Biologists often want statisticians to supply an interpretable model simply answering which proteins can be used for diagnosis. Feature extraction methods, on the other hand, taking advantage of all the available features to give a reduced feature space, those new components often give information or hints about the interactions and correlations among features rather than ranking single genes like in [10, 11]. Feature extraction is often questioned about losing information about single feature and supplying models with poor interpretability. However, when trying to find optimal feature subsets [12], feature selection methods may suffer from overfitting and may be also difficult to interpret and implement because they are based on computationally intensive iterative technique like genetic algorithms. There has been literatures claiming the relationship between PLS, a feature extraction method, and feature selection. In our work, the features are ranked according to the square for their weight during calculating the first PLS axis, which can be denoted as w12j . The equivalence between w12j and the widely used feature ranking criterion BSS j as well as F-Test scores has been proven in [9]. Thus, feature WSS j
selection and extraction can be handled together using PLS. 2.2 Simplified PLS for Feature Selection and Extraction
The partial least squares (PLS) method [13, 14] has been a popular modeling regression, discrimination and classification technique in its domain of originchemometrics. It creates score vectors (components, latent vectors) using the existing correlations between different sets of variables while keeping most of the variance.
1100
Z. Yin et al.
Typically, PLS regression deals with multivariate matrix X and F, recursively extracts components from both matrices according to the correlations between components from X and F, while in the binary classification scenario defined earlier, Y is a single variable vector consisting of binary class labels so that we can use Y all across the PLS calculation rather than extract components from it in each iteration. We standardize both X and Y to zero mean and unit variance for PLS processing. The procedure of extracting altogether m PLS components from X starts from seeking for a first PLS axis direction w1 , and outputs an n × m score matrix T at last.
w1 is typically calculated as the eigenvector corresponding to the biggest eigenvalue (denoted as θ12 ) of a matrix reflecting the correlation of X and Y: X T YY T X . While according to [8], with a single vector Y consisting of binary labels, we have: 2
θ12 = X 0T Y0 , w1 =
1
θ1
X TY . X TY
X 0T Y0 =
The square of each element in w1 can be utilized as the feature selection criterion. Then for each h=1…m, with X 0 = X , Y0 = Y1 = ...Yh :
wh =
X hT−1Y , X hT−1Y
th = X h −1wh , ph =
X hT−1th th
2
,
X h = X h −1 − th phT ,
th is the hth component (axis direction) extracted, i.e. the hth column of the score matrix T. ph is the loading of X on th , and X h is the residual components. Each component th is the linear combination of vectors in X, we have: where
h −1
th = X h −1wh = X ∏ ( I − w j pTj ) wh , j =1
thus h −1
wh* ∏ ( I − w j pTj ) wh , j =1
th = Xwh* .
Detecting Biomarkers for MACE Using SVM
1101 ~
Typical PLS regression applies LR regression of Y on t h to form a predicted Y . ~
The loadings of Y on th may also be determined using discriminant analysis approaches. In this paper, the score matrix T serves as the input of logistic discriminant or SVM to generate the evaluated class label. An advantage of PLS feature extraction is the possibility to visualize the data by graphical representation. Later, the structure of the IMAC-F4 dataset will be posed by plotting the first two PLS components using different colors for each class. There is no widely accepted procedure to determine component number m. Methods based on cross-validation for training set are proposed in [8] and boosting is applied to improve classification accuracy during seeking for m in [9]. With improved classification performance, linear SVM can be involved in the selection of m. Leaveone-out(LOO) test is implemented to each training set with different m and the results given by SVM is used to determine m used in classifying the testing set. 2.3 Linear SVM for Classification
Over the past several years, the model of SVM has been intensively investigated and applied as highly reliable and flexible classifier in various scenarios. With earlier n
defined X and Y, the optimal hyper-plane f (x) = ∑ yiα i K (xi , x) + b is used to classify i =1
each testing sample x by the sign of f(x), here K is the kernel function mapping x into a higher dimensional space, α i and
b are solved by SVM algorithm [15]. Using
common inner product xi , x
kernel function, a linear SVM is defined. Linear SVM was designed to find the hyper-plane whose minimal distance to the training sample is maximized [15]. The position of SVM decision boundary is determined by support vectors, i.e. samples cutting the edge of each category with Lagrange multiplier α i > 0 . If samples from different categories closely resemble each other, the number of support vectors will be large and make SVM perform badly. The punishment coefficient C is involved to handle the overlap in the training set with a limitation of 0 < α i ≤ C , lower value of C allows greater overlapping between different classes and has higher risk of overfitting (high classification accuracy on training set while low one on testing set). Realization of SVMs involves solving quadratic programming problems, thus using SVMs, especially those with non-linear kernels, in wrapper methods will bring extra computational cost [16], linear SVM though, has fewer coefficients to adjust, fewer risk of overfitting and runs faster compared with its non-linear counterparts.
3 Experimental Results 3.1 Challenges from the Dataset The nature of the whole dataset is imposed in Figure. 1 using the first two latent variables extracted from all 200 features. Two classes are highly overlapped.
1102
Z. Yin et al. 10 Control MACE
8
Second latent variable
6 4 2 0 -2 -4 -6 -8 -25
-20
-15
-10 -5 First latent variable
0
5
10
Fig. 1. Visualization of IMAC-F4 samples using the first two components extracted from the all 200 features
The performance of a classification method based on PLS feature extraction and Logistic Discrimination proposed in [7] serves as a baseline (denoted as “baseline” in the tables below). This method is accepted as an effective linear solution in tumor diagnose with DNA Microarray data. Table 1 shows the performance of classifiers in Leave-one-out (LOO) test across the dataset without feature selection. In each of the LOO iterations, one sample is left to test the classifier trained by the other 119 samples. Without feature selection and extraction, SVM struggles in the highly overlapped dataset. More than 90% of the training samples are chosen as support vectors and complicated but crisp classifiers are obtained. During the leave one out test over the whole dataset, very few samples are classified into Control group, which causes a catastrophic specificity of less than 5%. Actually, the average support vector number during the LOO is more than 100, and linear SVM just makes the same decision (“MACE”) in most of the iterations. Table 1. The performance of two classifiers without feature selection
Classifier Baseline
Linear SVM C=500 No PLS
Features 200
200
Components 1 2 3 4 5 6 -
Accuracy 0.5167 0.5833 0.6083 0.5750 0.5833 0.6000 0.4750
Sensitivity 0.5833 0.5500 0.6000 0.5500 0.6167 0.6167 0.9167*
Specificity 0.4500 0.6167 0.6167 0.6000 0.5500 0.5833 0.0333*
*Only 7 samples are classified into Control group during leave one out test.
Detecting Biomarkers for MACE Using SVM
1103
Feature extraction from all 200 features made by PLS ensures an accuracy around 50%-60% for baseline method, However, the performance of baseline method varies greatly with component number varying from 1 to 6. According to [9], the performance will decrease with unnecessary components added. On the other hand, it is reported that common classifiers supplies accuracy around 62% using top 5 features given by T-test and the accuracy of wrapper method based on standard GA is around 69% with top 5 features [2]. 3.2 Feature Selection It can be seen from Table 1 that feature extraction based on all the features gives an even worse accuracy than simple feature ranking criterion of t-test [2], worse still, the model involving 200 features is too complex to be accepted for development of immunoassay. Feature selection is necessary for better classification performance. 2
In our study, we use the average rank of features given by w1 j across 500 training sets randomly divided from the dataset, each training set contains 70% of the observations. w1 j is calculated for each feature j. The whole selection procedure takes less than 2 minutes, which is faster than most wrapper method based on genetic algorithms. Top 4 features in this table are finally selected for PLS processing and classification, as listed in Table 2: Table 2. The knowledge of top 4 features selected using average
No.
m/z(Da)
Avg. Score: w1 j
19 47 106 55
2639.04 17543 5230.4 6516.3
0.1563 0.1451 0.1361 0.1340
w12j
criterion
During the whole feature selection procedures, these features always take the top 4 position. The first column of Table 2 is the column number of selected features in the dataset, while the second column shows the m/z (mass over electric charge) value which labels the nature of related protein. 3.3 LOO Results on the Extracted Feature Space The simplified PLS is applied on the shrunk dataset formed by 4 selected features, and the first 1-4 latent variables are extracted to form new latent space. Different classifiers are applied to those feature space and their performance in LOO test across all of 120 samples is recorded in Table 3. It seems harder for linear SVM to handle the dataset with 4 features left. However, when the PLS scores serve as the input, some less overlapped feature space is obtained and the performance of linear SVM
1104
Z. Yin et al.
become comparable with baseline and give better maxima of accuracy (68.33%), sensitivity (66.7%) and specificity (70%). It can be seen from the table that the performance of linear SVM is more vulnerable to the variation of component number. As the sample number is not too large, the variation of C influences less on the SVM performance. Table 3. The performance of classifiers in the 4-feature space
Classifier Baseline
Linear SVM
Features 4
4
Components 1 2 3 4 -
C=500 No PLS
Linear SVM
4
C=100
Linear SVM
4
C=500
Linear SVM C=1000
4
1 2 3 4 1 2 3 4 1 2 3 4
Accuracy 0.6667 0.6583 0.6583 0.6583 0.4917
Sensitivity 0.6333 0.6333 0.6333 0.6333 0.9833*
Specificity 0.7000 0.6833 0.6833 0.6833 0
*Only 1 sample is classified into Control group while that should be a Mace sample. 0.6583 0.6500 0.6667 0.6583 0.6667 0.6500 0.6417 0.6000 0.6833 0.6083 0.5500 0.6667 0.6667 0.6667 0.6667 0.6833 0.6667 0.7000 0.6250 0.5667 0.6833 0.6083 0.5500 0.6667 0.6583 0.6500 0.6667 0.6833 0.6667 0.7000 0.6250 0.5667 0.5333 0.5917 0.5333 0.6500
3.4 70%-Cross-Validation Results The performance of linear SVM in the 4-feature space is further validated using 70%Cross validation on IMAC-F4 dataset. For both of 200-feature dataset and the 4feature dataset, 500 partitions into a training set containing 70% of the 120 observations and a test set with the left 30% is obtained. For each partition, Baseline method is applied with the extracted component number ranged from 1 to 6 for 200feature space and 1to 4 for 4-feature space. And linear SVM is applied without PLS feature extraction in the 200-feature space. Linear SVM can be used to further modify the PLS solution, e.g. the determination of component number. A common solution is applying cross-validation on training sets and comparing the mean error rate (MER), when MER begins to increase with new components added, feature extraction should be ended [9]. Linear SVM can be adopted to supply the mean error rate. For each training set, the LOO tests are applied with m varies from 1 to 4, and MERs are given by linear SVM. In over 90% iterations, MER gets its minimum when m=2.
Detecting Biomarkers for MACE Using SVM
1105
The performance of classifiers in 70%-cross validations is summarized in Table 4. Using top 4 features selected using first PLS axis weight, Baseline method achieves best accuracy of 71.06% with m=1, while linear SVM has an average accuracy of 68.86% with m=2, also comparable with the performance of classifiers using top 5 features selected by standard GA [2]. Table 4. The performance of classifiers in 70%-Cross validation (C=500 for SVMs)
Classifier
Features
Components
Baseline
200
Baseline
4
Linear SVM
200
1 2 3 4 5 6 1 2 3 4 -
4
2
Testing Accuracy (Avg. over 500 partitions) 0.5278 0.6111 0.6111 0.4722 0.4722 0.3278 0.7106 0.6886 0.6667 0.6667 0.4722
No PLS
Linear SVM
0.6886
4 Conclusion In this paper, the PLS-linear SVM method is applied to the MACE biomarker detection dataset. A simplified PLS designed for binary classification help linear SVM out of the curse of overlapped dataset and give improved prognostic values, PLS coefficient serves as feature ranking criterion and handles feature selection and extraction together in quite short time. Linear SVM also involves itself in the selection of component number used in PLS to further modify the performance. This solution is an efficient alternative to MACE biomarker detection solution based on evolutionary methods like genetic algorithms, and it also outperforms filter methods based on statistical scores on classification accuracy.
Acknowledgements This research is supported by the Chinese NSF No.60574019 and No.60474045, the Key Technology R&D Program of Zhejiang Province No.2005C21087, the Academician Foundation of Zhejiang Province No.2005A1001-13, and also funded by the HCNR Center for Bioinformatics Research Grant, HMS (STCW).
1106
Z. Yin et al.
References 1. Brennan, M.L., Penn, M.S., Lente, F.V., Nambi, V., Shishehbor, M. H., Aviles, R.J., Goormastic, M., Pepoy, M.L., McErlean, E.S., Topol, E.J., Nissen, S.E., Hazen, S.L.: Prognostic Value of Myeloperoxidase in Patients with Chest Pain. The New England J. Med. 349 (2003) 1595-1604 2. Zhou, X., Wang, H., Wang, J., Hoehn, G., Azok.J., Brennan, M.L., Hazen, S.L., Li, K., Wong, S.T.C.: Biomarker Discovery for Risk Stratification of Cardiovascular Events Using an Improved Genetic Algorithm. Proc. 2006 IEEE/NLM Int. Symposium on Life Science and Multimodality, Washington, D.C. 3. Morris, J.S., Coombes, K.R., Koomen, J., Baggerly, K.A., Kobayashi, R.: Feature Extraction and Quantification for Mass Spectrometry in Biomedical Applications Using the Mean Spectrum. Bioinformatics 21 (2005) 1764-1775 4. Tenenhaus, A., Giron, A., Saporta, G., Fertil, B.: Kernel Logistic PLS: A New Tool for Complex Classification. Proc. 2005 ASMDA Applied Stochastic models and Data Analysis, Brest, France. (http://asmda2005.enst-bretagne.fr/IMG/pdf/proceedings/441.pdf) 5. Rosipal, R., Trejo, L.J., Matthews, B.: Kernel PLS-SVC for Linear and Nonlinear Classification. Proc. 2003 ICML the Twentieth Int. Conf. on Machine Learning, Washington, D.C. (http://www.ofai.at/~roman.rosipal/Papers/icml03.pdf) 6. Peng, H., Long, F., Ding, C.: Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy. IEEE Trans. on Pattern Analysis and Machine Intelligence 27 (2005) 1226-1238 7. Nguyen, D., Rocke, D: Tumor Classification by Partial Least Squares using Microbaray Gene Expression Data. Bioinformatics 18 (2002) 39-50 8. Wang, H.: Partial Least-Squares Regression-Method and Applications. (in Chinese). National Defense Industry Press, Beijing (1999) 9. Boulesteix, A-L.: PLS Dimension Reduction for Classification with Microarray Data. Statistical Applications in Genetics and Molecular Biology 3 (2004) Article 33 (Epub 2004 Nov 23) 10. Dudoit, S., Shaffer, J.P., Boldrick, J.C.: Multiple Hypothesis Testing in Microarray Experiments. Statistical Science 18 (2003) 71-103 11. Ben-Dor, A., Bruhn, L., Friedman, N., Nachman, I., Schummer, M., Yakihini, Z.: Tissue Classification with Gene Expression Profiles. J. of Computational Biology 7 (2000) 559584 12. Bo, T.H., Jonassen, I.: New Feature Subset Selection Procedures for Classification of Expression Profiles. Genome Biology 3 (2002) R17 13. Wold, S.: Soft Modeling by Latent Variables; the Nonlinear Iterative Partial Least Squares Approch. In J.Gani(Ed.), Perspectives in Probability and Statistics, Papers in Honour of M.S. Bartlett, 520-540. Academic Press, London (1975) 14. Wold, S., Ruhe, H., Wold, H., Dunn III, W.J.: The Collinearity Problem in Linear Regression. The partial least squares (PLS) approach and Statistical Computations 5 (1984) 735-743 15. Vapnik, V.N.: The Nature of Statistical Learning Theory, Springer Press, New York (1998) 16. Mao, Y., Zhou, X., Pi, D., Wong, S.T.C., Sun, Y.: Parameters Selection in Gene Selection Using Gaussian Kernel Support Vector Machines by Genetic Algorithm. J. of Zhejiang University SCIENCE, 6B (10) (2005) 961-973
Hybrid Systems and Artificial Immune Systems: Performances and Applications to Biomedical Research Vitoantonio Bevilacqua, Cosimo G. de Musso, Filippo Menolascina, Giuseppe Mastronardi, and Antonio Pedone Department of Electronics and Electrical Engineering, Polytechnic of Bari, Via E. Orabona, 4 70125 – Bari, Italy
[email protected]
Abstract. In this paper we propose a comparative study of Artificial Neural Networks (ANN) and Artificial Immune Systems. Artificial Immune Systems (AIS) represent a novel paradigm in the field of computational intelligence based on the mechanisms that allow vertebrate immune systems to face attacks from foreign agents (called antigens). Several similarities as well as differences have been shown by Dasgupta in [1]. Here we present a comparative study of these two approaches considering evolutions of the concepts of ANN and AIS, respectively hybrid neural systems, Artificial Immune Recognition Systems (AIRS) and aiNet. We tried to establish a comparison among these three methods using a well known dataset, namely the Wisconsin Breast Cancer Database. We observed interesting trends in systems’ performances and capabilities. Peculiarities of these systems have been analyzed, possible strength points and ideal contexts of application suggested. These and other considerations will be addressed in the rest of this manuscript.
1 Introduction The nervous system and the immune one are probably the most complex systems in the vertebrates. Both have been shown to be necessary components for adaptability and then survival to the environment. Learning, memory and associative retrieval are the keywords for these systems and on these aspects researchers have focused their interests in order to replicate such behaviors in artificial systems. Artificial Neural Networks grew in this context and they are nowadays one of the most useful and powerful tools for data classification, clustering and prediction. Starting from the ’40s several other bio-inspired models have been successfully proposed including Genetic Algorithms (GA) [2] and Swarm Intelligence [3]. Artificial Immune Systems (AIS) [4] followed this scientific trend. Proposed for the first time by Farmer et al in 1986 [5] AIS field of research underwent a noticeable boost in the mid ‘90s with the research carried out by Dasgupta [1][6] and then with the pioneering work of de Castro, Timmis [7] and Hunt[8]. AIS systems are knowing a remarkable spread in the scientific community because of their flexibility and potentialities. Someone could D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1107–1114, 2007. © Springer-Verlag Berlin Heidelberg 2007
1108
V. Bevilacqua et al.
argue that there’s a great emphasis on these systems and it is not for a chance. AIS has been shown to be a good choice in several fields of application and one of the most versatile tools researchers are able to use. On the other hand, in their most common implementations, they are characterized by low computational costs; a critical aspect in delicate contexts like e.g. real time computing. Just like Artificial Neural Networks, even Artificial Immune Systems have known different implementations and optimizations through the years; supervised and unsupervised flavors being the former first and resource competition and negative selection the remaining. Hybrid Neural Systems are, probably, among the most famous evolution of Neural Networks being combination of Neural Networks with other Intelligent Systems that are able to conjugate strength points of all of the constituent systems. A Neural-Genetic approach for the problem of breast cancer classification has been described in [9]. However different approaches have been shown to be able to perform quite well on the same problem and, in particular, in this paper, we will focus on the comparison of novel immune systems models and advanced hybrid neural systems. The common platform selected for this comparative study is the Wisconsin Breast Cancer Database (WBCD). The WBCD is fundamentally based on the flattening principle. Composed by 699 cases, each defined by 11 fields, this dataset collects breast cancer cases observed by W.H. Wolberg in the late ’80s [5]. Sixteen cases lack of one parameter. Database entries are characterized by the following structure: (ID, Parameters, Diagnosis) where ID is the primary key, parameters fields contain numerical values associated to 10; the last field in the entries contains medical diagnosis associated to the cases, it is a binary value representing malignant/benign tumor. The former first ten indicators are extracted analyzing images obtained through Fine Needle Aspiration (FNA), a fast and easily repeatable breast biopsy exhaustively described in [10]. The systems selected for this comparison are a Hybrid-Neural approach and, for AIS, Artificial Immune Recognition Systems (AIRS, [11]) and aiNet [12]. All of these systems have been described in the following sections of this paper. The comparison among the presented solutions has been assessed using global accuracy metric. Interesting trends have been observed and reported; they mainly concern specific capabilities of Artificial Immune Systems to perform better under certain conditions and algorithms’ computational costs. These and other peculiarities of the systems under investigation have been addressed in the next paragraphs that are organized as follows: firstly an overview of the benchmark dataset is given with preliminary statistical analysis, then the NeuralGenetic approach is explored. Descriptions of AIRS and aiNet implementation follow. “Comparative study” collects all the results of the three systems and “Conclusions and Further Works” paragraphs ends the manuscript with considerations and interpretations of the results giving further cues of research in this field.
Hybrid Systems and AIS: Performances and Applications to Biomedical Research
1109
2 WBCD Dataset and Preliminary Data Analysis The WBCD dataset, as described above, is composed by 699 cases each of them defined by 11 parameters. The classification process could be faced like a function analysis problem:
x11 = f ( x1 , x2 , x3 , x4 , x5 , x6 , x7 , x8 , x9 , x10 ) with: x1 = Radius x2 = Area x3 = Perimeter x4 = Texture x5 = Smoothness x6 = Compactness x7 = Concavity x8 = Concave points x9 = Symmetry x10 = Fractal dimensions The attention is then focused on the analysis of the multidimensional space defined by n-tuple associated to each case. Several statistical analyses were then carried out in order to improve knowledge about this space and create the set of information to be used in following steps of development. PCA, PFA and ICA were preliminary carried out in order to gain an adequate model; these results have been previously described [9].
3 Hybrid Neural-Genetic Systems IDEST (Intelligent Diagnosis Expert System) is a Genetic Optimized Neural System for integrated diagnosis of breast cancer. This system is mainly based on a FeedForward Artificial Neural Network whose topology has been optimized using the approach illustrated in [13]. An ANN setup based on the results obtained in previous steps was trained using Back-Propagation algorithm in the variation that updates weights and biases values according to gradient descent momentum. The starting learning rate chosen was 0.3. This choice avoided the occurrence of step-back phenomena in learning phase and it gave the network sufficient energy to exit from "local minima": all this resulted in ANN's good aptitude for convergence. Stop criterion was set to SSE equal to 7E-3 or to the limit of 50000 training epochs. The system was able to complete the training phase in 13000 epochs reaching, therefore, the SSE target. The relatively contained number of epochs needed to accomplish the training step confirms the correctness of the results obtained via linear and non linear analysis and, in particular the accuracy of GA search. System validation was carried out submitting to the network the 228 cases of the validation set and calculating the number misclassified ones. The results returned by this analysis proved to be quite good: no misleading prediction was made on the 228 cases analyzed. Comparison between error distribution in training and validation sets shows low variability and confirms high precision on both classes. The highest obtainable system accuracy was then reached: it is evidently an indicative
1110
V. Bevilacqua et al.
result but potentialities of similar systems seem, now, to be supported by more concrete elements. In an a priori analysis we have tried to estimate the impact of most significant choices on the accuracy of the ANN's predictions. In this phase we observed how particular decisions contributed to the achievement of such a result. Leaving unchanged the phases of process described until now, but employing training and validation sets assembled ignoring results of linear and non linear analysis (i.e. using sets obtained simply dividing the original dataset in two subsets), it is possible to observe an error on the validation set equal to 4 cases on the 228. This corresponds to an accuracy of 98.6%, a competitive value indeed, which shows, the importance of the observation mostly on intrinsic variance of cases. Error oscillations, furthermore, become more evident right by the cases characterized, by a high variance. Another interesting observation can be made leaving unchanged the process described but eliminating the hybrid ranking inspired by elitist method typical of "evolutionary algorithms" contexts. Suppressing this step we can incur an error, on the validations set, approximately equal to 0.9-0.4%.Results obtained highlight the contribution to the accuracy of learning that choices in data pre-processing phase [13] have generated. Adopted devices allowed obtaining a system capable of taking full advantage of peculiar characteristics of the datasets and of the distribution of information in it.
4 Artificial Immune Recognition System AIRS (Artificial Immune Recognition System) is a supervised learning algorithm inspired by the function of the biological immune system designed to resolve problems such as intrusion detection, data clustering and classification problems. For this work, we have been employed AIRS1, described in [14], with the goal of developing a binary classification system. In the AIRS environment the feature that should be recognized is represented by the antigens, instead the recognizer features is a pool of antibodies, called memory cells. All the memory cells are created during the training stage and are representative of the training data. The lifecycle [16] of the AIRS system is represented in figure 1.
Fig. 1. An overview of the AIRS algorithm
Hybrid Systems and AIS: Performances and Applications to Biomedical Research
1111
Classification To classify an unknown antigen the affinity between this feature and all memory cells is calculated; the class of the best match memory cell is the class of the antigen presented. Parameter set for this algorithm is presented below: Seed: 1 Clonal rate: 10.0 Hyper clonal rate: 2.0 Mutation rate: 0.1 Total resources: 650 Affinity threshold scalar: 0.1 Stimulation threshold: 0.99 Seed is the number of antigens selected to seed the initial pool of memory cell.
Fig. 2. Java application GUI
5 aiNet aiNet (Artificial Immune Network) was proposed to the scientific community by de Castro and Von Zuben in the 2001 [12]. AiNet combines compression techniques with the applications of graph theory, yielding an unsupervised classifier. The centroids extracted by the algorithm are analyzed by minimizing a tree with MST (Minimal Spanning Tree - MST) where function cost is the distance among the centroids. AiNet performs well as filtering of dataset of great dimensionalities, describing the distribution of data. The cells of the immune net are represented in a space of the same dimension as the input data. The dimension of the net, that is the number of cells that composes it, is defined by a mechanisms inspired by the dynamics of the biological immune system. aiNet is based on two principles of the biological immune system: Clonal Selection and Immune Network Theory [18]. Clonal Selection defines how the system reacts to antigens' invasion: when an antigen is recognized by the system a subset of the antibodies that recognized the antigen undergoes cloning and changing by
1112
V. Bevilacqua et al.
introducing diversity in the population and then adapting itself to the invaders. This principle allows us extracting repetitive pattern in a dataset, because all the antigens with a particular sequence of values will be recognized by the same antibody. All the cells that compose the biological immune system interact with each other in total absence of external stimuli. This gave the idea of the existence of the model of such interactions, with a communication net that connects various elements. In the biological system chemical messages are exchanged determining the antibodies' survival or death. In the computational model the interaction between two antibodies is given by their relative distance; moreover nearest antibodies would recognize similar antigens, vanishing all the net. The model of de Castro & Von Zuben is based on these two principles. The recognition of the pattern influences the clone, mutation and selection task according to the principle of Clonal Selection, while the recognition of the elements of a same net is determined by the network suppression, eliminating its redundancy. The stop criterion proposed is a max number of iterations/generations. These properties turn out to be important in the analysis of biomedical dataset where several information are available for each patient but reduced sample numbers is usual. At the end of each run of aiNet the extracted antibodies represent an internal representation of the system for the spatial distribution of antigens. In aiNet the affinity between Ab and Ag is given by their relative distance: the use of Euclidean distance is very common in this context, especially in case of real valued data. Hamming distance is preferred in case of binary strings. Classification is then unsupervised and this is a critical factor in the analysis of biomedical data where information extraction tends most often to be an explorative rather than confirmative one.
6 Results The aim of this paper is the comparison of three different approaches to solve the same problem. Two of these techniques are supervised, IDEST and AIRS, while aiNet is an unsupervised learning algorithm. Results are reported in the table below: Table 1. For each technique, the rate per cent of features correctly classified
IDEST
Training Validation
100,00% 100,00%
AIRS (reducing training items) 99,00% 94,00%
AIRS
AINET
98,50% 100,00%
95,00%
IDEST uses an ANN (Artificial Neural Network) to classify and to recognise the data. Learning phase for this system is slower than the AIRS learning process, but performs better results both on the training and validation set. ANN explores the space of the features better than AIRS does because the latter is a single shot algorithm and for this reason is faster. Using the same training and validation set (502 and 185 items) also for the AIRS, this algorithm presents a rate of features correctly recognised for the validation set greater than for the training set. It could be argued
Hybrid Systems and AIS: Performances and Applications to Biomedical Research
1113
that the training set is greater than the double of the validation set (see Tab. 3). With a 200 entries training set we obtained that the accuracy is better for the training set. In conclusion, reducing learning set, AIRS maintains good results where IDEST starts to fail. For the classification, IDEST needs to train a lot of feature because a neural network must process a minimum set of items to learn its space, while AIRS needs to process a reduced data set, for its nature of unsupervised system, because AIRS is the extension of AIS (Artificial Immune System – unsupervised algorithm). For these investigations we consider that AIRS is better than a system based on ANN (as IDEST) when it needs to implement an on-line learning process or when the number of the features is not so large to train a neural network. Finally we have used an unsupervised system, aiNet, to understand how all the features are distributed in the space and with this technique we have calculated two clusters that classify correctly the 95% of the all items.
7 Conclusions and Further Works In this paper we compared different and evolved approaches from different fields of research. ANNs and Artificial Immune Systems derived approaches have been compared to each other. Although small differences in accuracy levels have been observed being IDEST the top performing algorithm, some observations should be made. As shown in Tab. 2 computational costs of these three algorithms are markedly different. Table 2. Time (in seconds) needed to complete the training phase
Training set
IDEST 1223
AIRS 96
AINET 135
Execution times have been computed flooring the mean of 100 runs of each system. It is evident that the Neural-Genetic system is more accurate but is most time consuming. On the other hand, AIRS and aiNet show smaller computational resources, the cost being a little degradation in accuracy. However it seems that a well planned fine tuning of the parameters for AIS based systems can lead to a remarkable improvement of the results. For their characteristics in terms of computational times and levels of accuracy, immune based approach seems to be a good alternative to well-established paradigms. Questions about sensitivity analysis of parameters and optimal feature sets are currently being investigated. Some interesting behaviours of Immune based systems are even under investigation: they mainly refer to the ability of such systems, under certain conditions to outperform classical approaches (like ANN or SVM). We are trying to model these behaviours (as reported in Tab. 1, columns 3 and 4) and to understand the inner mechanisms that lead to them; it is obvious, in fact, that this aspect could turn out to be a strength point of top performing AIS based systems in biomedical field. Maintaining a low sample size can allow containing experimental costs of the research pipeline.
1114
V. Bevilacqua et al.
References 1. Dasgupta, D.: Artificial Neural Networks and Artificial Immune Systems: Similarities and Differences, Proc. of the IEEE SMC 1 (1997) 873-878 2. Holland, J.H.: Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor. (1975) 3. Bonabeau, E., Dorigo, M., Theraulaz, G.: Swarm Intelligence: From Natural to Artificial Systems (1999) 4. de Castro, L.N., Timmis, J.I.: Artificial Immune Systems: A New Computational Intelligence Approach, Springer-Verlag, London, September (2002) 357 p. 5. Farmer, J.D. , Packard N., Perelson, A.: The immune system, adaptation and machine learning, Physica D 22 (1986) 187--204 6. Dasgupta, D.: Artificial Immune Systems and Their Applications, Springer-Verlag, Inc. Berlin, January (1999) 7. DeCastro, L., Timmis, J.: Artificial Immune Systems: A New Computational Intelligence Approach (2001) 8. Timmis, J, Neal, M., Hunt, J.: An Artificial Immune System for Data Analysis 55 Biosystems (2000) 143--150 9. Bevilacqua, V., Mastronardi, G., Menolascina, F.: Intelligent information structure investigation in biomedical databases: the breast cancer diagnosis problem, ISC 2005 10. Wolberg, W.H., Street, W.N., Heisey, D.M., Mangasarian, O.L.: Computer derived nuclear features distinguish malignant from benign breast cytology. Cancer Cytopathology 81 (1997) 172-179 11. Watkins, A., Timmis, J., Boggess, L.: Artificial Immune Recognition System (AIRS): An Immune-Inspired Supervised Machine Learning Algorithm, Genetic Programming and Evolvable Machines, 5-3 (2004) 291--317 12. de Castro, L.N., Von Zuben, F.J.: aiNet: An Artificial Immune Network for Data Analysis, (full version, pre-print), Book Chapter in Data Mining: A Heuristic Approach, H. A. Abbass, R. A. Sarker, and C. S. Newton (eds.), Idea Group Publishing (2001), USA, Chapter XII, pp. 231-259. 13. Bevilacqua, V., Mastronardi, G., Menolascina, F., Pannarale, P., Pedone, A.: A Novel Multi-Objective Genetic Algorithm Approach to Artificial Neural Network Topology Optimisation: The Breast Cancer Classification Problem, IJCNN 2006 14. Watkins, A.: AIRS: A resource limited artificial immune classifier - Department of Computer Science, Mississippi State University (2001) 15. Timmis, J., M. Neal, et al., An Artificial Immune System for Data Analysis - BioSystems 55 (1/3) (2000) 143-150 16. Brownlee, J.: Artificial Immune Recognition System (AIRS) a review and analysis- Centre for Intelligent Systems and Complex Processes (CISCP), Swinburne University of Technology (SUT) (2005) 17. Jerne N.K.: Towards a Network Theory of Immune System, Annals of Immunology (1973) 18. de Castro L.N.: http://www.dca.fee.unicamp.br/~lnunes/
NeuroOracle: Integration of Neural Networks into an Object-Relational Database System Erich Schikuta and Paul Glantschnig Research Lab on Computational Technologies and Applications, Institute of Knowledge and Business Engineering, Faculty of Computer Science, University of Vienna, Rathausstraße 19/9, A-1010 Vienna, Austria
[email protected]
Abstract. Many different approaches for the modeling of neural networks were presented in the literature (e.g. [4]). Generally the objectoriented approach proved itself as most appropriate. It provides a concise but comprehensive framework for the design of neural networks in terms of its static and dynamic components, i.e. the information structure and its methods in the object-oriented notion. This paper presents a framework for the conceptual and physical integration of neural networks into object-relational database systems. The static components comprise the structural parts of a neural network, as the neurons and connections, higher topological structures as layers, blocks and network systems. The dynamic components are the behavioral characteristics, as the creation, training and evaluation of the network. Finally the implementation of the new NeuroOracle system based on the proposed framework is presented.
1
Introduction
Object oriented database management systems (OO-DBMS) proved very useful for handling and administrating of complex objects. We believe that the objectoriented approach is the most comfortable and natural design model for neural networks [4]. In the context of object-oriented database systems neural networks are treated generally as complex objects. These systems showed very valuable at handling and administrating such objects in different areas, as computer aided design, geographic databases, administration of component structures, etc. It is our objective to consider neural networks as conventional data in the database system. From the logical point of view a neural network is a complex data value and can be stored as a normal data object. The usage of a database system as an environment for neural networks provides both quantitative and qualitative advantages. – Quantitative Advantages. Modern database systems allow the administration of objects efficiently. This is provided by a ’smart’ internal level of the system, which exploits well studied and well known data structures, access D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1115–1124, 2007. c Springer-Verlag Berlin Heidelberg 2007
1116
E. Schikuta and P. Glantschnig
paths, etc. A whole bunch of further concepts is inherent to these systems, like models for transaction handling, recovery, multi-user capability, concurrent access etc. This places an unchallenged platform in speed and security for the definition and manipulation of large data sets at users disposal. – Qualitative Advantages. The user has powerful tools and models at hand, like data definition and manipulation languages, report generators or transaction processing. These tools provide a unified framework for both handling neural networks and the input/output data streams of these networks. A homogeneous and comprehensive user interface is provided to the user. This spares awkward tricks to analyze the data of his database with a separate network simulator system. A further important aspect (which is beyond the scope of this paper) is the usage of neural networks as part of the rule component of a knowledge base database system [7]. Neural networks represent inherently knowledge by the processing in the nodes [6]. Trained neural networks are similar to rules in the conventional symbolic sense. A very promising approach is therefore the embedding of neural networks directly into the generalized knowledge framework of a knowledge based database system. 1.1
Integration of a Neural Network Simulator into a OO-DBMS
To solve the integration problem the approach described in [8] is used: In this approach, the embedment of neural networks into database systems, the opposite direction to conventional approaches is followed. In other words, the neural networks are moved into the database systems, and not the data to the neural network simulators. In this paper we present the underlying data model of the NeuroOracle system, an artificial neural network simulation system integrated into an object-relational database system, i.e. the Oracle database management system. 1.2
Goals
The goals and basic requirements of the NeuroOracle system were specified as follows. – The system is a neural network simulator. Basic neural network functions such as create, modify, delete, train and evaluate neural networks should be performed. – The system provides the main neural network paradigms which can be classified into three groups. These are Backpropagation Nets, Self Organizing Maps and Recurrent Nets. – The system is expansible, so it should be possible to create new neural networks, training and evaluation. Moreover a new neural network paradigm can be added to the neural network system. – The training and evaluation algorithms should be implemented in a high level programming language. – The system is implemented within object-relational Database Management System (DBMS).
NeuroOracle: Integration of Neural Networks
1.3
1117
Object-Relational Database Design
Why storing complex data, such as data of a neural network, in a flat relational database schema when there is an object-relational approach? In Oracle, since version 9i the definition of Oracle object types is possible. These object types are user-defined types that make it possible to model real-world entities as objects in the database. In general, the object-type model is similar to the class mechanism found in C++ and Java. Like classes, objects make it easier to model complex entities and logic, and the reusability of objects makes it possible to develop database applications faster and more efficiently. By natively supporting object types in the database, an object-relational DBMS enables application developers to directly access the data structures used by their applications. No mapping layer is required between client-side objects and the relational database columns and tables that contain data. Object abstraction and the encapsulation of object behaviors also make applications easier to understand and maintain. So, object types provide much extensibility. Because of the reasons explained in the last paragraph, it is supposed to be very practical to use this approach for the NeuroOracle system. With many self-defined object types a complex structure inside the database could be built, which is very generic and extensible.
2
The NeuroOracles’s Database Structure
The object-oriented model proved extremely appropriate for the specification and modeling of neural network systems. Object oriented database systems have proven very valuable to handle and manage complex objects. One describing property of the object oriented design is the hierarchy of types. A type comprises a set of objects, which share common functions. Generalization and specialization define a hierarchical type structure, which organizes the unique types. Functions defined on a super-type are also inherited by all of its sub-types along the type hierarchy. However, many state-of-the-art and widely used database systems provide a relational data model framework only and do not support the object-oriented paradigm until now. The relational approach [1] is declarative and value-oriented. Operations on relations are expressed by simple and declarative languages delivering their results by new relations. Today the relational approach is the model of choice in the community and provided by beneath all available database systems. The neural network is represented by values in the database system. The semantic information is expressed by relationships between these values. 2.1
Object Model
The neural network’s stored data within the NeuroOracle system can be divided into two groups: Static components and Dynamic components.
E. Schikuta and P. Glantschnig
Fig. 1. NeuroObject’s object model
1118
NeuroOracle: Integration of Neural Networks
1119
Static Components. The static neural network components comprise all information stored in relations, as neural network specific parameters, links, training objects, evaluation objects, etc. respective to the shown entity-relationship diagram (see Figure 1). The neural network object is a sub-type of the general object type of the database system. Sub-types can be classified into specialized neural network types according to their network paradigm. The network paradigm is defined by a specialization, a sub-type of the neural network type. This sub-type (which inherits all characteristics of its super-type) provides the specific and necessary attributes dependent on the network paradigm. Combined with the definition of the paradigm are the dynamics (the dynamic behavior) of the network. This approach is reflected in the Unified-Modeling-Language (UML) diagram of NeuroOracle’s object model in figure 1. A UML diagram is useful for the description of the conceptual schema of the ’reality’ in focus. The transformation of the model in a UML diagram to a database realization is straightforward. Rectangles represent entity sets, circles attributes, and connections relationships between entities. For a in-depth explanation see [2]. All data of a specific net is stored inside the neuralnet tab database table and its nested tables. As it is shown in Figure 1, neuralnet tab contains four nested tables: trainings, evaluations, connections and layers, which in turn contain the nested table neurons. One part of the neural network’s structure, the type and number of layers, the number of neurons of a specific layer, their activation- and output function types and their BIAS information are mapped in the nested tables layers and neurons. The other part, the connections between the neurons are stored within the nested table connections. The remaining two nested tables trainings and evaluations are used for performing training- and evaluation results. The structure of neural networks, to be more precisely, the layers, the neuron units of the layers, their activation and output functions and referred neural network type are the static components of neural networks, while the connectionweights between the neurons and/or the BIAS-value of each neuron, if activated, are dynamic information and are usually adapted during the neural network’s learning phase. Dynamic Components. The dynamic components of the neural network object are the typical operations on neural network, the training and evaluation phase. The algorithms for these phases are realized by routines coded in the internal database application code (Oracle’s PL/SQL procedures). Further these routines keep certain consistency assertions on the static components after execution of specific phases, as insertion of link weights after a training, results after an evaluation phase, and so on. 2.2
Datastream Concept
The functional data stream definition allows to specify the data sets in a comfortable way. It is not necessary to specify the data values explicitly, but the
1120
E. Schikuta and P. Glantschnig
data streams can be described by SQL statements. In the database component of the NeuroOracle system the well known apparatus of the SQL database manipulation language ([5]) is at hand. Thus the same tool can both be used for administration and analysis of the stored information. So it is easily possible to use ’real world’ data sets as training set for neural networks and to analyze other (or the same) data with trained networks. 2.3
Extensibility
An important aspect is the extensibility of the system. This is reached by a modularized paradigm approach. New modules have to follow a specific programming style to make it possible to integrate them easily into the NeuroOracle environment. Thus users have the possibility to shape the system to their needs by changing existing or adding new paradigms easily. All these implementations can be done without leaving the comfortable database environment.
3
Interaction of the NeuroOracle’s Components
For a better understanding of the system case study of a simple neural network tries to give explanation of the mapping between the neural network structure and the database tables and the interaction of the defined database components. Before a new neural network can be added to the system, the table nntype tab must contain the value of at least one neural network type as a neural network must refer to one specific network type. In case there is at least one network type defined a new neural network can be inserted into the neuralnet tab table respectively nested tables. This can be done by using a simple constructor method: INSERT INTO neuralnet\_tab VALUES ( neuralnet(1,’xor1’,struct(2,3,1), feedforward())); The Figure 2 shows how the neural network’s data is stored within the neuralnet tab table and its nested tables layers and connections. The nested tables trainings and evaluations do not contain any data before a training session respectively evaluation session is performed. The attribute nettype refers to a specific neural network type of the nntype tab table. The next task is to perform a training process, in which the dynamic components are adapted. By adding new training parameters and specific training data into the nested table trainings, a database trigger is fired, which in turn starts the self-implemented training algorithm written in PL/SQL programming language. The training algorithm then returns a set of results and stores it into the nested table trainings. A new evaluation session is created equivalently to the training process. Lacking of trigger-methods on nested tables an automatic training and/or evaluation of newly inserted data into these nested tables is not
NeuroOracle: Integration of Neural Networks ID 1.1
NETID xor1
LAYERS […….....]
CONNECTIONS […………]
LAYERID 1 2 3
NEURONID 1.1 1.2 2.1 2.2 2.3 3.1
TRAININGS
LAYERTYPE INPUT HIDDEN OUTPUT
NEURONNAME INPUT-1.1 INPUT-1.2 HIDDEN-2.1 HIDDEN-2.2 HIDDEN-2.3 OUTPUT-3.1
EVALUATIONS
1121
NETTYPE 0000220208FDC185DE499E47A983D34……….
NEURONS [………] [………] [………]
NEURONBIAS BIAS(0, 'FALSE', 'FALSE') BIAS(0, 'FALSE', 'FALSE') BIAS(0, 'FALSE', 'FALSE') BIAS(0, 'FALSE', 'FALSE') BIAS(0, 'FALSE', 'FALSE') BIAS(0, 'FALSE', 'FALSE')
FACTIVATION ACT_IDENTITY() ACT_IDENTITY() ACT_IDENTITY() ACT_IDENTITY() ACT_IDENTITY() ACT_IDENTITY() NE_FROM NE_TO IN.1 1.1 IN.2 1.2 1.1 2.1 1.1 2.2 1.1 2.3 1.2 2.1 1.2 2.2 1.2 2.3 2.1 3.1 2.2 3.1 2.3 3.1 3.1 OUT.1
FOUTPUT OUT_IDENTITY() OUT_IDENTITY() OUT_IDENTITY() OUT_IDENTITY() OUT_IDENTITY() OUT_IDENTITY() WEIGHT 1 1 -,31239259 ,255087614 -,81784761 -,24360991 -,58022749 ,206756711 -,86869955 ,071927786 ,441814899 1
FIXED TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE
Fig. 2. Mapping of neural network data
possible. For that reason two ”dummy”-tables with trigger constraints have been designed. The following SQL-statement creates a new training session. The data is inserted in the train session table. INSERT INTO train\_session VALUES ( doTraining(’xor1-training1’, ’xor1’, ’0;0;0;1;1;0;1;1’, ’0;1;1;0’, backprop(400,0.5,0.01) )); The data is inserted into the train session table (see Figure 3) ID 1
NETID xor1
TRAINPARAM(TRAINID, TRAINNAME, INPUTQUERY, OUTPUTQUERY, ALGORITHM(MAXEPOCH), TRAINRESULT(EPOCH)) TRAINING(1, 'xor1-training1', '0;0;0;1;1;0;1;1', '0;1;1;0', BACKPROP(400, ,5, ,01, RNDPERMUTATION('rndperm', 'rnd permutation of pattern-ids'), CONSTV ALUE(0), ALPHATERM(0), BATCHMODE('FALSE'), 10), NULL)
Fig. 3. Training data inserted into nested table
After insertion of the training data a database trigger is fired, that calls the proper training algorithm depending on the algorithm type inserted as training parameter into the database table. In this case the backpropagation algorithm function is used. The training parameters and the result of the training algorithm function are then inserted into the nested table trainings of the neuralnet tab table. This is shown in Figure 4. The training result object stored in the trainresult column is of type backproptrainresult. It contains a nested table resultdetail that stores the total net error of specified epochs. Furthermore, the database trigger updates the dynamic components of the neural network.
1122
E. Schikuta and P. Glantschnig TRAINID 1
TRAINNAME xor1-training1
INPUTQUERY 0;0;0;1; 1;0;1;1
OUTPUTQUERY 0;1;1;0
ALGORITHM() BACKPROP(400, ,5, ,01, RNDPERMUTATION( 'rndperm', 'rnd permutation of pattern-ids'), CONSTVALUE(0), ALPHATERM(0), BATCHMODE('FALSE'), 10)
TRAINRESULT() BACKPROPTRAINRESULT( 267, TAB_VFLOAT( VFLOAT(,00579831), VFLOAT(,977317695), VFLOAT(,977296002), VFLOAT(,137114644)), ,009932002, NT_RESULTDETAIL[…..]
Fig. 4. Training results inserted by training trigger
Figure 5 shows the graphical mapping of the neural network after the performed training session. Inputlayer
Hiddenlayer -3.39
INPUT1
Outputlayer
2.1
1.1
-3.73 0.23 -3.50 2.2
1.88
3.1
OUTPUT
-3.34 -3.46
INPUT2
1.2
1.86 0.26
2.3
Fig. 5. Graphical sketch of the trained network
As already mentioned the evaluation process works equivalently. New evaluation parameters are inserted into the other dummy table, eval session. Again, a database trigger is fired and the result of the evaluation process is inserted into the nested table evaluations. The following SQL statement gives an example of how to create a new evaluation with the doEvaluation object. A name for the new evaluation, the name of the existing neural network, an input that should be evaluated and the proper net training algorithm type have to be set. The object type of the algorithm is necessary to determine the right evaluation algorithm for the neural network. INSERT INTO eval_session VALUES (doEvaluation(’eval1’,’xor1’,’0;1’, backprop() )); The new created evaluation is stored in the nested table evaluations inside the neuralnet tab table. The next SQL statement accesses this table and its entries. SELECT e.* FROM neuralnet_tab n, TABLE(n.evaluations) e WHERE n.netid=’xor1’; The results of this query are shown in Figure 6.
NeuroOracle: Integration of Neural Networks EVALID 1
EVALNAME eval1
INPUTQUERY 0;0;1;0 ;0;1;1;1
EVALRESULT(OUTPUT_MATRIX) EVALRESULT_T(TAB_VFLOAT(VFLOAT(,04003242), VFLOAT(,877493239), VFLOAT(,881920161), VFLOAT(-,03385231)))
1123
ALGORITHM BACKPROP
Fig. 6. Evaluation table entries
4
Implementation Issues
The NeuroOracle system was developed and tested with Oracle10g. The NeuroOracle software package together with a comprehensive documentation can be downloaded from [3]. A personal edition of the Oracle DBMS System can be downloaded from the official Oracle website. Once the Oracle database is running on the system, the NeuroOracle system can be easily installed in virtually any user schema by running the provided script createNeuroOracle.sql in SQL*Plus. The author recommends to execute the script createUser.sql in order to create the default Oracle user for the NeuroOracle system. By executing the createNeuroOracle.sql script the necessary database schema as mentioned above is created. But also three neural network types are inserted by default into the nntype tab table. These are Feed-forward Networks with typeid 1, Recurrent Networks with typeid 2 and Self Organizing Maps with typeid 3. For the usage of the neural network simulator no new network type is necessary. The predefined three neural network types are sufficient for most applications, however new neural network types can be created easily by inserting into the nntype tab table easily. These entries do not produce any restrictions about the neural network’s structure inserted into the neuralnet tab table or the used training algorithm of the network but it provides a better overview of the existing networks for the users; it is just additional information for the neural network.
5
Visions
The NeuroOracle database system provides a fundamental basis, the application logic, on which front-end applications could and should possibly be built quite easily. Because of the object-relational approach application developers do not need to create a mapping between the data stored in the database tables and the application’s data structure. Due to the different kinds of interfaces provided by Oracle, such as Java, PL/SQL, Pro*C/C++, OCI or OLE, there exist many possibilities of building a user front-end application for NeuroOracle. For example a web-based application could be built using JSP or PL/SQL, or using a Java-Applet. Or a standalone application could be built using the OCI interface. So, the NeuroOracle system can be extended by a user interface making the usage of the system much easier or new network paradigms, training and evaluation algorithms can be added to the system. Furthermore the NeuroOracle system could be deployed on an Oracle Real Application Cluster to boost the performance and may be enabled to grid-computing in the future within the Oracle10g database.
1124
E. Schikuta and P. Glantschnig
Known problems Until this paper was written, no referential constraints could be defined for nested table columns. Therefore two dummy-tables, train session and eval session were needed for triggering newly defined training or evaluation objects. Also it was not possible to associate the neurons of a net with the nested connection table, with the result that there could be connections between neurons that don’t even exist. As a consequence the applications built on NeuroOracle should ensure data integrity among the NeuroOracle objects.
6
Conclusions and Future Research
We presented in this paper an object-relational model for the embedding of neural networks into data base systems. Based on this framework the NeuroOracle system was developed, an extensible, comfortable, and powerful neural network tool embedded into the object-relational Oracle database system. This approach provides a homogeneous and natural environment for the administration and handling of neural networks to the user.
References 1. Codd, E.: A Relational Model for Large Shared Data Banks. Communications of the ACM 13 (1970) 377–387 2. Date, C.: An Introduction to Database Systems. Addison-Wesley (1986) 3. Glantschnig, P., Schikuta, E.: Neurooracle Package. http:// www.cs.univie.ac.at/ template.php?tpl=shared/studProjES.tpl, November (2005) 4. Heileman, G. et al.: A General Framework for Concurrent Simulation of Neural Networks Models. IEEE Trans. Software Engineering 18 (1992) 551–562 5. Melton, J., Simon, A.: Understanding the New SQL: A Complete Guide. Morgan Kaufmann Publishers (1993) 6. Pao, Y.H., Sobajic, D.: Neural Networks and Knowledge Engineering. IEEE Knowledge and Data Engineering 3 (1991) 185–192 7. Schikuta, E.: The Role of Neural Networks in Knowledge Based Systems. Int. Symp. on nonlinear theory and applications, Hawaii, IEICE (1993) 8. Schikuta, E.: Neudb’95: An sql Based Neural Network Environment. In Shunichimeri et al., editors, Progress in Neural Information Processing, Proc. Int. Conf. on Neural Information Processing, ICONIP’96, Hong Kong, Springer-Verlag, Singapore (1996) 1033–1038
Discrimination of Coronary Microcirculatory Dysfunction Based on Generalized Relevance LVQ Qi Zhang1, Yuanyuan Wang1, Weiqi Wang1, Jianying Ma2, Juying Qian2, and Junbo Ge2 1
Department of Electronic Engineering, Fudan University, Shanghai 200433, P.R. China {051021084, yywang}@fudan.edu.cn 2 Department of Cardiology, Zhongshan Hospital of Fudan University, Shanghai 200032, P.R. China
[email protected]
Abstract. There are fewer effective methods to accurately discriminate the coronary microcirculatory dysfunction from the normal coronary microcirculation. Rather than traditional approaches only considering a single hemodynamic parameter, a novel scheme is proposed based on the generalized relevance learning vector quantization (GRLVQ) using multiple parameters (features). Naturally integrating the tasks of feature selection and classification, this scheme circularly adopts GRLVQ to gradually prune the unimportant features according to their weighting factors. In each circulation, the prototypes are generated for classification and the classification accuracy is obtained. Finally, the feature subset with the highest classification accuracy is selected and the corresponding classifier is also achieved. This approach not only simplifies the classifier but also enhances the classification performance. The method is verified on the physiological data collected from animals, and proved to be superior to the traditional single-parameter method.
1 Introduction Coronary microcirculation is suspected of being involved in a large number of cardiovascular diseases [1]. Therefore the task of effective classification of the coronary microcirculatory function, namely discriminating the coronary microcirculatory dysfunction from the normal coronary microcirculation, has great significance in the medical diagnosis. The traditional research orientation is trying to find a single clinical hemodynamic parameter for the classification, such as coronary flow reserve (CFR) and coronary resistance reserve (CRR) [2-4]. However, the discrimination performance of any single parameter is not satisfactory [2-4]. On one hand, because a single hemodynamic parameter is difficult to achieve the objective of exact classification, a new classifier scheme using multiple parameters is taken into account. On the other hand, to integrate multiple sources of information, the physiological data often lead to overfull parameters, which mean a high dimensionality of features, and increase the complexity of the classifier; in addition, the classification performance with so many features is not always excellent. Thus a D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1125–1132, 2007. © Springer-Verlag Berlin Heidelberg 2007
1126
Q. Zhang et al.
method automatically determining the intrinsic features and pruning the unimportant ones is needed. It can be seen from the mentioned above that an appropriate multiple-parameter method with both the function of intrinsic features determination and the ability of powerful classification is required for the classification of the coronary microcirculatary function. As a common comprehension, it can be divided into two successive tasks. The first task is the feature selection which selects a subset of original features by an evaluation criterion, such as the distance measures, information measures [5]; and the following task is the classifier design, to design a classifier with the minimal error-rate. Usually, these two tasks are not integrated; it means that the feature selection does not directly yield a result of the classification, and the classifier can not reduce the input dimensions itself. Therefore it increases the complexity of the whole classification system. A neural network called generalized relevance learning vector quantization (GRLVQ) catches our attention for its potential in natural integration of the feature selection and classification. As a modification of learning vector quantization (LVQ), GRLVQ reserves the merit of LVQ, namely the simple and accurate classifier based on prototypes. Furthermore, GRLVQ introduces the weighting factors to the data dimensions which are adapted automatically so that the classification error-rate becomes small and the intrinsic data dimensions can be explored [6]. In this paper, a novel and efficient scheme based on GRLVQ is proposed to effectively classify the microcirculatary function. GRLVQ is firstly utilized to rank all of the hemodynamic parameters (features) extracted from the physiological data and prune the obviously unimportant ones. Then for a subtle feature reduction, GRLVQ is adopted circularly to rank the residual features and prune the most irrelevant one, until the number of features is reduced to zero. In each circulation, the prototypes are also generated, which are brought for the classification by means of nearest neighbor, and the corresponding classification accuracy can be obtained. Finally, the feature subset with the highest classification accuracy is determined. The feature selection and classification are both achieved so far. This method is compared to the traditional single-parameter method via the animal experiments and proved to be more effective.
2 Data Acquisition and Feature Extraction The coronary microcirculatary dysfunction can be artificially caused by injecting different amount of microspheres into animals to produce different intracoronary microembolization [7], so we can get disease cases from animal experiments more conveniently than humans. Here we acquired 136 cases from 25 pigs. All the physiological data were collected at Department of Cardiology, Zhongshan Hospital, Shanghai, via the catheterization techniques including intravascular ultrasound (IVUS) imaging, intracoronary Doppler technique and intracoronary pressure measurement, respectively getting the IVUS video images, Doppler signals and pressure signals [7]. As a following, plentiful hemodynamic parameters were extracted from the original physiological data using several methods of medical signal processing. Here 21 features were extracted, such as the aforementioned parameters CFR and CRR, the
Discrimination of Coronary Microcirculatory Dysfunction
1127
frequency-domain parameters of coronary artery pressure (Pf(k), k = 0, 1, 2, 3), coronary blood flow volume (Qf(k)), coronary artery impedance (Rf(k)), etc. It should be noticed that, in our former research, CRR has been extended from a single parameter to a set of parameters CRRf(k) (where CRR = CRRf(0)), which have more hemodynamic information of the coronary microcirculation [7].
3 Feature Selection and Classification In this section, we will first introduce the algorithm of GRLVQ, then explain the functions of weighting factors and prototypes, and finally describe our novel scheme for the feature selection and classification. 3.1 The GRLVQ Algorithm Let D = {(xi, yi) ∈ Rn × {1,…, C} | i = 1,…, m} be a training data set with ndimensional elements xi = (x1i,…, xni) to be classified and C classes. A set W = {w1,…, wM} of prototypes in data space with class labels ck (k = 1,…, M) is used for data representation, where wk = (w1k,…, wnk) ∈ Rn, and ck ∈ {1,…, C}. The general algorithm of GRLVQ consists in minimizing the classification cost function [6]
EGRLVQ = ∑ f ( μ λ ( x i ) ) . m
(1)
i =1
Choose f as the sigmoid function f(x) = sgd(x) = 1/(1 + exp(−x)) ∈ (0, 1), and μλ(xi) = (dλ+ − dλ−)/(dλ+ + dλ−). Here dλ+ is the squared distance on a certain metric between the data point xi and the nearest correctly classified prototype, say w+; similarly, dλ− is the distance between xi and the nearest wrongly classified prototype, say w−. Rather than the Euclidian metric which considers the input dimensions as equally scaled and equally important, the metric of distance in GRLVQ is an ununiform metric d λ = xi − w
2
λ
n
= ∑ λ j ( xij − w j ) 2 .
(2)
j =1
Eq. (2) introduces input weighting factors λ = (λ1,…, λn), λj ≥ 0, j = 1,…, n, ∑jλj = 1, in order to allow a different scaling of the input dimensions. Via a stochastic gradient descent, partial derivatives of EGRLVQ yield the update formulas for w+, w− and λ [8]: sgd′ ( μ λ ( x i ) ) d λ− + + Δw = η Λ( x i − w + ) , (3) + − 2 d + d ( λ λ) Δw − = −η −
sgd ′ ( μ λ ( x i ) ) d λ+
(d
+ λ
+d
)
− 2 λ
Λ( x i − w − ) ,
⎛ ⎞ d λ− d λ+ i + 2 i − 2 ⎟ Δλ j = −η ⋅ sgd′ ( μ λ ( x ij ) ) ⎜ x − w − x − w . ( ) ( ) j j j + − 2 ⎜ d+ + d− 2 j ⎟ d + d ( ) ( ) λ λ λ λ ⎝ ⎠
(4)
(5)
1128
Q. Zhang et al.
Here η+, η−, η ∈ (0, 1) are learning rates, sgd′ is the derivative of the sigmoid function, and Λ is the diagonal matrix with entries λ1,…, λn. Each time λ updates, the normalization ∑jλj = 1 is followed so as to avoid numerical instabilities. 3.2 Functions of Weighting Factors and Prototypes
In the method of GRLVQ, the weighting factor λ means an adaptive metric for the input dimensions, and it can be determined automatically via a stochastic gradient descent as Eq. (5) shows. When the algorithm converges, the final λ ranks the input dimensions; namely if λj is bigger, the corresponding feature is more important and has more contribution to the classification. So according to the rankings of the features, some unimportant features can be pruned. As a prototype-based algorithm, GRLVQ yields the prototypes wk to represent the corresponding classes as accurately as possible via minimizing the cost function EGRLVQ [9]. As a following, the prototypes are used to design the classifier. In this paper, the element x is classified to the class cx depending on the principle of nearest neighbor: cx = cK, where K = arg min x − w k k
2
λ
.
(6)
It means that x is classified to the class which the nearest prototype belongs to. Here the metric of distance is also the scaled adaptive one, substituting the traditional Euclidian metric. 3.3 GRLVQ-Based Feature Selection and Classification
Adequately considering the functions of weighting factors and prototypes, a novel scheme naturally integrating the feature selection and classification is descried as follows. In this paper, the feature selection keeps to a top-down strategy, which means gradually reducing unimportant features from the whole feature set, and finding an optimal feature subset in this course. There are two problems to be solved; one is what the principle of feature reduction is, and the other is what the evaluation criterion of optimal feature subset is. Aiming at the first problem, we propose two principles utilizing the weighting factors, according to the current feature dimensionality n. When n is high, some features may have much interfering noise. Considering the uniform scaling is λj = 1/n, we define the feature reduction threshold as
λthresh = 1/(γ ⋅ n) .
(7)
If λj < λthresh, the corresponding feature can be pruned. Here γ > 1, is a threshold controlling factor; if γ is smaller, more dimensions will be reduced. By this threshold method, several obviously unimportant or interfering features can be pruned, to enhance the efficiency of feature reduction. When n is low, for a subtle feature reduction, we just prune one feature once. Suppose J = arg min(λ j ) , (8) j
then the Jth feature will be pruned.
Discrimination of Coronary Microcirculatory Dysfunction
1129
Aiming at the second problem, we directly use the classification accuracy RA (see Eq. (9)) as the evaluation criterion of optimal feature subset, unlike the conventional criterions such as distance measures, or information measures [5]. RA = mt+/mt ,
(9)
where mt is the number of elements in the test set, and mt+ is the number of correctly classified elements in the test set. After solving the aforesaid problems, the tasks of feature selection and classification can be naturally integrated as the following procedure: 1. With all of n features, use Eq. (3), (4) to get wk, and Eq. (5) to get λ. Use Eq. (6) to design the classifier, and Eq. (9) to gain the current classification accuracy RA(n). 2. Use Eq. (7) to prune several obviously unimportant or interfering features. Feature dimensionality is reduced from n to nr. 3. With the reduced nr features, use Eq. (3), (4) to get wk, and Eq. (5) to get λ. Use Eq. (6) to design the classifier, and Eq. (9) to gain the current classification accuracy RA(nr). 4. Prune the least important feature according to Eq. (8). Let nr = nr − 1. 5. If nr = 0, proceed to step 6; else skip to step 3. 6. Find the maximum of RA, then the corresponding nr features are the finally selected features which are most useful and essential for the classification, and the corresponding classifier is the optimal classifier in this course. The above procedure can be concluded into three stages. The first stage consists of step 1 and step 2, for coarse feature reduction; the second stage consists of step 3 to step 5, for subtle feature reduction; and the third stage is step 6, for final feature selection and classification.
4 Experiments and Results The presented GRLVQ-based scheme is verified on hemodynamic data collected from animals, in contrast to the traditional single-parameter method. Here, the number of classes C = 2, and the total number of cases m = 136, including m1 = 45 cases with the normal coronary microcirculation, and m2 = 91 cases with coronary microcirculatary dysfunction. 21 hemodynamic parameters were extracted, including CFR, CRRf(k), Pf(k), Qf(k), Rf(k) (where k = 0, 1, 2, 3), so the feature dimensionality n = 21. Every time for experiments, randomly divide the data in half, one as the training set (m1r = 23, m2r = 46), and the other as the test set (m1t = 22, m2t = 45). First investigate the classification performance of each single parameter with a simple method of threshold. Choose the threshold pthresh = (m1rμ1 + m2rμ2)/( m1r + m2r), where μi, i = 1, 2, is the mean of the parameter on training set for class i [10]. After training and testing each parameter for 50 times, the parameter CRRf(0) is found with the best classification performance, but the accuracy is not very high actually, with a big variance furthermore. The 6 parameters with highest accuracy on the test set are listed in Table 1.
1130
Q. Zhang et al.
Then investigate our GRLVQ-based scheme. Set the number of prototypes, M1 = 2, M2 = 5. Choose the learning rates as constant, η+ = η− = 0.1, η = 0.01. On every stage for the feature reduction, the prototypes wk and the weighting factors λ are repeatedly computed for 50 times to get more reliable results. The average weighting factors of 21 parameters are first obtained for coarse feature reduction
λ = (0.3778, 0.3105, 0.0626, 0.0576, 0.0454, 0.0271, …), where the weeny factors have been omitted. Let λthresh = 1/(5n) = 0.0095, 10 most unimportant parameters are pruned; other 11 parameters are reserved, including CRRf(0), CFR, CRRf(3), CRRf(2), CRRf(1), Qf(1) (listed according to the rankings high Table 1. The mean and standard deviation of classification accuracy on the test set with the single-parameter method
Parameter Accuracy Parameter Accuracy
CRRf(0) 0.8499f0.0384 CRRf(2) 0.6961f0.0339
CFR 0.8215f0.0368 Qf(0) 0.6791f0.0644
CRRf(3) 0.7033f0.0390 Qf(3) 0.6693f0.0479
Table 2. The mean and standard deviation of classification accuracy on the test set with the GRLVQ-based scheme. Only list the accuracy at the feature dimensionality of 21, 11, 8, 5, 3 and 1.
Feature Dimensionality Accuracy Feature Dimensionality Accuracy
21 0.8675f0.0411 5 0.8833f0.0320
11 0.8815f0.0489 3 0.8985f0.0205
8 0.8764f0.0296 1 0.8648f0.0418
Fig. 1. The classification accuracy on the test set with the GRLVQ-based scheme. The accuracy is various according to the feature dimensionality.
Discrimination of Coronary Microcirculatory Dysfunction
1131
to low). It is discovered that the importance of a single parameter is not equivalent to that of the same parameter in the cooperative work. As a following, the subtle feature reduction is carried out to prune the residual 11 features one by one, and also gain the current classification accuracy RA. Finally, find the maximum of RA is 0.8985 (see Table 2 and Fig. 1), and determine the corresponding 3 features as the ultimately selected features, which are CFR, CRRf(0), CRRf(3), orderly. The final average weighting factors for these 3 parameters are
λ = (0.3589, 0.3373, 0.3038) . It is indicated that CRRf(3) strongly increases its contribution to the classification. As Table 1 and Table 2 show, the presented GRLVQ-based scheme increases the classification accuracy by 4.86% on the test set in contrast to the single-parameter method. It also decreases the variance of accuracy, namely enhances the stability of the classification.
5 Conclusions In this paper, a novel scheme based on GRLVQ is proposed for discrimination of the coronary microcirculatary dysfunction from the normal microcirculation. Unlike the traditional single-parameter method, this GRLVQ-based scheme uses multiple parameters to enhance the classification performance. Adequately taking advantage of the weighting factors and prototypes in GRLVQ, this scheme naturally integrates feature selection and classification so that it simplifies the classifier and also improves the classification accuracy. On physiological data from animals, it is verified that the presented scheme is more effective than the traditional method. Since the disease cases of humans are not enough, we didn’t test our scheme on hemodynamic data from humans. It is expected to accumulate the cases of humans and investigate the performance of the scheme for clinical application in the future.
Acknowledgement This work was supported by the National Basic Research Program of China (No. 2006CB705700), Natural Science Foundation of China (No.30570488) and Shanghai Science and Technology Plan (No.054119612).
References 1. L'Abbate, A., Sambuceti, G., Haunso, S., Schneider-Eicke, J.: Methods for Evaluating Coronary Microvasculature in Humans. Eur Heart J. 20 (1999) 1300-1313 2. Kern, M.J., Lerman, A., Bech, J., Bruyne, B.D., et al: Physiological Assessment of Coronary Artery Disease in the Cardiac Catheterization Laboratory. Circulation. 114 (2006) 1321-1341 3. McGinn, A.L., White, C.W., Wilson, R.F.: Interstudy Variability of Coronary Flow Reserve. Influence of Heart Rate, Arterial Pressure, and Ventricular Preload. Circulation. 81 (1990) 1319-1330
1132
Q. Zhang et al.
4. Vassalli, G., Hess, O.M.: Measurement of Coronary Flow Reserve and Its Role in Patient Care. Basic Research in Cardiology. 93 (1998) 339-353 5. Liu, H., Yu, L.: Toward Integrating Feature Selection Algorithms for Classification and Clustering. IEEE Trans. Knowledge and Data Eng. 17 (2005) 491-502 6. Hammer, B., Villmann, T.: Generalized Relevance Learning Vector Quantization. Neural Networks. 15 (2002) 1059-1068 7. Luo, Z., Wang, Y., Wang, W., et al: Coronary Artery Impedance Estimation Based on the Intravascular Ultrasound Technique and Its Experimental Studies. Acta Acustica. 30 (2005) 15-20 8. Villmann, T., Schleif, F., Hammer, B.: Supervised Neural Gas and Relevance Learning in Learning Vector Quantization. Proc. of the Workshop on Self-Organizing Networks (WSOM). (2003) 47-52 9. Strickert, M., Seiffert, U., Sreenivasulu, N.: Generalized Relevance LVQ (GRLVQ) with Correlation Measures for Gene Expression Analysis. Neurocomputing. 69 (2006) 651-659 10. Bian, Z., Zhang, X., et al: Pattern Recognition. 2nd edn. Press of Tsinghua University, Beijing (2000) 87-90
Multiple Signal Classification Based on Genetic Algorithm for MEG Sources Localization* Chenwei Jiang1, Jieming Ma1, Bin Wang1,2, and Liming Zhang1,2 1
Department of Electronics Engineering, Fudan University, Shanghai 200433, China {0272015,042021026,wangbin,lmzhang}@fudan.edu.cn 2 The Research Center for Brain Science, Fudan University, Shanghai 200433, China
Abstract. How to locate the neural activation sources effectively and precisely from the magnetoencephalographic (MEG) recording is a critical issue for the clinical neurology and brain functions research. Multiple signal classification (MUSIC) algorithm and recursive MUSIC algorithm are widely used to locate multiple dipolar sources from the MEG data. The drawback of these algorithms is that they run very slowly when scanning a three-dimensional head volume globally. In order to solve this problem, a novel MEG sources localization scheme based on genetic algorithm (GA) is proposed. First, this scheme uses the property of global optimum of GA to estimate the rough source location. Then, combined with grids in small area, the accurate dipolar source localization is performed. Furthermore, we introduce the adaptive crossover and mutation probability, two-point crossover operator, periodical substitution and niche strategies to overcome the disadvantage of GA which falls into local optimum occasionally. Experimental results show that the proposed scheme can improve the speed of source localization greatly and its accuracy is satisfactory.
1 Introduction Magnetoencephalography is a noninvasive brain-measuring technique with the ability of estimating the neural current location in a millisecond-level definition. In contrast with other brain imaging techniques, e.g., MRI, CT, SPECT, PET, the temporal resolution of MEG is far superior to that achieved by others. So how to use the MEG signals to locate the neural current sources is an essential issue for understanding both spatial and temporal behavior of the brain. Multiple signal classification (MUSIC) [1] and recursive multiple signal classification (R-MUSIC) [2] are two widely used methods for MEG sources localization. These two methods commonly search for the MEG sources by scanning every grid. Unfortunately, it is quite time-consuming. For example, if the head is modeled as a sphere that is centered at the origin of the Cartesian coordinate system and has a radius of 9 cm, we have to repeat 729000 times to locate one dipole with the precision of one millimeter in one quadrant. To overcome this problem, a scheme *
This research was supported by the grant from the National Natural Science Foundation of China (No. 30370392 and No.60672116).
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1133–1139, 2007. © Springer-Verlag Berlin Heidelberg 2007
1134
C. Jiang et al.
based on genetic algorithm (GA) is proposed here to locate current dipoles quickly and precisely. In our scheme, we present a two-step grid-scanning procedure. First, we divide the whole three-dimension space to large grids for coarse scan, and pick out the grid, where the object function has its optimum answer as the next step’s scanning range, by the means of GA. Second, we scan the selected area by small grids to locate the MEG source positions more precisely. Furthermore, we introduce the adaptive crossover and mutation probability, two-point crossover operator, periodical gene substitution and niche strategies into GA to overcome the disadvantage of GA which falls into local optimum occasionally. In addition, we should point out that applying a two-step grid-scanning procedure to overcome the time-consuming problem constitutes the first originality of our research, and applying GA to large grids for coarse scan to overcome the problem of falling into the local optimum occasionally constitutes the second originality of our research. Experimental results show that the proposed scheme can improve the speed of source localization greatly and its accuracy is satisfactory. The remainder of this paper is organized as follows. We briefly review the inverse localization methods in Section 2 and describe the proposed scheme in Section 3. In Section 4, we present some simulated results to show the excellent performance of the proposed scheme. Conclusion is given in Section 5.
2 Inverse Localization Methods In this section, a least-square method is applied for computing the MEG sources localization. First, we present a least-square method as follows (1) F = || B − B ' || , where B is the magnetic field detected by Superconducting Quantum Interference Device (SQUID) sensors outside the head. B ' is the logical magnetic field derived from the parameter P{x, y, z} through the equation proposed in [1] B = [G ( p 1 ) " G ( p k ) ] Q T ,
(2)
where Q = [Q , " , Q , " , Q ] , and Q is the moment of the i th dipole. Using SVD T
T 1
T i
T K
T
T i
to decompose QiT , we can obtain QiT = uiσ i viT so that B can be expressed as B i = G ( p i ) Q iT = G ( p i ) u i σ i v iT = a ( p i , θ i ) s iT ,
(3)
where pi represents the location of the dipole, θ i does the direction of the dipole, and siT does the current strength of the dipole. Decomposing Bi as described in [2] yields subspace correlations between the subspaces spanned by A and ΦS . We designate a calculating function of subspace correlations as
{ c 1 , c 2 , " , c k } = s u b c u r r{ A , Φ S } ,
(4)
Multiple Signal Classification Based on GA for MEG Sources Localization
1135
which returns the ordered set { c1 , c2 ," , ck } of subspace correlations. Now we directly simplify equation (1) into finding a set of P{x, y, z} which makes the sum of { c1 , c2 ," , ck } be the maximum. In [2], the enumerative method is adopted. If we want to obtain high computational precision, we have to design a sufficiently dense grid in the volume of the head. But it is quite time-consuming to calculate at each grid point.
3 The Proposed Scheme 3.1 Method Based on GA The least-squares method mentioned above requires nonlinear multidimensional searches to find the unknown parameter P{x, y, z} . Here we review the grid-scanning method, which leads to our proposed “two-step” grid-scanning approach. Though it has a high definition, it is quite time-consuming. Therefore, we present a GA-based MUSIC approach to locate the MEG sources. Before performing the refined grid scan, we firstly use GA method, taking advantage of GA’s excellent global searching ability and its high convergent speed in coarse grid scan process, to find the area where the least-square method has its optimum answer for the next step scanning. After that, we use the refined grid-scanning method to locate the more accurate position of the MEG sources. This approach assures that the final results have the same definition as the ordinary grid-scanning method because of its second refined grid-scanning step, and also guarantees the global optimum because of its first gridscanning step by means of GA. As a result, the whole procedure can improve the speed of source localization greatly and its accuracy is also satisfactory. Here, GA method is proposed to find the parameter P{x, y, z} which maximizes the sum of { c1 , c2 ," , ck }. First, we utilize the GA algorithm to adapt the MEG sources localization problem. The procedures of performing GA are as follows: Step 1. Generate the initial population which has total N chromosomes. Step 2. Calculate the object function, and get each chromosome’s fitness Pi . Step 3. Find out the best chromosome individual Nmax . Step 4. Judge whether the most optimum answer’s fitness Pi > Preq or not. If Pi > Preq , go to step 7. Step 5. Manipulate the GA operator which containing the roulette wheel selection operator, one-point crossover operator and the mutation operator. Step 6. Judge whether it is the last generation. If not, go to step 2. Step 7. End the GA operation, and begin the grid scan. From the simulation experimental results, we found that the premature convergence phenomena happened occasionally and the GA-based MUSIC approach sometimes converged to the local minimum. In order to solve this problem, we improve the approach by adaptive crossover and mutation probability, two-point crossover operator, periodical substitution and niche strategies.
1136
C. Jiang et al.
3.2 Improved GA Method 3.2.1 Adaptive Crossover and Mutation Probability In this section, we propose a constant k which is adjusted by experiment to represent the mature extent of GA. By this means, we make the crossover and mutation probability adaptive. Here, f max is the maximum fitness of the gene population, and f avg is the average fitness of the gene population. When ( f max − f avg ) f avg > k , we consider that GA is in the former stage of the evolution. At this time, the fitness of the population has a high diversity. So we can use a large crossover probability and a small mutation probability to accelerate the convergence. When ( f max − f avg ) f avg ≤ k , we consider that GA is in the later stage of the evolution and the fitness of the population has a low diversity. So we can use a small crossover probability and a large mutation probability to widen the searching range, and avoid falling into the local optimum. In the former stage of the evolution, pc and pm are showed as follows: Pc = Pc 0 × e x p ( (
f m ax − f avg
Pm = Pm 0 × e x p ( − (
f m ax − f avg
− k ) /(
f avg f m ax − f avg f avg
f avg
− k ) /(
f m ax − f avg f avg
) ),
(5)
)).
(6)
Here, we let pc 0 and pm 0 be the initial value of pc and pm . In the later stage of the evolution, pc and pm are showed as follows: Pc = Pc 0 × e x p ((
f m ax − f avg
Pm = Pm 0 × e x p ( − (
f avg f m ax − f avg f avg
− k ) / k ),
− k ) / k ).
(7) (8)
3.2.2 Two-Point Crossover Operator When the GA falls into the premature convergence, the crossover operator will become invalid. The gene will be the same as the previous one after crossover operating. Here, in the two-point crossover operator, we pick out two points in the chromosome randomly. The genes between the two points are intercrossed, and the rest ones don’t, which can make the GA escape from the local optimum when the premature convergence happens. 3.2.3 Periodical Substitution Strategy In order to avoid converging to a local minimum, we also present a periodical substitution strategy. For every ten generations, we insert 10% new random chromosomes to substitute the old ones whose fitness is the smallest. This approach will increase the population diversity.
Multiple Signal Classification Based on GA for MEG Sources Localization
1137
3.2.4 Niche Strategy The main concept of the niche strategy is that there only can be one chromosome in the range of L. Therefore, the niche strategy can maintain the population diversity, and make every chromosome keep a distance from the others and the fitness distribute in the whole three-dimension head volume. Taking advantage of the niche strategy, we can avoid the local optimum effectively and enhance the searching ability of whole algorithm. The operating procedures of the niche strategy are as follows: Step 1. First, the total M chromosomes in population are linearly ordered by their fitness magnitudes and then the first N ones are memorized. Step 2. Manipulate the GA operators and generate the M chromosomes of next generation. Step 3. Arrange the total M+N chromosomes, in which M is the number of the new chromosomes and N is the memorized ones, in sequence according to their fitness magnitudes. The hamming distance between every two chromosomes is computed as follows: Xi − X
j
=
M
∑
k =1
2
( x ik − x jk )
( i = 1, 2 ... M + N − 1; j = i + 1, ..., M + N ) ,
(9)
When || X i − X j ||< L , we compare fitness of these two chromosomes. The smaller one will be reset as follows: Fmin( xi , x j ) = Fmin( xi , x j ) × 10−3 .
(10)
Step 4. Rearrange the new M chromosomes and the memorized N chromosomes in sequence by fitness magnitudes, and memorize the first M ones. Step 5. Go to the next generation of GA.
4 Experimental Results The simulation experiments are used to evaluate the performance of the proposed method. We implement the program via Matlab7 on a PC with a Pentium(R) 4 1.7G CPU. And a standard arrangement of 37 radial SQUID sensors is used here. The array has one sensor at θ =0, a ring of six sensors at θ = π / 8 , ϕ = kπ / 3 , k = 0, …, 5, a
ring of twelve sensors at θ = π / 4 , ϕ = kπ / 6 , k = 0, …, 11, and a ring of eighteen sensors at θ = 3π / 8 , ϕ = kπ / 9 , k = 0, …, 17. They are distributed on the upper region of a 9 cm single-shell sphere as shown in Fig. 1. In the simulation, a complete MEG model comprised of the model of primary current sources, dipole-in-a-sphere head model, magnetoconductivity, etc [3]-[6] is utilized here. The dipoles are assumed to have the fixed locations and orientations, whereas the current strengths are allowed to change in time according to a parametric model. Basing on this model, we can obtain a 37 × 500 simulative spatio-temporal MEG data set. Both R-MUSIC method and the proposed method are used in the experiment to locate the MEG sources. A 24-bit binary chromosome is adopted to represent a dipole position. Here, every 8 bits represent a coordinate of one quadrant.
1138
C. Jiang et al.
Fig. 1. The location distribution of 37 SQUID sensors Table 1. Localization results by R-MUSIC method (3 dipoles)
Data Data 1
True Locations (cm) X Y Z 0.0038 0.0224 1.2285 0.4110 0.6985 1.5985 0.8934 2.0409 0.4404
Estimated Locations (cm) X Y Z 0.1000 0.1000 1.3000 0.4000 0.7000 1.6000 0.9000 2.0000 0.5000 T (s): 11142.125
Table 2. Localization results by the proposed method (3 dipoles)
Data Data 1
True Locations (cm) X Y Z 0.0038 0.0224 1.2285 0.4110 0.6985 1.5985 0.8934 2.0409 0.4404
Estimated Locations (cm) X Y Z 0.0000 0.0000 1.2000 0.4000 0.7000 1.6000 0.9000 2.0000 0.4000 T (s): 475.265
Table 3. Comparison between R-MUSIC method and the proposed method (3 dipoles)
Method R-MUSIC The proposed method
Generation Number / 200
Population Number / 100
Average Accuracy (cm) 0.0384 0.0309
Average Time (s) 11115 475
We pick 10 groups of the MEG data of 3 dipoles randomly for the simulation experiment, and one of them is listed in table 1 and 2. The chromosome population is set as 100 and the generation number is 200. The initial crossover probability pc is set
Multiple Signal Classification Based on GA for MEG Sources Localization
1139
as 0.4 and the mutation probability pm is 0.05. Observing the results shown in table 1 and 2, we can find that our method has the same precision as R-MUSIC method. From the comparison between the R-MUSIC method and the proposed method in table 3, we can find that the proposed method is much faster in locating the MEG sources. Our method has the same results as R-MUSIC, but the speed is improved greatly. The average execution time is about 1/23 of R-MUSIC. Furthermore, with the source number increasing, the proposed method will have more superiority to the RMUSIC method.
5 Conclusions In this paper, we proposed a MEG source localization scheme based on GA. The experimental results from the simulation show that the source localization operation can be speeded up greatly. Further, combining with grids in small areas, we can obtain more accurate results. The localization of MEG sources based on GA precisely and quickly will contribute to its further applications.
References 1. Mosher, J. C., Lewis, P. S., Leahy R. M.: Multiple Dipole Modeling and Localization from Spatio-temporal MEG Data. IEEE Transactions on Biomedical Engineering 39 (6) (1992) 541-557 2. Mosher, J. C., Leahy, R. M.: Recursive MUSIC: A Framework for EEG and MEG Source Localization. IEEE Transactions on Biomedical Engineering 45 (11) (1998) 1342-1354 3. Cuffin, B. N.: Effects of Head Shape on EEG’s and MEG’s. IEEE Transactions on Biomedical Engineering 37 (1) (1990) 44-52 4. Crouzeix, A., Yvert, B., Bertrand, O., Pernier, J.: An Evaluation of Dipole Reconstruction Accuracy with Spherical and Realistic Head Models in MEG. Clinical Neurophysiology 110 (12) (1999) 2176-2188 5. Mosher, J. C., Leahy, R. M., Lewis, P. S.: EEG and MEG: Forward Solutions for Inverse Methods. IEEE Transactions on Biomedical Engineering 46 (3) (1999) 245-259 6. Aleksandar, D., Arye, N.: Estimating Evoked Dipole Responses in Unknown Spatially Correlated Noise with EEG MEG Arrays. IEEE Transactions on Signal Processing 48 (1) (2000)
Registration of 3D FMT and CT Images of Mouse Via Affine Transformation with Bayesian Iterative Closest Points Xia Zheng1, Xiaobo Zhou2,3, Youxian Sun1, and Stephen T.C. Wong2,3 1
Zhejiang University, National Laboratory of Industrial Control Technology, Hangzhou 310027, P.R. China
[email protected] 2 HCNR-CBI, Harvard Medical School and Brigham and Women’s Hospital, Boston, MA 02215, USA 3 Functional Molecular Imaging Center, Brigham and Women’s Hospital, MA 02115, USA
Abstract. It is difficult to directly co-register the 3D FMT (Fluorescence Molecular Tomography) image of a small tumor in a mouse whose maximal diameter is only a few mm with a larger CT image of the entire animal that spans about ten cm. This paper proposes a new method to register 2D flat and projected CT image first to facilitate the registration between small 3D FMT images and large CT images. And a novel algorithm Bayesian Iterative Closest Point (BICP) is introduced and validated in 2D affine registration. The visualization of the alignment of the 3D FMT and CT image through 2D registration shows promising results that would lead to automated 3D registration.
1 Introduction Mouse models of human cancer have dramatically improved over the past decade with the development of molecular imaging techniques, which characterizes and measures molecular events in living animals with high sensitivity and spatial resolution [1]. Imaging these molecular events is achieved by using innovative imaging agents, which include “smart sensor probes” that can be activated upon interaction with their biological targets [2]. The progress of molecular imaging has been accelerated by the development of dedicated small animal imaging equipment for microcomputed tomography (CT) and optical imaging. Optical imaging has seen exciting developments in recent years, such as optical tomography, which, in contrast to reflectance imaging, is not surface-weighted [3, 4]. Fluorescence molecular tomography (FMT) is one such technique that is capable of resolving molecular functions in deep tissues by reconstructing the in vivo distribution of intravenous injected far red and near infrared fluorescent probes [5]. 3D FMT images, however, contain only tumor information in the mouse and carry little anatomical information. Thus, we need to align and fuse FMT images with full animal CT or MR images in order to reveal fine anatomical structures [6]. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1140–1149, 2007. © Springer-Verlag Berlin Heidelberg 2007
Registration of 3D FMT and CT Images of Mouse Via Affine Transformation
1141
Image acquired from these modalities have different intensities and spatial resolutions in all three directions. To solve the registration problem of multimodal images, the maximization of mutual information of voxel intensities has been proposed and demonstrated to be a powerful method, allowing fully automated robust and accurate registration of multimodal images in a variety of applications without the need for segmentation or other preprocessing of the images [7]. As for registration of 3D FMT and CT images of the mouse, 3D FMT only reveals specific tumors without much anatomical information as whose maximum diameter is only a couple of mm in length and is much smaller than CT image of the whole animal, which is about ten cm long. This results in a challenging problem of directly aligning both images efficiently and precisely. More information is needed in order to facilitate registration. Fortunately, 2D flat images (photography) and 3D FMT images can be acquired in series using the planar fluorescence reflectance imaging/FMT system without moving the subject [8]. Therefore, their spatial relationship is known after acquisition, and FMT images can be superimposed onto the surface anatomical features to create a rough context. Our observation is that 2D flat images can be employed to bridge the gap between small 3D FMT images and large CT images of the same animal. Mutual information represents a leading technique to register the multimodal images. But this method requires reference and moving images have somewhat similar features, either identical or at least statistically dependent [9]. Unfortunately, flat image and the projected 3D CT image have no such similarity. Another registration approach is feature-based matching method which is typically applied when the local structural information is more significant than the information carried by the image intensities. It allows registering images of completely different natures and can handle complex between-image distortions. Hence, feature-based registration methods are employed here. The features of the two images are represented by their segmentation results – two point sets which will be registered by our novel Bayesian Iterative Closest Points (BICP). The final registration results demonstrate that our method is effective in automatically aligning the flat and the projected 3D CT images. The remaining of this paper is organized as follows. The registration problem and formulation is stated in Section 2, and the proposed algorithm of parameter estimate is presented in Section 3. The simulation and experimental results are shown in Section 4. Finally, Section 5 concludes this paper.
2 Problem Statement and Formulation Since 3D FMT images contain only tumor information in the mouse and carry little anatomical information, it is difficult to co-register 3D FMT and 3D CT images. While 3D FMT images are acquired, flat images can be acquired and matched with 3D FMT images to give an anatomical context of the mouse in 2D. If we register 2D flat images with 3D CT images, we can roughly align 3D FMT and CT images in 2D. The resultant transformation can then be employed as a starting point for further alignment in all three dimensions. A 2D flat image is basically a photograph i.e. a projection surface image, of the mouse. We project 3D CT images to obtain a 2D projection image that is similar to
1142
X. Zheng et al.
the flat image. Only the boundary information in both 2D images is the same and can be used for registration. The registration procedure is described here in detail The first step is to segment the 2D flat image. The boundary of the mouse, denoted by Cr , is obtained by segmenting the 2D flat image through gradient vector flow (GVF) snake method proposed by Chenyang Xu and Jerry Prince [13]. The second step is to project the 3D CT images to obtain a 2D CT images corresponding to the flat image. Projected image is denoted by P (θ ; I ) , where I is the 3D CT image,
θ = (θ1 ,θ 2 )
is the adjustable projection parameters (two rotation
angles about axes in the coronal plane of CT images) and
P (θ ; I ) is the projection
operation on 3D CT image I . The proper θ is used to make the projected CT image as similar as possible to the flat image that is exemplified in Figure. 1. Figure 1(b) resembles the 2D flat image more than Figure 1(c). After projection, the CT image P (θ ; I ) is then segmented to obtain the boundary of the mouse for registration, denoted by Cm (θ ) .
(a)
(b)
(c)
Fig. 1. (a) 2D flat image, (b) projection of the 3D CT image with a θ , (c) projection of the 3D CT image with another θ
The third step is to register the two 2D images by the two point sets Cr and Cm (θ ) . We denote the two point sets by reference point set and moving
Cr = {ri | i = 1,… , N r } and Cm (θ ) = {m (θ ) j | j = 1,… , N m } , where ri and m (θ ) j are the pixel position
point
set
with
their
elements
denoted
in
vectors and N m < N r . We assume that the relationship between the two point sets is 2D Euclidean affine transformation comprising of six parameters. The task of the registration is to determine the parameters of transformation λ = [λ1 , λ2 , λ3 , λ4 , λ5 , λ6 ] ,
the previous adjustable projection parameters and reference point set.
θ
which best align the moving point set
Registration of 3D FMT and CT Images of Mouse Via Affine Transformation
1143
Iteration is used to optimize our problem. For each iteration, the 3D CT image should be automatically projected according to the adjustable projection parameter θ . This poses another problem of automatically segmenting the projected CT image during iteration. In order to get a good segmentation result, we initialize the contour position to the true boundary as close as possible. Morphological operations including opening and closing are performed on the binary version of the projected CT image. The boundary of the last binary image is employed as the starting position of the GVF snake model. After segmentation of the images, we get two point sets and then register them. What makes the problem difficult is that correspondences between the point sets are unknown a priori. A popular approach to solve the problem is the class of algorithms based on the Iterative Closest Point (ICP) technique introduced by Besl [14]. We employ k -dimensional ( k -d) tree to find the closest point from the reference points as the correspondence points of the
N m moving points after affine transformation
denoted by Cc = {c j | j = 1,… , N m } . Now we can formulate our problem as follows Y = D * λ + nt .
(1)
That is
⎡ m1,1 ⎢m ⎢ 2,1 ⎢ ⎢ ⎢⎣ m Nm ,1 where
m1,2 ⎤ ⎡1 c1,1 c1,2 ⎤ ⎡λ ⎥ ⎢ m2,2 ⎥ ⎢1 c2,1 c2,2 ⎥⎥ ⎢ 5 = λ ⎥ ⎢ ⎥⎢ 1 ⎥ ⎢ ⎥ ⎢λ m Nm ,2 ⎥⎦ ⎢⎣1 c Nm ,1 c N m ,2 ⎥⎦ ⎣ 2
λ6 ⎤ λ3 ⎥⎥ + [ n1 n2 ] , λ4 ⎥⎦
(2)
mi ,1 and mi ,2 denote the two components of a moving point mi ,
λ1 = [λ5 , λ1 , λ2 ]T , λ2 = [λ6 , λ3 , λ4 ]T ,and
the two noise are assumed the anisot-
ropic normally distributed as follows
⎛ ⎡0 ⎤ ⎡σ 12 0 ⎤ ⎞ . nt ∼ N ⎜ ⎢ ⎥ , ⎢ ⎜ ⎣0 ⎦ 0 σ 2 ⎥ ⎟⎟ ⎣ 2 ⎦⎠ ⎝ We assume here that the parameter where
λ = [λ1 , λ2 ] , σ = [σ 1 , σ 2 ]
(3) and
Y = [Y1 Y2 ]
Yi is the i th column of Y .
3 Parameter Estimate 3.1 Prior Distributions
The traditional ICP method is widely used in point sets registration. But it is sensitive to noise and outliers. We found that there were too many identical points in the correspondence point sets Cc when the traditional ICP failed. The contribution of our
1144
X. Zheng et al.
BICP is to penalize this situation by introducing a hyper-parameter δ 2 . We follow a Bayesian approach to estimate the parameters [15]. The overall parameter space is Θ = λ × σ × δ where δ is a hyper-parameter which will be explained at the end of this section. Given the two point sets, our objective is to estimate Θ . The Bayesian inference of Θ is based on the joint posterior distribution p(λ , σ 2 , δ 2 | Y , Cc ) . The joint distribution of all variables is:
p(λ , σ 2 , δ 2 , Y | Cc ) = p(Y | λ , σ 2 , δ 2 , Cc ) p (λ | σ 2 , δ 2 , Cc )
(4)
× p (σ 2 | δ 2 , Cc ) p(δ 2 | Cc ). Under the assumption of independent moving points given
( λ , σ 2 , δ 2 ) , we have
2
p(Y | λ , σ 2 , δ 2 , Cc ) = ∏ p(Yi | λ1 , σ 2 , δ 2 , Cc ) i =1
2
= ∏ (2πσ )
− Nm / 2
2 i
i =1
exp(−
1 2σ i2
λ
λ
(5)
(Yi − D i ) (Yi − D i )). T
We assume the following structure for the prior distribution:
p(λ , σ 2 , δ 2 ) = p(λ1:2 | σ 2 , δ 2 ) p(σ 2 | δ 2 ) p(δ 2 ) = p(λ1:2 | σ 2 , δ 2 ) p(σ 2 ) p (δ 2 ), 2
where p(σ 2 ) = ∏ p (σ i2 ) and i =1
σ i2
(6)
distributed according to conjugate inverse-Gamma
prior distributions: σ i ∼ IG(0, 0) = 2
1
σ
2 i
. Given
σ 2,δ 2
we introduce the following
prior distribution: 2
p(λ | σ 2 , δ 2 ) = ∏ | 2πσ i2 Σ i |−1/ 2 exp(− i =1
where
Σ i−1 = δ i−2 DT D .
1 2σ i2
λ Tι Σ i−1λ ι ),
(7)
It shows that conditional upon (σ , δ ) , coefficients 2
2
λι are assumed to be zero-mean Gaussian with variance σ ι2 Σ i . As mentioned before, Cc selected from the reference point set is the correspondence point set of moving point set. We used the iterative closest point method to determine Cc . Accordingly it is possible that the correspondence points of several different moving points are identical due to noise which make the iteration converges to local optimal position. The extreme case is that all the moving points correspond to an identical point .In that −1
situation the determinant of Σi will tend to zero. To penalize this situation we introduce the term δ ∈ ( R ) . 2
+ 2
p(δ 2 ) = ∏ i =1 p(δ i2 ) where δ i2 ∼ IG (αδ 2 , βδ 2 ) . 2
Registration of 3D FMT and CT Images of Mouse Via Affine Transformation
1145
3.2 Formulation for Sampling
According to Bayes theorem: p(λ , σ 2 , δ 2 | Y , Cc ) ∝ p(Y | λ , σ 2 , δ 2 , Cc ) p(λ , σ 2 , δ 2 ) 2
∝ ∏ (2πσ i2 )
− Nm / 2
exp(−
i =1
1 2σ i2
λ
λ
(Yi − D i )T (Yi − D i ))
⎡ 2 ⎤⎡ 2 1 ⎤ × ⎢∏ | 2πσ i2 Σ i |−1/ 2 exp(− 2 λ Tι Σ i−1λ ι ) ⎥ ⎢∏ (σ i2 ) −1 ⎥ 2σ i ⎦ ⎣ i =1 ⎦ ⎣ i =1
(8)
β2 ⎤ ⎡ 2 − (α +1) × ⎢∏ (δ i2 ) δ 2 exp(− δ2 ) ⎥ . δi ⎦ ⎣ i =1 We can proceed to multiply the exponential terms to obtain ⎡ 2 ⎤ 1 p(λ , σ 2 , δ 2 | Y , Cc ) ∝ ⎢∏ (2πσ i2 ) − Nm / 2 −1 exp(− 2 YiT K iYi ) ⎥ 2σ ι ⎣ i =1 ⎦
λ
λ
⎡ 2 ⎤ 1 × ⎢∏ | 2πσ i2 Σ i |−1/ 2 exp(− 2 ( i − hi )T M i−1 ( i − hi )) ⎥ 2σ i ⎣ i =1 ⎦
(9)
β2 ⎤ ⎡ 2 − (α 2 +1) × ⎢∏ (δ i2 ) δ exp(− δ2 ) ⎥ , δi ⎦ ⎣ i =1 where: M i−1 = DT D + Σ i−1 , hi = M i DT Yi ,
(10)
K i = I N m − DM i D . T
We recall that p(λ , σ 2 , δ 2 | Y , Cc ) = p(λ1:2 | σ 2 , δ 2 , Y , Cc ) p(σ 2 | δ 2 , Y , Cc ) p(δ 2 | Y , Cc ) . It follows that for i = 1, 2 , λi and
δ i2
are distributed according to:
YT K Y Nm + 1, i i i ), 2 2 λi | (σ 2 , δ 2 , Y , Cc ) ∼ N (hi , σ i2 M i ).
σ i2 | (δ 2 , Y , Cc ) ∼ IG (
(11) (12)
Finally it is easy to derive follow expression from equation (8) 3 1 δ i2 | (λ , σ 2 , Y , Cc ) ∼ IG (αδ 2 + , βδ 2 + 2 λi DT Dλi ) . (13) 2 2σ i Then we can sample the parameters according to equations (11~13) to estimate them. 3.3 Estimate of the Projection Parameter
θ
In our case we have additional adjustable projection parameter θ which contributes to many local minima. We turn to certain global optimization methods among which
1146
X. Zheng et al.
the Differential Evolution (DE) is a simple and efficient adaptive scheme for global optimization over continuous spaces [11]. We can combine DE and BICP to estimate (λ,θ ) . That is, for every specified θ the two point sets can be extracted. Then BICP is used to register the 2D images and the resultant registration error is regarded as the value of the objective function of DE with the specified θ . 3.4 The Initial Position
Another problem in the BICP we have to consider is estimating the global initial values of λ0 , one component of which is the translation parameters. Since our algo(0)
rithm is based on iterative calculation, the initial alignment of {m( θ )j }Nj =m1 and {ri }i =r1 N
affects their convergence rates and precision. Registration of the geometric centers of both the data solves their initial shift such that: 1 Nr 1 Nm [λ5(0) , λ6(0) ] = ri − (14) ∑ ∑ m( θ )j , N r i =1 N m j =1 The other initial affine parameters [λ1(0) , λ2(0) , λ3(0) , λ4(0) ] are set to [1 0 0 1]. We summarize the components in the parameters estimate procedure as follows: • DE method over θ with the BICP method as its objective function • Utilize the BICP method to register the 2D images with specified θ .
1. Initialization. Set
λ (0) , δ i2(0) ,σ (0)
and
i =1
2. Iteration i a. finding the correspondence with λ using iterative closest point method b. sample σ λ δ from equation (11~13) 3. i = i + 1 and go to step 2
4 Experimental Results In below experiments we set
αδ = 2 , βδ = 20 . 2
2
4.1 Experiment with Simulated Image Data
First we confirm the effectiveness of our proposed BICP in affine registration. The simulated data is generated as follows. a. Segment the flat image to get a reference point set Cr . b. Select a part from the flat image and segment it to create a reference point set. c. Make an affine transformation ( λ =[0.8 0 0 1.1 0 0.1]) on the above point set and add noise (zero mean normally distributed with standard deviations 0.01) to the transformed result to give a moving point set Cflat_m . d. Register Cflat_m to Cr with our proposed simplex method.
Registration of 3D FMT and CT Images of Mouse Via Affine Transformation
1147
The error of a registration is defined as the registration’s Euclidean distance from the optimal match. Herein the registration error of our method is 1.32 pixel compared with the traditional ICP 1.34. Then we verify our algorithm in registering a synthetic flat image and the 3D CT image. The synthetic flat image is generated as follows: a. Project a 3D CT image after rotation about two axes according to the adjustable projection parameters θ . Herein we set it to reasonable values, (20, 13). b. Transform the projected 2D image with affine transformation parameters: λ = [1.2 0.1 0.2 0.95 -3 4]. c. Add random noise (zero mean normally distributed with standard deviations η ) to the transformed 2D image to get a "practical" simulated flat image. Following the above steps the synthetic flat image is generated with η =0.05 as illustrated in Fig. 2 (b).
(a)
(b)
(c)
Fig. 2. (a) Projected image (b) simulated flat image, and (c) registration result
Additionally, to show the performance on different noise levels, our algorithm is implemented on the synthetic flat image and the 3D CT image with different values ofη . The final results are shown in Table 1. Table 1. Registration result with different noise level
η
λ = [λ1 , λ2 , λ3 , λ4 , λ5 , λ6 ]
θ
0.15 0.1
Error (pixel) 0.5058 0.4274
[1.2421 0.1065 0.2357 0.9660 -3.8202 4.8272] [1.2225 0.1065 0.2227 0.9360 -3.8322 4.5470]
[22.914.3] [22.6 13.4]
0.05
0.3854
[1.2121 0.0954 0.1964 0.9488 -3.6229 4.2732]
[21.4 13.2]
4.2 Experiment with Real Data
In this experiment, the size of the real flat image is 231 × 341 with a spacing of 0.0185 × 0.0185 cm. For each mouse, we have three 3D CT image sets that cover the
1148
X. Zheng et al.
entire animal with an overlap. The 3D CT image sets are composed of head, thorax, and pelvis, which are high resolution images. We only use one image set of the central part, which is our region of interest (ROI). The resolution of the thorax 3D CT image is 512 × 512 × 512, and its voxel spacing is 0.0072 × 0.0072 × 0.0072 cm. The real data is registered with physical coordinate. The final parameters of our algorithm applied on the real data are: λ = [1.1959 0.0655 0.6935 -0.0758 1.0636 0.28790], and θ = [-7.1 13.5]. We demonstrate the registration result in Fig. 3. The blue line is the segmentation of the flat image, and the red line is the contour of the projected 3D CT image after rotation according to θ .
Fig. 3. Image registration result
Ultimately we fusion the 3D FMT and CT image in x-y plane with above 2D preliminary registration result in Fig. 4.
(a)
(b)
(c)
Fig. 4. (a) Projection and affine transformation of the 3D CT image ; (b) projection of the 3D FMT image; (c) fusion
5 Conclusions This work has contributed to the registration of the flat image and the projected 3D CT image of mouse to reduce the gap between the 3D FMT image and 3D CT image of the animal. A novel algorithm combining DE and BICP is proposed to optimize this multi-modality image registration problem. We first validated the new registration on simulated animal images and apply to the real data obtained experimentally. Future work of this research will investigate the alignment of the 3D FMT and CT images based on the 2D registration result.
Registration of 3D FMT and CT Images of Mouse Via Affine Transformation
1149
Acknowledgement The authors would like to acknowledge the excellent collaboration with their molecular imaging collaborators in this research effort, and, in particular, Dr. Dunham, Joshua M in Ntziachristos’s lab. Research of Xiaobo Zhou is supported by the HCNR Center for Bioinformatics Research Grant, Harvard Medical School. This work is supported by the Academician Foundation of Zhejiang Province (No. 2005A1001-13).
References [1] Jan, G., David, G. K., Stephen, D. W., Carla, F. B. K., Philip, M. S., Vasilis, N., Tyler, J., Ralph, W.: Use of Gene Expression Profiling to Direct in Vivo Molecular Imaging of Lung Cancer. PNAS 102 (40) (2005) [2] Weissleder, R.: Molecular Imaging: Exploring the Next Frontier (Review). Radiology 212 (1999) 609-614 [3] Tung, CH., Mahmood ,U., Bredow, S., et al.: In Vivo Imaging of Proteolytic Enzyme Activity Using a Novel Molecular Reporter. Cancer Res. 60 (2000) 4953-4958 [4] Ntziachristos, V., Tung, CH., Bremer, C., Weissleder, R.: Fluorescence Molecular Tomography Resolves Protease Activity in Vivo. Nat Med 8 (7) (2002) 757-760 [5] Ntziachristos, V., Bremer, C., Graves, E.E., Ripoll, J., Weissleder, R.: In Vivo Tomographic Imaging of Near-infrared Fluorescent probes. Mol Imaging 1 (2002) 82-88 [6] Ntziachristos, V., Ripoll, J., Wang, L.V., Weissleder, R.: Looking and Listening to Light: the Evolution of Whole-body Photonic Imaging. Nat Biotechnol 23 (2005) 313-320 [7] Frederik, M., Dirk, V., Paul, S.: Comparative Evaluation of Multiresolution Optimization Strategies for Multimodality Image Registration by Maximization of Mutual Information. Medical Image Analysis 3 (4) (1999) 373-386 [8] Graves, E.E., et al.: A Submillimeter Resolution Fluorescence Molecular Imaging System for Small Animal Imaging. Med. Phys. 30 (3) (2003) 901-911 [9] Barbara, Z., Jan, F.: Image Registration Methods: A Survey. Image and Vision Computing 21 (2003) 977-1000, [10] Press, W., Teukolsky, S., Vetterling, W., Flannery, B.: Numerical Recipes in C++. Second ed., Cambridge University Press, Cambridge (2002) [11] Rainer, S., Kenneth, P.: Differential Evolution - A Simple and Efficient Adaptive Scheme for Global Optimization over Continuous Spaces. ICSI TECHINCAL Report tr-95-012, (1995) [12] Kass, M., Witkin, A., Terzopoulos, D.: Snake: Active Contour Models. Int. J. Computer Vision 1 (4) (1987) 321-331 [13] Chenyang, X., Jerry, L. P.: Snakes, Shapes, and Gradient Vector Flow. IEEE Transaction on Image Processing 7 (3) (1998) 359-369 [14] Paul, J.B., Neil, D.M.: A Method for Registration of 3D Shapes. IEEE Transactions on Pattern Analysis and Machine Intelligence 14 (2) (1992) 239-255 [15] Andrieu, C., Freitas, J., Doucet, A.: Robust Full Bayesian Learning for Neural Networks. http://www.cs.berkeley.edu/jfgf/software.html (1999)
Automatic Diagnosis of Foot Plant Pathologies: A Neural Networks Approach Marco Mora1 , Mary Carmen Jarur1, Daniel Sbarbaro2, and Leopoldo Pavesi1 1
2
Department of Computer Science, Catholic University of Maule Casilla 617, Talca, Chile {mora,mjarur,lpavesi}@spock.ucm.cl http://www.ganimides.ucm,cl/mmora/ Department of Electrical Engineering, University of Concepcion Casilla 160-C, Concepcion, Chile.
[email protected]
Abstract. Some foot plant pathologies, like cave and flat foot, are normally detected by a human expert by means of footprint images. Nevertheless, the lack of trained personal to accomplish such massive first screening detection efforts precludes the routinely diagnostic of the above mentioned pathologies. In this work an innovative automatic system for foot plant pathologies based on neural networks (NN) is presented. We propose the use of principal components analysis to reduce the number of inputs to the NN and therefore increasing the efficiency of the training algorithm. The results achieved with this system evidence the feasibility of establishing automatic diagnosis systems based on the footprint image. These systems are of a great value specially in apart areas and are also suited to carry on massive first screening health campaigns.
1
Introduction
When the foot is planted, not all the sole is in contact with the ground, the footprint is the surface of the foot plant in contact with the ground. The cave foot and the flat foot are pathologies presented in children at the age of three. If these foot malformations are not detected and treated on time, they become worst during adulthood producing several disturbances, pain and posture-related disorders [12]. The characteristic form and zones of the footprint are shown in figure 1a. Zones 1, 2 and 3 correspond to regions in contact with the surface when the foot is planted, these are called anterior heel, posterior heel and isthmus respectively. Zone 4 does not form part of the surface in contact and is called footprint vault [12]. A simple method to obtain footprints is directly stepping the inked foot onto a paper on the floor. After obtaining the footprints, an expert analyzes them and assesses if they present pathologies. Usually, in the diagnosis of these pathologies an instrument known as podoscope is used to capture the footprints. A simple digital version of the podoscope based on a scanner has been proposed in [7]. Another basic instrument to obtain footprints is the pedobarograph [1]. Modern variants of the pedobarograph are proposed in [8,10]. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1150–1158, 2007. c Springer-Verlag Berlin Heidelberg 2007
Automatic Diagnosis of Foot Plant Pathologies: ANN Approach
1151
Our goal is to develop a system to achieve massive screening for the early detection of pathologies as flat foot and cave foot. For that reason, a simple, inexpensive, and easy-to-use system is needed. Considering the simplicity of a podoscope, we have developed a digital version based on a simple color digital camera. The instrument consists of a robust metallic structure with adjustable height and transparent glass in its upper part. The patient must stand on the glass and the footprint image is obtained with a digital color camera in the interior of the structure. For adequate lighting, white bulbs are used. The target system includes the regulation of the light intensity of the bulbs, which allows the amount of light to be adjusted for capturing images under different lighting conditions. With our digital podoscope we have built a database with more than 230 footplant optical images to support our research. These images were classified by a expert1 . From the total sample, 12.7% are flat feet, 61.6% are normal feet and 25.7% are cave feet. On the other hand, a non automatic method to segment footprints based on a simple sequence of traditional digital image processing techniques have been proposed in [2]. A method to segment footprints in optical color images of the sole by using neural networks is proposed in [6]. This paper describes the development of an automatic method to diagnose foot plant pathologies. Firstly, we propose an original representation for the footprint segmented patterns. Secondly, to reduce the pattern dimensionality we perform a principal component transform. Finally, we formulate the diagnose of foot plant pathologies as a pattern recognition problem, adopting a neural network during the footprint classification process. This paper is organized as follows. Section 2 introduce the foot plant pathologies and the diagnosis by using neural networks. Section 3 describes the footprint representation and characteristics extraction. Section 4 presents the training of the neural network classifier. Section 5 shows the validation of the neural network classifier. Finally, section 6 shows some conclusions and future studies.
2
Foot Plant Pathologies and Neural Network Diagnosis
It is possible to classify a foot by its footprint form and dimensions as a: normal, flat or cave foot. Figure 1b shows an image of a flat foot, figure 1c shows an image of a normal foot, and figure 1d shows an image of a cave foot. Currently, an expert defines if a patient has a normal, cave or flat foot by a manual exam called photopodogram. A photopodogram is a chemical photo of the foot part supporting the load. The expert determines the position for the two distances, sizes them, calculates the ratio and classifies the foot. Even tough the criteria for classifying footprints seems very simple, the use of a classifier based on neural networks (NN) offers the following advantages compared with more traditional approaches: (1) it is not simple to develop an algorithm to determine 1
The authors of this study acknowledge Mr.Eduardo ACHU, specialist in Kinesiology, Department of Kinesiology, Catholic University of Maule, Talca, Chile, for his participation as an expert in the classification of the database images.
1152
M. Mora et al.
(a) Zones
(b) Flat foot
(c) Normal foot
(d) Cave foot
Fig. 1. Images of the sole
with precision the right position to measure the distances, and (2) it can be trained to recognize other pathologies or to improve their performance as more cases are available. The multilayer perceptron (MLP) and the training algorithm called backpropagation (BP) [11] have been successfully used in classification and functional approximation. An important characteristic of MLP is its capacity to classify patterns grouped in classes not lineally separable. Besides that, there are powerful tools, such as the Levenberg-Marquardt optimization algorithm [3], and a Bayesian approach for defining the regularization parameters [5], which enable the efficient training of MLP. Even though there exist this universal framework for building classifiers, as we will illustrate in this work, a simple preprocessing can lead to smaller network structures without compromising performance.
3
Footprint Representation and Characteristics Extraction
Prior to classification, the footprint is isolated from the rest of the components of the sole image by using the method proposed in [6]. Figures 2a, 2b and 3c shown the segmentation of a flat, a normal foot, and a cave foot respectively.
(a) Flat foot
(b) Normal foot
(c) Cave foot
Fig. 2. Segmentation of footprint without toes
After performing the segmentation, the footprint is represented by a vector containing the width in pixels of the segmented footprint, without toes, by each column in the horizontal direction. Because every image has a width vector with different length, the vectors were normalized to have the same length. Also the value of each element was normalized to a value in the range of 0 to 1.Figures 3a, 3b and 3c show the normalized vectors of a flat, a normal and a cave foot.
Automatic Diagnosis of Foot Plant Pathologies: ANN Approach 1
1
1
0.9
0.9
0.9
0.8
0.8
0.8
0.7
0.7
0.7
0.6
0.6
0.6
0.5
0.5
0.5
0.4
0.4
0.4
0.3
0.3
0.3
0.2
0.2
0.1
0.1
0
10
20
30
40
(a) Flat vector
50
60
70
80
90
100
1153
0.2
0
10
20
30
40
50
60
70
80
90
100
0.1
0
10
20
30
40
50
footprint (b) Normal footprint (c) Cave vector vector
60
70
80
90
100
footprint
Fig. 3. Representation of the footprint without toes
As a method to reduce the dimensionality of the inputs to the classifier, a principal components analysis was used [4]. Given an eigenvalue λi associated to the covariance matrix of the width vector set, the percentage contribution γi (1) and the accumulated percentage contribution AP Ci (2) are calculated by the following expressions: λi γi = d j=1
AP Ci =
(1)
λi
i
γi
(2)
j=1
Table 1 shows the value, percentage contribution and the accumulated percentage contribution of the first nine eigenvalues. It is possible to note that from the 8th eigenvalue the contribution is close to zero, and then it is enough to represent the width vector with the first seven principal components. Figure 4 shows a normalized width vector (rugged red signal) and the resultant approximation from using the seven first main components (smoothed blue signal) for the three classes. Table 1. Contribution of the first 9 eigenvalues Value Percentual contribution Accumulated contribution
4
λ1 0.991 63.44 63.44
λ2 0.160 10.25 73.7
λ3 0.139 8.95 82.65
λ4 0.078 5.01 87.67
λ5 λ6 λ7 λ8 0.055 0.0352 0.020 0.014 3.57 2.25 1.31 0.94 91.24 93.50 94.82 95.76
λ9 0.010 0.65 96.42
Training of the Neural Network Classifier
A preliminary analysis of the segmented footprint analysis showed very little presence of limit patterns among classes: flat feet almost normal, normal feet almost flat, normal feet almost cave and cave feet almost normal. Thus, the training set was enhanced with 4 synthetic patterns for each one of the limit
1154
M. Mora et al. 1.1
1.2
1
1
1.1
0.9
0.9
1
0.8
0.9
0.7
0.8
0.6
0.7
0.5
0.6
0.4
0.5
0.3
0.4
0.2
0.3
0.8
0.7
0.6
0.5
0.4
0.1
0
10
20
30
40
50
60
70
80
90
100
0.2
0.3
0.2
0
10
20
30
40
50
60
70
80
90
100
0.1
0
10
20
30
40
50
60
70
80
90
100
(a) Flat class (b) Normal class (c) Cave class Fig. 4. Principal components approximation
cases. Thus the training set has a total of 199 images, 12.5% corresponding to a flat foot, 63% to a normal one and 24.5% to a cave foot. To build the training set the first seven principal components were calculated for all the width vectors in the training set. For the foot classification as a flat, normal or cave foot, a MLP trained with Bayesian regularization backpropagation was used. The structure of the NN is: – Number of inputs: 7, one for each main component. – Number of outputs: 1. It takes a value of 1 if the foot is flat, a value of 0 when the foot is normal and a value of −1 when the foot is cave. To determine the amount of neurons of the hidden layer, the procedure described in [3] was followed. Batch learning was adopted and the initial network weights were generated by the Nguyen-Widrow method [9] since it increases the convergence speed of the training algorithm. The details of this procedure are shown in table 2, where NNCO corresponds to the number of neurons in the hidden layer, SSE is the sum squared error and SSW is the sum squared weights. From the table 2 it can be seen that from 4 neurons in the hidden layer, the SSE, SSW and the effective parameters stay practically constants. As a result, 4 neurons are considered in the hidden layer. In figure 5 it is possible to observe that the SSE, SSW and the effective parameters of the network are relatively constant over several iterations, this means that the training process has been appropriately made. Figure 6 shows the training error by each pattern of the training set. From the figure it is important to emphasize that the classification errors are not very small values. This behavior assures that the network has not memorized the training set, and it will generalize well. Table 2. Determining the amount of neurons in the hidden layer NNCO Epochs 1 114/1000 2 51/1000 3 83/1000 4 142/1000 5 406/1000 6 227/1000
SSE 22.0396/0.001 12.4639/0.001 12.3316/0.001 11.3624/0.001 11.3263/0.001 11.3672/0.001
SSW Effective parameters Total parameters 23.38 8.49e+000 10 9.854 1.63e+001 19 9.661 1.97e+001 28 13.00 2.61e+001 37 13.39 2.87e+001 46 12.92 2.62e+001 55
Automatic Diagnosis of Foot Plant Pathologies: ANN Approach
1155
Training SSE = 11.3624
3
Tr−Blue
10
2
10
1
10
0
10
Squared Weights = 13.003
2
SSW
10
1
10
0
10
# Parameters
Effective Number of Parameters = 26.0911
30 20 10 0
20
40
60
80
100
120
140
142 Epochs
Fig. 5. Evolution of the training process for 4 neurons in the hidden layer. Top: SSE evolution. Center: SSW evolution. Down: Effective parameters evolution.
0.15
0.1
0.05
0
−0.05
−0.1
−0.15
0
20
40
60
80
100
120
140
160
180
200
Fig. 6. Classification error of the training set
5
Validation of the Neural Network Classifier
The validation set contains 38 new real footprint images classified by the expert, where 13.1% correspond to a flat foot, 55.3% to a normal one and 31.6% to a cave foot. For each footprint in the validation set, the corresponding normalizedwidth vector was calculated for the binary images of the segmented footprint, and then by performing principal component decomposition only the first 7 axes were presented to the trained NN. The figure 7 shows the results of the classification, the outputs of the network and the targets are represented by circles and crosses respectively. Moreover the figure shows the error of the classification represented by a black continuous line. The results are very good, considering that the classification was correct for the complete set.
1156
M. Mora et al. 1
Flat
0
Normal
Cave
−1 0
5
10
15
20
25
30
35
40
Fig. 7. Classification error of validation set and output/target of the net
1
1
1
0.9
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
0.8
0.7
0.6
0.5 0.5
0.5 0.4
0.4
0.4 0.3
0.3
0.2
(a) solucion01nd.jpg
(b) solucion02nd.jpg
0.3
0
10
20
(c) solucion03ni.jpg
30
40
50
60
70
80
90
100
0.2
0.2
0
10
20
30
(a)
40
50
60
70
80
90
100
0.1
0
10
20
(b)
30
40
50
60
70
80
90
100
60
70
80
90
100
60
70
80
90
100
(c)
1
1
1
0.9
0.9
0.9
0.8
0.8
0.7
0.7
0.8
0.7 0.6
0.6
0.5
0.5
0.6
0.5 0.4
0.4
0.4 0.3 0.3
0.2
(d) solucion04ni.jpg
(e) solucion05nd.jpg
10
20
(f) solucion04nd.jpg
0.3
0.2
0
30
40
50
60
70
80
90
100
0.1
0.2
0
10
20
30
(d)
40
50
60
70
80
90
100
0.1
0
10
20
30
(e)
1
40
50
(f)
1
1
0.9
0.9
0.9
0.8
0.8
0.7
0.7
0.8
0.7
0.6 0.6
0.6
0.5
0.5
0.5
0.4 0.4
0.4 0.3
0.3
0.3
0.2
0.1
(g)
(h)
(i)
0.2
0.2
0
10
20
30
40
50
60
70
80
90
100
0.1
0.1
0
10
20
30
(g)
40
50
60
(h)
70
80
90
100
0
0
10
20
30
40
50
(i)
Fig. 8. Foot plants and its segmented Fig. 9. Footprint vectors of the segmented footprints footprint Table 3. Classification of validation set ID footprint (a) (b) (c) (d) (e) (f) (g) (h) (i)
Output Target Error of Classification Classification Net Output classification of Net of expert 1.0457 1 -0.0457 Flat Flat 1.0355 1 -0.0355 Flat Flat 0.8958 1 0.1042 Flat Flat 0.0126 0 -0.0126 Normal Normal -0.0395 0 0.0395 Normal Normal 0.0080 0 -0.0080 Normal Normal -0.9992 -1 -0.0008 Cave Cave -1.0010 -1 0.0010 Cave Cave -0.9991 -1 -0.0009 Cave Cave
In order to illustrate the whole process, we have chosen 9 patterns of the validation set. Figure 8 shows the foot images and its segmented footprints. Figures 8a-c, 8d-f and 8g-i correspond to flat, normal and cave feet respectively. Figure 9 shows the footprint vectors of the previous images.
Automatic Diagnosis of Foot Plant Pathologies: ANN Approach
1157
Table 3 shows quantitative classification results of the 9 examples selected from the validation set, as can be seen the network classification and the expert classification are equivalent.
6
Final Remarks and Future Studies
This work has presented a method to detect footprint pathologies based on neural networks and principal components analysis. Our work shows a robust solution to a real world problem, in addition it contributes to automate a process that currently is made by a human expert. By adding synthetic border patterns, the training process was enhanced. The footprint representation by a width vector and principal component analysis were used. By using a MLP trained by a Bayesian approach, all patterns of the validation set were correctly classified. The encouraging results of this study demonstrate the feasibility of implementing a system for early, automatic and massive diagnosis of the pathologies analyzed in this study. In addition, this study lay down the foundation for incorporating new foot pathologies, which can be diagnosed from the footprint. Considering the experience obtained from this study, our interest is centered in the real time footprint segmentation for monitoring, analysis and detection of walking disorders. Additionally, during the research a database containing a large amount of images from different patients was generated. These images have been all ready classified by an expert considering some pathologies such as flat and cave foot. The authors are willing to make these images available to the entire research community, so that more knowledge and experience on a cooperative basis can be achieved.
References 1. Chodera, J.: Pedobarograph-Apparatus for visual display of pressure between contacting surfaces of irregular shape. CZS Patent 104 514 30d, (1960) 2. Chu, W., Lee, S., Chu, W. , Wang, T., and Lee, M.: The use of arch index to characterized arch height : a digital image processing approach. IEEE Transaction on Biomedical Engineering 42 (1995) 3. Foresee, D., Hagan, M.,: Gauss-Newton Approximation to Bayesian Learning. Proceedings of the International Joint Conference on Neural Networks, (1997) 4. Jollife, I.: Principal Component Analysis. Springer-Verlag, (1986) 5. Mackay, D.: Bayesian Interpolation. Neural Computation 4 (1992) 6. Mora, M., Sbarbaro, D.: A Robust Footprint Detection Using Color Images and Neural Networks. Proceedings of the CIARP 2005, Lecture Notes in Computer Science 3773 (2005) 311-318 7. Morsy, A., Hosny, A.: A New System for the Assessment of Diabetic Foot Planter Pressure. Proceedings of the 26th Annual International Conference of the IEEE EMBS (2004) 1376-1379
1158
M. Mora et al.
8. Nakajima, K., Mizukami, Y., Tanaka, K.: Footprint-Based personal recognition. IEEE Transactions on Biomedical Ingineering 47 (2000) 9. Nguyen, D., Widrow, B.: Improving the Learning Speed of 2-Layer Neural Networks by Choossing Initial Values of the Adaptive Weights. Proceedings of the IJCNN 3 (1990) 21-26 10. Patil, K, Bhat, V., Bhatia, M., Narayanamurthy, V., Parivalan, R.: New online methods for analysis of foot preesures in diabetic neuropathy. Frontiers Me. Biol.Engg. 9 (1999) 49-62 11. Rumelhart, D., McClelland, J., and PDP group: Explorations in Parallel Distributed Processing. The MIT Press. 1 and 2 (1986) 12. Valenti, V.: Orthotic Treatment of Walk Alterations. Panamerican Medicine (in Spanish) (1979)
Phase Transitions Caused by Threshold in Random Neural Network and Its Medical Applications Guangcheng Xi and Jianxin Chen Key Laboratory of Complex Systems and Intelligence Science Institute of Automation, Chinese Academy of Sciences Beijing 100080, China {guangcheng.xi,jianxin.chen}@ia.ac.cn
Abstract. In this paper, we detect threshold-driven phase transitions in the homogeneous random neural network. When the neurons are arranged as one dimension, the critical threshold is two, while in two dimensions counterpart, the critical threshold is four. We declare that random neural network is a specific case of Abstract neural automata. So we conclude that phase transitions in the random neural network can produce thought in human brain. We successfully apply the network to interpret the relation between diseases and syndrome in Traditional Chinese Medicine.
1
Introduction
The human consists of nearly 1011 neurons. The number of neurons is approximately infinite. Neural networks present a powerful strategy to disclose the mechanism of brain. However, most of neural networks, implemented by either software or hardware, is composed of finite neurons. Viewing this, we have presented a more perfect net-work, Abstract Neural Automata [1], which is composed of infinite neurons. Philosophical concepts are thought to be the highest product of brain and phase transitions has its prominent role to play in modeling brain, particularly in the process of thought. Hoshino.O, et, al showed that self organized phase transitions contribute to the in-formation processing mechanism of brain [5]. It is proposed that the phase transitions of simple learning in one layer perceptron [4]. We have showed phase transitions in brain - transitions of concepts produce thought. But any concept can be uniquely ex-pressed by basic concepts that are considered as set of extreme points of limit Gibbs measures of ANA[2]. In this paper, we detect the phase transitions driven by threshold of neuron in the homogenous random neural network, whose limit distribution is Poisson
The work was supported by the National Basic Research Program of China (973 Program) under grant No. (2003CB517106) and NSFC Projects under Grant No. 60621001.
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1159–1167, 2007. c Springer-Verlag Berlin Heidelberg 2007
1160
G. Xi and J. Chen
distribution-a specific kind of Gibbs distribution .The phase transitions occur under two situations; the first is the case that neurons are arranged as a line, that is to say, the dimension is one. The second is that neurons are arraying in the grid, i.e. the dimension is two. We found out that the critical threshold of phase transitions in the case of line is two, while in the case of grid is four.
2
Phase Transitions Driven by Threshold in Random Neural Network
Random neural networks (RNN) [6],[7] are composed of finite interconnecting neurons that operate as threshold automata in an asynchronous manner. Each neuron is either firing or quiet and from which emits an output line that could branch out after leaving the neurons and every branch must terminate as an input connection to another neurons either excitedly or inhibitedly. Any number of input connections should terminate to a neuron. Every neuron has a same threshold Z and fires if and only if the number of excitatory inputs is not less than Z in addition there is no inhibitory input to the neuron. There exists an input connection from a neuron to any one of the other counterparts with a fixed probability u an input connection is excitatory with a probability v and inhibitory with 1-v, any change of state of a neuron means a response delay of the neuron concerned. Response delays are independent identically distributed random variables that follow a negative exponential distribution with rate r. The probability that a neuron transits from the quiet state to the firing state under the condition that the number of the firing neurons is i is denoted asgi , accordingly, the probability that a neuron turns from firing to quiet given that i neurons are firing is denoted as hi .According to the rule introduced above, if i is less than the threshold Z ,gi = 0, hi = 1 .if i is not less than Z , i.e., i ≥ Z, then i i k gi = u (1 − u)i−k v k (1) k k=Z
hi = 1 − g i
(2)
We take two situations of state spaces of RNN to discuss the relation between the thresholds of the neurons and phase transitions, one is rectangular grid state, the corre-sponding random neural network is denoted as two dimension network; the other is line state, denoted as one dimension network. 2.1
Two Dimension Network
Given a random neural network is assembled by N neurons, besides the self organization of these neurons according to the rule described above, the operation of the network is also affected by M external inputs, whose states can be firing or quiet just like the ordinary neurons but operation is independent of them. The times of each input sojourning the quiet (firing) state are independent identically distributed random variables following a negative exponential distribution with
Phase Transitions Caused by Threshold in Random Neural Network
1161
rate m( m1 ).Although these M inputs behave independently of ordinary neurons, however, the probability that an input connects to any one of the ordinary neurons is u , input connection is excitatory with a probability v and inhibitory with a probability (1-v).The behavior of the system then can be described by a Markov process {Q(t), W (t), t ≥ 0}, where Q(t)is the number of firing neurons and W(t)represents the number of the firing inputs of the network at time t with the state space {(i, j), 0 ≤ i ≤ N, 0 ≤ j ≤ M } that can be represented by a rectangular grid. The state transition rates are shown in Fig. 1.
Fig. 1. Transitions rate for the Markov process with rectangular grid state space. The transitive rates are represented by eight symbols: A is (M − j)m, B is (j + 1)m1, C is irhi+j−1 , D is (N − i + 1)rhi+j , E is (i + 1)rhi+j , F is (N − i)rgi+j , G is (M − j + 1)m, H is jm1.
If the number of inputs M is chosen less than threshold Z ,then the set {(0, j), 0 ≤ j ≤ M } is a close state set since the network would be always quiet,a state in the set can not reach to any state outside it and vice versa.So M is assigned to let M ≥ Z, then the Markov process is a finite state irreducible Markov process in which there exists a steady state probability distribution that denoted as P = (Pi,j ),where Pi,j = limt→∞ P (Q(t) = i, W (t) = j),Pi,j must satisfy following equations with the general form: [(M − j)m + jm1 + irhi+j−1 + (N − i)rgi+j ]Pi,j = (M − j + 1)mPi,j−1 + (j + 1)m1Pi,j+1 + (N − i + 1)rgi+j−1 + (i + 1)rhi+j Pi+1,j
(3)
It is noted that when the general form is applied to the states on the border of the rectangular grid state space, some terms will be absent according to the certain condition.In addition to these (N + 1)(M + 1) equations, the limit probability N M Pi,j must meet the normalizing condition, i.e. Pi,j = 1.So the steady-state i=0 j=0
probability distribution P can be easily obtained numerically. The parameters of the system are chosen as:N = 100, M = 10, u = 0.1, v = 1, m = m1 = 1, r = 5 The threshold Z is chosen from zero to ten.We find that in the case of threshold Z < 4, the corresponding distributions P are nearly same, if Z > 4, P resemble with each other but are totally different to the case that Z > 4.The probability distribution at Z = 4 is a critical case. Namely, the network at least has two
1162
G. Xi and J. Chen
phases. Z = 4 is the critical point that phase transitions occur.We take the mean firing rate of the network as a measure to phase transition. Fig. 2 shows variation of the network’s mean firing rate versus the threshold, from which we can clearly verify that mean firing rate of the network varies vigorously during the critical threshold Z = 4.
Fig. 2. Mean firing rate versus the threshold
2.2
One Dimension Network
We further discuss the relation between phase transition and threshold under the condition that the state space of the network is represented by a line, namely one dimension state space. The neural network is composed of N neurons as before.The running of the network is affected by the arrival of stimulating pulses that are generated by external source from environment. The stimulating pulses arrive to each neuron according to a Poisson process with rate b;Each arriving pulse connects to the neuron with a probability u;The connected pulse is excitatory with probability v and inhibitory with probability (1-v) ; A quiet neuron will immediately fire if it is connected to the excitatory arriving pulse, a firing neuron will turn to be quiet instantaneously if it is connected to the inhibitory arriving pulse accordingly. Otherwise, the network operates in a selforganized manner described above. The behavior of the neural network can also be described by a Markov process {Q(t), t ≥ 0} with state space {i, 0 ≤ i ≤ N } represented by a line, where Q(t) is the number of firing neurons of network at time t .Transitions between the states rely on the self organization of the network and the arriving pulses, Fig. 3 displays the transitive rates for a state i . The Markov process is also a finite state irreducible Markov process, so there exists a steady-state probability distribution P = (Pi,j ) = [P0 , P1 , · · · , PN ] , where . must satisfy the following three equations: N (buv + rg0 )P0 = [bu(1 − v) + rh0 ]P1
(4)
Phase Transitions Caused by Threshold in Random Neural Network
1163
{i[bu(1 − v) + rhi−1 ] + (N − i)(buv + rgi )}Pi = (N − i + 1)(buv + rgi−1 )Pi−1 + (i + 1)[bu(1 − v) + rhi ]Pi+1 f or1 ≤ i ≤ N − 1
(5)
N [bu(1 − v) + rhN −1 ]PN = (buv + rgN −1 )PN −1
(6)
The solution to these N+1 equations is obtained analytically: i−1 N buv + rgk Pi = P0 i bu(1 − v) + rhk
(7)
k=0
P0 is calculated according to the normalizing condition:
N
Pi = 1
i=0
P0 = [1 +
N i−1 N i=1
i
k=0
buv + rgk ]−1 bu(1 − v) + rhk
(8)
In this situation, we found out that the critical threshold is two, that is to say, if thresholds of neurons are less than two, the corresponding steady probability distributions are of the same kind, in the other part, the distributions for those neurons whose thresholds are larger than two are approximately identical. The distributions for Z < 2 are totally distinct from Z > 2 . This indicates that when thresholds go from value that is less than two to value that is larger than two or vice versa, the limit distributions vary significantly. This is also a well defined phase transitions, which are also measured by mean firing rate of the neural network ,as shown in Fig. 4. Simple mathematic derivation can prove the steady probability distribution of random neural network follows Poisson distribution. Therefore the random neural network is Poisson neural networkspecific kind of Markov neural network, hence Gibbs according to O.K.Kozlov’s theory about Gibbs description of point random fields [9]. We can say the random neural network is specific kind of Abstract Neural Automata. It is of primary significance to point out that the limit states of random neural network are not unique, from above empirical analysis it is evident that the network has at least two limit distributions and they can transit with each other according to the different threshold. This is homologous with ANA’s variability of structure.
Fig. 3. State transitions for the Markov process with one dimension state space. A is (N − i + 1)(buv + rgi−1 ), B is i[bu(1 − v) + rhi−1 ] , C is (N − i)(buv + rgi ), D is (i + 1)[bu(1 − v) + rhi ].
1164
G. Xi and J. Chen
Fig. 4. The phase transitions in the one dimension neural network. The parameters are chosen as: N = 19, u = 0.1, v = 1, b = 1, r = 1.0 ≤ Z ≤ 3.
3
Medical Application
As a specific kind of Abstract neural automata, random neural network has medical application, especially in Traditional Chinese medicine(TCM)which is considered as a classical treasure of China and is on his way to standardization [10],[11]. However, TCM is seen somewhat occult, even expert TCM practitioners can not explicitly ex-plain how they diagnose from molecule level. So many people, particularly western people, do not take TCM as scientifically as western medicine. Indeed, the distillation of TCM is ”Bian Zheng Lun Zhi”, which means TCM experts first identify and deter-mine which Zheng that is called syndrome a patient caught based on information gathered from watching, snuffing, inquiring, and feeling the pulse (the four procedures are denoted as Si Zhen ), then they prescribe. The syndrome is key in system of Bian Zheng Lun Zhi. Study about the syndrome is core of study of basic theory of TCM. Here, we formally declare that syndromes are philosophical concepts that exist in the brain of TCM experts and the phase transitions between them lead to the thought of brain. In fact, syndrome is a diagnostic concept produced by mean of mapping symptoms (that is the Si Zhen information) into brain of TCM practitioners. Syndrome does have a close relation with disease, for example, a patient who suffers coronary heart disease often suffers blood stasis syndromes, which is characterized by some key symptoms [8]. As shown in Table 1, blood stasis syndrome does exist in the three complex diseases: coronary heart disease, diabetes (or sugar diabetes and cerebral infarction, and).However, the symptoms that correspond to each disease’s syndrome are immensely different with each other. So each disease’s syndrome is called subtype of the blood stasis syndrome. We creatively apply one-dimensional random neural network with phase transitions to interpret syndrome in TCM. We state above that syndrome is a concept
Phase Transitions Caused by Threshold in Random Neural Network
1165
Fig. 5. The x-axis is states, which represent symptoms, the y-axis is corresponding limit probability of each state, the probability distribution is disease’s syndrome subtype
of brain, it is abstract, while symptom is concrete. In this application, a symptom is represented by a state of Markov process that describes the behavior of network, while syndrome is manifested by the limit probability distribution of the Markov process. Since the number of phases of the neural network is two, each
1166
G. Xi and J. Chen
Table 1. Symptoms of three disease’s syndrome subtype. The Coronary heart disease(CHD)’s subtype has 12 symptoms, while Diabetes(SD) 11, Cerebral infarction(CI) 15. The fist and second disease have 4 symptoms in common, the first and the last have 5 in common, while the second and the last have 4. Disease Symptoms corresponding to Blood stasis syndrome CHD Angina; Palpitation; Dyspnea; Lassitude; Dark lips; Squamous and dry skin; Dark eye orbit; Dysphoria with feverish sensation in the chest,palms and soles; Dark purple tongue marked with ecchymosis; Petechia on the tongue; Engorged sublingual veins; Wiry pulse SD Polyoresia; Emaciate; Gender; Smoking; Frequent nocturia; Dark lips; Palpitation;Darkish complexion; Dark purple tongue marked with ecchymosis; Engorged sublingual veins; Unsmooth pulse CI Hemiplegia; Headache; Vertigo; Pain in nape; Lethargy; Age; Profession; Sign of palate mucous; Squamous and dry skin; Emaciate; Dark purple tongue marked with ecchymosis; Petechia on the tongue; Engorged sublingual veins; Unsmooth pulse; Wiry pulse
time two syndrome subtypes can be expressed by ones. We align all symptoms of two subtypes, and assign a state from state space of Markov process to each of them. Take symptoms of sugar diabetes and cerebral infarction as example, the two subtypes have 4 symptoms in common. The former has 11 and the latter 15, so the total is 22, which is assigned to the number of neurons. The other parameters of network are set as: connection probability u = 0.1 ; the excitatory probabilityv = 1 ; the arriving rate b = 1 ; response delay rater = 1. The limit probability distribution is depicted in Figure 5. The symptoms are placed at X-axis, the first 15 symptoms belong to cerebral infarction’s syndrome subtype, while the latter 11 belong to Diabetes’s subtype, the middle 4 are responsible for 4 overlap symptoms, they are Emaciate, Dark purple tongue marked with ecchymosis, Engorged sublingual veins and Unsmooth pulse. It is necessary to note that as long as the four overlap symptoms are represented by state 12,13,14,15 (regardless of order), the network can successfully interpret the syndrome. From Fig. 5 we can see that when Z = 1, the probabilities of latter 11 states are obviously larger than former, which approximate 0, this indicates that the distribution at can represent Diabetes’s syndrome subtype, while at Z = 2 , the other phase, the first 15 states’ probabilities are significantly larger than latter, which nearly vanish. So we can conclude that ran-dom neural network can successfully interpret disease and syndrome in TCM. The three disease’s other combinations can easily separate in this way.
4
Conclusions
This contribution is devoted to detecting phase transitions driven by threshold of neuron in the one-dimension random neural network and two-dimension counterpart respectively. In the former case, the critical threshold is two, while in the later case, the critical threshold is four. The random neural network whose limit
Phase Transitions Caused by Threshold in Random Neural Network
1167
distribution is Poisson is shown as a special Abstract Neural Automata, the limit configuration of which is Gibbs distribution. Finally, the one dimension network is applied to interpret blood stasis syndrome of Traditional Chinese Medicine, the successful result suggests that syndrome of TCM can be considered as a science.
References 1. Xi,G.C.: Abstract neural automata. Kybernete: The International Journal of System. and Cybernetics. 27 (1998) 81–86 2. Xi,G.C.: Variability of structure of abstract neural automata and the ability of thought. Kybernete: The International Journal of System. and Cybernetics. 30 (2003) 1549–1554 3. Hoshino, O. , Kashimori,Y. , Kambara,T.:Self-organized phase transitions in neural networks as a neural mechanism of information processing. PNAS. 93 (1996) 3303– 3307 4. Hertz,J., Krogh,A., Thorbergsson,G. : Phase transitions in simple learning. J. Phys. A: Math. Gen. 22 (1989) 2133-2150. 5. Hoshino, O. , Kashimori,Y. , Kambara,T.:Self-organized phase transitions in neural networks as a neural mechanism of information processing. PNAS. 93 (1996) 3303– 3307 6. Gelenbe,E., Stafylopatis,A. :Global behavior of homogeneous random neural systems. Appl. Math. Modelling. 15 (1991) 534-541 7. Jo,S. , Yin,J., Mao, Z.H.: Random neural networks with state-dependent firing neurons. IEEE Transactions on Neural Networks.16 (2005) 980–983 8. Yao,K.W.: Quantitative diagnosis of blood stasis syndrome and research on combination syndrome with disease. Doctoral Thesis, Chinese Academy of Traditional Chinese Medi-cine, 2004. 9. Kozlov,O.K.: Gibbsian description of point random fields. Theory. Prob. Appl. 21 (1976) 339–355 10. Xue,T.H., Roy, R. : Studying Traditional Chinese Medicine. Science. 300 (2003) 740–741 11. Normile, D. : The New Face of Traditional Chinese Medicine. Science. 299 (2003) 188–190
Multiresolution of Clinical EEG Recordings Based on Wavelet Packet Analysis Lisha Sun1, Guoliang Chang2, and Patch J. Beadle2 1
Key Lab of Intel. Manuf. Tech. of State Education Ministry, College of Engineering, Shantou University, Guangdong 515063, China
[email protected] 2 School of System Engineering, The University of Portsmouth, Portsmouth, U.K.
Abstract. Method for extracting the specified rhythms of clinical electroencephalogram (EEG) is proposed using the wavelet packet decomposition. Based on the ability of accurately resolving the signal into desired time-frequency components, EEG signals are preprocessed and decomposed into a series of rhythms for many clinical applications. Specified dynamic EEG rhythms can be accurately filtered with designed wavelet structure. In addition, we present a wavelet packet entropy method for processing of EEG signal. Both relative wavelet packet energy and wavelet packet entropy are presented as the quantitative parameter to measure the complexity of the EEG signal. Several experiments with real EEG signals are carried out to show that the proposed method excels the common discrete wavelet decomposition. The presented procedure can isolate specific EEG rhythms accurately and is also regarded as an efficient method for analyzing non-stationary signals in practice.
1 Introduction EEG collection includes the important information of the potentials in the cortex or on the surface of scalp interacted by the physiological activities of the brain. EEG become a common and effective way for clinical analysis as we know that detecting the changes of EEG signals is critical to understand the brain functions and many applications for neuroscience. Various clinical measurement tools are continuously widely used, but EEG signal, as a nondestructive testing method, is still play a key role in the diagnosis of brain and the brain functional analysis. Recently, many kinds of signal processing techniques have been proposed for studying the dynamic EEG signals and the corresponding functions. Power spectral analysis via Fourier transform has been widely used for many quantitative analysis of EEGs, but the spectral analysis is just appropriate for investigating the stationary signals of simple dynamics that consists of a linear superposition of few independent, strong, non-evolving periodicities [1,2]. In other words, the traditional spectral analysis has severe drawbacks for analyzing the practical EEG signals due to the transient periodicities and non-stationarity of the practical EEGs such as the records corresponding to the sleep stages, epileptic transients and the changes of the physiological state of the patients. Furthermore, evoked potentials (EPs) reflect the event related non-stationary records [3, 4]. With the D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1168–1176, 2007. © Springer-Verlag Berlin Heidelberg 2007
Multiresolution of Clinical EEG Recordings Based on Wavelet Packet Analysis
1169
development of modern signal processing techniques, the analysis of the transients characteristics of the EEG recordings become a widely accepted method like analyzing some types of artifacts and epilepto-genic transients of the EEG signals. To study the time-dependent spectrum of non-stationary EEG signals, some approaches were developed which include the short time Fourier transform (STFT), the Wigner-Ville representation and the time-varying parametric model. The STFT assumes the stationarity of the signal within a temporal window to match the time-frequency resolution chosen for the spectral analysis. The main problem is the fixed time-frequency resolution trade-off that results from windowing of the signal. Wigner-Ville distribution (WVD) is another common procedure for time-varying spectral analysis, but the most significant drawback of WVD is the existence of the cross-terms for analyzing the multi-component signals. As a parametric model approach, the time-varying AR model is used for the transient spectral estimate for non-stationary signals. The significant limitation of time-varying parametric models is the difficulties to establish the model properly for different practical signals like the selection of the model order and the basis functions. With the variable window length, Wavelet transform (WT) can overcome some of the drawbacks as indicated above and provide a time-frequency filtering capabilities. Wavelet analysis has been applied with considerable success to non-stationary signals, but many problems associated with the clinical EEG application and the feature extraction is still needed to be further investigated. Moreover, automatic methods generally do not stand comparison with traditional visual EEG analysis by trained physicians [5-8]. For this purpose, the aim of this paper is to presents a new method for the effective detection of the transient of the EEG signals based on wavelet transform. We mainly investigate the time-frequency characteristics of the different spontaneous brain rhythms and the new techniques for extracting time-varying rhythms of the signals by employing wavelet packet decomposition. In addition, the multi-channels time-varying rhythms are applied to reconstruct the Dynamic Topographic Brain Mapping (DTBM), which enable physician to understand the changes of the multichannels brain activities in specific rhythm of the EEG recordings. Finally, wavelet packet entropy are presented as the quantitative parameter to measure the complexity of the EEG signal.
2 Proposed Approach 2.1 Wavelet Transform Wavelet transform is an effective method for the time-frequency analysis of non-stationary signals. It can decompose a temporal signal into a summation of time-domain basis function of various frequency resolutions. The wavelet is a smooth and quickly vanishing oscillating function with a good location in both frequency and time [9,10]. Generally, wavelet ψ(t) is a function of zero average such as
∫
+∞
−∞
ψ (t )dt = 0
(1)
1170
L. Sun, G. Chang, and P.J. Beadle
a and translated by b :
which is dilated with scale parameter
1 ⎛t −b⎞ ψ⎜ ⎟ a ⎝ a ⎠
ψ a ,b ( t ) =
(2)
a , b ∈ R, a ≠ 0 .The wavelet transform of a signal f (t ) at the scale a and position b is computed by correlating f (t ) with a wavelet function given as where
+∞
Wf (a, b) = ∫ f (t ) −∞
1 *⎛ t − b ⎞ ψ ⎜ ⎟dt = 〈 s,ψ a,b 〉 a ⎝ a ⎠
(3)
If the wavelet satisfies
ψˆ (ω ) cψ = ∫ dω < ∞ −∞ ω 2
+∞
where ψˆ (ω ) is the Fourier transform of ψ (t ) , following relationship:
f (t ) =
1 cψ
+∞ +∞
∫ ∫W
f
(4)
f (t ) can be reconstructed by the
( a , b)ψ a ,b
−∞ −∞
1 dadb a2
(5)
Wavelet analysis allows a simultaneous and varying time-frequency resolution which leads to a multi-resolution representation for non-stationary physical signals. 2.2 Wavelet Packet Transform
The main problem of the WT is that the frequency resolution is poor in the high frequency region. In many applications, the wavelet transform may not generate a spectral resolution fine enough to meet the problem requirement. The use of wavelet packet is a generalization of a wavelet in that each octave frequency band of wavelet spectrum is further subdivided into finer frequency band by using the tow-scale relations, repeatedly. The wavelet packet function can be obtain by [11,12]: ∞
ψ 2j +i −11 (t ) = 2 ∑ h( k )ψ ij ( 2t − k )
(6)
ψ 2j +i 1 (t ) = 2 ∑ g ( k )ψ ij ( 2t − k )
(7)
k = −∞ ∞
k = −∞
The first wavelet ψ (t ) denotes the so-call mother wavelet function. The
h (k ) and
g (k ) represents the quadrature mirror filters associated with the scaling function and the mother wavelet function. The recursive relations between the j level and the j+1 level are given as f j2+i1−1 (t ) =
∞
∑ h( k ) f
k = −∞
i j
( 2t − k )
(8)
Multiresolution of Clinical EEG Recordings Based on Wavelet Packet Analysis
f j2+i1−1 (t ) = The wavelet coefficients
∞
∑ g (k ) f
i j
( 2t − k )
1171
(9)
k = −∞
c kj can be obtain from ∞
∫ f (t )ψ
c kj =
i j
(t )dt
(10)
−∞
Each wavelet packet subspace can be viewed as the output of a filter turned to a particular basis. Thus, a signal can be decomposed into many wavelet packet components. A signal may also be represented by a selected set of wavelet packets for a given level of resolution. Different combination of wavelet packet should be chosen for specific purpose. Signal f (t )
f 11 (t ) f 21 (t )
f 12 (t ) f 22 (t )
f 23 (t )
j=1
f 24 (t )
j=2
f 31 (t ) f 32 (t ) f 33 (t ) f 34 (t ) f 35 (t ) f 36 (t ) f 37 (t ) f 38 (t )
j=3
Fig. 1. The wavelet packet decomposition of a time-domain signal
2.3 Wavelet Packet Component Energy
Wavelet packet node energy is more robust in representing a signal than using the wavelet packet coefficients directly [13,14]. Define the signal energy as ∞
E f = ∫ f 2 (t )dt
(11)
−∞
E f i can be defined as the energy stored in the
Wavelet packet component energy component signal
j
Efi = ∫ j
∞
−∞
f ji (t ) 2 dt
(12)
The total energy of the signal can be decomposed into a summation of wavelet packet component energy that corresponds to different frequency bands and obtained by 2j
Etot = E f = ∑ E f i i =1
j
(13)
1172
L. Sun, G. Chang, and P.J. Beadle
In order to analyze specific frequency region, optimal tree structure should be selected. For example, as shown in Figure 1, the signal can be covered by f 12 (t ) , f 21 (t ) and f 22 (t ) or by f 12 (t ) , f 22 (t ) , f 31 (t) and f 32 (t ) . By defining the energy of each sub-band as E l , then, the normalized relative wavelet packet energy can be given as
Pl =
El E tot
(14)
Pl denotes the energy distribution in each wavelet packet. It is clear that energy distribution is sensitive to the energy changes with the signals components and represents the energy relation among each wavelet packet. 2.4 Wavelet Packet Entropy
As discussed, the Shannon entropy provides us a way for measuring the dynamic quantity distribution of the amount of disorder in system that can be regarded as a measure of uncertainty regarding the information content of a system. Thus, following the definition of entropy given by Shannon, the wavelet packet entropy is defined as
S wp = − ∑ pl ln[ pl ]
(15)
If the signal is a mono-frequency signal, all the energy will be within one frequency band. The energy of all other frequency band will be nearly zero. As a result, the relative wavelet packet energy will be 1, 0, 0…, which will lead to zero or very low value in the wavelet packet entropy. On the other hand, for a very disordered signal like a random process, its energy distribution will be in every frequency band. The relative wavelet packet energy will be almost the same and lead to a maximum value in wavelet packet entropy. The wavelet packets decomposition enables us to choose the best combination of the components for the representation of the EEG rhythms. A particular choice of tree-structure containing various components referred to as “wavelet packet decomposition” is used to the time-varying filter in 4 different filter banks corresponding to 4 types of time-varying EEG rhythms. For instance, a six-levels decomposition of Daubechies wavelet function is applied to detect the basic rhythms of EEG signals. The lowest frequency resolution can be estimated as
:
Δf =
1 fs ⋅ = 0.7812 Hz 26 2
The common 4 EEG rhythms such as
β
rhythm (13.28-30.47Hz),
(16)
α
rhythm
(7.812-13.28Hz), θ rhythm (3.906-7.812Hz) and δ rhythm (0.7812-3.906Hz) can be extracted. Several experimental results are tested to indicate the time-varying filtering characteristics of the specified rhythms. Some clinical EEG signals are investigated through the wavelet packet transform to show the transient of the rhythms and the satisfied filtering characteristics of 4 kinds of rhythms.
Multiresolution of Clinical EEG Recordings Based on Wavelet Packet Analysis
1173
In order to study the time-varying characteristics of different rhythms of the multichannels EEG signals for visual analysis, we develop a novel approach which constructs the EEG for visualizing the dynamic EEG topography that is very helpful for physicians to investigate the changes of multi-channel brain activities in specific rhythm. The time-varying energy of the specified rhythm of the EEG signal is defined as
E ( i ) (t ) = x ( i ) (t )
2
(17)
The time-varying EEG topography via 14 channels brain activities in specific rhythm is obtained. We can display the specific rhythm simultaneously for an interesting short time period selected. For example, the alpha rhythm transient reflects the main changes of brain electrical activity of the normal person. The time-varying energies of event related to brain rhythms may be tracked by observing the temporal variations of the squares of the wavelet packet coefficients.
EEG
200 0 -200
į
200
-200
ș
200
200
Į
50
100
150
200
250
300
350
400
0
50
100
150
200
250
300
350
400
0
50
100
150
200
250
300
350
400
0
50
100
150
200
250
300
350
400
0
50
100
150
200
250
300
350
400
0 -200
0 -200 200
ȕ
0
0
0 -200
(a)
(b) Fig. 2. (a) Four kinds of time-varying EEG rhythms at the C1 channel. (b). The time-varying contour mapping of the alpha rhythm in (a).
1174
L. Sun, G. Chang, and P.J. Beadle
3 Results and Discussion In this section, some real EEG signals from normal subjects were digitally collected with a standard commercial electroencephalograph of Model EEG 4400A by Nihon Kohden Corporation. 14 channel EEGs were converted at a sampling frequency of 100Hz with the international 10-20 system, recorded at the location of the scalp known as: Fp1, Fp2, F3, F4, C3, C4, P3, P4, O1, O2, F7, F8, T5, T6 [13]. Both Fig. 2(a) shows the 4 time-varying EEG rhythms of a normal EEG data at channel C1 by using 6 level Daubechis wavelet packet decomposition. The transient rhythms are obviously exhibited from the experimental results. It can be seen that the alpha rhythm becomes the main rhythm when subjects at rest with eyes closed. Fig. 2 (b) shows the contour mapping of the time-varying BEAM of the alpha rhythm that is displayed as a 2D surface by representing the head’s surface as an elliptical area. The two dimensional interpolation method was applied in this paper for reconstructing the overall 14 channels EEGs from the α rhythm. Thus, the α rhythm’s time-varying characteristics of all channels are clearly shown in Fig. 2 (b). In addition, the time-varying contour mapping provides us the possibility of computing the scalp spectral density maps about different brain rhythm’s active areas in the cortex of scalp. From the analysis above, it proved that the alpha rhythm in the EEG will enhance when the subject close his eyes, which represents different brain state compared to the open eyes status. To compare the differences from the 2 kinds of EEGs, two segments of EEG signal were chosen for the purpose. The first segment is a 2 seconds EEG signal with eyes open, and the other EEG signal is a 2 seconds period with the subject’s eyes closed. The wavelet packet decomposition was applied to the EEGs to extract the rhythms. A shown in Fig.3, four kinds of rhythms were comparable to each other when the subject opens his eyes. However, the alpha rhythm enhances and became the [
1.4
Wavelet Packet Entropy
1.2
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
10 12 Time Step
14
16
18
20
Fig. 3. The time-varying wavelet packet entropy of EEG signals when eyes open with solid line and eyes closed with dotted line, respectively
Multiresolution of Clinical EEG Recordings Based on Wavelet Packet Analysis
1175
domain rhythm when the subject closed his eyes. The primary experiments demonstrated that the proposed method can effectively detect the specified transient EEG rhythms with high time-frequency resolution. The parameters of the wavelet packet decomposition corresponding to the rhythms were successfully developed to form the contour mapping, which has important clinical significant for the EEG physicians. We also tested the time-dependent wavelet packet entropy which can be found from Fig.3. The time-varying wavelet packet entropy with eyes closed and open are shown in Fig.3. Fig.3 also shows the temporal average wavelet package entropy which can be viewed as a quantitative parameter of system’s complexity. It can be seen form the experiments that the brain activity in the closed eyes status is less disorder than it in the eyes open status.
4 Conclusions This paper presents a nonlinear method with wavelet packet analysis for rhythm decomposition and entropy study of EEG signals and ERPs. Wavelet packet transformation is applied to design the filters with different frequency characteristics in order to extract different kinds of dynamic EEG rhythms. Relative wavelet packet energy and the wavelet packet entropy are calculated. The relative energy provides information about the relative energy within different rhythms. Wavelet packet entropy measures the degree of order/disorder of the clinical EEG signal. The proposed method excels the wavelet packet decomposition can be used to isolate specific EEG and ERP rhythms more accurately for practical applications. The method presented in this paper is more flexible and accurate to design the specific filter banks due to the better matching in time-frequency characteristics of EEG signal for extracting different EEG rhythms. Finally, our method can be used as a new way for analyzing other kinds of medical signals.
Acknowledgement This work is supported by the Natural Science Foundation of China (60271023 and 60571066) and the Natural Science Foundation of Guangdong, respectively.
References 1. Pardey, J., Roberts, S. and Tarassenko, L.: A Review of Parametric Modeling Techniques for EEG Analysis. Med. Eng. Phys. 8(1) (1996) 2-11. 2. Jung, T.P., et al.: Estimating Alertness from the EEG Power Spectrum. IEEE Transaction on Biomedical Engineering 44(1) (1997) 60-69. 3. D’Attellis, C.E. et al.: Detection of Epileptic Events in Electroencephalograms Using Wavelet Analysis. Annals of Biomedical Engineering 25 (1997) 286-293. 4. Blanco, S., et al.: Time-Frequency Analysis of Electroencephalograms series 2. Gabor and Wavelet Transforms. Physical Review E 54(6) (1996) 6661-6672. 5. Thakor, N.V., et al.: Multiresolution Wavelet Analysis of Evoked Potentials. IEEE Transactions on Biomedical Engineering 40(11) (1993) 1085-1093.
1176
L. Sun, G. Chang, and P.J. Beadle
6. Clark, I., et al.: Multiresolution Decomposition of Non-Stationary EEG Signals: A Preliminary Study. Comput. Bio. Med. 25(4) (1995) 373-382. 7. Unser, M. and Aldroubi, A.: A Review of Wavelets in Biomedical Applications. Proceedings of the IEEE 84(4) (1996)626-638. 8. Blinowska, K.J. and Durka, P.J.: Application of Wavelet Transform and Matching Pursuit to the Time-Varying EEG Signals. Proc. of Conf. Artif. Neural Networks in Eng. (1994) 535-540. 9. Schiff, S.J., et al.: Fast Wavelet Transformation of EEG., Electroencephalogram and Clinical Neurophysiology 91(6) (1994) 442-455. 10. Tseng, S., et al.: Evaluation of Parametric Methods in EEG Signal Analysis. Med. Eng. Phys. 17(1) (1995) 71-78. 11. Pesquet, J., Krim, H. and Carfantan, H.: Time-Invariant Orthonormal Wavelet Representations. IEEE Transaction on Signal Processing 44(8) (1996) 1964-1970. 12. Daubechies, I.: Orthonormal Bases of Compactly Supported Wavelets. Communications on Pure and Applied Mathematics XII (1988) 909-996. 13. Quiroga, R., et al.: Wavelet Entropy in Event-Related Potentials. A New Method Shows Ordering of EEG Oscillations. Biological Cybernetics 84 (2001) 291-299. 14. Sun, Z.: Continuous Condition Assessment for Bridges based on Wavelet Packets Decomposition. Proceedings of SPIE - The International Society for Optical Engineering 4337 (2001) 357-367.
Comparing Analytical Decision Support Models Through Boolean Rule Extraction: A Case Study of Ovarian Tumour Malignancy M.S.H. Aung1, P.J.G Lisboa1, T.A. Etchells1, A.C. Testa2, B. Van Calster3, S. Van Huffel3, L. Valentin4, and D. Timmerman5 1
School of Computing and Mathematical Sciences, Liverpool John Moores University, UK {M.S.Aung,P.J.Lisboa,T.A.Etchells}@ljmu.ac.uk 2 Istituto di Clinica Ostetrica e Ginecologica, Università Cattolica del Sacro Cuore, Rome, Italy 3 Dept of Electical Engineering, ESAT-SCD, Katholieke Universiteit Leuven, Belgium {ben.vancalster,sabine.vanhuffel}@esat.kuleuven.be 4 Dept of Obstetrics and Gynaecology, University Hospital Malmö, Sweden 5 Dept of Obstetrics and Gynaecology, University Hospitals, Katholieke Universiteit Leuven, Belgium
[email protected]
Abstract. The relative performances of different classifiers applied to the same data are typically analyzed using the Receiver Operator Characteristic framework (ROC). This paper proposes a further analysis by explaining the operation of classifiers using low-order Boolean rules to fit the predicted response surfaces using the Orthogonal Search Based Rule Extraction algorithm (OSRE). Four classifiers of malignant or benign ovarian tumours are considered. The models analyzed are two Logistic Regression models and two Multi-Layer Perceptrons with Automatic Relevance Determination (MLP-ARD) each applied to a specific alternative covariate subset. While all models have comparable classification rates by Area Under ROC (AUC) the classification varies for individual cases and so do the resulting explanatory rules. Two sets of clinically plausible rules are obtained which account for over one half of the malignancy cases, with near-perfect specificity. These rules are simple, explicit and can be prospectively validated in future studies.
1 Introduction The logic behind the behaviour of parametric classification models is often described with reference to decision boundaries within the data space of N dimensions where N is the number of input attributes or variables. However the often complex morphology of the boundaries makes it difficult to describe them explicitly. For this reason parametric classification models are often treated as black boxes and only evaluated as such often using the Area Under the Receiver Operator Characteristic curve (AUC). This study shows how deeper insight can be gained about the decision boundaries if they are approximated using axis parallel hyper-cubes in the data space. The axis parallel morphology corresponds to a Boolean rule specifying the limits to each variable in the D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1177–1186, 2007. © Springer-Verlag Berlin Heidelberg 2007
1178
M.S.H. Aung et al.
hyper-cube. Consequently these statements are explanations for how a model classifies a data point into a particular class. While two models may have similar classification performances the extracted rules will show whether the models are functionally similar or not. The rule extraction method used for this analysis is the Orthogonal Search Based Rule Extraction algorithm (OSRE) [5]. This is a fast and principled approach to approximate hyper-cubes to the decision boundaries of selected models for the purpose of yielding low order mutually overlapping Boolean explanatory rules. This method has been used in a variety of applications [2] [4] including the medical domain. The explicit rules generated by the OSRE framework are especially important in safety critical domains such as decision support for clinical medicine, as they enable direct validation against expert knowledge. This study is focused on a prospectively acquired dataset for ovarian cancer kindly provided by the International Ovarian Tumour Analysis group (IOTA) as part of collaborative research funded by FP6 Network of Excellence: BIOPATTERN (www.biopattern.org). The International Ovarian Tumour Analysis group (IOTA) is a network of 9 clinical and academic centres from Belgium, Sweden, Italy, France and the United Kingdom that collected the data used in this study known as the IOTA Phase I dataset. The group applied strict data acquisition protocols [6] specifically for the purpose of parametric modeling. They consist of demographic variables, clinical signs and measurements derived from Doppler ultrasound scans for a cohort of 1066 patients diagnosed with having either a benign or a malignant ovarian tumour. There are over 40 explanatory variables measured for each patient. The database has been the object of several studies undertaken to produce classification models for diagnosing malignancy [1] [3] [6]. Four models that produced models with AUC of over 0.9 [3] [6] were selected for further investigation with rule extraction: The first two models labeled as ‘M1’ and ‘M2’ are logistic regression models and the latter two labeled as ‘M3’ and M4’ are neural network classifiers both based on a Bayesian Multi Layer Perceptron method.
2 Orthogonal Search Based Rule Extraction Orthogonal Search-based Rule Extraction (OSRE) is a framework for automatic rule extraction and pruning which efficiently extracts low order rules from the smooth decision surfaces created by analytical classifiers. The OSRE framework uses the data that constructed the smooth model to search, in orthogonal directions, for hyper-cubes containing the regions of the data space for which the model prediction is in-class (Fig. 1). The hyper-boxes that capture in-class data are converted into conjunctive rules expressed by the boundary values for each covariate. As the algorithm is applied to each data item that the model predicts to be in-class, there are as many rules produced as there are predicted in-class data, rule set Rn. This large number of rules is then automatically pruned according to predefined performance criteria intended typically to maximize the proportion of data explained or the positive predictive value (PPV) of the rule set [5].
Comparing Analytical Decision Support Models Through Boolean Rule Extraction
1179
Fig. 1. An example of creating a hyper-cube: after orthogonally searching the data space from a sample point, the size of the cube is determined by the lengths of the search spans limited by the decision surface from the model or the extreme of the data space
All the rules are expressive in conjunctive form. For example the hyper-cube in Fig.1 can be stated as: r1 = (1 ≤ a1 ≤ 6) ∧ (1 ≤ a 2 ≤ 4) ∧ (3 ≤ a 3 ≤ 6) , (1)
where a1, a2 and a3 are input variables. The initial rule set is first reduced by removing any repeated rules. In some cases this single reduction strategy can reduce Rn to a very small rule set of conjunctive rules. However, if the data contains continuous variables then, in practice, further rule reduction techniques may be necessary. The next phase of refinement is a minimum specificity filter. The rules whose specificity value is less than some predefined value are removed from the list. In clinical diagnostic applications we would only be interested in rules that have a very high specificity value, normally not accepting rules whose specificity is less than 0.9. After this process a minimum sensitivity filter is applied to give a measure of the coverage of the rule, i.e. how much of the in-class data the rule covers. Sensitivity and specificity values can now be calculated for the disjunction of all of the remaining rules. It is important to note that the rules can be mutually overlapping. The sensitivity and specificity values of the individual conjunctive rules or the combined set of disjunctive rules can be represented as steps in a Receiver Operator Characteristic (ROC) plot. The final goal of the rule refinement process is to achieve a rule set whose global ROC point is closest to the point with unit sensitivity and specificity.
3 Data and Classification Models The International Ovarian Tumour Analysis Group Phase I dataset consists of 1066 patients with adnexal tumours. The data has been prospectively collected according to a standardized protocol at the nine IOTA group centres [6]. Much attention was given to uniformity of protocols to ensure integrity and consistency. The dataset consists of over 40 clinical, biochemical, and ultrasound variables that have the potential for malignancy diagnosis. The clinical variables included family history, age, and any previous hormonal therapy. Also included were variables derived from transvaginal
1180
M.S.H. Aung et al.
scans and transabdominal ultrasonography. The presence or absence of pain during the scan was also recorded. The target vector for the analytical models is the outcome of the tumour (benign or malignant). 3.1 Description of Multi-centre Models
Four classification models from previous multi-centre studies that have demonstrated good classification rates are selected for rule extraction. These were two logistic regression models described in Timmerman et al [6] and two Multi-Layer Perceptrons with Automatic Relevance Determination described in Van Calster et al [3]. Both studies aimed to produce models for the diagnosis of malignant ovarian tumours trained on the IOTA Phase I dataset. The variable selection methods and parameter determination algorithms are described in detail in these studies. M1: Logistic Regression 1 with twelve variables selected Timmerman et al [6] describes a linear logistic regression model with a 12 variable subset of the set of 40 variables (Table 1). Selection process is stepwise multivariate logistic regression. The AUC value of M1 for a validation set is 0.94 (SE =0.017). M2: Logistic Regression 2 with a six variable subset This model is a simplified version of M1 using six variables also presented in Timmerman et al [6], these variables being the first to be entered into the stepwise selection process of the M1 set. The AUC value of this for M2 is shown to be 0.92 (SE=0.018) for a validation set. The six variables are a subset of those in Table 1, namely Age, Ascites, PapFlow, MaxSolid, WallRegularity and Shadows. M3: Multi-Layer Perceptron with Automatic Relevance Determination 1 with eleven variables selected Described in Van Calster et al [3] this model uses the connectionist Multi Layer Perceptron neural network for prediction. The network utilises a Bayesian framework Table 1. Variables used in Logistic Regression 1 (M1) Variable Name
Description
Variable Type
Univariate P Value
PerHistOvCa HormTherapy
Personal History of Ovarian Cancer Current Hormonal Therapy
Binary Binary
0.0096 0.0477
Age
Age of patient in years
Continuous
<0.0001
MaxLes
Size of largest lesion dimension in mm
Continuous
<0.0001
Pain
Presence of pelvic pain during scan
Binary
0.0034
Ascites
Presence of fluid outside Pouch of Douglas
Binary
<0.0001
PapFlow
Presence of blood flow within projections
Binary
<0.0001
Locul5
Morphologic category of lesion is solid
Binary
<0.0001
MaxSolid
Size of largest solid component dimension in mm Continuous
WallRegularity
Whether internal wall of legion is irregular
Binary
<0.0001
Shadows
Presence of acoustic shadows
Binary
<0.0001
ColScore
Subjective assessment of amount of blood flow
4 Categories
<0.0001
<0.0001
Comparing Analytical Decision Support Models Through Boolean Rule Extraction
1181
that can automatically regularise the model and soft prune unwanted variables. This framework is known as Automatic Relevance Determination (ARD). Evaluation of the M3 model showed an AUC of 0.94 (SE=0.018) for a validation set. Table 2. Variables used in MLP-ARD 1 (M3) Variable Name
Description
Variable Type
PapNr HormTherapy
Number of Papillations Current Hormonal Therapy
Age MaxLes Locul4
Morphologic category of lesion is multi-locular solid Binary
Univariate P Value
5 Categories Binary
<0.0001 0.0477
Age of patient in years
Continuous
<0.0001
Size of largest lesion dimension in mm
Continuous
<0.0001 <0.0001
Ascites
Presence of fluid outside Pouch of Douglas
Binary
<0.0001
PapFlow
Presence of blood flow within projections
Binary
<0.0001
Locul5
Morphologic category of lesion is solid
Binary
<0.0001
MaxSolid
Size of largest solid component dimension in mm
Continuous
<0.0001
WallRegularity
Whether internal wall of legion is irregular
Binary
<0.0001
ColScore
Subjective assessment of blood flow
4 Categories
<0.0001
M4: Multi-Layer Perceptron with Automatic Relevance Determination 2 with eleven variables selected Also in Van Calster et al [3] a second MLP-ARD model with a good classification rate is described with a differing variable set, with an AUC of 0.93 (SE=0.017) for a validation set, This model replaces the variables ‘hormonal therapy’, ‘colour score’, ‘number of papillations’ and ‘locularity being: multi-locular solid’ from M3 with ‘pelvic pain’, ‘acoustic shadows’, ‘personal history of ovarian cancer’ and ‘locularity being unilocular’. Therefore, the variables for M4 are: Table 3. Variables used in MLP-ARD 2 (M4) Variable Name
Description
Variable Type
Univariate P Value
PerHistOvCa Pain
Personal History of Ovarian Cancer Presence of pelvic pain during scan
Binary Binary
0.0096 0.0034
Age
Age of patient in years
Continuous
<0.0001
MaxLes
Size of largest lesion dimension in mm
Continuous
<0.0001
Locul1
Morphologic category of lesion is unilocular
Binary
<0.0001
Ascites
Presence of fluid outside Pouch of Douglas
Binary
<0.0001
PapFlow
Presence of blood flow within projections
Binary
<0.0001
Locul5
Morphologic category of lesion is solid
Binary
<0.0001
MaxSolid
Size of largest solid component dimension in mm
Continuous
<0.0001
WallRegularity
Whether internal wall of legion is irregular
Binary
<0.0001
Shadows
Presence of acoustic shadows
Binary
<0.0001
1182
M.S.H. Aung et al.
4 Results Two disjunctive sets of rules were generated for each model with different rule refinement criteria. The PPV criterion limits misclassifications but the sensitivity Table 4. Set 1 and Set 2 rules for malignancy in descending order of PPV Rule Number
Conjunctive Rule Statements
For Individual Rule
Cumulative Value in Set
Sensitivity
Specificity
PPV
Sensitivity
Specificity
PPV
Set 1
1
69 <= MaxLes <= 410 Pain = 0 44 <= MaxSolid <= 230 ColScore =4
0.18
1
1
0.18
1
1
2
54 <= Age <= 94 Pain = 0 Ascites = 1 WallRegularity = 1 Shadows = 0 ColScore =2 or 4
0.18
0.99
0.97
0.3
0.99
0.98
3
108 <= MaxLes <= 410 Pain = 0 42 <= MaxSolid <= 230 WallRegularity = 1
0.22
0.99
0.97
0.39
0.99
0.97
4
HormTherapy = 0 46 <= Age <= 94 PapFlow = 1 15 <= MaxSolid <= 230 ColScore = 3 4
0.19
0.99
0.95
0.49
0.99
0.96
5
31 <= Age <= 94 77 <= MaxLes <= 410 Pain = 0 30 <= MaxSolid <= 230 WallRegularity = 1 Shadows = 0 ColScore = 2 or 4
0.32
0.99
0.94
0.54
0.99
0.95
1
WallRegularity = 1 ColScore = 4 51.6<= Age <= 93.5
0.22
1
1
0.22
1
1
2
HormTherapy = 0 WallRegularity = 1 ColScore = 4 108<=MaxLes <=403
0.14
1
1
0.28
1
1
3
PapNr = 5 PapFlow = 1 ColScore = 3 4 59<=MaxLes <= 401 19.9<=MaxSolid<=227
0.18
1
1
0.39
1
1
4
WallRegularity = 1 Locul5 = 1 49.72<= Age <= 92.25
0.20
0.99
0.96
0.49
0.99
0.98
5
HormTherapy = 0 PapNr = 3 4 5 ColScore = 3 4 28.7<=MaxSolid<=224.2
0.17
0.99
0.97
0.52
0.99
0.97
Set 2
Comparing Analytical Decision Support Models Through Boolean Rule Extraction
1183
Fig. 2. Visualization of the hyper-cube for Rule 1 from Set 2 which only contains malignant cases represented by circles Root Population In Class: 266 Out Class: 800
Rule 1 Yes In Class: 49 Out Class: 0 LR+ >10
Rule 1 No In Class: 217 Out Class: 800 LR- = 0.82
Rule 2 Yes In Class: 32 Out Class: 1 LR+ >10
Rule 2 No In Class: 186 Out Class: 799 LR- = 0.82
Rule 3 Yes In Class: 24 Out Class: 2 LR+ >10
Rule 3 No In Class: 162 Out Class: 797 LR- = 0.78
Rule 4 Yes In Class: 26 Out Class: 2 LR+ >10
Rule 4 No In Class: 136 Out Class: 795 LR- = 0.8
Rule 5 Yes In Class: 14 Out Class: 2 LR+ >10
Rule 5 No In Class: 122 Out Class: 793 LR- = 0.69
Fig. 3. Hierarchical Rule Tree for Set 1: The Positive and Negative Likelihood Ratios for each rule is shown
1184
M.S.H. Aung et al. Root Population In Class: 266 Out Class: 800
Rule 1 Yes In Class: 60 Out Class: 0 LR+ >10
Rule 1 No In Class: 206 Out Class: 800 LR- = 0.77
Rule 2 Yes In Class: 15 Out Class: 0 LR+ >10
Rule 2 No In Class: 191 Out Class: 800 LR- = 0.861
Rule 3 Yes In Class: 29 Out Class: 0 LR+ >10
Rule 3 No In Class: 162 Out Class: 800 LR- = 0.81
Rule 4 Yes In Class: 28 Out Class: 2 LR+ >10
Rule 4 No In Class: 134 Out Class: 798 LR- = 0.79
Rule 5 Yes In Class: 8 Out Class: 1 LR +>10
Rule 5 No In Class: 126 Out Class: 797 LR- = 0.83
Fig. 4. Hierarchical Rule Tree for Set 2: The Positive and Negative Likelihood Ratios for each rule is shown
criterion leads to better coverage of the data. The PPV criterion resulted in better specificity for the overall disjunctive rule set, therefore this was the preferred approach. Moreover two of the analytical models (M1 and M3) generated rules with near perfect specificity. The ranked rules were then truncated when too few additional cases were explained. This resulted in the two rule sets listed in Table 4. Hierarchical rule trees describing the rule sets and their performance in mutually exclusive branches are shown in Fig. 3 and Fig. 4. All rules from both rule sets showed high specificity and thus a high positive likelihood (LR+) ratio for the detection of malignancy are given by: LR + =
Sensitivity 1 − Specificity
(2)
and the negative likelihood ratio (LR-) is also calculated to quantify the predictive power of the rule complement for detecting benign cases, given by: LR − =
1 − Sensitivity . Specificity
(3)
Rule Set 1 cumulatively identifies 144 of the 266 malignant cases with 2 misclassifications, compared to set 2 which identifies 140 malignant cases with 3 misclassifications. The overall sensitivities of Set 1 and 2 are 0.54 and 0.52, respectively. From the
Comparing Analytical Decision Support Models Through Boolean Rule Extraction
1185
ten selected rules shown there are 4 rules that did not yield any misclassifications and hence have a specificity of 1, one of these is Rule 1 from Set 2 visualized in Fig. 2. The common variable in these four rules is the subjective categorization of the amount of blood flow ‘ColScore’. This demonstrates the value of this variable despite the inherent uncertainty due to its subjective nature. All of the rules were assessed by a clinical expert and found to be consistent with current knowledge.
5 Conclusions The operation of alternative high performing classifiers was described using multivariate hierarchical rule trees with few low order rules. This analysis framework is proposed as the basis for direct validation of analytical models against domain expertise thus complementing and extending the traditionally used quantitative indices of aggregate performance. Over half of the malignant cases are explained by each rule set. The remaining cases may require a different classification paradigm whereby observations are allocated into ranked groups for risk of malignancy, so to describe the clinical profiles representing more difficult classifications, which fall at intermediate values of the risk of malignancy. This approach is currently being developed.
6 Discussion Rules are potentially more useful to the clinician than mathematical models because they can be taught and learned and be used without computation. On the other hand they might prove useful to discover previously undetected clinical correlations and they might give more insight in the ultrasound morphology and clinical presentation of benign and malignant ovarian tumours. While simple rules based on single variables seem not to be reliable or seem to be useful only to a small proportion of the patients, systematic rule extraction offers a much more extensive exploration of all possible rules hidden in a database. However, their clinical value needs to be assessed in future studies. A prospective validation is planned in IOTA Phase II.
References 1. Ameye, L: Predictive Models for Classification in Gynecology. PhD thesis, Faculty of Engineering, Katholieke Universiteit Leuven, Leuven, Belgium (2005) 2. Aung, M.S.H., Lisboa, P.J.G., Taktak, A.F.G., Damato, B.E: Modelling Survival of Intraocular Melanoma using a Partial Logistic Artificial Neural Network with Automatic Relevance Determination and Orthogonal Search Based Rule Extraction. Proc. Computational Intelligence in Medicine (CIMED), Lisbon, Portugal (2005) 114-121 3. Van Calster, B., Timmerman, D., Nabney I.T., Valentin, L., Van Holsbeke, C., Van Huffel, S: Classifying Ovarian Tumors using Bayesian Multi-Layer Perceptrons and Automatic Relevance Determination: A Multi-Center Study, in Proc. of the 28th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC2006), New York City, USA (2006) 5342-5345
1186
M.S.H. Aung et al.
4. Etchells, T.A., Harrison, M.J: Orthogonal Search-Based Rule Extraction for Modelling the Decision to Transfuse. Anesthesia 61 (4) (2006) 335-338 5. Etchells, T.A., Lisboa, P.J.G: Orthogonal Search-Based Rule Extraction (OSRE) for Trained Neural Networks: A Practical and Efficient Approach. IEEE Trans Neur. Net. 17 (2) (2006) 734-384 6. Timmerman, D., Testa, A.C., Bourne, T., Ferrazzi, E., Ameye, L., Konstantinovic, M.L., Van Calster, B., Collins, W.P., Vergote, I., Van Huffel, S., Valentin, L: Logistic Regression Model to Distinguish Between the Benign and Malignant Adnexal Mass Before Surgery: A Multicenter Study by the International Ovarian Tumor Analysis Group. J Clin Oncol. (2005) 8794-8801
Human Sensibility Evaluation Using Neural Network and Multiple-Template Method on Electroencephalogram (EEG)* Dongjun Kim1, Seungjin Woo1, Jeongwhan Lee2, and Kyeongseop Kim2 1
School of Electronics & Information Engineering, College of Science & Engineering Cheongju University, Cheongju 360-764, Republic of Korea
[email protected],
[email protected] 2 School of Biomedical Engineering, College of Biomedical & Health Science Konkuk University, Chungju 380-701, Republic of Korea {jwlee,kyeong}@kku.ac.kr
Abstract. This study presents a human sensibility evaluation method using neural network and multiple-template method on electroencephalogram (EEG). For our research objective, 10-channel EEG signals are collected from the healthy subjects. After the necessary preprocessing is performed on the acquired signals, the various EEG parameters are estimated and their discriminating performance is evaluated in terms of pattern classification capability. In our study, Linear Prediction (LP) coefficients are utilized as the feature parameters extracting the characteristics of EEG signal, and a multi-layer neural network is evaluated for indicating the degree of human sensibility. Also, the estimation for human comfortableness is performed by varying temperature and humidity environment factors and our results showed that our proposed scheme achieved the good performance for evaluating human sensibility. Keywords: EEG signals, human sensibility evaluation, neural network, multiple-template method.
1 Introduction Human sensibility ergonomics is a research field for making products and human life to be more convenient and comfortable. Various methods for qualitative and quantitative evaluation of human sensibility had been developed and this field requires understanding about the features of human mind and sensitivity. For human sensibility evaluation, biomedical signals such as blood pressure, electrocardiogram (ECG), pulse wave, skin temperature, and electroencephalogram (EEG) signal are commonly used. Among them, EEG signal may be the most suitable for estimating human sensibility due to its inherent characteristics for representing the behavior of central nervous system. *
This work was supported by the Regional Research Center Program on the Ministry of Education & Human Resources Development in Korea.
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1187–1193, 2007. © Springer-Verlag Berlin Heidelberg 2007
1188
D. Kim et al.
There are many researches for human sensibility or emotion using EEG signal. Yoshida[1] developed the technique for evaluating brain activity using frequency fluctuation of frontal alpha waves which can estimate stress or feelings of comfort associating with products and environments. He showed that the right frontal region correlates generally ‘feeling of arousal’ and the left frontal region does ‘positivenegative mood’. Davidson[2] proposed that the anterior regions of two cerebral hemispheres are specialized for approach and withdrawal processes, with the left hemisphere specialized for the former and the right for the latter. Eysenck[3] published the results that the extrovertive/introvertive and the neuroticism/stability states are related to arousal of reticulo-cortical and reticulo-limbic circuits, respectively. Schmidtke[4] reported that the brain activities revealed asymmetry in the frontal and the posterior regions according to the states of extrovertive/introvertive and the neuroticism/stability. Musha[5] proposed the emotion spectrum analysis method (ESAM), which can map EEG signal to four emotion indices using crosscorrelation coefficients of θ, α and β bands of FFT spectra. Anderson[6] tried to classify some mental tasks using auto-regressive(AR) modeling of 6-channel EEGs and neural networks. Some of their findings seem to be very meaningful, but many of them do not give enough validation in terms of statistical meaning sense or reveal theoretical weakness. This study proposes a new method for human sensibility evaluation that uses a neural network and the multiple-templates method on EEG signals. In this method, Linear Prediction (LP) coefficients are utilized as the featured parameters of EEG signal, and a multi-layer neural network is employed as the sensibility indicator.
2 EEG Signal Processing 2.1 EEG Signal Acquisition EEG signal acquisition system consisted of an electro-cap, an EEG amplifier, an A/D converter, and a desktop computer. The Electro-cap (Electro-cap International Inc., USA) contains 20 channels. The EEG amplifier (Jungsang-Techno Inc., Korea) and A/D converter (DT-9804 of Data Translation Inc., USA) are used. The EEG signal is amplified with a gain of 2000, and digitized by sampling frequency of 12samples/sec with the resolution of 12bits. Ten-point EEG signals (Fp1, Fp2, F3, F4, T3, T4, P3, P4, O1, O2) of the international 10/20 electrode system are collected from 10 healthy subjects and the measurements repeated three times for each subject. For relaxed or comfortable states, room temperature is maintained with 23-26 and humidity with 50-60%. For uncomfortable state, the temperature and humidity is above 30 and 60%, respectively.
℃
℃
2.2 Pre-processing Since the amplitude of EEG is very weak, it may be corrupted by the external noises or artifacts. For instances, the power line noise (60Hz) and eye blinking artifact are the most common noise sources. For accurate analysis on EEG, the noises and the artifacts must be reduced or eliminated. The power line noise can be removed using a
Human Sensibility Evaluation
1189
digital filter and a technique based on the characteristics of the eye blinking signal is developed for deleting the effects of eye blinking. Generally, the intensity of evoked signal due to eye blinking is greater than one of EEG signal. The technique for eliminating the eye blinking effects utilizes the difference in amplitude and the details of the process are as follows. Firstly, the mean of raw EEG signal is multiplied by four and this value is used as the threshold value to distinguish eye blinking and normal EEG signal. To find out the signal portion of eye blinking, peaks of signal due to eye blinking are detected. For peak picking, the slope of absolute value of raw data is used. When the peak value is greater than the threshold, it is regarded as eye blinking peak. The end point of the eye blinking is estimated by the sign of signal. When the sign is altered twice, the point is regarded as the end of eye blinking. To remove the transient effect of the eye blinking, one average interval of eye blinking is excluded in our analysis. To remove dc offset and low frequencies below 4Hz, a 4thorder IIR high pass filter (HPF) is designed as in the transfer function of the equation (1). For developing our research claims, we are interested especially in θ, α, and β bands. Generally, the high frequencies over 30Hz are not used in human emotion or sensibility analysis. Thus, a 4th-order IIR low pass filter (LPF) with a cutoff frequency of 30Hz is developed and the equation (2) shows the transfer function of the filter. Fig. 1 and 2 shows the magnitude responses of the digital HPF and LPF, respectively.
z− 2 − 2z−1 +1 z− 2 − 2z−1 +1 H ( z) = • 0.842 z − 2 − 1.981z − 1 + 1.177 0.914 z − 2 − 1.98 z − 1 + 1.105
(1)
0.821( z −2 − 2 z −1 + 1) 0.821( z −2 − 2 z −1 + 1) H ( z) = • 0.281z −2 − 0.357 z −1 + 3.362 0.915 z −2 − 0.357 z −1 + 2.728
(2)
Fig. 1. Magnitude response of HPF
After pre-processing of signals is performed, the various EEG parameters such as LP and linear cepstral coefficients are estimated and their performance in terms of pattern classification capability is evaluated. It turned out that LP coefficients showed the best performance and consequently we selected LP coefficients as EEG parameters.
1190
D. Kim et al.
Fig. 2. Magnitude response of LPF
3 Sensibility Evaluation Algorithm The sensibility evaluation algorithm designed in this study is based on the multipletemplate method that is commonly used in speaker-independent speech recognition research area. The multiple templates are made up of two personalities groups: the extrovertive or the introvertive group. The composition of our sensibility evaluation algorithm is depicted in Fig. 3. Template Construction EEG Signal
Preprocessing & Feature Extraction
Training of Neural Network
Templates of Extrovert Group
Templates of Introvert Group
Sensibility Evaluation Relaxation of Subject
Acquisition of EEG Signal
Preprocessing & Feature Extraction
Selection of Best Template
Presentation of Tasks
Acquisition of EEG Signal
Preprocessing & Feature Extraction
Neural Network
Sensibility Evaluation
Fig. 3. Composition of sensibility evaluation algorithm based on personality-templates
In template construction mode, first of all, the neural network is trained using EEG parameters to classify the sensibility states of typical relaxed, comfortable or uncomfortable conditions. The neural network model used in this study is a multilayer perceptron (MLP). Typical states of each condition were selected by subjects. The templates of each group were composed of the weights of trained neural network. In sensibility evaluation mode, before the test, subjects select their characteristics group. Then subjects are required to relax themselves, and using the EEG parameters of that time neural network selects the best matched template (weights) among the
Human Sensibility Evaluation
1191
EEG parameters Output layer
Input layer
Hidden layer
Fig. 4. Multi-layer perceptron neural network model for the sensibility indicator
selected group. Now tasks in terms of temperature and humidity for sensibility evaluation are applied to subjects and the neural network produces output as indicating the sensibility index. The structure of neural network, shown in Fig. 4, is adapted as the sensibility indicator. We assumed the EEG production as autoregressive (AR) model, and 6th order LP coefficients are used as EEG feature parameter. LP coefficients represent the input of the neural network and the output layer produces the comfortableness indices. Target vectors of output layer to classify the sensibility states of relaxed, comfortable and uncomfortable conditions are [1, 0, 0], [0, 1, 0], [0, 0, 1]. The network was trained by widely publicized the backpropagation algorithm[7].
4 Experimental Results When a subject feels the most relaxed, comfortable or uncomfortable state, he or she gives the sign of the index finger to the experiment assistant. Forty seconds interval of EEG data, which is forward and backward 20 seconds section at the most relaxed, comfortable, and uncomfortable states, are used for human sensibility evaluation. Sensibility evaluation tests are performed in three different scenarios: (1) group match mode, in this case, the neural network selects the best matched template among group-matched templates between the subject and templates, (2) random mode, in this case, the neural network selects the best matched template among all mixed Table 1. Results of human sensibility evaluation using personality-group templates
Case
Test data
Templates
Group match Random Group mismatch
Extrovertive Introvertive Random extrovertive introvertive
Extrovertive Introvertive Random introvertive extrovertive
No. of Data 15 15 30 15 15
Accuracy (%) 93.3 86.7 76.7 66.7 80.0
Avr. (%) 90.0 76.7 73.4
1192
D. Kim et al.
templates, (3) group mismatch mode, in this case, the neural network selects the best matched template among the opposite group templates. Table 1 shows results of these three cases, respectively. As shown in Table 1, the case of group match shows the best results. Fig. 5 shows an example of the neural network output by drawing a curve which means the sensibility evaluation indices.
Fig. 5. An example of sensibility evaluation curve
Fig. 6. An example of post-processed sensibility evaluation curve
Human Sensibility Evaluation
1193
The curve shows good performance in sensibility evaluation, but a phenomenon of biasing to relaxation state could be found. Thus, a post-processing technique is designed as shown in the equation (3) and (4).
y '[n] = {log10 ( y[n] × 1000)} / 3
(3)
y ' ' [ n] = y ' [ n] − y ' [ n]
(4)
Here, y[n], y’[n], and y’’[n] represents raw output, intermediate data, final output, respectively. An example of post-processed curve is shown in Fig. 6. The results show the more enhanced curve shape.
5 Conclusions This study proposes a human sensibility evaluation method using a neural network and multiple-template method on two personalities group in terms of the characteristics of EEG. Our proposed method uses LP coefficients as feature parameters of EEG, and a neural network of multi-layer perceptron and pre-/postprocessing techniques. Comfortableness evaluation tests for different room temperature and humidity shows good performance in human sensibility evaluation. Therefore our proposed method is expected to be useful for evaluating the sensibility (or emotion) especially for the patients or infants. Acknowledgement. This work was supported by the Regional Research Center program on the Ministry of Education & Human Resources Development in Korea.
References 1. Yoshida, T.: The Estimation of Mental Stress by 1/f Frequency Fluctuation of EEG. Brain Topography (1998) 771-777 2. Davidson, R.J.: Anterior Cerebral Asymmetry and the Nature of Emotion. Brain and Cognition 20 (1992) 125-151 3. Eysenck, H.J., Eysenck, M.W.: Personality and Individual Differences: a Natural Science Approach. Plenum Press (1985) 4. Schmidtke, J.I., Heller, W.: Personality, Affect and EEG: Predicting Patterns of Regional Brain Activity Related to Extraversion and Neuroticism. Personality and Individual Differences (2003) 5. Musha, T., Terasaki, Y., Haque, H.A., Ivanisky, G.A.: Feature Extraction from EEGs Associated with Emotions. Intl. Sympo. Artif. Life Robotics (Invited Paper) 1 (1997) 15-19 6. Anderson, C.W., Sijercic, Z.: Classification of EEG Signals from Four Subjects during Five Mental Task. In Solving Engineering Problems with Neural Networks Proceedings of the Conference on Engineering Applications in Neural Networks (EANN) (1996) 407-414 7. Hagan, M.T., Demuth, H.B., Beale, M.: Neural Network Design, PWS Publishing Co.(1996) 8. Robinson D.L.: The Technical, Neurological, and Psychological Significance of ‘Alpha’, ’Theta’, and ‘Delta’ Waves Confounded in EEG Evoked Potentials. Personality and Individual Differences 28 (2000) 673-693
A Decision Method for Air-Pressure Limit Value Based on the Respiratory Model with RBF Expression of Elastance Shunshoku Kanae, Zi-Jiang Yang, and Kiyoshi Wada Department of Electrical and Electronic Systems Engineering, Graduate School of Information Science and Electrical Engineering, Kyushu University Motooka 744, Nishi-ku, Fukuoka, 819-0395 Japan
[email protected]
Abstract. Air-pressure limit value is an important conditional parameter of artificial respiration. The pulmonary characteristics are very different according to the person. For setting appropriate ventilation conditions fitting to each patient, it is necessary to establish a mathematical model describing the mechanism of human respiratory system, and to know the pulmonary characteristic of each patient via identification of the model. For this purpose, two types of respiratory system models have been proposed by the authors. These models are expressed as second order nonlinear differential equations with air-volume variant elastic coefficient and air-volume variant resistive coefficient. In the first type of model, elastic coefficient is expressed as polynomial function of airvolume, while in the second type of model, elastic coefficient is expressed by RBF network. The model with polynomial expression of elastance has the advantage that the structure is simple. On the other hand, the model with RBF expression of elastance has better numerical stability against to the model with polynomial expression of elastance. In this paper, a decision method of air-pressure limit value based on the respiration model with RBF expression of elastance is proposed. This method adopt a numerical technique to find the point of saturation starting point in the elastance curve, So direct calculation of radius of curvature can be avoided. The proposed method is validated by an example of application to practical clinical data.
1
Introduction
Appropriate decision of air-pressure limit value is very important in artificial respiration. If the pressure value is too small, then the patient cannot get sufficient breathe. On the contrary, if the pressure value is too large, overexpansion of lung may occur. The pulmonary characteristics are very different according to patient. For setting appropriate ventilation conditions fitting to each patient, it is necessary to establish a mathematical model describing the mechanism of human respiratory system, and to know the pulmonary characteristic of each patient via identification of the model[1,2,3,4]. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1194–1201, 2007. c Springer-Verlag Berlin Heidelberg 2007
A Decision Method for Air-Pressure Limit Value
1195
For this purpose, two types of respiratory system models have been proposed by the authors. These models are expressed as second order nonlinear differential equations with air-volume variant elastic coefficient and air-volume variant resistive coefficient. In the first type of model, elastic coefficient is expressed as polynomial function of air-volume[5], while in the second type of model, elastic coefficient is expressed by RBF network[6]. The proposal of these models makes it possible to estimate pulmonary elastance and airway resistance from the measurements of air pressure, flow and volume at ventilator side, no need for any sensor inserted into body inside. The model with polynomial expression of elastance has the advantage that the structure is simple. On the other hand, the model with RBF expression of elastance has better numerical stability against to the model with polynomial expression of elastance[7]. In this paper, a decision method for air-pressure limit value based on the respiration model with RBF expression of elastance is proposed. Firstly, the respiration model with RBF expression of elastance is described in Section 2, and the parameter estimation algorithm is derived in Section 3. Then, proposed decision method for air-pressure limit value is explained in Section 4, and is validated by an example of application to practical clinical date in Section 5. Final section concludes this paper.
2
Respiratory Model with RBF Expression of Elastance
In this paper, a case of mechanical ventilation is considered in which case spontaneous breathing is absent. In the inspiration phase of mechanical ventilation, the air pressure of ventilator side is higher than the lung inside, so the fresh air is sent to lung by the difference of air pressure, and the lung is expanded. In the expiration phase, the exhaust after gas exchange is excreted naturally by shrinking force of lung. The expansion and the shrinkage relying on variation of air pressure are characterized by pulmonary elastance and airway resistance. The medical knowledge and clinical data clearly show that there is nonlinearity in elastance of lung and resistance of the airway. And they are important factor in decision of respiratory dynamics. Comprehensively considering the above each factor, a new respiratory model of lung has been proposed as follows: P (t) + aP˙ (t) = E(V )V (t) + R(V )V˙ (t) + hV¨ (t) + (t), E(V ) =
n
(1)
bi ψi (V (t)),
(2)
ci ψi (V (t)),
(3)
i=1
R(V ) =
n i=1
1196
S. Kanae, Z.-J. Yang, and K. Wada
where, V (t) is the volume of lung, V˙ (t), V¨ (t) the first and second order derivatives of the volume, P (t) the pressure of the airway, P˙ (t) the first order derivative of the pressure. a, h are the coefficients of the model. (t) contains model errors and measurement noises. As mentioned above, pulmonary elastance E(V ) and airway resistance R(V ) are not constant, they are some nonlinear function of volume V (t). In this model, pulmonary elastic coefficient and airway resistance coefficient are described by RBF network expressed by (2) (3) which input is the volume of lung. Here, n is the number of nodes, and bi , ci (i = 1, · · · , n) are weights of i-th node. The ψi (V ) is Radial Basis Function with center of V0i , deviation of σi : ψi (V ) = exp(−
(V − V0i )2 ), 2πσi2
i = 1, · · · , n.
(4)
Considering the relationship Q(t) = V˙ (t) between Q(t) and V (t), the respiratory model (1) can be written as: ˙ P (t) + aP˙ (t) = E(V )V (t) + R(V )Q(t) + hQ(t) + (t).
(5)
Substituting (2) and (3) for (5), we have P (t) = −aP˙ (t) + V (t)
n
bi ψi (V ) + Q(t)
i=1
n
˙ ci ψi (V ) + hQ(t) + (t).
(6)
i=1
Setting data vector ϕ(t) and parameter vector θ in above model as T ˙ ϕ(t) = [P˙ (t)|V (t)ψ1 (V ) · · · V (t)ψn (V )|Q(t)ψ1 (V ) · · · Q(t)ψn (V )|Q(t)] ,
θ = [−a|b1 · · · bn |c1 · · · cn |h]T , then a compact expression of the model is obtained: P (t) = ϕT (t)θ + (t).
3
(7)
Parameter Estimation
In the continuous-time model (6) (or (7)), the derivative terms are contained. Generally speaking, it is not desirable to calculate the derivatives directly from the measurements, because it may make the noise effect worse. In this section, a discrete-time identification model for respiration is derived based on Sagara’s numerical integration technique[8]. Denote the sampling period of data collection as T . At time instant t = kT , integrate both sides of Eq. (7) over the interval [(k − )T, kT ]. Let y(k) be the left hand side of the resultant equation. Then y(k) can be calculated as
kT
P (τ )dτ
y(k) = (k−)T
j=0
dj P (k − j),
(8)
A Decision Method for Air-Pressure Limit Value
1197
where, is a natural number that decides the window size of numerical integration. The coefficients di (i = 0, 1, · · · , ) are determined by formulae of numerical integration. For example, when the trapezoidal rule is taken, they are given as follows: d0 = d = T /2, (9) dj = T, j = 1, 2, · · · , − 1. The integral of ϕ(t) in the right hand side of Eq. (7) can be calculated by φ(k) =
dj ϕ(k − j)
j=0
= P (k) − P (k − ) dj V (k − j)ψ1 (V (k − j)) · · · j=0
dj V (k − j)ψn (V (k − j)) dj Q(k − j)ψ1 (V (k − j)) · · ·
j=0
j=0 dj Q(k − j)ψn (V (k − j)) Q(k) − Q(k − ) .
(10)
j=0
Here, analytical forms are taken for the terms where the integral can be calculated analytically. Get together the approximation error ∇e caused by numerical integration and the integral of error term (t) of Eq. (7) in e(t). Namely, e(k) is kT e(k) = ∇e + ε(τ )dτ. (11) (k−)T
Consequently, a discrete-time identification model of respiration is derived as follows: y(k) = φT (k)θ + e(k). (12) From the measurements of air pressure P (k), flow Q(k) and volume V (k), it is easy to calculate y(k) by Eq. (8) and φ(k) by Eq. (10) at each time instant k = + 1, · · · , N , then N − regression equation can be derived as: y = Φθ + e,
(13)
where, y = [y(+1) · · · y(N )]T , Φ = [φ(+1) · · · φ(N )]T , e = [e(+1) · · · e(N )]T , respectively. The least squares estimate that minimizes the criterion function J defined as a sum of squared errors
is given by
J = ||y − Φθ||2
(14)
ˆ = (ΦT Φ)−1 ΦT y. θ
(15)
Then, the estimate of pulmonary elastance is obtained: ˆ )= E(V
n i=1
ˆbi ψi (V ).
(16)
1198
S. Kanae, Z.-J. Yang, and K. Wada
The above algorithm is an off-line algorithm in which the calculation is carried out after the data of length N are completely collected. But, in practical clinical cases, the data are recorded successively, and the state of lung may change, so on-line algorithm is desired. The on-line algorithm for calculating the above LS estimate is as the follows[9]: ⎧ˆ ˆ − 1) + L(k)(y(k) − φT (k)θ(k ˆ − 1)), θ(k) = θ(k ⎪ ⎪ ⎪ ⎪ ⎨ S(k − 1)φ(k) L(k) = , (17) λ + φT (k)S(k − 1)φ(k) ⎪ ⎪ T ⎪ 1 S(k − 1)φ(k)φ (k)S(k − 1) ⎪ ⎩ S(k) = S(k − 1) − , λ λ + φT (k)S(k − 1)φ(k) ˆ and S are taken where, λ(λ ≤ 1) is forgetting factor, and the initial values of θ ˆ as θ(0) = 0, S(0) = s2 I (s is a sufficiently large real number).
4
Decision Method for Air-Pressure Limit Value
By using the estimation method described in Section 3, the pulmonary elastance curve of the patient can be identified. Generally, It is considered that the airpressure limit value should be set at saturation starting point in the elastance curve. Mathematically speaking, the saturation starting point corresponds to the point in which the radius of curvature is minimum. Form the respiratory model, the pulmonary elastance curve is identified as the function of P e = fE (V ) = E(V ) · V = V (t)
n
ˆbi φi (V (t)).
(18)
i=1
The radius r of curvature of this curve can be calculated by the formula dfE 2 3 d2 fE ) )2/ . (19) dV dV 2 The pressure value Pem that minimizes this r is the air-pressure limit value to be found. However, the equation contains first order and second order derivatives, so the calculations are very complex. Here, we present a practical simple numerical calculation method. This method is based on the fact that when the radius of curvature of a curve get minimum value at a point, this point corresponds to the maximum value of the curve under an appropriate rotational transformation. Now, give the calculation procedure as following: Step 1. Based on the identified pulmonary elastance function (18), calculate Pe values at each V in a reasonable resolution in the range of respiration [Vmin , Vmax ], and store the data into a matrix ⎡ ⎤ P1 V1 ⎢ ⎥ So = ⎣ ... ... ⎦ , r = (1 + (
Pq Vq where, subscript q denotes the number of data.
A Decision Method for Air-Pressure Limit Value
1199
Step 2. Calculate the rotational transformation matrix T , L = ((Pl − P1 )2 + (Vl − V1 )2 )1/2 , sa = (Vl − V1 )/L,
(20) (21)
ca = (Pl − P1 )/L,
(22)
ca −sa T = . sa ca
(23)
Step 3. Perform the rotational transformation on the data So , ⎡ ⎤ PT 1 VT 1 ⎢ .. ⎥ . ST = So T = ⎣ ... . ⎦
(24)
PT q VT q Step 4. Search maximum value point VT m (1 < i < q) in the range of [VT 1 , VT q ], in which point PT m = max PT i i=1,··· ,q
get the maximum value. Then, the point of (Pm , VM ) with the same index of m is the point with minimum radius of curvature on the pulmonary elastance curve, so the air-pressure limit value of artificial respiration is found.
60
50
Volume V [ml]
28 O O 40
30
20
10
0 0
2
4
6 8 10 Pressure P [cmH O]
12
14
16
2
Fig. 1. Estimated pulmonary elastance and air-pressure limit value decided by proposed method
1200
5
S. Kanae, Z.-J. Yang, and K. Wada
Validation by Clinical Examples
In this section, a clinical example is shown. The proposed method is used to decide air-pressure limit value of artificial respiration in a case of neonate admitted to a neonatal intensive care unit. Firstly, the pulmonary elastance of a patient is estimated based on the algorithm described in Section 3. In this example, the sampling period is T = 0.005 second, the data length is N = 470. The measurement data of P (k) and V (k) are plotted by solid line in Figure 1. The elastic term fE (V ) is approximated by a RBF network which have nE = 5 nodes. The estimated weights of RBF nodes are as [ ˆb1 , ˆb2 , ˆb3 , ˆb4 , ˆb5 ] = [1.444, −3.331, 4.496, −3.640, 1.662] where, the integration window size is taken as = 20. The static P/V characteristic curve calculated by the estimated parameters ˆb1 , ˆb2 , ˆb3 , ˆb4 , ˆb5 is drawn in dotted line in Figure 1. Secondly, air-pressure limit value of artificial respiration is decided by the proposed method. Estimated elastance curve (Curve 1) is transformed into Curve 2 by the rotational transformation. See Figure 2. The number of data is q = 1001, the maximum point of Curve 2 is i = 743, and the corresponding air-pressure Pm is Pm = 8.5764 cmH2 O (See Figure 1). This result is almost the same as the decision of veteran doctor. 70
60
50
40 Curve 1: Estimated Elastance
30
20
10 Maximum
Curve 2: Rotation Transformed Curve
0 0
10
20
30
40
50
60
70
Fig. 2. Rotational transformation and maximum point in transformed curve
A Decision Method for Air-Pressure Limit Value
6
1201
Conclusion
Air-pressure limit value is an important conditional parameter of artificial respiration. In this paper, a decision method for air-pressure limit value based on the respiration model with RBF expression of elastance is proposed. This method adopt a simple numerical technique to find the saturation starting point in the elastance curve, so direct calculation of radius of curvature can be avoided. The proposed method is validated by an example of application to practical clinical data.
References 1. Matamis, D., Lemaire, F., Harf, A., Brun-Buisson, C., Ansquer, J.C., Atlan, G.: Total Respiratory Pressure-Volume Curves in the Adult Respiratory Distress Syndrome. Chest, 86 (1984), 58–66. 2. Ranieri, V.M., Giuliani, R., Fiore, T., Damhrosio, M., Milic-Emili, J.: VolumePressure Curve of the Respiratory System Predicts Effects of PEEP in ARDS: “Occlusion” Versus “Constant Flow” Technique. Am. J. Respir. Crit. Care Med. 149 (1994) 19–27. 3. Uhl, R.R., Lewis, F.J.: Digital Computer Calculation of Human Pulmonary Mechanics Using a Least Squares Fit Technique. Comput. Biomed. Res. 7 (1974) 489–495. 4. Muramatsu, K., Yukitake, K, Nakamura, M, Matsumoto, I., Motohiro, Y: Monitoring of Nonlinear Respiratory Elastance Using a Multiple Linear Regression Analysis. European Respiratory Journal 17 (2001) 1158–1166. 5. S. Kanae and K. Muramatsu, Z.J. Yang, K. Wada: Modeling of Respiration and Estimation of Pulmonary Elastance. The Proceedings of the 5th Asian Control Conference (ASCC2004), pp.648–651, Melbourne, Australia (2004). 6. S. Kanae and Z.J. Yang, K. Wada: Estimation of Pulmonary Elastance Based on RBF Expression. Advances in Neural Networks – ISNN 2004, Lecture Notes in Computer Science 3173, PartII, pp.507–512, Springer(2004). 7. S. Kanae and K. Maeda, Z.J. Yang, K. Wada: Parameter Estimation of Nonlinear Differential Equation Models of Respiratory System by Using Numerical Integration Technique. Preprints of the 14th IFAC Symposium on System Identification, Newcastle, Australia (2006). 8. Sagara, S., Zhao, Z.: Numerical Integration Approach to On-Line Identification of Continuous-Time Systems. Automatica 26 (1990) 63–74. 9. Ljung, L.: System Identification Theory for the User. Prentice-Hall (1987).
Hand Tremor Classification Using Bispectrum Analysis of Acceleration Signals and Back-Propagation Neural Network Lingmei Ai1, Jue Wang1, Liyu Huang2, and Xuelian Wang3 1
Key Laboratory of Biomedical Information Engineering of Education of Ministry, xi’an Jiaotong University, xi’an,710049, China 2 School of Electronic Engineering, Xidian University, Xi’an, 710071, China 3 Department of Neurosurgery, Tangdu Hospital, The Fourth Millitary Medical University, xi’an, 710038, China
[email protected] (Lingmei Ai),
[email protected],
[email protected]
Abstract. This paper presents a new approach to classify three types of tremor, including parkinsonian, essential and physiological tremors, by using bispectrum analysis of time series of hand tremor and neural network. The acceleration signals of hand tremor from voluntary subjects were recorded and the features of diagonal slices of bispectrum were extracted. A simple BP artificial neural network classifier based on LM algorithm has been used for classification. The study indicates the accuracy rate is over 92.9%. The results show that the method has a better performance than other methods, such as time or frequency domain analysis,and provide a new approach to classify tremor for clinical neurosurgeon.
1 Introduction Tremor is an involuntary movement characterized by regular or irregular oscillations of one or several body parts [1]. Three types of tremor can often be observed are physiological tremor (PT), essential tremor (ET) and parkinsonian disease (PD). It may be physiological or pathological .Among the pathological cases, essential and parkinsonian tremor can often be observed .Head, limb and voice of body are affected by tremor, it also affects existence quality of people.Parkinson’s disease is a growing problem, with 120-180 victims for each 100,000 people. Most patients are over 40 years old although the disease can appear also at younger age. Essential tremor affects even up to 5000 people for each 100 000[2]. It is different between the types of tremor and the methods of treatment. The cause of disease is still unknown and the clinical distinction between them can be difficult. At present neurosurgeon diagnoses the types of tremor according to clinical experience and patient’s symptoms, n particular misdiagnosis rate between PD and ET exceed 25% [3]. That is, only according to clinical diagnosis of neurosurgeon is not accuracy to classify tremor types.Therefore, researchers try to study tremor from electrophysiology signals such D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1202–1210, 2007. © Springer-Verlag Berlin Heidelberg 2007
Hand Tremor Classification Using Bispectrum Analysis
1203
as electromyogram, electroencephalogram and other tremor signals like hand tremor acceleration, look forward to interpret the tremor phenomenon and can distinguish PD and ET correctly. Main fields of the study in the early stage including recorded electromyogram (EMG) of tremor subjects,measured,analysed amplitude and frequency [4], estimated spectra [5] and then try to orient paroxysm sources via EMG and electroencephalogram(EEG)[6], approximate entropy be also used for analysing the acceleration signals of PD and normal subjects (PT) [7]. Although researchers have made some progress, however, not still find the methods how to distinguish tremor types correctly. Wharrad etal. [8] adopted acceleration (ACC) signal distinguish. ET and PT, but not analysed the most often observed PD. Augustyn etal. [2] researched that distinguish three kinds tremor of PD, ET and PT, but the features were extracted from acceleration signals were so many that operation too complex to fast for testing on-line. Because of above problems, a new approach is proposed in this paper by measuring hand ACC signals of subjects of PD, ET and PT to classify kinds tremor. The features of diagonal slice of bispectrum were extracted act as the input of classifier which is artificial neural network of Error Back propagation (BP) of levenberg Marquadt(LM) arithmetic, the results indicate the classification effect is better.
2 Theory Foundation 2.1 Definition of Bispectrum [9] At present, high -order spectra, especially bispectrum play an important role in digital signal processing due to their ability of preserve nonminimum phase information, which is also a very useful tool for processing non-Gaussian,non-liner and noncausality signal, as well as it can be used for a good tool for processing Gaussian noise and blind signal. High-order spectra can express stochastic signal from better probability so can recuperate phase information which can not be recuperated by power spectra. Let mean value of real stochastic time series {x(n)}(n Z) is zero, kth-order stationarity, defined kth-order cumulants as follows:
∈
C k , x (τ 1 ,τ 2 ,..., τ k −1 ) = cum { x ( n ), x ( n + τ 1 ),..., x ( n + τ k −1 )}
τ i ( i = 1 , 2 ,..., k − 1 ) is discretional time delay Let Ck , x (τ 1 ,τ 2 ,...,τ k −1 ) absolutely summable ,namely ∞
∑
∞
∑ ...
∞
∑
τ 1 = −∞ τ 2 = −∞ τ k −1 = −∞
C k , x (τ 1 ,τ 2 ,..., τ k −1 ) 〈∞
.
(1)
1204
L. Ai et al.
then kth-order spectra can be defined as fourier transform of kth-order cumulants in k-1 dimension ∞
∞
∑ τ∑ τ
S k , x (ω 1 , ω 2 ,... ω k − 1 ) =
1
= −∞
2
= −∞
…
∞
∑ τ k −1
= −∞
C k , x (τ 1 , τ 2 ,..., τ k − 1 )
(2)
exp( − j (ω 1τ 1 + ω 2τ 2 ... + ω k − 1 τ k − 1 )) Where ω i = 2π f i . (i = 1,2,..., k − 1) .when k=2, power spectrum of time series can be defined as fourier transform of 2t`h-order cumulants or 2th-order self correlation.
P (ω ) =
∞
∑ τ
= −∞
C 2 , x (τ ) exp( − j ωτ )
(3)
Let tremor signal {x(n)} is 3th-order stationarity, then 3th –order cumulants or 3th-order moment of real stochastic time series {x(n)} be defined as follows:
C 3 , x (τ 1 , τ 2 ) = E [ x ( n ) x ( n + τ 1 ) x ( n + τ 2 )]
(4)
Where τ 1 , τ 2 may be discretional time delay. E[ • ] denotes statistical expectation. The bispectrum S 3 , x (ω 1 , ω 2 ) of time series is defined as the two dimension fourier transform of formula (4): S 3 , x (ω 1 , ω 2 ) =
+∞
+∞
∑ τ∑ τ 1 = −∞
ω1 ≤ π , ω 2 ≤ π , ω
1
2 = −∞
+ω
2
C 3, x (τ 1 , τ 2 ) exp( − j (ω 1τ 1 + ω 2τ 2 ))
(5)
≤ π , formula (5) can also be written by
B x (ω 1 , ω 2 ) = X (ω 1 ) X (ω 2 ) X * (ω 1 + ω 2 )
(6)
Where X( ω ) is fourier transform of time series {x(n)},*express Complex Conjugation. Bispectrum is a special cases for high -order spectra at k=3. We often divide {x(n)} into L segments When operating, bispectrum is estimated by direct way as follows: B x (ω 1 , ω 2 ) =
1 L
L
∑
i =1
X i (ω 1 ) X i (ω 2 ) X
* i
(ω 1 + ω 2 )
(7)
We often use main diagonal slice of three-order cumulants. Let τ 1 = τ 2 = τ of formula (4), then main diagonal slice of three-order cumulants is defined as follows: C (τ ) = C 3 , x (τ , τ ) = E [ x ( n ) x ( n + τ ) x ( n + τ )]
(8)
Fourier transform of formula (8) be defined as main diagonal slice spectrum of
1 1 dimension spectrum. The estimating precision 2 1 is higher and operating scalar is smaller for 1 dimension spectrum than that of 2 bispectrum, which is also called
bispectrum.
Hand Tremor Classification Using Bispectrum Analysis
1205
2.2 Important Properties of Bispectrum (1). Bispectrum is blind for Gaussian process. Three–order cumulants or bispectrum of Gaussian signal is zero. Bispectrum can restrains noise from signals and can extracts useful information from non-Gaussian signals, furthermore it can also tests the deviation of signals from Gaussian distribution . (2). Bispectrum not only preserves amplitude information of signals but also preserves phase information. Thus, it is easy to obtain the features information from tremor signals with bispectrum analysis. (3). Bispectrum is a symmetric function [9]. 2.3 Artificial Neural Network Artificial neural network possess study and association memory function. It can be used for classifier to carry out pattern recognition.According to Kolmogorov theorem, BP neural network of three layer can approaches discretionarily continuous function, so it can be used for solving classification problem of tremor signals. The application of BP neural network is much extensive. It consists of two parts, the information is propagated forwards and the error is propagated backwards, at the same time the weights value of neural network is adjusted continuously, and he network error is designed to minimize the root mean squared error between the actual output and the desired output. LM arithmetic of BP neural network was used in this paper [10].
3 Acquiring Time Series Signal of Tremor Three different types Subjects of 26 participated in the experiment, include 6 patients with ET (4 male, 2 female, aged from 29-80 years old), 10 Patients with PD (7 male, 3 female, aged from 27-74 years old) and 10 normal subjects with PT (7 male, 3 female, aged from 22-71 years old ). All of them have been diagnosed by neurosurgeon.PD and ET patients came from outpatients of Department of Neurosurgery in Tangdu Hospital of xi’an.PT subjects came from undergraduate, folk of patients and volunteer. Subjects were seated in straight-backed chair with their feet flat on the floor. The sensor was secured on the dorsum of middle finger of their prefered hand. Subjects maintained the following upper limb position during data collection:arm held out straight, parallel to the ground with the shoulder at 90 , the elbow and wrist extended and the hand pronated (facing downwards) in a stretch posture. This arm position was determined from a pilot study and was found to be the most sensitive position to difference between PD, ET, and PT subjects of tremor. The subjects were asked to raise their arm and maintain it in the postion whilst the data were collected for 60 seconds, repeated three times with the same collecting data way. The tremor signals were registered by MMA7260Q 3-axis low-g accelerometer of Freescale then were amplified, filtered finally converted to the digital form in the analog-digital (A/D) converter with sampling frequency of 512Hz, saved them at
1206
L. Ai et al.
computer with text formatting. Because the paroxysm of the tremor patients has switch on-off symptom, namely, the tremor is discontinuous. 10 seconds data that occured tremor of z axis which paralleled ground were Selected for acting as a data segments, find 4 data segments for every subject , add up to 104 data segments for analysis. The sampling rate was reduced to 128Hz before the signals were analyzed. 1.5
1
0.5
0.5 0
0 -0.5 0
10 20 30 (a) 40 sample for PD
PFA
PFA
PFA
1
-0.5 0 20 40 (c ) 40 sample for PT
-0.5 0 20 40 (b) 24 sample for ET
40
1.5 1 0.5 0
Fig. 1. Non- Gaussian testing . (a) PD,(b) ET and (c)PT.
50
10 20 30 (a) 40 sample for PD
40
interquartile
interquartile
interquartile
100
0 0
100
200
150
100
0 0
10 20 (b) 24 sample for ET
30
50
0 0
10 20 30 (c) 40 sample for PT
40
Fig. 2. Non-linearity testing (“o” represents estimate value and “*” represents theoretic value ). (a) PD, (b) ET and (c) PT.
Since the most important feature of signals is Gaussian and linearity, so which must be tested firstly.Linear signals are expressed simply and Gaussian signals are expressed by 2th-order statistics enough(Power spectrum). 104 data segments of 26 subjects were tested by Hinich [11] method . Risk probability of non-gaussian signals were accepted by probability of false alarm (PFA). We think that signals are gaussian when PFA 0.05 or else are non-Gaussian signals. Then testing non-linearity of tremor signals based on the deviation between theory and estimated value of interquartile. If the estimated interquartile range is much larger or much smaller than the theoretical value, the linearity hypothesis should be rejected. Fig.1 presents the testing results at PFA=0.05 for PD, ET and PT signals, respectively. We find that 97.5% of the PD signals are non-Gaussian. At the same time, 97.5% of the ET signals are non-Gaussian and 55% of the PT tremor signals are also non-Gaussian. Fig.2 presents the testing results of theoretical and estimated value of interquartile of three types signals of PD, ET and PT. We find that interquartile range of estimated value is much larger or much smaller than that of the theoretical value for all the PD and ET whereas also find parts PT have the same results the same as PD and ET. The results confirm that the majority of tremor signals are non-Gaussian and
〉
Hand Tremor Classification Using Bispectrum Analysis
1207
non- linearity, thus limiting their descriptions to only 2th-order characteristic is not sufficient. High-order statistic especially bispectrum is needed to describe them more accurately and distinctively.
4 Extracting Features of Tremor Signals We have known that the apexes of bispectrum in double frequency domain are symmetrical distribution according to diagonal axis [9]. The position of apexes of bispectrum is an important parameter for reflecting bispectrum framework.The position of apexes of bispectrum may be different for different types of tremor signals, therefore the information on apexes of bispectrum must be extracted. Owing to the operation of bispectrum is larger,so the diagonal slice was used for expressing the signals in this paper. Due to the tremor signals are different, consequently the amplitude must be normalized firstly. Secondly, we extracted the largest amplitude of apexes of diagonal slice spectrum and the frequency at the largest amplitude, thirdly, second larger amplitude,third larger amplitude of apexes diagonal slice spectrum and frequency which correspond with second larger amplitude, third larger amplitude were also extracted by the same way, lastly the energy on diagonal slice spectrum was extracted, add up to 7 parameters as features. M
energy= ∑ diagonalslice spectrum
2
i =1
Where M is the length of data segments . Table 1 is the average value with seven features of subjects of PD, ET and PT . From Table 1 we find that the difference of amplitude ,frequency, and energy of diagonal slice spectrum among kinds tremor is significant.
5 Results and Analysis Fig.3 presents 10 seconds hand tremor signals for (a) PD, (b) ET, and (c) PT, respectively. Fig.4 presents three-dimensional graphics of bispectrum for (a) PD, (b) ET and (c) PT signals, respectively. From Fig.4 we find that the shape distribution of bispectrum is different for PD,ET and PT.Fig.5 is diagonal slice spectrum for (a) PD, (b) ET and (c) PT. We can also find that the better coherence is present by comparing Fig.4 with Fig.5 . Table 1. The average value of three types tremor features subject PD ET PT
largest amplitude/frequency(hz) second amplitude/frequency(hz) third amplitude/frequency(hz) 0.0494 / 4.7299 0.0037 / 7.1510 0.0014 / 7.4880 0.0178 / 8.0704 0.0039 / 7.0096 0.0020 / 5.9280 0.0037 / 11.1322 0.0020 / 9.5222 0.0013 / 9.1478
energy 0.0035 0.0006 0.0000
1208
L. Ai et al.
0.1
0.1
0.01
0.05
0.05
0.005
0
0
0
-0.05
-0.05
-0.005
-0.1 0
500
-0.1 0
1000
500
(a)
.
-0.01 0
1000
500
1000 (c)
(b)
Fig. 3. 10 seconds hand tremor signals for one subject representative for each group.(a) PD, (b) ET, and (c) PT, amplitude has been normalized.
Fig. 4. The distribution of the bispectrum for one subject representative for each group.(a) PD, (b) ET, and (c) PT, amplitude and frequency have been normalized. -3
0.08
0.02
0.06
0.015
0.04
0.01
0.02
0.005
0 0
20
40 f/hz
60 (a)
80
0 0
4
x 10
2
20
40 f/hz (b)
60
0 0
80
20
40 f/hz
60
80
(c)
Fig. 5. The distribution of the diagonal slice of bispectrum for one subject representative for each group, (a) PD, (b) ET, and (c) PT, amplitude has been normalized Table 2. The relationship between neurons and accuracy rate of average classification numbers of neurons 2 3 4 5 6 7 8 9 10 11 accuracy of classification (%) 51 50 75 78.5 78.5 82.14 85.71 85.70 85.5 86.7
12
13
14
15
16
17
18
82.14 89.3 89.3 92.90 92.86 92.1 90.1
BP artificial neural network classifier based on LM adjustment arithmetic was adopted to carry out pattern recognition.Cross-validation method was used for classification,70% features data as training and another 30% features data as testing,namely, 26 subjects data were divided into 3 segments with training and testing ,respectively. Add up to 3 times. The input layer of classifier are seven features which are energy and the value of the largest amplitude ,second larger amplitude and third larger amplitude as well as three frequency which correspond with three amplitude,respectively . The output layer is three nodes which express
Hand Tremor Classification Using Bispectrum Analysis
1209
three types of tremor,respectively. The two-eighteen neurons of hidden layer was chosen to find best accuracy effect of classification.Table 2 presents the relationship between neurons and accuracy rate of average classification .The results indicate that the best accuracy rate of classification exceeds 92.9% when hidden layer neurons are 15.
6 Conclusions Bispectrum is a preferable tool to process non-Gaussian and non-linearity signals. First, hand tremor signals of subjects were measured by accelerometer and then were tested with Hinich methods, finally we find that PD of 97.5% and ET of 97.5% are non-Gaussian, non-linearity signals. Parts of the PT signals are also non-Gaussian and non-linearity signals adapt to bispectrum analysis.Three types of tremor signals were analysed by bispectrum , we find that there are distinct difference among them. Using the features of diagonal slice of bispectrum act as the input of artificial neural network of back-propagation to classify three kinds tremor signals, accuracy rate exceeds 92.9%, provide a new approach to classify three kinds tremor of PD, ET and PT for clinical neurosurgeon.
Acknowledgment The authors would like to express their sincere thanks for financial support by China National 863 High-Tech Project (Project Number: 2006AA04Z370).
References 1. Capello, A., Leardini, A., Benedetti, M. G., Liguori, R., Bertani, A,: Application of StereoPhotogrammetry to Total Body Three Dimensionalanalysis of Human Tremor. IEEE Transactions on Rehabilitation Engineering 5 (4) (1997) 388–393 2. Chwaleba, A., Jakubowski, J., Kwiatos, K.: The Measuring Set And Signal Processing Method For The Characterization Of Human Hand Tremor. CADSM’2003, February 1822 (2003) Lviv-Slasko, Ukraine 3. Spyers-Ashby, J.M., Stokes, M.J., Bain, P.G., Roberts, S.J.: Classification of Normal and Pathologica Tremors using a Multidimensional Electro-magnetic System. Medical Engineering & Physics 21 (1999) 713-723 4. Edwards, R., Beuter, A.: Indexes for Identification of Abnormal Tremor using Computer Tremor Evaluation systems. IEEE Transactions on Biomedical Engineering 6 (7) (1999) 895–898 5. Timmer, J., Lauk, M., Vach, W., lÜcking, C.H.: A Test for Difference between Spectral Peak Frequencies. Computational Statistic & Data Analysis 30 (1999) 45-55 6. Lauk, M., KÖster, B., Timmer, J., Guschbauer, B., Deuschl, G., lÜcking, C.H.: Side –to Side Correlation of Muscle Activity in Physiology and Pathological Tremors. Clinical neurophysiology 110 (1999) 1774-1783 7. Vaillancourt, D.E., Slifkin, A.B., Newell, K.M.: Regularity of Force Tremor in Parkinson’s Disease. Clinical Neurophysiology 112 (2001) 1594-1603
1210
L. Ai et al.
8. Wharrad, H.J., Jefferson, D.: .Distinguishing between Physiological and Essential Tremor using Discrimina and Cluster Analyses of Parameters Derived from the Frequency Spectrum. Human Movement Science 19 (2000) 319-339 9. Zhang, X.: Time Series Analysis-Higher-Order Statistic Method [M]. Beijing: Tsinghua University Press, 1996 10. Yang X., Zhen J.: Artificial Neural Network and Bland Siginal Processing [M]. Beijing: Tsinghua University Press, 2003 11. Hinich, M.J, Wilson, G.R,: Detection of Non-Gaussian Signals in Non-Gaussian Noise using the Bispectrum [J]. IEEETrans. Acoust., Speech, Signal Processing. ASSP38, 1126-1131
A Novel Ensemble Approach for Cancer Data Classification Yaou Zhao, Yuehui Chen, and Xueqin Zhang School of Information Science and Engineering University of Jinan, Jinan 250022, P.R. China yaou
[email protected],
[email protected]
Abstract. Micorarray data are often extremely asymmetric in dimensionality, such as thousands or even tens of thousands of genes and a few hundreds of samples. Such extreme asymmetry between the dimensionality of genes and samples presents several challenges to conventional clustering and classification methods. In this paper, a novel ensemble method based on correlation analysis is proposed. Firstly, in order to extract useful features and reduce dimensionality, different feature selection methods based on correlation analysis are used to form different feature subsets. Then a pool of candidate base classifiers is generated to learn the subsets which are re-sampling from the different feature subsets. At last, appropriate classifiers are selected to construct the classification committee using EDA (Estimation of Distribution Algorithms) algorithm. Experiments show that the proposed method produces the best recognition rates on two benchmark databases.
1
Introduction
Microarray technology has provided the ability to measure the expression levels of thousands of genes simultaneously in a single experiment [14]. Each spot on a microarray chip contains the clone of a gene from a tissue sample. Some mRNA samples are labelled with two different kinds of dyes, for example, Cy5 (red) and Cy3 (blue). After mRNA interact with the genes, i.e., hybridization, the color of each spot on the chip will change. The resulted image reflects the characteristics of the tissue at the molecular level. In recent years, research has showed that accurate cancer diagnosis can be achieved by performing microarray data classification. Various intelligent methods have been applied in this area. But the microarray data consists of a few hundreds of samples and thousands or even ten thousands of genes. It is extremely difficult to work in such a high dimension space using traditional classification methods directly. So gene selection methods have been proposed and developed to reduce the dimensionality. They include principal components analysis (PCA), Fisher ratio, t-test, and correlation analysis. Along with the feature selection methods, intelligent methods have been applied for microarray classification. Such as support vector machine (SVM) [1], Bayesian approaches [3], K nearest neighbor (KNN) [8], artificial neural network (ANN) [9], decision tree [7] D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1211–1220, 2007. c Springer-Verlag Berlin Heidelberg 2007
c1
SC
ED
Feature Subsets
Trianing Data
PC
CC Correlation Analysis
c2 c3
ck Re-Sampling
Classification Commitee
Y. Zhao, Y. Chen, and X. Zhang
Base Classifier Selection with EDA
1212
K classifiers
Fig. 1. PC, SC, ED and CC indicate the feature sets which are generated by the feature selection methods (Pearson Correlation, Cosine coefficient, Euclidean Distance and Spearman Correlation) respectively; C1 , C2 , C3 ...Ck denote K base classifiers. The feature selection approaches are firstly employed to reduce the dimensionality of training data and a pool of candidate base classifiers is generated to learn the subsets which re-sampling from the different feature subsets with PSO algorithm. Then appropriate classifiers are selected to construct the classification committee using EDA.
and flexible neural tree (FNT) [2]. But high accurate classification is difficult to achieve. Most intelligent classifiers are apt to be over-fitted. Recent years, ensemble approaches [5] has been proposed. It combines multiple classifiers together as a committee to make more appropriate decisions for classifying microarray data instances. It offers improved accuracy and reliability. Much research has showed that a sufficient and necessary condition the approach outperform its individual members is that the base classifiers should be accurate and diverse. An accurate classifier is one that has an error rate of better than randomly guessing classes for new instances, and two classifiers are diverse if they make different errors on common data instances [15]. So there are two aspects to aim for ensemble approaches. First aspect is how to generate diverse base classifiers. In traditional, resampling has been widely used to generating training datasets for base classifiers learning. This method is much too random and due to the small numbers of samples, the datasets may be greatly similar. In this paper, different correlation analysis methods are firstly applied to generate different feature subsets. Because of feature selection methods using different measures, such as Pearson Correlation (PC), Euclidean Distance (ED), the feature subsets must be diverse. Then re-sampling has been used to the different feature subsets to generate learning datasets. Owing to the datasets forming from different feature subsets, it may be much more various. The second aspect is how to combine the base classifiers. In this paper, an intelligent approach for constructing ensemble classifiers is proposed. The methods first training the base classifiers with particle swarm optimization (PSO) algorithm, and then select the appropriate classifiers to construct a high performance classification committee with
A Novel Ensemble Approach for Cancer Data Classification
1213
estimation of distribution (EDA) algorithm. Experiment show that the proposed methods produce the best recognition rates. The flowchart of our methods is in figure 1. The paper is organized as follows: The feature selection methods based on correlation analysis is introduced in section 2. The optimal design method for constructing ensemble classifiers is described in section 3. Section 4 gives the simulation results. Finally, we present some concluding remarks.
2
Feature Selection Based on Correlation Analysis
Since not all genes expression profiles are informative in understanding the difference between cancer and normal cases, feature selection is needed to exclude irrelevant genes. Two ideal feature markers are defined and utilize them to score the similarity of each gene. The two ideal feature markers are negatively correlated to represent two different aspects of classification boundaries. The feature markers are a binary vectors consisting of 0 and 1. The first feature marker is 1 in class A and 0 in class B, and the second is 0 in class A and 1 in class B. The two markers are expressed as Mideal1 = (1, 1, 1, 1, 0, 0, 0, 0, 0, 0) Mideal2 = (0, 0, 0, 0, 1, 1, 1, 1, 1, 1) The two feature makers are highly correlated to classes. If the genes are similar with the markers, we consider that the genes are informative for classification. For calculate the distance of each gene, four measures are used in this paper: Pearson Correlation (PC) n i=1 (ideali − μideal )(gi − μg ) P C = (1) 2 n 2 n (ideal − μ ) (g − μ ) i ideal i g i=1 i=1 Spearman Correlation (SC) SC = 1 −
6
n
2 i=1 (ideali − gi ) 2 n × (n − 1)
(2)
Euclidean Distance (ED) n ED = (ideali − gi )2
(3)
i=1
Cosine Coefficient (CC) n ideali × gi CC = n i=1 2 n 2 i=1 ideali × i=1 gi
(4)
1214
Y. Zhao, Y. Chen, and X. Zhang
where n is the number of samples; μg is the mean of the gene and μideal is the mean of ideal marker; gi is the ith real value of the gene vector and ideali is the correspond ith binary value of the ideal marker vector. The following two steps are employed to select informative features: 1) Use the measures to score all the genes in the data. (The score is good if the measurement is small.) 2) Choose the first N/2 best score genes for ideal marker one and rest for ideal marker two. (N is the total number of features which are selected.) After the two steps, the genes are ranked in terms of their significance and form the feature subsets according to the different measures. The subsets generated from the Pearson Correlation, Cosine coefficient, Euclidean Distance and Spearman Correlation are denoted as Spc , Scc , Sed , Ssc . Then the training sets are randomly sampling from the set of S(S ∈ {Spc , Scc , Ssc , Sed }). As a result, after k-run of sampling, a pool of training sets is produced for training base classifiers. After that, the base classifiers are trained by the datasets applying traditional machine learning approaches.
3
Learning the Datasets with Neural Networks
There are many kinds of methods for classification. In recent years, most researchers applied the SVM (Support Vector Machine) as a classifier to learn the microarray dataset and obtained very good results. But the SVM is very complex to compute and training the SVM costs a lot of time. If many SVMs are used as base classifiers for assembling, the training time may be very long and the ensemble classifiers are inefficient. Moreover, the SVMs are similar due to their learning algorithms, the ensemble classifiers can not efficiently increase the accuracy of classification. So selecting SVM as base classifier is not a good choice. In this paper, we use the artificial neural networks as the base classifiers and train them with PSO algorithm (Particle Swarm Optimization). The artificial neural network is simple to learn and can easily construct different explicit classification functions by changing the number of hidden-layer nodes. PSO is a global optimization algorithm, so the parameters of neural networks can gain very good values. 3.1
Artificial Neural Networks
Funahashi [18] has shown that neural networks with at least one hidden-layer can approximate a variety of conditions. The neural networks typically consists of three neural layers: an input layer, a hidden layer and an output layer. (see figure 2) All the neurons in one layer are connected with all the neurons in the next layer. In this type of network, the input layer is determined by the incoming signals. This upper layer distributes the input signals to neurons in the hiddenlayer. Each hidden neuron sums all its input signals by a dot product between its input vector and its weight and then adds a ”bias” input. The final result is
A Novel Ensemble Approach for Cancer Data Classification
1215
n3 x1
n1 n6
n4 x2
y
n2 n5 Input Layer
Hidden Layer
Output Layer
Fig. 2. A 2-3-1 artificial neural network
then transformed by an ”activation function” (in neural network terminology) to produce an input signal to the output layer. The output layer processes its input signals in the same fashion. The entire process can be written mathematically as yk = fo (βk + ωjk fh (βj + ωij xi )) (5) j
i
where xi is the input signal, yk is the output signal, ωij is the weight between input neuron i to hidden neuron j, and ωjk is the weight between hidden neuron j to output neuron k. The βj and βk are the biases for the hidden and output layers, and fh and fo are activation functions for the hidden and output layers. The logistic function defined as f (x) = 3.2
1 . 1 + e−x
(6)
Parameter Optimization with PSO
The Particle Swarm Optimization (PSO) [13] conducts searches using a population of particles which correspond to individuals in evolutionary algorithm (EA). A population of particles is randomly generated initially. Each particle represents a potential solution and has a position represented by a position vector xi . A swarm of particles moves through the problem space, with the moving velocity of each particle represented by a velocity vector vi . At each time step, a function fi representing a quality measure is calculated by using xi as input. Each particle keeps track of its own best position, which is associated with the best fitness it has achieved so far in a vector pi . Furthermore, the best position among all the particles obtained so far in the population is kept track of as pg . In addition to this global version, another version of PSO keeps track of the best position among all the topological neighbors of a particle. At each time step t, by using the individual best position, pi , and the global best position, pg (t), a new velocity for particle i is updated by vi (t + 1) = vi (t) + c1 φ1 (pi (t) − xi (t)) + c2 φ2 (pg (t) − xi (t))
(7)
1216
Y. Zhao, Y. Chen, and X. Zhang
where c1 and c2 are positive constant and φ1 and φ2 are uniformly distributed random number in [0,1]. The term vi is limited to the range of ±vmax . If the velocity violates this limit, it is set to its proper limit. Changing velocity this way enables the particle i to search around its individual best position, pi , and global best position, pg . Based on the updated velocities, each particle changes its position according to the following equation: xi (t + 1) = xi (t) + vi (t + 1).
4
(8)
Optimal Design Method for Constructing Ensemble Classifiers
Select many classifiers for constructing the committee are better than all [6]. So we should select appropriate classifiers to form the classification committee. In traditional, many approaches can accomplish this task, such as greedy hill climbing. It evaluates all the possible local changes to the current set, such as adding one classifier to the set or removing one. It chooses the best or simply the first change that improves the performance of subset. Once a change is made for a subset, it is never reconsidered. But generally, it can not find the optimist solution. In this paper, we introduce a selection method using EDA algorithms. 4.1
Estimation of Distribution Algorithms (EDA)
The EDA was first introduced by Larranaga, P. and Lozano, J. A and price in 2002 [10]. It is a search method that eliminates crossover and mutation from the Genetic Algorithm (GA) and places more emphasis on the relation between gene loci. More precisely, it generates the next generation based on probability distribution of N superior population samples. In this way, the probability distribution estimated at each generation is progressively converted into a probability distribution that generates more superior individuals. The EDA algorithm is given follows: 1) Randomly generate a set of λ individuals (t = 0). 2) Evaluate the λ individuals. 3) Select μ individuals (where μ ≤ λ) to be parents. Develop a probability distribution/density function pt based on the parents. 4) Create λ offspring using pt . 5) Evaluate the offspring. 6) The λ offspring replace the μ parents (t = t + 1). 7) If not done, goto step 3. 4.2
Constructing Ensemble Classifiers by EDA
Suppose K base classifiers are generated after trained by the feature subsets. They expressed as C1 , C2 , C3 , ..., Ck . S is the subsets of {C1 , C2 , C3 , ..., Ck}. Binary vectors are introduced to denote S. If Ci is selected, the ith position of
A Novel Ensemble Approach for Cancer Data Classification
1217
the vector is 1; while Ci is not selected, the ith position is 0. Binary vectors are used to be chromosome of individuals and they can be evolved by EDA algorithm. In order to measure individuals, the fitness function should be created. We first generate the validation set V and then calculate the error Ev of each individual on V . 1/Ev is the fitness. Ev is depicted as follows: Evi =
K
pij × classif ierj
(9)
j=1
Here Evi is the error of the ith individual. K is the total number of base classifiers. Pij is the binary number of chromosome at the jth position. classif ierj is the error of the jth base classifier on V .
5
Experiments
We performed extensive experiments on two benchmark cancer datasets, namely the Leukemia and Colon database. The leukemia dataset was taken from a collection of leukemia patient samples reported by Golub et al. [16]. This well-known dataset often serves as benchmark for microarray analysis methods. It contains measurements corresponding to acute lymphoblast leukemia (ALL) and acute myeloid leukemia (AML) samples from bone marrow and peripheral blood. Two matrices are involved: One includes 38 samples (27 ALL vs. 11 AML, denoted as G1), and the other contains 34 samples (20 ALL vs. 14 AML, denoted as G2). Each sample is measured over 7,129 genes [4]. The samples from colon dataset were taken from colon adencarcinoma specimens snap-frozen in liquid nitrogen within 20 minutes of removal from patients [17]. From some of these patients, paired normal colon tissue also was obtained. The cell lines used (EB and EB-1) and the process of RNA extraction and hybridization to the array are described in [17]. The microarray dataset consists of 22 normal and 40 tumor colon tissue samples. In this dataset, each sample contains 2,000 genes. For this experiment, the normalization procedure is firstly used for preprocessing the raw data. Four steps were taken: 1) If a value is greater than the floor 16000 and smaller than the ceiling 100, this value is replaced by the ceiling/floor. 2) Leaving out the genes with (max − min) ≤ 500, here max and min refer to the maximum and minimum of the expression values of a gene, respectively. 3) Carrying out logarithmic transformation with 10 as the base to all the expression values. 4) For each gene i, subtract the mean measurement of the gene μi and divide by the standard deviation σi . After this transformation, the mean of each gene will be zero, and the standard deviation will be one. After this steps, the correlation selection method is employed to form the training and testing datasets respectively. 9 training datasets are generated and
1218
Y. Zhao, Y. Chen, and X. Zhang Table 1. Parameters used for experimets Common parameters for PSO M : population size 20 c1 , c2 : learning factor 2.0 vmax : the max velocity 1.8 xup : the upper boundary of x 3.0 xdown : the lower boundary of x -3.0 φ1 , φ2 : uniform random number (0,1) Common parameters for EDA λ: population size 20 μ: elite size 5 Common parameters for Neural Network Ni : the number of input layer 30 Nh : the number of hidden layer 6-10 No : the number of output layer 1 1 fh , fo : activate function 1+e−x
Table 2. Relevant works on Leukemia dataset Author Classification Rate (%) Our Method 95.8 - 97.2 Furey et al. [16] 94.1 Li et al. [19] 84.6 Ben-Dor et al. [11] 91.6 - 95.8 Nguyen et al. [12] 94.2 - 96.4
Table 3. Relevant works on Colon dataset Author Clssification Rate (%) Our Method 85.5 - 93.3 Furey et al. [16] 90.3 Li et al. [19] 94.1 Ben-Dor et al. [11] 72.6 - 80.6 Nguyen et al. [12] 87.1 - 93.5
60 informative features of each sample are extracted for training the 9 base classifiers. In our experiment, the neural network is employed to be the classifier and we use particle swarm optimization (PSO) to adjust the weights of each neural network. Then EDA was applied for selecting appropriate NNs to constructing the classification committee. Table 1 indicates the parameters used for experiments. A comparison of different feature extraction methods and different classification methods for leukemia dataset (average classification rate for 20 independent runs) is shown in Table 2. Table 3 depicts the classification performance of the ensemble classifiers by using the 60 features for colon dataset.
A Novel Ensemble Approach for Cancer Data Classification
6
1219
Conclusions
In this paper, a novel ensemble of classifiers based on correlation analysis is proposed for cancer classification. The leukemia and colon databases are used for conducting all the experiments. Gene features are first extracted by the correlation analysis technique which greatly reduces dimensionality as well as maintains the informative features. Then the EDA is employed to construct the classifier committee for classification. Compare the results with some advanced artificial techniques, the proposed method produces the best recognition rates.
Acknowledgments This research was supported by the NSFC under grant No. 60573065 and the Key Subject Research Foundation of Shandong Province.
References 1. Chu, F., Wang, L.: Appliations of Support Vector Machines to Cancer Classification with Microarray Data. International Journal of Neural Systems 15(6) (2005) 475-484 2. Chen, Y., Peng, L., Abraham, A.: Gene Expression Profiling Using Flexible Neural Trees. IDEAL 2006, Burgos, Spain, Lecture Notes on Computer Science 4224 (2006) 1121-1128 3. Roth, V., Lange, T.: Bayesian Class Discovery in Microarray Datasets. IEEE Trans. Biomed. Eng. 51(5) (2004) 707-818 4. Zhang, A.: Advanced Analysis of Gene Expression Microarray Data. World Scientific Press (2006) 183-184 5. Tan, A., Gilbert, D.: Ensemble Machine Learning on Gene Expression Data for Cancer Classification. Appl. Bioinform. 2(Suppl 3) (2003) 75-83 6. Zhou, Z.H., Wu, J., Tang, W.: Ensembling Neural Networks: Many Could Be Better Than All. Artificial Intelligence 137(1-2) (2002) 239-263 7. Camp, N., Slattery, M.: Classification Tree Analysis: A Statistical Tool to Investigate Risk Factor Interactions with an Example for Colon Cancer. Cancer Causes Contr 13(9) (2002) 813-823 8. Li, L., Weinberg, C., Darden, T., Pedersen, L.: Gene Selection for Sample Classification Based on Gene Expression Data: Study of Sensitivity to Choice of Parameters of the GA/KNN Method. Bioinformatics 17(12) (2001) 1131-1142 9. Azuaje, F.: A Computational Neural approach to Support the Discovery of Gene Function and Classes of Cancer. IEEE Trans. Biomed. Eng. 48(3) (2001) 332-339. 10. Larranaga, P., Lozano, J.A.: Estimation of Distribution Algorithms: A New Tool for Evolutionary Computation. Kluwer Academic Publishers (2001) 11. Cho, S.-B.: Exploring Features and Classifiers to Classify Gene Expression Profiles Of acute Leukemia. Int. J. Pattern Recogn. Artif. Intell. 16(7) (2002) 1-13 12. Harrington, C.A., Rosenow, C., Retief, J.: Monitoring Gene Expression Using DNA Microarrays. Curr. Opin. Microbiol. 3 (2000) 285-291 13. Clerc, M., and Kennedy, J.: The Particle Swarm: Explosion, Stability, and Convergence in a Multidimensional Complex Space. IEEE Transactions on Evolutionary Computation 6 (2002) 58-73
1220
Y. Zhao, Y. Chen, and X. Zhang
14. Sarkar, I., Planet, P., Bael, T., Stanley, S., Siddall, M., DeSalle R., et al.: Characteristic Attributes in Cancer Microarrays. J. Biomed. Inform. 35(2) (2002) 111-122 15. Dietterich, T.G.: Ensemble methods in machine learning. Proceeding of the First International Workshop on Multiple Classifier Systems (2000) 1-15 16. Golub T.R., Slonim D.K., Tamayo P., Huard C., GaasenBeek M., Mesirov J.P., Coller H., Loh M.L., Downing J.R., Caligiuri M.A., Blomfield C.D., Lander E.S.: Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science 286(12) (1999) 531-537 17. Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D. and Levine, A.J.: Broad Patterns of Gene Expresson Revealed by Clustering analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Array. Proc. Natl. Acad. Sci. 96(12) (1999) 6745-6750 18. Funahashi K.: On the Approximate Realization of Continuous Mapping by Neural Networks. Neural Networks 2 (1989) 183-332 19. Eisen, M.B., Brown, B.O.: DNA Arrays for Analysis of Gene Expression. Methods Enzymol. 303 (1999) 179-205 20. Terrence, S.F., Nello, C., Nigel, D., David, W.B., Michel S., David H.: Support Vector Machine Classification and Validation of Cancer Tissue Samples Using Microarray Expression Data. Bioinformatics 16(10) (2000) 906-914
Biological Sequence Data Preprocessing for Classification: A Case Study in Splice Site Identification A.K.M.A. Baten1, S.K. Halgamuge1, B. Chang1, and N. Wickramarachchi2 1
Dynamic Systems and Control Research Group, DoMME, Faculty of Engineering, The University of Melbourne, Parkville 3010, Australia
[email protected],
[email protected],
[email protected] 2 Department of Electrical Engineering, The University of Moratuwa, Srilanka
[email protected]
Abstract. The increasing growth of biological sequence data demands better and efficient analysis methods. Effective detection of various regulatory signals in these sequences requires the knowledge of characteristics, dependencies, and relationship of nucleotides in the surrounding region of the regulatory signals. A higher order Markov model is generally regarded as a useful technique for modeling higher order dependencies of the nucleotides. However, its implementation requires estimating a large number of computationally expensive parameters. In this paper, we propose a hybrid method consisting of a first order Markov model for sequence data preprocessing and a multilayer perceptron neural network for classification. The Markov model captures the compositional features and dependencies of nucleotides in terms of probabilistic parameters which are used as inputs to the classifier. The classifier combines the Markov probabilities nonlinearly for signal detection. When applied to the splice site detection problem using three widely used data sets, it is observed that the proposed hybrid method is able to model higher order dependencies with better classification accuracies.
1 Introduction The recent advances and automation in the sequencing technology has generated a vast amount of biological sequence data and this has created strong needs for various sequence pattern analysis algorithms. In eukaryotic genomes, the detection of a coding region also depends on the precise identification of the exon-intron structures. However, the vast length and structural complexity of sequence data makes it a very challenging task. Most of the eukaryotic protein coding genes consist of introns and exons. The exons are the protein coding region of a gene and they are interspersed with intervening sequences of introns. Introns are termed protein non coding region of a gene as their functions are not well known yet. The borders between introns and exons are termed as splice sites. The splice site in the upstream part of an intron is called the donor splice site (in the direction 5’ to 3’) and the downstream part is termed as the acceptor splice site (in the direction 3’ to 5’). The acceptor and donor splice sites with D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1221–1230, 2007. © Springer-Verlag Berlin Heidelberg 2007
1222
A.K.M.A. Baten et al.
consensus AG (correspond to the end of an intron) and GT (correspond to the beginning of an intron) respectively are known as canonical splice sites. The non canonical splice sites are those with minor consensus such as GC and AC. Approximately 99% of the splice sites are canonical AG/GT splice sites [1]. As AG and GT represent possible acceptor and donor splice sites, every AG and GT is a candidate acceptor or donor splice site and they need to be classified as either a real (true) splice site or a pseudo (false) splice site. A number of computational methods have been developed in the past to identify the splice sites including: probabilistic approaches [2, 3-10], neural network and support vector machine approaches [11-15], methods based on discriminant analysis [16, 17] etc. These methods are based on seeking the consensus patterns or features and try to identify the underlying relationships among nucleotides in a splice site and the surrounding region. Neural networks and support vector machines (SVM) learn the complex features of neighborhoods surrounding the consensus di-nucleotides AG/GT by a complex non-linear transformation. Probabilistic models estimate position specific probabilities of splice sites by computing likelihoods of candidate signal sequences. The discriminant analysis uses several statistical measures to evaluate the presence of specific nucleotides, recognizing the splice sites without explicitly determining the probability distributions. Inspired by the presence of apparent consensus AG and GT in the splicing junctions the earlier researchers attempt to predict splice sites by using weight matrix method (WMM) [12, 18]. Later WMM was adopted in methods like NetPlantGene [19] and NNSplice [11]. The weight array model (WAM) developed by [7] describes the dependencies between adjacent nucleotides by the inhomogeneous first order Markov model (MM1). However its accuracy is not satisfactory. The maximal dependence decomposition (MDD) model in Genscan [3] is a decision tree process based on the dependencies among the nucleotides. The Bayes network model [9] and MDD showed an improvement over the previous splice site detection models. It has been suggested that a significant improvement in the detection of splice sites is possible if one of the base statistical models, such as WMM, MM1, MDD etc., is combined with other signal/content methods [4]. GeneSplicer is a method of this category [4], where second order Markov models (MM2) are combined with MDD. Similarly, Rajapakse and Ho et. al.[20] introduced a more complex splice site prediction system which combines mostly MM2 and backpropagation neural networks (BPNN). In this approach he Markov model serves as a preprocessing step for the BPNN and shows better prediction accuracy than Genesplicer. However, it requires longer sequence windows for the training. In this paper we propose a more efficient input preprocessing scheme by using MM1 and first order weight matrix model (WMM1) preprocessing. Our proposed MM1-BPNN combination shows a better classification performance with a less computational complexity than in [20] and [3]. We also used the zero order Markov model (MM0) and zero order weight matrix model (WMM0), radial basis function network (RBFN) and different combinations WMM1-BPNN, MM1-RBFN, WMM1-RBFN, MM0-BPNN, and MM0-RBFN for classification performance comparisons. As MM0 and WMM0 imply the same model, we refer them as MM0/WMM0. The remainder of the paper is structured as follows: Section 2 describes the modeling of splice sites by Markov models and the integration of MM1 and neural
Biological Sequence Data Preprocessing for Classification
1223
networks. Simulation results, performance comparisons and discussions are presented in Section 3. Concluding remark and future work is then given in Section 4.
2 Method 2.1 Markov Model Preprocessing of Splice Site Data Each nucleotide in a DNA sequence corresponds to a state in the Markov chain used, whose observed state variables are elements drawn from the alphabet Ω DNA =
{ A, C , G, T } .
Let
us
define
an
arbitrary
sequence
of
length
l : {s1, s2 , s3 ,..., sl } where si ∈ {A, C , G,T }, ∀i ∈ {1,..., l} , then the nucleotide
s i is a realization of the i th state variable of the Markov chain and except from state i to state i + 1 , there is no transitions from state i to other states. Hence, the model consists of states ordered in series. It evolves from state si to si +1 and emits symbols from the alphabet Ω DNA , where each state is characterized by a position-specific probability parameter. Assuming a Markov chain of order k , the likelihood of a sequence implied by the model can be given as l
P ( s1 , s 2 ,..., sl M ) =
∏ P ( s ), i
(1)
i
i =1
where the Markovian probability Pi ( si ) = P ( si si −1 , si − 2 ,..., si − k ) denotes the conditional appearance of the nucleotide at location i depending on the k predecessors. Such a model is characterized by a set of parameters: {Pi ( si | si −1,..., si − k ) : si , si −1,..., si − k ∈ Ω DNA , i = 1,2,..., l}. In the proposed method, MM1 is used to represent a set of nucleotides in a sequence. The Markovian parameters are expressed in-terms of position specific first order conditional probabilities (k=1): Pi ( si ) = P ( si si −1 ).
The model is then characterized ters: { Pi ( si si −1 ) : si , si −1 ∈ Ω DNA , i = 1, 2,..., l } .
by
(2) the
set
of
parame-
Loi-Rajapakse [20] suggested that the sequence should be divided into upstream, signal, and downstream segments. The signal segment is modeled by a first order Markov model, whereas, the downstream and upstream segments are modeled by two second order Markov models. If the lengths of the signal, upstream, and downstream segments are s, u, and d respectively, then the corresponding conditional probabilities are given by s
P ( s1 , s2 ,..., ss ) = ∏ P ( si si −1 ),
(3)
i =1
u
P ( s1 , s2 ,..., su ) = ∏ P ( si si −1 , si − 2 ), i =1
(4)
1224
A.K.M.A. Baten et al.
d
P ( s1 , s2 ,..., sd ) = ∏ P ( si si −1 , si − 2 ),
(5)
i =1
If the length of the sequence is L = u + s + d , then the proposed MM1-BPNN method requires to estimate L4k +1 Markovian parameters, where k = 1. On the other hand LoiRajapakse [20] method requires to estimate, u 4 k1 +1 + s 4 k2 +1 + d 4 k1 +1 Markovian parameters, where k1, k2 are the order of the Markov models having k1 = 2, and k2 = 1 . 2.2 Neural Networks
Neural networks are inspired by biological neurons with the ability to adapt or learn, to generalize and to cluster or organize data. In this work two neural networks: a multilayer backpropagation neural network (BPNN) with fully connected weights and a radial basis function network (RBFN) are used. The BPNN is trained by a backpropagation algorithm and captures the higher-order dependencies around the splice site[20]. The BPNN receives its inputs as Markovian probabilities generated by the MM1. Suppose, the neural network has n input nodes and if the input to the jth input node is x j , then x j = Pi (s i ), where Pi ( si ) = P ( si si −1 , si − 2 ,..., si − k ) .
The BPNN has a hidden layer of m units and one output unit. The network output
y predicts whether the input sequence contains an actual splice site or not, where y is given by ⎛ y= f ⎜ ⎜ ⎜ ⎝
⎛ wk f k ⎜ ⎜⎜ k =1 ⎝ m
∑
⎞⎞ w kj x j ⎟ ⎟ , ⎟⎟ ⎟⎟ j =1 ⎠⎠ n
∑
(6)
where f k , k = 1, 2 ,...., m , and f denote the activation functions of the hidden neurons and the output neuron, respectively, and wk , k = 1, 2, ..., m w kj , k = 1, 2, ..., m , j = 1, 2, ..., n denote the weights connected to the output neuron and to the hidden layer neurons, respectively. The output activation function is a unipolar sigmoidal and the hidden layer activation functions take the form of hyperbolic tangent sigmoidals. The RBFN is also used for comparison purposes.
2.3 Higher Order Markov Models
The low-order Markov chain provides a probabilistic description of signals. The neural networks receive Markov probabilities and combines non-linearly in order to incorporate more complex and distance interactions among elements in the splice sites. This sections it is shown that, by connecting the outputs of low-order Markov models to the neural network it is possible to achieve a higher-order Markov model. Schukat-Talamazzini et. al. [22], has introduced the interpolated Markov chain for stochastic language modeling. The higher-order conditional dependencies can be approximated by interpolation given a sequence ( s1 , s2 ,..., si ) : P (si s
i −1 1
) ≈
∑
i −1 j=0
a j g j ( s ii −− 1j ) Pˆ ( s i s ii− j )
∑
i −1
a j g j (s j=0
i −1 i− j
)
,
(7)
Biological Sequence Data Preprocessing for Classification
1225
where g j is a sigmoid function. By using the chain rule of probabilities and by replacing conditional probabilities with the probabilities conditioned by a less number of elements, l
P ( s1 , ..., sl ) ≈ P ( s1 ) ∏ i=2
i −1
∑b j =i
ij
Pˆ ( s i sii−−1j ) ,
(8)
where {bij : i = 2,..., l , j = 1,..., i} is a set of linear coefficients. That is, the non-linear relationship amongst variables in the sequence can be represented by a polynomial of sufficient order. As supported by [23], a neural network with a single hidden layer, having a sufficient number of hidden neurons, is capable of approximating the continuous multivariate functions defined on a hypercube [0 ,1 ]n , thereby, the input-output relationship represented by any higher-order polynomial. So the neural network receives inputs from low-order Markov chains whose output is represented in the form:
y=
∑
m1 ,..., ml = 0; m1 + .... ml = l
where
{ m i ; i = 1 , 2 ,.... l }
cm1 ,...,ml P1 ( s1 ) m1 ......Pl ( sl ) ml , are
non-negative
(9)
integers,
{ c m1 ,..., ml ; m i = 1, 2, ..., l ; i = 1, 2, ..., l } are a set of real value coefficients, and
{Pi ( si ) : i − 1,2,........l} are Markovian probabilities computed from low-order model M . Observing Eq. (8) and Eq. (9) it can be deduced that the neural network output, y , represents a higher-order Markov model, as also pointed in [20].
3 Results and Discussion 3.1 Datasets
To evaluate the performance of the proposed MM1-BPNN model we conducted several simulations with three publicly available data sets of different sizes. The data sets are GS1115 [4], NN269 [11], and the homo sapiens splice site data set HS3D [24]. Each of the data sets is divided randomly into a training data set and a test (control) set. The training and test data sets do not share any common sequences between them. The dataset GS1115 consists of 1115 human genes which contain 5733 true acceptor sites, 5733 true donor sites, and 650099 false acceptor and 488983 false donor sites. Each of the false (pseudo) acceptor/donor sites has AG/GT in the splicing junction but is not a real splice site according to the annotation. The second dataset is known as NN269 [11], which consists of 1324 confirmed true acceptor sites, 1324 confirmed true donor sites, 5552 false acceptor sites and 4922 false donor sites collected from 269 human genes. This data set is split into a training set and a testing set. The training data set contains 1116 true acceptor, 1116 true donor, 4672 false acceptor, and 4140 false donor sites. The test data set contains 208 true acceptor sites, 208 true donor sites, 881 false acceptor sites, and 782 false donor sites.
1226
A.K.M.A. Baten et al.
The Homo sapiens splice site data set (HS3D) is a set of 2955 true acceptor sites and 2992 true donor sites with a window size of 140 nucleotides around each splice site. Also, 287,296 false acceptor sites and 348,370 false donor sites have been included in this data set. Both training and test data sets are constructed by taking all the available true acceptor and donor sites with randomly selected equal number of false sites as shown in. 3.2 Model Implementation and Learning
The splice site detection problem is divided into two sub problems, namely the acceptor splice site identification and the donor splice site identification. Leave one out cross validation procedure is applied to determine the accuracy of the model. The cross validation is performed by randomly partitioning the data into five independent subsets. Each of the subsets does not share any repeating sequences. The model was trained using four of the subsets and was tested on the remaining one subset. Average of five prediction measures is taken as final prediction. The training of a model was conducted in two stages: the MM1 parameters estimation and the training of the BPNN by using MM1 parameters. The training sequences were aligned with respect to the consensus dinucleotides prior to stage one. The estimates of the MM1 are the ratios of the frequencies of each dinucleotide in each sequence position as shown in (10). Only the true splice site training sequences were used to create the Markov model. The desired output level is set to +1 or –1 depending on the true or false splice site class label. Pˆi ( si ) =
( ), #( s ) # sii− k
(10)
i −1 i−k
3.3 Accuracy Measures
The classification performance is defined by the sensitivity ( S N ) , specificity ( S P ) , and accuracy ( Acc ) of the model: Sensitivity ( S N ) =
TP TP + FN
,
Specificity ( S P ) =
TN TN + FP
,
Accuracy ( Acc ) =
SN + SP 2
,
where, TP, TN, FP, and FN denote the number of true positives, true negatives, false positives, and false negatives. 3.4 Classification Performance and Discussions 3.4.1 Best Preprocessing Model Selection We used several preprocessing schemes with BPNN and RBFN to compare their performances and to verify the usefulness of our proposed MM1-BPNN method. For instance, we combined BPNN and RBFN with the zero order Markov model (MM0), which is also well known as WMM model. In this study we have created several models including MM1-BPNN, WMM1-BPNN, MM0/WMM0-BPNN, MM1-RBFN, WMM1-RBFN, and MM0/WMM0-RBFN and we applied all the models in splice site
Biological Sequence Data Preprocessing for Classification
1227
identification. Based on the performance we found that MM1 serves as the best preprocessing model and MM1-BPNN is the best splice site prediction method (please refer to classification performance comparison section). Hence, in this study MM1BPNN is used as our main splice site prediction method. 3.4.2 Classification Performance Comparison Table 1 shows the comparison of performance between MM1-BPNN method with Loi-Rajapakse [20] and MDD [3] methods in terms of GS1115 dataset. The MM1BPNN method approximately identifies 94% of the acceptor splice sites and 95% of donor splice sites. This performance is superior to that of MDD method which approximately identifies 92% acceptor splice sites and 93% donor splice sites. However, Loi-Rajapakse method, which is a combination of first and second order Markov models with BPNN, produces the best acceptor splice site prediction accuracy among all the models. Table 1. Comparison between MM1-BPNN with Loi-Rajapakse and MDD method in terms of data set GS1115
Data set
GS1115
Splice site
Acceptor Donor
Loi-Rajapakse method Accuracy 0.945 0.940
MDD Accuracy 0.921 0.938
MM1BPNN Accuracy 0.942 0.952
To further verify the prediction accuracies of the MM1-BPNN method we used two other standard splice site data sets: NN269 and HS3D. We also used WMM1 and MM0/WMM0 preprocessing for the BPNN. As shown in Table 2 MM1-BPNN produces superior acceptor and donor splice site identification for both NN269 and HS3D datasets. WMM1-BPNN is produces the second best accuracy. However, MM0/WMM0-BPNN shows better performance for HS3D acceptor splice site than WMM1-BPNN method. Table 3 shows the performance of MM1-RBFN, WMM1-RBFN and MM0/WMM0-RBFN methods in terms of NN269 and HS3D datasets. Similar to that of MM1-BPNN (Table 3), MM1-RBFN produces better accuracies than WMM1RBFN and MM0/WMM0-RBFN methods. From Table 2 and Table 3 we observe that MM1 is the best preprocessing method. We also find that MM1-BPNN is the best one among all BPNN methods and MM1-RBFN is the best one among all RBFN methods. However, in terms of performance comparison between MM1-BPNN and MM1RBFN, MM1-BPNN produces better prediction accuracies accept for NN269 donor splice site where MM1-RBFN is marginally better than our proposed MM1-BPNN. Finally we compared the performance between MM1-BPNN method with standalone BPNN and standalone RBFN methods. As shown in Table 4 MM1-BPNN is clearly the better splice site prediction methods than two standalone methods. It indicates that MM1 preprocessing enhances the performance of the classifier like BPNN and RBFN.
1228
A.K.M.A. Baten et al.
Table 2. Comparison of performance between MM1-BPNN method with WMM1-BPNN and WMM0/MM0-BPNN methods
Dataset
Splice site
NN269
Acceptor Donor Acceptor Donor
HS3Dataset
MM1BPNN Accuracy 0.956 0.960 0.925 0.927
WMM1BPNN Accuracy
0.931 0.938 0.880 0.895
MM0/WMM0BPNN Accuracy
0.927 0.931 0.894 0.892
Table 3. Comparison of performance between MM1-RBFN method with WMM1-RBFN and WMM0/MM0-RBFN methods
Dataset
Splice site
NN269
Acceptor Donor Acceptor Donor
HS3Dataset
MM1RBFN Accuracy 0.954 0.968 0.930 0.921
WMM1RBFN Accuracy
0.929 0.942 0.878 0.893
MM0/WMM0RBFN Accuracy
0.929 0.945 0.856 0.872
Table 4. Comparison of performance between MM1-BPNN method with standalone BPNN and standalone RBFN methods without the preprocessing
Dataset
Splice site
NN269
Acceptor Donor Acceptor Donor
HS3Dataset
MM1BPNN Accuracy 0.956 0.960 0.925 0.927
Standalone BPNN Accuracy
0.852 0.887 0.821 0.824
Standalone RBFN Accuracy
0.862 0.873 0.814 0.829
The overall performance of the MM1-BPNN method is encouraging. When compared with two other standard splice site detection methods our proposed MM1BPNN showed better performance. Loi-Rajapakse method showed the second best performance. The performance of MM1-BPNN is not significantly better than that of Loi-Rajapakse method. However, The Loi-Rajapakse method mostly uses second order Markov models and requires longer sequence window for the training of the model. The computational complexity of a Markov model increases exponentially with the increase of the order of the Markov model and it requires more training data for the estimation of the Markovian parameters (also refer to section 2.1). So, LoiRajapakse method can be computationally very expensive if we consider the large size of sequence data. The MM1-BPNN method shows improvement in splice site
Biological Sequence Data Preprocessing for Classification
1229
prediction performance and at the same time it reduces the computational complexity and requires less number of training samples.
4 Conclusion and Future Work Identification of Splice sites in DNA sequences requires modeling the underlying low-level sequential relationships between nucleotides. In this paper we demonstrated a useful first order Markov encoding method that captures the sequence information and helps to improve the learning ability and classification performance of the classifier by 10-15%. We also show that the WMM preprocessing can also assist the classifiers like BPNN and RBFN to produce better classification accuracies. We study the splice site detection problem as a case study and observe significant improvement in splice site identification. This method can serve as a useful component for the gene finding methods as well as many other related problem area like transcription initiation site (TIS), translation start site (TSS) etc. Future work includes the use of other classifiers as reported in [25] in combination with neuro-fuzzy methods [26,27,28] that can generate classification rules.
References 1. Burset, M., Seledtsov, A., Solovyeva, V.V.: Analysis of Canonical and Non-Canonical Splice Sites in Mammalian Genomes. Nucleic Acids Research 28 (2000) 4364-4375 2. Chen, T.M., Lu, C.C., Li, W.H.: Prediction of Splice Sites with Dependency Graphs and Their Expanded Bayesian Networks. Bioinformatics 21 (2005) 471-482 3. Burge, C., Karlin, S.: Prediction of Complete Gene Structure in Human Genomic DNA. Journal of Molecular Biology 268 (1997) 78-94 4. Pertea, M., L, X.Y., Salzberg, S.L.: GeneSplicer: A New Computational Method for Splice Site Detection. Nucleic Acids Research 29 (2001) 1185-1190 5. Marashi, S.A., Eslahchi, C., Pezeshk, H., Sadeghi, M.: Impact of RNA Structure on the Prediction of Donor and Acceptor Splice Sites. BMC Bioinformatics 7 (2006) 297 6. Salzberg, S.: A Method for Identifying Splice Sites and Translation Start Site in Eukaryotic mRNA. Computer Applications in the Biosciences 13 (1997) 384-390 7. Zhang, M., Marr, T.: A Weight Array Method for Splicing Signal Analysis. Comput Appl Biosci 9 (1993) 499-509 8. Castelo, R., Guigo, R., Splice Site Identification by idlBNs. Bioinformatics 20 (2004), 69-76 9. Cai, D., Delcher, A., Kao, B., Kasif, S.: Modeling Splice Sites with Bayes Networks. Bioinformatics 16 (2000) 152-158 10. Staden, R.: The Current Status and Portability of Our Sequence Handling Software. Nucleic Acids Research 14 (1986) 217-231 11. Reese, M.G., Eeckman, F., Kupl, D., Haussler, D.: Improved Splice Site Detection in Genie. Journal of Computational Biology 4 (1997) 311-324 12. Brunak, S., Engelbrecht, J., Knudsen, S.: Prediction of mRNA Donor and Acceptor Sites From the DNA Sequence. Journal of Molecular Biology 220 (1991) 49-65 13. Zhang, X., Katherine, A.H., Ilana, H., Christina, S.L., Lawrence, A.C.: Sequence Information for the Splicing of Human Pre-mRNA Identified by Support Vector Machine Classification. Genome Research 13 (2003) 2637-2650
1230
A.K.M.A. Baten et al.
14. Sun, Y.F., Fan, X.D., Li, Y.D.: Identifying Splicing Sites in Eukaryotic RNA: Support Vector Machine Approach. Computers in biology and medicine 33 (2003) 17-29 15. Sonnenburg, S.: New Methods for Detecting Splice Junction Sites in DNA Sequence. Master's Thesis, Humbold University, Germany (2002) 16. Chuang, J.S., Roth, D.: Splice Site Prediction using a Sparse Network of Winnows. Technical Report, University of Illinois, Urbana-Champaign (2001) 17. Zhang, L. at al: Splice Site Prediction with Quadratic Discriminant Analysis using Diversity Measure. Nucleic Acids Research 31 (2003) 6214-6220 18. Arita, M., Tsuda, K., Asai, K.: Modeling Splicing Sites with Pairwise Correlations. Bioinformatics 18 (2002) 27-34 19. Hebsgaard, S.M., korning, P.G., Tolstrup, N., Engelbrecht, J., Rouze, P., Brunak, S.: Splice Site Prediction in Arabidopsis Thaliana Pre-mRNA by Combining Local and Global Sequence Information. Nucleic Acids Research 24 (1996) 3439-3452 20. Rajapakse, J.C., L.S.: Markov Encoding for Eetecting Signals in Genomic Sequences. IEEE/ACM Trans. Computational Biology and Bioinformatics 2 (2005) 131-142 21. Loi, S.H., Rajapakse, J.C.: Splice Site Detection with a Higher-Order Markov Model Implemented on a Neural Network. Genome Informatics 14 (2003) 64-72 22. Schukat, T.E., Gallwitz, F., Harbeck, S., Warnke, V.: Rational Interpolation of Maximum Likelihood Predictors in Stochastic Language Modeling. Proc of European Conference on Speech Communications and Technology 5 (1997) 2731-2734 23. Pinkus, A.: Approximation Theory of the MLP Model in Neural Networks. Acta Numerica (1999) 143-195 24. Pollastro, P., Rampone, S.: HS3D-Homo Sapiens Splice Sites Dataset. Nucleic Acids Research 2003 (Annual Database Issue) 25. Baten, A.K.M., Chang, B.C.H., Halgamuge, S.K., Li, J.: Splice Site Identification using Probabilistic Parameters and SVM Classification. BMC Bioinformatics 7 (2006) (Suppl 5) S15 26. Halgamuge, S.K., Glesner, M.: Fuzzy Neural Networks Between Functional Equivalence and Applicability. Int. J. Neural Systems 6 (1995) 185-196 27. Halgamuge, S.K.: Trainable Transparent Universal Approximator for Defuzzification in Mamdani-type Neuro-Fuzzy Controllers. IEEE Trans. Fuzzy Systems 6 (1998) 304-314 28. Halgamuge, S.K., Glesner, M.: Neural Networks in Designing Fuzzy Systems for Real World Applications. Fuzzy Sets and Systems 65 (1994) 1-12
A Method of X-Ray Image Recognition Based on Fuzzy Rule and Parallel Neural Networks Dongmei Liu1 and Zhaoxia Wang2 1
Department of Science and Technology, Nanjing University of Science and Technology, Nanjing 210094, Jiangsu, China
[email protected] 2 Department of Science and Technology, Wuhan University of Science and Technology, Wuhan 430072, Hubei, China
[email protected]
Abstract. The detection of explosives and illicit material in passengers’ luggage for the purpose of station security is an important area in public traffic security. This paper presents a method for X-ray image recognition based on fuzzy rule and parallel neural networks. Neural networks have been widely used in various fields. However, the computing efficiency decreases rapidly if the scale of neural network increases. In this paper, a new method of X-ray image recognition based on the fuzzyneuron system is proposed. In fuzzy rules method, a test pattern may belong to several classes with different degrees. A neural networks classifier is just for one class and used to make sure if the pattern is really belonged to that class based on fuzzy rules, they are combined to obtain the recognition result. From the experience results, the new method performs well.
1
Introduction
The detection of explosives and illicit material in passengers’ luggage for the purposes of station security is an important area in public traffic security. There are a number of methods for solving this problem including X-ray techniques including scatter, dual energy and transmission imaging, X-ray-based computed tomography, vapour detection, quadrupole resonance analysis and nuclear techniques[1]. Among all these methods X-ray detection methods are the most common means to inspect luggage at the station due to it is safer to humans and luggage, cheaper to buy and operate, and it is well understood. However, one of the main disadvantage of using X-ray technology is the high false alarm rate(when the system alarms frequently on harmless objects). A number of recent studies have investigated novel X-ray technologies from hardware methods to software methods. The software methods use computer technologies including signal processing and pattern recognition for improving the quality of images or automatically detecting explosives and illicit material. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1231–1239, 2007. c Springer-Verlag Berlin Heidelberg 2007
1232
D. Liu and Z. Wang
A number of studies addressed the issue of automated detection of explosives and illicit material. Early in 1992, Bjorkholm[2] applied computer technologies in contraband detection using X-rays, at the same year, Cable[3] proposed that intelligent technologies should be applied to aviation security. The main aim of the study by Liu[4] was to automatically detect elongated objects such as detonator using Gabor filters, Hough transform and information fusion. These studies, however, only apply very basic image processing and pattern recognition tools, even as Q. Lu[5] said ”unfortunately, most current systems didn’t develop the mature image processing”. So the advances in screening technologies themselves are not enough for developing robust screening systems due to the lack of real luminance level of X-ray images, this is very important to improve image processing and image recognition technologies. The investigation of computer-based data analysis in the context of station security is very much an open-ended research topic as this has not been studied in enough detail as one might expect. Singh[6,7] set focus on image enhancement and image segmentation using knowledge-based framework and apply advanced intelligent technologies for example neural networks and machine learning to automated detection of dangerous explosives. Neural networks have been proven to possess many advantages for pattern recognition systems because of their learning ability and good generalization. Generally speaking, multi-layered networks usually coupled with the backpropagation algorithm are widely used for image recognition. However, using this algorithm to perform the X-ray image recognition system on a large scale training set, some problems are encountered such as the lower recognition rate and very slow convergence speed in training. In order to solve this problem, a fuzzy rule and parallel neural network system is proposed in this paper. In fuzzy rule method, a pattern data may belong to several classes with different degrees. A value between zero and one is assigned to each pattern by a membership function, the application of fuzzy sets in a classification function causes the class membership to a relative one and an object can belong to several classes at the same time. Each class has its own neural networks classifier, and the parallel neural networks are composed of 3-layer BP neural networks. Details of this system are described in the remainder of this paper. Section 2 covers preprocessing of the system. In Section 3, we present a method for X-ray image recognition based on fuzzy rule and parallel neural networks. In section 4, experimental results of evaluating the developed techniques are presented. Finally, conclusions are summarized in Section 5.
2 2.1
Pre-processing Image Pre-processing
Due to imaging and influence of outer environment, X-ray image is poor in legibility, uneven in luminance and has much noise, what’s more, there are numerous kinds of articles in luggage and they are overlapped, enveloped and mixed up, which all make further processing and recognition of these images difficult. Therefore, pre-processing is needed to improve the quality of X-ray images, to
A Method of X-Ray Image Recognition
1233
remove the noise of background and object, to enlarge the intensity difference among objects and background, to enhance the edge information that easy to be segmented. In pre-processing, we have developed a simple and effective method named gradually fuzzy enhancement(GFE) which combines the global histogram information and the local window information. The basic procedure of the GFE can be described by the following steps: (1) (2) (3) (4)
Noise removal by background subtraction. Fuzzy weighted smoothing processing. Adaptive multi-peak histogram equalization processing. Repeat (2)-(3) until the processed image is unchangeable.
The details and discussions of the GFE method are presented in[8]. In order to obtain the integrated objects for recognition, we adopt the segmentation method presented in[9] that proposes a structural segmentation method based on ARG matching. This method consists of a series of graph-matching algorithms based on models under a kind of similarity measure, fuzzy similarity distance (FSD) that represents the similarity of the attributed relation between the vertex neighborhood and a certain model. Finally, the Number of Layer attribute for each region is obtained, and the integrated objects can be extracted using relational attributes and space information. 2.2
Color Descriptor
The color of X-ray image is related to material of the object and if the object is made up of different materials, the X-ray image consists of several different colors. In order to describe the color of images properly, we develop a new color descriptor which can not only express the object with almost consistent hue but with different ones. The color descriptor is expressed by the following equation: Color = C1 ∗ N1 /N + C2 ∗ N2 /N + ... + Ck ∗ Nk /N.
(1)
It means the object has N pixels and consists of k different hues Ci (i = 1..k) with Ni (i = 1..k) pixels which satisfy the following constraints: N1 ≤ N2 ≤ ... ≤ Nk , and N1 + N2 + ... + Nk ≈ N. According to the above descriptor, we construct some color features as following Ave Hue α =
M i=1
Ci ∗ Ni ÷
M
Ni ,
(2)
i=1
Consistency Hue α = 1 ÷ (1 + η(M − 1)β ). M is the maximum that satisfies the following constraints: N1 + N2 + ... + NM ≤ α ∗ N .
(3)
1234
D. Liu and Z. Wang
2.3
Feature Extraction
Different hazardous articles have different important features, so for each object it’s necessary to use different features for recognition. We make full use of the prior knowledge of each object and develop a simple weighted feature extraction(WFE) method based on prior knowledge. Let us assume that we have N training patterns Xi , each of which is described by p continuous features as Xi = (xi1 , xi2 , ..., xip ) . We assume a class label Ci ∈ [1,M] has been already assigned to each of the N patterns. The patterns of each class can be expressed as Sj = {Xij /class(Xi ) = j}, j = 1..M . The basic procedure of the WFE method can be described by the following steps: k Step1: Initially, each feature is assigned to a value wK (k = 1..p) based on prior knowledge; Step2: The distance within each class is computed as follows: (1) Randomly select some patterns Sj from Sj ; (2) Compute the average value of each feature from Sj as (xj1 , xj2 , ..., xjp ) ; (3) Compute the square value of each feature from Sj as (σ1j , σ2j , ...σpj ) , where p σkj = (xjk − xjk )2 , k=1
and xjk is the kth feature of pattern of Sj ; (4) Repeat (1)-(3) for several times, and get the average of square value of each feature σ jk , the distance within each class can be expressed as (σ j1 , σ j2 , ...σ jp ). Step3: The distance between classes is expressed as (σ1j , σ2j , ...σpj ) , where M 2 σkj = (xlk − xm k ) . l,m=1
Step4: We can get another value wSk (k = 1..p) from the fuzzy membership function of the distance within class and the one between classes; Step5: Finally, the importance of each feature is computed by the following equation: k wk = α ∗ wK + (1 − α) ∗ wSk .
3
Fuzzy Rule and Parallel Neural Networks
In the long distance, when people recognize complex or illegibility objects, we can’t recognize the object accurately only from some coarse features, but from the known features, we can make certain what the object may possibly be. While you are closer, you can find more fine features to recognize what the object actually is. Enlightened by this phenomenon, we develop a method to X-ray image recognition based on fuzzy rule and parallel neural networks. In the training process, we construct two different types of classifier named fuzzy classifier and neural networks classifier separately. The fuzzy classifier is made up of a series of fuzzy rules and there are three different fuzzy classifiers with different types of feature as color, shape and texture. Each class has its own neural networks classifier, and the parallel neural networks are composed
A Method of X-Ray Image Recognition
1235
of 3-layer BP neural networks. In the recognition process, the unknown pattern firstly comes into three different fuzzy classifiers simultaneously and each one produces some possible classes. Then through combining the result of each fuzzy classifier, the judge mechanism can conclude the fuzzy recognition result. Finally each neural networks classifier is used to make sure whether the pattern is really belonged to that class or not. Fig. 1 illustrates the framework of our method.
Fig. 1. The framework of our fuzzy-neuron method
3.1
Fuzzy Recognition System
Recently, various approaches have been proposed for generating fuzzy if-then rules for image recognition problems. While many sophisticated approaches have been proposed, very simple fuzzy if-then rules were also used as fuzzy classifiers. It’s often said that the number of fuzzy if-then rules exponentially increases as the number of features increases. Thus, in order to reduce the number of fuzzy if-then rules, we select a small number of features for constructing a fuzzy classifier using the smallest information entropy method in[10]. In the learning of membership functions, first the membership functions of the antecedent fuzzy sets for fuzzy if-then rules are evenly placed in the unit interval [0,1]. Then we increase the number of antecedent fuzzy sets that cause misclassification of an input pattern in an error-correction manner. Fuzzy Rules. Let us assume that we have N training patterns as follows: T raining = {(X 1 , C 1 ), (X 1 , C 1 ), ..., (X N , C N )}, each of which is described by p continuous features as X k = (xk1 , xk2 , ..., xkp ), k = 1..N , where xki is an attribute value of the i-th feature in the k-th pattern X k . We also assume that a class label has been already assigned to each of the N patterns as C k ∈ {C1 , C2 , ..., CM }.
1236
D. Liu and Z. Wang
Without loss of generality, each attribute values is normalized into a real number in the unit interval [0,1]. From the given training patterns, we generate fuzzy if-then rules Rq1 q2 ...qd of the following type: if xk1 belongs to Aq1 1 and xk2 belongs to Aq2 2 and... xkd belongs to Aqd d then X k belongs to SCq1 ,q2 ,...,qd = {Ci } with SCFq1 ,q2 ,...,qd = {CFi } where Rq1 ,q2 ,...,qd is the rule label, {A1i , A2i , ..., AKi i } is the antecedent fuzzy set of i-th feature, qi ∈ {1, 2, ..., Ki }, Aqi i is the qi -th antecedent fuzzy set of i-th feature, SCq1 ,q2 ,...,qd = {Ci } is the class set of fuzzy if-then rule Rq1 ,q2 ,...,qd , SCFq1 ,q2 ,...,qd = {CFi } is the corresponding certainty grade set of each class of the fuzzy rule Rq1 q2 ...qd . The fuzzy classification system is a coarse classifier, and grants several classes for each object, so the conclusion of each rule is a class set. Fuzzy Learning. A heuristic method for determining the class set and the corresponding certainty grade set from given training pattern is proposed as follows: Step1: Initially for each class Ci (i = 1..M ), compute the total compatibility as follows: αCi = μq1 1 (xk1 )μq2 2 (xk2 )...μqd d (xkd )/ 1, (4) X k ∈Ci
X k ∈Ci
where αCi is the total compatibility between pattern X k ∈ Ci and fuzzy rule Rq1 q2 ...qd , μqi i (xki ) is the membership function of the antecedent fuzzy set Aqi i . Step2: Sort αC1 , αC2 , ..., αCM degressively. Step3: If ∀αCi = 0(i = 1..M ), SCq1 ,q2 ,...,qd = SCFq1 ,q2 ,...,qd = N U LL and the fuzzy rule Rq1 ,q2 ,...,qd is invalid. Step4: For 0 = αCi > α, add the class label Ci to the class set SCq1 ,q2 ,...,qd degressively. Step5: For each Ci ∈ SCq1 ,q2 ,...,qd , compute the corresponding certainty grade as follows and add to the antecedent fuzzy set SCF q1 ,q2 ,...,qd CFi = (αCi − α)/
M
αCk ,
(5)
k=1
where α =
M
αCk /(M − 1).
k=1,k=i
From the above method, we can come to some conclusions: (a) If each pattern only belongs to one class in the fuzzy subspace Aq1 1 ∗ Aq2 2 ∗ ... ∗ Aqd d , ∃αCX 0 , ∀αCi = 0(i = X) , and SCq1 ,q2 ,...,qd = {CX }, SCF q1 ,q2 ,...,qd = {1}. (b) If there is no pattern in the fuzzy subspace Aq1 1 ∗ Aq2 2 ∗ ... ∗ Aqd d , ∀αCi = 0(i = 1, 2, ..., M ) and SCq1 ,q2 ,...,qd = SCF q1 ,q2 ,...,qd = N U LL. (c) If the pattern belongs to more than one class, ∃αCi 0 and there are several elements in SCq1 ,q2 ,...,qd and SCF q1 ,q2 ,...,qd . Fuzzy Reasoning. Let us assume that we have L valid fuzzy rules generated from given training patterns. A pattern X = (x1 , x2 , ..., xp ) is classified by the
A Method of X-Ray Image Recognition
1237
fuzzy classification system labelled by the possible class set SX . The process of fuzzy classification is proposed as follows: Step1: Initially for each class Ci (i = 1, 2, ...M ) , compute the membership value: βCi = max{μq11 (x1k )μq22 (xk2 )...μqdd (xkd )CFi /Ci ∈ SCq1 ,q2 ,...,qd ∩ CFi ∈ SCF q1 ,q2 ,...,qd ∩ Rq1 ,q2 ,...,qd ∈ S}. Step2: Sort βC1 , βC2 , ..., βCM degressively. Step3: If ∀βCi = 0 , pattern X = (x1 , x2 , ..., xp ) is an unknown class. Step4: For 0 = βCi > β , add the class label Ci to the class set SX degressively. Step5: Finally the possible class of pattern X = (x1 , x2 , ..., xp ) is in SX . 3.2
Parallel Neural Networks
Neural networks are non-parametric alternatives for image recognition[11]. The most popular multi-layer neural networks use backpropagation algorithm for training. In this study, a connected NN with p input neurons, 1 output neuron and p/2 hidden neurons have been simulated. Training Processing. Each NN classifier only aims at one class, so the output layer only consists of one neuron. The number of input layer and the input features differ from various classes. The input features are extracted through the above weighted feature extraction(WFE) method. Then, the BP algorithm is performed in each neural networks respectively. BP algorithm is repeated until the sum of squared error becomes equal to or smaller than certain value. The training rule is used to adjust the weights in order to move the network output closer to the targets. Variable learning rate has been used in our research. The performance of learning algorithm can be improved if we allow the learning rate to change during the training process. First, the initial network output and error are calculated. At each epoch new weights are calculated using the current learning rate. New output and error are then calculated. If the new error exceeds the old error, the new weights are discarded and the learning rate is decreased. Otherwise, the new weights are kept. If the new error is less than the old error, the learning rate is increased. Recognition Processing. When a test pattern is input into the parallel NN, some NN have been excluded based on the recognition result of fuzzy classifiers, only those more than threshold value need to recognize the test pattern whether it belongs to the class or not. The recognition process of each NN is at the same time. This parallel NN is easy to expand, while increasing some classes you can just need to add the corresponding NN classifier.
4
Experiments and Discussion
In this work, experiments have been carried out using X-ray images. Three fuzzy classifiers and eleven parallel neural networks classifiers are structured based on 110 training patterns during the training process. There are 200 non-sample patterns to be tested.
1238
D. Liu and Z. Wang
We adopt recognition rate, falseness rate to assess the classifier, and the falseness rate consists of three parts: loss rate that means a hazardous article is considered as a normal one, misinformation rate that means a normal article is considered as a hazardous one, mistake rate that means a type of hazardous article is considered as another type. The 6 features for fuzzy classifier are extracted by the smallest information entropy method in[10], and the 24 features for neural networks classifier are extracted by the method in 2.3, in order to make a comparison of the feature extraction method, we also show the result of using all 52 features and the 34 features except used in fuzzy classifiers. Table. 1 gives the recognition result of sample of fuzzy classifiers, neural networks classifier and combing fuzzy classifiers and neural networks classifiers. Table. 2 gives the recognition result of non-sample of different classifiers. From the experiment results, we can come to such conclusions: (a) Fuzzy classifiers have an accuracy of 100% both for sample and non-sample. (b) NN and Fuzzy-NN classifiers have an accuracy of 100% for sample. (c) The method of feature extraction is effective and the recognition rate for non-sample can reach to 95%. Table 1. The Recognition Results of sample
Table 2. The Recognition Results of non-sample
A Method of X-Ray Image Recognition
5
1239
Conclusions
In this paper, we proposed a fuzzy-neuron system approach for X-ray image recognition. In fuzzy rule method, a pattern data may belongs exclusively to several classes with different degrees. There is a neural networks classifier for each class and used to make sure if the pattern is really belonged to that class based on fuzzy rules, they are combined to obtain the recognition result. From the experience results, the new method performs well.
References 1. Singh, S., Singh, M.: Explosives Detection Systems (EDS) for Aviation Security: A Review. Signal Processing 83 (2003) 31-55 2. Bjorkholm, P., Wang, T.R.: Contraband Detection using X-rays with Computer Assisted Image Analysis. Proceedings of the Symposium on Contraband and Cargo Inspection Technology (1992) 111-115 3. Cable, A.P.: Some Aspects of the Use of Intelligent Systems Engineering in the Design of Airport Security Programmes. Proceedings of the First International Conference on Intelligent Systems Engineering, Edinburgh (1992) 77-85 4. Liu, W.: Automatic Detection of Elongated Objects in X-ray Images of Luggage. Masters Thesis, Department of Electrical and Computer Engineering, Virginia Tech. and State University, Blacksburg, VA, (1997) 5. Lu, Q.: The Utility of X-ray Dual-Energy Transmission and Scatter Technologies for Illicit Material Detection. Ph.D. Thesis, Department of Computer Engineering, Virginia Polytechnic Institute and State University, Blacksburg, VA (1999) 6. Singh, M., Singh, S.: Image Segmentation Optimization for X-ray Images of Airline Luggage. CIHSPS2004-IEEE International Conference on Computational Intelligence for Homeland Security and Personal Safety, Venice, Italy (2004) 7. Singh, M, Singh, S., Partridge, D.: A Knowledge-Based Framework for Image Enhancement in Aviation Security. IEEE Trans. Systems, Man and Cybernetics B 34 (2004) 2354-2365 8. Liu, D.M.: A Simple and Effective Enhancement Algorithm to X-ray Image. Application Research of Computers (2007) in press 9. Wang, L.L.: Structural X-ray Image Segmentation for Threat Detection by Attribute Relational Graph Matching. International conference on neural networks and brain (2005) 1206-1210 10. Nakashima, T., Nakai, G., Ishibuchi, H.: Improving the Performance of Fuzzy Classification Systems by Membership Function Learning and Feature Selection. IEEE International Conference on Fuzzy System, May 12-17, 1 (2002) 488-493 11. Krzysztof J. Cios: Image Recognition Neural Network - IRNN. Neurocomputing (1995) 159-185
Detection of Basal Cell Carcinoma Based on Gaussian Prototype Fitting of Confocal Raman Spectra Seong-Joon Baek1 , Aaron Park1, , Sangki Kang2 , Yonggwan Won1 , Jin Young Kim1 , and Seung You Na1 1
2
The School of Electronics and Computer Engineering, Chonnam National University, Gwangju, South Korea, 500-757 Telecommunication R&D Center, Samsung Electronics Co., LTD., South Korea, 426-791
Abstract. Confocal Raman spectroscopy is known to have strong potential for providing noninvasive dermatological diagnosis of skin cancer. According to the previous work, various well known methods including maximum a posteriori probability classifier (MAP), linear classifier using minimum squared error (MSE) and multi layer perceptron networks classifier (MLP) showed competitive results for basal cell carcinoma (BCC) detection. The experimental results are hard to interpret, however, since the classifiers uses global features obtained by principal component analysis (PCA). In this paper, we propose a method that can identify which regions of the spectra are discriminating for BCC detection. For the purpose, 5 and 7 Gaussian prototypes were built located on the typical peak position of BCC and normal (NOR) tissue spectra respectively. Every spectrum is approximated by a linear combination of the Gaussian prototypes. Decision tree is then applied to identify which prototypes are important for the detection of BCC. Among 12 prototypes, 5 discriminating prototypes were selected and the associated weights were used as an input feature vector. According to the experiments involving 216 confocal Raman spectra, support vector machines (SVM) gave 97.4% sensitivity, which confirms that the peak regions corresponding to the selected features are significant for BCC detection and the proposed fitting method is effective.
1
Introduction
Skin cancer is one of the most common cancers in the world. Recently, the incidence of skin cancer has dramatically increased due to the excessive exposure of skin to UV radiation caused by ozone layer depletion, environmental contamination, and so on. If detected early, skin cancer has a cure rate of 100%. Unfortunately, early detection is difficult because diagnosis is still based on morphological inspection by a pathologist [1].
This work was supported by grant No. RTI-04-03-03 from the Regional Technology Innovation Program of the Ministry of Commerce, Industry and Energy(MOCIE) of Korea.
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1240–1247, 2007. c Springer-Verlag Berlin Heidelberg 2007
Detection of Basal Cell Carcinoma Based on Gaussian Prototype Fitting
1241
There are two common skin cancers: basal cell carcinoma (BCC) and squamous cell carcinoma (SCC). Among them, BCC is the most common skin neoplasm and difficult to distinguish from surrounding noncancerous tissue. Clinical dermatologists have requested the accurate detection of BCC. The routine diagnostic technique used for the detection of BCC is pathological examination of biopsy samples. It involves rather complex treatments and relies upon a subjective judgment, which is dependent on the level of experience of an individual pathologist and can lead to the excessive biopsy of tissues. Thus, a fast and accurate diagnostic technique for the initial screening and selection of lesions for further biopsy is needed [2]. To resolve this problem, some researchers carried out BCC detection using Fourier transform (FT) Raman spectroscopy [3, 4]. However FT Raman spectra from a long wavelength excitation laser give poor signal-to-noise ratio and suffer from background noise. Then complicated statistical treatments are required for FT Raman spectra to eliminate the noise. Recently, direct observation method based on the confocal Raman technic was presented for the dermatological diagnosis of BCC using the shorter wavelength laser [2]. According to the study, confocal Raman spectra provide promising results for detection of precancerous and noncancerous lesions without special treatments. Based on the result, we investigated various classification methods including MAP, probabilistic neural networks, k-nearest neighbor, and MLP classification for BCC detection [5]. According to the results, MAP and MLP gave a classification error rate of about 4-5%. Also an ambiguous category was introduced for perfect classification. The experimental results showed that perfect classification was possible with reasonable number of ambiguous data, i.e., about 8% of test patterns for the case of MSE [6]. However the experimental results are hard to interpret and it is difficult to identify which protein bands are the most discriminating since the classifiers uses global features obtained by PCA. Hence we investigated the method that can identify which region of the spectra are significant for the detection of BCC. Inspecting the peak distribution of BCC and NOR spectra, we built 5 and 7 Gaussian prototypes located on the typical peak position of BCC and NOR spectra respectively. Every spectrum was then approximated by a weighted sum of the Gaussian prototypes and the weights were used as a feature vector. Decision tree was applied to identify which prototypes are important for the detection of BCC. The 5 prototypes from 12 prototypes were selected and the associated weights were used as an input feature vector. Experiments were carried out to show that the 5 peak regions associated with the selected 5 prototypes are significant bands for BCC detection and the proposed fitting method is effective.
2
Sample Preparation and Preprocessing
The tissue samples were prepared with conventional treatment. Details for the biological and chemical processes are described in [2]. A skin biopsy and spectral measurements were carried out in the perpendicular direction from the epidermis
1242
S.-J. Baek et al.
to the dermis. Confocal Raman spectra of BCC tissues were measured at different spots with an interval of 30-40 μm. In this way, 216 Raman spectra were collected from 10 patients. After measurement, Raman spectra were clipped at Raman shifts from 17501000 cm−1 , which region is known to contain all important protein bands, e.g., amide III , lipid and protein, amide I, phospholipid and nucleic acid mode. Then the spectra were normalized so that it falls in the interval [0,1]. To build Gaussian prototypes located on the peak position, we first searched peaks of BCC and NOR spectra. Before peak picking, spectra were smoothed by a moving average with span 25 cm−1 . We marked all peak positions and obtained peak distribution. Smoothed peak distribution is plotted in Fig. 1 and Fig. 2. Fig. 1 shows 5 dominant peak positions of BCC and Fig. 2 shows 7 dominant peak positions of NOR. 1588
Relative Peak Occurrence
1098
1000
1457
1340
1410
1100
1200
1300
1400
1500
1600
1700
Raman Shift (cm-1)
Fig. 1. Peak distribution of BCC spectra. Straight lines indicate the selected peaks.
After determining dominant peak positions, we computed the average width, i.e., standard deviation of Gaussian prototypes located on those peak positions. All peak positions and widths of Gaussian prototypes are listed in Table 1. Given center μi and width σi , the i-th Gaussian prototype is x−μi 2 1 −1( ) gi (x) = √ e 2 σi . 2πσi Every spectrum is approximated by a linear combination of the Gaussian prototypes. Let the i-th BCC and NOR Gaussian prototype be gib and gin . A spectrum y(x) is approximated by the following equations.
5 y(x) ≈ fb (x) = i=1 wib gib (x), 7 ≈ fn (x) = i=1 win gin (x), w b = arg min ||y(x) − fb (x)||2 , w n = arg min ||y(x) − fn (x)||2 ,
Detection of Basal Cell Carcinoma Based on Gaussian Prototype Fitting
1450
1658
1243
1090
Relative Peak Occurrence
1317
1255
1595
1535
1700
1600
1500
1400
1300
1200
1100
1000
Raman Shift (cm-1)
Fig. 2. Peak distribution of NOR spectra. Straight lines indicate the selected peaks. Table 1. Gaussian prototypes for BCC and NOR spectra BCC NOR Weight w1b w2b w3b w4b w5b w1n w2n w3n w4n w5n w6n w7n Center 1098 1340 1410 1457 1588 1090 1255 1317 1450 1535 1595 1658 Width 55 60 35 30 55 40 40 40 25 25 25 25
where wb , wn is a weight vector. Two weight vectors are concatenated to give a feature vector. The dimension of a feature vector is then 12. Two approximation examples are shown in Fig. 3 and Fig. 4. Figure 3 shows a BCC spectrum fitted by gib , while Fig. 4 shows a NOR spectrum fitted by gin . The major peaks in the original spectrum are preserved in the fitted spectrum. Thus we can say that the Gaussian prototypes located on the typical peak regions can successfully approximate main characteristics of a BCC and NOR spectrum.
3
Classification Methods and Experimental Results
Decision tree classifies a pattern through a sequence of questions, in which the next question asked depends upon the answer to the current question. Such a sequence of questions forms a node connected by successive links or branches downward to other nodes. Each node selects the best feature with which heterogeneity of the data in the next descendent nodes is as pure as possible. However, with the noisy data, this decision scheme is not always appropriate. So we used a decision tree as a feature selector rather than a classifier. Using the ’leaving one out’ experiments with decision tree, we obtained 3 prominent weights {w4b , w6n , w7n }. Two additional weights {w2b , w3n } were selected by visual inspection and experiments. These 5 weights are normalized to fall in [-1,1] and used
1244
S.-J. Baek et al.
1
original spectrum fitted spectrum
Normalized intensity
0.8
0.6
0.4
0.2
0
1700
1600
1500
1400
1300
1200
1100
1000
Raman shift (cm-1)
Fig. 3. A BCC spectrum fitted by Gaussian prototypes
1
Normalized intensity
original spectrum fitted spectrum 0.8
0.6
0.4
0.2
0
1700
1600
1500
1400
1300
1200
1100
Raman Shift (cm-1)
Fig. 4. A NOR spectrum fitted by Gaussian prototypes
as a feature vector in the subsequent classification. Table 2 shows the protein bands corresponding to these weights. MLP is the most powerful and flexible classifier since it can adapt to arbitrarily complex posterior probability functions [7]. Each layer has several processing units, called nodes or neurons which are generally nonlinear units except the input nodes that are simple bypass buffers. The unit operation is characterized by the equation, ok = f (netk ). The input to the unit, netk , and the activation function f (netk ) are given by netk = i wik oi + biask , f (netk ) = 2/(1 + e−2netk ) − 1. Since there are only two distinct classes, one output unit was used. MLP models were trained to output -1 for NOR class and +1 for BCC class using back propagation algorithm. The performance of MLP undergo a change according to the initial condition. Thus experiments were carried out 20 times and the
Detection of Basal Cell Carcinoma Based on Gaussian Prototype Fitting
1245
Table 2. Protein bands corresponding to 5 coefficients Weights Raman frequency Vibrational cm−1 descriptions w7n 1658 Amide I mode w6n 1588 Amide I mode w4b 1457 Lipid and protein mode w2b 1340 Amide III mode w3n 1317 Amide III mode
results were averaged. At the classification, output value is hard limited to give a classification result. SVM is also a powerful methodology for solving problems in nonlinear classification. In the simplest pattern recognition tasks, SVM uses a linear separating hyperplane to create a classifier with a maximal margin. In cases when given classes cannot be linearly separated in the original input space, SVM first nonlinearly transforms original feature into a higher dimensional feature space. This transformation can be achieved by using various nonlinear mappings: polynomial, sigmoid, and radial basis function (RBF). After the nonlinear transformation, linear optimal separating hyperplane can easily be found. The resulting hyperplane will be optimal in the sense of being a maximal margin classifier with respect to training data [8]. In the experiments, we used the following Gaussian RBF kernel. K(xi , xj ) = e−
||xi −xj ||2 σ2
,
where σ 2 is set to 20. Least squares SVM is used as an optimization method, where the regularization parameter is set to 1 [9]. Overall 216 spectra were divided into two groups. One is a training set and the other is a test set. Actually, the data from 9 patients were used as a training set and the data from one remaining patient were used as a test set. Once classification completes, data from one patient are eliminated from the training set and used as new test data. The previous test data are now joined with the training set. In this way, data from every patient were used as a test set. The average number of BCC and NOR spectra in the test set is 8 and 14 and that in the training set is 68 and 126 respectively. The classification results with the whole 12 weights are summarized in the Table 3. The number of hidden units in MLP was set to 6. In the table, we can see that the average sensitivity is about 94.1% while the average specificity is about 95.4%. The results are comparable to those in [5] and convince us that the Gaussian prototype fitting gives discriminating features. The experimental results with the 5 selected weights are summarized in the Table 4, where the number of hidden units in MLP was set to 5. According to the results, average sensitivity is about 96.8% and average specificity is about 98.3%. The increased classification rates indicate that the selected weights are the most discriminating and the Gaussian prototype fitting methods are very effective. In terms of biomarker discovery, the selected weights can be considered as potential
1246
S.-J. Baek et al. Table 3. Classification results with the whole 12 weights MLP SVM BCC NOR BCC NOR Decision of BCC 94.7 5.3 93.4 6.6 a pathologist NOR 5.0 95.0 4.3 95.7
Table 4. Classification results with the 5 selected weights MLP SVM BCC NOR BCC NOR Decision of BCC 96.1 3.9 97.4 2.6 a pathologist NOR 2.1 97.9 1.4 98.6
markers with which the abnormal cases (BCC) can be distinguished from the normal cases.
4
Conclusion
In this paper, we propose a method that can identify which bands of the spectra are significant for the detection of BCC. For that reason, we built Gaussian prototypes located on the typical peak position of BCC and NOR spectra. Every spectrum is then approximated by a linear combination of Gaussian prototypes. Using decision tree and visual inspection, we selected the 5 significant prototypes among the 12. The 5 weights associated with them are used as an input feature vector. The experiments involving 216 confocal Raman spectra showed the sensitivity of 97.4% in the case of SVM, which confirms that the peak regions corresponding to the selected weights are the most discriminating and the proposed Gaussian fitting method is very effective.
Acknowledgments The authors are grateful to prof. Jaebum Choo, Hanyang university, Ansan, Korea for providing the precious data.
References [1] Jijssen, A., Schut, T.C.B., Heule, F., Caspers, P.J., Hayes, D.P., Neumann, M.H., Puppels, G.J.: Discriminating Basal Cell Carcinoma from its Surrounding Tissue by Raman Spectroscopy. Journal of Investigative Dermatology 119 (2002) 64–69 [2] Choi, J., Choo, J., Chung, H., Gweon, D.G., Park, J., Kim, H.J., Park, S., Oh, C.H.: Direct Observation of Spectral Differences between Normal and Basal Cell Carcinoma (BCC) Tissues using Confocal Raman Microscopy. Biopolymers 77 (2005) 264–272
Detection of Basal Cell Carcinoma Based on Gaussian Prototype Fitting
1247
[3] Sigurdsson, S., Philipsen, P.A., Hansen, L.K., Larsen, J., Gniadecka, M., Wulf, H.C.: Detection of Skin Cancer by Classification of Raman Spectra. IEEE Trans. on Biomedical Engineering 51 (2004) 1784–1793 [4] Nunes, L.O., Martin, A.A., Silveira Jr, L., Zampieri, M., Munin, E.: Biochemical Changes between Normal and BCC Tissue: a FT-raman Study. Proceedings of the SPIE 4955 (2003) 546–553 [5] Baek, S.J., Park, A., Kim, J.Y., Na, S. Y., Won, Y., Choo, J.: Detection of Basal Cell Carcinoma by Automatic Classification of Confocal Raman Spectra. LNBI 4115 (2006) 402–411 [6] Baek, S.J., Park, A., Kim, D., Hong, S.H., Kim, D.K., Lee, B.H.: Screening of Basal Cell Carcinoma by Automatic Classifiers with an Ambiguous Category, LNCIS 345 (2006) 488–496 [7] Gniadecka, M., Wulf, H., Mortensen, N., Nielsen, O., Christensen, D.: Diagnosis of Basal Cell Carcinoma by Raman Spectra. Journal of Raman Spectroscopy 28 (1997) 125–129 [8] Kecman, V.: Learning and Soft Computing. The MIT Press (2001) [9] Suykens J.A.K., Gestel, T.V., Brabanter, J.D., Moor, B.D., Vandewalle, J.: Least Squares Support Vector Machines. World Scientific. Singapore (2002)
Prediction of Helix, Strand Segments from Primary Protein Sequences by a Set of Neural Networks Zhuo Song, Ning Zhang, ZhuoYang, and Tao Zhang* Key Lab. of Bioactive Materials, Ministry of Education and College of life science, Nankai University, Tianjin, PR China, 300071
[email protected]
Abstract. In prediction of secondary structure of proteins there are always some suspected segments. These suspected segments confuse people and lower the accuracy of prediction methods. To deal with this problem, a set of neural networks (NNs) are built based on helix, strand and coil segments selected from PDB. The test performance of these NNs on training data is perfect without surprise. However the prediction on test data is not good enough because the training data are lake of great representativeness. The results support the fact that closer neighbor vectors have the similar outputs of NNs. One can improve representativeness of training data without enlarging data scale as long as select less data from dense region and more from sparse region on condition that distribution of sample data has been known.
1 Introduction Prediction of 3D structure from primary protein sequence is an attractive problem in bioinformatics. If we believe that the secondary structures of a protein determine its 3D structures, prediction of secondary structure becomes an important issue. Since late 1970s many computational methods have been developed. The data they used include primary amino acid (AA) sequences [1], 3-51 neighboring AAs through moving-window [2], chemical properties of AAs [3] and alignments of sequences in structure-known protein databases that match the query sequences [4]. However, the highest accuracy of the methods is less than eighty percentages. Generally, there are 8 types of secondary structure in proteins, which are H(alpha-helix), G(3-helix/310helix), I(5-helix /π-helix), B(residue in isolated beta-bridge), E(extended strand), T(hydrogen bond turn), S(bend), “_”(any other structures). The above secondary structures can be simplified to three groups: helix (H, “H” and “G”), strand (E, “E” and “B”) and coil(C, all remaining types) [5]. Using symbol of H, E and C, the primary protein sequences can be transferred to their corresponding secondary structure sequences, of the form: …CCHHHHCCEEEEECCC.... Based on existing knowledge and predicting methods, however, there are always some segments, which look like either (both) H or (and) E. When a segment is *
Corresponding author.
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1248–1253, 2007. © Springer-Verlag Berlin Heidelberg 2007
Prediction of Helix, Strand Segments from Primary Protein Sequences
1249
supposed to be a kind of secondary structures but whether it is H or E is doubted, the segment becomes suspected. To determine the types of suspected segments, a novel method of neural networks has been developed.
2 Method Architecture of the novel method Helix, strand and coil segments are selected respectively from structure-known protein sequence database. Two ends of them are treated with different ways according to the types they belong to. For each protein segment, a 40 dimension vector was used to describe the composition proportion and the position order of 20 kinds of amino acid residues in the segment. The ration of principal structure content in a segment is its output of NNs. To use the learning ability of NNs and limit their disadvantages, a set of six NNs are designed. For example, NN of “H” is trained only by vectors of H whose output values are not less than 0.8; NN of “HE” is trained by both H and E in which E’s value is defined as 0; NN of “HEC” is trained by H, E and C in which E’s value is 0 and C’s value is 0.5. If the three NNs work well, they should act as the same way - when input helix vectors, they should return values closer 1. Beside, for NN of “HE”, when input E vectors the returned value is closer 0. Fig. 1 shows a detailed architecture of the method. H (H>=0.8)
Position of H, E and
H +E (H>=0.8; E=0)
C along the primary protein sequences
H
H
E
E
H +E +C (H>=0.8; E=0; C=0.5) E (E>=0.8) E +H (E>=0.8; H=0)
C
PDB file (*.pdb)
AA segments
C
40D vectors
E +H +C (E>=0.8; H=0; C=0.5) Six neural networks (NNs)
Fig. 1. A detailed architecture of this prediction method
DATASET Selection of secondary structure segments from proteins Proteins data are selected from ASTRAL SCOP 1.69 genetic domain sequence subsets based on PDB SEQRES records, with less than 40% identity to each other (http://astral.berkeley.edu/) [6]. Domain segments in above subsets are removed and
1250
Z. Song et al.
only intact protein chains are selected which belong to different structure classifications (all alpha, all beta, alpha/beta and alpha+beta etc.). The position information of all secondary structures in proteins are obtained from their PDB files (*.pdb). According to PDB Contents Guide Ver.2.1 (http://www.rcsb.org/pdb/docs/format/pdbguide2.2/part_2.html), proteins whose PDB files have format faults are removed too. Based on these conditions, 1820 proteins are selected and all their secondary structures segments are picked out finally. Define the output value of secondary structure segments used in NNs In practical application, the two ends positions of long secondary structures can not be determined exactly. To partly solve this problem, every used H or E segment contains one or two more AA residues on both sides if the added AA residues belong to Coil. To ensure the principal structure content of these segment is more than 0.8, H and E are treated as following: L=length of segments; if 8<=L<=15, add one more AA residue on each side of segments; if L>=16, add two more AA residues on each end of segments. For C, segments longer than 12 are picked out, and two AA residues are removed from them to make sure that the length of left is longer than 8. Finally, segments of 9403 H, 2999 E and 2069 C are used to build and test NNs. 40D vectors of secondary structure segments used in NNs 40D vectors of secondary structure segments contain information of component proportion (1-20D) and position order (21-40D) of 20 kinds of AA residues respectively according to their alphabet order. No different protein segments have the same vectors (Details and proof see 6). 1-20D: consisting of counts of 20 AAs in every position of segment, normalized by dividing by the length of sequence. The sum of all 1-20D values equals one. 21-40D: each dimension is the sum of position number of its corresponding AA along a segment (position number of segment starts from 1), normalized by dividing by the length of sequence. NEURAL NETWORKS The universal approximation theorem for NNs states that every continuous function which maps intervals of real numbers to some output interval of real numbers can be approximated arbitrarily closely by a multi-layer feed forward network with one hidden layer [7]. So such networks are used to approximate the above functions. Topology and transfer function of NNs The networks topology consisted of 40 input nodes (corresponding to 40D vectors), 81 hidden nodes and 1 output node. A non-standard transfer function is used in these neural networks. Traditionally, the x multi-layer perception NN use the s-form transfer function f ( x) = e , but here, g ( x) 1 + ex
is used, which is derived by integration of the uniform distribution p ( x) ~ U (− a, a ) .
g ( x) = ∫
x
−∞
p (t ) dt , and based on experiment set
a =6.
Prediction of Helix, Strand Segments from Primary Protein Sequences
1251
Selection of training datasets for NNs A set of six NNs are needed to build. They are noted as H (H>=0.8), HE (H>=0.8; E=0), HEC (H>=0.8; E=0; C=0.5), E (E>=0.8), EH (E>=0.8; H=0), EHC (E>=0.8; H=0; C=0.5). The segments of H, E and C come from 9403 helix, 2999 strand and 2069 coil respectively. According to Euclid distance of these 40D vectors, vectors in 40 dimension space whose distances are not less than 2.5 are selected. In training datasets, for example in HE, the distance from every vector in H to E is not less than 2.5, and for HEC in which the H and E data is same as in HE and the distance for every vector in E to both H and E are not less than 2.5. Finally, 2882 H, 1064 E and 101 C are picked out to consist the six NNs. So H is the subset of HE, HE is the subset of HEC, E is the subset of EH, and EH is the subset of EHC. Segments of 6521 H, 1935 E and 1968 C are remaining as test data. The training dataset is too large to do jackknife tests. Test method by using sets of NNs Test rules of H and E are based on two types of 4-NNs combinations are used. One is of H, HE, HEC and EH and the other is E, EH, EHC and HE. If input H segments, output values of NNs of H, HE and HEC should be basically equal. And the returned value of HE should be near 0. Similarly, if input E, values of E, EH and EHC should be basically equal and the output of EH should be near 0. The degree of equality is calculated by using training data to test the trained NNs. Mean value add 3 times standard deviation are used to eliminate calculation errors.
3 Results and Discussion 3.1 The Accuracy of Trained NNs (Show as Table 1) Table 1. The accuracy of six trained NNs
Average square error Average accuracy
H 0.00015 98.8%
HE 0.00033 98.2%
HEC 0.00158 96%
E 0.00039 98%
EH 0.00116 96.6%
EHC 0.00279 94.7%
3.2 Use Training Data to Test the NNs and Determine the Rules of H and E
For training data, the absolute values of differences between groups were calculated under the trained NNs, such as H and HE, HE and HEC, HEC and H, E and EH, EH and Table 2. Rule of determining H and E
H rule E rule
H-HE 0.0128 ± 0.0264 E-EH 0.0205 ± 0.0262
HE-HEC 0.0125 ± 0.0292 EH-EHC 0.0048 ± 0.021
HEC-H 0.0114 ± 0.0355 EHC-E 0.041 ± 0.1889
EH 0.0003 ± 0.0089 HE 0.0009 ± 0.0172
1252
Z. Song et al.
EHC, EHC and E. The values from EH and HE were also calculated. The means plus or minus 3 times the standard deviation were used to create the rule of determining H and E segments. (Show in table 2) 3.3 Training Data and Test Data Under Rules of H and E
Table 3 shows the accuracy of training data under rules of H and E. The results present that neither ‘H rule’ or ‘E rule’ can pick out all the real H or E segments. And the H and E segments they hold are not always true. It is partly because the training data selected can not represent whole characters of H and E. NNs can not know what they do not learn, so under distance space, there should be some singular vectors do not be contained in training data. This can partly solved by large the training dataset, but if so, the time and accuracy of training will become a problem. To solve the dilemma, it is important to know if a vector is belong to H or E, whether its closer neighbor vectors are belong to H and E as well. If it is true, and if the distribution of data under certain space is known, as long as select less data in dense region and more data in sparse region, the representativeness of data would be improved without enlarging the dataset. Following evidences support the conjecture. Table 3. The accuracy of data under rules of H and E
Training helix (2882) Training strand (1064) Test helix (6521) Test strand (1935)
H rule 2764 (95.9%) 0 (0%) 1360 (20.9%) 105 (5.4%)
E rule 0 (0%) 999 (93.9%) 744(11.4%) 547 (28.3%)
Table 4. Average Euclid distance between two groups
Test H_H rule (N=1360) Test H_H rule_No (N=5161) Test H_E rule (N=744) Test E_E rule (N=547) Test E_E rule_No (N=1388) Test E_H rule (N=105)
Training H 1.54 ± 0.28 (HH/H) 1.98 ± 0.34 (HH_No/H) 2.36 ± 0.24 (HE/H) / / 1.46 ± 0.22 (EH/H)
Training E / / 1.43 ± 0.20 (HE/E) 1.37 ± 0.18 (EE/E) 1.76 ± 0.32 (EE_No/E) 2.34 ± 0.31 (EH/E)
“Test H_H rule” means segments selected from test H by rule of H, “Test H_H rule_No” means segments which are not selected from test H by rule of H. Others have similar meanings. Less number is bold.
3.4 Euclid Distance of Data
The Euclid distance of a vector to a group vectors should be observed to support the above conjecture. The distance of a vector to a group vectors is determined as the minimum distance of the vector to all vectors in the group. For example, if the conjecture
Prediction of Helix, Strand Segments from Primary Protein Sequences
1253
3.0
** 2.5
** ** Euclid distance
2.0
** 1.5
1.0
0.5
0.0 HH/H HH_No/H
HE/H HE/E
EE_No/E EE/E
EH/H EH/E
Fig. 2. Average Euclid distances of two groups of data. **P<0.0001. The number of each group was shown in table 4. ■: less distance of two groups; □: larger distance of two groups.
is true, every vector in test H which is correctly selected by rule of H will have short distance with training H data, and then the mean distance of those correct-selected vectors in test H to training H should be small. In contract, the mean distance of those wrong-selected vectors in test H under rule of H should be bigger than the former.
Acknowledgements This work was supported by grants from the NBRPC (2004CB720300) and the NSFC (30470453, 30370386, 30640037).
References 1. Chou, P.Y., Fasman, G.D.: Prediction of the Secondary Structure of Proteins from Their Amino Acid Sequence. Adv Enzymol 47 (1978) 45-148 2. Rost, B., Sander, C.: Combining Evolutionary Information and Neural Networks to Predict Protein Secondary Structure. Proteins 19 (1) (1994) 55-72 3. Andrew, C.D., Penel, S., Jones, G.R., Doig, A.J.: Stabilizing Nonpolar/Polar Side-chain Interaction in the Alpha-helix. Proteins 45 (5) (2001) 449-55 4. Rost, B.: Protein Secondary Structure Prediction Continue to Rise. J Struct Biol 2001 18 (6) (2001) 204-18 5. Ganapathiraju, M.K., Klein-Seetharaman, J., Balakrishnan, N., Reddy, R.: Characterization of Protein Secondary Structure. IEEE Signal Process Mag 21 (3) (2004) 78-87 6. Hornik, K., Stinchcombe, M., White, H.: MLP’s are Universal Approximators. Neural Netw. 2 (1989) 359-66 7. Chandonia, J.M., Hon, G., Walker, N.S., Lo Conte, L., Koehl, P., Levitt, M., Brenner, S.E.: The ASTRAL Compendium in 2004. Nucleic Acids Research 32 (2004) D189-D192
A Novel EPA-KNN Gene Classification Algorithm Haijun Wang1,3 , Yaping Lin1,2 , Xinguo Lu1 , and Yalin Nie1 1
2 3
School of Computer and Communication, Hunan University, Changsha 410082, China School of Software, Hunan University, Changsha 410082, China School of Science, Henan University of Science and Technology, Luoyang 471003, China
[email protected]
Abstract. Accurate classification of samples using gene expression frofiles is very important in cancer detection and treatment. In this paper, a novel EPA-KNN (Emerging Patterns Advanced-K Nearest Neighbors) gene classification algorithm is proposed. Bayes estimation is applied for the computation of entropy to improve its reliability, and RCP (Random Cut Point) is presented to strengthen the generalization about unknown test samples. With these improvements, the EPAs are acquired. Then an EPA based classifier is constructed inspired by KNN. The experimental results show the new algorithm is feasible and effective.
1
Introduction
Gene expression profiles based on Microarray record expression levels of genes under specific experimental conditions, which provide a new approach to study the relationship between cancer and genes [1]. The expression levels of all genes are recorded under each condition. However, in the high dimensional gene data, only a few genes, which are called feature genes, are related with cancer subtype. A small set of feature genes is essential for the development of inexpensive diagnostic tests [2]. Clustering is widely used in gene classification, such as K-means, SOM, etc [3]. However, being an unsupervised learning method, clustering has no mechanism performing feature selection. A better approach is to use supervised classification. Khan used artificial neural networks to detect cancer [4]. Terrence et al. got good performance in leukemia dataset using SVM [5]. However, these algorithms can’t discover biologic gene rules, so the intrinsic relationships between cancer and genes can’t be explored clearly for biologic researchers. Recently a new concept-emerging patterns (EPs) has been proposed by Li [6]. EPs are gene rules with biologic explanation, which can capture significant differences between two classes of things (e.g., between edible and poisonous mushrooms, between normal tissues and cancer tissues, and so on). The more significant the difference between classes is, the stronger the ability to distinguish cancer types. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1254–1263, 2007. c Springer-Verlag Berlin Heidelberg 2007
A Novel EPA-KNN Gene Classification Algorithm
1255
In this paper, we propose a novel EPA-KNN gene classification algorithm. In which, Bayes estimation is introduced into the computation of entropy to measure cut point’s ability of classification. In Li’s method, the frequency of samples belonging to class Ci in a set Sj is used as an estimation of corresponding probability. Aiming to reduce the deviation between frequency and corresponding probability in small samples, some virtual samples are added in Bayes estimation to enlarge the sample size. Also in [6], a cut point is set as the average of some two adjacent gene expression values. In order to strengthen the generalization about unknown test samples, we propose a new concept, RCP (Random Cut Point), which is a uniform distribution. Then, we construct a new EPA-KNN classifier to predict cancer subtypes. Experimental results show that the new method can almost classify all samples from two subtypes of human acute leukemia: ALL (acute lymphoblastic leukemia) and AML (acute myeloid leukemia), and the new method is feasible and effective. The remainder of this paper is organized as follows. Section 2 introduces the concept of emerging patterns. In section 3 we outline the ALL/AML dataset. Section 4 describes the discovery of EPAs and intermediate experiment results. Section 5 provides EPA-KNN gene classification algorithm. Section 6 presents the results and analysis. Section 7 concludes the paper.
2
Emerging Patterns
Definition 1. genei @[c, d] is an item, which means that genei ’s expression value locates at interval [c, d]. Definition 2. A pattern X is a set of several items, for example: X = {genei1 @[ai1 , bi1 ], genei2 @[ai2 , bi2 ], . . . , geneik @[aik , bik ]}, where it = is , 1 ≤ t, s ≤ k if k > 1. Definition 3. The support of a pattern X in dataset D, denoted as suppD (X) is defined as suppD (X) = countD (X)/|D|, where countD (X) is the number of elements in D matching the pattern X. Definition 4. The growth rate of a pattern X from D1 to D2 , denoted as GrowthRate(X), is defined as ⎧ ⎪ if suppD1 (X) = 0 and suppD2 (X) = 0, ⎨ 0, ∞, if suppD1 (X) = 0 and suppD2 (X) = 0, GrowthRate(X) = ⎪ ⎩ suppD2 (X) , otherwise. suppD1 (X)
Definition 5. Given ρ > 1 as a growth-rate threshold, a pattern X is said to be an ρ-emerging patterns(ρ-EP or simply EP) from D1 to D2 if GrowthRateX≥ ρ.
1256
3
H. Wang et al.
The ALL/AML Dataset
The human acute leukemia (ALL/AML) gene expression profiles1 consist of 47 ALL samples and 25 AML samples. Each sample contains 7129 genes. The whole dataset is separated into training samples and test samples. Training samples include 27 ALL samples and 11 AML samples, and test samples include 20 ALL samples and 14 AML samples.
4
Emerging Patterns Advanced
4.1
Select Discriminatory Genes and Corresponding Cut Points
An entropy-based discretization method [7] uses entropy to select the important features. This method can remove many noisy features and produce items from remaining features. Let T partition the set S of samples into the subsets S1 and S2 . P (Ci , Sj ) is the frequency of samples belonging to class Ci in Sj . The class entropy of a subset Sj , j = 1, 2 is defined as Ent(Sj ) = −
k
P (Ci , Sj ) log P (Ci , Sj ).
(1)
i=1
Suppose the subsets S1 and S2 are induced by partitioning a feature A at cut point T. Then, the CIE (class information entropy) of the partition, denoted E(A, T ; S), is given by E(A, T ; S) =
|S1 | |S2 | Ent(S1 ) + Ent(S2 ). |S| |S|
(2)
The Minimal Description Length Principle is used to stop the process of finding gene’s cut point. Recursive partitioning within a set of values S stops iff Gain(A, T ; S) <
log2 (N − 1) δ(A, T ; S) + , N N
(3)
where N is the number of values in the set S, Gain(A, T ; S) = Ent(S) − E(A, T ; S), δ(A, T ; S) = log2 (3k − 2) − [k ∗ Ent(S) − k1 ∗ Ent(S1 ) − k2 ∗ Ent(S2 )], and ki is the number of class labels represented in the set Si . Using this discretization method, 866 of the 7129 features in ALL/AML dataset are discriminatory genes [8]. According to Equation (1,2), the smaller the E(A, T ; S) is, the more discriminatory the cut point T is. In ideal situation, when T completely separates the two classes, E(A, T ; S) gets its minimal value 0. P (Ci , Sj ) in Equation 1 is frequency, not probability. But entropy is defined by probability strictly. Bernoulli’s law of large numbers tells us that frequency converges to probability when the 1
http://sdmc.lit.org.sg/GEDatasets/Datasets.html#ALL-AML Leukemia
A Novel EPA-KNN Gene Classification Algorithm
1257
sample size is infinite, however there exists deviation between frequency and probability when the sample size is small. To make P (Ci , Sj ) be closer with the real probability of any sample in Sj that belongs to class Ci , Bayes estimation is introduced, namely, m-estimation as follows. P (Ci , Sj ) =
nij + mp , nj + m
(4)
where nij is the number of samples belonging to class Ci in Sj , nj is the sample size of Sj . m is the number of virtual samples. p should be the real probability in ideal situation, but a typical selection of p is uniform when information is not enough. Equation 4 can be seen as adding m virtual samples that are in p distribution to expand the actual samples. It’s no good that m is too large, and this will make P (Ci , Sj ) depend on p excessively. So we set m = |Sj |/4, and p = 1/2. Bayes estimation can relieve the problem caused by small samples and improve the reliability of entropy. For example, assuming that cancer or normal samples occur with the same probability, if we sample 38 times independently, the cancer samples’ expectation will be 19, and the same with the normal samples. But in reality it’s possible that samples of one class occur more than that of the other class because of experimental conditions and so on. Under extreme conditions, for example, cancer samples occur more than 30 times, and normal samples occur less than 8 times. The occurrence discussed above is possible to happen in small samples especially, which causes the deviation of probability estimation, then the entropy can’t measure the true information of classification. According to Equation (1-4), Bayes estimation is introduced into the computation of entropy in this paper, and 857 of the 7129 features in ALL/AML dataset are selected as discriminatory genes by our method. Although 857 genes are left, the calculation is still huge. So we only focus on 25 genes with the smallest CIE. Table 1 lists the 25 genes and makes a comparison with the 25 genes without Bayes estimation. Cut points separate 25 discriminatory genes into total 50 intervals. Accordingly, there are 50 items involved. We index the 50 items, and the two intervals of the ith gene are indexed as the (i ∗ 2 − 1)th and (i ∗ 2)th items respectively. This index is convenient to read and write emerging patterns; for example, the pattern {2} represents {geneX95735@[994.0, +∞)}. From Table 1, referring to the genes and corresponding cut points selected by the entropy-based discretization method without Bayes estimation, we find that 88% genes and corresponding cut points are unchanged, and these genes only have small changes at their sequence of CIE because of Bayes estimation. For the strongest classification, namely the top 8 discriminatory genes and corresponding cut points, the two methods achieve 100% agreements. There are 12% discriminatory genes and corresponding cut points being changed. Table 2 lists the comparison of these genes and corresponding cut points based on CIE in test samples. From Table 2, we can find that the CIE of geneU 50136 at cut point 1314.0 is 26.8% bigger than the CIE of the same gene at cut point 1987.0. geneM81923
1258
H. Wang et al.
Table 1. The 25 discriminatory genes with smallest CIE and corresponding cut points respectively, which are selected by the entropy-based discretization method with Bayes estimation or not Out list Accession number Cut point Accession number(Bayes) Cut point(Bayes) 1 X95735 994.0 X95735 994.0 2 M55150 1346.0 M55150 1346.0 3 M31166 83.5 M31166 83.5 4 M27891 1419.5 M27891 1419.5 5 X70297 339.0 X70297 339.0 6 P31483 80.5 P31483 80.5 7 L09209 992.5 L09209 992.5 8 U46499 156.5 U46499 156.5 9 M16038 651.5 D88422 658.0 10 M92287 1869.5 M21551 398.5 11 D14874 185.0 M23197 401.5 12 U50136 1341.0 U46751 2909.5 13 U22376 1423.0 U50136 1987.0 14 M27783 197.5 Y12670 911.0 15 D88422 658.0 M27783 357.5 16 M21551 398.5 M83652 541.5 17 M23197 401.5 M98399 245.0 18 U46751 2909.5 M54995 156.5 19 Y12670 911.0 U02020 381.5 20 M83652 541.5 M31523 415.5 21 M98399 245.0 D14874 185.0 22 M54995 156.5 M16038 651.5 23 U02020 381.5 M92287 1869.5 24 M31523 415.5 U22376 1423.0 25 M81933 128.0 J05243 245.0
Table 2. Discriminatory genes and corresponding cut points’ comparison of CIE in test samples (* represents genes and corresponding cut points selected with Bayes estimation) Distribution <cut point ALL AML U50136(1341.0) 18/25 7/25 U50136(1987.0)* 20/27 7/27 M27783(197.5) 11/13 2/13 M27783(357.5)* 12/16 4/16 M81933(128.0) 5/14 9/14 J05243(245.0)* 1/10 9/10 Gene(cut point)
in test samples ≥cut point ALL AML 2/9 7/9 0/7 7/7 9/21 12/21 8/18 10/18 15/20 5/20 19/24 5/24
CIE 0.57621 0.45446 0.58595 0.62831 0.59916 0.45684
and geneJ05243 are the same 25th in Table 1, but CIE of geneM81923 is 31.2% bigger than the CIE of geneJ05243 . The CIE of geneM27783 at cut point 197.5 is 6.7% smaller than the CIE of the same gene at cut point 357.5. Synthetically,
A Novel EPA-KNN Gene Classification Algorithm
1259
the Bayes estimation is effective to improve the selection of discriminatory genes and corresponding cut points. Bayes estimation improves the computation of entropy, but in the discovery of cut points, the general cut point is the mean of two adjacent gene expression values when the gene’s expression profiles are sorted. In reality, it’s not sure that the best cut point is just the mean of some two adjacent gene expression values. The point near to the mean may be better than the mean in the classification of test samples. Based on this, we make an assumption: if one gene’s expression profiles are sorted as genesort (1) < genesort(2) < . . . < genesort(n), and original cut point is cutpoint, where genesort (i) < cutpoint < genesort (i+1), 1 ≤ i ≤ (n− 1). Then the cutpoint in this paper now is a random variable, namely cutpoint’, and it’s subjected to uniform distribution of interval (genesort (a), genesort (b)), where a = max{1, (i − 1)}, b = min{(i + 2), n}. We define cutpoint’ above as RCP(Random Cut Point). Based on the last two columns in Table 1, original cut points are transformed into RCPs according to the idea of RCP, then we test the performance in test samples. Fig. 1 illustrates the results of one experiment.
Fig. 1. Original cut point and RCP’s comparison of CIE in test samples
RCP is variable which changes randomly in a certain interval, and original cut point can be seen as a specialness of RCP. From Fig. 1, compared with original cut points, we find that RCP makes the CIE of 40% genes be smaller, 48% be equal and 12% be little bigger. RCP is feasible and effective. We have discussed the selection of discriminatory genes and corresponding cut points. The introduction of Bayes estimation and RCP improves the performance indeed. The selection is more reasonable, and the generalization about test samples and robustness are strengthened too. In the remains of the paper, EPs based on the entropy-based discretization method [7] combining with Bayes estimation and RCP are called EPAs.
1260
4.2
H. Wang et al.
Discovery of EPA
When the discriminatory genes and corresponding cut points are determined, the EPAs are discovered by a border-based algorithm, MBD-LLBORDER [9]. The main idea of MBD-LLBORDER is as follows. First, compute following sets, LARGEBORDERδ (D1 ) = {X|X is EPA, and suppD1 (X) ≥ δ}, LARGEBORDE Rθ (D2 ) = {X|X is EPA, and suppD2 (X) ≥ θ}, then MBD-LLBORDER(LARGEB ORDERδ (D1 ), LARGEBORDERθ (D2 )) returns all ρ-EPA, where ρ = θ/δ.
Fig. 2. Principle of MBD-LLBORDER algorithm
For two subtypes of human acute leukemia: ALL and AML, we discover all 1-EPA based on discriminatory genes and corresponding cut points that are in the last two columns of table 1. There are a total of 17065 EPAs from AML to ALL, and 5654 EPAs conversely. Table 3 presents the top 20 EPAs in both subtypes. From the first row of Table 3, we find that geneX95735 is contributive for classification. geneX95735 at cut point 994.0 can separate two subtypes in the training samples completely. If only geneX95735 is used to classify 34 test samples, experiment shows that it only misclassified 3 samples (two ALL, the 15th and 17th samples; one AML, the 31st sample). It also shows that emerging patterns have strong ability of classification.
5
Classification Based on EPA
A new EPA-KNN classifier based on EPA is proposed. If two training datasets DP and DN , and a test sample T are given, The main idea of EPA-KNN is as follows. First, all EPAs are discovered by MBD-LLBORDER for both DP and DN training datasets, and every EPA selects the maximal support from both datasets as its weight. KNN measures the distance between test sample and training samples of different subtypes in same space. Inspired by KNN, EPAKNN merges EPAs of different subtypes into a set union, aiming to have the same offset in the measure of “distance” between test sample T and EPAs of different subtypes. All EPAs in the union are sorted by their weight. Then the K nearest EPAs to T in the union are selected to predict the class of T (K is odd). The algorithm of EPA-KNN is described as follows. 1: Select discriminatory genes and corresponding cut points based on the entropybased discretization method [7] combining with Bayes estimation 2: Transform cutpoints to RCPs
A Novel EPA-KNN Gene Classification Algorithm
1261
Table 3. The top 20 EPAs in both subtypes, in a descending order, ranked by their support in corresponding subtype index 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
EPA ALL/AML support {1} 1/0 {1 5} 1/0 {1 7} 1/0 {1 17} 1/0 {1 19} 1/0 {1 21} 1/0 {1 23} 1/0 {1 25} 1/0 {1 27} 1/0 {1 29} 1/0 {1 31} 1/0 {1 33} 1/0 {1 5 7} 1/0 {1 5 17} 1/0 {1 5 19} 1/0 {1 5 21} 1/0 {1 5 23} 1/0 {1 5 25} 1/0 {1 5 27} 1/0 {1 5 29} 1/0
EPA ALL/AML support {2} 0/1 {2 4} 0/1 {2 9} 0/1 {2 12} 0/1 {2 14} 0/1 {2 16} 0/1 {2 42} 0/1 {2 44} 0/1 {2 45} 0/1 {2 47} 0/1 {2 4 9} 0/1 {2 4 12} 0/1 {2 4 14} 0/1 {2 4 16} 0/1 {2 4 42} 0/1 {2 4 44} 0/1 {2 4 45} 0/1 {2 4 47} 0/1 {2 9 12} 0/1 {2 9 14} 0/1
3: Discover EPA based on MBD-LLBORDER 4: {Discover EPA from DP and DN training samples} suppDP (X) EP A DP = {X|X is EPA, and suppD (X) ≥ 1} N
supp
5:
6: 7: 8:
(X)
N EP A DN = {X|X is EPA, and suppD ≥ 1} DP (X) {Generate sets of triples T rip DP and T rip DN based on EPA, and “.” is operator to quote element in triple} T rip DP = {< Y, supp, class > |Y ∈ EP A DP , supp = suppDP (Y ), class = P} T rip DN = {< Y, supp, class > |Y ∈ EP A DN , supp = suppDN (Y ), class = N} {Merge all EPAs into EP A Sort} EP A Sort = {Xi |Xi ∈ T rip DP ∪ T rip DN , Xi .supp ≥ Xi+1 .supp} {Initialize variables} Score(T ) P = 0, Score(T ) N = 0, count = 0, i = 1 {Find the K nearest neighbors of T, and Score(T ) P ,Score(T ) N are the likelihoods that T is class P or N respectively} while count < K do if T matches Xi .EP A(Xi ∈ EP A Sort) then count = count + 1 if Xi .class == P then Score(T ) P = Score(T ) P + 1
1262
H. Wang et al.
else Score(T ) N = Score(T ) N + 1 end if i=i+1 end if end while 9: {Predict the class of T } if Score(T ) P > Score(T ) N then T is class P else T is class N end if EPA-KNN makes a reference to KNN. The K nearest neighbors of T are the first K EPAs matched by T, and the classes of the K nearest EPAs determine the class of T. When the class of T is predicted, it’s sure that the likelihood of one class is bigger than that of the other class absolutely. It’s impossible that the likelihoods of the two classes are equal or approximate, and it can avoid misclassifications caused by classification algorithm.
6
Experimental Results
We carry out our experiment in two methods: (1) Conduct Leave-One-OutCross-Validation (LOOCV) classification in total samples and (2) Conduct Independent Test (IT) to classify test samples based on training samples. The K of EPA-KNN is 19 in the two methods, and every time we select the 25 smallest discriminatory genes of CIE and corresponding cut points to discover EPA. The experiment repeated 15 times independently, and the average number of misclassification is used to measure performance. To test whether the Bayes estimation and RCP are effective according to the overall performance, we delete bayes estimation and RCP from EPA-KNN, the remainder of EPA-KNN algorithm is called EP-KNN. Table 4 presents the performances of EPA-KNN and EP-KNN, and makes a comparison with other algorithms that are good at gene classification. Compared with other methods, our method gains competitive performance. From Table 4, we can find that Bayes estimation and RCP improve the selection of discriminatory genes and corresponding cut points, and are more effective Table 4. The comparison of experimental results Method
Gene number Average number LOOCV EPA-KNN(EPA) 25 0.73 EP-KNN(EP) 25 1.0 Neural Networks [8] 10 1.0 SVM [5][8] 25-1000 0
of misclassification IT 1.07 2.0 2.0 2.0-4.0
A Novel EPA-KNN Gene Classification Algorithm
1263
for IT experiments especially due to smaller training samples. Compared with Neural Networks and SVM, our method gets better performance, and extracts biologic gene rules as well, which are helpful to study cancer pathology and drug discovery.
7
Conclusion
In this paper, we propose a novel EPA-KNN gene classification algorithm. In the process of producing EPAs, Bayes estimation is applied to improve the reliability of entropy for small samples through adding virtual samples; RCP is used to strengthen the generalization about unknown test samples. Then a new EPA-KNN classifier is proposed. Experiments show that EPA-KNN can discover extractive biologic rules and improve the cancer recognition rate efficiently. Future research works will include how to apply EP to parallel classification of multi-class data.
References 1. Kuramochi, M., Karypis, G.: Gene Classification Using Expression Profiles: A Feasibility Study. Int. J. ArtificialIntelligence Tools 14 (2005) 641-660 2. Lu, X.G., Lin, Y.P., Yang, X.L., et al.: Using Most Similarity Tree Based Clustering to Select the Top Most Discriminating Genes for Cancer Detection. Proceeding of The Eighth International Conference on Artificial Intelligence and Soft Computing, Zakopane, Poland (2006) 931-940 3. Jiang, D.X., Tang, C., Zhang, A.D.: Cluster Analysis for Gene Expression Data: A Survey. IEEE Trans. Knowledge and Data Engineering 16 (2004) 1370-1386 4. Khan, J., Wei, J.S., Ringner, M., et al.: Classification and Diagnostic Prediction of Cancers Using Gene Expression Profiling and Artificial Neural Networks. Nature Medicine 7 (2001) 673-679 5. Furey, T.S., Cristianini, N., Duffy, N., et al.: Support Vector Machine Classification and Validation of Cancer Tissue Samples Using Microarray Expression Data. Bioinformatics 16 (2000) 906-914 6. Dong, G.Z., Li, J.Y.: Mining Border Descriptions of Emerging Patterns from Dataset Pairs. Knowledge and Information Systems 8 (2005) 178-202 7. Li, J.Y., Wong, L.: Emerging Patterns and Gene Expression Data. Proceedings of 12th Workshop on Genome Informatics, Tokyo, Japan (2001) 3-13 8. Tan, A.H., Pan, H.: Predictive Neural Networks for Gene Expression Data Analysis. Neural Networks 18 (2005) 297-306 9. Dong, G.Z., Li, J.Y.: Efficient Mining of Emerging Patterns: Discovering Trends and Differences. Proceedings of ACM SIGKDD Conference on Knowledge Discovery and Data Mining, San Diego, USA (1999) 43-52
A Novel Method for Prediction of Protein Domain Using Distance-Based Maximal Entropy Shuxue Zou, Yanxin Huang, Yan Wang, Chengquan Hu, Yanchun Liang, and Chunguang Zhou College of Computer Science and Technology, Jilin University, Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, 130012, China {Shuxue Zou,sandror}@163.com, {Chunguang Zhou,cgzhou}@jlu.edu.cn
Abstract. Detecting the boundaries of protein domains has been an important and challenging problem in experimental and computational structural biology. In this paper the domain detection is first taken as an imbalanced data learning problem. A novel undersampling method using distance-based maximal entropy in the feature space of SVMs is proposed. On multiple sequence alignments that are derived from a database search, multiple measures are defined to quantify the domain information content of each position along the sequence. The overall accuracy is about 87% together with high sensitivity and specificity. Simulation results demonstrate that the utility of the method can help not only in predicting the complete 3D structure of a protein but also in the machine learning system on general imbalanced datasets.
1 Introduction Domain structure is one of the structure levels of protein, which is considered as the fundamental unit of the protein structure, folding, function, evolution and design. Detecting the domain structure of a protein is a challenging problem. There are such many methods for domain detection as methods on expert experiences on known protein structures to identify the domains, CATH [1], SCOP [2]; methods that try to infer domain boundaries by using the dimensional structure of proteins, PDP[3], DALI[4] and so forth. However, structural information is available for only a small portion of the protein space. And with the current rapid growth in the number of sequences with unknown structures, it is very important not only to accurately define protein structural domains, but also to predict domain boundaries on the basis of amino-acid sequence alone. A few methods only on the information of protein sequence were proposed, based on the use of similarity searches and multiple alignments to delineate domain boundaries, Domainer [5], DOMO [6]. In this paper, on multiple sequence alignments that are derived from a database search, multiple measures are defined to quantify the domain information content of each position along the sequence. We realize the boundary positions are far less than core-domain and first take Domain detection as imbalanced data learning problem. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1264–1272, 2007. © Springer-Verlag Berlin Heidelberg 2007
A Novel Method for Prediction of Protein Domain
1265
Learning classification from imbalanced datasets is an important topic. Support vector machines (SVMs) have been very successful in application areas ranging from image retrieval [7] to text classification [8]. Nevertheless when faced with imbalanced datasets, the performance of SVM drops significantly [9]. In the remainder of this paper, the negatives, i.e., the core domains, are always taken to be the majority class and the positives, i.e., the boundary positions, are the minority class. We propose a novel undersampling method based on the distance-based maximal entropy in the feature space of SVMs. The negtives which have the maximal entropy value with counterpart possitives are undersampled, in this way, the input data are no longer imbalanced. Thus the learned hyperplane is further away from the positive class, which can compensate for the skew associated with imbalanced datasets. The experiment results demonstrate that the utility of the method can help not only in predicting the complete 3D structure of a protein but also in the machine learning system on general imbalanced datasets.
2 Datasets and Feature Extraction Given a query sequence, our algorithm starts by searching the protein sequence database and generating a multiple alignment of all significant hits. The columns of the multiple alignment are analyzed using a variety of sources to define scores that reflect the domain-information-content of alignment columns. Information theory based principles are employed to maximize the information content. An overview of our method is depicted in Figure 1.
Fig. 1. Overview of the domain prediction method
1266
S. Zou et al.
2.1 Datasets The SCOP database with version 1.65 is employed in this paper, which includes 20,619 proteins and 54,745 chains. The datasets are selected according to the statistical results respectively on the single domain and more than two domains as well as for the consideration of the protein homology. 2.2 Domain Definitions For each protein chain we defined the domain positions to be the positions that are at least x residues apart from a domain boundary. Domain boundaries are obtained from SCOP definitions where for a SCOP definition of the form (start1; end1)::(startn; endn) the domain boundaries are set to (endi+starti+1)/2. All positions that are within x residues from domain boundaries are considered boundary positions. It is clear that the domain positions are far more than the boundary ones. Then the protein domain boundary detection is known as the class imbalance problem. SCOP Domains
MSGTLLAFDFGTKSIGVAVGQRITGTARPLPAIKAQDGTPDWNIIERLLKEWQPDEIIVGLP
x Domain Positions
x
Boundary Positions
Domain Positions
Fig. 2. Definition of the domain and boundary positions
2.3 Feature Extraction Firstly each protein in the selected dataset has to been aligned in sequence databases. To quantify the likelihood that a sequence position is part of a domain, or at the boundary of a domain we referred six measures [10] based on the multiple alignments that reflect structural properties of proteins. Information theory based principles are employed to maximize the information content. For each column amino acid entropy and class entropy are defined to measure the sequence conservation. There are four features extracted among the columns over a window. All appearances of a domain in database sequences will maintain the domain’s integrity, which can be measured by quantifying the consistency and correlation, including asymmetric correlation and symmetric one, of neighboring columns in an alignment. Regions of substantial structural flexibility in a protein often correspond to domain boundaries. In a multiple alignment of related sequences, positions with indels are with respect to the seed sequence indicate regions.
A Novel Method for Prediction of Protein Domain
1267
Besides we quote a novel method [11] to predict domain boundary from protein sequence alone. The simple physical approach is based on the fact that the protein unique three dimensional structure is a result of the balance between the gain of attractive native interactions and the loss of conformational entropy. Considering here the conformational entropy as the number of degrees of freedom on the angles φ,ψ and χ for each amino acid along the chain, the method for domain boundary prediction relies on finding the extreme values in a latent entropy profile.
3 SVMs and Strategies for the Imbalanced Classification Support Vector machines (SVMs) are novel statistical learning techniques that can be seen as typical novel methods for training classifiers based on polynomial functions, radial basis functions, neural networks, splines or other functions. Without loss of generality we choose the SVMs coupled with the RBF kernel widely used in pattern recognition. Given a set of labeled instances X train = {xi , yi }in=1 and a kernel function K SVM finds the optimal α i for each xi to maximize the margin γ between the hyperplane and the closest instances to it. The class prediction for a new test instance x is made through:
, 2
n − x − xi ⎛ ⎞ sign ⎜ f ( x ) = − ∑ y i α i K ( x , x i ) + b ⎟ , K ( x , x i ) = exp( 2σ 2 i =1 ⎝ ⎠
),
(1)
where b is the threshold. The 1-norm soft-margin SVM is used to minimize the primal Lagrangian: Lp =
w 2
2
n
+ C ∑ ξi − i =1
n
∑ α [ y (w ⋅ x i =1
i
i
i
+ b )− 1 + ξ i ]−
n
∑ rξ i =1
i
i
,
(2)
where α i ≥ 0 and ri ≥ 0 [12]. The penalty constant C represents the trade-off between the empirical error ξ and the margin. In order to meet the Karush-Kuhn-Tucker (KKT) conditions, the value of α i must satisfy: 0 ≤ α i ≤ C , and
n
∑α i =1
i
yi = 0 .
(3)
In [13], Akbani analysis three causes of performance loss with imbalanced data. Firstly positive points lie further from the ideal boundary. And the second is the weakness of Soft-Margins. The last one is imbalanced Support Vector Ratio. 3.1 SVMs Fail to Imbalanced Classification
Veropoulos [14] suggests using different penalty factors C+ and C for positive and negative classes, reflecting their importance during training. Therefore, the Lp formulation has two loss functions for two types of errors. -
1268
S. Zou et al.
Lp =
2
w 2
+C
n+
+
{i
∑
y i = + 1}
ξ ik + C −
n−
∑
{j y
}
j
= −1
− ∑ α i ⎣⎡ y i ( w ⋅ x i + b ) − 1 + ξ i ⎦⎤ −
∑
n
i =1
p
i =1
ξ
k j
(4) μ iξ i .
If the SVM algorithm uses an L norm ( k = 1 ) for the losses, its dual formulation gives the same Lagrangian as in the original soft-margin SVMs, but with different constraints on α i as follows: 0 ≤ α i ≤ C + , if y i = + 1,
and
(5)
0 ≤ α i ≤ C , if y i = − 1, −
It turns out that this biased-penalty method does not help SVMs as much as expected. From the KKT conditions (Eq. (3)), we can see that C imposes only an upper bound on α i , not a lower bound. Increasing C does not necessarily affect α i . Moreover, the constraint in Eq. (3) imposes equal total influence from the positive and negative support vectors. The increases in some α i at the positive side will inadvertently increase some α i at the negative side to satisfy the constraint. These constraints can make the increase of C+ on minority instances ineffective. 3.2 Strategies for the Imbalanced Classification
A number of solutions to the class-imbalance problems were previously proposed both at the data and algorithmic levels. As the above analysis when faced with imbalanced datasets the performance of SVMs could not rise by setting the parameters [9]. At the data level [15], these solutions include many different forms of resampling such as random oversampling, random undersampling, directed oversampling, directed undersampling, oversampling with informed generation of novel samples, and combinations of the above techniquesIn case of undersampling, examples from the majority class are removed. Examples removed can be randomly selected, or near miss examples, or examples that are far from the minority class examples. In this paper, a novel undersampling method using distance-based maximal entropy in the feature space of SVMs is proposed. Its unique learning mechanism makes it an interesting candidate for dealing with imbalanced datasets, since SVMs only takes into account those data that are close to the boundary, i.e. the support vectors, for building its model. What’s more important, as kernel-based methods, the classification of SVMs is defined in the feature space. So does our undersampling preprocessing. Let data set X = {x1 , x 2 , … , x M + N }, where x i = (xi ,1 , xi ,1 ,…, xi ,s ), i = 1,2,…, M + N , s is the number of features for data. Here N is the number of positive and M is the number of negative in the imbalanced data classification of SVMs, and M>>N. The entropy of xi is defined as: M N E = − S x i , x j log S x i , x j + 1 − S x i , x j log 1 − S x i , x j , (6) i
∑∑( ( i =1
j =1
)
2
(
) (
(
))
2
(
(
)))
A Novel Method for Prediction of Protein Domain
1269
where S (x i , x j ) = e −α D (x , x ) is the similarity between data xi and x j , and α is the curvature of the function. D ( x i , x j ) is the Euclidean distance in the feature space of i
i
j
SVMs between xi and x j , Suppose that u i is the projection of input vector xi in the feature space. We define: φ ( x i ) ⋅ φ ( x j ) = k ( x i , x j ) , then we can get a Euclidean distance in feature space: D 2 (u i , u j ) = φ ( xi ) − φ ( x j )
2
= φ 2 ( x i ) − 2φ ( x i )φ ( x j ) + φ 2 ( x j )
(7)
= k ( x i , x i ) − 2 k ( x i , x j ) + k ( x j , x j ).
The value of k ( x i , x j ) can be got from Eq. (1). What’s more important, the parameter in k ( x i , x j ) has to be coincident with the SVMs. S ( xi , x j ) varies in the range of [0, 1] ,
((
)
(
) ( (
)) ( (
)))
and S xi , x j log 2 S xi , x j + 1 − S xi , x j log2 1 − S xi , x j tends to its maximal value 1 for S ( xi , x j ) → 0.5 , and to its minimal value 0 for S ( xi , x j ) → 0 or S ( xi , x j ) → 1 . Accordingly α i = − ln(0.5) D i is estimated, where D i is the mean distance among all pairs of each possitive and all the negtives. Therefore, those negtives that are very close or distant to a given possitive one, would not be sampled. The negtives too close to the learned hyperplane may have skewed hyperplane and far away from it could not be the support vector but be trained with uselessness. While for the ones separated by the distance close to D i , their contributions are very high. The negtives which have the maximal entropy value with counterpart possitives are undersampled, in this way, the input data are no longer imbalanced. Thus the learned hyperplane is further away from the positive class. This is done in order to compensate for the skew associated with imbalanced datasets which pushes the hyperplane closer to the positive class. The undersampling algorithm using the distance-based maximal entropy is as follows:
① ② ③
Compute the distance from each positive to all of the negtives with Eq (7) and get the D i ; Compute the corresponding entropy of each positive according the Eq. (6); Sort the entropy and nonrepetitively choose the negative with the maximal entropy.
4 Analysis of Results In learning extremely imbalanced data, the overall classification accuracy ( ( true positive+ true negative) /the total of the samples) is often not an appropriate measure of performance. A trivial classifier that predicts every case as the majority class can still achieve very high accuracy. The medical community, and increasingly the machine learning community, use two metrics, the sensitivity and the specificity, when evaluating the performance of various tests. Sensitivity can be defined as the accuracy on the positive instances (true positives / (true positives + false negatives)), while specificity can be defined as the accuracy on the negative instances (true negatives / (true negatives + false positives)).
1270
S. Zou et al.
To find the best pair of C and σ 2 , the most important parameters involved in both the SVMs and the undersampling preprocessing, over some ranges, a grid search using cross-validation are employed. Since there are only limited numbers of instances obtained by the time-consuming alignment running on PC, we adopt a k-fold (k = 5 in this study) cross-validation procedure for an unbiased evaluation, which is a common technique used in the machine learning community for model selection. All the examples in the dataset are eventually used for both training and testing. The SVMs parameters grid search was iterated over the following values: C = {2 0 , , 2 9 , 210 } and σ 2 {2 −2 , 2 −4 , , 2 7 , 2 8 }. In our experiments, we compared the performance of our classifier with regular SVMs. And partly experimental results are shown in Table 1.The table below shows the sensitivity (Se), Specificity (Sp), the confused time and accuracy of each algorithm. Native SVMs stands for the imbalanced data directly as the input compared with balancing preprocessing proposed in this paper.
=
Table 1. Experimental results compared regular SVMs with the preprocessing
accuracy 0.25 0.5 1 2 4 8 16 32 64 128 256
1024 2 512 1 8 64 16 128 32 256 4
0.0893 0.0216 0.0649 0 0.0059 0.0287 0 0.0202 0 0.0071 0
0.9964 0.9998 0.9979 1 1 0.9996 1 0.9998 1 1 1
1182 128 449 133 136 135 135 131 130 127 135
85.82% 85.16% 85.71% 84.88% 84.95% 85.21% 84.83% 85.13% 84.77% 84.93% 84.85%
accuracy 0.761 0.531 0.859 0.625 0.787 0.953 0.944 0.895 0.875 0.665 0.807
0.732 0.842 0.797 0.734 0.532 0.901 0.996 0.926 0.892 0.584 0.884
187 79 146 66 81 94 74 111 86 99 62
64.39% 69.75% 78.22% 80.63% 86.78% 80.25% 87.46% 76.87% 88.14% 74.76% 83.34%
Note that regular SVMs has almost perfect specificity, but poor sensitivity because it tends to classify everything as negative, which is the same reason as the SVMs is not sensitive to the variable on the parameters of C and σ 2 . There are lower values of accuracy with some unfit pair of C and σ 2 , because all of the data will be tested in our classifier in place of the training set with equal negatives and positives. For example, as C = 1024 and σ 2 = 0.25 , the accuracy is the lowest as the generalization performance of the classifier decreases with the highest C . It is clear that the preprocessing not only reduces the size of input data, which decreases the training time, but also outperforms both the sensitivity and the specificity. The results show that the preprocessing method of balancing the input we proposed improves the classification performance.
5 Summary and Outlook In this paper we presented a promising method for detecting the domain structure of a protein from sequence information alone. Besides the conformational entropy of the
A Novel Method for Prediction of Protein Domain
1271
seed sequence is considered. Further more the information theory principles are used to optimize the scores. Worthwhile we firstly refer protein domain detection as an imbalanced data classification, and then propose a novel undersampling method by using distance-based maximal entropy in the feature space of SVMs. Finally support vector machines with RBF kernel are trained. And a grid search using cross-validation is employed in order to identify the optinmal C and σ 2 . At last the prediction of protein domain from sequence gets the accuracy about 87% by the method, which hopefully shows a significant improvement for the biological macromolecular constructions in bioinformatics. For further validating our method the datasets will be enlarged in the near future. And the features analysis should be compared with the biological importance.
Acknowledgement This paper is supported by the National Natural Science Foundation of China (60433020, 60673099, 60673023) and “985” project of Jilin University.
References 1. Orengo, A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., Thornton J.M.: CATH-a Hierarchic Classification of Protein Domain Structures. Structure 5 (1997) 1093-1108 2. Murzin, G., Brenner, S.E., Hubbard, T., Chothia, C.: SCOP: a Structural Classification of Proteins Database for the Investigation of Sequences and Structures. J. Mol. Biol. 247 (1995) 536–540 3. Alexandrov, N., Shindyalov, I.: PDP: Protein Domain Parser, Bioinformatics 19 (3) (2003) 429-430 4. Holm, L., Sander, C.: Mapping the Protein Universe. Science 273 (1996) 595–602 5. Sonnhammer, E.L., Kahn, D.: Modular Arrangement of Proteins as Inferred from Analysis of Homology. Protein Sci. 3 (1994) 482-492) 6. Gracy, J., Argos, P.: Automated Protein Sequence Database Classification. I. Integration of Copositional Similarity Search, Local Similarity Search and Multiple Sequence Alignment. Bioinformatics 14 (2) (1998) 164-187 7. Tong, S., Chang, E.: Support Vector Machine Active Learning for Image Retrieval. Proceedings of ACM International Conference on Multimedia (2001) 107-118 8. Joachims, T.: Text Categorization with SVM: Learning with Many Relevant Features. Proceedings of ECML-98. 10th European Conference on Machine Learning (1998) 9. Wu, G., Chang, E.: Class-Boundary Alignment for Imbalanced Dataset Learning. In ICML 2003 Workshop on Learning from Imbalanced Data Sets II, Washington, DC. (2003) 10. Nagaragan, N., Yona, G.: Automatic Prediction of Protein Domains from Sequence Information Using a Hybrid Learn System. Bioinformatics 1 (2004) 1–27 11. Oxana, V., Galzitskaya, Bogdan, S.M.: Prediction of Protein Domain Boundaries from Sequence Alone. Protein Science 12 (2003) 696–701 12. Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, Cambridge, UK. (2000)
1272
S. Zou et al.
13. Akbani, R., Kwek, S., Japkowicz, N.: Applying Support Vector Machines to Imbalanced Datasets. Proc. 15th. European Conf. Machine Learning (ECML) (2004) 39-50 14. Veropoulos, K., Campbell, C., Cristianini, N.: Controlling the Sensitivity of Support Vector Machines. Proceedings of the International Joint Conference on Artificial Intelligence (1999) 55–60 15. Kotsiantis, S., Kanellopoulos, D., Pintelas, P.: Handling imbalanced datasets: A review, GESTS International Transactions on Computer Science and Engineering 30 (1) (2006) 25-36
The Effect of Recording Reference on EEG: Phase Synchrony and Coherence Sanqing Hu1 , Matt Stead1 , Andrew B. Gardner2 , and Gregory A. Worrell1 1
Department of Neurology Division of Epilepsy and Electroencephalography Mayo Clinic, 200 First Street SW Rochester, MN 55905, USA
[email protected] 2 BioQuantix Corp. Atlanta, GA 30363
[email protected]
Abstract. In [1], we developed two methods to automatically identify the contribution of the recording reference signal from multi-channel intracranial Electroencephalography (iEEG) recordings. In this study, we subtract the reference recording contribution to iEEG and obtain corrected iEEG. We then investigate three commonly used iEEG metrics: spectral power, phase synchrony, and magnitude squared coherence (MSC) for common referential iEEG, corrected iEEG and bipolar montage iEEG. We find significant differences among the three iEEG metrics, and are able to determine the contribution from the recording reference to each metric. Generally, reference signals with smaller amplitude yield lower phase synchrony and reference signals with larger amplitude increase phase synchrony. Reference signals with spectral peaks increase coherence. Reference signal with low power may have no significant impact on calculated coherence. Bipolar EEG usually yields small phase synchrony or MSC values and may obscure the actual phase synchrony or MSC values between two local sources.
1 Introduction The temporal resolution of electroencephalography (EEG) has led to the widespread use of EEG by clinicians and scientists investigating physiologic and pathologic brain function. In particular, the use of EEG in the studies of neuronal assemblies and their oscillations has received wide attention [2]-[15]. It has been noted that certain types of assemblies are characterized by the synchronous activity of their constituent neurons and different EEG frequency components also reveal synchronies relating to different perceptual, motor or cognitive states [2]-[6]. In the literature, there are three common ways to analyze neuronal assemblies: spectral power, phase synchrony (such as mean phase coherence), and coherence (such as MSC). Spectral power measures the consequence of synchrony activity rather than synchrony activity itself. It is an indirect index of neural synchrony. The spectral power of EEG is an indirect measure of the degree of synchronization because weakly synchronized activity leads to destructive interference and lower measurable power at a given frequency [7]. Phase synchrony is a more direct index of neural synchrony and is defined as phase locking value, ranging from 0 (no synchronization) to 1 (perfect synchronization). Spectral power and phase synchrony have been used extensively to assess neural synchrony in human electrophysiological studies D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1273–1280, 2007. c Springer-Verlag Berlin Heidelberg 2007
1274
S. Hu et al.
[7] and [8]. Coherence identifies the synchrony of neuronal assemblies as a function of the correlation of EEG frequency components. Neurons in an assembly are presumably located in some proximity to the recording electrodes and exhibit oscillatory activity with common spectral properties. The typical finding is that in a given perceptual, cognitive, or motor task, EEG coherence increases (or decreases) in a certain band of EEG frequency spectrum [9]. It is well known that reference electrode is indispensable to recording multi-channel EEGs, but the fact that reference signal may heavily contaminate EEG recordings and have a significant effect on various EEG metrics is often ignored. Unfortunately, in all above cited references neuronal synchronization has been studied only based on either common referential EEG recordings or common reference-free EEG recordings such as bipolar EEG, average common reference EEG and Laplacian EEG. The potential confounding effect with common referential EEG recordings for spectral power analysis [7], coherence analysis [9] and [10], and phase synchrony analysis [7] and [11] is well established. The bipolar EEG, obtained by subtracting the potentials of two nearby electrodes, will remove all signals common to the two channels, including the common reference. However, one must realize that a given bipolar montage will completely miss dipoles with certain locations and tangential orientations, and not all signals common to the two electrodes are from the reference. Caution against the use of bipolar EEG for coherence analysis was given [12] and [13]. Although the average reference EEG and Laplacian EEG are also reference-free, caution against the use of them for synchronization analysis was given [14]. In [1] we proposed two methods to extract the scalp reference signal from clinical multi-channel iEEG recordings based on independent component analysis [15] and stated why the obtained signal is a “good” estimation of the real reference signal.
2 Methods One patient undergoing evaluation for epilepsy surgery with intracranial electrodes was investigated. This patient had a subdural grid of electrodes covering the left anterior temporal neocortex. The grid consisted of a silastic sheet embedded with 32 2.3 mm diameter Platinum- Iridium alloy electrodes separated by 10 mm in a 4 × 8 array. The iEEG was acquired using a stainless steel scalp suture placed in the vertex region of the scalp, midline between the Cz and Fz electrode positions (international 10–20). The data were recorded differentially, with the common reference being the scalp suture electrode. The scalp suture electrode is relatively isolated from the intracranial electrodes by the intervening layers of cerebrospinal fluid, bone, muscle, and scalp. These layers serve to distribute signal in such a way that approximately 7 cm2 of coordinated cortical activity is generally required to produce a clear deflection detectable on the scalp. In practice the reference electrode serves the purpose primarily of rejecting common mode potentials generated by muscular contraction and body movement, which are conducted to the intracranial vault. Unfortunately, it also introduces artifacts unique to the scalp site. The data were acquired with a DC-capable NeuralynxT M electrophysiology system, and digitized at 32556 Hz. For analysis in this paper the data were decimated offline
The Effect of Recording Reference on EEG: Phase Synchrony and Coherence
1275
by first low-pass filtering to 400 Hz using a forward and reverse digital filter (Matlab “filtfilt” procedure) to avoid phase shift and sampling to a frequency of 2000Hz. The time-series measures estimated were the power spectral density (PSD), phase synchrony (mean phase coherence in this paper), and MSC. The PSD was estimated by using Welch’s method and 512 sample Hanning window with 256 sample overlap. The mean phase coherence [16] is defined as N 1 R = ei[φx (tj )−φy (tj )] , N j=1 where φx (t) and φy (t) denote the phase variables of two oscillating signals x(t) and y(t), N = 20000 is the window sample size with half-overlap. To calculate phase synchrony, we performed a prefilter in frequency band 1Hz∼70Hz for iEEG. The MSC was estimated by using 512 sample Hanning window with 256 sample overlap.
3 Results The patient underwent iEEG monitoring using 32-contact subdural grid electrodes with a scalp reference electrode at sample frequency of 2000Hz. Four adjacent channel iEEG recordings (Grid2, Grid3, Grid4 and Grid5) are plotted in Figure 1A where only 20 seconds of 200 seconds are shown representatively. The reference signal in Figure 1A was calculated based on the second method [1] for the whole time period (200 seconds) and for the whole 32 channels where the rest of channels are omitted here. It should be pointed out that the whole data are artifact-free as seen in Figure 1A. In Figure 1A, the second four channels (Grid2∼Grid5) are corrected iEEG of the first four channels (Grid2∼Grid5). The corrected iEEG was obtained by subtracting the calculated reference signal from the original referential iEEG. One can see that the basic patterns of the original referential iEEG are preserved in the corresponding corrected iEEG. Two bipolar montage iEEG (Grid2−Grid3 and Grid4−Grid5) are also plotted in Figure 1A. To show the influence of the reference signal on iEEG recordings, Figure 1B∼1D describe the PSD for the reference signal, original referential Grid3, and corrected Grid3. One can easily see that the reference signal has activity of frequency near 60Hz over the whole time period in Figure 1B. This contribution was verified to come from the line voltage. This activity can be seen clearly for the referential Grid3 near 60Hz over the whole time period in Figure 1C and is removed in corrected Grid3 in Figure 1D. Figure 1E shows the spectral power for the reference signal (the dash-dot line), Grid2 and Grid3 where the solid lines correspond to the referential Grid2 and Grid3, the dashed lines correspond to the corrected Grid2 and Grid3, and the dotted lines correspond to the bipolar iEEG (Grid2−Grid3). It is easy to see that the referential iEEG (Grid2 and Grid3) and the reference signal have peaks near 60Hz and that the bipolar iEEG (Grid2−Grid3), the corrected iEEG Grid2 and Grid3 have no peak near 60Hz. This further verifies that the reference signal is removed out from the referential iEEG completely. In Figure 1F and 1G two measures: phase synchrony and MSC are analyzed where the solid lines correspond to the referential iEEG, the dashed lines correspond to the
1276
S. Hu et al.
corrected iEEG, the dotted lines correspond to the bipolar iEEG. To analyze phase synchrony between different channels, we filtered the referential iEEG and corrected iEEG to the frequency band 1Hz∼70Hz. From Figure 1F, one can see that phase synchrony values for Grid2*Grid4 fall into the interval [0.25, 0.5] and are minimally different for the referential iEEG and corrected iEEG. Phase synchrony values for Grid2*Grid5 fall into the interval [0.15, 0.35] and are minimally different for the referential iEEG and corrected iEEG. However, phase synchrony values for Grid3*Grid4 or Grid3*Grid5 are significantly different for the referential iEEG and corrected iEEG where for instance, phase synchrony values for the corrected Grid3*Grid4 are all above 0.8 and phase synchrony values for the referential Grid3*Grid4 lie in the interval [0.4, 0.6]. It is notable that all phase synchrony values for the bipolar iEEG (Grid2−Grid3)*(Grid4−Grid5) are small and all less than 0.2. Another interesting observation is that phase synchrony values of the corrected iEEG is mostly larger than that of the referential iEEG, e.g. Grid3*Grid4, because the amplitude of the reference signal is mostly smaller than that of the referential or corrected iEEG as seen in Figure 1A for this patient. Hence, the amplitude of the reference signal plays an important role in phase synchrony. From Figure 1G, one can see that MSC values for Grid2*Grid4 or Grid2*Grid5 are less than 0.3 from 7Hz to 70Hz for both the referential iEEG and corrected iEEG. MSC values for Grid3*Grid4 or Grid3*Grid5 are higher from 7Hz to 70Hz for the corrected iEEG compared with the long down trends before 55Hz and the big peaks after 55Hz for the referential iEEG. There are peaks near 60Hz for all the referential iEEG and no peak 60Hz for the corrected iEEG, which further confirms that the peaks come from the common reference signal. Moreover, comparing the four peaks, one can see that the peaks for Grid3*Grid4 and Grid3*Grid5 are much larger than that for Grid2*Grid4 and Grid2*Grid5 because the following fact: the power of the reference signal is larger than that of the corrected Grid3 near 60Hz shown in Figure 1E and as a result the reference signal plays a dominant role and leads to big peak amplitudes. On the contrary, the power of the reference signal is smaller than that of the corrected Grid2 near 60Hz shown in Figure 1E and as a result the reference signal plays a weak role and leads to small peak amplitudes. It is notable that all MSC values have little change from 0Hz to 7Hz for the referential iEEG and corrected iEEG because of lower power of the reference signal which can be understood from the comparison of the reference, Grid2 and Grid3 in Figure 1E. It is also interesting to note that all MSC values for the bipolar (Grid2−Grid3)*(Grid4−Grid5) from 7Hz to 70Hz are very close to zero. The higher MSC values for the bipolar (Grid2−Grid3)*(Grid4−Grid5) from 0Hz to 7Hz may come from the higher MSC values for Grid3*Grid4. Thus, in this case, it is obviously wrong to use MSC values for the bipolar (Grid2−Grid3)*(Grid4−Grid5) to reflect MSC values for Grid3*Grid4 or Grid3*Grid5 from 7Hz to 70Hz. Hence, now we make the following conclusions: i) reference signal may change the observed phase synchrony values or MSC values and thus lead to an incorrect interpretation of the EEG even if the raw data are very clean, that is, artifact-free; ii) the commonly used bipolar iEEG usually leads to small phase synchrony values or MSC values and cannot reflect large phase synchrony values or MSC values between two local real sources; iii) the amplitude of the reference signal plays an important role
The Effect of Recording Reference on EEG: Phase Synchrony and Coherence
A
B
1277
Reference
70
mer Grid2~Grid5), Corrected iEEG (the Later Grid2~Grid5), Referen
60
Frequency
50 40 30 20 10 145
C
150 Time (second)
155
0 0
D
Grid3
60
50
50
40 30
20 10 50
100 Time (second)
150
F
iEEG for Patient 1
Mean Phase Coherence
50
Power (db)
40 G2
30
G3
Reference
10
G2−G3
0
Magnitude Squared Coherence
0 0
200
60
50
10
20
30 40 50 Frequency (Hz)
60
70
iEEG for Patient 1
0.8 G3*G4
0.6 0.4 G3*G5
G2*G4
0.2 (G2−G3)*(G4−G5)
10
20
30 40 50 Frequency (Hz)
G2*G5
60
70
Fig. 1.
100 Time (second)
200
iEEG for Patient 1 1
0.8 G3*G4
0.6 G2*G4 G3*G5
0.4 0.2 G2*G5
1
0 0
150
30
10 0 0
G
200
40
20
−10 0
150
Corrected Grid3
60
20
100 Time (second)
70
Frequency
Frequency
70
E
50
0 0
(G2−G3)*(G4−G5)
50
100 Time (second)
150
200
1278
S. Hu et al.
in phase synchrony at one time point; iv) the power of the reference signal plays an important role in MSC at a given frequency.
4 Discussion and Conclusions In this study, we mainly discuss phase synchrony and coherence of common referential, corrected, bipolar iEEGs. It is well known that reference electrode is indispensable to recording multi-channel EEG, but the reference signal may heavily contaminate EEG recordings. It is very desirable to find what is the reference signal. In our recent paper [1] we proposed two methods to extract the scalp reference signal from multiple iEEG channel recordings based on independent component analysis and the following assumption: the reference signal from the scalp reference electrode can be treated as independent from all of the sources recorded at each intracranial electrode. This assumption is basically true because the reference scalp electrode is relatively isolated from the intracranial electrodes by the three intervening layers of cerebrospinal fluid, bone, and scalp. This assumption was supported by simulation results from clinic EEG data [1]. In this study, we applied the second method to clinical iEEG data from one patient to find the reference signal. After removing the reference signal for the patient, we got corrected iEEG and made comparisons for phase synchrony and coherence of the referential iEEG, corrected iEEG and bipolar montage iEEG. We found significant differences for these observed values among three iEEGs in many cases. Here the iEEG data from the patient were largely artifact-free. To show why the obtained reference signal is a “good” estimation of the real reference signal, we compared the spectral power of the referential and corrected iEEG and the reference signal in time-frequency domain, and find that the high frequency activity near 60Hz for the reference signal and referential iEEG (Figures 1B and 1C ) is removed from the corrected iEEG (Figure 1D). This shows that the reference signal is almost completely extracted out from the referential iEEG. This fact is further verified by peaks in the spectral power of the referential iEEG which were removed from the corrected iEEG (Figure 1E). Reference signal may have important influence on values of phase synchrony and MSC of EEG in the following two ways: i) the amplitude of the reference signal plays an important role in phase synchrony at one time point. More precisely, the reference signal with smaller amplitude may decrease phase synchrony values. On the contrary, the reference signal with larger amplitude may increase phase synchrony values. ii) The power of the reference signal plays an important role in MSC values at one given frequency. The reference signal with larger power may increase MSC values and may lead to larger peak amplitude (see, e.g., the referential and correct Grid3*Grid5 near 60Hz in Figure 1G). The reference signal with lower power may cause no much change for MSC values (see, e.g., the referential and correct Grid3*Grid4 from 0Hz to 10Hz in Figure 1G). Hence, the reference signal may change the observed phase synchrony values or MSC values significantly and thus lead to an incorrect interpretation of EEG even if the raw data are artifact-free. The commonly used bipolar EEG can remove the common reference. The bipolar EEG is very useful to remove artifacts when these artifacts come from the reference electrode. However, one should note that bipolar EEG also removes all signals
The Effect of Recording Reference on EEG: Phase Synchrony and Coherence
1279
common to the two channels, and not all signals common to the two electrodes are from the reference or artifact. Hence, a given bipolar montage will completely miss dipoles with certain locations and tangential orientations. From our simulation results, bipolar EEG usually leads to small phase synchrony values or MSC values and will underestimate real phase synchrony or MSC values between two different channels due to some local sources (see e.g., Figures 1F). Hence, bipolar EEG may cause distortion of phase synchrony and coherence and lead to misinterpretation of EEG. Figure 1: A) 20 seconds sample of four channel iEEG recorded from the subdural grid electrodes where the former four Grid2−Grid5 are the referential iEEG and the later four Grid2−Grid5 are the corrected iEEG. The reference signal is calculated by using the second method in our recent work. Grid2−Grid3 and Grid4−Grid5 are two bipolar iEEG. B) The PSD of the reference signal. C) The PSD of the referential Grid3. D) The PSD of the corrected Grid3. E) The spectral power for the referential (solid line) and corrected (dashed line) Grid2 and Grid3, reference signal (dash-dot line), and bipolar Grid2−Grid3 (dotted line). F) The mean phase coherence for referential (solid line) and corrected (dashed line) Grid2*Grid4, Grid2*Grid5, Grid3*Grid4, Grid3*Grid5, and bipolar (Grid2−Grid3)*(Grid4−Grid5) (dotted line). G) The magnitude squared coherence for the referential (solid line) and corrected (dashed line) Grid2*Grid4, Grid2*Grid5, Grid3*Grid4, Grid3*Grid5, and bipolar (Grid2−Grid3)* (Grid4−Grid5) (dotted line).
Acknowledgment This work was supported by NIH 5K23NS047495.
References 1. Hu, S., Stead, M., Worrell, G.A.: Automatic Identification and Removal of Scalp Reference Signal for Intracranial EEGs based on Independent Component Analysis. IEEE Transaction on Biomedical Engineering (in press) 2. Basar, E., Basar-Eroglu, C., Karakas, S., Schurmann, M.: Gamma, Alpha, Delta and Theta Oscillations Govern Cognitive Processes. Int. J. Psychophysiol 39 (2001) 241-248 3. Fuentemillaa L., Marco-Pallars, J., Graua, C.: Modulation of Spectral Power and of Phase Resetting of EEG Contributes Differentially to the Generation of Auditory Event-related Potentials. NeuroImage 30 (2006) 909-916 4. Knyazev, G. G., Savostyanov, A. N., Levin, E. A.: Alpha Synchronization and Anxiety: Implications for Inhibition Vs. Alertness Hypotheses. Int J Psychophysiol 59 (2006) 151-158 5. Sritharan, A., Line, P., Sergejew, A., Silberstein R., Egan, G., Copolov, D.: EEG Coherence Measures During Auditory Hallucinations in Schizophrenia. Psychiatry Res 136 (2005) 189-200 6. Yeragani, V. K., Cashmere, D., Miewald, J., Tancer, M., Keshavan, M.S.: Decreased Coherence in Higher Frequency Ranges (Beta and Gamma) between Central and Frontal EEG in Patients with Schizophrenia: A Preliminary Report. Psychiatry Res 141 (2005) 53-60 7. Trujillo, L. T., Peterson, M. A., Kaszniak, A. W., Allen, J. J.: EEG Phase Synchrony Differences Across Visual Perception Conditions May Depend on Recording and Analysis Methods. Clin Neurophysiol 116 (2005) 172-189
1280
S. Hu et al.
8. Palva, J. M., Palva, S., Kaila, K.: Phase Synchrony among Neuronal Oscillations in the Human Cortex. J Neurosci 25 (2005) 3962-3972 9. Duckrow, R. B., Zaveri, H. P.: Coherence of the Electroencephalogram during the First Sleep Cycle. Clin Neurophysiol 116 (2005) 1088-1095 10. Nunez, P. L.: Electric Fields of the Brain: The Neurophysics of EEG. Oxford University Press. New York: NY (1981) 11. Guevara, R., Velazquez, J. L., Nenadovic, V., Wennberg, R., Senjanovic, G., Dominguez, L. G.: Phase Synchronization Measurements using Electroencephalographic Recordings: What Can We Really Say about NeuronalSynchrony? Neuroinformatics 3 (2005) 301-314 12. Zaveri, H. P., Duckrow, R. B., Spencer, S. S.: The Effect of a Scalp Reference Signal on Coherence Measurements of Intracranial Electroencephalograms. Clin Neurophysiol 111 (2000) 1293-1299 13. Zaveri, H. P., Duckrow, R. B., Spencer, S. S.: On the Use of Bipolar Montages for Time-series Analysis of Intracranial Electroencephalograms . Clin Neurophysiol 117 (2006) 2102-2108 14. Schiff, S. J.: Dangerous Phase. Neuroinformatics 3 (2006) 315-318 15. Hyvarinen, A., Oja, E.: Independent Component Analysis: Algorithms and Applications. Neural Networks 12 (2000) 411-130 16. Mormann, F., Kreuz, T., Rieke, C., Andrzejak, R. G., Kraskov, A., David, P., Elger, C.E., Lehnertz, K.: On the Predictability of Epileptic Seizures. Clin Neurophysiol 116 (2005) 569-587
Biological Inspired Global Descriptor for Shape Matching* Yan Li, Siwei Luo, and Qi Zou Department of Computer Science Beijing Jiao Tong University Postcode 100044 Beijing China
[email protected],
[email protected],
[email protected]
Abstract. Shape description is the precondition for shape matching and retrieval. The more robust and stable primitives to describe shapes are global topological properties, but obtaining global topological properties is still an obstacle in computer vision. Motivated by the difference sensitivity of shortrange connection in biology vision, we present a novel global descriptor to describe the entire topology of simple closed 2D shape in this paper. We employ two novel strategies – the zigzag rule, which approximates shape to an elaborate polygonal curve, and cost function which combines global configurations as well as local information of the line stimulations as our punishments. With these two key steps the descriptor is robust to translation, scaling and rotation. Experimental results show the model gain good performance on matching and retrieval for silhouettes. Even for images with occlusion the result is excellent and reasonable.
1 Introduction A problem of both theoretical and practical importance in shape matching and retrieval is how to describe a shape. As we known, simple cells in V1 cortex obtain local features, such as position and orientation, however, how global information, such as the correlation between local features and topology, can be represent? An important finding in 1960s is that the majority of neurons function as edge detectors, they react strongly to an edge or a line of a given orientation in a give position of the visual field [1], and visual neural system manage the information by majority, but not the complexity of their functions [2, 3]. The neuron synapses form a network, by which the features of visual objects can be detected and organized in less than a millisecond [4, 5]. It is a swift synchronous process of distributed information, accordingly making a great contribution to some complex visual task such as grouping and matching etc. The synchronous action is completed by short-range connection and long-range connection [6]. Here, we focus on the contribution of short-range connection for planar closed curve description. *
Supported by the National Natural Science Foundation of China under Grant Nos. 60373029 and the National Research Foundation for the Doctoral Program of Higher Education of China under Grant Nos. 20050004001.
D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1281–1290, 2007. © Springer-Verlag Berlin Heidelberg 2007
1282
Y. Li, S. Luo, and Q. Zou
Motivated by neurophysiology, we build a network model to simulate aggregation of simple cells in V1 cortex. The neuron responding optimally records the relative position, orientation and frequency of the specific line stimulate. In this way, a meaningful stimulation turns to a connective network which is constructed by simple cells in primary visual cortex. Each node corresponding to a single simple cell in the network records the local primary visual features, and the short-range connection records the differences among the local features. Our main hypothesis is to record the topological configuration features of line stimulates into topology of neural network. The basic idea of our method is very simple. First, we approximate a planar closed shape to a polygon based on our zigzag rule. At the same time, getting the position, orientation and length of each edge, as simple cells do. Second, using these local features to make our descriptor, in which the statistical distribution of relative distances and orientational differences are used. Third, combining with the similarity distances among descriptors, the penalty from the difference of node number and the restriction of local similarity, our cost function is made up of three items to finish the matching.
2 Related Work The GD developed in this paper is closely related to two areas of current work. The first area concerns how to describe a shape. Some approaches use interest points and their luminance information to represent object images, such as [7, 8, 9, 10] and so on. However, these approaches are not appropriate for shapes, because shapes have configuration information which can not gain for several points. In other words, the information of global attribute is much more important than that of the local attribute in shape representation. A great deal of research has been done on contour image. Since contour can be represent as a single-closed curve, there are many methods based on parameterized perimeter. Arkin et al [11] use the distance between the turning functions of two polygons. Latecki el al [12] improved the work by computing the similarity between corresponding visual parts. Mokhtarian et al [13, 14] use the CSS (curvature scalespace) image to represent the shapes of object contours. The above two approaches perform well [15], however, they fail to account for the global configuration. S. Belongie et al [16] present the configuration of contour by global shape context, which capture the distribution of the remaining points relative to the reference point. And [17] extended the shape context as a robust method for capturing spatial relationships. In [18], Mortensen et al combined the SIFT and shape context, which gained a more accurate result in point matching. Our work is an improvement of [16]. The second area concerns correspondence problem. For shape matching, it shouldn’t be the only purpose which is to find total minimum of the similarity between each pair features, but also preserve their spatial relationships in corresponding feature. In [19], C. Scott and R. Nowak extend the standard assignment [20], which can preserve the order of the point correspondence. By enforcing the order preserving constraint, the accuracy and geometric fidelity of the point matching is increased, leading in turn to improved shape similarity assessment.
Biological Inspired Global Descriptor for Shape Matching
1283
3 Simple Shape Evolution Based on Zigzag Rule Since contours of objects in digital images are often distorted, the obvious way to neglect the distortions is to approximating the shape by a polygonal curve, which preserving the sufficient configuration information of the original one. We treat a contour as a finite point set first and then group and merge them to be segments. In Latecki el al [21], a discrete curve evolution process was generated with a stop parameter of the iterations. As a key contribution, we propose a novel evolution, the simple shape evolution, which does not require any parameter or iterations. Our purpose is using as less segments as possible to represent the configuration of contour as more elaborate as we can. First, we obtain the contour by using code available from M. Dow et al [22] that provides a set of points P = { p1 , p 2 ,..., pn } , pi = {( pi ) x , ( pi ) y } ( pi ) x , ( pi ) y ∈ R 2 in
which ( pi ) x , ( pi ) y are the x-coordinate and y-coordinate of the point, and sequence
number i is the order they constitute the contour. Second, grouping the points pi , pi +1 ... pi + d ∈ P to first-order segments l j (we abbreviate the first-order segment to FS below), if their x-coordinate or y-coordinate 3π π π π π 3π is same or proportionate. We define θ j ∈ {− ,− ,− ,0, , , , π } is the 4 2 4 4 2 4 direction of the FS, measured in degree. This step is an interim process in order to form a FS set L = {l1 , l 2 ,..., l m } with each segment l j of length d j and direction θ j . The third step here is the key step, which is merging several FSs into a final segment. Our merging rule is that whether the sequential segments l j , l j +1 ,..., l j +t ∈ L have the same directions alternately. If they have, merge them to be a segment s k . Since the figure of s k looks like a zigzag, we call the rule as zigzag rule.
Zigzag rule. Sequential segments l j , l j +1 ,..., l j +t ∈ L can be merged to be a
segment s k for approximation if they have the same directions alternately. The direction θ k of s k is computed by the definition of slope. Fig. 1 shows the sample merging for the conditions.
Fig. 1. Sample merging. Entire contour and approximate polygon is in the left. The rectangle in the down right corner in left image is zoomed in on the right image. Points are the original contour, and the solid line is the merged segment, which can be represented by the starting point and ending point. The two points are surrounded by the rectangles.
1284
Y. Li, S. Luo, and Q. Zou
4 Generating Shape Descriptor As mentioned in section 1, we build a model to simulate the V1 cortex simple cells network. Because the network is full connective, the differences and their corresponding local relations are stored in the topology network. In our model, we define the global descriptor to represent the topology network, denote the single simple cell and its topology with all other cells by s k . We use the histogram of binary relations of relative differences of positions and orientations (hijgeodis , hijori ) to represent the statistical rules. It is noted that hijgeodis is not the Euclidean distance, but geodesic distance, which is computed along the contour. From results in [17], we can conclude that geodesic distance which is more accurate than inner-distance is more effective to represent the position of line stimulation, especially in non-convex shapes such as articulated shapes. The value field of hijgeodis is grouped in m intervals and that of hijori is grouped in n intervals. Their combination distribution of (hijgeodis , hijori ) generates m × n intervals, and arbitrary (hijgeodis , hijori ) is in one of the m × n intervals doubtless. We compute the histogram hij = (hijgeodis , hijori ) of relative differences of positions and orientations between cell s k and the remaining k − 1 cells on the shape: descriptor ( s k ) = {hij ( s k ) / ∑i =1 ∑ j =1 hij ( s k ) i = 1,2,..., m j = 1,2,..., n} , m
hij ( s k ) = # { q ≠ s k
:h
geodis ij
n
(1)
∈ interval geodis (i ) & hijori ∈ interval ori ( j ) }.
(2)
This normalized histogram is defined to be the shape descriptor, an example is showed in Fig. 2.
Fig. 2. Sample of global descriptor. Approximate polygon of contour shows in the left, in which the bold line is s k . The descriptor of it is in the right. X-coordinate shows the 10 geodis
intervals of hij
ori
. In each interval, the different colors indicate the 4 intervals of hij . And
the y-coordinate shows the number in each bin.
4.1 Extension of Shape Context
To extend the shape context defined in [16], we redefined the bins with the difference of positions and orientations of the line stimulations. From biological view, in
Biological Inspired Global Descriptor for Shape Matching
1285
primary visual cortex line stimulation is more reasonable than point in shape representation. As a result, we use line segments to represent the shape, and replace the polar angle θ in shape context with the difference of line stimulation orientations. At the same time, the Euclidean distance is replaced by the geodesic distance, which is more sensitive to shape transformation. In the following the shape context is called SC and our descriptor GD.
5 Measuring the Shape Similarity Including the distances among descriptors, the penalty from the difference of node number and the restriction of local similarity, our cost function is made up of three items D , Num and LS . Considering descriptor ( s1k ) reference to node s1k on one contour, and another node s 2 l on another contour with descriptor ( s 2 l ) , we define the distance using the
χ 2 test statistic denote the difference between the two contours in (3). d kl ≡ descriptor( s1k ) − descriptor( s 2 l ) =
1 2
m
n
∑∑ i =1 j =1
[ hij ( s1k ) − hij ( s 2 l )] 2 hij ( s1k ) + hij ( s 2 l )
.
(3)
Given the set of distances between a M nodes shape and a N nodes shape, where M ≥ N , we can obtain the minimum total cost of them, D = min ∑ M k =1 d k ,π ( k ) . π ∈Π
Here, π (k ) is a permutation of minimum matching cost. This is a weighted bipartite matching problem. The input is a cost matrix with entries d kl , and the result is a minimum permutation π (k ) as well as the minimum cost D . In our experiments, we use the efficient algorithm of [22]. The parameters will be discussed in next section. Differ from the SC, which can normalize the shape with a determinate number of points, in order to maintain as most structure information as possible, since the node number on each contour is not always the same, so a “dummy” system is necessary. For two shapes with nodes M and N , where M ≥ N , we add dummy nodes to the second shape with an average matching cost of ε d = (∑kM=1 ∑lN=1 d kl ) / M * N . In this case, a node will be matched to a “dummy” whenever there is no real match available at the second shape. Our dummy node is not to handle the outliers, but to punish the node number difference of query shape and the reference shape. So the second item of cost function is Num = k Num ( M − N )ε d (k Num ≥ 0) ,
(4)
where k Num is a punishment coefficient. The third item of our cost function is about the local similarity. For a specific line stimulate in bi-level contour, length of the line stimulate is the local information we concerned. In the query and reference shape, a pair of corresponding line stimulate should have the similar length. In this consideration, we define the third item is (5) LS = k | length( s1 ) − length( s 2 ) | ( k ≥ 0) , LS
k
π (k )
LS
where length(⋅) is the length of a line stimulate, k LS is a punishment coefficient.
1286
Y. Li, S. Luo, and Q. Zou
Our final cost function is cos t = min( D + Num + LS ) M
= min ( π ∈Π
∑d
k ,π ( k ) ) + k Num ( M
(6)
− N )ε d + k LS | length( s1k ) − length( s 2 π ( k ) ) | .
k =1
In this cost function, the second item is determinate when the cost matrix is computed and the third item is added as an appendix of the best permutation.
6 Matching In this section we illustrate some aspects of the similarity and shape descriptor by computing some contours using the method presented before. The images are downloaded from Kimia data set. Our approach is simple and easy to apply, in every step it match the goal very well. In shape evolution stage, we use 50 images to compute the percentage of the segments reduction of our method. Each image is normalized to 155 pixels by 155 pixels, in order that 220 to 600 points consist of a shape. After evolution, the number of final segments is about 18 to 60, and the percentage of reduction is about 91.3%. 6.1 Translation, Rotation and Scaling Invariance
A matching approach should be invariant under translation, scaling and rotation, and here we evaluate our shape descriptor matching by these criteria. Invariance to translation is inherent to our descriptor, since all the measurements are computed by points taken from the contour. The final computing of descriptor use relative distance, avoiding the effect of absolutely numerical value. In our experiments the scaling is testing by 50 basic shapes and 6 derived shapes from each basic shape by scaling with factors 1.3, 1.2, 1.1, 0.9, 0.8 and 0.7. If the percentage of scaling factor is smaller or bigger more than 30%, invariance decreases a bit probably because of the error of discretization. Fig. 3 shows the scaling sample.
origanal
-30%
-20%
-10%
10%
20%
30%
Fig. 3. Sample of scaling invariant. The values under the images are the factor of scaling.
origanal
90º
180º
270º
Fig. 4. Sample of rotation invariance. The values under the images are the rotation angles.
Biological Inspired Global Descriptor for Shape Matching
1287
As to rotation, our shape descriptor guarantees almost complete invariance when the rotated angle is kπ / 2, (k = 0,1,2,3) . Fig. 4 shows the sample of rotation. For the shape similarity measures, the average cost of each segment for scaling and rotation are computed. When normalized the GD in [0, 1], the biggest distance between two GD is 20. The results are showed in Table 1. We can conclude from the data that as scaling factors smaller or bigger the cost is bigger. For rotation, we can provide for almost completely rotation invariance which difference of orientation is 90° , and a bit more distance for other rotation angles. Table 1. Average cost for scaling and rotation Scaling Cost
+30% 3.05
+20% 2.84
+10% 2.68
-10% 2.87
-20% 3.04
-30% Rotation 3.35
Cost
90
180
270
1.94
1.17
1.78
6.2 Retrieval on Kimia Database
To show the performance of the global descriptor in matching and retrieval, we test it in the Kimia data set. This test contains 25 images from 6 categories in Fig. 5. It has been tested by [21]. In our experiments, the last image is replaced by a similar one because the original one is not appropriate to distilling simple closed contour. The parameter used in our experiments as follow: L in COPAP [22] is 85% of min( M , N ) , k Num = 0.9, K LS = 0.1* ε d /(length( s1k ) + length( s 2π (k ) )) .
(10)
The values showed the punishment of node number difference is much bigger than that of local similarity for our matching. The node number equality of two shapes is more important. The retrieval result is summarized as the number of 1st, 2nd and 3rd closest matches that fall into the correct category. Because we choose line stimulation but not point to build our descriptor, the discretization affects our results a bit. The performance of our descriptor is not the best, but the result shows it can represent the shape in great extent. Compare with [2] and [21], computation of our method is much easier and swifter, at the same time gain the comparable performance. From biological view, although using points may have better results, line stimulation is more reasonable in shape representation in primary visual cortex.
Fig. 5. Kimia dataset: This dataset contains 25 instances of 6 categories
1288
Y. Li, S. Luo, and Q. Zou Table 2. Retrieval result on Kimia dataset [19] (Fig. 5)
Method Top 1 Top 2 Top 3
Sharvit [19] 23/25 21/25 20/25
G and W [7] 25/25 21/25 19/25
Belongie [2] 25/25 24/25 22/25
IDSC+DP
25/25 24/25 25/25
[21]
GD 24/25 22/25 20/25
6.3 Matching with Occlusion
To show the performance of the global descriptor in occlusion, we test it in the Kimia data set too. This test contains 15 basic images and 3 derived shapes from each basic shape by rotation with π / 2 orientation difference and at most 30% occlusion. See Fig. 6 for example. In our experiments, the parameter is same to that of section 6.2. The retrieval result is summarized as the number of 10% occlusion, 20% occlusion and 30%occlusion closest matches that fall into the shapes derived from the same basic one. The performance of our descriptor is 14/15, 12/15, 10/15. We can provide good robustness in rotation and with occlusion less than 20%.
Fig. 6. Sample of occlusion. In each category, the first one is the basic image, and the another three ones are derived from it with different occlusion and rotation.
6.4 An Extreme Example
In this chapter we test the GD in extreme occlusion, form 10% to 80%. We use the same 15 images as 6.3, and curtain them along x- axis, y- axis as well as the main axis of the shape. See Fig. 7 for examples. Shape, with many concaves and convexities, is called complex shape, as Basic A; on the contrary, the other ones are called simple shape, as Basic B. In this test, we rank the occlusion shapes by their similarity from the basic one in the same category. The result is in Fig. 8. At the same time, rank those by human eyes, the result are Eye A and Eye B. Fig. 9 shows the relation of contour occlusion percentage and reciprocal of similarity. Based on the reiteration experiments on the 15 images, we find that when the occlusion is small the result of GD and human eye are exactly the same. That means our method is reasonably simulated human vision in some extend. As the occlusion bigger, human eye begins confused, in where the number labeled with underline. Just like the inflexion points in Fig. 9. In Fig. 9 we can see that the L-curve of simple contour has an obvious inflexion point, means whatever the percentage of occlusion variant, the possibility of recognition will have no big difference. On the contrary, the L-curve of complex contour has no obvious inflexion point. It may because that complex contours have many features such as concaves and convexities, several missing will not affect the entire contour. On the contrary, simple contour has few features, and maybe one or two missing will result in bad recognition result, so the inflexion is much more obvious.
Biological Inspired Global Descriptor for Shape Matching
1289
Basic A
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Basic A
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Fig. 7. Sample of extreme occlusion along x-axis, y-axis as well as the main axis of the shape
Basic A
5
10
11
1
6
12
2
13
7
8
3
14
4
9
15
Eye A
5
10
11
1
6
12
2
7
3
13
14
8
4
9
15
Basic B
1
10
11
2
6
12
3
13
7
8
14
9
4
15
5
Eye B
1
10
11
2
6
3
12
13
12
7
9
5
8
14
4
Fig. 8. Rank results in extreme occlusion
Fig. 9. The relation of contour occlusion percentage and reciprocal of similarity
7 Conclusion and Future Work We have presented a biological inspired new approach to shape matching and retrieval. We employ two novel strategies – the zigzag rule, which approximates shape to an elaborate polygonal curve, and cost function which combines global configurations as well as local information of the line stimulations as our punishments. Combining these with the shape descriptor reflecting entire topology of the contour, our matching model is robust under translation, scaling and rotation. Compared with local luminance feature based approaches, our matching has much more accurate results, and with other contour based matching methods, our model is much easier and more efficient. Since the segmentation is based on pixel level, invariance under arbitrary rotated angles and arbitrary scaling factors is not always the same. Future improvements include making the descriptor more robust to scaling and arbitrary rotation, and extending the model to more complex shapes. The great significance of GD is applying the biological inspired elicitation into CV, and it has led to at least comparable result with current methods. We believe the
1290
Y. Li, S. Luo, and Q. Zou
established model will be constantly improved by adding new biological and cognitive details.
References 1. Hubel, D.H., Wiesel, T.N.: Receptive Fields, Binocular Interaction, and Functional Architecture in the Cat’s Visual Cortex. J. Physiology (London) 160 (1962) 106-154 2. Wolfram, S.: University and Complexity in Cellular Automata. Physica D 10 (1) (1984) 1-35 3. Jackson, E.: Perspectives of Nonlinear Dynamics. Cambridge University Press, New York 2 (1990) 454-504 4. Riehle, A., Grum, S., Diesman, M., et al.: Spike Synchronization and Rate Modulation Differentially Involved in Motor Cortical Function. Science 278 (1997) 1959-1953 5. Mainen, Z. F., Sejnowski, T. J.: Reliability of Spike Timing in Neocortical Neurons. Science 268 (1995) 1503-1506 6. Amir, Y., Harel, M.: Cortical Hierarchy Reflected in the Organization of Intrinsic Connections in Macaque Visual Cortex. J. Comp neurol 334 (1) (1993) 19-46 7. Harris, C., Stephens, M.: A Combined Corner and Edge Detector. in Fourth Alvey Vision Conf. (1988) 147-151 8. Schmid, C., Mohr, R.: Local Grayvalue Invariants for Image Retrieval. IEEE Trans. PAMI 19 (1997) 530-534 9. Lowe, David G: Object Recognition from Local Scale-Invariant Features. In ICCV (1999) 1150-1157 10. Khotanzan, A., Hong, Y. H.: Invariant Image Recognition by Zernike Moments. IEEE Trans. PAMI 12 (1990) 489-497 11. Arkin, M., Chew, L.P., Huttenlocher, D.P., Kedem, K., Mitchell, J.S.B.: An Efficiently Computable Metric for Comparing Polygonal Shapes. IEEE Trans. PAMI 13 (1991) 209-206 12. Latecki, L. J., Lakämper, R.: Shape Similarity Measure Based on Correspondence of Visual Parts. IEEE Trans. PAMI 22 (10) (2000) 1185-1190 13. Mokhtarian, F., Abbasi, S., Kittler, J.: Efficient and Robust Retrieval by Shape Content Through Curvature Scale Space. Image Databases and Multi-media search, A.W.M. Smeulders and R. Jain, eds., 51-58, World Scientific, 1997 14. Mokhtarian, F., Mackworth, A. K.: A Theory of Multiscale Curvature-based Shape Representation for Planar Curves. IEEE Trans. PAMI 14 (1992) 789-805 15. Latecki, L. J., Lakamper, R., Eckhardt, U.: Shpae Descriptors for Non-Rigid Shapes with a Single Closed Contour. Proc. IEEE Conf. Computer Vision and Pattern Recognition (2000) 424-429 16. Belongie, S., Malik, J., Puzicha, J.: Shape Matching and Object Recognition Using Shape Contexts. IEEE Trans. PAMI 24 (4) (2002) 509-522 17. Ling, H., Jacobs, D.: Using the Inner-distance for Classification of Articulated Shapes. in Proc. IEEE Conf. Computer Vision and Pattern Recognition, San Diego, CA (2005) 18. Mortensen, E.N, Hongli, D., Shapiro, L.: A SIFT Descriptor with Global Context. in CVPR 1 (2005) 184-190 19. Scott, C., Nowak, R.: Robust Contour Matching via the Order Preserving Assignment Problem. accepted for publication in IEEE Transactions on Image Processing 20. Jonker, R., Volgenant, A.: A Shortest Augmenting Path Algorithm for Dense and Sparse Linear Assignment Problems. Computing 38 (1987) 325-340 21. Latecki, L. J., Lakamper, R.: Convexity Rule for shape Decomposition Based on Discrete Contour Evolution. Computer Vision and Image Understanding 73 (3) (1999) 441-454 22. Dow, M., Nunnally, R.: http: //lcni.uoregon.edu /~mark/ SS_Edges/ SS_Edges.html
Fuzzy Support Vector Machine for EMG Pattern Recognition and Myoelectrical Prosthesis Control Lingling Chen, Peng Yang, Xiaoyun Xu, Xin Guo, and Xueping Zhang School of Electrical Engineering and Automation, Hebei University of Technology 300130 Tianjin, China
[email protected]
Abstract. For the optional control to the trans-femoral prosthesis and natural gait, an ongoing investigation of lower limb prosthesis model with myoelectrical control was presented. In this research, the surface electromyographic signals of lower limb were extracted to be switch signal, and translate into movement information. Considering every muscle’s different physiologic tendency, fuzzy support vector regression method was applied to establish an intelligent black box that can interpret the physiological signals to accurate information of knee joint angle. It achieves a comparable or better performance than other methods, and provides a more native gait to the prosthesis user.
1
Introduction
Electromyographic signal (EMG) detects the bioelectrical signals generated by the body during its contraction. Aside from the traditional use of detecting neuromuscular disease, muscle weakness, and modeling of muscle movements, applications for EMG have increased quite tremendously, especially in the field of prosthetics and exoskeletons. Because the technology allows for an electrical detection of the muscle contractions, these can be used as inputs to control artificial limbs or even help ease a heavy load by acting in accord with the human body. In this sense surface EMG (SEMG) has grown in its sensor technology and probably will grow even further in the future as well. After a limb is amputated, the brain continues to send signals to the remainder of the limb. EMG signals, which are intended for the movement of the missing limb, can potentially be interpreted and used to control prosthesis [1]. Human locomotion is complex biologic progress and controlled by cerebra and nerve center. Although this type of approach has proved successful for arm movement control, it focused on upper limb prosthesis control mostly. Walking movement, although seemingly stereotyped, is highly complex as it integrates equilibrium constraints and forward propulsion in a multi-joint system. In swing phase, it must make the gait of prosthesis be symmetric to the ableside limb to the greatest extent. So the crus prosthesis should have the same D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1291–1298, 2007. c Springer-Verlag Berlin Heidelberg 2007
1292
L. Chen et al.
accelerate and decelerate course like able-side leg in the swing phase. The movements of accelerate and decelerate control by the muscle of thigh and crut. The compressions of those muscles produce the knee moment to drive the knee joint, make the knee moment change according to a certain rule, control the flexure angle of knee joint and make the crut accelerate forwards in the initial swing phase [2]. To the lower limb prosthesis, adjusting the flexion angle of knee joint optionally is the key of the optional movement control of prosthesis. On the other hand, another essential request is safe and reliable to the prosthesis. It must can avoid the ”giving way” and adapt to different kinds of emergency. Support vector machine (SVM) is based on the theoretical learning theory [3] [4], that can be used for pattern classification or regression, such as object recognition, speech recognition, and isolated handwritten digit recognition [5] [6]. Therefore, we applied SVM to classify on-off situation of artificial knee joint, and the support vector regression (SVR) method is applied to interpret the physiological signals to give accurate information on the position and movement of the knee joint. It can achieve a comparable or better performance than other methods, and provide a more native gait to the prosthesis user. In this study, a lower limb prosthesis model with myoelectrical control is presented. This approach can avoid giving way of the prosthesis and establish the fuzzy SVR (FSVR) model to predict the angle of knee joint using SEMG signals.
2
Myoelectrical Prosthesis
The controlled lower limb prosthesis will have special features, some not available, or not well tuned and performed by conventional prostheses. These features include controlled knee flexion at early stance, high stability during weight bearing with a single axis knee, controlled knee release at late stance, controlled heel rise at early swing, damped full knee extension at late swing, damped knee at stance of stair descent and in sitting down, and adaptation to gait speed [7]. In order to resolve those problems above, the structure of the controlled system is described by the block diagram shown in Fig. 1. Stump
Able-side Leg
Sensor
Sensor
EMG
EMG
Classifier
Gait-sate Identification
On-off Signal
Knee Joint Moment
Self-lock Control
Electric Control
Prosthesis
Gait Control Physical Quantity
Fig. 1. The prosthesis model’s main control source is SEMG signal
2.1
Prosthesis Model with Myoelectrical Control
For different condition of amputees, the prosthesis control by the stump’s SEMG signal is very complexity. So the SEMG signal which sampled from the able-side
Fuzzy Support Vector Machine for EMG Pattern Recognition
1293
leg is as the control signal mostly. As the essential control request of prosthesis is realtime, the method of processing the raw SEMG signal must be simple, fast and effective. Therefore the linear envelope of the rectified signal was input of the SVR model. This model can be regard as a nonlinear function estimator. It accomplished the nonlinear mapping from physiological signals to the position and movement of the knee joint, and made optimal estimate to the angle of knee joint. 2.2
Giving Way
The main control of prosthesis is the control of knee joint, including standing and swing period. The most importance demand of lower limb prosthesis is avoiding “giving way” and tumble, namely knee joint inflects suddenly when the prosthesis is bearing (it is allowed that the knee joint has a slight flex in the initial stage of erecting, but must be extended subsequently). For instance, C-leg used a sensor under the tiptoe. When the tiptoe use enough force, the self-lock knee joint will be open and the amputee can walk. The SEMG signal sampled from stump acted as the switching control signal, and fulfilled the conversion from standing to walking. It can distinguish the mutation from standing to walking clearly, and it can be the on-off signal through classifying.
3
Gait-Sate Identification
Support Vector Machine represents a new approach for pattern classification that has attracted a great deal of interest in machine learning. It succeeded in solving many pattern recognition problems and performed better than nonlinear classifiers. A solution is the extraction of the statistical information enclosed in the blackbox model. As neural networks encrypt the model into a complex, nonlinear, mathematical formula, they are not easy to interpret at all. But in contrast to back propagation network (BP-NN), the SVM could be more appropriate for this purpose, given that the support vectors represent the critical samples for the classification task. The approach here is to develop an intelligent black box that can take the physiological signals and interpret them to give accurate information on the angle of knee joint. 3.1
Support Vector Regressing
Given a set of data points {(x1 , z1 ), . . . , (xl , zl )}, such that xi ∈ Rn is an input and zi ∈ Rn is a target output, the standard form of support vector regression: 1 T 1 w w + C(νε + (ξi + ξi∗ )). ,ε 2 l i=1 l
min∗
w,b,ξ,ξ
(1)
1294
L. Chen et al.
Subject to (wT φ(xi ) + b) − zi ≤ ε + ξi , zi − wT φ(xi ) − b ≤ ε + ξi∗ ,
(2)
ξi , ξi∗ ≥ 0, i = 1, . . . , l, ε ≥ 0.
(3)
1 min (α − α∗ )T Q(α − α∗ ) + Z T (α − α∗ ), eT (α − α∗ ) = 0, 2
(4)
eT (α + α∗ ) ≤ Cν, 0 ≤ αi , α∗i ≤ C/l, i = 1, . . . , l.
(5)
The dual is: α,α∗
where Qij = K(xi , xj ) ≡ φ(xi )T φ(xj ). e is the vector of all ones, C > 0 is the upper bound, and K(xi , xj ) is the kernel. The training vectors xi are mapped into a higher dimensional space by the function φ. The decision function is f (x) =
l
(−αi + α∗i )K(xi , xj ) + b.
(6)
i=1
3.2
Fuzzy Support Vector Regressing
Each muscle has different contribution to the angle predicting of the knee joint. For standard SVR, however, all of input vectors have same influence to the model. Therefore, we divided the whole forecasting process into several subsystems according to the different muscles. A series of submodels were applied to estimate the output of every subsystem, and the angle of knee joint is the weighted sum of every subsystem’s output. Using the FSVR to established multi-model estimation (Fig. 2), it can improve the effect of modelling, heighten the accuracy of model, and improve the predictive ability and generalization ability of model [9]. Utilizing fuzzy logic theory, the contribution of each muscle can be considered for the fuzzy membership. To do this, we define one-dimensional membership function pi . The original decision values will take on values ranging generally from 0 to +1, where values closer to +1 indicate the muscle that contribute to the movement of knee joint mostly, and values closer to 0 indicate lesser. pi is an alterable
Each Muscle Channel Expert System
SEMG Signals FSVR model M
pi
y = ∑ f i ( x , pi )
Angle of Knee Joint
i =1
Fig. 2. Translation the SEMG signals recorded from several muscles into the angle of knee joint by FSVR model. The choice of the fuzzy membership depends upon the expert system and angle of knee joint last time.
Fuzzy Support Vector Machine for EMG Pattern Recognition
1295
parameter, and it changed with the vary of knee joint angle, where i is the corresponding muscle, and M is the total number of muscles. Given the process output is the weighted sum of the M sub-models’ output: yj =
M
pi fi (xj ), j = 1, 2, . . . , l.
(7)
i=1
M where i=1 pi = 1, 0 ≤ pi ≤ 1, i = 1, 2, . . . , M. Then we can convert the modeling problem into solve l M min( (yi − pi fi (xj ))2 ).
pi ,fi
j=1
(8)
i=1
l Similar to the standard SVR algorithms, the linear loss function j=1 (ξj +ξj∗ ) l is applied to replace of the quadratic loss function j=1 (ξj2 +(ξj∗ )2 ). By the using of fi (xj ) = ωiT ϕ(xj )+ b and insensitive loss function ε, equation (8) is equivalent to 1 T min ∗ J = ωi ωi + C (ξj + ξj∗ ). ωi ,b,ξ,ξ 2 i=1 1 M
yj −
M
pi (ωiT ϕ(xj ) + b) ≤ ε + ξj ,
i=1
l
M
pi (ωiT ϕ(xj ) + b) − yj ≤ ε + ξj∗ ,
(9)
(10)
i=1
ξj , ξj∗ ≥ 0, j = 1, 2, . . . , l.
(11)
To solve this optimization problem, we constructed the Lagrangian
L =
M l l M 1 T ωi ωi + C (ξj + ξj∗ ) − αj (ε + ξj − yj + pi (ωiT ϕ(xj ) + b)) 2 i=1 j=1 j=1 i=1
−
l
α∗j (ε + ξj∗ + yj −
j=1
M
pi (ωiT ϕ(xj ) + b)) −
i=1
l
(ηj ξj + ηj∗ ξj∗ ).
(12)
j=1
The output of estimation sub-model i is fi (x, pi ) =
l
pi αj − α∗j k (xj , x) + b.
(13)
j=1
The output of model is y=
M i=1
fi (x, pi ) .
(14)
1296
4
L. Chen et al.
Experiment Result
SEMG signal was recorded with the Infiniti system, which is a biofeedback and physiological monitoring. It can record the SEMG signal synchronized with angles of knee joint in order to study temporal relationships. The SEMG signals are captured from four muscles of the lower limb: rectus femoris, vastus lateralis, biceps femoris, and tensor fasciae latae. The corresponding knee joint angles are also provided. Five subjects were asked to walk with 40 steps per minute.The raw SEMG signals are depicted in Fig. 3 over 4 gait cycles.
Fig. 3. The SEMG signals ( μV) coming from Rectus femoris, vastus lateralis, biceps femoris, tensor fascia latae, and the corresponding angle of knee joint were recorded. The EMG data are preprocessed into signal features, and the signal envelopes are use as input to the SVM.
In this study, Root mean square (RMS) of the rectified EMG signal with a time constant of 25 ms is applied to minimize the non-reproducible part of the signal, and outline the mean trend of signal development [8]. Normalizing signal is applied to over-come ”uncertain” character of micro-volt scaled parameters with a reference contraction. Based on certain threshold standard that defines when a muscle is “on”, the On/Off timing pattern of muscle in the gait cycle can be applied to control the open of prosthesis. Simulation results show that SEMG signal and knee joint angle have strong relation, and this arithmetic obtains preferable forecast result. Model was established by BP-NN, SVR and FSVR respectively. In order to compare the predicting effect, both of SVR and FSVR applied RBF (radial basis function) kernel function. K(xi , xj ) = exp(−νxi − xj 2 ).
(15)
Fuzzy Support Vector Machine for EMG Pattern Recognition 60
Error / De gre e
Angle / De gre e
real angle
60
predict angle
50
40 30 20 10
40 30 20 10
0 -10
70
real angle predict angle
50
1297
0
0
500
1000
1500
2000 2500 3000 S ample number
3500
4000
4500
-10 5000 0
500
1000
1500
By the SVR model
2000 2500 3000 S ample number
3500
4000
4500
5000
By the FSVR model
Fig. 4. The predicting result and error of knee joint angle are shown by SVR and FSVR respectively
Table 1. The comparing of BP-NN, SVR and F-SVR Error
BP-NN SVR
Root Mean Square Error/ Degree 9.954 Maximum Positive Error/ Degree 30.878 Maximum Negative Error/ Degree -26.489
FSVR
6.1452 3.457 18.565 14.154 -15.465 -12.154
Where the kernel parameter ν = 0.4, and the regularization parameter C = 1.5.The predicting results of SVR and FSVR methods are depicted in Fig. 4, and the comparing of them is shown in Table 1. It has better predicting result by FSVR than the other methods, and the root mean square error and maximum error are less than the other methods obviously. To the application of knee joint angle’s predicting, it is superiority in term of smoothness of curve and generalization ability. But it needs modify of the predicting value. Since the SEMG signals influence by a lot of uncertain factor, it should be modified according to the experience. If there is an abrupt spike in the angle curve, this kind of predicting is not unreliable, and can be replaced by the average value of fore- and -aft values.
5
Conclusion
This myoelectric control system can satisfy the optional control and safety need. The prime advantage of myoelectrical prosthesis is using the muscle system to control the prosthesis by physiological method. This kind of prosthesis can adapt to the external factor, the location of prosthesis, and the shift of body. However, there also have a lot of problems in the practical application, such as real-time control. We must quick the speed of the algorithm to satisfy the prosthesis user’s normal gait. At the same time, the ingenious combination of EMG signals and physical signals also is an important problem.
1298
L. Chen et al.
By applying FSVR algorithm to construct angle estimator, it not only provides satisfactory approximation and generalization property, but also achieves superior performance to BP-NN modeling method and SVR method. It can consider every muscle’s physiologic tendency sufficiently. But it need pay more attention to the selection of membership function. Otherwise, it should be pay more attention to the model selection of SVM. Through more suitable model selection, it will reach a better classification result after training. Acknowledgments. This work was supported by the National Natural Science Foundation of China (60575009). And the Research Institute of Prosthetics and Orthotics of the Ministry of Civil Affairs of P. R. China provided great aids on the experiments.
References 1. Mordaunt, P., Zalzala, A.S.M.: Towards and Evolutionary Neural Network for Gait Analysis. IEEE (2002) 1922–1927 2. Farina, D., Merletti, R., Nazzaro, M.: Effect of Joint Angle on EMG Variables in Leg and Thigh Muscles. IEEE Trans. Engineering in Medicine and Biology 20 (6) (2001) 62–71 3. Burges, J.C.: A Tutorial on Support Vector Machines for Pattern Recognition. Data Mining and Knowledge Discovery 2 (1998) 121–167 4. Corts, C., Vapnik, V.N.: Support Vector Networks. Machine Learning 20 (1995) 273–297 5. Tanaka, K., Kimuro, Y.: Motion Sequence Scheme for Detecting Mobile Robots in an Office Environment. Computational Intelligence in Robotics and Automation 1 (2003) 145–150 6. Osowski, S., Hoai, L.T., Markiewicz, T.: Support Vector Machine based Expert System for Reliable Heartbeat Recognition. IEEE Trans. Biomedical Engineering 51 (4) (2004) 582–589 7. Cheron, G., Leurs, F., Bengoetxea, A., Draye, J.P., Destre, M., Dan, B.: A Dynamic Recurrent Neural Network for Multiple Muscles Electromyographic Mapping to Elevation Angles of the Lower Limb in Human Locomotion. Journal of Neuroscience Methods 129 (2003) 95–104 8. Ferdjallah, M., Myers, K., Starsky, A.: Dynamic Electromyography. Proc. Pediatric Gait Con-ference (2000) 99-108 9. Feng, R., Shen, W., Zhang, Y., Shao, H.: Multiple Modeling Approach using Fuzzy Support Vector Machines. Control and Decision 18 (6) (2003) 646-650
Classification of Obstructive Sleep Apnea by Neural Networks Zhongyu Pang1 , Derong Liu1 , and Stephen R. Lloyd2 1
Department of Electrical and Computer Engineering University of Illinois at Chicago Chicago, IL 60607-7053, USA
[email protected],
[email protected] 2 Center for Narcolepsy, Sleep, and Health Research College of Nursing University of Illinois at Chicago Chicago, IL 60612-7350, USA
[email protected]
Abstract. Electroencephalogram (EEG) is a common tool to explore brain activities ranging from concentrated cognitive efforts to sleepiness. For the issue of sleepiness, pupil behavior can provide some information regarding alertness. The issue of sleepiness can be reflected by EEG energy. Specifically, intrusion of EEG theta wave activity into the beta activity of active wakefulness has been interpreted as ensuing sleepiness. This paper develops an innovative signal classification method that is capable of differentiating subjects with sleep disorder of obstructive sleep apnea (OSA) which cause excessive daytime sleepiness (EDS) from normal control subjects who do not have a sleep disorder. The theta energy ratios are calculated from the 2-second sliding windows by Fourier transform. An artificial neural network of modified ART2 is utilized to identify subjects with OSA from a combined group of subjects including healthy controls. This grouping from the neural network is then compared with the actual diagnostic classification of subjects as OSA or healthy controls and it is found to be 91% accurate in differentiating between the two groups.
1 Introduction Sleep plays an important role in the history of neuroscience and in the lives of human being. Excessive daytime sleepiness caused by sleep apnea can have a disruptive, embarrassing, or even dangerous impact on daily living activities. Symptoms of OSA include loud and irregular snoring, restless sleep and daytime sleepiness [17]. In addition, episodes of sleep apnea are often associated with oxygen desaturation. Repeated episodes of this desaturation can eventually lead to additional medical complications. In clinic [17], sleep apnea can be diagnosed when those symptoms are present and an all-night sleep study reveals the presence of at least 5 episodes of apnea and/or hypopnea per hour of sleep. An apnea episode occurs when airflow is decreased by at least 50% and lasts for more than 10 seconds. A hypopnea episode occurs when airflow is reduced by at least 30%, oxygen level in blood is reduced by at least 4% and the airflow reduction lasts for more than 10 seconds [23]. D. Liu et al. (Eds.): ISNN 2007, Part II, LNCS 4492, pp. 1299–1308, 2007. c Springer-Verlag Berlin Heidelberg 2007
1300
Z. Pang, D. Liu, and S.R. Lloyd
There are two classes of detection methods that have been applied to subjects with OSA. One class is based on breath signals while the other is based on electromyogram (EMG) signals. Several research papers [9,10,19,21,22] have examined methods belonging to the first class. Assessment of respiratory signal is a key step to detect OSA. Respiratory impedance can be determined by a forced oscillation technique (FOT) [10] and can be considered as a proper noninvasive method to diagnose sleep apnea. FOT relies on applying an oscillatory pressure signal to the respiratory system and determines the respiratory impedance using nasal pressure and airflow signals. Currently, FOT is a promising noninvasive method for measuring respiratory impedance [9,19]. Using FOT during sleep, Yen et al. [22] estimated airway impedance with high specificity and reliability. Then they used an artificial neural network to classify people with and without hypopnea/apnea based on this respiratory signal. Gumery et al. [12] developed a device to measure the surface EMG time latency reflex of the genioglossus muscle stimulated by time and amplitude calibrated negative pharyngeal pressure drops. Then, based on a Berkner transform, they built a multi-scale detector. Further they tested those detectors in terms of accuracy and robustness using signals acquired from apneic patients and healthy controls. EEG measurement can be used to detect brain activity since different mental activities produce different EEG patterns. Mill´an et al. [18] present a neural classifier to recognize mental tasks and get about 70% correct recognition. Researchers have found that fluctuations in wakefulness can be examined with EEG measurement from active subjects with eyes open and engaged in their usual awake activities [1,2,13]. In these situations, intrusions of theta activity into the beta activity of active wakefulness have been interpreted as ensuing sleepiness. Subjects with OSA and healthy controls may have different alertness levels under the same conditions. The pupil response patterns between subjects with and without sleep disorders are different. In this paper, we develop a novel method to detect subjects with the sleep disorder of OSA based on EEG. We compare subjects with sleep disorders to those healthy controls and find that they have different responses in theta wave patterns under the same situation. A significant difference between those subjects can be used for the purpose of classification by artificial neural networks, specifically, modified ART2 neural networks. We tested our algorithm using one set of subjects with OSA and healthy controls. This methodology may eventually lead to new diagnostic methods for the sleep disorder of OSA. This paper is organized as follows. In Section 2, we present the subjects and the experimental data collection. In Section 3, data preprocessing is described and our method for detecting excessive daytime sleepiness associated with sleep disorders is developed. In Section 4, simulation results are given. In Section 5, conclusions are presented, and future perspectives are discussed.
2 Subjects and Experimental Data Data from 5 untreated OSA (obstructive sleep apnea) subjects and 6 healthy controls were collected approximately 12 h after their mid-sleep period to maximize the
Classification of Obstructive Sleep Apnea by Neural Networks
1301
probability of sleepiness occurring. This mid-afternoon increase in somnolence, commonly believed to be a post-prandial phenomenon, has been shown to be unrelated to food intake [5]. Data collection was performed at the Center for Narcolepsy Research at the University of Illinois at Chicago. The alertness level testing, conducted with a pupillometry system built at Mayo Clinic [16], consists of 1 minute of recording of pupil diameter in the light followed by 14 minutes in a quiet, dark room. The analog pupil diameter data are digitized at the rate of 256 Hz using an A/D converter and saved to a PC using a binary format [16]. Filter for EEG was set at 0.3 Hz for high pass and 30 Hz for low pass. EMG filters were set at 10 and 100 Hz, respectively. EEG/PSG (polysomnography) data were also digitized at 256 Hz and stored with pupillometry data in a PC.
3 Method for Detecting Sleep Disorder Based on ART2 Neural Networks 3.1 Data Pre-processing A window of 2 seconds, which is a common technique in this field, is used to process EEG data. Since measurement records include pupil diameter and EEG in the specific environment, data for the first 3 minutes of recording were eliminated from analysis because the pupil dilates and oscillates when the lights are extinguished and can take 2–3 minutes of darkness to adapt and reach a larger stable diameter [15]. For this type of data, pupil diameters can provide some useful information about excessive daytime sleepiness and some researchers focus their research on pupil size only. According to the definition of theta wave in EEG, its main rhythm is between 4 Hz and 8 Hz. As we know, when people are awake, the main rhythm in EEG is beta and/or alpha wave. When a person goes to sleepiness, the theta wave is the main rhythm on EEG. Therefore, theta wave activity can be considered as an indicator of sleepiness. The amount of theta wave activity has been shown to increase in value during episodes where people demonstrate decreasing alertness level. Accordingly, theta energy could be calculated for 2-sec windows with the original data from three EEG channels of C3/A2, O1/A2 and P3/O1. The calculation was realized by Fourier transform for subjects with OSA and accompanying controls. This procedure can provide insight for the pattern of theta wave energy. Noise always exists in the EEG recording, so for a good representation of theta energy, the average energy can be obtained, which indicates the amount of theta wave present. Further theta energy ratios were obtained from mean power of the 4th minute; see Fig. 1 for the theta energy ratios of an OSA and a healthy control. In order to recognize general change of theta ratio, we use the regression method to catch changes with time under specific circumstances. The regression analysis is based on the method of Chatterjee and Hadi [6], expressed by, Y = Xβ + , ∼ N (0, σ 2 I),
(1)
where Y is a dependent variable (output), X is an independent variable (input or data), and is the error. Solving for β from (1) based on the least square error will give the predicted data.
1302
Z. Pang, D. Liu, and S.R. Lloyd OSA subject 4
3.5
Amplitude of theta energy
3
2.5
2
1.5
1
0.5
0
0
50
100
150
200 Time
250
300
350
400
250
300
350
400
Healhy control 4
3.5
Amplitude of theta energy
3
2.5
2
1.5
1
0.5
0
0
50
100
150
200 Time
Fig. 1. Theta energy ratios of a subject with OSA and a healthy control
3.2
ART2 Neural Networks
Adaptive resonance theory 2 (ART2) neural networks [4] were designed for both analog and digital inputs in 1987. ART2 has been widely used to identify patterns in various fields, e.g., Suzuki [20] used neural networks based on ART2 to recognize QRS-waves from electrocardiogram (ECG). The present paper is based on ART2 neural networks and modified learning functions to adapt to the input patterns. An ART2 neural network [4,8] consists of two subsystems: An attentional subsystem and an orienting subsystem. The attentional subsystem has two layers, F1 and F2. F1 is made up of three sub-layers. Here three sub-layers of F1 are necessary for analog input patterns since the differences between possible signals with particular patterns may be much smaller for analog inputs
Classification of Obstructive Sleep Apnea by Neural Networks
1303
than for binary inputs which are used to represent features of signals. There are some equations to describe ART2. Most of the equations that we use are the same as those in the original paper [4]. When resonant conditions existing in the network are below the threshold set by the vigilance parameter, the memory will be activated and the long term memory (top-down weights and bottom-up weights) adaptive process described next will also be activated. The following equations describe the updated relationship between the third layer of F1 from an input signal and the activated category layer F2. Bottom-up long term memory trace (F1 → F2) d zij = fcn (yj )[pi − zij ], dt
(2)
where pi is the ith output of the third layer of F1, yj is the output of the jth category activated, and fcn (yj ) is a function given in [4]. Top-down long term memory trace (F2 → F1) d zji = fcn (yj )[pi − zji ]. dt
(3)
When resonant conditions existing in the network exceed the threshold set by the vigilance parameter, we modify the above update equations in order to avoid forgetting all the information obtained before. The memory will be updated by the average value of all long term memory (LTM) associated with the same winner while individual input should get its own LTM by equations (2) and (3). The memory update is described as follows. Bottom-up long term memory trace (F1 → F2) zij ⇐
n−1 1 zij + zij . n n
(4)
Top-down long term memory trace (F2 → F1) zji ⇐
n−1 1 zji + zji . n n
(5)
In (4) and (5), n is the number of subjects associated with the winner j, n−1 n zij are the 1 previous weights, and n zij is the new weight from a new input. According to the original paper, there is a clear procedure along the feedforward and feedback path in the layers of F1 to calculate all the signal in F1, but what procedure to get a reset signal is not definite. The reset signal after the third layer of F1 is updated with the feedback signal from layer F2. 3.3 Our System and Parameter Selection Since the vigilance parameter ρ decides the level of similarity between input signals in the same category, more categories will be obtained if ρ is large when other parameters are the same, e.g., if ρ is close to 1. The order of input signals has certain effects on the final classification results. The reason is that the original ART2 algorithm has forgetting property. In order to solve this problem, a large ρ is chosen so that only signals similar
1304
Z. Pang, D. Liu, and S.R. Lloyd
enough will be grouped together. Based on equations (4) and (5), a mean signal can be obtained in one group. After that, the other ART2 are used with a different set of parameters to classify grouped signals from the first ART2. The parameter choices for our method are based on the original ART2 papers [3,4], where their relationship has been derived and limits have been set. ART2 (I) a1, b1, c1, d1, e1, ρ1
ART2 (II) a2, b2, c2, d2, e2, ρ 2
ART2 (III) a3, b3, c3, d3, e3, ρ 3
Fig. 2. Architecture of three ART2 series
The vigilance parameter has a critical effect on the results of classification. A bigger value of the vigilance parameter tends to separate inputs into more categories. In this case, only very similar subjects can be grouped together. On the other hand, if its value is too small, most inputs will go into one group. So we use three networks in hierarchy as in Fig. 2 for subjects with OSA, in order to avoid missing some subjects due to the choice of vigilance parameter and to obtain more precise classifications. However, we do not follow the traditional procedure of the third ART2 in our system, where we separate all subjects into two groups based on their similarity parameters for the fact that the larger the similarity parameter is, the closer they are to each other.
4 Results A total of 11 subjects are used to test our algorithm of neural networks, including 6 healthy controls and 5 subjects with obstructive sleep apnea. Difference between energy ratio distribution of the theta wave of one healthy subject and one with OSA is not obvious. It is not possible to distinguish them directly based on energy ratios so we use linear regression method to process them further. Fig. 3 shows some results of regression with different point numbers. The two top figures in Fig. 3 show regression results of a healthy subject with 20, 30 points and the bottom ones are for a subject with OSA under regression points 20 and 30. The energy ratio of theta wave is from the original data with 2-second sliding windows, in which artifacts such as eye blinking have been removed. However, the regression procedure can make the change of theta energy ratio more obvious.
Classification of Obstructive Sleep Apnea by Neural Networks
1305
20 point regression 4
3.5
3
2.5
2
1.5
1
0.5
0
0
50
100
150
200
250
300
350
250
300
350
250
300
350
250
300
350
Time
30 point regession 4
3.5
3
2.5
2
1.5
1
0.5
0
0
50
100
150
200 Time
20 point regression 4
3.5
3
2.5
2
1.5
1
0.5
0
0
50
100
150
200 Time
30 point regression 4
3.5
3
2.5
2
1.5
1
0.5 0
50
100
150
200 Time
Fig. 3. Regression results of a healthy subject and an OSA subject. The first two figures are for a healthy subject with regression points 20 and 30, respectively. The bottom two figures are for an OSA subject with regression points of 20 and 30, respectively.
Only one ART2 neural network with another ART2 cannot identify all subjects, and more subjects go to the wrong group since there are subjects grouped into more
1306
Z. Pang, D. Liu, and S.R. Lloyd Classification results 1 0.95
Correct classification rate
0.9 0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 10
15
20
25 30 Regression points
35
40
Fig. 4. Performance of one ART2 and three ART2 series neural networks. Solid line represents performance of our system with three ART2 neural networks; dashed line represents performance of one traditional ART2.
categories or missed as a noise under a fixed vigilance parameter ρ. Therefore, we use three ART2 in series. If parameter ρ is set slightly higher, more subjects will not be classified. On the other hand, if the parameter ρ is set slightly lower, most subjects go into the same group. That is, precision of the classification will be reduced. We find that regression by different data points has an effect on the results. The more the number of points used in the regression, the higher the vigilance parameter ρ has to be set in order to get better classification results. More points make the figure more flat and stable. For purpose of comparison of one ART2 and three ART2 series, we draw the figures of performance in Fig. 4. The vertical axis is the percentage of successful classification while the horizontal axis is the number of points used in the regression. From Fig. 4, we find that the three ART2 series could get much better result than only one ART2 under optimal vigilance parameter ρ. The reason is from the architecture of ART2 which reflects the similarity of signals. A single vigilance parameter ρ may be proper to two different inputs. When the number of inputs is large, one fixed vigilance parameter is not appropriate. Three ART2 series allows us to choose different parameters in different stages. The best performance is from inputs with regression of 30 points. In this case, the first ART2 obtains 8 categories, the second ART2 gets 3 categories, and the third ART2 has 2 categories. The system could identify correctly 10 subjects among 11 subjects. It missed one subject with OSA. After we observed its data, we noted that it has a different change of pupil diameter from other subjects. Normally the pupil diameter of subjects becomes smaller with time during data collection, but this subject has an opposite change. Its pupil diameter becomes the largest in the last 2 minutes. The simultaneous EEG should have similar response. Therefore, such change might have effects on the final classification result.
Classification of Obstructive Sleep Apnea by Neural Networks
1307
The following example is from our simulation for regression of 30 points. Three ART2 neural networks are used and 11 inputs are from 11 subjects. After the first ART2 neural networks, 11 inputs go into 8 categories since some very similar inputs group together. We take the average of inputs in the same group to get 8 inputs for the second ART2. Three categories are obtained after the second ART2, including 2 big groups and 1 small group. The same averaging strategy is applied to these three groups to get three inputs for the third ART2. Finally, two groups are reached after the third ART2. We check the status of each subject in two groups to get the percentage of correct classification.
5 Conclusion We have shown that ART2 neural networks in series can successfully classify subjects with/without OSA, based on the idea that patients with sleep disorder have different level of wakefulness from healthy people. Some studies of sleepiness in subjects with OSA have found that participants were unaware of the extent of their sleepiness under the same circumstance [7,11]. Patterns of theta energy ratio in EEG can reflect the difference between sleep disorder patients and healthy people since there is good evidence that rising theta EEG activity is a sign of increasing sleepiness [14]. A series of ART2 neural networks are necessary to get a precise classification result in order to eliminate effects of input ordering and to group more similar subjects together. Our experiment shows that a hierarchical system of ART2 neural networks could improve the precision of classification over that of a single ART2 network, achieving 91%, between subjects with OSA and controls. The reason is that our system has more flexibility to adapt to input patterns. Acknowledgments. The data collection for this research was supported by NIH grant R15 NR04030 and by Mr. J. A. Piscopo.
References 1. Akerstedt, T., Gillberg, M.: Subjective and Objective Sleepiness in the Active Individual. International Journal of Neuroscience 52 (1990) 29–37 2. Broughton, R., Yan, H., Boucher, B.: Effects of One Night of Sleep Deprivation on Quantified EEG Measures. Journal of Sleep Research 7 suppl. 2 (1998) 32 3. Carpenter, G.A., Grossberg, S.: A Massively Parallel Architecture for A Self-Organizing Neural Pattern Recognition Machine. Computer Vision, Graphics, and Image Proc. 37 (1987) 54–115 4. Carpenter, G.A., Grossberg, S.: ART 2: Self-Organization of Stable Category Recognition Codes for Analog Input Patterns. Applied Optics 26(23) (1987) 4919–4930 5. Carskadon, M.A., Dement, W.C.: Multiple Sleep Latency Test During the Constant Routine. Sleep 15 (1992) 396–399 6. Chatterjee, S., Hadi, A.S.: Influential Observations, High Leverage Points, and Outliers in Linear Regression. Statistical Science 1 (1986) 379–393 7. Cherivin, R.D., Guilleminault, C.: Obstructive Sleep Apnea and Related Disorders. Neurologic Clinics 14 (1996) 583–609
1308
Z. Pang, D. Liu, and S.R. Lloyd
8. Davenport, M.P., Titus, A.H.: Multilevel Category Structure in the ART-2 Network. IEEE Transactions on Neural Networks 15(1) (2004) 145–158 9. Davis, K.A., Luchten, K.R.: Respiratory Impedance Spectral Estimation for Digitally Created Random Noise. Annals of Biomedical Engineering, Boston, MA: Department of Biomedical Engineering, Boston University (1991) 179–195 10. DuBios, A.B., Brody, A.W., Lewis, D.H., Brugess, B.F.: Oscillation Mechanics of Lungs and Chest in Man. Journal of Applied Physiology 8 (1956) 587–594 11. Engleman, H.M., Douglas, W.S.: Under Reporting of Sleepiness and Driving Impairment in Patients with Sleep Apnea/Hypopnea Syndrome. Journal of Sleep Research 6 (1997) 272–275 12. Gunery P. Y., Roux-Buisson H., Meignen, S., Comyn, F.L., Dematteis, M., Wuyam, B., Pepin, J.L., Levy, P.: An Adaptive Detector of Genioglossus EMG Reflex Using Berkner Transform for Time Latency Measurement in OSA Pathophysiological Studies. IEEE Transactions on Biomedical Engineering 52(8) (2005) 1382–1389 13. Hasan, J.: Past and Future of Computer-Assisted Sleep Analysis and Drowsiness Assessment. Journal of Clinical Neurophysiology 13 (1996) 295–313 14. Horne, J.A., Reyner, L.A.: Driver Sleepiness. Journal of Sleep Research 4 (1995) 23–29 15. Kollarits, C.R., Kollarits, F.J., Schuette, W.H.: The Pupil Dark Response in Normal Volunteers. Current Eye Research 2(4) (1982) 255–259 16. McLaren, J.W., Fjerstad, W.H., Ness, A.B., Graham, M.D., Brubaker, R.F.: New Video Pupillometer. Optical Engineering 34(3) (1995) 676–682 17. Merritta, S.L., Schnydersa, H.C., Patelb, M., Basnerc, R.C., O’Neilld, W.: Pupil Staging and EEG Measurement of Sleepiness. International Journal of Psychophysiology 52 (2004) 97–112 18. Mill´an, J.R., Mouri˜no, J., Franz´e, M., Cincotti, F., Varsta, M., Heikkonen, J., Babiloni, F.: A Local Neural Classifier for the Recognition of EEG Patterns Associated to Mental Tasks. IEEE Transactions on Neural Networks 13 (2002) 678–686 19. Roth, P.R.: Effective Measurements Using Digital Signal Analysis. IEEE Spectrum (1971) 62–70 20. Suzuki, Y.: Self-Organizing QRS-wave Recognition in ECG Using Neural Networks. IEEE Transactions on Neural Networks 6 (1995) 1469–1477 21. Varady, P., Micsik, T., Benedek, S., Benyo, Z.: A Novel Method for the Detection of Apnea and Hypopnea Events in Respiration Signals. IEEE Transactions on Biomedical Engineering 49(9) (2002) 936–942 22. Yen, F.C., Behbehani, K., Lucas, E.A., Burk, J.R., Axe, J.P.: A Noninvasive Technique for Detecting Obstructive and Central Sleep Apnea. IEEE Transactions on Biomedical Engineering 44(12) (1997) 1262–1268 23. American Academy of Sleep Medicine: Sleep Related Breathing Disorders in Adults: Recommendations for Syndrome Definition and Measurement Techniques in Clinical Research. Sleep 22(5) (1999) 667–689
Author Index
Abiyev, Rahib H. II-241 Acu˜ na, Gonzalo I-311, I-1255, II-391 Afzulpurkar, Nitin V. III-252 Ahmad, Khurshid II-938 Ahn, Tae-Chon II-186 Ai, Lingmei II-1202 Akiduki, Takuma II-542 Al-Jumeily, Dhiya II-921 Al-shanableh, Tayseer II-241 Aliev, R.A. II-307 Aliev, R.R. II-307 Almeida, An´ıbal T. de I-138, III-73 Amari, Shun-ichi I-935 Anitha, R. I-546 Ara´ ujo, Ricardo de A. II-602 Aung, M.S.H. II-1177 Bae, Hyeon III-641 Bae, JeMin I-1221 Baek, Gyeongdong III-641 Baek, Seong-Joon II-1240 Bai, Qiuguo III-1107 Bai, Rui II-362 Bai, Xuerui I-349 Bambang, Riyanto T. I-54 Bao, Zheng I-1303 Barua, Debjanee II-562 Bassi, Danilo II-391 Bastari, Alessandro III-783 Baten, A.K.M.A. II-1221 Bevilacqua, Vitoantonio II-1107 Bi, Jing I-609 Bin, Deng III-981 Bin, Liu III-981 Boumaiza, Slim I-582 Cai, ManJun I-148 Cai, W.C. I-786 Cai, Wenchuan I-70 Cai, Zixing I-743 Caiyun, Chen III-657, III-803 Calster, B. Van II-1177 Canu, St´ephane III-486 Cao, Fenwen II-810 Cao, Jinde I-941, I-958, I-1025
Cao, Shujuan II-680 Carpenter, Gail A. I-1094 Carvajal, Karina I-1255 Cecchi, Guillermo II-500, II-552 Cecchi, Stefania III-731, III-783 Celikoglu, Hilmi Berk I-562 Chacon M., Mario I. III-884 Chai, Lin I-222 Chai, Tianyou II-362 Chai, Yu-Mei I-1162 Chandra Sekhar, C. I-546 Chang, Bao Rong III-357 Chang, Bill II-1221 Chang, Guoliang II-1168 Chang, Hyung Jin III-506 Chang, Shengjiang II-457 Chang, T.K. II-432 Chang, Y.P. III-580 Chang, Zhengwei III-1015 Chao, Kuei-Hsiang III-1145 Che, Haijun I-480 Chen, Boshan III-123 Chen, Chaochao I-824 Chen, Dingguo I-183, I-193 Chen, Feng I-473, I-1303 Chen, Fuzan II-448 Chen, Gong II-1056 Chen, Huahong I-1069 Chen, Huawei I-1069 Chen, Hung-Cheng III-26 Chen, Jianxin I-1274, II-1159 Chen, Jie II-810 Chen, Jing I-1274 Chen, Jinhuan III-164 Chen, Joseph III-1165 Chen, Juan II-224 Chen, Lanfeng I-267 Chen, Le I-138 Chen, Li-Chao II-656 Chen, Lingling II-1291 Chen, Min-You I-528 Chen, Mou I-112 Chen, Mu-Song III-998 Chen, Ping III-426
1310
Author Index
Chen, Po-Hung III-26, III-1120 Chen, Qihong I-64 Chen, Shuzhen III-454 Chen, Tianping I-994, I-1034 Chen, Ting-Yu II-336 Chen, Wanming I-843 Chen, Weisheng I-158 Chen, Wen-hua I-112 Chen, Xiaowei II-381 Chen, Xin I-813 Chen, Xiyuan III-41 Chen, Ya zhu III-967 Chen, Yen-wei II-979 Chen, Ying III-311, III-973 Chen, Yong I-1144, II-772 Chen, Yuehui I-473, II-1211 Chen, Yunping III-1176 Chen, Zhi-Guo II-994, III-774 Chen, Zhimei I-102 Chen, Zhimin III-204 Chen, Zhong I-776, III-914 Chen, Zhongsong III-73 Cheng, Gang I-231 Cheng, Jian II-120 Cheng, Zunshui I-1025 Chi, Qinglei I-29 Chi, Zheru I-626 Chiu, Ming-Hui I-38 Cho, Jae-Hyun III-923 Cho, Sungzoon II-880 Choi, Jeoung-Nae III-225 Choi, Jin Young III-506 Choi, Seongjin I-602 Choi, Yue Soon III-1114 Chu, Shu-Chuan II-905 Chun-Guang, Zhou III-448 Chung, Chung-Yu II-785 Chung, TaeChoong I-704 Cichocki, Andrzej II-1032, III-793 Cruz, Francisco II-391 Cruz-Meza, Mar´ıa Elena III-828 Cuadros-Vargas, Ernesto II-620 Cubillos, Francisco A. I-311, II-391 Cui, Baotong I-935 Cui, Baoxia II-160 Cui, Peiling III-597 da Silva Soares, Anderson Dai, Dao-Qing II-1081 Dai, Jing III-607
III-1024
Dai, Ruwei I-1280 Dai, Shaosheng II-640 Dai, Xianzhong II-196, III-1138 Dakuo, He III-330 Davies, Anthony II-938 Dell’Orco, Mauro I-562 Deng, Fang’an I-796 Deng, Qiuxiang II-575 Deng, Shaojiang II-724 Dengsheng, Zhu III-1043 Dillon, Tharam S. II-965 Ding, Gang III-66 Ding, Mingli I-667, III-721 Ding, Mingwei II-956 Ding, Mingyong II-1048 Ding, Xiao-qing III-1033 Ding, Xiaoshuai III-117 Ding, Xiaoyan II-40 Dong, Jiyang I-776, III-914 Dou, Fuping I-480 Du, Ji-Xiang I-1153, II-793, II-819 Du, Junping III-80 Du, Lan I-1303 Du, Wei I-652 Du, Xin I-714 Du, Xin-Wei III-1130 Du, Yina III-9 Du, Zhi-gang I-465 Duan, Hua III-812 Duan, Lijuan II-851 Duan, Yong II-160 Duan, Zhemin III-943 Duan, Zhuohua I-743 El-Bakry, Hazem M. III-764 Etchells, T.A. II-1177 Fan, Fuling III-416 Fan, Huaiyu II-457 Fan, Liping II-1042 Fan, Shao-hui II-994 Fan, Yi-Zheng I-572 Fan, Youping III-1176 Fan, Yushun I-609 Fang, Binxing I-1286 Fang, Jiancheng III-597 Fang, Shengle III-292 Fang, Zhongjie III-237 Fei, Minrui II-483 Fei, Shumin I-81, I-222
Author Index Feng, Chunbo III-261 Feng, Deng-Chao III-869 Feng, Hailiang III-933 Feng, Jian III-715 Feng, Xiaoyi II-135 Feng, Yong II-947 Feng, Yue I-424 Ferreira, Tiago A.E. II-602 Florez-Choque, Omar II-620 Freeman, Walter J. I-685 Fu, Chaojin III-123 Fu, Jiacai II-346 Fu, Jun I-685 Fu, Lihua I-632 Fu, Mingang III-204 Fu, Pan II-293 Fuli, Wang III-330 Fyfe, Colin I-397 Gan, Woonseng I-176 Gao, Chao III-35 Gao, Cunchen I-910 Gao, Jian II-640 Gao, Jinwu II-257 Gao, Junbin II-680 Gao, Liang III-204 Gao, Liqun II-931, III-846 Gao, Ming I-935 Gao, Shaoxia III-35 Gao, Song II-424 Gao, Wen II-851 Gao, Zengan III-741 Gao, Zhi-Wei I-519 Gao, Zhifeng I-875 Gardner, Andrew B. II-1273 Gasso, Gilles III-486 Ge, Baoming I-138, III-73 Ge, Junbo II-1125 Geng, Guanggang I-1280 Ghannouchi, Fadhel M. I-582 Glantschnig, Paul II-1115 Gong, Shenguang III-672 Grossberg, Stephen I-1094 Gu, Hong II-1 Gu, Ying-kui I-553, II-275 Guan, Peng I-449, II-671 Guan, Zhi-Hong II-8, II-113 Guirimov, B.G. II-307 Guo, Chengan III-461
Guo, Guo, Guo, Guo, Guo, Guo, Guo, Guo, Guo,
1311
Chenlei I-723 Lei I-93, I-1054 Li I-1286, II-931, III-846 Ling III-434 Peng III-633, III-950 Ping II-474 Qi I-904 Wensheng III-80 Xin II-1291
Hadzic, Fedja II-965 Haifeng, Sang III-330 Halgamuge, Saman K II-801, II-1221, III-1087 Hamaguchi, Kosuke I-926 Han, Feng II-740 Han, Fengqing I-1104 Han, Jianda III-589 Han, Jiu-qiang II-646 Han, Min II-569 Han, Mun-Sung I-1318 Han, Pu III-545 Han, Risheng II-705 Han, SeungSoo III-246 Hao, Yuelong I-102 Hao, Zhifeng I-8 Hardy, David II-801 He, Fen III-973 He, Guoping III-441, III-812 He, Haibo I-413, I-441 He, Huafeng I-203 He, Lihong I-267 He, Naishuai II-772 He, Qing III-336 He, Tingting I-632 He, Xin III-434 He, Xuewen II-275 He, Yigang III-570, III-860, III-1006 He, Zhaoshui II-1032 He, Zhenya III-374 Heng, Yue III-561 Hirasawa, Kotaro I-403 Hoang, Minh-Tuan T. I-1077 Hong, Chin-Ming I-45 Hong, SangJeen III-246 Hong, Xia II-516, II-699 Hope, A.D. II-293 Hou, Weizhen III-812 Hou, Xia I-1247 Hou, Zeng-Guang II-438
1312
Author Index
Hsu, Arthur III-1087 Hsu, Chia-Chang III-1145 Hu, Chengquan I-652, II-1264 Hu, Dewen I-1061 Hu, Haifeng II-630 Hu, Jing II-985 Hu, Jinglu I-403 Hu, Jingtao III-277 Hu, Meng I-685 Hu, Ruifen I-685 Hu, Sanqing II-1273 Hu, Shiqiang III-950 Hu, Shou-Song I-1247 Hu, Wei III-277 Hu, Xiaolin III-194 Hu, Xuelei I-1211 Hu, Yun-an II-47 Huaguang, Zhang III-561 Huang, Benxiong I-1336, III-626 Huang, D. II-1002 Huang, Dexian III-219 Huang, Fu-Kuo III-57 Huang, Hong III-933 Huang, Hong-Zhong III-267 Huang, Jikun III-1058 Huang, Kai I-1183 Huang, Liangli III-407 Huang, Liyu II-1202 Huang, Peng II-593 Huang, Qingbao III-1097 Huang, Tingwen II-24 Huang, Xinsheng III-853 Huang, Xiyue II-772, III-553 Huang, Yanxin II-1264 Huang, Yuancan III-320 Huang, Yuchun I-1336, III-626 Huang, Zailu I-1336 Huang, Zhen I-733, I-824 Huffel, S. Van II-1177 Huo, Linsheng III-1182 Hussain, Abir Jaafar II-921 Huynh, Hieu T. I-1077 Hwang, Chi-Pan III-998 Hwang, Seongseob II-880 Imamura, Takashi II-542 Irwin, George W. I-496 Isahara, Hitoshi I-1310 Islam, Md. Monirul II-562 Iwamura, Kakuzo II-257
Jarur, Mary Carmen II-1150 Je, Sung-Kwan III-923 Ji, Geng I-166 Ji, Guori III-545 Jia, Hongping I-257, I-642 Jia, Huading III-497 Jia, Peifa I-852, II-328 Jia, Yunde II-896 Jia, Zhao-Hong I-572 Jian, Feng III-561 Jian, Jigui II-143 Jian, Shu III-147 Jian-yu, Wang III-448 Jiang, Chang-sheng I-112 Jiang, Chenwei II-1133 Jiang, Haijun I-1008 Jiang, Minghui I-952, III-292 Jiang, Nan III-1 Jiang, Tiejun III-350 Jiang, Yunfei II-474 Jiang, Zhe III-589 Jiao, Li-cheng II-120 Jin, Bo II-510 Jin, Huang II-151 Jin, Xuexiang II-1022 Jin, Yihui III-219, III-1058 Jin-xin, Tian III-49 Jing, Chunguo III-1107 Jing, Zhongliang II-705 Jinhai, Liu III-561 JiuFen, Zhao III-834 Jo, Taeho I-1201, II-871 Jos´e Coelho, Clarimar III-1024 Ju, Chunhua III-392 Ju, Liang I-920, I-1054 Ju, Minseong III-140 Jun, Ma I-512 Jun, Yang III-981 Junfeng, Xu III-17 Jung, Byung-Wook III-641 Jung, Young-Giu I-1318 Kanae, Shunshoku I-275, II-1194 Kang, Jingli III-157 Kang, Mei I-257 Kang, Min-Jae I-1015 Kang, Sangki II-1240 Kang, Y. III-580 Kao, Tzu-Ping II-336 Kao, Yonggui I-910
Author Index Kaynak, Okyay I-14 Ke, Hai-Sen I-285 Kelleher, Dermot II-938 Kil, Rhee Man I-1117, I-1318 Kim, Byungwhan I-602 Kim, Dae Young III-368 Kim, Dongjun II-1187 Kim, DongSeop III-246 Kim, Ho-Chan I-1015 Kim, Ho-Joon II-715 Kim, HyunKi II-206 Kim, Jin Young II-1240 Kim, Kwang-Baek II-756, III-923 Kim, Kyeongseop II-1187 Kim, Pyo Jae III-506 Kim, Seoksoo II-1090, III-140 Kim, Sungshin III-641 Kim, Tai-hoon III-140 Kim, Woo-Soon III-1114 Kim, Yong-Kab III-1114 Kim, Yountae III-641 Ko, Hee-Sang I-1015 Konako˘ glu, Ekrem I-14 Koo, Imhoi I-1117 Kozloski, James II-500, II-552 Kumar, R. Pradeep II-1012 Kurnaz, Sefer I-14 Lai, Pei Ling I-397 Lee, Ching-Hung I-38, II-317 Lee, Geehyuk II-104 Lee, InTae II-206 Lee, Jeongwhan II-1187 Lee, Jin-Young III-923 Lee, Joseph S. II-715 Lee, Junghoon I-1015 Lee, Malrey I-1201, II-871 Lee, Seok-Lae I-1045 Lee, SeungGwan I-704 Lee, Shie-Jue III-515 Lee, SungJoon III-246 Lee, Tsai-Sheng I-694 Lee, Yang Weon III-1192 Leu, Yih-Guang I-45 Leung, Kwong-Sak II-371 Li, Ang II-689 Li, Bin I-767, I-1087 Li, Chuandong II-24 Li, Chun-hua III-382 Li, Demin III-695
Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li, Li,
Guang I-685 Haibin I-994 Hailong III-9 Haisheng II-414 Hongnan III-1182 Hongru III-9 Ji III-686 Jianwei III-933 Jing II-47, II-656 Jiuxian II-889, III-392 Jun I-676 Jun-Bao II-905 Kang I-496, II-483 Li I-132 Liming III-407 Meng II-842, III-1077 Minqiang II-448 Ping II-33 Qing II-251 Qingdu II-96 Qingguo II-424 Qiudan I-1280 San-ping III-382 Shaoyuan I-505 Shutao III-407 Tao I-81, I-93, I-374, II-8 Weidong III-147 Weimin III-988 Xiao-Li I-87 Xiaodong I-176 Xiaomei II-170 Xiaoou I-487, II-483 Xiuxiu I-796 Xue III-741 Yan II-1281 Yang I-1286 Yangmin I-757, I-813 Yanwen II-842 Yaobo II-1056 Yi II-612 Yibin I-1087 Yinghong I-424 Yong-Wei III-633 Yongming I-1 Yongwei III-950 Yuan III-1130 Yue I-414 Yufei I-424 Yunxia III-758 Zhengxue III-117
1313
1314
Author Index
Li, Zhiquan III-311 Lian, Qiusheng III-454 Lian, Shiguo II-79 Liang, Dong I-572 Liang, Hua I-618, I-920, III-399 Liang, Huawei I-843 Liang, Jinling II-33 Liang, Rui II-257 Liang, Yanchun I-8, I-652, II-1264 Liang, Yong II-371 Liao, Longtao I-505 Liao, Wudai I-897, III-164 Liao, X.H. I-70, I-786 Liao, Xiaofeng I-1104, II-724 Liao, Xiaoxin I-897, II-143, III-164, III-292 Lim, Jun-Seok II-398, III-678 Lin, Chuan Ku III-998 Lin, Hong-Dar II-785 Lin, ShiehShing III-231 Lin, Sida I-968 Lin, Xiaofeng III-1097 Lin, Yaping II-1254 Lin, Yu-Ching II-317 Lin, Yunsong II-1048 Lin, Zhiling I-380 Ling, Zhuang III-213 Linhui, Cai II-151 Lisboa, P.J.G II-1177 Liu, Benyong II-381 Liu, Bin III-1107 Liu, Bo III-219, III-1058 Liu, Derong I-387, II-1299 Liu, Di-Chen III-1130 Liu, Dianting II-740 Liu, Dongmei II-1231 Liu, Fei III-1067 Liu, Guangjun II-251 Liu, Guohai I-257, I-642 Liu, Hongwei I-1303 Liu, Hongwu III-686 Liu, Ji-Zhen II-179 Liu, Jilin I-714 Liu, Jin III-751 Liu, JinCun I-148 Liu, Jinguo I-767 Liu, Ju II-1065 Liu, Jun II-772 Liu, Lu III-1176 Liu, Meiqin I-968
Liu, Meirong III-570 Liu, Peipei I-1069 Liu, Qiuge III-336 Liu, Shuhui I-480 Liu, Taijun I-582 Liu, Wen II-57 Liu, Wenhui III-721 Liu, Xiang-Jie II-179 Liu, Xiaohe I-176 Liu, Xiaohua III-751 Liu, Xiaomao II-680 Liu, Yan-Kui II-267 Liu, Ying II-267, III-1058 Liu, Yinyin II-534, II-956 Liu, Yongguo III-237 Liu, Yun III-1155 Liu, Yunfeng I-203 Liu, Zengrong II-16 Liu, Zhi-Qiang II-267 Liu, Zhongxuan II-79 Lloyd, Stephen R. II-1299 L¨ ofberg, Johan I-424 Long, Aideen II-938 Long, Fei I-292 Long, Jinling I-1110 Long, Ying III-1006 Loosli, Ga¨elle III-486 L´ opez-Y´ an ˜ez, Itzam´ a II-835, III-828 Lu, Bao-Liang I-1310, III-525 Lu, Bin II-224 Lu, Congde II-1048 Lu, Hong tao III-967 Lu, Huiling I-796 Lu, Wenlian I-1034 Lu, Xiaoqing I-986, I-1193 Lu, Xinguo II-1254 Lu, Yinghua II-842 Lu, Zhiwu I-986, I-1193 Luan, Xiaoli III-1067 Lum, Kok Siong II-346 Luo, Qi I-170, II-301 Luo, Siwei II-1281 Luo, Wen III-434 Luo, Yan III-302 Luo, Yirong I-455 Lv, Guofang I-618 Ma, Chengwei III-973 Ma, Enjie II-362 Ma, Fumin I-658
Author Index Ma, Honglian III-461 Ma, Jiachen III-840, III-877 Ma, Jianying II-1125 Ma, Jieming II-1133 Ma, Jinwen I-1183, I-1227 Ma, Liyong III-840, III-877 Ma, Shugen I-767 Ma, Xiaohong II-40, III-751 Ma, Xiaolong I-434 Ma, Xiaomin III-1 Ma, Yufeng III-672 Ma, Zezhong III-933 Ma, Zhiqiang III-1077 Mahmood, Ashique II-562 Majewski, Maciej III-1049 Mamedov, Fakhreddin II-241 Mao, Bing-yi I-867, III-454 Marwala, Tshilidzi I-1237, I-1293 Mastorakis, Nikos III-764 Mastronardi, Giuseppe II-1107 Matsuka, Toshihiko I-1135 May, Gary S. III-246 Mei, Tao I-843 Mei, Xue II-889 Mei, Xuehui I-1008 Men, Jiguan III-1176 Meng, Hongling II-88, III-821 Meng, Max Q.-H. I-843 Meng, Xiangping II-493 Menolascina, Filippo II-1107 Miao, Jun II-851 Min, Lequan III-147 Mingzeng, Dai III-663 Miyake, Tetsuo II-542 Mohler, Ronald R. I-183 Mora, Marco II-1150 Moreno, Vicente II-391 Moreno-Armendariz, Marco I-487 Musso, Cosimo G. de II-1107 Na, Seung You II-1240 Nagabhushan, P. II-1012 Nai-peng, Hu III-49 Nan, Dong I-1110 Nan, Lu III-448 Naval Jr., Prospero C. III-174 Navalertporn, Thitipong III-252 Nelwamondo, Fulufhelo V. I-1293 Ng, S.C. II-664 Ngo, Anh Vien I-704
1315
Nguyen, Hoang Viet I-704 Nguyen, Minh Nhut II-346 Ni, Junchao I-158 Nian, Xiaoling I-1069 Nie, Xiaobing I-958 Nie, Yalin II-1254 Niu, Lin I-465 Oh, Sung-Kwun II-186, II-206, III-225 Ong, Yew-Soon I-1327 Ortiz, Floriberto I-487 Ou, Fan II-740 Ou, Zongying II-740 Pan, Jeng-Shyang II-905 Pan, Jianguo II-352 Pan, Li-Hu II-656 Pan, Quan II-424 Pandey, A. III-246 Pang, Zhongyu II-1299 Park, Aaron II-1240 Park, Cheol-Sun III-368 Park, Cheonshu II-730 Park, Choong-shik II-756 Park, Dong-Chul III-105, III-111 Park, Jong Goo III-1114 Park, Sang Kyoon I-1318 Park, Yongsu I-1045 Pavesi, Leopoldo II-1150 Peck, Charles II-500, II-552 Pedone, Antonio II-1107 Pedrycz, Witold II-206 Pei, Wenjiang III-374 Peng, Daogang I-302 Peng, Jian-Xun II-483 Peng, Jinzhu I-592, I-804 Peng, Yulou III-860 Pi, Yuzhen II-493 Pian, Zhaoyu II-931, III-846 Piazza, Francesco III-731, III-783 Ping, Ling III-448 Pizzileo, Barbara I-496 Pu, Xiaorong III-237 Qi, Juntong III-589 Qian, Jian-sheng II-120 Qian, Juying II-1125 Qian, Yi II-689 Qianhong, Lu III-981 Qiao, Chen III-131 Qiao, Qingli II-72
1316
Author Index
Qiao, Xiao-Jun III-869 Qing, Laiyun II-851 Qingzhen, Li III-834 Qiong, Bao I-536 Qiu, Jianlong I-1025 Qiu, Jiqing I-871 Qiu, Zhiyong III-914 Qu, Di III-117 Quan, Gan II-151 Quan, Jin I-64 Rao, A. Ravishankar II-500, II-552 Ren, Dianbo I-890 Ren, Guanghui II-765, III-651 Ren, Quansheng II-88, III-821 Ren, Shi-jin II-216 Ren, Zhen II-79 Ren, Zhiliang II-1056 Rivas P., Pablo III-884 Roh, Seok-Beom II-186 Rohatgi, A. III-246 Rom´ an-God´ınez, Israel II-835 Ronghua, Li III-657 Rosa, Jo˜ ao Lu´ı Garcia II-825 Rossini, Michele III-731 Rubio, Jose de Jesus I-1173 Ryu, Joung Woo II-730 Sakamoto, Yasuaki I-1135 S´ anchez-Garfias, Flavio Arturo III-828 Sasakawa, Takafumi I-403 Savage, Mandara III-1165 Sbarbaro, Daniel II-1150 Schikuta, Erich II-1115 Senaratne, Rajinda II-801 Seo, Ki-Sung III-225 Seredynski, Franciszek III-85 Shang, Li II-810 Shang, Yan III-454 Shao, Xinyu III-204 Sharmin, Sadia II-562 Shen, Jinyuan II-457 Shen, Lincheng I-1061 Shen, Yanjun I-904 Shen, Yehu I-714 Shen, Yi I-952, III-292, III-840, III-877 Shen, Yue I-257, I-642 Sheng, Li I-935 Shi, Haoshan III-943 Shi, Juan II-346 Shi, Yanhui I-968
Shi, Zhongzhi III-336 Shiguang, Luo III-803 Shin, Jung-Pil III-641 Shin, Sung Hwan III-111 Shunlan, Liu III-663 Skaruz, Jaroslaw III-85 Sohn, Joo-Chan II-730 Song, Chunning III-1097 Song, David Y. I-70 Song, Dong Sung III-506 Song, Jaegu II-1090 Song, Jinya I-618, III-479 Song, Joo-Seok I-1045 Song, Kai I-671, III-721 Song, Qiankun I-977 Song, Shaojian III-1097 Song, Wang-Cheol I-1015 Song, Xiao xiao III-1097 Song, Xuelei II-746 Song, Y.D. I-786 Song, Yong I-1087 Song, Young-Soo III-105 Song, Yue-Hua III-426 Song, Zhuo II-1248 Sousa, Robson P. de II-602 Squartini, Stefano III-731, III-7783 Starzyk, Janusz A. I-413, I-441, II-534, II-956 Stead, Matt II-1273 Stuart, Keith Douglas III-1049 Sun, Bojiao I-1346 Sun, Changcun II-1056 Sun, Changyin I-618, I-920, III-479 Sun, Fangxun I-652 Sun, Fuchun I-132 Sun, Haiqin I-319 Sun, Jiande II-1065 Sun, Lei I-843 Sun, Lisha II-1168 Sun, Pei-Gang II-234 Sun, Qiuye III-607 Sun, Rongrong II-284 Sun, Shixin III-497 Sun, Xinghua II-1065 Sun, Youxian II-1097, II-1140 Sun, Z. I-786 Sung, KoengMo II-398 Tan, Ah-Hwee I-1094 Tan, Cheng III-1176
Author Index Tan, Hongli III-853 Tan, Min II-438, III-1155 Tan, Yanghong III-570 Tan, Ying III-705 Tan, Yu-An II-301 Tang, Guiji III-545 Tang, GuoFeng II-465 Tang, Jun I-572 Tang, Lixin II-63 Tang, Songyuan II-979 Tang, Wansheng III-157 Tang, Yuchun II-510 Tang, Zheng II-465 Tao, Liu I-512 Tao, Ye III-267 Testa, A.C. II-1177 Tian, Fengzhan II-414 Tian, GuangJun I-148 Tian, Jin II-448 Tian, Xingbin I-733 Tian, Yudong I-213 Tie, Ming I-609 Timmerman, D. II-1177 Tong, Ling II-1048 Tong, Shaocheng I-1 Tong, Weiming II-746 Tsai, Hsiu Fen III-357 Tsai, Hung-Hsu III-904 Tsao, Teng-Fa I-694 Tu, Zhi-Shou I-358 Uchiyama, Masao Uyar, K. II-307
I-1310
Vairappan, Catherine II-465 Valentin, L. II-1177 Vanderaa, Bill II-801 Veludo de Paiva, Maria Stela III-1024 Vilakazi, Christina B. I-1237 Vo, Nguyen H. I-1077 Vogel, David II-534 Volkov, Yuri II-938 Wada, Kiyoshi I-275, II-1194 Wan, Yuanyuan II-819 Wang, Baoxian II-143 Wang, Bin II-1133 Wang, Bolin III-399 Wang, C.C. III-580 Wang, Chunheng I-1280 Wang, Dacheng I-1104
Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang, Wang,
Dongyun I-897 Fu-sheng I-241, I-249 Fuli I-380 Fuliang I-257 Gaofeng I-632 Grace S. III-57 Guang-Jiang III-774 Guoqiang II-740 Haijun II-1254 Haila II-79 Hong I-93 Hongbo I-733, I-824 Honghui II-1097 Hongrui I-749 Hongwei II-1 Jiacun III-695 Jiahai III-184 Jian III-821 Jianzhong II-493 Jingming I-1087 Jue II-1202 Jun III-95, III-194 Jun-Song I-519 Kai II-406 Ke I-834 Kuanquan II-583 Kun II-931, III-846 Lan III-416 Le I-1183 Lei I-29, III-497 Liangliang I-1227 Ling III-219, III-1058 Lipo II-57 Meng-Hui III-1145 Nian I-572 Qi III-721 Qin II-689 Qingren II-406 Quandi I-528 Rubin I-1127 Ruijie III-80 Sheng-jin III-1033 Shiwei I-122 Shoujue III-616 Shoulin I-920 Shufeng II-352 Shuqin I-652 Shuzong III-350 Tian-Zhen II-985 Tianmiao III-535
1317
1318
Author Index
Wang, Wei I-834, III-535 Wang, Weiqi II-1125 Wang, Xiang-ting II-120 Wang, Xiaohong I-952 Wang, Xiaohua III-860, III-1006 Wang, Xihuai I-658 Wang, Xin II-196 Wang, Xiuhong II-72 Wang, Xiumei I-652 Wang, Xiuqing III-1155 Wang, Xiuxiu II-612 Wang, Xuelian II-1202 Wang, Xuexia II-765 Wang, XuGang II-465 Wang, Yan I-652, II-1264 Wang, Yaonan I-592, I-804, III-469 Wang, Yen-Nien I-694 Wang, Yong I-22 Wang, Yongtian II-979 Wang, Yuan I-852, II-328 Wang, Yuan-Yuan II-284, II-819, II-1125, III-426 Wang, Yuechao I-767 Wang, Zeng-Fu I-1153 Wang, Zhancheng III-988 Wang, Zhaoxia II-1231 Wang, Zhen-Yu III-633 Wang, Zhihai II-414 Wang, Zhiliang II-251 Wang, Zuo II-301 Wei, Guoliang III-695 Wei, Hongxing III-535 Wei, Miaomiao II-612 Wei, Qinglai I-387 Wei, Ruxiang III-350 Wei, Wei I-292 Wei, Xunkai I-424 Wei, Yan Hao III-998 Weiqi, Yuan III-330 Wen, Bangchun I-29 Wen, Cheng-Lin I-319, II-994, II-985, III-774 Wen, Chuan-Bo III-774 Wen, Lei III-284 Wen, Shu-Huan I-863 Wen, Yi-Min III-525 Weng, Liguo I-74 Weng, Shilie I-213 Wenjun, Zhang I-512 Wickramarachchi, Nalin II-1221
Won, Yonggwan I-1077, II-1240 Wong, Hau-San III-894 Wong, Stephen T.C. II-1097, II-1140 Woo, Dong-Min III-105 Woo, Seungjin II-1187 Woo, Young Woon II-756 Worrell, Gregory A. II-1273 Wu, Aiguo I-380 Wu, Bao-Gui III-267 Wu, Gengfeng II-352 Wu, Jianbing I-642 Wu, Jianhua I-267, II-931, III-846 Wu, Kai-gui II-947 Wu, Ke I-1310 Wu, Lingyao I-1054 Wu, Qiang I-473 Wu, Qing-xian I-112 Wu, Qingming I-231 Wu, Si I-926 Wu, TiHua I-148 Wu, Wei I-1110, III-117 Wu, Xianyong II-8, II-113 Wu, Xingxing II-170 Wu, You-shou III-1033 Wu, Yunfeng II-664 Wu, Yunhua III-1058 Wu, Zhengping II-113 Wu, Zhilu II-765, III-651 Xi, Guangcheng I-1274, II-1159 Xia, Jianjin I-64 Xia, Liangzheng II-889, III-392 Xia, Siyu III-392 Xia, Youshen III-95 Xian-Lun, Tang III-213 Xiang, C. II-1002 Xiang, Changcheng III-553 Xiang, Hongjun I-941 Xiang, Lan II-16 Xiang, Yanping I-368 Xiao, Deyun II-1072 Xiao, Gang II-705 Xiao, Jianmei I-658 Xiao, Jinzhuang I-749 Xiao, Min I-958 Xiao, Qinkun II-424 Xiaoli, Li III-1043 Xiaoyan, Ma III-981 Xie, Haibin I-1061 Xie, Hongmei II-135
Author Index Xing, Guangzhong III-1107 Xing, Jie II-1072 Xing, Yanwei I-1274 Xingsheng, Gu I-536 Xiong, Guangze III-1015 Xiong, Min III-1176 Xiong, RunQun II-465 Xiong, Zhong-yang II-947 Xu, Chi I-626 Xu, De III-1155 Xu, Dongpo III-117 Xu, Guoqing III-988 Xu, Hong I-285 Xu, Hua I-852, II-328 Xu, Huiling I-319 Xu, Jian III-1033 Xu, Jianguo I-807 Xu, Jing I-358 Xu, Jiu-Qiang II-234 Xu, Ning-Shou I-519 Xu, Qingsong I-757 Xu, Qinzhen III-374 Xu, Shuang II-896 Xu, Shuhua I-1336, III-626 Xu, Shuxiang I-1265 Xu, Xiaoyun II-1291 Xu, Xin I-455 Xu, Xinhe I-267, II-160 Xu, Xinzheng II-913 Xu, Xu I-8 Xu, Yang II-1042 Xu, Yangsheng III-988 Xu, Yulin III-164 Xu, Zong-Ben II-371, III-131 Xue, Xiaoping I-879 Xue, Xin III-441 Xurong, Zhang III-834 Yan, Gangfeng I-968 Yan, Hua II-1065 Yan, Jianjun III-959 Yan, Qingxu III-616 Y´ an ˜ez-M´ arquez, Cornelio II-835, III-828 Yang, Guowei III-616 Yang, Hongjiu I-871 Yang, Hyun-Seung II-715 Yang, Hong-yong I-241, I-249 Yang, Jiaben I-183, I-193 Yang, Jingming I-480
Yang, Jiyun II-724 Yang, Jun I-158 Yang, Kuihe III-342 Yang, Lei II-646 Yang, Luxi III-374 Yang, Ming II-842 Yang, Peng II-1291 Yang, Ping I-302 Yang, Wei III-967 Yang, Xiao-Song II-96 Yang, Xiaogang I-203 Yang, Xiaowei I-8 Yang, Yingyu I-1211 Yang, Yixian III-1 Yang, Yongming I-528 Yang, Yongqing II-33 Yang, Zhao-Xuan III-869 Yang, Zhen-Yu I-553 Yang, Zhi II-630 Yang, Zhi-Wu I-1162 Yang, Zhuo II-1248 Yang, Zi-Jiang I-275, II-1194 Yang, Zuyuan III-553 Yanxin, Zhang III-17 Yao, Danya II-1022 Ye, Bin II-656 Ye, Chun-xiao II-947 Ye, Mao III-741 Ye, Meiying II-127 Ye, Yan I-582 Ye, Zhiyuan I-986, I-1193 Yeh, Chi-Yuan III-515 Yi, Gwan-Su II-104 Yi, Jianqiang I-349, I-358, I-368, I-374, I-1274 Yi, Tinghua III-1182 Yi, Yang I-93 Yi, Zhang I-1001, II-526, III-758 Yin, Fuliang III-751 Yin, Jia II-569 Yin, Yixin II-251 Yin, Zhen-Yu II-234 Yin, Zheng II-1097 Yin-Guo, Li III-213 Ying, Gao III-17 Yongjun, Shen I-536 Yu, Changrui III-302 Yu, Chun-Chang II-336 Yu, D.L. II-432 Yu, D.W. II-432
1319
1320
Author Index
Yu, Ding-Li I-122, I-339 Yu, Haocheng II-170 Yu, Hongshan I-592, I-804 Yu, Jian II-414 Yu, Jiaxiang II-1072 Yu, Jin-Hua III-426 Yu, Jinxia I-743 Yu, Miao II-724 Yu, Wen I-487, I-1173, II-483 Yu, Wen-Sheng I-358 Yu, Xiao-Fang III-633 Yu, Yaoliang I-449, II-671 Yu, Zhiwen III-894 Yuan, Chongtao III-461 Yuan, Dong-Feng I-22 Yuan, Hejin I-796 Yuan, Quande II-493 Yuan, Xiaofang III-469 Yuan, Xudong III-35 Yue, Dongxue III-853 Yue, Feng II-583 Yue, Heng III-715 Yue, Hong I-329 Yue, Shihong II-612 Yusiong, John Paul T. III-174 Zang, Qiang III-1138 Zdunek, Rafal III-793 Zeng, Qingtian III-812 Zeng, Wenhua II-913 Zeng, Xiaoyun III-1077 Zeng, Zhigang II-575 Zhai, Chuan-Min II-793, II-819 Zhai, Yu-Jia I-339 Zhai, Yuzheng III-1087 Zhang, Biyin II-861 Zhang, Bo II-40 Zhang, Chao III-545 Zhang, Chenggong II-526 Zhang, Daibing I-1061 Zhang, Daoqiang II-778 Zhang, Dapeng I-380 Zhang, David II-583 Zhang, Guo-Jun I-1153 Zhang, Hao I-302 Zhang, Huaguang I-387, III-715 Zhang, Jianhai I-968 Zhang, Jinfang I-329 Zhang, Jing II-381 Zhang, Jingdan II-1081
Zhang, Jinggang I-102 Zhang, Jinhui I-871 Zhang, Jiqi III-894 Zhang, Jiye I-890 Zhang, Jun II-680 Zhang, Jun-Feng I-1247 Zhang, Junxiong III-973 Zhang, Junying I-776 Zhang, Kanjian III-261 Zhang, Keyue I-890 Zhang, Kun II-861 Zhang, Lei I-1001, III-1077 Zhang, Lijing I-910 Zhang, Liming I-449, I-723, II-671, II-1133 Zhang, M.J. I-70, I-786 Zhang, Meng I-632 Zhang, Ming I-1265 Zhang, Ning II-1248 Zhang, Pan I-1144 Zhang, Pinzheng III-374 Zhang, Qi II-1125 Zhang, Qian III-416 Zhang, Qiang I-231 Zhang, Qizhi I-176 Zhang, Shaohong III-894 Zhang, Si-ying I-241, I-249 Zhang, Su III-967 Zhang, Tao II-1248 Zhang, Tengfei I-658 Zhang, Tianping I-81 Zhang, Tianqi II-640 Zhang, Tianxu II-861 Zhang, Tieyan III-715 Zhang, Wei II-656 Zhang, Xi-Yuan II-234 Zhang, Xiao-Dan II-234 Zhang, Xiao-guang II-216 Zhang, Xing-gan II-216 Zhang, XueJian I-148 Zhang, Xueping II-1291 Zhang, Xueqin II-1211 Zhang, Yan-Qing II-510 Zhang, Yanning I-796 Zhang, Yanyan II-63 Zhang, Yaoyao I-968 Zhang, Yi II-1022 Zhang, Ying-Jun II-656 Zhang, Yongqian III-1155 Zhang, You-Peng I-676
Author Index Zhang, Yu II-810 Zhang, Yu-sen III-382 Zhang, Yuxiao I-8 Zhang, Zhao III-967 Zhang, Zhaozhi III-1 Zhang, Zhikang I-1127 Zhang, Zhiqiang II-465 Zhang, Zhong II-542 Zhao, Bing III-147 Zhao, Chunyu I-29 Zhao, Dongbin I-349, I-368, I-374, I-1274 Zhao, Fan II-216 Zhao, Feng III-382 Zhao, Gang III-553 Zhao, Hai II-234 Zhao, Hongming III-535 Zhao, Jian-guo I-465 Zhao, Jianye II-88, III-821 Zhao, Lingling III-342 Zhao, Nan III-651 Zhao, Shuying I-267 Zhao, Xingang III-589 Zhao, Yaou II-1211 Zhao, Yaqin II-765, III-651 Zhao, Youdong II-896 Zhao, Zeng-Shun II-438 Zhao, Zhiqiang II-810 Zhao, Zuopeng II-913 Zheng, Chaoxin II-938 Zheng, Hongying II-724 Zheng, Huiru I-403 Zheng, Jianrong III-959 Zheng, Xia II-1140 Zhiping, Yu I-512 Zhon, Hong-Jian I-45 Zhong, Jiang II-947 Zhong, Shisheng III-66
Zhong, Ying-Ji I-22 Zhongsheng, Hou III-17 Zhou, Chunguang I-652, II-842, II-1264, III-1077 Zhou, Donghua I-1346 Zhou, Huawei I-257, I-642 Zhou, Jianting I-977 Zhou, Jie III-695 Zhou, Jin II-16 Zhou, Qingdong I-667 Zhou, Tao I-796 Zhou, Wei III-943 Zhou, Xianzhong III-434 Zhou, Xiaobo II-1097, II-1140 Zhou, Xin III-943 Zhou, Yali I-176 Zhu, Chongjun III-123 Zhu, Xun-lin I-241, I-249 Zhu, Jie II-593 Zhu, Jie (James) III-1165 Zhu, Lin I-904 Zhu, Qiguang I-749, III-311 Zhu, Qing I-81 Zhu, Si-Yuan II-234 Zhu, Xilin II-170 Zhu, Zexuan I-1327 Zhuang, Yan I-834 Zimmerman S., Alejandro III-884 Zong, Chi I-231 Zong, Ning II-516, II-699 Zou, An-Min II-438 Zou, Qi II-1281 Zou, Shuxue II-1264 Zuo, Bin II-47 Zuo, Wangmeng II-583 Zurada, Jacek M. I-1015 Zuyuan, Yang III-803
1321